How would scientists do on a public "science literacy" test -- and should we care? Stocklmayer & Bryant vs. NSF Indicators “Science literacy” scale part 2
Monday, August 4, 2014 at 3:02PM
Dan Kahan

So . . . this is the second post on the interesting paper Stocklmayer, S. M., & Bryant, C. Science and the Public—What should people know?, International Journal of Science Education, Part B, 2(1), 81-101 (2012)

Skip ahead to the bolded red text if you still vividly remember the first (as if it were posted “only yesterday”) or simply don’t care what it said & want to go straight to something quite interesting—the results of S&B's admnistration of a public “science literacy” test to trained scientists.

But by way of review, S&B don’t much like the NSF Science Indicators “factual knowledge” questions, the standard “science literacy” scale used in studies of public science comprehension.

The basic thrust of their  critique is that the Indicators battery is both undertheorized and unvalidated.

It’s “undertheorized” in the sense that no serious attention went into what the test was supposed to be measuring or why

Its inventors viewed public “science literacy” to be essential to informed personal decisionmaking, enlightened self-government, and a productive national economy. But they didn’t address what kinds of scientific knowledge conduce to these ends, or why the odd collection of true-false items featured in the Indicators (“Lasers work by focusing sound waves”; “The center of the earth is very hot”) should be expected to assess test takers’ possession of such knowledge.

The NSF “science literacy” test is unvalidated in the sense that no evidence was offered—either upon their introduction or thereafter—that scores on it are meaningfully correlated with giving proper effect to scientific information in any particular setting.

S&B propose that the Indicators battery be scrapped in favor of an assessment that reflects an “assets-based model of knowledge.” Instead of certifying test takers’ assimilation of some canonical set of propositions, the aim of such an instrument would be to gauge capacities essential to acquiring and effectively using scientific information in ordinary decisionmaking.

I went through S&B’s arguments to this effect last time, and why I found them persuasive. 

I did take issue, however, with their conclusion that the Indicators should simply be abandoned. Better, I think, would be for scholars to go ahead and use the Indicators battery but supplement it as necessary with items that validly measure the aspects of science comprehension genuinely relevant to their analyses.

It is more realistic to think a decent successor to the Indicators battery would evolve from this sort of process than it is to believe that a valid, new science comprehension scale will be invented from scratch.  The expected reward to scholars who contribute to development of the latter would be too low to justify the expected cost they’d incur, which would include having to endure the unwarranted but predictable resistance of many other scholars who are professionally invested in the Indicators battery.

Okay!  But I put off for “today’s” post a discussion of S&B’s very interesting original study, which consisted of the administration of the Indicators battery (supplemented with some related Eurobarometer “factual knowledge” items) to a group of 500 scientists.

click to see how the scientists did!The scientists generally outscored members of the public, although not by a very large margin (remember, one problem with the NSF battery is that it's too easy—the average score is too high to enable meaningful investigation of variance).

But the more interesting thing was how readily scientists who gave the “wrong” answer were able to offer a cogent account of why their response should in fact be regarded as correct.

For example, it is false to say the “the center of the earth is very hot,” one scientist pointed out, if we compare the temperature there to that on the surface of the sun or other stars.

Not true, 29% of the sample said, in response to the statement, “It is the father’s genes that determine whether the baby is a boy or girl”—not because “it is the mother’s genes” that do so but because it is the father’s chromosome that does.

No fair-minded grader would conclude that these scientists’ responses betray lack of comprehension of the relevant “facts.”  That their answers would be scored “incorrect” if they were among the test takers in an Indicators sample, S&B conclude, “cast[s] further doubts upon the value of such a survey as a tool for measurement of public ‘knowledge.’ ”

If I were asked my opinion in a survey, I’d “strongly disagree” with this conclusion!

Indeed, in my view, the idea that the validity of a public science comprehension instrument should be assessed by administering it to a sample of scientists reflects the very sort of misunderstandings—conceptual and psychometric—that S&B convincingly argue are reflected in the Indicators battery.

S&B sensibly advocate an “assets-based” assessment as opposed to a knowledge-inventory one.

Under the former, the value of test items consists not in their corroboration of a test taker's cathechistic retention of a list of "foundational facts" but rather in the contribution those items make to measuring a personal trait or capacity essential to acquiring and using relevant scientific information.

The way to validate any particular item, then, isn't to show that 100%--or any particular percent—of scientists would “agree” with the response scored as “correct.”

It is to show that that response genuinely correlates with the relevant comprehension capacity within the intended sample of test takers.

Indeed, while such an outcome is unlikely, an item could be valid even if the response scored as “correct” is indisputably wrong, so long as test takers with the relevant comprehension capacity are more likely to select that response.

This point actually came up in connection with my proto- climate-science comprehension measure

That instrument contained the item “Climate scientists believe that if the North Pole icecap melted as a result of human-caused global warming, global sea levels would rise—true or false?”

“False” was scored as correct, consistent with public-education outreach material prepared by NOAA and NASA and others, which explain that the “North Pole ice cap,” unlike the South Pole one, “is already floating,” and thus, like “an ice cube melting in a glass full of water,” already displaces a volume of water equivalent to the amount it would add when unfrozen. 

But an adroit reader of this blog—perhaps a climate scientist or maybe just a well educated nonexpert—objected that in fact floating sea ice has slightly less salinity than sea water, and as a result of some interesting mechanism or another displaces a teeny tiny bit less water than it would add if melted. Global sea levels would thus rise about 1/100th of “half a hair’s breadth”—the width of a human cell, within an order of magnitude—if the North Pole melted.

Disturbed to learn so many eminent science authorities were disseminating incorrect information in educational outreach materials, the blog reader prevailed on NOAA to change the answer it gives in its “Frequently asked questions about the Arctic” page gives to the question, “Will sea levels rise if the North Pole continues to melt?”  

Before the agency said that there'd be "no effect" if the "North Police ice cap melts"; now the page says there would be "little effect."

So at least on NOAA’s site (haven’t check to see if all the other agencies and public educators have changed their materials) “little effect” is now the “correct answer”—one, sadly, that NOAA apparently expects members of the public to assimilate in a completely unreflective way, since the agency gives no account of why, if the “North Pole is already floating,” it wouldn’t behave just like an “ice cube floating in a glass of water.”

Great!

But as I also explained, among the general-population sample of test takers to whom I administered my proto-assessment, answering “true” rather than "false" to the “North Pole” item predicted a three times greater likelihood of incorrectly responding true” as well to two other items: one stating that scientists expected global warming from CO2 emissions to “reduce photosynthesis by plants” ("photosynthesis"); and another that scientists believe global warming will "increase the risk of skin cancer” ("skin cancer").

If we assume that people who responded “false” to "photosynthesis" and "skin cancer" have a better grasp of the mechanisms of climate change than those who responded “true” to those items, then a “false” response to "North Pole” is a better indicator—or observable manifestation—of the latent or unobserved form of science comprehension that the “climate literacy” proto-assessment was designed to measure.

Maybe some tiny fraction of the people who answerd "true" to "North Pole" were aware of the cool information about the nano-sized differential between the volume of water the North Pole ice cap and the amount of water it displaces when frozen. But many times more than that no doubt simply didn't know that the North Pole is just an ice cube floating in the Arcitic Ocean or didn't know that ice displaces a volume of water equivalent to the volume it would occupy when melted.

For that reason, when administered to a general population sample , the instrument will do a better job in identifying those who get the mechanisms of human-caused climate change, and who can actually reason about information relating to them, if “false” rather than “true” is treated as correct.  

This is simplifying a bit: the issue is not merely whether there is a positive correlation between the answer deemed "correct" and superior performance on a validated test as a whole but whether the answer deemed correct makes scores on the test more reliable--for which such a correlation is necessary but not sufficient. But the same point applies: the response that makes a validated instrument more reliable could in theory be shown to be wrong or no "more" right than an alternative response the crediting of which would reduce the reliability of the instrument. 

The only person who would object to this understanding of how to score standardized test responses is someone who makes the mistake of thinking that a science-comprehension assessment is supposed to certify assimilation of some inventory of canonical “facts” rather than measure a latent or unobserved capacity to acquire and use scientific knowledge.

S&B don’t make that mistake. On the contrary, they assert that those who constructed the Indicators made it, and criticize the Indicators battery (and related Eurobarometer “factual knowledge” items) on that ground.

So I'm puzzled why they think it "casts further doubt" on the test to show that the "facts" in its science literacy inventory are ones that scienitsts themselves might dispute.

Indeed, it is well known to experts in the design of assessments that sufficiently knowledgeable people will frequently be able to come up with perfectly acceptable accounts of why a “wrong” response to one or another test item could reasonably be seen as correct. 

Again, so long as it is less likely that any particular test taker who selected that response had such an account in mind than that he or she simply lacked some relevant form of comprehension, then giving credit for the “wrong” answer to all who selected it will make the results less accurate.

Obviously, it would be a huge error to equate error with lack of knowledge when a “public science comprehension” assessment is administered to a group of expert scientists.  As S&B discovered,  it is highly likely that the test takers in such a sample will in fact able to give a satisfactory account of why any “wrong” answer they select should be viewed as correct.

But just as obviously, it would be a mistake to assume that when a public science comprehension test is administered to members of the public the small fraction who say the “Sun go[es] around the Earth” rather than “Earth goes around the Sun” are more likely than not conjuring up the defense of such an answer that the astronomer Sir Fred Hoyle could have given: namely, that a geocentric model of planetary motion is no less "correct" than a "heliocentric" one; the latter, Hoyle points out in his essay on Copernicus,  is justified on grounds of its predictive fecundity, not its superior accuracy.  

True, if Hoyle by some chance happened to be among the members of the public randomly recruited to take the test, his science comprehension might end up being underestimated.

Only someone who doesn’t understand that a public science comprehension measure isn’t designed to assess the comprehension level of trained scientists, however, could possibly treat that as evidence that a particular item is invalid.

S&B certainly wouldn’t make that mistake either.  The most important criticism they make of the Indicators is that insufficient attention was given in designing it to identifying what ordinary members of the public have to know, and what capacities they must have, in order to acquire and scientific information relevant to ordinary decisionmaking in a technologically advanced, liberal democratic society.

So for this reason, too, I don't see why they would think the results of their scientist survey "cast[s] further doubts upon the value of [the Indicators]." A valid  public scidence comprehension measure would surely produce the same amusing spectacle if administered to a group of trained scientists-- so the demonstration is neither here nor there if we are trying to figure out whether and how to improve upon the Indicators.

As I said, I really like the S&B paper, and hope that other researchers take its central message to heart: that the study of public science comprehension is being stunted for want of a defensibly theorized, empirically validated instrument.

I’m pretty sure if they do, though, they’ll see why administering existing or prospective instruments to trained scientists is not a very useful way to proceed. 

This simplifying a bit: the issue is not merely whether there is a correlation between the answer scored as "correct" and superior performance on a validated test as a whole but whether the answer deemed correct makes scores on the test more reliable--for which a positive correlation is necessary but not sufficient. But the same point applies--the response that makes a validated instrument more reliable could in theory be shown to be wrong or no "more" right than an alternative response that would reduces the reliability of the scores if deemed "correct."
Article originally appeared on cultural cognition project (http://www.culturalcognition.net/).
See website for complete article licensing information.