follow CCP

Recent blog entries
popular papers

What Is the "Science of Science Communication"?

Climate-Science Communication and the Measurement Problem

Ideology, Motivated Cognition, and Cognitive Reflection: An Experimental Study

'Ideology' or 'Situation Sense'? An Experimental Investigation of Motivated Reasoning and Professional Judgment

A Risky Science Communication Environment for Vaccines

Motivated Numeracy and Enlightened Self-Government

Ideology, Motivated Cognition, and Cognitive Reflection: An Experimental Study

Making Climate Science Communication Evidence-based—All the Way Down 

Neutral Principles, Motivated Cognition, and Some Problems for Constitutional Law 

Cultural Cognition of Scientific Consensus
 

The Tragedy of the Risk-Perception Commons: Science Literacy and Climate Change

"They Saw a Protest": Cognitive Illiberalism and the Speech-Conduct Distinction 

Geoengineering and the Science Communication Environment: a Cross-Cultural Experiment

Fixing the Communications Failure

Why We Are Poles Apart on Climate Change

The Cognitively Illiberal State 

Who Fears the HPV Vaccine, Who Doesn't, and Why? An Experimental Study

Cultural Cognition of the Risks and Benefits of Nanotechnology

Whose Eyes Are You Going to Believe? An Empirical Examination of Scott v. Harris

Cultural Cognition and Public Policy

Culture, Cognition, and Consent: Who Perceives What, and Why, in "Acquaintance Rape" Cases

Culture and Identity-Protective Cognition: Explaining the White Male Effect

Fear of Democracy: A Cultural Evaluation of Sunstein on Risk

Cultural Cognition as a Conception of the Cultural Theory of Risk

« Scaling up the SENCER solution to the "self-measurement paradox" | Main | Undertheorized and unvalidated: Stocklmayer & Bryant vs. NSF Indicators “Science literacy” scale part I »
Monday
Aug042014

How would scientists do on a public "science literacy" test -- and should we care? Stocklmayer & Bryant vs. NSF Indicators “Science literacy” scale part 2

So . . . this is the second post on the interesting paper Stocklmayer, S. M., & Bryant, C. Science and the Public—What should people know?, International Journal of Science Education, Part B, 2(1), 81-101 (2012)

Skip ahead to the bolded red text if you still vividly remember the first (as if it were posted “only yesterday”) or simply don’t care what it said & want to go straight to something quite interesting—the results of S&B's admnistration of a public “science literacy” test to trained scientists.

But by way of review, S&B don’t much like the NSF Science Indicators “factual knowledge” questions, the standard “science literacy” scale used in studies of public science comprehension.

The basic thrust of their  critique is that the Indicators battery is both undertheorized and unvalidated.

It’s “undertheorized” in the sense that no serious attention went into what the test was supposed to be measuring or why

Its inventors viewed public “science literacy” to be essential to informed personal decisionmaking, enlightened self-government, and a productive national economy. But they didn’t address what kinds of scientific knowledge conduce to these ends, or why the odd collection of true-false items featured in the Indicators (“Lasers work by focusing sound waves”; “The center of the earth is very hot”) should be expected to assess test takers’ possession of such knowledge.

The NSF “science literacy” test is unvalidated in the sense that no evidence was offered—either upon their introduction or thereafter—that scores on it are meaningfully correlated with giving proper effect to scientific information in any particular setting.

S&B propose that the Indicators battery be scrapped in favor of an assessment that reflects an “assets-based model of knowledge.” Instead of certifying test takers’ assimilation of some canonical set of propositions, the aim of such an instrument would be to gauge capacities essential to acquiring and effectively using scientific information in ordinary decisionmaking.

I went through S&B’s arguments to this effect last time, and why I found them persuasive. 

I did take issue, however, with their conclusion that the Indicators should simply be abandoned. Better, I think, would be for scholars to go ahead and use the Indicators battery but supplement it as necessary with items that validly measure the aspects of science comprehension genuinely relevant to their analyses.

It is more realistic to think a decent successor to the Indicators battery would evolve from this sort of process than it is to believe that a valid, new science comprehension scale will be invented from scratch.  The expected reward to scholars who contribute to development of the latter would be too low to justify the expected cost they’d incur, which would include having to endure the unwarranted but predictable resistance of many other scholars who are professionally invested in the Indicators battery.

Okay!  But I put off for “today’s” post a discussion of S&B’s very interesting original study, which consisted of the administration of the Indicators battery (supplemented with some related Eurobarometer “factual knowledge” items) to a group of 500 scientists.

click to see how the scientists did!The scientists generally outscored members of the public, although not by a very large margin (remember, one problem with the NSF battery is that it's too easy—the average score is too high to enable meaningful investigation of variance).

But the more interesting thing was how readily scientists who gave the “wrong” answer were able to offer a cogent account of why their response should in fact be regarded as correct.

For example, it is false to say the “the center of the earth is very hot,” one scientist pointed out, if we compare the temperature there to that on the surface of the sun or other stars.

Not true, 29% of the sample said, in response to the statement, “It is the father’s genes that determine whether the baby is a boy or girl”—not because “it is the mother’s genes” that do so but because it is the father’s chromosome that does.

No fair-minded grader would conclude that these scientists’ responses betray lack of comprehension of the relevant “facts.”  That their answers would be scored “incorrect” if they were among the test takers in an Indicators sample, S&B conclude, “cast[s] further doubts upon the value of such a survey as a tool for measurement of public ‘knowledge.’ ”

If I were asked my opinion in a survey, I’d “strongly disagree” with this conclusion!

Indeed, in my view, the idea that the validity of a public science comprehension instrument should be assessed by administering it to a sample of scientists reflects the very sort of misunderstandings—conceptual and psychometric—that S&B convincingly argue are reflected in the Indicators battery.

S&B sensibly advocate an “assets-based” assessment as opposed to a knowledge-inventory one.

Under the former, the value of test items consists not in their corroboration of a test taker's cathechistic retention of a list of "foundational facts" but rather in the contribution those items make to measuring a personal trait or capacity essential to acquiring and using relevant scientific information.

The way to validate any particular item, then, isn't to show that 100%--or any particular percent—of scientists would “agree” with the response scored as “correct.”

It is to show that that response genuinely correlates with the relevant comprehension capacity within the intended sample of test takers.

Indeed, while such an outcome is unlikely, an item could be valid even if the response scored as “correct” is indisputably wrong, so long as test takers with the relevant comprehension capacity are more likely to select that response.

This point actually came up in connection with my proto- climate-science comprehension measure

That instrument contained the item “Climate scientists believe that if the North Pole icecap melted as a result of human-caused global warming, global sea levels would rise—true or false?”

“False” was scored as correct, consistent with public-education outreach material prepared by NOAA and NASA and others, which explain that the “North Pole ice cap,” unlike the South Pole one, “is already floating,” and thus, like “an ice cube melting in a glass full of water,” already displaces a volume of water equivalent to the amount it would add when unfrozen. 

But an adroit reader of this blog—perhaps a climate scientist or maybe just a well educated nonexpert—objected that in fact floating sea ice has slightly less salinity than sea water, and as a result of some interesting mechanism or another displaces a teeny tiny bit less water than it would add if melted. Global sea levels would thus rise about 1/100th of “half a hair’s breadth”—the width of a human cell, within an order of magnitude—if the North Pole melted.

Disturbed to learn so many eminent science authorities were disseminating incorrect information in educational outreach materials, the blog reader prevailed on NOAA to change the answer it gives in its “Frequently asked questions about the Arctic” page gives to the question, “Will sea levels rise if the North Pole continues to melt?”  

Before the agency said that there'd be "no effect" if the "North Police ice cap melts"; now the page says there would be "little effect."

So at least on NOAA’s site (haven’t check to see if all the other agencies and public educators have changed their materials) “little effect” is now the “correct answer”—one, sadly, that NOAA apparently expects members of the public to assimilate in a completely unreflective way, since the agency gives no account of why, if the “North Pole is already floating,” it wouldn’t behave just like an “ice cube floating in a glass of water.”

Great!

But as I also explained, among the general-population sample of test takers to whom I administered my proto-assessment, answering “true” rather than "false" to the “North Pole” item predicted a three times greater likelihood of incorrectly responding true” as well to two other items: one stating that scientists expected global warming from CO2 emissions to “reduce photosynthesis by plants” ("photosynthesis"); and another that scientists believe global warming will "increase the risk of skin cancer” ("skin cancer").

If we assume that people who responded “false” to "photosynthesis" and "skin cancer" have a better grasp of the mechanisms of climate change than those who responded “true” to those items, then a “false” response to "North Pole” is a better indicator—or observable manifestation—of the latent or unobserved form of science comprehension that the “climate literacy” proto-assessment was designed to measure.

Maybe some tiny fraction of the people who answerd "true" to "North Pole" were aware of the cool information about the nano-sized differential between the volume of water the North Pole ice cap and the amount of water it displaces when frozen. But many times more than that no doubt simply didn't know that the North Pole is just an ice cube floating in the Arcitic Ocean or didn't know that ice displaces a volume of water equivalent to the volume it would occupy when melted.

For that reason, when administered to a general population sample , the instrument will do a better job in identifying those who get the mechanisms of human-caused climate change, and who can actually reason about information relating to them, if “false” rather than “true” is treated as correct.  

This is simplifying a bit: the issue is not merely whether there is a positive correlation between the answer deemed "correct" and superior performance on a validated test as a whole but whether the answer deemed correct makes scores on the test more reliable--for which such a correlation is necessary but not sufficient. But the same point applies: the response that makes a validated instrument more reliable could in theory be shown to be wrong or no "more" right than an alternative response the crediting of which would reduce the reliability of the instrument. 

The only person who would object to this understanding of how to score standardized test responses is someone who makes the mistake of thinking that a science-comprehension assessment is supposed to certify assimilation of some inventory of canonical “facts” rather than measure a latent or unobserved capacity to acquire and use scientific knowledge.

S&B don’t make that mistake. On the contrary, they assert that those who constructed the Indicators made it, and criticize the Indicators battery (and related Eurobarometer “factual knowledge” items) on that ground.

So I'm puzzled why they think it "casts further doubt" on the test to show that the "facts" in its science literacy inventory are ones that scienitsts themselves might dispute.

Indeed, it is well known to experts in the design of assessments that sufficiently knowledgeable people will frequently be able to come up with perfectly acceptable accounts of why a “wrong” response to one or another test item could reasonably be seen as correct. 

Again, so long as it is less likely that any particular test taker who selected that response had such an account in mind than that he or she simply lacked some relevant form of comprehension, then giving credit for the “wrong” answer to all who selected it will make the results less accurate.

Obviously, it would be a huge error to equate error with lack of knowledge when a “public science comprehension” assessment is administered to a group of expert scientists.  As S&B discovered,  it is highly likely that the test takers in such a sample will in fact able to give a satisfactory account of why any “wrong” answer they select should be viewed as correct.

But just as obviously, it would be a mistake to assume that when a public science comprehension test is administered to members of the public the small fraction who say the “Sun go[es] around the Earth” rather than “Earth goes around the Sun” are more likely than not conjuring up the defense of such an answer that the astronomer Sir Fred Hoyle could have given: namely, that a geocentric model of planetary motion is no less "correct" than a "heliocentric" one; the latter, Hoyle points out in his essay on Copernicus,  is justified on grounds of its predictive fecundity, not its superior accuracy.  

True, if Hoyle by some chance happened to be among the members of the public randomly recruited to take the test, his science comprehension might end up being underestimated.

Only someone who doesn’t understand that a public science comprehension measure isn’t designed to assess the comprehension level of trained scientists, however, could possibly treat that as evidence that a particular item is invalid.

S&B certainly wouldn’t make that mistake either.  The most important criticism they make of the Indicators is that insufficient attention was given in designing it to identifying what ordinary members of the public have to know, and what capacities they must have, in order to acquire and scientific information relevant to ordinary decisionmaking in a technologically advanced, liberal democratic society.

So for this reason, too, I don't see why they would think the results of their scientist survey "cast[s] further doubts upon the value of [the Indicators]." A valid  public scidence comprehension measure would surely produce the same amusing spectacle if administered to a group of trained scientists-- so the demonstration is neither here nor there if we are trying to figure out whether and how to improve upon the Indicators.

As I said, I really like the S&B paper, and hope that other researchers take its central message to heart: that the study of public science comprehension is being stunted for want of a defensibly theorized, empirically validated instrument.

I’m pretty sure if they do, though, they’ll see why administering existing or prospective instruments to trained scientists is not a very useful way to proceed. 

This simplifying a bit: the issue is not merely whether there is a correlation between the answer scored as "correct" and superior performance on a validated test as a whole but whether the answer deemed correct makes scores on the test more reliable--for which a positive correlation is necessary but not sufficient. But the same point applies--the response that makes a validated instrument more reliable could in theory be shown to be wrong or no "more" right than an alternative response that would reduces the reliability of the scores if deemed "correct."

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (8)

"The only person who would object to this understanding of how to score standardized test responses is someone who makes the mistake of thinking that a science-comprehension assessment is supposed to certify assimilation of some inventory of canonical “facts” rather than measure a latent or unobserved capacity to acquire and use scientific knowledge."

Or someone who thinks a science comprehension assessment is supposed to measure science comprehension.

It measures something. And that something is clearly correlated with science comprehension. But there is a U-shaped response curve, which means that it is correlated over part of the range and anti-correlated over the rest. They're not the same variable.

It's a bit like the way certain people figured tree ring width was correlated with temperature, even though it's well known that there is an optimum temperature for each tree species, and ring width rises and then falls with rising temperature, in an inverted U. Not taking this into account and simply taking temperature to be proportional to ring width, as a measurement of it, then gives quite misleading results.

Actually, there are a few more of those questions which are arguably 'wrong'. All milk contains trace radioisotopes and so can be technically radioactive but perfectly safe, so long as it is pasteurised (which is not generally boiling, but boiling would work too). Hot air rises only until a certain temperature gradient is reached, when it stops. (This is actually very relevant to how the greenhouse effect works, so is important in this context.) The liver makes some of the chemical components of urine, the kidneys just filter them out of the bloodstream. Electrons are arguably the same size as atoms if you count the size of the wavefunction (the cloud of quantum probability), since the outer boundary of the atom is defined by the outer boundary of this region. The claim that the continents are moving about on the surface of the Earth raises the question of relative velocities. (What are they moving relative to? How do you define this?) The children of a body builder can inherit more than genes - prize money from body-building competitions, for example. But even setting aside this obvious cheat, it has been fairly recently discovered that there are indeed acquired genetic characteristics that can be inherited: the methylation of genes that turns them on and off. Whether body building can introduce any such changes I don't know, but it's possible. There's also the potential for things like an increase/decrease in the probability of mutations. Antibiotics kill bacteria but bacteriophages are viruses that infect bacteria, so killing the bacterium kills them as well. And there are valid taxonomic arguments for saying that birds are a species of dinosaur (in much the same way that humans are a species of ape), and therefore not only did they exist when humans first evolved, they still do.

Mostly, you can work out what answer was intended by knowing what's taught in schools, or that it is a specific example commonly used to teach a particular general principle, or what the most common case is. But that indicates that it is measuring a knowledge of what scientific facts are taught to the general public, rather than what scientific facts are true, which of course is a different variable. It's correlated to it, but correlation does not imply identity any more than it implies causation.

And as I've often argued before, it doesn't rightly distinguish somebody who knows the right answer because an authority figure told them so, (which is a profoundly unscientific way of thinking), from someone who claims not to know because they haven't seen the scientific proof for themselves (which is). It's not measuring science comprehension, or even the capacity to acquire and use scientific knowledge. It's more like it's measuring the ability to regurgitate 'facts' that they have been told by someone about scientific topics (e.g. by a science teacher) which some people confuse for real scientific understanding.

Understanding the difference, though, is another piece of scientific knowledge which such tests don't measure. If you've only learnt science as some sort of trivia quiz challenge, and that you're supposed to trust science teachers and other authority figures and official sources implicitly, there's no wonder people don't understand it and think that's how science is supposed to work.

August 4, 2014 | Unregistered CommenterNiV

@NiV:

I agree the *Indicators* are not valid for the reason you state.

But the sort of amusing spectacle that S&B generated w/ their measure would be certain to happen to a *valid* public science comprehesion measure-- one that measures a rasoning capacity & not assimilation of facts learned at rote-- too. It would happen to *any* valid psychometric measure of any skill, reasoning style, etc.

There wouldn't be the "U" shaped distribution you imagine. There isn't even such a distribution in the results S&B report.

There just wouldn't be any reason to think that a valid public science comprhension measure can accurately distinguish between Hoyle and a good high school student.

August 4, 2014 | Registered CommenterDan Kahan

"There wouldn't be the "U" shaped distribution you imagine. There isn't even such a distribution in the results S&B report."

How so? If you put Joe Public, a school science teacher, and Fred Hoyle on a line and plotted their scores on the "Earth goes round the sun" question, how would it not be (inverted) U shaped?

"There just wouldn't be any reason to think that a valid public science comprehension measure can accurately distinguish between Hoyle and a good high school student."

I hope you'll pardon my scepticism!

There are already some existing public science comprehension measures out there: academic science qualifications. And they can distinguish between Sir Fred and a bright high school student quite nicely. Unless you mean that there can be no such thing as a valid public science comprehension measure?

August 4, 2014 | Unregistered CommenterNiV

@NiV

1. On U-shape: Just look at the table. The scientists as a group would have been in the very top of the distribution on the Indicators battery. So would any bright hs student & many dimwitted ones too (again, I'm not defending the Indicators). The questions are too easy to discriminate between them. The clever scientists who gave the 'wrong' answers -- like the 1 in 500 who said that the surface of the sun is hotter than the center of the earth--aren't smarter than the others & statistically would be noise (i.e., they wouldn't "bend the curve" downsward for PhD scienists).

2. If you think it makes sense for there to be 1 test for measuring both what members of the public should know to make appropriate use of science in ordinary decisionmaknig contexts *&* what trained scienitsts should to excel in their fields, then you have much more in common w/ the drafters of the Indicators than you appear to recognize.

August 4, 2014 | Unregistered Commenterdmk38

"The clever scientists who gave the 'wrong' answers -- like the 1 in 500 who said that the surface of the sun is hotter than the center of the earth--aren't smarter than the others & statistically would be noise"

Ah, right. I understand now. You mean because the scientists were smart enough to figure out the answers you expected, rather than the answers that were true, and were happy to play along, they scored highly. I was thinking about what would happen if they answered the questions truthfully, with strict scientific accuracy, for which the answers to quite a few would be 'I don't know', since they're ambiguous.

We get the same paradoxical effect when we ask people silly questions like "Do you believe in climate change?" They figure out what question you really meant to ask and answer that instead of the one on the paper, which can then conflict with some of their other answers.

"If you think it makes sense for there to be 1 test for measuring both what members of the public should know to make appropriate use of science in ordinary decisionmaknig contexts *&* what trained scienitsts should to excel in their fields, then you have much more in common w/ the drafters of the Indicators than you appear to recognize."

I think scientists are members of the general public, too, and a fair number of the rest of the general public have a deeper interest in and understanding of science than they're given credit for. A small minority, sure, but disproportionately significant in the social decisionmaking context because (as you yourself have indicated) people pick up their cues on what they should believe from their smarter friends and associates. If as you say people get their views partly from their social networks, rather than judging each individual issue themselves, then you need to measure the science comprehension of their social network.

There's no reason why you couldn't design the test to distinguish people more precisely at the top end of the scale. The accuracy of the test at each population percentile is one of the design parameters for the test. The more questions you ask, the more accurate you can make it, but the more risk there is of 'questionnaire fatigue'. You can develop it by asking a representative sample a lot of questions of the appropriate sort (with rest breaks), sorting people into order of overall knowledge, and then looking for individual questions (or small subsets) with sharp response curves. Each question has a mean threshold percentile and a spread, and so you can select an appropriate number of questions thresholding at each design percentile point to achieve the desired accuracy and error bars. Then validate the scale against a brand new sample.

For a general purpose score, you probably want fairly uniform accuracy across the range. Scientists and engineers make up about 5% of the US labour force, so it's not too unreasonable to require the test to distinguish them from high school students, (high school graduates making up about 70% of the population). Assuming even a bright student isn't going to be able to hold down a job in science and engineering, anything that gives around 5% error bars at the 95% mark is easily going to separate them. Does that sound unreasonable? It sounds achievable to me.

And to keep the number of redundant questions down, I'd probably design it in two phases - the first to tell roughly where they are on the scale to within 20%, say, and then based on the answer to select one from a set of further questionnaires to give a more precise answer for that capability level. (A cleverer approach would be to use a 'binary search' strategy. Ask a question in the middle of the range, then at each step choose what question to ask next, higher or lower, depending on the answer. But that could lead to all sorts of odd interactions and over-weighting effects which would be hard to analyse. Best not to risk it.)

I had a quick look at your sample item response curves in your 'under the hood' post, but since they're plotted against the overall metric rather than the percentile of population, they're a bit hard to interpret. Assuming the scale is uniform, it looks like individual (good) questions are yielding a standard deviation around 15%, so you would want around 6 questions to cover the range and then a selection of 9 questions to reduce that spread three-fold. 36 questions would instead give a 6-fold reduction, and a 2-sigma accuracy around the desired 5%. Ballpark.

Of course, if the ability to answer the questions is correlated, you would likely need some more. I was assuming independence there. But I'm guessing that you ought to be able to get around 5-10% error bars which ought to resolve the professional scientists and engineers. If you can ask the right questions.

It depends what question you're asking, though. A measure of general science intelligence isn't the same thing as an assessment of whether people know what they need to make valid decisions on scientific topics. For that, you would want a test with more resolution up at the top end, around the decisionmaking-competence threshold. And you would definitely be wanting something that wasn't just a pop-science trivia quiz.

Why not ask actual everyday science questions? "Q1. You read the label on your shampoo bottle and it says it contains molecular super-nutrient pentapeptides to gently nourish your hair and slow the seven signs of aging. Is this science or BS?" Virtually nobody needs to know about the size of electrons or whether the sun orbits the Earth, but there are a lot of sciency questions that they do get asked every day. If that's what you want to know about, why not ask some of those instead?

August 5, 2014 | Unregistered CommenterNiV

Just to be nitpicky:
"if the North Pole icecap melted as a result of human-caused global warming, global sea levels would rise"
is true, because it would be accompanied with a melting of ice on land and expansion of seawater.
So the NorthPoleIceMelting and SeaLevelRise are correlated, but not causally, as there is a third confounding variable.
Alas, how I have wished to apply this nerdy bit of statistics one day ;.)
Berry

August 5, 2014 | Unregistered CommenterBerry Boessenkool

I responded more at length on Izuru but I sense a discontinuity between the title and the subject but maybe this is your point. A literacy test is mildly useful and serves a purpose (measure public education effectiveness), a comprehension test cannot exist without literacy but also measures something else entirely, perhaps social utility or benefit to society for all that public education.

An example is Andre' Norton's science fiction. In one story her hero is riding on a balloon in a high wind. The mistake was to have this high wind trying to pull the hero off the balloon. I bailed out on my disbelief at that point because it was stupid. I know that balloons travel *with* the wind and very little wind would be felt on the balloon even if traveling hundreds of km/h over land.

That a balloon travels with the wind that carries it is an inventory fact but to be *processing* that fact while reading a book is part of *comprehension*. It is the combination of the two that becomes socially useful (IMO).

Anyway, I sense that some surveys are less interested in the actual result as compared to instilling in the minds of the public that a particular answer is "correct". This is revealed particularly in the case when "I don't know" is not a permitted answer.

September 12, 2014 | Unregistered CommenterMichael 2

@Michael2--

Good example.

I'm not convinced that basic-fact questions are necessary for a useful "ordinary science intelligence" assessment, but for sure "knowledge" only of those w/o possession of critical reasoning dispositions won't do anyone any good. Accordingly, an 'assessment' that didn't combine the two couldn't be of much use.

You might find this post of interest -- it shows relative contributions of the "factual knowledge" items, on one hand, and the "numeracy" & "cognitive reflection" ones, on other, to the OSI_2.0 scale that I'm now using in my own research on risk perception & science communication

--dankahan

September 12, 2014 | Unregistered Commenterdmk38

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>