follow CCP

Recent blog entries
popular papers

What Is the "Science of Science Communication"?

Climate-Science Communication and the Measurement Problem

Ideology, Motivated Cognition, and Cognitive Reflection: An Experimental Study

'Ideology' or 'Situation Sense'? An Experimental Investigation of Motivated Reasoning and Professional Judgment

A Risky Science Communication Environment for Vaccines

Motivated Numeracy and Enlightened Self-Government

Ideology, Motivated Cognition, and Cognitive Reflection: An Experimental Study

Making Climate Science Communication Evidence-based—All the Way Down 

Neutral Principles, Motivated Cognition, and Some Problems for Constitutional Law 

Cultural Cognition of Scientific Consensus
 

The Tragedy of the Risk-Perception Commons: Science Literacy and Climate Change

"They Saw a Protest": Cognitive Illiberalism and the Speech-Conduct Distinction 

Geoengineering and the Science Communication Environment: a Cross-Cultural Experiment

Fixing the Communications Failure

Why We Are Poles Apart on Climate Change

The Cognitively Illiberal State 

Who Fears the HPV Vaccine, Who Doesn't, and Why? An Experimental Study

Cultural Cognition of the Risks and Benefits of Nanotechnology

Whose Eyes Are You Going to Believe? An Empirical Examination of Scott v. Harris

Cultural Cognition and Public Policy

Culture, Cognition, and Consent: Who Perceives What, and Why, in "Acquaintance Rape" Cases

Culture and Identity-Protective Cognition: Explaining the White Male Effect

Fear of Democracy: A Cultural Evaluation of Sunstein on Risk

Cultural Cognition as a Conception of the Cultural Theory of Risk

« Against "consensus messaging" . . . | Main | Back in the US ... back in the US ... back in the US of Societal Risk conflict »
Tuesday
Jun092015

A Pigovian tax solution (for now) for review/publication of studies that use M Turk samples

I often get asked to review papers that use M Turk samples.

This is a problem because I think M Turk samples, while not invalid for all forms of study, are invalid for studies of how individual differences in political predispositions and cognitive-reasoning proficiencies influence the processing of empirical information relevant to risk and other policy issues.

I've discussed this point at length.

And lots of serious scholars now have engaged this isssue seriously.   

"Seriously" not in the sense of merely collecting some data on the demographics of M Turk samples at one point in time and declaring them "okay" for all manner of studies once & for all. Anyone who produces a study like that, or relies on it to assure readers his or her own use of an M Turk sample is "okay," either doesn't get the underlying problem or doesn't care about it.

I mean really seriously in the sense of trying to carefully document the features of the M Turk work force that bear on the validity of it as a sample for various sorts of research, and in the sense of engaging in meaningful discussion of the technical and craft issues involved.

I myself think the work and reflections of these serious scholars reinforce the conclusion that it is highly problematic to rely on M Turk samples for the study of information processing relating to risk and other facts relevant to public policy.

The usual reply is, "but M Turk samples are inexpensive! They make it possible for lots & lots of scholars to do and publish empirical research!"

Well, thought experiments are even cheaper.  But they are not valid.  

If M Turk samples are not valid, it doesn't matter that they are cheap. Validity is a non-negotiable threshold requirement for use of a particular sampling method. It's not an asset or currency that can be spent down to buy "more" research-- for the research that such a "trade off" subsidizes in fact has no value.

Another argument is, "But they are better than university student samples!"  If student samples are not valid for a particular kind of research, then journals shouldn't accept studies that use them either. But in any case, it's now clear that M Turk workers don't behave the way U.S. university students do when responding to survey items that assess whether subjects are displaying the sorts of reactions one would expect in people who  claim that they are members of the U.S. public with particular political outlooks (Krupnikov & Levine 2014).

I think serious journals should adopt policies announcing that they won't accept studies that use M Turk samples for types of studies they are not suited for.

But in any case, they ought at least to adopt policies one way or the other--rather than put authors in the position of not knowing before they collect the data whether journals will accept their studies, and authors and reviewers in the position of having a debate about the appropriateness of using such a sample over & over.  Case-by-case assessment is not a fair way to handle this issue, nor one that will generate a satisfactory overall outcome.

So ... here is my proposal: 

Pending a journal's adoption of a uniform policy on M Turk samples, the journal should oblige authors who do use M Turk samples to give a full account--in their paper-- of why the authors believe it is appropriate to use M Turk workers to model the reasoning process of ordinary members of the U.S. public.  The explanation should  consist of a full accounting of the authors’ own assessment of why they are not themselves troubled by the objections that have been raised to the use of such samples; they shouldn't be allowed to dodge the issue by boilerplate citations to studies that purport to “validate” such samples for all purposes, forever & ever.  Such an account helps readers to adjust the weight that they afford study findings that use M Turk samples in two distinct ways: by flagging the relevant issues for their own critical attention; and by furnishing them with information about the depth and genuineness of the authors’ own commitment to reporting research findings worthy of being credited by people eager to figure out the truth about complex matters.

There are a variety of key points that authors should be obliged to address.

First, M Turk workers recruited to participate in “US resident only” studies have been shown to misrepresent their nationality.  Obviously, inferences about the impact of partisan affiliations distinctive of the US general public cannot validly be made on the basis of samples that contain a “substantial” proportion of individuals from other societies (Shapiro, Chandler and Muller 2013)  Some scholars have recommended that researchers remove from their “US only” M Turk samples those subjects who have non-US IP addresses.  However, M Turk workers are aware of this practice and openly discuss in on-line M Turk forums how to defeat it by obtaining US-IP addresses for use on “US worker” only projects.  If authors are purporting to empirically test hypotheses about about how members of the U.S. general public reason on politically contested matters, why don't they see the incentive of M Turk workers to misrepresent their nationality as a decisive objection to using them as their study sample?

Second, M Turk workers have demonstrated by their behavior that they are not representative of the sorts of individuals that studies of political information-processing are supposed to be modeling. Conservatives are grossly under-represented among M Turk workers who represent themselves as being from the U.S. (Richey 2012).  One can easily “oversample” conservatives to generate adequate statistical power for analysis. But the question is whether it is satisfactory to draw inferences about real US conservatives generally from individuals who are doing something that such a small minority of real U.S. conservatives are willing to do.  It’s easy to imagine that the M Turk US conservatives (if really from the US) lack sensibilities that ordinary US conservatives normally have—such as the sort of disgust sensitivities that are integral to their political outlooks (Haidt & Hersch 2001), and that would likely deter them from participating in a "work force" a major business activity of which is “tagging” the content of on-line porn. These unrepresentative US conservatives might well not react as strongly or dismissively toward partisan arguments on a variety of issues.  So why is this not a concern for the authors? It is for me, and I’m sure would be for many readers trying to assess what to make of a study that nevertheless uses an M Turk sample.

Third, there are in fact studies that have investigated this question and concluded that M Turk workers do not behave the way that US general population or even US student samples do when participating in political information-processing experiments (Krupnikov & Levine 2014).   Readers will care about this—and about whether the authors care.

Fourth, Amazon M Turk worker recruitment methods are not fixed and are neither designed nor warranted to generate samples suitable for scholarly research. No serious person who cares about getting at the truth would accept the idea that a particular study done at a particular time could “validate” M Turk, for the obvious reason that Amazon doesn’t publicly disclose its recruitment procedures, can change them anytime and has on multiple occasions, and is completely oblivious to what researchers care about.  A scholar who decides it’s “okay” to use M Turk anyway should tell readers why this does not trouble him or her.

Fifth, M Turk workers share information about studies and how to respond to them (Chandler, Mueller & Paolacci 2014).   This makes them completely unsuitable for studies that use performance-based reasoning proficiency measures, which M Turk workers have been massively exposed to.  But it also suggests that the M Turk workforce is simply not an appropriate place to recruit subjects from for any sort of study in which subject communication can will contaminate the sample. Imagine you discovered that the firm you had retained to recruit your sample had a lounge in which subjects about to take the study could discuss it w/ those who just had completed it; would you use the sample, and would you keep coming back to that firm to supply you with study subjects in the future? If this does not bother the authors, they should say so; that’s information that many critical readers will find helpful in evaluating their work.

I feel pretty confident M Turk samples are not long for this world for studies that examine individual differences in reasoning relating to politically contested risks and other policy-relevant facts (again, there are no doubt other research questions for which M Turk samples are not nearly so problematic).  

Researchers in this area will not give much weight to studies that rely on M Turk samples as scholarly discussion progresses.  

In addition, there is a very good likelihood that an on-line sampling resource that is comparably inexpensive but informed by genuine attention to validity issues will emerge in the not too distant future.

E.g., Google Consumer Surveys now enables researchers to field a limited number of questions for between $1.10 & $3.50 per complete-- a fraction of the cost charged by on-line firms that use valid & validated recruitment and stratification methods.

Google Consumer Surveys has proven its validity in the only way that a survey mode--random-digit dial, face-to-face, on-line --can: by predicting how individuals will actually evince their opinions or attitudes in real-world settings of consequence, such as elections.  Moreover, if Google Surveys goes into the business of supplying high-quality scholarly samples, they will be obliged to be transparent about their sampling and stratification methods and to maintain them (or update them for the purposes of making them even more suited for research) over time.  

As I said, Amazon couldn't care less whether the recruitment methods it uses for M Turk workers now or in the future make them suited for scholarly research.

The problem right now w/ Google Consumer Surveys is that the number of questions is limited and so, as far as I can tell, is the complexity of the instrument that one is able to use to collect the data, making experiments infeasible.

But I predict that will change.

We'll see.

But in the meantime, obliging researchers who think it is "okay" to use M Turk samples to explain why they apparently are untroubled by the serious issues being raised about the validity of these samples would be an appropriate way, it seems to me, to make those who use such samples to internalize the cost that polluting the research environment with M Turk studies is imposing on social science research on cognition and political conflict.

Refs

Chandler, J., Mueller, P. & Paolacci, G. Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behavior research methods 46, 112-130 (2014).

Haidt, J. & Hersh, M.A. Sexual morality: The cultures and emotions of conservatives and liberals. J Appl Soc Psychol 31, 191-221 (2001). 

Kahan, D. Fooled Twice, Shame on Who? Problems with Mechanical Turk Study Samples. Cultural Cognition Project (2013a), http://www.culturalcognition.net/blog/2013/7/10/fooled-twice-shame-on-who-problems-with-mechanical-turk-stud.html

Krupnikov, Y. & Levine, A.S. Cross-Sample Comparisons and External Validity. Journal of Experimental Political Science 1, 59-80 (2014).

Richey, S,., & Taylor, B. How Representatives Are Amazon Mechanical Turk Workers? The Monkey Cage,(2012).

Shapiro, D.N., Chandler, J. & Mueller, P.A. Using Mechanical Turk to Study Clinical Populations. Clinical Psychological Science 1, 213-220 (2013).

 

 

 

 

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (5)

This probably won't matter to you, because your views seem set, but:

1. Some of the best social psychological work on cognition has been done with extremely unrepresentative samples (e.g., the experiments by Tversky and Kahneman).

2. You don't know what you are talking about in regard to the comparability of MTurk to nationally representative samples. As but one example, check out this study which ran the same experiment on MTurk and Knowledge Networks/GfK Custom Research: https://www.sociologicalscience.com/articles-vol1-19-292/

Guess which sample (MTurk vs. GfK) provided higher quality data (e.g., passed comprehension checks) (see Weinberg et al., 2014, pp. 299-300)?

August 10, 2015 | Unregistered CommenterJustin

@ Justin--

Sure it matters.

1. As I've explained my objection is not to use of "nonrepresentative" samples; it's to *invalid* samples. There are lots of things for which M Turk samples might be valid. But for reasons I've sated, I don't see them as valid for testing hypotheses about how differences in group identity affect processing of information on risk & other culturally contested facts

2. Thanks for the citation. Encourage others to take a look & decide for self what inferences the paper supports.

August 12, 2015 | Registered CommenterDan Kahan

I can't imagine anyone would contend MTurk is representative. If a paper relies on such a premise, I agree that it is problematic. But while MTurk should not be used to get point estimates, it is great for testing treatment effects. This is the case even if they involve the tasks you discuss. Using non-representative samples have proven to be extremely valuable in the decision sciences. And contrary to your claim, there are many studies that show observed behavior corresponds very well across MTurk samples and other connivance samples--behavioral patterns and treatment effects.

Note that many of the issues you raise are held constant across an experimental design, so they only matter if they are correlated with the treatment variable. And replication plays a role here. Even so, generalizability is an issue with all empirical studies, whether it is phone survey, internet survey, mail survey, field experiment, lab experiment or even secondary data (that we forget comes from dubious processes). The deal is that it has to be consistent and we have to know its limits, but we can't throw out useful methods because they aren't perfect. We would have nothing. Research is a collective process and knowledge is not generated by one study; rather it is created from different studies from different perspectives using different methods. No single method is perfect, each has strengths and weaknesses. We can do more by understanding how these complement one another than thinking one way is the only way. It kind of sounds like you wish to create barriers for others to join your research agenda.

August 19, 2015 | Unregistered CommenterKing Bostrom

Thank you for your illuminating series of posts about MTurk. I was wondering whether you would still say Google Surveys is becoming an alternative for MTurk. The practical problems have not changed, as far as I can tell. Since you wrote this, limits on survey length or complexity have not loosened. There is also no room for, say, an informed consent procedure.

Moreover, it seems that we know even less about Google Surveys respondents than we do about MTurkers. Google Surveys, too, does not disclose its recruitment methods. To me, those things signal that Google is also not interested in providing for an academic audience. I'd be interested in what you think about those issues.

January 5, 2017 | Unregistered CommenterClara

@Clara--

Nope, no evidence. Indeed, the form of questions on which they are willing to collect data is still t0o primitive to enable interesting experiments

But if they do get into this mkt, they'll have to mkt themselves as professional data providers and thus do what is necessary to assemble valid general population samples. Researchers can then fitgure out if those methods are valid. Amizon doesn't make any pretense to breing a professional data collection firm & has no coommitment to recruiting MT workers w/ that objective in mind.

January 5, 2017 | Registered CommenterDan Kahan

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>