follow CCP

Recent blog entries
popular papers

Science Curiosity and Political Information Processing

What Is the "Science of Science Communication"?

Climate-Science Communication and the Measurement Problem

Ideology, Motivated Cognition, and Cognitive Reflection: An Experimental Study

'Ideology' or 'Situation Sense'? An Experimental Investigation of Motivated Reasoning and Professional Judgment

A Risky Science Communication Environment for Vaccines

Motivated Numeracy and Enlightened Self-Government

Making Climate Science Communication Evidence-based—All the Way Down 

Neutral Principles, Motivated Cognition, and Some Problems for Constitutional Law 

Cultural Cognition of Scientific Consensus
 

The Tragedy of the Risk-Perception Commons: Science Literacy and Climate Change

"They Saw a Protest": Cognitive Illiberalism and the Speech-Conduct Distinction 

Geoengineering and the Science Communication Environment: a Cross-Cultural Experiment

Fixing the Communications Failure

Why We Are Poles Apart on Climate Change

The Cognitively Illiberal State 

Who Fears the HPV Vaccine, Who Doesn't, and Why? An Experimental Study

Cultural Cognition of the Risks and Benefits of Nanotechnology

Whose Eyes Are You Going to Believe? An Empirical Examination of Scott v. Harris

Cultural Cognition and Public Policy

Culture, Cognition, and Consent: Who Perceives What, and Why, in "Acquaintance Rape" Cases

Culture and Identity-Protective Cognition: Explaining the White Male Effect

Fear of Democracy: A Cultural Evaluation of Sunstein on Risk

Cultural Cognition as a Conception of the Cultural Theory of Risk

« "Science of Science Communication" course, version 2.0 | Main | "...but that just doesn't happen!..." Or: "Who is the 'Pakistani Dr' now?"--a fragment on the professional judgment of law professors »
Friday
Jan022015

Humans using statistical models are embarrassingly bad at predicting Supreme Court decisions....

Part of Lexy's brain. From Ruger et al. (2004)Demoralizingly (for some people; I don't mind!), computers have defeated us humans in highly discerning contests of intellectual acuity such as chess and Jeopardy.

But what about prediction of Supreme Court decisions?  Can we still at least claim superiority there?

Well, you tell me.

In 2002-03, a group of scholars organized a contest between a computer and a diverse group of human "experts" drawn from private practice and the academy (Ruger, Kim, Martin & Quinn 2004).

Political scientists have actually been toiling for quite a number of years to develop predictive models for the Supreme Court.  The premise of their models is that the Court’s decisionmaking can be explained by “ideological” variables (Edwards & Livermore 2008).

In the contest, the computer competitor, Lexy (let’s call it), was programmed using the field’s state-of-the-art model, which in effect tries to predict the Court's decisions based on a combination of variables relating to the nature of the case and the parties, on the one hand, and the ideological affinity of individual Justices as reflected by covariance in their votes, on the other.

For this reason, the contest could have been seen (often is described) as one that tested the political scientists’ “ideology thesis” against “formal legal reasoning.” 

But in fact, that's a silly characterization, since the informed professional judgment of genuine Supreme Court experts would certainly reflect the significance of "Justice ideology" along with all the other influences on the Court’s decisionmaking (Margolis 1987, 1996; Llewellyn, 1960).

In any case, Lexy trounced those playing the role of “experts” in this contest.  The political scientists' model correctly predicted the outcome in 75% of the decisions, while the experts collectively managed only 59% correct . . . .

The result was widely heralded as a triumph for both  algorithmic decisionmaking procedures over expert judgment & for the political scientists’ “ideology thesis."

But here’s the problem: while Lexy “did significantly better at predicting outcomes than did the experts” (Ruger et al. 2004, p. 1152), Lexy did not perform significantly better than chance!

The Supreme Court’s docket is discretionary: parties who’ve lost in lower courts petition for review, and the Court decides whether to hear their cases.

It rejects the vast majority of review petitions—96% of the ones on the "paid" docket and 99% of those on the “in forma pauperis” docket, in which the petitioner (usually a self-represented prisoner) has requested waiver of the filing fee on grounds of economic hardship.

Not surprisingly, the Court is much more likely to accept for review a case in which it thinks the lower court has reached the wrong result.

Hence, the Court is far more likely to reverse than to affirm the lower court decision. It is not unusual for the Court to reverse in 70% of the cases it hears in a Term (Hofer 2010).  The average Supreme Court decision, in other words, is no coin toss!

Under these circumstances, the way to test the predictive value of a statistical model is to ask how much better someone using the model would have done than someone uniformly picking the most likely outcome--here, reversal-- in all cases (Long 1997, pp. 107-08; Pampel 2000, p. 51).

In the year in which Lexy squared off against the experts, the Court heard only 68 cases.  It reversed in 72% of them. 

Thus, a non-expert who knew nothing more than that the Supreme Court reverses in a substantial majority of its cases, and who simply picked "reverese" in every case, would have correctly predicted 72% of the outcomes.  The margin between her performance and Lexy's 75% -- a grand total of two fewer correct predictions -- doesn't differ significantly (p = 0.58) or practically from zero. 

A practical person, then, wouldn't bother to use Lexy instead of just uniformly predicting "reverse."

None of the “holy cow!” write-ups on Lexy’s triumph—which continue to this day, over a decade after the contest—mentioned that the algorithm used by Lexy had no genuine predictive value.

But to be fair, the researchers didn't mention that either.

They noted that the the Supreme Court had reversed in 72% of its cases only in footnote 82 of their 60-page, 122-footnote paper. And even there they didn't acknowledge that the predictive superiority of their model over the "72% accuracy rate" one would have achieved by simply picking "reverse" in all cases was equivalent—practically and statistically—to chance.  

Instead, they observed that the Court had in some "recent" Terms reversed less than 70% of the granted cases. The previous Terms were in fact the the source of the researchers' "training data"--the cases they used to construct their statistical model. They don't say, but one has to believe that their "model" did a lot better than 75% accuracy when it was "fit" retrospectively to those Terms' cases-- or else the researchers would surely have tinkered with its parameters all the more.  But that the resulting model performed no better than chance (i.e, than someone uniformly picking "reverse," the most likely result in the training data) when applied prospectively to a new sample is a resounding verdict of “useless” for the algorithm the researchers derived by those means.

Sure, the “experts” did even worse, and that is embarrassing: it means they'd have done better by not "thinking" and instead just picking "reverse" in all cases, the strategy a non-expert possessed only of knowledge of the Court's lopsided proclivity to overturn lower court decisions would have selected.

But the "experts'" performance—a testament perhaps to their hubristic overconfidence in their abilities but also largely to the inclusion of many law professors who had no general specialty in Supreme Court decisionmaking —doesn’t detract from the conclusion that the statistical model they were up against was a complete failure.

What’s more, I don’t think there’s anything for Lexy or computers generally to feel embarrassed about in this matter. After all, a computer didn’t program Lexy; a group of humans did.

The only thing being tested in that regard was the adequacy of the “ideology thesis”-prediction model developed by political scientists.

That the researchers who published this study either didn’t get or didn’t care that this model was shown to perform no better than chance makes them the ones who have the most to be embarrassed about.

References 

Edwards, H.T. & Livermore, M.A. Pitfalls of empirical studies that attempt to understand the factors affecting appellate decisionmaking. Duke LJ 58, 1895-1989 (2008).

Hofer R. Supreme Court Reversal Rates: Evaluating the Federal Courts of Appeals, Landslide 2, 10 (2010). 

Llewellyn, K.N. The Common Law Tradition: Deciding Appeals (1960).


Long, J.S. Regression models for categorical and limited dependent variables (Sage Publications, Thousand Oaks, 1997).

Margolis, H. Dealing with risk : why the public and the experts disagree on environmental issues (University of Chicago Press, Chicago, IL, 1996).

Margolis, H. Patterns, Thinking, and Cognition (1987).

Pampel, F.C. Logistic regression : a primer (Sage Publications, Thousand Oaks, Calif., 2000).

Ruger, T.W., Kim, P.T., Martin, A.D. & Quinn, K.M. The Supreme Court Forecasting Project: Legal and Political Science Approaches to Predicting Supreme Court Decisionmaking. Columbia Law Rev 104, 1150-1210 (2004).

Long, J.S. Regression models for categorical and limited dependent variables (Sage Publications, Thousand Oaks, 1997).

 

 

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (12)

I think people often don't understand that the metric to beat is often not "50/50 coin toss" but rather "historic performance" or "persistence". Both the latter are useful for weather prediction. I think weather modelers do hold models to a standard of needing to be better than "I predict tomorrows weather will match today's".

Oddly, those who developed "Lexy" appear to have known they were using past performance to predict future decisions, yet didn't report how the simplest possible model would have done relative to their complicated one.

January 4, 2015 | Unregistered Commenterlucia

@Lucia-- I admit it can actually be pretty hard to add value w/ a predictive model when all the "value added" has to be extracted from the 30% that someone w/o any model (besides "uniformly predict most likely outcome") would fail to get right. But I can certainly think of contexts where one is trying to correctly classify cases that reflect an even more lopsided distribution of binary outcomes. Think of trying to screen people for rare disease. There you likely will be willing to let Mr./Ms. Nonexpert beat you in predicting the "negative" cases (thus accepting a higher "false positive" diagnoses) in order to get an advantage over him or her in in predicting the "positive" ones (i.e, having a lower "false negatives" than if you just predicted everyone is free of the diseease); the model might not have predictive accuracy much higher than chance but still have a "sensitivity" advantage that is really worthwhile. I don't see why that would be so, though, if one is predicrting S Ct results, at least in general (maybe someone in a particular case is much more worried about incorrectly predicting "reverse" than incorrectly prediting "affirm")

January 4, 2015 | Registered CommenterDan Kahan

==> "The Supreme Court’s docket is discretionary: parties who’ve lost in lower courts petition for review, and the Court decides whether to hear their cases....Not surprisingly, the Court is much more likely to accept for review a case in which it thinks the lower court has reached the wrong result."

My guess is that expert predictions (or computer algorithms) would do better than chance if asked to pick which cases SCOTUS will take up a case in the first place (think that the lower court reached the wrong result).

January 4, 2015 | Unregistered CommenterJoshua

@Joshua--

that would be an even *easier* one for Mr/Ms Nonexpert to win! Or in any case, a *really* hard one to model given that only 4% of paid cert petns get granted.

they say "cases of interest to business community," particularly where it favors reversal, have inside track, but it's still a very very very narrow lane.

But I'm pretty sure SCOTUS blog makes lots of forecasts.

& for sure Linda Greenhouse could beat even 96% accurate Mr/Ms Nonexpert

January 4, 2015 | Registered CommenterDan Kahan

@Joshua, @DK

I think the algorithm "cert denied, unless there's a circuit split or the court below declared a federal law unconstitutional" would do significantly better than chance (although you'd still miss a bunch of the cert grants), of course). Maybe add "or if SG advocates for cert." SCOTUSblog's "petitions to watch" feature is great, but I still think most of those get denied. Greenhouse knows all, of course.

January 4, 2015 | Unregistered CommenterMW

@MW--

I think I'd bet on Nonexpert if the competitor were a computer whose model was "deny unless circult conflict."

Nearly all the petns *assert* there is a conflict, precisely b/c that's the predominant basis on which today's Ct -- which is shockingly lazy in comparison to predecessors -- seems to grant review.

So obviously the computer would need to be programmed to distinguish real from bogus conflicts -- a programming task that I'm guessing would rival the complexity of teaching a machine to recognize human faces.

In addition, the Ct *doesn't* invariably grant review when there is a *genuine* conflict. Maybe sometimes ithe "cert pool" clerk (more laziness: the Justice's can't even be bothered to take individual responsibility for assessing all the review petns independently) just misses one (it's easier for clerk to recommend "deny" than justify "grant"). moreover, the Ct sometimes concludes that it would rather wait for some other case presnting the conflict -- maybe b/c they think the lower court decided correctly anway in the case at hand.

So the computer's model would need predictors for that too.

B/c it would need a model for identifying "conflicts" the resolution of which justifies granting a particular petn, I think a computer that used model "deny unless circuit conflict" would *grossly* overestimate # of grants.

Some empirics: S Ct litigation specialists *use* computer programs like this already to identify cases that they think are good candidates for review. They troll lower court cases w/ automated search routines that look for acknowledgment of "conflict" in lower court opinioin or for cases deciding particular issues they know already have given rise to conflict. Then they call the losing party & offer their services.

Their search routines are pretty finely tuned; there's no point soliciting losing petns -- unless someone is willing to pay you a lot -- the whole point of looking for "likely grants" is to find cases where the publicity of having the rare prize of an argued case justifies pro bono or steeply discounted rates; lots of people will then pay you a tone to write a hopeless review petn!

Nevertheless, even when those attnys think they have a "really good" chance, the probability of a grant is substantially less than 50%.

January 5, 2015 | Registered CommenterDan Kahan

==> " Or in any case, a *really* hard one to model given that only 4% of paid cert petns get granted."

Well, seems to me that asked to predict which cases SCOTUS would take on in the upcoming year, Mr/Mrs non-expert-picking-by-chance and Mr.Mrs. expert/computer-picking-by-algorithm-or-expertise would both be wrong most of the time, but the experts would be wrong less.

January 5, 2015 | Unregistered CommenterJoshua

BTW: I've been interested in a GA supreme court case (Chan v. Ellis) . When I read this I couldn't help wondering the % of reversals in State Supreme Courts-- and GA in particular.
It seems plausible that the likelihood a case will be heard by an upper court tends to improve when at first reading the lower court ruling appears tenuous to a judge. This should tend to result in more than 50% reversals in many upper courts. I recognize things can be 'upheld in part/ reversed in part and so on. But it the idea reversals are favored seems plausible. Has anyone tabulated this for courts in general?

January 5, 2015 | Unregistered Commenterlucia

@DK,

Fair points, all around. On the need for the machine to determine which circuit conflicts are genuine: don't Ruger et al. hand-code their variables? I don't think they've programmed their machine to recognize whether a decision is liberal or conservative, or what the primary issue is (I'm trying to think of how they would classify, say, NFIB v. Sebelius). I suppose whether there's a circuit split is a more subtle determination, but might it be possible to figure our whether there's really a conflict from the briefs? That's what the Court has to do, right? But yes, if they deny a substantial number of genuine conflicts, the model would overestimate grants.

January 5, 2015 | Unregistered CommenterMW

@MW:

Yes, Lexy characterized decision outcomes in ideological terms. See the "decision tree" diagram at top of post

January 5, 2015 | Registered CommenterDan Kahan

@DK:

Lexy used ideological characterization in the tree, but the researchers, not Lexy, coded the decision as liberal or conservative, no? See http://wusct.wustl.edu/data.php, Article p. 1163 n.45, p. 1165, p. 1167, etc.

January 5, 2015 | Unregistered CommenterMW

@Mw:

Yes, of course. I misunderstood you.

Lexy 2.0 would code the decisions herself; indeed, she would just program herself using the sort of A.I. techniques that helped Watson become a champion at Jeopardy ("You say 'what is cheesecake' is not the answer to 'Manhattan Project'? Okay, thanks! I'll update.")

When this happens, it will warrant fawning, out-of-breath writeups on 538.com, by writers whose credulousness will not be being exploited by people who use statistics to deter rather than engage critical reflection..

January 6, 2015 | Registered CommenterDan Kahan
Member Account Required
You must have a member account on this website in order to post comments. Log in to your account to enable posting.