No, We Will Not Have Artificially Intelligent Judges or Lawyers Anytime Soon

So there’s this research project which produced an AI that can predict the outcome of court cases at the European Court of Human Rights (ECHR) with 79% accuracy. It received quite some attention.

And indeed it sounds rather impressive. However, let’s not award this program the Loebner Prize just yet. There may be a lot less here than meets the eye.

I

As per the first of the links above, “the AI program worked by analyzing descriptions of court cases submitted to the ECHR. These descriptions included summaries of legal arguments, a brief case history, and an outline of the relevant legislation. […] The AI program then looked for patterns in this data, correlating the courts’ final judgements with, for example, the type of evidence submitted, and the exact part of the European Convention on Human Rights the case was alleged to violate. Aletras says a number of patterns emerged. For example, cases concerning detention conditions (eg access to food, legal support, etc.) were more likely to end in a positive judgement that an individual’s human rights had been violated; while cases involving sentencing issues (i.e., how long someone had been imprisoned) were more likely to end in acquittal.”

So, OK, the program mostly works off a set of heuristics such as “if your lawyer alleges that you were denied food in prison, you have a good chance of winning your case, but if you just claim that a six-year sentence is too harsh for what you did then you’re probably out of luck”, rather than diving into the legal subtleties of every individual case. Even so, 79% accuracy is not a bad score.

II

Or is it?

We’re talking about binary questions here: either the Court judges that a human rights violation did indeed take place, or not. Which means that you could very easily get a 50% prediction accuracy score just by flipping a coin for each case.

And it is possible to do a lot better than that, without ever reading a single word of those case summaries which the program supposedly based its predictions on. Looking at the Court’s overview of recent judgments, it is clear that the Court rules in favor of the complainant the vast majority of the time. Just eyeballing the first page of results, I would say it does so at least 90% of the time. The researchers only focused on Articles 3, 6 and 8 of the Declaration of Human Rights, but filtering on those articles doesn’t change the picture.

(That doesn’t mean the Court is too softhearted or uncritical; it could just mean there’s a good process for weeding out hopeless cases in an early phase. Also, cases which are ruled inadmissable don’t get included in that list.)

So you could write an “AI” which does better than this one, just by having it predict “the complaint will be granted” every single time!

If you wanted to be a bit more subtle about it, you could add a randomizer. E.g. let’s say that the actual success rate of a complaint to the ECHR is 90%, and you write your program to predict a successful outcome also 90% of the time. Then your program’s prediction rate on a set of randomly selected case summaries will be 82%. And to a not-too-critical observer, it won’t be immediately obvious what you are doing. To avoid being caught out too easily, you’d want to make sure that when the same case summary is fed into the program multiple times, it will always give the same output, but that’s easy enough.

Is that the trick which this program is using? Probably not — its authors seem sincere enough, I don’t think they would consciously cheat in such a blatant way. However, it would not terribly surprise me if this is effectively where the program gets its success rate from. I’m sure that if you were to look at the program’s source code, you’d see all kinds of clever stuff going on. But still, the fact that one could make an equally successful prediction system with a 10-sided die, or a parrot trained to continuously repeat the words “human rights violation!”, should give us some reassurance that this particular AI is probably not going to go Skynet on us anytime soon.

III

Alright, so I am probably being a bit too cynical about this. From reading the paper, it is clear that the program really does perform some textual analysis, using an N-gram based classifier, and it really can identify certain patterns in the text. For example, by looking for words such as “prison” and “detainee”, it can detect that the case involves the topic of “detention conditions”, which is apparently correlated with a higher probability of a violation complaint being upheld. Human-level AI it ain’t, but it appears to make a game attempt to do what it’s supposed to be doing.

And, to be fair, in their paper, the program’s authors are reasonably forthcoming about the limitations of their creation. Among other things, they point out that the program’s predictions are based on case summaries written by the judges, after they have already made their decision. Basically what they did was, they took the published judgments of cases that had already been decided, fed everything except the actual case outcome into the program, and then checked if the program’s prediction matched that outcome.

According to the authors, the predictions were based on the text of published judgments rather than lodged applications or briefs “simply because we did not have access to the relevant data set.”

If all you want is some data to test the latest version of your N-gram algorithm with, there’s nothing wrong with that approach. However, from the way this story has been reported in most of the mainstream press, one could easily get the impression that the AI can base its prediction on the documents produced by both sides’ lawyers before the judges give their verdict. That would be a lot more impressive.

(Or at least it would have been impressive, if its success rate was higher than that of a knock-knock joke with “Complaint who? Complaint Granted!” as the punchline.)

The authors do their best to put a positive spin on this and express optimism that their program would do just as well if it were given the lawyers’ briefs as input, but until someone actually tests that, this is a moderately interesting demonstration of how modern text analysis algorithms work, but not much more than that.

IV

So, OK. A group of AI researchers does something with N-grams and clustering algorithms to analyze court summaries. I’m sure as a research project this is a perfectly valid exercise; I am not deep enough into the AI field to judge how far behind or ahead of the state of the art this is.

Then, the mainstream press runs with it and presents it as if non-human lawyers and judges are just around the corner. At first sight it looks impressive, but when you look into it you realize that the program is just looking for statistical correlations between keywords, has zero actual understanding of what it “reads”, and in fact the program could improve its accuracy rate by just ignoring its input completely and always predicting a win for the complainant.

Unfortunately, that is standard.

Many, many years ago, Douglas Hofstadter wrote a chapter in his book “Fluid Concepts and Creative Analogies” about how egregiously overreported AI developments invariably are. You read an article and it appears to describe a computer understanding the subtleties of a complex story at a level which would elude most humans. Then you look into the details and you find out that the emperor is utterly naked. In some cases, this is clearly the fault of whichever journalist decided to ignore the facts in favor of a sexy headline. In other cases, one gets the impression that although the original researcher/developer may not be the origin of the journalist’s misrepresentation, they sure weren’t trying very hard to correct it.

On the one hand, that makes me rather pessimistic about the prospect that my self-driving car will be here anytime soon. On the other hand, I probably don’t need to lose too much sleep over being turned into paperclips in the near future either.

I guess I’ll count that as a win.