No, We Will Not Have Artificially Intelligent Judges or Lawyers Anytime Soon

So there’s this research project which produced an AI that can predict the outcome of court cases at the European Court of Human Rights (ECHR) with 79% accuracy. It received quite some attention.

And indeed it sounds rather impressive. However, let’s not award this program the Loebner Prize just yet. There may be a lot less here than meets the eye.

I

As per the first of the links above, “the AI program worked by analyzing descriptions of court cases submitted to the ECHR. These descriptions included summaries of legal arguments, a brief case history, and an outline of the relevant legislation. […] The AI program then looked for patterns in this data, correlating the courts’ final judgements with, for example, the type of evidence submitted, and the exact part of the European Convention on Human Rights the case was alleged to violate. Aletras says a number of patterns emerged. For example, cases concerning detention conditions (eg access to food, legal support, etc.) were more likely to end in a positive judgement that an individual’s human rights had been violated; while cases involving sentencing issues (i.e., how long someone had been imprisoned) were more likely to end in acquittal.”

So, OK, the program mostly works off a set of heuristics such as “if your lawyer alleges that you were denied food in prison, you have a good chance of winning your case, but if you just claim that a six-year sentence is too harsh for what you did then you’re probably out of luck”, rather than diving into the legal subtleties of every individual case. Even so, 79% accuracy is not a bad score.

II

Or is it?

We’re talking about binary questions here: either the Court judges that a human rights violation did indeed take place, or not. Which means that you could very easily get a 50% prediction accuracy score just by flipping a coin for each case.

And it is possible to do a lot better than that, without ever reading a single word of those case summaries which the program supposedly based its predictions on. Looking at the Court’s overview of recent judgments, it is clear that the Court rules in favor of the complainant the vast majority of the time. Just eyeballing the first page of results, I would say it does so at least 90% of the time. The researchers only focused on Articles 3, 6 and 8 of the Declaration of Human Rights, but filtering on those articles doesn’t change the picture.

(That doesn’t mean the Court is too softhearted or uncritical; it could just mean there’s a good process for weeding out hopeless cases in an early phase. Also, cases which are ruled inadmissable don’t get included in that list.)

So you could write an “AI” which does better than this one, just by having it predict “the complaint will be granted” every single time!

If you wanted to be a bit more subtle about it, you could add a randomizer. E.g. let’s say that the actual success rate of a complaint to the ECHR is 90%, and you write your program to predict a successful outcome also 90% of the time. Then your program’s prediction rate on a set of randomly selected case summaries will be 82%. And to a not-too-critical observer, it won’t be immediately obvious what you are doing. To avoid being caught out too easily, you’d want to make sure that when the same case summary is fed into the program multiple times, it will always give the same output, but that’s easy enough.

Is that the trick which this program is using? Probably not — its authors seem sincere enough, I don’t think they would consciously cheat in such a blatant way. However, it would not terribly surprise me if this is effectively where the program gets its success rate from. I’m sure that if you were to look at the program’s source code, you’d see all kinds of clever stuff going on. But still, the fact that one could make an equally successful prediction system with a 10-sided die, or a parrot trained to continuously repeat the words “human rights violation!”, should give us some reassurance that this particular AI is probably not going to go Skynet on us anytime soon.

III

Alright, so I am probably being a bit too cynical about this. From reading the paper, it is clear that the program really does perform some textual analysis, using an N-gram based classifier, and it really can identify certain patterns in the text. For example, by looking for words such as “prison” and “detainee”, it can detect that the case involves the topic of “detention conditions”, which is apparently correlated with a higher probability of a violation complaint being upheld. Human-level AI it ain’t, but it appears to make a game attempt to do what it’s supposed to be doing.

And, to be fair, in their paper, the program’s authors are reasonably forthcoming about the limitations of their creation. Among other things, they point out that the program’s predictions are based on case summaries written by the judges, after they have already made their decision. Basically what they did was, they took the published judgments of cases that had already been decided, fed everything except the actual case outcome into the program, and then checked if the program’s prediction matched that outcome.

According to the authors, the predictions were based on the text of published judgments rather than lodged applications or briefs “simply because we did not have access to the relevant data set.”

If all you want is some data to test the latest version of your N-gram algorithm with, there’s nothing wrong with that approach. However, from the way this story has been reported in most of the mainstream press, one could easily get the impression that the AI can base its prediction on the documents produced by both sides’ lawyers before the judges give their verdict. That would be a lot more impressive.

(Or at least it would have been impressive, if its success rate was higher than that of a knock-knock joke with “Complaint who? Complaint Granted!” as the punchline.)

The authors do their best to put a positive spin on this and express optimism that their program would do just as well if it were given the lawyers’ briefs as input, but until someone actually tests that, this is a moderately interesting demonstration of how modern text analysis algorithms work, but not much more than that.

IV

So, OK. A group of AI researchers does something with N-grams and clustering algorithms to analyze court summaries. I’m sure as a research project this is a perfectly valid exercise; I am not deep enough into the AI field to judge how far behind or ahead of the state of the art this is.

Then, the mainstream press runs with it and presents it as if non-human lawyers and judges are just around the corner. At first sight it looks impressive, but when you look into it you realize that the program is just looking for statistical correlations between keywords, has zero actual understanding of what it “reads”, and in fact the program could improve its accuracy rate by just ignoring its input completely and always predicting a win for the complainant.

Unfortunately, that is standard.

Many, many years ago, Douglas Hofstadter wrote a chapter in his book “Fluid Concepts and Creative Analogies” about how egregiously overreported AI developments invariably are. You read an article and it appears to describe a computer understanding the subtleties of a complex story at a level which would elude most humans. Then you look into the details and you find out that the emperor is utterly naked. In some cases, this is clearly the fault of whichever journalist decided to ignore the facts in favor of a sexy headline. In other cases, one gets the impression that although the original researcher/developer may not be the origin of the journalist’s misrepresentation, they sure weren’t trying very hard to correct it.

On the one hand, that makes me rather pessimistic about the prospect that my self-driving car will be here anytime soon. On the other hand, I probably don’t need to lose too much sleep over being turned into paperclips in the near future either.

I guess I’ll count that as a win.

Adding Pandoc to Gollum

The nice thing about Markdown is that it’s a very lightweight markup language which lets you write files in any text editor which look pretty normal when read as plain text, because they use mostly the kind of idioms which people commonly use in text documents anyway, but you can run them through a simple converter to generate HTML or PDF output with all kinds of fancy formatting.

The not-so-nice thing about Markdown is that approximately seven seconds after its original creation, it split into a kazillion subtly different flavours, variants and dialects. There’s Original Markdown, PHP-Markdown (a.k.a. Markdown Extra), GitHub Flavored Markdown, Pandoc Markdown, and a whole bunch of others. So whenever you move from one tool to another, you have to learn and unlearn a few tricks, and fix some subtle breakage in any content files you try to take with you..

So, if you have free choice to pick any of the many command-line tools for converting Markdown to, say, HTML, which one should you choose? That one’s easy: Pandoc. It’s amazing. It can convert to and from pretty much every markup format ever invented. Its own default flavour of the Markdown syntax is incredibly complete (you can literally write books in it), and pretty much a superset of all of the others, but you can selectively disable features, or even put it into ‘strict’ mode where it will behave almost exactly like John Gruber’s original version.

The main disadvantage of Pandoc is that it’s addictive: after writing some documents in Pandoc Markdown, going back to one of the more limited variants feels like going back to a Punto after zipping around in a Ferrari for a while. This commonly happens when you are using a blog or wiki which lets users write pages in Markdown.

The obvious solution, if you happen to be the maintainer of that blog or wiki site, is to configure it to use Pandoc instead of some lesser Markdown-to-HTML converter. (If you’re not, you will just need to suck it up, I’m afraid.)

HTML5 Micro-app for Blogging From the iPad With Textastic and Octopress

I switched to using Octopress for my blog a few months ago. Octopress, as you may be aware, is a Jekyll derivative, which means that you can maintain your blog posts offline (typically in a Git repository) as a series of text files in a simple markup format such as Markdown or ReStructuredText, and then generate plain HTML files which you deploy on your webserver.

I am running “rake watch” on the server, so that whenever I modify a file (via ssh, for example), it gets immediately picked up and the necessary pages get regenerated. Or I can write a page on a separate machine, commit it in a local git repository, and push the change to the server. So far, so good. However, I would like to be able to write blog posts on my iPad as well, preferably using Textastic, which is the best editor I’ve discovered for such purposes so far.

One of the things which makes Textastic nice, is that it has a bunch of options for easily synchronizing with a remote server: WebDAV and SFTP, among others. So I can create a file on the iPad and then easily copy it to the server. (Textastic doesn’t understand Git though, but that’s a different topic.)

So what’s the remaining problem?

OK, I Give Up. You Can Have My Privacy.

Any day now, I expect Google to release their latest product: Google Toilet Paper™.

It will be delivered to your doorstep for free, the production costs covered by targeted ads printed on the paper in biodegradable ink. It will be softer and nicer than any competing product (except Apple’s, but that’s quite expensive and only works in Apple Bathroom™s), and it will have several unique features which I’m not smart enough to anticipate but which will surely revolutionise the toilet paper industry overnight.

Inside every roll, there will be an RFID chip which can be read by the RFID reader in the next generation of Android phones. There will also be a simple rotation counter connected to the chip, so that the roll knows how often you use paper when you visit the toilet, and how much paper you use each time. All this information will be automatically uploaded to Google, the next time you sync your gmail account with your phone. By matching the pattern of your toilet visits to the e-mails and search queries of people with similar patterns, Google will then have a pretty good idea of any bowel-related afflictions you may have, often long before you do.

RIP Benoit Mandelbrot

Benoit Mandelbrot died a few days ago, at the age of 85. He is best known for discovering and popularizing the concept of fractals, in particular the Mandelbrot set and its close relative the Julia set.

In doing so, Mandelbrot gave the world much more than just a way to create pretty pictures and psychedelic animations. The realisation that such a literally infinite amount of complexity, not to mention beauty, could be created with a formula which you can write on the back of your hand, influenced thinkers in just about every field from philosophy to economics and from physics to biology.

As a little tribute to the man, if you are a programmer, I hereby encourage you to implement his famous fractal in your favourite programming language. It’s really simple; in many programming environments, opening a canvas to draw the pixels onto will be more work than actually calculating the fractal.

Silly Workaround for Android 2.2 Not Importing vCard Files via Bluetooth

I bought myself a new phone yesterday: an HTC Desire HD, running Android 2.2. On the whole, I’m extremely happy with it — especially since my previous phone was an LG Windows Mobile piece of junk which didn’t really deserve the name ‘smartphone’. There are a few quibbles, however.

One of them is this: I had a lot of contacts stored in the phone memory of the LG rather than on the SIMM card, so naturally I wanted to transfer them to its successor, preferably via Bluetooth. The HTC actually offered that option during the initial setup, but only for a limited number of models, mine not included.

But, fear not: the LG offered the option to transmit contacts to another phone via Bluetooth. One at a time, unfortunately, but hey — it didn’t take that long, and it gave me the opportunity to clean out some contacts. After perhaps five minutes of diligent Bluetoothing, all the contacts I still cared about seemed to have been transfered, with the Android phone happily saying something along the lines of “transfer received successfully!” for each one.

And after that, they turned out to have disappeared.

Upgraded to Ubuntu 10.10

The upgrade to Ubuntu 10.10, “Maverick Meerkat”, went smoothly enough. Except for one thing: Grub, as always. For some reason, Grub really hates me: for the last three Ubuntu upgrades, every time when I rebooted after the installation, it failed to boot. This despite the fact that I have a very straightforward setup: no dual-boot or anything, just a single Ubuntu installation on the machine. The error message this time was something along the lines of “unknown function”; I didn’t write it down exactly.

Martin Starts Link-blogging

When I started this blog, it was with the intention to write a post only when I had something interesting to say. Well, given my deplorably low posting frequency over the past year, I guess we have found out how interesting a person I really am.

So I am going to, every now and then, write a blog post just to share some links I found on the web recently which happened to catch my eye.

About Chess and Nuclear Reactors: The Case for Exception Handling

The world of software development has more than its fair share of topics where people tend to have long religious discussions about the “correct” way to do something. I think this is partly because the field for some reason attracts the kind of person who enjoys a nice bout of verbal fisticuffs, and partly because we spend a lot of time dealing with very abstract topics where the pros and cons of a given choice have more to do with differing philosophies than with objective facts.

One classic topic for this kind of discussion, which came up recently at work, is the use of exceptions for error handling. Every modern programming language offers an exception mechanism for this purpose, and presumably it is there to be used. However, ever since they were first introduced, there has been a large and vocal subset of the community arguing that exceptions do more harm than good and you’ll be writing better code if you just use good old return values to report whether a method succeeded.

One representative example comes from Joel Spolsky, one of my favorite authors. Another oft-quoted article making the same arguments is found in the “Frequently Questioned Answers” by Yossi Kreinin. They both make the same basic points: exceptions do not reduce complexity but merely hide it, and when complexity is hidden people tend to forget about it.

These arguments have merit, but I still feel that (when properly used) exception handling delivers enough value to be worth the cost. So I am going to be arguing for the status quo here, for a change. Executive summary: the dangers of exceptions are real, but code readability trumps almost everything.

LASEK

This is a blog post about something which took place six months ago, so it could be considered somewhat belated. On the other hand, that gives me the opportunity to give the full story in one go, rather than just posting “well, I had the operation two hours ago and they didn’t actually blow up my eyeball, so I guess it could have been worse, but I’m not really supposed to be staring at a computer screen just now and anyway I am doped up on painkillers so I’m leaving now, okthxbye.”