Blackett 2015: Kenneth Cukier 'In Defence of Big Data'


On the face of it, this seems a bit of a strange title for a talk to The OR Society but Kenneth Cukier, Data Editor of The Economist, who presented this year’s Blackett Memorial Lecture soon made it clear that it was actually a very appropriate title.

Traditionally, OR has been about identifying and understanding the problem, deciding what data is needed, collecting it and making sure it is as ‘clean’ as possible then attempting to isolate the causes. Big data, on the other hand, is not particularly concerned with a specific problem and it does not matter if the data is messy provided there is lots of it. If any correlations can be discerned from the data, again, it may not be critical as to whether A causes B, A is caused by B or that A and B are actually independent. If the data suggests that people tend to buy more ‘Pop Tarts’ just before the weather turns bad then it makes sense to stack the shelves when bad weather is forecasted: if they buy more, great; if not, it is no great loss. Sometimes it does not matter if there is no causal relationship. 

In the lead up to the 2008 Presidential Election, the Obama campaign had to choose from three or four website homepages that they thought would be the most effective. As it turned out, the most informal one ‘won’ – no one knows why but it need not matter.

In defence of big data

Kenneth Cukier

Importantly, the effectiveness of the different homepages was measurable: the best-performing image and link drew in $60 million more in contributions than the least performing one. Where big data has proved immensely successful has been in areas involving machine learning. If you write an algorithm which tries to answer all of the questions there are very few cases when you will succeed. Even in a simple game such as checkers (draughts), if you code all the rules and all the moves you can think of, the chances are you will still be able to beat the machine but if you write an algorithm which allows the machine to work out the best play for itself and let it play itself hundreds, thousands, millions of times, you will have produced a truly formidable player.

Spell-checkers, grammar checkers and translators are other areas that greatly benefit from a big data approach. Microsoft used four different algorithms to carry out grammar checking. The one which performed the best when taught using a small sample of cases performed the worst with large data, in fact it only improved a little.

But the one which performed the worst with small data (trained with half a million words), performed far better than the others when given large data (one billion words). In another example, Kenneth explained how machine learning had identified twelve critical markers when looking for cancerous cells.

The experts had previously only recognised nine of these. In the case of cars, researchers in Tokyo fitted a seat with hundreds of sensors to identify people via their posture. It may be used as an anti-theft device: whether the person sitting is the legal owner. If every car was so fitted then it might be possible to determine when a driver was falling asleep or not paying attention and perhaps be used to avoid potential accidents. He also recognised the vast potential the Internet of Things could provide, most of which we have not yet thought about. 

He finished on a more cautionary note. With machine learning we do not know how the rules have been derived or, indeed what those rules are. Not only that, but they are constantly changing as more experience is gained. So, for example, if the algorithm used by VW had been derived via machine learning would VW still be to blame? If a pedestrian is accidentally killed by a driverless car who would be responsible? If a train hits someone, unless the driver has broken the rules, there is usually no case to answer and we accept that, but trains do not have the ability to swerve, brake or take any other avoiding action. With the IoT, maybe the car will be able to alert the pedestrian via their mobile phone that they were so engrossed in when they stepped off the pavement!

Legislation covering data, data ownership, data protection, machine learning and this whole area is seriously in need of a major review.