My first experience of big data and high-speed analytics was at CERN and the Rutherford Lab over 50 years ago. I was in the Rutherford Lab part of a large distributed team working on a CERN bubble chamber experiment. There was a proton-proton collision every second or so which, for the charged particles, produced curved tracks in the chamber which were photographed from three different angles. The data from these tracks was recorded in something called the Hough-Powell device (after its inventors) in real time. This data was then turned into geometry; this geometry was then passed to my program. I was at the end of the chain and my job was to take the geometry, work out for this collision which of a number of possible events it actually was – the so-called kinematics analysis. This was done by chi-squared testing which seemed remarkably effective. The statistics of many events could then be computed, hopefully leading to the discovery of new (and anticipated) particles – in our case the Ω–. In principle, the whole process for each event, through to identification, could be done in real time – though in practice, my part was done off-line. It was in the early days of big computers, in our case, the IBM 7094. I suspect now it will be all done in real time. Interestingly, in a diary I kept at the time, I recorded my immediate boss, John Burren, as remarking that ‘we could do this for the economy you know’!
So if we could do it then for quite a complicated problem, why don’t we do it now? Even well-known and well-developed models – transport and retail for example – typically take weeks or even months to calibrate, usually from a data set that refers to a point in time. We are progressing to a position at which, for these models, we could have the data base continually updated from data flowing from censors. (There is an intermediate processing point of course: to convert the sensor data to what is needed for model calibration.) This should be a feasible research challenge. What would have to be done? I guess the first step would be to establish data protocols so by the time the real data reached the model – the analytics platform, it was in some standard form. The concept of a platform is critical here. This would enable the user to select the analytical toolkit needed for a particular application. This could incorporate a whole spectrum from maps and other visualisation to the most sophisticated models – static and dynamic.
There are two possible guiding principles for the development of this kind of system: what is needed for the advance of the science, and what is needed for urban planning and policy development. In either case, we would start from an analysis of ‘need’ and thus evaluate what is available from the big data shopping list for a particular set of purposes – probably quite a small subset. There is a lesson in this alone: to think what we need data for rather than taking the items on the shopping list and asking what we can use them for.
Where do we start? The data requirements of various analytics procedures are pretty well known. There will be additions – for example incorporating new kinds of interaction from the Internet-of-Things world. This will be further developed in the upcoming blog piece on block chains.
So why don’t we do all this now? Essentially because the starting point – the first demo – is a big team job, and no funding council has been able to tackle something on this scale. There lies a major challenge. As I once titled a newspaper article: ‘A CERN for the social sciences’?