All the talk of ‘big data’ sometimes carries the implication that we must now surely have all the data that we need. However, frequently, crucial data is ‘missing’. This can then be seen as inhibiting research: ‘can’t work on that because there’s no data’! For important research, I want to make the case that missing data can often be estimated with reasonable results. This then links to the ‘statistics vs mathematical models’ issue: purist statisticians really do need the data – or at least a good sample; if there is a good model, then there is a better chance of getting good estimates of missing data.
As I mentioned in an earlier blog – Serendipity, I started my life in elementary particle physics at the Rutherford Lab. I was working in part at CERN on a bubble chamber experiment in the synchrotron. A high-energy proton collided with another proton and the results of the collision left tracks in the chamber which were curved by a magnetic field thereby offering a measurement of momentum. My job was to identify the particles generated in the collision. The ‘missing data’ was that of the neutral particles which left no tracks. The solution came from a mix of the model and statistics. The model offered a range of hypotheses of possible events and chi-squared testing identified the most probable – actually with remarkable ease though I confess that there was an element of mystery in this for me. But it made me realise, with some hindsight, that missing data could be discovered.
Science needs data – cf. Nullius in verba – but it also hypotheses and theories to test – cf. Evolvere theoriae et intellectum. In the case of my current field – the analysis of states and dynamics of cities and regions, there is an enormous need for data. I once estimated that I needed 1013 variables as the basis for a half-decent comprehensive urban model and in many ways this takes us beyond big data – though real-time sensor data will generate these kinds of numbers very quickly. The question is: is it the data we need? We can set out a comprehensive theory – and an associated model – of cities and regions. We have core data from decennial censuses together with a large volume of administrative data and much survey data. The real-time data – e.g. positional data from mobile phones – can be used to estimate person flows, taking over from expensive (and infrequent) surveys. In practice, of course, much of the data available to us is sample data and we can use statistics – either directly or to calibrate models – to complete the set.
My own early work in urban modelling was a kind of inversion of the missing data problem: entropy maximising generated a model which provided the best fit to what is known – in effect a model for the missing data. It turns out, not surprisingly, to have a close relationship to Bayesian methods of adding knowledge to refine ‘beliefs’. In theory, this only works with large ‘populations’ but there have been hints that it can work quite well with small numbers. This only gets us so far. The collection (or identification) of data to help us build dynamic models is more difficult. Even more difficult is connecting these models which rely on averaging in large populations with micro ‘data’ – maybe qualitative – on individual behaviour. There are research challenges to be met here.
There are other kinds of challenges: what to do when critical elements are missing of a simpler nature. An example is the need for data on ‘import and export’ flows across urban boundaries to be used in building input-output models at the city (or region) scale. We need these models so that we can work out the urban equivalent of the well-understood ‘balance of payments’ in the national accounts. How can we estimate something which is not measured at all, even on a sample basis? I recently started to ponder whether we could look at the sectors of an urban economy and make some bold assumption that the import and export propensities were identical to the national ones? This immediately throws up another problem, that we have to distinguish between intra-national – that is between cities – and international flows. It became apparent pretty quickly that we needed the model framework of interacting input-output models for the UK urban system before we could progress to making albeit very bold estimates of the missing data. We have done this for 200+ countries in a global dynamics research project and the task was now to translate this to the urban scale but for the country as a whole. A ‘missing data’ problem is seen as quite a tricky theoretical one.
Perhaps the best way to summarise the ‘missing data’ challenges is to refer back to the ‘Requisite knowledge’ argument of an earlier blog: what is the ‘requisite data set’ needed for an effective piece of research – e.g. to calibrate a model? Then if the model is good, the model outputs looks like ‘data’ for other purposes. More generally: do not be put off from doing something important by ‘missing data’. There are ways and means albeit sometimes difficult ones!