‘Nullius in verba’ is the motto of the Royal Society. It can be roughly translated as ‘Don’t take anybody’s word for it’ with the implication, ‘verify through experiments’. Urban researchers – and social scientists more broadly – live in their laboratory and the data is created minute by minute. Our experiments are the interpretations of that data and the testing of theories and models – models as representations of theories. In the case of models, data is used for calibration and there has to be enough ‘left over’ for testing; or the calibrated model can be tested against a full, usually future, data set. We now live in an age of ‘big data’: the ‘minute by minute’ can be taken literally.
What does our data look like? We can start by defining our system of interest – people, organisations, activities, interactions. We have the usual challenges of scale: the groups and sectors – what kinds of people, of economic activity; how fine-grained the geography – continuous space or zones (in the latter case, what size?); how to describe activities and interactions. I once estimated that for a not very fine-grained description of a city of a million people, I would like to have 1013 variables. Ideally then, we would need a measure of each of these for a set of points in time. With the increasing availability of real-time data – telecoms, mobile phone position data and social media for example, this 1013 figure of possibilities would be much larger.
Let us classify what is now potentially available, starting with two dimensions:
- fast – slow
- open – closed.
The fast data is real-time derived, for example, from sensors, phones, consumers via store cards or social media; an example of the slow is the decennial Census data. There is a spectrum, of course: some surveys may be annual for example. Open data is freely available; closed is not. Again, there is a spectrum. As researchers, we would like the closed data to be open! Huge progress has been made in recent years through the Government’s open data policy and it has been estimated that many thousands of data files are now open. At the closed end of the spectrum, there are serious issues of privacy of course. This poses an interesting research question – particularly in relation to government administrative data: how can individual data be aggregated in such a way as to protect identity? If this could be systematically solved, it might be possible for HMRC for example, to release income data. There are also issues of accuracy: the role of professional statisticians is important here.
There are many interesting research questions posed by the data that is now available – all variants of ‘how to make the best use of it’. These are raised briefly here and mostly link to other entries for elaboration.
- Can we use fast data to update the slow? This is already done, for example by the ONS in relation to Census data and population forecasts but there must be many more opportunities.
- Can some of the fast data replace more laborious ways of data collection? A good example is the use of mobile phone data to estimate retail flows by Telefonica. Another would be the use of link-count data of traffic to estimate origin-destination flows (though as in many of these instances, there is a tricky modelling problem here – it is not just data collection).
- One of the arguments used to support the Government’s open data policy is that release of data will create new opportunities for application and will hence contribute both to efficiency and to economic growth and job creation. There are now many examples within the ‘smart city’ movement and actual and potential applications in health administration (though this is a good example of the difficulty in assembling a useable comprehensive system).
- The efficiency argument creates many new opportunities for that part of the operational research community that focuses on optimisation and efficiency.
These are all rather specific applications and fit relatively narrow research objectives, usually with particular data sets. There are numerous examples evident at the programmes of the many smart cities conferences. However, there are bigger questions.
- If we want a comprehensive portrait of a city – the 1013 question again – the data needs to be organised into an intelligent information system, and this has not been done in a comprehensive way – notwithstanding then long time span development of GISs for example. This is a big non-trivial issue and there are no signs of anyone taking it on systematically.
- The data-based applications are – again actually and potentially – valuable but they don’t contribute to the big problems. We might call this the ‘big data for big problems’ agenda (BDBP – a new acronym?) We should be reviewing the ‘wicked problems’ list – economic growth in different kinds of cities, social disparities, housing, future land use, etc – and ask what the big data revolution is contributing to these.
The data revolution is creating a new profession of ‘data scientists’. An argument is sometimes put that we don’t need theories and models any more, just the data. I would argue that the data scientists need to be part of a team and that the ‘only-need-data’ argument is nonsense. Can we imagine a physicist saying that they only need data? so, we need theoreticians, model builders and statisticians as well as data scientists but a serious fear is that these different professions will function in silos.
Data, and especially new data, certainly provide new opportunities. But recall that they are the basis of our experiments – and that experiments are to test theories and to deepen our understanding. Hence ‘Nullius in verba’ can be extended to ‘Evolvere theoria et intellectum’!
Alan Wilson, April 2015
 Wilson, A.G., 2007. A generalised representation for a comprehensive urban and regional model. Computers, Environment and Urban Systems, 31 (2), pp. 148-161.