Big Data, new epistemologies and paradigm shifts – Rob Kitchin – 2014
Introduction
Kitchin (2013) defines Big Data as being: huge in volume; high in velocity; diverse in variety; exhaustive in scope; fine grained in resolution and indexical in identification; relational in nature; and flexible, meaning both extendable and scalable. It is important to note that Big Data is not just about volume then, as datasets like national censuses have been produced for a long time. However, these older forms of large datasets took a long time to compile, were fairly coarse in their findings and were inflexible. Big Data is continuously generated, fine grained in scope, and flexible and scalable in production. This is made up of petabytes of data every hour in some cases, as opposed to a census once a decade.
Data collection on this scale is allowed by the ubiquity of computers and working internet, as well as newer, more efficient data storage solutions. This changes how data analysis occurs, particularly as the datasets are so large and often generated without a question in mind. Analysing huge datasets without a specific purpose was not feasible with previous technologies. Hundreds of algorithms can now be applied to a dataset in order to retroactively determine its use or worth. This creates a new epistemological positioning of ‘rather than testing a theory by analysing relevant data, new data analytics seek to gain insights ‘born from the data’’.
Many argue that this has far reaching consequences for the functioning of society that we cannot see yet (or are now beginning to see), ranging from governance to knowledge production and further. This is positioned here as a Kuhnian paradigmatic revolution. It goes further to say that Gray (Hey et al, 2009) suggests a new, fourth paradigm of science which is driven by advances in data collection and analytical methods. A new epistemological paradigm is developing then but the form it is taking is contested, which this paper explores.
A fourth paradigm in science?
There is the idea though that Big Data is so exhaustive that it does away with the need for theory. Both in the private sphere and in academia there is a belief that Big Data will simply reveal an unquestionable truth due to its size.
The end of theory: empiricism reborn
It then goes on to talk about the infamous Chris Anderson (2008) Wired article about the end of theory. It is basically the claim that all we need is to find correlation which allows us to find patterns we didn’t know we were looking for (Dyche, 2012). The examples for this are usually from retail and marketing and involve finding correlations between products to get customers to purchase more. It is desirable to be able to explain why these correlations exist but it is not necessary to increase profits. Companies who sell this kind of data analysis (such as Ayasdi) claim that it removes elements of human bias from data analysis. A purely inductive (rather than deductive) mode of scientific research is suggested by this empiricist epistemology.
However, there are issues with this. Big Data might be exhaustive in its data collection but it is both a representation and a sample, being shaped by external factors including the data ontology employed. The human bias it supposedly gets rid of is still there as data can never truly provide a God’s eye view. This connects to the second fault, that systems are designed by people to capture certain types of data and analyse this data in certain ways. The algorithms used arise from deductive science, despite the supposedly inductive nature of Big Data. The third fault is therefore that (as the data are not generated free from theory) the results cannot be free from theory. Even in automated algorithms, there are viewpoints in their code, they once again do not have a God’s eye view. Equally, correlations may not have causation and interpreting them as important can lead to ecological fallacies. Finally, taking this form of universal, quantitative approach to social sciences results in ‘an analysis of [cities] that is reductionist, functionalist and ignores the effects of culture, politics, policy, governance and capital’. Basically, you might be able to analyse data outside of its context but your results will, inevitably, be missing context.
Data driven science
Data driven science keeps the scientific method but is more open to abductive, inductive, and deductive approaches being used when suitable. It does seek hypotheses born from data, rather than theory, using inductive for research design. This induction is situated and contextualised within a theoretical domain. This means that knowledge discovery is guided by theory rather than seeking out every possible correlation in a dataset; ‘strategies of data generation and repurposing are carefully thought out’. Similarly, data analysis is driven by theoretically formed decisions, helping to determine which correlations seem worth further exploration. This uses abductive reasoning, which seeks a logical answer without claiming it is definitive. The inductive processes here are contextually framed and used alongside theory, rather than as their own end point. The argument for this is that the traditional scientific method is good for conditions of scarce data and weak computation so to continue to use it in an age of Big Data would be futile.
Computational social sciences and digital humanities
While there are positivists within social sciences and humanities, they are not as common as in harder sciences. Thus, the Big Data movement is not as clear cut here. For positivists, there are still issues such as data access and quality but overall still does offer studies of greater depth and are longitudinal in nature. The more finely ground nature of the data should, in theory, deal with charges of reductionism and universalism in positivist approaches, while also allowing for the findings across wider contextual settings.
Essentially for post-positivists, Big Data should make it easier for wider ranging pieces of research to take place, focusing on macro level scales than the micro level of a handful of novels, for example. Within digital humanities this is split between an attempt to digitise and systematise previously unfocused disciplines and those who want to use these new techniques alongside existing methods and traditional theory building. There has been concern though that Big Data and computerisation of methods would render the researcher too far removed from the data itself, removing any context by rendering cultural artefacts to be merely ‘data’. This suggests that digital humanities (and computational social sciences) can only address surface level concerns when it comes to examining everyday life, rather than deeper, underlying causes of issues. Big Data can act as a good starting point to identify issues in this sense but it cannot be an end point.
People and societies are too complex and messy to be reduced to data points, with people not being rational utility maximisers but unpredictable and often contradictory. Societies also have a great amount of variety between them and making them fit into data sets reduces them in a form of universalism or essentialism, being negative for knowledge creation and non-dominant societies. Big Data has a hard time with ‘the social’ and context then (Brooks, 2013), which has to be acknowledged in digital humanities. While digital humanities make use of descriptive statistics, computational social sciences also use inferential statistics to identify associations and casualties.
Critical GIS and radical statistics ‘employ quantitative techniques, inferential statistics, modelling and simulation whilst being mindful and open with respect to their epistemological shortcomings, drawing on critical social theory to frame how the research is conducted, how sense is made of the findings, and the knowledge employed’ in order to compensate for limitations in quantitative research. The research process here is made open and reflexive, giving a new epistemological framing which enables social scientists to draw insights from Big Data.