This post describes five relatively simple approaches for reducing the size and complexity of data. Computer performance can quickly become an issue when working with large datasets, and so it’s wise to have some quick and handy approaches for getting around these problems.
Big data has become all the rage in recent years, the term only having emerged (according to Google Trends) in late 2011. Yet, almost all datasets stored and analysed by public sector organisations are not ‘big’. Typically, real ‘big data’ requires the use of specialised software and approaches to undertake analyses in a practical and timely manner. These datasets can have hundreds of related tables of data, many millions of rows of data, or tens of thousands of variables.
Nevertheless, analysing the larger end of more normal-sized datasets can be difficult, particularly when using standard software like MS Excel and Access. These programmes are not designed to cope with datasets containing millions of records or thousands of variables. For example, Access has limitations of 2GB per database and 255 fields per table; Excel cannot store more than 1,048,576 rows by 16,384 columns per sheet. In practice, it is possible to get around these limitations (linking tables in Access, and using multiple sheets in Excel, etc.), but they are rarely reached in practice, as performance becomes the limiting factor due to processor load. (Anyone who has filled an entire Excel sheet with calculations knows how sluggish it can become).
So what can you do if you are coming up against these kinds of performance limits? There are several sneaky approaches you can use to reduce the size of datasets and minimise the time it takes to run analyses when things start slowing down (click to expand the numbered items):
1. Code your data
2. Sample your data
3. Collapse your data
4. Prune your data
5. Process your (open text) data
The great thing about these approaches is that they are applicable whether you are using standard software like Excel or Access; statistical software like R, Stata or SAS; or if you are a hardcore data scientist using Python, MySQL, etc. – there should be at least one of these approaches that can help you out in a time of need.