Big Data (Topic)

From Tools for Applied Data Analysis
Jump to: navigation, search

What is "Big Data", and Why Is It Different?

Excellent question! The truth is the term has a different meaning to different people. Within Political Science, for example, the term is often used for any project that studies correlations in relatively large datasets. On this site, however, we will use a more precise and practical definition:

  • "Big Data" is data that is too big to load and work with entirely in RAM.

Note that programs very often make copies of data when you work, so if your data just barely fits into RAM (say your data is 7gb and you have 10gb of RAM), then you may be working with "Big Data"!

Why this definition? When your computer is manipulating data (defining variables, running regressions, etc), what its actually doing is grabbing a little bit of data from storage, moving it to the processor, doing math with it, and then putting it back in storage.

To simplify somewhat, we can think of your computer as having two forms of storage -- RAM (sometimes called main memory), and your harddrive. But these two forms of memory are very, very different. Your computer can grab data from RAM 100,000 times faster it can grab data on the harddrive. Because of this, your computer is much happier (and performs faster) when it's able to keep all the data you're working with in RAM. Indeed, if you're moving data back and forth to your harddrive, it's almost certainly the biggest bottleneck -- doing actual computations is almost instantaneous compared to moving data back and forth to your harddrive on modern computers.

With that in mind, if you can't keep all your data in RAM, you have to use special tools that minimize the amount of time your computer spends going back and forth to the harddrive for data.

One side-note: This definition is somewhat specific to social sciences. If you talk to someone in the computer science field, or who works for a big company, they may think of big data as data that not only doesn't fit in RAM, but which also can't fit on a single computer. Keep that distinction in mind if you start googling "Big Data".

How do I know if my data is fitting into RAM?

Oddly, this is not actually a straight-forward question for two reasons:

  • Since programs often make copies of your data when they manipulate it, if you aren't careful about watching your memory, your program may start storing things on your harddrive, which can result in a program that would take seconds requiring days or more to run.
  • The size of data changes as you change its format. For example, if you're reading data from an inefficient file format (like a .csv text file), it may be the case that your large file may actually be quite small once you import it into a program like R or Stata that knows how to store data more efficiently.

So how do I check to see if my data is fitting in memory?

Unfortunately, if your program starts using more space than you have RAM, your operating system will usually just start using the harddrive for extra space without telling you (this is called "virtual memory"). As a result, you may be moving into the world where your program is going to run very slow without even knowing it.

To avoid this, use one of the tools provided by your operating system to monitor whether it's using the harddrive:

  • OSX: OSX comes with a program called "Activity Monitor" in the Applications > Utilities folder. In 10.10 (Yosemite), the memory tab of Activity Monitor has a graph at the bottom called "Memory Pressure". If this is green, you're good. If it's yellow or red, you're actively going back and forth to your harddrive. (Note your computer may be using "virtual memory" without affecting performance if it's just storing away things you aren't actively using).
  • Windows 7 / 8: Windows has a program called "Resource Monitor" that shows data on memory use. If the "Hard Faults" graph under the memory tab is spiking, your computer is using the harddrive.

Strategies for Working with Big Data

If you have big data, you basically have three options:

  1. Format and trim your data so it fits in memory or get more RAM;
  2. Chunk your data yourself;
  3. Use specialized software.


Avoid Spillover

Working with data that doesn't fit into RAM is a headache. If there's anyway you can avoid it, you should, which can potentially be accomplished via two methods:

  • Drop variables or store them more efficiently: If you're used to small datasets, you may be in the habit of always carrying around all the variables in a survey or dataset. But if you only keep the variables you absolutely need, or find ways to store them more efficiently, you can often stay within your size limits. For example, String variables take up lots of space -- if you can, turn them into numeric variables (see Data Types for more on data types).
  • Get more RAM: RAM isn't cheap, but nor is your time, and buying more RAM can save you an amazing amount of trouble. And if you can't afford to buy your own, you may find that it's cheaper to rent a computer with more RAM from cloud services like Amazon, which are often surprisingly affordable! For more, see the page on Cloud Computing Resources.

Chunking

One way to manipulate large datasets is by chunking -- loading your data into RAM one chunk (say, 1 million rows) at a time, manipulating it, and then appending it to a new data file. This is relatively straightforward for operations that act only on single rows -- like recoding variables -- but a little harder when you start doing things that involve multiple rows at once (like sorting, or taking averages over groups).

Chunking in R

R has two main libraries for this type of data manipulation: ffbase and bigmemory .

Chunking in Python

The main library for chunking in Python is PyTables, although PyTables is actually built into Pandas, so if you're using pandas, take a look at the tools for manipulating, saving to, and reading from HDF files.

Use Specialized Software

If you don't want to try to do this yourself, there is also software designed to handle these kinds of chunking operations behind the scenes. The most notable of these is SQL, which is designed to handle large datasets on standard computers, but there are many more emerging constantly.

One option worth mentioning is a data management tool called Anatella, though please read the full Anatella page if you're thinking of using it, as it has several shortcomings.

Another option is to turn to what's called distributed computing, where both data and computations are distributions across lots of machine, for examine on Amazon servers. Distributed computing tools not only have the advantage of being built to handle data that doesn't fit into memory, but also make it (relatively) easy to parallelize tasks across lots of computers, potentially spending up computations dramatically, though doing so requires learning some new tricks. See the Distributed Computing page for more information.

REALLY Big Data

As noted above, while most social scientists are likely dealing with data that doesn't fit into RAM, there is also the case of data that is so big that it won't even fit on one computer. If you're in this situation, you need to use a distributed computing tool, like Hadoop or Spark. See the Distributed Computing page for more on these systems.