Python (Level 1)

From Tools for Applied Data Analysis
Jump to: navigation, search


Pandas was designed by Wes McKinney as an alternative to R. As such, it is similar to R in many respects, but with a more "Pythonic" syntax (having a very intuitive and consistent syntax that is consistent through all Python tools), and more user control over memory use.

Pandas is organized around two primary data structures -- Series (1 dimensional arrays, analogous to a 1 dimensional vector), and DataFrames (a 2 dimensional array, which is actually a collection of Series where each column is a Series). As the name suggests, DataFrames are the Pandas analogue of data.frames in R.



Since Python just has a boring text interface, a few people came up with iPython as a way to make Python a little friendlier. iPython is a version of Python with an augmented interface an a couple extra tools (called "magic functions") designed to make Python a little easier to use. In particular, the emphasis is on making it easier to work with Python interactively ("iPython" stands for "interactive Python"). iPython's own tutorial can be found here!

There are few components to iPython, which include (but are not limited to):

  • Magic Functions: quick commands to do things like time the execution of a set of commands
  • Easy access to help documentation: if you type ? after the name of an object, iPython will open associated help documents (if it can). So if you have a pandas dataframe called "df" and type df? you'll immediately be shown the help documentation for dataframe objects provided by the developer.

Note everyone uses iPython, but is very popular among the "Scientific Computing" community (people who use Python for statistical analyses and data processing, as opposed to writing web applications. As a social scientists, you are in the "scientific computing" community").

Graphical User Interfaces

If you're gonna work with Python, you want to use an editor that will do things like check for syntax errors as you type.

When picking a GUI, keep in mind that most social scientists (even if they don't know it) like to do what's called interactive programming -- where you run a line, look at results, run another line, etc. This isn't actually the norm in computer science, so its not something well supported by all editors. The editors that are best at this do so by integrating an iPython window.

Designed for Interactive Use

  • Enthought's Canopy: checks syntax as you type, implements iPython instead of regular python (so it's easier to find help documentation and to make it easier to use magic functions), and has a nice little module for adding functions you can use if you want.
    • Free for students; paid for others; OSX and Windows
  • Spyder IDE: Similar to Enthought's Canopy, but free for everyone. Also comes with iPython integration and all the standard libraries used for scientific computing, like Pandas, NumPy, etc.

More General GUIs

  • PyCharm: A very nice editor for Python with syntax checking.
    • Both free ("community") and paid versions; OSX and Windows
  • PyScripter: Open-source lightweight editor
    • Completely Free; Windows Only

Here's another set of good opinions. Note that "IDE" stands for "Interactive Development Environment", which is a program that lets you edit code and run chunks as you work (for testing or other purposes) in an embedded Python window.

Getting Python

Currently the best and easiest way to get and work with python is to use Anaconda. Anaconda is a great tool for installing and managing not just python, but all the various libraries that are commonly used for social science.

Python 2.x versus Python 3.x

Some years ago, the designer of Python decided to fix some problems in the language, and he decided that to properly tweak the language, he had to create a version that was not backwards compatible (i.e. code written in Python 2 will not necessarily run in Python 3). This lead to much disagreement, and many people have continued to work with (and support) Python 2.

While this bifurcated community has survived for several years, it seems likely that Python 2 is near its end. Most tools written in Python 2 are now available in Python 3, and the Python community has decided to no longer update Python 2 (it's currently at 2.7). So if you're just starting out, I'd recommend jumping on the Python 3 wagon, and even if you're not, but you're at the start of a new project, I'd think about making the switch.

The one caveat is that the GUI recommended above (Canopy from Enthought) does not yet support Python 3. With that said, its developers are hard at work coming out with a Python 3 version, so hopefully you can switch soon. Moreover, most syntax changes in Python 3 (see below) have actually been implemented in Python 2.7, so even if you're working in Python 2.7, you can start using the conventions of Python 3 to help with the transition.

Substantive Differences between Python 2 and Python 3

Most of the differences between Python 2 and Python 3 are under the hood, but there are at least three syntax changes all users will likely run into.


In Python 3, "print" is now a function, so the use of parentheses is now mandatory. You can no longer type print "hello world!" . You have to type: print("hello world!") .

Division Like many other languages, if you divide one integer by another in Python 2, you always get back an integer. For example, if you type 5 / 2 , Python would do what's called integer division, meaning is would only return the integer component of the result (in this case, 2 ). But if you divide a float variable by an integer, you get the result you would expect -- for example, 5.0 / 2 or float(5) / 2 yields 2.5 .

After some reflection, Python developers decided this was dumb and non-intuitive -- after all, the whole idea of a dynamically-typed language is that you don't have to worry if your variables are integers of floats all the time! So in Python 3, 5 / 2 will now do what most people expect -- return 2.5 (stored as a float variable). If you want integer division in Python 3, you have to type 5 // 2 .

String Insertions If you wanted to put the numerical value of a variable z = 5 in a string in Python 2, you could type: 'the value of variable z is %i' % z , which results in the output 'the value of variable i is 5 .

This method has now been replaced by the ".format()" command, which replaces double squiggly brackets ( {} ) with the value of variables passed to format as arguments. So now you would type: 'the value of variable z is {}'.format(z) , and get back 'the value of variable z is 5' .

Libraries for Data Analysis

  • pandas
  • numpy


If you start working with Pandas, you'll hear a lot about NumPy. NumPy is an extremely fast and memory efficient tool for matrix manipulation in Python, and has been used for quite a while by the scientific community. However, NumPy had two issues -- (a) it's a very low-level language (basically like Matlab in Python) that's not very user friendly, and (b) it couldn't handle data with missing values very well. So Panda's was created on top of NumPy to make use of it's speed and efficiency while adding lots of more user-friendly features.

Important Concepts and Tricks

All languages have their own quirks -- tricks or concepts that are unique to the language, and make working with it much, much easier. All too often, people who use the language often forget that these aren't immediately obvious, so don't talk about them when they create tutorials. So here are a few of the tricks and comments that fall into the category of "things we wish someone had made us aware of when we were getting started."


  • The best book on Pandas is Python for Data Analysis by Wes McKinney, the originator and lead developer of Pandas. Panda's has continued to evolve since this book was published, so there are some useful tools that aren't included in the book, but it's a great foundation.
  • has good documentation and introductory tutorials.

Not sure whether to use Panda(s)?

Don't say no to Pandas