Programming Languages (Topic)

From Tools for Applied Data Analysis
Jump to: navigation, search

Python, Java, C++, Schema, etc. The world is full of programming languages, so which one should you learn to use?


Guiding Principles

The Performance / Ease-of-Use Trade-Off

Before picking a language, it's useful to understand a core concept in programming -- the trade-off between performance and ease-of-use. Programming languages exist on a spectrum, from "High-Level" languages (like Python) "Low-Level" languages (like C++). In a high-level language, you tell the computer what you want it to do, but you don't have to worry too much about how the computer is supposed to do what you've asked it to do. As a result, high-level languages are generally easier to use, and do lots of things to make sure you (the programmer) aren't doing anything stupid.

Low-level languages, by contrast, require the programmer to be very precise about how the computer is supposed to do everything that is asked of it. For example, in low-level languages you can't just create a variable and put something in it -- you also have to tell the computer the data type of the variable (integer, double, float, string, etc.). In some very low level languages (like C++), you also have to explicitly tell the computer where in memory to put your new variable. Moreover, low-level programming languages won't spend time making sure that you (the programer) isn't making any mistakes.

Because low-level languages don't waste time trying to figure out what the programer wants them to do, and don't check to make sure the programer hasn't made any mistakes, programs written in low-level languages tend to be much faster than programs written in high-level languages.

The Hybrid Approach

Does that mean you should be using a low-level language like C++? If you're a social scientist, then probably not. Most social scientists prefer easy to use high level programs like Python -- they automatically check for stupid mistakes, and in most cases the time required to learn a difficult low-level language like C++ isn't worth the energy for gains in terms of performance.

But if you do find you do need to write a program where performance is really important, it's also possible to often possible to just write that small bit of code in C++. Most high level languages have the ability to "call" bits of low-level code for individual tasks where high performance is important. For example, some people working in R with occasionally write short bits of code in C++ when they come across a problem (like numerical optimization) where performance is really, really important.

Moreover, when you're working with libraries or programs that other people have written, those library have often been written in C++ anyway! So even if you're invoking a library in Python, the code running in the background may actually be C++. If you don't know C++ you may not be able to open those libraries and modify them, but if you just want to use a pre-bundled program, then you can still basically get the performance gains of C++ without actually working in C++.

Static-Typed versus Dynamically-Typed Languages

Broadly speaking, there are two classes of programming languages -- typed and untyped (or Static-Typed versus Dynamic-Typed) languages.

In a static-typed language (like C++ or Java), whenever you create a variable, you also have to tell the program what kinds of values are going to go into that variable. For example, you can't just create a variable "x" and put whatever you want into it, you have to tell the computer what types of values it might see for that variable (integers, strings, etc.), and that can't change. If you create a variable "x" and tell the program it's an integer variable, then try to assign the value "here's some text!" to that variable, the program will crash.

In a dynamic-typed language (like R or Python), the program figures this out for you. You just create your variable and put whatever you want in it, and based on what you put into it, the program decides how to store it. So if you create a variable "x" and then assign it the value 5, it will store that as a number. But if you then assign "x" the value "here's some text!", then the program will seamlessly change "x" to a string variable.

Clearly, dynamic-typed languages are much easier to work with, but this ease of use comes at a cost. Every time you ask a program like R or Python to do something with a variable, it first has to stop and check the type of that variable, which takes time. This is a big part of why C++ and Java are much faster than R or Python -- they don't waste time checking these things!

Compiled versus Interpreted Languages

A concept closely related to the difference between static-typed and dynamically-typed languages is the distinction between compiled and interpreted languages. Most high-level languages -- like R and Python -- are interpreted languages, meaning that when you run code, the computer still has to do a reasonable amount of work to figure out exactly what the computer needs to do to accomplish what you've asked it to do. As a result, they tend to be slower.

Compiled languages like Java and C / C++, by contrast, can take user code and convert it directly into specific instructions for exactly what the computer needs to do to execute the code once (during a process known as compilation), and then it never has to think about it again, it just executes those instructions. When one finishes writing a program in Java or C, they "compile" it, which is when a program (called a compiler) reads the code and converts it into specific instructions for what the computer hardware needs to do ("machine code"). Then in the future, whenever that program is called, the computer can just read the instructions that have already been written, making them much faster.

Most dynamically-typed languages are interpreted languages, and most static-typed languages are compiled. The reason for this is that in a dynamically-typed language, the program doesn't know the types of variables in advance, so it can't actually come up with directions for what exactly the computer hardware should be doing. By contrast, in a static-typed language the program knows exactly what it will get anytime a function is called and what it needs to do, so it can write hardware instructions before it's actually pointed at a dataset.

With that said, there is one huge cost to compiled languages -- because they are designed to read entire programs at once and convert the whole program into hardware instructions at once, they aren't good for interactive programming, where the user runs one line of code, examines the results, and then writes and runs another. So though interpreted languages may be slower, they are often much better suited to things like data-exploration than compiled languages.

Common Languages

Python

Python is a high-level, dynamically-typed language. That means you don't have to worry too much about defining variable types and allocating memory, and you can program interactively, making it very popular among political scientists. Python has also become very popular with social scientists because the language comes with lots of tools for dealing with text data (strings).

In addition, while Python is a high-level, dynamically-typed language (and is thus a little slow), it has lots of libraries that are actually written in C++, meaning it can do many things extremely fast.

R

R is actually a full programming language. Unlike Python, it is not really used outside of statistical analysis, but lots of tools have been written for R. Similar to R, it is high-level dynamically-typed language, making it much easier to use than a language like Java or C++, but also slower. However, as with Python, many R tools have been written in C++, and so can still achieve high speeds.

Java

Java is a mid-level, compiled language. You have to define variable types, and cannot program interactively. However, it is a popular, mature language many people use, so it has a strong community.

Java is also a gateway-language for C++, which is a much less forgiving language than Java, but uses similar syntax.

C / C++

C and C++ are the languages you use when you REALLY care about performance, and are REALLY good at programming.

(C and C++ are basically the same thing -- C++ just adds a few bells and whistles to C, like tools for Object Oriented Programming)

One of the things that makes C++ so fast is that unlike languages like Python and Java, C++ doesn't waste any time trying to protect you. To give an example, if you were to try and grab the 10th item from a vector with only 9 items, Python and Java would both yell "YOU CAN"T DO THAT!". In C++, by contrast, C++ would go to the place in it's memory banks just below where the 9th item is stored, grab whatever random bits of data happen to be there, and give them back to you without actually telling you that something bad just happened. So... don't play with C++ if you don't know what you're doing!