One possible drawback with high level languages such as Python, is that it’s difficult to understand what happens behind the scene when code is executed.

While it’s true that most Python developers probably don’t care about the dirty details of the Python Virtual Machine, some basic knowledge of this hidden side of the language can always be useful, especially when performance starts to enter the game.

Python is an interpreted language (isn’t it?)

Unlike « pure » programming languages such as C or C++ (who said Assembly?), Python is an interpreted language, running on a Virtual Machine (VM).

I’m actually talking about the standard implementation of Python, called CPython (not to be confused with « Cython », which is something else), but there are other versions of Python, like PyPy or Jython, with completely different implementations. Let’s focus on CPython anyway, as it’s what most Python developers use by default.

So, (C)Python is an interpreted language, which is the main reason why it’s considered as a slow language, compared to compiled languages such as C or even Java.

Actually this is not completely true: Python code is always compiled, not into native machine language but into bytecode, which is in turn interpreted by the Python Virtual Machine. Python bytecode is the internal representation of a Python program, and it’s what you can find inside .pyc files, which are produced when you execute a Python script.

A tour of Python bytecode

To understand Python bytecode, one has to understand first how the Python VM works.

The Python VM is a stack based virtual machine, which means its working memory is structured as a stack.

If you ever owned a calculator such as the HP-48, you probably already know what is a stack machine. With the HP-48, to make a simple computation such as 2 * (12 + 42), you had to use a syntax called RPN (Reverse Polish Notation), and enter something weird like 2 12 42 + *, which means:

  • Push « 2 » on the stack
  • Push « 12 » on the stack
  • Push « 42 » on the stack
  • Add the two numbers at the top of the stack (i.e. 12 and 42), and replace them with the result (i.e. 54)
  • Multiply the two numbers at the top of the stack (i.e. 2 and 54), and replace them with the result
  • In the end, you have only the result on the stack (i.e. 108)

The Python VM works exactly the same way, so Python bytecode contains a set of instructions operating on a stack.

Let’s define a trivial Python function:

def f(x, y, z):
    return x * (y + z)

The bytecode produced by this function can be displayed thanks to the built-in dis module:

>>> import dis
>>> dis.dis(f)   
  2           0 LOAD_FAST                0 (x)
              3 LOAD_FAST                1 (y)
              6 LOAD_FAST                2 (z)
              9 BINARY_ADD          
             10 BINARY_MULTIPLY     
             11 RETURN_VALUE

Here is the meaning of this bytecode display:

  • The first column shows the line number in the source code.
  • The second column is the address of the instruction.
  • The third column is the operation code name (or opcode).
  • The fourth column is the operation parameter.
  • The fifth column (inside parentheses) gives the meaning of the parameter, related to the source code.

As you can see, this bytecode is quite similar to the RPN example above:

  • The LOAD_FAST instruction pushes a variable (here a function parameter) on the stack.
  • The BINARY_ADD instruction pops two values from the top of the stack, adds them, and pushes the results on the stack.
  • The BINARY_MULTIPLY instruction does what you guess.
  • And at last, RETURN_VALUE returns the remaining value which is on top of the stack.

There are around 120 different bytecode instructions, which are listed in the Python documentation.

Now let’s see a slightly more complex function, printing the n first integers:

def print_seq(n):
    for i in xrange(n):
        print i

The bytecode produced by this function is the following:

  2           0 SETUP_LOOP              25 (to 28)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_FAST                0 (n)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                11 (to 27)
             16 STORE_FAST               1 (i)

  3          19 LOAD_FAST                1 (i)
             22 PRINT_ITEM          
             23 PRINT_NEWLINE       
             24 JUMP_ABSOLUTE           13
        >>   27 POP_BLOCK           
        >>   28 LOAD_CONST               0 (None)
             31 RETURN_VALUE        

You can easily understand how it works thanks to the documentation, but here are some comments:

There is an additional column in the display, containing the >> symbol. This indicates that the instruction is a destination of a jump from some other instruction.

CALL_FUNCTION 1 means that there is a function to call with one positional argument, to be found on the top of the stack. The name of the function itself is a value on the stack, pushed by the instruction LOAD_GLOBAL.

We can see that print produces built-in bytecode instructions (PRINT_ITEM and PRINT_NEWLINE), as it’s a special keyword, not a real function.

We can also see that return None is implicitly added at the end of the function, if it has nothing to return.

Inside the Python VM

To finish this overview, let’s look into the virtual machine in more details.

As the name « CPython » implies, the Python VM is written in C, so in the end every single bytecode instruction is interpreted by a bunch of C code.

The core of the VM is the function PyEval_EvalFrameEx, defined in Python/ceval.c, which executes a single bytecode instruction. This function is basically a big switch on the opcode, and each bytecode instruction is implemented inside a case block.

Here is for instance the C implementation of the BINARY_ADD instruction:

TARGET(BINARY_ADD) {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *sum;
    if (PyUnicode_CheckExact(left) &&
                PyUnicode_CheckExact(right)) {
        sum = unicode_concatenate(left, right, f, next_instr);
        /* unicode_concatenate consumed the ref to v */
    }
    else {
        sum = PyNumber_Add(left, right);
        Py_DECREF(left);
    }
    Py_DECREF(right);
    SET_TOP(sum);
    if (sum == NULL)
        goto error;
    DISPATCH();
}

This code just takes the parameters from the stack, and calls PyNumber_Add function, which is defined this way:

#define NB_SLOT(x) offsetof(PyNumberMethods, x)

PyObject *
PyNumber_Add(PyObject *v, PyObject *w)
{
    PyObject *result = binary_op1(v, w, NB_SLOT(nb_add));
    if (result == Py_NotImplemented) {
        PySequenceMethods *m = v->ob_type->tp_as_sequence;
        Py_DECREF(result);
        if (m && m->sq_concat) {
            return (*m->sq_concat)(v, w);
        }
        result = binop_type_error(v, w, "+");
    }
    return result;
}

As Python is a dynamic language, this function makes no assumption about the type of the parameters. The real work is delegated to another function, binary_op1, which is a generic function to perform a binary operation on a pair of objects:

#define NB_BINOP(nb_methods, slot) \
        (*(binaryfunc*)(& ((char*)nb_methods)[slot]))

static PyObject *
binary_op1(PyObject *v, PyObject *w, const int op_slot)
{
    PyObject *x;
    binaryfunc slotv = NULL;
    binaryfunc slotw = NULL;

    if (v->ob_type->tp_as_number != NULL)
        slotv = NB_BINOP(v->ob_type->tp_as_number, op_slot);
    if (w->ob_type != v->ob_type &&
        w->ob_type->tp_as_number != NULL) {
        slotw = NB_BINOP(w->ob_type->tp_as_number, op_slot);
        if (slotw == slotv)
            slotw = NULL;
    }
    if (slotv) {
        if (slotw && PyType_IsSubtype(w->ob_type, v->ob_type)) {
            x = slotw(v, w);
            if (x != Py_NotImplemented)
                return x;
            Py_DECREF(x); /* can't do it */
            slotw = NULL;
        }
        x = slotv(v, w);
        if (x != Py_NotImplemented)
            return x;
        Py_DECREF(x); /* can't do it */
    }
    if (slotw) {
        x = slotw(v, w);
        if (x != Py_NotImplemented)
            return x;
        Py_DECREF(x); /* can't do it */
    }
    Py_RETURN_NOTIMPLEMENTED;
}

And that’s not finished! The actual addition function depends on the type of the parameters, so if we assume that the parameters are float numbers, this addition is implemented by the following function:

static PyObject *
float_add(PyObject *v, PyObject *w)
{
    double a,b;
    CONVERT_TO_DOUBLE(v, a);
    CONVERT_TO_DOUBLE(w, b);
    PyFPE_START_PROTECT("add", return 0)
    a = a + b;
    PyFPE_END_PROTECT(a)
    return PyFloat_FromDouble(a);
}

Got it! We can at last see our addition, done by the line a = a + b !

Now you probably see why numeric computation done in pure Python code is far slower than in C, and why you should use numpy if performance is a concern for you!