While trying to solve the String Calculator code kata, I realized how this simple exercise is a very good example to illustrate how a single problem can be solved with different levels of abstraction.

The requirements are very simple: given a string in input containing a list of numbers separated with arbitrary (non digit) characters, we should have a function called “add”, returning the sum of all these numbers. For instance, given the string “1,2\n3bar4”, we should return 10.

I will show different solutions to solve this problem, first with imperative programming, then with functional programming, and each time raising the level of abstraction.

Let’s start with the most imperative code we could think about: C code without any dependency on external libraries (including the libC):

int add(const char *numbers)
{
    int sum = 0;
    int curNumber = 0;
    char curChar;
    /* split, parse and accumulate */
    do {
        curChar = *numbers;
        if (curChar >= '0' && curChar <= '9') {
            curNumber = 10 * curNumber + (curChar - '0');
        } else {
            sum += curNumber;
            curNumber = 0; 
        }
        numbers++;
    } while (curChar);
    return sum;
}

The algorithm is quite straightforward: we just read the input string character by character until we reach its end (i.e. when the current character is ‘\0’, which is the marker for the end of a C string), and we decode the numbers on the fly manually. When we encounter a separator, i.e. a character which is not a digit, we accumulate the current decoded number into the sum variable, and reset it.

This code does the job, and very efficiently actually. It relies on a few tricks, but nothing really unusual for a C programmer, so in the end it’s fairly readable if you know C.

This is a good example of pure imperative code, as it tells the computer exactly how the result has to be computed. As it is very low level, it can be translated very easily to assembly language, and if you don’t believe me, here is the equivalent assembly code (x86-64):

    xor     eax, eax              ; sum = 0
    xor     ecx, ecx              ; curNumber = 0   
    jmp     .LOOP2
.LOOP1:
    lea     ecx, [rcx+rcx*4]      ; curNumber = 10 * curNumber
    lea     ecx, [rdx-48+rcx*2]   ;     + (curChar - '0')    
.LOOP2:
    add     rdi, 1                ; numbers++
    movsx   edx, BYTE PTR [rdi-1] ; curChar = *(numbers-1);
    lea     esi, [rdx-48]         ; if ((curChar - '0')
    cmp     sil, 9                ;     <= 9)
    jbe     .LOOP1
    add     eax, ecx              ; sum += curNumber
    test    dl, dl                ; while (curChar)
    je      .RETURN
    xor     ecx, ecx              ; curNumber = 0
    jmp     .LOOP2
.RETURN:
    rep ret                       ; return sum

As you can see, there is an almost direct mapping from C code to assembly, with just minor clever optimizations and re-ordering to reduce the number of instructions (this is the code produced by gcc 6.3 with -O2 flag, slightly re-arranged).

Such imperative code is not bad by itself, especially if performance is a major concern, however this code has some drawbacks:

It assumes we know precisely how a string is represented in memory, and it will have to be completely rewritten if we had to support multiple string encodings (for instance UTF-8 and UTF-16) The algorithm mixes in the same loop very different things: splitting (finding the numbers in the input string), parsing (converting strings to integers), and accumulation (i.e. computing the sum). Though this has been done to maximize performance, the counterpart is it makes the code difficult to understand at first sight, and not very maintainable. Now let’s raise a bit the level of abstraction, by moving from C to C++:

int add(const string& numbers) 
{
    // split
    vector<string> tokens;
    boost::split(tokens, numbers, !boost::is_digit());
    // parse and accumulate
    int sum = 0; 
    for (const auto& token: tokens) {
        sum += atoi(token.c_str());
    }
    return sum;
}

This code is certainly less performant than the previous C version, but it’s much cleaner as it makes clear that the algorithm is made of two different parts:

  • First, the input string is split based on a predicate (!is_digit()), and the result of this split is a vector of strings
  • Then, each item of the vector is parsed into an integer and accumulated into the sum variable (Note: even though atoi is not the best function to convert a string to a integer, it has been used on purpose as it handles nicely the case of numbers being an empty string)

But parsing and accumulation are still mixed together in a single for loop, so if we try to fix that we can isolate these two steps in two different loops:

int add(const string& numbers) {
    // split
    vector<string> tokens;
    boost::split(tokens, numbers, !boost::is_digit());
    // parse
    vector<int> intTokens;
    for (const auto& token: tokens) {
        intTokens.push_back(atoi(token.c_str()));
    }
    // accumulate
    int sum = 0;
    for (int intToken: intTokens) {
        sum += intToken;
    }
    return sum;
}

At this stage, the code starts to be quite verbose (it’s still C++ after all!). But if we take a closer look at the for loops, we can notice two things:

The first loop converts a vector of strings to a vector of ints, by applying the same function to each item. In functional programming, this is a very common pattern, called a map operation. In C++, the word “map” already has a different meaning, so this is rather called transform.

The second loop walks through a vector of integers, and combines them with a “+” operation. In functional programming, this is again a common pattern called fold (or reduce if the function used to combine the items is commutative). As C++ likes to use different names, this operation is called accumulate, which is a good name in our case but misleading in the general case, as it can do many other things than computing a sum.

So, if we replace the loops by functional algorithms, we get this:

int add(const string& numbers) 
{
    // split
    vector<string> tokens;
    boost::split(tokens, numbers, !boost::is_digit());
    // parse
    vector<int> intTokens;
    boost::range::transform(tokens, back_inserter(intTokens), 
        [] (const string& token){ return atoi(token.c_str()); });
    // accumulate
    int sum = boost::accumulate(intTokens, 0);
    return sum;
}

This code is quite nice, but still polluted with the verbosity of C++. This language was not really meant to be used for functional programming at this beginning, so writing (and reading) functional code in C++ easily turns into a wizardry exercise!

What if we use a language more fit for functional programming, such as Python? Here is what we could get:

def add(numbers):
    // split
    tokens = re.split('[^\d]+', numbers)
    // parse
    int_tokens = map(int, tokens)
    // accumulate
    return sum(int_tokens)

Now, this is readable code! The 3 steps of the algorithm are clearly visible, the map operation is called map as expected, and computing the sum of a list is just a matter of passing it to the sum function.

If we inline all the function calls, this can be rewritten in this more concise form:

def add(numbers):
    return sum(map(int, re.split('[^\d]+', numbers)))

It’s hard to see how this code could be made more expressive and simple.

But this would be forgetting there is a language much more powerful than Python regarding functional programming: Haskell.

In Haskell, the “add” function can be rewritten like this:

add numbers = sum (map read (wordsBy (not . isDigit) numbers))

This is quite similar to the Python implementation, with a few differences:

  • The split is done with the wordsBy function, whose first argument is a predicate, i.e a function taking a Char as argument and returning a Bool
  • The conversion from String to Int is done by the read function
  • Unlike Python, Haskell is a statically typed language. Thanks to its powerful type inference mechanism, the compiler automatically knows that the add function takes a String as input and returns a number (though it doesn’t know this result has to be an Int, unless we specify it in the function type signature).

Note that the predicate not . isDigit is actually the composition of two functions: isDigit and not.

By using this powerful function composition operator (.), we can rewrite the add function this way:

add = sum . map read . wordsBy (not . isDigit)

Here the numbers parameter has disappeared! Indeed it was not really needed to define what the add function really is, so in Haskell we can just drop it (this is called η-conversion), and keep the strict minimum.

Now we have reached the highest possible level of abstraction: there is nothing to add, nothing to remove.

It’s interesting to go back to the etymology of the word abstraction, which means “draw away” in Latin. Indeed, abstraction means removing the unnecessary details from something, and extract a general concept from it.

This is actually what we do in functional programming: we give names to general algorithms, such as sum, ignoring the implementation details: is the computation done in sequence or in parallel? We don’t (need to) know. Is the input a vector, a linked list, or a tree? We don’t care.

This is abstraction, this is the beauty of functional programming!