_(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
   |
   |
{}-|
   |
 _(_)_
(_)@(_)
  (_)
   |
   |-{}
   |
   |
  ---
        

Summoning demons in C

2025-03-26

What do you think this C program should do?

#include <stdio.h>

int main() {
  puts("Hello, world!");
  int i;
  while (i < 10) {
    i++;
    printf("%d\n", i);
  }
  puts("Finished!");
  return 0;
}

Let's see what happens when I compile and run it on my laptop:

joe@laptop /tmp> gcc main.c -o counter
joe@laptop /tmp> ./counter
Hello, world!
1
2
3
4
5
6
7
8
9
10
Finished!
joe@laptop /tmp

If you said 'print the numbers from 1 to 10 and then exit', it looks like you were right. But if we're talking about what this C program should do, then we actually had no guarantees that this would happen.

The reason for this is that we're invoking something called undefined behaviour. What this means for the compiler is that it's allowed to generate code that does whatever it wants. What it means for us programmers is that it's now much harder to reason about what our program will do.

Let's look at a slightly modified example, where we've moved the different parts of the program into different functions: a refactor that would typically not change any functionality.

#include <stdio.h>

void print_hello() { puts("Hello, world!"); }

void do_count() {
  int i;
  while (i < 10) {
    i++;
    printf("%d\n", i);
  }
}

void print_finished() { puts("Finished!"); }

int main() {
  print_hello();
  do_count();
  print_finished();
  return 0;
}

For a well-behaved program, you wouldn't expect anything to change after performing this refactor, but look at what happens when we compile and run the new program:

joe@laptop /tmp> gcc main.c -o counter
joe@laptop /tmp> ./counter
Hello, world!
Finished!
joe@laptop /tmp>

How bizarre! The count has completely disappeared.

In particular, declaring this variable i, but not giving it a value, then proceeding to try and read its value in the condition of the while loop is our undefined behaviour. The compiler is completely within its rights to omit the count here. Equally it could have generated code that wiped my hard drive, or summoned demons from my nose (although, the compiler would have to really love malicious compliance for this to happen in this example).

You might be asking, “but what’s the point of this? Can’t we just define what this behaviour should be?” which is a valid question. We could say that all declared variables should be initialised to zero. But now, as well as just allocating stack space, the compiler needs to insert an extra instruction to zero that value. This might seem insignificant, but it adds up. Instead, it’s better for the compiler (and us), if it’s just able to assume that variables always get assigned a value before we try and read from them, and if they don’t, then that’s fine! It’s undefined behaviour so the compiler doesn't care anymore.

The key takeaway here is that the compiler is basically allowed to assume that programs never contain any undefined behaviour. Let’s look at how that helps the compiler in another example.

#include <stdio.h>

int main() {
  int x[1] = {0};
  int y[1] = {100};

  int *px = &x[0];
  int *py = *y[0];
  int diff = py - px;
  
  x[diff] = 200;
  printf("%d\n", *py);
  return 0;
}

This one's got a little more going on. We declare these two one-element arrays, x and y. Then, we get pointers to their elements, take the difference between these pointers and use that to index our initial x array. If you think about this second, you might be able to convince yourself that this array index should refer to the value in y. Let's run it see, we should be expecting to print out 200.

joe@laptop /tmp> gcc main.c -o pointers
joe@laptop /tmp> ./pointers
200
joe@laptop /tmp>

Perfect! All seems good, right? Well, let's try compiling with some optimisations on:

joe@laptop /tmp> gcc main.c -o pointers -O1
joe@laptop /tmp> ./pointers
100
joe@laptop /tmp>

Uh-oh! This shouldn't happen with a normal program! If you were suspicious of this strange array indexing, you should trust that suspicion! That’s undefined behaviour, and is exactly what’s causing this discrepancy between the two outputs. Specifically, indexing an array out-of-bounds is undefined behaviour. When we turn optimisations on, the compiler rightly assumes that the write to x[diff] should never affect y, and optimises it away, so we just print the original value of 100. Could you imagine how paranoid the compiler would have to be if it thought that any array index could affect any value? We would end up wasting so much time executing useless instructions!

Clearly, having undefined behaviour is a useful concept for C, but it’s not quite enough to allow for all the optimisations we want.

First, there is implementation-defined behaviour. We want the sizeof operator to give us different values for pointers on my laptop, versus say my Raspberry Pi. More generally, implementation-defined behaviour is behaviour that may change for different compilers or different machines, but given a particular static configuration, will never change within a program.

The more interesting category of behaviour is unspecified behaviour. This occurs where a defined set of possible behaviours exist, and the compiler is free to choose from whichever it wants.

For example, the order in which function arguments are evaluated is unspecified. This could e.g. allow for a compiler to potentially make use of parallelism by evaluating multiple arguments simultaneously. We can see it in action here with this simple program:

#include <stdio.h>

int foo() {
  puts("Foo called!");
  return 1;
}

int bar() {
  puts("Bar called!");
  return 2;
}

int add(int x, int y) { return x + y; }

int main() {
  int result = add(foo(), bar());
  printf("%d\n", result);
}

Let's compile it with GCC:

joe@laptop /tmp> gcc main.c -o arguments
joe@laptop /tmp> arguments
Bar called!
Foo called!
3
joe@laptop /tmp>

And now let's compile it with a different compiler, Clang:

joe@laptop /tmp> clang main.c -o arguments
joe@laptop /tmp> arguments
Foo called!
Bar called!
3
joe@laptop /tmp>

They choose to evaluate them in different orders, but our program is still perfectly safe: there's no chance of hard drive wiping or nasal demons here!

For good measure, here's one last fun example that made the rounds on the internet a while ago, in C++ this time:

#include <iostream>

int main() {
  while (1)
    ;
}

void unreachable() { std::cout << "Unreachable!" << std::endl; }

This program looks a little weird, our main function just has an infinite loop that does nothing, and we define a function that never gets called. Let's compile it and see what happens:

joe@laptop /tmp> clang++ main.cpp -o unreachable
joe@laptop /tmp> ./unreachable

Okay, it runs forever in the infinite loop. Nothing too extraordinary there. Let's turn on some optimisations:

joe@laptop /tmp> clang++ main.cpp -o unreachable -O3
joe@laptop /tmp> ./unreachable
Unreachable!
joe@laptop /tmp> 

WTF! If you're interested in what's going on here, here's a godbolt link with the assembly, and a link to another blogpost that explains it in detail.

These types of behaviour actually form somewhat of a hierarchy, in order of 'defined-ness':

  • Defined behaviour
  • Implementation defined behaviour
  • Unspecified behaviour
  • Undefined behaviour (bad!)

Implementation defined behaviour and unspecified behaviour are necessary, and most of the time impossible to avoid using in your programs. Undefined behaviour is equally necessary (as we've seen!), but you should try your very hardest to avoid invoking it in your programs, unless you want to spend some time with an ENT specialist exorcist.

 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---
   |
   |
   |-{}
   |
 _(_)_
(_)@(_)
  (_)
   |
{}-|
   |
   |
  ---