Ryan Hileman

C Runtime Overhead

What do you do when libc is your bottleneck? If the total runtime of your process is under 10ms, most of this time (depending on your hardware) is spent waiting for the linker and C runtime to set things up. If you compare this to the Latency Numbers Every Programmer Should Know, it falls right above "disk seek" and "20x datacenter roundtrip".

Stripe launched their third CTF in early 2014. While their previous CTFs were based around finding security vulnerabilities, this one required you to submit working programs related to distributed system problems and graded you on their performance. The scoring architecture involved submitting your source code to a remote Git server, which would kick off a build process and eventually run and grade your program.

The scoring was based on the processor "user" time spent in your program, which can be read via clock(). This had interesting effects. You could sleep(1) without affecting your measured runtime. (I tried exploiting this by daemonizing a process which wouldn't count toward my runtime score and sleeping until it finished output, but some of the required syscalls were disallowed). The general premise for each level was to process input with the fastest possible running time, and no errors.

The first level (Level 0) was meant to be introductory, and was thus very simple. You were given a list of words on standard input, and had to print every word to standard output, but with angle brackets around any word not found in a given dictionary. The example solution program was written in Ruby and used an O(n) array lookup for dictionary words. To beat the minimum required time and pass the level, you could simply convert the array of words into an O(1) hash table, like so:

-words = File.read(path).split("\n")
+words = Set.new(File.read(path).split("\n"))

By the time I started working on the CTF, this wasn't really fast enough to get anywhere near the (top 20) leaderboard solutions. I spent the next three days fighting for my spot in the Gitcoin leaderboards while idly looking for better solutions to Level 0.

I switched to a compiled language for a bit of a score bump, then got more creative with my solutions. I tried (and failed at) methods like statically compiling a tree into my binary. I eventually settled on bloom filter, and managed to get a working solution in C down to a 20kB binary with the bloom filter embedded. The word list was 2.4MB, so I got a huge speed boost just by not having to read it from disk. (After asking around, I believe all of the top scoring solutions used either a bloom filter or a perfect hash function.)

At this point, I felt like I had very fast solution, but I wasn't even halfway to the top score. I happened to run strace -tt against my solution (which provides microsecond-accurate timing information for syscalls), and was pretty surprised when I compared against a completely empty program. As demonstrated in the following strace output, the empty program took as long as 9.4ms to execute on my machine, without processing any data. This was entirely linker and C runtime overhead.

// strace output for `int main() { return 0; }`

00:32:27.539873 execve("./overhead", ["./overhead"], [/* 15 vars */]) = 0
00:32:27.541060 brk(0)                  = 0xa7b000
00:32:27.541466 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
00:32:27.541834 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
00:32:27.542199 fstat(3, {st_mode=S_IFREG|0644, st_size=36586, ...}) = 0
00:32:27.542438 mmap(NULL, 36586, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4f640be000
00:32:27.542865 close(3)                = 0
00:32:27.543109 open("/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
00:32:27.543455 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\1\2\0\0\0\0\0"..., 832) = 832
00:32:27.543782 fstat(3, {st_mode=S_IFREG|0755, st_size=1984416, ...}) = 0
00:32:27.544162 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4f640bd000
00:32:27.544466 mmap(NULL, 3813200, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4f63b03000
00:32:27.544796 mprotect(0x7f4f63c9c000, 2097152, PROT_NONE) = 0
00:32:27.545104 mmap(0x7f4f63e9c000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x199000) = 0x7f4f63e9c000
00:32:27.545503 mmap(0x7f4f63ea2000, 16208, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f4f63ea2000
00:32:27.545775 close(3)                = 0
00:32:27.546087 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4f640bc000
00:32:27.546386 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4f640bb000
00:32:27.546664 arch_prctl(ARCH_SET_FS, 0x7f4f640bc700) = 0
00:32:27.547373 mprotect(0x7f4f63e9c000, 16384, PROT_READ) = 0
00:32:27.548209 mprotect(0x7f4f640c7000, 4096, PROT_READ) = 0
00:32:27.548533 munmap(0x7f4f640be000, 36586) = 0
00:32:27.548949 exit_group(0)           = ?
00:32:27.549272 +++ exited with 0 +++

The best part? My Level 0 solution also clocked in at 9ms. It dawned on me. Almost the entire runtime of my program was spent waiting for the C runtime to bootstrap my process. There was nothing left to optimize... unless I could start ripping out pieces of libc, or huck the whole thing and rewrite it in assembly.

I tried a quick runtime test using inline assembly and -ffreestanding -nostdlib to check my theory.

// gcc -m32 -ffreestanding -nostdlib
void _start() {
    /* exit system call */
    asm("movl $1,%eax;"
        "xorl %ebx,%ebx;"
        "int  $0x80"
    );
}

00:41:38.884122 execve("./blank", ["./blank"], [/* 15 vars */]) = 0
00:41:38.884552 exit(0)                 = ?
00:41:38.884629 +++ exited with 0 +++

This assembly stub runs in about 0.5ms, and doesn't make any extra syscalls. This was extremely promising.

I now had a plan. My next step was to statically compile the program. This removed the first few calls from the trace (loading ld.so.cache, reading libc). I was down to about 8ms, with at least 5ms spent in libc before my program even read from standard input.

These were the syscalls made (and precious time spent) before my static program had a chance to do anything:

00:35:19.095150 execve("./main-static", ["./main-static"], [/* 15 vars */]) = 0
00:35:19.095973 uname({sys="Linux", node="vagrant-arch", ...}) = 0
00:35:19.096373 brk(0)                  = 0x16cf000
00:35:19.096724 brk(0x16d01c0)          = 0x16d01c0
00:35:19.097065 arch_prctl(ARCH_SET_FS, 0x16cf880) = 0
00:35:19.097387 readlink("/proc/self/exe", "/vagrant/level0/main-static", 4096) = 27
00:35:19.097779 brk(0x16f11c0)          = 0x16f11c0
00:35:19.098140 brk(0x16f2000)          = 0x16f2000
00:35:19.098539 fstat(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
00:35:19.098837 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8d5a7cb000
00:35:19.099192 read(0, "Collaboratively administrate emp"..., 4096) = 1398
00:35:19.099544 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
00:35:19.099893 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8d5a7ca000

This process mostly happened the day after the CTF ended, and I shelved it until now. I've since written enough of a minimal libc to bootstrap my entire solution with -ffreestanding -nostdlib, so I finally have timing data to back up my original hypothesis. My initial times were very bad (around 40ms), but after implementing optimistic buffered IO, the strace for my full solution now looks like this:

00:53:41.882305 execve("./main-43.1", ["./main-43.1"], [/* 15 vars */]) = 0
00:53:41.882920 read(0, "Collaboratively administrate emp"..., 4096) = 1398
00:53:41.882989 read(0, "", 4096)       = 0
00:53:41.883006 write(1, "Collaboratively administrate emp"..., 1449) = 1449
00:53:41.883023 _exit(0)                = ?
00:53:41.883356 +++ exited with 0 +++

That's about 1ms... which is near 1/10 the runtime of the glibc version. It also only performs five total syscalls.

I reckon this solution would've beaten the winning high score for Level 0 by at least 2-4x. It's definitely worth mentioning the level should have used larger inputs to offset runtime overhead, but then I wouldn't have gone down this road.

If you're running into process startup time issues in a real world scenario and ever actually need to do this, it might be worth your time to profile and try one of the alternative libc implementations (like musl libc or diet libc).