Assembly instructions distribution

Frederic Cambus June 13, 2022 [Assembly] [Compilers] [Toolchains]

In my article about running FreeBSD on the Vortex86DX CPU, I mentioned using objdump to disassemble kernels in order to check whether they were using CMOV instructions or not.

One thing leading to another, I thought it would be fun to calculate the distribution of assembly instructions in ELF binaries. It turns out it can be done rather easily with a bit of Shell foo.

For the purpose of this article, I used SQLite 3.38.5 (2022-05-06) built with GCC 12.1.1 on Fedora 36 using the default optimization level (-O2) as a target binary to test instructions distribution against. It is a self-contained, full-featured SQL database engine as a single binary, making it an excellent choice for our experiment.

$ gcc --version
gcc (GCC) 12.1.1 20220507 (Red Hat 12.1.1-1)
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

For the record, the example below is using objdump 2.37 from GNU binutils on a Linux system.

The naive solution:

objdump -dj .text --no-show-raw-insn --no-addresses sqlite3 | \
	grep $'^\t' | awk -F '\t' -F ' ' '{ print $1 }' | \
	sort | uniq -c | sort -nr

There is a problem with this approach however, it doesn't differentiate between instructions and prefixes, so for prefixed instructions, we only account for the prefixes and discard the actual instructions. Given the choice between adding more Shell hackery and reaching out for a proper instruction decoder, I opted for the later.

I already knew about Capstone so this is the one I decided to use. It is widely packaged, supports multiple architectures, and also has Python bindings which makes it an extremely convenient option. As a word of caution, the reader should be aware that if we compare the results provided by different tools, we will notice divergences. Distinct disassemblers decode instructions differently, and this is exacerbated on x86-64 due to the complexity of the instruction decoder. Hardware decoders should be considered the only source of truth.

Here is a small Python snippet using Capstone 4.0.2 to print one instruction per line:

#!/usr/bin/env python3
import sys
from capstone import *

with open(sys.argv[1],'rb') as file:
    text = file.read()

    md = Cs(CS_ARCH_X86, CS_MODE_64)
    for insn in md.disasm(text, 0x0):
        print("%s" %(insn.mnemonic.split().pop()))

As previously when we used objdump, we are only interested by the .text section, so we need to extract it from the binary using objcopy:

objcopy --dump-section .text=sqlite3.text sqlite3

We can then use the Python program we created to compute our distribution:

./insn.py sqlite3.text | sort | uniq -c | sort -nr

Here are the first 10 lines of results, showing the most used instructions:

 106158 mov
  19462 test
  17918 call
  15866 je
  15566 cmp
  13276 jmp
  12720 nop
  12072 jne
  11629 pop
  10873 xor

The full set of results is available for download here in CSV format.

Here is a visualization of the 50 most used instructions using logarithmic scale:

While being able to visualize the number of occurrences for every instruction is valuable, the produced charts are a bit noisy if we want to include the whole set. We can remedy this by grouping instructions by categories.

Here is another chart, showing instructions grouped by category:

The dataset with instructions grouped by categories is available for download here in CSV format.

So what can we take away from these numbers? MOV is by far the most used instruction, and grouping by category also shows that the top contenders are data transfer instructions. Not surprising, and this is why data locality is so important for performance.

We can also notice that by default, GCC doesn't take advantage of modern CPUs features like SSE3, SSE4, and AVX instructions.

GCC 11 and Clang 12 introduced support for x86-64 micro-architecture levels as documented here and here respectively for each compiler. This was first discussed in 2020 and is now part of the x86-64 psABI. Each micro-architecture level can be enabled by setting the march flag to one of the following values:

x86-64: CMOV, CMPXCHG8B, FPU, FXSR, MMX, FXSR, SCE, SSE, SSE2
x86-64-v2: (close to Nehalem) CMPXCHG16B, LAHF-SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
x86-64-v3: (close to Haswell) AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, XSAVE
x86-64-v4: AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL

When building with x86-64-v2, here are the 50 most used instructions:

And here are instructions grouped by category:

We can notice that the x86-64-v2 level takes advantage of SSE3 instructions.

The following extra instructions are used:

ADDSUBPD, BLENDVPD, CMPLESD, FISTTP, MOVDDUP, PABSD, PACKUSDW, PADDQ, PALIGNR, PBLENDW, PEXTRD, PEXTRQ, PINSRD, PINSRQ, PMAXSD, PMAXUD, PMINSD, PMINUD, PMOVSXDQ, PMULLD, POR, PSHUFB, PSLLD, PSRLDQ, and ROUNDSD.

When building with x86-64-v3, here are the 50 most used instructions:

And here are instructions grouped by category:

The x86-64-v3 level takes advantage of AVX and AVX2 instructions, as well as BMI1, BMI2, and FMA instructions.

The following extra instructions are used:

ANDN, BLSR, LEAVE, LZCNT, MOVBE, RORX, SARX, SHLX, SHRX, VADDPD, VADDSD, VADDSS, VADDSUBPD, VBLENDVPD, VCMPLESD, VCMPLTSD, VCMPNLTSD, VCOMISD, VCOMISS, VCVTDQ2PD, VCVTPD2PS, VCVTPS2PD, VCVTSD2SS, VCVTSI2SD, VCVTSI2SS, VCVTSS2SD, VCVTTSD2SI, VDIVSD, VDIVSS, VEXTRACTI128, VFMADD132PD, VFMADD132SD, VFMADD213SD, VFMADD231SD, VFMSUB132SD, VFNMADD231SD, VINSERTF128, VINSERTI128, VMAXSS, VMINSD, VMINSS, VMOVAPD, VMOVAPS, VMOVD, VMOVDDUP, VMOVDQA, VMOVDQU, VMOVLHPS, VMOVLPS, VMOVQ, VMOVSD, VMOVSS, VMOVUPD, VMOVUPS, VMULPD, VMULSD, VMULSS, VPABSD, VPACKUSDW, VPACKUSWB, VPADDB, VPADDD, VPADDQ, VPADDW, VPALIGNR, VPAND, VPBLENDW, VPBROADCASTB, VPBROADCASTD, VPBROADCASTQ, VPBROADCASTW, VPCMPEQD, VPERMQ, VPEXTRD, VPEXTRQ, VPEXTRW, VPINSRD, VPINSRQ, VPMAXSD, VPMAXUD, VPMINSD, VPMINUD, VPMOVSXDQ, VPMULLD, VPOR, VPSHUFB, VPSHUFD, VPSHUFLW, VPSLLD, VPSUBD, VPUNPCKLQDQ, VPXOR, VROUNDSD, VSHUFPD, VSUBSD, VSUBSS, VUCOMISD, VUCOMISS, VUNPCKLPD, VUNPCKLPS, VXORPD, VXORPS, and VZEROUPPER.

The full counts of used instructions are available for download in CSV format for the x86-64-v2 and x86-64-v3 levels.

I'm leaving out the x86-64-v4 level for now, as I was getting weird instructions counts compared to the others binaries, so it seems something got wrong at some point. I might revisit this in the future.

EDIT: Harold kindly pointed out on Twitter that while it seems MMX instructions are used a lot, it's probably mainly/only the SSE2 versions that are used instead.

Back to top