Michael Altfield's gravatar

gcc Optimizations for Arithmetic Operations using Bit Shifts

I’ve got a hellacious project due and finals all next week, but this was just too much fun to pass up. In any case, compiler optimization increases compile time, and anything that gives me more time to sword fight on $1000 office chairs is worth a little R&D.

I’m working on writing this cut-down MIPS processor simulator for my Computer Organization class at UCF. I googled “word alignments” to help me better understand the most efficient calculations for converting the Byte Aligned Program Counter address to the Word Aligned Memory array when I ran across an interesting article showing that the mere *order* of variable declarations in a C program can affect the amount of memory used by that program.

The article explained the situation very well, and it makes sense why this issue would happen, but I was surprised that the compiler wouldn’t try to optimize situations like this by re-ordering a set of concurrent variable declarations of alternating data types.

In any case, I continued to hack away at my project when I began to think about whether or not gcc translates multiplication and division operations where one of the operands is a power of 2 into simple bit shift instructions at compile time. As it turns out, it does.

Basis

Alright, here’s the basics: Computers store data in binary. There are many shortcuts in the binary arithmetic that you learn to adore when you’re computing unsigned binary operations by hand. For example, just as shifting a decimal value (of base 10) to the left 1 digit will multiply that value by 10, shifting a binary value (of base 2) to the left 1 digit will multiply that value by 2. Shifting a binary value 2 digits to the left will multiply that value by 4, 3 digits by 8, 4 by 16, etc. This shortcut similarly applies to binary division when shifting to the right.

We can see then that it’s far simpler for a computer to shift a binary number to the left 4 digits like so:

0000 0010 1010 (42 in decimal)
0010 1010 0000 (42*16 = 672 in decimal)

rather than preforming addition over and over again.

So, the question is: when I write

int i = 42;
i = i * 16;

will the compiler translate my code into a series of addition instructions

movl    $42, -4(%rbp)
movl    -4(%rbp), %eax
addl    %eax, %eax
addl    %eax, %eax
addl    %eax, %eax
addl    %eax, %eax
movl    %eax, -4(%rbp)

or to a single left shift instruction

movl    $42, -4(%rbp)
sall    $4, -4(%rbp)

Moreover, gcc doesn’t go straight from C code to machine code. it preforms the following operations (in order)

  1. Preprocessing
  2. Compilation
  3. Assembly
  4. Linking

What I want is to analyze the 3rd step in this process–the assembly code. gcc allows you to (easily) analyze the assembly output of a program.

Note that a whole slew of options exist for analyzing the inner-workings of what gcc is doing as well as operations that directly alter the optimization capabilities of the compilation process. More info can be found from the gcc manual page.

To see a list of optimization options for your installed version of gcc, run the command

$ gcc --help=optimizers

The Test

Still with me? Good. Here’s how I preformed this test.

Below is a straight forward C program that shifts the integer i to the left 4 digits. Mathematically, this is the same as multiplying i by 16.

int main( ){

	int i = 42;
	// shift left 4x (equivalent to multiplying by 16)
	i = i << 4;

	return 0;
}

And below you can see the assembly code that gcc produces when compiling.

$ gcc -c -g -Wa,-a,-ad optest.c
GAS LISTING /tmp/ccQRJ4Fi.s                     page 1


   1                            .file   "optest.c"
   9                    .Ltext0:
  10                    .globl main
  12                    main:
  13                    .LFB2:
  14                            .file 1 "optest.c"
   1:optest.c      **** int main( ){
  15                            .loc 1 1 0
  16 0000 55                    pushq   %rbp
  17                    .LCFI0:
  18 0001 4889E5                movq    %rsp, %rbp
  19                    .LCFI1:
   2:optest.c      **** 
   3:optest.c      ****         int i = 42;
  20                            .loc 1 3 0
  21 0004 C745FC2A              movl    $42, -4(%rbp)
  21      000000
   4:optest.c      ****         // shift left 4x (equivalent to multiplying by 16)
   5:optest.c      ****         i = i &lt;&lt; 4;
  22                            .loc 1 5 0
  23 000b C165FC04              sall    $4, -4(%rbp)
   6:optest.c      **** 
   7:optest.c      ****         return 0;
  24                            .loc 1 7 0
  25 000f B8000000              movl    $0, %eax
  25      00
   8:optest.c      **** }
  26                            .loc 1 8 0
  27 0014 C9                    leave
  28 0015 C3                    ret
  29                    .LFE2:
  67                    .Lframe1:
  68 0000 14000000              .long   .LECIE1-.LSCIE1
  69                    .LSCIE1:
  70 0004 00000000              .long   0x0
  71 0008 01                    .byte   0x1
  72 0009 7A5200                .string &quot;zR&quot;
  73 000c 01                    .uleb128 0x1
  74 000d 78                    .sleb128 -8
  75 000e 10                    .byte   0x10
  76 000f 01                    .uleb128 0x1
  77 0010 03                    .byte   0x3
  78 0011 0C                    .byte   0xc
  79 0012 07                    .uleb128 0x7
  80 0013 08                    .uleb128 0x8
  81 0014 90                    .byte   0x90
  82 0015 01                    .uleb128 0x1
  83 0016 0000                  .align 8
  84                    .LECIE1:
  85                    .LSFDE1:
  86 0018 1C000000              .long   .LEFDE1-.LASFDE1
  87                    .LASFDE1:
  88 001c 1C000000              .long   .LASFDE1-.Lframe1
  89 0020 00000000              .long   .LFB2
  90 0024 16000000              .long   .LFE2-.LFB2
  91 0028 00                    .uleb128 0x0
  92 0029 41                    .byte   0x4

GAS LISTING /tmp/ccQRJ4Fi.s                     page 2


  93                            .long   .LCFI0-.LFB2
  94 002a 0E                    .byte   0xe
  95 002b 10                    .uleb128 0x10
  96 002c 86                    .byte   0x86
  97 002d 02                    .uleb128 0x2
  98 002e 43                    .byte   0x4
  99                            .long   .LCFI1-.LCFI0
 100 002f 0D                    .byte   0xd
 101 0030 06                    .uleb128 0x6
 102 0031 00000000              .align 8
 102      000000
 103                    .LEFDE1:
 104                            .text
 105                    .Letext0:

GAS LISTING /tmp/ccQRJ4Fi.s                     page 3


DEFINED SYMBOLS

The above output from the gcc command has both the C code intermixed with the assembly code. What we’re interested in is on line #23.

...
   3:optest.c      ****         int i = 42;
  20                            .loc 1 3 0
  21 0004 C745FC2A              movl    $42, -4(%rbp)
  21      000000
   4:optest.c      ****         // shift left 4x (equivalent to multiplying by 16)
   5:optest.c      ****         i = i &lt;&lt; 4;
  22                            .loc 1 5 0
  23 000b C165FC04              sall    $4, -4(%rbp)
...

Here’s a breakdown of line 23

  • sal specifies the operation (shift arithmetically left)
  • The suffix l to the sal operation specifies that the operands are double words
  • $4 is the immediate value specifying how many shifts to preform
  • -4(%rbp) specifies the location of the variable we want to shift

Now we can see that gcc takes i = i << 4; and converts this c code into sall $4, -4(%rbp) assembly code.

Multiplication by 3

Just so you can see what a multiplication instruction normally looks like, here’s what gcc produced when I multiplied an integer by 3.

	int i = 42;
	// multiplication by 3
	i = i * 3;

which produces

...
   3:optest.c      ****         int i = 42;
  20                            .loc 1 3 0
  21 0004 C745FC2A              movl    $42, -4(%rbp)
  21      000000
   4:optest.c      ****         // multiplication by 3
   5:optest.c      ****         i = i * 3;
  22                            .loc 1 5 0
  23 000b 8B55FC                movl    -4(%rbp), %edx
  24 000e 89D0                  movl    %edx, %eax
  25 0010 01C0                  addl    %eax, %eax
  26 0012 01D0                  addl    %edx, %eax
  27 0014 8945FC                movl    %eax, -4(%rbp)
...

Here’s a breakdown starting from line #23

  • 23 – load the integer (stored at -4(%rbp)) into the %edx register
  • 24 – move the contents of the %edx register (our int value) into the %eax register
  • 25 – add the %eax register to the %eax regsiter, and store it to the %eax register
  • 26 – add the %edx register to the %eax register and store it to the %eax register
  • 27 – store the contents of the %eax register to the integer in memory

One thing I noticed is that the instruction on line 25 is effectively multiplying the %eax register’s contents by 2. I’m not sure why this instruction doesn’t instead use the shift operation…

Multiplication by 16

Now I test to see what assembly is produced by gcc when I multiply an integer by 16.

	int i = 42;
	// multiply by 16 (equivalent to shifting left 4x)
	i = i * 16;

which produces

...
   3:optest.c      ****         int i = 42;
  20                            .loc 1 3 0
  21 0004 C745FC2A              movl    $42, -4(%rbp)
  21      000000
   4:optest.c      ****         // multiply by 16 (equivalent to shifting left 4x)
   5:optest.c      ****         i = i * 16;
  22                            .loc 1 5 0
  23 000b C165FC04              sall    $4, -4(%rbp)
...

Wonderful! As you can see, the same assembly code is produced as the previous example. This shows that gcc is intelligent enough to optimize the multiplication of an integer by a power of 2 by translating such instructions into bit shifts.

Division by 2

unsigned

Now I test to see what assembly is produced by gcc when I multiply an _unsigned_ integer by 16.

	unsigned int i = 42;
	// divison by 2 (equivalent to shifting right 1x)
	i = i / 2;

which produces

...
   3:optest.c      ****         unsigned int i = 42;
  20                            .loc 1 3 0
  21 0004 C745FC2A              movl    $42, -4(%rbp)
  21      000000
   4:optest.c      ****         // divison by 2 (equivalent to shifting right 1x)
   5:optest.c      ****         i = i / 2;
  22                            .loc 1 5 0
  23 000b 8B45FC                movl    -4(%rbp), %eax
  24 000e D1E8                  shrl    %eax
  25 0010 8945FC                movl    %eax, -4(%rbp)
...

We can see that gcc also interpreted the division by a power of 2 into a bit shift. However, for some reason, it decided to first load the integer from memory into a register, preform the shift, and then store the value back to memory. In any case, it optimized the division by making the instruction a bit shift.

signed

Note that I had to use an unsigned integer in the previous example. Binary division isn’t as straight forward as binary multiplication. Here’s the result of signed division.

	int i = 42;
	// divison by 2 (equivalent to shifting right 1x)
	i = i / 2;

which produces

...
   3:optest.c      ****         int i = 42;
  20                            .loc 1 3 0
  21 0004 C745FC2A              movl    $42, -4(%rbp)
  21      000000
   4:optest.c      ****         // divison by 2 (equivalent to shifting right 1x)
   5:optest.c      ****         i = i / 2;
  22                            .loc 1 5 0
  23 000b 8B55FC                movl    -4(%rbp), %edx
  24 000e 89D0                  movl    %edx, %eax
  25 0010 C1E81F                shrl    $31, %eax
  26 0013 01D0                  addl    %edx, %eax
  27 0015 D1F8                  sarl    %eax
  28 0017 8945FC                movl    %eax, -4(%rbp)
...
  • Lines #23 & #24 load the value from memory into a register
  • Lines #25 & #26 is a neat little trick to convert the signed integer into an unsigned value using Two’s Compliment (note how this works whether the int is positive or negative)
  • Line #27 is the optimized shift (yay! :D)
  • Line #28 stores the value in the register back to memory

This time it was a little more complicated, but–in the end–the compiler recognized that the division was by a power of 2, and it optimized it into a bit shift.

Mod by 4

If a binary number is divisible by decimal 4, then the last 2 binary digits of the number must be 0.

Interestingly enough, the compiler actually produced more optimized code than I planned on using myself.

There’s 3 (obvious) ways to check if a number is divisible by 4:

  1. i % 4 – check with modulus division that there are no remainders
  2. i << 30 – shift the 32 bit integer to the left 30x and check to see if the result is 0 (it will be if the 2 least significant bits are both 0)
  3. (i & 3) – logically AND the integer with a mask of 3 (binary 11) and check and see if the only 2 remaining bits (the 2 least significant bits) are 0.
  4. I tested all 3 programs:

    1. i % 4 == 0

    	int i = 42;
    	// check divisibility by 4
    	i = (i % 4 == 0 );
    

    which produced

    ...
       3:optest.c      ****         int i = 42;
      20                            .loc 1 3 0
      21 0004 C745FC2A              movl    $42, -4(%rbp)
      21      000000
       4:optest.c      ****         // check divisibility by 4
       5:optest.c      ****         i = (i % 4 == 0 );
      22                            .loc 1 5 0
      23 000b 8B45FC                movl    -4(%rbp), %eax
      24 000e 83E003                andl    $3, %eax
      25 0011 85C0                  testl   %eax, %eax
      26 0013 0F94C0                sete    %al
      27 0016 0FB6C0                movzbl  %al, %eax
      28 0019 8945FC                movl    %eax, -4(%rbp)
    ...
    

    Until I saw this output, I hadn’t actually considered AND-ing the integer with a mask. Still, I’m uncertain if AND-ing two 32 bit integers is actually cheaper than shifting a 32 bit integer to the left 30 digits.

    2. i << 30 == 0

    	int i = 42;
    	// check divisibility by 4
    	i = (i &lt;&lt; 30 == 0 );
    

    which produces

    ...
       5:optest.c      ****         i = (i &lt;&lt; 30 == 0 );
      22                            .loc 1 5 0
      23 000b 8B45FC                movl    -4(%rbp), %eax
      24 000e C1E01E                sall    $30, %eax
      25 0011 85C0                  testl   %eax, %eax
      26 0013 0F94C0                sete    %al
      27 0016 0FB6C0                movzbl  %al, %eax
      28 0019 8945FC                movl    %eax, -4(%rbp)
    ...
    

    Note that the instruction on line 24 changed from an and in the previous example to a @sal@. I’m not certain, but I imagine this is the most efficient of the 3.

    3. (i & 3) == 0

    	int i = 42;
    	// check divisibility by 4
    	i = ( (i &amp; 3) == 0 );
    

    which produces

    ...
       5:optest.c      ****         i = ( (i &amp; 3) == 0 );
      22                            .loc 1 5 0
      23 000b 8B45FC                movl    -4(%rbp), %eax
      24 000e 83E003                andl    $3, %eax
      25 0011 85C0                  testl   %eax, %eax
      26 0013 0F94C0                sete    %al
      27 0016 0FB6C0                movzbl  %al, %eax
      28 0019 8945FC                movl    %eax, -4(%rbp)
    ...
    

    Note that this assembly code is exactly the same as what was produced when we preformed modulus division. This shows that the compiler optimized a modulus operation with between an integer and a number that was a power of 2 using bit shifts.

    Conclusion

    The Good

    In many cases, the compiler successfully optimizes C code by interpreting arithmetic operations with an operand that is a power of 2 into bit shifts.

    The Bad

    In some cases, the compiler unnecessarily loads data from memory into a register, preforms a single operation on that data, and then stores the data back to memory. In such cases, using bit shifts operations instead of arithmetic operations in the C code sometimes prevents these inefficient loads and stores.

    The Ugly

    Bit shifts in C code are ugly. They’re hard to read. They’re hard to comprehend. And, IMHO, they decrease the potential longevity of your program by making it difficult to other programmers (besides the initial author) to follow.

    Just like many things in the engineering world, this is another trade off.

    If readability, scalability, and longevity of your program is important to you: don’t bother with bit shifts. Let the compiler optimize your beautiful, easy-to-read C code for you, and take comfort knowing that gcc will know when to use bit shifts instead of costly arithmetic operations–except that it might be unnecessarily preforming extra load and store instructions every now and then.

    If you don’t care about scalability, you’re working with a small program, and you never, ever expect that anyone else besides yourself will be using your program, go ahead and hack up some crazy low-level bit shifts in your C code. Take comfort knowing that you’ve spent an hour longer in development to save 20 picoseconds per execution.

    Further reading:

    1. An excellent reference to the x86 assembly language

2 comments to gcc Optimizations for Arithmetic Operations using Bit Shifts

  • Haywood Jablome

    Of your three methods for determining divisibility by four, which is more efficient for larger integers (Like on the same order of magnitude as Graham’s number)?

  • Anonymous

    FYI, “Lines #25 & #26 is a neat little trick to convert the signed integer into an unsigned value” is plain wrong. There’s no conversion. These add 1 to the dividend if the dividend is negative. The purpose is to compensate for the rounding towards minus infinity that the SAR instruction does with negative values. Try SAR’ing -1. You’ll get -1, while you expect 0 from division with rounding towards 0. Can you see it?

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>