I've got a hellacious project due and finals all next week, but this was just too much fun to pass up. In any case, compiler optimization increases compile time, and anything that gives me more time to sword fight on $1000 office chairs is worth a little R&D.

I'm working on writing this cut-down MIPS processor simulator for my Computer Organization class at UCF. I googled "word alignments" to help me better understand the most efficient calculations for converting the Byte Aligned Program Counter address to the Word Aligned `Memory`

array when I ran across an interesting article showing that the mere *order* of variable declarations in a C program can affect the amount of memory used by that program.

The article explained the situation very well, and it makes sense why this issue would happen, but I was surprised that the compiler wouldn't try to optimize situations like this by re-ordering a set of concurrent variable declarations of alternating data types.

In any case, I continued to hack away at my project when I began to think about whether or not gcc translates multiplication and division operations where one of the operands is a power of 2 into simple bit shift instructions at compile time. As it turns out, it does.

# Basis

Alright, here's the basics: Computers store data in binary. There are many shortcuts in the binary arithmetic that you learn to adore when you're computing unsigned binary operations by hand. For example, just as shifting a decimal value (of base 10) to the left 1 digit will multiply that value by 10, shifting a binary value (of base 2) to the left 1 digit will multiply that value by 2. Shifting a binary value 2 digits to the left will multiply that value by 4, 3 digits by 8, 4 by 16, etc. This shortcut similarly applies to binary division when shifting to the right.

We can see then that it's far simpler for a computer to shift a binary number to the left 4 digits like so:

0000 0010 1010 (42 in decimal) 0010 1010 0000 (42*16 = 672 in decimal)

rather than preforming addition over and over again.

So, the question is: when I write

int i = 42; i = i * 16;

will the compiler translate my code into a series of addition instructions

movl $42, -4(%rbp) movl -4(%rbp), %eax addl %eax, %eax addl %eax, %eax addl %eax, %eax addl %eax, %eax movl %eax, -4(%rbp)

or to a single left shift instruction

movl $42, -4(%rbp) sall $4, -4(%rbp)

Moreover, gcc doesn't go straight from C code to machine code. it preforms the following operations (in order)

- Preprocessing
- Compilation
- Assembly
- Linking

What I want is to analyze the 3rd step in this process--the assembly code. gcc allows you to (easily) analyze the assembly output of a program.

Note that a whole slew of options exist for analyzing the inner-workings of what gcc is doing as well as operations that directly alter the optimization capabilities of the compilation process. More info can be found from the gcc manual page.

To see a list of optimization options for your installed version of gcc, run the command

$ gcc --help=optimizers

# The Test

Still with me? Good. Here's how I preformed this test.

Below is a straight forward C program that shifts the integer `i`

to the left 4 digits. Mathematically, this is the same as multiplying `i`

by 16.

int main( ){ int i = 42; // shift left 4x (equivalent to multiplying by 16) i = i << 4; return 0; }

And below you can see the assembly code that gcc produces when compiling.

$ gcc -c -g -Wa,-a,-ad optest.c GAS LISTING /tmp/ccQRJ4Fi.s page 1 1 .file "optest.c" 9 .Ltext0: 10 .globl main 12 main: 13 .LFB2: 14 .file 1 "optest.c" 1:optest.c **** int main( ){ 15 .loc 1 1 0 16 0000 55 pushq %rbp 17 .LCFI0: 18 0001 4889E5 movq %rsp, %rbp 19 .LCFI1: 2:optest.c **** 3:optest.c **** int i = 42; 20 .loc 1 3 0 21 0004 C745FC2A movl $42, -4(%rbp) 21 000000 4:optest.c **** // shift left 4x (equivalent to multiplying by 16) 5:optest.c **** i = i << 4; 22 .loc 1 5 0 23 000b C165FC04 sall $4, -4(%rbp) 6:optest.c **** 7:optest.c **** return 0; 24 .loc 1 7 0 25 000f B8000000 movl $0, %eax 25 00 8:optest.c **** } 26 .loc 1 8 0 27 0014 C9 leave 28 0015 C3 ret 29 .LFE2: 67 .Lframe1: 68 0000 14000000 .long .LECIE1-.LSCIE1 69 .LSCIE1: 70 0004 00000000 .long 0x0 71 0008 01 .byte 0x1 72 0009 7A5200 .string "zR" 73 000c 01 .uleb128 0x1 74 000d 78 .sleb128 -8 75 000e 10 .byte 0x10 76 000f 01 .uleb128 0x1 77 0010 03 .byte 0x3 78 0011 0C .byte 0xc 79 0012 07 .uleb128 0x7 80 0013 08 .uleb128 0x8 81 0014 90 .byte 0x90 82 0015 01 .uleb128 0x1 83 0016 0000 .align 8 84 .LECIE1: 85 .LSFDE1: 86 0018 1C000000 .long .LEFDE1-.LASFDE1 87 .LASFDE1: 88 001c 1C000000 .long .LASFDE1-.Lframe1 89 0020 00000000 .long .LFB2 90 0024 16000000 .long .LFE2-.LFB2 91 0028 00 .uleb128 0x0 92 0029 41 .byte 0x4 GAS LISTING /tmp/ccQRJ4Fi.s page 2 93 .long .LCFI0-.LFB2 94 002a 0E .byte 0xe 95 002b 10 .uleb128 0x10 96 002c 86 .byte 0x86 97 002d 02 .uleb128 0x2 98 002e 43 .byte 0x4 99 .long .LCFI1-.LCFI0 100 002f 0D .byte 0xd 101 0030 06 .uleb128 0x6 102 0031 00000000 .align 8 102 000000 103 .LEFDE1: 104 .text 105 .Letext0: GAS LISTING /tmp/ccQRJ4Fi.s page 3 DEFINED SYMBOLS

The above output from the `gcc`

command has both the C code intermixed with the assembly code. What we're interested in is on line #23.

... 3:optest.c **** int i = 42; 20 .loc 1 3 0 21 0004 C745FC2A movl $42, -4(%rbp) 21 000000 4:optest.c **** // shift left 4x (equivalent to multiplying by 16) 5:optest.c **** i = i << 4; 22 .loc 1 5 0 23 000b C165FC04 sall $4, -4(%rbp) ...

Here's a breakdown of line 23

`sal`

specifies the operation (shift arithmetically left)- The suffix
`l`

to the`sal`

operation specifies that the operands are double words `$4`

is the immediate value specifying how many shifts to preform`-4(%rbp)`

specifies the location of the variable we want to shift

Now we can see that gcc takes `i = i << 4;`

and converts this c code into `sall $4, -4(%rbp)`

assembly code.

## Multiplication by 3

Just so you can see what a multiplication instruction normally looks like, here's what gcc produced when I multiplied an integer by 3.

int i = 42; // multiplication by 3 i = i * 3;

which produces

... 3:optest.c **** int i = 42; 20 .loc 1 3 0 21 0004 C745FC2A movl $42, -4(%rbp) 21 000000 4:optest.c **** // multiplication by 3 5:optest.c **** i = i * 3; 22 .loc 1 5 0 23 000b 8B55FC movl -4(%rbp), %edx 24 000e 89D0 movl %edx, %eax 25 0010 01C0 addl %eax, %eax 26 0012 01D0 addl %edx, %eax 27 0014 8945FC movl %eax, -4(%rbp) ...

Here's a breakdown starting from line #23

- 23 - load the integer (stored at
`-4(%rbp)`

) into the`%edx`

register - 24 - move the contents of the
`%edx`

register (our int value) into the`%eax`

register - 25 - add the
`%eax`

register to the`%eax`

regsiter, and store it to the`%eax`

register - 26 - add the
`%edx`

register to the`%eax`

register and store it to the`%eax`

register - 27 - store the contents of the
`%eax`

register to the integer in memory

One thing I noticed is that the instruction on line 25 is effectively multiplying the `%eax`

register's contents by 2. I'm not sure why this instruction doesn't instead use the shift operation...

## Multiplication by 16

Now I test to see what assembly is produced by gcc when I multiply an integer by 16.

int i = 42; // multiply by 16 (equivalent to shifting left 4x) i = i * 16;

which produces

... 3:optest.c **** int i = 42; 20 .loc 1 3 0 21 0004 C745FC2A movl $42, -4(%rbp) 21 000000 4:optest.c **** // multiply by 16 (equivalent to shifting left 4x) 5:optest.c **** i = i * 16; 22 .loc 1 5 0 23 000b C165FC04 sall $4, -4(%rbp) ...

Wonderful! As you can see, the same assembly code is produced as the previous example. This shows that gcc is intelligent enough to optimize the multiplication of an integer by a power of 2 by translating such instructions into bit shifts.

## Division by 2

### unsigned

Now I test to see what assembly is produced by gcc when I multiply an _unsigned_ integer by 16.

unsigned int i = 42; // divison by 2 (equivalent to shifting right 1x) i = i / 2;

which produces

... 3:optest.c **** unsigned int i = 42; 20 .loc 1 3 0 21 0004 C745FC2A movl $42, -4(%rbp) 21 000000 4:optest.c **** // divison by 2 (equivalent to shifting right 1x) 5:optest.c **** i = i / 2; 22 .loc 1 5 0 23 000b 8B45FC movl -4(%rbp), %eax 24 000e D1E8 shrl %eax 25 0010 8945FC movl %eax, -4(%rbp) ...

We can see that gcc also interpreted the division by a power of 2 into a bit shift. However, for some reason, it decided to first load the integer from memory into a register, preform the shift, and then store the value back to memory. In any case, it optimized the division by making the instruction a bit shift.

### signed

Note that I had to use an unsigned integer in the previous example. Binary division isn't as straight forward as binary multiplication. Here's the result of signed division.

int i = 42; // divison by 2 (equivalent to shifting right 1x) i = i / 2;

which produces

... 3:optest.c **** int i = 42; 20 .loc 1 3 0 21 0004 C745FC2A movl $42, -4(%rbp) 21 000000 4:optest.c **** // divison by 2 (equivalent to shifting right 1x) 5:optest.c **** i = i / 2; 22 .loc 1 5 0 23 000b 8B55FC movl -4(%rbp), %edx 24 000e 89D0 movl %edx, %eax 25 0010 C1E81F shrl $31, %eax 26 0013 01D0 addl %edx, %eax 27 0015 D1F8 sarl %eax 28 0017 8945FC movl %eax, -4(%rbp) ...

- Lines #23 & #24 load the value from memory into a register
- Lines #25 & #26 is a neat little trick to convert the signed integer into an unsigned value using Two's Compliment (note how this works whether the int is positive or negative)
- Line #27 is the optimized shift (yay! :D)
- Line #28 stores the value in the register back to memory

This time it was a little more complicated, but--in the end--the compiler recognized that the division was by a power of 2, and it optimized it into a bit shift.

## Mod by 4

If a binary number is divisible by decimal 4, then the last 2 binary digits of the number must be 0.

Interestingly enough, the compiler actually produced more optimized code than I planned on using myself.

There's 3 (obvious) ways to check if a number is divisible by 4:

`i % 4`

- check with modulus division that there are no remainders`i << 30`

- shift the 32 bit integer to the left 30x and check to see if the result is 0 (it will be if the 2 least significant bits are both 0)`(i & 3)`

- logically AND the integer with a mask of 3 (binary 11) and check and see if the only 2 remaining bits (the 2 least significant bits) are 0.

I tested all 3 programs:

### 1. i % 4 == 0

int i = 42; // check divisibility by 4 i = (i % 4 == 0 );

which produced

... 3:optest.c **** int i = 42; 20 .loc 1 3 0 21 0004 C745FC2A movl $42, -4(%rbp) 21 000000 4:optest.c **** // check divisibility by 4 5:optest.c **** i = (i % 4 == 0 ); 22 .loc 1 5 0 23 000b 8B45FC movl -4(%rbp), %eax 24 000e 83E003 andl $3, %eax 25 0011 85C0 testl %eax, %eax 26 0013 0F94C0 sete %al 27 0016 0FB6C0 movzbl %al, %eax 28 0019 8945FC movl %eax, -4(%rbp) ...

Until I saw this output, I hadn't actually considered AND-ing the integer with a mask. Still, I'm uncertain if AND-ing two 32 bit integers is actually cheaper than shifting a 32 bit integer to the left 30 digits.

### 2. i << 30 == 0

int i = 42; // check divisibility by 4 i = (i << 30 == 0 );

which produces

... 5:optest.c **** i = (i << 30 == 0 ); 22 .loc 1 5 0 23 000b 8B45FC movl -4(%rbp), %eax 24 000e C1E01E sall $30, %eax 25 0011 85C0 testl %eax, %eax 26 0013 0F94C0 sete %al 27 0016 0FB6C0 movzbl %al, %eax 28 0019 8945FC movl %eax, -4(%rbp) ...

Note that the instruction on line 24 changed from an `and`

in the previous example to a @sal@. I'm not certain, but I imagine this is the most efficient of the 3.

### 3. (i & 3) == 0

int i = 42; // check divisibility by 4 i = ( (i & 3) == 0 );

which produces

... 5:optest.c **** i = ( (i & 3) == 0 ); 22 .loc 1 5 0 23 000b 8B45FC movl -4(%rbp), %eax 24 000e 83E003 andl $3, %eax 25 0011 85C0 testl %eax, %eax 26 0013 0F94C0 sete %al 27 0016 0FB6C0 movzbl %al, %eax 28 0019 8945FC movl %eax, -4(%rbp) ...

Note that this assembly code is exactly the same as what was produced when we preformed modulus division. This shows that the compiler optimized a modulus operation with between an integer and a number that was a power of 2 using bit shifts.

# Conclusion

## The Good

In many cases, the compiler successfully optimizes C code by interpreting arithmetic operations with an operand that is a power of 2 into bit shifts.

## The Bad

In some cases, the compiler unnecessarily loads data from memory into a register, preforms a single operation on that data, and then stores the data back to memory. In such cases, using bit shifts operations instead of arithmetic operations in the C code sometimes prevents these inefficient loads and stores.

## The Ugly

Bit shifts in C code are ugly. They're hard to read. They're hard to comprehend. And, IMHO, they decrease the potential longevity of your program by making it difficult to other programmers (besides the initial author) to follow.

Just like many things in the engineering world, this is another trade off.

If readability, scalability, and longevity of your program is important to you: don't bother with bit shifts. Let the compiler optimize your beautiful, easy-to-read C code for you, and take comfort knowing that gcc will know when to use bit shifts instead of costly arithmetic operations--except that it might be unnecessarily preforming extra load and store instructions every now and then.

If you don't care about scalability, you're working with a small program, and you never, ever expect that anyone else besides yourself will be using your program, go ahead and hack up some crazy low-level bit shifts in your C code. Take comfort knowing that you've spent an hour longer in development to save 20 picoseconds per execution.

Further reading:

Of your three methods for determining divisibility by four, which is more efficient for larger integers (Like on the same order of magnitude as Graham's number)?

FYI, "Lines #25 & #26 is a neat little trick to convert the signed integer into an unsigned value" is plain wrong. There's no conversion. These add 1 to the dividend if the dividend is negative. The purpose is to compensate for the rounding towards minus infinity that the SAR instruction does with negative values. Try SAR'ing -1. You'll get -1, while you expect 0 from division with rounding towards 0. Can you see it?