 # gcc Optimizations for Arithmetic Operations using Bit Shifts

I've got a hellacious project due and finals all next week, but this was just too much fun to pass up. In any case, compiler optimization increases compile time, and anything that gives me more time to sword fight on \$1000 office chairs is worth a little R&D.

I'm working on writing this cut-down MIPS processor simulator for my Computer Organization class at UCF. I googled "word alignments" to help me better understand the most efficient calculations for converting the Byte Aligned Program Counter address to the Word Aligned `Memory` array when I ran across an interesting article showing that the mere *order* of variable declarations in a C program can affect the amount of memory used by that program.

The article explained the situation very well, and it makes sense why this issue would happen, but I was surprised that the compiler wouldn't try to optimize situations like this by re-ordering a set of concurrent variable declarations of alternating data types.

In any case, I continued to hack away at my project when I began to think about whether or not gcc translates multiplication and division operations where one of the operands is a power of 2 into simple bit shift instructions at compile time. As it turns out, it does.

# Basis

Alright, here's the basics: Computers store data in binary. There are many shortcuts in the binary arithmetic that you learn to adore when you're computing unsigned binary operations by hand. For example, just as shifting a decimal value (of base 10) to the left 1 digit will multiply that value by 10, shifting a binary value (of base 2) to the left 1 digit will multiply that value by 2. Shifting a binary value 2 digits to the left will multiply that value by 4, 3 digits by 8, 4 by 16, etc. This shortcut similarly applies to binary division when shifting to the right.

We can see then that it's far simpler for a computer to shift a binary number to the left 4 digits like so:

```0000 0010 1010 (42 in decimal)
0010 1010 0000 (42*16 = 672 in decimal)
```

rather than preforming addition over and over again.

So, the question is: when I write

```int i = 42;
i = i * 16;
```

will the compiler translate my code into a series of addition instructions

```movl    \$42, -4(%rbp)
movl    -4(%rbp), %eax
movl    %eax, -4(%rbp)
```

or to a single left shift instruction

```movl    \$42, -4(%rbp)
sall    \$4, -4(%rbp)
```

Moreover, gcc doesn't go straight from C code to machine code. it preforms the following operations (in order)

1. Preprocessing
2. Compilation
3. Assembly

What I want is to analyze the 3rd step in this process--the assembly code. gcc allows you to (easily) analyze the assembly output of a program.

Note that a whole slew of options exist for analyzing the inner-workings of what gcc is doing as well as operations that directly alter the optimization capabilities of the compilation process. More info can be found from the gcc manual page.

To see a list of optimization options for your installed version of gcc, run the command

```\$ gcc --help=optimizers
```

# The Test

Still with me? Good. Here's how I preformed this test.

Below is a straight forward C program that shifts the integer `i` to the left 4 digits. Mathematically, this is the same as multiplying `i` by 16.

```int main( ){

int i = 42;
// shift left 4x (equivalent to multiplying by 16)
i = i << 4;

return 0;
}
```

And below you can see the assembly code that gcc produces when compiling.

```\$ gcc -c -g -Wa,-a,-ad optest.c
GAS LISTING /tmp/ccQRJ4Fi.s                     page 1

1                            .file   "optest.c"
9                    .Ltext0:
10                    .globl main
12                    main:
13                    .LFB2:
14                            .file 1 "optest.c"
1:optest.c      **** int main( ){
15                            .loc 1 1 0
16 0000 55                    pushq   %rbp
17                    .LCFI0:
18 0001 4889E5                movq    %rsp, %rbp
19                    .LCFI1:
2:optest.c      ****
3:optest.c      ****         int i = 42;
20                            .loc 1 3 0
21 0004 C745FC2A              movl    \$42, -4(%rbp)
21      000000
4:optest.c      ****         // shift left 4x (equivalent to multiplying by 16)
5:optest.c      ****         i = i &lt;&lt; 4;
22                            .loc 1 5 0
23 000b C165FC04              sall    \$4, -4(%rbp)
6:optest.c      ****
7:optest.c      ****         return 0;
24                            .loc 1 7 0
25 000f B8000000              movl    \$0, %eax
25      00
8:optest.c      **** }
26                            .loc 1 8 0
27 0014 C9                    leave
28 0015 C3                    ret
29                    .LFE2:
67                    .Lframe1:
68 0000 14000000              .long   .LECIE1-.LSCIE1
69                    .LSCIE1:
70 0004 00000000              .long   0x0
71 0008 01                    .byte   0x1
72 0009 7A5200                .string &quot;zR&quot;
73 000c 01                    .uleb128 0x1
74 000d 78                    .sleb128 -8
75 000e 10                    .byte   0x10
76 000f 01                    .uleb128 0x1
77 0010 03                    .byte   0x3
78 0011 0C                    .byte   0xc
79 0012 07                    .uleb128 0x7
80 0013 08                    .uleb128 0x8
81 0014 90                    .byte   0x90
82 0015 01                    .uleb128 0x1
83 0016 0000                  .align 8
84                    .LECIE1:
85                    .LSFDE1:
86 0018 1C000000              .long   .LEFDE1-.LASFDE1
87                    .LASFDE1:
88 001c 1C000000              .long   .LASFDE1-.Lframe1
89 0020 00000000              .long   .LFB2
90 0024 16000000              .long   .LFE2-.LFB2
91 0028 00                    .uleb128 0x0
92 0029 41                    .byte   0x4

GAS LISTING /tmp/ccQRJ4Fi.s                     page 2

93                            .long   .LCFI0-.LFB2
94 002a 0E                    .byte   0xe
95 002b 10                    .uleb128 0x10
96 002c 86                    .byte   0x86
97 002d 02                    .uleb128 0x2
98 002e 43                    .byte   0x4
99                            .long   .LCFI1-.LCFI0
100 002f 0D                    .byte   0xd
101 0030 06                    .uleb128 0x6
102 0031 00000000              .align 8
102      000000
103                    .LEFDE1:
104                            .text
105                    .Letext0:

GAS LISTING /tmp/ccQRJ4Fi.s                     page 3

DEFINED SYMBOLS
```

The above output from the `gcc` command has both the C code intermixed with the assembly code. What we're interested in is on line #23.

```...
3:optest.c      ****         int i = 42;
20                            .loc 1 3 0
21 0004 C745FC2A              movl    \$42, -4(%rbp)
21      000000
4:optest.c      ****         // shift left 4x (equivalent to multiplying by 16)
5:optest.c      ****         i = i &lt;&lt; 4;
22                            .loc 1 5 0
23 000b C165FC04              sall    \$4, -4(%rbp)
...
```

Here's a breakdown of line 23

• `sal` specifies the operation (shift arithmetically left)
• The suffix `l` to the `sal` operation specifies that the operands are double words
• `\$4` is the immediate value specifying how many shifts to preform
• `-4(%rbp)` specifies the location of the variable we want to shift

Now we can see that gcc takes `i = i << 4;` and converts this c code into `sall \$4, -4(%rbp)` assembly code.

## Multiplication by 3

Just so you can see what a multiplication instruction normally looks like, here's what gcc produced when I multiplied an integer by 3.

```	int i = 42;
// multiplication by 3
i = i * 3;
```

which produces

```...
3:optest.c      ****         int i = 42;
20                            .loc 1 3 0
21 0004 C745FC2A              movl    \$42, -4(%rbp)
21      000000
4:optest.c      ****         // multiplication by 3
5:optest.c      ****         i = i * 3;
22                            .loc 1 5 0
23 000b 8B55FC                movl    -4(%rbp), %edx
24 000e 89D0                  movl    %edx, %eax
25 0010 01C0                  addl    %eax, %eax
26 0012 01D0                  addl    %edx, %eax
27 0014 8945FC                movl    %eax, -4(%rbp)
...
```

Here's a breakdown starting from line #23

• 23 - load the integer (stored at `-4(%rbp)`) into the `%edx` register
• 24 - move the contents of the `%edx` register (our int value) into the `%eax` register
• 25 - add the `%eax` register to the `%eax` regsiter, and store it to the `%eax` register
• 26 - add the `%edx` register to the `%eax` register and store it to the `%eax` register
• 27 - store the contents of the `%eax` register to the integer in memory

One thing I noticed is that the instruction on line 25 is effectively multiplying the `%eax` register's contents by 2. I'm not sure why this instruction doesn't instead use the shift operation...

## Multiplication by 16

Now I test to see what assembly is produced by gcc when I multiply an integer by 16.

```	int i = 42;
// multiply by 16 (equivalent to shifting left 4x)
i = i * 16;
```

which produces

```...
3:optest.c      ****         int i = 42;
20                            .loc 1 3 0
21 0004 C745FC2A              movl    \$42, -4(%rbp)
21      000000
4:optest.c      ****         // multiply by 16 (equivalent to shifting left 4x)
5:optest.c      ****         i = i * 16;
22                            .loc 1 5 0
23 000b C165FC04              sall    \$4, -4(%rbp)
...
```

Wonderful! As you can see, the same assembly code is produced as the previous example. This shows that gcc is intelligent enough to optimize the multiplication of an integer by a power of 2 by translating such instructions into bit shifts.

## Division by 2

### unsigned

Now I test to see what assembly is produced by gcc when I multiply an _unsigned_ integer by 16.

```	unsigned int i = 42;
// divison by 2 (equivalent to shifting right 1x)
i = i / 2;
```

which produces

```...
3:optest.c      ****         unsigned int i = 42;
20                            .loc 1 3 0
21 0004 C745FC2A              movl    \$42, -4(%rbp)
21      000000
4:optest.c      ****         // divison by 2 (equivalent to shifting right 1x)
5:optest.c      ****         i = i / 2;
22                            .loc 1 5 0
23 000b 8B45FC                movl    -4(%rbp), %eax
24 000e D1E8                  shrl    %eax
25 0010 8945FC                movl    %eax, -4(%rbp)
...
```

We can see that gcc also interpreted the division by a power of 2 into a bit shift. However, for some reason, it decided to first load the integer from memory into a register, preform the shift, and then store the value back to memory. In any case, it optimized the division by making the instruction a bit shift.

### signed

Note that I had to use an unsigned integer in the previous example. Binary division isn't as straight forward as binary multiplication. Here's the result of signed division.

```	int i = 42;
// divison by 2 (equivalent to shifting right 1x)
i = i / 2;
```

which produces

```...
3:optest.c      ****         int i = 42;
20                            .loc 1 3 0
21 0004 C745FC2A              movl    \$42, -4(%rbp)
21      000000
4:optest.c      ****         // divison by 2 (equivalent to shifting right 1x)
5:optest.c      ****         i = i / 2;
22                            .loc 1 5 0
23 000b 8B55FC                movl    -4(%rbp), %edx
24 000e 89D0                  movl    %edx, %eax
25 0010 C1E81F                shrl    \$31, %eax
26 0013 01D0                  addl    %edx, %eax
27 0015 D1F8                  sarl    %eax
28 0017 8945FC                movl    %eax, -4(%rbp)
...
```
• Lines #23 & #24 load the value from memory into a register
• Lines #25 & #26 is a neat little trick to convert the signed integer into an unsigned value using Two's Compliment (note how this works whether the int is positive or negative)
• Line #27 is the optimized shift (yay! :D)
• Line #28 stores the value in the register back to memory

This time it was a little more complicated, but--in the end--the compiler recognized that the division was by a power of 2, and it optimized it into a bit shift.

## Mod by 4

If a binary number is divisible by decimal 4, then the last 2 binary digits of the number must be 0.

Interestingly enough, the compiler actually produced more optimized code than I planned on using myself.

There's 3 (obvious) ways to check if a number is divisible by 4:

1. `i % 4` - check with modulus division that there are no remainders
2. `i << 30` - shift the 32 bit integer to the left 30x and check to see if the result is 0 (it will be if the 2 least significant bits are both 0)
3. `(i & 3)` - logically AND the integer with a mask of 3 (binary 11) and check and see if the only 2 remaining bits (the 2 least significant bits) are 0.
4. I tested all 3 programs:

### 1. i % 4 == 0

```	int i = 42;
// check divisibility by 4
i = (i % 4 == 0 );
```

which produced

```...
3:optest.c      ****         int i = 42;
20                            .loc 1 3 0
21 0004 C745FC2A              movl    \$42, -4(%rbp)
21      000000
4:optest.c      ****         // check divisibility by 4
5:optest.c      ****         i = (i % 4 == 0 );
22                            .loc 1 5 0
23 000b 8B45FC                movl    -4(%rbp), %eax
24 000e 83E003                andl    \$3, %eax
25 0011 85C0                  testl   %eax, %eax
26 0013 0F94C0                sete    %al
27 0016 0FB6C0                movzbl  %al, %eax
28 0019 8945FC                movl    %eax, -4(%rbp)
...
```

Until I saw this output, I hadn't actually considered AND-ing the integer with a mask. Still, I'm uncertain if AND-ing two 32 bit integers is actually cheaper than shifting a 32 bit integer to the left 30 digits.

### 2. i << 30 == 0

```	int i = 42;
// check divisibility by 4
i = (i &lt;&lt; 30 == 0 );
```

which produces

```...
5:optest.c      ****         i = (i &lt;&lt; 30 == 0 );
22                            .loc 1 5 0
23 000b 8B45FC                movl    -4(%rbp), %eax
24 000e C1E01E                sall    \$30, %eax
25 0011 85C0                  testl   %eax, %eax
26 0013 0F94C0                sete    %al
27 0016 0FB6C0                movzbl  %al, %eax
28 0019 8945FC                movl    %eax, -4(%rbp)
...
```

Note that the instruction on line 24 changed from an `and` in the previous example to a @sal@. I'm not certain, but I imagine this is the most efficient of the 3.

### 3. (i & 3) == 0

```	int i = 42;
// check divisibility by 4
i = ( (i &amp; 3) == 0 );
```

which produces

```...
5:optest.c      ****         i = ( (i &amp; 3) == 0 );
22                            .loc 1 5 0
23 000b 8B45FC                movl    -4(%rbp), %eax
24 000e 83E003                andl    \$3, %eax
25 0011 85C0                  testl   %eax, %eax
26 0013 0F94C0                sete    %al
27 0016 0FB6C0                movzbl  %al, %eax
28 0019 8945FC                movl    %eax, -4(%rbp)
...
```

Note that this assembly code is exactly the same as what was produced when we preformed modulus division. This shows that the compiler optimized a modulus operation with between an integer and a number that was a power of 2 using bit shifts.

# Conclusion

## The Good

In many cases, the compiler successfully optimizes C code by interpreting arithmetic operations with an operand that is a power of 2 into bit shifts.

In some cases, the compiler unnecessarily loads data from memory into a register, preforms a single operation on that data, and then stores the data back to memory. In such cases, using bit shifts operations instead of arithmetic operations in the C code sometimes prevents these inefficient loads and stores.

## The Ugly

Bit shifts in C code are ugly. They're hard to read. They're hard to comprehend. And, IMHO, they decrease the potential longevity of your program by making it difficult to other programmers (besides the initial author) to follow.

Just like many things in the engineering world, this is another trade off.

If readability, scalability, and longevity of your program is important to you: don't bother with bit shifts. Let the compiler optimize your beautiful, easy-to-read C code for you, and take comfort knowing that gcc will know when to use bit shifts instead of costly arithmetic operations--except that it might be unnecessarily preforming extra load and store instructions every now and then.

If you don't care about scalability, you're working with a small program, and you never, ever expect that anyone else besides yourself will be using your program, go ahead and hack up some crazy low-level bit shifts in your C code. Take comfort knowing that you've spent an hour longer in development to save 20 picoseconds per execution.

• Haywood Jablome
• Anonymous