I enjoy assembly language programming. Enough that I want to try to write a few blog posts explaining some stuff you can do using this language.
As a quick introduction to assembly language, I want to start with the “Blink” example used by Arduino Uno.
1 | // Examples/Basics/Blink |
The objective for this application is to toggle the onboard LED of the Arduino. This requires 3 steps:
- Set the GPIO pin to output mode
- Toggle the output continuously
- Add some delay between each toggle, to make the blink visible
Three fairly simple tasks. Each of these requirements are done using only one or two lines of C++ code to achieve.
Building this program, and uploading it to your Arduino Uno, will indeed cause the LED to start blinking. Which is a great starting point for testing and programming your new Arduino Uno.
However, it also reveals that the size of the executable is 924 bytes.
Arduino Uno uses an AVR based microcontroller, ATMega328P, which have 32 kB internal flash for the application. 924 bytes is only 2.8% of the total flash. However, if you are into low-level programming, and know your architecture, you will immediately realize that there are lots of overhead causing this simple task to take 924 bytes. What are these bytes actually used for?
To see what these 924 bytes actually contain, you can open the compiled output and examine it. In the case of Arduino, it is a little challenging to examine the actual output. First, you should enable verbose output using File - Preferences - Show verbose output. Examining this output reveals that the Arduino was programmed using a “hex file”.
1 | C:\Program Files (x86)\Arduino\hardware\tools\avr/bin/avrdude -CC:\Program Files (x86)\Arduino\hardware\tools\avr/etc/avrdude.conf -v -patmega328p -carduino -PCOM4 -b115200 -D -Uflash:w:C:\Users\Eirik\AppData\Local\Temp\arduino_build_938801/Blink.ino.hex:i |
Building using Arduino places the build output to a temporary folder. To examine the files, you have to look for this line in the output, and open the relevant folder.
The .hex file contains the program in machine code. You can open this program in a text editor, and it will show the code using the Intel Hex format (16 bytes pr line).
Using a hex editor, you can also examine the raw binary data. Then, you need to execute the command
avr-objcopy -O binary Blink.ino.elf Blink.ino.bin
in the temporary folder. Then you can open the .bin file in any hex editor.
Reading machine code is not very easy to do for human programmers. This is why Assembly language was invented, as a human readable counterpart to machine code.
Running the command
avr-objdump -S Blink.ino.elf > Blink.ino.lst
will disassemble the object file. This will print the output to a .lst file, which can be opened in a text editor.
When analyzing the usage, I found these results:
1 | 000-067: (104 bytes) Interrupt vectors for 26 vectors a 4 bytes. Each containing a 3 byte JMP instruction |
I will not dig into the details for now, but one observation is that one of the biggest functions is “digitalWrite” taking 144 bytes. This function contains much overhead for the fairly simple task we want to solve.
Optimization 1: Avoid digitalWrite()
If you know the details of Arduino Uno, you would know that:
- LED_BUILTIN is logical pin number 13
- Logical pin 13 maps to Port B5 (PB5)
- To configure PB5 to output, the 5th bit of the DDRB hardware register need to be high
- To set the output high or low, set the 5th bit of the PORTB hardware register to high or low
Using this knowledge, we can replace the digitalWrite functions with
1 | //pinMode(LED_BUILTIN, OUTPUT); |
This simple optimization causes the code size to be reduced from 924 to 640 bytes (!).
Optimization steps:
- Instead of digitalWrite(): Use DDRB/PORTB directly
- Reduces code size from 924 to 640 bytes
- digitalWrite seem to be a very bloated function
- Instead of setting and clearing bit: Use xor:
- Reduces code size from 640 to 602 bytes
1
2
3
4
5
6
7
8
9/*
PORTB |= (1 << 5);
delay(1000);
PORTB &= ~(1 << 5);
delay(1000);
*/
PORTB ^= (1 << 5);
delay(1000);
- Reduces code size from 640 to 602 bytes
- Instead of delay(): Use _delay_ms() from avr-libc
- Reduces code size from 602 bytes to 474 bytes
1
2// delay(1000);
_delay_ms(1000);
- Reduces code size from 602 bytes to 474 bytes
- Instead of toggling PORTB bit, write 1 to PINB bit
- Reduces code size from 474 bytes to 470 bytes
- Hardware feature on AVR which causes the PORTB bit to be toggled
1
2// PORTB ^= (1 << 5);
PINB = (1 << 5);
Optimized code using Arduino IDE and C++
1 | void setup() { |
Replacing Arduino IDE with avr-gcc
Time to try something else. Instead of using Arduino IDE, make a new folder and insert this C file:
1 |
|
Build the program using
1 | avr-gcc -mmcu=atmega328p -Os blink.c -o blink.elf |
Doing this change, from Arduino IDE and C++, to avr-gcc and plain C, causes the file size to decrease from 470 bytes to 158 bytes, shaving off lots of bloated code.
To program your Arduino manually, run this commands: (Replace COM4 with the actual port)
1 | avrdude "-CC:\Program Files (x86)\Arduino\hardware\tools\avr/etc/avrdude.conf" -v -patmega328p -carduino -PCOM4 -b115200 -e -U flash:w:blink.hex |
This command will connect to the Arduino, erase it, and program it with the file blink.hex.
If you examine the assembly listing, you will see that the first 104 bytes is the 26 interrupt vectors. We are not using any interrupts, thus, these can contain instructions instead of vectors. A hack to remove these vectors, is to ditch the -mmcu=atmega328p
specifier, and manually add the symbol __AVR_ATmega328P__
before including io.h.
Ditching the interrupt vectors and software initialization causes the entire code to become 26 bytes.
Assembly listing:
1 | 0: 25 9a sbi 0x04, 5 ; 4 |
This program causes that the Arduino behaves the same as the original blink example, but instead of occupying 924 bytes to do the task, this will do it in only 26 bytes.
Can we do any better?
Time for….
Assembler optimization
Put this code into a new file called “blink.S”
1 | sbi 0x04, 5 ; Set 5th bit of DDRD register to 1 -> DDRD |= (1 << 5); |
This is just a dumb copy of the assembly listing generated using the optimized C code, with some comments. Build it with
1 | avr-gcc blink.S -o blink.elf |
Now we can start optimizing.
1 | ; Blink |
Assembly listing:
1 | 00000000 <__ctors_end>: |
This code does several things to reduce the size:
Instead of using the out instruction to write to PINB, it uses the sbi instruction. This saves the instruction to load the r24 register.
Instead of initializing the 24 bit counter to some value (0x30D3FF) and decrementing it by one until we reach zero, we skip the initialization and decrement with a higher number for each iteration until we reach zero
The low 2 bytes of the counter can be decremented in a single instruction using the SBIW instruction
The two rjmp/nop instructions is removed. These were likely added by the _delay_ms function to get a more accurate delay.
Instead of Arduino IDE: Build from a C file using avr-gcc directly
Instead of building a “proper” application with Interrupt Vectors: Ignore them
Instead of C file: Build from an Assembler source file
Instead of busy-waiting in a loop: Reduce clock speed using fuse settings
Optimization overview
- Building “blink” in Arduino, C++: 924 bytes (due to library- and C++ overhead)
- Building “blink” in Arduino, C++, using DDRB and PORTB: 640 bytes
- Building “blink” in Arduino, C++, using DDRB and PORTB and xor: 602 bytes
- Building “blink” in Arduino, C++, using DDRB and PORTB and xor and
_delay_ms
: 474 bytes - Building “blink” in Arduino, C++, using DDRB and PINB and
_delay_ms
: 470 bytes - Building “blink” in C using avr-gcc: 158 bytes (including interrupt vectors etc)
- Building “blink” in C using avr-gcc: 26 bytes (interrupt vectors optimized away)
- Building “blink” in ASM: 12 bytes
Can we go any smaller?
The smallest program I was able to write, causing a standard Arduino to do visible blinking, is 6 instructions.
Instructions:
0: Set pin to output
1: Toggle output
2-4: Busy wait
5: Jump back to toggle
Skipping the last instruction
What if we skip the last instruction? Then, the CPU will continue executing code past the program. When erasing the flash memory, all memory becomes 0xFF. According to what I found on the internet, the instruction 0xFFFF appears to be an undefined instruction. However, it appears that this instructions behaves like a NOP instruction (no operation).
Thus, if the rest of the flash is 0xFFFF, the CPU will execute “NOP” instructions until the instruction pointer wraps around back to the beginning of the program. Reducing “code size” from 6 to 5 instructions.
Doing this hack will only work if you program your Arduino with a real ISP programmer (or a secondary Arduino with ArduinoISP).
When using the Arduino bootloader (programming the Arduino via USB), the Arduino bootloader will fake that it is a real ISP programmer. It will respond to the “erase flash” command, and perform write commands to the start of the flash. But, the erase command will not actually wipe all the flash. Causing the hack of removing the last jump instruction to fail.
If programming using a real ISP programmer, this will work. But it will also erase the bootloader, causing that you cannot program the Arduino using the standard USB interface anymore. If you want to restore the ordinary programming interface, you have to open Adruino IDE, and use Tools -> Burn Bootloader.
Reducing clock speed
3 of the remaining 6 or 5 instructions are instructions to busy-wait. The CPU runs at 16 MHz. If we reduce the clock speed significantly, we may be able to remove some of these instructions to busy-wait. The default clock setting is to use the external crystal, which is 16 MHz. To reduce the clock speed, we can
- Replace the external crystal
- Set clock prescaler (writing to register or using fuse setting)
- Use a different clock source using fuse setting
Replacing the external crystal may reduce the clock speed, but this will modify the Arduino.
The clock prescaler can maximum reduce the clock speed by a factor of 256, and requires instructions to write to this register (CLKPR). However, the prescaler can be set to 8 using the CKDIV8 fuse, which does not require code instructions.
Using fuse settings, it is possible to change CPU clock source. Instead of using the external crystal as clock source, you can set the CKSEL fuses to use the 128 kHz internal oscillator. Combining this with CKDIV8, the clock speed becomes 128 / 8 = 16 kHz.
This reduces the clock speed from 16 MHz to 16 kHz, reducing the speed with a factor of 1024. Meaning that the 3 busy-wait instructions can likely be reduced to 2.
But, reduced clock speed can also be combined with the hack of avoiding the jump instruction, cycling through the blank flash. The ATMega328P have 32 kB of flash, which becomes 16k 16-bit words. Given that the FFFF instruction behaves like a NOP instruction, each NOP takes 1 clock cycle. Cycling through the flash takes 16k clock cycles. At 16 MHz, this is done 1000 times pr second. But, at 16 kHz, this is done once pr second. Thus, we can remove all busy-wait instructions, and be left with two instructions.
1 | ; Blink |
Assembly listing:
1 | 00000000 <__ctors_end>: |
This program on 4 bytes will cause the LED to blink at a visible rate. However, it requires adjusted fuse settings, and completely wiped flash. These requirements demands an ISP programmer, and will cause the Arduino to not be programmable again using the ordinary USB interface (until the bootloader is manually programmed again, using an ISP). Because of this, I do not consider this a “valid” Arduino program. But, given this inconvenience of programming and consecutive programming, this will indeed be a 4 byte version of the blink program.