This is my second stab at toggling / counting as fast as possible in software. Please notice that this is a somewhat pointless optimization except for the sake of showing how to squeeze the last bit of performance out of an Arduino.
As we already know the fast counter runs at 15 824 484 ticks per second. Or with other words it is 175 516 ticks per second slower than the theoretical maximum.
Most of this overhead is probably due to the function calls to count_8191(). This is because there are not a lot of other places in the code where the overhead could occur.
void count_8191() { COUNT_4095; PINB = bit_4; COUNT_4095; }}
At 15 824 484 ticks per second count_8191() gets called 15 824 484 / 8192 = 1931.7 times per second. Thus the call overhead is roughly 175 516 / 1931.7 = 90.8 cycles per call. So lets have a look where these cycles are processed.
In order to figure out the cause for the overhead we will disassemble the compiled sketch. To prepare for this we will switch the IDE to “verbose” mode.
After switching to verbose mode we compile the sketch to create an *.elf file and to figure out the build directory.
Then we use avr-objdump to disassemble the compiled code and any editor to display the results.
udo:/tmp/build1588086228058204772.tmp$ avr-objdump -dS *.elf >disas.txt udo:/tmp/build1588086228058204772.tmp$ gedit disas.txt
So let’s have a look at the assembler code for count_8191().
void count_8191() { COUNT_4095; PINB = bit_4; COUNT_4095; } a6: ff 92 push r15 a8: 0f 93 push r16 aa: 1f 93 push r17 ac: 81 e0 ldi r24, 0x01 ; 1 ae: 89 b9 out 0x09, r24 ; 9 b0: 92 e0 ldi r25, 0x02 ; 2 b2: 99 b9 out 0x09, r25 ; 9 b4: 89 b9 out 0x09, r24 ; 9 b6: 24 e0 ldi r18, 0x04 ; 4 b8: 29 b9 out 0x09, r18 ; 9 ba: 89 b9 out 0x09, r24 ; 9 bc: 99 b9 out 0x09, r25 ; 9 be: 89 b9 out 0x09, r24 ; 9 c0: 38 e0 ldi r19, 0x08 ; 8 c2: 39 b9 out 0x09, r19 ; 9 c4: 89 b9 out 0x09, r24 ; 9 c6: 99 b9 out 0x09, r25 ; 9 c8: 89 b9 out 0x09, r24 ; 9 ca: 29 b9 out 0x09, r18 ; 9 cc: 89 b9 out 0x09, r24 ; 9 ce: 99 b9 out 0x09, r25 ; 9 d0: 89 b9 out 0x09, r24 ; 9 d2: 40 e1 ldi r20, 0x10 ; 16 d4: 49 b9 out 0x09, r20 ; 9 ... lots of lines deleted ... 3fcc: 49 b9 out 0x09, r20 ; 9 3fce: 29 b9 out 0x09, r18 ; 9 3fd0: 40 e1 ldi r20, 0x10 ; 16 3fd2: 49 b9 out 0x09, r20 ; 9 3fd4: 89 b9 out 0x09, r24 ; 9 3fd6: 99 b9 out 0x09, r25 ; 9 3fd8: 89 b9 out 0x09, r24 ; 9 3fda: 24 e0 ldi r18, 0x04 ; 4 3fdc: 29 b9 out 0x09, r18 ; 9 3fde: 89 b9 out 0x09, r24 ; 9 3fe0: 99 b9 out 0x09, r25 ; 9 3fe2: 89 b9 out 0x09, r24 ; 9 3fe4: 38 e0 ldi r19, 0x08 ; 8 3fe6: 39 b9 out 0x09, r19 ; 9 3fe8: 89 b9 out 0x09, r24 ; 9 3fea: 99 b9 out 0x09, r25 ; 9 3fec: 89 b9 out 0x09, r24 ; 9 3fee: 29 b9 out 0x09, r18 ; 9 3ff0: 89 b9 out 0x09, r24 ; 9 3ff2: 99 b9 out 0x09, r25 ; 9 3ff4: 89 b9 out 0x09, r24 ; 9 3ff6: 50 e2 ldi r21, 0x20 ; 32 3ff8: 59 b9 out 0x09, r21 ; 9 3ffa: 89 b9 out 0x09, r24 ; 9 3ffc: 99 b9 out 0x09, r25 ; 9 3ffe: 89 b9 out 0x09, r24 ; 9 4000: 29 b9 out 0x09, r18 ; 9 4002: 89 b9 out 0x09, r24 ; 9 4004: 99 b9 out 0x09, r25 ; 9 4006: 89 b9 out 0x09, r24 ; 9 4008: 39 b9 out 0x09, r19 ; 9 400a: 89 b9 out 0x09, r24 ; 9 400c: 99 b9 out 0x09, r25 ; 9 400e: 89 b9 out 0x09, r24 ; 9 4010: 29 b9 out 0x09, r18 ; 9 4012: 89 b9 out 0x09, r24 ; 9 4014: 99 b9 out 0x09, r25 ; 9 4016: 89 b9 out 0x09, r24 ; 9 4018: 49 b9 out 0x09, r20 ; 9 401a: 89 b9 out 0x09, r24 ; 9 401c: 99 b9 out 0x09, r25 ; 9 401e: 89 b9 out 0x09, r24 ; 9 4020: 29 b9 out 0x09, r18 ; 9 4022: 89 b9 out 0x09, r24 ; 9 4024: 99 b9 out 0x09, r25 ; 9 4026: 89 b9 out 0x09, r24 ; 9 4028: 39 b9 out 0x09, r19 ; 9 402a: 89 b9 out 0x09, r24 ; 9 402c: 99 b9 out 0x09, r25 ; 9 402e: 89 b9 out 0x09, r24 ; 9 4030: 29 b9 out 0x09, r18 ; 9 4032: 89 b9 out 0x09, r24 ; 9 4034: 99 b9 out 0x09, r25 ; 9 4036: 89 b9 out 0x09, r24 ; 9 4038: b9 b9 out 0x09, r27 ; 9 403a: 89 b9 out 0x09, r24 ; 9 403c: 99 b9 out 0x09, r25 ; 9 403e: 89 b9 out 0x09, r24 ; 9 4040: 29 b9 out 0x09, r18 ; 9 4042: 89 b9 out 0x09, r24 ; 9 4044: 99 b9 out 0x09, r25 ; 9 4046: 89 b9 out 0x09, r24 ; 9 4048: 39 b9 out 0x09, r19 ; 9 404a: 89 b9 out 0x09, r24 ; 9 404c: 99 b9 out 0x09, r25 ; 9 404e: 89 b9 out 0x09, r24 ; 9 4050: 29 b9 out 0x09, r18 ; 9 4052: 89 b9 out 0x09, r24 ; 9 4054: 99 b9 out 0x09, r25 ; 9 4056: 89 b9 out 0x09, r24 ; 9 4058: 49 b9 out 0x09, r20 ; 9 405a: 89 b9 out 0x09, r24 ; 9 405c: 99 b9 out 0x09, r25 ; 9 405e: 89 b9 out 0x09, r24 ; 9 4060: 29 b9 out 0x09, r18 ; 9 4062: 89 b9 out 0x09, r24 ; 9 4064: 99 b9 out 0x09, r25 ; 9 4066: 89 b9 out 0x09, r24 ; 9 4068: 39 b9 out 0x09, r19 ; 9 406a: 89 b9 out 0x09, r24 ; 9 406c: 99 b9 out 0x09, r25 ; 9 406e: 89 b9 out 0x09, r24 ; 9 4070: 29 b9 out 0x09, r18 ; 9 4072: 89 b9 out 0x09, r24 ; 9 4074: 99 b9 out 0x09, r25 ; 9 4076: 89 b9 out 0x09, r24 ; 9 4078: 59 b9 out 0x09, r21 ; 9 407a: 89 b9 out 0x09, r24 ; 9 407c: 99 b9 out 0x09, r25 ; 9 407e: 89 b9 out 0x09, r24 ; 9 4080: 29 b9 out 0x09, r18 ; 9 4082: 89 b9 out 0x09, r24 ; 9 4084: 99 b9 out 0x09, r25 ; 9 4086: 89 b9 out 0x09, r24 ; 9 4088: 39 b9 out 0x09, r19 ; 9 408a: 89 b9 out 0x09, r24 ; 9 408c: 99 b9 out 0x09, r25 ; 9 408e: 89 b9 out 0x09, r24 ; 9 4090: 29 b9 out 0x09, r18 ; 9 4092: 89 b9 out 0x09, r24 ; 9 4094: 99 b9 out 0x09, r25 ; 9 4096: 89 b9 out 0x09, r24 ; 9 4098: 49 b9 out 0x09, r20 ; 9 409a: 89 b9 out 0x09, r24 ; 9 409c: 99 b9 out 0x09, r25 ; 9 409e: 89 b9 out 0x09, r24 ; 9 40a0: 29 b9 out 0x09, r18 ; 9 40a2: 89 b9 out 0x09, r24 ; 9 40a4: 99 b9 out 0x09, r25 ; 9 40a6: 89 b9 out 0x09, r24 ; 9 40a8: 39 b9 out 0x09, r19 ; 9 40aa: 89 b9 out 0x09, r24 ; 9 40ac: 99 b9 out 0x09, r25 ; 9 40ae: 89 b9 out 0x09, r24 ; 9 40b0: 29 b9 out 0x09, r18 ; 9 40b2: 89 b9 out 0x09, r24 ; 9 40b4: 99 b9 out 0x09, r25 ; 9 40b6: 89 b9 out 0x09, r24 ; 9 40b8: 19 b9 out 0x09, r17 ; 9 40ba: 89 b9 out 0x09, r24 ; 9 40bc: 99 b9 out 0x09, r25 ; 9 40be: 89 b9 out 0x09, r24 ; 9 40c0: 29 b9 out 0x09, r18 ; 9 40c2: 89 b9 out 0x09, r24 ; 9 40c4: 99 b9 out 0x09, r25 ; 9 40c6: 89 b9 out 0x09, r24 ; 9 40c8: 39 b9 out 0x09, r19 ; 9 40ca: 89 b9 out 0x09, r24 ; 9 40cc: 99 b9 out 0x09, r25 ; 9 40ce: 89 b9 out 0x09, r24 ; 9 40d0: 29 b9 out 0x09, r18 ; 9 40d2: 89 b9 out 0x09, r24 ; 9 40d4: 99 b9 out 0x09, r25 ; 9 40d6: 89 b9 out 0x09, r24 ; 9 40d8: 49 b9 out 0x09, r20 ; 9 40da: 89 b9 out 0x09, r24 ; 9 40dc: 99 b9 out 0x09, r25 ; 9 40de: 89 b9 out 0x09, r24 ; 9 40e0: 29 b9 out 0x09, r18 ; 9 40e2: 89 b9 out 0x09, r24 ; 9 40e4: 99 b9 out 0x09, r25 ; 9 40e6: 89 b9 out 0x09, r24 ; 9 40e8: 39 b9 out 0x09, r19 ; 9 40ea: 89 b9 out 0x09, r24 ; 9 40ec: 99 b9 out 0x09, r25 ; 9 40ee: 89 b9 out 0x09, r24 ; 9 40f0: 29 b9 out 0x09, r18 ; 9 40f2: 89 b9 out 0x09, r24 ; 9 40f4: 99 b9 out 0x09, r25 ; 9 40f6: 89 b9 out 0x09, r24 ; 9 40f8: 59 b9 out 0x09, r21 ; 9 40fa: 89 b9 out 0x09, r24 ; 9 40fc: 99 b9 out 0x09, r25 ; 9 40fe: 89 b9 out 0x09, r24 ; 9 4100: 29 b9 out 0x09, r18 ; 9 4102: 89 b9 out 0x09, r24 ; 9 4104: 99 b9 out 0x09, r25 ; 9 4106: 89 b9 out 0x09, r24 ; 9 4108: 39 b9 out 0x09, r19 ; 9 410a: 89 b9 out 0x09, r24 ; 9 410c: 99 b9 out 0x09, r25 ; 9 410e: 89 b9 out 0x09, r24 ; 9 4110: 29 b9 out 0x09, r18 ; 9 4112: 89 b9 out 0x09, r24 ; 9 4114: 99 b9 out 0x09, r25 ; 9 4116: 89 b9 out 0x09, r24 ; 9 4118: 49 b9 out 0x09, r20 ; 9 411a: 89 b9 out 0x09, r24 ; 9 411c: 99 b9 out 0x09, r25 ; 9 411e: 89 b9 out 0x09, r24 ; 9 4120: 29 b9 out 0x09, r18 ; 9 4122: 89 b9 out 0x09, r24 ; 9 4124: 99 b9 out 0x09, r25 ; 9 4126: 89 b9 out 0x09, r24 ; 9 4128: 39 b9 out 0x09, r19 ; 9 412a: 89 b9 out 0x09, r24 ; 9 412c: 99 b9 out 0x09, r25 ; 9 412e: 89 b9 out 0x09, r24 ; 9 4130: 29 b9 out 0x09, r18 ; 9 4132: 89 b9 out 0x09, r24 ; 9 4134: 99 b9 out 0x09, r25 ; 9 4136: 89 b9 out 0x09, r24 ; 9 4138: 1f 91 pop r17 413a: 0f 91 pop r16 413c: ff 90 pop r15 413e: 08 95 ret
Whenever count_8191 is called the registers r15, r16, r17 will be pushed to the stack and restored later on. Also the constants 1, 2, 4, …, 128 are loaded into some registers. But if you look closely the compiler will not use 8 registers and then leave them as they are. Instead it will use less registers overwriting some of them every once in a while. This results in additional ldi instructions. This is where additional cycles are burned.
If we look at the code for loop() we also see such ldi statements.
0000415a <loop>: void loop() { 415a: cf 92 push r12 415c: df 92 push r13 415e: ef 92 push r14 4160: ff 92 push r15 4162: 0f 93 push r16 4164: 1f 93 push r17 l0: TWICE(TWICE(TWICE(TWICE(TWICE(COUNT_524288))))); 4166: 10 e2 ldi r17, 0x20 ; 32 4168: 01 e0 ldi r16, 0x01 ; 1 416a: 22 e0 ldi r18, 0x02 ; 2 416c: f2 2e mov r15, r18 416e: 94 e0 ldi r25, 0x04 ; 4 4170: e9 2e mov r14, r25 4172: 88 e0 ldi r24, 0x08 ; 8 4174: d8 2e mov r13, r24 4176: b0 e1 ldi r27, 0x10 ; 16 4178: cb 2e mov r12, r27 417a: 0e 94 53 00 call 0xa6 ; 0xa6 <_Z10count_8191v> 417e: 13 b9 out 0x03, r17 ; 3 4180: 0e 94 53 00 call 0xa6 ; 0xa6 <_Z10count_8191v> 4184: 06 b9 out 0x06, r16 ; 6 4186: 0e 94 53 00 call 0xa6 ; 0xa6 <_Z10count_8191v> 418a: 13 b9 out 0x03, r17 ; 3 418c: 0e 94 53 00 call 0xa6 ; 0xa6 <_Z10count_8191v> 4190: f6 b8 out 0x06, r15 ; 6 4192: 0e 94 53 00 call 0xa6 ; 0xa6 <_Z10count_8191v> 4196: 13 b9 out 0x03, r17 ; 3 4198: 0e 94 53 00 call 0xa6 ; 0xa6 <_Z10count_8191v> 419c: 06 b9 out 0x06, r16 ; 6 419e: 0e 94 53 00 call 0xa6 ; 0xa6 <_Z10count_8191v> 41a2: 13 b9 out 0x03, r17 ; 3 41a4: 0e 94 53 00 call 0xa6 ; 0xa6 <_Z10count_8191v>
Looking at the overall structure it is clear that it would be much better to preload 8 registers with the constans 1, 2, 4, …, 128 and then use just out instructions. Since we there are no other computations anywhere in the program it would be appropriate to not push any registers to the stack for subroutine processing. Instead we should reuse the same registers over and over again.
// // www.blinkenlight.net // // Copyright 2012 Udo Klein // // This program is free software: you can redistribute it and/or modify // it under the terms of the GNU General Public License as published by // the Free Software Foundation, either version 3 of the License, or // (at your option) any later version. // // This program is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the // GNU General Public License for more details. // // You should have received a copy of the GNU General Public License // along with this program. If not, see http://www.gnu.org/licenses/ #define COUNT_1 asm volatile ("out 0x09, r20"); #define COUNT_3 COUNT_1; asm volatile ("out 0x09, r21"); COUNT_1; #define COUNT_7 COUNT_3; asm volatile ("out 0x09, r22"); COUNT_3; #define COUNT_15 COUNT_7; asm volatile ("out 0x09, r23"); COUNT_7; #define COUNT_31 COUNT_15; asm volatile ("out 0x09, r24"); COUNT_15; #define COUNT_63 COUNT_31; asm volatile ("out 0x09, r25"); COUNT_31; #define COUNT_127 COUNT_63; asm volatile ("out 0x09, r26"); COUNT_63; #define COUNT_255 COUNT_127; asm volatile ("out 0x09, r27"); COUNT_127; #define COUNT_511 COUNT_255; asm volatile ("out 0x03, r20"); COUNT_255; #define COUNT_1023 COUNT_511; asm volatile ("out 0x03, r21"); COUNT_511; #define COUNT_2047 COUNT_1023; asm volatile ("out 0x03, r22"); COUNT_1023; #define COUNT_4095 COUNT_2047; asm volatile ("out 0x03, r23"); COUNT_2047; void count_8191() { COUNT_4095; asm volatile ("out 0x03, r24"); COUNT_4095; } #define COUNT_16383 count_8191(); asm volatile ("out 0x03, r25"); count_8191(); #define COUNT_32767 COUNT_16383; asm volatile ("out 0x06, r20"); COUNT_16383; #define COUNT_65535 COUNT_32767; asm volatile ("out 0x06, r21"); COUNT_32767; #define COUNT_131071 COUNT_65535; asm volatile ("out 0x06, r22"); COUNT_65535; #define COUNT_262143 COUNT_131071; asm volatile ("out 0x06, r23"); COUNT_131071; #define COUNT_524287 COUNT_262143; asm volatile ("out 0x06, r24"); COUNT_262143; #define COUNT_524288 COUNT_524287; asm volatile ("out 0x06, r25"); #define TWICE(a) a a void setup() { DDRD = 0b11111111; // set digital 0- 7 to output DDRB = 0b00111111; // set digital 8-13 to output DDRC = 0b00111111; // set digital 14-19 to output (coincidences with analog 0-5) // ensure we do not get interrupted during prescaler manipulation cli(); /* // activate this block by removing the previous comment // set clock prescaler as desired to slow down the timer const uint8_t clock_prescaler_1 = 0; const uint8_t clock_prescaler_2 = 1; const uint8_t clock_prescaler_4 = 2; const uint8_t clock_prescaler_8 = 3; const uint8_t clock_prescaler_16 = 4; const uint8_t clock_prescaler_32 = 5; const uint8_t clock_prescaler_64 = 6; const uint8_t clock_prescaler_128 = 7; const uint8_t clock_prescaler_256 = 8; // prepare to set clock prescaler: write CLKPCE bit to one and all the other to zero CLKPR = 1<<CLKPCE; // set clock prescaler immediately after preparing to do so CLKPR = clock_prescaler_256; /**/ // do not enable interrupts back again because we do not want // interrupts to slow us any down } void loop() { asm volatile ("ldi r20, 0x01"); asm volatile ("ldi r21, 0x02"); asm volatile ("ldi r22, 0x04"); asm volatile ("ldi r23, 0x08"); asm volatile ("ldi r24, 0x10"); asm volatile ("ldi r25, 0x20"); asm volatile ("ldi r26, 0x40"); asm volatile ("ldi r27, 0x80"); l0: TWICE(TWICE(TWICE(TWICE(TWICE(COUNT_524288))))); goto l0; }
Now how much overhead will we still have? Since our pieces of assembler code did not tell the compiler that any registers will be changed the compiler is smart enough to generate no code for pushing registers to the stack. Thus the only remaining overhead are the subroutine calls and the goto statement. According to the datasheet call and ret take 4 cycles each. So each call to count_8191 will take 8191 cycles for counting and 8 cycles overhead. Counting to 524 288 will take 64 calls or 64*8 = 512 cycles of overhead.
After counting 32 times to 524 288 (=16 777 216 ticks) there is one goto statemement that will consume 3 additional cycles. Since 3/16 777 216 < 1/1 000 000 this is below the clock precision, we can and will ignore this.
So basically we should now be running at 1-(8/8192) = 99.90% of the hardware speed. With other words I would expect to measure 3 996 094 Hz for the lowest significant bit. Or in terms of overhead about 15625 overhead cycles per second (or ~15628 cycles if we do not ignore the goto statement).
This time I measure 3 996 167 Hz for the lowest significant bit. This is 73 Hz or ~18 ppm faster than computed. Obviously my Arduino is running at a slightly higher frequency than the advertised 16 MHz. This is something we already explored in the Crystal Deviations experiment.
If you watch closely and compare it with the other counter video you notice that the improved counter is only marginally faster although we got rid of 90% of the overhead. How come? Well, the fast counter runs at 98.9% of the theoretical limit. The faster counter code reduces the overhead by 90% and runs at 99.9% of the theoretical limit. This is just 1% faster. This effect is also known as the “law of diminishing returns”. The closer you get to the optimum the less improvement you will see even if you try harder and harder.
Decompile the compiled program?
Don’t you know GCC can generate list files of your code? They are much easier to analyze.
I did not decompile it, I disassembled it. Technically this is something different. I am aware that GCC can generate said files. However I am not at all aware how to tell the Arduino IDE to pass the necessary flags to the compiler. So if you can provide instructions on how to do this “much easier” I would be more than happy to follow them.
dimishing? you most likely meant: ‘law of diminishing returns’.
Thanks for pointing this out. I fixed it.
On topic of typos, see “one goto stamement” (third paragraph before video).
Very interesting article, I have an interesting project that led me to this post. I need to replace the top octave synthesizer in an old electric organ, it is a 16 pin IC that takes a 2MHz reference signal and divides it by varying divisors (200 to 500) and outputs 12 different frequencies on each pin. I am emulating this with 12 attiny85s running to tone() command, but there are few issues with that method that make it less than ideal. Any thoughts on the best way to implement this correctly… A datasheet for the AY-3-0214 TOS chip will help explain what I am talking about…
Sorry Steve, but this is not the way to ask questions. If it is not worth your time to give sufficient details on what you are doing it is definitely not worth my time to guess what you might need. Having said that I doubt that your issue is related to my experiments. Beginner questions of unrelated Arduino stuff belong into the Arduino forum. You might also want to read this: http://stackoverflow.com/help/how-to-ask. This advice applies to any forum.