Faster Counter

This is my second stab at toggling / counting as fast as possible in software. Please notice that this is a somewhat pointless optimization except for the sake of showing how to squeeze the last bit of performance out of an Arduino.

As we already know the fast counter runs at 15 824 484 ticks per second. Or with other words it is 175 516 ticks per second slower than the theoretical maximum.

Most of this overhead is probably due to the function calls to count_8191(). This is because there are not a lot of other places in the code where the overhead could occur.

	void count_8191() { COUNT_4095; PINB = bit_4; COUNT_4095; }}

At 15 824 484 ticks per second count_8191() gets called 15 824 484 / 8192 = 1931.7 times per second. Thus the call overhead is roughly 175 516 / 1931.7 = 90.8 cycles per call. So lets have a look where these cycles are processed.

In order to figure out the cause for the overhead we will disassemble the compiled sketch. To prepare for this we will switch the IDE to “verbose” mode.

Navigate to Preferences

Activate Verbose Mode

After switching to verbose mode we compile the sketch to create an *.elf file and to figure out the build directory.

Determine Build Folder

Then we use avr-objdump to disassemble the compiled code and any editor to display the results.

udo:/tmp/build1588086228058204772.tmp$ avr-objdump -dS *.elf >disas.txt
udo:/tmp/build1588086228058204772.tmp$ gedit disas.txt

So let’s have a look at the assembler code for count_8191().

void count_8191() { COUNT_4095; PINB = bit_4; COUNT_4095; }
      a6:	ff 92       	push	r15
      a8:	0f 93       	push	r16
      aa:	1f 93       	push	r17
      ac:	81 e0       	ldi	r24, 0x01	; 1
      ae:	89 b9       	out	0x09, r24	; 9
      b0:	92 e0       	ldi	r25, 0x02	; 2
      b2:	99 b9       	out	0x09, r25	; 9
      b4:	89 b9       	out	0x09, r24	; 9
      b6:	24 e0       	ldi	r18, 0x04	; 4
      b8:	29 b9       	out	0x09, r18	; 9
      ba:	89 b9       	out	0x09, r24	; 9
      bc:	99 b9       	out	0x09, r25	; 9
      be:	89 b9       	out	0x09, r24	; 9
      c0:	38 e0       	ldi	r19, 0x08	; 8
      c2:	39 b9       	out	0x09, r19	; 9
      c4:	89 b9       	out	0x09, r24	; 9
      c6:	99 b9       	out	0x09, r25	; 9
      c8:	89 b9       	out	0x09, r24	; 9
      ca:	29 b9       	out	0x09, r18	; 9
      cc:	89 b9       	out	0x09, r24	; 9
      ce:	99 b9       	out	0x09, r25	; 9
      d0:	89 b9       	out	0x09, r24	; 9
      d2:	40 e1       	ldi	r20, 0x10	; 16
      d4:	49 b9       	out	0x09, r20	; 9

... lots of lines deleted ...

    3fcc:	49 b9       	out	0x09, r20	; 9
    3fce:	29 b9       	out	0x09, r18	; 9
    3fd0:	40 e1       	ldi	r20, 0x10	; 16
    3fd2:	49 b9       	out	0x09, r20	; 9
    3fd4:	89 b9       	out	0x09, r24	; 9
    3fd6:	99 b9       	out	0x09, r25	; 9
    3fd8:	89 b9       	out	0x09, r24	; 9
    3fda:	24 e0       	ldi	r18, 0x04	; 4
    3fdc:	29 b9       	out	0x09, r18	; 9
    3fde:	89 b9       	out	0x09, r24	; 9
    3fe0:	99 b9       	out	0x09, r25	; 9
    3fe2:	89 b9       	out	0x09, r24	; 9
    3fe4:	38 e0       	ldi	r19, 0x08	; 8
    3fe6:	39 b9       	out	0x09, r19	; 9
    3fe8:	89 b9       	out	0x09, r24	; 9
    3fea:	99 b9       	out	0x09, r25	; 9
    3fec:	89 b9       	out	0x09, r24	; 9
    3fee:	29 b9       	out	0x09, r18	; 9
    3ff0:	89 b9       	out	0x09, r24	; 9
    3ff2:	99 b9       	out	0x09, r25	; 9
    3ff4:	89 b9       	out	0x09, r24	; 9
    3ff6:	50 e2       	ldi	r21, 0x20	; 32
    3ff8:	59 b9       	out	0x09, r21	; 9
    3ffa:	89 b9       	out	0x09, r24	; 9
    3ffc:	99 b9       	out	0x09, r25	; 9
    3ffe:	89 b9       	out	0x09, r24	; 9
    4000:	29 b9       	out	0x09, r18	; 9
    4002:	89 b9       	out	0x09, r24	; 9
    4004:	99 b9       	out	0x09, r25	; 9
    4006:	89 b9       	out	0x09, r24	; 9
    4008:	39 b9       	out	0x09, r19	; 9
    400a:	89 b9       	out	0x09, r24	; 9
    400c:	99 b9       	out	0x09, r25	; 9
    400e:	89 b9       	out	0x09, r24	; 9
    4010:	29 b9       	out	0x09, r18	; 9
    4012:	89 b9       	out	0x09, r24	; 9
    4014:	99 b9       	out	0x09, r25	; 9
    4016:	89 b9       	out	0x09, r24	; 9
    4018:	49 b9       	out	0x09, r20	; 9
    401a:	89 b9       	out	0x09, r24	; 9
    401c:	99 b9       	out	0x09, r25	; 9
    401e:	89 b9       	out	0x09, r24	; 9
    4020:	29 b9       	out	0x09, r18	; 9
    4022:	89 b9       	out	0x09, r24	; 9
    4024:	99 b9       	out	0x09, r25	; 9
    4026:	89 b9       	out	0x09, r24	; 9
    4028:	39 b9       	out	0x09, r19	; 9
    402a:	89 b9       	out	0x09, r24	; 9
    402c:	99 b9       	out	0x09, r25	; 9
    402e:	89 b9       	out	0x09, r24	; 9
    4030:	29 b9       	out	0x09, r18	; 9
    4032:	89 b9       	out	0x09, r24	; 9
    4034:	99 b9       	out	0x09, r25	; 9
    4036:	89 b9       	out	0x09, r24	; 9
    4038:	b9 b9       	out	0x09, r27	; 9
    403a:	89 b9       	out	0x09, r24	; 9
    403c:	99 b9       	out	0x09, r25	; 9
    403e:	89 b9       	out	0x09, r24	; 9
    4040:	29 b9       	out	0x09, r18	; 9
    4042:	89 b9       	out	0x09, r24	; 9
    4044:	99 b9       	out	0x09, r25	; 9
    4046:	89 b9       	out	0x09, r24	; 9
    4048:	39 b9       	out	0x09, r19	; 9
    404a:	89 b9       	out	0x09, r24	; 9
    404c:	99 b9       	out	0x09, r25	; 9
    404e:	89 b9       	out	0x09, r24	; 9
    4050:	29 b9       	out	0x09, r18	; 9
    4052:	89 b9       	out	0x09, r24	; 9
    4054:	99 b9       	out	0x09, r25	; 9
    4056:	89 b9       	out	0x09, r24	; 9
    4058:	49 b9       	out	0x09, r20	; 9
    405a:	89 b9       	out	0x09, r24	; 9
    405c:	99 b9       	out	0x09, r25	; 9
    405e:	89 b9       	out	0x09, r24	; 9
    4060:	29 b9       	out	0x09, r18	; 9
    4062:	89 b9       	out	0x09, r24	; 9
    4064:	99 b9       	out	0x09, r25	; 9
    4066:	89 b9       	out	0x09, r24	; 9
    4068:	39 b9       	out	0x09, r19	; 9
    406a:	89 b9       	out	0x09, r24	; 9
    406c:	99 b9       	out	0x09, r25	; 9
    406e:	89 b9       	out	0x09, r24	; 9
    4070:	29 b9       	out	0x09, r18	; 9
    4072:	89 b9       	out	0x09, r24	; 9
    4074:	99 b9       	out	0x09, r25	; 9
    4076:	89 b9       	out	0x09, r24	; 9
    4078:	59 b9       	out	0x09, r21	; 9
    407a:	89 b9       	out	0x09, r24	; 9
    407c:	99 b9       	out	0x09, r25	; 9
    407e:	89 b9       	out	0x09, r24	; 9
    4080:	29 b9       	out	0x09, r18	; 9
    4082:	89 b9       	out	0x09, r24	; 9
    4084:	99 b9       	out	0x09, r25	; 9
    4086:	89 b9       	out	0x09, r24	; 9
    4088:	39 b9       	out	0x09, r19	; 9
    408a:	89 b9       	out	0x09, r24	; 9
    408c:	99 b9       	out	0x09, r25	; 9
    408e:	89 b9       	out	0x09, r24	; 9
    4090:	29 b9       	out	0x09, r18	; 9
    4092:	89 b9       	out	0x09, r24	; 9
    4094:	99 b9       	out	0x09, r25	; 9
    4096:	89 b9       	out	0x09, r24	; 9
    4098:	49 b9       	out	0x09, r20	; 9
    409a:	89 b9       	out	0x09, r24	; 9
    409c:	99 b9       	out	0x09, r25	; 9
    409e:	89 b9       	out	0x09, r24	; 9
    40a0:	29 b9       	out	0x09, r18	; 9
    40a2:	89 b9       	out	0x09, r24	; 9
    40a4:	99 b9       	out	0x09, r25	; 9
    40a6:	89 b9       	out	0x09, r24	; 9
    40a8:	39 b9       	out	0x09, r19	; 9
    40aa:	89 b9       	out	0x09, r24	; 9
    40ac:	99 b9       	out	0x09, r25	; 9
    40ae:	89 b9       	out	0x09, r24	; 9
    40b0:	29 b9       	out	0x09, r18	; 9
    40b2:	89 b9       	out	0x09, r24	; 9
    40b4:	99 b9       	out	0x09, r25	; 9
    40b6:	89 b9       	out	0x09, r24	; 9
    40b8:	19 b9       	out	0x09, r17	; 9
    40ba:	89 b9       	out	0x09, r24	; 9
    40bc:	99 b9       	out	0x09, r25	; 9
    40be:	89 b9       	out	0x09, r24	; 9
    40c0:	29 b9       	out	0x09, r18	; 9
    40c2:	89 b9       	out	0x09, r24	; 9
    40c4:	99 b9       	out	0x09, r25	; 9
    40c6:	89 b9       	out	0x09, r24	; 9
    40c8:	39 b9       	out	0x09, r19	; 9
    40ca:	89 b9       	out	0x09, r24	; 9
    40cc:	99 b9       	out	0x09, r25	; 9
    40ce:	89 b9       	out	0x09, r24	; 9
    40d0:	29 b9       	out	0x09, r18	; 9
    40d2:	89 b9       	out	0x09, r24	; 9
    40d4:	99 b9       	out	0x09, r25	; 9
    40d6:	89 b9       	out	0x09, r24	; 9
    40d8:	49 b9       	out	0x09, r20	; 9
    40da:	89 b9       	out	0x09, r24	; 9
    40dc:	99 b9       	out	0x09, r25	; 9
    40de:	89 b9       	out	0x09, r24	; 9
    40e0:	29 b9       	out	0x09, r18	; 9
    40e2:	89 b9       	out	0x09, r24	; 9
    40e4:	99 b9       	out	0x09, r25	; 9
    40e6:	89 b9       	out	0x09, r24	; 9
    40e8:	39 b9       	out	0x09, r19	; 9
    40ea:	89 b9       	out	0x09, r24	; 9
    40ec:	99 b9       	out	0x09, r25	; 9
    40ee:	89 b9       	out	0x09, r24	; 9
    40f0:	29 b9       	out	0x09, r18	; 9
    40f2:	89 b9       	out	0x09, r24	; 9
    40f4:	99 b9       	out	0x09, r25	; 9
    40f6:	89 b9       	out	0x09, r24	; 9
    40f8:	59 b9       	out	0x09, r21	; 9
    40fa:	89 b9       	out	0x09, r24	; 9
    40fc:	99 b9       	out	0x09, r25	; 9
    40fe:	89 b9       	out	0x09, r24	; 9
    4100:	29 b9       	out	0x09, r18	; 9
    4102:	89 b9       	out	0x09, r24	; 9
    4104:	99 b9       	out	0x09, r25	; 9
    4106:	89 b9       	out	0x09, r24	; 9
    4108:	39 b9       	out	0x09, r19	; 9
    410a:	89 b9       	out	0x09, r24	; 9
    410c:	99 b9       	out	0x09, r25	; 9
    410e:	89 b9       	out	0x09, r24	; 9
    4110:	29 b9       	out	0x09, r18	; 9
    4112:	89 b9       	out	0x09, r24	; 9
    4114:	99 b9       	out	0x09, r25	; 9
    4116:	89 b9       	out	0x09, r24	; 9
    4118:	49 b9       	out	0x09, r20	; 9
    411a:	89 b9       	out	0x09, r24	; 9
    411c:	99 b9       	out	0x09, r25	; 9
    411e:	89 b9       	out	0x09, r24	; 9
    4120:	29 b9       	out	0x09, r18	; 9
    4122:	89 b9       	out	0x09, r24	; 9
    4124:	99 b9       	out	0x09, r25	; 9
    4126:	89 b9       	out	0x09, r24	; 9
    4128:	39 b9       	out	0x09, r19	; 9
    412a:	89 b9       	out	0x09, r24	; 9
    412c:	99 b9       	out	0x09, r25	; 9
    412e:	89 b9       	out	0x09, r24	; 9
    4130:	29 b9       	out	0x09, r18	; 9
    4132:	89 b9       	out	0x09, r24	; 9
    4134:	99 b9       	out	0x09, r25	; 9
    4136:	89 b9       	out	0x09, r24	; 9
    4138:	1f 91       	pop	r17
    413a:	0f 91       	pop	r16
    413c:	ff 90       	pop	r15
    413e:	08 95       	ret

Whenever count_8191 is called the registers r15, r16, r17 will be pushed to the stack and restored later on. Also the constants 1, 2, 4, …, 128 are loaded into some registers. But if you look closely the compiler will not use 8 registers and then leave them as they are. Instead it will use less registers overwriting some of them every once in a while. This results in additional ldi instructions. This is where additional cycles are burned.

If we look at the code for loop() we also see such ldi statements.

0000415a <loop>:

void loop() {
    415a:	cf 92       	push	r12
    415c:	df 92       	push	r13
    415e:	ef 92       	push	r14
    4160:	ff 92       	push	r15
    4162:	0f 93       	push	r16
    4164:	1f 93       	push	r17
    l0:
        TWICE(TWICE(TWICE(TWICE(TWICE(COUNT_524288)))));
    4166:	10 e2       	ldi	r17, 0x20	; 32
    4168:	01 e0       	ldi	r16, 0x01	; 1
    416a:	22 e0       	ldi	r18, 0x02	; 2
    416c:	f2 2e       	mov	r15, r18
    416e:	94 e0       	ldi	r25, 0x04	; 4
    4170:	e9 2e       	mov	r14, r25
    4172:	88 e0       	ldi	r24, 0x08	; 8
    4174:	d8 2e       	mov	r13, r24
    4176:	b0 e1       	ldi	r27, 0x10	; 16
    4178:	cb 2e       	mov	r12, r27
    417a:	0e 94 53 00 	call	0xa6	; 0xa6 <_Z10count_8191v>
    417e:	13 b9       	out	0x03, r17	; 3
    4180:	0e 94 53 00 	call	0xa6	; 0xa6 <_Z10count_8191v>
    4184:	06 b9       	out	0x06, r16	; 6
    4186:	0e 94 53 00 	call	0xa6	; 0xa6 <_Z10count_8191v>
    418a:	13 b9       	out	0x03, r17	; 3
    418c:	0e 94 53 00 	call	0xa6	; 0xa6 <_Z10count_8191v>
    4190:	f6 b8       	out	0x06, r15	; 6
    4192:	0e 94 53 00 	call	0xa6	; 0xa6 <_Z10count_8191v>
    4196:	13 b9       	out	0x03, r17	; 3
    4198:	0e 94 53 00 	call	0xa6	; 0xa6 <_Z10count_8191v>
    419c:	06 b9       	out	0x06, r16	; 6
    419e:	0e 94 53 00 	call	0xa6	; 0xa6 <_Z10count_8191v>
    41a2:	13 b9       	out	0x03, r17	; 3
    41a4:	0e 94 53 00 	call	0xa6	; 0xa6 <_Z10count_8191v>

Looking at the overall structure it is clear that it would be much better to preload 8 registers with the constans 1, 2, 4, …, 128 and then use just out instructions. Since we there are no other computations anywhere in the program it would be appropriate to not push any registers to the stack for subroutine processing. Instead we should reuse the same registers over and over again.

//
//  www.blinkenlight.net
//
//  Copyright 2012 Udo Klein
//
//  This program is free software: you can redistribute it and/or modify
//  it under the terms of the GNU General Public License as published by
//  the Free Software Foundation, either version 3 of the License, or
//  (at your option) any later version.
//
//  This program is distributed in the hope that it will be useful,
//  but WITHOUT ANY WARRANTY; without even the implied warranty of
//  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
//  GNU General Public License for more details.
//
//  You should have received a copy of the GNU General Public License
//  along with this program. If not, see http://www.gnu.org/licenses/


#define COUNT_1                asm volatile ("out 0x09, r20");
#define COUNT_3    COUNT_1;    asm volatile ("out 0x09, r21"); COUNT_1;
#define COUNT_7    COUNT_3;    asm volatile ("out 0x09, r22"); COUNT_3;
#define COUNT_15   COUNT_7;    asm volatile ("out 0x09, r23"); COUNT_7;
#define COUNT_31   COUNT_15;   asm volatile ("out 0x09, r24"); COUNT_15;
#define COUNT_63   COUNT_31;   asm volatile ("out 0x09, r25"); COUNT_31;
#define COUNT_127  COUNT_63;   asm volatile ("out 0x09, r26"); COUNT_63;
#define COUNT_255  COUNT_127;  asm volatile ("out 0x09, r27"); COUNT_127;
#define COUNT_511  COUNT_255;  asm volatile ("out 0x03, r20"); COUNT_255;
#define COUNT_1023 COUNT_511;  asm volatile ("out 0x03, r21"); COUNT_511;
#define COUNT_2047 COUNT_1023; asm volatile ("out 0x03, r22"); COUNT_1023;
#define COUNT_4095 COUNT_2047; asm volatile ("out 0x03, r23"); COUNT_2047;

void count_8191() { COUNT_4095; asm volatile ("out 0x03, r24"); COUNT_4095; }

#define COUNT_16383  count_8191(); asm volatile ("out 0x03, r25"); count_8191();
#define COUNT_32767  COUNT_16383;  asm volatile ("out 0x06, r20"); COUNT_16383;
#define COUNT_65535  COUNT_32767;  asm volatile ("out 0x06, r21"); COUNT_32767;
#define COUNT_131071 COUNT_65535;  asm volatile ("out 0x06, r22"); COUNT_65535;
#define COUNT_262143 COUNT_131071; asm volatile ("out 0x06, r23"); COUNT_131071;
#define COUNT_524287 COUNT_262143; asm volatile ("out 0x06, r24"); COUNT_262143;
#define COUNT_524288 COUNT_524287; asm volatile ("out 0x06, r25");

#define TWICE(a) a a

void setup() {
    DDRD = 0b11111111; // set digital  0- 7 to output
    DDRB = 0b00111111; // set digital  8-13 to output
    DDRC = 0b00111111; // set digital 14-19 to output (coincidences with analog 0-5)

    // ensure we do not get interrupted during prescaler manipulation
    cli();

/*
    // activate this block by removing the previous comment
    // set clock prescaler as desired to slow down the timer
    const uint8_t clock_prescaler_1   = 0;
    const uint8_t clock_prescaler_2   = 1;
    const uint8_t clock_prescaler_4   = 2;
    const uint8_t clock_prescaler_8   = 3;
    const uint8_t clock_prescaler_16  = 4;
    const uint8_t clock_prescaler_32  = 5;
    const uint8_t clock_prescaler_64  = 6;
    const uint8_t clock_prescaler_128 = 7;
    const uint8_t clock_prescaler_256 = 8;

    // prepare to set clock prescaler: write CLKPCE bit to one and all the other to zero
    CLKPR = 1<<CLKPCE;
    // set clock prescaler immediately after preparing to do so
    CLKPR = clock_prescaler_256;
/**/
    // do not enable interrupts back again because we do not want
    // interrupts to slow us any down
}

void loop() {
     asm volatile ("ldi	r20, 0x01");
     asm volatile ("ldi	r21, 0x02");
     asm volatile ("ldi	r22, 0x04");
     asm volatile ("ldi	r23, 0x08");
     asm volatile ("ldi	r24, 0x10");
     asm volatile ("ldi	r25, 0x20");
     asm volatile ("ldi	r26, 0x40");
     asm volatile ("ldi	r27, 0x80");

    l0:
        TWICE(TWICE(TWICE(TWICE(TWICE(COUNT_524288)))));
    goto l0;
}

Now how much overhead will we still have? Since our pieces of assembler code did not tell the compiler that any registers will be changed the compiler is smart enough to generate no code for pushing registers to the stack. Thus the only remaining overhead are the subroutine calls and the goto statement. According to the datasheet call and ret take 4 cycles each. So each call to count_8191 will take 8191 cycles for counting and 8 cycles overhead. Counting to 524 288 will take 64 calls or 64*8 = 512 cycles of overhead.

After counting 32 times to 524 288 (=16 777 216 ticks) there is one goto stamement that will consume 3 additional cycles. Since 3/16 777 216 < 1/1 000 000 this is below the clock precision, we can and will ignore this.

So basically we should now be running at 1-(8/8192) = 99.90% of the hardware speed. With other words I would expect to measure 3 996 094 Hz for the lowest significant bit. Or in terms of overhead about 15625 overhead cycles per second (or ~15628 cycles if we do not ignore the goto statement).

This time I measure 3 996 167 Hz for the lowest significant bit. This is 73 Hz or ~18 ppm faster than computed. Obviously my Arduino is running at a slightly higher frequency than the advertised 16 MHz. This is something we already explored in the Crystal Deviations experiment.

If you watch closely and compare it with the other counter video you notice that the improved counter is only marginally faster although we got rid of 90% of the overhead. How come? Well, the fast counter runs at 98.9% of the theoretical limit. The faster counter code reduces the overhead by 90% and runs at 99.9% of the theoretical limit. This is just 1% faster. This effect is also known as the “law of diminishing returns”. The closer you get to the optimum the less improvement you will see even if you try harder and harder.

4 Responses to Faster Counter

  1. paul says:

    Decompile the compiled program?
    Don’t you know GCC can generate list files of your code? They are much easier to analyze.

    • I did not decompile it, I disassembled it. Technically this is something different. I am aware that GCC can generate said files. However I am not at all aware how to tell the Arduino IDE to pass the necessary flags to the compiler. So if you can provide instructions on how to do this “much easier” I would be more than happy to follow them.

  2. Andreas Schoelver says:

    dimishing? you most likely meant: ‘law of diminishing returns’.

  3. Thanks for pointing this out. I fixed it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s