Fast Counter

Every once in a while there is a question in the Arduino forum on how fast you can toggle an IO pin. The answer to this question depends on your line of thinking / what you will allow as an answer. So here is my take on performance optimization for its own sake.

First of all there are pins PB6 and PB7 which are connected to the crystal. Since these pins can be used as IO pins (but only if running without crystal) I would conclude the maximum speed is 16 MHz. However this feels like cheating as these pins just oscillate due to the crystal. Let us look at the next option for super fast output. According to the data sheet the fuses can be changed such that the system clock will be passed to PB0 which corresponds to digital 8. This will again give a 16 MHz output. But it is still somehow cheating because it requires to change the fuses which is probably out of reach for a lot of Arduino tinkerers. The next thing is to go for one of the hardware timers and toggle one of the output pins at each clock cycle. This would give an output of 8 MHz. I would not consider this cheating but it is somewhat boring because the hardware does all the work.

Pondering about this kind of arguments I started to wonder how fast can I toggle a pin in software? And as an afterthought how fast can I count a 20 bit counter (all the IO pins of my Arduino) in software? Of course this is just a somewhat theoretical exercise / performance optimization for its own sake. In a production environment one would feed a clock signal into counter chips. This would be cheaper and faster. However back to my issue: how fast can an Arduino count a 20 bit output without using a hardware counter?

First of all it is clear that instead of digitalWrite() I will use direct port manipulation. The naive solution would be to increment port D (digital 0-7), then carry on to port B (digital 8-13) and finally carry on to port C (digital 14-19).

This solution has two significant drawbacks though. The most significant drawback is that this approach does not work at all. The issue arises once this counter reaches 255. That is the point in time all bits of port D are set to 1 and all other ports are 0. The next step will set all bits of port D to 0 and bit 0 of port B to 1. Since this is performed in software the output will jump from 255 to 0 (after port D is incremented) and then to 256 (after the increment is carried to port B). Changing the update order does not make this any better. Then the output will jump from 255 to 0 and then 256. In both cases it is just impossible to count glitch free.

The other issue with the naive solution is that every time a carry operation is required two output ports must be toggled thus reducing the maximum reachable speed.

In order to resolve both issues I figured that it is better to not count in binary at all. Instead I will implement a Gray code counter. See the table below how Gray code looks compared to binary.

hexadecimal binary gray bit toggled
0 0000 0000 n/a
1 0001 0001 0
2 0010 0011 1
3 0011 0010 0
4 0100 0110 2
5 0101 0111 0
6 0110 0101 1
7 0111 0100 0
8 1000 1100 3
9 1001 1101 0
a 1010 1111 1
b 1011 1110 0
c 1100 1010 2
d 1101 1011 0
e 1110 1001 1
f 1111 1000 0

The point is that counting in Gray code will only toggle 1 bit per step. Thus it avoids both issues I already mentioned. As it will turn out it also allows to come up with a very fast implementation.

Besides this theoretical considerations there are practical considerations as well. How to implement this with as little computations as possible? And along the same train of thought: how to implement this with as little memory operations as possible? How to keep as much as possible in some CPU registers?

You may want to ponder this a little bit before you read on and have a look at my first solution.

The picture is exhibits how Arduino pins belong to AVR ports.

//
//  www.blinkenlight.net
//
//  Copyright 2012 Udo Klein
//
//  This program is free software: you can redistribute it and/or modify
//  it under the terms of the GNU General Public License as published by
//  the Free Software Foundation, either version 3 of the License, or
//  (at your option) any later version.
//
//  This program is distributed in the hope that it will be useful,
//  but WITHOUT ANY WARRANTY; without even the implied warranty of
//  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
//  GNU General Public License for more details.
//
//  You should have received a copy of the GNU General Public License
//  along with this program. If not, see http://www.gnu.org/licenses/


const uint8_t bit_0 = 1<<0;
const uint8_t bit_1 = 1<<1;
const uint8_t bit_2 = 1<<2;
const uint8_t bit_3 = 1<<3;
const uint8_t bit_4 = 1<<4;
const uint8_t bit_5 = 1<<5;
const uint8_t bit_6 = 1<<6;
const uint8_t bit_7 = 1<<7;

#define COUNT_1                PIND = bit_0;
#define COUNT_3    COUNT_1;    PIND = bit_1; COUNT_1;
#define COUNT_7    COUNT_3;    PIND = bit_2; COUNT_3;
#define COUNT_15   COUNT_7;    PIND = bit_3; COUNT_7;
#define COUNT_31   COUNT_15;   PIND = bit_4; COUNT_15;
#define COUNT_63   COUNT_31;   PIND = bit_5; COUNT_31;
#define COUNT_127  COUNT_63;   PIND = bit_6; COUNT_63;
#define COUNT_255  COUNT_127;  PIND = bit_7; COUNT_127;
#define COUNT_511  COUNT_255;  PINB = bit_0; COUNT_255;
#define COUNT_1023 COUNT_511;  PINB = bit_1; COUNT_511;
#define COUNT_2047 COUNT_1023; PINB = bit_2; COUNT_1023;
#define COUNT_4095 COUNT_2047; PINB = bit_3; COUNT_2047;

void count_8191() { COUNT_4095; PINB = bit_4; COUNT_4095; }

#define COUNT_16383  count_8191(); PINB = bit_5; count_8191();
#define COUNT_32767  COUNT_16383;  PINC = bit_0; COUNT_16383;
#define COUNT_65535  COUNT_32767;  PINC = bit_1; COUNT_32767;
#define COUNT_131071 COUNT_65535;  PINC = bit_2; COUNT_65535;
#define COUNT_262143 COUNT_131071; PINC = bit_3; COUNT_131071;
#define COUNT_524287 COUNT_262143; PINC = bit_4; COUNT_262143;
#define COUNT_524288 COUNT_524287; PINC = bit_5;

#define TWICE(a) a a

void setup() {
    DDRD = 0b11111111; // set digital  0- 7 to output
    DDRB = 0b00111111; // set digital  8-13 to output
    DDRC = 0b00111111; // set digital 14-19 to output (coincidences with analog 0-5)

    // ensure we do not get interrupted during prescaler manipulation
    cli();

/*  
    // activate this block by removing the previous comment
    // set clock prescaler as desired to slow down the timer
    const uint8_t clock_prescaler_1   = 0;
    const uint8_t clock_prescaler_2   = 1;
    const uint8_t clock_prescaler_4   = 2;
    const uint8_t clock_prescaler_8   = 3;
    const uint8_t clock_prescaler_16  = 4;
    const uint8_t clock_prescaler_32  = 5;
    const uint8_t clock_prescaler_64  = 6;
    const uint8_t clock_prescaler_128 = 7;
    const uint8_t clock_prescaler_256 = 8;

    // prepare to set clock prescaler: write CLKPCE bit to one and all the other to zero
    CLKPR = 1<<CLKPCE;
    // set clock prescaler immediately after preparing to do so
    CLKPR = clock_prescaler_256;
/**/
    // do not enable interrupts again because we do not want
    // interrupts to slow us any down
}

void loop() {
    l0:
        TWICE(TWICE(TWICE(TWICE(TWICE(COUNT_524288)))));
    goto l0;
}

If you look at my code it sure looks strange. This is due to the aggressive optimizations. So let me dissect them one by one.

The simplest optimization is the use of the goto statement. Although the loop function will be repeated forever there is some slight overhead attached to it. This overhead consists of 3 things: entering the function, leaving the function and processing a tiny bit of code outside of the function that is not relevant for counting at all. The goto statement will get rid of this overhead. Compared to the other optimizations it is the simplest but also the least effective.

The next optimization is that I do not use digitalRead() and digitalWrite(). Instead I use direct port manipulation. If you already have some experience with direct port manipulation you will know that pin states can be read from the PINB, PINC and PIND while pin states will be set by writing to PORTB, PORTC and PORTD. However here I do something strange. I write to PINB, PINC and PIND. This is a not so well known functionality that is documented in the datasheet. Writing a bit mask into the PIN registers will toggle all pins that are set to 1 in this mask. So instead of writing the desired output bit mask into the PORT registers I write to the PIN register to indicate which bit I want to toggle.

At first glance this does not seem to give any performance advantage. However there is a subtle detail here. The Arduino processor has 32 general registers. That is the processor can hold easily 8 values in these registers. As I am toggling at most 8 different bits per port these 8 values will fit into the processor’s general registers. If I would be pushing the desired values directly into the PORT registers I would require 256 different values. Obviously 256 values will never fit into 32 registers no matter how hard you try. So if the compiler is clever the PIN register approach will allow to keep the values in the registers and thus avoid any value reloading in the general registers. As it turns out the GNU compiler is smart enough to apply this optimization. Thus the bit toggling approach is faster than pushing precomputed values into the ports.

Now to the last and most aggressive of the optimizations – the COUNT macros. These macros are an “unrolled loop”. Instead of toggling the pins in some loop I have the macro processor output the toggle statements to the compiler.

The simplest of the macros is COUNT_1

#define COUNT_1                PIND = bit_0;

The macro processor will expand COUNT_1 to the following code containing 1 bit toggle operation.

PIND = bit_0;

The next macro is COUNT_3.

#define COUNT_3    COUNT_1;    PIND = bit_1; COUNT_1;

Since COUNT_3 contains the macro COUNT_1 the macro processor will expand this recursively and end up with 3 toggle operations.

PIND = bit_0;
PIND = bit_1; 
PIND = bit_0;

Similarily COUNT_7 gets expanded to 7 toggle operations.

PIND = bit_0;
PIND = bit_1; 
PIND = bit_0;
PIND = bit_2;
PIND = bit_0;
PIND = bit_1; 
PIND = bit_0;

If you look up in the table above you see that these macros toggle the bits exactly as needed for counting in gray code.

As you probably have already guessed COUNT_4095 gets expanded into 4095 bit toggle operations. But then there is something different. I do not introduce a COUNT_8191 macro. Instead I introduce a count_8191() function. Here is why. Have a look at the implementation of this function. It contains 8191 toggle operations. Each of these operations will take up at least one machine language statement. Since a machine language statement will take at least 1 word (=2 bytes) this function occupies at least 16382 bytes of flash memory. As an Arduino has only about 32 kBytes of flash memory I had to stop here because otherwise I would be running out of memory.

So basically this function stops the exponential size growth of the expanded macros. Then the game continues. The difference is now that the expanded macros will contain function calls and toggle operations.

For example COUNT_65535 gets expanded to the following code.

count_8191(); 
PINB = bit_5;
count_8191();
PINC = bit_0;
count_8191(); 
PINB = bit_5;
count_8191();
PINC = bit_1;
count_8191(); 
PINB = bit_5;
count_8191();
PINC = bit_0;
count_8191(); 
PINB = bit_5;
count_8191();

That is 8 function calls and 7 toggle operations or a total of 15 C statements. Much less than the 65535 statements if count_8191 would be a macro. If you do a little bit of math you will figure out that COUNT_524288 will be expanded to a total of 256 C statements.

As the final topping the successive application of the TWICE macro concatenates the COUNT_524288 macro 32 times. This is to reduce the overhead introduced by the goto statements as much as possible.

Now how fast is this thing counting? My frequency counter says the lowest significant bit toggles at 3 956 121 Hz. This implies it toggles 7 912 242 times per second. Since the lowest significant bit toggles every second count this thing counts at 15 824 484 ticks per second. This is roughly 98.9% of the hardware clock. Not to bad for a software solution.

If you want to verify this without a frequency counter there is a simple trick. My code is already prepared to divide the system clock by 256. You just need to remove one comment. This will effectively run the Arduino at 62.5 kHz. Thus the lowest bit will toggle at 15 454 Hz. Since each bit will toggle at half the speed of the previous bit, the highest significant bit will toggle at 1/2^18 times this frequency. This is 0.058 951 Hz or a period length of 16.963 seconds. This makes it feasible to verify the timing with a stop watch or a video camera. The video shows an Arduino running with this prescaler setup.

Not fast enough? Read the faster counter experiment on how to push this just one step further ;)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s