Comparing the performance of a de Bruijn sequence against a simple lookup table
While optimising the naive proof-of-work algorithm in my dave protocol implementation, I compared the efficiency of some methods of counting the number of leading zero bits in a byte slice.
Approaches
1. Naive approach
func nzerobit_1(key []byte) int {
var n int
for _, b := range key {
for i := 0; i < 8; i++ {
if b&(1<<i) == 0 {
n++
} else {
return n
}
}
}
return n
}
Iterating over each byte in the given key, we use the bitwise AND operator &
to check if the i
-th bit of the byte is zero. The expression 1<<i
creates a bitmask with only the i
-th bit set to 1, and the &
operator performs a bitwise AND operation between the byte and the bitmask. If the result is zero, it means that the i
-th bit is zero, and we increment n
. If the bit is set, we return the current value of n
.
2. Right shift
Instead of using b&(1<<i) == 0
, we can use (b>>i)&1 == 0
, avoiding the expensive left shift operation. This should yeild a small performance improvement.
func nzerobit_2(key []byte) int {
var n int
for _, b := range key {
for i := 0; i < 8; i++ {
if (b>>i)&1 == 0 {
n++
} else {
return n
}
}
}
return n
}
3. Unroll loop
Loops are expensive. Loop unrolling can help reduce overhead, and enable better instruction-level parallelism.
func nzerobit_3(key []byte) int {
var n int
for _, b := range key {
if (b>>0)&1 == 0 { n++ } else { return n }
if (b>>1)&1 == 0 { n++ } else { return n }
if (b>>2)&1 == 0 { n++ } else { return n }
if (b>>3)&1 == 0 { n++ } else { return n }
if (b>>4)&1 == 0 { n++ } else { return n }
if (b>>5)&1 == 0 { n++ } else { return n }
if (b>>6)&1 == 0 { n++ } else { return n }
if (b>>7)&1 == 0 { n++ } else { return n }
}
return n
}
4. Simple lookup table
We can create a lookup table storing the number of leading zero bits for every possible byte value.
var zeros = [256]int{
8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
}
func nzerobit_4(key []byte) int {
var n int
for _, b := range key {
n += zeros[b]
if b != 0 {
return n
}
}
return n
}
While simple, this approach can be significantly faster for small input sizes. As my use case is counting the number of leading zero bits in a 256 bit hash, this method could yeild great results.
5. De Bruijn sequence
var debruijn64 = [64]byte{
0, 1, 2, 53, 3, 7, 54, 27, 4, 38, 41, 8, 34,
55, 48, 28, 62, 5, 39, 46, 44, 42, 22, 9, 24,
35, 59, 56, 49, 18, 29, 11, 63, 52, 6, 26,
37, 40, 33, 47, 61, 45, 43, 21, 23, 58, 17,
10, 51, 25, 36, 32, 60, 20, 57, 16, 50, 31,
19, 15, 30, 14, 13, 12,
}
func nzerobit_5(key []byte) int {
for i, b := range key {
if b != 0 {
x := uint64(b&-b)*0x03f79d71b4ca8b09>>58
return i*8 + int(debruijn64[x])
}
}
return len(key) * 8
}
The debruijn64 array is a precomputed lookup table that maps the position of the least significant set bit in a byte to the number of trailing zeros.
i*8
gives the number of zero bits in the previous bytes. b&-b
isolates the least significant set bit in the current byte. uint64(b&-b)*0x03f79d71b4ca8b09
multiplies the isolated bit with a magic constant to generate a unique index into the table. >>58
shifts the result to the right by 58 bits to bring the relevant bits into the least significant position. The looked-up value represents the number of trailing zeros in the current byte, which is added to i*8
to get the total number of leading zero bits.
The de Bruijn sequence lookup table provides a constant-time method to determine the number of trailing zeros in a byte. By combining this with the byte index, the function efficiently calculates the total number of leading zero bits in the byte slice.
Benchmarks
The following results are from a 12-core Apple M2 Pro.
The simple lookup and de Bruijn sequence are the clear winners, regardless of byte length and number of leading zero bytes. As there is no clear winner between these two, I suggest running tests for your specific use case.
16 bytes, 4 leading zero bytes
Benchmark_1-12 53595952 22.68 ns/op
Benchmark_2-12 54691712 22.68 ns/op
Benchmark_3-12 76803686 15.74 ns/op
Benchmark_4-12 421736720 2.861 ns/op
Benchmark_5-12 423180349 2.864 ns/op
32 bytes, 2 leading zero bytes
Benchmark_1-12 80881609 14.65 ns/op
Benchmark_2-12 86697114 13.87 ns/op
Benchmark_3-12 100000000 11.22 ns/op
Benchmark_4-12 577513431 2.087 ns/op
Benchmark_5-12 570226458 2.076 ns/op
32 bytes, 4 leading zero bytes
Benchmark_1-12 49990105 23.07 ns/op
Benchmark_2-12 54375429 21.99 ns/op
Benchmark_3-12 76524105 15.73 ns/op
Benchmark_4-12 413261740 2.862 ns/op
Benchmark_5-12 403608087 2.912 ns/op
32 bytes, 8 leading zero bytes
Benchmark_1-12 34218296 35.23 ns/op
Benchmark_2-12 35122378 34.62 ns/op
Benchmark_3-12 62954121 19.30 ns/op
Benchmark_4-12 315074923 3.810 ns/op
Benchmark_5-12 315627609 3.805 ns/op
64 bytes, 4 leading zero bytes
Benchmark_1-12 50546887 23.29 ns/op
Benchmark_2-12 51958827 22.18 ns/op
Benchmark_3-12 73563591 15.65 ns/op
Benchmark_4-12 401217024 3.052 ns/op
Benchmark_5-12 386922139 3.040 ns/op
128 bytes, 4 leading zero bytes
Benchmark_1-12 53089023 22.98 ns/op
Benchmark_2-12 53183925 22.37 ns/op
Benchmark_3-12 74646594 16.12 ns/op
Benchmark_4-12 142048627 8.512 ns/op
Benchmark_5-12 142594008 8.442 ns/op
256 bytes, 4 leading zero bytes
Benchmark_1-12 51616419 23.62 ns/op
Benchmark_2-12 53396121 22.98 ns/op
Benchmark_3-12 57543146 18.51 ns/op
Benchmark_4-12 100000000 10.83 ns/op
Benchmark_5-12 100000000 10.89 ns/op
256 bytes, 64 leading zero bytes
Benchmark_1-12 5252018 231.0 ns/op
Benchmark_2-12 5165072 231.0 ns/op
Benchmark_3-12 12656263 97.03 ns/op
Benchmark_4-12 57885492 20.43 ns/op
Benchmark_5-12 58068614 20.46 ns/op