Simon Fell > Its just code > January 2021
Saturday, January 30, 2021
We're continuing with the array search covered in episode 1. I'd recommend reading that before this one.
Lets start with an easy one. One implementation we didn't cover in episode 1 is to use bytes.IndexByte. IndexByte does end up calling a hand craft assembly version so there's some hope this might be decent.
func (n *nodeLoop) getIndexByte(k byte) interface{} {
idx := bytes.IndexByte(n.key[:n.count], k)
if idx < 0 {
return nil
}
return n.val[idx]
}
Benchmark_IndexByte-8 11620299 102 ns/op
Benchmark_Loop-8 11181962 102 ns/op
Benchmark_LoopOneBoundsCheck-8 13485289 87.1 ns/op
Benchmark_LoopRev-8 10736464 111 ns/op
Benchmark_LoopRevOneBoundsCheck-8 14661828 81.0 ns/op
Benchmark_BinarySearch-8 4895956 246 ns/op
Benchmark_BinarySearchOneBoundsCheck-8 4806734 249 ns/op
Benchmark_BinarySearchInlined-8 10140198 116 ns/op
Benchmark_BinarySearchInlinedOneBoundsCheck-8 11359298 106 ns/op
PASS
That's disappointing. I generated a cpu profile, and it seems like the overhead of having deal with an arbitrary sized slice out weigh's any gains from the more efficient comparison eventually done. A variation that always passes 16 bytes to bytes.IndexByte and checks the result against n.count performs almost the same. We'll skip the steps for profiling and disassembly, checkout episode 1 if you need the details on how to do that.
Loop Unrolling is a common optimization. From the assembly we looked at previously we know the compiler isn't doing this. We can manually write a loop unrolled version. Lets see how this does.
func (n *nodeLoop) getUnrolledLoop(k byte) interface{} {
switch n.count {
case 16:
if n.key[15] == k {
return n.val[15]
}
fallthrough
case 15:
if n.key[14] == k {
return n.val[14]
}
fallthrough
case 14:
if n.key[13] == k {
return n.val[13]
}
fallthrough
case 13:
if n.key[12] == k {
return n.val[12]
}
fallthrough
...
Benchmark_IndexByte-8 11525317 104 ns/op
Benchmark_UnrolledLoop-8 13063185 93.0 ns/op
Benchmark_Loop-8 11788951 101 ns/op
Benchmark_LoopOneBoundsCheck-8 13612688 87.5 ns/op
Benchmark_LoopRev-8 10756114 112 ns/op
Benchmark_LoopRevOneBoundsCheck-8 14641873 82.3 ns/op
Benchmark_BinarySearch-8 4833198 248 ns/op
Benchmark_BinarySearchOneBoundsCheck-8 4789767 250 ns/op
Benchmark_BinarySearchInlined-8 10150819 118 ns/op
Benchmark_BinarySearchInlinedOneBoundsCheck-8 11339012 107 ns/op
There's still a lot of branches in the unrolled version, so not totally surprised with the outcome. Reviewing the assembly for this one, go doesn't use jump tables for the switch statement. This means its doing more comparisons than expected as well.
One thing that stands out from the loops versions assembly is that we have a 64 bit CPU which we're then forcing to do things one byte at time. Can we work 64 bits at a time instead and not have to resort to hand crafted assembly? What if we pack 8 keys into a uint64 and do bitwise operations to compare against the target key? Worst case is we'd have to do this on 2 uint64's instead of 16 bytes. Its doable, but at a cost of being significantly harder to understand than the loop version.
Starting with put we need to work out which uint64 to update, and then shift the key value into the relevant byte of that uint64.
type nodeMasks struct {
keys1 uint64
keys2 uint64
vals [16]interface{}
count int
}
func (n *nodeMasks) put(k byte, v interface{}) {
m := &n.keys1
c := n.count
if n.count >= 8 {
m = &n.keys2
c = c - 8
}
*m = *m | (uint64(k) << (c * 8))
n.vals[n.count] = v
n.count++
}
get is more work. First off we need a uint64 that has the value k in each byte. This can be constructed with a bunch of shifts, but turns out to be a significant percentage of the overall work. There are only 256 possible values for this though, so we can calculate them once at startup, and grab that when needed. The lookup table of masks takes 2k bytes (256 * 8) a more than reasonable tradeoff.
func (n *nodeMasks) get(k byte) interface{} {
if n.count == 0 {
return nil
}
// mask is a uint64 with k at each byte position
mask := masks[k]
// act has bytes with value FF in positions we don't want to consider
act := active[n.count-1]
// XOR the mask and the keys, for any bytes that match the result will be 0
// for ones that don't match the result will be not 0.
// OR the result with act so that any key positions we shouldn't consider get
// set to FF
r := (mask ^ n.keys1) | act
// now check each byte in the result for a zero
if (r & b1) == 0 {
return n.vals[0]
}
if (r & b2) == 0 {
return n.vals[1]
}
if (r & b3) == 0 {
return n.vals[2]
}
if (r & b4) == 0 {
return n.vals[3]
}
if (r & b5) == 0 {
return n.vals[4]
}
if (r & b6) == 0 {
return n.vals[5]
}
if (r & b7) == 0 {
return n.vals[6]
}
if (r & b8) == 0 {
return n.vals[7]
}
if n.count < 9 {
return nil
}
// same again for the upper 8 keys
r = (mask ^ n.keys2) | active[n.count-9]
if (r & b1) == 0 {
return n.vals[8]
}
if (r & b2) == 0 {
return n.vals[9]
}
if (r & b3) == 0 {
return n.vals[10]
}
if (r & b4) == 0 {
return n.vals[11]
}
if (r & b5) == 0 {
return n.vals[12]
}
if (r & b6) == 0 {
return n.vals[13]
}
if (r & b7) == 0 {
return n.vals[14]
}
if (r & b8) == 0 {
return n.vals[15]
}
return nil
}
Once we've finished with the bit operations, we're left with a uint64. Where the key matches the target value, the byte in the uint64 will be zero otherwise its none zero. We could write a loop to then check each byte, the above code is the unrolled equivalent of that loop, and provides a decent gain over the loop, 70.7ns vs 103ns.
Benchmark_IndexByte-8 11373612 104 ns/op
Benchmark_UnrolledLoop-8 12968982 92.5 ns/op
Benchmark_Masks-8 16908472 70.7 ns/op
Benchmark_MasksWithFinalLoop-8 11663716 103 ns/op
Benchmark_Loop-8 11187504 105 ns/op
Benchmark_LoopOneBoundsCheck-8 13619496 88.3 ns/op
Benchmark_LoopRev-8 10784320 112 ns/op
Benchmark_LoopRevOneBoundsCheck-8 14605256 82.5 ns/op
Benchmark_BinarySearch-8 4791508 254 ns/op
Benchmark_BinarySearchOneBoundsCheck-8 4614759 252 ns/op
Benchmark_BinarySearchInlined-8 10562584 114 ns/op
Benchmark_BinarySearchInlinedOneBoundsCheck-8 11154210 107 ns/op
But we're back to byte at a time, can we do more bit twiddling to decode the index out of the uin64? Turns out we can. With some additional bit fiddling, we can get the result to have just the high bit of the byte set where there was a match, and zero otherwise. At that point we can count the number of trailing zeros to determine which high bit was set.
func (n *nodeMasks) getMoreBitTwiddling(k byte) interface{} {
if n.count == 0 {
return nil
}
// This follows the same approach as get. But uses additional
// bit twiddling to determine if any of the bytes are zero.
mask := masks[k]
r := (mask ^ n.keys1) | active[n.count-1]
// see https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
x := (r - 0x0101010101010101) & ^(r) & 0x8080808080808080
idx := bits.TrailingZeros64(x) / 8
if idx < 8 {
return n.vals[idx]
}
if n.count < 9 {
return nil
}
r = (mask ^ n.keys2) | active[n.count-9]
x = (r - 0x0101010101010101) & ^(r) & 0x8080808080808080
idx = bits.TrailingZeros64(x) / 8
if idx < 8 {
return n.vals[idx+8]
}
return nil
}
As an aside, despite the fact that bits.TrailingZeros64 appears to be a pure go implementation, the compiler will replace it with the relevant assembly instruction.
Benchmark_IndexByte-8 10000000 103 ns/op
Benchmark_UnrolledLoop-8 10000000 92.9 ns/op
Benchmark_Masks-8 10000000 71.0 ns/op
Benchmark_MasksWithFinalLoop-8 10000000 105 ns/op
Benchmark_MasksWithBitTwiddling-8 10000000 78.1 ns/op
Benchmark_Loop-8 10000000 105 ns/op
Benchmark_LoopOneBoundsCheck-8 10000000 88.2 ns/op
Benchmark_LoopRev-8 10000000 113 ns/op
Benchmark_LoopRevOneBoundsCheck-8 10000000 82.2 ns/op
Benchmark_BinarySearch-8 10000000 250 ns/op
Benchmark_BinarySearchOneBoundsCheck-8 10000000 254 ns/op
Benchmark_BinarySearchInlined-8 10000000 115 ns/op
Benchmark_BinarySearchInlinedOneBoundsCheck-8 10000000 107 ns/op
Unexpectedly the bit twiddling approach is slightly slower than the version that used an unrolled loop to decode the return index. Wasn't expecting that, perhaps as all the checks for zero byte don't depend on each other, they can all get speculatively executed, while the bit twiddling approach depends on each previous instruction. Concrete explanations for this welcome. Notice though that either approach is faster than anything else so far.
Our final version is the one described in the art paper, which uses SIMD instructions to compare all 16 keys at once. There's no SIMD intrinsics for go, so this'll have to be in straight up assembly. I used avo to help deal with some of the drudgery involved. (see asm/asm.go in the repo for the full code). XMM registers are 16 bytes wide, and the VPBROADCASTB & PCMPEQB instructions are the primary SIMD instructions involved.
TEXT("Lookup", NOSPLIT, "func(k byte, x *[16]byte) int32")
Pragma("noescape")
Doc("Lookup returns the index into the array 'x' of the value 'k', or -1 if its not there." +
" If k appears at multiple locations, you'll get one of them as the return value, it may not be the first one.")
x := Load(Param("x"), GP64())
k := Load(Param("k"), GP32())
xKey := XMM()
MOVD(k, xKey)
VPBROADCASTB(xKey, xKey) // xmm register now contains the value k in all 16 bytes
xArr := XMM()
MOVUPD(Mem{Base: x}, xArr) // xmm register now contains the 16 bytes of the array x
// Compare bytes for equality between the 2 xmm registers.
// xArr is updated with the result. Where they're equal the byte is set to FF
// otherwise its set to 0
PCMPEQB(xKey, xArr)
rv := GP64()
rOffset := GP64()
XORQ(rOffset, rOffset) // resOffset = 0
MOVQ(xArr, rv) // get the lower 8 bytes from the xmm register into rv
TESTQ(rv, rv) // is rv 0? if not, at least one byte was equal
JNZ(LabelRef("returnCount")) // jump to converting that back to a index
MOVHLPS(xArr, xArr) // move top 64 bits to lower 64 bits in xmm register
MOVQ(xArr, rv) // move lower 8 bytes into rv
TESTQ(rv, rv)
JZ(LabelRef("notFound")) // is rv 0? if so there's no matches, so return -1
// the match was found in the top 8 bytes, so we need
// to offset the final calculated index by 8.
MOVQ(U64(8), rOffset)
Label("returnCount") // return tailing zeros / 8 + offset
idx := GP64()
TZCNTQ(rv, idx) // set idx to the number of trailing zeros in rv
SHRQ(Imm(3), idx) // divide idx by 8 to get from bit position to byte posn.
ADDQ(rOffset, idx) // add the result offset in.
Store(idx.As32(), ReturnIndex(0)) // return the final index as the result.
RET()
Label("notFound")
rMiss := GP32()
MOVL(U32(0xFFFFFFFF), rMiss)
Store(rMiss, ReturnIndex(0)) // return -1
RET()
The VPBROADCASTB and PCMPEQB instructions do the heavy lifting, but otherwise it bares a strongly resemblance to the bit mask approach we wrote earlier.
Benchmark_IndexByte-8 11691013 103 ns/op
Benchmark_UnrolledLoop-8 12977037 91.9 ns/op
Benchmark_GetLookupAsm-8 23960334 48.9 ns/op
Benchmark_Masks-8 16845604 70.4 ns/op
Benchmark_MasksWithFinalLoop-8 11721877 102 ns/op
Benchmark_MasksWithBitTwiddling-8 15233120 77.3 ns/op
Benchmark_Loop-8 11848963 101 ns/op
Benchmark_LoopOneBoundsCheck-8 13606868 87.6 ns/op
Benchmark_LoopRev-8 10635825 113 ns/op
Benchmark_LoopRevOneBoundsCheck-8 13355061 88.9 ns/op
Benchmark_BinarySearch-8 4826577 253 ns/op
Benchmark_BinarySearchOneBoundsCheck-8 4798706 250 ns/op
Benchmark_BinarySearchInlined-8 10516956 114 ns/op
Benchmark_BinarySearchInlinedOneBoundsCheck-8 11263238 107 ns/op
A clear winner, that's half the time our original loop took, and 30% faster than the next best version.
But which one would i really use? Despite the giant trip down the performance rabbit hole, I'd pick the one that performs the best out of the ones that are easily understandable. That's be the reverse loop with the tweak to remove all but one of the bounds checks, LoopRevOneBoundsCheck. If it later turned out that this particular part of the system was a bottleneck that needed improving then i'd probably go with the assembly version. However I'd need to do a lot more studying first to validate its correctness, not its logical correctness, but its correctness as a assembly function inside a go execution environment.
As a final exercise, I'll leave you to work out at what size does binary search actually win? Let me know what results you get.
Remember how i said it might depend on OS, CPU? Here's some runs from other machines, all using go 1.15.7
Windows 10 with an Intel 9900KS. Faster than the iMac runs, not surprising given its a 3 generations newer processor. But the relative performance between the approaches is pretty similar.
goos: windows
goarch: amd64
pkg: github.com/superfell/loopVsBinarySearch
Benchmark_IndexByte-16 14651924 81.8 ns/op
Benchmark_UnrolledLoop-16 16365361 72.8 ns/op
Benchmark_GetLookupAsm-16 31375419 37.9 ns/op
Benchmark_Masks-16 21862701 55.0 ns/op
Benchmark_MasksWithFinalLoop-16 14606589 81.3 ns/op
Benchmark_MasksWithBitTwiddling-16 18442527 63.3 ns/op
Benchmark_Loop-16 15587856 77.0 ns/op
Benchmark_LoopOneBoundsCheck-16 17430231 68.1 ns/op
Benchmark_LoopRev-16 14259435 85.0 ns/op
Benchmark_LoopRevOneBoundsCheck-16 19040486 64.0 ns/op
Benchmark_BinarySearch-16 6129547 195 ns/op
Benchmark_BinarySearchOneBoundsCheck-16 6129829 196 ns/op
Benchmark_BinarySearchInlined-16 13593127 86.6 ns/op
Benchmark_BinarySearchInlinedOneBoundsCheck-16 14756770 80.6 ns/op
PASS
And finally ARM/linux running on a Raspberry Pi 4 (this doesn't include a version of the SIMD solution). Different relative performance between the approaches now. The UnrolledLoop does better than any others, and the inlined binary search is not far behind it.
goos: linux
goarch: arm
pkg: github.com/superfell/loopVsBinarySearch
Benchmark_IndexByte-4 1443998 830 ns/op
Benchmark_UnrolledLoop-4 2522744 477 ns/op
Benchmark_Masks-4 2023891 594 ns/op
Benchmark_MasksWithFinalLoop-4 1422764 857 ns/op
Benchmark_MasksWithBitTwiddling-4 1911872 629 ns/op
Benchmark_Loop-4 1708760 700 ns/op
Benchmark_LoopOneBoundsCheck-4 1619178 679 ns/op
Benchmark_LoopRev-4 1728384 737 ns/op
Benchmark_LoopRevOneBoundsCheck-4 1907144 610 ns/op
Benchmark_BinarySearch-4 1000000 1309 ns/op
Benchmark_BinarySearchOneBoundsCheck-4 990722 1219 ns/op
Benchmark_BinarySearchInlined-4 2160622 556 ns/op
Benchmark_BinarySearchInlinedOneBoundsCheck-4 2185908 549 ns/op
Tuesday, January 26, 2021
The answer to almost every software engineering question is "It Depends". In this post we explore options for finding an item in a small array. Grab a cup of tea and down the rabbit hole we go.
I've been working on art a golang implementation of an adaptive radix tree. In this the tree can use a few differently sized nodes, and move between them to optimize memory usage. One of these node sizes is node16, a node that can store upto 16 child items. Each item is identified by a single byte key. The common read path is given a key tell me the child item or tell me its not there.
My first implementation stored the keys & values in no particular order and scanned them for lookups. At this time in the project I was more interested in something that worked than something that was the most optimal. Here's a simplified version of the code. Note that put makes many simplifying assumptions that aren't relevant to the discussion at hand. You can get all the code from this article from the loopVsBinarySearch repo.
type nodeLoop struct {
key [16]byte
val [16]interface{}
count int
}
func (n *nodeLoop) put(k byte, v interface{}) {
idx := n.count
n.key[idx] = k
n.val[idx] = v
n.count++
}
func (n *nodeLoop) get(k byte) interface{} {
for i := 0; i < n.count; i++ {
if n.key[i] == k {
return n.val[i]
}
}
return nil
}
Should i use a binary search instead? It depends. Binary Search would do better for larger arrays, but is it still better for our little 16 byte array? The loop has o(n) time while the binary search is o(log n). In my experience though Big O estimates of algorithmic time can be very miss-leading for small values of n. Lets find out, go has a binary search function in the sort package, which makes this straightforward.
type nodeSorted struct {
key [16]byte
val [16]interface{}
count int
}
func (n *nodeSorted) put(k byte, v interface{}) {
idx := sort.Search(n.count, func(i int) bool {
return n.key[i] >= k
})
copy(n.key[idx+1:], n.key[idx:int(n.count)])
copy(n.val[idx+1:], n.val[idx:int(n.count)])
n.key[idx] = k
n.val[idx] = v
n.count++
}
func (n *nodeSorted) get(k byte) interface{} {
idx := sort.Search(n.count, func(i int) bool {
return n.key[i] >= k
})
if idx < int(n.count) && n.key[idx] == k {
return n.val[idx]
}
return nil
}
And lets write a benchmark to compare the 2
var rnd = rand.New(rand.NewSource(42))
var keys = make([]byte, 16)
func init() {
for i := 0; i < len(keys); i++ {
keys[i] = byte(i)
}
rnd.Shuffle(len(keys), func(i, j int) {
keys[i], keys[j] = keys[j], keys[i]
})
}
func Benchmark_Loop(b *testing.B) {
n := new(nodeLoop)
for _, k := range keys {
n.put(k, int(k))
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
for _, k := range keys {
v := n.get(k)
if v != int(k) {
b.Errorf("Got unexpected value of %d for key %d", v, k)
}
}
}
}
func Benchmark_BinarySearch(b *testing.B) {
n := new(nodeSorted)
for _, k := range keys {
n.put(k, int(k))
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
for _, k := range keys {
v := n.get(k)
if v != int(k) {
b.Errorf("Got unexpected value of %d for key %d", v, k)
}
}
}
}
Lets run these and see what we get. All these tests were done using go 1.15.7 on OSX 10.15.7 running on an iMac with an Intel i7-6700K CPU @ 4.00GHz. You may get different results with different combinations of OS, go version & processor, it depends.
$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/superfell/loopVsBinarySearch
Benchmark_Loop-8 11602429 104 ns/op
Benchmark_BinarySearch-8 4848440 247 ns/op
PASS
The loop version doesn't look so shabby anymore, it runs in ~40% of the time of the binary search. (side note for VSCode users, the go plugin defaults to running code coverage on running benchmarks, which can change the relative output between benchmarks quite a bit. Either turn this off, or run the from the command line). What's going on? fortunately go has some great tools integrated into the tool chain, lets grab a cpu profile and look at that. We'll use the `-benchtime=10000000x` option to have each benchmark run the same number of times rather than the same duration. This will make comparing times between the different implementations easier.
$ go test -bench . -benchtime=10000000x -cpuprofile=cpu.prof
goos: darwin
goarch: amd64
pkg: github.com/superfell/loopVsBinarySearch
Benchmark_Loop-8 10000000 104 ns/op
Benchmark_BinarySearch-8 10000000 253 ns/op
PASS
We can use pprof to examine the generated cpu profile and generate timing annotated versions of the source code. We use the pprof command `list get` to show the functions called get.
$ go tool pprof cpu.prof
Type: cpu
Time: Jan 26, 2021 at 11:05am (PST)
Duration: 3.76s, Total samples = 3.13s (83.32%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)
type list get at the (pprof) prompt.
(pprof) list get
Total: 3.13s
ROUTINE ============= github.com/superfell/loopVsBinarySearch.(*nodeLoop).get in /Users/simon/Github/loopVsBinarySearch/n.go
480ms 560ms (flat, cum) 17.89% of Total
. . 15: n.key[idx] = k
. . 16: n.val[idx] = v
. . 17: n.count++
. . 18:}
. . 19:
20ms 40ms 20:func (n *nodeLoop) get(k byte) interface{} {
110ms 170ms 21: for i := 0; i < n.count; i++ {
260ms 260ms 22: if n.key[i] == k {
90ms 90ms 23: return n.val[i]
. . 24: }
. . 25: }
. . 26: return nil
. . 27:}
. . 28:
ROUTINE ============= github.com/superfell/loopVsBinarySearch.(*nodeSorted).get in /Users/simon/Github/loopVsBinarySearch/n.go
290ms 1.93s (flat, cum) 61.66% of Total
. . 157: n.key[idx] = k
. . 158: n.val[idx] = v
. . 159: n.count++
. . 160:}
. . 161:
50ms 50ms 162:func (n *nodeSorted) get(k byte) interface{} {
110ms 1.75s 163: idx := sort.Search(n.count, func(i int) bool {
. . 164: return n.key[i] >= k
. . 165: })
70ms 70ms 166: if idx < int(n.count) && n.key[idx] == k {
60ms 60ms 167: return n.val[idx]
. . 168: }
. . 169: return nil
. . 170:}
. . 171:
ROUTINE ============= github.com/superfell/loopVsBinarySearch.(*nodeSorted).get.func1 in /Users/simon/Github/loopVsBinarySearch/n.go
680ms 680ms (flat, cum) 21.73% of Total
. . 158: n.val[idx] = v
. . 159: n.count++
. . 160:}
. . 161:
. . 162:func (n *nodeSorted) get(k byte) interface{} {
260ms 260ms 163: idx := sort.Search(n.count, func(i int) bool {
420ms 420ms 164: return n.key[i] >= k
. . 165: })
. . 166: if idx < int(n.count) && n.key[idx] == k {
. . 167: return n.val[idx]
. . 168: }
. . 169: return nil
Hmmmm, that call to Search is taking a lot of time, lets see what that's doing. (pprof) list Search
ROUTINE ======================== sort.Search in /usr/local/go/src/sort/search.go
950ms 1.64s (flat, cum) 52.40% of Total
. . 54:// return s != "" && s[0] == 'y'
. . 55:// })
. . 56:// fmt.Printf("Your number is %d.\n", answer)
. . 57:// }
. . 58://
50ms 50ms 59:func Search(n int, f func(int) bool) int {
. . 60: // Define f(-1) == false and f(n) == true.
. . 61: // Invariant: f(i-1) == false, f(j) == true.
. . 62: i, j := 0, n
290ms 290ms 63: for i < j {
110ms 110ms 64: h := int(uint(i+j) >> 1) // avoid overflow when computing h
. . 65: // i ≤ h < j
430ms 1.12s 66: if !f(h) {
30ms 30ms 67: i = h + 1 // preserves f(i-1) == false
. . 68: } else {
. . 69: j = h // preserves f(j) == true
. . 70: }
. . 71: }
. . 72: // i == j, f(i-1) == false, and f(j) (= f(i)) == true => answer is i.
40ms 40ms 73: return i
. . 74:}
. . 75:
The f(h) call (line 66) is consuming most of the time here. This is the callback function we passed in. At a guess the compiler can't inline the callback function and because there's so little else going on the function call overhead comes to dominate the time taken. As an experiment we can be our own optimizer and manually inline the search call function. This is basically a cut and paste job from sort.Search, replacing !f(h) with n.key[h] < k
func (n *nodeSorted) getInlinedBinSearch(k byte) interface{} {
// impl of sort.Search manually inlined here
i, j := 0, int(n.count)
for i < j {
h := int(uint(i+j) >> 1) // avoid overflow when computing h
// i ≤ h < j
if n.key[h] < k {
i = h + 1 // preserves f(i-1) == false
} else {
j = h // preserves f(j) == true
}
}
// i == j, f(i-1) == false, and f(j) (= f(i)) == true => answer is i.
if i < int(n.count) && n.key[i] == k {
return n.val[i]
}
return nil
}
$ go test -bench .
Benchmark_Loop-8 11763114 101 ns/op
Benchmark_BinarySearch-8 4796678 246 ns/op
Benchmark_BinarySearchInlined-8 10631300 114 ns/op
PASS
That looks better, but surprisingly, still slower than the initial loop version. Lets grab another cpu profile and look at the getInlinedBinSearch function
$ go test -bench . -benchtime=10000000x -cpuprofile=cpu.prof
...
$ go tool pprof cpu.prof
...
(pprof) list getInlinedBinSearch
ROUTINE === github.com/superfell/loopVsBinarySearch.(*nodeSorted).getInlinedBinSearch in /Users/simon/Github/loopVsBinarySearch/n.go
660ms 680ms (flat, cum) 16.67% of Total
. . 178: return n.val[idx]
. . 179: }
. . 180: return nil
. . 181:}
. . 182:
40ms 40ms 183:func (n *nodeSorted) getInlinedBinSearch(k byte) interface{} {
. . 184: // impl of sort.Search manually inlined here
70ms 70ms 185: i, j := 0, int(n.count)
180ms 200ms 186: for i < j {
40ms 40ms 187: h := int(uint(i+j) >> 1) // avoid overflow when computing h
. . 188: // i ≤ h < j
70ms 70ms 189: if n.key[h] < k {
100ms 100ms 190: i = h + 1 // preserves f(i-1) == false
. . 191: } else {
. . 192: j = h // preserves f(j) == true
. . 193: }
. . 194: }
. . 195: // i == j, f(i-1) == false, and f(j) (= f(i)) == true => answer is i.
20ms 20ms 196: if i < int(n.count) && n.key[i] == k {
140ms 140ms 197: return n.val[i]
. . 198: }
. . 199: return nil
. . 200:}
. . 201:
Comparing back to the profile of the original loop, this version spends less time comparing the key to the array entries. That makes sense, Thats the whole point of the binary search. The rest of the loop overhead though negates that though. Possibly at this point we're at the mercy of the CPU's branch predictor. If you know more about the relative overhead of the 2 loops, let me know
Is there anything else we can do to make this faster? In both cases the n.key[index] access comes with a bounds check. I vaguely remember reading somewhere about the compiler being able to optimize away some of the bounds checks, if it checks a high bounds first. Makes sense, if n[10] is not out of bounds of they array, then you already know that a subsequent n[9] call is also not out of bounds. Lets try reversing the order of our loop, see if that helps.
func (n *nodeLoop) getReverse(k byte) interface{} {
for i := n.count - 1; i >= 0; i-- {
if n.key[i] == k {
return n.val[i]
}
}
return nil
}
$ go test -bench . -benchtime=10000000x -cpuprofile=cpu.prof
Benchmark_Loop-8 10000000 101 ns/op
Benchmark_LoopRev-8 10000000 108 ns/op
Benchmark_BinarySearch-8 10000000 248 ns/op
Benchmark_BinarySearchInlined-8 10000000 114 ns/op
Slightly worse, but were those bounds checks skipped? I'd guess not given the runtime, how can we tell? As far as i know, at this point we have to look at the code generated by the compiler. We can get the compiler to give us an assembly dump of the generated code.
If you get this far, its time to make another cup of tea.
$ go tool compile -S n.go > n5.asm
Look through the file and find the section for (*nodeLoop).getReverse. Now, I haven't done assembly since writing 68000 assembly for an Amiga A500. So after copious amounts of web searches, i've annotated the assembly with what (i think) is going on.
"".(*nodeLoop).getReverse STEXT nosplit size=125 args=0x20 locals=0x18
0x0000 00000 (n.go:39) TEXT "".(*nodeLoop).getReverse(SB), NOSPLIT|ABIInternal, $24-32
0x0000 00000 (n.go:39) SUBQ $24, SP
0x0004 00004 (n.go:39) MOVQ BP, 16(SP)
0x0009 00009 (n.go:39) LEAQ 16(SP), BP
0x000e 00014 (n.go:39) FUNCDATA $0, gclocals·4032f753396f2012ad1784f398b170f4(SB)
0x000e 00014 (n.go:39) FUNCDATA $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
0x000e 00014 (n.go:40) MOVQ "".n+32(SP), DX ;; register DX points to the start of the nodeLoop struct
;; in memory (the n variable in the code)
0x0013 00019 (n.go:40) MOVQ 272(DX), BX ;; copy the value from DX + 272 bytes into BX. 16 for n.keys,
;; 16 x 16 for n.vals get you 272, so BX has the value of the
;; count field
0x001a 00026 (n.go:40) DECQ BX ;; bx = bx-1. register bx is the loop count variable i
0x001d 00029 (n.go:40) MOVBLZX "".k+40(SP), SI ;; set register SI to the value of the parameter k
0x0022 00034 (n.go:40) JMP 39
0x0024 00036 (n.go:40) DECQ BX ;; i--
0x0027 00039 (n.go:40) TESTQ BX, BX ;; bitwise AND of BX and BX, a way to see if i == 0
0x002a 00042 (n.go:40) JLT 93 ;; if i < 0 goto 93 which will do the stack dance for the
;; return nil statement.
0x002c 00044 (n.go:41) CMPQ BX, $16 ;; cmp i to the literal value 16
0x0030 00048 (n.go:41) JCC 111 ;; if i >= 16 goto 111 which will generate an array index out
;; of bounds panic
0x0032 00050 (n.go:41) MOVBLZX (DX)(BX*1), DI ;; set register DI to n.keys[i]
0x0036 00054 (n.go:41) CMPB DIB, SIB
0x0039 00057 (n.go:41) JNE 36 ;; if n.keys[i] != k goto back to 36, which decrements the loop
;; counter and repeats the above tests
0x003b 00059 (n.go:42) SHLQ $4, BX ;; bit shift BX left 4, i = i * 16
;; remember that interface{} takes up 16 bytes
0x003f 00063 (n.go:42) MOVQ 16(DX)(BX*1), AX ;;
0x0044 00068 (n.go:42) MOVQ 24(DX)(BX*1), CX ;; ax & cx contain n.vals[i]
0x0049 00073 (n.go:42) MOVQ AX, "".~r1+48(SP) ;; shuffle them onto the stack for the return value
0x004e 00078 (n.go:42) MOVQ CX, "".~r1+56(SP)
0x0053 00083 (n.go:42) MOVQ 16(SP), BP
0x0058 00088 (n.go:42) ADDQ $24, SP
0x005c 00092 (n.go:42) RET ;; we're done
0x005d 00093 (n.go:45) XORPS X0, X0 ;; here to 110 is for the return nil
0x0060 00096 (n.go:45) MOVUPS X0, "".~r1+48(SP)
0x0065 00101 (n.go:45) MOVQ 16(SP), BP
0x006a 00106 (n.go:45) ADDQ $24, SP
0x006e 00110 (n.go:45) RET ;; we're done
0x006f 00111 (n.go:41) MOVQ BX, AX
0x0072 00114 (n.go:41) MOVL $16, CX
0x0077 00119 (n.go:41) PCDATA $1, $1
0x0077 00119 (n.go:41) CALL runtime.panicIndex(SB)
0x007c 00124 (n.go:41) XCHGL AX, AX
There's some pointer math going on to access the fields and array indexes in the nodeLoop struct. This struct is laid out in memory following the definition of the code.
Byte Offset | Value |
---|---|
0-15 | n.key[0] through n.key[15] |
16-31 | n.val[0] |
32-47 | n.val[1] |
... | ... |
256-271 | n.val[15] |
272-280 | n.count |
Code offsets 36 through 57 are the loop, and you can see that it checks BX which is the variable i against 16 every time around. So it has not optimized away any of the bounds checks. Lets see if we can do better at convincing the compiler about this. We'll access the largest array index from the loop before starting the loop.
func (n *nodeLoop) getReverse2(k byte) interface{} {
_ = n.key[n.count-1]
for i := n.count - 1; i >= 0; i-- {
if n.key[i] == k {
return n.val[i]
}
}
return nil
}
$ go test -bench . -benchtime=10000000x -cpuprofile=cpu.prof
Benchmark_Loop-8 10000000 102 ns/op
Benchmark_LoopRev-8 10000000 108 ns/op
Benchmark_LoopRevOneBoundsCheck-8 10000000 82.4 ns/op
Benchmark_BinarySearch-8 10000000 250 ns/op
Benchmark_BinarySearchInlined-8 10000000 116 ns/op
Alright, looks like we might of done it this time, that's a ~20% decrease in time. Lets double check the assembly that it is doing what we think.
"".(*nodeLoop).getReverse2 STEXT nosplit size=124 args=0x20 locals=0x18
0x0000 00000 (n.go:48) TEXT "".(*nodeLoop).getReverse2(SB), NOSPLIT|ABIInternal, $24-32
0x0000 00000 (n.go:48) SUBQ $24, SP
0x0004 00004 (n.go:48) MOVQ BP, 16(SP)
0x0009 00009 (n.go:48) LEAQ 16(SP), BP
0x000e 00014 (n.go:48) FUNCDATA $0, gclocals·4032f753396f2012ad1784f398b170f4(SB)
0x000e 00014 (n.go:48) FUNCDATA $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
0x000e 00014 (n.go:49) MOVQ "".n+32(SP), DX ;; register DX points to the start of the nodeLoop struct
0x0013 00019 (n.go:49) MOVQ 272(DX), BX ;; BX = n.count
0x001a 00026 (n.go:49) LEAQ -1(BX), AX ;; AX = BX-1 (I think, not sure)
0x001e 00030 (n.go:49) NOP
0x0020 00032 (n.go:49) CMPQ AX, $16
0x0024 00036 (n.go:49) JCC 113 ;; if AX >= 16 goto 113, which will generate a panic
0x0026 00038 (n.go:50) MOVBLZX "".k+40(SP), CX ;; CX = param k
0x002b 00043 (n.go:50) JMP 48
0x002d 00045 (n.go:50) DECQ AX ;; AX-- (AX is our loop variable i)
0x0030 00048 (n.go:50) TESTQ AX, AX
0x0033 00051 (n.go:50) JLT 95 ;; if i < 0 goto 95, which is return nil
0x0035 00053 (n.go:51) MOVBLZX (DX)(AX*1), BX ;; BX = n.keys[i]
0x0039 00057 (n.go:51) CMPB BL, CL
0x003b 00059 (n.go:51) JNE 45 ;; if n.keys[i] != k goto 45
0x003d 00061 (n.go:52) SHLQ $4, AX
0x0041 00065 (n.go:52) MOVQ 16(DX)(AX*1), CX ;; here through 94 are return n.vals[i]
0x0046 00070 (n.go:52) MOVQ 24(DX)(AX*1), AX
0x004b 00075 (n.go:52) MOVQ CX, "".~r1+48(SP)
0x0050 00080 (n.go:52) MOVQ AX, "".~r1+56(SP)
0x0055 00085 (n.go:52) MOVQ 16(SP), BP
0x005a 00090 (n.go:52) ADDQ $24, SP
0x005e 00094 (n.go:52) RET
0x005f 00095 (n.go:55) XORPS X0, X0
0x0062 00098 (n.go:55) MOVUPS X0, "".~r1+48(SP)
0x0067 00103 (n.go:55) MOVQ 16(SP), BP
0x006c 00108 (n.go:55) ADDQ $24, SP
0x0070 00112 (n.go:55) RET
0x0071 00113 (n.go:49) MOVL $16, CX ; array out of bounds panic
0x0076 00118 (n.go:49) PCDATA $1, $1
0x0076 00118 (n.go:49) CALL runtime.panicIndex(SB)
0x007b 00123 (n.go:49) XCHGL AX, AX
Now the loop is between 45 to 59, and you can see that the bounds check (at 32/36) only occurs once.
We can add the same additional array check to the other methods and see if it helps those.
Benchmark_Loop-8 10000000 102 ns/op
Benchmark_LoopOneBoundsCheck-8 10000000 88.3 ns/op
Benchmark_LoopRev-8 10000000 113 ns/op
Benchmark_LoopRevOneBoundsCheck-8 10000000 82.1 ns/op
Benchmark_BinarySearch-8 10000000 250 ns/op
Benchmark_BinarySearchOneBoundsCheck-8 10000000 253 ns/op
Benchmark_BinarySearchInlined-8 10000000 114 ns/op
Benchmark_BinarySearchInlinedOneBoundsCheck-8 10000000 106 ns/op
That worked for the forward loop as well. Didn't work the for function based binary search. From the numbers its not clear what's going on with the inlined binary search versions. Looking at the assembly they both seem to have the same main loop, and do end up doing bounds checks each time. Why the last one is quicker I'm not sure.
For the loops, its interesting that with the bounds checks limited to one that reverse is faster, but with a bounds check each time, forward is faster. Comparing assembly again, the reverse with one bounds check uses TESTQ to compare to 0, and the forward one has to use CMPQ to test against n.count. Perhaps TESTQ is faster than CMPQ? Perhaps the branch predictor can predict the terminate on zero case for the reverse loop better than it can predict the terminate on n.count for the forward loop. If you have a more concrete explanation for the differences between these 2 I'd love to hear it.
Come back for episode 2 where we look at some more esoteric approaches to this array search problem.