It all started out with a simple single stream reading test. Just a simple request for the entirety of an 8GB file. We do this stuff all the time. Except this time instead of 700 MB/s I was getting 130 MB/s. What?
Usually we test with jumbo frames (9000 MTU) but for this exercise we were using standard frames (1500 MTU). Still, there’s no way that was the difference. After 2 days I discover a method to consistently reproduce the problem: while the streaming test is running, toggle the LRO flag on the server’s network interface. This is just as crazy as making your car go faster by removing your soda from the cupholder. There’s no way that it has anything to do with it, but for some reason it does. Consistently. At last I have a reproducible, if ludicrous, defect.
Fast forward through 5 days of eliminating nodes, clients, switches, and NFS overcommits. Add in packet traces, kernel debugging output, and assorted analysis. Eventually Case catches the first real clue: the packet congestion window between the ‘fast’ and ‘slow’ states are distinctly different. In the ‘fast’ state, the congestion window stays fairly constant. In the ‘slow’ state, the window oscillates wildly – starting at the MTU growing really large, and starting over.
The LRO trick worked by causing enough retransmits that the stack dropped into slow start mode — one mystery solved. The reason we haven’t seen this before is that after a node-client pair get into the fast state, the slow start threshold is retained in the TCP hostcache between connections which is why we haven’t clearly identified this before — another mystery solved.
Fast forward through a few more days of slogging through TCP code down the path of blaming slow start threshold (or rather the lack of slow start in the slow state). By this time I’m way more familiar with the TCP code, and our kernel debugging framework, than I want to be. I notice that every time the congestion window drops back to the MTU it’s caused by an ENOBUFS error. It’s very unlikely we’re running out of buffer space though. Checking the called function reveals that the error would show up not only when we’re out of buffers, but also if we can’t return one immediately. We surmise the problem is some contention causing an inability to immediately get the requested buffer. So I change the code to reduce the congestion window by a single segment size (aka MTU) instead of dropping it all the way down to the segment size. The assumption being the next time we request a buffer of this size, we’re likely to get one.
And performance shoots up to 900 MB/s — even higher than the previous fast state.
The reason we’re unable to return the requested buffer immediately is unclear, and frankly above my paygrade. I’ll happily let the kernel devs work on that (it involves slabs and uma and things geekier than me).
The core of the problem remains “why aren’t we able to return the requested buffer immediately” but until the devs conquer that one we have a valid, shippable, workaround. And a lowly tester found, identified, and fixed it!
4 thoughts on “Mystery of the terrible throughput (or how I solved a TCP problem)”
Nice! I often find extended time with the debugger somewhat frustrating, but there’s no greater reward than tracking down an elusive, important bug! Especially if it’s one that’s flown under the radar for some time!
After 2 days I discover a method to consistently reproduce the problem: while the streaming test is running, toggle the LRO flag on the server’s network interface. This is just as crazy as making your car go faster by removing your soda from the cupholder. There’s no way that it has anything to do with it, but for some reason it does.
To be fair, it’s more like making your car go faster by pinching and releasing your gas line. The cup holder analogy would apply if, say, you restarted the WebUI and things went faster. Which I can’t say never happened.
That’s a better analogy, but maybe overstating it a bit given that the LRO was being toggled once a second. Maybe having a single spark-plug misfire once a second is more accurate.
Luckily everything can be reduced to a car analogy!