Some time back, I wrote a post about hunting down dropped packets on our network nodes. Here's a small follow up with some performance hints.
My happy glow after fixing the previous issue didn't last for long. Our network nodes had hit the next problem.
Our network nodes are virtualized for management simplicity. The hypervisor nodes have 10 Gbit interfaces, and we were regularly passing 3 - 5 Gbps through our network VMs, with no issue. We can even do parallel iperf up to ~8,5 Gbps through them, with no obvious problems.
The problem is that now and then we saw large peaks in packet amounts, and large drains on the CPUs on our network nodes. This escalated up to the point that we started dropping packets. You might be thinking, "A compromised VM DoSing something?". It wouldn't be the first time that happened. And when it happened, it did bring our network nodes on their knees.
Not this time though. Luckily we had improved our monitoring, so tracking down the source of the traffic was quite easy. It turned out to be a large Aspera data transfer over UDP, which generated ~100k packets per second. Luckily it was a friendly internal customer who did this, so we kindly asked them to rate limit while we debug.
How did math work again?
This next part took me way too long to figure out. I'll write it out in redacted form in the hopes that you can retain some respect for my mental faculties.
We had some baseline network stats from our monitoring. When we tested with iperf, a large transfer (8Gbps) could generate up to 25k packets per second. So why are we now seeing 100k? With ~1 Gbps of traffic?
So 1500 bytes / packet, 8 bits per byte, 1 Gbps. This gives 10243 /8 /1500 =~ 90k. Wait, what? We're seeing this packet rate with our maximum MTU? But How do we get 8 Gbit in 25k packets? That's means packets somewhere around 45000 bytes?
There is this thing called General Receive Offload (GRO). I knew it existed, but I never knew what it exactly did. In short, it takes incoming packets, and if they're essentially in the same flow, it combines them into a larger packet to reduce packet handling overhead. This seems to be on by default for the virtio network driver. So when we did our iperf TCP transfer tests, GRO combined the individual 1500 byte packets to larger chucks.
However, at least on CentOS 7, virtio doesn't seem to do it for UDP. I never found a great resource on this, but it might partly be be because it can mess up the order of the packets. So this explains the discrepancy in the packet amounts with UDP and TCP transfers.
When we had large packet amounts, there was a process called ksoftirqd that was the CPU hog. There's a great blog series about the kernel networking internals. From there I learned that ksoftirqd does a lot of the packet handling, inluding the applicaiton of the netfilter rules. Now the CPU load starts making sense.
Each CPU has its own ksoftirqd process. Each interrupt is generally assigned to one CPU. There's a service called irqbalance that tries to balance these out between CPUs.
Leelo Dallas Mulitqueue
Now let's look at our interrupts.
# cat /proc/interrupts |grep virtio
11: 0 0 0 IO-APIC-fasteoi virtio4
24: 0 0 0 PCI-MSI-edge virtio0-config
25: 2715189728 409036136 2434880172 PCI-MSI-edge virtio0-input.0
26: 2 42915 10564825 PCI-MSI-edge virtio0-output.0
27: 0 0 0 PCI-MSI-edge virtio1-config
28: 895637747 2067589528 710928967 PCI-MSI-edge virtio1-input.0
29: 2 276587 502494 PCI-MSI-edge virtio1-output.0
30: 0 0 0 PCI-MSI-edge virtio2-config
31: 2 185023862 385281087 PCI-MSI-edge virtio2-input.0
32: 1 0 0 PCI-MSI-edge virtio2-output.0
33: 0 0 0 PCI-MSI-edge virtio3-config
34: 2575 218858 333654 PCI-MSI-edge virtio3-req.0
As we see, the virtio devices are our network devices, and e.g. virtioN-input.0 is the input queue for the NIC N. In this case we have one input queue for the NIC. Most modern hardware NICs have multiple input queues. They can split the incoming traffic into different queues based on some metric. Each queue has its own interrupt, and thus the CPU load of packet handling can be spread to multiple CPUs.
Virtio can also do this, so I decided to test virtio multiqueue. Excitedly I launched a new network machine in our test environment, and did the twiddly bits to get it set up on the VM. This excitement didn't last for long. In practice, I didn't see a difference when I had aggressive UDP use. The algorithms used to decide into which queue the packets go put them all into the same queue. While this is an improvement, since some other flows will probably go to other queues, it still means a large UDP transfer can disturb 1/N of all your flows, where N is the amount of queues. Better, but not optimal.
Could I Request an Interrupt
The I stumbled on RedHat documentation (and a bunch of blog posts) that said that you could make multiple CPUs handle one IRQ. That had my interest! Maybe it helps if we throw more cycles at the IRQ?
So first, let's stop the irqbalance service that automatically assigns IRQs to CPUs.
# systemctl stop irqbalance
Then, let's try this with virtio1 which is our external nic.
# cat /proc/interrupts | grep virtio1
27: 0 0 0 PCI-MSI-edge virtio1-config
28: 896954790 2067589528 710928967 PCI-MSI-edge virtio1-input.0
29: 2 276587 502509 PCI-MSI-edge virtio1-output.0
IRQ 28 it is. Let's see what's happening with that IRQ.
This is a 3 CPU VM. The smp_affinity is a bitmask of the CPUs which are allowed to handle the IRQ.
Lets make all CPUs able to handle the IRQ.
# echo 0007> /proc/irq/28/smp_affinity
You basically echo the hex bitmask of which CPUs are allowed to handle this IRQ. It seems like it needs to be a 16 bit (4 hex chars) value. Here we echo 0007 for all our three cores.
This is not enough. You apparently also need to enable Receive Packet Steering (RPS) to allow all CPUs to read the rx queue.This is basically a poor-man's multiqueue. This is done on the interface itself, and is a per-queue setting. You do it with the same bitmask, but apparently the leading zeroes aren't required.
# echo 7 > /sys/class/net/eth0/queues/rx-0/rps_cpus
After this, a few more tweaks to enable Receive Side Scaling, which I guess should reduce out-of-order packets.
# echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
# echo 32768 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
I started generating UD load after this, and Voilá! "top" shows that there is load on three ksoftirqd processes! 100k packets flew through with no problems, 150k packets as well. I didn't test how far this stretches, since this was where our customer's transfer maxed out.
I'm not certain about any possible negative effects this might have. There are probably reasons all this isn't on by default, but I'll let you know when this blows up in our face.