Mumbling about computers

Analyzing mysterious dropped packets

2021-12-23 [ linux networking performance ]

I was asked to investigate "Why are we dropping packets at 6:30-6:40 most mornings on most colocations?"

Given the packets are dropped, I will not be able to see what data was lost, but if the drops are due to a burst of data, the packets that do make it to the kernel will be a good indicator of what's spamming the network.

As the packet drops are not consistent, the easiest way to get a meanigful capture of the data is to execute tcpdump on one of the servers that are impacted most often.

I captured a single 10 minute slice of data while the event was happening by running:

/usr/sbin/tcpdump -i ens3f0 -W 1 -G 600 -w /tmp/capture-$(date +%Y%m%d-%H%M).pcap

This will rotate the files after 600s (10 min) and write a total of 1 file(s) (a single 10 minute interval).

The capture file was 14GB, which is fairly unwieldy to manipulate, and as I have somewhat precise details on when packets were dropped (from host metrics), I can trim the capture down with editcap:

$ editcap -A '2020-08-11 06:31:20' -B '2020-08-11 06:31:40' /tmp/capture-20200811-0630.pcap short.pcap
$ editcap -A '2020-08-11 06:31:30' -B '2020-08-11 06:31:32' /tmp/capture-20200811-0630.pcap super-short.pcap

will give you a 20s capture (451MB) and a 2s capture (63MB), which are more reasonable to quickly iterate on.

Initial analysis

The basic idea when trying to find whether micro-bursts are happening is: there is some criteria on which you can bucket (~group) the data, and the outlier will be immediately visible.

First, split the data on protocol (TCP / UDP) and look for immediate spikes. In this case, there wasn't anything on TCP, but there was a big spike on UDP.

To get timestamps from all UDP packets, relative to the start of the capture:

tshark -r short.pcap -Y 'udp' -e frame.time_relative -T fields > timestamps

The easiest way for me to verify and iterate on the analysis is to plot the data and drill down on whatever looks suspicious. For that I wrote this small script:

import datetime
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker

from collections import defaultdict'ggplot')
bucket_ms = 100
if bucket_ms > 1:
    label = f'%.2f pkt/{bucket_ms}ms'
    label = '%.2f pkt/ms'

ts = defaultdict(int)
for t in open('timestamps').readlines():
    t = float(t.strip())
    bucket = int(t * (1000 / bucket_ms))
    ts[bucket] += 1

x = ts.keys()
y = [ts[k] for k in ts], y, color='green')

When looking at UDP packets per 100ms bucket there's a very clear spike:

Data bucketed every 100ms

Better bucketing

If I group UDP traffic into dst_port buckets (regardless of time):

$ tshark -r short.pcap -Y udp -e udp.dstport -T fields | uniqcnt | tail -5
   3083 28010
   5895 28060
  10472 28070
  12129 28874
  18556 28020

in the 20s capture file, port 28874 received 12k packets

$ tshark -r super-short.pcap -Y udp -e udp.dstport -T fields | uniqcnt | tail -5
    444 28010
    485 28060
    879 28070
   1867 28020
   7142 28874

but in the 2s capture it saw 7k! Let's graph again with more refined data (packets-per-ms but only from the data source we care about) to confirm our hypothesis:

tshark -r short.pcap -Y 'udp.dstport == 28874' -e frame.time_relative -T fields > timestamps

Data bucketed every 100ms Data bucketed every 100ms

5000 packets in 100ms is quite a bit, but not dramatic. Always need to check there's no aliasing due to the bucketing or zoom level.

Data bucketed every 1ms Data bucketed every 1ms, zoomed to cover 500ms total time

~800 packets / ms is completely unreasonable!

Assuming these packets are MTU sized (1500 bytes; and why would the other side send so many packets if they weren't?) this is effectively line rate (9.1Gb/s), and remember these are the packets we did not lose.

Finding the source culprit

Now that we cornered the suspect, we can check who is sending this data:

$ tshark -T fields -r super-short.pcap -Y 'udp.dstport == 28874' -e ip.src_host | uniqcnt
     10           "ip.src_host": "1xxxxxxxxxx16",
     10           "ip.src_host": "1xxxxxxxxxx50",
     10           "ip.src_host": "1xxxxxxxxxx75",
     11           "ip.src_host": "1xxxxxxxxxx21",
     20           "ip.src_host": "1xxxxxxxxxx1",
     24           "ip.src_host": "1xxxxxxxxxx31",
     86           "ip.src_host": "1xxxxxxxxxx58",
   6908           "ip.src_host": "1xxxxxxxxxx11",

The source host was sending a lot of data via multicast without any kind of throttling. A very reasonable throttle on the application solved the issues permanently.