Analyzing mysterious dropped packets
2021-12-23 [ linux networking performance ]I was asked to investigate "Why are we dropping packets at 6:30-6:40 most mornings on most colocations?"
Given the packets are dropped, I will not be able to see what data was lost, but if the drops are due to a burst of data, the packets that do make it to the kernel will be a good indicator of what's spamming the network.
As the packet drops are not consistent, the easiest way to get a meanigful capture of the data is to execute tcpdump
on one of the servers that are impacted most often.
I captured a single 10 minute slice of data while the event was happening by running:
/usr/sbin/tcpdump -i ens3f0 -W 1 -G 600 -w /tmp/capture-$(date +%Y%m%d-%H%M).pcap
This will rotate the files after 600s (10 min) and write a total of 1 file(s) (a single 10 minute interval).
The capture file was 14GB, which is fairly unwieldy to manipulate, and as I have somewhat precise details on when
packets were dropped (from host metrics), I can trim the capture down with editcap
:
$ editcap -A '2020-08-11 06:31:20' -B '2020-08-11 06:31:40' /tmp/capture-20200811-0630.pcap short.pcap
$ editcap -A '2020-08-11 06:31:30' -B '2020-08-11 06:31:32' /tmp/capture-20200811-0630.pcap super-short.pcap
will give you a 20s capture (451MB) and a 2s capture (63MB), which are more reasonable to quickly iterate on.
Initial analysis
The basic idea when trying to find whether micro-bursts are happening is: there is some criteria on which you can bucket (~group) the data, and the outlier will be immediately visible.
First, split the data on protocol (TCP / UDP) and look for immediate spikes. In this case, there wasn't anything on TCP, but there was a big spike on UDP.
To get timestamps from all UDP packets, relative to the start of the capture:
tshark -r short.pcap -Y 'udp' -e frame.time_relative -T fields > timestamps
The easiest way for me to verify and iterate on the analysis is to plot the data and drill down on whatever looks suspicious. For that I wrote this small script:
import datetime
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from collections import defaultdict
plt.style.use('ggplot')
plt.xticks(rotation='vertical')
bucket_ms = 100
if bucket_ms > 1:
label = f'%.2f pkt/{bucket_ms}ms'
else:
label = '%.2f pkt/ms'
plt.gca().yaxis.set_major_formatter(mticker.FormatStrFormatter(label))
ts = defaultdict(int)
for t in open('timestamps').readlines():
t = float(t.strip())
bucket = int(t * (1000 / bucket_ms))
ts[bucket] += 1
x = ts.keys()
y = [ts[k] for k in ts]
plt.bar(x, y, color='green')
plt.xlabel("time")
plt.ylabel("Packets")
plt.show()
When looking at UDP packets per 100ms bucket there's a very clear spike:
Data bucketed every 100ms
Better bucketing
If I group UDP traffic into dst_port
buckets (regardless of time):
$ tshark -r short.pcap -Y udp -e udp.dstport -T fields | uniqcnt | tail -5
3083 28010
5895 28060
10472 28070
12129 28874
18556 28020
in the 20s capture file, port 28874 received 12k packets
$ tshark -r super-short.pcap -Y udp -e udp.dstport -T fields | uniqcnt | tail -5
444 28010
485 28060
879 28070
1867 28020
7142 28874
but in the 2s capture it saw 7k! Let's graph again with more refined data (packets-per-ms but only from the data source we care about) to confirm our hypothesis:
tshark -r short.pcap -Y 'udp.dstport == 28874' -e frame.time_relative -T fields > timestamps
Data bucketed every 100ms
5000 packets in 100ms is quite a bit, but not dramatic. Always need to check there's no aliasing due to the bucketing or zoom level.
Data bucketed every 1ms, zoomed to cover 500ms total time
~800 packets / ms is completely unreasonable!
Assuming these packets are MTU sized (1500 bytes; and why would the other side send so many packets if they weren't?) this is effectively line rate (9.1Gb/s), and remember these are the packets we did not lose.
Finding the source culprit
Now that we cornered the suspect, we can check who is sending this data:
$ tshark -T fields -r super-short.pcap -Y 'udp.dstport == 28874' -e ip.src_host | uniqcnt
10 "ip.src_host": "1xxxxxxxxxx16",
10 "ip.src_host": "1xxxxxxxxxx50",
10 "ip.src_host": "1xxxxxxxxxx75",
11 "ip.src_host": "1xxxxxxxxxx21",
20 "ip.src_host": "1xxxxxxxxxx1",
24 "ip.src_host": "1xxxxxxxxxx31",
86 "ip.src_host": "1xxxxxxxxxx58",
6908 "ip.src_host": "1xxxxxxxxxx11",
The source host was sending a lot of data via multicast without any kind of throttling. A very reasonable throttle on the application solved the issues permanently.