So I was working on a college assignment where I was asked to characterize large amounts of data contained in PCAP files to isolate particular traits, particularly TCP retransmission. I’ll note for those who don’t know that I’m REALLY lazy by default and didn’t want to spend all evening hand cramming data from tcpdump into some file for analysis.
First I thought about just running tcpdump from the command line and parsing the text output for analysis. Retransmissions are pretty easy to find since they will have duplicate sequence numbers. Then I thought I shouldn’t depend on tcpdump since it will be a multi-step process (run tcpdump to output to a file, then ingest that file).
So I downloaded pcap and dpkt for Python from googlecode. The documentation on dpkt is a little sparse, but I was able to get something going. I’ll post some code once I clean it up, but there are a couple of things worth noting.
1. My code assumed all frames are ethernet. That’s easy since I’m capturing from an ethernet device (and so were the sample files I was given). Once you have the packet from pcap (called “pkt” in this example), get dpkt to tokenize it for you using:
eth = dpkt.ethernet.Ethernet(pkt)
2. Getting IP and TCP structures are as easy as getting the “data” from each upper encapsulating layer using:
ip = eth.data
tcp = ip.data
3. It isn’t that easy. If we are looking for tcp data, we need to make sure we only deal with TCP, which is IP protocol number 6 and not UDP or something else. Before getting the data from the ip packet (or treating that data as if it’s TCP), make sure it really is:
if ip.p != 6: # this isn’t TCP so we don’t care in this case
4. Even the treating the eth.data field as IP is a bad idea. ARP will bite you there since it doesn’t have a protocol field. Defensive programming is your friend.
if not isinstance(ip, dpkt.ip.IP):
So there were my quick pitfalls. Like I said, I’ll try to post some code later when it gets cleaned up. For now, don’t assume Ethernet packets have an IP payload and don’t assume that IP packets are TCP and you should be good to find any duplicate sequence numbers.