Have you ever used Wireshark?

It’s an amazing network debugging and investigation tool. Anyone remotely related to connected comùters will regularly use it. Networking, security analysts, network application development, penetration testing… either directly by just using the graphical software, the CLI tool tshark, or a variation of that for a quick lightweight debugging, like tcpdump. It will start sniffing in whatever network interface you tell the program to listen on, and log every single packet it sees into a file for storage, or even into standard input for you to just see in a quick check, for example to see if one of your programs is actually sending keepalive connections.

But that’s where the problems start popping up on the “let’s store everything in one file” solution: scalability basically disappears from your scope, for concurrency and management issues we’ll get into later.

Have you ever tried to open up wireshark in a windows machine? They have a lot of background traffic. Like, a lot. Phoning home for update checks, antivirus definition changes, hell, even the windows calculator needs to phone home for currency conversion capabilities. The future is now, and it’s connected. Just network noise galore.

Although the truth is you don’t even need Windows for that to be a problem to be a thing. Linux doesn’t have that noise problem, but if you have a lot of linux boxes, the problem pops up again. So, when you’re on the security field, like SOC analysis, you get quite a big amount of traffic to process. Thing is, then you imagine what most offices are, which is bbasically a big bunch of windows machines, and you quickly notice you’re gonna run into some big data-related problems.

Now, there’s a problem with file storage even when you’re only handling a single data node, and it’s that the format can be inefficient. It’s convenient, but you can’t do a lot with it before you get out of control. Try to open a 1 Gigabyte .pcap file and see what it takes to open it and filter content. That’s a gigabyte. A twenty minute 1080p youtube video is a gigabyte of traffic. This can be generated by one client, in twenty minutes, without them doing anything else, which is unlikely. Email sending, open browser tabs with open connections, background programs polling for updates… multitasking in computers, and multitasking in your coworkers! concurrency for the concurrency gods!

So we have a size problem. How was this solved originally? with a little something called file rotation. This is a program implementation that allows programs to, once the files reach a certain size, consider it finished and rotate into a new one, like moving from netlog.pcap.000001 to log.pcap.000002. Even then, if it’s a constant system, you can circle back and overwrite the first one. So this would never run you out of space! Cool solution! problem solved, right?

Well, if this was a done and solved problem for my use case, I would not be here writing a blog post series about it, wouln’t I?

It is solved in terms of enterprise tool availability for production usage, but there’s no fun on just installing an already readily available tool for this that handles it, learning happens when you break and fix stuff! and what better to fix roadblocks on, that your own new from-scratch tool?

My problem here is that I like centralization of content. So I don’t like having many files around and worrying that they will one day be overwritten. But even for non-lazy people, this creates a data concurrency problem, since only one person could write a file at once, and that means you could only have one client at a time. Pretty useless data aggregation system, if you ask me.

I obviously decided to overengineer an open source-licensed solution that only I would use, that is also probably worse than the actual solutions out there. Because if you know this blog, you know I’m about security and IT design, not application development. That said, I still like to learn new stuff and challenging myself, so project PCAPAnalyzer was born.

Now, this happened around two and a half years ago now, and has been in development intermittently since, because self-learning roadblocks. I hadn’t gotten into desktop app development then and I haven’t yet (that may be a future series? stay tuned!) so I turned into the next best thing I knew: Web Applications. I knew HTML, I had heard about PHP, that’s how those work, right? Open up a browser, click the link bookmark and the entire tool is at your disposal! No dependency hell, no distributed app update management for potentially hundreds of machines, no client drift from a valid config… beautiful words for a mantainer that is really not a mantainer!

First thing I needed was to set app requirements and priorities. You can’t build something if you don’t know what you want and need to build. I got to thinking about the frontend first. Now, I’m a simple man. I see a standard, I use a standard. Django? That looks like too much abstraction, and I already know some pyhthon3, so I want something tried and tested, and also new to me.

On web, PHP seems to be my best bet, since it’s also a useful skill to pick up. No laravel or anything, I don’t want any unnecesary abstraction layers. You learn the basics first, then build up with simplifying tools. The rest was the standard: HTML, CSS3 (I could use Bootstrap, but I think I’m gonna leave learning responsive design for another time), apache2 webserver.

Next comes the backend. This was, at the time, completely new territory to me, so I did some searching in good ol’ DuckDuckGo and looked at what could best suit my needs. I wanted to aggregate everything I gathered into one place and make it really fast in terms of data search at scale (I was going to quickly add up a lot of data, remember the youtube video example?), so that screamed Database to me.

I had also done a bit of learning on class about the LAMP stack, so I was familiar with MySQL. Free, Open Source, enterprise-ready scalable software, there for me to use, it was the perfect match. Data capture source was wireshark on the terminal, that one’s easy! unfortunately, wireshark does not have a “put into database” terminal option, neither does MySQL have a “wireshark capture” data source. So I needed some data processing.

This is the part of the backend that was going to be the most engaging: something I build basically from scratch. In class we processed data with python scripts, and it seemed very extensible and easy to implement. So I chose python as a tool for data processing and transformation, which then would go into a database. Processing strings and inserting into a database.

Let’s list our finaly decided application requirements and the choices to actually meet them

  • frontend
    1. We want this tool to be available inside of a website, solved with apache2 for delivery because it’s the gold standard
    2. The site should be dynamic and it needs to be capable of database management around our MySQL backend, solved using php
    3. Make it pretty: css3 is tried and tested for something! the web standard is very much a good tool for us
  • backend
    1. Network interface(s) data capture with Wireshark, more precisely its terminal counterpart tshark, because, no-gui Debian server!
    2. Store the data consistently for fast retrieval: mysql has been getting more efficient for the last twenty five years, good amount of work there
    3. Data treatment: There’s a lot of binary formats around that are not compatible with each other. That’s why we created csv, the age of interoperability. Let’s create some of those with python3!

I had seen all this on separate class assignments, so it was a good cross-understanding test! project PCAPAnalyzer was feasible and it was just about time for some invesigation and building! an actual product on the horizon!

But that’s a story for the next dev diary, where we’ll start giving an eye into how much I underestimanted the complexity of data processing as a non-developer. See you then!