Hello, r/MachineLearning! Over the course of the last 2 months I’ve been working on my first major machine learning project called, “Deepdos” in my free time outside of school and work. Deepdos is a network tool that provides analysis and in the future mitigation of all network traffic coming over whatever network adapter you specify. The analysis utilizes a logistic regression model that classifies traffic as either safe or malicious based on aggregated packet capture data using the CICFlowmeter (The people that created the tool are also the same people that created the dataset used for training). The mitigation, which will only be for Linux based systems, will create and manage firewall rules written directly to iptables. While the name includes “deep”, there is actually no deep learning involved at all. (At least not yet)
The project source code can be found here: deepdos
Currently the project is listed as being in a pre-alpha state, as there are a lot of milestones that need to be hit before I can consider this a stable/production ready project. Hopefully, some of you can help me get there! Currently, I’m looking for constructive feedback on the projects current state, additions that I should be making, and really anything else that can help me grow this project into something that can be useful for companies. Here is a snapshot of the project without having to look at any of the code:
Where I’m at:
- Currently utilize a logistic regression model that is trained on 200,000 samples of network traffic with 100,000 being “normal” network traffic and 100,000 being malicious.
- Packet capture data aggregation via tcpdump. Currently, I listen for very short bursts of time for development but will be ramping this time up to reflect the communication between two devices more accurately.
- Published on Pypi (Not stable, yet).
- I’ve rebuilt the structure of the application 3 times right now for scalability and think I finally developed a system
Where I’m trying to go:
- I’m currently thinking about how I can develop a robust testing system so that this project can continue to scale with reliability.
- Training on the full data set which is comprised of roughly 57 million samples, as I’m currently only using 200,000 of those samples. :[
- Experimenting with different machine and deep learning models to see how I can maximize performance of the classification and of the overall application.
Working on this project has been quite the learning experience and honestly, a really enjoyable time. I really appreciate those of you that took time out of your day to read this and hope that I can garner the opinions and expertise of those of you from this thread to make this into something awesome.