Machine Learning Meets Cybersecurity

Machine Learning (ML) and Cybersecurity, is this a match made in heaven or yet another tech hype? Indeed, this is an interesting combination worth exploring for the security professional who wants to solve their problems in a non-traditional manner or for the data scientist who wants to be involved in a significant impact area. Where does one start, though? This series of blogs aims to give you a roadmap, tools, inspiration, and technical knowledge to combine these two areas.

Nowadays, there is a lot of hype about Large Language Models (LLMs). However, fundamentals such as statistics and traditional ML concepts, such as embeddings, are the foundations on which these models have been based. Learning about those fundamentals leads to a better understanding of today’s technologies.

This is a blog series that has been inspired by my personal background as a researcher. I enjoy both the ML and Cybersecurity areas and find it difficult to choose one over the other. I am passionate about conveying knowledge, especially theory and math, in an intuitive manner.

Why ML & Cybersecurity?

What reasons do you have to dive into ML and Cybersecurity, other than these being hot technical areas? First, as technical professionals, we unavoidably all become, at some point, data analysts because we have to handle data and make sense of it. In the security area, thanks to advances in observability, efficient time series databases, and cheap storage, we have lots of data that is accumulating and remains unexplored. These data are a treasure trove of information for potential vulnerabilities in our systems, as long as we can decipher the patterns. In addition to all the text data, there is network data, since a network is a large data producer. The network produces packet captures that we only study closely when there is a problem. However, it is better to be proactive and have the ML tools to decipher network behavior. Finally, how about the unknowns? Unknown malware, unknown exploits, and zero days are discovered with ML.

Cyber-ML intersection — Intersection between Cybersecurity & ML

There is some intersection between Cybersecurity and ML that helps make the learning curve smoother. The figure above is by no means an exhaustive list of topics. What other topics do you think overlap (or not) between Cybersecurity and ML?

What this blog series is

The prevalent theme in the blogs is the data approach to solving problems for cybersecurity as you can observe in the diagram below. We will be taking this journey from the perspective of the security analyst with examples, concepts, and problems inspired by the field.

- ML & Cybersecurity
  - The basics
    - Dev environment zero to hero
    - Show me the data!
    - Let's explore: Exploratory Data Analysis (EDA)
      - Graphs
      - The maths
      - AI
  - Feature Engineering
    - Numerical
    - Categorical
    - Embeddings
    - Selection 
  - ML Algorithms for Cybersecurity
    - Classification
    - Evaluation

What this blog series is not

This is not an online class or a solution to all security problems using ML. There are a lot of good quality, free classes for ML. I would personally recommend the Deep Learning AI courses, Google Foundational ML courses, and Andrew Ng’s classes. There is also a plethora of great, free courses for security, such as the MIT Computer Systems Security, Web Security, and Responsible Red Teaming to name a few.

What I did not find in these courses was the application of ML to solving Cybersecurity problems from the perspective of a security analyst. That is why I decided to create a blog series as a guide to get started combining these two areas.

Structure of the blogs

These blog posts will include the following parts:

1,000-foot view: the what, how, and why of the blog. If you do not have time to read the whole blog or prefer to use your own research, this will be the point to start.
Technical details: This will be the main part of the blogs, with plentiful code examples and intuitive explanations.
Cybersecurity perspective: This part will include applications of what was discussed in the blog in cybersecurity. I will share my experiences and hurdles with using ML in the day-to-day flow.
Going even deeper: code, tools, and other resources, such as academic papers and books, will be provided at the end so that you can go deeper into the topics.

Prerequisites

In this series of blogs, I assume the following:

You have used Python and have basic knowledge of the language’s keywords and structure.
You have installed at least one Python package using pip.
You have basic knowledge of the Linux command line.

If you have never used Python before, here are a couple of free courses that I recommend:

Udemy: Python from beginer to Intermediate - 30 mins
FreeCodeCamp: Python for Beginners – Full Course - 4 hours

Where do we go from here?

I hope you are as excited as I am about starting this journey. First things first, though, we need a development environment that fits each one of us. We will discuss this in detail in the next post.

– Xenia

Going even deeper

ML Courses
- Google Foundational ML courses: I have taken most of these courses, always something to learn even if you are advanced data scientist.
- Andrew Ng’s classes
- Awesome ML Courses
Cybersecurity Courses:
- MIT Computer Systems Security
- Web Security: great tool (Burp) and tutorials
- University of Maryland Software Security
- Responsible Red Teaming
- Splunk
- Awesome cybersecurity university
- MOBISEC: one of my favorite researchers, has lots of great papers as well including my personal favorite Cloack and Dagger
More awesome lists:
- Awesome ML for Cybersecurity
- Awesome Python Data Science Books: definitely look at Khuyen Tran’s book and material for data science from this repo.
- Awesome list of awesome lists