Show me the data (part 1)

You have probably heard this saying many times before: “Garbage in, garbage out”. This is a big deal in machine learning projects and even more in ML cybersecurity projects because the model can only be as good as the data you input to it. That is why the first pillar of this ML cybersecurity series is about data. Data collection, preparation, and cleanup are important parts of an ML project, and we will dedicate three blogs to these topics.

You will find the code from this blog in the notebook blog_show_data_1.ipynb.

1,000 ft view

Why?

First, we need to have a good dataset to build ML models around it. What makes a good dataset?

Correctness: I cannot stress that enough! If you have the wrong IP address, such as 278.12345.213.545, it just does not make sense; however, your model will still work.
No gaps: missing values can be a menace in your cybersecurity ML project. When we start preparing data, we will see how to fill these gaps using sleek Python libraries.
No duplicates: because we do not want to reprocess the same stuff over and over, let alone that this can bias our results.
Structure: unstructured data is tough (but not impossible) to process. Performance also takes a big hit with unstructured data. Data may be organized in Elastic Docs, spreadsheets, SQL databases, etc.
Balance: highly imbalanced datasets are unfortunately common in cybersecurity. We have a lot of benign data in a client environment, and we have a lot of malicious data in a red team engagement. We need to balance our data creatively and not bias it while doing so.

What?

The diagram below gives you the types of cyberdata with links to the current blog. Note that some of these will be analyzed in parts 2 and 3:

- Cybersecurity Data
    - Logs/Events (we are here)
        - syslog
        - server log
        - host event logs
    - Pandas (part 2)
    - Unstructured text
        - domains
        - urls
    - Binaries
    - Traces (part 3)
        - pcaps
        - netflow

How?

The actions for data revolve around collection, storage, and processing.

- Collect
  - Monitoring
  - Observability
  - Metadata
  - Public datasets
- Store
  - Time series
  - Structured
  - Unstructured
- Load for processing

Collect & Store

In this section, I will explain the collection and storage techniques and technologies. As you will see, you are probably already collecting data with your SIEM and observability tools, and most organizations store their data in a database.

Collect

Definitions: monitoring vs observability

First, let’s clarify that monitoring and observability are not the same thing. These terms are often confused. Monitoring is the process of collecting, analyzing, and using information to track events and make educated decisions. Observability is the ability to collect a system’s internal state through logs, traces, and metrics.

Examples of monitoring & observability

A Security Information and Event Management System (SIEM), such as Splunk and Elastic Security, performs monitoring and collects, analyzes, and uses security data for you. An Intrusion Detection System (IDS) produces observability data for the internal states of your system. Suricata, Zeek, and Snort are good examples of IDSs that produce logs about the state of the system. Syslogs are another example of an observable system state. These are ways you are collecting data, and you can use them in ML projects.

You can setup your own observability stack using this repository Telegraf, Influx, Grafana (TIG) using Docker containers. Telegraf gives several plugins for network and log observability, Influx is a time series database optimized for handling data in a time period, and Grafana offers some sleek graphing techniques.

Metadata

Metadata describes your data. A malware sandbox produces metadata, such as static analysis data, strings, processes, etc., and dynamic analysis data for the malware under analysis. Some tools for security metadata that are open source and that I recommend checking out are: Cuckoo, LiSa for IoT, and MobSF for Android apps.

Public datasets

Finally, there is a wealth of public repositories that you can use in your ML projects to train them, i.e., teach them to recognize patterns. A few that I recommend (and more in the resources at the end of the blog):

Awesome cybersecurity datasets: a curated list of different types of datasets grouped in categories such as malware, passwords, etc.
SecRepo: grouped data in categories such as network.
CAIDA: not security data and specific to networks, but the largest network telescope data collection that you can use to train your models. This data can be good for research; however, it is not labeled, and you do not have a lot of information about what you are observing except noise on the Internet.

Store

There are a few different types of databases for your monitoring and observability data:

Time Series: these revolve around real-time data, and they use as their “primary key” a timestamp that marks when the data was recorded. These types of databases are preferred for monitoring and the observability of data.
SQL: Structured Query Language-type databases are often parallelized to Excel spreadsheets, and they offer standardized querying, performance, and atomicity.
Non-SQL: unstructured databases may also be a choice for monitoring data because of the speed of retrieval when it comes to real-time data streams.
Vector: Vector databases are great if you have text data, such as logs, that you convert to a single row of numbers (vector) and want to retrieve it fast and efficiently. They are the latest and greatest for Large Language Models (LLMs) that you want to feed with your custom data and enhance using Retrieval Augmented Generation (RAG).

Load cybersecurity data

In this section, I will discuss cybersecurity data, with a focus on how to load it for processing. If you want tips on how to setup your development environment to use these examples, check out my previous post. The next several blogs will be dedicated to processing the data and providing insights about it.

Logs/Events

There are simple ways to load log data, such as by simply reading files:

with open("my_log.txt", "r") as file_read:
    data = file_read.read()

This is the basic code to read a txt file. I strongly recommend the Working with Files in Python post from Real Python if you have never worked with files or you just need a brush up your knowledge.

Syslog

One of the most common log files is syslog and of course it should have its own Python libraries for parsing. I have personally used syslogmp. Here is a code example of using it and transforming your data to a dictionary for further processing.

from syslogmp import parse

def load_syslog(log_message: str) -> dict:
    message = parse(log_message)

    return {
        'timestamp': message.timestamp,
        'hostname': message.hostname,
        'message': message.message,
    }

# Example usage:
log_message = bytes('<133>Feb 25 14:09:07 webserver syslogd: restart', 'utf-8')
log_data = load_syslog(log_message)
print(log_data)

In this example, the function is parsing a message in BSD RFC 3164 format and returning a structured dictionary that includes a timestamp, hostname, and message. Of course, not all syslogs are in BSD format. If you want to parse them into a structured outcome, you may need to use regular expressions. The code below takes as input a generic log line and uses a regular expression to break it into a structured dictionary.

import re

def process_syslog(log_message):
    log_regex = r'^(\w{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2})\s(\S+)\s(\S+):\s\[(\d+)\]\s(.*)$'
    # The above regex pattern captures the following groups:
    # 1. Timestamp
    # 2. Hostname
    # 3. Application
    # 4. PID
    # 5. Message

    match = re.match(log_regex, log_message)
    if not match:
        return None

    timestamp = match.group(1) # (\w{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2})
    hostname = match.group(2) # (\S+)
    application = match.group(3) # (\S+)
    pid = int(match.group(4)) # \[(\d+)\]
    message = match.group(5) # (.*)

    return {
        'timestamp': timestamp,
        'hostname': hostname,
        'application': application,
        'pid': pid,
        'message': message,
    }

# Example usage:
log_message = 'Feb 28 14:21:30 example-hostname kernel: [12345] This is a log message'
process_syslog(log_message)

Now, this example can cause you a headache. The regular expression on its own is a messy mess; however, if you know how to write regexes, you can process any syslog or plaintext file. I tried to simplify it by copying and pasting the regex next to each group in the comments. For example, the timestamp variable has its regular expression group next to it:

timestamp = match.group(1) # (\w{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2})

If you want to understand the regular expression better, take each of these group regexes and put them in regex101 and test them with the text example provided.

After we process the regular expression, we return a handy data structure, called a dictionary, that structures our data in an easily retrievable way.

Recap

We have discussed the importance of data and how to get it. We have only scratched the surface of loading data for your security ML project and analyzed one category of data, text. In the next part, we will add packet captures and binaries. We will also learn about Pandas, the package that is prevalently used by data scientists for organizing data.

– Xenia

Going Even Deeper

Public datasets