This blog is the second part of our Data Analysis in Threat Intelligence series, where we focus on the tools and techniques used by Digital Shadows (now ReliaQuest) to convert raw, chaotic data into valuable intelligence for our clients. The first chapter of this series focused on the fundamental processes of data collection and data mining, two critical steps to ensure a smooth intake of data for our threat intelligence purposes.

Today, we will discuss how machine learning models can help us solve more complex data problems, and help us form the bigger intelligence picture.

Beware: Don’t be scared! We won’t delve too deeply into the mathematics of machine learning and we’ll keep the discussion around machine learning and its underlying principles at a high level.

## Machine Learning: Supervised vs Unsupervised

The first important distinction to make is the one between supervised and unsupervised learning. Supervised learning is used for predicting continuous variables (regression) or classifying data points into distinct categories (classification). Supervised learning methods need to be trained with the subset of a pre-existing dataset (training data) beforehand, and validated using another subset (test data) to ensure that the model has been fitted correctly. After this process, they can be used to predict outcomes based on new data.

On the other hand, unsupervised learning methods do not need to be trained or validated beforehand. They are run on the whole dataset, and their outputs are used to infer new conclusions about the data. Unsupervised learning is used for exploratory data analysis, which entails speculatively examining the data to see if some form of pattern is present. An example of exploratory analysis would be looking through the logs of a server to see if multiple connection requests are somehow linked (for example originating from the same IP address).

## Supervised Learning: Regression

Regression analysis allows us to determine the relationship between two or more variables (also known as dimensions). For example: how does the size of a data breach affect how much it is offered for on the dark web?

The simplest form of regression is linear regression. It is used when there is a linear relationship between the independent and dependent variable. The aim of linear regression is to draw a straight “line of best fit” through the data.;.

Where linear regression is not appropriate, for example due to a non-linear relationship between the dependent and independent variable, polynomial regression can be used. The technique is similar to linear regression, except that instead of a straight line, a curve is fitted with an equation containing x squared and x cubed values.

A combination of regression methods are often used in forecasting, for example to forecast how many more ransomware attacks can be expected in the coming months and years. This helps the user decide how to strategically allocate resources to defend against cyberattacks.

If one wishes to predict the odds of an outcome, for example the probability of ransomware victimization based on company turnover, logistic regression is used. This returns a probability between zero (indicating impossible) and one (indicating certain) of an outcome based on the value of a dependent variable, obtained from the output of the logistic function.

## Supervised Learning: Classification

Classification is one of the most common applications of supervised learning. There are several distinct methods to choose from. In all cases, classification accuracy is assessed by the confusion matrix and the receiver operator characteristics (ROC) curve. The confusion matrix shows the number of true positives, false positives, false negatives and true negatives obtained from running the model on the testing dataset. ROC curves are obtained by adjusting the sensitivity parameters of the model and plotting the true positive rate against the false positive rate. A line with the equation y=x indicates a classifier working at random. The closer the curve is to the y axis, the better the model performs. An example ROC curve is shown in fig 2.

K nearest neighbor assigns data points to the same class as its nearest neighbors. It calculates the euclidean distance between the datapoint in question and a K number of its nearest neighbors. It then assigns the new data point a class on a “majority vote” basis (i.e. the same class as the majority of its nearest neighbors). While this is a simple algorithm, its results are largely dependent on the distribution of the training dataset’s points. As a result, it is prone to a phenomenon known as overfitting, in which the thresholds are specific to the variance of the training dataset and do not take into account the greater variance of “real world” data.

A support vector machine (SVM) seeks to classify plotted data points by drawing a line known as a hyperplane to separate the two classes. Once that line is drawn the algorithm then seeks to maximize the margin between the hyperplane and the support vectors, which are parallel lines either side of the hyperplane that intersect at least one data point in the class corresponding to that side. It does this by altering the gradient of the hyperplane until the accuracy is maximized. A potential use for an SVM would be to classify malicious packets based on their length and payload.

Decision trees classify data points with binary decisions regarding their properties. They consist of nodes and branches. The root node is at the start of the tree, this determines which variable splits the data best. Intermediate nodes evaluate other variables against specific threshold values to decide which branch the data point should be sent down. At the end are the “leaf” nodes, which are the final classification categories.  For example, in a decision tree trained to classify DNS records by whether they were part of a cache-poisoning attack, one of the intermediate nodes would be the TTL value. If it is over a certain threshold, the tree would classify it as malicious. If not, the tree would send it to another node which, for example, tests the start of the IP address against a certain threshold. During the training process, the threshold values are decided and fine-tuned until the accuracy is maximized. Decision trees are prone to overfitting. To overcome this, a random forest can be used. This is where many decision trees are trained simultaneously, and their outputs are summated to give the probability of a data point falling in each class.

The classification techniques listed above all have their pros and cons, and which one is to be used depends on the task at hand. Beyond these, there are neural networks. These work in a manner that mimics the human brain, a network of perceptrons which are a mathematical model of the human neuron. While they have found some use in image recognition and object detection, they are experimental and computationally very expensive. As a result, they are normally used as a last-resort when other methods fail to achieve sufficient accuracy.

## Unsupervised Learning: Principal Component Analysis and Clustering

Principal Component Analysis (PCA) is an example of dimensionality reduction, which reduces the number of dimensions (variables) a dataset is spread across and reframes the data in a way that shows distinct groups. PCA starts by drawing a “line of best fit” across a standardised dataset with the aim of accounting for the greatest amount of variance within the entire dataset. This process continues until the number of principal components is the same as the number of original dimensions.

The data is then projected onto these principal components (i.e the principal components replace the original axes of the dataset). The result is two plots, the first being a PCA plot which shows the data distributed across the first two (or other if the user choses) principle components. From this one can see if there are distinct groups within the dataset. The second output is the loading plot. This shows the correlation between each of the original dimensions and each principal component. PCA can allow the user to visually spot distinct groups of data points. A clustering algorithm can also be run on the results with a higher chance of success than if the data points are left plotted along their original dimensions. The disadvantage of PCA is that because the data is not plotted against its original dimensions, it can be hard to interpret.

Clustering is normally used to explore the data to spot distinct groups of data points in a multi-dimensional dataset (one that has many variables). There are several techniques to choose from all with varying degrees of computational expense and complexity.

The simplest form of clustering is partition based. The most common example of which is k-means which moves the centrepoint of the clusters around until the cluster assignments for datapoints stop changing. K means is simple, but is vulnerable to being skewed by outliers. K medoids is another partition based clustering algorithm, which starts by selecting k number of data points at random to be medoids, and assigns all other data points to their nearest medoid. It then calculates the overall change in cost (reduction in distance between non medoids and their nearest medoids) associated with swapping each non medoid-medoid pair. While this algorithm is less vulnerable to outliers, it’s more computationally expensive than k means.

Density based assigns points in areas of high density to a single cluster. The clusters obtained from this method don’t always have a regular shape and can be quite convoluted. A frequently used implementation of this method is DBSCAN. It classifies a datapoint as a core point if it has a certain number of points within range. A reachable point is either within range of a core point, or is reachable via a path, outliers or noise points are neither core points nor reachable points. All points within range of a core point are assigned to the same cluster.

Distribution based clustering seeks to assign all data points which are part of the same normal distribution to their own cluster. The Gaussian mixture model works by creating a number of normal distributions with set parameters for mean and variance. It then calculates the probability of each point belonging to one of these distributions, and adjusts the distributions’ parameters to maximize this value across all data points. This is done iteratively until the probability is maximized in a process known as expectation-maximization.

## Text based analysis

A significant portion of the data we process is text based. Natural language processing covers a whole range of techniques from word distributions to sentiment analysis. To describe this area would take an entire dedicated blog, so for the purposes of this one simpler text-based analysis techniques will be summarized.

Before any analysis can be done on text based data, it needs to be normalized. Normalization actions include but are not limited to; converting all characters to lowercase, removal of some punctuation marks, setting all text to the same encoding format (ascii, unicode, UTF8 etc), removal or replacement of special characters (such as an e with an accent or an a with a circumflex).

If there are keywords of interest (for example brand or domain names) then a trie search can be run on the data. The text data is converted into a trie, a tree where each node is part of a word. The root node is blank, the next node contains a single letter and is linked to following nodes by the next letter in the sequence, as illustrated in Fig 4. For example when the word “digital” is added to a trie, the first node after the root contains the letter d. It is linked to the next node, containing “di” by the letter I, and so on until one reaches the end node containing the full word. On the way, one will encounter the words “dig” and “digit”. A trie search is a very efficient and scalable way of searching for the presence of a keyword within a text based dataset.