Solutions
Go Back

Make Security Possible

Reduce Alert Noise and False Positives

Boost your team's productivity by cutting down alert noise and false positives.

Automate Security Operations

Boost efficiency, reduce burnout, and better manage risk through automation.

Dark Web Monitoring

Online protection tuned to the need of your business.

Maximize Existing Security Investments

Improve efficiencies from existing investments in security tools.

Beyond MDR

Move your security operations beyond the limitations of MDR.

Secure with Microsoft 365 E5

Boost the power of Microsoft 365 E5 security.

Secure Multi-Cloud Environments

Improve cloud security and overcome complexity across multi-cloud environments.

Secure Mergers and Acquisitions

Control cyber risk for business acquisitions and dispersed business units.

Operational Technology

Solve security operations challenges affecting critical operational technology (OT) infrastructure.

Force-Multiply Your Security Operations

Whether you’re just starting your security journey, need to up your game, or you’re not happy with an existing service, we can help you to achieve your security goals.
Explore Our Solutions
Platform
Go Back

The GreyMatter Platform

Detection Investigation Response

Modernize Detection, Investigation, Response with a Security Operations Platform.

Threat Hunting

Locate and eliminate lurking threats with ReliaQuest GreyMatter

Threat Intelligence

Find cyber threats that have evaded your defenses.

Model Index

Security metrics to manage and improve security operations.

Breach and Attack Simulation

GreyMatter Verify is ReliaQuest’s automated breach and attack simulation capability.

Digital Risk Protection

Continuous monitoring of open, deep, and dark web sources to identify threats.

Phishing Analyzer

GreyMatter Phishing Analyzer removes the abuse mailbox management by automating the DIR process for you.

Integration Partners

The GreyMatter cloud-native Open XDR platform integrates with a fast-growing number of market-leading technologies.

Unify and Optimize Your Security Operations

ReliaQuest GreyMatter is a security operations platform built on an open XDR architecture and designed to help security teams increase visibility, reduce complexity, and manage risk across their security tools, including on-premises, clouds, networks, and endpoints.
Explore the GreyMatter Platform
Resources
Go Back

Resources

Blog

Company Blog

Case Studies

Brands of the world trust ReliaQuest to achieve their security goals.

Data Sheets

Learn how to achieve your security outcomes faster with ReliaQuest GreyMatter.

eBooks

The latest security trends and perspectives to help inform your security operations.

Industry Guides and Reports

The latest security research and industry reports.

Podcasts

Catch up on the latest cybersecurity podcasts, and mindset moments from our very own mental performance coaches.

Solution Briefs

A deep dive on how ReliaQuest GreyMatter addresses security challenges.

White Papers

The latest white papers focused on security operations strategy, technology & insight.

Videos

Current and future SOC trends presented by our security experts.

Events & Webinars

Explore all upcoming company events, in-person and on-demand webinars

ReliaQuest Resource
Center

From prevention techniques to emerging security trends, our comprehensive library can arm you with the tools you need to improve your security posture.
Resource Center
Research
Go Back

Threat Research

Threat Research

Get the latest threat analysis from the ReliaQuest Threat Research Team. ReliaQuest ShadowTalk Weekly podcast featuring discussions on the latest cybersecurity news and threat research.

Shadow Talk

ReliaQuest's ShadowTalk is a weekly podcast featuring discussions on the latest cybersecurity news and threat research. ShadowTalk's hosts come from threat intelligence, threat hunting, security research, and leadership backgrounds providing practical perspectives on the week's top cybersecurity stories.

Featured Research Introducing: Finance & Insurance Sector Threat Landscape
July 25, 2024
Company
Go Back

Company

About ReliaQuest

We bring our best attitude, energy and effort to everything we do, every day, to make security possible.

Leadership

Security is a team sport.

No Show Dogs Podcast

Mental Performance Coaches Derin McMains and Dr. Nicole Detling interview world-class performers across multiple industries.

Make It Possible

Make It Possible reflects our focus on bringing cybersecurity awareness to our communities and enabling the next generation of cybersecurity professionals.

Careers

Join our world-class team.

Press and Media Coverage

ReliaQuest newsroom covering the latest press release and media coverage.

Become a Channel Partner

When you partner with ReliaQuest, you help deliver world-class cybersecurity solutions.

Contact Us

How can we help you?

A Mindset Like No Other in the Industry

Many companies tout their cultures; at ReliaQuest, we share a mindset. We focus on four values every day to make security possible: being accountable, helpful, adaptable, and focused. These values drive development of our platform, relationships with our customers and partners, and further the ReliaQuest promise of security confidence across our customers and our own teams.
Search
Go Back

Back to blog

The Power of Data Analysis in Threat Intelligence – Part 1: Data Collection and Data Mining

ReliaQuest 31 March 2022

DS Pattern Treatment Blog Social Blog 1 1

In 2020, there was an estimated 59 trillion gigabytes of data in the world. Most of which was created in the latter half of the 2010s decade. This figure continues to grow. To convert this raw, chaotic data into valuable intelligence we use data mining tools and analytical techniques. Digital Shadows (now ReliaQuest) routinely uses several of these tools and techniques to assist its clients with determining trends within the cyber threat landscape. This has included our recent blogs on vulnerability intelligence and initial access brokers (IAB). In this blog, we’ll detail some of the common techniques used to support our research.

The guiding principle behind data analysis is the data, information and intelligence pipeline. Raw data comes in many forms including text based, numerical, date, boolean and many more. In order to be useful, it must be converted to information. This is achieved through cleaning the data before running statistical tests and/or analytical algorithms on it. The results of the analysis are then interpreted to present an overall view of the intelligence picture. Ultimately, this process exists to give anyone who needs to make security decisions the capability to make more informed ones.

This blog is part of a two-blogs series where we’ll dive into how we use data analysis in threat intelligence here at Digital Shadows (now ReliaQuest). Today will focus on the initial steps we take before working on our data and some use cases related to data visualization and data analysis. These models will help us cover the basics of data analysis before we’ll delve into the more advanced techniques.

First step: Cleaning and Extracting the Data

Data can come from many sources. The first step in data mining is finding where the data is located and determining whether it is appropriate and sufficient for the intended analysis. This will involve studying the data schema and structure (or lack thereof) and having an unambiguous knowledge of how the data was collected. Knowing this is fundamental for determining what conclusions the data can theoretically support. The following factors must also be considered: scope for sample bias, collection gaps and sample size.

Depending on the analysis being performed, the presence of sample bias doesn’t necessarily make a dataset unsuitable if the conclusions are heavily caveated. An incomplete conclusion drawn from a slightly biased sample can still offer some useful insights. Data from multiple sources can be combined to perform an integrated analysis. When performing such a task, it is vital to consider how the datasets are related.

After the dataset has been extracted, and is known to be appropriate for the task at hand, it must be cleaned. Data cleaning actions will depend on the type of data in each column, but they include:

Removing missing values
Removing illogical values (e.g. numbers outside an expected range)
Converting text to all lowercase and removing punctuation
Ensuring that data is stored as the correct type within the tool being used

After finding, extracting, joining and cleaning the data, the analysis can commence.

A Universal Truth: The Normal Distribution

When sampling a continuous variable, most values will be found near the average. As one moves away from the average in either direction, the frequency of values decreases. Extreme values are rare. This is because continuous variables are a summation of multiple factors, and there are more possible combinations that make the middle values than the extremes.

To visualize these values, we often use a model called the “normal distribution”. A normal distribution is a probability distribution used to model phenomena that have a default behavior and cumulative possible deviations from that behavior. Wherever one sees a mean value (i.e. the arithmetic average of a set of values), there will almost certainly be a bell curve behind it. This is the underpinning principle behind most statistical analysis techniques.

A normal distribution will typically have a mean equal to the median and mode. 68% of the data will fall within one standard deviation, 95% within 2 standard deviations and 99% within 3 standard deviations. Normal distributions can have skew, in which the peak of the curve is biased in a particular direction, as well as kurtosis, in which the curve is broader or narrower.

CVVS Score — **Normal Distribution of CVSS scores sourced from the MITRE CVE database.**

Normal distributions are universal and visualizing them can help us determine the overall spread of the data . For example, the figure above shows the normal distribution of CVSS scores. While it may not be a very smooth bell curve, it does show that most vulnerabilities have a score between 5 and 7 and that extreme CVSS scores are rare.

Visualising Data: The Box Plots

The graphs used to visualise and show patterns in data come in many forms. The best graphs are simple and intuitive, while simultaneously conveying as much useful information as possible. They must be able to convey a message without requiring more than a couple of lines of explanatory text below.

The most appropriate graph types will depend on the type of analysis being performed and what conclusions one wishes to portray. Bar graphs are mainly used to compare categories, or show a trend over time. Line graphs are normally used to compare continuous variables, but can also be used when one wishes to depict several trends over time without overcrowding the page. In addition to those very commonly used graphs, exist highly specialised plots.

One example that has been used in a Digital Shadows (now ReliaQuest) blog on Initial Access Brokers is the boxplot. This model is used to visualize and compare several normal distributions. The figure below depicts a boxplot used to compare the price distributions of several types of initial access.

initial acccess — **A box plot depicting the price distributions of several types of initial access**

The box indicates the interquartile range, where half of the data points lie. The whiskers on the ends of the box indicate the extreme ends of the distribution, where each whisker accounts for a quarter of the data. Outliers are usually depicted separately, but have been omitted from this particular analysis.

From this graph we can see that WebShell is the most valuable access type on average and shows the greatest spread. We can infer from this that WebShell allows a threat actor the highest level of access on average, but that the level of access also varies greatly. RDP shows the lowest average value despite offering effectively a “hands on keyboard” level of access to a machine. From this it can be inferred that most RDP offerings are low-privileged machines, and are of little value to threat actors.

Test your Hypotheses: Null Hypothesis

Null hypothesis based statistical testing is a model used to determine whether the relationship seen between samples is reflected in the population, or is down to sampling error. The starting assumption is called the null hypothesis, which assumes that there is no significant relationship between the samples and that any relationship observed is by chance alone. The type of test used will depend on the type of relationship being tested for, the type of data within each sample and whether population statistics are known. The table below summarizes some of the most commonly used tests.

Commonly used stats — **Commonly Used Statistical Tests**

The output figures of these statistical tests are converted into a ‘p’ value. A p value is the probability, expressed as a decimal, of observing results at least as extreme if the null hypothesis is correct. If this value is less than 0.05 (i.e. there is a less than 5% chance of us seeing the result if there is no significant difference between the populations) then we reject the null hypothesis and assume the results to be statistically significant.

An example of ANOVA in action can be seen in the IAB graph above, where it shows a significant difference between the types of access as indicated by the presence of p<0.05 in the plot title. This tells us that we can draw conclusions from the graph.

Conclusion

This blog covered the early stages of the data analysis process, spanning data collection through to basic analytical techniques. Some data problems however, require more advanced solutions. Such problems often require the use of machine learning methods, which will be covered in the second part of this blog series. If you’d like to access Digital Shadows (now ReliaQuest)’ constantly-updated threat intelligence library providing insights on a wide range of cyber threats, sign up for a demo of SearchLight (now ReliaQuest’s GreyMatter Digital Risk Protection) here.