Back to blog

A Journey Through _time: How to Resolve Common Time-Based Issues in Splunk

ReliaQuest 5 November 2020

ReliaQuest

“All we have to decide is what to do with the time that is given us.” – J. R. R. Tolkien

Time is one of the most critical details included in any logs, whether they are security related or not. This is doubly important for Splunk, as it can affect whether alerts trigger, whether analysis is performed correctly for those alerts, and even whether the logs are stored and retained properly.

Time issues often go unnoticed, as it’s natural for people to organize thoughts or events around time, often assuming time is correct and if anything is wrong, it has to be somewhere else. And sometimes time-based issues go misunderstood or poorly communicated, as any Sci-Fi fan who has dealt with time travel could tell you.

Today I’m going to go over how Splunk processes time, how to find problems with time, and how to fix some common (and not-so-common) issues we’ve seen before. This will help you identify gaps in your alerting and visibility that you weren’t aware of, and give you better understanding of some of your log sources as well.

WHY IS _TIME SO IMPORTANT?

Splunk excels at historical searches looking back in time and generates alerts on a near real-time basis instead of leveraging real-time correlation like traditional SIEMs use. For example, you can design an alert that looks over the last 70 minutes and runs once an hour, or design one that runs every minute and looks at the last 2 minutes. However, if your timestamps are wrong, you could easily fall outside of the range selected.

Splunk also stores events and handles retention based on buckets organized by event time, so if you have events with timestamp parsing issues, you could lose data before you intend to, or get stuck holding data for decades if Splunk thinks events are from the future

When responding to security alerts, the goal is to gain a full understanding of the chain of events leading up to compromise and identify if the infection has spread from there. If some of the log sources have the wrong timestamp, it’d be easy to misinterpret the chain of events – that is, if you even saw all the logs relevant within the time range you chose.

HOW SPLUNK DETERMINES _TIME

In Splunk, the primary internal time field is “_time”. This is stored in metadata in GMT and is converted to whatever a user’s time zone preference is. When integrating log sources, most will have TAs that will have props.conf to identify what portion of the log is the event timestamp and should be used as the “_time”. The props.conf will either specify a prefix for the timestamp or specify field if it’s JSON or KV extraction. You can even specify a time zone in the props.conf if you really need to, but we’ll talk more about that later.

Splunk will use a timestamp processor to interpret the timestamp. If nothing was set in the props.conf to tell Splunk where the timestamp is, it’ll use the timestamp processor to try to find the first thing that looks to Splunk like a timestamp. You can look at the timestamp processor at $SPLUNK_HOME/etc/datetime.xml. With a notable exception in 2019, this isn’t something you should need to ever edit.

The timestamp processor uses regex to identify timestamps, and having it try to ‘guess’ what in the logs is the timestamp is asking for trouble, since epoch timestamps are just 10 digit strings, and sometimes the first actual timestamp is not the event time. The timestamp processor also helps identify what time zone the log is in, assuming the log source has a GMT offset or some form of time zone info.

So what happens if Splunk looks at a log but doesn’t see anything like time? Splunk is built on _time, it needs to be something. If there is no timestamp found, Splunk will use the time from the Splunk server that received the log, which is also stored as _indextime for all logs. If there is a timestamp, but no time zone, Splunk will treat it as GMT. As you can imagine, this can cause issues, so let’s look at how to find and fix them

HOW TO FIND WHEN _TIME GOES WRONG

Luckily, it’s pretty easy to find if there are _time issues in Splunk. If you are trying to figure out if any of the timestamps in your logs are getting parsed wrong, just compare the _time (Event Time) to the _indextime (Splunk Receipt time).

Both _time and _indextime are normalized in GMT and stored in metadata for Splunk. They will always exist, so they are reliable sources in every environment. Just keep in mind when comparing _time and _indextime:

If _time is larger than _indextime, the events are “from the future”. They somehow got to Splunk before they actually happened. This clearly looks like a timestamp issue.
If _time is equal to, or slightly smaller than _indextime, the events are getting to Splunk shortly after they happen, which is normal.
If _time is much smaller than _indextime, there is either a timestamp issue, a networking issue, or something else unique about the log source.

You can use eval and stats to look at things in a variety of different ways. Here is an example of a search I ran to check some firewalls for issues:

As you can see, some of them are off by quite a lot. To help understand how timestamps get messed up, lets go over some types of issues we’ve seen before:

Double Header – Common

In distributed systems, there are plenty of cases where log sources aggregate and forward their logs before even reaching Splunk. Sometimes they end up in Splunk with double syslog headers, with one being from the original log source and another being from the syslog aggregator/forwarder. These are normally fairly minor _time discrepancies, so many might just ignore them in favor of larger priorities. Typically, to fix these within Splunk, you need to update the props.conf to account for the extra header, either by modifying the regex used to extract the log, or by adding in a TIME_PREFIX to match what’s before the true timestamp – even if that’s the first timestamp.

If you can have the syslog forwarder not add on a second header, that’d be the best fix to be applied outside of Splunk.

Wrong Time Field – Common

Some log sources have multiple timestamps by design. These can range from minutes, to hours, days, or even decades. For example, antivirus logs might tell you when a file is downloaded, when it was detected as malicious, and when it was removed from the device. It might also tell you when it last updated its malware database, when the specific IOC was added, etc. If not told anything specifically, Splunk will pick whichever one is first. Sometimes that might even be a 10/16 digit ID number in the beginning of the log that gets picked up as epoch time.

The solution again is in the props.conf. The ‘TIME_PREFIX’ allows you to tell it what is before the timestamp you want to use, but if it’s JSON or in a Key-Value extracted setup, you can also just specify the TIME_FIELD. For both of these, you can configure the TIME_FORMAT for the exact format of the timestamp.

Troubles in Time Zones – Uncommon

Source: XKCD

One of the most easily identified issues are time zone issues. If there is an issue, you’ll see differences between _time and _indextime that the size of common GMT offsets, such as 4 hours often for US Eastern time, or 7 hours for US Western time. It won’t always be whole numbers – remember that latency and half-hour offset time zones exist – but if it’s really close to matching up to an offset, time zones are the issue.

This happens when Splunk incorrectly identifies the time zone of the logs – either because there isn’t a time zone in the logs, or there is an incorrect one, or Splunk doesn’t understand the format the timestamp is in. As mentioned before, this means Splunk defaults to assuming GMT/UTC.

If the timestamp is in the wrong format, you can configure the TIME_FORMAT in the props.conf for Splunk to understand it. If the log source has the wrong time zone, you’ll need to fix that on the log source side.

Most vendors either have timestamps formatted with time zones by default or allow you to configure them that way. However, not all do – two notable exceptions are Cisco ASA and Palo Alto firewall logs.

If you can’t set a time zone in the timestamp in the logs, your best bet is to make sure all those log sources are set to GMT/UTC. Even the documentation from Palo Alto and Cisco recommend keeping their Firewalls on GMT.

It’s important to note that Splunk CAN adjust time zones to accommodate – as mentioned before, you can set time zone in the props.conf with the TZ setting. However, this gets messy quickly and is not normally recommended.

For example, you could set up a props.conf on a forwarder to handle Hong Kong ASA logs sending in GMT +8, but need another to handle your Tampa logs coming in GMT -4, and your other data centers will all need their own to cover them sending in their own local times. Managing all those configurations to ensure things come in correctly will add a lot of extra work for little to no gain. Not to mention a good chunk may break when daylight savings happens, or you open a small office which sends to an existing data center, but exists in a different time zone.

Now, this can be manageable if your business is all in one time zone, or you want all devices on one time zone. However, if you need to unify under one time zone anyway, GMT/UTC is often the best one for consistency, instead of the local time of your main office.

If you are seeing time zone issues like these, it means that time settings across your organization are inconsistent. This will require ensuring the team or individuals managing the firewalls update their settings to conform to a uniform standard – and GMT/UTC is typically the universal standard.

Legitimate Log Source “Delays” – Rare (Log Source Dependent)

Of all the issues in the article, this is both the most confusing and oftentimes the most impactful. Thankfully, it’s relatively rare, and understanding it will help you better understand how your own log sources work.

We first noticed this issue when investigating some false negatives for a customer. Proofpoint had alerted the customer that some potentially malicious activity had occurred; however our alerts to detect the same conditions didn’t trigger. While originally it was thought to be an issue with the wrong timestamp being parsed, the real issue was much more complicated.

This isn’t limited to Proofpoint TAP logs – for example, the Microsoft Office 365 Reporting Add-on also mentions that it “May delay trace logs up to 24 hours”. But Proofpoint TAP is a great use case to dig into.

Proofpoint TAP is an advanced email security tool. Like many such products, it analyzes emails for potential malicious attachments, and analyzes URLs included in emails and attachments. If it identifies the email as malicious, it will block the email. URLs are rewritten to go through Proofpoint – when clicked, Proofpoint will block the link if the URL is detected as malicious. All of this is logged, as is all allowed activity.

No security device is perfect, however, and sometimes Proofpoint allows malicious emails through or allows someone to click on a malicious link. However, Proofpoint will also keep track of what URL clicks and emails were allowed. If it later receives new intel that the URL or email it previously allowed was malicious, it will send a new log – telling you something previously allowed has been re-identified as potentially malicious.

Sometimes the update is within minutes, but sometimes it’s hours, days, or in rare cases weeks later that the threat is identified. Regardless of when the update happened, the _time of the original email/URL click stays the same – after all, that’s when the user clicked the URL, or got that malicious attachment.

So how do you handle cases where Splunk gets a log now describing activity from last week? Especially one you likely want alerting on since potentially malicious activity occurred.

Well, let’s first discuss what NOT to do. DO NOT try to treat this as a parsing issue and remap the _time to some other timestamp in the logs. For starters, the updated timestamp isn’t going to be in every log so it’d make parsing inconsistent. But more importantly, when the alert does fire, the investigation/analysis would go poorly since you’d likely be looking in the wrong time window. If I clicked a phishing URL two weeks ago, and you get an update today saying it was malicious, you need to be looking at my web traffic two weeks ago, not what I did today, to understand the situation. Having _time correct matters for more than just alerting – it matters for analyzing and context.

We can’t ignore the discrepancy and we want to make sure we are getting alerts and analyzing them correctly, but we don’t want to have huge searches looking over several weeks running constantly. There are two things that need to be done to correct the issue. First, adjust alerts to search based on both _time and _indextime. Second, make sure those responding to the incident understand how _time/_indextime works.

Both _time and _indextime can be used to filter searches. Although _indextime doesn’t have a handy drop down, it is just as potent as _time filters for creating efficient searches.

See the below search:

As you can see, the search is setup to look for the last 90 days’ worth of traffic, but it also uses the ‘_index_earliest=-3d@d’. This tells Splunk to look at events indexed in the last three days, but whose event timestamps are within the last 90 days. I went back 3 days to just show examples where the ‘clickTime’ of an even was drastically different than the indextime. For an alert, setting ‘_index_earliest=-70m’ and running it every 60 minutes covers all gaps.

Now, you may think running a 90 day search every hour is inefficient and would take forever to process, and normally you’d be right. However, ‘_index_earliest’ is nearly identical in efficiency as selecting the same range for _time. You could search ‘_index_earliest=-70m’ from ‘All Time’, and it would finish in nearly the same time as just searching for the last 70 minutes. The example search I ran above took only a few seconds to run since Splunk is smart enough to know I don’t need to look at events that weren’t indexed within the last three days.

REMEDIATING TIME-BASED ISSUES WITH RELIAQUEST GREYMATTER

At ReliaQuest, we understand that ensuring Splunk is accurately keeping track of the _time of your logs is a critical part of making sure you have proper visibility into your environment. That’s why checking and remediating time-based issues is a consistent part of our deep dive health checks that ReliaQuest GreyMatter customers receive. Between how Splunk stores data to how Splunk searches data, even to how people interpret data when looking at Splunk, everything is based on understanding the flow of time.

Hopefully, this article has given you insight into why time is such a critical component, how to detect issues, and how to resolve them when they do come up. While the types of errors covered isn’t an exhaustive list, it should give you starting point to address time-based concerns at various points. From log source settings, to parsing updates and search configuration changes, each proposed fix can be leveraged to accommodate a wide range of issues and help highlight Splunk’s flexibility.