Sequential Pattern Mining Approach for Watering Hole Attack Detection

October 22, 2015 Anirudh Kondaveeti

Work jointly performed by Anirudh Kondaveeti & Jin Yu

As malware techniques continue to evolve, it becomes increasingly challenging to detect network security threats, especially Advanced Persistent Threats (APTs) that are orchestrated by sophisticated adversaries. An increasingly common strategy adopted by APT actors to carry out targeted attacks is the watering hole technique. Watering hole attacks target a group of users in an organization by infesting the websites that are most often visited by these users. An example of watering hole attacks is shown in Fig. 1, where the user visits watering hole domain A, then gets redirected to a relatively unknown malicious domain B within a short time window, typically lasting a few seconds. In this blog post, we will discuss the application of sequential pattern mining to detect coordinated network attacks such as the watering hole variety.

Fig. 1: An example of Watering hole attack

Outbound network traffic data are readily available in enterprise firewall/proxy server logs where we observe internal users’ web access activities. In the following example, we will analyze billions of connection records between internal users and external destinations. Data preprocessing steps, as shown in Fig. 2 and listed below, are applied to the proxy logs before the subsequent analysis:

Fig. 2 : Preprocessing proxy logs

Host name normalization: Appropriate host names are normalized to the second level; for example, pivotaluser.facebook.com is normalized to facebook.com. The goal is to establish a stable list of identifiers for destination hosts, instead of tracking different variations of a single entity. There are cases where host names are kept un-normalized. For example, subdomains of no-ip.com (a popular dynamic DNS site) are treated as different entities.
Filtering invalid host names: To account for data collection errors, host names containing invalid characters are filtered out.
Identifying unpopular domains: Malicious domains are typically unpopular among users. The analysis therefore focuses on finding associations with domains that are visited by only a small number of users.
User-specific sessionization: Sessionization is a data processing procedure, widely used for analyzing user activities within a certain time window. We sessionize proxy logs to create time-ordered domain sequences, in order to analyze sequential patterns later. A new session is created for a user if the time to the previous domain visit is greater than a pre-specified timeout threshold. Fig. 3 shows an example where the domains visited by a user are converted into two sessions, using a timeout threshold of 60 seconds. The time threshold (60 secs) to create the sessions varies depending on the application. Since watering hole attacks happen within a span of few seconds, we chose a threshold of 60 seconds or lesser.

Sequential Pattern Mining (SPM) aims to discover sequential patterns in time-ordered data. It is applied to various domains such as market basket analysis, border security, vehicle trajectory analysis and health care. The quality of a sequential pattern is typically measured by its support and confidence. The support of a pattern <A, B => C> is the proportion of sequences in which the pattern occurs. The confidence is the ratio between support(<A, B, C>) and support(<A, B>), essentially estimating the conditional probability of the sequence <A, B, C> given the prior presence of <A, B>.

Fig. 3: Sessions created from a user’s web access records

In order to detect watering hole attacks, we aim to find low support and high confidence sequential patterns of domains. These patterns correspond to uncommon domain sequences (low support) that however have high conditional probabilities of redirect (e.g. traffic to a particular website almost always gets redirected to another particular site). Especially, we are interested in capturing low support and high confidence patterns that are present in multiple users’ web access records.

The steps involved in the modeling process are outlined in Fig. 4. First, time-ordered domain sequences are created from user-specific sessions (cf. Fig. 3). Earliest connection time is used for domain ordering in case of multiple visits to the same domain within a session. For example, the sequence from Session 1 in Fig. 3 is <{googlevideo.com}, {twitter.com}, {youtube.com}, {doubleclick.net}, {google.com, googleanalytics.com}>, where the last element contains two domains visited at the same time. We then select subset of sequences that contain domains of interest (e.g. unpopular domains visited by a small number of users), and distribute them across multiple segment servers of Pivotal Greenplum Database (GPDB). Leveraging the MPP architecture of GPDB, a modified Sequential Pattern Mining algorithm (m-SPM) is run on sequences residing in different segment servers in parallel to detect correlated domains, that are potentially part of watering hole attacks.

Fig. 4: Model execution of sequential pattern mining

A standard implementation of SPM (e.g. in R) follows the Apriori principle: If a sequence is not frequent (has low support), then none of its super-sequences is frequent, i.e., support has the anti-monotone property: <A>⊂<A,B>, hence support(<A>) ≥ support(<A,B>). Given sequences found so far, SPM grows each sequence by one element at each iteration, using Apriori-based candidate sequence generation to ensure minimum support. In our case, minimum support is enforced to prune domain sequences that are not sufficiently supported by the data. For example, sequences that are seen in one occasion or two are generally not interesting. Furthermore, taking into account the fact that the number of users also follows the anti-monotone property: #users(<A>) ≥ #users(<A,B>), we modify the standard SPM implementation to also filter out sequences that do not meet the minimum-number-of-user condition at each iteration, e.g. sequences from just a single user are less interesting, hence can be pruned.

For each domain of interest, the modified SPM (m-SPM) is run only on the subset of sequences containing that domain. This may cause some sequential patterns to have artificially high confidence. Recall that the confidence of a pattern <A, B => C> is defined as the ratio of support(<A,B,C>) to support(<A,B>), which is equivalent to the ratio of #(<A, B, C>) (the number of sequences containing <A, B, C>) to #(<A, B>) (the number of sequences containing <A, B>) i.e.

If #(<A, B>) is evaluated on the subset of sequences, it may not reflect the popularity of <A, B> in the whole dataset. A high confidence value as evaluated on the subset can therefore be misleading. We therefore design a new confidence metric, called relative confidence, to better estimate the popularity the left-hand-side sequences. The relative confidence of a pattern <A, B => C> is calculated as the ratio of #(<A, B, C>) in the subset to the minimum of #(<A>) and #(<B>) in the whole dataset i.e.

Intuitively, relative confidence favors the pattern whose left hand side contains less popular domains. For example, in Fig. 5, even though the confidence of both patterns are high, the relative confidence for the pattern containing google.com and facebook.com on the left hand side is lower. The resultant patterns returned by m-SPM are ordered by the relative confidence value, with more interesting patterns having higher values of relative confidence.

Fig. 5: Relative confidence favors patterns with unpopular left-hand-side domains

The model execution steps as depicted in Fig. 4 can be easily adapted to obtain new patterns from new proxy logs in an incremental fashion. One additional step is to merge the new results with the existing set of patterns, mainly involving simple summation of sequence counts from two sets of results for the recalculation of support and confidence measures.

In this blog post, we have introduced a data science approach which involves sequential pattern mining to perform watering hole attack detection. The approach has been applied to proxy logs from a large enterprise network. It not only captured known watering hole attacks, but also flagged new attacks that were previously unknown to the more traditional network security solutions. In our next blog post, we will discuss a graph mining based approach for a similar use case.

Read more:

Learn more about Pivotal Greenplum Database
Learn more about Pivotal Data Science
Read more about data science-driven security techniques

About the Author

Anirudh Kondaveeti is a Principal Data Scientist and the lead for Security Data Science at Pivotal. Prior to joining Pivotal, he received his B.Tech from Indian Institute of Technology (IIT) Madras and Ph.D. from Arizona State University (ASU) specializing in the area of machine learning and spatio-temporal data mining. He has developed statistical models and machine learning algorithms to detect insider and external threats and "needle-in-hay-stack" anomalies in machine generated network data for leading industries.

Revelations From the Field – Life in the Operations Team

I have spent the last 25 years doing application development, and really have had little or no appreciation...

A Look At Spring Cloud Services For Pivotal Cloud Foundry

Developing modern software emphasizes loosely coupled services, resilience, ease of scaling and the like. B...

Sequential Pattern Mining Approach for Watering Hole Attack Detection

About the Author

Previous

Next

Sequential Pattern Mining Approach for Watering Hole Attack Detection

About the Author

Previous

Next

Related content in this Stream

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.