Big Data Analytics for Network Security Monitoring

March 4, 2013 Derek Lin


After years of enterprise security breaches, one would think companies have learned much and are improving their security posture. In reality, the bad guys continue to have the upper hand in this game of cat-and-mouse. The intruders are creative and nimble in their efforts to penetrate network infrastructures, discovering new vulnerabilities that must be patched on a regular basis. While necessary, endpoint machine protection offers no guarantee of network security, and the users themselves pose even a bigger security breach risk due to phishing. In this climate, enterprises are recognizing that their infrastructures are full of holes and attacks can’t be prevented. A key security program is needed to identify threats early and mitigate their effects.

That’s easier said than done, as evidenced in the slew of security monitoring products out there. Security information and event management (SIEM) products are mostly about regulatory compliance. They allow you to store a ton of data, but that’s about it. Packet-level capture tools serve forensic purposes: they’re only useful if you know what to look for during post-incident management. Dashboarding tools may aggregate high-level statistics, but don’t identify specific, actionable threats. The majority of commercial traffic monitoring tools offer signature-based payload analysis, and are ineffective against zero-day attacks, in which signatures are unseen and infrequently updated by vendors. Malware designers can adapt quickly, since they have the same level of access to vendors’ commercial signature files as any user.

Use Cases

Security practitioners are beginning to see the need to conduct behavior profiling to counter security attacks. Whether the attacks are carried out by malware (ex: network topology reconnaissance) or by humans (ex: illegitimate network resource access), they often exhibit behaviors that deviate from statistical norms. If the security practitioners can model the norms, statistical outliers could point to potential attacks worthy of further investigation. This is called behavior-based anomaly detection, which complements conventional signature-based detection.

In anomaly detection, every entity on the network is profiled and monitored. For example, one Big Data use case for security analytics is malware beacon activity detection. A malware-infected machine connects to an external IP address to get instructions from the command and control center. However, such connections are well-hidden in the tremendous volume of legitimate outbound traffic, especially when the malware traffic happens over port 80, where http web traffic occurs. By designing statistical indicators over the traffic — such as the entropy of between-connection time intervals, or the number of bytes transferred over a seven day period a week ago — we can gain information on what activity is normal and what is not.

Another Big Data use case for security analytics is detection of anomalous user’s network resource access. Users with the same function tend to demonstrate similar network resource access patterns. We can cluster users based on patterns learned from months of resource access logs via graph-theoretic analysis. By comparing behavior-based user grouping information with user data managed in the Active Directory, we can identify users with resource access patterns which do not conform to the Active Directory policy.

Challenges of Security Analytics

Unfortunately, security practitioners today face three big challenges when trying to solve the use cases mentioned above:

Challenge #1: Sheer volume and velocity of data

Network security is uniquely a Big Data problem. Machines on a network generate tons of data every day — within enterprises, one terabyte of data is easily generated daily. Such a large volume practically prevents commercial security tools from performing long-range analysis, such as base-lining network object behavior over a 30 day period or more for all objects on the network. Large data volume hampers researchers’ ability to perform data mining experiments to gain necessary insights.

The volume of data within the enterprise

The volume of data within the enterprise.

Challenge #2: The variety of data sources

In any given enterprise, there is a plethora of devices and data sources, each generating various data in different formats. Within each device, there are potentially different logging levels and upgrades in the logging mechanism. Network security demands the flexibility to absorb new device logs and quickly evaluate and leverage the information. Currently, there are vendors which excel in collecting and normalizing the data. However, their focus is more about collection and less about preventive analytics. Additionally, these solutions don’t have the agility to quickly utilize logs from new forms of devices, and tend to stick with known data sources. To be effective in network security analytics, we neither want nor should be constrained by the status quo.

Challenge #3: No marriage between network security engineering and data science

In the Disruptive Data Science – Transforming Your Company into a Data Science-Driven Enterprise blog post by Annika Jimenez, she reiterated the desperate need to develop data science skills and train the existing analytics team members. In other words, enterprises need to develop horizontal data science expertise on top of vertical knowledge base. We are seeing the same need and challenge in the network security industry.

Most security practitioners don’t have the proper math background to engage in the kind of statistical analysis behavior profiling requires. Behavior profiling is both a mathematical exercise and an art. Data science moves beyond the simple statistical metrics of mean and standard deviation. It addresses questions such as what behavior indicators to design, and how to evaluate and combine them in a principled way. Enterprises often do not have these data science skills readily at their disposal.

On the other hand, network security knowledge takes years to acquire. Just as most security practitioners don’t have the math training necessary to perform behavior profiling, many data scientists lack the knowledge or understand the nuances of network security monitoring. Unless the two fields converge in academia — one such example of this happening can be found in the field of bio-informatics, which combines data science with biology — this is a difficult gap to cross in the industry.

These are the challenging roadblocks that I believe Greenplum is in a unique position to address. Greenplum’s massively parallel processing (MPP) allows long-range profiling possible, which is crucial to detect the low-laying and slow-moving behavior of advanced malware. The MPP architecture allows researchers to try out ideas and iterate faster in order to obtain insights. Greenplum’s IT Operation and Security Analytics Data Science Team boasts deep expertise in machine learning training, and is well-connected to the security industry. Within Greenplum, there is a synergy emerging between machine learning scientists and network security experts. In future blog posts, our team will outline more use cases, challenges, and opportunities in this space which presents growing opportunities.

About the Author


LLVM, XCode's Super Secret Bug Detector
LLVM, XCode's Super Secret Bug Detector

So if you’re anything like me you get a lump in your throat and a cold sweat every time you are about to h...

How to do Google Apps SSO in Ruby
How to do Google Apps SSO in Ruby

Google has a ton of APIs, and a fistful of authentication methods to match – everything from 3-legged OAut...

SpringOne 2021

Register Now