Pivotal Data Science Team Iterates Faster, Beats Existing Malware Detection Tools

April 20, 2015 Alexander Kagoshima

featured-kill-malware

Perhaps unexpected, but true—data science innovations beat today’s best malware detection tools.

Security organizations know that most of today’s third party SIEM, forensic, and malware detection applications have limits, which mostly stem from the fact that they have to resort to matching byte patterns instead of analyzing network traffic patterns intelligently. This is why companies are creating their own agile security labs based on data science—to outpace the growing size and sophistication of malicious software.

Recently, a six week Pivotal Data Science Labs project helped one of the world’s largest insurance providers understand where their existing, best-in-class malware detection tool was still falling short. In this time, Pivotal helped them detect malware they couldn’t previously detect, provided a new level of analytical power against malware threats, and advanced our customer’s growing, enterprise-wide data science group.

The Challenge—Data Science Applied to Malware Detection

According to a recent 9700+ person survey by PwC, security incidents increased 48% in 2014—to 42.8 million attacks—that’s over 100,000 attacks a day. The 2014 DBIR summarizes the problem well, “Attackers are getting better/faster at what they do at a higher rate than defenders are improving their trade. This doesn’t scale well, people.” The report also highlights that more than 75% of compromises happen in days and less than 25% are discovered in days.

Ultimately, the malware detection challenge is focused on quickly and comprehensively finding suspicious communication patterns—even one malware-infected user or server can cause financial loss or desecrate a brand. Within banks, insurance companies, retailers, healthcare providers, or any company storing personally identifiable information, malware behavior can be captured within proxy log files as the malicious apps try to communicate with and pass along information to their command and control servers outside the firewall.

A large part of the analytical problem has to do with the sheer volume of data and how to do a better job finding the needles in the haystack. In this project, one month of network data amounted to almost 1 terabyte, 1.5 billion rows of connections, close to 75,000 user (employee) accounts, almost 100,000 internal server IP addresses, and 500,000 of external web domains. As well, the target analytical system needed to scale as much as 10-fold, and the team needed to prove the concept within 6 weeks of elapsed time.

Recognizing the enormous data volumes involved with this problem, it also becomes clear why current security solutions fall short on analyzing this traffic intelligently. Until recently, data of this volume simply couldn’t be processed in a sensible amount of time. As mentioned, this is why today’s security landscape mostly focuses on matching known byte patterns in the network traffic, an operation which is computationally inexpensive. To intelligently analyze this kind of data and correlate features of the data, you need a sophisticated big data infrastructure like Pivotal Greenplum DB and new methods from data science—also mostly unheard of in the security space.

The Approach—Applying Data Science to Malware’s Network Traffic

The data science team turned to Pivotal Greenplum Database (GPDB)—the petabyte-scale, massively parallel, shared-nothing architecture moves the compute power to the data instead of moving the data to compute. Choosing this type of platform was an incredibly important part of the approach and allowed data scientists to iterate through algorithms very quickly during development. The max runtime for the single, most complex model took under an hour on the whole dataset of 1.5 Billion rows, versus taking a day or more to run and test on Apache Hive. This saved data scientists a lot of time and allowed for a truly agile, rapid development approach.

The project’s security lab infrastructure included an on-premise, quarter rack of Pivotal Data Computing Appliances (DCA), and the four main GPDB compute nodes included 64cores, 256 GB of memory, and 9 terabytes of disk. An overarching premise to data science engagements of this kind is restraining the use of a sample data set because what you are looking for is a couple hundred malware connections hidden somewhere inside the company’s network traffic. This can often include more than 1 billion connections per month. In other words, if you want to find the needles in the haystack, you need the whole haystack.

The data was prepared and models were processed completely within GPDB, and multiple models were used to identify different behavior patterns based on MADlib, R, and Python libraries. In fact, the IT team expected various algorithms to be used and evolved, helping them stay ahead of evolving threats. Architecturally, the data processing framework included four main stages. Raw proxy logs were first landed and unsuspicious domains and unsuccessful communications were filtered out. Then, domains and distinct users were extracted to whitelist additional domains. In the third step, external data was brought in to add intelligence, like threat likelihood or popularity, to existing domain information. In the most important, last step, the developed data science models were run to produce the results, which were usually lists of internal clients or external web domains, ranked by infection probability. Throughout these steps, the Pivotal Data Science team collaborated with the customer’s subject matter experts to improve the results.

The modeling methods included algorithms based on graph theory, natural language processing (NLP), anomaly detection, and clustering. Graph theory allowed the team to see which of the customer’s internal clients make a lot of connections to really obscure domains and how they interacted within the customer’s network afterwards. NLP methods were used to analyze suspicious domain names since malware often tries to hide its traffic to command and control servers behind certain types of domain names. Anomaly detection and clustering helped build a baseline of standard profiles to identify any clients that deviated in a significant way. When it came to visualization, various R and Python libraries were used during development to help ensure algorithms were working properly, and these were also used to help the customer envision future, production dashboards. However, the most valuable insight to the customer was provided by a simple list of suspicious nodes ranked by probability, handed over to the customer’s IT security team.

The Results of Data Science and Malware Detection

Many positive outcomes were achieved in this project. Most importantly, over a dozen infected nodes were found in the approach, all were previously undetected by the current, sophisticated, production security solution. The IT team was able to quarantine these users, confirm infection, and remove the threats.

The team also saw how the speed of GPDB was much faster than current approaches with Apache Hive™. For example, GPDB could group 1.5 billion connections by domain name in under 3 minutes and score 500,000 domain names using the previously mentioned NLP methods in under 10 seconds. They saw how this speed was essential for iterating over the malware detection models, allowing data scientists to quickly try variants or tweak parameters. It was also clear how new, innovative approaches with data science algorithms could help the IT team achieve security goals.

Lastly, Pivotal Data Labs provided the code and training to help the customer’s IT security team to collaborate with the customer’s internal data science team to maintain and refine the developed models on their own.

Learning More and Recommended Reading

Pivotal Data Science Blog:
- Automotive: A Peek Under the Hood of The Connected Car: What It Does & How It Applies to IoT Systems
- Media: Using Data Science to Predict TV Viewer Behavior and Formulate Hit TV Shows
- Financial Services: Financial Compliance: New Frontiers with Data Science
- Bio-Science/Health: Re-Architecting Genomics Pipelines to Handle the Rising Wave of Data
- Article: Distributed Deep Learning on MPP and Hadoop
Pivotal Greenplum Database: Data, Downloads, and Documentation

Editor’s Note: Apache, Apache Hadoop, Hadoop and Apache Hive are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Biography

Pivotal Cloud Predictions: A Perfect Storm for 2016—The Top 5 Themes

In this post, James Watters and Josh McKenty outline the top Cloud-Native themes for 2016. We believe 2015 ...

Lattice – Container Clustering Simplified

Last year we experienced huge enterprise demand for Cloud Foundry, a fully integrated cloud-native platform...

Pivotal Data Science Team Iterates Faster, Beats Existing Malware Detection Tools

The Challenge—Data Science Applied to Malware Detection

The Approach—Applying Data Science to Malware’s Network Traffic

The Results of Data Science and Malware Detection

About the Author

Previous

Next

Pivotal Data Science Team Iterates Faster, Beats Existing Malware Detection Tools

The Challenge—Data Science Applied to Malware Detection

The Approach—Applying Data Science to Malware’s Network Traffic

The Results of Data Science and Malware Detection

About the Author

Previous

Next

Related content in this Stream

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.