Financial Compliance: New Frontiers with Data Science

February 18, 2015 Niels Kasch

featured-complianceJoint work performed by Niels Kasch and Mariann Micsinai of Pivotal’s Data Science Labs.

Financial institutions must overcome the shortcomings of existing compliance pipelines that do not live up to the standards of expanding new regulations. In this blog article, we share experience from real-life engagements and show how an innovative, agile and real-time computational platform can re-architect compliance workflows and provide several advantages over existing solutions.

Our solution, a data lake platform coupled with cutting edge data science techniques, helps to identify underlying risk and fraud while reducing the compliance department’s overburdened, manual review process. The approach also advocates a flexible user interface to promote an adaptive, continuously learning compliance framework.

Current Challenges

After the financial crisis of 2008, banks were subjected to more and more intense regulatory scrutiny. US regulatory agencies tightened their enforcement of responsible conduct and placed renewed vigor into the elimination of unfair, deceptive or abusive practices. As a result, a myriad of strict rules under the comprehensive Dodd-Frank and Basel Committee regulations are enforced. Violations of any of these rules are coupled with mounting fines and litigation costs as evident in recent news headlines quoting heavy fines.

Banks face immense challenges to revise their compliance and governance infrastructure to meet regulatory standards in a timely manner. These challenging areas for financial institutions include:

  • Aggregation and real-time analysis of large, diverse and rapidly growing datasets across the institution
  • Real time data reporting at all levels of granularity
  • Overburdened manual review within compliance departments
  • Flexibility to respond to new regulations, e.g., changing OFAC lists, new sanctions against Russia, and implementations of yet unimplemented rules under the Dodd-Frank Act

Next Generation Compliance Platform

Current compliance systems only focus on a small part of compliance needs, be it for archival purposes or basic analytics. To prevent the next Bernie Madoff or Libor scandal, a next generation storage and processing platform is required.

The platform needs to address three main components:

  1. A compliance data lake with:
    1. The ability to handle the archiving and storage requirements of large volumes of complex, structured (e.g., transactions) and unstructured (e.g., text data: emails and chats) data.
    2. A home to easily integrate various data assets such as transactions, securities information, and governmental watch lists.
    3. An open and extensible architecture to better facilitate enhancements, such as an increase in capacity to address scale and performance requirements or an integration with new technologies as the business requirements change (e.g., real time data ingest and model scoring).
  2. Capabilities for scaling predictive models, advanced machine learning, and natural language processing over large, compliance-related data sets while supporting agile data science methodologies.
  3. Support for additional compliance apps and user interfaces that drive analytical insights and decision making for business users as well as capture new feedback-based data to increase the predictive power of the models.

The platform that brings all of the above components together is Pivotal’s Big Data Suite (BDS) with an ability to add Pivotal Cloud Foundry (PCF) as a PaaS for additional application or integration workloads. While PCF is the leading enterprise PaaS, Pivotal’s BDS allows for extensive storage and agile analytics on massive data sets using three paths—an MPP and column store database, in-memory data processing, or Hadoop. This combination is a data scientist’s dream because it facilitates agile data exploration and data integration coupled with advanced machine learning algorithms (c.f. Madlib and MLlib) to derive the most value from your data.

The Next Generation Financial Compliance Solution

Before getting into the details of the analytical components, it is worth pointing out how the architecture can extend for similar analytical scenarios with additional requirements for high-scale applications or integrations, as with financial trading information. These cases can benefit from inserting PaaS-based services at various places within the architecture to provide automated scale, lower development complexity, and fast, iterative development cycles. More importantly, the next generation financial compliance solution is driven by advanced analytics capabilities. Next, we will address each of the analytical components individually.

Fig. 1: Overview of next generation analytics solution for financial compliance. The 3 main components (the data lake, analytics, and feedback-centric user interface) and their interactions are depicted.

Fig. 1: Overview of next generation analytics solution for financial compliance. The 3 main components (the data lake, analytics, and feedback-centric user interface) and their interactions are depicted.

The Financial Compliance Data Lake

The data lake is a data-centered architecture, where all types of data come together in one place. The key here is to bring as much information together as possible to support the analytics behind financial compliance. For example, to analyze emails and chats, the data lake can serve as the archiving solution while simultaneously making the data available for analytics. Pivotal’s Big Data solution incorporates an MPP RDBMS that enhances data integration tasks such as resolving and joining entities across multiple and diverse data sources. Such a capability also allows for the integration of unstructured text with structured transactions (e.g., transactions, trades). This makes catching insider trading easier since compliance analysts can link trades to various communication channels. But, the data lake does not stop there. For example, an organization’s hierarchy can be part of the data lake as well and support legislative requirements which prohibit certain interactions within a company (e.g., Chinese wall policy between traders and trade clearing). Other data assets can also be incorporated into the data lake to benefit compliance use cases and include updates or retention policies for:

  • News feeds
  • OFAC watch lists (e.g., countries prone for AML)
  • Securities information
  • Portfolio information and risk metrics
  • Other communication channels (phone conversations, social media, text messages)
  • Access logs (building entry logs system access logs, weblogs)

Financial Compliance Analytics

The analytics pipeline is the heart of the solution. It determines whether a given trade or communication item violates regulation or not. The platform supports traditional e-discovery methods, such as search, but, more importantly, it features a complete machine-learning pipeline with multiple predictive models and modeling techniques:

  • Classification—To identify irrelevant messages such as automated notifications, newsletters, out-of-office messages, and print job notifications.
  • Graph analysis—To build communication profiles of individuals. This technique is often used in security analytics and malware detection to identify anomalous behavior. Graph analysis can establish hot spots of fraudulent activity based on who is talking to whom.
  • Text analytics—To identify the language behind fraud, determine the sentiment and certainty in the language of a trader before and after executing trades. Semantically, it can interpret if too much information (e.g. deal coloring) was involved with communication partners.

Financial Compliance User Interface and Feedback

The point is not to replace compliance analysts, but the approach focuses their attention on actual fraud cases. To enable effective compliance reviews for analysts, a dynamic user interface is an absolute must. The user interface provides the opportunity to make the system smarter as a whole. For example, a properly designed UI can solicit decision-making information from compliance analysts that can be automatically integrated into a feedback loop for analytics—a continuous learning system that gets smarter over time. Such feedback is instrumental to the system for the following reasons:

  • Keeps the system up to date, for example, in the face of changing regulations.
  • Provides more algorithmic training information in the form of fraudulent trades or communications.
  • Injects additional domain knowledge such as new expressions used by traders or new types of fraudulent transactions.

In our compliance pipeline, the combination of platform (Pivotal Cloud Foundry and Pivotal’s Big Data solutions), data science (Pivotal Data Labs), and software development (Pivotal Labs) come together in unison to stand up next generation financial compliance solutions.

Learning More

In this blog, we described the most important factors that financial institutions face in the current regulatory environment. We presented an innovative, agile and real-time computational platform that addresses financial compliance needs and explained how cutting edge data science can reduce the compliance department’s review process. The framework is easily extended to other industries where fraudulent activities need to be identified, for example, in the insurance industry. For more information on aspects of the solution:

About the Author


The Way to Hadoop Native SQL
The Way to Hadoop Native SQL

Today, Pivotal announced it has open sourced HAWQ and MADlib, contributing them to the Apache Software Foun...

How Pivotal Labs Transformed IDEO Labs Development Approach
How Pivotal Labs Transformed IDEO Labs Development Approach

Pair programming and agile development have become popular buzzwords in recent years, but these practices h...