July 30, 2012 Josh Klahr

Agile development has been all the rage for a while now – extreme programming, scrum, user stories, epics, backlogs, etc. have become the lingua franca of any software development organization worth it’s salt. And although notion of agile development hasn’t yet completely penetrated other parts of the enterprise, there is an increasing awareness of the benefits of agile development. One of the areas where the concept of agile development is starting to gain traction is around analytics. As Jim Kobielus noted in a recent post, organizations that are ability to quickly learn and iterate using experimentation are able to gain a competitive advantage; analytics vendors like SAS have also been promoting the concept of agile applied to big data analytics.

This development is not surprising, as the values of agile development should resonate with anyone who’s involved in delivering data and insights. At it’s core, agile values:

  • Individuals and interactions over processes and tools
  • Working software over comprehensive documentation
  • Customer collaboration over contract negotiation
  • Responding to change over following a plan

In the realm of analytics, what do these values mean? And more importantly, how can they be realized in order to really realize the vision of “Agile Analytics”?

Let’s start with the implications of “agile” in the world of Big Data Analytics (and also, Big Data Applications). Based on my experience at Yahoo! (and also, based on what I am starting to see from Greenplum’s customers) there is kind of an evolution of needs that takes place during the lifecycle of Big Data Analytics application development – let’s call this the Analytics Application Development Lifecycle. Before diving into what this lifecycle looks like, let’s first talk about the environment that enterprises with Big Data are dealing with today. In general, I’ve seen the following characteristics, which end up informing what an Agile Big Data environment needs to support:

Underlying Data Sets are Fast Changing: In this environment, timely analysis of new products and concepts is a competitive advantage. As a result, data processing and analysis systems need to be flexible enough to support underlying changes without requiring a rewrite or a new data model.

  • Demand for Analytics is Time Sensitive: In the big data world, the ability to analyze new features that are in production and impact revenue/ monetization is critical. Delays in turning around new requests can result in serious financial impact or customer risk.
  • Business Questions and Data Needs are Unpredictable: Anyone who is supporting the Business Intelligence (BI) needs of a “Big-Data-Driven” organization will tell you that reporting and analysis needs for new features can’t be anticipated – additional data needs often arise as the result of first-pass analyses. This means that data query and analysis systems must be built for unpredictable demands.
  • Volumes of Data and Data Consumers are Extremely Large: Analytics systems need to support deep analysis by data scientists, dashboards and reporting for larger internal user bases, and consumption by operational systems. To complicate things, all of these capabilities need to scale to support massive & growing data sets.
Given the above, what does a Big Data & Analytics platform need to do? It needs to support the analytics lifecyle as shown below.
A system that can easily support the above flow – with a focus on iterative, collaborative development within the “Ad Hoc” and “Proving Ground” quadrants – is well positioned to drive success for Big Data, Big Analytics initiatives. (Obviously I am biased, but check out Greenplum’s recent launch of Chorus to understand our vision here.) When evaluating your own platform to assess whether it’s ready to support this lifecycle my advice is to focus on the capabilities described in the Top Ten list below. Now – this is not a comprehensive list, but it captures the core elements that one should be looking for as part of a data platform rollout.
  1. Ad-hoc access to “raw” event/user level data
  2. Data source agnosticism – Hadoop & RDBMS interop
  3. Data search and discovery
  4. Analysis- and Developer-friendly environment – SQL, Code
  5. Lower-than-average cost of change for new data, metrics
  6. Schedule and publish capabilities for views, tables, insights
  7. Unified catalog/metadata service
  8. 3rd Party Tool “friendliness”
  9. Resource management for ad-hoc & production workloads
  10. Enterprise features for the entire data system

There are plenty of other things to think about as well: do you have the right “Data Scientists” within your organization to leverage this platform? Are you properly instrumenting your products and processes to drive data into your data platform? Are you thinking about closing the loop by building applications and systems that can leverage the insights delivered by your data science team (operationalization, as it were)? All things to keep in mind as you venture into the exciting new world of Big Data, and Big Analytics.

About the Author


Cloud Foundry Integration for Eclipse Now Supports Tunneling to Services
Cloud Foundry Integration for Eclipse Now Supports Tunneling to Services

Today we announce a new release of Cloud Foundry Integration for Eclipse which features the ability to open...

Screencast: The current, the backlog, together again!
Screencast: The current, the backlog, together again!

Today's screencast is about understanding workflow in Tracker, with a twist of organization. The Current...

SpringOne. All online. All free. Sep 2-3.