Build a Data Analytics Platform in Minutes Using Deployment Blueprints

June 25, 2021 Harmen van der Linde

In order to stay competitive, enterprise organizations are engaged in an ongoing drive to optimize and scale the delivery of their products and services. Data has become a critical solution component of achieving these goals. 

The growing number of data use cases, and their increasing scale, require new data management applications that go beyond traditional databases. In addition to data storage, modern application deployments require new capabilities such as caching, enrichment, real-time analysis, search, and visualization to better manage and utilize large data sets. Such data management capabilities require application teams to deploy and integrate multiple data management software tools as part of a single data platform. Examples include data ingest services supported by caching, data preparation, and analysis; storage tiers for real-time and historical data access; and data search and analytics dashboards. 

In the meantime, as part of enterprise cloud adoption, application architecture and deployment patterns evolve into distributed systems with interconnected software runtimes (i.e., microservices). Similarly, as per cloud native deployment principles, modern data platform implementations require deploying a distributed system of stateful software runtimes.

In other words, despite the benefits of horizontal scale and deployment agility that they bring to enterprises, modern data platforms also introduce new deployment challenges. With that in mind, let’s explore cloud native deployment patterns for data platforms and solutions that simplify deployment complexity and enable data platforms to be deployed in minutes instead of hours or days.

Modern data platforms

Open source and third-party software serve as the basis for many enterprise applications today. Frequently used functionality supported by these software components simplifies application development and allows developers to focus on business logic implementation. These software components, which are referred to as middleware, provide shared application services such as messaging, API management, and authentication. 

A particular class of middleware software is focused on data management. It covers a range of functions such as data ingestion, distribution, processing, and storage; examples include distributed databases and data caching software. Many of these software stacks originated in open source and have since evolved into commercially supported software and service offerings. 

When deployed and integrated as a data platform, these software stacks support a shared set of data services for integration with application business logic. Implementation of a data platform typically involves the deployment and replication of multiple types of data software runtimes across various server nodes to handle scale and resiliency. As a result, data platforms are multi-node systems supporting distributed (and often interconnected) software runtimes. 

A data platform deployed on cloud infrastructure creates a new, distributed, functional layer of data services that integrate with application runtimes via APIs and data function calls.

In addition to self-managed deployments with virtualized and containerized software runtimes, new data platform architectures are emerging that cover both hosted and hybrid cloud (services) deployments.

Modern data platforms are easier to implement, easier to scale, and easier to manage than their predecessors. And by leveraging cloud native technologies and deployment patterns, modern data platforms are more cost-effective and resilient as well. 

To support operational agility and cost efficiency, cloud data platforms require the following deployment capabilities:

  • System-level deployment automation – Implementation of modern data platforms requires deploying a combination of different data software stacks. Deployment automation for data platforms needs to cover the installation and configuration of multiple data software runtimes and the placement of these software runtimes across server and cloud infrastructure. Deployment manifests that capture data software configuration and placement details enable repeatable deployment patterns for data platforms that can be automated. 

  • Infrastructure capacity elasticity – The types and amount of data that need to be processed by a data platform will change—and likely increase—over time. As a result, a modern data platform must have a scale-out architecture, which enables incremental storage and processing capacity upgrades via the deployment of additional data platform nodes and software runtimes. 

  • Right-sized resource utilization – Data platform deployments consume infrastructure resources that are likely to grow over time, yielding an increasing amount of data volumes that need to be processed. Controls must be put in place that match the sizing of data platform runtime components with the resources available in underlying infrastructure server nodes and services so as to minimize underutilization and avoid resource bottlenecks. In short, data platforms with the required infrastructure resources must be deployed. 

  • Full-stack, system-level, and data observability – Cloud native monitoring principles for infrastructure and application runtimes also apply to modern data platforms. In addition to full-stack monitoring and platform-level status visibility, observability instrumentation must be in place for service-level objective (SLO) monitoring of end-to-end data flows within a data platform.

Implementation of modern data platforms requires automation and configuration manifests for the coordinated deployment of data software runtimes. When software runtimes are sized and placed correctly, the underlying server and cloud infrastructure utilization are optimized, driving down deployment costs. And data software runtime and infrastructure metric collection, together with cross-platform tracing, provide the building blocks for SLO data platform observability to support operational insights and agility.

Deployment challenges

In addition to increasing agility and scale, the adoption of cloud infrastructure and containerization has created a new set of deployment challenges for data platforms, specifically for stateful software runtimes and data workloads.

Stitching together multiple software runtimes – Most data platforms require the deployment of multiple data software stacks. In addition, for each software stack, multiple container runtime instances are deployed to meet defined scale and resiliency requirements, creating even more complexity. All of this requires numerous deployment steps, which can easily take hours or even days to complete.

Software runtime placement – Data platform implementations require that multiple types and instances of data software runtimes are deployed. How these software components are deployed across server compute and cloud infrastructure matters. For example, to ensure node failure resiliency, software runtime instances need to be placed on separate server nodes. At the same time, there is an incentive to stack together multiple runtime instances to optimize resource utilization of underlying server nodes. Handling these (sometimes competing) runtime placement requirements creates additional deployment complexity and risk. For example, when placement affinity rules are defined incorrectly, multiple data runtimes can inadvertently be placed into the same fault domain, creating a single point of failure. 

Resource sizing – Correctly allocating memory and compute resources is essential for the optimal execution of data software runtimes—tuning parameters for JVM heap memory is one example. At the same time, minimizing the overallocation of resources requires the configuration and tuning of resource (request) sizes and limits, potentially at multiple (cloud) infrastructure layers. These various levels of deployment tuning create new operational complexities, especially for mature data platform deployments where resource capacity and performance tuning are essential to optimize infrastructure costs.

Implementing data observability – Managing data platforms requires monitoring the operational state and performance of multiple, often many, server nodes and data software runtimes running on those nodes. To derive the overall functional status of a data platform, observability metrics from potentially many (runtime and server) endpoints have to be collected, aggregated, and processed, often causing scale and time-series data query complexities. In addition, measuring data SLO throughput characteristics for data platforms requires end-to-end tracing supported by data software and application runtimes, which requires custom code and additional runtime deployment configurations. 

Maintaining open source data software – Open source software used for implementing data platforms is typically manually downloaded from various public software repositories and  repackaged before deployment. Unfortunately, this process needs to be repeated every time new software upgrades and fixes are released. And those releases need to be tracked separately for each open source software component of the data platform deployment, which creates operational risk and overhead.

Blueprint deployments

New application catalog offerings and deployment automation tools have simplified software runtime installs and reduced the complexity of software lifecycle management for data platforms. In addition to validated software runtimes, application catalogs can also offer deployment automation manifests and configuration guidelines. 

DIY open source – A wide variety of public software repositories are available for downloading open source data management software. Once downloaded (and repackaged), software runtimes can be deployed manually or automatically, which gives application teams both choice and flexibility. Along with this flexibility, however, comes the complexity and ongoing burden of tracking and managing new software releases and testing patch upgrades to fix security vulnerabilities. All of this requires technical expertise, time, and resources, which complicates Day 2 operations.

Trusted open source – Application catalog and software marketplace offerings simplify Day 2 operations by making validated and packaged software runtimes and related artifacts available to applications teams. As a result, searching for and testing new data software releases and patch updates becomes easier, which simplifies the software lifecycle management for data platforms. Today, there is a growing list of software catalog offerings available, such as the Bitnami Application Catalog and VMware Marketplace.

Deployment blueprints – Deployment automation focuses on installing and configuring software runtimes, and includes resource sizing, security settings, and placement rules. With the adoption of declarative software automation solutions, deployment blueprints have become a tool to capture many of these (complex) configuration parameters. Especially for data platform deployments, where multiple data software runtimes need to be stitched together and configured, blueprints simplify initial implementations as part of Day 1 operations and ongoing Day 2 software patching and upgrades activities. Nowadays, blueprints are part of software deployment artifacts made available in application catalogs.

Many operational challenges associated with self-managed open source and third-party software deployments go away when using a centralized application catalog with validated data software runtimes and deployment artifacts. In addition, blueprints expand the automation semantics of software runtime implementations, simplifying deployments of distributed data platforms with multiple software stacks.

The use of blueprints reduces the need for reconfigurations after initial data platform deployments. Moreover, right-sized blueprint configurations reduce the amount of software runtime tuning work required throughout the remaining data platform deployment lifecycle. 

In effect, we see a shift left of post-deployment tuning activities being replaced by carefully engineered blueprint configurations, which enables right-sized deployments of data platforms on Day 1.

Getting started

You can find validated blueprint designs in the Bitnami Application Catalog and VMware Marketplace, including blueprints for building containerized data platforms with Kafka, Apache Spark, Solr, and Elasticsearch.

These engineered and tested data platform blueprints are implemented via Helm charts. They capture security and resource settings, affinity placement parameters, and observability endpoint configurations for data software runtimes. Using the Helm CLI or KubeApps tool, Helm charts enable the single-step, production-ready deployment of a data platform in a Kubernetes cluster, covering automated installation and the configuration of multiple containerized data software runtimes.

Each data platform blueprint comes with Kubernetes cluster node and resource configuration guidelines to ensure the optimized sizing and utilization of underlying Kubernetes cluster compute, memory, and storage resources. For example, README.md covers the Kubernetes deployment guidelines for the Kafka, Apache, Spark, and Solr blueprint.

We are always interested in getting feedback from users! You can submit questions and issues via the Bitnami GitHub page and use the Bitnami Community page to submit enhancement requests and ideas for new data platform blueprints.

About the Author

Harmen van der Linde is a senior director at VMware’s Office of the CTO; he also leads the Cloud Data Platforms team. He is currently focused on projects at the intersection of data and cloud architectures, covering deployment automation, migration, and lifecycle management. Before VMware, Harmen was the director and global head of Citigroup’s Monitoring and Logging organization. In that role, he led several observability projects focused on Kubernetes monitoring, bridging the gap between application development and IT infrastructure operations. Harmen also held various leadership roles both Cisco and AT&T, where he focused on product management and network operations planning.

More Content by Harmen van der Linde

No Previous Articles

Next
Tracing the Path to Clear Visibility in DevOps
Tracing the Path to Clear Visibility in DevOps

We’re excited to announce enhancements to the VMware Tanzu Observability by Wavefront platform.