How to Build a Kafka-Spark-Solr Data Analytics Platform Using Deployment Blueprints

June 25, 2021 Geeta Kulkarni

Enterprise applications rely on large amounts of data that needs to be distributed, processed, and stored. Data platforms offer data management services via a combination of open source and commercially supported software stacks. These services enable accelerated development and deployment of data-hungry business applications.

Building a containerized data analytics platform comprising different software stacks comes with several deployment challenges. For example, the challenge of stitching together, sizing, and placing multiple data software runtimes across Kubernetes nodes. For a better understanding of these challenges and how they are solved by utilizing deployment blueprints, check out this blog post.

The data platform deployment blueprints available on the Bitnami Application Catalog assist with the building of the most optimal, resilient data platforms by covering the following:

Pod placement rules – Affinity rules to ensure placement diversity so as to prevent single points of failure and optimize load distribution
Pod resource sizing rules – Optimized pod and JVM sizing settings for optimal performance and efficient resource usage
Default settings – To ensure pod access security
Optional VMware Tanzu Observability framework configuration – With out-of-the-box dashboards to monitor your data platform
Use of curated, trusted Bitnami images – To make the data platforms secure

In addition, all the blueprints are validated and tested to provide Kubernetes node count and sizing recommendations in order to facilitate cloud platform capacity planning. The goal is to optimize the server and storage footprint to minimize infrastructure cost.

Bitnami's Data Platform Blueprint1 with Kafka-Spark-Solr enables the fully automated deployment of a multistack data platform in a multinode Kubernetes cluster by covering the following software components:

Apache Kafka – A data distribution bus with buffering capabilities
Apache Spark – A cluster-computing platform that provides in-memory data analytics
Solr – An open source platform for data persistence and search

In this post, we will walk you through the steps to deploy this blueprint on a Kubernetes cluster.

Prerequisites

In order to deploy this blueprint according to our instructions, you will need

a Kubernetes cluster (Kubernetes 1.12 or higher) with one control plane (best-effort-small) and three worker nodes (best-effort-xlarge) running with Helm 3.1.0 installed
kubectl command line (kubectl CLI) installed and configured to work with your cluster

a Docker environment installed and configured
PersistentVolume provisioner support in the underlying infrastructure
ReadWriteMany volumes for deployment scaling
a Tanzu Observability cluster up and running, to enable the Tanzu Observability framework for the data platform

Deploying a data platform on a Kubernetes cluster

Throughout this post, we will use a VMware Tanzu Kubernetes guest cluster as the underlying infrastructure for the deployment of the data platform. We’ll focus on the use case of building a data platform with Kafka-Spark-Solr, which could be used for data and application evaluation, development, and functional testing.

In order to build this data platform, we need a Kubernetes cluster with the following configuration:

One master node (2 CPU, 4Gi memory)
Three worker nodes (4 CPU, 32Gi memory)

Deployment architecture

The diagram below depicts the deployment architecture.

The Kubernetes object details that this chart deploys includes:

Zookeeper with three nodes to be used for both Kafka and Solr
Kafka with three nodes using the Zookeeper deployed above
Solr with two nodes using the Zookeeper deployed above
Spark with one master and two worker nodes

Now, let’s go ahead and deploy the Kafka-Spark-Solr data platform on a Kubernetes cluster:

A resilient and optimal data platform comprising Kafka-Spark-Solr is up now and running on the Kubernetes cluster, within minutes, using a single command!

Data platform deployment with the Tanzu Observability framework

Tanzu Observability is an enterprise-grade observability solution. Bitnami Data Platform Blueprint1 with Kafka-Spark-Solr has an out-of-the-box integration with Tanzu Observability that is turned off by default, but can be enabled via certain parameters. Once you have enabled the observability framework, you can use the out-of-the-box dashboards to monitor your data platform, from viewing the health and utilization of the data platform at a high level to taking an in-depth view of each software stack runtime.

Deployment architecture with the Tanzu Observability framework

The diagram below depicts the data platform architecture with the Tanzu Observability framework.

The list of Kubernetes object details that this chart deploys is as follows:

Zookeeper with three nodes to be used for both Kafka and Solr
Kafka with three nodes using the Zookeeper deployed above
Solr with two nodes using the Zookeeper deployed above
Spark with one master and two worker nodes
Wavefront Collector’s DaemonSet to enable runtime feed into the Tanzu Observability service
Wavefront proxy

Let’s now go ahead with the deployment.

As mentioned in the prerequisites, you should have a Tanzu Observability cluster up and running before starting the deployment. Please note that the API token for your Tanzu Observability cluster user can be fetched from User → Settings → Username → API Access from the Tanzu Observability user interface.

Note: The Solr exporter is not supported when deploying Solr with authentication enabled.

Once the data platform is up and running, we can log in to the Tanzu Observability user interface and navigate to Integrations → Big Data to find the “Data Platforms” tile, as shown in the image below.

Click the “Data Platforms” tile and navigate to the “Dashboards” tab. Then click “Data Platform Blueprint1 Kafka-Spark-Solr Dashboard.”

Below are the most important sections that you will see in the dashboard.

Overview – This section gives you a single view of the data platform cluster, including Health, Utilization, and Applications, which together form the cluster.

Kubernetes platform – This section gives you a detailed view of the underlying Kubernetes cluster, specifically the node-to-pod mapping so you can understand the placement of application pods on Kubernetes cluster nodes.

Individual applications – This section gives you a detailed view of the individual application metrics; in this case, detailed metrics of Kafka, Spark, and Solr.

Kafka application metrics

Spark application metrics

Solr application metrics

As illustrated, you can build a data platform with Kafka-Spark-Solr together with the Tanzu Observability framework on the Kubernetes cluster, within minutes, using a single command.

Bitnami Data Platform Blueprint1 Helm chart makes this a quick, easy, and secure process, allowing you to focus your time and effort on business logic rather than a deployment configuration that uses multistep runbooks. The best part is, you have out-of-the-box dashboards available in Tanzu Observability to monitor your data platform from Day 1.

SpringOne Workshops: Boot, Native, RabbitMQ, and More, All Taught by Expert Instructors

Check out our packed calendar of SpringOne workshops, which includes bonus content thanks to our partners.

2 Ways to Integrate the Jaeger App with VMware Tanzu Observability Without Code Changes

The integration of Jaeger with Tanzu Observability will help you visualize the application traces and ident...

How to Build a Kafka-Spark-Solr Data Analytics Platform Using Deployment Blueprints

Prerequisites

Deploying a data platform on a Kubernetes cluster

Deployment architecture

Data platform deployment with the Tanzu Observability framework

Deployment architecture with the Tanzu Observability framework

Previous

Next

How to Build a Kafka-Spark-Solr Data Analytics Platform Using Deployment Blueprints

Prerequisites

Deploying a data platform on a Kubernetes cluster

Deployment architecture

Data platform deployment with the Tanzu Observability framework

Deployment architecture with the Tanzu Observability framework

Previous

Next

Related content in this Stream

Our Hands-On Labs offer you practical, real-world experience with VMware Tanzu products, Kubernetes, and modern application development.

This is an episodic video series that will explore Spring architectural patterns for cloud applications and how Tanzu enhances the experience with an optimized and integrated deployment platform.

These freely available Miro templates designed by VMware Tanzu Labs can help facilitate better collaboration among teams and run projects more effectively.

See how to get started with OpenTelemetry and Aria Operations for Applications in three simple steps—without manually instrumenting your Java application!

How to install VMware Tanzu Application Platform with transport layer security (TLS) and Microsoft Windows Azure Active Directory (Azure AD).

Hoping to sharpen your team’s expertise in modern apps? Our newest learning programs focus on advancing the specific skills your DevOps team needs: one designed for developers and one for operators.

Learn how API key management works with API portal for VMware Tanzu and Spring Cloud Gateway for Kubernetes.

How global namespaces and zero-trust policies with Tanzu Service Mesh can improve application security, resiliency, and multi-cloud operations.

This easy-to-follow guide shows how to get started with Tanzu Mission Control to provision Tanzu Kubernetes clusters and begin setting up organizational access policies.

This brief walk-through shows how to create a Tanzu Kubernetes cluster with added storage volumes to the control plane nodes, as well as the worker nodes, in less than 10 minutes.

Learn how to get started using Tanzu Mission Control to deploy Tanzu Kubernetes clusters on vSphere, and how to set up consistent policy enforcement.

This guide shows how to deploy the ExternalDNS plug-in via Tanzu Mission Control Catalog for use with AWS Route 53.

Using Let’s Encrypt and cert-manager with Tanzu Community Edition makes securing web applications a snap. Here’s how to do it.

New to Tanzu Kubernetes Grid Integrated Edition 1.12.0 is the ability to seamlessly migrate VCP volumes to the vSphere CSI driver. This post details how to do it and what happens behind the scenes.

Ready to explore Tanzu Community Edition? Follow this step-by-step guide to install and configure it in minutes

Want to learn about cloud native and Kubernetes technology but not sure where to start? Learning Paths on KubeAcademy are a straightforward way to learn what you need to and skip what you don’t.

In addition to VMware Tanzu Observability supporting various instrumentation and ingestion methods for distributed tracing, it now natively supports OpenTelemetry.

Spring Cloud Gateway for Kubernetes now supports the loading of your own extensions for customizations, and capturing metrics and trace data into your observability tools is easier than ever before.

How to customize, deploy, and manage open source software at scale in a secure, reliable, and consistent way.

Check out the many great sessions that have been accepted, and start adding them to your favorites on VMworld.com today!