How to Build a Kafka-Spark-Solr Data Analytics Platform Using Deployment Blueprints

June 25, 2021 Geeta Kulkarni

Enterprise applications rely on large amounts of data that needs to be distributed, processed, and stored. Data platforms offer data management services via a combination of open source and commercially supported software stacks. These services enable accelerated development and deployment of data-hungry business applications.

Building a containerized data analytics platform comprising different software stacks comes with several deployment challenges. For example, the challenge of stitching together, sizing, and placing multiple data software runtimes across Kubernetes nodes. For a better understanding of these challenges and how they are solved by utilizing deployment blueprints, check out this blog post.

The data platform deployment blueprints available on the Bitnami Application Catalog assist with the building of the most optimal, resilient data platforms by covering the following:  

  • Pod placement rules – Affinity rules to ensure placement diversity so as to prevent single points of failure and optimize load distribution

  • Pod resource sizing rules – Optimized pod and JVM sizing settings for optimal performance and efficient resource usage

  • Default settings – To ensure pod access security

  • Optional VMware Tanzu Observability framework configuration – With out-of-the-box dashboards to monitor your data platform

  • Use of curated, trusted Bitnami images – To make the data platforms secure

In addition, all the blueprints are validated and tested to provide Kubernetes node count and sizing recommendations in order to facilitate cloud platform capacity planning. The goal is to optimize the server and storage footprint to minimize infrastructure cost.

Bitnami's Data Platform Blueprint1 with Kafka-Spark-Solr enables the fully automated deployment of a multistack data platform in a multinode Kubernetes cluster by covering the following software components:

  • Apache Kafka – A data distribution bus with buffering capabilities

  • Apache Spark – A cluster-computing platform that provides in-memory data analytics

  • Solr – An open source platform for data persistence and search

In this post, we will walk you through the steps to deploy this blueprint on a Kubernetes cluster.

Prerequisites 

In order to deploy this blueprint according to our instructions, you will need 

  • a Docker environment installed and configured 

  • PersistentVolume provisioner support in the underlying infrastructure

  • ReadWriteMany volumes for deployment scaling

  • a Tanzu Observability cluster up and running, to enable the Tanzu Observability framework for the data platform

Deploying a data platform on a Kubernetes cluster

Throughout this post, we will use a VMware Tanzu Kubernetes guest cluster as the underlying infrastructure for the deployment of the data platform. We’ll focus on the use case of building a data platform with Kafka-Spark-Solr, which could be used for data and application evaluation, development, and functional testing.

In order to build this data platform, we need a Kubernetes cluster with the following configuration:

  • One master node (2 CPU, 4Gi memory)

  • Three worker nodes (4 CPU, 32Gi memory)

Deployment architecture

The diagram below depicts the deployment architecture. 

The Kubernetes object details that this chart deploys includes:

  • Zookeeper with three nodes to be used for both Kafka and Solr

  • Kafka with three nodes using the Zookeeper deployed above

  • Solr with two nodes using the Zookeeper deployed above

  • Spark with one master and two worker nodes

Now, let’s go ahead and deploy the Kafka-Spark-Solr data platform on a Kubernetes cluster:

A resilient and optimal data platform comprising Kafka-Spark-Solr is up now and running on the Kubernetes cluster, within minutes, using a single command! 

Data platform deployment with the Tanzu Observability framework

Tanzu Observability is an enterprise-grade observability solution. Bitnami Data Platform Blueprint1 with Kafka-Spark-Solr has an out-of-the-box integration with Tanzu Observability that is turned off by default, but can be enabled via certain parameters. Once you have enabled the observability framework, you can use the out-of-the-box dashboards to monitor your data platform, from viewing the health and utilization of the data platform at a high level to taking an in-depth view of each software stack runtime.

Deployment architecture with the Tanzu Observability framework

The diagram below depicts the data platform architecture with the Tanzu Observability framework. 

The list of Kubernetes object details that this chart deploys is as follows:

  • Zookeeper with three nodes to be used for both Kafka and Solr

  • Kafka with three nodes using the Zookeeper deployed above

  • Solr with two nodes using the Zookeeper deployed above

  • Spark with one master and two worker nodes

  • Wavefront Collector’s DaemonSet to enable runtime feed into the Tanzu Observability service 

  • Wavefront proxy

Let’s now go ahead with the deployment.

As mentioned in the prerequisites, you should have a Tanzu Observability cluster up and running before starting the deployment. Please note that the API token for your Tanzu Observability cluster user can be fetched from User → Settings → Username → API Access from the Tanzu Observability user interface. 

Note: The Solr exporter is not supported when deploying Solr with authentication enabled.

Once the data platform is up and running, we can log in to the Tanzu Observability user interface and navigate to Integrations → Big Data to find the “Data Platforms” tile, as shown in the image below. 

Click the “Data Platforms” tile and navigate to the “Dashboards” tab. Then click “Data Platform Blueprint1 Kafka-Spark-Solr Dashboard.” 

Below are the most important sections that you will see in the dashboard.

Overview – This section gives you a single view of the data platform cluster, including Health, Utilization, and Applications, which together form the cluster. 

Kubernetes platform – This section gives you a detailed view of the underlying Kubernetes cluster, specifically the node-to-pod mapping so you can understand the placement of application pods on Kubernetes cluster nodes.

Individual applications – This section gives you a detailed view of the individual application metrics; in this case, detailed metrics of Kafka, Spark, and Solr.

Kafka application metrics 

Spark application metrics 

Solr application metrics 

As illustrated, you can build a data platform with Kafka-Spark-Solr together with the Tanzu Observability framework on the Kubernetes cluster, within minutes, using a single command. 

Bitnami Data Platform Blueprint1 Helm chart makes this a quick, easy, and secure process, allowing you to focus your time and effort on business logic rather than a deployment configuration that uses multistep runbooks. The best part is, you have out-of-the-box dashboards available in Tanzu Observability to monitor your data platform from Day 1.

No Previous Articles

Next
2 Ways to Integrate the Jaeger App with VMware Tanzu Observability Without Code Changes
2 Ways to Integrate the Jaeger App with VMware Tanzu Observability Without Code Changes

The integration of Jaeger with Tanzu Observability will help you visualize the application traces and ident...