How to Build a Kafka-Spark-Solr Data Analytics Platform Using Deployment Blueprints

June 25, 2021 Geeta Kulkarni

Enterprise applications rely on large amounts of data that needs to be distributed, processed, and stored. Data platforms offer data management services via a combination of open source and commercially supported software stacks. These services enable accelerated development and deployment of data-hungry business applications.

Building a containerized data analytics platform comprising different software stacks comes with several deployment challenges. For example, the challenge of stitching together, sizing, and placing multiple data software runtimes across Kubernetes nodes. For a better understanding of these challenges and how they are solved by utilizing deployment blueprints, check out this blog post.

The data platform deployment blueprints available on the Bitnami Application Catalog assist with the building of the most optimal, resilient data platforms by covering the following:

Pod placement rules – Affinity rules to ensure placement diversity so as to prevent single points of failure and optimize load distribution
Pod resource sizing rules – Optimized pod and JVM sizing settings for optimal performance and efficient resource usage
Default settings – To ensure pod access security
Optional VMware Tanzu Observability framework configuration – With out-of-the-box dashboards to monitor your data platform
Use of curated, trusted Bitnami images – To make the data platforms secure

In addition, all the blueprints are validated and tested to provide Kubernetes node count and sizing recommendations in order to facilitate cloud platform capacity planning. The goal is to optimize the server and storage footprint to minimize infrastructure cost.

Bitnami's Data Platform Blueprint1 with Kafka-Spark-Solr enables the fully automated deployment of a multistack data platform in a multinode Kubernetes cluster by covering the following software components:

Apache Kafka – A data distribution bus with buffering capabilities
Apache Spark – A cluster-computing platform that provides in-memory data analytics
Solr – An open source platform for data persistence and search

In this post, we will walk you through the steps to deploy this blueprint on a Kubernetes cluster.

Prerequisites

In order to deploy this blueprint according to our instructions, you will need

a Kubernetes cluster (Kubernetes 1.12 or higher) with one control plane (best-effort-small) and three worker nodes (best-effort-xlarge) running with Helm 3.1.0 installed
kubectl command line (kubectl CLI) installed and configured to work with your cluster

a Docker environment installed and configured
PersistentVolume provisioner support in the underlying infrastructure
ReadWriteMany volumes for deployment scaling
a Tanzu Observability cluster up and running, to enable the Tanzu Observability framework for the data platform

Deploying a data platform on a Kubernetes cluster

Throughout this post, we will use a VMware Tanzu Kubernetes guest cluster as the underlying infrastructure for the deployment of the data platform. We’ll focus on the use case of building a data platform with Kafka-Spark-Solr, which could be used for data and application evaluation, development, and functional testing.

In order to build this data platform, we need a Kubernetes cluster with the following configuration:

One master node (2 CPU, 4Gi memory)
Three worker nodes (4 CPU, 32Gi memory)

Deployment architecture

The diagram below depicts the deployment architecture.

The Kubernetes object details that this chart deploys includes:

Zookeeper with three nodes to be used for both Kafka and Solr
Kafka with three nodes using the Zookeeper deployed above
Solr with two nodes using the Zookeeper deployed above
Spark with one master and two worker nodes

Now, let’s go ahead and deploy the Kafka-Spark-Solr data platform on a Kubernetes cluster:

A resilient and optimal data platform comprising Kafka-Spark-Solr is up now and running on the Kubernetes cluster, within minutes, using a single command!

Data platform deployment with the Tanzu Observability framework

Tanzu Observability is an enterprise-grade observability solution. Bitnami Data Platform Blueprint1 with Kafka-Spark-Solr has an out-of-the-box integration with Tanzu Observability that is turned off by default, but can be enabled via certain parameters. Once you have enabled the observability framework, you can use the out-of-the-box dashboards to monitor your data platform, from viewing the health and utilization of the data platform at a high level to taking an in-depth view of each software stack runtime.

Deployment architecture with the Tanzu Observability framework

The diagram below depicts the data platform architecture with the Tanzu Observability framework.

The list of Kubernetes object details that this chart deploys is as follows:

Zookeeper with three nodes to be used for both Kafka and Solr
Kafka with three nodes using the Zookeeper deployed above
Solr with two nodes using the Zookeeper deployed above
Spark with one master and two worker nodes
Wavefront Collector’s DaemonSet to enable runtime feed into the Tanzu Observability service
Wavefront proxy

Let’s now go ahead with the deployment.

As mentioned in the prerequisites, you should have a Tanzu Observability cluster up and running before starting the deployment. Please note that the API token for your Tanzu Observability cluster user can be fetched from User → Settings → Username → API Access from the Tanzu Observability user interface.

Note: The Solr exporter is not supported when deploying Solr with authentication enabled.

Once the data platform is up and running, we can log in to the Tanzu Observability user interface and navigate to Integrations → Big Data to find the “Data Platforms” tile, as shown in the image below.

Click the “Data Platforms” tile and navigate to the “Dashboards” tab. Then click “Data Platform Blueprint1 Kafka-Spark-Solr Dashboard.”

Below are the most important sections that you will see in the dashboard.

Overview – This section gives you a single view of the data platform cluster, including Health, Utilization, and Applications, which together form the cluster.

Kubernetes platform – This section gives you a detailed view of the underlying Kubernetes cluster, specifically the node-to-pod mapping so you can understand the placement of application pods on Kubernetes cluster nodes.

Individual applications – This section gives you a detailed view of the individual application metrics; in this case, detailed metrics of Kafka, Spark, and Solr.

Kafka application metrics

Spark application metrics

Solr application metrics

As illustrated, you can build a data platform with Kafka-Spark-Solr together with the Tanzu Observability framework on the Kubernetes cluster, within minutes, using a single command.

Bitnami Data Platform Blueprint1 Helm chart makes this a quick, easy, and secure process, allowing you to focus your time and effort on business logic rather than a deployment configuration that uses multistep runbooks. The best part is, you have out-of-the-box dashboards available in Tanzu Observability to monitor your data platform from Day 1.

How VMware IT Modernizes Mission-Critical Apps with VMware Tanzu

Propelling customer experiences through digital transformation VMware is undergoing significant digital tra...

How Digital Transformation is Powering VMware’s SaaS Business Growth

How Digital Transformation is Powering VMware’s SaaS Business Growth The post How Digital Transformation is...

How to Build a Kafka-Spark-Solr Data Analytics Platform Using Deployment Blueprints

Prerequisites

Deploying a data platform on a Kubernetes cluster

Deployment architecture

Data platform deployment with the Tanzu Observability framework

Deployment architecture with the Tanzu Observability framework

Previous

Next

How to Build a Kafka-Spark-Solr Data Analytics Platform Using Deployment Blueprints

Prerequisites

Deploying a data platform on a Kubernetes cluster

Deployment architecture

Data platform deployment with the Tanzu Observability framework

Deployment architecture with the Tanzu Observability framework

Previous

Next

Related content in this Stream

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!