Pivotal Greenplum 5.10 Introduces Greenplum-Kafka Connector for Real Time Data Loading

July 26, 2018 Ivan Novick

Pivotal is well known for its agile development processes and the Pivotal Greenplum product is no exception. Pivotal Greenplum is built with an agile development cadence and now version 5.10, the 10th release, is released in July 2018, just 10 months from the initial release of 5.0.

The headline feature of the 5.10 release of Pivotal Greenplum is Apache Kafka® integration, provided with the Pivotal Greenplum-Kafka Connector. Apache Kafka has in recent times become an industry standard technology for streaming processing, data ingestion, and enterprise bus use cases. In the world of big data, the velocity and volume of incoming data are ever increasing, and a system is needed to capture this data as it comes into the enterprise.

Key criteria for efficient data ingestion provided by Apache Kafka include:

Non-blocking data ingestion to ensure data can be consumed as fast as its generated
Scalability to petabyte scale
Scalability of data readers so that the incoming data can be processed by growing number of consumer processes throughout the enterprise
Clear automated data retention policy to avoid natural desire to collect all data indefinitely.
Enable in-place analytics and processing on the data streams as needed

Apache Kafka and Greenplum: Better Together

Apache Kafka is complementary to the Relational Database model and not an alternative to it. With a Relational Database, like Pivotal Greenplum, data can be ingested and stored for longer time periods and perform aggregation, grouping, and summary type business reporting as well as advanced analytics that requires historical data analysis. This type of full data scanning and aggregation is perfect for a RDBMS or a Data Warehouse and doesn’t fit the real-time streaming world.

Users want access to the data in real time for in-line processing in Apache Kafka and they also want the data to be delivered reliably from Apache Kafka and into the RDBMS or Data Warehouse for SQL analysis.

A Stock Exchange Use Case

Let's take a hypothetical use case where a Stock Exchange wants to store all trades on the stock exchange for the last 10 years in Pivotal Greenplum and be able to do analytics and reporting on the trades with SQL and advanced machine learning and analytics libraries. They also want the latency from the time of a trade happening on the exchange to it being ingested into Pivotal Greenplum for analysis to be minimized to just several seconds. This is possible with the new Pivotal Greenplum-Kafka Connector.

Using the connector, DBAs and application developers can create a YAML configuration file that will map the incoming data from Apache Kafka Topics into Pivotal Greenplum database tables, columns, and rows. Below is a sample YAML configuration file from the documentation:

The Greenplum-Kafka Connector can then be started using the YAML configuration file and all data that is published to the relevant Apache Kafka topics will be captured and loaded using Pivotal Greenplum’s high speed, direct to database segment, data loading architecture. Back to our hypothetical stock exchange use case, once this process is started, it will ensure that all trades that are published in the queue are loaded into the Pivotal Greenplum database and available for query and analysis with minimal, several second, delays.

One of the reasons Apache Kafka is so elegant and successful is its scalability of readers which comes from the fact that Apache Kafka servers do not track its readers. All of the burdens of reading data from Apache Kafka is on the user, in this case, the Greenplum-Kafka Connector. The connector can use the atomic ACID properties of Pivotal Greenplum to store the state of its data loading progress and resume from the offset point it left off in the Kafka Topic whenever needed. Data could even be truncated from Pivotal Greenplum and ETL can be re-run by resuming from a point in time in the Kafka topic by modifying the data offsets.

All of these characteristics make for a future world of continual data loading in real time, where in-place data transformations and processing can be done in Apache Kafka and then Apache Kafka topic data can be reliably and atomically transported into Pivotal Greenplum for deep analytics that require multi-row aggregation and analysis.

Welcome to the future today!

For more information about the Greenplum-Kafka connector, please read Greenplum documentation.

About the Author

Ivan has been working on big data, databases, and enterprise systems for over a decade. He spent 5 years in the financial industry building trading systems; worked at Yahoo on the data warehouse system before Hadoop was created; hacked on a MySQL storage engine for a year and has spent the last 10 years in various capacities working on the Pivotal Greenplum product. Ivan's passion is building next generation data platforms. In his free time, he has also been a beginning yoga student for the last 10 years. Born and raised in NYC Ivan is now is enjoying the California lifestyle where has resided since 2006.
Follow on Twitter Follow on Linkedin Visit Website

Greenplum Hackday 2021

Come and hack Greenplum and win prizes. On Friday, Apr 16th, we are having a hackday for Greenplum around t...

Top Ten Open-Source Big Data Database

Data has become a powerful tool for the global workforce. It’s a prerequisite to translate massive amounts ...

Pivotal Greenplum 5.10 Introduces Greenplum-Kafka Connector for Real Time Data Loading

About the Author

Previous

Next

Pivotal Greenplum 5.10 Introduces Greenplum-Kafka Connector for Real Time Data Loading

About the Author

Previous

Next

Related content in this Stream

Embark on an electrifying journey with VMware GreenPlum's Minor Release 6.27.0! The TMC March update ignites support for Kubernetes 1.29 and unleashes Spring Cloud Gateway Releases for an...

Revolutionize Your Data Science Experience: Elevating Data Exploration with GreenplumPython 1.1.0’s Advanced Embeddings Search in PostgreSQL and GreenplumUnlocking New Dimensions in Data Analysis...

As we stand at the threshold of a new era in data management, Greenplum continues to lead the industry with its commitment to innovation.

Author | Nihal Jain Troubleshooting and identifying supportability issues in a complex database system can be a daunting task. However, with the advent of tools like gpsupport, the process has...

This week we bring you, new TAS Releases, Tanzu Data solutions GemFire , Greenplum GA releases , Spring Product updates along with release notes , KB articles with Guidance and Troubleshooting...

Hear about Greenplum's recent advancements, including AI capabilities with PG vector extension and Postgres ML, and discuss handling billions of vectors and real-time data analytics, highlighting Gree

It is now official! VMware Tanzu Greenplum 7 was released on October 28, with a load of new and enhanced features. Check out this video to see how VMware Tanzu Greenplum 7 is the unified platform for

VMware Tanzu Greenplum already had traditional database security in roles, permissions, and role-based access control (RBAC). Now, we introduce new features like row-level security and improved featur

Check out this video to find out how VMware Tanzu Greenplum 7 makes the life of a developer much better with things like automated migration from Oracle-like databases, easier merge of data sets with

How can you make a super-efficient system that can support thousands of users running millions of queries in one system? Watch this video to find out how! Learn more: https://www.vmware.com/products/g

With the inclusion of pgvector and other new and improved capabilities, VMware Tanzu Greenplum 7 can support many AI use case requirements. Learn more: https://www.vmware.com/products/greenplum.html

VMware Tanzu Greenplum 7 provides a lot of features and enhancements aimed at improving the user experience—everything from the improved management and handling of statistics to the on-the-fly, non-di

Think about this... A massively parallel processing (MPP) analytics platform with blazing performance just got faster! Watch the video to find out how! Learn more about Tanzu Greenplum: https://www.vm

The enterprise analytics and data warehouse platform based on open source Postgres just got an upgrade. Now, VMware Tanzu Greenplum 7 is based on the more modern version of Postgres 12. Learn more: h

VMware Tanzu Greenplum is not just a data silo but instead integrates with a large ecosystem of data products, making it a unified data and analytics platform. Learn more: https://www.vmware.com/produ

VMware Private AI & Big Data Summit 2023 Speaker: 박춘삼, Head of Data Solutions Sales, VMware APJ

VMware Private AI & Big Data Summit 2023 Speaker: 김병수 파트장, Sr.Director - Head of Samsung Memory Research Center & Head of OCP (Open Compute Project) Experience Center Korea, Samsung

VMware Private AI & Big Data Summit 2023 Speaker: Ivan Novick, Director of Product Management, VMware Greenplum, VMware

Authors | Kevin Yeap & Brent Doil Introduction Greenplum Upgrade (gpupgrade) is a utility that allows in-place upgrades from Greenplum Database (GPDB) 5.x version to 6.x version. Version 1.7.0...