Pivotal Greenplum Database® (GPDB) is an advanced, fully featured, open source data warehouse. GPDB provides powerful and rapid analytics on petabyte scale data volumes. Greenplum 5.17.0 brings support to access highly-scalable cloud object storage systems such as Amazon S3, Azure Data Lake, Azure Blob Storage, and Google Cloud Storage.
Minio is a high performance distributed object storage server, designed for large-scale private cloud infrastructure. Since Minio supports S3 protocol, GPDB can also access Minio server that is deployed on-premise or cloud. One of the advantages of using Minio is pluggable storage backend that supports DAS, JBODs, external storage backends such as NAS, Google Cloud Storage and as well as Azure Blob Storage.
In this post, you will learn to setup Greenplum with Minio in 10 minutes.
Storing cold data
Enterprises are leveraging external storage systems to store cold data such as historical sales data, old transaction data, and so on. Data that can be effectively stored on external storage systems such as Minio distributed object storage. Whenever Greenplum customers want to run analytics workloads on such datasets, customers can leverage PXF to dynamically load data from Minio into their Greenplum cluster. Since Minio provides virtual storage for Kubernetes, local drive, NAS, Azure, GCP, Cloud Foundry and DC/OS, this use cases enable import / export operations to those virtual storage systems.
Sharing data with external systems
Typically, enterprises have needs to share data with multiple RDBMS and systems across the organization. One of the data sharing patterns is to store the data in an distributed object storage system such as Minio. Greenplum users export existing data into Minio so other applications can access the shared data from Minio.
How to configure Minio in Greenplum
$ su - gpadmin
Create a PXF Server Configuration.
$ mkdir -p $PXF_CONF/servers/minio
*Note: A PXF server configuration in
$PXF_CONF/serversis analogous to Foreign Data Wrapper Servers where each server represents a distinct remote system you want to connect to.
Copy the provided minio template into the server.
$ cp $PXF_CONF/templates/minio-site.xml $PXF_CONF/servers/minio $ cat $PXF_CONF/servers/minio/minio-site.xml <?xml version="1.0" encoding="UTF-8"?> <configuration> <property> <name>fs.s3a.endpoint</name> <value>YOUR_MINIO_URL</value> </property> <property> <name>fs.s3a.access.key</name> <value>YOUR_AWS_ACCESS_KEY_ID</value> </property> <property> <name>fs.s3a.secret.key</name> <value>YOUR_AWS_SECRET_ACCESS_KEY</value> </property> <property> <name>fs.s3a.fast.upload</name> <value>true</value> </property> <property> <name>fs.s3a.path.style.access</name> <value>true</value> </property> </configuration>
$ sed -i "s|YOUR_MINIO_URL|http://minio1:9000|" $PXF_CONF/servers/minio/minio-site.xml $ sed -i "s|YOUR_AWS_ACCESS_KEY_ID|minio|" $PXF_CONF/servers/minio/minio-site.xml $ sed -i "s|YOUR_AWS_SECRET_ACCESS_KEY|minio123|" $PXF_CONF/servers/minio/minio-site.xml *Note: sed in mac has some issues. If you have issues in mac use `sed -i '' -i ...`.
Use psql to create external table that uses the
minioserver to access the
stocks.csvtext file in our minio
CREATE EXTERNAL TABLE stock_fact_external ( stock text, stock_date text, price text) LOCATION('pxf://testbucket/stocks.csv?PROFILE=s3:text&SERVER=minio') FORMAT 'TEXT';
Use SQL query to retrieve data from Minio. This query returns the resultset from Minio servers that are preloaded with sample files under
gpadmin=# select count(*) from stock_fact_external; count ------- 561 (1 row) gpadmin=# select * from stock_fact_external limit 10; stock | stock_date | price --------+------------+------- symbol | date | price MSFT | Jan 1 2000 | 39.81 MSFT | Feb 1 2000 | 36.35 MSFT | Mar 1 2000 | 43.22 MSFT | Apr 1 2000 | 28.37 MSFT | May 1 2000 | 25.45 MSFT | Jun 1 2000 | 32.54 MSFT | Jul 1 2000 | 28.4 MSFT | Aug 1 2000 | 28.4 MSFT | Sep 1 2000 | 24.53 (10 rows)
In summary, you can use Minio, distributed object storage to dynamically scale your Greenplum clusters.