Using Greenplum to access Minio distributed object storage server

February 25, 2019 Kong-Yew Chan),engineering-blog@pivotal.io (Francisco Guerrero

Pivotal Greenplum Database® (GPDB) is an advanced, fully featured, open source data warehouse. GPDB provides powerful and rapid analytics on petabyte scale data volumes. Greenplum 5.17.0 brings support to access highly-scalable cloud object storage systems such as Amazon S3, Azure Data Lake, Azure Blob Storage, and Google Cloud Storage.

Minio is a high performance distributed object storage server, designed for large-scale private cloud infrastructure. Since Minio supports S3 protocol, GPDB can also access Minio server that is deployed on-premise or cloud. One of the advantages of using Minio is pluggable storage backend that supports DAS, JBODs, external storage backends such as NAS, Google Cloud Storage and as well as Azure Blob Storage.

In this post, you will learn to setup Greenplum with Minio in 10 minutes.

Use cases

Storing cold data

Enterprises are leveraging external storage systems to store cold data such as historical sales data, old transaction data, and so on. Data that can be effectively stored on external storage systems such as Minio distributed object storage. Whenever Greenplum customers want to run analytics workloads on such datasets, customers can leverage PXF to dynamically load data from Minio into their Greenplum cluster. Since Minio provides virtual storage for Kubernetes, local drive, NAS, Azure, GCP, Cloud Foundry and DC/OS, this use cases enable import / export operations to those virtual storage systems.

Sharing data with external systems

Typically, enterprises have needs to share data with multiple RDBMS and systems across the organization. One of the data sharing patterns is to store the data in an distributed object storage system such as Minio. Greenplum users export existing data into Minio so other applications can access the shared data from Minio.

How to configure Minio in Greenplum

You can configure GPDB to access external tables such as Minio, S3 and any S3 compatible object storage including Dell EMC Elastic Cloud Storage(ECS).

  1. Login as gpadmin.

    $ su - gpadmin
    
  2. Create a PXF Server Configuration.

    $ mkdir -p $PXF_CONF/servers/minio
    

    *Note: A PXF server configuration in $PXF_CONF/servers is analogous to Foreign Data Wrapper Servers where each server represents a distinct remote system you want to connect to.

  3. Copy the provided minio template into the server.

    $ cp $PXF_CONF/templates/minio-site.xml $PXF_CONF/servers/minio
    $ cat $PXF_CONF/servers/minio/minio-site.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <configuration>
        <property>
            <name>fs.s3a.endpoint</name>
            <value>YOUR_MINIO_URL</value>
        </property>
        <property>
            <name>fs.s3a.access.key</name>
            <value>YOUR_AWS_ACCESS_KEY_ID</value>
        </property>
        <property>
            <name>fs.s3a.secret.key</name>
            <value>YOUR_AWS_SECRET_ACCESS_KEY</value>
        </property>
        <property>
            <name>fs.s3a.fast.upload</name>
            <value>true</value>
        </property>
        <property>
            <name>fs.s3a.path.style.access</name>
            <value>true</value>
        </property>
    </configuration>
    
  4. Configure YOUR_MINIO_URL, YOUR_AWS_ACCESS_KEY_ID, and YOUR_AWS_SECRET_ACCESS_KEY properties in $PXF_CONF/servers/minio/minio-site.xml.

    $ sed -i "s|YOUR_MINIO_URL|http://minio1:9000|" $PXF_CONF/servers/minio/minio-site.xml
    $ sed -i "s|YOUR_AWS_ACCESS_KEY_ID|minio|" $PXF_CONF/servers/minio/minio-site.xml
    $ sed -i "s|YOUR_AWS_SECRET_ACCESS_KEY|minio123|" $PXF_CONF/servers/minio/minio-site.xml
    *Note: sed in mac has some issues. If you have issues in mac use `sed -i '' -i ...`.
    
  5. Use psql to create external table that uses the minio server to access the stocks.csv text file in our minio testbucket.

    CREATE EXTERNAL TABLE stock_fact_external (
    stock text,
    stock_date text,
    price text)
    LOCATION('pxf://testbucket/stocks.csv?PROFILE=s3:text&SERVER=minio')
    FORMAT 'TEXT';
    
  6. Use SQL query to retrieve data from Minio. This query returns the resultset from Minio servers that are preloaded with sample files under testbucket.

    gpadmin=# select count(*) from stock_fact_external;
    count
    -------
       561
    (1 row)
    
    gpadmin=# select * from stock_fact_external limit 10;
     stock  | stock_date | price
    --------+------------+-------
     symbol | date       | price
     MSFT   | Jan 1 2000 | 39.81
     MSFT   | Feb 1 2000 | 36.35
     MSFT   | Mar 1 2000 | 43.22
     MSFT   | Apr 1 2000 | 28.37
     MSFT   | May 1 2000 | 25.45
     MSFT   | Jun 1 2000 | 32.54
     MSFT   | Jul 1 2000 | 28.4
     MSFT   | Aug 1 2000 | 28.4
     MSFT   | Sep 1 2000 | 24.53
    (10 rows)
    

Conclusion

This post describes how to configure Greenplum to access Minio. For more details, please read this example on this github repository. For more information about PXF, please read this page

In summary, you can use Minio, distributed object storage to dynamically scale your Greenplum clusters.

Previous
Introducing the Monitoring Indicator Protocol, a New Observability as Code Project for Cloud Foundry and Kubernetes
Introducing the Monitoring Indicator Protocol, a New Observability as Code Project for Cloud Foundry and Kubernetes

The Indicator Protocol is an observability-as-code project which allows developers to define and expose per...

Next
Out-of-the-Box Application Observability with Spring Boot & Pivotal Cloud Foundry
Out-of-the-Box Application Observability with Spring Boot & Pivotal Cloud Foundry

How to monitor your app and keep it healthy using platform metrics and application metrics.