With the explosion of data stores and cloud services, data now resides across many disparate systems and in a variety of formats. When multiple datasets exist in external systems, it is often necessary to perform a lengthy ETL (extract, transform, load) operation to get data into the database. But what if we only need a small subset of the data? What if we only want to query the data to answer a specific question or to create a specific visualization? In this case, it's often more efficient to join datasets remotely and return only the results, rather than negotiate the time and storage requirements of a full data load operation.
Here’s a better way to query distributed data
Enterprises have moved away from the traditional approach of storing all data in a centralized data warehouse. It’s easy to see why.
Consider a data source tracking monthly sales across many years (shown in Figure 1 below). Transactional data represents real-time information, also known as “hot data.” Unsurprisingly, the most recent data is most frequently accessed by the application. This operational data is stored in MySQL.
Data subject to more analytics queries is “warm data” is stored in Greenplum. Older or “cold data” is accessed rarely and is archived in object storage like AWS S3.
Figure 1: Rationalizing data: hot, warm, and cold
It becomes cost prohibitive to re-architect and migrate these data stores. Yet, the business still needs to query hot, warm, and cold data from time to time.
Consider skipping the pain of a full ETL operation. Just join the remote datasets, and then return only the results. Don’t waste the time and processing power to perform a full loading operation. Instead, use a federated query!
A federated query joins different data sources, and returns only a copy of the data we need. The original dataset remains stored in its source system. You can perform such a search using the Greenplum Platform Extension Framework (PXF). PXF is designed to make it easy for Greenplum to query data in external data sources. Here’s how it works.
What is the Greenplum Platform Extension Framework?
The PXF framework maps Greenplum external table definitions to a variety of external data sources: data objects in the cloud, in Hadoop, in SQL databases, in a variety of common formats. PXF speeds retrieval of data with parallel, high-throughput access.
The PXF framework is composed of three main components.
PXF Extension is a C client library that implements a PostgreSQL extension. This extension runs on each database segment and communicates with the PXF Server using REST API calls.
PXF Server is a Java web application deployed in Apache Tomcat running on each Greenplum segment host in the cluster. The PXF Server receives requests from the PXF extension and translates those requests to the external systems.
PXF Command Line Interface is an application that enables users to manage the PXF Servers in their Greenplum installation.
Now, let’s explore how these components interact.
When a Greenplum user runs a query against a PXF external table, the query plan is generated, and then dispatched from the Greenplum master host to the Greenplum segments. The PXF Extension running on each Greenplum segment process, then forwards the request to the PXF Server running on the same host. A typical PXF deployment topology, as shown in Figure 2 below, places a PXF Server on each host that runs one or more Greenplum segment processes. A PXF Server receives at least as many requests as the number of segment processes running on the host, and potentially more if the query is complex and contains multiple processing slices. Each such request gets a PXF Server thread assigned to it.
Figure 2: PXF Architecture
PXF CLI is a utility that provides a convenient way to administer PXF servers running on Greenplum. With the PXF CLI, you can run commands on a specific PXF server (resident on an individual Greenplum segment host) or on all PXF servers at once. It enables you to perform actions like starting and stopping the server, checking the server's status, and other administrative tasks.
For more details on the architecture, watch our PXF overview.
An inside look at PXF concepts
PXF has three interfaces to read and write data from external data sources: the Fragmenter, Accessor, and Resolver.
Fragmenter is a functional interface that splits the overall dataset from an external data source into a list of independent fragments that can be read in parallel. The Fragmenter does not retrieve the actual data; it works only with metadata.
Accessor is a functional interface that reads or writes data from/to external data sources. It also converts data into individual records. PXF provides different accessors to connect to remote databases using JDBC, cloud object storage vendors, and remote Hadoop clusters using HDFS Client libraries.
Resolver is a functional interface that decodes (when reading) or encodes (when writing). The decode/encode operation is performed by an Accessor into individual fields when the record is read from/to the external system. It then maps field data types and values into a format that Greenplum (or the external system) understands.
PXF supplies multiple implementations for these interfaces that utilize different communication protocols and support multiple formats of data.
To improve usability, PXF provides a concept of a Profile. A profile is a simple named mapping that represents two things: the protocol for connecting to an external data source, and the format of the data that needs to be processed. For example, let’s say we want to read a text file from S3 cloud storage. We could then specify the profile “s3:text”. We could create a “hdfs:parquet” profile when reading a parquet file from HDFS.
Users specify the profile parameter value when they define the Greenplum external table. At runtime, PXF receives the value of the profile parameter, along with the request for data from Greenplum. PXF then refers to its system configuration to select the appropriate Fragmenter, Accessor, and Resolver implementations for data retrieval and processing.
Out of the box, Greenplum PXF supports many of the most popular data sources and formats:
Text, Parquet, Avro, ORC, Hive RC, JSON, SequenceFile
HDFS, HBase, Hive, Amazon S3, Azure Blob Storage, Azure Data Lake, Google Cloud Storage, Minio, SQL Databases (via JDBC)
The video Using PXF with External Data goes into greater detail on how PXF retrieves data from a variety of diverse sources.
PXF: the top 5 features
Now that you know how to query all of the data across your enterprise, you may be wondering how fast your queries will be. You’ll appreciate these key capabilities to speed things up.
Filter Pushdown & Column Projection: Reduce Data for Performance
With filter pushdown, PXF uses query constraints (e.g., dates, ranges, booleans, etc. from a WHERE clause) to improve query performance and reduce unwanted data movement.
The following PXF connectors support filter pushdown:
S3 Connector using the Amazon S3 Select service
Another way to limit data returned is with column projection. With column projection, only the columns required by a SELECT query on an external table are returned, instead of the entire table.
Amazon Web Services S3 Select API: Easily Query Your Cloud Data Lake
S3 Select is a feature of the popular S3 object store that PXF supports. For queries on CSV and Parquet formatted data, S3 Select only returns a portion of the data, instead of the entire object. With S3 Select, cloud applications can land data in an S3 data lake. From there, Greenplum can query and analyze the data via PXF. Reading is done in parallel; results can be written back to S3.
JDBC Read Partitioning: Fast, Efficient Parallel Querying
PXF supports partitioning during JDBC query processing when you specify the PARTITION_BY option in the external table definition. Note that partitioning does not refer to whether the data is actually stored in partitioned tables in the remote database. Rather, it provides a hint to PXF that the dataset in the remote table can be processed in parallel, with each partition being retrieved by a separate PXF thread. When the data is partitioned the same way in both storage and the remote database, the retrieval by the remote database will also be very efficient. The remote database need only scan a given partition (and not the whole table) to satisfy a query for a partition from PXF.
JDBC Connector and Named Queries
PXF also provides a JDBC connector to many common SQL, and SQL-like databases with read/write access. The JDBC connector supports named queries, which are static. These use an explicit name (instead of an external table name) to run against remote SQL databases.
When might you want to use named queries?
You need to join tables within the same external database
You want to perform complex aggregations close to the data source
You’re not allowed to create a VIEW in the external database
The external system has better computational performance
Learn more about named queries in this video.
Greenplum Users - Give PXF a Try!
For times when we want to query data from a remote system, federated queries are a powerful tool to avoid wasted time and processing power involved in a full loading operation. Review the documentation, and learn how to leverage the power of PXF!
This post also featured contributions from the Pivotal Unmanaged Data team: Alexander Denissov, Francisco Guerrero, Venkatesh Raghavan, Oliver Albertini, Divya Bhargov, and Lisa Owen.
About the AuthorFollow on Twitter Follow on Linkedin More Content by Bob Glithero