Joint work performed by Scott Kahler and Ian Redzic.
With the rampant, well-publicized data breaches of the past several years and associated regulatory requirements, companies are increasingly adding security and cryptographic functions to their data at rest. This applies to big data is well, including users of Pivotal Greenplum. For these companies, Protegrity offers Functional Data Encryption—a strong alternative over default crypto in Pivotal Greenplum.
In this post, we will explain how traditional Greenplum crypto works in comparison to Protegrity’s solution, walk through Protegrity set-up on Greenplum, show how data at rest is encrypted, and give examples of SQL code for Protegrity crypto function calls.
Default Greenplum Crypto & Protegrity’s Functional Data Encryption
In 2015, Pivotal released a technical white paper Basics of Encrypting Data in Greenplum Database, providing high-level instructions on how to use the PostgreSQL
pgcrypto extension to encrypt data in Greenplum. While this approach is appropriate for many situations, using
pgcrypto to encrypt data-at-rest can be cumbersome depending on your organizational and user needs. For example, you must first create encryption keys using GPG to encrypt plaintext/bytes with the
pgcrypto extension, then run the following SQL command for every INSERT with the encryption key:
INSERT INTO userssn(username, ssn) SELECT robotccs.username, pgp_pub_encrypt(robotccs.ssn, keys.pubkey) AS ssn FROM ( VALUES ('Alice', '123-45-6788'), ('Bob', '123-45-6799')) AS robotccs(username, ssn) CROSS JOIN (SELECT dearmor('-----BEGIN PGP PUBLIC KEY BLOCK----- Version: GnuPG v2.0.14 (GNU/Linux) mQENBFS1Zf0BCADNw8Qvk1V1C36Kfcwd3Kpm/dijPfRyyEwB6PqKyA05jtWiXZTh 2His1ojSP6LI0cSkIqMU9LAlncecZhRIhBhuVgKlGSgd9texg2nnSL9Admqik/yX ... /UUZB+dYqCwtvX0nnBu1KNCmk2AkEcFK3YoliCxomdOxhFOv9AKjjojDyC65KJci Pv2MikPS2fKOAg1R3LpMa8zDEtl4w3vckPQNrQNnYuUtfj6ZoCxv =XZ8J -----END PGP PUBLIC KEY BLOCK-----' AS pubkey) AS keys;
Additionally, the use of
pgcrypto requires a separate key management solution. While PL/SQL triggers can make use of
pgcrypto much easier, using
pgcrypto with triggers in Greenplum is not possible due to its MPP architecture. So, Pivotal has partnered with Protegrity to simplify encrypting data-at-rest in Greenplum.
Protegrity provides a set of functions that datapasses through and is encrypted in a specialized transformation process. After the function executes, the physical data resides on disk in a new format and satisfies encryption at rest requirements. A similar process is used to decrypt the data and determine if a user can access all or part of it. The function accomplishes crypto processes by capturing the user information prior to encryption and passing those credentials to a local Protegrity agent, which is installed on all Greenplum nodes. This process communicates with the server, where a catalog of policies is maintained, accessed, and applied. During decryption, the function accesses the policies to apply, determines data access rights for the user and decrypts and/or masks the data according to the policy. These policies are created and associated with users in a centralized Protegrity Enterprise Security Administrator (ESA) service, giving data security administrators a central point to maintain data policies among various platforms. This also removes any need for admins to log into the database to change a policy. In addition, the ESA helps establish a separation of duties, where operational users cannot access data in the clear without security administrator permission.
Architecture Diagram and Overview
At a high level the main components are the Pivotal Greenplum cluster, the Protegrity PEP agents and the Protegrity ESA.
The security administrator will interact with the Protegrity ESA. There they will declare data policies that determine what encryption algorithms will be used to store the data. Additionally they define user policies as to who is allowed to access and decrypt the data as well as any rules that determine if the user will see all of the data or only masked portions. These policies are then pushed down to the PEP agents running on all of the Pivotal Greenplum servers.
On the Pivotal Greenplum cluster, when a user issues a query that utilizes one of the Protegrity provided functions, the query will retrieve the data and then the function will contact the local PEP agent. The function provides the PEP agent with the user the query is using. This way, it can look into the catalog provided by the ESA for rules that apply to this user and the data they are attempting to access. It takes the piece of data and checks if they are authorized to access it through the function. The PEP agent will then run any encryption/decryption on the data as well as apply any masking as set forth in its catalog of rules and return the value.
In addition to this methodology of function execution within the database, Protegrity provides tools that provide for the functions to encrypt the data outside of the database. This could be used as part of the ETL process to transform the data before it is loaded into Pivotal Greenplum, speeding up the data ingestion process by removing the need to encrypt the data on the fly while ingesting.
Setting Up Protegrity in Pivotal Greenplum
The use of any Protegrity product first requires installation of the Protegrity ESA server. Installing the Protegrity ESA is beyond the scope of this blog, but this information can be easily obtained from your Protegrity account team.
After the ESA is in place, the Protegrity Database Protector for Pivotal Greenplum is then installed. This includes an installation for PEP (Protegrity Enforcement Point) server processes on all of Greenplum nodes and the master. This also means it is necessary that all of the servers within the Greenplum cluster can reach the Protegrity ESA server, wherever it resides on the network.
Protegrity maintains an installation guide called Protegrity Database Protector – Pivotal Greenplum. This document lists the steps necessary to install the PEP servers on the Greenplum master and nodes. It also contains directions on how to import the UDFs (User-Defined Functions, which are small code routines embedded in the database) necessary for Greenplum to manipulate the data and communicate with the PEP application.
In order to verify the installation completed successfully, look to see if the Protegrity functions have been created:
Additionally check to see if the
pty_whoami function returns the id of the user executing the function.
If these functions are working, you have a successful install and are ready to start obfuscating some data.
Encrypting Data-at-Rest using Protegrity Data Protector
As an example of what functionality the Database Protector can provide, let’s look at the following. In this instance, we have a simple table with an id used to identify the row, ssn to identify the individual, a rating the individual provided, and the date in which they provided it.
When the user
gpuser wants to access the full data, including the encrypted SSN, they use the view
v_sample_ssn, which is simply a view of the
sample_ssn_parts table with a function applied to the SSN field. The function uses the
gpuser identity to make a request to the ESA for an access policy and gets a key. It applies the policy and key to the data and decrypts it so they see the proper data.
Also, another rule within Protegrity ESA has been applied, and it ensures data will be masked for any other user attempting to look at that data through the view. When the data is accessed by a sneaky admin trying to get social security numbers, it will only return the last 4 digits—masking and protecting the rest of the sensitive data.
The million dollar question is, what data actually exists in the table being accessed? If a user were to access the underlying table directly, they will see the following information:
These are tokenized versions of the existing data. Even though it is not the actual data, the representation is consistent and maintains the format of the original data. Protegrity can use a variety of encryption algorithms, but this data tokenization is especially useful for data scientists. Constantly passing data through functions is an additional overhead to processing, and normal encryption usually makes it impossible to use the data until it is decrypted. Tokenization, in contrast, transforms data into a form that masks its actual value but can still be used by analytic algorithms.
Implementing SQL with Protegrity Data Protector Tokenization
Once the ESA is installed, PEP processes are running, and the Protegrity functions have been installed, the next step is to work with a Protegrity administrator to setup a data protection policy. The administrator must create a policy in the Protegrity Security Manager for an SSN element that will use credit card tokenization, allow
gpuser to detokenize the data, and let all other users to see masked versions of the data. The policy needs to be configured to be pushed down to the PEP agents, which are running on each server in the Greenplum cluster to be referenced as the functions execute within the database. All of this is managed outside of Greenplum, which sets up a good separation of duties.
Once this is setup, we are ready to get working on what we need inside the database. The first thing we will need is the table to store the data:
CREATE TABLE sample_ssn_parts ( id INT PRIMARY KEY, ssn VARCHAR, rating INT, rating_date DATE ) DISTRIBUTED BY (id);
This will hold the raw data that is put on disk. Data that goes into the SSN field will need to be accessed through the function provided by Protegrity to tokenize the data and detokenize it as it is extracted. If we want to simply put tokenized data into the database we would use something like this:
INSERT INTO sample_ssn_parts ( id, ssn, rating, rating_date ) VALUES ( 1, pty_varcharins(‘123-45-6789,'ssn'), 2, ‘2016-04-01’);
This will call up the tokenization function, which contacts the local PEP server to determine what tokenized value to place in the actual table.
If you were to attempt to retrieve that data back “in the clear,” the SSN would appear as a totally different number, which is it’s tokenized value.
In order to retrieve the value back in it’s original form, the Protegrity function needs to be called on that piece of data.
SELECT id, pty_varcharsel(ssn,'ssn') as ssn, rating, rating_date FROM sample_ssn_parts WHERE id = 1;
As the data is selected, the function will check with the PEP process to see if the user is allowed to access the data and, if so, what format it should be returned in. In this case,
gpuser executed this query, and it returns the original inserted values. Any other user would get back ###-##-6789 for the SSN value. This is based on the policies created on the ESA and pushed down to the local PEP process.
It becomes a bit burdensome remembering to add these functions each time you enter a query and you may wish to restrict user access away from the base table. This is where it is beneficial to create a view that automatically applies the function.
CREATE VIEW v_sample_ssn_parts AS SELECT id, pty_varcharsel(ssn,'ssn') as ssn, rating, rating_date FROM sample_ssn_parts;
Now anyone entering the query such as
SELECT id, ssn, rating, rating_date FROM v_sample_ssn_parts WHERE id = 1;
will automatically have the function applied, and this view is what most users would gain access to. Next, we need to make it easier to insert, update and delete the data. Ideally, the functions being applied are as transparent as possible. This can be done by creating rules for INSERT, UPDATE and DELETE against the table.
CREATE OR REPLACE RULE insert_sample_ssn AS ON INSERT TO v_sample_ssn_parts DO INSTEAD INSERT INTO sample_ssn_parts ( id, ssn, rating, rating_date ) VALUES ( NEW.id, pty_varcharins(NEW.ssn,'ssn'), NEW.rating, NEW.rating_date); CREATE OR REPLACE RULE update_sample_ssn AS ON UPDATE TO v_sample_ssn_parts DO INSTEAD UPDATE sample_ssn_parts SET ssn = pty_varcharins(NEW.ssn,'ssn'), rating = NEW.rating, rating_date = NEW.rating_date WHERE id = NEW.id; CREATE OR REPLACE RULE delete_sample_ssn AS ON DELETE TO v_sample_ssn_parts DO INSTEAD DELETE FROM sample_ssn_parts WHERE id = OLD.id;
It should be noted that for these rules to work, tuples in the table must be uniquely identifiable. The RULE can capture work to be done, but it will need to go back and apply the logic set forth in the RULE to the specific records in which the login in the original query would have affected.
All of the data for ssn will be stored in it’s tokenized format in the actual
sample_ssn table, while users can access
v_sample_ssn and treat it much like a normal table. If a Data Scientist wanted to run a clustering algorithm on this data, rather than passing it through the view they could instead access the raw data in
sample_ssn to avoid the overhead but still not access the actual ssn numbers.
As you can see, the Database Protector from Protegrity provides a compelling way to tokenize and encrypt data at rest, manage those rights in a system that exists outside the database, and obfuscate the data in a way that doesn’t require it to be decrypted for use.
- Read the Pivotal Greenplum Administrator Guide
- Download the Pivotal Greenplum Best Practices Guide
- Find out more from the Pivotal Greenplum product page or read more blog articles
- Watch the latest Pivotal Greenplum videos
- Be part of the Pivotal Greenplum social channels: Twitter and LinkedIn
About the Author