Inside The New Solr-powered, SQL Text Analytics Engine For Greenplum

December 13, 2016 Bharath Sitaraman


When it comes to searching for and correlating pretty much anything online, text still anchors the web and app experiences. Yet despite of—or because of—its ubiquity, the slew of languages, slang, and colloquialisms tend to muddy the search waters of text search.

Under the search hood, only sophisticated search metrics and analytics are able to make sense of all this structured and unstructured text.

That’s where GPText comes in.

Understanding and leveraging text is a complex problem, but with GPText, we solve important parts of it. GPText is a combination of Apache Solr and Pivotal Greenplum. Solr is a popular open source search engine server for enterprises. Greenplum is a massively parallel processing data warehouse adept at in-database analytics and data science workloads.

GPText takes the flexibility and configurability of Solr and merges it with the scalability and easy SQL interface of Greenplum. The result is a tool that enables organizations to process mass quantities of raw text data for large-scale text analytics, including semi-structured and structured data (social media feeds, email databases, documents, etc.)

Greenplum users can index tables filled with raw text columns into Solr indexes. This means they can use the familiar SQL interface (SQL user defined functions) to search quickly and efficiently through raw text data and filter via the other structured columns in their tables. Further, users can pipe the results of their searches into Apache MADLib’s analytical libraries for clustering, classification, sentiment analysis, and other advanced analytics capabilities. Apache MADlib (incubating), for those not familiar with it, is a SQL-based open source machine learning library.

Using Text Analytics In Healthcare, Financial Services, And Beyond

Many Pivotal customers could potentially use GPText in conjunction with other Greenplum offerings to tackle a varied set of problems. In fact, several use cases are currently in flight with Pivotal customers. In the medical field, for example, analyzing text helps detect and assess patient risk patterns and factors, which in turn helps clinicians decide which tests to run and other courses of action for their patients. Financial firms can examine internal company communications, such as emails and instant messages, to detect possible instances of fraud. Within the auto industry, manufacturers can mine service records to identify potential defects early on and issue recalls or improvements in subsequent models.

Brief Overview Of Solr/GPText Architecture

There are two ways to think about GPText architecture: physical and logical.

From a physical architecture perspective, GPText allows the user to run a number of Solr processes that are independent from the number of Greenplum segment processes running on the system. This allows flexibility in terms of performance as well as memory management. In the figure below, GPText is configured for an equal mapping of Solr processes and Greenplum segment processes on each host (Figure 1), but this configuration can be flexible based on customer’s needs. If users are looking to improve indexing and search performance, they can simply increase the number of nodes running on each host. If  they are looking to be more conservative in resource consumption, they can decrease their usage while maintaining parallelism.


Figure 1. Greenplum and Solr processes run independently.

From a logical perspective, GPText shards indexes with the same distribution as the original table. Each shard contains a leader (similar to a Greenplum primary segment) and a configurable number of followers (similar to Greenplum mirror segments) (Figure 2).


Figure 2. GPText shards indexes with the same distribution as the original table.

How exactly do the two layers map to each other? Taking a look below, the replica blocks of each shard get distributed across the Solr nodes, maintaining an even and highly available distribution (Figure 3).

Figure 3. How the GPText replica blocks of each Greenplum shard get distributed across the Solr nodes.

Figure 3. How the replica blocks of each shard get distributed across the Solr nodes.

GPText 2.0: Highly Available For Richer Text Analytics

The primary focus of this release of GPText is high availability. This ensures that in the event that any node goes down, at least one replica for each shard is still available. If a leader replica is lost, Zookeeper (the configuration manager and monitor) simply elects a new leader. As long as a single replica of each shard is available, GPText is still operational and able to index and search. This proves extremely valuable for customers operating in high risk environments where even a second of downtime is unacceptable.

In addition to these architectural enhancements, GPText offers many improvements in text search. In addition to the analyzer chains available through Solr, GPText adds two additional text processing analyzers that can tokenize and parse international text (multiple languages) and social media text (including #hashtags, URLs, emoticons, and @mentions).

From a search perspective, GPText has developed a Unified Query Parser (UQP) that can process queries involving boolean operators, complex regular expressions, and proximity searches all at once. This provides users a clean interface to write extremely complex queries in order to search their data effectively.

Learn More And Start Using GPText 2.0

Text plays an ever increasing role in a large number of organizations. With its ability to catalogue and correlate text, GPText 2.0 will help Pivotal customers decipher the meaning  behind words.

We are very proud to now make this unique technology available to our Pivotal Greenplum customers as a free add-on!


The Emergence and Future of the Data Engineer
The Emergence and Future of the Data Engineer

Recent developments in data management have led to the creation of the field called data engineering. This ...

Democratizing the Logical Data Warehouse: The End of Analytics Lock-In
Democratizing the Logical Data Warehouse: The End of Analytics Lock-In