In this week’s episode we cover a couple of important updates in the world of Big Data and Fast Data. Firstly, we discuss the new Apache Software Foundation Incubator project called Geode—which releases the core of Pivotal GemFire as true open-source software.
Then we discuss the world of Query Optimisation and in particular the new Pivotal Query Optimizer for Pivotal Greenplum Database. If you are in to running your big queries on big data efficiently—you will want to take a look at this!
- Subscribe to the feed
- Feedback: firstname.lastname@example.org
- Links Referred to in the Show:
Welcome to the Pivotal Perspectives Podcast, the podcast at the intersection of Agile, Cloud, and Big Data. Stay tuned for regular updates, technical deep dives, architecture discussions, and interviews. Now let’s join Pivotal’s Australia and New Zealand CTO Simon Elisha for the Pivotal Perspectives Podcast.
Hello everyone and welcome back to the podcast. Fantastic to have you back. Just me on my own today, Simon Elisha. Good to talk to you again. Going to talk about a couple of new releases, new capabilities, things that might be of interest to you. The first one is something called Geode. You may be familiar with the word geode. Often in our childhood, we played with these things. These are really defined as a small cavity in rock aligned with crystals or other mineral matter. I know back in the 70s, it was very popular to have your geode proudly displayed on the tabletop or on the bench, or in some public place in your house to show these shiny, shiny crystals. That is not the geode we’re talking about today. We’re talking about a new project that’s just entered incubation as part of the Apache Software Foundation. This is something called Project Geode. Geode is essentially the open sourcing of the core of GemFire.
This is kind of a big deal because it’s a very capable component that’s now being brought into the open source community to allow it to grow, to thrive, and to flourish. What does it actually do? How does it all fit together? Geode is a data management platform. It allows you to have real-time consistent access to data-intensive applications through widely distributed cloud architecture.
Sounds good so far, but how does it actually make this happen? It basically pulls together memory, CPU, and network resources, and also optionally some local disk access across multiple processors to manage application objects and behavior. What it does is take what is an incredibly complex problem, which is essentially the in-memory data grid. The ability to access data at RAM-based speeds, very quickly, in a very predictable and precise fashion across the network, and do this entering high availability, consistency of access, and also fault tolerance. If you’ve ever worked in distributed systems, you’ll know that none of these problems are easy to solve. Once you add in the concept of volatile memory, they become even harder to solve.
Geode does this very, very effectively. It also provides reliable asynchronous event notifications and guaranteed message delivery. You could see how this technology fits into a very specific set of use cases. It actually goes back a long way because these use cases have existed for a long time. In fact, the origins of Geode was back in the first object database for Smalltalk, which is called GemStone.
As GemFire, the commercial product, it was first deployed in the financial sector, and really focused on transactional, low-latency data engines that we used in trading platforms. Things that need to be done quickly, very efficiently, and very much in a performance fashion. Where we see it deployed very commonly today is around things like very high-scale businesses, 24/7 business-critical applications, things like large-scale ticketing systems, etc., that have to run all the time and have to cope with very heavy workloads.
The open sourcing of this component and the entry into incubation is a really exciting thing for the community because it means that developers can now pull down this code and have a look at it, and start to play with it, and start to give new ideas of what to do with it, etc. To build upon a really solid foundation of something that’s already been built. It’s all built in Java, so Java would be the way you want to get going. It literally takes you about 5 minutes to get started, in terms of deploying your environment and start to poke around at the code and seeing what it can do.
It’s a pretty exciting time, because it means that we can see where this particular component will go in the context of the broader ecosystem, and what’s going to go on. I’ll add the link to the GitHub distro so that you can start to branch and start to issue some pull requests, and start to get involved with that side of the community. Again, if you’re into in-memory databases, if you’re into real-time, if you’re into events, if you’re into notifications, if you’re into big scale, then Geode is an Apache Software Foundation Incubator project you should really look at.
Speaking of other things related to data, and handling of queries, and getting information that you need very quickly, there’s been a significant update to the Pivotal Greenplum Database. The Greenplum Database, just to remind you, is our massively parallel processing database. It can handle honking great amount of data at very performant fashion, and can run across lots of commodity servers and storage. It provides you with that real big data experience at very low cost, but with a very performance and SQL-compliant approach.
When we write SQL, one of the things that takes place is we have to figure out how to execute that SQL. Executing SQL is what we call a non-trivial exercise. How do you access data that’s spread across multiple tables with multiple access patterns, multiple indices, etc.? There’s lots of variables that go into there.
Typically, we have 2 types of optimizers. We have what’s called a rules-based optimizer and a cost-based optimizer. Rules-based optimizers use heuristics or rules of thumb, or good processes that are known to work in most cases to optimize how queries work. Cost-based optimizers are different as they actually play out in virtual time if you like, or virtually, all the potential access paths that would make sense for a particular query and then calculate the actual cost that it would take to run those queries. It then selects the optimum path based on the lowest cost, which kind of makes sense.
You could imagine playing out how the query would execute in all its permutations before actually executing the query could take time. We don’t actually take more time to do the query plan than it takes to actually execute the query. The optimizer in Greenplum has always been incredibly powerful, and has delivered amazing optimization for queries. What we’ve done in release 4.3.5 is introduce the new Pivotal Query Optimizer. This co-exists with the legacy query optimizer and both of them can be used. What the Pivotal Query Optimizer does is extends the planning and optimization capabilities of the database. What it does is it introduces a very extensible and verifiable optimizing platform, something we can grow with, something we can extend over time as the demands of databases go. It also takes advantage of multi-core architectures in a far more advantageous way.
This means that we can execute queries faster. We can get better answers, and we can calculate the cost more accurately. Also, many of you will use very specific types of queries from time to time. Things like queries against partition tables, queries that have subqueries, queries that have a common table expression, DML operations, etc., will be very enhanced by using this optimizer.
Also, it improves the join ordering, the join aggregate reordering, the sort order optimization, and data skew estimates as well, all come into play. What this means is that in general, you’ll get a significantly faster performance experience for most types of queries in your database. It also means that it can calculate plans very, very quickly. There are a whole bunch of benchmarks out there in the query world. One good one is TPC-H Query 21. It was able to run this particular plan and the optimizer was able to generate 1.2 billion possible plans in just 250 milliseconds.
You can imagine how if you’re running plans that have to operate across tens or hundreds of terabytes of data, or even petabytes of data, the time to create these plans and these variety of options of plans is very important. I, for one, would be very happy to spend 250 milliseconds to make sure I get my optimum plan released. A big deal in the world of Greenplum database users, because it means you can get better performance for a wider range of your queries without having to do anything except upgrade your software. Both options work together. If you choose to use the new Pivotal Query Optimizer as your primary optimizer, that will do its best to get you an answer. If, for some reason, it can not be used, it will fall back to the legacy query optimizer so you always get an answer. Something new to look at.
As ever, both these topics will be links in the shownotes to investigate further. The repo for Geode and also some more detail around the new Pivotal Query Optimizer. Until then, I look forward to talking to you soon, and until then, keep on building.
Thanks for listening to the Pivotal Perspectives Podcast with Simon Elisha. We trust you’ve enjoyed it and ask that you share it with other people who may also be interested. We’d love to hear your feedback. Please send any comments or suggestions to email@example.com. We look forward to having you join us next time on the Pivotal Perspectives Podcast.
About the Author