This week I am joined by Alexey Grishchenko who is a Pivotal Enterprise Architect to discuss the evolution from traditional data architectures, through advanced, Lambda and streaming architectures. He also explains how modern data stores like massively parallel processing (MPP), Apache Hadoop®, and in memory data stores fit into these modern data architectures.
- Subscribe to the feed
- Feedback: firstname.lastname@example.org
- Links Referred to in the Show:
Welcome to the Pivotal Perspectives podcast. The podcast at the intersection of Agile, Cloud, and Big Data. Stay tuned for regular updates, technical deep-dives, architecture discussions, and interviews. Let’s join Pivotal’s Australian and New Zealand CTO, Simon Elisha for the Pivotal Perspectives podcast.
Welcome back to the podcast, great to have you with us. Today, I have a really special guest. I’m joined by Alexey Grishchenko. Welcome Alexey.
Good to have you on board, Alexey. Alexey is actually a Pivotal Enterprise architect, and he’s based in Cork, Ireland. I’m speaking from Melbourne, Australia, he’s in Cork, Ireland. I think it’s early where you are and late where I am. Is that about right Alexey?
Yeah, here it’s 11am in the morning.
That’s not too bad. I’ve got 8pm side of the equation, so we made it work. I’ve asked Alexey to come on and have a chat with us today because he’s been doing some really interesting work around modern data architecture. This came out of a customer request he was helping with. Let’s face it, often the best things come out of what customers ask for and what we can respond to. Alexey, before we launch into that topic of modern data architecture, let’s maybe talk about what the ask was, what that customer was asking for.
Yes, it was a traditional data warehouse customer running MPP solution and it was a number of source system like Oracle and MS SQL. They started to face problems with ingesting real-time data with working with unstructured data and also big amounts of data. They asked us what is our vision in delivering new architectures and new data processing pipelines for their environment. I started with simple introduction of the data lake concept, land architecture concept which basically well-known concepts in the current data design.
Later, I understood that it would be much better to give the customer some background and show how the industry evolved over time, going from traditional data architecture to the advanced options, and then to the modern options. Because just putting customer with traditional MPP data warehouse to face the land architecture, which introduces you with streaming solutions and Hadoop solutions, and it doesn’t basically cover the traditional data warehouse. They will be struggling with understanding how they can get there. So I build this deck to show them the real evolution and what they can use and where they came go from that point.
Sure, and it is a real journey for many customer, and there are many terms already you’ve mentioned earlier on, things like MPP, Hadoop, etc., that are familiar to some people and not to others. Let’s maybe take it step-by-step. Let’s define what most customers would see as a traditional current state architecture that they may have. Maybe speak to that for a moment.
Traditional data architecture is the thing that you will face at most of the enterprise customers. They got a number of source systems, like CRM system and automated banking system. All of these system are running mostly on the traditional OLTP databases like Oracle, MS SQL, and so on. After this, they got nightly, daily process or some customers even have weekly processes that extracts the data from these sources and puts this data into the data warehouse.
After this, they got a typical ETL cycle of processing the data and moving the data from the detailed data layer to the aggregated layer, reports, data marts, and so on. After this, BI users can access the data from the data warehouse. Typically, this process takes something from one day to three to four days. With some customers, I even see the latency of one week from the data arriving to the source system, OLTP system until the data appears in their warehouse in the reports of the end users.
This is the traditional systems. They are implemented at a wide range of the customers, including telecom customers and banking system.
They tend to be quite expensive and slow to change, and they did a great job compared to what we had before that. They’re kind of reaching the limit of what they can do, wouldn’t you say?
Yes, so they’re really limited in terms of the scalability and in terms of they are not processing the data as rapid as the business wants. After that time has passed, the business started to struggle with stale data, like they’re worked with the data that is two, three days or even one week late to arrive, and they cannot make current business decision based on this stale data. The second problem that arises is that how to implement the disaster recovery solutions, because basically with the traditional architecture you have only one data warehouse as a single point of truth.
Then they introduced something like backup restore solutions to have a secondary site for the data warehouse. Again, this solution is not that good because it introduces even greater latency. When you switch to the disaster recovery site, you will have stale data. That was the biggest problems that traditional data architecture has faced. Also, the problem is extensibility of the system. With a traditional system, you are just writing some data in your source system, and then you are designing the processes to extract the data.
These processes are sometimes complex because you have to plan the source system for the data to be easily extracted, like adding timestamps for the source tables to be able to get only the changed data. Or implement in a different custom CDC solutions. When the traditional data architecture is implemented for the customer, they face all these issues, and they started to think about something a bit more advanced because traditional approach is to have an ETL when you first extract the data, transform it on the fly, and load it into your data warehouse.
Next step usually for the customers is to implement ELT when they take data from the source system, load this to the data warehouses as is, and then implement a number of layers of transformations. Why did the customers go from ETL to ELT? Mostly, this is caused by the fact that the storage became many times cheaper. On the cheaper storage, you can store more data and pay less for this.
This is why first MPP solutions and first solutions for the data warehouses were really limited in space and implement an ETL was almost the only option for aggregating data on the fly and extracting not all the data but only subset of the data that is needed for analytics was a good option at that moment. At this particular moment, most of the users are following some advanced practices of the data architectures and one of them is ELT.
They’re using this concept to load the data. Of course, their data warehouse became bigger because now in the data warehouse they have a layer called operational data store which is just a copy of the source operational systems. It gives them a better flexibility because they can are … With the change in the logic of extracting the data, they can be more flexible. They don’t need to implement any changes on the source systems, any changes in the extraction processes. They are free to make these changes only on the data warehouse site.
Implement an ELT was the next step towards the architecture with change data capture, which is CDC. Change data capture is the process when you offload the source OLTP system because over the time, all OLTP systems became more and more loaded. Like initially, telecom companies had small number of customers that had a number of the cell phones and other devices were not that widely used and the billing was not that big and was not that loaded.
Over the time, the load increased and increased. At the moment, most of the customers don’t want to tolerate the workload data warehouse data extraction processes put on the source systems. You can understand if you’re running a traditional Oracle OLTP database and you’ve got an ETL process from the data warehouse site that is scanning the full table, it requires really much I/O because the table is big. It requires much efforts from the Oracle side because most of the workload, the specific instance Oracle designed for is OLTP.
Write in one row of data, written one row of data calculating some information on the couple of rows of data. This requires mostly random I/O and most of the OLTP systems for now, at least at the advanced customers, are transitioned to the SSD storages. These storages are given much better performance for the random I/O. Then you have your data warehouse extraction process that is running strictly sequential I/O on the huge amounts of data. This invalidates Oracle cache, file system cache, and it causes performance problems on the OLTP solutions side.
The modern approach is to do use CDC. CDC is just parsing the write ahead log of the database and getting the data directly after it is inserted to the database. Originally, the vendors of CDC solutions claimed that you can have five or even two minutes latency after the event has occurred. The source system update event is loaded to your data warehouse. In fact, it’s not the truth. The truth is that if you have many, many source tables and for all of them you’re implementing CDC, you would be able to load them something like one hour or two hours after the data appeared on the source system.
Again, one or two hours is much better than nightly batch and it is much better than three to four days that business users wait for the whole ETL cycle. It gives you a better flexibility, better performance.
It’s really sort of the big step of from where you were and it’s really about giving the business answers much more quickly than they’ve ever had it before. We’re trying to build this technical capability and what is typically quite a complex world of different systems and different data so that we can get answers far more quickly and feed them into the decision making process.
Yes. CDC has its own drawbacks. For instance, if you have a really big number of source systems like for some customers, they have only one or two OLTP source databases. In fact, if you have 10 of them or 15 of them, you will spend a really huge amount of time and efforts on integrating all of these solutions, implementing CDC from them, and loading their data in the data warehouse because switching from ELT to CDC is not that simple. After all, when you will implement CDC for all of them, you will see that your data warehouse is getting overloaded by the amount of data it’s trying to load in the real time.
Originally, data warehouses solutions, MPP databases, and even traditional databases are designed for data warehouse. They’re designed for analytical workload for batch loaded of the data. Then you analytically approach it over the huge amounts of tables and gigabytes of the data, terabytes of the data. When you start loading really small chunks of the data, your data warehouse solution will not be that simple to maintain it and to extend.
This is why after experimenting with all of this ELT and CDC, customers start to transition to the modern data architecture which can solve these problems. For one example, we’ve got a customer that has more than 15 different data sources, and with CDC, they’re loading more than 1000 tables into their data warehouse. You can imagine that 1000 constantly changing tables in the source system, each of them generates a stream of change events and all of these events are loaded into 1000 objects on the data warehouse site.
It creates just a massive load on the MPP solution, and this leaves not much space for the real analytical workload in this data warehouse. This is why, over the time, you are naturally transitioning to the modern data architecture. With modern data architecture, you’ve got a number of approaches. One of them is the data lake. Data lake, at the moment we are discussing data lake, it has nothing to do with the speed of ingestion but mostly it tries to solve the problem of the new data types and really huge amounts of data. Over the time, your data warehouse is growing.
Initially, when the companies transition to the MPP solutions, they have a data warehouse of something like five to 10 terabytes on average. Then they get an MPP solution they know that it can scale easily, they know that they can put much data into it, and they start to collect all the data they can take. Over the time, this data warehouse grows to 20 terabytes, 50 terabytes, 100 terabytes. Then they start to understand that not all of this data is really that relevant for the business.
For instance, only 10% of the data stored in 100 terabyte data warehouse is really daily used by the end users. All the other 90% are also important but they’re not used daily. They’re used once a week or two times a week, and still you waste this disk storage resources to store this data in a high performance MPP solution. This is why that data lake architecture arrives. It was one of the reasons. To offload the not that relevant data. We usually call it “cold data” or just “warm data” to the Hadoop. It allows you to save the space, save the capacity of the complex and perform an MPP solution, and offload some cold data to the cheap, and still perform a Hadoop solution.
This leaves you an option and anytime from the MPP solution you can select the data residing on Hadoop. It remains still available. Of course, the access time will be a bit slower but you should understand that if you are accessing this data just a couple of times a week, this is not a very big problem for most of the end users. They can tolerate spending a bit more time on accessing this data than before, given the fact that they can store more hot information in the traditional data warehouse solution.
The second point why modern architecture with the data lake has arisen is because of the boom of social networks, Internet of Things, and all the other stuff around this. The amount of unstructured data generated by old means is becoming bigger and bigger over the time. Now most of the companies can extract data from social networks. They can put the excess logs of their websites, so all of this information can be easily put to Hadoop cluster and analyzed there.
If you will take a look at the formats of, for instance, JSON data provided structures. You will see that it’s semi-structured data because it’s JSON. It has many optional fields. It has nested structures and so on. Of course, you can put it to the traditional relational schema, but it doesn’t make much sense because most of this data that is loaded with the Tweet in JSON is not relevant for you. You are mostly interested in text and maybe location information.
This is the way you use Hadoop. You put all the source raw data in the Hadoop layer and then you run some kind of extraction processes that are processing only the relevant part of the information. This relevant part is getting loaded in your data warehouse and analyzed there. Using this concept of the data lake is very basic step of the modern architecture. If you will take a look at big companies like Facebook and LinkedIn and all the others, they’re following this similar approach. They have really huge Hadoop cluster, but for the real-time analytics they’re extracting aggregated data from these Hadoop clusters and loading them into MPP solutions.
Here the approach is similar. You’re extracting really huge amounts of data from different external system, and maybe even internal systems, put them to the Hadoop cluster, run extraction aggregation process on the Hadoop layer, and then load this aggregated data into data warehouse for the deeper analytics. This way the data lake approach really works. The main two drivers are uploading of historical data and processing all the huge amounts of unstructured or semi-structured data.
For sure. One of the interesting evolutions we’ve seen recently is the prevalence of real-time and streaming data. Given this modern data architecture, describe how that streaming piece fits in, how we overlay it on top of that.
As I described, data lake covers mostly integration between traditional data warehouse and the Hadoop layer, some would call it big data layer. In fact, it does not very well cover the streaming data. If you want to process the stream of real-time data, you have to invent something different. This is where comes Lambda architecture. In Lambda architecture, you have two separate layers of the data processing. One is called the batch layer network you process data in traditional ETL cycles maybe nightly or even weekly. You have a speed layer when you process the real-time stream of event and get the real-time metrics out of the data, real-time aggregates available to the business as soon as the data has arrived to the system. This way you have dual-loading. You get the single stream of data and it’s loaded to both speed layer and batch layer.
Then from the serving layer you can select the data from both of them grouping it together. This gives you some unique points like over the whole time, you store the old source data in your batch processing layer. This gives you an opportunity to review the final data marts or the final aggregates at any point in time. You’re completely free to do it. If you have lost your real-time data, it’s not a problem because you can just reload this real-time cluster anew and recalculate the aggregates based on the batch layer information.
Also, it gives you an opportunity to see data and see the results of maybe applying some statistical model on top of the data, analytical models, as soon as the data has arrived to your cluster. This is the great opportunity that is introduced by Lambda architecture but it comes at a cost. The cost is that you have to implement the processing logic for the data, the same logic in both speed layer and batch layer. Usually it might cause some problems because for the batch layer, you can use Hadoop. For Hadoop, the native approach for processing and aggregating this kind of data is to use MapReduce or something based on Hive and so on.
In speed layer, you have to use a different approach. When you use solutions like Storm or Spark streaming, they have different language, different capabilities, different functions, and you still have to maintain the same logic in these two systems simultaneously. If you change the logic in one system, you have to change it in the second system and deploy at the same time. This sometimes creates problems with the Lambda architecture.
There’s no such thing as a free lunch is there?
Of course, yes. You maintain just two layers of the data and of course, you have to maintain two branches of the logic for processing this data. This approach is somewhat completely separate from the data lake approach, and not all the customers see how you can bring them up together. This is what I want to cover in the next step on streaming architecture. One of the modern data architectures is data lake or data hub, it is called in a different way by different vendors, but the idea is exactly the same. Next is Lambda architecture and the last one is streaming architecture.
This architecture just starts to arise. You can see this, for instance, in Spark streaming and new Apache top level project called Samza, and in some other technologies. Also the biggest driver in streaming architecture is reactive programming. In reactive programming, you start to decouple the big monolithic application into a set of separate actors and message queue connecting these actors with each other. First it might be internal message queue but finally you might find out that implementing external message queue will be a better solution even for the reactive programming.
Slightly from reactive programming, you will go to the microservices approach of the programming. All of these together gives you a next level of the approach for the modern data architecture. Let’s take a look at the traditional system, how it works. You got a client that is opening web browser, for instance, banking system. It opens web browser, open the page of your bank, and this page connects him to the front end layer of applications of your banking system. Front end is calling some kind of services running on them … Enterprise … Integration bus layer … Something like …
Yes, like a TIBCO Service Bus or something like that.
Yes, on typical service busses, you run a big service that’s in Java. For instance, you access them from the front end using graphical interface. Then you access the services, the services are connecting to the databases using JDBC or ODBC connections and these databases, they’re calling somewhat storage procedures that are populating tables with the data or calculating something. For instance, when the customer has made the call, it goes to the billing called the storage procedure that calculates the new balance and puts the information about this transaction to the database.
Then the database, when you write the data to the table, this data is also written to the log. It’s write ahead log in Oracle, it’s CDC log. You parse this log, extract the data to the CDC solution, CDC solution groups it together into batches, loads to data warehouse. In data warehouse, you go from detailed layer to the aggregated layer, and to the data marts layer, and finally, consumed by the BI application.
This flow is really, really long. What you can do and how you can improve this. Basically, when you are introducing the message queue approach and the microservices approach, you might find out that this microservices running on your front end cluster are talking with each other using the message queue. The idea is that when they’re talking with each other, you can replicate this stream of messages to a different solution. You can replicate this stream of messages the way that one message is processed by the traditional system, like using JDBC calling the OLTP system to process the information about the call that just happened.
Different approach is that one of the streams go to the OLTP solution. Second stream might go to the solution like speed layer in the Lambda architecture when you process events in the real-time and give the real-time aggregates, the third stream goes to the traditional data warehouse consumption layer. The fourth stream might go to the Hadoop layer and the events will be just put the same way as they arrived. It gives you a high flexibility. Imagine that you are a telecom operator. You get a huge stream of the events happening with your network. People are calling, the calls are dropping, and so on.
One of the layers is one of the consumption, one of the consumers of the stream is the typical OLTP database you’re billing information and phone calls that happened goes to billing, its processed, and the balance is calculated. Second stream is the real-time solution. This real-time solution might calculate the average call time, might calculate the amount of dropped calls on a single base station and might visualize it in the real-time on the customer dashboard. This way on the, of course, not the customer business user in this case is our customer. Business user would be able to see when on which base station the calls are dropped in almost real-time, because you are processing the data at the same time when the billing has received this event.
The data ingestion happens to the data warehouse instead of running the traditional CDC approach and extracting the data after it is loaded to the data warehouse, you’re processing the data before it is loaded to the traditional system. It gets loaded to the data warehouse, processed, grouped into batches, and loaded into your tables. The fourth stream might be to load the data into Hadoop layer. This way you are storing all the historical information about all the events that happened on your network. In the data warehouse, you usually tend to store just one month, one last months of all the events because even one last month will be huge amount of data.
In Hadoop, you can easily store almost all of them or information for … We’ve got customers that intend to store five years of detailed information for all the events happened over the network. This is the modern architecture with streaming. When you go on the deeper levels and start to design your data pipeline and data ingestion, not from the OLTP systems and extracting the data from OLTP system, but starting with the front end and back end services and redesigning them. You would be able to get much more flexibility in terms of what is happening and how you can integrate different systems.
You should understand that if you want to implement real-time data processing system, you should have access to that real-time data stream. The data stream that is output of the OLTP system is not real-time. It is already stale data, might be stale for five minutes, it might be stale for one hour. With this approach, with microservices and the message queues serving the data to a number of consumers is completely separate approach which gives you much more flexibility in terms of the data processing logic.
For sure. I think that gives a really good view of the evolution of this data technology for organizations that are trying to build this real-time experience. If nothing changes, nothing changes, so you have to make change. I think it’s important that our listeners also understand is that one of the benefits of this approach or the way that we tackle this at Pivotal is we have something called Big Data Suite. The Big Data Suite includes all the components that Alexey was talking about.
Things like queueing systems in RabbitMQ Hadoop systems with Pivotal HD, in-memory data grids with GemFire. Spring XD to handle streaming data, Spark to handle streaming data. Greenplum for that MPP database, HAWQ for SQL MPP on top of Hadoop. Cloud Foundry, of course, to create and implement those microservices. You really need to be instrumenting your environment with some different components to give you the flexibility to implement this kind of architecture.
The change that happens in an organization is pretty radical once they start having access to real-time data, up-to-date dashboards, and start implementing some more advanced data analytics and data size to be more proactive about what they do. It’s really pretty phenomenal the change that you see.
Yes, I agree. Basically, the evolution shows that with the change in the industry, with the change in the instruments and tools you’re working with, you have a broader capability in the ways you can process the data. Initially, it’s just the last 10 years when you got only traditional SMP databases. You had no option to store all the raw data, all the events, and so on. You just didn’t have this option. Over the time, you see that real-time systems has occurred. Again, integrating with real-time systems requires you to change your data infrastructure, integrating with Hadoop solutions, and processing big amounts of data requires you to make another change in it.
If this change are not lead by proper design, you might turn out to have a set of completely separate systems, completely separate in-memory system, completely separate data warehouse, and completely separate Hadoop instillation which will turn out in a big headache because all of the time, your business users will require you to have complex queries, complex requests to integrate the data between them. This will become a very big problem if you will not consider the proper design from the very beginning.
For sure. Excellent. Alexey, thanks so much for sharing that background or that overview. I’m sure our listeners got a lot out of it, a lot more context and really appreciate you taking the time and speaking to us from Cork in Ireland today. Thank you so much, Alexey.
Thank you everyone for listening. As always, we love to get your feedback email@example.com is the email address to hit us up at. Until then, keep on building.
Thanks for listening to the Pivotal Perspectives podcast with Simon Elisha. We trust you’ve enjoyed it and ask that you share it with other people who may also be interested. We’d love to hear your feedback so please send any comments or suggestions to firstname.lastname@example.org. We look forward to having you join us next time on the Pivotal Perspectives podcast.
Editor’s Note: Apache, Apache Hadoop, Hadoop, Apache Spark, Apache Storm and Apache Samza are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author