People often talk about Big Data in vague and general terms. And often the subject of Fast Data is totally ignored. The reality is that many organisations struggle to figure out just how to take advantage of the data they have, identify what data they might need and recognize the particular style of data they are dealing with.
By breaking down what actually makes up Big Data and Fast Data and what they are most suited for—we can understand how best to apply them. It is only then that business-centric use cases become self-evident. It is these use cases that drive the desire for innovation, for exploration and creates the need for Data Science.
This week, Simon discusses Big vs Fast data and how you might use it, the components involved and some important considerations.
PLAY EPISODE #3
- Find more episodes of All Things Pivotal podcast
- Feedback: firstname.lastname@example.org
- Subscribe to this feed
- Links Referred to in the Show:
Welcome to the All Things Pivotal Podcast, the podcast of the intersection of agile, cloud and big data. Stay tuned in for regular updates, to deep dives, architecture discussions and interviews. Please share your feedback with us by emailing email@example.com.
Hello everybody and welcome back to the All Things Pivotal Podcast. Fantastic to have you back. This is episode number three. My name is Simon Elisha and I’m CTO and Senior Manager of Field Engineering here at Pivotal in Australia and New Zealand.
What are we going to speak about today? Today’s topic is comparing big data and a fast data. A few buzz words there that we will demystify, decode and to give some realistic relevance to.
Big data is a very loaded term. It’s been interesting watching it evolve over the years and seeing people’s perceptions of this particular capability. Big data really indicates a volume of data or how hard it is to process that particular corpus of data. Now, this is where things get tricky because your definition of big will vary. For some people a few terabytes is considered a big data set, for others it’s a few petabytes. I’ve even met people for whom a few exabytes is a big data set.
Really, things can vary based on your perception of big. You can think of things like all the Google searches that were even done or every transaction that your business has ever performed or been involved in, or all the PDFs stored by organization. Say for example you send out emails or statements to your customers and these get stored as PDFs, imagine all of those over the lifetime of the business.
Another good example is all the call records from a call center. All those things coming in and out of a particular call center, you may want to store them. That could get very big particularly if you’re storing it in an audio sense.
Big is not just the volume of data. We often attempted to say, ‘How many petabytes you have? How many terabytes you have?’ That’s a bit of a game from that perspective. It can also just be big in terms of an individual data set or table, if you like, where you may have billions of rows in a particular data set that you need to process.
Now, as anyone who’s tried to process that type of thing knows, that’s substantially difficult because you start to run out of memory on a single machine or you then start to have to partition up the workload to do it more effectively and things starts to take longer. Getting an answer takes an extended period of time. Big is not just the macro view of the data but also the more focused view of the data as it sits within a data model, for example.
Now, fast data is really about how quickly you can process an action of data that you have to hand. This typically relates to real time or new real time activity. Types of things to consider in this place, things like online share trading, a real time stream of customer activity. For example, seeing a customer in real time purchase something so you can make a recommendation or see them reading something so you know they maybe interested in something related to that particular topic.
Mobile phone activity, so you maybe you want to track dropped calls. You want to see if there’s an impending issue on the network that you need to modify or some sort of location-based service where you would like an action to closely follow a particular event. For example, if I’m visiting your supermarket and I have your particular business’ supermarket application, maybe you’d like to trigger a particular special for me in real time.
This is really around fast data. Fast data provides some interesting technical challenges in and of itself because typically we’re working with reasonably large volumes of data, typically large flow. Often we can’t control the nature of that flow, we need to be able to accept it reliably and in a consistent fashion. We want to potentially store that information for later analysis as well. It’s one thing to store the information that’s fast if you like it very quickly and process it very quickly but then dispose of it to the big bucket. In most cases, we don’t want to do that. We actually want to retain that information for the long term to perform more analysis on it, different analysis on it, et cetera.
Do we have to chose between big data and fast data? One of the things we try and do in IT is to make sure you don’t have to make those types of tradeoffs and choices and you can have everything that you want and really the ideas is to not have to choose. It’s to have everything at once and we call this the business datalike architecture. This really understands that we have multiple data sources coming in.
Let me kind of break this down for you. On the source side, we have real time ingestion of data. We have micro batch ingestion, so very small updates happening and then batch ingestion. A more traditional, here’s all the transactions that took place in a database yesterday, then here we go. Those sources need to come in through an ingestion tier. Again, these are handled in real time, micro batch and mega batch and they get stored via processing tier into a number of different layers.
These layers could be in memory database or data grid, so very large scale clustered collection of servers that can hold a large amount of data into RAM. As we know in the world of IT, RAM is pretty much the fastest place you can access data from, so that’s the best place for it to be if you need it to be quick, but then we may move it down to some sort of massively parallel processing database to be able to store it in a more persistent fashion and use it and query it on an ongoing basis.
Eventually in the big data space and also in the fast data space, all roads do tend to lead to HDFS, so it’d be Hadoop File System and we may store it there be it structured or unstructured data for a long period of time.
Once we have that data there, we can now choose where we access the data from and this is where, again, we take a multiple view of how we query this data and a multiple method approach to querying the data as well. We don’t just say, ‘I have a single view of my information,’ we say, ‘I have multiple ways in which to access my information suited to the problem at hand or the question that I’m trying to ask.
For example, I may want to query my in-memory database or in-memory data grid in real time and say, ‘What’s happening now? Now what’s happening?’ Or create some sort of event trigger type approach where if a particular type of information comes in and it’s correlated with something else, an action takes place. I may achieve this using SQL-type queries at the memory data grid layer or I may use NoSQL interactions as well, I can choose.
Then I may want to do some sort of interactive type work. This is less real time but close to real time type work where I may want to see what’s the prevailing condition of my environment minute by minute. In this case, typically, I’d write myself some SQL, SQL runs the engine of business still today and I can do some interactive work. Now, this is where the speed of processing and the ability to action and act upon this big corpus of data that we have with this very large data system trying to work with become really important.
Because if I want some sort of interactive insight, I can’t wait an hour for the report to run. I can probably wait a minute or two, maybe five minutes depending on the nature of the report and the complexity, but I typically want it to happen reasonably fast in a sort of a tolerable timeframe. I’d argue that less than the time it takes to make a coffee is probably your upper bound that you’re looking for.
These are the things that happen during the day, it can give you nice updates on an ongoing basis. These are kinds of things that if you’re in an information radiator type approach where you’ve been boards up around the business and you’re showing what’s going on in the business in near real time, these are the types of queries that would be feeding that information.
Then there are the batch insights. These are the really large, typical reports that get run on a regular basis for most businesses that work on a much larger amount of data, maybe all the data that’s in the environment. This will typically be done again using SQL or MapReduce or variations on that as well. This is often where a lot of the heavy duty data size takes place as well, we’ll speak more about that shortly. This is where we try an action or work upon all the information we have.
Whereas in the real time space we may be just working with a very small data set that’s just coming recently, we may choose to be experimenting with creating models and applying them to the larger set of data to understand if we’re seeing the correct relationships, the correct causality that we’re looking for in real time. We can then feed that back into the real time processing engine and do it the same way.
I’ve talked about a few different concepts and these are sort of how they fit together. Often people say, ‘What are the components of products that fit into this space?’ Let me maybe translate that a little bit for you. If we think about things like clickstream data or sensor data that’s coming in, a great place to land that is something called GemFire XD and also using Spring XD. This is the ability to take real time information into an in-memory data grid and process it very effectively.
We can process within that environment, we can use things like SQL if we want to, we can use REST based interfaces, et cetera, to access that information and use that information in real time.
If we’ve got things that are less real time, that sort of micro batch approach to things like weblogs and network data, that often gets ingested directly either into GemFire XD or into the Hawq or [inaudible 00:09:25] databases. These are massively parallel processing database that later distribute workload very effectively and they allow you to ingest huge amounts of information, store it and query it very effectively as well.
Hawq gives you the ability to access information that is stored on HDFS using antistandard SQL, which is very important to talk about that shortly. You may use other tools to bring that data in, you may use various data loaders, you may use Scoop, you may use Flume. There’s all kinds of cool names out there for all these different projects that run to bring data in. Then if you have your more traditional sort of CRM ERP data, then you may use a data loader type process to bring that data across from those traditional environments into what’s called Pivotal HD, which is a Hadoop distribution and allowing it to be stored in HDFS on a large scale.
Then on the action side, you can use this information a number of ways. As I mentioned, you can query using SQL, you can use MapReduce, you can use Hive, you can use Pig, you can use a whole raft of different tools. You can also do things like send notifications or things out onto queues, so things like [inaudible 00:10:34] Q or [inaudible 00:10:34] to populate other data sets or to trigger events. You can create applications using Pivotal Cloud Foundry that consume the data that exists in these services so you can attach to those services and pull that data off as well. It gives you a lot of flexibility and a lot of capability.
It’s interesting as we move through the big data space, a lot of thought goes into how these architectures get put together because this stuff is complicated. There’s a lot of data you’re working with, quite a few moving parts, things that have to all fit together and work seamlessly to be managed and maintained.
One of the architecture frameworks that you probably want to be familiar with is something called the Lambda architecture and this is something that Nathan Marz came up with from his work at back typing Twitter. It sort of tries to deal with a need for a robust system that’s fault-tolerant against hardware failures and human mistakes and to serve a wide range of workloads and use cases. Things like the low-latency needs, the batch needs, et cetera.
If you have some time, I recommend having a look at that as well. It gives you, again, an alternate perspective of what we talk about in terms of the business [inaudible 00:11:38] like architecture, we’re melding together what is fast and what is big and try and get access out of that very quickly.
I’ve talked about accessing the data quite a lot. I want to dive a little bit more into that. Is it this business [inaudible 00:11:51] is intended to be a platform. The word platform infers it’s something upon which you can build from. We’re not making any decisions for you around the particular visualization tool you should be using or the analytical tools you should be using or the modeling tools you should be using.
Essentially what we’re trying to provide is a basis upon which you can effectively store and access in the manner that you see fit the information you need to have to hand to run the business. This means that we can provide access via Anti SQL, you can have NoSQL access, you can direct risk-based access to this data. You can consume it the way you want to.
It’s interesting in speaking with customers that all have very different needs around the type of visualization that they want. Some want the ability to write reports very quickly and easily, to be able to distribute that capability to the workforce so they use various tools around that. Others want really rich geographically related visualization tools so they can overlay data sets onto really detailed maps and create really visually appealing displays. Others are looking for very scientifically oriented visualization tools. There’s one size fits all and in fact, it’s a case of choosing the correct tool for the correct task for the correct user.
Obviously when we talk about big data, we talk about bigness and bigness involves infrastructure, it involves software, it involves licensing. People are concerned, what am I up for? If I’m starting to store these really large amounts of information, am I kind of signing an open check to spend a huge amount of money? Let me talk about it from the software perspective.
Firstly, any software needs to show value and any solution needs to show value and it should be appropriate to the solution, to the problem at hand and it should be solving a big business issue that can either generate more revenue or save a lot of cost. What we’ve tried to do with the licensing of the Pivotal Big Data Suite is create a package where you can get access to all the components that I’ve spoken about. These are things like GemFire, like Hawq, like Pivotal HD, like the [inaudible 00:13:58] database, et cetera, in one package.
This is a CPU-based package where you license the number of CPUs that you’re using for the particular components. However, what this also allows you to do is to have access to unlimited Pivotal HD when you license this in this particular way. This means that you start to manage the number of cores you need for the additional components. For example, if you want to run the Hawq database on 100 cores for argument’s sake, you would license it for that. If you’re running your Pivotal HD on 1000 cores, you can go right ahead and do that if you want to, there’s no limitation. You can also mix and match between the particular components as you like.
What it’s not tied to is the amount of data that you’re storing or processing. This gives you immense flexibility to keep a lot of historic data and to manage that data set effectively and to not have to throw away important information because you’re licensing the compute power, you’re not licensing the storage capacity. This becomes really appealing to a lot of customers.
Speaking of customers, I thought I’d mention a case study. We’ll to this in the show notes but it’s a really interesting case of a company called Eridea, which is one of the world’s leading health informatics companies. They work on technology that helps support the innovation, collaboration and management of chronic diseases, stratified medicine and biomedical research and they use biomedical informatics and analytics.
Really, they’re combining some really interesting and innovative data sets together to try and identify the cause of and solution to various diseases. Specifically in their case, they’re using Hadoop and other big data and application technology from Pivotal for gene sequencing and mapping to undercover correlations between genotype and clinical data. They can provide a really effective platform upon which they can drive new applications, drive new models, et cetera and their goal is to revolutionize the treatment and management of chronic diseases.
This is a lofty goal, a tough goal, an important goal that needs to be supported by a wide variety of data sets that are changing all the time as the research changes. That’s a really interesting case study, so I’ll link to that in the show notes for you.
Now, I mentioned we talk about the way data science fits, so one of the challenges if you talk to any data scientist is not knowing what tools you might need for the problem to remain as you commence your exploration. Now, you may also not know what data sets you are going to need, you may not know what processing types you’re going to use, you won’t know how the models would look, you don’t know what the tool sets would look like. It’s kind of a sea of unknowns, which is part of the challenge and part of the opportunity as they say.
Having the platform upon which you can say, ‘If I need SQL type interaction, I can use it. If I need access to an open source tool I’d be [inaudible 00:16:46] I’ve got it. If I need to run some type of real time eventing scenario, I can do it.’ Gives you immense flexibility. Also, if you start to say, ‘I’ve got these 10 different data sets that I’m combining but I could do something really cool if I could get a real time feed from Twitter and Facebook and something else,’ then the ability to just plug those in and go becomes very effective in terms of solving particular data science problems.
Really, having access to big and fast data puts you in the box seat for getting the best outcome for the problems you’re trying to solve. That’s a bit of an overview of big and fast data, I hope it was useful to you. Again, please tell others around the podcast if you’re enjoying it because it is new and we’d like to get the…
About the Author