When building systems to handle real-time or streaming data, we need to look at some architectural elements differently. We cannot rely on the patterns of the past, but we can learn from the mistakes and successes of others! In this episode, we explore some of the key considerations when designing software to handle real-time data.
- Subscribe to the feed
- Feedback: email@example.com
- Links referred to in the show:
Welcome to the Pivotal Perspectives Podcast. The podcast of the intersection of Agile, Cloud and Big Data. Stay tuned for regular updates, technical deep dives, architecture discussions, and interviews. Now let’s join Pivotals Australia and New Zealand CTO, Simon Elisha for the Pivotal Perspectives Podcast.
Hello everybody, and welcome back to the podcast. Great to have you back. Guess what? I’m in another hotel room. It seems to be a trend with the podcast at the moment. This time in beautiful Sydney, Australia, a town many of you will be familiar with, and some of you want to visit. I can recommend it, it’s a nice place, but enough about that.
This week I had a question where someone asked about working with real-time data, and some of the things to think about with this. I thought I would refer a little bit about some of the things I’ve seen, some of the things I’ve worked with, some things I’ve seen customers doing well, some things not so well that may help inform some of your thought processes and some of your decisions. Where to start?
Working with real-time data or streaming data, the rules are different, very different to what we’re used to handling. This is the case on a number of dimensions, and I want to walk through these dimensions step by step but will bounce around a little bit. The main difference is that of scale. The scale in real-time data particularly if you’re using external data sources that are not within the purview of your own control, the scale can be unpredictable, and potentially uncontrollable. What do I mean by that? There will be firehouses that you’re attached to that will send you as much information as you can take. Whether or not, you’re back in systems can cope is your problem, not their problem. You may say, “That’s fine. I’ll just drop some data on the floor,” but if you need that data for longitudinal studies or trending analysis etc., dropping data on the floor is also not appropriate. What do we do?
We have to think about a number of dimensions. One of those is the consumption side. If we’re pulling streaming data from someone, or receiving streaming data from somewhere, we’re going to be doing something with it. Typically that will be in the guise of what we call worker nodes. These are little system service containers, run-time engines whatever, depending on your architecture that are consuming from the stream. Now the stream could be additional time stream so it may be like a HDFS connection, or it would be better if it was some form of queuing system where there was some sort of intermediate persistence of some nature, and some in order reply capability, and that’s something like a RabbitMQ for example. I’m a big fan of queues. They help with decoupling of systems. They help with resilience or reliability of systems. They give you nice, clean interfaces.
Irrespective to have that data coming, you need to consume that data in a timely, efficient, and effective way. Part of that is constructing your working node or nodes, there will be more than one typically. It’s important to consider the size of each node because when you’re scaling your nodes, you’re going to watch them move in a kind of increment that makes sense, but from a performance perspective and an economical perspective. You don’t want a node size that means that you jumped too high, so you go from being fully saturated on your worker node to being under-utilized on your new worker node. You also don’t want one that’s too small so you’re spinning a bazillion little worker nodes that are in aggregate quite inefficient, you need to find the sweet spot. How do you find the sweet spot?
Testing metrics and testing. Basically you need to instrument your worker nodes and understand how they’re performing, try different combinations of sizes, be it memory or CPU, and see what makes sense. You can, of course, always adjust those later on, but you’ll see trend lines over time. You’ll see as the scaling goes up and down, what makes sense, what smooths out your line, etc., but you need to make sure you have this worker node farm able to cope with the incoming traffic. Now again, these worker nodes themselves are stateless, we talk about statelessness a lot on this podcast. They need to be able to simply pull work off a queue, work on that individual piece of work, and then pull the next piece off. They don’t know the rest of the context. They don’t know what’s going on. If they die for some reason, another node can pick up that piece of work and use it as well.
Now related to this concept of the worker nodes or that incoming stream of real-time data, and it’s not one stream, it can be many streams, it can be many steps of the streams, but I’ll simplify it for the purpose of the conversation. We have no time for downtime or outages. Now you may say will we never do and our systems always have to be up, but really most systems go down for sometime unless they’re built in a coordinated way and are deployed properly on a platform that can support that, typically you’ll see some form of downtime or outages from time to time. In the real-time world, particularly because we want you to capture all the information, and particularly in the case where we cannot replay information from the source because it’s a one-shot deal then there is no time for downtime or outages. We need to build assuming failure of related components and underlying components, and keep operating irrespective of that.
Again, things like that clustered and highly available distributed queues can come into play, RabbitMQ, great track record in this space are really important, also your order of scaled worker nodes are very important so that they can come and go as well, and a robust and controlled endpoint is important. Now the flip-side of the real-time data is you also need to consider what if you are actually providing some data or responding to some requests, how do you handle backlogs and how do you handle that pressure on services that have failed? One of the key disaster patterns that we see with services is the service goes down and it can’t cope with what’s going on, and of course, a whole lot of requests start backing up from other systems, and then when the system comes back up, it can’t cope because it’s getting slammed by all these systems.
Typically you want to implement some form of throttling with some brownout operation mode where we cope with certain types of workload that may be disguised as others. It could be true for some prioritization system. This can be done using headers on messages, for example. It could be done using some time-stamp mechanism, some source mechanism. There’s a whole lot of approaches you can take, it’s not one size fits all, but you got to think about it. It’s really, really important to think about how you handle the failure scenarios. I would argue even more importantly to recover from failure scenarios. Too often you see these scenarios where the service goes down, it’s overwhelmed, it’s brought back up only to go down again, and you rinse and repeat, no one’s very happy with that. Again, the liberal use and intelligent use, and correctly configured use of queues becomes very important in this context, and can make your life a lot easier as well.
Other things like flow control, authentication need to be kept in place, throttling of API cores etc. are also really good strategies to consider. Now this data that’s flowing through the system needs to be used for things, and it’s typically wanting to be viewed in real-time, typically you’re aggregating some data, you may be visualizing some data, a whole lot of things you want to with it, where do you put it? Now again if you come from the old school as I do, your default answer to most things is, “I’ll check that data in the database and probably relational database, and it’ll be great,” and then I run some queries on it, and I’ll do some SQL, and I’m all good all be producing data. Well, that just won’t work in a streaming environment. It’s too slow, it doesn’t provide a real-time view, it adds a huge amount of latency to any traffic, and in many cases the workload on the database means that you need some messy database or some sort of shadow database, or some sort of unnaturally scaled thing to store that data, so I don’t recommend doing that, certainly not for your real-time data visualization.
Memory is your friend in this case because memory is quick and it works effectively, and it is now cost-effective in this day and age, which let me tell you, 25 years ago, it was not cost-effective for any sort of thing. You should use memory in memory cases to store state information or visualization information, or aggregate information. This is the location that things like dashboard etc. where you read it from. Let me give you a really simple example, let’s say you’re doing some voting application, you got real-time votes coming into the system. You could have as part of your worker nodes connecting to a ready cache, that’s in memory case and simply using the increment command to increment a particular counter or value, that could be counting up for you then you could have an application that goes and reads from that ready cache, reads from a particular key, can grab that information anywhere you go. You can even be fancier and you could segment those keys. You could do all sorts of hot spot avoidance, etc.
The fundamental thing is you’re going to be able to send a lot of data into the cache really, really quickly. If you have that cache in a highly valuable configuration, it’s going to work for you both from a performance and resilience perspective. It means that the overall flow is super, super fast. Now the cache, the memory cache is not a final destination because memory for all our work that we may do is still volatile, so we still do need to store things somewhere else. So, Hadoop is a popular location to store information. You can store a lot of your information at low cost, so you may want to send it off there. You may just want to send it off into an object store as a raw data for an analysis later, it’s really up to you, but that should happen outside of the stream of work that’s going on, so in memory is what you want.
If you want to do something even more advanced with in-memory stuff, obviously something like GemFire or Apache Geode is the open source version is really powerful in terms of doing event-based works. You can do things like having a continuous query running on the data so that’s like an object query language. When a particular query is satisfied, it will actually trigger an event within your application so rather than pulling the application, pulling the memory all the time, it’s actually getting a push event, and it can do something intelligent, so that’s really one of the cool, very effective features that exist within GemFire or Apache Geode that you may want to take advantage for more event-based processing, or complex event handling.
Something else to consider when you’re building these types of services is what development language, and frameworks you might use for the worker nodes, and for other components. Some of the things you want to look for are low connection overheads of a network type stack the way it handles threads, its efficiency, and its performance parameters. It’s not a case of one size fits all. There are different things that are good for different application types, and again you should test them. If you’re working at agile world or a scrum world, spiking the idea, so grabbing a technology and mocking up quick proof of concept, and doing some measurements then trying something else, and doing some measurements really works very, very well. It’s important to dive deeply on the stack that you’re building, and to have as few layers as possible, but I wouldn’t go crazy with optimization. I see some people go super, super crazy on explicit hardware optimizations etc.
Hardware evolves really, really fast, and actually you’re chasing your own tail if you’re chasing hardware advancements. It’s better to have an optimized software stack that is new, instance types become available from the various eyes, pliers, or your own internal IT provides you with different instance types. You can simply migrate the application to that independent of the implementation details, and get the performance being made, so that type of stuff is really important as well. Also if you’re moving a lot of data around on a regular basis, you could do a lot worse than looking at something like spring cloud data flow, the old SpringXD, which has some fantastic capabilities around the movement of data. One of the things I really like within that piece of software is the concept of taps on the streams that it creates. This is really important because when you have streaming data, you’ll be using this data for many different purposes, and you may want to have different transformations happen to the same piece of data. You may not want to have one particular transformation affecting another transformation.
By having this tap concept, you can tap off one original stream website, the Twitter feed coming in. One tap could be searching for a particular string match. Another one could be filtering that data based on a particular language. Another one could be doing geo tagging. Another one could be doing some brand new enrichment process that you’re experimenting with, you’ve got the choice. Having lots of different data paths can make your life a lot easier in terms of affecting change within a real-time stream because, again, as we said before no time for downtime, no time for adages, but we still need to be iterating what we’re doing, we still need to be building, so that’s real-time.
As you can see, once you dive into real-time, you start to get your arms around a whole lot of different problem domains, a whole lot of performance criteria, a whole lot of different tolerances in what you’re building. You cannot use a batch mindset. You cannot use, what I’ll call, more conventional architecture. You’ve got to use all the powers at your disposal in terms of a distributing computing architecture, but also be aware of some of the fallacies of distributed processing and some of the challenges in most types of architectural patents. They have been talked about many times, things like CAP theorem are things you should be looking at as well to understand the effects of a design decisions that you make.
Stand upon the shoulders of giants. Look what some other people have done particularly those people who worked in the real time messaging space do some real interesting work, and share a lot quite publicly on the internet. There are a lot of really interesting academic papers that can also inform you as to some of the patterns to go for, and some of the patterns to stay away from as well. I hope that’s been useful, a bit of a riff on some architectural components. Real-time is a very cool and interesting area for most organizations, and certainly we see the move into that more and more as design patterns for things that are meaningful to the users. We all want instant gratification, real-time data gives you that. I hope you enjoyed listening. As always please do send us your suggestions, we’d love to hear from you, firstname.lastname@example.org and until next time, keep on building.
Thanks for listening to the Pivotal Perspectives Podcast with Simon Elisha. We trust you’ve enjoyed it, and ask that you share with other people who may also be interested. Now we’d love to hear your feedback, so please send any comments or suggestions to email@example.com. We look forward to having you join us next time on the Pivotal Perspectives Podcast.
About the Author