A Quick Look at Spring Cloud Data Flow

October 7, 2015 Simon Elisha


sfeatured-podcastThe pressures for real-time data in applications is picking up at the same rate that applications are gravitating toward modern Cloud-Native architectures. Last month at Spring One 2GX, Pivotal announced the release of Spring Cloud Data Flow, which moves many of the capabilities of Spring XD to a Cloud-Native architecture. In this episode, we will walk through the changes and how it fits into your Cloud-Native application architectures.




Speaker 1:
Welcome to the Pivotal Perspectives podcast. The podcast at the intersection of agile, cloud and big data. Stay tuned for regular updates, technical deep dives, architecture discussions and interviews. Now, let’s join Pivotal’s Australia and New Zealand’s CTO, Simon Elisha, for the Pivotal Perspectives podcast.

Simon Elisha:
Hi, everyone. Welcome back to the podcast. So glad you can join me. Simon Elisha as always coming to you from the palatial small office in Melbourne. I’m sitting in an office with no windows and very bad air conditioning. There’s a little context setting for you. It’s not always that bad. I just didn’t get a good room this week.

I want to give you a very quick update on a new or probably more accurately to say re-imagined version of a product and capability that we have available here at Pivotal. This is something called Spring Cloud Data Flow. Now, Spring Cloud Data Flow was announced at the most recent Spring conference, SpringOne.

You may be familiar with some of the content concepts in Spring Cloud Data Flow if you’ve heard me speak about something called Spring XD. Now, Spring XD was created originally to provide very sophisticated, distributed data pipelines for real-time and batch processing so taking on the challenge of picking up data and moving it from various sources to various sinks and doing it in a performant fashion, in a scalable fashion and allowing you to transform data in between those movement components. Maybe you want to filter some data, maybe you want to augment some data, maybe you want to run it some live PMML models for some data science, choice is up to you. Super, super powerful.

Now, what the team did is look at what they’ve built and recognize that the world around Spring XD had changed somewhat. In particular, there’s a lot more capability available under the covers on various structure platforms to do what it was doing from a scaling perspective. This is a great example of an application becoming Cloud-Native and leveraging things like Spring Boot to make themselves be more efficient.

Firstly, what they did is they look at the platform and say, “Well, at the moment it is work to operate, you set up Spring XD. You need a whole lot of stuff like ZooKeeper and message transport, database, etc.” Quite a few moving parts. Let’s re-imagine this as a sort of message-driven microservices. Let’s use Spring Boot data microservices. Let’s use Spring Cloud services to include things like Eureka, etc. Let’s enable it to be deployed on Pivotal Cloud Foundry. What this means is that the Spring XD run time is gone and is replaced now by what’s called a Service Provider Interface or SPI.

The SPI takes advantage of the native platform capabilities. There are three key platforms: Pivotal Cloud Foundry, which is what you use in production, Lattice, which is kind of a smaller version of PCF that you can use in your laptop or in a work group environment, or Yarn if you’re running this application on HDFS. Now, what this does is creates a radically more simple architecture and allows you to deploy this on any platform that you want and the ones that I’ve listed very, very simply and very, very easily.

Now, what has not changed is the ease of use, the power and the sophistication of Spring XD in its new guise as Spring Cloud Data Flow. You still use things like the Java DSL. You use XD shell to manage the environment. You use the admin UI if you want to Flo which is the new very cool GUI and, of course, the REST APIs. You can take advantage of all that how to very simply create those pipelines that you need to move data hither and yon.

In fact, often we’ll demonstrate Spring Cloud Data Flow to customers. One of the canonical reference example we’ll show them is consuming data from Twitter, doing some form of transformation in displaying it. The fact that we can set that up in a matter of minutes is pretty impressive because it’s an example of how powerful the platform it is to do data movement type stuff.

Now, the Spring Cloud Data Flow Admin SPI is a Spring Boot application itself. It does things like service binding, service discovery, all the channel bindings. It does everything it needs to do for you. From an operation’s perspective, all you do is you deploy the admin component onto the targeted environment. Then you bind it to the external services such as Kafka, Redis, or RabbitMQ and you’re good to go. This becomes a really cool kind of junction point to move data around so you could be consuming data off HTTPS stream or off a social media stream doing some form of transformation and then deploying it into maybe HDFS or some sort of other analytics workplace or you may want to send it into in-memory data grids. You may want to send it to GemFire. You may want to send it to Redis, for example.

It is really focused on combining both real-time and batch processing. What we see more and more often, applications that need to take advantage of these two components. The real-time components provide that live-streaming interactive experience and the batch component to do far more detailed historical analytics and analysis.

The other thing that’s important is to be able to scale, so we need to be able to scale the streams and batch pipelines without interrupting data flows. This means having the resilience of things like Pivotal Cloud Foundry underneath to provide a distributed platform, a containerized distributed platform that can scale appropriately and provide performance management. We also want to be able to look at different applications using metrics, health checks or remote management functionality. The ability to jump in and control what’s going on at the Spring Cloud Data Flow level is really, really important.

What this really does is provide you the very easily accessible open source mechanism for hooking together various components of your application and more importantly various data sources and sinks that make a part of that in a loosely coupled highly scalable approach. This really takes Spring XD to the next level. It’s now Spring Cloud Data Flow. Definitely worth looking at.

I’ll link in the show notes some of the documentation and you can, of course, fork it on GitHub if you want. You can contribute as much as you like. It’s a really fantastic robust easy to use platform. It’s a case of building a tool to do a specific set of tasks really, really well minimizing the amount of fiddling and configuration you need to do and maximizing the amount of performance and flexibility. Great to see the evolution of this component. Something certainly to look at. Until next time. Keep on building.

Speaker 1:
Thanks for listening to the Pivotal Perspectives podcast with Simon Elisha. We trust you’ve enjoyed it and ask that you share it with other people who may also be interested. Now, we’d love to hear your feedback so please send any comments or suggestions to podcast@pivotal.io. We look forward to having you join us next time on the Pivotal Perspectives podcast.


About the Author

Simon Elisha is CTO & Senior Manager of Field Engineering for Australia & New Zealand at Pivotal. With over 24 years industry experience in everything from Mainframes to the latest Cloud architectures - Simon brings a refreshing and insightful view of the business value of IT. Passionate about technology, he is a pragmatist who looks for the best solution to the task at hand. He has held roles at EDS, PricewaterhouseCoopers, VERITAS Software, Hitachi Data Systems, Cisco Systems and Amazon Web Services.

Multivariate Time Series Forecasting for Virtual Machine Capacity Planning
Multivariate Time Series Forecasting for Virtual Machine Capacity Planning

In this blog, we continue our blog series on multivariate time series to apply this modeling approaches for...

Informatica Joins Data Lake Ecosystem with Capgemini and Pivotal
Informatica Joins Data Lake Ecosystem with Capgemini and Pivotal

Today, we are announcing Informatica’s addition to the data lake ecosystem, spawned by Pivotal and Capgemin...


Subscribe to our Newsletter

Thank you!
Error - something went wrong!