Organizations are collecting vast amounts of unstructured data. In fact as much as 80% of the data stored within an organization is unstructured. Organisations are often challenged with how to store, analyse and action the insights contained within this type of data. Further, they often wish to combine these new unstructured data sets with current structured data sets that they rely upon.
The other challenge commonly faced by businesses is that of customer churn, the desire to reduce the number of customers they lose, and to deeply understand the reasons for customers leaving. Not just for its own sake—but to be able to take clear actions to reduce the rate of churn and to improve either the customer experience or the products that they offer.
There are a raft of technologies and techniques that can be applied in this area—but where to start? This week we speak with Data Scientists Mariann Miscinai and Niels Kasch to get their insights into these domains.
PLAY EPISODE #10
- Subscribe to this feed
- Feedback: firstname.lastname@example.org
- Find more episodes of All Things Pivotal podcast
- Links Referred to in the Show:
Welcome to the All Things Pivotal Podcast. The podcast at the intersection of agile, cloud and big data. Stay tuned for regular updates to technical deep dives, architecture discussions and interviews. Please share your feedback with us by emailing email@example.com.
Hello everybody, and welcome back to the All Things Pivotal Podcast. Fantastic to have you back, as always. Another new year that we’re celebrating in podcast land. This week we’re going to have a very special conversation podcast with two very smart folks from around the world who have joined me today. I’m joined by Mariann Micsinai and Niels Kasche, who are working on our data science part of the house, if you like. The world of Pivotal. As you know, we do dilar science, we do agile, we do platform as a service, we do big data.
So they’ve come along to talk about some really interesting work they’ve been doing in a number of different fields. Without any ado, I’d like to introduce Mariann and Niels. Perhaps Mariann, would you like to start with a bit of an introduction about yourself, and your role at Pivotal, and a bit about your background.
Absolutely. I work for Pivotal as a Senior Data Scientist. I joined Pivotal two years ago, straight out of my PhD program. I completed a PhD in computational biology. I actually have a very varied background, ranging from linguistics to economics, mathematics, medicine. It really helps me a lot to have a very integrated and interdisciplinary view of the field, which I think is actually very nice and great to have.
I guess diversity is the watch word of your background there. Was data science always a goal for you or was it something you sort of fell into as the field became visible to you?
I think it evolved over time. When I joined computational genomics, it was an emerging field with a ton of data. Imagine that sequencing machines were producing so much data that hospitals and medical IT departments couldn’t deal with it. It was a very exciting time because we had to do everything from scratch, programming, and learning medicine and applying and pushing the boundaries of math and computational sciences. Having a background in this field, plus I had a Wall Street background, where I worked on the trading floor, again dealing with a lot of data. It definitely evolved in me over time to become part of this field, which I find very exciting.
Fantastic. Fantastic. And Niels, what about yourself? What’s your story?
(laughs). So, I’ve been with Pivotal probably almost three years now. I’m also a Senior Data Scientist. Like Mariann, I was hired straight out of school after completing my PhD. I did my PhD in computer science, so a heavy emphasis on artificial intelligence and machine learning. My particular focus area was actually natural language processing, NLP for short. As part of my graduate work I had been working on web-scale text to actions to extract common sense knowledge, which in turn can be used for reasoning about all kinds of language input. It’s a really valuable skill to have on two fronts. One, dealing with large data assets, web-scale data sources. Two, the text components. Now a lot of companies are moving into space of analyzing text data, so it’s a very exciting field to be in.
At Pivotal, essentially I’ve taken this expertise and applied it to a bunch of different protocols, namely finance industry, but also insurance, oil and gas. Finance is a very interesting domain, because you get a lot of different types of data assets, a lot of structured data, but also a lot of unstructured data, now at this point in time, which most of the time is text data.
So it’s a very exciting field to be in. My background definitely helps with that.
For sure, for sure. As our listeners can see, I’m the least qualified person on the call today. Very underrepresented in my number of doctorates, personally, compared to you guys. You know, one of the things we find at Pivotal is that you need to find smart folks like Niels and Mariann, and throw them at big, ugly problems. That’s kind of the way you solve things.
Certainly one of the big problems that many organizations face is getting value from very large sets of unstructured texts. These texts could be emails. It could be chats. It could be transcoded call center calls. If you think about all the unstructured texts you personally produce every day in your daily work, imagine how much a typical large enterprise would be collecting or having flow through the organization.
Guys, let’s talk about how organizations can deal with this. Maybe Niels, do you want to start giving your background in that space specifically on how do people get their hands around this? What do they do?
So, yeah, this is actually really interesting problem. Estimates range that up to eighty percent of data that’s being collected is actually unstructured text data. Like you mentioned, you know, email, finance blogs, but also social media. Think of Twitter feeds and Facebook streams and that kind of stuff. Also call center data.
There’s actually incredible value in this data, so for example, if you’re trying to understand your customers better, maybe you want to look at social media streams where they’re tagging your products and making statements about your products. If you have access to this kind of data, you can, of course, analyze it, and move towards producing value for your company.
To get the handle around this, what you really need to have is the right platform. You can’t analyze petabytes of text data on just a single machine. It just takes too much time to do that. So you want to have a parallel machine, or a parallel processing environment, where you can quickly, in an agile way, explore the data, get a feel for the data, and then actually do something with it by building a predictive or explanatory model.
So this is really a classic decomposition-type problem, where we have lots of stuff that we want to break into lots of smaller things and attack it kind of like an army of ants would attack it.
Mariann, do you want to talk about maybe some of the things you’ve seen customers do with this kind of analytics? It’s one thing to say, well, I can collect the world’s Tweets, and I can have them and I can process them. Certainly in my experience the most successful data science projects are the ones that come from line-of-business, you say. I’m trying to solve for this, and then the data scientists put together models and data sets that solve that particular problem.
Is there one that springs to mind for year-round textual-type data and unstructured data that you’ve seen recently?
Absolutely, Simon. So what we see is actually a huge strand to integrate structured data that is already existing within an organization, and they’re untapped, unstructured data sources. Bringing these two types of data sets together the first time for an organization, produces much more intelligent models, helps them to answer their questions in bigger detail.
In terms of questions, you can think that an organization tries to either explain something or predict something, right? If you have different types of data sets, these two tasks, either explanation or prediction, are more accurate and more intelligent because they are living on more data, and more different types of data.
As Niels said, this is especially apparent in the finance field. There, compliance is the hugest issue. Risk is a huge issue. Understanding customers and giving the three hundred and sixty degree profile that Niels was referring to, these are very valuable assets for organizations. Not only in finance, but actually all over the various industries.
Fantastic. It’s interesting that comment you made about companies want to try to either explain or predict something. It’s kind of like the data corollary to the commercial construct of most companies, which is they’re either there to increase their revenue or decrease costs. It’s interesting if you think of the data economy, it has a similar kind of concept, which is interesting when you think about that. I think most of us don’t have that perspective on how data can be used.
If we’re talking about this sort of really large set of data, one thing is we try to do on the Podcast is to go from the high-level concept of hey this is cool, you can do stuff, to how would you actually do this stuff? How would you dive into it, or what are some of the tools that you use from a technical perspective to take this embarrassingly parallel problems, as we like to say, and break them down and actually get answers in a timely fashion?
So it’s actually interesting that you mentioned embarrassing parallel. Not everything we do is embarrassingly parallel. A lot of the pre-processing text [on the base 10:22], yes, that’s embarrassingly parallel. You can distribute it over, for example, the [duposso 10:30] and MP database very nicely, so you get the linear scale-out of the processing environment, in terms of these data assets, very fast. Not every task, for example, within machine learning, not every model easily decomposes into an embarrassingly parallel task.
What we invested in Pivotal heavily and our machine running library is to actually take machine algorithms and parallelize them in fashion that they will execute and distribute an environment, that you can still leverage to this super [inaudible 11:08] environment without needing an embarrassingly parallel pass.
The right platform is definitely a thing that you need to have. [Wa 11:26] compute power by itself, doesn’t do the job just as either. You need the right tools. At Pivotal we use code, stuff that’s open in the open source community. For text analytics, for example, we utilize a lot of NLTK, and an open NLP, which seamless integrates in our environment, so we can just use existing resources in analytics processes.
Also, in-house growing [inaudible 12:02], which are also available as open source now, that we can leverage in our analytics process.
Fantastic. So really we’re talking about that storage substrait of things like Pivotal HD for the Hadoop layer. [Horp 12:20] for that MPP database, and the sequel to our [baxis 12:22]. Then more detailed analytic tools like MADLIB and those other libraries that you mentioned that run close to the data.
Mariann, perhaps you’ve got some perspective of tools that you’ve used, and things you’ve seen that are really important to have available to you.
I nearly agree with what Niels said, and if I would summarize it I would probably put these points in three points.
We need to have a parallel environment where we can process very fast. Two, this environment needs to be flexible enough for us to put a large amount of data on it, of all different types. Most of you probably have heard the concept of data [read 13:02] that is becoming extremely important. Then, the third component is having the software tools that Niels mentioned, the analytic tools that take us from the data processing step all the way to actually running machine-learning algorithms in a very stream-lined and performing fashion.
What we really work with at Pivotal, whether it’s the data base or Hadoop, that we have this flexibility and we have opensource tools available. We are not locked into having to use one machine-learning library versus the other, so it’s a platform that is very flexible for us. Most importantly, we can also develop our own tools if they are not available. We contribute this way back to the opensource community.
You know, one of the facets, it’s been a real joy for me to interact with a lot of data scientists at Pivotal. You go [coolive 14:00] from people talk about these data scientists that are like unicorns, that no one ever seems to meet, but we have a lot of them here, so it’s good to actually talk to the real ones. It seems that there’s a feature of data science which is that genuine concept of science, which is exploration. Creating hypotheses, testing them out, etc. To do so, you have no idea what tools you’ll need at the outset, and traditionally IT’s is really hard, because an IT will say what database do you want, and you’ll say, ‘I don’t know.’ ‘How much storage do you need?’ ‘I don’t know.’ ‘How much computer do you need?’ ‘I don’t know.’ ‘What tools sets are you going to use?’ ‘I don’t know.’
These are not good answers for people who are trying to build that infrastructure. Having access to those opensource tools that you can kind of swing in and swing out as you need to, becomes really important. It’s been very interesting to see that certainly most data scientists I’ve met have no strong allegiance to any particular set of tools, per se. They like to have lots of tools, and then they’ll just bring them in to use as they like.
Would you say that’s a fair assessment, Niels?
Absolutely. The other point, the tools that we work with, they develop fast. There are new tools on the horizon, up-and-coming platforms, and we would like to leverage them. Not any particular tool gives us everything that we need, so being well-versed in a variety of tools is really significantly expands the types of analyses that we can execute, and in turn provides the most value for the customer. That’s what we’re trying to do, provide value to the customer.
Now, one of the other concerns that we see in nearly every business that we deal with is the idea of wanting to retain customers. We all know how expensive it is to get a customer, and to lose a customer is something that most organizations don’t like. This is something typically called churn. I know that churn analysis occupies the mind of a lot of the sea-sweep folks, in the line of business management. It’s often the main things they worry about, or the top three things they worry about, particularly in financial services and the [telquest 16:15] space, churn is a very big issue.
Mariann, you mentioned you have a background in this space as well. Would you like to talk a little bit about churn analysis and some of the thinking in that space and what you see happening.
Absolutely. As you said, churn is actually a very difficult case that we usually get involved in. It varies all across the financial field, from big banks to asset managers, to even smaller companies. We see it a lot in many other industries as well. Insurance, for example, but obviously media, etc. Any subscription-based service provider is also a good candidate for it. It’s a very wide-ranging, obvious concern.
In finance, in what we see in terms of retail finance is that banks try to understand their customers, and provide a three hundred and sixty degree view. It means that they probably will have to integrate data, since they haven’t thought about before. They have their own data that they know about, and there is some dark data within any organization, that is lying around, that no one has looked at it. When we engage in banks, to understand their customers, we also mostly bring in a lot of external data sets. The US government as a lot of public data sets. We can bring in time series about obvious market performance. We can bring in other proprietary data providers, like IXI, [Love Complete 18:00] [Action 18:01], that collects other, different types of data.
In addition, as we’ve said before, unstructured data, for example, call center logs, are very important. Or even [vetclick 18:14], right, so click stream data to see how a customer navigates through a bank’s website. When you bring all these data types together, then you get a more deeper view of how your customers are behaving. We are really trying to understand the behavior of these customers, and how we can produce actionable results for a bank to stop the outflow of assets, or customers, obviously.
This is again a very deep problem, and it starts from even defining what churn is. How do you define churn? Do you define churn in terms of external churn, their customers just leave the bank? Or do you define it, as internal churn, part of the problem? Most of the time we have to guide our customers, these banks, to even think harder about how do they define churn and what is the population that we should consider for the analysis? The population of the bank’s customers that would become part of the analytical models.
For sure, for sure. It’s interesting, often when we talk about churn, people tend to take a pointed-time view of that model and say, ‘We’re just going to see, here’s churn last month or why we think they might’ve done so, etc.’ Whereas, really, it seems to feel to be moving towards a more real-time interactive model, where we’re seeing, if you like, the precursors to, or the leading indicators of a potentially churning customer, based on a number of factors.
We need to feel that back in real time.
Niels, are you seeing good software solutions to this in terms of the crashing of applications or alerting systems, etc., that kind of tie off from these models?
Yes. But before answering that particular question, one more point to bring the story together. Earlier, we mentioned that we’re building predictive models, and explanatory models. So churn is actually a case where both types of models really have the applicability. So one where you mentioned we want to score, we want to predict, is this particular customer leaving in the next week or the next month or the next sixth months. So that’s a predictive model.
The explanatory side of why are customers leaving?. Is there something wrong with our product? Are my customers unhappy? Those kind of things we can tell by looking at the details of the model, the predictive thrivers, the influential model, and definitely drives other decisions that are important for business, such as the product mix to offer an individual group of customers, or how they engage with their customers in the call centers. Very good points.
In terms of moving this entire environment into the real-time scoring, yes the Pivotal platform has ways of taking the predictive models, or explanatory models that we built and moving them into, for example, Gemfire, for real-time scoring. This is really the emphasis of operationalization that we try to make the point. It’s not enough to hand the customer a model, say, ‘Okay, we’re done. You figure out what to do with the model.’ The point is that we leave the customer with either an app potentially developed by Pivotal Apps, our in-house software company, where they get actual insight right away.
For example, we develop a churn model and we leave the customer with a fully operationalized mechanism to actively score their customers on real-time or in batch-mode every month or every week or every hour, something like that.
Fantastic. Mariann, you’ve seen a lot of this in your time. Some perspective from you?
Again, let me just give an example to Niels’ excellent points. We’re digging up all of these very complicated and complex mathematical models, but the point is that we would like the banks to use it and use it in a manner that gives them a return on their analytics investments.
What we did, for example, with the churn model that it got integrated in the bank to their call centers. The banks could rank their customers in terms of probability of churn, and assets flowing out of the bank. Then they could actively campaign, in terms of marketing campaign, for these high-risk customers, or when the customers call into a call center. This call center can reroute based on the probability of churn assigned, by our models, to these customers. If there is a high-risk customer, they might get a more experienced call-center analysis. for example, this is one example from real life, how our models help banks.
Fantastic. That really is the key, isn’t it? It’s making use of this information. Everyone talks about the conceptual side of things, but when you see it in a tangible form where people are actually integrating into their call center, into their customer-interaction-type work, then it becomes really powerful.
Some really great insights there from both Niels and Mariann, so appreciate you taking the time there. Both of you have written some really cool blogs on this topic, so we’ll link to them in the show notes, so people listening can have a look. We’ll wrap it up there.
So, first of all, Mariann, thank you so much for joining us today.
Thank you Simon.
And Niels, thank you so much for coming on the Podcast.
Yeah, thank you Simon.
So thanks everyone for listening. Please do share it with others if you’re enjoying it. If you have any feedback we’d love to hear from you. Podcast@pivotal.io. More episodes to come. But until then, keep on building.
Thanks for listening to the All Things Pivotal Podcast. If you enjoyed it, please share it with others. We love hearing your feedback, so please send any comments or suggestions to firstname.lastname@example.org.
About the Author