Making Or Saving Money With Big Data

June 24, 2015 Simon Elisha

sfeatured-podcastUltimately companies end up focussing on two things:

1. Making money
2. Saving money

These are universal across industries and concerns. So how does data science help these ends?

In this episode we dive into two real retail use-cases where data science helped make money and save money.




Speaker 1:
Welcome to the Pivotal Perspective Podcast. The podcast of the intersection of Agile, Cloud and Big Data. Stay tuned for regular updates, technical deep dives, architecture discussions and interviews. Now let’s join Pivotal’s Australia and New Zealand’s CTO, Simon Elisha for the Pivotal Perspectives Podcast.

Simon Elisha:
Hello everyone and welcome back to the podcast. So glad you could join us, we really do appreciate that you take the time to listen. This week a bit about listener suggestion from Chris who wanted to have us talk more about some of the data science that we do, and what that looks like and what that means because there are some cool stories and some interesting stories. I thought about using maybe some of the extra fancy ones or the really unusual ones but I thought, you know what, let’s stick to basics. Let’s stick to something that everyone understands.

Really when we think about what businesses are trying to do if we strip back all the mission statements and values etc., kind of what they’re there for is to either make money or save money. They tend to be pressured to do one or the other or in some cases, both depending on where the business is in its marketplace and its competitive construct, etc. This is where business stakeholders will often initiate ideas that lead to data science because they’re looking for answers that they can’t find at the moment. They can’t find it intuitively, they don’t have enough data to find it or maybe they have sufficient data or they suspect they may have sufficient data, but they can’t find the insights or the answers that they’re looking for.

I wanted to share with you two real examples of what was done in this space. They both actually are retail examples and retail is good because we all understand how retail works, at least at a superficial level once you dig beneath the surface, there are nuances we could never imagine. One of the big problems around retail is a little thing called shrinkage. Now I’ll avoid the obvious shrinkage jokes and simply say that shrinkage is the term used in retail for goods being stolen, going missing, magically ending up in the pockets of the employees sometimes. Essentially it’s stock loss, and stock loss has a direct impact on cost because obviously this is inventory that you’re purchasing, that you’re spending money on but not getting any money back from.

Focus on shrinkage is a big driver in most retailers and certainly very much falls in the category of saving money. All retailers do a lot of work around shrinkage already. They have big teams in some cases that deal with it purely but in this case, the business goal was to understand the shrinkage drivers and to validate some hypotheses about why shrinkage is happening, and then to actually reduce the lost due to shrinkage, so to provably show that we can find the source of shrinkage, make some changes and fix it. Also, within this particular organization, their analytical sophistication was low so they wanted some easy-to use tools, and they wanted the answers, etc.

What’s a framework that we can use? First, we need to measure the shrinkage so we do this by auditing the stores, collecting any records that may exist, correlating inventory orders with sales to look for gaps, etc.. We need to get some source data, and most organizations have this through point of sale, etc. Then we need to analyze. We need to start looking for patterns and use machine learning techniques. I’ll talk about this more shortly to cross check the patterns and to identify causality and to look for new patterns that may not be evident to the occasional user. Then also, it’s important to work with the loss prevention department to try and identify some of those more intangible factors that need to be factored in.

Then we need to develop and implement solutions based upon the findings. Each pattern of findings can drive a different solution so it may be that we pay closer forecast to demand, forecasting to see if there’s anomalies there. We have close supervision of staff, where we put in those shelf protection devices. You may have seen those little annoying plastic things that stop you from taking stuff. We may introduce some sort of video of RFID-based analytics. I would argue that the most important step is this last step, which is we evaluate and revisit the results of the actions we’ve taken. We make sure that we measure the impact of the solutions, and then we improve those solutions appropriately.

What happened in this particular case from a technical perspective? The team gathered a large amount of data that had multiple terabytes of data, a huge number of different tables, data from accounting, from point of sale, from historical in-store order reports, etc. What they did is, in this case, they loaded it into a Greenplum Database, and they did some modeling. In this case, they used k-means clustering-based analysis. Will I try and explain to you what k-means clustering-based analysis is on the podcast? No, I will not. There will be a link in the show notes, because this is one that’s much easier to read than to explain. They used our open-sourced big data library called MADlib analyses, some other open-sourced tools like Python and R. They did some visualization using R and Python, Madpotlib, etc.

They were able to take this huge amount of information and run a low statistical analysis very, very quickly. I got some stats that I think will help contextualize that for you. They were able the run the k-means clustering on 5,000 rows with 300 columns, so this is a big matrix, in under 2 seconds. They could do the same thing on 400,000 rows with 300 columns in less than 30 seconds. They were also able to run very complex joint of aggregation queries on tables that had over 1.6 billion rows in less than 260 seconds. The speed element and the performance element is important because we’re running lots and lots of different models.

We want to get answers quickly. Waiting 2 seconds for something is a lot better than waiting 20 minutes for something when not being able to process the data in the first place, which often happens. They throw 1.6 billion rows into a table for many databases, and they start to choke. What were the answers? What did they find? What they found is that concentrating on a few stores and a few products in their stores can have a really high impact. They were able to focus in the solution domain. What they found is that the top x-number of stores by shrinkage contribute to 20 percent of total shrinkage, but less than 10 percent of total sales. What am I saying there? I’m saying that stores that they’re losing money on for a cost perspective are also contributing far less from an income perspective as well. Which is really interesting.

An in-general shrinkage increased linearly with sales, so there’s a parallel there. Some stores have a much higher or lower shrinkage as a percentage of sales, so these were indicators. If you had a high percentage of shrinkage, obviously, more focus is required. If you have a low percentage of shrinkage, great opportunity to learn what is happening at that store, and to why things are going on. What I also found is that there’s a correlation between infantry data and store orders. They’re able to compare to what’s coming in and what’s actually there. They can create a shrinkage early warning system, a really simple way to identify how something is going awry at the store. Management needs to get involved, maybe the loss prevention team is to get involved more closely, etc.

It really gave them a handle on where their losses were taking place, a really interesting case study. What about the making money side? Well, we’ll continue with the retail story. In this case, it was around demand modeling. What does a demand model do? It establishes the relationship between prices, promotions and other relevant drivers on sales. How do things work? How do I sell stuff? I might do marketing promotions, social media activity. I might do markdowns on prices. I might do sales. I might do gimmicks. There’s a whole range of things that I can do. We need to correct and modify these models on a regular basis.

This particular engagement, they decided to create a rich set of reusable data assets. They decided to create an ongoing analysis and reporting framework. They decided to put together some 50-plus different features for 150 million different types of transactions. These were things like promotion-type inventory prices, discounting, seasonality, etc., really complicated stuff. What they wanted to do was to build a demand model to explain and predict unit sales of different products. What they wanted to do is to dive underneath the fundamentals of the primary demand model, and look at what the different levers’ worth they can handle, by department and by store.

It wasn’t just a general, hey, if we do this we’ll have some more of that across the board. It was specific to your department, specific to the store, which meant you had geographic location factors, demographic factors, etc. I wanted to see which levers were more effective and what metrics were more stalastic to these different levers. In this case, they used Bayesian hierarchical modeling. Again, will I try to explain what Bayesian hierarchical modeling is? No, I will not, as I will fail miserably. You can look it up, and I’ll share a link in the show notes. What they did, was they found 14 key levers that worked really effectively.

Then they were able to create a what-if analysis. They were able to run real-time scenarios to inform pre-season and in-season plans, so they could make changes on the fly as to what activities were going to take place. What this meant, was that we could see what the positive and negative impact of all the levers were. We could be able to create these predictive sales reports to say here’s what we think it’s going to look like. Obviously, the actual sales could be fed back into the model as well, so we could tune that and understand where things were going awry, and where things should be, and where the predictions are working.

There was a really interesting simulation example that was used. I’ll throw this one out there because it’s interesting. They had a situation where they said, let’s have a goal. We’ll build a scenario, and we’ll simulate what happens. The goal was to increase Q3 sales within a particular brand style. In this case, it was sunglasses. They built a scenario where they offered a promotional discount of 15 percent. They also offered 3 more color varieties than in the previous season. They fed that into the model to see what the answer would be. In this case, it resulted in 10 percent greater unit sales compared to a business as usual plan.

It was driven by both quantified positive impacted discounts on sales, so what they had observed before in that particular category. Also, a quantified positive impact of color assortment on sales in that particular category. What this means is that the retailer can create dynamic models, and understand what happens if they make different stock choices, what is the effect on their overall sales, and do that with a reasonable level of precision and a high degree of confidence. This really changes the method in which we go and order stock. You can imagine there’s a big responsibility to being undertaking, it’s a big expense, and there’s a lot of risk involved.

If we can reduce the risk by feeding in real-time data, or actual data I should say, historical data as well. We can create models that are far more relevant and useful, and that get you the outcome you want. In this case, show me how I can make more money. There you go, two examples of data science in action, supported by technology, but driven by business needs and driven by the business goal. I hope you found that interesting and useful. Again, lots of links in the show notes, particularly for the statistical stuff for those of you who are interested. Until then, keep on building.

Speaker 1:
Thanks for listening to the Pivotal Perspectives Podcast, with Simon Elisha. We trust you’ve enjoyed it, and ask that you share it with other people who may also be interested. We’d love to hear your feedback, so please send any comments or suggestions to We look forward to having you join us next time on the Pivotal Perspectives Podcast.

About the Author

Simon Elisha is CTO & Senior Manager of Field Engineering for Australia & New Zealand at Pivotal. With over 24 years industry experience in everything from Mainframes to the latest Cloud architectures - Simon brings a refreshing and insightful view of the business value of IT. Passionate about technology, he is a pragmatist who looks for the best solution to the task at hand. He has held roles at EDS, PricewaterhouseCoopers, VERITAS Software, Hitachi Data Systems, Cisco Systems and Amazon Web Services.

Small Steps: Introducing Young Women to Technology Careers Early
Small Steps: Introducing Young Women to Technology Careers Early

With women in science, technology, engineering and math fields still hovering below 30%, Pivotal’s Keith Ma...

Support Vector Machines in Apache MADlib
Support Vector Machines in Apache MADlib

The new release of Apache MADlib 1.9 (incubating) includes support vector machines, which can be used for c...

SpringOne 2021

Register Now