How Data Science Delivers Retail E-Commerce Analytics, Comparisons & Decisions

November 10, 2014 Ellie Dobson

featured-fashion-trendsUnexpectedly, the next thing to become “in fashion” in the fashion industry is data science.

While the fashion industry is known for their big events in New York City—with high end brands draping celebrities as they walk down the red carpet, while the media provides commentary on the “hot look” for the season—data science is changing the industry. Specifically, it is allowing fashion leaders to gain a faster, more complete understanding of the consumer and how consumer behavior impacts marketing, merchandising, margins, inventory, production, and supply.

Background—The State of Fashion Trend-Spotting and Intelligence

Women’s apparel makes up a $116.4 billion dollar market in the U.S., and men have been gaining ground along with a continued uptick in e-commerce sales for apparel. Behind the big business of fashion are legions of people who stay on top of culture, catwalks, and consumer trends, feeding the information to all players in the ecosystem—designers, textile procurement agents, manufacturers, buyers, retailers, merchandisers, marketers, advertisers, and consumers.

Today, companies use consulting services or have internal teams who research what is happening with consumers and fashion around the world, even sending out photographers to find sub-culture looks that are gaining mainstream ground. In addition, information is collected from inside retail environments by reporting on patterns, colours, hot products, mark-downs, garment counts, what is selling well, and what is not. Like other sectors in retail and manufacturing, information about what is out of stock or overstocked has a considerable impact on margins and profits. Historically, much of this information has been captured and presented through largely inefficient, manual approaches. To combat this, recently one company brought in Pivotal’s Data Science team to help figure out what was possible with big data.

Fashion and Data Science—Project Goals, Challenges, and Approach

After some initial brain-storming about all the possibilities of applying data science, from consumer sentiment analysis to behavioral cross selling opportunities, a joint Pivotal and customer team realized their focus and built a roadmap. The goal was to understand what online retailers were doing on their websites and translate this information into actionable insights. The project was defined as a trend forecasting and intelligence service that captured fields of unstructured data from various websites and provided dashboards for self-serve analytics. The service would allow end-users to visually interact with product categories and sub-categories, colours, and retailers as well as metrics like new merchandise, mark-downs, and out-of-stock items. Particularly, the customer saw the value of showing how online retailing and merchandising changes over time. For example, they wanted to see how a sub-category was being marked down during a season or if one retailer was selling out faster than others. These insights would help the entire supply chain, from designer to advertiser, with decision-making.

One of the key challenges with building such a system was starting with data in its raw form—there were no clean, reliable, consistent metrics. We needed to iterate through the process of cleansing and presenting data that made sense. In addition to having a group of subject matter experts who were not data-centric, the Pivotal team was working to transform website data into clean, reliable, consistent metrics. The development process included a series of six weekly review and feedback cycles with a variety of stakeholders to improve the results. This meant getting feedback on what the dashboards suggested and cycling back to the source data, looking at each step, and improving how the was data was sourced, cleansed, transformed, filtered, and presented. Ultimately, the cooperative innovation ensured the results delivered value.

The Data Science Architecture and Solution

At a high level, the data was captured from the web, processed, and made available to dashboards in Tableau. The diagram below is an overview of the overall workflow. Of course, the front-end dashboards could have been based on any visualization tool with SQL connectors—Python, Java, Matlab, etc.


After the initial load of 18 months of daily data, web crawlers continued to scrape and parse 100+ retailer websites on a daily basis and push the raw, semi-structured information into a Pivotal HD cluster with a virtualized HAWQ environment. The data was a daily snapshot of what existed on the retailer websites, including what was added and what changed. Information on product availability, pricing, colour, country, retailer, and more were extracted from the web data. The initial sizing was for 1.5 billion rows.

Next, the data was processed into intermediate tables within HAWQ, using SQL to filter, clean and analyse data from multiple datasets. A third party tool was used to take web page data, like for a pair of jeans, and tag it as belonging to the jeans category and cropped-jeans sub-category. Metrics related to the behavior of new-in and sold-out items were developed to allow the entire team to spot trend behavior over time without direct access to the point of sale or e-commerce CRM systems—this was the heart of the data science team’s effort. In traditional systems, one might assume a flag in a database marked something out of stock, but, by looking at the website each day, the system could see how items could sell out and come back and establish a trend for the value changing over time.

Further SQL processing inside HAWQ categorized garments into attributes such as price, colour, and size—all based on data science. This allowed the team to group similar items together, and compare trend patterns for such groups across multiple retailers. Techniques were designed to normalise the data appropriately across all categories to ensure comparisons across retailers or subcategories were done in a fair and unbiased way. Low statistics were also filtered out to avoid misleading conclusions.

Lastly, the team converted the data into insight tables that were consumed by the dashboards and, ultimately, users who could get value without any technical understanding. Cycling through the finer details and meaning of each dashboard and data set included a considerable amount of collaboration with subject matter experts and a cross-disciplined team of stakeholders. In addition, the data science team addressed where the data and system could introduce bias.

The Business Results and Overall Impact

Once the production system was operational, retailers or other members of the ecosystem could answer a variety of questions based on what is really happening on retailer websites. End users were equipped with visual and interactive dashboards to analyse on the fly and ask questions like: What buying trends exist? Are they long or short term trends? Do different markets show differences in terms of purchase behavior? Who competes in a category of a particular colour? Does a particular retailer lag or lead trends? As well, the engagement also facilitated knowledge transfer from Pivotal Data Labs’ data scientists to practitioners on the customer side.

The dashboard below has a main plot with trend metrics—the higher the line, the bigger the trend. The dots below the main plot show the trends by retailer. In this example, we can see a skinny jean seems to be going out of style, and certain retailers are leading in cropped jeans.


From a competitive standpoint, the solution provides new-to-the-world retailer intelligence where data and analysis didn’t exist before, clearly a “Post-Paper Era” application with big data and the foundation for a data lake. By shifting the process of data collection from manual to systematic, business leaders can infer what is happening without using direct sales data gleaned from separate point of sale or e-commerce systems. The tool provides a much clearer picture about what is ringing the cash register and what isn’t across a variety of perspectives and retail outlets, including competitive comparisons.

Importantly, the multi-terabyte volume of daily, growing data needed to scale to make the development quick and cost-effective. The overall process of prototyping the capture, processing, analysis, and review of dashboard data was iterated on holistically and incrementally by the data science team on a multi-node, HDFS backend and SQL interface. In a Pivotal environment, there were no data store or SQL processing engine speed constraints. While many data science developers wait hours or days to review query results, the front end queries return the insights at the sub-second level.

As part of Pivotal Big Data Suite, Pivotal HD with HAWQ allowed this company to achieve greater agility and speed to market, quickly enabling a new type of data-driven business and product, and the customer is currently looking to scale up their architecture over the foreseeable future.

Learn More:

About the Author


Pivotal Shows Web Summit The Secret to Faster, More Predictable Software Development
Pivotal Shows Web Summit The Secret to Faster, More Predictable Software Development

Last week, hundreds of start-ups, tech visionaries and the likes of Eva Longoria, Tony Hawk, and Bono desce...

iOS Continuous Deployment with TeamCity and HockeyApp
iOS Continuous Deployment with TeamCity and HockeyApp

Goal for this setup: Run unit tests and feature tests for your iOS app whenever a commit gets pushed to ma...

SpringOne 2021

Register Now