Using Data Science to Predict TV Viewer Behavior and Formulate a Hit TV Show

March 16, 2015 Jarrod Vawdrey

This blog is part of a series with joint work performed by Jarrod Vawdrey and Noelle Sio.

Television executives and media companies are beginning to embrace the value of data science when it comes to understanding viewership. By combining unstructured data (e.g. text, video, etc.) with traditional data sources, data scientists are using machine learning to identify how production decisions impact ratings and which have the most influential effect.

In a recent engagement with a global media conglomerate, the Pivotal Data Labs team investigated what makes viewers tune-in to and tune-out of specific television shows. The challenge at the onset was that large and diverse amounts of data for broadcast shows is not generally available, given that most data is still collected manually. Thus, we had to be creative about what additional datasets could be used to feed a predictive model.

Existing efforts, which used manually collected metadata, had reached a ceiling in predictive performance. Although these models were sophisticated, they were all fed with features based off of structured data. To improve upon these existing efforts, we explored an augmentation of this dataset with unstructured sources like video, audio, transcript, and social. Ultimately, we decided to use transcript data in our modeling efforts since it was the most readily available. By doing so, we were able to successfully improve upon existing models and provide actionable insights that could be taken directly to TV show producers.

In this blog, we will describe our approach, the tools that we used, and some lessons learned.

Background: Adding More Data (Science) to Traditional Ratings

Historically, media companies have been limited in their understanding of viewers, using only third party data sources, such as Nielsen television ratings, to track and analyze audience size and composition. Nielsen collects data from both diaries and television-connected devices to measure viewing habits for many demographics such as age, gender, race, economic class, and area. However, for a TV show producer, this data does not give specific feedback about how to improve an individual broadcast or episode.

Unlike the digital, social world, using data to drive decisions is not common in the television world and considered quite innovative. The only companies doing something similar are the newer media companies like Amazon and Netflix, who have been tracking the actual big data numbers to determine what shows are likely to be successful, such as House of Cards with Kevin Spacey. For example, online-focused companies take approaches using meta-tags with information about as much as 30 million plays per day to determine what will be a hit, what viewers like, and what keeps them watching.

Goals: Bringing Pivotal Data Science into the Picture

In order to help our customer improve their understanding of viewer behavior, we delivered an end-to-end solution—this encompassed a framework to ingest and manipulate the unstructured transcripts, predictive models, and a means to interact with the data and models.

While many commercial solutions are specialized and proprietary, we were able to build an open solution using the Pivotal platform which sets the foundation for future advanced analytics work. Additionally, this solution was built to scale both in terms of number of programs (i.e., every show in their network) as well as broadcasts (i.e., every show that has ever aired).

The project deliverables included::

A text analytics framework—ingesting, transforming and modeling transcript data in a scalable way
In-database machine learning models—using predictive toolsets, like MADlib or Python libraries via PL/Python
An application—incorporating the data and models into a lightweight application to explore the data and provide what-if simulations

Data, Platform, and Approach

Data

Multiple sources were made available for the project: Nielsen ratings data, manually collected metadata, and show transcripts.

Each data source differed in format and quality. The Nielsen data was provided in report form and required minimal effort to load into the Pivotal platform for analysis. The manually collected data was also in report form; as with most manually collected data, it contained a lot of entry errors and is typically unreliable for modeling purposes. Finally, the show transcripts were in text format and held little to no consistent structure from one broadcast to another.

Platform

The final model was deployed on a Pivotal Hadoop/HAWQ instance exposed to Pivotal Cloud Foundry as a service for production usage. A prototype Node.js application was pushed to the same Cloud Foundry instance, which exposed end-users to analytical insights and allowed them to interact with model results.

Approach

As with most data science projects, and text analytics in particular, the majority of effort is spent cleaning and manipulating data. Part of this effort was developing a framework that would take the inconsistently formatted transcripts and prepare them so that one could apply any number of sophisticated NLP approaches and algorithms. In this project, we used a topic model to generate features for the overall model. This framework could also have been used for additional features based on tone, language complexity, and more.

The text framework included the following steps:

Data Clean Up: Matching up spoken text with speakers in non-standardized text
In-database Text Transformation: Parsing, Tokenization, Lemmatization, and TF-IDF
Corpus Reduction: Defining the dictionary of interest
Text Modeling: LDA modeling to identify the underlying topics within the transcripts

The output from the text framework was combined with other features and fed into a series of supervised models built for each viewer population.

Overall Model

The modeling stage started with narrowing down the tens of thousands features generated to those found to be most predictive of viewership metrics. Using MADlib’s parallelized implementation of linear regression, a regression was run for every feature to calculate its specific influence on ratings. The most relevant features were then filtered further for multicollinearity. Several algorithms were then compared to identify the most performant model for the data, with elastic net regression resulting in the highest predictive accuracy.

The Insights and Results

It is a commonly held belief that show format (and specifically commercial breaks) have the highest measurable impact on viewership. However, we found that it is truly a mix of format, content, and people on a show. These factors also differ depending on the population demographics of the audience.

Unexpected important variables included:

Speaker characteristics
Number of people shown on screen at a time
Broadcast topics

Although we included thousands of features based off of the manually collected metadata, the vast majority of them ended up falling out of the final model because they had no predictive power. Instead, the most relevant variables were derived from the transcript data. This analysis delivered a clear perspective on the drivers of show viewership and popularity changes over time—a new and significant value to decision-making.

In about 8 weeks, the project was delivered, demonstrated the power of leveraging unstructured data, and showed the extensibility of the Pivotal platform. Armed with the code, the platform, and training via knowledge transfer, the company has taken the next steps towards becoming a data-driven enterprise—building an application that leverages a wide set of data and data science to provide actionable insights directly to TV broadcast decision makers.

Other Articles You Might Like

Learn More:

Pivotal HD and HAWQ: Product | Docs | Downloads or Blog Articles
Pivotal Data Labs: About or Blog Articles

About the Author

Biography

See you at Techweek LA—Lean Start-ups, UX, & Digital Transformation

Four of our most respected executives will be joining the collective brain-trust next week at Techweek LA. ...

Enterprise-Grade Single Sign-On For Pivotal Cloud Foundry Applications

We are pleased to announce the general availability of the Pivotal Single Sign-On service for Pivotal Cloud...

Using Data Science to Predict TV Viewer Behavior and Formulate a Hit TV Show

Background: Adding More Data (Science) to Traditional Ratings

Goals: Bringing Pivotal Data Science into the Picture

Data, Platform, and Approach

The Insights and Results

About the Author

Previous

Next

Using Data Science to Predict TV Viewer Behavior and Formulate a Hit TV Show

Background: Adding More Data (Science) to Traditional Ratings

Goals: Bringing Pivotal Data Science into the Picture

Data, Platform, and Approach

The Insights and Results

About the Author

Previous

Next

Related content in this Stream

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.