Data Science 101: Using Election Tweets To Understand Text Analytics

May 25, 2016 Srivatsan Ramanujam

 

sfeatured-36056-TextAnalyticsElectionJoint work by Srivatsan Ramanujam, Regunathan Radhakrishnan, and Scott Hajek of Pivotal Data Science.

Natural Language Processing (NLP) is a field of data science that is enormously powerful and accessible for businesses, and yet is still widely untapped. It can help companies make sense of archived documents, reveal customer sentiment, and even detect customer churn.

To illustrate the power of NLP in terms everyone can understand, the Pivotal Data Science team recently collected Twitter data from the New York Democratic and Republican primaries and used them as a backdrop to illustrate three common scenarios where NLP can help businesses, including a basic understanding of word embeddings, sentiment analysis, and topic analysis.

Why New York? The front-runners in both parties are from New York and since it was a current event, we could count on a lot of real-time sentiment quickly, which gives us good fodder to discuss insights.

Word Embeddings

Number of tweets per day

We ingested Twitter data corresponding to roughly 30 hand-picked terms and handles from April 14, 2016 to April 20, 2016. The New York primary occured on April 19.

The idea behind word embeddings is pretty simple. In the most basic approach, one could represent every word as its own unique vector—data science-speak for a data point—but this approach does not capture any semantic similarities between words, so your analysis can be very skewed. Logic like this suggests that the word “dog” is as dissimilar from “puppy” as it is from “banana.”

In word embeddings, we use statistical algorithms to create a mapping of words to vectors such that semantically similar words are closer in distance than other words. For instance, the vector representation for “powerful” and “strong” will be closer in distance than the representation for the word “Paris”.

Once we have a vector representation, we can query the model to find a word that doesn’t belong in a list. Here’s how we queried the model for our election report:

word embedding query

You’ll notice the word representations for “bernie”, “hillary” and “trump” are closer to each other than the representation for the word “president”. This means none of the candidates’ names occur strongly enough in contexts similar to where the word “president” is found. This makes sense because none of them are president yet.

Similarly, the model returns “trump” as the odd one out from the list of [“bernie”,”trump”, “hillary”]. This implies that “hillary” and “sanders” often occur in similar contexts more than (“trump” and “hillary”) or (“trump” and “bernie”). Again, Hillary and Bernie are Democrats and Trump is Republican, so they are typically discussed by separate parties.

We also computed similarity between word representations and found some patterns:

word embeddingsword embeddings

From the similarity scores, we can see that Donald Trump has been associated with the words “mexico” and “china” more strongly than other candidates. While this is not shocking, it confirms that public Twitter users are discussing this similarity.

So, word embeddings can help us glean interesting relationships between words (or sentences, paragraphs and documents) in a collection, in a completely unsupervised fashion. If you expand this thinking, determining these relationships automatically can help you to categorize or discover trends in your documents. In the litigation process, tools employing such techniques are frequently used to perform eDiscovery on case documents, while businesses can use this technique to help with categorizing emails or to create document summarization and recommendations.

Sentiment Analysis

Language carries sentiment at multiple levels. Individual words can carry positive or negative connotations, but they can also interact with each other in phrases leading to an overall sentiment that is more than the simple sum of its parts. For that reason, we recommend using an approach that analyzes short phrases for their similarity with known positive or negative adjectives.

The figure below shows the number of positive and negative sentiment tweets for every hour starting from April 14 until April 20.

negative sentiment trumppositive and negative sentiment for hillary

Note that for both candidates the number of negative sentiment tweets always outweigh the number of positive sentiment tweets most of the time, supporting the view that there is indeed a large amount of negative rhetoric in this election cycle. Also, note that the peaks in the sentiment score time series for Hillary and Bernie line up to moments during Democratic debate on April 14 and New York primary results being announced on April 19.

We can also use sentiment analysis to isolate and visualize the positive phrases present in the tweets during the NY primary results announcements through word clouds:

image03trump positive phrases

We can also use this same technique to create word clouds of top adjectives found for each of the front-runners.

image00image14

These techniques for sentiment analysis can help surface a vast amount insight into how consumers perceive a given person or topic. At Pivotal, we’ve applied sentiment analysis for predicting customer churn, in predicting commodity futures using Twitter, and also as a tool to measure brand perception and campaign monitoring.

Topic Analysis

The goal of Topic Analysis is to take large collections of documents and organize them by topics. Topics in this case are a collection of words that tend to co-occur in similar contexts. Topic analysis can aid in document summarization, to identify the key themes in a large collection of text. Further, by converting a document into a vector (over topic probabilities) we can use them in many prediction problems like document classification or document recommendation. Tag clouds can aid in visualizing topics and in identifying the prominent topics in a document.

We apply this technique on a collection of tweets corresponding to a candidate to identify trending topics. Topics can be analyzed over time to identify changes, for example: Benghazi was a big deal several months ago, now not so much.

Figure below shows the word clouds for two interesting topics that were discovered from Donald Trump related tweets.
trump topic analysistrump topic analysis

Topic 22 was related to tweets that talked about Donald Trump’s reference to 9/11 as 7/11 while topic 12 was related tweets talking about New York primaries. We did a similar topic analysis exercise with tweets related to Hillary Clinton and found the following interesting topics:

hillary topic 21hillary topic 22hillary topic 23

Topic 21 was related to tweets that were about a trending hashtag (#DemocraticWhores) in Twitter that week. One of the speakers in a Bernie Sanders rally referred to Hillary Clinton as “Corporate Democratic Whore” which raised a lot of eyebrows for its inappropriateness. Topic 23 was related to tweets talking about George Clooney backing Hillary over Bernie Sanders. Topic 22 was related to tweets that talked a particular moment in the NY debate between Bernie and Hillary when Hillary had a grin on her face while Bernie was talking about climate change which generated lot of negative sentiment tweets.

Now, to extend these concepts to the land of business, and imagine if you ran these types of analysis on your customer communications, including support emails. This could raise key areas of concern for customers that the product organization could use to drive development. Taking this one step further, Pivotal has used topic analysis to predict customer churn for a major telco company finding that customers calls with a dropped-call topic tended to be associated with customers with lower satisfaction rating than those who called about a device upgrade. Armed with this kind of knowledge, you can then take action to change call scripts, offers or even guide product design for the future in order to raise customer satisfaction and retention.

Getting Started With Data Science

Data science has tremendous potential to provide the concrete guidance and insight that businesses crave. The days of focus groups, surveys and relying exclusively on experience are numbered because they are imperfect data sets. Businesses simply can’t afford to miss these signals through the noise.

Word Embeddings, Sentiment Analysis, and Topic Analysis are just a handful of the types of techniques the Pivotal Data Science team uses to solve a variety of business problems ranging from brand analysis, predicting customer churn, and surfacing retail recommendations. To have our team help you apply data science to bolster your business, simply contact us here.

For those looking to start playing with this data science on your own, the Pivotal Big Data Suite makes it easy to get started ingesting and analyzing Twitter data. And, since all of its major components are now open source, getting started is even easier. In a following blog, we will share detailed technical instructions on how we set up our environment so that you can try this at home (or the office). As a quick guide though, we used Pivotal’s Spring XD data pipeline framework which comes with Twitterstream, a module for tapping into the Twitter Streaming API. Spring XD facilitates real-time processing of the tweets as well as sinking the data to HDFS or a variety of other storage options. For more details, see the quick-start tutorial as well as the Pivotal blog post on creating real-time counters and gauges for a Twitter stream and our Decahose pipeline.

 

About the Author

Biography

Previous
Pivotal Big Data Suite Sets Purdue University Students Up For Success
Pivotal Big Data Suite Sets Purdue University Students Up For Success

Purdue University has become a leader in using data and data science to help students increase student succ...

Next
Benchmarking Stream Performance With Spring XD 1.2 and Apache Kafka
Benchmarking Stream Performance With Spring XD 1.2 and Apache Kafka

One of the goals for the Spring XD 1.2 release was to obtain the baseline performance metrics on a typical ...