November revealed a plethora of fascinating data-driven insights into how teams work together, the ways that social behavior dictates our approach to online privacy, the activities of Yelp users, and much more. Here’s our roundup of the biggest data science news of the month, from Pivotal, academia, and the tech industry.
Project management app Asana released a stunning animated visualization on its blog demonstrating some of the many insights the company has gleaned about teamwork from its user data. Sociology PhD student Clark Bernier adds additional context for the data science heads in a post on the company’s engineering blog.
Yelp released a deep dataset for researchers this month, collecting a voluminous amount of information the company has collected from Phoenix, Las Vegas, Madison, Waterloo, and Edinburgh. Yelp’s Challenge Dataset encompasses 42,153 businesses, 320,002 business attributes, 252,898 users, 955,999 edge social graph data, 1,125,458 reviews, and much more.
After successfully predicting 15 out of 16 World Cup rounds, and strong results from Bing’s NFL predictions, its UK data science team is setting out to repeat the success by predicting the outcome of matches in the English Premier League.
Nilesh Karnik of Aureus Analytics provides an in-depth primer on unsupervised learning algorithms, a machine learning technique which groups similar documents together and then creates descriptions for those document groups in order to identify distinct categories.
Despite their initial popularity, sites like Quora and Yahoo Answers now lie fallow for a number of reasons—unreliable answers, the rise of more specialized forums such as Stack Overflow, and the initial novelty of such crowdsourced Q&A services wearing off. Wired profiles Aaron Patzer, founder of Mint, who believes that a rigorously data-driven approach can change this trend. His new site Fountain patterns itself off a Siri-like interface, and utilizes data science techniques such as natural language processing to connect queries with expert advice.
Aylin Caliskan-Islam, a graduate student at Drexel, discusses her current research on the Freedom to Think blog, which utilizes machine learning and natural-language to gain insight on the social behavior patterns which dictate our privacy practices online.
Visualoop announces an update to the World Digital Library website, which collects vintage infographics and data visualizations dating all the way back to the 17th Century, demonstrating that while the tools for data analysis have radically transformed, the art and design of visualization has a storied, and often beautiful, past.
This Month in Pivotal Data Science
3 Key Capabilities Necessary for Text Analytics & Natural Language Processing in the Era of Big Data
This post explains common, unstructured text processing tasks in detail so we can understand how they merge with traditional analytics on structured data. Then, we outline the three key capabilities that data scientists must have to help businesses reach this new generation of analytical applications. Lastly, we explore how data scientists can approach in-database text analytics, text analytics on Apache Hadoop® and Spark, and list many other open source natural language processing toolsets available on a Pivotal Platform.
On this podcast, Simon speaks “live to tape” with Ailey Crow, Senior Data Scientist, about this year’s Strata + Apache Hadoop® World conference that concluded last month, the “Data Jam” Pivotal held there, and how she became a Data Scientist.
Data science might become the next “hot item” in the fashion industry, and this post provides a business and technical overview on why the entire fashion industry is moving towards big data analytics. It explains how a recent Pivotal customer worked with Pivotal Data Labs on the development of a system that analyzed daily web-scrape data from e-commerce retailers to infer fashion trends from metrics on stockouts, pricing, color, and more.
Helping a troubled teenager through a crisis isn’t what comes to mind when you think of a data scientist’s regular activities. But courtesy of the Pivotal for Good program, Noelle Sio is doing just that, in partnership with the data philanthropy nonprofit Datakind. Crisis Text Line, whose counselors trained specialists assist hundreds of at-risk teens every day, is the beneficiary of Noelle’s current work.
From Sea to Trees, Pivotal Data Science Looks at Climate Change in Acadia National Park: Day Field Reports
Data science is being used to transform the way we measure the world around us, particularly in the area of climate change. In this series, Pivotal’s Senior Data Scientist, Srivatsan Ramanujam, shares a journal of his trip to Acadia National Park to work with a cross functional team from Pivotal, EMC, Earthwatch, and Schoodic Research Institute, improving the way they collect and analyze data within the context of a data lake. From studying birds to barnacles, this is important work to extend the reach of citizen science and our understanding of climate change. Read all 4 days: Day 1, Day 2, Day 3, Day 4
With the right approach to data science and the right technology platform, companies can gain a shorter time to insight, utilize more dark data, improve model quality and decision-making, and minimize data movement. Yet, switching data science toolsets can be a daunting task. In this story, the Pivotal Data Science team has proven change is merited in 5 powerful ways.
December Pivotal Data Science Events
December 16th, 2014 10:00 AM PST / 6:00 PM GMT
With 2015 just around the corner, the Pivotal Data Science team has been challenged to point its predictive inclinations toward spotting emerging trends in Data Science. With a global team of 30, doing innovative work in almost every vertical market, Pivotal’s data scientists have a rich view into the underlying trends and shifts impacting their craft.
In this webcast, leaders from the team – Annika Jimenez, Kaushik Das and Hulya Farinas – will share their insights on the key Data Science industry trends for the coming year. Every angle of Data Science is fair game:
- New use cases at the vertical level
- Analytical tool usage trends
- Implications of the shift in focus to model operationalization
- Meta observations about maturity of the craft
- Ethics evolution in Data Science
- Venture capital activity
Join us for this lively discussion of top predictions for Data Science in 2015. The presentation will be followed by a Q&A session where attendees will have an opportunity to share their own thoughts and predictions.
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author