Data scientists come from various research backgrounds, bringing a wide array of interests to bear upon their work. Noah Zimmerman, Senior Data Scientist at Greenplum, embodies this eclecticism. Boasting a background in immunology and biomedical informatics, Zimmerman is also fascinated with human-computer interaction and design for collaboration. All of these interests inform Zimmerman’s work with Greenplum customers. In this interview, he speaks about his background, and the path from immunology and design to data science.
Datastream: Can you give us some detail on your background and diverse set of interests?
I did my PhD in biomedical informatics at Stanford. I was originally interested in artificial intelligence, in particular knowledge representation—how do you structure knowledge in logical formalisms such that you can make new conclusions from information that you didn’t have before. Inference and deduction, using technologies developed for the semantic web. I think I was drawn to the knowledge representation side of things because I had some background in philosophy and theory of mind. Ultimately my interests and focus shifted from the representation of information to the source of the information: data. It became clear to me in graduate school that data is king.
My PhD work was advised by Guenther Walther in the statistics department and Lee Herzenberg in Genetics. Together we worked on the analysis of data produced by an instrument called a flow cytometer which allows you to count proteins on the surface and inside of cells very rapidly, roughly 5000–10,000 cells per second. At that rate, it’s trivial to collect millions of cells, measuring up to 20 different characteristics on each of those cells. With that kind of throughput, you can imagine how you get to large datasets pretty quickly.
It became interesting to me to see what different types of techniques could be used to mine information from that data. We worked on unsupervised statistical learning algorithms and distance metrics, and worked out computational approaches to isolate homogeneous populations of cells and quantify changes in those cells across different conditions, such as disease states or stimulations. Some of this work went on to be applied in the area of allergy diagnostics.
So I was working in immunology and statistics, and at the same time I was involved with the d.school (design) at Stanford. I always loved that way of thinking—human-centered design, empathy, rapid prototyping, iterative thinking, structured approaches to brainstorming on lots of different ideas. Learning how to fail quickly, and why you failed. I thought that all of those things were important for science in general. At the time, I couldn’t figure out how to work those things directly into my dissertation, but it certainly had a big impact on my thinking and I am thrilled to be co-teaching a new course at the d.school in the Winter quarter called Design for Science, where we explore the intersection of science and design.
Datastream: How did you end up working in data science? Does your interest in design inform the work you do at Greenplum?
Data science encapsulates a lot of those processes. Customers exist on a spectrum in terms of their analytic capabilities, from a first-ever foray into an analytic project to large academic medical centers who have teams of graduate students and post-docs who are writing their own algorithms and want a platform on which to, for instance, parallelize their work. In the d.school parlance we call these groups extreme users, and it turns out you can learn a lot about ordinary users by focusing on the extremes.
Many of our customers need help brainstorming around the data assets they have and the problems that they’re having in their company, and how they think we can use the data they have — or bring external data sources in conjunction with their data — to solve some real problem. In the design process, we’d call that stage “empathy,” which seems like a touchy-feely kind of word for a corporate setting, but that’s really what it is: practicing active listening and trying to uncover what the problems are beneath the surface, which sometimes are different than the problems that people say that they have. Some of the art in both design and data science is figuring out what people mean versus what people say.
But the biggest draw of data science is the opportunity to work with lots of large-scale data sets in their natural habitat. Sometimes in academia, it’s difficult to get access to data sets—you might have to fill piles of forms, pay someone to do a database dump, and the whole process could take months. On the industry side, instead of panhandling for data, we have access to interesting data sets generated by large businesses that might not otherwise be available to a researcher. The fun part for me is getting access to those kinds of datasets and helping turn it into insights and actions for an organization.
Datastream: How are you helping CareCore implement data-driven decision-making?
In general, one of the things that I really like about this job is having access to the data sets, but the other thing is the ability to effect change in a real way. Here, the terminology we use is “operationalizing the model,” but in real English what that means is putting the model into a living eco-system of people and software and measuring the results. At the end of a lab, when you engage with our team, our expectation is to deliver a working predictive model that you can use to make business decisions.
I think that’s pretty cool—we’re not just doing this as a science experiment to show what we can do it, we’re actually implementing models in the Greenplum database within our customers’ systems, and allowing them to actually effect change within their organization in a real and meaningful way.
One of my broader theories about this stuff is that these models shouldn’t be used to make clinical determination about patient care. We’re not saying we should make decisions based solely on the results of a model. Instead we think it as: given the fixed resources in health care, how can we help providers prioritize where to allocate resources, how we can help them bring relevant data to bear on their decision-making process. I think of these computational approaches as cognitive tools to support decision-making, to surface relevant information at the time of the decision, not as a way to replace human decision-making.
About the Author