All Things Pivotal Episode #5: Interview with Ailey Crow at the Strata + Hadoop Conference

November 5, 2014 Simon Elisha

featured-pivotal-podcastData Science is an incredibly interesting pursuit—and the Strata + Apache Hadoop® World conference is right at the epicenter of that. To quote the organisers:

Strata +Apache Hadoop® World is where cutting-edge science and new business fundamentals intersect—and merge. It’s a deep-immersion event where data scientists, analysts, and executives get up to speed on emerging techniques and technologies by dissecting case studies, developing new skills through in-depth tutorials, sharing emerging best practices in data science, and imagining the future.

Strata + Apache Hadoop® World is now the largest conference of its kind in the world, yet it’s kept the informal, collegial spirit that makes it one of the best places to connect and collaborate.”

Naturally Pivotal Software had a significant presence at the conference, and even ran a “Data Jam” for attendees.

On this week’s podcast Simon speaks “live to tape” with Ailey Crow, Senior Data Scientist, about the conference, the “Data Jam” Pivotal held there, and how she became a Data Scientist.

PLAY EPISODE #5

 

RESOURCES:

Transcript

Simon:
Welcome to the All Things Pivotal podcast, the podcast at the intersection of agile, cloud and big data. Stay tuned for regular updates, technical deep dives, architecture discussions and interviews. Please share your feedback with us by emailing podcast@pivotal.io.

Hello everybody and welcome back to the All Things Pivotal podcast. Fantastic to have you back. This is episode number 5 and a special interview episode today. We’re coming to you quasi-live, I guess, recording live from New York City, although I’m not in the beautiful Big Apple, although my colleague Ailey Crow is. Ailey, welcome to the show.

Ailey:
Great. Thank you very much, Simon. Great to hear.

Simon:
Ailey is a senior data scientist and has been attending Strata and Hadoop World this week. We’ll be learning all about what she’s been learning which is pretty exciting. A little bit about the conference. This is a conference that takes place each year in New York City. A few other cities as well, but New York City is the big gathering.

It’s billed as where cutting edge science and new business fundamentals intersect and merge. A deep immersion event where data scientists, analysts and executives get up to speed on emerging techniques and technologies by dissecting case studies, developing new skills through in-depth tutorials, sharing emerging best practices in data science and imagining the future. Sounds pretty cool.

It is actually the largest conference of its kind in the world, but it’s still a little informal, a fair bit of collegial action, etc. We don’t have to take the written word’s word for it. We can ask someone who was there. Ailey, how has your experience been so far at the conference, just from a high level, in terms of the field, the types of people you’ve been talking to, etc.?

Ailey:
Sure, good. Again, this is Strata Hadoop World, so definitely a lot of emphasis on Hadoop itself. Sort of runs the gamut in terms of who’s here. People who have barely just heard of Hadoop are interested in hearing more about it, to absolute experts. That leads to a lot of interesting conversations, both in and out of talks, of people exchanging expertise and problems and challenges and solutions.

Simon:
For sure. It’s always great if you can learn from other people’s mistakes, I find, so that’s probably …

Ailey:
Definitely.

Simon:
… a big part of the conference as well. Ailey, let’s talk a bit about yourself firstly. You’re a senior data scientist at Pivotal Software. Take us through the journey to this current role. Give us a feel for where you came from, how you got here.

Ailey:
Sure. I’m not sure I have a typical path, although I’m not sure there is a typical path for data science at this point. My undergrad degree was in chemical physics from Brown University in Providence, Rhode Island. Then went on to do my PhD in biophysics, focusing a lot on microscopy actually, so different types of microscopes and some signal processing mixed in. All applied to the field of cell mechanics, so biology at the cellular level. I ended up after that doing an internship or that led to then a position at Genentech, which was the first biotech ever, based in South San Francisco.

Simon:
Wow.

Ailey:
Running a lab of microscopes there for the basic research division, but spending most of my time doing image analysis. A lot of programming and generally automated analysis of images in the biomedical field. That launched me straight into learning, into machine learning, so got good experience in that field and what do you know, I guess that made me qualified to be a data scientist, so I was lucky enough to end up at Pivotal where I focus on life sciences and health care within the data science team.

Simon:
Fantastic, so using data science for the good of society really is the key there.

Ailey:
I’d like to think so. Great.

Simon:
One of the challenges of working at a company like Pivotal, particularly for myself, is being one of the least smart people in the company, compared to all of the absolutely brilliant people we have here.

Ailey:
I think we all have different expertise is the way I like to think about it.

Simon:
It’s quite daunting. It’s a nice background there. It’s a tough one to compete with. Ailey, this week, you’ve been obviously at the conference now, but I believe a big part of what you’ve been doing has been participating in something called the data jam. Tell us about what that is.

Ailey:
Sure. The data jam session took place today and was a panel including myself and Ryan Peterson from Isilon. Was essentially a nice intimate small group where we could essentially just discuss the current challenges and answer questions from a small group of people who are interested in hearing a little more about work done by the federation and particularly, Pivotal and Isilon.

Had a fair number of people from the financial sector, but also a number of people who just had questions about HAWQ and about MADlib and what exactly are these products and what are they used for and how do I actually get into data science and what are the current trends in the field and things along those lines. Some interesting questions during the session and then a lot of interesting discussions following the session as well. People had more specific questions.

Simon:
For sure. I guess people in conferences like to ask the more general question and then buttonhole you afterwards and get into the weeds there.

Ailey:
Yes, you’ve got it.

Simon:
Was there a particular theme that came out of that conversation that really struck you as being unusual or likely to be interesting to a lot of people?

Ailey:
I think what I found most interesting were some of the modeling questions I was asked afterwards, but that’s probably less interesting to a wider audience. Let’s see, I’m trying to think. For a wider audience, 1 question that was raised that I thought was particularly insightful is the question of how you get buy-in across an organization for data science.

Often you have, say, the marketing team that’s interested or the IT team that’s interested, but how do you get an entire organization to consider data science as something that’s actually useful and so that you can then get the resources required to do that. I thought that was a particularly insightful question because that’s something that we see all the time that I didn’t necessarily expect someone who wasn’t dealing with data science every day to expect or to understand.

Simon:
It is an interesting combination, isn’t it, because data scientists need … Well, firstly, it’s a very specific discipline itself, so it needs a particular background and data scientists, from what I can tell, need a very specific set of tools or quite a broad range but particular set of tools and capabilities, but then have some options in terms of how they can be deployed or accessed.

Often, I see people who are saying, ‘Well, we’ve got the data science capability,’ and you got some poor data scientist with a server under their desk desperately trying to model as much as they can with the RAM they have available, versus IT going out and building a full-fledged data [lab 00:06:48], lots of CPU cores, lots of storage, etc. but no one to actually take advantage of it.

Ailey:
Right, exactly. Exactly.

Simon:
It’s bringing the 2 together that is important and having a strategy around that, isn’t it?

Ailey:
Right. Something that we’ve done in the past which has been fun for us and I think for the clients as well is to host a … It’s a confusing terminology here, but a data jam. It’s wherein … This was more of a week or at least 4 days of training with anyone at the company who’s interested who had some level of programming background, but followed by a 48 hour data jam where we brought in some external data sets of interest and the clients could bring in some of their own data if they were interested and essentially just have people hack away with the tools that they have just learned.

Ideally actually draw some insights out of those 48 hours to prove that, yes, they can use the tools even with minimal training, 4 days of training and that they can draw insights very quickly from the data. This is something that we’ve had pretty great success with. We ran a data jam at a biotech over the summer and were able to actually draw scientific insights out of 36 hours of data jam.

Simon:
Wow.

Ailey:
It’s been a lot of fun.

Simon:
That’s fantastic. It just shows the power, that if you bring the right people and the right data together, with the right processing, you actually can find those answers. It’s not so hard, but getting all those pieces together can be hard, I think.

Ailey:
Right.

Simon:
That’s a fantastic story. You’ve been bouncing around that conference. I know it’s a pretty busy week. I’m amazed your voice is lasting so well actually. What’s the most interesting thing you’ve seen?

Ailey:
Oh gosh. A lot of fun talks. I think some of my favorites have been different … talks by data scientists from different organizations that you may or may not expect. There was actually a great talk from [hood 00:08:43], who was the director of data science and data analytics at Etsy. Etsy is a forum where people can sell homemade goods or homemade arts and crafts. It’s incredibly popular. But it hadn’t occurred to me that they were actually running analysis on their clickstream data and on their transactional data in the background and doing some really interesting recommendation models out of that. That was fun to hear about. Another fun one was Spotify, which is a music [crosstalk 00:09:14].

Simon:
The music folks, yeah.

Ailey:
Listening, yes, [working 00:09:15], exactly. Another is a recommendation system but they’re doing all kinds of fun things with feature generation out of the actual audio and not only recommending music and finding better ways to recommend music to listeners but also looking at, for example, can you predict political affiliations just based on the type of music that people are listening to and other fun things like that.

Simon:
Wow. They’re looking for really, more unusual, insights.

Ailey:
Yes, exactly, so interesting insights from all kinds of domains and all kinds of different organizations. That’s been a lot of fun.

Simon:
That’s great. Certainly many of those organizations that were born online have a very keen focus on understanding their data and operationalizing that data and using it for meaningful things. It’s so valuable and so interesting [crosstalk 00:10:02].

Ailey:
Yes and they actually have the data, which helps.

Simon:
Having the data is step 1, isn’t it?

Ailey:
Definitely.

Simon:
Speaking of insights, you work with a lot of our customers, etc. and I’m not asking you to name names because I know a lot of these things are sensitive, but what are some of the best insights you found for customers recently in particular domains? What’s a discovery you made and thought, ‘Oh, that’s pretty cool’?

Ailey:
I think some of my favorite insights have been cases where we’re really able to enable customers to do their own work, in some sense. Ideally, if we’ve been successful in an engagement, we can walk away and they will continue to run models and create features and draw insights on their own.

Actually I think my favorite is one that I’ve already mentioned, is this data jam that we were able to host at this biotech company, wherein we were able to work with them to take a process that they’d already developed but were running not in an optimal way, help them get it up and running on the Pivotal platform, such that we could speed it up over 360 fold in the case of 1 particular step and actually get novel scientific insights for them that could then be fed into their new drug pipeline or lead to some actual real results that could help cure diseases. In 36 hours, I thought that was pretty good.

Simon:
That’s very good. It’s interesting because a lot of this search for insight is exploratory.

Ailey:
Yes.

Simon:
It’s kind of, I’m going to try something, now try something again, now try something again. If every time you try something, as you say, it’s tens of hours or days or what have you, if you can squeeze that down to minutes or seconds, that totally changes the domain and the way you do things. Also, your attention span. It’s hard to be focused on a problem if, now I’m going to try this, walk away for 2 days, come back and continue where I left off.

Ailey:
Absolutely.

Simon:
Our brains don’t work that way, do they?

Ailey:
Right, no. If you can essentially reduce the cost of asking questions, it allows you to be more creative and get to insights and answers more quickly and therefore do more iterations and improve models and be more creative with your solutions.

Simon:
Fantastic. I know data scientists are all about the scientific method of being very precise and very evidence-based, but I’m going to ask you a less science-y question.

Ailey:
Sure.

Simon:
What are your favorite data tools and what parts of the Pivotal software are you using most frequently? I want to see if you’ve got some favorites in there.

Ailey:
Sure, sure. I’m going to give you my non-answer first, which is that, what I actually like best about both the Pivotal and open source tools, is the flexibility. The ability to, given the problem, find the right tool or the optimal tool to answer your question. Having said that, of course, I have my preferences.

Within the Pivotal stack, mostly I’ve been using a lot of GPDB and HAWQ recently. The great thing about that is if you’ve developed code in one, it ports very nicely over into the other, so you can use a lot of the same methods and codes and queries that you’ve developed in GPDB also in HAWQ. It’s just that your data is coming from HDFS instead of from the database. That’s been great and you can implement MADlib in both, also fantastic.

Simon:
For sure. We’ll link to information about MADlib and GPDB and HAWQ, of course, in the show notes if you want to dive into those into some more detail. Now, we talked a little bit about some of the challenges of marrying together data science and infrastructure, etc. and it’s interesting that attendees at the conference were raising that. I actually had that specific question to talk about with you. You’ve seen this done probably well, you’ve seen this done maybe less than well. What advice would you give to companies who are looking to have a data science capability?

Ailey:
Right. I think that it’s tricky to give 1 answer to that because it really depends on where that company’s coming from. If, for example, they already have some talents that they can leverage, versus really, starting in the dark where they know this is important but don’t necessarily know how to go about it [crosstalk 00:14:10].

Simon:
Would it be really understanding where you’re starting from, like being realistic about the starting point? Is that really what we’re getting to here?

Ailey:
Sure, absolutely, so it definitely helps to know if you have that talent already within your organization. Something, for example, like a data jam can be helpful there in that you can actually let people loose on some data and see what you have there. That can also help with, as we were mentioning earlier, with the buy-in, of demonstrating the value of data science across the organization to acquire more and more resources. I think that that is a big part of it, understanding where exactly you are. Talking to a number of other organizations that are also doing something similar, have done something similar, I’m sure could be helpful there.

Simon:
For sure, for sure. Now, it’s interesting that the term data scientist has only really emerged in the last 2-3 years into the public consciousness. Many people are sort of like, ‘What is a data scientist?’ But the other thing is people say, ‘Well, how do I become one? This sounds cool and interesting and something I might like.’ What advice would you give to someone wanting to become a data scientist? What should they be doing?

Ailey:
Right, sure. I guess a couple of things. The first of which is just get your hands dirty. Go out and find some data. There’s plenty of public data out there and just try it out. Play around with some algorithms. Play around with different ways of doing analysis, different visualization, draw some insights. Just see what you like. I think that’s the number 1 thing, is just go for it, try it out.

Simon:
Actually do it.

Ailey:
Right, actually do it. There are plenty of ways of doing that, including looking at, I think, data.gov, from the US government, has a number of different public data sources. DataKind is an organization that works with nonprofits and has meetups where you can go in and work with other people on different data sets, etc and for a great cause. I think that would be thing 1. Go ahead and try it out. Get your hands dirty.

Other than that, I’d say, sort of a 2 part answer is, definitely brush up on your machine learning and there are a lot of great online courses for that, but don’t forget where you come from. Essentially whatever domain knowledge you already have or whatever expertise in a given field is going to really be something useful and you should use that to your advantage.

For example, I came from biotech and that’s really an area that I feel comfortable in and can leverage what I already know towards the data science work that I’m doing. If you aren’t coming from a particular background, you can find something you’re passionate about and learn something about health care or finance or green energy or whatever it might be and then really leverage that, because it really helps to know a particular field so that you actually know what the challenges are and the terminology before you try to solve the problem. Make sure you’re actually solving the right problem so that your results will actually get used.

Simon:
For sure. I think in many cases understanding what the actual problem is is the biggest challenge …

Ailey:
Absolutely.

Simon:
… rather than solving it.

Ailey:
Mm-hmm (affirmative). Sure.

Simon:
Get that domain knowledge happening is the big tip from Ailey there. Sensational. Ailey, I know that you’re in a conference room. You’re likely to be barged in on at any moment, so what we’ll do is we’ll wrap it up there. Thank you so much for joining us on the All Things Pivotal podcast. It’s been fantastic to have you with us.

Ailey:
Great. No. Thank you.

Simon:
Thank you everyone for listening. Again, if you’re enjoying the podcast, we’d like to hear from you, podcast@pivotal.io. Please let others know that the podcast exists. It is very brand new and shiny, so we want to get the word out. Until next time, keep on building. See you later everyone.

Thanks for listening to the All Things Pivotal podcast. If you enjoyed it, please share it with others. We love hearing your feedback, so please send any comments or suggestions to podcast@pivotal.io.

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Simon Elisha is CTO & Senior Manager of Field Engineering for Australia & New Zealand at Pivotal. With over 24 years industry experience in everything from Mainframes to the latest Cloud architectures - Simon brings a refreshing and insightful view of the business value of IT. Passionate about technology, he is a pragmatist who looks for the best solution to the task at hand. He has held roles at EDS, PricewaterhouseCoopers, VERITAS Software, Hitachi Data Systems, Cisco Systems and Amazon Web Services.

Previous
Do Disruptive Mobile Apps Spell the End for Banks?
Do Disruptive Mobile Apps Spell the End for Banks?

Banks and financial institutions should take heed—over 30% of millennial consumers believe they won’t need ...

Next
3 Key Capabilities Necessary for Text Analytics & Natural Language Processing in the Era of Big Data
3 Key Capabilities Necessary for Text Analytics & Natural Language Processing in the Era of Big Data

This post explains common, unstructured text processing tasks in detail so we can understand how they merge...