Scientists around the world are performing experiments and doing analysis with a focus on investigating the nature of Climate Change.
The Big Data vs. Climate Change program is a joint effort by EMC Corporation, Pivotal and the Earthwatch Institute. It enables the study of interactions between nature and climate, and promotes the engagement of citizen scientists using data lakes, analytic tools and visualizations.
In this episode Simon is joined by Vatsan Ramanujan who is a Principal Data Scientist at Pivotal. Vatsan shares some insight into the work that was done, and some interesting stories from “out in the field”.
PLAY EPISODE #24
- Subscribe to the feed: http://pivotalsoftwarepodcast.libsyn.com/rss
- Visit http://pivotal.io/podcasts for show notes and other episodes.
- Feedback: firstname.lastname@example.org
- Links Referred to in the Show:
- Live Visualisation: http://www.emc.com/big-data/insights.htm
Welcome to the All Things Pivotal podcast, the podcast of the intersection of agile, cloud, and big data. Stay tuned for regular updates, technical deep dives, architecture discussions, and interviews. Please share your feedback with us by emailing email@example.com.
Hello everybody and welcome back to the All Things Pivotal podcast, fantastic the have you back for another episode. My name’s Simon Elisha, CTO and director of field engineering here in beautiful Melbourne, Australia. I’m joined here by a very special guest today, I’m joined by Vatsan, who is a Principal Data Scientist here at Pivotal. Good day, Vatsan, thanks for joining us.
Thanks, Simon. Good to meet you.
Good to meet you. Vatsan has been doing some really interesting work of late that we wanted to share with you and have him talk it through. What I’ll do is I’ll set the scene just at the high level, but I’m going to let Vatsan do most of the talking, because he’s the brains of the operation, I’m just the voice, so it makes him do the work. What we’re talking about today is something called the big data versus climate change program, and this is a joint effort undertaken by EMC Corporation, Pivotal, and the Earthwatch Institute. The goal is to enable the study of interaction between nature and climate and it promotes the engagement of citizen scientists using data lakes, analytic tools, and visualizations. In the show notes I’ll share with you some really cool websites and visualizations that you can play with yourself.
I guess, Vatsan, what I want you to do is to really peel it right back and start at the start. Often when we’re talking about big data and data lake, everyone gets excited about the technology and how you can use it and all that sort of stuff, but really, any initiative of this nature actually starts with a scientific problem or a business problem, something someone wants to solve, so maybe give us some context about the problem domain you were looking at, you were given, and what you were trying to solve for.
Sure thing, great question. Climate change is something which most people do not debate. It’s a fact, and a lot of people understand the causes of climate change, the impact of climate change, but what a lot of people don’t think about when they are talking about climate change and the problems it poses is how are citizens contributing towards our understanding of climate change and its effect on plant and animal species. Now, an interesting aspect to this whole climate change study is that there have been hundreds of thousands of volunteers dedicating their time on a voluntary basis to help real scientists, climate scientists or ecologists, by collecting data which helps us in improving our understanding of climate change on plant and animal species. These citizens are called citizen scientists, and some of the activities that they might engage in are things like counting the number of bird migrations of a certain species of bird at a particular location.
We went to Acadia National Park in the state of Maine, it’s one of the most eastern points in the contiguous United States, where along with the scientists at Acadia, the scientists from Earthwatch Institute, and EMC and Pivotal employees, we saw what it meant to be a citizen scientist. For the five days that we were there, we spent two days essentially collecting data, acting as citizen scientists so that we can understand some of the problems and challenges that they face, and what could we do to improve their experience. If you think of a typical citizen scientist, he or she might probably not have a lot of experience or expertise in climate change itself. They are willing to volunteer their time in assisting the data collection efforts, and once they’ve collected this data, a lot of it is just painful manual work, the data gets uploaded to a whole host of databases, so for example, the bird-related data goes to a database called eBird, and of course there’s climate-related data and a bunch of different such data sources where the data gets stored.
They don’t really know what happens to that data. Did it really contribute toward somebody’s research? Did it really help us in improving our understanding of climate change? They don’t know that, so our goal is to set up a data link which will house all data related to climate, weather, changing patterns of seasons, the appearance and disappearance rate of different plant and animal species, and provide a portal of visualizations which will allow citizen scientists to drill down and understand the effect of these different variables on the disappearance rate of animal species, for instance. What that would essentially do is complete this feedback loop where citizen scientists can understand the true value of the data collection efforts, and also assist ecologists and scientists in their models by making a whole host of tools available for them to do that kind of work.
Fantastic. It sounds like it’s a classic case of where you’ve, if you like, gone into the trenches to understand what people are doing with this kind of work. They’re collecting data in, if I can use a traditional word I think of, we get very focused on Internet of Things, and devices, and sensors, and what have you, whereas, from what you’re saying, this is good old fashioned people getting into the field, making observations, recording those observations, making subsequent observations, and comparing those observations, which is kind of the core of observational science, really, isn’t it?
Exactly. As a data scientist, for me, some of these methods seemed primitive, but then I really started understanding their challenges and I could see ways in which technology could address some of these problems. As you mentioned, a lot of the data science, data collection efforts, by citizen scientists is extremely manual. In the two days that we spent on the field, we had a paper, a pad, and we had a pencil, and we essentially were looking for the migrations of three different bird species at the Schoodic Point, which is a very prominent observatory point at the Acadia National Park. From there you can see migratory birds going north or south across the Atlantic. We would split out into teams and we would each call out that, “Hey, I see a gannet,” or, “I see a different bird species,” and then everybody would make note of that observation, and sometimes just to minimize variance, we’d have and independent group also tracking the count of these different bird species at different fifteen minute intervals so that then we can compare our results and determine what is the true number of birds that actually migrated.
Likewise, we were in the field counting barnacles, so it’s a certain organism which you find on rocks, especially in inter-tidal pools. They use this method called called the point-intercept method, where they use this physical device which is called a quadrature, it’s shaped like a square and it’s got a series of wires, both horizontally and vertically, and you place them at a fixed point in an inter-tidal pool region, and you count the number of barnacles at each horizontal wire and vertical wire intersection point. They repeat this observation every six months or so, and that way they’ll know if the barnacles are decreasing in numbers or increasing in numbers. This was really involved manual work, and it is not a good use of scientists’ time when they should be studying more challenging problems rather than spending their time in manual data collection.
We saw a lot of avenues where data science could have equipped citizen scientists to better collect data to minimize the variance in the observations. For example, we noticed that when we split into couple of groups to count bird migrations, first off, we had to see what a bird looked like and, based on our knowledge of different bird species, we had to identify if it is a gannet or if it is another bird species. We noticed that the few of us that had literally no knowledge of ornithology or different kinds of birds, our numbers were as good as a random guess, pretty much, and the expert bird counters could count those birds really accurately. That is not really a human being’s specialty, right?
Imagine if we could take high resolution images for every two minute intervals at a certain region in Acadia National Park. Then, by doing image processing algorithms on those images, we can segment regions which depict birds, and then you could use a classifier to essentially say if a certain image belongs to bird species A or bird species B, and you could have hundreds of thousands of such images collected every day hosted in the data lake. A machine learning algorithm could crunch through this in no time and tell you how many bird migrations actually happened, and the variance is going to be fairly … You’re going to have very low false positive rate compared to the way humans were collecting data. Once you have these bird counts, then scientists can go and build models to forecast what would be the expected number of birds in the next year, so on and so forth, given certain climate variables. That is a more valuable use of human beings’ time compared to staying for six hours at a particular point, especially in a cold region such as Acadia National Park, and having to count these birds.
For sure. It’s a great example of using humans for what they’re meant to be doing and machines for what they’re meant to be doing.
In terms of the type of data gathered and the volume of data gathered, can you give us a sense of … You mention we’re getting human based readings and also some automated feeds, etc. In the data lake, what sort of volumes of data are we talking about?
Sure. The human collected data is not a lot, so you have four or five full time volunteers, for example, they call those bird counters, in Acadia National Park, and most of them know each other because it’s a small community, and they take turns counting birds at different days of the month, so that’s really not a lot. You might probably see, in a one hour time interval, some three hundred or four hundred different birds having migrated when observed from a particular point. These volunteers print out and count these birds at different points along the Acadia coast, so you have this interesting situation where you have to ensure that people are not double counting birds, it’s not the same bird which appeared in one region at time X has not appeared in a different region a couple of minutes later. There are all these issues involved in showing that the data is valid, so manual data collection is not a huge size, but we do have data in the form of weather, so this could be in minute intervals, hour intervals. A lot of those data is hosted on this website called NOAA, so you do have access to that. Pretty much, these are the two major data sources.
By having those data sources ingested into a data lake, what does that platform enable you to do as a data scientist, as a citizen scientist, as a casual observer, that you couldn’t do if you had, let’s say, just thrown it into a traditional monolithic SQL-type database or some other sort of a typical proprietary platform?
Sure thing. If you look at what the data lake enables, it essentially democratizes the data collection process and also the data availability. Now, these different data sources that I mentioned, each of those are hosted on different servers. For instance, if a scientist wanted to study the effect of client change on bird migration, they first have to go to, say, the NOAA’s website and pull out a sample of data about weather at a given region for a specific time period, then they have to retrieve the corresponding data about bird migrations from the eBird website, then they have to build whatever statistical models that they’re trying to build to make projections and predictions. That is quite an involved process, especially because the number of sources and people you have to interact with in order to get the data. A lot of valuable time is wasted in getting this data before people can start building models, and sometimes people are even concerned as to why a certain request for data is being made. Perhaps they need attributions, they’re concerned about the motives behind why the data was requested. Is it for research, is it for some commercial purpose, so on and so forth.
A data lake would help mitigate this problem by making sure that there’s a single repository where you have both structured data and unstructured data. For instance, observations made by people could be in the form of text. That’s unstructured data, and typically traditional databases are not the right place to store such textual data. Data about weather is fairly structured, and traditional databases can store it, but again, then you have to deal with the problem of scale, because weather is very location-specific, so depending on which location’s weather information do you want and for what historical period, you might have a hard time getting quick access to that data.
A data lake, by making sure that all of this is housed in a single location, and you have access to some big data crunching tools, be it for machine learning learning or even for simple things like data cleansing, joining different data sources, that would empower scientists to spend more time doing important things than doing tedious manual work. They could build models faster, they could clean data faster, and they could house the results in the same data lake thereby, if they were to build an application, that application can consume the models that they’ve built and show it to people who don’t necessarily understand data or statistics, but can understand visualizations. It makes climate change and its effect on phenology accessible to the masses.
For sure. It’s interesting, that combination of the challenge of storing large amounts of data, processing unstructured data with structured data, etc., as you said, and then consequently visualizing that is often the only way that it’s made accessible or real to people, and you’ll see on the website that we’ll link to a great example of that in terms of the migratory patterns that we’ve been speaking about. Vatsan, I’d really be interested to hear what was the most surprising or interesting thing you learned during this project.
That’s a good question. I would say the most surprising and interesting thing that I learned through this expedition is things that you take for granted. The real citizen scientists spend a lot of hours on the field and they have a lot of commitment to this cause. For instance, we worked with this great gentleman called Seth Benz, who is a bird counter, and he’s part of Earthwatch. He’s been collecting data for tens of years of bird migrations, and he goes to this particular point called Schoodic Point at Acadia National Park, no matter what the time of day it could be. Sometimes he’s there at 6am, sometimes he’s there at 5am when the sun rises, no matter what the temperature may be. It could be sub-zero, he’s still there and he is a solitary bird counter.
He spends an hour or so with his log book, he counts the number of bird migrations, and then he goes and uploads this data to the eBird website. If we could give Seth and other bird counters like him a tool which helps mitigate, or rather, makes his life more easier, he’s so committed in the cause that he might better spend his time doing work which cannot be solved by a machine, that would be an amazing contribution. It first off would empower people like him to be more effective in the cause that they care about.
I think that’s fantastic, and think that’s, again, that’s that classic learning of taking the experience that you often have sitting in an office thinking about a problem domain and making it real by going and seeing the people at the workplace and then changing that entire workflow, that’s where efficiencies and improvements happen. Vatsan, thanks so much for sharing that insight and a little bit of detail around that project. We really appreciate you coming on the podcast and sharing a little more color around what data science is all about and what it can do for society. Thank you so much, Vatsan.
Fantastic, and thanks everyone for listening again. As ever, weekly podcasts coming up, lots more topics on the way. We do love to get your feedback, so please do share it with us, firstname.lastname@example.org. Until next time, keep on building. Thanks for listening to the All Things Pivotal podcast. If you enjoyed it, please share it with others. We love hearing your feedback, so please send any comments or suggestions to email@example.com.
About the Author