Yesterday was our orientation day of the expedition to Acadia National Park. Today was all about data collection and citizen science. In the field report for today, I’ll describe our data collection activities, the challenges involved in the same, and the ways in which we see a climate data lake addressing some of those challenges. Additionally, I got to interact more closely with one of our scientists who is working with Acadia National Park on his research. There is significant potential for a data lake and big data tools to help them focus more on research and less on data wrangling. In one of our orientation sessions yesterday, our principal investigator for Acadia National Park, Dr. Abe Miller-Rushing, pointed out that 50% of their time is spent cleaning up the data—a lot of valuable time not doing core science such as studying phenology. While data wrangling is part-and-parcel of our jobs as data scientists, phenologists and ecologists should certainly not waste time wrangling data. Phenologists like Dr. Abe obtain data from multiple sources, a lot of which is manually recorded observations. Today, as citizen scientists, we participated in two such manual data collection exercises.
Collecting Data on Bird Migration: Manual vs. Machine-based Image Processing
The first of our exercises was counting bird migration. Right after breakfast at 8 am, we hiked a mile along the Acadia coast to Schoodic Point from the Schoodic Research Institute. Schoodic Point is on the southern tip of Winter Harbor. This geographic location is an ideal spot for observing bird migrations along the Atlantic. Our task was to work in teams of 2—one observer and one note taker—to count the number of sightings of three different migratory bird species: the Common Loon, the Northern Gannet, and the Common Eider. We were to only count birds migrating from North to South and were instructed to disregard birds which were simply milling around looking for food. Each one of us had a pair of binoculars and a telescope to spot birds up to a couple of miles away from the coast. We soon realized the challenges in this approach. My ability to differentiate one bird from the other, especially between species which looked very similar, was random at best. Furthermore, the freezing temperatures with 20 mph winds made it quite uncomfortable to stay focussed with our eyes on the horizon looking out for birds. When we compared initial results as recorded by our five different teams, we found considerable variability in the number of bird sightings. We seemed to do better when we worked as one single unit. Certainly, more pairs of eyes helped. Acadia and Schoodic Institute scientists and ecologists like our field team leader this morning, Seth Benz, routinely record such observations and make them available for research by uploading the data to websites such as ebird.org. While the observations from experts like Seth are likely to be a lot more accurate compared to observations by citizen scientists, we see a lot of ways in which technology, particularly a climate data lake, could greatly help scientists and ecologists like Seth and his team to better record these bird migration sightings.
Using Technology to Automate Data Capture and Improve Quality
For example, if we had a network of stationary cameras taking high resolution images every few seconds along the horizon each day, then these images could be ingested regularly into a data lake for analysis. Through image processing and object recognition techniques, we could separate out blocks representing birds from these images and run a content based information retrieval engine (CBIR) to match the detected objects against a database of images of migratory birds observed in the region. The images could then be presented in a smartphone app to researchers like Seth, with timestamped images of detected birds, allowing them to override any misclassifications by the CBIR engine. Such a system, powered by a data lake infrastructure, could greatly reduce observer error and reduce the time spent by scientists on problems which can be solved by machines through automation. Furthermore, if the data lake is the central repository of such bird sightings, it will immediately be available to other researchers using this data. Note: At Pivotal Data Labs, we’ve built prototypes of this kind of system as a proof-of-concept of our big data technology and data science applications. You can find more information on this in earlier blog posts by Pivotal data scientists Gautam and Ailey such as Content Based Information Retrieval on Apache Hadoop® and Massively Parallel In-Database Image Processing.
Collecting Data on Tide Pools
Our second exercise on data collection was on measuring the effect of climate change on intertidal ecology. This was led by Hannah Webber, field team leader and education projects manager for Schoodic Institute. Right after lunch we hiked about a mile to Diagon Alley, a tide pool ecosystem of barnacles, blue mussels, and a variety of seaweeds. Again, our task was to work in pairs and count the presence or absence of several species. We used the Point-Intercept method with a quadrat to record our observations at four different, marked spots on three different tide pool regions. As with the previous data collection exercise, I could see how technology could assist and simplify this data collection task. A smartphone app could be developed to take a picture of the quadrat. Then, an image processing program could automatically fill out a matrix of hits and misses for the different species of organisms. This time-stamped data, along with the GPS coordinates of the tide pool, could be uploaded to the data lake and made available for researchers worldwide.
Wrapping Up Day 2
As our second day of the expedition comes to an end, I can see more possibilities of how big data and a climate-focused data lake could have an impact on climate change research. Research can be greatly aided and automated by technology, allowing phenologists and ecologists to spend more time doing what they are best at—finding scientific meaning from data instead of tedious collection tasks. Instead, the power of data science on a data lake could lend a helping hand to eliminate the heavy lifting and allow researchers to get results and insights more quickly. Tomorrow, we’ll be heading out to Mount Desert Island for our field trip and will return to the Schoodic Research Institute in the afternoon to continue our brainstorming on the climate data lake. I will also blog my thoughts on how data lakes and big data tools, such as those used by Pivotal Data Labs, could assist Earthwatch scientist Dr. Richard Feldman in analyzing spatio temporal changes in duck abundances.
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author