8 “Simple” Guidelines For Data Projects

April 21, 2017 Dat Tran

Advice for building impactful data products

Countless data posts out there will tell you to do things like “harness the cloud” or “run experiments.” The vagueness of these posts is not helpful. You can’t “tip and trick” your way to a successful data product. You have to have the right mindset. I got frustrated reading these posts and decided to write my own, but one that’s not presented as collection of tips, tricks, or rules, but as guidelines. Following all of these doesn’t guarantee success, but they might be useful for you…

What follows is a collection of things I have recently observed at client meetings and also during project work. This post is inspired by an excellent article by Martin Goodson, “Ten Ways Your Data Project is Going to Fail” and includes my personal views on many things I currently see in data projects (That’s one problem right there … thinking in terms of projects instead of products.)

1. Think simple first and then, if it’s really needed, get more complex

Since AlphaGo beat one of the top Go players last year, artificial intelligence (a.k.a. deep learning) is a hot topic. Today, if you don’t do AI, you are not one of the cool kids. Too many clients ask me how they can use AI to do cool stuff. The problem is that in most cases a simple model is enough. Don’t overthink it.

2. Define your data product MVP and release as early as possible

Establish a working end-to-end pipeline early for your data minimum viable product and deploy it. Only data science models that are in production generate real value even if it is not perfect, which makes sense — so do yourself a favor and create an API first culture. Many data science teams still focus too much on improving their models instead of having a big picture of the problem. That big picture will become clearer if you deploy early. And you know what, collecting more data will help to improve your model for free. 😉

3. Establish your target architecture and workload while you go

Many clients ask me about the optimal sizing of the required data environment upfront. They want to know, for example, if 1.5TB memory is enough for their Spark cluster, before we even get to work on the problem. I understand they need this for budgeting, but my approach is iterative. I will get a better understanding of what I need while working on the problem. All together now: “Things will change.”

4. Use the right tool for the right problem

Is there anything worse than walking into a heated debate over what tool is better than another? This is typical whether it’s R vs. Python vs. SAS vs. Spark vs. Dask vs. Flink vs. Kafka vs. RabbitMQ vs. Redis vs. Geode vs. TensorFlow vs. Theano vs. Torch and so on… But here’s the rub: there isn’t a single tool that can solve every problem. My answer to this is: always work with the tool that best solves your problem. Tools come and go, so don’t become too infatuated with any particular one. (Particularly, I’m looking at you SAS.)

5. Build things that are meaningful

Sometimes I work with clients that want to build features that don’t make any sense. You don’t need to use word2vec if you are analyzing sensor data! So do your user research before you really invest a lot of money and time in building a given feature. If it doesn’t affect or inform the product’s goals, don’t build it.

6. Prioritize the projects with the biggest business impact

Let’s say you have this huge digital/data/big data transformation project with, say, thousands of projects going on at the same time. Where do you focus? You have to define what’s important and validate that assumption before you dive into building data tools and reporting to support a project. Data takes a ton of time to work on, and it needs to be brought on methodically, not on a whim.

7. Measure your model and improve it from time to time

There is a saying, “Don’t change a running system.” But if you stick to this, then the system may run away from you. I had an engagement where a model was running for 3+ years and then the client realized that the results it produced didn’t match the problem anymore. That’s the problem with data. It changes. And this is especially problematic if you do “select all” in your code. 😉

So, recheck your model regularly, you’ll thank me later. A test-driven development culture can also help to mitigate this issue.

8. Communication + collaboration = 🔑

A data science team I recently talked to told me they have problems getting meaningful work within the company. The problem was that their CIO liked the idea of “big data” but established the data science team as a separate initiative, instead of being embedded into the business. They’ve been isolated for two years now. I told them to implement pair programming to break down barriers between different teams. Working in a balanced team will also help a lot.

Here are some useful pieces I’ve read that help me figure out how to approach a data project/product:

Last but not least, fellow data folks, tell me:

What are some guidelines you use to ensure a successful project?

Change is the only constant, so individuals, institutions, and businesses must be Built to Adapt. At Pivotal, we believe change should be expected, embraced and incorporated continuously through development and innovation, because good software is never finished.

8 “Simple” Guidelines For Data Projects was originally published in Built to Adapt on Medium, where people are continuing the conversation by highlighting and responding to this story.

About the Author

Dat works as a Senior Data Scientist at Pivotal. His focus is helping clients understand their data and how it can be used to add value. To do so, he employs a wide range of machine learning algorithms, statistics, and open source tools to help solve his clients’ problems. He is a regular speaker and has presented at PyData and Cloud Foundry Summit. His background is in operations research and econometrics. Dat received his MSc in economics from Humboldt University of Berlin.
Follow on Twitter Follow on Linkedin Visit Website

3 Answers, 1 Question, with Bob Sutton

We chatted with Bob Sutton, Stanford d.school professor and author of “The No Asshole Rule” and the upcomin...

“Your IRS Wait Time is 3 Hours” Is Lean Possible in Government?

The story of how the IRS embraced Lean Startup practices and built an app citizens craved.Every call to the...

8 “Simple” Guidelines For Data Projects

Advice for building impactful data products

1. Think simple first and then, if it’s really needed, get more complex

2. Define your data product MVP and release as early as possible

3. Establish your target architecture and workload while you go

4. Use the right tool for the right problem

5. Build things that are meaningful

6. Prioritize the projects with the biggest business impact

7. Measure your model and improve it from time to time

8. Communication + collaboration = 🔑

About the Author

Previous

Next

8 “Simple” Guidelines For Data Projects

Advice for building impactful data products

1. Think simple first and then, if it’s really needed, get more complex

2. Define your data product MVP and release as early as possible

3. Establish your target architecture and workload while you go

4. Use the right tool for the right problem

5. Build things that are meaningful

6. Prioritize the projects with the biggest business impact

7. Measure your model and improve it from time to time

8. Communication + collaboration = 🔑

About the Author

Previous

Next

Most Recent

Why communication and camaraderie are necessary for a distributed teamThis post was co-written by Harlie LevineDistributed teams are hard. It’s extremely difficult to have a group of people...

How designers in complex technical domains can quash self-doubt, wrangle tough problems, and deliver user value.

Pomodoro, Ping-Pong or Pair-mate?This post was written by Maya Rosecrance, Software Engineer at Pivotal London and Sarah Connor, Software Engineer at Pivotal LondonPair programming is at the...

How Scotiabank is Modernizing its Approach to Software DevelopmentA cloud-native platform like Pivotal Cloud Foundry (PCF) is designed to enhance developer productivity. It abstracts away...

This week for Pivotal Voices, we’re featuring Sharon Tam, Senior Product Manager at Pivotal New York.When I was growing up, my nickname used to be Shady Sharon because I was always caught up in...

Measure Intensity, Frequency, and RecencyAs a product team, how do you know which problem to solve? This question might be hard to answer, especially if you don’t understand your customers...

Which route should you take?

The ongoing race to reinvent enterprise IT’s identity as a modern software development shop.A few years ago, I was helping a large bluechip company deliver a multi-year software project. The goal...

Why you shouldn’t think of yourself as a junior developerA lot of young people will graduate this month and start looking for a job. They’re polishing their LinkedIn profiles, deleting those...

Conduent Uses Advanced Analytics and Data Science to Help Its Clients Proactively Avoid Legal TroubleKarl Sobylak, Senior Director of Data Analytics at Conduent chats with Jeff Kelly at PostGres...

How automation is making life easier for developers at Garmin.Letting go is sometimes difficult. But that’s what developers at Garmin are learning to do.The only way to release software on a daily...

Thomas Squeo, Senior Vice President for Digital Transformation and Digital Architecture at West, breaks down what to consider in the midst of an acquisition process...

Alois Reitbauer explains what IT will be like in the future.It’s easy to get swept up with the idea of autonomous cars. No one wants to be stuck in traffic or deal with a long commute home at the...

Learn why Concourse was key in Cerner’s digital transformation journey.

In this week’s Pivotal Voices, Edie Beer, Software Engineering Manager at Pivotal New York, details her journey coming back to software after a decade awayI grew up in a tiny town in West Virginia...

The benefits of having an active, shared knowledge center as you build your product as a team.How do you keep track and remember all of those familiar activities — from roadmaps, to retros, to...

A conversation with Ovum’s Tony Baer about the future of data.Some overnight successes hide the decades of work it takes to get there. That’s the perception that Tony Baer, a Principal Analyst at...

New, modern applications get all the attention. But legacy applications are just as important, particularly at large enterprises whose founding predates the iPhone. At such companies, these older...

Hear from people who pivoted into engineering.Coding is a new gold rush of sorts. With programming jobs growing 12% faster than the market average and a high average salary, coding isn’t solely...

Hadrien Raffalli, Labs PM at Pivotal Tokyo breaks down how to interpret different customer commitments and experiments to know if your MVP is actually viable.“I love the idea. Let me know when the...