In 2012 Harvard Business Review published its now famous article on data scientists, everything in the world of big data was still fresh, shiny and new. There were no schools where the eager masses could pursue a graduate degree and emerge from their chrysalis ready to tackle the world. Fast forward to 2018, there are literally thousands upon thousands of newly minted data scientists coming to the industry each year, many of which are largely unprepared for how messy the real world of data really is. If they have the training what’s the problem?
It was all so easy in school
In school, academic exercises tend to be fairly neat, with smallish data (by modern standards), and mainly focused on the modeling aspects of the lifecycle, the fun part. The figure below shows the full data science process including modeling alongside the other important steps of building a machine learning model.
In most cases, modeling only makes up about 20 percent of the overall work in a project. What about the other 80 percent? Often a data scientist does more than just modeling, but many of the tasks fall to another professional called the Data Engineer. Now we need some definitions. What does a data scientist do? What does a data engineer do? What is the overlap and do I need both?
You would think for the most popular career path in the world, defining it would no longer present a problem, not so! In his epic post, “The Battle of the Data Science Venn Diagrams,” David Taylor provides a history of the attempts. I agree that the best definition is Conway’s combination of Hacking Skills, Math & Statistics, Substantive Expertise (aka business knowledge) sums it up pretty well, especially if you add in Communication, contributed by Stephen Kolassa in 2015. It’s often hacking that drives the need for a data engineer.
Data engineers, as stated in KDNuggets from 2017, “are less concerned with statistics, analytics, and modeling as their data scientist/analyst counterparts, and are much more concerned with data architecture, computing and data storage infrastructure, data flow”, etc. Take a look at the earlier TDSP diagram, data engineers focus mostly on the gray and blue items (data acquisition and deployment). In a nutshell, data engineers go to extraordinary lengths to help data scientist spend more time modeling.
In a high-functioning, mature team, this relationship can become almost symbiotic, in a mutual sense. In biology, mutualism is a class of symbiosis involving a long-term relationship between individuals of different species where both individuals benefit. This is not far off. The symbiotic relationship between data scientists and data engineers is mutually beneficial as it allows each to focus on their core strengths, ensuring the final result is better than either could have likely accomplished alone.
While I won’t specify who is the goat and who is the bird, this blog series will detail a very real partnership within the CSEO data team at Microsoft between the data scientists and engineers that are helping to drive our digital transformation story. Stay tuned for the next installment titled “You want what?!”, where we will discuss the ups and downs of delivering data science on a real project.
Learn more about the company’s approach to Data and AI by checking out the case studies and articles from the IT Showcase team here.