Data Scientist, Data Engineer or Parallel and Distributed Algorithmist?

Written by Wilfried Kirschenmann, on 06 February 2018

In recruitment, I often wonder what would lead me to advise a candidate to position themselves as a data scientist, a data engineer, or a developer specializing in parallel and distributed algorithms.

Indeed, on paper, these profiles are quite similar: all three require skills in development and algorithms, statistics and applied mathematics, and data manipulation. Of course, the required skill levels will not be exactly the same for each profile, as illustrated in the table below.

Competency/Metrics Chart

However, for young candidates, they still have much to learn and will gain competence with experience. In reality, it is their soft skills that will primarily make the difference. To understand what makes them unique, it is necessary to analyze their environments and the different interactions they must have. From these interactions, we can identify the criteria that legitimize their actions and expertise.

  • Data Scientist

    The role of a Data Scientist is the simplest to explain: it is primarily about designing algorithms to transform the company's or public data into information or even decisions useful for business objectives. Therefore, they must interact with the business to understand its challenges and also effectively communicate the value, constraints, and limitations of their solutions. They must demonstrate listening skills, pedagogy, and storytelling abilities.

  • Data Engineer

    The role of a Data Engineer is to industrialize the systems designed by the Data Scientist. They implement solutions that enable operation under industrial conditions. For this, they may need to adapt the algorithms designed by Data Scientists. In addition to Data Scientists, the Data Engineer also interacts with infrastructure support services and users in technical support for sufficiently complex issues. Given the production challenges often faced by Data Engineers, they must remain calm, take a step back from problems, and demonstrate listening skills.

  • Parallel and Distributed Algorithmist

    Less known, yet the role of the parallel and distributed algorithmist is quite simple to understand: they must ensure that the models developed by Data Scientists meet the performance requirements of the business. They can be considered as a sub-specialty of the Data Engineer, although most do not follow this career path: they are usually already expert developers in parallel and distributed algorithms and systems. For example, an HPC expert is generally, by default, a parallel and distributed algorithm specialist. They primarily interact with Data Scientists and Data Engineers. At the intersection of Data Scientist and Data Engineer roles but without user interaction, the parallel and distributed algorithm specialist must demonstrate a strong aptitude for abstraction and intellectual rigor, effective teaching ability, and the capacity to take a step back.

In many projects, the first two profiles are sufficient. However, as soon as performance issues arise, the parallel and distributed algorithm specialist becomes indispensable. This is when High-Performance Data Analytics (HPDA) issues are addressed.

  Data scientist Data engineer Parallel and Distributed Algorithmist
Development and algorithms - = +
Statistics and applied mathematics + - =
Data handling = + -