What does it mean to be a Data Scientist?
What does it mean to be a Data Scientist?
If you’ve watched Moneyball the critical scene is when Peter Brand( Jonah Hill) is talking to Billy Beane (Brad Pitt) about his views on what is wrong with baseball teams, highlighting that teams should rather be buying runs instead of players.
That’s what it means to be a Data Scientist. I can’t say I’ve had epiphany moments of that magnitude, but I’m getting close. What makes that scene such a Data Science moment?
i) Framing the problem around mathematical principles
ii) There is context awareness
iii) The solution answers the business problem
Why is this important? Why is it important to understand what a Data Scientist is and why is it important to understand what a Data Scientist is supposed to do?
a) I’m seeking to add more information to the library of the internet for anyone interested in the profession. That individual can build a rough picture of what the role entails. This is important as not so long ago I was that individual googling what a Data Scientist is/does
b) Adding to the conversation banter library for the next time you, the reader, meets a ‘Data Scientist’.
Firstly, I want to point out that my views of the role are shaped by the context in which I get to “Data Science”. Thus, being a Data Scientist is context specific and the scope of the work is dependent on the organizational problem. Therefore, the way I view my role, my responsibilities and the impact I have on the organization’s activities are not a reflection of the role of a Data Scientist.
However, as this role has been voted the “Sexiest Job of the 21st Century” by the Harvard Business Review, I’m trying to give some insights into what are some of the core competencies of the role? I’ve mentioned what highlights a Data Science moment in relation to Moneyball. However, these moments are loosely communicated in an HR Job Spec when you’re applying. What you’ll see on the job spec is a long list of programming language requirements and some proprietary software experience that unless you’re a PHD candidate, meeting the requirements seems like a tall order.
Let’s breakdown these requirements in a way that makes sense. Imagine a Data Scientist as a Crossfit Athlete. The core skills for a Crossfit Athlete needs to have are Fitness (the ability to recover quick between workouts), Strength (technical know-how to lift heavy stuff) and grit (soft skill that really sets one individual from the rest). For a Data Scientist, these translate to Statistics/ Mathematics, Computer Science and Communication skills.
There’s a large emphasis I find on the Computer Science component. The requirements for SQL, Python & R are the most common. These make sense as these languages allow you to get the data in the format you need before you start analysis. In addition, python and R (more especially) have Statistical Packages [prewritten code] that make it easier to then run the data through a magical ‘black box’ that gives you an output. Thus, the Computer Science component is the technical prowess that is required to do well in the role. Like any technical exercise, practice makes perfect and the more scenarios you are exposed to the better you become.
The ‘Science-y’ component is having the mathematical/statistical background to recognize what the aforementioned ‘black boxes’ do. Like any ‘black box’ widget, you can plug and play it for any dataset but for you to obtain any useful results you need to adjust the black boxes to your datasets for the output to make any sense. This is the Feature Engineering part. All data is important but sometimes there are parts of the data that is more important than the rest. Using mathematical tools standardizes the methodology and this gives us a working framework that allows us to identify the data that is more important. Like fitness, it’s a bit of natural talent and hard work. Fortunately, I’m one of those that wasn’t blessed with a natural affinity for mathematics, so I had to grind it out to fully understand a law/theory/concept, so it’s possible for anyone to grind it out to build this skill into their arsenal.
The communication bit requires context awareness. You need to then translate the finds of the mathematical analysis to the problem you have before you. This is what Peter Brand communicates well to Billy Beane who can then strategize how he can meet his objectives with the information available. There’s no point of having fancy information that doesn’t lead to any action. Technically, that’s noise and in the communication of the results of a data analysis exercise is to lessen the noise. This distinguishes the really good Data Scientist and the good ones. Being aware of the bigger question, guides who you communicate back to stakeholders and aids you in communicating in a manner that anyone can understand.
Data Analysis is supposed to communicate an action that needs to be taken and Data Science is supposed to guide strategy and shine a light into the future. As a science, it’s based on the assumptions made and the strength of the results are dependent on the strength of the assumptions. Thus, to a certain degree of certainty, I hope this article has been enlightening and if not, it’s a chance occurrence 😊
For budding Data Scientists and veterans who are looking to enriching their experiences, here are few resources that can help you on your journey:
GitHub : Literally has all the coding exercises in all the languages under the sun. You just need to be prepared to put in the work to research how it works and the instances when certain packages are used.
Kaggle.com: Specifically designed for the Data Science Professional. All in one site, from offering technical assistance, to offering a community of individuals who are trying to figure out what they are doing as they are doing it. Do a few challenges and post questions.
Stack Overflow : My SQL savior! This is my query library, sometimes I think I have an awesome script, only to find out I could minimize the number of lines by following a different chain of thought.