A Beginner's Guide to Data Science

Charlie | February 7th 2018

As a data science startup, one of the toughest challenges facing us is explaining to family and friends what data science actually is. This article is going to attempt to explain what data science is, what it can do and why it is giving so many companies a competitive advantage in their field.

To put it simply data science is the collection, storage and processing of data to give predictions or extract meaning from data. This process can be thought of as cooking a meal. The different types of data are the ingredients, the recipe is an algorithm which is a set of instructions of how to combine the ingredients (data), the oven is the computer which does the cooking, the data scientist is the chef and the prediction is the meal.

In order to cook this data science meal effectively the data scientist must be able to combine methods and principles from maths, statistics, software engineering and particularly machine learning. For example, skills are needed in software engineering in order to handle and store data safely. Maths and statistics are applied to critically analyse and interpret the data. Machine learning is a subsection of artificial intelligence that allows computers to learn without being explicitly programmed. This allows the automated extraction of knowledge from data. This might sound like magic, but fundamentally machine learning can only really answer 5 key questions.

Is this A or B?
Is this normal?
How much - or - how many?
How is this arranged?
What should I do next?

Question 1: Is this person male or female? Will this customer like this product? Yes or no, machine learning answers this with classification algorithms. It takes a new data point and makes a prediction if it is A or B.

Question 2: Is this normal? This is known as anomaly detection. Algorithms are designed to identify if something is unexpected compared to the rest of the data. This is already used in fraud detection by major banks.

Question 3: What is the value of this? Predicting the numerical value of something is known as regression, for example predicting the height of a child given the heights of their parents.

Question 4: How is this arranged? This question is looking at clustering data points together that have things in common and hence is known as clustering.

Question 5: What should I do next? This is one of the most interesting questions and is solved by algorithms that use reinforcement learning. Reinforcement learning is inspired by how brains respond to punishment and reward, using trial and error to find out the best solution while learning as they go.

Public interest in machine learning and data science has soared in recent years, as shown by this graph of the number of google searches for data science and machine learning since 2004. This is for several different reasons. Firstly, born digital companies such as Facebook, Google and Amazon have shown the world the power of using these techniques in business. Secondly, there is a genuine need for data science to make sense of all the data collecting on hard drives all around the world.

Data science is becoming increasingly more important in our rapidly expanding digital world. In every aspect of society from healthcare and transport to our social lives and TV habits, extensive amounts of data is being produced. Applying machine learning to this vast ocean of data allows us to extract information which influences decision making. Training a model to recognise a cancerous tumour in a MRI scan can help a doctor to identify a life threatening disease. Businesses are now able to predict when a user paying for their service is about to terminate their subscription. The applications of data science are both powerful and widespread, and it has a big role to play in our ongoing digital evolution.

For more detailed examples of the capabilities of data science in action, see our articles about data science at Netflix, and the best Go player in the world, who happens to be a computer.