When solving a real-world problem with Data Science, the first step towards solving it starts with Data Cleaning and Preprocessing. When a Data Scientist is provided with a dataset, it may be in an unstructured format with various inconsistencies.
Organizing the data and removing erroneous information makes it easier to analyze and draw insights. This process involves the removal of redundant data, the transformation of data in a prescribed format, handling missing values etc.
A Data Scientist analyzes the data through various statistical procedures. In particular, two types of procedures used are:
- Descriptive Statistics
- Inferential Statistics
Assume that you are a Data Scientist working for a company that manufactures cell phones. You have to analyze customers using the mobile phones of your company. In order to do so, you will first take a thorough look at the data and understand various trends and patterns involved.
In the end, you will summarize the data and present it in the form of a graph or a chart. You therefore, apply Descriptive Statistics to solve the problem.
You will then draw ‘inferences’ or conclusions from the data. We will understand inferential statistics through the following example – Assume that you wish to find out a number of defects that occurred during manufacturing.
However, individual testing of mobile phones can take time. Therefore, you will consider a sample of the given phones and make a generalization about the number of defective phones in the total sample.
Now, you have to predict the sales of mobile phones over a period of two years. As a result, you will use Regression Algorithms. Based on the given historical sales, you will use regression algorithms to predict the sales over time.
Furthermore, you wish to analyze if customers will purchase the product based on their annual salary, age, gender, and credit score. You will use historical data to find out whether customers will buy (1) or not (0). Since there are two outputs or ‘classes’, you will use a Binary Classification Algorithm.
Also, if there are more than two output classes we use Multivariate Classification Algorithm to solve the problem. Both of the above-stated problems are part of ‘Supervised Learning’.
There are also instances of ‘unlabeled’ data. In this, there is no segregation of output in fixed classes as mentioned above. Suppose that you have to find clusters of potential customers and leads based on their socio-economic background.
Since you do not have a fixed set of classes in your historical data, you will use the Clustering Algorithm to identify clusters or sets of potential clients. Clustering is an ‘Unsupervised Learning’ algorithm.
Self Driving cars have become a trending technology. The principle behind the self-driving car is autonomy, that is, being able to take decisions without human interference. The traditional computers required human input to yield output. Reinforcement Learning has solved the problem of human-dependence.
Reinforcement Learning is about taking specific actions to accumulate maximum reward. You can understand this with the following instance: Assume that you are training a dog to fetch ball. Then you reward the dog with a treat or reward each time it fetches the ball.
You do not give it a treat if it does not fetch the ball. The dog will realize the reward of treats if it fetches the ball back. Reinforcement Learning uses the same principle. We give a reward to the agent based on its action and it will try to maximize the reward.
A Data Scientist will require tools and software to tackle the above-mentioned problems. We will now take a look at some of the tools that a Data Scientist uses to those problems.