Linear regression is a statistical technique that is used to find relationships between a dependent variable and one or more independent variables. It is used to predict the outcome of a continuous (numeric) variable. It is widely used for stock market analysis, weather forecasting, and sales predictions.
Linear regression is applied in two steps:
1. Estimate the relationship between two variables.
Examples: Does body weight influence the blood cholesterol level? Will the size of the house affect house prices?
2. Predict the value of the dependent variable based on other independent variables.
The simplest form of a simple linear regression equation with one dependent and an independent variable is shown using the following formula:
Where y is the dependent variable, x is the independent variable, m is the slope, and c is the intercept/coefficient of the line.
The slope m is represented as:
Below are the two types of linear regression:
Let’s understand the intuition behind the regression line by using an example:
The table on the left represents the data; the data points are plotted on the graph on the right.
The next step is to calculate the mean of X and Y and plot the values on the graph.
Here, the mean of X is three, and the mean of Y is five.
The regression line should ideally pass through the mean of X and Y.
Now, we need to draw the equation of the regression line. For that, we need to calculate the following parameters.
Based on the calculated values, the values of slope (m) and coefficient (c) are solved.
Let’s calculate the predicted values of Y for corresponding values of X using the linear equation where m=1.3 and c=1.1.
The best fit line should have the least sum of squares of these errors, also known as e square.
The sum of the squared errors for this regression line is 3.9. We check this error for each line and conclude the best fit line having the least e square value.
A typical data science life cycle consists of the following stages:
Data acquisition: The primary step in the life cycle of any data science project is to acquire the right data from multiple sources. Data acquisition involves acquiring data from different internal and external sources that can help answer business questions. Data can be extracted from various sources, such as logs from web servers, social media data, online repositories, or databases.
Data preparation: Often referred to as data cleaning or data wrangling, it is a critical step in the life cycle. The data collected from different sources is frequently messy and is typically missing various values. Therefore, it is crucial to clean this data to derive value from it.
Data exploration: After cleaning the data, you can perform hypothesis testing and visualize the data to understand the data better. Data exploration is sometimes called data mining. It is used to identify patterns in your data set and find important potential features with statistical analysis.
Predictive modeling: To train your machine to make predictions, you need to build predictive models. For this, you have to choose the right algorithm on which the machine is to be trained. Historical data is then split into training and validation sets. The model is trained using the training set. The trained model is validated using the validation dataset, and the model is then evaluated for accuracy and efficiency.
Model interpretation and deployment: After a rigorous evaluation of the model, you can deploy into a production-like environment for final user acceptance. You’ll want to present your model to a non-technical person and convey the actionable insights derived from the data.
Now that we have looked at the different data science life cycle stages let’s look at some of the data science algorithms that can help you solve complex business problems.
R has powerful graphics packages that help in data visualization. These graphics can be viewed on the screen, and saved in various formats, including .pdf, .png, .jpg, .wmf and .ps. It can be customized according to various graphic needs and enables you to copy and paste in Word or PowerPoint files.
You can create a bar chart, pie chart, histogram, kernel density plot, line chart, boxplot, heat map, and word cloud.
Let’s look at boxplots in R.
Boxplots are also known as whisker diagrams. They will display the distribution of data based on the following parameters:
To create a boxplot, you need to provide a boxplot(data).
The line at the bottom of the box is the minimum value, and the line of the top of the box is the maximum value. The dark line inside the box is the median value, and the points lying outside the box are outliers.
Now that you know more about data visualization in R, let’s jump into learning the different phases of the data science life cycle.
R can be easily downloaded and installed from the CRAN website.
You can select a suitable operating system and click on it to download. Here, “Download R for Windows” has been selected.
Follow the default options to finish the installation.
You can also install RStudio, which is an integrated development environment for R. It is available in two formats: RStudio Desktop is a regular desktop application. At the same time, RStudio Server runs on a remote server and enables RStudio access using a web browser.
The following is what the interface of RStudio looks like:
Here is a small script that is used to perform some basic operations and plot a graph.
Before you start programming in R, you should install packages and their dependencies. Packages provide pre-assembled collections of functions and objects. Each package is hosted on the CRAN repository. Not all packages are loaded by default, but they can be installed on demand.
To install a new package in RStudio, go to Tools -> Install Packages
Then, you can search for the package you want to install and select the location where you want to install the package.
Now, let’s discuss the different data structures available in the R programming language.
1. Vectors: It is the most basic R object, which has atomic values.
2. Matrices: These are R objects in which the elements are arranged in a two-dimensional layout. They also contain elements of the same types.
3. Arrays: They can store data in more than two dimensions. Suppose we create an array of dimensions (two, three, four) then it creates four rectangular matrices, each with two rows and three columns.
4. Data Frames: A data frame is a table in which each column contains values of one variable, and each row contains one set of values from each column.
5. Lists: A list contains elements of different types (numbers, strings, vectors, etc.) It can also include a matrix or a function as its elements. The list is created using the list() function.
R offers various statistical and graphical techniques. It has an extensive library of packages that makes it easy to implement machine learning algorithms. It can be easily integrated with popular software, like Tableau, and Microsoft SQL Server.
R is not just a programming language; it has a worldwide repository system called CRAN (Comprehensive R Archive Network). You can access it at https://cran.r-project.org/.
It also has a collection of all critical updates, R sources, R binaries, R packages, and other documentation. CRAN hosts around 10,000 packages of R.