Categories
4. Linear Regression In R programming language

Use Case of revenue prediction, featuring linear regression

Predicting the revenue from paid, organic, and social media traffic using a linear regression model in R.

We will now look at a real-life scenario where we will predict the revenue by using regression analysis in R. The sample dataset we will be working with is shown below:

sample-dataset.

In this demo, we will work with the following three attributes to predict the revenue:

  1. Paid Traffic – Traffic coming through advertisement
  2. Organic Traffic – Traffic from search engines, which is non-paid
  3. Social Traffic     –  Traffic coming in from various social networking sites
traffic

We will be making use of multiple linear regression. The linear regression formula is:

multiple-linear-regression

Before we begin, let’s have a look at the program’s flow:

  1. Generate inputs using csv files
  2. Import the required libraries
  3. Split the dataset into train and test
  4. Apply the regression on paid traffic, organic traffic, and social traffic
  5. Validate the model 

So let’s start our step-by-step linear regression demo! Since we will perform linear regression in RStudio, we will open that first.

We type the following code in R:

# Import the datasetsales <- read.csv(‘Mention your download path’)head(sales) #Displays the top 6 rows of a datasetsummary(sales) #Gives certain statistical information about the data. The output will look like below:
head-sales
dim(sales) # Displays the dimensions of the dataset
dim-sales

Now, we move onto plotting the variables.

plot(sales) # Plot the variables to see their trends
plot-variables

Let’s now see how the variables are correlated to each other. For that, we’ll take only the numeric column values.

library(corrplot) # Library to finds the correlation between the variablesnum.cols<-sapply(sales, is.numeric)
num.cols
cor.data<-cor(sales[,num.cols])
cor.data
corrplot(cor.data, method=’color’)
cor
correlation-matrix

As you can see from the above correlation matrix, the variables have a high degree of correlation between each other and with the sales variable.

Let’s now split the data from training and testing sets.

# Split the data into training and testing
set.seed(2)
library(caTools) #caTools has the split function
 split <- sample.split(sales, SplitRatio = 0.7) # Assigning it to a variable split, sample.split is one of the functions we are using. With the ration value of 0.7, it states that we will have 70% of the sales data for training and 30% for testing the model
split
train <- subset(sales, split = ‘TRUE’) #Creating a training set test <- subset(sales, split = ‘FALSE’) #Creating a testing set by assigning FALSE
head(train)
head(test)
View(train)
View(test)

Now that we have the test and train variables, let’s go ahead and create the model:

Model <- lm(Revenue ~., data = train) #Creates the model. Here, lm stands for the linear regression model. Revenue is the target variable we want to track.
summary(Model) 
call-formula
# Prediction
pred <- predict(Model, test) #The test data was kept for this purpose
pred #This displays the predicted values
res<-residuals(Model) # Find the residualsres<-as.data.frame(res) # Convert the residual into a dataframe
res # Prints the residuals
# compare the predicted vs actual values
results<-cbind(pred,test$Revenue)
results
colnames(results)<-c(‘predicted’,’real’)
results<-as.data.frame(results)
head(results)
head-results
# Let’s now, compare the predicted vs actual values
plot(test$Revenue, type = ‘l’, lty = 1.8, col = “red”)

The output of the above command is shown below in a graph that shows the predicted revenue.

predicted-revenue

Now let’s plot our test revenue with the following command:

lines(pred, type = “l”, col = “blue”) #The output looks like below

Let’s go ahead and plot the prediction fully with the following command:

plot(pred, type = “l”, lty = 1.8, col = “blue”) #The output looks like below, this graph shows the expected Revenue
pred-index

From the above output, we can see that the graphs of the predicted revenue and expected revenue are very close. Let’s check out the accuracy so we can validate the comparison.

# Calculating the accuracy
rmse <- sqrt(mean(pred-sales$Revenue)^2) # Root Mean Square Error is the standard deviation of the residuals
rmse

 The output looks like below:

rmse

You can see that this model’s accuracy is sound. This brings us to the end of the demo.

Categories
4. Linear Regression In R programming language

Working of Linear Regression

We can better understand how linear regression works by using the example of a dataset that contains two fields, Area and Rent, and is used to predict the house’s rent based on the area where it is located. The dataset is:

area-image

As you can see, we are using a simple dataset for our example. Using this uncomplicated data, let’s have a look at how linear regression works, step by step:

1. With the available data, we plot a graph with Area in the X-axis and Rent on Y-axis. The graph will look like the following. Notice that it is a linear pattern with a slight dip. 

rent-area

2. Next, we find the mean of Area and Rent.

mean-area-rent

3. We then plot the mean on the graph.

4. We draw a line of best fit that passes through the mean.

rent-area

5. But we encounter a problem. As you can see below, multiple lines can be drawn through the mean: 

rent-area-multiple-lines

6. To overcome this problem, we keep moving the line to make sure the best fit line has the least square distance from the data points

best-fit-line

7. The least-square distance is found by adding the square of the residuals

adding-square

8. We now arrive at the relation that, Residual is the distance between Y-actual and Y-pred.

rent-residual

9. The value of m & c for the best fit line, y = mx+ c can be calculated using these formulas:

value-m-c-1

10. This helps us find the corresponding values:

corresponding-values

11. With that, we can obtain the values of m & c.

value-m-c-3

12. Now, we can find the value of Y-pred.

pred-y.

13. After calculating, we find that the least square value for the below line is 3.02.

least-square

14. Finally, we are able to plot the Y-pred and this is found out to be the best fit line.

plot-the-y-pred

This shows how the linear regression algorithm works. Now let’s move onto our use case.

Categories
4. Linear Regression In R programming language

Define Linear Regression?

Linear regression is a form of statistical analysis that shows the relationship between two or more continuous variables. It creates a predictive model using relevant data to show trends. Analysts typically use the “least square method” to create the model. There are other methods, but the least square method is the most commonly used. 

Below is a graph that depicts the relationship between the heights and weights of a sample of people. The red line is the linear regression that shows the height of a person is positively related to its weight.

linear-reg-height-wt

Now that we understand what linear regression is, let’s learn how linear regression works and how we use the linear regression formula to derive the regression line.

Categories
4. Linear Regression In R programming language

Understanding and need of Linear Regression

Before we try to understand what linear regression is, let’s quickly explore the need for a linear regression algorithm by means of an analogy. 

Imagine that we were required to predict the number of skiers at a resort, based on the area’s snowfall. The easiest way would be to plot a simple graph with snowfall amounts and skiers on the ‘X’ and ‘Y’ axis respectively. Based on the graph, we could infer that as the amount of snowfall increased, so the number of skiers would obviously increase.

Hence, the graph makes it easy to see the relationship between skiers and snowfall. The number of skiers increases in direct proportion to the amount of snowfall. Based upon the knowledge the graph imparts, we can make better decisions relating to the operations of a ski area.

To understand linear regression, we need to understand the term “regression” first. Regression is used to find relationships between a dependent variable (Y) and multiple independent (X) variables. Here, the independent variables are known as the predictors or explanatory variables, and the dependent variable is referred to as a response or target variable. 

A linear regression’s equation looks like this:

y = B0 + B1x1 + B2x2 + B3x3 + ….

Where B0 is the intercept(value of y when x=0)

B1, B2, B3 are the slopes

x1, x2, x3 are the independent variables

In this case, snowfall is an independent variable and the number of skiers is a dependent variable. So, since regression finds relationships between dependent and independent variables, then what exactly is linear regression?