Categories
4. Linear Regression In R programming language

Use Case of revenue prediction, featuring linear regression

Predicting the revenue from paid, organic, and social media traffic using a linear regression model in R.

We will now look at a real-life scenario where we will predict the revenue by using regression analysis in R. The sample dataset we will be working with is shown below:

sample-dataset.

In this demo, we will work with the following three attributes to predict the revenue:

  1. Paid Traffic – Traffic coming through advertisement
  2. Organic Traffic – Traffic from search engines, which is non-paid
  3. Social Traffic     –  Traffic coming in from various social networking sites
traffic

We will be making use of multiple linear regression. The linear regression formula is:

multiple-linear-regression

Before we begin, let’s have a look at the program’s flow:

  1. Generate inputs using csv files
  2. Import the required libraries
  3. Split the dataset into train and test
  4. Apply the regression on paid traffic, organic traffic, and social traffic
  5. Validate the model 

So let’s start our step-by-step linear regression demo! Since we will perform linear regression in RStudio, we will open that first.

We type the following code in R:

# Import the datasetsales <- read.csv(‘Mention your download path’)head(sales) #Displays the top 6 rows of a datasetsummary(sales) #Gives certain statistical information about the data. The output will look like below:
head-sales
dim(sales) # Displays the dimensions of the dataset
dim-sales

Now, we move onto plotting the variables.

plot(sales) # Plot the variables to see their trends
plot-variables

Let’s now see how the variables are correlated to each other. For that, we’ll take only the numeric column values.

library(corrplot) # Library to finds the correlation between the variablesnum.cols<-sapply(sales, is.numeric)
num.cols
cor.data<-cor(sales[,num.cols])
cor.data
corrplot(cor.data, method=’color’)
cor
correlation-matrix

As you can see from the above correlation matrix, the variables have a high degree of correlation between each other and with the sales variable.

Let’s now split the data from training and testing sets.

# Split the data into training and testing
set.seed(2)
library(caTools) #caTools has the split function
 split <- sample.split(sales, SplitRatio = 0.7) # Assigning it to a variable split, sample.split is one of the functions we are using. With the ration value of 0.7, it states that we will have 70% of the sales data for training and 30% for testing the model
split
train <- subset(sales, split = ‘TRUE’) #Creating a training set test <- subset(sales, split = ‘FALSE’) #Creating a testing set by assigning FALSE
head(train)
head(test)
View(train)
View(test)

Now that we have the test and train variables, let’s go ahead and create the model:

Model <- lm(Revenue ~., data = train) #Creates the model. Here, lm stands for the linear regression model. Revenue is the target variable we want to track.
summary(Model) 
call-formula
# Prediction
pred <- predict(Model, test) #The test data was kept for this purpose
pred #This displays the predicted values
res<-residuals(Model) # Find the residualsres<-as.data.frame(res) # Convert the residual into a dataframe
res # Prints the residuals
# compare the predicted vs actual values
results<-cbind(pred,test$Revenue)
results
colnames(results)<-c(‘predicted’,’real’)
results<-as.data.frame(results)
head(results)
head-results
# Let’s now, compare the predicted vs actual values
plot(test$Revenue, type = ‘l’, lty = 1.8, col = “red”)

The output of the above command is shown below in a graph that shows the predicted revenue.

predicted-revenue

Now let’s plot our test revenue with the following command:

lines(pred, type = “l”, col = “blue”) #The output looks like below

Let’s go ahead and plot the prediction fully with the following command:

plot(pred, type = “l”, lty = 1.8, col = “blue”) #The output looks like below, this graph shows the expected Revenue
pred-index

From the above output, we can see that the graphs of the predicted revenue and expected revenue are very close. Let’s check out the accuracy so we can validate the comparison.

# Calculating the accuracy
rmse <- sqrt(mean(pred-sales$Revenue)^2) # Root Mean Square Error is the standard deviation of the residuals
rmse

 The output looks like below:

rmse

You can see that this model’s accuracy is sound. This brings us to the end of the demo.

Leave a Reply

Your email address will not be published. Required fields are marked *