Predicting the revenue from paid, organic, and social media traffic using a linear regression model in R.
We will now look at a real-life scenario where we will predict the revenue by using regression analysis in R. The sample dataset we will be working with is shown below:
In this demo, we will work with the following three attributes to predict the revenue:
- Paid Traffic – Traffic coming through advertisement
- Organic Traffic – Traffic from search engines, which is non-paid
- Social Traffic – Traffic coming in from various social networking sites
We will be making use of multiple linear regression. The linear regression formula is:
Before we begin, let’s have a look at the program’s flow:
- Generate inputs using csv files
- Import the required libraries
- Split the dataset into train and test
- Apply the regression on paid traffic, organic traffic, and social traffic
- Validate the model
So let’s start our step-by-step linear regression demo! Since we will perform linear regression in RStudio, we will open that first.
We type the following code in R:
|# Import the datasetsales <- read.csv(‘Mention your download path’)head(sales) #Displays the top 6 rows of a datasetsummary(sales) #Gives certain statistical information about the data. The output will look like below:|
|dim(sales) # Displays the dimensions of the dataset|
Now, we move onto plotting the variables.
|plot(sales) # Plot the variables to see their trends|
Let’s now see how the variables are correlated to each other. For that, we’ll take only the numeric column values.
|library(corrplot) # Library to finds the correlation between the variablesnum.cols<-sapply(sales, is.numeric)|
As you can see from the above correlation matrix, the variables have a high degree of correlation between each other and with the sales variable.
Let’s now split the data from training and testing sets.
|# Split the data into training and testing|
library(caTools) #caTools has the split function
split <- sample.split(sales, SplitRatio = 0.7) # Assigning it to a variable split, sample.split is one of the functions we are using. With the ration value of 0.7, it states that we will have 70% of the sales data for training and 30% for testing the model
train <- subset(sales, split = ‘TRUE’) #Creating a training set test <- subset(sales, split = ‘FALSE’) #Creating a testing set by assigning FALSE
Now that we have the test and train variables, let’s go ahead and create the model:
|Model <- lm(Revenue ~., data = train) #Creates the model. Here, lm stands for the linear regression model. Revenue is the target variable we want to track.|
pred <- predict(Model, test) #The test data was kept for this purpose
pred #This displays the predicted values
res<-residuals(Model) # Find the residualsres<-as.data.frame(res) # Convert the residual into a dataframe
res # Prints the residuals
|# compare the predicted vs actual values|
|# Let’s now, compare the predicted vs actual values|
plot(test$Revenue, type = ‘l’, lty = 1.8, col = “red”)
The output of the above command is shown below in a graph that shows the predicted revenue.
Now let’s plot our test revenue with the following command:
|lines(pred, type = “l”, col = “blue”) #The output looks like below|
Let’s go ahead and plot the prediction fully with the following command:
|plot(pred, type = “l”, lty = 1.8, col = “blue”) #The output looks like below, this graph shows the expected Revenue|
From the above output, we can see that the graphs of the predicted revenue and expected revenue are very close. Let’s check out the accuracy so we can validate the comparison.
|# Calculating the accuracy|
rmse <- sqrt(mean(pred-sales$Revenue)^2) # Root Mean Square Error is the standard deviation of the residuals
The output looks like below:
You can see that this model’s accuracy is sound. This brings us to the end of the demo.