In this analysis, we’ll use a standard built-in cars dataset to find the correlation between variables.
head(cars) – Displays the top six rows of the data frame
str(cars) – Displays the structure of the data frame (50 observations and two variables)
plot(cars) – Provides a scatter plot of speed vs distance
plot(cars$speed, cars$dist)
Correlation analysis studies the strength of the relationship between two continuous variables. It involves computing the correlation coefficient between the two variables.
If one variable consistently increases with the increasing value of the other, then they have a strong positive correlation (value close to +1).
Let’s build a linear regression model on the entire dataset to build the coefficients:
We can predict the dependent variables if the model is “statistically significant”.
The value of p should be less than 0.05 for the model to be statistically significant.
Split the dataset into training and testing:
Fit the model on training data and predict ‘dist’ on test data:
Review model diagnostic measures: summary(lmMod)
A simple correlation between the actual and predicted values can be used to measure accuracy:
You can compute all the error metrics in one go using the regr.eval() function in the DMwR package. Use install.packages(‘DMwR’) for this if you are using it for the first time.
Now that we have seen how the linear regression algorithm works in R, let’s now learn about decision trees.