Categories
7. Random Forest in R programming language

Predicting the Quality of Wine

The following use case shows how this algorithm can be used to predict the quality of the wine based on certain features—such as chloride content, alcohol content, sugar content, pH value, etc. 

To do this, we have randomly assigned the variables to our root node and the internal nodes.

Usually, with decision trees or random forest algorithms, the root nodes and the internal notes are calculated using the Gini index/Gini impurity values. 

1. We have the first decision tree, which is going to take chlorides and alcohol content into consideration. If the chloride value is less than 0.08 and the alcohol content is greater than six, then the quality is high (in this case, it’s eight). Otherwise, the quality is five. This decision tree is shown below:

/chloride

2. Our second decision tree will be split based on pH and sulphate content. If the sulphate value is less than 0.6 and the pH is lesser than 3.5, then the quality is six. Otherwise, it is five. The decision tree is shown below:

sulphates

3. Our last decision tree will be split based on sugar and chloride content. If sugar is less than 2.5 and the chloride content is less than 0.08, then we get the quality of the wine to be five. Otherwise, it’s four. The decision tree is shown below:

sugar

Two out three decision trees above indicate the quality of our wine to be five—the forest predicts the same. 

In this demo, we will run an R program to predict the wine’s quality. The image shown below is the dataset that holds all attribute values required to predict the wine’s quality.

wines-quality.gif

So, let’s get coding!

wine <- read.csv(url(“https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv”), header = TRUE, sep = “;”) # This command is used to load the dataset
head(wine) # Display the head and dimensions of wine dataset
dim(wine)
barplot(table(wine$quality)) # Barplot to see the quality of wines. The output looks like below
table-wine-quality
# Now, we have to convert the quality values into factors
wine$taste <- ifelse(wine$quality < 5, “bad”, “good”)
wine$taste[wine$quality == 5] <- “normal”
wine$taste[wine$quality == 6] <- “normal”
wine$taste <- as.factor(wine$taste)
str(wine$taste)
barplot(table(wine$taste)) # Barplot to view the taste of wines. The output is shown below.
table(wine$taste) 
# Next, we need to split the data into training and testing. 80% for training, 20% for testing.
set.seed(123)
samp <- sample(nrow(wine), 0.8 * nrow(wine))
train <- wine[samp, ]
test <- wine[-samp, ]
# Moving onto the Data visualization
library(ggplot2)
ggplot(wine,aes(fixed.acidity,volatile.acidity))+ geom_point(aes(color=taste))# This command is used to display a scatter plot. The output looks like below
acidity

ggplot(wine,aes(alcohol)) + geom_histogram(aes(fill=taste),color=’black’,bins=50) # This command is used to display a stacked bar chart. The output looks like below

dim(train)
dim(test)  # Checks the dimensions of training and testing dataset
install.packages(‘randomforest’)
library(randomforest)           # Install the random forest library
# Now that we have installed the randomforest library, let’s build the random forest model
model <- randomforest(taste ~ . – quality, data = train, ntree = 1000, mtry = 5)
model
model$confusion
# The next step is to validate our model using the test data
prediction <- predict(model, newdata = test)
table(prediction, test$taste)
prediction
model-random-forest

# Now, let’s display the predicted vs. the actual values
results<-cbind(prediction,test$taste)
results
colnames(results)<-c(‘pred’,’real’)
results<-as.data.frame(results)
View(results)
# Finally, let’s calculate the accuracy of the model
sum(prediction==test$taste) / nrow(test) # The output is as shown below
view-results
sum-pred

You can see that this model’s accuracy is 90 percent, which is great. Now we have automated the process of predicting wine quality. This brings us to the end of this demo on random forest.

Leave a Reply

Your email address will not be published. Required fields are marked *