The following use case shows how this algorithm can be used to predict the quality of the wine based on certain features—such as chloride content, alcohol content, sugar content, pH value, etc.
To do this, we have randomly assigned the variables to our root node and the internal nodes.
Usually, with decision trees or random forest algorithms, the root nodes and the internal notes are calculated using the Gini index/Gini impurity values.
1. We have the first decision tree, which is going to take chlorides and alcohol content into consideration. If the chloride value is less than 0.08 and the alcohol content is greater than six, then the quality is high (in this case, it’s eight). Otherwise, the quality is five. This decision tree is shown below:
2. Our second decision tree will be split based on pH and sulphate content. If the sulphate value is less than 0.6 and the pH is lesser than 3.5, then the quality is six. Otherwise, it is five. The decision tree is shown below:
3. Our last decision tree will be split based on sugar and chloride content. If sugar is less than 2.5 and the chloride content is less than 0.08, then we get the quality of the wine to be five. Otherwise, it’s four. The decision tree is shown below:
Two out three decision trees above indicate the quality of our wine to be five—the forest predicts the same.
In this demo, we will run an R program to predict the wine’s quality. The image shown below is the dataset that holds all attribute values required to predict the wine’s quality.
So, let’s get coding!
|wine <- read.csv(url(“https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv”), header = TRUE, sep = “;”) # This command is used to load the dataset|
head(wine) # Display the head and dimensions of wine dataset
barplot(table(wine$quality)) # Barplot to see the quality of wines. The output looks like below
|# Now, we have to convert the quality values into factors|
wine$taste <- ifelse(wine$quality < 5, “bad”, “good”)
wine$taste[wine$quality == 5] <- “normal”
wine$taste[wine$quality == 6] <- “normal”
wine$taste <- as.factor(wine$taste)
barplot(table(wine$taste)) # Barplot to view the taste of wines. The output is shown below.
|# Next, we need to split the data into training and testing. 80% for training, 20% for testing.|
samp <- sample(nrow(wine), 0.8 * nrow(wine))
train <- wine[samp, ]
test <- wine[-samp, ]
|# Moving onto the Data visualization|
ggplot(wine,aes(fixed.acidity,volatile.acidity))+ geom_point(aes(color=taste))# This command is used to display a scatter plot. The output looks like below
ggplot(wine,aes(alcohol)) + geom_histogram(aes(fill=taste),color=’black’,bins=50) # This command is used to display a stacked bar chart. The output looks like below
dim(test) # Checks the dimensions of training and testing dataset
library(randomforest) # Install the random forest library
# Now that we have installed the randomforest library, let’s build the random forest model
model <- randomforest(taste ~ . – quality, data = train, ntree = 1000, mtry = 5)
# The next step is to validate our model using the test data
prediction <- predict(model, newdata = test)
# Now, let’s display the predicted vs. the actual values
# Finally, let’s calculate the accuracy of the model
sum(prediction==test$taste) / nrow(test) # The output is as shown below
You can see that this model’s accuracy is 90 percent, which is great. Now we have automated the process of predicting wine quality. This brings us to the end of this demo on random forest.