Categories
7. Random Forest in R programming language

Predicting the Quality of Wine

The following use case shows how this algorithm can be used to predict the quality of the wine based on certain features—such as chloride content, alcohol content, sugar content, pH value, etc. 

To do this, we have randomly assigned the variables to our root node and the internal nodes.

Usually, with decision trees or random forest algorithms, the root nodes and the internal notes are calculated using the Gini index/Gini impurity values. 

1. We have the first decision tree, which is going to take chlorides and alcohol content into consideration. If the chloride value is less than 0.08 and the alcohol content is greater than six, then the quality is high (in this case, it’s eight). Otherwise, the quality is five. This decision tree is shown below:

/chloride

2. Our second decision tree will be split based on pH and sulphate content. If the sulphate value is less than 0.6 and the pH is lesser than 3.5, then the quality is six. Otherwise, it is five. The decision tree is shown below:

sulphates

3. Our last decision tree will be split based on sugar and chloride content. If sugar is less than 2.5 and the chloride content is less than 0.08, then we get the quality of the wine to be five. Otherwise, it’s four. The decision tree is shown below:

sugar

Two out three decision trees above indicate the quality of our wine to be five—the forest predicts the same. 

In this demo, we will run an R program to predict the wine’s quality. The image shown below is the dataset that holds all attribute values required to predict the wine’s quality.

wines-quality.gif

So, let’s get coding!

wine <- read.csv(url(“https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv”), header = TRUE, sep = “;”) # This command is used to load the dataset
head(wine) # Display the head and dimensions of wine dataset
dim(wine)
barplot(table(wine$quality)) # Barplot to see the quality of wines. The output looks like below
table-wine-quality
# Now, we have to convert the quality values into factors
wine$taste <- ifelse(wine$quality < 5, “bad”, “good”)
wine$taste[wine$quality == 5] <- “normal”
wine$taste[wine$quality == 6] <- “normal”
wine$taste <- as.factor(wine$taste)
str(wine$taste)
barplot(table(wine$taste)) # Barplot to view the taste of wines. The output is shown below.
table(wine$taste) 
# Next, we need to split the data into training and testing. 80% for training, 20% for testing.
set.seed(123)
samp <- sample(nrow(wine), 0.8 * nrow(wine))
train <- wine[samp, ]
test <- wine[-samp, ]
# Moving onto the Data visualization
library(ggplot2)
ggplot(wine,aes(fixed.acidity,volatile.acidity))+ geom_point(aes(color=taste))# This command is used to display a scatter plot. The output looks like below
acidity

ggplot(wine,aes(alcohol)) + geom_histogram(aes(fill=taste),color=’black’,bins=50) # This command is used to display a stacked bar chart. The output looks like below

dim(train)
dim(test)  # Checks the dimensions of training and testing dataset
install.packages(‘randomforest’)
library(randomforest)           # Install the random forest library
# Now that we have installed the randomforest library, let’s build the random forest model
model <- randomforest(taste ~ . – quality, data = train, ntree = 1000, mtry = 5)
model
model$confusion
# The next step is to validate our model using the test data
prediction <- predict(model, newdata = test)
table(prediction, test$taste)
prediction
model-random-forest

# Now, let’s display the predicted vs. the actual values
results<-cbind(prediction,test$taste)
results
colnames(results)<-c(‘pred’,’real’)
results<-as.data.frame(results)
View(results)
# Finally, let’s calculate the accuracy of the model
sum(prediction==test$taste) / nrow(test) # The output is as shown below
view-results
sum-pred

You can see that this model’s accuracy is 90 percent, which is great. Now we have automated the process of predicting wine quality. This brings us to the end of this demo on random forest.

Categories
7. Random Forest in R programming language

Few terminologies in the Random Forest Algorithm

Before we start working with R, we need to understand a few different terminologies that are used in random forest algorithms, such as:

1. Variance – When there is a change in the training data algorithm, this is the measure of that change. 

2. Bagging – This is a variance-reducing method that trains the model based on random subsamples of training data. 

3. Out-of-bag (oob) error estimate – The random forest classifier is trained using bootstrap aggregation, where each new tree is fit from a bootstrap sample of the training observations. The out-of-bag (oob) error is the average error for each calculation using predictions from the trees that do not contain their respective bootstrap sample. This enables the random forest classifier to be adjusted and validated during training.

4. Information gain – Used to determine which feature/attribute gives us the maximum information about a class. It is based on the concept of entropy, which is the degree of uncertainty, impurity, or disorder. It aims to reduce the level of entropy, starting from the root node to the leaf nodes. 

The formula for entropy is as shown below:

entrop-formula

Where p represents the probability, and E(S) represents the entropy.

5. Gini index: The Gini index, or Gini impurity, measures the degree of probability of a particular variable being incorrectly classified when it is chosen randomly. The degree of the Gini index varies between zero and one, where zero denotes that all elements belong to a certain class or only one class exists, and one denotes that the elements are randomly distributed across various classes. A Gini index of 0.5 denotes equally distributed elements into some classes.

The Gini index formula is shown below:

gini-index.

Where pi is the probability of an object being classified to a particular class.

Let’s now look at how we can implement the random forest algorithm.

Categories
7. Random Forest in R programming language

Applications in Random Forest

random-forest-application

Random forest classifiers have a plethora of applications in the market today. Let’s go ahead and look at a few of them:

  1. In the field of banking, it is used to predict fraudulent customers
  2. Random forests are used to analyze the symptoms of patients and diagnose diseases
  3. In the ecommerce field, recommendation lists help predict purchases based on customer activity 
  4. Analyze stock market trends to predict profit or loss using the random forest algorithm 

Let’s now look at a few of the terms we need to know in order to understand the random forest algorithm. 

Categories
7. Random Forest in R programming language

Some steps for building a random forest

  • Randomly select “K” features from total “m” features where k < m
  • Among the “K” features, calculate the node “d” using the best split point
  • Split the node into daughter nodes using the best split method
  • Repeat the previous steps until you reach the “l” number of nodes
  • Build a forest by repeating all steps for “n” number times to create “n” number of trees

After the random forest trees and classifiers are created, predictions can be made using the following steps:

  • Run the test data through the rules of each decision tree to predict the outcome and then store that predicted target outcome
  • Calculate the votes for each of the predicted targets
  • The most highly voted predicted target is the final prediction 
Categories
7. Random Forest in R programming language

Random Forest Algorithm Assumptions

  • There should be some actual values in the feature variables of the dataset, which will give the classifier a better chance to predict accurate results, rather than provide an estimation. Missing values should be handled from training the model.
  • The predictions from each tree must have very low correlations.
Categories
7. Random Forest in R programming language

What is the working of random forest ? 

Before understanding how a random forest algorithm works, first, let’s learn more about how a decision tree works with the following example: 

Suppose you want to predict whether a person will buy a phone or not based on the phone’s features. For that, you can build a simple decision tree.

phone price

In this decision tree, the parent/root node and the internal nodes represent the phone’s features, while the leaf nodes are the outputs. The edges represent the connections between the nodes based on the values from the features. Based on the price, RAM, and internal storage, consumers can decide whether they want to purchase the phone. The problem with this decision tree is that you only have limited information, which may not always provide accurate results.

Using a random forest model will improve your results, as it provides diversity into building the model with several different features.

price-phone.

We have created three different decision trees to build a random forest model.

Now, suppose a new phone is launched with specific features, and you want to decide whether to buy that phone or not.

price-ram

Let’s transfer this data to our random forest model and confirm the model’s output.

model-output

The first two trees predict the phone purchase, and the third decision tree suggests the disadvantages of making this purchase. Therefore, our model predicts that you should buy the newly launched phone.

Categories
7. Random Forest in R programming language

Algorithm Features of Random Forest

  • Provides higher accuracy than other algorithms
  • Gives estimates of what variables are important in the classification
  • Handles missing data efficiently, and the generated forests can be saved for future use with other data
  • Computes proximities between pairs of cases that can be used in clustering, locating outliers, or to give interesting views of the data
Categories
7. Random Forest in R programming language

Define Random Forest

Random forest is a popular supervised machine learning algorithm—used for both classification and regression problems. It is based on the concept of ensemble learning, which enables users to combine multiple classifiers to solve a complex problem and to also improve the performance of the model.

The random forest algorithm relies on multiple decision trees and accepts the results of the predictions from each tree. Based on the majority votes of predictions, it determines the final result.

The following is an example of what a random forest classifier in general looks like:

training-set

The classifier contains training datasets; each training dataset contains different values. Multiple decision tree models are created with the help of these datasets. Based on the output of these models, a vote is carried out to find the result with the highest frequency. A test set is evaluated based on these outputs to get the final predicted results.