The problem statement is simple. You have a dataset, and you need to predict whether a candidate will get admission in the desired college or not, based on the person’s GPA and college rank.
It’s important to note that in the dataset that we’ve imported, we were given the GPAs and college ranks for several students, but it also has a column that indicates whether those students were admitted or not. Based on this labeled data, you can train the model, validate it, and then use it to predict the admission for any GPA and college rank. Once you split the data into training and test sets, you will apply the regression on the two independent variables (GPA and rank), generate the model, and then run the test set through the model. Once that is complete, you will validate the model to see how well it performed.
Here is the video that represents the steps followed to implement the use case.
The very first thing you need to do is import the data set that you were given in CSV format (comma-separated values). Next, select and import the libraries that you will need. Although R is an excellent programming language with a lot of built-in functions, it is easily and powerfully extended by the use of libraries and packages. Then you need to split the data set into a training set and a test set.
After the library is loaded, you set your working directory. In that working directory, there’s a file called binary dot CSV, and that’s the CSV file from the college. In this case, the data has four columns: GRE, GPA rank, and then the answer column: whether or not someone was admitted (value = 1) or not admitted (value = 0).
Now it’s time to split the data. Take the data frame and split it into two groups, a training set, and a test set. The demo uses an 80/20 ratio, so 80 percent of the data will go into the training set, and 20 percent will go into the test set. Of course, that ratio could be 60/40 or 70/30. It depends on the size of your data, but in our example and for our purposes, 80/20 is perfect. Next, we’ll do a little data munging. In general, you munge the data early on after ingestion, and you have to be careful. In this case, you don’t have any missing values; you don’t have any real outliers. Our data was pretty clean when we got it and ingested it, but in general, that’s not the case, and you need to put in a lot of work and pay a lot of attention to the munging process here.
We’re going to use the GLM function (the general linear model function) to train our logistic regression model and the dependent variable. The independent variables are GPA and rank, and a little tilde sign here says the dependent variable will be a function of GPA and rank. The two independent variables in the data will be the training set, and the family will be binomial; binomial indicates that it’s a binary classifier. It’s a logistic regression problem.
There it is: You ran your model, and there’s a summary of your model. You can see that there is some statistical significance in GPA and rank by the coefficients and output of the model. So next, let’s run the test data through the model. Next, set up a confusion matrix and look at your predictions versus the actual values. Again, this is important. You had the answers, and you predicted some answers, so hopefully, our predicted answers match up with the actual answers.
To check that, run a confusion matrix so you can see the predicted values versus the actual values. It’s important here to know if it was predicted false, and it was false, or if it was predicted true, and it was true.
Logistic regression is a binary classifier, and it’s very good at that in general. Are there other binary classifiers? Yes, but logistic regression is easy to understand and easy to implement, and that means it’s often the first choice.