5. Logistic Regression in R language

Why need Logistic Regression?

You need to understand why you would use logistic regression and not linear regression. Picking the machine learning algorithm for your problem is no small task. It behooves you to understand linear regression vs. logistic regression.

Linear regression answers the question, “How much?” In our earlier example, as website traffic grows, how much will revenue grow?

Whereas logistic regression predicts if something will happen or not happen. Linear regression is generally used to predict a continuous variable, like height and weight. Logistic regression is used when a response variable has only two outcomes: yes or no, true or false.

We refer to logistic regression as a binary classifier, since there are only two outcomes. Let’s try to understand this with an example. Let’s say you have a startup company, and you are trying to figure out whether the startup will be profitable or not. That’s binary, with two possible outcomes: profitable or not profitable. So let’s use initial funding to be the independent variable.

Funding vs. Profit graph is linear

This graph shows funding versus profit, and it appears linear. Once again, our intuition tells us that the more funding a startup has, the more profitable it will be, but of course, data science doesn’t depend on intuition; it depends on data.

This graph does not tell whether the startup will be profitable or not; it states only that with an increase in funding, the profit also increases. That’s not binary. If you want to predict how much profit will be made, linear regression would be useful, but that’s not what you are trying to figure out here. Hence you need to make use of logistic regression, which is two outcomes—in our case, profitable and not profitable.

Two outcomes - Profitable and not Profitable

In the next graph, the x-axis is our independent variable, funding. The y-axis is no longer the dependent variable, profit, but rather the probability of profit. For example, if you look at a company with funding of, say, 40, then the probability that the company will be profitable is around 0.8 or 80 percent, based on the best-fit line, called a sigmoid curve.

In the example, we plotted several companies with various funding levels from 10 to 70 and indicated whether they were zero—not profitable—or 1—profitable—on the graph. This is how you should think of logistic regression.

In this example, given the amount of funding, we can calculate the probability that a company will be profitable or not profitable. If you use the threshold line of 0.5, then you have your classifier. If the probability is 0.5 or higher, the company is profitable; if the probability is lower than 0.5, it’s not profitable.

Before getting into the depths of understanding logistic regression in R, let us first understand what it is.

Leave a Reply

Your email address will not be published. Required fields are marked *