Basic Tutorials


Use of Population Stability Index (PSI)

There are multiple uses of Population Stability Index (PSI). They are listed below –

Model might be influenced by economic changes. Suppose you built a risk model during economic recession (year 2008) and you are using the same model to score datasets in year 2016. There is a high chance that various attributes of the model are changed drastically over last 8 years. It means it does not make sense to use this model anymore if features of the model are changed significantly.

Change in product offerings due to internal policy changes. For example, one of your product are relaunched recently so attributes may behave differently as compared to attributes of your model.

PSI can detect if any data integration or programming issues to run the scoring code.

How PSI is calculated?

PSI = (% of records based on scoring variable in Scoring Sample (A) – % of records based on scoring variable in Training Sample (B)) * In(A/ B)


  1. Sort scoring variable on descending order in scoring sample
  2. Split the data into 10 or 20 groups (deciling)
  3. Calculate % of records in each group based on scoring sample
  4. Calculate % of records in each group based on training sample
  5. Calculate difference between Step 3 and Step 4
  6. Take Natural Log of (Step3 / Step4)
  7. Multiply Step5 and Step6


  1. PSI < 0.1 – No change. You can continue using existing model.
  2. PSI >=0.1 but less than 0.2 – Slight change is required.
  3. PSI >=0.2 – Significant change is required. Ideally, you should not use this model any more.

To understand the cause of a change, we need to generate the characteristic analysis report.

Characteristic Analysis

It answers which variable is causing a shift in population distribution. It compares the distribution of an independent variable in the scoring data set to a development data set. It detects shifts in the distributions of input variables that are submitted for scoring over time.

It helps to determine which changing variable is most influential in causing the model score shift.

Most Important –
Check the direction of impact due to model variable shifts.

Check the signs of the shifted attributes and the average values of those attributes compared to those from the previously scored population or development sample. This will indicate whether the model attribute shifts are increasing or decreasing the model scores.

Basic Tutorials

Mixed-Effects Regression Modeling

Mixed effects models work for correlated data regression models, including repeated measures, longitudinal, time series, clustered, and other related methods.

Why not to use simple regression for correlated data

One key assumption of ordinary linear regression is that the errors are independent of each other. However, with repeated measures or time series data, the ordinary regression residuals usually are correlated over time. Hence, this assumption is violated for correlated data.

Definition of Mixed Regression Model

It includes features of both fixed and random effects. Whereas, standard regression includes only fixed effects.

Example :

Examination Result (target variable) could be related to how many hours students study (fixed effect), but might also be dependent on the school they go to (random effect), as well as simple variation between students (residual error).

Fixed Effects vs. Random Effects

Fixed effects assume observations are independent while random effects assume some type of relationship exists between some observations.

Gender is a fixed effect variable because the values of male / female are independent of one another (mutually exclusive); and they do not change. Whereas, school has random effects because we can only sample some of the schools which exist; not to mention, students move into and out of those schools each year.

A target variable is contributed to by additive fixed and random effects as well as an error term.

     yij = β1x1ij + β2x2ij … βnxnij + bi1z1ij + bi2z2ij … binznij + εij

where yij is the value of the outcome variable for a particular ij case, β1 through βn are the fixed effect coefficients (like regression coefficients), x1ij through xnij are the fixed effect variables (predictors) for observation j in group i (usually the first is reserved for the intercept/constant; x1ij = 1), bi1 through bin are the random effect coefficients which are assumed to be multivariate normally distributed, z1ij through znij are the random effect variables (predictors), and εij is the error for case j in group i where each group’s error is assumed to be multivariate normally distributed.

Mixed Models vs. Time Series Models

  1. Time series analysis usually has long time series and the primary goal is to look at how a single variable changes over time. There are sophisticated methods to deal with many problems – not just autocorrelation, but seasonality and other periodic changes and so on.
  2. Mixed models are not as good at dealing with complex relationships between a variable and time, partly because they usually have fewer time points (it’s hard to look at seasonality if you don’t have multiple data for each season).
  3. It is not necessary to have time series data for mixed models.


The ARIMA and AUTOREG procedures provide more time series structures than PROC MIXED.

The data used in the example below contains the interval scaled outcome variable Extroversion (extro) is predicted by fixed effects for the interval scaled predictor Openness to new experiences (open), the interval scaled predictor Agreeableness (agree), the interval scaled predictor Social engagement (social), and the nominal scaled predictor Class (class); as well as the random (nested) effect of Class within School (school). The data contains 1200 cases evenly distributed among 24 nested groups (4 classes within 6 schools).

R Code :
Step I  : Load Data

# Read data <- read.table(“”,
                       header=TRUE, sep=”,”, na.strings=”NA”, dec=”.”, strip.white=TRUE)

Step II : Install and load library

# Install and load library

Step III : Building a linear mixed model

# Building a linear mixed model
lmm.2 <- lmer(formula = extro ~ open + agree + social + class + (1| school/class), data =,  REML = TRUE, verbose = FALSE)

# Check Summary

Calculating total variance of the random effects
Add all the variance together to find the total variance (of the random effects) and then divide that total by each random effect to see what proportion of the random effect variance is attributable to each random effect (similar to R² in traditional regression). So, if we add the variance components:

 = 2.8836 + 95.1718 + 0.9684
 = 99.02541

Then we can divide this total variance by our nested effect variance to give us the proportion of variance accounted for, which indicates whether or not this effect is meaningful.
 = 2.8836/99.02541 = 0.02912030

So, we can see that only 2.9% of the total variance of the random effects is attributed to the nested effect. If all the percentages for each random effect are very small, then the random effects are not present and linear mixed modeling is not appropriate (i.e. remove the random effects from the model and use general linear or generalized linear modeling instead). We can see that the effect of school alone is quite substantial (96%) = 95.17339/99.02541

Output : Estimates of the fixed effects

These estimates are interpreted the same way as one would interpret estimates from a traditional ordinary least squares linear regression.
A one unit increase in the predictor Openness to new experiences (open) corresponds to a 0.0061302 increase in the outcome Extroversion (extro). Likewise, a one unit increase in the predictor Agreeableness (agree) corresponds to a 0.0077361 decrease in the outcome Extroversion (extro). Furthermore, the categorical predictor classb has a coefficient of 2.0547978; which means, the mean Extroversion score of the second group of class (b) is 2.0547978 higher than the mean Extroversion score of the first group of class (a).

Extract Estimates of Fixed and Random Effects

#To extract the estimates of the fixed effects
#To extract the estimates of the random effects

#To extract the coefficients for the random effects intercept (2 groups of school) and each group of the random effect factor, which here is a nested set of groups (4 groups of class within 6 groups of school)

Calculating Predicted Values

#To extract the fitted or predicted values based on the model parameters and data
yhat <- fitted(lmm.2)
lmm.data2 = cbind(,yhat)
#Score a new dataset
yhat1 = predict(lmm.2,
#To extract the residuals (errors) and summarize them, as well as plot them
residuals <- resid(lmm.2)

Check Intra Class Correlation It allows us to assess whether or not the random effect is present in the data.

lmm.null <- lmer(extro ~ 1 + (1|school), data =

SAS Code :

CLASS class school;
MODEL extro = open  agree  social  class school*class / solution outp=test;
random school / solution SUBJECT = id TYPE = UN;
ods output solutionf=sf(keep=effect estimate rename=(estimate=overall));
ods output solutionr=sr(rename=(estimate=ssdev));

Basic Tutorials

Why Variable Selection is important?

Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.

Too many variables might result to overfitting which means model is not able to generalize pattern

Too many variables leads to slow computation which in turns requires more memory and hardware.

Why Boruta Package?

There are a lot of packages for feature selection in R. The question arises ” What makes boruta package so special”.  See the following reasons to use boruta package for feature selection.

It works well for both classification and regression problem.

It takes into account multi-variable relationships.

It is an improvement on random forest variable importance measure which is a very popular method for variable selection.

It follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas, most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.

It can handle interactions between variables

It can deal with fluctuating nature of random a random forest importance measure

Basic Idea of Boruta Algorithm

Perform shuffling of predictors’ values and join them with the original predictors and then build random forest on the merged dataset. Then make comparison of original variables with the randomised variables to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important.

How Boruta Algorithm Works
Follow the steps below to understand the algorithm –

Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.

Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.

Combine the original ones with shuffled copies

Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.

Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.

Find the maximum Z score among shadow attributes (MZSA)

Tag the variables as ‘unimportant’  when they have importance significantly lower than MZSA. Then we permanently remove them from the process.

Tag the variables as ‘important’  when they have importance significantly higher than MZSA.

Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged ‘unimportant’ or ‘important’, whichever comes first.

Difference between Boruta and Random Forest Importance Measure

When i first learnt this algorithm, this question ‘RF importance measure vs. Boruta’ made me puzzled for hours. After reading a lot about it, I figured out the exact difference between these two variable selection algorithms.

In random forest, the Z score is computed by dividing the average accuracy loss by its standard deviation. It is used as the importance measure for all the variables. But we cannot use Z Score which is calculated in random forest, as a measure for finding variable importance as this Z score is not directly related to the statistical significance of the variable importance. To workaround this problem, boruta package runs random forest on both original and random attributes and compute the importance of all variables. Since the whole process is dependent on permuted copies, we repeat random permutation procedure to get statistically robust results.

Is Boruta a solution for all?

Answer is NO. You need to test other algorithms. It is not possible to judge the best algorithm without knowing data and assumptions. Since it is an improvement on random forest variable importance measure, it should work well on most of the times.

What is shuffled feature or permuted copies?

It simply means changing order of values of a variable. See the practical example below –

mydata = data.frame(var1 = 1 : 6, var2=runif(6))
shuffle = data.frame(apply(mydata,2,sample))
head(cbind(mydata, shuffle))

    Original         Shuffled
   var1   var2    var1      var2
1    1 0.2875775    4 0.9404673
2    2 0.7883051    5 0.4089769
3    3 0.4089769    3 0.2875775
4    4 0.8830174    2 0.0455565
5    5 0.9404673    6 0.8830174
6    6 0.0455565    1 0.7883051

R : Feature Selection with Boruta Package

1. Get Data into R

The read.csv() function is used to read data from CSV and import it into R environment.

#Read data
df = read.csv(“”)

2. List of variables

#Column Names

Result : “admit” “gre”   “gpa”   “rank”

3. Define categorical variables

df$admit = as.factor(df$admit)
df$rank = as.factor(df$rank)

Basic Tutorials


I. Relative Clustering Validation

Relative clustering validation, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g.,: varying the number of clusters k). It’s generally used for determining the optimal number of clusters.
It is already discussed in the previous article – How to determine the optimal number of clusters

II. Internal Clustering Validation
Internal clustering validation, which use the internal information of the clustering process to evaluate the goodness of a clustering structure. It can be also used for estimating the number of clusters and the appropriate clustering algorithm.

The internal measures included in clValid package are:

  1. Connectivity – what extent items are placed in the same cluster as their nearest neighbors in the data space. It has a value between 0 and infinity and should be minimized.
  2. Average Silhouette width – It lies between -1 (poorly clustered observations) to 1 (well clustered observations). It should be maximized.
  3. Dunn index – It is the ratio between the smallest distance between observations not in the same cluster to the largest intra-cluster distance. It has a value between 0 and infinity and should be maximized.

III. Clustering Stability Validation
Clustering stability validation, which is a special version of internal validation. It evaluates the consistency of a clustering result by comparing it with the clusters obtained after each column is removed, one at a time.
The cluster stability measures includes:

  1. The average proportion of non-overlap (APN)
  2. The average distance (AD)
  3. The average distance between means (ADM)
  4. The figure of merit (FOM)

The APN measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.

The AD measures the average distance between observations placed in the same cluster under both cases (full dataset and removal of one column).

The ADM measures the average distance between cluster centers for observations placed in the same cluster under both cases.

The FOM measures the average intra-cluster variance of the deleted column, where the clustering is based on the remaining (undeleted) columns. It also has a value between zero and 1, and again smaller values are preferred.

The values of APN, ADM and FOM ranges from 0 to 1, with smaller value corresponding with highly consistent clustering results. AD has a value between 0 and infinity, and smaller values are also preferred.

IV. External Clustering Validation
External cluster validation uses ground truth information. That is, the user has an idea how the data should be grouped. This could be a know class label not provided to the clustering algorithm. Since we know the “true” cluster number in advance, this approach is mainly used for selecting the right clustering algorithm for a specific dataset.
The external cluster validation measures includes:

  1. Corrected Rand Index
  2. Variation of Information (VI)

The Corrected Rand Index provides a measure for assessing the similarity between two partitions, adjusted for chance. Its range is -1 (no agreement) to 1 (perfect agreement). It should be maximized.

The Variation of Information is a measure of the distance between two clusterings (partitions of elements). It is closely related to mutual information. It should be minimized.

Important Point

It is essential to validate clustering with both internal and external validation measures. If you do not have access to external (actual label) data, the analyst should focus on internal and stability validation measures.

R Code : Validate Cluster Analysis

####################### Clustering Validation #############################

#Load data
data = iris[,-c(5)]
data = scale(data)

#Loading desired package

# Internal Validation
clmethods <- c(“hierarchical”,”kmeans”,”pam”)
internval <- clValid(data, nClust = 2:5, clMethods = clmethods, validation = “internal”)

# Summary

# Hierarchical clustering with two clusters was found the best clustering algorithm in each case (i.e., for connectivity, Dunn and Silhouette measures)

# Stability measure Validation
clmethods <- c(“hierarchical”,”kmeans”,”pam”)
stabval <- clValid(data, nClust = 2:6, clMethods = clmethods,
                validation = “stability”)

# Display only optimal Scores

# External Clustering Validation

# K-Means Cluster Analysis
fit <- kmeans(data,2)

# Compute cluster stats
species <- as.numeric(iris$Species)
clust_stats <- cluster.stats(d = dist(data), species, fit$cluster)

# Corrected Rand index and VI Score
# Rand Index should be maximized and VI score should be minimized

# k means Cluster Analysis
fit <- kmeans(data,2)

# Compute cluster stats
species <- as.numeric(iris$Species)
clust_stats <- cluster.stats(d = dist(data), species, fit$cluster)

# Corrected Rand index and VI Score
# Rand Index should be maximized and VI score should be minimized

# Same analysis for Ward Hierarchical Clustering
d2 <- dist(data, method = “euclidean”)
fit2 <- hclust(d2, method=”ward.D2″)

# cluster assignment (members)
groups <- cutree(fit2, k=2)

# Compute cluster stats
clust_stats2 <- cluster.stats(d = dist(data), species, groups)

# Corrected Rand index and VI Score
# Rand Index should be maximized and VI score should be minimized

Basic Tutorials


Logistic Regression

It is used to predict the result of a categorical dependent variable based on one or more continuous or categorical independent variables. In other words, it is multiple regression analysis but with a dependent variable is categorical.


1. An employee may get promoted or not based on age, years of experience, last performance rating etc. We are trying to calculate the factors that affects promotion. In this case, two possible categories in dependent variable : “Promoted” and “Not Promoted“.

2. We are interested in knowing how variables, such as age, sex, body mass index, effect blood pressure (sbp). In this case, two possible categories in dependent variable : “High Blood Pressure” and “Normal Blood Pressure“.


Logistic regression is based on Maximum Likelihood (ML) Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood). With ML, the computer uses different “iterations” in which it tries different solutions until it gets the maximum likelihood estimates. Fisher Scoring is the most popular iterative method of estimating the regression parameters.

logit(p) = b0 + b1X1 + b2X2 + —— + bk Xk

where logit(p) = loge(p / (1-p))
Take exponential both the sides

p : the probability of the dependent variable equaling a “success” or “event”.


Binary logistic regression model assumes binomial distribution of the response with N (number of trials) and p(probability of success). Logistic regression is in the ‘binomial family’ of GLMs. The dependent variable does not need to be normally distributed.

Example – If you flip a coin twice, what is the probability of getting one or more heads? It’s a binomial distribution with N=2 and p=0.5. The binomial distribution consists of the probabilities of each of the possible numbers of successes on N trials for independent events that each have a probability of p.

Interpretation of Logistic Regression Estimates

If X increases by one unit, the log-odds of Y increases by k unit, given the other variables in the model are held constant.

In logistic regression, the odds ratio is easier to interpret. That is also called Point estimate. It is exponential value of estimate.
For Continuous Predictor

An unit increase in years of experience increases the odds of getting a job by a multiplicative factor of 4.27, given the other variables in the model are held constant. In other words, the odds of getting a job are increased by 327% (4.27-1)*100 for an unit increase in years of experience.

For Binary Predictor

The odds of a person having years of experience getting a job are 4.27 times greater than the odds of a person having no experience. 

Magnitude : If you want to compare the magnitudes of positive and negative effects, simply take the inverse of the negative effects. For example, if Exp(B) = 2 on a positive effect variable, this has the same magnitude as variable with Exp(B) = 0.5 = ½ but in the opposite direction.

Odd Ratio (exp of estimate) less than 1 ==> Negative relationship (It means negative coefficient value of estimate coefficients)

Standardized Coefficients

The concept of standardization or standardized coefficients (aka estimates) comes into picture when predictors (aka independent variables) are expressed in different units. Suppose you have 3 independent variables – age, height and weight. The variable ‘age’ is expressed in years, height in cm, weight in kg. If we need to rank these predictors based on the unstandardized coefficient, it would not be a fair comparison as the unit of these variable is not same.

Standardized Coefficients (or Estimates) are mainly used to rank predictors (or independent or explanatory variables) as it eliminate the units of measurement of independent and dependent variables). We can rank independent variables with absolute value of standardized coefficients. The most important variable will have maximum absolute value of standardized coefficient.

Interpretation of Standardized Coefficient

A standardized coefficient value of 2.5 explains one standard deviation increase in independent variable on average, a 2.5 standard deviation increase in the log odds of dependent variable. 

Assumptions of Logistic Regression

  1. The logit transformation of the outcome variable has a linear relationship with the predictor variables. The one way to check the assumption is to categorize the independent variables. Transform the numeric variables to 10/20 groups and then check whether they have linear or monotonic relationship.
  2. No multicollinearity problem. No high correlationship between predictors.
  3. No influential observations (Outliers).
  4. Large Sample Size – It requires at least 10 events per independent variable.

Test Overall Fit of the Model : -2 Log L , Score and Wald Chi-Square
These are Chi-Square tests. They test against the null hypothesis that at least one of the predictors’ regression coefficient is not equal to zero in the model.

Important Performance Metrics
1. Percent Concordant : Percentage of pairs where the observation with the desired outcome (event) has a higher predicted probability than the observation without the outcome (non-event).
Rule: Higher the percentage of concordant pairs the better is the fit of the model. Above 80% considered good model.

2. Percent Discordant : Percentage of pairs where the observation with the desired outcome (event) has a lower predicted probability than the observation without the outcome (non-event). 3. Percent Tied : Percentage of pairs where the observation with the desired outcome (event) has same predicted probability than the observation without the outcome (non-event).  4.  Area under curve (c statistics) – It ranges from 0.5 to 1, where 0.5 corresponds to the model randomly predicting the response, and a 1 corresponds to the model perfectly discriminating the response.

C = Area under Curve = %concordant + (0.5 * %tied) 

.90-1 = excellent (A).80-.90 = good (B).70-.80 = fair (C).60-.70 = poor (D).50-.60 = fail (F)5. Classification Table (Confusion Matrix)
Sensitivity (True Positive Rate) – % of events of dependent variable successfully predicted as events.

Sensitivity = TRUE POS / (TRUE POS + FALSE NEG)

Specificity (True Negative Rate) – % of non-events of dependent variable successfully predicted as non-events.

Specificity = TRUE NEG / (TRUE NEG + FALSE POS)

Correct (Accuracy) = Number of correct prediction (TRUE POS + TRUE NEG) divided by sample size.
6. KS Statistics
It looks at maximum difference between distribution of cumulative events and cumulative non-events.

Detailed Explanation – How to Check Model Performance 
Problem Statement –
A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, affect admission into graduate school. The outcome variable, admit/don’t admit, is binary.

This data set has a binary response (outcome, dependent) variable called admit, which is equal to 1 if the individual was admitted to graduate school, and 0 otherwise. There are three predictor variables: gre, gpa, and rank. We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest. [Source : UCLA]

R Code : Logistic Regression
#Read Data Filemydata <- read.csv(“”)


#Cross Tabxtabs(~admit + rank, data = mydata)

#Data Preparationmydata$rank <- factor(mydata$rank)

# Split data into training (70%) and validation (30%)dt = sort(sample(nrow(mydata), nrow(mydata)*.7))train<-mydata[dt,]val<-mydata[-dt,] 

# Check number of rows in training and validation data setsnrow(train)nrow(val)

#Run Logistic Regressionmylogistic <- glm(admit ~ ., data = train, family = “binomial”)summary(mylogistic)$coefficient

#Stepwise Logistic Regressionmylogit = step(mylogistic)

#Logistic Regression Coefficientsummary.coeff0 = summary(mylogit)$coefficient

#Calculating Odd RatiosOddRatio = exp(coef(mylogit))summary.coeff = cbind(Variable = row.names(summary.coeff0), OddRatio, summary.coeff0)row.names(summary.coeff) = NULL

#R Function : Standardized Coefficientsstdz.coff <- function (regmodel) { b <- summary(regmodel)$coef[-1,1]  sx <- sapply(regmodel$model[-1], sd)  beta <-(3^(1/2))/pi * sx * b  return(beta)}
std.Coeff = data.frame(Standardized.Coeff = stdz.coff(mylogit))std.Coeff = cbind(Variable = row.names(std.Coeff), std.Coeff)row.names(std.Coeff) = NULL

#Final Summary Reportfinal = merge(summary.coeff, std.Coeff, by = “Variable”, all.x = TRUE)

#Predictionpred = predict(mylogit,val, type = “response”)finaldata = cbind(val, pred)

#Storing Model Performance Scoreslibrary(ROCR)pred_val <-prediction(pred ,finaldata$admit)

# Maximum Accuracy and prob. cutoff against itacc.perf <- performance(pred_val, “acc”)ind = which.max( slot(acc.perf, “y.values”)[[1]])acc = slot(acc.perf, “y.values”)[[1]][ind]cutoff = slot(acc.perf, “x.values”)[[1]][ind]

# Print Resultsprint(c(accuracy= acc, cutoff = cutoff))

# Calculating Area under Curveperf_val <- performance(pred_val,”auc”)perf_val

# Plotting Lift curveplot(performance(pred_val, measure=”lift”, x.measure=”rpp”), colorize=TRUE)

# Plot the ROC curveperf_val2 <- performance(pred_val, “tpr”, “fpr”)plot(perf_val2, col = “green”, lwd = 1.5)

#Calculating KS statisticsks1.tree <- max(attr(perf_val2, “y.values”)[[1]] – (attr(perf_val2, “x.values”)[[1]]))ks1.tree
Interpret Results :

1. Odd Ratio of GRE (Exp value of GRE Estimate) = 1.002 implies for a one unit increase in gre, the odds of being admitted to graduate school increase by a factor of 1.002.
2. AUC value shows model is not able to distinguish events and non-events well.3. The model can be improved further either adding more variables or transforming existing predictors.

Basic Tutorials


IntroductionPython is widely used and very popular for a variety of software engineering tasks such as website development, cloud-architecture, back-end etc. It is equally popular in data science world. In advanced analytics world, there has been several debates on R vs. Python. There are some areas such as number of libraries for statistical analysis, where R wins over Python but Python is catching up very fast.With popularity of big data and data science, Python has become first programming language of data scientists.

There are several reasons to learn Python. Some of them are as follows –

  1. Python runs well in automating various steps of a predictive model.
  2. Python has awesome robust libraries for machine learning, natural language processing, deep learning, big data and artificial Intelligence.
  3. Python wins over R when it comes to deploying machine learning models in production.
  4. It can be easily integrated with big data frameworks such as Spark and Hadoop.
  5. Python has a great online community support.

Do you know these sites are developed in Python?

  1. YouTube
  2. Instagram
  3. Reddit
  4. Dropbox
  5. Disqus

Python 2 vs. 3

Google yields thousands of articles on this topic. Some bloggers opposed and some in favor of 2.7. If you filter your search criteria and look for only recent articles, you would find Python 2 is no longer supported by the Python Software Foundation. Hence it does not make any sense to learn 2.7 if you start learning it today. Python 3 supports all the packages. Python 3 is cleaner and faster. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 12 years releasing robust versions of Python 3 series. You should go for latest version of Python 3.

How to install Python?

There are two ways to download and install Python

  1. Download AnacondaIt comes with Python software along with preinstalled popular libraries.
  2. Download Pythonfrom its official website. You have to manually install libraries.

Recommended : Go for first option and download anaconda. It saves a lot of time in learning and coding Python

Coding EnvironmentsAnaconda comes with two popular IDE :

  1. Jupyter (Ipython) Notebook
  2. Spyder

Spyder. It is like RStudio for Python. It gives an environment wherein writing python code is user-friendly. If you are a SAS User, you can think of it as SAS Enterprise Guide / SAS Studio. It comes with a syntax editor where you can write programs. It has a console to check each and every line of code. Under the ‘Variable explorer’, you can access your created data files and function. I highly recommend Spyder!

Jupyter (Ipython) Notebook Jupyter is equivalent to markdown in R. It is useful when you need to present your work to others or when you need to create step by step project report as it can combine code, output, words, and graphics.

Spyder Shortcut keysThe following is a list of some useful spyder shortcut keys which makes you more productive.

  1. Press F5 to run the entire script
  2. Press F9 to run selection or line
  3. Press Ctrl+ 1 to comment / uncomment
  4. Go to front of function and then press Ctrl + I to see documentation of the function
  5. Run %reset -f to clean workspace
  6. Ctrl + Left click on object to see source code
  7. Ctrl+Enter executes the current cell.
  8. Shift+Enter executes the current cell and advances the cursor to the next cell

List of arithmetic operators with examples

Arithmetic OperatorsOperationExample
+Addition10 + 2 = 12
Subtraction10 – 2 = 8
*Multiplication10 * 2 = 20
/Division10 / 2 = 5.0
%Modulus (Remainder)10 % 3 = 1
**Power10 ** 2 = 100
//Floor17 // 3 = 5
(x + (d-1)) // dCeiling(17 +(3-1)) // 3 = 6

Basic programs in Python

Example 1

x = 10
y = 3
print("10 divided by 3 is", x/y)
print("remainder after 10 divided by 3 is", x%y)

Result :
10 divided by 3 is 3.33
remainder after 10 divided by 3 is 1
Example 2

x = 100
x > 80 and x <=95
x > 35 or x < 60
x > 80 and x <=95
Out[45]: False
x > 35 or x < 60
Out[46]: True

Comparison, Logical and Assignment Operators

Comparison & Logical OperatorsDescriptionExample
>Greater than5 > 3 returns True
<Less than5 < 3 returns False
>=Greater than or equal to5 >= 3 returns True
<=Less than or equal to5 <= 3 return False
==Equal to5 == 3 returns False
!=Not equal to5 != 3 returns True
andCheck both the conditionsx > 18 and x <=35
orIf atleast one condition hold Truex > 35 or x < 60
notOpposite of Conditionnot(x>7)

Assignment Operators

It is used to assign a value to the declared variable. For e.g. x += 25 means x = x+25.

x = 100
y = 10
x += y

In this case, x+=y implies x=x+y which is x = 100+ 10.

Similarly, you can use x-=y, x*=y and x /=y

Python Data Structures

In every programming language, it is important to understand the data structures. Following are some data structures used in Python.
1. ListIt is a sequence of multiple values. It allows us to store different types of data such as integer, float, string etc. See the examples of list below. First one is an integer list containing only integer. Second one is string list containing only string values. Third one is mixed list containing integer, string and float values.

  1. x = [1, 2, 3, 4, 5]
  2. y = [‘A’, ‘O’, ‘G’, ‘M’]
  3. z = [‘A’, 4, 5.1, ‘M’]

Get List ItemWe can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).

x = [1, 2, 3, 4, 5]
Out[68]: 1

Out[69]: 2

Out[70]: 5

Out[71]: 5

Out[72]: 4

x[0] picks first element from list. Negative sign tells Python to search list item from right to left. x[-1] selects the last element from list.

You can select multiple elements from a list using the following method

x[:3] returns[1, 2, 3]

2. TupleA tuple is similar to a list in the sense that it is a sequence of elements. The difference between list and tuple are as follows –

  1. A tuple cannot be changed once constructed whereas list can be modified.
  2. A tuple is created by placing comma-separated values inside parentheses ( ). Whereas, list is created inside square brackets [ ]


K = (1,2,3)
State = (‘Delhi’,’Maharashtra’,’Karnataka’)

Perform for loop on Tuple

for i in State:


Detailed Tutorial : Python Data Structures
FunctionsLike print(), you can create your own custom function. It is also called user-defined functions. It helps you in automating the repetitive task and calling reusable code in easier way.
Rules to define a function

  1. Function starts with def keyword followed by function name and ( )
  2. Function body starts with a colon (:) and is indented
  3. The keyword return ends a function andgive value of previous expression.
def sum_fun(a, b):
result = a + b
return result

z = sum_fun(10, 15)

Result : z = 25

Suppose you want python to assume 0 as default value if no value is specified for parameter b.

def sum_fun(a, b=0):
result = a + b
return result
z = sum_fun(10)

In the above function, b is set to be 0 if no value is provided for parameter b. It does not mean no other value than 0 can be set here. It can also be used asz = sum_fun(10, 15)

Python Conditional StatementsConditional statements are commonly used in coding. It is IF ELSE statements. It can be read like : ” if a condition holds true, then execute something. Else execute something else”
Note : The if and else statements ends with a colon :

k = 27
if k%5 == 0:
 print('Multiple of 5')
 print('Not a Multiple of 5')

Result :Not a Multiple of 5

List of popular packages (comparison with R)

Some of the leading packages in Python along with equivalent libraries in R are as follows-

  1. pandas. For data manipulation and data wrangling. A collections of functions to understand and explore data. It is counterpart of dplyr and reshape2 packages in R.
  2. NumPy. For numerical computing. It’s a package for efficient array computations. It allows us to do some operations on an entire column or table in one line. It is roughly approximate to Rcpp package in R which eliminates the limitation of slow speed in R. Numpy Tutorial
  3. Scipy.For mathematical and scientific functions such asintegration, interpolation, signal processing, linear algebra, statistics, etc. It is built on Numpy.
  4. Scikit-learn. A collection of machine learning algorithms. It is built on Numpy and Scipy. It can perform all the techniques that can be done in R usingglm, knn, randomForest, rpart, e1071 packages.
  5. Matplotlib.For data visualization. It’s a leading package for graphics in Python. It is equivalent to ggplot2 package in R.
  6. Statsmodels.For statistical and predictive modeling. It includes various functions to explore data and generate descriptive and predictive analytics. It allows users to run descriptive statistics, methods to impute missing values, statistical tests and take table output to HTML format.
  7. pandasql. It allows SQL users to write SQL queries in Python. It is very helpful for people who loves writing SQL queries to manipulate data. It is equivalent to sqldf package in R.

Maximum of the above packages are already preinstalled in Spyder.
Comparison of Python and R Packages by Data Mining Task

TaskPython PackageR Package
IDERodeo / SpyderRstudio
Data Manipulationpandasdplyr and reshape2
Machine LearningScikit-learnglm, knn, randomForest, rpart, e1071
Data Visualizationggplot + seaborn + bokehggplot2
Character FunctionsBuilt-In Functionsstringr
SQL Queriespandasqlsqldf
Working with Datesdatetimelubridate
Web Scrapingbeautifulsouprvest

Popular python commands

The commands below would help you to install and update new and existing packages. Let’s say, you want to install / uninstall pandas package.
Run these commands from IPython console window. Don’t forget to add ! before pip otherwise it would return syntax error.Install Package
!pip install pandas

Uninstall Package
!pip uninstall pandas

Show Information about Installed Package
!pip show pandas

List of Installed Packages
!pip list

Upgrade a package
!pip install –upgrade pandas –userHow to import a packageThere are multiple ways to import a package in Python. It is important to understand the difference between these styles.

1. import pandas as pd It imports the package pandas under the alias pd. A function DataFrame in package pandas is then submitted with pd.DataFrame.

2. import pandas
It imports the package without using alias but here the function DataFrame is submitted with full package name pandas.DataFrame

3. from pandas import *
It imports the whole package and the function DataFrame is executed simply by typing DataFrame. It sometimes creates confusion when same function name exists in more than one package.Pandas Data Structures – Series and DataFrameIn pandas package, there are two data structures – series and dataframe. These structures are explained below in detail -1. SeriesIt is a one-dimensional array. You can access individual elements of a series using position. It’s similar to vector in R. In the example below, we are generating 5 random values.

import pandas as pd
import numpy as  np
s1 = pd.Series(np.random.randn(5))
0   -2.412015
1   -0.451752
2    1.174207
3    0.766348
4   -0.361815
dtype: float64

Extract first and second valueYou can get a particular element of a series using index value. See the examples below –




Basic Tutorials

Linear Regression

It is a way of finding a relationship between a single, continuous variable called Dependent or Target variable and one or more other variables (continuous or not) called Independent Variables.

It’s a straight line curve. In the above figure, diagonal red line is a regression line which is also called best-fitting straight line. The distance between dots and regression line is errors. Linear regression aims at finding best fitting straight line by minimizing the sum of squared vertical distance between dots and regression line.

Variable Type

Linear regression requires the dependent variable to be continuous i.e. numeric values (no categories or groups).

Simple vs. Multiple Linear Regression

Linear regression can be simple linear regression when you have only one independent variable . Whereas Multiple linear regression will have more than one independent variable.

Regression Equation


b0 is the intercept the expected mean value of dependent variable (Y) when all independent variables (Xs) are equal to 0. and b1 is the slope.
b1 represents the amount by which dependent variable (Y) changes if we change X1 by one unit keeping other variables constant.

Important Term : Residual

The difference between an observed (actual) value of the dependent variable and the value of the dependent variable predicted from the regression line.


Linear regression is based on least square estimation which says regression coefficients (estimates) should be chosen in such a way that it minimizes the sum of the squared distances of each observed response to its fitted value.

Minimum Sample Size

Linear regression requires 5 cases per independent variable in the analysis.

Assumptions of Linear Regression Analysis 

1. Linear Relationship : Linear regression needs a linear relationship between the dependent and independent variables.
2. Normality of Residual : Linear regression requires residuals should be normally distributed.
3. Homoscedasticity :  Linear regression assumes that residuals are approximately equal for all predicted dependent variable values. In other words, it means constant variance of errors.
4. No Outlier Problem
5. Multicollinearity : It means there is a high correlation between independent variables. The linear regression model MUST NOT be faced with problem of multicollinearity.
6. Independence of error terms – No Autocorrelation
It states that the errors associated with one observation are not correlated with the errors of any other observation. It is a problem when you use time series data. Suppose you have collected data from labors in eight different districts. It is likely that the labors within each district will tend to be more like one another that labors from different districts, that is, their errors are not independent.

Distribution of Linear Regression

Linear regression assumes target or dependent variable to be normally distributed. Normal Distribution is same as Gaussian distribution. It uses identity link function of gaussian family

Standardized Coefficients

The concept of standardization or standardized coefficients (aka estimates) comes into picture when predictors (aka independent variables) are expressed in different units. Suppose you have 3 independent variables – age, height and weight. The variable ‘age’ is expressed in years, height in cm, weight in kg. If we need to rank these predictors based on the unstandardized coefficient, it would not be a fair comparison as the unit of these variable is not same.

Standardized Coefficients (or Estimates) are mainly used to rank predictors (or independent or explanatory variables) as it eliminate the units of measurement of  independent and dependent variables). We can rank independent variables with absolute value of standardized coefficients. The most important variable will have maximum absolute value of standardized coefficient.

Interpretation of Standardized Coefficient

A standardized coefficient value of 1.25 indicates that a change of one standard deviation in the independent variable results in a 1.25 standard deviations increase in the dependent variable.

Detailed Explanation : Standardized vs. Unstandardized Coefficient
Measures of Model Performance

1. R-squared

It measures the proportion of the variation in your dependent variable explained by all of your independent variables in the model. It assumes that every independent variable in the model helps to explain variation in the dependent variable. In reality, some variables don’t affect dependent variable and they don’t help building a good model.

In the numerator of equation above, yi-hat is the predicted value. Mean value of Y appears in denominator.

Rule :

Higher the R-squared, the better the model fits your data. In psychological surveys or studies, we generally found low R-squared values lower than 0.5. It is because we are trying to predict human behavior and it is not easy to predict humans. In these cases, if your R-squared value is low but you have statistically significant independent variables (aka predictors), you can still generate insights about how changes in the predictor values are associated with changes in the response value.

Can R-Squared be negative?

Yes, it is when horizontal line explains the data better than your model. It mostly happens when you do not include intercept. Without an intercept, the regression could do worse than the sample mean in terms of predicting the target variable. It is not only because of exclusion of intercept. It can be negative even with inclusion of intercept.

Mathematically, it is possible when error sum-of-squares from the model is larger than the total sum-of-squares from the horizontal line.

R-squared = 1 – [(Sum of Square Error)/(Total Sum of Square)]

2. Adjusted R-squared

It measures the proportion of variation explained by only those independent variables that really affect the dependent variable. It penalizes you for adding independent variable that do not affect the dependent variable.

Adjusted R-Squared is more important metrics than R-squared

Every time you add a independent variable to a model, the R-squared increases, even if the independent variable is insignificant. It never declines. Whereas Adjusted R-squared increases only when independent variable is significant and affects dependent variable.

3. RMSE ( Root Mean Square Error)
It explains how close the actual data points are to the model’s predicted values. It measures standard deviation of the residuals.

In the formula above, yi is the actual values of dependent variable and yi-hat is the predicted values, n – sample size.
Important Point

RMSE has the same unit as dependent variable. For example, dependent variable is sales which is measured in dollars. Let’s say RMSE of this sales model comes out 21. We can say it is 21 dollars.

What is good RMSE score?

There is no thumb rule regarding good or bad RMSE score. It is because it is dependent on your dependent variable. If your target variable lies between 0 to 100. RMSE of 0.5 can be considered as good but same 0.5 RMSE can be considered as a poor score if dependent variable ranges from 0 to 10. Hence there is so such good or bad RMSE by simply looking at the value.
Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and is the most important criterion for fit if the main purpose of the model is prediction.

The RMSE for your training and your test sets should be very similar if you have built a good model. If the RMSE for the test set is much higher than that of the training set, it is likely that you’ve badly over fit the data, i.e. you’ve created a model that works well in sample, but has little predictive value when tested out of sample


RMSE amplifies and severely punishes large errors as compared to mean absolute error (MAE).

R-Squared vs RMSE

R-squared is in proportion and has no units associated to target variable whereas RMSE has units associated to target variable. Hence, R-squared is a relative measure of fit, RMSE is an absolute measure of fit.

The code below covers the assumption testing and evaluation of model performance :

  1. Data Preparation
  2. Testing of Multicollinearity
  3. Treatment of Multicollinearity
  4. Checking for Autocorrelation
  5. Checking for Outliers
  6. Checking for Heteroscedasticity
  7. Testing of Normality of Residuals
  8. Forward, Backward and Stepwise Selection
  9. Calculating RMSE
  10. Box Cox Transformation of Dependent Variable
  11. Calculating R-Squared and Adj, R-squared manually
  12. Calculating Residual and Predicted values
  13. Calculating Standardized Coefficient

R Script : Linear Regression

Theory part is over. Let’s implement linear regression with  R –

Load required packages


Make sure the above listed packages are already installed and loaded into R. If they are not already installed, you need to install it by using the command install.packages(“package-name”)

Read Data

We will use mtcars dataset from cars package. This data was extracted from the Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles.

#Loading data

# Looking at variables

'data.frame': 32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Variable description of the above variables are listed below against their respective variable names.

mpgMiles/(US) gallon
cylNumber of cylinders
dispDisplacement (
hpGross horsepower
dratRear axle ratio
wtWeight (1000 lbs)
qsec1/4 mile time
amTransmission (0 = automatic, 1 = manual)
gearNumber of forward gears
carbNumber of carburetors

Summarize Data

In this dataset, mpg is a target variable. See first 6 rows of data by using head() function.

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1


> summary(mtcars)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000 

Data Preparation

Make sure categorical variables are stored as factors. In the program below, we are converting variables to factors.

mtcars$am   = as.factor(mtcars$am)
mtcars$cyl  = as.factor(mtcars$cyl)
mtcars$vs   = as.factor(mtcars$vs)
mtcars$gear = as.factor(mtcars$gear)

Identifying and Correcting Collinearity

In this step, we are identifying independent variables which are highly correlated to each other. Since mpg is a dependent variable, we are removing it in the code below.

#Dropping dependent variable for calculating Multicollinearity
mtcars_a = subset(mtcars, select = -c(mpg))

#Identifying numeric variables
numericData <- mtcars_a[sapply(mtcars_a, is.numeric)]

#Calculating Correlation
descrCor <- cor(numericData)

# Print correlation matrix and look at max correlation
           disp         hp        drat         wt        qsec       carb
disp  1.0000000  0.7909486 -0.71021393  0.8879799 -0.43369788  0.3949769
hp    0.7909486  1.0000000 -0.44875912  0.6587479 -0.70822339  0.7498125
drat -0.7102139 -0.4487591  1.00000000 -0.7124406  0.09120476 -0.0907898
wt    0.8879799  0.6587479 -0.71244065  1.0000000 -0.17471588  0.4276059
qsec -0.4336979 -0.7082234  0.09120476 -0.1747159  1.00000000 -0.6562492
carb  0.3949769  0.7498125 -0.09078980  0.4276059 -0.65624923  1.0000000

# Visualize Correlation Matrix
corrplot(descrCor, order = "FPC", method = "color", type = "lower", tl.cex = 0.7, tl.col = rgb(0, 0, 0))
# Checking Variables that are highly correlated
highlyCorrelated = findCorrelation(descrCor, cutoff=0.7)

#Identifying Variable Names of Highly Correlated Variables
highlyCorCol = colnames(numericData)[highlyCorrelated]

#Print highly correlated attributes

[1] "hp"   "disp" "wt"  

#Remove highly correlated variables and create a new dataset
dat3 = mtcars[, -which(colnames(mtcars) %in% highlyCorCol)]

[1] 32  8

There are three variables “hp”   “disp” “wt” that found to be highly correlated. We have removed them to avoid collinearity. Now, we have 7 independent variables and 1 dependent variable.

Developing Regression Model

At this step, we are building multiple linear regression model.

#Build Linear Regression Model
fit = lm(mpg ~ ., data=dat3)

#Check Model Performance

#Extracting Coefficients


See the coefficients of Linear Regression Model and ANOVA table

Linear regression model tests the null hypothesis that the estimate is equal to zero. An independent variable that has a p-value less than 0.05 means we are rejecting the null hypothesis at 5% level of significance. It means the coefficient of that variable is not equal to 0. A large p-value implies variable is meaningless in order to predict target variable.

> summary(fit)
lm(formula = mpg ~ ., data = dat3)

    Min      1Q  Median      3Q     Max 
-5.4850 -1.3058  0.1856  1.5278  5.2439 

            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  16.7823    19.6148   0.856    0.401  
cyl6         -1.8025     2.6085  -0.691    0.497  
cyl8         -3.5873     4.0324  -0.890    0.383  
drat          1.4283     2.1997   0.649    0.523  
qsec          0.1003     0.7729   0.130    0.898  
vs1           0.7068     2.3291   0.303    0.764  
am1           3.2396     2.4702   1.311    0.203  
gear4         1.3869     3.0466   0.455    0.653  
gear5         2.3776     3.4334   0.692    0.496  
carb         -1.4836     0.6305  -2.353    0.028 *
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.004 on 22 degrees of freedom
Multiple R-squared:  0.8237, Adjusted R-squared:  0.7516 
F-statistic: 11.42 on 9 and 22 DF,  p-value: 1.991e-06

> anova(fit)
Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq F value    Pr(>F)    
cyl        2 824.78  412.39 45.7033 1.464e-08 ***
drat       1  14.45   14.45  1.6017   0.21890    
qsec       1   2.83    2.83  0.3137   0.58108    
vs         1   1.02    1.02  0.1132   0.73969    
am         1  26.35   26.35  2.9198   0.10157    
gear       2   8.15    4.07  0.4513   0.64254    
carb       1  49.96   49.96  5.5363   0.02798 *  
Residuals 22 198.51    9.02                      
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The critical plots of linear regression model are shown below –

  1. Residuals vs Fitted
  2. Normal Q-Q
  3. Scale Location
  4. Residuals vs Leverage

Calculating Model Performance Metrics

#Extracting R-squared value

[1] 0.8237094

#Extracting Adjusted R-squared value

[1] 0.7515905

[1] 171.2156


[1] 187.3387

Higher R-Squared and Adjusted R-Squared value, better the model. Whereas, lower the AIC and BIC score, better the model.

Understanding AIC and BIC

AIC and BIC are measures of goodness of fit. They penalize complex models. In other words, it penalize the higher number of estimated parameters. It believes in a concept that a model with fewer parameters is to be preferred to one with more. In general, BIC penalizes models more for free parameters than does AIC.  Both criteria depend on the maximized value of the likelihood function L for the estimated model.

AIC value roughly equals the number of parameters minus the likelihood of the overall model. Suppose you have two models, the model with the lower AIC and BIC score is better.

Variable Selection Methods

There are three variable selection methods – Forward, Backward, Stepwise.
1. Starts with a single variable, then adds variables one at a time based on AIC (‘Forward’)
2. Starts with all variables, iteratively removing those of low importance based on AIC (‘Backward’)
3. Run in both directions (‘Stepwise’)

#Stepwise Selection based on AIC
step <- stepAIC(fit, direction="both")

#Backward Selection based on AIC
step <- stepAIC(fit, direction="backward")

#Forward Selection based on AIC
step <- stepAIC(fit, direction="forward")

#Stepwise Selection with BIC
n = dim(dat3)[1]
stepBIC = stepAIC(fit,k=log(n))

> summary(stepBIC)
lm(formula = mpg ~ vs + am + carb, data = dat3)

    Min      1Q  Median      3Q     Max 
-6.2803 -1.2308  0.4078  2.0519  4.8197 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  19.5174     1.6091  12.130 1.16e-12 ***
vs1           4.1957     1.3246   3.168  0.00370 ** 
am1           6.7980     1.1015   6.172 1.15e-06 ***
carb         -1.4308     0.4081  -3.506  0.00155 ** 
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.962 on 28 degrees of freedom
Multiple R-squared:  0.7818, Adjusted R-squared:  0.7585 
F-statistic: 33.45 on 3 and 28 DF,  p-value: 2.138e-09

Look at the estimates above after performing stepwise selection based on BIC. Variables have been reduced but Adjusted R-Squared remains same (very slightly improved). AIC and BIC scores also went down which indicates a better model.


Calculate Standardized Coefficients

Standardized Coefficients helps to rank predictors based on absolute value of standardized estimates. Higher the value, more important the variable.

#Standardised coefficients

#R Function : Manual Calculation of Standardised coefficients
stdz.coff <- function (regmodel)
{ b <- summary(regmodel)$coef[-1,1]
sx <- sapply(regmodel$model[-1], sd)
sy <- sapply(regmodel$model[1], sd)
beta <- b * sx / sy

std.Coeff = data.frame(Standardized.Coeff = stdz.coff(stepBIC))
std.Coeff = cbind(Variable = row.names(std.Coeff), std.Coeff)
row.names(std.Coeff) = NULL

Calculating Variance Inflation Factor (VIF)

Variance inflation factor measure how much the variance of the coefficients are inflated as compared to when independent variables are not highly non-correlated. It should be less than 5.


Testing Other Assumptions

#Autocorrelation Test

#Normality Of Residuals (Should be > 0.05)

#Testing for heteroscedasticity (Should be > 0.05)

#Outliers – Bonferonni test

#See Residuals
resid = residuals(stepBIC)

#Relative Importance

See Actual vs. Prediction

#See Predicted Value
pred = predict(stepBIC,dat3)
#See Actual vs. Predicted Value
finaldata = cbind(mtcars,pred)
print(head(subset(finaldata, select = c(mpg,pred))))

                   mpg     pred
Mazda RX4         21.0 20.59222
Mazda RX4 Wag     21.0 20.59222
Datsun 710        22.8 29.08031
Hornet 4 Drive    21.4 22.28235
Hornet Sportabout 18.7 16.65583
Valiant           18.1 22.28235

Other Useful Functions

#Calculating RMSE
rmse = sqrt(mean((dat3$mpg - pred)^2))

#Calculating Rsquared manually
y = dat3[,c("mpg")]
R.squared = 1 - sum((y-pred)^2)/sum((y-mean(y))^2)

#Calculating Adj. Rsquared manually
n = dim(dat3)[1]
p = dim(summary(stepBIC)$coeff)[1] - 1
adj.r.squared = 1 - (1 - R.squared) * ((n - 1)/(n-p-1))

#Box Cox Transformation

K-fold cross-validation

In the program below, we are performing 5-fold cross-validation. In 5-fold cross-validation, data is randomly split into 5 equal sized samples. Out of the 5 samples, a single sample which is random 20% of data is retained as validation data , and the remaining 80% is used as training data. This process is then repeated 5 times, with each of the 5 samples used exactly once as validation data. Later we average out the results.

kfold = cv.lm(data=dat3, stepBIC, m=5)

Basic Tutorials

What is Partial Correlation?

Partial correlation explains the correlation between two continuous variables (let’s say X1 and X2) holding X3 constant for both X1 and X2.

Partial Correlation Mathematical Formula

In this case, r12.3 is the correlation between variables x1 and x2 keeping x3 constant. r₁3 is the correlation between variables x1 and x3.

Let’s take an example –

Suppose we want to see the relationship between sales and number of high performing employees keeping promotion budget constant. In this case, sales is the variable1 and high performing employees is the variable 2 and promotion budget the variable3.


  1. Relationship between demand of coffee and tea keeping prices of tea controlled.
  2. Relationship between GMAT score and number of hours studied keeping SAT score constant.
  3. Relationship between weight and number of meals intake while controlling age
  4. Relationship between bank deposits and interest rate keeping household rate constant.

What is Semipartial Correlation

Semipartial correlation measures the strength of linear relationship between variables X1 and X2 holding X3 constant for just X1 or just X2. It is also called part correlation.

In the above image,  r1(2.3) means the semipartial correlation between variables X1 and X2 where X3 is constant for X2.

Difference between Partial and Semipartial Correlation

Partial correlation holds variable X3 constant for both the other two variables. Whereas, Semipartial correlation holds variable X3 for only one variable (either X1 or X2). Hence, it is called ‘semi‘partial.

Assumptions : Partial and Semipartial Correlation

  1. Variables should be continuous in nature. For example, weight, GMAT score, sales etc
  2. There should be linear relationship between all the three variables. If a variable has non-linear relationship, transform it or ignore the variable.
  3. There should be no extreme values (i.e outliers). If outliers are present, we need to treat them either by percentile capping or remove the outlier observations
  4. Variables you want to hold constant can be one or more than one

SAS Code : Partial Correlation Coefficient

In this example, we are checking association between height and weight keeping age constant.

PROC CORR data=sashelp.class;
 Var Height;
 With weight;
 Partial age;

The partial correlation coefficient between weight and height is 0.70467 holding age constant. The p-value for the coefficient is 0.0011. It means we can reject the null hypothesis and concludes that coefficient is significantly different from zero.

R Script : Partial Correlation Coefficient

# Load Library
# Read data
# Partial correlation between “height” and “weight” given “age”
with(mydata, pcor.test(Height,Weight,Age))

R Script : Semipartial Correlation Coefficient

Semi partial correlation – Age constant for Weight only

with(mydata, spcor.test(Height,Weight,Age))

estimate    p.value statistic  n gp  Method
0.4118409 0.08947395  1.807795 19  1 pearson
The estimate value is Pearson Semipartial correlation coefficient.

Semi partial correlation coefficient – Age constant for Height only

with(mydata, spcor.test(Weight,Height,Age))

   estimate    p.value statistic  n gp  Method

1 0.4732797 0.04727912  2.149044 19  1 pearson

Squared Partial and Semipartial Correlation

In regression, squared partial and squared semipartial correlation coefficients are used.

Squared partial correlation tells us how much of the variance in dependent variable (Y) that is not explained by variable X2 but explained by X1. In other words, it is the proportion of the variation in dependent variable that was left unexplained by other predictors / independent variables but has been explained by independent variable X1.

Here, R²y.12 is the r-squared from the regression model in which X1 and X2 are independent variables.

Squared Semi-partial correlation tells us how much of the unique contribution of an independent variable to the total variation in dependent variable. In other words, it explains increment in R-square when an independent variable is added.

Squared Partial correlation will always be greater than or equal to squared semi-partial correlation.

Squared Partial Correlation >= Squared Semi-partial Correlation

SAS Code  : Squared Partial and Semi-Partial Correlation
In PROC REG, the PCORR2 option tells SAS to produce squared-partial correlation and SCORR2 option tells SAS to produce squared semi-partial correlation. The STB option is used to generate standardized estimate and TOL is used to calculate tolerance.

Proc Reg data= Readin;
Model Overall = VAR1 – VAR5 / SCORR2 PCORR2 STB TOL ;

The squared semi-partial correlation between Overall and VAR1 tells us model R-square is added by 0.18325 if  VAR1 is included in the model.
The squared partial correlation between Overall and VAR1 tells us the proportion of variance in Overall that is not explained by the other independent variables, 43% is explained by VAR1.

Which indicates variable importance?

Squared Semipartial correlation indicates variable importance because it measures incremental value in R-Square. We can rank variables based on high to low values of squared semipartial correlation score.

Relationship between Squared Semipartial correlation and Standardized Estimate

Squared Semipartial Correlation = (Standardized Estimate)² * Tolerance

Can individual squared semi-partial correlation add to R-squared?
Answer is NO. It is because the total variation in dependent variable also constitutes a portion that is due to within correlation between two independent variables.

Basic Tutorials

Hypothesis Testing : Meaning

You have sample data and you are asked to assess the credibility of a statement about population using sample data.

In other words, we use a random sample of data taken from a population to describe and make inferences about the population. For example, Indian Government wants to know about response of its citizens on a new policy. It is not possible to reach out each citizen to collect feedback as it’s a very expensive and time-consuming process. Instead they reach out to a sample of it from each district or state and make judgement whether people are happy with the policy or not.

Statistical significance evaluates the likelihood that an observed (actual) difference is due to chance. It deals with the following question :

If we selected many samples from the same population, would we still find the same relationship between these two variables in every sample? Or is our finding due only to random chance?

Independent T-Test

The independent t test evaluates whether the means for two independent groups are significantly different from each other. It is used for just 2 groups of samples. If you have more than 2 groups of samples, you should use ANOVA.

  1. Each score is sampled independently and randomly. 
  2. The scores are normally distributed within each of the two groups.
  3. The variance in each of the groups is equal. 

Case StudyHow powerful are rumors? Frequently, students ask friends and/or look at instructor evaluations to decide if a class is worth taking. Kelley (1950) found that instructor reputation has a profound impact on actual teaching ratings.  [Source : Journal of Personality, 18, 431-439]Experimental Design Before viewing the lecture, students were given a summary of the instructors prior teaching evaluations. There were two types of instructors : Charismatic instructor and Punitive instructor.Null HypothesisIt is a statement that you want to test. It usually states that there is no relationship between the two variables. In this case, the null hypothesis states that there is no difference between the mean ratings of the charismatic-teacher-reputation condition and the punitive-teacher-reputation condition.Alternate HypothesisIt is contrary to the null hypothesis. It usually states that there is a relationship between the two variables. In this case, the alternate hypothesis states that there is a difference between the mean ratings of the charismatic-teacher-reputation condition and the punitive-teacher-reputation condition.

What is p-value in simple terms?

P-value evaluate how well the sample data support that the null hypothesis is true. A low P value means that your sample provides enough evidence that you can reject the null hypothesis for the entire population. In technical language, it means lowest level of significance at which you can reject the null hypothesis.Type I and II Errors Examples1. Let’s say you are testing a new drug for some disease.
Null Hypothesis : New Drug has no effect on disease. In a test of its effectiveness, a type I error would be to say it has an effect when it does not (False Positive) ; a type II error would be to say it has no effect when it does (False Negative).

2. Null Hypothesis : Person is innocent
If an innocent person is convicted, it is a type I error (False Positive). If court lets guilty person go free, it is type II error (False Negative). Description of type I and II error is shown in the image below –

Hypothesis Testing : Type I and II Errors

InterpretationAn independent-samples t-test was used to test the difference between the mean ratings of the charismatic-teacher-reputation condition and the punitive-teacher-reputation condition. The output from SPSS is shown below 

Assumption Check

The columns labeled “Levene’s Test for Equality of Variances” tell us whether an assumption of the t-test has been met. The t-test assumes that the variance in each of the groups is approximately equal.

Look at the column labeled “Sig.” under the heading “Levene’s Test for Equality of Variances”. In this example, the significance (p value) of Levene’s test is .880. If this value is less than or equal to 5% level of significance (.05), then you can reject the null hypothesis that the variability of the two groups is equal, implying that the variances are unequal. 

If the significance (p value) of Levene’s test is less than or equal to 5% level of significance (.05), then you should use the bottom row of the output (the row labeled “Equal variances not assumed”

If the significance (p value) of Levene’s test is greater than 5% level of significance (.05), then you should use the middle row of the output (the row labeled “Equal variances assumed”

In this example, .880 is larger than 0.05, so we will assume that the variances are equal and we will use the middle row of the output.ConclusionThe column labeled “Sig. (2-tailed)” gives the two-tailed p value associated with the test. In this example, the p value is .018. Since p-value .018 is less than .05, so we reject null hypothesis. That implies that there is a significant difference between the mean ratings of the charismatic-teacher-reputation condition and the punitive-teacher-reputation condition.

Basic Tutorials

Measure of Central Tendency

It  describes a whole set of data with a single value that represents the centre of its distribution.
There are three main measures of central tendency: the mode, the median and the mean. 

It is the sum of the observations divided by the sample size.

The mean of the values 5,6,6,8,9,9,9,9,10,10 is (5+6+6+8+9+9+9+9+10+10)/10 = 8.1

Limitation :  
It is affected by extreme values. Very large or very small numbers can distort the answer

It is the middle value. It splits the data in half. Half of the data are above the median; half of the data are below the median.

Advantage :  It is NOT affected by extreme values. Very large or very small numbers does not affect it

It is the value that occurs most frequently in a dataset

Advantage :  It can be used when the data is not numerical.

Disadvantage :
1. There may be no mode at all if none of the data is the same
2. There may be more than one mode