Linear Regression in R


  • When the independent variable is Continuous only or both Continuous and Categorical
    • For example, Attrition = f(Salary,) were Salary is independent continuous variable
    • The relation is linear, if the dependent variable is linearly related to independent variable. Dependent variable Attrition is linearly related to Independent variable Salary.
  • NOTEIf LINEAR REGRESSION problem contains categorical variables, it gets converted to ANOVA model
  • Equation of the line is y = mx + c + E where 
    • y is Dependent variable
    • m is Slope/Coefficient
    • x is Independent variable
    • c is Intercept 
    • E is Error - The error E is residual or factors that are not considered when plotting a linear regression model. As the factors are accommodated in the model, E starts decreasing its value.
SCENARIO: Consider the details of Credit card expenses for households with certain income level
NULL Hypothesis H0: For y = mx + c + E, the m (Coefficient or Slope) value is 0, meaning Dependent and Independent variables are not related to each other and hence we can avoid using the given independent variable

R code below uses this Dataset saved in .csv file
## Read Consumer data
Consumer <- read.csv(file.choose())

##View Structure of Data
str(Consumer)
tail(Consumer)

##Clean the data by removing the redundant 4 columns
Consumer$X <- NULL
Consumer$X.1 <- NULL
Consumer$X.2 <- NULL
Consumer$X.3 <- NULL

##Load the psych and lmtest Libraries
library(psych)
library(lmtest)#install.packages("lmtest")

## Summary Statistics
describe(Consumer)
cor(Consumer) #Correlation
# Implies - Dependent variable AmountCharged is strongly dependent on Independent variables
# Income and HouseholdSize. Income is not strongly dependent on HouseholdSize or vice versa

#install.packages("ggplot2")
library(ggplot2)

par(mfrow = c(3, 1))

with(Consumer, boxplot(Income, main="Income (1000) US$"))
with(Consumer, boxplot(HouseholdSize, main="Household Size"))
with(Consumer, boxplot(AmountCharged, main="Amount Charged US$"))

par(mfrow = c(2, 1))

with(Consumer, plot(HouseholdSize, AmountCharged, pch=19, cex=0.6))
with(Consumer, plot(Income, AmountCharged, pch=19, cex=0.6))
## Simple Regressions with one Independent Variable
reg1 <- lm(AmountCharged ~ Income, data=Consumer)
reg1
summary(reg1)

anova(reg1)

reg2 <- lm(AmountCharged ~ HouseholdSize, data=Consumer)
reg2
summary(reg2)

anova(reg2)

## Multiple Regression with 2 Independent Variables
reg3 <- lm(AmountCharged ~ Income + HouseholdSize, data=Consumer)
reg3
summary(reg3)
anova(reg3)

##Extract the fitted values and residual values from the reg3 output
fitted(reg3)
residuals(reg3)

fit3 <- fitted(reg3)
res3 <- residuals(reg3)

##Merge the fitted and residual values with Consumer dataset for comparison sake
ConsumerReg <- cbind(Consumer, fit3, res3)

##Plot the actual versus fitted values in a plot
with(ConsumerReg, plot(AmountCharged, fit3, pch=19, cex=0.6))

## Prediction of new observations
newobs <- data.frame(Income = 40, HouseholdSize = 3)
newobs

predict.lm(reg3, newdata=newobs)

# newobs <- data.frame(Income = c(40,50), HouseholdSize = c(3, 4))
# predict.lm(reg3, newdata=newobs)

##Assumptions
##Linear relationship
par(mfrow = c(2, 1))
with(Consumer, plot(HouseholdSize, AmountCharged, pch=19, cex=0.6))
with(Consumer, plot(Income, AmountCharged, pch=19, cex=0.6))

##Multivariate normality Test
with(Consumer, shapiro.test(Income))
with(Consumer, shapiro.test(HouseholdSize))

par(mfrow = c(2, 1))
with(Consumer, qqnorm(Income, pch=19, cex=0.6))
with(Consumer, qqline(Income, col='red'))

##No or little multicollinearity - Correlation or VIF test
library(corrplot)#install.packages("corrplot")
corrplot(cor(Consumer))

#library(VIF)
#vif(reg3)#install.packages("VIF")

##No auto-correlation - Durbin Watson Test
#Null Hypothesis - there is no autocorrelation
dwtest(reg3)

##No Homoscedasticity
#Null hypothesis : Data is homoscedastic
gqtest(reg3)

NOTE: 
  • Adjusted R-squared value: The snapshot above gives the P-value for the regression model and tells if the model is insignificant i.e. p < 0.05
  • Adjusted R-square values is used to compare models. Say if the model started with 10 variables and have a value for Adjusted R-square value calculated. Now business adds up one more variable. The new model is designed using these 11 variables. The Adjusted R-square value calculated for this new model. The working or effect of adding new variable is understood or judged by comparing the Adj-R values
NOTE:
  • For two independent variables, Y = c + mX1 + mX2 + E, i.e. Y = c + m*Income + m*Household + E
REFERENCES

Comments