Linear Regression in R

When the independent variable is Continuous only or both Continuous and Categorical

For example, Attrition = f(Salary,) were Salary is independent continuous variable
The relation is linear, if the dependent variable is linearly related to independent variable. Dependent variable Attrition is linearly related to Independent variable Salary.

NOTE: If LINEAR REGRESSION problem contains categorical variables, it gets converted to ANOVA model

Equation of the line is y = mx + c + E where

y is Dependent variable
m is Slope/Coefficient
x is Independent variable
c is Intercept
E is Error - The error E is residual or factors that are not considered when plotting a linear regression model. As the factors are accommodated in the model, E starts decreasing its value.

SCENARIO: Consider the details of Credit card expenses for households with certain income level

NULL Hypothesis H0: For y = mx + c + E, the m (Coefficient or Slope) value is 0, meaning Dependent and Independent variables are not related to each other and hence we can avoid using the given independent variable

R code below uses this Dataset saved in .csv file

## Read Consumer data

Consumer <- read.csv(file.choose())

##View Structure of Data

str(Consumer)

tail(Consumer)

##Clean the data by removing the redundant 4 columns

Consumer$X <- NULL

Consumer$X.1 <- NULL

Consumer$X.2 <- NULL

Consumer$X.3 <- NULL

##Load the psych and lmtest Libraries

library(psych)

library(lmtest)#install.packages("lmtest")

## Summary Statistics

describe(Consumer)

cor(Consumer) #Correlation

# Implies - Dependent variable AmountCharged is strongly dependent on Independent variables

# Income and HouseholdSize. Income is not strongly dependent on HouseholdSize or vice versa

#install.packages("ggplot2")

library(ggplot2)

par(mfrow = c(3, 1))

with(Consumer, boxplot(Income, main="Income (1000) US$"))

with(Consumer, boxplot(HouseholdSize, main="Household Size"))

with(Consumer, boxplot(AmountCharged, main="Amount Charged US$"))

par(mfrow = c(2, 1))

with(Consumer, plot(HouseholdSize, AmountCharged, pch=19, cex=0.6))

with(Consumer, plot(Income, AmountCharged, pch=19, cex=0.6))

## Simple Regressions with one Independent Variable

reg1 <- lm(AmountCharged ~ Income, data=Consumer)

reg1

summary(reg1)

anova(reg1)

reg2 <- lm(AmountCharged ~ HouseholdSize, data=Consumer)

reg2

summary(reg2)

anova(reg2)

## Multiple Regression with 2 Independent Variables

reg3 <- lm(AmountCharged ~ Income + HouseholdSize, data=Consumer)

reg3

summary(reg3)

anova(reg3)

##Extract the fitted values and residual values from the reg3 output

fitted(reg3)

residuals(reg3)

fit3 <- fitted(reg3)

res3 <- residuals(reg3)

##Merge the fitted and residual values with Consumer dataset for comparison sake

ConsumerReg <- cbind(Consumer, fit3, res3)

##Plot the actual versus fitted values in a plot

with(ConsumerReg, plot(AmountCharged, fit3, pch=19, cex=0.6))

## Prediction of new observations

newobs <- data.frame(Income = 40, HouseholdSize = 3)

newobs

predict.lm(reg3, newdata=newobs)

# newobs <- data.frame(Income = c(40,50), HouseholdSize = c(3, 4))

# predict.lm(reg3, newdata=newobs)

##Assumptions

##Linear relationship

par(mfrow = c(2, 1))

with(Consumer, plot(HouseholdSize, AmountCharged, pch=19, cex=0.6))

with(Consumer, plot(Income, AmountCharged, pch=19, cex=0.6))

##Multivariate normality Test

with(Consumer, shapiro.test(Income))

with(Consumer, shapiro.test(HouseholdSize))

par(mfrow = c(2, 1))

with(Consumer, qqnorm(Income, pch=19, cex=0.6))

with(Consumer, qqline(Income, col='red'))

##No or little multicollinearity - Correlation or VIF test

library(corrplot)#install.packages("corrplot")

corrplot(cor(Consumer))

#library(VIF)

#vif(reg3)#install.packages("VIF")

##No auto-correlation - Durbin Watson Test

#Null Hypothesis - there is no autocorrelation

dwtest(reg3)

##No Homoscedasticity

#Null hypothesis : Data is homoscedastic

gqtest(reg3)

NOTE:

Adjusted R-squared value: The snapshot above gives the P-value for the regression model and tells if the model is insignificant i.e. p < 0.05
Adjusted R-square values is used to compare models. Say if the model started with 10 variables and have a value for Adjusted R-square value calculated. Now business adds up one more variable. The new model is designed using these 11 variables. The Adjusted R-square value calculated for this new model. The working or effect of adding new variable is understood or judged by comparing the Adj-R values

NOTE:

For two independent variables, Y = c + mX1 + mX2 + E, i.e. Y = c + m*Income + m*Household + E

REFERENCES

https://www.greatlearning.in/great-lakes-pgpba/

Statistics

Search This Blog

Linear Regression in R

Comments

Post a Comment