- When the independent variable is Continuous only or both Continuous and Categorical
- For example, Attrition = f(Salary,) were Salary
is independent continuous variable
- The relation is linear, if the dependent variable is linearly related to independent variable. Dependent variable Attrition is linearly related to Independent variable Salary.
- NOTE: If LINEAR REGRESSION problem contains categorical variables, it gets converted to ANOVA model
- Equation of the line is y = mx + c + E where
- y is Dependent variable
- m is Slope/Coefficient
- x is Independent variable
- c is Intercept
- E is Error - The error E is residual or factors that are not considered when plotting a linear regression model. As the factors are accommodated in the model, E starts decreasing its value.
NULL Hypothesis H0: For y = mx + c + E, the m (Coefficient or Slope) value is
0, meaning Dependent and Independent variables are not related to each other
and hence we can avoid using the given independent variable
R code below uses this Dataset saved in .csv file
## Read Consumer data
Consumer <- read.csv(file.choose())
##View Structure of Data
str(Consumer)
tail(Consumer)
##Clean the data by removing the redundant 4 columns
Consumer$X <- NULL
Consumer$X.1 <- NULL
Consumer$X.2 <- NULL
Consumer$X.3 <- NULL
##Load the psych and lmtest Libraries
library(psych)
library(lmtest)#install.packages("lmtest")
## Summary Statistics
describe(Consumer)
cor(Consumer) #Correlation
# Implies - Dependent variable AmountCharged is strongly dependent on
Independent variables
# Income and HouseholdSize. Income is not strongly
dependent on HouseholdSize or vice versa
#install.packages("ggplot2")
library(ggplot2)
par(mfrow = c(3, 1))
with(Consumer, boxplot(Income, main="Income (1000) US$"))
with(Consumer, boxplot(HouseholdSize, main="Household Size"))
with(Consumer, boxplot(AmountCharged, main="Amount Charged US$"))
par(mfrow = c(2, 1))
with(Consumer, plot(HouseholdSize, AmountCharged, pch=19, cex=0.6))
with(Consumer, plot(Income, AmountCharged, pch=19, cex=0.6))
## Simple Regressions with one Independent Variable
reg1 <- lm(AmountCharged ~ Income, data=Consumer)
reg1
summary(reg1)
anova(reg1)
reg2 <- lm(AmountCharged ~ HouseholdSize, data=Consumer)
reg2
summary(reg2)
anova(reg2)
## Multiple Regression with 2 Independent Variables
reg3 <- lm(AmountCharged ~ Income + HouseholdSize, data=Consumer)
reg3
summary(reg3)
anova(reg3)
##Extract the fitted values and residual values from the reg3 output
fitted(reg3)
residuals(reg3)
fit3 <- fitted(reg3)
res3 <- residuals(reg3)
##Merge the fitted and residual values with Consumer dataset for comparison sake
ConsumerReg <- cbind(Consumer, fit3, res3)
##Plot the actual versus fitted values in a plot
with(ConsumerReg, plot(AmountCharged, fit3, pch=19, cex=0.6))
## Prediction of new observations
newobs <- data.frame(Income = 40, HouseholdSize = 3)
newobs
predict.lm(reg3, newdata=newobs)
# newobs <- data.frame(Income = c(40,50), HouseholdSize = c(3, 4))
# predict.lm(reg3, newdata=newobs)
##Assumptions
##Linear relationship
par(mfrow = c(2, 1))
with(Consumer, plot(HouseholdSize, AmountCharged, pch=19, cex=0.6))
with(Consumer, plot(Income, AmountCharged, pch=19, cex=0.6))
##Multivariate normality Test
with(Consumer, shapiro.test(Income))
with(Consumer, shapiro.test(HouseholdSize))
par(mfrow = c(2, 1))
with(Consumer, qqnorm(Income, pch=19, cex=0.6))
with(Consumer, qqline(Income, col='red'))
##No or little multicollinearity - Correlation or VIF test
library(corrplot)#install.packages("corrplot")
corrplot(cor(Consumer))
#library(VIF)
#vif(reg3)#install.packages("VIF")
##No auto-correlation - Durbin Watson Test
#Null Hypothesis - there is no autocorrelation
dwtest(reg3)
##No Homoscedasticity
#Null hypothesis : Data is homoscedastic
gqtest(reg3)
NOTE:
- Adjusted R-squared value: The snapshot above gives the P-value for the regression model and tells if the model is insignificant i.e. p < 0.05
- Adjusted R-square values is used to compare models. Say if the model started with 10 variables and have a value for Adjusted R-square value calculated. Now business adds up one more variable. The new model is designed using these 11 variables. The Adjusted R-square value calculated for this new model. The working or effect of adding new variable is understood or judged by comparing the Adj-R values
- For two independent variables, Y = c + mX1 + mX2 + E, i.e. Y = c + m*Income + m*Household + E
REFERENCES
Comments
Post a Comment