On this laid of exercises, nosotros are going to exercise the
lm
and glm
functions to perform several generalized linear models on ane dataset. Since this is a basic laid of exercises nosotros volition possess got a closer await at the arguments of these functions together with how to possess got wages of the output of each business office then nosotros tin honour a model that fits our data.
Before starting this laid of exercises, I strongly propose you lot await at the R Documentation of
lm
and glm
. Note: This laid of exercises assume that you lot possess got a basic agreement of generalized linear models.
Answers to the exercises are available here.
If you lot obtained a dissimilar (correct) respond than those listed on the solutions page, delight experience gratis to post service your respond every bit a comment on that page.
The dataset nosotros volition live using contains information from passengers of the Titanic including if they survived or not.
To obtain the information run these lines of code.
if (!'titanic' %in% installed.packages()) install.packages('titanic')
library(titanic)
DATA <- titanic_train[,-c(1,4,9,11)]
Exercise 1
Linear regression
1. Use
DATA
to do a linear model using the function lm
with the variables Age together with Fare every bit independent variables together with Survived every bit the independent one. Save the regression inwards an object called lm_reg
2. Use the function
glm
to perform the same chore together with relieve the regression inwards an object called glm_reg
Exercise 2
If you lot impress whatever of the previous objects you lot volition realize that there’s non much information close the functioning of the models, fortunately
summary
is a cracking business office to honour out to a greater extent than close whatever statistical model you lot preform to a dataset. Depending on the model summary
will hit dissimilar outputs.- Apply
summary
tolm_reg
and toglm_reg
. You volition honour a slight departure betwixt both of the outputs, that is becauseglm
is to a greater extent than flexible thanlm
.
Exercise 3
So far nosotros possess got been assuming (incorrectly) that the theme variable (
Survived
) follows a normal distribution together with that’s why nosotros possess got been performing a linear regression. Obviously Survived
follows a binomial distribution, at that topographic point are solely 2 options either the rider survived (1) or the rider wasn’t that lucky together with he died (0). Since the information has a binomial distribution nosotros should perform a logistic regression, to do this exercise the function glm
to perform a logistic regression using Age
and Fare
as independent variables together with relieve it inwards an object called bin_model
. Hint: Define the value of the argument family
properly. Exercise 4
Inside the household unit of measurement attribute you lot tin ever specify a detail link, inwards example you lot don’t a default link volition live associated depending on the household unit of measurement you lot chose.
1. To honour out the default link associated to a certainly family, you lot tin write the household unit of measurement elevate followed past times a parenthesis (Ex.
2. Create a probit model amongst the same variables used in
gaussian()
. Find the default link associated to the binomial family.2. Create a probit model amongst the same variables used in
bin_model
and relieve it inwards an object called bin_probit_model
. Exercise 5
Findind the right model requires to compare dissimilar models together with select the best, although at that topographic point are many functioning measures, for directly nosotros volition exercise the
AIC
as our mensurate (smaller AIC are better). This agency that bin_model
is ameliorate than bin_probit_model
, then let’s continue working with bin_model
. Until directly intercept variable has been component of the models. Create a logistic regression amongst the same variables but amongst no intercept.
Exercise 6
Impute data. If you lot run the
summary
function to whatever of the previous models you lot volition honour out that 177 observations possess got been deleted due to missingness. This happens because the glm
function has every bit default argument na.acton ="na.omit"
. This brand easier to run a model amongst messier data, but that is non ever great. You desire to possess got total command an agreement of what does the business office is doing. 1. There are some missing values in
2. Update the
age
, supervene upon this values amongst the median.2. Update the
glm_model
with the updated data, specify na.action='na.fail'
This volition assure us that the dataset has no missing values, otherwise it volition exhibit an error. Exercise 7
Add polynomial independent variables. Some variables possess got a quadratic interaction betwixt them together with the theme variable, this tin live solved past times specifying inwards the formula of the model a quadratic interaction.
Add a quadratic interaction for the variable
Fare
into the electrical flow model, specified in glm_model
Exercise 8
Add categorical variables. Add
Sex
as an independent variable into the electrical flow model specified in glm_model
. Note that Sex is non a numeric variable. Exercise 9
Now that nosotros possess got flora a skilful model that fits our data, then it’s fourth dimension to exercise the
predict
function to honour how skilful the model predicts inwards our ain data. Use the function predict
to honour the prediction of the model in DATA
and relieve it in Pred.default
Exercise 10
Pred.default
shows the predicted values nether the link transformation, inwards this example logit. This is non easily interpretable, to prepare this occupation nosotros tin specify the type
of prediction nosotros want.- Obtain the predictions every bit probability values.
- Exta: What’s the per centum accuracy of this model if nosotros assigned every bit died (0) if the predicted probability is less than 0.5 together with survived (1) otherwise?
_______________________________________________
Below are the solutions to these exercises on generalized linear models.
if (!'titanic' %in% installed.packages()) install.packages('titanic') library(titanic)
## Warning: bundle 'titanic' was built nether R version 3.3.3
DATA <- titanic_train[,-c(1,4,9,11)] #################### # # # Exercise 1 # # # #################### (lm_reg <- lm(formula = Survived Age + Fare, data = DATA))
## ## Call: ## lm(formula = Survived Age + Fare, information = DATA) ## ## Coefficients: ## (Intercept) Age Fare ## 0.420973 -0.003517 0.002583
(glm_model <- glm(formula = Survived Age + Fare, data = DATA, family = gaussian))
## ## Call: glm(formula = Survived Age + Fare, household unit of measurement = gaussian, information = DATA) ## ## Coefficients: ## (Intercept) Age Fare ## 0.420973 -0.003517 0.002583 ## ## Degrees of Freedom: 713 Total (i.e. Null); 711 Residual ## (177 observations deleted due to missingness) ## Null Deviance: 172.2 ## Residual Deviance: 158 AIC: 957.2
#################### # # # Exercise 2 # # # #################### summary(lm_reg)
## ## Call: ## lm(formula = Survived Age + Fare, information = DATA) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0336 -0.3675 -0.3110 0.5563 0.7829 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.4209734 0.0409896 10.270 < 2e-16 *** ## Age -0.0035166 0.0012209 -2.880 0.00409 ** ## Fare 0.0025834 0.0003351 7.708 4.3e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual criterion error: 0.4714 on 711 degrees of liberty ## (177 observations deleted due to missingness) ## Multiple R-squared: 0.08263, Adjusted R-squared: 0.08005 ## F-statistic: 32.02 on 2 together with 711 DF, p-value: 4.837e-14
summary(glm_model)
## ## Call: ## glm(formula = Survived Age + Fare, household unit of measurement = gaussian, information = DATA) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.0336 -0.3675 -0.3110 0.5563 0.7829 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.4209734 0.0409896 10.270 < 2e-16 *** ## Age -0.0035166 0.0012209 -2.880 0.00409 ** ## Fare 0.0025834 0.0003351 7.708 4.3e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian household unit of measurement taken to live 0.2221983) ## ## Null deviance: 172.21 on 713 degrees of liberty ## Residual deviance: 157.98 on 711 degrees of liberty ## (177 observations deleted due to missingness) ## AIC: 957.25 ## ## Number of Fisher Scoring iterations: 2
#################### # # # Exercise iii # # # #################### (bin_model <- glm(formula = Survived Age + Fare, data = DATA, family = binomial))
## ## Call: glm(formula = Survived Age + Fare, household unit of measurement = binomial, information = DATA) ## ## Coefficients: ## (Intercept) Age Fare ## -0.41706 -0.01758 0.01726 ## ## Degrees of Freedom: 713 Total (i.e. Null); 711 Residual ## (177 observations deleted due to missingness) ## Null Deviance: 964.5 ## Residual Deviance: 891.3 AIC: 897.3
#################### # # # Exercise iv # # # #################### binomial()
## ## Family: binomial ## Link function: logit
(bin_probit_model <- glm(formula = Survived Age + Fare, data = DATA, family = binomial(link = probit)))
## ## Call: glm(formula = Survived Age + Fare, household unit of measurement = binomial(link = probit), ## information = DATA) ## ## Coefficients: ## (Intercept) Age Fare ## -0.24598 -0.01028 0.00933 ## ## Degrees of Freedom: 713 Total (i.e. Null); 711 Residual ## (177 observations deleted due to missingness) ## Null Deviance: 964.5 ## Residual Deviance: 894.4 AIC: 900.4
#################### # # # Exercise v # # # #################### (bin_model_no_int <- glm(formula = Survived 0 + Age + Fare, data = DATA, family = binomial(link = logit)))
## ## Call: glm(formula = Survived 0 + Age + Fare, household unit of measurement = binomial(link = logit), ## information = DATA) ## ## Coefficients: ## Age Fare ## -0.02805 0.01594 ## ## Degrees of Freedom: 714 Total (i.e. Null); 712 Residual ## (177 observations deleted due to missingness) ## Null Deviance: 989.8 ## Residual Deviance: 896.4 AIC: 900.4
#################### # # # Exercise half dozen # # # #################### (bin_model <- glm(formula = Survived Age + Fare, data = DATA, family = binomial(link = logit), na.action = 'na.omit'))
## ## Call: glm(formula = Survived Age + Fare, household unit of measurement = binomial(link = logit), ## information = DATA, na.action = "na.omit") ## ## Coefficients: ## (Intercept) Age Fare ## -0.41706 -0.01758 0.01726 ## ## Degrees of Freedom: 713 Total (i.e. Null); 711 Residual ## (177 observations deleted due to missingness) ## Null Deviance: 964.5 ## Residual Deviance: 891.3 AIC: 897.3
Impute <- median(DATA$Age, na.rm = TRUE) DATA$Age[is.na(DATA$Age)] <- Impute (bin_model_Impute <- glm(formula = Survived Age + Fare, data = DATA, family = binomial(link = logit), na.action = 'na.fail'))
## ## Call: glm(formula = Survived Age + Fare, household unit of measurement = binomial(link = logit), ## information = DATA, na.action = "na.fail") ## ## Coefficients: ## (Intercept) Age Fare ## -0.47997 -0.01682 0.01620 ## ## Degrees of Freedom: 890 Total (i.e. Null); 888 Residual ## Null Deviance: 1187 ## Residual Deviance: 1109 AIC: 1115
#################### # # # Exercise vii # # # #################### (bin_model<- glm(formula = Survived Age + poly(Fare,2) , data = DATA, family = binomial(link = logit)))
## ## Call: glm(formula = Survived Age + poly(Fare, 2), household unit of measurement = binomial(link = logit), ## information = DATA) ## ## Coefficients: ## (Intercept) Age poly(Fare, 2)1 poly(Fare, 2)2 ## 0.05118 -0.01812 18.41909 -10.24135 ## ## Degrees of Freedom: 890 Total (i.e. Null); 887 Residual ## Null Deviance: 1187 ## Residual Deviance: 1097 AIC: 1105
summary(bin_model)
## ## Call: ## glm(formula = Survived Age + poly(Fare, 2), household unit of measurement = binomial(link = logit), ## information = DATA) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.3643 -0.8806 -0.8030 1.2209 1.8474 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.051181 0.180492 0.284 0.77674 ## Age -0.018117 0.005714 -3.171 0.00152 ** ## poly(Fare, 2)1 18.419094 2.554456 7.211 5.57e-13 *** ## poly(Fare, 2)2 -10.241350 2.229437 -4.594 4.35e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial household unit of measurement taken to live 1) ## ## Null deviance: 1186.7 on 890 degrees of liberty ## Residual deviance: 1096.6 on 887 degrees of liberty ## AIC: 1104.6 ## ## Number of Fisher Scoring iterations: iv
#################### # # # Exercise 8 # # # #################### (bin_model<- glm(formula = Survived Age + poly(Fare,2) + as.factor(Sex) , data = DATA, family = binomial(link = logit)))
## ## Call: glm(formula = Survived Age + poly(Fare, 2) + as.factor(Sex), ## household unit of measurement = binomial(link = logit), information = DATA) ## ## Coefficients: ## (Intercept) Age poly(Fare, 2)1 ## 1.28411 -0.01077 15.37627 ## poly(Fare, 2)2 as.factor(Sex)male ## -6.59275 -2.37887 ## ## Degrees of Freedom: 890 Total (i.e. Null); 886 Residual ## Null Deviance: 1187 ## Residual Deviance: 877 AIC: 887
summary(bin_model)
## ## Call: ## glm(formula = Survived Age + poly(Fare, 2) + as.factor(Sex), ## household unit of measurement = binomial(link = logit), information = DATA) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.4390 -0.6053 -0.5619 0.7994 2.0913 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.284114 0.225901 5.684 1.31e-08 *** ## Age -0.010767 0.006558 -1.642 0.10066 ## poly(Fare, 2)1 15.376271 2.770311 5.550 2.85e-08 *** ## poly(Fare, 2)2 -6.592748 2.521885 -2.614 0.00894 ** ## as.factor(Sex)male -2.378874 0.171822 -13.845 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial household unit of measurement taken to live 1) ## ## Null deviance: 1186.66 on 890 degrees of liberty ## Residual deviance: 877.04 on 886 degrees of liberty ## AIC: 887.04 ## ## Number of Fisher Scoring iterations: iv
#################### # # # Exercise nine # # # #################### DATA$Pred.default <- predict(bin_model) #################### # # # Exercise 10 # # # #################### DATA$Prob <- predict(bin_model, type = 'response') DATA$Pred <- ifelse(DATA$Prob<.5, 0,1) sum(DATA$Pred==DATA$Survived) / nrow(DATA)
## [1] 0.7822671
http://www.r-exercises.com/2017/09/16/generalized-linear-functions-beginners/
http://www.r-exercises.com/2017/09/16/generalized-linear-models-solutionbeginners/