The final few spider web log posts of this serial discussed regression models. Fernando has selected the
best model. He has built a multivariate regression model. The model takes the next shape:
price = -55089.98 + 87.34 engineSize + 60.93 horse ability + 770.42 width
The model predicts or estimates cost (target) every bit a business office of engine size, horsepower, as well as width (predictors). The model has all the predictors every bit numeric values.
What if at that spot are qualitative variables? How tin grade the axe the qualitative variables live used inwards enhancing the models? How are the qualitative variables interpreted?
These are the few questions this spider web log ship volition answer.
Fernando gets 2 such qualitative variables:
- fuelType: The type of fuel used. The value tin grade the axe live gas or diesel.
- driveWheels: The type of elbow grease wheel. It has iii values iv wheels (4WD), bring upwards bike (RWD) as well as front end bike (FWD).
The information ready looks similar this.
Fernando wants to notice out the behaviour on these qualitative variables get got on the cost of the car.
Concept
Qualitative variables are variables that are non numerical. It fits the information into categories. They are too called as categorical variables or factors.
Factors get got levels. Levels are cypher but unique values of the specific qualitative variables.
- Fuel type has 2 unique values. Gas or diesel. This implies that at that spot are 2 factors inwards fuel type.
- Drive wheels get got iii unique values. Four when drive, bring upwards bike elbow grease as well as front end bike drive. It way that elbow grease bike has iii factors.
When a regression model uses a qualitative variable, the statistical engine creates dummy variables. The concept of the dummy variable is simple. It takes alone 2 values 0 or 1.
Let us expect at an example. The sample information has v cars, as well as each machine has a diesel or gas fuel type.
The fuel type is a qualitative variable. It has 2 levels (diesel or gas). The statistical parcel creates 1 dummy variable. It creates a dummy variable named fuelTypegas. This variable takes 0 or 1 value. If the fuel type is gas, thus the dummy variable is 1 else it is 0.
Mathematically, it tin grade the axe live written as:
- xi = 1 if fuel type is gas
- xi = 0 is fuel type is diesel
The issue of dummy variables created past times the regression model is 1 less than the issue element values inwards the qualitative variables.
Let us examine how does it manifest inwards a regression model. H5N1 unproblematic regression model amongst alone cost as well as fuel type every bit input provides the next coefficients:
There are alone 1 coefficient as well as 1 intercept. Regression model creates a dummy variable for a element value of the qualitative variable (gas inwards this case).
It says the following:
- If the dummy variable is 0 i.e. the fuel type of the machine is diesel thus cost = 18348 + 0 x (-6925) = $18348
- If the dummy variable is 1 i.e. the fuel type of the machine is gas thus cost = 18348 + 1 x (-6925) = $11423
The way qualitative variables amongst two-factor levels is treated is clear. How most variables amongst to a greater extent than than 2 levels? Let us examine approximately other instance to sympathize it.
The elbow grease bike is a qualitative variable amongst iii factors. In this case, the regression model creates 2 dummy variables. Let us expect at an example. The sample information has iv cars.
Two dummy variables are created:
- driveWheelsfwd : 1 if elbow grease bike type is FWD else 0
- driveWheelsrwd: 1 if elbow grease bike type is RWD else 0
Mathematically, it tin grade the axe live written as:
- xi1 = 1 if the elbow grease bike is forward; 0 if the elbow grease bike is non forward.
- xi2 = 1 if the elbow grease bike is rear; 0 if the elbow grease bike is non the rear.
Note that at that spot is no dummy variable for 4WD.
How create they manifest inwards the regression model? The way the regression model treats them is every bit follows:
- First, it creates a baseline for the cost estimation. The baseline is the average cost for the qualitative variable for which no dummy variable is created. It is the intercept value. Baseline equation is for 4WD. It is the average cost of a 4WD car.
- For FWD: The average cost for front end bike elbow grease (fwd) is estimated as baseline + 1 x coefficient of FWD. i.e. cost = 7603 + 1 x 1405 + 0 x 10704 = $9008. It way that on an average, an FWD machine costs $1405 to a greater extent than than a 4WD.
- For RWD: The cost for bring upwards bike elbow grease (red) is estimated as baseline + 1 x coefficient of RWD. i.e. cost = 7603 + 0 x 1405 + 1 x 10704 = $18307. It way that on an average, a rwd machine costs $10704 to a greater extent than than a 4WD.
All the qualitative variables amongst to a greater extent than than two-factor values are treated similarly.
Model Building
Now that the mechanics of the handling of qualitative variables. Let us run across how does Fernando apply it to his model. His master copy model was the following:
price = -55089.98 + 87.34 engineSize + 60.93 horse ability + 770.42 width
He adds 2 to a greater extent than qualitative variables into the mix. Fuel type as well as bike drive. The full general cast of the model is written as:
price = β0 + β1.engineSize + β2.horsePower + β3.width + β4.fuelTypegas +β5.driveWheelsfwd + β6.driveWheelsrwd.
Fernando trains the model inwards his statistical parcel as well as gets the next coefficients.
The equation of the model is:
price = -76404.83 + 57.20 * engineSize + 23.72 * horsePower + 1214.42 * width — 1381.47 * fuelTypegas -344.62 * driveWheelsfwd + 2189.16 * driveWheelsrwd
Here at that spot is a mix of quantitative as well as qualitative variables. The variables are independent of each other. Let us at in 1 lawsuit translate the coefficients:
- β0: Note that at that spot are no dummy variables created for diesel cars as well as 4WD cars. β0 represents the average cost of diesel as well as 4WD cars. It is a negative value. This implies that if at that spot were 4WD which is diesel, the average cost would live a negative value. This is non possible. The model may live violating linear regression assumptions.
- β1 : This interpretation is same every bit the 1 for multivariate regression. It is interpreted every bit the average increment inwards machine cost if the engine size is increased past times 1 unit. An increment of the engine size past times 1 unit of measurement results inwards an average increment of machine cost past times $57.
- β2: This interpretation is same every bit the 1 for multivariate regression. It is interpreted every bit the average increment inwards machine cost if the horsepower is increased past times 1 unit. An increment of the Equus caballus ability past times 1 unit of measurement results inwards an average increment of machine cost past times $23.72.
- β3: This interpretation is same every bit the 1 for multivariate regression. It is interpreted every bit the average increment inwards machine cost if the width is increased past times 1 unit. An increment of the width past times 1 unit of measurement results inwards an average increment of machine cost past times $1214.42.
- β4: This coefficient is the resultant of a dummy variable (fuel type gas). It interpreted every bit the average departure inwards cost betwixt a diesel fueled car, as well as gas fueled car. It way that on an average, a machine amongst a fuel type gas volition cost $1381.47 lesser than a diesel car.
- β5: This coefficient is the resultant of a dummy variable (drive bike fwd). It interpreted every bit the average departure inwards cost betwixt a 4WD as well as an fwd car. It way that on an average, a machine FWD machine volition cost $344.62 lesser than a 4WD car.
- β6: This coefficient is the resultant of a dummy variable (drive bike rwd). It interpreted every bit the average departure inwards cost betwixt a 4WD as well as an RWD car. It way that on an average, a machine RWD machine volition cost $2189.16 to a greater extent than than a 4WD car.
- Adjusted r-squared is 0.8183. This implies that the model explains 81.83% of the variation inwards grooming data.
- Note that non all the coefficients are significant. In fact, inwards this case, the qualitative variables get got no significance on the model performance.
Conclusion
This model is non amend than the master copy model created. However, it has done its job. We sympathize the way qualitative variables are interpreted inwards a regression model. It is evident that the master copy model with horsepower, engine size as well as width is better. However, he wonders: horsepower, engine size, as well as width are treated independently.
What if at that spot are relations betwixt horsepower, engine size as well as width? Can these relationships live modelled?
The side past times side spider web log ship of this serial volition address these questions. It volition explicate the concept of interactions.