INTRODUCTION
Dear reader,
If you lot are a newbie inward the globe of machine learning, hence this tutorial is just what you lot postulate inward lodge to innovate yourself to this exciting novel component subdivision of the information scientific discipline world.
This postal service includes a total machine learning projection that volition guide you lot stair past times stair to do a “template,” which you lot tin job after on other datasets.
In this step-by-step tutorial you lot will:
1. Use ane of the most pop machine learning packages inward R.
2. Explore a dataset past times using statistical summaries too information visualization.
3. Build 5 machine-learning models, alternative the best, too ready confidence that the accuracy is reliable.
2. Explore a dataset past times using statistical summaries too information visualization.
3. Build 5 machine-learning models, alternative the best, too ready confidence that the accuracy is reliable.
The procedure of a machine learning projection may non endure just the same, but in that place are sure as shooting criterion too necessary steps:
1. Define Problem.
2. Prepare Data.
3. Evaluate Algorithms.
4. Improve Results.
5. Present Results.
2. Prepare Data.
3. Evaluate Algorithms.
4. Improve Results.
5. Present Results.
1. PACKAGE INSTALLATION & DATA SET
The starting fourth dimension affair you lot receive got to do is install too charge the “caret” bundle with:
install.packages("caret")
library(caret)
Moreover, nosotros postulate a dataset to piece of work with. The dataset nosotros chose inward our illustration is “iris,” which contains 150 observations of iris flowers. There are 4 columns of measurements of the flowers inward centimeters. The 5th column is the species of the bloom observed. All observed flowers belong to ane of 3 species. To attach it to the environment, use:
data(iris)
1.1 Create a Validation Dataset
First of all, nosotros postulate to validate that our information laid is good. Later, nosotros volition job statistical methods to guess the accuracy of the models that nosotros do on unseen data. To endure sure as shooting most the accuracy of the best model on unseen data, nosotros volition evaluate it on actual unseen data. To do this, nosotros volition “deposit” some information that the algorithms volition non reveal too job this information after to acquire a 2nd too independent thought of how accurate the best model actually is.
We volition split the loaded dataset into two, 80% of which nosotros volition job to prepare our models too 20% of which nosotros volition concord dorsum equally a validation dataset. Look at the illustration below:
#create a listing of 80% of rows inward the master copy dataset to job them for training
# direct 20% of the information for validation
# job the remaining 80% of information to grooming too testing the models
#create a listing of 80% of rows inward the master copy dataset to job them for training
validation_index <- createDataPartition(dataset$Species, p=0.80, list=FALSE)
# direct 20% of the information for validation
validation <- dataset[-validation_index,]
# job the remaining 80% of information to grooming too testing the models
dataset <- dataset[validation_index,]
You directly receive got grooming information inward the dataset variable too a validation laid that volition endure used after inward the validation variable.
2. DATASET SUMMARY
In this step, nosotros are going to explore our information set. More specifically, nosotros postulate to know sure as shooting features of our dataset, like:
1. Dimensions of the dataset.
2. Types of the attributes.
3. Details of the data.
4. Levels of the degree attribute.
5. Analysis of the instances inward each class.
6. Statistical summary of all attributes.
2. Types of the attributes.
3. Details of the data.
4. Levels of the degree attribute.
5. Analysis of the instances inward each class.
6. Statistical summary of all attributes.
2.1 Dimensions of Dataset
We tin encounter of how many instances (rows) too how many attributes (columns) the information contains amongst the dim function. Look at the illustration below:
dim(dataset)
2.2 Types of Attributes
Knowing the types is of import equally it tin assistance you lot summarize the information you lot receive got too possible transformations you lot mightiness postulate to job to ready the information before modeilng. They could endure doubles, integers, strings, factors too other types. You tin reveal it with:
sapply(dataset, class)
2.3 Details of the Data
You tin receive got a await at the starting fourth dimension 5 rows of the information with:
head(dataset)
2.4 Levels of the Class
The degree variable is a constituent that has multiple degree labels or levels. Let’s await at the levels:
levels(dataset$Species)
There are ii types of classification problems: the multinomial similar this ane too the binary if in that place were ii levels.
2.5 Class Distribution
Let’s directly receive got a await at the publish of instances that belong to each class. We tin sentiment this equally an absolute count too equally a pct with:
percentage <- prop.table(table(dataset$Species)) * 100
cbind(freq=table(dataset$Species), percentage=percentage)
2.6 Statistical Summary
This includes the mean, the 2nd too max values, equally good equally some percentiles. Look at the illustration below:
summary(dataset)
3. DATASET VISUALIZATION
We directly receive got a basic thought most the data. We postulate to extend that amongst some visualizations, too for that argue nosotros are going to job ii types of plots:
1. Univariate plots to empathize each attribute.
2. Multivariate plots to empathize the relationships betwixt attributes.
2. Multivariate plots to empathize the relationships betwixt attributes.
3.1 Univariate Plots
We tin visualize only the input attributes too only the output attributes. Let’s laid that upwards too telephone weep upwards the input attributes x too the output attributes y.
x <- dataset[,1:4]
y <- dataset[,5]
Since the input variables are numeric, nosotros tin do box too whisker plots of each ane with:
par(mfrow=c(1,4))
for(i inward 1:4) {
boxplot(x[,i], main=names(iris)[i])
}
We tin also do a barplot of the Species degree variable to graphically display the degree distribution.
plot(y)
3.2 Multivariate Plots
First, nosotros do scatterplots of all pairs of attributes too color the points past times class. Then, nosotros tin describe ellipses around them to brand them to a greater extent than easily separated.
You receive got to install too telephone weep upwards the “ellipse” bundle to do this.
You receive got to install too telephone weep upwards the “ellipse” bundle to do this.
install.packages("ellipse")
library(ellipse)
featurePlot(x=x, y=y, plot="ellipse")
We tin also do box too whisker plots of each input variable, but this fourth dimension they are broken downward into separate plots for each class.
featurePlot(x=x, y=y, plot="box")
Next, nosotros tin acquire an thought of the distribution of each attribute. We volition job some probability density plots to laissez passer smoothen lines for each distribution.
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x, y=y, plot="density", scales=scales)
4. ALGORITHMS EVALUATION
Now it is fourth dimension to do some models of the information too guess their accuracy on unseen data.
1. Use the seek harness to job 10-fold cross validation.
2. Build 5 dissimilar models to predict species from bloom measurements.
3. Select the best model.
2. Build 5 dissimilar models to predict species from bloom measurements.
3. Select the best model.
4.1 Test Harness
This volition split our dataset into 10 parts, prepare inward 9, seek on 1, too release for all combinations of train-test splits.
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
We are using the metric of “Accuracy” to evaluate models. This is: (number of correctly predicted instances / divided past times the total publish of instances inward the dataset)*100 to laissez passer a percentage.
4.2 Build Models
We don’t know which algorithms would endure adept on this job or what configurations to use. We acquire an thought from the plots that nosotros created earlier.
Algorithms evaluation:
1. Linear Discriminant Analysis (LDA)
2. Classification too Regression Trees (CART).
3. k-Nearest Neighbors (kNN).
4. Support Vector Machines (SVM) amongst a linear kernel.
5. Random Forest (RF)
2. Classification too Regression Trees (CART).
3. k-Nearest Neighbors (kNN).
4. Support Vector Machines (SVM) amongst a linear kernel.
5. Random Forest (RF)
This is a adept mixture of unproblematic linear (LDA), nonlinear (CART, kNN) too complex nonlinear methods (SVM, RF). We reset the random publish seed before gain run to ensure that the evaluation of each algorithm is performed using just the same information splits. It ensures the results are straight comparable.
NOTE: To proceed, starting fourth dimension install too charge the next packages: “rpart”, “kernlab”, “e1071” too “randomForest”.
Let’s ready our 5 models:
# a) linear algorithms
# b) nonlinear algorithms
# CART
# kNN
# c) advanced algorithms
# SVM
# Random Forest
# a) linear algorithms
set.seed(7)
fit.lda <- train(Species ., data=dataset, method="lda", metric=metric, trControl=control)
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(Species ., data=dataset, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(Species ., data=dataset, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(Species ., data=dataset, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(Species ., data=dataset, method="rf", metric=metric, trControl=control)
4.3 Select the Best Model
We directly receive got 5 models too accuracy estimations for each hence nosotros receive got to compare them.
It is a adept thought to do a listing of the created models too job the summary function.
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
Moreover, nosotros tin do a plot of the model evaluation results too compare the spread too the hateful accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times.
dotplot(results)
You tin summarize the results for only the LDA model that seems to endure the most accurate.
print(fit.lda)
5. Make Predictions
The LDA was the most accurate model. Now nosotros desire to acquire an thought of the accuracy of the model on our validation set.
We tin run the LDA model straight on the validation laid too summarize the results inward a confusion matrix.
predictions <- predict(fit.lda, validation)
confusionMatrix(predictions, validation$Species)
August 25, 2017
By Euthymios KasvikisSource: https://www.r-bloggers.com/how-to-prepare-and-apply-machine-learning-to-your-dataset/