Intro to ML
Machine Learning (ML) provides machines with the ability to learn autonomously based on past experiences, observations and analyzing patterns within a given dataset without explicitly programming. But how do we teach machines? Teaching the machines involves a structural process where every stage builds a better version of the machine. The process might involve loading in the data, representing the data in the algorithmic format chosen (modeling), then making generalizations (prediction).
Some of the ML methods include
Supervised learning: Training data includes desired outputs.
Unsupervised learning: Training data does not include desired outputs. Example is clustering. It is hard to tell what is good learning and what is not.
Semi-supervised learning: Training data includes a few desired outputs.
Reinforcement learning: Rewards from a sequence of actions. AI types like it, it is the most ambitious type of learning.
In this and the following sections, we will mainly focus on supervised learning.
Goals of supervised learning:
Classification: when the function being learned is discrete.
Regression: when the function being learned is continuous.
Probability Estimation: when the output of the function is a probability.
Supervised learning algorithms:
Seeing the myriad of algorithms, one might ask: How do we determine which algorithm to choose? What constitutes the "best" model? In this introductory section, we introduce several model validation criteria that hopefully will aid you in your future research.
PART ONE: Training/Testing Split
Generally before any model evaluation process, we split the full data into training and testing dataset. While performing our usual statistical analysis on the training dataset, we hold back the testing dataset from training the model. This is because ideally, the testing dataset serves as a representation of real-life unseen data. Although the evaluation of the training data might result in a biased score, the held-out sample is likely to give an unbiased estimate of the model skill.
PART TWO: Cross Validation
You might be inclined to ask: does the randomly selected sample truly represent the complexity of unseen data? What if, by some sheer luck, all observations with the smallest y value is grouped into the testing set? To combat this issue and make our estimation of model performance more robust, we introduce K-fold Cross-Validation. Cross Validation is a resampling method used to evaluate machine learning models on a limited data sample. Users first choose K, which is the total number of splits they wish to perform. The general procedure of K-th fold cross validation is as follows:
Shuffle the data randomly
Split the dataset into K groups, For each unique group, the group is regarded as a test data, and the remaining is the training data. We fit a model on the training set and evaluate our model performance on the test set.
The average model performance is calculated as the average of the multiple folds.
It s worth noting that
Each observation is assigned to an individual group and stays there for the duration of the procedure. This means that each sample is used in training set k-1 times and used in the holdout set 1 time.
The individual model from each one of the fold is not saved. K-th fold cross validation attempts to find the average performance of the purposed model.
After learning about the practical procedure of obtaining a unbiased account of model accuracy, you might be wondering: why don't we simply evaluate our model performance on the dataset it was trained on? What are the metrics mentioned in the plot above? In the next section, we will delve into the math foundation that proves the validity of our above procedures. Be prepared for some fun fun fun math formulas :)
Useful Reads:
Comments