Overview of Statistical Learning

Statistical learning simply refers to the broad set of tools that are available for understanding data. There are two main types of statistical learning: supervised and unsupervised.

Supervised Learning

Supervised learning involves building statistical models to predict outputs \( (Y) \) from inputs \( (X) \). For example, assume that we have a salary dataset for statisticians. The dataset consists of the experience level and salary for 10 different statisticians.

Years of Experience (X) Salary (Y)
0.5 70000
1.0 74000
1.5 75000
2.0 75000
2.5 77000
3.0 80000
3.5 78000
4.0 79000
4.5 82000
5.0 85000

We could build a simple linear regression model to predict the salary of statisticians by using experience level as a predictor. This is an example of supervised learning, where we have supervising outputs (salary values) that guide us in developing a statistical model to determine the relationship between experience level and salary.

In general, there are two main types of supervised learning: regression and classification.

Regression

Predicting a quantitative output is known as a regression problem. For example, predicting someone's salary is a regression problem.

Classification

Predicting a qualitative output is known as a classification problem. For example, predicting whether a stock will go up or down is a classification problem.

Unsupervised Learning

Unsupervised learning involves building statistical models to determine relationships from inputs \( (X) \). There are no supervising outputs. For example, assume that we have a customer dataset. The dataset consists of the annual salary and annual spend on Amazon for 10 different individuals.

Customer ID Salary Spend
1 54425 1103
2 67953 1353
3 53135 3216
4 61452 4146
5 59374 2890
6 66544 975
7 55550 1240
8 58064 956
9 58918 6791
10 54162 4800
... ... ...

We could use a statistical clustering algorithm to group customers by their purchasing behavior. This is an example of unsupervised learning, where we do not have supervising outputs that already inform us which customers are low spenders, average spenders, or high spenders. Instead, we have to come up with the determination ourselves.

In general, there are two main types of unsupervised learning: clustering and association.

Clustering

Determining groupings is known as a clustering problem. For example, grouping customers together based on purchasing behavior is a clustering problem.

Association

Determining rules that describe large portions of a dataset is known as an association problem. For example, determining that people who buy X also buy Y is an association problem. A modern real-world example of this is Amazon's "frequently bought together" product recommendations.

Next Chapter →