# Getting Started with Statistical Learning: Insights from Chapter 1 of ISLR

Photo by Jefferson Santos on Unsplash

## Table of contents

Hey readers!! I am Brahma ๐ a passionate software engineer. I will be sharing my insights on the chapters of the famous Introduction to Statistical Learning with R aka ISLR.

So, here we are and I will be sharing the insights on Chapter 1. I would highlight the important parts that one should keep in mind while reading this chapter.

# Introduction

The text kicks off with an introduction to three types of datasets that will be used in the due course of the book.

**Wage Data**This dataset tries to relate the wages that people in the Atlantic region of the US get with a variety of factors that can affect the salary of an individual. In particular, it aims to understand the association between an employee's age, education and the calendar year on his wage. Here is a pictorial representation of what's there in the data:

If we try to draw some inferences then the following can be some:

The

`age`

vs`wage`

graph suggests that the salary for an individual increases significantly till the age of 35-40 years of age. Then it remains roughly the same with a slight increase till the age of 60 years. After that which is generally the age of retirement in many places, the salary starts to drop.One more insight from the same graph suggests that the majority of the working population is roughly between 20-60 years of age. The maximum is around 30-50 years of age which can be said as the maximum productivity age.

The

`year`

vs`wage`

graph suggests that the overall salary of the people remained almost the same between 2003 and 2009 with a very small increase in wages.The

`education level`

vs`wages`

graph with the increase in the level of education the salary seems to increase significantly. Implying that more educated folks get better salaries.Overall looking at all three graphs we can say the majority of the salary is concentrated between 50-200k USD.

**Stock Market Data**The Wage Data involved predicting a quantitative output value. But at times the output value may be a qualitative parameter i.e., we can't represent the output as numeric values e.g., has a car [yes or no], gender [male or female or other]. Basically, there are just a few limited class of values that the output value can choose from.

One such example is the Stock Market Data, where we can analyse the past performance of a stock, let's say over a period of six-months or maybe a year, and predicting if the price of that stock will go up or down.

This is known as a classification problem where the output is classified into one of the values in the output class [here, up or down].

The above graphs represent whether today's stock price will go up or not based on one-day, two-day and three-day previous data.

To our surprise this data is not really enough to predict a stock's performance i.e., we can not really predict today's price based on the previous day's price as the price of a stock depends on various factors like the company's major decisions or maybe a new product launch.

In simple terms, we can't predict stock prices with historical data. This is obvious else everyone would have been a millionaire by now. ๐

**Gene Expression Data**The previous two examples had data that contained both the input and output values. But what if we don't have an output value? Weird situation right but it's just real real-world application ๐ฅฒ. The first one with both input and output values is called

**Supervised Learning**and the latter one is called**Un-Supervised Learning**.We consider the NCI60 data set [Gene Expression Data], which consists of 6,830 gene expression measurements for each of 64 cancer cell lines. Instead of predicting a particular output variable, we are interested in determining whether there are groups, or clusters, among the cell lines based on their gene expression measurements.

Left: Representation of the NCI60 gene expression data set in a two-dimensional space, Z1 and Z2. Each point corresponds to one of the 64 cell lines. There appear to be four groups of cell lines, which we have represented using different colours.

Right: Same as the left panel except that we have represented each of the 14 different types of cancer using a different coloured symbol. Cell lines corresponding to the same cancer type tend to be nearby in the two-dimensional space.

# Notation and Simple Matrix Algebra

There are some common notations that are used through out the book which even I will be following throughout this series. Let's know some of them:

**n**to represent the number of distinct data points, or observations, in our sample.**p**denotes the number of variables that are available for use in making predictions.For example, the Wage data set consists of 11 variables for 3,000 people, so we have n = 3,000 observations and p = 11 variables (such as year, age, race, and more)

Simple Matrix Notation:

x_ij represents the value of the jth variable for the ith observation, where i = 1, 2,...,n and j = 1, 2,...,p.

Let X denote an n ร p matrix whose (i, j)th element is x_ij:

Simple Matrix Algebra:

Matrix Multiplication:

# Adios

That's all for now folks!! See you in the next one.๐ฅ๐ช

Signing off!!!๐