suppressPackageStartupMessages({
library(tidyverse)
library(scales)
library(glue)
library(datarium)
})
Linear regression is a simplest form of machine learning allowing to build a linear model based on the availble data and use it to predict outcomes for the new (previously unseen) observations.
set.seed(42)
df <- tibble(x = seq(0,10,length.out = 100)) %>%
mutate(y = rnorm(n = 100, mean = 2, sd =0.5) * x + rnorm(100, mean = 10, sd = 5)) # adding some noise
df
df %>%
ggplot(aes(x, y))+
geom_point(size = 1)+
geom_smooth(method = "lm", se = FALSE)+
geom_line(aes(y = 2*x + 10), color = "red")+
theme_bw()+
labs("One dimensional linear regression")
How do we find the equation of the linear regression?
The definition:
Linear regression is a line of the form \(y = mx+b\) that minimizes the mean squared error of the prediction
For an observation \(x_i\), the prediction error is defined as
\[ e(i) = y_i-y_{i,pred} = y_i - (mx_i+b) \]
The squared error is:
\[ e^2(i) = (y_i-y_{i,pred})^2 = (y_i - (mx_i+b))^2 \] Mean squared error (MSE) across all \(N\) observations
\[ MSE = \frac{1}{N}\sum_{i=1}^{N}e^2(i) = \frac{1}{N}\sum_{i=1}^{N} (y_i-y_{i,pred})^2 = \sum_{i=1}^{N}(y_i - mx_i - b)^2 \] The linear regression coefficients are then found by finding the mimimum of MSE
\[ m,b \leftarrow \arg \min \sum_{i=1}^{N}(y_i - mx_i - b)^2 \]
Finding the solution requires some calculus, but the end result is given by
\[ m = \frac{\sum_{i=1}^{N}(x_i-\overline x)(y_i-\overline y)}{\sum_{i=1}^{N}(x_i-\overline x)^2} = \frac{Cov(x,y)}{Var(x)} = r_{xy}\frac{Var(y)}{Var(x)} \] \[ b = \overline y - m\overline x \]
Luckily it is very easy to find linear regression using R.
fit <- lm(y~x, data = df)
summary(fit)
Call:
lm(formula = y ~ x, data = df)
Residuals:
Min 1Q Median 3Q Max
-11.8128 -3.8642 0.3551 2.8431 12.0895
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.8910 1.0201 8.716 7.35e-14 ***
x 2.1445 0.1762 12.168 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.139 on 98 degrees of freedom
Multiple R-squared: 0.6017, Adjusted R-squared: 0.5977
F-statistic: 148.1 on 1 and 98 DF, p-value: < 2.2e-16
m <- round(fit$coefficients[2],2)
b <- round(fit$coefficients[1],2)
So from the noisy data we obtained the following line 2.14x+8.89
R-squared
(Pearson’s correlation coefficient) is a simple way to determine the quality of the prediction. The closer R-squared is to 1 the better linear regression represents the data.
y_pred <- predict(fit, tibble(x = 1:100))
tibble(x = 1:100,
y = 2*x + 10,
y_pred = y_pred) %>%
ggplot(aes(x, y))+
geom_line(color = "red")+
geom_point(aes(y =y_pred), color = "blue")
NA
NA
When not to use linear regression :)
The idea is to use linear combination of multiple variables (predictors) to predict the outcome:
\[ \hat y = b + m_1x_1+m_2x_2+...+m_nx_n \] where \(b\) is the intercept (or bias), \(m_1,...,m_n\) are linear coefficients and \(x_1,...,x_n\) are predictors (or independent variables).
For this section we will be using marketing
dataset from datarium
package.
From marketing
help page: A data frame containing the impact of three advertising medias (youtube, facebook and newspaper) on sales. Data are the advertising budget in thousands of dollars along with the sales. The advertising experiment has been repeated 200 times.
data("marketing")
head(marketing)
First let’s explore the data by plotting the data
marketing %>%
pivot_longer(cols = c(youtube, facebook, newspaper), names_to = "media", values_to = "budget") %>%
ggplot(aes(x = budget, y = sales, color = media)) +
geom_point()+
geom_smooth(se = FALSE, method = "lm")+
facet_wrap(~media, scales = "free")+
theme_bw()+
labs(title = "Sales vs Advertisement Budget - Univariate Analysis",
x = "Budget [$, thousands]", y = "Sales [$, millions]")
We can see a relatively good correlation between sales and youtube media, some correlation with facebook and not as much correlation with newspaper advertising.
The challenge is that all three media cotribute to sales together and even though we can plot them seperately, the effect is compound and it is difficlut to understand the impact of each media channel separately.
Technically it is possible to you all data to fit our model we will only use 75% of the data for fitting the model and then use the remaining 25% for testing how the model performs. This is generally an accepted approach in training any machine learning model.
set.seed(42)
N <- nrow(marketing)
in_train <- sample(seq(1,N), size = N * 0.75) # make sure N *0.75 is integer, otherwise need to round the result
train <- marketing %>%
slice(in_train)
test <- marketing %>%
slice(-in_train)
nrow(train)
[1] 150
nrow(test)
[1] 50
fit <- lm(sales ~ youtube + facebook + newspaper, data = train)
summary(fit)
Call:
lm(formula = sales ~ youtube + facebook + newspaper, data = train)
Residuals:
Min 1Q Median 3Q Max
-10.1290 -0.9734 0.4990 1.4681 3.2090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.655491 0.458705 7.969 4.19e-13 ***
youtube 0.046252 0.001673 27.647 < 2e-16 ***
facebook 0.175388 0.010508 16.691 < 2e-16 ***
newspaper 0.001930 0.007118 0.271 0.787
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.121 on 146 degrees of freedom
Multiple R-squared: 0.886, Adjusted R-squared: 0.8836
F-statistic: 378.1 on 3 and 146 DF, p-value: < 2.2e-16
We can see a relatively good fit with R squared equal to 0.89. We can also see that youtube and facebook are relatively good predictors (three astersiks means good), while newspaper is a poor predictor.
Basically we can see that spending an additional 1000 dollars on facebook advertising leads to an increase in sales by approximately 175 thousand dollars. The youtube coefficient suggests that for every 1000 dollars increase in budget we can get an increase of 46 thousand dollars, on average.
Since the newspaper does not contribute much to the sales prediction, it can be removed from the analysis, as well as from the marketing strategy. The updated model will look like this
fit <- lm(sales ~ youtube + facebook, data = train)
summary(fit)
Call:
lm(formula = sales ~ youtube + facebook, data = train)
Residuals:
Min 1Q Median 3Q Max
-10.1963 -0.9979 0.4783 1.4798 3.2449
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.695331 0.433163 8.531 1.64e-14 ***
youtube 0.046268 0.001667 27.761 < 2e-16 ***
facebook 0.176391 0.009805 17.989 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.115 on 147 degrees of freedom
Multiple R-squared: 0.8859, Adjusted R-squared: 0.8844
F-statistic: 570.7 on 2 and 147 DF, p-value: < 2.2e-16
\[ sales = 3.7 + 0.046 * youtube + 0.176 * facebook \] (here sales is expressed in millions and facebook/youtube is the advertising budget in thousands)
pred <- predict(fit, test)
df <- test %>%
mutate(predicted = pred,
pred_error = predicted/sales - 1)
summary(df)
youtube facebook newspaper sales predicted pred_error
Min. : 4.92 Min. : 2.40 Min. : 2.64 Min. : 3.84 Min. : 6.378 Min. :-0.27067
1st Qu.: 80.55 1st Qu.:11.25 1st Qu.: 16.47 1st Qu.:12.63 1st Qu.:11.462 1st Qu.:-0.08124
Median :168.30 Median :25.08 Median : 31.20 Median :14.64 Median :15.468 Median :-0.01436
Mean :165.75 Mean :27.78 Mean : 36.76 Mean :16.44 Mean :16.264 Mean : 0.00790
3rd Qu.:259.14 3rd Qu.:43.89 3rd Qu.: 48.03 3rd Qu.:19.68 3rd Qu.:19.818 3rd Qu.: 0.04947
Max. :340.32 Max. :59.52 Max. :136.80 Max. :30.60 Max. :28.331 Max. : 0.66102
df %>%
ggplot(aes(x = sales, y = pred_error))+
geom_point()+
scale_y_continuous(labels = scales::percent, breaks = pretty_breaks(6))+
theme_bw()