In this lab you will build a linear model that takes into account both the advertising and the airplay, to predict album sales. First we need to fit a model using a bunch of known values of sales, advertising and airplay and then we need to derive the values of b0, b1 and b2 (the intercept and the slopes). We then need to know the budget spent on advertising for the album of interest, and how many times it was played on the radio. If we have these five values (b0, b1, b2, advertising and airplay for the album) we can emit a prediction regarding how much the album will sell.
sales <- read.delim("path/sales.dat")
print(sales)
Examine your data. Are there data points that requires special attention? How can you find out?
The function summary() is a good starting point.
summary(sales)
The output highlights that sales is a factor (it should be an integer, it’s the number of albums sold), and that the airplay and attract columns contain NA values. Also, the maximum values of airplay looks suspiciously large (it’s unlikely that any song has been played on the radio 99 million times in a week, even if it may feel like it for some pop hits).
Check if your data frame contains NA values.
To find which rows contain NA values you can use the command:
which(is.na(sales$airplay))
which(is.na(sales$attract))
To remove a row by index:
sales <- sales[-9, ]
To find and remove all rows containing NA values (all in one go):
sales <- sales[complete.cases(sales), ]
print(sales)
Check the type of each column:
sapply(sales, class)
Now, correct the data and make sure the column type is now appropriate for the analysis. First converte the column to character, then again to numeric (forcing a column into another data type is called coercion):
converted_sales_column <- as.numeric(as.character(sales$sales))
Then find the NA values indices:
which(is.na(converted_sales_column))
Or, in one long command:
which(is.na(as.numeric(as.character(sales$sales))))
Or you can
- Coerce the column from factor to integer (not numeric, as there cannot be decimal places in the number of albums sold, in this example you cannot sell a third of an album)
- Correct the now missing (NA) cell value in row 28
- Put the column back into the data frame
corrected_sales <- as.integer(as.character(sales$sales))
corrected_sales
corrected_sales[28] <- 249
sales$sales <- corrected_sales
print(sales)
Now the data is technically correct, let’s make the data consistent, that is, including only data points meaningful for that domain.
Remove the rows containing dramatic outliers.
plot(sales$adverts, sales$airplay)
You can find the data point by selecting any row with high airplay:
which(sales$airplay > 50)
which(sales$airplay > 100)
Remove row 18:
sales <- sales[-18, ]
We had 203 records (data frame rows) initially, now we have 201 left:
length(sales$airplay)
Let's have a look at the plots
plot(sales$adverts, sales$airplay)
plot(sales$sales, sales$airplay)
Generate in R a multiple linear model to predict sales from airplay and adverts.
sales_model <- lm(sales ~ adverts + airplay, data = sales)
summary(sales)
Generate a model with only one predictor, advertisement:
sales_model_1var <- lm(sales ~ adverts, data = sales)
Now update the model to keep the same output and predictors as the input model (this is what the .~. symbols mean, it is not a frowning emoticon) and add the “airplay” and “attract” data:
sales_model_3var <- update(object = sales_model_1var, .~. + airplay + attract)
Now compare the two models, the 1var model contains only adverts:
summary(sales_model_1var)
While the 3var model contains also airplay and attractiveness:
summary(sales_model_3var)
The adverts-only model accounts for about 32.7% of the variation. The three-predictors model accounts for 65.9% of the variation. Comparing the R squared to the adjusted R squared can give us an idea of how a model generalises. Ideally, we want the R squared and adjusted R squared values to be close. This means that te amount of variation accounted for does not shrink too much when we move from a model derived from a sample (R squared) to a model derived – ideally – from the population (adjusted R squared). Luckily, the variance accounted for does not shrink too much for any of the models (from 33.1% to 32.7% and from 66.4% to 65.9%).
Adapted from:
Discovering statistics using R. Authors: A. Field, J. Miles, Z. Field. Publisher: Sage, 2012 (chapter 7 – Multiple regression, p 261-311)
Research methods and statistics 2, Tom Booths and Alex Doumas 2018.