The aim of the two ANOVA labs is to explore one-way ANOVA as a generalised linear model.
In this lab we will work mostly through an example analysis using the linear model function, and in the next lab we will tackle the aov() ANOVA function.
(Example adapted from A. Field, “Discovering statistics using R”, Sage, chapter 10, p. 400)
The example compares the effect -- measured in some meaningful scale -- of administering a sugar pill to a patient (placebo condition, dose code = 1), or a low dose of a drug (dose code = 2) or a high dose of the same drug (dose code = 3).
Import the data to create a data frame using the following command:
data <- data.frame(person = c( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15), dose = c( 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3), effect = c(3, 2, 1, 1, 4, 5, 2, 4, 2, 3, 7, 4, 5, 3, 6))
Then you can use:
data$dose <- factor(data$dose, levels=c(1,2,3), labels = c("Placebo", "Low_dose", "High_dose"))
And now a sanity check or two:
#View(data)
summary(data)
What do you think the factor() or gl() command did? Check the documentation for these function.
- The command takes the data from the “dose” column and transforms it into a factor with three levels (1, 2 and 3) and three corresponding text labels (placebo=1, low dose=2, high dose=3). Having human-readable names for the factor levels will help in the interpretation later on.
Using this command, calculate the mean effect for each group (placebo, low dose, high dose).
by(data = data[, "effect"], INDICES = data$dose, FUN = mean)
Then check the documentation of the by() function. What does by() do? If by() did not exist, how would you calculate the three means using three separate commands?
- The by() command applies the FUN function specified to the data. More specifically, before applying the function, it will group the data according to the levels present in the factor specified under INDICES. So this command takes the data from the “effect” column in the data data frame. It then groups it according to the dose factor (group 1, 2, and 3) and finally applies the mean() function to factor level 1 (placebo), then to level 2 (low dose), then to group 3 (high dose) and prints the results:
means <- by(data = data[, "effect"], INDICES = data$dose, FUN = mean)
means
You can extract each individual value using the index of the value you require:
means[2]
Or its name:
names(means)
means["Low_dose"]
The by() command is equivalent to the three separate commands:
placebo_mean <- mean(c(3,2,1,1,4)) # 2.2
low_dose_mean <- mean(c(5,2,4,2,3)) # 3.2
high_dose_mean <- mean(c(7,4,5,3,6)) # 5.0
Do you know what the # character means here? It’s a code comment. It means anything following the # until the carriage return is human-readable text that should not be interpreted by the R language.
The by() command is also equivalent to the commands:
placebo_mean <- mean(data[data$dose=="Placebo", "effect"])
low_dose_mean <- mean(data[data$dose=="Low_dose", "effect"])
high_dose_mean <- mean(data[data$dose=="High_dose", "effect"])
To make sure you have understood how the by() function works, use by() to calculate the standard deviation for each group (placebo, low dose, high dose). You can also use by() to run the pastecs package function stat.desc() that will output a variety of statistics for each group.
by(data = data[, "effect"], INDICES = data$dose, FUN = sd)
install.packages("pastecs")
library(pastecs)
by(data = data[, "effect"], INDICES = data$dose, FUN = stat.desc)
In order to run the linear regression, we need to tell R about our dummy coding. Add to your data frame the values for the dummy_low and dummy_high variables using the following commands:
data$dummy_low <- c(0,0,0,0,0,1,1,1,1,1,0,0,0,0,0)
data$dummy_high <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1)
print(data)
Now run an appropriate lm() command to predict the effect from the dummy_low and dummy_high variables. Call the resulting model drug_lm. What do the coefficient values represent? You can find the coefficients under the “Estimate” column in the summary(drug_lm).
Tip: to answer, examine the means you calculated above and consider the other exercises you did to glean the meaning of the intercept and slopes in the model.
You may have to run:
attach(data)
Generate the linear model:
drug_lm <- lm(formula = effect ~ dummy_low + dummy_high , data = data)
summary(drug_lm)
The intercept (2.2) represents the mean effect for the placebo group.
Dummy_low = b1, represents the difference between the low-dose mean effect and the placebo mean effect: Placebo mean effect + b1 = 2.2 + 1 = 3.2 = low-dose mean effect.
Dummy_high = b2, represents the difference between the high-dose mean effect and the placebo mean effect: Placebo mean effect + b2 = 2.2 + 2.8 = 5.0 = high-dose mean effect.
Continue to use the problem set model. Interpret the significance of the model coefficients using their p values. Use a significance threshold of .05. What does it mean in practical terms?
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2000 0.6272 3.508 0.00432 **
dummy_low 1.0000 0.8869 1.127 0.28158
dummy_high 2.8000 0.8869 3.157 0.00827 **
dummy_low is not significant (p value > .05) so the difference between the placebo group and the low dose group is not significant. However, dummy_high is significant (p value < .05) so the difference between the placebo group and the high dose group is indeed significant.
Observe the diagram. What do the following elements represent? The long black line at about 3.5 effect units The small circles, triangles and squares The light blue, medium blue and dark blue short horizontal lines The small thick double-headed arrow The big, thick double-headed arrow
- The long black line at about 3.5 effect units: is the grand mean (the overall mean of all effects for all 15 participants). It is the simplest model that we can fit to the data.
- The small circles, triangles and squares: represent individual data points, individual participant effects for placebo (circles), low dose (triangles) and high dose (squares)
- The light blue, medium blue and dark blue short horizontal lines: are the placebo effect mean, the low-dose mean and the high-dose mean
- The small thick double-headed arrow: is b1, the difference between the placebo group mean effect and the low-dose mean effect
- The big, thick double-headed arrow: is b2, the difference between the placebo group mean effect and the high-dose mean effect
So in correlational research, the regression coefficients represent the slope of the line, but in experimental research, they represent the differences between group means.
Let’s work on the sum of squares, to evaluate our model.
All sum of squares follow the same basic pattern: deviation=∑▒〖(observed-model)〗^2 What changes between total sum of squares, residual (or error) sum of squares and model sum of squares is which data and which model is used to calculate the basic equation above.
The simplest model that we can fit to the data is the grand mean. In other words, in the absence of other ideas, we can use the overall mean as prediction of what would be the value of the independent variable for a new data point. Obviously, if a model is good, it should be better than the grand mean, that is, it should be closer to the data points (have smaller overall deviation and smaller residuals).
We can compare the improvement in fit due to using our model (rather than the grand mean) to the error that still remains, since our model won’t usually be perfect. Another way of saying this is that when the grand mean is used as a model, there will be a certain amount of variation between the data and the grand mean. When a model is fitted it will explain some of this variation, but some will be left unexplained.
The F ratio is the ratio of the explained to unexplained variation. Look at the figure below (Figure 10.3 from the Field book) and name each sum of squares (put the name in the red rounded boxes).
Calculate the Total Sum of Squares for the drug data:
SS_tot <- sum((data$effect - mean(data$effect))^2)
SS_tot
SS_tot <- var(data$effect)*(length(data$effect)-1)
SS_tot
Now calculate the Model Sum of Squares:
Calculate the grand mean of all data, regardless of treatment:
grand_mean <- mean(data$effect)
print(grand_mean)
Get the data by treatment:
pl <- data[data$dose == "Placebo", "effect"]
low <- data[data$dose == "Low_dose", "effect"]
high <- data[data$dose == "High_dose", "effect"]
Calculate the sum of squares (one squared value for each treatment):
SS_model <- sum(length(pl)*((mean(pl)-grand_mean))^2, length(low)*((mean(low)-grand_mean))^2, length(high)*((mean(high)-grand_mean))^2)
SS_model
We saw that there are about 43 units of variation, he model accounts for 20 units of variation, so about half.
Now calculate the Residuals Sum of Squares.
pl <- data[data$dose == "Placebo", "effect"]
low <- data[data$dose == "Low_dose", "effect"]
high <- data[data$dose == "High_dose", "effect"]
SS_residual <- sum((pl-mean(pl))^2, (low-mean(low))^2, (high-mean(high))^2)
SS_residual
Let’s plan an ANOVA analysis. The steps are, roughly:
library(car)
leveneTest(y = data$effect, group = data$dose, center = median)
The Levene test is not significant (p value .89 >> .05), so we can proceed with our ANOVA (or linear model) analysis:
Before we create an ANOVA model, create a linear model using effect ~ dose as formula and examine the output.
drug_lm_2 <- lm(formula = effect ~ dose, data = data)
summary(drug_lm_2)
Create an ANOVA model with the same data and formula as the linear model.
drug_anova <- aov(formula = effect ~ dose, data = data)
drug_anova
summary(drug_anova)
Material adapted from:
•Discovering statistics using R. Authors: A. Field, J. Miles, Z. Field. Publisher: Sage, 2012
•The R book (second edition). Authors: Michael J. Crawley. Publisher: Wiley, 2013
•Research methods and statistics 2, Tom Booths and Alex Doumas 2018.