Lab 2: One-way ANOVA¶

The aim of the two ANOVA labs is to explore one-way ANOVA as a generalised linear model.
In this lab we will work mostly through an example analysis using the linear model function, and in the next lab we will tackle the aov() ANOVA function.

If you have only one independent (manipulated) variable and only two conditions (for example, no drug vs with drug) you can compare the means with a t-test.
If you have more than two conditions (for example no drug, a low dose of drug and a high dose of drug) or two or more independent variables, you cannot use multiple t-tests, as this will underestimate errors. You should use ANOVA or linear regression instead.
ANOVA (analysis of variance) and linear regression are both Generalised Linear Models (GLM). They are the same thing.
For historical reasons ANOVA has become more popular in experimental research
(controlled experiments) and regression is somehow more popular for correlational research (looking for relationships in real-world data). But they are calculated using the same R functions behind the scenes.
ANOVA produces an F-statistics or F-ratio for all comparisons. It is an omnibus test.
Just as the t-test can be represented by the linear regression equation (see previous lab), ANOVA can be represented by a multiple regression equation (in which the number of predictors is one less than the number of categories of the independent variable).
The F ratio can be used to assess difference between means and whether a regression model fits the data. Because assessing differences between group means can be seen as fitting a regression model taking into account said group means.
One-way ANOVA: you have a single independent variable with more than two levels (if you only had 2 levels, you would run a t-test…):
- Example: studying the effectiveness of three types of pain reliever: aspirin, ibuprofen and gin
Two-way ANOVA: you have two or more independent variable (each with two or more conditions):
- Example: studying pain relief based on pain reliever and type of pain
  - Factor A: Pain reliever (aspirin versus ibuprofen)
  - Factor B: type of pain (headache versus back pain)

PS 1¶

(Example adapted from A. Field, “Discovering statistics using R”, Sage, chapter 10, p. 400)

The example compares the effect -- measured in some meaningful scale -- of administering a sugar pill to a patient (placebo condition, dose code = 1), or a low dose of a drug (dose code = 2) or a high dose of the same drug (dose code = 3).

Import the data to create a data frame using the following command:

data <- data.frame(person = c( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15), dose = c( 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3), effect = c(3, 2, 1, 1, 4, 5, 2, 4, 2, 3, 7, 4, 5, 3, 6))

Then you can use:

data$dose <- factor(data$dose, levels=c(1,2,3), labels = c("Placebo", "Low_dose", "High_dose"))

And now a sanity check or two:

#View(data)

summary(data)

     person            dose       effect     
 Min.   : 1.0   Placebo  :5   Min.   :1.000  
 1st Qu.: 4.5   Low_dose :5   1st Qu.:2.000  
 Median : 8.0   High_dose:5   Median :3.000  
 Mean   : 8.0                 Mean   :3.467  
 3rd Qu.:11.5                 3rd Qu.:4.500  
 Max.   :15.0                 Max.   :7.000

What do you think the factor() or gl() command did? Check the documentation for these function.

- The command takes the data from the “dose” column and transforms it into a factor with three levels (1, 2 and 3) and three corresponding text labels (placebo=1, low dose=2, high dose=3). Having human-readable names for the factor levels will help in the interpretation later on.

PS 2¶

Using this command, calculate the mean effect for each group (placebo, low dose, high dose).

by(data = data[, "effect"], INDICES = data$dose, FUN = mean)

data$dose: Placebo
[1] 2.2
------------------------------------------------------------ 
data$dose: Low_dose
[1] 3.2
------------------------------------------------------------ 
data$dose: High_dose
[1] 5

Then check the documentation of the by() function. What does by() do? If by() did not exist, how would you calculate the three means using three separate commands?

- The by() command applies the FUN function specified to the data. More specifically, before applying the function, it will group the data according to the levels present in the factor specified under INDICES. So this command takes the data from the “effect” column in the data data frame. It then groups it according to the dose factor (group 1, 2, and 3) and finally applies the mean() function to factor level 1 (placebo), then to level 2 (low dose), then to group 3 (high dose) and prints the results:

means <- by(data = data[, "effect"], INDICES = data$dose, FUN = mean)
means

data$dose: Placebo
[1] 2.2
------------------------------------------------------------ 
data$dose: Low_dose
[1] 3.2
------------------------------------------------------------ 
data$dose: High_dose
[1] 5

You can extract each individual value using the index of the value you require:

means[2]

Or its name:

names(means)

means["Low_dose"]

The by() command is equivalent to the three separate commands:

placebo_mean <- mean(c(3,2,1,1,4)) # 2.2
low_dose_mean <- mean(c(5,2,4,2,3)) # 3.2
high_dose_mean <- mean(c(7,4,5,3,6)) # 5.0

Do you know what the # character means here? It’s a code comment. It means anything following the # until the carriage return is human-readable text that should not be interpreted by the R language.

The by() command is also equivalent to the commands:

placebo_mean <- mean(data[data$dose=="Placebo", "effect"])
low_dose_mean <- mean(data[data$dose=="Low_dose", "effect"])
high_dose_mean <- mean(data[data$dose=="High_dose", "effect"])

PS 3¶

To make sure you have understood how the by() function works, use by() to calculate the standard deviation for each group (placebo, low dose, high dose). You can also use by() to run the pastecs package function stat.desc() that will output a variety of statistics for each group.

by(data = data[, "effect"], INDICES = data$dose, FUN = sd)

data$dose: Placebo
[1] 1.30384
------------------------------------------------------------ 
data$dose: Low_dose
[1] 1.30384
------------------------------------------------------------ 
data$dose: High_dose
[1] 1.581139

install.packages("pastecs")
library(pastecs)
by(data = data[, "effect"], INDICES = data$dose, FUN = stat.desc)

Warning message:
"package 'pastecs' is in use and will not be installed"

data$dose: Placebo
     nbr.val     nbr.null       nbr.na          min          max        range 
   5.0000000    0.0000000    0.0000000    1.0000000    4.0000000    3.0000000 
         sum       median         mean      SE.mean CI.mean.0.95          var 
  11.0000000    2.0000000    2.2000000    0.5830952    1.6189318    1.7000000 
     std.dev     coef.var 
   1.3038405    0.5926548 
------------------------------------------------------------ 
data$dose: Low_dose
     nbr.val     nbr.null       nbr.na          min          max        range 
   5.0000000    0.0000000    0.0000000    2.0000000    5.0000000    3.0000000 
         sum       median         mean      SE.mean CI.mean.0.95          var 
  16.0000000    3.0000000    3.2000000    0.5830952    1.6189318    1.7000000 
     std.dev     coef.var 
   1.3038405    0.4074502 
------------------------------------------------------------ 
data$dose: High_dose
     nbr.val     nbr.null       nbr.na          min          max        range 
   5.0000000    0.0000000    0.0000000    3.0000000    7.0000000    4.0000000 
         sum       median         mean      SE.mean CI.mean.0.95          var 
  25.0000000    5.0000000    5.0000000    0.7071068    1.9632432    2.5000000 
     std.dev     coef.var 
   1.5811388    0.3162278

PS 9¶

In order to run the linear regression, we need to tell R about our dummy coding. Add to your data frame the values for the dummy_low and dummy_high variables using the following commands:

data$dummy_low <- c(0,0,0,0,0,1,1,1,1,1,0,0,0,0,0)
data$dummy_high <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1)
print(data)

   person      dose effect dummy_low dummy_high
1       1   Placebo      3         0          0
2       2   Placebo      2         0          0
3       3   Placebo      1         0          0
4       4   Placebo      1         0          0
5       5   Placebo      4         0          0
6       6  Low_dose      5         1          0
7       7  Low_dose      2         1          0
8       8  Low_dose      4         1          0
9       9  Low_dose      2         1          0
10     10  Low_dose      3         1          0
11     11 High_dose      7         0          1
12     12 High_dose      4         0          1
13     13 High_dose      5         0          1
14     14 High_dose      3         0          1
15     15 High_dose      6         0          1

PS 10¶

Now run an appropriate lm() command to predict the effect from the dummy_low and dummy_high variables. Call the resulting model drug_lm. What do the coefficient values represent? You can find the coefficients under the “Estimate” column in the summary(drug_lm).

Tip: to answer, examine the means you calculated above and consider the other exercises you did to glean the meaning of the intercept and slopes in the model.

You may have to run:

attach(data)

The following objects are masked from data (pos = 6):

    dose, dummy_high, dummy_low, effect, person

Generate the linear model:

drug_lm <- lm(formula = effect ~ dummy_low + dummy_high , data = data)
summary(drug_lm)

Call:
lm(formula = effect ~ dummy_low + dummy_high, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
  -2.0   -1.2   -0.2    0.9    2.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   2.2000     0.6272   3.508  0.00432 **
dummy_low     1.0000     0.8869   1.127  0.28158   
dummy_high    2.8000     0.8869   3.157  0.00827 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.402 on 12 degrees of freedom
Multiple R-squared:  0.4604,	Adjusted R-squared:  0.3704 
F-statistic: 5.119 on 2 and 12 DF,  p-value: 0.02469

The intercept (2.2) represents the mean effect for the placebo group.

Dummy_low = b1, represents the difference between the low-dose mean effect and the placebo mean effect: Placebo mean effect + b1 = 2.2 + 1 = 3.2 = low-dose mean effect.

Dummy_high = b2, represents the difference between the high-dose mean effect and the placebo mean effect: Placebo mean effect + b2 = 2.2 + 2.8 = 5.0 = high-dose mean effect.

PS 11¶

Continue to use the problem set model. Interpret the significance of the model coefficients using their p values. Use a significance threshold of .05. What does it mean in practical terms?

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   2.2000     0.6272   3.508  0.00432 **
dummy_low     1.0000     0.8869   1.127  0.28158   
dummy_high    2.8000     0.8869   3.157  0.00827 **

dummy_low is not significant (p value > .05) so the difference between the placebo group and the low dose group is not significant. However, dummy_high is significant (p value < .05) so the difference between the placebo group and the high dose group is indeed significant.

PS 12¶

Observe the diagram. What do the following elements represent? The long black line at about 3.5 effect units The small circles, triangles and squares The light blue, medium blue and dark blue short horizontal lines The small thick double-headed arrow The big, thick double-headed arrow

- The long black line at about 3.5 effect units: is the grand mean (the overall mean of all effects for all 15 participants). It is the simplest model that we can fit to the data.
- The small circles, triangles and squares: represent individual data points, individual participant effects for placebo (circles), low dose (triangles) and high dose (squares)
- The light blue, medium blue and dark blue short horizontal lines: are the placebo effect mean, the low-dose mean and the high-dose mean
- The small thick double-headed arrow: is b1, the difference between the placebo group mean effect and the low-dose mean effect
- The big, thick double-headed arrow: is b2, the difference between the placebo group mean effect and the high-dose mean effect

So in correlational research, the regression coefficients represent the slope of the line, but in experimental research, they represent the differences between group means.

Let’s work on the sum of squares, to evaluate our model.

All sum of squares follow the same basic pattern: deviation=∑▒〖(observed-model)〗^2 What changes between total sum of squares, residual (or error) sum of squares and model sum of squares is which data and which model is used to calculate the basic equation above.

The simplest model that we can fit to the data is the grand mean. In other words, in the absence of other ideas, we can use the overall mean as prediction of what would be the value of the independent variable for a new data point. Obviously, if a model is good, it should be better than the grand mean, that is, it should be closer to the data points (have smaller overall deviation and smaller residuals).

We can compare the improvement in fit due to using our model (rather than the grand mean) to the error that still remains, since our model won’t usually be perfect. Another way of saying this is that when the grand mean is used as a model, there will be a certain amount of variation between the data and the grand mean. When a model is fitted it will explain some of this variation, but some will be left unexplained.

The F ratio is the ratio of the explained to unexplained variation. Look at the figure below (Figure 10.3 from the Field book) and name each sum of squares (put the name in the red rounded boxes).

Calculate the Total Sum of Squares for the drug data:

SS_tot <- sum((data$effect - mean(data$effect))^2)
SS_tot

SS_tot <- var(data$effect)*(length(data$effect)-1)
SS_tot

Now calculate the Model Sum of Squares:

Calculate the grand mean of all data, regardless of treatment:

grand_mean <- mean(data$effect)
print(grand_mean)

[1] 3.466667

Get the data by treatment:

pl <- data[data$dose == "Placebo", "effect"]
low <- data[data$dose == "Low_dose", "effect"]
high <- data[data$dose == "High_dose", "effect"]

Calculate the sum of squares (one squared value for each treatment):

SS_model <- sum(length(pl)*((mean(pl)-grand_mean))^2, length(low)*((mean(low)-grand_mean))^2, length(high)*((mean(high)-grand_mean))^2)
SS_model

We saw that there are about 43 units of variation, he model accounts for 20 units of variation, so about half.

Now calculate the Residuals Sum of Squares.

pl <- data[data$dose == "Placebo", "effect"]
low <- data[data$dose == "Low_dose", "effect"]
high <- data[data$dose == "High_dose", "effect"]

SS_residual <- sum((pl-mean(pl))^2, (low-mean(low))^2, (high-mean(high))^2)

SS_residual

Let’s plan an ANOVA analysis. The steps are, roughly:

Enter the data (done)
Compute Levene’s test to check for homogeneity of variance
Compute the basic ANOVA (or linear model)
Compute contrasts to see which group differences are significant (advanced topic, not covered in this lab)
Compute post-hoc tests (advanced topic, not covered in this lab)

library(car)
leveneTest(y = data$effect, group = data$dose, center = median)

The Levene test is not significant (p value .89 >> .05), so we can proceed with our ANOVA (or linear model) analysis:

Before we create an ANOVA model, create a linear model using effect ~ dose as formula and examine the output.

drug_lm_2 <- lm(formula = effect ~ dose, data = data)
summary(drug_lm_2)

Call:
lm(formula = effect ~ dose, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
  -2.0   -1.2   -0.2    0.9    2.0 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)     2.2000     0.6272   3.508  0.00432 **
doseLow_dose    1.0000     0.8869   1.127  0.28158   
doseHigh_dose   2.8000     0.8869   3.157  0.00827 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.402 on 12 degrees of freedom
Multiple R-squared:  0.4604,	Adjusted R-squared:  0.3704 
F-statistic: 5.119 on 2 and 12 DF,  p-value: 0.02469

Create an ANOVA model with the same data and formula as the linear model.

drug_anova <- aov(formula = effect ~ dose, data = data)

drug_anova

summary(drug_anova)

Call:
   aov(formula = effect ~ dose, data = data)

Terms:
                    dose Residuals
Sum of Squares  20.13333  23.60000
Deg. of Freedom        2        12

Residual standard error: 1.402379
Estimated effects may be unbalanced

            Df Sum Sq Mean Sq F value Pr(>F)  
dose         2  20.13  10.067   5.119 0.0247 *
Residuals   12  23.60   1.967                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Material adapted from:

•Discovering statistics using R. Authors: A. Field, J. Miles, Z. Field. Publisher: Sage, 2012

•The R book (second edition). Authors: Michael J. Crawley. Publisher: Wiley, 2013

•Research methods and statistics 2, Tom Booths and Alex Doumas 2018.

	Df	F value	Pr(>F)
	<int>	<dbl>	<dbl>
group	2	0.1176471	0.8900225
	12	NA	NA