Lab 1: T-test and simple linear regression¶

PS1¶

The t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis.

A t-test is used to decide whether two sample means are significantly different
Observed = observed difference between sample means
Expected = expected difference by chance between population means, if the null hypothesis is true (null hypothesis = all the data comes from the same population)
Error = estimate of the standard error of the difference between two sample means
The t-test is a general linear model. Yep. Just like linear regression.

A t-test is the most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistics (under certain conditions) follow a Student's t distribution. The t-test can be used, for example, to determine if the means of two sets of data are significantly different from each other.

Calculate the mean anxiety for the picture group using R. By hand, plug the mean anxiety value for the picture group into the line equation.
Tip: ignore the error term, and remember that the picture group was given the group number zero. What is the intercept value for the line? (b0) What does the intercept represent, in term of the group data?

Create the dataframe using the following command:

Group <- gl(2, 12, labels = c("Picture", "Real Spider"))
Anxiety <- c(30, 35, 45, 40, 50, 35, 55, 25, 30, 45, 40, 50, 40, 35, 50, 55, 65, 55, 50, 35, 30, 50, 60, 39 )
spiderLong <- data.frame(Group, Anxiety)

Calculate the mean using the group name (“Picture”) to select the picture data, and the data frame column number (2nd column) to select the anxiety column:

meanPicture <- mean(spiderLong[Group=="Picture", 2])
meanPicture

Or using the dataframe column name (“Anxiety”):

meanPicture <- mean(spiderLong[Group=="Picture", "Anxiety"])
meanPicture

Complete the line equation for the picture group, using group number zero and the mean anxiety value for the picture group:

𝑎𝑛𝑥𝑖𝑒𝑡𝑦0 = 𝑏0 + 𝑏10

𝑎𝑛𝑥𝑖𝑒𝑡𝑦0 = 𝑚𝑒𝑎𝑛 𝑎𝑛𝑥𝑖𝑒𝑡𝑦 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑝𝑖𝑐𝑡𝑢𝑟𝑒 𝑔𝑟𝑜𝑢𝑝 = 𝑏0 + 0

𝑀𝑒𝑎𝑛 𝑎𝑛𝑥𝑖𝑒𝑡𝑦 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑝𝑖𝑐𝑡𝑢𝑟𝑒 𝑔𝑟𝑜𝑢𝑝 = 𝑏0

𝑏0 = 40

b0, the intercept, represents the average of the picture-only group.

Now calculate the mean anxiety for the real spider group using R. By hand, plug the mean anxiety value for the real-spider group into the line equation.
Tip: ignore the error term, and remember that the real-spider group was given the group number one. What is the slope of the line? (b1) What does it represent in terms of the two conditions? Calculate the mean of the real-spider data:

meanRealSpider <- mean(spiderLong[Group=="Real Spider", "Anxiety"])
meanRealSpider

𝑎𝑛𝑥𝑖𝑒𝑡𝑦1 = 𝑏0 + 𝑏11

𝑎𝑛𝑥𝑖𝑒𝑡𝑦1 = 𝑚𝑒𝑎𝑛 𝑎𝑛𝑥𝑖𝑒𝑡𝑦 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑟𝑒𝑎𝑙𝑠𝑝𝑖𝑑𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 = 40 + 𝑏11

𝑀𝑒𝑎𝑛 𝑎𝑛𝑥𝑖𝑒𝑡𝑦 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑟𝑒𝑎𝑙𝑠𝑝𝑖𝑑𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 = 𝑚𝑒𝑎𝑛 𝑎𝑛𝑥𝑖𝑒𝑡𝑦 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑝𝑖𝑐𝑡𝑢𝑟𝑒 𝑔𝑟𝑜𝑢𝑝 + 𝑏1

47 = 40 + 𝑏1

𝑏1 = 47 − 40 = 7

𝑏1 = 𝑚𝑒𝑎𝑛 𝑟𝑒𝑎𝑙𝑠𝑝𝑖𝑑𝑒𝑟 − 𝑚𝑒𝑎𝑛 𝑝𝑖𝑐𝑡𝑢𝑟𝑒 = 7

b1, the slope, represents the difference between the real-spider group mean and the baseline (the picture-only group) mean.

Write, by hand, the full equation for the anxiety model, including the b coefficients.

𝑎𝑛𝑥𝑖𝑒𝑡𝑦𝑖 = 𝑏0 + 𝑏1𝑔𝑟𝑜𝑢𝑝𝑖

𝑎𝑛𝑥𝑖𝑒𝑡𝑦𝑖 = 40 + 7𝑔𝑟𝑜𝑢𝑝𝑖

Run a linear regression model over the data, using anxiety as the dependent variable, and the group as the independent variable.

spider_lm <- lm(spiderLong$Anxiety ~ spiderLong$Group)
spider_lm

Call:
lm(formula = spiderLong$Anxiety ~ spiderLong$Group)

Coefficients:
                (Intercept)  spiderLong$GroupReal Spider  
                         40                            7

Examine the model:

summary(spider_lm)

Call:
lm(formula = spiderLong$Anxiety ~ spiderLong$Group)

Residuals:
   Min     1Q Median     3Q    Max 
 -17.0   -8.5    1.5    8.0   18.0 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   40.000      2.944  13.587 3.53e-12 ***
spiderLong$GroupReal Spider    7.000      4.163   1.681    0.107    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.2 on 22 degrees of freedom
Multiple R-squared:  0.1139,	Adjusted R-squared:  0.07359 
F-statistic: 2.827 on 1 and 22 DF,  p-value: 0.1068

Is the difference in anxiety significant between the picture-only and the real-spider groups?

  - No, the difference between groups is not significant if we set an alpha of 5%. The P value for the slope (spiderLong$GroupReal Spider) has a P value > 0.05.  
  - Note: you must decide your significance threshold before collecting your data.

Now run an independent t-test over the data, like your learnt in RMS1 lab 14 (revise it if needed). Examine and interpret the output. You can write the function input comparing a list x versus a list y:

t.test(Anxiety ~ Group, data = spiderLong, paired = FALSE, var.equal = TRUE, conf.level = 0.95)

	Two Sample t-test

data:  Anxiety by Group
t = -1.6813, df = 22, p-value = 0.1068
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -15.634222   1.634222
sample estimates:
    mean in group Picture mean in group Real Spider 
                       40                        47

PS2¶

Simple linear regression: A simple linear regression is a linear regression model with a single explanatory variable.

[From M. Crawley, “The R book” (second edition) Wiley p. 450] This example exercise uses data about the growth of caterpillars fed on experimental diets containing increasing quantities of tannin.

Create a dataframe using the command:

reg.data <- data.frame(growth=c(12,10,8,11,6,7,2,3,3), 
tannin=c(0,1,2,3,4,5,6,7,8))

Plot a scatterplot:

reg.data

plot(reg.data)

Calculate slope and intercept to 2 decimal digits using the lm() function. Tip: decide which variable you want to use as dependent variable. Growth or tannin? Here we are measuring growth, and we are observing how tannin influences it, so growth is the dependent variable (often called y) and tannin is the independent variable (often called x):

model <- lm(reg.data$growth ~ reg.data$tannin)
model

Call:
lm(formula = reg.data$growth ~ reg.data$tannin)

Coefficients:
    (Intercept)  reg.data$tannin  
         11.756           -1.217

Write the line equation for the linear model obtained with lm():

  𝑦 ≈ 11.76 − 1.22 𝑥

Plot the linear model over the scatterplot. Tip: use the abline() function.

plot(x = reg.data$tannin, y = reg.data$growth)
abline(model, col="red")

How can you extract more information about your linear model?
Tip: try to jot down a few possible commands now. You can use the function summary():

summary(model)

Call:
lm(formula = reg.data$growth ~ reg.data$tannin)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4556 -0.8889 -0.2389  0.9778  2.8944 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      11.7556     1.0408  11.295 9.54e-06 ***
reg.data$tannin  -1.2167     0.2186  -5.565 0.000846 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.693 on 7 degrees of freedom
Multiple R-squared:  0.8157,	Adjusted R-squared:  0.7893 
F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

And you could use the attributes() function to list the elements present inside the model object:

attributes(model)

model$residuals

model$residuals

You can also access the elements of the model summary using the dollar notation

attributes(summary(model))

So you can get, for example, the proportion of variance accounted for by the model using the notation:

summary(model)$r.squared

Get the amount of variance accounted for by your linear model (R squared) from the model object in R. Is the model a good model?

summary(model)$r.squared

The model accounts for about 80% of the variance in the data, but we need more information to judge the model. For example more information from the model plots (to see if the assumptions are upheld), from the F value for the regression sum of squares, from the confidence intervals for slope and intercept and so on.

Use anova() on the model to find out the value of SSR and SSM .
ANOVA and linear models are the same thing. You’ll later see that one or the other can be more practical.

anova(model)

The ANOVA output also includes the F ratio. Calculate the F ratio.

attributes(anova(model))

The data can be accessed using the row and column names, or using the row and column numbers:

anova(model)["reg.data$tannin","Mean Sq"]

anova(model)[1, 3]

anova(model)["Residuals","Mean Sq"]

anova(model)[2, 3]

Now you can calculate the F-ratio:

MSM <- anova(model)["reg.data$tannin","Mean Sq"]
MSR <- anova(model)["Residuals","Mean Sq"]
F_ratio <- MSM/MSR
F_ratio

Material adapted from:

•Discovering statistics using R. Authors: A. Field, J. Miles, Z. Field. Publisher: Sage, 2012

•The R book (second edition). Authors: Michael J. Crawley. Publisher: Wiley, 2013

•Research methods and statistics 2, Tom Booths and Alex Doumas 2018.

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
	<int>	<dbl>	<dbl>	<dbl>	<dbl>
reg.data$tannin	1	88.81667	88.81667	30.97398	0.0008460738
Residuals	7	20.07222	2.86746	NA	NA