Lab 1: T-test and simple linear regression

PS1

The t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis.

  • A t-test is used to decide whether two sample means are significantly different
  • Observed = observed difference between sample means
  • Expected = expected difference by chance between population means, if the null hypothesis is true (null hypothesis = all the data comes from the same population)
  • Error = estimate of the standard error of the difference between two sample means
  • The t-test is a general linear model. Yep. Just like linear regression.

A t-test is the most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistics (under certain conditions) follow a Student's t distribution. The t-test can be used, for example, to determine if the means of two sets of data are significantly different from each other.

Calculate the mean anxiety for the picture group using R. By hand, plug the mean anxiety value for the picture group into the line equation.
Tip: ignore the error term, and remember that the picture group was given the group number zero. What is the intercept value for the line? (b0) What does the intercept represent, in term of the group data?

Create the dataframe using the following command:

In [1]:
Group <- gl(2, 12, labels = c("Picture", "Real Spider"))
Anxiety <- c(30, 35, 45, 40, 50, 35, 55, 25, 30, 45, 40, 50, 40, 35, 50, 55, 65, 55, 50, 35, 30, 50, 60, 39 )
spiderLong <- data.frame(Group, Anxiety)

Calculate the mean using the group name (β€œPicture”) to select the picture data, and the data frame column number (2nd column) to select the anxiety column:

In [2]:
meanPicture <- mean(spiderLong[Group=="Picture", 2])
meanPicture
40

Or using the dataframe column name (β€œAnxiety”):

In [3]:
meanPicture <- mean(spiderLong[Group=="Picture", "Anxiety"])
meanPicture
40

Complete the line equation for the picture group, using group number zero and the mean anxiety value for the picture group:

π‘Žπ‘›π‘₯𝑖𝑒𝑑𝑦0 = 𝑏0 + 𝑏10

π‘Žπ‘›π‘₯𝑖𝑒𝑑𝑦0 = π‘šπ‘’π‘Žπ‘› π‘Žπ‘›π‘₯𝑖𝑒𝑑𝑦 π‘“π‘œπ‘Ÿ π‘‘β„Žπ‘’ π‘π‘–π‘π‘‘π‘’π‘Ÿπ‘’ π‘”π‘Ÿπ‘œπ‘’π‘ = 𝑏0 + 0

π‘€π‘’π‘Žπ‘› π‘Žπ‘›π‘₯𝑖𝑒𝑑𝑦 π‘“π‘œπ‘Ÿ π‘‘β„Žπ‘’ π‘π‘–π‘π‘‘π‘’π‘Ÿπ‘’ π‘”π‘Ÿπ‘œπ‘’π‘ = 𝑏0

𝑏0 = 40

b0, the intercept, represents the average of the picture-only group.

Now calculate the mean anxiety for the real spider group using R. By hand, plug the mean anxiety value for the real-spider group into the line equation.
Tip: ignore the error term, and remember that the real-spider group was given the group number one. What is the slope of the line? (b1) What does it represent in terms of the two conditions? Calculate the mean of the real-spider data:

In [4]:
meanRealSpider <- mean(spiderLong[Group=="Real Spider", "Anxiety"])
meanRealSpider
47

π‘Žπ‘›π‘₯𝑖𝑒𝑑𝑦1 = 𝑏0 + 𝑏11

π‘Žπ‘›π‘₯𝑖𝑒𝑑𝑦1 = π‘šπ‘’π‘Žπ‘› π‘Žπ‘›π‘₯𝑖𝑒𝑑𝑦 π‘“π‘œπ‘Ÿ π‘‘β„Žπ‘’ π‘Ÿπ‘’π‘Žπ‘™π‘ π‘π‘–π‘‘π‘’π‘Ÿ π‘”π‘Ÿπ‘œπ‘’π‘ = 40 + 𝑏11

π‘€π‘’π‘Žπ‘› π‘Žπ‘›π‘₯𝑖𝑒𝑑𝑦 π‘“π‘œπ‘Ÿ π‘‘β„Žπ‘’ π‘Ÿπ‘’π‘Žπ‘™π‘ π‘π‘–π‘‘π‘’π‘Ÿ π‘”π‘Ÿπ‘œπ‘’π‘ = π‘šπ‘’π‘Žπ‘› π‘Žπ‘›π‘₯𝑖𝑒𝑑𝑦 π‘“π‘œπ‘Ÿ π‘‘β„Žπ‘’ π‘π‘–π‘π‘‘π‘’π‘Ÿπ‘’ π‘”π‘Ÿπ‘œπ‘’π‘ + 𝑏1

47 = 40 + 𝑏1

𝑏1 = 47 βˆ’ 40 = 7

𝑏1 = π‘šπ‘’π‘Žπ‘› π‘Ÿπ‘’π‘Žπ‘™π‘ π‘π‘–π‘‘π‘’π‘Ÿ βˆ’ π‘šπ‘’π‘Žπ‘› π‘π‘–π‘π‘‘π‘’π‘Ÿπ‘’ = 7

b1, the slope, represents the difference between the real-spider group mean and the baseline (the picture-only group) mean.

Write, by hand, the full equation for the anxiety model, including the b coefficients.

π‘Žπ‘›π‘₯𝑖𝑒𝑑𝑦𝑖 = 𝑏0 + 𝑏1π‘”π‘Ÿπ‘œπ‘’π‘π‘–

π‘Žπ‘›π‘₯𝑖𝑒𝑑𝑦𝑖 = 40 + 7π‘”π‘Ÿπ‘œπ‘’π‘π‘–

Run a linear regression model over the data, using anxiety as the dependent variable, and the group as the independent variable.

In [5]:
spider_lm <- lm(spiderLong$Anxiety ~ spiderLong$Group)
spider_lm
Call:
lm(formula = spiderLong$Anxiety ~ spiderLong$Group)

Coefficients:
                (Intercept)  spiderLong$GroupReal Spider  
                         40                            7  

Examine the model:

In [6]:
summary(spider_lm)
Call:
lm(formula = spiderLong$Anxiety ~ spiderLong$Group)

Residuals:
   Min     1Q Median     3Q    Max 
 -17.0   -8.5    1.5    8.0   18.0 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   40.000      2.944  13.587 3.53e-12 ***
spiderLong$GroupReal Spider    7.000      4.163   1.681    0.107    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.2 on 22 degrees of freedom
Multiple R-squared:  0.1139,	Adjusted R-squared:  0.07359 
F-statistic: 2.827 on 1 and 22 DF,  p-value: 0.1068

Is the difference in anxiety significant between the picture-only and the real-spider groups?

  - No, the difference between groups is not significant if we set an alpha of 5%. The P value for the slope (spiderLong$GroupReal Spider) has a P value > 0.05.  
  - Note: you must decide your significance threshold before collecting your data. 

Now run an independent t-test over the data, like your learnt in RMS1 lab 14 (revise it if needed). Examine and interpret the output. You can write the function input comparing a list x versus a list y:

In [7]:
t.test(Anxiety ~ Group, data = spiderLong, paired = FALSE, var.equal = TRUE, conf.level = 0.95)
	Two Sample t-test

data:  Anxiety by Group
t = -1.6813, df = 22, p-value = 0.1068
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -15.634222   1.634222
sample estimates:
    mean in group Picture mean in group Real Spider 
                       40                        47 

PS2

Simple linear regression: A simple linear regression is a linear regression model with a single explanatory variable.

[From M. Crawley, β€œThe R book” (second edition) Wiley p. 450] This example exercise uses data about the growth of caterpillars fed on experimental diets containing increasing quantities of tannin.

Create a dataframe using the command:

In [8]:
reg.data <- data.frame(growth=c(12,10,8,11,6,7,2,3,3), 
tannin=c(0,1,2,3,4,5,6,7,8))

Plot a scatterplot:

In [9]:
reg.data
A data.frame: 9 Γ— 2
growthtannin
<dbl><dbl>
120
101
82
113
64
75
26
37
38
In [10]:
plot(reg.data)

Calculate slope and intercept to 2 decimal digits using the lm() function. Tip: decide which variable you want to use as dependent variable. Growth or tannin? Here we are measuring growth, and we are observing how tannin influences it, so growth is the dependent variable (often called y) and tannin is the independent variable (often called x):

In [11]:
model <- lm(reg.data$growth ~ reg.data$tannin)
model
Call:
lm(formula = reg.data$growth ~ reg.data$tannin)

Coefficients:
    (Intercept)  reg.data$tannin  
         11.756           -1.217  

Write the line equation for the linear model obtained with lm():

  𝑦 β‰ˆ 11.76 βˆ’ 1.22 π‘₯

Plot the linear model over the scatterplot. Tip: use the abline() function.

In [12]:
plot(x = reg.data$tannin, y = reg.data$growth)
abline(model, col="red")

How can you extract more information about your linear model?
Tip: try to jot down a few possible commands now. You can use the function summary():

In [13]:
summary(model)
Call:
lm(formula = reg.data$growth ~ reg.data$tannin)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4556 -0.8889 -0.2389  0.9778  2.8944 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      11.7556     1.0408  11.295 9.54e-06 ***
reg.data$tannin  -1.2167     0.2186  -5.565 0.000846 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.693 on 7 degrees of freedom
Multiple R-squared:  0.8157,	Adjusted R-squared:  0.7893 
F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

And you could use the attributes() function to list the elements present inside the model object:

In [14]:
attributes(model)
$names
  1. 'coefficients'
  2. 'residuals'
  3. 'effects'
  4. 'rank'
  5. 'fitted.values'
  6. 'assign'
  7. 'qr'
  8. 'df.residual'
  9. 'xlevels'
  10. 'call'
  11. 'terms'
  12. 'model'
$class
'lm'

Once you know the attribute name, you can get each attribute value using the dollar notation ($):

In [15]:
model$residuals
1
0.244444444444448
2
-0.538888888888891
3
-1.32222222222222
4
2.89444444444444
5
-0.888888888888889
6
1.32777777777778
7
-2.45555555555556
8
-0.238888888888888
9
0.977777777777778

You can also access the elements of the model summary using the dollar notation

In [16]:
attributes(summary(model))
$names
  1. 'call'
  2. 'terms'
  3. 'residuals'
  4. 'coefficients'
  5. 'aliased'
  6. 'sigma'
  7. 'df'
  8. 'r.squared'
  9. 'adj.r.squared'
  10. 'fstatistic'
  11. 'cov.unscaled'
$class
'summary.lm'

So you can get, for example, the proportion of variance accounted for by the model using the notation:

In [17]:
summary(model)$r.squared
0.815663265306122

Get the amount of variance accounted for by your linear model (R squared) from the model object in R. Is the model a good model?

In [18]:
summary(model)$r.squared
0.815663265306122

The model accounts for about 80% of the variance in the data, but we need more information to judge the model. For example more information from the model plots (to see if the assumptions are upheld), from the F value for the regression sum of squares, from the confidence intervals for slope and intercept and so on.

Use anova() on the model to find out the value of SSR and SSM .
ANOVA and linear models are the same thing. You’ll later see that one or the other can be more practical.

In [19]:
anova(model)
A anova: 2 Γ— 5
DfSum SqMean SqF valuePr(>F)
<int><dbl><dbl><dbl><dbl>
reg.data$tannin188.8166788.8166730.973980.0008460738
Residuals720.07222 2.86746 NA NA

The ANOVA output also includes the F ratio. Calculate the F ratio.

In [20]:
attributes(anova(model))
$names
  1. 'Df'
  2. 'Sum Sq'
  3. 'Mean Sq'
  4. 'F value'
  5. 'Pr(>F)'
$row.names
  1. 'reg.data$tannin'
  2. 'Residuals'
$class
  1. 'anova'
  2. 'data.frame'
$heading
  1. 'Analysis of Variance Table\n'
  2. 'Response: reg.data$growth'

The data can be accessed using the row and column names, or using the row and column numbers:

In [21]:
anova(model)["reg.data$tannin","Mean Sq"]
88.8166666666667
In [22]:
anova(model)[1, 3]
88.8166666666667
In [23]:
anova(model)["Residuals","Mean Sq"]
2.86746031746032
In [24]:
anova(model)[2, 3]
2.86746031746032

Now you can calculate the F-ratio:

In [25]:
MSM <- anova(model)["reg.data$tannin","Mean Sq"]
MSR <- anova(model)["Residuals","Mean Sq"]
F_ratio <- MSM/MSR
F_ratio
30.9739828397454

Material adapted from:

β€’Discovering statistics using R. Authors: A. Field, J. Miles, Z. Field. Publisher: Sage, 2012

β€’The R book (second edition). Authors: Michael J. Crawley. Publisher: Wiley, 2013

β€’Research methods and statistics 2, Tom Booths and Alex Doumas 2018.

In [ ]:

In [ ]:

In [ ]: