Download the script here
Often, before conducting regression analysis, quantitative papers will present a series of bivariate relationships in order to establish unconditional relationships. Here are some common ones.
You can run an independent sample t-test across groups using the t.test() command. You have to reference the two groups separately.
Conduct an independent sample t-test of the miles per gallon variable across cylinders
t.test(data$mpg[data$cyl==4],
data$mpg[data$cyl==8])
Welch Two Sample t-test
data: data$mpg[data$cyl == 4] and data$mpg[data$cyl == 8]
t = 7.5967, df = 14.967, p-value = 1.641e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
8.318518 14.808755
sample estimates:
mean of x mean of y
26.66364 15.10000
The t.test() command takes two arguments at its most basic, and it conducts an independent sample t-test on those variables.
You can run also an dependent sample t-test between two variables. You have to reference the two variables separately and specify that it is paired.
t.test(data$gear, data$carb, paired=TRUE)
Paired t-test
data: data$gear and data$carb
t = 3.1305, df = 31, p-value = 0.003789
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.3049387 1.4450613
sample estimates:
mean of the differences
0.875
Though it is not used often, you can indeed do one-way analysis of variance in R. However, I will note that the results you get in terms of statistical significance are identical to those you could get from running a linear model. See below for the code to run a one-way ANOVA.
# Run the one-way ANOVA model
aov(data$mpg ~ data$cyl, data=data)
# Display full ANOVA results
summary(aov(data$mpg ~ data$cyl, data=data))
Df Sum Sq Mean Sq F value Pr(>F)
data$cyl 1 817.7 817.7 79.56 6.11e-10 ***
Residuals 30 308.3 10.3
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
There are three important differences between this command and the other commands we see in this tutorial that are important to note. They are especially important because we will see them pop up again in the next discussion of linear regression.
You can test for the statistical relationship between two categorical variables using the chisq.test() function. At it’s base, it takes two variables as input.
chisq.test(data$mpg, data$cyl, correct=FALSE)
Pearson`s Chi-squared test
data: data$mpg and data$cyl
X-squared = 56.831, df = 48, p-value = 0.1792
Notice that we include the correct=FALSE argument in order to make sure that R does not automatically (as you can see in the help file) apply a continuity correction.
You can calculate the correlation between two variables specifically with the cor() command.
cor(data$cyl,data$mpg, use="complete.obs")
[1] -0.852162
# Get a subset of variables
correlationsubset <- data[c("mpg","cyl","hp")]
# Get the correlation matrix
cor(correlationsubset, use="complete.obs")
mpg cyl hp
mpg 1.0000000 -0.8521620 -0.7761684
cyl -0.8521620 1.0000000 0.8324475
hp -0.7761684 0.8324475 1.0000000
Here, you enter the two variables separately as separate arguments. In addition, note the complete.obs setting. Like the mean() and sd() functions from before, cor() by default does not omit missing data. You need to include use=“complete.obs” to make cor() omit missing data.
To make a correlation matrix of several variables, the command is the same except instead of entering the variables separately, you create a subset of the variables, and just enter them all at once.