Hypothesis testing_Mastering Scientific Computing with R-QQ阅读女频现言网

上QQ阅读APP看书，第一时间看更新

Hypothesis testing

Often when we analyze data, we would like to know whether the mean of our sample distribution is different from some theoretical value or expected average. Suppose we measured the height of 12 females and wanted to know if the average we calculated from our sample population is significantly different from the theoretical average height of females, which is 171 cm. A simple test we could perform to test this hypothesis would be the Wilcoxon signed-rank test. To do this in R, we will use the wilcox.test() function with the mu argument set to 171:

> female.heights <- c(117, 162, 143, 120, 183, 175, 147, 145, 165, 167, 179, 116)
> mean(females.heights)
[1] 151.5833
> wilcox.test(female.heights, mu=171)
Wilcoxon signed rank test with continuity correction
data:  female.heights
V = 11.5, p-value = 0.0341
alternative hypothesis: true location is not equal to 171
Warning message:
In wilcox.test.default(female.heights, mu = 171) :
  cannot compute exact p-value with ties

Because the p-value is less than 0.05, we accept the alternative hypothesis that the mean of our sample set is not equal to 171. You will also notice in the output that R produces a warning message stating that the exact p-value cannot be computed with ties in the data. This is because there are repeats of the same measurement in our dataset, which cause ties in the data. When there are ties in the data, the wilcox.test() function will return a p-value and, if specified, confidence interval estimates based on a normal approximation to the signed-rank statistics. Let's have a look at the following function:

> wilcox.test(female.heights, mu=171, conf.int = TRUE, conf.level = 0.99)
  Wilcoxon signed rank test with continuity correction
data:  female.heights
V = 11.5, p-value = 0.0341
alternative hypothesis: true location is not equal to 171
99 percent confidence interval:
 120 175
sample estimates:
(pseudo)median
      151.9893
Warning messages:
1: In wilcox.test.default(female.heights, mu = 171, conf.int = TRUE,  :
  cannot compute exact p-value with ties
2: In wilcox.test.default(female.heights, mu = 171, conf.int = TRUE,  :
  cannot compute exact confidence interval with ties

Instead of using the Wilcoxon signed-rank test with continuity correction, we can test if the mean of our sample distribution is different from the theoretical average of 171 using a bootstrap approach that would estimate the mean from taking the mean for a large number of random samples (in this case 10,000), each with a size of 12 and values part of our female.heights dataset. To allow R to generate these samples, we will use the sample() function and specify replacement=TRUE so that repetitions are allowed and we don't always end up with all twelve values from our sample dataset.

Here is the code you would use in R to apply the bootstrap approach to test whether the mean of our sample population is significantly different from the theoretical average.

First, we create a numeric vector to store the 10,000 means generated from sampling the female.heights dataset, as follows:

> f <- numeric(10000)
> for(i in 1:10000) {
 f[i] <- mean(sample(female.heights, replace=TRUE))
}
> hist(f, xlab="bootstrap means")

The result is shown in the following plot:

Now, we can perform a t-test to determine whether the mean of our sampling distribution is different for the theoretical value using the t.test() function with the mu argument set to 171, as follows:

> t.test(female.heights, mu=171)
  One Sample t-test
data:  female.heights
t = -2.7861, df = 11, p-value = 0.01771
alternative hypothesis: true mean is not equal to 171
95 percent confidence interval:
 136.2446 166.9221
sample estimates:
mean of x
 151.5833

Since the p-value is less than 0.05, we can accept the alternative hypothesis and conclude that the female heights in our dataset are significantly different from the theoretical average.

Instead of comparing the mean of two sample distributions, you might want to compare their variance by performing an F-test. You can perform this test in R using the var.test() function, as follows:

> var.test(probeA, probeB)
  F test to compare two variances
data:  probeA and probeB
F = 0.2301, num df = 41, denom df = 41, p-value =
7.182e-06
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.1236921 0.4281015
sample estimates:
ratio of variances
         0.2301147

Another useful statistical test is the Kruskal-Wallis test used to determine whether two sample populations are identical without assuming that they follow a normal distribution. For example, if we go back to our data frame containing the gene expression values for probeA and probeB and their response to the drugA treatment (df), we can determine if the mean expression value for probeA is dependent on the response to the drugA treatment using the kruskal.test() function. Let's have a look at this function:

> head(df)
          expr_probeA expr_probeB drugA_response
GSM487973    4.372242    8.176033        success
GSM487972    4.293080    8.541016           fail
GSM487971    4.210850    8.338475        success
GSM487970    4.707231    7.935021           fail
GSM487969    4.345101    7.868985        success
GSM487968    4.586062    7.909702           fail
> kruskal.test(expr_probeA ~ drugA_response, data = df)
  Kruskal-Wallis rank sum test
data:  expr_probeA by drugA_response
Kruskal-Wallis chi-squared = 0.0571, df = 1, p-value =
0.8111

Since the p-value is greater than 0.05, we cannot reject the null hypothesis and conclude that the probeA expression values in response to the drugA treatments are indeed from the same population.

Proportion tests

Sometimes we want to evaluate proportions instead of individual measurements. Several statistical tests are used to determine whether the proportion observed in your dataset is significantly different from some theoretical value. Two useful statistical tests used in hypothesis testing for proportions are the traditional Z-test and the binomial exact test. Say we wanted to test that 50 percent of patients have to wait more than 4 hours to see a doctor. We can record the waiting period for 11 patients and store that information in a numerical vector as follows:

> waiting.period <- c(3, 5, 4, 5.5, 3.5, 2.5, 3, 5, 4.5, 3, 3.5)

Next, we can count the number of patients who waited for more than 4 hours by creating a table using the following commands:

> above4.hrs <- ifelse(waiting.period > 4, "yes", "no")
> above4hs.table <- table(above4.hrs)
> above4hs.table
above4.hrs
 no yes
  7   4

Now, if we can assume the waiting period to see a doctor is indeed 50 percent, then we can use a Z-test to this hypothesis using the prop.test() function:

> prop.test(4, n=11, p=0.5, alternative="two.sided", correct=FALSE)
  1-sample proportions test without continuity correction
data:  4 out of 11, null probability 0.5
X-squared = 0.8182, df = 1, p-value = 0.3657
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.1516647 0.6461988
sample estimates:
        p
0.3636364

By explicitly stating correct=FALSE in the prop.test() function, we are ensuring that no continuity correction is applied when determining the Z value.

Since the sample size of our female.heights dataset is small (n=11), it might be a better idea to use "continuity-adjusted" Z value when determining whether to accept or reject the null hypothesis by setting the correct argument to TRUE.

> prop.test(4, n=11, p=0.5, alternative="two.sided", correct=TRUE)
  1-sample proportions test with continuity correction
data:  4 out of 11, null probability 0.5
X-squared = 0.3636, df = 1, p-value = 0.5465
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.1236508 0.6838637
sample estimates:
        p
0.3636364

From the statistical test, we can accept the null hypothesis and conclude that we don't have sufficient evidence to say that 50 percent of patients wait more than four hours to see a doctor. In fact, from the 95 percent confidence interval, we can see that the proportion of patients that will likely wait for more than four hours to see a doctor is potentially between 0.12 and 0.68.

Instead of using the continuity-adjusted Z-statistic to choose whether to accept or reject the null hypothesis, we can also perform a binomial exact test with the binom.test() function, as follows:

> binom.test(4, n=11, p=0.5)
  Exact binomial test
data:  4 and 11
number of successes = 4, number of trials = 11, p-value =
0.5488
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.1092634 0.6920953
sample estimates:
probability of success
             0.3636364

Similarly, the binomial exact test indicates that we don't have enough evidence to reject the null hypothesis and that the proportion of patients that wait for more than four hours to see a doctor likely falls between 11 to 69 percent.

Two sample hypothesis tests

So far, we have covered how to use statistical tests for hypothesis testing on a single sample distribution. However, these statistical tests can also be applied to compare two sample distributions. For example, we could test whether the mean of the probeA sample distribution was significantly different from the probeB sample distribution. If we can assume that the error of both samples are normally distributed, then we can use a t-test to compare the mean of both samples by entering t.test(probeA, probeB), as follows:

> t.test(probeA, probeB)
  Welch Two Sample t-test
data:  probeA and probeB
t = -35.8398, df = 58.92, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.486162 -3.117460
sample estimates:
mean of x mean of y
 4.773866  8.075677

If the errors are not normally distributed for these samples, we can run the Wilcoxon rank-sum test, as follows:

> wilcox.test(probeA, probeB)
  Wilcoxon rank sum test
data:  probeA and probeB
W = 0, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

If these samples are paired, you can run a paired t-test or a Wilcoxon signed-rank test by setting the paired argument to TRUE:

> wilcox.test(probeA, probeB, paired=T)
  Wilcoxon signed rank test
data:  probeA and probeB
V = 0, p-value = 4.547e-13
alternative hypothesis: true location shift is not equal to 0

Instead of comparing the mean of two populations, you might be interested in comparing two proportions by performing a binomial test using the prop.test() function. Suppose you want to test the hypothesis that women are favored over men for promotions in company A. If there are 16 out of 430 female employees and 63 out of 1,053 male employees promoted this year, you could perform a binomial proportions test using the prop.test() function with two vectors. The first vector will contain the number of female employees and male employees promoted this year and the other vector will store the total number of females and males employed by company A. Let's have a look at the following code:

> promoted.employees <-c(16, 63)
> total.employees <- c(430, 1053)
> prop.test(promoted.employees, total.employees)
  2-sample test for equality of proportions with continuity correction
data:  promoted.employees out of total.employees
X-squared = 2.6653, df = 1, p-value = 0.1026
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.04717571  0.00193619
sample estimates:
    prop 1     prop 2
0.03720930 0.05982906

You can test for independence in contingency tables using the Chi-squared or Fisher's Exact test in R, which uses the data stored in a matrix with the chisq.test() and fisher.test() functions. For example, we can create a matrix that stores the counts observed for blue eyes, brown eyes, blond hair, and brown hair in a sample population of 96 Caucasian males. To test whether eye color and hair color are independent, we can perform a Chi-squared test with the chisq.test() function, as follows:

> trait.counts <- matrix(c(24, 14, 11, 47), nrow=2)
> colnames(trait.counts) <- c("Blue eyes", "Brown eyes")
> rownames(trait.counts) <- c("Blond hair", "Dark brown hair")
> trait.counts   
                       Blue eyes     Brown eyes
Blond hair               24             11
Dark brown hair          14             47
> chisq.test(trait.counts)
  Pearson's Chi-squared test with Yates' continuity correction
data:  trait.counts
X-squared = 17.4938, df = 1, p-value = 2.882e-05

As you can see, by default, Yates' continuity correction is applied. To remove Yates' continuity correction, you just have to specify correct=FALSE as follows:

> chisq.test(trait.counts, correct=FALSE)
  Pearson's Chi-squared test
data:  trait.counts
X-squared = 19.3544, df = 1, p-value = 1.086e-05

Since the p-value is less than 0.05, we can reject the null hypothesis that eye color and hair color are independent traits for this sample population.

Instead of a Chi-squared test, you may be interested in performing a Fisher's exact test. This statistical test is usually performed for the analysis of contingency tables in which one or more of the expected frequencies are less than five. To illustrate the fisher.test() function, let's create a 2 x 2 matrix containing hypothetical data, as follows:

> data.counts <- matrix(c(7, 5, 2, 6), nrow=2)
> fisher.test(data.counts)
  Fisher's Exact Test for Count Data
data:  data.counts
p-value = 0.1968
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  0.438792 55.616983
sample estimates:
odds ratio
  3.895711

Note

You can also use the fisher.test() function with matrices much larger than 2 x 2 vectors or you can use two vectors containing factor levels instead of a matrix to save you the trouble of counting the number of events in each combination of traits.

Unit root tests

R has a variety of packages that allow you to perform a variety of statistical tests. For example, the tseries package allows you to perform unit root tests to determine whether a time series variable is nonstationary using an autoregressive model. For example, the Augmented Dickey–Fuller (ADF) is a test for a unit root in a time series sample. The null hypothesis of this test is that the time series is not stationary. The ADF test statistics is a negative number and the more negative that number, the more strongly we can reject the null hypothesis and accept the alternative hypothesis that there is a unit root in the time series.

For example, let's create a time series representing the fertility rate of women between the age of 35 to 39 and 40 to 44 between 1997 and 2012 for residents of New York City using the annual vital statistic tables provided by the Department of Health at http://www.health.ny.gov/statistics/vital_statistics/. The fertility rates provided in these tables are based on the number of live births per 1,000 females between the age of 35 to 39 and 40 to 44. A summary of this data is provided in the fertility Data.xlsx file, which can be downloaded at (https://www.packtpub.com/books/content/support/19729/fertilityData.xlsx). Save this file to your current working directory, as shown:

>setwd("/directory_where_file_was_downloaded/")

Note

You can use the getwd() function to obtain your current working directory file path or change it using the setwd() function.

Next, load the data into your workspace using the read.xls() function, which is available as part of the gdata package:

> library("gdata")
> fertility_rates.df <- read.xls("fertilityData.xlsx")

# Change the colnames for the two time series accordingly.
> colnames(fertility_rates.df)[2:3] <- c("Age 35-39", "Age 40-44")

We can take a look at the data frame using the head() function:

> head(fertility_rates.df)
  year Age 35-39 Age 40-44
1 1997      46.4      12.4
2 1998      48.1      12.8
3 1999      49.7      12.3
4 2000      49.8      13.5
5 2001      49.2      13.2
6 2002      49.4      13.5

Next, we need to create a time series object for the fertility rate of each age group using the ts() function. Basically, the ts() function allows us to specify the times of the first and last observation together with the frequency of that occurrence with the start, end, and frequency arguments, respectively. The frequency is the number of observations per unit time, where 1 is equal to annual, 4 is equal to quarterly, and 12 is equal to monthly. To create a time series from our data frame, we will remove the year column and specify the frequency as annual with frequency=1, as shown in the following code:

> fertilityRates.ts <- ts(fertility_rates.df[, 2:3], start=c(1997, 1), end=c(2012, 1), frequency=1)

To inspect our time series object, we can print it, as follows:

> fertilityRates.ts
Time Series:
Start = 1997 
End = 2012 
Frequency = 1 
     Age 35-39 Age 40-44
1997      46.4      12.4
1998      48.1      12.8
1999      49.7      12.3
2000      49.8      13.5
2001      49.2      13.2
2002      49.4      13.5
2003      51.6      14.8
2004      52.5      14.2
2005      52.0      14.3
2006      53.5      14.5
2007      53.6      15.0
2008      54.5      15.0
2009      58.2      17.1
2010      63.6      18.2
2011      65.2      18.0
2012      66.9      19.1

We can also plot the time series with the plot() function, which calls the plot.ts() function in R to create a graph of time series objects, as shown in the following graph:

> plot(fertilityRates.ts, main="Fertility Rates for Females in NYC from 1997 to 2012", xlab="Year")

Now, let's use the ADF test to check if our time series is stationary in both female age groups using the adf.test() function as part of the tseries package. Since we can only apply the function on univariate time series, we will evaluate each of them separately. Let's have a look at the following function:

> library("tseries")

For the time series for women aged 35 to 39, we use the following code:

> adf.test(fertilityRates.ts[, 1])
  Augmented Dickey-Fuller Test

data:  fertilityRates.ts[, 1]
Dickey-Fuller = 0.3567, Lag order = 2, p-value = 0.99
alternative hypothesis: stationary

Warning message:
In adf.test(fertilityRates.ts[, 1]) : p-value greater than printed p-value

For the time series for women aged from 40 to 44, we use the following code:

> adf.test(fertilityRates.ts[, 2])
  Augmented Dickey-Fuller Test

data:  fertilityRates.ts[, 2]
Dickey-Fuller = -0.3712, Lag order = 2, p-value = 0.9805
alternative hypothesis: stationary

In both cases, we cannot reject the null hypothesis. Therefore, we conclude that both time series are not stationary. Indeed, we can see from the graphs that the fertility rate for both age groups increased in the early 2000s. Now, what if we are only interested in the fertility rate of women aged from 40 to 44 between 1997 and 2003? We can test if this time series is stationary, as follows:

> testTS <- ts(fertility_rates.df[1:7, 2], start=c(1997, 1), frequency=1)
>  adf.test(testTS)
  Augmented Dickey-Fuller Test
data:  testTS
Dickey-Fuller = -7.4101, Lag order = 1, p-value = 0.01
alternative hypothesis: stationary
Warning message:
In adf.test(testTS) : p-value smaller than printed p-value

In this case, we have a p-value, which is much smaller than 0.05, so we can reject the null hypothesis and conclude that the time series for the fertility rate of woman aged from 40 to 44 was stationary between 1997 and 2003.

You may be interested in using the adf.test() function to know whether the price at closing of a particular stock is stationary over a given period of time. So, for our next example, we will test whether the Facebook stock price from 2012-06-18 to 2014-11-28 was stationary at closing. We can download the stock price information from Yahoo! using the getSymbols() function from the quantmod package. Let's have a look at the following function:

> install.packages("quantmod")
> library("quantmod")

The stock symbol for Facebook is "FB" on the NASDAQ stock exchange:

> fbstock <-getSymbols("FB", src="yahoo", from= '2012-06-18', end='2014-11-28', auto.assign=FALSE)

We can inspect the object using the head() function to print the first six dates, as follows:

> head(fbstock)
           FB.Open FB.High FB.Low FB.Close FB.Volume FB.Adjusted
2012-06-18   29.96   32.08  29.41    31.41  42978900       31.41
2012-06-19   31.54   32.18  30.70    31.91  30849000       31.91
2012-06-20   31.92   31.93  31.15    31.60  15553600       31.60
2012-06-21   31.67   32.50  31.51    31.84  21875300       31.84
2012-06-22   32.41   33.45  32.06    33.05  74834000       33.05
2012-06-25   32.86   33.02  31.55    32.06  24352900       32.06

We can plot fbstock information using the chartSeries() function, which is also part of the quantmod package. Since we are just interested in viewing the stock price at closing and the total volume, we will only plot the fourth and fifth columns, as shown in the following code:

> chartSeries(fbstock[, 4:5], theme="white", up.col="black")

The plot is shown in the following graph:

Alternatively, you could plot the fbstock data with the plot.ts function after converting the fbstock data from an extended time series (xts) object to a standard time series (ts) object with the as.ts() function:

> plot.ts(as.ts(fbstock[, 4:5]), main="FACEBOOK Stock Information from 2012-06-18 to 2014-11-28")

Here is the plot the previous command generated:

Now, let's perform the ADF test with the adf.test() function, which is a part of the tseries package, as follows:

> require(tseries)
> adf.test(fbstock[, 4])
  Augmented Dickey-Fuller Test
data:  fbstock[, 4]
Dickey-Fuller = -3.123, Lag order = 8, p-value = 0.103
alternative hypothesis: stationary

Based on the p-value, we cannot reject the null hypothesis and conclude that the time series is not stationary and thus, contains a unit root. We can confirm the ADF test results by performing a Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test. The KPSS test is also used to test whether this time series is stationary, except in this case, the null hypothesis is that the observable time series is stationary around a deterministic trend. To perform the KPSS test on our fbstock time series, we can use the kpss.test() function, which is also part of the tseries package.

Again, since we are interested in the Facebook stock price at closing, we will perform the KPSS test on the fourth column:

> kpss.test(fbstock[, 4])
KPSS Test for Level Stationarity
data:  fbstock[, 4]
KPSS Level = 9.6561, Truncation lag parameter = 5, p-value = 0.01
Warning message:
In kpss.test(fbstock[, 4]) : p-value smaller than printed p-value

As you can see, our p-value is below 0.05, so we can reject the null hypothesis and conclude that the time series for the Facebook stock price at closing, between 2012-06-18 and 2014-11-28, is not stationary.

Note

If you are interested in additional packages for time series analysis in R, we recommend you consult the CRAN R project Time Series Analysis website at http://cran.r-project.org/web/views/TimeSeries.html.