Mastering Scientific Computing with R
上QQ阅读APP看书,第一时间看更新

Chapter 2. Statistical Methods with R

This chapter will present an overview of how to summarize your data and get useful statistical information for downstream analysis. We will also show you how to plot and get statistical information from probability distributions and how to test the fit of your sample distribution to well-defined probability distributions. We will also go over some of the functions used to perform hypothesis testing including the Student's t-test, Wilcoxon rank-sum test, z-test, chi-squared test, Fisher's exact test, and F-test.

Before we begin, we will load the gene expression profiling data from the E-GEOD-19577 study entitled "MLL partner genes confer distinct biological and clinical signatures of pediatric AML, an AIEOP study" from the ArrayExpress website to use as a sample dataset for some of our examples. For simplicity, we will not go into the details of how the data was generated, except to mention that the study evaluates the expression level of 54,675 probes in 42 leukemia patients' samples. If you would like to learn more about the study, please consult the experiment web page at http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-19577. Here are the steps we will follow to load the data into R:

  1. Download the R ExpressionSet (E-GEOD-19577.eSet.r).
  2. Load the dataset with the load() function. This command will create the study object, which contains the raw experimental data.
  3. Rename the study object as MLLpartner.dataset.
  4. Load the Biobase and affy bioconductor packages.
  5. Normalize the data with the rma() function.
  6. Inspect the data.

These steps can be implemented in R, as follows:

> load(url("http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-19577/E-GEOD-19577.eSet.r"))
> MLLpartner.ds <- study
> library("affy")
> library("Biobase")
> AEsetnorm = rma(MLLpartner.ds)
Background correcting
Normalizing
Calculating Expression
> head(exprs(AEsetnorm)) #output shown truncated
          GSM487973 GSM487972 GSM487971 GSM487970 GSM487969
1007_s_at  4.372242  4.293080  4.210850  4.707231  4.345101
1053_at    8.176033  8.541016  8.338475  7.935021  7.868985
117_at     5.730343  8.723568  5.172717  5.404062  5.731468
121_at     7.744487  6.951332  7.202343  7.158402  6.959318
1255_g_at  2.707115  2.717625  2.699625  2.698669  2.701679
1294_at    9.077232  7.611238  9.649630  7.911132  9.732346

Now, let's get the expression values for two probes to be used in our examples, as follows:

> probeA <- as.numeric(exprs(AEsetnorm)[1,])
> probeA <- setNames(probeA, colnames(exprs(AEsetnorm)))
> probeB <- as.numeric(exprs(AEsetnorm)[2,])
> probeB <-setNames(probeB, colnames(exprs(AEsetnorm)))

Now, let's create a matrix with all the expression values for each probe evaluated in the 42 patient samples, as follows:

> MLLpartner.mx <- as.matrix(exprs(AEsetnorm))
> #Lets save the object to our session
> dump("MLLpartner.mx", "MLLpartner.R")
> class(MLLpartner.mx)
[1] "matrix"
> dim(MLLpartner.mx)
[1] 54675    42