on June 21, 2011 by Alastair Kerr in Statistics, Tutorials, Comments (2)

Using R: a guide for complete beginners

This tutorial is intended to introduce users quickly to the basics of R, focusing on a few common tasks that  biologists need to perform  some basic analysis:  load a table, plot some graphs, and perform some basic statistics. More extensive tutorials can be found on the project website and via bioconductor (not covered here).

R-language: http://www.r-project.org

BioConductorhttp://www.bioconductor.org

Advantages of R

  • Free!
  • Powerful, many libraries have been created to perform application specific tasks. e.g. analysis of microarray experiments and Next-Gen sequencing (bioconductor: including Bioseq group).
  • Presentation quality graphics
    • Save as a png, pdf or svg
  • History
    • What you do can be saved for the next time you use R.
    • Ability to turn it into an automated script to perform again and again on different data

Disadvantages

    Preparation

    • (Optional) Download and save the tutorial data set from
      • http://bioinformatics.knowledgeblog.org/wp-content/uploads/bioinf/kerr/data.tsv
      • Start R (type R on a Linux or Mac terminal, or find the starting link from PC)

    Getting More Help

    • Project Home page
      • http://www.r-project.org/
      • Check out the ‘introduction to R’, which is a much more in depth guide .
      • Also R has a built-in help system (see later)

    Working directory

    This is the directory used to store your data and results. It is useful if it is also the directory where your input data is stored.

    • Mac/Linux: this is the directory where you typed in R
    • PC: Change using the change working directory option

    R – terms used in workshop

    • Vector : a list of numbers, (equivalent to a column in a table)
    • Data Frame: a group of Vectors, (a table where the rows are not necessarily related)
    • Matrix: (a table where columns and rows are related)

    R Syntax

    R is a functional based language, the inputs to a function, including options, are in brackets. Note that all dat and options are separated by a comma

    • Function(data, options)

    Even quit is a function

    • q()

    So is help

    help(read.table)

    Provides the help page for the FUNCTION ‘read.table’

    help.search(“t test”)

    Searches for help pages that might relate to the phrase ‘t test’

    NOTE: quotes are needed for search strings, they are not needed when referring to data objects or function names.

    There is a short cut for help,

    ? shows the help page on a function name, same as help(function)

    ?read.table

    ?? searches for help pages on functions, same as help.search(‘phrase’)

    ??“t test”

    Information is usually returned from a function, by default this is printed to screen

    read.table(‘data.tsv’) or read.table(‘http://bioinformatics.knowledgeblog.org/wp-content/uploads/bioinf/kerr/data.tsv’)

    This can always be stored, we call what it is stored in an ‘object’

    mydata <- read.table(‘data.tsv’)

    here mydata is an object of type dataframe

    Reminder:

    • Vector: a list of numbers, equivalent to a column in a table
    • Data Frame = a collection of vectors. Equivalent to a table

    Hint:

    • Up/Down arrow keys can be use to cycle through previous commands

    Getting data into R

    For a beginner this can be is the hardest part, it is also the most important to get right.

    It is possible to create a vector by typing data directly into R using the combine function ‘c’

    x <- c(1,2,3,4,5);

    same as

    x <- c(1:5);

    creates the vector x with the numbers between 1 and 5.

    You can see what is in an object at any time by typing its name;

    x

    will produce the output ‘[1] 1 2 3 4 5’

    Note that names need to be quoted

    daysofweek ← c(‘Monday’, ‘Tuesday’, ‘Wednesday’, ‘Thursday’, ‘Friday’);

    Usually however you want to input from a file. We have touched on the ‘read.table’ function already.

    mydata <- read.table(‘data.tsv’)

    Now mydata is a data frame with multiple vectors

    each vector can be identified by the default syntax

    mydata$V1 mydata$V2 mydata$V3 #if any of these are typed it will print to screen

    By default the function assumes certain things from the file

    • The file is a plain text file (there are function to read excel files: not covered here)
    • columns are separated by any number of tabs or spaces
    • there is the same number of data points in each column
    • there is no header row (labels for the columns)
    • there is no column with names for the rows** [I’ll explain].

    If any of these are false, we need to tell that to the function

    If it has a header column

    mydata <- read.table(‘data.tsv’, header=TRUE) # header=T also works

    Note that there is a comma between different parts of the functions arguments

    If there is one less column in the header row, then R assumes that the 1st column of data after the header are the row names

    Now the vectors (columns) are identified by their name

    mydata$A mydata$B mydata$C #if any of these are typed it will print to screen

    summary(mydata) # Summary about the whole data frame

    summary(mydata$A) # Summary information of column A

    We can shortcut having to type the data frame each time by attaching it

    attach(mydata)

    summary(B) # summary of column B as ‘mydata’ is attached

    Two other important options for read.table

    If is is separated only by tabs and has a header

    mydata <- read.table(‘data.tsv’, header=T, sep=’\t’)

    Really useful if you have spaces in the contents of some columns, so R does not mess up reading the columns . However if the columns or of an uneven length it will tell you.

    If you know that the file has uneven columns

    mydata <- read.table(‘data.tsv’, header=T, sep=’\t’, fill=TRUE)

    This causes R to fill empty spaces in a columns with ‘NA’ .

    The last two examples will still work with our file and give the same result as with only headers=T

    Graphs

    to get an idea of what R is capable of type

    demo(graphics)

    <return> steps through the examples, and the code is printed to the screen

    We will work with simpler examples that have immediate use to biologists.

    Remember to get more information about the options to a function type ‘?function’

    Histogram of A

    hist(mydata$A)

    If there was more data we could increase the number of vertical columns with the option, breaks=50 (or another relevant number).

    boxplot(mydata)

    We can get rid of the need to type the data frame each time by using the attach function

    attach(mydata) # if not already done so

    boxplot(mydata$A, mydata$B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

    same as

    boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

    Scatter plot

    attach(mydata) # if not already done so

    plot(A,B) # or plot(mydata$A, mydata$B)

    SAVING an image

    Windows users (Rgui) RIGHT click on image and select which you want.

    These instructions work for everyone.

    You need to create a new device of the type of file you need, then send the data to that device

    to save as a png file (easy to load into the likes of powerpoint, also great for web applications.

    png(‘filename’)

    boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

    or to save as a pdf

    pdf(‘filename’)

    boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

    Note

    • Nothing will appear on screen, the output is going to the file
    • Also it may not be saved immediately but will once the device (or R) is turned quit.

    To quit R type

    q() # If you save your session, next time you start R, you will have your data preloaded.

    Or if you want to remain in R

    dev.off() #turns of the png (or pdf etc) device, thus forces the data to save

    Testing Data

    IMPORTANT use non-parametric tests for non-parametric data

    Always best to check: e.g. use the Shapiro Wilk test for normality

    Always best to check: e.g. use the Shapiro Wilk test for normality

    shapiro.test(A)

    You are comparing your data set to a hypothetical distribution of ‘normal’. Therefore if it is significant then it is NON-parametric. A good cut-off for this is p < 0.05. You do not usually need a stricter criteria as the t-test is quite forgiving for near normal data.

    Do this for each data set: A, B, C & D

    Do A and E differ by random chance?

    As A and E are both normally distributed we a use a parametric test (test the means)

    t-test(A,E)

    If the data is paired you should use the option “paired=true”

    PAIRED DATA: sets of data from the same row that came from the same set (e.g. time-frame)

    What about A and D?

    As E is NOT normally distributed (i.e. non-parametric) we a use a non-parametric test. Using a parametric test on non parametric data is always meaningless.  Non parametric tests will work on parametric (normally distributed) data, it is just that they are less sensitive to change then a parametric test.

    Suggestions for Non-parametric tests:

    1. Wilcoxon signed-rank test.

    wilcox.test(A,D)

    Note this cannot be used if the paired differences between samples are not symetrical around a median.

    2. Kolmogorov-Smirnov Test (aka KS test)

    ks.test(A, D)

    From Wikipedia: “The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples”.

    Note however that this test ignores data with ties: sorted paired values that are the same between tthe two samples. This can be a problem when looking at data that has a small number of possible states (e.g. only 1, 2, 3, 4).


    2 Comments

    1. George Moulton

      June 21, 2011 @ 2:46 pm

      A concise article that provides a useful quick reference guide to main functions to be used in R.

      In the disadvantages, it would be useful to know why the LIMMA-GUI and the R-commander are disadvantages.

      In the statistical test section, I would change ‘vector’ in the Shapiro Will test command line to the actual vector that you use in the data provided.

      Small typos
      Change “Therefore if it significant then it is NON-parametric” to Therefore if it is significant then it is NON-parametric”

      Change “samples are not symetrical around a medien” to “samples are not symmetrical around a median”

    2. Alastair Kerr

      June 21, 2011 @ 3:02 pm

      Thanks, I have added your improvements

    Leave a comment

    Login