Friday, 10 December 2010

R: Basic R Skills - Splitting and Plotting

I am giving a short R course next year, so I am going to make a series of blog posts to help get my thoughts and example code in order. The aim is to introduce people with little or no experience of R to the language with self contained examples. The order of the posts are not going to reflect any order in the course, just what I feel like doing at the time.

This first post is going to deal with splitting and plotting data. It is a common occurrence to have data in such a form that you want to split the data in one column based on the data in another column. Maybe you want to split an experimental result by age or gender for example. Perhaps you want to see if there is a difference in the distribution of results in males and females. The example code below goes through one such hypothetical example.



The figure shows the output you should get from running the code. Essentially the example is designed to illustrate the split function and the ~ (tilde) character. 


The split function will do what it says, split a vector of data (A), based on another vector (B). It returns a list, with each element of the list being all of the element in A that match each element in B. For example 



A <- c(1,2,3,4)
B <- c("X","Y","X","Y")
sp <- split(A,B)
sp
$X
[1] 1 3
$Y
[1] 2 4
Now we have a list, and we can operate on each element of the list using the apply functions, such as lapply

lapply(sp,sum)
$X
[1] 4
$Y
[1] 6

There are lots off different apply functions, a good introduction is here.

The other main way of splitting is using the ~ (tilde) operation. In my head I always read this as 'given', such as plot(A ~ B) is "plot A given B". This is an example of the formula notation in R, but here we are using it very simply. It essentially does the same thing as split.

Note: You actually need to do plot(A ~ factor(B)) if B isn't already a factor.

Lots of functions support the function call, such as t.test in the example, for others you can use the lapply and split version, such as for density in the example. 

I also mention the aggregate function, which essentially is the same as lapply and split but seems slower on large datasets. 

1 comment:

  1. I would think basic R should take better advantage of existing packages and leave the details of apply() to later. Take a look at the wonderful package plyr that has a very elegant solution to the split-apply-combine paradigm. This dovetails very well with ggplot2 as well.

    ReplyDelete