compBiomeBlog

Thursday, 15 May 2014

Geeknote - Command line interface for Evernote

I have been using Evernote for several years and have been a premium member since 2011 as I was using it so much before then. I use it for work and for home, for recipes and for code, for pretty much everything. I use a modified version of The Secret Weapon for tracking tasks and priorities. I also use IFTTT to connect it to my Google calendar for repeating reminders, such recurring tasks at work and worming the cat. You get the message I like Evernote. However, I also spend a lot of time at the command line and while the global shortcut keys are really useful it is a very one way process. I often use them to save beautifully crafted one-liners, quickly store the results of an ad-hoc analysis or take a quick screen-shot. They are great, but they don't really let me get at the project management and task tracking features that I use Evernote for. It has never been a big deal, just a niggle that it doesn't quite fit together as well as I like.

Then I found evernote-mode for emacs and thought I had found the solution. I hadn't. I could see that is was useful, but it didn't really fit with how I work, even though I spend a lot of time in emacs. I realised what I wanted was something like the amazing todo.txt but with Evernote as the backend. Then I found Geeknote and I could tell from the tag line "Are you a geek? Do you like Evernote? Geeknote - is for you!" that I had found what I wanted. I got it installed without too much trouble, and logged in, which I was pleased to see supported the two-step authentication that lets me sleep at nights. I just needed to change the editor to emacs (actually 'emacsclient -t', but that is for another post, maybe) and I was set. I could create new notes, complete with my Secret Weapon GTD tags, edit existing notes, re-tag notes to change priorities, everything I wanted. I love the search functionality and the ability to refer to notes from previous searches simply by their number. I now have set up a bunch of bash aliases to add new tasks and view what I should be doing. It isn't quite there yet but I will tweak and expand on my workflow.
For those that are interested this is in my .bashrc at the moment:

These just allow me to quickly add notes with a priority then view notes by priority. I just need to type show.todo.now and I know what I should be doing. I will probably add more wrapping to simplify some tasks such as changing the priority of a task etc, but for now I am just loving the way I can add a note at the command line and within seconds it appears on my phone's Evernote widget, amazing integration. Also in reverse I can add a reminder on my phone while dropping the kids at school and find it waiting on my command line at work ( after typing todo).

I will try and post more of my config and experience after I have used it for a while and see how it fits with my workflow.

Thursday, 16 August 2012

Randomize lines in two files, keeping relative order

I recently wanted to randomize the lines in two files, but to keep the relative order of the lines between the files. So I can remember how to do this next time I will post it here.

for i in `cat file1.txt`;do echo $RANDOM;done >randomOrder.txt
paste randomOrder.txt file1.txt file2.txt |sort -k1n >sorted.txt
cut -f 2 sorted.txt >file1.txt
cut -f 3 sorted.txt >file2.txt
rm -f sorted.txt
rm -f randomOrder.txt

Thursday, 28 July 2011

Perl One Liner: Delete files with wrong number of lines

Just a quick perl one liner for future reference. I needed to delete some text files that didn't have the correct number of lines as they would break a downstream R script to parse the results

for f in *.txt;do perl -ne 'END{unlink $ARGV unless $.==200}' ${f} ;done

I always forget that $ARGV is the variable for the input file name in a one-liner.

Thursday, 7 July 2011

Things I would tell a budding bioinformatician to learn.

I recently read Ewan Birney's blog post, which I found echoed a lot of my own thoughts about the use of statistical in computational biology. I thought I would compile my own similar list but for bioinformatics / computational biology in general. I have not been and in the field as long as Ewan and I certainly still have a lot to learn, particularly about statistics due to my biological background, but I have learnt some things over the last ten years, that like Ewan, I wish someone had told me long ago. The points are in no particular order.

LSF: Using job arrays

Our cluster uses the LSF job scheduler. One feature that I find useful is the ability to create job arrays. These are similar jobs that differ in just once respect, such as input file or parameter, or they could be identical such as for simulations, modelling etc. The main benefit of job array, at least for me, is the ability to control the number of jobs running, and change it on the fly. For example I need 500 jobs running, but I can only run 240 jobs at any one time on my group's queue on the cluster and that would exclude others in my group for getting anything done. So I can use job arrays to submit 500 jobs, but only allow 100 to run at any time. When one finishes another one starts until they are all finished.

Another benefit of jobs arrays is that they are submitted as a single job, so a job array with a thousand parts is submitted instantly, but submitting a thousand separate jobs would take a long time.

Below is an example bash script that does a distributed sort, it is designed to show how to use job arrays and dependencies, not necessarily how to do sorting.


### Generate a random big file that we want to sort, 10 Million lines


perl -e 'for (1..1E7){printf("%.0f\n",rand()*1E7)};' > bigFile


### Split the file up into chunks with 10,000 lines in each chunk


split -a 3 -d -l 10000 bigFile split


### rename the files on a 1-1000 scheme not 0-999


for f in split*;do mv ${f} $(echo ${f} |perl -ne 'm/split(0*)(\d+)/g;print "Split",$2+1,"\n";');done


### submit a job array, allowing 50 jobs to be run at anyone time


ID=$(bsub -J "sort[1-1000]%50" "sort -n Split\$LSB_JOBINDEX >Split\$LSB_JOBINDEX.sorted" |perl -ne 'm/<(\d+)>/;print "$1"')


### merge the sorted files together once all the jobs are finished using the –w dependency


ID2=$(bsub -w "done($ID)" "sort -n -m *.sorted >bigFile.sorted"  |perl -ne 'm/<(\d+)>/;print "$1"')


### Delete the temp files, waits for the merge to finish first


bsub -w "done($ID2)" "rm -f Split*"

The main point is that the jobs differ only by the value passed to them from the $LSB_JOBINDEX environment variable. Each job gets a different version of this with the number specified in the square brackets earlier, [1-1000] in this case. There are also additional notation for doing steps, such as 10,20,30 and you can also just specify a list of numbers such as 1,5,10,22,999 etc.

The hard part is making this simple number map to something useful for your task, in this case it was easy as I used split to name the files with sequential numbers, but perhaps you have 500 data-sets you want to perform the same analysis on. In this case you either rename the data-sets with a sequential naming, or use a look up table to associate input files with the numbers given from $LSB_JOBINDEX and have your analysis script use the lookup table to convert the number from $LSB_JOBINDEX into an input filename or parameter.
They key point in the code is using the %50 notation to choose how many jobs to run at any one time. This can be changed with bmod, for example:

bmod -J"%100" JOBID This would now allow 100 jobs to be run simultaneously, rather then 50. Notice also the use of the perl one liner (I am sure awk would work too) to get the job ID and store it ready to use as a dependency for the next step. This is another benefit of the job array, in that there is just one job id, which makes modifying and killing jobs much easier.

You can monitor the status of job arrays with the -A flag to bjos (bjobs -A), which will show you how many jobs are pending, running, done or exited etc.

If you want to check the progress if a particular job you can do a bpeek using its job id and array id, e.g. bpeek 1234542[101], the same notation works for bkill and bjobs

Friday, 10 December 2010

R: Basic R Skills - Splitting and Plotting

I am giving a short R course next year, so I am going to make a series of blog posts to help get my thoughts and example code in order. The aim is to introduce people with little or no experience of R to the language with self contained examples. The order of the posts are not going to reflect any order in the course, just what I feel like doing at the time.

This first post is going to deal with splitting and plotting data. It is a common occurrence to have data in such a form that you want to split the data in one column based on the data in another column. Maybe you want to split an experimental result by age or gender for example. Perhaps you want to see if there is a difference in the distribution of results in males and females. The example code below goes through one such hypothetical example.

The figure shows the output you should get from running the code. Essentially the example is designed to illustrate the split function and the ~ (tilde) character.

The split function will do what it says, split a vector of data (A), based on another vector (B). It returns a list, with each element of the list being all of the element in A that match each element in B. For example

A <- c(1,2,3,4)

B <- c("X","Y","X","Y")

sp <- split(A,B)

sp

$X

[1] 1 3

$Y

[1] 2 4

Now we have a list, and we can operate on each element of the list using the apply functions, such as lapply.

lapply(sp,sum)

$X

[1] 4

$Y

[1] 6

There are lots off different apply functions, a good introduction is here.

The other main way of splitting is using the ~ (tilde) operation. In my head I always read this as 'given', such as plot(A ~ B) is "plot A given B". This is an example of the formula notation in R, but here we are using it very simply. It essentially does the same thing as split.

Note: You actually need to do plot(A ~ factor(B)) if B isn't already a factor.

Lots of functions support the function call, such as t.test in the example, for others you can use the lapply and split version, such as for density in the example.

I also mention the aggregate function, which essentially is the same as lapply and split but seems slower on large datasets.

Wednesday, 8 December 2010

R: Using RColorBrewer to colour your figures in R

RColorBrewer is an R packages that uses the work from http://colorbrewer2.org/ to help you choose sensible colour schemes for figures in R. For example if you are making a boxplot with eight boxes, what colours would you use, or if you are drawing six lines on an x-y plot what colours would you use so you can easily distinguish the colours and look them up on a key? RColorBrewer help you to do this.

Below is some example R code that generates a few plots, coloured by RColorBrewer.

The colours are split into three group, sequential, diverging, and qualitative.

Sequential - Light colours for low data, dark for high data
Diverging - Light colours for mid-range data, low and high contrasting dark colours
Qualitative - Colours designed to give maximum visual difference between classes

The main function is brewer.pal, which you simply give the number of colours you want, and the name of the palette, which you can choose from running display.brewer.all()

There are limits on the number of colours you can get, but if you want to extend the Sequential or Diverging groups you can do so with the colorRampPalatte command, for example :

colorRampPalette(brewer.pal(9,"Blues"))(100)

This will generate 100 colours based on the 9 from the 'Blues' palette. See image below for a contrast.

From compBiomeBlog