Showing posts with label bash. Show all posts
Showing posts with label bash. Show all posts

Thursday, 16 August 2012

Randomize lines in two files, keeping relative order


I recently wanted to randomize the lines in two files, but to keep the relative order of the lines between the files. So I can remember how to do this next time I will post it here.
for i in `cat file1.txt`;do echo $RANDOM;done >randomOrder.txt
paste randomOrder.txt  file1.txt file2.txt  |sort -k1n >sorted.txt
cut -f 2 sorted.txt  >file1.txt
cut -f 3 sorted.txt  >file2.txt
rm -f sorted.txt
rm -f randomOrder.txt

Thursday, 16 December 2010

LSF: Using job arrays



Our cluster uses the LSF job scheduler. One feature that I find useful is the ability to create job arrays. These are similar jobs that differ in just once respect, such as input file or parameter, or they could be identical such as for simulations, modelling etc. The main benefit of job array, at least for me, is the ability to control the number of jobs running, and change it on the fly. For example I need 500 jobs running, but I can only run 240 jobs at any one time on my group's queue on the cluster and that would exclude others in my group for getting anything done. So I can use job arrays to submit 500 jobs, but only allow 100 to run at any time. When one finishes another one starts until they are all finished.


Another benefit of jobs arrays is that they are submitted as a single job, so a job array with a thousand parts is submitted instantly, but submitting a thousand separate jobs would take a long time.


Below is an example bash script that does a distributed sort, it is designed to show how to use job arrays and dependencies, not necessarily how to do sorting.

### Generate a random big file that we want to sort, 10 Million lines
perl -e 'for (1..1E7){printf("%.0f\n",rand()*1E7)};' > bigFile
### Split the file up into chunks with 10,000 lines in each chunk
split -a 3 -d -l 10000 bigFile split
### rename the files on a 1-1000 scheme not 0-999
for f in split*;do mv ${f} $(echo ${f} |perl -ne 'm/split(0*)(\d+)/g;print "Split",$2+1,"\n";');done
### submit a job array, allowing 50 jobs to be run at anyone time
ID=$(bsub -J "sort[1-1000]%50" "sort -n Split\$LSB_JOBINDEX >Split\$LSB_JOBINDEX.sorted" |perl -ne 'm/<(\d+)>/;print "$1"')
### merge the sorted files together once all the jobs are finished using the –w dependency
ID2=$(bsub -w "done($ID)" "sort -n -m *.sorted >bigFile.sorted" |perl -ne 'm/<(\d+)>/;print "$1"')
### Delete the temp files, waits for the merge to finish first
bsub -w "done($ID2)" "rm -f Split*"

The main point is that the jobs differ only by the value passed to them from the $LSB_JOBINDEX environment variable. Each job gets a different version of this with the number specified in the square brackets earlier, [1-1000] in this case. There are also additional notation for doing steps, such as 10,20,30 and you can also just specify a list of numbers such as 1,5,10,22,999 etc.


The hard part is making this simple number map to something useful for your task, in this case it was easy as I used split to name the files with sequential numbers, but perhaps you have 500 data-sets you want to perform the same analysis on. In this case you either rename the data-sets with a sequential naming, or use a look up table to associate input files with the numbers given from $LSB_JOBINDEX and have your analysis script use the lookup table to convert the number from $LSB_JOBINDEX into an input filename or parameter.
They key point in the code is using the %50 notation to choose how many jobs to run at any one time. This can be changed with bmod, for example:


bmod -J"%100" JOBID This would now allow 100 jobs to be run simultaneously, rather then 50. Notice also the use of the perl one liner (I am sure awk would work too) to get the job ID and store it ready to use as a dependency for the next step. This is another benefit of the job array, in that there is just one job id, which makes modifying and killing jobs much easier.

You can monitor the status of job arrays with the -A flag to bjos (bjobs -A), which will show you how many jobs are pending, running, done or exited etc.

If you want to check the progress if a particular job you can do a bpeek using its job id and array id, e.g. bpeek 1234542[101], the same notation works for bkill and bjobs

Thursday, 9 September 2010

Quick Tip: PUT - scp files from remote machine to local machine

It is often desirable to have a quick look at a file on a remote machine, such as an HPC cluster, on your desktop machine. I use MacFUSE and SSHFS for this but sometimes I like to move a file without finding my current location in a finder windows. I have written a quick wrapper around scp with a line to determine the IP address of my local machine, stored as an environment variable.

Add the following lines to your .bashrc on the remote machine.


export DESKTOP=$(last -100 |grep $(whoami) |head -n 1 |perl -ane 'print $F[2]')
put(){
if [ -z "$1" ]; then
    echo "put - Sends specified files to Desktop of local machine";
    echo "usage: put filesToSend";
else
find "$*" |xargs -I % scp % $DESKTOP:~/Desktop
fi
}

Then reload the file with source .bashrc. You should now be able to type put fileName, and the file will appear on your desktop. As long as you have your rsa keys set up, otherwise it will ask for a password too. It works for multiple files too, such as *.png. 

I find it useful for quickly viewing pdfs and images. 

Monday, 5 July 2010

BASH: Per directory BASH history

A while ago I found this post, describing how to have a per directory bash history. I have been meaning to implement this for a while and today managed to.

It enables each directory to have a separate searchable bash history file. So simply by changing into a directory you can view what you did in there. I think this will be useful backup for remembering what commands I used and for editing to make scripts etc.

I have also added an hgrep command, to search all of the history files as I can imagine I will forget which directory I was in when I did something I want to do again. I have not decided wether this grep function should return the directory where the command was run or not. I will see after using it for a while.

If you would like to implement this, add the following code to your .bashrc file


hgrep (){ find ~/.dir_bash_history/ -type f|xargs grep -h $*;}


# Usage: mycd
# Replacement for builtin 'cd', wh ich keeps a separate bash-history
# for every directory.
shopt -s histappend
alias cd="mycd"
export HISTFILE="$HOME/.dir_bash_history$PWD/bash_history.txt"

function mycd()
{
history -w # write current history file
builtin cd "$@" # do actual c d
local HISTDIR="$HOME/.dir_bash_history$PWD" # use& nbsp;nested folders for history
if [ ! -d "$HISTDIR" ]; then # create folder if neede d
mkdir -p "$HISTDIR"
fi
export HISTFILE="$HISTDIR/bash_history.txt" # set& nbsp;new history file
history -c # clear memory
history -r #read from current histfile
}


The top line is my hgrep function, remove the -h if you want to see which directory each of the returned history items was run in. The rest should be transparent, just cd as normal and a new history file will be created.

Friday, 18 June 2010

R: Command Line Calculator using Rscript

I currently use an awesome little bash trick to get a command line calculator that was posted on lifehacker, and that I blogged about previously.

calc(){ awk "BEGIN{ print $* }" ;}
You just add this to your .bashrc file and then you can use it just like calc 2+2. 


This is really useful, however I recently stumbled upon Rscript. This comes with the standard R install and allows you to make a scripts similar to perl or bash with the shebang #!/usr/bin/Rscript, or wherever your Rscript is (you can check with a whereis Rscript command). The nice thing is that it also has a -e option for evaluating an expression at the command line, just like the perl -e for perl one liners. For example:


Rscript -e "round(runif(10,1,100),0)"


[1] 17 23 21 36 10 47 90 81 83  5


This gives you 10 random numbers uniformly distributed between 1 and 100. You can use any R functions this way, even plot for making figures.

Anyway, it seemed that Rscript would be really useful as a command line calculator too. So after a bit of playing and Googling I adapted a nice alias found in a comment on this blog post. Here it is :

alias Calc='Rscript -e "cat( file=stdout(), eval( parse( text=paste( commandArgs(TRUE), collapse=\"\"))),\"\n\")"'

So now you can type things like Calc "-log10(0.05)", whereas my above mentioned calc would just stare at me blinking, looking a bit embarrassed. You can really go to town if you like:


Calc "round(log2(sqrt(100)/exp(0.05)*(1/factorial(10))),2)"
Calc "plot(hist(rnorm(1E6),br=100))" 
I think I will probably keep the calc version too as it is a bit quicker, with it's lower overhead, but Calc should be useful for more complex things too.

Friday, 14 May 2010

Bash: Remove spaces from filename

Some people use spaces in file names, silly I know. Remove them!
find .|while read file;do mv "${file}" $(echo "${file}" | perl -ne 's/ /_/g;print;');done

Tuesday, 11 May 2010

Bash: mkcd - mkdir and cd

I found the useful hint here to create a new bash command which both recursively creates a directory and changes into that directory. Just add this line to your .bashrc file.

mkcd (){ mkdir -p "$*";cd "$*";}

It supports spaces in directory names, however I do not! It is evil.

I am sure if I ever remember to use it, this will save literally tens of key strokes each day. Now that is efficiency.

Thursday, 1 April 2010

Timing the Command Line: time

I have been working more and more on chIP sequencing data recently, which can be pretty huge. Even simple tasks such as counting the number of lines in a file, sorting, filtering etc now have a considerable time cost.

In order to assess the most efficient way of performing some operations I have been using the time function at the command line. For example:

wc -l test.txt
19050959 test.txt


time sort test.txt >test_normalsort.txt
real    1m59.395s


time distSort test.txt
real    2m18.901s


In this case the normal sort was faster than a distributed sort and merge, but that could just be as our cluster was really busy when I ran this. Either way time is very useful.

Monday, 18 January 2010

BASH: Changing ls colours

I have squinted at my dark blue directories on black background when doing an ls on my terminal for a long time and always meant to change it to something easier to read. As our network is down (again) today I spent half an hour figuring it out.

There are two methods as my mac and the linux cluster use different versions of ls.

On the Mac

The Mac (I am using 10.5) uses the environment variable LSCOLORS to change the colours used by ls when you use the --color option (which you should probably set as an alias in your .bash_profile file). You can look at info ls for all the options, but basically something along the lines of

export LSCOLORS="gxfxcxdxbxegedabagacad"

Will change the directories from blue to cyan. Much easier to read on a dark background terminal. All I changed was the fist letter from an e to a g, blue to cyan.

On Linux

I am using ls from coreutils version 5.2.1. This has a much more flexible system than the Mac, and uses a different environment variable, but it is similar.

The default colour configuration is in /etc/DIR_COLORS
which you can copy to your home using cp /etc/DIR_COLORS ~/.dir_colors. You can then edit this file to change the colours as you wish. The colour codes can be found here.
I just changed one line,

DIR 01;34 # directory
to
DIR 01;36 # directory

Which changes blue to cyan. Then you can just add this line to your .bash_profile

eval $(dircolors -b .dir_colors)

Which just runs a short shell script to generate the correct environment variable.

You could achieve the same thing with

export LS_COLORS='di=01;36'

But I quite like having a config file. I might play around a little with some of the other settings and see how it goes.

Tuesday, 24 November 2009

LSF: Job Array Modification

I use job arrays on LSF to control running large number of jobs. One nice feature of job arrays is being able to control the maximum number of jobs running, and so be nice to my fellow cluster users. I use the following BASH one liner to modify the maximum number of jobs on all of my job arrays at once.

bjobs -A |cut -f 1 -d " " |grep -v JOBID |while read seq;do bmod -J "%11" $seq;done

Just change the %11 part to be what ever number you want, well what ever number you can get away with.

Friday, 20 November 2009

BASH: randomize the lines in a file

I colleague needed to randomize the lines in a text file, and as usual google as the answer. I removed the sed and replaced it with cut. It works due to the $RANDOM variable which returns a psedu-random number each time you call it. Nice.

for i in `cat textFile.csv`;do echo "$RANDOM $i";done |sort -n -k 1|cut -f 2- -d " "

So it adds a random number before each line, then sorts on this number. Simple but clever.

Wednesday, 4 November 2009

Perl one liner: Random Lines from a File

I have some bed files that are too large to process in a reasonable time, so I need to randomly sample lines from them to create files of a workable size.

I used some bash and perl magic for this.

for f in *.bed;do export WC=`wc ${f} -l |cut -f 1 -d " "`;perl -i -ne 'srand;print if rand() <1500/$ENV{'WC'}' ${f} ;done


Basically, it checks the length of the file and stores the result in the environment variable WC, then it reads in the file line by line and only prints out the line if a random number between 0 and 1 is less than the proportion of our required size (1500 in this case) of our length (WC).

This is looped round all bed files in the current directory.

Edit:
You could also do something like this:

perl -ne 'print rand;print "\t";print;' FILENAME |sort |head -n 100 |cut -f 2 >NEWFILENAME


Which will return a random 100 lines from the file.

Friday, 9 October 2009

BASH History Cheat Sheet

I came across this great page on using the BASH history effectively.

The Definitive Guide to Bash Command Line History

I will never keep tapping up to find the command I want again, or almost never.

Monday, 21 September 2009

LSF Job Array

A useful function of the LSF cluster management system is the use of job arrays. They enable you to submit many jobs at the same time that differ by only one parameter, such as input file or something. In this case all the jobs are actually identical as they are randomly permuting an input file and scoring the results. So I am using this as a way of making more iterations of the job run in the same time.

bsub -q qname -J "jobName[1-1000]%240" -o /dev/null '~/script/script_name.pl inputFile $(echo $LSB_JOBINDEX).txt option1 option2'

You could also use them to run the same script on many input files, or run with a range of parameters too. I have another use of them where I compare many position weight matrices to a sequence using a job array, one job for each PWM.


Monday, 10 August 2009

One Liner: Remove File Extensions

Just a quick one, I downloaded the latest genome build already repeat masked but I the script I am running required the files to be just chromosome.fa (not chromosome.fa.masked). This quick bash one liner removes the masked part using basename.

for f in *.masked;do mv ${f} $(echo $(basename ${f} .masked));done

Sunday, 9 August 2009

One Liner: Count occurrences of multiple patterns in multiple files

There may be a more elegant solution to this, but I wanted to count the number of times a number of sequences occur in a number of files. Replace FILES with the list of files you want to search in (e.g. *.txt) and replace PATTERNS, with a file containing the things you want to search for, one entry per line. This BASH script should do the rest.

for f in FILES;do cat PATTERNS |while read seq;do grep ${seq} ${f} |wc -l|xargs echo ${f} ${seq};done;done

It is basically two loops, one that goes through the files, the other through each line in the PATTERNS file, then it just uses xargs to output the results in a sensible order. If you don't care about the number of each individual pattern in the file but just the total the -f option to grep would be work.

Monday, 3 August 2009

Bash Scripting: for loop

Sometimes you want to run a program over and over with different parameters, you can do this with bash script wrapped around your program to cover a range of numbers:

for ((i=5;i<=15;i++));do echo ${i};done
So you could do something like this:

for ((i=0;i<=10;i++));do fancy_program --option ${i} --output ${i}_output.out;done
Or even something like this:

for f in *.fas;
do for ((i=0;i<=10;i++));
do fancy_program --input ${f} -n ${i} -o $(basename ${f} .fas)_${i}_output.out; done;
done
To loop through all files and all numbers. Of course you probably want to do this by submitting the jobs to a computing cluster, using bsub for example with LSF, otherwise thing might get a bit slow.

Tuesday, 28 July 2009

Quick one liner: Remove empty lines from files

This quick bash and perl one liner will remove any empty lines from all the files in the current directory.

for f in *;do perl -i -ne 'print unless /^$/' ${f};done
It will not remove lines that contain only white space however. For that you would need:


for f in *;do perl -i -ne 'print if /\S/' ${f};done