Tuesday, December 3, 2013

How to Manipulate Big Data with One-liner Commands


Here is a list of simple one line awk, grep and sed commands that are very useful when dealing with big large datasets. This is particularly true for any work involving genomics, sequencing, etc. I have used them numerous time with datafile generated by Plink, for genome assembly, protein and gene databases. 


Let's say you have the following space-separated file long.txt giving the heart rate (column 2) and longevity (column 3) of various species.  

cat 150 15
chicken 275 15
hamster 450 3
rabbit 205 9
elephant 30 70
whale 20 80
giraffe 66 20
human 70 70

 

awk

Calculate the average longevity of the species in the whole data set.

   awk '{count+=$3} END {print count/NR}' long.txt

Count the number of animals with a heart rate greater than 70 beats per minute.

   awk '($2 > 70) {++count} END {print count}' long.txt

Find the average longevity of all species with heart rate greater than 70 beats per minute.

   awk '($2 > 70) {count++;long+=$3} END {print long/count}' long.txt

grep


Count number of lines that do not contain chicken.

        grep -v chicken long.txt | wc

sed


Replace cat 150 15 by dog 90 15

        sed -i 's/"cat 150 15"/"dog 90 15"/g' long.txt


If you want to know more about awk and sed: sed and awk Pocket Reference, 2nd Edition

No comments:

Post a Comment