Raspberry Science: How to Manipulate Big Data with One-liner Commands

Here is a list of simple one line awk, grep and sed commands that are very useful when dealing with big large datasets. This is particularly true for any work involving genomics, sequencing, etc. I have used them numerous time with datafile generated by Plink, for genome assembly, protein and gene databases.

Let's say you have the following space-separated file long.txt giving the heart rate (column 2) and longevity (column 3) of various species.

cat 150 15
chicken 275 15
hamster 450 3
rabbit 205 9
elephant 30 70
whale 20 80
giraffe 66 20
human 70 70

awk

Calculate the average longevity of the species in the whole data set.

   awk '{count+=$3} END {print count/NR}' long.txt

Count the number of animals with a heart rate greater than 70 beats per minute.

   awk '($2 > 70) {++count} END {print count}' long.txt

Find the average longevity of all species with heart rate greater than 70 beats per minute.

   awk '($2 > 70) {count++;long+=$3} END {print long/count}' long.txt

grep

Count number of lines that do not contain chicken.

grep -v chicken long.txt | wc

sed

Replace cat 150 15 by dog 90 15.

sed -i 's/"cat 150 15"/"dog 90 15"/g' long.txt

If you want to know more about awk and sed: sed and awk Pocket Reference, 2nd Edition

Tuesday, December 3, 2013

How to Manipulate Big Data with One-liner Commands

awk

grep

sed

No comments:

Post a Comment