Here is a list of simple one line awk, grep and sed commands that are very useful when dealing with big large datasets. This is particularly true for any work involving genomics, sequencing, etc. I have used them numerous time with datafile generated by Plink, for genome assembly, protein and gene databases.
Let's say you have the following space-separated file long.txt giving the heart rate (column 2) and longevity (column 3) of various species.
cat 150 15
chicken 275 15
hamster 450 3
rabbit 205 9
elephant 30 70
whale 20 80
giraffe 66 20
human 70 70
awk
Calculate the average longevity of the species in the whole data set.awk '{count+=$3} END {print count/NR}' long.txt
Count the number of animals with a heart rate greater than 70 beats per minute.
awk '($2 > 70) {++count} END {print count}' long.txt
Find the average longevity of all species with heart rate greater than 70 beats per minute.
awk '($2 > 70) {count++;long+=$3} END {print long/count}' long.txt
grep
Count number of lines that do not contain chicken.
grep -v chicken long.txt | wc
sed
Replace cat 150 15 by dog 90 15.
sed -i 's/"cat 150 15"/"dog 90 15"/g' long.txt
If you want to know more about awk and sed: sed and awk Pocket Reference, 2nd Edition
No comments:
Post a Comment