Introduction
In this tutorial, I put together some useful Linux/Unix commands that
I found useful for a day-to-day text processing task. I used the tutorial in
GED lab meeting to help Summer students (SROP) get started with sequence and
data analyses.
I did not cover all basic functionality of each tool because you will always
find tutorials and basic examples online. I also assume that you are familiar
with Linux/Unix command line toos such as pipe, redirection and etc.
Display text with cat command
cat is a very simple command you can use to display content in a file.
Simply run:
will display all content in sample.txt file.
chr1 1000 2000 gene +
chr1 1000 2000 exon +
chr2 2000 3400 exon +
chr2 3000 5000 exon +
chr3 3001 5002 exon +
chr12 1000 2000 gene +
chr12 1000 2000 intron +
You can also run:
to display tab and the end of line characters. This is helpful when you need
to investigate the file format.
chr1^I1000^I2000^Igene^I+$
chr1^I1000^I2000^Iexon^I+$
chr2^I2000^I3400^Iexon^I+$
chr2^I3000^I5000^Iexon^I+$
chr3^I3001^I5002^Iexon^I+$
chr12^I1000^I2000^Igene^I+$
chr12^I1000^I2000^Iintron^I+$
You can also run:
then type some texts and hit Ctrl-D to add texts to the beginning of
sample.txt.
Manipulate columns and rows with awk
awk accepts conditional expressions that you can use to select or filter
data in a text file.
awk '$2>=2000&&$3<=4000' sample.txt
will select rows of which the second and third columns have a value between
2000 to 3000.
You can also give awk a pattern to search for.
will search for rows that contain chr3.
To search for a match in a specific column use:
awk '$4 ~/gene/' sample.txt
awk will select rows that contain "gene" in the 4th column.
chr1 1000 2000 gene +
chr12 1000 2000 gene +
To print out specific columns use:
awk '{print $1,$2,$3}' sample.txt
Only 1st, 2nd and 3rd columns are printed.
chr1 1000 2000
chr1 1000 2000
chr2 2000 3400
chr2 3000 5000
chr3 3001 5002
chr12 1000 2000
chr12 1000 2000
You can also use print command to swap the columns. For example,
awk '{print $2,$3,$1}' sample.txt
will swap column 1 and column 2.
1000 2000 chr1
1000 2000 chr1
2000 3400 chr2
3000 5000 chr2
3001 5002 chr3
1000 2000 chr12
1000 2000 chr12
Search with grep
With -c option, grep displays the number of lines that contain a search word.
We can use this option to count sequences in FASTA format.
There are five sequences in par.fa. "^" means the beginning of the line.
By default, grep will also a substring, therefore
will match both "chr1" and "chr12".
chr1 1000 2000 gene +
chr1 1000 2000 exon +
chr12 1000 2000 gene +
chr12 1000 2000 intron +
If you only want to search for "chr1", you need to add -w option.
In this case, grep will only search for a whole word that matches the
search word.
grep -w chr1 sample.txt
chr1 1000 2000 gene +
chr1 1000 2000 exon +
You can also search for more than one keywords using -e option.
grep -e gene -e intron sample.txt
will search for lines that contain either "gene" or "intron".
chr1 1000 2000 gene +
chr12 1000 2000 gene +
chr12 1000 2000 intron +
You can also use OR operator to do the same thing.
grep 'gene\|intron' sample.txt
Or you can use egrep.
egrep 'gene|intron' sample.txt
grep also accepts AND operator. To search for "chr1" and "gene" run:
grep -E "chr1\s.*gene" sample.txt
You will get:
To do invert search, use -v option. For example,
will exclude any line that has exon in it.
chr1 1000 2000 gene +
chr12 1000 2000 gene +
chr12 1000 2000 intron +
With -l option, grep will display a filename instead of a line
that contain the search word.
will give:
witn -c, grep will also display the number of lines that contain the
search word in a file.
You will get:
sample2.txt:0
sample3.txt:2
sample4.txt:0
sample.txt:2
With -n, grep will display a line number and a line that contain the
search word.
will give:
sample3.txt:6:chr12,,,,1000,,,,2000,,,,gene,,,,,,+
sample3.txt:7:chr12,,,,1000,,,,,,,2000,,,,intron,,,,+
sample.txt:6:chr12 1000 2000 gene +
sample.txt:7:chr12 1000 2000 intron +
Search and substitute with sed
To substitue a word/pattern use,
sed 's/chr/scaffold/' sample.txt
sed will substitute "chr" with "scaffold".
scaffold1 1000 2000 gene +
scaffold1 1000 2000 exon +
scaffold2 2000 3400 exon +
scaffold2 3000 5000 exon +
scaffold3 3001 5002 exon +
scaffold12 1000 2000 gene +
scaffold12 1000 2000 intron +
You can also remove a word by substituting it with a blank.
You will get:
1 1000 2000 gene +
1 1000 2000 exon +
2 2000 3400 exon +
2 3000 5000 exon +
3 3001 5002 exon +
12 1000 2000 gene +
12 1000 2000 intron +
To add a word to the beginning of the line use:
sed 's/^/chr/' sample2.txt
sample2.txt original content is
1 1000 2000 gene +
1 1000 2000 exon +
2 2000 3400 exon +
2 3000 5000 exon +
3 3001 5002 exon +
12 1000 2000 gene +
12 1000 2000 intron +
with the above command, you will get:
chr1 1000 2000 gene +
chr1 1000 2000 exon +
chr2 2000 3400 exon +
chr2 3000 5000 exon +
chr3 3001 5002 exon +
chr12 1000 2000 gene +
chr12 1000 2000 intron +
Translate with tr
I rarely use this command, but it can be used to deal with messed up csv or
spece/tab delimited files.
For example, you are handed a file with this format:
chr1,,,,,,1000,,,,2000,,,,gene,,,,+
chr1,,,,1000,,,,,,,,2000,,,,exon,,,,+
chr2,,,,2000,,,,3400,,,,exon,,,,+
chr2,,,,3000,,,,,,,5000,,,,exon,,,,,,,+
chr3,,,,3001,,,,5002,,,,exon,,,,+
chr12,,,,1000,,,,2000,,,,gene,,,,,,+
chr12,,,,1000,,,,,,,2000,,,,intron,,,,+
You can try to use awk to print the first three columns.
awk -F',' '{print $1,$2,$3}'
But here is what you get.
chr1
chr1
chr2
chr2
chr3
chr12
chr12
awk fails to recognize the columns.
You can also try to use ",,,," instead.
awk -F',,,,' '{print $1,$2,$3}' sample3.txt
but you still cannot get rid of all commas.
chr1 ,,1000 2000
chr1 1000
chr2 2000 3400
chr2 3000 ,,,5000
chr3 3001 5002
chr12 1000 2000
chr12 1000 ,,,2000
Now we can try using tr to collapse multiple commas into one and replace it
with a tab.
tr -s ',' '\t' < sample3.txt | cat -Te
The output is piped to cat command to display a tab character.
chr1^I1000^I2000^Igene^I+$
chr1^I1000^I2000^Iexon^I+$
chr2^I2000^I3400^Iexon^I+$
chr2^I3000^I5000^Iexon^I+$
chr3^I3001^I5002^Iexon^I+$
chr12^I1000^I2000^Igene^I+$
chr12^I1000^I2000^Iintron^I+$
You can see that multiple commas are converted to a single tab.
tr can be used to convert other texts, for example:
echo "accggtgt" | tr a-z A-Z
will convert lowercase letters to uppercase letters.
Or we can use it to convert a soft mask in genome sequence to a hard mask.
echo "AAAGGTTGAGAGGTGCCaccggtgtCCCAAGGTTTT" | tr a-z X
Here is what you will get:
AAAGGTTGAGAGGTGCCXXXXXXXXCCCAAGGTTTT
Merge data with paste and join command
paste command can be used to append columns from one file to another.
For example, sample.txt contains:
chr1 1000 2000 gene +
chr1 1000 2000 exon +
chr2 2000 3400 exon +
chr2 3000 5000 exon +
chr3 3001 5002 exon +
chr12 1000 2000 gene +
chr12 1000 2000 intron +
and sample4.txt contains:
1000
1000
3000
5000
2000
3000
6000
We can add data from sample4.txt to sample.txt as the sixth column by
the following command:
paste sample.txt sample4.txt
Here is the output:
chr1 1000 2000 gene + 1000
chr1 1000 2000 exon + 1000
chr2 2000 3400 exon + 3000
chr2 3000 5000 exon + 5000
chr3 3001 5002 exon + 2000
chr12 1000 2000 gene + 3000
chr12 1000 2000 intron + 6000
join command works like paste command except that two files have to
share a common column. For example, genes1.txt contain a list of genes with
their expression levels like this:
GNBP3 3400
GNBP1 50
Toll 1230
Spatzle 2300
dorsal 57000
Dmik2 34000
And genes2.txt contains the same list of genes with different expression
levels. We can merge them together to get a list of genes with expression
levels from two datasets with join.
join genes1.txt genes2.txt
You will get:
GNBP3 3400 400
GNBP1 50 30050
Toll 1230 1240
Spatzle 2300 4400
dorsal 57000 87000
Dmik2 34000 14000
It works because genes1.txt and genes2.txt have a common column.
join command still works if the data are missing. For example:
join genes1.txt genes3.txt
will give:
GNBP3 3400 400
GNBP1 50 30050
Toll 1230
Spatzle 2300 4400
dorsal 57000 87000
Dmik2 34000 14000
In this case, the expression of Toll gene in genes3.txt is missing.
We can use -e STRING to replace a missing value with STRING. For example,
join -e NA genes1.txt genes3.txt
the missing value is replaced by "NA".
GNBP3 3400 400
GNBP1 50 30050
Toll 1230 NA
Spatzle 2300 4400
dorsal 57000 87000
Dmik2 34000 14000
There are comments.