Use awk to count number of sequences in a FASTA file

Nice one-liner that simply counts the number of greater-than (>) symbols in a file. In a FASTA file, there should only be a single “>” for each sequence in the file.

$awk '/>/ { count++ } END { print count }' InputFastaFile.fasta

Here’s an example of a FASTA file format for those who don’t know:

>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

Use sed (or awk) to remove newline breaks from FASTA file

So, in my last post, I was stoked to find a way to filter FASTA files by a minimum sequence length using awk. However, that little one-liner fails miserably if your FASTA file isn’t formatted “correctly.” It turns out that the FASTA file I was working on (which was generated by a SOAP de novo assembly using iPlant) was technically formatted “correctly,” in that it had the necessary features to be called a FASTA file. It looked like this:

>sequence_ID_1
atcgatcgggatc
aatgacttcattg
gagaccgaga
>sequence_ID_2
gatccatggacgt
ttaacgcgatgac
atactaggatcag
at

For all intents and purposes, that is a proper FASTA file. However, that formatting is problematic when using a program like sed or awk because both of those programs examine files by line. So, when I would run my awk one-liner to filter out all sequences (in this specific case, contigs) greater than a specified length,

awk '!/^>/ { next } { getline seq } length(seq) >= 200 { print $0 "\n" seq }' FastaFileInput.fa > FastaFileOutput.fa

 


I wasn’t getting any results that had sequences longer than 200bp because of the line breaks within each individual sequence!

So, how did I solve this issue?

Of course, there are programs/scripts that are able to handle instances like this. In fact, iPlant even has a sequence length restriction App built-in, so I don’t even need to take my SOAP FASTA file out of iPlant at all. However, I am trying to learn sed and awk, so I turned this problem into a learning exercise. Additionally, there may be times when I’m working with FASTA files outside of iPlant and may need/want a tool to manipulate a FASTA file without having to rely on a special script, program or website.

Unfortunately, the solution was WAY beyond anything I could end up finding on my own. I did find a way to remove “newlines” (indicated as ‘n’ in sed and awk) using sed and/or translate (“tr” command in Terminal). Here’s the sed command:

$sed ':a;N;$!ba;s/\n//g'

 


I still don’t fully follow how this works, but I do know that this part

s/\n//g

means that when a newline is encountered,

\n

substitute

s/

it with nothing throughout the entire file

/g

The remainder of the command is complicated and involves labels and branching and other stuff I don’t fully follow yet.

Using the built-in translate command in Terminal is MUCH cleaner and MUCH easier to understand for newbs like me:

$tr -d '\n'

This means use translate

tr

 

to delete all newlines

-d '\n'

 

The above are great for general usage, but they don’t address the idea of how to ignore certain lines. I ended up turning to the amazing folks at StackOverflow for help and they came up with solutions for both sed and awk amazingly quickly; it was great!

Here’s a solution with sed:

$sed ':a;N;/^>/M!s/\n//;ta;P;D' InputFastaFile.fasta > OutputFastaFile.fasta

I can’t even come close to explaining this, but the key to this is the

/^>/

 

portion. That is where you can enter any regular expression (i.e. regex) that you want ignored. So, this sed one-liner will ignore any line that contains a “>” at the beginning of the line.

Here’s a solution with awk:

$awk '/^>/{print (NR==1)?$0:"n"$0;next}{printf "%s", $0}END{print ""}' InputFastaFile.fasta > OutputFastaFile.fasta

I haven’t fully explored this, but I’m fairly certain you can replace the

"^>"

 

with any other regex to skip lines containing that regex.

In any case, using either of those above solutions creates a new FASTA file that looks like this:

>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

You can see that the sequence lines are now all on a single line, which means that my original awk one-liner will now be able to pull out records of a minimum sequence length!

Use awk to filter FASTA file by minimum sequence length

Nice little one-liner to filter a FASTA file by sequence length:

$awk '!/^>/ { next } { getline seq } length(seq) >= 200 { print $0 "\n" seq }' FastaFileInput.fa > FastaFileOutput.fa

Simply change the number “200” to any number to set your desired minimum sequence length.

And, for those who don’t know what a FASTA file format is, it is a format to delineate biological sequence (DNA or protein sequences) data. Here’s a short example of what one looks like:

>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

Each individual sequence is preceded by a sequence identifier line. This identifier line is always indicated by a “>” at the beginning of this line.

Here’s a quick explanation of how it works, as I currently understand it:

!/^>/ {next}

 

– If a line (i.e. record) begins with a “>”, go to the next line (record).

{getline seq}

 

– “getline” reads the next record and assigns the entire record to a variable called “seq”

length(seq) >= 200

 

– If the length of the “seq” record is greater than, or equal to, 200 then…

{print $0 "n" seq}

 

– Print all records ($0) of the variable “seq” in the file that matched our conditions, each on a new line (“\n”)

 

Important note: this will only work on sequences that exist on a single line in the file. If the sequence wraps to multiple lines, the code above will not work. You can fix your FASTA files so that the sequences for each entry exist on single lines: http://itrylinux.com/use-sed-or-awk-to-remove-newline-breaks-from-fasta-file/