Use awk to filter FASTA file by minimum sequence length

Nice little one-liner to filter a FASTA file by sequence length:

$awk '!/^>/ { next } { getline seq } length(seq) >= 200 { print $0 "\n" seq }' FastaFileInput.fa > FastaFileOutput.fa

Simply change the number “200” to any number to set your desired minimum sequence length.

And, for those who don’t know what a FASTA file format is, it is a format to delineate biological sequence (DNA or protein sequences) data. Here’s a short example of what one looks like:

>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

Each individual sequence is preceded by a sequence identifier line. This identifier line is always indicated by a “>” at the beginning of this line.

Here’s a quick explanation of how it works, as I currently understand it:

!/^>/ {next}

 

– If a line (i.e. record) begins with a “>”, go to the next line (record).

{getline seq}

 

– “getline” reads the next record and assigns the entire record to a variable called “seq”

length(seq) >= 200

 

– If the length of the “seq” record is greater than, or equal to, 200 then…

{print $0 "n" seq}

 

– Print all records ($0) of the variable “seq” in the file that matched our conditions, each on a new line (“\n”)

 

Important note: this will only work on sequences that exist on a single line in the file. If the sequence wraps to multiple lines, the code above will not work. You can fix your FASTA files so that the sequences for each entry exist on single lines: http://itrylinux.com/use-sed-or-awk-to-remove-newline-breaks-from-fasta-file/

8 thoughts on “Use awk to filter FASTA file by minimum sequence length

Leave a Comment