Use awk to filter FASTA file by minimum sequence length

UPDATED 20180530

Revised code:

awk ‘/^>/ { getline seq } length(seq) >100 { print $0 “\n” seq }’ FastaFileInput.fa > FastaFileOutput.fa

Thanks to the comment from cpad0112 for the clarification and streamlined code!

Original post is below – for posterity.


Nice little one-liner to filter a FASTA file by sequence length:

$awk '!/^>/ { next } { getline seq } length(seq) >= 200 { print $0 "\n" seq }' FastaFileInput.fa > FastaFileOutput.fa

Simply change the number “200” to any number to set your desired minimum sequence length.

And, for those who don’t know what a FASTA file format is, it is a format to delineate biological sequence (DNA or protein sequences) data. Here’s a short example of what one looks like:


Each individual sequence is preceded by a sequence identifier line. This identifier line is always indicated by a “>” at the beginning of this line.

Here’s a quick explanation of how it works, as I currently understand it:

!/^>/ {next}

– If a line (i.e. record) begins with a “>”, go to the next line (record).


{getline seq}

– “getline” reads the next record and assigns the entire record to a variable called “seq”


length(seq) >= 200

– If the length of the “seq” record is greater than, or equal to, 200 then…


{print $0 "\n" seq}

– Print all records ($0) of the variable “seq” in the file that matched our conditions, each on a new line (“\n”)


Important note: this will only work on sequences that exist on a single line in the file. If the sequence wraps to multiple lines, the code above will not work. You can fix your FASTA files so that the sequences for each entry exist on single lines:

10 thoughts on “Use awk to filter FASTA file by minimum sequence length

  1. Code is correct (in a round about way esp filtering by >), but explanation is incorrect.
    1. !/^>/ {next}-
    a) !/^>/ — look for the line that doesn’t start with > (emphasis on doesn’t)
    b) {next} — After above step go to next lines that doesn’t start with > i.e go to lines that start with > (which is header line)
    2. {getline seq} — Go to the next line of next line that doesn’t with > (double negation here). Instead code should have been direct. i.e come back to step 1a.

    Pick up the lines that start with > and then store the next lines in variable. The code should have been:

    awk ‘/^>/ { getline se } length(se) >100 { print $0 “\n” se }’ test.fa

    As per the amended code, pick up the lines that start with >, but store the next lines in variable, filter by length and then print the result along with headers.

Leave a Reply to Olivier Cancel reply