Use sed (or awk) to remove newline breaks from FASTA file

So, in my last post, I was stoked to find a way to filter FASTA files by a minimum sequence length using awk. However, that little one-liner fails miserably if your FASTA file isn’t formatted “correctly.” It turns out that the FASTA file I was working on (which was generated by a SOAP de novo assembly using iPlant) was technically formatted “correctly,” in that it had the necessary features to be called a FASTA file. It looked like this:


For all intents and purposes, that is a proper FASTA file. However, that formatting is problematic when using a program like sed or awk because both of those programs examine files by line. So, when I would run my awk one-liner to filter out all sequences (in this specific case, contigs) greater than a specified length,

awk '!/^>/ { next } { getline seq } length(seq) >= 200 { print $0 "\n" seq }' FastaFileInput.fa > FastaFileOutput.fa


I wasn’t getting any results that had sequences longer than 200bp because of the line breaks within each individual sequence!

So, how did I solve this issue?

Of course, there are programs/scripts that are able to handle instances like this. In fact, iPlant even has a sequence length restriction App built-in, so I don’t even need to take my SOAP FASTA file out of iPlant at all. However, I am trying to learn sed and awk, so I turned this problem into a learning exercise. Additionally, there may be times when I’m working with FASTA files outside of iPlant and may need/want a tool to manipulate a FASTA file without having to rely on a special script, program or website.

Unfortunately, the solution was WAY beyond anything I could end up finding on my own. I did find a way to remove “newlines” (indicated as ‘n’ in sed and awk) using sed and/or translate (“tr” command in Terminal). Here’s the sed command:

$sed ':a;N;$!ba;s/\n//g'


I still don’t fully follow how this works, but I do know that this part


means that when a newline is encountered,




it with nothing throughout the entire file


The remainder of the command is complicated and involves labels and branching and other stuff I don’t fully follow yet.

Using the built-in translate command in Terminal is MUCH cleaner and MUCH easier to understand for newbs like me:

$tr -d '\n'

This means use translate



to delete all newlines

-d '\n'


The above are great for general usage, but they don’t address the idea of how to ignore certain lines. I ended up turning to the amazing folks at StackOverflow for help and they came up with solutions for both sed and awk amazingly quickly; it was great!

Here’s a solution with sed:

$sed ':a;N;/^>/M!s/\n//;ta;P;D' InputFastaFile.fasta > OutputFastaFile.fasta

I can’t even come close to explaining this, but the key to this is the



portion. That is where you can enter any regular expression (i.e. regex) that you want ignored. So, this sed one-liner will ignore any line that contains a “>” at the beginning of the line.

Here’s a solution with awk:

$awk '/^>/{print (NR==1)?$0:"n"$0;next}{printf "%s", $0}END{print ""}' InputFastaFile.fasta > OutputFastaFile.fasta

I haven’t fully explored this, but I’m fairly certain you can replace the



with any other regex to skip lines containing that regex.

In any case, using either of those above solutions creates a new FASTA file that looks like this:


You can see that the sequence lines are now all on a single line, which means that my original awk one-liner will now be able to pull out records of a minimum sequence length!