Use sed (or awk) to remove newline breaks from FASTA file

So, in my last post, I was stoked to find a way to filter FASTA files by a minimum sequence length using awk. However, that little one-liner fails miserably if your FASTA file isn’t formatted “correctly.” It turns out that the FASTA file I was working on (which was generated by a SOAP de novo assembly using iPlant) was technically formatted “correctly,” in that it had the necessary features to be called a FASTA file. It looked like this:

>sequence_ID_1
atcgatcgggatc
aatgacttcattg
gagaccgaga
>sequence_ID_2
gatccatggacgt
ttaacgcgatgac
atactaggatcag
at

For all intents and purposes, that is a proper FASTA file. However, that formatting is problematic when using a program like sed or awk because both of those programs examine files by line. So, when I would run my awk one-liner to filter out all sequences (in this specific case, contigs) greater than a specified length,

awk '!/^>/ { next } { getline seq } length(seq) >= 200 { print $0 "\n" seq }' FastaFileInput.fa > FastaFileOutput.fa

 


I wasn’t getting any results that had sequences longer than 200bp because of the line breaks within each individual sequence!

So, how did I solve this issue?

Of course, there are programs/scripts that are able to handle instances like this. In fact, iPlant even has a sequence length restriction App built-in, so I don’t even need to take my SOAP FASTA file out of iPlant at all. However, I am trying to learn sed and awk, so I turned this problem into a learning exercise. Additionally, there may be times when I’m working with FASTA files outside of iPlant and may need/want a tool to manipulate a FASTA file without having to rely on a special script, program or website.

Unfortunately, the solution was WAY beyond anything I could end up finding on my own. I did find a way to remove “newlines” (indicated as ‘n’ in sed and awk) using sed and/or translate (“tr” command in Terminal). Here’s the sed command:

$sed ':a;N;$!ba;s/\n//g'

 


I still don’t fully follow how this works, but I do know that this part

s/\n//g

means that when a newline is encountered,

\n

substitute

s/

it with nothing throughout the entire file

/g

The remainder of the command is complicated and involves labels and branching and other stuff I don’t fully follow yet.

Using the built-in translate command in Terminal is MUCH cleaner and MUCH easier to understand for newbs like me:

$tr -d '\n'

This means use translate

tr

 

to delete all newlines

-d '\n'

 

The above are great for general usage, but they don’t address the idea of how to ignore certain lines. I ended up turning to the amazing folks at StackOverflow for help and they came up with solutions for both sed and awk amazingly quickly; it was great!

Here’s a solution with sed:

$sed ':a;N;/^>/M!s/\n//;ta;P;D' InputFastaFile.fasta > OutputFastaFile.fasta

I can’t even come close to explaining this, but the key to this is the

/^>/

 

portion. That is where you can enter any regular expression (i.e. regex) that you want ignored. So, this sed one-liner will ignore any line that contains a “>” at the beginning of the line.

Here’s a solution with awk:

$awk '/^>/{print (NR==1)?$0:"n"$0;next}{printf "%s", $0}END{print ""}' InputFastaFile.fasta > OutputFastaFile.fasta

I haven’t fully explored this, but I’m fairly certain you can replace the

"^>"

 

with any other regex to skip lines containing that regex.

In any case, using either of those above solutions creates a new FASTA file that looks like this:

>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

You can see that the sequence lines are now all on a single line, which means that my original awk one-liner will now be able to pull out records of a minimum sequence length!