Use sed (or awk) to remove newline breaks from FASTA file

So, in my last post, I was stoked to find a way to filter FASTA files by a minimum sequence length using awk. However, that little one-liner fails miserably if your FASTA file isn’t formatted “correctly.” It turns out that the FASTA file I was working on (which was generated by a SOAP de novo assembly using iPlant) was technically formatted “correctly,” in that it had the necessary features to be called a FASTA file. It looked like this:


For all intents and purposes, that is a proper FASTA file. However, that formatting is problematic when using a program like sed or awk because both of those programs examine files by line. So, when I would run my awk one-liner to filter out all sequences (in this specific case, contigs) greater than a specified length,

awk '!/^>/ { next } { getline seq } length(seq) >= 200 { print $0 "\n" seq }' FastaFileInput.fa > FastaFileOutput.fa


I wasn’t getting any results that had sequences longer than 200bp because of the line breaks within each individual sequence!

So, how did I solve this issue?

Of course, there are programs/scripts that are able to handle instances like this. In fact, iPlant even has a sequence length restriction App built-in, so I don’t even need to take my SOAP FASTA file out of iPlant at all. However, I am trying to learn sed and awk, so I turned this problem into a learning exercise. Additionally, there may be times when I’m working with FASTA files outside of iPlant and may need/want a tool to manipulate a FASTA file without having to rely on a special script, program or website.

Unfortunately, the solution was WAY beyond anything I could end up finding on my own. I did find a way to remove “newlines” (indicated as ‘n’ in sed and awk) using sed and/or translate (“tr” command in Terminal). Here’s the sed command:

$sed ':a;N;$!ba;s/\n//g'


I still don’t fully follow how this works, but I do know that this part


means that when a newline is encountered,




it with nothing throughout the entire file


The remainder of the command is complicated and involves labels and branching and other stuff I don’t fully follow yet.

Using the built-in translate command in Terminal is MUCH cleaner and MUCH easier to understand for newbs like me:

$tr -d '\n'

This means use translate



to delete all newlines

-d '\n'


The above are great for general usage, but they don’t address the idea of how to ignore certain lines. I ended up turning to the amazing folks at StackOverflow for help and they came up with solutions for both sed and awk amazingly quickly; it was great!

Here’s a solution with sed:

$sed ':a;N;/^>/M!s/\n//;ta;P;D' InputFastaFile.fasta > OutputFastaFile.fasta

I can’t even come close to explaining this, but the key to this is the



portion. That is where you can enter any regular expression (i.e. regex) that you want ignored. So, this sed one-liner will ignore any line that contains a “>” at the beginning of the line.

Here’s a solution with awk:

$awk '/^>/{print (NR==1)?$0:"n"$0;next}{printf "%s", $0}END{print ""}' InputFastaFile.fasta > OutputFastaFile.fasta

I haven’t fully explored this, but I’m fairly certain you can replace the



with any other regex to skip lines containing that regex.

In any case, using either of those above solutions creates a new FASTA file that looks like this:


You can see that the sequence lines are now all on a single line, which means that my original awk one-liner will now be able to pull out records of a minimum sequence length!

Use awk to filter FASTA file by minimum sequence length

UPDATED 20180530

Revised code:

awk ‘/^>/ { getline seq } length(seq) >100 { print $0 “\n” seq }’ FastaFileInput.fa > FastaFileOutput.fa

Thanks to the comment from cpad0112 for the clarification and streamlined code!

Original post is below – for posterity.


Nice little one-liner to filter a FASTA file by sequence length:

$awk '!/^>/ { next } { getline seq } length(seq) >= 200 { print $0 "\n" seq }' FastaFileInput.fa > FastaFileOutput.fa

Simply change the number “200” to any number to set your desired minimum sequence length.

And, for those who don’t know what a FASTA file format is, it is a format to delineate biological sequence (DNA or protein sequences) data. Here’s a short example of what one looks like:


Each individual sequence is preceded by a sequence identifier line. This identifier line is always indicated by a “>” at the beginning of this line.

Here’s a quick explanation of how it works, as I currently understand it:

!/^>/ {next}

– If a line (i.e. record) begins with a “>”, go to the next line (record).


{getline seq}

– “getline” reads the next record and assigns the entire record to a variable called “seq”


length(seq) >= 200

– If the length of the “seq” record is greater than, or equal to, 200 then…


{print $0 "\n" seq}

– Print all records ($0) of the variable “seq” in the file that matched our conditions, each on a new line (“\n”)


Important note: this will only work on sequences that exist on a single line in the file. If the sequence wraps to multiple lines, the code above will not work. You can fix your FASTA files so that the sequences for each entry exist on single lines:

Secure Shell (SSH) SSHure iSSH SSHweet!

SSH allows you to connect to a remote computer and run task remotely. In my situation, this is great for remotely logging in to one of our lab computers that is designed for intensive computing tasks (24GB of RAM!).

Using SSH is also fairly straightforward. To get started logging in to a remote computer/server that you have access to, just type the following in Terminal (and substitute your own username and the address of your target computer):

$ssh username@remotecomputeraddress

Enter your password for the remote computer.

Alternatively, instead of dealing with passwords every time you log in to a remote computer, generate some SSH keys!  Not only can you eliminate the need to use a password and automatically log in when you type your ssh command, but by using keys you can virtually eliminate people being able to use a brute force password attack to break in to your computer/server!

First, generate your key set.  The following command will generate a private and a public key.  The public key can be placed on any server you want SSH key access to.  You can just send the public key to anyone who has the capabilities (both the know-how and authorization) to install it in the correct location on the computer/server you’d like to connect to.  The private key on your computer will then be able to match with your public key on any computer that the public key has been installed on!  No passwords needed for connection!

Generate the keys:

$ssh-keygen -t rsa

Feel free to use an empty password when you are prompted; just hit the “Enter” button and then confirm by hitting the “Enter” button again. This password is only used when physically using your computer to initiate a SSH session. For most people, having a password to initiate a SSH from their computer becomes more of a hassle than it’s worth. However, if you anticipate someone else using your computer, and you’d like to prevent them from easily using SSH to remotely login to servers that you’ve installed SSH keys on, then it would be advised to enable a password for your SSH sessions.

Looking in your


folder reveals the following:

$ls ~/.ssh
id_rsa  known_hosts

The “” file is your public key file. This is the file that can be transferred to other computers to enable password-free SSH capabilities on those computers.

Now that we have our keys, we need to transfer the public key to the server. Assuming you have administrative privileges for the server, there are two options for putting the public key on the server. If it’s the first key, we can use the following command:

$ssh-copy-id username@remotecomputeraddress

That will not only copy the public key from the computer to the server, but it will also create the proper directories if they don’t already exist on the server.

Otherwise, if you have the appropriate permissions, you can also use the following command to append your public key to an existing “authorized_keys” file on the server:

cat ~/.ssh/ | ssh username@remotecomputeraddress 'cat >> .ssh/authorized_keys'

But, in order for the value of SSH keys to be fully realized, the destination computer/server should have password authentication disabled.  Doing so means that only computers with authorized SSH keys will be allowed access

I’m using SSH keys to lock down my home Synology server.  To do this, I SSH’d into the Synology as user “root”, since “root” is the only user authorized to make system changes.

$ssh root@SynologyIPaddress

By default, Synology only seems to have the text editing program “vi”. Let me tell you, it is NOT intuitive how to use it.  For example, to delete characters, you have to use the ‘x’ key!  Luckily the University of Washington has a nice tutorial on how to use “vi” for editing documents.

$vi /etc/ssh/sshd_config

Once you’re in the file, remove the “#” from in front of the two lines shown below AND change “yes” to “no” in the line “PasswordAuthentication”.  Then, be sure to save the file.
samb@Mephistopheles: ~_018

After quitting (don’t forget to save changes!), we need to restart the SSH service. I ended up doing this via the GUI since some of the common command line suggestions for restarting SSH didn’t work.

Now, when trying to SSH in, you’ll only be allowed in if you’re doing trying to do so from an authorized computer that has a public key installed on the server. On that note, it would be prudent to backup your private key so that if your computer dies, you’ll still be able to authenticate with the remote computer by installing your private key on a new client computer.

Changing the computer (host) name in Ubuntu

You’d think this would be easy.  Using the GUI, just go to “System Settings” > “Details”.  See the box with my computer’s name (Device name) in it?


In theory, it seems like I should just be able to click on that and type a new name in.


Instead, for some unknown reason, I have to open up Terminal and perform this from the command line.  And, it requires modification of two different files!  Why?  I guess one of the continuing quirks of using Ubuntu.

Anyway, here’re the files that need to be edited and here’re the commands to do so:

1.  Edit the /etc/hostname file

$gksudo gedit /etc/hostname

Remember to use gksudo to open gedit. I discovered this not too long ago. Once the file opens, just replace the existing Device name with the new one. In my file, there was no other text; only the current Device name.

2. Edit the /etc/hosts file

$gksudo gedit /etc/hosts

Once this file opens, find your current Device name and replace it with the new one that you entered in Step 1 above.

So, that’s it, I guess. It certainly isn’t the most intuitive way to accomplish this, but I guess it’s the only way to accomplish the task.

sudo echo solved!

In an earlier post when I was installing and configuring BLAST, I was unable to append a file from the command line using:

$sudo echo 'text I want appended to file' >>

I would always get a permission denied error, despite being able to open (and edit) that same file by using the graphical text editor program, “gedit”.

Turns out, that the problem is related to how the “sudo” command is applied to the commands following it. The issue is that “sudo” only gets applied to “echo” and does not get applied to the “redirect” command (the ‘>>’). Thus, “echo” has sudo permissions, but the redirect does not, and cannot append the text to the file.

So, how does one get around this? Apparently, there are two relatively easy ways.

1. Use “sudo” to run a “subshell”. Example:

$sudo sh -c "echo 'text I want to append to a file' >> /path/to/file/"

The “sudo” command gets applied to a a new subshell (that’s what the ‘sh’ means; a new shell). Thus, everything that runs in that subshell is governed by the “sudo” permissions.

2. Use “sudo” to execute “tee”. Example:

$echo 'text I want to append to a file' | sudo tee -a /path/to/file/

The “sudo” command gets applied to the “tee” function. As such, anything that “tee” executes already has sudo permissions. In the above example, the “echo” command normally sends the text in the single quotes to the screen, but in this instance the text gets “piped” (the ‘|’ symbol) to the subsequent command. The subsequent command is running “tee” with sudo permissions and tells “tee” to append (the ‘-a’ argument) the piped text to our file. Additionally, “tee” will also print the text to the screen, as that’s it’s primary function.

Can’t update IPython

Well, been trying to get the IPython “nbconvert” command to run, to convert some IPython notebooks to html and/or slides. However, when I type this in Terminal:

samb@Mephisto:~$ ipython nbconvert

I get this output:

[TerminalIPythonApp] File not found: u'nbconvert'

This is weird because “nbconvert” is supposed to be built into IPython. Looking around the web, it seems as though “nbconvert” is only built in to IPython versions >1.0. Let’s check my version of IPython:

samb@Mephisto:~$ ipython --version

So, that explains it. I’m running v0.13.2. This is mildly irritating, as I just installed IPython a week ago. In fact, the latest stable version listed on the IPtyhon install page is 1.2.1 from February.



However, I can’t get IPython to update. When I try to use “apt-get”, this is what happens:

samb@Mephisto:~$ sudo apt-get install ipython-notebook
Reading package lists... Done
Building dependency tree
Reading state information... Done
ipython-notebook is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

Sooooooo, now what?

I’m going to try to remove IPython via the Ubuntu Software Center.

OK, done.  To be safe, will reboot computer and then verify IPython removal has been completed.

After restarting, typing “ipython notebook” in to Terminal indicates that it is not installed and suggests I can get, and install, IPython by entering “sudo apt-get install ipython”.  However, I’m going to follow the instructions on the IPtyhon install page, which instructs the user to enter: sudo apt-get install ipython-notebook

This is how things turned out:

samb@Mephisto: ~_014

See that?  “Setting up ipython-notebook (0.13.2-2)…”  That’s the same version I had previously!  So, I guess I need to hunt down the more recent version.  At the bottom of the IPtyhon install page there is a link to the IPython Downloads Archive and this is what’s in it:


Lo and behold, the current version, which is MANY versions removed from the version that “apt-get” is retrieving.  Clicking on that link brings me to a page with a “tarball” (a file that ends with “.tar.gz”) and a ZIP file.

Downloaded the tarball by clicking on the link (I should learn how to do that in the command line).

Un-tarred and un-gzipped the file in Terminal:

samb@Mephisto:~/Downloads$ tar -xzf ipython-1.2.1.tar.gz

I then moved (using the GUI) the un-tarred and un-gzipped ipython-1.2.1 folder to my /home/samb directory and followed the installation directions for installing IPython from source.

The Terminal showed a bunch of stuff. Let’s see if using “nbconvert” does anything.

Result? Typing “ipython nbconvert” spits out a whole bunch of stuff; which is great!

I’ll launch IPython and see if it still sees my configuration file from the initial install/configuration last week.

Didn’t work! Launched IPython (ipython notebook) and this is what I got:

ImportError: No module named jinja2

Looking back at the installing IPython from source page, it doesn’t indicate that required dependencies are not installed by default. Maybe that’s always the case when installing packages from source when using Linux? I have no idea, but it would certainly be nice if the IPython instuctions indicated that, particularly since using “apt-get” doesn’t retrieve the most recent IPython build, thus potentially requiring people to download the source file.

Now that we’ve encountered this problem, I’m going to try to install IPython using “pip”. However, when I try running “pip” from Terminal, I’m informed that “pip” is not installed on the system. So, we’ll get it:

samb@Mephisto:~$ sudo apt-get install python-pip

Now, let’s run (as shown on the IPtyhon install page) :

samb@Mephisto:~$ pip install ipython[all]
Requirement already satisfied (use --upgrade to upgrade): ipython[all] in /usr/local/lib/python2.7/dist-packages
Installing extra requirements: 'all'
Cleaning up...
samb@Mephisto:~$ ipython --version

Not thrilled by the “Requirement already satisfied” output, since I uninstalled the last IPython (used the Ubuntu Software Center Package Manager). And, guess what. The notebook won’t launch and still spits out the error message about no jinja2 module. I’ll try restarting the computer and see if that helps.


It did not help. I’m going to just delete the python2.7 directory found in /usr/local/lib, since that’s where all the IPython stuff seems to live. Deleted the directory:

samb@Mephisto:~$ sudo rm -rv /usr/local/lib/python2.7

rm = remove
-rv = recursively (r), verbose (v) to see all the files as they get deleted

Tried using “pip” again:

pip install ipython

It downloaded the ipython1.2.1 tarball, extracted stuff and then…

couldn’t write to the /usr/local/lib directory due to lack of permissions.

So, I re-ran the command with “sudo” and everything seems to have installed properly and completely. Trying to launch IPython notebook yields…

THE SAME JINJA2 module message!!!

Ahhhhh. Just realized I didn’t run:

sudo pip install ipython[all]

Totally forgot the “[all]”. Let’s try again.

Removed the python2.7 directory, as before. Ran sudo pip install ipython[all]. Ha! Well look at that! Here’s the end portion of the output from the that:

changing mode of /usr/local/bin/ to 755
changing mode of /usr/local/bin/ to 755
changing mode of /usr/local/bin/ to 755
Successfully installed ipython Sphinx pygments jinja2 nose docutils
Cleaning up...

Now, try to launch IPython and… Success!! I have finally managed to upgrade IPython! Not only that, but it has retained my previous configuration set up for the default notebook location. And, finally, test out if “nbconvert” is present. Yep! It’s there (just typed: ipython nbconvert)!

Here’s the quick summary of what I ended up having to do:

1. Had to delete the Python2.7 folder (located at /usr/local/bin).

2. Had to install “pip” (sudo apt-get install python-pip)

3. Used “pip” to retrieve and install most current version of IPython (sudo pip install ipython[all]




Use gksudo instead of sudo?

When using the “sudo gedit” command in Terminal to open my script “/etc/profile.d/”, I noticed I was getting the error message:

(gedit:2359): IBUS-WARNING **: The owner of /home/samb/.config/ibus/bus is not root!

The error message wasn’t preventing “gedit” from opening or anything; I just noticed that the Terminal had spit this out while I was editing my script file in “gedit”. So, curiosity got the best of me and I started to see if I could find a way to eliminate the messag.

After looking around the web, it seems that I should be using the “gksudo” command instead of regular old “sudo” when opening programs that have a graphical user interface (GUI).

So, let’s try opening the script above using “gedit” and “gksudo”.

samb@Mephisto://etc/profile.d$ gksudo gedit
The program 'gksudo' is currently not installed. You can install it by typing:
sudo apt-get install gksu

Well, it turns out that “gksudo” isn’t installed by default on this version of Ubuntu (13.10). However, it’s nice that the output tells you how to obtain “gksudo”. So, let’s try that out and try running “gksudo” again.

Using “gksudo” to run “gedit” works; no more warning message.

Installing and configuration of NCBI standalone BLAST

The Basic Local Alignment Search Tool (BLAST) is available to download for a variety of platforms here:

I snagged this one:

Downloaded to my large Windows partition and extracted the files by double-clicking.

The primary “configuration” that I’m interested in is adding the BLAST directory (/media/B0FE4B1FFE4ADD6A/BioinformaticsTools/ncbi-blast-2.2.29+/bin) and the BLAST database directory (/media/B0FE4B1FFE4ADD6A/BioinformaticsTools/ncbi-blast-2.2.29+/dbs) to the Linux “PATH”.

One question is, why? What’s the point? Well, I guess those are two questions, but whatever.

The primary advantage to adding these two directories to the system PATH, is simply a matter of efficiency. To run BLAST in the current configuration, I have to type something like this (after the $):

Change to the proper directory:
samb@Mephisto:/$ cd /media/B0FE4B1FFE4ADD6A/BioinformaticsTools/ncbi-blast-2.2.29+/bin

Then, type out the necessary instructions to initiate BLAST and provide the location of the BLAST database:
samb@Mephisto:/media/B0FE4B1FFE4ADD6A/BioinformaticsTools/ncbi-blast-2.2.29+/bin$ ./blastn -db /media/B0FE4B1FFE4ADD6A/BioinformaticsTools/ncbi-blast-2.2.29+/dbs/RickettsiaGBnt20140228

By adding the BLAST and database directories to the system PATH, all I’d have to enter would be:

samb@Mephisto:/$ blastn -db RickettsiaGBnt20140228

Much cleaner and efficient. No changing directories, no lengthy directory paths.

So, let’s see if we can get that set up.

Tried to follow some instructions on the NCBI website:

But, those instructions don’t tell you how to make a script to automatically append the BLAST and BLAST database locations to the system PATH so that you don’t have to amend the PATH each time you want to use BLAST! Lame!

Additionally, the instructions say that a “.ncbirc” file is required in the Home directory. Well, should the file have a name or do I just leave the file literally “.ncbirc”? Is it like a .htaccess file then? I guess I’ll just have to try it out. Ugh.

Anyway, here’s how I got it working so that I don’t have to change directories to the BLAST folder each time and I run BLAST and so I can just enter the name of a BLAST database without having to type out the full path of the database.

1. For dual boot systems, BLAST cannot be installed on the Windows partition. Linux can’t do the proper writing/executing of files on that partition due to a difference in partition formatting (Windows is NTFS).

2. Can’t simply move BLAST package from one location to a new one. It seems that the BLAST package needs to be re-installed in the user’s desired location. However, this could be an issue related to partition formatting, as I tried to simply move the BLAST folders from the Windows partition to the Linux partition. When I did this, couldn’t get BLAST to run, even though it was now on the Linux partition.

3. Created a script file in the /etc/profile.d folder called “” so that the locations of the BLAST executables and BLAST databases will be loaded into the system PATH upon logging in to the computer, as opposed to having to manually append the locations to the PATH each time I start the computer. Put this info in the file:
export PATH=$PATH:/media/B0FE4B1FFE4ADD6A/BioinformaticsTools/ncbi-blast-2.2.29+/bin:
export BLASTDB=/home/samb/BioinformaticsTools/ncbi-blast-2.2.29+/db

The “!#/bin/bash” identifies the file as a script file to the system. The other two lines “export” the variables (PATH and BLASTDB) so that they are available to use by the rest of system.

4. Created the “.ncbirc” file (called it blastdb.ncbirc) and put it in my computer’s “Home” folder. File contains this text:

5. Everything runs wonderfully!

I documented everything (including all failures, file locations changes, etc) in an IPython Notebook: InstallingBLAST.

Customizing IPython default notebook location

Did some searching around and it turns out, the “easiest” way to do this is to simply navigate (using Terminal) to your desired notebook save location and then launching IPython (just enter the text that’s listed after the $):

samb@Mephisto:~$ ipython notebook

I’ve been using IPython for some months now and never realized that it was simply saving notebooks in my current working directory! Doh!  I could’ve just been changing to the desired directory this entire time!

Knowing this, I can just create a symbolic link to my desired default IPython directory and quickly change to the directory when starting IPython without having to go through a bunch of cd steps to get to the location.

However, if you have a default location you’d like IPython to save your IPyton notebooks to, I found out how to edit some of the IPython profile configuration files on this Stack Overflow entry.

Briefly, here’s what needs to be done.

1. Locate where IPython is installed, using Terminal.


samb@Mephisto:~$ ipython locate



2. My IPython location did NOT contain an “” file, so one had to be created. Oh, and to figure out that I did NOT have that Python (.py) file, I just navigated there using the GUI (Nautilus), but I did have to allow for viewing Hidden Files by pressing Ctrl-h. Used Terminal to create this files with the following command (and subsequent output):

The input:

samb@Mephisto:~$ ipython profile create

The output:

[ProfileCreate] Generating default config file: u'/home/samb/.config/ipython/profile_default/'
[ProfileCreate] Generating default config file: u'/home/samb/.config/ipython/profile_default/'

3. I opened my “” file with the gedit program (the default program that opens text-based files when I double-click on them) and changed this:

# The directory to use for notebooks.
#c.NotebookManager.notebook_dir = u’/home/samb’

to this:

# The directory to use for notebooks.
c.NotebookManager.notebook_dir = u’/media/B0FE4B1FFE4ADD6A/Users/Samb/Dropbox/Lab/IPython_nbs’

Notice, I took out the hash/pound symbol (#) before the line that has the directory to my desired notebook save location, as the hash symbol indicates that a particular line should NOT be interpreted by whatever program is reading the file.

One thing I did notice is that the Stack Overflow entry indicates there should be two lines that need to be adjusted. However, my file only had one of the two entries described, so I simply changed that. I’ll launch IPython and see what happens.

My expectation is that IPython will open and see the existing IPython notebooks in the directory I specified in the

The result?Workspace 1_010

It worked perfectly (see the area surrounded by the red rectangle)!

The biggest benefit of this setup is that I will now have a fixed file path for my IPython notebooks, whether I’m working at home, at work, or some other remote location.  After playing around some more, I may decide to set up a Git repository so that the IPython files are publicly accessible.  I’m also going to try out the Synology Cloud Station (basically the same as Dropbox) functionality on our lab’s server.  I think the latter might be ideal, since I’d be saving my IPython notebooks in our “web” folder, which is already publicly accessible.  But, using a Git repository would also have the benefit of version control, which might be nice.

Whatever the case, I can now quickly and easily modify the default IPython notebook save location!


Installing IPython

Installation instructions for IPython are here and it looks like using Linux has an advantage over the other two operating systems; no installation of extraneous software needed!

Just have to open Terminal and enter: sudo apt-get install ipython-notebook

Now, to test out if that worked. In Terminal: ipython notebook

Workspace 1_009

Cool, it’s up and running and I’m not sure it could have been any easier! Next IPython steps will be to configure the default save location for the IPython notebook files.