Creating SQLite3 Databases from a .csv File

When you’re working with a spreadsheet that requires more complex filtering/sorting (i.e. based on conditions that are satisified (or not) across multiple columns/rows) than what’s generally available in Microsoft Excel/Libre Office/Google Sheets, it turns out that SQLite3 (which comes installed with Ubuntu) is a pretty easy/quick way to approach complicated filtering. Additionally, this can all be done from the shell (command line).

The spreadsheet I’m working with in this example contains data for abalone samples that have been evaluated for infection levels of the withering syndrome bacterium in various tissues.

Before using SQLite:

  • Change the names of any columns that are duplicate names.
  • Export the spreadsheet as a comma-separated file (.csv)

The remaining steps are all performed using the shell. Additionally, none of the commands below actually alter your file in any way. So, you can play around all you want without worrying about modifying/destroying the source data set.

Check the header (column names) of the spreadsheet .csv file:

$head -1 RedPinkPinto.csv

The output:

Accession #,Type,Date,Species,Control/Exp,Shell Length (mm),Total Weight (g),Shell weight (g),Body weight (g),Sex,Foot Condition,PE RLO,PE St,PE New,PE total RLOs,DG RLO,DG St,DG New,DG Tot RLOs,Metaplasia,Coccidia (Y/N),Comments

Launch SQLite3:

$sqlite3

The shell prompt should change to:

sqlite>

Specify the delimiter of the file to be imported (comma in this instance):

sqlite>.separator ","

Import the .csv file (which creates a SQLite table) and name the table:

sqlite> .import RedPinkPinto.csv RedPinkPinto

Check the column names of our new table:

sqlite>.schema

Screenshot of the output:

Selection_026

Change the output view of the new table from comma-separated to a columnar view:

sqlite>.mode column

Run a query (filter) to view only those records (rows) where the Species column contains “Red” and where the value in the “PE St” column is greater than the value in the “PE New” column.

Additionally, only show the following columns in the output:

  • Accession #
  • Species
  • PE RLO
  • PE St
  • PE New
  • DG RLO
  • DG St
  • DG New
sqlite> SELECT "Accession #", Species, "PE RLO", "PE St", "PE New","DG RLO", "DG St", "DG New" FROM RedPinkPinto WHERE "Species"="Red" AND "PE St">"PE New";

Here’s a screenshot of the output:

Selection_029

This is cool, but it’s a pain to have to try to remember the order of the columns.

Enable column headers in the output:

sqlite>.header ON

Re-run the query to view the output difference (NOTE: Quotes are needed around column names that contain spaces):

sqlite> SELECT "Accession #", Species, "PE RLO", "PE St", "PE New","DG RLO", "DG St", "DG New" FROM RedPinkPinto WHERE "Species"="Red" AND "PE St">"PE New";

Screenshot of the output with column headers enabled:

Selection_030

Much nicer and easier to read!

Overall, using SQLite is a much faster, easier, and more flexible process to filter datasets than trying to do so in a traditional spreadsheet.

SSH tunnels and IPython

We have an OS X server blade at the lab that’s been suped up with some extra RAM (I think 24GB) and is very useful for our bioinformatics processing, as there are a number of bioinformatics tools installed on it. Additionally, running sequence analysis jobs on the server keeps my personal computer free to do other tasks instead of being swamped by these computationally expensive processes.

With that being the case, it’s also much more convenient to log in to the server remotely instead of having to sit directly at the computer to initiate jobs.

I’ve previously written how I’ve managed to set up SSH and the SSH config file to simplify my life for remotely logging in to the server. That part is pretty straightforward. However, we also use IPython Notebooks to document how we’ve processed data. IPython Notebooks run through a web browser, so how can we still use them when we’ve SSH’d into the remote server via the command line?

It turns out, you can set up a SSH tunnel and then connect to the server through your browser. This does require that you have access to the server, as well as available ports on both your client and server to connect through.

Here’s how I create a tunnel into our server:

ssh -L client_port_Number_over8000:localhost:server_port_Number_over8000 user_name@server_ip_address

That’s a lot to type, especially if you’re logging into the server multiple times a week or even multiple times a day. To get around this, we can edit the SSH config file (found in ~/.ssh/config)

Host easy_to_remember_nickname
HostName IP_address_of_remote_server
User remote_server_login_username
ControlMaster auto
LocalForward localhost:client_portNumber_over8000 localhost:server_portNumber_over8000
ServerAliveInterval 30

With this info in the config file, recreating the tunnel is as easy as typing the following on the command line:

$ssh easy_to_remember_nickname

Once the tunnel is established, open a new Terminal window and SSH into the server. Once the SSH connection is established, change to the directory containing your IPython notebook(s) (Note: this is not needed if you’ve set up an IPython config file that sets the default IPython directory), then launch IPython on the remote machine:

$ipython notebook --no-browser --port=first_port_number_listed_in_tunnel_command

Now, on the machine you’re using, start your browser and enter the following into the URL of your browser:

localhost:first_port_number_listed_in_tunnel_command

This should connect through the tunnel and you’ll have the normal web-based view of your notebooks, but those notebooks are actually housed on a remote machine!

Installing Eclipse IDE for Java Developers

I feel like a lot of  my posts are basic software installs.  One day, this will become second nature, I’m sure…

To play around with more Java programming, I’ve decided to try out Eclipse and use Eclipse instead of JGRASP.  Eclipse has so many features to help simplify the programming process; I’m irritated I didn’t take my sister’s advice and install this during my computer programming course.

Anyway, I grabbed the 64-bit Eclipse IDE for Java Developers tarball and extracted it using the Ubuntu graphical user interface (GUI).  I moved the extracted folder to my home folder and then added an alias to my bash_aliases file.

I don’t know how to restart bash to honor updates to system files, so I restarted the computer.  Booya!  Eclipse installed and launchable from Terminal by simply typing:

$eclipse

Creating aliases (i.e. shortcuts) to launch programs from Terminal

As part of a computer programming course (which utilizes Java) that I started taking, we had to install a program called JGRASP.  In order to launch the program, I have to type the following in Terminal:

$java -jar jgrasp.jar

This is long and difficult to remember.  Luckily, there’s a way to shorten this up and use a custom name that’s easier to remember.

It turns out Ubuntu is set up to handle this type of thing.  By creating a file in the Home directory called:

.bash_aliases

Upon starting Terminal, Ubuntu will check for the existence of this file. If it exists, it will use any aliases present.

So, for this specific example, I added the following to my .bash_aliases file:

alias jgrasp=’java -jar jgrasp.jar’

The format of the alias command is rather straightforward:
alias – Invokes the alias Terminal command.
jgrasp – The name of my alias that I decided upon.
=’ ‘ – The equals sign assigns the following path and/or command(s) that are contained in single quotation marks to the alias.

So, now to launch JGRASP, all I have to type in Terminal is:

$jgrasp

Sweet! And, with the ability to assign a path to a specific alias, that means I also don’t have to change directories to launch programs!

Installing a version control system (SmartGit)

I’ve been taking a computer programming course and finally decided I’d like to have some sort of version control so that I can revert to older versions of the programs I’m writing in case I totally jack something up.

I selected SmartGit from the Git website’s list of Linux version with a graphical user interface (GUI).

Downloaded SmartGit and moved the tarball gzip to my Home directory and upacked/extracted the files by double-clicking.

Ironically, since I selected SmartGit based on it having a GUI, after unpacking the file in my Home directory, I couldn’t figure out how to start/install the program (no GUI for that, I guess!). So, I took a look at the readme.txt file in the SmartGit folder. Here’s what it reads:

Installation

SmartGit itself does not need to be installed; just unpack it to your preferred
location and launch the bin/smartgit.sh script. It might be necessary to define
the SMARTGIT_JAVA_HOME environment variable pointing to the used Java home.

Well, this doesn’t seem like the best start for selecting a program that should make using easier, due to the existence of a GUI. I guess I’ll have to launch it the “old fashioned” way…

Change to the directory containing the smartgit.sh script:
$cd /smartgithg-6_0_5/bin

Launch SmartGit:
$./smartgithg.sh

SmartGit launched and here are the steps that I went through to get it configured:

1.See Git Executable location – Didn’t change anything.

Setup SmartGit-Hg _031

2.Select SSH Client – Didn’t change anything.

Setup SmartGit-Hg _032

3. Enter username/e-mail – Didn’t enter anything here.

Setup SmartGit-Hg _033

4a. Select a hosting provider for online repositories – I already have a GitHub account, so I selected that.

Setup SmartGit-Hg _034

4b. Selected GitHub…

Setup SmartGit-Hg _035

4c. Generated a GitHub API token on the GitHub website…

Authorized applications - Mozilla Firefox_037

4d. Continued API token stuff…

New personal access token - Mozilla Firefox_038

5.SmartGit detected “existing” local Git repositories – Not sure how/why these get detected as Git repositories, since I haven’t used them as such.

Setup SmartGit-Hg _039

6.New or existing repository – Selected to create a new repository.

Welcome to SmartGit-Hg _040

7.Weird message – I just hit “Initialize”

Add or Create Repository _041

Now, I want to make an alias so that I can launch the program without having to remember the location all the time. So, going back to my post on creating aliases in the .bash_aliases file, I’ll add the following to the .bash_aliases file so that I can easily launch SmartGit without having to remember where it is on my system:

alias smartgit=’/home/samb/smargithg-6_0_5/bin/smartgithg.sh’

Now, to launch SmartGit, all I have to type into Terminal is this:

$smartgit

Very nice!

SSH Config File for Easier Logins

Using SSH is relatively straightforward and pretty easy. Simply enter something like the following into Terminal:

$ssh user@computer.ip.address

If you don’t already utilize security keys, you’ll be prompted for the password associated with the user that you’re trying to login as. Voila, you’re in and commanding a computer/server remotely.

Despite this simplicity, if you need to access multiple computers/servers and those computers/servers don’t have nice, easy-to-remember domain names, then you’re left with the task of remembering IP addresses for all the different computers/servers. Of course, you can write these down somewhere for future reference, but it would be much nicer if SSH could remember this info for you. As it turns out, SSH can store that info for you! All you have to do is create a config file with the necessary info.

First, on your client computer (i.e. the computer you’ll be using to connect to the remote computer/server), use Terminal to change to the hidden SSH directory:

$cd ~/.ssh

This can also be done via the graphical user interface (GUI) of your computer, if it’s easier. You’ll just have to enable viewing of hidden files/folders. To do so, while you are in a file browser window, go to the “Edit” menu at the top of your screen, select “Preferences” and then check to box to “Show hidden and backup files.” Now you’ll be able to see the “.ssh” folder.

Once there, create a file called “config”. You can do this in a variety of ways. I’ve used the “touch” command in Terminal:

$touch config

Now, we need to add the needed text to this config file. I’ve just used a graphical text editor (gedit) by entering the following in Terminal:

$gksudo gedit config

In the file, I’ve entered the minimally necessary info to accomplish what I want (which is to be able to connect to remote computers/servers without remembering long IP addresses or domain names). For each remote computer/server that you’ll connect to, you’ll have the following text in the config file:

Host easy_to_remember_nickname
HostName IP_address_of_remote_server
User remote_server_login_username
ControlMaster auto
ServerAliveInterval 30
ServerAliveCountMax 3

Now, just save this config file and you’re done! Here’s an example of what the config file might look like with actual info:

Host hotdogs
HostName 192.168.1.1
User MyUserIDforTheServer
ControlMaster auto
ServerAliveInterval 30
ServerAliveCountMax 3

Now, when I decide to use SSH to login to the server located at address 192.168.1.1, all I have to enter into Terminal is this:

$ssh hotdogs

Additionally, you can append additional entries to this file for each server that you connect to.

Oh, and as for the other info in the config file, I’m not fully certain what they all do (the info for this was provided by a computer guru who works with our lab):

ControlMaster – Don’t know
SererAliveInterval 30 – This is for keeping the connection alive.
ServerAliveCountMax – This is for keeping the connection alive.

Of course, you still have to remember the “pet names” that you give to your servers, but it’s much easier to remember English words that it is to remember a series of digits associated with a particular server.

Updating IPython (again)

IPython 2.0 has just been released, so I figured I’d go ahead and update it.

I’ve already had one post dedicated to updating IPython, but that was complicated due to the discrepancy between the version available in the Ubuntu Software Center and the version available in the pip repository. After some struggles, I figured out how to install pip (since pip is not installed with Ubuntu by default) and use pip to download and install the full IPython package:

$sudo pip install ipython[all]

To use pip to update a previously installed package (in my case, IPython), I just typed the following in to Terminal:

$sudo pip install --upgrade ipython

And, it quickly downloaded IPython 2.0 and uninstalled the previous version of IPython (1.2.1). Easy!

samb@Mephistopheles: ~_020

Well, I guess I spoke too soon! Trying to run IPython:

$ipython notebook

I get this error message:

ImportError: The IPython Notebook requires tornado >= 3.1.0, but you have 2.4.1

Great! I have absolutely no idea what “tornado” is, nor am I totally certain how to update it.

Tried

$sudo apt-get tornado

and got this: “E: Invalid operation tornado

For some reason, I decided to try the IPython update command I used above, but add “[all]” to the end of the command like was necessary for the initial installation.

$sudo pip install --upgrade ipython[all]

Guess what! It friggin’ worked! The very first thing that popped up on screen was info regarding updating “tornado”! Can’t believe I got burned by the “[all]” command again when dealing with installing IPython! But, it doesn’t seem logical to me that if you’re updating an existing version of IPython that it wouldn’t update all existing components/dependencies of IPython that are already installed on the system.

Whatever. It’s updated and it’s running!

Synology-specific and user-specific SSH key requirements

Earlier, I figured out how to set up SSH keys for SSH authentication in to a Synology server, and eliminate the use of a password for authentication.

These were steps towards securing the Synology, but what I really wanted to accomplish was being able to disable the “root” account to really put the server on lockdown. Although the process is technically very easy (just edit the /etc/ssh/sshd_config file and change “PermitRootLogin” to “yes”), I still needed to verify that I could SSH in to the Synology with another user account. Having SSH’d in to the Synology in the past, I had learned that the only other user account (besides “root”) that has SSH permissions by default is the “admin” account.

So, knowing that I had already established private and public keys on my computer AND had connected to the Synology from my computer using said keys, I tried to SSH with the admin account. What happened? Got this message: “Permission denied (publickey).”

That’s both good and bad.

Good that the Synology is definitely not using password authentication any more.
Bad that I can only login using the “root” account.

So, how to resolve this? Since I had been under the impression that the server just needed a single public key that corresponded to a single private key, I was a bit stumped. I had assumed that the public key provided to the Synology would apply to all existing user accounts. After a fair amount of searching (and I think perusing the Synology forums), I stumbled across the reason for this.

It turns out that each individual user has their own location on the Synology to store authorized public keys! After discovering that, it was fairly straightforward to add my existing public key on my computer to the appropriate user’s (in this case, “admin”) SSH authorized keys file. I used the following command to accomplish this:

$cat /path/to/ssh_public_key/id_rsa.pub | ssh root@synologyaddress 'cat >> /volume1/homes/SynologyUserName/.ssh/authorized_keys'

The brief explanation of the command:
First we used “cat”, which normally prints the text of your intended file (in this case, the id_rsa.pub) to the screen. But, instead of printing the info to the screen we “piped” that info (that’s the “|” character) to our SSH connection. The info of the “id_rsa.pub” file gets sent to server over our SSH connection and then we tell the server to use “cat” to append (that’s the purpose of the “>>“) that information to the authorized keys file for the specific user.

After doing that, I can now connect using the “admin” user account. That means I can now disable the “root” user login capabilities.

Use awk to count number of sequences in a FASTA file

Nice one-liner that simply counts the number of greater-than (>) symbols in a file. In a FASTA file, there should only be a single “>” for each sequence in the file.

$awk '/>/ { count++ } END { print count }' InputFastaFile.fasta

Here’s an example of a FASTA file format for those who don’t know:

>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat