Creating SQLite3 Databases from a .csv File

When you’re working with a spreadsheet that requires more complex filtering/sorting (i.e. based on conditions that are satisified (or not) across multiple columns/rows) than what’s generally available in Microsoft Excel/Libre Office/Google Sheets, it turns out that SQLite3 (which comes installed with Ubuntu) is a pretty easy/quick way to approach complicated filtering. Additionally, this can all be done from the shell (command line).

The spreadsheet I’m working with in this example contains data for abalone samples that have been evaluated for infection levels of the withering syndrome bacterium in various tissues.

Before using SQLite:

  • Change the names of any columns that are duplicate names.
  • Export the spreadsheet as a comma-separated file (.csv)

The remaining steps are all performed using the shell. Additionally, none of the commands below actually alter your file in any way. So, you can play around all you want without worrying about modifying/destroying the source data set.

Check the header (column names) of the spreadsheet .csv file:

$head -1 RedPinkPinto.csv

The output:

Accession #,Type,Date,Species,Control/Exp,Shell Length (mm),Total Weight (g),Shell weight (g),Body weight (g),Sex,Foot Condition,PE RLO,PE St,PE New,PE total RLOs,DG RLO,DG St,DG New,DG Tot RLOs,Metaplasia,Coccidia (Y/N),Comments

Launch SQLite3:

$sqlite3

The shell prompt should change to:

sqlite>

Specify the delimiter of the file to be imported (comma in this instance):

sqlite>.separator ","

Import the .csv file (which creates a SQLite table) and name the table:

sqlite> .import RedPinkPinto.csv RedPinkPinto

Check the column names of our new table:

sqlite>.schema

Screenshot of the output:

Selection_026

Change the output view of the new table from comma-separated to a columnar view:

sqlite>.mode column

Run a query (filter) to view only those records (rows) where the Species column contains “Red” and where the value in the “PE St” column is greater than the value in the “PE New” column.

Additionally, only show the following columns in the output:

  • Accession #
  • Species
  • PE RLO
  • PE St
  • PE New
  • DG RLO
  • DG St
  • DG New
sqlite> SELECT "Accession #", Species, "PE RLO", "PE St", "PE New","DG RLO", "DG St", "DG New" FROM RedPinkPinto WHERE "Species"="Red" AND "PE St">"PE New";

Here’s a screenshot of the output:

Selection_029

This is cool, but it’s a pain to have to try to remember the order of the columns.

Enable column headers in the output:

sqlite>.header ON

Re-run the query to view the output difference (NOTE: Quotes are needed around column names that contain spaces):

sqlite> SELECT "Accession #", Species, "PE RLO", "PE St", "PE New","DG RLO", "DG St", "DG New" FROM RedPinkPinto WHERE "Species"="Red" AND "PE St">"PE New";

Screenshot of the output with column headers enabled:

Selection_030

Much nicer and easier to read!

Overall, using SQLite is a much faster, easier, and more flexible process to filter datasets than trying to do so in a traditional spreadsheet.

Creating aliases (i.e. shortcuts) to launch programs from Terminal

As part of a computer programming course (which utilizes Java) that I started taking, we had to install a program called JGRASP.  In order to launch the program, I have to type the following in Terminal:

$java -jar jgrasp.jar

This is long and difficult to remember.  Luckily, there’s a way to shorten this up and use a custom name that’s easier to remember.

It turns out Ubuntu is set up to handle this type of thing.  By creating a file in the Home directory called:

.bash_aliases

Upon starting Terminal, Ubuntu will check for the existence of this file. If it exists, it will use any aliases present.

So, for this specific example, I added the following to my .bash_aliases file:

alias jgrasp=’java -jar jgrasp.jar’

The format of the alias command is rather straightforward:
alias – Invokes the alias Terminal command.
jgrasp – The name of my alias that I decided upon.
=’ ‘ – The equals sign assigns the following path and/or command(s) that are contained in single quotation marks to the alias.

So, now to launch JGRASP, all I have to type in Terminal is:

$jgrasp

Sweet! And, with the ability to assign a path to a specific alias, that means I also don’t have to change directories to launch programs!

Installing a version control system (SmartGit)

I’ve been taking a computer programming course and finally decided I’d like to have some sort of version control so that I can revert to older versions of the programs I’m writing in case I totally jack something up.

I selected SmartGit from the Git website’s list of Linux version with a graphical user interface (GUI).

Downloaded SmartGit and moved the tarball gzip to my Home directory and upacked/extracted the files by double-clicking.

Ironically, since I selected SmartGit based on it having a GUI, after unpacking the file in my Home directory, I couldn’t figure out how to start/install the program (no GUI for that, I guess!). So, I took a look at the readme.txt file in the SmartGit folder. Here’s what it reads:

Installation

SmartGit itself does not need to be installed; just unpack it to your preferred
location and launch the bin/smartgit.sh script. It might be necessary to define
the SMARTGIT_JAVA_HOME environment variable pointing to the used Java home.

Well, this doesn’t seem like the best start for selecting a program that should make using easier, due to the existence of a GUI. I guess I’ll have to launch it the “old fashioned” way…

Change to the directory containing the smartgit.sh script:
$cd /smartgithg-6_0_5/bin

Launch SmartGit:
$./smartgithg.sh

SmartGit launched and here are the steps that I went through to get it configured:

1.See Git Executable location – Didn’t change anything.

Setup SmartGit-Hg _031

2.Select SSH Client – Didn’t change anything.

Setup SmartGit-Hg _032

3. Enter username/e-mail – Didn’t enter anything here.

Setup SmartGit-Hg _033

4a. Select a hosting provider for online repositories – I already have a GitHub account, so I selected that.

Setup SmartGit-Hg _034

4b. Selected GitHub…

Setup SmartGit-Hg _035

4c. Generated a GitHub API token on the GitHub website…

Authorized applications - Mozilla Firefox_037

4d. Continued API token stuff…

New personal access token - Mozilla Firefox_038

5.SmartGit detected “existing” local Git repositories – Not sure how/why these get detected as Git repositories, since I haven’t used them as such.

Setup SmartGit-Hg _039

6.New or existing repository – Selected to create a new repository.

Welcome to SmartGit-Hg _040

7.Weird message – I just hit “Initialize”

Add or Create Repository _041

Now, I want to make an alias so that I can launch the program without having to remember the location all the time. So, going back to my post on creating aliases in the .bash_aliases file, I’ll add the following to the .bash_aliases file so that I can easily launch SmartGit without having to remember where it is on my system:

alias smartgit=’/home/samb/smargithg-6_0_5/bin/smartgithg.sh’

Now, to launch SmartGit, all I have to type into Terminal is this:

$smartgit

Very nice!

Use awk to count number of sequences in a FASTA file

Nice one-liner that simply counts the number of greater-than (>) symbols in a file. In a FASTA file, there should only be a single “>” for each sequence in the file.

$awk '/>/ { count++ } END { print count }' InputFastaFile.fasta

Here’s an example of a FASTA file format for those who don’t know:

>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

Use sed (or awk) to remove newline breaks from FASTA file

So, in my last post, I was stoked to find a way to filter FASTA files by a minimum sequence length using awk. However, that little one-liner fails miserably if your FASTA file isn’t formatted “correctly.” It turns out that the FASTA file I was working on (which was generated by a SOAP de novo assembly using iPlant) was technically formatted “correctly,” in that it had the necessary features to be called a FASTA file. It looked like this:

>sequence_ID_1
atcgatcgggatc
aatgacttcattg
gagaccgaga
>sequence_ID_2
gatccatggacgt
ttaacgcgatgac
atactaggatcag
at

For all intents and purposes, that is a proper FASTA file. However, that formatting is problematic when using a program like sed or awk because both of those programs examine files by line. So, when I would run my awk one-liner to filter out all sequences (in this specific case, contigs) greater than a specified length,

awk '!/^>/ { next } { getline seq } length(seq) >= 200 { print $0 "\n" seq }' FastaFileInput.fa > FastaFileOutput.fa

 


I wasn’t getting any results that had sequences longer than 200bp because of the line breaks within each individual sequence!

So, how did I solve this issue?

Of course, there are programs/scripts that are able to handle instances like this. In fact, iPlant even has a sequence length restriction App built-in, so I don’t even need to take my SOAP FASTA file out of iPlant at all. However, I am trying to learn sed and awk, so I turned this problem into a learning exercise. Additionally, there may be times when I’m working with FASTA files outside of iPlant and may need/want a tool to manipulate a FASTA file without having to rely on a special script, program or website.

Unfortunately, the solution was WAY beyond anything I could end up finding on my own. I did find a way to remove “newlines” (indicated as ‘n’ in sed and awk) using sed and/or translate (“tr” command in Terminal). Here’s the sed command:

$sed ':a;N;$!ba;s/\n//g'

 


I still don’t fully follow how this works, but I do know that this part

s/\n//g

means that when a newline is encountered,

\n

substitute

s/

it with nothing throughout the entire file

/g

The remainder of the command is complicated and involves labels and branching and other stuff I don’t fully follow yet.

Using the built-in translate command in Terminal is MUCH cleaner and MUCH easier to understand for newbs like me:

$tr -d '\n'

This means use translate

tr

 

to delete all newlines

-d '\n'

 

The above are great for general usage, but they don’t address the idea of how to ignore certain lines. I ended up turning to the amazing folks at StackOverflow for help and they came up with solutions for both sed and awk amazingly quickly; it was great!

Here’s a solution with sed:

$sed ':a;N;/^>/M!s/\n//;ta;P;D' InputFastaFile.fasta > OutputFastaFile.fasta

I can’t even come close to explaining this, but the key to this is the

/^>/

 

portion. That is where you can enter any regular expression (i.e. regex) that you want ignored. So, this sed one-liner will ignore any line that contains a “>” at the beginning of the line.

Here’s a solution with awk:

$awk '/^>/{print (NR==1)?$0:"n"$0;next}{printf "%s", $0}END{print ""}' InputFastaFile.fasta > OutputFastaFile.fasta

I haven’t fully explored this, but I’m fairly certain you can replace the

"^>"

 

with any other regex to skip lines containing that regex.

In any case, using either of those above solutions creates a new FASTA file that looks like this:

>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

You can see that the sequence lines are now all on a single line, which means that my original awk one-liner will now be able to pull out records of a minimum sequence length!

Use awk to filter FASTA file by minimum sequence length

Nice little one-liner to filter a FASTA file by sequence length:

$awk '!/^>/ { next } { getline seq } length(seq) >= 200 { print $0 "\n" seq }' FastaFileInput.fa > FastaFileOutput.fa

Simply change the number “200” to any number to set your desired minimum sequence length.

And, for those who don’t know what a FASTA file format is, it is a format to delineate biological sequence (DNA or protein sequences) data. Here’s a short example of what one looks like:

>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

Each individual sequence is preceded by a sequence identifier line. This identifier line is always indicated by a “>” at the beginning of this line.

Here’s a quick explanation of how it works, as I currently understand it:

!/^>/ {next}

 

– If a line (i.e. record) begins with a “>”, go to the next line (record).

{getline seq}

 

– “getline” reads the next record and assigns the entire record to a variable called “seq”

length(seq) >= 200

 

– If the length of the “seq” record is greater than, or equal to, 200 then…

{print $0 "n" seq}

 

– Print all records ($0) of the variable “seq” in the file that matched our conditions, each on a new line (“\n”)

 

Important note: this will only work on sequences that exist on a single line in the file. If the sequence wraps to multiple lines, the code above will not work. You can fix your FASTA files so that the sequences for each entry exist on single lines: http://itrylinux.com/use-sed-or-awk-to-remove-newline-breaks-from-fasta-file/

Secure Shell (SSH) SSHure iSSH SSHweet!

SSH allows you to connect to a remote computer and run task remotely. In my situation, this is great for remotely logging in to one of our lab computers that is designed for intensive computing tasks (24GB of RAM!).

Using SSH is also fairly straightforward. To get started logging in to a remote computer/server that you have access to, just type the following in Terminal (and substitute your own username and the address of your target computer):

$ssh username@remotecomputeraddress

Enter your password for the remote computer.

Alternatively, instead of dealing with passwords every time you log in to a remote computer, generate some SSH keys!  Not only can you eliminate the need to use a password and automatically log in when you type your ssh command, but by using keys you can virtually eliminate people being able to use a brute force password attack to break in to your computer/server!

First, generate your key set.  The following command will generate a private and a public key.  The public key can be placed on any server you want SSH key access to.  You can just send the public key to anyone who has the capabilities (both the know-how and authorization) to install it in the correct location on the computer/server you’d like to connect to.  The private key on your computer will then be able to match with your public key on any computer that the public key has been installed on!  No passwords needed for connection!

Generate the keys:

$ssh-keygen -t rsa


Feel free to use an empty password when you are prompted; just hit the “Enter” button and then confirm by hitting the “Enter” button again. This password is only used when physically using your computer to initiate a SSH session. For most people, having a password to initiate a SSH from their computer becomes more of a hassle than it’s worth. However, if you anticipate someone else using your computer, and you’d like to prevent them from easily using SSH to remotely login to servers that you’ve installed SSH keys on, then it would be advised to enable a password for your SSH sessions.

Looking in your

~/.ssh


folder reveals the following:

$ls ~/.ssh
id_rsa  id_rsa.pub  known_hosts


The “id_rsa.pub” file is your public key file. This is the file that can be transferred to other computers to enable password-free SSH capabilities on those computers.

Now that we have our keys, we need to transfer the public key to the server. Assuming you have administrative privileges for the server, there are two options for putting the public key on the server. If it’s the first key, we can use the following command:

$ssh-copy-id username@remotecomputeraddress


That will not only copy the public key from the computer to the server, but it will also create the proper directories if they don’t already exist on the server.

Otherwise, if you have the appropriate permissions, you can also use the following command to append your public key to an existing “authorized_keys” file on the server:

cat ~/.ssh/id_rsa.pub | ssh username@remotecomputeraddress 'cat >> .ssh/authorized_keys'


But, in order for the value of SSH keys to be fully realized, the destination computer/server should have password authentication disabled.  Doing so means that only computers with authorized SSH keys will be allowed access

I’m using SSH keys to lock down my home Synology server.  To do this, I SSH’d into the Synology as user “root”, since “root” is the only user authorized to make system changes.

$ssh root@SynologyIPaddress


By default, Synology only seems to have the text editing program “vi”. Let me tell you, it is NOT intuitive how to use it.  For example, to delete characters, you have to use the ‘x’ key!  Luckily the University of Washington has a nice tutorial on how to use “vi” for editing documents.

$vi /etc/ssh/sshd_config


Once you’re in the file, remove the “#” from in front of the two lines shown below AND change “yes” to “no” in the line “PasswordAuthentication”.  Then, be sure to save the file.
samb@Mephistopheles: ~_018

After quitting (don’t forget to save changes!), we need to restart the SSH service. I ended up doing this via the GUI since some of the common command line suggestions for restarting SSH didn’t work.

Now, when trying to SSH in, you’ll only be allowed in if you’re doing trying to do so from an authorized computer that has a public key installed on the server. On that note, it would be prudent to backup your private key so that if your computer dies, you’ll still be able to authenticate with the remote computer by installing your private key on a new client computer.

Changing the computer (host) name in Ubuntu

You’d think this would be easy.  Using the GUI, just go to “System Settings” > “Details”.  See the box with my computer’s name (Device name) in it?

Details_017

In theory, it seems like I should just be able to click on that and type a new name in.

Nope.

Instead, for some unknown reason, I have to open up Terminal and perform this from the command line.  And, it requires modification of two different files!  Why?  I guess one of the continuing quirks of using Ubuntu.

Anyway, here’re the files that need to be edited and here’re the commands to do so:

1.  Edit the /etc/hostname file

$gksudo gedit /etc/hostname

Remember to use gksudo to open gedit. I discovered this not too long ago. Once the file opens, just replace the existing Device name with the new one. In my file, there was no other text; only the current Device name.

2. Edit the /etc/hosts file

$gksudo gedit /etc/hosts

Once this file opens, find your current Device name and replace it with the new one that you entered in Step 1 above.

So, that’s it, I guess. It certainly isn’t the most intuitive way to accomplish this, but I guess it’s the only way to accomplish the task.

sudo echo solved!

In an earlier post when I was installing and configuring BLAST, I was unable to append a file from the command line using:

$sudo echo 'text I want appended to file' >> filename.sh

I would always get a permission denied error, despite being able to open (and edit) that same file by using the graphical text editor program, “gedit”.

Turns out, that the problem is related to how the “sudo” command is applied to the commands following it. The issue is that “sudo” only gets applied to “echo” and does not get applied to the “redirect” command (the ‘>>’). Thus, “echo” has sudo permissions, but the redirect does not, and cannot append the text to the file.

So, how does one get around this? Apparently, there are two relatively easy ways.

1. Use “sudo” to run a “subshell”. Example:

$sudo sh -c "echo 'text I want to append to a file' >> /path/to/file/filename.sh"

The “sudo” command gets applied to a a new subshell (that’s what the ‘sh’ means; a new shell). Thus, everything that runs in that subshell is governed by the “sudo” permissions.

2. Use “sudo” to execute “tee”. Example:

$echo 'text I want to append to a file' | sudo tee -a /path/to/file/filename.sh

The “sudo” command gets applied to the “tee” function. As such, anything that “tee” executes already has sudo permissions. In the above example, the “echo” command normally sends the text in the single quotes to the screen, but in this instance the text gets “piped” (the ‘|’ symbol) to the subsequent command. The subsequent command is running “tee” with sudo permissions and tells “tee” to append (the ‘-a’ argument) the piped text to our file. Additionally, “tee” will also print the text to the screen, as that’s it’s primary function.