Sie sind auf Seite 1von 54

KEEP

CAL M
a
nd
US E
PYTHON
f
or
GENOMI CS

C
ri
st
ia
nTa
cci
ol
i
More on Pandas 22
Contents The revenge . . . . . . . . . . . . . . . . . . 22

Matplotlib Module 23
About the Author 2 Let’s plot it! . . . . . . . . . . . . . . . . . . 23

Introduction 3 BioPython Module 24


Fun with Python . . . . . . . . . . . . . . . 4 With a Little Help from My Friends . . . . . 24
The basics . . . . . . . . . . . . . . . 4
Running an external program 25
Math with Python 5 How do I run an external program? . . . . 25
Thrash your old calculators . . . . . . . . . 5
Command line arguments 26
Variables in Python 6 Arguments... the easy way! . . . . . . . . . 26
A variable is a small box . . . . . . . . . . . 6
Files and Folders 27
How to list files and create folders . . . . . 27
The If statement 7
If this happens, then do that, else... . . . . . 7 Surfin’ the Web 28
The Selenium module . . . . . . . . . . . . 28
The For loop 8
Let’s iterate! . . . . . . . . . . . . . . . . . . 8 Classes and objects 29
The big containers . . . . . . . . . . . . . . . 29
The While loop 9
Let’s iterate... what? again? . . . . . . . . . 9 Inheritance 30
Inheritance is important . . . . . . . . . . . 30
Lists&Dictionaries 10
The big variables . . . . . . . . . . . . . . . 10 Conclusion 31
This is the End (our only friend) . . . . . . . 31
Working with Strings 11
EXERCISE: HOW TO COUNT BASES IN GE-
More on Strings 12 NOMIC SEQUENCES 32
Strings, Strings and Strings . . . . . . . . . 12 Count DNA made Easy . . . . . . . . . . . 32
Counting Bases using the Count Function . 33
Read and Write files 13 Counting Bases using Dictionaries . . . . . 33
Reading and writing is important! . . . . . 13 How to Create a Nice Format for your
Counter . . . . . . . . . . . . . . . . . . . . . 34
Functions 14
Don’t copy and paste your code, please! . . 14 EXERCISE: HOW TO CREATE RANDOM SE-
QUENCES 36
Regular Expressions 15 Random DNA or RNA sequences . . . . . . 36
The advanced way! . . . . . . . . . . . . . . 15
EXERCISE: HOW TO CREATE RANDOM SEN-
List Comprehensions & Lambda Functions 16 TENCES 37
Random Sentences . . . . . . . . . . . . . . 37
Set, zip, map & reduce 17
The four horsemen! . . . . . . . . . . . . . . 17 EXERCISE: HOW TO COUNT KMERS 39
K-mers counting . . . . . . . . . . . . . . . . 39
The Counter 18
EXERCISE: HOW TO WORK WITH MAT-
The Itertools Module 19 PLOTLIB AND PANDAS 41
Combinations and Permutations . . . . . . 19 Matplotlib and Pandas . . . . . . . . . . . . 41

The Numpy Module 20 EXERCISE: CALCULATING THE ENTROPY 44


Working with "a" Matrix . . . . . . . . . . . 20 Shannon Entropy . . . . . . . . . . . . . . . 44

The Pandas Module 21 EXERCISE: CHARGAFF’S SECOND RULE 46


Not an animals, nor a disease . . . . . . . . 21 Chargaff plots . . . . . . . . . . . . . . . . . 46

Contents
EXERCISE: BROWSING THE WEB 48
Neanderthal and Selenium . . . . . . . . . . 48
About the Author
Professor C. Taccioli is an Associate Professor at the
University of Padova and is in charge of the MAPS
Bioinformatics Laboratory at M.A.P.S Department. He
has a Master degree in "Molecular Biology", a PhD in
"Pharmacology and Molecular Oncology" and a BS.C in
"Biostatistics". He has previously worked at the Uni-
versity of Bologna, University of Ferrara, University of
Modena and Reggio Emilia, at The Ohio State University
(OSU) and at University College London (UCL). His
research activity focuses on Genomics (including Cancer
genomics).

®© C.T 2018
Introduction
The first edition of this book (2018) has been written for
students who want to learn Python in order to analyze
genomic data. This is a very short text, and it is for be-
ginners, such as students and/or normal people. If you
already know how to program in other languages such
as C++, this book is going to be “super easy” for you,
otherwise it will be just “easy”. Don’t worry about un-
derstanding this book, I’m going to teach you the basics
of Python together with many examples about genomics
and other fancy staff. Each chapter includes two or more
exercises and at the end of the book you’ll find many of
them. Have fun!
Fun with Python SHOULD I USE LINUX?
If you use Linux or Mac, you don’t need to install Ana-
The basics conda because Python is already installed and can be use
through the terminal. In Linux, the terminal is a black
screen that you find in "Development" and shows up if
you type together the buttons ctrl + alt + t on the key-
HISTORY OF PYTHON board. If you use a Mac, the terminal is located in Appli-
I believe that Python is the easiest programming lan- cation and then Utilities. On the terminal type python.
guage ever created. It is used by the most important That’s the Python interpreter. Using Python this way,
companies such as Amazon and Google. Unfortunately, means using the native Python without editors (Spyder
Python is not very fast but life is life, and it is not a is an editor). The good thing with the interpreter is that
matter of increasing its speed. Anyway, if you want you don’t have to select and run your code all the times,
to go fast, I’ll teach you some tricks. If you want you just need to press enter; the bad thing is that if you
to know more about Python history take a look here: write more lines of code, sometimes it gives you an error
https://en.wikipedia.org/wiki/History_of_Python and you don’t know why. So, I suggest to use Ana-
conda and the Spyder editor. Indeed, I suggest also to
use Linux. If you use Linux, you can write some code in
Python is an “interpreted language” whereas other a text file and call it, for example, "mycode.py" and then
languages such as C and C++ are “compiled language”. using the terminal, you can run it with the command
A compiled language is one where the program, once python mycode.py. This is a good way to run python
compiled, is expressed in the instructions of the target if your code is very long. Remember that in this book I
machine. Naturally, Python code is interpreted and for will capitalize the first letter of the word “Python”
this reason is much slower compared to C++, but it has an whereas you have to write it lowercase when you
advantage of being a high level programming language. use a terminal in Linux, Mac or Windows. Take a
look here: https://www.python.org/downloads/.

SO YOU WANT TO INSTALL PYTHON


There are many software out there to start using T HE P RINT FUNCTION
Python, but I suggest you to install Anaconda. It’a The “print” command is the first you will learn. Let’s
tool that works on Linux, Mac, and Windows. It has see an example:
a lot of libraries (things to do math, graphics, etc.)
already installed, editors to write your Python code, print("Hello, guys!")
and tutorials to learn how to write good code. So, go Hello guys!
to https://www.continuum.io/downloads, choose you
operative system (Linux, Mac or Windows), choose the
version you like (I suggest the version 3.6 or upper), Ok. That is the first code you have written in Python. I
choose the 32 or 64 bit installation (if you don’t know hope it’s not the last one :) Ah right... I forgot to tell you
what I’m talking about, get the “32” one) and install it. something. Run the command import this and some-
After you have installed it, run it (it takes a bit to load, thing like this will appear:
if you use linux run the command “anaconda-navigator”
without quotes) and a window will show-up (anaconda- Beautiful is better than ugly. Explicit is better than im-
navigator). Click on Spyder (it’s an editor), and write plicit. Simple is better than complex. Complex is better
this command inside the window to the left: than complicated. Flat is better than nested. Sparse
is better than dense. Readability counts. Special cases
print ("Ciao") aren’t special enough to break the rules. Although prac-
ticality beats purity. Errors should never pass silently.
Select it and click on “Run” and “Run cell and Unless explicitly silenced. In the face of ambiguity, refuse
advance” in the menu bar (on the top). Do you see the temptation to guess. There should be one– and
the word ’Ciao’ on the lower left window? Yes? Good. preferably only one –obvious way to do it. Although
You’re done. Python is installed!!! that way may not be obvious at first unless you’re Dutch.
Now is better than never. Although never is often better
than *right* now [...]

Introduction
4
Math with Python math and then use the math function. There are a
lot of modules (libraries) you can import to do a lot
of things. A module allows you to logically organize
Thrash your old calculators your Python code. Grouping related code into a module
makes the code easier to understand and use. A module
is a Python staff (it’s a file inside your computer) consist-
ing of Python code that can be used typing the command
WHO CARES ABOUT MATH? import and the name of a specific module. In the case
Python is an amazing math tool. It uses the mathe- below we want to import the math module. If you have
matical standard operations as taught at high school or an old version of Python (es. Python 2.7) you can import
secondary school. In other words, mathematical expres- the module division in order to always get floating num-
sions and parentheses are evaluated in the same order bers (from __from future__ import division).
we have learned them many years ago.

P YTHON AS A CALCULATOR P YTHON AS A POWERFUL CALCU -


Let’s do now some calculations:
LATOR
print(4+3) Let’s do now some powerful calculations importing the
12 math module:

print(4*88458457485) import math


353833829940
print(math.sqrt(16))
print(4**2) 4.0
16
print(math.log(100,10))
print(((4**2)+4)*99) 2.0
1980
print(math.pi*2)
print(4/3) 6.283185307179586
1
print(math.sin(math.radians(90)))
print(4/3.0) 1.0
1.3333333333333333
print(math.sin(math.degrees(90)))
As you see, Python can be used like a calculator -0.954091467473
using the symbol + to add, - to subtract, * to multiply
by, / to divide by, and ** to power a number. So, Don’t worry! You have imported the library math, so
Python is a powerful calculator, but be careful. If you if you want to use the math functions you have to write
use Python version 2.7 or lower and you want to force math dot and the name of a specific math operation. In
Python to give you a floating point number result (es. the first example, you want to get the "square root" of
4/3 = 1.3333333), you have to add a dot in at least 16; in the second you want the "log" of 100 using a base
one number inside a calculation. Es 4/3 will give you equal to 10; in the third you want to multiply the “greek
1, whereas 4/3.0 will give you 1.333333. If you use p” by 2; in the fourth you want to calculate the "sin"
Python version 3 or upper you don’t have to use this function of 90 in radians. Finally, in the last example
trick, everything will be just fine. you want to calculate the "sin" function of 90 in degrees.
For a complete list of math commands, take a look here:
https://docs.python.org/2/library/math.html

HOW TO IMPORT MODULES TO GET MORE MATH...


Sometimes you want to comment your code to remember
AND IN GENERAL HOW TO IMPORT MODULES
things and describe what you have done. If you want
In order to use logarithms, sin, cosin and all the other this, just add the symbol # before commenting.
mathematical functions we have to import a library
called “math”. That’s very easy, just write import

5
Variables in Python concatenate the variables, whereas if you are using
numbers, it adds them.

A variable is a small box


VARIABLES IN DETAIL
PUT SOME STUFF INTO A VARIABLE When you create a variable you reserve a specific amount
A variable is something that holds a value that may space in the memory (RAM) of your computer. The
change. In simplest terms, a variable is just a box that equal symbol “=” is used to store values into variables;
you can put stuff in. You can use variables to store the operand to the left of the “=” sign is the name
numbers, strings and other things. of the variable, whereas the operand to the right is the
value.

P YTHON AND ITS VARIABLES


Let’s see now some examples that show how to use C ONCATENATE VARIABLES USING
variables in Python: PRINT
a = 4 Now we know everything about variables, so let’s see
print(a) some other examples:
4
name = "Paul"
This code creates a variable called “a”, and assigns to it age = 18
the integer number 4. When we ask Python to tell us print(name + " is " + str(age))
what is stored in the variable “a”, it returns that number Paul is 18
again. We can also change what is inside a variable. For
example: So, why did we use str(age) instead of age to
concatenate variables? Because age is not a string, it’s
a = 10 a number. So, if we want to concatenate a string (ex.
print(a) "Paul") and a number (ex. 26), we have to transform
10 the number in a string using the command str().

a = "Ciao"
print(a) H OW TO KNOW IF A VARIABLE IS A
Ciao
STRING OR A NUMBER
As you see, we used double quotes to store the string So you want to know if a variable is a string or a number:
“Ciao” into the variable “a” but we didn’t use any
quotes for the numbers. That’s because Python wants name = "Paul"
you to use quotes or double quotes when you create age = 18
string variables, whereas numbers do not need them. weight = 75.66
Moreover, strings and/or numbers can be both added a = 3945034850298340598304598203
using the symbol “+”. Let’s see an example:
print(type(name))
a = "Hello " <type 'str'>
b = " World!"
print(a+b) print(type(age))
Hello World! <type 'int '>

a = 9 print(type(weight))
b = 6.9 So, the variable "name" is a string, "age" is an integer,
print(a+b) "weight" is a float, whereas "a" is a long number.
15.9 "Float" is a number with decimals, whereas a "long" is a
very long number. So, if you want to know what kind of
That’s pretty cool. If you are using strings, Python variable you have just created, use the command "type".

Variables in Python
6
The If statement language you are force to indent the statements within
the block so that it is easier for a human to see where
the block begins and ends without having to scan for the
braces. In Python, the interpreter can tell where the
If this happens, then do that, else... block begins and ends based on whitespace (usu-
ally four). Indentation should be used for the if state-
ment and the for loop (we’ll see it later). If you use
MAKE A DECISION... PLEASE four spaces and then three spaces this is an error. For
Decisions are when a program has more than one example:
choice of things to do. Think about food. If you cook
some "spaghetti", you eat them, otherwise you thrash
everything. What? This not a good example? Ok,
never mind, luckily, Python has a decision statement to
help us when our application needs to make such deci- D ON ’ T DO THAT
sions. It’s like to write in english: "If this, Then this, If you use four spaces the first time to indent things,
Else something else". then you have to use always four spaces. Python is very
picky. Let’s see the example we have seen earlier:
age = 20
S OME EXAMPLES WITH I F if (age < 20):
Let’s see some examples: print("I’m younger than 20")
print("I’m so young")
age = 20 elif (age == 20):
if (age < 20): print("I’m 20... I’m so cool")
print("I’m so young") else:
elif (age == 20): print("It’s older than 20")
print("I’m 20... I’m so cool") print("I’m gonna die!!!") # ERROR!!!
else: # You used three spaces here
print("I’m older than 20") # You should’ve use four spaces
print("I’m gonna die!!!") # Like in the lines above
I’m 20... I’m so cool

We have some new things to remember here. First, after


"if", "elif" ("elif" means else if) and "else" statement S OME EXAMPLES WITH I F
you have to add ":". Second, you have to indent after DNA = "ATATATATG"
the "if", "elif", or "else" statement in order to tell python if (len(DNA) > 2):
which conditions are part of the "if", "elif" or "else" print("DNA length is greater than 2")
block. You don’t have to use "elif" and "else" together if(len(DNA) == 9):
all the times... sometimes you use "else", sometimes print("DNA is > 2 and == 9")
you might want to use "elif". What’s the meaning of else:
indentation? See the next section. By the way, if you print("DNA length is less than 2")
change age from 20 to 5 the result will be "I’m so DNA length is greater than 2
young". If you change the age from 20 to 43, the result DNA is > 2 and == 9
will be "I’m older than 20" and "I’m gonna die". Cool,
isn’t it? The command len() is used to get the length of a
string. If you run the command print(len(DNA)) you’ll
see that the result is 9. Another important thing here,
is that we have nested an “if” statement inside another
INDENTATION
“if” statement (first 4 lines after the first "If"). In
Indentation are the spaces between the page border and English, we can translate it like: "If the length of the
the start of the text. All statements with the same DNA sequence is greater than 2, then print - DNA is
distance from the right belong to the same block of greater than 2 -, and if it’s equal to 9 print “DNA length
code. If a block is deeply nested, that means that is is greater than 2, and actually is 9”, else print “DNA
simply indented further to the right, so belongs to the length is less than 2 - ”. So the result is both “DNA
statement above. Python is unique among the other length is greater than 2”, and “DNA length is less than
programming languages by having this concept. In this 2”.

7
The For loop # Let’s convert uppercase DNA
# to lowercase DNA. All the other
# characters will be converted to "-"
DNA = "ATTACKA"
Let’s iterate! for i in DNA:
if(i == "A"):
print("a", end="")
elif(i == "T"):
WHAT IS A FOR LOOP STATEMENT ? print("t", end="")
A "For loop" in Python is a particular way to repeat a elif(i == "G"):
group of statements, a specified number of times. You print("g", end="")
can use any object (such as strings, numbers, and so on) elif(i == "C"):
but every statement that you want to iterate should be print("c", end="")
indented. Let’s see some example below. else:
print("-", end="")
attac-a

I told you that programming in Python is like to write in


S OME EXAMPLE WITH F OR English. So in the first example the first line for i in
for i in range(1,4): range(1,4): means: Let’s do something for each el-
print(i) ement “i” contained in the list (in this case the list has
print("END of LOOP") 10 objects). Instead of "i", you can write "c" or "count"
1 or "JoeSatriani", it does not really matters. Then, ev-
2 erything that is below and indented, will be processed
3 3 times. It can be everything. You can print 3 times
END of LOOP "Hello" or whatever. Range is a very useful command
but it’s weird. If you write range(1,3), it’s going to it-
DNA = "GATT" erate from 1 to 2. So if you want to count from 1 to
for i in DNA: 3, you have to write range(1,4). As you see, the “for
print(i) loop” statement (like the “If” statement) force you
G to use “:” and the indentation, but we’ll get use to
A it. If you don’t indent, you exit from the loop. See the
T first example... the last print ("END of LOOP") is not
T indented; that means that it’s out of the iteration and
it’ll be printed just once. The third example is cool. It
# If you’re using Python 2.7 or lower shows you that you can iterate also with strings. In this
# instert the line below: case, you don’t need range, that’s for numbers. The
# from __future__ import print_function fourth example looks difficult but it’s not. Since, we
DNA = "GATTACA" want to print staff horizontally we have to import a new
for i in DNA: kind of print command. We have to do it importing the
print(i, end="") "new print function" taken from the library "__future__".
GATTACA Then we use the command print("something",end="")
# from __future__ import print_function that means that we want to print characters separated
DNA = "GATTACA" by a null string. If you want to separate your charac-
for i in DNA: ter using a comma do this: print("something",end=",").
print(i, end=".") The last examples show how to use the "If" statement
G.A.T.T.A.C.A. within a "For loop". In the first case each "U" is con-
verted to "T", whereas in the second one, upper case
# Let’s Convert DNA to RNA ("T" -> "U") letters are converted to lowercase characters; in all the
for i in DNA: other cases a letter is converted to "-". The use of
if(i == "T"): "For" and "If" in the same code is very powerful and
print("U",end="") it’s the most important thing to learn. Understanding
else: these two statement is like to understand 50% of Python
print(i,end="") language.
GAUUACA

The For loop


8
The While loop W HEN YOU SHOULD USE ’ WHILE ’
# For example we want
# the user enter the word
Let’s iterate... what? again? # ’hello’ stopping the script,
# if not the script continues
# P.S if you get an error try to
ITERATIONS USING ’ WHILE’ # use raw_input instead of input...
There is another way to iterate things. This method is # it depends on you Python version
called while. So... when I should use the while loop and
when the for loop? Why people look like to prefer the for while True:
loop... less code to write? Is there any specific situation n=input("Please enter the word Python: ")
which I should use one or the other? Is it a matter of if n.strip() == "Python":
personal preference? break

As we said before, while and for loops are both used


to repeat code but while loops do not run for a specifi
ANSWER
number of times, but run until a defined condition is or
While loops, like for loops, are both used to repeat is not longer met. If the condition is initially false, the
code but unlike for loops, “while loops” don’t run n loop body will not be executed at all. In other words, we
times, but until a defined condition is met or not. More- can summarize the difference between for and while like
over, “for statements” in Python are able to iterate that:
through a string or a list and so on and need less lines
of code. You won’t be able to do it with a “while loop”. • for loop: should be used to loop through a list, a
On the other hand, “while statements” loop until a con- string, etc;
dition is False and are able to iterate even if we don’t
know the number of iterations we need. In all the • while loop: should be used to loop an indeterminate
other cases we can use “while or for loop indifferently”. number of times until some condition is met or not.
Let’s see some examples!
U SING WHILE TO ASK A PASSWORD
# Let’s see how to ask the users
# to type a password for your script
D IFF . BETWEEN FOR AND WHILE
# For example let’s create password=""
# two loops... one with "for" while password is not "GATTACA":
# and the other one with "while" password = input("Please enter \
your password: ")
# for loop... from 1 to 4 if password == "GATTACA":
for i in range(1,5): print("Correct password")
print(i) break
1 else:
2 print("Incorrect password -\
3 try again")
4
Please enter your password: adsf
Incorrect password - try again
# while loop... from 1 to 4
i = 1 Please enter your password: GATTACA
while(i<5): Correct password
print(i)
i = i + 1 The backslash "\" character in the code above was used
1 to to break a long line to multiple lines. You’re able to use
2 “for” and “while”. These are the most important things
3 in Python. Congrats!
4

9
Lists&Dictionaries # Iterate dna, make uppercase and put in DNA
DNA = []
dna = ["tata","auu","aug"]
for i in dna:
The big variables up = i.upper() # make "i" uppercase
DNA.append(up)
print(DNA)
[0 TATA0 ,0 AUU0 ,0 AUG0 ]
DO I FEEL LIKE A LIST OR A DICTIONARY ? So the first two examples are easy. We created a list
Your brain is still burning from the last lesson, I know. called "list1". This list contains the names of the mem-
Don’t worry. Lists and dictionaries are easy. They are bers of the band "Metallica". So, in the first case we
just like variables, but can contain more staff inside added a the end of the list the name "James". Then, we
them. removed him. Then, we inserted the string "Roberto" in
second position (the index is not 2 but 1, because the
index starts from 0, at the beginning of a list). The third
example is pretty cool. First, we created an empty list
LISTS called "DNA". Then, we iterated each element of the
list "dna", and use the command uppercase() to make
Lists are lists of values (strings, numbers, and so on). them uppercase. Then, we put each element with the
Each element in a list is indexed starting from zero; the command append() in the new list "DNA". Python is
first one is numbered zero, the second 1, the third 2, etc. "case sensitive"; so uppercase and lowercase words have
You can remove and add values. They are defined a different meaning in Python. In this case, for example,
within brackets "[]", and each element is separated "DNA" is not the same of "dna".
with a comma ",". There are also variables called Tuples
that are unchangable and are defined by parentheses.

E XAMPLE WITH D ICTIONARIES


# Let’s create & iterate a dictionary
DICTIONARIES dict1 = {"Seq3": "AAA", "Seq4":"CGC"}
In a dictionary, you have an "index" that is a word, and for k, v in dict1.items():
for each of them a definition that is called "value". The print(k + " is: " + v)
values in a dictionary aren’t numbered. You can add, Seq3 is: AAA
remove, and modify the values in dictionaries. Seq4 is: CGC
# Add and remove an element from a dictionary
dict_RNA = {"Seq7": "AUA", "Seq8":"AUU"}
dict_RNA["Seq9"] = "UUU" # add element
print(dict_RNA)
E XAMPLE WITH LISTS del dict_RNA["Seq9"] # remove element
# Insert "Jason" at the end of list1 print(dict_RNA)
list1 = ["Kirk","James","Lars"] {’Seq8’: ’AUU’, ’Seq9’: ’UUU’, ’Seq7’: ’AUA’}
list1.append("Jason") {’Seq8’: ’AUU’, ’Seq7’: ’AUA’}
print(list1)
[0 Kirk0 ,0 James0 ,0 Lars0 ,0 Jason0 ] As you see, dictionaries are created using curly brackets,
whereas keys and values are separated by columns ":". In
# Let’s remove "Jason" case values are strings you must use quotes o double
del list1[3] # remove the 4th element quotes, if they are numbers don’t use anything. Each
print(list1) pair of keys and values are then separated by commas
[0 Kirk0 ,0 James0 ,0 Lars0 ] ",". In this example we have also iterated each key and
value using the counter "k" and "v" (you can call them
# Insert "Roberto" in the second position as you like), whereas in the third example we added and
list1.insert(1,"Roberto") # insert in 2nd pos. removed a key ("Seq9") and its value ("UUU").
print(list1)
[0 Kirk0 ,0 Roberto0 ,0 James0 ,0 Lars0 ]

Lists&Dictionaries
10
Working with Strings S LICING S TRINGS
word = "Python"
Strings are important!!! # chars from pos. 0 (included) to 2 (excluded)
print(word[0:2])
Py
STRING INDEXES
Now that you know most of Python programming lan- # chars from pos. 2 (included) to 5 (excluded)
guage, you need to rest a little bit. For this reason, I’ll print(word[2:5])
show you how to play with strings, using a lot of exam- tho
ples and a few theory. The only thing to remember is
that every "word" or "string" is associated to its index # chars from 2 (included) to -1 (excluded)
that starts from the first characters with the number 0. print(word[2:-1])
There is also another system of indexing that starts from tho
-1 which is the last letter of a string.
# chars from -5 (included) to -1 (excluded)
print(word[-5:-1])
ytho
INDEX EXAMPLE
These are the indexes for the word "Python": # chars from -6 (included) to -1 (excluded)
+---+---+---+---+---+---+ print(word[-6:-1])
| P | y | t | h | o | n | Pytho
+---+---+---+---+---+---+
0 1 2 3 4 5
-6 -5 -4 -3 -2 -1
S LICING TO THE END OF A S TRING
You can use the positive indexes or the negative ones, word = "Python"
depends on the problem you want to solve. Now, let’s
see some real staff. # chars from position 0 (included) to the end
print(word[0:])
Python

S ELECT CHARACTERS # chars from position 4 (included) to the end


word = "Python" print(word[4:])
on
# character in position 0
print(word[0]) # from the second-last (included) to the end
P print(word[-2:])
on
# character in position 5
print(word[5]) # from the sixth-last (included) to the end
n print(word[-6:])
’Python’
# last character
print(word[-1]) # from the beginning to position 2 (excluded)
n print(word[:2])
Py
# second-last character
print(word[-2]) # from the beginning to position 6 (excluded)
o print(word[:6])
Python
# sixth-last character
print(word[-6]) # concatenate two slices ’Py’+’thon’
P print(word[:2] + word[2:])
Python

11
More on Strings R EVERSE S KIPPING S LICING
word = "Python"

Strings, Strings and Strings # Reverse string


print(word[::-1])
nohtyP
WHY STRINGS AGAIN?
Strings are very important in genomics because the pri- # From the end, skipping 2 places each time
mary structure of DNA or RNA is a string. Previously print(word[::-2])
we’ve seen that we can make uppercase or lowercase nhy
a string using the commands uppercase() and lower-
case() respectively. However there’s more about strings. # From the end, skipping 3 places each time
Let’s see some examples. print(word[::-3])
nt

# From the end, skipping 4 places each time


print(word[::-4])
INDEX EXAMPLE ny
These are the indexes of the string Python, in case you
don’t remember them. # From the end, skipping 5 places each time
print(word[::-5])
+---+---+---+---+---+---+ nP
| P | y | t | h | o | n |
+---+---+---+---+---+---+ # From the end, skipping 6 places each time
0 1 2 3 4 5 print(word[::-6])
-6 -5 -4 -3 -2 -1 n

S KIPPING S LICING F IND - LEN - SPLIT - JOIN


word = "Python" word = "Python"

# Through end, skipping 1 places each time # Find "h" within the string "Python"
print(word[::1]) print(word.find("h"))
Python 3 # "h" is in the 3rd index position

# Through end, skipping 2 places each time # Count "o" occurences in "Python" string
print(word[::2]) print(word.count("o")) # "o" found once
Pto 1

# Through end, skipping 3 places each time # Calculate "Python" length


print(word[::3]) print(len(word))
Ph 6

# Through end, skipping 3 places each time # Split "Python" where "o" is found
print(word[::4]) print(word.split("o"))
Po [’Pyth’, ’n’]

# Through end, skipping 3 places each time # Join together a list of strings
print(word[::5]) word2 = ["P","y","t","h","o","n"]
Pn print("".join(word2))
Python
# Through end, skipping 3 places each time
print(word[::6])
P

More on Strings
12
Read and Write files # Read a fasta file and put the sequence
# in a variable without the fasta header
fasta = "".join(open("myfile.fasta").\
readlines()[1:]).replace("\n",’’)
Reading and writing is important!

To write or read a file in Python, you just need to


follow three steps: "open", "read/write", and "close". D OWNLOAD AND READ A GENOME
import wget
import gzip
C REATE A FILE AND WRITE IT #write the command below in a single line
# Let’s open the file text.txt file = wget.download("http://hgdownload.soe.
file=open("/home/chris/Desktop/text.txt","w") ucsc.edu/goldenPath/eboVir3/bigZips/KM034562v1
file.write("Hello!") .fa.gz")
file.close()
with gzip.open(file,"rb") as f:
The first command is "open". It creates a file called file_content = str(f.read())
"text.txt" in a directory called "/home/chris/text.txt".
Of couse, you will have a different directory because splitted_list = file_content.split("\n")
your name is not "chris" (isn’t?). Moreover, If you # if the line above doesn’t work try this:
use Windows, the path to your file should be written # splitted_list = file_content.split("\n")
this way: "C:\Users\All Users\desktop\text.txt".
Linux and Mac operative systems use the slash sym- ebola_list = splitted_list[1:]
bol to represent the paths, whereas Windows uses ebola_genome = "".join(ebola_list)
the backslash. For example in Windows you write: print(ebola_genome)
"C:\Users\All Users\desktop\gene.txt", whereas in
Linux or Mac you should write /home/jimi/gene.txt. First, we have imported the library "wget". This library
Did you notice the "w" letter within the open() com- gives you the possibility to download whatever you want
mand? That means that we have created the file "text.txt" from Internet. If you get an error importing this library,
because we want to write it ("w" means write). The file it means it’s not installed. So, if you use Anaconda, click
created is now stored in the variable "file". The second on "Environments", then click on the window "Installed"
command is "file.write()". It writes the sentence "Hello" and select the option "Not Installed". Then, check the
inside the file text.txt stored in "file". The last command is library "wget". Now, click on "Apply". Great. The li-
"file.close()". It closes the file "text.txt". If you don’t close brary "wget" is installed, and you can import it when
the file, all the commands written above won’t work. running your code. Then, import the library gzip that
provides a fast way to decompress files. If you don’t
find gzip in Anaconda, install it running this command
R EAD A FILE AS A VARIABLE in your terminal: "pip install gzip". That’s the common
# Let’s read a file and include it way you install modules in Python. If you don’t have pip
# in a variable installed in your computer, find it on Google and install
file = open("/home/chris/text.txt","r") it. Now, let’s download the Ebola genome from UCSC
text = file.read() web site through "wget" and decompress it via "gzip". In
file.close() this case we use "with" to open a file. If you use "with"
print(text) you don’t need to close the file, so it’s faster. The option
"rb" means "read a binary file", and our Ebola genome file
You just opened a file called "text.txt" and you stored it is binary file (.gz). Now, we have our genome sequence
in a variable called "file". Then, you have read "file", and inside a variable called "file_content". Unfortunately, this
you put its content in another variable called "text". Then, variable is full of "end-line" symbols ("\n"). Since we
you close it. That’s it. don’t want them, we split and joined all the sequences
included in our list in a single sequence that represents
the genome.
R EAD A FILE AS A LIST
In order to put the rows of a text file in a list you should
use the command "open", "readlines" and "replace".

13
Functions def cube(a):
print("the cube is " + str(a**3))
cube(3)
Don’t copy and paste your code, the cube is 27

please! def cube(a):


print("the cube is ")
return a**3
FUNCTIONS ARE BLOCKS cube(3)
the cube is
Functions are a way to organize parts of your code into
27
blocks, allowing you to use them many times without
to copy and paste your scripts all the times. Moreover,
What’s the difference between these two functions?
functions make the code more readable, allowing the
Read more...
programmers to create good software and share it with
friends and colleagues.

THE RETURN STATEMENT AND THE ALIENS


COMMANDS OR FUNCTIONS?
The previous functions give the same results (if we ex-
Functions are blocks of code that can be reused. Many clude the fact that second one print the result in two
times, you have read in this book the word "command" lines... but we don’t care about it). In the first case
such as print(), len(), join(), etc. Indeed, what we’ve we use the print() function only, whereas in the second
called "commands" are "functions". If you want to case, we use the statement return. Well... I tell you
create your own function you need just to write def the truth... the second case is better than the first.
and the function’s name followed by (), placing any The return statement gives a result back to the function
variables (also called arguments) within the brackets and while print produces only text. If you use print and not
then type the column ":". Also with functions, you need return, it’s like if you say: "Hi, John" and he says "Hi",
to indent your code. Are you happy? Yeah, I think so. but his eyes are empty like he has been kidnapped by an
alien. It happened to me once, so I use return ;(.

S OME EASY EXAMPLES T HE T WIN PARADOX


#Function that add a number (a) to another (b)
Let’s suppose you have a twin. If you build a space-
def sum_numbers(a, b):
craft and you drive it at the speed of light, when
return(a + b)
you come back to the earth your twin will be older
sum_numbers(10,5)
than you. That’s what Einstein said. Now, let’s
15
create a function that calculates the reduced time
you’ve got while traveling with your super spacecraft:
# from __future__ import print_function from math import sqrt
# uncomment the line above if you
# are using Python 2.5 or 2.7 def einstein(speed, timearth):
def convertoRNA(DNA): fraction = float(speed)/100;
for i in DNA: # einstein-lorentz formula for time travel
if(i=="T"): factor = sqrt(1-fraction**2)
print("U",end="") timespace = timearth * factor
else: return timespace
print(i,end="") einstein(80,10)
print() 5.999999999999998 So...
convertoRNA("GATTACA") if you travel for 10 years (earth time) at 80% of the
GAUUACA speed light, when you come back to this planet, your
twin is 10 years older, whereas you are only 6 years older.
Be careful! If you write the DNA sequence, with- Pretty cool!
out quotes ("") Python thinks that DNA is a variable
not a string, so use something like "GATTACA" not
GATTACA. Let’s see now two examples with numbers:

Functions
14
Regular Expressions S EARCH IN DETAIL
from re import *
dna = "ATGACGTACGTACGACTG"
The advanced way! # store the match object in the variable m
m=search(r"GA([ATGC]{3})AC([ATGC]{2})AC",dna)
print("entire match: " + m.group())
When writing programs for genomics we often search for
print("first match: " + m.group(1))
patterns (substrings) in strings. For example, if we want
print("second match: " + m.group(2))
to search for a "TATA box" in a promoter or the start
entire match: GACGTACGTAC
codon "AUG" in a mRNA sequence we can use regular
first match: CGT
expressions. Regular expression are part of formal lan-
second match: GT
guage that use symbols to define a search pattern. This
language has been created in the 1950s by the matemath-
ician Stephen Cole Kleene. . As you can see the function "match" searches for
only substrings at the beginning of a string, whereas
the function "search" searches for the first match in the
S EARCH VS M ATCH entire string. If we want the other matches, we have
to write the regular expression within parenthesis and
# Let’s import all the functions (*) print them with the function "group()". The regular ex-
# from the regular expression module "re" pression GA([ATGC]{3})AC([ATGC]{2}) means that what
from re import * we want something that starts with GA, then there are
dna = "AAACCCCCCCCCATG" three letters among the group A or T or G or C (that is
# if ATG is found at the beginning of [ATGC]{3}), then there is AC and other two letters among
# the dna string, "Good!" is printed the group A or T or G or C (that is [ATGC]{2}). Fi-
# out otherwise you’ll see the word "Bad!" nally we want to match also AC. Since we have included
if match(r"ATG", dna): [ATGC]{3} and [ATGC]{2} within the parenthesis, we can
print("Good!") select them separately using the functions "group(1)"
else: and "group(2)". If we use the function "group()" we
print("Bad!") will select everything. Remember, you have to write
Bad! m.group() or m.group(2) or m.group(3) because we have
created the variable "m" that stores our regular expres-
dna = "ATCGCGAATTCAC" sion GA([ATGC]{3})AC([ATGC]{2}). What if we want to
# if GAATTC is found, "Restriction have all the substrings described by our Regular Expres-
# site found" string is printed out sion in just one shot? And what if we want to know also
if search(r"GAATTC", dna): when a match starts and ends? In the first case, we want
print("Restriction site found!") to use the "findall" function, in the second, we we’ll use
Restriction site found! the function "finditer". Let’s see some examples:
dna = "ATCGGGTCCTTCAC"
# if GG followed by A or T followed by CC F INDALL VS F INDITER
# are found, "Restriction site found" from re import *
# string is printed out. The symbol "|"
# means OR whereas the symbol "&" means dna = "AAAATATAAACCCTATATT"
# AND. runs = findall(r"TATA[AT]+", dna)
if search(r"GG(A|T)CC", dna): # The symbol "+" means "one or more"
print("Restriction site found!") # The symbol "*" means "zero or more"
Restriction site found! print(runs)
[’TATAAA’, ’TATATT’]
As you see, each of these examples are explained using
dna = ’AAAATATAAACCCTATATT’
the comments delimited by the symbol "#". Now we will
for i in finditer(r"TATA[AT]",dna):
introduce the concept of the function group(); group(1)
print(i.group() + " " + str(i.start())\
will return the match of the string described by the sec-
+ "," + str(i.end()))
tion of the pattern in the first set of parentheses, instead
TATAA 4,9
the function group(2) will return the match described by
TATAT 13,18
the second, etc. Let’s see some other examples:

15
List Comprehensions LAMBDA FUNCTIONS
Python supports a clear syntax that lets Python pro-
& Lambda Functions grammers define functions in just one line. This way of
writing functions have been borrowed from a program-
ming language called Lisp. Lambda functions can be
Python list comprehensions enable for manipulation of used anytime we need a function to be declared.
lists whereas with Lambda functions we can pass func-
tions to other functions to do stuff. They are powerful
because with a line of code you can do many things. L AMBDA F UNCTIONS E XAMPLES
Let’s see some examples:
LIST COMPREHENSIONS
# Calculate the cube of a number
List comprehensions provide a nice way to create new cube = lambda x: x**3
lists. They consist of square brackets containing an ex- print(cube(2))
pression followed by one or more for loops. The ex- 8
pressions can be anything, meaning you can put calcu-
lations with numbers or search patterns for looking for # Calculate the cube of a list
substrings and the result will be a new list. list1 = [1,2,3]
print(list(map(lambda x: x**3,list1)))
[1,8,27]

A PROMISE IS A PROMISE # Filter for even numbers


Using List Comprehensions we will be able to run faster list1 = [2,17,3,9,4]
programs in Python. I promised you at the beginning print(list(filter(lambda x: x%2==0,list1)))
of this book. List comprehensions are usually a little [2, 4]
bit faster than the equivalent for loop. The reason is,
most likely, because list comprehensions do not have to # Find the maximum
append in a list on every iteration. from functools import reduce
list1 = [1,2,44,9]
mx = lambda a,b: a if (a > b) else b
print(reduce(mx,list1))
44
L IST C OMPREHENSIONS
# Let’s see some example. The first one As you see in the first example we created a function
# calculates the square root of the called "cube" using a lambda function that calculates the
# numbers contained in a list: cube for each number we use for that function. In the sec-
ond example we use the lambda function cube for all the
from math import * elements of a list using another function called "map". To
list1 = [1,2,3,4,5] print the result of a "map" function, we have to use a list.
# This is the List Comprehension Using "map" with lambda functions is the same of using
[sqrt(i) for i in list1] a "list comprehensions". Same thing for the third exam-
[1.0, 1.4142135, 1.7320508, 2.0, 2.23606] ple where we filtered for even numbers using the "%"
module operation (x%2==0 means that is no rest for the
# Using List Comprehensions, we can division, so it is an even number ). In the fourth example,
# write the functions in just one line instead, we used the function "reduce" that applies the
lambda function for the first two numbers, then for the
from math import * second and third, then for the third and the fourth, etc. A
list1 = [1,2,3] the end we have the maximum number in the list as the
for i in list1: # You can avoid this result. At the end of the day, using lambda functions you
print(sqrt(i)) # using List Comprehension will be able to create new functions in just one line. For
1.0 example, try to create a lambda function for the "Twin
1.4142135 Paradox" we have seen in the chapter about "Functions".
1.7320508 Good Luck!

List Comprehensions & Lambda Functions


16
Set, zip, map & reduce U SING MAP AND REDUCE
# map is applies a function to
# all items in a list
The four horsemen! import math
items = [1, 2, 3, 4, 5]
What about if I ask you to remove duplicates from a list list(map(math.log, items))
or intersect two groups or apply a function to another [0.0, 0.69, 1.09, 1.38, 1.60]
list? I know, you can use a for loop or a staff like that but
there are some other easy ways to do it. These a ways are # map can be used also with lambda functions
the functions: set, zip, map and reduce. items = [1, 2, 3, 4, 5]
list(map((lambda x: x **2), items))
[1, 4, 9, 16, 25]
U SING SET AND ZIP
# same as before:
# set removes duplicates
items = [1, 2, 3, 4, 5]
a = [1,2,3,2,4]
def square(x): return x ** 2
print(set(a))
list(map(square,items))
{1,2,3,4}
[1, 4, 9, 16, 25]
# set can be used for intersection
# filter creates a list of elements
a = [1,2,3,4,5]
# for which a function returns true,
b = [2,3,4]
# here it’s used to get numbers < 0
print(set(a) & set(b))
numb = list(range(-3, 5))
{2,3,4}
minor = list(filter(lambda x: x < 0, numb))
print(minor)
# set can be used for union
[-3, -2, -1]
a = [1,2,3,4,5]
b = [2,3,4]
# The function reduce(func, seq)
print(set(a) | set(b))
# continually applies the function
{1,2,3,4,5}
# func() to the sequence seq.
# in python2 you don’t need to import
# set can be used for difference
# functools... just use reduce
a = [1,2,3,4,5]
import functools
b = [2,3,4,9,10]
functools.reduce(lambda x,y: x+y,[7,11,42,13])
print(set(a) - set(b))
73
print(set(b) - set(a))
{1,5}
{9,10}

# zip pairs two lists WHEN TO USE THEM!


# and in this example is used to get Python is an advanced programming language that
# the max between two numbers means that many functions are already "built-in" inside
a = [1,2,3,4,5] the language. So if you need to remove duplicates, find
b = [2,3,4,9,10] element shared by groups, apply functions to lists and
c = list(zip(a,b)) other things like that, you don’t have to write the code
for i in c: by yourself, just use the functions that I showed you in
print(max(i)) this page. In this way the code will be much shorter and
2 clear. Imagine to use the set function in order to remove
3 the duplicates from a list. Using "set" instead you don’t
4 need to slow down your application and make the code
9 really "Perly". Sorry Larry!
10

17
The Counter E ASY EXAMPLES
# Most common words in a file
Counter is a container imported from the module ’collec- import re
tions’. It allows to count letters or words without using a from collections import Counter
for loop. It’s faster and easier, so I suggest to use it.
with open(’/Users/me/Share/myfile.txt’) as f:
file = f.read()
THE COUNTER! words = re.findall(r’[\b[a-zA-Z]+\b’, file)
What about if we want to count the frequency of words wordcounts = Counter(words)
in a text or the distribution of letters in a sentence? And ignore=[’The’, ’the’,’And’,’and’, ’of’, ’to’]
what about if we want to identify the most common for word in list(wordcounts):
letters in a text? Don’t worry, with Counter, you can do if word in ignore:
it with few lines of code. del wordcounts[word]
print(wordcounts.most_common(3))
[(’that’,12577),(’in’,12331), (’shall’,9760)]
C OUNT ON ME !
I know... you didn’t understand absolutely nothing about
# Let’s count the letters in a string
the code above. The first examples were pretty easy and
from collections import Counter
self explanatory but this one... Well, first of all, we have
imported the module "re" that is the package that pro-
print(Counter(’GATTACA’))
grammers use when dealing with regular expressions
{’A’: 3, ’T’: 2, ’G’: 1, ’C’: 1}
and strings (see the paragraph on regular expressions)
and then we have imported Counter from the collections
# A better way to visualize it
module. Then we opened a text file (it can be a book or
# "c" in this case is a special dictionary
whatever contains letters) and we put the words in a list
# and "i" represents its keys
called "words". How do we do that? with the regular ex-
c = Counter(’GATTACA’)
pression "\b[a-zA-Z]+\b". What does it mean? It means
for i in ’ATGC’:
that we want to select all the words. In fact, \b it’s the
print(i, "=", c[i])
start and the end of a word. "a-zA-Z" means that a word
A = 3
can be upper or lower case and the symbol "+" means
T= 2
that a word can be of whatever length. This staff is really
G = 1
cool guys. Try with Shakespeare or Dante and see the
C = 1
most common words they used in their poems. Ah right,
I forgot to tell you. The list "ignore" contains the word
# Let’s count the letters in a word
that we want to exclude from the counting. We removed
c = collections.Counter(’pyth’)
these words from our list using the function "del".
print(c)
Counter({’t’: 1, ’h’: 1, ’p’: 1, ’y’: 1,})
WHAT IS COUNTER?
# Most common letters in a text Counter is a dictionary subclass for counting elements. It
c = Counter() # create c as Counter is unordered but this collection of elements are stored as
f = open(’/home/chris/Desktop/rings.txt’, ’r’) dictionary keys and their counts are stored as dictionary
file = f.read() values. In this way, it’s easy to calculate frequencies with
for i in file: Counter instead of using the "while" or "for loops".
c.update(i.rstrip().lower())
print (’Most common letters:’)
# Let’s use a built-in function called Counters was primarily designed to work with posi-
# most_common tive integers to represent counts that is the frequency
for letter, count in c.most_common(3): of word of letter in a text for example. However, neg-
print(letter,"=", count) ative numbers can be used. See the documentation:
Most common: https://docs.python.org/3/library/collections.html
e: 235331
i: 201032
a: 199554

The Counter
18
The Itertools Module C OMBINATIONS AND P ERMUTA -
TIONS
Combinations and Permutations # Let’s see some examples now
The itertools module contains a lot of functions. We’ll
see just teh combinatoric generators in order to calculate # Combinations without repetition
permutations and combinations with or without repeti- # (order doesn’t matter)
tions. Later we will define k as the number of ways of import itertools
picking elements and n from the number of all the ele-
ments. So, k is called class and n is called number of print(list(itertools.combinations("ATGC", 2)))
elements. [(’A’, ’T’),(’A’, ’G’),(’A’, ’C’),(’T’, ’G’),
(’T’, ’C’), (’G’, ’C’)]
COMBINATIONS
# Permutations without repetition
Combinations are a group of specific elements, ordered # (order matters)
without thinking if the order of these elements is impor- print(list(itertools.permutations \
tant. For example, the combination of 3 elements of ("ATGC", 2)))
class 2 is 3. Why? Because the Combinations without [(’A’, ’T’),(’A’, ’G’),(’A’, ’C’),(’T’, ’A’),
repetition of n=3 and k=2 (for ex. a, b and c) is equal (’T’, ’G’),(’T’, ’C’),(’G’, ’A’),(’G’, ’T’),
to: {a,b} {a,c} {b,c} (’G’, ’C’),(’C’, ’A’),(’C’, ’T’),(’C’, ’G’)]

# Combinations with repetition


COMBINATIONS WITH REPETITION # (order doesn’t matter)
Combinations with repetition are a group of specific from itertools import \
elements, ordered without thinking if the order of these combinations_with_replacement
elements is important. Elements in this case can be re-
peated. For example, The combination of 3 elements of print(list(combinations_with_replacement \
class 2 is 6. Why? Because the Combinations with- ("ATGC", 2))
out repetition of n=3 and k=2 (for ex. a, b and c) is [(’A’, ’A’),(’A’, ’T’),(’A’, ’G’),(’A’,’C’),
equal to: {a,a} {a,b} {a,c} {b,b} {b,c} {c,c} (’T’, ’T’),(’T’, ’G’),(’T’, ’C’), (’G’, ’G’),
(’G’, ’C’),(’C’, ’C’)]

# Permutations with repetition


PERMUTATIONS # (order matters)
A Permutation is an ordered Combination. In other from itertools import product
words, order is important. Combinations without rep-
etition of n=3 and k=2 (for ex. a, b and c) is equal to: print(list(product("ATGC", repeat=2)))
{a,b} {a,c} {b,a} {b,c} {c,a} {c,b} [(’A’, ’A’),(’A’, ’T’),(’A’, ’G’),(’A’,’C’),
(’T’, ’A’),(’T’, ’T’),(’T’, ’G’),(’T’, ’C’),
(’G’, ’A’),(’G’, ’T’),(’G’, ’G’),(’G’, ’C’),
PERMUTATIONS WITH REPETITION (’C’, ’A’),(’C’, ’T’),(’C’, ’G’),(’C’, ’C’)]
A Permutation with repetition is an ordered Combi-
# Let’s find all the anagrams for the
nation. In other words, order is important. The ele-
# word "yep"
ments can be repeated. Combinations with repeti-
yep = list(itertools.permutations \
tion of n=3 and k=2 (for ex. a, b and c) is equal to:
("yep", 3))
{a,a} {a,b} {a,c} {b,a} {b,b} {b,c} {c,a} {c,b} {c,c}
for i in yep:
"".join(i)

ANAGRAMS ’yep’
Anagrams are permutations without repetitions where ’ype’
the number of elements are equal to the class (n = k). ’eyp’
In the last example we will find all the anagrams for the ’epy’
word "yep". ’pye’
’pey’

19
The Numpy Module O PERATION WITH MATRICES
# Multiply the matrix c by 2
d = c*2
Working with "a" Matrix print(d)
[[ 2. 2. 2.]
"You take the blue pill, the story ends. You wake up in your [ 2. 2. 2.]
bed and believe whatever you want to believe. You take the red [ 2. 2. 2.]
pill, you stay in Wonderland, and I show you how deep the [ 2. 2. 2.]]
rabbit hole goes." [Morpheus to Neo, Matrix (1999)]
# Adds matrix c to d
Sorry guys, this is not about the movie “Matrix", e = c+d
it’s about working with matrices of numbers. In math- print(e)
ematics, a matrix is a rectangular group of numbers, [[ 3. 3. 3.]
arranged in rows and columns. In case of just one line [ 3. 3. 3.]
of numbers, the matrix is called vector. The easiest way [ 3. 3. 3.]
to do it, is using the module Numpy. To import Numpy [ 3. 3. 3.]]
is super-easy. Just write: from Numpy import *. In this
way you will import all the functions within the Numpy # Creates a random matrix 4x3
module. The examples shown below are commented # filled of integers
and self explaining. Enjoy the Numpy module! r=random.randint(10, size=(4,3))
print(r)
# of course the random matrix showed
C REATING VECTORS & MATRICES # on your screen will be different
# Let’s see out to work with matrices and # from the one below because it’s random
vectors with Numpy: [[5 7 3]
[6 1 7]
# Import numpy [7 6 4]
from numpy import * [5 3 6]]

# Creates a vector
a = array([1,2,9,5])
print(a) R OWS AND COLUMNS
[1 2 9 5] # Prints the first row of r matrix
print(r[[0],:])
# Creates another vector called a2 [[5 7 3]]
# and adds vector a to vector a2
a2 = array([1,4,5,5]) # Prints the second row of r matrix
print(a+a2) print(r[[1],:])
[ 2 6 14 10] [[6 1 7]]

# Creates a matrix 4x3 filled with 0 # Prints the first column of r matrix
b = zeros((4,3)) print(r[:,[0]])
print(b) [[5]
[[ 0. 0. 0.] [6]
[ 0. 0. 0.] [7]
[ 0. 0. 0.] [5]]
[ 0. 0. 0.]]
# Prints the second column of r matrix
# Creates a matrix 4x3 filled with 1 print(r[:,[1]])
c = ones((4,3)) [[7]
print(c) [1]
[[ 1. 1. 1.] [6]
[ 1. 1. 1.] [3]]
[ 1. 1. 1.]
[ 1. 1. 1.]]

The Numpy Module


20
The Pandas Module W ORKING WITH DATAFRAMES
# Select column A and B
df.loc[:,["A","B"]]
Not an animals, nor a disease A B
a 7 7
In real world, Pandas are cute animals or a pediatric
b 9 5
autoimmune neuropsychiatric disorders associated with
c 3 6
streptococcal infections... in Python world, it is a mod-
d 0 1
ule. It’s very similar to numpy but some prefer pandas
e 9 4
because they feel it’s easier to use because it’s able to cre-
f 9 5
ate dataframes.
# Select rows a and b
PANDAS&DATAFRAMES
df.loc[["a","b"],:]
Pandas is a Python module similar to numpy. It’s used A B C D
for data manipulation of dataframes and time series. A a 7 7 3 6
dataframe is a matrix that has column names and row b 9 5 8 5
names. A matrix, instead, has only numbers.
# Select rows a and c, and columns A and B
df.loc[["a","c"],["A","B"]]
S ERIES AND DATAFRAMES A B
a 7 7
# Let’s create a series letting pandas create c 3 6
# a default integer index:
# Select rows from a to c, and
from pandas import * # columns from A to C by headers
Series([1,3,5,6,8]) df.loc["a":"c","A":"C"]
0 1.0 A B C
1 3.0 a 7 7 3
2 5.0 b 9 5 8
4 6.0 c 3 6 9
5 8.0
# Select rows from a to c, and
As you see pandas created a column using the # column from A to C by indexes
numbers we imported. Note that pandas created a # Be careful, indexes start from 0
default index that is the number of rows (to the left of df.iloc[0:3,0:3]
our numbers). A B C
# Let’s create a dataframe a 7 7 3
b 9 5 8
from pandas import * c 3 6 9
from numpy import *
df =DataFrame(random.randint(10,size=(6,4)),\ # Calculate the mean for all the columns
index=["a", "b", "c", "d", "e","f"],\ df.mean()
columns=list("ABCD")) A 6.166667
print(df) B 4.666667
A B C D C 3.833333
a 7 7 3 6 D 4.333333
b 9 5 8 5
c 3 6 9 5 # Calculate the mean for all the rows
d 0 1 1 2 df.mean(1)
e 9 4 2 1 a 5.75
f 9 5 0 7 b 6.75
c 5.75
We have filled the matrix with random numbers d 1.00
using the numpy modules. The function rand.randint e 4.00
allowed us to create a matrix with random numbers. f 5.25

21
More on Pandas ... AND MORE ABOUT DATAFRAMES
# Calculate square root of
# a subset of df
The revenge df.loc["a":"c","A":"C"].apply(sqrt)
A B C
Guys... I noticed you are very good with pandas, so I de-
a 2.645751 2.645751 1.732051
cided to show you some more examples. These are about
b 3.000000 2.236068 2.828427
subsetting a dataframe and doing some nice operations
c 1.732051 2.449490 3.000000
with it.
# Change the number
W ORKING WITH PANDAS ... AGAIN # with index (4,4) to 100
# Let’s create the dataframe df we see df2.iloc[3,3]=100
# in the previous page then select the print(df2)
# rows from a to c, and the columns from A B C D
# A to C by headers. Now let’s calculate a 7 7 3 6
# the mean of the selected rows b 9 5 8 5
df.loc["a":"c","A":"C"].mean(1) c 3 6 9 5
a 5.666667 d NaN NaN NaN 100
b 7.333333 e 9 4 NaN NaN
c 6.000000 f 9 5 NaN 7

# Select rows from a to c, and


# columns from A to C by headers
# and calculate the mean of H EAD - TAIL - DESCRIBE
# the selected columns
df.loc["a":"c","A":"C"].mean(0) # Show the first two rows
A 6.333333 df.head(2)
B 6.000000 A B C D
C 6.666667 a 7 7 3 6
b 9 5 8 5
# Copy dataframe df to df2
df2=df.copy() # Show the last two rows
print(df2) df.tail(2)
A B C D A B C D
a 7 7 3 6 e 9 4 2 1
b 9 5 8 5 f 9 5 0 7
c 3 6 9 5
d 0 1 1 2 # Give a summary of the rows for the
e 9 4 2 1 # "df" dataframe, such as:
f 9 5 0 7 # counts, mean (average), std (standard
# deviation), min (minumum), max
# Change values to "NaN" if #(maximum) and the percentiles
# they are less than 3
df2[df2 < 3] = "NaN" df.describe()
print(df2) A B C D
A B C D count 6.000000 6.000000 6.000000 6.000000
a 7 7 3 6 mean 6.166667 4.666667 3.833333 4.333333
b 9 5 8 5 std 3.816630 2.065591 3.763863 2.338090
c 3 6 9 5 min 0.000000 1.000000 0.000000 1.000000
d NaN NaN NaN NaN 25% 4.000000 4.250000 1.250000 2.750000
e 9 4 NaN NaN 50% 8.000000 5.000000 2.500000 5.000000
f 9 5 NaN 7 75% 9.000000 5.750000 6.750000 5.750000
max 9.000000 7.000000 9.000000 7.000000
Pretty clear, right?

More on Pandas
22
have used the randint function that creates random inte-
Matplotlib Module ger numbers. Let’s create now, a fancier bar plot!

Let’s plot it! T WO NICER BAR PLOTS


import matplotlib.pyplot as plt
What? Of course... Python is able to create graphs. We import numpy as np
need just a module to do that. This module is called
"Matplotlib" and it’s very easy to use. We just need to # Setting the seed random numbers for random
write: import matplotlib.pyplot as plt in order to import # reproducibility. It can be any number
it in our code and use it with a “nickname” such as plt np.random.seed(123)
(you can use the name you prefer to call it).
# Let’s create two lists x and y with random
WHAT IS MATPLOTLIB? # standardized numbers created using the
# function random.randn from the module numpy
Matplotlib is a module able to create charts within x = np.random.randn(10000)
Python together with NumPy/Pandas math libraries. y = np.random.randn(9000)

# the histogram of the data


PYTHON VS R plt.hist(x, 50, facecolor=’g’,alpha=0.50)
plt.hist(y, 50, facecolor=’r’,alpha=0.50)
If you don’t need very difficult statistic tests, Numpy/-
Pandas and Matplotlib are easier to use than R. R is # Caption for x and y axis
a programming language used for statistics but is a bit plt.xlabel(’Random gaussian numbers’)
difficult and its syntax is not very clear. plt.ylabel(’Frequency’)

# Title of the graph and ttext


plt.title(’Hist. of gaussian rand. numbers’)
S CATTERPLOTS AND BARPLOTS plt.text(-2, 600, r"µ = 10000, α = 0.50")
import matplotlib.pyplot as plt plt.text(-2.8, 400, r"µ = 9000, α = 0.50")
import numpy as np plt.axis([-3, 3, 0, 1000])
from random import randint plt.grid(True) # let’s include the grid
plt.show()
# It creates a scatter plot
plt.plot([1,2,3,4], [1,4,9,16],’go’) # It creates a boxplot
plt.axis([0, 5, 0, 20]) plt.boxplot(x, 0, ’’)
plt.show()

# It creates a barplot with int


x=[randint(0,9) for i in range(0,1000000)]
plt.hist(x,bins=10)
plt.axis([0, 9, 0,100000])

Calm down... it’s easy. We first have created two lists


x and y with standardized numbers using the function
randn . They are two samples from the ”standard nor-
mal“ distribution. The seed is useful if you want to ob-
tain the same random numbers the next time you run the
randn function. Then we have created the first and the
second bar plot with text titles. Using the parameter al-
The result is a scatter plot and a bar plot. In the case pha you can set the opacity of the barplots. Take a look
of the bar plot, the bars have the same size because we to the code of the boxplot... that’s supr easy. Isn’t it?

23
BioPython Module That’s supereasy, isn’t it? We have imported Se-
qIO from Bio, then we have open the fasta file and
then we have used the "for loop" in order to get
With a Little Help from My Friends the id and the sequence for every record contained
in the FASTA file. Of course, using "sequence.id"
Hi guys, do you know the song "With a Little Help from and "sequence.seq" we can do whatever you want,
My Friends"? It was written by "The Beatles". This band such as calculate the length of each sequence with
was very famous in the 60’ and played, among the others, the function len "len(sequence.seq)", or convert
amazing songs like "Toxicity", "Infected Voice", "The Wolf every sequence in lower case using the function lower
I feed" and many others... Anyway, sometimes it happens "str(sequence.seq.lower())". In the last case, we have
that we are tired, angry or sad. However, if we are lucky, used the str function in order to convert everything
there are friends who listen to us... and if we are very, in a string because sequence.seq is not a string is a
very lucky, they also help us. When programming in Bio.Seq.Seq object.
Python for Genomics we are not alone... take a look here:
http://biopython.org/DIST/docs/tutorial/Tutorial.html

BIOPYTHON A LIGNMENT WITH PAIRWISE 2


BioPython has many functions already written for re- Suppose that we want to do a global pairwise alignment
searchers who are dealing with Genomics. It can be in- between two hemoglobin sequences stored in hba.fasta
stalled running the command pip install bio or us- and hbb.fasta files downloaded respectively from:
ing the Anaconda Navigator interface -> Environments. https://www.ncbi.nlm.nih.gov/protein/4504347?report=fasta
and
https://www.ncbi.nlm.nih.gov/protein/4504349?report=fasta
Each sequence should be included in a single file each (one in
WHAT IS A FASTA FILE? hba.fasta and the other in hbb.fasta). These are hemoglobin
A fasta file is a text file filled with DNA/RNA sequences alpha and hemoglobin beta protein sequences from the
formatted like this: human genome and the code used for the aligment is show
here below:
> Sequence 1
# Let’s import the pairwise2 and SeqIO from Bio
ATATATATATTATATG
from Bio import pairwise2
> Sequence 2 from Bio import SeqIO
ATATATGGGGGGATAT from Bio.pairwise2 import format_alignment
# Let’s open the fasta files and
# remember to change the path file
seq1 = SeqIO.read("/Users/cris/hba.fasta", "fasta")
seq2 = SeqIO.read("/Users/cris/hbb.fasta", "fasta")
O PEN A FASTA FILE WITH S EQ IO
# Let’s print the first alignment -> aligns[0]
# From BioPython let’s import the function aligns=pairwise2.align.globalxx(seq1.seq,seq2.seq)
# SeqIO. SeqIO is written with the letters print(pairwise2.format_alignment(∗aligns[0]))
# I and O (not the number 10) MV-LSPADKTNV--SPADNVKKK # yours might be different
from Bio import SeqIO |||||||||||||||||||||||
MVHL-----T----SPADNVKKK
# Let’s open the fasta file and print
# Remember to include the correct As you see, we have called the alignment function
# pathway where your fasta file is saved "align.globalxx". The two letters of the function name
(here: xx) are used for setting the scores and penalties
for sequence in SeqIO.parse(\ for matches and gaps. The first letter sets the match
"/Users/cris/Desktop/sequence.txt", "fasta"): score: "x" means that a match counts 1 while mismatches
print(sequence.id) have no costs, whereas with "m" general values for ei-
print(sequence.seq) ther matches or mismatches can be defined by the user.
Sequence 1 The second letter sets the cost for gaps; "x" means no gap
ATATATATATTATATG costs at all, with "s" different penalties for opening and
Sequence 2 extending a gap can be assigned by the user. So, "glob-
ATATATGGGGGGATAT alxx" means that only matches between both sequences
are counted. One can also use the "localxx".

BioPython Module
24
Running an external R UNNING STAFF WITH P YTHON 3
# Let’s run the command ls -l
program # in the directory /root
# using Python 3.4
import subprocess
How do I run an external program? subprocess.run(["ls", "-l", "/"],\
stdout=subprocess.PIPE)\
An interesting feature of Python is the ability to run ex- CompletedProcess(args=[’ls’, ’-l’, ’/root’],\
ternal programs. Running tool within your Python code returncode=0, stdout=b’total 53\
in very useful in genomics because it allows you to use drwxrwxr-x+ 98 root admin 3332\
software already created so you don’t have it to write it Feb 27 13:06 Applications
by yourself. # ... so all the staff above means that
# I have the folder applications in root
THE OUTSIDER # Let’s run the command ls -l
The possibility to run external programs is very cool to # in the directory /root
use within a Python code. It allows run existing tools as # using Python 3.4 and print it
you were using a terminal. # in a different way
import subprocess
a=subprocess.run(["ls", "-lht", "/"],stdout=\
subprocess.PIPE)
RUNNING PROGRAMS THROUGH A TERMINAL b=str(a)
If you use Linux, you probably run a program like that: print(b.replace(’\n\n’, ’\n’))
/home/cris/software/mytool # the command above replace \n with an endline
CompletedProcess(args=[’ls’, ’-lht’, ’/’],
If you use Windows XP or greater, you run it like returncode=0, stdout=b’total 53
that: drwxrwxrwt@ 4 root 136B Mar 2 17:50 Volumes
c://windows//Program//myprogram//myprogram.exe drwxrwxr-x+ 98 root 3.3K Feb 27 13:06 Applic
dr-xr-xr-x 2 root 1B Feb 27 12:40 home
If you use a Mac, you run it like that: dr-xr-xr-x 2 root 1B Feb 27 12:40 net
/Application/myprogram drwxr-xr-x+ 67 root 2.2K Dec 13 17:15 Library
lrwxr-xr-x 1 root 42B Dec 2 13:36 Developer
drwxr-xr-x@ 39 root 1.3K Sep 5 18:02 bin
dr-xr-xr-x 3 root 4.2K Feb 27 12:39 dev
R UNNING STAFF WITH P YTHON 2.7 dr-xr-xr-x 2 root 1B Feb 27 12:40 home
# Let’s run the command date in Linux drwxr-xr-x@ 4 root 136B Jan 19 02:05 System
# using Python 2.7 drwxr-xr-x@ 59 root 2.0K Jan 19 02:03 sbin
subprocess.call("/bin/date") drwxr-xr-x+ 67 root 2.2K Dec 13 17:15 Library
Thu Feb 16 17:05:56 CET 2017 lrwxr-xr-x 1 root 42B Dec 2 13:36 Developer
drwxr-xr-x@ 39 root 1.3K Sep 5 18:02 bin
# Let’s run the command date in Linux drwxr-xr-x@ 13 root 442B Nov 4 2015 usr
# and store it in a variable drwxr-xr-x 3 root 102B Oct 14 2015 opt
# using Python 2.7 drwxr-xr-x@ 2 root 68B Oct 13 2015 Network
date_command=subprocess.check_output("date",\ drwxr-xr-x 6 root 204B Oct 13 2015 Users
shell=True).rstrip("\ n") drwxrwxr-t@ 2 root 68B Oct 13 2015 cores
print(date_command) ’)
’Thu Feb 16 17:07:37 CET 2017’
These super amazing examples just showed you how to
# Let’s run the command ls (list of files) run commands like ls -l or ls -lht using Python. Of
# with the argument -l (description of files) course you can do it using the terminal in Linux, Win-
# using Python 2.7 dows or Mac but you can also do it within a Python code.
subprocess.check_call(["ls", "-l"]) In the first example the result was a mess, but in the sec-
# The results depends on your OS ond one we replaced the \n symbol with a real endline
autodiskmount fstyp to get a really nice outuput... try to use other Linux com-
mands by yourself. Enjoy it!

25
Command line argu- C OMM . L INE A RG . EXAMPLE II
# Let’s write and save our script
ments # in a file called name.py
import sys
name=sys.argv[1]
Arguments... the easy way! print("Hello " + name + "!")

As we said in the "Indroduction" a Python program can # Let’s run our script trough
be saved in a text file using the extension ".py" and # a terminal
then used to run it through the terminal. For exam- python name.py Jimi
ple, we can save a file "hello.py" containing the script Hello Jimi!
print(’Hello!’) and then we can run it with the com-
mand python hello.py through our terminal. On the # Instead of arguments we
screen we will see the line Hello!. Now... if you want # can use the input function
to pass some parameters to the script you can write # in case you are using Python or greater.
some strings after the name of the script such as: python # (or the raw_input function if you are using
hello.py name. Don’t worry, let’s see some examples... # Python 2.7)
# Suppose we give the parameter 3 and 4
a = input("Insert a number for the short\
LINE ARGUMENTS side of a rectangule: ")
If you are able to use Linux, you have probably used b = input("Insert a number of the long\
some command like ls that is a command that shows side of a rectangle: ")
all the files you have in a folder. If you run ls -l it area = int(a)*int(b)
prints you the files in a folder plus some information print("The area of the rectangle is: " +\
about these files. The parameter -l is what everybody str(area))
calls a command line argument. The area of the rectangle is 12

# Write in a text file the code below


# This program takes the two parameters
C OMM . L INE A RG . EXAMPLE I # from the command line and prints the
# Let’s write this script # rectangle area
import sys import sys
print(sys.argv) short_side = sys.argv[1]
long_side = sys.argv[2]
# Let’s save it and run it area = int(short_side)*int(long_side)
# through the terminal like this: print("The rectangle area is: " + str(area))
python myprogram.py A B C
[’myprogram.py’, ’A’, ’B’,’C’] # Run the script written before in a file
# called area.py and run it through a
# Remember, never run this while you’re inside # terminal like here below:
# a Python interpreter or an IDE. Instead, python area.py 3 4
# you have to save your script in a text file, The rectangle area is: 12
# rename it with a name such as program.py
# and then run it with your terminal (google As you can see, if you use the raw_input or input func-
# "what is a terminal"). The first element of tion the script asks the user to write a variable that can
# the function sys.argv is the name of the use to get the result. This is a very cool function but it
# script and then the other parameters will takes time to write it. Did you notice that we use the
# follow, in this case: ’A’, ’B’, ’C’. Command function int when calculated the area? It’s because the
# line arguments are useful because they parameters are always considered strings and for this
# allow the user to specify some numbers reasons we have to transform them in numbers. Then
# or strings that our script can you use we had to re-transform the variable "area" in string when
# before it is lauched. Let’s see some other we want to print it. Instead you can use the command
# examples. line arguments. Moreover, we had to use the sys module
that provides information about constants, functions and
methods of the Python interpreter.

Command line arguments


26
Files and Folders THE OS AND THE OS.PATH MODULE
The os.path module contains many functions that deal
with file names without having to work with forward
How to list files and create folders and backward slashes, and so on. This library is part
of the os module that is much bigger. The os module
Python gives the opportunity to work directly with files provides a number of functions able to work directly with
and directories. This is made possible through the the operating system and it is very useful for those who
os.path module that has many functions that allow the want to program in an advanced way.
user to create, remove, lists and modify files and fold-
ers. Suppose we have some files in our Desktop called
seq1.txt seq2.txt and seq3.txt containing a DNA
sequence each. We want to open and count them. What A MORE COMPLICATED EXAMPLE
can we do? Let’s see... # Suppose we want to move
# our files called seq1.txt seq2.txt
# and seq3.txt that are located on the Desktop
WORKING WITH FILES AND DIRECTORIES
# to three folders called seq1 seq2 and seq3
Python comes with libraries that allow your programs import os, shutil
to interact with files and directories in a way that is # name of Desktop path
independent from your operating system. directory = "/Users/cristian/Desktop/"
# for each file in the Desktop
for filename in os.listdir(directory):
# check if the filename ends with txt
W ORKING WITH FILES AND FOLD - if filename.endswith(".txt"):
ERS # get the name of the files without
# Let’s open three files we have # the .txt extension in order to have
# in our Desktop and read the # the names of the new directories
# sequences that are in it. dirname = filename.replace(".txt", "")
# These files are called seq1.txt, seq2.txt # create the folders on the Desktop
# and seq3.txt. Of course you have to write os.mkdir(directory+dirname)
# the name of the directory where you saved # moves the txt files in the new
# your files. I have a Mac and my Desktop is # directories
# in /Users/cris/Desktop, in your case shutil.move(directory+filename, \
# it’ll be in a different path probably. directory+dirname)

import os If you run the code above and change that path with
yours, you will see that three folders have been created
directory = "/Users/cris/Desktop/" and the three files (seq1.txt, seq2.txt and seq3.txt)
have been moved to the new folders ( seq1, seq2 and
for filename in os.listdir(directory): seq3). How is that possible? First of all, we have im-
if filename.endswith(".fasta"): ported the os module in order to use the mkdir function
print("Reading from " + filename) that creates the directories. Then, we have imported
path = directory+filename the shutil module in order to use the move function
myfile = open(path) to “move” the files to the new directories. It is possible
myfile_list = myfile.readlines() to do other beautiful things with folders and files using
mylist_stripped = [i.replace("n","") \ Python. For example the os.path.exists function check
for i in myfile_list] if a directory or a file exists. Moreover, importing
print(mylist_stripped) the glob module and using the glob function it is pos-
Reading from seq1.txt sible to get a list of paths matching a pattern, for example:
[’TATAT’, ’ATTAG’]
import glob
Reading from seq2.txt
print glob.glob("c:/windows/*.fasta")
[’TATAT’, ’AGAGA’]
Reading from seq3.txt
DNA1.fasta
[’TATAT’, ’AGAGA’]
DNA1.fasta

27
Surfin’ the Web ... CONTINUE
# Let’s go to the email text box
emailElem = browser.find_element_by_id \
The Selenium module ("login-username")

Selenium is a Python module that provides a simple # Let’s insert our email
way to browse internet using the most famous browsers emailElem.send_keys("youremail@yahoo.com")
like Firefox, Ie, Chrome, Remote etc. The current sup-
ported Python versions are 2.7, 3.6 and above. You can # Let’s go to the password text box
download selenium from the PyPI page. However, you passwordElem = browser.find_element_by_id \
can use pip to install the Selenium package like this: ("login-signin")

pip install selenium # Let’s insert our password


passwordElem.send_keys("12345")
Unfortunately selenium requires a driver to interface
with your browser. Firefox, for example, requires the # Let’s click on the submit
"geckodriver", in /usr/bin or /usr/local/bin. Chrome passwordElem.submit()
requires, instead, "ChromeDriver" that should be in-
stalled in your chrome-driver folder or in /usr/bin or # Of course it won’t work because you have
/usr/local/bin. The setting that I’ve just described # to insert your real email and password
works fine for Linux and Mac. If you are using Win-
dows, instead, search on Google to set up the drivers for
This is very cool but there is a problem... it’s difficult to
Firefox or Chrome.
use. In fact, before using it you need to know how html
works. Most of the web pages are written in html. So if
SELENIUM you use Chrome and you want to browse the Yahoo web
Would you like to automate all the boring and repeti- page, you have to click on the right button of your mouse
tive activities you do using internet? Like "check the and then click on "Web Page Source". Then you have to
weather" or "check the new articles published in search for the word "login" or something like that. In
pubmed about your favorite disease"? Yes? Ok, case of Yahoo web page the id of the text box is "login-
let’s see how to use |bf selenium to scrape the web!!! signi". Of course is not always like that. For example... if
you want to search something through selenium on the
Google web page the situation is different. Let’s see an
example:
F IRST EXAMPLE WITH S ELENIUM
import os
from selenium import webdriver S ELENIUM AND G OOGLE
# Let’s search for something about cancer
# Let’s select the driver path... I’ve put it # using Google and print the results
# in this folder because I like it driver = webdriver.Chrome(chromedriver)
# but it should work also if you put it in driver.get("http://www.google.com")
# /usr/bin if you have a Mac or Linux input_element=driver.find_element_by_name("q")
chromedriver = \ input_element.send_keys("1000 genomes")
"/usr/local/Cellar/chromedriver/2.30 \ input_element.submit()
/bin/chromedriver"
# this is the HTML code found
# Let’s load the driver # in the Google page close the the results
os.environ["webdriver.chrome.driver"] = \ RESULTS_LOCATOR = "//div/h3/a"
chromedriver
# Let’s put the results in a variable
# Let’s create the webdriver.Chrome object page1_results = driver.find_elements \
browser = webdriver.Chrome(chromedriver) (By.XPATH, RESULTS_LOCATOR)

# Suppose you hava a Yahoo account # Let’s print the results in the first page
# Let’s go to the Yahoo.com web page for item in page1_results:
browser.get("https://mail.yahoo.com") print(item.text)

Surfin’ the Web


28
side2). In fact, when we have created the object named
Classes and objects rect1 we have written rect1=Rectangle(3,5). What’s
the meaning of the word self? It’s the first parameter of
the method __init__ and it’s mandatory. It allows us to
The big containers use the object when we create it. Moreover, every time
we create an object passing some parameters, these ones
Objects in Python are big containers of variables and
must be called as self.parameter, for example self.side1 =
functions. When referred to objects the variables are
side1 or self.side2 = side2. The rest is easy. In order to cre-
called attributes whereas the functions are called method
ate an object we have to write the name of the object (for
of an object . Objects are created using special templates
example rect1) equal to the name of the class (for exam-
called classes. The beautiful part is that we can create as
ple Rectangle) and the parameters if the exist. Now, let’s
many objects as we want from a single class.
see an example of classes in genomics. If you read some
books about Python and classes, you’ll probably notice
MEANING OF CLASSES AND OBJECTS that the authors call the objects instances of a class and
Classes are essentially templates to create objects. its functions methods. As I said previously they also call
Objects are containers of variables and functions. the variables inside a class attributes of a class!

T HE DNA CLASS
T HE RECTANGLE CLASS # Let’s create the class ’DNA’
class DNA:
# Let’s create the class Rectangle """ This class includes a method that\
class Rectangle: calculates DNA length and much more"""
# Use three double quotes to describe classes def __init__(self, seq=""):
""" This class is about Rectangles! """ self.seq = seq
def __init__(self, side1, side2):
self.side1 = side1 def lendna(self):
self.side2 = side2 countA = self.seq.count("A")
countT = self.seq.count("T")
def Area(self): countG = self.seq.count("G")
return self.side1*self.side2 countC = self.seq.count("C")
return countA+countT+countG+countC
# Now... let’s create the first object
rect1 = Rectangle(3,5) def convert(self, mode):
print(rect1.Area()) self.mode = mode
15 if mode == "upper":
return self.seq.upper()
# Now... let’s create the second object elif mode == "lower":
rect2 = Rectangle(2,4) return self.seq.lower()
print(rect2.Area()) else:
8 print("Please use the mode\
’lower’ or ’upper’!")
# Now... let’s create the third object # Let’s create an object
rect3 = Rectangle(3,3) dna1 = DNA("ATTGC")
print(rect3.Area()) dna1.lendna()
9 5
dna1.convert("lower")
First of all: 1) the first line of the code is the word class attgc
and the name of the class, in this case Rectangle; 2) the
second line is not mandatory, it’s the description of the First of all, the name of the class this time is DNA. Second,
class. So... if we want to know something about the class we have created a constructor that accepts one parame-
or its objects just type help(name_of_the_class); 3) the ter (seq) that is a DNA string. Third, we have created
special function or method called __init__ (two under- a method or function called lendna that calculates the
scores before and after init) is automatically called when counts of the letters "A", "T", "G" and "C" of the DNA
an object of the class is created. It’s also called construc- sequence seq. Fourth, we have created another method
tor and it’s used, in this case, because we want to give an called convert that convert the DNA sequence upper or
object some parameters when we create it (ex: side1 and lowercase depending on the parameter we use.

29
Inheritance T HE DNA SUPERCLASS
# Let’s create the class ’DNA’
class DNA:
Inheritance is important ’’’ This class includes a method that\
Classes can inherit functionality from other classes. In calculates DNA length ’’’
fact, the most common feature associated with object pro- def __init__(self, seq=""):
gramming is inheritance that is the ability to define a self.seq = seq
new class as a modified version of an existing class. The
main advantage of inheritance is that you can add new def length(self):
methods to a class without having to change the original countA = self.seq.count("A")
one. It is called "inheritance" because the new class "in- countT = self.seq.count("T")
herits" all the methods of the original class. By extending countG = self.seq.count("G")
this, the original class is often called "superclass" and the countC = self.seq.count("C")
derived class "daughter" or "subclass". return countA+countT+countG+countC

class RNA(DNA):
INHERIT ME! # Here below you are encouraged to always write
Python supports inheritance from multiple classes, unlike # a brief description of your class
other programming languages such as Java or C++. """ This class includes methods that are able
to convert DNA sequences to RNA"""
def __init__(self, seq):
T HE DNA CLASS self.seq = seq
def convertL(self):
# Let’s create the class ’DNA’ new = self.seq.lower()
class DNA: return new
’’’ This class includes a method that\
calculates DNA length ’’’ def convert2RNA(self):
def __init__(self, seq): if self.seq.isupper():
self.seq = seq return self.seq.replace\
("T","U")
def lendna(self): else:
countA = self.seq.count("A") return self.seq.replace\
countT = self.seq.count("T") ("t","u")
countG = self.seq.count("G")
countC = self.seq.count("C") # Let’s see if it works:
return countA+countT+countG+countC b = RNA(’ATGCA’)
b.length()
class RNA(DNA): 5
’’’ This class includes a method that\
calculates RNA length ’’’ b.convertL()
def __init__(self, seq): atgca
self.seq = seq
b.convert2RNA()
# Let’s see if it works: augca
a = DNA(’ATGC’)
a.length()
4 As you see we have created the class RNA and then
we have inherited the method length from the super-
b = RNA(’ATGCA’) class DNA. Moreover we have created two methods (or
b.length() functions) only for the class RNA. In summary, inheri-
5 tance is when a class uses arguments or methods created
within another class. If we think of inheritance in terms
of genomics, we can think of a bacteria inheriting certain
As you see the class RNA has inherited the method
genes from another bacteria. That is, a bacteria can in-
length from the class DNA. It is possible to create some
herit antibiotics resistance. In this way we can save code
functions that are exclusively created inside and for the
and make your programs shorter.
RNA class? Yes, of course. Let’s see some examples:

Inheritance
30
Conclusion
This is the End (our only friend)
This text showed you how to program in Python and
how to deal with genomics using this easy programming
language. Remember... the only way to learn how to pro-
gram is programming. For this reason, in the next para-
graphs, I selected some examples you can use to train and
learn much faster than reading theoretical staff. They are
ordered for complexity... at least, take a look. Have fun!
EXERCISE: HOW TO COUNT BASES IN GE-
NOMIC SEQUENCES
Many Ways to Count Bases
In this paragraph we’ll learn how to count bases in a DNA string. That’s the way I do it. Everyone can solve the problem
using his own imagination. These are many ways to do it of course but I suggest to create functions or classes in order to
get the job done.

Count DNA made Easy

C ODE
import time

def count1(DNA):
t0 = time.time()
countA=0
countT=0
countG=0
countC=0
for i in DNA:
if i == "A":
countA += 1
elif i == "T":
countT += 1
elif i == "G":
countG += 1
elif i == "C":
countC += 1

# Let’s print the count for each bases


print("A: " + str(countA) )
print("T: " + str(countT))
print("G: " + str(countG))
print("C: " + str(countC))

# This returns the cpu time


t1 = time.time()
cpu_time = t1 - t0
print("cpu_time = " + str(cpu_time))

OUTPUT
count1("ATGCC")
A: 1
T: 1
G: 1
C: 2
cpu_time = 3.5999999999702936e-05

EXERCISE: HOW TO COUNT BASES IN GENOMIC SEQUENCES


32
Counting Bases using the Count Function

C ODE
def count2(DNA):
t0 = time.time()
countA = DNA.count("A")
countT = DNA.count("T")
countG = DNA.count("G")
countC = DNA.count("C")

# This returns the cpu time


t1 = time.time()
cpu_time = t1 - t0
print("cpu_time = " + str(cpu_time))

# This returns a list of counts


return [countA,countT,countG,countC]

OUTPUT
count2("ATGCC")
cpu_time = 6.000000000838668e-06
[1,1,1,2]

Counting Bases using Dictionaries

C ODE
def count3(DNA):
t0 = time.time()
countA = DNA.count("A")
countT = DNA.count("T")
countG = DNA.count("G")
countC = DNA.count("C")

bases = ["A", "T", "G", "C"]


counts = [countA,countT,countG,countC]
# This creates an empty dictionary
dna_dict = {}

# This fills the dictionary with counts


# for each base
for i in range(len(bases)):
dna_dict[bases[i]] = counts[i]

# This returns the cpu time


t1 = time.time()
cpu_time = t1 - t0
print("cpu_time = " + str(cpu_time))

return dna_dict

Counting Bases using the Count Function


33
OUTPUT
count3("ATGCC")
cpu_time = 7.000000000090267e-06
’A’: 1, ’C’: 2, ’G’: 1, ’T’: 1

How to Create a Nice Format for your Counter

C ODE
def count4(DNA):
t0 = time.time()
countA = DNA.count("A")
countT = DNA.count("T")
countG = DNA.count("G")
countC = DNA.count("C")

bases = ["A", "T", "G", "C"]


counts = [countA,countT,countG,countC]
# This creates an empty dictionary
dna_dict =

# This fills the dictionary with counts


# for each base
for i in range(len(bases)):
dna_dict[bases[i]] = counts[i]

print("n---------------- -----")
print("Nucleotide Code: Base:n---------------- -----")
for k, v in dna_dict.items():
if k == "A":
print("Adenine......... ..." + str(v) + ".")
if k == "T":
print("Thymine......... ..." + str(v) + ".")
if k == "G":
print("Guanine......... ..." + str(v) + ".")
if k == "C":
print("Cytosine........ ..." + str(v) + ".")

# This returns the cpu time


t1 = time.time()
cpu_time = t1 - t0
print("ncpu_time = " + str(cpu_time))

EXERCISE: HOW TO COUNT BASES IN GENOMIC SEQUENCES


34
OUTPUT
count4("ATGCCCA")

---------------- -----
Nucleotide Code: Base:
---------------- -----
Cytosine........ ...3.
Guanine......... ...1.
Thymine......... ...1.
Adenine......... ...2.

cpu_time = 4.7999999999603915e-05

As you can see in all of these three examples, we have imported the module time. Why? Don’t worry! It’s not mandatory,
we just want to know about the speed of our script. It looks that the script of example 1.2 is the fastest one. In the example
1.1 we used a for loop in order to count the bases whereas in the example 1.2 we used the count function and put the
result in a list. The most complete code is that one in example 1.3 where we used the count function and a dictionary as
a result. Regarding the example 1.4, it’s the same of the example above, it’s just “fancier”.

How to Create a Nice Format for your Counter


35
EXERCISE: HOW TO CREATE RANDOM SE-
QUENCES
Random DNA or RNA sequences
Random sequences are very important in genomics because we can use them in order to figure out if real genomes are
built by chance or by a selective pressures. In this paragraph we’ll learn how to create random DNA or RNA sequences
in a very easy way.

C ODE
import random

def randomDNA(length, alphabet="ATGC"):


return "".join([random.choice(alphabet) for i in range(length)])

def randomRNA(length, alphabet="AUGC"):


return "".join([random.choice(alphabet) for i in range(length)])

OUTPUT
randomDNA(3)
’GTC’

randomRNA(3)
’UGC’

randomDNA(10)
’GCCGTAAGAT’

randomRNA(10)
’GAUCUCAGUG’

randomDNA(50)
’AAAAGTTCTATCTTGACTAATCAACGGCCAGCCGTACATAGCGCGCTAAG’

randomRNA(50)
’ACUCGCGUUAGGAAGUCAUCCUGUUAGCUCCAACUAAGGUUCAGUGGUGA’

randomDNA(90)
’CGTAGGTTCTAGGAGGCACGTAGTTTCGGAAGTGTAACTACGCCCGGTCAAACATGCCCGGTTGCCCCCTGAAGGAAACCTGCAGGCGGT’

randomRNA(90)
’CACAGAACAUAUCAUCAUUAUUUCAAUGCCCCCAAAUCGUGGCGUCGAUUGGUGUGUCUGGCAGUUAGUGGGAACUGAAUACCGGGACCG’

This code is simple. We have imported the module random. From this module we have used the function choiche. This
function choice allows the users to create a random string from a set of characters. Then, we used a list comprehension
and the random.choice function in order to get the sequence with the length we set for the variable length. In this way,
we have written two functions. One creates DNA random sequences whereas the other one creates the RNA sequences.

EXERCISE: HOW TO CREATE RANDOM SEQUENCES


36
EXERCISE: HOW TO CREATE RANDOM SEN-
TENCES
Random Sentences
Ok... I’m going to show you something very cool. Let’s say I’m a monkey and I start to type things on the keyboard
of your computer. How long does it take to get the sentence "hi guys"? We can try to simulate it. To make it simple,
we will use only the English alphabet characters. Moreover, let’s suppose that the monkey is able to separate every
words with a space. So, first we are going to write a function that creates random words, then we are going to set the
maximum numbers of characters and words and finally, we are going to simulate the monkey job. I know is nothing
about genomics but it’s a lot of fun. Let’s see how it works!

C ODE
import random
import time

def random_sentence(max, nwords, alphabet="abcdefghijklmnopqrstuvwyz"):


sentence = ’’ # empty variable
# this loop is needed to create more than one word
for j in range(nwords):
length = random.randint(1,max) # random length
word = "".join([random.choice(alphabet) for i in range(length)]) # random word
sentence = sentence + ’ ’ + word # fill the variable sentence with random words
sentence = sentence.lstrip() # remove the first space from the variable sentence
return sentence

randsentence = random_sentence(4,2) # random sentence


sent_to_find = ’hi guys’ # the sentence to find
counter = 0

t0 = time.time() # start to count the time

# this loop is needed in order to create many random sequences until we find
# that one in the variable sent_to_find
while randsentence != sent_to_find:
counter = counter + 1
print(randsentence) # print old (non-sent_to_find) random sentence
randsentence = random_sentence(4,2) # pick a new sentence.

cpu_time = round(time.time()-t0,2) # stop time


print("The script found ’0’ after 1 iterations. Congrats!".format(sent_to_find,counter))
print("cpu_time = " + str(cpu_time) + " seconds")

CODE
random_sentence(6,2)
ht wzvg
.
The script found ’hi guys’ after 38931601 iterations. Congrats!
cpu_time = 508.06 seconds

As you see, there are new things here. First, we used a for loop in order to create more than one word in a random

37
sentence depending on the parameter we set in the random_sentence function. Then we have concatenated all the words
in the empty variable ’sentence’. This variable will have a space before all the other words, so we have deleted it with
the function lstrip. Then, we have created a while loop in order to generate random sentences until we find that one
contained in the variable sent_to_find. Awesome, isnt’t?

EXERCISE: HOW TO CREATE RANDOM SENTENCES


38
EXERCISE: HOW TO COUNT KMERS
K-mers counting
The term k-mer in genomics is usually associated to all the possible substrings of length k that are contained in a DNA
sequence. In the context of the genome analysis, k-mers are often used to identify the characteristic of particular chro-
mosome or genome regions in general. Let’s see a script that can help us to identify a specific set of k-mers.

C ODE
def kmers(dna, n):
# it creates an empty dictionary
kmer_dictionary = {}

# it fills the dictionary with k-mers


for i in range(len(dna)+1-n):
kmer = dna[i:i+n] # this is the kmer
kmer_dictionary[kmer] = kmer_dictionary.get(kmer, 0) + 1 # it’s filling the dict. up

# it returns the filled dictionary


return(kmer_dictionary)

OUTPUT

kmers("ATA",1)
{’A’: 2, ’T’: 1}

kmers("ATA",2)
{’AT’: 1, ’TA’: 1}

kmers("ATA",3)
{’ATA’: 1}

kmers("ATATATA",4)
{’TATA’: 2, ’ATAT’: 2}

kmers("ATATATA",5)
{’TATAT’: 1, ’ATATA’: 2}

kmers("ATATATA",6)
{’ATATAT’: 1, ’TATATA’: 1}

kmers("ATGACGTTGGCGCAGCAGACTTT",1)
{’A’: 5, ’C’: 5, ’T’: 6, ’G’: 7}

kmers("ATGACGTTGGCGCAGCAGACTTTTTTTTTTTTTTTTTTTTTTTTTT",1)
{’A’: 5, ’C’: 5, ’T’: 29, ’G’: 7}

In order to solve the problem we had to create a window as long as the kmer length (n) and we had to go trough the
sequence until its entire length. This length is equal to the length of sequence minus the length of the kmer plus 1. It’s
like that because the loop should stop when the last k-mer has been identified at the end of the sequence. Then, we used
the function get. This function, when it’s called, checks if the specified key exists in the dictionary. If it does, then it

39
returns the value of that key. In this way, for each key (kmer) its frequency is returned.

EXERCISE: HOW TO COUNT KMERS


40
EXERCISE: HOW TO WORK WITH MAT-
PLOTLIB AND PANDAS
Matplotlib and Pandas

For Python programmers, matplotlib is the library of choice when it comes to plot things. Moreover, another module
called Pandas is the favorite one when reading and working with matrices. So, let’s say we have an excel file with a table
in it. We can save it as a tab delimited text file and then we can import it in Python using Pandas. Let’s say we have a
file like that:

TEST 1.TXT FILE


col1 col2
row1 15 445
row2 15 4954
row3 70 94

So, let’s import it through Pandas and create a pie plot using Matplotlib:

E XAMPLE 5.1
import pandas as pd
import matplotlib.pyplot as plt

# Let’s import the table within our file


matrix = pd.io.parsers.read_csv("your_file",sep="\t",index_col=0)

# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = list(matrix.index) # getting the row names
headers = list(matrix) # getting the headers (we will not use it)

# Let’s get the first column from our matrix, we won’t use the second column
sizes = matrix.col1

# Let’s highlight the second slice of the pie plot


explode = (0, 0.1, 0) # only "explodes" the 2nd slice

# Let’s create the sublopts


fig1, ax1 = plt.subplots() # creates a figure with axis
ax1.pie(sizes, explode=explode, labels=labels, autopct="%1.1f%%",shadow=True, startangle=90)

# This ensures that pie is drawn as a circle.


ax1.axis("equal")

plt.show()

Here we are! That’s the pie plot. What do you think? I think it’s cool. You can import whatever you want and plot
it. Let’s see another example! Now, suppose we have created a matrix of 3 columns and 100 rows filled with random
numbers from 0 to 99:

41
TEST 2.TXT FILE
col1 col2 col3
row0 98 88 24
row1 14 74 84
row2 54 44 88
.
.
.
row98 66 77 84
row99 64 54 14

C ODE
from mpl_toolkits.mplot3d import Axes3D

# It imports the file


m = pd.io.parsers.read_csv("your_file",sep="\t",index_col=0)

# It creates the colors list


cm = plt.get_cmap("RdYlGn")
col = [cm(float(i)/(30)) for i in range(30)]

# 2D Plot
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(m.col1, m.col2, s=10, c=col, marker="o")

# 3D Plot
fig = plt.figure()
ax3D = fig.add_subplot(111, projection="3d")
ax3D.scatter(m.col1, m.col2, m.col3, s=10, c=col, marker="o")

plt.show()

EXERCISE: HOW TO WORK WITH MATPLOTLIB AND PANDAS


42
Great! We have imported our matrix with Pandas and plot it in 2 and 3D (Axes3D module).

Matplotlib and Pandas


43
EXERCISE: CALCULATING THE ENTROPY
Shannon Entropy
The term k-mer in genomics is usually associated to all the possible substrings of length k that are contained in a DNA
sequence. In the context of the genome analysis, k-mers are often used to identify the characteristic of particular chro-
mosome or genome regions in general. Let’s see a script that can help us to identify a specific set of k-mers.

C ODE 6.1
import random
import matplotlib.pyplot as plt
from collections import Counter
from math import log2

# It creates a random dna sequence


def randomDNA(length, alphabet=’ATGC’):
return "".join([random.choice(alphabet) for i in range(length)])

# It calculates the entropy of a dna sequence


def entropy(dna):
return -sum(i/len(dna) * log2(i/len(dna)) for i in Counter(dna).values())

# It calculates the shanno entropy of a given sequence contained in a window


def entropy_window(window_length):
x = []
dna = randomDNA(length)
if(window_length>len(dna)):
print("Window size exceeds DNA length: Execution Aborted!")
else:
for i in range(len(dna)):
window = dna[i:i+window_length]
if(len(window)==window_length):
x.append(entropy(window))
else:
break
return x

# It plots the histogram of the mean of A/T and C/G for each sliding window
# with the length contained in the vector called "lista"
length = 1000 # length of a test dna sequence
lista = [10, 20, 50, 100, 200] # size of the sliding windows

# It plots the histograms (bootstrap=10)


# That means that will have 10 plots with 5 bar graph in each
for j in range(10):
for i in lista:
x = entropy_window(i)
plt.hist(x, bins=10) # histogram with bins=10
plt.show()

EXERCISE: CALCULATING THE ENTROPY


44
This code is difficult. It’s going to create a random dna sequence (see randomDNA function) of a given length and then
it calculates the Shannon entropy of all the sliding windows (one base after another) of a given length (see variable
lista). Then it plots the bar graphs of the frequency distribution for each window length. It also possible to test more dna
sequences setting the parameter boostrap (it’s not a real bootstrap but it looks like). Let’s see a picture to understand
how the sliding windows work:

What does it mean in a real world? That means that if you have a sequence like that: "ATGATATATATGAGGCCCC" the
formula of the Shannon entropy is:

H(X) = -[(0.316*log2 0.316)+(0.211*log2 0.211)+(0.211*log2 0.211)+(0.263*log2 0.263)]=1.97848

Why? Because the frequency of A is 0.316, the frequency of T is 0.263, the frequency of C is 0.211 and the fre-
quency of G is 0.211. So at the end of your long day you get 1.97848. The Shannon entropy has been calculated using the
module Counter. Do you remember it?

Shannon Entropy
45
EXERCISE: CHARGAFF’S SECOND RULE

Chargaff plots

Chargaff’s second parity rule states that, in a single strand of DNA in any organism, the amount of Guanine is "almost"
equal to Cytosine and the amount of Adenine is "almost" equal to Thymine. Le’ts create a random genome and then
calculate the ratio between the verage of the ratios A/T and C/G in order to test the hypothesis the average is close to 1.
We’ll do it for all the sliding windows of different lengths through our random genome.

C ODE
import random
import matplotlib.pyplot as plt

# Using the function "random.seed" you can obtain always the same random sequence.
# random.seed(444)
# The number ’444’ is arbitrary. Change it if yuo want.

def randomDNA(length, alphabet=’ATGC’):


return "".join([random.choice(alphabet) for i in range(length)])

# It calculates the mean of A/T and C/G


def charg_calc(dna):
nA = dna.count(’A’)
nT = dna.count(’T’)
nG = dna.count(’G’)
nC = dna.count(’C’)
if((nA is not 0) and (nT is not 0) and (nG is not 0) and (nC is not 0)):
percAT = nA/nT
percCG = nC/nG
average = (percAT+percCG)/2
return average

# It calculates the mean of A/T and C/G for each sliding window
def charg_window(window_length):
x = []
dna = randomDNA(length)
if(window_length>len(dna)):
print("Window size exceeds DNA length: Execution Aborted!")
else:
for i in range(len(dna)):
window = dna[i:i+window_length]
if(len(window)==window_length):
x.append(charg_calc(window))
else:
break
return x

EXERCISE: CHARGAFF’S SECOND RULE


46
C ONTINUE ...
# It plots the histogram of the mean of A/T and C/G for each sliding window
length = 50000 # length of the sequence
lista = [50, 100, 500, 1000, 2500, 3000, 5000] # size of the sliding windows

# It plots the four histograms! It’s a kind of boostrap :)


for j in range(4):
for i in lista:
x = charg_window(i)
plt.hist(x, bins=20) # histogram with bins=20
plt.show()

So what’s the meaning of this code? We just wanted to create a frequency distribution of a value calculated as the mean
between the A/T and C/G. Why? Because if Chargaff was right it should be always almost equal to 1. As you can see
when we use a sliding window of 50 it’s not like that but when we start to use window greater than ’1000’ in length this
is the case. At the beginning of the code, we used the function "random.seed". That creates always the same random
sequence every time we use a specific number. For example, if we use the number ’444’ we will obtain always the same
sequence, whereas if we use the seed ’1234’ we will obtain always another one. Of course instead of using random
sequences you can get real sequence or genome from different organisms. Be careful Python is pretty slow if you use big
genomes it’s going to take ages.

Chargaff plots
47
EXERCISE: BROWSING THE WEB

Neanderthal and Selenium

Suppose we have a list of Homo ssapiens genes and we want to know if they are the same in the Neanderthal genome.
Well... we can do it manually, visiting the web page http://neandertal.ensemblgenomes.org/index.html and then paste
the name of each gene in the text box clicking on the result link and downloading the table of the mutations. That’s
cool but what if if you have to do it for ’19,000’ genes? Let’s say we have a txt file containing all the Homo sapiens
genes alphabetically ordered in one column. Well we can use the script below, that uses Selenium, in order to get all the
’Non-synonymous" mutations for each gene when comparing Homo sapiens and Homo neanderthalensis:

C ODE
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException

neandertal = []

# Let’s read the file that contains the name of the genes
with open(’/Users/cristian/Desktop/genes.txt’) as f:
names = f.read().splitlines()

# Let’s create the file the will include the result


thefile = open(’/Users/yourname/Desktop/neandertal.txt’, ’w’)

for i in names:
try: # try to make work this block otherwise give an error and continue
chromedriver = "/usr/local/Cellar/chromedriver/2.30/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
driver.get(’http://neandertal.ensemblgenomes.org/index.html’)
element1 = driver.find_element_by_id(’q’)
element1.send_keys(i)
element2 = driver.find_element_by_xpath("//input[@type=’submit’ and @value=’Go’]")
element2.click()
protein_link = driver.find_element_by_partial_link_text(’ENSP’)
protein_link.click()
variation_link = driver.find_element_by_partial_link_text(’Variations’)
variation_link.click()
wait = WebDriverWait(driver, 10)
table1 = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, \
"#main > div:nth-child(2) > div.content > div > table"))).text

EXERCISE: BROWSING THE WEB


48
C ODE
t = table1.split()
count = 0
for j in t:
if (j=="Non-synonymous"): # Looking for Non synonymous mutations
count = count + 1
stringa = i+": "+ str(count)
neandertal.append(stringa)
thefile.write("%sn" % stringa)
print(stringa)
driver.quit()
except (NoSuchElementException): # in case of the script doesn’t find the correct element
# in the HTML code
print(i + " was no valid entry and was skipped.")
driver.quit()
continue
except (TimeoutException): # in case the web page doesn’t load on time
print(i + " timed out and was skipped.")
driver.quit()
continue

thefile.close()

OUTPUT
A2M: 11
ABL1: 23
ADCY5: 1
AGPAT2: 4
AGTR1: 9
AIFM1: 1
AKT1: 6
APEX1: 8
APOC3: 0
APOE: 18
APP: 21
APTX: 4
AR: 8
ARHGAP1: 1
ARNTL: 0
ATF2: 1
.
.
.

The most difficult part of the code is the part that tries to find an element in the HTML code of the web
page in order to insert the name of the genes and the button link. There are many ways to do it, see
http://selenium-python.readthedocs.io/locating-elements.html for more info. In our case, we used the
driver.find_element by partial link text and xpath but you can use the method that you prefer the most. You have
to train a bit in order to understand the way Selenium works. At the begininning it starts going to the web page, then
inserts the name of the first gene of the file, then clicks on the protein name and finally shows and write in a file called
"neandertal.txt", the number of the "Non-synonymous" mutations for that specific gene. As you can see, the code uses
the statements "try" and "except" that allows the programmer to run a program and in case of errors it does not stop but
shows the name of an error or perform a particular task. That’s one of the most difficult but nicest code you’ll ever see in
your entire life. Am I right? No? Ok. Bye.

Neanderthal and Selenium


49
50 INDEX

Index
A glob module, 27

addition function, 5 H
align.globalxx function, 24
align.localxx function, 24 Homo neanderthalensis, 48
Anaconda, 4 Homo sapiens, 48
anagrams, 19
arguments of a function, 14 I
attribute of an object, 29
if, else and elif statements, 7
attributes, 29
indentation, 7
Axes3D module, 43
inheritance, 30
B input function, 26
install modules, 24
BioPython module, 24 installing modules through pip, 13
instances, 29
C itertools module, 19
Chargaff rules, 46 f J
choiche, function, 36
classes and objects, 29 join function, 12
collections module, 18
combinations, 19 K
command line arguments, 26
comment symbol, 5 k-mers, 39, 44
concatenation function, 6
L
constructor, 29
Counter, 18 lambda functions, 16
Counter module, 45 len function, 12
Linux, 4
D
list comprehensions, 16
dataframe, 21 lists, 10
DataFrame function, 21 log function, 5
degrees function, 5 lstrip, function, 38
del function, 18
dictionaries, 10 M
division function, 5 Mac, 4
driver.find_element function, 49 map function, 17
E match function, 15
math module, 5
Editor, 4 Matplotlib module, 23, 41
matrix, 20
F methods, 29
multiplication function, 5
find function, 12
findall function, 15 N
finditer function, 15
for loops, 8 numpy module, 20
functions, 14
O
G
os module, 27
glob function, 27 os.path module, 27
os.path.exists function, 27 V

P variables, 6

pairwise2 module, 24 W
pandas module, 21, 41
permutations, 19 while loops, 9, 38
power function, 5 Windows, 4
print function, 4 write files, 13
print_function function, 8 Z
R zip function, 17
radians function, 5
rand.randint, 21
randint function, 23
randn function, 23
random module, 36
random.seed function, 47
range function, 8
raw_input function, 26
re module, 18
read files, 13
readlines function, 13
reduce function, 17
regular expressions, 15
replace function, 13, 25

search function, 15
seed function, 23
Selenium module, 28, 48 f
self, 29
SeqIO module, 24
Series function, 21
set, 17
Shannon entropy, 45
shutil module, 27
sin function, 5
split function, 12
splitlines function, 25
Spyder, 4
sqrt function, 5
str function, 6
string indexes, 11
strings, 6
strings selection, 11
strings slicing, 11
subtraction function, 5
sys module, 26

terminal, 4, 26
time module, 35
type function, 6

Index
51

Das könnte Ihnen auch gefallen