Beruflich Dokumente
Kultur Dokumente
CAL M
a
nd
US E
PYTHON
f
or
GENOMI CS
C
ri
st
ia
nTa
cci
ol
i
More on Pandas 22
Contents The revenge . . . . . . . . . . . . . . . . . . 22
Matplotlib Module 23
About the Author 2 Let’s plot it! . . . . . . . . . . . . . . . . . . 23
Contents
EXERCISE: BROWSING THE WEB 48
Neanderthal and Selenium . . . . . . . . . . 48
About the Author
Professor C. Taccioli is an Associate Professor at the
University of Padova and is in charge of the MAPS
Bioinformatics Laboratory at M.A.P.S Department. He
has a Master degree in "Molecular Biology", a PhD in
"Pharmacology and Molecular Oncology" and a BS.C in
"Biostatistics". He has previously worked at the Uni-
versity of Bologna, University of Ferrara, University of
Modena and Reggio Emilia, at The Ohio State University
(OSU) and at University College London (UCL). His
research activity focuses on Genomics (including Cancer
genomics).
®© C.T 2018
Introduction
The first edition of this book (2018) has been written for
students who want to learn Python in order to analyze
genomic data. This is a very short text, and it is for be-
ginners, such as students and/or normal people. If you
already know how to program in other languages such
as C++, this book is going to be “super easy” for you,
otherwise it will be just “easy”. Don’t worry about un-
derstanding this book, I’m going to teach you the basics
of Python together with many examples about genomics
and other fancy staff. Each chapter includes two or more
exercises and at the end of the book you’ll find many of
them. Have fun!
Fun with Python SHOULD I USE LINUX?
If you use Linux or Mac, you don’t need to install Ana-
The basics conda because Python is already installed and can be use
through the terminal. In Linux, the terminal is a black
screen that you find in "Development" and shows up if
you type together the buttons ctrl + alt + t on the key-
HISTORY OF PYTHON board. If you use a Mac, the terminal is located in Appli-
I believe that Python is the easiest programming lan- cation and then Utilities. On the terminal type python.
guage ever created. It is used by the most important That’s the Python interpreter. Using Python this way,
companies such as Amazon and Google. Unfortunately, means using the native Python without editors (Spyder
Python is not very fast but life is life, and it is not a is an editor). The good thing with the interpreter is that
matter of increasing its speed. Anyway, if you want you don’t have to select and run your code all the times,
to go fast, I’ll teach you some tricks. If you want you just need to press enter; the bad thing is that if you
to know more about Python history take a look here: write more lines of code, sometimes it gives you an error
https://en.wikipedia.org/wiki/History_of_Python and you don’t know why. So, I suggest to use Ana-
conda and the Spyder editor. Indeed, I suggest also to
use Linux. If you use Linux, you can write some code in
Python is an “interpreted language” whereas other a text file and call it, for example, "mycode.py" and then
languages such as C and C++ are “compiled language”. using the terminal, you can run it with the command
A compiled language is one where the program, once python mycode.py. This is a good way to run python
compiled, is expressed in the instructions of the target if your code is very long. Remember that in this book I
machine. Naturally, Python code is interpreted and for will capitalize the first letter of the word “Python”
this reason is much slower compared to C++, but it has an whereas you have to write it lowercase when you
advantage of being a high level programming language. use a terminal in Linux, Mac or Windows. Take a
look here: https://www.python.org/downloads/.
Introduction
4
Math with Python math and then use the math function. There are a
lot of modules (libraries) you can import to do a lot
of things. A module allows you to logically organize
Thrash your old calculators your Python code. Grouping related code into a module
makes the code easier to understand and use. A module
is a Python staff (it’s a file inside your computer) consist-
ing of Python code that can be used typing the command
WHO CARES ABOUT MATH? import and the name of a specific module. In the case
Python is an amazing math tool. It uses the mathe- below we want to import the math module. If you have
matical standard operations as taught at high school or an old version of Python (es. Python 2.7) you can import
secondary school. In other words, mathematical expres- the module division in order to always get floating num-
sions and parentheses are evaluated in the same order bers (from __from future__ import division).
we have learned them many years ago.
5
Variables in Python concatenate the variables, whereas if you are using
numbers, it adds them.
a = "Ciao"
print(a) H OW TO KNOW IF A VARIABLE IS A
Ciao
STRING OR A NUMBER
As you see, we used double quotes to store the string So you want to know if a variable is a string or a number:
“Ciao” into the variable “a” but we didn’t use any
quotes for the numbers. That’s because Python wants name = "Paul"
you to use quotes or double quotes when you create age = 18
string variables, whereas numbers do not need them. weight = 75.66
Moreover, strings and/or numbers can be both added a = 3945034850298340598304598203
using the symbol “+”. Let’s see an example:
print(type(name))
a = "Hello " <type 'str'>
b = " World!"
print(a+b) print(type(age))
Hello World! <type 'int '>
a = 9 print(type(weight))
b = 6.9 So, the variable "name" is a string, "age" is an integer,
print(a+b) "weight" is a float, whereas "a" is a long number.
15.9 "Float" is a number with decimals, whereas a "long" is a
very long number. So, if you want to know what kind of
That’s pretty cool. If you are using strings, Python variable you have just created, use the command "type".
Variables in Python
6
The If statement language you are force to indent the statements within
the block so that it is easier for a human to see where
the block begins and ends without having to scan for the
braces. In Python, the interpreter can tell where the
If this happens, then do that, else... block begins and ends based on whitespace (usu-
ally four). Indentation should be used for the if state-
ment and the for loop (we’ll see it later). If you use
MAKE A DECISION... PLEASE four spaces and then three spaces this is an error. For
Decisions are when a program has more than one example:
choice of things to do. Think about food. If you cook
some "spaghetti", you eat them, otherwise you thrash
everything. What? This not a good example? Ok,
never mind, luckily, Python has a decision statement to
help us when our application needs to make such deci- D ON ’ T DO THAT
sions. It’s like to write in english: "If this, Then this, If you use four spaces the first time to indent things,
Else something else". then you have to use always four spaces. Python is very
picky. Let’s see the example we have seen earlier:
age = 20
S OME EXAMPLES WITH I F if (age < 20):
Let’s see some examples: print("I’m younger than 20")
print("I’m so young")
age = 20 elif (age == 20):
if (age < 20): print("I’m 20... I’m so cool")
print("I’m so young") else:
elif (age == 20): print("It’s older than 20")
print("I’m 20... I’m so cool") print("I’m gonna die!!!") # ERROR!!!
else: # You used three spaces here
print("I’m older than 20") # You should’ve use four spaces
print("I’m gonna die!!!") # Like in the lines above
I’m 20... I’m so cool
7
The For loop # Let’s convert uppercase DNA
# to lowercase DNA. All the other
# characters will be converted to "-"
DNA = "ATTACKA"
Let’s iterate! for i in DNA:
if(i == "A"):
print("a", end="")
elif(i == "T"):
WHAT IS A FOR LOOP STATEMENT ? print("t", end="")
A "For loop" in Python is a particular way to repeat a elif(i == "G"):
group of statements, a specified number of times. You print("g", end="")
can use any object (such as strings, numbers, and so on) elif(i == "C"):
but every statement that you want to iterate should be print("c", end="")
indented. Let’s see some example below. else:
print("-", end="")
attac-a
9
Lists&Dictionaries # Iterate dna, make uppercase and put in DNA
DNA = []
dna = ["tata","auu","aug"]
for i in dna:
The big variables up = i.upper() # make "i" uppercase
DNA.append(up)
print(DNA)
[0 TATA0 ,0 AUU0 ,0 AUG0 ]
DO I FEEL LIKE A LIST OR A DICTIONARY ? So the first two examples are easy. We created a list
Your brain is still burning from the last lesson, I know. called "list1". This list contains the names of the mem-
Don’t worry. Lists and dictionaries are easy. They are bers of the band "Metallica". So, in the first case we
just like variables, but can contain more staff inside added a the end of the list the name "James". Then, we
them. removed him. Then, we inserted the string "Roberto" in
second position (the index is not 2 but 1, because the
index starts from 0, at the beginning of a list). The third
example is pretty cool. First, we created an empty list
LISTS called "DNA". Then, we iterated each element of the
list "dna", and use the command uppercase() to make
Lists are lists of values (strings, numbers, and so on). them uppercase. Then, we put each element with the
Each element in a list is indexed starting from zero; the command append() in the new list "DNA". Python is
first one is numbered zero, the second 1, the third 2, etc. "case sensitive"; so uppercase and lowercase words have
You can remove and add values. They are defined a different meaning in Python. In this case, for example,
within brackets "[]", and each element is separated "DNA" is not the same of "dna".
with a comma ",". There are also variables called Tuples
that are unchangable and are defined by parentheses.
Lists&Dictionaries
10
Working with Strings S LICING S TRINGS
word = "Python"
Strings are important!!! # chars from pos. 0 (included) to 2 (excluded)
print(word[0:2])
Py
STRING INDEXES
Now that you know most of Python programming lan- # chars from pos. 2 (included) to 5 (excluded)
guage, you need to rest a little bit. For this reason, I’ll print(word[2:5])
show you how to play with strings, using a lot of exam- tho
ples and a few theory. The only thing to remember is
that every "word" or "string" is associated to its index # chars from 2 (included) to -1 (excluded)
that starts from the first characters with the number 0. print(word[2:-1])
There is also another system of indexing that starts from tho
-1 which is the last letter of a string.
# chars from -5 (included) to -1 (excluded)
print(word[-5:-1])
ytho
INDEX EXAMPLE
These are the indexes for the word "Python": # chars from -6 (included) to -1 (excluded)
+---+---+---+---+---+---+ print(word[-6:-1])
| P | y | t | h | o | n | Pytho
+---+---+---+---+---+---+
0 1 2 3 4 5
-6 -5 -4 -3 -2 -1
S LICING TO THE END OF A S TRING
You can use the positive indexes or the negative ones, word = "Python"
depends on the problem you want to solve. Now, let’s
see some real staff. # chars from position 0 (included) to the end
print(word[0:])
Python
11
More on Strings R EVERSE S KIPPING S LICING
word = "Python"
# Through end, skipping 1 places each time # Find "h" within the string "Python"
print(word[::1]) print(word.find("h"))
Python 3 # "h" is in the 3rd index position
# Through end, skipping 2 places each time # Count "o" occurences in "Python" string
print(word[::2]) print(word.count("o")) # "o" found once
Pto 1
# Through end, skipping 3 places each time # Split "Python" where "o" is found
print(word[::4]) print(word.split("o"))
Po [’Pyth’, ’n’]
# Through end, skipping 3 places each time # Join together a list of strings
print(word[::5]) word2 = ["P","y","t","h","o","n"]
Pn print("".join(word2))
Python
# Through end, skipping 3 places each time
print(word[::6])
P
More on Strings
12
Read and Write files # Read a fasta file and put the sequence
# in a variable without the fasta header
fasta = "".join(open("myfile.fasta").\
readlines()[1:]).replace("\n",’’)
Reading and writing is important!
13
Functions def cube(a):
print("the cube is " + str(a**3))
cube(3)
Don’t copy and paste your code, the cube is 27
Functions
14
Regular Expressions S EARCH IN DETAIL
from re import *
dna = "ATGACGTACGTACGACTG"
The advanced way! # store the match object in the variable m
m=search(r"GA([ATGC]{3})AC([ATGC]{2})AC",dna)
print("entire match: " + m.group())
When writing programs for genomics we often search for
print("first match: " + m.group(1))
patterns (substrings) in strings. For example, if we want
print("second match: " + m.group(2))
to search for a "TATA box" in a promoter or the start
entire match: GACGTACGTAC
codon "AUG" in a mRNA sequence we can use regular
first match: CGT
expressions. Regular expression are part of formal lan-
second match: GT
guage that use symbols to define a search pattern. This
language has been created in the 1950s by the matemath-
ician Stephen Cole Kleene. . As you can see the function "match" searches for
only substrings at the beginning of a string, whereas
the function "search" searches for the first match in the
S EARCH VS M ATCH entire string. If we want the other matches, we have
to write the regular expression within parenthesis and
# Let’s import all the functions (*) print them with the function "group()". The regular ex-
# from the regular expression module "re" pression GA([ATGC]{3})AC([ATGC]{2}) means that what
from re import * we want something that starts with GA, then there are
dna = "AAACCCCCCCCCATG" three letters among the group A or T or G or C (that is
# if ATG is found at the beginning of [ATGC]{3}), then there is AC and other two letters among
# the dna string, "Good!" is printed the group A or T or G or C (that is [ATGC]{2}). Fi-
# out otherwise you’ll see the word "Bad!" nally we want to match also AC. Since we have included
if match(r"ATG", dna): [ATGC]{3} and [ATGC]{2} within the parenthesis, we can
print("Good!") select them separately using the functions "group(1)"
else: and "group(2)". If we use the function "group()" we
print("Bad!") will select everything. Remember, you have to write
Bad! m.group() or m.group(2) or m.group(3) because we have
created the variable "m" that stores our regular expres-
dna = "ATCGCGAATTCAC" sion GA([ATGC]{3})AC([ATGC]{2}). What if we want to
# if GAATTC is found, "Restriction have all the substrings described by our Regular Expres-
# site found" string is printed out sion in just one shot? And what if we want to know also
if search(r"GAATTC", dna): when a match starts and ends? In the first case, we want
print("Restriction site found!") to use the "findall" function, in the second, we we’ll use
Restriction site found! the function "finditer". Let’s see some examples:
dna = "ATCGGGTCCTTCAC"
# if GG followed by A or T followed by CC F INDALL VS F INDITER
# are found, "Restriction site found" from re import *
# string is printed out. The symbol "|"
# means OR whereas the symbol "&" means dna = "AAAATATAAACCCTATATT"
# AND. runs = findall(r"TATA[AT]+", dna)
if search(r"GG(A|T)CC", dna): # The symbol "+" means "one or more"
print("Restriction site found!") # The symbol "*" means "zero or more"
Restriction site found! print(runs)
[’TATAAA’, ’TATATT’]
As you see, each of these examples are explained using
dna = ’AAAATATAAACCCTATATT’
the comments delimited by the symbol "#". Now we will
for i in finditer(r"TATA[AT]",dna):
introduce the concept of the function group(); group(1)
print(i.group() + " " + str(i.start())\
will return the match of the string described by the sec-
+ "," + str(i.end()))
tion of the pattern in the first set of parentheses, instead
TATAA 4,9
the function group(2) will return the match described by
TATAT 13,18
the second, etc. Let’s see some other examples:
15
List Comprehensions LAMBDA FUNCTIONS
Python supports a clear syntax that lets Python pro-
& Lambda Functions grammers define functions in just one line. This way of
writing functions have been borrowed from a program-
ming language called Lisp. Lambda functions can be
Python list comprehensions enable for manipulation of used anytime we need a function to be declared.
lists whereas with Lambda functions we can pass func-
tions to other functions to do stuff. They are powerful
because with a line of code you can do many things. L AMBDA F UNCTIONS E XAMPLES
Let’s see some examples:
LIST COMPREHENSIONS
# Calculate the cube of a number
List comprehensions provide a nice way to create new cube = lambda x: x**3
lists. They consist of square brackets containing an ex- print(cube(2))
pression followed by one or more for loops. The ex- 8
pressions can be anything, meaning you can put calcu-
lations with numbers or search patterns for looking for # Calculate the cube of a list
substrings and the result will be a new list. list1 = [1,2,3]
print(list(map(lambda x: x**3,list1)))
[1,8,27]
17
The Counter E ASY EXAMPLES
# Most common words in a file
Counter is a container imported from the module ’collec- import re
tions’. It allows to count letters or words without using a from collections import Counter
for loop. It’s faster and easier, so I suggest to use it.
with open(’/Users/me/Share/myfile.txt’) as f:
file = f.read()
THE COUNTER! words = re.findall(r’[\b[a-zA-Z]+\b’, file)
What about if we want to count the frequency of words wordcounts = Counter(words)
in a text or the distribution of letters in a sentence? And ignore=[’The’, ’the’,’And’,’and’, ’of’, ’to’]
what about if we want to identify the most common for word in list(wordcounts):
letters in a text? Don’t worry, with Counter, you can do if word in ignore:
it with few lines of code. del wordcounts[word]
print(wordcounts.most_common(3))
[(’that’,12577),(’in’,12331), (’shall’,9760)]
C OUNT ON ME !
I know... you didn’t understand absolutely nothing about
# Let’s count the letters in a string
the code above. The first examples were pretty easy and
from collections import Counter
self explanatory but this one... Well, first of all, we have
imported the module "re" that is the package that pro-
print(Counter(’GATTACA’))
grammers use when dealing with regular expressions
{’A’: 3, ’T’: 2, ’G’: 1, ’C’: 1}
and strings (see the paragraph on regular expressions)
and then we have imported Counter from the collections
# A better way to visualize it
module. Then we opened a text file (it can be a book or
# "c" in this case is a special dictionary
whatever contains letters) and we put the words in a list
# and "i" represents its keys
called "words". How do we do that? with the regular ex-
c = Counter(’GATTACA’)
pression "\b[a-zA-Z]+\b". What does it mean? It means
for i in ’ATGC’:
that we want to select all the words. In fact, \b it’s the
print(i, "=", c[i])
start and the end of a word. "a-zA-Z" means that a word
A = 3
can be upper or lower case and the symbol "+" means
T= 2
that a word can be of whatever length. This staff is really
G = 1
cool guys. Try with Shakespeare or Dante and see the
C = 1
most common words they used in their poems. Ah right,
I forgot to tell you. The list "ignore" contains the word
# Let’s count the letters in a word
that we want to exclude from the counting. We removed
c = collections.Counter(’pyth’)
these words from our list using the function "del".
print(c)
Counter({’t’: 1, ’h’: 1, ’p’: 1, ’y’: 1,})
WHAT IS COUNTER?
# Most common letters in a text Counter is a dictionary subclass for counting elements. It
c = Counter() # create c as Counter is unordered but this collection of elements are stored as
f = open(’/home/chris/Desktop/rings.txt’, ’r’) dictionary keys and their counts are stored as dictionary
file = f.read() values. In this way, it’s easy to calculate frequencies with
for i in file: Counter instead of using the "while" or "for loops".
c.update(i.rstrip().lower())
print (’Most common letters:’)
# Let’s use a built-in function called Counters was primarily designed to work with posi-
# most_common tive integers to represent counts that is the frequency
for letter, count in c.most_common(3): of word of letter in a text for example. However, neg-
print(letter,"=", count) ative numbers can be used. See the documentation:
Most common: https://docs.python.org/3/library/collections.html
e: 235331
i: 201032
a: 199554
The Counter
18
The Itertools Module C OMBINATIONS AND P ERMUTA -
TIONS
Combinations and Permutations # Let’s see some examples now
The itertools module contains a lot of functions. We’ll
see just teh combinatoric generators in order to calculate # Combinations without repetition
permutations and combinations with or without repeti- # (order doesn’t matter)
tions. Later we will define k as the number of ways of import itertools
picking elements and n from the number of all the ele-
ments. So, k is called class and n is called number of print(list(itertools.combinations("ATGC", 2)))
elements. [(’A’, ’T’),(’A’, ’G’),(’A’, ’C’),(’T’, ’G’),
(’T’, ’C’), (’G’, ’C’)]
COMBINATIONS
# Permutations without repetition
Combinations are a group of specific elements, ordered # (order matters)
without thinking if the order of these elements is impor- print(list(itertools.permutations \
tant. For example, the combination of 3 elements of ("ATGC", 2)))
class 2 is 3. Why? Because the Combinations without [(’A’, ’T’),(’A’, ’G’),(’A’, ’C’),(’T’, ’A’),
repetition of n=3 and k=2 (for ex. a, b and c) is equal (’T’, ’G’),(’T’, ’C’),(’G’, ’A’),(’G’, ’T’),
to: {a,b} {a,c} {b,c} (’G’, ’C’),(’C’, ’A’),(’C’, ’T’),(’C’, ’G’)]
ANAGRAMS ’yep’
Anagrams are permutations without repetitions where ’ype’
the number of elements are equal to the class (n = k). ’eyp’
In the last example we will find all the anagrams for the ’epy’
word "yep". ’pye’
’pey’
19
The Numpy Module O PERATION WITH MATRICES
# Multiply the matrix c by 2
d = c*2
Working with "a" Matrix print(d)
[[ 2. 2. 2.]
"You take the blue pill, the story ends. You wake up in your [ 2. 2. 2.]
bed and believe whatever you want to believe. You take the red [ 2. 2. 2.]
pill, you stay in Wonderland, and I show you how deep the [ 2. 2. 2.]]
rabbit hole goes." [Morpheus to Neo, Matrix (1999)]
# Adds matrix c to d
Sorry guys, this is not about the movie “Matrix", e = c+d
it’s about working with matrices of numbers. In math- print(e)
ematics, a matrix is a rectangular group of numbers, [[ 3. 3. 3.]
arranged in rows and columns. In case of just one line [ 3. 3. 3.]
of numbers, the matrix is called vector. The easiest way [ 3. 3. 3.]
to do it, is using the module Numpy. To import Numpy [ 3. 3. 3.]]
is super-easy. Just write: from Numpy import *. In this
way you will import all the functions within the Numpy # Creates a random matrix 4x3
module. The examples shown below are commented # filled of integers
and self explaining. Enjoy the Numpy module! r=random.randint(10, size=(4,3))
print(r)
# of course the random matrix showed
C REATING VECTORS & MATRICES # on your screen will be different
# Let’s see out to work with matrices and # from the one below because it’s random
vectors with Numpy: [[5 7 3]
[6 1 7]
# Import numpy [7 6 4]
from numpy import * [5 3 6]]
# Creates a vector
a = array([1,2,9,5])
print(a) R OWS AND COLUMNS
[1 2 9 5] # Prints the first row of r matrix
print(r[[0],:])
# Creates another vector called a2 [[5 7 3]]
# and adds vector a to vector a2
a2 = array([1,4,5,5]) # Prints the second row of r matrix
print(a+a2) print(r[[1],:])
[ 2 6 14 10] [[6 1 7]]
# Creates a matrix 4x3 filled with 0 # Prints the first column of r matrix
b = zeros((4,3)) print(r[:,[0]])
print(b) [[5]
[[ 0. 0. 0.] [6]
[ 0. 0. 0.] [7]
[ 0. 0. 0.] [5]]
[ 0. 0. 0.]]
# Prints the second column of r matrix
# Creates a matrix 4x3 filled with 1 print(r[:,[1]])
c = ones((4,3)) [[7]
print(c) [1]
[[ 1. 1. 1.] [6]
[ 1. 1. 1.] [3]]
[ 1. 1. 1.]
[ 1. 1. 1.]]
21
More on Pandas ... AND MORE ABOUT DATAFRAMES
# Calculate square root of
# a subset of df
The revenge df.loc["a":"c","A":"C"].apply(sqrt)
A B C
Guys... I noticed you are very good with pandas, so I de-
a 2.645751 2.645751 1.732051
cided to show you some more examples. These are about
b 3.000000 2.236068 2.828427
subsetting a dataframe and doing some nice operations
c 1.732051 2.449490 3.000000
with it.
# Change the number
W ORKING WITH PANDAS ... AGAIN # with index (4,4) to 100
# Let’s create the dataframe df we see df2.iloc[3,3]=100
# in the previous page then select the print(df2)
# rows from a to c, and the columns from A B C D
# A to C by headers. Now let’s calculate a 7 7 3 6
# the mean of the selected rows b 9 5 8 5
df.loc["a":"c","A":"C"].mean(1) c 3 6 9 5
a 5.666667 d NaN NaN NaN 100
b 7.333333 e 9 4 NaN NaN
c 6.000000 f 9 5 NaN 7
More on Pandas
22
have used the randint function that creates random inte-
Matplotlib Module ger numbers. Let’s create now, a fancier bar plot!
23
BioPython Module That’s supereasy, isn’t it? We have imported Se-
qIO from Bio, then we have open the fasta file and
then we have used the "for loop" in order to get
With a Little Help from My Friends the id and the sequence for every record contained
in the FASTA file. Of course, using "sequence.id"
Hi guys, do you know the song "With a Little Help from and "sequence.seq" we can do whatever you want,
My Friends"? It was written by "The Beatles". This band such as calculate the length of each sequence with
was very famous in the 60’ and played, among the others, the function len "len(sequence.seq)", or convert
amazing songs like "Toxicity", "Infected Voice", "The Wolf every sequence in lower case using the function lower
I feed" and many others... Anyway, sometimes it happens "str(sequence.seq.lower())". In the last case, we have
that we are tired, angry or sad. However, if we are lucky, used the str function in order to convert everything
there are friends who listen to us... and if we are very, in a string because sequence.seq is not a string is a
very lucky, they also help us. When programming in Bio.Seq.Seq object.
Python for Genomics we are not alone... take a look here:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
BioPython Module
24
Running an external R UNNING STAFF WITH P YTHON 3
# Let’s run the command ls -l
program # in the directory /root
# using Python 3.4
import subprocess
How do I run an external program? subprocess.run(["ls", "-l", "/"],\
stdout=subprocess.PIPE)\
An interesting feature of Python is the ability to run ex- CompletedProcess(args=[’ls’, ’-l’, ’/root’],\
ternal programs. Running tool within your Python code returncode=0, stdout=b’total 53\
in very useful in genomics because it allows you to use drwxrwxr-x+ 98 root admin 3332\
software already created so you don’t have it to write it Feb 27 13:06 Applications
by yourself. # ... so all the staff above means that
# I have the folder applications in root
THE OUTSIDER # Let’s run the command ls -l
The possibility to run external programs is very cool to # in the directory /root
use within a Python code. It allows run existing tools as # using Python 3.4 and print it
you were using a terminal. # in a different way
import subprocess
a=subprocess.run(["ls", "-lht", "/"],stdout=\
subprocess.PIPE)
RUNNING PROGRAMS THROUGH A TERMINAL b=str(a)
If you use Linux, you probably run a program like that: print(b.replace(’\n\n’, ’\n’))
/home/cris/software/mytool # the command above replace \n with an endline
CompletedProcess(args=[’ls’, ’-lht’, ’/’],
If you use Windows XP or greater, you run it like returncode=0, stdout=b’total 53
that: drwxrwxrwt@ 4 root 136B Mar 2 17:50 Volumes
c://windows//Program//myprogram//myprogram.exe drwxrwxr-x+ 98 root 3.3K Feb 27 13:06 Applic
dr-xr-xr-x 2 root 1B Feb 27 12:40 home
If you use a Mac, you run it like that: dr-xr-xr-x 2 root 1B Feb 27 12:40 net
/Application/myprogram drwxr-xr-x+ 67 root 2.2K Dec 13 17:15 Library
lrwxr-xr-x 1 root 42B Dec 2 13:36 Developer
drwxr-xr-x@ 39 root 1.3K Sep 5 18:02 bin
dr-xr-xr-x 3 root 4.2K Feb 27 12:39 dev
R UNNING STAFF WITH P YTHON 2.7 dr-xr-xr-x 2 root 1B Feb 27 12:40 home
# Let’s run the command date in Linux drwxr-xr-x@ 4 root 136B Jan 19 02:05 System
# using Python 2.7 drwxr-xr-x@ 59 root 2.0K Jan 19 02:03 sbin
subprocess.call("/bin/date") drwxr-xr-x+ 67 root 2.2K Dec 13 17:15 Library
Thu Feb 16 17:05:56 CET 2017 lrwxr-xr-x 1 root 42B Dec 2 13:36 Developer
drwxr-xr-x@ 39 root 1.3K Sep 5 18:02 bin
# Let’s run the command date in Linux drwxr-xr-x@ 13 root 442B Nov 4 2015 usr
# and store it in a variable drwxr-xr-x 3 root 102B Oct 14 2015 opt
# using Python 2.7 drwxr-xr-x@ 2 root 68B Oct 13 2015 Network
date_command=subprocess.check_output("date",\ drwxr-xr-x 6 root 204B Oct 13 2015 Users
shell=True).rstrip("\ n") drwxrwxr-t@ 2 root 68B Oct 13 2015 cores
print(date_command) ’)
’Thu Feb 16 17:07:37 CET 2017’
These super amazing examples just showed you how to
# Let’s run the command ls (list of files) run commands like ls -l or ls -lht using Python. Of
# with the argument -l (description of files) course you can do it using the terminal in Linux, Win-
# using Python 2.7 dows or Mac but you can also do it within a Python code.
subprocess.check_call(["ls", "-l"]) In the first example the result was a mess, but in the sec-
# The results depends on your OS ond one we replaced the \n symbol with a real endline
autodiskmount fstyp to get a really nice outuput... try to use other Linux com-
mands by yourself. Enjoy it!
25
Command line argu- C OMM . L INE A RG . EXAMPLE II
# Let’s write and save our script
ments # in a file called name.py
import sys
name=sys.argv[1]
Arguments... the easy way! print("Hello " + name + "!")
As we said in the "Indroduction" a Python program can # Let’s run our script trough
be saved in a text file using the extension ".py" and # a terminal
then used to run it through the terminal. For exam- python name.py Jimi
ple, we can save a file "hello.py" containing the script Hello Jimi!
print(’Hello!’) and then we can run it with the com-
mand python hello.py through our terminal. On the # Instead of arguments we
screen we will see the line Hello!. Now... if you want # can use the input function
to pass some parameters to the script you can write # in case you are using Python or greater.
some strings after the name of the script such as: python # (or the raw_input function if you are using
hello.py name. Don’t worry, let’s see some examples... # Python 2.7)
# Suppose we give the parameter 3 and 4
a = input("Insert a number for the short\
LINE ARGUMENTS side of a rectangule: ")
If you are able to use Linux, you have probably used b = input("Insert a number of the long\
some command like ls that is a command that shows side of a rectangle: ")
all the files you have in a folder. If you run ls -l it area = int(a)*int(b)
prints you the files in a folder plus some information print("The area of the rectangle is: " +\
about these files. The parameter -l is what everybody str(area))
calls a command line argument. The area of the rectangle is 12
import os If you run the code above and change that path with
yours, you will see that three folders have been created
directory = "/Users/cris/Desktop/" and the three files (seq1.txt, seq2.txt and seq3.txt)
have been moved to the new folders ( seq1, seq2 and
for filename in os.listdir(directory): seq3). How is that possible? First of all, we have im-
if filename.endswith(".fasta"): ported the os module in order to use the mkdir function
print("Reading from " + filename) that creates the directories. Then, we have imported
path = directory+filename the shutil module in order to use the move function
myfile = open(path) to “move” the files to the new directories. It is possible
myfile_list = myfile.readlines() to do other beautiful things with folders and files using
mylist_stripped = [i.replace("n","") \ Python. For example the os.path.exists function check
for i in myfile_list] if a directory or a file exists. Moreover, importing
print(mylist_stripped) the glob module and using the glob function it is pos-
Reading from seq1.txt sible to get a list of paths matching a pattern, for example:
[’TATAT’, ’ATTAG’]
import glob
Reading from seq2.txt
print glob.glob("c:/windows/*.fasta")
[’TATAT’, ’AGAGA’]
Reading from seq3.txt
DNA1.fasta
[’TATAT’, ’AGAGA’]
DNA1.fasta
27
Surfin’ the Web ... CONTINUE
# Let’s go to the email text box
emailElem = browser.find_element_by_id \
The Selenium module ("login-username")
Selenium is a Python module that provides a simple # Let’s insert our email
way to browse internet using the most famous browsers emailElem.send_keys("youremail@yahoo.com")
like Firefox, Ie, Chrome, Remote etc. The current sup-
ported Python versions are 2.7, 3.6 and above. You can # Let’s go to the password text box
download selenium from the PyPI page. However, you passwordElem = browser.find_element_by_id \
can use pip to install the Selenium package like this: ("login-signin")
# Suppose you hava a Yahoo account # Let’s print the results in the first page
# Let’s go to the Yahoo.com web page for item in page1_results:
browser.get("https://mail.yahoo.com") print(item.text)
T HE DNA CLASS
T HE RECTANGLE CLASS # Let’s create the class ’DNA’
class DNA:
# Let’s create the class Rectangle """ This class includes a method that\
class Rectangle: calculates DNA length and much more"""
# Use three double quotes to describe classes def __init__(self, seq=""):
""" This class is about Rectangles! """ self.seq = seq
def __init__(self, side1, side2):
self.side1 = side1 def lendna(self):
self.side2 = side2 countA = self.seq.count("A")
countT = self.seq.count("T")
def Area(self): countG = self.seq.count("G")
return self.side1*self.side2 countC = self.seq.count("C")
return countA+countT+countG+countC
# Now... let’s create the first object
rect1 = Rectangle(3,5) def convert(self, mode):
print(rect1.Area()) self.mode = mode
15 if mode == "upper":
return self.seq.upper()
# Now... let’s create the second object elif mode == "lower":
rect2 = Rectangle(2,4) return self.seq.lower()
print(rect2.Area()) else:
8 print("Please use the mode\
’lower’ or ’upper’!")
# Now... let’s create the third object # Let’s create an object
rect3 = Rectangle(3,3) dna1 = DNA("ATTGC")
print(rect3.Area()) dna1.lendna()
9 5
dna1.convert("lower")
First of all: 1) the first line of the code is the word class attgc
and the name of the class, in this case Rectangle; 2) the
second line is not mandatory, it’s the description of the First of all, the name of the class this time is DNA. Second,
class. So... if we want to know something about the class we have created a constructor that accepts one parame-
or its objects just type help(name_of_the_class); 3) the ter (seq) that is a DNA string. Third, we have created
special function or method called __init__ (two under- a method or function called lendna that calculates the
scores before and after init) is automatically called when counts of the letters "A", "T", "G" and "C" of the DNA
an object of the class is created. It’s also called construc- sequence seq. Fourth, we have created another method
tor and it’s used, in this case, because we want to give an called convert that convert the DNA sequence upper or
object some parameters when we create it (ex: side1 and lowercase depending on the parameter we use.
29
Inheritance T HE DNA SUPERCLASS
# Let’s create the class ’DNA’
class DNA:
Inheritance is important ’’’ This class includes a method that\
Classes can inherit functionality from other classes. In calculates DNA length ’’’
fact, the most common feature associated with object pro- def __init__(self, seq=""):
gramming is inheritance that is the ability to define a self.seq = seq
new class as a modified version of an existing class. The
main advantage of inheritance is that you can add new def length(self):
methods to a class without having to change the original countA = self.seq.count("A")
one. It is called "inheritance" because the new class "in- countT = self.seq.count("T")
herits" all the methods of the original class. By extending countG = self.seq.count("G")
this, the original class is often called "superclass" and the countC = self.seq.count("C")
derived class "daughter" or "subclass". return countA+countT+countG+countC
class RNA(DNA):
INHERIT ME! # Here below you are encouraged to always write
Python supports inheritance from multiple classes, unlike # a brief description of your class
other programming languages such as Java or C++. """ This class includes methods that are able
to convert DNA sequences to RNA"""
def __init__(self, seq):
T HE DNA CLASS self.seq = seq
def convertL(self):
# Let’s create the class ’DNA’ new = self.seq.lower()
class DNA: return new
’’’ This class includes a method that\
calculates DNA length ’’’ def convert2RNA(self):
def __init__(self, seq): if self.seq.isupper():
self.seq = seq return self.seq.replace\
("T","U")
def lendna(self): else:
countA = self.seq.count("A") return self.seq.replace\
countT = self.seq.count("T") ("t","u")
countG = self.seq.count("G")
countC = self.seq.count("C") # Let’s see if it works:
return countA+countT+countG+countC b = RNA(’ATGCA’)
b.length()
class RNA(DNA): 5
’’’ This class includes a method that\
calculates RNA length ’’’ b.convertL()
def __init__(self, seq): atgca
self.seq = seq
b.convert2RNA()
# Let’s see if it works: augca
a = DNA(’ATGC’)
a.length()
4 As you see we have created the class RNA and then
we have inherited the method length from the super-
b = RNA(’ATGCA’) class DNA. Moreover we have created two methods (or
b.length() functions) only for the class RNA. In summary, inheri-
5 tance is when a class uses arguments or methods created
within another class. If we think of inheritance in terms
of genomics, we can think of a bacteria inheriting certain
As you see the class RNA has inherited the method
genes from another bacteria. That is, a bacteria can in-
length from the class DNA. It is possible to create some
herit antibiotics resistance. In this way we can save code
functions that are exclusively created inside and for the
and make your programs shorter.
RNA class? Yes, of course. Let’s see some examples:
Inheritance
30
Conclusion
This is the End (our only friend)
This text showed you how to program in Python and
how to deal with genomics using this easy programming
language. Remember... the only way to learn how to pro-
gram is programming. For this reason, in the next para-
graphs, I selected some examples you can use to train and
learn much faster than reading theoretical staff. They are
ordered for complexity... at least, take a look. Have fun!
EXERCISE: HOW TO COUNT BASES IN GE-
NOMIC SEQUENCES
Many Ways to Count Bases
In this paragraph we’ll learn how to count bases in a DNA string. That’s the way I do it. Everyone can solve the problem
using his own imagination. These are many ways to do it of course but I suggest to create functions or classes in order to
get the job done.
C ODE
import time
def count1(DNA):
t0 = time.time()
countA=0
countT=0
countG=0
countC=0
for i in DNA:
if i == "A":
countA += 1
elif i == "T":
countT += 1
elif i == "G":
countG += 1
elif i == "C":
countC += 1
OUTPUT
count1("ATGCC")
A: 1
T: 1
G: 1
C: 2
cpu_time = 3.5999999999702936e-05
C ODE
def count2(DNA):
t0 = time.time()
countA = DNA.count("A")
countT = DNA.count("T")
countG = DNA.count("G")
countC = DNA.count("C")
OUTPUT
count2("ATGCC")
cpu_time = 6.000000000838668e-06
[1,1,1,2]
C ODE
def count3(DNA):
t0 = time.time()
countA = DNA.count("A")
countT = DNA.count("T")
countG = DNA.count("G")
countC = DNA.count("C")
return dna_dict
C ODE
def count4(DNA):
t0 = time.time()
countA = DNA.count("A")
countT = DNA.count("T")
countG = DNA.count("G")
countC = DNA.count("C")
print("n---------------- -----")
print("Nucleotide Code: Base:n---------------- -----")
for k, v in dna_dict.items():
if k == "A":
print("Adenine......... ..." + str(v) + ".")
if k == "T":
print("Thymine......... ..." + str(v) + ".")
if k == "G":
print("Guanine......... ..." + str(v) + ".")
if k == "C":
print("Cytosine........ ..." + str(v) + ".")
---------------- -----
Nucleotide Code: Base:
---------------- -----
Cytosine........ ...3.
Guanine......... ...1.
Thymine......... ...1.
Adenine......... ...2.
cpu_time = 4.7999999999603915e-05
As you can see in all of these three examples, we have imported the module time. Why? Don’t worry! It’s not mandatory,
we just want to know about the speed of our script. It looks that the script of example 1.2 is the fastest one. In the example
1.1 we used a for loop in order to count the bases whereas in the example 1.2 we used the count function and put the
result in a list. The most complete code is that one in example 1.3 where we used the count function and a dictionary as
a result. Regarding the example 1.4, it’s the same of the example above, it’s just “fancier”.
C ODE
import random
OUTPUT
randomDNA(3)
’GTC’
randomRNA(3)
’UGC’
randomDNA(10)
’GCCGTAAGAT’
randomRNA(10)
’GAUCUCAGUG’
randomDNA(50)
’AAAAGTTCTATCTTGACTAATCAACGGCCAGCCGTACATAGCGCGCTAAG’
randomRNA(50)
’ACUCGCGUUAGGAAGUCAUCCUGUUAGCUCCAACUAAGGUUCAGUGGUGA’
randomDNA(90)
’CGTAGGTTCTAGGAGGCACGTAGTTTCGGAAGTGTAACTACGCCCGGTCAAACATGCCCGGTTGCCCCCTGAAGGAAACCTGCAGGCGGT’
randomRNA(90)
’CACAGAACAUAUCAUCAUUAUUUCAAUGCCCCCAAAUCGUGGCGUCGAUUGGUGUGUCUGGCAGUUAGUGGGAACUGAAUACCGGGACCG’
This code is simple. We have imported the module random. From this module we have used the function choiche. This
function choice allows the users to create a random string from a set of characters. Then, we used a list comprehension
and the random.choice function in order to get the sequence with the length we set for the variable length. In this way,
we have written two functions. One creates DNA random sequences whereas the other one creates the RNA sequences.
C ODE
import random
import time
# this loop is needed in order to create many random sequences until we find
# that one in the variable sent_to_find
while randsentence != sent_to_find:
counter = counter + 1
print(randsentence) # print old (non-sent_to_find) random sentence
randsentence = random_sentence(4,2) # pick a new sentence.
CODE
random_sentence(6,2)
ht wzvg
.
The script found ’hi guys’ after 38931601 iterations. Congrats!
cpu_time = 508.06 seconds
As you see, there are new things here. First, we used a for loop in order to create more than one word in a random
37
sentence depending on the parameter we set in the random_sentence function. Then we have concatenated all the words
in the empty variable ’sentence’. This variable will have a space before all the other words, so we have deleted it with
the function lstrip. Then, we have created a while loop in order to generate random sentences until we find that one
contained in the variable sent_to_find. Awesome, isnt’t?
C ODE
def kmers(dna, n):
# it creates an empty dictionary
kmer_dictionary = {}
OUTPUT
kmers("ATA",1)
{’A’: 2, ’T’: 1}
kmers("ATA",2)
{’AT’: 1, ’TA’: 1}
kmers("ATA",3)
{’ATA’: 1}
kmers("ATATATA",4)
{’TATA’: 2, ’ATAT’: 2}
kmers("ATATATA",5)
{’TATAT’: 1, ’ATATA’: 2}
kmers("ATATATA",6)
{’ATATAT’: 1, ’TATATA’: 1}
kmers("ATGACGTTGGCGCAGCAGACTTT",1)
{’A’: 5, ’C’: 5, ’T’: 6, ’G’: 7}
kmers("ATGACGTTGGCGCAGCAGACTTTTTTTTTTTTTTTTTTTTTTTTTT",1)
{’A’: 5, ’C’: 5, ’T’: 29, ’G’: 7}
In order to solve the problem we had to create a window as long as the kmer length (n) and we had to go trough the
sequence until its entire length. This length is equal to the length of sequence minus the length of the kmer plus 1. It’s
like that because the loop should stop when the last k-mer has been identified at the end of the sequence. Then, we used
the function get. This function, when it’s called, checks if the specified key exists in the dictionary. If it does, then it
39
returns the value of that key. In this way, for each key (kmer) its frequency is returned.
For Python programmers, matplotlib is the library of choice when it comes to plot things. Moreover, another module
called Pandas is the favorite one when reading and working with matrices. So, let’s say we have an excel file with a table
in it. We can save it as a tab delimited text file and then we can import it in Python using Pandas. Let’s say we have a
file like that:
So, let’s import it through Pandas and create a pie plot using Matplotlib:
E XAMPLE 5.1
import pandas as pd
import matplotlib.pyplot as plt
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = list(matrix.index) # getting the row names
headers = list(matrix) # getting the headers (we will not use it)
# Let’s get the first column from our matrix, we won’t use the second column
sizes = matrix.col1
plt.show()
Here we are! That’s the pie plot. What do you think? I think it’s cool. You can import whatever you want and plot
it. Let’s see another example! Now, suppose we have created a matrix of 3 columns and 100 rows filled with random
numbers from 0 to 99:
41
TEST 2.TXT FILE
col1 col2 col3
row0 98 88 24
row1 14 74 84
row2 54 44 88
.
.
.
row98 66 77 84
row99 64 54 14
C ODE
from mpl_toolkits.mplot3d import Axes3D
# 2D Plot
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(m.col1, m.col2, s=10, c=col, marker="o")
# 3D Plot
fig = plt.figure()
ax3D = fig.add_subplot(111, projection="3d")
ax3D.scatter(m.col1, m.col2, m.col3, s=10, c=col, marker="o")
plt.show()
C ODE 6.1
import random
import matplotlib.pyplot as plt
from collections import Counter
from math import log2
# It plots the histogram of the mean of A/T and C/G for each sliding window
# with the length contained in the vector called "lista"
length = 1000 # length of a test dna sequence
lista = [10, 20, 50, 100, 200] # size of the sliding windows
What does it mean in a real world? That means that if you have a sequence like that: "ATGATATATATGAGGCCCC" the
formula of the Shannon entropy is:
Why? Because the frequency of A is 0.316, the frequency of T is 0.263, the frequency of C is 0.211 and the fre-
quency of G is 0.211. So at the end of your long day you get 1.97848. The Shannon entropy has been calculated using the
module Counter. Do you remember it?
Shannon Entropy
45
EXERCISE: CHARGAFF’S SECOND RULE
Chargaff plots
Chargaff’s second parity rule states that, in a single strand of DNA in any organism, the amount of Guanine is "almost"
equal to Cytosine and the amount of Adenine is "almost" equal to Thymine. Le’ts create a random genome and then
calculate the ratio between the verage of the ratios A/T and C/G in order to test the hypothesis the average is close to 1.
We’ll do it for all the sliding windows of different lengths through our random genome.
C ODE
import random
import matplotlib.pyplot as plt
# Using the function "random.seed" you can obtain always the same random sequence.
# random.seed(444)
# The number ’444’ is arbitrary. Change it if yuo want.
# It calculates the mean of A/T and C/G for each sliding window
def charg_window(window_length):
x = []
dna = randomDNA(length)
if(window_length>len(dna)):
print("Window size exceeds DNA length: Execution Aborted!")
else:
for i in range(len(dna)):
window = dna[i:i+window_length]
if(len(window)==window_length):
x.append(charg_calc(window))
else:
break
return x
So what’s the meaning of this code? We just wanted to create a frequency distribution of a value calculated as the mean
between the A/T and C/G. Why? Because if Chargaff was right it should be always almost equal to 1. As you can see
when we use a sliding window of 50 it’s not like that but when we start to use window greater than ’1000’ in length this
is the case. At the beginning of the code, we used the function "random.seed". That creates always the same random
sequence every time we use a specific number. For example, if we use the number ’444’ we will obtain always the same
sequence, whereas if we use the seed ’1234’ we will obtain always another one. Of course instead of using random
sequences you can get real sequence or genome from different organisms. Be careful Python is pretty slow if you use big
genomes it’s going to take ages.
Chargaff plots
47
EXERCISE: BROWSING THE WEB
Suppose we have a list of Homo ssapiens genes and we want to know if they are the same in the Neanderthal genome.
Well... we can do it manually, visiting the web page http://neandertal.ensemblgenomes.org/index.html and then paste
the name of each gene in the text box clicking on the result link and downloading the table of the mutations. That’s
cool but what if if you have to do it for ’19,000’ genes? Let’s say we have a txt file containing all the Homo sapiens
genes alphabetically ordered in one column. Well we can use the script below, that uses Selenium, in order to get all the
’Non-synonymous" mutations for each gene when comparing Homo sapiens and Homo neanderthalensis:
C ODE
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
neandertal = []
# Let’s read the file that contains the name of the genes
with open(’/Users/cristian/Desktop/genes.txt’) as f:
names = f.read().splitlines()
for i in names:
try: # try to make work this block otherwise give an error and continue
chromedriver = "/usr/local/Cellar/chromedriver/2.30/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
driver.get(’http://neandertal.ensemblgenomes.org/index.html’)
element1 = driver.find_element_by_id(’q’)
element1.send_keys(i)
element2 = driver.find_element_by_xpath("//input[@type=’submit’ and @value=’Go’]")
element2.click()
protein_link = driver.find_element_by_partial_link_text(’ENSP’)
protein_link.click()
variation_link = driver.find_element_by_partial_link_text(’Variations’)
variation_link.click()
wait = WebDriverWait(driver, 10)
table1 = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, \
"#main > div:nth-child(2) > div.content > div > table"))).text
thefile.close()
OUTPUT
A2M: 11
ABL1: 23
ADCY5: 1
AGPAT2: 4
AGTR1: 9
AIFM1: 1
AKT1: 6
APEX1: 8
APOC3: 0
APOE: 18
APP: 21
APTX: 4
AR: 8
ARHGAP1: 1
ARNTL: 0
ATF2: 1
.
.
.
The most difficult part of the code is the part that tries to find an element in the HTML code of the web
page in order to insert the name of the genes and the button link. There are many ways to do it, see
http://selenium-python.readthedocs.io/locating-elements.html for more info. In our case, we used the
driver.find_element by partial link text and xpath but you can use the method that you prefer the most. You have
to train a bit in order to understand the way Selenium works. At the begininning it starts going to the web page, then
inserts the name of the first gene of the file, then clicks on the protein name and finally shows and write in a file called
"neandertal.txt", the number of the "Non-synonymous" mutations for that specific gene. As you can see, the code uses
the statements "try" and "except" that allows the programmer to run a program and in case of errors it does not stop but
shows the name of an error or perform a particular task. That’s one of the most difficult but nicest code you’ll ever see in
your entire life. Am I right? No? Ok. Bye.
Index
A glob module, 27
addition function, 5 H
align.globalxx function, 24
align.localxx function, 24 Homo neanderthalensis, 48
Anaconda, 4 Homo sapiens, 48
anagrams, 19
arguments of a function, 14 I
attribute of an object, 29
if, else and elif statements, 7
attributes, 29
indentation, 7
Axes3D module, 43
inheritance, 30
B input function, 26
install modules, 24
BioPython module, 24 installing modules through pip, 13
instances, 29
C itertools module, 19
Chargaff rules, 46 f J
choiche, function, 36
classes and objects, 29 join function, 12
collections module, 18
combinations, 19 K
command line arguments, 26
comment symbol, 5 k-mers, 39, 44
concatenation function, 6
L
constructor, 29
Counter, 18 lambda functions, 16
Counter module, 45 len function, 12
Linux, 4
D
list comprehensions, 16
dataframe, 21 lists, 10
DataFrame function, 21 log function, 5
degrees function, 5 lstrip, function, 38
del function, 18
dictionaries, 10 M
division function, 5 Mac, 4
driver.find_element function, 49 map function, 17
E match function, 15
math module, 5
Editor, 4 Matplotlib module, 23, 41
matrix, 20
F methods, 29
multiplication function, 5
find function, 12
findall function, 15 N
finditer function, 15
for loops, 8 numpy module, 20
functions, 14
O
G
os module, 27
glob function, 27 os.path module, 27
os.path.exists function, 27 V
P variables, 6
pairwise2 module, 24 W
pandas module, 21, 41
permutations, 19 while loops, 9, 38
power function, 5 Windows, 4
print function, 4 write files, 13
print_function function, 8 Z
R zip function, 17
radians function, 5
rand.randint, 21
randint function, 23
randn function, 23
random module, 36
random.seed function, 47
range function, 8
raw_input function, 26
re module, 18
read files, 13
readlines function, 13
reduce function, 17
regular expressions, 15
replace function, 13, 25
search function, 15
seed function, 23
Selenium module, 28, 48 f
self, 29
SeqIO module, 24
Series function, 21
set, 17
Shannon entropy, 45
shutil module, 27
sin function, 5
split function, 12
splitlines function, 25
Spyder, 4
sqrt function, 5
str function, 6
string indexes, 11
strings, 6
strings selection, 11
strings slicing, 11
subtraction function, 5
sys module, 26
terminal, 4, 26
time module, 35
type function, 6
Index
51