Sie sind auf Seite 1von 23

M.

Tech Final Year Project


(Phase - 3)

Submited to Submited By
Varshapriya J.N

Anshul Vyas
(Computer & IT department)
(M.Tech 2nd Year)

Target

Performance evaluation of TOPHAT.

Using Openstack cloud environment

Using Black Box Technique different Algorithms

Next Target

Study all available allignment algorithms.

Study algorithm on which tophat works.

Tweak something to evaluate and improve the


performance.

Challenging keywords

Rna sequence

Allignment

Transcript/Transcriptome

Splice junction

Intron/Extron

Coding/ Non coding part

Rna Sequence

RNA-seq (RNA sequencing), also called whole transcriptome


shotgun sequencing(WTSS)
To reveal the presence and quantity of RNA in a biological sample
at a given moment in time.

RNA-Seq is used to analyze the continuall

Spliced transcripts,

Post-transcriptional modifications,

Gene fusion,

RNASeq can also be used to determine exon/intron boundaries and


verify or amend previously annotated 5 and 3 gene boundaries.

Allignment

In bioinformatics, a sequence alignment is a way of arranging


the sequences of DNA, RNA, or protein to identify regions of
similarity that may be a consequence of functional, structural,
or evolutionary relationships between the sequences.
Aligned sequences of nucleotide or amino acid residues are
typically represented as rows within a matrix. Gaps are
inserted between the residues so that identical or similar
characters are aligned in successive columns. Sequence
alignments are also used for non-biological sequences, such
as calculating the edit distance cost between strings in a
natural language or in financial data.

Transcriptome

the sum total of all the messenger RNA


molecules expressed from the genes of an
organism.

Extron and Intron

Sequence Editor generates FASTAQ.

Splice Junction

In molecular biology, splicing is the editing of


the messenger RNA (pre-mRNA) transcript in
which introns are removed and exons are
joined together (ligated).

Splicing

Refer in rna seq paper ppt

All Available Algorithms


1. Hash Based Algorithms(Mosaik,SwiftWat,AGILE (AliGnIng Long Reads)
2.Tree-Prefix/Suffix Based Algorithms
3.Merge Sort based Algorithms

Current limitations:

High mapping error rates,

Low mapping speed,

Read length limitation

Study TOPHAT

It is an open source.

Created by Jhon Hoppkins University

It is not a single software.

Combination of different bioinformatic


tools .

FASTQ Format
@Read_id_1
CTGATGTGCCGCCTCACTTCGGTGGT

+
@@@DDDDDH8<BAHG@BHGIHIII>(
@Read_id_2
TGATGTGCCGCCTCACTACGGTGGTG
+
FHHHHHJIJIJIJIIIJJIIJGIGII
@Read_id_3
...

The four lines are:

The name/ID of the read, preceded by a "@". For read pairs, there will be two entries with that name, either in the same or a
second FASTQ file.

The sequence of the read.

A "+" sign. In very old FASTQ files, this is followed by the read name from the first line. Today, this line is present for historical
reasons backwards compatibility only.

The quality scores of the bases from line 2. The scores are generated by the sequencing machine, and encoded as ASCII
(33+score) characters. The line should have the same length as line 2, as there is one quality score per base.

BlackBox Architecture
FASTQ-2

FASTQ-1

?
Mapped
Sequence

Unmapped
Sequence

Setup Tophat on machine


1. Take FASTAQ as input.
2. Process it for mapping.
3.Use different algorithms for mapping.
4. Evaluate the result.

Result after Implimenting BWT

Default Tophat algo : 08 second for rice rna

After replacing it with BWT algorithm:


It took 07 Second

Next tasks

On the basis of algo:Analyze performance of


mapper with
1) BWT-SW
2)FM index
3)GFM index

Next tasks

On the basis of length of rna-sequence:


1) Humen and animals
2) Different organic Crops
3) Change the base lenght (bp)

Thank You

Das könnte Ihnen auch gefallen