You are on page 1of 44

GENE MINING

 1. DEFINITION:- Gene mining is the process of exploiting


deoxyribonucleic acid (DNA) sequence of one genotype to isolate useful gene
from related genotypes.

HISTORICAL BACKGROUND

EVENTS YEAR

Sequencing of first plant genome (Arabidopsis thaliana) 2000

COMPLETION OF HUMAN GENOME ROJECT 2003

FIRST WORK ON GENE MINING 2003

SOME ORGANISMS WHOSE GENOME HAVE BEEN SEQUENCED


COMPLETELY

EUKARYOTES (mb) PROKARYOTES (mb)

Arabidopsis thaliana (114.5) Bacillus subtilis (4.20)

Saccharomycese cerevisiae (12) Haemophilus influenzae (1.83)

Oryza sativa (466) Escherichia coli (4.6)

Homo sapiens Vibrio cholerae (4.0)

Drosophilla melanogaster (120) Mycobacterium tuberculosis (4.40)

1
GENE MINING

Plasmodium falciparum (23) Treponema pallidum (1.14)

PROGRRAMMES /SOFTWARES / DATABASES USED FOR


GENE MINIG

BLAST

2
GENE MINING

 3. INTRODUCTION:-

CENTRAL DOGMA:-

3
GENE MINING

WHERE IS GENE LOCATED?

INTRODUCTION TO GENE MINING:-

The principle reason for gene mining is to identify and isolate genes that are
characterised for conferring essential traits. The widespread use and availability of
molecular biological techniques have allowed for the rapid development and
identification of nucleic acid derived sequences.With the availability of integration
of laboratory equipment with advanced computer software, researchers are able to
conduct advanced quantitative analyses, database comparisons and computational
algorithms to seek and identify gene sequences . Genetic databases for organisms
such as Escherichia coli, Haemophilus influenzae, Mycoplasma genitalium , and
Mycoplasma pneumoniae , to name a few, are available for public.These
biological databases store information that is searchable and from which biological
information may be retrieved.This work illustrates exploitation of publicly
available sequence databases on the Internet for identification of useful genes. The

4
GENE MINING

Internet is readily accessible to scientists worldwide. The resources used mainly


are Genbank at the National Center for Biotechnology Information
(http://www.ncbi.nlm.nih.gov/), the A. thaliana TC database at TIGR
(http://www.tigr.org/tdb/tgi/agi/) This work was carried out using an Internet
connection and a DOS-based sequence analysis software package, and the facilities
of a basic molecular biology laboratory for PCR verification. It is recommended to
carry out this type of sequence analysis in a Windows or Apple Macintosh
environment, since these are the most readily available platforms and software
packages are now available in the public domain for sequence analysis.

 3. VARIOUS METHODS ADOPTED FOR GENE MINING:


1. DNA Extraction and PCR BASED Gene mining

2. Data Mining

3. Using Genetic Algorithm

4. Peptide mass fingerprinting

5. From Biomedical Literature: weighing protien-pprotien

6. interactions and connectivity

7. With the help of GENOWATCH

8. ORIEL

9. DNA chip analysis (Microarray)

5
GENE MINING

1. DNA Extraction and PCR BASED Gene mining

 Plant Material

The germplasm used in this study of allele mining is leaf materials of the
concerned genotypes are collected from the Genetic Resources Centers.

 DNA Extraction

Total genomic DNA is isolated from fresh green leaves (approx 5 g) according to
the methodology of Dellaporta et al. with minor modifications. The quality and
quantity of the extracted DNA is confirmed to be consistent both
spectrophotometrically and by running the extracted DNA on 1.0% agarose gels
stained with ethidium bromide.

 PCR Analysis

PCR amplification of genomic DNA was carried out using gene-specific primers.
The PCR amplification consisted of a total of 40 cycles of melting (94°C for 1
min), annealing (55°C for 1 min 15 s), and elongation (72°C for 3 min 5 s). The
PCRamplified products were electrophoresed in 1.4% agarose gel in 1X Tris-
acetate-ethylenediamine– tetraacetic acid (TAE) buffer. The gels were
photographed under an ultraviolet transilluminator.

6
GENE MINING

5' 3' AND NC PRIMERS USED FOR ALLELE MINING

Gene-specific primers amplify the DNA of each accession, and the amplified
product represents either the entire allele or some functional component of the
allele, such as the promoter or the coding sequences.

7
GENE MINING

PROBLEMS

 Amplification of more than one gene (lack of specificity)

 Failure to amplify alleles in distantly related genera.

8
GENE MINING

2. DATA Mining

Definitions of DATA Mining

Data mining mainly is about somehow extracting the information and knowledge
from text;

2 Definitions:

 Any operation related to gathering and analyzing text from external sources for
business intelligence purposes;

 Discovery of knowledge previously unknown to the user in DATA/ TEXT;

Data mining is the process of compiling, organizing, and analyzing large document
collections to support the delivery of targeted types of information to analysts and
decision makers and to discover relationships between related facts that span wide
domains of inquiry.

Data Mining PROBLEMS

 Data mining systems induce knowledge from datasets which are huge, noisy
(incorrect), incomplete, inconsistent, imprecise (fuzzy), and uncertain.

 The problem is that existing systems use a limiting attribute value language for
representing the training examples and induced knowledge.

 Furthermore, some important patterns are ignored because they are statistically
insignificant.

9
GENE MINING

3. Using Genetic Algorithm

Rapid growth of available data in digital format increase need for methods to
analyze them . So research on some topics such as text classification, information
retrieval and automatic text summarization became an important field

Researchers in Knowledge Discovery in Databases (KDD) have provided new


tools for analyzing and accessing data in databases . Some of them is based on
term frequency and are used in text processing. Goal of an automatic text
summarization system is to generate a summary of the original text that allows the
users to obtain the main pieces of information available in that text, but with a
much shorter reading time . In addition, an important data preprocessing task for
effective classification is the attribute selection task, which consists of selecting the
most relevant attributes for classification purposes .

4. PEPTIDE MASS FINGERPRINTING

Public protein sequence database such as SWISS-PROT is used practically for the
protein identification from the result of Matrix-Assisted Laser Desorption
Ionization-Time Of Flight (MALDI-TOF)data, which is one of popular proteomic
studies. However, for the less of protein information for the specific plant species
in these databases it is needed to construct the private protein database containing
sufficient protein information for interpreting massive PMF results about each
specific plant species. Thus we tried to make the protein database by translating
enormous coding region sequences obtained from EST analysis and the PMF
software working on these databases. Therefore, in this study, we tried to make the
individual systems about EST based data analysis, regulatory motif information
from chromosomal mapping of ESTs, microarray data and PMF information from
bench works at first and finally integrate these individual

10
GENE MINING

5. From Biomedical Literature: weighing protien-pprotien interactions and


connectivity

An initial set of genes and proteins is obtained from gene-disease relationships


extracted from PubMed abstracts using natural language processing. Interactions
involving the corresponding proteins are similarly extracted and integrated with
interactions from curated databases (such as BIND and DIP), assigning a
confidence measure to each interaction depending on its source. The augmented
list of genes and gene products is then ranked combining two scores: one that
reflects the strength of the relationship with the initial set of genes and incorporates
user-defined weights and another that reflects the importance of the gene in
maintaining the connectivity of the network. We can apply the method to
atherosclerosis to assess its effectiveness.

The method can be summarized as follows:

 1. Obtain a list of genes or gene products known to be involved with the target
disease from the CBioC[5] database.

 2. Apply heuristics to unify variants of extracted names, and use HUGO [8] to
normalize both the set obtained in the previous step and the names stored in
CBioC. This will be referred to as the initial set.

 3. Apply nearest-neighbor expansion to the initial set to build a protein interaction


network using data from the CBioC database and curated databases. Analyze the
connectivity of the network. The genes and proteins in this network (derived from
the interactions) form the extended set.

4. Apply a heuristic scoring formula to the extended set to predict the proteins most
likely related to the disease.

11
GENE MINING

6. GenoWatch: a disease gene mining browser for association study.

A human gene association study often involves several genomic markers such as
single nucleotide polymorphisms (SNPs) or short tandem repeat polymorphisms,
and many statistically significant markers may be identified during the study.
GenoWatch can efficiently extract up-to-date information about multiple markers
and their associated genes in batch mode from many relevant biological databases
in real-time. The comprehensive gene information retrieved includes gene
ontology, function, pathway, disease, related articles in PubMed and so on.
Subsequent SNP functional impact analysis and primer design of a target gene for
re-sequencing can also be done in a few clicks. The presentation of results has been
carefully designed to be as intuitive as possible to all users. The GenoWatch is
available at the website http://genepipe.ngc.sinica.edu.tw/genowatch.

7. ORIEL

 Introduction
The ORIEL Project (Online Research Information Environment for the Life
Sciences) This European Project will develop tools and procedures to promote
access to and integration of a wide range of information resources in the life
sciences.
The tools developed through ORIEL will enable effective linking of different types
of biological information (literature, factual and multimedia databases) make
navigation easy, thereby encouraging the creative exploration of the information
landscape facilitate communication by making data presentation and information
visualisation user-friendly.

12
GENE MINING

 Project Description
Aims

The ORIEL Project (IST-2001-32688), funded by the EU and coordinated by the


European Molecular Biology Organisation (EMBO) aims to provide research
communities with tools to manage large, complex, multimedia datasets and to
navigate through an increasingly intricate and potentially confusing information
landscape.

 Methodologies

ORIEL's methodologies will be tested and applied in a critical user


environment represented by the EU-funded E-BioSci platform (QLRI-CT-2001-
30266).

 Developments

Methods leading to the creation of new concepts of the scientific literature, based
on machine-understandable documents.Technologies permitting effective linking of
a wide range of biological digital information sources, including molecular,
genomic and multi-dimensional image databases, promoting ease of cross-
database navigation, leading to creative exploration of the information landscape.
Protocols facilitating effective data representation and information
visualisation through the construction of adaptive interfaces that meet the needs of
individual users.

 Background

The emerging fields of genomics and bio-informatics are having far-reaching


effects on all aspects of the Life Sciences. Additionally, biotechnology and
biomedicine, that will benefit enormously from new genome-based technologies,

13
GENE MINING

have become important growth areas in the European life sciences industry.
Genomics research is characterized by the production of vast amounts of raw and
derived data. The integration of the exponentially growing amounts of these and
associated biological information in digital form (publications, sequence and
sequence-related information, digital image data) is presenting one of the most
demanding current challenges to information technology. There is an urgent need
to better exploit the potential of the Internet and other communication networks to
develop novel technology and intelligent middleware for the integration of large,
complex and disparate information resources.

 Objectives

The ORIEL project will explore and further develop methods, technologies and
protocols aimed at the integration, dissemination and exploitation of large,
complex and disparate digital information resources. With a view to making such
technologies widely available, it will focus on the Life Sciences as a data-intensive
and highly demanding testbed that will: - permit effective linking of different types
of biological information displaying complex inter-relationships (literature, factual
and multi-media image databases)
- promote ease of navigation leading to creative exploration of the information
landscape and facilitate user-friendly data presentation and information
visualisation.

Milestones

The development of new concepts that will enhance the efficiency of integration of
different types of biological data currently maintained in a wide spectrum of digital
collections and resources across Europe.

14
GENE MINING

 The development and optimization of interactive and adaptive user interfaces to


promote intelligent access to, retrieval and analysis of data stored in digital form.

8. DNA chip analysis (Microarray)

This is a recently developed technique for the analysis of gene expression & has
following features.

• The expression of many genes can be investigated at the same time (i.e. in one
experiment)

• This requires the availability of many cloned genes

• Allows the elucidation of complex responses

• Based on two RNA samples, a control and a sample of interest (e.g. heat stressed/
mutant)

Limitations:-

• High tech.

• Expensive: requires fancy equipment and expensive reagents

• Analysis not straight forward and still under development

• Available at Purdue at the Genome Center in the basement of Whistler.

15
GENE MINING

INSTRUMENT USED OUTPUT OF SCANNER

16
GENE MINING

17
GENE MINING

18
GENE MINING

19
GENE MINING

 4. APPLICATIONS OF GENE MINING

1. Allele Mining for Stress Tolerance Genes in Oryza Species and Related
Germplasm.

2. Allele mining and sequence diversity at the wheat powdery mildew resistance
locus Pm3.

3. Gene-mining the Arabidopsis thaliana genome: applications for biotechnology in


Africa

4. Isolation of Nucleic Acid Molecules Related to Integrin

5. Gene mining strategies of drug discovery

6. Isolation of a Known Gene to Validate System

7. Mining colon tumor-relevant genes

8. Mining molecular signatures for leukemia subtypes

9. Gene mining: classification of biological types

10. Metagenomics ( Uncultivable Microbes & Novel genes)

11. Genomics Industry - Gene Mining Companies

12. Gene mining in African rice germplasm to improve drought resistance in rainfed
production systems for resource-poor farmers of Africa
13. Mining the Epigenome for Methylated Genes in Lung Cancer

20
GENE MINING

1. Allele Mining for Stress Tolerance Genes in Oryza Species and


Related Germplasm.

The international project to sequence the genome of Oryza sativa L cv. Nipponbare
has made allele mining possible for all genes of rice. Scientists used a rice
calmodulin gene, a rice gene encoding a late embryogenesis-associated protein,
and salt-inducible rice gene to optimize the polymerase chain reaction (PCR) for
allele mining of stress tolerance genes on identified accessions of rice and related
germplasm. Two sets of PCR primers were designed for each gene. Primers based
on the 5' and 3' untranslated region of genes were found to be sufficiently
conserved so as to be effective over the entire range of germplasm in rice for which
the concept of allelism is applicable.

However, the primers based on the adjacent amino (N) and carboxy (C)
termini amplify additional loci. Two sets of PCR primers were designed for each
gene. Field-based phenotyping of germplasm identifies tolerant accessions,
biochemical and physiological analysis groups. the existing and emerging tools of
genomics and proteomics help to identify key genes or key members of a gene
family involved in each mechanisms. The technique of choice for allele mining is
PCR. Gene-specific primers amplify the DNA of each accession, and the amplified
product represents either the entire allele or some functional component of the
allele, such as the promoter or the coding sequences.

21
GENE MINING

HOW TO FIND NOVEL GENE

2. Allele mining and sequence diversity at the wheat powdery mildew


resistance locus Pm3.
The production of wheat is threatened by a constantly changing population of
pathogen races. Considering the capability of many pathogens to overcome genetic
resistance, the identification and implementation of new sources of resistance is
essential. Landraces and wild relatives of wheat have played an important role as
genetic resources for the improvement of disease resistance. Here, we discuss the
allele mining approach to characterize and utilize the naturally occurring resistance
diversity in wheat. This study is a large scale systematic allele mining, including
1320 hexaploid wheat landraces selected on the basis of ecogeographical

22
GENE MINING

parameters favouring growth of powdery mildew. The landraces were infected


with a set of differential powdery mildew isolates, which allowed the selection of
resistant lines. The molecular tools derived from Pm3 haplotype studies were
applied to study the genetic diversity at this locus. From the known Pm3 R alleles,
Pm3b was the only one frequently identified. In the same set, we also found a high
frequency of landraces carrying a susceptible haplotype. This analysis allowed the
identification of candidate resistant lines that were further tested for the presence
of new potentially functional alleles. Based on transient expression assays as well
as Virus Induced Gene Silencing (VIGS), we conclude that we have identified at
least two new functional Pm3 alleles. The new interesting and functional alleles
can be transferred to susceptible but economically important wheat varieties as
single genes or R-gene cassettes to achieve efficient control of mildew. This study
contributes to targeted use of genetic diversity resources for research and breeding.

3. Gene-mining the Arabidopsis thaliana genome: applications for


biotechnology in Africa

Plant science research has reached the post-genome era with the completion
of the genome sequences of both a dicotyledonous (Arabidopsis thaliana) (The
Arabidopsis Genome Initiative 2000) and a monocotyledonous (rice: Oryza sativa)
species (Yu et al. 2002). These genome sequences obtained through publicly
funded research have been made available through the Internet with new sequence
information appearing each day. In addition, large collections of cDNA libraries
derived from different plant tissues or growth conditions have been subjected to
single pass sequencing, often from the 3’ends, to derive express sequence tag
(EST) databases (Bennetzen 1999, Quackenbush et al. 2000). All these data
present an opportunity for researchers to enhance studies on non-model crop plants
by identifying homologues in the more tractable model species. This can lead to
design of experiments, such as the study of mutants in the model plant, which can
provide rapid answers to gene function in the crop plant. The plant cell wall, often
containing a matrix of pectic components, is the first line of defence against fungal
pathogens (Esquerre-Tugaye et al. 2000). In addition, pectic fragments broken
down from plant cell walls are elicitors of the plant defense response (Boudart et
al. 1998). Several lines of evidence indicate that the polygalacturonase inhibiting
protein (PGIP), which is associated with the cell walls of many plants has a role to
23
GENE MINING

play in plant resistance to fungal pathogens (De Lorenzo and Cervone 1997). This
work was initiated to identify a homologue of the gene for PGIP in A. thaliana
since it has relevance as a model system for protein-protein interactions, as well as
practical application in engineering fungal resistance in crop plants (Powell et al.
2000). PGIPs have been identified in a variety of plant species such as bean, pear
and apple (Toubart et al. 1992, Stotz et al. 1993, Arendse et al. 1999) . PGIPs are
characterised by their ability to bind to fungal polygalacturonases (PGs) and this
has led to the hypothesis that PGIP plays a role in the plant defence response by
modulating the activity of endo- PGs produced by invading fungi (Cervone et al.
1989). In addition, PGIPs are interesting for protein-protein interaction studies,
since they are made up of leucine rich repeats (LRRs) (De Lorenzo and Cervone
1997). The main model system for studying this process has been the interaction
between the Fusarium monilforme PG and the bean PGIP (Desiderio et al. 1997).
These studies have given in vitro evidence for this protein-protein interaction and
enabled identification of specific PGIP amino acids in this interaction (Leckie et al.
1999). However, this is a heterologous system employing use of a tobacco
expression system for production of variants of the bean PGIP. Testing of the
hypothesis in vivo has been hampered by lack of a model plant system, which can
be readily transformed and manipulated. This provided the rationale for searching
for a pgip homologue in the model dicotyledon A. thaliana.

4. Isolation of Nucleic Acid Molecules Related to Integrin


The integrin family of cell adhesion receptors plays a fundamental role in
the processes involved in cell division, differentiation and movement. The specific
function identified was that the target be an integral membrane protein involved in
cytoskeletal formation. The localization selected was that the protein be expressed
in the midgut of an organism.These structural-functional parameters were then
used to target potential genes based on the function identified from the PubMed
database on all organisms. The primer design software was the MacVector
software, and following an initial round of sequence determination, the primer
design was improved.

24
GENE MINING

5. Gene mining strategies of drug discovery


The strategies for identifying the limited number of genes that will be
relevant to any given disease (i.e., "gene mining") have been evolving at a rapid
pace. Until recently, technological developments restricted the genomics
specialist's sense of accomplishment to the mere compiling of "possibly relevant"
genes. The most common approach was to define the "mutant" (or "diseased") and
"wild-type" (or "normal") sets of genes in terms of the ensemble of mRNAs
produced by cells under a given circumstance (e.g., drug treatment). Because of a
certain focus on technology, a tidal wave of descriptive information (e.g., a long
list of mRNAs whose levels differed by twofold or more between experimentally
fixed conditions) threatened to obscure the identification of truly relevant genes. A
more systematic approach to pharmacogenomics, and in particular to the
pharmacogenomics of CF, can now benefit from hypothesis-driven bioinformatic
tools to identify disease- and drug-specific patterns of gene expression. In an
iterative scheme, a hypothesis is developed and used to design investigations of
cells or tissues. Microarrays are used to analyze the samples, and the resulting data
are installed into a database. Once a CF database is generated, specific algorithms
are used as bioinformatic tools, extracting meaning out of the data. Some of these
tools, such as the hierarchical clustering algorithm (see below), are available within
the public domain of the Internet. We have developed two additional tools, called
GRASP (for Gene R atio Analysis Paradigm) and GENESAVER (Gene Space
Vector). Both are hypothesis-driven techniques, which we shall describe in detail
as they have been applied to CF. The analyzed data must then be integrated into
the larger scope of bioinformation available through the Internet. In this way, a
new, refined hypothesis can be developed for the next cycle of investigation . In
our experience, several tactical cycles through this strategic approach have been
necessary to develop insight into a given problem.

6. Isolation of a Known Gene to Validate System


In order to validate the system, it was used to isolate a known gene; in this case
the Mudunca sexta Aminopeptidase gene. Aminopeptidase is involved in the
modulation of various cellular responses, especially in cell-cell adhesion and signal
transduction. Aminopeptidase is directly involved in resistance by insects to
insecticidal toxins of Bacillus thuringiensis. The M. sexta aminopeptidase gene
25
GENE MINING

was mined based on nucleotide and amino acid sequence alignment with the
existing aminopeptidase related sequences

7. Mining molecular signatures for leukemia subtypes


Here, the target phenotypes are two distinct leukemia subtypes, AML and ALL.
Thus, an ensemble decision analysis is conducted to identify the significant
molecular signatures (subtype-relevant genes) that underpin the complex molecular
mechanisms for distinction between the two subtypes. These data contain
measurements corresponding to ALL and AML samples from bone marrow and
peripheral blood.

Leukemia: Acute Lymphoblastic (ALL) vsAcute Myeloid (AML)

ALL AML

Visually similar, but genetically very different

26
GENE MINING

8. Gene mining: classification of biological types


Working on the same data allows us to show the differences between the two
targets. As a result of the ensemble gene subset selection, three best subsets
determined by their classification performance on the holdout samples using
equation 3 are obtained, all with a 2 value of 9.118 (P = 0.003). Best subset 1
(Best tree 1) contains four genes: M26383 (human monocyte-derived neutrophil-
activating protein mRNA, MONAP), T51849 (tyrosine-protein kinase receptor
ELK precursor, R.norvegicus), Z24727 ] (Homo sapiens tropomyosin isoform
mRNA) and H55758 (H.sapiens -enolase). Best subset 2 also contains four genes:
M26383 , T94993 (H.sapiens fibroblast growth factor receptor 2 precursor),
T58861 (60S ribosomal protein L30E, Kluyveromyces lactis) and R39465
(eukaryotic initiation factor 4A, Oryctolagus cuniculus). Best subset 3 contains
five genes: M63391 (H.sapiens desmin gene), D14812 (H.sapiens KIAA0026
mRNA, complete cds), H44011 (myosin heavy chain, non-muscle type A,
H.sapiens), T58861 and H55933 (H.sapiens mRNA homolog of yeast ribosomal
protein L41). To unravel the relationships between the two targets, classify the
biological types and mine disease-relevant genes, we construct a classification rule
using all 20 colon tumor-relevant genes. To allow for (lessen) selection bias due to
either the same approach being used for feature gene selection and prediction or
the induced rule being tested on tissue samples that had been used in the first
instance to select the feature genes, we perform a de novo validation procedure
called external cross-validation, with newly permutated data sets and with separate
classifiers from that used for feature gene subset selection. The classifiers
considered are a SVM with five different kernel functions, FLD, LNR, KNN and
MD, reflecting the diversity of discriminant methods putatively useful for
microarray data analysis.

27
GENE MINING

Although there are some variations between the different classifiers, on average all
four subsets (three best trees and the feature set with the 20 relevant genes)
identified by our ensemble approach perform comparably with or better than the
feature set with all 2000 genes. Best tree 3, although it does not include the top
gene (M26383 ), achieves the highest performance across the multiple external
classifiers and even performs better than the feature set of the top 20 colon tumor-
relevant genes, with the highest performance (92.1%) attained using a SVM with a
polynomial 2-D kernel, which is the highest attainable so far. The second best
feature set is the top 20 relevant genes, reflecting the fact that the relevant genes
are extracted from trees, which are in turn built with a target of high classification
performance given a data structure. Nevertheless, this feature set is neither
necessarily the most economical (minimal) nor the most efficient set for
classification or prediction because there are ‘redundant’ features among the top 20
genes (e.g. the two replicates of R39465 ). Indeed, mining these ‘redundant’ genes
is one of major goals for ensemble decision analysis of microarrays.

9. Mining colon tumor-relevant genes


A 5-fold cross-validation (CV) resampling approach is used to construct the
training and test sets. First, colon tumor and normal samples are randomly divided
into five non-overlapping subsets of roughly equal size, i.e. tumor subsets Di (i =
1, 2, ..., 5) and normal subsets Ni (i = 1, 2, ..., 5). Repeat the resampling 20 times
and obtain 500 pairs of training and test sets. The proposed gene extraction
approach is then applied to each pair. In order to obtain a statistical measure of
significance for each gene, a null distribution FV0 is constructed, as described
previously. An empirical threshold of 0.035 at the significance level of 0.01 is
chosen, denoted as FV0ß = 0.035 (ß = 0.01). The extracted colon tumor-relevant
genes of high significance (P < 0.01), obtained by analyzing 500 pairs.

28
GENE MINING

10. Metagenomics ( Uncultivable Microbes & Novel genes)


Modern biotechnology has a steadily increasing demand for novel genes for
application in various industrial processes and development of genetically
modified organisms. Identification, isolation and cloning for novel genes at a
reasonable pace is the main driving force behind the development of
unprecedented experimental approaches. Metagenomics is one such novel
approach for engendering novel genes. Metagenomics of complex microbial
communities (both cultivable and uncultivable) is a rich source of novel genes for
biotechnological purposes.

11. Genomics Industry & Gene Mining Companies


Now that the sequence data is available and placed in the public
domain.Companies have been created to "mine" the data, that is, to analyze the
genomic sequences to identify genes, their function, and their relationships to
health and disease processes. Companies pioneering in this area included Sequana
and Millennium Pharmaceuticals . Worldwide efforts are ongoing in optimizing
medical treatment by searching for the right medicine at the right dose for the
individual. Metabolism is regulated by polymorphisms, which may be tested by
relatively simple SNP analysis, however requiring DNA from the test individuals.
Target genes for the efficiency of a given medicine or predisposition of a given
disease are also subject to population studies, e.g., in Iceland, Estonia, Sweden, etc.
For hypothesis testing and generation, several bio-banks with samples from
patients and healthy persons within the pharmaceutical industry have been
established during the past 10 years. Thus, more than 100,000 samples are stored
in the freezers of either the pharmaceutical companies or their contractual partners
at universities and test institutions.

Ethical issues related to data protection of the individuals providing samples to


bio-banks are several: nature and extent of information prior to consent, coverage
of the consent given by the study person, labeling and storage of the sample and
data (coded or anonymized). In general, genetic test data, once obtained, are
permanent and cannot be changed. The test data may imply information that is not
beneficial to the patient and his/her family (e.g., employment opportunities,
29
GENE MINING

insurance, etc.). Furthermore, there may be a long latency between the analysis of
the genetic test and the clinical expression of the disease and wide differences in
the disease patterns. Consequently, information about some genetic test data may
stigmatize patients leading to poor quality of life. This has raised the issue of
‘genetic exceptionalism’ justifying specific regulation of use of genetic
information.

Discussions on how to handle sampling and data are ongoing within the industry
and the regulatory sphere, the European Agency for the Evaluation of Medicinal
Products (EMEA) having issued a position paper, the Council for International
Organizations of Medical Sciences (CIOMS) having a working group on this issue,
and the European Society of Human Genetics preparing background paper on
‘Polymorphic sequence variants in medicine: Technical, social, legal and ethical
issues. Pharmacogenetics as an example’. Within the European project Privacy in
Research Ethics and Law (PRIVIREAL), recommendations for common European
guidelines for membership in research ethical committees have been discussed,
balancing the interests and assuring independence and legal competence. Good
decision making, assuring legality of protocols and assessment of data protection is
suggested to be part of any evaluation of protocols.

12. Gene mining in African rice germplasm to improve drought


resistance in rainfed production systems for resource-poor farmers of
Africa
Rice has been cultivated in western and central Africa for centuries and is now a
staple food in the region. But drought is a major problem as it severely depresses
yield in upland and rainfed lowlands, where most of the producers are resource-
poor farmers. Drought resistance in plants, however, is a complex trait, controlled
by the interaction of many genes, as it involves several physiological, phonological
and morphological mechanisms. Consequently, conventional breeding for drought
resistance in Africa has had but limited success. DNA markers and genetic
mapping are expected to provide impetus not only in gaining a better

30
GENE MINING

understanding of the traits associated with drought but also by contributing to


enhanced selection efficiency.

The project seeks to 1) characterize drought in different environments and


identify the most important traits associated with drought resistance, 2) select
and characterize sources of drought resistance for genetic mapping and
quantitative trait locus (QTL) analysis, and 3) develop advanced lines combining
drought resistance with heavy yield and agronomic and quality characters
acceptable to farmers and consumers. To achieve these objectives the project will
exploit a core germplasm pool (Oryza glaberrima and O. sativa) of 1) drought-
resistant O. glaberrima accessions, collected and screened in Mali by the Institut
d’Economie Rurale (IER), 2) drought-tolerant interspecific breeding lines
developed by the Africa Rice Centre (WARDA) from crosses between O.
glaberrima and O. sativa, and 3) a range of traditional O. glaberrima and O. sativa
accessions from WARDA’s gene bank. Confirmed sources of resistance among this
core germplasm will be crossed with elite but drought-susceptible O. sativa lines
to develop interspecific and intraspecific populations segregating for drought
resistance. These populations will be phenotyped in replicated field trials in
different environments in Mali and Nigeria. QTL analysis will be performed to
identify, across environments, drought-improving alleles for future breeding. In
other populations, selection will be conducted to generate agronomically superior
drought-resistant lines.

31
GENE MINING

13.Mining the Epigenome for Methylated Genes in Lung


Cancer
Lung cancer has become a global public health burden, further substantiating the
need for early diagnosis and more effective targeted therapies. The key to
accomplishing both these goals is a better understanding of the genes and
pathways disrupted during the initiation and progression of this disease. Gene
promoter hypermethylation is an epigenetic modification of DNA at promoter
CpG islands that together with changes in histone structure culminates in loss of
transcription. The fact that gene promoter hypermethylation is a major
mechanism for silencing genes in lung cancer has stimulated the development of
screening approaches to identify additional genes and pathways that are
disrupted within the epigenome. Some of these approaches include restriction
landmark scanning, methylation CpG island amplification coupled with
representational difference analysis, and transcriptome-wide screening. Genes
identified by these approaches, their function, and prevalence in lung cancer are
described. Recently, we used global screening approaches to interrogate 43 genes
in and around the candidate lung cancer susceptibility locus, 6q23–25. Five genes,
TCF21, SYNE1, AKAP12, IL20RA, and ACAT2, were methylated at 14 to 81%
prevalence, but methylation was not associated with age at diagnosis or stage of
lung cancer. These candidate tumor suppressor genes likely play key roles in
contributing to sporadic lung cancer. The realization that methylation is a
dominant mechanism in lung cancer etiology and its reversibility by
pharmacologic agents has led to the initiation of translational studies to develop
biomarkers in sputum for early detection and the testing of demethylating and
histone deacetylation inhibitors for treatment of lung cancer. Lung cancer has

32
GENE MINING

become a global public health burden, with 1.5 million deaths expected by 2010.
The high mortality from this disease stems from the lack of an effective screening
approach for early diagnosis and the refractiveness of advanced cancers to
conventional therapies, substantiating the need to develop more effective
targeted therapies and chemoprevention. Although smoking cessation does
reduce risk for lung cancer, approximately half of lung cancers diagnosed are in
former smokers. Adenocarcinoma is the major histologic type of cancer diagnosed
in smokers in the United States and now Europe. An incidence rate of 40% and up
to 80% has been reported for this histologic type of cancer in smokers and never
smokers, respectively, diagnosed with lung cancer. Non–small cell lung cancer
(NSCLC, comprising mainly adeno, squamous cell, and large cell carcinoma) is
diagnosed in approximately 80% of patients, while the remaining 20% of tumors
appear to be small cell lung cancer (SCLC). The detection of numerous cytogenetic
changes provided the first link to the molecular pathogenesis of lung cancer.
Mapping of chromosomal sites for rearrangement, breakpoints, and losses
revealed both common and distinct changes in SCLC and NSCLC. The commonality
for specific regions in the genome for allelic loss suggested the presence of tumor
suppressor genes (TSGs) within these loci. The retinoblastoma gene was the first
TSG linked to lung cancer . Loss of function of this gene through either deletion or
point mutation occurs in 90% of SCLC, while less than 15% of NSCLCs harbor
changes in this TSG . The second major TSG inactivated in lung cancer is p53.
Although p53 inactivation is common across many malignancies, the mutation
spectrum within this gene tracks with specific tumor types.

In lung cancer, the most common mutation seen is the G:C to T:A transversion, an
alteration potentially stemming from the inability to repair DNA damage caused

33
GENE MINING

by polyaromatic hydrocarbons such as benzo[a]pyrene, which is present in


tobacco . Consistent with this hypothesis, the prevalence for transversion
mutations increased in tumors with increasing cumulative exposure to cigarette
smoke. Mutations in p53 are found in 70% of SCLC, 65% of squamous cell cancer,
and 33% of adenocarcinoma. In lung cancer, the search for TSGs inactivated
through the two-hit mechanism of loss of one allele and mutation of the
remaining allele have not identified any genes whose prevalence for inactivation
approaches that seen for the retinoblastoma and p53 genes. The exception to this
is LKB gene that is mutated exclusively in approximately one-third of
adenocarcinomas . The most commonly mutated oncogene in lung cancer is K-ras
with approximately 30 to 40% of adenocarcinomas harboring an activating
mutation, while mutations in squamous cell and SCLC are rarely observed .
Mutations are localized to codons 12, 13, and 61 with the majority (> 85%)
occurring within codon 12. Nearly 70% of the mutations seen are G to T
transversions within codon 12 that change a glycine codon (GGT) to valine (GTT)
or cysteine (TGT) that may reflect DNA adducts formed by metabolism of
polyaromatic hydrocarbons in tobacco. Recently, a whole genomic approach was
taken to address how many mutations are seen in cancer .

These studies were focused on breast and colon cancer, but most likely reflect the
paradigm seen in lung cancer studies that have evaluated candidate genes
discovered through various screening modalities. In this whole genome
sequencing study, approximately 80 gene mutations were identified that alter
amino acids. What was surprising was that the prevalence of the majority of these
mutations in primary tumors was less than 5%. The authors concluded that these
minor mutations would each be associated with a "small fitness advantage" that

34
GENE MINING

would drive tumor progression, and thus, it is not the most common genetic
changes but these rare changes that dominate the cancer genome landscape .
While this is an interesting hypothesis, the emergence of epigenetic modifications
of critical regulatory genes indicates that the epigenome may play an equal, if not
greater role in driving cancer initiation and progression than genetic mutations.
The most common epigenetic change in cancer is methylation of DNA at the fifth
position of the cytosine ring. Cytosine located 5' to guanine (CpG) is the prime
target of methylation in the mammalian genome and this dinucleotide is
concentrated in a much higher frequency than a random genome-wide
distribution in regions called CpG islands. About 50% of human promoters contain
CpG islands that often extend into exon 1 of many critical regulatory genes. When
DNA hypermethylation occurs within a CpG island located in the promoter region
of a gene, it is also accompanied by histone modifications (such as acetylation,
methylation, or phosporylation of histone tails) within the island. Together, these
two epigenetic changes create a closed chromatin configuration around the
promoter region denying access to RNA polymerase and regulatory proteins
needed for transcription.

The end result of this process is loss of gene transcription and hence "silencing of
gene function." With the development of the methylation-specific PCR assay that
can screen for gene methylation in specific promoters, there has been
tremendous growth over the past decade in the identification of genes that are
silenced in lung cancer through promoter hypermethylation. Transcriptional
silencing by CpG island hypermethylation now rivals genetic changes that affect
coding sequence as a critical trigger for neoplastic development and progression.
Genes responsible for all types of normal cellular function are targeted for

35
GENE MINING

inactivation by methylation at prevalences of 15 to 80% in lung tumors. These


include genes involved in cell cycle regulation (e.g., p16), apoptosis (e.g., death
associated protein kinase), DNA repair (e.g., O6-methylguanine-DNA
methyltransferase), cell adhesion (e.g., H-cadherin), signal transduction (e.g., ras
effector homolog 1 [RASSF1A]), and cell differentiation (e.g., RAR-β). Importantly,
many of these genes appear to be inactivated at the earliest histologic stage of
lung cancer and in cytologically normal-appearing bronchial epithelial cells from
smokers. Understanding which pathways are inactivated in the tumor cell and
bronchial epithelium of smokers will be essential for developing targeted therapy
for lung cancer and cancer prevention. The realization of gene promoter
methylation as a major alteration in the cancer cell has stimulated the
development of screening approaches to identify additional genes and pathways
that are disrupted within the epigenome. The sections below describe some of
the high-throughput genome screening approaches used to identify methylated
genes in cancer and a recent study by our group evaluating promoter methylation
of genes in and around the candidate lung cancer susceptibility locus 6q23–25
using a combination of screening approaches.

36
GENE MINING

 5. FUTURE SCENARIO:-

 Advancement in gene mining companies

 Metagenomics for mining new genetic resources of microbial communities

 Gene mining with the hierarchical clustering algorithm

 "gene-mining" strategies of drug discovery

 Mining the mouse genome

 Global gene mining and the pharmaceutical industry

 Synthetic life and gene mining:

Synthetic life and gene mining:


In our DNA there might be mistakes that make it more likely to get a disease like
breast cancer, diabetes or a mental illness. Medical science become pretty good
at it but until now it has just been scratching the surface. The field of genetics in
medicine is about to explode. It is really felt that medical science is standing at the
very start of an utter explosion in genetics and medicine, and that's not just in
cancer, it's in many different fields, but particularly in cancer. It gives us the
opportunity of targeting our highly expensive and they are very expensive drugs,
and also toxic drugs but targeting them at the people who need them most and
are most likely to respond. The genes are being searched which are associated not
just with cancer but a whole range of diseases, from obesity to heart disease. Less
than a decade ago it cost billions of dollars for the first human genome to be
sequenced. Today it might cost several hundred thousand dollars, and within a
few years, that could be as little as $1,000. Once that happens, we'll all have
access to an amazing amount of information about our future health. Within the

37
GENE MINING

last two years over 100 new genes have been identified that are associated with
risk of developing different types of diseases. Not just cancer, but other diseases
that often we regard as lifestyle diseases: the risk of developing diabetes, your
propensity towards obesity, high blood pressure, neurological diseases and so on.
And those new genes offer tremendous opportunities for the prevention and
even better therapy.There are also big ethical questions still to be answered
about tinkering with these basic building-blocks of life.
Global gene mining and the pharmaceutical industry
Worldwide efforts are ongoing in optimizing medical treatment by searching for
the right medicine at the right dose for the individual. Metabolism is regulated by
polymorphisms, which may be tested by relatively simple SNP analysis, however
requiring DNA from the test individuals. Target genes for the efficiency of a given
medicine or predisposition of a given disease are also subject to population
studies, e.g., in Iceland, Estonia, Sweden, etc. For hypothesis testing and
generation, several bio-banks with samples from patients and healthy persons
within the pharmaceutical industry have been established during the past 10
years. Thus, more than 100,000 samples are stored in the freezers of either the
pharmaceutical companies or their contractual partners at universities and test
institutions.
Ethical issues related to data protection of the individuals providing samples to
bio-banks are several: nature and extent of information prior to consent,
coverage of the consent given by the study person, labeling and storage of the
sample and data (coded or anonymized). In general, genetic test data, once
obtained, are permanent and cannot be changed. The test data may imply
information that is not beneficial to the patient and his/her family (e.g.,

38
GENE MINING

employment opportunities, insurance, etc.). Furthermore, there may be a long


latency between the analysis of the genetic test and the clinical expression of the
disease and wide differences in the disease patterns. Consequently, information
about some genetic test data may stigmatize patients leading to poor quality of
life. This has raised the issue of ‘genetic exceptionalism’ justifying specific
regulation of use of genetic information.
Discussions on how to handle sampling and data are ongoing within the industry
and the regulatory sphere, the European Agency for the Evaluation of Medicinal
Products (EMEA) having issued a position paper, the Council for International
Organizations of Medical Sciences (CIOMS) having a working group on this issue,
and the European Society of Human Genetics preparing background paper on
‘Polymorphic sequence variants in medicine: Technical, social, legal and ethical
issues. Pharmacogenetics as an example’. Within the European project Privacy in
Research Ethics and Law (PRIVIREAL), recommendations for common European
guidelines for membership in research ethical committees have been discussed,
balancing the interests and assuring independence and legal competence. Good
decision making, assuring legality of protocols and assessment of data protection
is suggested to be part of any evaluation of protocols.

 Mining the mouse genome

The mouse genome sequence, published , has already made a huge impact on the
research community. Although only a draft, it is clear that the sequence is a very
high-quality product, with excellent coverage and reliability over large genomic
expanses. It is a huge asset to researchers, and its significance matches that of the
human genome. In the past six months, for example, the Ensembl genome

39
GENE MINING

browser of the Sanger/European Bioinformatics Institute dealt with 2.6 million


requests for detailed information about the mouse genome, and 3.2 million
queries about the human sequence.

But there is one important difference between these two resources — the mouse
genome encodes an experimentally tractable organism. This means that it is now
truly possible to determine the function of each and every component gene by
experimental manipulation and evaluation, in the context of the whole organism.

 6. CONCLUSION :-
Work is being done in exponential scale worldwide. Indian scientist are also toiling
hard. Due to some bottlenecks they are not able to keep the pace. Gene mining is
not only boon for plant biotechnology but equally good for animal sciences. Gene
mining provided molecular biologists with a powerful and useable tool for
extracting disease-relevant genes, a major theme in the post-genomic era. This
technique leaves a question mark for the target driven gene functioning.

BLAST TYPES:-

BLASTp- Compares an Amino acid query sequence against a protein sequence


database.

BLASTn- Compares a Nucleotide query sequence against a protein sequence


database.

BLASTx- Compares six frame conceptual translation products of a Nucleotide


query sequence against a protein sequence database.

40
GENE MINING

tBLASTn- Compares a protein query sequence against a Nucleotide sequence


database dynamically translated in all six reading frames.

tBLASTx - Compares six reading frames translation of a nucleotide query


sequence against the six frame translation of the nucleotide sequence database.

 7. RESEARCH PAPERS PUBLISHED:

1. R. Latha, L. Rubia, J. Bennett and M. S. Swaminathan. 2004Allele Mining for


Stress Tolerance Genes in Oryza Species and Related Germplasm.
Molecular Biotechnology.Volume 27. 101-108.

2. Kaur N, Street K , Mackay M , Yahiaoui N, Keller B. Allele mining and


sequence diversity at the wheat powdery mildew resistance locus Pm3.
Plant molecular biology. (65). 93-106.

3. DK Berger. 2004. Gene-mining the Arabidopsis thaliana genome: applications


for biotechnology in Africa. South African Journal of Botany, 70(1): 173–
180.

4. Graciel A, Gonzalez¥, Juan C. Uribe, Luis Tari, Colleen Brophy & Chitta Baral.
2007. mining gene-disease relationships from biomedical literature:
weighting proteinprotein interactions and connectivity measures. Pacific
Symposium on Biocomputing 12. 28-39.

5. Seokkyung Chung, Jongeun Jun, Dennis McLeod. 2004. Mining Gene


Expression Datasets using Density-based Clustering. CIKM (04). 8–13.

6. Gerard R. Lazo1, Debbie Laudencia-Chingcuanco1, Yong Q. Gu1, Olin D.


Anderson1.2004. Gene Mining for Conserved cis Elements in Model
Genomes Using Gene Expression Patterns. In: The NCBI Handbook. 106-
109.

7. S. M. Khalessizadeh, R. Zaefarian, S.H. Nasseri, and E. Ardil. 2006. Genetic


Mining: Using Genetic Algorithm for Topic based on Concept Distribution.
World Academy of Science, Engineering and Technology 13. 144-147.

41
GENE MINING

 8. Patents(program )
 Peptide Mass Fingerprinting Database Management program Using
AMWISE and fBIND technique ,

Registration Number: 2004-01-12-835

 Peptide Mass Fingerprinting program Using AMWISE and fBIND technique,

Registration Number : 2004-01-12-836

 cDNA Microarray data Classification tool

Registration Number : 2004-01-22-839

 cDNA Microarray data Clustering tool

Registration Number : 2004-01-22-840

42
GENE MINING

 9. REFERENCES:
1. R. Latha, L. Rubia, J. Bennett and M. S. Swaminathan. 2004Allele Mining for
Stress Tolerance Genes in Oryza Species and Related Germplasm.
Molecular Biotechnology.Volume 27. 101-108.

2. Kaur N, Street K , Mackay M , Yahiaoui N, Keller B. Allele mining and


sequence diversity at the wheat powdery mildew resistance locus Pm3.
Plant molecular biology. (65). 93-106.

3. DK Berger. 2004. Gene-mining the Arabidopsis thaliana genome: applications


for biotechnology in Africa. South African Journal of Botany, 70(1):
173–180.

4. GracielA Gonzalez¥, Juan C. Uribe, Luis Tari, Colleen Brophy, Chitta Baral.
2007. mining gene disease relationships from biomedical literature:
weighting proteinprotein interactions and connectivity measures. Pacific
Symposium on Biocomputing 12. 28-39.

5. Seokkyung Chung, Jongeun Jun, Dennis McLeod. 2004. Mining Gene


Expression Datasets using Density-based Clustering. CIKM (04). 8–13.

6. Gerard R. Lazo1, Debbie Laudencia-Chingcuanco1, Yong Q. Gu1, Olin D.


Anderson1.2004. Gene Mining for Conserved cis Elements in Model
Genomes Using Gene Expression Patterns. In: The NCBI Handbook.
106-109.

7. S. M. Khalessizadeh, R. Zaefarian, S.H. Nasseri, and E. Ardil. 2006. Genetic


Mining: Using Genetic Algorithm for Topic based on Concept
Distribution. World Academy of Science, Engineering and Technology
13. 144-147.

43
GENE MINING

On Line references:-

a) http://www.pubgene.org.

b) http://www.gene.ucl.ac.uk

c) http://microarray.princeton.com

d) http://www.ncbi.nlm.nih.gov/

e) http://www.tigr.org/tdb/tgi/agi/

f) http://genepipe.ngc.sinica.edu.tw/genowatch

44