Sie sind auf Seite 1von 18

See

discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/216384474

A Systematic, Data-driven Approach to the


Combined Analysis of Microarray and QTL Data
BOOK JANUARY 2008
DOI: 10.1159/000317174

READS

19

8 AUTHORS, INCLUDING:
Catriona Mary Tate

Morris Agaba

Congenica

The Nelson Mandela African Institute of Scie

22 PUBLICATIONS 93 CITATIONS

53 PUBLICATIONS 1,347 CITATIONS

SEE PROFILE

SEE PROFILE

Stephen J Kemp

Andy Brass

University of Liverpool

The University of Manchester

165 PUBLICATIONS 2,809 CITATIONS

151 PUBLICATIONS 5,336 CITATIONS

SEE PROFILE

SEE PROFILE

Available from: Laurence Stuart Hall


Retrieved on: 04 January 2016

A SYSTEMATIC, DATA-DRIVEN APPROACH TO THE COMBINED ANALYSIS


OF MICROARRAY AND QTL DATA

RENNIE, C1,2
HULME, H2
FISHER, P2
HALL, L3
AGABA, M4
NOYES, HA1
KEMP, SJ1,4
BRASS, A2

1. School of Biological Sciences, Biosciences Building, Liverpool L69 7ZB, UK


2. School of Computer Science/Faculty of Life Sciences, University of Manchester, UK
3. Roslin Institute, Roslin, Midlothian EH25 9PS, Scotland, UK
4. ILRI P.O. Box 30709, Nairobi, 00100 Kenya

Short title: Microarray and QTL data analysis

Summary

High throughput technologies inevitably produce vast quantities of data. This presents
challenges in terms of developing effective analysis methods, particularly where the
analysis involves combining data derived from different experimental technologies.

In this investigation, we applied a systematic approach to combine microarray gene


expression data, QTL data and pathway analysis resources in order to identify functional
candidate genes underlying tolerance of Trypanosoma congolense infection in cattle. We
automated much of the analysis using Taverna workflows previously developed for the
study of trypanotolerance in the mouse model.

We identified pathways represented by genes within the QTL regions, and subsequently
ranked this list according to which pathways were over-represented in the set of genes
that were differentially expressed (over time or between tolerant Ndama and susceptible
Boran breeds) at various timepoints after T. congolense infection. The genes within the
QTL that played a role in the highest-ranked pathways were flagged as good targets for
further investigation and experimental confirmation.

Introduction

The analysis of microarray gene expression data can present difficulties due to the vast
size of the datasets. Depending on the purpose of the study, analysis may be further
complicated by the need to combine data produced using different experimental
techniques or by the underlying complexity of the phenotype being investigated. A
systematic, data-driven, semi-automated analysis pipeline was developed for the
pathway-based combined analysis of microarray and Quantitative Trait Locus (QTL) data
as part of a study investigating the genetics underlying tolerance to African bovine
trypanosomiasis (nagana) [1].

Nagana is transmitted by the tsetse fly, leading to loss of productivity and often death in
infected cattle. It represents a major constraint on livestock production in Africa [2].

Some breeds of cattle, such as the Boran, are susceptible to the pathological
consequences of trypanosomiasis. Others, such as the N'dama, are more resistant to these
effects (trypanotolerant) [3]. However, the susceptible breeds have desirable traits, such
as greater size, and can be preferred by farmers. Identification of genes that influence
response to trypanosomiasis might inform new treatment approaches, or even pave the
way for creating transgenic breeds that combine the desirable traits of susceptible and
trypanotolerant cattle.

Trypanotolerance is a complex phenotype including several distinct components, likely to


involve separate genetic control. Features include the ability to control anaemia, control
parasitaemia and maintain bodyweight. Previous studies provide evidence of the
complexity of trypanotolerance. The trypanosomiasis response of haematopoietic
chimaeric twins bred from one Boran and one Ndama parent was studied, demonstrating
that control of anaemia depends on bone marrow from a trypanotolerant background,
whereas control of parasitaemia does not [4]. The mapping study that provided QTL data
used in this analysis showed that the proportion of phenotypic variation explained by
each QTL was between 6 and 20%, suggesting that multiple genes, or complex epistatic
or environmental effects, may influence each trait [5].

A microarray gene expression timecourse study was carried out to investigate gene
expression differences between (trypanotolerant) N'dama and (trypanosusceptible) Boran
cattle infected with T. congolense strain IL1180 [1]. This study generated a vast dataset.
Thousands of probesets on the array generated signals that were significantly different
between timepoints and/or between the two breeds (in T-tests or paired T-tests with p<=
0.01).

A mapping study identified QTL for 16 phenotypic traits associated with


trypanotolerance in Boran and Ndama cattle [5]. The gene underlying a QTL is not
assumed to be differentially expressed. However, it is expected to connect biologically
with differentially expressed genes. The known pathways that included a gene within one
of the five QTL of largest effect were identified and compared with the known pathways

that included a differentially expressed gene. The rationale behind this approach was to
establish the possible connections between the QTLs and the differentially expressed
genes.

A systematic strategy was used to enable an objective triage of the datasets, resulting in a
shortlist of strong candidate pathways that included both a differentially expressed gene
and a gene within a trypanotolerance QTL. These pathways were ranked according to the
results of a Fisher exact test performed using the Database for Annotation, Visualisation
and Integrated Discovery (DAVID) [6]. A literature search was carried out to determine
whether the biological function of each pathway was likely to be linked to the phenotypic
trait influenced by the QTL.

Large sections of the analysis were automated by adapting Taverna workflows originally
developed for the study of trypanotolerance in the mouse model [7]. This allows the
entire analysis to be repeated consistently and relatively quickly, for example to
incorporate information on the bovine genome from a different EnsEMBL build. It would
also be possible to adapt the analysis procedure to examine a different species or a
different phenotype for which QTL data are available.

Materials and methods

Microarray gene expression data were acquired using Affymetrix Bovine Genome 100
format (Midi) microarrays for liver samples harvested from Boran and Ndama cattle at
0, 12, 15, 18, 21, 26, 29, 32 and 35 days post-infection [1]

These data were analysed with dChip [8] to identify and remove outliers before
normalisation using the Robust Multi-Array (RMA) method.. Principal Components
Analysis (PCA) was used to check that hybridisations grouped as expected.

T-tests were used to compare gene expression between breeds at each timepoint. Paired
T-tests (using data for the same individual animals at different timepoints) were used to
compare gene expression for each timepoint with day 0. Lists of probes that showed
differential gene expression (p<=0.01) between breeds or over time were compiled.

A previous study identified trypanotolerance QTL in Ndama and Boran [5]. Five QTL
were selected to include in this analysis based on phenotypic trait, mapping resolution
and strength of effect. Base-pair positions of QTL relative to the EnsEMBL bovine
genome preliminary build Btau2.0 were determined manually. Names and phenotypes for
the five QTL are shown in table 1.

To combine microarray and QTL data, we adapted a Taverna workflow previously


developed for the study of trypanotolerance in the mouse model [7]. The paper cited
provides a full description. In brief, lists of differentially expressed genes (over time or
between breeds) were associated with Kyoto Encyclopaedia of Genes and Genomes
(KEGG) pathways. A separate process identified genes within QTL and associated these
with KEGG pathways. A third process compared these lists to produce a list of KEGG
pathways that contained both differentially expressed genes and genes from the QTL.

Some adaptations of the workflow were necessary. Rather than the mouse EnsEMBL
build and IDs, the bovine EnsEMBL preliminary build (Btau2.0) was used. Bovine gene
IDs were retrieved for Affymetrix probes then mapped to human homologues (using
EnsEMBL data for NCBI build 36) so that human IDs could be used for the remainder of
the analysis (available annotation on bovine genes is very limited). Output was in the
same form as the original, comprising a list of KEGG pathways that included at least one
differentially expressed gene and at least one gene from the QTL.

This list was ranked based on the p-value of each pathway in a Fisher exact test
performed on the microarray data using DAVID indicating whether pathway genes
showed more differential expression than expected by chance. The list was annotated to
add gene symbols for pathway genes in the QTL and to indicate the breeds and
timepoints in which pathway genes were differentially expressed. Further annotation was
derived from gene and pathway resources including GenBank, iHOP, GenMAPP and
GeneGo: MetaCore.

Figure 1 summarises the analysis protocol described above.

Results

The analysis procedure could be re-used or adapted to examine another species or


phenotype for which QTL data are available. The modular nature of the protocol and of
Taverna workflows facilitates adding or altering analysis stages. Workflow scripts and
supplementary data (e.g. files from intermediate stages) are available from the authors..

In the bovine trypanotolerance study, the result of the analysis procedure was to provide a
shortlist of targets for further investigation. This result can be quantified by assessing the
numbers of genes requiring further investigation based on the combined analysis results
or on the original data.

Out of 24128 probesets on the array, 12591 were significantly differentially expressed (p
<= 0.01) in T-tests or paired T-tests comparing expression between breeds or over time.
8342 of these probesets were mapped to a known gene, in total representing 7071 unique
gene symbols.

After combining the pathway lists for differentially expressed and QTL genes, pathway
genes within QTL provide a list of 127 targets. Restricting the pathway list to those with
a significant (p<=0.05) score in the DAVID Fisher exact test reduces this to 51 targets (it
could be reduced more by checking whether expression changes are downstream of the
QTL gene and whether the pathway function is related to the QTL phenotype).

The list of pathways with a significant score (p <=0.05) in the DAVID Fisher exact test is
displayed in Table 2. Pathway genes lying within each QTL are also listed. Note that
these data are based on an analysis using the EnsEMBL Btau2.0 preliminary build. A
more recent preliminary build is available, and the analysis will be repeated, and key
findings discussed, in a future publication.

Discussion

When studying complex phenotypes, analysis based on biological processes already


known to be involved may be insufficient. It is possible that other key biological
pathways, or complex interactions between them, could be missed. Data-driven
approaches are useful to identify the biological processes showing strongest variation in
the results.

The aim of a pathway-based approach to analysing microarray and QTL data is to


identify biologically meaningful links between the two datasets. The gene underlying a
QTL is not necessarily differentially expressed, but may influence expression of other
genes downstream in a known pathway. This approach allows such genes to be identified
without detailed investigation of every gene in the QTL regions or every gene that is
differentially expressed.

Automation is increasingly necessary to handle the vast quantities of data produced by


high-throughput technologies, where manual analysis of the entire dataset is not feasible.
Automated approaches are systematic, promoting consistency and reducing bias.
Consistent replication of automated analyses is relatively simple, allowing separate
studies to produce comparable results and allowing analyses to be repeated in order to
incorporate new information (e.g. from updates to genome build and gene information

available in public databases). Automated analysis can be an effective triage process,


producing a shortlist of strong targets for thorough manual investigation.

Conclusion

Systematic data-driven automated approaches offer an excellent means to triage data


from high-throughput technologies, providing a shortlist of viable targets for thorough
manual analysis and experimental confirmation.

Acknowledgements

This work was wholly funded by The Wellcome Trust.

References

1. Agaba M, Hulme H, Rennie C, Mwakaya J, Ogugo M, Brass A, Kemp S J:


Genome responses of tolerant and susceptible cattle infected with Trypanosoma
congolense. In these proceedings.
2. Kristjanson PM, Swallow BM, Rowlands GJ, Kruska RL, de Leeuw PN:
Measuring the costs of African animal trypanosomosis, the potential benefits of
control and returns to research. Agric Syst 1999;59;79-98.
3. Murray M, DIeteren G, Teale AJ. Trypanotolerance, in Maudlin I, Holmes PH,
Miles MA (eds): The Trypanosomiases. Wallingford UK, CABI Publishing, 2004,
pp 461-477.
4. Naessens J, Leak SG, Kennedy D, Kemp SJ, Teale AJ: Responses of bovine
chimaeras combining trypanosomosis resistant and susceptible genotypes to
experimental infection with Trypanosoma congelense. Vet Parasitol
2003;111;125-142.
5. Hanotte O, Ronin Y, Agaba M, Nilsson P, Gelhaus A, Horstmann R, Sugimoto Y,
Kemp S, Gibson J, Korol A, Soller M, Teale A: Mapping of quantitative trait loci
controlling trypanotolerance in a cross of tolerant West African Ndama and
susceptible East African Boran cattle. Proc Natl Acad Sci USA
2003;100(13);7443-7448.

6. Dennis GJ, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA:
DAVID: Database for Annotation, Visualisation, and Integrated Discovery.
Genome Biol 2003;4(9);R60.
7. Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S, Stevens R,
Brass A: A systematic strategy for large-scale analysis of genotype-phenotype
correlations: identification of candidate genes involved in African
trypanosomiasis. Nucl Acids Res 2007;35(16);5625-5633.
8. Li C, Wong WH: Model-based analysis of oligonucleotide arrays: Expression
index computation and outlier detection. Proc Natl Acad Sci USA 2001;98;3136(2001a), Proc. Natl. Acad. Sci. Vol. 98, 31-36.

Corresponding author: Catriona Rennie


Address: LF8, Kilburn Building, The University of Manchester, Oxford Rd, Manchester,
M13 9PL, UK.
Email: catriona.rennie@postgrad.manchester.ac.uk

Figure 1

Table 1
QTL
BTA2
BTA4
BTA7
BTA16
BTA27

Phenotype
Anaemia
Parasitaemia
Anaemia and parasitaemia
Anaemia
Anaemia

Table 2
KEGG pathway name
Leukocyte transendothelial migration
Regulation of actin cytoskeleton
Cell cycle

BTA2

BTA4

FN1
PIP5K3
MCM6
ORC4L

CHRM2

Focal adhesion

FN1

ZYX

MAPK signalling pathway

CASP8

CASP2

Hematopoietic cell lineage

Glycerophospholipid metabolism
Adherens junction
Neurodegenerative disorders
T cell receptor signalling pathway

BTA27
CLDN23
BRAF
FGF20

PRKACA
TUBB4
COL5A3
VAV1
ECSIT
PRKACA

GNAQ
CAPN2

BRAF

DUSP10

BRAF
DUSP4
FGF20
IKBKB

EPOR
FCER2
CASP8
LCT
EFNB1

DGKI
EPHA1
EPHB6
DGKI

CASP8
CD28
CTLA4
ICOS

Long-term potentiation
Apoptosis

CASP8

Toll-like receptor signalling pathway

CASP8

Wnt signalling pathway


Glutathione metabolism
Calcium signalling pathway

BTA16

CDKN2D

Gap junction

Huntingtons disease
Glycerolipid metabolism
Axon guidance

BTA7
VAV1
VAV1

IDH1

GSTK1
CHRM2

AGPAT6
UNC5D
ARD1A
INSP

AGPAT6
FGFR1

VAV1

IKBKB

PRKACA
IRAK1
PRKACA
IRAK1
TICAM1
PRKACA

GNAQ
CAPN2

PTGER1
PRKACA

GNAQ
GNA14
ITPKB

BRAF
IKBKB
IKBKB
DKK4
SFRP1
ADRB3
VDAC3

Captions:

Figure 1 Summary of the analysis procedure. Automated sections are indicated using
grey shading.

Table 1. Name and phenotype for the five trypanotolerance QTL used in this analysis.
For more detailed information, please refer to the original mapping study [Hanotte 2003]

Table 2. Pathways with a significant (p<=0.05) score in a Fisher exact test to determine
whether the differential expression of pathway genes is higher than expected by chance.
The columns on the right give the gene symbols for pathway genes within each of the
QTL.

Das könnte Ihnen auch gefallen