Sie sind auf Seite 1von 185

EXPLORATIONS IN LOGIC

PROGRAMMING FOR BIOINFORMATICS

A thesis submitted to King’s College London


for the degree of Doctor of Philosophy
in the Faculty of Natural and Mathematical Sciences

February 1, 2019

Samuel R Neaves
Department of Informatics
Abstract

Logic programming is a paradigm for computer programming, where knowledge is repre-


sented in a restricted form of first order logic, as a set of facts and rules, that describe
the data and relationships within. While logic programming is a now mature field of
research, its applications in biology are sparse. This thesis advances the argument that
Programming in Logic (Prolog) is ideally suited to the analysis and manipulation of data
in bioinformatics tasks. We demonstrate this with 3 core application areas.
First, we describe how some common biological data analysis tasks can be formulated as
subgroup discovery tasks (interesting subgroups of a population where the class distribution
differs from the overall class distribution). We show how a researcher can solve these tasks
in Prolog, by first developing a logic-based subgroup discovery program and then optimising
this so that it can run on large datasets. We apply this program to two examples where
we identify subgroups of: 1) CpG sites that are differentially methylated in cancer, and 2)
microbes that are present in lesional psoriasis.
Second, we develop a web-logic application programming interface (API) for human
Reactome data. This API allows researchers to send logic programs that query Reactome,
to be computed in the cloud – thereby sending the small program to the large data. This is
useful due to the increasing size and number of biological databases such that downloading
each for local computation is becoming increasingly unfeasible.
Third, we use Inductive Logic Programming (ILP) to identify pathway application
patterns in Reactome pathways, whose presence predicts lung cancer type. The identified
model has similar performance to the state of the art, while being interpretable, such that
it provides biological insight into these conditions.

2
Acknowledgements

I would first like to thank my supervisor Sophia Tsoka. She has listened to me and helped
me throughout the process of the PhD. Her willingness to spend her time with me and be
flexible in dealing with the challenges I have faced has been much appreciated.
I would also like to thank the people who have been in the Tsoka research group,
they have provided stimulating scientific discussions and moral support; Gareth Muirhead,
Jonathan Cardoso, Laura Bennett and Aristotelis Kittas.
My friends; especially John Burley, Chris Goff, Stuart Lock, Paola Di Pietro, Louise
Poulter and Anna Dodridge. They may have occasionally led me astray (some more than
others...) but in the process kept me sane and happy. I will note Stuart for his hospitality
when I arrived in London, putting me up and keeping a fun house. Anna for keeping me
connected to my first world of student unions and continuing to inspire me with her life
attitude. John for putting me up on countless weekends, never saying no to meeting up
for a drink and even paying for some incredible holidays to see Walruses and catch up
with friends abroad. Paola for her warm friendly personality which made many a London
evening enjoyable. Louise for when I really need to go out dancing, we always have a great
time. Chris for our regular pub sessions putting the world to rights, exploring technology
and politics, these have been hugely appreciated. Many other friends who I can not all
name here, but believe me you are appreciated.
My family have been a huge support to me, with their love and sacrifices. My Mum
and Dad, sister Lucy, Granny and Bernard and my extended family – all have helped me
in countless ways. The most special thanks go to Louise – for being an amazing scientific
collaborator and so much more, thank you.

3
Contents

Abstract 2

Acknowledgements 3

1 Introduction 10
1.1 How this thesis is organised . . . . . . . . . . . . . . . . . . . . . . . . . . 14

I Background 16

2 Biology 17
2.1 Genomic information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Tissues and organs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Organisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Ecosystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Example health problems 23


3.1 Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Breast cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Lung cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.3 Gene expression in cancer . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.4 DNA methylation in cancer . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Psoriasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 The skin microbiome . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 What is known about skin microbiota . . . . . . . . . . . . . . . . . 26
3.2.3 The skin microbiome in the presence of psoriasis . . . . . . . . . . . 27

4
3.3 Health problems summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Data 28
4.1 Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1 Sequence conservation data and 16S rRNA microbiome data . . . . 29
4.2 Microarrays – CpG methylation data and gene expression data . . . . . . . 29
4.3 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Reactome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.2 IMG/M data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3 Encode - The encyclopedia of DNA elements . . . . . . . . . . . . . 30

II Logic programming in Prolog and descriptive rule induc-


tion from biological data 32

5 Logic programming in Prolog: an overview for bioinformaticians 33


5.1 Foundational concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3.1 Formal meaning of words and sentences . . . . . . . . . . . . . . . . 36
5.3.2 Proof theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Meta-theoretical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4.1 Prolog soundness and completeness . . . . . . . . . . . . . . . . . . 41
5.5 Declarative and procedural reading of a Prolog program . . . . . . . . . . 42
5.5.1 Procedural readings (of logic programs). . . . . . . . . . . . . . . . 43
5.6 Arithmetic operations in Prolog . . . . . . . . . . . . . . . . . . . . . . . . 46
5.6.1 Peano arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.6.2 Constraint satisfaction and constraint logic programming . . . . . . 49
5.6.3 Low level arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.7 Other features of Prolog . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.7.1 Meta-predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.7.2 Second order predicates . . . . . . . . . . . . . . . . . . . . . . . . 58
5.7.3 Definite Clause Grammars (DCGs) . . . . . . . . . . . . . . . . . . 60
5.8 Prolog documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.9 SWI-Prolog version 7 extensions . . . . . . . . . . . . . . . . . . . . . . . . 65

5
6 Descriptive rule induction from biological data in Prolog - subgroup dis-
covery 66
6.1 An introduction to subgroup discovery . . . . . . . . . . . . . . . . . . . . 67
6.1.1 Data description language . . . . . . . . . . . . . . . . . . . . . . . 68
6.1.2 Rule language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.3 Coverage function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1.4 Learning subgroup rules the general framework . . . . . . . . . . . 72
6.1.5 Existing rule learning algorithms . . . . . . . . . . . . . . . . . . . 74
6.1.6 Rule length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.1.7 Formulation of common bioinformatics tasks as subgroup discovery
tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Implementation of subgroup discovery task in Prolog . . . . . . . . . . . . 85
6.2.1 Method 1: Pure constraint logic programming implementation . . . 85
6.2.2 Method 2: Heuristic top-down search using a constraint covering
function and weighted instances . . . . . . . . . . . . . . . . . . . . 93
6.2.3 Method 3: Genetic algorithm for searching a very large hypothesis
space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 Application of constructed subgroup discovery algorithms to CpG methyla-
tion and microbiome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.1 Application to the CpG sites . . . . . . . . . . . . . . . . . . . . . . 105
6.3.2 Application to the microbiome . . . . . . . . . . . . . . . . . . . . . 109
6.4 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

III Structured data 114

7 Reactome Pengine 115


7.1 Description of content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4 Interfacing with Reactome Pengine . . . . . . . . . . . . . . . . . . . . . . 118
7.4.1 Reactome Pengine data flows . . . . . . . . . . . . . . . . . . . . . 118
7.5 Example usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.5.1 SWISH examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.5.2 Example Prolog script for Unix pipeline . . . . . . . . . . . . . . . 137

6
7.6 Comparison of Reactome Pengine to existing data access options . . . . . . 139
7.6.1 Amount of data exchanged . . . . . . . . . . . . . . . . . . . . . . . 139
7.6.2 Flexibility of querying . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8 Using ILP to identify pathway activation patterns in systems biology 144


8.1 Description of content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.2 Introduction and background . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.3 Overview of propositionalization . . . . . . . . . . . . . . . . . . . . . . . . 147
8.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.4.1 Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.4.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.4.3 Searching for pathway activation patterns . . . . . . . . . . . . . . 153
8.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.5.1 Quantitative evaluation and comparison with SBV Improver model 156
8.5.2 Results for Warmr method . . . . . . . . . . . . . . . . . . . . . . . 158
8.5.3 Results for Warmr/TreeLiker combined method . . . . . . . . . . . 159
8.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

9 Summary and future directions 161


9.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.1.1 Integrate biological expertise and information . . . . . . . . . . . . 163
9.1.2 Evaluate Prolog versions for bioinformatics . . . . . . . . . . . . . . 164
9.1.3 Improve algorithms for subgroup discovery . . . . . . . . . . . . . . 164
9.1.4 Expand applications of subgroup discovery . . . . . . . . . . . . . . 164
9.1.5 Deploy large pengine networks . . . . . . . . . . . . . . . . . . . . 165
9.1.6 Use even more structural knowledge in model building . . . . . . . 165
9.1.7 Use the data collected from Reactome Pengine . . . . . . . . . . . . 165
9.1.8 Incorporate reframing . . . . . . . . . . . . . . . . . . . . . . . . . 166
9.2 Implications and recommendations . . . . . . . . . . . . . . . . . . . . . . 166
9.3 The imagined future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Appendix A Prolog execution algorithm 168

Bibliography 170

7
List of Tables

6.1 An attribute-value data table . . . . . . . . . . . . . . . . . . . . . . . . . 70


6.2 A multi-instance data table . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3 A coverage table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4 Subgroup discovery algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.5 Illustrative dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.6 Example rules from the data in Table 6.5 . . . . . . . . . . . . . . . . . . . 79
6.7 Data table with gene expression as features . . . . . . . . . . . . . . . . . . 80
6.8 Data table with genes as instances . . . . . . . . . . . . . . . . . . . . . . . 81
6.9 Vocabulary translation from GSEA to subgroup discovery . . . . . . . . . . 81
6.10 Vocabulary translation from DMR to subgroup discovery . . . . . . . . . . 84
6.11 Vocabulary translation from microbiome to subgroup discovery . . . . . . . 84
6.12 Method 1 running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.13 Method 2 running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.14 Method 3 running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.15 Attributes used in the CpG data mining task. . . . . . . . . . . . . . . . . 107
6.16 A multi-instance data table, each OTU has a bag of species attributes . . . 111
8.1 Top 5 identified pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

8
List of Figures

5.1 Proof tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40


5.2 A SLD tree for the query ?-u(A) . . . . . . . . . . . . . . . . . . . . . . . 41
6.1 Coverage illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Feature construction algorithm . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Bumphunter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1 Reactome Pengine data flows . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Reactome Pengine example . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3 Hive plot of Pathway 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.1 ILP method overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.2 Reaction graph illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.3 Logical aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.4 ILP results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.5 Warmr pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.6 Three pathway activation patterns . . . . . . . . . . . . . . . . . . . . . . 159

9
Chapter 1

Introduction

The current situation in biological research laboratories is unpalatable, tremendous po-


tential for new knowledge is within our collective grasp due to the incredible advances in
molecular biology, such as next generation sequencing technology. However this potential
is not being met. We are collectively wasting time and energy when there is much work to
be done. Individuals inflicted with disease have an urgent yearning for us – the research
community – to deliver knowledge and technology that will alleviate suffering.
Currently, a number of key problems include:

• Data sits in difficult to combine silos. There are now countless biological
databases containing a huge diversity of data. The data ranges from raw sequencing
data to descriptions of chemical reactions and multitudes of recorded phenotypes. A
researcher who attempts to study these collectively has to download masses of data
from many services in many different data formats. The data is not conceived of
in whole and an appropriate unifying knowledge representation is missing. This can
be highly problematic, because researchers combining data need to either use one
of the many software packages that provide mappings of identifiers or derive their
own methods to combine the data. This often results in a duplication of effort with
different researchers writing their own subtly different programs to combine data in
their own research groups.

• Data and knowledge are stored separately. Raw data such as the sequence
for a gene is stored in one set of databases and perhaps gene to gene interactions
is stored in another set, but the real knowledge about what a gene does and how
it works is in journal articles or occasionally isolated computer programs. For the

10
CHAPTER 1. INTRODUCTION 11

most part our collective knowledge exists in journal articles in free text, that require
a person’s time to read, understand and act upon. The knowledge that is contained
in computer programs and algorithms is isolated and human knowledge is needed to
combine these into ad hoc pipelines to answer specific research questions [93].

• Data storage and transfer is resource intensive and error prone. Many
terabytes of data are being generated by biological research [63]. For instance the
European Bioinformatics Institute reported in 2013 storing 20 petabytes of data [98]
and by 2015 [26] storing 75 petabytes of data - nearly quadrupling in 2 years, it is
increasingly difficult to store and even to transfer data from database providers to
the machines on which researchers compute analyses.
Researchers currently resort to techniques such as downloading a database to their
local machine, collectively taking up double the storage space and consuming much
bandwidth. They then need to use cryptographic hashing to ensure that data has
not been corrupted on transfer or storage [16].

• Difficulties choosing the right analysis technique. Many researchers fail to


choose the correct machine learning task for their data analysis and default to tasks
such as classification or clustering which are well known and understood. They may
also attempt to optimise an easy to quantify metric (such as accuracy or the area
under the receiver operative curve) with out taking into account the purpose of the
model and the context of where the model will be deployed. For example, often in
biology we wish to characterise differences in experimental conditions or phenotypes
rather than build predictive models.

• Incorrect assumptions in models. Many statistical analyses of biological data


make incorrect assumptions about variable interactions; they use uni-variate regres-
sion which fails to take into account existing knowledge. For example we know that
genes operate in a complex web of interactions that form biological pathways, yet the
most common study method to try and ascertain what a gene does is to perform a
Genome Wide Association Study (GWAS) which treats each gene as an independent
variable [152].

• Overly complex models. Conversely, some studies will construct complex ‘black
box’ models, (perhaps built with a deep neural network or a support vector machine
with a complex combined kernel function) [122, 13]. These models may have uses
CHAPTER 1. INTRODUCTION 12

in, for example, predicting prognosis from imaging data [135]. However, it is often
the case that we need to justify our clinical decision making (which admittedly is a
difficult philosophical task) but -put simply- having reasons in the form of rules which
a human can intuitively understand, question and reason about is more informative
and useful to researchers than a classification model with many thousands of weights
contributing to a decision [65].

The combined effect of these problems contribute to the crisis of reproducibility, a


duplication of effort and resources, redundancy of work, flawed analysis and not fit for
purpose models being built [9, 51, 78].
We argue that the formal knowledge representation capabilities of logic, the design
of which has been honed over many centuries of collective deliberations, best represents
how scientific biological knowledge should be represented and reasoned with. Represent-
ing knowledge in logical statements allows us to use formal logical reasoning. A single
step of formal reasoning is performed as a symbolic logical inference - that is a syntacti-
cal manipulation of two existing statements to form a new statement. Chains of logical
inferences form arguments that allow us to come to new conclusions which we can employ
in medical science. In addition, the limits of different types of logic have been studied
and there are formal means to understand the limitations. For example, the concepts of
soundness and completeness for a logic system cover in what circumstance an argument is
valid and what statements are possible to be inferred or if in a given logical system there
are statements that can not be inferred. Logic provides the glue to combine our collective
knowledge. Being able to reason with existing knowledge to derive new knowledge allows
for better decisions in future tasks, whether these tasks are further research or clinical
decision making.
The efforts over the last few decades to automate logical reasoning has resulted in the
robust and powerful programming language of Prolog. Although there are different flavours
[162] they are sufficiently similar for us to talk about them collectively (when we make use
of a specific feature in a specific Prolog implementation we will point this out).
Even though Prolog was first implemented in 1972 [24], it only recently matured into a
practical tool for bioinformatics. Outside of computer science settings, paradigms such as
logic programming are rarely encountered and certain developments needed to happen for
the language to evolve from a computer scientists’ tool to a life scientists’ tool. Prerequi-
sites for this to happen; include having sufficient data and the ability to codify biological
CHAPTER 1. INTRODUCTION 13

knowledge into formal rules. Indeed the requirement to curate experts knowledge into for-
mal rules was one of the major issues encountered when the first attempts to apply Prolog
to practical tasks began, these involved the development of ‘expert systems’ [59]. Some
successes were achieved but the manual curation of expert rules proved to be a large bottle
neck in the adoption of these automated reasoning systems.
In order to try and tackle the problem of codifying human knowledge, computer sci-
entists turned their attention to the possibility of automatically learning rules. This has
driven the development of machine learning algorithms and the field of machine learning
has ballooned into a major scientific undertaking. Many remarkable results have been
achieved and large numbers of scientific papers rely on these algorithms to draw out re-
sults. State of the art machine learning algorithms often need two key resources, the first
is a large amount of data and the second is driven by the first, a lot of computing power.
The output of machine learning algorithms are ‘models’ which represent some knowl-
edge about a concept. One thing that seems to have been lost in the success of modern
machine learning methods is the concept of how to deploy these models collectively, the
idea of what was once called an expert system.
Currently, there is a problem – models sit in isolation and do not sit in a grand ‘expert’
system. One learnt model can not automatically reason with another learnt model. Leaving
the knowledge contained in it isolated and underused. Further an expert system would
justify its decisions by being capable of showing what rules have been used in answering
a query, this feature has been neglected in modern machine learning classification models,
where a decision could be made by weighing many thousands of weak contributions from
variables.
So far we have identified the following practical requirements to enable bioinformati-
cians to use the automatic reasoning that Prolog provides, 1) needing machine learning
algorithms to automatically codify knowledge into rules, 2) a large amount of data to feed
the machine learning algorithms and 3) sufficient computing power to handle the large
amount of data. Others include having appropriate software libraries, for example, to read
data files or to write web applications. These allow the computational bioinformatician to
do their job without reinventing the wheel. Many of these libraries are now available in
SWI-Prolog and we will demonstrate and describe their usage in this thesis.
Whilst we are not the first to recognise the suitability of logic programming for bioin-
formatics [156, 6, 108], the topic has not been fully explored. In particular recent develop-
ments in Prolog for deploying applications to the web, improving the efficiency of logical
CHAPTER 1. INTRODUCTION 14

reasoning and learning with structured data have not been fully exploited.
As we will see, the idea of a grand unified web scale logic program with automated
learning of characteristic rules and from these automated reasoning to answer research
questions is tantalisingly close. Our attempts to move the life sciences and bioinformatics
in this direction is the subject of this thesis.

1.1 How this thesis is organised


The thesis is broken into three parts, sandwiched between this introductory chapter and a
concluding chapter. The first part provides background information on a number of relevant
topics including: a) basic biology in order to help a computational reader understand the
domain that we are working in, b) example health problems to provide the context of
how this work can be mapped to problems in health research, and c) a description of the
different types of data used in this thesis and how it has been collected and made available.
The second and thirds parts of this thesis constitute the core work to advanced our
argument, each part illustrates a number of advantages of using logic programming when
accessing and manipulating biological data. We will show how using logic programming
allows researchers to combine their reasoning with automatic reasoning provided by the
logic engine – thus amplifying the power of the data.
The second part of this thesis has two distinct goals. The first goal [Chapter 5] is to pro-
vide an up to date synthesis of Prolog knowledge, this will form the basis of understanding
for a bioinformatician who is new to this paradigm. It includes formal descriptions of how
Prolog works and describes a number of built in and library predicates as well as techniques
that a bioinformatician would use in order to perform reliable, reproducible data analysis.
It describes a methodology for programming in Prolog that is implemented in the second
chapter in this part of this thesis [Chapter 6] .
The second goal explores the hypothesis that descriptive rule learning of the microbiome
and epigenome are effective ways to mine these biological datasets [Chapter 6]. Descriptive
rule learning is very suitable for logic programming because the learnt rules can be imme-
diately used on the logical database. In the process of learning these rules we demonstrate
our methodology for programming in Prolog and demonstrate a number of features of logic
programming that allow for efficient manipulation and effective reasoning with data.
The third part of the thesis concentrates on complex structural data – primarily the
human reactome. It consists of two chapters [Chapters 7 and 8]. In Chapter 7 we show
CHAPTER 1. INTRODUCTION 15

how modern logic programming techniques can improve access to complex datasets on the
web, providing an example data service that is ideal for the distribution of large complex
datasets. In chapter 8 we then go on to give examples of structured data mining of gene ex-
pression data in the presence of complex background knowledge (reactome pathway data).
The two chapters in the third part of the thesis are based on published work. Chapter 7
is based on Reactome Pengine: A web-logic API to the homo sapiens Reactome (Bioin-
formatics 2018) [111]. Chapter 8 is based on Using ILP to Identify Pathway Activation
Patterns in Systems Biology (International Conference on Inductive Logic Programming.
Springer, Cham, 2015) [110]. Finally, chapter 9 summaries our argument and presents a
vision for the future.
Part I

Background

16
Chapter 2

Biology

When Charles Darwin proposed the theory of evolution by means of natural selection [35],
he explained both the unity and diversity of life. The diversity is that life viewed at human
scale is so heterogeneous with birds flying, fish swimming, plants photosynthesising and
everything in between. In contrast, there are two striking unities in life, first everything
under a microscope is very similar, with the same cellular machinery visible, and secondly,
all living beings strive to reproduce and survive but no living individual is immortal. It is
the lineages that survive not the individuals.
Two ideas are worth stating to help a computer scientist understand biology as a science.
The first is summed up by the quote:

“Nothing in biology makes sense except in the light of evolution.” – Dobzhansky,


Theodosius [43].

This quote emphasises that in order to understand a biological phenomenon it is always


worth attempting to see how a trait might have evolved. A computer scientist can un-
derstand that evolution is an optimising algorithm where the frequency of the replicating
agents – the genes – is being optimised. As Dawkins explains in “Climbing mount im-
probable” [37] the algorithm of evolution is not a global optimising algorithm but a hill
climbing algorithm, where there are many local optima - these local optima correspond to
the life forms we observe in the world.
The second statement is based on a widely shared colloquialism:

“Biology is the science of exceptions”.

That is anything a biologist tries to state in general as a law, another biologist will pipe

17
CHAPTER 2. BIOLOGY 18

up with an exception they have found. A number of famous witnesses to this colloquialism
include:

• All mammals give birth to live young – except the platypus and echidna that lay
eggs.

• All eukaryotic cells have a nucleus – except red blood cells (it’s ejected to make room
for oxygen).

• All blood is red – except for some fish that have colourless blood.

Both of these ideas, that evolution can be used as a framework to understand a bio-
logical phenomena, and that there will often be an exception to any biological phenomena
described, can be helpful when reading and understanding biological research.

2.1 Genomic information


Whilst Darwin had explained a theory of how life evolved, the mechanism for inherited
information remained unexplained. Mendel and others ascertained that the unit of genetic
information must be discrete and that evolution would not work if the unit of inheritance
was diluted in each generation [100]. These discrete units came to be known as genes. A
gene is what is inherited and is the unit of selection that natural selection acts upon.
In the 1950’s Watson and Crick discovered the structure of DNA [155] (a large complex
molecule inside a cells nucleus) and understood that it could provide the mechanism for a
discrete unit of heritability that evolution required:

“It has not escaped our notice that the specific pairing we have postulated
immediately suggests a possible copying mechanism for the genetic material.”
– Watson and Crick.

What was discovered is that DNA is a molecular structure composing of two paired chains
of chemical bases forming a double helix shape. Each chain has a backbone of sugar and
phosphates and off this backbone there are four different kinds of bases, these are Adenine,
Cytosine, Guanine and Thymine referred to as A,C,G and T.
Once DNA had been described it was now possible for biologists to define a gene as a
sequence of DNA that works together and travels across generations together. From this
discovery, scientists in the second half of the twentieth century derived the central dogma
CHAPTER 2. BIOLOGY 19

of molecular biology; that DNA sequences are transcribed into RNA sequences, which are
translated into proteins sequences [29].
RNA is a molecule that is similar to DNA, but is single stranded. When a gene is to be
expressed the DNA is unzipped and a matching RNA molecule is built. This is then fed
into the cellular machinery as the instructions to build a protein sequence. These protein
sequences are chains of amino acids. When they are formed they fold into complex shapes.
These shapes are the molecular machines that build and form cells, which are the primary
unit of life.
Scientists have developed a number of terms to describe the study of all of these pro-
cesses. These include many ’omics terms. The original ’omic is genomics, which is a term
used to describe the study of the genome which is the collection of genes that exist in an
individual, the word can also be used to describe the collection of genes in a species. For
example the Human genome is the collection of all genes found in humans, a subset of
this is your personal genome which is unique to you (unless you have an identical twin).
Many other omics terms have been coined in order to describe the large scale study of some
aspect of biology. These include the transcriptome, epigenome, reactome and microbiome.
The transcriptome is the set of genes that have been transcribed. That is the set of
RNA molecules that has been constructed from parts of the DNA. The transcriptome is
different, in different cells at different times under different circumstances. When a gene
is transcribed it is said to be expressed. It is the expression of different genes, that allows
different types of cells to exist – despite all the cells in the same organism having the same
DNA.
What controls the expression of genes into RNA molecules is sometimes called epige-
netics [71]. In our context we use this word to describe the molecular features that attach
to DNA and effect if the DNA is transcribed or not. The epigenome is the set of modifi-
cations to DNA that effect the transcription. These include CpG methylation and histone
modifications.
The proteins that have been constructed interact with themselves and other biological
molecules in the cell, these interactions are chemical reactions. The chains of reactions
that control a process in a living organism are called biological pathways, and the set of
reactions is called the reactome.
Finally, the microbiome is the set of all microbiota in a body location, environment or
species, for instance the gut microbiome (the set of microbes found in human guts), or the
human microbiome (the set of all microbes living on or in humans).
CHAPTER 2. BIOLOGY 20

In this thesis we will present methods to study the epigenome (Chapter 6), the micro-
biome (Chapter 6) and the transcriptome (Chapter 8), and we will make use of the genome
(Chapter 6,7 and 8) and the reactome (Chapters 7 and 8).

2.2 Cells
Cells are the building blocks of all organisms and understanding cells is a key goal for
biologists [1]. Billions of years ago the main lineage of life diverged into prokaryotic and
eukaryotic life. Prokaryotic life is unicellular (such as bacteria) whereas eukaryotic life is
more complex and is often multi-cellular, such as plants and animals. Prokaryotic and
eukaryotic life have evolved different types of cells and in the case of multi-cellular life
there will be different types of cells in the same life form. Prokaryotic and eukaryotic cells
differ in how they store their DNA – in eukaryotic life it is stored in a structure called
the nucleus, whereas in prokaryotic it is more free floating. The difference in the storage
of the genetic information, leads to differences in cell reproduction, with eukaryotic cell
reproduction being the more complicated of the two. Eukaryotic cell reproduction is more
complicated because two types of reproduction are needed. The first is called mitosis and
is when one cell replicates itself with the same DNA. The second is called meiosis and is
when one cells DNA combines with another in order to form a third cell. This cell has
some of the DNA from the first cell and some from the second - this is sexual reproduction.
The instructions to build a new cell are stored in the DNA and, via biological pathways,
ultimately control how a cell will reproduce. It is important to remember that whether
life is unicellular or multi-cellular, evolution does not work at the level of individuals but
at the levels of genes. This is the modern synthesis [74]. Dawkins famously described an
organism as a vehicle that has been built by a set of selfish genes in order to help replicate
themselves [36].

2.3 Tissues and organs


In multi-cellular life forms, a group of specialised cells that work together is called a tissue
and all these cells share the same DNA. Inter-cellular signalling controls how these cells
communicate in order to work together in order to successfully replicate the DNA by
creating a new individual organism. A group of specialised tissues form an organ, and an
organ controls some major aspect of life support. For example, the lungs are made up of
CHAPTER 2. BIOLOGY 21

different tissues that allow oxygen to be taken from the air and transferred in to the red
blood cells of an animal. The lungs themselves are part of the respiratory system, which is
the group of organs including the chest muscles, tongue and nose that together accomplish
the task of breathing.

2.4 Organisms
We usually think of an organism as a discrete individual but the demarcation of individuals
can be blurred. There are many examples of parasitic and symbiotic relationships where one
organism could not survive without the other. Indeed, even at the level of the cell we can
see these phenomena. For example, eukaryotic cells have an (essential to their survival)
organelle called mitochondria which is thought to have evolved from a deep symbiotic
relationship with a prokaryote organism in ancient times [62]. The evolution of these
symbiotic and parasitic relationships can be understood by using game theory [163]. This
allows us to also understand altruistic ‘behaviour’ between closely related individuals and
symbiotes. A recent prominent area of study for biologists is the relationship between an
organism and its resident colony of microbiota [126]. These colonies are always present
in animals and plants and consist of many thousands of different bacteria, archaea and
viruses. Some biologists have taken to describing the microbiome as the missing organ [10]
to emphasis how important it is to an individual.

2.5 Species
For multi-cellular life, a species can be defined as a group of organisms that can breed [60].
The evolutionary history of a species can be inferred by comparing the genomes of different
species. The study of this is called phylogenetics [119]. Biologists have also categorised life
into a taxonomic hierarchy the levels of which are: Domain → kingdom → phylum → class
→ order → family → genus → species[133]. The demarcation of these levels is ill-defined
especially in organisms such as bacteria, but it is a valuable tool for understanding life.

2.6 Ecosystems
Collections of species in a environment form ecosystems. Although ecosystems are often
imagined as a fixed system that is in balance, evolution shows us this is not the case. It is a
CHAPTER 2. BIOLOGY 22

constant arms race as different genes compete to replicate by creating ever more elaborate
vehicles. Some sets of genes will team up at the cellular level, others at the organism
and species level. The microbiome of an organism can also be considered a study of an
ecosystem [126].
Chapter 3

Example health problems

We will now give some background information on a number of health problems which will
serve as examples in this thesis. We do this in order to provide the context in which we
apply logic programming.

3.1 Cancer
Cancer is when aberrant cells divide and multiply in an uncontrolled fashion – this leads
to neoplasms (an abnormal growth of tissue – a tumour). Cancer tumours begin when one
or a number of cells incorrectly divide due to a signalling problem amongst or inside the
cells. As we described in section 2.2 the machinery of cells is ultimately controlled by DNA,
via RNA and protein production. Therefore, broken aspects of any of these processes can
lead to cancer. A cancer proliferates when daughter cells inherit the erroneous DNA. In
the USA one in two people will get cancer in their lifetime and 600,000 people will die of
cancer each year [20].
Cancer can be understood in terms of local evolution, where cancer cells vary, compete
and the fittest survive [61, 18]. Cancer cells compete against the cell’s tumour suppressing
machinery which attempts to (a) repair DNA in the cell and (b) attack cancerous cells.
Because evolutionary history is on the side of the immune system many copying mistakes
and problems will be stopped quickly, however the evolutionary pressure to develop defences
against cancer only applies to young organisms before they have had a chance to reproduce
(and pass on their genes), hence the evolutionary pressure has been less for older ages, and
is one of the main reasons cancers effect older people more on average [18].
Cancers often occur in parts of the body where cells reproduce quickly. For example,

23
CHAPTER 3. EXAMPLE HEALTH PROBLEMS 24

in breast tissue, milk is being produced and in the lungs, mucus is being produced.
In this thesis we use a number of techniques to build models that give researchers
insights into what the malfunctions are in the bodies signalling system which are leading
to the cancers. The studies used are on two different types of cancer that are very serious
for human health. These are breast and lung cancer.

3.1.1 Breast cancer


In Chapter 6 we use a dataset that has been collected in order to study patients with
breast cancer specifically Ductal Carcinoma in situ (DCIS). DCIS is a breast neoplasm,
where the effected cells are located in the basement membrane of breast ducts and it is a
precursor of further invasive carcinomas.

3.1.2 Lung cancer


In Chapter 8 Squamous Cell Carcinoma (SCC) and Adenocarcinoma (AC) lung cancers
are studied. Since AC and SCC differ in their cell of origin, location within the lung,
and growth patterns, they are considered as distinct diseases. SCC cancer develops in
the flat cells that cover the surface of airways and will usually grow in the centre of
the lungs, whereas AC starts in the mucus making gland cells of the airways. Both AC
and SCC cancers are usually caused by smoking. Chemicals contained in cigarettes such
as chromium make poisons such benzo(a)pyrene bind strongly to DNA, causing serious
damage. Other chemicals like arsenic and nickel interfere with the biological pathways for
repairing damaged DNA [66]. AC is the most common type of lung cancer accounting for
40% of cases while SCC accounts for 30% [169].

3.1.3 Gene expression in cancer


Many genes act as tumour suppressors, these normally act to inhibit cell growth and
division, other genes act to accelerate cell growth and division. If there is a problem in a
cell that reduces the expression of tumour suppressor genes or increases the expression of
genes that promote cell growth then that can lead to cancer. For example a DNA mutation
could occur that disrupts the promoter region of a tumour suppressor gene, leading to a
loss or reduction in its expression.
CHAPTER 3. EXAMPLE HEALTH PROBLEMS 25

3.1.4 DNA methylation in cancer


DNA methylation is a physical process that is used by cells to regulate gene expression.
In DNA methylation, methyl groups attach to DNA at CpG sites (a CpG site is where a
cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along
the 50 −→ 30 direction). DNA methylation patterns are known to play an important role
in cancer [33, 11]. These epigenetic marks can cause the repression or promotion of genes
that leads to loss of cell cycle control, an altered function of transcription factors, altered
receptor functions, and disruption of normal cell - cell and cell - substratum interactions.
DNA methylation is important because it is possible to design drugs that precisely target
methylated regions of DNA and remove the methylation or add the methylation as required
[33].

3.2 Psoriasis
In Chapter 6 we will be applying machine learning algorithms to data for knowledge dis-
covery of psoriasis. Specifically we will be looking at the microbiome (the community of
microbes that inhabit the human body) of psoriasis.
Psoriasis is an inflammatory skin condition that is currently incurable [112]. The symp-
toms of psoriasis are silvery plaques that occur due to an increase in keratinocytes (a type
of skin cell) resulting in incomplete cornification (the formation of a dead layer of skin that
acts as a barrier) in the stratum corneum (the outermost layer of skin). The keratinocytes
multiply at a faster rate than normal leading to an abnormal epidermis where the outer
layer is defective [116].
It is known that a perturbed immune system is associated with psoriasis. The affected
skin contains components of the immune system such as T-Cells and cytokines [112]. These
cells lead to inflammation in the skin because they are pro-inflammatory.
Psoriasis is a growing problem in the developed world with nearly three in one hundred
people in the USA now suffering from the condition [125]. It is not known what triggers
psoriasis but some GWAS studies have identified some genetic variants that affect risk
(e.g. in IL21A and IL23R genes [109]). It is thought that environmental factors other than
genetic are likely to have an effect, such as stress, trauma and the composition of the skin
microbiome [112].
CHAPTER 3. EXAMPLE HEALTH PROBLEMS 26

3.2.1 The skin microbiome


Skin is mostly composed of two structures the epidermis and the dermis. The dermis
provides structural support and transports nutrients, while the epidermis forms the main
barrier that protects the body from external environment. A major function of the skin
is to regulate temperature, using sweat glands and hairs (having varying abundance at
different locations on the body). Human skin is home to a large diverse community of
microorganisms, with which we have a symbiotic relationship. The community is supported
by secretion of nutrients and in return it is thought that the community helps to protect
us from dangerous bacteria [23]. The microbiome of a person consists of a large number
of cells, in the same order of magnitude as human cells, however the meta genome (the
aggregate genome of all the micro organisms) is an order of magnitude larger [10]. The
microbiome is not fixed and there is considerable diversity between individuals, and the
composition of the community has been associated with a number of diseases. For example,
deviation from a healthy ecological equilibrium this has been linked to inflammatory disease
[96, 104, 58]. Little is known about the relationship between the skin microbiome and skin
pathologies, such as psoriasis [107].

3.2.2 What is known about skin microbiota


The skin is home to a complex community including fungi, viruses, mites and bacteria [64].
The bacteria population is acquired at birth and closely resembles the vaginal microbiome.
The skin environment is not the same across the body, with differences in pH, sebum,
moisture levels and temperature – all factors that change due to the different qualities of
the skin in different body locations.
Although body site is a major factor in the communities of bacteria, the human environ-
ment has also been shown to effect the communities. Yang et al [168] showed community
differences by location (urban versus rural areas) and age. However, there is still much to
learn about the relationship of the composition of these communities with other factors,
such as clothing material, soaps and cosmetic use [64].
The human microbiome project aims to understand the human microbiome, it has found
that some skins sites are among the most diverse microbiomes in the human body [73, 28].
However studies such as Costello et al [168] show that the composition of the communities
are fairly stable over an individual’s life time and the high degree of individual variability
has lead to the idea of using the microbiome left in the environment for forensic analysis.
CHAPTER 3. EXAMPLE HEALTH PROBLEMS 27

Many researcher now wish to learn more about the relationships between the host
and the skin microbiome communities in the presence and absence of conditions such as
psoriasis.

3.2.3 The skin microbiome in the presence of psoriasis


Several studies have explored the microbiome of psoriasis [56, 107], findings include the fact
that Propionibacterium acnes is reduced in lesional skin and there were reduced abundances
of Proteobacteria and actinobacteria phyla and and increase in Firmicutes. Thus showing
that the composition of microbiota changes between lesional and non-lesional skin. Further
studies have defined ‘cutaneotypes’ of different clusters of microbial compositions [2]. One
of these ‘cutaneotypes’ – dominated by Firmicutes-Actinobacteria – was associated with
lesional psoriasis.

3.3 Health problems summary


As we have seen health problems are complex, and an increasingly complex related data
is being gathered. Ultimately we are interested in causality however as a first step, un-
derstanding properties of different bacteria that appear in psoriasis, what different gene
expression patterns occur in lung cancer and an understanding of the epigenetics of breast
cancer are all important research topics.
Chapter 4

Data

There are now many types of technologies to capture ever more types of biological data.
This has been described as the data deluge [12]. In order to show the broad usage and di-
versity of types of data available to the modern biologist, this thesis provides exploratory
case studies of a number tasks that a bioinformatician may undertake. The data types
include, sequence conservation data, 16S rRNA microbiome data, CpG methylation data,
gene expression data, structured biological pathway data, data about the physical charac-
teristics of microbiota and further genome annotation data. Each of these types of data
will be explained in this Chapter. As sequence conservation data and 16S rRNA micro-
biome data are both derived from genome sequence data we start with an overview of the
sequencing process.

4.1 Sequencing
Sequencing technology allows the genome and transcriptome of an organism to be directly
read. The technology for sequencing was initially prohibitively expensive, however devel-
opments that came about due to the human genome project greatly reduced the cost [151].
These technologies are known as second and third generation sequencing technology [124].
A prominent modern technology for sequencing is RNASeq [105]. Inexpensive sequencing
technology makes it feasible to sequence many more organisms and to compare them. For
example, to sequence all the microbes living on a person’s skin would not have been con-
sidered feasible at the time of the human genome project, as the size of this meta-genome
is an order of magnitude larger than the human genome. However, just a few years later
this undertaking was started by the human microbiome project [147].

28
CHAPTER 4. DATA 29

4.1.1 Sequence conservation data and 16S rRNA microbiome


data
Sequence conservation data is derived from sequence comparisons between different species
and different genes [121]. We use conservation data in two distinct ways in this thesis.
Firstly, in the CpG methylation subgroup discovery task (the subject of Chapter 6) a
number of conservation-based attributes for each CpG site are used. This data is ascer-
tained by comparing up to 100 genomes from different species and scoring each CpG site
by how many times it is conserved. If a CpG site is highly conserved it is assumed that it
has had strong selective evolutionary pressure on it i.e. it has a crucial function in keeping
the organism alive and successfully replicating.
Secondly, we use conservation data that estimates the distinct species of bacteria found
from a biological sample (for example, a swab taken from a patient with psoriasis). Re-
searchers have found that one particular gene (known as the 16S rRNA gene) acts as an
evolutionary clock [164]. Using this knowledge and RNASeq technology, it is possible to
identify different strains and species of bacteria, some of which may be new to microbiol-
ogists because they have previously not been isolated and cultured in a lab [165].

4.2 Microarrays – CpG methylation data and gene


expression data
Once the initial genome of a species has been ascertained through sequencing it is possible
to design microarrays which make use of this information in order to inexpensively and
quickly perform a number of data collection tasks.
A DNA array [3] is a substrate on which single stranded DNA (ssDNA) is deposited
with various sequences. These single strands are called ‘probes’ [45]. These are arranged
in a grid and each probe is designed to answer particular biological questions. The array
is interrogated by washing the array with a solution of ssDNA (the target), that has been
generated from a biological sample. The solution is intended to have complementary DNA
to the probes on the array. When these two single strands fuse or hybridise this can be
detected because the DNA in the probes has been labelled with a florescent dye. Compu-
tational imagery techniques scan the arrays and output data matrices of the hybridisation
levels of each probe on the array. DNA array technology can be adapted to measure the
transcriptome (measuring RNA) and the epigenome (measuring methylation of CpG sites)
CHAPTER 4. DATA 30

[72, 44, 14]. In Chapters 6 and 8 we use data from CpG methylation arrays and RNA gene
expression arrays, respectively.

4.3 Databases
4.3.1 Reactome
Reactome is a database of biological pathways that is made available for free online [30].
The complete database contains details of biochemical reactions in a number of species,
however in this thesis we limit our use to the data on humans.
Reactome uses a reaction centric data model, where entities participate in reactions that
are manually curated into pathways. Each step in a pathway has the literature citations
of the experiments used to ascertain that they exist.
We make use of the Reactome database in part three of this thesis. We use it as an
exemplar for a modern bioinformatics API in Chapter 7 and as background knowledge to
inform model construction for classification models in the task of distinguishing between
two types of lung caner in Chapter 8.

4.3.2 IMG/M data


Another source of data that we use is derived from The Integrated Microbial Genomes
and Microbiomes (IMG/M) database from the Joint Genome Institute (JGI) [97]. This
database primarily collects genomic sequence data of different microbes. IMG/M contains
all the draft and complete microbial genomes sequenced by the JGI – integrated with other
publicly available genomes (including archaea, bacteria, eukarya, viruses and plasmids). In
addition to the genomic data, the IMG/M data contains meta-data about many strains
of bacteria. This includes phenotypes such as shape and size, as well as details about
the habitats and other data. In this thesis it is this meta data that we utilise in order
to characterise individual types of microbes that are present or not in lesional psoriasis
(Chapter 6).

4.3.3 Encode - The encyclopedia of DNA elements


The Encode consortium manages a database of genome annotations that they have collected
using several different technologies, with the aim of identifying all the functional elements
CHAPTER 4. DATA 31

of the human genome. The data includes transcription factor binding sites, non-coding
RNA (that is, sites where DNA that is transcribed into RNA but this RNA is not then
translated into protein) and sites with different chromatin structure (chromatin holds the
DNA molecule in knots, controlling what parts of DNA are accessible to cellular machinery
using histone proteins). We use features derived from this data in Chapter 6.
Part II

Logic programming in Prolog and


descriptive rule induction from
biological data

32
Chapter 5

Logic programming in Prolog: an


overview for bioinformaticians

In this chapter we will provide an up to date synthesis of Prolog knowledge. We will


do this by firstly presenting a guide to the theoretical underpinnings of Prolog, including
technical details about how Prolog works and a description of important in-built proce-
dures. Next, we will show how any Prolog program can be read declaratively as well as
procedurally - and why this matters. Understanding these concepts will be necessary for a
bioinformatician undertaking a scientific task using Prolog. We will then go on to describe
some programming techniques to illustrate how a bioinformatician can use Prolog effec-
tively. This will include choosing an arithmetic representation and making use of meta
predicates, second order predicates and definite clause grammars. Throughout, we will
highlight the features that make Prolog especially suitable for reproducible research by
ensuring logical correctness and reducing the number of mistakes in programs. After this
we will describe the SWI-Prolog’s inbuilt functionality to create useful documentation and
finally we will highlight some differences between the main Prolog used in this thesis (SWI-
Prolog) and the ISO Prolog standard (ISO/IEC 13211-1) that could potentially catch out
an uninformed user.

5.1 Foundational concepts


The type of logic used when logic programming in Prolog is called ‘Clausal Logic’ – specifi-
cally Definite Clausal Logic (DCL), which is a subset of full Clausal Logic. When describing

33
CHAPTER 5. PROLOG 34

a logic such as Clausal Logic we need to at least describe three elements: syntax, seman-
tics and proof theory [47]. The formal description of these elements will allow us to reason
about arguments such as:

VirusX infects all bacteria.


MicrobeY is a bacteria.
Therefore VirusX infects MicrobeY.

That is we will be able to make statements and follow arguments that concern indi-
viduals such as VirusX and MicrobeY, sets of individuals such as bacteria, and relations
between individuals such as VirusX infects MicrobeY. In addition we would also like to be
able to reason with statements such as :

All carnivors eat some animal.

In such statements we are reasoning with infinite domains (ALL carnivores), we can do
this by giving abstract names to entities without explicitly naming them, this is explained
further in the following description of the syntax of DCL.

5.2 Syntax
The description of a logic’s syntax describes, a) what is the alphabet we are using, b) what
different types of words there are and c) what are the allowable ‘sentences’ in the logic.
In DCL individual names are called constants and in Prolog these are single words
starting with a lowercase letter, or strings of characters in single quotes
(constant, ‘Also a constant’). An arbitrary individual is denoted by a single word
starting with an uppercase letter this is a variable (an anonymous variable begins with an
underscore, and is an individual that we do not care to name) (Var, _AnonVar). Constants
and variables together are referred to as simple terms. In order to use an abstract name
to, for example, refer to an infinite domain, a complex term is used. A complex term
consists of a functor (also a word beginning with a lowercase letter) followed by a number
of terms separated by commas and encased in brackets (complexterm(Var,const)). A
ground term does not have any variables.
In order to define a relation between two (or more) individuals we use predicates which
are notationally the same as a constant. An atom is a predicate followed by a number
of terms encased in brackets (the same syntax as a complex term). Both for atoms and
CHAPTER 5. PROLOG 35

complex terms, the number of terms encased in brackets is referred to as the arity of the
atom/predicate or complex term. Each term in the brackets is referred to as an argument.
Predicates with a different arity refer to different relations and do not mean the same thing.
A predicate can also have zero arity, in which case it is a simple proposition that can either
be true or false.
We can connect atoms to make statements called clauses using the symbols , for ‘and’
; for ‘or’ and :- for ‘if’. A statement will only have one ‘if’ but may have multiple ‘ors’
and ‘ands’. The left part of a statement is the head of a clause and the right part of the
statement is the body of the clause, the ‘if’ (:-) is sometimes known as the ‘neck’. In DCL
(as opposed to full Clausal Logic) we only allow there to be one literal in the head of a
clause (these type of clauses are called Horn clauses), for reasons explained in section 5.3
on semantics.
A predicate definition can be made up of a number of clauses, which can be read as a
disjunction. Different predicate definitions are combined into a program and can be read
as a conjunction. The collection of predicate definitions is also sometimes known as a
Knowledgebase.
For example if we take this (meaningless) program:

u(A):-w(b,q(A)),x(a).
u(A):-x(A),z.

u(A,B):-y(n,m).

v(I):-v(b,Q).

w(b,q(1)):-true.

x(a):-true.
x(b):-true.

y(n,m):-true.

z:- true.

There are seven predicate definitions. In order to refer to a predicate we use the
combination of it’s name and it’s arity written like u/1. The first predicate definition is
CHAPTER 5. PROLOG 36

therefore u/1 which is made up of two clauses. The head of the first clause is the single
atom u(A). As this atom is in the head of the clause it is a positive literal. There are two
atoms in the body of the clause (negative literals), the first atom is w/2 it’s first argument
is the constant b and the second argument is the complex term q/1, q is the functor and
it’s only argument the variable A. The second body atom in the first clause is x/1 which
has a single argument - the constant a. The second clause that completes the predicate
definition for u/1 has the same head atom as the first clause, but the body is made up of
two atoms x(A) and the constant z. The second predicate is u/2 which is distinct to u/1
as it has a different arity. u/2 is defined by a single clause. The predicate w/2’s second
argument is a complex (compound) term with the functor q and a single argument – the
number 1. The predicates, w/2, x/1, y/2 and z are all facts because the only body atom is
the constant true which is a special atom which always evaluates to true. Clauses which
have this form are normally abbreviated to omit the neck and body and would therefore
simply be written as:

w(b,q(1)).

x(a).
x(b).

y(n,m).

z.

The predicate x/1 is the only other predicate with two clauses, representing the disjunc-
tion x(a) or x(b) as separate facts. The predicate z has an arity of zero and is therefore
a simple proposition which is true. A predicate can also be described as a relation when
its arity is greater than or equal to two e.g. the predicate y/2 is a relation but z/0 is not.

5.3 Semantics
5.3.1 Formal meaning of words and sentences
Sentences in Clausal Logic are said to be ‘truth functional’. This means that the meaning
of a sentence in the logic is assigned a truth value (in our case simply true or false), and
the semantics specifies under what conditions this can happen. For this we use a number
CHAPTER 5. PROLOG 37

of concepts such as the Herbrand universe, Herbrand base and Herbrand interpretation.
The Herbrand universe is the set of all individuals we are talking about in our clauses. So
for a program P it is the set of all ground terms that can be built from the functors and
constants in P. Therefore if a program contains a functor then its Herbrand universe is
infinite (because a nested infinite term can be created). The Herbrand base of a program
is the set of ground atoms that can be constructed using the predicates in P along with the
ground terms. A Herbrand interpretation is a possible state for the universe we are using,
to do this we assign a mapping from the items in Herbrand base set of P to either true or
false. By convention we can treat a Herbrand interpretation as a subset of the Herbrand
base by stating that these are the entities that are true and the entities not in the set are
false.
Example:
The Herbrand universe for the above program is the following infinite set:

{a,b,n,m,q(a),q(b),q(n),q(m),q(q(a)),...}

The Herbrand base is:

{u(a),u(b),u(n),u(m),u(q(a)),...
w(a,a),w(a,b), ...
x(a),x(b), ...
u(a,a),u(a,b), ...
y(a,a),u(a,b), ...
v(a),v(b), ...
z}

An example Herbrand interpretation would be:

{u(a),u(q(a)}

In order to assign a truth value to each clause we use the following rules and the fact that
the body of a clause is a conjunction.

1. The body is true, and the head is true -> Clause is true
2. The body is true, and the head is false -> Clause is false
3. The body is false, and the head is true -> Clause is true
4. The body is false, and the head is false -> Clause is true
CHAPTER 5. PROLOG 38

Thus a clause is equivalent to the statement ‘head or not body’ and so a clause is a
disjunction of atoms with each atom in the body of the clause being negated. Body atoms
are therefore called negative literals and head atoms are called positive literals. If a clause
is true in an interpretation then that interpretation is a model for that clause and an
interpretation is a model for an entire program, if it is a model for each clause in the
program.
Adding further literals to a clause can therefore only restrict the possible models for
a program (this is an important property for reasoning about Prolog programs, both for
correcting mistakes in programs and for automatic reasoning of programs). If an atom is in
every model of a program, then it is a logical consequence of the program. The semantics
we use is called minimal model semantics and that means we only accept things as true
if they are true in every model. This avoids having to state everything that is not true in
our domain of discourse, and is also the reason we are restricted to having one atom in
the head of our clauses, because otherwise there could be multiple minimal models. This
is the difference between DCL and full Clausal Logic. This also means that when we want
to have a statement such as x;y:-z. (If z then x or y) we actually need to write either
x:-z, not(y). or y:-z,not(x). to explicitly choose the minimal model we intend. We
will discuss further the semantics of negation in a later section.

5.3.2 Proof theory


Proof theory pertains to how we can obtain new sentences from given axioms by only using
symbolic manipulation, the allowable symbolic manipulations are defined by inference rules.
In DCL and Prolog we only have one inference rule: resolution [130]. We use resolution
as an inference rule because the alternative of checking if every model of a program P is a
model of a clause C would be unfeasible in practice. If we can derive C from P by chaining
together applications of the inference rule then we can say that C can be proved from P.
Resolution is carried out by taking two clauses from the program and relating them by
one atom that is the head of one clause and where the same atom appears in the body of
the second clause (if at least one of these atoms contains a variable, then they need to be
able to be made equal via a substitution- see the next paragraph). The resolvent clause
has a head which is the resolved upon literal and a body made up of all the literals in the
two input clauses except the resolved upon atom. Any negative literals that appeared in
both input clauses now only appears once in the body of the resolvent clause.
CHAPTER 5. PROLOG 39

a:-b,c.
b:-d,e.

Resolving on b leads to :

a:-c,d,e.

When an atom that is being resolved upon contains a variable, we need to apply a
substitution so that the pair of atoms are made equal. We substitute terms for variables –
for example if we apply resolution on the following two clauses and wish to unify b(V,u)
and b(w,X) we would use the substitution {V->w,X->u}.

a(u,V) :-b(V,u).
b(w,X) :- c(w,Z),d(X,Z).

This process is called unification and the substitution is called a unifier. The resulting
clause is then:

a(u,w):-c(w,Z),d(u,Z).

Unifiers do not have to be grounding substitution, as a variable can be mapped to a different


variable. For example: a(X) and a(Y) could be unified with {X->Y} or {Y->X}. Clauses
can be unified by many different substitution’s but we want to obtain the Most general
unifier (MGU) . For example p(a,f(Y)) could be unified with p(X ,f(g(Z)) by using
the substitution {X->a,Y->g(B),B->Z} but the unifier {X->a, Y->g(Z)} is more general.
The MGU is the unifier with the least number of substitutions needed to unify the two
atoms. If an MGU exists it is unique (the two substitutions X->Y and Y->X in the first
example are in fact instances of the same MGU – they just have a renaming of variables).
The proof method for DCL is proof by refutation - we prove that the opposite is not
true. We attempt to derive the empty clause (which means false) in order to demonstrate
that the set of clauses is inconsistent using the substitutions that unify the literals in the
derivation. Therefore in order to query a program, we add the negated clause of our query
to the program, in order to try and refute the query by finding an inconsistency using the
inference rule. For example we could add: ¬u(X), a proof tree for the this query is shown
in Figure 5.1:
CHAPTER 5. PROLOG 40

:-u(X). u(A):-w(b,q(A)),x(a)

{A->X}

:-w(b,q(A)),x(a). w(b,q(1)):-true

{A->1}

:-x(a). :-x(a)

Figure 5.1: A proof tree for the query ?-u(X)

The proof tree shows each MGU, as Prolog attempts to derive the empty clause. In
this example the empty clause can be reached in three different ways, hence the query is
refuted. However, this is also true not only for A->1 but also for A->a and A->b. These
substitutions can be perceived of as the answers to the query. The first solution is the first
proof found, and the other solutions to this query which would be found on backtracking,
which means trying different branches of the proof tree. How these are selected and the
order they are selected is dependant on the selection rule of the inference engine.
When resolving with complex terms and attempting to unify them, we should ensure
that the two atoms do not have any of the same variables (so called ‘occurs check’).
However, for efficiency reasons Prolog does not normally do this unless you explicitly ask
for it. This means that under specific circumstances resolution in Prolog is unsound (see
section 5.4 for more details).
Prolog’s implementation of resolution uses Selection Linear Definite (SLD) resolution
[84]. The resolution strategy of SLD states how a literal to resolve upon is selected. This
is done by the selection rule, in Prolog the selection rule is left to right, and the search for
a matching clause is top down in a program. This means Prolog searches ‘Depth First’ by
default. This is illustrated in Figure 5.2
CHAPTER 5. PROLOG 41

?-u(A)
aaa
%% a
:-w(b,q(A)),x(a) :-x(A),z
A/1 A/a LL A/b
x(a) z z

2 22

Figure 5.2: A SLD tree for the query ?-u(A)

5.4 Meta-theoretical issues


In addition to the three elements in clausal logic we have outlined above (syntax, semantics
and proof theory), we are also concerned with soundness and completeness [47]. These
concepts are about the ‘Clausal logic’ rather than in the logic and so are meta-theoretical.

Soundness: If an inference rule is sound then it guarantees that a sentence theorem


derived from an axiom using the inference rule is true.

Completeness: Inference rules that are complete allow us to derive any true sen-
tence from our axioms. If an inference rule is not complete there will be sentences
that are true in the logic that are not derivable.

5.4.1 Prolog soundness and completeness


The requirements of building a practical programming language means that in fact Prolog
is neither sound nor complete. This means that Prolog will in fact give incorrect answers in
a limited number of cases and there will be cases where an answer exists but it will not be
able to be found. Prolog is normally unsound due to the need for efficiency, its unification
omits the occurs check , which means in principle it is unsound. However, a careful Prolog
programmer will avoid these cases and therefore the reasoning can generally be relied upon.
Prolog also offers facilities to perform the occurs check making the reasoning sound (at the
cost of efficiency), if the user deems this necessary.
Prolog is incomplete because of the top down search of the selection rule. This means
that if a recursive rule appears before a base case (a non recursive clause, i.e a clause that
CHAPTER 5. PROLOG 42

does not have it’s head literal also in it’s body), Prolog will enter a loop and not ever find
the base case.
For example:

a(X):-a(X).
a(b).

This is a valid logical statement containing a tautology and a second case, simply querying
?-a(X) will result in an infinite loop in Prolog. Each time the system attempts to resolve
on a(X) it will generate a new identical clause. Prolog can be made complete by changing
its search strategy to breadth first. This can be done by using a meta-interpreter which
we will briefly describe in a later section.

5.5 Declarative and procedural reading of a Prolog


program
Because of the need for the selection strategy for the inference rule a Prolog program has
both a declarative and procedural reading. Both readings are useful when writing programs
and the different readings are appropriate for different aspects of Prolog programming.
In computer science, algorithms are a core subject of study and understanding their
properties is very important for the successful achievements of biological data mining.
An algorithm is a description of how a task should be completed (different algorithms
have different properties that allow us to compare two algorithms, for example, how much
memory an algorithm would need for a given input, or if one algorithm will run quicker
than another algorithm). Discussion of an algorithm’s properties are independent of specific
implementation details and computer hardware.
Kowalski [83] gave the following equation to understand what an algorithm is, in the
context of a Prolog program.

algorithm = logic + control

The logic and control components are the declarative and procedural knowledge of a Prolog
program respectively. The declarative reading of a Prolog program will define the result
of a query if one is returned. Adding atoms to a clause can only reduce the number of
solutions to a query, and removing atoms from a clause or adding additional clauses to
a program can only increase the number of solutions to a program. These procedures
CHAPTER 5. PROLOG 43

(refinements) can be called generalising a program or specialising a program respectively.


These properties are important in two respects. Firstly, these properties allows us to
reason about mistakes in our programs. If we write a program and expect an answer to be
returned, if the Prolog engine answer us with ‘false’ i.e. that it has not been able to find an
answer, we know that we will need to generalise our program by either removing clauses or
atoms from clauses. If on the other hand we expect a program to give a single answer, but
Prolog gives us many answers, we know that our program is to general and we should add
clauses or atoms to our clauses (specialisation). (Other refinements include distinguishing
between variables). Secondly, this property of Prolog is also useful in machine learning, as
we will see in the latter parts of this thesis, when we will use heuristic search algorithms
to learn rules from data. These algorithms make use of specialisation and generalisation
by defining Refinement operators. Further we can say, a declarative reading of a Prolog
program is general and there is no need to understand the various different instantiation
patterns for queries that may be called, it is thus the easiest way to understand what
a program is doing. In comparison to logic programming in Prolog many programming
paradigms do not have a declarative reading and can only be understood procedurally. As
we will see, the execution strategy of a Prolog program can easily be changed by the means
of a meta interpreter. This will not affect the declarative reading of a Prolog program but
may affect the procedural reading.

A methodology for writing Prolog Programs

A methodology for writing Prolog Programs is to first aim to logically describe the solu-
tion to the problem with declarative predicates. Writing this general program will often
allow queries from some instantiation patterns efficiently. Others may not immediately
allow efficient computation. The implementation can then be adapted taking into account
procedural knowledge. In Chapter 6 of this thesis we will illustrate this methodology when
we implement subgroup discovery in Prolog in order to mine biological datasets.

5.5.1 Procedural readings (of logic programs).


A Prolog program is started by giving a sequence of goals (also known as a query), combined
by either conjuncts or disjuncts. These are then executed in left to right order. For example
the query ?-u(X),u(Y). would first try to prove u(X) then u(Y) similarly a clause would
have a procedural reading. For example :
CHAPTER 5. PROLOG 44

X:-Y,Z.

Would be read as; To satisfy X, satisfy Y and then satisfy Z. When reading a clause pro-
cedurally the order that the goals are processed is a factor to consider. Because SLD
resolution is top down we should normally write non-recursive cases before recursive cases.
The complete algorithm for the standard Prolog execution is presented in Appendix A.
A Prolog program can be declaratively correct, but procedurally incorrect. For example,
this could happen when the selection strategy for the inference rule causes the execution
of the program to enter an infinite loop. By reading a Prolog program procedurally we can
avoid this and further we can implement algorithms - a key advantage. For example, the
predicate unsortedlist_sortedlist/2 could define the logical relation between a sorted
and an unsorted list, but the programmer will often need to understand how to write the
different algorithms which will be most efficient for the intended use of the relation (e.g.
merge sort versus quick sort). This is especially relevant when implementing algorithms
that have been described in the literature and written in an imperative style. In chapter six
of this thesis we will show different implementations of algorithms for subgroup discovery
– that aim to define a relation between our datasets and a set of interesting subgroup rules.
Even though these algorithms attempt to define the same relation – the implementation
will be different and they will find different rules, and be applicable to data with different
properties.
Interestingly, there is research on automatically finding efficient programs and algo-
rithms but until this area is more mature it is the responsibility of the Prolog programmer
[31].
In addition to the logical features of Prolog, there are also non-logical features which
can only be understood procedurally. These include input and output where a program
needs to read a file or user input, or write output to a screen or other device. These features
fall outside of logic and are controlled by non-logical predicates. They are known as side
effect predicates and in later chapters we will see some of them in use.
The procedural understanding of a Prolog program can be difficult because in contrast
to the declarative reading, we need to understand the instantiation of variables and what
alternatives are found on backtracking. Indeed different algorithms could be implemented
for different variable instantiations. This is because a program can be queried in multiple
directions. We recommend that a Prolog programmer keeps the logic of there problem
separate from any input/output requirements. This separation of concerns allows for clean
declarative logical code for core functionality, and efficient reusable input/output code in
CHAPTER 5. PROLOG 45

other components of the program.

Goal termination

An important consideration in writing Prolog programs is the termination of goals. Some-


times practitioners of Prolog will run into situations where they query a Prolog system
and enter an infinite loop which they were not expecting. This is actually the unexpected
invocation of a powerful feature of Prolog, that is, it is a relational language that automat-
ically backtracks to find further solutions. It is important to remember that some Prolog
programs should not terminate. For example, the following query on backtracking will
continue forever generating ever longer lists:

?- length(Ls, _).
Ls = [] ;
Ls = [_8662] ;
Ls = [_8662, _8668] ;
Ls = [_8662, _8668, _8674] ;
...

Another important class of problems where Prolog is useful but termination may not
appear to occur, is when we are searching for properties of objects that we do not in
advance know to exist. For instance, if searching for a sub-graph with a certain substructure
within an infinite graph, where that sub-graph does not exist then the program should not
terminate[143].
In order to better understand termination in Prolog it is useful to distinguish two types
of non termination [143].

• Existential: A program query Q terminates existentially if and only if we receive


an answer when posting Q. Note that this is different to functional programming
languages because functions only have one answer, whereas Prolog is a relational
language, and multiple results can be returned on backtracking.

• Universal: A Prolog query Q terminates universally if and only if the query.

?- Q, false.

terminates existentially.
CHAPTER 5. PROLOG 46

A logical reading of the query ‘?-Q,false.’ is always false, so if an answer is reported it


will be false.
The halting problem [146] tells us that in principle both are undecidable but observation
of calls to our predicates can give us insight into our programs. A Prolog query Q does
not terminate if and only if it does not terminate universally, i.e., if the query:

?- Q, false.
...
does not terminate.

If a query does not terminate existentially, then it also does not terminate universally
[143]. A debugging technique called failure slices [114] (and the related technique of pro-
gram slicing) makes use of the logical properties of a Prolog program in order to help a
programmer fix mistakes in their programs. This technique can produce explanations for
non-termination. The idea is to slice the program into segments and insert a false goal.
This will allow us to isolate from where the non-termination originates in our program.
This kind of debugging technique is not available to imperative programming languages
such as Java or Python and is a key advantage to Prolog programming. It is even possible
to use an external Prolog program that will automatically generate program slices to help
find reasons for non-termination in our programs [114].

5.6 Arithmetic operations in Prolog


In Prolog there are at least three implementations of arithmetic as predicates: Peano,
constraints and low level built in predicates. We explain these here because they serve as a
good illustration of some of the trade offs that a programmer must take into consideration
when writing Prolog programs. In the programs in this thesis we attempt to choose the
best method for each problem we encounter.

5.6.1 Peano arithmetic


Peano arithmetic could be argued to be the most natural method for Prolog. In Peano
arithmetic integers would be represented by terms. For instance, 0, s(0),s(s(0)) would
mean 0,1,2.
Then, using the following predicate definition, which relates two numbers to their sum
we have defined a predicate that can add, subtract and count.
CHAPTER 5. PROLOG 47

n_n2_sum(0,X,X).
n_n2_sum(s(X),Y,s(Z)):-n_n2_sum(X,Y,Z).

We can use this predicate for a variety of tasks, as demonstrated by the following queries.

%T This general query can be used to generate integers on back tracking:


?- n_n2_sum(N,_,_).
N = 0;
N = s(0);
N = s(s(0));
N = s(s(s(0)));

%Any number added to zero is itself


?- n_n2_sum(0,X,X).
true.

%One plus one is two.


?- n_n2_sum(s(0),s(0),X).
X = s(s(0)).

%One subtracted from three is two.


?- n_n2_sum(X,s(0),s(s(s(0)))).
X = s(s(0)) ;
false.

%What two numbers sum to two.


?- n_n2_sum(X,Y,s(s(0))).
X = 0,
Y = s(s(0)) ;
X = Y, Y = s(0) ;
X = s(s(0)),
Y = 0 ;
false.

In fact other Prolog data structures can be used this way to implicitly represent integers.
CHAPTER 5. PROLOG 48

We do not have to use the term s/1. For example we can implicitly represent the size of
a list by a list itself. This is illustrated by the following predicate definition.

n_n2_sum([],L,L).
n_n2_sum([H|T],L2,[H|L3]) :- n_n2_sum(T,L2,L3).

This predicate is normally called append/3 and it can be used to combine lists (to add two
numbers together), to split a list (to subtract one number from another), or to generate
lists of increasing length (to count). This can be achieved by making queries with different
instantiation patterns – that is with some arguments bound to explicit values so that they
are ground or other semi-instantiated, for example, [X|Y] is a list but we do not know how
long it is, or a argument could be a free (or unbound) variable.
The following shows example queries for adding 2+1, finding what numbers sum to
three and counting.

%2+1 =3.
?- append([_,_],[_],X).
X =[_,_,_].

%What numbers sum to three.


?-append(X,Y,[_,_,_]).
X = [],
Y = [_11396, _11402, _11408] ;
X = [_11396],
Y = [_11402, _11408] ;
X = [_11396, _11402],
Y = [_11408] ;
X = [_11396, _11402, _11408],
Y = [] ;

%counting
?-append(_,_,X).
true;
X = [_6662|_6434] ;
X = [_6662, _6674|_6434] ;
X = [_6662, _6674, _6686|_6434];
CHAPTER 5. PROLOG 49

A disadvantage of Peano arithmetic is that algorithms for multiplication by successive


addition are slow and inefficient. Also, large numbers take up a lot of memory as they
are represented as nested terms (the actual display of a number is not important as this
could be changed with a translation program from terms to decimal numbers). For these
reasons Peano arithmetic is not often used in production programs on its own but when list
processing is needed we can often implicitly have arithmetic for free and there is no need
to have counting variables that might be required in imperative languages, for example a
counter in a loop.

5.6.2 Constraint satisfaction and constraint logic programming


An alternative method to Peano arithmetic which maintains many of the advantages (in-
cluding maintaining logical consistency in different goal orders and flexibility of querying
instantiation pattern) with less disadvantages (it is more efficient in terms of memory and
easier to read) is Constraint Logic Programming (CLP). CLP also offers the ability for
more complex reasoning and efficient searches for solutions which can be used for combi-
natorial problems (many bioinformatics tasks can be considered combinatorial problems -
for example sequence alignment). We make use of this method of representing and rea-
soning arithmetically throughout the code presented in this thesis. We also implement a
constraint based combinatorial subgroup discovery method in Chapter 6 of this thesis. We
use clp(b) and clp(fd) systems [145, 144]. Some Prolog version’s also have systems for
clp(z), clp(q) and clp(r). The clp(b) system is for variables constrained to Boolean
values, clp(q) is rational, clp(r) floating point numbers values, clp(z) is integers and
clp(fd) is finite domains. It is also possible to consider standard Prolog as a constraint
system over Herbrand terms, this is referred to as clp(H). In that reading every predicate
will impose constraints on the set of solutions, because it will at most restrict the set of
solutions (i.e never increase it). When reasoning over Herbrand terms, the only constraints
are equality and inequality of terms, which are implemented by the predicates (=)/2 and
dif/2, respectively. We will discuss the dif/2 constraint in more detail in the section on
negation in Prolog. Finally, it is possible to use Constraint Handling Rules (CHR) to
reason using user defined constraints [50].
Systems such as clp(fd) and clp(b) augment standard Prolog. The clp(fd) and
clp(b) systems introduce new syntax which becomes available when each library is loaded.
For clp(fd), operators are prefixed with the # sign. So = becomes #= and >= becomes #>=
etc.
CHAPTER 5. PROLOG 50

The following query asks the question, what X when added to two equals five.

?-5 #=X+2.
X=3.

In contrast clp(b) systems will maintain standard operators but these will be enclosed
in the sat/1 predicate. This predicate will attempt to satisfy the enclosed equation. For
example the following query shows that for X*Y to be satisfied, that is to evaluate to true
or 1 both X and Y have to be equal to one.

?-sat(X*Y).
X=Y,X=1.

Both clp(b) and clp(fd) systems work by first issuing constraint goals, followed by
labelling goals, which will enumerate solutions. As observed in the above two examples,
if a goal can immediately be satisfied then there is no need to explicitly call the labelling
predicates. The following examples illustrates a constraint satisfaction problem when this
is not the case and the labelling predicate is required to enumerate the solutions:

?-sat(X*Y + X*Z), labelling([X,Y,Z]).


X = Z, Z = 1,
Y = 0 ;
X = Y, Y = 1,
Z = 0 ;
X = Y, Y = Z, Z = 1.

Using clpfd we can define our n_n2_sum/3 predicate as:

:-use_module(library(clpfd)).
n_n2_sum(N,N2,Sum):-
Sum #=N+N2.

How constraint satisfaction programs work

In order to use a CSP system we will need to define:

1. A set of variables.

2. The domains from which the variables can take values.


CHAPTER 5. PROLOG 51

3. Constraints that the variables have to satisfy.

In both clp(b) and clp(fd) variables look like normal Prolog variables, however if,
on querying, a clp(fd) variable that is instantiated to something other than an integer,
a type error will be thrown. If a variable is instantiated to something other than zero or
one in a clp(b) variable then also an error will also be thrown. The domain of a clp(fd)
variable can initially be set with the in/2 or ins/2 predicates. For example to state that
the variable X is in the domain of 1 to 10 inclusive we would query:

?- X in 1..10.
X in 1..10.

If, on labelling of variables, there are several values that meet the constraints, then a
criterion for choosing amongst these can also be specified. This is called the labelling
strategy.
The actual algorithms for constraint satisfaction are not directly visible to a Prolog
programmer. The programmer simply interacts with the constraint solver. The intuition
behind a constraint satisfaction problem is that it can be conceived of as a hyper-graph,
where the nodes are the variables and the edges are the constraints. So the constraint
p(X, Y ) would have two arcs:(X, Y ) and (Y, X). A consistency algorithm will then control
the search for a solution.
We now describe a simplified algorithm of how this might work. We assume we have
the variables X and Y and the domains are DX and DY , there is a constraint p(X, Y )
present. The arc(X, Y ) is said to be arc consistent if for each value of X in DX , there is
some value for Y in DY satisfying the constraint p(X, Y ). If this is not consistent then
each value in DX that does not have a corresponding value in DY may be deleted from
DX . This is repeated until we end up with a consistent arc (X, Y ).
For a concrete example, we set the domains of X and Y to the integers 1 − 10. If we
have a constraint p(X, Y ) : X − 4 >= Y , then the arc (X, Y ) is not consistent, because
if for example X = 3 then Y can not take a value in DY (as 3 − 4 = −1 which is not in
the domain of Y ). So the domain of X— would be reduced to 5 − 10. When a domain is
reduced the CSP algorithm will check every other arc as they may now not be consistent.
This effect will percolate through the system, sometimes in a loop until either a domain is
empty in which case there is no answer or an answer could be a constraint, or it could be
a concrete answer.
CHAPTER 5. PROLOG 52

Examples:

%Example query for a constraint that can not be satisfied.


?-X#>=10, X#<2.
false.

%Example where the answer to a query is a constraint.


?- X#>=10.
X in 10..sup.

%Example query where a concrete answer is given.


?- X#>10,X#=<11.
X = 11.

In the case where a constraint is still there, then the system can be asked to label the
remaining variables.

?- X#>=10, X#<13,labeling([min(X)],[X]).
X = 10 ;
X = 11 ;
X = 12.

A user can define what order they wish to see the labelled results by using the first argument
of the labeling/2 predicate. In the previous query we asked for labels that minimise our
value of X. In this way certain classes of optimisation problems can be achieved – using
only declarative knowledge but still having good efficiency.

5.6.3 Low level arithmetic


The final method to implement arithmetic in Prolog is to use the low level operator is.

?- X is 5+2.
X=7.

However, unlike using the clp(fd) system, this can not be used in all directions. As
can be seen in the following query.
CHAPTER 5. PROLOG 53

?- 5 is X +2.
ERROR: Arguments are not sufficiently instantiated
ERROR: In:
ERROR: [8] 5 is _2816+2
ERROR: [7] <user>

One advantage is that the low level approach is compatible with floating point arith-
metic, allowing for decimal numbers. However, floating point arithmetic is often the cause
of programming errors, no matter what language they are written in. In the case where
you know how much precision you require for a given problem, it will often be a good idea
to multiply out the decimals numbers so that you can use integers, which will allow you to
use the constraint libraries and will make the rounding explicit which can be hidden with
floating point operations.

5.7 Other features of Prolog


Cut in prolog (predicate: !/0)

The cut predicate is a special predicate that prevents back-tracking. It is an important


control construct that affects both the procedural and declarative reading of a program.
An informal description of cut is to imagine a one way gate, (like a turn style), whereas
conjunction (,) is a two way gate. When the Prolog engine moves through a ‘gate’ to
try and prove the next goal, and fails it will try and go back through the gate (called
backtracking) and attempt another way to prove this goal. However, if there is a cut, then
the Prolog engine will not back-track and will report a failure for that goal. So the cut
goal always succeeds and commits Prolog to all the choices made since the parent goal was
called.
Cuts are part of the procedural understanding of a Prolog program. They often re-
sult in programs that are logically incorrect, because (often inadvertently) they remove
logical solutions to a query. This can happen when a naive programmer uses a cut in an
attempt to improve the speed of computation of a query with one instantiation pattern,
but this actually removes valid solutions for another query instantiation pattern. In Prolog
terminology these are called ‘red cuts’ [118].
CHAPTER 5. PROLOG 54

Negation in Prolog (predicate: \+/1)

Negation in Prolog can be a misunderstood concept. The core principle is that Prolog has
a closed world assumption (CWA). When using the CWA negation is defined as negation as
failure, that is if goal can not be proved – it is said to be false. This does not correspond to
the usual logical definition of ‘not’. For this reason negation in modern Prolog uses \+/1—
predicate rather than the not/1 predicate (they perform exactly the same computation)
in order to emphasise this difference. A subtlety is that \+/1 means “cannot be proven
at this time”. This means that the order of the goals is important and we lose the pure
logical monotonic reading of the program. If the goal will only ever be called when ground
this will not be a problem. When the goal is ground it can be understood declaratively,
as it is sound. However negation as failure can only be understood procedurally and not
declaratively when the goal is not ground.

The dif/2 constraint

An alternative to using \+/1 predicate is to use the dif/2 predicate. This predicate is a
constraint predicate that is used to express inequality. When constraints are in a program
or goal, the solution found is not just a binding to variables but may be a list of constraints.
These constraint can be called upon later if their results are used in further goals.
For example, if we post a dif/2 goal:

?- dif(X,1).
dif(X, 1).

Instead of returning a normal Prolog answer where a variable is bound to a value,


Prolog returns a constraint which in this case is the query goal itself.
When defining a relation in the form of a predicate it is often useful to have mutually
exclusive clauses. For example, if we wanted to say: if an animal is dog then it is a pet,
else it is a wild animal. we can write:

animal(dog,pet).
animal(A,wild):- dif(A,dog).

This predicate maintains logical properties such as monotonicity. This means that when
we add and subtract clauses and predicates the set of solutions expands and contracts in a
predictable way. Also, the meaning of a program where the literal order is changed is not
changed.
CHAPTER 5. PROLOG 55

5.7.1 Meta-predicates
Prolog programs can themselves be represented as terms. This is in important feature
that is made use of in ‘Pengines’ the subject of Chapter 8 of this thesis. Meta-predicates
take as an argument a Prolog term and this term is treated as a goal. This allows us to
call predicates that are constructed at run time. The basic meta predicate is the call/1
predicate. This predicate returns true if the clause passed to it can be proved true. From
this predicate further predicates can be defined, for example, the calln/n family. The
predicates allow us to augment the term passed to the call.
Other meta-predicates apply a goal to a list inputs. These include the maplist family
and the foldl family and scanl (these are called families because there are multiple
predicates defined that accept as inputs predicates of different arities).
These predicates often replace the use of loops in imperative languages. For example, if
a bioinformatician wanted to apply a function to elements of corresponding lists then they
would use maplist/3. Here we use our n_n2_sum/3 clfpd predicate from our discussion
on arithmetic, but we note that any predicate can be passed into these meta predicates
providing much power. This first query uses maplist to sum corresponding elements from
two lists.

?- maplist(n_n2_sum,[1,2,3],[4,5,6],Sums).
Sums = [5, 7, 9].

In the next query we pass the value 3 into the first argument of maplist to add 3 to
each element.

?- maplist(n_n2_sum(3),[4,5,6],Sums).
Sums = [7, 8, 9].

Now we show a query that ‘folds’ a list from the left direction. The effect of this using
the n_n2_sum/3 predicate is to sum the list.

?- foldl(n_n2_sum,[4,5,6],0,Sum).
Sum = 15.

Next we show the scanl/4 predicate, that when used with n_n2_sum/3, predicate
produces a list of iterative values:

?- scanl(n_n2_sum,[4,5,6],0,Sum).
Sum = [0, 4, 9, 15].
CHAPTER 5. PROLOG 56

These structures are more powerful then their functional equivalents (such as those im-
plemented in the Haskel programming language) because they are relational and therefore
can be used in multiple directions, including as a generator or checker.
The next query illustrates how backtracking can be used to find constraints on numbers
(that sum up to 5).

?- foldl(n_n2_sum,X,0,5).
X = [5] ;
X = [_3074, _3080],
_3080+_3074#=5 ;
X = [_3518, _3524, _3530],
_3524+_3518#=_3550,
_3530+_3550#=5 ;
...

Finally, we demonstrate how to check that the 3rd list is the sum of the first two lists

?- maplist(n_n2_sum,[-1,-2,-3],[1,2,3],[0,0,0]).
true.

Immediate uses for meta-predicates that can be imagined for a bioinformatician, include
various different traversals over graphs and applying data transformations to lists.
Four advantage of using these meta predicates versus a loop construct are :

1. We do not need to track iterators, removing the possibility that you have an off by
one error.

2. The code is often shorter and it can be argued it is easier to understand what the code
means, as we have declaratively defined the relation that the predicate is working
with and we do not have to understand the internal state of a changing system, rather
we can reason what the results will be.

3. The code is encapsulated, the relation is separated from the looping. This means that
there is more chance that your code will be correct and that your resulting analysis
and conclusions can be relied upon.

4. Flexibility of querying patterns, the same code can ‘undo’ a loop, we do not need to
write the opposite function and potentially have a mismatch in functionality.
CHAPTER 5. PROLOG 57

Reification of predicate calls

Recent developments in the Prolog community have defined a number of meta predicates
that are made use of in this thesis. These are described in Neumerkel and Kral [113] and
are implemented in a library called “reif”.
The primary predicate that is defined in library reif is if_/3 and its purpose is to
index the dif/2 predicate described above. The reason why it is important to do this is
that we want computational efficiency while retaining logical properties (i.e. the goal order
does not matter and predicates define true relations that can be used in multiple directions
and as a generator). This means that predicate calls should be deterministic when possible,
which means the called goal should not leave open or create redundant choice points.
The if_/3 predicate is a new building block that can be used to make a large number
of useful predicates, that are both efficient and retain the logical properties desired. It
also simplifies code writing as it reduces the number of conditions in comparison to using
dif/2 directly where a programmer needs to write all the conditions twice. Once for the
positive case and once for the negative case, but with if_/3 the condition only needs to
be stated once.
In order to use if_/3 you need a reified predicate definition. A number of common rei-
fied predicates have already been defined in the library reif such as =/3, #>=3 and #=</3 .
For example =(X, E, T) is a reified instance of the disjunction X = E ; dif(X, E). These
have been combined into this new predicate which has an additional argument, T, which
is true if the terms are equal and false otherwise.
It is a simple task to define additional reified predicates when needed and from these a
large number of utility predicates can be defined that meet the properties described before.
Examples include tfilter/3 and tpartition/3. The former of these is used to filter an
input list to a list where the supplied predicate is true for elements in the list, for example:

?- tfilter(dif(1),[a,1,b],Xs).
Xs = [a, b].

This goal succeeded deterministically. The later, creates a partition of the input list based
on the supplied reified predicate. For example:

?- tpartition(=(0), [1,2,3,4,0,1,2,3,0,0,1], Ts, Fs).


Ts = [0, 0, 0],
Fs = [1, 2, 3, 4, 1, 2, 3, 1].
CHAPTER 5. PROLOG 58

The efficiency gains of using these meta predicates is important for dealing with large
datasets that bioinformaticians work with, as we will see Chapter 6 of this thesis. They
also reduce the amount of code and help reduce the number of mistakes in our programs.

Meta interpreters

A powerful feature of Prolog programming is the ability to meta-program [118]. This is


achieved by means of a meta-interpreter. A meta-interpreter is a Prolog program that
takes as input another Prolog program and executes it. Meta-interpreters have a number
of uses for a bioinformatician. Firstly, by using a meta-interpreter the search strategy for
selection of resolution can be changed. For example, instead of top down and left to right,
it could be changed to bottom up, right to left or, even a random order. This can be
helpful when using Prolog to search for solutions to a task and we wish for the search to be
complete or to explore a very large space, in which case adding some probabilistic aspects
to a search can be an effective means to traverse this space.
Another important use for meta interpreters is when we wish to deploy our models
(rules) that we have learnt from our data, on our data. Using a meta interpreter for this
involves augmenting our existing rules that are stored in our knowledgebase with the rules
we have learnt or are in the process of learning (represented as terms that correspond to the
clause of a program). This allows us to test our rules without changing our knowledgebase
[47].
Finally, advanced machine learning techniques such as the ILP system Metagol [32] use
meta interpreters to learn entire Prolog programs from examples, including the difficult
task of inventing required ‘predicates’. These techniques undoubtedly will have many
applications in bioinformatics tasks (including learning small data manipulation programs,
such as presented in [94]).

5.7.2 Second order predicates


There are a number of second order predicates available in Prolog sometimes called ‘all-
solutions’ predicates. These predicates allow us to construct sets of solutions as a data
structure, which is in contrast to the standard way of querying Prolog where we are
given one answer at a time and backtrack to find further solutions. The three second
order predicates commonly used are; findall/3, setof/3 and bagof/3. The predi-
cate findall/3 is problematic to use from a logical point of view. It has the form:
CHAPTER 5. PROLOG 59

findall(Template,Enumerator,Instances), which can be read as: “Instances is the


sequence of instances of Template which correspond to proofs of Enumerator in the order
which they are found by the Prolog system” [118]. This reading is meta logical because it
depends on the instantiation of variables in Template and Enumerator and on the proof
strategy of the Prolog system. So for example if we used a meta interpreter to change the
search strategy for our Prolog program this would change the results of any query using
findall/3. Another logical issue is that the findall/3 predicate will not fail but will
return the empty list if Template fails for all Enumerators.
Despite findall/3 being problematic from a logical perspective, it is a useful predicate
when requiring output from a Prolog query to either be written to a file or as a set of
solutions returned from an API call, as we will see in the Chapter 7 on Reactome Pengine.
The lesson is that we should remember to avoid mixing our application logic and our
input/output code. The other two predicates having the same shape as findall/3 are:
setof(Template,Enumerator,Instances)
bagof(Template,Enumerator,Instances)
The predicate setof/3 would be used as setof(X,p(X),L). Here L is the set of instances
X for which p(X) could be proved. It will find unique solutions and sort these in order.
When there are no solutions setof/3 fails. The predicate setof/3 can not return the
empty list as there would be infinitely many ways to prove that a set of solutions does not
exist. These two properties (unique ordered solutions and unable to return the empty list)
mean that setof/3 retains some of the logical consistency that findall/3 lacks.
The predicate bagof/3 also suffers from some logical inconsistency because the bags of
solutions are ordered by the proof mechanism. However, it is useful to use when there are
free variables that are not bound in Template. It can then enumerate (on backtracking)
bindings to these variables.
In both setof/3 and bagof/3 the Enumerator for a bag or set can include existentially
quantified variables, for variables that are not bound in Template. The syntax for this is
X1^X2^Goal. This results in a similar feature to the SQL GROUP BY command where
results are grouped according to a particular variable.
One useful querying pattern with setof/3 is a nested setof/3 query. This can be used
to convert implicit graphs into explicit adjacency lists. For example say we have a list of
reaction_pathway(Reaction,Pathway) facts in our database (that relate what reactions
happen in what pathways) and we wish to find all the pathways that have at least four
reactions, we could use the following query:
CHAPTER 5. PROLOG 60

?-setof(Pathway,
setof(Reaction,reaction_pathway(Reaction,Pathway),[_,_,_,_,|_]),
Pathways).

Note how we have used the implicit ‘list’ representation of the number four as discussed
in the section on Prolog arithmetic.
In SWI-Prolog version 7 there are also a number of other second order predicates that
are useful to know about. These are aggregate/4, aggregate/3, aggregate_all/3 and
findnsols/4. The predicate aggregate/4 internally uses setof/3, aggregate/3 uses
bagof/3 whereas aggregate_all/3 uses findall/3. All aggregate and aggregate_all
predicates are used with Template terms for either count, sum/1, min/1 or max/1 . These
could also be nested in an arbitrary named compound term. For example r(min(X),max(X)).
These second order predicates provide efficient database SQL like functions for sets of so-
lutions to Prolog queries and will often be useful for a bioinformatician. For full details of
how these work see the SWI-Prolog documentation.
The final predicate we will mention is findnsols/4. This is similar to findall/3 but
limits the number of solutions found to n solutions. On back-tracking it will find the next
n solutions. This predicate is useful when using pengines (covered in Chapter 7). It also
corresponds to the SQL LIMIT statement that is available in some database systems.

5.7.3 Definite Clause Grammars (DCGs)


In this thesis we sometimes use a technique called Definite Clause Grammars (DCG’s).
DCG’s are effectively syntactical sugar for standard Prolog syntax. They do not add any
functionality but can make reading certain programs and queries easier.
A programmer would use a DCG when they need to either to parse, generate or check
lists. A bioinformatician will often need to do these tasks in their work. For example, to
read data from a file, the syntax used for storing data in the file can be conveniently parsed
by a DCG. Another example is is the description of data processing pipelines of programs
that will be applied to a dataset and using Prolog as the glue language. Such software
pipelines are an important aspect of work for a bioinformatician. The formal description
of pipeline as a grammar can aid understanding and maintainability, as well as allowing
logical reasoning about the pipeline.
A further example is when a bioinformatician needs to implement an algorithm that
has been described procedurally in the literature. It is possible to describe such iterative
CHAPTER 5. PROLOG 61

algorithms as lists of state changes and a DCG is an appropriate way to model this. In order
to implement the threading of state, you would use semi context notation, which requires
fewer arguments (than not using a DCG), making it easier to read and understand. For
more information on this technique see [143].
A DCG predicate is a rule where there is a head and a body. The body consists of terminals
and non terminals. A terminal is a list which will represent the elements it contains. A non-
terminal is another DCG which represents the elements that they themselves represent. We
refer to a DCG predicate as f//n. (f being the functor and n the arity) This distinguishes
it from the standard predicate indicator in Prolog (that only has one /). An important
feature of DCGs is that it is possible to insert regular Prolog goals, by inserting them inside
two braces {} inside the DCG rule.
The following DCG describes DNA, as a list of a,c,g and t elements.

dna --> [].


dna --> [a], dna.
dna --> [c], dna.
dna --> [g], dna.
dna --> [t], dna.

To invoke a grammar rule we use the phrase/3 predicate, as shown in the following exam-
ple:

%Is the list [a,c,g,t] accepted by this gramma?


?-phrase(dna,[a,c,g,t]).
True.
%Generate examples from this grammar.
?-phrase(dna,X).
X = [];
X = [a];
X = [a, a];
X = [a, a, a];

Another useful predicate is phrase_from_file/3 which allows us to apply a DCG


directly to an input file. This predicate is defined in library PIO [115]. This functionality
would be frequently used by a bioinformatician wishing to read biological data files into
Prolog data structures.
CHAPTER 5. PROLOG 62

In order to see what a DCG predicate looks like as a standard Prolog predicate you
can query your Prolog system with the listing/1 predicate. For example:

?-listing(dna).
dna(A, A).
dna([a], A) :-
dna([], A).
dna([c], A) :-
dna([], A).
dna([g], A) :-
dna([], A).
dna([t], A) :-
dna([], A).

There are some key advantages of using DCGs in bioinformatics tasks. First, they are very
readable; there are only a few predicates definitions that will actually be needed to use
or modify and these will have fewer arguments than non DCG predicates accomplishing
the same task. Second using a DCG we would explicitly describe our lists when we need
them. Making it clear what our data structure is and formally describing its requirements,
with automatic checking that our data meets these requirements. This makes for better
self-documentation and less chance of mistakes in our programs.

5.8 Prolog documentation


An important consideration for a bioinformatician writing computer code is how to docu-
ment their programs. This is so that the code can be maintained by themselves and others.
Good documentation facilitates reproducible research by allowing other scientists to read
and understand code that has been used for an analysis.
Documentation for an entire software project will often need three components.

1. An overview or guide (often in the form of a scientific paper in science applications).

2. A tutorial – often including demonstrations of code and allowing the opportunity for
a new user to play with concepts in order to facilitate understanding.

3. A technical reference – a detailed technical description of each part of the code that
users will interact with.
CHAPTER 5. PROLOG 63

In part three of this thesis, when we describe the Reactome Pengine tool, we will illustrate
these three types of documentation.
The declarative nature of Prolog aids understanding as a user can read the predicate
name to determine the function of the predicate. To a large extent this can mean that
Prolog code is self documenting. In order for this to be true, effort does need to be made
to correctly name predicates. Predicates should be named by declaring relations, verbs
should be avoided because these imply a direction of use and a procedural reading, hence
can often lead to developers missing out on some of the functionality that their program
provides. The exception to this rule is when a predicate is controlling a side effect, for
example, writing output or reading input. Hence, it is a good idea for predicate names to
correspond to arguments. Where possible we adopt this convention and give our predicates
names to correspond to the arguments.
Apart from the names of predicates, code can be made easier to use by providing com-
ments. This can allow greater detail about a predicates usages for reference material. The
PlDoc system is provided by SWI-Prolog to generate reference documentation. PlDoc uses
structural comments and from this it can generate high quality documentation. These
structural comments can be used in two ways. Firstly the system can automatically gen-
erate high quality Latex documentation for publication as a PDF file. Secondly, and more
powerfully, it provides a web server that allows a developer to browse documentation in
situ and share live documentation with others in a simple manner.
The structure of a PlDoc comment allows for a free text description of a predicate
but also optionally semi-formal information. For example, although Prolog is a typeless
language (variables can be bound to anything), it is often useful to specify types for the
intended use of an argument in a predicate. These informal types are often referenced in
the literature and documentation of a Prolog system. There are no standard types so care
must be taken to be consistent. For example, do not use ‘int’ and ‘integer’ to both refer
to an argument expecting an integer. The second feature that can be documented for an
argument is the expected ‘mode’ of the argument i.e. whether the argument is to be bound
to a ground or semi ground term (input arguments) or expected to be a free variable or at
least a structure that is partially ground/instantiated (output arguments).
CHAPTER 5. PROLOG 64

The recommended style for showing an arguments mode uses the following symbols:

• ++ argument is fully ground at call-time.

• + argument is fully instantiated at run time i.e. it is an input, it does not necessarily
have to be fully ground.

• - argument is an output , it could be unbound or a term with free vars.

• -- argument is completely unbound.

• ? argument is bound to a partial term.

• : argument is a meta argument.

• @ argument will not be further instantiated than at call time.

• ! argument contains a mutable structure.

A Prolog compiler will not enforce anything that is written in the Prolog documentation
concerning types and modes the structured comments are solely intended for human read-
ing. Additionally, a predicate can be marked as non-deterministic, semi-determinate or
multi as follows.

• If a predicate is not expected to fail and can only generate one value it is deterministic.

• If a predicate leaves choice points and may give multiple answers but it could fail,
then it is non deterministic.

• If a predicate can return multiple answers but can not fail it is multi

• If a predicate can succeed precisely once or fail then it is semi-deterministic.

Sometimes a developer will decide to document each predicate with multiple type and
mode declarations to show each intended use. For example, the length/2 predicate has
the following documentation:

length(+List:list, -Length:int) is det. % 1.


length(?List:list, -Length:int) is nondet. % 2.
length(?List:list, +Length:int) is det. % 3.
CHAPTER 5. PROLOG 65

5.9 SWI-Prolog version 7 extensions


This thesis mainly involves development and experiments using SWI-Prolog [159]. This
Prolog is free, open source and well maintained. There are many other Prolog versions
including commercial Prologs such as Sicstus [27, 17].
SWI-Prolog has a number of differences compared to ISO standard Prolog that are
worth highlighting. Two important differences are SWI’s string data format and dicts
data structure. Strings are text enclosed in double quotes, e.g. "This is a string". This
representation of a character sequence allows for more efficient computation compared to
atoms and has other internal benefits. In contrast ISO compliant Prologs will use double
quote syntax for lists of character codes that are atoms. For more details see the SWI-
Prolog documentation.
The second important difference is that SWI-Prolog offers a dictionary data structure.
Dicts provide a notation to access member items. Dicts do not respect the order that items
are stored in and so can be a useful tool when there are many items to be stored in a data
structure which would prove awkward in a long standard prolog predicate. Dicts use the
dot operator, which in standard ISO Prolog is used as the functor for Prolog lists. This
means that in SWI-Prolog the functor for a list is no longer a dot, but is the term [|].
For full details see the SWI-Prolog documentation.
Chapter 6

Descriptive rule induction from


biological data in Prolog - subgroup
discovery

In this chapter we demonstrate a process of how to analyse biological datasets where we


wish to understand the differences between groups of data. We show that this common
biological task can be mapped to the machine learning task of subgroup discovery. To
do this, we first describe formally what subgroup discovery is (section 6.1), and survey
existing algorithms to solve this task (section 6.1.5). We then identify examples research
questions from the biological literature, that have not been explicitly framed as subgroup
discovery tasks (section 6.17). We describe some existing bioinformatics methods (e.g.
Bumphunter) and show how these can be framed as subgroup discovery tasks. This allows
us to benefit from the formal subgroup discovery frameworks that exist, to find a wider
range of differences between instance groups.
With two identified example problems, and the framework for subgroup discovery, we
show how using Prolog to implement programs to solve this task is appropriate. We first
make a direct mapping from the formal description of the subgroup discovery task to
a prototype declarative program. This program performs an exhaustive search for rules
that characterise subgroups of the data (section 6.2.1). We run experiments to show that
this prototype method is not able to mine the large datasets in our examples. Next, we
demonstrate the required algorithmic thinking in order to implement two further versions
of subgroup discovery in Prolog. These versions take into account procedural aspects of
Prologs operation and are able to mine larger datasets. Using these implementations we

66
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 67

demonstrate the ability to find rules to describe subgroups of CpG sites that are different
in cancer (section 6.3.1) and subgroups of microbes that are present or not in lesional
psoriasis samples (6.3.2).

6.1 An introduction to subgroup discovery


Subgroup discovery is a type of descriptive data mining [68, 117]. It is a machine learn-
ing task, like classification and clustering. In contrast to the task of classification which
attempts to learn a model (or theory) of a concept that can be used for prediction, the
goal of the data analysis is to discover individual patterns that describe regularities in the
data. Other descriptive data mining tasks include association rule learning and cluster-
ing. However subgroup discovery in contrast to association rule learning and clustering
is a supervised learning task and each instance has a label of specific interest (a class).
In subgroup discovery we wish to find large population subgroups that have significantly
different class distribution than the entire dataset. The result of subgroup discovery are
individual rules where each rules consequent (head) is the class label of interest. In con-
trast to classification task these rules do not form a rule set or rule list that would be
used for prediction but these single rules will be of interest because they reveal interesting
properties of the data. By implementing subgroup discovery algorithms in Prolog, we end
up with rules that can be immediately applied and combined in our knowledgebase.
The task of subgroup discovery was initial defined by Klosgen [80] and Wrobel [166]
as : Given a population of individuals and a property of those individuals that
we are interested in, find population subgroups that are statistically ‘most
interesting’, e.g., are as large as possible and have the most unusual statistical
(distributional) characteristics with respect to the property of interest.
A subgroup is defined by a rule, the rule is a description of the subgroup. It consists
of conjunctions of features that are characteristic for a selected class of individuals. A
subgroup description B is the condition part of a rule SubgroupDescription -> C where
C is the property of interest (the class). Subgroup discovery is a special case of a more
general rule learning task.
An instance is described by a vector of attribute values, (or in the multi instance
case,a bag of feature vectors) and examples refer to instances labelled by the class label.
An instance is covered by a rule when its description satisfies the rule conditions. An
example is correctly covered by the rule if it is covered and the class of the rule matches
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 68

the class label of the example. An example is incorrectly covered if its description
satisfies the rule conditions, but the class label does not match the class label of the
example. Figure 6.1 illustrates examples of how rules can cover examples.

Formally a subgroup discovery task is defined as followed. Given:

• a data description language defining the form of data,

• a rule description language, defining the form of rules,

• a coverage function COVERED(r,e), defining whether rule r covers example e

• a class attribute C, and

• a set of examples ε, instances for which the class labels are known, described in the
data description language

Find:

• Subgroup descriptions in the form of individual rules R formulated in the rule de-
scription language, each should be as large and consistent as possible.

In a classification task we would look for a minimal rule set which is complete and
consistent but in subgroup discovery this is relaxed. The union of the subgroups does
not need to have complete coverage of the positive class as we do not need to describe
every instance as belonging to a subgroup. Additionally a subgroup can have a fairly large
coverage of the negative class and still be interesting. Finally, instances can belong to
multiple subgroups so there can be some redundancy because we are then able to describe
multiple facets of the instances.

6.1.1 Data description language


In this chapter we use two different data description languages. The first is attribute value
representation, this is the most common representation used in machine learning tasks. In
this case an instance description has the form (v1 ,j , . . . , vn ,j ), where each vi ,j is the value
of attribute Ai , i ∈ {1, . . . , A}. An attribute can either have a finite set of values (discrete)
or take real numbers as values (continuous or numerical). An example ej is a vector of
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 69

Figure 6.1: If target classes are not disjoint then a rule cannot be complete and consistent
[53]

attribute values labelled by a class label ej = (v1 ,j , . . . , vn ,j , cj ) where each vi ,j is a value


of attribute Ai and cj in {c1 , . . . , cc } is one of the C possible values of class attribute C.
The class label is also known as the target attribute in subgroup discovery. Attribute value
data can be seen as a table with columns for attributes and rows (tuples) for each example
as shown in Table 6.1.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 70

InstanceId A1 A2 A3 C
Id1 a c 1.1 Pos
Id2 b z 0.2 Neg
Id3 q c 1.1 Pos
... ... ... ... ...
Idn A1n A2n A3n Cn

Table 6.1: An attribute-value data table

The second data description language used in this chapter is called multi-instance (MI)
representation. This representation is suitable to represent noisy data, where we can-
not label each individual or as we will see in our microbiome example when we are not
sure what set of attributes apply to an example. Here examples are bags of tuples (indi-
viduals) and the class label is applied to the bag rather than each individual tuple. Such
that an instance has the form : {(v1 ,j , . . . , vn ,j )1 , (v1 ,j , . . . , vn ,j )2 , . . . , (v1 ,j , . . . , vn ,j )B ,1 }
where B is the number of individuals in each bag. Each instance can have a different
number of individuals in its bag. An example in multi-instance data representation is
ej = ({(v1 ,j , . . . , vn ,j )1 , (v1 ,j , . . . , vn ,j )2 , . . . , (v1 ,j , . . . , vn ,j )B ,1 }, cj ) . If we represent a
MI dataset as a table multiple rows will correspond to each instance, as shown in Table
6.2.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 71

InstanceId A1 A2 A3 C
Id1 a c 1.1 Pos
b z 0.2
q c 1.1
Id2 q c 0.2 Neg
r c 0.1
v z 1.0
Id3 z v 0.2 Pos
e c 0.1
a q 0.3
... ... ... ... ...
Idn ... ... ... cn

Table 6.2: A multi-instance data table

In our Prolog implementations an AVL instance will be represented by a flat list[a1,a2,a3,class]


or a simple term e(a1,a2,a3,class), and an MI instance will be represented by a list of
terms [i(a1,a2,a3),i(a1,a2,3),class] , or a nested list [[a1,a2,a3],[a1,a2,3],class].

6.1.2 Rule language


A subgroup rule will have the form:
IF f1 AN D f2 AN D . . . AN D fl T HEN Class = ci .
The condition part of the rule is a logical conjunction of features, where a feature fk
is a test that check whether an example has the specified property or not. The number of
features in a rule l is the length of the rule.
Features have the form Ai = vi ,j for discrete attributes and Ai < v or Ai >= v for
continues attribute values. In formal logic the rule would be written as:
f 1 ∧ f 2 ∧ . . . ∧ f l ⇒ ci
And in Prolog like syntax this would be :
Ci :- f1, f2, ..., fl.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 72

6.1.3 Coverage function


The coverage function is different in the Attribute Value Learning(AVL) case to the Multi
Instance Learning (MIL) case. In the AVL case an instance is covered if the rule satisfies
the instance’s representation. In contrast, in the MI Learning case an instance is covered
if at least one individual in an instance’s bag satisfies the rule.
This difference can easily be adapted in our Prolog implementation of our algorithms
by prefixing our coverage predicate with a call to the memberd_t/2 predicate from library
reif which will pull an individual out of the bag for an instance and check for satisfaction
in the same way as the AVL instance coverage check would work. Alternatively we can
adapt the coverage function by applying the AVL coverage function to all the individuals
in each bag, then counting how many in a bag are covered and then ensuring that this is
greater than one, to set the overall MIL coverage function. Regardless of the setting the
coverage function will entail if a rule is complete or consistent or neither. An illustration
of this is shown in figure 6.1.

6.1.4 Learning subgroup rules the general framework


Each algorithm for learning subgroups has two distinct stages: feature construction and
rule construction. In feature construction we transform the data from the attribute values
of the instances to the features of the instances. Algorithms for feature construction aim
to create a minimal set of features where each feature is relevant. Optionally, after feature
construction feature selection algorithms can be used to further reduce the number of
features that go forward to the rule construction stage. A good feature is able to distinguish
between many positive-negative pairs. That is if we took each possible pair of positive and
negative instances the number of times a feature could distinguish between these is a mark
of how a good a feature is. A feature that cannot distinguish between any positive-negative
pairs is completely irrelevant and can be immediately discarded from consideration. In our
tasks in order to generate features we use the algorithm in Figure 6.2 first presented in
[54]: This algorithm generates features separately and independently for each attribute.
The features that it builds are sufficiently rich that complete and consistent rules can in
principle be built from them when the data allows it [53]. In the multi-instance case, each
individual in the bag is assumed to have the label of the bag for the purposes of feature
construction.
Another consideration when constructing features is which strategy to use to handle
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 73

Figure 6.2: An algorithm for the transformation of a set of attributes into a set of features
[53]

missing attribute values. In this chapter we adopt the pessimistic value strategy to handle
missing values - other strategies are detailed in [53]. The motivation for this strategy is
that unknown values should not affect the quality and potential usefulness of a feature in
rule construction, therefore a feature should not be able to distinguish between a positive-
negative pair when one of the instances values in the pair is unknown. Therefore if a value
is missing for a feature in a positive example then that value is set to false, and set to true
for a negative example. Thus the resulting positive-negative pairs can not be distinguished
by that feature. This results in a smaller number of discriminated pairs meaning that these
features have been penalised because they are built from attribute values with unknown
values. When feature construction is performed explicitly as a first step in rule learning,
then this results in a coverage table. Table 6.3 is an example coverage table for an AVL
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 74

dataset (a MI dataset would be suitably adapted).

InstanceId F1 F2 F3 ... Fl C
Id1 T F T ... T Pos
Id2 T T F ... F Neg
Id3 F T T ... T Pos
... ... ... ... ... ... ...
Idn {T∨F} {T∨F} {T∨F} ... ... Cn

Table 6.3: A coverage table

6.1.5 Existing rule learning algorithms


The second stage after feature construction is rule construction. Rule learning algorithms
can be characterised according to two key properties. The first property is how thorough the
search for rules is – algorithms can either use a heuristic search covering some of the search
space (to a greater or lesser extent) or an exhaustive search. Heuristic algorithms search
the rule space by following a heuristic, they are not guaranteed to find a globally optimal
rule but will instead find a local optimum. In contrast, exhaustive learning algorithms
are guaranteed to find the optimum rule but can be computationally expensive, hence
impractical for many problems. A second key property that can be used to characterise
rule learning algorithms is the strategy for learning multiple rules. Algorithms can either
learn a set of rules all together or each rule can be learnt sequentially one after the other.
For heuristic learning algorithms the search space can be structured in order to direct
the order for rule finding, these can be either general-to-specific or specific-to-general.
General-to-specific, ‘top down’ learners start with the most general rule and repeatedly
specialise it by adding features. As features are added the number of instances covered is
reduced and the aim is to reduce coverage of negative instances while maintaining coverage
of positive instances. In contrast specific-to-general, ‘bottom up’ learners start with most
specific rule covering an example and then generalise the rule until it can not be further
generalised. Another alternative search strategy for heuristic algorithms is to not apply
a structure to the search space but instead use a randomised beam. Randomised beam
search algorithms include Evolutionary Algorithms or Genetic Algorithms (EA,GA) both
of which use a search method inspired by evolutionary biology. A further randomised
beam algorithm approach is ANT-Colony optimisation algorithms, these uses a search
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 75

method inspired by social insect behaviour. Table 6.4 details existing subgroup discovery
algorithms, including what search strategy they use and if they learn a set of rules all at
once, or learn rules sequentially one by one.
General-to-specific search is guided by a heuristic and is the most common search
strategy. A number of these algorithms such as CN2-SD can be set so that the search
maintains a ‘beam of rules’ rather than a single rule. Specific-to-general search is useful
when there are too few instances available for the incremental search of general-to-specific
to be effective, however, it is not commonly used in subgroup discovery and has had the
most success in Inductive Logic Programming applications.
General-to-specific heuristic search algorithms will learn one rule at a time. In order to
stop the same rule being learnt repeatedly, instances that have been already covered by a
rule need to be penalised. In classification rule learning, covered instances will be removed
completely from the subsequent rule learning process. In subgroup discovery however, it
may be interesting for an instance to be described in multiple ways and because unlike in a
classification task a subgroup rule is independent(the rules do not collectively form a rule
list), completely removing covered instances is not appropriate. Hence, algorithms such as
CN2-SD and RSD employ a method where a weight for each instance is maintained, when
an instance is covered by a rule, its weight is reduced. The heuristic used to guide the
search is also adapted(from classification) to take into account the weight on the instances
covered. Alternatively the adaption of AQ (Algorithm Quasi-Optimal) in [138] keeps a
list of uncovered positive examples and restricts the adding of features to a rule to be a
value of a random uncovered instance. This results in this instance being guaranteed to
be covered.
Specific-to-general algorithms will also learn a single rule for a given set of input in-
stances, and in this way the number of subgroups can be predefined. However, the effect of
this search strategy is that the question becomes ‘how can we describe what these instances
have in common, compared to the negative class?’ rather than ‘What are interesting sub-
groups of this data?’ because we have generalised a set of instances. Another potential
problem with this approach is that if a large number of instances are generalised into
one subgroup rule this will often result in very long rules (potentially infinite if the data
description language is sufficiently rich) which are difficult to interpret.
Exhaustive algorithms will output a complete set of rules, the researcher can then set
a cut off quality value or set a fixed maximum number of subgroups. Randomised beam
algorithms such as genetic algorithms will output a set of rules equal to the beam width
Exhaustive rule learning algorithms Name Citation
Explora [80]
SD-Map [7]
Apriori-SD [77]
Heuristic rule learning algorithms Name Search-Order Learns-Sequentially/Learns as a set
SD General-to-Specific Sequentially [54]
AQ-Family General-to-Specific Sequentially [101]
Explora General-to-Specific Sequentially [80]
Prism General-to-Specific Sequentially [19]
CN2-SD General-to-Specific Sequentially [90]
Midos General-to-Specific Sequentially [166]
CHAPTER 6. DESCRIPTIVE RULE INDUCTION

Subgroupminer General-to-Specific Sequentially [81]


RSD General-to-Specific Sequentially [92]
Golem Specific-to-General Set [106]
SDIGA Randomised-Beam Set [34]
Table 6.4: Subgroup discovery algorithms

ANT- Miner Randomised-Beam Set [120]


CGBA-SD Randomised-Beam Set [95]
76
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 77

(population) this is normally set to between 10 and 100. These can then be further filtered
using a quality criteria (e.g. all rules above an accuracy threshold) or by taking the top K
subgroups or by using a statistical test to see how likely a permuted dataset would result
in subgroups of this size.
Heuristic general-to-specific algorithms and genetic algorithms can be used with a va-
riety of heuristics to guide the search. In fact, for rule learning algorithms that work by
sequentially finding rules one after the other, it is possible to employ two different heuris-
tics – one for learning each individual rule and one for selecting rules [150]. An effective
heuristic is the Weighted Relative Accuracy [92], it is a generalisation of the rate difference
heuristic that works with weighted instances (required in the sequential learning strategy).

6.1.6 Rule length


Rule length is an important consideration in subgroup discovery [138] because although it
can be argued that shorter rules will offer better predictive performance, (shorter rules are
more general and therefore will tend to cover more examples, allowing stronger statistical
statements) this property is not exclusively important when learning rules to describe
subgroups. This is because we are seeking to describe our given datasets, not to make
predictions on unseen data. Michalski [102] has noted that there are two different types
of rules; characteristic and discriminative. Stecher et al[138] give the following example to
show the difference between these.
Example discriminative rule:
elephant :- trunk.
This rule states that an animal with a trunk is an elephant and allows us to quickly
discriminate elephants from other animals. However this rule fails to show what collection
of characteristics are typical of an elephant. We do not see that an elephant has thick grey
skin, has long tusks and that they are massive.
Consider instead the following characteristic rule:
trunk, tusks, greyskin, massive :- elephant.
Here the implication sign is the opposite from before, we give the set of properties that are
implied by the target class.
Characteristic rules are related to formal concept analysis [102]. In Michalski’s termi-
nology a concept is both discriminative and characteristic i.e. where the head is equivalent
to the body. Therefore, if we reverse the implication (we can do this because of the equiv-
alency) to the standard rule direction, we obtain the following rule:
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 78

elephant :- trunk, tusks, greyskin,massive.


This form of rule can be learnt with subgroup discovery algorithms. We could also wish
that a subgroup mining algorithm finds longer rules such as:
elephant :- trunk, tusks, greyskin, massive, bigears.
elephant :- trunk, tusks, greyskin, massive, smallears.
These rules correspond to African and Asian subgroups of elephants. This type of rule (i.e.
long, characteristic) will often be useful when mining datasets collected from cancer and
psoriasis patients, where we would like to know unknown properties of psoriasis or cancer
and we tackle these tasks in section 6.3.
Stecher et al [138] propose an algorithm DoubleBeam-S that uses an inverted heuristic
for rule selection (and standard heuristics for single rule construction), specifically the m-
estimate heuristic, in order to find longer rules which have good coverage. However they
show that inverted heuristics do not work well for sparse datasets because the rules found
have many negative properties i.e. the rule describes many properties that are NOT the
case for a class. This is unintuitive in many cases (you would not describe an elephant
as ‘not having wings’ and ‘not having fins’ . . . ). They also show that in the case of
sparse datasets, inverted heuristics may only learn a long single rule for each class - so this
technique does not always find subgroups of the data.

Illustrative example

Table 6.5 shows an artificial example for subgroup discovery. Each row in the table is a
microbe and each attribute (column) corresponds to some property of that microbe. The
class label for each instance describes whether the environment in which the microbe lives,
is the sea or not
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 79

No Cell Arrangement Size Shape Breathes Oxygen Lives in the sea


1 Clusters Medium Circular N N
2 Clusters Medium Circular Y N
3 Clusters Small Circular N Y
4 Chains Large Spiky N Y
5 Chains Small Spiky Y Y
6 Singles Medium Circular N N
7 Chains Medium Spiky N Y
8 Singles Large Spiky N Y
9 Singles Medium Spiky Y Y
10 Singles Small Spiky N Y
11 Clusters Small Spiky N Y
12 Singles Large Circular Y N
13 Chains Large Spiky Y N
14 Singles Large Circular N Y

Table 6.5: Illustrative dataset

From this data three example subgroups can be found(Table 6.6).

Rule Coverage
IF Size = Medium AND Shape = circular THEN ‘lives Yes(0/9) no (3/5).
in the sea’ = N
IF Size = Medium THEN ‘lives in the sea’ = N Yes(2/9) no (3/5).
IF Shape = Spiky THEN ‘lives in the sea’ = Y Yes(6/9) no (1/5).

Table 6.6: Example rules from the data in Table 6.5

6.1.7 Formulation of common bioinformatics tasks as subgroup


discovery tasks.
A number of data mining tasks in the fields of biology and medicine have been explicitly
framed as subgroup discovery tasks. These include data mining Coronary Heart Disease
(CHD) datasets [54] for patient screening and early detection of patient groups at risk of
CHD. Mining Brain ischemia datasets [85], for characterising stroke patients and mining
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 80

sonographic examinations datasets [8] to identify interesting diagnostic patterns to supple-


ment medical documentation. Subgroups of patients with different types of cancer have
also been mined for [55] the data table organisation for this study is shown in 6.7.

Patient Id Gene1 Gene2 ... GeneP Class


1 Expressed Not-Expressed ... Expressed Cancer
2 Not-Expressed Expressed ... Expressed Cancer
3 Expressed Not-Expressed ... Expressed Healthy
4 Not-Expressed Not-Expressed ... Not-Expressed Healthy

Table 6.7: How the data table is organised for studies such as [55]. There are P genes
(attributes) and N patients. N << P .

In that study it was shown how to mine patients gene-expression data in order to find
rules of the following form:

PatienthasCancer <-
Gene1=Expressed ^ Gene2=Expressed ^,...,^ GeneP=NotExpressed.

This method of data analysis from Gamberger et al [55] has the advantage that the
found rules are easier to understand than a model which has many thousands of small
contributions of gene expressions built using techniques such as support vector machines.
But although recognised in that paper the large number of genes compared to the number
of patients [82] is still statistically difficult. An alternative way to analysis datasets such as
this, is instead of attempting to find disease markers for patients by finding subgroups of
patients, we attempt to find subgroups of expressed genes that are different in the exper-
imental conditions (Depicted in table 6.8). This allows researchers to better understand
how the described subgroups of genes affect a phenotype or experimental condition. It
has been recognised [89, 142] that the data analysis technique of Gene Set Enrichment
Analysis (GSEA - a common biological data analysis technique) is equivalent to restricting
subgroup rules to having a single feature. In effect the subgroups are predefined and are
not searched for, the data analysis is to simply find which ones are significant. To see this
take note of the following rule:
GeneIsDifferntlyExpressed <-GO-term1.
This rule would typically be described by a biologist as “Genes annotated with GO-Term1
are enriched” and would be found by performing GSEA with a genes that have been labelled
with Gene Ontology terms on a set of data from an experiment or set of phenotypes.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 81

GeneId GoTerm1 GoTerm2 . . . GoTerm Differentially Expressed in cancer


Gene1 Yes No ... Yes Yes
Gene2 No Yes ... Yes Yes
Gene3 No No ... Yes No
Gene4 No No ... No No

Table 6.8: How the data table can be organised for studies such as SEGS[142]. There are
P Gene Ontology Terms (attributes) and N Genes. N >> P .

Where this has been recognised, researchers have used subgroup rule learning algo-
rithms to make new ‘Gene sets’ by taking conjunctions of existing sets. These sets are the
intersection of all sets in the rules features. In the paper ‘Contrasting subgroup discovery
[89]’ a terminology comparison between these two fields is given and we recreate it here in
table 6.9.

Subgroup discovery Bioinformatics:


Gene Set Expression Analysis
Object or instance Gene
Attribute value or feature value Annotation or biological concept, e.g. Go
Term
Class attribute Gene Expression under specific experimen-
tal condition such as a specific time point
or phenotype
Class or class attribute value, e.g. pos/neg Differentially/non-differentially gene
Subgroup of objects Gene set
Interesting subgroup Enriched Gene set

Table 6.9: A translation of the vocabulary used in subgroup discovery literature and gene
set expression analysis literature from the paper contrasting subgroup discovery

If we turn our attention to the related field of epidemiology then we can see other data
analysis techniques that we can recognise as restricted forms of subgroup discovery. For
instance, a common task in the field of epidemiology when mining epigenetics data is to
search for Differentially Methylated Regions (DMRs). DMRs are thought to have an effect
on cancer and other diseases. Two examples of methods to find DMRs are Adjacent Site
Clustering (A-Clustering) [136] and ‘Bump hunter’ [75] which we will now briefly describe.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 82

A-Clustering

A-Clustering is a method for the detection of co-regulated methylation regions, and re-
gions associated with an exposure. CpG sites within regions identified by A-Clustering are
modelled as multivariate responses to an environmental exposure, assuming that the ex-
posure affects all CpG sites in the region equally. A-Clustering first identifies methylation
regions based on correlation between methylation sites, independently of any exposure. It
then analyses these regions to identify those affected by an exposure. The clustering that
A-Clustering performs includes possible restrictions based on the distance between CpG
sites on the DNA. The algorithm can also have a pre-step called dbp-merge which merges
sets of sites between highly correlated sites that are physically close on a chromosome.
In effect, this technique is clustering instances (CpG sites) and then testing which of
these clusters are associated with the phenotype or experimental condition of interest. As
this method does not take the labels of instances into account when building clusters it
is not using all the information available to build the groups which subgroup discovery
algorithms use.

Bumphunter

In contrast to A-Clustering Bumphunter does use the labels for the instances. The red bar
in the image (6.3 B) is the cut off that defines what class each CpG site is in. Viewing this
in the context of our formal description of subgroups discovery, we can see this method as
restrictions on the data description language and rule description language.

The ‘bump or DMR’ in the image can be described by the subgroup rule as :
CpGDiffentInCancer<-CpG-Location>=42233400^CpG-Location<42234400
Thus the rules are all built from one attribute (the genome location) and have a fixed length
of two features, one fixed to a >= and the other < with the latter strictly larger than the
other. Recognising that research activities like searching for DMRs is the equivalent to
‘subgroup discovery’ allows us to formulate the problem in the formal manner of subgroup
discovery and make use of the knowledge obtained in the machine learning literature. This
allows us to expand and generalise these methods by adding further attributes or changing
the attributes to the description of CpG sites and allowing rules be more expressive (for
example to be longer and to be made from features built from multiple attributes i.e. not
just genomic location).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 83

Figure 6.3: This figure is taken from Jaffe et al[75] it shows an example of a Differentially
Methylated Region (DMR). Panel A shows the methylation measurements from a colon
cancer dataset plotted against genomic location on chromosome 2. Eight normal (blue)
and eight cancer (red) samples are shown for each location. The curves represent a smooth
estimate of the population-level methylation for cancer (red) and normal (blue) samples.
The green bar represents a region found to be a cancer DMR(Interesting subgroup). In
panel B the black curve is an estimate of the population-level difference between normal and
cancer. The curve is expected to vary due to measurement error and biological variation
but rarely exceed a certain threshold, for example those represented by the red horizontal
lines. A candidate DMR is a region where the black curve is outside these boundaries.

Novel subgroup discovery formulations

By observing other modern genomic technology applications and research articles we can
also identify other datasets such as the 16S rRNA microbiome studies which have similar-
ities to gene expression and CpG methylation datasets. Hence, we can also formulate a
useful data mining task of subgroup discovery of the microbiome.

Evaluations of subgroups

As in GSEA and techniques such as Bumphunter, a statistical evaluation of the significance


of subgroups is the final stage required to strengthen the case that discovered subgroups
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 84

Subgroup discovery Methylation


Finding differentially methylated regions
Object or instance GpG - Site
Attribute value or feature value Location of CpG site on the Genome
Class attribute Methylation under specific experimental
condition such as a specific time point or
phenotype
Class or class attribute value, e.g. pos/neg Differentially /non-differentially methy-
lated CpG site
Subgroup of objects Methylated Region
Interesting subgroup Differentially methylated region.

Table 6.10: A translation of the vocabulary used in subgroup discovery literature and DMR
literature that we have constructed.

Subgroup discovery Microbiome


Finding subgroups of bacteria
Object or instance Bacteria
Attribute value or feature value Property of Bacteria
Class attribute Read counts under specific experimental
condition such as a specific time point or
phenotype
Class or class attribute value, e.g. pos/neg Differentially/non-differentially abundant
microbe
Subgroup of objects Microbes with specific properties
Interesting subgroup Microbes with specific properties associ-
ated with differential abundance in the
phenotype

Table 6.11: A translation of the vocabulary used in subgroup discovery literature and a
new task for characterising subgroups of microbes that we have identified.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 85

are not simply anomalies but represent true patterns in the data. Once subgroups have
been found and checked for significance they can be shared with the research communities
and domain experts in order for them to judge the rules on there utility. Potential utility
of subgroups can be used for decision support, for example, which microbes should be
isolated for further study and what evolutionary history of a CpG site tells us about our
vulnerability to cancer.

6.2 Implementation of subgroup discovery task in -


Prolog
In this section we follow the methodology for Prolog programming that we set out in sec-
tion 5.5. We first attempt to declaratively describe our problem and use pure logic and
constraint libraries. We then determine whether a heuristic is needed due to the size of the
search space. There will be in total three iterations, that allow for increasing dataset sizes.
In this section we are primarily concerned with illustrating the required computational
thinking a bioinformatician requires in order to develop Prolog implementations for iden-
tified data mining tasks. For the experiments in this section we use a PC with a Intel(R)
Core(TM) i7-4712MQ CPU @ 2.30GHz and 15GiB RAM. The OS is Ubuntu 16.4 and the
SWI-version is 7.7.1-23-gc81adf9

6.2.1 Method 1: Pure constraint logic programming implemen-


tation
Code block 1 shows an implementation of the task of subgroup discovery as a purely declar-
ative data mining program. We first import the relevant libraries on lines 1 and 2. Lines
4 - 10 consist of a small dataset for illustration. There are three positive and three nega-
tives examples, each example is a vector of three binary values that correspond to features
of these examples. Lines 12 - 21 define our core relation, data_rulefeatures_value/3
that describes the relationship between our data, our rule and the ‘value’ we give to
that rule on this data. Here we simply use T rueP ositives − F alseP ositives – unnor-
malised rate difference [53]. We use the clp(fd) library to enumerate (lines 17 and 18)
the True positive and False positive rates in order to find the highest ‘value’ subject to
a clp(b) constraint list, which corresponds to what features will be in the rule. This
clp(b) constraint list, is constructed in the predicate features_example_constraints/3
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 86

on lines 31 -33. This list is turned into a satisfiability problem on line 29 in the predicate
features_example_covered/3. This predicate defines when a rule covers an example.
The predicate rulefeatures_examples_coverednumber/3 (defined on lines 23 to 25) de-
fines the relation of a rule to a set of examples and how many of those examples are covered
by the rule. It does this by mapping the features_examples_covered/3 predicate to each
example in its given list and counts how many of those examples are covered by the rule
using a satisfiability card goal (line 25). The rulefeatures_examples_coverednumber/3
is called twice – once for the positives examples and once for the negative example in the
main relation data_rulefeatures_value/3. Finally, on line 21 the clp(b) labelling is
applied to the list of features that represents the rule in order to find the best rule.
Code Block1
1 :- use_module(library(clpb)).
2 :- use_module(library(clpfd)).
3
4 data(PosExamples,NegExamples):-
5 PosExamples=[ [0,1,0,1],
6 [0,1,0,1],
7 [1,0,0,1]],
8 NegExamples=[ [1,0,1,0],
9 [1,0,1,0],
10 [0,1,1,0]].
11
12 data_rulefeatures_value(data(Positives,Negatives),Features, Value):-
13 Positives =[Example|_Rest],
14 same_length(Features, Example),
15 length(Positives,NumberOfPositives),
16 [TP,FP] ins 0..NumberOfPositives,
17 Value #= TP-FP,
18 labeling([max(Value)], [TP,FP]),
19 rulefeatures_examples_coverednumber(Features,Positives, TP),
20 rulefeatures_examples_coverednumber(Features,Negatives, FP),
21 labeling(Features).
22
23 rulefeatures_examples_coverednumber(Features, Examples, Number):-
24 maplist(features_example_covered(Features), Examples, Numbers),
25 sat(card([Number], Numbers)).
26
27 features_example_covered(FeatureList,ExampleList,Covered):-
28 features_example_constraints(FeatureList,ExampleList,Structure),
29 sat(Covered =:= *(Structure)).
30
31 features_example_constraints([],[],[]).
32 features_example_constraints([H1|T1],[H2|T2],[(H1=:=(H1*H2))|Structure]):-
33 features_example_constraints(T1,T2,Structure).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 87

We illustrate this implementation of subgroup discovery with a minimal running exam-


ple by calling the following goals:

?- data(Ps,Ns),data_rulefeatures_value(data(Ps,Ns),Fs,V).
Ps = [[0, 1, 0, 1], [0, 1, 0, 1], [1, 0, 0, 1]],
Ns = [[1, 0, 1, 0], [1, 0, 1, 0], [0, 1, 1, 0]],
Fs = [0, 0, 0, 1],
V = 3 ;

This returns the best rule, represented as a list F of 0’s and 1’s. The value of 1 in a
position in the list indicates that that this is a feature in the identified rule. In this first
example the rule found is that F4 =1.

P os ⇐= F 4 = true.

In words:

If an example has F4 = true, then it is a positive example.

In this case the rule covers all 3 positives and 0 negative examples, so it does not strictly
represent a subgroup (as the coverage is complete). On backtracking we would find the next
best rule according to our heuristic value. In this way every subgroup is found alongside
its value.

?- data(P,N),data_rulefeatures_value(data(P,N),F,V).
P = [[0, 1, 0, 1], [0, 1, 0, 1], [1, 0, 0, 1]],
N = [[1, 0, 1, 0], [1, 0, 1, 0], [0, 1, 1, 0]],
F = [0, 0, 0, 1],
V = 3 ;
P = [[0, 1, 0, 1], [0, 1, 0, 1], [1, 0, 0, 1]],
N = [[1, 0, 1, 0], [1, 0, 1, 0], [0, 1, 1, 0]],
F = [0, 1, 0, 1],
V = 2 ;
P = [[0, 1, 0, 1], [0, 1, 0, 1], [1, 0, 0, 1]],
N = [[1, 0, 1, 0], [1, 0, 1, 0], [0, 1, 1, 0]],
F = [1, 0, 0, 1],
V = 1 ;
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 88

P = [[0, 1, 0, 1], [0, 1, 0, 1], [1, 0, 0, 1]],


N = [[1, 0, 1, 0], [1, 0, 1, 0], [0, 1, 1, 0]],
F = [0, 1, 0, 0],
V = 1
...

In order to adapt this algorithm for the multi instance case, we make a number of
changes. Code Block 2 shows a small MI dataset that we use for this example. Code
Block 3 shows the adapted code that makes use of the library reif (line 3) described
previously (section 5.7.1). On line 17 we use the if_/3 predicate to reify the relation
#>= with the second argument set to one. We do this so that we can define the relation
numberbiggerthan1_t/2. We can use this predicate in conjunction with
rulefeatures_milInstanceBag_coverednumber/3 to define the relation
rulefeatures_milExamples_coverednumber/3, by mapping the result of calls to
rulefeatures_milInstanceBag_coverednumber/3 with numberbiggerthan1_t/2. This
is a direct translation of the definition of the MI Learning coverage rule (an instance is
covered by a rule, if at least one individual in its bag is covered by the rule). Suitable
adaptations are given to the other predicates and the names are changed to reflect that
they now describe relations between bags of individuals as instances rather than instances
as simple vectors.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 89

Code Block2
1 mil_data(PosExamples,NegExamples):-
2 PosExamples =
3 [
4 [[1, 0, 0], [1, 0, 0], [1, 0, 1]],
5 [[1, 1, 0], [1, 1, 0], [0, 1, 1]],
6 [[1, 1, 1], [0, 1, 1], [0, 1, 1]],
7 [[0, 1, 0], [1, 1, 1], [0, 1, 1]],
8 [[1, 0, 1], [1, 0, 1], [0, 1, 0]],
9 [[0, 0, 0], [1, 0, 0], [1, 0, 1]],
10 [[0, 0, 1], [1, 1, 0], [0, 0, 0]],
11 [[1, 0, 1], [1, 0, 1], [1, 0, 1]],
12 [[0, 0, 0], [0, 0, 0], [1, 1, 1]],
13 [[1, 0, 1], [1, 1, 0], [1, 0, 1]]
14 ],
15
16 NegExamples =
17 [
18 [[0, 0, 0], [0, 0, 1], [0, 0, 0]] ,
19 [[0, 1, 0], [1, 1, 1], [0, 0, 1]] ,
20 [[0, 1, 0], [0, 1, 0], [1, 1, 1]] ,
21 [[1, 0, 1], [1, 1, 0], [1, 1, 1]] ,
22 [[1, 1, 1], [1, 1, 0], [1, 1, 1]] ,
23 [[1, 0, 0], [0, 1, 0], [0, 1, 1]] ,
24 [[1, 1, 0], [1, 1, 0], [0, 1, 0]] ,
25 [[0, 1, 1], [1, 0, 1], [1, 1, 0]] ,
26 [[0, 0, 1], [0, 1, 0], [0, 0, 0]] ,
27 [[0, 0, 0], [0, 1, 0], [0, 1, 0]]
28 ].
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 90

Code Block3
1 :- use_module(library(clpb)).
2 :- use_module(library(clpfd)).
3 :- use_module(library(reif)).
4
5 mildata_rulefeatures_value(data(Positives,Negatives),Features, Value):-
6 Positives =[Example|_Rest],
7 same_length(Features, Example),
8 length(Positives,NumberOfPositives),
9 [TP,FP] ins 0..NumberOfPositives,
10 Value #= TP-FP,
11 labeling([max(Value)], [TP,FP]),
12 rulefeatures_milExamples_coverednumber(Features,Positives, TP),
13 rulefeatures_milExamples_coverednumber(Features,Negatives, FP),
14 labeling(Features).
15
16 numberbiggerthan1_t(Number,Value):-
17 if_(#>=(Number,1),Value=1,Value=0).
18
19 rulefeatures_milExamples_coverednumber(Features,Examples,ExamplesCoveredNumber):-
20 length(Examples,ExampleSize),
21 ExamplesCoveredNumber in 0 .. ExampleSize,
22 labeling([max(ExamplesCoveredNumber)],[ExamplesCoveredNumber]),
23 maplist(rulefeatures_milInstanceBag_coverednumber(Features),Examples,Numbers),
24 maplist(numberbiggerthan1_t,Numbers,Truths),
25 sat(card([ExamplesCoveredNumber],Truths)).
26
27 rulefeatures_milInstanceBag_coverednumber(Features, Bag, NumberInBagCovered):-
28 length(Bag,BagSize),
29 NumberInBagCovered in 0..BagSize,
30 labeling([min(NumberInBagCovered)],[NumberInBagCovered]),
31 maplist(features_milIndivdual_covered(Features), Bag, Numbers),
32 sat(card([NumberInBagCovered], Numbers)).
33
34 features_milIndivdual_covered(FeatureList,ExampleList,Covered):-
35 features_milIndividual_constraints(FeatureList,ExampleList,Structure),
36 sat(Covered =:= *(Structure)).
37
38 features_milIndividual_constraints([],[],[]).
39 features_milIndividual_constraints([H1|T1],[H2|T2],[(H1=:=(H1*H2))|Structure]):-
40 features_milIndividual_constraints(T1,T2,Structure).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 91

When we test this implementation we obtain a number of subgroups (we omit the return
of variables P and N for brevity):

?- mil_data(_P,_N),mildata_rulefeatures_value(data(_P,_N),F,V).

F = [1, 0, 1],
V = 3 ;

F = [1, 0, 0],
V = 3 ;

F = [0, 0, 1],
V = 2

...

The first returned answer would be interpreted as the rule:

IF an example has an individual in its bag where F1 = true and F3 = true


THEN it is a positive example.

This rule describes a subgroup of our (artificial) data where the class distribution of the
subgroup is different to the class distribution of the entire dataset.

Running time

To understand the running time of the AVL implementation as the size of the datasets
increase we use the time/1 predicate. The data is two dimensional so we sum the number
of examples and the number of features together to derive a single problem dimension
shown as Size in Table 6.12.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 92

Inferences CPU time seconds Lips Size


5,300 0.001 6145198 1
19,164 0.003 7251666 2
36,977 0.004 9904799 3
476,434 0.045 10681199 4
452,660 0.042 10906260 5
4,576,880 0.393 11634970 6
5,532,995 0.482 11480329 7
5,894,758 0.496 11890537 8
12,387,874 1.041 11901689 9
17,508,397 1.439 12171159 10
37,888,052 3.054 12406813 11
Missing - - -
75,932,587 5.941 12781284 13
160,286,558 12.877 12448492 14
272,617,307 21.811 12500336 15
266,568,631 21.887 12180032 16
595,571,244 49.728 11977273 17
1,108,293,345 91.438 12121283 18
1,761,835,015 152.673 11540810 19
3,641,736,671 319.629 11394628 20

Table 6.12: Number of inferences, CPU time in seconds and Logical inferences per second
for method 1 with increasing sizes of random datasets. Size is the number of features and
the number of instances of each class.

The running time results shown in Table 6.12 include the number of inferences and the
number of logical inferences per second as well as the CPU time. We conclude from this
time analysis that at present the Binary Decision Diagram algorithm behind the clp(b)
system (that attempts to solve the satisfiability problem), is currently too slow to work
on our complete (real rather than test) datasets. However, our implementation may still
be useful in the future when improved satisfiability libraries are implemented. If a future
Prolog version incorporates these improvements then this implementation will likely run
without any changes (it is known that SWI-Prolog does not contain the state of the art
implementations of these libraries).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 93

6.2.2 Method 2: Heuristic top-down search using a constraint


covering function and weighted instances
In this method we structure the search space in a similar way to the existing CN2-SD and
RSD algorithms [90, 92]. However, we make use of clp(fd) and clp(b) and reif libraries
to define our coverage relation and to reduce the weight of our instances. Each instance is
initially given a weight of 100 and every time that instance is covered by a rule, its weight
is reduced by 50% (floored integer division).
Code block 4 shows the predicates used to learn a single rule. The main predicate is
ps_ns_rulepath/3 (defined on line 40) this predicate relates lists of positive and negative
instances with a rule path. The rule path argument is a list of refinements to a rule,
that can be thought of as a path through coverage space [52]. The ps_ns_rulepath/3
predicate works by calling the recursive predicate ps_ns_rule_tv_fv_rulepath/6. This
predicate finds all the specialisations of the current rule with specialise/2 defined on
line 19, applies each of these candidates to the data (line 32) selects the best rule (line 36)
and finally recursing to find further specialisations. The stopping criteria is simple when
the number of negatives covered is equal to zero (line 28).
Code block 5 shows the predicates to learn multiple rules by calling the predicate
ps_ns_rulepath/3 in code block 4 multiple times. The predicate data_rules/2 expects
instance feature values as a list paired with a weight for that instance. These pairs are
constructed by a call to maplist/3 on the data with i_wi/2 predicate on line 1. Lines
3-8 define the predicate that will reduce the weights of an instance by using the reified
rule_instance_truth/3 predicate.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 94

Code Block4
1 rule_instances_weight(Features, WExamples, Number):-
2 maplist(x_y_xypair,Weights,Examples,WExamples),
3 maplist(rule_instance_t(Features), Examples, Numbers),
4 maplist(n_n2_product,Weights,Numbers,Products),
5 list_sum(Products,Number).
6
7 rule_instance_t(FeatureList,ExampleList,Covered):-
8 build_structure(FeatureList,ExampleList,Structure),
9 sat(Covered =:= *(Structure)).
10
11 rule_instance_truth(Rule,Instance,Truth):-
12 rule_instance_t(Rule,Instance,T),
13 bool01_t(T,Truth).
14
15 build_structure([],[],[]).
16 build_structure([H1|T1],[H2|T2],[(H1=:=(H1*H2))|Structure]):-
17 build_structure(T1,T2,Structure).
18
19 specialise(Input,Output):-
20 select(0,Input,1,Output).
21
22 ps_ns_rule_sumP_sumN_value(Ps,Ns,Rule,Pws,Nws,Value):-
23 rule_instances_weight(Rule,Ps,Pws),
24 rule_instances_weight(Rule,Ns,Nws),
25 Value #=Pws-Nws.
26
27 ps_ns_rule_tv_fv_rulepath(Ps,Ns,Start,TP,FP,[r(Special,Value,cm(TP2,FP2))|Result]):-
28 FP#\=0, %Stop when 0 covered negatives.
29 TP #>0,
30 Value #>0,
31 findall(Special,specialise(Start,Special),Specials),
32 maplist(ps_ns_rule_sumP_sumN_value(Ps,Ns),Specials,TP_X,FP_X,Values),
33 maplist(w_x_y_z_wxyz,Values,TP_X,FP_X,Specials,Weight_RuleStructures),
34 keysort(Weight_RuleStructures,Sorted),
35 reverse(Sorted,RSorted),
36 RSorted =[Value-s(TP2,FP2,Special)|_Rest],
37 ps_ns_rule_tv_fv_rulepath(Ps,Ns,Special,TP2,FP2,Result).
38 ps_ns_rule_tv_fv_rulepath(_,_,_,_,_,[]).
39
40 ps_ns_rulepath(Ps,Ns,Result):-
41 Ps=[_W-Example|_],
42 length(Example,Size),
43 length(Start,Size),
44 Start ins 0..0,
45 ps_ns_rule_tv_fv_rulepath(Ps,Ns,Start,_TP,_FP,Result).
46
47 x_y_xypair(X,Y,X-Y).
48
49 w_x_y_z_wxyz(W,X,Y,Z,W-s(X,Y,Z)).
50
51 n_n2_sum(X,Y,Z):-
52 Z#=X+Y.
53
54 n_n2_product(X,Y,Z):-
55 Z#=X*Y.
56
57 list_sum(List,Sum):-
58 foldl(n_n2_sum,List,0,Sum).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 95

Code Block5
1 i_wi(I,100-I).
2
3 t_weighted_reduced(P_2,FatList,DietList):- t_weighted_reduced_(FatList,DietList,P_2).
4
5 t_weighted_reduced_([],[],_).
6 t_weighted_reduced_([W-X|Xs0],Ts,P_2):-
7 if_(call(P_2,X),(w0_w1(W,W2),Ts= [W2-X|Ts0]),(Ts=[W-X|Ts0])),
8 t_weighted_reduced_(Xs0,Ts0,P_2).
9
10 w0_w1(W,W1):-
11 W1 #= W div 2.
12
13 empty_t([],true).
14 empty_t([_|_],false).
15
16 data_rules(data([],_Neg),[]).
17 data_rules(data(Pos,Neg),[Rule|Rules]):-
18 once(ps_ns_rulepath(Pos,Neg,RulePath)),
19 if_(empty_t(RulePath),Rules=[],(
20 reverse(RulePath,[r(Rule,_V,_CM)|_]),
21 t_weighted_reduced(rule_instance_truth(Rule),Pos,PosReduced),
22 t_weighted_reduced(rule_instance_truth(Rule),Neg,NegReduced),
23 data_rules(data(PosReduced,NegReduced),Rules)
24 )).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 96

Running time

Inferences CPU time LIPS Size


15,912,043 1.711 9298652 10
16,586,224 1.891 8769365 20
43,101,571 4.817 8949020 30
132,083,484 14.483 9120095 40
393,833,750 45.282 8698220 50
476,817,636 60.564 7873602 60
830,805,016 95.323 8715972 70
1,734,559,570 219.135 7915991 80
2,535,043,462 357.792 7087721 90
3,812,853,495 463.530 8226328 100

Table 6.13: Number of Inferences, CPU time in seconds and Logical inferences per second
for Method 2 with increasing sizes of random datasets. Size is the number of features and
the number of instances of each class.

Table 6.13 shows that Method 2 can cope with much larger dataset sizes than Method
1, but is still not powerful enough for our dataset sizes. This algorithm has the same
time complexity as CN2 and CN2-SD [21, 90]. An additional problem with these first two
implementations (Method 1 and Method 2) is that representing instances as lists of features
does not work, for our dataset sizes. Our hardware cannot learn with rules represented as
a lists of 300k length with 500k instances.

6.2.3 Method 3: Genetic algorithm for searching a very large


hypothesis space
For the discussion of this implementation we have extracted the pertinent parts of the
code and separated them into a number of code blocks to be discussed independently.
Before we do that we will make a note on the terminology used. When using genetic
algorithms [158] for biological research the cross over of the terminology between these two
fields can sometimes be confusing. In the genetic algorithm a candidate rule is referred
to as a chromosome. In the data mining task of finding subgroups of CpG sites, we do
not use features based on the actual chromosome of the patients. Therefore, when we
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 97

refer to ‘chromosomes’ in the following we are using it in the candidate rule sense and
similarly when we refer to parent, mother, father and child. These are all candidate rules,
not anything to do with our data instances. Finally, the ‘population’ is the current set of
candidate rules, not the population of microbes in the microbiome study.
In contrast to the previous methods where a rule (for the purposes of the search) was
represented as a list of binary values (where a 1 would indicate that feature was included
in the rule) in this method we represent a rule as a list of indices to a set of features. i.e a
rule could now be: [f1,f500,f2000,f8400]. Which would represent the rule:

IF f1=true AND f500=true AND f2000= true AND f8400 = true THEN pos class

In this way the rule size can be fixed, which allows us to manually set a preference for
shorter or longer rules. It is possible to search for very long rules, which would not be
explored by the previous two methods.
We will now provide a discussion of the main predicate definitions in this implemen-
tation. Firstly, code block 6 shows how cross over is implemented [158]. The primary
predicate is mum_dad_c1_c2/4. This relates a pair of ‘mother’ and ‘father’ chromosomes to
two ‘child’ chromosomes. This predicate definition makes use of the association list data
structure, which has better performance for look up than traversing a list using member/2.
On line 7 the list of index positions is split for the mother and father chromosomes by the
number of cross over points (the number of cross over points is set at 3, and is stored as a
Prolog fact in the program, but for searching for very long rules this could be increased).
The split chromosomes of the ‘mother’ and ‘father’ are then inter-weaved to form the
new ‘child’ chromosomes by using the predicate s_L1_L2_LA_LB/4. This is illustrated by
the following query:

s_L1_L2_LA_LB([[0,0,0,0,0],[0,0],[0,0]],[[1,1,1,1,1],[1,1],[1,1]],L1,L2).
L1 = [[1, 1, 1, 1, 1], [0, 0], [1, 1]],
L2 = [[0, 0, 0, 0, 0], [1, 1], [0, 0]] ;
false.

These ‘child’ chromosomes are created as split lists, and they are concatenated into a
single list using a DCG (lines 13-14). This is the primary method to generate new rules in
the search.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 98

Code Block6
1 mum_dad_c1_c2(M,D,C1,C2):-
2 length(M,LM),
3 numlist(1,LM,Index),
4 maplist(x_y_xypair,Index,M,MPair),
5 maplist(x_y_xypair,Index,D,DPair),
6 splitlocs_length(SplitLocs,LM),
7 list_splitlocs_splits(Index,SplitLocs,SplitIndexes),
8 list_to_assoc(MPair,MAssoc),
9 list_to_assoc(DPair,DAssoc),
10 maplist(assoc_keys_values(MAssoc),SplitIndexes,MumSplit),
11 maplist(assoc_keys_values(DAssoc),SplitIndexes,DadSplit),
12 s_L1_L2_LA_LB(MumSplit,DadSplit,C1Split,C2Split),
13 phrase(concatenation(C1Split), C1),
14 phrase(concatenation(C2Split),C2),!.
15
16 s_L1_L2_LA_LB(L1,L2,LA,LB):-
17 interweave_L1_L2_LA_LB(odd,L1,L2,LA,LB),!.
18
19 interweave_L1_L2_LA_LB(_,[],[],[],[]).
20 interweave_L1_L2_LA_LB(odd,[L1_H|L1_T],[L2_H|L2_T],[L2_H|LA_T],[L1_H|LB_T]):-
21 interweave_L1_L2_LA_LB(even,L1_T,L2_T,LA_T,LB_T).
22 interweave_L1_L2_LA_LB(even,[L1_H|L1_T],[L2_H|L2_T],[L1_H|LA_T],[L2_H|LB_T]):-
23 interweave_L1_L2_LA_LB(odd,L1_T,L2_T,LA_T,LB_T).
24
25 x_y_xypair(X,Y,X-Y).
26
27 my_get_assoc(A,K,V):-
28 get_assoc(K,A,V).
29 assoc_keys_values(A,K,V):-
30 maplist(my_get_assoc(A),K,V).
31
32 list_splitlocs_splits(L,SplitLocs,S):-
33 list_splitlocs_splits(L,SplitLocs,S,1),!.
34
35 list_splitlocs_splits(L,[],[L],_).
36 list_splitlocs_splits(L,SplitLocs,[[H|Split1]|RestSplits],Count):-
37 L=[H|T],
38 SplitLocs =[First|Rest],
39 H #< First,
40 Count2 #= Count+1,
41 list_splitlocs_splits(T,[First|Rest],[Split1|RestSplits],Count2).
42 list_splitlocs_splits(L,SplitLocs,[[H]|RestSplits],Count):-
43 L=[H|T],
44 SplitLocs =[First|Rest],
45 H = First,
46 Count2 #= Count+1,
47 list_splitlocs_splits(T,Rest,RestSplits,Count2).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 99

Code block 7 shows the tournament selection [103]. The predicate


pop_torunementselected/2 relates a population of chromosomes/candidate rules, to a
number of rules that win a ‘tournament’. The tournament size is fixed to six random
candidates, and the top two chromosomes from the tournament are selected. In this way
the population of rules becomes ‘fitter’.
Code Block7
1 pop_tournementselected(P,S):-
2 length(TournementMembers,6),
3 maplist(random_member(P),TournementMembers),
4 pop_sortedpop(TournementMembers,[_X-Best,_Y-SecondBest|_Rest]),
5 S=[Best,SecondBest].

Code block 8 shows the hill climbing mutation operation [134]. This mutation operation
performs a local hill climbing search to find if there is a chromosome near by (in feature
space) to the current candidate chromosome that has a better fitness evaluation. The
search takes a random allele in the chromosome and tries to find local improvements that
are nearby, constraining the search using clp(fd) to ensure that we do not try and test
an invalid feature (feature id minus 1 for example would not make sense). This assumes
that the feature order has some semantic meaning. For example, the features for CpG
sites which relate to a measure of how closely a site is preserved have an ordering. A list
of ordered features (made from this real valued attribute) could be:

f(1,attribute_1,>=,11),f(2,attribute_1,>=12),f(3,attribute_1, >=13).

This shows that there is a relationship (in this case +10) between the feature Id, and
what the feature is testing for. When the features refer to attributes that do not have an
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 100

ordering this process is more of a random search – rather than a hill climbing search.
Code Block8
1 chromosome_hillclimbmutated(Range,C1,C2a,C_Fitnesses):-
2 list_listindexed(C1,C1Index),
3 random_member(Feature-Index,C1Index),
4 Value #= Feature,
5 numberoffeatures(NumberOfFeatures),
6 findall(NewValue,(
7 Max #=NumberOfFeatures,
8 Min #=1,
9 Top #=Value+Range,
10 Bottom #= Value-Range,
11 NewValue in Min..Max,
12 NewValue #=<Top,
13 NewValue #\=Value,
14 NewValue #>=Bottom,
15 label([NewValue])
16 ),
17 NewValues),
18 maplist(old_index_value_new(C1,Index),NewValues,NewCs),
19 maplist(heuristic_chromosome_fitness(rate_diff),NewCs,ChildFitness),
20 maplist(x_y_xypair,ChildFitness,NewCs,C_Fitnesses),
21 pop_sortedpop(C_Fitnesses,[C2|Rest]),
22 C2 = _ValueOfC2-C2a.

This is demonstrated by the following query, where we see the 4th allele has been
randomly selected to be ‘hill climbed’ (the value of the chromosome is the sum and we
have added screen output):

?-chromosome_hillclimbmutated(4,[1,10,5,11],C2,F).
Mutated Sorted:
31-[1, 10, 5, 15].
30-[1, 10, 5, 14].
29-[1, 10, 5, 13].
28-[1, 10, 5, 12].
26-[1, 10, 5, 10].
25-[1, 10, 5, 9].
24-[1, 10, 5, 8].
23-[1, 10, 5, 7].
C2 = [1, 10, 5, 15],
F = [23-[1, 10, 5, 7], 24-[1, 10, 5, 8], 25-[1, 10, 5, 9],
26-[1, 10, 5, 10], 28-[1, 10, 5|...], 29-[1, 10|...],
30-[1|...], 31-[...|...]].
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 101

In order to see what instances are covered by a feature, different partitioning predicates
are implemented (Code Block 9 and Code Block 10). They have a similar implementation
in both data mining tasks (with suitable adaptations for the multi-instance case shown in
Code Block 10), and there is a version for the positive instances and the negative instances
in both tasks in order to account for missing values following our pessimistic value strategy –
from [53]. For instance, in the CpG mining task the cpgpartition_pos_ts_fs_feature/4
predicate is used. This predicate makes uses of the meta predicate if_/3 (in a nested
call) from library reif that we described in the section on meta predicates in Chapter
5. This enables a fast, deterministic implementation, that does not run out of memory
when applying the feature to the approximately 500k instances in the CpG mining task.
Nevertheless, this is the most computationally intensive part of the algorithm as the number
of meta-calls is proportional to the number of instances in the case of CpG methylation
task, and in the worst case it is proportional to the combined number of individuals in all
the bag instances in the microbiome case (the library reif actually uses a technique called
‘goal expansion’ that is able to reduce the number of these meta-calls).
Code Block9
1 cpgpartition_pos_ts_fs_feature([],[],[],_).
2 cpgpartition_pos_ts_fs_feature([X|Xs0],Ts,Fs,feature(At,_,Op,FValue)):-
3 cpg_ats_i(X,AtList),
4 atom_concat(#,Op,Op2),
5 maplist(atterm_atname,AtList,Ats),
6 if_(memberd_t(At,Ats),
7 (
8 memberd(attribute(At,AtValue3),AtList),
9 if_(call(Op2,AtValue3,FValue), (Ts=[X|Ts0],Fs=Fs0),
10 ( Ts =Ts0,Fs=[X|Fs0]))
11 )
12 ,(Fs=[X|Fs0],Ts=Ts0)),
13 cpgpartition_pos_ts_fs_feature(Xs0,Ts0,Fs0,feature(At,_,Op,FValue)).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 102

Code Block10
1 otupartition_pos_ts_fs_feature([],[],[],_).
2 otupartition_pos_ts_fs_feature([X|Xs0],Ts,Fs,Feature):-
3 Feature = f(At,Op,FValue),
4 instance(X,XBag,_),
5 length(XBag,BagSize),
6 if_(bag_feature_t(BagSize,XBag,f(At,Op,FValue)),
7 ( Ts=[X|Ts0],Fs=Fs0),
8 ( Fs=[X|Fs0],Ts=Ts0)),
9 otupartition_pos_ts_fs_feature(Xs0,Ts0,Fs0,f(At,Op,FValue)).
10
11 individual_atvalues(I,AtValues):-
12 findall(A-V,img_data(I,A,V),AtValues).
13
14 individualcovered_feature_t(I,f(At,_Op,V),T):-
15 memo(individual_atvalues(I,AtValueList)),
16 if_(memberd_t(At-V,AtValueList),T=true,T=false).
17
18 bag_feature_t(_Size,[],_,false).
19 bag_feature_t(Size,[OneFromBag|RestOfBag],F,Truth):-
20 length(RestOfBag,L),
21 if_(individualcovered_feature_t(OneFromBag,F),Truth=true,bag_feature_t(Size,RestOfBag,F,Truth)).

Code block 11 shows how the rate difference heuristic is implemented taking into ac-
count the number of instances covered by the chromosome/rule.
Code Block11
1 heuristic_chromosome_fitness(rate_diff,C,Fitness):-
2 posExamples(PosExamples),
3 negExamples(NegExamples),
4 maplist(f,C,Features),
5 length(PosExamples,PositiveLength),
6 length(NegExamples,NegativeLength),
7 examples_features_filtered(pos,PosExamples,Features,TruePos,1),
8 examples_features_filtered(neg,NegExamples,Features,FalsePos,1),
9 length(TruePosExamples,TruePositives),
10 length(FalsePosExamples,FalsePositives),
11 Fitness is (TruePositives/PositiveLength)-(FalsePositives/NegativeLength).

Finally, the algorithm is initiated by gensize_popsize_genomelength_chromosome_type/4,


(Code block 12, line 1) which has the number of generations, the initial population size,
the best rule found, and the type of chromosome as arguments (binary or indexed in a
range). This predicate creates the initial random population, sorts it, computes the first
set of cross over, and then calls the recursive predicate g_pop_best/3, which runs through
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 103

the generations performing the mutation and cross over operations.


Code Block12
1 gensize_popsize_genomelength_chromosome_type(Generations,PopSize, GenomeLength,NewBest,Type):-
2 population_size_len_type(P,PopSize,GenomeLength,Type),
3 maplist(heuristic_chromosome_fitness(rate_diff),P,Fitnesses,Sums),
4 maplist(x_y_xypair,Fitnesses,P,P_Fitnesses),
5 maplist(x_y_xypair,Sums,P,P_Sums),
6 pop_tournementselected(P_Fitnesses,[Best,SecondBest]),
7 mum_dad_c1_c2(Best,SecondBest,C1,C2),
8 maplist(heuristic_chromosome_fitness(rate_diff),[C1,C2],ChildFitness),
9 maplist(x_y_xypair,ChildFitness,[C1,C2],C_Fitnesses),
10 append(P_Fitnesses,C_Fitnesses,NewPop),
11 pop_sortedpop(NewPop,Sorted),
12 maplist(portray_clause,Sorted),
13 phrase(concatenation([Middle,[_Last1,_Last2]]),Sorted),
14 Generation2 #= Generations -1,
15 g_pop_best(Generation2,Middle,NewBest),!.
16
17 g_pop_best(0,[Best|_Rest],Best).
18 g_pop_best(Generations,Population,NewBest):-
19 pop_tournementselected(Population,[Best,SecondBest]),
20 mum_dad_c1_c2(Best,SecondBest,C1,C2),
21 maplist(heuristic_chromosome_fitness(rate_diff),[C1,C2],ChildFitness),
22 maplist(x_y_xypair,ChildFitness,[C1,C2],C_Fitnesses),
23 append(Population,C_Fitnesses,NewPop),
24 pop_sortedpop(NewPop,Sorted),
25 phrase(concatenation([Middle,[_Last1,_Last2]]),Sorted),
26 Generation2 #= Generations -1,
27 g_pop_best(Generation2,Middle,NewBest).

In order to demonstrate the genetic algorithm we provide the following example query
that simply optimises the sum of the chromosome numbers
?-gensize_popsize_genomelength_chromosome_type(4,4,5,X,number(1-1000)). The
following query output shows how the performance improves after each generation (note
that the generations are counting down rather than up):

?-gensize_popsize_genomelength_chromosome_type(4,4,5,X,number(1-1000)).
Init pop:
2076-[154, 212, 279, 645, 786].
2196-[637, 223, 405, 316, 615].
2865-[467, 803, 640, 892, 63].
2307-[596, 701, 87, 519, 404].
Sorted pop:
2962-[596, 803, 640, 519, 404].
2865-[467, 803, 640, 892, 63].
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 104

2307-[596, 701, 87, 519, 404].


2210-[467, 701, 87, 892, 63].
2196-[637, 223, 405, 316, 615].
2076-[154, 212, 279, 645, 786].
Generation 2
3335-[596, 803, 640, 892, 404].
2962-[596, 803, 640, 519, 404].
2865-[467, 803, 640, 892, 63].
2496-[471, 803, 640, 519, 63].
2307-[596, 701, 87, 519, 404].
2210-[467, 701, 87, 892, 63].
Generation 1
3339-[596, 803, 640, 892, 408].
3335-[596, 803, 640, 892, 404].
3335-[596, 803, 640, 892, 404].
2962-[596, 803, 640, 519, 404].
2865-[467, 803, 640, 892, 63].
2496-[471, 803, 640, 519, 63].
Generation 0
3343-[596, 803, 640, 892, 412].
3339-[596, 803, 640, 892, 408].
3339-[596, 803, 640, 892, 408].
3335-[596, 803, 640, 892, 404].
3335-[596, 803, 640, 892, 404].
2962-[596, 803, 640, 519, 404].
X = 3343-[596, 803, 640, 892, 412].

Running time

This core algorithm is very fast compared to the other two implementations. However, it
takes time to check how many instances are covered by a rule as this is not done in advance
(which is the case of the other two algorithms). This is done to save space, as it is not
possible to load into the search algorithm all the coverage values of all candidate rules.
In order to see how this algorithm performs, we test with increasing sub-samples of the
CpG data. We fix the algorithm at 3 generations, 3 features and an initial population size
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 105

of 3. We then modify the size of the dataset by taking sub-samples of the CpG dataset.
Table 6.14 shows the result of this and we can see that the algorithm scales linearly in
terms of the dataset size.

Inferences CPU time Lips DataSetSize


238,774 0.084 2848566 10
21,411,702 1.643 13030997 100
60,169,616 4.938 12184540 1000
343,173,209 33.645 10200498 10000
2,690,849,758 328.527 8191122 100000

Table 6.14: CPU time for genetic algorithm for increasing sizes of CpG dataset samples

6.3 Application of constructed subgroup discovery al-


gorithms to CpG methylation and microbiome
In this section we will illustrate the application of the genetic subgroup discovery algorithm
(described in section 6.2.3) to methylation CpG data in cancer and microbiome data in
psoriasis patients. We aim to better understand the characteristics of these diseases. We
do not seek to build a diagnostic tool, but to characterise these conditions by identifying
subgroups in these data. The descriptive rules that define subgroups of our data will allow
researchers to generate hypotheses about causes and mechanisms of the conditions that
can be further investigated in later studies.

6.3.1 Application to the CpG sites


Method

We use the CpG dataset from Gene Expression Omnibus (GEO) dataset GSE60185 [48],
consisting of 285 array samples. Of these, 47 are taken from normal breast tissue, and
238 from breast tissue afflicted with cancer. The methylation array (Illumina Infinium
HumanMethylation450 microarray) recorded methylation levels across the whole genome
– 468,424 CpG sites. We used the normalised version of the dataset, where the data has
been processed for probe filtering, color bias correction, background subtraction and subset
quantile normalisation, [141].
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 106

As we described earlier in this chapter, the objects of study here are the CpG sites
not the participants. Therefore, in order to assign class labels to each CpG site we first
take the mean difference of the methylation levels for people with breast cancer versus
those with normal samples. Then, taking this vector of mean differences we apply optimal
2-means clustering [153] in order to assign each CpG site into two categories – differentially
methylated in breast cancer and non-differentially methylated in breast cancer. We refer
to these as the positive and negative classes, respectively.
This results in :

358625 positive CpGs (differentially methylated in the cancer array samples)

98321 negative CpGs (not differentially methylated in the cancer array samples).

We care equally about false positive and false negative identification of differentially
methylated CpGs sites because when learning rules based on this classification both of
these errors will effect the quality of rule produced. The clustering method described
above suitably trades off the false positive and false negative rates because it finds the
optimal threshold for dividing our CpG sites into two categories. This assumes that the
data can be split into two categories based on there being two methylation profiles in cancer
(i.e. if there are no real clusters this algorithm will still bring back some division of the
data into two arbitrary groups). This process can be thought of as setting the red line in
Figure 6.3.
Our features are constructed by the algorithm in Figure 6.2, derived from the attributes
used in a previous study [132]. These attributes consist of a number of genomic sequence
annotations, including conservation data throughout the genome in both protein coding
and non protein coding regions. Other annotations are taken from [25] the ENCODE
Project, including transcribed non-coding RNAs, transcription factor binding sites and
chromatin structure. In total there are 10 attributes groups – see table 6.15 (Hash et
al. call these feature groups, but we follow the convention in the rule learning literature
of reserving the word feature to mean a binary test [53]). In these attribute groups are a
number of attributes, which correspond to the individual data files that have these sequence
annotations.
We derived 333,719 features from these 10 attributes groups, where each feature consists
of an attribute, an attribute value and an operator. For example, the following feature
consists of the attribute ”46-Way Sequence Conservation”, attribute value 0.2 and operator
>=.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 107

Id Name Description
A 46-Way Sequence Conservation: Based on multiple sequence alignment scores,
at the nucleotide level, of 46 vertebrate
genomes compared with the human genome.
B Histone Modifications (ChIP-Seq): Based on ChIP-Seq peak calls for histone
modifications.
C Transcription Factor Binding Sites Based on PeakSeq peak calls for various tran-
(TFBS PeakSeq): scription factors.
D Open Chromatin (DNase-Seq): Based on DNase-Seq peak calls.
E 100-Way Sequence Conservation: Based on multiple sequence alignment scores,
at the nucleotide level, of 100 vertebrate
genomes compared with the human genome.
F GC Content: Based on a single measure for GC content cal-
culated using a span of five nucleotide bases
from the UCSC Genome Browser.
G Open Chromatin (FAIRE): Based on formaldehyde-assisted isolation of
regulatory elements (FAIRE) peak calls.
H Transcription Factor Binding Sites Based on SPPpeak calls for various transcrip-
(TFBS SPP): tion factors.
I Genome Segmentation: Based on genome-segmentation states using a
consensus merge of segmentation’s produced
by theChromHMM and Segway software.
J Footprints: based on annotations describing DNA foot-
prints across cell types from ENCODE.

Table 6.15: Attributes used in the CpG data mining task. For more details on these see
[132]
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 108

f(fid,46-Way Sequence Conservation >= 0.2)

The large number of instances results in a large number of features, but importantly
we still have more objects (CpG sites) than we do features and this means we do not have
a “large p small n problem” [42]. Also, these features are binary tests not real valued
numbers and so the hypothesis space is smaller than using real valued attributes directly.
We ran the genetic algorithm for 20 generations with an initial population of 100 and
a chromosome size set to 6.

Results

The genetic algorithm took 5 hours to complete. The rules found incorporate features
that have been built from files of Chip-Seq and DNase-Seq files, as well as the 100-Way
sequence conservation (attribute groups, B, D, and E). There are two different types of files
for the attribute group B, gapped peaks and broad peaks. BroadPeak means peaks where
histone modifications span wider ranges of genomic region. In contrast, Gapped Peaks
means interpreted data where the regions may have been spliced or incorporate gaps in the
genomic sequence [149]. The final attribute used in the rules are thresholds on 100-Way
multiple alignment sequence (attribute group E).
The top 3 identified rules (and the number they cover of each class) are as follows:

IF
gappedPeak_E108-H3K9ac.gappedPeak.gz < 13613081900 = TRUE
AND gappedPeak_E097-H3K9me3.gappedPeak.gz < 13524832050 = TRUE
AND broadPeak_E094-H3K4me1.broadPeak.gz >= 216 = TRUE
AND broadPeak_E069-H3K27ac.broadPeak.gz < 52 6= TRUE
AND wgEncodeCshlLongRnaSeq_wgEncodeCshlLongRnaSeqBjCellPapContigs-
bedRnaElements.gz < 624876 = TRUE
AND broadPeak_E121-H3K9ac.broadPeak.gz < 808 = TRUE

THEN PosClass.
Covering 150 positives instances, and 20 negative instances.

IF
gappedPeak_E019-H3K9ac.gappedPeak.gz, < 3041216650 = TRUE
AND broadPeak_E065-H3K27me3.broadPeak.gz >= 159 = TRUE
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 109

AND gappedPeak_E009-H3K4me1.gappedPeak.gz >= 13684353550 = TRUE


AND broadPeak_E107-H3K27me3.broadPeak.gz < 1395 = TRUE
AND gappedPeak_E003-H2A.Z.gappedPeak.gz, < 9625166000 = TRUE
AND 100-Way_PHYLOP, f13, <, -1003 = TRUE

THEN PosClass.
Covering 211 positives instances, and 44 negative instances.

IF
100-Way_PHYLOP < 4 = TRUE
AND broadPeak_E065-H3K27me3.broadPeak.gz >= 159 = TRUE
AND gappedPeak_E009-H3K4me1.gappedPeak.gz >= 13684353550 = TRUE
AND broadPeak_E003-H3K4me1.broadPeak.gz < 300 = TRUE
AND gappedPeak_E003-H2A.Z.gappedPeak.gz < 9185280200 = TRUE
AND 100-Way_PHYLOP, f13, >, -1003) = TRUE

THEN PosClass.
Covering 381 positives instances, and 120 negative instances.

Interestingly this last rule has two features built from the same attribute 100-Way_PHYLOP,
this means that the subgroup contains examples that lie in a range of sequence conserva-
tion. This perhaps indicates that the covered CpG sites live in parts of the DNA that have
diverged from a common ancestor for the same amount of time but are now distinct areas
(Speculation).

6.3.2 Application to the microbiome


Method

For this application we use a dataset from Muirhead [107] combined with the meta data
from the IMG/M database. In this dataset there are 255 samples taken from patients with
lesional psoriasis and 257 samples taken from people without lesional psoriasis. For each
sample we have the abundance levels of different microbes obtained by using 16S rRNA
Sequencing. The 16S rRNA sequencing data is processed by assigning reads to operational
taxonomic units (OTUs). For further details of how this is done see [107].
As in section 6.3.1, the object of study here is not the participants, but the individual
OTUs. The OTUs are the instances for our data analysis. In order to assign class labels to
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 110

each OTU we aggregate the samples from each person in the two different groups (lesional
and non lesional). If an OTU is only present in lesional samples then this is assigned the
positive class. If an OTU is present in non lesional samples then this is in the negative
class (so this class contains OTUs that are only found in non lesional samples as well as
OTUs found in both lesional and non lesional samples). This results in 891 positive OTUs
and 2,185 negative OTUs (3,076 total instances).
The process from Muirhead [107] assigned each OTU the most specific taxonomic unit
possible. This means that some OTUs can be associated with a species of bacteria, whereas
others can only be determined at the genus (or higher) taxonomic level. As attributes for
the microbes we used IMG/M meta data. The IMG/M meta data on bacterial genomes is
for a more specific level in the taxonomic tree than the data we have from Muirhead [107].
This mean that the attributes in the IMG/M data are associated with at least the species
level, and sometimes individual strains of a species. This means that our data is noisy –
we do not know exactly which set of attributes apply to which microbe from the Muirhead
study [107]. In order to account for this we use the multi-instance version of our genetic
subgroup discovery algorithm. We create bags for the lowest taxonomic level available for
each instance. For example, if the lowest level is species, then we assign all the strains
from that species in the IMG/M data into a bag for that instance.
To give a concrete example, OTU 5118 (a positive class instance) has the follow-
ing taxonomic classification: phylum=‘Proteobacteria’, class=‘Alphaproteobacteria’, or-
der=‘Rhizobiales’, family=‘Phyllobacteriaceae’, genus=‘Phyllobacterium’. We do not know
the exact species, and in our IMG/M data we have two species in this taxonomic group:
‘Phyllobacterium sp. YR531’ and ‘Phyllobacterium sp. UNC302MFCol5.2’, so the at-
tributes from these two species are in the instance bag for OTU 5118. This is illustrated
in Table 6.16. In total, we use 25 discrete attributes. Our features are constructed by the
algorithm in Figure 6.2 and this results in a total of 200 features.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 111

OTU-ID A1 A2 A3 C
OTU1 a c 1.1 Pos
b z 0.2
q c 1.1
OTU2 q c 0.2 Neg
r c 0.1
v z 1.0
OTU3 z v 0.2 Pos
e c 0.1
a q 0.3
... ... ... ... ...
OTUdn ... ... ... cn

Table 6.16: A multi-instance data table, each OTU has a bag of species attributes

We ran the genetic algorithm for 20 generations with an initial population of 50 and a
chromosome size set to 3.

Results

The genetic algorithm took 4 hours to complete, and the top 5 identified rules (and the
number they cover of each class) are as follows:

IF Ecosystem = ‘Ponds’
AND Phenotype = ‘Unknown’
AND Metabolism = ‘Chlorophenol degrading’
THEN OTU is present in lessional psoriasis

Covering 15 positives instances, and 2 negative instances.

IF Ecosytem Category = ‘Protozoa’


AND Phenotype = ‘Catalase posivtive, Non-Haemolytic’
AND Habitat =‘Biofilm’
THEN OTU is present in lesional psoriasis.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 112

Covering 13 positives instances, and 6 negative instances.

IF Metabolism =‘Iron oxidizer, Sulfer oxidizer’


AND Habitat = ‘Aquatic Soil’
AND Habitat =‘Host skin’
THEN OTU is present in lesional psoriasis.

Covering 17 positives instances, and 10 negative instances.

IF Metabolism =‘Iron oxidizer, Sulfer oxidizer’


AND Phenotype = ‘Catatlase positive, Non-Haemolytic’
AND Habitat =‘Host, Human upper respiratory tract’
THEN OTU in lesional psoriasis

Covering 10 positives instances, and 8 negative instances.

IF Ecosytem Category = ‘Protozoa’


AND Habitat = ‘Aquatic Soil’
AND Habitat =‘Biofilm’
THEN OTU is present in lessional psoriasis

Covering 38 positives instances, and 29 negative instances.

Each of these rules describes a subgroup of bacteria whose class distribution is different
to the overall class distribution. These rules are interpretable, such that they can be used by
a biologist to generate hypotheses for further exploration. For example, the last subgroup
found describes specific bacteria that are said to live in ‘Aquatic Soil’ and on a ‘Biofilm’,
a biologist could attempt to culture these microbes and study under what conditions do
they grow and how they interact with different environments. Further questions around
the ways in which these bacteria react to different classes of anti-antibiotics. Another area
of study could be prompted by the second to last subgroup, where the metabolism and
phenotype of the microbiota has been characterised. An expert biologist may be able to
postulate, different host/microbiota interactions based on these specific identified concepts.
These could then be tested in the laboratory.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 113

6.4 Summary and discussion


In this chapter we have identified a number of biological data mining tasks that can be
formulated as subgroup discovery tasks. We focus on two particular applications, where we
seek to find subgroups of: 1) differentially methylated CpG sites in cancer and 2) microbes
present in lesional psoriasis. We argued that Prolog was an ideal programming language
for developing subgroup discovery algorithms for two reasons. First, Prolog allows the
task to be formulated as a fully declarative program, which allowed us to write small (in
terms of code base) and clear prototypes. Second, Prolog allows development of efficient
algorithms by taking into account the procedural aspects of Prolog programming that
allowed us to mine large datasets. We used two different data representations, attribute
value representation and multi-instance, for the CpG and microbiome data, respectively.
The latter allows us to account for noise in the data, where we could not be certain which
set of attributes applied to our instances. In both tasks, we identified a number of rules
that characterised subgroups in these data, having a different distribution of class labels
than the data as a whole. These will be potentially interesting to biologists who study
cancer or psoriasis, and can be thought of as promising avenues to follow-up with further
investigations.
In order to check that these rules are significant permutation testing would be required.
The process for this would be to permute our data labels 1000 times and run the genetic
algorithm on the permuted data. We would then identify the top rule from each permuta-
tion, and calculate the proportion of these with a class distribution more different than the
underlying class distribution, compared with the ‘true’ rule we have identified. This pro-
portion represents the probability of identifying a rule as ‘interesting’ as the one we have
identified, by chance alone. This process needs access to a high performance computing
resource which we do not have access to (or parallel versions of the algorithms could be
developed).
The identified rules represent some new knowledge about these domains, and can be
easily transformed into predicates that can be deployed on the relevant knowledgebases.
They can also serve as building blocks for other predicate definitions that may be written
manually or learnt from data.
Part III

Structured data

114
Chapter 7

Reactome Pengine

The work presented in this chapter is based on the publication : Neaves, Samuel R., Sophia
Tsoka, and Millard, Louise AC. “Reactome Pengine: A web-logic API to the homo sapiens
reactome.” Bioinformatics 1 (2018): 3.

7.1 Description of content


Existing ways of accessing data from the Reactome database are limited. Either a re-
searcher is restricted to particular queries defined by a web application programming inter-
face (API), or they have to download the whole database. This chapter presents Reactome
Pengine which is a web service providing a logic programming based API to the human
reactome. This gives researchers greater flexibility in data access than existing APIs, as
users can send their own small programs (alongside queries) to Reactome Pengine.
Availability: The server and an example notebook can be found at:
https://apps.nms.kcl.ac.uk/reactome-pengine.
Source code is available at:
https://github.com/samwalrus/reactome-pengine
A Docker image is available at:
https://hub.docker.com/r/samneaves/rp4/ .

7.2 Introduction
Reactome [30] is a web service that includes a database of the molecular details of cellular
processes, and is one of the leading tools for bioinformaticians working with biological

115
CHAPTER 7. REACTOME PENGINE 116

pathways. Currently, users access data in Reactome using either HTML (the website), a
REST API, a SPARQL API, or by downloading the complete dataset for local processing.
The APIs provide a convenient way to access the data but restrict this access to a set
of predefined API calls (For example a user can choose download only the subset of data
they requested), whereas downloading the complete dataset means that the data can be
processed exactly as required. This chapter presents the tool Reactome Pengine that
allows the flexibility of the latter, with the convenience of the former. This makes queries
more efficient, saving both bandwidth and storage space, and this is achieved by using
the logic programming language, Prolog. Logic programming is a paradigm for computer
programming, where knowledge is represented in a restricted form of first order logic, as
a set of facts and rules called a knowledge base [22]. See Chapter 5 for more details. The
knowledge base is interrogated with queries, which are powerful due to inbuilt procedures
that use the facts and rules together to infer solutions. As we argue throughout this thesis
Logic programming has much potential in bioinformatics [108, 4, 6] and it can be used
with Reactome to build predictive models which we demonstrate in Chapter 8.

7.3 Implementation
Recently, a library for building web servers using SWI-Prolog [160] called Pengines [88],
has been developed. Pengines allows data providers to make their Prolog knowledge base
available to users via a web service (that uses a web-logic API), accessed as if it was on
the users machine. In addition, users can send programs to the pengine to manipulate
the data as they wish. This is very different from the traditional way of accessing data,
where a user is either constrained to the set of queries defined in a (non-logical) web API,
or has to download a dataset in bulk. Pengine services support federated queries, similar
to SPARQL, but with Turing complete programs executing on remote services rather than
SQL-like queries. We will describe this in more detail in later sections.
The Reactome Pengine tool presented here uses the Pengine library to make a Prolog
knowledge base, built on Reactome data, available to researchers on the internet. The
mainstay of the knowledge base are facts retrieved from the Reactome HomoSapiens.owl
RDF file, which contains circa 1.35 million RDF triples. In addition, we have also pro-
vided an intuitive set of data access predicates (which are similar to functions in other
programming paradigms) that sit on top of the RDF data. These define relations between
Reactome entities and provide access to the data at a higher level of abstraction. Users
CHAPTER 7. REACTOME PENGINE 117

can query the RDF directly or use this abstraction layer (or both), or make their own
abstraction of the data. Our abstraction layer includes predicates that represent reactions
as nodes on a graph, with edges between the nodes as described in the following two cases.
First, an edge exists when an output of a reaction is an input of another reaction. This edge
type we name precedes. Second, an edge exists when an output of a reaction r1 is a control
of another reaction r2 , and the particular edge type depends on how the output of r1
controls r2 (e.g. activation or inhibition, or subtypes of these). An example predicate that
relates two reactions via a linking entity is ridReaction_ridLink_type_ridReaction/4.
We also provide predicates with indexed (therefore fast) access to a set of queries that
we expect to be useful for researchers, but that are computationally intensive (and hence
slow without indexing). For example ridPathway_reactions/2 relates pathways to the
complete list of biochemical reaction IDs.
The Pengine library has inbuilt mechanisms to ensure the integrity of the server on
which it is hosted. Security is ensured by allowing only safe predicates to be run on the
Pengine server. Upon running a query the service first checks that the query is safe and
returns an error if this is not the case. For example, sending a program that calls shell/1
would result in an error (because a user could for example send a shutdown command to
the server). The Pengine library also contains a number of methods to manage resource
allocation on the server, including restricting request execution time and the maximum
number of requests that can be executed simultaneously. For more details see [88]. Finally,
the service runs inside a docker container which isolates the service from the underlying
machine, and facilitates scaling and load balancing to meet demand. Queries to Reactome
Pengine are logged such that over time we can augment the inbuilt predicates with the
popular queries and programs, and also build new indexes to improve performance and
functionality as the service is used. This also means we can explore the possibility of
applying machine learning on the collected programs to automatically learn predicates
that are useful for users. Documentation describing the logical API is available at:
https://apps.nms.kcl.ac.uk/reactome-pengine/.
To use SWISH to interact with Reactome Pengine (shown as grey/solid arrows in
Figure 7.1) the user writes a program (program A in Figure 7.1) for SWISH, that will
itself contain a query or program (program B in Figure 7.1) to be processed on Reactome
Pengine. When SWISH executes program A, the constituent program B is forwarded to
Reactome Pengine. Upon receiving program B Reactome Pengine executes it and sends
the results back to SWISH. SWISH continues program A and then displays the results in
CHAPTER 7. REACTOME PENGINE 118

the user’s browser.

7.4 Interfacing with Reactome Pengine


There are two main ways to access Reactome Pengine. Firstly, the Reactome Pengine
can be accessed using a web application, such as SWISH web notebooks [161]. As with
notebooks of other languages (such as Jupyter for Python), a SWISH notebook includes
executable code inter-weaved with text explanations ideal when wishing to share code
with other researchers or work collaboratively. Furthermore, accessing Reactome Pengine
via a notebook means that the researcher does not need to set up SWI-Prolog on their
machine. SWISH includes graphical renderers such as C3 for simple chart generation and
Graphviz for graph visualisation. Users can also include Javascript and R code in their
SWISH notebook. For example, the Javascript D3 library can be used to generate inter-
active visualisations. R can be used to perform statistical analyses and plot results. An
example notebook demonstrating these capabilities is available at
https://apps.nms.kcl.ac.uk/reactome-pengine/. and it is also recreated in this sec-
tion.
The second main way to access Reactome Pengine is within a SWI-Prolog program
running on a local machine. We can do this by adding the directive
:-use_module(library(pengines)). and using the pengine_rpc/3 predicate. This can
be useful for writing local application pipelines that need to access Reactome data. The
output of the Reactome Pengine request can then be used as input to the next step in the
pipeline (see Section 7.5.2). This is especially useful where data cannot be processed in the
cloud due to regulatory constraints. In addition to these two recommended Prolog based
access options, it is also possible to access the service with other languages that support
HTTP such as JavaScript Node.js. See code block 6 in this chapter for details including
an example JavaScript function that calls Reactome Pengine.

7.4.1 Reactome Pengine data flows


Figure 7.1 illustrates the flow of data between the users computer and Reactome Pengine,
which could occur either directly or through SWISH. To use Reactome Pengine directly
(shown as yellow/dashed arrows in Figure 7.1) the user writes a local Prolog program
containing a subroutine that sends a query or program to Reactome Pengine. Reactome
CHAPTER 7. REACTOME PENGINE 119

Pengine executes the program and returns data back to the local calling program to finish
its execution.

Figure 7.1: Diagram of possible user interactions with the Reactome Pengine server.
Thick arrows: data sent to Reactome Pengine; thin arrows: data returned from Reac-
tome Pengine. Yellow, dashed: Direct interaction with Reactome Pengine; grey, solid:
Interaction with Reactome Pengine via SWISH

7.5 Example usage


We now first describe two particular use case examples both using SWI-Prolog 7.7 and
then present a static version of the queries from the online interactive SWISH notebook.
The first example in Command line interaction 1 shows a simple interactive Prolog session
with two queries. The first query imports the Pengine library and the second query calls
the pengine_rpc/2 predicate. The pengine_rpc/2 predicate has two arguments: 1) the
server address of Reactome Pengine, and 2) the query we wish to run on Reactome Pengine.
In this example we issue the query rid_name(Protein1,Name), to find the common name
for the Reactome identifier Protein1. The result obtained is the variable Name, bound to
CHAPTER 7. REACTOME PENGINE 120

"Rnf111".
Command line interaction 1
?- use_module(library(pengines)).
true.
?- pengine_rpc(‘https://apps.nms.kcl.ac.uk/reactome-
pengine/’,rid_name(‘Protein1’,Name)).
Name = "Rnf111" .

The second more advanced example shows how to send a program (alongside a query
that calls the program) to Reactome Pengine to be computed remotely. For instance,
a bioinformatician can use Reactome Pengine to explore paths of reactions through the
human reactome. Code block 1 shows an example Prolog program that can be used for this
purpose (Adapted from https://stackoverflow.com/questions/30328433/definition-
of-a-path-trail-walk/30595271#30595271) ; available on Github at:
https://github.com/samwalrus/reactome-pengine).
This program includes two core elements. First, the predicate path_program/1 retrieves a
list of clauses that themselves define a program that will be sent to the Reactome Pengine.
Second, the predicate path_from_to/3 is the main predicate that a bioinformatician would
use to query the Reactome for paths in a variety of ways (without downloading the entire
dataset to their machine). For instance, a researcher can use this predicate to: a) establish
whether a path exists from a particular reaction to another, b) retrieve all paths from
a reaction, or c) retrieve all paths to a reaction. The path_from_to/3 predicate first
retrieves the Reactome Pengine server address (line 25) and the program (specified in
path_program/1 lines 4-23 and called on line 26), and then sends this program alongside
a specified query to Reactome Pengine (lines 27-29). Command line interaction 2 shows
example commands that use this program. Furthermore, in the notebook examples 6 and
7 show additional refinements to this program, such as further constraints for properties
CHAPTER 7. REACTOME PENGINE 121

of reaction paths.
Code Block 1: paths.pl
1 :-use_module(library(pengines)).
2 reactome_server(S):-
3 S=‘https://apps.nms.kcl.ac.uk/reactome-pengine’.
4 path_program(Program):-
5 Program=[
6 (:- meta_predicate path(2,?,?,?)),
7 (:- meta_predicate path(2,?,?,?,+)),
8 (graph_path_from_to(P_2,Path,From,To):-
9 path(P_2,Path,From,To)),
10 (path(R_2, [X0|Ys], X0,X):-
11 path(R_2, Ys, X0,X, [X0])),
12 (path(_R_2, [], X,X, _)),
13 (path(R_2, [X1|Ys], X0,X, Xs) :-
14 call(R_2, X0,X1),
15 non_member(X1, Xs),
16 path(R_2, Ys, X1,X, [X1|Xs])),
17 (non_member(_E, [])),
18 (non_member(E, [X|Xs]) :-
19 dif(E,X),non_member(E, Xs)),
20 ( e(R1,R2):-
21 ridReaction_ridLink_type_ridReaction(R1,_,_,R2)
22 )
23 ].
24 path_from_to(Path,From,To):-
25 reactome_server(Server),
26 path_program(Program),
27 pengine_rpc(Server,
28 graph_path_from_to(e,Path,From,To),
29 [src_list(Program)]).

Figure 7.2: Lines 6-22 are the program sent to Reactome Pengine. In this example, the
program is a list of terms, where each term is a clause that will be interpreted by Reactome
Pengine.

Command line interaction 2


?- consult(‘codeblock1.pl’).
true.
?- path_from_to(Path,From,To).
Path = [To],
From = To ;
Path = [‘BiochemicalReaction4884’, ‘BiochemicalReaction4068’],
From = ‘BiochemicalReaction4884’,
To = ‘BiochemicalReaction4068’ .
CHAPTER 7. REACTOME PENGINE 122

7.5.1 SWISH examples


This section is a plain text version of the accompanying online interactive SWISH notebook.
It can be found at: https://apps.nms.kcl.ac.uk/reactome-pengine/. The following
queries can be run in the browser to see live results, but are also presented here in plain
text. The queries are designed to be run in the notebook and they may produce different
results outside of that context.
The examples of functionality in this notebook are a fraction of the possible ways to
use Reactome Pengine, intended to illustrate calls to the API.

Basic usage

The basic usage section of the note book contains 6 interactive queries (1-6). We first briefly
explain the data model of the underlying Reactome dataset. Secondly, we explain the data
model that we have built into Reactome Pengine. We then explain where computations
take place. Next we give some example queries that illustrate the capabilities of Reactome
Pengine in the context of a SWISH notebook, including graphical rendering, R integration
and Javascript applications. Also note that further functionality of the Reactome Pengine
can be utilised by building client applications in the full desktop version of SWI-Prolog.
1. Underlying data model of Reactome
The underlying data model is based on a RDF triple graph. This is fully documented on the
Reactome website [30]. In brief, the principle type of entity in Reactome is the reaction and
reactions have inputs and output entities. Reactions are also optionally controlled. Each
biological entity such as proteins, small molecules, and reactions are given an ID. Entities
also include ’Complexes’ and ’Protein sets’. A complex is a set of molecular entities that
have combined together. A protein set is when a set of proteins can perform the same
biological function. Both complexes and protein sets can also themselves be composed of
complexes and protein sets. We can query this data directly with the rdf/3 predicate. In
the online version you can see how this is done by clicking the blue play triangle in the
CHAPTER 7. REACTOME PENGINE 123

relevant cell. You would also be able to then find further solutions, with the ’next’ button.
SWISH Query 1
:-use_module(library(pengines)).
reactome_server(‘https://apps.nms.kcl.ac.uk/reactome-pengine’).
rdf(X,Y,Z):-
reactome_server(S),
pengine_rpc(S,rdf(X,Y,Z),[]).

?-rdf(X,Y,Z).
X = ‘http://www.reactome.org/biopax/47/48887#PublicationXref12’,
Y = ‘http://www.biopax.org/release/biopax-level3.owl#author’,
Z = ^^("Chang, Nan-Chi", ‘http://www.w3.org/2001/XMLSchema#string’)

2. Our Reactome Pengine Data model


In addition to the direct method of querying the RDF file using the Prolog semantic web
library, a number of higher level predicates built on top of this library are also provided.
Often these higher level predicates give a more intuitive and compact view of the data.
Our model states that reactions are nodes on a graph. There are edges between these
nodes in two cases:

1. When an output of a reaction is an input to another reaction. This edge type we


name ’precedes’.

2. An output of a reaction r1 is a control of another reaction r2. This results in a


number of different edge types depending on how the output of r1 controls r2. The
types are:

– ACTIVATION

– ACTIVATION-ALLOSTERIC

– INHIBITION

– INHIBITION-ALLOSTERIC

– INHIBITION-COMPETITIVE

– INHIBITION-NONCOMPETITIVE

We can view this graph for an individual pathway using ridPathway_links/2 shown below.
We can use this predicate in all directions, i.e 1) enumerate pathways and link pairs (neither
argument is instantiated), 2) find the links for a particular pathway (pathway argument is
CHAPTER 7. REACTOME PENGINE 124

instantiated), 3) see which pathway has a set of links (links argument is instantiated), or 4)
check whether a given pathway has a given set of links (both arguments are instantiated).
Most predicates in the API are capable of multi-directional queries. Details are given in
the Reactome Pengine API documentation.
Generate pathways and a list of links in the pathway using:
SWISH Query 2.1
reactome_server(S),
pengine_rpc(S,ridPathway_links(P,Ls)).

?-ridPathway_links(P,L).
L = [ridReaction_ridLink_type_ridReaction(‘BiochemicalReaction6’, ‘Complex18’, precedes,
‘BiochemicalReaction7’), ridReaction_ridLink_type_ridReaction(‘BiochemicalReaction5’, ‘Complex17’,
‘ACTIVATION’, ‘BiochemicalReaction6’), ridReaction_ridLink_type_ridReaction(‘BiochemicalReaction5’,
‘Complex17’, precedes, ‘BiochemicalReaction6’), ridReaction_ridLink_type_ridReaction(‘BiochemicalReaction7’,
‘Complex14’, precedes, ‘BiochemicalReaction5’), ridReaction_ridLink_type_ridReaction(‘BiochemicalReaction7’,
‘Protein53’, precedes, ‘BiochemicalReaction6’)],
P = ‘Pathway1’;

Generate pathways that have an activation link where the activating entity is ’Complex452’
using:
SWISH Query 2.2
?-ridPathway_links(P,_Ls),
member(ridReaction_ridLink_type_ridReaction(R1, ‘Complex452’, ‘ACTIVATION’, R2),_Ls).

P = ‘Pathway100’,
R1 = ‘BiochemicalReaction301’,
R2 = ‘BiochemicalReaction302’;

3. Converting between Reactome Identifiers and familiar names/identifiers


In order to convert between the Reactome identifiers and the common names of entities
you can use the rid_name/2 predicate. You may also want to convert Reactome protein
identifiers to Uniprot identifiers, and for this you can use ridProtein_uniprotId/2. Both
of these predicates are full relations so a user can query in both directions and use the
predicate to generate examples or to check that a pair are related. The following two
examples retrieve the name and Uniprot ID for the Reactome IDs ‘Complex452’ and
‘Protein1’, respectively.
SWISH Query 3.1
?-reactome_server(S),pengine_rpc(S,rid_name(‘Complex452’,Name),[]).
Name = "Active TrkA receptor:Phospho-Frs2:CrkL:C3G complex",
S = ‘https://apps.nms.kcl.ac.uk/reactome-pengine’.
CHAPTER 7. REACTOME PENGINE 125

SWISH Query 3.2


?-reactome_server(S),pengine_rpc(S,ridProtein_uniprotId(‘Protein1’,UniprotId),[]).
S = ‘https://apps.nms.kcl.ac.uk/reactome-pengine’,
UniprotId = ‘Q99ML9’;

4. Finding the Protein Complex that contains the Sonic Hedgehog Protein
Sonic HedgeHog (SHH) is a well studied protein. If we want to query Reactome Pengine
to find what Protein complex it takes part in we can use the following query.
SWISH Query 4.1
?-reactome_server(S),pengine_rpc(S,(rid_name(RidSonic,"SHH"),ridProteinSet_component(RidProteinSet,RidSonic),
ridComplex_component(RidComplex,RidProteinSet),rid_name(RidComplex,ComplexName)),[]).
ComplexName = "Patched:Hedgehog",
RidComplex = ‘Complex1074’,
RidProteinSet = ‘Protein1901’,
RidSonic = ‘Protein1902’,
S = ‘https://apps.nms.kcl.ac.uk/reactome-pengine’;

5. Naming and reuse of queries


In Prolog a query is a call to a predicate (or as we have seen a conjunction of predicates).
As with any Prolog program we can reuse a query and assign it a name by writing a new
predicate definition containing the query. This is useful for complex queries as we can
decompose the complexity into small parts.
In the following program we name the previous query as
pathway_with_complex452_activation/1. This predicate is true if P is instantiated with
a pathway that has an activation edge where the activation entity is complex452. We also
define a predicate small_pathway/1 for pathways with less than 35 edges. We then define a
new predicate, reusing pathway_with_complex452_activation/1 and small_pathway/1
to define small_pathway_with_complex452_activation/1. As this new predicate is made
from the composition of pathway_with_complex452_activation/1 and small_pathway/1,
we can logically reason that the new predicate will be true for the intersection of the answers
CHAPTER 7. REACTOME PENGINE 126

to the original two predicates.


SWISH Program 5.1
pathway_with_complex452_activation(P):-
ridPathway_links(P,Ls),
member(ridReaction_ridLink_type_ridReaction(_R1, ‘Complex452’, ‘ACTIVATION’, _R2),Ls).

small_pathway(P):-
ridPathway_links(P,L),
length(L,S),
S<35.

small_pathway_with_complex452_activation(P):-
small_pathway(P),
pathway_with_complex452_activation(P).

Running these three queries allows you to see that the answers to the third query is
the intersection of the answer to the first two queries as we expected.
SWISH Query 5.1
?-pathway_with_complex452_activation(P).
P = ‘Pathway100’;
P = ‘Pathway101’

SWISH Query 5.2


?-small_pathway(P).
P = ‘Pathway1’;

SWISH Query 5.3


?-small_pathway_with_complex452_activation(P).
P = ‘Pathway101’;

6. Where the execution of our programs takes place


As we are seeing, using Prolog allows us to specify programs and algorithms, not just
queries (in contrast to SQL, Cypher, REST APIs and SPARQL).
When using SWISH our programs are either executed on Reactome Pengine or on
the SWISH server (itself a pengine application). When using the desktop version of SWI
Prolog, programs are either executed on Reactome Pengine or your local machine. The idea
is that small programs can be brought to the large data, rather than needing to transfer
large datasets to a users machine. This, alongside the ability to send programs to the
pengine, makes for an extremely flexible logical API. Both SWISH and Reactome Pengine
have a time limit on queries. So far we have made simple queries of Reactome Pengine,
and executed our programs in SWISH. In order to reduce the data transfer, and to send a
program to the Pengine Reactome we use the 3rd argument of pengine_rpc/3.
CHAPTER 7. REACTOME PENGINE 127

Here we write a program to find a path on the graph of reactions across the whole
Reactome. The predicate path_program/1 returns the program we wish to run, as a
list of clauses. The predicate path_from_to/3 retrieves the server address and program
and sends this, along with the query, to Reactome Pengine. The identified path is then
returned. To perform the same query without using Reactome Pengine the entire database
would need to be downloaded.
SWISH Program 6.1
:-use_module(library(pengines)).

path_program(Program):-
Program=[
(:- meta_predicate path(2,?,?,?)),
(:- meta_predicate path(2,?,?,?,+)),
(graph_path_from_to(P_2,Path,From,To):-path(P_2,Path,From,To)),
(path(R_2, [X0|Ys], X0,X):-path(R_2, Ys, X0,X, [X0])),
(path(_R_2, [], X,X, _)),
(path(R_2, [X1|Ys], X0,X, Xs) :- call(R_2, X0,X1),non_member(X1, Xs),path(R_2, Ys, X1,X, [X1|Xs])),
(non_member(_E, [])),
(non_member(E, [X|Xs]) :- dif(E,X),non_member(E, Xs)),
( e(R1,R2):-ridReaction_ridLink_type_ridReaction(R1,_,_,R2)) %this makes a two place edge term
].

%Send a program and a query to the pengine reactome and return the result.
path_from_to(Path,From,To):-
reactome_server(Server),
path_program(Program),
Query=graph_path_from_to(e,Path,From,To),
pengine_rpc(Server,Query,[src_list(Program)]).

SWISH Query 6.1


?-path_from_to(Path,From,To).
From = To,
Path = [To]
From = ‘BiochemicalReaction4884’,
Path = [‘BiochemicalReaction4884’, ‘BiochemicalReaction4068’],
To = ‘BiochemicalReaction4068’;

The following query performs a breadth first search to find the shortest paths, finding each
CHAPTER 7. REACTOME PENGINE 128

new result on backtracking.


SWISH Query 6.2
?-length(Path,_),path_from_to(Path,From,To).
From = To,
Path = [To]
From = ‘BiochemicalReaction4884’,
Path = [‘BiochemicalReaction4884’, ‘BiochemicalReaction4068’],
To = ‘BiochemicalReaction4068’;
From = ‘BiochemicalReaction4884’,
Path = [‘BiochemicalReaction4884’, ‘BiochemicalReaction4883’],
To = ‘BiochemicalReaction4883’;

Advance usage

The advanced usage section of the notebook contains an additional four queries (7-11).
7. Definite Clause Grammars
As we have discussed (in chapter 5) in Prolog we can describe lists declaratively, so for
instance, we can write a definite clause grammar (DCG) for paths. We remind the reader
that a DCG predicates have a slightly different syntax to standard Prolog predicates, but
this does not effect there ability to be sent to the Reactome Pengine. For simplicity of
presentation the DCG in the example below runs on the SWISH server. This example finds
paths that pass through a reaction and that satisfy a user defined rule. In this case we
simply ask for a path that passes through a reaction that has CTDP1 (Protein11301) as
an input. Additionally we add Path=[_,_,|_]. to the query. This serves as a constraint
to find paths with at least two steps.
SWISH Program 7.1
ridReaction_input(Rid,I):-
reactome_server(S),
pengine_rpc(S,ridReaction_input(Rid,I),[]).

okPath --> begin, needs,end.

begin -->[].
begin -->[_],begin.

needs -->{ridReaction_input(R,‘Protein11301’)},[R].

end --> [].


end --> [_],end.
CHAPTER 7. REACTOME PENGINE 129

SWISH Query 7.1


?-path_from_to(Path,From,To), phrase(okPath,Path), Path=[_,_|_].
From = ‘BiochemicalReaction4884’,
Path = [‘BiochemicalReaction4884’, ‘BiochemicalReaction4068’, ‘BiochemicalReaction4064’],
To = ‘BiochemicalReaction4064’;

8. Rendering the query results


Graphviz
Up to this point we have been representing the reaction graph as lists and terms. While
this is useful for computations, for conveying results to human users it is often better to
use graphical representations. To this end, in SWISH we can use its Graphviz renderer.
Additionally, to make the graph visualisation more meaningful, we map the Reactome IDs
to the reaction names.
SWISH Program 8.1
:- use_rendering(graphviz).

graphviz_program(Program):-
Program =[
(linktype_color(X,green):- member(X,[‘ACTIVATION’, ‘ACTIVATION-ALLOSTERIC’])),
(linktype_color(X,red):- member(X,[‘INHIBITION’,
‘INHIBITION-ALLOSTERIC’,
‘INHIBITION-COMPETITIVE’,
‘INHIBITION-NONCOMPETITIVE’])),
(linktype_color(precedes,black)),
(link_graphvizedge(ridReaction_ridLink_type_ridReaction(R1, _, Type, R2),
edge(R1Name->R2Name,[color=Color])):-
linktype_color(Type,Color),
rid_name(R1,R1Name),
rid_name(R2,R2Name))
].

pathway_graphviz(P,G):-
reactome_server(S),
graphviz_program(Program),
pengine_rpc(S,(ridPathway_links(P,L),maplist(link_graphvizedge,L,GE)),[src_list(Program)]),
G = digraph(GE).
CHAPTER 7. REACTOME PENGINE 130

SWISH Query 8.1


?-pathway_graphviz(X,G).
P=‘Pathway1’,
G=

Charting data with C3


We can chart data from Reactome using C3, a Javascript visualisation library. The follow-
ing code makes a simple plot showing the number of reactions in a set of pathways.
SWISH Program 8.2
:- use_rendering(c3).

pathways(P):-
P=[‘Pathway1’,‘Pathway2’,‘Pathway5’,‘Pathway12’].

pathway_pairsize(P,PName-L):-
reactome_server(S),
pengine_rpc(S,(ridPathway_reactions(P,R),rid_name(P,PName)),[]),
length(R,L).

chart(Chart):-
pathways(P),
maplist(pathway_pairsize,P,Pairs),
Chart = c3{data:_{x:elem, rows:[elem-count|Pairs], type:bar},
axis:_{x:_{type:category}}}.
CHAPTER 7. REACTOME PENGINE 131

SWISH Query 8.2


?-chart(C).
C=

9. Scraping the web and integrating with Reactome Pengine


Scraping the traditional HTML web for data makes it possible to combine many existing
remote data sites with Reactome Pengine. Web scraping inside SWISH is currently limited
to using the predicate load_html/3, whereas if you build a client application with the full
version of SWI Prolog then predicates such as http_open/3, are available. After retrieving
the HTML using either of these methods, we can then utilise libraries sgml and xpath to
manipulate the data.
In the example below we scrape gene expression data from the GEO website for sample
GSM38051. We also demonstrate how results can be displayed with the SWISH table
CHAPTER 7. REACTOME PENGINE 132

renderer.
SWISH Program 9.1
:- use_rendering(table).
:- use_module(library(sgml)).
:- use_module(library(xpath)).

elem_in(URL, Elem,X,Y) :-
load_html(URL, DOM, []),
xpath(DOM, //’*’(self), element(Elem,X,Y)).

url(‘https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?view=data&acc=GSM38051&id=4196&db=GeoDb_blob01’).

splitter_row_cols(S,R,C):-
split_string(R,S,’’,C).

url_datatable(U,Table):-
url(U),
elem_in(U,pre,_,[_One,_Two,_Three,_Four,_Five,Data|_Rest]),
split_string(Data,’\n’,’’,Rows),
maplist(splitter_row_cols(’\t’),Rows,Table).

row_pair([X,Y,_,_],X-Y).
assoc_key_val(Assoc,Key,Value):- get_assoc(Key, Assoc,Value).
assoc_key_val(Assoc,Key,na):- \+get_assoc(Key,Assoc,_Value).

SWISH Query 9.1

?-length(List,10),append([_|List],_Rest,_All),url_datatable(_U,_All).
List=
|-------------------+-----------+-----+------------|
| "AFFX-BioB-5_at" | "757.7" | "P" | "0.00039" |
| "AFFX-BioB-M_at" | "933.7" | "P" | "0.000095" |
| "AFFX-BioB-3_at" | "525.6" | "P" | "0.000095" |
| "AFFX-BioC-5_at" | "1999.5" | "P" | "0.000044" |
| "AFFX-BioC-3_at" | "2339.5" | "P" | "0.000044" |
| "AFFX-BioDn-5_at" | "4321.3" | "P" | "0.000044" |
| "AFFX-BioDn-3_at" | "9229.4" | "P" | "0.00007" |
| "AFFX-CreX-5_at" | "21949.9" | "P" | "0.000044" |
| "AFFX-CreX-3_at" | "26022.8" | "P" | "0.000044" |
| "AFFX-DapX-5_at" | "1171.1" | "P" | "0.00006" |
|-------------------+-----------+-----+------------|

We can now integrate our web scraping and Reactome Pengine in a single query. Here
CHAPTER 7. REACTOME PENGINE 133

we find the expression values for the probes for Protein11042.


SWISH Query 9.2

?-reactome_server(_S),
length(_List,12050),append([__List],_Rest,_All),url_datatable(_U,_All), maplist(row_pair,_List,_Pairs),
list_to_assoc(_Pairs,_Assoc),pengine_rpc(_S,ridProtein_probelist(’Protein11042’,Probelist),[]),
maplist(atom_string,Probelist,_ProbelistStrings), maplist(assoc_key_val(_Assoc),_ProbelistStrings,Valuelist).

Probelist = [‘1562682_at’, ‘210426_x_at’, ‘210479_s_at’, ‘226682_at’, ‘235567_at’, ‘236266_at’, ‘239550_at’,


‘240163_at’, ‘240951_at’, ‘241760_x_at’],
Valuelist = [na, "66.3", "105.6", na, na, na, na, na, na, na]

10. R Integration
The R programming language is built into SWISH, which means that we can perform
statistical analysis using familiar tools. The syntax used is based on Real [5]. In the
example below we query the number of edges in a set of pathways and the number of
reactions in the same set. We then use R to calculate the correlation, fit a line and plot
these using R’s qplot function.
SWISH Program 10.1
:- <- library("ggplot2").
program(Program):-
Program=[(pathways_30(P):- findnsols(30,Rid,rid_type_iri(Rid,‘Pathway’,_),P)),
(ridPathway_reaction_(P,R):-ridPathway_links(P,L),member(E,L),E=..[_,R,_,_,_]),
(ridPathway_reaction_(P,R):-ridPathway_links(P,L),member(E,L),E=..[_,_,_,_,R]),
(ridPathway_reactions_(P,Rs):-
setof(R,ridPathway_reaction_(P,R),Rs)),
(pathway_numberedges_numberreactions(P,NE,NR):-
ridPathway_links(P,Ls),length(Ls,NE),
ridPathway_reactions_(P,Rs),length(Rs,NR)
),
(xs_ys(Xs,Ys):-pathways_30(P),maplist(pathway_numberedges_numberreactions,P,Xs,Ys))].

get_data(Xs,Ys):-
reactome_server(S),
program(Program),
pengine_rpc(S,xs_ys(Xs,Ys),[src_list(Program)]).
CHAPTER 7. REACTOME PENGINE 134

We use this query to see the correlation:


SWISH Query 10.1
get_data(NumberOfEdges,NumberOfReactions),
Correlation <- {|r(NumberOfEdges,NumberOfReactions)||cor(NumberOfEdges,NumberOfReactions)|}.

Correlation = [0.9967855883592985],
NumberOfEdges = [734, 174, 8, 57, 57, 7, 6, 2, 30, 2, 5, 9, 61, 60, 56, 9, 2, 2, 7, 57, 66, 66, 45, 7, 5, 20,
80, 4, 11, 14],
NumberOfReactions = [469, 134, 7, 52, 52, 4, 5, 3, 30, 3, 5, 11, 54, 52, 51, 7, 3, 2, 6, 52, 57, 57, 42, 4, 5
, 14, 54, 4, 11, 15]

And we use this query to plot the data with a fitted line:
SWISH Query 8.2
?-get_data(NumberOfEdges,NumberOfReactions),
<-qplot(NumberOfEdges,NumberOfReactions,geom=c("point","smooth"),
xlab="Number of Edges", ylab="Number of Reactions").

This graph shows shows different pathways and how the number their of edges and the number of their reactions are related.It
is plotted by calling the R function qplot with the parameter ‘geom’ set to ‘point’ and ‘smooth’. This results in a loess fit
line with confidence limits which are added by default.

11. Integrating with Javascript (and the D3 library)


SWISH allows the inclusion of Javascript code, which means we can use libraries such as
CHAPTER 7. REACTOME PENGINE 135

d3. Therefore, we can make interactive charts and web applications with Reactome data
inside a SWISH notebook. To see the Javascript code double click the text in the notebook
and scroll down to the ‘script’ tags.
We illustrate this functionality by combining Pengine Reactome and web-scraping to
build a simple application to show Hive Plots [86]. In Hive Plots the geometric placement
of nodes on a graph has meaning, based on user defined rules. As we are using Prolog, we
can build these rules easily. In the example below we visualise two features of reactions.
These features are mapped to the geometric placement of nodes in the graph. The first
feature is based on the network properties (the degree) of the reaction nodes. A reaction
is assigned to one of three categories:

1. Reactions that have a larger outdegree than indegree

2. Reactions that have a larger indegree than outdegree.

3. Reactions that have an equal in and outdegree.

This first feature is illustrated by placing nodes on one of three axes: Vertical axis for
category 1, 4 o’clock axis for category 2, and 8 o’clock axis for category 3.
For the second feature, we use the data of gene expression that we have scraped from
the GEO website and perform the following steps:

1. Query Reactome Pengine for the set of probes that code for the set of proteins that
are inputs for each reaction in the graph.

2. For each probe we retrieve the expression levels from the scraped GEO data.

3. Calculate the sum of the expression levels of the probes of each reaction.

4. Re-scale the summations to be between 0 and 1.

This second feature is illustrated by the distance of the reaction node from the centre
of the graph. In the notebook a drop down menu is presented which when selected will
run the query 10.1. This will result in a term which is parsed by javascript to produce the
figure 7.3. The visualisation enables comparison of the graph properties and expression
CHAPTER 7. REACTOME PENGINE 136

levels of different pathways.


SWISH Query 10.1
:-use_module(library(pengines)).
ridPathway(X):-
reactome_server(S),
pengine_rpc(S,rid_type_iri(X,‘Pathway’,_),[]).
item_index_itemindex(I,In,I-In).
edgelist_hive(Edges,hive(Terms,NewEdges)):-
setof(Node,E^(member(E,Edges),edge_node(E,Node)),Nodes),
length(Nodes,L),
L2 is L-1,
numlist(0,L2,Indexes),
maplist(item_index_itemindex,Nodes,Indexes,Pairs),
findall(edge(X,Y),(member(ridReaction_ridLink_type_ridReaction(R1,_,_,R2),Edges),
member(R1-X,Pairs),
member(R2-Y,Pairs)),
NewEdges),
maplist(graph_node_group(NewEdges),Pairs,Groups),
maplist(pair_probelist,Pairs,Probelists),
url_datatable(_U,Table),
maplist(datatable_probelists_valuelists(Table),Probelists,_Valueslists,Sums),
max_list(Sums,Max),
maplist(divide(Max),Sums,Scaled),
maplist(pair_group_expressions_term,Pairs,Groups,Scaled,Terms).
divide(X,Y,Z):-
Z is Y/X.
ridPathway_hive(P,Hive):-
ridPathway_links(P,List),
edgelist_hive(List,Hive).
pair_group_expressions_term(X-Y,Z,Q,node(X,Y,Z,Q)).
edge_node(ridReaction_ridLink_type_ridReaction(_, _, _, R2),R2).
edge_node(ridReaction_ridLink_type_ridReaction(R1, _, _, _),R1).
graph_node_indegree(G,N,D):-
findall(X,member(edge(X,N),G),Xs),
length(Xs,D).
graph_node_outdegree(G,N,D):-
findall(X,member(edge(N,X),G),Xs),
length(Xs,D).
graph_node_group(G,_-N,Group):-
graph_node_indegree(G,N,I),
graph_node_outdegree(G,N,O),
in_out_group(I,O,Group).
in_out_group(In,In,0).
in_out_group(In,Out,1):-
In>Out.
in_out_group(In,Out,2):-
In <Out.
pair_probelist(RidReaction-_,ProbelistString):-
reactome_server(S),
pengine_rpc(S,ridReaction_inputProbelist(RidReaction,Probelist),[]),
maplist(atom_string,Probelist,ProbelistString).
datatable_probelists_valuelists(D,P,Values,Sum):-
findall(Value,(member(X,P),member([X,Value,_,_],D)),Values),
maplist(number_string,NumberValues,Values),
sum_list(NumberValues,Sum).
CHAPTER 7. REACTOME PENGINE 137

Figure 7.3: Hive plot of Pathway 21: This plot gives a geographic representation of a
pathway network. This plot is created ‘live’ by scraping the NCBI website and combining
this information with the data in Reactome. It depicts the gene expression values ag-
gregated into reactions for GSM38051 alongside some network properties of the reaction
graph. Each node represents a reaction. Nodes are placed on one of three axes. 12 O’clock
axis: reactions that have a larger outdegree than indegree. 4 O’clock axis: reactions that
have a larger indegree than outdegree. 8 O’clock axis: reactions that have an equal in and
outdegree. Distance from the centre indicates an aggregate value of the expressed probes
that code for proteins that take part in this reaction. This kind of plot can be used to
quickly see if network properties are associated with gene expression and how this might
change for different samples or different pathways.

7.5.2 Example Prolog script for Unix pipeline


As we have shown, instead of using a SWISH notebook, Reactome Pengine can be used
directly in a Prolog program on a local machine. A useful technique is to write a Prolog
CHAPTER 7. REACTOME PENGINE 138

script that can be used as part of an existing UNIX pipeline. An example script is given
in Code Block 2, this is then demonstrated in a UNIX pipeline in Code Block 3 (using
example file proteins.txt). This script takes a file with a Reactome protein identifier on
each line and outputs the Affymetrix probe identifiers for each protein (retrieved from
Reactome Pengine). In order to execute this script the user will need to make the script
executable using the UNIX command ‘chmod’.
Code Block 2
1 #!/usr/bin/env swipl
2
3 :- use_module(library(pengines)).
4 :- initialization main.
5
6 server(S):-S="https://apps.nms.kcl.ac.uk/reactome-pengine".
7
8 main :-
9 catch(readloop, E, (print_message(error, E), fail)),
10 halt.
11 main :-
12 halt(1).
13
14
15 readloop:-
16 read_line_to_string(user_input,String),
17 string_test(String).
18
19 string_test(String):-
20 dif(String,end_of_file),
21 atom_string(Atom,String),
22 ridProtein_probelist(Atom,Animal),
23 writeln(Animal),
24 readloop.
25
26 string_test(Term):-
27 Term = end_of_file,
28 fail.
29
30
31 ridProtein_probelist(R,P):-
32 server(S),
33 pengine_rpc(S,ridProtein_probelist(R,P),[]).
34

Example File: proteins.txt


1 Protein56
2 Protein17
3 Protein34

:
CHAPTER 7. REACTOME PENGINE 139

Code Block 3: Example UNIX pipeline


1 ./codeblock5.pl < proteins.txt | grep 1565484_x_at

7.6 Comparison of Reactome Pengine to existing data


access options
This section compares the use of Reactome Pengine to: 1) downloading the entire dataset
directly from Reactome and 2) downloading subsets of the data using the existing Reactome
Application Programming Interfaces (APIs). We compare 1) the amount of data exchanged
between the user’s computer and Reactome Pengine, and 2) the degree of flexibility of
querying the data.

7.6.1 Amount of data exchanged


We compare the amount of data exchanged between the user’s machine and Reactome
Pengine in contrast with downloading the complete dataset from Reactome and work-
ing with a local copy of the data. Reactome Pengine is built on the biopax RDF file
(Homo_sapiens.owl) available at http://www.reactome.org/pages/download-data/.
Therefore, when not using Reactome Pengine the amount of data transferred when down-
loading this data to the user’s machine is just the size of this file, which is 136.6 MB. It
is also possible to download the dataset in different formats, such as CSV files, an SQL
database or a Neo4J graph database, each with there own storage requirements.
The amount of data transferred when using Reactome Pengine is dependent on the
query or program that the user submits. As Reactome Pengine is designed for non-trivial
but not intensive use of the Reactome data, typically short programs will be sent to server.
We gave an example program in Code Block 1. This program is able to find paths of
reactions in the reactome, which is a more complex query than the REST API can complete,
but one that can easily be performed without downloading the whole Reactome dataset,
using Reactome Pengine. The size of the data exchanged using this query is in the order
of kilobytes. Another key advantage is that a user does not have to download the latest
version of the data but that the services can be centrally updated. This again reduces the
amount of data exchanged.
Reactome Pengine is intended as a pioneering application that demonstrates how pengine
CHAPTER 7. REACTOME PENGINE 140

technology is useful for bioinformatics research. While downloading the Reactome dataset
in its entirety is currently feasible, biological datasets will only increase in size, such that
efficient and flexible data querying approaches, such as pengines, will be imperative for
future analysis involving the integration of omics data.

7.6.2 Flexibility of querying


Reactome Pengine versus existing Reactome APIs

We compare Reactome Pengine with 1) the REST API and 2) the SPARQL API. First,
the REST API is limited to a number of queries designed by the Reactome maintainers
(see documentation here:
http://www.reactome.org/pages/documentation/developer-guide/restful-service/#API).
Any query that can be performed using the REST API, can also be performed using Re-
actome Pengine. For example, the REST service can be used to find the sub-pathways of
‘Apoptosis’, and an equivalent query using Reactome Pengine is given in Code Block 4.
Code Block 4
1 :-use_module(library(pengines)).
2 reactome_server(‘https://apps.nms.kcl.ac.uk/reactome-pengine’).
3
4 my_program(P):-
5 P=[
6 (
7 pathwayName_subpathway(PName,SubName):-
8 rid_name(RidPathway,PName),
9 ridPathway_component(RidPathway,RidComponent),
10 rid_type_iri(RidComponent,’Pathway’,_),
11 rid_name(RidComponent,SubName)
12 )
13 ].
14 pathwayName_subpathway(PName,SubName):-
15 reactome_server(S),
16 my_program(P),
17 pengine_rpc(S,pathwayName_subpathway(PName,SubName),[src_list(P)]).

We invoke the query with:


?-pathwayName_subpathway(Apoptosis,Subpathway).
In contrast to the REST API we can build upon this query to create composite queries.
For example, we could add further constraints to the query to find sub-pathways with
particular properties.
Any SPARQL query can be performed using Reactome Pengine. For example the
SPARQL documentation from Reactome gives an example query that finds pathways that
CHAPTER 7. REACTOME PENGINE 141

have entities in the cellular membrane:


https://www.ebi.ac.uk/rdf/documentation/reactome/.
The Prolog program in Code Block 5 uses Reactome Pengine to perform this query.
Code Block 5
1 :-use_module(library(pengines)).
2 reactome_server(‘https://apps.nms.kcl.ac.uk/reactome-pengine’).
3
4 program(P):-
5 P=[
6 (
7 pathway_acrossmembrane(Pathwayname):-
8 Location = "plasma membrane",
9 rid_type_iri(RidPathway,’Pathway’,_Iri),
10 rid_name(RidPathway,Pathwayname),
11 ridPathway_component(RidPathway,RidReaction),
12 ridReaction_input(RidReaction,RidEntity),
13 rid_location(RidEntity,Location)
14 )
15 ].
16
17 pathway_acrossmembrane(PathwayName):-
18 reactome_server(S),
19 program(P),
20 pengine_rpc(S,pathway_acrossmembrane(PathwayName),[src_list(P)]).

The Reactome SPARQL API is more flexible than the REST API for two key reasons.
First, SPARQL can be used to specify SQL like queries over the data, rather than a
predefined subset specified by an API. Second, SPARQL can be used to interrogate several
datasets in a single query (called a federated query). For example, a bioinformatician could
query both Reactome and Uniprot to integrate data from these disparate sources.
While SPARQL is more flexible than a REST API, it is less flexible than Reactome
Pengine because it is not a full programming language. This means that typically de-
velopers using SPARQL will have a two language setup, for example, SPARQL might
be embedded in Java. This can be problematic due to the paradigm mismatch, where
SPARQL is relational and Java is an imperative object oriented language. This is not the
case for Prolog which has a relational paradigm itself, and is a full programming language.
Therefore, using Reactome Pengine from within a Prolog program means that the data
can be queried and manipulated within a single program.
The Reactome Pengine Prolog API allows for simpler and more flexible federated queries
than using SPARQL. Complex federated queries are simpler to compose in the Reactome
Pengine due to its ability to build composite queries, as discussed above. Furthermore,
because we can make use of standard Prolog libraries within the query sent to the Reactome
CHAPTER 7. REACTOME PENGINE 142

Pengine, we can also include queries to other data services, including REST, SPARQL,
HTML and other pengine services. An example of this is given in the accompanying
SWISH notebook (example 9).

Reactome Pengine querying language

It is also possible to have the web-logic query embedded in another language that supports
http requests. For example a shell script, java script or python. Code Block 6 gives an
example of a node js program that uses the pengines npm module:
https://www.npmjs.com/package/pengines
This feature is useful when introducing Reactome Pengine to existing code pipelines that
might not be written in Prolog. However building Prolog pipelines and using Prolog as the
‘glue’ language is very powerful as we have illustrated throughout this work. Notably the
ability to ‘name and reuse’ queries (example 5 in the accompanying notebook and the fact
that Prolog is a Homoiconic language – where data and code use the same syntax) means
that Prolog pipelines are very effective for bioinformatic work – especially for queries across
multiple data end points (sometimes known as federated queries).
Code Block 6
1 /* We first require the pengines library, then we define our prolog query and define functions for success
2 and failure which log to the console.*/
3
4 pengines = require(‘pengines’);
5
6 peng = pengines({
7 server: "https://apps.nms.kcl.ac.uk/reactome-pengine/pengine",
8 sourceText: ‘small_pathway(P):- ridPathway_links(P,L), length(L,S), S<35.’,
9 ask: "small_pathway(X)",
10 chunk: 100,
11 }
12 ).on(‘success’, handleSuccess).on(‘error’, handleError);
13 function handleSuccess(result) {
14 console.log(result)
15 }
16 function handleError(result) {
17 console.error(result)
18 }
CHAPTER 7. REACTOME PENGINE 143

Reactome Pengines data access predicates

In addition to providing the data available from Reactome, Reactome Pengine also in-
cludes over 30 public predicate definitions which offer intuitive and fast access to ele-
ments of the data. Full details of these predicates are available in the online documenta-
tion {https://apps.nms.kcl.ac.uk/reactome-pengine/documentation. Furthermore,
as Reactome Pengine is monitored, commonly used queries can be added to Reactome
Pengine as new predicate definitions.

Reactome Pengines data output format

It is possible to change the format of replies from the Reactome Pengine from Prolog
terms to either CSV or JSON file formats. For an example of a shell script that queries
a pengine for a CSV file see: https://github.com/SWI-Prolog/swish/blob/master/
client/swish-ask.sh

7.7 Summary
Reactome Pengine is a web service that provides a simple way to logically query the human
reactome on the web. It provides both raw RDF data access and a set of built-in predicates
to facilitate this. It can be accessed by both local Prolog programs and web notebooks
such as SWISH. Programs hosted on SWISH notebooks allow for easy sharing and the
ability to render query solutions graphically. In contrast, programs running in SWI-Prolog
client applications have access to the full power of Prolog including system calls and they
also maintain privacy of local computations. Either of these options allow researchers to
perform analysis of data that requires querying the human reactome and integrating with
other data sources. The Pengine technology allows the user to bring the small program to
the large data. Increasingly more (and larger) biological datasets are becoming available
online and while we have presented a Pengine web service for Reactome, it is possible to
build these for any other online biological dataset. This is potentially very powerful, as
researchers will not have to download and manage these datasets but can build pipelines
that consist of a set of programs sent to these pengine web services. This will result in the
uptake of a unified knowledge representation of first order logic and tremendous resource
savings.
Chapter 8

Using ILP to identify pathway


activation patterns in systems biology

The work presented in this chapter is based on the publication : Neaves, Samuel R., Louise
AC Millard, and Sophia Tsoka. “Using ILP to Identify Pathway Activation Patterns in
Systems Biology.” International Conference on Inductive Logic Programming. Springer,
Cham, 2015.

8.1 Description of content


In this chapter we show a logical aggregation method that, combined with propositionaliza-
tion methods, can construct novel structured biological features from gene expression data.
We do this to gain understanding of pathway mechanisms, for instance, those associated
with a particular disease. We illustrate this method on the task of distinguishing between
two types of lung cancer; Squamous Cell Carcinoma (SCC) and Adenocarcinoma (AC). We
identify pathway activation patterns in pathways previously implicated in the development
of cancers. Our method identified a model with comparable predictive performance to the
winning algorithm of a recent challenge, while providing biologically relevant explanations
that may be useful to a biologist.

8.2 Introduction and background


In the field of Systems Biology researchers are often interested in identifying perturbations
within a biological system that are different across experimental conditions. Biological

144
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 145

systems consist of complex relationships between a number of different types of entities, of


which much is already known [30]. An Inductive Logic Programming (ILP) approach may
therefore be effective for this task, as it can represent the relationships of such a system as
background knowledge, and use this knowledge to learn potential reasons for differences in
a particular condition. We demonstrate a propositionalization-based ILP approach, and
apply this to the example of identifying differences in perturbations between two types of
lung cancer; Squamous Cell Carcinoma (SCC) and Adenocarcinoma (AC).
A recent large competition run by the SBV Improver organisation, called the Diagnostic
Signature Challenge, tasked competitors with finding a highly predictive model distinguish-
ing between these two lung cancer types [127]. The challenge was motivated by the many
studies that have also worked on similar tasks, with the aim to find a model with the
best predictive performance. The winning method from this competition is a pipeline that
exemplifies the classification approaches used for this task [140].
The typical pipeline has three distinct stages. The first stage uses technology such as
microarrays or RNAseq, to measure gene expressions across the genome in a number of
samples from each of the experimental conditions. The second stage identifies a subset of
genes whose expression values differ across conditions. This stage is commonly achieved
by performing differential expression analysis and ranking genes by a statistic such as fold
change values. A statistical test is then used to identify the set of genes to take forward to
stage 3. Alternatively for stage 2, researchers may train a model using machine learning to
classify samples into experimental conditions, often using an attribute-value representation
where the features are a vector of gene expression values (as performed by the winning SBV
Improver model). This approach has the advantage that the constructed model identifies
genes with important dependencies but which on their own would not have been selected.
Researchers use the ‘top’ features from the model to identify the set of genes to take on to
stage 3.
In stage 3 researchers look for connections between these genes by, for example, per-
forming a Gene Set Enrichment Analysis (GSEA). In GSEA, the selected genes are split
into subsets that match predefined sets, each of which groups genes satisfying a known
biological relation. For example, a gene set may have a related function, exist in the same
location in the cell, or take part in the same pathway.
To bring background knowledge of relations into the model building process, past ILP
research integrated stage 2 (finding differentially expressed genes) and stage 3 (GSEA),
into a single step [139]. This was achieved using Relational Subgroup Discovery, which has
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 146

the advantage of being able to construct novel sets by sharing variables across predicates
that define the sets. For example, a set could be defined as the genes annotated with two
Gene Ontology terms.
Other ways researchers have tried to integrate the use of known relations includes
adapting the classification approach of stage 2. New features are built by aggregating
across a predefined set of genes. For example, an aggregation may calculate the average
expression value for a pathway [69].
A major limitation of current classification approaches is that the models are con-
structed from either genes or crude aggregates of sets of genes, and so ignore the detailed
relations between entities in a pathway. In order to incorporate more complex relations an
appropriate network representation is needed, such that biological relations are adequately
represented. For example, a simple directed network of genes and proteins does not repre-
sent all the complexities of biochemical pathways, such as the dependencies of biochemical
reactions. To do this bipartite graphs or hyper-graphs can be used [157].
One way to incorporate more complex relations is by creating topologically defined
sets, where a property of the network is used to group nodes into related sets. One method
to generate these sets is Community Detection. However, this approach can create crude
clusters of genes, that do not account for important biological concepts. Biologists may be
interested in complex biological interactions rather than just sets of genes.
Network motif and frequent sub-graph mining are methods that can look for structured
patterns in biological networks [79]. However, in these approaches the patterns are often
described in a language which is not as expressive as first order logic. This means they are
unable to find patterns with uninstantiated variables, or with relational concepts such as
paths or loops.
To our knowledge only one previous work used ILP for this task [70]. Here the authors
propose identifying features consisting of the longest possible chain of nodes in which
non-zero node activation implies a certain (non-zero) activation in its successors, which
they call a Fully Coupled Flux. Their work is preliminary, with limited evaluation of the
performance of this method.
The aim of this chapter is to illustrate how we can identify pathway activation patterns,
that differ between biological samples of different classes. A pathway activation pattern is
a pattern of active reactions on a pathway. Our novel approach uses known relations be-
tween entities in a pathway, and important biological concepts as background knowledge.
These patterns may give a biologist different information than models built from simple
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 147

gene features. Therefore, we seek to build models that are of comparative predictive per-
formance to those of previous works, while also providing potentially useful explanations.
In this work we take a propositionalization-based ILP approach, where we represent the
biological systems as a Prolog knowledge base (composed of first order rules and facts), and
then reduce this to an attribute-value representation (a set of propositions), before using
standard machine learning algorithms on this data. We therefore begin with an overview
of propositionalization, and a discussion of why it is appropriate for this task.

8.3 Overview of propositionalization


Propositionalization is a method that transforms data represented in first order logic to
a set of propositions, i.e. a single table representation where each example is represented
by a fixed length vector. This is called a reduction. It is possible to make a proper
reduction of the representation using Godel numbering or well ordering arguments [38].
However, these will have limited practical value as useful structure can be lost or encoded
inefficiently, leading to poor inductive ability. Heuristic-based propositionalization methods
allow specification of a language bias and a heuristic, in order to search for a subset of
potential features which are useful for the learning task.
We have four reasons for adopting a propositionalization-based approach, rather than
directly applying an ILP learner. First, separating the feature construction from the model
construction means that we have an interesting output in the middle of the process, which
we would lose if they were coupled together. For example, the features constructed can
represent useful domain knowledge in their own right, as they can describe subgroups of
the data which have a different class distribution, or frequent item sets or queries on the
data.
Second, propositionalization can be seen as a limited form of predicate invention, where
the predicate refers to a property of an individual, or relationships amongst properties of
the individual. This means that, when building a model, the features may correspond
to complex relationships between the original properties of an individual. In our case
they correspond to potentially interesting pathway activation patterns. Hence, we can
understand predictions in terms of these higher order concepts, which may give important
insights to a biologist.
Third, propositionalization can impose an individual-centred learning approach [38].
This limits predicates to only refer to relationships between properties of an individual,
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 148

as under this approach we do not allow a predicate which relates individuals. This strong
inductive bias is appropriate for our case, as we do not wish to consider relationships
between the individuals. The fourth reason is that we can perform many other learning
tasks on the transformed data, with the vast array of algorithms available for attribute-
values datasets.
In this work we use query-based propositionalization methods, and now describe some
key algorithms. A review of some publicly available propositionalization methods was re-
cently performed by Lavrac et al. [91]. These include Linus, RSD, TreeLiker (HiFi and
RelF algorithms), RELAGGS, Stochastic Propositionalization, and Wordification, along-
side the more general ILP toolkit, Aleph [137]. Other methods that were not mentioned
in that review include Warmr [41], Cardinalisation [41], ACORA [123] and CILP++ [49].
There has also been work on creating propositionalization methods especially for linked
open data, both in an automatic way [129], and in a way where manual SPARQL queries
are made [128]. The methods in these papers are not appropriate for our work because
our data is not entirely made up of linked open data, and we wish to include background
rules encoding additional biological knowledge. It is also worth noting that certain kernel
methods can be thought of as propositionalization [38].
Wordification treats relational data as documents and constructs word-like features.
These are not be appropriate for our task, as they do not correspond to the kind of
patterns we are looking for, i.e. features with uninstantiated variables. Stochastic proposi-
tionalization performs a randomised evolutionary search for features. This approach may
be interesting to consider for future work. CILP++ is a method for fast bottom-clause
construction, defined as the most specific clause that covers each example. This method is
primarily designed to facilitate the learning of neural networks, and has been reported to
perform no better than RSD when used with a rule-based model [49].
ACORA, Cardinalisation and RELAGGS are database inspired methods of proposition-
alization. They are primarily designed to perform aggregation across a secondary table,
with respect to a primary table. ACORA is designed to create aggregate operators for
categorical data, whereas RELAGGS performs standard aggregation functions (summa-
tion, maximum, average etc.) suitable for numeric data. Cardinalisation is designed to
use complex aggregates, where conditions are added to an aggregation. In our work we
manually design an aggregation method, described in Section 8.4.2. These aggregation
systems are not appropriate for graph-based datasets, because representing the graph as
two tables (denoting edges and nodes) and aggregating on paths through the graph would
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 149

Inferring instantiated Searching for pathway


Extracting the reaction Evaluation
reaction graphs for each activation patterns
graph (hold out data)
instance (training data)

Figure 8.1: Method overview

require many self joins on the edge table. Relational databases are not optimised for this
task, such that the resulting queries would be inelegant and inefficient.
The propositionalization methods we use in this work are TreeLiker and Warmr. Tree-
Liker is a tool that provides a number of algorithms for propositionalization including
RelF [87]. RelF searches for relevant features in a block-wise manner, and this means that
irrelevant and irreducible features can be discarded during the search. The algorithms in
TreeLiker are limited to finding tree-like features where there are no cycles. RelF has been
shown to scale much better than previous systems such as RSD, and can learn features with
tens of literals. This is important for specifying non-trivial pathway activation patterns.
Warmr is a first order equivalent of frequent item-set mining, where a level wise search
of frequent queries in the knowledge base is performed. Warmr is used as a proposition-
alization tool by searching for frequent queries in each class. In Warmr it is possible to
specify the language bias using conjunctions of literals, rather than just individual literals,
and to put constraints on which literals are added. This allows strong control of the set
of possible hypotheses that can be considered. Finally, unlike TreeLiker, Warmr can use
background knowledge, defined as facts and rules.

8.4 Methods
An overview of the process we take is shown in Figure 8.1. First, we extract the reaction
graph for each pathway, from Reactome. Second, we infer the instantiated reaction graphs
for each instance in the dataset. Third, we identify pathway activation patterns using
propositionalization, and then build classification models to predict the lung cancer types.
Lastly, we evaluate our models using a hold-out dataset. We begin with a description of
the datasets we use in this work.
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 150

8.4.1 Raw Data


Our approach uses two sources of data: 1) a dataset from Gene Expression Omnibus
(GEO) [46] as the set of examples (gene expression values of a set of individuals), and 2)
information about biological systems from Reactome.

GEO data main data

We use a two class lung cancer dataset obtained from GEO, which was previously used in
the SBV Improver challenge [127]. This dataset is made up from the following datasets:
GSE2109, GSE10245, GSE18842 and GSE29013 (n=174), used as training data, and
GSE43580 (n=150), used as hold-out data. We used the examples where the participants
were labelled as having either SCC or AC lung cancer. This is the same data organisation
as that used in SBV Improver challenge, to allow us to compare our results with the top
performing method from this challenge.
This data contains gene expression measurements from across the genome measured by
Affymetrix chips. Each example is a vector of 54,614 real numbers. Each value denotes
the amount of expression of mRNA of a gene. There is a uniform class distribution of
examples, in both the training and holdout dataset.

Reactome – background knowledge

We use the Reactome database to provide the background knowledge, describing biological
pathways in humans. Reactome [30] is a collection of manually curated peer reviewed
pathways. Reactome is made available as an RDF file, which allows for simple parsing
using SWI-Prologs semantic web libraries, and contains 1,351,811 triples. Reactome uses
the bipartite network representation of entities and reactions. Entity types include nucleic
acids, proteins, protein complexes, protein sets and small molecules. Protein complexes and
protein sets can themselves comprise of other complexes or sets. In addition, a reaction may
be controlled (activated or inhibited) by particular entities. A reaction is a chemical event,
where input entities (known as substrates), facilitated by enzymes, form other entities
(known as products).
Figure 8.2a shows a simple illustration of a Reactome pathway. P nodes denote proteins
or protein complexes, R nodes denote reactions, and C nodes denote catalysts. A black
arrow illustrates that a protein is an input or output of a reaction. A green arrow illustrates
that an entity is an activating control for a reaction. A red arrow illustrates that an entity
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 151

a) C P b)

P P1 R2 P

R2
P R1 P2 P

P R1 R3
P3
R3 P
P

R4

P R4 P

Figure 8.2: Reaction graph illustrations. There are three types of relationships between
reactions: follows (black solid lines), activation (green dashed), and inhibition (red dash-
dotted). Figure a) is the initial extracted graph from Reactome which is bipartite with
Reactions and entities as nodes and figure b) is the reaction-centric graph where Reactions
are directly linked. Both a) and b) depict the same pathway.

is an inhibitory control for a reaction. Reaction R1 has 3 protein substrates and 3 protein
products, and is controlled by catalyst C. Reactions R3 and R4 both have one protein
substrate and one protein product. R3 is inhibited by P 2, such that if P 2 is present then
reaction R3 will not occur. R4 is activated by P 3, such that P 3 is required for reaction
R4 to occur.

8.4.2 Data Processing


Extracting reaction graphs

We reduce the Reactome bipartite graph to a Boolean network of reactions. This simplifies
the graphs while still adequately encoding the relationships between entities. Previous work
has shown that Boolean networks are a useful representation of biological systems [154],
and unlike gene and protein Boolean networks ours encodes the dependencies between
reactions.
The Boolean networks we create are reaction-centric graphs, where nodes are reactions
and directed edges are labelled either as ‘activation’ ,‘inhibition’ or ‘follows’ corresponding
to how reactions are connected. For example, Figure 8.2b shows the reaction-centric graph,
corresponding the Reactome graph shown in Figure 8.2a. Reaction R2 follows R1, because
in the Reactome graph P 1 is an output of R1 and an input to R2. Reaction R1 inhibits
R3, because P 2 is an output of R1, and it is also an inhibitory control of R3. Reaction R1
activates reaction R4, because P 3 is an output of R1, and an activating control of R4.
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 152

on on
on on

on OR OR
on

on on C

AND
OR

activation
off

off on

on OR
A inputs

AND
Reaction

on on
on
Binary probe value
off OR
Protein
off OR
Protein set
off OR
off Protein complex
B

Figure 8.3: Illustration of logical aggregation. Known biological mechanisms can be repre-
sented as OR or AND gates. The triangular nodes are binary probe values, created using
Barcode.

Inferring instantiated reaction graphs

Boolean networks [154] are a common abstraction in biological research, but these are
normally applied at the gene or protein level not at the reaction level. In order to use a
Boolean network abstraction on a reaction network, we apply a logical aggregation method
that aggregates measured probe values (from the GEO dataset) into reactions. This creates
a binary value for each reaction, to create instantiated versions of the reaction-centric graph
created in the previous step.
Before we can use this logical aggregation we first transform the original probe values
into binary values, an estimated value denoting whether a gene is expressed or not. We
do this using Barcode [99], a tool for converting the continuous probe values to binary
variables, by applying previously learnt thresholds to microarray data. It is important to
note that Barcode makes it possible to compare gene expressions, both within a sample,
and between samples that are potentially measured by different arrays.
The logical aggregation process is illustrated in Figure 8.3. This process takes the
binary probe values as input, and uses the structure provided by the Reactome graph, and
key biological concepts, to build reaction level features. As we have already described,
each reaction has a set of inputs that are required for a particular reaction. We interpret
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 153

each reaction input as a logical circuit with the following logical rules. The relationship
between probes and proteins is treated as an OR gate (matched by Uniprot IDs), because
multiple probes can map to the same protein. We are assuming that the measurement from
a single probe indicates with high probability that the protein product is present or not.
The formation of a protein complex requires all of its constituent proteins and therefore is
treated as an AND gate. A protein set is a set of molecules that are functionally equivalent
such that only one is needed for a given reaction, and so this is treated as an OR gate.
Inputs to a reaction are treated as an AND gate. A reaction is on if the inputs are on, any
activating agents are on, and any inhibitory agents are off. We note that both protein sets
and protein complexes can themselves comprise of arbitrarily nested complexes or sets.
Figure 8.3 illustrates the logical aggregation rules of a single reaction. This reaction has
two inputs and one activating control. The two inputs are a protein complex and a protein
set, and the values of these are calculated using their own aggregation processes, labelled
A and B. The aggregation in process A, starts with the binary probe values, and first infers
the values of three proteins. The protein complex is then assigned a value of on because all
proteins required for this complex are present (are all on themselves). The aggregation in
process B starts by inferring the values of two proteins from the probe values. One protein
is on and the other is off. The protein set is assigned the value on because only one protein
in this set is required for this protein set to be on. There also exists an activating control
for the reaction, a protein whose value is determined by a process labelled C. This protein
is assigned the value on, because both probe values are on, when at least one is required.
As all inputs of the reaction are on and the activating control is also on, the reaction is
assigned the value on.

8.4.3 Searching for pathway activation patterns


In order to identify pathway activation patterns we first find pathways that are most likely
to contain these patterns, using the training data. We then use three approaches to identify
pathway activation patterns within the top pathways, and evaluate the identified activation
patterns using the hold-out data.

Identifying predictive pathways

To identify pathways we first run TreeLiker on each pathway. This generates a set of
attribute-value features for each instantiated pathway. We use TreeLiker with the RelF
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 154

algorithm and the following language bias:

set(template,[reaction(-R1,#onoroff),link(+R1,-R2,!T1),
reaction(+R2,#onoroff),link(+R2,-R3,!T2),reaction(+R3,#onorff),
link(!RA,-R4,!T3),link(+R4,!RB,!T4),link(+R1,-R2,#T1),
link(+R2,-R3,#T2),link(!RA,-R4,#T3),link(+R4,!RB,#T4)])

This language bias contains two types of literals; reaction/2 and link/3. The second
argument of the reaction literal is always constrained to be a constant depicting if a reaction
is on or off. The link/3 literal depicts the relationship between two reactions, where the
third argument of the link literal is either a variable or a constant describing the type of
relationship - either follows, activates or inhibits. For example, an identified pattern may
contain the literal link(r1,r2,follows), specifying that an output entity of reaction r1
is an input to reaction r2.
We then test the performance of the features of each pathway, using 10 fold cross
validation. We use the J48 decision tree algorithm (from Weka) because this builds a
model that give explanations for the predictions. We calculate the average accuracy across
folds, for each pathway, and rank the pathways from highest to lowest accuracy. We then
use the top ranked pathways as input to three different methods, to identify predictive
pathway activation patterns.

Method 1

This approach simply takes a pathway of interest, generates a single model using the
J48 algorithm using the training data, and then evaluates this performance on the hold-
out data. The decision tree can then be viewed to determine which activation patterns
are predictive of lung cancer type. We demonstrate this approach with the top-ranked
pathway.

Method 2: Warmr approach

We illustrate using Warmr to generate pathway activation patterns, using one of our iden-
tified ‘top’ pathways.
We use Warmr with two particular concepts in the background knowledge. First, we use
a predicate longestlen/3, that calculates the longest length of on reactions in an example,
for the pathway on which Warmr is being run. The arguments are: 1) the beginning
reaction of a path, 2) the end reaction of the path with longest length, and 3) the length of
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 155

this path. This longest length concepts corresponds to the fully coupled flux of a previous
work [70].
Second, we use the predicates inhibloop/1 and actloop/1, that depict inhibition and
activation loops, where a path of on reactions form a loop and one of the edges is an
inhibition or activation edge, respectively. Inhibition and activation loops are common
biological regulatory mechanism [148].
We then use the OneR (Weka) algorithm to identify the single best pathway activation
pattern found by Warmr, and then evaluate this pattern on the hold-out data.

Method 3: combined approach

Our combined method takes advantage of the beneficial properties of the two algorithms,
by using Warmr to extend the patterns identified by TreeLiker. This effectively switches
the search strategy from the block-wise approach of TreeLiker, to the level-wise approach of
Warmr. The reason for doing this is to identify any relations between reactions that exist
between entities within the TreeLiker feature, that could not be identified in TreeLiker
due to its restriction to tree structures. This results in long cyclical features that neither
TreeLiker nor Warmr would be able to find on their own.
While we could use the features generated by method 1, and extend these, in this section
we also demonstrate the possibility of using our approach for generating descriptions of
subgroups. We identify a subgroup with the CN2SD algorithm[90], using the training
data. The activation patterns defining this subgroup are then extended using Warmr. The
following code is an example language bias we use in Warmr:

rmode(1: (r(+S,-A,1),link(A,-\B,follows),link(B,-\C,_),r(S,C,0),
r(S,B,0), link(B,-\D,_),r(S,D,1),link(A,-\E,_),r(S,E,1))).
rmode(1: link(+A,+B,#)).

The first rmode contains the feature that was previously identified using TreeLiker. The
second rmode uses the literal link, to allow Warmr to add new links to the TreeLiker
feature. After extending the activation pattern using Warmr, we then evaluate this on the
hold-out data.
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 156

Ranking Pathway Accuracy


1 Hexose uptake 78.74%
2 Hyaluronan biosythesis 77.59%
4 Mitotic G1-G1/ phases 76.74%
4 Creatine metabolism 78.64%
5 Cell Cycle 77.59%

Table 8.1: Top 5 pathways identified. Mean accuracy across 10 folds of cross-validation on
the training dataset.

8.5 Results
To reiterate, the aim of this work is to build explanatory models that help biologists
understand the system perturbations associated with conditions, in this case lung cancer.
Therefore although we give a quantitative classification performance of our models in order
to allow performance comparisons, we additionally emphasise the form that the classifi-
cation models take and how these are of interest to biologists. Table 8.1 shows the top 5
pathways found using our TreeLiker/J48 method.

8.5.1 Quantitative evaluation and comparison with SBV Improver


model
In order to provide a quantitative comparison of our models, we compare to the winning
diagnostic classifier as identified by the SBV Improver challenge. We use the area under
the ROC curve (AUC) metric, to evaluate the ranking performance of the models. We
generate confidence intervals for the AUC using a stratified bootstrapped approach (with
2000 bootstrap samples). We use permutation testing to compare the performance of our
models with a random model. We generate 2000 random rankings, with the same class
distribution as our data, and calculate the AUC for each of these values. We then find the
proportion of random rankings with an AUC greater than that of our models. We refer to
this as the permutation P value.
We select the top pathway identified in the training data, hexose uptake, and after
retraining J48 on the training data, evaluation on the hold out data gives an AUC of
0.82 (95% confidence interval: 0.764-0.890). This model is better than a random model
((p < 0.001 using permutation testing). The SBV method, evaluated on the hold out data
had an AUC of 0.913 (95% CI: 0.842-0.9466). The confidence intervals overlap such that
we cannot find a significant difference in performance between our model and the SBV
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 157

1.0
0.8
Sensitivity
0.6
0.4
0.2
0.0

1.0 0.8 0.6 0.4 0.2 0.0


Specificity
(a) Example of the hexose uptake pathway for (b) ROC curves comparing performance. SBV
a particular individual. Green squares: on re- Improver: blue, dashed; TreeLiker, J48: green,
actions, red octagons: off reactions. Identi- solid;Warmr,J48: red, dash-dotted.
fied feature of three on reactions shown in pink,
dashed box.

Figure 8.4: Results

model. The ROC curves of the SBV model and the hexose update model, are shown in
Figure 8.4. Our hexose uptake model is a decision tree with a single feature (i.e. a decision
stump):

reaction(A,1), link(A,B,_), link(B,C,_), reaction(C,1), reaction(B,1).

This corresponds to a chain of three on reactions, where the model predicts SSC if this
feature exists and AC otherwise. This Pathway Activation Pattern is present in 67 of the
76 individuals with SSC, and 17 of the 74 individuals with AC.
In Figure 8.4a we show an example instantiation of the hexose uptake pathway, for a
particular individual. For this individual, the three variables A,B,C in the feature given
above, are instantiated to the following reactions:

A. GLUT1 + ATP <=> GLUT1 :ATP.


B. GLUT1 + ATP <=> GLUT1 +ATP.
C. alpha-D-Glucose + ATP => alpha-D-glucose 6-Phosphate + ADP.
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 158

Phosphorylated p130 Ubiquitination of p130


Association of Cks1 with activation
(RBL2) binds SCF(Skp2): (RBL2) by SCF (Skp2)
SCF(Skp2) complex
Cks1 complex

Binding of phospho-p27/p21:
p107 (RBL1) binds Degradation of Ubiquitination of phospho- Cdk2:Cyclin E/A to the SCF
CyclinE/A:CDK2 ubiquitinated p27/p21 by p27/p21 (Skp2):Cks1 complex
the 26S proteasome

Figure 8.5: The pattern found by Warmr instantiated for individual GSM1065725. There
is a self-activating loop, and this is highlighted by the grey box.

8.5.2 Results for Warmr method


We illustrate the value of our Warmr only method using the cell cycle pathway (ranked fifth
in Table 8.1). The more complex background predicates that we have defined for Warmr
are only relevant when the pathway itself contains particular relationships. For example,
the activation loop predicate will only be potentially beneficial when a pathway contains
an activation edge, that may potentially be identified as the activation within an activation
loop. The cell cycle is the largest of the top 5 pathways. It is also the first pathway that
contains all three kinds of edges; follows, activation and inhibition. The OneR classifier
generated with the Warmr features has an AUC of 0.699 (95% CI: 0.625-0.773) on the
hold-out data.
While this model performs worse than the SBV Improver model and the hexose uptake
pathway TreeLiker/Combined model (in terms of AUC), it still has predictive value (p <
0.001 compared to a random model, using permutation testing). The identified rule is
complex and potentially interesting to a biologist:

actloop(C),largestlen(E,F,G),greaterthan(G,5),link(E,H,follows),r(H,0)

The rule states that a sample is classified as SCC cancer if there is a self activating loop
for a reaction C, and that the longest chain of on reactions is from reaction E to reaction
F which is a chain at least 6 reactions long. Additionally, following reaction E there is
also a reaction H that itself is not on.
This suggests that one of the differences between the SCC and AC cancer is that in the
cell cycle SCC tumours have a self activating loop, that causes a longer chain of reactions
to occur than in the AC tumour types. The instantiation of the learnt rule/pattern to
a particular individual is shown in Figure 8.5. In this example there is a chain of 7 on
reactions, and this also contains the self-activating loop.
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 159

F1 Not Present F2 Present F3 Present

Figure 8.6: The three features in the subgroup description. The solid lines represent the
feature found by TreeLiker, the dotted lines show the Warmr extensions. Green rounded
squares on, red octagons off, blue squares on or off reactions.

8.5.3 Results for Warmr/TreeLiker combined method


As explained above, the feature used in the top pathway was very simple, and hence
we demonstrate the value of our Warmr/TreeLiker combined approach on more complex
features, identified from the Hyaluronan biosythesis and export pathway, ranked second in
Table 8.1. Figure 8.6 shows the three features describing the subgroup identified by this
approach. We can see that the additional edges that Warmr finds give a more complete
view of the relations between the reactions in these features. This information may be
important when a biologist analyses these results. The subgroup described by these three
features has 58 true positives and 9 false positives in the hold out data.

8.6 Conclusions
In this work we have shown the potential of ILP methods for mining the abundance of highly
structured biological data. Using this method we have identified differences in Pathway
Activation Patterns that go beyond the standard analysis of differentially expressed genes,
enrichment analysis, gene feature ranking and pattern mining for common network motifs.
We have also demonstrated the use of logical aggregation with a reaction graph, and
how this simplifies the search for hypotheses to an extent where searching all pathways
is tractable. We have introduced a novel approach that uses Warmr to extend features
initially identified with TreeLiker. This makes it possible to search for long cyclical features.
We have identified pathway activation patterns predictive of the lung cancer type,
in several pathways. The model we built on the hexose uptake pathway has predictive
performance comparable with the top method from a recent challenge, but also provides
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 160

biologically relevant explanations for its predictions. Each identified activation pattern is
evaluated on the hold-out data, such that this should be the expected performance on new,
unseen examples. The pathway activation patterns we have found are in clinically relevant
pathways [167, 40]. Patterns identified using this method may give diagnostic and clinical
insights that biologists can develop into new hypotheses for further investigation.
Chapter 9

Summary and future directions

In this thesis we argue that logic programming should be used to a greater extent than
present for computational biological tasks. Part 1 of this thesis introduced the domain
of discourse, explaining the background on health research for cancer and psoriasis. We
described how different types of biological data are collected and stored currently. Part
2 synthesised knowledge about Prolog and illustrated how to use Prolog to tackle two
biological data mining tasks, namely, subgroup discovery of CpG sites that are different
in breast cancer and subgroups of microbes that are present in lesional psoriasis samples.
A number of algorithms were implemented, including versions for attribute value learning,
and versions for multi-instance learning. Part 3 emphasised structural data, showing
how Prolog can be used to implement advanced web services to access large complex
biological datasets on the web. We then showed how using ILP techniques allows us to
mine structured biological data, in order to generate rules that gave explanations in the
form of ‘pathway activation patterns’.
Recalling Chapter 1, we identified a number of current problems in bioinformatics
research. These are important problems due to the need for more efficient and effective
health research. These were:

1. Data sitting in difficult to combine silos.


2. Data and knowledge are stored separately.
3. Data storage and transfer is resource intensive.
4. Difficulties choosing the right analysis technique.
5. Incorrect assumptions in models.
6. Overly complex models.

161
CHAPTER 9. SUMMARY AND FUTURE DIRECTIONS 162

Throughout this thesis we tackled these points, we now summarise how each point was
addressed by our contributions.
Data sitting in difficult to combine silos: This was primarily addressed by the
implementation of Reactome Pengine (Chapter 7), but also in our synthesis of Prolog
knowledge (Chapter 5). We described how logic could be used as the ‘glue’ for our collective
knowledge. This was done by describing how we can relate knowledge, using rules defined
by predicates. We described how new predicates could combine existing predicates to
create flexible rules, that can be used to query a knowledgebase, and how these predicates
could be expanded to cross the world wide web, by implementing a Prolog based API to
the human reactome. This allows federated queries across multiple data sources.
Data and knowledge are stored separately: This was addressed by both our
Prolog synthesis (Chapter 5) and our description of the implementation of the Reactome
Pengine API (Chapter 7). We described that when a piece of knowledge has been generated
(perhaps by a machine learning algorithm in the form of a rule), we do not want that
classification or subgroup rule to only be presented in scientific paper, but also for it to be
provided in a form that can be deployed on our own or other datasets. We described how
Prolog allows us to have ‘knowledgebases’ not just ‘databases’ and this means that rules
(representing knowledge) and data can be stored together. Our implemented ‘Reactome
Pengine’ tool allows rules to be sent to different datasets alongside a query. This aids
reproducible research and a wider use of what is collectively known, by making it easier to
deploy known knowledge on existing data.
Data storage and transfer is resource intensive: This was addressed by our
implementation of Reactome Pengine (Chapter 7). Instead of needing to transfer terabytes
of data for analysis, small programs are transferred and run in the cloud, meaning that a
user does not need to download and store large biological datasets on their own machine.
Difficulties choosing the right analysis technique: We addressed this problem in
two ways, first in the subgroup discovery chapter we identify a number of common bio-
logical data analysis techniques such as finding Differentially Methylated Regions that can
be considered as subgroup discovery tasks. Recognising this allowed us to use techniques
developed in the machine learning literature to generalise these tasks and to apply exist-
ing frameworks by developing implementations of existing algorithms and developing new
purely declarative pattern mining techniques. Secondly, in both our subgroup discovery
work and our our ILP work we argued that we need to take into account the purpose of
learning a model from our data. Do we need to build a classifier, that will be used for
CHAPTER 9. SUMMARY AND FUTURE DIRECTIONS 163

predication on unseen data? Or do we wish to understand some of the aspects of why one
set of data is different from another set? If that is the case then the subgroup discovery
task is a better match than the classification task and what the features in our rules rep-
resent is also important - for example, we showed how pathway activation patterns give
more insight than simple gene features.
Incorrect assumptions in models: This was primarily addressed by our work on
using ILP to find pathway activation patterns (Chapter 8). This described fusing biological
pathway data from the Reactome database and gene expression calling data (from the bar
code tool) with example datasets of gene expression data, in order to create models that
take into account some of the biological knowledge about how genes interact. We did not
assume that genes are independent variables. By using ILP methods we were able to find
rules that performed as well as the state of the art classification models, whilst at the same
time offering further explanations and potentially new interesting information in the form
of ‘pathway activation patterns’ represented as first order rules.
Overly complex models: This was again addressed by our work on using ILP and
our work on subgroup discovery. In both of these contributions we used data mining
techniques that resulted in comprehensible rules. These rules describe subgroups of CpG
sites, microbes and patients, and biologists will be able to plan future studies based on the
outputs of these data mining explorations.

9.1 Future work


We now describe the limitations of our research and highlight research goals that could
address them in future work.

9.1.1 Integrate biological expertise and information


In this thesis we attempted to give an overview of the domain of discourse and to provide
information about some of the data types that have so far have been collected for biological
studies. Necessarily this was incomplete, as there are now so many new ways to collect
data in ever finer detail. For instance, we did not talk about methods to isolate cell types in
order to eliminate this source of noise in microbiome, transcriptome and epigenome studies.
Future applications that mine the microbiome, transcriptome and epigenome would benefit
from incorporating this information. In addition, our work gave a limited introduction to
the conditions under study, and future studies would benefit from close collaboration with
CHAPTER 9. SUMMARY AND FUTURE DIRECTIONS 164

domain experts that could inform the model construction process and provide evaluations
of the identified rules.

9.1.2 Evaluate Prolog versions for bioinformatics


In our synthesis of Prolog knowledge (Chapter 5), we did not compare and contrast Prolog
versions, in terms of how suitable they are for bioinformatics. We also did not explore the
efforts that have been made to make probabilistic versions of Prolog such as Problog and
PRISM[39, 131] such extensions allow for advanced probabilistic reasoning which are likely
to be valuable for bioinformatics work. A useful future study would compare different
Prolog implementations with probabilistic Prolog in the domain of bioinformatics, in order
to highlight relevant libraries (including constraint solvers), programming techniques and
potential pitfalls.

9.1.3 Improve algorithms for subgroup discovery


Our implementations of descriptive rule learning algorithms for subgroup discovery were
intended to show the process of how to map a biological data analysis task to an appropriate
machine learning task and then implement an algorithm to solve this task in Prolog – taking
note of the declarative and procedural aspects of Prolog’s execution. Future work could
refine these algorithms. For instance, the genetic algorithm we implemented has a number
of hyper parameters which could be optimised. Additionally, whilst we provided multi-
instance versions of these subgroup discovery algorithms, further improvements for the
multi-instance case should be possible. For example, it may be beneficial if the heuristic
takes into account the number of covered individuals in a bag. Other improvements to the
algorithms could include adding further constraints to restrict the number of features in
a rule, rather than having a fixed number of features in the genetic algorithm or allowing
the declarative algorithm to set the number of features without constraint.

9.1.4 Expand applications of subgroup discovery


In terms of the data we choose to analyse, this could be expanded. For example, a promising
area for the application of subgroup discovery in the microbiome would be to use the meta-
genome (the set of all genes that exist in an organism and it symbiotes) as attributes.
This would allows us to find subgroups of microbes defined by what genes the microbes
in the subgroup had in common compared to the general population of microbes. This
CHAPTER 9. SUMMARY AND FUTURE DIRECTIONS 165

would provide interesting insights into biological pathways that cross the human/microbe
interface.

9.1.5 Deploy large pengine networks


Reactome is actually a small dataset by the scale of modern biology and although highly
structured making it an ideal demonstration of the approach, many of the advantages
would be more telling on larger datasets. However, applications using larger databases will
need to be suitability resourced in order to work effectively. If the service was popular
it may need to be rationed or have a service charge in order to maintain and pay for
the compute use. The true benefits of the pengine approach to APIs will only become
apparent if many biological data providers move to provide these services. It will then be
possible to have complex federated queries across large swathes of biological knowledge.
Another limitation in the current implementation of pengines, is that the computation is
not private. For some use cases this would be a desirable property, and technologies such
as homomorphic encryption (which allows computation on cipher text) may one day allow
pengine type services to be private [57].

9.1.6 Use even more structural knowledge in model building


A key limitation to our work learning with structured data is that our current knowledge
of biological pathways is incomplete. As this data becomes more available searching for
pathway activation patterns will become more reliable. In addition we only demonstrated
this work on human pathway data but more might be known about common models such
as mice. Furthermore, many other databases could be used as background knowledge, for
instance, there are other pathway databases such as KEGG [76]. Additionally, other data
sources that detail other biological information could be included. This would allow a more
complete idea of the changes between two types of instances. If data providers made much
of their data available via pengine services, then it would be theoretically possible to learn
models using federated queries.

9.1.7 Use the data collected from Reactome Pengine


Another important feature of the implemented Reactome Pengine (and other potential
biological pengine services) is that queries to it can be logged. In this way we can collect
the programs that are sent to it. This can potentially be useful as they could be made
CHAPTER 9. SUMMARY AND FUTURE DIRECTIONS 166

available to other users of the service, allowing for an automatic sharing of programs. It
could also act as a new data source for data mining in bioinformatics. Researchers could
investigate applying machine learning algorithms to learn from these submitted programs.
If we say that programs represent some knowledge about a domain rather than just data,
mining such future ‘knowledge-sets’ could tap into a rich seem of new ideas and concepts.

9.1.8 Incorporate reframing


If a large network of pengine servers was deployed they would benefit from the machine
learning research topic of ‘Reframing’. Reframing is where a model that has been learnt
from data has a systematic adaption to allow it to operate in a context that is different to
where it was trained [67]. Concepts such as this will be useful in the imagined world, as we
could query one pengine for a set of pre-learned rules, a second for a context manipulator,
and then send our adapted programs to a third dataset that we wish to interrogate.

9.2 Implications and recommendations


Now that we have demonstrated how to build and deploy a ‘pengine’ service, other bio-
logical data providers should consider adopting this or similar technologies. With this in
mind we recommend that Prolog programming should be taught to a larger extent than it
currently is. In order for the benefits of logic programming to be felt in the life sciences,
efforts need to be made to educate bioinformaticians in this paradigm. It requires a dif-
ferent way of thinking and can seem intimidating, but there would be many benefits to a
wider user base.
A further recommendation, is that biological researchers should try to understand
known machine learning tasks so that the questions they wish to ask of their data can
be mapped to these known tasks. This will allow them to make use of existing algorithms
and methods rather than reinventing limited versions of these such as ‘Bump hunter’. We
also recommend that researchers should use techniques that make use of existing knowl-
edge and do not make known incorrect assumptions. For instance, the machine learning
models for predicting cancer subtypes in the dream challenge (Chapter 8) assume that
genes are independent variables – when we know that this is not the case. We also think
that the form of the models built by researchers should be given more emphasis. Having
models represented as Prolog facts, i.e. in logic, allows these models to be reused as parts
CHAPTER 9. SUMMARY AND FUTURE DIRECTIONS 167

of composite queries, and they can be read by humans as well as machines. This cannot
be said for deep neural networks.

9.3 The imagined future


We envision a future where many biological datasets, and biological knowledge sit in
‘pengine’ accessible servers. This future will benefit from the increased use of the for-
mal reasoning capabilities of Prolog. Formal logic in the form of Prolog programs will act
as ‘glue’ for our collective knowledge, which will enable better utilisation of what is already
known. Some of these pengine services will contain expertly curated rules, while others
will have rules learnt from data (with details about how they were learnt also encoded as
Prolog facts – it will be possible to reproduce exactly how these rules were learnt to enable
a scientific critical understanding of them). These rules will describe interesting subgroups
of data, as well as different classification and clustering rules and will have a form that give
insight into the underlying biology, taking into account known relations between aspects
of the objects being studied and how the objects themselves are related. It will be possi-
ble to query these datasets collectively, building composite federated queries. In this way
a web-scale Prolog based semantic knowledgebase could be constructed with automatic
adaptations of knowledge to new data and domains. We will be able to make queries of
different models that have been learnt from different datasets, combining these into new
knowledge – standing on the shoulders of giants will be easier with this support.
Appendix A

Prolog execution algorithm

The algorithm for Prolog execution [15]. In this algorithm Matched means unification
without occurs check. The algorithm omits details on how a user can ask for alternative
solutions by forcing backtracking.

168
APPENDIX A. PROLOG EXECUTION ALGORITHM 169

Code Block Appendix.1


1 procedure execute (Program, Goallist, Success);
2 Input arguments:
3 Program: list of clauses
4 Goallist: list of goals
5
6 Output argument:
7 success: truth value; success will become true if Goallist is true with respect to Program
8 Local variables:
9 Goal:goal
10 OtherGoals:list of goals
11 Satisfied: truth value
12 MatchOk: truth value
13 Instant: instantiation of variables
14 H,H’,B1,B1’,...Bn,Bn’: goals
15 Auxiliary functions:
16 empty(L): returns true if L is the empty list
17 head(L): returns the first element of list L
18 tail(L): returns the rest of the list L
19 append(L1,L2): appends list L2 at the end of list L1
20 match(T1,T2,MatchOk,Instant): tries to match terms T1 and T2; if succeeds then match
21 MatchOk is true and Instant is the corresponding instantiation
22 of Variables
23 substitute(Instant,Goals): substitute variables in Goals according to instantiation Instant
24
25 Begin
26 if empty(Goallist) then Success := true
27 else
28 begin
29 Goal:= head(GoalList)
30 OtherGoals := tail(GoalList)
31 Satisfied := false;
32 while not Satisfied and "more clauses in program" do
33 begin
34 Let next clause in Program be
35 H:-B1,...,Bn.
36 Construct a variant of this clause
37 H’:-B1’,...,Bn’.
38 match(Goal,H’,MatchOk,Instant);
39 if MatchOk then
40 begin
41 NewGoals := append([B1’,...,Bn’],OtherGoals);
42 NewGoals := substitute(Instant, NewGoals);
43 execute(Program, NewGoals, Satisfied)
44 end
45 end;
46 Success :- Satisfied
47 end
48 end;
Bibliography

[1] Bruce Alberts. Molecular biology of the cell. Garland science, 2017.

[2] Alexander V Alekseyenko, Guillermo I Perez-Perez, Aieska De Souza, Bruce Strober,


Zhan Gao, Monika Bihan, Kelvin Li, Barbara A Methé, and Martin J Blaser. Com-
munity differentiation of the cutaneous microbiota in psoriasis. Microbiome, 1(1):31,
2013.

[3] David B Allison, Xiangqin Cui, Grier P Page, and Mahyar Sabripour. Microarray
data analysis: from disarray to consolidation and consensus. Nature reviews genetics,
7(1):55, 2006.

[4] Nicos Angelopoulos. A logical approach to working with biological databases. In


Technical Communications ICLP 2015, Dublin, 2015.

[5] Nicos Angelopoulos, Samer Abdallah, and Georgios Giamas. Advances in integrative
statistics for logic programming. International Journal of Approximate Reasoning,
78:103–115, 2016.

[6] Nicos Angelopoulos and Jan Wielemaker. Accessing biological data as Prolog facts.
In Proceedings of the 19th International Symposium on Principles and Practice of
Declarative Programming, pages 29–38. ACM, 2017.

[7] Martin Atzmueller and Frank Puppe. SD-Map–A fast algorithm for exhaustive sub-
group discovery. In European Conference on Principles of Data Mining and Knowl-
edge Discovery, pages 6–17. Springer, 2006.

[8] Martin Atzmüller, Frank Puppe, and Hans-Peter Buscher. Profiling examiners using
intelligent subgroup mining. In Proceedings of the 10th Workshop on Intelligent Data
Analysis in Medicine and Pharmacology (IDAMAP-05), pages 46–51, 2005.

170
BIBLIOGRAPHY 171

[9] Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature News,
533(7604):452, 2016.

[10] Fernando Baquero and César Nombela. The microbiome as a human organ. Clinical
Microbiology and Infection, 18(s4):2–4, 2012.

[11] Stephen B Baylin, Manel Esteller, Michael R Rountree, Kurtis E Bachman, Kornel
Schuebel, and James G Herman. Aberrant patterns of DNA methylation, chromatin
formation and gene expression in cancer. Human molecular genetics, 10(7):687–692,
2001.

[12] Gordon Bell, Tony Hey, and Alex Szalay. Beyond the data deluge. Science,
323(5919):1297–1298, 2009.

[13] Asa Ben-Hur and William Stafford Noble. Kernel methods for predicting protein–
protein interactions. Bioinformatics, 21(suppl 1):i38–i46, 2005.

[14] Marina Bibikova, Bret Barnes, Chan Tsan, Vincent Ho, Brandy Klotzle, Jennie M Le,
David Delano, Lu Zhang, Gary P Schroth, Kevin L Gunderson, et al. High density
DNA methylation array with single CpG site resolution. Genomics, 98(4):288–295,
2011.

[15] Ivan Bratko. Prolog programming for artificial intelligence. Pearson education, 2001.

[16] Vince Buffalo. Bioinformatics data skills: Reproducible and robust research with open
source tools. ” O’Reilly Media, Inc.”, 2015.

[17] Mats Carlsson. SICStus Prolog User’s Manual,. Swedish Institute of Computer
Science, 2016.

[18] Matias Casás-Selves and James DeGregori. How cancer shapes evolution and how
evolution shapes cancer. Evolution: Education and outreach, 4(4):624–634, 2011.

[19] Jadzia Cendrowska. Prism: An algorithm for inducing modular rules. International
Journal of Man-Machine Studies, 27(4):349–370, 1987.

[20] H Chial. Proto-oncogenes to oncogenes to cancer. Nature Education, 2008.

[21] Peter Clark and Tim Niblett. The CN2 induction algorithm. Machine learning,
3(4):261–283, 1989.
BIBLIOGRAPHY 172

[22] William F Clocksin and Christopher S Mellish. Programming in PROLOG. Springer


Science & Business Media, 2003.

[23] AL Cogen, V Nizet, and RL Gallo. Skin microbiota: a source of disease or defence?
British Journal of Dermatology, 158(3):442–455, 2008.

[24] A Colmerauer, H Kanoui, Ph Roussel, and R Pasero. Un systeme de communica-


tion homme-machine en français, rapport de recherche, cri 72-18. UER de Luminy.
Université dAix-Marseille, 1973, 1972.

[25] ENCODE Project Consortium et al. An integrated encyclopedia of DNA elements


in the human genome. Nature, 489(7414):57, 2012.

[26] Charles E Cook, Mary Todd Bergman, Robert D Finn, Guy Cochrane, Ewan Birney,
and Rolf Apweiler. The European Bioinformatics Institute in 2016: data growth and
integration. Nucleic acids research, 44(D1):D20–D26, 2015.

[27] Vı́tor Santos Costa. The life of a logic programming system. In International Con-
ference on Logic Programming, pages 1–6. Springer, 2008.

[28] Elizabeth K Costello, Christian L Lauber, Micah Hamady, Noah Fierer, Jeffrey I
Gordon, and Rob Knight. Bacterial community variation in human body habitats
across space and time. Science, 326(5960):1694–1697, 2009.

[29] Francis Crick. Central dogma of molecular biology. Nature, 227(5258):561, 1970.

[30] D. Croft, A. F. Mundo, R. Haw, M. Milacic, J. Weiser, G. Wu, M. Caudy, P. Garapati,


M. Gillespie, M. R. Kamdar, B. Jassal, S. Jupe, L. Matthews, B. May, S. Palatnik,
K. Rothfels, V. Shamovsky, H. Song, M. Williams, E. Birney, H. Hermjakob, L. Stein,
and P. D’Eustachio. The Reactome pathway knowledgebase. Nucleic Acids Research,
42(D1):D472–D477, November 2013.

[31] Andrew Cropper and Stephen H Muggleton. Learning efficient logical robot strategies
involving composable objects. In IJCAI, pages 3423–3429, 2015.

[32] Andrew Cropper and Stephen H Muggleton. Logical minimisation of meta-rules


within meta-interpretive learning. In Inductive Logic Programming, pages 62–75.
Springer, 2015.
BIBLIOGRAPHY 173

[33] Elodie M Da Costa, Gabrielle McInnes, Annie Beaudry, and Noël J-M Raynal. DNA
methylation–targeted drugs. The Cancer Journal, 23(5):270–276, 2017.

[34] Shuo Dai, Yong Zhang, Limin Jia, and Yong Qin. A subgroup discovery algorithm
based on genetic fuzzy systems. In Proceedings of the 2015 Chinese Intelligent Au-
tomation Conference, pages 171–177. Springer, 2015.

[35] Charles Darwin and William F Bynum. The origin of species by means of natural
selection: or, the preservation of favored races in the struggle for life. Penguin, 2009.

[36] Richard Dawkins. The Selfish Gene. Oxford University Press, USA, 1976.

[37] Richard Dawkins. Climbing mount improbable. WW Norton & Company, 1997.

[38] Luc De Raedt. Logical and relational learning. Springer Science & Business Media,
2008.

[39] Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. Problog: A probabilistic
prolog and its application in link discovery. 2007.

[40] VK de Sá, TP Rocha, AL Moreira, FA Soares, T Takagaki, L Carvalho, AG Nichol-


son, and VL Capelozzi. Hyaluronidases and hyaluronan synthases expression is in-
versely correlated with malignancy in lung/bronchial pre-neoplastic and neoplastic
lesions, affecting prognosis. Brazilian Journal of Medical and Biological Research,
48(11):1039–1047, 2015.

[41] Luc Dehaspe and Luc De Raedt. Mining association rules in multiple relations. In
Nada Lavrač and Sao Deroski, editors, Inductive Logic Programming, number 1297
in Lecture Notes in Computer Science, pages 125–132. Springer Berlin Heidelberg,
January 1997.

[42] Guoqing Diao and Anand N Vidyashankar. Assessing genome-wide statistical signif-
icance for large p small n problems. Genetics, 194(3):781–783, 2013.

[43] Theodosius Dobzhansky. Nothing in biology makes sense except in the light of evo-
lution. The american biology teacher, 75(2):87–91, 2013.

[44] Sorin Drăghici. Statistics and data analysis for microarrays using R and bioconductor.
CRC Press, 2011.
BIBLIOGRAPHY 174

[45] David J Duggan, Michael Bittner, Yidong Chen, Paul Meltzer, and Jeffrey M Trent.
Expression profiling using cDNA microarrays. Nature genetics, 21(1s):10, 1999.

[46] Ron Edgar, Michael Domrachev, and Alex E Lash. Gene expression omnibus: NCBI
gene expression and hybridization array data repository. Nucleic acids research,
30(1):207–210, 2002.

[47] Peter A Flach. Simply logical intelligent reasoning by example. 1994.

[48] Thomas Fleischer, Arnoldo Frigessi, Kevin C Johnson, Hege Edvardsen, Nizar
Touleimat, Jovana Klajic, Margit LH Riis, Vilde D Haakensen, Fredrik Wärnberg,
Bjørn Naume, et al. Genome-wide dna methylation profiles in progression to in situ
and invasive carcinoma of the breast with impact on gene transcription and prognosis.
Genome biology, 15(8):435, 2014.

[49] Manoel VM França, Gerson Zaverucha, and Artur S dAvila Garcez. Fast relational
learning using bottom clause propositionalization with artificial neural networks.
Machine learning, 94(1):81–104, 2014.

[50] Thom Frühwirth. Constraint handling rules. In Constraint programming: Basics and
trends, pages 90–107. Springer, 1995.

[51] Jonathan C Fuller, Pierre Khoueiry, Holger Dinkel, Kristoffer Forslund, Alexan-
dros Stamatakis, Joseph Barry, Aidan Budd, Theodoros G Soldatos, Katja Linssen,
and Abdul Mateen Rajput. Biggest challenges in bioinformatics. EMBO reports,
14(4):302–304, 2013.

[52] Johannes Fürnkranz and Peter A Flach. Roc ‘n’ rule learningtowards a better un-
derstanding of covering algorithms. Machine Learning, 58(1):39–77, 2005.

[53] Johannes Fürnkranz, Dragan Gamberger, and Nada Lavrač. Foundations of rule
learning. Springer Science & Business Media, 2012.

[54] Dragan Gamberger and Nada Lavrač. Expert-guided subgroup discovery: Method-
ology and application. Journal of Artificial Intelligence Research, 17:501–527, 2002.

[55] Dragan Gamberger, Nada Lavrač, Filip Železnỳ, and Jakub Tolar. Induction of com-
prehensible models for gene expression datasets by subgroup discovery methodology.
Journal of biomedical informatics, 37(4):269–284, 2004.
BIBLIOGRAPHY 175

[56] Zhan Gao, Chi-hong Tseng, Bruce E Strober, Zhiheng Pei, and Martin J Blaser.
Substantial alterations of the cutaneous bacterial biota in psoriatic lesions. PloS
one, 3(7):e2719, 2008.

[57] Craig Gentry. A fully homomorphic encryption scheme. Stanford University, 2009.

[58] Dirk Gevers, Subra Kugathasan, Lee A Denson, Yoshiki Vázquez-Baeza, Will
Van Treuren, Boyu Ren, Emma Schwager, Dan Knights, Se Jin Song, Moran Yas-
sour, et al. The treatment-naive microbiome in new-onset crohns disease. Cell host
& microbe, 15(3):382–392, 2014.

[59] Joseph C Giarratano and Gary Riley. Expert systems: principles and programming.
Brooks/Cole Publishing Co., 1989.

[60] Stephen Jay Gould. What is a species. Discover, 13(12):40–44, 1992.

[61] Patrick Goymer. Natural selection: The evolution of cancer. Nature News,
454(7208):1046–1048, 2008.

[62] Michael W Gray, Gertraud Burger, and B Franz Lang. The origin and early evolution
of mitochondria. Genome biology, 2(6):reviews1018–1, 2001.

[63] Casey S Greene, Jie Tan, Matthew Ung, Jason H Moore, and Chao Cheng. Big data
bioinformatics. Journal of cellular physiology, 229(12):1896–1900, 2014.

[64] Elizabeth A Grice and Julia A Segre. The skin microbiome. Nature Reviews Micro-
biology, 9(4):244, 2011.

[65] Anna Hart and Jeremy Wyatt. Evaluating black-boxes as medical decision aids:
issues arising from a study of neural networks. Medical informatics, 15(3):229–236,
1990.

[66] Stephen S Hecht. Cigarette smoking and lung cancer: chemical mechanisms and
approaches to prevention. The lancet oncology, 3(8):461–469, 2002.

[67] José Hernández-Orallo, Adolfo Martı́nez-Usó, Ricardo BC Prudêncio, Meelis Kull,


Peter A Flach, Chowdhury Farhan Ahmed, and Nicolas Lachiche. Reframing in
context: A systematic approach for model reuse in machine learning. AI Communi-
cations, 29(5):551–566, 2016.
BIBLIOGRAPHY 176

[68] Franciso Herrera, Cristóbal José Carmona, Pedro González, and Marı́a José Del Je-
sus. An overview on subgroup discovery: foundations and applications. Knowledge
and information systems, 29(3):495–525, 2011.

[69] Matěj Holec, Jiřı̀ Klma, Filip Železnỳ, and Jakub Tolar. Comparative evaluation
of set-level techniques in predictive classification of gene expression samples. BMC
Bioinformatics, 13(Suppl 10):S15, June 2012.

[70] Matěj Holec, Filip Železnỳ, Jiřı̀ Kléma, Jiřı̀ Svoboda, and Jakub Tolar. Using bio-
pathways in relational learning. Inductive Logic Programming, page 50, 2008.

[71] Robin Holliday. Epigenetics: a historical overview. Epigenetics, 1(2):76–80, 2006.

[72] Laura Hoopes. Genetic diagnosis: DNA microarrays and cancer. Nature Education,
1(3), 2008.

[73] Curtis Huttenhower, Dirk Gevers, Rob Knight, Sahar Abubucker, Jonathan H Bad-
ger, Asif T Chinwalla, Heather H Creasy, Ashlee M Earl, Michael G FitzGerald,
Robert S Fulton, et al. Structure, function and diversity of the healthy human mi-
crobiome. Nature, 486(7402):207, 2012.

[74] Julian Huxley. Evolution the modern synthesis. George Allen and Unwin, 1942.

[75] Andrew E Jaffe, Peter Murakami, Hwajin Lee, Jeffrey T Leek, M Daniele Fallin,
Andrew P Feinberg, and Rafael A Irizarry. Bump hunting to identify differentially
methylated regions in epigenetic epidemiology studies. International journal of epi-
demiology, 41(1):200–209, 2012.

[76] Minoru Kanehisa and Susumu Goto. Kegg: kyoto encyclopedia of genes and genomes.
Nucleic acids research, 28(1):27–30, 2000.

[77] Branko Kavšek and Nada Lavrač. APRIORI-SD: Adapting association rule learning
to subgroup discovery. Applied Artificial Intelligence, 20(7):543–583, 2006.

[78] Someswa Kesh and Wullianallur Raghupathi. Critical issues in bioinformatics and
computing. Perspectives in health information management/AHIMA, American
Health Information Management Association, 1, 2004.

[79] Wooyoung Kim, Min Li, Jianxin Wang, and Yi Pan. Biological network motif detec-
tion and evaluation. BMC Systems Biology, 5(Suppl 3):S5, December 2011.
BIBLIOGRAPHY 177

[80] Willi Klösgen. Explora: A multipattern and multistrategy discovery assistant. In


Advances in knowledge discovery and data mining, pages 249–271. American Associ-
ation for Artificial Intelligence, 1996.

[81] Willi Klösgen and Michael May. Spatial subgroup mining integrated in an object-
relational spatial database. In European Conference on Principles of Data Mining
and Knowledge Discovery, pages 275–286. Springer, 2002.

[82] Michael R Kosorok, Shuangge Ma, et al. Marginal asymptotics for the large p,
small n paradigm: with applications to microarray data. The Annals of Statistics,
35(4):1456–1486, 2007.

[83] Robert Kowalski. Algorithm= logic+ control. Communications of the ACM,


22(7):424–436, 1979.

[84] Robert Kowalski and Donald Kuehner. Linear resolution with selection function. In
Automation of Reasoning, pages 542–577. Springer, 1983.

[85] Petra Kralj, N Lavrac, Dragan Gamberger, and Antonija Krstacic. Supporting fac-
tors to improve the explanatory potential of contrast set mining: Analyzing brain
ischaemia data. In 11th Mediterranean Conference on Medical and Biomedical Engi-
neering and Computing 2007, pages 157–161. Springer, 2007.

[86] Martin Krzywinski, Inanc Birol, Steven JM Jones, and Marco A Marra. Hive plots.
Rational approach to visualizing networks. Briefings in bioinformatics, 13(5):627–
644, 2011.

[87] Ondřej Kuželka and Filip Železnỳ. Block-wise construction of tree-like relational
features with monotone reducibility and redundancy. Machine Learning, 83(2):163–
192, 2011.

[88] Torbjörn Lager and Jan Wielemaker. Pengines: Web Logic Programming Made
Easy. Theory and Practice of Logic Programming, 14(4-5):539–552, July 2014.

[89] Laura Langohr, Vid Podpečan, Marko Petek, Igor Mozetič, Kristina Gruden, Nada
Lavrač, and Hannu Toivonen. Contrasting subgroup discovery. The Computer Jour-
nal, 56(3):289–303, 2012.
BIBLIOGRAPHY 178

[90] Nada Lavrač, Branko Kavšek, Peter A Flach, and Ljupčo Todorovski. Subgroup
discovery with CN2-SD. Journal of Machine Learning Research, 5(Feb):153–188,
2004.

[91] Nada Lavrač and Anže Vavpetič. Relational and semantic data mining. In Logic
Programming and Nonmonotonic Reasoning, pages 20–31. Springer, 2015.

[92] Nada Lavrač, Filip Železnỳ, and Peter A Flach. RSD: Relational subgroup discovery
through first-order feature construction. In International Conference on Inductive
Logic Programming, pages 149–165. Springer, 2002.

[93] Jeremy Leipzig. A review of bioinformatic pipeline frameworks. Briefings in bioin-


formatics, 18(3):530–536, 2017.

[94] Dianhuan Lin, Eyal Dechter, Kevin Ellis, Joshua B Tenenbaum, and Stephen H
Muggleton. Bias reformulation for one-shot function induction. 2014.

[95] José Marı́a Luna, José Raúl Romero, Cristóbal Romero, and Sebastián Ventura.
On the use of genetic programming for mining comprehensible rules in subgroup
discovery. IEEE transactions on cybernetics, 44(12):2329–2341, 2014.

[96] Chaysavanh Manichanh, Lionel Rigottier-Gois, Elian Bonnaud, Karine Gloux, Eric
Pelletier, Lionel Frangeul, Renaud Nalin, Cyrille Jarrin, Patrick Chardon, Phillipe
Marteau, et al. Reduced diversity of faecal microbiota in crohns disease revealed by
a metagenomic approach. Gut, 55(2):205–211, 2006.

[97] Victor M Markowitz, I-Min A Chen, Krishna Palaniappan, Ken Chu, Ernest Szeto,
Yuri Grechkin, Anna Ratner, Biju Jacob, Jinghua Huang, Peter Williams, et al. Img:
the integrated microbial genomes database and comparative analysis system. Nucleic
acids research, 40(D1):D115–D122, 2011.

[98] Vivien Marx. Biology: The big challenges of big data, 2013.

[99] M. N. McCall, H. A. Jaffee, S. J. Zelisko, N. Sinha, G. Hooiveld, R. A. Irizarry,


and M. J. Zilliox. The Gene Expression Barcode 3.0: improved data processing and
mining tools. Nucleic Acids Research, 42(D1):D938–D943, January 2014.

[100] Gregor Mendel. Experiments in plant hybridization (1865). Verhandlungen des


naturforschenden Vereins Brünn.) Available online: www. mendelweb. org/Mendel.
html (accessed on 1 January 2013), 1996.
BIBLIOGRAPHY 179

[101] Ryszard S Michalski. On the quasi-minimal solution of the general covering problem.
Proceedings of the International Symposium on Information Processing, 1969.

[102] Ryszard S Michalski. A theory and methodology of inductive learning. In Machine


Learning, Volume I, pages 83–134. Elsevier, 1983.

[103] Brad L Miller, David E Goldberg, et al. Genetic algorithms, tournament selection,
and the effects of noise. Complex systems, 9(3):193–212, 1995.

[104] Xochitl C Morgan, Timothy L Tickle, Harry Sokol, Dirk Gevers, Kathryn L Devaney,
Doyle V Ward, Joshua A Reyes, Samir A Shah, Neal LeLeiko, Scott B Snapper,
et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and
treatment. Genome biology, 13(9):R79, 2012.

[105] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Barbara
Wold. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature
methods, 5(7):621, 2008.

[106] Stephen Muggleton, Cao Feng, et al. Efficient induction of logic programs. Citeseer,
1990.

[107] Gareth Muirhead. Analysis of the microbiome and host-transcriptome in psoriasis


and atopic dermatitis. PhD thesis, King’s College London, 2017.

[108] Chris Mungall. Experiences Using Logic Programming in Bioinformatics, pages 1–21.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.

[109] Rajan P Nair, Kristina Callis Duffin, Cynthia Helms, Jun Ding, Philip E Stuart,
David Goldgar, Johann E Gudjonsson, Yun Li, Trilokraj Tejasvi, Bing-Jian Feng,
et al. Genome-wide scan reveals association of psoriasis with il-23 and nf-κb path-
ways. Nature genetics, 41(2):199, 2009.

[110] Samuel R. Neaves et al. Using ILP to Identify Pathway Activation Patterns in
Systems Biology, pages 137–151. Springer International Publishing, 2016.

[111] Samuel R Neaves, Sophia Tsoka, and Louise AC Millard. Reactome pengine: A
web-logic API to the homo sapiens reactome. Bioinformatics, 1:3, 2018.

[112] Frank O. Nestle, Daniel H. Kaplan, and Jonathan Barker. Psoriasis. New England
Journal of Medicine, 361(5):496–509, 2009. PMID: 19641206.
BIBLIOGRAPHY 180

[113] Ulrich Neumerkel and Stefan Kral. Indexing dif/2. arXiv preprint arXiv:1607.01590,
2016.

[114] Ulrich Neumerkel and Fred Mesnard. Localizing and explaining reasons for non-
terminating logic programs with failure-slices. In International Conference on Prin-
ciples and Practice of Declarative Programming, pages 328–341. Springer, 1999.

[115] Ulrich Neumerkel, Markus Triska, and Jan Wielemaker. Declarative language ex-
tensions for prolog courses. In Proceedings of the 2008 international workshop on
Functional and declarative programming in education, pages 73–78. ACM, 2008.

[116] F Niyonsaba, A Suzuki, H Ushio, I Nagaoka, H Ogawa, and K Okumura. The hu-
man antimicrobial peptide dermcidin activates normal human keratinocytes. British
Journal of Dermatology, 160(2):243–249, 2009.

[117] Petra Kralj Novak, Nada Lavrač, and Geoffrey I Webb. Supervised descriptive rule
discovery: A unifying survey of contrast set, emerging pattern and subgroup mining.
Journal of Machine Learning Research, 10(Feb):377–403, 2009.

[118] Richard A O’Keefe. The craft of Prolog, volume 86. MIT press Cambridge, 1990.

[119] Roderick DM Page and Edward C Holmes. Molecular evolution: a phylogenetic


approach. John Wiley & Sons, 2009.

[120] Rafael S Parpinelli, Heitor S Lopes, and Alex Alves Freitas. Data mining with an
ant colony optimization algorithm. IEEE transactions on evolutionary computation,
6(4):321–332, 2002.

[121] William R Pearson and David J Lipman. Improved tools for biological sequence
comparison. Proceedings of the National Academy of Sciences, 85(8):2444–2448, 1988.

[122] Sérgio Pereira, Adriano Pinto, Victor Alves, and Carlos A Silva. Brain tumor seg-
mentation using convolutional neural networks in mri images. IEEE transactions on
medical imaging, 35(5):1240–1251, 2016.

[123] Claudia Perlich and Foster Provost. Distribution-based aggregation for relational
learning with identifier attributes. Machine Learning, 62(1-2):65–105, 2006.
BIBLIOGRAPHY 181

[124] Bahareh Rabbani, Hirofumi Nakaoka, Shahin Akhondzadeh, Mustafa Tekin, and
Nejat Mahdieh. Next generation sequencing: implications in personalized medicine
and pharmacogenomics. Molecular BioSystems, 12(6):1818–1830, 2016.

[125] Tara D Rachakonda, Clayton W Schupp, and April W Armstrong. Psoriasis preva-
lence among adults in the united states. Journal of the American Academy of Der-
matology, 70(3):512–516, 2014.

[126] David A Relman. The human microbiome: ecosystem resilience and health. Nutrition
reviews, 70(s1), 2012.

[127] Kahn Rhrissorrakrai, J. Jeremy Rice, Stephanie Boue, Marja Talikka, Erhan Bilal,
Florian Martin, Pablo Meyer, Raquel Norel, Yang Xiang, Gustavo Stolovitzky, Ju-
lia Hoeng, and Manuel C. Peitsch. SBV Improver Diagnostic Signature Challenge:
Design and results. Systems Biomedicine, 1(4):3–14, September 2013.

[128] Petar Ristoski. Towards linked open data enabled data mining. In The Semantic
Web. Latest Advances and New Domains, pages 772–782. Springer, 2015.

[129] Petar Ristoski and Heiko Paulheim. A comparison of propositionalization strategies


for creating features from linked open data. Linked Data for Knowledge Discovery,
page 6, 2014.

[130] John Alan Robinson. A machine-oriented logic based on the resolution principle.
Journal of the ACM (JACM), 12(1):23–41, 1965.

[131] Taisuke Sato and Yoshitaka Kameya. Prism: a language for symbolic-statistical
modeling. In IJCAI, volume 97, pages 1330–1339, 1997.

[132] Hashem A Shihab, Mark F Rogers, Julian Gough, Matthew Mort, David N Cooper,
Ian NM Day, Tom R Gaunt, and Colin Campbell. An integrative approach to pre-
dicting the functional effects of non-coding and coding sequence variation. Bioinfor-
matics, 31(10):1536–1543, 2015.

[133] George Gaylord Simpson. Principles of animal taxonomy. 1961.

[134] David B Skalak. Prototype and feature selection by sampling and random mutation
hill climbing algorithms. In Machine Learning Proceedings 1994, pages 293–301.
Elsevier, 1994.
BIBLIOGRAPHY 182

[135] Peter B Snow, Deborah S Smith, and William J Catalona. Artificial neural networks
in the diagnosis and prognosis of prostate cancer: a pilot study. The Journal of
urology, 152(5):1923–1926, 1994.

[136] Tamar Sofer, Elizabeth D Schifano, Jane A Hoppin, Lifang Hou, and Andrea A
Baccarelli. A-clustering: a novel method for the detection of co-regulated methylation
regions, and regions associated with exposure. Bioinformatics, 29(22):2884–2891,
2013.

[137] Ashwin Srinivasan. The Aleph Manual, 2001. URL http://www. comlab. ox. ac.
uk/activities/machinelearn/Aleph/aleph.html, 69, 2019.

[138] Julius Stecher, Frederik Janssen, and Johannes Fürnkranz. Shorter rules are bet-
ter, arent they? In International Conference on Discovery Science, pages 279–294.
Springer, 2016.

[139] Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Ben-
jamin L. Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R.
Golub, Eric S. Lander, and Jill P. Mesirov. Gene set enrichment analysis: A
knowledge-based approach for interpreting genome-wide expression profiles. Pro-
ceedings of the National Academy of Sciences, 102(43):15545–15550, October 2005.

[140] Adi L Tarca, Nandor Gabor Than, and Roberto Romero. Methodological approach
from the best overall team in the SBV Improver Diagnostic Signature Challenge.
Systems Biomedicine, 1(4):217–227, 2013.

[141] Nizar Touleimat and Jörg Tost. Complete pipeline for infinium
R human methylation

450k beadchip data processing using subset quantile normalization for accurate dna
methylation estimation. Epigenomics, 4(3):325–341, 2012.

[142] Igor Trajkovski, Nada Lavrač, and Jakub Tolar. Segs: Search for enriched gene sets
in microarray data. Journal of biomedical informatics, 41(4):588–601, 2008.

[143] Markucs Triska. The Power of Prolog. 2018. Accessed: 2018-05-10.

[144] Markus Triska. The finite domain constraint solver of SWI-Prolog. In International
Symposium on Functional and Logic Programming, pages 307–316. Springer, 2012.

[145] Markus Triska. The Boolean constraint solver of SWI-Prolog: System description.
In FLOPS, volume 9613 of LNCS, pages 45–61, 2016.
BIBLIOGRAPHY 183

[146] Alan Mathison Turing. On computable numbers, with an application to the


entscheidungsproblem. a correction. Proceedings of the London Mathematical So-
ciety, 2(1):544–546, 1938.

[147] Peter J Turnbaugh, Ruth E Ley, Micah Hamady, Claire M Fraser-Liggett,


Rob Knight, and Jeffrey I Gordon. The human microbiome project. Nature,
449(7164):804, 2007.

[148] John J Tyson, Katherine C Chen, and Bela Novak. Sniffers, buzzers, toggles and
blinkers: dynamics of regulatory and signaling pathways in the cell. Current opinion
in cell biology, 15(2):221–231, 2003.

[149] UCSC contributors. Frequently asked questions: Data file formats. https://genome.
ucsc.edu/FAQ/FAQformat.html#format13", 2018. [Online; accessed 20-May-2018].

[150] Anita Valmarska, Nada Lavrač, Johannes Fürnkranz, and Marko Robnik-Šikonja.
Refinement and selection heuristics in subgroup discovery and classification rule
learning. Expert Systems with Applications, 81:147–162, 2017.

[151] J Craig Venter, Mark D Adams, Eugene W Myers, Peter W Li, Richard J Mural,
Granger G Sutton, Hamilton O Smith, Mark Yandell, Cheryl A Evans, Robert A
Holt, et al. The sequence of the human genome. Science, 291(5507):1304–1351,
2001.

[152] Peter M Visscher, Matthew A Brown, Mark I McCarthy, and Jian Yang. Five years
of GWAS discovery. The American Journal of Human Genetics, 90(1):7–24, 2012.

[153] Haizhou Wang and Mingzhou Song. Ckmeans. 1d. dp: optimal k-means clustering
in one dimension by dynamic programming. The R journal, 3(2):29, 2011.

[154] Rui-Sheng Wang, Assieh Saadatpour, and Rka Albert. Boolean modeling in systems
biology: an overview of methodology and applications. Physical Biology, 9(5):055001,
October 2012.

[155] James D Watson, Francis HC Crick, et al. Molecular structure of nucleic acids.
Nature, 171(4356):737–738, 1953.

[156] Tyler Weirick, Giuseppe Militello, Yuliya Ponomareva, David John, Claudia Döring,
Stefanie Dimmeler, and Shizuka Uchida. Logic programming to infer complex RNA
BIBLIOGRAPHY 184

expression patterns from RNA-seq data. Briefings in bioinformatics, page bbw117,


2016.

[157] Ken Whelan, Oliver Ray, and Ross D King. Representation, simulation, and hypoth-
esis generation in graph and logical models of biological networks. In Yeast Systems
Biology, pages 465–482. Springer, 2011.

[158] Darrell Whitley. A genetic algorithm tutorial. Statistics and computing, 4(2):65–85,
1994.

[159] Jan Wielemaker. SWI-Prolog version 7 extensions. In Workshop on Implementation


of Constraint and Logic Programming Systems and Logic-based Methods in Program-
ming Environments 2014, page 109. Citeseer, 2014.

[160] Jan Wielemaker et al. SWI-Prolog. Theory and Practice of Logic Programming,
12(1-2):67–96, January 2012.

[161] Jan Wielemaker, Torbjörn Lager, and Fabrizio Riguzzi. SWISH: SWI-Prolog for
sharing. CoRR, abs/1511.00915, 2015.

[162] Wikipedia contributors. Comparison of prolog implementations — Wikipedia, the


free encyclopedia, 2018. [Online; accessed 12-May-2018].

[163] Edward O Wilson. Sociobiology. Harvard University Press, 2000.

[164] Carl R Woese. Bacterial evolution. Microbiological reviews, 51(2):221, 1987.

[165] PCY Woo, SKP Lau, JLL Teng, H Tse, and K-Y Yuen. Then and now: use of 16S
rDNA gene sequencing for bacterial identification and discovery of novel bacteria in
clinical microbiology laboratories. Clinical Microbiology and Infection, 14(10):908–
934, 2008.

[166] Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. In European


Symposium on Principles of Data Mining and Knowledge Discovery, pages 78–87.
Springer, 1997.

[167] Rongrong Wu, Lorena Galan-Acosta, and Erik Norberg. Glucose metabolism provide
distinct prosurvival benefits to non-small cell lung carcinomas. Biochemical and
biophysical research communications, 460(3):572–577, 2015.
BIBLIOGRAPHY 185

[168] Shi Ying, Dan-Ning Zeng, Liang Chi, Yuan Tan, Carlos Galzote, Cesar Cardona,
Simon Lax, Jack Gilbert, and Zhe-Xue Quan. The influence of age and gender on
skin-associated microbial communities in urban and rural human populations. PloS
one, 10(10):e0141842, 2015.

[169] Lei Zhang, Linlin Wang, Bochuan Du, Tianjiao Wang, Pu Tian, and Suyan Tian.
Classification of non-small cell lung cancer using significance analysis of microarray-
gene set reduction algorithm. BioMed research international, 2016, 2016.

Das könnte Ihnen auch gefallen