Anda di halaman 1dari 5

Bio-Ontologies 2010

Representation of research hypotheses


Larisa N. Soldatova* and Ross D. King
Aberystwyth University, Wales, UK

*
ABSTRACT definitions. The representation of research units is based on
Motivation: The motivation for the work reported in this LABORS and DDI (an ontology for Drug Discovery
paper is mainly technological: hypotheses are now being Investigations) [DDI].
produced on an industrial scale by computers in biology, e.g. The examples in this paper are taken from the
the annotation of a genome is essentially a large set of investigations run by the Robot Scientist Adam (A
hypotheses generated by sequence similarity programs; and Discovery Machine), the investigation into re-discovery of
Robot Scientists enable the full automation of a scientific gene functions in aromatic amino acids pathway and the
investigation, including generation and testing of research investigation into novel biological knowledge, which are
hypotheses. The ontology LABORS defines the reported in [Nature, Science], all the information about the
representation and supports the recording of hypotheses investigations is available at:... This paper does not report
automatically generated by Robot Scientists. LABORS or DDI, how they were built and using what
Availability: the RS webpage, LABORS. upper level ontologies and relations.
The rest of the paper is organized as follows: section 2
1INTRODUCTION explains how Adam generates hypotheses. Section 3 reports
Research hypotheses are a key element of scientific how LABORS supports the logical representation of the
investigations. Therefore the accurate and semantically clear research and negative hypotheses on different levels of
representation of them is vital for the recording of granularity and how to represent sets of hypotheses. The
investigations. A number of projects aim to address the need discussion section includes lessons learned from the
to represent and record research hypotheses in a representation of automatically generated hypotheses and
semantically defined form. In our opinion, the most how it can be used to improve the recording of hypotheses
advanced ones are the Semantic Web Applications in and investigations.
Neuroscience (SWAN) [], and the ontology for biomedical
investigations (OBI) [The OBI Consortium, 2010] projects. 2AUTOMATIC GENERATION OF
Hypotheses in the SWAN Alzheimer knowledge base are HYPOTHESES
portions of natural language text which are represented as Computers are increasingly automating the process of data
research statements (discourse-elements), and these are analysis and hypothesis formation, for example: machine
linked (via discourse-relations) to other discourse elements learning programs (based on induction) are used in
and citations which specify the author's name, article, chemistry to help design drugs; and in biology, genome
journal, etc. Similarly OBI represents hypotheses as the annotation is essential a vast process of (abductive)
class obi: hypothesis textual entity, (here and further in the hypothesis formation. It is now probable that the majority
text we use italic for ontological classes where appropriate) of hypotheses in biology are computer generated, though of
where hypotheses are part of obi: objective specification of course not the most original, or important ones.
an obi:investigation. The idea of a Robot Scientist is to combine laboratory
In this paper we extend the representation of hypotheses automation, automated hypothesis formation, and other
as textual entities to the representation of hypotheses, which techniques from Artificial Intelligence (AI) to close the
are automatically generated by a machine, as logical loop and automate the whole scientific process. We have
entities. This representation of hypotheses is based on the developed the Robot Scientist Adam [XX]. Adam
published in [Science, see the Supplementary Materials, follows a hypothetico-deductive methodology. First Adam
April 2009] version of LABORS (the LABoratory Ontology abductively hypotheses new facts about yeast functional
for Robot Scientist). Note: LABORS was designed to biology, then it deduces the experimental consequences of
support automated investigations in the areas of Systems these facts using its model of metabolism, which it then
Biology and Functional Genomics run by robots and for this experimentally tests. In our original work on robot
reason LABORS does not comply with all OBO Foundry scientists, we used a purely logical approach to hypothesis
requirements [ref], e.g. robots do not need textual formation based on applying abductive logic programming
*
To whom correspondence should be addressed. to a logical model of a yeast metabolism subset [King,

1
L.N.Soldatova & R.D.King

Nature]. Unfortunately, this general method is too inefficient


to deal with large bioinformatic models. We thus developed
an alternative approach based on using standard
bioinformatic methods coupled with the insight that
standard genome annotation is a vast process of abductive
hypothesis formation [King Science].

3REPRESENTATION OF AUTOMATICALLY
GENERATED HYPOTHESES
3.1Specification of hypotheses onto different levels
of granularity
Robot Scientist automated investigations are generally
complex investigations, with multiple study domains
involved, and different levels of granularity. For example,
the investigation into automation of science, in which Adam
discovered novel knowledge about yeast genes evolves four
different domains and has 10 levels (see Figure 1) [Science].
The levels are defined by the corresponding hypotheses. On
the highest level is the hypothesis that it is possible to fully
automate scientific discovery. This hypothesis is further encodes(yer152c,ec.2.6.1.39).
decomposed onto more specialized hypotheses, i.e. is it Adam also explicitly records the corresponding negative hy-
possible to automatically re-discover biological knowledge, pothesis:
the manual experiments would confirm the results
automatically obtained by the robot, etc. At the lowest level not_encodes(yer152c,ec.2.6.1.39).
of the investigation are hypotheses about yeast growth These hypotheses are instances of the LABORS classes re-
which are linked to the experimental data optical density search hypothesis (H0) and negative hypothesis (H1), and
(OD) readings. Complicated logical inference is required in they are stored in a relational data base [Soldatova et al.,
order to make the conclusion that scientific discovery can be 2006]. Adam used abduction to form these hypotheses and
fully automated from the basis of a large number of ODs. therefore the hypotheses are not necessarily true facts. The
The representation of hypotheses plays an essential role in LABORS class method of hypothesis formation allows to re-
the logical representation of such investigations. cord this information.
Let us consider the decomposition of hypotheses into Figure 1: One example traversal of the levels of the representation
more specialized ones in more detail. Adam with the use of of the investigation executed by the Robot Scientist Adam (a
its background knowledge and bioinformatics facts, fragment). The relations are has part.
generates hypotheses about yeast genes and enzymes, i.e.
gene YER152C encodes an enzyme with the enzyme class
E.C.2.6.1.39. The research hypothesis are encoded in the It is generally necessary to execute a real physical experi-
logical programming language Prolog: ment to test if a hypotheses is correct. However such entities
as the gene YER152C and an enzyme with the enzyme
class E.C.2.6.1.39 exist only in Adam's memory and not in
Adam's physical world. In the real physical world Adam can
operate only with yeast strains and metabolites. The hypo-
thesis that gene YER152C encodes an enzyme with the en-
zyme class E.C.2.6.1.39. has to be specialized to such a
level that the robot can physically test the hypothesis. Using
its background knowledge, Adam infers that if the research
hypothesis is correct, than the addition of the following
metabolites .. with the KEGG numbers C00047, C00449,
C00956 correspondently to a yeast strain with a removed
gene YER152C would restore the yeast growth (see Figure
2):

2
Representation of research hypotheses

affects_growth(c00047,delta_YER152C). The conclusion if a hypothesis has been confirmed or


rejected is made following the corresponding decision
affects_growth(c00449,delta_YER152C).
procedures on the basis of the results [all procedures are
affects_growth(c00956,delta_YER152C). available at our website]. A more generic conclusion is
made on the basis of more specific conclusions which
If the metabolites are available, than using the yeast strain correspond to more specific hypotheses. For example a
YER152C from its yeast strains library, Adam can physic- conclusion that a generic hypothesis is confirmed may be
ally test the hypotheses above. Adam designs plate layouts made if two out of three more specific hypotheses have been
with controls and replicates in order to collect enough stat- confirmed.
istics to accurately analyze the results and runs the experi- If hypotheses and conclusions of an investigation are
ments. Adam used decision trees and random forests, with recorded in such a way, than it is possible to check how
resampling methods, to decide whether the difference in exactly each conclusion has been made, on what basis and

growth between the deletion strain growing on a defined following what assumptions. We argue that all scientific
medium and on a defined medium + metabolite could be investigations have to be recorded and reported in a similar
discriminated from the difference in growth between the way to enable complete check of consistency and validity of
wild type growing on a defined medium and on a defined the results.
medium + metabolite.puter science terminology while the
deletion strains are mutant versions with genes removed that
hypothesized to encode an orphan enzyme. Adam used Figure 2: Examples of hypotheses generated by Adam.
standard 96-well plates to grow the yeast, which enabled 24
repeats of each strain and medium combination. To control
for intraplate environmental effects, we used Latin squares. 3.2Sets of hypotheses for cyclic investigations
The results are represented with the use of the same terms as
Robots can potentially generate thousands of hypotheses
the hypotheses:
and test them in parallel. However, even for robots it is not
affects_growth(c00047,delta_YER152C). practical to exhaustively test all possible hypotheses because
a hypothesis space can be very large. Adam selects
not_affects_growth(c00449,delta_YER152C). hypotheses and designs experiments to test them following
the combination of the selection criteria:

3
L.N.Soldatova & R.D.King

a hypothesis which gives the best split of a hypothesis 4DISCUSSION AND FUTURE WORK
space; Here we summarize what lessons were learned from the
a hypothesis which gives most of information about a representation of the automatically generated hypotheses by
domain of interest; robots and how this might be useful for the improvement of
the formulating and recording of research hypotheses by
a hypothesis with the maximum prior probability of humans.
being correct; Explicitness. Research hypotheses should be recorded
it would require minimum of resources to test a precisely and explicitly published. It is still common in the
hypothesis. reporting of science for research hypotheses stated in the
introduction to be implicitly replaced by other hypotheses in
Such selected hypotheses are not completely independent, the conclusions [Interface]. Or even more commonly for
and LABORS represents them as a set of hypotheses (the there to be no research hypothesis [Interface]. It is also
class hypotheses-set is the subclass of the class collection) important to record hypotheses which are formed during
where each particular hypotheses is a member of the set. investigations so other researchers can easily find (i.e. using
A set of hypotheses is tested in cycles. Each cycle of text mining) and test them. This would speed up the
investigation has a specified input hypotheses-set. Adam scientific progress.
designs and run experiments to test each hypothesis from Constructivism. Researchers should aim to formulate
the set. Then Adam analyses the results of the experiments hypotheses in a constructive way, so it is clear how to
and makes conclusions about whether a particular design experiments to test them.
hypothesis has been confirmed or rejected. The rejected Systematic approach. The automated approach for
hypotheses are eliminated from the input hypotheses-set and hypotheses generation has an advantage of being systematic.
the remaining set of hypotheses are considered as a All possible hypotheses for a study domain are considered,
specified output of the current cycle of the study, and as a and the best are selected for testing. The concept of the
specified input hypotheses-set for the next cycle of the best hypothesis is explicitly defined, i.e. as the most
study. Adam continuing to run cycles of studies until the probable, the cheapest, the most informative one. Such a
hypotheses-set is empty or the robot runs out of specied systematic approach could be adopted by humans for the
resources [King Nature]. assessment of research hypotheses.
Current ontological representations are not sufficient Statistical significance. It is relatively easy to repeat
enough for the recording of such hypotheses and automated studies, for example Adam executes 24 replicates
investigations. OBI defines investigations and study design of each study. This allows to detect statistically significant
executions in such a way that they cannot have inputs. For differences in the yeast growth which are often missed by
example, hypotheses formed in obi:hypothesis generating human-investigators [King science]. This demonstrates the
investigation (is an investigation in which data is generated importance of the collecting the experimental data over a
and analyzed with the purpose of generating new large enough number of repeated experiments, analysis of
hypothesis) cannot be passed to obi:hypothesis driven statistic significancy of the results and excluding the results
investigation (an investigation with the goal to test one or obtained i.e. due to biological variability. We argue that the
more hypothesis) (see also the classes expo:hypothesis number of repeats is often not enough in biological
forming investigation and expo:hypothesis generating experiments.
investigation [Soldatova & King, 2006]). To overcome Learning from negative results. The hypotheses
these difficulties, LABORS, and DDI in addition to the class which have been rejected provide no less information about
obi:investigation define various structural research units, the domain of study than the hypotheses which have been
e.g. study, study cycle, trail, replicate mainly according to confirmed. Therefore it is important to record and store the
the hypotheses tested. For example, replicates test identical rejected hypotheses. Unfortunately, it is not a normal
hypotheses, and have identical study designs; and cycles of scientific practice to report negative results.
study test hypotheses-sets in cycles (for more detail and In future, we plan to provide more sophisticated
definitions of the research units see [DDI]). classification of research hypotheses, i.e. a hypothesis about
a new entity (an object, a process, an information content
Thus the analysis of the research hypotheses which entity), about a new property of already existing entity,
were produced within Adam's investigations enabled us to about a new relation between already existing entities. We
improve the logical representation of the structural units of would like to continue the work on the classification of
general scientific investigations. alternative hypotheses. It would also be interesting to
consider dynamic aspects of the hypotheses selection for the

4
Representation of research hypotheses

area of drug discovery, to include into the selection criteria


how much time it would take to test a hypothesis.
Currently we are modifying LABORS in order to
provide a better mapping with OBI and to suit human users,
as (semi-)automated investigations are becoming
increasingly important for the area of biomedicine
[Alexandro paper].

ACKNOWLEDGEMENTS
We thank RC UK and BBSRC ... for funding the work
reported in this paper. We thank all the members of the
Computational Biology group at Aberystwyth University,
UK. We thank members of the OBI Consortium for the
discussions.

REFERENCES
King. Science
[Soldatova et al., 2006] EXPO-RS
[DDI]
[Interface]
The OBI Consortium, 2010]
SWAN
Dormand,J.R. and Prince,P.J. (1980) A family of embedded
RungeKutta formulae. J. Comp. Appl. Math., 6, 1926.
Alexandrescu,A. (2001) Modern C++ Design: Generic
Programming and Design Patterens Applied. Addision Wesley
Professional, Boston.
Dormand,J.R. and Prince,P.J. (1980) A family of embedded
RungeKutta formulae. J. Comp. Appl. Math., 6, 1926.
Alexandrescu,A. (2001) Modern C++ Design: Generic
Programming and Design Patterens Applied. Addision Wesley
Professional, Boston.
Dormand,J.R. and Prince,P.J. (1980) A family of embedded
RungeKutta formulae. J. Comp. Appl. Math., 6, 1926.