Center for Proteomics and Metabolomics, Leiden University Medical Center, The Netherlands
Fraunhofer Institute for Algorithms and Scientific Computing (SCAI) Schloss Birlinghoven, 53754 Sankt Augustin, Germany
highlights
Proposal of a novel optimization phase to the common scientific workflow life cycle.
Development of an automated optimization framework implementing the optimization phase.
Plugin mechanism to support the development of arbitrary optimization methods.
Implementation of a Genetic Algorithm-based parameter optimization plugin.
Optimization of two use cases to demonstrate ease of use and computational efficiency.
article
info
Article history:
Received 1 February 2013
Received in revised form
21 June 2013
Accepted 5 September 2013
Available online 23 September 2013
Keywords:
e-science
Workflow optimization
Taverna
Genetic algorithm
Workflow life cycle
abstract
Scientific workflows have emerged as an important tool for combining the computational power with data
analysis for all scientific domains in e-science, especially in the life sciences. They help scientists to design
and execute complex in silico experiments. However, with rising complexity it becomes increasingly
impractical to optimize scientific workflows by trial and error. To address this issue, we propose to insert
a new optimization phase into the common scientific workflow life cycle. This paper describes the design
and implementation of an automated optimization framework for scientific workflows to implement
this phase. Our framework was integrated into Taverna, a life-science oriented workflow management
system and offers a versatile programming interface (API), which enables easy integration of arbitrary
optimization methods. We have used this API to develop an example plugin for parameter optimization
that is based on a Genetic Algorithm. Two use cases taken from the areas of structural bioinformatics and
proteomics demonstrate how our framework facilitates setup, execution, and monitoring of workflow
parameter optimization in high performance computing e-science environments.
2013 Elsevier B.V. All rights reserved.
1. Introduction
Computational science has joined theory and experiment
as a third methodology to perform science. One of the main
applications of in silico experiments besides computer simulation
is data analysiscomputational processes that analyze data to gain
research results. Depending on the data volume, the data types,
and the task in question different combinations of computerbased methods have to be used in concert to obtain the desired
analysis result. To address the challenges in set-up, execution
and management of these complex in silico experiments, scientific
workflows have become an increasingly popular choice as they
0167-739X/$ see front matter 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.future.2013.09.005
353
Fig. 1. (a) The well described scientific workflow life cycle. (b) The scientific
workflow life cycle extended by the new optimization phase.
354
of the workflow and the results. The phase involves the examination of the obtained results as well as their comparison to those
of other in silico and in vitro experiments. Commonly, scientists
restart the workflow life cycle after this analysis step, if results do
not match the goals or expectations. During subsequent cycles, the
workflow is changed and tested again. Changes may involve the
workflow parameters, choice of components or even the order in
which these components are executed. This iterative refinement
process is rather common in the daily life of a scientist and called
trial and error.
However, manual trial and error to improve scientific workflows is very time consuming. Therefore, we have added a new
phase to the scientific workflow life cycle, to assist researchers
in their search for the optimal setup of a specific scientific workflow. As shown in Fig. 1(b), the proposed optimization phase is
placed between the planning and sharing phase and the execution phase. This arrangement of phases ensures that in the initial
cycle the computationally expensive execution of the full workflow
is not triggered while the intermediate results from partial workflow executions indicate that the execution of the current workflow would not lead to acceptable results. If the workflow involves
large scale, compute intensive or time consuming components, our
arrangement allows for those components to be either excluded
(if not targeting the optimization aim), approximated or optimized
with special attention. Most importantly, the new phase can be automated, i.e. it can perform an arbitrary number of optimization
cycles without user intervention (cf. Sections 2.2 and 3). Manually
triggered trial and error steps are thereby minimized or even eliminated.
result than others. To assist researchers in their search for optimal parameters, tools for automated parameter sweeps have been
developed [13]. However, with increasing complexity of scientific
workflows, i.e. with rising number of parameters, trial and error as
well as parameter sweeps become impractical. We therefore
propose the use of a Genetic Algorithm or Particle Swarm Optimization [14]. Compared to parameter sweep approaches these
metaheuristic algorithms typically need to explore a much smaller
part of the parameter search space to find good solutions. In the
case of a genetic algorithm, each parameter can be represented as
a gene on a chromosome. Hence, each chromosome can be used to
encode one set of input parameters to obtain one fitness value.
To reduce the search space even further, we take user-defined
constraints and dependencies between parameters into account.
These dependent parameters can be part of only one application
or belong to several applications. Dependencies can be described as
mathematical or logical functions as well as a fixed set of allowed
value combinations.
355
Fig. 3. The processor dispatch stack with the extended optimization layer.
356
Fig. 4. The control flow of the new optimization framework and its API for workflow optimization as well as a GA-based parameter optimization plugin.
Fig. 5. The screenshot shows the (MS)-based proteomics workflow in the new optimization perspective implemented by the Taverna optimization framework. (1) Workflow
diagram, (2) general selection mechanism for the sub-workflow, (3) specific input window for parameter optimization sampling, (4) shows the window for specifying
parameter dependencies.
specific text fields, selection options, etc. for the particular optimization method. After providing the required inputs the integrated run button activates the optimization framework and starts
the fully automated optimization process.
3.3. Grid-based optimization in Taverna
As complex scientific workflows within the life science domain may incur high computational and data transfer loads,
workflow management systems benefit from the ability to use
high-performance computing (HPC) and high-throughput computing (HTC) resources, as provided, e.g. by e-science infrastructures
such as DEISA [17] or XSEDE [18]. Access to these computing infrastructures is mainly provided via Grid middleware systems, such
357
358
(cf. Fig. 4(7a, 7b)). Afterwards, a new generation is created and the
evolution continues until the optimal result is found or the limit of
evolutionary steps is reached.
5. Automated parameter optimization use cases
To experimentally test our optimization framework and the
parameter optimization plugin, two real world use cases from the
life science domain were taken. The first use case is from the
area of structural bioinformatics and shows the common set-up
for parameter optimization, namely a manually guided parameter
sweep. The second workflow, performing a proteomics use case,
was beforehand only tested by trial and error and based on user
experiences. Here we took these experiences to define specific user
constraints for the search space.
5.1. Support vector regression
A recurring task in data driven biosciences is the development
of predictors. Given a suitable set of training data, supervised machine learning algorithms such as Support Vector Machines are
trained, and the resulting model can then be used to classify the
new input data or to predict the value of a particular property. As
an example we have chosen the training of a local protein structure
predictor that estimates the structural similarity of a short protein
segment, for which only the sequence profile is known, to a reference structure. We define the local backbone structure of the
protein as a vector of 8 consecutive dihedral angles. By using a
training set that contains the sequence profiles as vector features
and the structure similarity to the particular reference structure as
label one can train a model that predicts the structure similarity of
any protein sequence to that respective reference, e.g. in terms of a
root mean square deviation. Support Vector Regression (SVR) [27],
a technique closely related to Support Vector Machines (SVM), can
be used to train such an estimator. A set of such estimators could
be used for secondary structure prediction, an important technique
in structural bioinformatics. In Fig. 6 we show the set-up of a Support Vector Regression training as a Taverna workflow. The server
components in this case are the scaling and training programs from
the libsvm library [28]. We perform SVR training using a so called
Radial Basis Function (RBF) kernel. This kernel function has one parameter that specifies the width of the Gauss function of the RBF.
The function for penalizing errors has a parameter , to specify the
width of an insensitive zone and there is a regularization parameter C that penalizes the complexity of the model. For these three
real-valued parameters we set some bounds and choose exponential generator functions for C and . As an valuable example, we
suggest to set the population size to 20 and optimize for 5 generations. Thus, the automated search tests 100 different workflow
parameter sets. As fitness function we use the squared correlation
coefficient that is one of the outputs of the libsvm program and
has its optimum at 1. Performing this set up in a manually guided
parameter sweep using 420 trials found 0.527902 as maximal correlation (C = 64, = 0.03125, = 0.75). In contrast, the automated GA-based optimization obtained a correlation of 0.588481
as optimum (C = 61, = 0.03342, = 1.5758) using only 100
sub-workflow runs. The GA-based method was set up with constraints for C to [20 , 28 ], for to [212 , 20 ], and for to [0.0, 10.0].
For each new generation the best 13 chromosomes of the population were chosen by tournament selection and were recombined.
Mutation probability was set to 15 and applied to all chromosomes.
Population size was filled to 20 by random chromosomes. In conclusion, the optimization process found a better solution using less
than 25% of the trials required for the parameter sweep.
5.2. Mass spectrometry-based proteomics
In mass spectrometry (MS)-based proteomics [29], proteins are
typically identified via enzymatic digestion, producing peptides
Fig. 6. The screenshot shows the support vector regression workflow in Taverna.
by PeptideProphetTandem [34]. The PeptideProphetTandem concludes with a value for the estimated number of correctly identified spectra (est_tot_num_correct). This value was used as the
fitness function in our optimization runs. The input parameters
typically used by the researchers are mass_error_plus = 0.5,
mass_error_minus = 0.5 and monoisotopic_mass_isotope_error =
yes, which produces est_tot_num_correct = 11375.1. In the
optimization setup for the GA the constraints were set for
mass_error_minus and mass_error_plus to [0, 0.5], whereas a dependency was defined by |mass_error_minus mass_error_plus| <
0.2. This dependency was set, because a larger discrepancy indicates the instrument is poorly calibrated and the data should
first be recalibrated before further analysis. For the automated GA
optimization, the population size was set to 20 and generation
size to 5. As optimum result we reached est_tot_num_correct =
13220.7 with mass_error_plus = 0.1709 and mass_error_minus =
0.1192. This result is significantly better than the value obtained by
the typical parameter set. The est_tot_num_correct is the number
of correct peptide-spectrum matches estimated by the mixturemodeling in PeptideProphet. It is the area under the positive
(correct) distribution fitted to the X!Tandem discriminant score
distribution and is a measure of sensitivity.
6. Discussion
We have sketched the possibilities of our developed generic optimization framework for Taverna by implementing a GA-based
parameter optimization plugin and experimentally tested this setup by two real world use cases from the area of structural bioinformatics and proteomics. For both use cases the fully automated
optimization framework using the example GA plugin found a better optimum than a user guided parameter sweep approach. Additionally, our approach has demonstrated computational efficiency
using a comparatively small number of samples. Although, many
design frameworks [35] have been developed, they are not accessible for scientists working with scientific workflows and
workflow management systems. Instead, parameter sweeps and
manual trial and error are still the methods of choice for life science
researcher today [36,37], which was therefore used as comparison
here. To our knowledge there is no other comprehensive approach
for general workflow optimization freely available today, except
Nimrod/OK [7], which is a more specialized workflow parameter
optimization tool from the engineering domain and apparently not
available for download. Nimrod/OK [7] is an extension of the Kepler
WMS and supports model optimization as well as inverse problems. Optimization works by including special optimization actors
into a workflow. These optimization actors implement various optimization methods, such as simplex, genetic algorithms and simulated annealing.
Optimization with Nimrod/OK, however, is a complex process
that requires enveloping the workflow to be optimized into a second, optimization specific workflow and manual editing of a configuration file to setup the optimization parameters. In contrast
our approach integrates the optimization on a lower, more general
level and thereby allows to setup the optimization parameters directly from within the Taverna GUI as well as seamless switching
between optimization and execution mode. While the optimization extension for the WINGS/Pegasus WMS developed by Kumar
et al. [6] is similar to our optimization phase technique it is currently solely focusing on optimizing the runtime performance and
is therefore concentrated on the modification of parameters, which
are likely to influence the runtime. Another approach within the
engineering domain for parameter optimization using the method
of moving asymptotes is presented by Crick et al. [8]. Although,
Crick also uses the Taverna WMS to design a parameter optimization workflow, this is a static rather than a general approach. The
359
publication describes a manually created workflow, which comprises an optimization method for parameter optimization. Each
cycle of optimization has to be designed manually in the outer optimization workflow. None of these approaches provides a general
mechanism to address the scientific workflow optimization in scientific workflow management systems.
Compared to other established design frameworks [35], which
are typically used in the engineering domain, our framework aims
at providing parameter and other optimization methods to life scientists, who typically have neither experience in computer science
nor in optimization. This is achieved by providing direct access to
optimization services from within Taverna, a commonly used WMS
in the life sciences. Except adding a fitness function, which could
for many common cases be readily provided as Taverna components, a workflow requires no changes for being optimized and
therefore can be monitored, edited, and shared as any other Taverna workflow. Such features have only recently been included into
engineering optimization frameworks [38]. Additionally, our optimization works in parallel and is therefore comparatively fast. The
sum of these features should make scientific workflow optimization easy and fast enough to be adopted by life scientists.
At present our framework cannot be compared to design
frameworks in the engineering domain regarding the breadth in
methods and more advanced features like multi-objective optimization [39], hybrid optimization [40] etc. In principal, one could
manipulate any design optimization framework and substitute the
original model description with a Taverna command line execution of the workflow. However such an approach would require
tedious manual configuration and therefore is unlikely to be used
by life scientists. On the other hand some requirements for future
multidisciplinary design frameworks identified in [41] such as parallel processing, user guides, project templates and GUI integration
are already included in our current framework. Additionally, Tavernas multipurpose approach meets the requirement to formulate
fitness functions in arbitrary programming languages.
In order to enable fast porting of most optimization methods
available in engineering design frameworks into our workflow
optimization framework we have designed an API that provides
abstractions of all Taverna internals like security mechanisms, parallel execution and data handling on the Grid as well as GUI integration. This opens an easy way for developers to add further
optimization plugins, which implement these more advanced
workflow optimization techniques including multi-objective optimization. Due to the complex fitness landscapes of scientific workflows other meta-heuristics such as particle swarm optimization
(PSO) [42] or ant-colony optimization (ACO) [43] for continuous
problems would be useful additions.
Two important issues in optimization are the selection of an appropriate optimization target function and good parameter choices
for the optimization method itself. The most common optimization approaches in the life sciences use comparison to a gold standard and hence minimization of a single objective measures such
as RMSD or correlation. In other cases finding a suitable fitness
function may remain a challenge. In contrast to the engineering domain multi-objective optimization has not yet been widely used
for parameter optimization in the context of life sciences [44]
although it would be helpful to assess trade-offs, e.g. between
sensitivity and specificity or between result quality and runtime.
Using e.g. a niched-Pareto GA [39, p. 218 ff.] extension of our framework to multi-objective optimization is feasible. However, the execution of a workflow is often computationally costly, sometimes
taking hours to execute and also the calculation of a Pareto optimal set is more costly and computing intensive than the optimization of a single-objective problem [39, p. 46]. Hybrid optimization
approaches [40] or other approximation approaches [39] used in
concert with Pareto optimization would be a possible approach to
360
overcome compute efficiency issues. In other cases it may be sufficient to roughly approximate part of the Pareto front by choosing a single objective measure with an explicit trade-off parameter
such as F [45] and perform separate optimization for a few values
of the trade-off parameter ( ). For example the Matthews correlation coefficient in the first use case could be substituted by an F
component.
A crucial issue especially for non-experts in optimization is the
parameterization of the optimization algorithms themselves. Currently our optimization set-up including cross-over rate, mutation
rate, number of generations and population size of the GA, which
may still need to be adapted to the search space. Initializing these
values could be supported in the future by automated tools [46] or
based on provenance as outlined below.
As briefly mentioned in Section 2.2 parameters are not the only
entities of a workflow that could be changed during optimization.
Future plugins for our framework could target higher levels of the
workflow such as component optimization where the choice of the
components used for particular steps of the workflow is optimized.
In certain cases even the order of components, i.e. the topology of
the workflow can be changed without changing the purpose of the
workflow. An example for component optimization is a local sequence alignment component, which could be chosen between the
heuristic BLAST method [47], the dynamic programming approach
of SmithWaterman [48] or a profile alignment methods like PsiBLAST [49]. As each of these alignment methods has its own set
of parameters a component optimization should be used together
with parameter optimization. In principle, both optimization types
can be jointly encoded in a single Genetic Algorithm e.g. by forming sub-populations or by defining separate chromosomes or parts
on a chromosome for each set of component specific parameters
of which only one is active at a time. Topology optimization, i.e.
switching the order of components without changing the overall
semantics is only possible in certain cases where the respective
methods do not produce any new data types but only change the
number of items or their representation. Typical methods of this
type are clustering, filtering and scaling. The performance of the
combination of a clustering method and a filter, for example, might
differ depending on whether the filtering acts on the input items
before the clustering step or on the cluster representatives. The result, however, would be in both cases a subset of the input items.
The implementation of a higher level optimization of workflows
was not taken into account in this publication, as a major requirement of those transformations is the availability of additional logic,
such as an ontology that enables the optimization method to determine which algorithms or sub-workflows perform equivalent
functions. A similar ontology and a set of methods for data transformations would be needed for input and output types. Such
ontologies are not yet available for the methods and data types
typically used in the life sciences.
For large numbers of components with alternative methods and
in general for the optimization of workflows with many parameters more sophisticated techniques may be required to ensure sufficient sampling of the search space within a reasonable amount
of compute time. Apart from user constraints and approximation
methods any information obtained from optimizations of similar
workflows might be helpful to guide the optimization process.
We therefore propose in the following a provenance-based
optimization strategy. The basic idea is based on sharing of the
knowledge between collaborators in e-science infrastructures [50].
While currently mainly workflow definitions are stored in community repositories it would in principal be possible to store and
retrieve additional information such as provenance data, performance measures of optimization runs etc. If such an infrastructure for information-enriched workflows would be available, other
techniques could utilize this information for optimization. Certainly, optimization by learning from provenance in e-science is a
361
[33] R. Craig, R.C. Beavis, TANDEM: matching proteins with tandem mass spectra,
Bioinformatics 20 (9) (2004) 14661467.
[34] A. Keller, A.I. Nesvizhskii, E. Kolker, R. Aebersold, Empirical statistical model to
estimate the accuracy of peptide identifications made by MS/MS and database
search, Analytical Chemistry 74 (20) (2002) 53835392.
[35] S.S. Rao, Engineering Optimization: Theory and Practice, Wiley, 2009.
[36] S. Gesing, J. van Hemert, P. Kacsuk, O. Kohlbacher, Special Issue: portals for life
sciencesproviding intuitive access to bioinformatic tools, Concurrency and
Computation: Practice and Experience 23 (3) (2011) 223234.
[37] S. Smanchat, M. Indrawan, S. Ling, C. Enticott, D. Abramson, Scheduling
parameter sweep workflow in the Grid based on resource competition, Future
Generation Computer Systems 29 (5) (2013) 11641183.
[38] C. Heath, J. Gray, OpenMDAO: framework for flexible multidisciplinary
design, analysis and optimization methods, in: Proceedings of the 53rd AIAA
Structures, Structural Dynamics and Materials Conference, 2012.
[39] K. Deb, Multi Objective Optimization Using Evolutionary Algorithms, John
Wiley and Sons, 2001.
[40] T.A. El-Mihoub, A.A. Hopgood, L. Nolle, A. Battersby, Hybrid genetic algorithms:
a review, Engineering Letters 3 (2) (2006) 124137.
[41] J.A. Parejo, A. Ruiz-Corts, S. Lozano, P. Fernandez, Metaheuristic optimization
frameworks: a survey and benchmarking, Soft Computing 16 (3) (2012)
527561.
[42] J. Kennedy, R. Eberhart, Particle swarm optimization, in: Proceedings of IEEE
International Conference on Neural Networks, Vol. 4, 1995, pp. 19421948.
[43] M. Dorigo, G. Di Caro, Ant colony optimization: a new meta-heuristic, in:
Proceedings of the 1999 Congress on Evolutionary Computation, CEC 99,
Vol. 2, 1999, pp. 14701477.
[44] J. Sun, J.M. Garibaldi, C. Hodgman, Parameter estimation using metaheuristics
in systems biology: a comprehensive review, IEEE/ACM Transactions on
Computational Biology and Bioinformatics 9 (1) (2012) 185202.
[45] C.J. van Rijsbergen, Information Retrieval, 1979.
[46] G. Ochoa, M. Preuss, T. Bartz-Beielstein, M. Schoenauer, Editorial for the special
issue on automated design and assessment of heuristic search methods,
Evolutionary Computation 20 (2) (2012) 161163.
[47] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment
search tool, Journal of Molecular Biology 215 (1990) 403410.
[48] O. Gotoh, An improved algorithm for matching biological sequences, Journal
of Molecular Biology 162 (1982) 705708.
[49] S.F. Altschul, T.L. Madden, A.A. Schffer, J. Zhang, Z. Zhang, W. Miller, D.J.
Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs, Nucleic Acids Research 25 (1997) 33893402.
[50] D. De Roure, C. Goble, R. Stevens, The design and realisation of the
myExperiment Virtual Research Environment for social sharing of workflows,
Future Generation Computer Systems 25 (5) (2009) 561567.
Sonja Holl received her Bachelor and Master of Science degree in computer science from the Heinrich-Heine University, Dsseldorf, Germany, in 2008, 2009 respectively. She
is currently a Ph.D. candidate in the Jlich Supercomputing Centre at the Forschungszentrum Jlich GmbH. Her research interests are in the areas of distributed computing
and scientific workflows with emphasis on the optimization of bioinformatic workflows.
362