Anda di halaman 1dari 11

Future Generation Computer Systems 36 (2014) 352362

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

A new optimization phase for scientific workflow


management systems
Sonja Holl a, , Olav Zimmermann a , Magnus Palmblad b , Yassene Mohammed b ,
Martin Hofmann-Apitius c
a

Jlich Supercomputing Centre (JSC), Forschungszentrum Jlich, 52425 Jlich, Germany

Center for Proteomics and Metabolomics, Leiden University Medical Center, The Netherlands

Fraunhofer Institute for Algorithms and Scientific Computing (SCAI) Schloss Birlinghoven, 53754 Sankt Augustin, Germany

highlights

Proposal of a novel optimization phase to the common scientific workflow life cycle.
Development of an automated optimization framework implementing the optimization phase.
Plugin mechanism to support the development of arbitrary optimization methods.
Implementation of a Genetic Algorithm-based parameter optimization plugin.
Optimization of two use cases to demonstrate ease of use and computational efficiency.

article

info

Article history:
Received 1 February 2013
Received in revised form
21 June 2013
Accepted 5 September 2013
Available online 23 September 2013
Keywords:
e-science
Workflow optimization
Taverna
Genetic algorithm
Workflow life cycle

abstract
Scientific workflows have emerged as an important tool for combining the computational power with data
analysis for all scientific domains in e-science, especially in the life sciences. They help scientists to design
and execute complex in silico experiments. However, with rising complexity it becomes increasingly
impractical to optimize scientific workflows by trial and error. To address this issue, we propose to insert
a new optimization phase into the common scientific workflow life cycle. This paper describes the design
and implementation of an automated optimization framework for scientific workflows to implement
this phase. Our framework was integrated into Taverna, a life-science oriented workflow management
system and offers a versatile programming interface (API), which enables easy integration of arbitrary
optimization methods. We have used this API to develop an example plugin for parameter optimization
that is based on a Genetic Algorithm. Two use cases taken from the areas of structural bioinformatics and
proteomics demonstrate how our framework facilitates setup, execution, and monitoring of workflow
parameter optimization in high performance computing e-science environments.
2013 Elsevier B.V. All rights reserved.

1. Introduction
Computational science has joined theory and experiment
as a third methodology to perform science. One of the main
applications of in silico experiments besides computer simulation
is data analysiscomputational processes that analyze data to gain
research results. Depending on the data volume, the data types,
and the task in question different combinations of computerbased methods have to be used in concert to obtain the desired
analysis result. To address the challenges in set-up, execution
and management of these complex in silico experiments, scientific
workflows have become an increasingly popular choice as they

Corresponding author. Tel.: +49 2461 612760.


E-mail address: s.holl@fz-juelich.de (S. Holl).

0167-739X/$ see front matter 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.future.2013.09.005

allow for easy comprehension, editing, and dissemination of


analysis recipes in the life science domain. In particular, they have
been successfully utilized in e-science environments [1], which
provide researchers access to large scale computational and data
resources and enable seamless and secure collaborations.
Originating from business workflows [2], important aspects of
scientific workflows are workflow enactment, modeling and execution as well as sharing. As they precisely define the execution
and data flow, scientific workflows enable the management of even
heterogeneous and complex scientific computing experiments. A
successful approach that has emerged, organizes the development
of such in silico experiments in a scientific workflow life cycle [3].
This concept describes the cyclic process by which a workflow
passes through several phases including design, planning, sharing,
execution, analysis, and learning, until the goal of the in silico experiment has been achieved. However, the increasing complexity

S. Holl et al. / Future Generation Computer Systems 36 (2014) 352362

of scientific workflows, in particular in the life sciences, poses new


challenges such as keeping track of applications, interrelations, and
dependencies. For complex workflows choosing parameters ad hoc
or using the default parameters of the applications will often be
suboptimal and may lead to poor results or even failure of the entire in silico experiment.
A scientific workflow typically consists of a set of components,
each representing a scientific application, which are logically connected to define a specific data flow. The parameters of these
components can have complex interrelationships and some components will have a stronger impact on the final result than others.
To allow for iterative improvement of these factors, the established
scientific workflow life cycle provides a learning phase after the execution [35]. In this phase, researchers analyze the results and refine the workflow, e.g. by changing parameters before continuing
the life cycle to obtain potentially better results. While for smaller
workflows it may be feasible to find near optimal workflow
settings using trial and error, such manual optimization becomes
impractical for complex workflows, due to the large number of
possible choices for parameters, components and workflow topology. Furthermore, we argue that this approach is not very resourceefficient as it requires that the whole workflow is executed while
typically only parts of a workflow require refinement. Instead, we
propose to insert a new phase into the scientific workflow life cycle which performs the scientific workflow optimization more efficiently and in an automated manner.
The typical scientific workflow life cycle is supported by various scientific workflow management systems [3]. They ease the
design, creation, automated execution and result analysis of scientific workflows. Some optimization approaches have already been
implemented for these workflow management systems in the life
science domain. For example, Kumar et al. [6] designed an integrated framework for parameter optimizations, but with particular
focus on runtime performance optimizations. Abramson et al. [7]
developed a tool that aims at parameter optimization for models
and inverse problems. In another approach [8], the authors defined a fixed optimization workflow to optimize the application
parameters. In contrast, our work describes the extension of a
workflow management system (WMS) by a general optimization
framework that can flexibly deal with many types of automated
scientific workflow optimizations. By developing a general framework, we want to ease the implementation and integration of optimization methods into WMS for developers. As optimizations may
become very compute intensive, we focus in our work on parallel
and distributed approaches for workflow optimization.
A shorter version of our approach has been presented at the IEEE
International Conference on E-Science [9]. In the present contribution we show the general aspects of our optimization framework
for the life science domain, outline potential optimization levels,
present an example implementation of a plugin for optimization
at the parameter level and added a real world use case representing a typical analysis task for a proteomics lab.
In Section 2, we give a detailed description of the common scientific workflow life cycle, details of our new automated optimization phase and its generic approach, including potential scientific
workflow optimization levels. Section 3 describes our main contribution, an optimization framework embedded into an established
scientific workflow management system. Additionally, this section
provides details of our parallel and distributed approach. Section 4
describes the implemented example plugin for optimization at the
parameter level, explaining both the developer perspective and the
user perspective of our framework. Finally, two life science use
cases taken from structural bioinformatics and proteomics are exposed in Section 5. Section 6 compares our work to the related approaches and Section 7 finally concludes the paper.

353

Fig. 1. (a) The well described scientific workflow life cycle. (b) The scientific
workflow life cycle extended by the new optimization phase.

2. The scientific workflow life cycle


2.1. State-of-the-art
State-of-the-art development for an in silico experiment in the
life sciences is a cyclic process with several steps. In the literature the term workflow life cycle [2] has been coined for this development process and several successful applications to scientific
workflows have been described [35,10,11]. The definitions of the
life cycle may vary in their particular specifications but generally
lead to the same architecture shown in Fig. 1(a). The individual
phases are described in the following.
Design and refinement
The cycle usually starts with the design of a new or the refinement of an existing workflow taken from a repository. During this
phase the components are selected, representing the individual
steps of an experiment. At the same time, the composition of these
components is established. This includes the precise definition of
the dependencies of data and components. Although in some definitions, creation of an executable workflow from an abstract template belongs to the design and refinement phase, which we will
describe it as part of the planning and sharing phase.
Planning and sharing
The description of the planning and sharing phase in the literature is very heterogeneous. Planning is turning the abstract
workflow created during design phase into a concrete executable
workflow. This is achieved by mapping abstract parts to concrete applications or algorithms. Parameters and data sources are
defined as well as execution resources are selected. A thorough
planning is particularly important for large scale and compute intensive workflows (applications) as in these cases mappings to
high performance computing (HPC), Grid or Cloud resources have
to be precisely defined. After the last cycle this phase is used to
share the designed workflow with the community in an e-science
infrastructure so that other researchers can access and then run or
extend them.
Workflow execution
Workflow execution is typically managed by a workflow engine. This engine maps the executable to an appropriate execution
environment by retrieving information about available software,
computing resources and data resources. The workflow components are then executed in the predefined order, consuming the
defined data while being monitored by the engine. The results of
the workflow execution are sent back to the engine, and passed on
to the user.
Analysis and learning
To successfully evolve the scientific experiment, the scientific
workflow life cycle contains as the last phase an analysis and learning step. Some authors also include into this phase the publishing

354

S. Holl et al. / Future Generation Computer Systems 36 (2014) 352362

of the workflow and the results. The phase involves the examination of the obtained results as well as their comparison to those
of other in silico and in vitro experiments. Commonly, scientists
restart the workflow life cycle after this analysis step, if results do
not match the goals or expectations. During subsequent cycles, the
workflow is changed and tested again. Changes may involve the
workflow parameters, choice of components or even the order in
which these components are executed. This iterative refinement
process is rather common in the daily life of a scientist and called
trial and error.
However, manual trial and error to improve scientific workflows is very time consuming. Therefore, we have added a new
phase to the scientific workflow life cycle, to assist researchers
in their search for the optimal setup of a specific scientific workflow. As shown in Fig. 1(b), the proposed optimization phase is
placed between the planning and sharing phase and the execution phase. This arrangement of phases ensures that in the initial
cycle the computationally expensive execution of the full workflow
is not triggered while the intermediate results from partial workflow executions indicate that the execution of the current workflow would not lead to acceptable results. If the workflow involves
large scale, compute intensive or time consuming components, our
arrangement allows for those components to be either excluded
(if not targeting the optimization aim), approximated or optimized
with special attention. Most importantly, the new phase can be automated, i.e. it can perform an arbitrary number of optimization
cycles without user intervention (cf. Sections 2.2 and 3). Manually
triggered trial and error steps are thereby minimized or even eliminated.

result than others. To assist researchers in their search for optimal parameters, tools for automated parameter sweeps have been
developed [13]. However, with increasing complexity of scientific
workflows, i.e. with rising number of parameters, trial and error as
well as parameter sweeps become impractical. We therefore
propose the use of a Genetic Algorithm or Particle Swarm Optimization [14]. Compared to parameter sweep approaches these
metaheuristic algorithms typically need to explore a much smaller
part of the parameter search space to find good solutions. In the
case of a genetic algorithm, each parameter can be represented as
a gene on a chromosome. Hence, each chromosome can be used to
encode one set of input parameters to obtain one fitness value.
To reduce the search space even further, we take user-defined
constraints and dependencies between parameters into account.
These dependent parameters can be part of only one application
or belong to several applications. Dependencies can be described as
mathematical or logical functions as well as a fixed set of allowed
value combinations.

2.2. A concept for the workflow optimization extension

3. A framework for workflow optimization in scientific workflow management systems

The proposed scientific workflow optimization phase (cf.


Fig. 1(b)) is generally not limited to any particular type of workflow
optimization or scientific domain. Optimization can target different levels of the workflow description. In this paper, we have focused on workflow parameter optimization as the most important
level for the life science domain and will describe this level in full
detail. Other identified higher-level optimizations are the choice of
components or workflow topology. We will also shortly describe
the idea behind these optimizations even if we will only take parameter optimization into account for later framework evaluation,
due to additional requirements. To employ any kind of automated
optimization users need to select or define an objective function
that provides the optimization engine with a metric for the quality
of a workflow result. Fitness functions are specific for the particular workflow or research goal but independent of the optimization
type and the implementation in question. As an example for singleobjective optimization, the Matthews correlation coefficient [12]
is a quality measure for two class classification results. In the life
science domain, objective functions are often distance measures
between a result and a given reference, which serves as a gold
standard. In principal multi-criteria optimization would be conceivable, however for reasons we discuss in Section 6 we have only
taken single-objective optimization into account. In the following,
we specify three conceivable workflow optimization levels.
Parameter optimization
The most common optimization level for scientific workflows is
parameter optimization. Workflows may combine various data inputs and contain several different components, each with its own
set of parameters. Thus, the number of possible parameter combinations for any non-trivial workflow provides a considerable
challenge for parameter optimization. Additionally, the optimal
parameter combination of a workflow may depend on its input
data and some parameters will have more influence on the final

Higher levels of workflow optimization


One such higher level of optimization is component optimization, which attempts to find the best algorithm to fulfill a given
subtask. Sometimes, two or more algorithms have implement
equivalent functions but with different characteristics. Hence, each
of them may be optimal for different input data. Under certain
circumstances the order of the data or enactment flow could be
switch. Examples are cases where the execution sequence of consecutive filters or clusterings can be swapped. However, both cases
require additional logic to identify equivalence such as an ontology.
This aspect will be further discussed in Section 6.

Concerning the scientific workflow life cycle many techniques


have been developed to assist users during the design, sharing,
automated execution and result analysis of scientific workflows.
These methods have been integrated to form scientific workflow
management systems (WMS) [3]. All WMSs support the various
phases of the workflow life cycle although differing considerably
in their implementation. Some widely used WMSs such as Taverna, Kepler or Pegasus/WINGS and their differences have been reviewed in [15]. However, our proposed optimization phase is not
yet supported in any of those nor in any other WMS. Some WMSs
implement parts of a workflow optimization process, which are described in Section 6. To support the optimization phase in a WMS,
we have developed a new generic optimization framework. This
framework enables developers to easily integrate new optimization levels and methods into a WMS by taking over the general
aspects needed for workflow optimization such as execution and
monitoring. In the following, we describe, how we integrated this
optimization framework into an existing scientific workflow management system to support the optimization phase.
3.1. Concept for an optimization phase framework
The execution phase is the most compute intensive part of
the scientific workflow life cycle, as the workflow is going to
be executed as a whole. As previously mentioned, it is usually
beneficial to perform at least some kind of optimization before the
entire workflow is executed. This is based on the fact, that typically
only some parts of a workflow require optimization. If learning,
optimization and refinement take place after the execution phase,
as in the common scientific workflow life cycle (cf. Fig. 1(a)) and
described in [3,4], or [5], the entire workflow with all its entities is
executed again in each iteration.

S. Holl et al. / Future Generation Computer Systems 36 (2014) 352362

Fig. 2. The architecture of the Taverna workflow management system, the


proposed extended optimization framework as well as extended plugins.

Appropriate optimization strategies can reduce or even avoid


the re-execution of workflow parts that do not require optimization. This is of particular importance if these parts are compute intensive or involve large-scale data analysis. Accordingly, the new
optimization framework is triggered by the WMS before the execution phase starts. From this point on only the sub-workflow,
that the user has marked for optimization is processed by the
framework. A sub-workflow is a nested part of a larger workflow, containing the set of required data, components and dependencies for the optimization. Neither the optimization levels nor
optimization methods have been predefined in our optimization
framework. Instead different optimization plugins can independently plug into a well-defined framework application programming interfaces (API), each implementing a particular optimization
method for a particular optimization level. Thereby, we offer developers a platform to easily extend a WMS with various optimization methods. The framework forwards the sub-workflow to each
optimization plugin, which then applies its specific optimization
technique and returns accordingly a set of modified sub-workflows
to the framework API. The framework in turn is then responsible
for their execution. After each round of execution the framework
calls the respective optimization plugin again, this evaluates the
results and decides whether the optimization is completed. Optimized sub-workflows are returned to the WMS to continue the
scientific workflow life cycle.
As the framework cannot be adequately described in a general
manner, we focus in the remainder of this article on our implementation as an extension of the Taverna workflow management
system [16]. In the following, we briefly outline the Taverna workflow management system that was used as a reference implementation. We describe how we extended Taverna with the proposed
optimization framework in more detail.
3.2. Integration of the optimization framework into Taverna
While there are several open source WMSs that have been used
in science, few of them have a client and a server version and come
equipped with a wide range of bioinformatics methods, such as
Taverna [16]. It is an open source, domain-independent workflow
management system written in Java. The Taverna Workbench provides a graphical interface for creating and monitoring workflows,
while the Taverna Workflow Engine is responsible for their execution. Workflows consist of a user defined arrangement of activities.
Each activity provides input and output ports for the underlying
application that can be connected, thereby defining the dataflow
between the applications and hence the workflow structure. Each
activity in Taverna constitutes an abstract representation of an
available Web Service, Java class, local script or other service. The
Taverna system can be easily extended by means of several service
provider interface (SPI). We have extended the Taverna WMS to
enable easy design of various optimization methods for developers in the life science domain. An architecture for this concept is
shown in Fig. 2. The implemented framework manages all general aspects of the optimization, such as invocation, enactment and
forwarding. For individual optimization techniques plugins can be
integrated independently into the specified framework API. An
optimization plugin can, e.g. implement parameter optimization

355

Fig. 3. The processor dispatch stack with the extended optimization layer.

using a genetic algorithm as the underlying method. The plugins


manage everything specific to the particular type of optimization.
To implement the framework we make use of the processor dispatch stack of Taverna, described in [16] and shown in Fig. 3. The
processor dispatch stack is a mechanism to manage the execution
of a specific activity. Before the activity is invoked, a predefined
stack of layers is called from top to bottom and after the execution
from bottom to top. Each layer implements a special functionality,
important for the activity invocation. To integrate the new optimization framework into Taverna, we have extended this dispatch
stack with a new optimize layer on top of the stack, see Fig. 3.
Additionally, we take advantage of the sub-workflow concept,
provided by Taverna. It allows for the definition and execution of
a sub-workflow. This means, that we can manually select a subworkflow, which is interpreted by Taverna as a single activity and
therefore processed by the dispatch stack. Only at the lowest layer
the sub-workflow is decomposed into the individual activities,
each of which again traverses the stack itself. We exploit this fact
by defining those parts of the workflow that shall be optimized as
a sub-workflow. The stack then provides access to all parameters,
data structures, data flows etc. of the sub-workflow during the
invocation of our optimize layer.
The new layer in turn can extract the workflow details, creates
the sub-workflow and forwards them to the particular optimization plugin, which in our case is the parameter optimization plugin
(cf. Fig. 4(2)).
The proper optimization level and sub-workflow refinement
is then handled by the respective optimization plugin and not
by the framework. The optimization plugin returns a set of new
sub-workflow entities, each with a different set of parameters (cf.
Fig. 4(3)).
The optimize layer executes this sub-workflow set by utilizing
the topmost Taverna standard layer, namely parallelize. More
precisely, it provides a queue filled with sub-workflows for the
parallelize layer, which in turn pushes each sub-workflow down the
stack in a separate thread. This triggers the parallel execution of all
sub-workflows (cf. Fig. 4(4, 5)).
After the execution of the sub-workflows the optimize layer
receives a set of results from the parallelize layer. This result set
is again passed through to the specific optimization plugin for
analysis and evaluation (cf. Fig. 4(6 and 7)). If any result has passed
the predefined optimization threshold or the maximum number of
cycles has been reached, the optimization is finished (cf. Fig. 4(8)).
Otherwise, the plugin creates another set of sub-workflows for the
next round of optimization (cf. Fig. 4(7 and 3c)). If the optimal subworkflow has been found, the resulting parameter set is returned
to the user.
Our optimization framework also implements a new perspective in the Taverna Workbench, shown in Fig. 5 to enable an
easy and convenient set up of an optimization run. In the new
integrated perspective we arranged the workflow diagram (cf.
Fig. 5(1)) and a selection pane (cf. Fig. 5(2)) for defining the subworkflow to optimize. After the selection, all components which
are not subject to optimization appear grayed out in the workflow
diagram. Internally a new sub-workflow is created. The pane at
the lower left shows a specific interface implemented by the respective optimization plugin (cf. Fig. 5(3)). This pane provides

356

S. Holl et al. / Future Generation Computer Systems 36 (2014) 352362

Fig. 4. The control flow of the new optimization framework and its API for workflow optimization as well as a GA-based parameter optimization plugin.

Fig. 5. The screenshot shows the (MS)-based proteomics workflow in the new optimization perspective implemented by the Taverna optimization framework. (1) Workflow
diagram, (2) general selection mechanism for the sub-workflow, (3) specific input window for parameter optimization sampling, (4) shows the window for specifying
parameter dependencies.

specific text fields, selection options, etc. for the particular optimization method. After providing the required inputs the integrated run button activates the optimization framework and starts
the fully automated optimization process.
3.3. Grid-based optimization in Taverna
As complex scientific workflows within the life science domain may incur high computational and data transfer loads,
workflow management systems benefit from the ability to use
high-performance computing (HPC) and high-throughput computing (HTC) resources, as provided, e.g. by e-science infrastructures
such as DEISA [17] or XSEDE [18]. Access to these computing infrastructures is mainly provided via Grid middleware systems, such

as UNICORE [19] (UNiform Interface to COmputing REsources),


ARC [20] (Advanced Resource Connector), or Globus Toolkit [21].
Our previously developed Grid plugin [22] for the Taverna Workbench provides seamless access to Grid infrastructures via the UNICORE Grid middleware. We used this tool to move parts of the
automated optimization to a Grid infrastructure. This design exploits the fact that optimization methods typically work as a multistep concept, whereby each step contains several samples, which
can be executed independently. Thus, we equipped the framework
with an invocation mechanism to enable the parallel execution of
the samples on the Grid. A detailed description how the arising security challenges involved in Grid execution of Taverna workflows
have been addressed as well as the set up and performance metrics
can be found in [23].

S. Holl et al. / Future Generation Computer Systems 36 (2014) 352362

4. Optimization use case: workflow parameter optimization


In this paper we focus on the basic case of workflow optimization for the life science domain, namely parameter optimization
and have developed a parameter optimization plugin that extends
the previous developed generic framework for testing purposes.
Life science workflows may combine various data inputs and contain several different applications. Typically each component has a
number of parameters and the quality of the final result critically
depends on good choices for them. Additionally, the input data and
parameters may have different influences on the final result. Thus,
the combinatorial number of possibilities for parameter values becomes a challenge for researchers. Due to the requirement profile
of scientific workflows within the life science domain we have
decided to base our sample plugin implementation of workflow
parameter optimization on Genetic Algorithms [24] due to their
simplicity, proven performance, versatility and recent success in
life sciences [25]. A significant advantage of using an algorithm that
has been in use for a long time is that a range of mature, feature
rich and well tested frameworks for Genetic Algorithms is available
as open source. As multi-objective optimization is not yet widely
used within life science workflows, and parameter optimization
methods in general are a new research field to this domain we
have focused within this contribution on single-objective methods.
We will discuss details of multi-objective approaches in Section 6.
However, in principal the proposed generic optimization framework is not fixed to single- or multi-criteria optimization nor to
any optimization method such as genetic algorithms.
4.1. The genetic algorithm library
Genetic Algorithms (GA) are among the most popular optimization methods and have been successfully used for more than
35 years to solve complex scientific optimization and search problems. GA do not require any derivatives and can encode both numerical and non-numerical problems. They mimic the mechanism
of natural evolution by applying recombination, mutation and selection to a population of parameter vectors. The search process
combines exploration and hill climbing and allows the algorithm
to sample many local minima to find the global optimum without
getting trapped. The evolution of the population of potential solutions is guided by an objective function called fitness function that
determines the probability of a solution to be inherited to the next
generation. The GA terminates either after a fixed number of iterations or if some fitness based threshold is reached and the fittest
individual is selected as the final solution. In our approach this individual represents the best parameter set for the workflow. After
evaluation of several GA implementations we have chosen JGAP for
our project.
JGAP [26] (Java Genetic Algorithms Package) is an open-source
GA library written in Java. It offers a general mechanism for genetic
evolution that can be easily adapted to the particular optimization
problem. For our optimization use case, we have mainly refined
JGAPs parallel mechanism and the breeder mechanism. To allow
for support of constrained parameter optimization, we have made
use of a special genotype in JGAP, called MapGene. MapGenes
can be used to define a fixed set of parameters, from which the
breeder can select only one. Additionally, we provide a mechanism
to add mathematical functions to numerical genes for mapping
gene values to parameter values. One of the major changes to
JGAP concerns the central breeder method. Our refined breeder
acts as follows: the generated population of sub-workflows is first
executed and the calculated results translated into fitness values
to determine the fittest chromosome. Then a new population
is generated by applying an a priori chosen natural selection
mechanism, e.g. tournament or best chromosome selection on the

357

old population. To these parents, mutation and crossover operators


are applied and they are added to the new population. To fill the
population to the predefined population size randomly created
chromosomes are added. Now, the evolution cycle is continued
with the new population.
The extended parallel mechanism is able to determine the fitness values of the chromosomes in parallel. In combination with
the plugin, proposed in Section 4.2, the JGAP library suspends and
waits until all sub-workflows have been executed by our optimization framework and the fitness values have been extracted.
4.2. Parameter optimization setup
In order to demonstrate the automated parameter optimization
of workflows from within Taverna we have developed a GA-based
optimization plugin that utilizes the described optimization framework API. By utilizing the previously described generic framework
we do not have to deal with any Taverna specific internals such
as execution or data handling. The parameter optimization plugin integrates a new pane into the optimization perspective of the
Taverna Workbench (Fig. 5(3)). It enables scientists to modify the
default parameter sampling. Our optimization setup allows all accessible parameters of the sub-workflow to be included in the parameter optimization process. However, without any constraints a
lot of computing time may be wasted on testing unreasonable or
even invalid parameter combinations. To optionally benefit from
user knowledge about the underlying workflow semantics and the
resulting constraints on the parameter space we have equipped
our parameter optimization plugin with several features to define
constraints. Based on the user-selected sub-workflow, a table of
its selected components including their accessible parameters is
created. For each of these components, the user can select, which
parameters shall be subject to the optimization. Optionally, a type
dependent set of properties such as upper and lower bounds and a
mathematical function: y = f (x) (where y defines the parameter
value, x the gene value) can be specified for each parameter. For
discrete parameter types the user can give a list of valid values. It
is also possible to specify dependencies between parameters either
as a fixed list of allowed combinations, or as a mathematical or logical formula. The only non-optional part of the setup process is the
description of the desired optimum in terms of a fitness function.
In the most simple case one output of the workflow can be chosen as fitness value to be maximized or minimized. Alternatively
several outputs can be combined in a script to calculate a single
fitness value. An example would be the Matthews correlation coefficient [12], which measures the quality of a two-class classification. For convenience a set of predefined measures, which are
commonly used within the life sciences, could be provided as
Taverna components. During the optimization process this fitness
function is evaluated for each set of parameter values. Finally, the
user can configure the termination condition for the evolutionary
process.
4.3. The parameter optimization plugin
During the optimization run our optimization framework
invokes the parameter optimization plugin and forwards the userselected sub-workflow cf. Fig. 4(2). The plugin initializes a new
evolutionary process by using an extended version of the JGAP
library. Sampling the parameter space to form a new population
in JGAP is guided by the constraints, dependencies and other
details entered by the user. The resulting parameter set samples
are returned to the plugin. The optimization plugin initializes the
corresponding set of Taverna sub-workflows (cf. Fig. 4(3a, 3b, 3c)).
Our optimization framework then manages the parallel execution
of the instantiated sub-workflows and forwards the results back to
the parameter plugin after execution. The optimization parameter
plugin extracts the fitness values and returns them to the JGAP
library, which evaluates the relative parameter set performance

358

S. Holl et al. / Future Generation Computer Systems 36 (2014) 352362

(cf. Fig. 4(7a, 7b)). Afterwards, a new generation is created and the
evolution continues until the optimal result is found or the limit of
evolutionary steps is reached.
5. Automated parameter optimization use cases
To experimentally test our optimization framework and the
parameter optimization plugin, two real world use cases from the
life science domain were taken. The first use case is from the
area of structural bioinformatics and shows the common set-up
for parameter optimization, namely a manually guided parameter
sweep. The second workflow, performing a proteomics use case,
was beforehand only tested by trial and error and based on user
experiences. Here we took these experiences to define specific user
constraints for the search space.
5.1. Support vector regression
A recurring task in data driven biosciences is the development
of predictors. Given a suitable set of training data, supervised machine learning algorithms such as Support Vector Machines are
trained, and the resulting model can then be used to classify the
new input data or to predict the value of a particular property. As
an example we have chosen the training of a local protein structure
predictor that estimates the structural similarity of a short protein
segment, for which only the sequence profile is known, to a reference structure. We define the local backbone structure of the
protein as a vector of 8 consecutive dihedral angles. By using a
training set that contains the sequence profiles as vector features
and the structure similarity to the particular reference structure as
label one can train a model that predicts the structure similarity of
any protein sequence to that respective reference, e.g. in terms of a
root mean square deviation. Support Vector Regression (SVR) [27],
a technique closely related to Support Vector Machines (SVM), can
be used to train such an estimator. A set of such estimators could
be used for secondary structure prediction, an important technique
in structural bioinformatics. In Fig. 6 we show the set-up of a Support Vector Regression training as a Taverna workflow. The server
components in this case are the scaling and training programs from
the libsvm library [28]. We perform SVR training using a so called
Radial Basis Function (RBF) kernel. This kernel function has one parameter that specifies the width of the Gauss function of the RBF.
The function for penalizing errors has a parameter , to specify the
width of an insensitive zone and there is a regularization parameter C that penalizes the complexity of the model. For these three
real-valued parameters we set some bounds and choose exponential generator functions for C and . As an valuable example, we
suggest to set the population size to 20 and optimize for 5 generations. Thus, the automated search tests 100 different workflow
parameter sets. As fitness function we use the squared correlation
coefficient that is one of the outputs of the libsvm program and
has its optimum at 1. Performing this set up in a manually guided
parameter sweep using 420 trials found 0.527902 as maximal correlation (C = 64, = 0.03125, = 0.75). In contrast, the automated GA-based optimization obtained a correlation of 0.588481
as optimum (C = 61, = 0.03342, = 1.5758) using only 100
sub-workflow runs. The GA-based method was set up with constraints for C to [20 , 28 ], for to [212 , 20 ], and for to [0.0, 10.0].
For each new generation the best 13 chromosomes of the population were chosen by tournament selection and were recombined.
Mutation probability was set to 15 and applied to all chromosomes.
Population size was filled to 20 by random chromosomes. In conclusion, the optimization process found a better solution using less
than 25% of the trials required for the parameter sweep.
5.2. Mass spectrometry-based proteomics
In mass spectrometry (MS)-based proteomics [29], proteins are
typically identified via enzymatic digestion, producing peptides

Fig. 6. The screenshot shows the support vector regression workflow in Taverna.

that in turn are identified by tandem mass spectrometry (MS/MS).


The mass spectrometer is often coupled on-line to a chromatographic system separating peptides based on their physicochemical properties, such as charge, size or hydrophobicity. In a
one-hour chromatographic gradient, the peptides are separated
and 20,00050,000 of these individually selected for fragmentation and MS/MS. The peptide fragments, or tandem mass spectra, are then compared against calculated fragmentation patterns
for all possible peptides that could result from a digestion of proteins with the enzyme used and sequences from the analyzed biological species. Tandem mass spectra are only compared against
those theoretical peptides whose mass is within a user-defined
maximum mass measurement error from the experimentally determined mass of the whole peptide (the precursor), and most
algorithms also have user-tunable parameters for a maximum
mass measurement uncertainty in MS/MS, misidentification of the
monoisotopic peak resulting in systematic errors of an integer
number of mass units, minimum total signal intensity, number of
peaks, peptide mass range and what types of product ions are expected in MS/MS. The optimal choice of these and other parameters
depends on the species, sample preparation, enzymatic digestion,
chromatography, mass spectrometer, data acquisition settings and
search algorithm. In inter-laboratory comparisons where the same
data were made available to many different research groups, the
number of returned identified peptides varied by almost an order
of magnitude, demonstrating that finding good, or optimal, parameters for the peptide identification algorithms is a non-trivial problem [30].
Fig. 5 shows the workflow for MS/MS analysis. It was originally designed by Mohammed et al. [31], who developed a new
mechanism for the search of MS/MS mass spectrometry data in a
cloud environment using the Trans-Proteomic Pipeline (TPP) [32].
We have adapted this workflow for Grid execution using the Taverna UNICORE plugin [22]. In doing so, the execution time of
the database search tool X!Tandem [33] has been reduced from
5.5 min on a desktop computer to an average of 2 min. The workflow input fastaFile represents the library used for the evaluation,
mzXMLFile is the set of proteins to be searched, nrOfDaughters is
an integer value specifying the number of files the mzXMLFile
is split into. mass_error_minus and mass_error_plus describe the
lower and upper windows of mass tolerance and have been used as
optimization targets. The activity mzXMLDecomposer decomposes
the mzXML file into several child mzXML files. Each resulting mzXML file is afterwards processed by one instance of
X!Tandem represented by tandem. These tandem jobs are executed in parallel on the Grid. The activity mzXMLComposer composes the results back into one file, which is finally evaluated

S. Holl et al. / Future Generation Computer Systems 36 (2014) 352362

by PeptideProphetTandem [34]. The PeptideProphetTandem concludes with a value for the estimated number of correctly identified spectra (est_tot_num_correct). This value was used as the
fitness function in our optimization runs. The input parameters
typically used by the researchers are mass_error_plus = 0.5,
mass_error_minus = 0.5 and monoisotopic_mass_isotope_error =
yes, which produces est_tot_num_correct = 11375.1. In the
optimization setup for the GA the constraints were set for
mass_error_minus and mass_error_plus to [0, 0.5], whereas a dependency was defined by |mass_error_minus mass_error_plus| <
0.2. This dependency was set, because a larger discrepancy indicates the instrument is poorly calibrated and the data should
first be recalibrated before further analysis. For the automated GA
optimization, the population size was set to 20 and generation
size to 5. As optimum result we reached est_tot_num_correct =
13220.7 with mass_error_plus = 0.1709 and mass_error_minus =
0.1192. This result is significantly better than the value obtained by
the typical parameter set. The est_tot_num_correct is the number
of correct peptide-spectrum matches estimated by the mixturemodeling in PeptideProphet. It is the area under the positive
(correct) distribution fitted to the X!Tandem discriminant score
distribution and is a measure of sensitivity.
6. Discussion
We have sketched the possibilities of our developed generic optimization framework for Taverna by implementing a GA-based
parameter optimization plugin and experimentally tested this setup by two real world use cases from the area of structural bioinformatics and proteomics. For both use cases the fully automated
optimization framework using the example GA plugin found a better optimum than a user guided parameter sweep approach. Additionally, our approach has demonstrated computational efficiency
using a comparatively small number of samples. Although, many
design frameworks [35] have been developed, they are not accessible for scientists working with scientific workflows and
workflow management systems. Instead, parameter sweeps and
manual trial and error are still the methods of choice for life science
researcher today [36,37], which was therefore used as comparison
here. To our knowledge there is no other comprehensive approach
for general workflow optimization freely available today, except
Nimrod/OK [7], which is a more specialized workflow parameter
optimization tool from the engineering domain and apparently not
available for download. Nimrod/OK [7] is an extension of the Kepler
WMS and supports model optimization as well as inverse problems. Optimization works by including special optimization actors
into a workflow. These optimization actors implement various optimization methods, such as simplex, genetic algorithms and simulated annealing.
Optimization with Nimrod/OK, however, is a complex process
that requires enveloping the workflow to be optimized into a second, optimization specific workflow and manual editing of a configuration file to setup the optimization parameters. In contrast
our approach integrates the optimization on a lower, more general
level and thereby allows to setup the optimization parameters directly from within the Taverna GUI as well as seamless switching
between optimization and execution mode. While the optimization extension for the WINGS/Pegasus WMS developed by Kumar
et al. [6] is similar to our optimization phase technique it is currently solely focusing on optimizing the runtime performance and
is therefore concentrated on the modification of parameters, which
are likely to influence the runtime. Another approach within the
engineering domain for parameter optimization using the method
of moving asymptotes is presented by Crick et al. [8]. Although,
Crick also uses the Taverna WMS to design a parameter optimization workflow, this is a static rather than a general approach. The

359

publication describes a manually created workflow, which comprises an optimization method for parameter optimization. Each
cycle of optimization has to be designed manually in the outer optimization workflow. None of these approaches provides a general
mechanism to address the scientific workflow optimization in scientific workflow management systems.
Compared to other established design frameworks [35], which
are typically used in the engineering domain, our framework aims
at providing parameter and other optimization methods to life scientists, who typically have neither experience in computer science
nor in optimization. This is achieved by providing direct access to
optimization services from within Taverna, a commonly used WMS
in the life sciences. Except adding a fitness function, which could
for many common cases be readily provided as Taverna components, a workflow requires no changes for being optimized and
therefore can be monitored, edited, and shared as any other Taverna workflow. Such features have only recently been included into
engineering optimization frameworks [38]. Additionally, our optimization works in parallel and is therefore comparatively fast. The
sum of these features should make scientific workflow optimization easy and fast enough to be adopted by life scientists.
At present our framework cannot be compared to design
frameworks in the engineering domain regarding the breadth in
methods and more advanced features like multi-objective optimization [39], hybrid optimization [40] etc. In principal, one could
manipulate any design optimization framework and substitute the
original model description with a Taverna command line execution of the workflow. However such an approach would require
tedious manual configuration and therefore is unlikely to be used
by life scientists. On the other hand some requirements for future
multidisciplinary design frameworks identified in [41] such as parallel processing, user guides, project templates and GUI integration
are already included in our current framework. Additionally, Tavernas multipurpose approach meets the requirement to formulate
fitness functions in arbitrary programming languages.
In order to enable fast porting of most optimization methods
available in engineering design frameworks into our workflow
optimization framework we have designed an API that provides
abstractions of all Taverna internals like security mechanisms, parallel execution and data handling on the Grid as well as GUI integration. This opens an easy way for developers to add further
optimization plugins, which implement these more advanced
workflow optimization techniques including multi-objective optimization. Due to the complex fitness landscapes of scientific workflows other meta-heuristics such as particle swarm optimization
(PSO) [42] or ant-colony optimization (ACO) [43] for continuous
problems would be useful additions.
Two important issues in optimization are the selection of an appropriate optimization target function and good parameter choices
for the optimization method itself. The most common optimization approaches in the life sciences use comparison to a gold standard and hence minimization of a single objective measures such
as RMSD or correlation. In other cases finding a suitable fitness
function may remain a challenge. In contrast to the engineering domain multi-objective optimization has not yet been widely used
for parameter optimization in the context of life sciences [44]
although it would be helpful to assess trade-offs, e.g. between
sensitivity and specificity or between result quality and runtime.
Using e.g. a niched-Pareto GA [39, p. 218 ff.] extension of our framework to multi-objective optimization is feasible. However, the execution of a workflow is often computationally costly, sometimes
taking hours to execute and also the calculation of a Pareto optimal set is more costly and computing intensive than the optimization of a single-objective problem [39, p. 46]. Hybrid optimization
approaches [40] or other approximation approaches [39] used in
concert with Pareto optimization would be a possible approach to

360

S. Holl et al. / Future Generation Computer Systems 36 (2014) 352362

overcome compute efficiency issues. In other cases it may be sufficient to roughly approximate part of the Pareto front by choosing a single objective measure with an explicit trade-off parameter
such as F [45] and perform separate optimization for a few values
of the trade-off parameter ( ). For example the Matthews correlation coefficient in the first use case could be substituted by an F
component.
A crucial issue especially for non-experts in optimization is the
parameterization of the optimization algorithms themselves. Currently our optimization set-up including cross-over rate, mutation
rate, number of generations and population size of the GA, which
may still need to be adapted to the search space. Initializing these
values could be supported in the future by automated tools [46] or
based on provenance as outlined below.
As briefly mentioned in Section 2.2 parameters are not the only
entities of a workflow that could be changed during optimization.
Future plugins for our framework could target higher levels of the
workflow such as component optimization where the choice of the
components used for particular steps of the workflow is optimized.
In certain cases even the order of components, i.e. the topology of
the workflow can be changed without changing the purpose of the
workflow. An example for component optimization is a local sequence alignment component, which could be chosen between the
heuristic BLAST method [47], the dynamic programming approach
of SmithWaterman [48] or a profile alignment methods like PsiBLAST [49]. As each of these alignment methods has its own set
of parameters a component optimization should be used together
with parameter optimization. In principle, both optimization types
can be jointly encoded in a single Genetic Algorithm e.g. by forming sub-populations or by defining separate chromosomes or parts
on a chromosome for each set of component specific parameters
of which only one is active at a time. Topology optimization, i.e.
switching the order of components without changing the overall
semantics is only possible in certain cases where the respective
methods do not produce any new data types but only change the
number of items or their representation. Typical methods of this
type are clustering, filtering and scaling. The performance of the
combination of a clustering method and a filter, for example, might
differ depending on whether the filtering acts on the input items
before the clustering step or on the cluster representatives. The result, however, would be in both cases a subset of the input items.
The implementation of a higher level optimization of workflows
was not taken into account in this publication, as a major requirement of those transformations is the availability of additional logic,
such as an ontology that enables the optimization method to determine which algorithms or sub-workflows perform equivalent
functions. A similar ontology and a set of methods for data transformations would be needed for input and output types. Such
ontologies are not yet available for the methods and data types
typically used in the life sciences.
For large numbers of components with alternative methods and
in general for the optimization of workflows with many parameters more sophisticated techniques may be required to ensure sufficient sampling of the search space within a reasonable amount
of compute time. Apart from user constraints and approximation
methods any information obtained from optimizations of similar
workflows might be helpful to guide the optimization process.
We therefore propose in the following a provenance-based
optimization strategy. The basic idea is based on sharing of the
knowledge between collaborators in e-science infrastructures [50].
While currently mainly workflow definitions are stored in community repositories it would in principal be possible to store and
retrieve additional information such as provenance data, performance measures of optimization runs etc. If such an infrastructure for information-enriched workflows would be available, other
techniques could utilize this information for optimization. Certainly, optimization by learning from provenance in e-science is a

very complex concept due to its flexibility but this is outweighed


by its versatility as it could be applied on all optimization levels
described in Section 2.2. As example, already optimized workflow
parts could guide the optimization processes. During parameter
optimization a process could start with parameter values, which
already have been identified as optimal in a closely related application. Additionally, users could narrow the search space by searching and ranking similar workflows, whose provenance data can
then be used to set ranges for parameters. Similar procedures can
be applied for the other optimization types described. Another aspect of provenance based optimization is the re-use of intermediate results. If single components of a workflow part do not require
any optimization, the stored provenance data can be used to extract intermediate results for further processing. In particular for
optimizations containing large-scale data analysis and compute intensive applications, approximation would be beneficial. Although
such provenance based optimization could be established in larger
local environments, its full potential is dependent on a community
wide adoption.
7. Conclusions
In this paper, we have presented a novel scientific workflow life
cycle phase to optimize scientific workflows for the life science domain. The proposed phase seamlessly fits into the common scientific workflow life cycle and augments the entire development
process of a scientific experiment. By inserting this phase before
the whole workflow is going to be executed the optimization phase
can serve as a substitute for several rounds of the whole scientific
workflow life cycle. The idea behind it is that typically only parts
of a workflow require to be optimized and our optimization approach concentrates all computing resources on these parts by extracting and executing only sub-workflows. We have developed a
generic optimization framework within the Taverna WMS to test
the feasibility of a general approach to scientific workflow optimization. This generic approach enables other developers to easily
and quickly add other levels and methods for workflow optimization into Taverna. We have exemplary implemented a GA-based
plugin for parameter optimization that shows the usefulness and
easy integration of workflow optimization methods into the framework API. In the two use cases taken from structural bioinformatics
and proteomics we have demonstrated that the automated parameter search gives results at least on par as manual search (trial and
error) or parameter sweeps, while being more time and resource
efficient. Finally, we have sketched how in the future community
repositories could harness user knowledge to guide advanced optimization strategies.
The future work will include further optimization plugins as
well as heuristic approaches such as tracking intermediate results, so that workflow instances differing only in the parameters
of downstream components can use them and thereby avoid redundant computation of the upstream parts of the workflow. Further investigations may include other optimization algorithms and
ways to estimate the final fitness from the intermediate results.
Acknowledgments
The authors would like to thank the Taverna development team
for the fruitful discussions and great support during the design and
implementation as well as the UNICORE support and development
group for their commitment to improve this work.
References
[1] I. Taylor, E. Deelman, E. Gannon, Workflows for e-Science: Scientific
Workflows for Grids, Springer, London, UK, 2007.
[2] W. van der Aalst, K. van Hee, Workflow Management: Models, Methods, and
Systems, MIT Press, USA, 2002.

S. Holl et al. / Future Generation Computer Systems 36 (2014) 352362


[3] Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox, D. Gannon, C. Goble, M.
Livny, L. Moreau, J. Myers, Examining the challenges of scientific workflows,
IEEE Computer 40 (12) (2007) 2634.
[4] B. Ludaescher, M. Weske, T. McPhillips, S. Bowers, Scientific workflows:
business as usual? in: 7th International Conference on Business, BPM, Ulm,
Germany, 2006, pp. 3147.
[5] E. Deelman, Y. Gil, Managing large-scale scientific workflows in distributed
environments: experiences and challenges, in: Proceedings of the 2nd IEEE
International Conference on e-Science and Grid Computing, Washington, DC,
USA, 2006.
[6] V.S. Kumar, P. Sadayappan, G. Mehta, K. Vahi, E. Deelman, V. Ratnakar, J. Kim,
Y. Gil, M. Hall, T. Kurc, J. Saltz, An integrated framework for parameter-based
optimization of scientific workflows, in: Proceedings of the 18th International
Symposium on High Performance Distributed Computing, HPDC, Munich,
Germany, 2009, pp. 177186.
[7] D. Abramson, B. Bethwaite, C. Enticott, S. Garic, T. Peachey, A. Michailova,
S. Amirriazi, Embedding optimization in computational science workflows,
Journal of Computational Science 1 (1) (2010) 4147.
[8] T. Crick, P. Dunning, H. Kim, J. Padget, Engineering design optimization using
services and workflows, Royal Society of London Philosophical Transactions
Series A 367 (2009) 27412751.
[9] S. Holl, O. Zimmermann, M. Hofmann-Apitius, A new optimization phase for
scientific workflow management systems, in: 2012 IEEE 8th International
Conference on E-Science, Chicago, USA.
[10] B. Ludaescher, I. Altintas, S. Bowers, J. Cummings, T. Critchlow, E. Deelman, D.
De Roure, J. Freire, C. Goble, M. Jones, S. Klasky, T. McPhillips, N. Podhorszki,
C. Silva, I. Taylor, M. Vouk, Scientific Process Automation and Workflow
Management, Scientific Data Management, Chapman and Hall/CRC, London,
2009, pp. 476508.
[11] X. Fan, P. Brzillon, R. Zhanga, L. Li, Making context explicit towards decision
support for a flexible scientific workflow system, in: Fourth Workshop on
Human Centered Processes HCP, Genoa, Italy, 2011, pp. 39.
[12] B.W. Matthews, Comparison of the predicted and observed secondary
structure of T4 phage lysozyme, Biochimica et Biophysica Acta (BBA)Protein
Structure 405 (2) (1975) 442451.
[13] D. Abramson, B. Bethwaite, C. Enticott, S. Garic, T. Peachey, Parameter space
exploration using scientific workflows, in: Proceedings of the 9th International
Conference on Computational Science: Part I, Germany, 2009, pp. 104113.
[14] E.-G. Talbi, Metaheuristics: From Design to Implementation, Wiley, New York,
2009.
[15] E. Deelman, D. Gannon, M. Shields, I. Taylor, Workflows and e-Science: an
overview of workflow system features and capabilities, Future Generation
Computer Systems 25 (5) (2009) 528540.
[16] P. Missier, S. Soiland-Reyes, S. Owen, W. Tan, A. Nenadic, I. Dunlop, A. Williams,
T. Oinn, C. Goble, Taverna, reloaded, in: Proceedings of the 22nd International
Conference on Scientific and Statistical Database Management, SSDBM10,
Germany, 2010, pp. 471481.
[17] W. Gentzsch, D. Girou, A. Kennedy, H. Lederer, J. Reetz, M. Riedel, A. Schott,
A. Vanni, M. Vazquez, J. Wolfrat, DEISAdistributed European infrastructure
for supercomputing applications, Journal of Grid Computing 9 (2) (2011)
259277.
[18] Extreme science and engineering discovery environment (XSEDE), 16th
January 2013. www.xsede.org.
[19] A. Streit, et al., Unicore 6recent and future advancements, Annals of
Telecommunications 65 (2010) 757762.
[20] H.N. Krabbenhoeft, S. Moeller, D. Bayer, Integrating ARC grid middleware with
Taverna workflows, Bioinformatics 24 (2008) 12211222.
[21] I. Foster, Globus toolkit version 4: software for service-oriented system,
in: Proceedings of the 2005 IFIP International Conference on Network and
Parallel Computing, in: LNCS, vol. 3779, 2005, pp. 213.
[22] S. Holl, O. Zimmermann, M. Hofmann-Apitius, A UNICORE plugin for HPCenabled scientific workflows in Taverna 2.2, in: IEEE World Congress on
Services, SERVICES 2011, pp. 220223.
[23] S. Holl, O. Zimmermann, B. Schuller, B. Demuth, M. Hofmann-Apitius, Secure
multi-level parallel execution of scientific workflows on HPC Grid resources
by combining Taverna and UNICORE services, in: Proceedings of the UNICORE
Summit 2012, pp. 2734.
[24] J.H. Holland, Adaptation in Natural and Artificial Systems: an Introductory
Analysis with Applications to Biology, Control and Artificial Intelligence, MIT
Press, USA, 1992.
[25] A. Niazi, R. Leardi, Genetic algorithms in chemometrics, Journal of Chemometrics 26 (6) (2012) 345351.
[26] K. Meffert, et al. JGAPjava genetic algorithms and genetic programming
package, 16th January 2013. http://jgap.sf.net.
[27] A.J. Smola, B. Schoelkopf, A tutorial on support vector regression, Statistics and
Computing 14 (3) (2004) 199222.
[28] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM
Transactions on Intelligent Systems and Technology 2 (3) (2001) 127.
[29] R. Aebersold, M. Mann, Mass spectrometry-based proteomics, Nature 422
(2003) 198207.
[30] iPRG-2013 Study, 16th January 2013. http://www.abrf.org/index.cfm/group.
show/proteomicsinformaticsresearchgroup.53.htm.
[31] Y. Mohammed, E. Mostovenko, A.A. Henneman, R.J. Marissen, A.M. Deelder,
M. Palmblad, Cloud parallel processing of tandem mass spectrometry based
proteomics data, Journal of Proteome Research 11 (10) (2012) 51015108.
[32] A. Keller, J. Eng, N. Zhang, X. Li, R. Aebersold, A uniform proteomics MS/MS
analysis platform utilizing open XML file formats, Molecular Systems Biology
1 (0017) (2005) 18.

361

[33] R. Craig, R.C. Beavis, TANDEM: matching proteins with tandem mass spectra,
Bioinformatics 20 (9) (2004) 14661467.
[34] A. Keller, A.I. Nesvizhskii, E. Kolker, R. Aebersold, Empirical statistical model to
estimate the accuracy of peptide identifications made by MS/MS and database
search, Analytical Chemistry 74 (20) (2002) 53835392.
[35] S.S. Rao, Engineering Optimization: Theory and Practice, Wiley, 2009.
[36] S. Gesing, J. van Hemert, P. Kacsuk, O. Kohlbacher, Special Issue: portals for life
sciencesproviding intuitive access to bioinformatic tools, Concurrency and
Computation: Practice and Experience 23 (3) (2011) 223234.
[37] S. Smanchat, M. Indrawan, S. Ling, C. Enticott, D. Abramson, Scheduling
parameter sweep workflow in the Grid based on resource competition, Future
Generation Computer Systems 29 (5) (2013) 11641183.
[38] C. Heath, J. Gray, OpenMDAO: framework for flexible multidisciplinary
design, analysis and optimization methods, in: Proceedings of the 53rd AIAA
Structures, Structural Dynamics and Materials Conference, 2012.
[39] K. Deb, Multi Objective Optimization Using Evolutionary Algorithms, John
Wiley and Sons, 2001.
[40] T.A. El-Mihoub, A.A. Hopgood, L. Nolle, A. Battersby, Hybrid genetic algorithms:
a review, Engineering Letters 3 (2) (2006) 124137.
[41] J.A. Parejo, A. Ruiz-Corts, S. Lozano, P. Fernandez, Metaheuristic optimization
frameworks: a survey and benchmarking, Soft Computing 16 (3) (2012)
527561.
[42] J. Kennedy, R. Eberhart, Particle swarm optimization, in: Proceedings of IEEE
International Conference on Neural Networks, Vol. 4, 1995, pp. 19421948.
[43] M. Dorigo, G. Di Caro, Ant colony optimization: a new meta-heuristic, in:
Proceedings of the 1999 Congress on Evolutionary Computation, CEC 99,
Vol. 2, 1999, pp. 14701477.
[44] J. Sun, J.M. Garibaldi, C. Hodgman, Parameter estimation using metaheuristics
in systems biology: a comprehensive review, IEEE/ACM Transactions on
Computational Biology and Bioinformatics 9 (1) (2012) 185202.
[45] C.J. van Rijsbergen, Information Retrieval, 1979.
[46] G. Ochoa, M. Preuss, T. Bartz-Beielstein, M. Schoenauer, Editorial for the special
issue on automated design and assessment of heuristic search methods,
Evolutionary Computation 20 (2) (2012) 161163.
[47] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment
search tool, Journal of Molecular Biology 215 (1990) 403410.
[48] O. Gotoh, An improved algorithm for matching biological sequences, Journal
of Molecular Biology 162 (1982) 705708.
[49] S.F. Altschul, T.L. Madden, A.A. Schffer, J. Zhang, Z. Zhang, W. Miller, D.J.
Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs, Nucleic Acids Research 25 (1997) 33893402.
[50] D. De Roure, C. Goble, R. Stevens, The design and realisation of the
myExperiment Virtual Research Environment for social sharing of workflows,
Future Generation Computer Systems 25 (5) (2009) 561567.

Sonja Holl received her Bachelor and Master of Science degree in computer science from the Heinrich-Heine University, Dsseldorf, Germany, in 2008, 2009 respectively. She
is currently a Ph.D. candidate in the Jlich Supercomputing Centre at the Forschungszentrum Jlich GmbH. Her research interests are in the areas of distributed computing
and scientific workflows with emphasis on the optimization of bioinformatic workflows.

Olav Zimmermann is a staff scientist at the Simulation


Laboratory Biology, a research and support unit at the
Jlich Supercomputing Centre, Germany. His research interests include bioinformatics and biophysics algorithms
for HPC architectures, protein structure prediction, protein folding and protein design. He received his Ph.D. in
Biology in 2003 from the University of Cologne, Germany
and was a cofounder and executive of Science-Factory
GmbH, a German startup that developed large scale workflow systems for bioinformatics.

Magnus Palmblad is an associate professor at the


Biomolecular Mass Spectrometry Unit of the Leiden University Medical Center in the Netherlands. He received his
Ph.D. in Molecular biotechnology at Uppsala universitet in
Sweden, 2002. His research concerns the quantitative and
systematic approaches to molecular biology, primarily
using mass spectrometry-based proteomics and model
systems. He has also published several papers on new
algorithms for the mass spectrometry data analysis.

362

S. Holl et al. / Future Generation Computer Systems 36 (2014) 352362


Yassene Mohammed obtained his Ph.D. in Medical Informatics from the University of Gttingen. He is currently a member of the Biomolecular Mass Spectrometry
Unit of the Leiden University Medical Center. The fields of
Yassenes work and research have been biomedical signal
and image processing, modeling and simulation of minimal invasive surgery, as well as healthgrids and cloud
computing. Currently his main focus is computational high
throughput proteomics and using scientific workflow for
processing biomedical big data.

Martin Hofmann-Apitius trained initially as a molecular


biologist and shifted the focus of his work to applied bioinformatics in the end of the 1990s. In 2002, he accepted the
position as the Head of the Department of Bioinformatics
at the Fraunhofer Institute for Algorithms and Scientific
Computing (SCAI) in Sankt Augustin (Germany), a governmental non-profit research institute. In 2006, he was appointed Professor for Applied Life Science Informatics at
Bonn-Aachen International Center for Information Technology (B-IT). In his scientific work, he is interested in solving real-world problems in biomedicine by applying computational approaches.

Anda mungkin juga menyukai