Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Nottingham, UK
Editor
Simon J Cox
Sponsored by
JISC, Microsoft®, ORACLE®, IBM®
OERC, eSI
1
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Premier Sponsor
2
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Gold Sponsor
Silver Sponsor
Bronze Sponsors
3
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
This year has seen an important transition for the UK e-Science programme. Many
of the first set of pilot projects have completed and we now have clear outputs
from the programme. e-Science in the UK had its fifth birthday and moved in to a
sustained mode of management with the programme directorate being replaced by
an EPSRC programme manager and Malcolm Atkinson as e-Science envoy.
The AHRC began their own e-Science programme and the e-Social science
activities have matured, while all of the Research Councils, JISC and other
stakeholders continue to develop e-Science and e-Infrastructure.
The theme for this year’s UK e-Science All Hands Meeting - Achievements,
Challenges and New Opportunities - reflected this transition and as you will see in
this proceedings reflects the broad range of presentations and discussion held at the
meeting. From the results of the completed activities, to the challenges for new
areas, from the blue-skies research issues to the industry take up of prototype
technologies.
A special thanks are due to the Steering Committee especially to Ken Brodlie who
as Vice Chair will take on the lead for next year’s meeting; the Programme
Committee, chaired by Paul Watson and Jie Xu, who together with the programme
committee members handled the refereeing of the large number of submitted
papers; to the team at the National e-Science Centre and EPSRC for organising the
event and for the management of the website and the paper submission activity;
and to Susan Andrews for her tireless efforts in producing this proceedings.
Finally we would like to acknowledge the support for the meeting from EPSRC,
JISC, Microsoft, IBM, ORACLE, Oxford e-Research Centre and the e-Science
Institute.
4
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
5
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Dear Colleagues,
This was the first year in which we have required full papers to be submitted for
review rather than just abstracts. This change was made as the quality of the All
Hands papers has risen steadily since the very first meeting in 2001 and we wanted to
ensure that the very best papers were presented at the meeting. We believe that the
change has been a success, enabling us to select an exciting set of high quality papers.
For this, we are of course very grateful to those who submitted papers, successful or
otherwise.
All submissions to AHM 2006 were rigorously reviewed under the supervision of the
PC. We received 128 paper regular submissions from which the PC selected 84 papers
to be presented. I would like to thank the members of the PC who were able to
maintain a high quality of reviewing even though the timescales were much tighter
than in previous years. Throughout the whole process, the members were unfailingly
helpful, hard-working and supportive. Those other reviewers who provided expert
advice on papers in their area were also of great help to us. Overall, I feel that this
reflects the importance and warmth with which the AHM is viewed within the e-
Science community.
The most oversubscribed part of the conference this year was the Workshop
programme for which we received 25 proposals for only 10 slots. It is good to see the
growing worldwide interest in the AHM reflected in the UK-Korean and UK-Chinese
workshops that we will be hosting. The Birds of a Feather sessions were also popular,
with 13 proposals for only 5 places. In the case of both Workshops and BoFs, the PC
tried to strike a balance between encouraging the formation of new communities in
interesting, emerging areas, and supporting mature communities that have developed
over the past years at the AHM.
The overall reviewing process was made straightforward thanks to the work of Susan
Andrews of NeSC who not only provided online tools that simplified our work, but
also gave me valuable help through the past 8 months, for which I am very grateful.
Finally, I would like to thank the Vice-Chair - Prof Jie Xu – for his constant assistance
and good advice.
I hope you find the AHM 2006 programme interesting and enjoyable.
Paul Watson
Chair
AHM 2006 Programme Committee
6
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
7
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Other Reviewers
8
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
9
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Welcome to the proceedings of the UK e-Science All Hands meeting 2006. On this
CD you will find pdf versions of the papers which were presented in the regular
sessions, workshops, and as posters at the conference.
Particular and special thanks to the invaluable Susan Andrews at NeSC, who performed
an excellent and thorough job in bringing all aspects of the proceedings together- right
from handling the initial submissions to delivering the final CD.
10
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Periorellis P., Cook N., Hiden H., Conlin A., Hamilton M.D.,
Wu J., Bryans J., Gong X., Zhu F., Wright A.
Panayiotis.Periorellis@ncl.ac.uk
May 2006
Abstract
The principal aim of the GOLD project (Grid-based Information Models to Support the Rapid Innovation
of New High Value-Added Chemicals) is to carry out research into enabling technology to support the
formation, operation and termination of Virtual Organisations (VOs). This technology has been
implemented in the form of a set of middleware components, which address issues such as trust, security,
contract monitoring and enforcement, information management and coordination. The paper discusses
these elements, presents the services required to implement them and analyzes the architecture and
services.The paper follows a top down approach starting with a breif outline on the architectural
elements, derived during the requirements engineering phase and demonstrates how these archtiectural
elements were mapped onto actual services that were implemented according to SOA principles and
related technologies.
11
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
SSM model.
12
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
federated identity and the real identity of the participant et al. 1996, Thomas R., K., 1997]. In addition, given
allowing both accountability and privacy to be supported. the sensitivity of the information that may be shared
Secure message exchange between the VO participants is in some VOs, (which raises concerns regarding
achieved by exchanging security tokens carried within the competitive advantage) parties are not expected to be
headers of the exchanged messages. The federation assigned a single set of rights that would last
service signs and attaches these tokens to SOAP message throughout the duration of a VO. It is more likely that
headers according to the WS-Security specification [WS- VO participants would agree limited or gradual access
Security 2004]. The specification supports various token to their resources depending on workflow progress. In
structures such as X.509 certificates, Username or XML- GOLD we want to be able to restrict role access on
based tokens which have been customised, in the GOLD resources depending on the execution context. In a
Middleware, to support SAML [OASIS 2004] assertions. virtual organization the execution context can be
The structure of a message carrying a token and the regarded as the particular project or the goal that all
lifecycle of such a token from request to validation is participants have come together to achieve. In a
shown in Figure 3. virtual organization there can be many projects with
various participants resulting to complicated
interrelationships between them. For example, some
of them may play the same role in various projects,
carrying out the exact same tasks, or have different
roles within the same project depending on the
performed task. The question we raise is ‘Should a
role have the same permission and access rights
throughout the set of similar or even identical projects
and the tasks within those projects?’ Our view is that
in dynamic access control systems we should separate
roles from role instances. To support this, we present
the role instances as parameterised roles. Those
parameters can be used to express relations to
distinguish the role with task-specific or project-
specific privileges from the general role. Also some
roles may be relational, e.g. originator(document,
Figure 3 - Security token lifecycle project) and it may be necessary to enforce separation
of duties for such purposes as ensuring that the
In the body of the message carries the actual message originator of a document cannot also sign it off.
while the header is structured according to WS-Security Policy on sign-off could include the possession of
specification and carries a security token formatted in one certain qualifications in combination with an
of the standardised styles described earlier. The above organisational position, for example the manager can
description implies that a federation has been established sign off a document provided he/she also has a
and that there is a direct trust relationship between the chemist qualification and is not the originator of that
federation and the participants. If this direct relationship document. Different role instances may require
is not present the above model fails. To alleviate this different permissions and indeed additional levels of
problem the GOLD Middleware makes use of the WS- authorization depending on the project and task in
Trust [2005] specification that offers a mechanism for which they are active. To be able to handle such
managing brokered trust relationships. cases, GOLD needs to support adequately fine-
grained access control mechanisms. Parameterisation
of roles supports both relational roles and fine-grained
3.1.2 Authorisation access control In order to raise the levels of trust in
In earlier publications [Periorellis et al. 2004] we those cases, one needs to make sure that adequately
discussed the advantages and disadvantages of access fine grained access control mechanisms are supported.
control models, ranging from active control lists to role Granularity refers to the level of detail for which one
and task based systems. We concluded that the dynamic can define access rights. Fine grained permissions are
nature of virtual organisations makes it necessary for any needed for instances of roles as well as instances of
VO infrastructure to support a mechanism that deals with objects. For example, a chemist role may be granted
dynamic rights activation and de-activation. In dynamic access to chemical documents but we do not however
systems such VOs, both the topology of the VO (parties wish to grant access to all chemical documents
that form it and links between them) and its workflow are produced by the system. Instead, we want any access
subject to change. Therefore static rights assignment prior permissions granted to the chemist role to be project-
to a workflow does not capture the eventuality of a party specific (e.g., the instance of a particular
leaving, added or any alterations to the workflow itself. collaboration) as well as task-specific (e.g., the
Several authors have elaborated on this issue [Coulouris, instance of a particular pre-defined set of activities).
G., et al. 1994, Roshan, K., T., et al. 1997, Sandu, R., S., So the management of roles and access permissions in
13
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
14
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
architecture for subscribing, producing and delivering interested in the state of the coordination activities.
notification messages. It supports typical subscription GOLD adopted ActiveBPEL [Emmerich et al. 2005]
operations such as subscribe, unsubscribe and renew, for workflow execution. The Workflow Enactment
while the delivery mechanism is based on the pull model service responds to requests by initiating workflows to
similar to OMG’s CORBA notification model [Bolton co-ordinate interactions between VO members. For
2001]. When the participant subscribes to GOLD’s example, the process of approving a chemical safety
notification service, the service includes information study may involve passing this study to several
regarding the Subscription Management service in its approvers, each of which must sign the document
response. Subsequent operations, such as getting the before it is considered complete. The enactment
status of, renewing and unsubscribing, are all directed to engine notifies the participants about events and
the subscription manager. The source (producer) sends required activities, depending on the topics
both notifications and a message signifying the end of participants have registered for. For situations where
registered subscriptions to the sink (consumer). To it is not necessary to co-ordinate the detailed activities
provide notification services the GOLD Middleware of VO participants, the GOLD Middleware can be
makes use of NaradaBrokering [Pallickara and Fox 2003] used to provide an abstraction with a lower degree of
which is a mature, open source, distributed messaging integration, for instance a shared space that contains
infrastructure. NaradaBrokering can support a number of projects that are divided into discrete tasks. Each of
notification and messaging standards, notably: JMS [JMS these tasks can contain a set of data, which can be
2001], WS-Eventing and WS-Notification. It is, therefore, accessed and edited by participants with the correct
suitable for intra- and inter-organisational notification. In security credentials. These documents can include any
the context of a VO, a significant advantage of a of the information types supported within the VO,
notification service built on NaradaBrokering is the including the results of any enacted workflows, and
flexibility of deployment options. The service could be any workflows that are still on-going.
deployed as a stand-alone service at a single site or,
alternatively, as a peer-to-peer network of Narada brokers 3.3 Regulation
offering a federated notification service. For example, the
notification service could be distributed across a set of Regulation helps govern interactions between parties,
Trusted Third Parties (TTPs) that support the VO or ensuring that participants’ rights (in terms of the
across the members of the VO itself. In either deployment resources they are allowed to access) are properly
NaradaBrokering provides scalable, secure and efficient granted and that obligations are properly dispatched
routing of notifications. The Enactment service operates (such as making resources available to others). This is
in conjunction with the core GOLD Middleware services achieved by the use of contracts and contract
to provide support for coordination and workflow. The enforcement mechanisms as well as monitoring
main components of the service are the workflow editor mechanisms for auditing and accountability.
and the workflow enactment engine, see Figure 5. In
addition, there are custom workflow tasks for
manipulating VOs and managing documents. 3.3.1 Contract Enactment
Each member of a VO requires that their interests are
protected, specifically:
• that other members comply with contracts
governing the VO;
• that their own legitimate actions (such as
delivery of work, commission of service) are
recognised;
Figure 5 - Enactment service • that other members are accountable for their
actions.
In Figure 5, participant P describes a workflow using a To support this, the GOLD Middleware records all
workflow editor and saves the output in a storage facility. activities to monitor for compliance with the
Workflows may be represented graphically using an regulatory regime. Furthermore, critical interactions
editor that subsequently creates an XML output document between VO participants should be non-repudiable
structured according to the BPEL4WS (Business Process (no party should be able to deny their participation)
Enactment Language for Web Services) specification and the auditing and monitoring functions must be fair
[BPEL4WS 2002]. The technology provides an XML (misbehaviour should not disadvantage well-behaved
schema (currently being considered as an OASIS parties). For example, a business contract governing
standard) for describing tasks, roles, inputs, outputs, pre the scenario described in Section will specify
and post conditions associated with each task. During sequencing constraints on the production of
workflow enactment, the workflow engine retrieves documents such as the requirements, recipe, thermal
BPEL4WS documents and executes them, subsequently analysis and scale-up analysis. It will also require that
sending any notifications required to subscribers (S) the safety of the scaled-up process is assured, hence
15
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the requirement that the the Scale-up company employ the what operations business partners are:
Thermal Safety company to provide an analysis and • permitted to perform if deemed necessary to
validation of the potential exotherm that was identified fulfill the contract;
during the scale-up studies. In a complex natural language
• obliged to perform to ensure contract
contract of the form typically negotiated between business
fulfillment;
partners, there may in fact be ambiguities that, given the
sequencing constraints, could lead to deadlock during the • prohibited from performing as these actions
chemical process development project. There is, would be illegal, unfair or detrimental to the
therefore, a need to produce an electronic version of the other partners.
contract that can be model-checked for correctness In addition, rules may stipulate when and in what
properties (e.g. safety and liveness). Having developed a order the operations are to be executed. To form and
contract that is free of ambiguities, it should then be have automatic management of partnerships within a
possible to derive an executable version to regulate an VO, electronic representations of contracts must be
interaction at run-time. That is, the business messages used to mediate the rights and obligations that each
exchanged between participants during the process member promises to honour. In the worst case,
development should be validated with respect to the violations of agreed interactions are detected and all
executable contract to ensure that they comply with interested parties are notified. In order to support this,
contractual constraints. To hold participants to account, the original natural language contract that is in place
and to be able to resolve subsequent disputes, these to govern interactions between participants has to
message exchanges should be audited. In the scenario undergo a conversion process from its original format
described in Section , the Scale-up company sends the into an executable contract (x-contract) that works as
Supplier company the scale-up model. The delivery of a mediator of the business conversations. This
this document should be validated against the contract to conversion process involves the creation, with the
ensure any pre-requisite obligations have been fulfilled. help of a formal notation, of one or more
To safeguard their interests, the Supplier company will computational models of the contract with different
require evidence that the model originated at the Scale-up levels of details. To achieve these objectives, the
company. Similarly, the Scale-up company will require Promela modeling language [Holzmann 1991] is used
evidence that they fulfilled their obligation to deliver the to represent all the basic parameters that typical
model to the Supplier company. business contracts comprise, such as permissions,
Given these requirements, we identify two main aspects obligations, prohibitions, actors (agents), time
to contract enactment and accountability: constraints, and message type checking. The Promela
representation can be validated with the help of the
• high level mechanisms to encode business
accompanying Spin model-checker tool [Holzmann
contracts so that they are amenable to integrity
2004]. For example, model-checking the Promela
and consistency checking and in order to derive
representation can improve the original natural
an executable form of contract;
language contract by removing various forms of
• a middleware infrastructure that can monitor a inconsistency as discussed in Solaiman et al. [2003].
given interaction for conformance with the This implementation-neutral representation can be
executable contract - ensuring accountability and refined to include technical details such as
acknowledgement. acknowledgment and synchronisation messages. The
To address the first aspect, Section 3.3.1.1 provides a details will vary depending on specific
summary of work on the derivation of electronic contracts implementation techniques and standards that are
and deployment models for contract mediated interaction. adopted. This implementation specific representation
This work appears in Molina et al. [2005], which also can then be used for run-time monitoring.
presents related work. The GOLD project extends this Conceptually, an x-contract is placed between VO
work to enact business contracts using infrastructure for members to regulate their business interactions. In
accountability and non-repudiation as an enforcement terms of the interaction model, the x-contract may be
mechanism. Section 3.3.2 presents this infrastructure and reactive or proactive. A reactive x-contract intercepts
shows how it addresses the second aspect identified business messages, validates the messages and rejects
above. This paper is concerned with monitoring and invalid messages. A proactive x-contract drives the
enforcement of business operation clauses, of equal cross-organisational business process by inviting VO
importance is the monitoring of the levels of Quality of members to send legal messages of the right type, in
Service (QoS) offered within a VO. This concerns the correct sequence, at the correct time etc. Deployment
collection of statistical metrics about the performance of a can be either centralised or distributed. This leads to
service to evaluate whether a provider complies with the four deployment models:
QoS that the consumer expects. Molina et al. [2004]
• Reactive central - where all messages are
examine this aspect of regulation and related work.
intercepted by a centralised x-contract (at a
TTP, for example) that is responsible for
3.3.1.1 Contract-mediated interaction forwarding just the legal messages to their
The rules in a conventional, paper-based contract express intended destination.
16
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
• Proactive central - where a centralised x- that a message is syntactically valid and in the correct
contract coordinates the business process on sequence with respect to a contract. Alternatively, a
behalf of VO members and triggers the exchange message may require validation with respect to more
of legal messages. complex contractual conditions or with respect to
• Reactive distributed - where the x-contract is local application state. Triggering validation at the
level of business message delivery has the potential to
split into separate components that can be used
allow specialisation of an application to meet the
to validate just those messages sent to an
constraints of different regulatory regimes. Web
individual VO member and to reject illegal
Services are increasingly used to enable B2B
messages sent to that member.
interactions of this kind. However, there is currently
• Proactive distributed - a distributed version of no support to make the exchange of a set of business
proactive central that coordinates the legal messages (and their associated acknowledgements)
participation of each member in the business both fair and non-repudiable. A flexible framework
process. for fair, non-repudiable message delivery has
Distributed deployments face the difficult challenge of therefore been developed. The Web Services
keeping contract state information synchronised at both implementation of this framework comprises a set of
ends. For example, a valid message forwarded by the services that are invoked at the middleware level and
buyer’s x-contract could be dropped at the seller’s end so enable the Web Services developer to concentrate
because intervening communication delays render the on business functions. The GOLD Middleware
message untimely (and therefore invalid) at the seller renders the exchange of business messages fair and
side. State synchronisation is necessary to ensure that non-repudiable. Arbitrarily complex, application-level
both the parties agree to treat the message as either valid validation is supported through the registration of
or invalid. One approach that uses a non-repudiable state message validators. The framework is sufficiently
synchronisation protocol [Cook et al. 2002] is described flexible to adapt to different application requirements
in Molina et al. [2003]. The GOLD Middleware used to and, in particular, to execute different non-repudiation
invoke validation with respect to a contract at runtime is protocols to meet those requirements.
discussed below in Section 3.3.2.
3.3.2.1 Basic concepts
3.3.2 Accountability
Non-repudiation is the inability to deny an action or
This section focuses on accountability for the delivery of event. In the context of distributed systems, non-
a single business message. However, this validated and repudiation is applied to the sending and receiving of
non-repudiable message delivery can then be used as a messages. For example, for the delivery of a message
building block for contract monitoring and enforcement from A to B the following types of non-repudiation
of the kind envisaged in Section 3.3.1. may be required:
• NRO - B may require Non-Repudiation of
Origin of the message, i.e. irrefutable
evidence that the message originated at A;
• NRR - A may require Non-Repudiation of
Receipt of the message, i.e. irrefutable
evidence that B received the message.
Non-repudiation is usually achieved using public key
cryptography. If A signs a message with their private
key, B can confirm the origin of the message by
verifying the signature using A's public key, and vice
versa. An additional requirement is that at the end of
Figure 6 - Business message delivery with the interaction no well-behaved party is
acknowledgements disadvantaged. For example, consider the selective
receipt problem where a sender provides NRO but the
Figure 6 shows the delivery of a business message and recipient does not provide the corresponding NRR.
associated acknowledgements. Typically, for each This problem is addressed by the fair exchange of
business message, there should be an immediate items where fairness is the property that all parties
acknowledgement of receipt indicating successful obtain their expected items or no party receives any
delivery of the message. Eventually, a second useful information about the items to be exchanged
acknowledgement indicates whether the original business [Markowitch et al. 2002]. Kremer et al. [2002]
message is valid (or invalid) in the context of the given provide a survey of protocols to achieve fair, non-
interaction. Finally, the validation message is repudiable exchange of messages. The following
acknowledged in return. Validation of the original discussion is based on the use of an in-line TTP to
business message (performed at B) can be arbitrarily support the exchange. However, our execution
complex. For example, it may simply involve verification framework is not restricted to this class of protocol.
17
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
3.4 Storage
The storage element addresses the need to store,
manage and access information. In addition there is a
requirement to be able to determine how a piece of
information was derived. The Information
Management and Repository services meet this need
by providing configurable information storage and
Figure 7 - Executing a business interaction through a logging/auditing functionality.VOs must control and
delivery agent manage the exchange of information between the
participants, and the role of the Information
Figure 7 introduces a Delivery Agent (DA), or inline Management service in the GOLD Middleware is to
TTP, to the interaction shown in Figure 6. Four types of support this exchange in three ways:
evidence are generated:
• NRO - Non-Repudiation of origin that msg • to ensure a common structure and meaning
originated at A; for information shared across the VO;
• NRS - Non-Repudiation of submission to the DA • to provide information services and tools to
of msg and NRO; support the controlled exchange of
• NRR - Non-Repudiation of receipt of msg by B; information according to the policies and
• NRV - Non-Repudiation of validation, valid or contracts that are in place within the VO;
otherwise, as determined by validation of msg by • to extract value from the information stored
B. during the lifetime of a VO.
A starts an exchange by sending a message, with proof of
origin, to the DA. This is the equivalent of Message 1 in To support the information management requirements
Figure 6 with the NRO appended. The DA exchanges of VOs the GOLD Middleware provides an
msg and NRO for NRR with B (before application-level Information Model that defines the structure and
validation of msg). Then the DA provides NRR to A meaning of information shared by its participants.
equivalent to Message 2 in Figure 6. Subsequently, B This model can be divided into three categories:
performs application-level validation of msg (as in
Message 3 of Figure 6 and provides NRV to the DA. The • Generic - represents information that is
DA, in turn, provides NRV to A and provides required by all VOs. This includes
acknowledgement of NRV to B. The exact sequence of descriptions of the VO structure, the
the exchange will be dictated by the actual protocol used participants, the tasks being performed,
and should not be inferred from Figure 7. security policies etc. The services that make
up the generic GOLD VO infrastructure (i.e.
those comprising the security, coordination
and regulation architectural elements) all
exchange information defined in this
category of the information model.
Figure 8 - Interceptor approach • Domain specific - within a particular domain,
As shown in Figure 8, our approach is to deploy there are types of information that are
interceptors that act on behalf of the end users in an generic across a broad range of VOs.
interaction. An interceptor has two main functions: • Application specific - information in this
• to protect the interests of the party on whose category represents specialist information
behalf it acts by executing appropriate protocols describing a particular domain.
and accessing appropriate services, including This information model is based on the myGrid
TTP services; information model [Sharman et al. 2004], which was
designed to support VOs in the bioinfomatics domain.
• to abstract away the detail of the mechanisms
used to render an interaction safe and reliable for
its end user. 4 Conclusions
In this case, the mechanism used communicates through
the DA. It is the responsibility of the DA to ensure GOLD middleware offers a set of services that can be
fairness and liveness for well-behaved parties in used to assist in the formation, operation and
interactions that the DA supports. The introduction of termination of virtual organisations. The aim of the
interceptors means, that as far as possible, A and B are project and the proposed architecture is to offer VO
developers the flexibility to configure the VO
18
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
according to their requirements without imposing too Service Orchestration using the Business Process
many constraints or imposing what and how it should be Execution Language (BPEL). UCL-CS. Research
done. In this limited space we touched on 4 fundamental Note RN/05/07. Gower St, London WC1E 6BT, UK.
architectural elements and discussed in turn how they ERL, T. 2004. Service-Oriented Architecture: A Field
could be implemented. Adhering to certain principles Guide to Integrating XML and Web Services, Prentice
regarding privacy and trust we devised a security policy Hall PTR, 0131428985
for authorisation and authentication that is based FISLER, K., KRISHNAMURTHI, S.,
primarily on current WS standards. Virtual organisations MEYEROVICH, L.A. and TSCHANTZ, M.C. 2005.
bring together a number of independent entities with the Verification and Change-Impact Analysis of Access-
aim to collaborate in achieving a common goal. This Control Policies. In Proceedings of the 27th
creates the need for some form of coordination regarding International Conference on Software Engineering,
the message exchanges between those entities. 21, 196-205, St. Louis, MO, USA.
Coordination therefore is a key aspect of collaborative GREENBERG, M.M., MARKS, C., MEYEROVICH,
working. Participants have to remain informed of certain L.A. and TSCHANTZ, M.C. 2005. The Soundness
events and It is also necessary to ensure their obligations and Completeness of Margrave with Respect to a
are dispatched at the appropriate time. The paper showed Subset of XACML. Technical Report CS-05-05,
how regulation helps govern interactions between parties, Department of Computer Science, Brown University.
to ensure that obligations are properly dispatched and HOLZMANN, G.J. 1991. Design and Validation of
rights are properly granted by the use of contracts and Computer Protocols, Prentice Hall.
contract enforcement mechanisms as well as monitoring HOLZMANN, G.J. 2004. The SPIN Model Checker,
mechanisms for auditing and accountability. Finally shed Primer and Reference Manual, Prentice Hall.
some light was on the issue of storage and information JMS. 2001. JMS: Java Messaging Service
management and as such several broad requirements and Specification (JMS),
implementation strategies were discussed. http://java.sun.com/products/jms/docs.html.
KREMER, S., MARKOWITCH, O. and ZHOU, J.
References 2002. An Intensive Survey of Fair Non-repudiation
Protocols, Computer Communications, 25, 1601–
BOLTON, F. 2001, Pure CORBA, SAMS, ISBN 1621.
0672318121 KRISHNA, A., TAN, V., LAWLEY, R., MILES, S.
BPEL4WS. 2002. BPEL4WS V1.1 specification. and MOREAU, L. 2003. The myGrid Notification
ftp://www6.software.ibm.com/software/developer/library/ Service. In Proceedings of UK OST e-Science All
ws-bpel1.pdf. Hands Meeting (AHM'03), Nottingham, UK.
CHADWICK, D. and OTENKO, A. 2003. The PERMIS LIBERTY. 2005. Specification documentation of
X.509 role based privilege management infrastructure. Liberty Alliance Project.
Future Generation Computer Systems, 19 , 2, 277-289. https://www.projectliberty.org/resources/specification
CHECKLAND, P. B. and SCHOLES, J. 1990. Soft s.php.
Systems Methodology in Action, John Wiley and Sons, MARKOWITCH, O., GOLLMANN, D. and
Chichester. KREMER, S. 2002. On Fairness in Exchange
CONLIN, A.K., ENGLISH, P.J., HIDEN, H.G., Protocols. In Proceedings of 5th International
MORRIS, A.J, SMITH, R. and WRIGHT, A.R. 2005. A Conference on Information Security and Cryptology
Computer Architecture to Support the Operation of (ISISC 2002), Springer LNCS 2587.
Virtual Organisations for the Chemical Development MOLINA-JIMENEZ, C., SHRIVASTAVA, S.,
Lifecycle. In Proceedings of European Symposium on CROWCROFT, J. and GEVROS, P. 2004. On the
Computer Aided Process Engineering, (ESCAPE 15), Monitoring of Contractual Service Level Agreements.
1597-1602. In Proceedings of IEEE International Conference on
COOK, N., SHRIVASTAVA, S. and WHEATER, S. E-Commerce (CEC), 1st International Workshop on
2002. Distributed Object Middleware to Support Electronic Contracting (WEC), San Diego.
Dependable Information Sharing between Organisations. MOLINA-JIMENEZ, C., SHRIVASTAVA, S.,
In Proceedings of IEEE Int. Conf. on Dependable SOLAIMAN, E. and WARNE, J. 2003. Contract
Systems and Networks (DSN), Washington DC, USA. Representation for Run-time Monitoring and
COULOURIS, G., and DOLLIMORE, J. 1994. Security Enforcement. In Proceedings of IEEE International
Requirements for Cooperative Work: A Model and its Conference On E-Commerce (CEC), Newport Beach,
System Implications. In Proceedings of Workshop on USA.
ACM SIGOPS European Workshop: Matching Operating MOLINA-JIMENEZ, C., SHRIVASTAVA, S.,
Systems to Application Needs. Wadern, Germany. SOLAIMAN, E. and WARNE, J. 2005. A Method for
DEMCHENKO, Y., 2004. Virtual organisations in Specifying Contract Mediated Interactions. In
computer grids and identity management. Information Proceedings of 9th IEEE International Enterprise
Security Technical Report, 9, 1, 59-76. Computing Conference (EDOC), Enschede,
EMMERICH, W., BUTCHART, L., CHEN, L., Netherlands.
WASSERMANN, B., and PRICE, S. L., 2005. Grid MORGAN, R.L., CANTOR, S., CARMODY, S.,
19
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
20
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Currently several production Grids offer their resources for academic communities. Access to
these Grids is free but restricted to academic users. As so, the Grids offer only basic quality of
service guarantees and minimal user support. This incorporates Grid portals without workflow
editing and execution capabilities, brokering with no QoS and SLA management, security
solutions without privacy and trust management. Today’s Grids do not provide any kind of
support for running legacy code applications, and do only very basic accounting without any
billing functionalities. This academic experimental phase should be followed by opening up the
Grid to non-academic communities, such as business and industry. The widening of the Grid
user community defines additional requirements. This paper discusses how the GEMLCA
legacy code architecture is extended in order to fulfil the requirements of commercial
production Grids. The aim of our research is to facilitate the seamless integration of GEMLCA
into future commercial Grids. GEMLCA, in this way, could serve as a reference
implementation for any third party utility service that can be added to production Grids in order
to improve their usability.
21
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
integrate GEMLCA with future commercial Grids only solution that supports this functionality at
even more additional features, like accounting and production level and has already been integrated
charging based on SLAs, are required. (like in case of the UK NGS), or will be integrated
in the near future (like the OSG) to productions
This paper explores some of the enhancements Grids, is GEMLCA [1], developed by the
that have already been added to GEMLCA, or University of Westminster. It “gridifies” legacy
currently under specification and development code applications, i.e. converts these applications
aiming to fulfil the requirements of both academic into Grid services. The novelty of the GEMLCA
and commercial production Grids. concept, compared to similar solutions like in [12],
[13] and [14] is that it requires minimal effort from
2. Production Grids and Legacy Code Support both system administrators and end-users of the
Grid providing a high-level user-friendly
There are several production Grid systems, like environment to deploy and execute legacy codes
the TeraGrid [6] and the Open Science Grid (OSG) on service-oriented Grids. The deployment of a
[7] in the US, or the EGEE Grid [8] and the UK new legacy code service with GEMLCA means to
National Grid Service [4] in Europe, that already expose the functionalities of this legacy application
provide reliable production quality access to as a Grid service that requires the description of
computational and data resources for the academic the program’s execution environment and
community. All these Grids were set up as input/output parameters in an XML-based Legacy
resource-oriented Grids based on Globus Toolkit Code Interface Description (LCID) file. This file is
version 2 (GT2) [9]. All these consider moving used by the GEMLCA Resource layer to handle
towards a service-oriented architecture using either the legacy application as a Grid service.
gLite [10] or GT4 [11]. They also consider
providing service not only for the academic GEMLCA provides the capability to convert
communities but for business and industry too. legacy codes into Grid services. However,
These objectives require significant enhancement end-users still require a user-friendly Web
of these Grid infrastructures and also the interface (portal) to access the GEMLCA
incorporation of additional user support services. functionalities: to deploy, execute and retrieve
results from legacy applications. Instead of
Enabling the seamless migration of legacy developing a new custom Grid portal, GEMLCA
applications onto the new Grid platforms is one of was integrated with the workflow-oriented P-
the most important of these user support services.
GRADE Grid portal [15] extending its
There is a vast legacy of applications solving
functionalities with new portlets. The P-GRADE
scientific problems or supporting business critical
functionalities which should be ported onto the portal enables the graphical development of
Grid with the least possible effort and cost. The workflows consisting of various types of
executable components (sequential, MPI or PVM
Browse
Log in
repository
Create
workflow
Map
execution
Publish Execute
legacy code workflow
Visualise
execution
22
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
23
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
24
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Store MDS4
Historical historical
test data data GMT
Prediction Run GMT
based on probes and
Portal collect data Grid site with
current and
legacy code
historical
data resources
Submit
GMT classifier
workflow Grid broker
Resource
availability (e.g. LCG
broker)
Grid site with
Map
legacy code
execution resources
25
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
offering only those GEMLCA resources for oriented architecture, the Grid Accounting and
mapping those latest GMT test results were Charging architecture (GAC) has been elaborated
positive. Although, this does not guarantee that the for accounting and charging for usage of
resource will actually work when executing the GEMLCA resources and services. The
workflow, the probability of a successful execution architecture, given on Fig. 4, incorporates the
will be significantly increased. Work is also accounting service, the charging service, the GAC
undergoing to connect the GMT to the LCG database with accounts and usage records, and
resource broker, as illustrated on Fig. 3. GMT, as portlets to manage accounting and charging.
Production Grid
3rd party service providers (e.g. EGEE, NGS)
RUR
Historical accounting accounting
database
test data service
logs
RUR RCR
charging
service Grid site with
GAC legacy code
RCR resources
Grid broker
submit
workflow
26
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
and stores the RUR in the corresponding resource accounts. Having the RCR the charging service
and user accounts. transfers the calculated amount from the user’s
account to the provider’s account.
resource data user data
host name / IP address host name / IP address To provide access to the providers’ and users’
certificate name certificate name accounts and manage them, an account portlet will
host type be added to the GEMLCA portal.
27
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Periodic Nonuniform Sampling Sequences, Conf. Kiellmann, Springer, 2005, pp 3-17, ISBN 0-387-
Proc. of the Grid-Enabling Legacy Applications 23351-2.
and Supporting End Users Workshop, within the [15] P. Kacsuk G. Sipos: Multi-Grid, Multi-User
framework of the 15th IEEE International Workflows in the P-GRADE Grid Portal, Journal
Symposium on High Performance Distributed of Grid Computing, , Springer Science + Business
Computing , HPDC’15, Paris, France, June 19-23, Media B.V., Vol. 3 No. 3-4 Dec, 2005, pp 221-
2006 238, ISSN: 1570-7873
[4] The UK National Grid Service [16] J. Novotny, M. Russell, O. Wehrens: GridSphere,
Website,http://www.ngs.ac.uk/ “An Advanced Portal Framework”, Conf. Proc. of
[5] The WestFocus Grid Alliance Website, the 30th EUROMICRO Conference, August 31st -
http://www.gridalliance.co.uk/ September 3rd 2004, Rennes, France.
[6] The TeraGrid Website, http://www.teragrid.org [17] James Frey, Condor DAGMan: Handling Inter-Job
[7] The Open Science Grid Website, Dependencies,
http://www.opensciencegrid.org/ http://www.bo.infn.it/calcolo/condor/dagman/
[8] The EGEE Website, http://public.eu-egee.org/ [18] Globus Workspace Management Service,
http://www.globus.org/
[9] The Globus Toolkit GT2, http://www.globus.org/
[19] G.Terstyansky, T. Delaitre, A. Goyeneche, T, Kiss,
[10] EGEE gLite version 1.5 Documentation, K. Sajadah,S.C.Winter, P.Kacsuk, Security
http://glite.web.cern.ch/glite/documentation/defaul Mechanisms for Legacy Code Applications in GT3
t.asp Environment, Conf. Proc. of the 13th Euromicro
[11] The Globus Toolkit GT4, http://www.globus.org/ Conference on Parallel, Distributed and Network-
[12] Y. Huang, I. Taylor, D. W. Walker, ”Wrapping based Processing, Lugano, Switzerland, February
Legacy Codes for Grid-Based Applications”, 9-11, 2005
Proceedings of the 17th International Parallel and [20] Andrea Guarise: Grid Accounting in EGEE using
Distributed Processing Symposium, workshop on DGAS. Current practices Terena Networking
Java for HPC), 22-26 April 2003, Nice, France. Conference, Poznan, June 8, 2005
ISBN 0-7695-1926-1 [21] Shawn Mullen, IBM, Matt Crawford, FNAL,
[13] B. Balis, M. Bubak, M. Wegiel. A Solution for Markus Lorch, VT, Dane Skow, FNAL – “Site
Adapting Legacy Code as Web Services. In Requirements for Grid Authentication,
Proceedings of Workshop on Component Models Authorization and Accounting” GFD-I.032 – GGF,
and Systems for Grid Applications. 18th Annual October 13, 2004
ACM International Conference on [22] Alexander Barmouta, Rajkumar Buyya: GridBank:
Supercomputing, Saint-Malo, July 2004 A Grid Accounting Services Architecture (GASA)
[14] D. Gannon, S. Krishnan, A. Slominski, G. for Distributed System Sharing and Integration
Kandaswamy, L. Fang, “Building Applications [23] RUR, Resource Usage Record Working Group,
from a Web Service based Component Global Grid Forum,
Architecture, in “Component Models and Systems http://www.gridforum.org/3_SRM/ur.htm
for Grid Applications” edited by V. Getov and T.
28
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
29
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
30
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
31
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
CIRCUIT MODELS
COMPACT MODEL
will be a novel application of RealityGrid
PARAMETER
EXTRACTOR
COMPACT
CIRCUITS
PARAMETERS
my
CHARACTERISTICS
(www.realitygrid.org.uk) and Grid
CIRCUIT SUMULATORS
(www.mygrid.ork.uk) software, and their
integration will give rise to new
DEVICE MODELS
PARAMETER
EXTRACTOR
COMPACT MODEL
COMPACT
DEVICES
PARAMETERS
requirements on both. As the RealityGrid
DEVICE
SIMULATORS CHARACTERISTICS software is maintained by project partners,
TCAD
32
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
information will be supported and expect to learn from and feed into the
incorporated within the workflow GRAAP-WG at GGF and on-going efforts
management framework to influence real in this area, such as the WS-Agreement
time workflow enactment. The definition standard specification [10]. The case for
of appropriate schemas will be advanced reservation and allocation will
fundamental to this work and we will be tempered by the practical impact on
build on the OGSA working group’s reduction of overall utilisation of the
current comparison of GLUE and DMTF compute resources and the associated cost
CIM, where the e-Science partners are impact this will incur. A better
intimately involved [8]. understanding of these issues is crucial in
Understanding where a given the fEC era.
simulation should be executed can be an
involved process involving numerous 3.3 Data Management Framework
metrics, e.g. the status of the Grid The data sets that are generated by device
infrastructure at that time, the chip modellers and circuit/systems designers
architecture that the code has been are significant, both in size and in number
compiled for, the location of specific data as indicated in table 1.
sets, the expected duration of the job, the
authorisation credentials of the end user Task Accumulated
wishing to run the workflow etc. To data project
support the scientific needs of the project, lifetime
SPICE cell & cct char. 10GB
we will survey on-going work in this area,
T-level sim. (nanosim) 5-10TB
and if necessary develop and deploy our Gate-level sim. (Verilog) 1TB
own meta-scheduling and planning Behavioural sim. (Sys C) 100GB
services building on the basic meta- Extraction 20GB
scheduler currently supported in the NeSC 3D TCAD sim. 1TB
BRIDGES project. 3D ‘atomistic’ sim. 5-10TB
Metrics on data movement, as well as Compact models 30GB
Circuit level fault sim. 100GB
existing and predicted resource usage will
Behavioural sim. 10GB
be explored as part of this work. This will Evolutionary systems 30GB
include the exploration of different Fragment sims. 10 TB
economic models, such as maximal job Cells sims. 5 TB
throughput, minimal job cost in the Extraction to STA 100 GB
presence of separately owned and Sim. mixed-mode circuits 5TB
managed Grid resources, each with their with noise and variation.
own specific accounting and charging Table 1: Summary of Expected Data Set
policies. Accumulation
Given that the fEC model of research
funding requires monies to be set aside for The tight linkage and integration of these
time spent on major HPC clusters, we data sets and models is paramount. We
propose to explore existing state of the art plan to use and extend the OGSA-DAI
in Grid based accounting services. One of system (www.ogsadai.org.uk) and the
these which we will explore and OGSA Data Architecture efforts to meet
potentially enhance is the Resource Usage this need, including the current work in
Service [9] which is currently being OGSA-DAI to manage files as well as
considered for deployment across the databases. Particular areas where we will
NGS, and will be considered for the focus will include: (i) the integration of
(www.scotgrid.ac.uk) and the Northwest OGSA-DAI with workflow systems
Grid. We plan here to draw on expertise including data transformation capabilities,
and software from the Market for (ii) the development of appropriate meta-
Computational Services project (see data schemas specific to the electronics
http://www.lesc.ic.ac.uk/markets/). In the domain, (iii) attaching security restrictions
latter part of the project, we also plan to to data as it is moved (rather than to the
explore advanced resource broking, sources and sinks themselves), (iv)
reservation and allocation. We would comparison of remote data access with
33
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
34
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
and implementation areas. We expect to Step3: A client portlet for the Taurus
directly input the security solutions software is selected by the scientist.
incorporating Shibboleth and advanced Step 4: The scientist then explores and
authorisation into OMII-UK version 5 accesses the virtualised data design space
releases (currently scheduled for 2007 in for input data sets to the Taurus software
the draft roadmap) and provide a rigorous production run. This might consist of
assessment and feedback on their experimental data, first principle
workflow and enactment engine and their simulation data or data stemming from
enhancements. circuit or system inputs. Once again, what
the scientist can see is determined by their
4. Initial Design Scenarios security privileges (to protect IP). The
To understand how these frameworks and meta-data describing the characteristics of
components will be applied to support the the data such as the confidence rating
nanoCMOS Grid infrastructure we outline (whether it has been validated or been
one of the key scenarios that we intend to superseded), who created it, when it was
support. The requirements capture, design created, what software (or software
and prototyping phases that run through pipelines) and which versions of the
the lifetime of the project will refine this software were used to create the data,
scenario and produce numerous other whether there are any IP issues associated
scenarios. with the data will all be crucial here.
We consider a scenario where a Step 5: Once the user has selected the
scientist wishes to run an atomistic device appropriate data sets needed for generation
modelling simulation based on the of the appropriate I/V curves, the meta-
commercial Taurus software which scheduler/planner is contacted to submit
requires a software licence to generate the job. Where the jobs are submitted will
statistically significant sets of I/V curves. depends on which sites have access to an
These I/V curves will then be fed into the appropriate license for the Taurus software
Aurora software to generate compact as well as the existing information on the
models which will subsequently be fed state of the Grid at that time.
into a circuit analysis software package Step 6: Once the meta-
such as SPICE. At each of these stages the scheduler/planner submits the jobs to the
scientists will be able to see the results of Grid resources and the portlet is updated
the software production runs and influence with real time information on the status of
their behaviour. Note that Taurus, Aurora the jobs that have been submitted (whether
and SPICE are representative examples they are queued, running, completed). The
only and a far richer set of simulation actual job submission might be involved
software will be supported. here, for example when the input files are
Step 1: A Glasgow researcher attempts very large and require to be partitioned.
to log in to the project portal and is We will draw on the JSDL standards work
automatically redirected via Shibboleth to here, e.g. through the OMII GridSAM
a WAYF service (we will utilise the SDSS software.
Federation (www.sdss.ac.uk)) and Step 7: On completion (or during the
authenticates themselves at their home running of the Taurus simulations), the
institution with their home resultant I/V data sets are either stored
username/password (denoted here by along with all appropriate meta-data (not
LDAP server – as used at Glasgow). shown here) or fed directly (potentially via
Step 2: The security attributes agreed appropriate data format transformation
within the nanoCMOS federation are services not shown here) into the Aurora
returned and used to configure the portal, software which in turn will decide where
i.e. they see the services (portlets which best to run the Aurora simulation jobs
will access the Grid services) they are using the meta-scheduler/planner and
authorised for. These security attributes available run time information.
will include licenses for software they Step 8: The Aurora client portlet will
possess at their home institution and their allow for monitoring of the resultant
role in the federation amongst others. compact models and allow these to be fed
35
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
into the SPICE models (also started and can only be achieved if they are integrally
run using the meta-scheduler/planner). involved in the design of this
This orchestration of the different infrastructure. To achieve this we plan to
simulations and how they can feed directly educate the electronics design teams to an
into one another typifies the capabilities of extent whereby they can Grid enable their
the Grid infrastructure we will support and own software, design their own
indicates how we will support a virtual workflows, annotate their own data sets
nanoCMOS electronics design space. The etc. It is only by the successful adoption of
e-Science partners in the project will these kinds of practices that the
initially design families of such workflows infrastructure will “revolutionise the
which the scientists can parameterise and electronics design industry” as we hope.
use for their daily research, however as the
scientists become more experienced in the 6. References
Grid technologies, they will design and 1. International Technology Roadmap for
make available their own workflows for Semiconductors Sematech http://public.itrs.net
others to use. 2. R. Khumakear at al., “An enhanced 90nm
Interesting challenges that arise from High Performance technology with Strong
this domain that the Grid infrastructure Performance Improvement from Stress and
will allow to explore will be when designs Mobility Increase through Simple Process
have been recognised to be invalid, Changes” 2004 Symposium on VLSI
Technology, Digest of Technical Papers, pp
erroneous or superseded. Tracking data 162-163, 2004
sets and relations between the design
3. H. Wakabayashi, “Sub 10-nm Planar-Bulk-
space and models to keep the data design CMOS Device using Lateral Junction
space as accurate as possible is novel. Control”, IEDM Tech. Digest, pp. 989-991,
Further key challenges will be to ensure 2003.
that IP issues and associated policies are 4. A. Asenov, A. R. Brown, J. H. Davies, S.
demonstrably enforced by the Kaya and G. Slavcheva, “Simulation of
infrastructure. Drawing on the e-Health Intrinsic Parameter Fluctuations in
Decananometre and Nanometre scale
projects at NeSC, we will show how data MOSFETs”, IEEE Trans. on Electron Devices,
security, confidentiality and rights Vol.50, No.9, pp.1837-1852, 2003.
management can be supported by the Grid 5. P.A. Stolk, H.P. Tuinhout, R. Duffy, et al.,
infrastructure to protect commercial IP “CMOS Device Optimisation for Mixed-
interests. Signal Technologies”, IEDM Tech Digest,
pp.215-218, 2001
5. Conclusions 6. B. Cheng, S. Roy, G. Roy, F. Adamu-Lema
The electronics design industry is facing and A. Asenov, “Impact of Intrinsic Parameter
Fluctuations in Decanano MOSFETs on Yield
numerous challenges caused by the and Functionality of SRAM Cells”, Solid-State
decreasing scale of transistors and the Electronics, Vol. 49, pp.740-746, 2005.
increase in device design flexibility. The
7. Globus Toolkit Monitoring and Discovery
challenges whilst great are not System, www.globus.org/toolkit/mds
insurmountable. We believe that through a
8. Open Grid Service Architecture Common
Grid infrastructure and associated know-
Information Model, www.ggf.org/cim
how, a radical change in the design
practices of the electronic design teams 9. Resource Usage Service Working Group,
can be achieved. www.doc.ic.ac.uk/~sjn5/GGF/rus-wg.html
To address these challenges, the 10. A. Andrieux, K. Czajkowski, A. Dan, K.
scientists cannot work in isolation, but Keahey, H. Ludwig, J. Pruyne, J. Rofrano, S.
must address the issues in circuit and Tuecke, and M. Xu. Web services agreement
specification WS-Agreement (draft), 2004.
systems design in conjunction with the
atomic level aspects of nano-scale
transistors. We also emphasise that the
success of this project will not be based
upon development of a Grid infrastructure
alone. It requires the electronics design
community “as a whole” to engage. This
36
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Mark Norman,
University of Oxford
1. Abstract
Who will be the grid users of tomorrow? We propose a categorisation of ‘future grid’ users
into the following categories: Service End-User, Power User (with three distinct sub-types),
Service Provider and Infrastructure Sysadmin. A further basic type could be argued as Third
Party Beneficiary. This paper outlines the possible characteristics of these ‘types’ of users.
For users that have layers of applications or, for example, a portal between them and the grid
resource, it is almost certain that heavyweight security solutions, as we have with client
digital certificates, are too onerous and unnecessary. It is likely that some users will,
however, need client digital certificates, due to the level of control that they may exert on
individual grid resources. We also outline a Customer-Service Provider model of grid use. It
may be that authentication and authorisation for the SEU 'customers' should be the
responsibility of the Service Providers (SPs). This would hint at a more legal framework for
delegating authority to enable grid use, but one which could be more secure and easier to
administer. Such a model could also simplify the challenges of accounting on grids, leaving
much of this onerous task to the Service Providers.
given away free in 1994, the Internet was
2. Introduction the domain of the educated and technically
knowledgeable. Even within that educated
2.1. The ESP-GRID project elite, the use of the Internet was dominated
by a few research subject areas, possibly
The Evaluation of Shibboleth and PKI for arenas in which the development of
Grids (ESP-GRID) project’s central aim computing itself had been highly relevant
was to achieve a deeper understanding of for many years. What changed? The
the potential role that Shibboleth can play introduction of a graphical interface that
in grid authentication, authorisation and was easier to use and was more intuitive
security. One of the main outcomes of the did increase the rate of uptake of home
project has been that Shibboleth is computing.
applicable to some users in some situation
and client-based PKI is applicable largely Many interested groups must hope that
to more technical users in other situations. grid technology must be approaching the
This gave rise to an examination of the metaphoric ‘release of the browser’ stage
future types of grid users. This arose from some time soon. Whether there will be a
a series of brainstorming and consultation surge in take-up, as seen with Internet
sessions with current grid users and technologies after 1994, or whether it will
developers. The inking within this paper be a more steady increase remains to be
represents some of this output supported seen. However, it is the availability and
by anecdotal evidence and findings in the ease of use to the greater community that
literature. will make the breakthrough. This paper is
focussed mainly upon the educational and
2.2. When will the grid be really research use of grid technology. The
useful? engagement of the average citizen with
grid technology will take much longer.
Before Netscape’s browser, Mosaic, was We believe that the experience of take-up
37
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
of the Internet is relevant to the divide should examine the types of users that are
between ‘researchers experienced in emerging within grid computing and
programming or scripting’ and the ‘rest’ of consider their generic security profiles as
the research community. well as their likely access management
requirements. This may assist in
identifying such users in order to carry out
2.3. Are we talking to the right
a real world requirements analysis.
users? However, until such an analysis is made,
Anecdotal evidence of researchers our work is merely a guide to the likely
refusing to engage and benefit from grid categories of users.
technology suggests that when an
application interface is presented that is We have, by necessity, very technical
easy to use, the uptake is strong (e.g. users at present. This may distract us from
Sinnott, 2006). As the Market for building an accessible grid for future users
Computational Services Project notes, the who may be far less computing-technical.
inability to use a simple ‘service’ such as a
resource broker in itself leads to a lack of The following sections of this paper
ease of use and little motivation for the present our view of these users of
end user leading to little or no take-up for tomorrow. This is a personal view, based
real use (Grid Markets, 2003). partly on recent experience within the
ESP-GRID project and partly on
Authors have previously noted that the predictions arising from the use of the
current grid middleware is too intimidating Internet and the Web.
for many users, and have often focussed
on the security aspects (e.g. Beckles,
2004a). These aspects are important as 3. Types of grid users
they are often the most onerous for the
non-computer specialist. In temporary 3.1. Categories of users
lieu of the work, noted above, to collect
Table 1 presents a summary of the types of
requirements from current non-users
grid users that exist, or that will exist in
(Beckles, 2004b), we believe that we
the very near future. Clearly, as with any
38
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
‘categorisation’ activity, there will be executes (or interacts with) only one grid
users who move frequently between the node (a ‘single point’ job) whereas another
groups, and whom may occupy two or may run a job that is divided, or
more categories simultaneously. subsequently splits, into many sub-jobs
However, we believe that the categories that interact with many grid nodes. These
are useful in examining high level are useful definitions but are probably
requirements, especially those of access applicable to nearly all of the actors in
control and security. Table 1.
39
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
security risk than the other users, apart The PUS interacts directly with grid
from the SEU. Nevertheless, code written nodes, running code on those nodes. The
by or submitted by the PUA will be run on security risk, from the viewpoint of the
a grid node somewhere and therefore the resource owners, is therefore much higher
security risk may be seen as being from this type of user and, if her identity is
moderate. not known, it would be a requirement that
40
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
it could be traced easily and very quickly, illustrating the access management
should any ‘breach in security’ occur. The ‘behaviour’ of these different types of grid
benefits of a system whereby the user can users. As we have already established in
be traced accurately, when problems Table 2, there are a variety of ways in
occur, should outweigh the benefits (if which the access management
there are any) of logging explicit identities requirements of each set of users, and of
at each grid node. See Norman (2006) for each resource protecting itself from each
a further examination of the issues of user, may be fulfilled. The scenario
asserting explicit identity ‘up front’. presented in Figure 1 is merely one of
many that are possible. Nevertheless, we
The PUDS has a similar security profile to have depicted the PUA acting in two
the PUS, but is beginning to take on some different ways, as these two ways are
of the aspects of a SP and therefore could likely to be significant.
pose a threat to both the grid and to the
test SEUs involved in the development.
When interacting with the grid, there may
3.4. Proportions of users and
therefore be a requirement for the PUDS our effort in servicing them
to be explicitly identified. The assertions made in Table 3 are opinion
only. However, if these are correct, then
As Table 2 indicates, the SP has a complex we need to find a way of engaging with
security profile. The SP (machine) is users in the categories that are likely to
likely to be trusted by and/or to be account for a medium or high proportion
explicitly identified to both SEUs and to of grid use in the future. It is obviously
grid nodes. The SP (user) is also likely to quite difficult to service such users when
have a profile similar to a PUDS when their current abundance is low and very
developing and testing and connecting to tempting to over-engage with the current
grid nodes directly. The SP machines may most common type of user.
or may not be considered as part of the
grid: these machines may simply be
gateways to the grid and not contribute 4. The Customer-Service
directly to grid computation.
Provider grid
The security profile of the Grid-Sys has relationship
been expressed as ‘Moderate’. This is due
to two opposing influences. Firstly, for 4.1. SEUs dominate
each grid node, there will be very few
It is clear, from our earlier assumptions,
system administrators, almost certainly in
that the vast majority of ‘users’ of the grid,
single figures. This means that the task of
in future, will probably be Service End
managing these users’ authorisation
Users and, individually, these SEUs pose
information – and possibly authentication
the lowest security threat as their activities
mechanisms – is relatively simple.
are highly controlled by the SP and the
Secondly and conversely, if an impostor
service application. If we accept these as
were to be able to bypass the access
basic assumptions, we can see some
management system, the risks are very
advantages for the simplicity of a multi-
high to the grid node.
tiered security architecture. As Figure 1
shows, there is trust between the grid and
3.3. An access management the SP and between the SP and the entity
scenario or organisation managing the user’s
credentials. Furthermore, the SP and the
3.3.1. Two major routes of entry to IdP are clear auditable points. We can
the grid thus envisage the SP as the true grid user.
Figure 1 outlines a likely scenario It is the SP entity that runs jobs on the grid
41
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 1 Possible access management behaviour scenario of the different types of grid users. (The
PUDS has been omitted as it should contain elements of the PUS and SP).
42
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
1
See http://wiki.oucs.ox.ac.uk/esp-
grid/NeSC_Shibbolized_Resources
43
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
• If such applications are placed in Grid Markets (2003). A Market for Computational
portals (probably using web Services: A Proposal to the e-Science Core
technology), the security threat profile Technology Programme.
http://www.lesc.ic.ac.uk/markets/Resources/Tag.pdf
of this vast majority of users is . Also
relatively low (being partly mediated http://www.sve.man.ac.uk/Research/AtoZ/MCS/RU
by the application). Thus, heavyweight S/.
security solutions will not be needed
for the majority of users. Norman, M.D.P. (2006) A case for Shibboleth and
grid security: are we paranoid about identity?
• In such a scenario, Power Users will Proceedings of the 2006 UK e-Science All Hands
exist as a small proportion of users. Meeting
Those users probably merit Sinnott, R (2006) Development of Usable Grid
heavyweight security solutions to be Services for the Biomedical Community.
applied to them. Proceedings of Designing for e-Science:
Interrogating new scientific practice for usability, in
• Where applications are based for the the lab and beyond’ workshop at the UK National e-
benefit of most users, these provide Science Centre, January 25-26, 2006.
44
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
This paper describes a work-flow language ‘Martlet’ for the analysis of large quantities of distributed
data. This work-flow language is fundamentally different to other languages as it implements a new pro-
gramming model. Inspired by inductive constructs of functional programming this programming model
allows it to abstract the complexities of data and processing distribution. This means the user is not required
to have any knowledge of the underlying architecture or how to write distributed programs.
As well as making distributed resources available to more people, this abstraction also reduces the po-
tential for errors when writing distributed programs. While this abstraction places some restrictions on the
user, it is descriptive enough to describe a large class of problems, including algorithms for solving Singular
Value Decompositions and Least Squares problems. Currently this language runs on a stand-alone middle-
ware. This middleware can however be adapted to run on top of a wide range of existing work-flow engines
through the use of JIT compilers capable of producing other work-flow languages at run time. This makes
this work applicable to a huge range of computing projects.
45
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
46
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The inspiration for this programming model came read and would increase the potential for user errors.
from functional programming languages where it These problems are overcome using two techniques.
is possible to write extremely concise powerful First, local names for variables in the procedure are
functions based on recursion. The reverse of a list of used, so the URIs for data only need to be entered
elements for instance can be defined in Haskell [5] when the procedure is invoked. This means that in
as; the procedure itself all variable names are short, and
can be made relevant to the data they represent. Sec-
reverse [] = [] ond, a define block is included at the top of each
reverse (x:xs) = reverse xs ++ [x] procedure where the programmer can add abbrevia-
tions for parts of the URI. This works because the
This simply states that if the list is empty, the URIs have a logical pattern set by whom the func-
function will return an empty list, otherwise it will tion or data belongs to and the server it exists on. As
take the first element from the list and turn it into a a result the URIs in a given process are likely to have
singleton list. Then it will recursively call reverse much in common.
on the rest of the list and concatenate the two lists The description of the process itself starts with the
back together. The importance of this example is the keyword “proc”, then there is a list of arguments
explicit separation between the base case and the in- that are passed to the procedure, of which there must
ductive case. Using these ideas it has been possible be at least one due to the stateless nature of pro-
to construct a programming model and language that cesses. While additional syntax describing the read,
abstracts the level of parallelisation of the data away write nature of the arguments could improve read-
from the user, leaving the user to define algorithms ability, it is not included as it would also prevent
in terms of a base case and an inductive case. certain patterns of use. This may change in future
Along with the use of functional programming variants of the language. Finally there is a list of
constructs, two classes of data structure, local and statements in between a pair of curly braces, much
distributed, were created. Local data structures are like C. These statements are executed sequentially
stored in a single piece; distributed data structures when the program is run.
are stored in an unknown number of pieces spanning There are two types of statement: normal state-
an unknown number of machines. Distributed data ments and expandable statements. The difference
structures can be considered as a list of references to between the two types of statements is the way they
local data structures. These data structures allow the behave when the process is executed. At runtime an
functional constructs to take a set of distributed and expand call is made to the data structure represent-
local data structures, and functions that perform op- ing the abstract syntax tree. This call makes it adjust
erations on local data structures. These are then used its shape to suit the set of concrete data references it
as a base case and an inductive case to construct a has been passed. Normal statements only propagate
work-flow where the base function gets applied to all the expand call through to any children they have,
the local data structures referenced in the distributed whereas expandable statements adjust the structure
data structures, before the inductive function is used of the tree to match the specific data set it is required
to reduce these partial results to a single result. So, to operate on.
for example, the distributed average problem looked
at in Section 3, taking the distributed matrix A and
returning the average in a column vector B, could be 5.1 Normal Statements
written in Martlet as the program in Figure 1.
As the language currently stands, there are six dif-
Due to this language being developed for large
ferent types of normal statement. These are if-
scale distributed computing on huge data sets, the
else, sequential composition, asynchronous compo-
data is passed by reference. In addition to data, func-
sition, while, temporary variable creation, and pro-
tions are also passed by reference. This means that
cess calls. Their syntax is as follows:
functions are first class values that can be passed into
and used in other functions, allowing the workflows
to be more generic. Sequential Composition is marked by the key-
word seq signalling the start of a list of statements
that need to be called sequentially. Although the
5 Syntax and Semantics seq keyword can be used at any point where a state-
ment would be expected, in most places sequential
To allow the global referencing of data and func- composition is implicit. The only location that this
tions, both are referenced by URIs. The inclusion construct really is required is when one wants to
of these in scripts would make them very hard to create a function in which a set of sequential lists
47
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
proc(A,B)
{
// Declare the required local variables for the computation. Y and Z
// are used to represent the two sets of values Yi and Zi in the
// example equations. ZTotal will hold the sum of all the Zi’s.
Y = new dismatrix(A);
Z = new disinteger(A);
ZTotal = new integer(B);
Figure 1: Function for computing the average of a matrix A split across an unknown number of servers. The
syntax and semantics of this function is explained in Section 5.
48
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
49
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
f1 (A1 ) f1 (A3 , C)
seq seq
f2 (A1 , B1 ) f2 (B3 , C)
f1 (A2 ) f1 (A2 , C)
f2 (A2 , B2 ) f2 (B2 , C)
f1 (A3 ) f1 (A1 , C)
seq seq
f2 (A3 , B3 ) f2 (B1 , C)
Figure 3: The abstract syntax tree for the example Figure 4: The abstract syntax tree for the example
map statement after expand has been called setting foldr statement after expand has been called setting
A = [A1 , A2 , A3 ] and B = [B1 , B2 , B3 ]. A = [A1 , A2 , A3 ] and B = [B1 , B2 , B3 ].
foldr is a way of applying a function and an ac- foldl is the mirror image of foldr so the
cumulator value to each element of a list. This is Haskell example would now evaluate to
defined in Haskell as: ((((0+1)+2)+3)+4)+5
The syntax tree in Martlet is expanded in almost
foldr f e [] = e exactly the same way as foldr. The only difference
foldr f e (x: xs) = f x is the function calls from the sequential statement
(foldr f e xs) are in reverse order. The only time that there is any
reason to choose between foldl and foldr is when f
This means that the elements of a list xs is not commutative.
= [1,2,3,4,5] can be summed by the state-
ment; foldr (+) 0 xs which evaluates to
1+(2+(3+(4+(5+0)))) tree is a more complex statement type. It con-
structs a binary tree with a part of the distributed
Foldr statements are constructed from the foldr
data structure at each leaf, and the function f at each
keyword followed by a list of one or more statements
node. When executed this is able to take advantage
which represent f. An example is shown below with
of the potential for parallel computation. A Haskell
its corresponding abstract syntax tree.
equivalent is:
foldr
{ tree f [x] = x
f1(A,C); tree f (x:y:ys) =
f2(B,C); f (tree f xs’) (tree f ys’)
} where (xs’,ys’) =
split (x:y:ys)
f1 (A, C) split is not defined here since the shape of the
foldr seq tree is not part of the specification. It will however
always split the list so that neither is empty.
f2 (B, C) Unlike the other expandable statements, each
node in a tree takes 2n inputs from n distributed data
When this function is expanded this is replaced by structures, and produce n outputs. As there is insuf-
a sequential statement that keeps any non-distributed ficient information in the structure to construct the
arguments constant and calls f repeatedly on each mappings of values between nodes within the tree,
piece of the distributed arguments as shown in Fig- the syntax requires the arguments that the statements
ure 4. use to be declared in brackets above the function in
50
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
such a way that the additional information is pro- Apache Axis [3] and Jakarta Tomcat [4]. As we have
vided. found no projects with a similar approach aimed at
Non-distributed constants and processes used in f a similar style of environment, a direct comparison
are simply denoted as a variable name. The relation- with other projects has not been possible. This work
ship between distributed inputs and the outputs of f is, however, currently being tested with data from the
are encoded as (ALeft,ARight)\A->B, where ClimatePrediction.net project with favorable results
ALeft and ARight are two arguments drawn from and will hopefully be deployed on all our servers
the distributed input A that f will use as input. The over the course of the next year allowing testing on
output will then be placed in B and can be used as an a huge data set.
input from A at the next level in the tree. At runtime, when concrete values have been pro-
Lets consider a function that uses a method sum vided, it is possible to convert abstract functions into
passed into the statement as s, a distributed argu- concrete functions. The concrete functions then con-
ment X as input and outputs the result to the non- tain no operations that are not supported by a range
distributed argument A. This could be written as: of other languages. As such, it is envisaged that the
middleware will in time be cut back to a layer that
tree((XL,XR)\X -> A) can sit on top of existing work-flow engines, pro-
{ viding extended functionality to a wide range of dis-
s(A,XL,XR);
tributed computing applications. This capability will
} be provided through the construction of a set of JIT
compilers for different work-flow languages. Such
tree s(A, XL, XR) compliers need only take a standard XML output
produced at runtime and performing a transforma-
When this is expanded, it uses sequential, asyn- tion to produce the language of choice. This would
chronous and temporary variables in order to con- then allow a layer supporting the construction of dis-
struct the tree as shown in Figure 5. Because of tributed data structures and the submission of ab-
the use of asynchronous statements any value that stract functions to be placed on top of a wide range
is written to must be passed in as either an input or of existing resources with minimal effort, extend-
an output. ing their use without affecting their existing func-
tionality. Such a middleware would dramatically in-
5.3 Example crease the number of projects that Martlet is appli-
cable to. Hopefully the ideas in Martlet will then
If the Martlet program to calculate averages from be absorbed into the next generation of work-flow
the example in Figure 1 where submitted it would languages. This will allow both existing and future
produce the abstract syntax tree shown in Figure 6. languages to deal with a type of problem that thus far
This could then be expanded using the techniques has not been addressed, but will become ever more
show here to produce a concrete functions for differ- common as we generate ever-larger data sets.
ent concrete datasets.
6 Conclusions References
In this paper we have introduced a language and pro- [1] David P. Anderson, Jeff Cobb, Eric Kor-
gramming model that use functional constructs and pela, Matt Lebofsky, and Dan Werthimer.
two classes of data structure. Using these constructs Seti@home: an experiment in public-resource
it is able to abstract from users the complexity of computing. Commun. ACM, 45(11):56–61,
creating parallel processes over distributed data and 2002.
computing resources. This allows the user simply to
think about the functions they want to perform and [2] Tony Andrews, Francisco Curbera, Hitesh Do-
does not require them to worry about the implemen- holakia, Yaron Goland, Johannes Kiein, Frank
tation details. Leymann, Kevin Liu, Dieter Roller, Doug
Using this language, it has been possible to de- Smitth, Satish Thatte, Ivana Trickovic, and
scribe a wide range of algorithms, including algo- Sanjiva Weerwarana. BPEL4WS. Technical
rithms for performing Singular Value Decomposi- report, BEA Systems, IBM, Microsoft, SAP
tion, North Atlantic Oscillation and Least Squares. AG and Siebel Systems, 2003.
To allow the evaluation of this language and pro-
gramming model, a supporting middleware has been [3] Apache Software Foundation. Apache Axis,
constructed [10] using web services supported by 2005. URL: http://ws.apache.org/axis/.
51
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
s(A2 , X1 , X2 )
async
s(A, A1 , X5 )
Figure 5: When the tree function on page 7 is expanded with X = [X 1 , X2 , X3 , X4 , X5 ], this is one of the
possible trees that could be generated.
MatrixSum(A,Y)
map seq
MatrixCardinality(A,Z)
tree seq
IntegerSum(ZL,ZR,ZTotal)
MatrixDivide(B,ZTotal,B)
Figure 6: The abstract syntax tree representing the generic work-flow to compute the an average introduced
in Figure 1.
[4] Apache Software Foundation. The [10] Daniel Goodman and Andrew Martin. Scien-
Apache Jakarta Project, 2005. URL: tific middleware for abstracted parallelisation.
http://jakarta.apache.org/tomcat/. Technical Report RR-05-07, Oxford Univer-
sity Computing Lab, November 2005.
[5] Richard Bird. Introduction to Functional Pro-
gramming using Haskell. Prentice Hall, second [11] Tom Oinn, Matthew Addis, Justin Ferris,
edition, 1998. Darren Marvin, Martin Senger, Mark Green-
[6] Jeffrey Dean and Sanjay Ghemawat. Mapre- wood, Tim Carver, Kevin Glover, Matthew R.
duce: Simplified data processing on large clus- Pocock, Anil Wipat, and Peter Li. Tav-
ters. Technical report, Google Inc, December erna: a tool for the composition and enactment
2004. of bioinformatics workflows. Bioinformatics,
20(17):3045–3054, 2004.
[7] E. Deelman, J. Blythe, Y. Gil, and C. Kessel-
man. Pegasus: Planning for execution in grids. [12] David Stainforth, Jamie Kettleborough, An-
Technical report, Information Sciences Insti- drew Martin, Andrew Simpson, Richard Gillis,
tute, 2002. Ali Akkas, Richard Gault, Mat Collins,
David Gavaghan, and Myles Allen. Cli-
[8] Sanjay Ghemawat, Howard Gobioff, and Shun- mateprediction.net: Design principles for
Tak Leung. The google file system. In SOSP public-resource modeling research. In 14th
’03: Proceedings of the nineteenth ACM sym- IASTED International Conference Parallel and
posium on Operating systems principles, pages Distributed Computing and Systems, Nov
29–43, New York, NY, USA, 2003. ACM 2002.
Press.
[9] Daniel Goodman and Andrew Martin. Grid
style web services for climateprediction.net.
In Steven Newhouse and Savas Parastatidis,
editors, GGF workshop on building Service-
Based Grids. Global Grid Forum, 2004.
52
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Flooding is a growing problem in the UK. It has a significant effect on residents, businesses and
commuters in flood-prone areas. The cost of damage caused by flooding correlates closely with the
warning time given before a flood event, and this makes flood monitoring and prediction critical to
minimizing the cost of flood damage. This paper describes a wireless sensor network for flood
warning which is not only capable of integrating with remote fixed-network grids for
computationally-intensive flood modeling purposes, but is also capable of performing on-site flood
modeling by organising itself as a ‘local grid’. The combination of these two modes of grid
computation—local and remote—yields significant benefits. For example, local computation can be
used to provide timely warnings to local stakeholders, and a combination of local and remote
computation can inform adaptation of the sensor network to maintain optimal performance in
changing environmental conditions.
53
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
using that platform, and section 5 discusses on the results of point prediction models
factors that can be used to drive adaptation. executed by the local grid of GridStix nodes and
Finally, section 6 discusses our ongoing disseminated to local stakeholders in a range of
deployment and evaluation work, section 7 formats including on-site audio/visual warnings,
discusses related work, and section 8 offers a public web-site and SMS alerts. Each of these
conclusions and outlines our plans for future media has associated benefits and drawbacks.
work. For example, SMS warnings are an effective
method of publishing timely alerts to local
2. Exploiting Local Computation stakeholders. However, SMS warnings require
that users register for the service in advance and
Our prototype flood prediction system uses are therefore ineffective for stakeholders who
local grid computation to provide improved might be unaware of a flood risk. Local
support for flood monitoring. This section audio/visual flood warnings may be effective
discusses how local computation can be used to without the need for stakeholders to proactively
(i) inform system adaptation, (ii) support participate, however their effectiveness is
diverse sensors and (iii) provide timely dependent upon stakeholders being within
warnings to local stakeholders. audio/visual range.
First, local computation can be used to drive
the adaptation of WSN behaviour based on
3. The GridStix Platform
awareness of environmental conditions such as
flood data and power monitoring. For example, 3.1 Overview
based on the execution of point prediction
In order to achieve the ‘local grid’ functionality
models, we can switch to a more reliable
discussed in the previous section, a powerful
network topology (see below) at times when
and versatile sensor platform is required (in
node failure seems more likely (i.e. when
terms of both hardware and software). This
imminent flooding is predicted). Adaptation
section describes such a platform – GridStix.
may also be informed by input from
More information on the platform itself is given
computationally intensive spatial prediction
in [Hughes ‘06].
models executed in the remote fixed grid.
Second, the availability of local computation 3.2 Hardware Platform
can support richer sensor modalities such as
image-based flow prediction [Bradley ‘03]. In order to support the proposed functionality, a
Image-based flow prediction is a novel sensor node device must be capable of
technique for measuring water flow rates that interfacing with a variety of sensors including
uses off-the-shelf digital cameras. It is cheaper traditional sensors (e.g. depth sensors) and more
and more convenient to deploy than the novel sensors (e.g. the digital imaging hardware
commonly-used ultrasound flow sensors, but that is used to support the image-based flow
can only be used where significant prediction discussed above). A suitable device
computational power is available. Flow-rate must also be capable of supporting a variety of
measurements are calculated based upon a wireless communications technologies to
series of images taken by a digital camera provide redundancy and allow sensor nodes to
deployed overlooking the river. Naturally switch physical network technologies as
occurring tracer particles are identified on the conditions require. Finally, the device must
water surface and tracked through subsequent have sufficient computational and storage
images, from which the surface velocity of the resources to support the GridKit software
water is inferred. The data-set used by this platform (see below).
method, a sequence of high-resolution images, Sensor networks often make use of devices
is too large for off-site transmission to be with extremely constrained local resources such
feasible using GSM or GPRS technologies and as the Berkley Motes [Xbow ‘06]. This is
therefore the method is impractical in current because such devices have extremely modest
sensor network deployments. However, power requirements and can therefore operate
organising computationally-capable sensor for long periods on small batteries. However,
nodes into a local grid allows analysis to be such constrained platforms do not offer
performed on-site and the results of this analysis sufficient computational power to support
then transmitted off-site. functionality such as on-site flood prediction,
Finally, on site flood-modeling allows nor do they offer sufficient support for diverse
timely flood warnings to be distributed to local networking technologies and sensor types. For
stakeholders. These flood-warnings are based this reason, more powerful embedded hardware
54
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
has been selected for use in the GridStix situated as discreetly and securely as possible to
platform. avoid unwanted attention.
Each GridStix node is based on the Gumstix
[Waysmall ‘06] embedded computing platform, 3.3 The GridKit Middleware Platform
so named as each device is roughly the same The GridKit middleware platform [Coulson ‘05]
size as a pack of gum. Despite their small-size, provides the key functionality that is required to
each of these devices is equipped with a 400 support distributed systems such as grids, peer-
MHz Intel XScale PXA255 CPU, 64Mb of to-peer networks and WSNs. GridKit is based
RAM and 16Mb of flash memory. These on the OpenCOM [Coulson ‘02] component
hardware resources support the execution of a model and the various facets of system
standard Linux kernel and Java Virtual Machine functionality are implemented as independent
making them inherently compatible with the component frameworks. This component-based
GridKit platform, which has been successfully approach allows developers to build rich
deployed on comparable hardware such as support for distributed systems or conversely, to
PDAs [Cooper ‘05]. Furthermore, the PXA255, build stripped-down deployments suitable for
which performs comparably to a 266MHz execution on embedded hardware such as the
Pentium-class CPU are capable of executing a Gumstix.
single iteration of a point prediction model in a Importantly, GridKit offers rich support for
matter of seconds. application-level overlay networks. Its Overlay
The Gumstix devices also provide a variety Framework [Coulson ‘05] supports the
of hardware I/O mechanisms, enabling simultaneous deployment of multiple overlay
connection to a variety of sensors. For example, networks and enables these to interoperate
a network camera (for image-based flow flexibly (e.g. by layering them vertically or
measurement) can be connected via a standard composing them horizontally). It also supports
wired Ethernet connection, while flow sensors the adaptation of overlays, allowing for
can be connected via an on-board serial port, example, one overlay to be swapped for another
and depth sensors can be connected via the at run-time. The use of adaptable overlays is
GPIO lines of the XScale I2C bus. In this way it discussed in detail in section 4, which illustrates
is possible to connect multiple sensors to a how overlays with different performance
single device. In terms of networking, each characteristics can be used to adapt to changing
device is equipped with an onboard Bluetooth environmental conditions.
radio and Compact Flash 802.11b network
hardware, which is used to provide an ad-hoc
communications infrastructure. Furthermore,
4. Supporting Adaptation
the devices can be equipped with GPRS 4.1 Overview
modems for transmitting and receiving data
from off-site. This section examines situations in which WSN
Of course, the above capabilities come at the adaptation is possible and then goes on to
expense of increased power consumption: consider the factors that can be used to inform
While a Berkley Mica Mote consumes only such adaptation. Three discrete classes of
54mW during active operation [XBow’06], our adaptation are identified: (i) adaptation at the
devices consume around 1W during typical level of the physical network, (ii) adaptation at
operation, and thus it would not be feasible to the overlay network level, and (iii) adaptation of
power them for long periods using batteries CPU performance.
alone. To address this, solar panel arrays are 4.2 Physical Network Adaptation
employed. Given aggressive power
management, we have found that a single 15cm2 Our flood prediction system makes use of three
mono-crystalline solar panel, with a maximum wireless networking technologies, Bluetooth,
output of 1.9 watts combined with a 6v 10AH IEEE 802.11b and GPRS, each of which has
battery, is sufficient to continually power a very different performance characteristics:
device.
Finally, to minimise the effects of harsh • The compact flash 802.11b hardware
weather conditions, flood water, vandalism, (SanDisk Connect Plus) supports speeds of
uncooperative grazing animals, etc., we have 11Mbps at a range of 137 meters, 5.5Mbps
housed the devices in durable, water-tight at 182 meters, 2 Mbps at 243 meters and
containers that can safely be buried. In some 1Mbps at 365 meters. It offers significantly
cases burial is not possible (e.g. solar panel better performance than Bluetooth or GPRS
deployment). In such cases, the device is
55
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
and has a maximum power consumption of Dijkstra’s algorithm [Dijkstra ‘59]. Examples of
approximately 0.5 watts. SP and FH spanning trees are shown in figure 1.
• The class 2 Bluetooth radio (Ericsson
ROK-104-001) supports speeds of up to
768Kbps at a range of up to 25M. It offers
QoS that is significantly lower than
802.11b, but significantly higher than
GPRS. It consumes a maximum of 0.2
watts [Ericsson ‘02].
• The GPRS modem (Ambicom GPRS-CF)
supports uplink speeds of up to 29kbps and
downlink speeds of up to 58kbps. Range is
not an issue as we assume that the entire
deployment site is within the bounds of Figure 1: FH and SP Spanning Trees.
GPRS coverage. GPRS offers lower QoS
than either 802.11b or Bluetooth and SP and FH are just two common spanning
consumes a maximum of 2 watts. Its tree types and many others are also available.
operating wavelength of around 30cm is We are currently investigating the performance
longer than that of the other technologies of a range of spanning trees for off-site data
(which operate at around 12cm) and dissemination. Nevertheless, these two
therefore performs better under water examples serve to illustrate the trade-off that
[Ambicom ‘06]. often exists between overlay performance and
power consumption.
Each of our three communication technologies
clearly has advantages and disadvantages. For 4.4 CPU Power Adaptation
example, 802.11b offers good QoS and long
The XScale PXA255 CPUs used in the GridStix
range; however, it consumes significantly more
platform support software controlled frequency
power than Bluetooth. Conversely GPRS offers
scaling, which allows the CPU to be set at a
much poorer performance, but is not limited by
variety of speeds from 100MHz to 400MHz.
range. We discuss below how these differing
Processing power increases with clock speed
characteristics can best be exploited.
but at the cost of increased power consumption.
4.3 Application-level Overlay Adaptation Figure 2 shows the relationship between the
clock frequency of the XScale CPU and the
As previously discussed, we employ power it consumes.
application-level overlays to provide
Pow er Consumption of XScale CPU
communications support for our flood
prediction system. There are a range of overlays 0.5
which can be used and each has advantages and
disadvantages. 0.45
Power Use (Watts)
56
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
57
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
cost of increased network, CPU and power In particular, we are currently engaged in
resources. assessing the performance of the solar panels in
Note that the performance maximisation terms of the power they produce in various
actions taken when flood prediction becomes weather conditions (hours of daylight, cloud
time-critical are in direct conflict with the cover etc.), and are assessing battery
power saving actions taken when battery power performance in terms of charge retention. In
runs low (see above). In cases where logical addition, we are assessing the performance of
adaptive actions conflict, the system must the physical networking hardware (802.11b,
determine which action is most critical. We are Bluetooth and GPRS) in terms of power
currently in the process of assessing this consumption and QoS characteristics such as
through evaluation of system performance. throughput, loss, delay and jitter. These factors
Furthermore, this is by no means the only are being evaluated with various usage profiles
situation in which adaptations interfere with in various weather conditions and under varying
each other. In particular, there are a number of levels of immersion.
ways in which the type of physical network in Based upon these basic performance
use affects the optimal choice of overlay, and characteristics, a simulator is being constructed
vice versa. that will allow the testing of various
application-level networks and power
6. Deployment and Evaluation management strategies, using site-specific
topographical information, past weather
6.1 Deployment conditions and past flooding data. This will be
The system described above has already been used to prototype potential deployment
built and tested in the lab and we are now technologies and investigate the performance of
preparing to deploy it in a real-world large-scale deployments that could not be easily
environment. The planned deployment site is at tested in the real world.
Cow Bridge, which is located on the River
Ribble in the Yorkshire Dales. This site is prone 7. Related Work
to flooding for much of the year and thus offers
A number of grid-related projects have
good potential for evaluating the system under
addressed the issues of WSN-grid integration in
real-world conditions. Flooding at the site
general and WSN-based flood prediction in
affects the nearby village of Long Preston,
particular. A prime example of the former is the
which thus additionally presents us with a
Equator remote medical monitoring project
motivation for evaluating warning systems for
[Rodden ‘05], and a prime example of the latter
local stakeholders. The site is largely rural
is Floodnet [DeRoure ‘06]. However, these
which minimises the risk to deployed hardware
systems (to the best of our knowledge) all
due to theft and/or vandalism. We anticipate
employ a ‘dumb’ proxy-based approach to
initial deployment during summer 2006, when
integrating WSNs with the grid and thus cannot
instances of flooding are relatively uncommon.
take advantage of the local computational power
Deployment at the site will cover
that we employ to drive the adaptation of WSN
approximately 1km of river with an initial
behaviour, to support richer sensor modalities
installation of 13 nodes. The majority of sensors
such as image-based flow prediction, and to
deployed will be depth sensors, along with a
provide timely flood warnings to local
single image-based flow sensor and a single
stakeholders.
ultrasound flow sensor.
58
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
59
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
60
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Patient recruitment for clinical trials and studies is a large-scale task. To test a given drug for example,
it is desirable that as large a pool of suitable candidates is used as possible to support reliable
assessment of often moderate effects of the drugs. To make such a recruitment campaign successful, it
is necessary to efficiently target the petitioning of these potential subjects. Because of the necessarily
large numbers involved in such campaigns, this is a problem that naturally lends itself to the paradigm
of Grid technology. However the accumulation and linkage of data sets across clinical domain
boundaries poses challenges due to the sensitivity of the data involved that are atypical of other Grid
domains. This includes handling the privacy and integrity of data, and importantly the process by
which data can be collected and used, and ensuring for example that patient involvement and consent is
dealt with appropriately throughout the clinical trials process. This paper describes a Grid infrastructure
developed as part of the MRC funded VOTES project (Virtual Organisations for Trials and
Epidemiological Studies) at the National e-Science Centre in Glasgow that supports these processes
and the different security requirements specific to this domain.
61
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
studies: data collection and study management. two numbers, which have different structures
Furthermore, it is a requirement that the in different contexts, is a major, yet
infrastructure that will be developed will be unavoidable, challenge.
effective yet simple to use for the non-Grid
personnel involved in the clinical trials 2.2 Recruitment Work-Flow
process. A patient recruitment process must ideally
capture patient consent as early as possible.
2. Clinical Patient Recruitment Whilst information on patients is stored in a
As described previously, clinical patient variety of digital formats and locations, a
recruitment is a large-scale and resource- priori consent that these data sets can be
consuming exercise. The human challenge in accessed and queried to decide that a given
co-ordinating such a large effort can be patient be recruited to a clinical trial, is
immense and in some cases, such as the UK needed. One of the best sources of information
Biobank Project [2] the number of potential associated with potential trials candidates is
candidates is so large that the use of distributed through primary care sources, i.e. in their GPs
technology is mandatory for the task to be databases. Understanding whether a given
completed in a meaningful time-scale. (UK clinical trial is in the interest of a particular
Biobank expects to recruit 500,000 members patient is best answered by these GPs.
of the population between 40-69 years of age). Figure 1 gives a pictorial representation of
the process/workflow through which a clinical
2.1 Crossing Domain Boundaries trial investigator and a GP might interact to
To effectively identify suitable trial candidates support primary care recruitment.
on a large a scale requires knowing the
structure of patient information data sets across 1. The trial co-ordinator wishes to set up a
a broad set of domains. For instance, to sample new clinical trial, with a specific
the national population of the UK would description of the drug/treatment and the
require knowing the data structures of the patients’ characteristics that would
health services in England and Scotland, potentially qualify them for entry into the
knowing how they relate to each other and clinical trial. They need to contact a set of
knowing how to translate between the schema GPs to describe this new trial and find out
of both. if these GPs have patients that fit the
Ideally there should be a single electronic required criteria.
health record which captures all necessary 2. Assuming that the GP is interested in
health information associated with a given participating in this particular trial, they
patient that is accessed and updated by all need to search their own patient records
health care providers throughout a patients’ for anyone that may fit these criteria.
lifetime. This should support tracking of a 3. Assuming that one or more matching
patients’ place of residence throughout their patients have been found, the GP decides
lifetime, and allow for cross checking of if it is really in the patients’ interest to
records in one area to those of the other. participate in the clinical trial. If this is the
However such a single e-Health record case then said patient is contacted by the
remains a distant wish and a variety of GP and told about the trial. If they are
heterogeneous and largely non-interoperable willing to participate having had the
legacy infrastructures and data sets is the norm potential benefits and issues that might be
across the NHS, with paper based patient case associated with the trial fully explained,
histories and records still commonly used. they are asked for their consent to use
To support the linkage of distributed data their personal data for this purpose. Once
sets associated with a given patient, it is consent is obtained, this information is
beneficial to have a common, unique identifier recorded.
for patients that spans all domains and can be 4. Based upon the specifics of the clinical
used to join the patient records between trial, various data sets are collected from
databases. In Scotland this unique identifier is the GPs database. These can be non-
a number known as the Community Health identifying information such as age,
Index (CHI), which is currently being rolled height, weight, medical conditions and
out across the nation as part of a new Scottish social and demographic details.
parliamentary initiative. It is planned that the Identifying information may be
CHI number will be rolled out across all of anonymised with the de-anonymising keys
Scotland by mid-2006. In England, the unique maintained by the GPs.
identifier is the NHS Number, which is a 5. This information along with the note of
distinct entity from the CHI. Relating these consent is communicated back to the trial
62
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
co-ordinator. The information is then there are other more subtle requirements that
validated and stored for later use in the must also be met when attempting to set up a
trial. system that is flexible enough to use Grid
technologies effectively, but also maintains the
high standards of privacy and integrity
required in the clinical domain. A term used to
describe this particular entity is a Clinical
Virtual Organisation (CVO). The level of
prescription of security policies and how they
are enforced must be strongly adhered to
within such a CVO.
63
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
necessary to access this record, and time is of Authentication – this is almost always
the essence. The system here is to let the HCP achieved using Public Key Infrastructures
have access to this information but have an (PKIs) where public and private keys and
irreversible record that this unprivileged access certificates are used to verify the authenticity
has occurred. When the immediate situation is of a user’s identity.
resolved the logs of the event should be Authorization – as this is a more
investigated by an auditing authority, to see complicated requirement, in terms of
whether the actions of the HCP were justified. establishing privilege rights based on identity,
there is a wider range of possible solutions in
3.2 Clinical Data Security Policies the community (PERMIS [8], CAS [9], VOMS
A security policy defined in one site node of a [10], Akenti [11]). These applications all have
VO will not necessarily have the same various advantages and disadvantages
structure as a different node within the same depending on the needs of the developer and
VO. This is inherently tied to the structure of the implementation idiosyncrasies, but no clear
object classification within a specific domain, leader has yet been established in the field. In
as security can only be defined and enforced the VOTES project, authorization is provided
on data that itself has a well-defined structure. by a simple implementation of an Access
Put another way, sites must have their own Control Matrix (see section 4.3).
autonomy and hence define and enforce their Auditing – this is a security measure that
own security policies on access to different has more relevance later in the production
data sets. cycle of a system. Whilst not underestimating
There are numerous developments in the importance of logging all user activity and
standards for the description of data sets used being able to attach events to individuals,
in the clinical domain however. These are design and production in the VOTES project is
complex and evolving with numerous currently focused on securing access to the
commercial bodies and standards groups system in the first instance and supporting the
involved in developing strategies and include recruitment process.
major initiatives such as Health-Level 7 (HL7)
[4], SNOMED-CT [5] and OpenEHR [6]. 4. VOTES Implementation
There is often a wide range of legacy data sets The details of the portal application that is
and naming conventions which impact upon currently under development to provide
standardisation processes and their acceptance. solutions to these security, usability and
The International Statistical Classification of process challenges, are now described in this
Disease and Related Health Problems version section.
10 (ICD-10) is used for the recording of
diseases and health related problems and is 4.1 Grid Technologies
supported by the World Health Organisation. The implementation of the VOTES project has
In Scotland ICD-10 is used within the NHS built largely upon the expertise in certain Grid
along with ICD version 9. ICD-10 was technologies based at the National e-Science
introduced in 1993, but the ICD classifications Centre in Glasgow. These include the
themselves have evolved since the 17th following applications that provide tools to
Century [7]. develop and maintain the application:
To compound this problem of GridSphere, Globus and OGSA-DAI.
classification, there is the issue of the dynamic GridSphere [12] is a web portal technology
nature of virtual organisations. One of the that provides easy access to secure grid
standard characteristics of a VO is that it is not services. It can be presented as a group of
only a loose collaboration between sites but it layered portlets allowing simultaneous, and
is also a transient one, with a limited lifetime interactive, task processing. A development
and for a specific purpose. As such, any suite is available that provides easy-to-use
security policy enforced will also have a tools and tutorials for building and deploying
limited lifetime – requiring the re-evaluation of these portal services.
security requests after a given time period. The Globus Toolkit [13] similarly provides
This must be considered when defining a set of distinct modular tools that allow
security policies and establishing chains of development of Grid services. Version 4.0 of
trust. the toolkit is implemented in the VOTES
project, which has been developed to the Web
3.3 Grid Security Solutions Services Resource Framework (WS-RF)
Security solutions in the Grid community are specification [14] - this is in line with a drive
largely categorised by where they fit into the by Globus to align their service architecture to
“AAA” scenario. the more common web services standard [15].
64
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
OGSA-DAI [16] (Open Grid Services Because of its proximity to the ‘front-line’ of
Architecture – Data Access and Integration) is primary care, GPASS is the most pertinent
a project that has developed a toolkit technology when looking at patient
specifically for grid services that federate data recruitment for clinical trials. The distributed
from distributed sources. The OGSA-DAI and asynchronous structure of the application
services developed in VOTES again follow the makes the process of integrating this securely
WS-RF specification, but this is implemented into a wider system a challenging but
in addition to direct JDBC connections to local necessary one.
data sources. The major advantage of using The Scottish Care Information (SCI) Store
OGSA-DAI is that many different data [22] is a batch storage system which allows
sources, from DBM systems to XML and flat hospitals to add a variety of information to be
files, can be queried. The data returned is then shared across the community, e.g. pathology,
rendered in a standard XML format, and tools radiology, biochemistry lab results are just
are provided to allow easy analysis and some of the data that are supported by SCI
presentation of this format. Store. Regular updates to SCI Store are
provided by the commercial supplier using a
4.2 GPASS and SCI Store web services interface. Currently there are 15
In the VOTES implementation thus far, the different SCI Stores across Scotland (with 3
development team have been given access to across the Strathclyde region alone). Each of
relevant software and realistic (representative) these SCI Store versions has their own data
data sets used to explore the functionality of models (and schemas) based upon the regional
the NHS services and data. Negotiations are hospital systems they are supporting. The
on-going in rolling out the services described schemas and software itself are still
in the following sections across the wider NHS undergoing development.
in Scotland and in gaining access to actual SCI Store serves as a repository for this
clinical data sets data across regional and national domains. As
The General Practice Administration individual practitioners update their GPASS
System for Scotland (GPASS) [21] is a client- databases, the information is automatically
side application that is in use by a large uploaded to this central repository. The
number of general practitioners, in over 890 VOTES infrastructure makes use of this by
practices in Scotland. It provides a portable, developing grid services that allow
‘on-the-spot’ electronic interface for GPs to interrogation of the centralised SCI Store
input patient data, to be synchronised to the repository when collecting study data, and also
centralised SCI-Store [22] repository at such allow interrogation of the individual
times as a connection can be made. It also practitioner databases for patient recruitment.
contains a local data store that houses and An interface between the grid services already
inter-relates information on drugs and developed in VOTES and a test GPASS
treatments, thus providing a rudimentary database is shown in Figure 3.
ability to advise on a particular treatment,
depending on patient symptoms and history
[23]. The interface used to show the electronic
record of patient history is shown in Figure 2.
65
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
66
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
67
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 5: Two views of the recruitment portal. The left image shows the options available in the portal if the role is
that of the GP. These options are to check the information on the clinical trial, to use GPASS to auto-complete the
patient information and to notify the trial coordinator that patient information and consent has been obtained. The
right image shows the coordinator role, which has two options, one to initiate the organisation of the trial and one
to upload the final data for the patient.
68
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
At the centre of the Clinical e-Science Framework (CLEF) project is a repository of well organised,
detailed clinical histories, encoded as data that will be available for use in clinical care and in-silico
medical experiments. We describe a system that we have developed as part of the CLEF project,
to perform the task of generating a diverse range of textual and graphical summaries of a patient’s
clinical history from a data-encoded model, a chronicle, representing the record of the patient’s
medical history. Although the focus of our current work is on cancer patients, the approach we
describe is generalisable to a wide range of medical areas.
69
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
by combining them with visual navigation tools. Alanine aminotransferase, Alkaline phosphatase,
In this way, we take advantage of the better Aspartate aminotransferase, albumin and total
accesibility and interactivity offered by visual protein). However, pure text is not always the
timelines as well as of the ability of natural best medium of presenting large amount of
language to convey complex temporal information information, part of which is numerical and most
and to aggregate numerical data. of which is highly interconnected. Text loses
the ability of navigating through information,
2 Types of report of expanding some events and of highlighting
The intended end-user of the generated reports important dependencies between pieces of
is a GP or clinician who uses electronic patient information. A textual report alone cannot
records at the point of care to familiarise effectively combine the time sequence element
themselves with a patient’s medical history with the semantic dependencies - both of which
and current situation. A number of specific are essential in representing patient records.
requirements arise from this particular setting: Depending on the type of report chosen, either
one or the other of these elements will necessarily
• Events that deviate from the norm are be emphasised at the expense of the other.
more important than normal events (e.g., an We envisage therefore that, depending on
examination of the lymphnodes that reveals circumstances, users may want to have fully
lymphadenopathy is more important than an textual reports (for example, for producing printed
examination that doesn’t). However, normal summaries of a patient’s history) or combined
events should also be available on demand. graphical and textual reports (for interactive
• Some events are more important than others visualisation). In the following, we will describe
and they should not only be included in the two reports generated in either of the two
the summary but also highlighted (through scenarios. Section 3 will describe in more
linguistic means, colour coding, graphical detail the natural language generation techniques
timelines or similar display features). employed in generating both independent textual
reports and report snippets that support the
• Having different views of the same data is a graphical interface.
useful feature, because it allows the clinician
to spot correlation between events that they 2.1 Textual reports
may have missed otherwise. Textual reports are views of a data-encoded
• Summaries that provide a 30-second electronic patient record (a chronicle), which is
overview of the patient’s history are often itself both a distillation and an integration of
desireable; ideally, these should fit entirely the elements within the traditional EPR. In this
on a computer screen. However, users should respect, they do not correspond to the narratives
be able to obtain more detailed information traditionally contained in a patient record, such
about specific events by expanding their as letters from clinicians, discharge notes, consult
description. summaries. They are a new type of text that
aggregates information from the full record.
Following these requirements, we proposed Based on our requirements analysis with
in this project an integrated visualisation tool clinicians, we identified two main types of textual
where users can use a graphical interface coupled report that could be used in different settings. The
with a text generation engine in order to navigate first is a longitudinal report, which is meant to
through patient records. Textual reports have provide a quick historical overview of the patient’s
the advantage of offering a snaphsot view of illness, whilst preserving the main events, such
a patient’s history at any point in time, they as diagnoses, investigations and interventions.
can be used for checking the consistency of a It describes the events in the patient’s history
patient’s record, can be ammended and printed, ordered chronologically and grouped according
used in communication between clinicians or to the type. It contains most events in the
clinicians and patients. Text is a good way history, although some preliminary filtering is
of describing temporal information (events performed to remove usually a small number of
that happened at a certain position in time isolated events. The following example displays a
with respect to another event), of summarising fragment of a generated longitudinal summary.
numerical data (for example, specifying that liver (1) The patient is diagnosed with grade 9
invasive medullary carcinoma of the breast.
tests were normal instead of listing individual She was 39 years old when the first malignant
measurements for billirubin concentration, cell was recorded. The history covers 1517
70
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
weeks, from week 180 to week 1697. During this it. In this case, they may order a report
time, the patient attended 38 consults.
focused on incomplete chemotherapy packages
YEAR 3: and investigations of type blood test.
Week 183
• Radical mastectomy on the breast was 2.2 Visual reports
performed to treat primary cancer of the left A visual report is a one-screen overview of a
breast. patient record (see fig.1), where various types
• Histopathology revealed primary cancer of the of events are colour-coded and displayed along
left breast. a timeline. Selection of events can be used for
Week 191 highlighting event dependencies or for generating
• Examination revealed no enlargement of the focused textual reports. Apart from general
liver or of the spleen, no lymphadenopathy of history timelines, users can also investigate the
the left axillary lymphnodes,no abnormality trend of numerical values, for example increases
of the haemoglobin concentration or of the and decreases in the billirubin concentration from
leucocyte count. one test to an other.
• Radiotherapy was initiated to treat primary
cancer of the left breast.
• ...
The second class of summary focuses on
a given type of event in a patient’s history,
such as the history of diagnoses, interventions,
investigations or drug prescription. In contrast
to the longitudinal summaries, which are generic,
this type of report is query-oriented, since it
summarizes only events which the user deems
relevant.
A summary of the diagnoses, for example, will
focus on the Problem events that are recorded
in the chronicle, whilst other events only appear if
they are directly related to a Problem. This type
of summary is necessarily more concise, since the Figure 1: Visual history snapshot: minimum
events do not have to appear chronologically and zooming, the user has selected Problems in the
thus can be grouped in larger clusters. Secondary graphical interface and a summary of all recorded
events are also more highly aggregated. For problems is displayed in the bottom pane.
example:
(2) In week 483, histopathology revealed
primary cancer of the right breast. Radical The advantage of a visual display is that the user
mastectomy on the breast was performed to treat can have a global view of a patient’s history.
the cancer. However, much information is hidden behind each
In week 491, no abnormality of the leucocyte event displayed on the timeline. The user can
count or of the haemoglobin concentration, reveal this information by interacting with the
no lymphadenopathy of the right axillary graphical display. By zooming in or out, events
lymphnodes, no enlargement of the spleen or are collapsed or expanded. For example, in
of the liver and no recurrent cancer of the Fig.1 there is a chemotherapy event spanning 8
right breast were revealed. Radiotherapy was weeks. In a minimum zoom view, they appear
initiated to treat primary cancer of the right as a single chemotherapy event; by zooming
breast. in, the user will be able to see that there have
In the weeks 492 to 496, five radiotherapy been 6 chemotherapy cycles given succesfully,
cycles were performed. and 3 chemotherapy cycles have been deferred.
A subclass of reports in this category is Hovering the mouse over any event will display
represented by reports of selective events. as a tooltip a short description of the event.
For example, a clinician may suspect that a For example, hovering over the chemotherapy
certain patient has interrupted their chemotherapy event in Fig. 1, the user will see the tooltip A
package repeatedly and wants to see if this complete chemotherapy course was given from
correlates with a certain medical condition such week 312 to week 320. Further information
as anaemia or if there are other causes behind about an event can be obtained by clicking on its
71
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
icon. A chemotherapy event, for example, “hides” richness of information provided. Having access
information about the particular drug regimen to not only facts, but to the relations between
used, exact dates of chemotherapy cycles and them, has important implications in the design of
reasons for deferring a particular cycle. Since the content selection and text structuring stages.
this information is better expressed as text than This facilitates better and easier text generation
graphically, each selection of an event will trigger and allows for a higher degree of flexibility of the
the production of a report snippet that describes in generated text.
more detail that particular event. The chronicle relations can be categorised
Apart from individual events, the user can also into three types according to their role in
select multiple events (by clicking on several the generation process. Rhetorical relations
event icons on the timeline), classes of events are relations of causality and consequence
(by clicking on the event name on the left hand between two facts (such as, Problem CAUSED -
side of the screen) or time spans (by selecting BY Intervention or Intervention
years on the horizontal axis). The effect of such INDICATED - BY Problem) and are used in
selections will be the production of summaries the document planning stage for automatically
similar to those described in the previous section. inferring the rhetorical structure of the text, as it
Selection of events will produce event-focused will be described in 3.2.2. Ontological relations
summaries, whilst selection of time spans will such as Intervention IS - SUBPART- OF
produce longitudinal summaries for that particular Intervention bear no significance in
span. text planning and realisation, but can be
Semantic relations between events are used in content selection. Attribute relations
displayed on demand, allowing the user to such as Problem HAS - LOCUS Locus
see the logical sequence of events (tracing, for or Investigation HAS - INDICATION
example, the reason for performing a red packed Problem are used in grouping messages in a
cell transfusion to anaemia which was in turn coherent way, prior to the construction of the
caused by chemotherapy performed to treat rhetorical structure tree.
cancer).
3.2 System design
3 The Report Generator The system design of the Report Generator
In the following, the term Report Generator will follows a classical NLG pipeline architecture, with
be used to designate the software that performs a Content Selector, Content Planner and Syntactic
text generation, as a result of either a direct Realiser. The Content Planner is tightly coupled
request from the user for a specific type of report with the Content Selector, since part of the
or a selection of events in the graphical timeline. document structure is already decided in the event
The output of the report generator may be either a selection phase. Aggregation is mostly conceptual
full report or a report snippet, but practically, the rather than syntactic, therefore it is performed in
type of selection employed in chosing the focus of the content planning stage as well.
the report does not influence the technique used in 3.2.1 Content selection
generating it.
The process of content selection is driven by
3.1 Input two parameters: the type of summary and the
length of summary. We define the concept of
The input to the Report Generator is a chronicle,
summary spine to represent a list of concepts
which is a highly structured representation of
that are essential to the building of the summary.
an Electronic Patient Record, in the form of a
For example, in a summary of the diagnoses,
semantic network. It is beyond the scope of
all events of type Problem will be part of the
this paper to describe the methodology involved
spine (Figure 2). Events linked to the spine
in transforming an EPR into a chronicle -
through some kind of relation may or may not
the chronicalisation process is complex and
be included in the summary, depending on the
involves Information Extraction from narratives,
specified type and length of the summary. The
solving multi-document coreference, temporal
design of the system does not restrict the spine
abstraction and inferencing over both structured
to containing only events of the same type: a
and information extraction data (Harkema et al.,
spine may contain, for example, Problems of
2005). For the purpose of this paper, we
type cancer, Investigations of type x-ray
consider the input correct and complete. The main
and Interventions of type surgery.
advantage in using a chronicle as opposed to a less
structured Electronic Patient Record lies in the
72
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Locus
Problem level of a message (i.e., how far from the spine
Investigation
Intervention
the event is). For example, a Problem event
Locus Problem Investigation
has a large number of attributes, consisting of
name, status, existence, number
Intervention
Locus Problem of nodes counted, number of nodes
Drug
involved, clinical course, tumour
Problem size, genotype, grade, tumour
marker and histology, along with the usual
time stamp. If the Problem is a spine event, all
Figure 2: Spine events for a summary of these attribues will be specified, whilst if the
diagnoses Problem is 2 levels away from the spine, only the
name and the existence will be specified.
The relations stored in the chronicle help in 3.2.2 Document planning
the construction of clusters of related events. A
typical cluster may represent, for example, that a The document planner component is concerned
patient diagnosed with cancer following a clinical with the construction of complete document
examination, a mastectomy was performed plans, according to the type of summary and
to remove the tumour, a histopathological cohesive relations identified in the previous stage.
investigation on the removed tumour confirmed The construction of document plans is, however,
the cancer, radiotherapy was given to treat the initiated in the content selection phase: content is
cancer, which caused an ulcer that was then selected according to the relations between events,
treated with some drug. Smaller clusters are which in turn informs the structure of the target
generally not related to the main thread of events, text.
therefore the first step in the summarisation The document planner uses a combination of
process is to remove small clusters2 The next step schemas and bottom-up approach. A report is
is the selection of important events, as defined typically formed of three parts: a schematic
by the type of summary. Each cluster of events description of the patient’s demographic
is a strongly connected graph, with some nodes information (e.g., name, age, gender); a two-
representing spine events. For each cluster, the sentence summary of the patient’s record
spine events are selected, together with all nodes (presenting the time span of the illness, the
that are at a distance of less than n from spine number of consults the patient attended and
events, where n is a user-defined parameter used the number of investigations and interventions
to adjust the size of the summary. For example, in performed); and the actual summary of the record
the cluster presented in figure 3, assuming a depth produced from the events selected to be part of
value of 1, the content selector will choose cancer, the content. In what follows, we will concentrate
left breast and radiotherapy but not radiotherapy on this latter part.
cycle, nor ulcer. The first stage in structuring the summary
is to combine messages linked through
cancer attributive relations. This results in instances
ed_By Has
_Lo such as that shown in example (3), where a
icat cus
Ind Problem message is combined with a Locus
Has_Locus left breast
radiotherapy message to give rise to the composite message
Cau Problem-Locus.
sed
_By
Is_Subpart_Of In the second stage, messages are grouped
radiotherapy ulcer
according to specific rules, depending on the type
cycle of summary. For longitudinal summaries, the
grouping rules will, for example, stipulate that
Figure 3: Example of cluster events occurring within the same week should
be grouped together, and further grouped into
The result of the content selection phase is a years. In event-specific summaries, patterns of
list of messages, each describing an event with similar events are first identified and then grouped
some of its attributes specified. The number according to the week(s) they occur in; for
of attributes specified depends on the depth example, if in week 1 the patient was examined
2 for enlargement of the liver and of the spleen
In the current implementation these are defined as
clusters containing at most three events. This threshold with negative results and in week 2 the patient
was set following empirical evidence. was again examined with the same results and
73
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
underwent a mastectomy, two groups of events individual events are realised in individual
will be constructed, leading to output such as: clauses or sentences which are connected to other
(3) In weeks 1 and 2, examination of the accompanying events through the appropriate
abdomen revealed no enlargement of the liver or rhetorical relation. Additionally, the number
of the spleen. of attributes included in the description of a
In week 2, the patient underwent a mastectomy. Problem is a decisive factor in realising an event
as a phrase, a sentence or a group of sentences.
Within groups, messages are structured In the following two examples, there are two
according to discourse relations that are either Problem events (cancer and lymphnode
retrieved from the input database or automatically count) linked through an Investigation
deduced by applying domain specific rules. At event (excision biopsy), which is indicated
the moment, the input provides three types of by the first problem and has as a finding the
rhetorical relation: Cause, Result and Sequence. second problem. In Example 4, the problems are
The domain specific rules specify the ordering first-mentioned spine events, while in Example 5,
of messages, and always introduce a Sequence the problems are skeleton events (the cancer is a
relation. An example of such a rule is that a subsequent mention and the lymphnode count is a
histopathology event has to follow a biopsy first mention), with the Investigation being
event, if both of them are present and they the spine event.
start and end at the same time. These rules
help building a partial rhetorical structure tree. (4) A 10mm, EGFR +ve, HER-2/neu +ve, oestrogen
Messages that are not connected in the tree are receptor positive cancer was found in the
by default assumed to be in a List relation to left breast (histology: invasive tubular
other messages in the group, and their position adenocarcinoma). Consequently, an excision
is set arbitrarily. Such events are grouped biopsy was performed which revealed no
together according to their type; for example all metastatic involvement in the five nodes
unconnected Intervention events, followed sampled.
by all Investigations. (5) An excision biopsy on the left breast
In producing multiple reports on the same was performed because of cancer. It revealed
patient from different perspectives, or of different no metastatic involvement in the five nodes
types, we operate under the strong assumption sampled.
that event-focussed reports should be organised
in a way that emphasises the importance of the As these examples show, the same basic
event in focus. From a document structure rhetorical structure consisting of three leaf-nodes
viewpoint, this equates to building rhetorical and two relations (CAUSE and CONSEQUENCE)
structures where the focus event (i.e., the spine is realised differently in a Problem-focussed
event) is expressed in a nuclear unit, and skeleton report compared to an Investigation-based report.
events are preferably in sattelite units. The conceptual reformulation is guided by the
At the sentence level, spine events are assigned type of report, which in turn has consequences at
salient syntactic roles that allows them to be the syntactic level.
kept in focus. For example, a relation such
as Problem CAUSED - BY Intervention is 3.2.3 Aggregation
more likely to be expressed as: The fluency of the generated text is enhanced
by conceptual aggregation, performed on
“The patient developed a Problem as a
messages that share common properties. Simple
result of an Intervention.”
aggregation rules state, for example, that two
when the focus is on Problem events, but as: investigations with the same name and two
different target loci can be collapsed into one
“An Intervention caused a Problem.” investigation with two target loci. Consider, for
example, a case where each clinical examination
when the focus is on Interventions. consists of examinations of the abdomen for
This kind of variation reflects the different enlargement of internal organs (liver and spleen)
emphasis that is placed on spine events, although and examination of the lymphnodes. Thus,
the wording in the actual report may be different. each clinical examination will typically consist
Rhetorical relations holding between simple event of three independent Investigation events. If
descriptions are most often realised as a single fully expanded, a description of the clinical
sentence (as in the examples above). Complex examination may look like:
74
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
(6) • examination of the abdomen revealed no progress notes (Campbell et al., 1993), reports on
enlargement of the spleen radiographs (A. Abella, 1995), and bone scans
• examination of the abdomen revealed no (Bernauer et al., 1991) or post-operative briefings
enlargement of the liver (M. Dalal, 1996).
• examination of the axillary lymphnodes The timeline method has been used extensively
revealed no lymphadenopathy of the axillary in visualising patient histories. The Lifelines
nodes project (Plaisant et al., 1998) provides a method
for visualising and accessing personal histories
With a first level of aggregation, this is reduced to: by means of a graphical interface, and has
(7) Examination revealed no enlargement of the been used for both patient records and legal
spleen or of the liver and no lymphadenopathy case histories. TeleMed (Kilman and Forslund,
of the axillary nodes. 1997) gathers patient information from distributed
databases and presents it in a Web interface as
However, even this last level of aggregation icons on a timeline; interaction with the icons
may be not enough, since clinical examinations provides access to more detailed descriptions of
are performed repeatedly and consist of the the individual pieces of information. Various
same types of investigation. We employ authors describe the advantages of the timeline
two strategies for further aggregating similar approach to visualising temporal data of the kind
events. The first solution is to report only present in patient histories (Tufte and Kahn, 1983;
those events that deviate from the norm - for Cousins and Kahn, 1991).
example, abnormal test results. The second,
which leads to larger summaries, is to produce 5 Conclusion
synthesised descriptions of events. In the case We presented in this paper an innovative approach
of clinical examinations for example, it can to the problem of presenting and navigating
describe a sequence of investigations such as through patient histories as a means of supporting
the one in Example 7 as “The results of a clinical care and research. Our approach uses
clinical examination were normal”, or, if the a visual naviagation tool combined with natural
examination result deviates from the norm on a language generated reports. Although developed
restricted numbers of parameters, as “The results for the domain of cancer, the very same methods
of clinical examination were normal, apart from that we could be used with little adaptation effort
an enlargement of the spleen”. for general health care; they are based on a general
model of the main events in a patient’s medical
4 Related work
history that occur across all diseases and ailments:
Natural language generation has been used in symptoms, diagnoses, investigations, treatments,
the medical domain for various applications. side effects and outcomes. As such, we only
For example: to generate drug leaflets (i.e., make use of general medical knowledge, which
pill inserts) in multiple languages and styles applies to any medical sub-domain. The design
(PILLS (Bouayad-Agha et al., 2002)), letters to of both the text generator and visual navigator is
patients to help them stop smoking (STOP (Reiter completely data-driven by the system input, the
et al., 2003)), individualised patient-education chronicle.
brochures (MIGRANE (Buchanan et al., 1992)); An important consequence of the use of
HealthDoc (Hirst et al., 1997)); Piglit (Binsted et chronicles as input is that our system does not
al., 1995)). There is also a body of work on the require complex domain semantics, which has
generation of summaries of patient records (e.g., been regarded as one of the essential components
(Afantenos et al., 2005), (Elhadad and McKeown, of NLG systems (Reiter and Dale, 2000). This
2001)). This work, however, differs from ours is partly because inferences that are normally
in that they concentrate on the summarization of required to combine and order facts in the
textual records, while we deal with summarization generated summary have already been performed
of data from Electronic Patient Records. prior to the language generation process, and
Most computer-based patient record their results have been stored in the chronicle as
management systems have simple generation relations between facts. Indeed, a key feature
facilities built-in, which produce simple text, of our system is that — apart from the relations
normally consisting of unconnected sentences and present in the input data — it does not use any
thus lacking fluency. Natural language generation kind of external domain knowledge in the process
techniques have been applied in various reporting of content selection. The only domain specific
systems for generating telegraphic textual rules used are in text organisation, specifically in
75
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the ordering of messages; these are not essential, N. Bouayad-Agha, R. Power, D. Scott, and A. Belz.
although they do improve the fluency of the text. 2002. Pills: Multilingual generation of medical inf
ormation documents with overlapping content. In
Our lack of reliance on domain semantics is a Proceedings of LREC 2002, pages 2111–2114.
clear advantage for the portability of the system
to other domains. It is nevertheless true that more B.G. Buchanan, J.D. Moore, D.E. Forsythe,
G. Carenini, and S. Ohlsson. 1992. Involving
specific domain knowledge could improve the patients in health care: explanation in the clinical
summarization process, for example, in deciding setting. In Proceedings of the Sixteenth Annual
which events should be considered important Symposium on Computer Applications in Medical
Care (SCAMC’92), pages 510–512.
(which will clearly vary from one medical area
to the next). In our report generation system, K.E. Campbell, K. Wieckert, L.M. Fagan, and
M.Musen. 1993. A computer-based tool for
this type of knowledge can be encoded as a set generation of progress notes. In Proceedings of the
of external rules, whose application would not Symposium on Computer Applications in Medical
be essential to the system. These rules can Care (SCAMC), pages 284–288.
be specified without interfering with the main Cousins and M. Kahn. 1991. The visual display of
application, and require no changes in previous temporal information. In Artificial Intelligence in
code. Medicine, pages 341–357.
Current electronic patient records are human- N. Elhadad and K. McKeown. 2001. Towards
friendly but highly impoverished from the point generating patient specific summaries of medical
articles. In Proceedings of the Workshop on
of view of systematic machine analysis and Automatic Summarization, NAACL 2001, Pittsburg,
aggregation. By contrast, the ideal machine USA.
representation is far too complex to be human- H. Harkema, I. Roberts, R. Gaizauskas, and M. Hepple.
friendly. Our research suggests that with 2005. Information extraction from clinical records.
the combined graphical and natural language In Proceedings of the 4th UK e-Science All Hands
Meeting, Nottingham, UK.
geneneration approach we have described here,
this complex machine representation can be made Graeme Hirst, Chrysanne DiMarco, Eduard H. Hovy,
both relatively familiar, and friendly. and K. Parsons. 1997. Authoring and generating
health-education documents that are tailored to the
needs of the individual patient. In Proceedings of
Acknowledgment the 6th International Conference on User Modeling,
Italy.
The work described in this paper is part of the
Clinical E-Science Framework (CLEF) project, David G. Kilman and David W. Forslund. 1997. An
international collaboratory based on virtual patient
funded by the Medical Research Council grant records. Communications of the ACM, 40(8):110–
G0100852 under the E-Science Initiative. We 117.
gratefully acknowledge the contribution of our K.McKeown M. Dalal, S. Feiner. 1996. MAGIC:
clinical collaborators at the Royal Marsden and An experimental system for generating multimedia
Royal Free hospitals, colleagues at the National briefings about post-bypass patient status. Journal
of the American Medical Informatics Association,
Cancer Research Institute (NCRI) and NTRAC pages 684–688.
and to the CLEF industrial collaborators.
C. Plaisant, R. Mushlin, A. Snyder, J. Li, D. Heller,
and B. Shneiderman. 1998. Lifelines: Using
References visualization to enhance navigation and analysis
of patient records. In Proceedings of the 1998
J. Starren A. Abella, J. Kender. 1995. Description American Medical Informatic Association Annual
generation of abnormal densities found in Fall Symposium, pages 76–80.
radiographs. In Proceedings of the Symposium on
Computer Applications in Medical Care (SCAMC), Alan Rector, Jeremy Rogers, Adel Taweel, David
pages 542–546. Ingram, Dipak Kalra, Jo Milan, Robert Gaizauskas,
Mark Hepple, Donia Scott, and Richard Power.
Stergos D. Afantenos, Vangelis Karkaletsis, and 2003. Clef - joining up healthcare with clinical and
Panagiotis Stamatopoulos. 2005. Summarization post-genomic research. In Second UK E-Science
from medical documents: A survey. Artificial ”All Hands Meeting”, Nottingham, UK.
Intelligence in Medicine, 33(2):157–177. Ehud Reiter and Robert Dale. 2000. Building natural
language generation systems. Cambridge University
J. Bernauer, K. Gumrich, S. Kutz, P. Linder, and D.P. Press, New York, NY, USA.
Pretschner. 1991. An interactive report generator
for bone scan studies. In Proceedings of the E. Reiter, R. Robertson, and L.M. Osman. 2003.
Symposium on Computer Applications in Medical Lessons from a failure: Generating tailored smoking
Care (SCAMC), pages 858–860. cessation letters. Artificial Intelligence, 144:41–58.
K. Binsted, A. Cawsey, and R.B. Jones. 1995. E.R. Tufte and M. Kahn. 1983. The Visual Display of
Generating personalised patient information using Quantitative Information. Graphics Press, Cheshire,
the medical record. In Proceedings of Artificial Connecticut.
Intelligence in Medicine Europe, Pavia, Italy.
76
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Sanjay Kharche1, Gunnar Seemann2, Lee Margetts3, Joanna Leng3, Arun V Holden4, and
Henggui Zhang1
1
School of Physics and Astronomy, University of Manchester, Manchester, M60 1QD, UK
2
Institute of Biomedical Engineering, University of Karlsruhe (TH), Karlsruhe, Germany
3
Directorate of Information Systems, University of Manchester, Manchester, M13 9PL, UK
4
Institute of Cell Membrane and Systems Biology, University of Leeds, Leeds, LS1 9JT, UK
Abstract
Atrial fibrillation (AF) is a common cardiac disease with high rates of morbidity, leading to major
personal and NHS costs. Computer modeling of AF using a detailed cellular model with realistic
3D anatomical geometry allows investigation of the underlying ionic mechanisms in far more detail
than in a physiology laboratory. We have developed a 3D virtual human atrium that combines
detailed cellular electrophysiology, including ion channel kinetics and homeostasis of ionic
concentrations, with anatomical detail. The segmented anatomical structure and multi-variable
nature of the system makes the 3D simulations of AF large and computationally intensive. The
computational demands are such that a full problem solving environment requires access to
resources of High Performance Computing (HPC), High Performance Visualization (HPV), remote
data repositories and a backend infrastructure. This is a classic example of eScience and Grid-
enabled computing. Initial work has been carried out using multiple processor machines with shared
memory architectures. As spatial resolution of anatomical models increases, requirement of HPC
resources is predicted to increase many-fold (~ 1 – 10 teraflops). Distributed computing is
essential, both through massively parallel systems (a single supercomputer) and multiple parallel
systems made accessible through the Grid.
1. Introduction
AF is a condition in which the upper chambers
of the heart contract at a very high rate and in a
disorganized manner. As a result, the pulse of
AF patients is highly irregular. AF affects about
4% of the population over the age of 60 years.
Mortality rate associated with AF has increased
over the past two decades. In the US, the age
standardized death rate (per 100,000 US
population) increased from 28 to 70 in 1998 [1].
In the UK alone, around 500,000 patients are
affected by AF and the mortality rate has
doubled since 1993. AF also increases the risk Figure 1. 3D anatomical model of human
of stroke and heart failure among other female atrium showing PM (blue), CT (red), RA
complications. (purple), LA (green). SAN and BB are
The electrophysiology of the atrium, as in embedded close to the CT and cannot be seen
other cardiac tissue, is extremely complex. The
membrane potential is determined by the equations. For example, the Courtemanche et al.
integrated actions of ionic channels, Na/Ca [2] (CRN) model for human atrial cells consists
exchanger and Na/K pump currents. In addition, of 24 state-variables of ODE. Such a description
regulative mechanisms governing the is necessary for the simulation of the effects of
homeostasis of intracellular calcium drugs, age and genetic disorders on ionic
concentration also affects the membrane channel kinetics and thus single cell and tissue
potential. Modeling such a system requires use electrical activities. Simple models, e.g.
of high order, stiff ordinary differential FitzHugh-Nagumo (FHN) excitation model,
77
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
appendage
AV ring distinctively different regions with cells in each
-40 of the regions having different AP
-60
characteristics. The LA, RA, PM and BB have
similar APs with an action potential duration
-80 (APD90) of approximately 325 ms. However, in
200 300 400 500 600 700 800
the CT the APD90 is 330 ms, in the atrial
t (ms)
appendage the APD90 is known to be 310 ms, in
B the atrioventricular (AV) ring is 300 ms. The
CRN model can be adapted by adjusting
1200 maximal conductances of the transient outward
current (Ito), the L-type calcium current (ICaL),
and the rapidly activated potassium current (IKr)
800 atrium to simulate these heterogeneous APs [3] based
pm on experimental data [4]. The results are shown
[Ca2+]i
ct
appendage in Figure 2, along with the corresponding
AV ring
400 heterogeneous calcium transients. Although
there is heterogeneity in AP and calcium
transients, the resting potential remains almost
0 the same (-81.5 mV), as do the upstroke
200 220 240 260
velocity (220 mV/ms) and the peak potentials
t (ms)
(25 mV).
Figure 2. (A) AP heterogeneity in human The SAN is a pacemaker, and is auto-
atrium. LA, RA, PM and BB have an APD90 of rhythmic, with a period of approximately 70 per
325 ms (solid blue), PM (dash dot-doted black), minute pacing the atrium at the base of the PM
CT is 330 ms (dashed green). In the atrial close to the CT. We have modified the CRN
appendage the APD90 is known to be 310 ms model by including a hyperpolarising-activated
(dotted red), in the AV ring is 300 ms (dash current (If) as given in Zhang et al. [5] to
dotted cyan). (B) Intracellular calcium simulate the auto-rhythmic AP in SAN as
concentrations associated with APs in A. Color shown in Figure 3. Details of modifications to
coding and dash styles are as in A. the CRN model to simulate SAN AP are given
in [3].
allow fast prototyping and (relatively) The 3D anatomical model for an adult
computationally inexpensive alternative for female human atrium [6] was obtained from the
simulating electrical propagation in 3D National Library of Medicine of the National
geometries. Such models can be used as a Institute of Health in Bethesda, Maryland, USA
starting point for the insertion of biophysically [7, 8]. This spatial resolution of the data set is
detailed cell models into anatomically detailed much higher than that of clinical MRI. The
tissue models. voxel size is 0.33 mm x 0.33 mm x 0.33 mm.
The human atrium consists of left atrium The data were segmented as described in [6].
(LA), right atrium (RA), pectinate muscles In the atrium, fibre orientation is not as
(PM), cristae terminalis (CT), the Bachmann’s structured as in the ventricle. The fibre
bundles (BB), and sino-atrial node (SAN). orientation of the atrial working myocardium
Figure 1 shows a composite of all these has a complex and nearly random fashion [9].
components. The SAN is the pacemaker of the Only the conduction pathways, i.e. CT, BB and
heart that initiates electrical action potential PM and the tissue near the mitral and tricuspidal
(AP). PM, CT and BB assist in conduction of ostia, near the superior and inferior vena cavae
electrical excitation waves. The human atrium is and near the pulmonary veins have a consistent
a heterogeneous tissue with cells in different orientation. The macroscopic anisotropy of the
regions having different action potential electrical excitation conduction pattern, as well
characteristics, preferential conduction as mechanical contraction, is strongly
pathways and varied fibre orientations. influenced by the spatial layout of the cardio-
78
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
-40
granted through a Class 3 project, available for
new users to gain access to the system for
-60 evaluation purposes.
The Sun-Fire machines have suitable up to
-80 date C/C++ compilers for parallelisation using
0 500 1000 1500 2000 2500
OpenMP. MPI codes are compiled with
t (ms)
MPICH (freely downloadable from the web) on
Figure 3. Pacemaking AP in the SAN simulated
the Sun machines. CSAR has MIPSPro 7 series
using modifications to the CRN model as given
and Intel v series of compilers.
in [3].
myocytes, i.e., the orientation of the muscle 4. Computing aspects of the 3D model
fibres and layers inside the myocardium. A – necessisity of HPC
nearly random fibre orientation is reported
The 3D anatomical atrial model, as described in
inside the human SAN [10]. These fibre
the previous section, consists of 325 x 269 x
orientations have been incorporated in our
298 nodes. This amounts to more than 26 x 106
model.
nodes. The following description of computing
Normal electrical excitation in the atrium
resource is based on implementation of the
has been reconstructed previously [5, 11].
CRN 24 variable cell model, within human
Normal conduction starts at the SAN, which is
virtual atria. During a 3D simulation, the
the primary pacemaker of the heart with auto-
following arrays are required to be held in
rhythmic activity. Then the fast right atrial
memory. A tissue information array (~108
conduction pathways i.e. CT and PM transmit
integer values), the state array (109 consisting of
the excitation. During that phase, the right atrial
double-precision floating-point data type), the
working myocardium gets depolarised. At the
diffusion matrix (109 double-precision floating-
same time, BBs transmit the activation towards
point data type), spatial derivatives are required
the LA and the whole atrium is activated. All
to compute the heterogeneous conduction (108
conduction pathways are characterized by a
double-precision floating-point data type). We
large electrical conductivity along the fibre
can see that a minimum of 17.2 GB of memory
direction. The excitation reaches the AV node,
is necessary. Instead of the CRN model, if a
which is the only electrical path between atria
simple FHN model is used, the memory
and ventricles in the physiologically normal
requirement is reduced to 10 GB due to a large
case. After a short delay, the excitation is then
reduction in number of independent variables
conducted into the ventricles. The simulated
associated with each node. In either case, large
conduction pattern of atrial excitation is shown
amounts of contiguous memory is required.
in Figure 4, using simple FHN cell model.
The serial code for 3D models with FHN as
cell model was ran to estimate scalability.
3. Computational resouces Simulating 1000 time units of activity took 5
The simulations were performed using a number hours of computer time on a standard Sun-Fire
of different computing facilities. These included 880 workstation. Upon parallelization, the same
systems hosted by the Computational Biology run for FHN case was reduced to elapsed time
Laboratories (CBL) at Leeds, the University of of 1.5 hours using 4 processors, showing good
Manchester and the National HPC Service, scalability. Further optimization was
CSAR [12]. At CBL Leeds a 24 processor Sun- implemented by use of binary output. This
Fire-880 with UltraSPARC chip and a memory further improved performance by 7 % for the
of 24 GB is available. The principal author’s 1000 time unit simulation. Results of this
laboratory has a 4 processor Sun-Fire-880 with simulation are shown in Figure 4.
UltraSPARC chip with 16 GB of memory. A Any pathology motivated case study
local university wide (University of requires many repeated simulations with
Manchester) SGI machine with 32 SGI R14k different stimulus protocols or parameters in the
processors and a memory of 16 GB is also model. This increases enormously the compute
available [13]. All these are shared memory demand. AF due to re-entrant atrial excitation in
79
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
A B C
D E F
Figure 4. Simulation of normal conduction pattern in human female atrium. Translucent blue denoted
the atrial geometry and solid yellow represents the excitation. Propagation is from the SAN to the
LA. Model is stimulated at SAN (A, t = 0) and conduction spreads in the RA and along the BB and
PM into the LA (B, C, D, E, F). Simulation of 1000 time units of activity (frame F) takes 95 minutes
on a standard 4 processor Sun Sparc.
Full geometrical model demands very large
3D atrium is primarily propagating scroll amounts of contiguous memory. We have,
waves. Scroll waves can be initiated in the however, exploited problem specific features in
atrium using several protocols. Simple S1-S2 or our simulations and reduced these overheads
cut wave, or the following possible protocols to considerably. Atrial tissue geometry occupies
initiate re-entrant scroll waves are [14]: about 8% geometry of the total data set, due to
1. S1 would be a stimulus at SAN. Then atrium being thin walled, large holes of atrial
an ectopic excitation S2 located within chambers and vena caves. We re-structure (or
the RA near the superior vena cavae, renumber) the arrays mentioned in section 4
15 mm away from the SAN. such that the real atrial nodes (leaving out the
2. Another protocol is the S1-S2-S3 empty nodes) of the data set, i.e. only 8% of the
protocol. The S1-S2 is as the same as total 26 million nodes and related information
described above, but after a time delay, are stored. This improved efficacy of memory
a S3 stimulus is applied to the same usage. By re-numbering the real atrial nodes we
location of the S2. are not storing any data points that are not
3. This protocol rapidly paces the SAN at atrium. This reduced the memory required by
high frequency. This results in re-entry FHN model to less than 3 GB. The memory
on the RA surface after sufficient required by CRN is reduced to less than 10 GB.
number of stimuli. As a first step toward biophysically detailed
Initiation of re-entry at the required location in modeling of human atrium, we have simulated
the 3D geometry using the S1-S2, S1-cut wave, electrical propagation using shared memory
or protocols 1 and 2 has to be done by trial and systems. A shared memory system is not
error, involving several trial simulations. Other necessary and the same results can be obtained
possible stimulation protocols to induce normal using distributed memory systems. Distributed
and scroll waves have been described in [15, memory parallelism may give better scalability.
16].
Upon using the cut wave protocol, we
initiated re-entry and the results are shown in
5. 2D Atrial and other small sized
Figure 5. Such a simulation of 1 s takes about simulations
44 hours on 16 processors using an Often, 2D tissue simulations offer useful
improvement as described below. insights into the mechanisms underlying the
80
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
A B C
D E F
Fig 5. Initiation and propagation of scroll wave in human female atrium considering the healthy, or
control case. Translucent blue shows atrial geometry, and solid yellow denotes propagating electrical
activity. The SAN was stimulated to induce a solitary propagation. After an appropriate duration of
time, the upper half of the geometry was reset to resting conditions
Followingsimulating
are two the cutexamples
wave protocol (A,
where
t = 210 ms). Propagation of scroll wave can be seen in distributed
B (t = 230 ms), C (t = 250 ms), D (t
memory computing can assist in= 270 ms)
and E (t = 290 ms). In control case, re-entrant scroll waves self terminate
investigation (F,and
of cell t = 1D
335tissue
ms). behaviors.
genesis of AF, without dealing with complex behaviors [19]. In cell models, a generic
3D simulations. 2D tissue simulations were investigation is that of pacing based behavior of
performed to investigate the link between AF AP. Detailed of biophysical cell models in
genesis and gain-of-function in IK1 due to Kir2.1 themselves are not overly demanding on
gene mutation [17]. In this simulation, tissue memory, requiring storage of several tens or
size was taken to be 37.5 mm x 37.5 mm, in few hundreds of variables (e.g. models
accordance with normal size of atrial incorporating Markov chain formulations of
appendage. Spiral re-entry was initiated with a cellular processes). Pacing based investigations
standard cross-field stimulation protocol [18]. are however, long by the nature of the problem.
To characterize the stability of reentry, a 10 s A ventricular cell model developed by Priebe-
long run of simulations were performed for 3 Beckmann (PB) [20] for non-failing and failing
cases, consisting of control, for heterozygous heart was pacing at various pacing rates from
and homozygous Kir2.1 gene mutation. Surface basic cycle length (BCL) = 100 ms, to BCL =
potentials was taken, to allow reconstruction of 1200 ms in increments of 5 ms with trains of
animation of activity from t = 0 to t = 10 s. In 100 stimuli. AP behavior for the final 10 stimuli
addition, pseudo-ECG and spiral wave tip were noted and plotted against BCL as a
positions were computed at run time. Sample bifurcation diagram, to investigate the existence
frames from the 2D simulation results are of AP alternans or 2:2 responses. Figure 7
shown in Figure 6. shows the results from such a simulation.
Parallelization helps to reduce computing Simulation carried out on single processor as a
time and allows for a more intensive serial run involved running this model for 60
investigation. The program was parallelized hours. Inter-process communication in such a
using OpenMP and was ran on a dual processor simulation is minimal, with the only
Sun workstation. The time taken was 29 hours. synchronization required being in accumulation
We then ran it on 4 processor Sun-Fire and the of the results. The pacing at a given BCL is
elapsed time reduced to 16 hours. Our 2D code independent of pacing result at another BCL.
shows good scalability with near ideal speedup. This being a cell model, the simulation is not
The small fraction of code that has to be demanding for memory of each node in the
necessarily serial is while doing essential file distributed memory system. Shared memory
output. systems with large amount of memory are not
Cell models need to be investigated for various required. If the code is parallelized for a
81
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
I II III
Figure 6. Representative frames from 2D simulations while investigating the effects of gain-of-
function in IK1 on AF. Top panel shows frames for control, middle panel for heterozygous, and
bottom panel for homozygous mutant type. 2D simulations at t = 400 ms in column I, t = 1300 ms in
column II, and t = 2650 ms in column III. A total of 10 s of activity was simulated.
distributed memory system, then run time can anatomical structures and cellular
be reduced drastically, depending on the electrophysiology. This model provides an
number of nodes. alternative to experimental methods to
Another example where low cost distributed investigate the cellular ionic and molecular
memory systems can be utilized is shown in mechanisms underlying the genesis and control
Figure 8. Here we have a 1D transmural strand of AF in humans. Due to large size of geometry
of human ventricle of length 15 mm with a and multi-variable nature of the system, the
spatial resolution of 0.1 mm with 150 cells, of model demands extensive computing resources
which 40 are endo-, 50 are mid-, and 60 are
epi-. Computing vulnerability window (VW) at 600
a ventricular cell location requires running the heart failure
500 Control
1D serial program on 1 processor for 1.3 hours.
Again, as in the cell model case, computation of 400
APD90 (ms)
82
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
83
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
84
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
What is evident from clinical observations is shear and move with every heart beat and how
that whilst we can determine some information blood flows through the heart when it is
from activity on the surface of the heart and pumping as well as an understanding proton
from observations of how it is pumping our transport mechanisms as modelled by Professor
blood, it is difficult to know what is actually Richard Vaughan-Jones, as protons affect the
happening within the vessels and the tissues and metabolic function of heart cells. Molecular
with experimental work on heart function modelling by Professor Mark Sansom supports
limited, there has long been recognized the need the work on proton transport.
to use alternative methods to determine how the
heart works, how it behaves in Treatment of ischemia and fibrillation is an area
arrhythmogenesis as well as how to treat these of interest in heart modelling as resuscitation
diseases. Realistic computational models of the techniques in hospitals require informed
heart, sometimes referred to as “virtual heart refinement to improve patient survival rates. It
simulators”, are currently the only viable has been found through in-silico studies in
approach to allowing us to observe all Oxford and Tulane that ventricular anatomy and
parameters of interest at sufficient spatial and transmural heterogeneities determine the
temporal resolution. This section aims to mechanisms of cardiac vulnerability to electric
highlight a selection of the heart modeling shocks and therefore of defibrillation failure.
research which is being undertaken in This knowledge is important for the design of
conjunction with Integrative Biology. new protocols for defibrillation, which could
consist for example in the application of small-
Forty years of research have yielded extensive magnitude shocks to particular locations in the
experience of modelling the cellular activity [1], ventricles. This could result in a dramatic
[2] commencing with the work of Professor decrease in the defibrillation threshold (which
Denis Noble in Oxford in 1960. Since then, could be defined as the minimum shock strength
researchers have been simulating how cells required to terminate the arrhythmogenic
behave and how electrical activity in the heart activity in the ventricles in order to restore sinus
can be determined using these cell models. rhythm or "to reset the ventricles").
Oxford and Auckland have worked on joint
research activity to explore cardiac activity Collaboration between the Medical University
through cell and tissue models and the recently of Graz in Austria and University of Calgary
funded Wellcome Trust Heart Physiome project has resulted in the development of a cardiac
shows their commitment to this field. This modelling package called CARP (Cardiac
project aims to build on existing models to help Arrhythmia Research Package,
understand different mechanisms of arrythmias. http://carp.meduni-graz.at). CARP is a
A simple change in a computer model or input multipurpose cardiac bidomain simulator which
parameter can replicate a mutation in a channel has been successfully applied to study the
protein, for example, and the impact of that induction of arrhythmias , the termination of
change can be followed in the model of the arryhthmias by electrical shocks (defibrillation)
whole organ. This approach is already showing and also for the development of novel micro-
promising results and providing an insight into mapping techniques. CARP is optimized to
why arrythmias that look very similar when perform well in both shared memory and
recorded on an electrocardiogram (ECG) can distributed memory environments. Besides pure
have many possible underlying causes. computing capabilities, CARP comprises also
visualization tools (Meshalyzer) and mesh
If we take the models used by Denis Noble, generation tools. In Sheffield, computer models
which form the basis for many areas of research have been developed to model the process of
into heart disease and its treatment, we see that ventricular fibrillation and resulting sudden
his models focused on the discovery and cardiac death, as well as electric wave re-entry
modelling of potassium channels in heart cells and fibrillation when the heart becomes
and the electric current flow through a cell wall unstable. Complex visualisation techniques
[3] . Models of calcium balance were made in have been developed to aid in this process and
the 1980s and these have progressed extensively allow the tracking of filaments (centre of a
since then to a high degree of physiological spiral wave) in 2D and 3D where re-entrant
detail. During the 1990s, these cell models were activity manifests itself in the form of spirals
utilised to model anatomically detailed tissue (2D) and waves (3D).
and organ anatomy. Supplementing this is an
understanding of how the tissues in the heart
85
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Optical mapping is a widely used experimental 0.6mm) and over 40 timesteps of 0.05ms
technique providing high-resolution timestep.
measurements of cardiac electrical activation
patterns on the epicardial surface through the Clearly this is a huge scientific task which will
use of voltage-sensitive fluorescent dyes. require the collaboration of many researchers,
However, inherent to the mapping technique are but the infrastructure needs will also be
blurring and distortion effects due to fluorescent extensive, with the need for seamless and secure
photon scattering from the tissue depths. data repositories to manage both the input data
Researchers in Oxford, in collaboration with US and the generated results, for fast compute
partners have developed a mathematical model facilities with potentially high speed networks
of the fluorescent signal that allows simulation enabling fast communication between the
of electrical recordings obtained from optical facilities, and extensive visualisation tools,
mapping experiments A bi-domain generating geometry details remotely, enabling
representation of electrical activity is combined the scientists to visualise the results real time.
with finite element solutions to the photon By working with scientists in the design of such
diffusion equation to simulate photon transport an infrastructure and in the development of
within cardiac tissue during both the excitation overarching virtual research environments to
and emission processes. Importantly, these support this research process, Integrative
simulations are performed over anatomically- Biology aims to develop an infrastructure which
based finite element models of the rabbit enables this science to achieve the anticipated
ventricles, allowing for direct comparison results in finding the key to preventing and
between simulation and whole-heart treating disease.
experimental studies [4].
The computing services we are developing
These various approaches show how individual within the project allow researchers to target
research is progressing in different areas and simulations at the most appropriate computer
how, through collaboration with partner sites, system depending on the resources and response
the skills of other scientists are being leveraged. needed. They provide data and metadata
management facilities, using the Storage
3. Heart Modelling – the computing Resource Broker (from UCSD), for looking
challenge after the many datasets created in computational
experiments, and visualisation tools for
examining results and discussing these
Modelling the human heart in its entirety is collaboratively within the consortium. An
beyond the capabilities of even the largest
associated project, funded by the Joint
supercomputer. Even the smallest modelling
activity has historically proved to be a challenge Information Systems Committee, is developing
for a typical modeller who has developed and a Virtual Research Environment that is
refined their models on their local desktop of embedding these services into a portal-based
laptop. Models would typically run for many framework so they are easily usable by
days if not weeks, limiting the capacity of researchers who are not themselves computing
simulations achievable with existing experts. Present capability allows users to
infrastructures. Some laboratories have been submit their models to NGS and HPCx through
lucky enough to have local clusters usable by a portal interface or Matlab interface and
scientists and if these scientists were lucky manage their data and provenance information
enough to have the support of those managing through the concept of an experiment. This has
these infrastructures, they could benefit from opened up a new way of working for our
the use of local services. Management of their scientific collaborators and the following
data has, however, been ad hoc and often section shows some of the results of this grid
scientists lacked the facilities to manage data enabled science.
correctly with suitable metadata and provenance
capture capability. Preliminary statistics 4. Examples of improved science
showed that to run 2ms of electrophysiological through e-science
excitation of a slab of tissue from the left
ventricular wall, using the Noble 98 heart The following sections will describe the results
model would require 1 day to compute on of Integrative Biology users who have leveraged
HPCx. This was calculated using a mesh with the technical capability offered to them to
20,886 bilinear elements (spatial resolution perform their science in new ways. These in-
86
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
silico experiments have resulted in publications size of oxygen-deprived region with a rabbit
in renowned journals and conferences (not only heart model only and that they may not apply to
heart modelling, but also electrophysiology and larger regions or to the human heart. But
computer science). they’re sufficiently intriguing to warrant further
investigation. It is hoped that subsequent
4.1 Heart de-fibrillation on NGS experiments will be able to quantify the time
limit for re-starting a heart and thus improve
Dr Blanca Rodriguez and her collaborators in patient treatment.
Oxford and Tulane have worked on how to
apply electric shocks to the heart to restart it and 4.2 Transmural electrophysiological
investigating the affect of disease of the heart on heterogeneities:
its effectiveness. Their research has explored
‘windows of vulnerability’ regarding the timing Experimental and theoretical studies in both
of its application of the shocks when the isolated ventricular tissue [7], and single
efficacy is reduced. To do so, they have used a myocytes [8] have proved that transmural
code called Memfem from Tulane from the US, dispersion in action potential duration (APD)
which uses a finite element solver and 2 PDEs exists, which results from changes in ionic
(partial differential equations) coupled with a properties in the depth of the ventricular wall. In
system of ODEs (ordinary differential particular, three layers of functionally-different
equation). Each small project has about 150 jobs cell types have been identified, namely the
and each job generates 250 output files (each epicardial, endocardial and midmyocardial
4Mb), which amounts to 150Gb per project. layers. Experimental and theoretical studies
have proved that transmural dispersion in APD
in the ventricles, which varies with age [9] and
animal species [9] may modulate the
arrhythmogenic substrate and thus could alter
defibrillation efficacy. However, the role of
transmural heterogeneities in the mechanisms
underlying defibrillation failure is unknown.
87
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
on HPCx, and these build on simulations using active tissue. The first frame (650 ms) shows
simplified models that are run on local HPC the propagation of the third paced beat shortly
resources including the White Rose Grid. A key after the stimulus has been applied to the apex
aspect of this work is to relate the findings to (bottom) of the heart. The second frame (790
clinical practice. Work on modelling the ms) shows the state of the heart just before the
initiation of VF has already yielded information premature stimulus is applied over the region
that could be used to identify patients at risk, shown. Part of this region has yet to recover,
and this is the basis of a pilot clinical study shown by the yellow and light blue regions. The
about to start in collaboration with clinical third frame shows the effect of the premature
colleagues at the Northern General Hospital in stimulus. The activity resulting from the
Sheffield. Work on the mechanisms that sustain stimulus can only propagate outwards and
VF is also tightly meshed with clinical and downwards because it cannot propagate into
experimental studies, and is one component of a areas that are still recovering.
wider project focusing on understanding VF in
the human heart that also involves the
Universities of Auckland, Utrecht, Oxford, and
UCL.
88
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
89
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The following project has been conducted by Enabling this science is an architecture that
Martin Bishop in Oxford using complex abstracts away the complexity of working with
mathematical models from Tulane and results disparate compute and data resources. This
from experiments conducted by Igor Efimov’s architecture has leveraged work from early e-
laboratory in St. Louis. Science projects. In order to handle the diversity
of end-users – or to be more precise, the
For the first time, a model of photon scattering diversity of environments in which our end-
has been used to accurately synthesize users routinely work – and the diversity of IT
fluorescent signals over the irregular geometry systems which they wish to exploit, the
of the rabbit ventricles following the application infrastructure is organized into 4 major layers.
of strong defibrillation shocks. During such
stimulation protocols there is a large transmural The upper layer is provided in two formats, one
variation in transmembrane potential through portal based to minimize the footprint on the
the myocardial wall. It is thus thought that users desktop/laptop, and the other based on a
fluorescent photon scattering from depth plays variety of familiar desktop tools such as Matlab
a significant role in optical signal modulation at to provide modellers with access to IB services
shock-end. A bidomain representation of in an environment in which they have already
electrical activity is combined with finite invested a significant learning and development
element solutions to the photon diffusion effort. For some services, such as advanced
equation, simulating both the excitation and visualization specific to these types of
emission processes, over an anatomically-based applications, some client-side support is
model of ventricular geometry and fiber required, necessitating the downloading of
orientation. Photon scattering from within a 3D specific clients. This layer will also provide the
volume beneath the epicardial optical recording interaction services for collaborative
site is shown to transduce these differences in visualization which are required for steering and
transmembrane potential within this volume in situ decision making by partners.
through the myocardial wall. This leads directly
to a significantly modulated optical signal The next layer contains Integrative Biology (IB)
response with respect to that predicted by the Services. These provide a defined, extensible
bidomain simulations, distorting epicardial set of services offering a simple abstract
virtual electrode polarization produced at shock- interface to the range of IT systems and
end. We can also show that this degree of facilities required by IB cardiac and tumour
distortion is very sensitive to the optical modellers. Supporting the IB Services are a set
properties of the tissue, an important variable to of Basic Services targeted at the various IT
consider during experimental mapping set-ups. systems available to IB. These include front-
These findings provide an essential first-step in ends to the Storage Resource Broker (SRB)
aiding the interpretation of experimental optical data management service running on the UK
mapping recordings following strong National Grid Service (NGS) and low level job
defibrillation shocks. submission facilities provided by the Globus
Toolkit 2 (GT2) and the Open Middleware
Infrastructure Institute (OMII) , plus access to
resources such as visualization and steering
developed by the project. This Basic Services
layer and the higher IB Services layer are Web
service based. The Basic Services provide
access to the underlying IT infrastructure
systems used by the project through a variety of
(A) (B) access mechanisms.
90
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
concept of a collaborative experiment with the the EPSRC (ref no: GR/S72023/01) and IBM
principal investigator in the experiment for the Integrative Biology project, and of the
controlling access rights; and leveraging mature Joint Information Systems Committee for the
facilities for data management already Integrative Biology Virtual Research
developed by one of the project partners, Environment project. R. Clayton acknowledges
CCLRC. The total volume of project data held support from the British Heart Foundation
in SRB is approaching 0.5TB, the majority of (PG03/102/15852). G. Plank was supported by
which is raw output from numerical the Austrian Science Fund FWF (R21-N04).
simulations. IB files are organized by studies
and experiments and managed through use of a 8. References
metadata schema based on the CCLRC Schema
for Scientific Metadata.
[1] Noble D. Modelling the heart: from genes to
6. Conclusions cells to the whole organ Science 2002; 295:
1678-1682.
Researchers across the Integrative Biology [2] Noble D. The Initiation of the Heartbeat.
consortium are developing complex computer OUP, Oxford. 1975
models to look at aspects of how the heart
operates and by coupling these models, these
researchers hope to be able to determine what [3] A Modification of the Hodgkin-Huxley
causes heart disease and potentially how to Equations Applicable to Purkinje Fibre Action
prevent heart disease. These complex models and Pace-maker Potentials, Noble, D. 1962.
require immense amounts of compute power Journal of Physiology 160, 317-352. PubMed
which is not available on the scientists desktop. ID: 14480151
For this reason, the project is enabling access to
HPC resources through the National Grid [4] Synthesis of Voltage-Sensitive Optical
Service (NGS) and High Performance Signals: Application to Panoramic Optical
Computing Service (HPCx) by utilising tools to Mapping, Martin J. Bishop , Blanca Rodriguez ,
enable the complexity of working with such James Eason , Jonathan P. Whiteley , Natalia
infrastructures to be hidden from the scientist. Trayanova and David J. Gavaghan , Biophys J.
This infrastructure and services are enabling BioFAST on January 27,
modellers to do far more than build static 2006.doi:10.1529/biophysj.105.076505
models of the heart. By having almost
instantaneous access to data, computing and [5] Role of shock timing in cardiac vulnerability
visualisation software held on the NGS and to electric shocks, Rodríguez B, Trayanova N,
HPCx, and ultimately global IT infrastructures, Gavaghan D. . APICE International Symposium
the IB project hopes to allow researchers to on Critical Care Medicine, Trieste (Italy),
change conditions in the heart and see how the November, 2005.
model reacts in almost real time – the nearest
thing to experimenting with a live heart. Early [6] Vulnerability to electric shocks in regional
feedback from the users has shown that by ischemia, Rodríguez B, Tice B, Blake R, Eason
partnering with providers of advanced J, Gavaghan D, Trayanova N. . Heart Rhythm,
technology facilities, science can progress faster May 2006.
and the results shown here are evidence of the
advances made to date in benefiting clinical [7] C. Antzelevitch, G.X. Yan, W. Shimizu, and
patient care. A. Burashnikov, “Electrical heterogeneity, the
ECG, and cardiac arrhythmias, From Cell to
7. Acknowledgements Bedside,” W.B. Saunders Company,
Philadelphia, 1999.
We wish to thank other members of the
Integrative Biology consortium for their input [8] MA McIntosh, SM Cobbe, GL. Smith,
into, and comments on, the work described in “Heterogeneous changes in action potential and
this paper, in particular the role of the scientific intracellular Ca2+ in left ventricular myocyte
collaborators in determining the requirements sub-types from rabbits with heart failure,”
for the project and providing the results for this Cardiovasc Res., pp. 397-409, 2000.
paper. We also wish to thank the EPCC for
help with installing and optimising our codes on [9] S.F. Idriss and P. D. Wolf, “Transmural
HPCx. We acknowledge the financial support of
91
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
92
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Vito Perrone1, Anthony Finkelstein1, Leah Goldin1, Jeff Kramer2, Helen Parkinson3,4,
Fiona Reddington3
1 University College London, London UK
2 Imperial College, London UK
3 NCRI Informatics Coordination Unit, London UK
4 European Bioinformatics Institute, Cambridge UK
Abstract
The NCRI Informatics Initiative has been established with the goal of using informatics to
maximise the impact of cancer research. A clear foundation to achieving this goal is to
enable the development of an informatics platform in the UK that facilitates access to, and
movement of, data generated from research funded by NCRI Partner organisations, across
the spectrum from genomics to clinical trials. To assure the success of such a system, an
initial project has been defined to establish and document the requirements for the platform
and to construct and validate the key information models around which the platform will be
built. The platform will need to leverage many projects, tools and resources including those
generated by many e-Science projects. It also required contributing to the development of a
global platform through a close interaction with similar efforts being developed by the NCI
in the USA. This paper recounts our experience in analysing the requirements for the
platform, and explains the customised analysis approach and techniques utilised in the
project.
93
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
tools and resources available in the UK and them with respect to the project and domain
worldwide can interoperate with one another as a characteristics.
coherent whole. This will reduce duplication of Although our approach has been developed
effort and funding and leverage existing resources for supporting analysis of the cancer research
for maximum benefit. domain, we believe that it has generic
applicability to other projects in the broader
1.2 The NCRIII Platform Requirements context of e-science.
Analysis Project
As with any complex and innovative software 2. The Platform Context and Related
system project, a fundamental activity is to reach Works
a clear understanding of what the system’s goals
and requirements are. Lack of proper The context where the NCRI platform will
requirements analysis is recognized as the major operate is made up of a multitude of projects,
source (almost 50%) of failure for software resources and initiatives that aim to support
system projects [2]. E-science projects are, we cancer research in different ways. These range
would contend, not immune to this. A proper from local projects to global initiatives and
requirements analysis should: identify what is address different issues ranging from data
missing in the context where the system will acquisition, storage, and mining to knowledge
operate, define stakeholder goals and management. Furthermore, the field of cancer
expectations; eliminate ambiguities, research is highly dynamic reflecting the
inconsistencies and problems potentially leading emergence of new technologies, developing
to late faults; provide the information needed to infrastructure and scientific discovery.
support the system construction and the Our field analysis has led us to organize the
evaluation of competing technical options; enable different entities involved in the platform context
system change and evolution. into the following categories.
Within the NCRIII the need for a precise Local Repositories: belong to institutions or
identification of the key goals and requirements research groups and tend to host data about
has been considered of primary importance due to investigations performed by the local teams. They
the complexity of the envisioned project and the are important sources of data although the
apparent lack of solutions capable of filling the terminologies and data formats used to store the
existing gaps. The platform development has been acquired data or specimens tend to be highly
thus preceded by a project focusing on the heterogeneous between repositories. Access to the
requirements analysis. A multidisciplinary data is usually limited to local researchers or
analysis group has been set up and the analysis exploiting direct contacts with them.
has been driven by a set of use cases acquired by Specialized Global Repositories: aim to collect
interviewing practitioners working in the field. information about specific types of data e.g.
Unlike use cases collected in other e-science ArrayExpress [4], and caArray [5], for micro-
project, which describe the main functionalities of array data. Researchers need to be supported in
the system being developed, our use cases needed both depositing and accessing the data they
to uncover interoperability needs. The main goals contain. Repositories may overlap with one
of our analysis have been to understand how the another in the type of data they house and can use
various initiatives operating in the cancer research different data formats and metadata vocabularies
field would effectively cooperate with one to annotate the acquired data. Most of the
another, what the relative roles are, how they are repositories offer ad-hoc defined input forms for
actually used by practitioners, and what may collecting data and a web site for accessing the
make the community self-sustainable. This has stored data typically through a search engine.
required us to assume a higher level perspective Other access services like API or Web Services
in defining the use cases which are closer to user may be offered.
stories as defined in the Agile development [22] Knowledge Bases: aim to collect domain
than to standard use cases as described in [3]. The knowledge in specific areas, organizing it so that
core characteristic of our project along with the deductive reasoning can be applied to them. Each
intrinsic multidisciplinary nature of the project repository focuses on specific data types and
team and the need to work side by side with provides ad-hoc defined forms for input data and
domain experts, have required us to adopt an access functionalities which may include
innovative approach and to customize the used advanced user graphical inter interfaces.
analysis techniques. In this paper we describe the Examples include REACTOME (a curated KB of
approach and the used techniques, motivating biological pathways) [6], PharmGKB (an
integrated resource about how variation in human
94
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
genes leads to variation in our response to drugs) Although the ultimate goal is similar,
[7], UniProtKB/Swiss-Prot (a Protein improving research efforts by enabling
Knowledgebase containing annotated protein interoperability, a number of differences exist
sequences) [8]. between the envisioned NRCRIII platform and
Scientific Libraries: collect scientific the existing initiatives [16]. Essentially, while the
publications containing research results in the existing initiatives are addressing the problem by
field of interest. The basic information unit is a undertaking an approach whereby tools,
scientific paper. Examples include the Public vocabularies, common repositories, etc are built
Library of Science[9], and PubMed [10] and access is centrally provided by the lead
Policies: define the way information is collected organization, the NCRI initiative aims to provide
and used .Examples include those provided by the a ‘glue’ layer enabling the different initiatives,
Department of Health [12] resources and tools to communicate with one
Terminology Sources (or ontologies): defined another in a more effective fashion. This requires
for offering terminology annotation services to close collaboration with aforementioned
the global cancer research community. They can organisations and other existing projects
vary in terms of addressed domains, coverage including CancerGRID [17], CLEF [18], etc.
extension (domain specific vs. cross-domain),
description language, adoption level, access 3. The Analysis Approach
service interfaces, etc. Examples include the NCI-
thesaurus (cross-domain annotations in the cancer A feature of e-science projects is the need to
research domain)[5], the GO ontology (annotation establish a multidisciplinary team including
of gene products), the MGED-ontology requirements engineers, software developers and
(descriptors for microarray experiments) [15], etc. domain experts with a broad vision of the specific
Representation Formats: defined to represent fields, in this case the cancer biomedical domain.
data for specific domains. Common Such a team was composed including
representation formats are not available for all the requirements experts from University College
cancer domains and often there are several London and Imperial College and domain experts
proposals or “standards” in use. Examples include from the NCRI Informatics Coordination Unit and
HL7 [13] and CDISC [14] for clinical information Task Force. In the initial phase, cancer related
exchange, MAGE-ML [15] for microarray literature analysis and periodic team meetings
experiments, etc. have been intertwined to build up a common
Bioinformatics Services: existing services used understanding of the platform’s high level goals
to elaborate raw data (e.g. microarray experiment and to define a common language. Outcomes of
data clustering), to search similarities between the preliminary activities have been a context
new biologic data and existing data stored in diagram (whose main components have been
some repositories (e.g. BLAST), to translate data described in outline in section 2), an initial set of
among different formats, etc. stakeholders and a preliminary domain model.
Although not all existing projects, resources Subsequently, the analysis has been focused on
and initiatives can be easily categorized, this list understanding how the various elements operating
provides a broad high-level view of the reference in the platform context can interoperate one with
context showing its multiple facets. another to underpin the needs of scientific
The issue of interoperability is currently being researchers. In accordance with the NCRI vision
addressed via initiatives underway at the US of supporting community driven development, we
National Cancer Institute Center for have adopted a user centred approach for the
Bioinformatics (NCICB) [5] and the European analysis activities. To this end, a number of use
Bioinformatics Institute (EBI) [4]. NCICB is cases, covering the whole spectrum of the cancer
coordinating an ongoing project called the cancer research as proposed by the NCRI matrix [1],
Biomedical Informatics Grid (caBIG) whose aim have been acquired and used to drive the overall
is to maximize interoperability and integration analysis process.
among initiatives funded by the NCI by providing Figure 1 maps out the analysis flow focusing on
open source infrastructure and tools plus the use case analysis.
vocabulary services (EVS – Enterprise
Vocabulary Services [5] and a centralized
metadata repository (caDSR – Data Standards
Repositories [5]). The EBI provides a centralized
access point to several bioinformatics databases
and services through a unique web site.
95
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
96
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
97
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
multidisciplinary team. An important issue we the interviewed domain experts. Generally, the
have identified in this project is thus that in order expert writes down the use case description and
to clearly identify all the needed information, a then, together with the analysis team, it is
multi-perspective analysis is needed. This entails progressively structured whereby the method’s
a multi-dimensional domain model to be defined concepts. The original description does not
for supporting the analysis activities. Each contain all the information that makes up the
dimension can be considered a domain model by requirements. The GQR method has shown up to
itself used alone or combined with the other be effective in supporting the conversation
dimension in the analysis activities. between the analysts and the domain experts. For
each question, experts are asked to identify what
Cancer Biolog y
data sets and/or services may be used to produce
…. the needed result, dragging in their knowledge of
Target
the context. Requirements are identified through
Agent
Pathway
this interaction and recorded as notes, at the
Gene
beginning, and through semi-formal diagrams and
Specimen Experiment ….. Cancer Research
tables afterwards.
Data Format
Ontology
Let us consider, as example, two questions
Bioinformatics Repository
Service
among the five that follow from the goal.
….
System Integrator
• Q1.1 can be answered by identifying suitable
specimens to be used in the experiment. This
information can be found in tissue banks
Cancer Research Investigation, Experiment, belonging to hospitals (e.g. Dundee tissue bank)
Protocol, Individual
Information, Research or by means of global specialized repositories like
Sample, Clinical Record, those under construction in UK (e.g. OnCore
Experiment Result, Policy,
etc. [26]) or in US (e.g. caTissue [5]). The platform
Cancer Biology Gene, Protein, Pathway, should thus be able to access to such repositories
Agent, disease, drug, target,
tissue, SNP, etc. to query them and to produce the result R1, that
System Integrator Bioinformatics Repository, is, a list of specimens. Using the cancer research
Ontology, Registry,
Service, Data Format, perspective in the domain model, this result can
Access Policy, Protocol,
etc.
be exalained by way of the entities
CR_ResearchSamples and CR_MedicalRecords.
The platform will thus need to search all
Figure 4: The lti-dimensional domain repositories which are known to include data
model about such kind of entities. From the system
In our project, The Platform domain model integrator perspective, these repositories are
has been organized across three dimensions. The modelled as SI_InformaticsRepositories (and in
Cancer Research which includes all the concepts particular SI_GlobalSpecializedRepositories for
involved in the investigation/experiment “OnCore” and SI_LocaRepositories for
execution, like samples, patient data, protocols, “Dundee”) and their data elements, representing
publishable data (results of the experiment the above mentioned entities, should be
executions), etc. The Cancer Biology including semantically annotated by metadata elements
concepts like tumour, drug, gene, pathway, etc. (SI_MetaDataElements) whose domain is defined
The System Integrator whose aim is to model the within a IS_TerminologySource like the “NCI-
environment where the platform will operate, the Thesaurus” providing, for instance, concepts like
different types of available resources and their Organism, Disease, Disorder and Findings,
relationships. Lack of space prevents us from Drugs and Chemical to match the researcher’s
showing the three models but query attributes.
Figure 4 shows the three dimensions of our • Q1.4 can be answered by using the list of
domain model above and some examples of the genes identified by the clustering service used to
included concepts below. The complete identify common genetic variations querying
specification can be found on [19]. existing knowledge bases like REACTOME [6],
PharmGKB [7], KEGG (Kyoto Encyclopedia of
5.1 Examples use case analysis through the Genes and Genomes) [27], etc., or scientific
domain model libraries (e.g. PubMed) looking for papers which
In this section we show some excerpts from the have reported about similar experiments.
analysis of the use case shown in Figure2. The Currently, researchers use screen-scraping
GQR analysis is usually conducted together with technologies to analyze and integrate data sets
collected from different repositories.
98
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Goal Question Investigate genetic
variation in tumor
G1
response to
chemotherapy
A scientist wishes to investigate genetic variation in tumour response to
treatment with a specific class of chemotherapy. She would like to identify Data Set
Identify Design and microarray data, Identify common Publish
specimens of a specific tumour type, flash-frozen and prepared using a specimens execute cluster, and genetics experiment
specific methodology, and for which there are associated medical records microarray compare variations results
experiment
for treatment outcome. With sections of those specimens, the researcher
would like to carry out microarray experiments for tumour cells and normal Question
cells on the periphery of the tumour. She needs to store and analyze the
data using conventional clustering methodologies. She would also like to R1 R3
R4
R5
R6
R7
List of conventional Common Common
compare the clusters to currently-known metabolic pathways, some of which Service for data MicroArray
are known to be chemotherapy targets. With the list of genes from the
Service Speciments clustering
clustering
genetic
data
Genetic
methodologies variations Variation
pathways of interest showing expression variation related to the
chemotherapy treatment, the investigator can then identify common genetics Question
variations in the public databases for subsequent follow-up. At the time of
publication of her study she wants to maximize the impact of her
achievements on the scientific community for follow-up studies by
evaluate the effectiveness of our approach in a
depositing the microarray data in public repositories. Data Set real case. Moreover, the resources the platform
Data Set Question Question would need to support the project are being
Figure 5: Excerpts from the use case analysis collected. These range across the different kinds
of resource identified in section 2. Analyzing how
R5 is the result of such integration that should be they interoperate with one another to fulfil the
supported by the platform (as much as possible). various goals identified in the user stories will
Using the biological perspective in the domain provide a suitable test-bed for validation of the
model, we can explain the result in that saying whole information architecture.
that the platform should query repositories known
to contain information about CB_Gene, 7. Discussion and Future Work
CB_Pathways, CB_ExpressionVariation and
CB_Agent (where a class of chemotherapy is In this paper we have recounted our
considered an agent). From the system integrator experience in the early analysis of the NCRI
perspective, these repositories are modelled as Platform, a software system whose main aim is to
SI_KnowledgeBases (e.g. “REACTOME” and enable the creation of a community supporting
“PharmGKB”) and SI_ScientificLibrary (e.g. initiatives operating in the cancer research field in
PubMed). As well as in the previous example, UK and world-wide. A multidisciplinary team
their information should be annotated with proper composed by requirements engineers and domain
metadata. To build the required result, the experts was set up to carry out the analysis
platform should integrate data coming from these activities. Preliminary steps in the analysis
sources by exploiting the semantic relationships process have been the identification of a core set
among concepts defined in the of stakeholders and the main entities operating in
IS_TerminologySource used as metadata source. the cancer research context, and the definition of
By these examples we have shown how the a preliminary domain model to gain an initial
same use case is analyzed from different common understanding of the problem within the
perspectives providing an explanation from both multidisciplinary analysis team. Afterwards, the
the user and system perspective on how the analysis has been driven by a set of use cases
different resources should interoperate with one representing examples of research investigations
another to fulfil the high-level research goals. that might benefit from the platform introduction.
The main aim of our analysis has been to clearly
understand how the various high heterogeneous
6. Validation
resources making up the cancer research context
The approach and models introduced in this paper might interoperate with one another to fulfil
are currently being validated in a companion researchers’ goals. The high heterogeneity of the
demonstrator project (see [1] section Demo cancer research domain and the need of working
Projects, Imaging and Pathology). This project in a multidisciplinary team have required us to
aims at developing services for 3D reconstruction develop a customised approach to analyze the
of the macroscopic resected rectums using in vivo acquired use cases. The approach involves using
photographs and incorporating links to virtual an ad-hoc defined high level investigation model
histology images and MRI images. The project and a domain model. The former allows the
leverages technologies and services developed in analysis team along with the interviewed domain
the e.Diamond project [29] and is using data sets experts structuring an investigation throughout its
stored in repositories belonging to hospitals lifecycle pointing out what researchers would ask
involved in a previous clinical trial, that is, the the platform with and what resources it can
Mercury Trial [28]. Data (including images) is answer with. The latter, allows analyzing in
currently being collected and anonymized details the multiple facets of the interoperability
manually and transferred by coping it on CDs. problem by facing them from three orthogonal
Finally, the developed service has to be deployed dimensions representing the researcher, biology
so that it can be used in a distributed environment and system points of view. The approach has
for more informed clinical decision making. proven to be effective in analysing a variety of
Several user stories have been identified and are use cases taken from different heterogeneous sub-
being analyzed in the context of this project to
99
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
domains of the cancer research and particularly [3] Alistair Cockburn: Writing Effective Use
suitable for working with multidisciplinary teams. Cases. Addison Wesley, 2001.
Although it has been defined and used in the [4] European Bioinformatics Institute,
context of cancer research, we believe that the http://www.ebi.ac.uk
basic ideas and principles can be of interest to [5] National Cancer Institute – Center for
other projects in the broader e-research field. Bioinformatics, http://ncicb.nci.nih.gov/
Besides supporting the analysis activities, the [6] http://www.reactome.org/
investigation and domain models, mapped out in [7] http://www.pharmgkb.org/
this paper, constitute the key information model [8] http://www.ebi.ac.uk/swissprot/
around which the platform will be built. To [9] http://www.plos.org/
achieve the platform’s objectives, a semantic web [10] http://www.ncbi.nlm.nih.gov
service architecture is being designed following [11] http://www.geneontology.org/
the Semantic Web Service Initiative proposed [12] http://www.dh.gov.uk/PolicyAndGuidance
framework [30]. The platform will offer an [13] www.hl7.org.uk/
environment where services requests (issued by [14] http://www.cdisc.org/
the platform users) and offered services (issued [15] http://www.mged.org/
by service providers) can be published. Adopting [16] R. Begent, et al.: Challenges of Ultra Large
a model similar to the introduced Investigation Scale Integration of Biomedical Computing
Model, the platform will permit a simple Systems", 18th IEEE International
workflow-like description of investigations where Symposium on Computer-Based Medical
questions will correspond to service requests Systems, Dublin, Ireland, 2005.
while results to offered services or to a set of [17] http://www.cancergrid.org/
available services integrated by matchmaker [18] http://www.clinical-escience.org/
services provided by the platform environment. [19] www.cs.ucl.ac.uk/CancerInformatics
These services will be made accessible by [20] I. Alexander, S. Robertson, S.:
providing semantic descriptions of their Understanding project sociology by
capabilities whose core concepts root in the modelling stakeholders. IEEE Software, vol.
Domain Model entities. These provide a high 21(1), Jan. 2004.
level conceptualization of the platform domain [21] Kuniavsky, M.: Observing the user
from three fundamental perspectives. This experience: A Practitioner's Guide to User
conceptualization can be used as a high-level Research. Morgan Kaufmann, 2003
ontology for semantically describing the web [22] M. Cohn: User Stories Applied: For Agile
services offered by our platform. Such high level Software Development. Addison-Wesley,
conceptualization will then be combined with a 2004
number of specialized ontologies (most of which [23] A. Dardenne, A. van Lamsweerde, S. Fickas:
already exist in the field) through ontology Goal-Dircted Requirements Acquisition.
composition as described in [31]. Science of Computer Programming, 20
Finally, we are currently analyzing and (1993)
working in cooperation with several organizations [24] R. Solingen, E. Berghout: The Goal/
which have already developed some of the Question/ Metric Method McGraw-Hill, 1999
informatics components the platform will need to [25] A. Finkelstein, et al.: Viewpoints: a
use. Among others, we can mention the NCICB in framework for integrating multiple
US; the EBI; CancerGrid, CLEF and others. perspectives in system development" Int.
Journal of Software Engineering and
Acknowledgements Knowledge Engineering, vol. 2, 1992
[26] http://www.ntrac.org.uk/
The project team wish to acknowledge the [27] www.genome.jp/kegg/
funding for this work provided by Cancer [28] http://www.pelicancancer.org/researchproject
Research UK and to thank all the researchers who s/mercury.html
have cooperated in the analysis activities. [29] http://www.ediamond.ox.ac.uk/
[30] M. Burstein et al.: A Semantic Web Services
References Architecture. Internet Computing, Sept. 2005
[31] J. Jannink, P. Srinivasan, D. Verheijen G.
[1] NCRI Informatics Initiative
Wiederhold: Encapsulation and composition
www.cancerinformatics.org.uk
of ontologies, Proc. of AAAI Summer
[2] Standish Group, CHAOS report,
Conference, AAAI (1998)
www.standishgroup.com/chaos.htm
100
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
CCLRC have developed a suite of data management tools for the Engineering and Physical Sciences Research
Council (EPSRC) funded e-Science project 'The Simulation of Complex Materials' [1], which ran from 2002 -
2005. The focus of the project was to aid the development of a computational technology for the prediction of
the polymorphs of an organic molecule prior to its synthesis, which would then provide the ability "to control
the unwanted appearance of polymorphism and to exploit its benefits in the development, manufacture and
processing of new molecular materials" [2]. Prior to the project the data of interest was distributed across a
multitude of sites and systems, with no simple formal methods for the management or distribution of data.
This was considerably hindering the analysis of the results and the refinement of the prediction process and it
was therefore essential to rationalise the data management process. The initial concern was for the collection
and safe storage of the raw data files produced by the simulations during the computation workflow [3]. This
data is now stored in a distributed file system, with tools provided for its access and sharing. As the data was
not annotated with metadata it was difficult for it to be discovered and reused by others and so web interfaces
were implemented to enable the cataloguing of data items and their subsequent browsing. In addition there
was no fine grained access to the data. Specific crystal data is now parsed from the simulation outputs and
stored in a relational database, and a web application has been deployed to enable extensive interrogation of
the database content. This paper will elaborate on these tools and describe their impact on the achievement of
the project's aims.
Background
A crystal may have different polymorphs, Therefore a method of predicting which crystal
(different arrangements of the molecules in the structure a given organic molecule will adopt
crystal lattice), and under different conditions would have
considerable benefit in product development
"different polymorphs have different physical across the range of molecular materials industries.
properties, and so there are major problems in
quality control in the manufacture of any The computational chemistry group at UCL have
polymorphic organic material. For example, a developed computational methodologies for
polymorphic transformation changes the melting predicting the energetically feasible crystal
point of cocoa butter and hence the taste, the structures of small, rigid, organic molecules [4].
detonation sensitivity of explosives producing Each simulation involves the running of multiple
industrial accidents, and the solubility changing programs to generate these crystal structures, at
the effective dose of pharmaceuticals." [2]. considerable computational and human expense.
101
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The process produces a great quantity of After a simulation run the results are uploaded to
heterogeneous data estimating various the relevant storage media and can be discovered
mechanical, morphological and spectroscopic and shared with other scientists - for example, to
properties of the computed crystal structures. use for comparison when a new polymorph is
However there was little support for accessing and discovered experimentally that has already been
managing this data and this hindered the project's computationally predicted. [10]
progress and collaborative efforts. To counter this
CCLRC developed an effective data management From the simulation outputs the raw data is
infrastructure to facilitate the creation, storage and processed in three ways:
analysis of data and metadata.
i. A subset of the data is parsed from the
Architecture files and inserted into the crystal
database.
This infrastructure, as shown in figure 1, enables a ii. The data files are uploaded into SRB.
more efficient cycle of discovery, analysis, iii. Metadata about the data is entered into
creation and storage of both data and metadata. the metadata database using the metadata
The tools provided for the management and editor.
discovery of metadata are the Data Portal [5], the
Metadata Editor [6], the CCLRC metadata schema Data Portal can then be used to discover and
[7] and the metadata database [8]. The tools for retrieve raw data from the SRB, and the Crystal
the management of raw and processed data are the Structure Navigator can be used to extract and
Computed Crystal Structures Database, the compare data on the crystal structures from the
Crystal parser, the Crystal Structure Navigator computed crystal structure database.
and the Storage Resource Broker (SRB) [9]. The
SRB was developed by the San Diego Super
Computing Centre (SDSC).
102
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Storage Resource Broker (SRB) studies are also linked to topics, keywords and
investigator details. The database tables for the
The first issue to be overcome was the lack of data file and data set entities also store the
tools to manage the collection and safe storage of physical location of the data in SRB, alongside the
simulation outputs, which also made it difficult to metadata.
share data. Data files from simulation runs were
generally located on users' machines or on the
machine on which the computation had been run.
Metadata Manager
The CCLRC metadata manager is a web-based
In 2002 SRB was chosen as the middleware tool for the insertion and manipulation of entries
because of its appropriateness and CCLRC's in the metadata database. Typically users annotate
strong links with SDSC. GUIs (Data Portal and their datasets and data files with details such as
Metadata Manager) that used the Scommands the provenance of the data and its location in
behind the scenes were written to allow users SRB. Users can organise their data by creating
access to the grid enabled storage resources. and editing information about studies and then
adding datasets and data files to the hierarchy.
SRB is a distributed file system that allows files to The metadata forms the basis for the search and
be shared across multiple resources whilst retrieval of data by the Data Portal.
providing access through a single interface and
showing a unified logical view of the virtual In the most recent version of Metadata Manager
organisation's data. As such data is organised in a users can also add topics, so removing the last
logical rather than a physical context, and the vestiges of a centrally controlled vocabulary in the
complexity of locating and mediating access to infrastructure. Scriptable command line tools for
individual data items is abstracted away from the the insertion of metadata have also now been
users. Access control is simply implemented with written.
the data originator able to control and configure
access permissions to their data for other project
members. The raw data produced by the Data Portal
simulations is now stored on a number of
distributed resources managed by the SRB. The CCLRC Data Portal is a web-based front-end
that enables the scientists to browse the metadata
database and discover data resources from
Metadata Database physically diverse institutions. Data is displayed
in a virtual logical file system – so files are
Although the use of SRB facilitates data displayed by topic (such as the molecule) rather
distribution and replication, the value of the data than by geographical location, this being of much
is in its discoverability – and the annotation of the more use to the project participants. The required
data holdings with metadata is what achieves this. data resources can then be downloaded from SRB
directly to the users' machines.
The raw data files in SRB are now catalogued
with metadata, and this data is stored in the The Computed Crystal Structure
metadata database which uses the CCLRC
Database
metadata schema. This is a common model that
has been re-used in other CCLRC projects that Although SRB allowed the searching of related
required metadata holdings. As the same logical data it did not provide any tooling for data
model has been re-used across different projects comparisons and refinement as there was no fine
the applications written against this model, (Data grained access to the data, which was still stored
Portal and Metadata Manager), have also been in unstructured text files. This hindered analysis
reused on other projects, with very little of the results and it was very difficult to look at,
customisation required. for example, all the crystal structures with a total
lattice energy beneath a certain threshold.
Raw data files are grouped into datasets, and Therefore the project requested a bespoke
datasets are grouped under studies with a name, database (figure 2) specific to their requirements
start date, end date and originator details. The
103
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
and data, which would enable them to drill down runs and populates the computed crystal structure
and query across data. database with a subset of the simulation data.
Some extra information about the molecule or
The CCLRC Computed Crystal Structure conformation or potential has to be entered into a
Database is a relational database running on configuration file by the project administrator
Oracle 10G and is hosted on the National Grid (Louise Price) and the script is then run from the
Service (NGS). It is used to store specific data command line. If any information on the study
about crystal structures, rather than metadata or being uploaded is already contained in the
unprocessed raw data in a file or CLOB format. database it is updated otherwise new records are
The data model for the calculated crystal database inserted into the database.
was designed to allow easy interrogation of the
database and provide good performance for Crystal Structure Navigator
complex queries as the database grew.
The crystal structure navigator is a powerful query
There are five tables for representing the crystal tool that provides users with the ability to search
data – for crystal data which fits certain criteria by
allowing the construction and execution of queries
against the calculated crystal database, (figure 3).
The application runs on a Tomcat web server and
is written with JSPs and servlets. It has a simple
and intuitive interface from which the user can
compile their display and search options.
104
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
105
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Secured bulk file transfer over HTTP(S)
Yibiao Li and Andrew McNab
Manchester University
Abstract
A method of secured bulk file transferring over HTTP(S) for Gridsite is discussed in this article.
Unlike FTP, this method can transfer a file from one Grid node to another Grid node directly, using
zero memory of the client computer. The verified information is transferred over HTTPS while the
file is transferred over HTTP. To speed up the file transfer, a multiconnection technique is adopted.
Keywords: bulk file, HTTP(s) transfer, GridSite, GridHTTP protocol
106
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
edited using the ACL editor built into the is a wellestablished and consistently
GridSite system by people who have the Admin implemented protocol that can be enabled on the
permission within the ACL. Windows Storage Server.
If a user has this permission in a given directory, The advantages of FTP include:
when the user views directory listings or files Support for all kinds of clients: Standardized
with a browser in that directory the user will see implementation of the protocol means that
the option "Manage Directory" in the page virtually any FTP client, running on a Microsoft
footer. This allows the user to get a listing of the or nonMicrosoft operating system, can use the
directory and the .gacl file will appear at the top FTP server.
if it's present. If not, then there will be a button High performance and simplicity:
to create a new .gacl file with the same Performance and simplicity of the protocol
permissions as have been inherited by that makes it a convenient option for file transfers
directory from its parent. across the Internet.
GACL allows quite complex conditions to be The primary disadvantage of FTP is that data
imposed on access, but normally user can think and logon information is sent unencrypted
of an ACL as being composed of a number of across the network. This could result in the
entries, each of which contains one condition discovery of logon accounts or passwords. This
(the required credential) and a set of allowed information could be used by unauthorized
and denied permissions. individuals to access other systems.
Credentials can be individual user's certificate
names or whole groups of certificate names if a 2.3 HTTP
DN List is given. (User can also specifiy The HTTP[1] protocol is a protocol for file
hostname patterns using Unix shell wildcards transfer over internet. It is often used to
(eg *.ac.uk) or EDG VOMS attribute certificates download HTML files or image files through a
see the GACL Reference for details.) web browser such as IE, Mozilla, or FireFox.
Permissions can be Admin (edit the ACL), But it can also be used to file upload by some
Write (create, modify or delete files), List command under Unix/Linux OS.
(browse the directory) or Read (read files.)
Permissions can be allowed or denied. If denied 2.4 HTTPS
by any entry, the permission is not available to
that user or DN List (depending on what HTTPS is a communications protocol designed
credential type was associated with the Deny.) to transfer encrypted information between
computers over the Internet. HTTPS is HTTP
using a Secure Socket Layer (SSL).
It is recommended that users utilize HTTPS
2. Why transfer files over HTTP(S)?
when transferring files containing security
Normally, there are the following way to sensitive information.
transfer files over internet.
2.1 Email 3. Design
One of the most important aspects of the
Internet is the ability to send large files easily. 3.1 GridHTTP protocol
Email is still the primary way to receive or send To realize the file transfer between GridSite
large files over the Internet. Unfortunately, nodes, GridHTTP protocol was designed, which
using Email to send large files or receive large supports bulk data transfers via unencrypted
files is fraught with drawbacks. In today's world, HTTP while the information of authentication
file sizes are getting larger and larger but email and authorization with the usual grid credentials
technology has not advanced at the same pace. It over HTTPS.
is no longer efficient to send large files via To initiate a GridHTTP transfer, clients set an
Email and in many cases it is impossible. Upgrade: GridHTTP/1.0 header when
making an HTTPS request for a file. This header
notifies the server that the client would prefer to
2.2 FTP
retrieve the file by HTTP rather than HTTPS, if
FTP is a method for exchanging files over the possible. The authentication and authorization
internet utilizing standard TCP/IP protocols to are done via HTTPS (X.509, VOMS, GACL etc
enable data transfer. deciding whether it is right) and then the server
FTP can be used to upload and download files may redirect the client to an HTTP version of
of almost any size from or to a central server. It the file using a standard HTTP 302 redirect
107
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
response giving the HTTP URL (which can be 3.3 Bulk file transfer between GridSite nodes
on a different server, in the general case.) For
Assume that a grid user wants to copy a bulk file
small files, the server can choose to return the
from GridSite node (source server) to another
file over HTTPS as the response body. When
GridSite node by giving a batch of commands
contacting a legacy server, the Upgrade header
(so he cannot logon destination to copy file
will be silently ignored and the file will be
directly) on a computer denoted as client
returned via HTTPS as normal.
computer.
For redirection to plain HTTP transport, a
Now we consider a simple case, copying files
standard HTTP SetCookie header is used to
without secured factor.
send the client a onetime passcode in the form
In general, using a command like wget or curl,
of a cookie, GRIDHTTP_PASSCODE, which
the user can copy the file to the local computer
much be presented to obtain the file via HTTP.
first, then upload it to the destination server as
This onetime passcode only works for the file
shown in figure 1.
in question, and only works once: the current
implementation stores it in a file and deletes the
file when the passcode is used. (This
mechanism is no worse than GridFTP for internet
fil e
files being obtained by the attacker.)
As you can see, GridHTTP is really a profile for
using the HTTP/1.1 standard, rather than a new
protocol or a set of extensions: no new headers
or methods are involved. Fig 1 user downloads file first to local computer, then uploads it to
destination server
Ways of extending it to support variable TCP
window sizes so it can be used for a mix of long Though this method can do the job, it apparently
and short distance connections (currently the waste time, internet bandwidth and local
TCP window size has to be set in the Apache machine memory and disk space.
configuration file), and support for thirdparty Instead, we can seek a way to send command to
transfers using the HTTP COPY method from destination server, and ask it to get file from
WebDAV are being added to the GridSite source server.
implementation.
file
3.2 Advantages
One big advantage of redirecting to a pure in ter net
HTTP GET transfer is not just that the server Source Ser ver D estin atio n Server
sendfile() system call to tell the kernel to copy it
m
m
co
108
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Now let us add the secured feature to the above Furthermore, considering that transferring a bulk
case in GridSite circumstance. file could take a lot of time, we can use the
To support the remote bulk file copy, the multiconnection technique to get different part
GridSite node should: of each in each connection at the same time. that
1. verify the user certificate (source server). will speed up the bulk file copying. But one
2. produce onetine passcode (source server). problem is that when the destination server
3. check the user access to the destination sends the request for file copying, a onetime
directory, respond to the user if no write passcode is required for each connection, to get
access (destination server). onetime passcodes for connections (one pass
4. send file request with passcode as a cookie code for each connection), a secured connection
via HTTP (destination server). have to be created to transfer the request and
5. redirect HTTPS request to HTTP (source response of onetime passcode, and the source
server). server will have a mechanism to produce a one
6. retrieve the file and save it to some time passcode when it receives a request with
directory (destination server). original passcode.
7. respond to user when finishing (destination Thus the completed procedure for multi
server). connection bulk file copy can be described as:
The client end command should do the 1. User send a request to source server with
following: uses user ID (user certificate and user key) to
1. send a file passcode request to source request a reusable time passcode (HTTPS),
GridSite node with the user certificate over 2. Source server verifies the user ID,
HTTPS 3. Source server issues a reusable passcode
2. retrieve passcode from source GridSite to client computer if ID is OK, or responds a
node over HTTPS error message (HTTPS),
3. send file copy request to destination 4. Client sends a request to destination server
GridSite node with the passcode over to get file from source server with reusable
HTTPS passcode (HTTPS),
5. Destination server sends request to source
A completed description of the bulk file copy server with reusable passcode to get file
system can be given as (see figure 3): size (HTTPS),
1. Client uses user ID (user certificate and 6. Destination server creates multi
user key) to request a one time passcode connection, and get onetime passcode for
from source gridsite server (HTTPS) each connection (HTTPS),
2. Source server verifies the user ID. 7. Source server verifies the passcode, and
3. Source server issues a onetime passcode produce onetime passcode and send it back
to client (HTTPS) to the destination server (HTTPS).
4. Client sends a request to destination server 8. Destination server requests part of file in
to get file from source server with onetime each connection with onetime passcode
passcode (HTTPS) (HTTPS),
5. Destination server sends request to source 9. Source server checks the onetime pass
server with passcode (HTTPS) code and transfer part of file to destination
6. Source server verifies the passcode server (HTTP)
7. Source server transfer file to destination 10. Destination server combines parts of file
server (HTTP) into a completed file, and save it to the
directory required.
7. send file via H T T P
Please note that in step 9 and 10, if we
change the HTTP connection to HTTPS
6 ver ify passco de
5. req uest a file w ith onetim e passcode via H T T PS
connection, then the whole file will be
transferred over HTTPS. So it is easy to
2 ver ify user keys
Sour ce Ser ver
in te rnet
D estination S erver extended to the case of transferring both file
1.
3 .
G
iv e
r e
qu
w it es t
h o n e t
fil e P S and security data over HTTPS.
a
o ke e g TT
ne y t im t o H
ti m v ia e n d e v ia
H p a a
m d
e TT ss m co
pa P S co co s s
ss de a pa
co
de nd e
S e ti m
v
ia 4 . o ne
HT h
TP w it
S
Fig 3 a completed bulk file transfer system (single connection)
109
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
C onnec tion n 7 . send file via H T T P code over HTTPS, then to get the part of the file
C onnec tion 2 7 . send file via H T T P over HTTP.
C onnec tion 1 7 . send file via H T T P
interne t
2 ve rify user keys
So urce S erver
1.
D estination Se rver On the client side, the command was built in the
powerful command htcp. The options needed to
r e
3. qu
G es le
iv e t t fi P S
a ke o ne g e T T
pass to the destination server, such as
o y v tim to H
ne ia e p n d v ia
t im
e H T as ma ode
pa TP s c m c
S od co ss
a a
connection number, block size of each
ss e w
co
de nd e p
ith S e tim
v
ia 4 . on e
HT h
w it
TP
S connection and thread number specified by user
are transferred with OPTS in the HTTP header
Fig 4 a completed bulk file transfer system (multiconnection)
over HTTPS.
4 Implementation 5. The server configuration and
command usage
To realize the design in section 3, we consider
the implementation on the server side and client 5.1 The server configuration
side separately.
As described in section 4.1, there are two
1.1server side modules in server side. The passcode module
has built in the GridSite package, so after the
On the server side, we need to realize two main package is installed and the apache service
modules. The one is responsible for producing starts, this module has been started, there is no
the onetime passcode and reusable passcode extra configuration needed.
and relative check (source server side in the The second module is also included in the
above discussion), the other is responsible for GridSite package but run as a separate program,
responding client requests and copying file named gridsitecopy.cgi, it should be installed
(destination server in the above discussion). into a specified directory that is normally
In practise, we developed first module (pass mapped to /cgibin/ in apache configuration file.
code) as part of GridSite, and developed the So after the GridSite package is installed, check
second module as a CGI program. the directory if you know where it is, or check
The passcode module was part of GridSite the apache configuration file https.conf first to
package which can be compiled and run as an find out where it is, then check if the file
apache module[2][3]. the key task here is the one gridsitecopy.cgi is there. The following line is
time passcode encoding, as described in the needed to add in the apache configuration file
protocol, the passcode should be related with httpd.conf:
one file including its full path and the time. Here Script COPY /cgibin/gridsitecopy.cgi
we generate a random number and save it to If you want to use copy a file from the GridSite
some directory that is only accessible by the node to another, you must have write access for
Apache server, once the file related to this pass the destination directory on the destination node.
code has been sent, the passcode will then be The access configuration is described in section
deleted, or after a specified period, even the file 1.3 access control.
has not been sent, it will be deleted. This pass
code mechanism here ensures that even 5.2 Client command
someone knows a passcode by some means, he
cannot get access right to other files. To copy a file from one GridSite node to
The second module is responsible for receiving another, you use htcp command provided by
the user request and obtaining the file from GridSite package. Here we give two examples.
source server. In the current version of GridSite Example 1: copy a file data.dat from node A to
package, we developed it as a CGI program, in node B’s data directory using one connection:
Apache configuration file, it is mapped to HTTP htcp rmtcp https://a/data.dat https://b/data/
COPY method. It was developed with C and Example 2: copy a file data.dat from node A to
libcurl, the key matter in the one connection node B’s data directory using multiconnection:
case is to transfer the passcode as a cookie via htcp rmtcp connectionnumber 5 blocksize
HTTPS connection, then get the file over HTTP. 20 https://a/data.dat https://b/data/
In the multiconnection case, we used the multi Note that in the above examples, the option
thread technique. For each connection, we use rmtcp ask htcp to execute the remote copy
the reusable passcode to get the onetime pass module; option connectionnumber indicates
how name connections will be used to get the
110
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
file, and option blocksize indicates the [7] Curl and libcurl stuff: http://curl.haxx.se
maximum kbyte for each connection. [8] GridPP website: http://www.gridpp.ac.uk
To use htcp command, users should copy their [9] GridFTP website:
certificates in a directory .globus or current http://www.globus.org/toolkit/docs/3.2/gridftp/
directory.
6. Notes
There is another application tool for the
GridSites to transfer files, GridFTP[9], which is
based on the popular File Transportation
Protocol FTP, supporting functionalities such as:
1) Grid Security Infrastructure (GSI)
2) Thirdparty control of data transfer
3) Parallel data transfer
4) Striped data transfer
5) Partial file transfer
As shown in the previous sections, the
GridHTTP provides similar functionalities, but
most difference from GridFTP is GridHTTP is
based on HTTP protocol.
There are a lot of arguments on advantages and
disadvantages of file transferring over HTTP or
FTP. It is hard to determine which is better from
the theory. For the grid environments, GridFTP
needs extra installation and configurations,
while the htcp can embedded into GridSite
package as an Apache module and an
independent CGI application program, and
needs quite simple configurations.
7Acknowledgements
This work was funded by the Particle
Physics and Astronomy Research Council
through the GridPP programme.
References:
[1] R. Fielding etc, Hypertext Transfer Protocol
– HTTP/1.1”, http://www.w3.org/
Protocols/rfc2616/rfc2616.html, 1999
[2] L. Stein and D. MacEachern, Writing
Apache Modules with Perl and C”, O'Reilly &
Associates, 1999
[3] B. Laurie and P. Laurie
Apache: The Definitive Guide, Third Edition”,
O'Reilly & Associates, Third Edition, 2002
[4] Thomas Boutell, Featuring C and Perl 5
Source Code”, Addison Wesley, 1996
[5] GridSite software and documents:
http://www.gridsite.org
[6] Apache official website:
http://www.apache.org
111
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
"!$# %&'&'(
)"*+-,/.1024365728:9<;=?>@A>BDCFEHG7IKJMLONP;RQS>RTU>=VJXWYZG19[;\Q]>^_>`bacYcYcCFId9e;RfA>giheJjlkmacL\nMG7IKo
9DpqLOasrMG7IKteashWuJFv'@UYwCxtenMJXyz;R@UYwCxtenMJXyz;@A{}|A~z?;gj7Jxh<YwCxLO
N BD'S;
OIKJM4G7CFLUIKnCFLOaZPCxheacJMLvJFIq
OjYcGPCxIGteGPCxIejl;{}|{F{]@UG7L\GrxC\;giyash<7GIeYcCFLO
oCPyIeGLOj7GSG7IKkFGYcGWbqCxh<aZJMLRCFYCx4JMIeCxh<JFIKWM;\G7IKkFGYcGWF;4BQxmM|F\;OpVgQ
Vi\R ¡
®£¥/¯±°[²x´«dËõ¤1°d¥6³®¥¦£d¯-³/¸}«RèP¨Ä²1©:£[}´¬£d²7³/¥¦£dÃÄû«e©¥¦«¬¤1û¨ª´¬°[¼
¢£'¤}¥¦£l§7¨ª©X«¤F«¬¥/®£d¥¦¯±°[²7³°d´K´«Kµ¦µ\³/£¶©}°l³·°:®¥/£¯¹¸1¨»º¸±«¬²X¼ ³¦¨»£²Õ¤}ê°[²}²1«K©®£¥RÀ Â × Ç ¸}£lÞ¡«¬§d«K¥ Ç ©X½P²1°d¯¨ª´¡©}°l³·°¥¦«¬¤}ÃĨɼ
«¬¥¦ºd½¤}¸P½Xµ6¨ª´¬µ«¾X¤F«¬¥¦¨»¯D«¬²7³·µµ/¿1´·¸°dµ³/¸}«Ài°d¥/º«Á°d©}¥/£² ´K°l³/¨Ä£d²Õ¼i°[¿X³¦£d¯±°l³¦¨Ä´K°[ÃÄû½¥¦«¬¤1û¨ª´¬°[³/¨Ä²}ºÈ1ÃÄ«KµOÎF«³Þ¡«¬«K²µ/¨»³/«Kµ
 £dÃÄÃĨĩX«K¥ÅsÀÁ ÂÆÇ ´¬£d²7³/¥¦£dÃÄû«e©¥/«K¤}û¨ª´¬°[³/¨Ä£d²:£[PÈ1ÃÄ«Kµi°[¯D£d²}º ¨Ä²S¥¦«Kµ/¤F£d²1µ/«'³¦£ÕÍ£Î1µ¼¯D°<½U°dÃĵ/£qÎF«¿1µ6«¬®¿}ÃR¨Ä²]°´·¸}¨»«K§7¼
ºd¥¦¨ª©µ/¨É³¦«Kµ¡¨Äµ¡¥¦«KÊ7¿}¨Ä¥/«e©ËO̽P²1°d¯¨ª´ Ç °d¿X³/£¯±°l³/«e©¥¦«¬¤}ÃĨª´¬°l¼ ¨Ä²}º'°[²D£¤X³/¨Ä¯±°[Ã}©X¨ªµ³¦¥/¨ÄÎ}¿X³¦¨»£²±£[M©}°l³·°°[¥¦£d¿}²1©Õ³/¸1«ºd¥¦¨Ä© Ç
³/¨Ä£d²b¨»²-¥/«eµ6¤F£d²1µ/«U³¦£zÍ£dÎxµ¯±°<½°[êµ6£ÏÎF«]¿1µ6«¬®¿}Ã Ç °[²1© °d²1©±³¦¸}«¬¥¦«°d¥/«_²P¿}¯D«¬¥¦£d¿1µ¡¤F£µ¦µ6¨ÄÎ}ÃÄ«µ6³/¥·°l³¦«¬ºd¨Ä«Kµ®£d¥µ/¿1´·¸
¸1°µÎx«K«¬²U¨Ä²P§d«Kµ6³/¨Äº°[³/«e©q¿xµ6¨Ä²}ºD³/¸1«'ºd¥¦¨Ä©Vµ/¨»¯¶¿}ê°l³/£¥Ð_¤X¼ ©X½P²1°d¯D¨Ä´_¥/«K¤}ÃĨĴK°l³/¨Ä£d²iË Â ¿1¥/¥¦«¬²7³ºd¥¦¨Ä©}µ¡°[¥¦«²}£[³½d«¬³¡¯±°l¼
³/£¥·ÑX¨»¯VËÒÓ²Ô³/¸}¨ªµD¤1°[¤F«¬¥ Ç ¥/«eµ6¿}û³¦µ®¥¦£d¯µ6¨Ä¯Õ¿}ê°l³¦¨»£²H£d ³¦¿}¥/«:«K²}£d¿}º¸³¦£D°dûÃÄ£lÞ³/«eµ³¦¨»²1º£di³¦¸}¨ÄµèP¨Ä²1©q£[¥¦«¬¤1û¨ª´¬°[¼
³/¸1«"ÀiÁ ÂÖ £d¯D¤}¿X³¦¨»²}º×¥/¨ª©Ø¨»²ÚÙdÛdÛÜ Ç ¨Ä²Ý°u¤}¸P½Xµ6¨ª´¬µ ³¦¨»£² Ç ¸}£lÞ¡«¬§«¬¥ Ç µ/£z³¦¸}«qº¥/¨ª©Ïµ/¨»¯¶¿}ê°l³/£¥DÐ_¤X³¦£d¥ÑP¨Ä¯÷ö»ôø
°[²x°[ÃĽPµ/¨ªµµ/´¬«¬²1°d¥/¨Ä£ Ç °[¥¦«'¤}¥¦«Kµ/«¬²7³/«e©Ë¢¸}«Kµ/«Õµ/¸}£lÞ Ç Èx¥¦µ6³ Ç ¸1°µÎF«¬«K²q©X«eµ6¨Äºd²1«K©³/£¶«¾X¤}ÃÄ£d¥¦«³¦¸}««¬äF«e´³¦µ¡£di©X½P²1°[¯D¨ª´
³/¸x°l³©X½P²1°d¯¨ª´]¥/«K¤}ÃĨĴK°l³/¨Ä£d²Ø©X£P«KµVº¨»§«?¨»¯D¤}¥¦£l§d«e©AÍ£dÎ ©}°[³¦°¥¦«¬¤}ÃĨĴK°l³¦¨»£²q¿}²x©X«¬¥°§l°[¥¦¨Ä«³½±£[R´£²1©X¨»³/¨Ä£d²1µKË
³/¸1¥/£¿}ºd¸}¤1¿X³Kß}µ/«K´¬£d²1© Ç ³¦¸1°l³®£d¥¡³¦¸}¨Äµ´¬£d¯D¤}ÃÄ«¾º¥/¨ª©µ/½Xµ¼ ÒÓ²¤}¥/«K§P¨»£¿1µÞ4£¥/è Ç Ð_¤X³¦£d¥ÑP¨Ä¯ùÞ¡°µ¿1µ6«e©Ï³¦£Aµ³¦¿1©X½
³/«K¯ Ç µ6¨Ä¯D¤}ÃÄ«¥¦«¬¤1û¨ª´¬°[³/¨Ä£d²qµ6³/¥·°l³¦«¬º¨»«eµ4µ/¿1´·¸V°dµÀiàáØ°[²1© §l°[¥¦¨Ä£d¿1µ_ºd¥¦¨Ä©1µ¨Ä²1´ÃÄ¿1©X¨Ä²}ºq³/¸}« ð ¿1¥/£¤x«e°[²zÌ°l³¦°7×¥/¨ª©Hö ú<ø
ÀiâOáã°[¥¦«S°dµ«äM«K´³/¨Ä§d«S°µ¯D£d¥¦«S°©X§l°[²1´¬«K©«e´£d²1£d¯D¨Ä´ °d²1©]³/¸}«qÀ  ×û³/£¤x£Ã»£ºd½z£dÙdÛdÛdü?ö ý<øwËVÒÓ²?³/¸}¨ªµ'¤1°d¤x«K¥ Ç
¯D£X©X«¬Ãªµ¬ß³¦¸}¨»¥·© Ç ³¦¸1°l³ºd¥¦¨Ä©uµ/¨»³/«z¤F£dÃĨĴ¬¨»«eµÞ¸1¨Ä´·¸å°dûÃÄ£lÞ ³¦¸}«¯Õ¿1´·¸¶¯£¥/«¡´£¯¤1û«¬¾µ/«³/³/¨Ä²}º£[xÀ  ר»²±Ù[ÛÛdÜ'Å®³/¸}«
¯±°l¾X¨Ä¯Õ¿}¯¥¦«Kµ/£d¿}¥·´«¶µ6¸1°d¥/¨Ä²}º°[¥¦«Õ¯D£d¥¦«Õ«¬äF«e´³/¨Ä§d«ßm°[²1© È1¥·µ6³®¿}ûýd«e°[¥¨Ä²Þ¸1¨Ä´·¸V³¦¸}«'ÀiÁ  ިÄÃÄÃm¤}¥¦£P©}¿1´«'©}°[³¦° Æ
ê°dµ6³/ÃĽ Ç ³¦¸1°l³©}½7²x°[¯D¨Ä´q¥¦«¬¤}ÃĨĴK°l³¦¨»£²"¨ªµD¤1°[¥/³/¨ª´¿1ÃÄ°d¥/ÃĽ?«æ¼ ¨ªµ'¨Ä²P§d«Kµ6³/¨Äº°[³/«e© Ç «K§l°[ÃÄ¿1°l³¦¨»²}ºzµ/£d¯D«µ/¨»¯D¤}ÃÄ«¥¦«¬¤}ÃĨª´¬°l³¦¨»£²
®«K´³/¨Ä§d«Þ¸1«¬²©}°[³¦°'°´¬´«eµ/µ¤1°l³/³/«¬¥¦²1µR¨Ä²1´¬Ã»¿1©}«_µ/£d¯D«È1ÃÄ«Kµ µ6³/¥·°l³¦«¬ºd¨Ä«Kµ°dµRÞ¡«¬ÃÄÃx°µ°[²D«K´¬£d²}£¯D¨Ä´¯D£X©X«¬Ãx£[MÈ1ÃÄ«¥¦«¬¤}ÃĨɼ
ÎF«¬¨Ä²}º'°d´K´«Kµ¦µ/«K©'¯D£¥/«£[泦«¬²³/¸x°[²D£[³¦¸}«¬¥·µ Ç µ6¿1´·¸D°dµOÞ¨»³/¸ ´K°l³/¨Ä£d² Ç ¿}²x©X«¬¥Õ°V¥·°[²1ºd«±£[´£²1©X¨»³/¨Ä£d²1µKËqâ\¨Ä¥¦µ6³ Ç °UÎ}¥¦¨Ä«
°çF¨Ä¤Xæ¼cÃĨÄèd«©X¨ªµ³¦¥/¨ÄÎ}¿X³¦¨»£²mË µ/¿}¥¦§d«¬½]£[¥¦«¬Ãª°l³¦«K©AÞ¡£d¥¦è¨ªµ¶ºd¨Ä§d«¬²iËS¢¸}«qºd¥¦¨Ä©A£¤X³/¨Ä¯D¨É¼
µ¦°l³¦¨»£²S¤}¥¦¨Ä²1´¨Ä¤}ÃÄ«Kµ'°d²1©¥¦«¬¤1û¨ª´¬°[³/¨Ä£d²?µ³¦¥¦°[³/«¬º¨»«eµ¨»²P§d«eµ³¦¨É¼
º7°l³/«e©H°d¥/«³/¸}«K²"¤}¥¦«Kµ/«¬²7³¦«K© Ç ®£Ã»ÃÄ£lÞ¡«K©AÎP½Ï°µ/¸}£¥6³±©X«¼
é êmë \RìízîV ¡Rïì ë µ¦´¥¦¨»¤}³/¨Ä£d²b£dÐ_¤X³/£¥·ÑX¨»¯
¨Ä²ÑP«e´³¦¨»£²bü1Ë¢¸}«]«¾X¤F«¬¥¦¨É¼
¯D«¬²7³·µ¤F«¬¥/®£d¥¦¯«e©D°[¥¦«©X«Kµ¦´¥¦¨»ÎF«K©D¨Ä²ÑP«e´³/¨Ä£d²þ°d²1©¶³/¸}«
¢¸}«Ãª°l³/«eµ³R¸}¨Äºd¸D«K²}«¬¥¦ºd½¤1°d¥6³¦¨Ä´¬Ã»«°d´K´«Kû«K¥¦°[³/£d¥·µ Ç µ/¿1´·¸D°dµ ¥¦«Kµ/¿}û³¦µ4¤}¥/«eµ6«K²7³/«K©¨»²VÑP«K´³/¨Ä£d²qý Ç Îx«¬®£d¥¦«_©X¥·°<Þ¨»²1ºÕµ6£¯D«
³/¸1«Ài°d¥/º«OÁ_°d©X¥¦£d²  £Ã»ÃĨĩ}«¬¥4ÅZÀiÁ ÂÆ °l³ ¡ð àñ Ç ³/¸}« ð ¿X¼ ´¬£d²1´¬Ã»¿1µ/¨Ä£d²1µ¨Ä²zÑP«K´³/¨Ä£d²zÿPË
¥¦£d¤F«K°[²?Ð_¥/º7°[²}¨ÄòK°[³/¨Ä£d²U®£¥ñ¿1´ÃÄ«K°d¥:à«Kµ/«K°[¥·´·¸ Ç °[¥¦«Õ«e°l¼
ºd«K¥/ÃĽ:°<Þ¡°d¨É³¦«K©:ÎP½¤1°[¥/³/¨ª´ÃÄ«¤}¸P½Xµ/¨Ä´¬¨Äµ6³¦µKË¢¸}«ÀiÁ  °[ÃÄ£d²}« í ìÕ
¨ªµ«¬¾X¤x«e´³/«e©³¦£_¤}¥¦£X©X¿1´¬«³/«K²1µ\£[1¤F«³·°[ÎP½7³/«eµi£d1¥¦°<ÞÏ©1°l³¦°
°[²1²7¿x°[ÃÄû½ Ç ¯±°[èP¨»²1º¨»³²}«e´«Kµ¦µ¦°[¥¦½:³/£©X¨ªµ³¦¥/¨ÄÎ}¿X³¦«´£d¯D¤}¿}³6¼
¨Ä²}º¶°d²1©µ6³/£d¥·°[º«¥/«eµ6£¿}¥·´«Kµ¨»²°'Þ¡£d¥¦ÃÄ©XÞ¨ª©X«º¥/¨ª©ËRÑP¤F«¼ ÒÓ²A¥¦«K´¬«¬²7³½d«K°d¥¦µ³¦¸}«¬¥¦«±¸1°<§«ÎF«¬«K²Aµ6«K§d«¬¥·°[ÃRº¥/¨ª©]µ6¨Ä¯Õ¿X¼
´¨»Èx´V³¦£?³/¸}«¥/«eÊ¿1¨»¥¦«¬¯D«¬²7³·µD£[:³/¸}«SÀiÁ  ¨ªµD³/¸1«ÀiÁ  ê°l³¦¨»£²Ô¤1¥/£dÍ«K´³·µ Ç «¾}°[¯D¨Ä²}¨Ä²}º?§<°d¥/¨Ä£d¿xµ°µ6¤F«K´³¦µ±£d_º¥/¨ª©
 £d¯D¤}¿}³/¨Ä²}ºz×¥/¨ª©HÅZÀ Â × ÆÇ Þ¸}¨Ä´·¸S¿1µ/«Kµ°³/¸}¥¦«¬«¬¼Z³¦¨»«K¥/«e© µ/½Xµ³¦«¬¯±µ¬Ë¢¸1«
"!#$%'&(*)+-,.& 0/1
ºd¥¦¨ª©z°[¥·´·¸}¨»³/«K´³/¿}¥¦«dËÒÓ²³/¸}¨ªµ°[¥·´·¸}¨»³/«e´³/¿1¥/« Ç ¥·°<ÞÝ©}°[³¦°¨ªµ ,3(%'&(5467%89-87 :<;=4?> @7AB-,DCE%F Ç £d¥HG]Ðñõà Â_Ç
¤}¥¦£X©X¿1´«e©H°[³¶³¦¸}« ¡ð àñ󴫬²7³¦¥¦°dᢨī¬¥/¼wÛ1ߨ»³¨ªµ¥/«K¤}ÃĨɼ ¤}2 ¥¦£[Í«e´³VÞ¡°µ¨»²}¨»³/¨ª°l³¦«K©-³/£Ô«¾X¤}ÃÄ£d¥¦«?´£d¯D¤}¿}³/¨Ä²}º¯£X©P¼
´¬°[³/«e©q¨Ä²U°D´£d²7³¦¥/£Ã»ÃÄ«K©s°dµ/¸}¨Ä£d²q³/£°²P¿}¯ÕÎF«¬¥£dO¢¨Ä«¬¥/¼·ô «KÃĵ'®£d¥³¦¸}«ÀiÁ  ¤}¥/£dÍ«K´³¦µKËõ_µ'¤1°d¥6³'£d¡³¦¸}¨ªµ'¤}¥¦£[Í«e´³ Ç
µ/¨É³¦«Kµ Ç Þ¸}¨ª´·¸z°[¥¦«'¥¦«Kµ/¤x£²1µ/¨»Î}ÃÄ«:®£d¥_¤x«K¥/¯±°d²}«¬²7³_µ6³/£¥¦°dºd«dË °åµ/¨Ä¯Õ¿}ê°l³¦£d¥Þ°dµ©X«¬§«¬ÃÄ£d¤F«K©³/£-¤}¥¦£l§7¨ª©X«Ï°u¥¦«K°dû¨ªµ6³/¨ª´
ð °´·¸¶¢¨Ä«¬¥/¼¦ô4¸1°dµO°_²P¿}¯ÕÎF«¬¥R£dF°dµ¦µ/£P´¬¨Ä°[³/«e©'¢¨»«K¥6¼Ùµ/¨»³/«Kµ Ç µ/¨»¯¶¿}ê°l³/¨Ä£d²"£[:©X¨Äµ6³/¥¦¨ÄÎ}¿X³/«e©"´¬£d¯D¤}¿X³¦¨»²1º]µ/½Xµ³¦«¬¯±µÞ¨É³¦¸
¤}¥¦£l§P¨Ä©X¨Ä²}º´¬£d¯D¤}¿X³¦¨»²1º¤F£lÞ4«K¥®£¥_¿1µ/«¬¥°[²1°dû½Xµ/¨Äµ°[ÃÄ£d²1º Þ¸}¨ª´·¸z©X¨»äF«K¥/«K²7³©}°[³¦°D¤}¥¦£X´«Kµ¦µ/¨»²}º°d¥¦´·¸1¨É³¦«K´³¦¿}¥¦«Kµ´¬£d¿}ê©
Þ¨»³/¸V¯D£P©}«Kµ6³µ³¦£d¥·°[ºd«´¬°d¤1°[Î}¨ÄÃĨɳ¦¨»«eµ¬Ë ÎF«Ï«¬§l°dû¿1°[³/«e©¹ö»ôKÛ<øwËÒw³·µU¯±°[¨Ä² ©X¨ÉäM«¬¥¦«¬²x´«?®¥¦£d¯ Ð_¤X¼
ÒÓ²]µ/¿1´·¸?°©1°l³¦°[¼c¨Ä²7³/«K²1µ6¨Ä§d«ºd¥¦¨ª© Ç ¥¦«¬¤}ÃĨĴK°l³¦¨»£²£[4È1ÃÄ«Kµ ³¦£d¥ÑP¨»¯ Ç ¸}£lÞ4«K§d«K¥ Ç ÃĨ»«eµ¶¨Ä²"³/¸}«Vê°d´·è?£[¨»²X®¥·°dµ6³/¥¦¿1´³/¿}¥¦«
°[¯D£²}º±ºd¥¦¨Ä©Uµ/¨»³/«Kµ¨Äµ_¨Ä¯¤F£d¥/³¦°d²7³³/£¨Ä¯D¤}¥¦£l§d«ºd¥¦¨ª©V¤x«K¥6¼ ®£¥¡°d¿X³/£¯D°[³/«e©¥¦«¬¤}ÃĨĴK°l³¦¨»£²°[²1©¥/«K¤}û¨ª´¬°Õ£d¤X³¦¨»¯D¨ªµ/°[³/¨Ä£d²mË
112
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
×¥¦¨Ä©1ÑX¨»¯
ö þeø¡¨ªµÕ°zºd¥¦¨Ä©Aµ/¨»¯¶¿}ê°l³/¨Ä£d²]³¦£7£Ã»èP¨»³©X«K§d«Kû£¤x«e© £d°[ÃÄÃÈ1ÃÄ«¤F£d¤}¿1ÃÄ°d¥/¨»³/¨Ä«Kµ Ç ¸}£lÞ4«K§d«K¥KËU¢¸}«µ6¨Ä¯D¤}û«eµ³D°[¤X¼
³/£¨Ä²7§«Kµ6³/¨Äº°[³/«¥¦«Kµ/£d¿1¥¦´¬«_°[ÃÄÃÄ£P´K°l³¦¨»£²±³/«e´·¸}²}¨ªÊ¿1«Kµ°[²1©¨»² ¤}¥¦£°´·¸¨ÄµD³¦£A³/¥¦¨Äºdºd«K¥±£d²å°?Èxû«z¥¦«KÊ7¿}«eµ³ Ç °[²1©Ô³¦¸}¨Äµ¨Äµ
¤1°d¥6³¦¨Ä´¬¿}ÃÄ°d¥ Ç °´£¯D¤}¿X³¦°[³/¨Ä£d²1°dëe´£²}£d¯¶½dË¢¸}«Õ®£P´¬¿1µ¨ªµ Þ¸1°[³¸1°dµÎF«¬«K²V¨»¯D¤}ÃÄ«¬¯D«¬²7³¦«K©¨»²Ð_¤X³¦£d¥ÑP¨Ä¯Ë
£d²zµ¦´·¸}«e©X¿}ÃĨ»²}ºq°[²1©V¥¦«Kµ/£d¿}¥·´«:Î}¥¦£d諬¥¦¨»²1º1ßX³¦¸}«¬¥¦«'¨ªµ²}£d³ Ç ¢¸}«'£l§«¬¥·°[ÃÄÃm°[¨Ä¯ £dOµ/¿1´·¸z°±©X¨ªµ³¦¥/¨ÄÎ}¿X³¦«K©V¥¦«¬¤}ÃĨª´¬°l³¦¨»£²
¸}£lÞ¡«¬§«¬¥ Ç ´¬°[¤x°[Î}¨Äû¨»³½A®£¥D©1°l³¦°]¯±°[²x°[ºd«K¯D«¬²7³Dµ/¿1´·¸°dµ ¯D£X©X«¬ÃÞ¡£d¿}ê©ÏÎF«q³/£?°d´·¸}¨Ä«¬§«ºdÃÄ£dÎ1°dÃ4£¤X³/¨Ä¯D¨Äµ¦°l³¦¨»£²H°dµ
Þ¡£d¿}ê©ÔÎx«z¥¦«KÊ7¿}¨Ä¥¦«K©H®£d¥¨Ä²P§d«Kµ6³/¨Äº°[³/¨Ä£d²Ô£[:¥/«K¤}ÃĨĴK°?£d¤X¼ °¥¦«Kµ/¿}Ãɳ£dÃÄ£P´K°[ã¤X³/¨Ä¯D¨Äµ¦°l³¦¨»£²qÎP½«e°d´·¸Vµ6¨»³/%« $ µà«K¤}ÃĨĴK°
³/¨Ä¯D¨Äµ¦°l³¦¨»£²Ôµ6³/¥·°l³¦«¬º¨»«eµ¬Ë  ¸}¨ª´KÑX¨»¯ ö»ôeÙ<ø¨ÄµD°?µ6¨Ä¯Õ¿1ÃÄ°[³/£d¥ Ð_¤X³¦¨»¯D¨ªµ6«K¥ÕÅsàÐ Æ Ë ð °´·¸àÐݳ/¸1«¬¥¦«®£d¥¦«:¸1°dµ³Þ4£±º£°[ê'µ &
©X«eµ6¨Äºd²}«e©¶³/£¨Ä²7§«Kµ6³/¨Äº°[³/«µ/´·¸}«e©X¿}ÃĨ»²1º'µ³¦¥¦°[³/«¬º¨»«eµO¨»²±´¬£d²X¼ ³¦¸}«Õ¯D¨Ä²}¨»¯D¨ªµ/°[³/¨Ä£d²z£dR¨»²1©}¨»§P¨ª©X¿1°[ÃxÍ£dÎU«¬¾X«K´¿}³/¨Ä£d²z´£7µ³·µ Ç
Í¿1´³/¨Ä£d²Þ¨»³/¸]©}°l³·°ÃÄ£X´¬°l³¦¨»£²mË ¡¥¦¨Ä´·èXµ×¥/¨ª©Hö»ôKú<ø Ç Þ¸}¨Äû« °d²1©¶³/¸}«¯±°[¾P¨Ä¯D¨Äµ¦°l³¦¨»£²¶£dF¿xµ6«¬®¿}û²1«Kµ¦µR£[Mû£X´K°[ÃÄû½Õµ6³/£¥/«e©
¨Ä²}¨É³¦¨Ä°dûÃĽ"®£P´¬¿1µ/¨»²}ºH£d²?Í£dÎ嵦´·¸}«e©X¿}ÃĨ»²}º Ç ¸1°dµ±ÎF«¬«K²å«¾P¼ È1ÃÄ«KµKËõºd£P£X©¥¦«¬¤}ÃĨª´¬°l³¦¨»£²µ6³/¥·°l³/«Kºd½±Þ¨ÄûÃMÎF«:£d²}«:Þ¸}¨ª´·¸
³/«K²1©X«e©³¦£V¨»²1´¬Ã»¿x©X«±¥/«K¤}ÃĨĴK°q¯±°d²1°[º«¬¯D«¬²7³KËDÒw³·µ:¥/«K¤}û¨ª´¬° °´·¸}¨»«K§d«eµ°]º£7£X©Ï³/¥·°d©X«¬¼c£däåÎx«¬³Þ4«K«¬²Ô¨»²1©}¨»§P¨ª©X¿1°[Ã¥¦¿}²X¼
¯±°[²1°dºd«K¯«K²7³´£¯D¤x£²}«¬²7³¦µ Ç ¸}£lÞ¡«¬§«¬¥ Ç °[¥¦«¶´¬«¬²7³/¥·°[ÃĨªµ6«e© ²}¨Ä²}ºD³¦¨»¯D«Kµ°[²1©q£l§d«K¥¦°dûÃF¥/«eµ6£¿}¥·´«¿X³/¨Äû¨ªµ¦°l³/¨Ä£d²iË
¥·°l³/¸1«¬¥³/¸1°d²'³/¸1«¡©X¨ªµ6³/¥¦¨»Î}¿}³/«K©¶°[¥·´·¸}¨»³/«K´³/¿}¥¦«¿xµ6«e©'¨Ä²DÐ_¤X¼
³/£¥·ÑX¨»¯VË4õûÃi³/¸}«eµ6«Õ¤}¥/£dÍ«K´³¦µ³¦¸}«¬¥¦«®£¥/«:ºd¨Ä§d«'°±´¬£d¯D¤}ÃÄ«¼
)* ( "*+i5Ï-2 ,./_3/,6. },2 8 0"1"*32
¯D«¬²7³¦°d¥/½u°d¤}¤}¥¦£°d´·¸³¦£H³¦¸1°l³V¿1µ/«K©b¨»² Ð_¤X³/£¥·ÑP¨Ä¯ Ç °dÃɼ
ÃÄ£lÞ¨»²}ºS«¾X¤}ÃÄ£d¥·°l³¦¨»£²Ï£[_©}¨ÉäM«¬¥¦«¬²7³°[¥¦«K°µÕ£d¤1°[¥·°[¯D«¬³/«¬¥ à«K¤}ÃĨĴK°l³/¨Ä£d² ´K°[²Îx«AÃÄ£dº¨Ä´K°[ÃÄû½bµ/«¬¤x°[¥·°l³/«e©å¨Ä²7³/£³/¸1¥/«K«
µ/¤1°d´¬«dË µ6³¦°dºd«Kµ³/¸}¥¦£d¿1ºd¸±Þ¸}¨ª´·¸°[²P½D¥¦«¬¤}ÃĨĴK°l³¦¨»£²µ³¦¥¦°[³/«¬º½Õ¯¶¿1µ³
¤}¥¦£X´«K«K©ËR¢¸}«eµ6«°[¥¦«_©X«Kû¨Ä²}«K°[³/«e©°dµ®£dÃÄû£lÞµ Ç Þ¨É³¦¸«K°´·¸
ïí ï¬ R ï ï¬Rïì ë µ6³¦°dºd«Ï©X«K¤x«K²1©X¨Ä²}ºå£d²³¦¸}«Hµ/¿1´¬´¬«Kµ¦µU£[³/¸}«H¤}¥/«e´«e©X¨»²1º
µ6³¦°dºd«dËâ\¨Ä¥¦µ6³´£¯D«Kµ¡³/¸1« / A *,54#& ,3 %76 847,.-,3(%SÞ¸}«¬¥¦« Ç
ÒÓ² °Ôºd¥¦¨Ä©-«¬²P§P¨»¥¦£d²1¯«K²7³ Ç ³/¸}«K¥/«Ï°d¥/«]¯±°[²P½b§l°d¥/¨ª°[Î}ÃÄ«Kµ º¨»§«¬²z³¦¸}«³/¥¦¨»ººd«K¥´¬£d²1©X¨»³/¨Ä£d² Ç ³/¸}«±àÐ °[³:°Vµ/¨É³¦«¶¯¶¿1µ³
Þ¸}¨ª´·¸±©X«³¦«¬¥¦¯¨Ä²}«¨»³¦µO£l§«¬¥·°[ÃÄÃ7¤F«¬¥/®£d¥¦¯±°[²1´¬« Ç °d²1©¶Þ¸}¨Ä´·¸ ©X«e´¨ª©X«Þ¸}«¬³/¸}«K¥£d¥²}£d³³/£D¥¦«¬¤}ÃĨĴK°l³¦«_³¦¸}«È1ÃÄ«:³/£D¨»³¦µÃÄ£[¼
°[¥¦«R¨Ä¯¤F£µ¦µ/¨»Î}ÃÄ«R³¦£¸1°d¥/¯D£d²1¨Äµ/«R¨»²7³¦££d²}«R£¤X³/¨Ä¯±°[Ã7´£d²}È1º[¼ ´K°[õ/¨»³/«dËÒw¨»³_©X«e´¨ª©X«Kµ²}£[³³¦£¥¦«¬¤}ÃĨĴK°l³¦« Ç ³¦¸}«È1ÃÄ«'¯¶¿1µ³
¿}¥·°l³¦¨»£²V®£d¥_³/¸}«Þ¸}£dÃÄ«'º¥/¨ª©ËÒw³:¨ªµ¤F£µ¦µ6¨ÄÎ}ÃÄ« Ç ¸}£lÞ¡«¬§d«K¥ Ç ÎF«±¥/«e°d©¥/«K¯£d³/«Kû½Ë¶¢¸}«Dµ6«e´£²1©Sµ6³¦°dºd«¨Äµ / A'*,54#&79'1
³/££d¤X³¦¨»¯D¨ªµ6«Õ³/¸}£7µ6«'§l°d¥/¨ª°[Î}ÃÄ«KµÞ¸}¨Ä´·¸S°[¥¦«'°±¤1°[¥/³£dR³/¸}« 84,3 % Ç Þ¸}«¬¥¦«4¨»F³/¸}«àи1°µO©X«K´¬¨Ä©X«e©'³¦£:¥/«K¤}û¨ª´¬°[³/«4³/¸}«
ºd¥¦¨ª©:¯¨ª©}©XÃÄ«¬Þ°[¥¦«¨»³¦µ/«¬Ã»Ë\â1£d¥\°©}°l³·°ºd¥¦¨ª© Ç ³/¸}«4¯£7µ³¨»¯¼ È1ÃÄ« Ç ¨»³'¯Õ¿1µ6³Õ´·¸}£P£µ/«¶Þ¸1¨Ä´·¸?«¾X¨ªµ³¦¨»²1ºU¥/«K¤}û¨ª´¬°q³¦£V´¬£d¤P½dË
¤F£d¥/³¦°[²7³°d¥/«e°dµ¡³/££d¤X³¦¨»¯D¨ªµ6«Þ¨ÄûÃi³/¸}«K²UÎx«Í£dÎzµ¦´·¸}«K©X¿1Ãɼ â\¨Ä²1°[ÃÄû½ Ç ¨Ä²u³/¸1;« :,D / A' &<4#CE%Fµ6³¦°dºd« Ç ¨É'³¦¸}«ÃÄ£[¼
¨Ä²}ºH°[²x©å©}°l³·°A¯±°[²1°dºd«K¯«K²7³KËÝ¢¸}¨ªµ¤x°[¤F«¬¥Þ¨ÄÃÄô¬£d²X¼ ´K°[õ/¨É³¦«©X£P«Kµ²1£[³'¸1°<§«Dµ/"¿ =´¨Ä«¬²7³Õµ/¤1°d´¬«³/£zµ6³/£d¥¦«³/¸}«
´«K²7³/¥·°l³/«£²±©}°l³·°¯±°[²1°dºd«¬¯D«K²³ Ç °[²x©±©X½P²1°[¯D¨ª´¥/«K¤}û¨ª´¬° ²}«KÞ奦«¬¤}ÃĨĴK° Ç £d²1«£d¥¡¯D£d¥¦«Èxû«eµ¯¶¿1µ³ÎF«:©}«¬ÃÄ«³/«e©¿}²7³¦¨»Ã
£d¤}³/¨Ä¯¨ªµ¦°l³/¨Ä£d²¨Ä²¤x°[¥/³/¨ª´¿}ê°[¥eË ³¦¸}«¬¥¦«:¨Äµ«K²}£d¿1ºd¸qµ/¤1°´«dËR¢¸}«eµ6«:³¦¸}¥¦«¬«µ³·°[º«Kµ¡Þ¨»ÃÄëK°´·¸
ÎF«_©X¨ªµ¦´¿1µ¦µ6«e©¶®£d¥³¦¸}«¥¦«¬¤}ÃĨª´¬°l³¦¨»£²±µ³¦¥¦°[³/«¬º¨»«eµRÞ¸}¨ª´·¸°[¥¦«
©X«eµ/´¬¥/¨ÄÎF«K©¨»²zÑP«e´³¦¨»£²qüxË
*
36,/. }, ,65 },2 8 "!,/.
¢¸}«K¥/«A°[¥¦«Aµ6«K§d«K¥¦°dèįD¤x£¥6³·°[²7³z©X«Kµ/¨Äºd²Ø©X«K´¬¨Äµ/¨»£²1µVÎF«¼ > \ì¶ ?q ï
¸}¨Ä²1©H³/¸}«zÐ_¤X³/£¥·ÑX¨»¯ ¥¦«¬¤}ÃĨª´¬°l³¦¨»£²Ï¯D£X©X«¬ÃcËHâ\¨»¥·µ6³ Ç ¨»³¨ªµ
©X¨ªµ³¦¥/¨ÄÎ}¿X³¦«K©]¥·°l³/¸1«¬¥³/¸1°d²A´«¬²7³¦¥¦°dû¨ªµ6«e©z£¥¸1¨»«K¥¦°d¥¦´·¸}¨ª´¬°dÃZË Ð_¤X³¦£d¥ÑP¨Ä¯ã¨ªµ¶°d²A«¬§«¬²7³6¼Ó©X¥¦¨»§«¬²Aµ6¨Ä¯Õ¿1ÃÄ°[³/£d¥ Ç Þ¥/¨»³6³¦«¬²A¨Ä²
ð °´·¸qµ/¨É³¦«¨Äµ°[Î}ÃÄ«³/£±¯±°[²1°dºd«¨É³·µ£lÞ²¥/«K¤}û¨ª´¬°D´¬£d²7³/«K²³ @7°<§l°}ËAõµD©X½P²1°d¯¨ª´©}°l³·°¥¦«¬¤}ÃĨª´¬°l³¦¨»£²Ï¨Ä²7§£dÃħd«eµÕ°[¿}³/£[¼
°[²x©H³¦¸}«¬¥¦«V¨Äµ²}£Ïµ6¨Ä²}ºÃ»«V¤F£d¨Ä²³£d_s°d¨»ÃÄ¿}¥¦«dߨ»³°[êµ6£?¥¦«¼ ¯±°l³¦«K©S©X«K´¬¨Äµ/¨»£²1µ°dÎx£¿X³_¥¦«¬¤1û¨ª´¬°¤}ÃÄ°´«K¯«K²7³°[²x©z©X«¬ÃÄ«¼
¯D£l§d«eµ³/¸1«q²}«K«K©?®£¥µ/¨»³/«KµÕ³/£SèP²}£lÞ ³/¸}«Vµ6³¦°l³¦«£d³/¸}« ³¦¨»£² Ç ³¦¸}«V«¬¯D¤}¸1°µ6¨ªµD¨Äµ£²Ôµ/¨»¯¶¿}ê°l³/¨Ä£d²"£[³¦¸}«V¥/«K¤}ÃĨĴK°
Þ¸}£Ã»«:ºd¥¦¨ª©Ë ¯±°[²x°[ºd«K¯D«¬²7³Õ¨Ä²X®¥¦°µ³¦¥/¿x´³/¿1¥/«ËV¢¸1«q°[¥·´·¸}¨»³/«e´³/¿1¥/«°d²1©
ÑP«K´¬£d²1© Ç ¨É³£d¤F«¬¥·°l³¦«Kµ£²V°¤}¿}ÃÄÃi¥·°l³/¸1«¬¥³¦¸1°[²°D¤1¿1µ6¸ ¨Ä¯D¤}û«K¯D«¬²7³¦°[³/¨Ä£d²Ï°[¥¦«©X«eµ/´¬¥/¨ÄÎx«e©?¨»²åö Ülø°[²1©?µ/£z£d²1û½]°
¯D£X©X«¬ÃcË4ÒÓ²z°D¤1¿1µ6¸U¯D£X©X«KÃ Ç ³/¸}«Õµ/¨»³/«Õ´¬£d²7³¦°d¨»²}¨Ä²}º±°D¤1°d¥6¼ Î}¥¦¨Ä«R©X«Kµ¦´¥¦¨»¤}³/¨Ä£d²¨Äµº¨»§«¬²¸}«¬¥¦«dË
³/¨ª´¿1ÃÄ°d¥¶©}°[³¦°VÈ1ÃÄ«Þ¡£d¿1ÃÄ©A©X«e´¨ª©X«Þ¸1«¬²A³¦£z¥/«K¤}ÃĨĴK°l³/«¨É³
°[²x©Þ¸}«K¥/«³/£1ËOÒÓ²°'¤}¿}ÃÄïD£X©X«¬Ã Ç °¶µ/¨É³¦«Þ¸}¨ª´·¸q©}¨Ä©²}£[³ A B*
x.10C, "iD. 1EF1
¨Ä²}¨É³¦¨Ä°dûÃĽA¸1°<§«³¦¸}«Èxû«VÞ¡£d¿}ê©H©X«e´¨ª©X«Þ¸}«¬²"³/£]¥/«K¤}ÃĨɼ
´¬°[³/«'¨»³³/£¨»³¦µ/«¬Ã» Ç °[²1©UÞ¸1«¬¥¦«®¥/£¯VËÒÓ²°±¥¦«K°dÃi¤1°d¥6³¦¨Ä´¬Ã»« ¢¸}«´¬£d²1´¬«¬¤X³¦¿1°[ÃP¯D£X©X«¬ÃX£d}³/¸}«Ð_¤X³¦£d¥ÑP¨Ä¯Ý°[¥·´·¸}¨É³¦«K´³/¿}¥¦«
¤}¸P½Xµ6¨ª´¬µqºd¥¦¨ª© Ç ¸}£lÞ¡«¬§«¬¥ Ç µ6¨»³/«eµÞ4£¿}ÃÄ©åÎx«?¿}²}ÃĨÄèd«¬ÃĽԳ/£ ¨ªµµ/¸}£lÞ²:¨Ä²Õâ\¨»º¿}¥/«ôdËmÒÓ²³¦¸}¨ªµi¯D£X©X«¬Ã Ç ³/¸}«4ºd¥¦¨Ä©:´¬£d²1µ/¨Äµ6³¦µ
°d´K´«K¤X³µ/¤x£²7³¦°[²1«¬£d¿xµÕ¥¦«¬¤}ÃĨª´¬°l³¦¨»£²A£[_©1°l³¦°z®¥/£¯ µ6£¯« £d°:²P¿}¯¶Îx«K¥£dµ/¨»³/«Kµ Ç ´¬£d²}²1«K´³¦«K©DÎP½Õ²1«³Þ¡£d¥¦èÕÃĨ»²}èXµKË\õ
£[³¦¸}«¬¥Rµ/¨É³¦«°[²1©¶µ6£:°¤1¿}ûÃX¯D£X©X«KÃX¨Äµ\s°<§£d¿}¥¦«K© Ç Þ¨»³/¸«K°´·¸ º¥/¨ª©Aµ6¨»³/«¯D°<½S¸1°<§d«°  £¯D¤}¿X³/¨Ä²}º ð û«K¯D«¬²7³qŠ¡ð¡ÆÇ °
µ/¨É³¦«¥/«eµ6¤F£d²1µ/¨ÄÎ}û«_®£d¥¨É³·µ£lÞ²q¥¦«¬¤}ÃĨª´¬°£d¤X³¦¨»¯D¨ªµ/°[³/¨Ä£d²mË Ñ7³¦£d¥·°[º« ð ÃÄ«¬¯D«¬²7³:ÅZÑ ð¡Æ £¥4ÎF£[³¦¸mË ð °d´·¸µ/¨»³/«:°[êµ6£'¸x°dµ¡°
â¨Ä²1°dûÃĽ Ç °Ý¥¦«¬¤}ÃĨª´¬°l³¦¨»£² ³/¥¦¨»ººd«K¥Ï¯Õ¿1µ6³"ÎF«-´·¸1£µ/«¬²mË à«K¤}ÃĨĴK°UÐ_¤}³/¨Ä¯¨ªµ/«¬¥ÅsàÐ Æ Þ¸}¨ª´·¸S¯±°dèd«Kµ:©X«e´¨ªµ6¨Ä£d²xµ_£²
#u¸}«K²q°'µ/¨É³¦«¨ªµ¥¦«KÊ7¿}«eµ³¦«K©®£¥¡°:Èxû«_Þ¸}¨Ä´·¸¨É³©X£P«Kµ²}£[³ ¥¦«¬¤}ÃĨª´¬°l³¦¨»£²1µ_³/£³/¸x°l³Õµ/¨É³¦«dËDõ à«Kµ/£d¿}¥·´G« ¡¥/£èd«¬¥DÅsà Æ
¸1°<§« Ç ®£¥«¾}°[¯D¤}ÃÄ« Ç ³/¸}¨ªµ´¬£d¿}ê©q³/¥¦¨»ººd«¬¥³¦¸}«È1¥·µ6³µ³·°[ºd« ¸1°d²1©XÃÄ«Kµ³/¸}«Dµ¦´·¸}«K©X¿1û¨Ä²}º£dmÍ£dÎxµ³¦£qµ/¨É³¦«Kµ Ç Þ¸1«¬¥¦«'³/¸1«¬½
£[F³/¸1«¥/«K¤}û¨ª´¬°[³/¨Ä£d²Dµ³¦¥¦°[³/«Kºd½dËOõ²1£[³/¸1«¬¥R¤F£µ¦µ6¨ÄÎ}ÃÄ«³/¥¦¨»ººd«¬¥ ¥¦¿}²D£d²¶³/¸}« ¡𠵬HË @£dÎ1µO¤}¥¦£X´«eµ/µÈ1ÃÄ«Kµ Ç Þ¸}¨ª´·¸±°[¥¦«¡µ6³/£¥/«e©
´£¿}ê©DÎx«°:È1ÃÄ«£d²µ/£d¯D«£d³/¸}«K¥4µ/¨É³¦«¥¦«K°´·¸}¨»²1º'°'´«K¥6³·°[¨Ä² ¨Ä²z³¦¸}«Ñ ð µ_°d²1©z´K°[²zÎF«¶¥¦«¬¤1û¨ª´¬°[³/«K©VÎF«³Þ¡«¬«K²µ/¨É³¦«Kµ°d´¼
ÃÄ«¬§d«KÃO£[¤F£d¤}¿}ê°[¥¦¨»³½dË'¢¸}¨ªµÞ4£¿}ÃÄ©z¥/«eÊ¿1¨»¥¦«Õ¯D£d²1¨É³¦£d¥¦¨»²}º ´¬£d¥·©X¨»²1º³¦£_³¦¸}«©X«K´¬¨Äµ/¨Ä£d²1µ\¯±°d©X«ÎP½³¦¸}«àжË[õà«K¤}ÃĨĴK°
113
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
¨Ä²Ð_¤X³¦£d¥ÑP¨»¯ Ç °[¨Ä¯D¨»²1ºA³/£Hµ/¨Ä¯Õ¿}ê°l³¦«z§l°[¥¦¨»£¿1µ±¤F£µ¦µ6¨ÄÎ}ÃÄ«
The Grid
º¥/¨ª©¹µ¦´«K²1°[¥¦¨Ä£µKË ¢¸}«eµ6«¨Ä²1´ÃÄ¿1©X« >=@?F%F ,3&(°d´K´«Kµ¦µ Ç
Þ¸}«K¥/«U°Í£dΰd´K´«Kµ¦µ/«Kµ¶«K°d´·¸"È1ÃÄ«V¨É³±¥¦«KÊ7¿}¨Ä¥/«eµ¶£²1´« Ç ¨Ä²
Resource Broker
£dxÍ£Îq³½P¤F«KµÞ¸}¨ª´·¸V¨É³Þ¨ÄÃÄÃm°d´¬´¬«¬¤X³eË
Element Element
â¨Äºd¿1¥/«Dô<&×¥¦¨Ä©V°d¥¦´·¸}¨»³/«e´³¦¿}¥/«¿1µ/«K©¨Ä²zÐ_¤}³/£d¥ÑP¨Ä¯VË A *
XB, ,65 X,628
u3 *-2 1,CX0 5
¢¸}«K¥/« °[¥¦«Ý³Þ¡£ûèP¨Ä²1©}µ-£[H£¤X³/¨Ä¯D¨Äµ¦°l³¦¨»£²°[Ãĺd£¥/¨»³/¸}¯
Þ¸}¨ª´·¸ ¯±°<½åÎF«Ï¨»²P§«Kµ6³/¨Äº°l³¦«K©-¿1µ/¨»²}º-Ð_¤}³/£d¥ÑP¨Ä7¯ &u³/¸}«
 °l³·°[ÃÄ£dº¿}«z¸}£ÃÄ©}µq¯±°[¤}¤1¨»²}º7µ£dÕÃÄ£dºd¨ª´¬°dÃÈ1ÃÄ«¬²x°[¯D«Kµ³/£ ͣε/´·¸}«e©X¿}ÃĨ»²1º'°[Ãĺd£d¥¦¨»³/¸}¯±µR¿1µ/«K©ÎP½¶³/¸}«à ϳ¦£Õ©X«K´¬¨Ä©}«
¤}¸P½Xµ6¨ª´¬°dÃ\È1û«K²1°[¯D«eµ°[²1©S°à«¬¤}ÃĨĴK° G°[²1°dºd«¬¥:¸1°d²1©XÃÄ«Kµ Þ¸}¨ª´·¸bµ/¨É³¦«Kµ'Í£dÎ1µµ6¸1£d¿}ê©Îx«]µ6«K²7³³/£ Ç °[²x©Ô³/¸}«]©}°l³·°
¥¦«¬¤}ÃĨĴK°l³¦¨»£²1µ°[²1©¥¦«¬ºd¨ªµ6³/«¬¥·µ4³/¸}«K¯û¨Ä²³/¸}«  °[³¦°[ÃÄ£dº¿}«dË ¥¦«¬¤}ÃĨª´¬°l³¦¨»£²z°[Ãĺd£¥/¨»³/¸}¯±µ¿xµ6«e©VÎ7½q³/¸1«¶àй°l³«e°d´·¸zµ/¨É³¦«
³¦£:©}«K´¨ª©X«¡Þ¸}«¬²D°[²x©'¸}£lÞA³¦£_¥¦«¬¤1û¨ª´¬°[³/«4È1û«eµ¬ËO¢¸}«¡®£X´¿1µ
£d\³¦¸}¨ªµ¤1°d¤x«K¥¨ªµ£d²³¦¸}«Õ©}°[³¦°±¥¦«¬¤}ÃĨª´¬°l³¦¨»£²U°[Ãĺd£d¥¦¨»³/¸}¯±µ Ç
A *( ,B E3X,628 F8 E }5 °d²1©uµ6£?³¦¸}«±Í£Îåµ/´·¸}«e©X¿}ÃĨ»²1ºÏ°[Ãĺd£d¥¦¨»³/¸}¯±µ±°[¥¦«U²}£d³q©X«¼
µ¦´¥¦¨»ÎF«K©¸}«K¥/«Ë
¢¸}«¬¥¦«-°d¥/«b³/¸}¥¦«¬«-Î}¥¦£°d©$£d¤X³¦¨»£²1µA®£¥"¥¦«¬¤}ÃĨª´¬°l³¦¨»£²
µ6³/¥·°l³¦«¬ºd¨Ä«Kµ¨Ä²AÐ_¤}³/£d¥ÑP¨Ä¯V˶â\¨»¥·µ6³/ÃĽ Ç £d²1«D´K°[²]´·¸}£P£µ/«Õ³¦£
¢£¨Ä²}¤}¿X³°º¥/¨ª©:³¦£d¤F£dÃÄ£dºd½ Ç °¿1µ/«¬¥Oµ6¤F«K´¬¨ÉÈx«Kµi³/¸}«µ³¦£d¥·°[º« ¤F«¬¥/®£d¥¦¯
²}£H¥¦«¬¤1û¨ª´¬°[³/¨Ä£d²mË$ÑP«e´£d²x©Xû½ Ç £²}«S´K°[²å¿1µ/«?°
´¬°d¤1°d´¬¨É³½¶°[²1©D´£¯D¤}¿X³/¨Ä²}º¤F£lÞ4«K¥°l³«e°d´·¸±µ6¨»³/« Ç °d²1©¶³/¸}« 6³/¥·°d©X¨»³/¨Ä£d²1°d0à °[Ãĺd£d¥¦¨»³/¸}¯Þ¸}¨ª´·¸ Ç Þ¸}«¬²Õ¤}¥/«eµ6«K²7³/«K©'ިɳ¦¸
´¬°d¤1°d´¬¨É³½Ï°d²1©Ïê°<½d£¿X³D£[³¦¸}«V²}«³Þ¡£d¥¦èAÃĨ»²}èXµÎF«³Þ¡«¬«¬² °È1ÃÄ«¥¦«KÊ7¿}«eµ³ Ç °[ÃÄÞ¡°<½Xµ³/¥¦¨»«eµ³¦£q¥¦«¬¤1û¨ª´¬°[³/«D°[²1© Ç ¨»4²1«K´¼
«K°´·¸mËÑ ð µ¡°[¥¦«:©X«È1²1«K©±³¦£D¸1°<§d«°¶´¬«¬¥/³¦°[¨Ä²´K°[¤1°´¨»³½ Ç ¨»² «eµ/µ¦°[¥¦½ Ç ©X«¬ÃÄ«³¦«Kµ4«¾X¨ªµ³¦¨»²}º¶È1ÃÄ«Kµ³/£D©X£¶µ/£1ËOõÃĺd£¥/¨»³/¸1¯Dµ4¨Ä²
G Ç °d²1© ¡𠵡³/£±¸x°<§d«°±´«K¥6³·°[¨Ä²q²P¿}¯¶Îx«K¥£[ 6Þ¡£d¥¦èd«K¥ ³¦¸}¨Äµ¶´¬°[³/«¬º£d¥¦½z°[¥¦«³/¸}«Àà¡áÅZÀm«e°dµ6³Õà«K´¬«¬²7³/ÃĽSá_µ6«e© ÆÇ
²}£X©X«eµ ±Þ¨»³/¸z°ºd¨Ä§d«¬²U¤1¥/£X´«eµ/µ/¨Ä²}ºD¤x£lÞ¡«¬¥eËÑP¨É³¦«KµÞ¸}¨Ä´·¸ Þ¸}¨ª´·¸©X«Kû«¬³/«Kµ³/¸1£µ/«È1ÃÄ«KµÞ¸1¨Ä´·¸±¸1°<§«Îx«K«¬²¿xµ6«e©Dû«e°dµ6³
¸1°<§«²}«¬¨»³/¸}«K¥_° ¡ð ²1£d¥°d²Ñ ð °d´³_°dµ¥/£¿X³/«K¥¦µ£²V³/¸}« ¥¦«K´¬«¬²7³/ÃĽ Ç °[²1©Ý³/¸}«ÀiâOá ÅsÀi«K°µ³Aâ}¥¦«KÊ7¿}«K²³¦Ã»½ á_µ6«e© ÆÇ
²}«¬³Þ4£¥/èM/Ë °d´·èPº¥/£¿}²1©"³/¥·° =´z£²³/¸}«]²}«¬³Þ4£¥/è´¬°d² Þ¸}¨ª´·¸V©X«¬ÃÄ«³¦«Kµ4³/¸}£7µ6«Þ¸}¨ª´·¸¸1°<§«ÎF«¬«K²q¿1µ/«K©Ã»«e°dµ6³4®¥¦«¼
°[êµ/£ÎF«'µ/¨»¯¶¿}ÃÄ°[³/«e©Ë Ê7¿}«K²³¦Ã»½¨»²³/¸1«¥/«e´«¬²7³¤1°dµ6³KË¢¸1¨»¥·©XÃĽ Ç £d²1«:´K°[²¿1µ/«:°d²
«e´£d²1£d¯D¨Ä´4¯D£X©X«¬ÃP¨Ä²ÕÞ¸}¨ª´·¸¶µ/¨É³¦«KJµ 6Î1¿}:½ _°d²1K© ¦µ6«Kûà È1ÃÄ«Kµ
¿1µ/¨Ä²}ºS°d²"°[¿x´³/¨Ä£d²H¯«e´·¸1°[²1¨Äµ/¯ Ç °[²1©AÞ¨ÄÃÄᣲ}ÃĽ?©}«¬ÃÄ«³/«
È1ÃÄ«Kµ¨É\³/¸}«K½q°d¥/«Ã»«eµ/µ§<°dû¿x°[Î}ÃÄ«³/¸1°d²q³¦¸}«²}«KÞåÈ1ÃÄ«dËÌ_«¼
!"$#&%('$*)+
,-#.
³·°[¨ÄÃĵ¶£[³/¸1«q°[¿x´³/¨Ä£d²A¯D«e´·¸1°[²}¨ªµ/¯ °d²1©]È1ÃÄ«§l°[ÃÄ¿}«¤}¥¦«¼
õ ¤}¸P½Xµ6¨ª´¬µ°d²1°[ÃĽXµ6¨ªµ4Í£dÎ]¿1µ6¿x°[ÃÄû½U¤}¥¦£X´«eµ/µ/«Kµ:°²P¿}¯ÕÎF«¬¥ ©X¨ª´³¦¨»£²H°[Ãĺd£¥/¨»³/¸1¯DµÕ´¬°[²?ÎF«®£¿}²1©A¨Ä²åö üdøcËz¢¸1«¬¥¦«°[¥¦«
£[RÈxû«eµ¬Ë¢¸}¨Äµ_¨ªµ_µ/¨»¯¶¿}ê°l³/«e©VÎP½V©}«È1²}¨Ä²}º°±Ã»¨ªµ6³£dÍ£dÎ1µ ´¬¿}¥/¥¦«¬²7³¦Ã»½V³Þ¡£§«¬¥·µ6¨Ä£d²1µ_£[³/¸}««K´¬£d²}£¯¨ª´¶¯£X©X«Kà &³/¸}«
°[²x©Õ³¦¸}«È1ÃÄ«KµR³/¸1°[³R³/¸}«K½¶²}«K«K©ß7°¡Í£dÎDÞ¨ÄûÃx¤}¥/£X´¬«Kµ¦µRµ6£¯« Î}¨Ä²}£¯¨ª°[Ãx«K´¬£d²}£¯¨ª´¯D£X©X«¬Ã Ç Þ¸1«¬¥¦«È1û«§<°dû¿1«Kµ°d¥/«¤}¥¦«¼
£d¥°dûã[:³/¸}«VÈxû«eµ¨Ä²Ô¨»³¦µq©}°l³·°dµ/«³ Ç °´¬´£¥¦©}¨»²}º]³¦£?³/¸}« ©X¨ª´³¦«K©ÕÎ7½:¥·°[²}èP¨Ä²}º³/¸1«È1ÃÄ«Kµ¨Ä²Õ°Î}¨Ä²}£¯¨ª°[Ã7©X¨ªµ³¦¥/¨ÄÎ}¿X³¦¨»£²
&<484#79AB& -%]Þ¸1¨Ä´·¸S¸1°dµ_Îx«K«¬²]´·¸}£µ/«¬²mË¢¸}«Õ³/¨Ä¯D«D° °´¬´£¥¦©}¨»²}º³/£_³/¸}«K¨»¥O¤F£d¤}¿}ê°[¥¦¨»³½¨Ä²'³¦¸}«¡¥/«e´«¬²7³\¤1°dµ6³ Ç °d²1©
È1ÃÄ«³¦°dèd«eµ¡³/£¤}¥¦£X´«eµ/µ©X«K¤x«K²1©}µ£d²V¨É³·µµ/¨»òK«'°[²1©V£²³/¸}«
²P¿}¯ÕÎF«¬¥°d²1©Ô¤}¥¦£P´¬«Kµ¦µ6¨Ä²}º]¤x£lÞ¡«¬¥D£dÞ¡£d¥¦èd«K¥¶²}£X©X«eµ±°l³ LNMO.PRQ<SUTWVRPRXZY\[]PR^_`aPRb<c<_aPRdZe7PR^[]YNfe]Yg[h>^+i:jlkman-opqsr]Yg`aY
³/¸1« ¡ð Ë i:jPR^t_ar<YuS0`Ywv;c]Yge]xNy\d>S(d;xwxgc]``aYge]xgYzd>S-_r]Y{m|W}+`~h>e<XZYw[8P _aYgh>e][
9
l h
Q]c]`YO.PRQ<S([]PR^ _a`PRb]c<_PRdZeHqdZc]VR[Jrhg@Y N 9
114
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
³/¸1«çM¨»¤XF«K´¬£d²}£¯¨ª´¯D£X©X«KÃ Ç Þ¸1«¬¥¦«¡°çF¨Ä¤Xæ¼cÃĨÄèd«©X¨ªµ³¦¥/¨ÄÎ}¿X¼
³/¨Ä£d²¨Äµ_¿1µ/«K©z¨Ä²1µ6³/«e°d©ÏÅZ°çF¨Ä¤X©X¨ªµ³¦¥/¨ÄÎ}¿X³¦¨»£²¨ªµ_°¤x£lÞ¡«¬¥
ê°<Þ¨»²-Þ¸}¨Ä´·¸°A®«KÞ«¬§«¬²7³¦µq£X´¬´¬¿}¥q§«¬¥¦½Ô®¥/«eÊ¿1«¬²7³/ÃĽ Ç 100
Þ¸}¨Äû«¯D£7µ³«¬§«¬²7³¦µ£X´K´¿}¥¨Ä²X®¥¦«KÊ7¿}«¬²7³¦Ã»½ Æ Ë 95
A *A
â}£¥"«K§l°[ÃÄ¿1°l³¦¨»²}º¹ºd¥¦¨Ä©û¤x«K¥6®£¥/¯±°d²1´« Ç ©X¨»äM«¬¥¦«¬²7³Ô¿xµ6«K¥¦µ
80
¯±°<½¸1°<§«±©X¨ÉäM«¬¥¦«¬²7³¶´¬¥/¨»³/«K¥/¨ª°}Ëõ²A£¥¦©X¨Ä²1°d¥/½z¿xµ6«K¥'Þ¨»ÃÄà 75
¯D£µ6³Ã»¨Äèd«Kû½qÎx«¨Ä²7³/«K¥/«eµ³¦«K©¨»²V³¦¸}«:³/¨Ä¯D«Õ°Í£dγ·°[èd«eµ¡³/£ 70
Þ°[²7³±³/£Hµ/«¬«z³/¸}«K¨»¥¥¦«Kµ/£d¿1¥¦´¬«KµDÎF«¬¨Ä²}ºÏ¿1µ/«K© « =±´¬¨»«K²7³/ÃĽdË
¢¸}«_«¬§l°[ÃÄ¿1°l³¦¨»£²D¯«¬³/¥¦¨Ä´¿1µ/«K©D¨Ä²±³/¸}¨ªµ4¤1°[¤F«¬¥¡°[¥¦<« &C=#&(%
60
50
0 2.5 5 7.5 10 12.5 15 17.5 20 22.5
675!7 "%!$'#&("%$')+&(**)+, *, * , , , -/ ,. -/&89*)&:01"%&'$'$2&8*)3)1$2*5*4 , , , 5, À  ׹ÙdÛdÛdܵ/¨»¯¶¿}ÃÄ°[³/¨Ä£d²mË
¢¸}« ð ñᨪµ±³¦¸P¿1µ±³¦¸}«z¥·°l³/¨Ä£A£[È1ÃÄ«z¥¦«KÊ7¿}«Kµ6³¦µÞ¸}¨Ä´·¸ d£ R°D¤x°[¥/³/¨ª´¿}ê°[¥OÍ£ÎUÎx«K¨»²}º±¥¦¿}²V£d²V³¦¸}«ºd¥¦¨ª©qÞ°dµ¯£X©P¼
¿1µ/«U²}«³Þ¡£d¥¦èÏ¥¦«Kµ/£d¿}¥·´«eµ³/£?³¦¸}«U³/£d³¦°dòP¿}¯ÕÎF«¬¥£d_Èxû« «KûÃÄ«K©?ÎP½S³/¸}«¥¦«¬Ãª°l³¦¨»§«±²P¿}¯ÕÎF«¬¥'£d«¾X¤x«e´³¦«K©?¿1µ/«¬¥·µ:®£d¥
¥¦«KÊ7¿}«Kµ6³¦µ Ç °[²x©qµ/£Õ³¦¸}«ÃÄ£lÞ4«K¥4³¦¸}« ð ñ_á³/¸}«:ÃÄ«Kµ¦µÃ»£7°d©X«e© ³¦¸}«'©X¨»äF«K¥/«K²³«¬¾X¤x«K¥/¨Ä¯D«¬²7³¦µKË
³/¸1«²}«³Þ¡£d¥¦èU¨ªµ'°[²1©]¸}«K²1´«³/¸}«±¯D£d¥¦«±« =´¬¨»«K²³¦Ã»½S¨É³'¨ªµ
ÎF«¬¨Ä²}º±¿1µ/«K©Ë @d£Î Ì°l³·°dµ/«³ ¢£[³·°[ò}£xË â\¨Äû«eµGÍ£dÎ
µ/¨Äò¬«±Ås¢ Æ £[\È1ÃÄ«Kµ
; <>= 'ï ë O ? Oî HJILKNMOJPRQSQ þ[Û Ùdþ[ÛÛdÛ Ùþ
¢£u«¬§l°[ÃÄ¿1°l³¦«]³/¸}«A¤F«¬¥/®£d¥¦¯±°[²1´¬«]£d¶³¦¸}«Kµ/«A¥¦«¬¤}ÃĨĴK°l³¦¨»£² HJILKNMOJPRT:K Ù[ÙdÛþ Û ôeôeÛdÙdÛþdÛdÛdÛdÛ Û ôeÙþ
þdÛ
µ6³/¥·°l³/«Kºd¨Ä«Kµ¨»²°¥¦«K°dû¨ªµ6³/¨ª´ºd¥¦¨ª©'µ/´¬«¬²1°d¥/¨Ä£ Ç °µ6¨Ä¯Õ¿1ÃÄ°[³/¨Ä£d²'£d HNULINHLV ÿ[þ ú7ÿ[þdÛdÛ Ùþ
À Â × ¿xµ6¨Ä²}º³¦¸}«¤}ÃÄ°d²}²}«e©z¥¦«Kµ/£d¿}¥·´«eµ®£d¥¤}¸P½Xµ6¨ª´¬µ_°[²1°dÃɼ M+WXV
½Xµ6¨ªµ'¨Ä²ÔÙdÛdÛÜVÞ¡°µÕµ/«³¶¿1¤H°dµ'®£dÃÄû£lÞµKË¢¸}«³/£d¤F£dÃÄ£dº½ Ç IRTYM[ZLP\V+WYHJI]I ÿ[ÿ[þþ ú7ÿ[þdÛdÛ
ú7ÿ[þdÛdÛ
úÜ
ú7ÿdþ
Î1°µ6«e©å£d²b³/¸}«]³¦¸}«?ÃÄ°<½£d¿X³£['³¦¸}«?²1°l³¦¨»£²1°[Ã¥/«eµ6«e°[¥·´·¸ IRTYM%ZPRZ:K[^
²}«¬³Þ4£¥/èXµ Ç ¨ÄµDµ/¸}£l޲ϨIJÔâ\¨Äºd¿}¥¦«Ù}ËÏ¢¸}«q²1«³Þ¡£d¥¦è?´K°l¼ ¢\°[Î}ÃÄ«<ô &@£d䬣d²XÈxºd¿}¥·°l³¦¨»£²U¤1°[¥·°[¯D«³¦«¬¥·µ¿1µ/«K©¨»²³/¸}«
¤1°´¨»³½µ6¸}£lÞ²q¨»²Vâ\¨»º¿}¥¦«:ÙÕÞ¡°µ¯D£X©X¨»È1«K©³/¸1¥/£¿}ºd¸³/¸}« À  ׹ÙdÛdÛdÜ´£²XÈ1º¿}¥¦°[³/¨Ä£d²mË
µ/¨»¯¶¿}ÃÄ°[³/¨Ä£d²]ÎP½U³/¸}«±°©}©X¨»³/¨Ä£d²]£[4Îx°d´·èPºd¥¦£d¿}²x©z²}«¬³Þ4£¥/è
³/¥·° =´ Ç Þ¸}¨ª´·¸±§l°[¥¦¨»«e©±°d´K´£d¥·©X¨Ä²}º:³/£'³/¸}«³/¨Ä¯«_£[m©}°<½¨»²
³/¸1«qµ/¨»¯¶¿}ÃÄ°[³/¨Ä£d²Ï°µÕµ/¸}£lÞ²A¨»²"â¨Äºd¿1¥/«ú1Ë¢¸}¨ªµÕ¤1¥/£dÈ1û«
Þ°dµRÎ1°µ6«e©D£d²±¯D«K°µ6¿1¥/«K¯«K²7³¦µR£di°<§<°d¨»Ãª°[Î1û«Î1°d²1©XÞ¨ª©P³/¸ ? *)( i57+2 E1.i5
ÎF«³Þ¡«¬«¬² ¡ð àñ °[²1©"Àm½£d² Ç °d²1©Ï°dµ¦µ/¿}¯D«KµÕ³/¸1°[³¤1°[³6¼ ¢¸}«'´¬£d¯D¤}¿X³¦«°[²1©µ³¦£d¥·°[º«¥¦«KÊ7¿}¨Ä¥/«K¯«K²7³¦µ4®£d¥À Â × ¨Ä²
³/«K¥/²xµ4£di²}«¬³Þ4£¥/èD¿1µ¦°[º«Þ¨ÄûÃM²}£d³´·¸1°d²}ºd«µ6¨Äºd²1¨ÉÈx´K°[²7³/ÃĽ ³¦¸}«'È1¥·µ6³®«¬Þ ½«K°[¥·µ£[ÀÁ  ©}°[³¦°l¼c³¦°dèP¨»²}º±Þ¡«¬¥¦«Õ©X¥·°<Þ²
£l§d«K¥¡³/¸}«²}«¬¾P³®«¬Þb½«K°d¥¦µKË ®¥¦£d¯ó³/¸1«¶À Â × ¢«K´·¸1²}¨Ä´K°[ÃOÌ«eµ6¨Äºd²S૬¤F£d¥/³¶ö Ù<øwË #u¸}¨Äû«
? * @¡2.A_5 O8CBED±,63 i5 ³¦¸}«¡©1°l³¦°¨Äµ¨Ä²}¨É³¦¨Ä°dûÃĽµ³¦£d¥¦«K©'°[³\¢¨Ä«¬¥/¼wÛ°[²1©'¢¨Ä«¬¥/¼·ôµ6¨»³/«eµ Ç
¿1µ/«¬¥S°[²1°dû½Xµ/¨ÄµÍ£Î1µz°[¥¦«?«¾X¤F«K´³¦«K©³¦£u¥/¿}² °[³¢¨»«K¥6¼Ù
â}£¿}¥V©}°l³·°dµ/«³¦µÞ4«K¥/«S©X«È1²1«K© Ç ´£d¥¦¥¦«Kµ/¤x£²1©X¨Ä²}º?³/£H³/¸}« µ/¨É³¦«KµKË ð °d´·¸S¢¨Ä«¬¥/¼Ù±Þ°dµ_³/¸}«K¥/«¬®£d¥¦«¶º¨»§«¬²]°[²S°d²]°<§«¬¥/¼
®£d¿1¥¶¯±°lÍ£¥¶ÀÁ  «¬¾P¤F«¬¥¦¨Ä¯«K²7³´£Ã»Ãª°[ÎF£d¥·°l³¦¨»£²1µKË ð °´·¸ °dºd«K© ¡ð £dMýdü7þ_è}ÑPÒ6Ù[ÛÛdÛ}Ë ¢¸}«¢¨»«K¥6¼ÓÛ°[²x©¢¨Ä«¬¥/¼·ôµ/¨»³/«Kµ
£[\³/¸1«Kµ/«:©1°l³¦°µ6«¬³¦µÞ°dµ¤}ê°d´¬«K©V°l³³¦¸}«'¢¨»«K¥6¼ÓÛVŠ¡ð àñ Æ Þ¡«¬¥¦«Dºd¨Ä§d«¬²AÑ ð µ°d´¬´¬£d¥·©X¨Ä²}º³¦£³/¸}«K¨»¥'¤1ÃÄ°d²}²}«K©?´K°[¤1°´¨»¼
°[²x©«K°´·¸¢¨Ä«¬¥/¼¦ô°l³¡³/¸}«:ÎF«¬ºd¨Ä²}²}¨Ä²}º£[m³¦¸}«µ6¨Ä¯Õ¿1ÃÄ°[³/¨Ä£d² Ç ³¦¨»«eµ Ç «K°d´·¸]ÎF«¬¨Ä²}ºzµ61¿ =±´¬¨»«K²7³³/£V¸}£dê©?°V´£d¯D¤}ÃÄ«³¦«±´£d¤P½
µ/£«e°d´·¸¶È1ÃÄ«¨»²}¨»³/¨ª°[ÃÄÃĽ'¸1°d©ô<Ù¥¦«¬¤1û¨ª´¬°µ\°d¥/£¿}²1©Õ³/¸}«º¥/¨ª©Ë £dO°dûÃM³¦¸}«'©}°l³·°¶È1û«eµ¬Ë
ÑP¨»¾Sͣγ½P¤x«eµÞ¡«¬¥¦«©X«Èx²}«K© Ç Þ¨»³/¸å³/¸}«±Í£Îå°[²1©ÔÈxû« ð °´·¸¢¨Ä«¬¥/¼ÓÙ¶µ/¨É³¦«Þ°dµº¨»§«¬²°´K°[²}£²}¨ª´¬°[ÃM§l°[ÃÄ¿}« Ç °<§7¼
¤1°d¥¦°d¯«¬³/«K¥¦µq°dµµ6¸}£lÞ²b¨»²¢°dÎ}û«ÔôdË¢¸}«?µ/¨»¯¶¿}ÃÄ°[³/«e© «K¥¦°dºd¨Ä²}º³¦¸}«³/£d³¦°dÃ}¢¨»«K¥6¼Ù_¥/«eÊ¿1¨»¥¦«¬¯D«¬²7³·µ£l§«¬¥\³/¸1«²7¿1¯¶¼
Í£dÎxµÕ¤}¥¦£X´«eµ/µ/«K©]³¦¸}«¬¨Ä¥¶º¨»§«¬²Ïµ/¿}Î1µ/«³£[È1ÃÄ«KµÕ®¥/£¯ ³/¸}« ÎF«¬¥£d\¢¨Ä«¬¥/¼ÓÙ¶µ/¨É³¦«KµKËO¢¸1¨Äµº7°<§d«°[²V°<§d«¬¥·°[º«Ñ ð µ/¨»òK«:£[
©}°[³¦°dµ/«³eË #u¸}«¬²°Í£dÎU¥·°[²U£d²z°±µ/¨»³/« Ç ¨»³_¥¦«³/¥¦¨Ä«¬§d«e©q³/¸}« <ô E7ÿ_¢ ËdÌ¿1«³/£:³/¸1«Ã»¨Ä¯D¨É³·°l³¦¨»£²1µR£[°<§l°[¨ÄÃÄ°dÎ}ÃÄ«4¯D«¬¯D£¥/½
È1ÃÄ«KµR¨»³¥¦«KÊ7¿}¨Ä¥/«e©ÅZ°d´K´£d¥·©X¨Ä²}º_³/£:³/¸1«´·¸}£µ/«¬²D°d´K´«eµ/µO¤1°[³6¼ Þ¸}«K²U¥/¿1²}²}¨Ä²}ºD³/¸}«¶µ6¨Ä¯Õ¿}ê°l³¦¨»£² Ç ¸1£lÞ4«K§d«¬¥ Ç ³/¸}«¶µ6¨Ä¯Õ¿}ê°l¼
³/«K¥/² Æ °[²1©D¤1¥/£X´«eµ/µ/«K©³¦¸}«¬¯ °d´K´£¥¦©X¨Ä²}º:³¦£³/¸}«_´£d¯D¤}¿}³6¼ ³¦¨»£²1µÞ4«K¥/«'¥¦«Kµ6³/¥¦¨ª´³/«e©V³/£q³/¸1«¶£¥¦©X«K¥£dôKÛÛdÛÍ£dÎxµ¬Ë¢£
¨Ä²}º'¥¦«Kµ/£d¿}¥·´«eµR°<§<°d¨»Ãª°[Î1û«°[³³/¸1°[³4µ/¨É³¦«dËR¢¸}«¤}¥¦£dÎx°[Î}¨Äû¨»³½ ¥¦«K°´·¸°µ6³¦°[³/«¨Ä²³/¸}«µ/¨Ä¯Õ¿}ê°l³¦¨»£²Þ¡«¬¥¦«³/¸1«_Ñ ð µR°d¥/«®¿1ûÃ
115
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
FZK
DESY
Wisconsin MIT
Indiana Wuppertal
Purdue BNL Freiburg
RWTH Eotvos
Caltech Paris
TO France UCSD Mainz KFKI
Lyon GSI
LAL
Oklahoma CERN MPI/LMU SZTAKI
ICEPP Langston FNAL Saclay CAF
Nebraska
New Mexico UBP Debrecen
IFCA Nantes
TO France USC UIBK
CSCS
CNAF
CIEMAT
UB
Arlington Florida IFIC IFAE Roma
Bari Frascati TO India
PAEC UAM PIC Pisa
ASCC
NCP Catania Napoli
UNESP
NUST Melbourne UFRJ Legnaro Torino WEIZMANN
Milano
VECC/SINP UERJ CBPF TEL AVIV
â\¨Äºd¿}¥¦«'Ù"&ÑP¨Ä¯D¤}û¨»È1«e©³¦£d¤F£dÃÄ£dº½±£[RÀ Â × ¨Ä²zÙ[ÛÛdÜ}Ë
TIFR HAIFA
116
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
30000 1.1
No Rep No Rep
LRU LRU
LFU LFU
28000 Eco Bin Eco Bin
Eco Zipf 1 Eco Zipf
26000
0.9
Mean Job Time (s)
24000
ENU
0.8
22000
20000 0.7
18000
0.6
16000
0.5
-3 -2 -1 -3 -2 -1
10 10 10 10 10 10
UË
²P¿}¯ÕÎF«¬¥·µ¶£[Í£Î1µ Ç ³/¸}¨ªµ¥/«KÃÄ°[³/¨Ä§d«¨Ä¯D¤}¥¦£l§d«¬¯D«K²³Þ4£¿}ê© Ùþ ÃÄ£lÞ¡«¬¥³/¸x°[²³/¸}«Õ£[³/¸1«¬¥·µ Æ ËÒw³_µ/«¬«¬¯±µ´¬Ã»«e°[¥³¦¸1°l³_°d²
¸}£ÃÄ©Ë૬¤}ÃĨĴK°l³¦¨»£²z¨Äµ³/¸}«K¥/«¬®£d¥¦«Õ°[²U¨Ä¯D¤x£¥6³·°[²7³Þ°<½£d «Kº°[ÃĨ»³¦°[¥¦¨ª°[²°d¤}¤}¥¦£°d´·¸ Ç ¨Ä²Þ¸}¨ª´·¸¥¦«Kµ/£d¿}¥·´«eµ4°d¥/«µ6¸x°[¥¦«K©
¥¦«K©X¿1´¬¨»²1ºÍ£ÎÕ³¦¨»¯D«KµR°d²1©¶²}«³Þ¡£d¥¦è¿1µ¦°[ºd« Ç °[²1©'³/¸}«¥¦«¬Ãª°l¼ °µO¯Õ¿1´·¸D°dµO¤F£µ¦µ6¨ÄÎ}ÃÄ« Ç ½7¨Ä«¬Ãª©}µOÎF«¬²}«¬È}³¦µR³/£°[ÃÄÃ}ºd¥¦¨ª©'¿1µ/«¬¥·µ¬Ë
³/¨Ä§d«Kû½Dµ/¨»¯D¤}ÃÄ«_Àà¡á-°[²1©±ÀâOábµ6³/¥·°l³/«Kºd¨Ä«Kµ°d¥/«³/¸}«_¯£7µ³
«äM«K´³/¨Ä§d«:®£¥³/¸}¨ªµ³/£¤x£Ã»£ºd½Ë
ÒÓ²¶³¦¸}«µ/«K´³/¨Ä£d²1µOÞ¸}¨ª´·¸®£dÃÄû£lÞ Ç ³/¸}«¥¦«Kµ/¿}û³¦µ°d¥/«¡³¦°dèd«K²
Þ¨»³/¸U³¦¸}«'ÃÄ£lÞ§l°dû¿}«Õ£[ °[²x©V¨É³:µ6¸}£¿}ꩳ/¸}«K¥/«¬®£d¥¦«Îx« *
iD. }5A+2 ,, ,<3/, - D±,63
u.. i5P5
¥¦«¬¯D«¬¯¶Îx«K¥/«e©³¦¸1°l³¨»²U¥¦«K°dû¨»³½ Ç ¥/«K¤}ÃĨĴK°l³/¨Ä£d²VÞ¡£d¿}ê©V¸1°<§«
°¯Õ¿1´·¸Uµ6³/¥¦£d²}º«¬¥¡«äM«K´³KË ÒÓ²Õ°dûÃd³/¸}««¾X¤x«K¥/¨Ä¯D«¬²7³¦µi¤}¥¦«Kµ/«¬²7³/«e©µ6£s°d¥ Ç Í£dÎxµm¸1°<§«°d´¼
´¬«Kµ¦µ6«e©U³/¸1«¬¨Ä¥È1ÃÄ«Kµ¿1µ/¨Ä²}º°qµ6«eÊ7¿}«¬²7³/¨ª°[ÃO°d´K´«eµ/µ_¤1°l³/³/«K¥/²mË
* ( iD. }5A-2 , , "2436,6., i5 õ_µ¶ÑP«e´³/¨Ä£d²?üxË Ù}Ë ú¯D«¬²7³/¨Ä£d²1«K© Ç ¸}£lÞ¡«¬§d«K¥ Ç çF¨»¤}æ¼cÃĨ»è«È1û«
°´¬´«eµ/µi¯±°<½°dÃĵ/£ÎF«4µ/¨»º²}¨ÉÈF´¬°[²7³i¨Ä²Õ°ºd¥¦¨Ä©µ/¿1´·¸'°dµÀ  ×Ë
ÒÓ²'³¦¸}«¡¤}¥/«K§P¨»£¿1µi«¬¾X¤x«K¥/¨Ä¯D«¬²7³¦µ Ç µ6¨»³/«¡¤F£dÃĨĴ¬¨»«eµi¼F³/¸1«³½P¤F«Kµ ¢¸}«¡¤F«¬¥/®£d¥¦¯D°d²1´«4£[}³¦¸}«¡¥/«K¤}û¨ª´¬°[³/¨Ä£d²Õµ6³/¥·°l³¦«¬º¨»«eµiÞ¨»³/¸°
£[XÍ£Î1µÞ¸1¨Ä´·¸Þ¡£d¿1ÃÄ©±ÎF«_°´¬´¬«¬¤X³¦«K©DÎP½D«K°d´·¸µ/¨»³/«¼\Þ4«K¥/« çF¨Ä¤X°d´K´«Kµ¦µ\¤1°l³/³/«K¥/²Þ4«K¥/«4³/¸1«¬¥¦«®£d¥¦«´£d¯D¤1°d¥/«e©³¦£_³¦¸1°l³
µ/«³°´¬´¬£d¥·©X¨»²1º¶³¦£³¦¸}«¬¨Ä¥¤}ê°[²1²}«K©U¿xµ/°dºd«dËÁ«¬¥¦« Ç ³/¸}«¶«æ¼ Þ¨»³/¸bµ/«KÊ7¿}«K²7³/¨ª°[Ã_°´¬´¬«Kµ¦µ Ç Þ¨»³/¸u³¦¸}«¥¦«Kµ/¿}û³¦µµ/¸}£lÞ²¨Ä²
®«K´³q£dµ/¨»³/«¤F£dÃĨª´¨Ä«Kµ£²³/¸}«£l§d«¬¥·°[ÃÄÃ¥¦¿}²}²}¨Ä²}ºÏ£d:³/¸}« â\¨Äºd¿}¥¦«4ý1Ë¢¸}¨ªµµ/¸}£l޵ʿ1¨É³¦«4°©}¨ÉäM«¬¥¦«¬²7³¤x°l³6³¦«¬¥¦²³/£³/¸}«
ºd¥¦¨ª©Ô°d¥/«V¨Ä²P§d«Kµ6³/¨Äº°[³/«e©Ë-¢¸}¨ªµ±Þ°dµ©}£d²}«UÎP½"©}«È1²}¨Ä²}º ¤}¥¦«¬§P¨Ä£d¿1µ¥¦«Kµ/¿}û³¦µKËõÃɳ¦¸}£d¿}º¸³¦¸}«®£d¿}¥¥¦«¬¤1û¨ª´¬°[³/¨Ä£d²S°dÃɼ
³Þ¡£«¾P³/¥¦«¬¯D«eµ£[R¤F£dÃĨĴ¬½dËÒÓ²V³¦¸}«'Èx¥¦µ6³ Ç ©X«Kµ/¨Äºd²1°[³/«K
© $D º£d¥¦¨É³¦¸}¯±µµ³¦¨»ÃÄÃi¸1°<§«_§«¬¥¦½µ/¨»¯D¨ÄÃÄ°d¥¤x«K¥6®£¥/¯±°d²1´«eµ Ç ³/¸1«¬½
B)9AB7 Ç °dûáµ6¨»³/«eµÕ°´¬´«K¤X³/«e©?°[ÃÄÃmÍ£dÎ?³½P¤F«KµKËVÒÓ²?³/¸}« °d¥/«²1£lÞå°[ÎF£d¿X³ÿdþ Ýs°dµ6³/«K¥³¦¸1°[²Þ¨»³/¸1£d¿X³¡¥/«K¤}û¨ª´¬°[³/¨Ä£d²mË
µ/«K´£²1© Ç ©X«Kµ/¨»º²1°l³¦«K© % N B)9AB Ç «K°d´·¸Aµ/¨É³¦«±Þ4£¿}ê© ¢¸}« ð ñ_áÚ¨ªµ:´£¥/¥¦«Kµ/¤x£²1©X¨Ä²}ºdÃĽÃÄ£lÞ¡«¬¥eË¢¸}¨ªµ¨Äµ:©X¿1«Õ³¦£
°d´K´«K¤X³¡£²}ÃĽ£²}«Í£Î³½P¤F« Ç Þ¨»³/¸q°d²«K§d«¬²q©X¨Äµ6³/¥¦¨ÄÎ}¿X³/¨Ä£d² ³¦¸}«¶Þ°<½¨»²SÞ¸}¨Ä´·¸]°±®«KÞ È1ÃÄ«Kµ®¥/£¯ó«e°d´·¸±Í£Î $ µÈxû«eµ6«¬³
£[µ6¨»³/«eµ'®£¥¶«e°d´·¸UÍ£ÎA³½P¤x«Ë?¢¸1«©X«¬s°[¿}û³±µ6«¬³£dµ6¨»³/« °d¥/«°´¬´«eµ/µ/«K©¶¯±°[²P½³¦¨»¯D«Kµ4©X¿}¥¦¨»²1º:³/¸}«OÍ£Î1µ Ç Þ¸}¨Äû«£d³/¸X¼
¤F£dÃĨĴ¬¨»«eµ4¨ªµ4³/¸}«K¥/«¬®£d¥¦«_¨Ä²Îx«¬³Þ4«K«¬²³¦¸}«Kµ/«_³Þ4£«¾P³/¥¦«¬¯D«eµ Ç «K¥¦µR°d¥/«°d´K´«Kµ¦µ/«K©¶¨»²X®¥¦«KÊ7¿}«K²7³/ÃĽdËO¢¸}¨ÄµR°dûÃÄ£lÞµ³/¸}«°´¬´¬«Kµ¦µ
°[²x©:¨Äµ\©X«Kµ/¨Äºd²1°[³/«K©:¨Ä²³¦¸}«¥¦«Kµ/¿}û³¦µmÎF«¬ÃÄ£lÞA°d6µ
?, @ #lËO¢¸}« ¸}¨ªµ6³/£d¥¦¨Ä«KµR³¦£Õ¤}¥¦«K©}¨Ä´³È1ÃÄ«_§l°dû¿}«eµ¯D£d¥¦«_°´¬´¬¿}¥¦°[³/«Kû½Õ³/¸1°d²
¥¦«Kµ/¿}Ãɳ·µ°[¥¦«±µ6¸}£lÞ²]¨Ä²?â\¨Äºd¿}¥¦«þXË¢¸}«eµ6«D¥¦«Kµ/¿}û³¦µµ/¸}£lÞ Þ¨»³/¸S³/¸}«Dµ/«KÊ7¿}«¬²7³¦¨Ä°dä1°[³6³¦«¬¥¦² Ç Þ¸}«¬¥¦«'³¦¸}«¬½U¯±°<½Uµ/«¬«D°
³/¸x°l³³¦¸}«£l§d«¬¥·°[ÃÄÃm¤1°[³6³¦«¬¥¦²z£[µ/¨É³¦«Õ¤F£dÃĨĴ¬¨»«eµ£²U³/¸1«Õºd¥¦¨ª© È1ÃÄ«Õ£²}ÃĽq£²1´«Ëõµ³/¸}«¶²7¿1¯ÕÎF«¬¥£dMÍ£dÎ1µ°[²1©³¦¸}«Õ¤1¥/£d¼
¸1°<§«°¤F£lÞ4«K¥6®¿}ÃR«¬äF«e´³'£²S¤F«¬¥/®£d¥¦¯±°[²1´¬«d˱¢¸}«±¯D«K°d² ¤F£d¥/³/¨Ä£d²z£dR³/¸}«¶Þ¸}£dÃÄ«Õ©}°[³¦°µ6«¬³µ/«¬«¬²UÎP½U°[²U¨Ä²1©X¨Ä§P¨Ä©}¿1°[Ã
Í£dζ³/¨Ä¯D«Þ¨É³¦¸Õ³¦¸}« $D
N B)9AF-¤x£Ã»¨ª´½¨ªµR°[ÎF£d¿}³Oý
Û Ñ ð ¨»²x´¥¦«K°dµ/«Kµ Ç ¸}£lÞ¡«¬§d«K¥ Ç ³/¸1«q¥¦«Kµ/¿}û³¦µ¶Þ¨É³¦¸Hµ6«eÊ7¿}«¬²7³/¨ª°[Ã
ÃÄ£lÞ4«K¥_³¦¸1°[²?Þ¨»³/¸]³/¸1« %' )9AF¤x£Ã»¨ª´½Ë¢¸}¨Äµ:¨ªµ °´¬´«eµ/µµ/¸}£¿}ÃÄ©³/«K²1©³/£lÞ°[¥·©}µ°Dµ6¨Ä¯D¨»Ãª°[¥¤x°l³6³¦«¬¥¦²U°dµ¡®£d¥
³/¥¦¿}«°d´¥¦£µ¦µ'°dûÃ4³/¸}«q¥/«K¤}ÃĨĴK°l³/¨Ä£d²Ïµ6³/¥·°l³¦«¬º¨»«eµ Ç °[û³/¸1£d¿}º¸ ³¦¸}«SçF¨Ä¤X:°´¬´¬«Kµ¦µ¬ËÔ¢¸}¨ÄµD¨ªµDÎF£d¥¦²}«V£d¿X³ÎP½A³¦¸}«V¥/«eµ6¿}û³¦µ
³/¸1«'«äM«K´³¨Äµ_µ³¦¥/£²}ºd«eµ³Þ¨»³/¸U²}£¥/«K¤}û¨ª´¬°[³/¨Ä£d²V°[²1©VÞ¨»³/¸ ®¥¦£d¯ §l°[¥¦½7¨Ä²}º ÖÞ¨»³/¸Ïµ/«KÊ7¿}«K²³¦¨Ä°dðd´K´«eµ/µKË¢¸1«¤}¥¦«Kµ6¼
³/¸1«Ààá'ßl¨»³¨ÄµO°[º7°[¨Ä²¶µ/«¬«K²Õ³¦¸1°l³O³/¸1«µ/¨Ä¯¤1û«¥¦«¬¤}ÃĨĴK°l³¦¨»£² «K²1´«£dF°[²P½çM¨»¤Xæ¼wÃĨ»è«¡«Kû«K¯D«¬²7³ Ç «K§d«K²¶¨»M´¬£d¯¶Î}¨»²1«K©¶Þ¨É³¦¸
µ6³/¥·°l³/«Kºd¨Ä«Kµ°d¥/«µ/û¨Äºd¸7³¦Ã»½s°µ³¦«¬¥³¦¸1°[²:³¦¸}««e´£d²1£d¯D¨Ä´\¯D£X©P¼ °µ/«KÊ7¿}«K²7³/¨ª°[Ãi¤x°l³6³¦«¬¥¦² Ç Þ4£¿}ÃÄ©V¯±°dèd«Õ©X½P²1°d¯D¨Ä´¥¦«¬¤1û¨ª´¬°[¼
«¬ÃªµKË $D B)9AB7°[êµ6£Vºd¨Ä§d«eµ°Vû£lÞ¡«¬¥ ð ñáÅZ°[ÎF£d¿X³ ³¦¨»£²¸1¨»º¸}û½q©X«Kµ/¨»¥·°[Î1û«Ë
117
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
40000 1.1
No Rep No Rep
LRU LRU
LFU LFU
35000 Eco Bin Eco Bin
Eco Zipf 1 Eco Zipf
30000
0.9
Mean Job Time (s)
25000
ENU
20000 0.8
15000
0.7
10000
0.6
5000
0 0.5
All Job Types Mixed One Job Type All Job Types Mixed One Job Type
¤F£dÃĨĴ¬¨»«eµ¬Ë
40000
No Rep No Rep
LRU 1.4 LRU
LFU LFU
35000 Eco Bin Eco Bin
Eco Zipf Eco Zipf
1.2
30000
1
Mean Job Time (s)
25000
0.8
ENU
20000
0.6
15000
0.4
10000
5000 0.2
0 0
Sequential Zipf Sequential Zipf
¤1°[³6³/«K¥/²iË
118
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
119
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
University of Edinburgh, UK
Abstra
t
Grid enabled storage systems are a vital part of data pro
essing in the grid environment. Even
for sites primarily providing
omputing
y
les, lo
al storage
a
hes will often be used to buer input
les and output results before transfer to other sites. In order for sites to pro
ess jobs eÆ
iently it is
ne
essary that site storage is not a bottlene
k to Grid jobs.
dCa
he and DPM are two Grid middleware appli
ations that provide disk pool management of
storage resour
es. Their implementations of the storage resour
e manager (SRM) webservi
e allows
them to provide a standard interfa
e to this managed storage, enabling them to intera
t with other
SRM enabled Grid middleware and storage devi
es. In this paper we present a
omprehensive set of
results showing the data transfer rates in and out of these two SRM systems when running on 2.4 series
Linux kernels and with dierent underlying lesystems.
This ben
hmarking information is very important for the optimisation of Grid storage resour
es at
smaller sites, that are required to provide an SRM interfa
e to their available storage systems.
120
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
ollaboration bears this out [3℄. Sites will be able 2.2 dCa
he
to use the results of this paper to make informed
de
isions about the optimal setup of their stor- dCa
he [8℄ is a system jointly developed by
age resour
es without the need to perform ex- DESY and FNAL that aims to provide a me
h-
tensive analysis on their own. anism for storing and retrieving huge amounts
of data among a large number of heterogeneous
Although in this paper we
on
entrate on par- server nodes, whi
h
an be of varying ar
hi-
ti
ular SRM [4℄ grid storage software, the results te
tures (x86, ia32, ia64). It provides a single
will be of interest in optimising other types of namespa
e view of all of the les that it man-
grid storage at smaller sites. ages and allows a
ess to these les using a vari-
This paper is organised as follows. Se
tion 2 ety of a proto
ols, in
luding SRM. By
onne
ting
des
ribes the grid middleware
omponents that dCa
he to a tape ba
kend, it be
omes a hierar-
were used during the tests. Se
tion 3 then goes
hi
al storage manager. However, this is not of
on to des
ribe the hardware, whi
h was
hosen parti
ular relevan
e to Tier-2 sites who do not
to represent a typi
al Tier-2 setup, that was used typi
ally have tape storage. dCa
he is a highly
during the tests. Se
tion 3.1 details the lesys-
ongurable storage solution and
an be easily
tem formats that were studied and the Linux ker- deployed at Tier-2 sites where DPM is not suÆ-
nel that was used to operate the disk pools. Our
iently
exible.
testing method is outlined in Se
tion 4 and the
results of these tests are reported and dis
ussed
in Se
tion 5. In Se
tion 6 we present suggestions 2.3 Disk pool manager
of possible future work that
ould be
arried out
to extend our understanding of optimisation of Developed at CERN, and now part of the gLite
Grid enabled storage elements at small sites and middleware set, the disk pool manager (DPM)
on
lude in Se
tion 7. is similar to dCa
he in that is provides a sin-
gle namespa
e view of all of the les stored on
the multiple disk servers that it manages and
provides a variety of methods for a
essing this
data, in
luding SRM. DPM was always intended
2 Grid middleware
ompo- to be used primarily at Tier-2 sites, so has an em-
nents phasis on ease of deployment and maintenan
e.
Consequently it la
ks some of the sophisti
ation
of dCa
he, but is simpler to
ongure and run.
2.1 SRM
The use of standard interfa
es to storage re- 2.4 File transfer servi
e
sour
es is essential in a Grid environment like
the WLCG sin
e it will enable interoperation of The gLite le transfer servi
e (FTS) [9℄ is a grid
the heterogeneous
olle
tion of storage hardware middleware
omponent that aims to provide re-
that the
ollaborating institutes have available liable le transfer between storage elements that
to them. Within the high energy physi
s
om- provide the SRM or GridFTP [10℄ interfa
e. It
munity the storage resour
e manager (SRM) [4℄ uses the
on
ept of
hannels [11℄ to dene uni-
interfa
e has been
hosen by the LHC experi- dire
tional data transfer paths between storage
ments as one of the baseline servi
es [5℄ that elements, whi
h usually map to dedi
ated net-
parti
ipating institutes should provide to allow work pipes. There are a number of transfer pa-
a
ess to their disk and tape storage a
ross the rameters that
an be modied on ea
h of these
Grid. It should be noted here that a storage ele-
hannels in order to
ontrol the behaviour of the
ment that provides an SRM interfa
e will be re- le transfers between the sour
e and destination
ferred to as `an SRM'. The storage resour
e bro- storage elements. We
on
ern ourselves with two
ker (SRB) [6℄ is an alternative te
hnology devel- of these: the number of
on
urrent le trans-
oped by the San Diego Super
omputing Center fers (Nf ) and the number of parallel GridFTP
[7℄ that uses a
lient-server approa
h to
reate a streams (Ns ). Nf is the number of les that
logi
al distributed le system for users, with a FTS will simultaneously transfer in any bulk le
single global logi
al namespa
e or le hierar
hy. transfer operation. Ns is number of simultane-
This has not been
hosen as one of the baseline ous GridFTP
hannels that are opened up for
servi
es within the WLCG. ea
h of these les.
121
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
S ienti Linux [18℄ 305 distribution. However, ilarly 3 shows the rate averaged over Nf .
122
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Nf Filesystem Nf Filesystem
ext2 ext3 jfs xfs ext2 ext3 jfs xfs
1 157 146 137 156 1 214 192 141 204
3 207 176 236 236 3 297 252 357 341
5 209 162 246 245 5 300 261 368 354
10 207 165 244 247 10 282 253 379 356
Table 2: Average transfer rates per lesystem Table 4: Average transfer rates per lesystem
for dCa
he for ea
h Nf . for DPM for ea
h Nf .
Ns Filesystem Ns Filesystem
Table 4 shows the transfer rate, averaged over 5.2 Error Rates
Ns , for DPM for ea
h of the lesystems. Simi-
Table 6 shows the average per
entage error rates
larly 5 shows the rate averaged over Nf . for the transfers obtained with both SRMs.
The following
on
lusions
an be drawn:
5.2.1 dCa
he
1. Transfer rates are greater when using mod-
ern high performan
e lesystems like jfs and With dCa
he there were a small number of trans-
xfs than the older ext2,3 lesystems. fer errors for ext2,3 lesytems. These
an be
tra
ed ba
k to FTS
an
elling the transfer of a
2. Transfer rates for ext2 are higher than ext3, single le. It is likely that the high ma
hine load
be
ause it does not suer a journalling over- generated by the I/O traÆ
impaired the per-
head. forman
e of the dCa
he SRM servi
e. No errors
were reported with xfs and jfs lesystems whi
h
3. For all lesystems, having more than 1 si-
an be
orrelated with the
orrespondingly lower
multaneous le transferred improves the av- load that was observed on the system
ompared
erage transfer rate substantially. There ap- to the ext2,3 lesystems.
pears to be little dependen
e of the aver-
age transfer rate on the number of les in a 5.2.2 DPM
multi-le transfer (for the range of Nf stud-
ied). With DPM all of the lesystems
an lead to er-
rors in transfers. Similarly to dCa
he these er-
4a. With dCa
he, for all lesystems, single rors generally o
ur be
ause of a failure to
or-
stream transfers a
hieve a higher average re
tly
all srmSetDone() in FTS. This
an be
transfer rate than multistream transfers. tra
ed ba
k to the DPM SRM daemons being
than multistream transfers. For xfs and jfs ext2 ext3 jfs xfs
multistreaming has little ee
t on the rate. dCa
he 0.21 0.05 0 0
DPM 0.05 0.10 1.04 0.21
4
. In both
ases, there appears to be little de-
penden
e of the average transfer rate on the Table 6: Per
entage observed error rates for dif-
number of streams in a multistream transfer ferent lesystems and kernels with dCa
he and
(for the range of Ns studied). DPM.
123
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
124
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
ating system (whi
h would also allow testing into a
hara
teristi
Tier-2 SRM setup, the re-
of a 2.6 Linux kernel). sults
an be summarised as follows:
3. Investigate the ee
t of using a dierent 1. Pool lesystem: jfs or xfs.
stripe size for the RAID
onguration of the
disk servers. 2. FTS parameters: Low value for Ns , high
value for Nf (staggering the le transfer
4. Vanilla installations of SL generally
ome start times). In parti
ular, for dCa
he
with unsuitable default values for the TCP Nf = 1 was identied as giving optimal
read and write buer sizes. In light of transfer rate.
this, it will be interesting to study the how
hanges in the Linux kernel-networking tun- It is not possible to make a single re
ommen-
ing parameters
hange the FTS data trans- dation on the SRM appli
ation that sites should
fer rates. Initial work in this area has shown use based on the results presented here. This de-
that improvements
an be made.
ision must be made depending upon the avail-
able hardware and manpower resour
es and
on-
5. It would be interesting to investigate the sideration of the relative features of ea
h SRM
ee
t of dierent TCP implementations solution.
within the Linux kernel. For example, TCP- Extensions to this work were suggested, rang-
BIC [19℄, westwood [20℄, vegas [21℄ and ing from studies of the interfa
e between the ker-
web100 [22℄. nel and network layers all the way up to mak-
6. Chara
terise the performan
e enhan
ement ing
hanges to the hardware
onguration. Only
that
an be gained by the addition of extra when we understand how the hardware and soft-
hardware, parti
ularly the in
lusion of extra ware
omponents intera
t to make up an entire
disk servers. Grid enabled storage element will we be able to
give a fully informed set of re
ommendations to
When the WLCG goes into full produ
tion, a Tier-2 sites regarding the optimal SRM
ongu-
more realisti
use
ase for the operation of an ration. This information is required by them in
SRM will be one in whi
h it is simultaneously order that they
an provide the level of servi
e
writing (as is the
ase here) and reading les expe
ted by the WLCG and other Grid proje
ts.
a
ross the WAN. Su
h a simulation should also
in
lude a ba
kground of lo
al le a
ess to the
storage element, as is expe
ted to o
ur during 8 A
knowledgements
periods of end user analysis on Tier-2
omputing
The resear
h presented here was performed us-
lusters. Preliminary work has already started
in this area where we have been observing the ing S
otGrid [2℄ hardware and was made possible
with funding from the Parti
le Physi
s and As-
performan
e of existing SRMs within S
otGrid
during simultaneous read and writing of data. tronomy Resear
h Coun
il through the GridPP
proje
t [23℄. Thanks go to Paul Millar and David
Martin for their te
hni
al assistan
e during the
7 Con
lusion
ourse of this work.
125
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
[5℄ LCG Baseline Servi
es Group Report [20℄ TCP Westwood details
http://
ern.
h/LCG/peb/bs/BSReport- http://www.
s.u
la.edu/NRL/hpi/t
pw/
v1.0.pdf
[21℄ TCP Vegas details
[6℄ The SRB
ollaboration http://www.
s.arizona.edu/proto
ols/
http://www.sds
.edu/srb/index.php/Main Page
[22℄ TCP Web100 details
[7℄ The San Diego Super
omputing Center http://www.hep.u
l.a
.uk/ytl/t
pip/web100/
http://www.sds
.edu
[23℄ GridPP - the UK parti
le physi
s Grid
[8℄ dCa
he
ollaboration http://gridpp.a
.uk
http://www.d
a
he.org
[9℄ gLite FTS
https://uimon.
ern.
h/twiki/bin/view/
LCG/FtsRelease14
[10℄ Data transport proto
ol developed by the
Globus Allian
e
http://www.globus.org/grid software
/data/gridftp.php
[11℄ P. Kunszt, P. Badino, G. M
Can
e,
The gLite File Transfer Servi
e: Mid-
dleware Lessons Learned from the Ser-
vi
e Challenges, CHEP 2006 Mumbai.
http://indi
o.
ern.
h
/
ontributionDisplay.py?
ontribId=20
&sessionId=7&
onfId=048
[12℄ LCG generi
installation manual
http://grid-deployment.web.
ern.
h/grid-
deployment/do
umentation/LCG2-
Manual-Install/
[13℄ PostGreSQL Database website
http://www.postgresql.org/
[14℄ R.Card, T. Ts'o, S. Tweedie (1994). "De-
sign and implementation of the se
ond ex-
tended lesystem." Pro
eedings of the First
Dut
h International Symposium on Linux.
ISBN 90-367-0385-9.
[15℄ Linux ext3 FAQ
http://batleth.sapienti-
sat.org/proje
ts/FAQs/ext3-faq.html
[16℄ Jfs lesystem homepage
http://jfs.sour
eforge.net/
[17℄ Xfs lesystem homepage
http://oss.sgi.
om/proje
ts/xfs/index.html
[18℄ S
ienti
Linux homepage
http://s
ienti
linux.org
[19℄ A TCP variant for high speed long distan
e
networks
http://www.
s
.n
su.edu/fa
ulty/rhee
/export/bit
p/index.htm
126
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
127 1
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
128 2
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
129 3
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the fact that common conventions are used 3.2.1. Improvement 3: SOAP with
and they are internally consistent. Attachments
The reduction in data size in going from The major bottleneck arising from this
a WebRowSet to a CSV format can be scenario is the Base64 encoding of the
estimated by calculating the space required binary data for inclusion in a SOAP
to represent the same result in each format. message. This encoding requires additional
This can be done by calculating the number computation at both the client and the
of extra characters needed to describe a row server side. Moreover, the converted data is
of data. Assuming for CVS data that all approximately 135% of the size of the
fields are wrapped in double quotes and that original file, clearly impacting the
there are no escaped characters, then the efficiency of the data transfer.
extra number of characters needed to We have addressed both of these
represent a row in CSV document is: concerns by using SOAP messages with
(number_of_columns * 3) – two quotes and attachments [BTN00]. This approach
a comma – whereas for WebRowSet the use significantly reduces the time required to
of specific WebRowSet defined XML tags process SOAP messages and allows the
make this number: (number_of_columns * transfer of binary data to take place without
27)+25. So, WebRowSet always requires at necessitating Base64 encoding. The one
least nine time as many non-data characters difficulty with this approach is that, as
as CSV. SOAP messages with attachments are not a
standard feature of all SOAP specifications,
3.2. Use Case 2: Transferring Binary interoperability issues can arise.
Data
4. Experimental Results
A slight variation on the previous use To quantify the effect of the performance
case involves using OGSA-DAI to provide improvements outlined in the preceding
access to files stored on a server’s file section, we compared the performance of
system. Commonly, these could be large OGSA-DAI WSRF v2.2 with OGSA-DAI
binary data files (e.g., medical images), WSRF v2.1.
stored in a file system, with the associated
metadata for these files stored separately in 4.1. Experimental Setup
a relational database. The client queries the
databases to locate any files of interest, We ran our experiments using an
which are then retrieved by using the Apache Tomcat 5.0.28 / Globus Toolkit
OGSA-DAI delivery mechanisms. Files are WS-Core 4.0.1 stack. Our experimental
retrieved separately from the SOAP setup consisted of a client machine and a
interactions for data transport efficiency. In server machine on the same LAN. We ran
some cases, however, it can be more the server code on a Sun Fire V240 Server,
convenient for a client to receive the data which has a dual 1.5 GHz UltraSPARC IIIi
back in a SOAP response message rather processor and 8 GB of memory, running
than using an alternative delivery Solaris 10 with J2SE 1.4.2_05. The client
mechanism. machine was a dual 2.40GHz Intel Xeon
The implementation of this scenario system running Red Hat 9 Linux with the
includes the same six steps as in the first 2.4.21 kernel and J2SE 1.4.2_08. Both the
use case except that step 4 now requires the client JVM and the JVM running the
conversion of a binary file to a text-based Tomcat container were started in server
format, usually Base64 encoding, in order mode by using the -server
for it to be sent back in a SOAP message, -Xms256m -Xmx256m set of flags.
and step 6 includes decoding of the file The network packets in these
back into its original binary format. experiments traversed two routers, and iperf
1.7.0 found the network bandwidth to be
approximately 94 Mbits/s. The average
round-trip latency was less than 1 ms.
130 4
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The database used in the experiments Figure 1 shows the overall timing
was MySQL 5.0.15 with MySQL results. Queries for 512 rows or greater
Connector/J driver version 3.1.10. We used show a significant improvement, up to 35%
littleblackbook, the sample database table by simply optimizing the WebRowSet
distributed with OGSA-DAI, for all conversion (Improvement 1). The use of
experiments. The average row length for CSV instead of WebRowSet (Improvement
this table is 66 bytes. The rows have the 2) also shows a significant improvement for
following schema: int(11), varchar(64), larger queries, up to 65% over the original
varchar(128), varchar(20). v2.1 and about 50% over simply optimizing
Before we took any measurements, both the conversion.
the client and server JVMs were warmed Figure 2 shows in more detail the
up. Then each test was executed 10 times. performance improvements on the server
The results reported are the average of these side using Apache Axis logging to obtain
runs, with error bars indicating +/- standard the times spent in the different phases of the
deviation. SOAP request processing. We divide the
server performance into three phases:
4.2. Faster Conversions and Change
in Data Format
1. Axis Parsing: the time spent in Apache
Our first set of experiments was based Axis parsing a SOAP request.
on the two improvements suggested by our
first use case, namely, optimizing the code 2. OGSA-DAI Server: the time OGSA-
to do the data format conversions faster and DAI spent performing the requested
evaluating the use of CSV instead of activities and building the response
WebRowSet for an intermediate format. document.
For a set of queries, returning results
consisting from 32 to 16,384 rows, we 3. Message Transfer: the time the server
measured the time to perform the 6 steps spent sending a message back to the client.
involved in a client-service interaction,
including the translation of the results into
(and out of) WebRowSet or CSV formats as
appropriate.
Figure 1: Measurements comparing the effects of better conversion code (Improvement 1) and using
CSV formatting instead of WebRowSet format (Improvement 2) against the original v2.1 code. Results
include both client and server times.
131 5
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 2: Time spent in the server only, split into three phases: Apache Axis parsing, OGSA-DAI server
work, and the message transfer to the client.
The time spent in the Axis parsing phase is the Original v2.1 code up to 8MB file sizes
roughly constant as we always perform the because the process of Base64 encoding
same operation. In all cases the time spent and building SOAP response for file sizes
in the OGSA-DAI server phase dominates of 16MB upwards consumed all the heap
and generally increases systematically with memory available to the JVM and
the size of the query results obtained. The consequently caused the JVM to terminate.
largest portion of this phase is spent The performance gain shows nonlinear
translating the ResultSet objects from a growth with increasing file size.
Java ResultSet object into WebRowSet or Transferring an 8 MB file as a SOAP
CSV. The optimised conversion to attachment takes only 25% of the time
WebRowSet (Improvement 1) works up to needed to transfer the same file inside the
50% faster than the original version for body of a SOAP message. This
large result sets. By using CSV instead of improvement is due to the fact that SOAP
WebRowSet (Improvement 2), we also see attachments do not need any special
a large reduction in delivery time and encoding, and less time is spent processing
reduced network traffic, the Message XML because the data is outside the body
Transfer phase. of the SOAP message.
Figure 4 gives additional detail about the
4.3. Using SOAP Attachments
server-side performance using the three
The second set of experiments we ran previously defined phases. The Axis
was to test the use of SOAP with parsing phase is roughly constant, as before.
attachments, as outlined in our second use During the OGSA-DAI server phase, the
case and Improvement 3 in Section 3.2. We original SOAP delivery method is CPU
compare using Byte64 encoding and bound because of the Base64 conversion
returning the data in the body of a SOAP required, while the performance of the new
message to using SOAP messages with SOAP with attachments case is generally
attachments to transfer a binary file. much better, limited mainly by the
Figure 3 shows the time taken to transfer performance of the I/O operations rather
binary data of increasing size using both than those of the CPU.
delivery methods. We only have data for
132 6
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 3: Time taken to transfer a binary file using Base64 encoded data inside the body of a SOAP
message and as a SOAP attachment (Improvement 3).
Figure 4: Time spent on server side split into phases. For each group the bar on the left measures the time
spent sending binary data inside a SOAP message (Original) while the right bar corresponds to the
approach where binary data is sent as a SOAP attachment (Improvement 3).
133 7
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 5: Execution time for scenarios fetching SQL results converted to XML and CSV data using two
delivery mechanisms: delivery inside the body of a SOAP message and delivery as a SOAP attachment.
small data sets (less than 512 rows). For Scientific Computing Research, Office of
larger data sets, however, SOAP with Science, U.S. Department of Energy, under
attachments can achieve transfer times that Contract W-31-109-ENG-38.
are up to 30% faster than when using We also gratefully acknowledge the
WebRowSet for delivery. Only the largest input of our past and present partners and
data set sees a performance gain when CSV contributors to the OGSA-DAI project
formatting is used with SOAP attachments. including: EPCC, IBM UK, IBM Corp.,
NeSC, University of Manchester,
University of Newcastle and Oracle UK.
5. Conclusions
This paper summarises part of an ongoing References
effort to improve the performance of
OGSA-DAI. We have analysed two typical [AAB+05] M Antonioletti, M. Atkinson, R.
use patterns, which were then profiled and Baxter, A Borley, N. P. Chue
the results used as a basis for implementing Hong, P. Dantressangle, A. C.
a focused set of performance Hume, M. Jackson, A. Krause, S.
improvements. The benefit of these has Laws, M. Parsons, N. W. Paton, J.
been demonstrated by comparing the M. Schopf, T. Sugden, P. Watson,
performance of the current release of and D. Vyvyan, OGSA-DAI Status
OGSA-DAI, which includes the and Benchmarks, Proceedings of
performance improvements, with the the UK e-Science All Hands
previous release, which does not. We have Meeting, 2005.
seen performance improvements of over [AAB+05a] M. Antonioletti, M.P.
50% in some instances. Source code and the Atkinson, R. Baxter, A. Borley,
results data are available from [Dob06]. N.P. Chue Hong, B. Collins, N.
Hardman, A. Hume, A. Knox, M.
Acknowledgements Jackson, A. Krause, S. Laws, J.
This work is supported by the UK e- Magowan, N.W. Paton, D. Pearson,
Science Grid Core Programme, through the T. Sugden, P. Watson, and M.
Open Middleware Infrastructure Institute, Westhead. The Design and
and by the Mathematical, Information, and Implementation of Grid Database
Computational Sciences Division Services in OGSA-DAI.
subprogram of the Office of Advanced Concurrency and Computation:
134 8
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
135 9
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
136 10
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
With the emergence of Grid computing and service-oriented architectures, computing is becoming less
confined to traditional computing platforms. Grid technology promises access to vast computing power
and large data resources across geographically dispersed areas. This capability can be significantly
enhanced by establishing support for small, mobile wireless devices to deliver Grid-enabled applications
under demanding circumstances. Grid access from mobile devices is a promising area where users in
remote environments can benefit from the compute and data resources usually unavailable while working
out in the field. This is reflected in the aircraft maintenance scenario where decisions are increasingly
pro-active in nature, requiring decision-makers to have up-to-date information and tools in order to avoid
impending failures and reduce aircraft downtime. In this paper, an application recently developed for
advanced aircraft health monitoring and diagnostics is presented. CBR technology for decision support is
implemented in a distributed, scaleable manner on the Grid that can deliver increased value to remote
users on mobile devices via secure web services.
1 Introduction
This paper describes a distributed, service-
oriented Case-Based Reasoning (CBR) system
built to provide knowledge-based decision support
in order to improve the maintenance of Rolls-
Royce gas turbine engines [1]. Figure 1 shows a
typical gas turbine engine (GTE) used to power
modern aircraft. As of 2006 there are around
54,000 of these engines in service, with around
10M flying hours per month. These GTEs, also
known as aero engines, are extremely reliable
machines and operational failures are rare. Figure 1. A Rolls-Royce Gas Turbine Engine
However, currently great effort is being put into
reducing the number of in-flight engine environments can benefit from the compute and
shutdowns, aborted take-offs and flight delays data resources usually unavailable while working
through the use of advanced health monitoring out in the field. This paper describes how Grid,
technology. In this maintenance environment, CBR and mobile technologies are brought
decisions are increasingly pro-active in nature, together to deliver a substantial further
requiring decision-makers to have up-to-date improvement in maintenance ability by
information and tools in order to avoid impending facilitating the wide-scale communication of
failures and reduce aircraft downtime. This information, knowledge and advice between
represents a benefit to society through reduced individual aircraft, airline repair and overhaul
delays, reduced anxiety and reduced cost of bases, world-wide data warehouses and the engine
ownership of aircraft. manufacturer. This paper focuses on Proxim-CBR,
Grid computing [2] from mobile devices is a a recently developed system, as an example of
promising area where users in remote how Grid technology can be extended to handheld
137
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
2 Related Work
The Rolls-Royce University Technology Centre
(UTC) in Control and Systems Engineering at The
University of Sheffield has a long history of using
CBR for engine health monitoring and
diagnostics. The UTC has actively explored a
variety of on-wing control system diagnostic
techniques and also portable maintenance aid Figure 2. Structure of the Flightline
applications. CBR techniques were applied in Maintenance Advisor
developing a portable PC-based Flightline
Maintenance Advisor [3, 4] to correlate and delivered the required processing power. DAME
integrate fault indicators from the engine tackled complex issues such as security and
monitoring systems, Built-In Test Equipment management of distributed and non-homogenous
(BITE) reports, maintenance data and dialog with data repositories within a diagnostic analysis
maintenance personnel to allow troubleshooting framework with distributed users and computing
of faults (figure 2). The primary advantage of the resources.
CBR-based tool over a traditional paper-based Current research includes work being done in
Fault Isolation Manual is its capability to use the Business Resource Optimisation for
knowledge of the pattern of multiple fault Aftermarket and Design on Engineering Networks
symptoms to isolate complex faults. This effort (BROADEN) project. BROADEN is a UK DTI
was eventually escalated to the development of an funded project [7], led by Rolls-Royce Plc, that
improved Portable Maintenance Aid for the aims to build an internal Grid at Rolls-Royce to
Boeing 777 aircraft. The outcome of these further prove, in a production environment, the
initiatives included the implementation of a technologies developed in DAME. This will
portable Flightline Maintenance Advisor that was include integrated diagnostic and prognostic tools
trialled with Singapore Airlines. for health monitoring across development test,
More recently, rather than using a portable production pass-off and in-service aero engines,
computer which needs updating with new data as and the formulation of a strategy to transfer
it becomes available, it was clearly desirable for a proven Grid technology to production networks.
CBR system to be accessed remotely by engineers This project supports the "Health Management
over a computer network. The advantage of this is and Prognostics of Complex Systems" research
that it is easier to support and it also allows search theme from the Aerospace Innovation and Growth
of an extensive knowledge base of historical Team's National Aerospace Technology.
maintenance incidents across an entire fleet of
engines. This was evident in the outcome of a 3 Engine Diagnostics and Maintenance
large-scale initiative that explored this potential
and developed new solutions [5, 6]. The A fundamental change of emphasis is taking place
Distributed Aircraft Maintenance Environment within companies such as Rolls-Royce Plc where
(DAME) project was a pilot project supported instead of selling engines to customers, there is a
under the United Kingdom e-Science research shift to adoption of power-by-the-hour contracts.
programme in Grid technologies. DAME was In these contracts, the engines are effectively
particularly focused on the notion of proof-of- provisioned to clients whilst the engine
concept, using the Globus Toolkit and other manufacturer retains total responsibility for
emerging Grid technologies to develop a maintaining the engine. To support this new
demonstration system. This was known as the approach, improvements in in-flight monitoring of
DAME Diagnostic and Prognostic Workbench engines are being introduced with the collection
that was deployed as a Web-enabled service, of much more detailed data on the operation of the
linking together global members of the DAME engine. New engine monitoring equipment [8] can
Virtual Organisation and providing an Internet- record up to 1GB of data per engine for each
accessible portal to very large, distributed engine flight. In the future, Terabytes of data could be
data resources, advanced engine diagnostic recorded every day for diagnostic and prognostic
applications and supercomputer clusters that purposes.
138
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
139
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
in our implementation. However, this does not each chosen attribute. In addition to that, hard
reflect the actual software architecture. It depicts constraints can be defined prior to that so the
the functional as opposed to physical algorithm can optimise the entire process by
configuration. In the following sections, these reducing the number of potential cases to be
components are described in further detail. matched. With a continuously expanding
knowledge base of cases at a global scale, the
4.1 Knowledge Base process described above presents an ever
increasing demand for computing resources.
Essential to our CBR system is the casebase that
represents the knowledge repository. This 4.3 Recommendation
contains detailed descriptions of engine faults and
Traditionally, existing cases in the knowledge
the best practice maintenance advice (solutions)
base are adapted to form a solution for a new
gathered from engineers and experienced
problem. In this diagnostics and maintenance
mechanics over the development and service life
environment, however, the output of the CBR
of engines. For a new engine type, little
system is delivered in the form of a discrete set of
information is known initially but with our CBR
potential candidates as opposed to a single
technique, a casebase of independent records of
solution. Concern for accountability in a safety-
fault and maintenance information can be
critical environment often dictates that the system
developed in a piecemeal manner and updated as
not force a user into accepting a single solution.
and when knowledge about the behaviour of the
The system aims to recommend a narrowed-down
engine is known. More importantly, the location
choice of possible solutions, using knowledge
of the knowledge base within a virtual
accumulated previously about the domain on
maintenance facility on the Grid also allows the
which it operates. This will enable the user, an
integration of diagnostic knowledge from multiple
aircraft expert in particular, to make an informed
health information sources which are vital in
decision. In order to do that, each solution is
improving the accuracy and coverage of the CBR
accompanied by the system’s confidence in that
system. Useful diagnostic information previously
answer. Various threshold levels can be tuned to
available from separate monitoring and
both limit the number of potential solutions as
maintenance systems, when brought together into
well as eliminate inadequate confidence.
a single system, provides for a more powerful
Ultimately, the system acts to support the decision
diagnostic tool.
making task, but the aircraft expert retains the
responsibility for the final decision.
4.2 CBR Engine
4.4 Confidence
The primary responsibility of the CBR Engine is
to read the knowledge base into memory and
As described above, confidence is one of the
perform retrieval, matching and ranking of the
critical factors of the CBR system in
cases based on specific query input. The CBR
recommending solutions to the expert. The
Engine also provides the interface for the system
algorithm will compute the confidence of a
to generate new cases, view and manage existing
solution by taking into account prior knowledge
cases, manage the knowledge model and execute
contained in the knowledge model, which
any external modules that contribute to the
includes a model for natural-language
analysis of information within the system.
terminology, boundary values for each attribute,
Retrieval, matching and ranking are performed
and attribute relationship models. The process is
using an enhanced, weighted nearest-neighbour
repeated for each candidate solution to finally
algorithm of progressive complexity. Using a well
obtain a ranked list of recommended solutions. It
defined, internal Knowledge Model of the domain
is important to note here again that a trained user
that it operates on, the algorithm effectively
is able to fine-tune the entire process by selecting
emulates how an expert would identify the
specific attributes, configuring the weighting of
problem from past knowledge by performing
each attribute to correspond to the relative
similarity measures across the available
“importance” of that attribute, and also define
information. Although this is done automatically,
hard constraints so that the algorithm can further
trained users are allowed to influence the entire
optimise the process.
process by choosing specific attributes to analyse.
They can configure specific weighting of
attributes to correspond to the “importance” of
140
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
5 Grid Computing and CBR along with other diagnostic services form a
Deployment distributed environment in which applications, or
application components, can inter-operate
The Grid is defined as an aggregation of seamlessly across the virtual organization in a
geographically dispersed computing, storage and platform-neutral, language-neutral fashion.
networking resources [2]. It provides many Consumers of the CBR service are allowed to
advantages over conventional high-performance view the service simply as an endpoint that
computing because it is co-ordinated to deliver supports a particular request format or contract.
improved performance, better utilisation, Consumers need not be concerned about how the
scalability, higher quality of service and easier CBR service goes about executing their requests;
access to data. This makes the Grid ideal for they expect only that it will.
collaborative decision support across virtual
organisations because it enables the sharing of 5.2 High Performance Computing
applications and data in an open, heterogeneous
environment. Applications previously considered High performance computing support for the CBR
to be host-centric, such as the CBR system service is provided by means of implementing
described in the previous sections, can now be Open Grid Services Architecture (OGSA)
distributed throughout a Grid computing system, concepts with Web Service technologies. With
thus improving quality of service while also this, the above-mentioned CBR service has
offering significantly enhanced capabilities. Our benefited from both Web Service technologies as
use of industry standard technologies to well as Grid functionality. More specifically, the
implement such a Grid-enabled CBR system is compute-intensive tasks within the CBR matching
discussed here. process, previously computed on a single
computer, can now be aggregated across any
5.1 Grid Service available Grid supercomputing node with the use
of Grid middleware, the Grid’s effective operating
Grid Services are essentially Web Services with system [14]. With this, multiple instances of the
improved characteristics such as state and life- CBR service can be generated on-demand. The
cycle management [13]. Our CBR system’s advantage of this is that it allows multiple
functionality has been delivered as a Grid service consumer requests to be processed simultaneously
in a Service-Oriented Architecture (SOA) by separate nodes of the Grid. With our CBR
environment. SOAs are essentially a collection of system as a prime example, the integration of Grid
services, focusing on interoperability and location computing with Web services has benefited
transparency. These services communicate with aircraft experts at any remote geographical
each other, and can be seen as unique tools location by providing them with access to
performing different parts of a complex task. powerful diagnostic tools, data repositories and
Communication can involve either simple data large computing resources via any Internet-
passing or it could involve two or more services enabled computing device.
co-ordinating some activity. SOAs and services
are about designing and building a system using 6 Grid-Enabled Decision Support on
heterogeneous network addressable software Mobile Handheld Devices
components. An important aspect of the service-
based CBR delivery is that it separates the In addition to the typical diagnostic and
service’s implementation from its interface. maintenance scenario described previously, a
A typical use scenario for the CBR service is as situation may arise where an aircraft engine
follows: A Web browser initiates a request to the expert, usually regarded as a high-value resource,
CBR service at a given Internet address using the is required to investigate a currently occurring
SOAP (Simple Object Access Protocol) protocol problem under demanding circumstances; i.e.
over HTTP (Hypertext Transfer Protocol). The he/she may be traveling to a foreign aircraft site
CBR service receives the request, processes it, and a conventional computer is not available.
and returns a response. Based on emerging Furthermore, the ability of a handheld computer to
standards such as XML (eXtensible Markup capture investigative findings on-site and instantly
Language), SOAP, UDDI (Universal Description, upload these findings to the Grid-based system
Discovery and Integration) and WSDL (Web greatly increases the accuracy of the CBR system
Service Definition Language), the CBR service in diagnosing the problem. The advantage in
141
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
142
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the mini portal against a standard web portal, it successfully identified. Note that any references to
has been found that the mini portal can deliver an actual aircraft or operators have been anonymised.
identical quality of results to the user whilst The prime advantage of this approach is that
minimising the graphical content load on the PDA any occurring problem condition can be made
display. An Apache Tomcat servlet container known immediately to the right person, in the
serves as the hosting environment for the mini right place at the right time, giving that person the
portal on any Grid node. opportunity to pro-actively handle the arising
The software and communication standards problem condition and collaborate on-line with a
used in the demonstrator systems that have been virtual network of experts to solve the problem.
implemented are based on widely accepted Complementary analysis tools and services on the
industry standards thus it is highly feasible for the Grid allow a widespread collection of knowledge,
system to be migrated across various other data and tools that were not previously available
application areas. Furthermore, the increasing in this manner to be used on-line for further
availability of wireless networks [17, 21] and investigation. It is this model of problem-solving
advances in Grid technology provide a strong case that we define here as pro-active mobile
for mobile, Grid-enabled decision support with computing for decision support.
handheld computing devices. The condition monitoring demonstrator system
In the following sections, the CBR system, has been implemented using a network of remote
Grid computing and mobile PDA work described CBR service nodes deployed at different
thus far are implemented in two different areas of geographical locations. This will be described in
the aero engine diagnostic and maintenance close detail in Section 7. These remote nodes
scenarios – pro-active remote condition continuously receive diagnostic information,
monitoring and fault investigation. downloaded from aircraft on-wing systems, from
multiple sources in real-time. Any newly received
6.3 Pro-Active Remote Condition Monitoring information will be automatically analysed to
identify problems.
Condition monitoring systems are synonymous A unique feature of our distributed architecture
with fault diagnostics. Whether localised or is the ability to monitor health information across
distributed, they are common and widespread, multiple nodes, at different locations, without
with more and more organisations becoming actually transmitting that information off-site. In
dependant on such systems to support critical the event of a problem, alert messages can be
decision making in order to identify problems automatically flagged to notify remote users via
quickly, avoid unnecessary risks and reduce cost. suitable communication channels based on the
In the past, centralized high-performance severity of the problem. At this point, the actual
ground-based systems have the ability to support exchange of information between stakeholders
powerful applications but they are limited in their can then be negotiated via established, trusted
scope of reach or can prove inaccessible when and channels for further action based on the
where they could be most effective. At the organisations’ requirements and service-level
opposite end of the scale, portable or on-board agreements. A clear advantage of this approach is
standalone systems can be very useful in remote the significant reduction in the network bandwidth
locations or harsh operating environments but required. But more importantly, it further supports
they are very limited in terms of performance and crucial security policies and privacy of any
capacity. commercially sensitive data.
Today, the integration of service-oriented
architectures, modern wireless technologies, 6.4 Pro-Active Fault Investigation
advanced mobile devices and distributed
computing presents a unique solution to overcome For a more user-driven, interactive fault
these limitations. This integration can provide the investigation process, the PDA-based mini
necessary high-performance computing capability browser can be used to transmit specific user
required to continuously perform condition inputs to the CBR service in order to match the
monitoring whilst transmitting essential details of a currently occuring problem against the
information back to key users via a mobile large CBR knowledge base on the Grid. Figure 6
handheld device. Figure 5 depicts a screen capture depicts the match results of such a process on the
of this on a PDA that was implemented as part of PDA. For any new or existing case, additional
our work. Here, possible causes of the occurring knowledge gained from further analysis such as
problems on different aircraft have been vibration spectral data, as depicted in figure 7,
143
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
144
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
8 Security
The dynamic and multi-institutional nature of
Grid environments introduces challenging security
Figure 8. Full fault case details
issues that demand new technical approaches [23].
For instance, the CBR knowledge base and related
engine diagnostic data could contain highly
relevant information about an engine's design
characteristics and operating parameters. This
could potentially be misused. For this reason,
access to the systems described here is highly
restricted to authorised users only. To overcome
the problem, the Grid Security Infrastructure
(GSI) provided by the Globus Toolkit has been
used to enable secure authentication and secure
communication over an open network. GSI is
composed of security elements that conform to the
Generic Security Service API (GSS-API), which
is a standard API for security systems promoted
by the Internet Engineering Task Force (IETF).
GSI consists of a number of security features
that include mutual authentication and single sign-
on. This is based on public key encryption, digital Figure 9. Abnormal spike in engine data
certificates and Secure Sockets Layer (SSL) identified via the Engine Simulation Grid Service
communication. At the core of the security
145
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
146
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
References
[1] Rolls-Royce. “The Jet Engine”, Rolls-Royce Plc, 1986.
[2] I. Foster (Ed.) and C. Kesselman (Ed.), “The Grid 2”, Morgan Kaufmann, 2003.
[3] S.M. Hargrave. “Evaluation of Trent 800 Portable Maintenance Aid Demonstrator”, Rolls-Royce
University Technology Centre, University of Sheffield, Report No. RRUTC/Shef/R/98202, 1998.
[4] S.M. Hargrave. “Review of Performance-Based Diagnostic Tool”, Rolls-Royce University Technology
Centre, University of Sheffield, Report No. RRUTC/Shef/TN/98204, 1998.
[5] Distributed Aircraft Maintenance Environment (DAME) project; www.cs.york.ac.uk/dame, 2003.
[6] T. Jackson, J. Austin, M. Fletcher, and M. Jessop. “Delivering a Grid enabled Distributed Aircraft
Maintenance Environment (DAME)”, Proc UK e-Science All-Hands Meeting, pp. 420-427, 2003.
[7] Business Resource Optimisation for Aftermarket and Design on Engineering Networks (BROADEN)
project; www.shef.ac.uk/acse/research/themes/utc/broaden.html, 2005.
[8] G.F. Tanner and J.A. Crawford. “An Integrated Engine Health Monitoring System for Gas Turbine
Aero-Engines”, Proc. IEE Seminar on Aircraft Airborne Condition Monitoring, pp 5/1-5/12, 2003.
[9] J. Croft. “Avionics: Beaming Bits and Bytes”, Air Transport World, p. 54, January 2006.
[10] H.A. Thompson, "The Use of Wireless and Internet Communications for Gas Turbine Engine
Monitoring and Control", Proc AIAA/ICAS International Air and Space Symposium and Exposition:
The Next 100 Years, July 2003.
[11] J. Kolodner. “Case-Based Reasoning”, Morgan Kaufmann, 1993.
[12] S.K. Pal, T.S. Dhillon and D.S. Yeung (Eds). “Soft Computing in Case Based Reasoning”, Springer-
Verlag UK, 2000.
[13] I. Foster, C. Kesselman, J. Nick, and S. Tuecke. “Grid Services for Distributed System Integration”,
Computer, 35(6), 2002.
[14] I. Foster and C. Kesselman. “Globus: A Metacomputing Infrastructure Toolkit”, Intl J. Supercomputer
Applications, vol. 11, no. 2, pp. 115-128, 1997.
[15] IEEE 802.11 Working Group, “The IEEE Std 802.11b-1999”, 1999.
[16] IEEE 802.15 Working Group. “The IEEE Std 802.15.1-2002”, 2002
[17] A. Vollmer. “With Devices Ready to Go, Bluetooth is Poised to Make Its Move”, Electronic Design,
July 24, 2000.
[18] M. Mouly and M. Pautet. “The GSM System for Mobile Communications”, Published by
the authors, 1992.
[19] M. Rahnema. “Overview of the GSM System and Protocol Architecture”, IEEE Communications
Magazine, vol. 31, no. 4, pp. 92-100, April 1993.
[20] R. Kalden, I. Meirick, and M. Meyer, "Wireless Internet Access Based on GPRS," IEEE Personal
Communications Magazine, vol. 7, no. 2, pp. 8-18, April 2000.
[21] U. Varshney. “Recent Advances in Wireless Networking”, IEEE Computer, vol. 33, no. 6, June 2000.
[22] X. Ren, M. Ong, G. Allan, V. Kadirkamanathan, H.A. Thompson and P.J. Fleming. “Service-Oriented
Architecture on the Grid for Integrated Fault Diagnostics”, Journal of Concurrency and Computation:
Practice and Experience, John Wiley and Sons, 2006.
[23] V. Welch, F. Siebenlist, I. Foster, J. Bresnahan, K. Czajkowski, J Gawor, C. Kesselman, S. Meder, L.
Pearlman, and S. Tuecke. “Security for Grid Services”, Proc. 12th IEEE Int’l Symposium on High
Performance Distributed Computing (HPDC-12), pp. 48-57, IEEE Press, 2003.
147
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
148
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
149
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
150
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
One could view the job of a sensor as having two localhost). The advantage of this approach is that the
parts: to gather information about a target and to re- monitoring plugins need know nothing about how
port this information to some information system. If the data is to be reported. Data from a monitoring
these two parts were separated, then the collected target can be sent to any number of reporting tar-
information could be sent to any number of infor- gets. This also allows MonAMI to be extended, both
mation systems, based on the sensor’s configuration. in what information is gathered and to what infor-
This would allow the sensor to provide information mation systems data is to be sent.
to any number of interested parties, from local site- The list of available monitoring and report-
administrators through to VO- or Grid-specific mon- ing plugins is available on the MonAMI project
itoring. webpage[7] and currently include monitoring
We can define a universal sensor as being a sensor the local filesystem, running processes, Apache
that can send gathered information to many different HTTP[17], MySQL[18] and Tomcat[19] servers.
(ideally, to all necessary) information systems. To Support exists for reporting plugins, including
be useful, the sensor should be extensible, so support Nagios[15], Ganglia[16] and MonaLisa[11] in
for additional information systems can be added. addition to simple file-based logging.
Further, if the sensor configuration is separated Figure 3 shows a simple configuration for
into different parts (based on the different interested MonAMI with the configuration components shown
parties) then providing monitoring information for as functional blocks. In this example, routine moni-
different people via different information systems toring services are reported to two information sys-
becomes tractable. Each group interested in moni- tems (Nagios[15] and Ganglia[16]). There is also a
toring some grid services can specify their interests more intensive test, which is conducted far less fre-
independent of others and the universal sensor will quently to reduce its impact. The results of this test
capture sufficient information and deliver it accord- are reported to Nagios (to alert sysadmin of undesir-
ing to the different delivery requirements. able results) and recorded in a log file.
The MonAMI project[7] provides a framework for
developing a universal sensor. It uses a plugin infras-
tructure to support different information systems,
3.2 MonAMI and other components
support which can be extended to include additional The grid community has several projects with over-
information systems as needed. lapping goals and implementations. The grid
It is important to state that MonAMI does not aim monitoring workgroup of the Global Grid Forum
to be a complete monitoring solution. Data within (GGF)[8] has produced a specification for grid-level
any information system has value only if it is then monitoring: GMA[9]. The R-GMA project[10] has
further analysed, for example presented as a graph produced a realisation of this work.
on a webpage, triggering an email, or used for trend- Separate from R-GMA, the MonaLisa project[11]
analysis to predict when computing resources need has a monitoring infrastructure that is widely used,
updating. including a number monitoring targets and end-
client applications.
3.1 The MonAMI daemon Within WLCG, effort is underway in developing
site-level and grid-level functionality tests through
The MonAMI framework currently provides a light- the SAM[12] project. This information is aimed to-
weight monitoring daemon. The advantage of run- wards providing information for site-administrators.
ning a daemon process is that many targets can be For end-users, the Dashboard[13] project aims to
monitored concurrently. The daemon will automat- provide monitoring information that is easy to navi-
ically aggregate information so that requests for the gate.
same monitoring information are consolidated and As previously stated, MonAMI does not aim to be
the impact of monitoring the target is minimised. a complete solution, but rather to provide useful data
This daemon periodically checks the status of to other existing projects. It aims to be part of these
monitoring targets and reports their status to one or larger systems by providing information on key grid
more reporting targets. This is the polling model of services. As a universal sensor, it aims to report to
monitoring dataflow. At the time of writing, support whichever combination of monitoring information
for the asynchronous dataflows (alerts and DIG) is systems is currently in use.
being added. Support for additional systems can be added by
A target (whether monitoring or reporting) is a writing an additional reporting plugin for MonAMI.
specific instance of a plugin. The plugin is generic The configuration can be updated to include the ex-
concept (a MySQL server, for example), whilst the tra reporting, allowing data to be reported without
target is specific (e.g. the MySQL server running on affecting any existing monitoring.
151
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
152
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
distributions, it is now common to include support that provide three of the above four functions and
for (and use by default) the ELF file format for use a provide macro to create the plugin description.
executable content. ELF includes support for dy- Further details on how to write plugins is avail-
namically loading data. The core part of MonAMI able within the Developer’s guide, which documents
(MonAMI-core) uses the OS provided linker support MonAMI and describes the pedagogical code in-
for loading shared objects to load plugins. cluded in the source distribution.
Its important to note that MonAMI-core has no
internal list of what plugins are available. Instead,
it scans for available plugins at run time and loads 4 Conclusions
those that are available. This allows for additional
plugins to be included independently of upgrading MonAMI aims to provide a framework for devel-
the MonAMI distribution. oping a universal sensor: a sensor that can report
Each plugin contains a brief overview of the plu- to many different information systems. It allows
gin; containing the plugin name, a brief description the monitoring of multiple targets, with information
and what operations the plugin supports. about each target being sent to any number of infor-
A plugin can be mentioned within the configura- mation systems. MonAMI uses a plugin system so
tion multiple times, once for each service the plu- additional monitoring can be supported by adding
gin is to monitor, or information system to which the extra plugins.
plugin is to report. Each mention within the config- For each server, the list of what is to be monitored
uration file creates a new target based on the named is separable. This allows the addition of extra mon-
plugin (as described in 3.1). itoring requirements without affecting the existing
The synchronous API requires a plugin to monitoring.
provide two functions (monami open and MonAMI does not aim to be a complete moni-
monami close) and at least one of a further two toring solution, but rather a building block: a low
functions (monami read and monami write) overhead, extensible method of providing multiple
depending on whether the plugin is providing groups of people with the monitoring information
support for monitoring or reporting activity respec- they need.
tively.
153
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
154
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Experien
e gained from the TeraGyroid and SPICE proje
ts has shown that
o-allo
ation and fault
toleran
e are important requirements for Grid
omputing. Co-allo
ation is ne
essary for distributed
appli
ations that require multiple resour
es that must be reserved before use. Fault toleran
e is
important be
ause a
omputational Grid will always have faulty
omponents and some of those
faults may be Byzantine. We present HARC, Highly-Available Robust Co-allo
ator, an approa
h to
building fault tolerant
o-allo
ation servi
es. HARC is designed a
ording to the REST ar
hite
tural
style and uses the Paxos and Paxos Commit algorithms to provide fault toleran
e. HARC/1, an
implementation of HARC using HTTP and XML, has been demonstrated at SuperComputing 2005
and iGrid 2005.
155
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
2
portan
e to Grid
omputing means that it must be requests to a node. Unfortunately the DNS entries
a reliable servi
e. were mis-
ongured and reported the wrong host-
name for one of the nodes
ausing job submissions
to fail randomly. Lo
al testing at the site did not
III. FAULT TOLERANCE AND GRID reveal the problem.
COMPUTING The third
ase involved a resour
e broker that as-
signed jobs to
omputational resour
es. O
asionally
Whatever denition of Grid
omputing is used we a
omputational resour
e would fail in a Byzantine
are lead to two ines
apable
on
lusions: a
omputa- way: it would a
ept jobs from the resour
e broker,
tional Grid will always have faulty
omponents and fail to exe
ute the job but report to the resour
e
some of those faults will be Byzantine. broker that the job had
ompleted su
essfully. The
Computational Grids are internet s
ale distributed resour
e broker would
ontinue assigning jobs to the
systems, implying large numbers of
omponents and faulty resour
e until it was drained of jobs.
wide area networks. At this s
ale there will always The examples illustrate the importan
e of end-to-
be faulty
omponents. end arguments in system design error re
overy
The
ase that a
omputational Grid will always at the appli
ation level is absolutely ne
essary for
have faulty
omponents is illustrated by the fa
t that a reliable system, and any other error dete
tion or
on average over ten per
ent of the daily operational re
overy is not logi
ally ne
essary but is stri
tly for
monitoring tests, GITS [4℄, run on the UK National performan
e [35℄. In ea
h
ase making the individual
Grid Servi
e, UK NGS [38℄, report failures. Sin
e the
omponents more reliable would not have prevented
start of the UK NGS a number of the
ore nodes have the problem. Retrotting reliability to an existing
been unavailable for days at a time due to planned design is very di
ult [30℄.
and unplanned maintenan
e. For
o-allo
ation a very real example of Byzantine
Crossing administrative boundaries raises issues of fault behaviour o
urs when a resour
e provider a
-
trust: Can a user trust that a resour
e provider has
epts a reservation for a resour
e but at the s
heduled
ongured and administrates the resour
e properly? time the user, and possibly the resour
e provider,
Can a resour
e provider trust that a user will use dis
over the resour
e is unavailable.
the resour
e
orre
tly? Distributed transa
tions are
rarely used between organizations be
ause of a la
k
of trust; one organization will not allow another or- IV. REST
156
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
3
intermediaries to
a
he the representation. The rep- Initial
resentation may
ontain links to other resour
es
re- State
tions [3, 9, 24, 34, 39℄ for
o-allo
ation are based
on variations of these approa
hes.
is related to the question of trust the RMs must
trust the user to
an
el any unwanted reservations.
A. One Phase Approa
h A partial solution is to add an intermediary ser-
vi
e that is trusted by the user and RM to
orre
tly
In this approa
h a su
essful reservation is made make and
an
el reservations as ne
essary. The user
in a single step. sends the set of bookings he requires to the inter-
mediary whi
h handles all the intera
tions with the
The user sends a booking request to ea
h of the
RMs,
an
elling all the reservations if the purposed
Resour
e Managers (RM) requesting a reservation.
s
hedule is una
eptable to any of the RMs. The in-
The RMs either a
ept or reje
t the booking. If one
termediary is still a single point of failure unless it
or more of the RMs reje
ts the booking the user must
an be repli
ated.
an
el any other bookings that have been made. This
approa
h has the advantage of being simple but a
potential drawba
k is that the user may be
harged
ea
h time he
an
els a reservation. B. Two Phase Approa
h
157
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
4
Fig. 1 shows the state transitions for the RM in the te
tors or partial syn
hrony. Consensus is an impor-
two phase approa
h. tant problem be
ause it
an be used to build repli-
Unfortunately the two phase approa
h
an blo
k.
as: start a set of deterministi
state ma
hines in the
If the TM fails after the rst phase but before start- same state and use
onsensus to make them agree on
ing the se
ond phase, before sending the Commit or the order of messages to pro
ess [25, 31℄.
Abort message, the RMs are left in the Booking Held Paxos is a well known fault tolerant
onsensus al-
state until the TM re
overs. While in the Booking gorithm. We present only a des
ription of its proper-
Held state the RMs may be unable to a
ept reser- ties and refer the reader to the literature [28, 29, 31℄
vations for the s
heduled time slot and the resour
e for a full des
ription of the algorithm. In Paxos a
s
heduler may be unable to run the optimal work leader pro
ess tries to guide a set of a
eptor pro-
load on the resour
e.
esses to agree a value. Paxos will rea
h
onsensus
The blo
king nature of the two phase approa
h even if messages are lost, delayed or dupli
ated. It
may be a
eptable for
ertain s
enarios depending
an tolerate multiple simultaneous leaders and any
on how long the TM is unavailable and providing it of the pro
esses, leader or a
eptor,
an fail and re-
re
overs
orre
tly. To over
ome the blo
king nature
over multiple times. Consensus is rea
hed if there is
of the two phase approa
h requires a three phase a single leader for a long enough time during whi
h
proto
ol [36℄ su
h as Paxos. the leader
an talk to a majority of the a
eptor pro-
Another potential problem with the two phase ap-
esses twi
e. It may not terminate if there are always
proa
h is that RMs must support the Booking Held too many leaders. There is also a Byzantine version
state. A solution is for RMs that do not support the of Paxos [8℄ to handle the
ase were a
eptors may
Booking Held state to move straight to the Booking have Byzantine faults.
Committed state if the booking request is a
eptable Paxos Commit [18℄ is the Paxos algorithm applied
in the rst phase. For the se
ond phase they
an ig- to the distributed transa
tion
ommit problem. Ef-
nore a Commit message and treat an Abort message fe
tively the transa
tion manager of two phase
om-
as a
an
ellation of the reservation for whi
h the user mit is repla
ed by a group of a
eptors if a ma-
may be
harged. This a relaxation of the
onsisten
y jority of a
eptors are available for long enough the
property of two phase
ommit. transa
tion will
omplete. Gray and Lamport [18℄
It is also possible to relax the isolation property showed that Paxos Commit is e
ient and has the
of two phase
ommit. If a RM re
eives a booking same message delay as two phase
ommit for the
request while in the Booking Held state it
an reje
t fault free
ase. They also showed that two phase
the request, advising the
lient that it is holding an
ommit is Paxos Commit with only one a
eptor.
un
ommitted booking whi
h may be released in the Though it may be possible to apply Paxos dire
tly
future. This prevents deadlo
k [19℄, when two users to the
o-allo
ation problem by having the RMs a
t
request the same resour
es at the same time. as a
eptors we
hose to use Paxos Commit instead.
The role of the TM
an be played by the user but The advantage Paxos Commit has over using Paxos
should be
arried out by a trusted intermediary ser- dire
tly for
o-allo
ation is that the role played by
vi
e. the RMs is no more
omplex than in the two phase
approa
h. Limiting the role of the RM makes it more
a
eptable to resour
e providers and redu
es the pos-
VI. CO-ALLOCATION AND CONSENSUS sibilities for faulty behaviour from the RM.
158
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
5
timal s
hedule; how to manage a set of reservations If the user resends the message to the a
eptor or to
on
e they have been made; how to negotiate qual- any other a
eptor he will re
eive the same TID.
ity of servi
e with the resour
e provider or how to The user should
ontinue submitting the request
support
ompensation me
hanisms when a resour
e to any of the a
eptors until he re
eives a TID
fails to full a reservation or when a user
an
els a the submission of the booking request is idempotent
reservation. HARC is designed so that other servi
es a
ross all the a
eptors.
and proto
ols
an be
ombined with it to solve these If a user reuses a UID he will get the TID of the
problems. previous booking request asso
iated with that UID.
To avoid reusing a UID the user
an re
ord all the
UIDs he has previously used, however a more pra
-
A. Choosing a S
hedule ti
al approa
h is to randomly
hoose a UID from a
large namespa
e. Two users
an use the same UID
The RM advertises the s
hedule when a resour
e as it is the
ombination of the user's identity and the
may be available through a URI. The user retrieves UID that is relevant.
the s
hedule using a HTTP GET. The information The TID is a RequestURI [11℄. The user
an use a
retrieved is for guidan
e only, the RM is under no HTTP GET with the TID to retrieve the out
ome of
obligation by advertising it. It is the RM's prerog- the booking request from any of the a
eptors. The
ative as to how mu
h information it makes avail- TID is returned to user using the HTTP Lo
ation
able, it may
hoose not to advertise a s
hedule at header in the response to the POST
ontaining the
all. The RM may supply
a
hing information along booking request. The HTTP response
ode is 201
with the s
hedule using the
a
he support fa
ilities of indi
ating that a new resour
e has been
reated. The
HTTP. The
a
hing information
an indi
ate when a
eptor should return a 303 HTTP response
ode if
the s
hedule was last modied or how long the s
hed- a TID has already been
hosen for the request to
ule is good for. The RM may also support
ondi- allow the user to dete
t the
ase when a UID may
tional GET [11℄ so that
lients do not have to re- have been reused.
trieve and parse the s
hedule if it hasn't
hanged
sin
e the last retrieval. From the s
hedules for all
the resour
es the user
hooses a suitable time when C. Making the Reservations
the resour
es he requires might be free.
A
o-s
heduling servi
e for HARC
ould use
HTTP
onditional GET to maintain a
a
he of re- After the TID has been
hosen the leader uses
sour
e s
hedules from whi
h to
reate
o-s
hedules Paxos Commit to make the bookings with the RMs.
for the user. The
o-s
heduling servi
e would have The booking request is broken down into the sub-
to store only soft state making it easy to repli
ate. requests and the sub-requests are sent to the appro-
priate RM using HTTP POST. The TID is also sent
to the RM as the Referer HTTP header in the POST
B. Submitting the Booking Request message.
The RM responds with a URI that will represent
The user
onstru
ts a booking request whi
h
on- the booking lo
al to the RM and whether the book-
tains a sub-request for ea
h resour
e that he wants to ing is being held or has been reje
ted. It also broad-
reserve. The booking request also
ontains an iden-
asts this message to all the other a
eptors.
tier
hosen by the user, the UID. The
ombination As in the Paxos Commit algorithm the a
eptors
of the identity of the user and the UID should be de
ide whether the s
hedule
hosen by the user is
globally unique. The user sends the booking request a
eptable to all the RMs or has been reje
ted by any
to an a
eptor using a HTTP POST. This a
eptor of the RMs. The leader informs ea
h RM whether
will a
t as the leader for the whole booking pro
ess to
ommit or abort the booking using a HTTP PUT
unless it fails in whi
h
ase another a
eptor will take on the URI provided by the RM.
over. It is the RM's obligation to dis
over the out
ome
The leader pi
ks a transa
tion identier, the TID, of the booking. If the RM does not re
eive the
om-
for the booking request from a set of TIDs that it mit or abort message it should use a HTTP GET
has been initially assigned. Ea
h a
eptor has a dif- with the TID on any of the a
eptors to dis
over the
ferent range of TIDs to
hoose from to prevent two out
ome of the booking. On
e the a
eptors have
a
eptors trying to use the same TID. The a
eptor made a de
ision it
an be
a
hed by intermediaries.
ee
tively repli
ates the message to the other a
ep- A HTTP GET on the TID returns the out
ome of
tors by having Paxos agree the TID for the message. the booking request and a set of links to the reser-
This instan
e of Paxos agrees a TID for the
ombina- vations lo
al to ea
h RM. Detailed information on a
tion of the user's identity and the user
hosen UID. reservation at a parti
ular RM
an be retrieved by
159
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
6
using a HTTP GET on the link asso
iated with that digitally en
rypted [21℄ by the user so that the a
-
reservation.
eptors
annot read it.
The user
an
an
el the reservation through the A HTTP GET using the TID returns only the out-
URI provided by the RM and the RM
an adver-
ome of a booking request and a set of links to the
tise that it has
an
elled the reservation using the individual bookings so there is no requirement for a
-
same URI. The user should poll the URI to monitor
eptors to provide a
ess
ontrol to this information.
the status of the reservation, HTTP HEAD or
on- The individual RMs
ontrol a
ess to any detailed
ditional GET
an be used to optimize the polling. information on reservations they hold.
There is a question of what happens if a RM de-
ides to unilaterally abort from the Booking Held
state whi
h is not allowed in two phase
ommit or VIII. HARC AVAILABILITY
Paxos Commit. Sin
e HARC terminates when it
de
ides whether a s
hedule should be
ommitted or HARC has the same fault toleran
e properties as
aborted we
an say that the RM
an
elled the reser- Paxos Commit whi
h means it will progress if a ma-
vation after HARC terminated. This is possible be- jority of a
eptors are available. If a majority of
ause there is only a partial ordering of events in a a
eptors are not available HARC will blo
k until a
distributed system [25℄. The user
an only nd out majority is restored, sin
e blo
king is una
eptable
that the RM
an
elled the reservation after HARC we dene this as HARC failing. To
al
ulate the
terminated. The onus is on the user to monitor the MTTF for HARC we use the notation and deni-
reservations on
e they have been made and to deal tions from [19℄.
with any eventualities that may arise. The probability that an a
eptor fails is 1/M T T F
were MTTF is the Mean Time To Failure of the a
-
eptor. The probability that an a
eptor is in a failed
D. HARC A
tions state is approximately M T T R/M T T F , were MTTR
is the Mean Time To Repair of the a
eptor.
HARC supports a set of a
tions for making and Given 2F + 1 a
eptors, the probability that
manipulating reservations: Make, Modify, Move and HARC blo
ks is the probability that F a
eptors are
Can
el. Make is used to
reate a new reserva- in the failed state and another a
eptors fails.
tion, Modify to
hange an existing reservation (eg The probability that F out of the 2F + 1 a
eptors
to
hange the number of CPUs requested), Move to are in the failed state is:
hange the time of a reservation and Can
el to
an- F
(2F + 1)! M T T R
el a reservation. A booking request
an
ontain a (1)
mixture of a
tions and HARC will de
ide whether F !(F + 1)! M T T F
to
ommit or abort all of the a
tions. The HARC The probability that one of the remaining F + 1 a
-
a
tions provide the user with exibility for dealing
eptors fails is:
with the
ase of a RM
an
elling a reservation, for
example he
ould make a new reservation on another 1
(F + 1) (2)
resour
e at a dierent time and move the existing MT T F
reservations to the new time. The probability that HARC will blo
k is the produ
t
A third party servi
e
ould monitor reservations on of (1) and (2):
the behalf of the user and deal with any eventualities
that arise using the HARC a
tions in a
ordan
e to (2F + 1)! M T T RF
some poli
y provided by the user. (3)
(F !)2 M T T F F +1
The MTTF for HARC is the re
ipro
al of (3):
E. Se
urity
(F !)2 M T T F F +1
M T T FHARC ≈ (4)
(2F + 1)! M T T RF
The
omplete booking request sent to the a
eptor
is digitally signed [5℄ by the user with a X.509 [1℄
er- Using 5 a
eptors, F = 2, ea
h with a MTTF of 120
ti
ate. The a
eptors use the signature to dis
over days and a MTTR of 1 day, M T T FHARC is approx-
the identity of the user. Individual sub-requests are imately 57,600 days, or 160 years.
also digitally signed by the user so that the RMs
an verify that the sub-request
ame from an au-
thorized user. The a
eptors also digitally sign the IX. BYZANTINE FAULTS AND HARC
160
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
7
an happen in a Grid environment. We believe that mean that the reliability of the overall system im-
a
eptors
an be implemented as failstop [19℄ ser- proves. Just as Beowulf systems built out of
om-
vi
es that are provided as part of a Virtual Organi- modity
omponents have displa
ed expensive super-
zation's role of
reating trust between organizations.
omputers so
lusters of PCs are displa
ing expensive
Provision of trusted servi
es is one way of realizing mainframe type systems [15℄. HARC demonstrates
the Virtual Organization
on
ept. If Byzantine fault how important servi
es
an be made fault tolerant
toleran
e is ne
essary then Byzantine Paxos [8℄
ould to
reate a fault tolerant Grid.
be used in HARC. HARC has been demonstrated to be a se
ure, fault
If a HARC a
eptor does have a Byzantine fault it tolerant approa
h to building
o-allo
ation servi
es.
annot make a reservation without a signed request It has also been shown that the user and RM roles in
from a user. HARC has been designed to deal with HARC are simple. The statetransitions for the RM
Byzantine fault behaviour from users and RMs. in HARC are the same as for the two phase approa
h
No
o-allo
ation proto
ol
an guarantee that the illustrated in Fig. 1. The only extra requirement
reserved resour
es will a
tually be available at the for the RM over the two phase approa
h is that it
s
heduled time. HARC is a fault tolerant proto
ol must broad
ast a
opy of its Booking Held or Abort
for making a set of reservations, it does not attempt message to all a
eptors.
to make any guarantees on
e the reservations have The use of HTTP
ontributes to the simpli
ity of
been made. The user may be able to deal with the HARC. HTTP is a well understood appli
ation pro-
situation were a resour
e is unavailable at the s
hed- to
ol with strong library support in many program-
uled time by booking extra resour
es that
an a
t as ming languages. HARC demonstrates through its
ba
kup. This is an appli
ation spe
i
solution and use of HTTP and URIs the ee
tiveness of REST as
again illustrates the importan
e of end-to-end argu- an approa
h to developing Grid servi
es.
ments [35℄. HARC provides a fault tolerant servi
e HARC's fun
tionality
an be extended by adding
to the user but fault toleran
e is still ne
essary at other servi
es, for example to support
o-s
heduling
the appli
ation level. and reservation monitoring. Extending fun
tionality
by adding servi
es, rather than modifying existing
servi
es, indi
ates good design and a s
alable system
X. HARC/1
in a
ordan
e to REST.
The opaqueness of the sub-requests to the a
ep-
HARC/1, an implementation of HARC, has been tors means that HARC has the potential to be used
su
essfully demonstrated at SuperComputing 2005 for something other than
o-allo
ation. For example
and iGrid 2005 [20℄ where it was used to
o-s
hedule a HARC implementation
ould be used as an imple-
ten
ompute and two network resour
es. HARC/1 mentation of Paxos Commit to support distributed
is implemented using Java with sample RMs imple- transa
tions.
mented in Perl. HARC/1 is available at http://
www.
t.lsu.edu/personal/ma
laren/CoS
hed.
XII. ACKNOWLEDGEMENTS
XI. CONCLUSION
[1℄ C. Adams and S. Farrell. Internet X.509 Publi
able operational monitoring and Grid integration.
Key Infrastru
ture Certi
ate Management Proto- Pro
eedings of UK e-S
ien
e All Hands Meeting.
ols. IETF RFC 2510, 1999. 2003.
[2℄ R. Allan et al. Building the e-S
ien
e Grid in the [5℄ M. Bartel et al. XML-Signature Syntax and Pro
ess-
UK: Grid Information Servi
es. Pro
eedings of UK ing. W3C Re
ommendation, February 2002.
e-S
ien
e All Hands Meeting. 2003. [6℄ T. Berners-Lee, R. Fielding and L. Masinter. Uni-
[3℄ A. Andrieux et al. Web Servi
es Agreement. Draft form Resour
e Identier (URI): Generi
Syntax.
GGF Re
ommendation, September 2005. IETF RFC 3986, 2005.
[4℄ D. Baker and M. M
Keown. Building the e-S
ien
e [7℄ R. Blake et al. The teragyroid experiment
Grid in the UK: Providing a software toolkit to en- super
omputing 2003. S
ienti
Computing,
161
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
8
13(1):1-17, 2005. [23℄ S. Jha et al. Spi
e: Simulated pore intera
tive
om-
[8℄ M. Castro and B. Liskov. Pra
ti
al Byzantine fault puting environment using grid
omputing to un-
toleran
e. Pro
eedings of 3rd OSDI, New Orleans, derstand dna translo
ation a
ross protein nanopores
1999. embedded in lipid membranes. Pro
eedings of the
[9℄ K. Czajkowski et al. A proto
ol for negotiating UK e-S
ien
e All Hands Meetings, 2005.
servi
e level agreements and
oordinating resour
e [24℄ D. Kuo and M. M
Keown. Advan
e reservation
management on distributed systems. Pro
eedings and
o-allo
ation for Grid Computing. In First In-
of 8th International Workshop on Job S
heduling ternational Conferen
e on e-S
ien
e and Grid Com-
Strategies for Parallel Pro
essing, eds D Feitelson, puting, volume e-s
ien
e, IEEE Computer So
iety
L. Rudolph and U. S
hwiegelshohn, Le
ture Notes Press, 2005.
in Computer S
ien
e, 2537, Springer Verlag, 2002. [25℄ L. Lamport. Time, Clo
ks and the Ordering of
[10℄ K. Czajkowski et al. Grid Information Servi
es for Events in a Distributed System. Communi
ations of
Distributed Resour
e Sharing. Pro
eedings of the the ACM 21, 7, 1978.
Tenth IEEE International Symposium on High- [26℄ L. Lamport, M. Pease and R. Shostak. Rea
hing
Performan
e Distributed Computing (HPDC-10), Agreement in the Presen
e of Faults. Journal of the
IEEE Press, 2001. Asso
iation for Computing Ma
hinery 27, 2, 1980.
[11℄ R. Fielding et al. Hypertext Transfer Proto
ol [27℄ L. Lamport, M. Pease and R. Shostak. The Byzan-
HTTP/1.1, IETF RFC 2616, 1999. tine Generals Problem. ACM Transa
tions on Pro-
[12℄ R. Fielding. Ar
hite
tural Styles and the Design of gramming Languages and Systems 4, 3, 382-401,
Network-based Software Ar
hite
tures. PhD thesis. 1982.
University of California, Irvine, 2000. [28℄ L. Lamport. The Part-Time Parliament. ACM
[13℄ M. Fis
her, N. Lyn
h, and M. Paterson. Impossibil- Transa
tions on Computer Systems 16, 2, 133-169,
ity of distributed
onsensus with one faulty pro
ess. 1998.
Journal of the ACM, 32(2), 1985. [29℄ L. Lamport. Paxos Made Simple. ACM SIGACT
[14℄ I. Foster, C. Kesselman and S. Tue
ke. The Anatomy News (Distributed Computing Column) 32, 4
of the Grid: Enabling S
alable Virtual Organiza- (Whole Number 121, De
ember 2001) 18-25, 2001.
tions. International J. Super
omputer Appli
ations, [30℄ B. Lampson. Hints for
omputer system design.
15(3), 2001. ACM Operating Systems Rev. 17, 5, pp 33-48, 1983.
[15℄ S. Ghemawat, H. Gobio, and S. Leung. The Google [31℄ B. Lampson. How to build a highly available sys-
Filesystem. 19th ACM Symposium on Operating tem using
onsensus. In Distributed Algorithms, ed.
Systems Prin
iples, Lake George, NY, O
tober, Babaoglu and Marzullo, Le
ture Notes in Computer
2003. S
ien
e 1151, Springer, 1996
[16℄ M. Gudgin et al. SOAP Version 1.2 Part 1: Messag- [32℄ E. Res
orla. HTTP Over TLS. IETF RFC 2818,
ing Framework. W3C Re
ommendation, June 2003. 2000.
[17℄ J. Gray. Notes on data base operating systems. In [33℄ L. Roberts. Resour
e Sharing Computer Networks.
R. Bayer, R. Graham, and G. Seegmuller, editors, IEEE International Conferen
e, New York City.
Operating Systems: An Advan
ed Course, volume 1968.
60 of Le
ture Notes in Computer S
ien
e. Springer- [34℄ A. Roy. End-to-End Quality of Servi
e for High-end
Verlag, Berlin, Heidelberg, New York, 1978. Appli
ations. PhD thesis. University Of Chi
ago,
[18℄ J. Gray and L. Lamport. Consensus on Transa
- Illinois, 2001.
tion Commit. Mi
rosoft Resear
h Te
hni
al Report [35℄ J. Saltzer, D. Reed and D. Clark. End-to-end argu-
MSR-TR-2003-96, 2005. ments in system design. Pro
eedings of the 2nd In-
[19℄ J. Gray and A. Reuter. Transa
tion Pro
essing: ternational Conferen
e Distributed Computing Sys-
Con
epts and Te
hniques. Morgan Kaufmann Pub- tems, Paris, 1981.
lishers In
., San Fran
is
o, CA, USA, 1992. [36℄ D. Skeen. Nonblo
king
ommit proto
ols. In SIG-
[20℄ A. Hutanu et al. Distributed and
ollaborative vi- MOD '81: Pro
eedings of the 1981 ACM SIGMOD
sualization of large data sets using high-speed net- International Conferen
e on Management of Data.
works. Submitted to pro
eedings of iGrid 2005, Fu- ACM Press, 1981.
ture Generation Computer Systems. The Interna- [37℄ TeraGrid. http://www.teragrid.org/.
tional Journal of Grid Computing: Theory, Methods [38℄ UK National Grid Servi
e. http://www.ngs.a
.uk/.
and Appli
ations, 2005. [39℄ K. Yoshimoto, P. Kovat
h and P. Andrews. Co-
[21℄ T. Imamura, B. Dillaway and E. Simon. XML En- s
heduling with usersettable reservations. In Job
ryption Syntax and Pro
essing. W3C Re
ommen- S
heduling Strategies for Parallel Pro
essing, eds E.
dation, De
ember 2002. Fra
htenberg, L. Rudolph and U. S
hwiegelshohn,
[22℄ I. Ja
obs and N. Walsh. Ar
hite
ture of the World Le
ture Notes in Computing S
ien
e, 3834, Springer,
Wide Web, Volume One. W3C Re
ommendation, 2005.
De
ember 2004.
162
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
163
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
164
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
common persistent storage, and access to the JMS ser- 5. The ExecutionHandler subscribe to the Job topic,
vice providers for the Job queue, the Job topic and the selecting only to be notified of messages related to
Status topic. The server hosting this service is also called the ID of the JobEntity it must handle.
submission server.
A JobEntity instance has the main following variables: 6. The ExecutionHandler then instantiates the JobEn-
unique ID, execution status, workflow definition, work- tity object and starts its execution.
flow status information, last status update time, start
date, end date, user information. 4.3 Control
Execution servers host a pool of message driven Exe-
cutionHandler beans which, on receiving a message from The following sequence of events happen when the user
the Job queue, subscribe to messages added to the Job sends a control command to the execution handler, such
topic. The pool size represents the maximum number of as pause,resume,stop or kill (See Figure 4):
tasks that the execution server allows to be processed con-
1. The TaskManagement service receives the request for
currently. The container hosting the execution server also
a control command to a given JobEntity ID.
has access to the same persistent storage and messaging
service providers as the submission server and so can host 2. If the request to execute that JobEntity is not in the
instances of Job entities. Job queue and its state is running then it posts the
control request to the Job topic.
4.2 Submission
3. The listening ExecutionHandler receives the notifica-
The following sequence of events happen when a workflow tion and performs the control action on the JobEntity
is submitted for execution (See Figure 3): which will accordingly modify its execution status.
4. That execution is picked up by one of the Execu- 1. The ExecutionHandler for a running JobEntity regu-
tionHandlers of any execution server, following the larly requests the latest workflow status information
allocation policy of the JMS provider. from the running workflow.
165
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 2: Architecture
166
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
2. If that status has changed since the last status up- and use the information returned to choose the relevant
date, the state of the JobEntity is updated and the subscriber.
new execution status and workflow status informa- Although these policies can be sufficient in general, for
tion is submitted to the Status topic. our application to workflow execution, another policy was
created to make sure that we use the resource that holds
3. While the client tool is started, it will be subscrib- any intermediate results that have already been processed
ing to publications on the Status topic, for the tasks in the workflow, in order to optimise its execution. In this
that have been submitted by the current user, and case, the policy has to check the workflow description
will therefore receive the notification and associated associated with the request, to find out any intermediate
status information. results associated with any activity, and then decide to
use the corresponding server if possible.
The policy followed by the ExecutionHandler to sched-
ule the request for the latest workflow status information,
is based on a base period and a maximum period. The
status is requested according to the base period, except
5 Deployment
if the status has not changed in which case the update The persistence service provider used is HSQL[5]. It pro-
period is doubled, up to the maximum period. vides support for the persistence of basic types as well as
Java objects. The messaging service is JBossMQ[7]. Mes-
4.5 Failure detection sages are stored if needed in the same HSQL database
instance used for the persistence of container managed
One problem with the de-coupled architecture presented, entity objects.
is that there is no immediate notification that an execu-
tion server has failed, as it only communicates through
the messaging service which does not provide by default 5.1 Campus Grid scheduler
information to the application about its subscribers. The
approach used to detect failures of the execution server is The first deployment of the scheduler uses protocols that
to check regularly, on the server hosting the TaskManage- are only suitable over an open network without communi-
ment service, the last status update time of all the run- cation restrictions or network address translation (NAT)
ning JobEntity. If that update time is significantly above in some parts. This is usually the case for deployment in-
the maximum update period, then that job is stopped, side an organisation. The RMI protocol can be used for
killed if necessary, and restarted. simplicity and efficiency, as well as direct connection and
notification of the client tool can be performed without
the risk of network configuration issues. This setup is de-
4.6 Security scribed in Figure 6
In order to make sure that workflows run in the correct
context and has the correct associated roles and autho- 5.2 Scheduling over WAN
rization, the ExecutionHandler needs to impersonate the
user who submitted the workflow. This is implemented by The second deployment of the scheduler uses HTTP tun-
a specific JAAS module that enables that impersonation nelling for method calls and HTTP-based polling mecha-
to happens for a particular security policy used by the nism for queue and topic subscribers as shown on Figure
ExecutionHandler to login. 7. The main advantages are that the client does not have
to have an IP address accessible by the messaging service,
as there is no direct call-back. This means that although
4.7 Scheduling Policy the client tool, in our application, performs rich interac-
The scheduling policy is defined by the way the JMS ser- tions with the workflow, it does not have to be on the
vice provider decides which ExecutionHandler should re- same network as the task management server.
ceive requests added to the Job queue. Several policies The execution servers also do not need to have a public
are provided by default with the JMS provider that was IP, which makes it theoretically possible, given the right
used, in particular we used a simple round-robin policy software delivery mechanism for the execution server, to
at first. use the scheduler in configuration such as supported by
In order to integrate at that stage with a resource man- the SETI@Home[2] scheduler.
agement tool such as the Grid Engine[11], the scheduling Other configurations need to be modified to support
policy was extended. To find out which subscriber to use, such deployment. In particular the values for time out
we assumed that the set of execution servers was the same and retries values for connections to the persistence man-
as the set of resources managed by the grid engine and ager and the messaging service need to be increased, as
submitted a request to execute the command hostname network delays or even network failures are more likely.
167
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
168
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
We have implemented and tested the scheduler using a The Grid Application Toolkit [9] has wrappers for Java
variety of Discovery Net bioinformatics and cheminfor- called JavaGAT. This interface is a wrapper over the main
matics application workflows. A complete empirical eval- native GAT engine which itself is trying to provide a con-
uation of the scheduling is beyond the scope of this paper sistent interface layer above several Grid infrastructure
since it takes into account the characteristics of the appli- such as Globus, Condor and Unicore. Again, the differ-
cation workflows themselves. However, in this section we ence of the approach relies on the submission of native
provide a brief overview of the experimental setting used process executions over the grid, instead of handling that
to test and evaluate the scheduler implementation. process at a higher level and leaving the scheduling, po-
The scheduler was deployed for testing over an ad-hoc tentially, to a natively implemented Grid or resource man-
and heterogeneous set of machines. The submission server agement service.
was hosted on a Linux server where the persistence and Proactive [6] takes a Java oriented approach by provid-
messaging services were also held. The execution servers ing a library for parallel, distributed and concurrent com-
were running on a set of 15 Windows single processor puting interoperating with several Grid standards. While
desktop machines on the same organisation’s network using the same approach as here, we based the engineer-
without restrictions. The scheduler sustained overnight ing around Java commodity messaging and persistence
constant submission and execution of workflows, each ex- services such that the robustness of the system mainly
ecution server handling a maximum of 3 concurrent exe- and the network protocols it uses depends on these ser-
cutions, without apparent bottlenecks. vice rather than on the implementation.
169
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
References
[1] S. AlSairafi, F.-S. Emmanouil, M. Ghanem, N. Gi-
annadakis, Y. Guo, D. Kalaitzopoulos, M. Osmond,
A. Rowe, J. Syed, and P. Wendel. The design of
discovery net: Towards open grid services for knowl-
edge discovery. International Journal of High Per-
formance Computing Applications, 17, Aug. 2003.
[7] JBossMQ.
http://www.jboss.com/products/messaging.
170
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
We describe the GridSite toolbar, an extension for the Mozilla Firefox web browser, and its
interaction with the GridSite delegation web service. A method for automatic discovery of the
delegation service is also introduced. The combination of the toolbar and automatic discovery
allows users to delegate credentials to a remote server in a simple and intuitive way from within the
web browser. Also discussed is the toolbar's ability to enable the use of the GridHTTP protocol,
also a part of the GridSite framework, in a similar way.
171
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
“Upgrade” header in the request to the GridSite delegation service, the private key
“GridHTTP/1.0”. This is entirely in keeping associated with the proxy certificate never
with the HTTP standard so any server which leaves the remote server, adding an extra layer
does not understand the GridHTTP protocol, or of security to the delegation process.
does not have it enabled, will ignore the header. This is completely analogous to the
If GridHTTP is enabled, the server will processes involved in obtaining a standard
determine if the client should have read access X.509 certificate, but cast as a web service. The
to the requested files, based on their X.509, GSI delegation service is treating the client as a
proxy or VOMS proxy certificate and any Certificate Authority (CA) and requesting that
defined access control lists. If access is allowed, the public key be signed by the user's private
the server will respond with a redirection to a key (which acts like the CA's private key). The
standard HTTP location for the file and with a main difference is that both the client and the
single-use passcode, contained in a standard service can verify the identity of the other
HTTP cookie. The client should then present through trusted CAs, so there is no lengthy
this passcode cookie when requesting the file identification step involved.
from the HTTP location and will be granted The above description covers the
access to download the file. functionality of the initial (version 1) delegation
service interface. There is now an updated
The performance benefits of the GridHTTP interface specification (version 1.1) that extends
protocol come not only from the data stream the functionality of the delegation service. The
being unencrypted (saving CPU cycles at the new functionality includes methods that allow
server and client ends) but also from the highly users to check when an existing proxy
optimised Apache file serving routines. In certificate will expire, to renew such a
comparison, while GridFTP9 allows certificate and to destroy the certificate.
unencrypted data transfer it does not make use
of low level system calls as in the case of 3.2 GridSite Implementation
GridHTTP/Apache. The performance difference The GridSite implementation of the delegation
between the two transfer methods (for service makes use of the gSOAP11 toolkit and is
unencrypted data streams) is evident in research written in C. Running under the GridSite
carried out by R. Hughes-Jones10. environment allows the service to obtain
authentication information about connecting
3. The GridSite Delegation Service clients directly from environment variables. It is
intended as a simple illustration of how web
The GridSite delegation service is a web service
services can be created within the GridSite
that allows users to place proxy certificates on
framework but this implementation can also be
remote servers. The WSDL-based service uses
incorporated into other web services that might
standard plain-text SOAP messages, with the
require this functionality. Additionally, GridSite
authentication coming from an HTTPS
provides a command line client, also built using
connection from the client. This makes it very
the gSOAP toolkit, in order to supply a
easy to implement the service in any of a variety
complete solution.
of programming languages, especially when
The GridSite delegation service
using GridSite as the web services platform.
specification was created within the EGEE12
3.1 Delegation Service Internals project. As a result, there are also Java client
and server implementations available within the
The delegation procedure is initiated by a client gLite framework. Since the service is based on a
sending a getProxyReq message to the service. WSDL definition, all of the combinations of
The service then produces a public/private key client and server implementations interoperate
pair and generates a certificate request from the without any problems.
public key. The certificate request is then sent to
the client. The client then signs the request with
4. Service Discovery
their private key and sends it back to the server
in a putProxy message. The signed request Traditional web services require that the client
combined with the associated private key forms is either configured for one particular instance
a valid proxy certificate. of the service, with a hard-coded location (as is
This approach has the distinct advantage the case with the GridSite delegation service
over the usual methods of delegating credentials command line client, once compiled), or that the
to a remote server. The standard “grid-proxy- user of the client provides the URL of the
init” and job submission routines produce the service they wish to use. This can cause
public and private keys for the proxy certificate problems if locations of services change or
locally, sign the public key and then transfer the users wish to locate alternative services
both to the remote server over the network. In providing the same functionality.
172
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
A system was developed to allow a browser The GridSite toolbar is a client for both the
(and in turn the GridSite toolbar) to be notified GridSite delegation service and the GridHTTP
of an available delegation service by a web site. protocol, wrapped up in an extension for the
The URL of the service is provided either in Mozilla Firefox web browser. It makes use of
HTTP headers or in a META13 element in the several features of the browser and the service
HTML source of the page being viewed. The discovery method, described in Section 4, to
META element method is similar to the method make delegation and using GridHTTP a simple
employed by sites and browsers to enable point-and-click task.
automatic location of RSS feeds, which uses a
LINK element.
The header used for the notification is 5.1 Mozilla Firefox
"Proxy-Delegation-Service", the value of which Mozilla Firefox was chosen as a base platform
should be the URL of the delegation service. for the GridSite toolbar for several key features.
The insertion of this header can be easily These include:
achieved with GridSite Apache module and
setting the GridSiteDelegationURI directive in z Default web browser in the Scientific
the web server's configuration file. Linux distribution, recommended by
To deliver the same information without EGEE/gLite.
using HTTP headers, a META element z Explicitly designed to be easily
containing the http-equiv attribute can be extensible.
inserted into the HEAD element of a web page. z Provides JavaScript objects for
The META element would have the following manipulation of SOAP messages and
format: interacting with WSDL services.
z Provides a cross-platform development
<META environment.
http-equiv = z Simple API for creating graphical
"Proxy-Delegation-Service" interface elements.
content = z Built in certificate verification of
"https://eg.com/delegate.cgi" remote servers over all HTTPS
>
connections.
z Security updates come "for free" from
The combination of the two methods above Mozilla.
provides great flexibility for different groups of
people to alert users to a delegation service. The In addition to these features, the use of
HTTP header method allows a web server Firefox keeps to the GridSite project’s general
administrator to inform all visitors of a related philosophy of building on established, open
delegation service and META element method source software and related protocols, such as
allows an individual page (a personal home Apache and HTTP, as much as possible. These
page, for example) to do the same. The latter projects have large development teams and are
may be useful in an environment where extensively tested by a wide user base. This
enabling server-wide options are not possible allows the GridSite software to inherit the
(e.g. pages located on unrelated servers) or on a stability, performance and security of these
server not using the GridSite software. projects and receive security updates for crucial
elements (such as HTTPS communication –
5. The GridSite Toolbar client and server side) for free. Trying to create
As stated previously, it is possible for users to stable, efficient solutions for the functions that
upload custom web services to a GridSite GridSite and the toolbar provide in the form of
server, an example of which is the delegation bespoke software and protocols would be much
service. Additionally, GridSite provides a wide more time consuming and much harder to
variety of site management features, including maintain against possible security flaws.
editing/uploading/deleting files and folders and 5.2 Requirements
editing access control lists. All of these tasks
can be carried out from within a web browser. The GridSite toolbar makes several assumptions
Although a command line client was created about the configuration of a user's environment.
for the delegation service, it did not provide an It requires that the user's certificate (as well as
easily accessible interface as in the case of the the relevant CA root certificate) is loaded into
browser enabled functionality mentioned above. the Firefox software security device. It is also
A browser based client for the delegation assumed that the user’s certificate is stored in
service was produced to illustrate a mechanism PEM encoded usercert.pem and userkey.pem
for allowing interaction with such web services. files in the user's ~/.globus directory. Finally,
173
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the extension also requires OpenSSL14 The browser's built in functionality will produce
command line tools to be installed. a warning to the user if the server hosting the
The toolbar has been developed to work on delegation service has a certificate that is either
Linux systems but could be ported to work in a invalid or not from a recognised CA. The
Windows environment, provided that a method connection also uses the certificate in the
of producing secure named pipes, or an software security device to authenticate to the
equivalent, is available (see section 5.4 for service.
details of how these are used).
The user is then asked for their PEM
5.3 Service Detection passphrase, which is passed, through a named
Every time a page (or tab) is loaded or brought pipe, to a shell script wrapped around the
into focus, the HTML source of the displayed OpenSSL command line program. This script
page is searched for the relevant META signs the request to produce the proxy
element. Then an HTTP HEAD request is made certificate. The shell script makes use of a file
to the same URL as the page being displayed called usercert.srl in the user's ~/.globus
and the response is inspected for the Proxy- directory (from the -CAcreateserial command
Delegation-Service header. If location is defined line option for OpenSSL X.509 tools) to ensure
in both the HTTP headers and the META that the extension will never produce two proxy
element the value found in the headers will be certificates with the same serial number – as
used. recommended in the IETF RFC3280
specification for GSI proxies.
174
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
META tags in preference over a service graphical elements that make the process more
specified in HTTP headers. This can be intuitive.
achieved through the use of directory specific Using the web service hosting features of
Apache settings by disabling the HTTP header GridSite and the methods outlined in this paper,
for a particular area of a site. it is possible for any user to create and deploy a
Finally, the new features introduced in web service, or a collection of services,
version 1.1 of the delegation service interface configured to their own specific set of
are not yet supported by the GridSite toolbar. requirements. The web service can be written in
This could be achieved relatively easily by almost any programming or scripting language,
extending the work already done. hosted as CGI scripts for Apache, and then
made accessible through the familiar
5.7 GridHTTP environment of a web browser.
Access to files using the GridHTTP protocol is Applications could range from something as
similarly an easy operation when using the simple as a having a mechanism for notifying
GridSite toolbar. To do so, users right click a users of the status of submitted jobs to a
HTTPS link to a file and select “Get with complex Grid “portal” site, which could be used
GridHTTP” (Figure 2). to submit jobs and retrieve output. In the latter
case, the GridSite delegation service and
GridSite toolbar functionality could be easily
integrated to allow the delegation of credentials
to the service as required.
The GridSite toolbar demonstrates that
making grid applications as easy to use as any
locally installed application is possible. Using
such techniques can help make the Grid more
accessible for new users and ease its adoption.
6. Conclusion
The GridSite toolbar is a simple example of
how the GridSite functionality can be combined
with the service discovery method and a web
browser environment in order to simplify the
use of web services. It not only avoids the use
of long command line strings, but also uses the
built in functionality of the web browser (such
as choosing where to save a file) to add
175
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Acknowledgements
This work was funded by the Particle Physics
and Astronomy Research Council through their
GridPP end e-Science Studentship programmes.
We would also like to thank other members
of the EDG and EGEE security working groups
for providing much of the wider environment
into which this work fits.
References
1. X.509v3 is described in IETF RFC2459,
“Internet X.509 Public Key Infrastructure
Certificate and CRL Profile.”
2. Grid Security Infrastructure information is
available from Globus:
http://www.globus.org/toolkit/docs/4.0/security/
3. VOMS and GACL are described in the EDG
Security Co-ordinations Group Paper,
“Authentication and Authorization Mechanisms
for Multi-Domain Grid Environments”, L. A.
Cornwall et al, Journal of Grid Computing
(2004) 2: 301-311.
4. GridSite Software is available from
http://www.gridsite.org/
5. The Apache Web Server:
http://httpd.apache.org/
6. XACML specification is by OASIS:
http://www.oasis-open.org
7. “The GridSite Security Framework”, A.
McNab & S. Kaushal, Proceedings of All Hands
Meeting (2005).
8. Mozilla Firefox information and downloads:
http://www.mozilla.com
9. GridFTP:
http://www.globus.org/grid_software/data/gridft
p.php
10. Richard Hughes-Jones, private
communication and talk at GNEW 2004.
11. The gSOAP C++ web services toolkit is
available from
http://www.cs.fsu.edu/~engelen/soap.html
12. The EGEE (Enabling Grids for E-sciencE)
project: http://public.eu-egee.org/
13. HTML 4.01 Specification detailing the use
of all valid elements:
http://www.w3.org/TR/REC-html40/
14. OpenSSL is available from
http://www.openssl.org/
176
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Monica Duke
UKOLN, University of Bath
Abstract
Facilitating discovery is an aspect of curation that has been addressed by the eBank UK project.
This paper describes the metadata and associated issues considered when working within the
chemistry sub-discipline of crystallography. A metadata-mediated discovery model has been
implemented to aid the dissemination and sharing of research datasets; although the specific
characteristics of crystallography were the main driver in determining the metadata requirements,
consideration was also given to cross-disciplinary interaction and exchange. An overview of the
metadata profile that has been developed is provided, against the background of the project.
177
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
178
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
metadata exposed by the repository, thus limit searches for datasets added within a
facilitating re-use. specific time period. A scientist (or service)
could then generate a ‘latest additions’ or
2.2 Qualified Dublin Core ‘added since’ feature.
Dublin Core is presented as two levels. The dc:creator is the entity responsible for
simplest and most basic level consists of a basic making the content of the resource. This has
element set. This is then extended to allow for been interpreted as the names of the scientists
element refinements, that is elements that refine themselves who create the datasets that are
(but not extend) semantics in ways that may be deposited. Previous work has identified issues
useful for resource discovery. of metadata quality and it is recommended that
Two categories of refinements (also called content rules should be applied to the values of
qualifiers) are recognised; the first category metadata elements to improve quality [14].
makes an element narrower or more specific, One recommendation is that author names
e.g. the relation element can be refined to should be entered in controlled form and this
isVersionOf or isFormatOf. These are two has been implemented in the deposition
specific instances of the relation element which software, so that name formats adhere to those
narrow down the meaning. The second method recommended for eprints [15]. The
of qualifying an element is by specifying an organisations that the scientists belonged to was
“encoding scheme” which aids in the also included but this was designated within the
interpretation of the element value. Examples dc:publisher element, which is defined as an
of encoding schemes include controlled entity responsible for making the resource
vocabularies or formal notations. Thus an available.
element qualified by the encoding scheme will dc:type describes the nature or genre of the
have as its value a token taken from a controlled content of the resource. Various types of
vocabulary (e.g. a classification system) or a resources are described and disseminated using
string formatted according to a formal notation OAI-PMH (text, image, audio and video are but
e.g. expression of a date. a few examples available from OAIster [16]),
The eBank UK application profile makes therefore the value of this element was
use of the basic Dublin Core elements as well as considered important to enable selective and
the two mechanisms of refinements to define an sorting operations on the harvested metadata in
element set that is specialized to the domain- a heterogenous environment. The value
specific needs. The ‘dumbing-down’ principle currently implemented for the content of this
of Dublin Core ensures that applications that element is ‘crystal structure data holding’.
cannot recognise the specialised element can There are other potential values that this
safely ignore the qualifier and treat the element element could take (not currently implemented),
as unqualified. Despite the loss of specifity this such as the more generic ‘dataset’. Note that all
is intended to allow the remaining value of the elements in Dublin Core are repeatable
element to be used correctly and usefully for therefore using all the values in repeated dc:type
resource discovery. elements is an option.
dc:identifier is defined in the Dublin Core
2.3 Generic Dublin Core descriptions documentation as an unambiguous reference to
the resource within a given context. In the
The Dublin Core metadata elements (indicated context of the service being deployed in the
in this section by the prefix dc) comprise a eBank project, the most useful identifier would
number of elements that can (intentionally) be be one that not only identifies the datasets but
applied to a large number of different resources. can also be dereferenced to access the resource.
For example dc:title is the name given to a Two types of identifier are being used. The
resource. This label can be applied equally to URL of the web entry for a holding in the
the title of a book, a photograph, a work of art archive is used as it provides an entry point to
or an experiment. In the eBank application all the datasets for a crystal structure. On
profile, the title of the datasets takes the name of clicking the URL, the web page for the entry
the molecular compound that is being displays links that enable download of various
determined in the crystallographic experiment. datasets, and a selection of key metadata about
dc:date is a date associated with an event in the resource.
the life cycle of a resource. Typically the date An alternative identifier is also being
is associated with the creation or availability of implemented. This is the Digital Object
the resource; the date was considered a useful Identifier (DOI). DOIs are assigned in
element to include since it could allow users to collaboration with a German agency. The DOI
179
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
180
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
from images of x-ray diffraction patterns, to integrated facilities for access coupled with data
structured documents, such as the CIF. One aim manipulation, and it was not the intention of
of the repository was to improve on the current eBank to compete with these sources. We
publication and dissemination processes (which therefore do not have any immediate plans to
make the CIF file available) by giving access to implement this feature of the METS standard.
all the results from a determination. Each entry However it is an obvious advantage that the
in the repository therefore consists of a standard accommodates this information which
collection of files from different stages of the is available for others to use should they wish to
experiment, all arising from the determination extend on the eBank output.
of one chemical structure.
From the discovery perspective, the 3. The Realities of Cross-Discovery.
metadata should reveal to the user the existence
of the multiple files available, together with an As a cross-disciplinary effort, one of the
indication of their nature. In this specific interests of the eBank UK project was to
domain, the experimental stage from which the encourage and explore common technologies
files are generated is considered significant and across disciplines and resource types, with a
indicative of the files available. Altogether, the focus on linking between data sets and
files make up a collection, or more specifically published literature. Given its uptake in the
a data holding, and share common digital library world, and the experience of the
characteristics such as the creator, and chemical project partners, the OAI-PMH was viewed as a
descriptors. promising candidate protocol to provide shared
The model of the crystallography resource technology connecting digital libraries with
that emerges bears some resemblance to a scientific repositories.
category of digital objects termed complex
3.1 Connections with published literature
objects. In the digital library domain, content
packaging standards are being proposed as a Initially it was envisaged that an OAI-PMH
tool to disseminate digital resources that are repository of crystallography literature with
composed of more than one underlying file or known connections to the deposited data would
object. Typically these standards consist of a be created for initial demonstration of cross-
mechanism for declaring the structure and search capabilities. Unlike some other
make-up of digital items, and use XML schemas disciplines, (such as physics), there was no
that define an XML format that either existing body of publications of crystallography
references or contains the data files that make that was already accessible in this way. This
up the package. plan was superceded since good collaborations
METS [17] is one such packaging format that developed between project partners and the
being investigated by the eBank UK project. IUCr resulted in publication metadata being
METS is maintained by the Library of made available by the latter body for the
Congress; a METS document consists of seven purposes of demonstration in a web interface.
major sections, including a METS header and Thus XML descriptions of a small sample of
descriptive metadata. The descriptive metadata carefully selected publications were cross-
section is designed so that metadata descriptions searched with the metadata from the data
(e.g. those in Dublin Core) can either be archive.
included or referenced. This allowed the project The searching capabilities were shown in a
to re-utilise the Dublin Core descriptions, demonstrator prototype which supported
outlined above, within a METS file. searching against author/creator names and
The File section and the Structural Map chemical information. The publications
section of METS are used to link (or contain) included had been identified such that
the files, and to describe the logical or physical connections were known to exist between the
relations, respectively. These are the sections articles and the datasets.
currently being implemented and investigated. The use of the OAI-PMH had an unplanned
The Behaviour section of METS (in consequence. Discussions are in an advanced
common with other sections of some of the stage to allow publishers to use the protocol to
other packaging formats) associates files with access data that is usually submitted in support
executable code needed to read or manipulate of published articles. The use of the DOI also
them. The eBank project has so far means that rather than reproduce the data in the
concentrated on the dissemination role of the published article (as sometimes happens), the
metadata, simply advertising the existence of data will simply be referenced with more space
datasets; specialised sources already exist with in the article devoted to discussion of the
181
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
results. Thus although not exactly as originally 3.4 Use of Dublin Core and OAI-PMH in
intended, OAI-PMH looks set to become part of eScience
the infrastructure for publishing data in
[19] in a report on data curation for eScience in
crystallography, and this may lead to new
the UK in 2003 found an “almost total lack of
dissemination routes for the data.
knowledge of metadata tools such as the Dublin
3.2 Connections with eLearning Core” in the responses to a questionnaire.
However eScience initiatives (such as those
One other angle that the eBank project is quoted in Section 1.4) do report using some of
exploring is the pedagogical role of data. A the Dublin Core elements amongst a number of
study is currently being undertaken to examine other specialized descriptions adapted to their
the interaction of students directly through the applications. There is not much evidence
web interface of the archive containing the data. however of use of the OAI-PMH as a common
However, from the discovery perspective, the framework to deliver research data or metadata
potential of the metadata to be used to identify (at least in the UK) to date.
relationships or relevance between datasets and
e-learning materials from other sources is still 4. Future Work
largely unknown.
The OAI-PMH is being used in a growing The eBank UK project has defined the metadata
repository of learning resources funded by the fields (described in sections 2.3 and 2.4) based
JISC. JORUM [18] is a free online service for on typical laboratory practice at a large national
teaching and support staff in UK Further and centre, with the aim of providing an example
Higher Education Institutions, helping to build a that could be re-used in other settings.
community for the sharing, reuse and Furthermore, the software that generates the
repurposing of learning and teaching materials. metadata is specialised to the workflow at that
It is hoped that searches can identify resources laboratory. Work is ongoing at other centres to
from JORUM that complement the datasets in evaluate alternative OAI-PMH software in
the eBank repository, when used by students to different laboratories, and it is hoped to evaluate
search for data. the eBank software modifications against
With such a specialist area as that addressed different laboratory practices in the future.
by eBank to date, it is always questionable So far only a relatively small sample of
whether the coverage of a generic resource such datasets have become accessible and searchable
as JORUM will be wide enough to contain a through the eBank initiative. It is therefore
sample of results relevant to an eBank search. difficult to evaluate the efficiency of searching
Initial examination of the metadata revealed that the metadata fields exposed until a critical mass
a potential subset of resources (albeit small in of metadata is available. Furthermore, for some
number) were relevant to the topic of of the fields used (e.g. the InChI), experience is
crystallography. Regretfully, at the time of still being gained within the discipline to
writing access to the learning resources establish helpful searching methods (e.g.
themselves was still being negotiated substructure searching).
(institutional sign-up and ATHENS However the notion of open access for
authentication is required). Therefore their crystallography data has been widely
actual relevance and the target audience could disseminated within the crystallography
not be evaluated by the crystallography users in community, and generally very well received.
a demonstrator prototype. It is hoped that once Discussions are ongoing and funding is being
the access issues are sorted out a demonstration sought to enable co-operation with other
and evaluation can take place. existing repositories to expose their metadata
using OAI-PMH. These efforts together with
3.3 Validating the output existing activities may in the future allow for a
Ideally, the true test of the metadata profile more extensive evaluation of cross-searching
would be demonstrated bv a cross-search and related discovery issues, with feedback
service against a number of different from a larger group of users, beyond proof of
repositories containing either crystallographic, concept.
or indeed other resources. Due to the lack of The inclusion of a DOI as a citation for a
suitable compatible repositories (i.e. ones dataset when referenced in a publication is still
supporting OAI-PMH) it is still too early to be in the very early stages (although at least some
able to carry out such validation. publishers are receptive to the idea). This
practice will be tested in the near future so that
the reaction and interest of the community can
182
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
As increasing amounts of data are generated and [7] Duke, M., Day, M. ,Heery, R. et al.
made available for sharing and re-use through Enhancing access to research data: the
eScience initiatives, the curation responsibility challenge of crystallography.
of enabling discovery of such data will become In: Proceedings of the 5th ACM/IEEE Joint
a more pressing issue. The eBank UK project Conference on Digital Libraries, Denver, CO.,
has demonstrated the use of the OAI-PMH as a USA, June 7-11, 2005. New York: Association
technology for metadata-mediated discovery in for Computing Machinery, 2005, pp. 46-55.
crystallography. For this protocol to become ISBN 1-58113-876-8.
the connecting technology between repositories
of crystallography data, and provide a shared [8] Heery, R., Duke, M., Day, M. et al.
infrastructure across different resource types, a Integrating research data into the publication
more widespread adoption of the protocol workflow: eBank experience In: Proceedings
amongst the applicable repositories would be PV-2004: Ensuring the Long-Term Preservation
required. and Adding Value to the Scientific and
Technical Data, 5-7 October 2004, ESA/ESRIN,
6. Acknowledgment Frascati, Italy, Noordwijk: European Space
Agency, 2004, pp. 135-142.
The eBank UK project
http://www.ukoln.ac.uk/projects/ebank-uk/ is
[9] Coles, S., Frey, J., Hursthouse, M. et al.
funded under the JISC Semantic Grid and
Enabling the reusability of scientific data:
Autonomous Computing Programme. The
Experiences with designing an open access
project partners are UKOLN, University of
infrastructure for sharing datasets. Designing
Bath, School of Electronics and Computer
for Usability in e-Science. International
Science, University of Southampton, The
Workshop, NeSC, Edinburgh, Scotland, 26-27
School of Chemistry, University of
January, 2006
Southampton and PSIgate, University of
Manchester. The work reported here is a result
[10] Coles, S.J., Frey, J.G., Hursthouse, M.B. et
of collaborative effort between project
al. An e-Science environment for service
members, (past and present) from these
crystallography - from submission to
institutions.
dissemination. Journal of Chemical Information
and Modeling, Special Issue on eScience. 2006
7. References (In Press).
[1] Rusbridge C., et al The Digital Curation
Centre: A vision for Digital Curation. From [11] O'Neill, K., Cramer, R., Gutierrez, M., et
Local to Global: Data Interoperability — al. A specialised metadata approach to
Challenges and Technologies, Mass Storage and discovery and use of data in the NERC
Systems Technology Committee of the IEEE DataGrid. Proceedings of the U.K. e-science
Computer Society, 20–24 June 2005, Sardinia, All Hands Meeting, 2004.
Italy
[12] Pignotti, E., Edwards, P., Preece, A., et al.
Providing Ontology Support for Social
183
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Copyright Notice
184
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
The paper will illustrate the “CombeChem Project” experience of supporting the chemical data
lifecycle, from inception in the laboratory to organization of the data from the chemical literature.
The paper will follow the different parts of the data lifecycle, beginning with a discussion of how
the laboratory data could (or should) be recorded, and enriched with appropriate metadata, so as to
ensure that curated data can be understood within its original context when subsequently accessed,
as it is generated (the ideal of “Autonomic Annotation@Source”). Intrinsic to our argument is the
recording of the context as well as the data, and maintaining access to the data in the most flexible
form for potential future re-use for purposes that are not recognised when the data was collected.
This is likely to involve many routes to dissemination, with data and ideas being treated by parallel
but linked methods, which will influence traditional approaches to publication and dissemination,
giving rise to a Grid style access to the information working across several administrative domains
summarized by the concept of “Publication@Source”.
185
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
experimenters into an e-Science infrastructure development from the project may be easily
based on pervasive and Semantic Grid incorporated. Useful lessons have been learned
technology. Many phases of the knowledge in implementing security features for both the
cycle were explored in CombeChem, from user NCS systems, where a balance has been found
interaction with Grid-enabled high-throughput between cost and a very secure network. In
analysis, fed by smart laboratories (notebooks addition, remote and automated data collection
and monitoring), together with modern procedures, based on a combination of robotics
statistical design and analysis, to utilization of and scripted software routines, and coupled with
semantic techniques to accumulate and disperse automated structure solution and refinement, are
the chemical information efficiently. The way presented.
these investigations inform digital curation and
dissemination are highlighted in the following 2.1 Workflow Analysis
sections. We consider the generation of data
from instrumentation in section 2 and the A first step towards designing such a complex
Crystallography supplies the major example system was the identification of the sequence of
used here to demonstrate different approach to individual processes taken by users, service
data dissemination discussed in section 6. The operators and samples, from an initial
need to capture and link metadata in this and application to use the service to the final
other chemical laboratories is covered in section dissemination and further use of a crystal
3 and examples provided in sections 4 and 5. structure. All major activities, interactions and
Wider community interaction is considered in dependencies between the parties involved
section 7 and conclusions drawn in section 8. (both human and software components), may
then be described as a workflow, from which an
architecture that would accommodate all the
2 Instruments on the Grid and the processes could be designed.
National Crystallography Service (NCS)
In the Grid-enabling of research in structural The workflow for a typical service
chemistry, we focused on chemical crystallography experiment is quite complex
crystallography and relating this to the chemical when considered at this level of granularity. For
and physico-chemical properties of the target this reason it will not be reproduced here, -a
materials, adopting a “high-throughput thorough discussion is provided in reference 6.
philosophy”, processing large families of A typical Grid, or web, service would only
compounds while embedding the protocols for involve computing components (e.g.
capture of metadata, date-stamping and calculations, data retrieval services), hence the
adoption of agreed standards for archival and workflow involving these services is fairly
dissemination. trivial to derive and can be automated by an
appropriate workflow engine. However, the
The CombeChem philosophy was applied to the service crystallography workflow also includes
operation of the EPSRC Chemistry Programme many manual operations, e.g. sending a sample
funded National Crystallography Service to the service or mounting a sample on a
(NCS). This facility is a global centre of diffractometer. From the analysis it was evident
excellence with a long established service that the NCS Grid Service, in common with the
providing experimental structural chemistry few other scientific instruments on the Grid78910,
resources for the UK chemistry community. The is server-driven as opposed to purely
NCS involvement has provided an exemplar of computational Grid services that are generally
how e-Science methodologies enhance user orchestrated by the user.
interaction with an operational service that
provides resources to a distributed and varied The workflow gives rise to the design of a
user base and resulted in a set of database which is core to the system and is
recommendations for the construction of such capable of tracking a sample through the
an infrastructure (now being adopted by other workflow. A sample is automatically given a
EPSRC services). The NCS Grid facility has different status in this database, according to its
been deployed as a set of unified services that position in the workflow and each status has
provide application, submission, and certificated different authorisation conditions. The interplay
secure, sample status monitoring, data access between a generalised form of the workflow and
and data dissemination facilities for the status of a sample is shown in Figure 1.
authenticated users, and is designed so that
components currently under various stages of
186
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
187
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
188
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
189
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
6.2 Metadata harvesting and value added and retrieval capability at a number of different
services chemical and bibliographic levels. The
demonstrator system, along with searching
When an archive entry is made public guidelines may be viewed and used at
the metadata are presented to an interface with http://eprints-uk.rdn.ac.uk/ebank-demo/.
the internet in accordance with the Open
Archive Initiative – Protocol for Metadata
Harvesting (OAI-PMH)18. OAI-PMH is an 6.3 Data capture and management
accepted standard in the digital libraries Building on the advances made by the eBank
community for the publication of metadata by project we are now developing repositories to
institutional repositories which enables the drive efficient and complete capture of data as it
harvesting of this metadata. Institutional is generated in the laboratory. Embedding the
repositories and archives that expose their archive deposition process into the workflow in
metadata for harvesting using the OAI-PMH the laboratory in a seamless and often
provide baseline interoperability for metadata automated manner allows the acquisition of all
exchange and access to data, thus supporting the the necessary files and associated metadata.
development of service providers that can add This procedure is determined and driven by the
value. Although the provision of added value by experimental workflow and publication process
service providers is not currently well and the 'Repository for the Laboratory' project
developed a number of experimental services (R4L: http://r4l.eprints.org) is in the process of
are being explored. The eBank UK project has workflow and publisher requirements capture.
developed a pilot aggregator service19 that The repository driven experiment has the
harvests metadata from the archive and from the advantage that very accurate metadata can be
literature and makes links between the two. The recorded and a registration process has been
service is built on DC protocols and is therefore designed whereby an experimenter can
immediately capable of linking at the unambiguously assert that he/she performed a
bibliographic level, but for linking on the particular action at a particular time. A
chemical dataset level a different approach is schematic of this project is shown in figure 3.
required. The Dublin Core Metadata Initiative
(DCMI) provides recommendations for
including terms from vocabularies in the
encoded XML, suggesting: Encoding schemes
should be implemented using the xsi:type
attribute of the XML element for property. As
there are not as yet any designated names for
chemistry vocabularies, for the purposes of
building a working example the project defined
some eBank terms as designators of the type of
vocabulary being used to describe the molecule.
Thus a chemical formula would be expressed in
the metadata record as:
<dc:subject
xsi:type="ebankterms:ChemicalFormula">C27
H48</dc:subject>
There are currently no journals publishing Figure 3
crystal structure reports that disseminate their
content through OAI protocols. In order to
provide proof of concept for the linking process
the RSS feed for the International Union of 7 Outreach: CombeChem and e-Malaria
Crystallography’s publications website was It is generally acknowledged that the public
used to provide metadata and crystal structure reputation of science is poor. In schools, science
reports published in these journals were then teaching can be uninspiring, science is
deposited in the archive, thus providing the perceived as boring, hard and irrelevant to
aggregator service with two sources of people’s lives, and the decline in the number of
information. Aggregation is performed on the pupils choosing science courses is worrying for
following metadata; author, chemical name, the science community and society at large. To
chemical formula, compound class, keywords address this difficulty, we have, as part of a
and publication date, thus providing a search project jointly supported by CombeChem and
190
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
191
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Dissemination. Journal of Chemical (19) M. Duke, M. Day, R. Heery, L.A. Carr, S.J.
Information and Modeling Coles, Enhancing access to research data: the
(doi:10.1021/ci050362w) challenge of crystallography. Proceedings of the
5th ACM/IEEE-CS Joint Conference on Digital
7
R. Bramley, K. Chiu, J.C. Huffman, K. Libraries, 2005, 46 – 55, ISBN:1-58113-876-8
Huffman and D.F. McMullen, Instruments and
Sensors as Network Services: Making
Instruments First Class Members of the Grid,
Indiana University CS Department Technical
Report 588, 2003.
8
http://nmr-rmn.nrc-
cnrc.gc.ca/spectrogrid_e.html
9
http://www.itg.uiuc.edu/technology/remote_mic
roscopy/
10
http://www.terena.nl/library/tnc2004-
proceedings/papers/meyer.pdf
11
A. Nash, W. Duane, C. Joseph, O. Brink, B.
Duane, PKI: implementing and managing E-
security, 2001, New York: Osborne/McGraw-
Hill.
12
R. Guida, R. Stahl, T. Blunt, G. Secrest, J.
Moorcones, Deploying and using public key
technology, IEEE Security and Privacy, 2004,
4, 67-71.
13
A. Bingham, S. Coles, M. Light, M.
Hursthouse, S. Peppe, J. Frey, M. Surridge, K.
Meacham, S. Taylor, H. Mills, E. Zaluska,
Security experiences in Grid-enabling an
existing national service, Conference
submission to: eScience 2005, Melbourne,
Australia.
(14) http://www.e-
science.clrc.ac.uk/web/services/datastore
(15) http://www.sdsc.edu/srb/
192
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
The rapid increase in data volume and data availability along with the need for continual quality
assured searching and indexing information of such data requires efficient and effective metadata
management strategies. From this perspective, the necessity for adequate, well-managed and high
quality Metadata is becoming increasingly essential for successful long-term high quality data
preservation. Metadata's assistance in reconstruction or accessibility of preserved data, however,
bears the same predicament as that of the efficient use of digital information over time: long-term
metadata quality and integrity assurance notwithstanding the rapid evolvements of metadata
formats and related technology. Therefore, in order to ascertain the overall quality and integrity of
metadata over a sustained period of time, thereby assisting in successful long-term digital
preservation, effective long-term metadata curation is indispensable. This paper presents an
approach to long-term metadata curation, which involves a provisional specification of the core
requirements of long-term metadata curation. In addition, the paper introduces “Metadata Curation
Record”, expressed in the form of an XML Schema, which essentially captures additional
statements about both data objects and associated metadata to aid long-term digital curation. The
paper also presents a metadata schema mapping/migration tool, which may be regarded as a
fundamental stride towards an efficient and curation-aware metadata migration strategy.
1
E-Science Centre, CCLRC - http://www.e-
science.clrc.ac.uk/web
193
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
successful long-term2 data preservation, may seem tool, which may be regarded as a fundamental stride
incredibly challenging. towards an efficient and curation-aware metadata
migration strategy.
Under the challenges set by the task of successful
long-term data preservation, the word ‘Metadata’ is 2. Motivation
becoming increasingly prevalent, with a growing
awareness of the role that it can play in accomplishing Over the past few years, several organized and
such a task. In fact, the digital preservation arguably successful endeavors (e.g. The NEDLIB
community has already envisaged the need of good project3) have been made in order to find an effective
quality and well-managed metadata for reducing the solution for successful long-term data preservation.
likelihood of important data becoming un-useable However, the territory of long-term metadata curation,
over substantially long periods of time (Calanag, although increasingly acknowledged, thus far, is even
Sugimoto, & Tabata, 2001). Furthermore, it has been somewhat unexplored, let alone conquered. In fact, in
recognized that metadata can be used to record most digital preservation or curation motivated
information required to reconstruct or at the very least workgroups and projects, necessity of long-term
understand the reconstruction process of digital metadata curation is seen relegated to the backseat and
resources on future technological platform (Day, deemed secondary, mainly due to lack of awareness of
1999). the criticality of the problem. As a result, no
acceptable methods exist to date for effective
Metadata’s assistance in reconstruction or accessibility management and preservation of metadata for long
of preserved data, however, bears the same periods of time. The Digital Curation Centre (DCC),
predicament as that of the efficient use of digital UK4 in conjunction with the e-Science Centre of the
information over time: long-term metadata quality and CCLRC however, aims to solve this rather cultural
integrity assurance notwithstanding the rapid issue. The work presented in this paper contributes to
evolvements of metadata formats and related the long-term Metadata Curation activity of the DCC.
technology. The only solution to this problem is
employment of a well conceived, efficient as well as 3. Metadata Defined
scalable long-term curation plan or strategy for
The word “metadata” was invented on the model of
metadata. In effect, curation has the ability to inhibit
meta-philosophy (the philosophy of philosophy),
metadata from becoming out of step with the original
meta-language (a language to talk about language)
data or undergoing additions, deletions or
etc. in which the prefix meta expresses reflexive
transformations, which change the meaning without
application of a concept to itself (Kent and
being valid. In other words, in order to ascertain the
Schuerhoff, 1997). Therefore, at the most basic,
overall quality and integrity of metadata over a
metadata can be considered as data about data.
sustained period of time, thereby assisting in ensuring
However, as conformity of the middle term ("about")
continued access to high quality data (particularly
of this definition is crucial to a common understanding
within the e-Science context), effective long-term
of metadata, this classical and simple definition of
metadata curation is indispensable. With the evident
metadata has become ubiquitous and is understood in
necessity of long-term preservation of scientific data,
different ways by many different professional
the task of long-term curation of the metadata
communities (Gorman, 2004). For example, from the
associated with that data is therefore equally important
bibliographic control outlook, the focus of "aboutness"
and beneficial in the context of e-Science.
is on the classification of the source data for
identifying the location of information objects and
This paper endeavors to provide a concise discussion
facilitating the collocation of subject content.
of the main requirements of long-term metadata
Conversely, from the computer science oriented data
curation. In addition, the paper introduces “Metadata
management perspective, "aboutness" may well
Curation Record”, expressed in the form of an XML
emphasize on the enhancement of use in relation to the
Schema, which essentially captures additional
source data. Moreover, this metadata or "aboutness"
statements about both data objects and associated
is synonymous with its context in the sense of
metadata to aid long-term digital curation. The paper
contextual information.
also presents a metadata schema mapping/migration
2 3
The time period over which continued access to Networked European Deposite Library -
digital resources with accepted level of quality despite the http://www.kb.nl/coop/nedlib/
4
impacts of technological changes is beneficial to both the Digital Curation Centre, UK –
user base and curatorial organization(s) of the resources. http://www.dcc.ac.uk
194
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Nevertheless, in light of its acknowledged role in the attempts to provide a general overview of the main
organisation of and access to networked information requirements.
and importance in long-term digital preservation,
metadata may be defined as structured, standardized 5.1 Metadata Standards
information that is crafted specifically to describe a The Digital Preservation professionals have already
digital resource, in order to aid in the intelligent, perceived the necessity of metadata formats/standards5
efficient and enhanced discovery, retrieval, use and in forestalling obsolescence of metadata (hence
preservation of that resource over time. In the context obsolescence of the actual data or resource), due to
of digital preservation, information about the technical dynamic technological changes. In the context of
processes associated with preservation is an ideal long-term data curation, it is essential that the
example of metadata. structure, semantics and syntax of Metadata conform
4. Metadata Curation to a widely supported standard(s), so that it is effective
for the widest possible constituency, maximises its
In effect, the phrase “Metadata Curation” is an integral longevity and facilitates automated processing.
part of the phrase “Digital or Data Curation”, which
has different interpretations within different As it would be impractical to even attempt to
information domains. From the museum perspective, determine unequivocally what will be essential in
data curation covers three core concepts – data order to curate metadata in the future, the metadata
conservation, data preservation and data access. elements should reflect (along with other relevant
Access to data or digital information in this sense may information such as, metadata creator, creation date,
imply preserving data and making sure that the people version etc.) necessary assumptions about the future
to whom the data is relevant can locate it - that access requirements in that regard. Furthermore, the
is possible and useful. Another interpretation of the metadata elements should be interchangeable with the
phrase “Data Curation” may be the active elements of other approved standards across other
management of information, involving planning, systems with minimal manipulation in order to ensure
where re-use of the data is the core issue (Macdonald metadata interoperability. This will consequently aid
and Lord, 2002). in minimization of overall metadata creation and
maintenance cost. It may also be advantageous to
Therefore, in essence, long-term data or digital define specific metadata elements that portray
curation is the continuous activity of managing, metadata quality.
improving and enhancing the use of data or other
digital materials over their life-cycle and over time for 5.2 Metadata Preservation
current and future generations of users, in order to Metadata curation requires metadata to be preserved
ensure that its suitability sustains for its intended along with data in order to ensure its accurate
purpose or a range of purposes, and it is available for descriptions over time. To date, the dominant
discovery and re-use. approach to long-term digital preservation has been
that of migration. Unfortunately, it does pose the
In light of the above construal of digital preservation, notable danger of data loss or in some cases the loss of
Metadata curation may be defined as an inherent part original appearance and structure (i.e. ‘look and feel’)
of a digital curation process for the continuous of data as well as being highly labour intensive.
management of metadata (which involves its creation However, in the context of metadata preservation,
and/or capturing as well as assuring its overall ‘look and feel’ of metadata (e.g. differing date/time
integrity) over the life-cycle of the digital materials formats) is not as imperative as that of the original
that it describes in order to ascertain its suitability for data as long as it maintains its aptness for describing
facilitating the intelligent, efficient and enhanced the original data accurately over time. Therefore,
discovery, retrieval, use and preservation of those albeit the existence and availability of Emulation
digital materials over time. (which seeks to solve the problem of data ‘look and
feel’ loss by mimicking the hardware/software of the
5. Requirements of Metadata Curation original data analysis environment) Migration would
The efficacy of Metadata curation largely relies upon appear to be a better solution for long-term Metadata
successful implementation of a number of preservation. Having said that, if a superior or
requirements. Although metadata curation
requirements may be quite different according to the 5
Fundamentally, a metadata standard or
type of data described, the information outlined below specification is a set of specified metadata elements or
attributes (mandatory or optional) based on rules and
guidance provided by the governing body or organisation(s).
195
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
alternative preservation strategy is proposed, this validity against metadata schema(s) and re-
would also be worth considering as both Emulation usability of metadata.
and Migration have received criticism for being
costly, highly technical, and labour intensive. While representational curation (i.e. structural
validation) typically begins during or after creation of
However, a classic unresolved metadata migration
metadata records, semantic curation can in fact start
issue is that of tracking and migrating changes to the
even before metadata records come into existence.
metadata itself. This issue is likely to arise when
The importance of semantic curation is easily
currently used metadata standards/formats change
deducible from the fact that in order to ensure
and/or evolve in the future. For example, an element
effective assistance of metadata in long-term digital
contained within a contemporary metadata format
curation, metadata curation (i.e. semantic curation)
might be replaced in or even excluded from the newer
should begin at the very outset of metadata lifecycle
versions of that format, thus incurring the problem of
by developing or adopting a curation-aware metadata
migrating the information under that element to its
framework or ontology. It is also useful to have a rich
corresponding element(s) (if any) of the new format.
set of functionality for metadata validation
Therefore, in order to successfully curate Metadata, a
incorporated within metadata creation and
curation aware migration strategy needs to facilitate
management tools.
migration of (ideally from old formats to new formats)
and tracking/checking changes (i.e. new formats to old 5.4 Metadata Versioning
formats) to metadata between different versions of its
format but also be flexible for addition of further Throughout the vibrant process of long-term metadata
requirements. curation, metadata is prone to be volatile. This
volatility may well be caused by updating of metadata
5.3 Metadata Quality Assurance that can involve amendments to or deletions from
existing metadata records. However, previous
As highlighted earlier in this paper, quality assurance
versions of metadata may need to be retrieved in order
of metadata is an integral part of long-term metadata
to obtain vital information (e.g. in the case of
curation. It needs to be ensured that appropriate
annotation - who made the annotation and which
quality assurance procedures or mechanisms are in
version(s) of a value it applies to) about the associated
place to eliminate any quality flaws in a metadata
preserved information if required. It is therefore
record and thereby, ascertain its suitability for its
essential to be able to discriminate between metadata
intended purpose(s). As identified in (JISC, 2003),
in different states, which arise and co-exist over time
incorrect and inconsistent content of metadata
by versioning metadata information.
significantly lower the overall quality of metadata.
This inaccuracy and inconsistency in metadata often
5.5 Other Requirements
occur due to lack of validation and sanity checking at
the time of metadata entry as well as metadata creation Aside from the requirements outlined above, long-
guidelines respectively. Non-interoperable metadata term metadata curation need take the following
formats and erroneous metadata creation and/or additional issues into account:
management tools are also considerable contributors
to such metadata quality flaws. Metadata Policy: A set of broad, high-level
principles that form the guiding framework
In general, quality of a metadata record is measured within which the Metadata curation can operate,
by the degree of consistency with and/or accuracy in must be defined. The Metadata Policy would
reference to the actual dataset and conformance to normally be a subsidiary policy of the
some agreed standard(s). Therefore, digital metadata organizational data policy statement and should
curation process, at least the part or module reference the rules with reference to legal and
contributing to metadata quality assurance should be other related issues regarding the use and
executed on the following two levels: preservation of data and metadata, as governed
by the data policy statement.
Semantic curation - organizing and managing the
meaning of metadata, i.e. ensuring semantic Audit Trail & Provenance Tracking: Metadata
validity of metadata to ensure meaningful Curation Process should ensure recording of
description of dataset. information with required granularity and
facilitate necessary means to track any
Representational curation - organizing and significant changes (e.g. provenance change) to
managing the formal representation (and both data and metadata over their life cycles.
utilization) of metadata, i.e. ascertaining structural
196
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
This will, amongst other things, help provide The approach employed to construct the curation
assurance as to the reliability of the content record involved examining a range of different
information of both data and associated existing well-known metadata schemas, such as
metadata. Dublin Core7, Directory Interchange Format (DIF)8,
DCC RI Label9, CCLRC Scientific Metadata Model
Access Constraints & Control: Appropriate Version 2 (Sufi & Mathews, 2004) and IEEE
security measures should be adopted to ensure Learning Object Metadata (LOM)10 and importing
that the metadata records have not been the most relevant elements (in terms of curation,
compromised by unauthorized sources, thereby preservation and accessibility) from them. The
ensure the overall consistency in the metadata rationale for this approach was to utilising existing
records. Furthermore, verification of the resources and thereby avoiding wheel reinvention as
metadata records’ authenticity before they are much as possible.
ingested into the system for long-term curation is
also a crucial requirement. Techniques such as 6.1. Overview
digital signatures, checksum may be employed In general, as depicted in figure 1, the elements
for this function. Fixity information, as defined contained within the Metadata Curation Record, are
within the OAIS framework (OAIS, 2002), is an divided into four different categories: General,
ideal example that advocates the use of such Availability, Preservation and Curation.
techniques or mechanisms.
Firstly, the “General” category represents all generic
6. Metadata Curation Record information (e.g. Creator, Publisher, Keywords, etc.)
about a data object. This category of elements is
It would not be an overstatement to regard the term primarily required for presenting an overview of the
“information” as highly crucial as well as instrumental digital object to its potential users. In addition, the
in the context of long-term digital curation. To elements (e.g. keywords, subject) that record
elaborate, success of a long-term curation strategy keywords related information; may well be used to aid
predominantly relies on sufficient and accurate in keywords based searching for scientific data across
information about the resources being curated. Under disparate sources. Secondly, the “Availability”
the influence of this observation, the “Metadata elements provide information with regards to
Curation Record” (hereafter referred to as a MCR) has accessing the data object, checking its integrity and
been constructed in the form of an XML Schema, any access or use constraints associated with it.
which aims to record additional statements about both
data objects and associated metadata to aid long-term The “Preservation” category presents information that
digital curation. will assist in long-term preservation and accessibility
of digital objects. Of particular mention is the OAIS
In other words, the MCR essentially pursues two compliant (OAIS, 2002) “Representation Information
primary objectives. First objective is to capture as Label”, which captures information required to enable
much information about a digital information object as access to the preserved digital object in a meaningful
possible in order to assist its long-term preservation, way. The use of RI label can be recursive, especially
curation and accessibility. This objective may well in cases where meaningful interpretation of one RI
echo the objectives of many widely used long-term element requires further RI. This recursion continues
preservation motivated XML schemas (e.g. until there is sufficient information available to render
PREMIS6). The second objective, on the other hand, the digital object in a form the OAIS Designated
is a feature that may not be discerned in most of Community can understand.
contemporary metadata standards. That is to provide
additional statements about the metadata record itself,
thereby supporting long-term curation of that record.
In a digital curation system, the metadata ingest 7
Dublin Core Metadata Initiative -
interface and/or metadata extraction tools/services can http://dublincore.org/
be developed based on the MCR to ensure that 8
DIF Writer’s Guide -
appropriate and sufficient metadata is acquired to aid http://gcmd.gsfc.nasa.gov/User/difguide/difman.html
9
in the curation of both data and metadata. DCC Info Label Report -
http://dev.dcc.ac.uk/twiki/bin/view/Main/DCCInfoLabelRep
ort
10
IMS LOM -
6
PREservation Metadata: Implementation http://www.imsproject.org/metadata/mdv1p3pd/imsmd_best
Strategies -http://www.oclc.org/research/projects/pmwg/ v1p3pd.html
197
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Finally, the “Curation” category elements provide metadata schema mapping tool that has been
information about life cycle and other related aspects developed using JAVA technologies, aims to resolve
(e.g. version, annotation) of the digital object and the this issue by effectively facilitating easy, semi-
metadata record itself, which may be used in efficient automatic migration of metadata between two co-
and long-term curation of both the digital object and existing versions of its format. The tool employs an
its metadata. Of particular significance is the efficient regular expression driven matching algorithm
“LifeCycle” elements that are defined to record all to determine all possible matches (direct or indirect)
changes along with information about who/what (e.g. between two versions of a metadata schema
factors, people etc.) conducted or was responsible for irrespective of their types (i.e. XML or Relational),
the changes that a digital object undergoes throughout calculates mapping rules based on the matches and
its life cycle. This type of information is vital for finally migrates metadata from the source schema to
implementing some crucial curation related the destination schema.
functionalities, such as provenance tracking (as
specified in the OAIS model) and audit trailing that 7.1 Rationale
are essential for checking and ensuring quality and
There are numerous commercial and non-commercial
integrity of data objects.
database schema mapping or migration tools available
at present. Most of these tools enable users, to
Furthermore, “MetaMetadata” element, in this
varying extent, to automatically find matches between
category is dedicated towards capturing information
two database and/or XML schemas and migrate and/or
required to efficiently curate the MCR itself. It has its
copy data across based on the matches. Examples of
own “General” (e.g. Creator, Indexing),
such tools include, Altova Map force11, SwisSQL12,
“Preservation” (e.g. Identifier, Representation
etc. Many of these tools also facilitate interoperability
Information), “Availability” (e.g. Metadata Quality,
between different databases by allowing users to
Access, Rights etc.) and “Curation” (e.g. Event)
perform cross-database schema migration, such as
category elements. In a Metadata Curation system,
migration from Oracle to DB2, MS SQL to Oracle etc.
the “MetaMetadata” elements would be implemented
Naturally, the existence of these tools may somewhat
as a separate schema complimenting the MCR
question the necessity and the motivation for the
consisting of the rest of the elements.
Metadata Schema Mapping Tool.
7. Metadata Schema Mapping Tool
Successful long-term metadata curation demands 11
Altova MapForce -
curation-aware migration strategy in order to cope http://www.altova.com/download_mapforce.html
with the metadata migration issue (see 5.2) that arises 12
SwissSQL - www.swissql.com/
from metadata schema/format evolution. The
198
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 2: Direct & indirect matches between two versions of DATAFILE table
In response to this question, it would not be an “MODIFY” in the latter, hence declare the
overstatement to say that the inability of currently aforementioned two fields as “in-direct match”. The
available tools to find indirect or non-obvious matches Metadata Schema Mapping tool is capable of doing
between two schemas essentially justifies the exactly that along with other features, such as
necessity and confirms the uniqueness of the Metadata determining matches between two tables based on
Schema Mapping tool. To illustrate, two database their relationship (i.e. foreign key based relationship)
tables of two versions of ICAT schema of CCLRC13, with other tables in corresponding schemas, in both
as shown in figure 2, may be considered. Commercial cross-database and cross-schema format oriented
tools will only be able to determine the direct matches settings.
(as indicated by the thin arrows from source table to
destination table) between these two tables. Here, the In addition to its role in metadata curation, the
term “Direct Match” refers to as exact duplicate or mapping tool has the ability to play a perceivably
replica of a field or column in a database table in significant role in any data management environment
terms of the name and type of that particular field or administering considerably large and evolving data
column. sets. In such a dynamic environment, data evolution
often results in evolution of underlying schema(s) as
These two database tables together provide an ideal well as change of databases employed (e.g. MS
example of the aforementioned metadata migration Access to Oracle), to assist and maintain the changes
issue, i.e. migrating changes to the metadata itself, to the datasets. The schema mapping tool provides an
especially in cases, where one or more crucial easier and relatively less labour-intensive means (than
elements in the old format do not have any directly the commercial tools) of identifying and reconciling
corresponding elements in the new format. For complex and “non-obvious” differences between
example, the field “DATAFILE_UPDATE_TIME” in schemas. Thereby, the tool effectively facilitates more
the version 1 of DATAFILE table has been changed to accurate migration of data from an old version of a
“DATAFILE_MODIFY_TIME” in version 2 schema to its newer version irrespective of the schema
(indicated by a solid arrow in the diagram). Currently type and/or the database they reside in, while enabling
available tools will not be able to establish any link the use of both versions. This will effectively make
between these two fields as they only search for direct the accessibility of the datasets to the users more
matches as shown in diagram 2. declarative as they would be able to pose queries to
the datasets based on a version of the schema they are
However, looking up in any English Dictionary will familiar with. From this perspective, the use of the
help one easily determine that the word “UPDATE” in tool is indeed beneficial to the efficient management
the former field is in fact synonymous to the word of ever-expanding scientific data generated from
various e-Science activities.
13
The ISIS ICAT is a metadata catalog of the
previous 20 years of data collected at the ISIS facility. The
database schema used by them is based heavily on the
CCLRC Scientific Metadata Model version 2 (Sufi &
Mathews, 2004).
199
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
200
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Ligang He, Martin Dove, Mark Hayes, Peter Murray-Rust, Mark Calleja, Xiaoyu Yang, Victor Milman*
University of Cambridge, Cambridge, UK
*Accelrys Inc., Cambridge, UK
lh340@cam.ac.uk
Abstract server side, where the calculations are performed and the
results are returned to the users. The CASTEP server
CASTEP is an application which uses the density component is included in the server side. In the Grid
functional theory to calculate atomistic structure and environments, however, it cannot be assumed that
physical properties of materials and molecules. This Materials Studio is installed on all computational
paper investigates the execution mechanisms for running resources. This presents a challenge of harvesting the
CASTEP applications using grid resources. First, a computational grid resources that have no pre-installed
lightweight execution environment is developed for CASTEP server environments. This paper tackles this
CASTEP, so that the CASTEP applications can be run on problem by developing a lightweight execution
any grid computing resource with the supported platform environment for CASTEP applications. The lightweight
even without the CASTEP server being installed. Then, environment includes the CASTEP executables, shared
two execution mechanisms are developed and libraries, relevant configuration files, license
implemented in this paper. One is based on Condor's file checking-out executables and license data, etc.
transfer mechanism, and the other based on independent The execution mechanisms developed in this paper
data servers. Both mechanisms are complete with the are implemented and deployed on CamGrid, a
GridSAM job submission and monitoring service. Finally, university-wide grid system at the University of
the performance of these mechanisms is evaluated by Cambridge [1]. CamGrid is based on the Condor system
deploying them on CamGrid, a university-wide [8], in which a federated service is provided to flock the
condor-based grid system at the University of Cambridge. condor pools located in different departments. The jobs
The experimental results suggest that the execution submitted to a local Condor pool may be dispatched to
mechanism based on data servers is able to deliver higher another flocked remote pool depending on the resource
performance; moreover, if multiple data replicas are availability [9].
provided, even higher performance can be achieved. GridSAM is a job submission and monitoring web
Although these mechanisms are motivated by running service [6]. One of advantages of GridSAM is that the
CASTEP applications in grids, they also offer excellent GridSAM client and server communicate with each other
opportunities to be applied to other e-science through the standard SOAP protocol. Moreover,
applications as long as the corresponding lightweight GridSAM can be connected to different distributed
execution environments can be developed. resource managers such as Condor [4]. In this paper,
GridSAM is deployed in the CamGrid and configured to
1. Introduction distribute the jobs for execution through Condor. In the
configuration GridSAM web service is available to the
Internet, and the users can submit jobs over the Internet to
Grid systems are becoming a popular platform for the GridSAM service, where the jobs are further
processing e-science applications [2][3]. CASTEP is a submitted by GridSAM to a condor job submission host.
software package used for atomistic quantum-mechanical Then the job submission host further schedules the jobs to
modeling of materials and molecules [7]. In order to run the remote exeute cmachines in CamGrid.
CASTEP applications, the CASTEP server is required to This paper investigates two approaches to transferring
provide the necessary execution environment. a lightweight CASTEP execution environment. One
Currently, a popular approach to running CASTEP approach is to make use of the Condor’s file transfer
applications is to employ the Materials Studio Modeling mechanism. In the Condor submission description file, it
from Accelrys Inc [12]. Materials Studio Modeling 4.0 is can be specified what files need to be transferred to the
a Client/Server software environment [12]. At the client execute nodes. The job submission host will copy these
side, the users can create models of molecular structures files over once the execute nodes have been determined.
and materials. The model information is transferred to the
201
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Another approach is to setup independent data servers well as other system configurations are defined and
for a lightweight CASTEP execution environment. When different CASTEP executables are invoked in
a job has been launched on execute nodes, it first stages in different scenarios.
the necessary CASTEP execution environments and then z Shared libraries – these libraries are required at run
starts running. time by the CASTEP executables.
These CASTEP execution mechanisms are presented z License checking-out executables and relevant
in detail in this paper. Their performance is evaluated on license data – the commercial version of Materials
CamGrid in terms of various metrics, including makespan, Studio Modeling 4.0 needs to check out the licence
average turnaround time, overhead and speedup. The from the pre-installed license server before being
performance is also compared against that achieved by able to run the CASTEP executables. These
the Materials Studio mechanism. license-relevant files are included in a license pack,
The work developed in this paper is motivated by the which is located in a directory separated from the
MaterialsGrid project being conducted at the University Material Studio package. (Please note that a simpler
of Cambridge [13]. The MaterialsGrid project aims to license mechanism is applied to academic versions of
create a dynamic database of materials properties based CASTEP for the UK scientists; in this case the
on quantum mechanical simulations run in grid CASTEP executables do not need to check out the
environments. CASTEP is a typical example of a license files).
computational application involved in this project. When installing the Materials Studio and the license
Pack, the system configurations need be specified, such as
2. Lightweight Execution Environment the home directory of the installation, the location of the
license server and license files. Under the Condor job
manager, however, all files are placed in a single
After installing Materials Studio Modeling 4.0 on a temporary working directory. Therefore, in order to
machine, the CASTEP execution environment is built. By successfully access the files in a execute node, these
tracing the calling path of the CASTEP executables, the configurations need to be changed to adapt to the Condor
necessary files and settings for running CASTEP working location.
applications are identified in this paper. After all necessary changes are made, the required
All files required to run a CASTEP application can be files for running CASTEP applications are packed into tar
divided into the following three categories. files. The lightweight execution environment can be
1. User-defined files: cell file and parameter file constructed in a execute node by unpacking these tar files.
CASTEP application needs to take as input the The total size of the Materials Studio and the License
information about atomistic system users want to study. Pack package is 884 Megabytes. The size of the tar files
The information is summarised in two text files: for the lightweight CASTEP execution environment is
model.cell and model.param (cell and param are the file reduced to 105 Megabytes.
extensions). The cell file defines the molecular system of
interest, and the parameter file (.param) defines the type
of calculation to be performed. These two files can be 3. Execution Mechanisms in Grids
generated using Materials Studio Modeling 4.0 [12] or by
the CML transformation toolkit [11][10]. In this section, different CASTEP execution mechanisms
2. Application-dependent system files: potential files are presented in detail. The execution mechanisms
When calculating the property of a given model, the differentiate from each other mainly by their approaches
pseudopotential files corresponding to the elements to dealing with the transferring of the CASTEP
composing the structure will be used. A pseudopotential lightweight environment.
file for an element describes the screened electrostatic
interaction between valence electrons and ionic core for 3.1 Execution Mechanism Based on Condor’s File
that element. The pseudopotential files for all elements in Transfer Mechanism
the periodic table are provided in the Materials Studio
Modeling package. When connecting GridSAM with the Condor job manager,
3. Application-independent system files: two major configuration files need to be edited. One is the
The files in this category are required by all jobmanager.xml file, which specifies the job manager and
calculations. These include: its settings used to distribute the jobs received by
z CASTEP executables and shell scripts – CASTEP GridSAM. The other is the file classad.groovy, in which
executables are the programs for performing the the values of the ClassAd attributes of a Condor job can
actual calculations. These executables are wrapped in be specified.
shell scripts, in which the environmental variables as In the classad.groovy file, the transferred input files
are specified, which include the user submitted cell file
202
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
GridSAM
classad.groovy
*.cell, *.param
Condor Job
Submission Host
1. FILE1='MaterialsStudio.tar' and the parameter file as well as the tar files for
2. FILE2='LicensePack.tar' lightweight CASTEP execution environment. This
3. FILE3='Al_00.usp' classad.groovy file is used by GridSAM to construct the
4. FILE4='O_00.usp' job submission description file for Condor. When the
5. flag=`expr $RANDOM % 2` constructed Condor submission description file is
6. if [ $flag -lt 1 ] submitted in the job submission host (where the tar files
7. then for the lightweight CASTEP execution environment are
8. HOST=$HOST1 also stored), the Sched daemon sends the job advert to
9. PORT=$PORT1 matchmaker and then the matchmaker looks for the
10. else execute node whose resource advert can match the job
11. HOST=$HOST2 advert. After the suitable node is found, the Sched
12. PORT=$PORT2 daemon starts a Shadow daemon acting on behalf of the
13. fi submitted job, while the Start daemon at the execute node
14. ftp -n $HOST $PORT << END_SCRIPT spawns a starter daemon to represent the process which
15. quote USER $USER performs the actual calculation. After the communication
16. quote PASS $PASSWD between the Shadow daemon and Starter daemon has
17. get $FILE1 been established, the specified input files are transferred
18. get $FILE2 from the job submission host to the execute node. The
19. get $FILE3 steps of constructing the CASTEP environment (i.e.,
20. get $FILE4 unpacking the transferred tar files) and performing the
21. quit actual CASTEP calculations are wrapped in a shell script,
22. END_SCRIPT which is the executable sent to the execute nodes.
23. tar -xf $FILE1 This execution mechanism is illustrated in Fig.1.
24. tar -xf $FILE2
25. Call the script RunCASTEP.sh, which is included in 3.2 Execution mechanism based on independent
the MaterialsStudio.tar, to calculate the specified data servers
model, Al2O3
Figure 2. Execution flow of the data server-based In this mechanism, the independent data servers are set up
execution mechanism in an execution node to accommodate the tar files for the lightweight CASTEP
203
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
*.cell, *.param
jobmanager.xml
GridSAM
classad.groovy
*.cell, *.param
Condor Job
Submission Host
Data Server
…
Data Server
execution environment. These data servers can be any user-specified Al2O3 model. Step 5 generates a random
publicly accessible servers, such as FTP server, HTTP number, which is 0 or 1 in this example. In step 6-13, a
server, or more securely SFTP server. In the performance FTP server is specified according to the result of step 5.
evaluation in Section 4 the FTP servers are utilised to In step 14-22, the specified files are downloaded. Step
offer file downloading. Multiple data servers can be set to 23-24 build the execution environment for running the
provide the data replicas and the lightweight CASTEP CASTEP application. Step 25 invokes the shell script
environment can be downloaded from any one of them. contained in the constructed execution environment to
The access to data servers and execution of a calculate the model.
CASTEP application are wrapped in an executable shell The execution mechanism based on data servers is
script. When the shell script is scheduled onto an execute illustrated in Figure 3.
node, it will first stage in the necessary files from one of
the data servers and then call the downloaded CASTEP 4. Performance Evaluation
executable to perform the calculations. The Condor’s file
transfer mechanism is only used to transfer user-defined This section evaluates the performance of the execution
input files and this executable shell script. The shell script mechanism presented in this paper. Multiple users are
for the execution mechanism with 2 FTP servers is simulated to simultaneously submit the CASTEP
outlined in Fig.2. The submitted job request is to calculate execution requests via GridSAM (each user submits one
the total energy of the Al2O3 crystal [5]. In the shell script, request) to calculate the total energy of the Al2O3
the files in step 1-2 contain the lightweight environment molecular model [5]. GridSAM then launches the batch
for running the CASTEP application, and the files in step of jobs through Condor onto CamGrid. GridSAM 1.1 [14]
3-4 are the potential files needed for calculating the is deployed in the experiments. Please refer to [1] and [15]
204
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
1600
Data Server-1
Makespan (sec.)
1280
File Transfer by Condor
1120
960
800
640
480
320
160
0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Users
for the detailed settings of CamGrid. In this paper, the three cases differentiates from each other. The mechanism
experimental results for three mechanism settings are with 3 data servers outperforms the other two
demonstrated. These three mechanism settings are: mechanisms while the mechanism which makes use of
1) Lightweight CASTEP execution environment is Condor’s file transfer mechanism shows the worst
transferred through Condor’s file transfer performance.
mechanism; These results can be explained as follows. When
2) One FTP server is used to provide downloading of using Condor’s file transfer mechanism, the execute
the lightweight CASTEP execution environment; nodes obtain the files from the job submission host. In the
3) Three FTP servers are employed, i.e., three replicas Condor system, the job submission host is a very busy
are provided for the downloading. Each FTP sever node. It has to generate a Shadow daemon for every
offers the same network bandwidth. execute node it sends the job to. The Shadow daemon is
responsible for communicating with the Starter daemon in
4.1. Makespan each execute node, which includes transferring input files
specified in the Condor submission description file.
In this subsection, the experiments have been conducted Therefore, although each execute node has a separate
to compare the performance of different execution Shadow daemon in the job submission host for
mechanisms in terms of makespan as the number of the transferring files, these Shadow daemons work in the
users submitting CASTEP jobs increases. Makespan is same node and share the same network card and a stretch
defined as the time duration between the time when of the network link, and as the result, these Shadow
GridSAM submits the batch of jobs to CamGrid and the daemons may not be able to transfer the files in parallel,
time all these jobs are completed. The shorter makespan, especially when their number becomes high.
the better performance is achieved. The experimental Setting up an independent data server to offer data
results are demonstrated in Fig.4. downloading can shift the burden of the job submission
It can be observed from this figure that when the host. This can explain why the execution mechanism
number of users is small (less than 9) the performance based on the data server outperforms that based on the
curves under these three mechanisms are intertwined with Condor’s file transfer mechanism in Fig.4. When there
each other in terms of makespan. As the number of the are more than one data servers to offer multiple data
users increases to more than 9, the performance of these replicas, the file transfers can be processed in the higher
205
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
950
630
550
470
390
310
230
150
2 6 10 14 18
Number of Users
Figure 5. Performance comparison of different execution mechanisms in terms of average turnaround time
degree of parallelism. Hence, the mechanism with 3 data the mechanism to submit, schedule and run a CASTEP
servers performs better than that with one data server. job is defined as all the time that is not spent on executing
This result suggests that providing multiple data replicas the job. The overhead includes the time for file transfer,
for access are beneficial to improving performance in queuing time, the time used to dispatch jobs as well as the
terms of makespan. time used to collect the calculated results. The average
overhead of a batch of jobs is defined as the average of
4.2. Average turnaround time the overhead of all jobs. As can be seen from Fig.6, the
data server-based execution mechanisms have much
Fig.5 demonstrates the performance of different execution lower overhead than the one based on the Condor’s file
mechanism in terms of average turnaround time under transfer mechanism when the number of users is greater
different number of users. A job’s turnaround time is than 2. This may be because that in the mechanism using
defined as the time duration between the time when it is Condor to transfer files, the job submission host becomes
submitted and the time when the job is completed. a bottleneck since the job management functionalities
As can be observed from Fig.5, the execution including file transfer are all performed on this node. In
mechanisms based on data servers significantly the mechanism based on data servers, a large portion of
outperform the one based on the Condor’s file transfer file transfer functionalities are shifted to data servers.
mechanism except the case of 2 users. This may be due to Therefore, the job management can be carried out in the
a smoother file transfer with independent data servers. It higher degree of parallelism so that the overhead endured
can also be observed from this figure that the mechanism by each job is likely to be reduced.
with 3 data servers achieves better performance than the
one with a single data server. Again, this can be explained 4.4 Speedup
by the fact that the files can be transferred in higher
degree of parallelism when multiple data servers are The experiments conducted in this section investigate
accessible. the speedup achieved by the CASTEP execution
mechanisms presented in this paper. It is assumed that the
4.3 Overhead of the mechanisms CASTEP execution requests processed by the presented
mechanisms are also sent to and run in sequence in the
Fig.6 demonstrates the overhead of the mechanisms Materials Studio Server where the CASTEP execution
presented in this paper. The overhead imposed by using
206
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
800
720
480
400
320
240
160
80
0
2 6 10 14 18
Number of Users
Table 1. Speedup of the CASTEP execution mechanism with 3 data servers in terms of makespan (time unit: second)
The number of users 2 6 10 14 18
Data Server-3 217.8 340.8 541.7 641.4 796.2
Materials Studio 331.4 1230.2 2017.0 2282.7 3590.2
Speedup 1.5 3.6 3.7 3.6 4.5
Table 2. Speedup of the CASTEP execution mechanism with a single data server in terms of makespan (time unit:
second)
The number of users 2 6 10 14 18
Data server-1 320.9 399.7 679.0 835.0 1040.6
Materials Studio 472.4 1257.3 2030.3 2802.1 3908.4
Speedup 1.5 3.1 3.0 3.4 3.8
Table 3. Speedup of the CASTEP execution mechanism based on Condor’s file transfer in terms of makespan (time unit:
second)
The number of users 2 6 10 14 18
File transfer by Condor 276.2 390.2 913.3 1099.2 1304.3
Materials Studio 423.8 1273.0 1863.7 3093.5 3920.5
Speedup 1.5 3.2 2.4 2.8 3.0
environment is available (therefore, there is no need to and the resulting speedup. Table 2 and Table 3
transfer the lightweight execution environment). demonstrate the experimental results of the other two
The speedup of the execution mechanism in this paper mechanisms in terms of speedup.
over the Materials Studio can be defined as the ratio of It can be seen from these tables, the speedup achieved
the makespan achieved by the Materials Server to the by all these three mechanisms is greater than 1 even if
makespan by the execution mechanism. there exist the overhead of transferring the lightweight
Table 1 shows the makespan achieved by the CASTEP execution environment and the overhead
execution mechanism with 3 data servers, the makespan imposed by using GridSAM job submission service and
when the same batch of jobs is run in the Materials Studio, the Condor job manager. These overheads are relatively
207
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
constant for the given execution platform and execution the University of Cambridge”, Proceedings of the
mechanism. Therefore, if the molecular models with the 2004 UK e-Science All Hands Meeting
longer calculation time are sent for execution, the [2] M Dove, E Artacho1, T White, R Bruin, et al, “The
speedup to be achieved is expected to be higher (the order eMinerals project: developing the concept of the
of the time for calculating the Al2O3 molecular model in virtual organisation to support collaborative work on
these experiments is about 200 seconds). molecular-scale environmental simulations”,
It can also be observed from these tables that the data Proceedings of the 2005 UK e-Science All Hands
server-based execution mechanisms achieve the higher Meeting
speedup when the number of users is high (greater than 6). [3] I. Foster and C. Kesselman, editors, The Grid:
This result once again suggests that the data server-based Blueprint for a New Computing Infrastructure,
execution mechanism is superior to the one based on the Morgan Kaufmann, 2003, 2nd edition. ISBN:
Condor’s file transfer mechanism when the number of 1-55860-933-4
jobs submitted to the system is high. [4] C. Goonatilake, C. Chapman, W. Emmerich, M.
Farrellee, T. Tannenbaum, M. Livny, M. Calleja and
5. Conclusions M. Dove, “Condor Birdbath - web service interface
to Condor”, Proceedings of the 2005 UK e-Science
CASTEP is an application using the density functional All Hands Meeting.
theory to calculate the structure and properties of [5] V Hoang and S. Oh Molecular, “dynamics study of
materials and molecules. Currently, the Client/Server aging effects in supercooled Al2O3”, Phys. Rev. E, 70,
architecture in the Materials Studio is the popular 061203, 2004
approach to running CASTEP applications. This paper [6] W. Lee, A. McGough, and J. Darlington,
investigates the execution mechanisms for running “Performance Evaluation of the GridSAM Job
CASTEP in grid resources, where CASTEP execution Submission and Monitoring System”, Proceedings of
environments may not be available. In this paper, a the 2005 UK e-Science All Hands Meeting,
lightweight CASTEP execution environment is developed Nottingham, UK, 2005
and it can be transferred to any grid computational [7] M.D. Segall, P.J.D. Lindan, M.J. Probert, C.J.
resource. Based on the lightweight environment, two Pickard, P.J. Hasnip, S.J. Clark and M.C. Payne,
execution mechanisms have been developed in this work. “First-principles simulation: ideas, illustrations and
One makes use of the Condor's file transfer mechanism to the CASTEP code”, J. Phys. Condensed Matter 14
transfer the lightweight environment and the other (2002) 2117
employs the independent data servers. The performance [8] D. Thain, T. Tannenbaum, and M. Livny, "Condor
of these execution mechanisms is evaluated on CamGrid and the Grid", in Fran Berman, Anthony J.G. Hey,
in terms of various metrics. The results show that the Geoffrey Fox, editors, Grid Computing: Making The
proposed execution mechanisms can speed up the batch Global Infrastructure a Reality, John Wiley, 2003.
processing of the CASTEP applications; moreover, the ISBN: 0-470-85319-0
mechanisms based on data servers can achieve higher [9] D. Thain, T. Tannenbaum, and M. Livny,
performance than the one based on Condor’s file transfer "Distributed Computing in Practice: The Condor
mechanism, especially when the workload level is high. Experience" Concurrency and Computation:
These execution mechanisms also have excellent Practice and Experience, Vol. 17, No. 2-4, pages
potentials to be applied to other computational 323-356, February-April, 2005.
applications as long as the corresponding lightweight [10] J. Wakelin, A. García, P. Murray-Rust, “The use of
execution environments can be developed. XML and CML in Computational Chemistry and
Physics Applications”, Proceedings of the 2004 UK
e-Science All Hands Meeting, 2004
Acknowledgement
[11] Yong Zhang, Peter Murray-Rust, Martin T Dove,
Robert C Glen, Henry S Rzepa, Joa A Townsend,
The authors would like to thank the DTI for sponsoring
Simon Tyrell, Jon Wakelin, Egon L Willighagen,
MaterialsGrid project.
“JUMBO - An XML infrastructure for eScience”,
Proceedings of the 2004 UK e-Science All Hands
6. References Meeting.
[12] http://www.accelrys.com/products/mstudio/
[1] M Calleja, B Beckles, M Keegan, M A Hayes, A [13] http://www.materialsgrid.org/
Parker, M T Dove, “CamGrid: Experiences in [14] http://gridsam.sourceforge.net/1.1/
constructing a university-wide, Condor-based, grid at [15] http://www.escience.cam.ac.uk/projects/camgrid/
208
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Toby O. H. White1, Rik P. Tyer2, Richard P. Bruin1, Martin T. Dove1, Katrina F. Austen1
1Department of Earth Sciences, Downing Street, University of Cambridge. CB2 3EQ
2CCLRC Daresbury Laboratory, Daresbury, Warrington. WA4 4AD
Abstract
The San Diego Supercomputing Centre’s Storage Resource Broker (SRB) is in wide use in our
project. We found that none of the supplied interfaces fulfilled our needs, so we have developed a
new interface, which we call TobysSRB. It is a web-based application with a radically simplified
user interface, making it easy yet powerful to operate for novices and experts alike. The web
interface is easily extensible, and external viewers may be incorporated. In addition, it has been
designed such that interactions with TobysSRB are easily automatable, with a stable, well-defined
and well-documented HTTP API, conforming to the REST (Representational State Transfer)
philosophy. The focus has been on lightweightness, usability and scriptability.
209
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
In its favour is the fact that, since the interface is
Again, there are also a number of third-party expressed and used through a unix shell, SRB
interfaces. Mostly these consist of language-specific interaction can be very easily incorporated in a script,
wrappers around the C API. whether written in shell script, or in a higher-level
language such as Perl or Python.
Use of the SRB However, again, network effects conspire to make
The SRB’s primary selling point is as a way to repeated Sls's a great deal slower than lots of ls's.
abstract access to files stored in multiple logical In addition, our experience has been that the SRB is
locations, through a single interface, which bears a not sufficiently robust to allow scripting of many
resemblance to a hierarchical filesystem. In addition, SRB interactions (on the order of a few hundred in
it offers a number of additional features - limited less than a minute). This is due to an architectural
user-editable metadata, and replication of files. flaw in the SRB server, which will report success for
However, within our project, we have found that a transaction, even before the database has been
the only aspect of the SRB that we are interested in is updated, thus breaking the atomicity of SRB
the common view of a familiar filesystem-like operations. Clearly this introduces an enormous
interface. number of potential race conditions. It is therefore
This manner of usage is encouraged by the impossible to set up the SRB infrastructure to allow
analogies that can be drawn between the Scommands high volumes of requests of the sort which are
and native unix filesystem tools. essentially trivial for a real filesystem. Repeated
Indeed, for both SRB versions 2 and 3, there are failures of various sorts will be seen.
filesystem plugins, which enable the use of the SRB
transparently, and allow an SRB repository to be MySRB and InQ both try to solve some of the
mounted as a native filesystem within Linux, and a problems associated with the Scommands, in
similar plugin to Windows Explorer, which enables different ways.
an analogous interface for Windows. InQ primarily tries to solve the problems
Given that we use SRB only as a network associated with a textual interface. As a native
filesystem, many of the features offered by the Windows application, it must be installed locally, on
existing interfaces are beyond our needs. a Windows computer. Thus, it does not solve the
problems associated with needing to be at a
Deficiencies in current methods of SRB particular computer. It does at least not require
interaction setting up for each user - it can be used to retrieve
anyone's files from a single installation.
User interfaces MySRB is a web-based interface to the SRB. It is
The Scommands are the SRB developers' a CGI script, implemented in C, which provides a
recommended interface. However, navigating an stateful session of limited duration, and most SRB
SRB repository through the Scommands has many of actions are possible through it. Since it is a web
the same strengths and drawbacks as navigating a application, it is accessible from anywhere with a
normal filesystem from the command line. working web-browser and internet connection.
In its disfavour is the fact that visualizing a However, it suffers from a number of deficiencies
directory structure, and navigating through it when in its user interface (UI). Firstly, due to the necessity
you are not sure of the destination, can be clumsy to make available through MySRB almost all of the
from the command line; and for users unused to functionality of the SRB, the interface suffers from
command-line interaction, it is a daunting prospect. complexity and overcrowding.
Indeed it is for this reason that graphical file system Secondly, when using it as a simple method for
browsers were invented in the 1970s. retrieving files, there are two particular irritations
Furthermore, a network filesystem can never which render it frustrating for the expert, and
deliver the performance of a local filesystem, due to confusing for the novice.
network latency. This is a familiar problem, from • It second-guesses the browser’s ability to show
which all network filesystems (NFS, AFS) suffer. files - if a request is made to view a file of a type
Thus Scommand-ing one's way through a repository MySRB is not aware of, or indeed that MySRB
is always slower and less efficient even than thinks cannot be displayed by the browser, then
navigating through a filesystem using the analogous MySRB will escape all HTML-sensitive
native unix tools. characters, and insert the resultant translated file
A further deficiency is the necessity to install the into an HTML frame between <PRE> tags. This
Scommands on any system where access is required. naturally renders it impossible for the browser to
It is impossible to connect to the SRB using this render the output sensibly when retrieving an SVG
method from an arbitrary computer which knows file.
nothing about the SRB. Furthermore, even when the • When downloading a file, MySRB
Scommands are installed, it is necessary for each communicates with the browser such that every
user individually to be set up correctly to use the time, the browser’s save dialogue tries to name the
SRB - thus it is not possible to, for example, lean file MySRB.cgi.
over someone's shoulder, while they are logged on,
and quickly retrieve a file from your own SRB
collection.
210
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Developer interfaces sufficiently transparent that novice users would feel
In terms of scriptability, the available developers at ease, solving problem I above.
APIs did not fit our needs either. In addition, since we also perceived problems II
Jargon is in Java, which is of course only useful and III above, we realised that, with proper design, a
for Java applications. Very few of our tools are web-based solution could fix these also.
written in Java; and Java is not conducive to quick Problem II can only sensibly be solved by
scripting of the sort that the Scommands can be used wrapping the Scommands - a non-robust, but
for. scriptable, interface - behind a layer which performs
Matrix is a web-services API, and as such is proper error and timeout checking. This layer can as
nominally platform-independent. However, in well be a web-based application as any other sort.
practice, it requires a large investment in the stack of Problem III is of course solved in the same way -
SOAP/WS infrastructure on any computer which by making the interface web-accessible, it is then
must communicate with it. Again, there is no sense in accessible anywhere.
which it can be used in a script-like fashion.
Of the third-party interfaces, none suited our Overview of TobysSRB
purposes.
TobysSRB is implemented as a CGI script, written in
Motivation Python, and all its interactions with the SRB are
performed by executing appropriate Scommands
A further problem with all the available interfaces,
using Python's subprocess handling.
with the obvious exceptions of MySRB and Matrix,
For security, it is written such that it will only
is that they all require communication between client
work over an HTTPS connection; the only
and server over a number of uncommon ports;
information exposed to eavesdroppers is that a
primarily port 5544 for MCAT communication, and
connection is being made; all data is encrypted.
variable ports from 65000 upwards. This makes use
It requires no special configuration on the server
of the SRB from behind firewalls tricky. Clearly
other than that required for running CGI scripts in
MySRB and Matrix work entirely over ports 80 or
general; and of course that the Scommands be
443, which are almost universally available.
installed somewhere on the server.
So, in summary, the main problems we perceived
In this section, we shall explain firstly its internal
with the available methods of interacting with the
implementation, then the interfaces presented to the
SRB were:
user, both through a web-browser, and through its
I. The UIs of the available graphical approaches
web-accessible API.
were insufficiently user-friendly
II. None of the interfaces available offered Internal implementation
sufficiently robust scriptability.
III. Only the Scommands offered any scriptability Configuration and authentication
at all, but required local installation and setup Configuration of the Scommands for a user involves
for each user, and could not be used from the creation of a directory, ~/.srb, and two files
behind many firewalls.
therein, ~/.MdasEnv and ~/.MdasAuth.
Whenever the user wishes to interact with the SRB,
To this end, we have built a new SRB frontend,
which we call TobysSRB, which ameliorates all of they must first issue the command Sinit. This
these issues. authenticates them against the server, and then
creates a file within ~/.srb which keeps the SRB
Approaches to a solution session's state (which consists only of the location of
The primary motivation for the creation of the current SRB collection). The session should be
TobysSRB was to solve problem I above - we needed ended with an Sexit which merely clears up this
a very simple UI for retrieval of files from the SRB; session file.
its initial intended audience was in fact (Of course, frequently one will pause in the
undergraduate students, who had little to no middle of a session, and forget to do an Sexit
knowledge of the software involved, and who had before logging out, with the result that the ~/.srb
little to no control over the computers from which directory quickly fills up with stale session files.)
they need to access the SRB.
The following requirements were thus essential.
• It should allow the very easy retrieval and viewing
of files. mdasCollectionHome '/home/tow.eminerals'
• It should not require any installation on client mdasDomainHome 'eminerals'
machines. srbUser 'tow'
srbHost 'forth.dl.ac.uk'
It was clear that a web-based solution would best srbPort '5544'
fulfil these criteria, since web browsers are defaultResource 'CambsLake'
universally available, and a small amount of AUTH_SCHEME 'ENCRYPT1'
forethought combined with extensive user testing and
feedback would ensure that the UI could be made Figure 1: Example .MdasEnv File
211
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
A typical example of a .MdasEnv file is shown complicated, since session-handling logic is required.
in figure 1. (The .MdasAuth file contains nothing However, TobysSRB works in an entirely
but a password.) Note, however, that much of the stateless fashion, as shown in figure 3. This
contents is either redundant, or irrelevant to the effectively means that reauthentication occurs on
client. every action. However, this can be made transparent
In almost all cases, the home directory can be to the user - once the user has authenticated, every
constructed from the username - and in any case, the page that is returned from TobysSRB has a
client ought not to need to specify their home username/password form, but the values are filled in
directory when the server will also know. The client by default, so the user need not worry about them.
ought not to care about the authentication For a session consisting of multiple commands ,
mechanism, when the server could tell it. Since there this stateless approach theoretically involves an
is a default port, it should not be necessary to specify increase in load on the server side, since now a
it unless using a non-default value. temporary directory and files must be created and
Thus, in fact the only information that the client destroyed for every request. (For single requests
should need to know is username, password, and there is obviously no difference.) However, in
location of the MCAT server. In our case, since practice, we have found this increase entirely
TobysSRB is wrapping all SRB details, the client unmeasurable. And from the client’s perspective, any
need not even know this last. All the user need marginal increase in time is insignificant compared
specify to a given TobysSRB instance are username with the time required for each interaction between
and password. All additional information is held by TobysSRB and the underlying SRB servers.
TobysSRB in a configuration file. In addition, MySRB need only initialize the SRB
So in a given session with TobysSRB, the session once for each MySRB session, whereas
username and password are provided as CGI TobysSRB in principle must reinitialize every time.
variables. TobysSRB then constructs a temporary However, since TobysSRB is stateless, and the only
directory, within which it constructs two files, purpose of Sinit is to set the current directory, in
corresponding to .MdasAuth and .MdasEnv. All fact we need not initialize our session at all. If all
necessary Scommands are then executed as follows: SRB locations are specified as full (not relative)
MdasEnvFile=$TMPDIR/MdasEnvFile \ paths, then Scommands all work perfectly correctly
MdasAuthFile=$TMPDIR/MdasAuthFile \ without initialization; and this also obviates the need
Scommand for us to worry about clearing up stale session files
and at the end of a TobysSRB session, the files and later on.
the temporary directory are removed. Furthermore, by keeping the implementation
A brief note on password security: since all entirely stateless, it means that the interface is
TobysSRB sessions occur over HTTPS, all trivially scriptable, since no cookie-handling need be
information on passwords is secure from performed by the client.
eavesdropping. However, it may be visible, as a CGI Interaction with the SRB and error handling
variable, in the URL bar of a web-browser. In order
As described, interaction with the SRB is performed
to obviate the possibility of password stealing by
by Scommands executed from within TobysSRB.
looking over shoulders, it is trivially obscured by
Some notes on implementation here are worthwhile.
firstly XORing the password character by character,
Firstly, since the Scommands are being executed,
and then encoding all URL-sensitive characters. This
by the shell interpreter, it is vitally important that any
makes the password essentially immune to being
input that is passed in from the user is checked
picked up by a glance.
before constructing the command line, otherwise
Session state malicious shell commands could be easily executed
One of the main reasons why MySRB is difficult to by the user. To this end, we check for safety all user
script against is the fact that it is a stateful input that will be passed to the command line, and
application, and cookie-handling is used to keep the escape any characters in SRB filenames that are also
session alive. This is illustrated in figure 2. shell metacharacters.
This has the advantage, of course, that the user is This task is made more difficult by the fact that
not required to reauthenticate every time they (prior to version 3.4.1 of the SRB) there is no
perform some action - rather the authentication data definitive list of what characters are allowable in
is held in a cookie. And since the authentication SRB filenames - indeed different official SRB tools
information required for MySRB is not merely allow different sets of characters, and files created
username and password, but the full gamut of through one tool may not be retrievable through
configuration information described in the previous another (vide frequent discussions on SRB-Chat);
section, it would be obnoxious to require typing in and the list of problematic characters varies between
every time. SRB versions. Therefore, we simply disallow any
Unfortunately, of course, it also means that any characters that might cause problems on any version
client must be prepared to handle cookies to interact of the SRB. This does prevent certain filenames,
with MySRB, which effectively restricts clients to which are otherwise legal in recent SRB versions,
browsers, and excludes simple scripts. It also makes from being used, but we feel our conservative
the internal workings of MySRB significantly more approach is entirely justified.
212
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
talk to
SRB server
perform
Browser Scommands
talk to Browser
SRB server
clear up
talk to
SRB server .Mdas files
clear up HTTP
session
RESPONSE
213
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Finally, as previously mentioned, all files
A: Sub-collections
suffixed .xml may have an additional link appended
C: Ancestor collections
B: Files (E). This is because most of the XML files we
produce within our project are CML files, and we
have also developed an on-the-fly CML-to-XHTML
convertor to allow easy viewing of such files[6].
Scriptable API
Part of the purpose in creating TobysSRB was to
provide a stable and robust programmable interface
to the SRB, accessible from computers where the
Scommands are not installed. For this reason, all
TobysSRB functions are accessible with a simple and
well-documented API, described in this section.
TobysSRB receives information from a client on
two channels;
• HTTP verbs
• CGI argument list.
HTTP verb
TobysSRB understands the following four HTTP
verbs, and performs appropriate actions:
D: Permissions E: External viewer • GET - for retrieval of information; either of a
directory listing or of file contents.
Figure 4: Screenshot of TobysSRB • PUT - a file will be uploaded.
• DELETE - a file or collection will be deleted.
• POST - one of a variety of other actions may be
retrieved and presented to the user. A screenshot is taken.
shown in figure 4.
As can be seen, collections (A) and individual When TobysSRB is invoked from a web-browser,
data objects (B) are listed separately. The collection only GET and POST will be used; but PUT and
names are links which will return a page of the same DELETE can be used from scripts which issue
format, displaying the contents of that collection. HTTP requests.
The filenames are all links which will directly
return the files; thus following them will allow the CGI argument list
browser to render them however it can, and using the A range of CGI arguments are recognized, which
normal browser mechanism (usually right-click and TobysSRB will act upon. A full description of the
select “Save As...”) will allow downloading the file. API is beyond the scope of this paper, but by
By constructing the URL appropriately, we have appropriate combinations of parameters, all of the
ensured that when saving the file, the browser knows SRB actions TobysSRB knows about may be
the filename and will save it under that name. performed.
In the top left of the screen, there is a list (C) of The API is clearly documented, and easily
all the parent collections, which allows quickly understood. Most importantly, files are easily
navigating back up the SRB. accessible from a single, stable, easily generated
At the bottom of the screen are two forms; the URL, which looks like
first allows for quickly listing a given directory https://my.server.ac.uk/TobysSRB.py/
rather than having to navigate page-by-page path/to/filename?
throughout the hierarchy; the second allows username=user;password=pass;
uploading local files. By accessing that URL with GET, the object may
Two further things are worthy of note. Firstly, be retrieved from anywhere, accessing it with PUT
notice that each collection or filename is preceded by will place data into the file, accessing it with
two links (D) labelled 'a+r' and 'a-r'. These change DELETE will remove the file, and - in concert with
the permissions on the file or collection (recursively additional CGI parameters - accessing it with POST
for collections) and add and remove, respectively, will perform all other available operations.
global read permissions. If the URL ends in a ‘/’ then it is assumed that the
The SRB of course offers much greater relevant SRB object is a collection, and so a listing of
granularity of permissions, but we have found that its contents will be returned; if not then it is assumed
the only changes we generally make are to global to be a data object, whose contents will be retrieved
read permissions, and a quick button click is and returned. If either assumption is wrong, a
distinctly easier than Schmod's arcane and poorly redirect will be issued, in the same way as would
documented syntax. Trying to allow for the full range happen for an HTTP request.
of allowable permission modifications would
significantly complicate the interface.
214
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Philosophy MySRB supports a much wider range of SRB
operations. It is the primary interface where new
This API very much follows the REST
features in the C API are prototyped and exposed to
(Representational State Transfer) philosophy [7] and
the user.
may be seen as a lightweight alternative to wrapping
In comparison, TobysSRB supports only the
the SRB with Web Services. Each SRB collection
operations
and file may be perceived, through TobysSRB, as an
HTTP-accessible resource, upon which various • file retrieval
operations (retrieval, modification, deletion, etc.) • file upload
may be performed. • file delete
In this fashion, all of the SRB operations • collection listing
supported by TobysSRB may be used from a script • collection creation
capable of running on any internet-connected • collection removal
machine, without recourse to the Scommands. • add/remove global read permissions
Thus the command However, in our experience, these compose by far
Sget /path/to/file_of_interest
the vast majority of operations performed, and are
may be replaced, when Scommands are unavailable, the only ones that we have found it useful to script -
with in any case, all other operations remain available
wget “https://my.server.ac.uk/
through the Scommands.
TobysSRB.py/path/to/file
Because TobysSRB supports a much reduced
username=user;password=pass”
range of operations, its UI can be much simpler. This
Other interactions are easily performed by means that it could be designed in a fashion
generating more complex HTTP requests, some of analogous to a typical graphical filesystem browser,
which makes it easily accessible to any computer
which can be done with wget or curl (which are
user familiar with that paradigm.
almost universally available on the Unix command
In addition, MySRB makes a distinction between
line), and for those which cannot, generation of an
viewing a file and retrieving it - when viewing it,
HTTP request is less than 10 lines of easily
MySRB second-guesses the browsers rendering
abstractable Perl or Python since modules exist for
capabilities, and massages the output in various ways
this in the standard libraries of both.
before rendering it in a frame. This means that, for
It should be noted that the SRB username and
example, an SVG file may not be viewed through
password must be known by the script in order to
MySRB, because it is transformed into fully escaped
create the URLs. In simple cases, they could be
HTML and presented to the browser as text.
stored inline in the script, but they could equally be
TobysSRB, on the other hand, presents the file
dynamically read from a file, or generated by user
straight to the browser, mime-typed accordingly, and
input. This does not however make the method any
allows the browser to render the file how it, or the
less secure than other existing methods the
user, chooses.
Scommands need to know these data as well, but
Further, since MySRB is a stateful application,
simply store them in the .srb directory. By allowing
which uses cookies for session handling, it is very
them to be stored or generated elsewhere, we place
complicated to automate a MySRB session.
more power in the hands of the user, since they do
TobysSRB, however, works entirely statelessly,
not need to create .srb directories on every
and therefore scripting an upload or download is as
machine; nor are the usernames and passwords so
easily discoverable should a client account be simple as performing (through curl, or wget, or
compromised. Perl or Python) an HTTP request to a URL.
This API therefore provides an accessible way of Finally, although the source for MySRB is
automating SRB interactions. In addition, since available, so in principle it could have been altered in
TobysSRB acts as a buffering layer against the order to fulfil our requirements, and would be
vagaries of the SRB, and returns well-documented extensible for the addition of external links, in
error messages, it can act as a considerably more practice this is not the case, since it consists of 18000
robust SRB interface for scripts, even where the lines of code, which is somewhat opaque to the
Scommands are available. Since no additional client uninitiated. In contrast, TobysSRB consists of 500
software is necessary, it can be immediately used on lines of Python, and is much more easily altered.
any internet-connected machine without preparation.
And finally, the interaction occurs wholly over a User experiences
single port, which generally speaking will be port
TobysSRB is now in use across the (multi-
443, and universally available, so the resultant scripts
institutional) eMinerals project which employs the
are portable to any client machine without the need
authors, both for project members, and for the
to worry about firewall issues.
undergraduate students who work with the project. In
addition it has been disseminated to a number of
Comparison with MySRB other users within the institutions hosting eMinerals.
To the best of the authors’ knowledge, every user
Clearly, in some respects, TobysSRB is a direct
who has been exposed to TobysSRB prefers it to
replacement for MySRB, as a web-based interface to
MySRB.
the SRB. A brief comparison follows.
215
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Some users, especially the undergraduate students All Hands Meeting, Nottingham, 2004;
at whom it was initially aimed, use TobysSRB as Berrisford, P., et al., “SRB in a Production
their only method of SRB interaction. Other more Context”, All Hands Meeting, Nottingham,
advanced users continue to use the Scommands for 2004
some, if not most tasks, but when web access is [3] Bruin, R.P., et al. “Job submission to grid
required (for use from remote computers) or a computing environments”, All Hands
browser interface preferred (for viewing XHTML or Meeting, Nottingham (2006) - in press;
SVG files), then TobysSRB is unanimously preferred Calleja, M. et al. “Grid Tool integration with
to MySRB. the eMinerals project”, All Hands Meeting,
The scriptable interface has not yet been as Nottingham (2005);
widely adopted, largely because there are a number Calleja, M. et al., “Collaborative grid
of existing tools which already use the Scommands infrastructure for molecular simulations: the
as their primary interface. However, several users do eMinerals minigrid as a prototype integrated
prefer the TobysSRB API, and a growing number of compite and data grid”, Mol. Simul. 31, 303
newer scripts are being written to work with that (2005)
interface. Chapman, C. et al., “Managing Scientific
Processes on the eMinerals mini-grid using
Summary BPEL”, All Hands Meeting, Nottingham
(2006) - in press
Finding the currently available methods for [4] http://www.sdsc.edu/srb/index.php/
interaction with the SRB to be inadequate for our Contributed_Software
needs, a new front-end was developed. This [5] Fielding, R. et al., “Hypertext Transfer Protocol
interface, TobysSRB, pares down the facilities -- HTTP/1.1”, RFC 2616, 1999.
offered by MySRB and in so doing allows for the [6] White, T.O.H. et al., “Application and Use of
generation of a considerably more user-friendly CML in the eMinerals project”, All Hands
interface. Meeting, Nottingham, 2006.
TobysSRB allows the viewing of any SRB file, [5] Fielding, R. et al., “Hypertext Transfer Protocol
and is easily extensible such that it can, for example, -- HTTP/1.1”, RFC 2616, 1999.
automatically create links to external services, such [7] Fielding, R., “Architectural Styles and the Design
as the CML viewing tool described in [6]. The of Network-based Software Architectures”,
primary objective of TobysSRB was to provide users PhD thesis, University of California Irvine,
within the eMinerals project with a user- friendly 2000.
interface that performs all the commonly required
tasks with greater facility than those tools already
available.
Furthermore, the requirement for intelligent error-
detection and timeout handling when wrapping the
Scommands has resulted in the creation of a RESTful
HTTP API for interacting with the SRB, allowing
SRB interaction in a robust and network-transparent,
universally accessible fashion.
Both objectives have been achieved, and
TobysSRB has become a transferable tool that can be
used by any project using the SRB.
Acknowledgements
We are grateful for funding from NERC (grant
reference numbers NER/T/S/2001/00855, NE/
C515698/1 and NE/C515704/1).
References
[1] Moore, RW and Baru, C., “Virtualization services
for data grids” in “Grid Computing: Making
the Global Infrastructure a Reality” (ed.
Berman, F. Hey, A.J.G and Fox, G., John
Wiley) Chapter 11 (2003);
Also, see http://www.sdsc.edu/srb
[2] Doherty, M., et al, “SRB in Action”, All Hands
Meeting, Nottingham, 2003;
Manandhar, A. S. et al.“Deploying a
distributed data storage system in the UK
National Grid Service using federated SRB”,
216
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Current grid computing [1, 2] technologies have often been seen as being too heavyweight
and unwieldy from a client perspective, requiring complicated installation and configuration
steps to be taken that are beyond the ability of most end users. This has led many of the
people who would benefit most from grid technology, namely application scientists, to avoid
using it. In response to this we have developed the Application Hosting Environment, a
lightweight, easily deployable environment designed to allow the scientist to quickly and easily
run unmodified applications on remote grid resources. We do this by building a layer of
middleware on top of existing technologies such as Globus, and expose the functionally as web
services using the WSRF::Lite toolkit to manage the running application’s state. The scientist
can start and manage the application he wants to use via these services, with the extra layer
of middleware abstracting the details of the particular underlying grid middleware in use. The
resulting system provides a great deal of flexibility, allowing clients to be developed for a range
of devices from PDAs to desktop machines, and command line clients which can be scripted
to produce complicated application workflows.
217
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
allows scientists to quickly and easily run un- [27]. The client cannot therefore accept
modified, legacy applications on grid resources, inbound network connections, and has to
managing the transfer of files to and from the poll the AHE server to find the status of
grid resource and allowing the user to monitor an application instance.
the status of the application. The philosophy of
• the client machine needs to be able to up-
the AHE is based on the fact that very often a
load input files to and download output
group of researchers will all want to access the
files from a grid resource, but does not
same application, but not all of them will possess
have GridFTP client software installed.
the skill or inclination to install the application
An intermediate file staging area is there-
on a remote grid resource. In the AHE, an ex-
fore used to stage files between the client
pert user installs the application and configures
and the target grid resource.
the AHE server, so that all participating users
can share the same application. This draws a • the client has no knowledge of the location
parallel with many different communities that of the application it wants to run on the
use parallel applications on high performance target grid resource, and it maintains no
compute resources, such as the UK Collabora- information on specific environment vari-
tive Computational Projects (CCPs) [8], where a ables that must be set to run the applica-
group of expert users/developers develop a code, tion. All information about an application
which they then share with the end user commu- and its environment is maintained on the
nity. In the AHE model, once the expert user AHE server.
has configured the AHE to share their applica-
• the client should not be affected by
tion, end users can use clients installed on their
changes to a remote grid resource, such as
desktop workstations to launch and monitor the
if its underlying middleware changes from
application across a variety of different compu-
GT2 to GT4. Since GridSAM is used to
tational resources.
provide an interface to the target grid re-
The AHE focuses on applications not jobs,
source, a change to the underlying middle-
with the application instance being the central
ware used on the resource doesn’t matter,
entity. We define an application as an entity
as long as it is supported by GridSAM.
that can be composed of multiple computational
jobs; examples of applications are (a) a simula- • the client doesn’t have to be installed on a
tion that consists of two coupled models which single machine; the user can move between
may require two jobs to instantiate it and (b) a clients on different machines and access
steerable simulation that requires both the sim- the applications that they have launched.
ulation code itself and a steering web service to The user can even use a combination of
be instantiated. Currently the AHE has a one to different clients, for example using a com-
one relationship between applications and jobs, mand line client to launch an application
but this restriction will be removed in a future and a GUI client to monitor it. The client
release once we have more experience in applying therefore must maintain no information
these concepts to scenarios (a) and (b) detailed about a running application’s state. All
above. state information is maintained as a cen-
tral service that is queried by the client.
III Design Considerations
These constraints have led to the design of a
The problems associated with ‘heavyweight’ lightweight client for the AHE, which is simple to
middleware solutions described above have install and doesn’t require the user to install any
greatly influenced the design of the Application extra libraries or software. The client is required
Hosting Environment. Specifically, they have to launch and monitor application instances,
led to the following constraints on the AHE de- and stage files to and from a grid resource. The
sign: AHE server must provide an interface for the
• the user’s machine does not have to have client to launch applications, and must store the
client software installed to talk directly to state of application instances centrally. It should
the middleware on the target grid resource. be noted that this design doesn’t remove the
Instead the AHE client provides a uniform need for middleware solutions such as Globus
interface to multiple grid middlewares. on the target grid resource; indeed we provide
an interface to run applications on several dif-
• the client machine is behind a firewall that ferent underlying grid middlewares so it is es-
uses network address translation (NAT) sential that grid resource providers maintain a
218
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
supported middleware installation on their ma- ing SOAP [18] messages using X.509 digital cer-
chines. What the design does do is simplify the tificates in accordance with the OASIS WS-
experience of the end user. Security [19] standard as described in [20].
Communication in the AHE is secured us- GridSAM provides a web services interface
ing Transport Layer Security (TLS) [28]; our for submitting and monitoring computational
initial analysis showed that we did not need to jobs managed by a variety of Distributed Re-
use SOAP Message Level Security (MLS) as our source Managers (DRM), including Globus [3],
SOAP messages would not need to pass through Condor [21] and Sun Grid Engine [22], and runs
intermediate message processing steps. TLS in an OMII [23] web services container. Jobs
is provided by mutually authenticate SSL be- submitted to GridSAM are described using Job
tween the client and the AHE, and between the Submission Description Language (JSDL) [24].
AHE and GridSAM. This requires that the AHE GridSAM uses this description to submit a job
server and associated GridSAM instances have to a local resource, and has a plug-in architec-
X.509 certificates supplied by a trusted certifi- ture that allows adapters to be written for dif-
cate authority (CA), as do any users connect- ferent types of resource manager. In contrast
ing to the AHE. When using the AHE to ac- to WEDS, which represents jobs co-located on
cess a computational grid, typically both user the hosting resource, the AHE can submit jobs
and server certificates will be supplied by the to any resource manager for which a GridSAM
grid CA that the user is submitting to. Where plug-in exists.
proxy certificates are required, for example when Reflecting the flexible philosophy and nature
using GridSAM to submit jobs via Globus, a of Perl, WSRF::Lite allows the developer to host
MyProxy server is used to store proxy certifi- WS-Resources in a variety of ways, for instance
cates uploaded by the user, which are retrieved using the Apache web server or using a stan-
by GridSAM in order to submit the job on the dalone WSRF::Lite Container. The AHE has
user’s behalf. been designed to run in the Apache [25] con-
tainer, and has also been successfully deployed
IV Architecture of the AHE in a modified Tomcat [26] container.
The AHE represents an application instance Figure 1 shows the architecture and work-
as a stateful WS-Resource[7], the properties of flow of the AHE. Briefly, the core components
which include the application instance’s name, of the AHE are: the App Server Registry, a
status, input and output files and the target grid registry of applications hosted in the AHE; the
resource that the application has been launched App Server Factory, a “factory” according to
on. Details of how to launch the application are the Factory pattern [29] used to produce a WS-
maintained on a central service, in order to re- Resource (the App WS-Resource) that acts as
duce the complexity of the AHE client. a representation of the instance of the execut-
The design of the AHE has been greatly in- ing application. The App Server Factory is it-
fluenced by WEDS (WSRF-based Environment self a WSRF WS-Resource that supports the
for Distributed Simulations)[9], a hosting envi- WS-ResourceProperties operations. The Appli-
ronment designed for operation primarily within cation Registry is a registry of previously created
a single administrative domain. The AHE differs App WS-Resources, which the user can query to
in that it is designed to operate across multiple find previously launched application instances.
administrative domains seamlessly, but can also The File Staging Service is a WebDAV [30] file
be used to provide a uniform interface to appli- server which acts as an intermediate staging step
cations deployed on both local HPC machines, for application input files from the user’s ma-
and remote grid resources. chine to the remote grid resource. We define
The AHE is based on a number of pre- the staging of files to the File Staging Service
existing grid technologies, principally GridSAM as “pass by value”, where the file is transferred
[10] and WSRF::Lite [11]. WSRF::Lite is a Perl from the user’s local machine to the File Stage
implementation of the OASIS Web Services Re- Service. The AHE also supports “pass by refer-
source Framework specification. It is built using ence”, where the client supplies a URI to file re-
the Perl SOAP::Lite [12] web services toolkit, quired by the application. The MyProxy Server
from which it derives its name. WSRF::Lite is used to store proxy credentials required by
provides support for WS-Addressing [13], WS- GridSAM to submit to Globus job queues. As
ResourceProperties [14], WS-ResourceLifetime described above we use GridSAM to provide a
[15], WS-ServiceGroup [16] and WS-BaseFaults web services compliant front end to remote grid
[17]. It also provides support for digitally sign- resources.
219
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
All user interaction is via a client that com- Once the user has uploaded all of the input files
municates with the AHE using SOAP messages. he sends the “Start” message to the App WS-
The workflow of launching an application on a Resource to start the application running (6).
grid resource running the Globus middleware The Start message contains the locations of the
(shown in figure 1 is as follows: the user retrieves files to be staged in to and out from the target
a list of App Server Factory URIs from the AHE grid resource, along with details of the user’s
(1). There is an application server for each ap- proxy credential and any arguments that the
plication configured in the AHE. This step is user wishes to pass to the application. The App
optional as the user may have already cached WS-Resource maintains a registry of instanti-
the URI of the App Server Factories he wants to ated applications. Issuing a prepare message
use. The user issues a “Prepare” message (2); causes a new entry to be added to the registry
this causes an App WS-Resource to be created (7). A “Destroy” command sent to the App WS-
(3) which represents this instance of the appli- Resource causes the corresponding entry to be
cation’s execution. To start an application in- removed from the registry.
stance the user goes through the sequence: Pre-
pare →Upload Input Files →Start, where Start The App WS-Resource creates a JSDL docu-
actually causes the application to start execut- ment for a specific application instance, using its
ing. Next the user uploads the input files to the configuration file to determine where the appli-
intermediate file staging service using the Web- cation is located on the resource. The JSDL is
DAV protocol (4). sent to the GridSAM instance acting as interface
to the grid resource (8), and GridSAM handles
The user generates and uploads a proxy cre- authentication using the user’s proxy certificate.
dential to the MyProxy server (5). The proxy GridSAM retrieves the user’s proxy credential
credential is generated from the X.509 certificate from the MyProxy server (9) which it uses to
issued by the user’s grid certificate authority. transfer any input files required to run the ap-
This step is optional, as the user may have pre- plication from the intermediate File Staging Ser-
viously uploaded a credential that is still valid. vice to the grid resource (10), and to actually
220
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
submit the job to a Globus back-end. The AHE is designed to be interacted with
The user can send Command messages to by a variety of different clients. The clients we
the App WS-Resource to monitor the applica- have developed are implemented in Java using
tion instance’s progress (11); for example the the Apache Axis [32] web services toolkit. We
user can send a “Monitor” message to check on have developed both GUI and command line
the application’s status. The App WS-Resource clients from the same Java codebase. The GUI
queries the GridSAM instance on behalf of the client uses a wizard to guide a user through the
user to update state information. The user can steps of starting their application instance. The
also send “Terminate” and “Destroy” messages wizard allows users to specify constraints for the
to halt the application’s execution and destroy application, such as the number of processors to
the App WS-Resource respectively. GridSAM use, choose a target grid resource to run their
submits the job to the target grid resource and application on, stage all required input files to
the job completes. GridSAM then moves the the grid resource, specify any extra arguments
output files back to the file staging locations for the simulation, and set it running.
that were specified in the JSDL document (12).
Once the job is complete the user can retrieve To install the AHE clients all an end user
any output files from the application from the need do is download and extract the client, load
File Staging Service to their local machine. The their X.509 certificate into a Java keystore us-
user can also query the Application Registry to ing a provided script and set an environment
find the end point references of jobs that have variable to point to the location of the clients.
been previously prepared (14). The user also has to configure their client with
the endpoints of the App Server Registry and
V AHE Deployment Application Registry, and the URL of their file
As described above the AHE is implemented staging service, all supplied by their AHE server
as a client/server model. The client is designed administrator.
to be easily deployed by an end user, without
having to install any supporting software. The The AHE client attempts to discover which
server is designed to be deployed and configured files need to be staged to and from the resource
by an expert user, who installs and configures by parsing the application’s configuration file.
applications on behalf of other users. It features a plug-in architecture which allows
Due to the reliance on WSRF::Lite, the AHE new configuration file parsers to be developed
server is developed in Perl, and is hosted in a for any application that is to be hosted in the
container such as Apache or Tomcat. The ac- AHE. The parser will also rewrite the user’s ap-
tual AHE services are an ensemble of Perl scripts plication configuration file, removing any rela-
that are deployed as CGI scripts in the hosting tive paths, so that the application can be run on
container. To install the AHE server, the expert the target grid resource. If no plug-in is avail-
user must download the AHE package and con- able for a certain application, then the user can
figure their container appropriately. The AHE specify input and output files manually.
server uses a PostgreSQL [31] database to store
the state information of the App WS-Resources, Once an application instance has been pre-
which must also be configured by the expert pare and submitted, the AHE GUI client allows
user. We assume that a GridSAM instance has the user to monitor the state of the application
been configured for each resource that the AHE by polling its associated App WS-Resource. Af-
can submit to. ter the application has finished, the user can
To host an application in the AHE, the ex- stage the application’s output files back to their
pert user must first install and configure it on local machine using the GUI client. The client
the target grid resource. The expert user then also gives the user the ability to terminate an ap-
configures the location and settings of the appli- plication while it is running on a grid resource,
cation on the AHE server and creates a JSDL and destroy an application instance, removing it
template document for the application and the from the AHE’s application registry. In addition
resource. This can be done by cloning a pre- to the GUI client a set of command line clients
existing JSDL template. To complete the in- are available which provide the same functional-
stallation the expert user runs a script to repop- ity of the GUI. The command line clients have
ulate the Application Server Registry; the AHE the advantage that they can be called from a
can be updated dynamically and doesn’t require script to produce complex workflows with mul-
restarting when a new application is added. tiple application executions.
221
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
222
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
able) logic to orchestrate the distributed In future we also hope to be able to use a
components implicates the end-user in on- GridSAM connector to the Unicore middleware
going maintenance of the client’s configu- to allow the AHE to submit jobs to the DEISA
ration (consider the difficulty of adding a grid. By providing a uniform interface to these
new application or new resource, especially different back end middlewares, the AHE will
one operating a different middleware, to provide a truly interoperable grid from the user’s
the set that the user can access). perspective. We also plan to integrate support
for the RealityGrid steering framework into the
• the client needs to be “attached” to the AHE, so that starting an application which is
grid in order to launch and monitor jobs marked as steerable automatically starts all the
and retrieve results, which decreases client necessary steering services, and also to extend
mobility. the AHE to support multi-part applications such
The AHE approach alleviates these difficulties as coupled models. The end-user still deals with
by moving as much of the complexity as possi- a single application, while the complexity of
ble into the service layer. The AHE decomposes managing multiple constituent jobs is delegated
the target audience into expert and end-users, to the service layer.
where the expert user installs, configures and VIII Acknowledgements
maintains the AHE server, and the end-users
need simply to download the ready-to-go AHE The development of the AHE is funded by
client. The client itself becomes thinner, and the projects “RealityGrid” (GR/R67699) and
with a reduced set of software dependencies is “Rapid Prototyping of Usable Grid Middleware”
easier to install. All state persistence occurs at (GR/T27488/01), and also by OMII under the
the service layer, which increases client mobil- Managed Programme RAHWL project. The
ity. Architecturally, the AHE is akin to a por- AHE can be obtained from the RealityGrid web-
tal, but one where the client is not constrained site: http://www.realitygrid.org/AHE.
to be a Web browser, increasing the flexibility References
of what the client can do, and permitting pro-
grammatic access, which allows power users to [1] P. V. Coveney, editor. Scientific Grid Com-
construct lightweight workflows through script- puting. Phil. Trans. R. Soc. A, 2005.
ing languages.
[2] I. Foster, C. Kesselman, and S. Tuecke. The
VII Summary anatomy of the grid: Enabling scalable vir-
tual organizations. Intl J. Supercomputer
By narrowing the focus of the AHE middle- Applications, 15:3–23, 2001.
ware to a small set of applications that a group
of scientists will typically want to use, the task of [3] http://www.globus.org.
launching and managing applications on a grid is [4] http://www.unicore.org.
greatly simplified. This translates to a smoother
end user experience, removing many of the barri- [5] J. Chin and P. V. Coveney. Towards
ers that have previously deterred scientists from tractable toolkits for the grid: a
getting involved in grid computing. In a produc- plea for lightweight, useable middle-
tion environment we have found the AHE to be a ware. Technical report, UK e-Science
useful way of providing a single interface to dis- Technical Report UKeS-2004-01, 2004.
parate grid resources, such as machines hosted http://nesc.ac.uk/technical papers/UKeS-
on the NGS and TeraGrid. 2004-01.pdf.
By representing the execution of an applica-
[6] J. Kewley, R. Allen, R. Crouchley,
tion as a stateful web service, the AHE can easily
D. Grose, T. van Ark, M. Hayes, and Mor-
be built on top of to form systems of arbitrary
ris. L. GROWL: A lightweight grid services
complexity, beyond its original design. For ex-
toolkit and applications. 4th UK e-Science
ample, a BPEL engine could be developed to al-
All Hands Meeting, 2005.
low users to orchestrate the workflow of applica-
tions using the Business Process Execution Lan- [7] S. Graham, A. Karmarkar, J Mischkin-
guage. Employing a graphical BPEL workflow sky, I. Robinson, and I. Sedukin. Web
designer would ease the creation of workflows Services Resource Framework. Technical
by users not comfortable with creating scripts report, OASIS Technical Report, 2006.
to call the command line clients, which is some- http://docs.oasis-open.org/wsrf/wsrf-
thing we hope to look at in the future. ws resource-1.2-spec-os.pdf.
223
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
224
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Reading e-Science Centre, Environmental Systems Science Centre, University of Reading, Harry
Pitt Building, University of Reading, Whiteknights, Reading RG6 6AL
∗
Corresponding author: email address jdb@mail.nerc-essc.ac.uk
Abstract
Grid systems have a reputation for being difficult to build and use. We describe how the
ease of use of the Styx Grid Services (SGS) software can be combined with the security and
trusted nature of the Secure Shell (SSH) to build simple Grid systems that are secure, robust
and easy to use and administer. We present a case study of how this method is applied to a
science project (GCEP), allowing the scientists to share resources securely and in a manner
that does not place unnecessary load on system administrators. Applications on the Grid can
be run exactly as if they were local programs from the point of view of the scientists. The
resources in the GCEP Grid are diverse, including dedicated clusters, Condor pools and data
archives but the system we present here provides a uniform interface to all these resources.
The same methods can be used to access Globus resources via GSISSH if required.
225
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
226
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
benefits of the Styx Grid Services system for en- It is a very easy task to create a simple shell
hancing the ease of use of Grids. script (or batch file in Windows) called makepic
that wraps the SGSRun program and contains the
host name and port of the SGS server, so the
2 How Styx Grid Services user can simple run:
make Grids easy to use
makepic -i input.dat -o pic.png
The Styx Grid Services (SGS) system is a means
for running remote applications exactly as if they exactly as before. Many scientific codes have
were installed locally. The software is pure Java, corresponding graphical wrapper programs that
is very small in size (under 3 MB) and is easy use the command-line executable as an “engine”,
to install and configure. The technical details of display results graphically and perhaps allow
the SGS system have been discussed in previous the user to interact with the running program.
papers [3, 4] and will not be repeated here. To The makepic script (which calls the remote Styx
illustrate how the SGS system is installed and Grid Service) behaves identically to the origi-
used, we shall work through a concrete exam- nal executable and so could be called by such
ple. a graphical wrapper program, thereby “Grid-
Let us take the example of a simple visual- enabling” the graphical program without chang-
ization program called makepic that reads an ing it at all. More details on this process can be
input file and creates a visualization of the re- found on the project website [2].
sults as a PNG file. The names of the input and By allowing the user to execute programs on
output files are specified on the command line, a Grid exactly as if they were local programs, the
for example: Styx Grid Services software provides a very nat-
ural and familiar environment for users to work
makepic -i input.dat -o pic.png in. Users do not have to know the details of
where or how the remote program runs, nor do
This program can be deployed on a server as they need to manually upload input files to the
a Styx Grid Service as follows. The makepic pro- correct location: once the program is deployed
gram is installed on the server. A simple XML as a Styx Grid Service, all this is handled auto-
configuration file is created on the server. This matically.
file describes the program in terms of its inputs,
outputs and command-line arguments (see fig-
ure 1). The SGS server daemon is then started. 2.1 Styx Grid Services and work-
Clients can now run the makepic Styx Grid flows
Service from remote locations, exactly as if the
Given that, in the SGS system, remote services
makepic program were deployed on their local
can be executed exactly as if they were local
machines. They do this using the SGSRun pro-
programs, simple shell scripts can be used to
gram, which is a generic client program for run-
combine several remote services to create a dis-
ning any Styx Grid Service:
tributed application or “workflow”. This is de-
SGSRun <hostname> <port> \ scribed in detail in [4]. No special workflow tools
makepic -i input.dat -o pic.png are required on either the client or server side to
do this. Data can be transferred directly be-
where <hostname> and <port> are the host tween applications along the shortest network
name (or IP address) and port of the SGS server route, saving time and bandwidth (i.e. inter-
respectively. The SGSRun program connects to mediate data do not have to pass through the
the SGS server and downloads the XML descrip- client). It should be noted that this method
tion of the makepic program (figure 1). SGSRun does not provide all the features that might be
uses this configuration information to parse the required of a general workflow system but will
command line arguments that the user has pro- be sufficient for many users.
vided. It then knows that input.dat is an input
file and uploads it automatically from the user’s 2.2 Running jobs on Condor and
machine to the SGS server before the program Sun Grid Engine via SGS
is started. Having started the program, SGSRun
knows that makepic will produce an output file In the above example the makepic program runs
called pic.png, which it downloads to the user’s on the SGS server itself. By using a distrib-
machine. uted resource management (DRM) system such
227
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 1: Portion of the configuration file on a Styx Grid Services server, describing the makepic
program that is deployed. This specifies that the program expects one input file, whose name is
given by the command-line argument following the “-i” flag. The program outputs one file, whose
name is given by the command-line argument following the “-o” flag, and also outputs data on its
standard output stream.
as Condor [10] or Sun Grid Engine (SGE, [8]), makepic -i inputs -o pic.png
the program can be run on a different machine,
balancing the load between a set of machines in where makepic is the script that wraps the
a Condor pool or cluster. With a very small SGSRun executable as described above.
change to the SGS configuration file, the SGS The input files are automatically uploaded
system can run programs through Condor or to the SGS server as before. The server notices
SGE in a manner that is completely transpar- the presence of an input directory where it was
ent to the user: the client interface is identical expecting a single file. It takes this as a sig-
in all cases. Plugins to allow the SGS system to nal to run the makepic program over each file
run jobs on other DRMs such as PBS (Portable in the input directory, producing a picture for
Batch System) can be created if required. each file. The client then downloads these pic-
tures automatically and places them in a direc-
This is particularly useful when the user
tory called pic.png on the user’s local machine.
wishes to execute the same program many times
The SGS server uses the underlying DRM sys-
over a number of input files. This is known as
tem (e.g. Condor or SGE) to run these tasks in
“high-throughput computing” and is commonly
parallel on the worker nodes. The progress of
used in Monte Carlo simulations and parameter
the whole job is displayed to the client as the
sweep studies. In the above example, the user
individual tasks are executed.
might wish to execute the makepic program over
a large number of input files, creating a visual-
ization of each one. Normally this would require 2.3 Disadvantages of the SGS sys-
the user to upload the input files, name them in tem
some structured fashion and create an appropri-
ate job description file in a format that is under- The design of the SGS system has focused on
stood by Condor, SGE or the DRM system in ease of use, deployment and maintenance. There
question. are a number of disadvantages of the system:
In the Styx Grid Services system, the run- • It only works with command-line pro-
ning of these high-throughput jobs is very sim- grams (this is true for most Grid systems).
ple. Let us imagine that the user has a set of
input files for the makepic program in a direc- • The current SGS system will not work well
tory called inputs on his or her local machine. in cases where the output files that are pro-
The user simply runs the makepic Styx Grid Ser- duced by a program are not predictable, or
vice as before but, instead of specifying a single where the program spawns other programs
input file on the command line, the user enters during execution. Work is underway to ad-
the name of the inputs directory: dress this limitation.
228
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
• The SGS system has its own security • The SGS system saves the user from hav-
model. The server administrator must ing to manually create job submission files
maintain an SGS user database in addi- for DRM systems such as Condor and SGE
tion to that of the underlying system. (section 2.2 above).
• On the server, each application runs with • The SGS system automatically monitors
the permissions of the owner of the SGS the progress of jobs and displays this to
server itself, rather than with the permis- the user.
sions of the specific user.
• By running the application through an in-
• Some resource providers might not trust termediary program (i.e. the SGS server
the SGS software without considerable fur- program), there is an opportunity to au-
ther testing and wider acceptance within tomatically harvest metadata about each
the community. run. One could keep a record of each invo-
cation of a particular program on the Grid
and automatically store metadata about
3 SGS-SSH each run (this metadata could be provided
by the user from the SGSRun program).
The last three issues in the above list can be ad-
dressed by combining the SGS system with the
Secure Shell (SSH). Essentially, the user uses the 4 Case Study: the GCEP
SGS system exactly as before (and benefits from project
its ease of use) but mutually authenticates with
the server using SSH. All SGS traffic is trans- GCEP (Grid for Coupled Ensemble Predic-
mitted over the secure channel that is created tion) is a NERC e-Science project that in-
by the SSH connection. This combined system volves the Reading e-Science Centre (ReSC), the
is known as SGS-SSH. NERC Centre for Global Atmospheric Modelling
This is achieved through a relatively small (CGAM), the British Antarctic Survey (BAS)
modification to the SGS software. The original and the Council for the Central Laboratory of
SGS server reads and writes messages through the Research Councils (CCLRC). GCEP aims to
network sockets. The SGS-SSH “server” is ac- test the extent to which the Earth’s climate can
tually a program that reads messages on its stan-be predicted on seasonal to decadal timescales.
dard input and writes replies to its standard out-
This will be achieved by running a large num-
put. The SGS client is modified to connect to ber of computer-based climate simulations, with
the remote server via SSH and execute the SGS- each simulation being started from a different set
SSH “server” program via an exec request. The of initial conditions of the oceans and the distri-
client then exchanges messages with the SGS- butions of ice, soil moisture and snow cover. If
SSH server program through the secure channel. any aspect of climate can be predicted on these
Therefore, the only server that needs to be timescales, it is these slowly-varying properties
maintained by the system administrator is the that will contain the necessary information.
SSH server itself. The user authenticates via The climate simulation program that will
the normal SSH procedures and the applications be run is the Hadley Centre’s full HadCM3
that the user runs through the SGS-SSH inter- model [7], which has recently been ported so
face run with the permissions of the specific user
that it can run on PC clusters. Large numbers
in question, not a generic user. There is no sep-(ensembles) of simulations will need to be run
arate user database. in order to gather the necessary statistics. The
One might ask what the value is of using the process of analysing the results of the simula-
Styx Grid Services system at all - why not sim- tions will also be compute- and data-intensive.
ply use SFTP and SSH? There are a number of The compute resources available to the project
advantages of the SGS system in this environ- include clusters at ReSC, BAS and CCLRC,
ment: a Condor pool at the University of Reading
and the UK National Grid Service (NGS). The
• The SGS system automatically takes care project will generate several terabytes of data
of uploading the input files and download- and so an efficient data management system will
ing the output files to and from the Grid be required to allow results to be shared amongst
resource. the project scientists.
229
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
4.1 GCEP Grid architecture SSH: this uses the user’s Grid certificate to
authenticate with the server. This requires
The GCEP partners wish to focus on the sci-
the user to obtain and care for a certificate
ence challenges and do not wish to devote large
and create proxy certificates whenever he
amounts of time to the installation and main-
or she needs to run a job.
tenance of complex Grid middleware. We pro-
pose to use SGS-SSH to build the Grid. In this 2. Create a “gateway” machine, which the
way, the administrators of the GCEP clusters (at user accesses through SSH as before. This
ReSC, BAS and CCLRC) only need to maintain gateway machine accesses the Globus re-
an SSH server and set up a local user account source on the user’s behalf, creating the
for each of the GCEP users. Figure 2 gives an necessary proxy certificate automatically.
overview of the proposed GCEP architecture. The user’s certificate could be stored on
There are two types of job that will be run this gateway machine (if it is sufficiently
in GCEP. The first type is the running of the secure) or the certificate could be stored
HadCM3 model itself to produce simulations of in a MyProxy server and retrieved auto-
the evolution of the Earth’s climate arising from matically when the user runs a job.
various initial conditions. HadCM3 is a complex
model to configure and it only runs efficiently on At the time of writing it is not clear which
certain clusters (e.g. it requires a large amount approach will be more appropriate for the
of memory per node). Therefore, we expect that project and further investigation will be re-
scientists will run the HadCM3 model “manu- quired.
ally”, by logging into the cluster in question (via
SSH) and carefully configuring the program be- 4.3 Automatic resource selection
fore running it. This is an expert task that is
not easily automated. A greater degree of transparency could be ob-
The second job type that will be run in tained by providing a system that automatically
GCEP is the analysis of the output from the finds the best resource in the GCEP Grid on
HadCM3 model runs. These data analysis jobs which a given job should run. This might involve
are typically much simpler to set up and can run automatic selection of the “fastest” or “least
on a wide variety of machines. It is these data busy” machine in the Grid to ensure that the
analysis jobs that are most suitable for automa- scientists receive their results as quickly as pos-
tion using the Styx Grid Services system. We sible. Although an attractive idea in theory, this
propose that, when a scientist creates a useful is very difficult to achieve in practice. We envis-
data analysis tool (i.e. a program), he or she age that the GCEP users will, initially at least,
arranges for it to be installed on each of the have to perform their own resource selection. It
GCEP resources as a Styx Grid Service. All may be possible to create a basic resource selec-
other scientists can then use this tool exactly tor that queries the queue management system
as if they had it installed locally. The SGS sys- on each of the GCEP resources and makes a de-
tem would allow the tools to be run on a variety cision; however, this is unlikely to be optimal as
of resources (SGE clusters, Condor pools, indi- it is extremely difficult to quantify the “load” on
vidual machines) in a manner that is completely any given machine or to estimate the true length
transparent to the scientists. of a queue.
Although this does not use “traditional”
Grid software, this is true Grid computing: it is 4.4 Data management
distributed computing, performed transparently
across multiple administrative domains. The GCEP project will produce several ter-
abytes of data, which will be stored at the var-
ious partner institutions and which must be
4.2 Using the National Grid Ser- shared amongst the project scientists at all the
vice institutions. One option for handling this is to
When the GCEP scientists wish to use Globus use the Storage Resource Broker (SRB). This
resources such as the UK National Grid Service would create a single virtual data store that
they will not be able to use SSH to access these combines all the data stores at the partner lo-
resources directly. There are two approaches to cations. Users would not have to know where
this: each individual data file is stored: this is han-
dled by the SRB system. Additionally, the SRB
1. Use GSISSH as a direct replacement for system can store metadata about each file in the
230
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
National Reading
ReSC BAS
Grid Condor pool
cluster cluster
Service
SGS SGS
over SSH over SSH
SGS over SGS
GSISSH over SSH
Users
Figure 2: Overview of the proposed architecture for the GCEP project, omitting the CCLRC cluster
for clarity. The dedicated GCEP resources simply run SSH servers, as does the Reading Condor
pool. The National Grid Service resources are accessed through GSISSH. The resources themselves
run a variety of Distributed Resource Management systems including Condor and Sun Grid En-
gine. The Styx Grid Services software is run over SSH and GSISSH to allow users to run Grid
applications on these various resources just as easily as they would run local applications. Files are
shared amongst the resources using SSH and sshfs; the Storage Resource Broker is another option
for this (section 4.4).
system, making it easier for scientists to find files SFTP (secure FTP) to move files from place to
in the entire virtual store. place. An important limitation of an SSH-based
distributed file system is that it lacks a dedicated
An alternative approach, which would sacri- metadata store.
fice some functionality but require less instal-
lation and configuration, would be simply to
share the files via SSH. Thanks to the wide- 5 Discussion
spread nature of SSH there are many tools for
providing easy access to filesystems on an SSH We have described how the ease of use of the
server. On Linux clients, remote filesystems Styx Grid Services system can be combined with
that are exposed through SSH can be mounted the Secure Shell (SSH) to create Grid systems
on the client’s machine through sshfs [9]. This that are easy to build, use and administer. No
has some significant advantages over SRB. With complex middleware is required and the barrier
sshfs, users would not have to copy entire data to uptake by users is very low: most of the tools
files to the location at which a particular data involved are already familiar to a large propor-
analysis program needs to run: the machine run- tion of users. We have discussed how SGS-SSH
ning the program can mount the remote disk can be applied to a multi-partner science project
via sshfs and operate upon the data file as if (GCEP), allowing compute and data resources
it were on a local disk. This means that the to be shared transparently and securely with the
data analysis program can extract just those minimum of administrative effort. This will pro-
portions of the data file that it needs, saving vide a reliable and easy-to-use Grid system to
considerable file transfer time. Sshfs is only the GCEP scientists, allowing them to focus on
available to Linux users (of which there are the considerable science challenges rather than
a large proportion in GCEP), although there having to spend time learning new tools.
is an equivalent, low-cost commercial equiva- SGS-SSH does not provide every possible fea-
lent (www.sftpdrive.com) available for Windows ture that might be desirable in a full-scale pro-
clients. In the absence of sshfs, users can use duction Grid. However, we argue that it pro-
231
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
References
[1] Beckles, B., V. Welch, and J. Basney, 2005: Mechanisms for increasing the usability of grid
security. International Journal of Human-Computer Studies, 63, 74–101.
[2] Blower, J., 2006: Styx Grid Services. Online, http://www.resc.rdg.ac.uk/jstyx/sgs.
[3] Blower, J., K. Haines, and E. Llewellin: 2005, Data streaming, workflow and firewall-friendly
Grid Services with Styx. Proceedings of the UK e-Science Meeting, S. Cox and D. Walker,
eds., ISBN 1-904425-53-4.
[4] Blower, J., A. Harrison, and K. Haines, 2006: Styx Grid Services: Lightweight, easy-to-use
middleware for scientific workflows. Lecture Notes in Computer Science, 3993, 996–1003.
[5] Chin, J. and P. V. Coveney, 2004: Towards tractable toolkits for the Grid: a
plea for lightweight, usable middleware. UK e-Science Technical Report UKeS-2004-01,
http://www.nesc.ac.uk/technical papers/UKeS-2004-01.pdf.
[6] Foster, I.: 2005, Globus Toolkit version 4: Software for service-oriented systems. IFIP In-
ternational Conference on Network and Parallel Computing, Springer-Verlag, volume 3779 of
LNCS , 2–13.
[7] Gordon, C., C. Cooper, C. A. Senior, H. Banks, J. M. Gregory, T. C. Johns, J. F. Mitchell,
and R. A. Wood, 2000: The simulation of SST, sea ice extents and ocean heat transports in
a version of the Hadley Centre coupled model without flux adjustments. Climate Dynamics,
16, 147–168.
232
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
In standard analysis of simulations of the heart, usually only one state variable – the trans-
membrane voltage (or action potential) – is visualized. While this is the ‘most important’ vari-
able to visualize, all but the most basic cardiac models have many state variables at each node;
data that are not used when visualizing the output. In this paper, we present a novel visualization
technique developed within the Integrative Biology project that uses the entire state of the car-
diac virtual tissue to produce images based on the deviation from normal propagation of action
potential.
233
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
234
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
may be left out of the visualization, and cycle of excitation and recovery. This struc-
ture becomes apparent, at least with more sim-
2. A collection of state variables can be col- ple models, when the data is visualized in
lapsed into a single ‘meta-variable’ which phase space. In the case of a tri-variate model,
represents the value of all its constituent if the standard visualization at time t is to
variables. For example, the parameters generate three n by m pixel images Ut (x, y),
m,h, and j in Figure 3 all depend on mem- Vt (x, y), and Wt (x, y), for 0 < x < n
brane voltage and time, and determine and 0 < y < m, the phase space visual-
the magnitude of current through the Na+ ization is a single 3D scatter-plot image ob-
channel in the cell membrane, which is tained by plotting n × m ‘dots’, one at each
needed for an action potential to propa- (Ut (x, y), Vt (x, y), Wt (x, y)). Figure 5 shows
gate. examples of phase space plots from a Fenton
Karma three variable model exhibiting normal
The first point can be seen if the correlation
propagation.
co-efficient is calculated between each pair of
variables in the four variable Fenton Karma
model at each point in time. The correlation
used every pixel in a pairwise fashion between
each pair of state variables, and can be seen
in Figure 4. Note that U and D are almost
perfectly correlated across the whole simula-
tion, which suggests there would be little extra (a) (b)
information gained through including both of
these variables in a visualization.
(c) (d)
235
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
space. While this ‘snake’ is entirely expected ality reduction techniques from section 2 re-
– it occurs as cells depolarise then recover – duced the phase space even by a factor of three,
it nevertheless provides an interesting view on there would still be too many variables to visu-
wavefront propagation, and should be present alize directly. One approach is to create multi-
even in higher variate models. ple phase space plots, which has the advantage
that all the data are shown, but the disadvan-
tages of still having to visually combine many
data sources (see Figure 8, for example). The
phase space plots also have the disadvantage of
removing spatial information from the visual-
ization.
One way forward is to form visualiza-
(a) (b) tions based on the density and/or location
of the points in the phase space, through
‘hyper-histogram’s [5], but we found this to
be a less promising approach than using the
propagation-model described in this paper.
236
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
(a) (b)
Each data point of CVT is assigned a value
based on its Euclidean distance from the near-
Figure 7: The model of normal propagation est point of the propagation model. The result-
√
for a Fenton Karma three variable simula- ing scalar values, in the range [0 − n] for an
tion. (a) The full model, and (b) the deci- n-dimensional model, can be used, at least in
mated model capturing the essence of figure this 2-D case to generate false colour images
(a). using a standard colourmap.
237
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
images alone.
(a) (b)
535 ms +10 ms
Figure 9: The visualization of the deviation
from the model of normal propagation for
a normal propagation in the Fenton Karma
three variable simulation. (a) The action po- +20 ms +30 ms
tential output, and (b) the deviation from
the model.
238
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
239
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
technique appropriate? There may also be [5] J.W. Handley, K.W. Brodlie, and R.H.
scope in normalising the deviation to be in the Clayton. Multi-variate visualization of
√
range [0 − 1] rather than [0 − n] so that the cardiac virtual tissue. In Proc. 19th IEEE
results from different models can be directly International Symposium on Computer
compared. Based Medical Systems, 665–670, IEEE
This visualization technique can be ex- Computer Society, 2006.
tended to three-dimensional simulations fairly
easily. The distance metric is dimension in- [6] D. Noble and Y. Rudy. Models of cardiac
dependent, as it is based in phase space. The ventricular action potentials: Iterative in-
significant distances occur at the wave-front of teractions between experiment and sim-
action potential, so an interesting visualization ulation. Philos. Trans. R. Soc. Lond. Ser.
might be to form an iso-surface of action po- A-Math. Phys. Eng. Sci., 359:1127–1142,
tential, and colour it according to the deviation 2001.
from the model. [7] A.V. Panfilov and A.M. Pertsov. Ven-
tricular fibrillation: Evolution of the
multiple-wavelet hypothesis. Philos.
Acknowledgements Trans. R. Soc. Lond. Ser. A-Math. Phys.
Eng. Sci., 359:1315–1325, 2001.
The authors wish to acknowledge the support
provided by the funders of the UK e-Science [8] Blanca Rodrı́guez, Brock M. Tice,
Integrative Biology Project: The EPSRC (ref James C. Eason, Felipe Aguel, Jr. Jos
no: GR/S72023/01) and IBM. M. Ferrero, and Natalia Trayanova. Ef-
fect of acute global ischemia on the up-
per limit of vulnerability: a simulation
References study. Am. J. Physiol. Heart. Circ. Phys-
iol., 286(6):H2078–H2088, 2004.
[1] R.H. Clayton. Computational models
of normal and abnormal action poten- [9] P. Wong and R. Bergeron. 30 years of
tial propagation in cardiac tissue: Link- multidimensional multivariate visualiza-
ing experimental and clinical cardiology. tion. In Gregory M. Nielson, Hans Ha-
Phys. Meas., 22:R15–R34, 2001. gan, and Heinrich Muller, editors, Scien-
tific Visualization - Overviews, Method-
[2] R.H. Clayton and A.V. Holden. Fila-
ologies and Techniques., pages 3–33, Los
ment behaviour in a computational model
Alamitos, CA., 1997. IEEE Computer
of ventricular fibrillation in the canine
Society Press.
heart. IEEE Trans. Biomed Eng., 51:28–
34, 2004. [10] Fagen Xie, Zhilin Qu, Junzhong Yang,
Ali Baher, James N. Weiss, and Alan
[3] R.H. Clayton and A.V. Holden. Prop- Garfinkel. A simulation study of the ef-
agation of normal beats and re-entry fects of cardiac anatomy in ventricular
in a computational model of ventricular fibrillation. Journal of Clinical Investi-
cardiac tissue with regional differences gation, 113:686–693, 2004.
in action potential shape and duration.
Progress in Biophysics and Molecular
Biology, 85:473–499, 2004.
240
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
This paper presents a new service-oriented approach to the design and implementation of
visualization systems in a Grid computing environment. The approach evolves the traditional
dataflow visualization system, based on processes communicating via shared memory or sockets,
into an environment in which visualization Web services can be linked in a pipeline using the
subscription and notification services available in Globus Toolkit 4. A specific aim of our design is
to support collaborative visualization, allowing a geographically distributed research team to work
collaboratively on visual analysis of data. A key feature of the system is the use of a formal
description of the visualization pipeline, using the skML language first developed in the gViz e-
science project. This description is shared by all collaborators in a session. In co-operation with
the e-Viz project, we generate user interfaces for the visualization services automatically from the
skML description. The new system is called NoCoV (Notification-service-based Collaborative
Visualization). A simple prototype has been built and is used to illustrate the concepts.
241
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
242
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the participation is spread over a period of time. and develop at a later time – important for
The requirement here is for a persistent pipeline collaboration with Australasia for example.
of services, which a collaborator can pick up
243
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
To support collaborative working over the interests. Every time a change happens in these
construction of a pipeline there is a requirement notification topics, the service will
to share pipeline description information automatically deliver the changed information
between the Pipeline Controller Service and all to the subscribing clients.
the clients. An extension of skML (Duce and The main reason for choosing Notification
Sagar, 2005), an XML based visualization Web Services to implement visualization is their
pipeline description language, is used for this ‘push’ feature. It fits well with the requirement
purpose. It has been extended to fit the NoCoV of a visualization pipeline which needs the
system with an emphasis on Service-Oriented services to send their results to other services
features. connected to them ‘downstream’, each time
The NoCoV system has been implemented they produce a new result.
with GlobusToolkit 4 (GT4). The stateful Web The publish/subscribe pattern provides a
Services provided by GT4 offer the capability data sharing approach for collaboration. The
of maintaining visualization pipeline notification topics published by the
information between sessions. This allows users visualization service can be either the data or
to save the current status of the pipeline on the the control parameters. Actions from
Pipeline Controller Service for use the next time participants (such as changing parameter values)
they reconnect. We also exploit Notification are also published as notifications, allowing
Web Services in GT4. Moreover, GT4 provides these to be shared.
a set of Grid security specifications which can The stateful feature inherited from GT4 Web
be seamlessly applied to the NoCoV system in Services makes it possible to achieve
its future development to address one of the asynchronous collaboration as the status of the
desired issues in collaboration: security. pipeline is persistent and users can retrieve the
saved pipeline to carry on their previous work –
4. NoCoV System design or new users can take over and continue the
development.
The publish/subscribe pattern can also
4.1 Using Notification Web Service
reduce network traffic by only delivering
WS-Notification includes a set of specifications changed information to clients. Moreover it can
(OASIS, 2006) and a whitepaper cut down service and client workloads as clients
(Publish/Subscribe, 2006) which describe the only need to subscribe to the service once and
use of a topic-based publish/subscribe pattern to then just wait for the notification messages. It is
achieve notification with standard Web Services. similar to previously used technologies such as
socket communication, but with the facility of
WS-Notification, the system developers do not
Client need to consider the details of physical
communication ports, and the communication is
2.Subscribe 4.Trigger a relatively easy to set up compared to socket
notification communication.
Notification As XML/SOAP messages have a restriction
Topic on their maximum size, when large data (e.g.
update of geometry) needs to be sent as a
1.Publish 3.Change notification, only a reference to the data is
happens included in the notification message, and the
Service data itself can be transferred using http, ftp or
gridftp separately. Another possible solution is
to add data as an attachment (e.g. DIME) to the
Figure 2:The working pattern of Notification notification message.
Web Service There are however some disadvantages of
notification services. When a client subscribes
With the normal request/response to a notification service, the client needs to be
communication mode between service and able to function as a listening service waiting
client, the client has to keep polling the service for the notification. In the case of a GT4
in order to get the latest changes. By contrast, implementation, it requires GT4 to be installed
the notification approach works in the manner on the machine where the notification client
shown in Figure 2. When the service publishes a runs. In addition, every time a change happens
set of notification topics, clients can subscribe on a notification service, the service needs to
to relevant topics according to their different start a connection to the subscriber. If the
244
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
notification client (subscriber) sits behind a make it possible to add security control in
firewall, the firewall may block all the NoCoV.
connections initiated from outside. In this case,
although the client can subscribe successfully to 4.3 Pipeline Controller Service
the service, it can not receive any notifications A Pipeline Controller Service is placed
from outside of the firewall. WS-Messenger between the end-users and the visualization
(Huang, et al, 2006) proposed an approach to notification services, in order to enable users to
address the issue of the delivery of notification collaboratively build the pipeline and configure
through a firewall by using a MessageBox each visualization service linked within the
service to store all notification messages and pipeline. Figure 3 displays the components of
having consumers periodically pull notification the Pipeline Controller Service.
from the message box. Another alternative, The Pipeline Controller stands between the
which is less secure but more straightforward, is distributed visualization services and the end-
to set a small range of open ports in the firewall users as a proxy, through which users can send
and configure the local notification system to requests to create/destroy, connect/disconnect
only use ports within this range. service instances and set parameters of these
instances. The removing or adding of
4.2 The Extended skML
visualization services or any changes inside
As the proposed NoCoV system is expected to visualization services are transparent to end-
enable the collaborative creation and users.
configuration of visualization pipelines, and the
persistence of pipeline information, the
visualization pipeline must be represented in
such a way that it can be stored as service
resource properties.
skML is an XML-based dataflow description
language (Duce and Sagar, 2005) which
describes visualization pipelines in a generic
way, so that the skML description can be
independent of the implementation of the
pipeline. The skML language was developed as
part of the gViz e-science project, and was
heavily influenced by MVEs such as IRIS
Explorer. However it lacks certain features to
describe characteristics of Service-Oriented
collaborative visualization pipelines. Rather
than create a different description language, we
have chosen simply to modify and extend skML.
An ‘instance’ element replaces the ‘module’ Figure 3: The Pipeline Controller Service
element in the skML to represent visualization
The Pipeline Controller also functions as a
service instances, but the ‘link’ element in
shared workspace for all the participants. For
skML is kept to represent the subscriptions to
visualization experts who create visualization
notifications. One of the significant differences
pipelines, the Pipeline Controller keeps a
is that the extension aims to present richer
description of the current pipeline in the
information about the visualization pipeline. For
extended skML, which can be retrieved for the
example, all the output and input ports for each
later joiners to the collaborative session. The
service are recorded: this is then made visible at
Pipeline Controller is implemented as a
the user interface, to allow users the ability to
notification service, in which a notification
change the connections to/from that service. In
topic about the latest status of the jointly created
contrast with the original skML, all control
pipeline is published so that users can share the
parameters for a service instance must be
same pipeline and join in synchronous
explicit in the description, again so that they can
collaboration of the construction of a pipeline.
be presented in an automatically generated user
Notification topics about control parameters
interface. Another difference is the adding of
and the latest visualization result generated from
new properties such as ‘owner’ and ‘sharable’,
the pipeline, are published for scientists who do
which identify who owns this visualization
not want to be involved with the building of the
service instance and who are allowed to access
pipeline but simply wish to modify control
this service instance. These new properties will
245
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
5. Prototype Implementation
A prototype was implemented as a proof of
concept for the NoCoV system. The prototype
involves a set of simplified visualization
services, a Pipeline Controller Service with
basic functions as proposed in the design, a
Pipeline Editor Client for visualization experts
and a Parameter Control Client for scientists.
Figure 4 – Pipeline Editor Client
5.1 Visualization Services implemented as
Notification Web Services The Pipeline Editor Client (see Figure 4) is
initialized with a list of available visualization
In order to demonstrate the capability of services shown on the left hand side from an
building and controlling a visualization pipeline XML format service list file (which makes it
through end-user interfaces, four visualization possible to retrieve a list of available services
services were created with simple visualization from a UDDI server). By clicking on the
functions. visualization services, a corresponding instance
The data service retrieves data from a source will be created and displayed as a box in the
according to the file name or URL provided by editor window on the right hand side. The editor
the user. The output of the data service can be window supports drag-and-drop operation. By
subscribed to by an isosurface service or a slice right clicking on the instance in the editor
service, which can generate output geometries window, the user can specify the output or input
in VRML format. The inline service can to that instance in order to connect different
subscribe to both isosurface and slice services to instances together to create a pipeline. All the
combine multiple geometries into a single scene. participants in the collaboration will see the
same pipeline on their editor windows, as they
5.2 Pipeline Controller Service
subscribe to the same notification topic – the
The Pipeline Controller acts as an agent for end- definition of the current pipeline. At this stage,
users, releasing users from the burden of dealing we simply assume clients know notification
246
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
topics published by each visualization service, shows the PPC corresponding to the pipeline
but in our future work, each visualization displayed in Figure 4.
service will provide a method which returns a
description of notification topics published for 5.5 Simple illustration
the client to subscribe. We illustrate the use of the NoCoV system with
a very simple example. In Figure 6, one user
5.4 Parameter Control Client
has built a pipeline of two services to visualize a
The Parameter Control Client (PCC) provides a volumetric dataset (data – to read in the data;
user interface for the control parameters of the slice to display a cross-section).
individual components of the visualization
pipeline created by the Pipeline Editor Client.
The PCC is implemented using the user
interface component from the e-Viz system
(Riding et al, 2005), called the e-Viz Client.
This component takes an XML description of
the visualization pipeline and generates a user
interface for each component based on its
description. It also provides connectivity from
the user interface to the remote visualization
component to send and receive parameter values.
The original e-Viz Client used the gViz
library as its sole communications mechanism
to send/receive parameter changes. This tool has
Figure 6 – Slice Visualization
now been extended to allow alternative
A collaborator then joins the session, and
communications mechanisms to be used in
initially sees the same slice, but with ability to
place of the gViz library. This is managed by
view from a different angle. Figure 7 shows the
specifying in the RDF section of the XML
two screens, displayed side-by-side here to
description file what mechanism is to be used
show the effect.
for each pipeline component. In this case, a
Notification Services mechanism has been
specified to link the e-Viz Client with the
NoCoV system.
247
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
6. Conclusions and future work on GT4 matters. Jason Wood developed the
GUI aspects as part of the EPSRC e-Viz project.
We have presented the vision, design and
prototype implementation of a next generation
visualization system. It is built using service-
References
oriented concepts, and envisages a worldwide Charters, S.M, Holliman, N.S and Munro, M (2004)
collaboratory in which a research team can Visualization on the Grid: a Web Service Approach.
come together to visualize data – the Proceedings of the UK e-Science All Hands Meeting,
collaboration can be at the same time, or at pp202-209.
different times. A key feature is exploitation of
Charters, S.M (2006) Virtualising Visualisation. PhD
Grid and Web services technologies, in thesis, University of Durham.
particular notification services. The publish/
subscribe pattern is used to link the Duce, D.A, Sagar, M (2005) skML a Markup
visualization services in a pipeline. Language for Distributed Collaborative Visualization.
A middle-layer service, the Pipeline EG UK Theory and Practice of Computer Graphics
Controller Service – acting as a proxy for pp171-178.
distributed visualization services - is included to
provide collaborative functions for different GT4 (2006) Globus Toolkit Web site,.
http://globus.org/toolkit/
levels of end-users. Work from the gViz and e-
Viz e-science projects is exploited to provide a gViz project web site, (2006)
formal description of the visualization pipeline, http://www.comp.leeds.ac.uk/vis/gviz/
and automatically created user interfaces,
respectively. A prototype of the proposed Huang.Y, et al, (2006) WS-Messenger: A Web
system has been implemented as a proof of Services-based Messaging System for Service-
concept. Oriented Grid Computing. CCGrid 2006, IEEE
The following aspects need to be explored in Computer Society, pp166-173.
the next stage to create a comprehensive
IRIS Explorer web site, (2006)
collaborative visualization system.
http://www.nag.co.uk/Welcome_IEC.asp
Security is an important issue for all
collaborative systems. As the prototype is OASIS Web Services Notification TC, (2006).
implemented with GT4, the GT4 security http://www.oasis-
mechanisms can be seamlessly applied to open.org/committees/tc_home.php?wg_abbrev=wsn
NoCoV. The security issues that need to be
considered in future work include: setting Publish-Subscribe Notification for Web services,
different roles for users; setting different access (2006) .http://www.ibm.com/developerworks/library/
permission to each visualization instance in the ws-pubsub/WS-PubSub.pdf
pipeline; and dynamically changing the valid
realVNC web site, (2006) http://www.realvnc.com
user list to control the joining and leaving of
users in a collaborative session. Riding, M, et al, (2005) e-Viz: Towards an Integrated
The discovery of available visualization Framework for High Performance Visualization,
services is another important strand. In the Proceedings of the UK e-Science All Hands Meeting,
prototype, the client gets an available service pp1026-1032. See also e-Viz project website, URL:
list from an XML file which contains details of http://www.eviz.org.
each visualization service – but it is possible to
introduce a UDDI server into the system to Sastry, M and Craig, M (2003) Scalable application
visualization services toolkit for problem solving
provide this functionality.
environments. In Proceedings of the UK e-Science
Other issues include the standardization of All Hands Meeting, pp 520-525.
the data format passed between different types
of visualization service, and automatic updating vtk web site, (2006) http://www.kitware.com.
of the Pipeline Controller when new services
are brought into the NoCoV repository. Walton, J (2004) NAG’s IRIS Explorer. In
Visualization Handbook, pp 633--654, Academic
Press. Available at:
Acknowledgements http://www.nag.co.uk/IndustryArticles/ch32.pdf
Thanks to Stuart Charters (Univ of Durham) for
making available an early version of his thesis; Wood, J, Wright, H and Brodlie, K, (1997)
Collaborative Visualization, Proceedings of IEEE
and to John Hodrien (Univ of Leeds) for advice
Visualization 1997 Conference, edited by R. Yagel
and H.Hagen, pp 253-260, ACM Press.
248
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
The advent of the Grid Computing paradigm and the link to Web Services provides fresh
challenges and opportunities for collaborative visualization. Ontologies are a central building
block of the Semantic Web. Ontologies define domain concepts and the relationships between
them, and thus provide a domain language that is meaningful to both humans and machines. In
this paper, the analysis and design of a prototype ontology for visualization are discussed, whose
purpose is to provide the semantics for the discovery of visualization services, support
collaborative work, curation, and underpinning visualization research and education. Relevant
taxonomies are also reviewed and a formal framework is developed.
249
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
sharing of process models between visualization Ontology” held in April 2004, which identified
developers and users; curation and provenance the need to establish an ontology for
management of visualization processes and data; visualization, and investigated the structure for
and collaboration and interaction between such an ontology, giving an overall description
distributed sites offering visualization services of what such an ontology should contain [5].
[2]. At the same time, Semantic Web Service However, there was agreement that these initial
technologies, such as the Ontology Web efforts were both tentative and incomplete
Language (OWL), are developing the means by because the creation of an ontology is an
which services can be given richer semantic iterative activity and the establishment of the
specifications. Richer semantics can enable ontology for visualization needs consensus
more flexible automation of service provision within the visualization community itself.
and use, and support the construction of more Brodlie at the University of Leeds proposed the
powerful tools and methodologies. This also notion of a “scientific visualization taxonomy”
makes it possible for users to define their using the concept of an “E-notation” [6][7],
requirements and subsequently connect to which was first developed at the AGOCG
services that may be adopted in their work. visualization meeting in 1992. This is seen as a
Ontologies are a central building block of the useful classification and model of the underlying
Semantic Web: they provide formal models of sampled data which may be used for
domain knowledge that can be exploited by subsequently visualizing this data, however it
intelligent agents. A visualization ontology failed to capture details about how such samples
would provide a domain language that is were distributed, or how the visualization was
meaningful to humans and machines, describes carried out – both fundamental issues in
the configuration of a visualization system and visualization.
supports the discovery of available Web Melanie Tory and Torsten Möller in Simon
Services. Fraser University have also undertaken
We show how a prototype ontology for significant work on “visualization taxonomies”
visualization may be developed using Protégé. It [8]. Recently, they presented a novel high-level
is extendable, and as such, our initial efforts can visualization taxonomy. The taxonomy classifies
be viewed as a starting point and catalyst for visualization algorithms rather than data.
experts in visualization to create and agree a Algorithms are categorized based on the
standard ontology. In Section 2, a summary of assumptions they make about the data being
related work is provided. Section 3 introduces visualized. As the taxonomy is based on design
the Ontology Web Language and Section 4 models, it is more flexible and considers the
describes a popular ontology editing tool -- user’s conceptual model, emphasizing the human
Protégé. The development of ontology for aspect of visualization [9].
visualization using Protégé will be discussed in Ed H. Chi at Xerox Parc put forward a new way
the Section 5. Conclusions and future work are to taxonomize information visualization
presented in Section 6. techniques by using the Data State Model
approach [10]. This research shows that the Data
2.Related Work State Model not only helps researchers
Two workshops have recently been held at the understand the design space of visualization
UK’s National e-Science Centre (NeSC), one on algorithms, but also helps implementers
“Visualization for eScience”, held in January understand how information visualization
2003 [4] and the other on “Visualization techniques can be applied more broadly [11].
250
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
251
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
synthesize some existing taxonomies, mainly concepts and their relations as possible to
based on Ken Brodlie [6] and Melanie Tory, provide machine-readable formal specifications
Torsten Möller’s [8] taxonomies, and present for the discovery of visualization services.
these in an organized structure which highlights Moreover, some of decisions made in the
Data_Model
chosen for /by
has-component
transform from/by/into
Primitive_Set has-component
has-component
252
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
253
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
254
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
define a set of techniques and algorithms which 5.5 Data_ Representation class
can be used to transform or create visual The reason that Data_Representation class
representations of data using a data model. So should be included is that some models of
techniques/algorithms are categorized based on representation are specific to classes of data, for
the data model rather than on the data itself, example work in flow visualization and graph
which facilitates the transformation or visualization have distinct categories of
visualization of the data model. The relationship representations. So users should be allowed to
between techniques/algorithms and data model is make the appropriate representation choice
achieved by three properties: is-transformed-by, based on their information need. Its subclasses
transform-from and transform-into-model. The contain multimodal attributes, i.e. visual, haptic,
property transform-from is an inverse property sonic, olfactory, physical attribute etc. The visual
of is-transformed-by. The former means an attribute is further split into four subclasses
algorithm can be used to visualize the spatialization, timing, color and transparency
corresponding data model, the latter the inverse. [7].
The range of the property transform-into-model
is Visualization_Techniques, which means we 6.Conclusion and future work
can use an algorithm to transform a data model In this paper, we have described our experience
into a simpler form. It can therefore be seen that of building an ontology for visualization,
it is possible to match data for specific although it is accepted that this is both
applications to the most appropriate visualization incomplete and tentative. Our VO is designed as
techniques/algorithms by means of the above an OWL-based prototype ontology whose
three properties. A fragment of subclass of purpose is to provide a vocabulary by which
Visualization_Techniques is shown in figure 5 users and visualization systems can
(for example, AS2 denotes that its subclasses communicate, and it is being used in
can be used to transform or visualize the data matchmaking portal for discovery of
model ES2). visualization services [21]. Several specific
directions for future work include:
255
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
visualizations given knowledge of the data. Techniques using the Data State Reference Model.
In Proceedings of the Symposium on Information
References Visualization (InfoVis '00), pp. 69--75. IEEE Press,
1. Brodlie, K. W., Duce, D. A., Gallop, J. R., Walton, 2000. Salt Lake City, Utah.
J. P. R. B. & Wood, J. D. Distributed and 12. S. K. Card, J. D. Mackinlay. The Structure of the
Collaborative Visualization.Computer Graphics Information Visualization Design Space.
Forum 23 (2), 223-251,2004 Proceedings of IEEE Symposium on Information
2. D. J. Duke, K.W. Brodlie and D. A. Duce, Visualization (InfoVis ’97), Phoenix, Arizona,
Building an Ontology of Visualization 92-99 Color Plate 125, 1997.
3. Simone Ludwig, O. F. Rana, J. A. Padget and W. 13. S. K. Card, J. D. Mackinlay and B. Shneiderman.
Naylor, Matchmaking framework for Information Visualization: Using Vision to Think.
Mathematical Web Services, Journal of Grid Morgan-Kaufmann, San Francisco, California,
Computing, 2006. 1999.
4. National e-Science Centre. Visualization ontologies. 14. OLIVE: On-line Library of Information
http://www.nesc.ac.uk/talks/393/vis ontology Visualization Environments.
report.pdf. http://otal.umd.edu/Olive/, 1999.
5. National e-Science Centre. Visualization for 15. OWL Web Ontology Language Overview :
e-science. http://www.w3.org/TR/2004/REC-owl-features-20
http://www.nesc.ac.uk/esi/events/130/workshop 040210/
report.pdf. 16. Holger Knublauch, Ray W. Fergerson, Natalya F.
6. K.W. Brodlie, 1992.Visualization Techniques, in Noy, Mark A. Musen. The Protégé OWL Plugin:
Scientific Visualization - Techniques and An Open Development Environment for Semantic
Applications, edited by K.W. Brodlie, L.A. Web Applications. Third International Semantic
Carpenter, R.A. Earnshaw, J.R. Gallop, Web Conference - ISWC 2004
R.J.Hubbold, A.M. Mumford, C.D. Osland and P. 17. MONET Consortium. MONET Home Page, www.
Quarendon, Chapter 3, pp 37-86, Springer-Verlag. Available from http://monet.nag.co.uk.
7.K.W. Brodlie, 1993. A Classification Scheme for 18 Holger Knublauch, Mark A. Musen, and Alan L.
Scientific Visualization, in Animation and Rector.Editing description logics ontologies with
Scientific Visualization, edited by R. A. Earnshaw the Prot´eg´e OWL plugin. In International
and D. Watson, pp 125-140, Academic Press. Workshop on Description Logics, Whistler, BC,
8. Melanie Tory and Torsten Moller. A model-based Canada, 2004.
visualization taxonomy. Technical Report 19 I J Grimstead, D W Walker and N J Avis.
SFU-CMPT-TR2002-06, Computing Science Resource Aware Visualization Environment –
Dept., Simon Fraser University, 2002. RAVE, accepted for publication in Concurrency
9. Melanie Tory and Torsten Möller. Rethinking and Computation: Practice and Experience.
Visualization: A High-Level Taxonomy, IEEE 20 Ian J. Grimstead, David W. Walker,
Symposium on Information Visualization, pp. Nick J. Avis. Collaborative Visualization: A
151-158, Oct. 2004. Review and Taxonomy. In Ninth IEEE
10. Ed H. Chi and J. T. Riedl. An Operator Interaction International Symposium on Distributed
Framework for Visualization Systems. Symposium Simulation and Real-Time Applications, 2005, pp.
on Information Visualization (InfoVis '98), 61-69.
Research Triangle Park, North Carolina: 63-70, 21 Gao Shu Omer F. Rana, Nick J Avis and Chen
1998. Dingfang. Ontology_based Semantic Matchmaking
11. Ed H. Chi. A Taxonomy of Visualization Approach. Accepted of publication.
256
Meshing with Grids: Toward functional abstractions for
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
grid-based visualization
Abstract
A challenge for grid computing is finding abstractions that separate concerns about what a grid application must
achieve from the specific resources on which it might be deployed. One approach, taken by a range of toolkits and
APIs, is to add further layers of abstraction above the raw machine and network, allowing explicit interaction with
grid services. The developers’ view, however, remains that of a stateful von Neumann machine, and the interleaving
of grid services with domain computations adds complexity to systems.
An alternative is to replace this tower of abstraction with a new kind of programming structure, one that ab-
stracts away from complexities of control and location while at the same time adapting to local resources. This
paper reports early results from a project exploring the use of functional language technologies for computationally
demanding problems in data visualization over the grid. Advances in functional language technology mean that
pure languages such as Haskell are now feasible for scientific applications, bringing opportunities and challenges.
The opportunities are the powerful types of ‘glue’ that higher order programming and lazy evaluation make pos-
sible. The challenge is that structures that work well within stateful computation are not necessarily appropriate
in the functional world. This paper reports initial results on developing functional approaches to an archetypal
visualization task, isosurfacing, that are a preliminary step towards implementing such algorithms on a polytypic
grid.
Figure 1: (a) Marching cubes applied to a 2D dataset (e.g. marching squares) and extracted isocontour; (b)
Marching cubes applied to a 3D dataset and extracted isosurface.)
HOpenGL binding, so the mapping phase is implemented slightly dif- 2 It bears similarities to Zermelo-Frankel (ZF) set comprehensions
ferently, via a function that is invoked as the GL display callback. 259in mathematics.
, Proceedings
arr!(x+1,y+1,z), All Hands Meeting The
arr!(x,y+1,z)
of the UK e-Science 2006,ideal solution
© NeSC 2006,toISBN
the listed issues would in-
0-9553988-0-0
, arr!(x,y,z+1), arr!(x+1,y,z+1) tuitively be an implementation of the algorithm that
, arr!(x+1,y+1,z+1), arr!(x,y+1,z+1) would take care of all the three kind of problems si-
) multaneously. However often ideal solutions are tay-
lored to specific resources of computation, policy not
Finally, to the definition of mcube: affordable in a multi-facet environment like the grid.
Experience shows how to one holistic solution it is often
mcube :: a -> (XYZ->Cell a) -> XYZ preferable an amenable suite of solutions capable to ef-
-> [Triangle b] ficiently provide optimal trade-offs between results and
mcube thresh lookup (x,y,z) = available computational power: different implementa-
group3 (map (interpolate tions for different environments. However the ability to
thresh cell (x,y,z)) swap between different versions of the same algorithms
(mcCaseTable ! bools)) involves evincing crucial patterns of behaviour within
where the algorithm itself. Exploiting the abstraction and ex-
cell = lookup (x,y,z) pression power of Haskell in the context of the march-
bools = toByte (map8 (>thresh) cell) ing cubes algorithm, we have outlined several different
implementations that in turn take care of memory is-
The cell of vertex sample values is found using the sues, when the entire dataset cannot be loaded in mem-
lookup function that has been passed in. We derive ory, and sharing of computation, when the same com-
an 8-tuple of booleans by comparing each sample with putation is carried on the same data more than once.
the threshold (map8 is a higher-order function like map,
only over a fixed-size tuple rather than an arbitrary se- 2.2.1 Stream based algorithm
quence), then convert the 8 booleans to a byte (bools)
to index into the classic case table (mcCaseTable). When dealing with large dataset the monolithic array
The result of indexing the table is the sequence of data structure presented so far is not feasible; it simply
edges cut by the surface. Using map, we perform the may not fit in core memory. A solution is to separate
interpolation calculation for every one of those edges, traversal and processing of data. For its inner structure
and finally group those interpolated points into triples the marching cube algorithm only ever needs at any one
as the vertices of triangles to be rendered. The linear moment, a small partition of the whole dataset: a sin-
interpolation is standard: gle point and 7 of its neighbours suffices, making up
a unit cube of sample values. If we compare this with
interpolate :: Num a => a -> Cell a -> XYZ a typical array or file storage format for regular grids
-> Edge -> TriangleVertex (essentially a linear sequence of samples), then the unit
interpolate thresh cell (x,y,z) edge = cube is entirely contained within a “window” of the file,
case edge of corresponding to exactly one plane + line + sample of
0 -> (x+interp, y, z) the volume. The ideal solution is to slide this window
1 -> (x+1, y+interp, z) over the file, constructing one unit cube on each itera-
... tion, and dropping the previous unit cube.
11 -> (x, y+1, z+interp) Figure 2 illustrates the idea.
where
interp = (thresh - a) / (b - a)
(a,b) = selectEdgeVertices edge cell
of code coming from different implementations but de- Table 4: Time Performance - PowerPC
fined by logically equivalent signatures. In our testing time (s)
phase for example we have noticed how the best perfor- dataset array stream. VTK
mance on both machine were achieved by a marching neghip 1.088 0.852 0.29
cubes implementation which included both the stream- hydrogen 8.638 6.694 0.51
ing and sharing of the threshold computation. statueLeg 48.78 34.54 2.78
aneurism 72.98 54.44 5.69
2.5 Time and Space Profiles skull 79.50 57.19 79.03
stent8 287.5 154.9 33.17
Performance numbers are given for all the presented vertebra8 703.0 517.1 755.0
versions of marching cubes written in Haskell, over a
range of sizes of dataset (all taken from volvis.org).
The relevant characteristics of the datasets are sum- pose extra administrative burden. In contrast, the time
marised in Table 1, where the streaming window size is performance of the streaming version scales linearly
calculated as one plane+line+1. Tables 3 and 4 give the with the size of dataset outperforming VTK (see Ta-
absolute time taken to compute an isosurface at thresh- ble 4). It can also be seen that the memory perfor-
old value 20 for every dataset, on two different plat- mance is exactly proportional to the size of the slid-
forms, a 3.0GHz Dell Xeon and a 2.3GHz Macintosh ing window (plane+line+1). The streaming version has
G5 respectively, compiled with the ghc compiler and memory overheads too, mainly in storing and retrieving
-O2 optimization. Table 2 shows the peak live memory intermediate (possibly unevaluated) structures. How-
usage of each version of the algorithm, as determined ever, the ability to stream datasets in a natural fashion
by heap profiling. makes this approach much more scalable to large prob-
lem sets. Comparing the haskell implementations with
implementations in other languages something interest-
Table 2: Memory Usage ing appears.
memory (MB) If we consider the figures obtained by running the
dataset array stream. VTK algorithm on a Mac paltforms the streaming Haskell
neghip 0.270 0.142 1.4 version is actually faster than VTK for the larger sur-
hydrogen 2.10 0.550 3.0 faces generated by skull and vertebra8. Moving to a
statueLeg 11.0 3.72 15.9 different platform the panorama changes “apparently”
aneurism 17.0 2.10 28.1 in favour of the VTK implementation. However sur-
skull 17.0 2.13 185.3 prisingly the Haskell implementation appears to be still
stent8 46.0 8.35 119.1 competitive. The interesting aspect worth noting is the
vertebra8 137.0 8.35 1,300.9 speed-up gained by VTK which is much bigger than
the one of the Haskell implementation, in proportion
On the Intel platform the array-based version strug- to the increase in computational power due to a faster
gles to maintain acceptable performance as the size of processor. We suspect that such a difference resides in
the array gets larger. We suspect that the problem is the compiler-generated code which could easily be bet-
memory management costs: (1) The data array itself is ter optimized for an Intel architecture rather than for a
large: its plain unoptimised representation in Haskell PowerPC.
uses pointers, and so it is already of the order of 5–
9× larger than the original file alone, depending on
machine architecture. (2) The garbage collection strat- 3 Related Work
egy used by the compiler’s runtime system means that
there is a switch from copying GC to compacting GC The difficulties of working with large volumes of data
at larger sizes, which changes the cost model. (3) have prompted a number of researchers to consider
The compiler’s memory allocator in general uses block- whether approaches based on lazy or eager evaluation
sizes up to a maximum of a couple of megabytes. A strategies would be most appropriate. While call-by-
263need is implicit in lazy functional languages, several
single array value that spans multiple blocks will im-
in the
Proceedings of the UK e-Science All Hands Meeting kind
2006, © of data 2006,
NeSC defining the 0-9553988-0-0
ISBN sample points, and sev-
eral of the simple functions used to make the final ap-
plication are eminently reusable in other visualization
algorithms (for example mkStream). Our next step is
to explore how type-based abstraction, e.g. polytypic
programming, can be used to make the algorithm in-
dependent of the specific mesh organization; we would
like the one expression of marching cubes to apply both
to regular and irregular meshes.
Acknowledgement
The work reported in this paper was funded by the UK
Engineering and Physical Sciences Research Council.
References
[1] G. Allen, T. Dramlitsch, I. Foster, N.T. Karonis, M. Ripeanu,
E. Seidel, and B. Toonen. Supporting efficient execution in
heterogeneous distributed computing environments with cac-
tus and globus. In Supercomputing ’01: Proc. of the 2001
ACM/IEEE conference on Supercomputing, pages 52–52, 2001.
Figure 3: Functionally surfaced dataset coloured to [2] E. Chernyaev. Marching cubes 33: Construction of topologi-
show age of triangles in stream generated by the stream- cally correct isosurfaces. Technical report, 1995.
ing marching cubes Haskell implementation over the [3] D. Clarke, R. Hinze, J. Jeuring, A. oh, and J. de Wit. The generic
neghip dataset. haskell user’s guide, 2001.
[4] M. Cox and D. Ellsworth. Application-controlled demand pag-
ing for out-of-core visualization. In Proceedings of Visualiza-
efforts have explored more ad hoc provision of lazy tion ’97, pages 235–ff. IEEE Computer Society Press, 1997.
evaluation in imperative implementations of visualiza- [5] D. Duke, M. Wallace, R. Borgo, and C. Runciman. Fine-grained
visualization pipelines and lazy functional languages. IEEE
tion systems e.g. [11],[4]. In [15] Moran et al. use a Transaction on Visualization and Computer Graphics, 12(5),
coarse-grained form of lazy evaluation for working with 2006.
large time-series datasets. The fine-grained approach [6] R.B. Haber and D. McNabb. Visualization idioms: A concep-
inherent in Haskell not only delays evaluation until it tual model for scientific visualization systems. In Visualization
is needed, but also evaluates objects piecewise. This in Scientific Computing. IEEE Computer Society Press, 1990.
behaviour is of particular interest in graphics and com- [7] Haskell: A purely functional language. http://www.haskell.org,
Last visited 27-03-2006.
putational geometry, where order of declaration and
[8] J. Hughes. Why functional programming matters.
computation may differ. Properties of our fine-grained Computer Journal, 32(2):98–107, 1989. See also
streaming approach also match requirements for data http://www.cs.chalmers.se/ rjmh/Papers/whyfp.html.
streaming from [12], we refer for a more detailed diss- [9] Johan Jeuring and Patrik Jansson. Polytypic programming.
cussion on this topic to [5]. In J. Launchbury, E. Meijer, and T. Sheard, editors, Tutorial
Text 2nd Int. School on Advanced Functional Programming,
Olympia, WA, USA, 26–30 Aug 1996, volume 1129, pages 68–
114. Springer-Verlag, 1996.
4 Conclusion [10] R. Lämmel and S. Peyton Jones. Scrap your boilerplate: a
practical design pattern for generic programming. ACM SIG-
Our purely functional reconstruction of the marching PLAN Notices, 38(3):26–37, 2003. Proceedings of the ACM
SIGPLAN Workshop on Types in Language Design and Imple-
cubes algorithm makes two important contributions. mentation (TLDI 2003).
First, it shows how functional abstractions and data rep- [11] D.A. Lane. UFAT: a particle tracer for time-dependent flow
resentations can be used to factor algorithms in new fields. In Proceedings of Visualization ’94, pages 257–264.
ways, in this case by replacing monolithic arrays with a IEEE Computer Society Press, 1994.
stream-based window, and composing the overall algo- [12] C.C. Law, W.J. Schroeder, K.M. Martin, and J. Temkin. A
rithm itself from a collection of (functional) pipelines. multi-threaded streaming pipeline architecture for large struc-
tured data sets. In Proceedings of Visualization ’99, pages 225–
This is important in the context of grid computing, 232. IEEE Computer Society Press, 1999.
because a stream- based approach may be more suit- [13] T. Lewiner, H. Lopes, A.W. Vieira, and G. Tavares. Efficient im-
able for distribution than one that relies on a mono- plementation of marching cubes’ cases with topological guaran-
lithic structure. It is also important for visualization, tees. Journal of Graphics Tools, 8(2):1–15, 2003.
as streaming occupies an important niche between fully [14] W.E. Lorensen and H.E. Cline. Marching cubes: A high res-
olution 3d surface construction algorithm. In Proceedings of
in-core and fully out-of-core methods, and the func-
SIGGRAPH’87, pages 163–169. ACM Press, 1987.
tional approach is novel in that the flow of data is man-
[15] P.J. Moran and C. Henze. Large field visualization with
aged on a need-driven basis, without the programmer demand-driven calculation. In Proceedings of Visualization’99,
resorting to explicit control over buffering. Second, pages 27–33. IEEE Computer Society Press, 1999.
the functional reconstruction shows that elegance and [16] W. Schroeder, K. Martin, and B. Lorensen. The Visualization
abstraction need not be sacrificed to improve perfor- Toolkit: An Object-Oriented Approach to 3D Graphics. Prentice
mance; the functional implementation is polymorphic 264 Hall, second edition, 1998.
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Dependability is a key factor in any software system and has been made a core aim of the Globus
based China Research Environment over Wide-area Network (CROWN) middleware. Our past
research, based around our Fault Injection Technology (FIT) framework, has demonstrated that
Network Level Fault Injection can be a valuable tool in assessing the dependability of RPC oriented
SOAP based middleware. We present our initial results on applying our Grid-FIT method and tools
to Globus middleware and compare our results against those obtained in our previously published
work on Web Services. Finally this paper outlines our future plans for applying Grid-FIT to
CROWN and thus providing dependability metrics for comparison against other middleware
products.
265
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
elements. Each wsdl:part defines a parameter message oriented calling but this is outside the
(or return value in the case of a response scope of this research.
message). A wsdl:part has an associated name
that must be unique within the wsdl:message 2.2 SOAP
element and a type that defines the parameter SOAP [9, 10] is a messaging protocol
type. There are a number of predefined types designed to allow the exchange of messages
and complex types can also be defined using a over a network. It is XML based to allow the
types element. exchange of messages between dissimilar
machines.
<wsdl:message
name="unregisterServiceRequest"> Our method is primarily concerned with
<wsdl:part RPC mechanisms over SOAP. This is defined
name="context" type="xsd:string"/> by the W3C in [11] and describes a general
<wsdl:part purpose RPC mechanism. The message types
name="entry"
type="impl:ServiceEntry"/> that are involved in an RPC exchange and the
</wsdl:message> relevant features used by our method are briefly
reviewed here.
Table 1: Example WSDL Message A SOAP message utilizes the
http://schemas.xmlsoap.org/soap/envelope/
Once all request and response messages schema which defines the namespace soapenv
required to implement an RPC style interface and this namespace is setup in the
have been defined they can be used to define the soapenv:Envelope element. Consequently all
calling interface for the Web Service. This is elements that unitize this namespace must be
done by use of the wsdl:portType element (see enclosed by the soapenv:Envelope element. The
Table 2). A wsdl:portType contains a number of soapenv namespace defines a semantic
wsdl:operation elements with each framework for SOAP messages.
wsdl:operation element corresponding to a The body of the SOAP message is enclosed
method of the Web Service. Each by the soapenv:Body element. This element acts
wsdl:operation is made up of a wsdl:input as a grouping for the body elements for different
element that will be the request part of the RPC types of messages. Its primary function is to
and a wsdl:output element that will be the keep the body elements distinct from other
response message of the RPC. grouping of elements, for instance a header
block.
<wsdl:portType name="Quote"> These two elements form the basis of a
<wsdl:operation name="getQuote"
parameterOrder="symbol"> SOAP message. The soapenv:Body element is
<wsdl:input then populated with elements that make up the
name="getQuoteRequest" payload of a request, response or fault message.
message="impl:getQuoteRequest"/> A typical request message is given in Table
<wsdl:output
name="getQuoteResponse" 3. The request message name is defined in the
message="impl:getQuoteResponse"/> wsdl:operation (see Table 2) but by convention
</wsdl:operation> the name of the message equates to the service
<wsdl:operation
name="unregisterService"
method name but it can be defined as any valid
parameterOrder="context entry"> name. In the example the message and method
<wsdl:input name are getQuote. The message element is
name="unregisterServiceRequest" therefore ns1:getQuote. The namespace ns1 is
message="impl:unregisterServiceRequest"/
defined to be the urn of the service that
> implements the method. If this is combined with
<wsdl:output the address of the server hosting the service this
name="unregisterServiceResponse" allows a specific method of a specific service on
message=
a specific server to be identified.
"impl:unregisterServiceResponse"/> The ns1:getQuote element contains
</wsdl:operation> parameter elements that represent the RPC
</wsdl:portType>
parameters, for instance the getQuote method
Table 2: Example WSDL PortType takes one string parameter called symbol so
ns1:getQuote contains one element with an
The above explanation briefly describes the element tag symbol which contains the string
use of WSDL to define a classic RPC style Web data for that parameter. Parameters are defined
Service. WSDL can also be used describe other in WSDL by wsdl:part elements in
styles of Web Service calling interface such as wsdl:message elements (see Table 1).
266
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
A typical response message is given in Table being language specific, for instance using Axis
4. The response message is similar in structure 1.1 for Java if a piece of user code on a server
to the request message but by convention the throws a Java exception the faultcode is set to
response message name is post-fixed with the soapenv:Server.userException to indicate that
word Response although, again, it can be any the fault originates in server side user code, then
valid name defined in the wsdl:operation the fault string is set to the text description of
element. In this example the response element exception and the detail element is not used.
name is ns1:getQuoteResponse.
<soapenv:Envelope
<soapenv:Envelope xmlns:soapenv=…
xmlns:soapenv= xmlns:xsd=…
xmlns:xsi=…>
”http://schemas.xmlsoap.org/soap/envelope <soapenv:Body>
/” <ns1:getQuoteResponse
xmlns:xsd= soapenv:encodingStyle=…
”http://www.w3.org/2001/XMLSchema” xmlns:ns1=…>
xmlns:xsi= <getQuoteReturn href="#id0"/>
"http://www.w3.org/2001/XMLSchema- </ns1:getQuoteResponse>
instance"> <multiRef id="id0"
<soapenv:Body> soapenc:root="0"
<ns1:getQuote soapenv:encodingStyle=…
soapenv:encodingStyle= xsi:type="ns2:QuoteValue"
”http://schemas.xmlsoap.org/soap/encoding xmlns:soapenc=…
/” xmlns:ns2=
xmlns:ns1= "http://quote.stock.samples.dasbs.org">
"http://quote.stock.samples.dasbs.org"> <date
<symbol xsi:type="xsd:dateTime">
xsi:type="xsd:string"> 2004 10 30T10:54:18.511Z
foo </date>
</symbol> <quote
</ns1:getQuote> xsi:type="xsd:double">
</soapenv:Body> 47.5
</soapenv:Envelope> </quote>
</multiRef>
Table 3: Typical Request Message </soapenv:Body>
</soapenv:Envelope>
A response element contains elements that
represent any method return value and any Table 4: Typical Response Message
parameters that are marked to be marshalled in- <soapenv:Envelope xmlns:soapenv=…
out or out. Method return results follow the xmlns:xsd=…
xmlns:xsi=…
naming convention of the method name post <soapenv:Body>
fixed by the word Return and are represented in <soapenv:Fault>
the WSDL by wsdl:part elements. In-out and <faultcode>
out parameters follow the same conventions as soapenv:Server.userException
</faultcode>
parameters in a request message. <faultstring>
The example response message in Table 4 java.rmi.RemoteException:
also demonstrates the format that objects and can't get a stock price
</faultstring>
arrays are marshalled in a SOAP message. This
<detail/>
utilizes the multiRef element. Each object or </soapenv:Fault>
item in an array is created using a multiRef that </soapenv:Body>
has an id. The actual parameter/return value is </soapenv:Envelope>
then set to this reference id and the complex Table 5: Typical Fault Message
data can be constructed within the multiRef
element in the same way that individual
parameters are constructed in a message. This 3 Grid Middleware
technique applies to both request and response
messages. Grid middleware allows the construction of
Table 5 shows a typical SOAP Fault Grid services. It typically provides a transport
Message. Fault Messages are used to return mechanism as well as security facilities that
failure information from a server to a client. The give a consistent set of operations over a
soapenv:Fault element contains three elements: heterogeneous platform base.
faultcode, faultstring and detail. These elements Globus Toolkit is the front running reference
are used to convey failure information to the implementation of OGSA. Version 4 of this
client with the faultstring and detail elements product is based around a Java core composed
267
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
of Apache Tomcat and Apache Axis that makes architecture. Secondly, a series of grid
the Globus environment similar to our previous middleware and auxiliary tools should be
experimental environments. provided to meet the common requirements
The major difference is the way that Globus of different scientific activities in all kinds of
packages and exchanges data. Whilst Globus areas. Finally, a number of typical
utilizes the standard Axis RPC mechanism it applications should be put in use over the
packages parameters and return values into a fabric and middleware to demonstrate the
complex data structure. This means that each feasibility and robustness of CROWN.
message exchange has a schema associated with The CROWN NodeServer is the basic
it (See Table 6) rather than each parameter environment for Grid services to be deployed
being defined as a set of wsdl:part in the and executed in the CROWN system. It has 5
wsdl:message element. main features that are outlined here.
Grid service container: Based on the
<xsd:element name="getQuote">
<xsd:complexType> Globus Toolkit WS Core 4.0, CROWN
<xsd:sequence> NodeServer implements a fundamental Grid
<xsd:element service hosting environment that follows the
name="symbol"
type="xsd:string"/>
OGSA/WSRF specifications.
</xsd:sequence> Remote and hot service deploying: The
</xsd:complexType> CROWN NodeServer allows service providers
</xsd:element> to deploy and configure Grid services remotely
without having to restart the container. This is
Table 6: Example Globus WSDL combined with a security and trustworthy
mechanism to provide a way for distributing
This means that the WSDL for each request applications through the CROWN system to
and response message has a more nested execute securely and enhances the scalability of
structure but this has no major effect on the the whole system.
structure of the messages exchanged. This is Resource monitor: CROWN NodeServer
because when there is only one object passed as gathers both static and dynamic information of
a parameter it can be contained in the message the underlying resource, and shares them via a
body rather than as a sequence of multiRef Grid service interface. The information can be
elements (See Table 7). used to control workflow or monitor a system.
Legacy applications integration: Many
<soapenv:Body>
applications are not developed in a service
<getQuote xmlns="http://…">
<symbol>foo</symbol> oriented way. To add them into Grid
</getQuote> environment, CROWN NodeServer can
</soapenv:Body> encapsulate these programs as Grid services.
Table 7: Example Globus SOAP Message Security and trust: The NodeServer
follows the WS-Security specification and
Whilst some modification to the code base is provides security functions of SOAP signature
necessary to accommodate these differences in and encryption, authentication, authorization
WSDL they are fairly minor. and identity mapping between different security
infrastructures. By using automated trust
3.1 CROWN negotiation, NodeServer can help to setup trust
To fully evaluate our system we require a relationship among strangers.
complex application to access. In doing so we By using NodeServer, we encapsulated
would gain useful knowledge on how to heterogeneous resource into a single
construct fault models and where to apply them homogeneous view. All kinds of resources in
in a system. With this in mind we have selected CROWN are provided via the services running
a test bed system which will provide us with a in NodeServer.
real-world test senario. We have selected the NodeServer as our first
CROWN is a Grid test bed to facilitate real-word Grid-FIT test case because it gives us
scientific activities in different disciplines. a complex system to test, is written utilizing
The CROWN middleware platform is consist SOAP and it’s assessment would be a benefit to
of 3 parts. Firstly, many resources (including both parties involved in the collaboration, i.e.
The middleware can be made more reliable and
computers, clusters and some storage devices)
Grid-FIT can use the data gathered to construct
should be connected via a nation-wide
better fault models.
network infrastructure to build the CROWN
268
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
269
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
We have implemented our stock quoting averaged over a number of runs. The average
service to use a large repeatable dataset, stored RPC time was measured at 0.05s and appeared
in a backend database to produce a time based to allow the algorithms to function correctly.
real-time stock quote. Since the quote service is The test case was iterated and the
based around a database containing the transactions from each were compared. Apart
simulated quote values it is possible to replicate from minor timing variations the analysis
a test run exactly by resetting time etc. to a set showed that the test case was repeatable when
of starting conditions. Web Service technology was used.
Our trading service implements a simple We repeated this experiment with our
automatic buying and selling mechanism. An instrumented Globus system (See Table 8). This
upper and lower limit is set which triggers gave results similar to the Web Service based
trading in shares. Shares are sold when the high system with an average match of 99.5% and an
limit is exceeded and shares are brought when average RPC execution time of 0.05s.
the quoted price is less than the lower limit.
Test 1 Test 2 Test 3 Test 4
The buying and selling process involves Match %
Match Mismatch
100.00 0.00
Match Mismatch Match Missmatch Match Missmatch
99.00 1.00 100.00 0.00 99.00 1.00
transferring money using the bank service and Average Time 0.05 0.05 0.07 0.05 0.05 0.06
Std Dev 0.02 0.02 0.03 0.02
multiple quotes (one to trigger the transaction Average Match 99.50
and one to calculate the cost). Since these Std Dev 0.57735
multiple transactions involve processing time Table 8: Baseline Data For Globus System
and network transfer time this constitutes a race
condition as our quoting service produces timed By comparing the test runs there appears to
real-time quotes. Any such race condition be very little difference between the data sets
leaves the potential for the system to lose obtained except for minor timing variations (See
money since the initial quote price may be Figure 3).
different from the final purchase price. This is
intentionally to demonstrate that Grid-FIT does
not introduce a significant overhead into the
system and thus effect its operation.
This paper details three different series of
data: 1) A baseline set of data with the system
running normally; 2) A simulated
faulty/malicious service 3) A simulated heavily
loaded server.
Our test system was implemented using
Globus WS-Core Version 4.0.1 running on
Figure 3: Comparison of Baseline Test Data
Apple Mac OS X using G4 1.5 GHz processors
Runs
and 1Gb RAM.
By comparing the previous Web Service
Quote Service based data with the new Grid-FIT data we
observed no significant differences.
270
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Match %
Match Mismatch
0.00 100.00
Match Mismatch Match Missmatch Match Missmatch
0.00 100.00 0.00 100.00 0.00 100.00
statically encoded distribution and minor
Average Time 0.04 0.04 0.04 0.04 differences in timings introduced by Globus.
Std Dev 0.00 0.00 0.00 0.00
271
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
case followed the same basic trend as the Web Web Services," Information and
Service based tests. Software Technology, Elsevier, 2006.
This work provides us with a proof of [8] E. Christensen, F. Curbera, G.
concept. As a next step we intend to implement Meredith, and S. Weerawarana, "Web
an enhanced fault and failure model to facilitate Services Description Language
typical Grid system assessment based on the (WSDL)," 1.1 ed: W3C, 2001.
framework provided by FIT. [9] D. Box, D. Ehnebuske, G. Kakivaya,
To accomplish this we will need A. Layman, N. Mendelsohn, H. F.
experimental data on a non-trivial Globus based Nielsen, S. Thatte, and D. Winer,
system. "Simple Object Access Protocol
We intend to utilize the CROWN (SOAP) 1.1," 1.1 ed: W3C, 2000.
middleware to provide us with a test bed [10] F. Curbera, M. Duftler, R. Khalaf, W.
system. By applying Grid-FIT to the Nagy, N. Mukhi, and S. Weerawarana,
NodeServer element built into CROWN we will "Unraveling the Web Services Web:
gain valuable experience and data on assessing An Introduction to SOAP, WSDL, and
Globus based systems. This collaboration will UDDI," IEEE Internet Computing, vol.
be of advantage to both sides since 6, pp. 86-93, 2002.
dependability is a key aim of the CROWN [11] M. Gudgin, M. Hadley, N.
middleware and by analyzing the application of Mendelsohn, J.-J. Moreau, and H. F.
our Grid-FIT tool and method to CROWN we Nielsen, "SOAP Version 1.2 Part 2:
will gain valuable data on how to enhance our Adjuncts," vol. 2005: W3C, 2003.
fault models. [12] N. Looker, M. Munro, and J. Xu, "WS-
FIT: A Tool for Dependability
7 References Analysis of Web Services," presented
at The 1st Workshop on Quality
[1] I. Foster, "Globus Toolkit Version 4: Assurance and Testing of Web-Based
Software for Service-Oriented Applications, COMPSAC, Hong Kong,
Systems," presented at IFIP 2004.
International Conference on Network [13] N. Looker, M. Munro, and J. Xu,
and Parallel Computing, 2005. "Simulating Errors in Web Services,"
[2] I. Foster, H. Kishimoto, A. Savva, D. International Journal of Simulation
Berry, A. Djaoui, A. Grimshaw, B. Systems, Science & Technology, vol. 5,
Horn, F. Maciel, F. Siebenlist, R. 2004.
Subramaniam, J. Treadwell, and J. V. [14] N. Looker and J. Xu, "Assessing the
Reich, "The Open Grid Services Dependability of OGSA Middleware
Architecture, Version 1.0.," Global by Fault Injection," Proceedings of the
Grid Forum (GGF) 2005. Symposium on Reliable Distributed
[3] Apache, "Axis Architecture Guide," Systems, pp. 293-302, 2003.
vol. 2003, 1.1 ed: Apache Axis Group, [15] J. Voas, "Fault Injection for the
2003. Masses," Computer, vol. 30, pp. 129-
[4] "The Apache Jakarta Tomcat 5 130, 1997.
Servlet/JSP Container," 2006. [16] N. Looker, M. Munro, and J. Xu, "A
[5] E. Marsden, J. Fabre, and J. Arlat, Comparison of Network Level Fault
"Dependability of CORBA Systems: Injection with Code Insertion,"
Service Characterization by Fault presented at 29th Annual International
Injection," presented at Symposium on Computer Software and Applications
reliable distributed systems, Osaka, Conference (COMPSAC'05),
Japan, 2002. Edinburgh, Scotland, 2005.
[6] N. Looker and J. Xu, "Assessing the [17] N. Looker, B. Gwynne, J.Xu, and M.
Dependability of SOAP-RPC-Based Munro, "An Ontology-Based Approach
Web Services by Fault Injection," 9th for Determining the Dependability of
IEEE International Workshop on Service-Oriented Architectures,"
Object-oriented Real-time Dependable presented at 10th IEEE International
Systems, pp. 163-170, 2003. Workshop on Object-oriented Real-
[7] N. Looker, M. Munro, and J. Xu, "A time Dependable Systems, Sedona,
Method for Dependability Analysis of USA, 2005.
272
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
273
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Finally, since many users were using ssh based cause KCAs are not fully trusted compared to clas-
login, we need to migrate them off that: they must sic CAs [6] (i.e., CAs that issue long-lived end-entity
get certificates, the certificates must grant them ac- certificates), although this is slowly improving. An-
cess to the same resources that they could access via other major disadvantage is that this approach does
ssh, and in particular, their identity must be pre- not scale: current Grid middleware needs all CAs
served — i.e., we need to map between the ssh pub- in all hierarchies installed along with their signing
lic key and the certificate public key. policies and CRL endpoints. If each institution has
It is clear that identity management is an essential its own CA the international collaborations would
part of SSO. Moreover, identity management is quite be overwhelmed by the sheer number of CAs. The
complicated, particularly over longer timescales. We same problem arises in the trust context: each CA
will touch upon the subject briefly in the next section has to negotiate trust with peer CAs and international
and in the Architecture section below, but otherwise resource providers, and they would become over-
it is outside the scope of this paper. whelmed if there were more than one CA per coun-
try. One could argue that it should be sufficient to
review the root’s policy, but most resource providers
1.3 Integrating SSO
need to know details of each individual site’s creden-
As mentioned above, the terminal work integrates tial conversion policy: who can get an account, how
with other site SSO efforts. The user offices of is the DN tracked back to the identity, etc.
the CCLRC facilities will manage the user accounts Another approach was taken by Brookhaven Na-
for both local and external users in CCLRC’s staff tional Laboratories. Like CCLRC, they were not
database. The challenge for the facilities is to pop- keen on having disparate ssh and Grid key manage-
ulate the database with consistent data, because a ment infrastructures in addition to their site Kerberos
given user’s details may not be up to date, or the infrastructure. They integrated a One Time Pass-
user may be present more than once in the database – word (OTP) system with parts of their site infras-
for example, the user may have registered with more tructure [7]; they found it was simple to integrate,
than one experiment. but the cost was relatively high since users had to
For our purpose, we need to add the Grid identity have hardware OTP tokens.
information (mainly the Distinguished Name (DN))
to this database. Populating it initially is already 2 Architecture
a challenge, because 200 personal certificates have
been issued to CCLRC and Diamond, and need to Figure 1 gives an overview of the architecture of the
be matched to the Active Directory id (or Kerberos CCLRC SSO solution. In the left half of Figure 1
name). Strictly speaking, each user must do it for the site authentication system is employed, which is
themselves by proving both their identities to the the only authentication infrastructure users are con-
database – or, more precisely, an agent running on cerned with. Depending on the implementation of
behalf of the user which is able to ask the user for SSO the site authentication token can be anything
proof of both identities (or pick up SSO versions of from a password to a Kerberos [4] ticket and corre-
both), contact a server which can pick up both tokens spondingly the user may (in the case of passwords)
and update the database. or may not (in the case of a Kerberos ticket) have to
be actively involved in supplying the token. From
1.4 Other Work the user’s point of view, the difference is whether
they have to type their password only once per ses-
In this section we briefly describe other efforts in sion (e.g., when logging on to a computer, case 1
the area of SSO. The simplest example of SSO is from Section 1.2), or whether they have to type it
of the site that is fully “Kerberised,” where Kerberos again when accessing the user interface (case 3) —
authentication grants access to all resources. This but at least then, it is the same username and pass-
doesn’t immediately support roaming users, but it word as the federal one. In either case, in the archi-
does provides SSO within the site. For Grid use tecture, the token is passed on to the MyProxy [8]
such a site will have to implement credential con- server by the user interface, which then checks its
version via MyProxy [5] or KCA (which provide validity with the site authentication infrastructure.
credential conversion, generating short-lived certifi- Case 2 from Section 1.2 is subtly different. From
cates for users presenting Kerberos tickets). This the user’s point of view, for users who already have
is the approach taken by Fermilab. A KCA con- eScience certificates, they must upload an unen-
verts a Kerberos ticket to a short-lived X.509 cer- crypted proxy to the MyProxy server, typically once
tificate (similar to a Globus proxy), but these cer- every week. This requires authentication, so the
tificates cannot be used at many external sites be- MyProxy upload tool must be integrated with SSO.
274
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Head
User User Interface Node
WN
WN
Key
X509 proxy
They may have a passphrase for the browser’s key- to represent the user’s true, “real life” identity. The
store and another one protecting the private key if Grid community defines four assurance levels [10]:
the credentials are exported from the browser. These Rudimentary: there is essentially no proof of
passphrases are independent of the SSO infrastruc- the user’s identity;
ture — indeed, it is possible to leave the browser
keystore or the exported key entirely unprotected (of Basic: The applicant provides proof of identity
course this violates the CA’s policy). to an agent of the issuing authority via a reason-
In any case, users need to unlock their credentials, ably secure method, but may not have to appear
and to authenticate to the MyProxy server, using any in person;
method, to upload their proxy. Alternatively the user
Medium: The applicant must appear in person
interface can directly use the local X.509 credentials
to the agent, presenting adequate photographic
(either in the browser, or exported) for authentica-
proof of identity;
tion of the user. In this case the user will have to
type the passphrase that unlocks the private key. High: Essentially as medium, except at the time
Grid access always requires an X.509 credential. of renewal, the applicant must present photo
For users without UK eScience certificates, we solve id again, and there are greater requirements re-
the SSO to the Grid problem by generating short- garding the protection of the user’s private key.
lived, low-assurance (see below) credentials within To be internationally approved, the eScience CA has
the MyProxy server. to operate to the medium assurance level. The CA
The right of Figure 1 is within the Grid Security and its verifying agents form a well-defined entity
Infrastructure (GSI) domain and uses X.509 certifi- that can account for the flow and subsequent use of
cates [9] and proxies. Within the SSO system de- personal data throughout, from the first application
scribed here, the MyProxy server is the source of to the final certificate expiry. In the terminology of
these credentials. When a user correctly authenti- the Data Protection Act [11], the CA (more specif-
cates the MyProxy server will generate a proxy for ically, the CA manager) is the data controller: the
the user, based on either the user’s real UK eScience person who defines which data is necessary and re-
certificate (if it has been previously uploaded) or quired, what it is used for, and how it should be
a short-term, low-assurance certificate generated by stored and managed. Furthermore, in the case of a
its internal CA. The user interface can then use this problem such as misuse of a resource, the resource
proxy to authenticate the user to the head node of a admin can report the user’s DN to the CA and the
resource and can also further delegate the certificateCA can prove, using its personal data, who the user
to the head node to allow other resources to be used. is in “real life.”
With a credential conversion service, managing
2.1 Assurance the user’s data is externalised: it is in the hands of
some other authority, external to the issuing author-
Assurance, in this context, is a rough measure of the ity and its agent — not just at the time of identity ver-
extent to which authentication token can be trusted ification, but throughout the “lifetime” of the data. It
275
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
thus becomes harder for the service to even define, out, but Shibboleth is not yet integrated with the Grid
and much less guarantee, its assurance level. Some middleware.
sites can generate authentication tokens from their
payroll databases, but those are the minority. Most
sites’ databases have visitors and contractors, etc.
3 Implementation
Worse yet, in the misuse case mentioned above,
For Grid SSO in CCLRC we have leveraged the Ker-
suppose a resource admin reports the DN to the cre- beros/Active Directory [3] security system already
dential conversion service admin, but the latter can-
present to allow users to log in using either a Ker-
not provably tie the site authentication token to the
beros token or their site password. This addresses
user’s real life identity. Only people with appropri-two of the SSO use cases (one and three) mentioned
ate access to the site database (usually HR staff) can
in Section 1.2. The different schemes are employed
do that, and they probably will not, for data protec-based on whether or not the user has a valid Kerberos
tion reasons, because the resource is external to their
ticket present.
site. This aspect may make resource administrators
The SSO solution also must support users that
less inclined to trust the credential conversion ser-
have an X.509 certificate locally (e.g., as issued by
vice.
the e-Science CA). This is the second use case men-
Since we cannot say, even for our own site, what is
tioned in Section 1.2, and is addressed by allowing
the assurance defined by the site staff database, we,users to access certificates stored in a variety of for-
somewhat arbitrarily, refer to certificates created by
mats.
the SSO service (from the MyProxy CA) as “low-
The e-Science CA is a web-based CA: when users
assurance”. The reader may find it helpful to com-
download their personal certificate, it is stored in
pare it to basic assurance: in some sense, MyProxy
the browser, and usually exported from the browser
is an agent which receives and validates an authenti-
in PKCS#12 format. For Grid work, users would
cation token from the applicant, and issues a certifi-
typically have to convert this format into PEM for-
cate.
mat at the command line, using relatively complex
Another potential problem with the SSO architec- OpenSSL commands. However, this conversion is
ture is that each site has its own credential conver-not required in the SSO project: it is possible to use
sion service. This means that rather than reviewing the certificate in both formats, both PKCS#12 and
and installing a single point of trust, the resource ad-
PEM. It is even possible to use the certificate directly
min must now review policies of tens of services, from the user’s browser, thus making the exporting
and install them individually. stage unnecessary. This once again follows the SSO
In theory, if all sites’ services could agree to have:
philosophy, allowing users to authenticate to Grid re-
a common policy of adequate level (the policy sources with a minimum of effort.
will essentially be the “lowest common denom- The main components involved in the CCLRC
inator,” i.e., loose enough to accommodate all SSO system are the MyProxy server and the Java
sites), GSI-SSHTerm, which forms the user interface com-
ponent from Figure 1.
a common root certificate with each site’s ser-
vice CA issued by that root, 3.1 MyProxy Server
then, because these services do not issue CRLs, it The standard MyProxy server distribution comes
should be possible just to trust the common root. both with Kerberos (using SASL authentication) and
However, on the Grid, Globus still requires the sign- CA support as compile time options. Two MyProxy
ing policy files, to perform extra validation of the server instances on the same machine are employed
namespaces. for our SSO solution, sharing the same certificate
Thus, to install a credential conversion CA for store. Both also have the build-in CA support en-
each site, may be possible in a single (small-ish) abled, again issuing certificates from the same root
country, or in projects with few collaborating sites, certificate. The first is a non-Kerberos MyProxy
but on a larger, or international, scale it becomes dif- server (using the normal MyProxy port of 7512);
ficult. this is linked through a PAM module to the site au-
For these reasons, and also to provide host and thentication system for site password access to the
service certificates, an SSO infrastructure cannot re- SSO facilities. A normal MyProxy server also al-
move the need for a national CA. Even for a project lows users to upload real e-Science certificate prox-
the size of NGS, work is required to make future ies to the server using the standard tools. The second
SSO infrastructures compatible; some of this work MyProxy server runs on port 7513 and uses the Ker-
will be done in the context of the Shibboleth roll- beros/SASL authentication.
276
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
When a low-assurance certificate is required the automatically when necessary, and the terminal re-
MyProxy servers call out to a small program which authenticates the user without any user intervention.
is used to map a site user id to a DN. To do this it uses The proxy on the remote host, the Grid head node
LDAP access to the site Active Directory and con- into which the user logs in via the terminal, can
structs the DN as follows: /C=UK /O=eScienceSSO also expire. This proxy is created initially by the
/OU=CCLRC /UID=uid /CN=firstName surname. terminal, but is not renewed by the terminal in the
Resource administrators must explicitly trust the current implementation. The user will have to re-
MyProxy CA by importing its root certificate to the authenticate to the MyProxy server to get a new
resource and adding the SSO DNs of legitimate users proxy out of it.
to the resource’s grid-mapfile.
3.4 File transfer
3.2 Java GSI-SSHTerm
Grid access is more useful if the user can transfer
1
The Java GSI-SSHTerm is used as the user inter- data from and to the Grid head node within the ses-
face in our SSO solution, providing GSI-SSH ac- sion. The Java GSI-SSHTerm includes a graphical
cess to grid resources, through a platform indepen- transfer interface that the user can use to upload and
dent Java SSH terminal. The original Java SSH ter- download files using the SFTP protocol (over GSI).
minal is an open source project hosted on Source-
forge. Jean-Claude Côte at NRC-CNRC then de- 3.5 X forwarding and VNC client
veloped a GSI module for the SSH terminal, which
we have extended and improved. The GSI module The Java GSI-SSHTerm is capable of secure “X for-
and our additions rely on the Globus Java Commod- warding” to an X Window System server running on
ity Grid (CoG) libraries [12]. For the SSO project the same computer as the Java GSI-SSHTerm pro-
the MyProxy parts of the CoG libraries were modi- gram. In other words, programs running on the head
fied to support the Kerberos based MyProxy servers node can display their windows on the user’s desk-
through the addition of support for the SASL authen- top machine.
tication method. The Java GSI-SSHTerm uses Ker- Alternatively, users can use a Java VNC client
beros authentication if the user has a Kerberos ticket built into the program to connect to a VNC server
and otherwise allows users to supply their password session running on the head node.
to access the MyProxy servers. However, making
use of the Kerberos token requires Java 1.6, and only 3.6 CCLRC Changes to the GSI-SSHTerm
Java 1.4 and 1.5 are widely available. We have suc-
To bring the SSHTerm and the GSI module into a
cessfully used a beta release of Java 1.6.
form useful for Grid users we have had to both add a
Although this new functionality is exposed
number of features and also perform hardening and
through the Java GSI-SSHTerm it could be used with
bug-fixing throughout the code.
any other Java based grid user interfaces and tools
Within the main body of the SSHTerm code,
such as a proxy upload tool.
changes included new build scripts, large changes to
In addition, as previously mentioned in Section 3,
the X forwarding code to allow correct authentica-
the Java GSI-SSHTerm allows direct access to real
tion under UNIX and to allow UNIX domain sockets
e-Science certificates in a number of ways.
to be used, improvement to protocol conformance
in the choice of key exchange, cypher, compression,
3.3 Renewal etc., algorithms (taking the responsibility for this be-
wildering decision from the users) and a redesign of
When the terminal’s proxy expires, the user is
the open connection dialog to allow easy connection
prompted to re-authenticate, either via MyProxy or
for users connecting to Grid resources using default
via the user’s local credentials. For longer running
settings similar to default SSH settings.
sessions where the user stays logged in but is ab-
In the area of GSI support the original authen-
sent from the terminal, command output is not lost.
tication method implemented by Jean-Claude Côte
The terminal prompts the user for authentication,
could not authenticate users who did not know their
but continues to display the command output, but
username. To enable logon without a specified user-
the user cannot type anything into the terminal un-
name we had to implement the “external-keyx” au-
til the authentication is successful. Of course, if
thentication algorithm. In addition, this authentica-
the user authenticates with a local Kerberos ticket,
tion algorithm requires the “gss-group1-sha1-*” key
it is sufficient that the terminal can pick that up. On
exchange algorithm, which we also implemented.
Windows, for example, Windows renews the ticket
Other work in the area of GSI support included port-
1 http://www.grid-support.ac.uk/content/view/81/62/ ing Jean-Claude Côte’s code to a later version of
277
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the CoG kit and implementing methods to obtain Kerberos token (generated in the case of CCLRC by
proxies from additional sources (MyProxy servers, Microsoft AD Key Distribution Centre). It should
PKCS#12 files and browsers) and integrating these be noted that this is 10 hours by default; as such, us-
methods into a consistent and unified user interface. ing our SSO solution without long-lived X.509 cer-
These changes to the GSI-SSHTerm are in ad- tificates may not be beneficial to users who need to
dition to the changes to the CoG libraries and the launch Grid jobs that are likely to last longer than 10
GSI-SSHTerm to allow Kerberos authentication to hours.
MyProxy servers.
4.2 Communication Channels
3.7 Deployment
All communication between the user, the MyProxy
As we write this, our SSO solution is in the pro- server or MyProxy CA, the site authentication struc-
cess of being rolled out within CCLRC and pilot ture and Grid resource provider is performed over
users have been accessing the CCLRC SCARF clus- secured channels. This is done by making use of
ter through the SSO solution using auto-generated either Kerberos tokens belonging to the user or the
low-assurance certificates. Further, through the MyProxy server, or X.509 certificates in the case of
Diamond-CCLRC SSO committee there are plans communication between the resource provider and
for our SSO solution to be used across the Diamond the MyProxy server.
and ISIS facilities. Users can already use the SSO Using our SSO solution alleviates the requirement
solution as a fast and convenient way of accessing for users to authenticate themselves by entering a
any Grid resource (including the NGS) for which password if the user already possesses a Kerberos
they are authorized, once they have uploaded a real token, which could be transparently generated dur-
UK e-Science proxy to the SSO MyProxy server. ing login to the user’s computer. This happens auto-
matically with the Windows operating system if the
4 Security Evaluation computer is registered on the domain. In the case
of Linux, this can be made possible using a solution
From the point of view of the Resource Manager, such as Vintella. If this is not the case, a user can still
our SSO solution is an implementation of the Grid gain access to the Grid resource provider, albeit not
Security Infrastructure specification as described by transparently, using our solution by providing their
Von Welch et al [13]. Trust is established upon pre- AD login details to the MyProxy CA server, which
sentation by the user of an X.509 identity proxy cer- then retrieves a Kerberos token by calling out to the
tificate [14, 2]. As with choosing to trust a classic AD service solely for the purpose of generating a
CA, the Resource Manager must choose to trust the short term proxy certificate.
SSO CA or another classic CA to enable SSO access
for a user by installing the root certificate of the CA. 4.3 Java GSI-SSHTerm
Once this is done, the user is granted access to the
resource once their DN is registered in the normal The Java GSI-SSHTerm user interface is distributed
way. both as a trusted applet and as a standalone appli-
cation. In both cases the program needs read and
possibly write access to the local disk. If the user has
4.1 SSO MyProxy Server
their certificate stored locally on disk or in a browser,
Our SSO solution requires short-lived X.509 proxy the program needs to read files and libraries locally,
certificates (of real UK eScience certificates) to be which could include the execution of native code. If
stored on a MyProxy server unencrypted so authen- run for the first time on a computer, then the UK
tication is performed transparently. The MyProxy e-Science root certificate and signing policy is in-
server thus represents a potential security problem stalled in the default location onto that computer, if
if it is not adequately secured. However, it must be these files have not already been installed. This is to
noted that the lifetime of the proxy certificates stored facilitate running other Grid or CoG kit software in
on this server may be set to a short period of time. the future.
Ideally this should be no longer than the lifetime of When run as an applet, such disk access can only
the original authentication credential used to gener- be performed if the applet has been given explicit
ate the proxy certificate, but this may not be practical permission to do so by the user. This is done by
or useful. In the case of users with certificates from accepting an object signing certificate issued by the
a classic CA, this is the lifetime of their long-lived UK e-Science CA (the ca-operator certificate). This
X.509 certificate. In the case of users without such gives the user the assurance that the program has not
a certificate, this lifetime is the lifetime of the user’s been tampered with, e.g. no Trojan has been intro-
278
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
duced to gain unauthorized access to the user’s cer- The security improvement is likely to be greater for
tificate. We plan to investigate the use of Java Web- security-novice or non-technical users who are not
Start technology in the future to give the similar level used to remembering complicated passwords.
of assurance to users when running the program as a
standalone application.
5 Future Work
4.4 Certificate Security The work in SSO is progressing in a number
The existence of inadequately protected X.509 cer- of directions. As mentioned in Section 3.2 the
CoG libraries, which have been modifed to support
tificates have in the past been a cause of security con-
cern. Ideally the user should have a minimal num- SASL/Kerberos, can be used in many other contexts.
ber of certificates necessary to work with the Grid,It is key that the vast majority of site Grid resources
and have a copy backed up in a secure offline lo- support single sign-on, in a consistant way, if the
cation. Our SSO solution alleviates the need to havewider single sign-on project at CCLRC is to be suc-
cessful, therefore work is in progress to use this tech-
long term certificates issued by a classic CA to make
nology in other Grid related user interfaces, for ex-
use of local Grid facilities. However, if a user needs
a long term certificate then the GSI-SSHTerm pro- ample with the Grid portals work at Daresbury Lab-
vides the additional security option of having the oratory.
user’s only certificate in their browser. An additional area of future work is integrating
Shibboleth into our SSO framework. JISC will
Unlike generating a proxy certificate in the tradi-
tional way using the MyProxy client tool, proxy cer-be deploying a Shibboleth infrastructure [15] for
tificates generated by the GSI-SSHTerm or retrieved UK academia. Meanwhile, we can build on work
done in the various Shibboleth projects concurrent
by a MyProxy server are not stored as a file but rather
in resident memory. This minimizes the possibility with our work, such as the ShibGrid project at Ox-
ford/CCLRC and the Shebangs project at Manch-
of a proxy certificate being intercepted on the client
machine. ester (see [16] for details of these and other Shib-
boleth/Grid projects). The ShibGrid project aims to
We assert that if reasonable effort is spent safe-
guarding the deployment of this solution and educat-provide Shibboleth access to NGS, and its architec-
ing users to use it in a responsible manner, our SSOture is similar to that of Figure 1, with Shibboleth
solution using Java GSI-SSHTerm is no less secure cookies instead of Kerberos tokens. Thus, at least
than existing Grid access mechanisms. in theory, it should be easy to integrate Shibboleth
authentication into the terminal. One potential com-
plication is that the current versions of Shibboleth
4.5 Password Security
are entirely web-based: they depend on the HTTP
A SSO infrastructure potentially improves the secu- protocol and redirects, etc.
rity of the authentication infrastructure:
If users may remember fewer passwords – ide- 6 Acknowledgments
ally, a single one – they are less likely to write
them on notes and stick them to their monitor. The authors wish to thank the GOSC (Grid Opera-
tions Support Centre) for funding this work. They
If users are forced to remember more than also wish to express their gratitude to all early
one password, because systems are not inte- users within CCLRC, and in particular those of the
grated, they are very likely to reuse their pass- CCLRC SCARF cluster, who, led by Dr Duncan
words. This is particularly bad in the case of Tooke, have already put the terminal into produc-
the federal (Active Directory or Kerberos) pass- tion.
word, because site policy dictates that this pass- This document is typeset with LATEX2e.
word must be stored only in the central service
(KDC). If users reuse this password, storing it
in potentially less secure places, then site secu- 7 Conclusion
rity may be compromised.
The usual definition of Single Sign-on (SSO) is often
The password is validated by a central service, too narrow to be useful. In this paper, we start with
which means that it can be checked for strength, what we see is a more useful definition of SSO based
according to site policy. This is not true for the on requirements gathered from users who already
password that protects the private key, or the use our Grid resources. Working to those definitions,
browser’s keystore, but at least those do not, we provide an implementation of a portable termi-
hopefully, leave the user’s desktop machine. nal to access Grid resources. We have taken an open
279
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
source project with GSI extensions, and integrated it [9] S. Santesson and R. Housley. Internet X.509
with the site infrastructure using MyProxy. Our own public key infrastructure authority information
software developments and additions to the terminal access certificate revocation list (CRL) exten-
are of course also released under an open source li- sion. Request for Comments (RFC) 4325, De-
cence. We have increased the ease-of-installation by cember 2005.
deploying it both as a tarball, and as an applet.
We have described the architecture and how it ties [10] Randy Butler and Tony Genovese. Global
the Grid and the site authentication infrastructure to- grid forum certificate policy model. Available
gether. We have also described how it fits into the at http://www.ggf.org/documents/GFD.16.pdf,
larger picture, with a national CA and a Shibboleth Jun 2003.
infrastructure, and why we see a need for all three (a [11] Data Protection Act 1998. Available as
CA, the Shibboleth infrastructure, and site credential http://www.legislation.hmso.gov.uk/acts/acts1
conversion) in the future. 998/19980029.htm, March 2000.
[12] Gregor von Laszewski, Jarek Gawor, Peter
References Lane, Nell Rehn, Mike Russell, and Keith
Jackson. Features of the Java commodity Grid
[1] David Byard and Jens Jensen. Single sign-on
kit. Concurrency and Computation: Practice
to the grid. In Proceedings of the 2005 UK e-
and Experience, 14:1045–1055, 2002.
Science All Hands Meeting, September 2005.
[13] Von Welch, Frank Siebenlist, Ian T. Fos-
[2] S. Tuecke, D. Engert, I. Foster, M. Thompson,
ter, John Bresnahan, Karl Czajkowski, Jarek
L. Pearlman, and C. Kesselman. Internet X.509
Gawor, Carl Kesselman, Sam Meder, Laura
public key infrastructure proxy certificate pro-
Pearlman, and Steven Tuecke. Security for
file. Request for Comments (RFC) 3820, June
Grid services. In 12th International Sympo-
2004.
sium on High-Performance Distributed Com-
[3] Robbie Allen, Joe Richards, and Alistair G. puting (HPDC-12), Seattle, WA, USA, pages
Lowe-Norris. Active Directory. O’Reilly, Se- 48–57, 2003.
bastopol, California, USA, 3rd edition, January
[14] CCITT recommendation X.509: The direc-
2006.
tory - authentication framework. CCITT Blue
[4] Jennifer G. Steiner, Clifford Neuman, and Jef- Book, volume VIII, pages 48–81, 1988.
frey I. Schiller. Kerberos: An authentication
[15] Tom Scavo and Scott Cantor. Shibboleth
service for open network systems. In Proceed-
architecture technical overview. Internet
ings of the USENIX Winter Conference, Dallas,
2 document: draft-mace-shibboleth-tech-
Texas, USA, pages 191–202, February 1988.
overview-02, June 2005. Available at
[5] Jim Basney, Marty Humphrey, and Von http://shibboleth.internet2.edu/docs/draft-mac
Welch. The MyProxy online credential repos- e-shibboleth-tech-overview-latest.pdf .
itory. Software: Practice and Experience,
[16] Von Welch. Report for the GGF 16 BoF
35(9):801–816, July 2005.
for grid developers and deployers leverag-
[6] EU Grid Policy Management Authority. Clas- ing shibboleth. Summary of BoF sessions
sic authentication policy profile. Available at GGF 16, February 2006. Available at
at http://www.eugridpma.org/igtf/, Sep 2005. http://grid.ncsa.uiuc.edu/papers/GGF16-Shib-
(version 4.03). BOF-Report.pdf .
280
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Mark Norman,
University of Oxford
1. Abstract
The findings in this paper represent some of the output of the ESP-GRID project following the
consultation of current grid users regarding the future nature of grid computing. The project
found that there was a clear purpose for Shibboleth in a future grid and that, for the majority of
users, this would be secure and improve their experience of grid computing. Client-based PKI
remains suitable and desirable for Power Users and we must be careful of the means by which
we mix these two access management technologies. PKI is currently used to define grid
identities but these are problematically conflated with authorisation. The grid community
should work harder to separate identity/authentication and authorisation. This paper also
questions whether we need identity to be asserted throughout grid transactions in every use case.
Currently, this is a solution to a security requirement: it should not be a requirement in itself.
We propose that the grid community should examine methods for suspension of a rogue user’s
activities, even without identity being explicitly stated to all parties. The project introduced the
concept of a Customer-Service Provider model of grid use and has produced demonstrators at
the University of Glasgow.
281
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Therefore, most activities on a grid may procedures for issuing the university card in
pose a much lower threat to grid machines the first place. It follows that the original
than the activities that dominate today. For choice for choosing client-based PKI for
the sake of this paper, let us assume that we grid security is somewhat flawed, as the
have a mixed economy of users: some strongest part – the greatest benefit – of
exerting relatively deep control over distant PKI: the establishment of a long-term,
grid machines and many users with little highly trustworthy, ID is compromised.
scope or interest in modifying the This is an argument for another place,
computing environment or how the jobs run however.
on the grid.
A further difficulty with mixing old
university procedures and newer, very
2.3. Sections of this paper centralised, PKI procedures can be summed
Within this paper, we examine identity up in this scenario:
management and who is best to take on this
task. This is followed by an examination of Post-grad A. Newman begins work
the perceived requirement for constant at Cotswolds University. He finds
identity assertion throughout the grid and he needs a digital certificate for
some of his grid-based research.
the ‘case for Shibboleth’.
He talks to his local registration
personnel about this who know
nothing of the “grid” and then
3. On the grid is it finds he has to travel to his local
appropriate to devolve Registration Authority at Oxford
University, after applying on-line.
identity management? He travels to Oxford and presents
Traditionally, user ‘identities’ have been his ‘Cotswolds Card’and the RA
grants his certificate request.1 A
managed in the higher education
little later, it turns out that
community on a per-institution Newman is a thief and a fraudster
(organisation) basis. There has been little and Cotswolds University revokes
drive to be very rigorous about checking all of his university accounts,
real-world identity accurately when issuing swipe cards etc. etc.
identity credentials at such organisations for Unfortunately, the good
the first time, although it is likely that these registration folks at Cotswolds
procedures have been better than many don’t have anything to do with e-
believe. Those working in a stricter Science (they haven’t been on the
(usually PKI) culture may consider these RA training course) and therefore
Newman is allowed to keep his
procedures to be inferior. However, there
digital certificate for the rest of
are strengths and weaknesses to both the the year.
(usual) PKI approach and to the per-
institution approaches. It is clearly better that the registration or
personnel people closest to the user should
3.1. In the UK, we are already look after the identity of that user. PKI is
trusting the old ID- usually over-centralised and managed at a
establishment processes very remote, often national, level, as in the
UK. This is highly problematic. This case
At present, an applicant for a digital
certificate only needs to present some form
of photo ID (undefined in the UK e-Science 1
Grid CA - Certificate Policy and Assume that the “Cotswolds Card” is his university
ID. However, a lovely twist to this story would be
Certification Practices Statement). Usually that he (theoretically) could have used a “Cotswolds
this is taken as a person’s university card. Card” that was issued by his local swimming pool
This means that we are trusting the (with inadequate ID checks), but that still contained
his photograph.
282
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
283
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
better and points out how good security • an investigation into the activities
may seem counter-intuitive until looked at of the individual.
in depth. With regard to the use of ID, he
rightly suggests that if you make something These requirements are expanded upon and
easier (i.e. lower security) if ID is used, tied to the Customer-Service Provider
then the bad guys will just get ID. And the model (see 5.1 below).
rest of us are left not paying enough
attention to security because the ID has
given us a false sense of security. 5. The case for Shibboleth
284
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
why long term identity tokens grid can interoperate, but it avoids the
could not be issued. This would issues of supporting Power Users. These
then mean that authentication could issues may be unimportant unless the
take place in a variety of places. number of Power Users grows greatly. The
need for something very easy to use for
• Authorisation attributes may also
good uptake by researchers was discovered
be held with virtual organisations
early on in the BRIDGES project, in
and a secondary query may be
particular, by the developers at Glasgow,
necessary.
both in terms of access management (and
the need to avoid client digital certificates)
Nevertheless, in the first exception cited
and in a clean, easy, “Google-like”
above, it may still be most convenient for
interface (Sinnott, 2006).
SEUs to be authenticated at their home
organisation (IdP) for single sign-on
reasons. Similarly, it may be convenient 5.3. Most users are not Power
for the virtual organisation to allow Users
authentication at the IdP or the SP before
releasing the attribute information. Missing from Figure 1 are the other types
of grid user (described and discussed in
If we put the above two exceptions aside, Norman, 2006). These include the most
then Shibboleth common type of user that exists today.
(http://shibboleth.internet2.edu/) is a good Such users, who typically work at the
fit for devolving authentication and much command line, write and/or compile code
of the management of authorisation and often wish to modify the environment
attributes. Shibboleth would provide a at a remote grid node, we have termed
useful single sign-on (like) experience for ‘Power Users’ in the ESP-GRID project.
the user: he would only need to authenticate Power Users are likely to be able to tolerate
at his home organisation. This would the difficulties of working with PKI and
benefit him in terms of having to learn only any security advantage derived from the use
one sign-on interface, and would place the of PKI is of benefit to the grid as such users
task of managing identities and attributes pose a greater threat to an individual grid
with the most appropriate organisation. node.
285
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
approaches (e.g. GridShib) mandate the use the C-SP model, and Power Users remain
of a certificate for authentication but then the dominant group, Shibboleth may be of
use a Shibboleth AA for authorisation limited benefit.
purposes. Some of these projects are also
Figure 1 The C-SP model of access to the grid: the SEU is authenticated by the IdP (trusted by the
SP) and the SP accesses the grid via a host certificate.
286
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
287
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
288
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Tom Goodale1 , Simone A. Ludwig1 , William Naylor2 , Julian Padget2 and Omer F. Rana1
1
School of Computer Science/Welsh eScience Centre, Cardiff University
2
Department of Computer Science, University of Bath
Abstract
The GENSS project has developed a flexible generic brokerage framework based on the use of plug-in components that are
themselves web services. The focus in GENSS has been on mathematical web services, but the broker itself is domain inde-
pendent and it is the plug-ins that act as sources of domain-specific knowledge. A range of plug-ins has been developed that
offer a variety of matching technologies including ontological reasoning, mathematical reasoning, reputation modelling,
and textual analysis. The ranking mechanism too is a plug-in, thus we have a completely re-targettable matchmaking and
brokerage shell plus a selection of useful packaged behaviours.
289
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
We can therefore observe that each service will have a func- of the service (can also be applied to service composi-
tional interface (describing the input/outputs needed to in- tion),
teract with it and their types) and a non-functional interface 3. Reputation modelling: recommendations from the
(which identifies annotations related to the service made by user’s social network are aggregated to rank service
other users and performance data associated with the ser- recommendations,
vice). Being able to support selection on both of these two 4. Textual analysis: natural language descriptions of re-
interfaces provides a useful basis to distinguish between ser- quests and services are analysed for related semantic
vices. information.
Even when a service (or a composition of a set of services) The purpose of the framework is to make brokerage tech-
has been selected, it is quite likely that their interfaces are nology readily deployable by the non-specialist, and more
not entirely compatible. Hence, one service may have more importantly, an effective but unobtrusive component for e-
parameters than another, making it difficult to undertake an Science users. Specifically this indicates the following
exact comparison based just on their interfaces. Similarly, modes of use:
data types used within the interface of one service may not 1. As a pre-deployed broker in a work-flow demonstrating
fully match those of another. In such instances, it would be pre-packaged functionality and utility
necessary to identify mapping between data types to deter- 2. As a bespoke broker using pre-defined plug-ins demon-
mine a “degree” of match between the services. Although strating the construction of a new broker from existing
the selection or even the on-the-fly construction of shim ser- services
vices is something that could be addressed from the match- 3. The authoring and/or packaging of plug-in services
making perspective [6, 10], we do not discuss this issue fur- other than those described above, demonstrating the
ther in this paper. flexibility and interoperability of the architecture
The remainder of the paper is laid as follows: (i) in the 4. The exploration of meta-brokerage, where the web
next-but-one section (3) we describe the architecture in de- service description of broker components are pub-
tail and the design decisions that lead to it (ii) this is followed lished and selected by a meta-broker to instantiate new
by a description of the range of plug-ins that have been de- special-purpose brokers.
veloped during the MONET and GENSS projects and that
are now being integrated through the KNOOGLE project
(iii) the paper concludes with a survey of related work and a 3 Service-oriented Matchmaking
brief outline of developments foreseen over the next year.
3.1 Matchmaking Requirements
We begin by reiterating the basic requirements for match-
2 eScience Relevance making:
The GENSS project has developed a flexible generic bro- 1. Sufficient input information about the task is needed to
kerage framework based on the use of plug-in components satisfy the capability, while the outputs of the matched
that are themselves web services. The focus in GENSS service should contain at least as much information as
has been on mathematical web services, but the broker it- the task is seeking, and
self is domain independent and it is the plug-ins that act
2. The task pre-conditions should at least satisfy the capa-
as sources of domain-specific knowledge. Thus, bespoke
bility pre-conditions, while the post-conditions of the
matchers/brokers can be created at will by clients, as we
capability should at least satisfy the post-conditions of
demonstrate later in conjunction with the Triana workflow
the task.
system. Each broker takes into consideration the particu-
lar data model available within a domain, and undertakes These constraints reflect work in component-based software
matching between service requests and advertisements in engineering and are, in fact, derived from [23]. They are also
relation to the data model. The complexity of such a more restrictive than is necessary for our setting, by which
data model can also vary, thereby constraining the types of we mean that some inputs required by a capability can read-
matches that can be supported within a particular applica- ily be inferred from the task, such as the lower limit on a nu-
tion. Within GENSS and its predecessor project MONET, merical integration where by convention this is zero, or the
a range of plug-ins were developed that offer a variety of dependent variable in a symbolic integration of a uni-variate
matching technologies: function. Conversely, a numerical integration routine might
1. Ontological reasoning: input/output constraints are only work from 0 to the upper limit, while the lower limit
translated to description logic and a DL reasoner is used of the problem is non-zero. A capability that matches the
to identify query to service matches, task can be synthesised from the composition of two invoca-
2. Mathematical reasoning: analyses the functional re- tions of the capability with the fixed lower limit of 0. Clearly
lationship expressed in the pre-conditions and effects the nature of the second solution is quite different from the
290
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
···
Ontology
server (OWL)
@ Cardiff Web Services (Matching algorithms)
@ Cardiff
HTTP/SOAP
P
OA
/S
TP
HT UDDI registry
@ Cardiff
Matching algorithms
HTTP/SOAP
T P/ Mathematical service
SO descriptions
AP
···
Ontology
Authorization
server (OWL)
Server Numeric and Symbolic Services
@ Bath
@ Bath
Figure 1: Architecture
first, but both serve to illustrate the complexity of this do- 2. The Broker and Reasoner: these constitute the core
main. It is precisely this richness too that dictates the nature of the architecture, communicating with the client via
of the matchmaking architecture, because as these two sim- TCP and with the other components via SOAP.
ple examples show, very different reasoning capabilities are 3. The matchmaker: this is in part made up of a reasoning
required to resolve the first and the second. Furthermore, we engine and in part by the matching algorithms, which
believe that given the nature of the problem, it is only very define the logic of the matching process.
rarely that a task description will match exactly a capability 4. Mathematical ontologies: databases of OWL based on-
description and so a range of reasoning mechanisms must be tologies, derived from OpenMath Content Dictionaries
applied to identify candidate matches. This results in: (CDs), GAMS (Guide to Available Mathematical Soft-
ware) etc. developed during the MONET [15] project.
Requirement 1: A plug-in architecture support- 5. A Registry Service: which enables the storage of math-
ing the incorporation of an arbitrary number of ematical service descriptions, together with the corre-
matchers. sponding endpoint for the service invocation.
6. Mathematical Web Services: available on third party
The second problem is a consequence of the above: there sites, accessible over the Web.
will potentially be several candidate matches and some There are essentially two use cases:
means of indicating their suitability is desirable, rather than
picking the first or choosing randomly. Thus: Use Case 1: Matchmaking with client selection: which
proceeds as follows:
Requirement 2: A ranking mechanism is re- 1. The user contacts the matchmaker.
quired that takes into account pure technical (as 2. The matchmaker loads the matching algorithms speci-
discussed above in terms of signatures and pre- fied by the user via a look-up in the UDDI registry. In
and post-condition) and quantitative and qualita- the case of an ontological match a further step is nec-
tive aspects—and even user preferences. essary. This is, the matchmaker contacts the reasoner
which in turn loads the corresponding ontology.
3. Having additional match values results in the registry
3.2 Matchmaking Architecture
being queried, to see whether it contains services which
Our matchmaking architecture is shown in Figure 1 and match the request.
comprises the following: 4. Service details are returned to the user via the match-
1. The Authorization service, Portal: these constitute the maker.
client interface and are employed by users to specify The parameters stored in the registry (a database) are ser-
their service request. vice name, URL, taxonomy, input and output signatures,
291
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
pre- and post-conditions. Using contact details of the ser- tions. Programming units (i.e. tools) include information
vice from the registry, the user can then call the Web Service about which data-type objects they can receive and which
and interact with it. ones they output, and Triana uses this information to per-
Use Case 2: Brokerage: where the client delegates service form design-time type checking on requested connections to
selection via a policy statement. This proceeds essentially as ensure data compatibility between components; this serves
above except that the candidate set of services is then anal- the same purpose as the compilation of a program for com-
ysed according to the client-specified policy and one service patibility of function calls.
is selected and invoked. Triana has a modularized architecture that consists of a
Details of the various components of the architecture are cooperating collection of interacting components. Briefly,
discussed in [13]. the thin-client Triana GUI connects to a Triana engine (Tri-
ana Controlling Service, TCS) either locally or via the net-
work. Under a typical usage scenario, clients may log into
4 Workflow integration a TCS, remotely compose and run a Triana application and
then visualize the result locally – even though the visualiza-
Workflow-based tools are being actively used in the e- tion unit itself is run remotely. Alternatively, clients may log
Science community, generally as a means to combine com- off during an application run and periodically log back on
ponents that are co-located with the workflow tool. Re- to check the status of the application. In this mode, the Tri-
cently, extensions to these tools which provide the ability to ana TCS and GUI act as a portal for running an application,
combine services which are geographically distributed have whether distributed or in single execution mode.
also been provided. To demonstrate the use of the match-
maker as a service, we have integrated our Broker with the
Triana workflow engine.
4.2 Triana Brokering
To support matchmaking for numerical services in Triana,
4.1 Triana the broker has been included within Triana in two ways:
1. A service in the toolbox: as illustrated in figure 2, the
Triana was initially developed by scientists in GEO 600 [7] broker is included as a “search” service within the Tri-
to help in the flexible analysis of data sets, and therefore ana toolbox. In order to make use of the service, it is
contains many of the core data analysis tools needed for one- necessary for a user to instantiate the service within a
dimensional data analysis, along with many other toolboxes workflow, including a “WSTypeGen” component be-
that contain units for areas such as image processing and text fore the service, and a “WSTypeViewer” after the ser-
processing. All in all, there are around 500 units within Tri- vice. These WSType components allow conversion of
ana covering a broad range of applications. Further, Triana is data types into a form that can be used by a web service.
able to choreograph distributed resources, such as web ser- Double clicking on the WSTypeGen component gener-
vices, to extend its range of functionality. Additional web ates the user interface also shown in figure 2, requiring
service-based algorithms have also been added recently to a user to chose a match mode (a “structural” mode is se-
Triana to support data mining [17]. Triana may be used lected in the figure), specify pre- and post-conditions,
by applications and end-users alike in a number of differ- and the query using OpenMath syntax. Once this form
ent ways [21]. For example, it can be used as a: graphical has been completed, hitting the “OK” button causes
workflow composition system for Grid applications; a data the search service to be invoked, returning a URL to
analysis environment for image, signal or text processing; the location of a service that implements the match.
as an application designer tool, creating stand-alone applica- If multiple matches are found, the user receives a list
tions from a composition of components/units; and through of services and must manually select between them (as
the use of its pluggable workflow representation architec- shown in figure 3). If only a single service is found, the
ture, allowing third party tool and workflow representation location of a WSDL file may be returned, which can
such as WSDL and BPEL4WS. then be used by a subsequent service in the workflow
The Triana user interface consists of a collection of tool- (if the matchmaker is being used as a broker).
boxes containing the current set of Triana components and 2. A menu item: in this case, the search service is invoked
a work surface where users graphically choreograph the re- from a menu item, requiring the user to complete a
quired behaviour. The modules are late bound to the services form, as shown in figure 3. The user specifies the query
that they represent to create a highly dynamic programming using the form also shown in the figure, and selects a
environment. Triana has many of the key programming con- matching mode (a “Structural match” is selected in the
structs such as looping (do, while, repeat until etc.) and figure). The result is displayed in a separate window
logic (if, then etc.) units that can be used to graphically con- for the user to evaluate. In this instance, it is not nec-
trol the dataflow, just as a programmer would control the essary to match any data types before and after a match
flow within a conventional program using specific instruc- service, and some additional components will need to
292
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
CoRE, Rutgers University, Feb 06 36
CoRE, Rutgers University, Feb 06 37
293
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
be implemented by the user to achieve this. This mode such a manner that it can be used in coordination with the
of use is particularly useful if a user does not intend architecture described in this paper.
to use the matchmaker within a workflow, but prefers
to investigate services of interest, before a workflow is
constructed graphically. 7 Related Work
Using the matchmaker as a search service allows it to be
used as a standard component in Triana. In this case, a user Matchmaking has quite a significant body of associated lit-
constructs a workflow with a search service being a compo- erature, so we do not attempt to be exhaustive, but survey
nent in this workflow. During enactment, the search service just those systems that have been influential or we believe
invokes the matchmaker and returns one or more results. are especially relevant to the issues raised here, namely ar-
Where a single result is returned (i.e. where the matchmaker chitecture, flexibility and matching technologies.
is being used as a broker), the workflow continues to the next Although generalizations are risky, broad categorizations
service in graph without any requirement for user input. As of matchmaking and brokerage research seem possible using
Triana already allows for dynamic binding of components to criteria such as domain, reasoning mechanisms and adapt-
service instances, the search service essentially provides an ability.
alternative form of dynamic binding. Much of the published literature has described generic
brokerage mechanisms using syntactic or semantic, or a
combination of both, techniques. Some of the earliest sys-
5 Complexity issues tems, enabled by the development of KIF (Knowledge In-
terchange Format) [5] and KQML (Knowledge Query and
An important issue for brokerage systems is whether the Manipulation Language) [20], are SHADE [11] operating
system scales well with respect to the number of services over logic-based and structured text languages and the com-
registered with the system. That is, the number of services plementary COINS [11] that operates over free text using
should only have a small factor in any expression calculating well-known term-first index-first information retrieval tech-
the complexity of the system. For the brokerage system de- niques. Subsequent developments such as InfoSleuth [14]
scribed in this paper, the complexity costs will be dependant applied reasoning technology to the advertised syntax and
on the complexity costs of the matchmaker plugins to the semantics of a service description, while the RETSINA sys-
service. Generally the more powerful the plugin, the higher tem [19] had its own specialized language [18] influenced
the complexity cost. One class of matchmaker plugins, that by DAML-S (the pre-cursor to OWL-S) and used a belief-
we utilise is ontology based plugins, these generally involve weighted associative network representation of the relation-
some traversal of an ontology and as such, the costs will be ships between ontological concepts as a central element of
dependant more on the height than the absolute size of the the matching process. While technically sophisticated, a
ontology. Some figures indicating the actual performance of particular problem with the latter was how to make the initial
our system appear in Appendix A assignment of weights without biasing the system inappro-
priately. A distinguishing feature of all these systems is their
monolithic architecture, in sharp contrast to GRAPPA [22]
6 Planned Future Work (Generic Request Architecture for Passive Provider Agents)
which allows for the use of multiple matchmaking mecha-
A particular system which manages allocation of jobs to a nisms. Otherwise GRAPPA essentially employs fairly con-
pool of processors, is the GridSAM system [8]. The Grid- ventional multi-attribute clustering technology to reduce at-
SAM system takes job descriptions given in JSDL [2] and tribute vectors to a single value. Finally, a notable contri-
the executables from Clients. It then distributes the jobs over bution is the MathBroker architecture, that like the domain-
the processors which perform the processing, GridSAM then specific plug-ins of our brokerage scheme, works with se-
returns the results to the Clients. It also provides certain mantic descriptions of mathematical services using the same
monitoring services. Grimoires [9] is a registry service for MSDL language. However, current publications [3] seem to
services, which allows attachments of semantic descriptions indicate that matching is limited to processing taxonomies
to the services. Grimoires is UDDIv2 [1] compliant and and the functional issues raised by pre- and post-conditions
stores the semantic attachments as RDF triples, this gives are not considered. The MONET broker [4], in conjunction
scope for attachments with arbitrary schema including those with the RACER reasoner and the Instance Store demon-
for describing mathematical services. A restriction that has strated one of the earliest uses of a description logic reasoner
been identified with the current GridSAM architecture, is to identify services based on taxonomic descriptions coming
that it doesn’t incorporate any brokerage capabilities. Fur- closest to the objective of the plug-ins developed for GENSS
thermore it appears that the Grimoires registry does not pro- in attempting to provide functional matching of task and ca-
vide the resource allocation provided by GridSAM. A future pability.
project will look at integration of these two approaches in In contrast, matching and brokerage in the grid computing
294
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
domain has been relatively unsophisticated, primarily using ship of new, shareable, generic and specific matching ele-
syntactic techniques, such as in the ClassAds system [16] ments (as reported in the Appendix). This would also pro-
used in the Condor system and RedLine [12] which extends vide the basis for defining a policy about how results from
ClassAds, where match criteria may be expressed as ranges multiple matching techniques may be combined.
and hence are a simple constraint language. In the Condor
system, the use of ClassAds is to enable computational jobs
find suitable resources, generally using dynamic attributes 9 Acknowledgements
such as available physical and virtual memory, CPU type
and speed, current load average, and other static attributes The work reported here is partially supported by the Engi-
such as operating system type, job manager etc. A resource neering and Physical Sciences Research Council under the
also has a simple policy associated with it, which identifies Semantic Grids call of the e-Science program (GENSS grant
when it is willing to accept new job requests. The approach reference GR/S44723/01) and partially supported through
is therefore particular focused to work for managing job ex- the Open Middleware Infrastructure Institute managed pro-
ecution on a Condor pool, and configured for such a system gram (project KNOOGLE).
only. It would be difficult to deploy this approach (without
significant changes) within another job execution system, or
one that makes use of a different resource model. The Red-
References
Line system allows matching of job requests with resource [1] A.E. Walsh. UDDI, SOAP, and WSDL: The Web
capabilities based on constraints – in particular the ability to Services Specification Reference Book, 2002. UD-
also search based on resource policy (i.e. when a resource is DIORG.
able to accept new jobs, in addition to job and resource at-
tributes). The RedLine description language provides func- [2] Stephen McGough Ali Anjomshoaa, Darren Pulsipher.
tions such as Forany and Forall to be able to find multiple Job Submission Description Language WG (JSDL-
items that match. The RedLine system is however still con- WG), 2003. Available from https://forge.
strained by the type of match mechanisms that it supports— gridforum.org/projects/jsdl-wg/.
provided through its description language. Similar to Con-
dor, it is also very difficult to modify it for a different re- [3] Rebhi Baraka, Olga Caprotti, and Wolfgang Schreiner.
source model. Our approach is more general, and can allow A Web Registry for Publishing and Discovering Math-
plug-ins to be provided for both RedLine and Condor as part ematical Services. In EEE, pages 190–193. IEEE Com-
of the matchmaker configuration. In our model, therefore, as puter Society, 2005.
the resource model is late-bound, we can specify a specialist
[4] Olga Caprotti, Mike Dewar, and Daniele Turi. Mathe-
resource model and allow multiple such models to co-exist,
matical Service Matching Using Description Logic and
each implemented in a different configuration of the match-
OWL. In Andrea Asperti, Grzegorz Bancerek, and An-
maker.
drzej Trybulec, editors, MKM, volume 3119 of Lecture
Notes in Computer Science, pages 73–87. Springer,
2004.
8 Conclusion
[5] M. Genesereth and R. Fikes. Knowledge In-
We have outlined the development of a brokerage archi- terchange Format, Version 3.0 Reference Man-
tecture whose initial requirements were derived from the ual. Technical report, Computer Science De-
MONET broker, namely the discovery of mathematical ser- partment, Stanford University, 1992. Avail-
vices, but with the addition of the need to establish a func- able from http://www-ksl.stanford.edu/
tional relationship between the pre- and post-conditions of knowledge-sharing/papers/kif.ps.
the task and the capability. As a consequence of the en-
gineering approach taken in building the GENSS match- [6] C. Goble. Putting Semantics into e-Science and Grids.
maker/broker, the outcome has been a generic matchmak- Proc E-Science 2005, 1st IEEE Intl Conf on e-Science
ing shell, that may be populated by a mixture of generic and Grid Technologies, Melbourne, Australia, 5-8 De-
and domain-specific plug-ins. These plug-ins may also be cember, 2005.
composed and deployed with low overhead, especially with [7] EO 600 Gravitational Wave Project. http://www.
the help of workflow tools, to create bespoke matchmak- geo600.uni-hannover.de/.
ers/brokers. The plug-ins may be implemented as web ser-
vices and a mechanism has been provided to integrate them [8] GridSAM - Grid Job Submission and Mon-
into the architecture. Likewise, the use of web services for itoring Web Service, 2006. Available from
the plug-ins imposes a low overhead on the production of http://gridsam.sourceforge.net/2.
new components which may thus encourage wider author- 0.0-SNAPSHOT/index.html.
295
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
[9] (Grimoires: Grid RegIstry with Metadata Ori- [20] D. McKay T. Finin, R. Fritzson and R. McEntire.
ented Interface: Robustness, Efficiency, Secu- KQML as an agent communication language. In Pro-
rity , 2004. Available from http://twiki. ceedings of 3rd International Conference on Informa-
grimoires.org/bin/view/Grimoires/. tion and Knowledge Management, pp. 456-463, 1994.
[10] Duncan Hull, Robert Stevens, Phillip Lord, Chris [21] Ian Taylor, Matthew Shields, Ian Wang, and Roger
Wroe, and Carole Goble. Treating ”shimantic web” Philp. Grid Enabling Applications using Triana. Work-
syndrome with ontologies. In John Domingue, shop on Grid Applications and Programming Tools.
editor, First Advanced Knowledge Technologies In conjunction with GGF8. Organized by: GGF Ap-
workshop on Semantic Web Services (AKT-SWS04), plications and Testbeds Research Group (APPS-RG)
volume 122. KMi, The Open University, Milton and GGF User Program Development Tools Research
Keynes, UK, 2004. Workshop proceedings avail- Group (UPDT-RG), 2003.
able from CEUR-WS.org. ISSN:1613-0073. http:
//sunsite.informatik.rwth-aachen.de/ [22] D. Veit. Matchmaking in Electronic Markets, volume
Publications/CEUR-WS/Vol-122/. 2882 of LNCS. Springer, 2003. Hot Topics.
[11] D. Kuokka and L. Harada. Integrating information [23] Amy Moormann Zaremski and Jeannette M. Wing.
via matchmaking. Intelligent Information Systems 6(2- Specification Matching of Software Components.
3), pp. 261-279, 1996. ACM Transactions on Software Engineering and
Methodology, 6(4):333–369, October 1997.
[12] Chuang Liu and Ian T. Foster. A Constraint Language
Approach to Matchmaking. In RIDE, pages 7–14.
IEEE Computer Society, 2004. A GENSS performance indicators
[13] Simone Ludwig, Omer Rana, William Naylor, and Performance indicators for the GENSS matchmaker are re-
Julian Padget. Matchmaking Framework for Mathe- ported here. An “end-to-end” test, with a user in Bath and
matical Web Services. Journal of Grid Computing, a registry in Cardiff, produced a wall-clock time of 8 sec-
4(1):33–48, March 2006. Available via http://dx. onds. A “soak” test to repeatedly run the same search
doi.org/10.1007/s10723-005-9019-z. (client/matching process – as a Web Service – on a machine
ISSN: 1570-7873 (Paper) 1572-9814 (Online). in Bath with registry in Cardiff) produced the following re-
[14] W. Bohrer M. Nodine and A.H. Ngu. Semantic broker- sults: Structural matcher
ing over dynamic heterogenous data sources in InfoS- Iterations real user sys
leuth. In Proceedings of the 15th International Confer-
ence on Data Engineering, pp. 358-365, 1999. 1 0m14.495s 0m4.362s 0m0.172s
10 1m2.898s 0m5.766s 0m0.318s
[15] Mathematics on the Net - MONET. http:// 100 9m59.015s 0m12.173s 0m1.566s
monet.nag.co.uk. Indicating that amortized elapsed time/query is around
6 seconds using the structural matcher.
[16] Rajesh Raman, Miron Livny, and Marvin H. Solomon. Ontological matcher
Matchmaking: Distributed Resource Management for Iterations real user sys
High Throughput Computing. In HPDC, pages 140–,
1 0m22.089s 0m4.786s 0m0.173s
1998.
10 1m49.031s 0m5.882s 0m0.299s
[17] O. F. Rana, Ali Shaikh Ali, and Ian J. Taylor. Web 100 17m13.894s 0m12.682s 0m1.203s
Services Composition for Distributed Data Mining. In Indicating that amortized elapsed time/query is just over
D. Katz, editor, Proc. of ICPP, Workshop on Web and 10 seconds using the ontological matcher. Performing the
Grid Services for Scientific Data Analysis, Oslo, Nor- same test as above, but where a command line Java appli-
way June 14, 2005. cation was used to search the service registry, linux sys-
tem monitoring tools report that on the machine carrying out
[18] K. Sycara, S. Widoff, M. Klusch, and J. Lu. Larks: Dy- the search process that program used approximately 25.4MB
namic matchmaking among heterogeneous software and 36.06% CPU.
agents in cyberspace. Journal of Autonomous Agents
and Multi Agent Systems, 5(2):173–203, June 2002.
[19] Katia P. Sycara, Massimo Paolucci, Martin Van Velsen,
and Joseph A. Giampapa. The RETSINA MAS Infras-
tructure. Autonomous Agents and Multi-Agent Systems,
7(1-2):29–48, 2003.
296
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
We provide a critical evaluation of the different Web Service styles and approaches to data-binding
and validation for use in ‘data-centric’ scientific applications citing examples and recommendations
based on our experiences. The current SOAP API’s for Java are examined, including the Java API
for XML-based remote procedure calls (JAX-RPC) and Document style messaging. We assess the
advantages and disadvantages of 'loose' verses 'tight' data binding and outline some best practices
for WSDL development. For the most part, we recommend the use of the document/ wrapped style
with a 100% XML schema compliant data-model that can be separated from the WSDL definitions.
We found that this encouraged collaboration between the different partners involved in the data
model design process and assured interoperability. This also leverages the advanced capabilities of
XML schema for precisely constraining complex scientific data when compared to RPC and SOAP
encoding styles. We further recommend the use of external data binding and validation frameworks
which provide greater functionality when compared to those in-built within a SOAP engine. By
adhering to these best practices, we identified important variations in client and experimental data
requirements across different institutions involved with a typical e-Science project.
297
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
2.1.1 RPC (applies to encoded and literal) • Complex types are SOAP encoded and are
referenced by “href” references using an
• An RPC style WSDL file contains multiple identifier (3a). The href’s refer to
<part> tags per <message> for each “multiref” elements positioned outside the
request/ response parameter (10b, 11b). operation wrapping element as direct
• Each message <part> tag defines type children of the <soap:Body> (6a). This
attributes, not element attributes (message complicates the message as the
parts are not wrapped by elements as in <soap:Body> may contain multiple
Document style WSDL files) (10b, 11b). “multiref” elements.
• The type attribute in each <part> tag can • RPC/ encoded is not WS-I compliant [1]
either; a) reference a complex or simple which recommends that the <soap:Body>
type defined in the <wsdl:types> section, should only contain a single nested sub
e.g. <part name=”x” type=”tns:myType”>, element.
or b) define a simple type directly, e.g.
<part name=”x” type=”xsd:int”> (10b, 11b 2.1.3 RPC/ literal
respectively).
• An RPC SOAP request wraps the message • RPC/ literal style improves upon the RPC/
parameters in an element named after the encoded style.
invoked operation (2a, 12a). This • An RPC/ literal WSDL does not specify an
'operational wrapper element' is defined in encodingStyle attribute.
the WSDL target namespace. An RPC • The use attribute, which is nested within
SOAP response wraps the message the <wsdl:binding>, has the value “literal”.
parameters in an element named after the • An RPC/literal encoded SOAP message has
invoked operation with 'Response' only one nested child element in the
appended to the element name. <soap:Body> (12a). This is because all
• The difference between RPC encoded and parameter elements become wrapped within
RPC literal styles relates to how the data in the operation element that is defined in the
the SOAP message is serialised/ formatted WSDL namespace.
when sent 'over the wire'. The abstract parts • The type encoding information for each
of the WSDL files are similar (i.e. the nested parameter element is removed (14a,
<wsdl:types>, <wsdl:message> and 15a).
<wsdl:portType> elements – refer to • RPC/ literal is WS-I compliant [1].
Section 6.0). The only significant
difference relates to the definition of the The main weakness with the RPC encoding
<wsdl:binding> element. The binding style is its lack of support for the constraint of
element dictates how the SOAP message is complex data, and lack of support for data
formatted and how complex data types are validation. The RPC/ encoded style usually
represented in the SOAP message. adopts the SOAP encoding specification to
serialize complex objects which is far less
2.1.2 RPC/ encoded comprehensive and functional when compared
to standard XML Schema. Validation is also
• An RPC/ encoded WSDL file specifies an problematic in the RPC/ literal style; when an
encodingStyle attribute nested within the RPC/ literal style SOAP message is constructed,
<wsdl:binding>. Although different only the operational wrapper element remains
encoding styles are legal, the most common fully qualified with the target namespace of the
is SOAP encoding. This encoding style is WSDL file, and all other type encoding
used to serialise data and complex types in information is removed from nested sub-
the SOAP message. elements (this is shown in Table 1). This means
(http://schemas.xmlsoap.org/soap/encoding). all parts/ elements in the SOAP message share
• The use attribute, which is nested within the WSDL file namespace and lose their
the <wsdl:binding> has the value original Schema namespace definitions.
“encoded”. Consequently, validation is only possible for
• An RPC/ encoded SOAP message has type limited scenarios, where original schema
encoding information for each parameter elements have the same namespace as the
element. This is overhead and degrades WSDL file. For the most part however,
throughput performance (4a, 7a, 8a). validation becomes difficult (if not impossible)
since qualification of the operation name comes
298
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
from the WSDL definitions, not from the operation name is removed from the
individual schema elements defined in the <soap:Body> request which can cause
<wsdl:types> section. interoperability problems (21a), and b) the
<soap:Body> will contain multiple children
2.2 Document Encoding/ Binding Style (22a, 26a) if more than one message part is
In contrast to RPC, Document style encoding defined in a request/ response message
provides greater functionality for the validation (19b, 20b).
of data by using standard XML schema as the • Document/ literal is not fully WS-I
encoding format for complex objects and data. compliant [1], which recommends that the
The schema defined in the <wsdl:types> section <soap:Body> should only contain a single
can be embedded or imported (refer to Section nested sub element.
6.1). The following summary examines the key
features of the Document WSDL binding style 2.2.3 Document/ wrapped
and the format of a resulting SOAP message.
Table 1 illustrates these key points (schema • An improvement on the Document/ literal
examples are numbered and referred to in the style is the Document/ wrapped style.
text). • When writing this style of WSDL, the
request and response parameters of a Web
2.2.1 Document (applies to literal and wrapped) Service operation (simple types, complex
types and elements) should be 'wrapped'
• Document style Web services use standard within single all-encompassing request and
XML schema for the serialisation of XML response elements defined in the
instance documents and complex data. <wsdl:types> section (24b - akin to the
• Document style messages do not have type RPC/ literal style).
encoding information for any element (23a, • These 'wrapping' elements need to be added
24a), and each element in the soap message to the <wsdl:types> section of the WSDL
is fully qualified by a Schema namespace file (24b).
by direct declaration (22a), or by • The request wrapper element (24b) must
inheritance from an outer element (30a). have the same name as the Web Service
• Document style services leverage the full operation to be invoked (this ensures the
capability of XML Schema for data operation name is always specified in the
validation. <soap:Body> request as the first nested
element).
2.2.2 Document/ literal • By specifying single elements to wrap all of
the request and response parameters, there
• Document/ literal messages send request is only ever a single <part> tag per
and response parameters to and from <message> tag (32b).
operations as direct children of the • A Document/ literal style WSDL file is
<soap:Body> (22a, 26a). fully WS-I compliant [1] because there is
• The <soap:Body> can therefore contain only a single nested element in the
many immediate children sub elements <soap:Body> (29a).
(22a, 26a). • Document/ wrapped style messages are
• A Document/literal style WSDL file may therefore very similar to RPC/ literal style
therefore contain multiple <part> tags per messages since both styles produce a single
<message> (19b, 20b). nested element within a <soap:Body>. The
• Each <part> tag in a message can specify only difference is that for Document/
either a type or an element attribute, wrapped style, each element is fully
however, for WS-I compliance, it is qualified with a Schema namespace.
recommended that only element attributes
be defined in <part> tags for Document The main advantage of the Document style
style WSDL (19b, 20b). over the RPC style is the abstraction/ separation
• This means that every simple or complex of the type system into a 100% XML Schema
type parameter should be wrapped as an compliant data model. In doing this, several
element and be defined in the <wsdl:types> important advantages related to data binding
section (15b, 16b). and validation are further realised which are
• The main disadvantages of the Document/ discussed in the next section.
literal Web Service style include: a) the
299
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Table 1 - A Comparison of the Different WSDL Binding Styles and SOAP Encoding
Table 2 - Advantages and Disadvantages of Each WSDL Binding Style and SOAP Encoding
Style Advantages Disadvantages
• Simple WSDL • Complex types are sent as multipart references meaning <soap:Body>
• Operation name wraps parameters can have multiple children
• Not WS-I compliant
RPC
• Not interoperable
Encoded
• Type encoding information generated in soap message
• Messages can’t be validated
• Child elements are not fully qualified
• Simple WSDL • Difficult to validate message
• Operation name wraps parameters • Sub elements of complex types are not fully qualified.
RPC
• <soap:Body> has only one element
Literal
• No type encoding information
• WS-I compliant
• No type encoding information • WSDL file is more complicated
Doc • Messages can be validated • Operation name is missing in soap request which can create
Literal • WS-I compliant but with restrictions interoperability problems
• Data can be modelled in separate schema • <soap:Body> can have multiple children
• WS-I recommends only one child in <soap:Body>
• No type encoding information • WSDL file is complicated – request and response wrapper elements
• Messages can be validated may have to be added to the <wsdl:types> if original schema element
Doc
• <soap:Body> has only one element name is not suitable for Web Service operation name.
Wrapped
• Operation name wraps parameters
• WS-I compliant
300
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
301
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the type system to define the format of the • “Base64 encoding” loose Data Type
messages, instead it uses generic data types to An XML document can be transmitted as a
express communication messages. Loosely Base64 encoded string or as raw bytes in the
typed services are flexible, allowing Web body of a SOAP message. These are standard
Service components to be replaced in a 'plug- data types and thus every SOAP engine handles
and-play' fashion. Conversely, a 'strongly typed' this data in compatible fashion. Indeed,
Web Service means the WSDL type system embedding Base64 encoded data and raw bytes
strictly dictates the format of the message data. in the SOAP body is WS-I compliant.
Strongly typed Web services are less flexible, • “SOAP Attachment” loose Data Type
but more robust. Each style influences the SOAP attachments can be used to send data
chosen approach to data binding and each has of any format that cannot be embedded within
its own advantages and disadvantages which are the SOAP body, such as raw binary files.
summarized in Table 3. Sending data in an attachment is efficient
because the size of the SOAP body is
4.1 Loosely Typed Web services minimized which enables faster processing (the
A loosely typed WSDL interface specifies SOAP message contains only a reference to the
generic data types for an operation's input and data and not the data itself). Additional
output messages (either “String”, “Base64- advantages over other techniques include the
Encoded”, “xsd:any”, “xsd:anyType” or ability to handle large documents, multiple
“Attachment” types). This approach requires attachments can be sent in a single Web Service
extra negotiation between providers and invocation, and attachments can be compressed-
consumers in order to agree on the format of the decompressed for efficient network transport.
data that is expressed by the generic type.
Consequently, an understanding of the WSDL 4.2 Strongly typed Web services
alone is usually not sufficient to invoke the A purely strongly typed WSDL interface
service. defines a complete definition of an operation's
• “String” loose Data Type input and output messages with XML Schema,
A String variable can be used to encode the with additional constraints on the actual
actual content of a messages complex data. permitted values (i.e. Document style with value
Consequently, the WSDL file may define constraints). It is important to understand
simple String input/ output parameters for however, that Document style services do not
operations. The String input could be an XML have to be solely strongly typed, as they may
fragment or multiple name value pairs (similar combine both strongly typed fields with loose/
to Query string). In doing this, the generic types where necessary. Strong typing is
implementation has to parse and extract the data especially relevant for scientific applications
from the String before processing. An XML which often require a tight control on message
document formatted as a String requires extra values, such as length of string values, range of
coding and decoding to escape XML special numerical values, permitted sequences of values
characters in the SOAP message which can (e.g. organic compounds must have Carbon and
drastically increase the message size. Hydrogen in the chemical formula and rotation
• “any”/ “ anyType” loose Data Type angle should be between 0 – 360 degrees).
The WSDL types section may define From our experiences related to e-HTPX [6],
<xsd:any> or <xsd:anyType> elements. the best approach involved mixing of the
Consequently, any arbitrary XML can be different styles where necessary. For mature
embedded directly into the message which maps Web services, where the required data is
to a standard “javax.xml.soap.SOAPElement”. established and stable, the use of strong data
Partners receive the actual XML but no contract typing was preferable. For immature Web
is specified regarding what the XML data services where the required data is subject to
describes. Extraction of information requires an negotiation/ revision, loose typing was
understanding of raw XML manipulation, for preferable. We often used the loose typing
example, using the Java SAAJ API. Limited approach during initial developments and
support for “any”/ “anyType” data type from prototyping.
various SOAP Frameworks may result in
portability and interoperability issues.
302
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Table 3 – Advantages and Disadvantages of Loose versus Strong Data Typing in Web services
Modeling
Advantages Disadvantages
Approach
• Easy to develop • Requires additional/ manual negotiation between client and
• Easy to implement service provider to establish the format of data wrapped or
• Minimum changes in WSDL interface expressed in a generic type
• Stable WSDL interface • This may cause problems regarding maintaining consistent
• Flexibility in implementation implementations and for client/ service relations
Loose • Single Web Service implementation may handle multiple • No control on the messages
Type types of message • Prone to message related exceptions due to inconsistencies
• Can be used as Gateway Service routing to actual services between the format of sent data, and accepted data format
based on contents of message (requires Web Service code to be liberal in what it accepts –
this adds extra coding complexity).
• Encoding of XML as a string increases the message size due
to escaped characters
• Properly defined interfaces • Difficult to develop (requires a working knowledge of XML
• Tight control on the data with value constraints and WSDL)
• Message validation • Resistive to change in data model
• Different possibilities for data validation (pluggable data
Strong Type
binding/validation)
• Robust (only highly constrained data enters Web Service)
• Minimized network overhead
• Benefits from richness of XML
303
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
304
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Sofia Panagiotidi, Jeremy Cohen, John Darlington, Marko Krznarić and Eleftheria Katsiri
London e-Science Centre, Imperial College London, South Kensington Campus, London SW7 2AZ, UK
Email: {sp1003,jhc02,jd,marko,ek}@doc.ic.ac.uk
Abstract. We present work done within the Grid ENabled Integrated Earth system model
(GENIE) project to take the original, complex, tightly-coupled Fortran earth modeling ap-
plication that has been developed by the GENIE team and enable it for execution within
a component-based execution environment. Further we have aimed to show that by repre-
senting the application as a set of high-level Java Web Service components, execution and
management of the application can be made much more flexible. We show how the applica-
tion has been built into higher-level components and how these have been wrapped within
the Java Web Service abstraction. We then look at how these components can be composed
into workflows and executed within a Grid environment.
Keywords: component, (earth) module, instance, port, grid, wrapper, glue (code), abstract
interface, concrete interface
305
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
land, the sea-ice etc., referred to from now on Previous work also includes research into the
as earth modules or just modules, are modelled way the ICENI [3] framework can be used to
independantly, using specified meshes of data run parameter sweep experiments across multi-
points to represent the boundary surfaces, and ple Grid resources [5].
conform to the natural laws of physics that gov- Lastly, there have been efforts into wrapping
ern the exchange of energy and fluxes from one up each of the basic Earth modules using the
to another. JNI [7] library, which led to unexpected compli-
One of the architectural characteristics of cations and unjustified amounts of effort, lead-
GENIE is that it is designed so as to consist ing us to look into a more efficient solution.
of more than one implementation (instance) of
an earth element in the case of the atmosphere,
ocean, land and sea-ice. Furthermore, scientists 3 Component-based Frameworks
are able to choose to experiment with several
simulation scenarios (configurations) compris- The basic reason why the design of GENIE has
ing of combinations of such implementations. not achieved desired level of modularity, which
The way this was implemented is through a sep- is the objective, is that it is fundamentally re-
arate program, “genie.F”, which is responsible stricted by the use of Fortan as a scientifically
for the control and execution of each module, efficient language [8].
the attribute passing between the chosen ones The component programming model is the
and the appropriate interpolations of the data latest stage of a natural progression, starting
exchanged, through interpolation functions. In from the monolithic application, and moving to-
fact, all possible cases of configurations are han- wards applications built from increasingly more
dled through the “genie.F” with the use of flags, modularised pieces, with one addition, the cou-
being switched on/off to specify the use of which pling framework. Components are designed to
implementation of a module. be protected from changes in the software envi-
As new modules are actively being re- ronment outside their boundaries.
searched and developed, it is desirable for the Thus, a component can be described as a
GENIE community to have the flexibility to ’black box’, fully specified by its function spec-
easily add, modify and couple together GENIE ification and the input/output type patterns of
modules and experiment with new configura- the data (ports). The fact that a component
tions, without undue programming effort. De- is interfaced with other modules/systems only
spite significant progress in the GENIE commu- through its input/output ports (and is otherwise
nity, the desired result is far from reality. shielded from the rest of the world) allows big-
ger applications to be designed, implemented,
and tested, independently of everything else.
2.2 Previous work The coupling framework within component
architecture is a system defining the bindings
Several milestones towards the advance of the between the components in a clear and com-
GENIE framework have been achieved. The first plete way and providing information about the
task was to separate the tightly coupled Fortran run-time environment. It can include connec-
code into distinct pieces each representing an en- tors which perform the direct “binding” of the
vironmental module (atmosphere, land, ocean, ports of two components or even more compos-
ocean-surface, chemistry, sea-ice, etc). The main ite operations on the input/output/inout ports.
piece of code handling and coupling the mod- Finally, a configuration of components may be
ules was disengaged and formed the so-called (re)used as an individual component in another
“genie.F” top program. (sub) framework.
Efforts in gradually moving GENIE towards In this context, an application like GENIE
a grid environment have been made in the past, may be composed (assembled) at run-time from
most importantly allowing the execution and components selected from a component pool. A
monitoring of many ensemble experiments in component-based approach to GENIE would en-
parallel [4, 6], reducing the overall execution able a user to fully exploit parallel execution
time. Though, they all deal with a subset of the of the system. Furthermore, by having seperate
gridification issues and serve the isolated current layers of software development as such, the con-
at that time needs of the scientific community. figuration and desired behaviour of the simula-
306
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
tion model can be easily defined and modified 4.2 Extended Abstract Component
by the user. Model
One of the desires of the GENIE community
has been to experiment with the order of exe- To realise this model we define four key ques-
cution of the earth modules. With the current tions which we then look at in greater detail:
system, this is hard to achieve. Our component-
based approach allows to easily modify the exe-
cution order of the components and the starting • How are abstract models linked/composed?
order of the components giving a possibility to • How is a GENIE system comprising various
a user to further test the robustness of the sim- models assembled?
ulation method. • How is GENIE system deployed in a dis-
An example of an ideal composition en- tributed dynamic environment?
vironment would be the Imperial College e- • How do GENIE models communicate with
Science Networked Infrastructure (ICENI) [3]. each other?
ICENI is a pioneering Service Oriented Archi-
tecture (SOA) and component framework devel- In order to compose abstract models in an
oped at the London e-Science Centre, Imperial interchangeable manner, it is necessary for the
College London. Within the ICENI model an models to have common interfaces. In the case
abstract component can have several differing of GENIE, it is possible to encounter situa-
implementations. Associated with each compo- tions where semantically equivalent modules,
nent is metadata describing its characteristics. that should in theory be interchangeable, can-
At deployment, this metadata is used to achieve not be substituted due to differing interfaces.
the optimal selection of component implementa- We tackle this issue by defining a pluggable ab-
tion and execution resource in order to minimise stract interface that allows the knowledge of the
overall execution time given the Grid resources module developer to be encapsulated within a
currently available. ’plugin’.
To build a GENIE system from a given set of
modules it is necessary to compose the modules
4 A Grid Model for GENIE in a semantically valid format. Using component
metadata, an idea utilised extensively in ICENI
components, we can annotate the GENIE com-
4.1 Abstract Component Model ponents with information that determines their
composability.
Within the ICENI model, semantically equiva- By wrapping GENIE components in a Java
lent components have the same interface. In fact Web Service wrapper, it is possible to deploy
there need only be one semantically equivalent the components in a service container allowing
abstract component, concrete implementations them to communicate using standard Web Ser-
inherit from this. In earth modelling systems vice protocols. The use of Web Service stan-
such as GENIE, however, different implemen- dards provides mobility and allows the location
tations of a semantically equivalent earth entity of component deployment to be determined at
module may have differing interfaces. For exam- application run time. The deployment of GE-
ple, one ocean model may incorporate salinity NIE in a distributed, dynamic environment is
while another does not. However, they are both discussed in more detail in section 5.
ocean models and should be interchangable. We The communication between GENIE models
therefore extend the ICENI philosophy to pro- may be accomplished using a variety of methods
vide different mechanisms whereby similar mod- dependent on where they are deployed. Com-
els but with differing interfaces can be used in ponents deployed on geographically distributed
an interchangable fashion. We do this by extend- resources may communicate using Simple Ob-
ing the notion of an interface by wrapping the ject Access Protocol (SOAP) messages over an
module with plugin adapters to accomodate the HTTP connection, although this may not be the
different interfaces. We describe this interface most efficient. Components that are co-located
model and show how the implementation can within a cluster may use MPI or shared memory
be achieved through Babel [1]. to communicate (figure 1).
307
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
308
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
need to be part of the interface, the separation of latter of which requires extra programming
these arguments into inputs, outputs and inouts effort to be implemented via other mecha-
is difficult. The reason for this is that Fortran nisms (e.g. JNI),
passes all arguments to a routine by reference. • It minimizes the problems we identified in
This means that it is very difficult to extract [8] when using JNI, such as arrays stored
any information about whether a parameter is differently in memory (C/Fortran), float-
strictly used as read-only part of the memory, or ing point numbers, complex structure argu-
gets modified during the execution of the spe- ment passing, manual creation of intermedi-
cific piece of code. ate “.h” files and programming effort.
During our inspection and detailed analysis • Possible future addition of extra compo-
of the GENIE modules, a long and complex task, nents becomes easy, since it provides a stan-
the nature and type of the module parameters dard mechanism for wrapping up compo-
was documented. This served not only for our nents.
purposes, but has provided useful input to the • Babel is compatible with most Fortran com-
GENIE project for future needs of the scientists pilers used by scientists in GENIE.
and users of the application. In several cases, a • An SIDL file can include methods such as
manual exhaustive in-depth analysis of the code the initialise component, main and finalise,
of a routine, together with the subroutines called making it possible to expose all separately
within the code needed to be done in order to developed subroutines as parts of the same
determine whether an attribute is written to, component, as semantically correct and de-
is of read-only nature, or possibly serves both sired, conforming to the component lifecycle
read/modify purposes. of modern component architectures.
• An interoperability tool like Babel is par-
5.2 Linking Fortran to Java ticularly useful to make heterogeneous en-
vironments communicate, since our purpose
In order to form high level components that can is not only to make GENIE modules avail-
be launched and used as web services we choose able over the web but moreover, to provide
Java to be the main language in which the in- a generic methodology independent of the
terfaces of the earth modules will be exposed. component implementation language.
Rewriting them would possibly cause a num-
ber of inaccuracies and most importantly would After testing and verifying its suitability for
cause a disruption in the development of such the project’s needs, we have been using Babel to
a complex scientific application. Therefore, we wrap a sample of components and make them
choose to wrap each earth module individually accessible from a Java client which is designed
using an efficient and suitable mechanism. to replace “genie.F”. Throughout this procedure
For this purpose we pick Babel [1], an inter- several complications were encountered but re-
operability tool which addresses the language in- solved.
teroperability problem in software engineering. Firstly, Babel uses its own defined data types
Babel uses a prototype SIDL (Scientific Inter- (from here and on referred to as sidl-types) in
face Definition Language) interface description the SIDL interface specification that it accepts,
which describes a calling interface (but not the producing code which handles similar data types
implementation) of a particular scientific mod- in the language specified by the user. Thus, the
ule. Babel, as a parser tool, uses this interface code which is generated in Fortran, designed to
description to generate glue code (server skele- access an earth module, needs some additions in
ton and client stub) that allows a software li- order to serve its purpose. For this reason, sev-
brary implemented in one supported language eral external routines “casting” the sidl-types
to be called from any other supported language. of data from/to Fortran data types were devel-
Below we denote some of the advantages oped. In this way, it was made possible to have
when using Babel, as against other mechanisms, a one-to-one correspondence of integers, floating
to wrap up GENIE modules: point numbers, as well as one and two dimen-
sional arrays to and from the equivalent sidl-
• Babel provides all the three kinds of commu- types. Having these casting routines, the glue
nication ports used by the GENIE subrou- code can access the top level module routine
tines – inputs, outputs, as well as inouts, the by passing the arguments and having them re-
309
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
turned after the addition of a piece of “cast- stages, which require minimal programming ef-
ing” code inside the glue. It needs to be men- fort:
tioned here that in this way, there is absolutely
• describe the input/output/inout ports of
no need to modify any of the existing module
the module top level Fortran subroutine in
code in order to access it from the skeleton cre-
a simple SIDL file
ated from Babel. Therefore, just by adding an
• run Babel parser to create glue code
additional structure for each module in the tree
• connect the top level subroutine of an earth
code of GENIE containing the SIDL descriptors
module to the skeleton by passing the ports
and the generated glue code, the maintenance of
to/from Babel skeleton and to/from subrou-
the structure is guaranteed. Furthermore, this
tine
does not interfere with any attempt to run the
GENIE application without using the wrapped
modules and the componentised version, but ex- 5.3 Executing GENIE in a Grid
ecuting it in the old-fashioned way where “ge- Environment
nie.F” handles the modules. So far, we have managed to wrap up two sim-
Another feature of Babel is that it works ple GENIE components using Babel and tested
with the use of shared libraries. The code of the results returned when calling them from a
the module gets compiled together with the simple Java client. Our current activities mainly
glue code to form a library, which then can include a specific configuration (figure 5) com-
be placed independently and accessed through prising the GENIE basic ocean (Goldstein), a
a client from anywhere on the web. The GE- simple atmosphere (Embm) and a simple com-
NIE application structure already contains the patible sea-ice (Goldstein sea-ice). We have iso-
code of a module in an independent, shared li- lated this case as the most suitable to study and
brary. Therefore, it is only sufficient to compile implement using Web Services. The three earth
to glue code against the module library to ensure modules, together with their initialisation, and
the communication of the two, allowing a more finalisation routines are being described and ac-
distributed environment where the skeleton ac- cessed through a Java interface and at the same
cesses the actual implementation remotely. time a Java client is being developed in order to
access, and couple the components. Initially the
We end up with a module wrapping proce- configuration chosen is being tested by direct
dure (figure 4) that enables an earth module access of the interfaces exposed by the wrapped
(and eventually a whole configuration) to be ac- modules, without the use of Web Services, for
cessed by a Java client, in the following simple reasons of simplicity.
310
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
It needs to be made clear that any com- 7. Advance the previous stages into a Grid
ponent composition environment requires every environment so as to incorporate Co-
entity to be in the form of a software compo- ordination Specification, Dynamic Abstract
nent. Thus, all the functionality of “genie.F” Code Generation, Resource Allocation and
must be disguised in a component’s structure Optimisation.
and become part of the data workflow. For ex-
ample, all intermediate interpolation functions Although no thorough experiments over the
of data exchanged between modules as well as efficiency have taken place yet, we expect that
various functional and non-functional operators the advantages of this approach far outweight
being part of the component composition phase, overhead produced during the execution (con-
should become part of the (GENIE) application version of SIDL to language specific types and
specification phase. vice versa, data exchange between modules,
data serialisation and trasportation over the
Grid).
6 Conclusions
We have taken the Fortran implementation of
the GENIE earth simulation application and
shown how this can be modified to take ad-
vantage of a service-based computational Grid
execution model. Through analysis of the orig-
inal, tightly-coupled implementation, the com-
ponent interfaces were identified. It was then
possible to apply an abstract component model
to the various earth entity module implementa-
Fig. 5. Configuration to be implemented in Web tions to provide substitutable components that
Services can be composed into workflows for scheduling
and execution on Grid resources. Our imple-
Below, we summarise the various enginner-
mentation takes the form of a Service Oriented
ing steps which are planned to take place in or-
Architecture utilising Web Services as the ser-
der to achieve the grid enabling of the GENIE
vice model. We have shown that it is possible
application using a Web Service-based workflow
to take a complex legacy application and apply
within the London e-Science Centre.
wrapping techniques for service-enabling with-
1. Wrap up a component in Java using Babel out the extensive effort that would be required
and test it individually to ensure proper re- for a rewrite of the original code in a different
sults during execution. language. The work provides simplified deploy-
2. Wrap up all modules for one existing con- ment of GENIE workflows across multiple re-
figuration. Then create a Java client to syn- sources in combination with the opportunity for
chronise the components and compare the improved runtimes that distributed Grid execu-
results to those produced when executing tion allows. The generic and component-based
“genie.F” over the same configuration. nature of this architecture allows it to be ap-
3. Launch Web Services in place of the original plied into similar legacy system applications. It
components and modify the Java client to provides a language-independant environment
call these Web Services instead of the com- where the rise of newly developed earth com-
ponents. ponents can be easily incorporated to the rest
4. Create all possible configurations of GENIE of the application. Our showcase also describes
in a similar manner to that shown in the generic and distinct stages (wrapping, deploying
previous steps. as web service, abstractly interfacing, interac-
5. Create the Abstract Component Wrapper tion specification) in order to advance a legacy
by choosing carefully what the concrete in- application into a loosely coupled, component-
terface will be. based application. We intend to continue this
6. Use a workflow engine to orchestrate the Ab- work to service enable further modules within
stract Components in all configurations. the GENIE framework.
311
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
312
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
This paper presents accounts of several real case applications using grid farms with data- and
functional-parallelism. For data parallelism a job submission system (EasyGrid) provided an
intermediate layer between grid and user software. For functional parallelism a PVM algorithm ran
user’s software in parallel as a master / slave implementation. These approaches were applied to
typical particle physics applications: hadronic tau decays, searching for anti-deuterons, and neutral
pion discrimination using genetic programming algorithms. Studies of performance show
considerable reduction of time execution using functional gridification.
313
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
314
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
determinate cuts to maximize event selection implemented in client server architecture with
[17]-[19]. Genetic algorithms can also be good performance.
associated with neural networks to implement All software is available from the Internet [6],
discriminate functions [20] for Higgs boson. and is fully operational and easily adaptable for
Our approach is innovative because the any application and experiment.
mathematical model obtained with GP maps the Discrimination functions can be used to
variables hyperspace to a real value through the discriminate neutral pions from background
discrimination functions, an algebraic function with 80% accuracy and 91% purity. This will
of pions kinematics variables. Applying the allow me compare experimental values with
discriminator to a given pair of gammas, if the observable obtained from theoretical Standard
discriminate value is bigger than zero, the pair Model.
of gammas is deemed to come from pion decay.
Otherwise, the pair is deemed to come from REFERENCE
another (background) source. [1] GridPP site: http://www.gridpp.ac.uk/
Two datasets were built, one for training [2] The GridPP Collaboration: P J W Faulkner et al
“GridPP: development of the UK computing Grid for
with 57,992 records, and one for test with particle physics” 2006 J. Phys. G: Nucl. Part. Phys. 32
302,374 records. Events with one real neutral N1-N20 doi:10.1088/0954-3899/32/1/N01
pion were selected and marked as 1. Events [3] CERN site:
without real pions and invariant mass http://public.web.cern.ch/Public/Welcome.html
[4] LHC site: http://lhc.web.cern.ch/lhc/
reconstruction in the same region of real neutral [5] LCG site: http://lcg.web.cern.ch/LCG/
pions where also selected and marked 0. [6] Werner,J.C.; “HEP analysis, Grid and EasyGrid Job
Kinematics data from each gamma used in Submission Prototype: Babar/CM2 showcase” at
http://www.hep.man.ac.uk/u/jamwer/
the reconstruction were written in the datasets: [7] BaBar Collaboration; “The BaBar detector”, Nuclear
angles of the gamma ray, 3-vector momentum, Instruments and Methods in Physics Research
total momentum, and energy in the calorimeter. A479(2002) 1-116.
[8] Tavera, M.; Private communication on eta
To avoid unit problems, we use sine, cosine and reconstruction from TauUser data, PhD Thesis.
tangent values for each angle measured in the [9] Werner,J.C.; “Neutral Pion project”
genetic trees. All other attributes are measured http://www.hep.man.ac.uk/u/jamwer/pi0alg5.html
[10] Werner,J.C.; “Search for anti-deuteron”
in GeV (1,000 million electron-volts). http://www.hep.man.ac.uk/u/jamwer/deutdesc.html
Table III shows the results for training and [11] Parallel Virtual Machine site:
test of 3 different runs. All results where in http://www.csm.ornl.gov/pvm/pvm_home.html
[12] Holland,J.H. “Adaptation in natural and artificial
agreement and shows high purity, fundamental systems: na introductory analysis with applications to
to study observable variables from neutral pion biology, control and artificial intelligence.”
particles. High purity means there will be low Cambridge: Cambridge press 1992.
[13] Goldberg,D.E. “Genetic Algorithms in Search,
non-neutral pions contamination in the sample
Optimisation, and Machine Learning.” Reading,
(less then 10%). Efficiency of 83% means there Mass.: Addison-Whesley, 1989.
will be a lost of 17% of real neutral pions from [14] Chambers,L.; “The practical handbook of Genetic
the sample, with decrease in number and Algorithms” Chapman & Hall/CRC,2000.
[15] Koza,J.R. “Genetic programming: On the
increase of statistical error. programming of computers by means of natural
Table IV shows the time expended running selection.” Cambridge,Mass.: MIT Press, 1992.
standalone and with several numbers of slaves, [16] Werner,J.C.; “Active noise control in ducts using
genetic algorithm” PhD. Thesis- São Paulo University-
with good performance: 10 slaves should reduce São Paulo-Brazil-1999.
the time in ideal conditions to 10%, and our [17] Cranmer,K.; Bowman,R.S.; “PhysicsGP: A genetic
implementation achieved 24% despite all programming approach to event selection” Computer
Physics Communications 167 (2005) 165-176.
necessary communication overheads. [18] Focus Collaboration, “Application of genetic
programming to high energy physics event selection”
Conclusion. Nuclear instruments and methods in physics research
A 551 (2005) 504-527.
In this paper implementations of data and [19] Focus Collaboration; “Search for L+c -> pK+p- and
D+s -> K+K+p- using genetic programming event
functional parallelism using LCG/PVM grid selection” Physics letters B 624 (2005) 166-172
environment are discussed and applied for [20] Mjahed, M.; “Search for Higgs boson at LHC by using
several real case studies. A reliable job genetic algorithms” To be published in Nuclear
Instruments and Methods in Physics Research.
submission system (EasyGrid) manages all
aspects of integration between user’s
requirements and resources for data grid.
Functional gridification algorithm was
315
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
b.SE
Bf40
c.NFS
1Gbs
Arthur
Fig. 1 CPU load and IO wait performance for data transfer paradigm in time frames of 15 seconds. (a) NFS access by application. (b) File copy
locally and later access. (c) NFS access with jammed network.
T ABLE I DATA DISTRIBUTION ANALYSIS. P ERFORMANCE WAS DEFINED AS (100 * EPS/EPS_LOCAL ), WHERE EPS IS EVENTS PER SECOND.
EPS_LOCAL (1577) IS NUMBER OF EVENTS PER SECOND , USING LOCAL STORED FILES AND TAKES 600 SECONDS.
# c.NFS 1 Gbps a.NFS b.SE 100Mbps
Jobs EPS Perf EPS Perf EPS Perf
1 855 54 1574 100 1193 76
3 481 31 1457 92 949 60
6 492 31 1388 88 725 46
12 412 26 1372 87 495 31
T ABLE II: AVERAGE EXECUTION TIME OF 500 JOBS IN 12 AND 56 CPUS IN PARALLEL WITH FILE TRANSFER AND REMOTE ACCESS.
File
500 Jobs Transfer Exec Total
12 CP Us Time 00:15:00 00:27:00 00:42:02
Seconds 900 1620 2522
56 CPUs Time 01:43:52 00:12:19 01:56:11
Seconds 6232 739 6971
T ABLE III T RAINING AND TESTS RE SULTS FROM DISCRIMINATE FUNCTION OBTAINED USING GENETIC PROGRAMMINGWITHDIFFERENTDATASETS.
Case 1:Forecast Case 2: Forecast Case 3: Forecast
Training Real D>0 D<0 D>0 D<0 D>0 D<0
57992 1 23299 4819 23368 4750 22491 5627
records 0 3093 26781 3040 26834 2731 27143
Accuracy 86 86 85
Efficiency 82 83 80
Purity 89 89 90
Test Real D>0 D<0 D>0 D<0 D>0 D<0
302374 1 117268 41037 117215 41090 111999 46306
records 0 14153 129916 13870 130199 12543 131526
Accuracy 81 81 80
Efficiency 74 74 70
Purity 90 90 91
T ABLE IV EXECUTION TIME FOR TH E SAME SOFTWARE WITH DIFFERENT NUMBER OF SLAVES AND NODES.
Standalone 1node / 2 slaves 5 nodes / 10 slaves
Time(1,000s) 80 47 19
Improvement 58% 24%
316
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
"!#$%& (') *+ (,-%.0/#1
2!3456782
)6(39:-6
;=<?>A@CBED FHGJIKML?NO<?PRQTSTSVUWKML?XY<?ZR@\[G]@CFED QVBMK
KH^ [U`_CQTBHFaU`Dbdcfehgig`QTjhQkZleh[meh[
nCogpbrqL?s\ththu
v w7yz{}|~8z
x ]WH
³E¹}V´ \´ÁÂÞ© ßa«-½4HH´O7
àº"
ßÕ]á#88¾ â 8Ñ&ÀYM"±WV4MM4H¬TH
fHT4
M
AMT$lJ""EHOM=E"E4M »l-§Ck4J¸]a·§]¯"W4V"³4E
A4²V
E]CE4
fHT4OM8¡ £¢¤M¥E=¡a4M¦J H
AM§´ãÌärMHÞ¼hH¸MEÃ:4¥
"E4
f¥§\¨©ª«Va¬V"4HC4R¥4§4}M
AW® ¥4§4HOM
AE»V¡¥Raa44
¯\°±4²V³OEHV4W4"M f" ¸V-MCEH\VT8(\VTlM"HM"H¹
4:-4E]CE4
fHT4H´¦µR¶4E·4E
AMEM
f® "MW4EMà£M"å²4M
E4
A¥§Ë¹æ»l"
$aWV4EH48"Ma$4M"MVE¸VE " ¸VrÚ§"M"»l¥]0WM"½ç\HTèMH¯\
M:EHVCE4
AHVE¹CVM74º""E® V
A
fV· ¸JM§V´·:4E"¹8Mf"MW³
]M4¥±WM¶»8M(½¼hE¡²V
%CE4
AET¾ "fé8ÏÁËéЯ© Ø «"MÎHa¹Cf"(M&aHJ4H
E]CE4
fHTH´¿-E48-"Mf}aE§CHº²M »l½MT:"ê²4M
ë E4
A¥§ê4dÀ¾8ÁÓ
4¡EM
A4fWEJ8À$8Á·¢`À-¸MH4½M® ")ÀRìí"M§CEHV
AH´
EVÁV"MC¨¡²Mº§C4¦"H&\WH4)M"¦
"E²V"CE4
AHV³8¡ 8´}¾¥¦WM¶»8M $\CE)aE§\a¡ACE4H"a¾¥W®
M"7a¥¬TH$H¸MEMCH²VR48¡ ÃE]CE® RV¥
A"H»8M&º¡W»l4"HVT")H®
AET¡4¾"Wa7"7³aHJ4H7W»l4"HMT ¸]4M"
fHT8²V(8¾ Þ\H
AETH´
"7§C»8EE7 H
AM§k"À¾8Á=¥l4H´
î ïèðxñÚò y ð |¾z| ó£| Æ È±ôEÊ Æ7õ
| Æ Èöv Æ |)ô÷yÊyùø÷yËzúRûüy
Ä ÅÆ z{}ÇȱÉ7~8z}ÊÇ Æ
dM"W¥
fH³kJæãMèEHV
8 ¾ r¥)7"MW4E³"TH$CE4
AET)]¸MHW® ]W¥E7¥kM"æWa¬V"ET¥)M"r¥kVEH4M
TJº¡²"M
AET4M\"4RMË
³JEa´Ì¤¥ EM4HÛM±VêH¥EHêÁË]ÛÑ,M"W¥
4HET4³]f"J4(²4M
Í"MMT®¤MV p4
³E´aW7pM
³
A Þ§\W\4H°§\E»HE
EM¥WM\4lËH¸JJ4MfJ} H
AM§¹»l¥º¥ M=E"E4M±]H$\H
AET4H´Ï]"A4E4
EJ4HfÎ"WRV4f¥EV³74(À¾ÏÐ)´C 44E¸VEA4VVH:
AE
AET·
E]CE4
fHTº4HT±4]"a4
³E »8M4ûl¥OEM
A4"·M4¥Ã½7§\aV
fa
AM4
ªËÑÒÓ¾4 »0J\HMa³"°»l¡MT4] \7
fV$
A\VWTlHVV¥WWE
³l
4=OT4RÔMÕMÕMÖ"´¦Ð-"MfR¥³J±¥ =» ¯\J4E°4kTHV\H4M§»laM
\E4»8 7§]±
AVW$×VÕMÕ]]¥¥¡EHOJ 4Ea´O"7llíM4¥¯"WEfJAÀ¾8Á©þýa«f
Ø ª³"WM"ªHÙkEMT4HV4V4¾ÙkEMT® 8é&ÿ åHÔdET4E±»l¥Ú4¥Wak"·ÁÂ
HT4H´)fM
AMT¡M& »ÚJ\k"(HH "TM:7"M4´8¡ ÛJ»8E¸MHVf
4°""EHEM""4°4H"aÛ4VWE4" ¸M "(EM
º§""M4)ÏÐRÜ ""$8ÐR
4H¬TaH»r¥4§4H4MVR"\]¥E´ ]WH
7¹}»l°4¥a¯W"§"Wa¬TETM¹?V
¾4k4E
³l§"VWaMk$8С 7© Ôa«"kÏСÜÍ© ÝJ« AM4¥]WH
7´
" ¸V)§CEEE¸VEMCHkM"±HJVHk"4
¥MWkMHMM"4²"7¥7J»%4Þ
³V:4 dVEM"§"½4a¦"M4ãE\§H
8ÐR Ú4E
³fT4E4MCE§»lÞ"EAM4 4¯\"°4E
+ÏÐ¡Ü 4:§C]M]MHE
ª
317
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
MA
AE4M"J4M"ê]HJ4V"k4°ç"HO"
44M"²H4AM$7ç"H³¦4kH³¯4a
4E
7´ÚÌäV4H³Þ§\:"WadJ¦H
A±
3.0 k
E¸VElWH¸]¥aAEa·4=§\±4"=4E
A4E
»l¥k4W
A
³4¥WaVlAÏÐRÜàJ4V´?
2.0 k
Nr. of VMs
ÏÐ¡Ü W4MùEM
A
º"HJar»lx#HT
E4¸]EM¹íàWM®¤Ha+4§\MxE4¸ME3¸]
1.0 k
M"ækM§4Ã"W²E"=ç"¦»lê¸J4M"
18:00 20:00 22:00 00:00 02:00 04:00
³èaWV4EHâpM§\V¯ÝMg ¾Ü±4¨=Hè4§C
(CDF Production)
600000
500000
4E8"R4$MH
AH"M
7¹T²M
A
200000
ÁVMC8С Ú\Þ "E4
f¶Â¾4´µ¡kEM°E"
100000
J-Mi¹]§-»lHRÎM§MW§"
fW4H"E4
a44M"Mk4A¾M§""RTJ4EMHECE¾"
Date
M4:ª
!"$#%&(' ""E\4a²M84)4¾
AMM±VE"\´ 4E\R´WÙf»l"WJ:»-?Ta-¾"MW®
DataTier
cdf-caf cdf-sam cdf-fzkka cdf-ucl-1
cdf-fncdfsrv0 cdf-cnaf cdf-ucl other
)*,+.-/102 3'4 65$ 634'7
36/89:(
3; 3'
1</6-=*> ?@-A/*.BA5C 34'4 A
DEF=*> 6(&-3G( µ¡Vk:H¾OE$78\СJ 4:4MH"TMV-³4HM:VaAWVHJ» a´
H *6I.
KJMLN+O>5P 634'74QRSRTUVP(/W/TX-
JY,-A/'
Z+O>5[ 3') 1
3$ 3') 17'4\]Z^@
D3.
J>_@
Z&-3 (
º¥Ë"¥"V¾E"4?4Vl-4"pMË4$"4JE
A4¾
AC8MÐR H ´
V`
aIa3.(O.0 aH4447"J4²4M
, RСÁ°4VH¡V"¹\
$8С ¦WWH
A4J¸]¡4M4* baJ4V³
fa® M¥fE4:8JÐR»l ¯T¥
AHM¾-M&l.f³EaWMV
AHHM´$"½4M4fHÀ$´ 8Á
M
³¢`4HT:MH§CE4V¨¾"ÎV§¦"\
AH\¥W
04"8¡ ÎV§"-:HM8¾
¶»4¡¥ R }Ï
AMT4HM7»VVE8HH´
Ìä³V4H4AM³ÁÂÛ"WE4ÎM§³¥?§®
zÊHôE ÊEyø³|¾zz}|Ê ÇÆ
{ ·Ê õ fz @y 0Êzk~ 7ú-È
`Ç Cz 7ú
AWaÞ7VV§"AVHMEH\HfAkÁÂ
aM"]:"Ã\Wa4¦"V4
A:V® ôHÊ õ fz%ôHÊ Æ
VË© ×J«VV¥H
AH\¥W
æ§]R»l¥M"?M
AV
V¥4HMa?4E
ACM4¾ÎM(HM\M T"H8¡ ¦HEM"MA4MMRaWHT
C]Mi´Ã"7®äEMH°Â¾¥J8J$³aE§CHÞ ª®ßÕlÒY¦* bHM´8Ñ&Àè4HM"4EH¡a¬V"4H
AM477© ÙJ«¤´·²aJ"a(M-"fCTV 4ECHaæ"M¦W"4VWE4k»lº]®
318
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
70
60
50
Percent
40
30
V4RÝ
20
ï Ç Æ ~-ôHÉ7yÊEÇ Æ
-¶»4WE³J»-\WH4²4M
YÀ$8Ák"
8¾ °* bHº4fEM
A4"E]¸]V
AETR
4AEM}"4E)$À$8Á?´Ðí²H]H"WM
M}AWH¸]¥a-}4AHW®Ô³HTE$J$À¾8Á·
319
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Ian Grimstead, David Walker and Nick Avis; School of Computer Science, Cardiff University
Frederic Kleinermann; Department of Computer Science, Vrije Universiteit
John McClure; Directorate of Laboratory Medicine, Manchester Royal Infirmary
Abstract
We present a use of Grid technologies to deliver educational content and support collaborative
learning. “Potted” pathology specimens produce many problems with conservation and appropriate
access. The planned increase in medical students further exasperates these issues, prompting the use
of new computer-based teaching methods. A solution to obtaining accurate organ models is to use
point-based laser scanning techniques, exemplified by the Arius3D colour scanner. Such large
datasets create visually compelling and accurate (topology and colour) virtual organs, but will often
overwhelm local resources, requiring the use of specialised facilities for display and interaction.
The Resource-Aware Visualization Environment (RAVE) has been used to collaboratively display
these models, supporting multiple simultaneous users on a range of resources. A variety of
deployment methods are presented, from web page applets to a direct AccessGrid video feed.
320
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
321
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 4: Plastinated human heart, data viewed from PointStream and corresponding RAVE client
322
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
tools. Whilst commodity graphics cards are [Grims04] Ian J. Grimstead, Nick J. Avis, and
increasing in their performance at more than David W. Walker. Automatic Distribution of
Moore’s Law rates, their ability to handle large Rendering Workloads in a Grid Enabled
complex datasets still presents barriers to the Collaborative Visualization Environment, in
widespread use of this material. This coupled Proceedings of Supercomputing 2004, held 6th-
with slow communications links between 12th November in Pittsburgh, USA, 2004.
centralised resources and students who are often [Grims05] Ian J. Grimstead. RAVE – The
working remotely and in groups (using problem Resource-Aware Visualization Environment,
based teaching methods) and these facts quickly presentation at SAND'05, Swansea Animation
erode some of the advantages of digital assets Days 2005, Taliesin Arts Centre, Swansea,
over their physical counterparts. Wales, UK, November 2005.
To this end we have sought to harness grid [Nava03] A. Nava, E. Mazza, F. Kleinermann,
middleware to remove the need for expensive N. J. Avis and J. McClure. Determination of the
and high capability local resources – instead mechanical properties of soft human tissues
harnessing remote resources in a seamless and through aspiration experiments. In MICCAI
transparent manner to the end user to effect a 2003, R E Ellis and T M Peters (Eds.) LNCS
compelling and responsive learning (2878), pp 222-229, Springer-Verlag, 2003.
infrastructure. [Older04] J. Older. Anatomy: A must for
We have presented methods to allow both teaching the next generation, Surg J R Coll
the quick and relatively easy creation of 3D Edinb Irel., 2 April 2004, pp79-90
virtual organs and their visualization. We [Vidal06] F. P. Vidal, F. Bello, K. W. Brodlie,
maintain that taken together these developments D. A. Gould, N. W. John, R. Phillips and N. J.
have the potential to change the present Avis. Principles and Applications of Computer
anatomical teaching methods and to hopefully Graphics in Medicine, Computer Graphics
promote greater understanding of human Forum, Vol 25, Issue 1, 2006, pp 113-137.
anatomy and pathology.
5. Acknowledgements
4. References This study was in part supported by a grant
[Avis00] N. J. Avis. Virtual Environment from the Pathological Society of Great Britain
Technologies, Journal of Minimally Invasive and Ireland and North Western Deanery for
Therapy and Allied Technologies, Vol 9(5) Postgraduate Medicine and Dentistry. All
pp333- 340, 2000. specific permissions were granted regarding the
[Avis04] Nick J Avis, Frederic Kleinermann use of human material for this study.
and John McClure. Soft Tissue Surface- We also acknowledge the support of
Scanning: A Comparison of Commercial 3D Kestrel3D Ltd and the UK’s DTI for funding
Object Scanners for Surgical Simulation the RAVE project. We would also like to thank
Content Creation and Medical Education Chris Cornish of Inition Ltd. for his support
Applications, Medical Simulation, International with dataset conversion.
Symposium, ISMS 2004, Cambridge, MA,
USA, Springer Lecture Notes in Computer
Science (3078), Stephane Cotin and Dimitris
Metaxas (Eds.), pp 210-220, 2004.
[Bergman86] L. D. Bergman, H. Fuchs, E.
Grant, and S. Spach. Image rendering by
adaptive refinement. Computer Graphics
(Proceedings of SIGGRAPH 86), 20(4):29--37,
August 1986. Held in Dallas, Texas.
[Childers00] Lisa Childers, Terry Disz, Robert
Olson, Michael E. Papka, Rick Stevens and
Tushar Udeshi. Access Grid: Immersive Group-
to-Group Collaborative Visualization, in
Proceedings of the 4th International Immersive
Projection Technology Workshop, Ames, Iowa,
USA, 2000.
[Ellis02] H. Ellis. Medico-legal litigation and
its links with surgical anatomy. Surgery 2002.
323
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Jason Crampton, Hoon Wei Lim, Kenneth G. Paterson, and Geraint Price
Information Security Group
Royal Holloway, University of London
Egham, Surrey TW20 0EX, UK
324
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
• Identity-based: The use of identity-based public keys at the user side, user authentication can be performed
in IBC allows any entity’s public key to be generated without any physical intervention from the users and
and used on-the-fly; without the need for certificates.
• Certificate-free: IBC does not make use of certificates 2. Mutual authentication and key agreement: IKIG also
since public keys can be computed based on some pub- supports a certificate-free authenticated key agreement
lic identifiers; and protocol based on the TLS handshake. Our protocol al-
lows mutual authentication and session key establish-
• Small key sizes: Since identity-based cryptographic
ment between two entities in a more lightweight man-
schemes use pairings which are, in turn, based on ellip-
ner than the traditional TLS as it has small key sizes
tic curves, they can have smaller key sizes than more
and requires no certificates.
conventional public key cryptosystems such as RSA.
By exploiting some properties from HIBC, IKIG facili- 3. Delegation: We propose a non-interactive delegation
tates the creation and usage of identity-based proxy creden- protocol which works in the same way as in the GSI,
tials in a very natural way. These identity-based proxy cre- in the sense that the delegator signs a new public key of
dentials, in turn, are needed to support features that match the delegation target. In addition, IKIG allows partial
those provided by the GSI. aggregation of signatures. This can be useful when the
In the IKIG setting, the roles of a Certificate Authority length of the delegation chain increases. This is a new
(CA) in the current PKI-based GSI has been replaced by feature not available in the GSI.
a Trusted Authority (TA). The TA’s roles including acting Users in the IKIG setting do not need to obtain short-
as the Private Key Generator (PKG) and supporting other term private keys from their respective PKGs. This is be-
user-related administration. Figure 1 shows the hierarchical cause the users themselves act as PKGs for their local proxy
setting of HIBC that matches the hierarchical relationships clients. Thus short-term private key distribution is not an
of various entities within a grid environment, with the TA at issue in IKIG. This contrasts favourably with conventional
level 0, user at level 1 and user proxy at level 2. applications of IBC, where private key distribution is a com-
plicating factor.
Level 0 TA
325
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Table 1. Performance trade-offs in computation times (in milliseconds) between the GSI and the IKIG
settings on a Pentium IV 2.4 GHz machine.
GSI IKIG
Operation RSA Time HIBE/HIBS Time
Key generation
(a.) Long-term 1 GEN 149.90 1 EXT 1.69
(b.) Short-term 1 GEN 34.85 1 EXT 1.74
Delegation
(a.) Delegator 1 512-bit SIG 1.86 1 SIG 3.35
1 512-bit VER
(b.) Delegation target 1 GEN 35.63 1 EXT 1.74
1 512-bit SIG
(c.) Verifier 3 512-bit VERs 0.84 1 VER 8.42
this should give hope to faster HIBE and HIBS schemes The identity-based approach of [10] shows that secret pub-
in the near future. On the other hand, we remark that it is lic keys can be constructed in a very natural way using arbi-
unlikely that significant algorithmic improvements for RSA trary random strings, eliminating the structure found in, for
computations will be forthcoming, since this field has been example, RSA or ElGamal keys.
intensively researched for many years. An identity-based secret public key protocol can be nat-
urally converted into a password-based version of the stan-
2.3 Identity-Based Secret Public Keys dard TLS protocol. The resulting protocol allows passwords
to be tied directly to the establishment of secure TLS chan-
In a password-based authentication protocol, a secret nels (rather than transmitting plain passwords through se-
public key is a standard public key which can be generated cure TLS channels). Such a protocol can be constructed
by a user or an authentication server, and is known only to without radical modification to the existing TLS handshake
them but is kept secret from third parties. A secret public messages. The identity/password-based TLS protocol is di-
key, when encrypted with a user’s password, should serve rectly applicable to securing interactions between grid users
as an unverifiable text. This may significantly increase the and MyProxy in IKIG.
difficulty of password guessing even if it is a poorly chosen
password as an attacker has no way to verify if he has made
the correct guess. However, it may not be easy to achieve
3 Future Work
unverifiability of text by simply performing naive symmet-
ric encryption on public keys of standard types, such as Certificateless public key cryptography (CL-PKC) [1]
RSA or ElGamal, which contain certain number theoretic offers an interesting combination of features from IBC
structure. and standard public key cryptography. In particular, each
We have recently investigated the use of identity-based user selects a public/private key-pair (in addition to a TA-
secret public keys [10] and designed a password-based TLS generated private key component that matches an identi-
protocol which, in turn, seems to fit nicely into MyProxy. fier), thereby eliminating the problem of key escrow found
326
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
in IBC. Moreover, the public keys do not need to be sup- [8] P. Gutmann. PKI: It’s not dead, just resting. IEEE
ported by certificates, so as with IBC, many of the problems Computer, 35(8):41–49, August 2002.
associated with certificate management in PKI are elimi-
nated in CL-PKC. [9] H.W. Lim and K.G. Paterson. Identity-based cryptog-
As part of our future work, we plan to study the appli- raphy for grid security. In H. Stockinger, R. Buyya,
cation of CL-PKC in grid environments. We will develop and R. Perrott, editors, Proceedings of the 1st IEEE
a security architecture for grid systems based on CL-PKC International Conference on e-Science and Grid Com-
that parallels our existing work on IKIG. We will examine puting (e-Science 2005), pages 395–404. IEEE Com-
how CL-PKC can be used to support authentication and es- puter Society Press, 2005.
tablishment of secure communications via a TLS-like pro- [10] H.W. Lim and K.G. Paterson. Secret public key proto-
tocol. We will also study the key management aspects of cols revisited. In Proceedings of the 14th International
our architecture, such as, key updating, key revocation and Workshop on Security Protocols 2006, to appear.
the use of credential storage systems, such as MyProxy, in
our architecture. [11] G. Price. PKI challenges: An industry analysis. In
Subsequently, we plan to conduct a performance analysis J. Zhou, M-C. Kang, F. Bao, and H-H. Pang, editors,
to compare our CL-PKC approach with the existing GSI and Proceedings of the 4th International Workshop for Ap-
IKIG approaches. plied PKI (IWAP 2005), pages 3–16. Volume 128 of
FAIA, IOS Press, 2005.
References [12] M. Scott. Computing the Tate pairing. In A. Menezes,
editor, Proceedings of the RSA Conference: Topics
[1] S.S. Al-Riyami and K.G. Paterson. Certificateless in Cryptology - the Cryptographers’ Track (CT-RSA
public key cryptography. In C.S. Laih, editor, Ad- 2005), pages 293–304. Springer-Verlag LNCS 3376,
vances in Cryptology - Proceedings of ASIACRYPT 2005.
2003, pages 452–473. Springer-Verlag LNCS 2894,
2003. [13] A. Shamir. Identity-based cryptosystems and sig-
nature schemes. In G.R. Blakley and D. Chaum,
[2] P. S. L. M. Barreto, S. D. Galbraith, C. Ó hÉigeartaigh, editors, Advances in Cryptology - Proceedings of
and M. Scott. Efficient Pairing Computation on CRYPTO’84, pages 47–53. Springer-Verlag LNCS
Supersingular Abelian Varieties. Cryptology ePrint 196, 1985.
Archive, Report 2004/375, September 2005. Avail-
able at http://eprint.iacr.org/2004/375. [14] Shamus Software Ltd. MIRACL. Available at
http://indigo.ie/∼mscott/, last accessed in April 2006.
[3] J. Basney, M. Humphrey, and V. Welch. The MyProxy
online credential repository. Journal of Software: [15] M.R. Thompson and K.R. Jackson. Security is-
Practice and Experience, 35(9):817–826, July 2005. sues of grid resource management. In J. Weglarz,
J. Nabrzyski, J. Schopf, and M. Stroinski, editors,
[4] D. Boneh and M. Franklin. Identity-based encryption Chapter 5 of Grid Resource Management: State of the
from the Weil pairing. In J. Kilian, editor, Advances Art and Future Trends, pages 53–69, Boston, 2003.
in Cryptology - Proceedings of CRYPTO 2001, pages Kluwer Academic.
213–229. Springer-Verlag LNCS 2139, 2001.
[16] S. Tuecke, V. Welch, D. Engert, L. Pearman, and
[5] I. Foster and C. Kesselman. Globus: A metacomput- M. Thompson. Internet X.509 public key infrastruc-
ing infrastructure toolkit. International Journal of Su- ture proxy certificate profile. The Internet Engineering
percomputing Applications, 11(2):115–128, 1997. Task Force (IETF), RFC 3820, June 2004.
[6] I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke. A
security architecture for computational Grids. In Pro-
ceedings of the 5th ACM Computer and Communica-
tions Security Conference, pages 83–92. ACM Press,
1998.
327
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
In the existing Grid scheduling literature, the reported methods and strategies are
mostly related to high-level schedulers such as global schedulers, external schedulers,
data schedulers, and cluster schedulers. Although a number of these have previously
considered job scheduling, thus far only relatively simple queue-based policies such as
First In First Out (FIFO) have been considered for local job scheduling within Grid
contexts. Our initial research shows that it is worth investigating the potential impact
on the performance of the Grid when intelligent optimisation techniques are applied to
local scheduling policies. The research problem is defined, and a basic research
methodology with a detailed roadmap is presented. This paper forms a proposal with
the intention of exchanging ideas and seeking potential collaborators.
328
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
329
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
state-of-the-art Grid literature: what happens if make the resource allocation more efficient
more intelligent techniques are adopted for within a Grid.
local scheduling within a Grid environment?
Furthermore, will these techniques help to 4. Proposed Research Methodology
improve the efficiency of job scheduling and
resources management for the whole Grid? For The project will follow a systematic approach
example, due to the relatively slower data as follows:
transfer rate and frequent data transfer latency 1. State-of-the-art review. Several
between geographically distributed Grid factors are assumed to be considered
services, local scheduling policies need to for Grid computing e.g. jobs’ priority,
consider factors such as jobs’ priority, temporal constraints, jobs’ waiting
temporal constraints, jobs’ waiting policies, policies, datasets and storage
datasets limitation, finite capacity of storage limitation. The first step of the project
for job queues and data, and resource failure. will be aimed at verifying these
Resource trust expectations also need to be assumptions and matching them to
considered and confidence levels can current state-of-the-art.
potentially become one of the criteria used to
2. Pilot scenarios development. Key
schedule and/or reschedule resources [11-12].
characteristics will be identified for
Under these circumstances, a queue-based
the Grid global and local scheduling
scheduling policy is simply not good enough
and a number of pilot scenarios will
and an intelligent method will be more
be developed for assessing the
efficient.
impacts of local policies on the
performance of different strategies of
This research aims to investigate the impact on
external scheduler (ES) and dataset
the performance of a Grid system when
scheduler (DS).
intelligent optimisation techniques are adopted
by local schedulers within a generic Grid 3. Grid system modelling. The Grid
scheduling architecture. After the completion system will be modelled as a network
of the planned research, it will become clear of computing sites distributed
which local scheduling policies and their geographically, each comprising a
combination are efficient so that they can number of processors and a limited
improve the performance of the whole Grid amount of storage. It is assumed that a
computing. number of users, each one associated
with a particular site, submit jobs.
Each job requires specific
The specific research objectives are namely: computation and dataset resources to
1. To identify a variety of factors that be available, which may not reside on
local schedulers need to consider the site where the job is submitted.
within a general Grid scheduling The general Grid scheduling
architecture; architecture proposed in [2] will be
2. To develop intelligent scheduling adopted.
models and algorithms for 4. Modelling of scheduling problems.
computation and data scheduling Novel constraint-based scheduling
within a Grid system; models for local schedulers within a
3. To develop a simulator that can be Grid system will be proposed. The
used to evaluate the performance of a problem can be considered as
Grid system adopting one or more constraint satisfaction problems
intelligent techniques for scheduling; (CSPs) where a set of jobs, a set of
processors and a limited amount of
4. To demonstrate the efficiency of the
data storage are given; each job
developed models and methodologies
requires a number of processors and
through simulated Grid environments.
an amount of storage for a certain
time, satisfying a set of constraints
These research objectives are important to the such as jobs’ priority, time-bound
future study of the Grid. The Grid paradigm constraints etc, and the objective is to
originated as a new computing infrastructure maximize the efficiency of a
for data-intensive computing problems and is computing site i.e. to schedule as
now an established technology for the sharing many jobs as possible within a
of large scale geographically distributed deadline.
resources. Achieving these objectives will
5. Development of scheduling
330
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
331
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
{opb,jcheney,cedward1,efountou}@inf.ed.ac.uk
Abstract
Scientific databases are continuously updated either by adding new data or by deleting and/or modifying existing data.
It is fairly obvious that it is crucial to preserve historic versions of these databases so that researchers can access previous
findings to verify results or compare old data against new data and new theories. In this poster we will present XMLArch
that is used in archiving scientific data stored in relational or XML databases. XMLArch builds upon and extends previous
archiving techniques for hierarchical scientific data.
332
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
cal structure of the data and leverages the strong key struc- it is exactly this structure that is the basis of the technique
tures which are common in the design of scientific datasets. in [2].
We will also present the benefits of combining archiving as
done in XMLArch and XML compression, using for the lat- In [2] the idea behind archiving is based on:
ter the state of the art techniques as surveyed in [5].
1. identifying the correspondence and changes between
two given versions based on keys and
2 Archiving Scientific Data 2. merging the different versions using these keys in one
archive.
In this section we give a high level overview of the Identifying correspondences between versions using
XMLArch archiver. As already mentioned, we choose the keys is different from the diff-based ([7]) approaches used in
archiving approach presented in [2]. This work dealt with most of the existing tools. Diff-based approaches are based
archiving scientific data that is stored in some hierarchical on minimum edit distances between raw text lines and use
data format (such as XML). In the case of relational data no information about the structure of the underlying data.
we first need to extract the relational data into a hierarchical Our key based approach on the other hand attempts to unify
format, in this case to XML. To do this we built a simple re- objects in different data versions based on their keyed iden-
lational database extraction tool, the output of which is then tity. In this way, the archive can not only preserve the se-
passed into the archiving tool itself and optionally on to a mantic continuity between the elements but also efficiently
compressor. support queries on the history of those elements.
The idea for the initial relational database extractor was
that it should not need to know anything at all about the se- The merging of different versions into one archive is
mantics of the domain data. To facilitate this the extractor again different from the diff-based approaches that store
uses a simple schema-agnostic XML format to represent the the deltas from version to version independently. Occur-
extracted relational data, making the implementation fairly rences of elements are identified using the key structure and
portable across different databases and domains. The only stored only once in the final archive. A sequence of num-
input required is access to the relational database itself. The bers (timestamp) is used to record the sequence of versions
motivation for this simplistic approach being that often the in which an element appears. This timestamp is conceptu-
person archiving a dataset has little or no prior experience ally stored with every element in the hierarchy. Taking ad-
with the data (e.g. a Systems Administrator). vantage of the hierarchical nature of the data, the timestamp
This point and extract behavior makes the extractor in an element is stored only when it is different from that of
behave very much like conventional relational database its parent. In this way, a fair amount of space overhead is
backup tools. The eventual aim is that as much as possi- avoided by inheriting timestamps.
ble of the relational information in the database will be pre- The final archive is stored as an XML document. XML
served along with the XML snapshot. The theory being that was chosen as the archive format for a number of reasons:
as long as the data is preserved along with some form of for one thing the hierarchical models used in many biolog-
structural description, possibly no more than a textual de- ical databases are very similar to the XML data model. In
scription of the relations, then someone in the future will addition, XML is currently the most prevalent textual for-
be able to reconstruct the relations around the data. Obvi- mat for hierarchical data. Because of this there are a sig-
ously if we can store as much of the schema as possible in nificant number of commercial and open source tools that
a standard form this reduces the work needed to recreate it. are readily available to manipulate XML. Given that one
One last aim was that the data extractor should not at- of the most important considerations when archiving is to
tempt to do too much. Tools already exist from virtually make sure that the data is retrievable at a later date, using a
every relational database vendor to extract data into XML. widespread, human readable text-based format would seem
The extractor component of this project should be seen as a to increase the likelihood that the data will be accessible in
simplest possible tool to use in order to get the data out of the future.
the database in the absence of other more suitable options.
Once the relational data is extracted it is archived by the The final XML archive is then optionally passed into
XML archiver submodule of XMLArch using an approach standard text or XML compression tools to further reduce
based on the one introduced in [2]. The archiving tech- the storage requirements of the archive. As the same prin-
niques presented in that work stem from the requirements ciples for archiving as noted in [2] are used by the XML
and properties of scientific databases: first of all, much sci- archiver, the same benefits in terms of storage overhead re-
entific data is kept either in well-organized hierarchical data ductions noted in that work apply. As was shown in [2]
formats, or it has an inherent hierarchical structure; second, the resulting output is often particularly well suited to XML
this hierarchically structured data usually has a key structure specific compressors such as XMill [9]. This is unsur-
which is explicit in the design. The key structure provides a prising given the hierarchical models underlying the source
canonical identification for every part of the document, and databases and is supported by the findings of [4].
333
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Version 1 Version 2
dataset dataset
name record, {field[@name=chapter_id]=1274} record, {field[@name=receptor_id]=2179} name record, {field[@name=chapter_id]=1274} record, {field[@name=receptor_id]=2179} name
name
name name name name name name name name name name
name Adrenoceptors chapter_id 1274 receptor_id 2179 chapter_id 1274 code 21:ADR:3:A1D name Adrenoceptors chapter_id 1274 receptor_id 2181 chapter_id 1274 code 21:ADR:4:A2A
dataset, t={1,2}
name record, {field[@name=chapter_id]=1274} record, {field[@name=receptor_id==2179}, t={1} name record, {field[@name=receptor_id]=2181}, t={2}
chapters chapters
name Adrenoceptors chapter_id 1274 receptor_id 2179 chapter_id 1274 code 21:ADR:3:A1D receptor_id 2181 chapter_id 1274 code 21:ADR:4:A2A
3 An Example: Archiving the IUPHAR (line 6). Attribute keyfield records whether the attribute
database participates in the primary key of the relational table or not
(line 7).
In this section we present an example to demonstrate In addition to this XML data, we also need a way to
how XMLArch is used to archive IUPHAR relational data. identify the XML elements in the extracted XML document.
To present our approach we use a simple example of This is done by using the notion of XML keys introduced
IUPHAR data. in [1]. In this work an XML element is uniquely identified
We consider tables chapters and receptors shown be- by the values of a subset of its descendant elements.
low (the primary keys for all tables are underlined). For example, in our case a table element is uniquely
The first stores the receptor families where each fam- identified by its name attribute. A record element within
ily has a chapter id (primary key) and a name. Table the table element defined for the relational table chapters
receptors stores the name and code of receptors. Attribute (i.e., the table element with value chapters for its name
receptor id is the primary key for the table and attribute attribute) is uniquely identified by the value of its field
chapter id indicates the receptor family to which the re- subelement that corresponds to the relational attribute
ceptor belongs to. chapter id . In a similar manner, a record element within
a table element that corresponds to the receptors relational
Source relational schema R: table is uniquely identified by the value of its field subele-
chapters(chapter id , name) ment defined for the receptor id relational attribute. This
receptors(receptor id , chapter id , name, code) key information is defined by means of XPath [6] expres-
sions as advocated in [1]. In XMLArch the annotator com-
The XML DTD to which this data is published is shown
below: ponent, a sub component of the archiver, records with each
keyed element its XML key and value whenever applicable
1. <!ELEMENT dataset (table*)> (e.g. elements table and record). These key annotations
2. <!ELEMENT table (record*)> are used during the merging of a new version of the database
3. <!ATTLIST table name #PCDATA> with the archived version to unify element identities.
4. <!ELEMENT record (field+)> We show in the upper part of Figure 1 two versions of
5. <!ELEMENT field #PCDATA>
the database. The difference between Version 1 and Ver-
6. <!ATTLIST field name #CDATA
7. keyfield (true|false) ’false’>
sion 2 is that receptor with receptor id equal to 2179 has
been deleted and receptor with receptor id equal to 2181
Element table stores information about a relational ta- has been added.
ble. Attribute name records the name of the table (line 3). The archived XML document is also shown in Figure 1.
Each table element has one or more record elements (line One can observe that elements that exist in both versions are
2) where such an element is defined for each tuple of the stored only once (e.g., the table elements and the dataset
corresponding relational table. A record element has one element). Notice that there is a timestamp (t = {1,2}) asso-
or more field subelements (line 4). A field element is ciated with the dataset element that indicates that this ele-
defined for an attribute of the relational table with XML at- ment (as well as all its descendants that do not have a times-
tribute name to record the name of the relational attribute tamp) are present in both versions. Observe that the record
334
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
335
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
FEA of Slope Failures with a Stochastic Distribution of
Soil Properties Developed and Run on the Grid
William Spencer1, Joanna Leng2, Mike Pettipher2
1
School of Mechanical Aerospace and Civil Engineering, University of Manchester
2
Manchester Computing, University of Manchester
Abstract
This paper presents a case study about how one user developed and ran codes on a grid,
in this case in the form of the National Grid Service (NGS). The user had access to the
core components of the NGS which consisted of four clusters which are configured,
unlike most other U.K. academic hpc services, to be used primarily through grid
technologies, in this case globus. This account includes how this code was parallelised,
its performance and the issues involved in selecting and using the NGS for the
development and running of these codes. A general understanding of the application
area, computational geotechnical engineering, and the performance issues of these codes
are required to make this clear.
336
A further limitation
Proceedings of theisUK
placed on the
e-Science Allmesh same
Hands Meeting stress
2006, return
© NeSC method;
2006, however in this
ISBN 0-9553988-0-0
resolution in order to preserve accuracy, the solver the factorisation and the loading are
maximum element size is 0.5m cubed. 500 inherently linked. Other stress return methods
realisations were necessary to gain an accurate are not available for the Tresca soil model used.
understanding of the slope failures reliability.
4.1 Comparison of direct and iterative
solvers
It is the normal expectation that the iterative
solver would outperform the serial direct solver
in a parallel environment. Indeed the iterative
solver showed good speedup when run on
increasing numbers of processors, as expected
by Smith [4]. The plasticity analysis discussed
here does not fit this generalisation.
Consideration of the mathematical procedures
adopted in this iterative solver and in the direct
solver lead to the observation that the direct
solver takes advantage of the unchanging mesh.
Table I compares timings for codes with
Figure 2: 3D random field
both solvers, the iterative and the direct, for one
realisation only. The code with the direct solver
3. Strategy for Parallelisation was run on 1 processor while the code with the
Two approaches were investigated, one that iterative solver was run on 8 processors. The
uses a serial solver, but achieves parallelism by table demonstrates the efficiency of the direct
task farming realisations to different processors; solver for this problem in the ratio of timings
and the other that uses a parallel solver and task for the two solvers in terms of wall time and so
farming. Both codes were adapted from their demonstrating the algorithmic efficiency of the
original forms [4]. direct solver for this particular analysis.
Once developed both codes were analysed to When the use of task farming for multiple
discover which would be most appropriate for realisations is studied, the direct solver becomes
full scale testing. A serial version, using a direct even more competitive; assuming efficient task
solver, was developed on a desktop platform. farming over 8 processors, then this is
The second, a simple parallel version that uses demonstrated by the ratio of CPU time. The
an iterative element by element solver, was serial solver strategy is an order of magnitude
developed and optimised on a local HPC faster than the parallel solver.
facility, comprising SGI Onyx 300 platform The minimum desired mesh depth for this
with 32 MIPS based processors. problem is 60 elements, with 500 realisations
The limiting factor preventing further work and 16 sets of parameters being needed for a
on the Onyx was time to solution, with serial minimal analysis. The user found that feasibly
solutions taking six times longer than on the only 32 processors were available for use on the
user’s desktop. In order to allow larger analyses NGS. Based on extrapolation of Table I a rough
to be conducted 100000 CPU hour were applied estimate of wall clock time can be calculated;
for, and allocated, on the NGS. The NGS task-farming with the direct solver requires 5.8
consists of 4 clusters of Pentium 3 machines, days while task-farming with the iterative solver
each processor running approximately 1.5 times requires 57.9 days. Clearly the direct solver
faster than the user’s desktop and 10 times combined with running realisations in parallel is
faster than the Onyx processors. the only reasonable choice.
337
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Number of Direct Solver on Parallel Solver Wall Time CPU Time
Elements in y 1 processor on 8 processors
Direction (length of (column A) (column B)
slope) Time (s) per Realisation Ratio of B/A
1 27.7 111.7 4.0 32.3
3 110.9 245.0 2.2 17.7
5 200.9 392.6 2.0 15.6
15 592.4 982.4 1.7 13.3
30 1092.5 1601.1 1.5 11.7
Table I: Comparison of direct and iterative solver run times.
the number of degrees of freedom divided by solver version high speed interconnections
the number of processors used. would go some way to speeding it up.
Thus the direct solver is limited to the The NGS is configured to be used through
amount of memory locally available to the CPU, globus. In this work globus was used for several
which on the two main NGS nodes is 1 Gb. This purposes from interactive login, to file transfer
gives an absolute limit to the number of and job submission.
elements that can be solved; In this case it is The user is based as the School of
fortunate that the maximum size of mesh is Engineering at the University of Manchester
large enough to give a meaningful solution. and it is worth noting that this school has a
The results in Table I, show that the iterative policy of only supporting Microsoft windows on
solver is increasingly competitive the larger the their local desktop machines and globus is not
problem gets. Combined with the memory part of the routine configuration. Globus was
limitation of the direct solver a further increase not installed locally but instead the user started
in mesh size could only be achieved using the a session on a local Onyx and started their
iterative solver. globus session from there.
Once the user had his code working on one
5. Experimentation of the NGS nodes he transferred and compiled a
copy on each of the other nodes. Job submission
The validity of the final code was proven by was performed using each of these executables.
comparing it with the results of the previously The code is designed to run in a particular
studied 2D version [5] and those of well environment with set directories and input and
established deterministic analytical results. In out put files. The user developed a number of
both cases the results compared very scripts to automate the set up of this
favourably, providing a check on both the 3D configuration and these were run by hand before
FEA code and the 3D random field generator. a job was run. With time and confidence the
Further to the validation, full scale analyses user could completely automate the process.
are currently being undertaken. Preliminary The jobs were submitted from the Onyx
results [6] show that use of the more realistic using scripts to write and execute pbs, these
3D model has a significant effect on the were configured to provide a wide range of job
reliability of the slope. These results have submission types and then reused or adapted ad
interesting implications for the design of safe hoc. The initial scripts were quite simple with
and efficient slopes, showing that a ‘safe’ slope just a globus-job-run command. These
designed by traditional methods can either be saved the user typing out complex commands.
very conservative or risky, if the effects of soil More sophisticated scripts were developed as
variability are not considered. the user became confident. These scripts set
environment variables and used the Resource
6. Suitability and Use of the National Allocation Language file for job submission.
Grid Service (NGS) The user monitored the load on all the nodes so
that he could submit jobs on the node with the
The use of commodity clusters is ideal for this lowest load.
code’s final form, as its fundamental nature A small amount of work was needed to get
does not require the ultra fast inter-processor the codes that had previously been running on
communication or shared memory of some the Onyx to run on the NGS machines. The
proprietary hardware. The very large memory compiler on the NGS nodes was stricter in its
requirements of the direct solver made the implementation of standards. The method
individual per processor memory limit the adopted for editing the code was to do so on a
constraining factor.. For the alternate iterative
338
desktopProceedings
windows of machine in a FORTRAN
the UK e-Science detriment,
All Hands Meeting 2006, © but
NeSCmay constrain
2006, other users with
ISBN 0-9553988-0-0
editor and then transfer the file via the local smaller allocations.
Onyx to the NGS node, whence it was In its present form the ideal solution to
compiled. This did take some extra time in filling the desired number of CPU hours would
transferring files to and fro but in general the be to harness the spare CPU time of the
reduction in execution time from running tests university’s public clusters via distributed grid
on NGS more than made up for it. application such as BOINC [7] or Condor [8].
Debugging was achieved through printing Short of this the use of the NGS provides a
output to file and by visualization when the ready to run and largely trouble free resource,
code ran to completion, but incorrectly. While with good support services.
totalview is available on the nodes the user was The future of this particular application is no
not aware of this tools functionality or how to doubt in a fully parallel implementation, with
use this profiling tool. the iterative solver. This is the only
Overall developing the code on the grid was computational approach that will deal with the
no more painful than in any other environment, demands of the science, which will require the
the only downside being the lack of processor analysis of higher resolution and longer slopes.
availability when the service is busy; either Little optimisation or profiling was used on any
requiring the use of a different NGS node or the of the codes. It is expected that improvements to
local Onyx. efficiency and speed would be introduced
particularly to the iterative solver version.
7. Discussion and Conclusions Further development of the tangent stiffness
method making it applicable to this problem
It should be noted that prior to this project on would dramatically improve the performance of
the NGS this user had no experience or the iterative solver, at the expense of
understanding of e-Science and the grid. He was considerable research time.
not familiar with hpc and had little practice in
using HPC services. Initially it was daunting to
Acknowledgment
use a service like the NGS where the policies
and practices were different to the user’s local The authors would like to acknowledge the
HPC service. The user took some time and use of the UK National Grid Service in carrying
support to learn how to get the best out of the out this work.
service but in the end was happy with both the
service and the computational results. References
The main value of using the NGS for this
user was to allow the execution of code [1] K. Hyunki, “Spatial Variability in Soils:
requiring large amounts of CPU time. Large Stiffness and Strength”, PhD thesis, Georgia
volumes of CPU time for in house machines are Institute of Technology, August 2005, pp 15-16.
often hard to come by because they are in high [2] G.A. Fenton and E.H. Vanmarcke,
demand. At present the NGS is moderately "Simulation of random fields via local average
loaded, and has powerful computers, allowing subdivision." J. of Engineering Mechanics,
such large analyses to be run in a timely ASCE, 116(8), Aug 1990, pp 1733-1749.
fashion. Generally a run requiring 32 CPU’s [3] NGS (National Grid Service);
was started within 24 hours. It also (usually) http://www.ngs.ac.uk/, last accessed 21/2/06.
allowed virtually on demand access to run small [4] I.M. Smith and D.V. Griffiths,
debug or test programs, most useful in code ”Programming the finite element method” third
development. edition, John Wiley & Son, Nov 1999.
The NGS has been in full production since [5] M.A. Hicks and K. Samy, “Reliability-based
September 2004 and currently has over 300 characteristic values: a stochastic approach to
active users. This user applied for resources Eurocode 7”, Ground Engineering, Dec 2002,
near the beginning of the service, autumn 2004. pp 30-34.
The loading of the service has increased steadily [6] W. Spencer and M.A. Hicks, “3D stochastic
and with this the monitoring of the use of modelling of long soil slopes”, 14th ACME
resources has become more critical [3]. As the conference, Belfast, April 2006, pp 119-122.
number of users increases further and the [7] Berkley open infrastructure for network
resources become scarcer it is expected that the computing; http://boinc.berkeley.edu/, last
policy of the NGS will develop. Given the very accessed 2/4/06.
large allocation of time given to this user [8] Condor for creating computational grids;
(100000 hours), this has not been of severe http://www.cs.wisc.edu/pkilab/condor/, last
accessed 4/7/06
339
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Reagan W. Moore, Arcot Rajasekar, Michael Wan, Wayne Schroeder, Richard Marciano
San Diego Supercomputer Center
Abstract
Trusted digital repository audit checklists are now being developed, based on assessments of
organizational infrastructure, repository functions, community use, and technical infrastructure.
These assessments can be expressed as rules that are applied on state information that define the
criteria for trustworthiness. This paper maps the rules to the mechanisms that are needed in a trusted
digital repository to minimize risk of both data and state information loss. The required
mechanisms have been developed within the Storage Resource Broker data grid technology, and
their use is illustrated on existing preservation repository projects.
340
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
choosing to replicate data across different types end data integrity, from the submitting
of storage media, across different vendor application to the final disk storage.
storage products, between geographically The preferred method is to checksum a file
remote sites, and into deep archives, the before transport, register the value of the
technology failures can be addressed. checksum in the MCAT metadata catalog, and
By supporting data virtualization, the ability then verify the checksum after the transfer.
to manage collection properties independently Sput –k localfilename SRBfilename
of the choice of storage system, most Put a file into a SRB collection. Client
obsolescence issues can be addressed. Data computes simple checksum (System5 sum
virtualization enables the incorporation of new command) of the local file and registers
technology including new access protocols, new with MCAT. No verification is done on the
media, and new storage systems. Format server side.
obsolescence is managed through use of Sput –K localfilename SRBfilename
versions. Data virtualization also supports the Put a file into a SRB collection. After the
evolution of the metadata schema, ensuring that transfer, the server computes the checksum
new preservation authenticity attributes can be by reading back the file that was just stored.
used over time. By choosing to federate This value is then compared with the source
independent preservation environments that use checksum value provided by the client.
different sustainability models and different This verified checksum value is then
operational models, it is possible to address the registered with MCAT.
organizational challenges. Sget –k SRBfilename localfilename
Get a file from a SRB collection. Retrieves
2. Risk Mitigation Mechanisms simple checksum (System5 sum command
result) from the MCAT and compares with
The Storage Resource Broker supports a the checksum of the local file just
wide variety of mechanisms that can be used to downloaded.
address the above risks. The mechanisms can The SRB provides mechanisms to list size and
be roughly divided into the following checksums of files that were loaded into a SRB
categories: collection. The utilities identify discrepancies
- Checksum validation between the preservation state information and
- Synchronization the files in the storage repositories.
- Replication, backups, and versions Sls -V
- Federation Verifies file size in a SRB vault with file
We examine the actual SRB Scommand utilities size registered in MCAT. Lists error when
to illustrate how each of these types of integrity file does not exist or the sizes do not match.
assurance mechanisms is used in support of the Schksum -l SRBfilename
NARA Research Prototype Persistent Archive Lists the checksum values stored in MCAT
[4] and the NSF National Science Digital including checksums of all replicas.
Library persistent archive [5]. Schksum SRBfilename
If a checksum exists, nothing is done. If
2.1 Checksums
the checksum does not exist, it is computed
Checksums are used to validate the transfer of and stored in MCAT.
files, as well as the integrity of stored data. The Schksum –f SRBfilename
SRB uses multiple types of checks: Force the computation and registration of a
- TCP/IP data transport checksum checksum in MCAT.
- Simple check of file size Schksum –c SRBfilename
- Unix checksum (System5 sum command. Computes checksum and verifies value
Note that it does not detect blocks that are with the checksum registered in MCAT
out of order) Spcommand –d <SRBfile> “command
- MD5 checksum (Robust checksum, command-input”
implemented as a remote procedure) Discovers where SRBfile is stored, and
The SRB uses TCP/IP to do all data executes command at the remote site. The
transport. This implies that checksums are MD5 checksum is implemented as a proxy
validated by the transfer protocol during all data command.
movement. However the TCP/IP checksum is
susceptible to multiple bit errors and may not 2.2 Synchronization
detect corruption of large files. Also, the The SRB supports the synchronization of files
transport protocol does not guarantee end-to- from system buffers to storage, between SRB
341
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
collections, and between a SRB collection and a List all the files that have different
local directory. The synchronization commands checksums between the source and
are used to overcome errors such as loss of a destination SRB collections.
file, or a system halt during transfer of a file. In Synchronization commands are also used to
addition, the SRB server now incorporates ensure that the multiple copies of a file have the
mechanisms to validate the integrity of recently same bits.
transferred files. Ssyncd SRBfilename
srbFileChk server Synchronize all replicas of a file to be the
The srbFileChk server checks the integrity same as the “dirty” or changed version.
of newly uploaded files. By default, it runs Ssyncd –a –S SRBresourcename SRBfilename
on the MCAT enabled host. It can also be Synchronize all replicas of a file, ensuring
configured to run on other servers by that a copy exists on each of the physical
changing the fileChkOnMes and storage systems represented by the logical
fileChkOnThisServer parameters in the storage resource name SRBresourcenam.
runsrb script. Ssyncont SRBcontainer
A fairly nasty problem with a loss of Synchronize the permanent copy of a SRB
data integrity can exist in recently uploaded container (residing on tape) with the cached
files if the UNIX OS of the resource server copy residing on disk.
crashes during the transfer. Data that are Ssyncont –z mcatZone SRBcontainer
uploaded successfully may still be in a Synchronize the SRB container residing in
system buffer and not on disk. This problem the data grid Zone mcatZone with the
is particularly bad because the SRB system cached copy residing on disk in data grid
has already registered the files in the MCAT Zone mcatZone
and is not aware of the integrity problem Synchronization between the metadata catalog
until someone tries to retrieve the data. A and the storage repository is also needed, either
typical symptom of this mode of corruption to delete state information for which no file
is the size of the corrupted file is not the exists, or to delete files for which no state
same as the one registered in MCAT. The information exists.
srbFileChk server performs a file check Schksum -s SRBfilename
operation for newly created SRB and lists Does a quick check on data integrity using
errors. By default, it wakes up once per day file size. Any discrepancies are listed as
to perform the checking. errors, including files not present on the
SphyMove –S targetResource SRBfilename storage system. This identifies bad state
Sysadmin command to move user's files information.
between storage resources. Since the vaultchk utility
original copy is removed after the transfer, This is a sysadmin tool that identifies and
the UNIX fsync call is executed after each deletes orphan files in the UNIX vaults.
successful SphyMove to ensure the files do Orphan files (files in SRB vaults with no
not remain in a system buffer. entry in the MCAT) can exist in the SRB
Srsync –s Localfilename s:SRBfilename vault when the SRB server stops due to
Synchronize files from local repository to system crashes or server shutdown at certain
SRB collection checking for differences in critical points (such as during a bulk upload
file size. operation).
Srsync –s s:SRBfilename Localfilename
Synchronize files from SRB collection to 2.3 Replica, backups, and versions
local repository checking for differences in The synchronization commands rely on the
file size. existence of multiple copies of the files. We
Srsync –s s:SRBsourcefile s:SRBtargetfile differentiate between:
Synchronize files between two SRB - Replica, a copy of a file that is stored under
collections checking for differences in size. the same logical SRB file name.
Srsync Localsourcefile s:SRBtargetfile - Backup, a copy of a file that has a time
Synchronize files from local repository to stamp.
SRB collection using checksum registered - Version, a copy of a file that has an
in MCAT. associated version number.
Srsync –l s:SRBsourceCollection Replicas can be synchronized to ensure that
s:SRBtargetCollection they are bit for bit identical. A Backup enables
the storage of a snapshot of a file at a particular
time. Versions enable management of changes
342
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
343
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract: The Centre for Ecology & Hydrology (CEH) holds various databases collectively representing a
valuable environmental research resource. However, their use inside and outside CEH is constrained by lack
of data accessibility and interoperability. This project has focused on three test-bed datasets held at the
Lancaster Environment Centre which form a good example of the diversity of CEH terrestrial and freshwater
data. Metadata systems and access tools have been constructed in collaboration with the NERC Data Grid
(NDG) to provide users with Grid services linking data discovery to dataset delivery. This paper focuses on
the modelling of CEH’s data and metadata using the NDG models.
344
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Where appropriate, CEH have implemented a references to it than explicitly write out the series
two tier data structure so that authorised users can every time.
access raw data whereas summarised data is Reference System. These provide
available to everyone. For both ECN and Lakes, efficiencies when there is a systematic pattern in
the data has been summarised temporally. (say) spatial coordinates or a time series. This
For CS, the raw data is accessible but the means that every time or spatial coordinate need
precise location of the survey is not disclosed. not be specified explicitly. Instead, we can
Thus, datasets are available at the CS square level specify a starting point, an interval and an end-
but the location shown is just the government point or total number of elements.
office region. Hence, only one authorisation role
is required. 5. Modelling data using CSML
3. The Elements of MOLES 5.1 Water chemistry & meteorological data
To gain an understanding of how the test data CSML was developed by the Atmospheric
has been modelled and the issues involved, it is Science and Oceanography communities and as
important to introduce the models themselves. such copes very well with water chemistry &
There are four main elements within MOLES: meteorological data. Such data is characterised by
- Activity, e.g. a whole project or a field chemical and physical measurements taken at
survey in a particular location. single points in space over time series.
- Data Production Tool. Used to make
measurements and collect data, e.g. 5.2 Species
something as simple as a quadrat.
- Observation Station. A site of data Each CEH dataset may contain data for
production tools , e.g. a nature reserve. hundreds of different species. These species
- Data Entity. This contains metadata names could be rolled up into phenomenon names
about the data itself and is linked to e.g. ‘count of bat species A’, ‘count of bat species
usage metadata (in a CSML file). B’, etc. However, this would require huge
These elements are linked by IDs to create numbers of different phenomena within a dataset
deployments (the use of a data production tool at that can not be matched to standard phenomenon
an observation station to produce a data entity on dictionaries. The solution implemented is to
behalf of an activity). Much of CEH’s data is create an EcoGrid dataset for each species.
collected by scientists undertaking surveys in the
field. MOLES handles this by classifying the 5.3 Categorical Data
surveyor as an observer within the deployment.
CEH has a significant amount of categorical
4. Introduction to CSML and free-text type data (e.g. habitats), which
should clearly be represented as phenomena.
There are several components within CSML. Unlike most phenomena, these do not possess
Phenomenon. This is a property or variable units and so will not possess a ‘unit’ attribute.
that an activity sets out to measure. This can be
referenced to a standard dictionary (such as 5.4 Missing values
Climate Forecast standard names [14])
Unit. The unit of measure of a phenomenon. A common scenario is that measurements are
Units can be referenced to a standard dictionary made over a time series. However, for CEH,
such as Unidata’s ‘udunits’ [15]. occasionally there are breaks in the series, perhaps
Feature Type. An abstract representation of due to equipment being shut down for
the measurement of a phenomenon, e.g. a maintenance, etc. Potentially, this could result in
measurement taken at one point over a time series more times in the series than there are
is a PointSeriesFeature. measurements recorded. CSML copes well with
Array. Sometimes, a temporal or spatial this by allowing us to explicitly state when a
coordinate series is used repeatedly. It is then “missing value” occurs.
more efficient to declare an Array once and make
345
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
6. Issues & Potential Solutions development [4], and CSML’s associated tools do
not currently support it. An additional problem is
6.1 Species that it is not clear how the relationship between
data values and vectors would be illustrated. For
One of the main limitations of CSML is that example, if a phenomenon applies across a range
it has no obvious way of modelling biological of locations then the string of values measured
concepts that make up much of CEH’s data. will be of the form “x1 x2 x3”. If the
Often the number of individuals of a species is phenomenon is complex and contains a vector
counted and this can be subdivided by gender, (say gender) applied to a simple phenomenon then
morph, stage of development, etc in places. There the string could be of the form “(x1 y1), (x2 y2),
are two possible solutions. (x3 y3)”. It is not then clear whether y1 is the
phenomenon witnessed for gender 1 at location 2
Solution 1 or whether it is for gender 2 at location 1. The
The various combinations of these attributes problem gets worse as the level of nesting
could be rolled up into phenomenon names. This increases.
is conceptually simple but creates a few problems. For now, this issue has been circumnavigated
Firstly, a very large number of phenomena would by making data available only at the ‘number of
be required. Secondly, these phenomena would individuals’ level. In the future, ontologies may
be so fine grained that they can not be referenced be used to cope with such situations better.
to a standard dictionary, and other researchers
would find it difficult to compare their results 6.2 Profile Series Features
with CEH’s. Finally, the phenomenon name used
in the MOLES file would be a simple name (such CSML has a feature type called
as number of individuals) and as such would not ProfileSeriesFeature to represent a measurement
match up to the phenomenon name in the recorded at various points along a directed line in
corresponding CSML file. space over a time series.
CEH has lake data where measurements are
Solution 2 taken at various water depths over a time series.
Another solution is the use of composite Here, the ProfileSeriesFeature type would be ideal
phenomena [4], a concept inherited by CSML. for modelling such data. However, since the
This consists of another phenomenon (e.g. count measurement depths may not be constant from
of individuals of species A) to which a vector is one time to the next, ProfileSeriesFeatures can not
applied. In this example, the vector could be be used. This situation is encountered in other
gender (e.g. male and female). It is also possible domains (e.g. observational oceanography) and a
to nest composite phenomena, so that a composite new feature type will be introduced to cope with
phenomenon consists of another composite such eventualities in a future revision of CSML.
phenomenon (and a vector) which in turn consists
of a simple phenomenon (and another vector). So, 6.3 Observations & Measurements paper
extending the example, could result in:
CSML inherits part of its structure from the
Composite phenomenon 1 (count of schema detailed in the Open Geospatial
individuals by species by gender) = composite Consortium Observations & Measurements paper
phenomenon 2 (count of individuals by species) * [4]. This paper contains an example of how this
vector 1 (gender) model could be applied to an ecological survey
AND example, and has been reviewed to see whether it
Composite phenomenon 2 = Simple provides solutions to the issues faced when using
phenomenon (count of individuals) * vector 2 CSML. However, the data modelled in the
(species) example is far simpler than what CEH has so it
has not provided a complete solution.
An advantage of this approach is that the root
phenomenon is simple and standardised. 6.4 Transects
However, a major problem is that the ‘Composite
Phenomenon’ construct of OGC is under Some of CEH’s data (e.g. butterfly surveys)
is collected via transects. Transects are just routes
346
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
that the observer walks along. As the observer they are not something a scientist sets out to
follows the transect, they make observation notes measure. On the other hand, it is possible to
at various points. CSML has a Trajectory feature record one time/date as an input parameter but not
type that might be used to model transect data. several. Such data is omitted from CSML files.
However, data is repeatedly collected from a
transect forming a time series. Unfortunately, 6.8 Replicate measurements
trajectories can not be used to model time series.
An alternative is the ProfileSeriesFeature For water chemistry measurements, CEH
type. However, a profile must be a straight line sometimes uses multiple test tubes at the same
with a specified direction, whereas a transect may location at the same time to measure the same
be a curved path. This problem is avoided by phenomena. There is no mechanism within
summing data at a higher level (e.g. site). CSML for modelling this.
Composite phenomena could be used with a
6.5 Sublocations vector containing the replicate IDs. However,
these are not currently supported in CSML. At
For some of its data, CEH uses several this stage the only possible solution is to store
different subdivisions to express location more mean or modal values of these replicates.
precisely, e.g. for spittle bugs there are sites
comprising locations that comprise quadrats. This 6.9 Protocol metadata
situation is not handled by CSML well. One
option is to use reference systems but this would To study freshwater invertebrates, CEH uses
be very complex to implement and understand. nets to collect samples from rivers and lakes. The
Again, the solution adopted has been to sum data properties of a net are significant in determining
at a higher level (e.g. site or location code). which species it yields. Technically, the net is a
data production tool and as such it would
6.6 Date type phenomena normally be described in the corresponding
MOLES record. However, several different nets
In some cases, there is date-type data that can be used within one dataset or study so this
does not reflect the time at which measurements information must go into the CSML record
were made but instead is one purpose of the instead. However, there is currently no suitable
survey itself. For example, one survey measures place for such information within CSML.
when frogs are first seen congregating, hatching
and leaving the pond. This data must therefore be 7. Future
modelled as phenomena. It is possible to do this
if we do not specify a related unit. However, this It would be desirable to extend EcoGrid to
implies that the date is just a text string when in cover all CEH data and also incorporate data held
fact it carries more meaning than that. by the National Biodiversity Network which
If CSML allowed us to specify the focuses on Sites of Special Scientific Interest and
corresponding unit as ‘date’ then a user would be thus is complementary to EcoGrid’s data.
able to make sensible comparisons between data UK ecologists would like to collaborate with
e.g. to see where frogs are hatching first. ecologists worldwide. The Knowledge Network
for Biocomplexity [16] is trying to solve similar
6.7 Date Type Parameters problems and has developed Ecological Metadata
Language (EML) to describe ecological data.
Typically, a measurement is made at a point Creation of EML records should open up CEH’s
in space over a time series. However, there are data to a much wider audience.
cases, particularly for water chemistry, where Environmental data always has a spatial
samples are taken from a river and then at some component. The first version of NDG will allow
subsequent time/date, analysis is done. There can spatial searching of datasets. However, nothing
be several stages for the analysis and the time/date more complex is permitted. The second stage of
of each is recorded, e.g. pH measurement, NDG, which began in late 2005, may introduce
filtration, and completion of analysis. improved spatial capabilities with consequent
There is currently no suitable place for these benefits for EcoGrid.
within CSML. They are not phenomena because
347
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
348
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The GridSite Proxy Delegation Service
Andrew McNab, Shiv Kaushal
Department of Physics and Astronomy, University of Manchester
Abstract
X.509 Proxy Certificates may now be delegated from a client to a service using the GridSite/gLite
Delegation Web Service. We have implemented both a client and an implementation of the service,
and we describe the delegation process in terms of the operations of the delegation portType. We
also explain how proxies delegated to a service can be used by other Web Services written as CGI
scripts or executables, and how other components of GridSite and Unix account permissions can be
combined to control which services can access a delegated proxy credential on a shared server.
This work has been done as part of the GridSite and EGEE projects, and the GridSite toolkit now
includes functions which thirdparty developers can use to add support for a delegation portType to
their own applications.
1. Introduction issued by a Certification Authority, or an X.509
PC previously created by the user.
X.509 Proxy Certificates1 were originally
introduced by the Globus Project2 and have The delegation procedure involves the
subsequently become central to the operation of following steps:
international production Grids, such as the LHC
Computing Grid3 and the EGEE4 project. A • The service is contacted by the client, which
proxy certificate allows jobs or agents to prove asks for an X.509 certificate request.
that they are running on behalf of a particular • The service then generates an RSA public
user, and they are granted to jobs by some form and private key pair, stores the private key
of delegation procedure involving the creation and uses the public key as the basis of an
and signing of a shortlived X.509 certificate. X.509 certificate request.
This paper describes an X.509 Proxy Certificate • The service may choose to prepopulate the
delegation web service developed jointly as part request with values, such as the desired
of the EGEE and GridSite5 projects. expiration time and the uses to which the
proxy may be put.
• The service returns the certificate request to
2. The delegation protocol the client, which enforces any restrictions it
places on values such as expiration time,
2.1 X.509 delegation
and then signs the request using the client's
RFC 32801 describes how the certificate own private key.
requests and proxy certificates (PC) are • The resulting X.509 PC is then sent to the
processed by the service and client. The service, which now possesses the
GridSite delegation service caters to the corresponding private key, the PC itself and
scenario of a remote web service which requires all of the X.509 PC and EEC which prove
a PC, and a client which already possesses the chain of trust back to a trusted
either a full X.509 end entity certificate (EEC) Certification Authority.
349
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Since an X.509 PC is very similar to a After populating and signing the request, the
conventional EEC, it can be used within most client then completes the procedure using the
client applications without modification. On the putProxy operation to send the signed X.509 PC
server side, for applications which use back to the service, and includes the Delegation
SSL/TLS6 such as HTTPS7, only the code which ID to allow the service to associate the PC with
checks the client's certificate chain requires the private key generated in the first phase.
modification (to permit chains including PCs
in addition to Certification Authority The renewProxyReq operation is also
certificates and a final EEC.) For provided which can be used in place of
Apache/mod_ssl, this modification is done getNewProxyReq and allows the client to
dynamically by GridSite by intercepting the specify an existing Delegation ID and extend
underlying OpenSSL8 callbacks during the the lifetime of its delegation session. In return,
authentication phase. PC support was also the service responds with an X.509 certificate
added to OpenSSL itself in version 0.9.7g. request and putProxy is used as before.
Clients can enquire about the status of a
2.2 Web Service operations delegation session with the getTerminationTime
operation, which returns the expiration time of
To allow clients and services to perform this the proxy corresponding to the given
two stage delegation procedure, a delegation Delegation ID; and the destroy operation can be
web service specification has been developed used to remove a proxy including the private
within the EGEE Middleware Security Group9 key from the service's store.
for use by the GridSite and gLite10 middleware
systems. 3. The GridSite implementation
As mentioned above, GridSite adds support to
This paper describes version 1.1.0 of the
Apache11 for the authentication of clients using
service (http://www.gridsite.org/delegation
an X.509 PC with an HTTPS connection. Web
1.1.0.wsdl) which uses the namespace
services can then be written as CGI scripts or
http://www.gridsite.org/namespaces/delegation1.
executables, and obtain the authentication
information, including the certificate chain
The service consists of a single portType,
used, from the CGI environment variables.
Delegation, which may either be implemented
as a standalone web service or as an additional
portType of services which require delegation. 3.1 GridSite Delegation Service
In either mode, a Delegation ID string is used to
identify the delegation session, which is The GridSite Delegation Service uses this
necessary if the same client wishes to maintain environment and is written in C with the aid of
multiple simultaneous delegations to the same standard toolkits. The gSOAP12 web services
server, with different expiration times or other toolkit is used to convert between SOAP
options. This scenario is likely to occur if a user messages and C structures, and OpenSSL is
submits multiple jobs to a grid which are used to examine X.509 certificates and generate
unaware of each other but arrive at the same private keys and certificate requests. All the
site. additional utility functions which were required,
The delegation procedure is carried out including the storage of proxies, were written as
using two operations within the portType, part of the GridSite library, and where they can
getNewProxyReq and putProxy. be used by thirdparty authors to add a
To initiate a delegation, the client sends an delegation portType to their own web services.
empty getNewProxyReq request, and receives a
Delegation ID generated by the service and an
X.509 certificate request in Base64 encoded
PEM format.
350
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
3.2 htproxyput client specified in the Apache virtual server
configuration.
htproxyput is a commandline client which can For example “/var/www/proxycache” if
be used with the GridSite Delegation Service or “/var/www/html” is the DocumentRoot. This
other implementations of the portType. The allows multiple virtual servers, with different
command syntax is similar to the htcp family of DNS components in their URL, to be hosted on
commands supplied with the main GridSite the same physical server, without clashes
distribution, and its default options look for the between Delegation IDs which are only unique
user's X.509 EEC or X.509 PC in commonly to each DNS domain name.
used locations, including the Within the proxycache directory, the cache
X509_USER_PROXY environment variable, subdirectory contains private keys awaiting the
and the “/tmp/x509up_uN” file created by upload of the corresponding X.509 PC,
Globus gridproxyinit or gLite vomsproxy organised into directories named after the URL
init. encoded Distinguished Name of the X.509
As with the Delegation Service, most of the EEC, with one subdirectory for each Delegation
functionality is provided by calls to GridSite ID.
toolkit functions, supplemented by gSOAP and For example
OpenSSL. “/var/www/proxycache/cache/%2FC%3DUK%
2FO%3DeScience%2FOU%3DManchester%
4. Use with CGI services 2FL%3DHEP%2FCN%3Dandrew%
20mcnab/acd5ab55a6c9473e/”
The implementation of the delegation service Once the signed X.509 PC is returned by the
has been designed to allow the sharing of client, the PEM encoded proxy chain including
proxies with other web services hosted on the the EEC and PC and private key is saved as
same GridSite enabled server. GridSite allows userproxy.pem within a subdirectory named
Web Services for Grids to be written as CGI after the Delegation ID, which is itself a
scripts or executables, with authentication and subdirectory of the user's URLencoded EEC
access control decisions performed within Distinguished Name. For example,
Apache by the mod_gridsite module, which also “/var/www/proxycache/%2FC%3DUK%2FO%
makes the values of X.509 PC and EEC 3DeScience%2FOU%3DManchester%2FL%
available, plus any VOMS X.509 Attribute 3DHEP%2FCN%3Dandrew%
Certificates. GridSite also provides a setuid 20mcnab/acd5ab55a6c9473e/”
wrapper, based on Apache's suexec, which
allows Unix accounts and file permissions to 4.2 Interaction with gsexec
control what areas of the hosting server are
accessible to each instance of a service.
GridSite's gsexec wrapper allows CGI scripts or
executables to run as users other than the
4.1 Proxy Storage apache user, which typically owns the listening
Apache processes and most of their files. The
The delegation service must store the private server can be configured to assign a pool
key when it is generated and the signed X.509 account to each client X.509 EEC identity, or to
PC when it is returned by the client. Both each directory containing one or more services,
filesystem and SQL database stores have been or a mixture of both on a perdirectory basis.
specified. GridSite implements filebased These modes allow the sandboxing of instances
storage, which interoperates with GridSite's use of a service associated with different clients, or
of Unix file permissions to sandbox other of sets of services owned by different
services and sessions owned by different users. application groups.
All of the proxy files are stored in Since the delegation service stores delegated
subdirectories of a proxycache directory, which proxy chain and private key in the Unix
is a sibling of the DocumentRoot directory filesystem, ownership and permissions can be
351
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
used to control access to the proxy in a 2. The Globus Project, http://www.globus.org/
transparent way. 3. LHC Computing Grid,
Let us assume that Apache runs as user http://www.cern.ch/lcg
apache and group apache; that each pool user 4. Enabling Grids for EScience,
has both a user and its own group; that the http://www.euegee.org/
apache user has secondary membership of each 5. The GridSite Project,
pool user's private group; and that files are http://www.gridsite.org/
created with owner and group have read/write 6. RFC 2246, “The TLS Protocol”, T. Dierks
permission. et al., January 1999.
This arrangement lets Apache write to files, 7. RFC 2818, “HTTP over TLS”, E. Rescorla,
if it receives an authorized PUT request, and May 2000.
allows CGI scripts to run as a given pool user 8. The OpenSSL Project,
and write files, but prevents them from writing http://www.openssl.org/
to files owned by other pool users. 9. The EGEE Middleware Security Group,
If the GridSite Delegation Service is then http://egeejra3.web.cern.ch/egeejra3/
run as apache, it has the ability to create proxy 10. gLite, http://glite.web.cern.ch/glite/
files which are only readable by the pool user 11. The Apache Webserver,
which corresponds to the correct EEC identity, http://httpd.apache.org/
by virtue of its group readable permission and 12. gSOAP, http://gsoap2.sourceforge.net/
the pool user's private group.
5 Summary
We have implemented both a client and a
service for the GridSite/gLite Web Service for
the delegation of X.509 Proxy Certificates. This
work has been done as part of the GridSite
project, and the GridSite toolkit now includes
functions which thirdparty developers can use
to add support for a delegation portType to their
own applications.
Acknowledgements
This work has been funded by the UK's Particle
Physics and Astronomy Research Council,
through its eScience Studentship programme
and support for the GridPP collaboration.
The GridSite/gLite delegation web service
is the result of discussions within the EGEE
Middleware Security Group, and we would like
to thank Akos Frohner, Joni Hahkala, Olle
Mulmo and Ricardo Rocha for their work on
finalising the specification.
References
352
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Grid portals that can provide a uniform access to underlying grid services and resources are emerging.
Currently, there are a variety of technologies and toolkit that can be employed for grid portal development.
In order to provide a guideline for grid portal developers to choose suitable toolkit, in this paper, we
briefly classify grid portals into non portlet-based and portlet-based, and attempt to survey major tools
and technologies that can be employed to facilitate the creation of grid portals.
353
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
cooperation with Grid technologies. CoG is WSRP, Single Sign On (SSO). It is open-
often used to build Grid portals [11]. It includes source, 100 % JSR portlet API and WSRP
Perl CoG, Python CoG and Java CoG. compliant. Liferay is suitable for enterprise
portal development. Institutions and companies
3.2 GPDK that adopted Liferay to create their portals
GPDK was a widely used toolkit for building include EducaMadrid, Goodwill, Jason’s Deli,
non-portlet based portals. GPDK itself was built Oakwood, Walden Media, etc [16].
on top of the Java CoG based on standard 3-tier
architecture [2]. GPDK was a successful 3.6 eXo platform
product in creating application specific portals The eXo platform [18] can be regarded as a
by many research groups such as the portal and CMS [17]. The eXo platform 1 was
GridGaussian computational chemistry portal, more like a portal framework. The eXo platform
the MIMD Lattice Computation (MILC) portal, 2 proposed a Product Line Strategy [19] as it is
etc [2]. However, GPDK tightly coupled realised that end user customers need to get
presentation with logic in its design and did not ready to use packaged solutions instead of
conform to any portal/portlet specifications. monolithic product. The eXo platform 2 is now
This can result in poor extensibility and a core part on which an extensive product line
scalability. Now the GPDK is not supported any can be built [19]. It features that it adopts Java
longer, and the major author of GPDK has Server Faces and the released Java Content
moved to GridSphere. Repository (JCR – JSR 170) specification. The
eXo platform 2 adopts JSR-168 and supports
3.3 GridSphere WSRP.
The development of GridSphere has combined
the lessons learned in the development of ASC 3.7 Stringbeans
portal and GPDK. The GridSphere Portal Stringbeans [20] is a platform for building
Framework [8] is developed as a key part of the enterprise information portals. The platform is
European project GridLab [15]. It provides an composed of three components: (i) a portal
open-source portlet-based Web portal, and can container/server, (ii) a Web Services platform,
enable developers to quickly develop and and (iii) a process automation engine. At this
package third-party portlet web applications that time the portal server and Web services
can be run and administered within the platform have been released. The process
GridSphere portlet container. One of the key automation engine will be released in the near
elements in GridSphere is that it supports future [20]. It is JSR-168 compliant and
administrators and individual users to supports WSRP. The Stringbeans was used for
dynamically configure the content based on the UK National Grid Service (NGS) portal [21].
their requirements [6]. Another distinct feature
is that GridSphere itself provides grid-specific 3.8 uPortal
portlets and APIs for grid-enabled protal uPortal[22] is a framework for producing
development. The main disadvantage of the campus portal. It is now being developed by the
current version of GridSphere (i.e. GridSphere JA-SIG (Java Special Interest Group). It is JSR-
2.1) is that it does not support WSRP. 168 compliant and supports WSRP. uPortal is
now widely used in creating university portals,
3.4 GridPort for example, the Bristol university staff portal.
The GridPort Toolkit [9] enables the rapid
development of highly functional grid portals. It 3.9 OGCE (Open Grid Computing
comprises a set of portlet interfaces and services Environment)
that provide access to a wide range of backend The OGCE (Open Grid Computing
grid and information services. GridPort 4.0.1 Environment) project was established to foster
was developed by employing GridSphere. collaborations and sharable components with
GridPort 4.0.1 might be the last release as the portal developers [23]. OGCE consists of a core
GridPort team has recently decided to shift the set of grid portlets that are JSR 168 compatible.
focus from developing a portal toolkit to Currently OGCE 2 supports GridSphere and
developing production portals [9]. uPortal portlet containers.
354
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
355
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
356
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
357
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
358
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
359
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 3 Workflows with R integration for BAIR, where ProprocessAffy, ttest, FDR are nodes with General-R
framework, while Expert-R is the interactive R environment integrated. The ViewExpert-R is the figures created.
Other nodes, such as Boxplot, Filter, Derive, Join are some other Knowledge Discovery Network nodes.
360
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Cardiff University’s Condor pool aims to cater for the high throughput computing needs of a wide
range of users based in various schools across campus. The paper begins by discussing the
background of Cardiff University’s Condor pool. The paper then presents a selection of case studies
and outlines our strategy for Condorising applications using submit script generators. Finally the
paper presents the findings of a fEC exercise and outlines our policy for attributing directly
allocated costs to a particular research project.
361
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
362
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
2.4 Condorising Applications Using Submit residues in the asymmetric part. The Gasbor
Script Generators submit script generator works in the same way
as the Dammin submit script generator.
The submit script generator allows us to rapidly
Condorise applications concerned with data
processing. We will use the Dammin and 3. Full Economic Costing
Gasbor applications to aid our discussions. A full economic costing of Cardiff University’s
Dammin and Gasbor can be used to build Condor pool was conducted in line with various
models of proteins using simulated annealing. higher education funding council’s requirements
The input files for Dammin and Gasbor are for full economic costing of research projects.
produced by an application called Gnom which
is used to filter the data captured by the 3.1 Terminology
experimental apparatus which bombards a
The full economic costing (fEC) model divides
sample of the protein under investigation with
costs into two types, indirect costs and direct
X-rays.
costs. The model further divides direct costs
Dammin and Gasbor were originally
into two subtypes, directly incurred costs and
designed as interactive applications asking the
directly allocated costs [7].
user a number of questions before processing
Indirect costs are those costs incurred that
the output from Gnom and generating a model
are not directly related to any one project but
of the protein that can be visualized. Dammin
costs that are necessary to support a given
and Gasbor can also operate in batch processing
project. Directly incurred costs are those costs
mode by providing an answer file containing the
incurred for equipment or services related to a
answers the user would otherwise have had to
single project. Directly allocated costs are those
supply during interactive mode.
costs incurred for equipment or services shared
When running Dammin, Donna would
by a number of projects.
typically accept all the default answers to the
questions Dammin asked except the name of the 3.2 Indirect Costs
Gnom file, the name of the log file (used to log
any errors), and the name of the project The indirect cost of the university's Condor pool
identifier (used to name the output file). is the cost of equipment, power, and staff
Using the submit script generator Donna required to provide execute nodes, submit
does not have to answer a single question, the nodes, and networking which are already
generator simply looks in the input directory for provided by Information Services as part of the
the Gnom file and generates twenty answer files university's overall costs in providing an open
containing the name of the Gnom file, the name access workstation service for students to
of the log file in the format file0.log to support learning and teaching.
file19.log, and the name of the project identifier The cost of the open access workstation
in the format file0 to file19. The submit script service includes: initial purchase of machines
generator then generates twenty batch files with three-year warranties updated on a four-
calling Dammin with one of the twenty answer year rolling cycle, the cost of power consumed,
files. the cost of support staff required to maintain
The submit script generator then builds the and update the machines, and an element of cost
Condor submit script itself which in turn for networking and central data storage.
transfers copies of the Dammin binary, the The indirect cost of the university's Condor
Gnom file, the answer file, and the batch file to pool is recovered via the indirect charge in £ per
each execute node and tells Condor where to FTE per year added to the full economic costing
transfer the output files to on the submit node. of every research project.
The script generator is also capable of
processing multiple Gnom files in the input 3.3 Direct Costs
directory. The direct cost of the university's Condor pool
When running Gasbor, Donna would is the cost of equipment, power, and staff
typically accept all the default answers to the required to provide the central manager and
questions Gasbor asked except the name of the Condor support services beyond those provided
Gnom file, the name of the log file, the project by Information Services as part of the open
identifier, and the number of residues in the access workstation service.
asymmetric part. The direct equipment cost is the cost of the
Using the submit script generator Donna central manager which includes: initial purchase
only has to answer one question, the number of of the machine with a three-year warranty
363
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
updated on a four-year rolling cycle, as well as We recommend that future research projects
an annual racking fee that includes: rack space, that wish to use the university’s Condor pool try
uninterruptible power supply, air conditioning, to estimate, with our assistance, the number of
and network connection. The cost of the central Condor hours needed to satisfy their
manager is £1,560 pa. computational requirements in order that
The direct power cost is significantly less directly allocated costs can be included and
than the cost of running the pool at maximum subsequently recovered via the fEC of their own
capacity which, based on current market prices, research projects.
would be £49,640 pa. Our internal charging policy is to charge 1.5
The direct staff cost is currently £35,000 pa pence per Condor hour whereas our external
which includes one member of full-time staff charging policy is to charge 2.0 pence per
and an element of cost for administration. Condor hour.
Currently the direct cost of the university’s
Condor pool is met by Information Services and Acknowledgements
the Welsh e-Science Centre. In the future the
direct cost of the university’s Condor pool will We are grateful to those users who provided
become a directly allocated cost shared between case studies: Steffan Adams, Kevin Ashelford,
the research projects using the Condor service. Tim Bray, Mary Chin, Donna Lammie, Geraint
We briefly discuss how to attribute directly Lewis, David Walker, and Tim Wess. We are
allocated costs to a particular research project in also grateful to Information Services and the
the next subsection. Welsh e-Science Centre for funding this
research.
3.4 Attributing Directly Allocated Costs to
Particular Research Projects References
Directly allocated costs can be attributed to a [1] M. Litzkow, M. Livny, and M. Mutka.
particular research project using the accounting Condor – A Hunter of Idle Workstations. In
information collected by the central manager. Proceedings of 8th IEEE International
A value for directly allocated costs can be Conference on Distributed Computing Systems
calculated by dividing the maximum directly 1988 (ICDCS8), pages 104-111, San Jose,
allocated costs of the Condor pool by the California, USA, June 1988. IEEE.
maximum number of Condor hours available [2] M. Chin, G. Lewis, and J. Giddy.
which gives us a cost of 1.5 pence per Condor Implementation of BEAMnrc Monte Carlo
hour. Simulations on the Grid. In Proceedings of 14th
Calculating the directly allocated equipment International Conference on the Use of
and staff costs is trivial, however calculating the Computers in Radiation Therapy 2004
power cost is a little more involved. To do this (ICCR2004), Seoul, Korea, May 2004.
we measured the power consumed by a number [3] J. K. Pritchard, M. Stephens, and P.
of different open access workstations of various Donnelly. Inference of Population Structure
specifications. Using Multilocus Genotype Data. Genetics,
We measured the power consumption of 155(2):945-959, 2000.
each sample machine in three different states for [4] G. Evanno, S. Regnaut, and J. Goudet.
a total of fifteen minutes. A sample machine in Detecting the Number of Clusters of Individuals
the first state, IDLE, was simply turned on with Using the Software Structure: a Simulation
nobody logged in. A sample machine in the Study. Molecular Ecology, 14:2611-2620, 2005.
second state, MAX CPU, was running a Condor [5] D. I. Svergun. Restoring Low Resolution
job that called the CPUSoak program provided Structure of Biological Macromolecules from
with the Condor toolkit. A sample machine in Solution Scattering Using Simulated Annealing.
the third state, MAX DISK, was busy copying Biophysical Journal, 76:2879-2886, 1999.
and deleting an ISO file over and over. [6] D. I. Svergun, M. V. Petoukhov, and M.
We then calculated the combined additional H. J. Koch. Determination of Domain Structure
power consumption of each sample machine of Proteins from X-Ray Solution Scattering.
whilst running at MAX CPU and MAX DISK. Biophysical Journal, 80:2946-2953, 2001.
We then used census information to [7] The Joint Costing and Pricing Steering
calculate the cost of running the pool for an Group. Internet. http://www.jcpsg.ac.uk/, July
hour and divided that by the number of 2006.
machines in the pool giving us an average cost
of 0.5 pence per Condor hour.
364
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
One of the key issues in supporting e-Science is managing data in distributed, flexible, scalable and autonomous ways.
This paper proposes an architecture based on Peer-to-Peer systems that can be used to facilitate large-scale distributed data
management. It identifies the important problems that need to be addressed and presents possible solutions, along with their
expected advantages. These ideas, we believe, are helpful in extending the discussion of alternative approaches of supporting
the multiple facets of e-Science.
365
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
366
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
terms at different peers (e.g., Peers 1 and i in Figure 2a). X is received, the local peer knows that in addition to
Finally, the same remote term may be served by multiple accessing its local data, it needs to forward an appropri-
peers (e.g., for Peers 1 and n in Figure 2a). This give us ate query to Peer 1. The query is formulated by trans-
substantial flexibility in identifying relevant peers or even lating term X in the query to term A for Peer 1 to be
choosing between alternative peers serving related infor- able to handle it. In addition, the same procedure can
mation. be undertaken once the reformulated query reaches Peer
The questions that arise have to do with forming and 1 and for term A. The outcome of this process is a path
maintaining routing tables: (i) How much information is (P1 , Q1 )/(P2 , Q2 )/ . . . /(Pn , Qn ) with the semantics that at
exchanged whenever a new peer joins the network? Alter- each peer Pi the corresponding reformulated query Qi
natives include the new peer “downloading” the existing should be evaluated.
peer’s entire routing table, or a part of it. (ii) How is the The path may indeed contain duplicate peers, i.e., the
routing table updated as peers join and leave the network? same peer may have to be visited multiple times in pro-
One option is to “piggy-back” the routing table updates cessing a query. The most efficient way of accessing such
when accessing remote peers for the purposes of querying. peers is an issue of query optimisation and evaluation and
Another option is to have a maintenance protocol being is the purpose of the query router module.
executed periodically. (iii) How can the routing table be 3.4 Query Evaluation
further utilised during query evaluation? In particular, can
the maintenance protocol provide performance guarantees An integral part of query evaluation is query optimisa-
about the connections stored in the routing table (e.g., la- tion. Already a hard problem in centralised environments,
tency, probability of the connection being up to date etc.). the situation is aggravated in decentralised ones as the like-
Decentralised indexing allows for a great deal of research lihood of optimisation-time assumptions holding during
to be undertaken in the area. evaluation-time is even lower. We propose decomposing
queries into a query algebra and reducing query evalua-
3.2 Security tion to a routing problem. We shall use two basic rout-
Data management means that, in addition to serving, peers ing/evaluation strategies, shown in Figures 2c and 2d. The
manage access to the served data. The question that first is parallel evaluation: data requests are forwarded to
emerges is one of security: how can peers control which peers, which then upload their data to the requesting peer
peers access their data? One solution is for each peer to re- that locally evaluates the query. The alternative is serial
quest each other peer to register with it. However, this evaluation: a route is established and all peers are visited
solution will not scale: it goes against the idea of com- serially until the complete result is produced.
plete decentralisation (as it means that each peer is aware The two approaches can be better explained through
of all peers in the system), while it also inhibits the mainte- an example. Consider a relational query of the form
nance protocols described earlier (as even the slightest lo- R1 .a1 = R2 .a2 = . . . = Rn .an posed at peer Pk where each
cal changes need to be globally transmitted). Additionally, Ri resides on a different peer Pi of the system. The parallel
we would like all forbidden requests to fail fast, i.e., to fail strategy would send requests for each Ri to be transmitted
as soon as possible – ideally at the originating peer. This to Pk and for Pk to locally evaluate the query. The serial
means that each peer not only keeps track of who has ac- strategy instructs that the query be sent to P1 , the peer re-
cess to its data, it also maintains information about what sponsible for R1 , which then rewrites the query1 and for-
data it has access to. wards it to P2 , the next peer in the sequence. The process
The solution we propose is based on XML security is repeated until Pn is reached, which then sends the result
views [8]. Each peer exports different views of its data to to the originating peer Pk .
different peers, depending on the peer it is communicating Note that the original query can be rewritten in many
with. An example is shown in Figure 2b where Peer k, ex- ways. In the previous example the original query can be
ports different views to Peers i and j . The system adheres decomposed into numerous blocks2 ; each block can be
to the fail fast principle: since neither Peer i nor Peer j are evaluated in parallel or serially. Each peer makes local
aware of the data they cannot access, they cannot request routing, and, hence, optimisation decisions. Finally, note
it. that a single peer may need to be visited multiple times. In
such cases, care needs to be taken so that the number of
3.3 Query Reformulation times a single peer is visited is minimised.
Query reformulation can be thought of as the query com- To summarise, we envision query evaluation through
pilation step in a decentralised system. During query re- query routing to be a continuous cycle of (i) mapping
formulation the system executes a resource discovery pro- the query to a query algebra and forming query blocks,
tocol, which aids in: (i) identifying peers that may contain (ii) performing local optimisation to identify whether par-
relevant terms; (ii) translating the query to terms that are allel or serial evaluation is more beneficial for a particular
understandable by other peers; and (iii) ensuring that each block, (iii) forwarding a query block to the peers storing
requesting peer has access to a particular term by consult- data relevant to the block, and (iv) rewriting the query to
ing security policies. 1 A simple, but most likely inefficient, rewrite would be to substitute
The three steps mentioned above are iteratively exe- values in R1 .a1with constants.
2 n i! n = n!
Pn
cuted. For instance, in Figure 2a, if a query about term
P
i =1 i i=1 (n−i)!
to be exact.
367
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
368
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
The past few years have seen the Grid rapidly evolving towards a service-oriented computing
infrastructure. With the OGSA facilitating this evolution, it is expected that WSRF will be acting as
the main an enabling technology to drive the Grid further. Resource monitoring plays a critical role
in managing a large-scale Grid system. This paper presents GREMO, a lightweight resource
monitor developed with Globus Toolkit 4 (GT4) for monitoring CPU and memory of computing
nodes in a Windows and Linux environments.
1
369
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
<portType name="RegPortType"
2. WSRF and WSN wsdlpp:extends="wsrpw:GetResourceProperty
wsntw:NotificationProducer"
wsrp:ResourceProperties
WSRF is a set of specifications that specify how ="tns:RegResourceProperties">
to make Web services stateful amongst other <operation name="cname">
aspects. The problem of where to store state <input message="tns:CnameInputMessage"/>
<output message="tns:CnameOutputMessage"/>
involved the introduction for the concept of </operation>
‘resources’ (WS-Resource); a persistent </portType>
memory for services which may reside in
memory, hard disk or even a database. Each List 1: WS-Resource property definition in WSDL.
resource uses a unique address termed ‘endpoint
reference’ to isolate resources from services Note that the wsrp: Resource Properties
enabling other services to use them directly with attribute of the portType element specifies
out having to go through the parent service, what the service's resource properties are. The
given this the introduction of other useful resource properties must be declared as a type
functions were also created. where the monitoring state information is kept.
Firstly the 'wsdlpp:extends' attribute allows the
• WS-Resource Lifetime – manages resources use of predefined port types, in this example we
by setting a life time have used both the ‘get resource’ & ‘send
• WS-Resource Properties (RPs) - many notification’ with bindings automatically
elements to a resource, similar to an object. created by GT4, so there is no need to specify in
• WS-Service Group- enables grouping certain the WSDL code.
services to aid searching for them.
• WS-Base Faults- returning error exception 3. GREMO Architecture
that may be produced by WS-Resource.
• WS-Addressing- actual address given to Figure 1 shows GREMO architecture the
services and resources rather than URL, following sections describe each GREMO
enables one to use resources or Web services components.
independently.
An authority can then use their own (client) End 3.1 Client Service
Point Reference (ERP) to become a subscriber, This service runs locally, its main function is to
and given the qName from which to receive provide RPs that represents the local resources
notifications registration can be set, where a being monitored along with methods to modify
listener client will act on received notifications. them. The notification system is used to alert the
Once a notification is received, EPR and RP's subscriber of any changes in these RPs. In this
qName (from the sender) along with its new case the service here only has to send out
value can be extracted form the message and notifications (requiring a web container). Each
then be processed. The addition of this new client with its own service means they have
functionality also means that the WSDL [9] their own set of RPs. This approach means that
documents also need to show its descriptions of there is not a single service (server) having to be
a RP, straying away from the standard format, constantly updated for each user, giving rise to
List 1 shows the additional code need for the unnecessary instance and processing problems
WSRF standard. on the receiving side, rather having the client(s)
send the notifications. The server simply
processes notifications to reduce strain and
2
370
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
improve flow by not having to call and wait for their resources, using their IP addresses as an
a result/ response, not to mention the delay index for the buffers that are implemented using
incurred when a client gets disconnected multiple vectors.
abruptly.
3.5 Storing Monitoring Data
3.2 Server Service
A mySQL table is used as a persistent storage
Similar to the client service, the Server Service for the monitoring data, as it would be
provides RPs that represent registration inefficient to keep over 200 values (integer in
information of clients, with notifications sent memory) for each user that can potentially run
when a new user registers. Every time the into the thousands! From keeping the most
service is invoked a new set of RPs are created recent 100 values (each for CPU and memory) a
(service instances). Each client has its own set relative history can be produced along with
of RPs differing from the Client Service, here other useful information, but this is only
one main service which processes all the temporary since when a user logs-off their entry
registration information. Since registration is is removed from the table.
less periodic than the actual monitoring, this
seem the most economic way to create this Used as a buffer for the Server Monitor,
registry service. information in this table uses IP address as a
primary key index. Essential in acting as a
3.3 Client Side Monitoring temporary buffer is the ability for data to
Client registration must take place first, given continuously loop around the set 100 fields
the EPR of the Server Service. The client then given for each, monitored type. This means
invokes the Server Service modifying its RPs keeping a counter for each monitored resource,
allowing the server monitor to use this checking if the condition has reached 100 (and
information to subscribe to the monitoring RPs consequently around wrap back to 1), before
managed by client. In a similar fashion, the executing a mySQL update.
client monitor modifies its Client Service's RPs
3.6 Web Interface
in accordance to local monitored resources at
given set intervals. Both Windows and Linux HTML Web pages are used for both registering
environments are accommodated to ensure cross and monitoring clients. In this case an HTML
platform functionality. Since the information form is used as an interface to a Java Servlet
being monitored is CPU and MEM, this which in turn uses the Client registration class,
information cannot be directly accessed through and Client monitoring class to start monitoring,
a Java Virtual Machine (JVM), hence code with a DHTML page to update the client on the
native to the OS is need. In the case for the resources they were monitoring. Using similar
Windows part of retrieving monitoring data, code as in the Server side monitor an applet
pre-complied 'C' dll files were used along with version uses mySQL connector is created
Java Native Interface (JNI) to access them, allowing a remote administrator to view the
where as in the case for Linux such values have current usage of the monitoring service.
to be calculated using the /proc/ virtual file
system.
3
371
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
4. GREMO Performance Having defined a basic structure the possible
uses are widespread; As far as performance
GREMO is implemented with GT4 on both goes, we have been using the GT4 notification,
Windows and Linux platforms. Figure 2 shows which uses Apache Axis. Since at any one time
a snapshot of GREMO. Ideally GREMO could a maximum of two integers are sent, the
handle many hundreds of subscribed users, but overhead data wise seems excessive, however,
in the real case there always is usually a limiting studies have shown that the actual conversion to
factor. GREMO has to processing a large and from ASCII data is very time consuming
number of SOAP notification messages. Having leading to performance decreases when the
done a number of experimental tests, using two amount of data is increased [11]. Keeping this to
Pentium IIII workstations running Windows a minimal level is beneficiary not only in
XP, both with 512Mb of RAM running Globus processing time but also when network traffic is
4.0.0 container, Apache Tomcat 5.0.28 and heavy. Time stamping of data to overcome any
mySQL 4.1.15 (server monitor only). Tests loss in accuracy history building still has to be
were carried out with a client sending multiple implemented.
notifications in burst of 10-500 with the
resulting average delays, lags & processing Taking into account the results shown current
times recorded shown in figure 3. implementation of GREMO would be limited to
around 40 users. The results also show a steady
increase in lag and delay whilst processing
60
delay remain constant suggesting that the GT4
50 container was processing these notifications at
Overhead in seconds
4
372
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
373
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
3 Tools and Evaluation domains for the purpose of management. The aim of
OASIS is to provide a standard mechanism for users
In this section of the poster presentation, we introduce
and services in a distributed environment to interwork
4 tools that target similar issues of authentication and
securely with access control policies enforced. OASIS
authorization. These tools are PERMIS, OASIS,
system is based on Role Membership Certificates
Shibboleth, XACML. We will present these tools and
(RMC) issued to the users and Credential Records
their features individually in this section. In the
(CR) stored on the servers. One focus of OASIS is the
subsequent section we will provide a detailed
dynamic role activation. For example, in order for a
evaluation of these tools in conjunction with our
user to possess a role, there may be a set of role
requirements.
activation conditions that must be satisfied. These
conditions may include requirements for prerequisite
3.1 PERMIS – PrivilEge and Role Management
roles, and any other constraints. Roles can be activated
Infrastructure Standards Validation
or deactivated dynamically as situations arise. OASIS
PERMIS was funded by Information Society Initiative uses appointment certificates (ACs) for associating
in Standardization and developed by Salford privileges persistently with membership of some roles
University. PERMIS dictates a typical role based and for handling delegation of rights between users.
access control (RBAC) model and aims to provide a These certificates can be issued to users from some
solution for managing user credentials when accessing roles with the particular functions and they can serve
target resources. It is used in electronic transactions in as a form of credential to satisfy the role activation
governments, universities or businesses for solving the conditions for a user to activate one or more other
“authentication of the personal identity of the parties roles. Some other key differences are summarized by
involved” and the “determination of the roles, status, Bacon et al [4] between OASIS and other typical
privileges, or other socio-economic attributes” of the RBAC schemes (Sandhu R. et al [7] proposed a
individual, through the use of X.509 attribute number of RBAC models).
certificates (ACs) [2]. ACs are widely used to store
users’ roles and XML-based authorization policies for 3.3 Shibboleth – Federated Identity
access control decisions in PERMIS. They are
Shibboleth is a project developed at Internet2/MACE
digitally signed by the issuers and stored in the public
[9]. It is an identity management (user attributes based)
repositories. Chadwick & Otenko [3] has summarized
system designed to provide federated identity and
basic features of PERMIS including being a
aims to meet the needs of the higher education and
mechanism for identifying users, specifying policies
research communities to share secured online services
to govern the actions users can perform, and making
and access restricted digital content. It focuses on
access control decisions based on policy checking.
providing a way for a user using a web browser to be
authorized to access a target site using information
In short, PERMIS is a RBAC system, which bases all
held at the user’s security domain. It also aims to
the access control decisions on the roles for users and
allow users to access controlled information securely
policies. Roles and policies are respectively stored in
from anywhere without the need of additional
X.509 ACs, which are then protected by digital
authentication process. “Shibboleth is developing
signature and kept in the public repository. In the
architectures, policy structures, practical technologies
PERMIS architecture, a user makes an access request
and an open source implementation to support inter-
via an application gateway. In the application gateway
institutional sharing of web resources subject to access
the Access Control Enforcement Function (AEF) unit
control” [8]. The design of Shibboleth was based on a
authenticates the user and asks the Access Control
few key concepts:
Decision Function (ADF) unit if the user is permitted
a. Federated Administration.
to perform the requested action on the target service
b. Access Control Based on Attributes.
provider. The ADF accesses LDAP (Lightweight
c. Active Management of Privacy.
Directory Access Protocol) directories to retrieve the
d. Standards Based.
policy and role ACs, and then makes a granted or
e. A Framework for Multiple, Scaleable
denied decision. PERMIS defines its own policy
Trust and Policy Sets (Federations).
grammar. The typical components that PERMIS XML
f. Has defined a standard set of attributes.
policy comprises are defined by Bacon et al [4],
Chadwick & Otenko [5] [3].
Shibboleth system includes two main components:
Identity Provider (IdP) and Service Provider (SP)
3.2 OASIS - An Open, role-based, Access
[8][10][11]. IdP is associated with the origin site. SP
control architecture for Secure Interworking
is associated with the target site. These two
Services
components are usually deployed separately but they
OASIS is developed by Opera research group in work together to provide secure access to web based
Cambridge Computer Lab [6] and is a RBAC system resource. Shibboleth aims to exchange user attributes
for open, interworking services in a distributed securely across domains for authorization purpose and
environment, with services being grouped into PKI is the foundation to build the predetermined trust
374
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
relation between Shibboleth components, i.e. IdPs and assertion. In PERMIS, when authenticating a user the
SPs, of the federated members. Vullings et al [11] AEF could use digital signatures, Kerberos, or
points that the main assumption underlying Shibboleth username/password pairs [4]. Shibboleth mainly uses
is that the IdP and the SP trust each other within a the SAML standard to construct its tokens through
federation. Some main features of this federation OpenSAML [14] APIs which is also developed at
activated system are summarized by Morgan et al [10]. Internet2 [9] while other tokens are also supported.
The Shibboleth architecture will be described in The standardization that SAML offers largely
details with diagrams in the poster. Shibboleth uses facilitates the exchange of security information about
OpenSAML APIs to standardize message and users in trust domains and fits nicely with the other
assertion formats, and bases protocol bindings on WS-* standards utilized in the GOLD system.
SAML.
4.2 Privacy
3.4 XACML – eXtensible Access Control Mark- Privacy is also an essential issue that a security system
up Language needs to address. In the GOLD system, a user is issued
XACML is an XML based Web service standard for with a SAML assertion, or even a GOLD context id in
evaluating security sensitive requests against access the prototype, and only this assertion or context id
control policies. It provides standard XML schema for flows between the GOLD participants. A user is only
expressing policies, rules based on those policies and authenticated once with his private information
conditions. It also specifies a request/response without the need to provide them for any additional
protocol for sending a request and having it approved. times. Privacy Protection is not provided in PERMIS.
PERMIS has chosen public repositories to store the
The XACML specification also defines an architecture attribute certificates, which compromises the user’s
for handling the entire lifecycle of a request, from its privacy[4]. In OASIS the ACs are not publicly visible
creation through to its evaluation and response. when being issued and presented. When ACs go
Several components such as the Policy Enforcement through communication channels they are encrypted
Point (PEP) and Policy Decision Point (PDP) under SSL, therefore the privacy is ensured. In
transform requests into the standard format before Shibboleth, fairly active management of privacy was
they are evaluated using a rule combining algorisms in place when the system was designed and the users
[12]. The benefits of using XACML as written in the have full controls what personal information is
specification are summarized in SUNXACML[12]. released and to whom.
XACML allows fine-grained access control and is
based on the assumption that a user’s request to 4.3 Federation and broker style trust
perform an action on a resource under certain Federation is an indispensable part in the GOLD
conditions needs an “allow” or “deny” decision. architecture and offers the GOLD participants
freedom of deciding whether they want to trust the
4 Discussion identity providers based on a pre-determined trust
relationship. PERMIS [2] and OASIS [15] have been
The requirements section of this poster describes how developed in a distributed environment. Undoubtedly
virtual organization dynamics can affect decisions federation is the paramount issue in the Shibboleth
regarding the implementation of authentication and system. However the GOLD system also offers
authorization mechanisms. In this section, we evaluate participants a choice of finding alternative
the tools based on a few elements we draw from independent parties in cases of parties disagreeing on
requirements. the authorities they trust, without sacrificing any
traceability or accountability of credentials. This
4.1 Authentication and SSO broker style trust is not currently supported in any of
The GOLD system supports SSO via the usage of other aforementioned tools.
SAML assertions and related protocols as specified by
Liberty Alliance [13], using OpenSAML [14] APIs. 4.4 Dynamic activation and deactivation of
PERMIS offers a RBAC authorization infrastructure. access rights
However, PERMIS has been built to be authentication Central to the authorization requirements is the need
agnostic [4]. When a user wishes to access an for dynamic activation and deactivation of the access
application controlled by PERMIS, the user must first rights. As addressed earlier, the GOLD framework
be authenticated by the application specific AEF. supports this by enabling VOs to define conceptual
OASIS does not have support for authentication. It is boundaries around projects and tasks so that the roles
entirely an access control system. Shibboleth and permissions can be scoped. Ongoing decisions can
implements the SSO mechanism via the use of SAML be made along with the progresses of projects or tasks
and the concept of federated identity, so users are only in relation to the dynamic activation and deactivation
required to sign on once at home organizations. In the of the access rights. A high level of granularity is
GOLD authentication system, we support a variety of provided in this context. These are all achievable by
tokens, e.g. X.509 digital signature/encryption, SAML
375
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
taking the advantages of standardization and [3] Chadwick D., Otenko S. 2003_1, A comparison
flexibilities that XACML can offer. OASIS [4] of the Akenti and Permis authorization infrastructures,
provides the means for the dynamic role activation as Proceedings of the ITI First International Conference
discussed in the earlier section. In contrast, the role on Information and Communications Technology
and policy rights assignment in PERMIS are rather (ICICT 2003) Cairo University, pages 5-26,
persistent and the attribute certificates PERMIS uses [4] Bacon J. et al. 2003, Persistent versus Dynamic
are a static representation. According to the PERMIS Role Membership, 17th IFIP WG3 Annual Working
developers, in the current release of PERMIS, a role is Conference on Data and Application Security, No. 17,
revoked by explicitly deleting the role AC from the pages 344-357
LDAP directory using an LDAP browser/admin tool. [5] Chadwick D., Otenko A. 2003_2, The PERMIS
Shibboleth focuses on attribute-based authorization X.509 role based privilege management infrastructure,
and it does not mention the dynamic use of policy Future Generation Computer Systems, Vol. 19, Issue
rights. 2, 277-289
[6] OASIS, 2003, An Open, Role-based, Access
4.5 Policy delegation Control Architecture for Secure Interworking Services,
Also in GOLD, we support the delegation of Cambridge University EPSRC project,
authorities, where the source of authorities can http://www.cl.cam.ac.uk/Research/SRG/opera/projects
delegate roles/privileges/rights to the subordinate /.
authorities. X.509 standard, which PERMIS is based [7] Sandhu R., Coyne E., Feinstein H. & Youman
on, specifies mechanisms to control the delegation of C., 1996, Role-Based Access Control Models, IEEE
authority from the source of authority to subordinate Computer, Volume 29, Number 2.
attribute authorities. The delegation was not fully [8] Shibboleth 2005, Shibboleth Project,
supported by the PERMIS implementation in the Internet2/MACE, http://shibboleth.internet2.edu
previous releases. However it is currently supported [9] Internet2/MACE, 1996-2005, Internet2-
by the recently developed release according to the Middleware Architecture Committee for Education,
developers. A role holder may delegate his/her role to http://middleware.internet2.edu/MACE/
another individual, without the need to have [10] Morgan R. L., Cantor S., Carmody S., Hoehn
permission to alter the privileges assigned to that W., & Klingenstein K. 2004, Federated Security: The
individual. Furthermore PERMIS supports role Shibboleth Approach, Educause Quarterly, Vol. 27,
hierarchies. With role hierarchies privileges of No. 4.
subordinate roles can be inherited by superior roles, [11] Vullings E., Buchhorn M., Dalziel J. 2005,
and a role holder can delegate just a subordinate role Secure Federated Access to GRID applications using
instead of the entire role [16]. OASIS uses SAML/XACML.
appointment as introduced earlier, as opposed to the [12] SUNXACML 2004, Sun’s XACML
privilege delegation. Implementation, http://sunxacml.sourceforge.net
[13] Liberty Alliance Project 2003, Introduction to
the Liberty Alliance Identity Architecture, Revision
5 Conclusion 1.0.
This poster submission looked at VO requirements in [14] OpenSAML 2005, An Open Source Security
terms of authorization and authentication taking into Assertion Markup Language implementation”
account the dynamics of a VO, the levels of distrust, (Internet2) http://www.opensaml.org/
the need for federation as well as rights delegation. [15] Hine J.H., Yao W., Bacon J., Moody K., 2000,
We have evaluated the 4 tools available under open An Architecture for Distributed OASIS Services, Proc.
source licensing which are addressing similar issues. Middleware 2000, Lecture Notes in Computer Science,
We give practical assessments on what these tools do Vol. 1795, Springer-Verlag, Heidelberg and New
and how they address issues that we are concerned in York, 107-123.
conjunction with the requirements we elicited from [16] Chadwick D. & Otento A., 2002, RBAC
the project. Policies in XML for X.509 Based Privilege
Management, in Security in the Information Society:
References Visions and Perspectives: IFIP TC11 17th Int. Conf.
On Information Security (SEC2002), Cairo, Egypt. Ed.
[1] Periorellis, P., Townson, C. and English, P., 2004 by M. A. Ghonaimy, M. T. El-Hadidi, H.K.Aslan,
CS-TR: 854 Structural Concepts for Trust, Contract Kluwer Academic Publishers, pp 39-53
and Security Management for a Virtual Chemical
Engineering, School of Computing Science,
University of Newcastle.
[2] PERMIS, 2001, Privilege and Role Management
Infrastructure Standards Validation,
http://www.permis.org/en/index.html
376
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
1
Dept. of Computer Science, University College London,
Gower St, London WC1E 6BT, United Kingdom
2
Dept. of Earth Sciences, University of Cambridge,
Downing Street, Cambridge CB2 3EQ, United Kingdom
3
Cambridge eScience Centre, Centre for Mathematical Sciences,
University of Cambridge, Wilberforce Road, Cambridge CB3 0EW
Abstract
Scientists require means of exploiting large numbers of grid resources in a fully integrated manner
through the definition of computational processes specifying the sequence of tasks and services they
require. The deployment of such processes, however, can prove difficult in networked environments,
due to the presence of firewalls, software requirements and platform incompatibility. Using the Business
Process Execution Language (BPEL) standard, we propose here an architecture that builds on a
delegation model by which scientist may rely on middle-tier services to orchestrate subsets of the
processes on their behalf. We define a set of inter-related workflows that correspond to basic patterns
observed on the eMinerals minigrid. These will enable scientists to incorporate job submission and
monitoring, data storage and transfer management, and automated metadata harvesting in a single
unified process, which they may control from their desktops using the Simple Grid Access tool.
377
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
378
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
- The retrieval and staging of data from our Description Language (JSDL), an emerging XML
storage resources onto the target resource based GGF standard [9].
- The submission and execution of the job
- The storage of the produced data Orchestration Service: At the core of our system
- Finally, the automated harvesting of metadata we rely on one or more BPEL engines to provide
from the produced output, and its storage. and manage the execution of workflows as
specified in section 2.1. We use for this purpose
As part of the process, these various stages require the open source Active BPEL engine to deploy
fault-handling capabilities and monitoring of state workflows. These workflows, deployed as
changes. This in itself constitutes a relatively executable Web Services, will present an interface
complex but reusable workflow unit (illustrated in for clients to submit JSDL documents and
figure 1) that can be interleaved with desktop corresponding data requirements, and, upon receipt
processes, such as retrieving user input for steering of such a document, will manage the invocations of
purposes, or input and output processing. the various required services.
The complexity and resource requirements of such The Active BPEL engine also provides graphical
a process means that its management and execution Web based monitoring tools, providing a detailed
is best left to a third party with sufficient resources, view of the execution process, which will ensure
for which we rely on the Business Process that users can check on the progress of their jobs,
Execution Language (BPEL). whilst keeping the client-side tools as light as
possible.
2.2 The Business Process Execution Language
BPEL aims to enable the orchestration of Web Simple Grid Access (SGA) tool: The SGA tool is
Services, by providing means of composing a our self-contained client tool that we have
collection of Web Service invocations into an implemented to allow users to specify their job
executable workflow, itself presented as an requirements and generate corresponding JSDL
invocable Web Service. and data requirements specifications for
Much work has been invested in the development submission to a BPEL engine.
of Web Service compliant grid middleware. For We favor here a simple command line interface,
example, web service based job submission tools which will allow users to specify the various
such as GridSAM have been developed, providing requirements of the job (input files, executables,
a standard job interface to underlying resource etc.) as a sequence of arguments, as well as
management systems such as Condor. allowing additional attributes (proxy server
BPEL allows us to specify the process to execute descriptions etc.) to be obtained from a
upon request as an XML document. It provides configuration file, in a predefined location or from
support for controlled flow elements, including local environment settings.
sequential or parallel executions, conditional The client also provides additional data staging
executions, variables, fault handling and XPath and capabilities. Because client side inbound
XSLT queries. connections are rarely available, it enables users to
Once a workflow has been specified, a BPEL upload required files to the SRB vaults before job
engine will be responsible for orchestrating the submission if required through HTTP - as we will
workflow on a user’s behalf, acting as a explore below, as well as making files to be made
middleman between resources and client. Such an available through a variety of means (FTP, HTTP).
approach also provides us with the advantage that Files can also be returned to the user’s desktop
the BPEL workflow can be modified independently upon job completion. Finally, it also allows users
from the users, considerably easing administration to generate a certificate proxy, which will be
and enabling the transparent addition of new uploaded to a credential management service (i.e.
resources and services. Web Services also provide myProxy [13]) of the user’s choice. Once data and
better support for firewalls, by allowing session proxy requirements have been handled, the client
management over a single port. will invoke the orchestration services to orchestrate
the execution of the workflow.
3. Implementation The client has been implemented in Java, ensuring
that it will be compatible with most platforms,
Our implementation is illustrated in figure 2. For using the CoG kit [11], Apache Axis [12], Apache
job specification, we rely on the Job Submission HttpClient and other libraries.
379
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
4. Conclusion
We have begun the process of
incorporating the SGA tool into
existing workflows and tools used by
our scientists.
In particular, we have incorporated
SGA into GDIS (GTK Display
Interface for Structures), an open
Figure 2: Implementation source program that allows scientists
to visualise molecules, crystals and
Meta-scheduling: We rely here on previous work crystal defects in three dimensions and at the
in which we created a plug-in for the GridSAM job atomic scale [14]. GDIS can act as an interface to
submission service to support submissions to create code input and run the simulations on the
multiple Condor pools, or Globus managed clusters local machine. By enabling GDIS to invoke our
through Condor’s Web Service interface [8], SGA tool, we can now launch calculations onto our
effectively transforming GridSAM into a complete minigrid, and take advantage of all our services,
metascheduling service. GridSAM also provides with little change to the original code base.
support for JSDL, and myProxy, facilitating its References
integration into the system.
1. Dove, M. et al., Environment from the molecular level: an
escience testbed project. Proc. of All Hands Meeting,
Data Transfer and services: While Web Services Nottingham, 2003
are used to control the data transfers, the actual 2. Blanshard, L. et al., Grid tool integration within the
transfer of data is handled through HTTP, in order eMinerals project, in Proc. Of the All Hands Meeting
Nottingham, 2004.
to avoid the overhead of SOAP based invocations. 3. Andrews, T. et al., Business Process Execution Language for
For this purpose, we have written a front end Web Services Version 1.1 OASIS, 2003.
wrapper over the SRB tools using CGI that http://ifr.sap.com/bpel4ws.
facilitates interaction with the SRB data vaults 4. The GridSAM project. http://www.lesc.ic.ac.uk/gridsam/
5. The Globus Project. http://www.globus.org/
through HTTP. 6. White, T. et al., eScience methods for the combinatorial
Alongside this, we have created a Web Service that chemistry problem of adsorption of pollutant organic molecules
will facilitate the remote coordination of transfers on mineral surfaces, in Proc. Of the All Hands Meeting,
between third parties and can be deployed on target Nottingham, 2005.
7. Murray-Rust, P. et al., Chemical Markup Language and XML
resources. The reasons for adopting HTTP are Part I. Basic principles, J. Chem. Inf. Comp. Sci., 39, 928, 1999.
obvious: they facilitate the upload of files for 8. Chapman, C. et al. , Condor Birdbath - web service interface
clients in the presence of firewalls, and also enable to Condor, in Proc. of the All Hands Meeting, Nottingham,
them to access files through their browser. 2005.
9. Anjomshoaa, A., et al., Job Submission Description
Language Specification v1.0, GFD #: GFD-R-P.056, 2005.
Metadata collection: The interface provided by 10. Tyer, R. P., et al., Automatic metadata capture and grid
the RCommands Web Service provides means of computing, in Proc. of the All Hands Meeting, Nottingham,
adding, editing and removing new data sets and 2006.
11. Java Cog Kit http://wiki.cogkit.org/
studies associated with a particular file, and its 12. Apache Software Foundation. http://www.apache.org/
physical location (on the SRB). 13. MyProxy http://grid.ncsa.uiuc.edu/myproxy/
In addition to basic submission details, detailed 14. Fleming, S., and Rohl, A., GDIS: a visualization program
information about the data can be obtained by for molecular and periodic systems, Z. Krist. 200 580, vol.220
pp.580-584, 2005.
380
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
381
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Table 1. Examples of how the study / dataset / data object levels have been used to organise data.
Study Molecular dynamics simulation of silica Ab initio study of dioxin molecules on clay
under pressure surface
Data set Identified by the sample size and the Identified by number of chlorine atoms in
interatomic potential model the molecule and the specific surface site
Data object Collection on the SRB containing all input Collection on the SRB containing all input
and output files and output files
Simulation metadata: information such as the the network related code to be autogenerated.
user who performed the run, the date, the On the server, the Axis SOAP engine is
computer run on etc. configured to expose specified methods of
Parameter metadata: values for the parameters certain classes as RPC web services. The client
that control the simulation run, such as code is generated on the fly by gSOAP from the
temperature, pressure, potential energy WSDL file produced by Axis.
functions, cut-offs, number of time steps etc.
Property metadata: values of various properties Metadata Manager: the web interface
calculations in the simulation that would be
useful for subsequent metadata searches, to the metadata database
such as final or average energy or volume. The RCommands were written primarily to
Code metadata: information to enable users to provide tools that can be used in scripts, but
determine what produced the simulation nevertheless they give scientists a useful
data, such as code name, version and interface to the metadata database. However,
compilation options. there are cases when a web interface is better,
Arbitrary metadata: strings to enable later particularly when requiring a graphical
indexing into the data, such as whether a overview that cannot be provided by a unix
calculation is a test or production run. shell interface. Thus RPT has developed a web
interface to the metadata database called the
The RCommand framework “Metadata Manager”. This gives an overview of
the study level, from which the user can drill
To facilitate automatic metadata capture, we down into the various layers. The user can
have developed a set of scriptable unix line perform a number of the functions that are
commands that can upload metadata to the provided by the RCommands.
metadata database (Table 2). The RCommands The MDM design uses the JSP Model 2
are a standard three-tier application: architecture, which is based on the Model-View-
Client: A set of binary tools written in C using Controller (MVC) pattern. In addition, the Front
the gSOAP library. The motivation for using Controller pattern is also used. The majority of
C was the requirement that the tools be as the code in the Model layer is common to both
self-contained as possible so they can easily the RCommands and the MDM. Hence, as with
be executed server side via Globus. the RCommands, the database connectivity is
Application Server: Written in Java using Axis provided using the JDBC libraries.
as the SOAP engine and JDBC for database
access. The code essentially maps from
procedural web service calls to SQL Collecting metadata: the role of
appropriate for the underlying database [4]. XML in output files
RDBMS: The backend database is served by a
Oracle 10g RAC cluster. Although Oracle is Much of the metadata we collect is harvested
used, there is no requirement for any Oracle from output data files. To facilitate this, we have
specific functionality. enabled our key simulation programs to write
One of the main reasons to use a three tier the main output files in XML, using the
model is that the backend databases are heavily Chemical Markup Language [5]. CML specifies
firewalled and cannot be accessed directly from a number of XML element types for
client machines. representing lists of data, including:
The SOAP messages are sent to the metadataList: contains general properties
application server via SSL-encrypted HTTP. of the simulation, such as code version;
The application server is authenticated using its parameterList: contains parameters for
certificate while the client requests are with the simulation;
authenticated using usernames and passwords. propertyList: contains property values
The use of web service technology allows computed by the simulation.
382
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
RCommand Action
Rinit Starts an RCommand session by setting up session files
Rpasswd Changes the password for access to the metadata database
Rcreate Creates study, dataset and data object levels, associating the lower levels with the
level immediate above, adding a name to each level, adding a metadata description
and topic association in the case of creating a study, and associating a URI in the case
of creating a data object.
Rannotate Adds metadata. In the case of studies or datasets, this enables a metadata description,
and in the case of datasets and data objects it also enables metadata name/value
pairs. It also enables more topics to be associated with a study.
Rls Lists entities within the metadata database. With no parameters, it lists all studies, and
with parameters it will list the entries within a study or dataset level. It can also be used
to list all possible collaborators or science topics.
Rget Gives the metadata associated with a given study, dataset or data object. In the case
of a study, it can also list associated collaborators and science topics.
Rrm Removes entities or parameters from the metadata database.
Rchmod Add or remove co-investigators from a study.
Rsearch Search the metadata database, tuned to search within different levels and against
descriptions, name/value pairs and parameters.
Rexit Ends an RCommand session, cleaning up session files.
It is usual to have more than one of each list, simulation. The user specified expression might
particularly the propertyList. These lists have the form:
correspond to the metadata types described AgentX = FinalEnergy, output.xml:/
above. An example is given in Figure 1. PropertyList[title='rolling
averages']/Property
Automatic metadata capture within a [dictRef='dl_poly:eng_tot']
grid computing environment The term providing the name of the metadata
item is 'FinalEnergy' and the document to
As described in the introduction, the eMinerals be queried is output.xml. The string
scientists run their simulations using the MCS following ‘output.xml:’ is parsed by MCS
tool [3]. MCS deals with four types of metadata: and converted to a series of AgentX library
1. An arbitrary text string specified by the user. calls. In this example, AgentX is asked to locate
2. Environment metadata automatically all the data sets in output.xml that relate to
captured from the submission and execution
the concept ‘property’ and which have the
environment.
3. Metadata extracted from the first reference ‘dl_poly:eng_tot’. The value of
metadataList and parameterList this property is extracted and associated with
elements described in the previous section. the term FinalEnergy. The RCommands are
4. Additional metadata extracted from the then used to store this name value pair in the
XML documents. These specifications take metadata database.
the form of expressions with a syntax similar AgentX works with a specification of ways
to XPath expressions. These are parsed by to locate data in documents (such as a CML
MCS and broken down into a single term document) that have a well defined content
used to provide the context of the metadata model. There are two components to the
(such as ‘FinalEnergy’) and a series of AgentX framework:
calls to be made to the AgentX library [6]. 1. An ontology that specifies terms relating to
MCS uses calls to the AgentX library to query concepts of interest in the context of this
documents for data with a specific context. For work. These terms relate to classes of real
example, AgentX could be used to find the total world entities of interest and to their
energy of a system calculated during a properties. The ontology is specified using
OWL and serialised using RDF/XML.
383
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
<metadataList>
<metadata name="identifier" content="DL_POLY version 3.06 / March 2006"/>
</metadataList>
2. The mappings, which are used to relate terms ‣ parameterisation of computations of PCB
in the ontology to document fragment molecules [7];
identifiers. For XML documents, these ‣ study of silica glass under pressure [8]
fragment identifiers are XPointer expressions ‣ sharing literature search results, with the data
that may be evaluated to locate data sets and object linking to the on-line publication URL.
data elements in the documents. Each format
is associated with its own set of mappings We are grateful for funding from NERC (grant
and serialised using RDF/XML. reference numbers NER/T/S/2001/00855, NE/
AgentX is able to retrieve information from C515698/1 and NE/C515704/1).
arbitrary XML documents, as long as mappings
are provided. Mappings exist for an increasing References
number of simulation codes.
1. MT Dove et al. The eMinerals project:
Post-processing using the RParse tool developing the concept of the virtual organisation
to support collaborative work on molecular-scale
Although XML output will capture all environmental simulations. Proceedings of All
information associated with the simulation, it is Hands 2005, pp 1058–1065, 2005
2. M Calleja et al. Collaborative grid infrastructure
inevitable that the automatic tools may miss for molecular simulations: The eMinerals
some of the metadata; it is not always obvious minigrid as a prototype integrated compute and
at the start of a piece of work what properties data grid. Mol. Simul. 31, 303–313 (2005)
are of most interest. We have used the AgentX 3. RP Bruin et al. Job submission to grid computing
libraries to develop the Rparse tool to environments. Proceedings of All Hands 2006
retrospectively extract metadata by scanning 4. M Doherty, K Kleese, S Sufi. "Database Cluster
over the XML files contained within the data for e-Science". Proceedings of UK e-Science All
objects in a single dataset. Rparse uses the SRB Hands Meeting 2003, pp 268–271, 2003
Scommands, the RCommands, and the AgentX 5. TOH White et al. Development and Use of CML
library. The user specifies a collection in the in the eMinerals project. Proceedings of All
Hands 2006
SRB, the relevant output files, AgentX query 6. PA Couch et al. Towards Data Integration for
expressions, and the dataset into which the Computational Chemistry. Proceedings of All
metadata is to be inserted. Hands 2005, pp 426–432 (2005)
7. KF Austen et al., Using escience to calibrate our
Examples of applications tools: parameterisation of quantum mechanical
calculations with grid technologies. Proceedings
We have used the metadata tools for the of All Hands 2006
following applications: 8. MT Dove et al. Anatomy of a grid-enabled
‣ collaborative studies of adsorption of molecular simulation study: the compressibility
molecules onto mineral surfaces; of amorphous silica. Proceedings of All Hands
2006
384
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
A problem that frequently arises in the management and integration of scientific data is
the lack of context and semantics that would link data encoded in disparate ways. To
bridge the discrepancy, it often helps to mine scientific texts to aid the understanding of
the database. Mining relevant text can be significantly aided by the availability of
descriptive and semantic metadata. The Digital Curation Centre (DCC) has undertaken
research to automate the extraction of metadata from documents in PDF([22]).
Documents may include scientific journal papers, lab notes or even emails. We suggest
genre classification as a first step toward automating metadata extraction. The
classification method will be built on looking at the documents from five directions; as
an object of specific visual format, a layout of strings with characteristic grammar, an
object with stylo-metric signatures, an object with meaning and purpose, and an object
linked to previously classified objects and external sources. Some results of experiments
in relation to the first two directions are described here; they are meant to be indicative
of the promise underlying this multi-faceted approach.
385
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
([3]) and Kessler et al. ([17]). We follow In the experiments which follow we worked
Kessler who refers to genre as “any widely with a data set of 4000 documents collected via
recognised class of texts defined by some the internet using a randomised PDF-grabber.
common communicative purpose or other Currently 570 are labelled with one of the 60
functional traits, provided the function is genres and manual labelling of the remainder is
connected to some formal cues or in progress. A significant amount of
commonalities and that the class is extensible”. disagreement is apparent in labelling genre even
For instance, a scientific research article is a between human labellers; we intend to cross
theoretical argument or communication of check the labelled data later with the help of
results relating to a scientific subject usually other labellers. However, the assumption is that
published in a journal and often starting with a an experiment on data labelled by a single
title, followed by author, abstract, and body of labeller, as long as the criteria for the labelling
text, finally ending with a bibliography. One process are consistent, is sufficient to show that
important aspect of genre classification is that it a machine can be trained to label data according
is distinct from subject classification which can a preferred schema, thereby warranting further
coincide over many genres (e.g., a mathematical refinement complying with well-designed
paper on number theory versus a news article on classification standards.
the proof of Fermat's Last Theorem).
2. Classifiers
By beginning with genre classification it is
possible to limit the scope of document forms For the experiments described here we
from which to extract other metadata. By implemented of two classifiers. First, an Image
reducing the metadata search space metadata classifier, which depends on features extracted
such as author, keywords, identification from the PDF document when handled as an
numbers or references can be predicted to image. It converts the first page of the PDF file
appear in a specific style and region within a to a low resolution image expressed as pixel
single genre. Independent work exists on values. This is then sectioned into ten regions
extraction of keywords, subject and for an assessment of the number of non-white
summarisation within specific genre which can pixels. Each region is rated as level 0, 1, 2, 3
be combined with genre classification for with the larger number indicating a higher
metadata extraction across domains (e.g. [2], density of non-white space. The result is
[13], [14], [26]). Resources available for statistically modelled using the Maximum
extracting further metadata varies by genre; for Entropy principle with MaxEnt developed by
instance, research articles unlike newspaper Zhang ([28]). Second we implemented a
articles come with a list of citations closely Language model classifier, which depends on
related to the original article leading to better an N-gram model on the level of words. N-
subject classification. Genre classification will gram models look at the possibility of word
facilitate automating the identification, w(N) coming after a string of words W(1),
selection, and acquisition of materials in W(2), ..., w(N-1). A popular model is the case
keeping with local collecting policies. when N=3. This has been modelled by the
BOW toolkit ([19]) using the default Naïve
We have opted to consider 60 genres and have Bayes model without a stoplist.
discussed this elsewhere [initially in 18]. This
list does not represent a complete spectrum of 3. Experiment Design
possible genres or necessarily an optimal genre
An assumption in the two experiments
classification; it provides a base line from which
described here is that PDF documents are one of
to assess what is possible. The classification is
four categories: Business Report, Minutes,
extensible. We have also focused our attention
Product/Application Description, Scientific
on information from genres represented in PDF
Research Article. This is, of course, a false
files. Limiting this research to one file type
assumption and limiting the scope in this way
allowed us to bound the problem space further.
changes the meaning of the resulting statistics
We selected PDF because it is widely used, is
considerably. However, our contention is that
portable, benefits from a variety of processing
high level performance on a limited data set
tools, is flexible enough to support the inclusion
combined with a suitable means of accurately
of different types of objects (e.g. images, links),
narrowing down the candidates to be labelled
and is used to present a diversity of genre.
would achieve the end objective.
386
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
For the first experiment we took the 70 protection, and even when text is extracted, text
documents in our labelled data belonging to the processing tools are quite strict in their
above four genres, randomly selected a third of requirements for input data. In this respect,
them as training data (27 documents) and the images are much more stable and manageable.
remaining documents as test data (43), trained Combining a soft decision image classifier with
both classifiers on the selected training data, and the language model both increases the overall
examined result. In the second experiment we accuracy and results in a notable increase in
used the same training and test data. We recall for most of the four genres (see Table 2).
allocated the genres to two groups each
containing two genres: Group I included
business reports and minutes while Group II Table 1. Result of Language Model Only
encompassed scientific research articles and
product descriptions. Then we trained the Overall accuracy: 77%
image classifier to differentiate between the two Genres Prec (%) Rec (%)
groups and used this to label the test data as Business Report 83 50
Sci. Res. Article 88 80
documents of Group I or II. Concurrently we
Minutes 64 100
trained two language model classifiers: Product Desc 90 90
Classifier I which distinguishes business reports
from minutes and Classifier II which labels Table 2. Result of Image & Language Model
documents as scientific research articles or
product descriptions. Subsequently we took the Overall accuracy: 87.5%
test documents labelled Group I and labelled Genres Prec (%) Rec (%)
them with Classifier I and those labelled Group Business Report 83 50
II and labelled them with Classifier II. We Sci. Res. Article 75 90
examined the result. Minutes 71 100
Product Desc 90 100
4. Results and Conclusions
The precision and recall for the two experiments The results of the experiments indicate that the
are presented in Tables 1 and 2. Although the combined approach is a promising candidate for
performance of the language model classifier further experimentation with a wider range of
shown in Table 1 is already high, this, to a great genres. The experiments show that, although
extent, reflects on the four categories chosen. In there is a lot of confusion visually and
fact, when the classifier was extended to include linguistically over all 60 genres, subgroups of
40 genres, it performed only at an accuracy of the genres exhibit statistically well-behaved
about 10%. When a different set was employed characteristics. This encourages the search for
which included Periodicals, Thesis, Minutes and groups which are similar or different visually or
Instruction/ Guideline, the language model linguistically. Further experiments are planned
performs at an accuracy of 60.34% and the to enable us to refine this hypothesis.
image classifier on Group I (Periodicals) and
Group II (Thesis, Minutes, Further improvement can be envisioned,
Instruction/Guideline) performs at 91.37%. including integrating more classifiers. An
Note also that, since Thesis, Minutes and extended image classifier could examine pages
Instruction/Guidelines can be intuitively other than the just first page (as done here), or
predicted to have distinct linguistic examine the image of several pages in
characteristics, the language classifier's combination: different pages may have different
performance on each group is also predicted to levels of impact on genre classification, while
perform at a high level of accuracy (results processing several pages in combination may
pending). It is clear from the two examples that provide more information.. A language model
such a high performance can not be expected for classifier on the level of POS and phrases would
any collection of genres. Judging from the result use a N-gram language model built on the Part-
of the classifiers, the current situation seems to of-speech tags (for instance, tags denoting
be a case of four categories which are similar words as a verb, noun or preposition) of the
under the image classifier but which differ underlying text of the document and also on
linguistically. A secondary reason for involving partial chunks resulting from detection of
images in the classification and information phrases (e.g. noun, verb or prepositional
extraction process arises because some PDF phrases). A stylometric classifier taking its cue
files are textually inaccessible due to password from positioning of text and image blocks, font
387
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
388
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Solving Grid interoperability between 2nd and 3rd generation Grids by the
integrated P-GRADE/GEMLCA portal
Tamas Kiss1, Peter Kacsuk1,2, Gabor Terstyanszky1, Thierry Delaitre1, Gabor Kecskemeti2, Stephen Winter1
1
Centre for Parallel Computing, University of Westminster
115 New Cavendish Street, London, W1W 6UW
e-mail: kisst@wmin.ac.uk
2
MTA SZTAKI, 1111 Kende utca 13
Budapest, Hungary
Abstract
Grid interoperability has recently become a major issue at Grid forums. Most of the current
ideas try to solve the problem at the middleware level where unfortunately too many
components (information system, broker, etc.) should be made interoperable. As an alternative
concept the P-GRADE/GEMLCA portal is the first Grid portal that targets the problem at the
level of workflows. Different components of a workflow can be executed simultaneously in
several Grids providing access to a larger set of resources and enabling the user to exploit more
parallelism than inside one Grid. The paper describes how the P-GRADE Portal and GEMLCA
enable the execution of workflows in multiple Grids based on different 2nd (GT2 and LCG-2)
and 3rd (GT4 and g-Lite) generation Grid middleware.
In the current paper we show that 2.1 Connecting Second Generation Grids
interoperability can be solved in a much higher Most of the current production Grid systems
level, namely at the workflow level that could be are based on second generation Grid technology.
part of a Grid portal. Indeed, P-GRADE Grid The basis of the underlying middleware is in most
portal [8] is the first Grid portal that tries to solve cases the Globus Toolkit 2, however because of
389
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
substantial variations and modifications, these 2.2 Extending Second Generation Grids with
Grids are not naturally interoperable with each Third Generation Resources
other. As the P-GRADE portal supports access to
multiple Grids at the same time, and as it also
Most of the production Grids [10] [11] [13]
supports both LCG and Globus-based Grids, the
[14] [15] are based on second generation
portal can be utilised to connect these Grid
middleware at the moment. However, most of
systems and map the execution of different
them are considering the transition to service
workflow components to different resources in
oriented middleware, like GT4 or gLite. The P-
different Grids. As the portal is also integrated
GRADE portal, extended with GEMLCA (Grid
with the GEMLCA legacy code repository
Execution Management for Legacy Code
(GEMLCA-R) [18], users can not only submit
Applications) [9], is capable to seamlessly assist
their own executable to Grid resources but can
this transition. GT4 GEMLCA resources can be
also browse the repository and select executables
added to GT2- or LCG-based Grids without any
published by other users.
modification of the underlying infrastructure. In
this case the GT4 resource is becoming an integral
P-GRADE EGEE/VOCE - LCG
part of the original Grid. Users can map workflow
User GEMLCA
Executable +
components to these resources using the proxy
Portal Proxy certificate 1 certificate accepted by that particular Grid,
1
UK NGS without being aware of the differences of the
GT2 Budapest
2 underlying layers. Figure 2 illustrates, that a GT4
LCG Broker
sites. Any P-GRADE GEMLCA portal Figure 2. Extending GT2 Grids with GT4 resources
installation is capable to support the above
functionalities. Portal administrators only have to
define both Grids with their default resources, and 2.3 Connecting Second and Third Generation
users have to assign the appropriate certificate to Grids
each Grid before submitting workflows. As there As the P-GRADE portal is capable to connect
are currently many users in Europe with access to multiple Grids, and as through GEMLCA it
rights to both of these Grids, they can easily also supports service-oriented Grid systems, it
utilise the combined power of the resources. became possible to connect separate GT2- and
GT4-based Grids from the same portal at
390
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
workflow-level. In order to demonstrate this a legacy code submitted from the GEMLCA
concept the UK NGS was connected to a GT4 repository to the GT2 NGS site at Manchester.
testbed, called the Westfocus Grid [12], Finally, the comparator component was a GT2 job
comprising clusters at Westminster and Brunel submitted directly to Oxford within the NGS.
Universities. The GT2 and GT4 Grids were Besides the different underlying architectures the
defined as separate Grids in separate Grids were also using different proxy certificates.
administrative domains that could potentially The execution graph in figure 5 illustrates that the
accept different user certificates. As it is workflow completed successfully demonstrating
illustrated on Figure 3 jobs within a workflow can the workflow level interoperability of this very
be submitted to GT2 resources within the NGS (1 diverse infrastructure.
GEMLCA-R
and 2), or could be services deployed in the GT2 Job submission
legacy code
to Poznan
Westfocus Grid (3 and 4). submitted to
GT4 Service Manchester
Invocation at UoW
EGEE
P-GRADE Westfocus Grid
GEMLCA GT4
User Service
Portal WestFocus
invocation with Grid
Proxy certificate 2
UK NGS
GT2
1 NGS
3
Westminter
GT2 Job
submission to
Manchester Oxford
Executable 4
2
Figure 4. Traffic Simulation Workflow Running in 3
GEMLCA Brunel Different Grids
Oxford Executable + Repository
Proxy
certificate 1
391
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Future work, on one hand aims to extend the [6] Condor Team, University of Wisconsin-Madison:
existing EGEE P-GRADE portals with the Condor Version 6.4.7 Manual,
http://www.cs.wisc.edu/condor/manual/v6.4/
GEMLCA service and to connect the [7] The NorduGrid web page,
P-GRADE/GEMLCA portal to some other large http://www.nordugrid.org/
production Grids, namely the NorduGrid and the [8] P. Kacsuk and G. Sipos: Multi-Grid, Multi-User
Open Science Grid. Workflows in the P-GRADE Grid Portal, Journal
of Grid Computing Vol. 3. No. 3-4., 2005,
Work is also carried out in order to extend the Springer, 1570-7873, pp 221-238
service support capabilities of the portal. The [9] T. Delaittre, T. Kiss, A. Goyeneche, G.
GEMLCA – P-GRADE portal is already capable Terstyanszky, S.Winter, P. Kacsuk: GEMLCA:
Running Legacy Code Applications as Grid
to support GT4-based Grids. However, this Services, Journal of Grid Computing Vol. 3. No.
support is limited to legacy codes that are 1-2., 2005, Springer, 1570-7873, pp 75-90
presented as Grid services through GEMLCA. [10] The UK National Grid Service Website,
The aim is to incorporate native GT4 services and http://www.ngs.ac.uk/
pure Web services into the portal with minimal [11] The EGEE web page, http://public.eu-egee.org/
modification of the current architecture. In order [12] The WestFocus Gridalliance web page,
to achieve this Web and GT4 Grid services are http://www.gridalliance.co.uk/
handled by modified GEMLCA resources. These [13] The SEE-GRID web page, http://www.see-
grid.org/
GEMLCA resources are used to deploy the
[14] The Open Science Grid Website,
Web/Grid service by generating a GEMLCA http://www.opensciencegrid.org/
specific interface description file from the WSDL [15] The TeraGrid Website, http://www.teragrid.org
automatically, to browse the GEMLCA [16] T. Delaitre, A.Goyeneche, T.Kiss, G.Z.
repository, select the required service and set its Terstyanszky, N. Weingarten, P. Maselino, A.
parameter values, and finally to invoke the service Gourgoulis, S.C. Winter: Traffic Simulation in P-
at workflow execution time. Grade as a Grid Service, Conf. Proc. of the
DAPSYS 2004 Conference, pp 129-136, ISBN 0-
387-23094-7, September 19-22, 2004, Budapest,
References Hungary
[17] William Lee, A. Stephen McGough, and John
Darlington: Performance Evaluation of the
[1] Michael Rambadt, Philipp Wieder: UNICORE - GridSAM Job Submission and Monitoring System,
Globus: Interoperability of Grid Infrastructures, Conf. Proc. of the UK e-Science All-hands
Conf. Proc of the Cray User Group Summit 2002, meeting, 2005,
Manchester, UK. http://www.allhands.org.uk/2005/proceedings/pape
[2] Globus Team, Globus Toolkit 4.0 Release rs/541.pdf
Manuals, http://www.globus.org/toolkit/docs/4.0/ [18] T. Kiss, G. Terstyanszky, G. Kecskemeti, Sz. Illes,
[3] The gLite website, http://glite.web.cern.ch/glite/ T. Delaittre, S. Winter, P. Kacsuk, G. Sipos :
[4] The Unicore website in wikipedia, Legacy Code Support for Production Grids, Conf.
http://en.wikipedia.org/wiki/UNICORE Proc. of the Grid 2005 - 6th IEEE/ACM
[5] J. Novotny, M. Russell, O. Wehrens: GridSphere, International Workshop on Grid Computing
“An Advanced Portal Framework”, Conf. Proc. of November 13-14, 2005, Seattle, Washington, USA
the 30th EUROMICRO Conference, August 31st -
September 3rd 2004, Rennes, France.
392
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Grid computing impacts directly on the experimental scientific laboratory in the areas of monitoring and
remote control of experiments, and the storage, processing and dissemination of the resulting data. We
highlight some of the issues in extending the use of an MQ Telemetry Transport (MQTT) broker from facili-
tating the remote monitoring of an experiment and its environment to the remote control of an apparatus. To
demonstrate these techniques, an Intel-Play QX3 microscope has been “grid-enabled” using a combination of
software to control the microscope imaging, and sample handling hardware built from LEGO Mindstorms.
The whole system is controlled remotely by passing messages using an IBM WebSphere Message Broker.
393
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
394
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
some limits in terms of repeatabilty and tolerance get through, if the end effector is unable to recieve
of construction. For longer term use high spec in- the control signal straight away, should the system
dustrial components and indexed motion controllers then queue it for later delivery? It was found with
would produce a system with a much higher level of this work that the reliability of the IR link between
reproducability. LEGO also has a limited maximum the PC and the RCX was somewhat variable, some-
load capacity (it is after all designed as a toy). times it was necessary to manually move the RCX to
regain signal. When the link was regained, MQTT
The use of MQTT as a remote control protocol then ensured that waiting messages were delivered,
brings some benefits and some unexpected features. often resulting in unpredictable behaviour. In the
By using existing libraries the software development case of the LEGO model this was unexpected, but not
can be extremely rapid, which when producing a dangerous. However, with another system the un-
demonstrator is important. MQTT also provides a predictable behaviour from delayed messages could
solution to some of the traditional network prob- cause a dangerous situation, hence control hardware
lems, for example dealing with broken or unreliable and software should be designed to avoid this, in
connections, and making sure that the message al- particular using the non persistent messaging mode
ways gets through. However, with remote control it of MQTT.
is necessary to consider if messages should always
395
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
4 Extensions and Future work much information about the apparatus, and person-
nel in the vicinity of the apparatus as possible, both
4.1 Sample Handling for reasons of safety and result reliabilty. By using
common technologies adding this data becomes a far
As part of automating an apparatus in a laboratory easier task.
moving physical samples onto and off the appara-
tus needs addressing. Some existing apparatus are
fitted with auto samplers and sample batching han- 4.3 Authorisation over MQTT
dlers which allow the operator to preload a num- MQTT deliberately has a very minimalist approach
ber of samples, usually into a grid or carousel that to security, enabling appropriate security to be lay-
then sequentially presents the samples to the mea- ered on top of it as required for any given applica-
surement position. By combining sample selection tion. In this demonstrator, security was purpously
hardware with a scheduling and notification system overlooked as the equipment is relativly safe, and
the remote user could arrange a time on the appara- data traffic limited to the campus network. Before
tus during which their sample would be mounted, using this type of system to allow control over a
and they would be given control of the apparatus. hostile network (such as the internet) an appropri-
ate security scheme would need to be implemented.
4.2 Inclusion of Remote Monitoring This could include; Encrypting the message pay-
load (which could use PKI certificates to addition-
Work on remote environment and apparatus mon- ally provide signing for authentication and non-
itoring has previously reported that MQTT can be repudiation), Challenge/response security (by send-
used to monitor sensors in realtime(1; 3; 9). When ing the challenge/response flows as MQTT messages
implementing automation on a real-laboratory sce- over pub/sub), or TCP/IP traffic protection (such as
nario, it is important to provide the operator with as standard VPN or SSH tunnels).
References
[1] J. M. Robinson, J. G. Frey, A. J. Stanford-Clark, A. D. Reynolds, B. V. Bedi, Sensor networks and grid mid-
dleware for laboratory monitoring, in: First International Conference on e-Science and Grid Computing
(e-Science’05), pp. 562 – 569, pages.
URL http://doi.ieeecomputersociety.org/10.1109/E-SCIENCE.2005.73
[2] A. Stanford-Clark, Integrating monitoring and telemetry devices as part of enterprise information
resources, Tech. rep., IBM (2002).
URL http://www-306.ibm.com/software/integration/mqfamily/integrator/telemetry/
pdfs/telemetry_integration_ws.pdf
[3] J. M. Robinson, J. G. Frey, A. J. Stanford-Clark, A. D. Reynolds, B. V. Bedi, Chemistry by mobile phone
(or how to justify more time at the bar), in: Proceedings of the UK e-Science All Hands Meeting 2005, ,
pages.
URL http://www.allhands.org.uk/2005/proceedings/papers/481.pdf
[4] IBM, WebSphere MQ Telemetry Transport Specification.
URL http://publib.boulder.ibm.com/infocenter/wbihelp/index.jsp?
topic=/com.ibm.etools.mft.doc/ac10840_.htm
[5] Lejos : Java for the rcx, accessed 24 April 2006.
URL http://lejos.sourceforge.net/
[6] Cpia webcam driver for linux, accessed 24 April 2006.
URL http://webcam.sourceforge.net/
[7] A. Nirendra, Exploring procfs, Linux Gazette 115.
URL http://linuxgazette.net/115/nirendra.html
[8] jtravis@p00p.org, Official camserv home page, accessed 28 April 2006.
URL http://cserv.sourceforge.net
[9] Floodnet: Pervasive computing in the environmment, accessed 24 April 2006.
URL http://envisense.org/floodnet/floodnet.htm
IBM and WebSphere are trademarks of IBM Corporation in the United States, other countries or both.
Java is a trademark of Sun Microsystems Inc. in the United States, other countries or both.
LEGO and Mindstorms are trademarks of The LEGO Group in the United Kingdom, and other countries.
Intel is a trademark of Intel Corporation in the United States and other countries.
396
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
397
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
398
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
i1
Rqui Rqr
… St St+1 St+2 …
Rpui Rpr
T UI rid
Figure 3. Transitions of a stateful resource
Figure 5. Uniform interface interaction
Figure 2: Transitions of a stateful resource Figure 3: Uniform interface interaction
may amend the value of some property, p ∈ P (a The conciseness with which one can define the
change to some entity, e, shall be denoted e0 ). The mapping, UI, is dictated by the methods chosen to
transition from St+1 to St+2 , conversely, may change represent Rqui and Rpui . This mapping must be ex-
the properties composing the resource. The uniform ecuted by an engine of some sort, and the greater the
interface should only be concerned with the commu- semantics the engine is able to derive from some given
nication of, and resulting state from the interaction interaction syntax, the more efficient the engine, and
semantics, not their semantic consequence on the rep- consequently the uniform interface. Increased effi-
resented resource. To represent interactions with a ciency of a uniform interface has the subsidiary effects
stateful resource, the uniform interface at its most of reducing the burden on both the communication
fundamental should supply a set of atomic methods protocol used to transport Rqui and Rpui between the
semantically analogous to: source and destination resource, and the footprint of
the U I function at the interaction endpoints.
Mui = {GET, SET }
These methods enable complete control over the 2.1 Solution in REST Style
state of some resource, r; facilitating retrieval (GET)
and modification (SET). Interaction with the re- The architecture proposed applies the Representa-
source necessitates the assignment of a unique iden- tional State Transfer (REST) architectural style, put
tifier, rid , such that interactions may be directed at forward by Fielding [7]. REST declares a set of
resources using this identifier. More specific function- constraints which focus on achieving desirable prop-
ality can be obtained by providing specialisations of erties, such as scalability and reliability within a
these atomic methods, or conveying increased seman- distributed system. Hypertext Transfer Protocol
tic content to these methods. This yields a trade-off (HTTP) [7], the archetypal application-level pro-
between the cardinality of the interaction method set, tocol for distributed systems, can be used to en-
Mui , and the syntax required by some m ∈ Mui . The force the constraints of the REST architecture, fa-
most efficient solution to the general problem must cilitating stateless, self-descriptive client-server inter-
resolve this trade-off in an optimal manner. actions. HTTP, therefore, acts as the basis for the
The fundamental issue to be addressed in a solution uniform interface of the proposed solution.
is the mapping of the uniform interface interaction The concept of a resource is central to REST, and
to the specific resource interaction, which we define is formally defined as a ‘temporally varying member-
formally as: ship function Mr (t) which for time t maps to a set
of entities, or values, which are equivalent’ [7]. A
Rqres = U I(Rqui ) (1) resource is conceptual and defined by its mapping
to representation at some time t. This definition per-
tains well to our definition of a stateful resource given
Rpui = U I(Rpres ) (2) above; at time t, Mr (t) will map the resource r to
399
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Rqhttp Rqres Rqsys line of the request. For syntactic and semantic sim-
plicity we retain all standard HTTP methods shown
above.
Rphttp Rpres Rpsys Introspection on, and amendment to the resource
HTTP
REST Stateful state can be performed through the utilisation of ba-
ResourceURI1 Components sic HTTP methods and resource URI. The result-
ing state of a transition can be defined formally as:
St+1 = (St ∪ (pi , vi ))\(pi , vi ), where pi ∈ Pr ∧ pi ∈
Figure 4: Proposed interface structure Pr . State transition is where this uniform interface
greatly simplifies any previous solutions.
some representation and resource r can be said hold Any of the interaction methods, m ∈ Mhttp , may
some state St . include metadata in a response, which is used for the
The notion of a conceptual resource allows us to communication of additional information regarding
encapsulate the varying representations (states) of a the state. For example, included in the metadata of
resource into one conceptual entity, drawing many a GET would be the URIs for access to, and ma-
parallels with the WS-Resource construct in WSRF. nipulation of component properties of the state, us-
For the purpose of entity identification, and indeed ing the standard interaction methods. Supplemen-
for the communication of much of the interaction se- tary metadata may include XML Schema [19] to de-
mantics, we use a Uniform Resource Identifier (URI) fine the structure of interaction data. Properties are
[18]. At time t, Mr (t) = St , and resolve(U RI1) = St , therefore treated as conceptual resources, and we can
accordingly the current state representation of the partition state into components using URIs, simplify-
resource can be accessed by resolving the identifier, ing interactions markedly. For instance (Figure 5), if
U RI1 . The state representation is composed of all some resource has properties {p1 , p2 }, GET to U RI1
resource properties, P exposed by a resource at some will return St = {(p1 , v1 ), (p2 , v2 )} ∪ {U RI2 , U RI3 }.
URI, and these properties may be derived from any The URIs are not parts of the state of resource r, they
number of stateful components behind the uniform are simply metadata to communicate how component
interface, which once again is comparable with WS- properties may be changed. Interactions can then be
Resource functionality. We define the methods of a directed at these URIs to access or manipulate the
uniform interface using HTTP as follows: associated property.
400
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
troduce a new managed resource instance and be re- stages of processing, from HTTP and URI to the un-
turned the URI of the newly created resource for derlying management standard and from this stan-
future interactions. The state representation of the dard to the resource-specific interaction. The re-
management resource would be amended accordingly. sulting architectures therefore demonstrate many of
This resource notion enables the modelling of any the restrictions of the underlying management stan-
identifiable object whether conceptual or concrete, dards, albeit while utilising a different communica-
giving us unbounded modelling flexibility. tion medium. The REST architecture we propose,
through exploitation of HTTP method and URI se-
Rqres = HT T P (Rqhttp) (3) mantics, enables semantically complete interaction
with stateful resources. The abstraction provided by
the resource concept enables the reduction in cardi-
Rphttp = HT T P (Rpres) (4)
nality of the interaction method set, and the utili-
The mapping function, HTTP simply inspects the sation of HTTP methods to convey interaction se-
destination URI and HTTP method of the incoming mantics enables syntactic brevity in interaction. In-
request. If such a resource exists, and this method stead of placing a large amount of syntactic burden
of interaction is supported, the HTTP request is for- on the transport protocol, and mapping function to
warded to the resource to execute the internal se- convey the semantics of the interaction, we are de-
mantics and respond. If URI does not exist or the riving all required semantics from the URI, HTTP
method of interaction is not supported, the conven- method, and any content associated with the request
tional HTTP response codes are used. We have thus (in some standard data format).
improved the conciseness of the mapping function
markedly. All necessary information for interaction
is packaged into the HTTP request and URI, no ad- 3 Optimal resource provision-
ditional contextual information is held on the server
relating to a given sequence of interactions. Any con-
ing of application hosting in
text concerning a given (sequence of) interaction(s) a changing environment
is incorporated into the URI and held on the client-
side. This adheres to the REST constraint of state- Recent developments in distributed and grid comput-
less interactions. The abstraction provided by the ing have facilitated the hosting of service provision-
concept of a resource in REST enables functionality ing systems on clusters of computers. Users do not
analogous to the WS-Resource in WSRF. The low have to specify the server on which their requests (or
level composition of state is again shifted outside of ‘jobs’) are going to be executed. Rather, jobs of dif-
the scope of the uniform interface, relieving much of ferent types are submitted to a central dispatcher,
the semantic and syntactic burden and enabling in- which sends them for execution to one of the avail-
creasingly flexible notions to be represented as re- able servers. Typically, the job streams are bursty,
sources. By exploiting the semantics of HTTP and i.e. they consist of alternating ‘on’ and ‘off’ periods
URI, the beneficial effects of abstraction on the uni- during which demands of the corresponding type do
form interface have been further developed. There and do not arrive.
have been other attempts to utilise HTTP for the ex- In such an environment it is important, both to
pression of resource interaction semantics [16, 3], but the users and the service provider, to have an effi-
none have fully exploited the benefits of HTTP and cient policy for allocating servers to the various job
URI semantics. Concentration has been on manipu- types. One may consider a static policy whereby a
lation of HTTP requests and utilisation of URIs to fixed number of servers is assigned to each job type,
communicate the semantics of some underlying man- regardless of queue sizes or phases of arrival streams.
agement standard, for instance SNMP [16]. Such Alternatively, the policy may be dynamic and allow
solutions are ineffective as they must undergo two servers to be reallocated from one type of service to
401
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
another when the former becomes under-subscribed times for type i are distributed exponentially with
and the latter over-subscribed. However, each server mean 1/µi . This model is illustrated in Figure 6.
reconfiguration takes time, and during it the server
is not available to run jobs; hence, a dynamic policy ??? ?? l
must involve a careful calculation of possible gains -
l
and losses.
The purpose of this work is to (i) provide a compu-
tational procedure for determining the optimal static
allocation policy and (ii) suggest acceptable heuristic ???
???
- l
policies for dynamic server reconfiguration. In order
to achieve (i), an exact solution is obtained for an
isolated queue with n parallel servers and an on/off l
source. The dynamic heuristics are evaluated by sim- ?????
- l
ulation. l
The problem described here has not, to our knowl-
edge, been addressed before. Much of the server allo-
cation literature deals with polling systems, where a Figure 6: Heterogeneous clusters with on/off sources
single server attends to several queues [4, 5, 8, 9, 10].
Even in those cases it has been found that the pres-
ence of non-zero switching times makes the optimal Any of queue i’s servers may at any time be
policy very difficult to characterise and necessitates switched to queue j; the reconfiguration period, dur-
the consideration of heuristics. The only general re- ing which the server cannot serve jobs, is distributed
sult for multiple servers concerns the case of Poisson exponentially with mean 1/ζi,j . If a service is pre-
arrivals and no switching times or costs: then the empted by the switch, it is eventually resumed from
cµ-rule is optimal, i.e. the best policy is to give ab- the point of interruption.
solute preemptive priority to the job type for which The cost of keeping a type i job in the system is
the product of holding cost and service rate is largest ci per unit time (i = 1, 2, ..., M ). These ‘holding’
(Buyukkoc et al [1]). costs reflect the relative importance, or willingness to
A model similar to ours was examined by Palmer wait, of the M job types. The system performance is
and Mitrani [12]; however, there all arrival processes measured by the total average cost, C, incurred per
were assumed to be Poisson; also, the static allocation unit time:
was not done in an optimal manner. The novelty XN
C= ci Li , (5)
of the present study lies in the inclusion of on/off
i=1
sources, the computation of the optimal static policy
and the introduction of new dynamic heuristics. where Li is the steady-state average number of type
i jobs present. Those quantities depend, of course,
on the server allocation policy.
3.1 The model
In principle, it is possible to compute the optimal
The system contains N servers, each of which may be dynamic switching policy by treating the model as a
allocated to the service of any of M job types. There Markov decision process and solving the correspond-
is a separate unbounded queue for each type. Jobs ing dynamic programming equations. However, such
of type i arrive according to an independent inter- a computation is tractable only for very small sys-
rupted Poisson process with on-periods distributed tems. What makes the problem difficult is the size
exponentially with mean 1/ξi , off-periods distributed of the state space one has to deal with. The system
exponentially with mean 1/ηi and arrival rate during state at any point in time is described by a quadru-
on-periods λi (i = 1, 2, ..., M ). The required service ple, S = (j, n, u, m), where j is a vector whose ith
402
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
tational complexity of that task grows very quickly Figure 7: Total average cost, over a period of
with the number of queues, M , the number of servers, T=10000, of a system with 2 job types and 20 servers
N , and the truncation level. For that reason, we
have concentrated on determining the optimal static
allocation policy (which does not involve switching) big variations, especially in the static cases. Note
and comparing its performance with that of some dy- that it is probably not a coincidence that the dy-
namic heuristics. namic simulations seem to fluctuate much less, their
very nature makes them more capable of dealing with
more extreme events. Note that the dynamic policies
3.2 Initial results
are a big improvement over the static ones studied
Figure 7 shows the total (i.e. for both job types com- previously [12].
bined) average (i.e. over time) cost, over a period
of T = 10000, of a system with 2 job types and 20
servers. Both the job types are on for half the time 4 Conclusions
but one of them has a cycle lasting (in mean) 50 and
the other 100. The holding cost for the first job type In this paper we have presented some preliminary
is 1, for the second it is 2. In all other aspects they are results concerning issues relating to dynamic de-
identical and the rate with which jobs are completed ployment within commercial hosting environments.
per server is 1. The x-axis represents increased load, These results show the suitability of using the REST
with the rate of job arrival increasing from 15 to 19.5. architectural style for developing an open software ar-
This means the system is only stable on average. The chitecture for dynamic operating policies. In addition
switching time and cost is set to be very small in this we have shown some initial results from modelling
case. Heuristic 1 is making switching decisions based experiments using stochastic modelling techniques to
on a fluid cost approximation until the next time the derive improved policy decisions. These results show
queue empties. The assumption here being that job that where the requests for a service fluctuate, taking
types that are on, remain on and vice versa. Heuris- this additional information into account when assign-
tic 2 is the same as above, but using the average load ing servers can have a significant benefit.
to phase the on-off periods out of the model. A substantial amount of further work remains to
Because of time constraints, these are averages over be done within the DOPCHE project in developing
a small number of simulations. This explains some optimal and heuristic policies and in developing the
403
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
architectural support to implement these policies in [9] G. Koole, “Structural Results for the Control of
practise. Queueing Systems using Event-Based Dynamic
Programming”, Queueing Systems Theory and
Applications, 30, pp 323-339, 1998.
References
[10] Z. Liu, P. Nain, and D. Towsley, “On Optimal
[1] C. Buyukkoc, P. Varaiya and J. Walrand, “The Polling Policies”, Queueing Systems Theory and
cµ-rule revisited”, Advances in Applied Probabil- Applications, 11, pp 59-83, 1992.
ity, 17, pp 237-238, 1985. [11] H. Ludwig, A. Keller, A. Dan, R. King, A
Service Level Agreement Language for Dy-
[2] K. Czajkowski, I. Foster, C. Kesselman, V.
namic Electronic Services, 4th IEEE Interna-
Sander, S Tuecke, SNAP: A Protocol for Ne-
gotiating Service Level Agreements and Coor- tional Workshop on Advanced Issues of E-
Commerce and Wen-based Information Systems,
dinating Resource Management in Distributed
IEEE Computer Society Press, 2002.
Systems, 8th International Workshop on Job
Scheduling Strategies for Parallel Processing, [12] J. Palmer and I. Mitrani, “Optimal Server Allo-
LNCS 2537: 153-183, Springer- Verlag, 2002. cation in Reconfigurable Clusters with Multiple
Job Types”, Journal of Parallel and Distributed
[3] L. Deri, Surfing Network Resources across the Computing, 65/10, pp 1204-1211, 2005.
Web, In Proceedings Of 2nd IEEE Workshop on
Systems Management, 1996, pp. 158-167 [13] J. Schopf, M. Baker, F. Berman, J. Dongarra,
I. Foster, B. Gropp, T. Hey, ”Grid Performance
[4] I. Duenyas and M.P. Van Oyen, “Heuristic Workshop Report”, whitepaper from the Inter-
Scheduling of Parallel Heterogeneous Queues national Grid Performance Workshop, May 12-
with Set-Ups”, Technical Report 92-60, Depart- 13, UCL, London, 2004.
ment of Industrial and Operations Engineering,
University of Michigan, 1992. [14] E. de Souza e Silva and H.R. Gail, “The Uni-
formization Method in Performability Analysis”,
[5] I. Duenyas and M.P. Van Oyen, “Stochas- in Performability Modelling (eds B.R. Haverkort,
tic Scheduling of Parallel Queues with Set-Up R. Marie, G. Rubino and K. Trivedi), Wiley,
Costs”, Queueing Systems Theory and Applica- 2001.
tions, 19, pp 421-444, 1995.
[15] H.C. Tijms, Stochastic Models, Wiley, New
[6] I. Foster, C. Kesselman, and S. Tuecke. The York, 1994.
Anatomy of the Grid: Enabling Scalable Virtual [16] C. Tsai, R. Chang, SNMP through WWW, In-
Organizations, International Journal of High ternational Journal Of Network Management,
Performance Computing Applications, 15(3), 8(1), 1998, pp.104-119
2001, pp. 200-222
[17] A. van Moorsel, Grid, Management and Self-
[7] R.T. Fielding, Architectural Styles and the De- Management, The Computer Journal, 48(3),
sign of Network-based Software Architectures, 2005, pp.325-332
PhD thesis, Department Of Information and
[18] W3C, Uniform Resource Identifier (URI):
Computer Science, University Of California,
Generic Syntax, ed. T. Berners-Lee, R.T. Field-
Irvine, 2000.
ing, L. Masinter, 1998
[8] G. Koole, “Assigning a Single Server to Inhomo- [19] W3C, XML Schema, ed. H.Thompson, D. Beech
geneous Queues with Switching Costs”, Theoret- et al, 2004
ical Computer Science, 182, pp 203-216, 1997.
404
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
405
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
is also available from each component of the lab ing services are provided.
automation equipment. Otherwise, fault detection in Grid-based sys-
The robot components are controlled and co- tems has so far been mostly targeted towards
ordinated by software. The whole system is con- analysis of network faults, and detection of prob-
tinually supplied with descriptions of experiments lems in compute clusters. Tools for monitoring
that have been designed to test hypotheses cre- and fault detection exist, but are currently mostly
ated by AI software. The system should run con- restricted to monitoring network traffic and gen-
tinuously, 24 hours a day, 7 days a week. eral performance of Grid services (response time,
Any failure of this system needs to be detected availability, etc). Globus has a Monitoring and
as early as possible, so that action can be taken, Discovery System (MDS) component that allows
and the equipment can be used to its maximum querying and subscription to published data. It
potential. A single fault can cause the entire sys- is intended to be an interface to other monitor-
tem to be unusable, and all experiments currently ing systems, and to allow a standard interface to
running (which may have been running for days) the data. Other general grid service monitoring
may need to be abandoned. Faults can range from tools include Gridmon1 , Network Weather Ser-
an obvious show-stopping breakdown to a slight vice2 , Netlogger3 , Inca4 and Active Harmony5 .
variation of some condition, which could still have An annual Grid Performance Workshop6 reflects
a disastrous effect on the results of experiments. current research in these systems.
Our laboratory automation system is complex
and contains many potential sources of failure, in-
2.2 Existing e-Science labs and fail- cluding human error, hardware error and experi-
ure detection systems mental error. In the next section we describe these
and the methods of failure detection in more de-
Very little laboratory-based and experimental
tail.
equipment is actually Grid-enabled yet. The Grid
is still an evolving concept and toolkits such as
Globus have not yet matured enough for most in-
dustrial use.
3 System requirements
Ko et al [4] describe a laboratory with a web-
based interface, designed to be accessed remotely
3.1 Sources of information
by students to run coupled tank control experi- The primary sources of information for use in de-
ments. The system provides feedback to the stu- tecting failure will be:
dents, by video, audio and by data, such as plots
of response curves. The students can access the Equipment metadata logs: Each piece of
remote laboratory 24 hours a day. The potential equipment logs information such as event
of remote biology labs used for education is ex- timings and internal settings and variation.
plored further by Che [1].
NEESgrid is a large grid-based system for earth- Experimental data/observations: The opti-
quake simulation and experimentation. Laborato- cal density readings of the yeast can show
ries are linked via grid infrastructure to compute that something is wrong. For example, an os-
resources, data resources and equipment (http: cillating pattern of readings for a microtitre
//it.nees.org/). Various instruments and sen- plate lead us to discover that the two incuba-
sors can be monitored and viewed remotely, and tors between which the plate was being cycled
NEESgrid provides teleobservation and telepres- were set to agitate at different speeds.
ence via video streams, data streams and still im-
ages. Experimental expectations: The Robot Sci-
entist work is perhaps unique in having the
The DAME project is a major e-Science pi-
whole process automated, so we can make
lot project providing a distributed decision sup-
use of the fact that the expected result of ev-
port system for aircraft maintenance. They pro-
ery experiment is automatically available for
posed a Grid based framework for fault diagno-
comparison to the actual result. Of course,
sis and a implementation for gas turbine engines
an unexpected result may also be due to the
[5]. Their work is perhaps the most similar to
discovery of new scientific knowledge.
ours, though for a very different application, and
used Globus Toolkit version 3 to provide web ser- 1 Gridmon: http://www.gridmon.dl.ac.uk/
vice interfaces. They note the benefits of using 2 NWS: http://nws.cs.ucsb.edu/
3 Netlogger: http://www-didc.lbl.gov/NetLogger/
a loosely coupled service-oriented architecture to 4 Inca: http://inca.sdsc.edu/
implement a variety of fault diagnostic services. 5 Active Harmony: http://www.dyninst.org/harmony/
A engine simulation service, case based reason- 6 Grid Performance Workshop: http://www-unix.mcs.
406
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
407
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
4.3 Abstract Monitoring Agents computer hardware faults and computer science
researchers may not want to concern themselves
Example implementations of the abstract monitor
with warnings about low liquid levels. This uses
are:
a relational database where the relations between
• The incubator monitor, which monitors tem- researchers and monitors are stored.
perature, humidity, O2 and CO2 levels of
three incubators. This reports to the dis-
patch broker when thresholds are not met. 5 Discussion
The yeast must be grown in a stable, con-
Detecting and recording errors and failures is es-
trolled and measurable enviroment. Only
sential for any highly complex system of many
with a wide array of different environmental
different parts. Even fully automated biological
state monitors will it be possible to ensure
experiments are prone to noise, and careful moni-
all experiments were done in a similar envi-
toring of conditions and experimental metadata is
roment.
crucial for correct interpretion of results. We need
• A host monitor which simply monitors the to be able to detect problems as they occur, rather
main control PC running the robot. The than days later when valuable experimental time
host monitor’s job is to ensure that all con- and resources have been wasted.
trol PCs are up and running at all times and The failure detection system that has been de-
to report this as a high alert to the dispatch veloped is general enough to be applicable to most
broker. When PCs fail this will waste ex- laboratory automation environments. It uses
perimental time and most likely ruin current Globus/Web Services to provide a loose coupling
experiments. of components. It is in place on the Robot Scien-
tist with the most immediately essential monitor-
Monitoring Agents are separated from the busi-
ing agents, and we now need to extend its capa-
ness of making decisions about who to report to or
bilities with a wide variety of more complex mon-
how to report by the dispatch broker. This allows
itoring agents. More intelligent agents will pro-
a Monitoring Agent to focus just on the detail of
vide diagnosis as well as detection, and intelligent,
monitoring. This also allows Monitoring Agents
user-friendly diagnosis of biological problems will
to be composed of other Monitoring Agents in a
be an interesting area of research in its own right.
hierarchical manner. A more complex Monitor
that relies on several aspects of the system can be
built using the results of several low level Monitor- Acknowledgements
ing Agents, hence avoiding time-consuming repe- The authors would like to thank Dr Andrew
tition of low level tests. Sparkes for valuable discussions.
Parameters that need monitoring come from
many different systems such as databases, file-
based data logs, computational tasks or querying References
serial or USB interfaces to discover continuous pa-
rameters, and these may not be in close quarters, [1] A. Che. Remote biology labs. In E-ducation With-
out Borders, 2005.
therefore it is essential to have a loose coupling
for agents. [2] I. Foster. Globus toolkit version 4: Software for
service-oriented systems. In IFIP International
Conference on Network and Parallel Computing,
4.4 Dispatch Broker pages 2–13. Springer-Verlag LNCS 3779, 2005.
The dispatch broker recives reports from any [3] R. D. King, , K. E. Whelan, F. M. Jones, P. G. K.
agent and notifies all outbound agents that a new Reiser, C. H. Bryant, S. Muggleton, D. B. Kell,
report has been delivered. This ensures the loose and S. G. Oliver. Functional genomic hypothesis
generation and experimentation by a robot scien-
coupling between detecting and reporting.
tist. Nature, 427(6971):247–252, 2004.
[4] C. C. Ko, B. M. Chen, J. Chen, Y. Zhuang, and
4.5 Selective Reporting K. C Tan. Development of a web-based laboratory
This allows users to select which reports they wish for control experiments on a coupled tank appa-
ratus. IEEE Transactions on Education, 44(1),
to receive.
2001.
This is to ensure that information passed on
is relevant in the context of the user it is sent [5] X. Ren, M. Ong, G. Allan, V. Kadirkamanathan,
to. On such a large system as the Robot Scien- H. A. Thompson, and P. J. Fleming. Service ori-
ented architecture on the Grid for FDI integra-
tist, researchers with completely different back-
tion. In Proc 3rd UK e-Science All Hands Meeting
grounds will be interested in different reports. Bi- (AHM 2004), 2004.
ologists may not want to concern themselves with
408
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
We present a novel approach to describe and reason about stateless information processing
Semantic Web Services. It can be seen as an extension of standard descriptions which
makes explicit the relationship between inputs and outputs and takes into account OWL
ontologies to fix the meaning of the terms used in a service description. This allows us
to define a notion of matching between services that yields high precision and recall for
service location. We explain why matching of these kinds of services is decidable, and
provide examples of biomedical web services to illustrate the utility of our approach.
409
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
formatics, see Section 3—it was observed that ontology (i.e., a finite set of OWL-DL axioms)
the notion of conjunctive query can be adopted and ABox for a factual ontology (i.e., a finite
for these purposes. Before introducing this “ser- set of OWL-DL facts). The union of a TBox T
vices as queries” approach formally, let us illus- and an ABox A is called a knowledge base and
trate it with a simple example from [6] (realistic denoted by KB = hT , Ai. We use KB |= Ψ to
examples from the bio-informatics domain are denote the fact that KB implies Ψ, i.e., Ψ holds
given in Section 3). in every interpretation that satisfies KB.
Let S1 and S2 be services both having an in-
put of type GeoRegion and an output of type Definition 1 (Service syntax). A service de-
~ ~y : Y
scription S=h~x : X; ~ ; Φ(~x , ~y , ~z )i consists of
Wine, and suppose that S1 (resp., S2 ) returns
the list of wines that are produced (resp., sold ) ~ = hx1 : X1 , . . . , xm : Xm i of pairs
• a list ~x : X
in the region with which it was called. If the of variables xi and classes Xi ; this list enu-
types of inputs and outputs are the only infor- merates input variables and their “types”;
mation available to match a request to a service,
then no matching algorithm can distinguish be- ~ = hy1 : Y1 , . . . , yn : Yn i of pairs of
• a list ~y : Y
tween S1 and S2 , and thus matching cannot be variables yj and classes Yj ; this list enumer-
precise—see (1) above. ates output variables and their “types”;
Next, assume that a user requests a service
Q that takes a FrenchGeoRegion as input and re- • a relationship specification Φ of the form
turns the list of FrenchWines that are produced in term1 (~x , ~y , ~z ) ∧ . . . ∧ termk (~x , ~y , ~z ),
this region. Even though the service S1 returns,
where each termi (~x , ~y , ~z ) is either an ex-
in general, wines that may not be FrenchWines,
pression of the form w: C with C a class or
it returns only FrenchWines when called with a
wRw0 with R a property, and w, w0 vari-
FrenchGeoRegion, and thus should be matched to
ables from ~x , ~y , ~z or individual names.
this request. A matching algorithm that does
so has a high recall—see (2) above. Definition 2 (Service semantics). A ser-
More formally, when run with a GeoRegion g, vice s implements a service description S over
the service S1 returns all those wines w for which a TBox T if, for any ABox A and any tuple of
there exists some winegrower f who produces w ~ then
individuals ~a in T , A, if T , A |= ~a : X,
and who is located in g. In our framework, this
service can be described as follows: 1. s accepts ~a as input and
INPUT g:GeoRegion 2. when run with ~a as input, it returns the
OUTPUT w:Wine set of all those tuples of individuals ~b from
THERE IS SOME f [ WineGrower(f), A such that T , A |= ~b : Y
~ ∧ ∃~z Φ(~a , ~b , ~z ).
LocatedIn(f,g), Produces(f,w) ]
Intuitively, this means that the service s must
The terms Wine, LocatedIn, etc., are defined accept all tuples ~a that conform the input type
in some ontology. In contrast, the service S2 ~ declared in S, and return as its output the set
X
returns all those wines w for which there exists of those instances ~b of the output type Y ~ that
some shop s who sells w and who is located in g, bear the relationship Φ to the input tuple ~a .
and can thus be described as follows:
INPUT g:GeoRegion 2.2 Matching services
OUTPUT w:Wine
THERE IS SOME s Matching is the problem of determining whether
[ Shop(s), LocatedIn(s,g), Sells(s,w) ] a given service description S conforms to an-
other service description Q. Matching algo-
In what follows we show that matching ser- rithms can be used for service discovery pur-
vice descriptions of this kind is reducible to pose, and we can think of S as being a service
query containment w.r.t. an ontology—a task advertisement and of Q as being a service re-
whose decidability and complexity is relatively quested by a user. Let us first give a formal
well understood. definition and then provide explanations. Here
we use |~x | to denote the length of a vector ~x .
2.1 Describing services
Definition 3. Given two service descriptions:
We assume the reader to be familiar with OWL-
DL and its semantics [11]. Throughout this pa- ~ ~y : Y
S = h ~x : X; ~ ; Φ(~x , ~y , ~u ) i,
per, we borrow the term TBox for a class-level ~
Q = h ~z : Z; w
~ :W~ ; Ψ(~z , w~ , ~v ) i,
410
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
411
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
412
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
A Discovery Net project that has been investigating the relationship between macro and micro-scale
earthquake deformational processes has successfully developed and tested a geoinformatics infrastructure
for linking computationally intensive earthquake monitoring and modelling. Measurement of lateral co-
seismic deformation is carried out with imageodesy algorithms running on servers at the London eScience
Centre. The resultant deformation field is used to initialise geo-mechanical simulations of the earthquake
deformation running on supercomputers based at the University of Oklahoma. The paper describes detail
of this project, initial results of testing the workflow, and follow-on research and development.
413
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
measuring co-seismic ground movement (e.g. InSAR , workflows can be published through web portals and
GPS). Web/Grid services, allowing users to execute
Finite Element Modelling (FEM) is an example distributed computations interactively and to analyse
of micro analysis conducted within earthquake studies. the results of their execution using a variety of
FEM is a well-established technique for determining visualisation tools.
stresses and displacements in objects and systems. In As part of the M2M project, the Discovery Net
the field of earthquake engineering, FEM is used infrastructure has been extended to enable integration
extensively by structural and geotechnical engineers to of the domain specific measurement and modelling
investigate the effects of earthquakes on infrastructure services based at Imperial College and the University
such as buildings, bridges, and dams. In addition, of Oklahoma. Processing routines, developed within
FEM-based simulations are used in the fields of Matlab, have been integrated into this workflow to
structural geology, and neo-tectonics to model transient enable post-processing of imageodesy output and
stress and deformation due to tectonics. Applying FEM preparation of input data for the finite element analysis
to model remote sensing imageodesy results is a software. When packaged together, this workflow
promising approach to investigate the stress field and provides:
faulting mechanism of macro scale earthquakes • Seamless integration between measurement and
through micro scale computing simulation. modelling services.
Imageodesy and FEM techniques are both • One-click execution with minimal user input
compute-intensive applications that require access to (e.g. to define material parameters).
specialized software components executing on high • The workflow has been packaged as an open
performance computing facilities. The aim of our web-service that has the potential to be used by
work presented here is to enable users to integrate the the earth science community.
use of these techniques within the same earthquake
study.
The work undertaken in this project is built
around the analysis of data from the Ms 8.1 Kunlun
earthquake, which occurred in northern Tibet in Nov
2001. This event was the largest earthquake in China
for 50 years and produced a massive surface rupture
zone of over 450 km making it the longest rupture of
an earthquake on land ever identified. The magnitude
and nature of the earthquake along with the large area
that was affected provide the perfect natural laboratory
in which to investigate earthquake dynamics and test
the novel geoinformatics approach that has been
developed during this research project.
3. Implementation
Figure 1: Imageodesy refinement workflow
The implementation of the geoinformatics workflow
has been based on the Discovery Net informatics The methodology that has been developed as part
infrastructure [4]. Discovery Net is a grid-enabled of the M2M project uses regional measurements of co-
software analytics platform developed at Imperial seismic deformation to directly run high-resolution
College London, funded through a UK e-Science Pilot geo-mechanical simulations of the earthquake
Project and extended within this work. The platform is deformation based on the Finite Element Method
based on a service-based computing infrastructure for (FEM). Macro-scale measurements of surface
high throughput informatics that supports the deformation are retrieved using sub-pixel FNCC (Fast
integration and analysis of data collected from various Normalised Cross-Correlation) imageodesy algorithms
high throughput devices. that have been developed [2] and refined [3] by the
Within Discovery Net, end-user applications are geo-application research team of Discovery Net. These
constructed within a visual authoring environment as algorithms measure the lateral co-seismic deformation
workflows that represent the steps required for (horizontal image feature shift and long a fault line)
executing a particular distributed computation, and the associated with an earthquake from pre- and post-event
flow of information between these tasks. Within each satellite images. The algorithms are computationally
workflow, analysis components are treated as services intensive and are implemented on supercomputers
or black boxes with known input and output interfaces based at the London eScience Centre. The output
described using web service protocols. These services deformation field produced by this software provides
can execute either on the user’s own machine or make the input into a Finite Element analysis system, which
use of high performance computing resources through is running on servers based at the University of
specialised implementations. Also, Discovery Net
414
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 2: W-E shear strain (left) and N-S shear strain (right) during the Kunlun earthquake predicted from running
finite element analysis on low-resolution imageodesy-derived co-seismic measurements using a simple 2D
homogeneous elastic model. The predicted N-S shear strain field is particularly interesting as it highlights the
predominantly left lateral faulting in the area.
Oklahoma. Discovery Net is used to orchestrate the led to the discovery of previously unreported surface
domain-specific tasks (Figure 1). rupturing south of the main Kunlun fault [7].
The finite-element segment of the analysis was We have further refined imageodesy algorithms
implemented on high-performance servers based at the developed as part of this pilot project [2] to include
University of Oklahoma. The TeraScale computational data refinement and post-processing algorithms [3] that
architecture [8] was used as the basis for construction improve the quality of the imageodesy output by
of a finite-element application capable of ingesting removing systematic noise. The images shown on
macro-scale imageodesy measurements. The TeraScale Figure 3 show the full width of the central part of the
framework was employed for this task as it provides original scene. In these images, blue to green indicates
the tools for building feature rich finite-element shift to the left and yellow to red indicates shift to the
applications that are easily interoperable and readily right. The red arrows in the filtering refined image
scaled to the available or required computing indicate the Kunlun fault zone along which the massive
resources. The finite-element application built during earthquake occurred. As the result, the terrain block
this research project enables automatic mesh creation south to the fault moved toward east (right). We can
and the initialisation of model runs using the kinematic see how the vertical noise is very visible after the
data as boundary conditions within a full-physics geo- horizontal shift has been removed, while in the result
mechanical simulation that includes the effects of produced by FFT selective and adaptive filtering; the
momentum conservation and material characterisation. horizontal striping and the multiple frequency wavy
The resultant finite-element model can be used to patterns of vertical stripes have also been successfully
estimate residual stress distribution induced (or removed. This clearly revealed the phenomena of
relieved) in the seismic region. In addition, macro-to- regional co-seismic shift along the Kunlun fault line as
micro scale analysis is achieved by using the macro- indicated by three arrows.
scale kinematic measurements to define boundary The workflow and software we have developed
conditions for higher resolution finite element models. has been successfully tested using low-resolution
In effect, this means that measured displacements imageodesy measurements to run a simple
define the boundary conditions at the edges of high- homogeneous elastic model of the seismic region
resolution finite element meshes. (Figure 2). The macro-scale results highlight the
predominantly left-lateral deformation (which ranges
4. Results from 1.5-8.1m) during the Kunlun earthquake in the
modelled N-S shear strain field. In addition, this result
As part of the initial experiment, the Discovery Net indicates the presence of unexpected but significant
geo-application research team has developed and shear features in the northwest of the study area that
applied imageodesy algorithms to cross-event Landsat are currently under further investigation.
7 ETM+ satellite imagery for the Kunlun earthquake
[2]. 5. Conclusions
Due to the large size of the input datasets and
demanding nature of the algorithms, implementation The applications research conducted in this project has
was carried out on a MPI parallel computer within the successfully developed and tested an analytical system
London eScience centre. The results have revealed for linking kinematic measurements of earthquake co-
stunning patterns of co-seismic left-lateral seismic deformation with geo-mechanical modelling to
displacement along the Kunlun fault and presented the enable investigation of the dynamics of earthquake
first 2-D measurement of regional deformation crustal deformation at macro-to-micro scales. This
associated with this event [6]. In addition, interesting approach is both novel in its use of distributed high
patterns of deformation south of the main fault have
415
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
have adopted will seek to incrementally increase the
modelling resolution, improve how the models are
parameterised (e.g. with realistic geological properties
for the region), and strive to simulate the full range of
physical processes involved with earthquake
deformation.
Following on from our initial low-resolution, 2D
model runs we intend to run much higher resolution 2D
analyses before ultimately undertaking 3D simulation
of the seismic region. Parameterisation of these models
will be undertaken with a more realistic
characterisation of the material properties of the region.
Due to the flexibility and interoperability of our finite
element application the implementation of these
increasingly complex simulations will be
straightforward and will allow us to concentrate on
interpreting the results.
References
[1] Okada, Y., 1985, Surface deformation due to shear and
Figure 3: Filtering image noise. The original X-shift image tensile faults in a half-space: Bulletin of the
on top is badly contaminated by stripes revealing the Seismological Society of America, v. 75, no. 4, p. 1135-
compensation errors of two way cross-track scanning. 1154.
Using novel frequency selective and adaptive FFT filters [2] J. G. Liu and J. Ma. Imageodesy on MPI & grid for co-
this has been effectively removed without subduing the seismic shift study using satellite imagery In
lateral shift information of co-seismic deformation. Proceedings of the 3rd UK e-Science All-hands
performance computing and its application of advanced Conference AHM 2004, Pages 232-240. Nottingham
analytical and modelling tools to real-world problems. UK, September 2004.
With respect to geoinformatics work involved, a [3] Jian Guo Liu and Gareth L. K. Morgan, "FFT Selective
number of discoveries have been made: and Adaptive Filtering for Removal of Systematic Noise
• Novel patterns of co-seismic left-lateral in ETM+ Imageodesy Images", accepted by the IEEE
displacement Transactions on Geoscience and Remote Sensing, 2006.
• First 2D measurement of the Kunlun earthquake [4] S. Al Sairafi, F. S. Emmanouil, M. Ghanem, N.
• Novel rupturing south of the main fault Giannadakis, Y. Guo, D. Kalaitzopolous, M. Osmond,
• Unexpected shear northwest of the main area. A. Rowe, J. Syed and P. Wendel. The Design of
One of the most important considerations when Discovery Net: Towards Open Grid Services for
Knowledge Discovery. International Journal of High
developing a valid geo-mechanical simulation is the
Performance Computing Applications. Vol 17 Issue 3.
ability to properly characterise the material properties 2003.
of the region under investigation. Conventional
approaches to the coupling of remote sensing [5] Michel, R., and J.P. Avouac, Imaging Co-Seismic Fault
Zone Deformation from Air Photos: The Kickapoo
measurement with physical models have been based on
Stepover along the Surface Ruptures of the 1992
analytical methods that assume homogeneous Landers Earthquake, J. Geophys. Res.,2006.
geological properties for the region under investigation
[1]. Our FEM approach is significantly more flexible [6] Liu, J. G., Mason, P. J. and Ma, J., 2006. Measurement
of the left-lateral displacement of Ms 8.1 Kunlun
than this and permits the construction of models
earthquake on 14th November 2001 using Landsat-7
containing a variety of different materials (i.e. rock ETM+ imagery. International Journal of Remote
types). This gives FEM the ability to construct a more Sensing, 27, No.10, 1875-1891.
realistic and heterogeneous material representation of
the area under simulation. This ability is of particular [7] J. G. Liu, C. Haselwimmer, 2006. Co-seismic ruptures
found up to 60 km south of the Kunlun fault after 14
importance when it comes to understanding the Nov 2001, Ms 8.1, Kokoxili earthquake using Landsat-7
relationship between macro- and micro-scale ETM+ imagery. International Journal of Remote
deformational phenomena where differences in Sensing, in press.
geological properties can be critical to the partitioning
[8] Terascale Frameworks white paper, 2003. Available at
of stress and strain.
http://nees.ou.edu/frameworks.pdf. Accessed on 4th
Having successfully tested our analytical May, 2006.
workflow, our research is now focused on improving
the modelling resolution and complexity in order that
macro to micro-scale analysis of the Kunlun
earthquake can be undertaken. The approach that we
416
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
417
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
access whilst also providing tools and applica- 2.2. Comparison with models
tions for research. AstroGrid (AG) is now at a
stage where it is fully developing its tools and The information obtained from object pho-
applications that will be able to achieve use- tometry is greatly increased by matching
ful scientific results. It is currently the only the observational spectral energy distribu-
VO project to utilise a workflow system. The tions (SED) to that of synthetically con-
workflow works like a basic program where structed galaxy spectra. As an example we
steps are executed in a logical manner. The show the use of the Bruzual and Charlot
workflow can be complex with a number of models of 2003 (Bruzual & Charlot 2003) al-
applications involved, all configured into de- though other models include PEGASE (Fioc
fault runtime configuration parameters. & Volmerange 1997) and Starburst99 (Lei-
therer et al. 1999) will shortly also be inte-
grated. Figure 1 shows a SED match to one of
2. Science Outline the object in the GOODS3 field. The fitting
The motivation for this ’Spectral Energy Dis- utilises the χ2 minimisation technique to find
tribution (SED) Service’ is a result of the re- a best fit model to the data.
cent, e.g. the last ten years, studies of high
redshift galaxies (e.g. Stanway et al. 2004,
Bunker et al. 2004 and Eyles et al. 2005) dis-
covered with the method pioneered by Stei-
del et al. (1995) used to identify galaxies
above z > 2. The ’SED Service’, simply put,
is a matching technique that takes observa-
tional flux measurements of objects in differ-
ent wave bands and matches them to model
spectral energy distributions (SED) created
from model codes. This process allows the
user to derive physical parameters such as
SFH’s, stellar masses and ages from the mod-
els which can be used for analysis of the sam-
ple of objects in that given field. The tech-
nique is fast becoming a useful way to study
Figure 1: The best fit GALAXEV model
large samples of objects at these high red-
for our example object with an exponen-
shifts. Thus, creating an automated tech-
tially decaying SFH with the chosen e-folding
nique to study these samples of objects is very
timescale tau = 200 Myr and with an age of
useful especially now that VO tools are in a
800 Myr.
position to support scientific analysis.
418
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
v091.pdf
4 http://www.ivoa.net/Documents/Notes/CEA- 9 http://www.ivoa.net/Documents/WD/ADQL-
/CEADesignIVOANote-20050513.html /ADQL-20050624.pdf
419
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
References
[1] Bertin E., Arnouts S., 1996, AASS, 117,
393
[2] Bolzonella M., Miralles J.M., Pell R.,
2000, AA, 363, 476
[3] Bruzual G., Charlot S., 2003, MNRAS,
344, 1000
[4] Bunker A.J. et al. 2004, MNRAS, 355,
374
[5] Eyles L.P. et al. 2005, MNRAS, 364, 443
[6] Fioc M., Volmerange B.R., 1997, AA,
326, 950
[7] Guy Rixon, Private Communication,
2005
[8] Harrison et al., ADASS XIV, 2005, ASP
conference series, 347, 291
[9] Leitherer C. et al., 1999, AJSS, 123, 3
[10] Padovani P., Allen M.G., Rosati P., Wal-
ton N.A., 2004, AA, 424, 545
[11] Stanway E.R. et al., 2004, ApJ, 607, 704
[12] Steidel C.C., Pettini M., Hamilton D.,
1995, AJ, 110, 2519
Figure 3: The components that are engaged [13] York D.J. et al., 2000, AJ, 120, 1579
for the SExtractor application. A similar pro-
cess is invoked when running other applica-
tions.
420
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
{pt,jxu}@comp.leeds.ac.uk
Abstract
A promising means of attaining dependability improvements within service-oriented Grid environments is
through the set of methods comprising design-diversity software fault-tolerance. However, applying such
methods in a Grid environment requires the resolution of a new problem not faced in traditional systems,
namely the common service problem whereby multiple disparate but functionally-equivalent services may
share a common service as part of their respective workflows, thus decreasing their independence and
increasing the likelihood of a potentially disastrous common-mode failure occurring. By establishing
awareness of the topology of each channel/alternate within a design-diversity fault-tolerance scheme,
techniques can be employed to avoid the common-service problem and achieve finer-grained control within
the voting process. This paper explores a number of different techniques for obtaining topological
information, and identifies the derivation of topological data from provenance information to be a
particularly attractive technique. The paper concludes by detailing some very encouraging progress with
integrating both provenance and topology-aware fault-tolerance applications into the Chinese CROWN Grid
middleware system.
421
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
However, the concept of a service being potentially added value services. One possible method of
composite takes on much greater importance when recording topology information is to require the
considering the common-service problem, due to the hosting environment of each service within a Grid to
lack of knowledge about the workflow of append topology information within the header of
alternates/channels within a design diversity scheme. every outgoing SOAP message it sends. In this way,
This results in design diversity becoming less the topology of a given request can be reconstructed
attractive in the context of SOAs, despite the great from the individual blocks within the SOAP header
opportunities for cheap construction of such element, once the final SOAP message containing a
systems. It is therefore essential to investigate ways result is received by the fault tolerance mechanism.
in which to reduce the impact of such common This is a similar to the approach used in [PER05],
services between channel/alternate workflows. One which attempts to gather topological information by
such method is to exploit awareness in system prepending dependency information onto a CORBA
topology to achieve this goal, as through knowledge request as it passes from stage-to-stage.
of system topology it may be possible to devise Another potential means for storing topological
techniques for reducing the impact of the common information is through extensions to WSDL
service problem, such as through the use of weighted documentation. A WSDL document describes the
voting algorithms. interface of a Web service and how to interact with
The topology of a software system is essentially the that interface; however, an extension to WSDL could
set of dependencies between the components that allow for information about the workflow of a given
constitute that system. In the context of the common service/operation to be expressed within the service
service problem, it is important that the dynamic description of a service. A service requestor could
dependencies that can occur due to the ultra-late thus parse this information in order to ascertain the
binding capability of service-oriented Grids are workflow of a given service operation.
properly represented within the topological view of A similar technique to this is to store information
the system, and therefore we are particularly about the topology of a given service’s workflow (or
concerned with building up a view of system alternate negotiable topologies) in a metadata
topology that represents the concrete workflows of document (such as a WS-Policy document) related to
every channel/alternate within a design diversity that service; this data can then be retried via a
based system that are enacted for a given input, and technology such as WS-MetadataExchange and
– recursively - the concrete workflow of each parsed accordingly.
component within those workflows. The advantage and disadvantages of each of these
We define topology as “the set of dependencies techniques are summarised below:
between all components of a system that are used to • Appending topology information to SOAP headers.
produce an outcome based on a given input”, where This method shows promise, as it allows for the
a system consists of a number of components, which capture of dynamic system topology; this is a
cooperate under the control of a design to service the particularly useful property in the context of service-
demands of the system environment. In order to oriented Grids, due to their ability to enact late-
provide such topology information, there needs to be binding schemes. However, the method also requires
mechanisms to extract it. that each hosting environment within a virtual
organisation be adapted to perform this task in a
2.1 Methods of achieving topological awareness syntactically and semantically equivalent way; as
At the current moment in time, there are no widely there is currently no existing technology that
known methods of gaining information about the provides this functionality, this is a non-trivial task.
topology of a given service within a Grid system, Also, the approach could – depending on the size of a
and very little research has been performed in this workflow (which may be recursive) – result in very
area, the notable exception being that performed by large SOAP headers that require increased time to
[PER05]. However, there are a number of parse and transmit, thus resulting in increased system
technologies that can potentially be used to provide overhead.
such information. • WSDL. Embedding topological information within
One such technology is that of SOAP messaging. the WSDL document of a service has the advantage
SOAP is the commonly used XML-based protocol that topology can be derived before the service is
used when passing messages between services invoked; this results in increased choice for any
within Grids. The body element of a SOAP message topology-aware fault tolerance mechanisms as
is concerned with the main information that the decisions can be made ahead of invocation.
sender wishes to transmit to the receiver, whilst the However, this method cannot express the topology of
header element (if present) contains additional a service that exploits dynamic, ultra-late binding
information necessary for intermediate processing or techniques, and requires that any fault tolerance
422
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
mechanism that wishes to exploit topological data to topology of a process can be known ahead of
always have access to the latest version of the invocation (prospectively); however, these schemes
service’s WSDL document. Additionally, such a assume a static topology and thus cannot be used
method would require extensions to the standard with dynamic, ultra-late binding services.
WSDL schema that currently do not exist, and Conversely, recording topology within SOAP header
would have to be adhered to by each service within a elements can be used to record dynamic topology,
workflow recursively. but is strictly retrospective – the topology of a
• Metadata. The notion of embedding topological process can only be analysed upon the return of a
information within the metadata of a service has result.
similar advantages to that of embedding the By deriving topology information from a provenance
information within a WSDL document; namely that system, a neat compromise can be made between
topology can be extracted in advance of the these properties; topology information can be derived
invocation stage. However, the approach shares a at run-time during the course of an invocation (and
similar disadvantage in that dynamic ultra-late subsequent workflow enactment) by querying data
binding services (in any part of the set of workflows contained within the provenance store; this allows
enacted) cannot be modelled in this way. Unlike some analysis to be performed before a result is
embedding topological information within a WSDL received, and also supported dynamic, ultra-late
document, however, there are already standard ways binding services.
– such as WS-Policy and WS-MetadataExchange – Additionally, provenance technologies are already in
for expressing and extracting metadata, and so at existence, and are being increasingly used in real-
least from a syntactic point of a view, the approach world hosting environments, and so the infrastructure
is more feasible. on which topology information can be extracted does
not need creating, unlike in the case of appending
2.2. Use of Provenance information to SOAP messages.
In addition to the methods expressed above, we
propose a new technique to deriving topological Table 1. Methods for topology derivation.
information about service workflows: analysis of the
provenance information of a given task. The Data
Scheme Dynamic Feasibility
provenance of a piece of data is the documentation Availability
of the process that led to a given data element’s Requires
creation. This concept is well established, although SOAP
Yes Retrospective implementation of
headers
past research has referred to the same concept with a middleware support.
variety of different names (such as lineage [LAN91], Requires extensions to
WSDL No Prospective
and dataset dependence [ALO97]). specification.
The recording of provenance information can be Technology already
used for a number of goals, including verifying a Metadata No Prospective
exists.
process, reproduction of a process, and providing
Technology already
context to a piece of result data. Provenance in Provenance Yes Run-time.
exists.
relationship to workflow enactment and SOAs is
discussed in [SZO03]; in a workflow based SOA
A comparison of all four schemes is shown in table
interaction, provenance provides a record of the
1. Due to the advantages it provides over other
invocations of all the services that are used in a
possible methods, the view of this paper is that
given workflow, including the input and output data
provenance is a particularly attractive technique to
of the various invoked services. Through an analysis
consider when deriving topological information.
of interaction provenance, patterns in workflow
execution can thus be detected. Using actor
provenance, further information about a service, 3. PreServ Integration with CROWN
such the script a service was running, the service' s
configuration information, or the service’s QoS The ideas of exploiting topological information to aid
metadata, can be obtained. Through an analysis of in the provision of design-diversity software fault-
interaction provenance, system topology can thus be tolerance in Grid environments have been explored
derived. in the FT-Grid tool, developed at the University of
This approach to deriving topological information Leeds and discussed in [TOW05b]. This tool derives
has a number of advantages over all three of the its topological data from analysis of provenance
methods described in section 2.1. Methods of records generated by the PreServ provenance
deriving topology from WSDL descriptions or recording tool [GRO04], developed at the University
service metadata are attractive in that the entire of Southampton. Initial experimentation has been
423
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
very positive, and has demonstrated a significant By establishing awareness of the topology of each
reduction in the number of common-mode failures channel/alternate within a design-diversity fault-
encountered during a large number of automated tolerance scheme, techniques can be employed to
tests. [TOW05b]. avoid the common-service problem and achieve
One drawback of the approach is the assumption that finer-grained control within the voting process.
every service within a Grid system has a compatible This paper has explored a number of different
provenance recording facility; this is a rather strong techniques for obtaining topological information, and
assumption, and has been of questionable feasibility. has identified the derivation of topological data from
However, recently, as part of the UK-Sino COLAB provenance information to be a particularly attractive
(Collaboration between Leeds and Beihang technique.
universities) project, the developers of the CROWN The paper concludes by detailing some very
(Chinese Research environment over Wide-area encouraging progress with integrating both
Network – a middleware widely used in the Chinese provenance and topology-aware fault-tolerance
e-Science programme) Grid middleware system applications into the Chinese CROWN Grid
have integrated the PreServ provenance system into middleware system.
the main CROWN middleware.
The intention of this is to allow provenance
recording functionality to be transparently available 5. References
on any service hosted on CROWN systems. From [ALO97] G. Alonso, C. Hagen, “Geo-opera: Workflow Concepts
the perspective of using topological information to for Spatial Processes”, in Proceedings of the 5th International
assist design-diversity fault-tolerance schemes, this Symposium on Spatial Databases, Berlin, Germany, June 1997.
is of great value, as the prospect of deriving [ALO04] G. Alonso, F. Casati, H. Kuno, V. Machiraju, “Web
complete topological information about Services: Concepts, Architectures and Applications”, Springer,
2004.
channel/alternate workflows becomes much more
feasible on systems developed within the CROWN [AND90] T. Anderson and P. Lee, Fault Tolerance: Principles and
Grid. Indeed, it is intention of the COLAB project to Practice. New York: Springer-Verlag, 1990.
integrate a generalised FT-Grid application into [GRO04] P. Groth, M. Luck, L. Moreau, “A protocol for recording
future releases of the CROWN middleware, to take provenance in service-oriented grids”, in Proceedings of the 8th
advantage of this situation. International Conference on Principles of Distributed Systems
(OPODIS'04), Grenoble, France, December 2004.
This generalised FT-Grid application is intended to
offer mechanisms for topological analysis that can [LAN91] D. Lanter, “Lineage in Gis: The Problem and a
Solution”, Technical Report 90-6, National Center for Geographic
then be fed into other fault-tolerance schemes, as Information and Analysis, UCSB, Santa Barbara, CA, 1991.
well as offering to enact user-defined design-
diversity fault-tolerance applications itself (such as [LIN03] Z. Lin, H. Zhao, and S. Ramanathan, "Pricing Web
Services for Optimizing Resource Allocation - An Implementation
Distributed Recovery Blocks and Multi-Version Scheme," presented at 2nd Workshop on e-Business, Seattle,
Design). December 2003.
[PER05] S. Pertet, P. Narasimhan, “Handling Propagating Faults:
4. Conclusions The Case for Topology-Aware Fault-Recovery”, in DSN
Workshop on Hot Topics in System Dependability, Yokohama,
There is a great need for methods for improving the Japan, June 2005.
dependability of applications within service-oriented [SZO03] M. Szomszor, L. Moreau, “Recording and reasoning over
Grid environments. A promising means of attaining data provenance in web and grid services”, Int. Conf. on
such dependability improvements is through the set Ontologies, Databases and Applications of Semantics, Vol. 2888
of Lecture Notes in Computer Science, pp. 603-620, Catania, Italy,
of methods comprising design-diversity software November 2003.
fault-tolerance, such as Multi-Version Design and
Recovery Blocks. [TOW05a] P. Townend, P. Groth, J. Xu, "A Provenance-Aware
Weighted Fault Tolerance Scheme for Service-Based
However, applying such methods in a Grid Applications", in Proceedings of 8th IEEE International
environment requires the resolution of a new Symposium on Object-oriented Real-time distributed Computing,
problem not faced in traditional systems. This paper Seattle, May 2005
investigates the common service problem, whereby [TOW05b] P. Townend, P. Groth, N. Looker, J. Xu, "FT-Grid: A
multiple disparate but functionally-equivalent Fault-Tolerance System for e-Science", in Proceedings of 4th U.K.
services may share a common service as part of their e-Science All-Hands Meeting, 19th - 22nd Sept., 2005, ISBN 1-
904425-53-4.
respective workflows, thus decreasing their
independence and increasing the likelihood of a
potentially disastrous common-mode failure
occurring.
424
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
This poster presents a selection of ontology editing tools that have been developed
for Protégé-OWL. The tools were developed in the context of the CO-ODE
project 1. Although Protégé-OWL2 is arguably the most widely used tool for
editing OWL ontologies, the design and implementation of the standard Protégé-
OWL user interface has been focused around satisfying the needs of logicians and
knowledge modellers. The tools that are presented in this poster are designed to be
used by domain experts who are generally not logic savvy. It is hoped that these
tools will encourage e-Scientists to take up OWL for use in their service
architectures.
Ontologies and metadata are key to application or style dictates that only a
knowledge management and service subset of the full expressive power of a
architectures for e-Science. They are critical language such as OWL is required. This is
for annotating resources so that they can be particularly true when the aim is to extend a
discovered and used by semantic middle- simple terminology or taxonomy. In
ware in service oriented architectures and addition, most users need an environment
workflows. In particular, the Web Ontology that makes ontology development a fast and
Language, OWL, is on the verge of gaining pain free process. They also require tools
acceptance from a wide range of e-Science support for frequently used ontology design
groups and organisations. patterns, so that such patterns can be applied
quickly and without error.
Despite the fact that OWL is generally seen
as a solution to a number of design and Using real requirements from groups, and
implementation areas in e-Science, the exposure to difficulties experienced by
complexity of the language, and beginners during tutorials, we have
understanding its logical underpinnings can developed and are experimenting with
be a barrier to its correct use. To address several different user interfaces. These user
this situation, various ontology editors such interfaces attempt to address the following
as Protégé-OWL [1] and SWOOP [2] have points:
been developed. However, the default
interfaces in both of these tools are 1) Hiding the logical constructs and
somewhat geared towards the tastes of complex notation from the user.
logicians. These kinds of users are typically 2) S e l e c t i n g a p p r o p r i a t e
seduced by the ability to take advantage of expressiveness for a given group.
the full expressivity of OWL. In contrast, 3) Abstracting away from the
for many other types of users, it is language by simulating some
frequently the case that a particular higher-level constructs.
1 http://www.co-ode.org
2 http://protege.stanford.edu/plugins/owl
425
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
426
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Conclusion
Acknowledgements
427
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Many crystal structures of HIV-1 protease exist, but the number of clinically interesting
drug resistant mutational patterns is far larger than the available crystal structures. Muta-
tional protocols convert one protease sequence with available crystal structure into another
that diverges by a small number of mutations. It is important that such mutational algorithms
are followed by suitable multi-step equilibration protocols, e.g. using chained molecular dy-
namics simulations, to ensure that the desired mutant structure is an accurate representation.
Previously these have been difficult to perform on computational grids due to the need to keep
track of large numbers of simulations. Here we present a simple way to construct a chained
MD simulation using the Application Hosting Environment.
428
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
B = class ‘A’ except atoms of all amino acids within 5 Å of and including N25D mutations
C = class ‘A’ except atoms of all amino acids within 5 Å of and including I84V mutations
** NVT ensemble temperature maintained using Langevin thermostat with coupling coefficient of 5 /ps
*** NPT ensemble maintained using Berendsen Barostat [8] at 1bar and with pressure coupling of 0.1 ps
Table 1: Equilibration protocol for molecular dynamics simulation of HIV-1 protease incorporating
relaxation of mutated amino acid residues.
laxation of the incorporated mutations within must issue the ahe-listapps command to find the
the framework of their surrounding protease end point of the application factory of the appli-
structure. Furthermore, we show how use of the cation she wants to run. Next she issues the ahe-
AHE both automates such a chained protocol prepare command to create a new WS-Resource
and facilitates deployment of such simulations to manage the state of her application instance.
across distributed grid resources. Finally she issues the ahe-start command, which
will parse her application configuration file to
III The Application Hosting
find any input files that need to be staged to the
Environment
remote grid resource, stage the necessary files,
The Application Hosting Environment and launch the application. The user can then
(AHE) is a lightweight, WSRF [5] compliant, use the ahe-monitor command to check on the
web services based environment for hosting un- progress of her application and, once complete,
modified scientific applications on the grid. The the ahe-getoutput command to stage the output
AHE is designed to allow scientists to quickly files back to her local machine. By calling these
and easily run unmodified, legacy applications simple commands from a shell or Perl script the
on grid resources, manage the transfer of files user is able to create complex application work-
to and from the grid resource and monitor the flows, starting one application execution using
status of the application. The philosophy of the the output files from a previous run.
AHE is based on the fact that very often a group
IV Implementation
of researchers will all want to access the same
application, but not all of them will possess the The 1TSU crystal structure was used as the
skill or inclination to install the application on starting point for the molecular dynamics equi-
a remote grid resource. In the AHE, an expert libration protocol. This structure contains inac-
user installs the application and configures the tive wildtype protease complexed to a substrate.
AHE server, so that all participating users can VMD [16] was used for the initial preparation of
share the same application. For a discussion the system prior to simulation. The coordinates
of the architecture and implementation of the of the substrate were removed from the struc-
AHE, see [4] . ture, all missing hydrogen atoms were inserted
The AHE provides users with both GUI and and the structure was solvated and neutralized.
command line clients to interact with a hosted The N25D mutation was incorporated to restore
application. In order to run an application using catalytic activity to the protease and the I84V as
the AHE command line clients, firstly the user it is a primary drug resistant mutation for sev-
429
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
(a) (b)
430
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The simulation has shown that in this case, [7] K. L. Meagher and H. A. Carlson. Sol-
the change in RMSD of mutated amino acids vation Influences Flap Collapse in HIV-1
is similar to that of the protease backbone as Protease. Proteins: Struct. Funct. Bioinf.,
a whole. Whilst this is indicative of a good 58:119–125, 2005.
initial mutational protocol, such as that used
in VMD, differences in the RMSD of mutated [8] H. J. C. Berendsen, J. P. M. Postma, W. F.
residues during minimization and force relax- van Gunsteren, A. DiNola, and J. R. Haak.
ation show that an equilibration protocol that Molecular dynamics with coupling to an ex-
allows conformational change of mutated amino ternal bath. J. Chem. Phys., 81:3684–3690,
acids assists in the achievement of equilibra- 1984.
tion. Furthermore, as a significant degree of [9] W. R. P. Scott and C. A. Schiffer. Curling
simulation using a multi-step protocol is neces- of Flap Tips in HIV-1 Protease as a Mech-
sary to achieve equilibration, the ability to au- anism for Substrate Entry and Tolerance of
tomate implementation of such a protocol using Drug Resistance. Structure, 8:1259–1265,
the AHE is greatly beneficial when considering 2000.
the need to do such equilibrations for a large
number of protease mutations. [10] A. L. Perryman, J. Lin, and J. A. McCam-
We have also shown that due to the flexi- mon. HIV-1 protease molecular dynamics
ble nature of the AHE, a complex workflow can of a wild-type and of the V82F/I84V mu-
be orchestrated by scripting the AHE command tant: Possible contributions to drug resis-
line clients; in this case we have conducted a tance and a potential new target site for
chained molecular dynamic simulation using less drugs. Protein Sci., 13:1108–1123, 2004.
than forty lines of Perl code. [11] A. Wlodawer and J. Vondrasek. Inhibitors
References of HIV-1 Protease: A Major Success of
Structure-Assissted Drug Design. Annu.
[1] P. V. Coveney, editor. Scientific Grid Com-
Rev. Biophys. Biomol. Struct., 27:249–284,
puting. Phil. Trans. R. Soc. A, 2005.
1998.
[2] I. Foster, C. Kesselman, and S. Tuecke. The
[12] V. A. Johnson, F. Brun-Vezinet, B. Clotet,
anatomy of the grid: Enabling scalable vir-
B. Conway, D. R. Kuritzkes, D. Pillay,
tual organizations. Intl J. Supercomputer
J. Schapiro, A. Telenti, and D. Richman.
Applications, 15:3–23, 2001.
Update of the Drug Resistance Mutations
[3] J. Chin and P. V. Coveney. To- in HIV-1: 2005. Int. AIDS Soc. - USA,
wards tractable toolkits for the grid: 13:51–57, 2005.
a plea for lightweight, useable middle-
ware. Technical report, UK e-Science [13] N. G. Hoffman, C. A. Schiffer, and
Technical Report UKeS-2004-01, 2004. R. Swanstrom. Covariation of amino acid
http://nesc.ac.uk/technical papers/ positions in hiv-1 protease. Virology,
UKeS-2004-01.pdf. 314:536–548, 2003.
431
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
We present PRATA, a system that supports the following in a uniform framework: (a) XML publishing, i.e., converting
data from databases to an XML document, (b) XML integration, i.e., extracting data from multiple, distributed databases,
and integrating the data into a single XML document, and (c) incremental maintenance of published or integrated XML
data (view), i.e., in response to changes to the source databases, efficiently propagating the source changes to the XML
view by computing the corresponding XML changes. A salient feature of the system is that publishing, integration and view
maintenance are schema-directed: they are conducted strictly following a user-specified (possibly recursive and complex)
XML schema, and guarantee that the generated or modified XML document conforms to the predefined schema. We discuss
techniques underlying PRATA and report the current status of the system.
432
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
mentioning that while XML publishing is a special case of One wants to represent the relational Receptor data as
XML integration where there is a single source database, we an XML document that conforms to the DTD C shown in
treat it separately since it allows us to develop and lever- Fig. 3. The document consists of a list of receptor fami-
age specific techniques and conduct the computation more lies, identified by STcUW_X6ZO[ ids. Each eUG`j\'f1g collects the list
efficiently than in the generic integration setting. of [ZSZ6W_X6b%[ s and the list of [ZB e erences in the family (chap-
ter). Note that [ZSZ6W_X6b%[ (thus C ) is recursively defined: the
[ZSZ6W_X6b%[k child of [ZSZ WYX bG[ element l C can also take an ar-
bitrary number of [ZSZ WYX bG[mlmn , lpo , qKqOq , lsr as its children,
which are the receptors related to lsC through references.
To avoid redundant information, if a [ZSZ WYX bG[ appears more
than once in the reference hierarchy, the document stores it
only once, i.e., its first appearance.
The ATG C for publishing the Receptor data is given in
Fig. 4, which extends the predefined DTD C by associating
Figure 1. the System Architecture semantic rules (e.g., tjn ) to assign values to the variables
(e.g., $family). Given the Receptor Database, the ATG C
2 XML Publishing is
evaluated top-down: starting at the root element type of
C , it evaluates the semantic rules associated with each el-
This module allows users to specify mappings from a ement type encountered, and creates XML nodes following
relational
database schema to a predefined XML schema the DTD to construct the XML tree. The values of the vari-
, via a GUI and in a novel language Attribute Translation ables uGv ’s are used to control the construction. Now we
Grammar (ATG) that we proposed in [2]. Given an ATG give part of the ATG evaluation process.
and a database instance of , the system automatically
(1) At each receptors element w , the target tree x is
generates an XML document (view)
of such that
further expanded as follows. As u"yGz"{z|c}2~"y%wd }2d is assigned
is guaranteed to conform to the given DTD . This process “0” from the family element’s evaluation, the first branch of
is called schema-directed publishing. SQL query t o is triggered. This branch finds the tuples of
"! #%$'&)( *,+
-% !/.0& )
Source relational schema
( ,
:
yGz9{Oz|}2~"y 6 and
z associated with each family from
1)23 (4&56&4#%$'7/( *8+ 29! #:$;&( *8+ , -%!/.<& , 57+& )
( , the receptors relation, by using variable u"yGz"{z|c}2~"y%wd
12=,
A (4(B&;&;>> *8*8++ 2(B&696! &4#%#%$'$'&7/( ( *,*8+ + ?&6!/( , $;*,$8@ & )
( , , (passed from u%_ _ {Kc9|c}2z9y 6 in the last step) as a con-
( , ) stant parameter. For each y of these tuples, a receptor
Figure 2. Relational schema DC for Receptors child of w is created carrying y as the value of its variable.
We next demonstrate the ATG approach for publishing (2) At each receptor element y , name element and
relational data in XML, by using a simplified example taken receptors element w9 are created (note that “1” is as-
from the IUPHAR (International Union of Pharmacology) signed to u"yGz"{z|c}2~"y%wd }2: , and u"yGz"{z|c}2~"yG yGz9{Oz|}2~"y 6 is
Receptor Database [8], which is a major on-line reposi- accumulated into u"yGz"{z|c}2~"y%wd 6w in this process).
tory of characterisation data for receptors and drugs. We (3) At each receptors element w , the second branch of
consider tables EGFGHIKJ2L:M8N , M;LKEOL%I9J2PM8N , M8LOQN and E"R J2L shown in SQL query t o is triggered because uGyGz9{Oz|}2~"y%w: }2d is now
Fig. 2 (the primary keys for all tables are underlined). Table “1”. This branch finds the tuples of yGz9{Oz|}2~"y and
z
EGFGHIKJ2L:M8N stores the receptor families where each family has from the receptors and cite relations related to recep-
a STVUWYX Z[ \B] (primary key) and a ^_UG`aZ . Table M8L9EL%I9J2PM;N tor y that have not been processed in previous steps, by us-
stores the [ZSZ WYX bG[ \B] , STcUW_X6ZO[ \B] , ^YUG`aZ and SbK]dZ of re- ing variable uGy"z"{z|}2~"yGwd and u"yGz"{z|c}2~"y%wd 6w as constant
ceptors. Table M8LOQN stores publications on receptor families. parameters. Note here that the variable uGy"z"{z|}2~"yGwd w is
Each reference has a [ZBe \B] , X5\'X4f,Z and a g:ZUG[ of the publica- used to decide whether or not a receptor has been pro-
tion. It is associated with a unique receptor family through cessed earlier, and thus avoid redundant inclusion of the
the attribute (foreign key) STcUWYX ZO[ \B] . Finally, table E9R J2L same receptor in the document x . For each y: of these
records many-to-many relationships between receptors and tuples, a receptor child of w" is created carrying y% as the
references with [ZBe \B] and [ZSZ6W_X6b%[ \B] (as foreign keys) value of its variable, and the receptor element is in turn
pointing to references and receptors, respectively. created as described in (2).
Target DTD : hi Details on the above conceptual level evaluation process
<!ELEMENT db (family*)>
<!ELEMENT family* (name, receptors, references )> and more effective techniques to generate efficient evalua-
<!ELEMENT references (reference*)> tion plans are show in [2]. These include query-partitioning
<!ELEMENT reference (title, year) and materializing intermediate results, in order to reduce
<!ELEMENT receptors (receptor*)>
<!ELEMENT receptor (name, receptors)> communication cost between the publishing module and the
/* #PCDATA is omitted here. */ underlying DBMS, based on estimates of the query cost and
data size. We omit the details here due to the space con-
Figure 3. DTD C for publishing data of C
433
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
straint, but encourage the reader to consult [2]. referred to as multi-source queries. Second, receptors are
Semantic Attributes: /*omitted*/ defined recursively. Third, for each family, references
Semantic Rules: can only be computed after the receptors’ computa-
0
db family*
: $family
select chapter id, name from chapters
tion is completed since it is to collect all references in the
receptors subtree, which is recursively defined and has
family name, receptors, references an unbounded depth. In other words, this imposes a de-
$fname = ($family.name), $references = ($family.chapter id),
$receptors = (0, $family.chapter id, ) pendency between the references and the receptors
receptors
receptor* subtrees. As a result the XML tree can not be simply com-
: $receptor
case $receptors.tag of puted as ATGs [1] by using a top-down method.
0: select receptor id, name, $receptors.ids In addition, the XML report is also required to satisfy the
from receptors
where chapter id = $receptors.id
flowing
XML constraints:
1: select a.receptor id, a.name, $receptors.ids n : references(reference.ref id reference)
from receptors a, cite b, cite c o : db(reference.fid ¡ family.fid)
where b.receptor id = $receptors.id and Here n is a key constraint asserting that each subtree rooted
b.ref id = c.ref id and
b.receptor id <> c.receptor id and
at a references node, ref id is a key of reference
a.receptor id = c.receptor id and elements; and o is an inclusion constraint asserting that the
a.receptor id not in $receptors.ids families cited by references in the db must be presented in
receptor
name, receptors the family subtree of the db.
$rname = ($receptor.name), As remarked in Section 1, no commercial systems or re-
$receptors = (1, $receptor.receptor id, $receptor.ids $receptor.receptor id)
search prototype can support schema-directed XML integra-
references
reference*
: $reference
select title, year tion with respect to both DTD and XML constraints. The
from refs only effective technique for doing this is our Attributes In-
where chapter id = $references.chapter id tegration Grammars (AIGs) reported in [1], which is the un-
reference
title, year derlying technique for the integration module of PRATA.
$year = ($reference.year), $title = ($reference.title)
Based on multi-source query decomposition and con-
A
S /* A is one of name, title, year */
straints compilation PRATA is capable of integrating data
$S = ($A.val)
C from multiple sources and generating an XML document
Figure 4. An example ATG that is guaranteed to both conform to a given DTD and
3 XML Integration satisfy predefined XML constraints. PRATA also leverages
a number of optimization techniques to generate efficient
Extending the support for ATGs, this module provides a query evaluation plans [1]. In particular, it uses a cost-
GUI for users to specify mappings from multiple, distributed based scheduling algorithm, by taking into account depen-
dency relations on subtrees, to maximize parallelism among
databases to XML documents of a schema , in our lan-
guage Attribute Integration Grammar [1]. In addition, given underlying relational engines and to reduce response time.
an AIG and source databases, this module extracts data from Due to the lack of space we omit the details here (see [1]).
the distributed sources and produces an XML document that
is guaranteed to conform to the given DTD and satisfy 4 Incremental XML Views Maintenance
predefined XML contraints .
As an example, suppose that the tables EGFGHI9J2L%M;N and This module maintains published XML viewsc
M8L9EL%I9J2PM;N in Fig. 2 are stored in a database DB1, while the based on our incremental computation techniques devel-
tables M8LOQN and E9R J2L are stored in DB2. Now the IPUHAR oped in [3]. In response to changes ¢j to the source
department wants to generate a report about the recep- database , this module computes the XML changes ¢£x to
tors and corresponding publications. The report is re-
such that s¤¥¢j
§¦¨¢£x¥¤
,while minimizing
quired
to conform to a fixed DTD n , which extends the unnecessary recomputation. The operator ¤ denotes the ap-
C of Fig. 3 as follows: elements fid and rec id are plication of these updates,
added to the family production and receptor produc- As remarked in Section 1, scientific databases keep be-
tion as their first subelements respectively; and ref id ing updated. Consider the XML view published by the ATG
and fid are added to the reference production as its C of Fig. 4 from the the Receptor Database specified by
first two subelements (from table <M8LOQN we can see one Fig. 2. Suppose that the relational Receptor database is up-
reference can only relate itself to one family). In ad- dated by insertions ¢£yGz"©w and ¢j{/6}2z to the base relations
dition, for each family, the document is to collect all the yGz"©w and {5}2z respectively. This entails that the new refer-
references that are publications cited directly or indi- ence information must be augmented to the corresponding
rectly by receptors in the family. reference subtrees in XML views. Moreover, the insertions
These introduce the following challenges. First, the in- also increase the related receptors for some receptor nodes
tegration requires multiple distributed data sources. As a in the XML view and the augment may result in further ex-
result a single SQL query may access multiple data sources, pansion of the XML view. Given the recursive nature of C ,
434
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
this entails recursive computation and is obviously nontriv- algorithms and indexing structures for query merging [2],
ial. constraint compilation, multi-source query rewriting, query
The incremental algorithm in [3] is based on a notion scheduling [1], and bud-cut incremental computation [3],
of ¢ ATG. A ¢ ATG ¢ is statically derived from an ATG which are not only important for XML publishing and in-
by deducing and incrementalizing SQL queries for gen- tegration, but are also useful in other applications such as
erating edges of XML views. The XML changes ¢ªx are multi-query evaluation and database view maintenance.
computed by ¢ , and represented by of a pair of edge re- Friendly GUIs. PRATA offers several tools to facilitate users
lations l¬«®l¯°
, denoting the insertions(buds) and dele- to specify publishing/integration mappings, browse XML
tions(cuts). The whole process is divided into three phases: views, analyze published or integrated XML data, and mon-
(1) a bud-cut generation phase that determines the impact itor changes to XML views, among other things.
of ¢j on existing parent-child(edge) relations in the old The current implementation of PRATA fully supports
XML view x by evaluating a fixed number of incremental- (a) schema-directed XML publishing, and (b) incremental
ized SQL queries; (2) a bud completion phase that iteratively maintenance of XML views of a single source, based on the
computes newly inserted subtrees top-down by pushing SQL evaluation and optimization techniques discussed in previ-
queries to the relational DBMS; and finally, (3) a garbage ous sections. For XML schemas, it allows generic (possi-
collection phase that removes the deleted subtrees. bly recursive and non-deterministic) DTDs, but has not yet
The rational behind this is that the XML update ¢£x is implemented the support for XML constraints. A prelim-
typically small and more efficient to compute than the en- inary prototype of the system was demonstrated at a ma-
tire new view ª¤±¢j
. The key criterion for any incre- jor database conference [4], and is deployed and evaluated
mental view maintenance algorithm is to precisely identify at Lucent Technologies and European Bioinformatics Insti-
¢ªx ; in other words, it is to minimize recomputation that tute [4]. We are currently implementing (a) XML integration
has been conducted when computing the old view x . Our and (b) incremental maintenance of XML views of multiple
incremental maintenance techniques make maximal use of sources. Pending the availability of resources, we expect to
XML sub-trees computed for the old view and thus mini- develop a full-fledged system in the near future.
mize unnecessary recomputation. It should be pointed out
that the techniques presented in this section are also appli-
cable to XML views generated by AIG. We focused on ATG
References
views just to simplify the discussion. Due to the limited [1] M. Benedikt, C. Y. Chan, W. Fan, J. Freire, and R. Rastogi.
space we omit the evaluation details here (see [3]). Capturing both types and constraints in data integration. In
SIGMOD, 2003.
[2] M. Benedikt, C. Y. Chan, W. Fan, R. Rastogi, S. Zheng, and
5 PRATA: Features and Current Status A. Zhou. DTD-directed publishing with attribute translation
Taken together, PRATA has the following salient features, grammars. In VLDB, 2002.
[3] P. Bohannon, B. Choi, and W. Fan. Incremental evaluation
which are beyond what are offered by commercial tools or of schema-directed XML publishing. In SIGMOD, 2004.
prototype systems developed thus far. [4] B. Choi, W. Fan, X. Jia, and A. Kasprzyk. A uniform sys-
Schema conformance. PRATA is the first system that au- tem for publishing and maintaining XML. In VLDB, 2004.
tomatically guarantees that the published or integrated XML Demo.
[5] M. F. Fernandez, A. Morishima, and D. Suciu. Efficient eval-
data conforms to predefined XML DTDs and schemas, even uation of XML middleware queries. In SIGMOD, 2001.
if the DTDs or schemas are complex and recursive. [6] GO Consortium. Gene Ontology.
Automatic validation of XML constraints. In a uni- http://www.geneontology.org/.
form framework for handling types specified by DTDs or [7] IBM. DB2 XML Extender.
http://www-306.ibm.com/software/data/db2/extenders/xmlext/.
schemas, PRATA also supports automatic checking of in- [8] IUPHAR. Receptor Database.
tegrity constraints (keys, foreign keys) for XML. http://www.iuphar-db.org.
Integration of multiple, distributed data sources. PRATA [9] Microsoft. XML support in Microsoft SQL erver 2005, De-
is capable of extracting data from multiple, distributed cember 2005. http://msdn.microsoft.com/
databases, and integrating the extracted data into a single library/en-us/dnsql90/html/sql2k5xml.asp/.
[10] R. J. Miller, M. A. Hernández, L. M. Haas, L.-L. Yan,
XML document. A sophisticated scheduling algorithm al- C. T. H. Ho, R. Fagin, and L. Popa. The Clio project: Man-
lows PRATA to access the data sources efficiently in parallel. aging heterogeneity. SIGMOD Record, 30(1):78–83, 2001.
Incremental updates. PRATA is the only system that [11] Oracle. Oracle Database 10g Release 2 XML DB Technical
supports incremental maintenance of recursively defined Whitepaper. http://www.oracle.com/technology/tech/
views, and is able to efficiently propagate changes from data xml/xmldb/index.html.
[12] J. Shanmugasundaram, E. J. Shekita, R. Barr, M. J. Carey,
sources to published/integrated XML data. B. G. Lindsay, H. Pirahesh, and B. Reinwald. Efficiently
Novel evaluation and optimization techniques. Underly- publishing relational data as XML documents. VLDB Jour-
ing PRATA are a variety of innovative techniques including nal, 10(2-3):133–154, 2001.
435
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
436
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
relationship can be established, and, as a result, some people may talk about goods in a shop
the outcome can be measured [8]. One example when using stock. Additionally, offering
of this is to measure a new system’s multiple-choice answers can be rather dull for
performance in relation to an existing system. the participants, because there are only pre-
This is known as a comparative experiment. selectable answers [11]. This may be overcome
Another option, which is named an absolute by using multiple indicators, semantic
experiment, would be to test a system in differential scales or ranked order questions,
isolation [9]. The measurement base here could Which are often referred to as the Likert scale
be the time to accomplish an exercise that needs [7, 9].
to be undertaken within a programme. For There are also other tools like the semantic
example, there are two groups, whose members differential scale, in which the respondents get
can be argued to be similar to each other in the task of ranking their opinion according to
certain characteristics. If one group learns to use their importance. For example, using opposing
a computer programme faster than the other, the terms like clumsy and skilful to indicate how
experiment should allow the assumption that the these are perceived by the respondents [9].
programme used one group is easier to learn. In Ranked order questions are another option for a
this example the variables are defined to be questionnaire. The researcher here gives the
similar (group and task) and the measurement option of ranking, for example, of importance of
base (time to learn the programme) is the a programme on a PC. Options would include: a
difference. The cause of this difference is the browser, word processor, image enhancer or
intuitiveness of the programme interface. messenger. The participant can then choose
There are critical views towards the use of this which option is the most important or the lesser
theory. For example, Faulkner [9] quotes a important to the users.
study of ‘morphic resonance’, where the Sample size
outcome of the experiments contradicted each With surveys or structured interviews the
other. Two researchers planned and conducted sample size can be large. In many studies the
the experiment together, but started with sample size does not have a direct impact on the
opposing views of the outcome. The outcome of cost of the project, because the surveys can be
the experiment was then published in two sent via e-mail and the interviewer does not
contradicting papers. Faulkner [9] argued the have to attend the filling in of the survey.
reason for this was that the two individuals Bias
“hung on” to their opinions rather then to accept Bias is argued to be low when using quantitative
an interpretation that may destruct a view that methods. This is mainly reasoned with the
one may be personally attached [9]. absence of an interviewer, who is not able to
Surveys typically use questionnaires or influence the process. Also important is that the
structured interviews for the collection of data. participant is guaranteed anonymity, which may
They have a common aim to scale up and model prevent a bias in the response.
a larger population [7]. Therefore, a survey can Quality of questioning:
be seen as a set of questions creating a The questions have to be short and precise.
controlled way of querying a set of individuals They have to be phrased very carefully so as not
to describe themselves, their interests and leaving room for miss-interpretation. As a
preferences [10]. result, to phrase the question in a certain way
Questionnaires are the most structured way of may have a negative impact on the answers.
asking questions of participants. This is because Data quality:
the designer of the questionnaire decides upon The data quality is provided by the design of the
the options for the answers. For example, a Yes questionnaire. When the questionnaire is
or No; or tick one of the answers given. These designed, the answers are fixed and will be the
multiple-choice answers make the analysis of ones counted. Therefore, there will be no room
the responses easy for the researcher [11]. for further insight into causes or processes.
Indeed, when it is seen in more detail this is not Obviously, questionnaires can be poorly
a very straight-forward approach, because as designed and may not gather useful information.
Kuniavsky [10] points out the questions have to Negative Bias:
be phrased in a correct way to avoid Bias may be introduced through selection of
misunderstandings or double meaning. He uses participants that make up the sample group.
the example of a financial Web page where the Also the phrasing of questions and the time and
owners want to query how often their users location may have an impact on the answers
purchase new stock. This may be well given, and perhaps introduce negative bias. For
understood for most of their users. However, example, Jacobs et. al. [12] chose to use surveys
437
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
because their subjects where not accessible for knowledge [15-17]. The advantage of this kind
direct interviews. Their study subjects where of interaction is that the opinion of one
high level members of staff in an international individual can cause further dialogue from
organization. others. Moreover, a larger number of opinions,
attitudes, feelings or perceptions towards a topic
2.2 Qualitative Methods are voiced by the participants, this provides a
Qualitative methods contrast greatly with broader overview of the discussed topic [11,
quantitative ones. Qualitative research is 13].
defined as a methodology that allows the 4.Language-based methods, such as content
interpretation of data more then counting and conversation analysis; work on the basis of
predefined answers [6, 8, 13]. a newspaper article or transcribed conversation.
Quantitative methods may take longer and/or Their main aim is to uncover lower level
involve intensive contact with the situation of structures in conversations or texts [8]. This,
study. The situation of study can be real live however, is not considered here, mainly because
situation, single or multiple people, there were no examples found of this method.
organizations or populations [9]. The task of the
researcher is to capture a situation in its full Qualitative methods bring forward positive and
complexity and log the information found. The negative aspects to the research conducted using
analysis then includes investigating the topics or them. Accounts of the positive aspects are the
subjects found and interpreting or concluding on depth and the quality of information [6, 15, 18,
their meaning from a personal as well as from a 19] gathered. As Punch puts it, “qualitative
theoretical point of view [6]. interviews are the best way to understand other
It is however sometimes stated, that qualitative people”. Therefore, perceptions, meanings,
methods are not straight-forward and are often situations descriptions and the construction of
seen as more unclear or diffused as quantitative realities can be best recorded in a qualitative
results [13]. setting [9]. A negative impact is the dependency
The main methods for the collection of data are on the moderator, who is seen as important in
participant observation, qualitative interviewing, the process of asking the questions and
focus groups and language based methods [6, interpreting the data [6, 16]. Holzinger [20], for
8]. example, used forty-nine participants for a study
1. Participant observation is defined by an looking at the creation of an interface using
immersion of the researcher with the subject mainly qualitative methods to gain information
into the situation to be studied [8]. This will from the users. The researchers described
allow the collection of data in an unobtrusive methods as useful for identifying the users’
way by watching the behaviour and activities of needs. Rose et. al. [17] reported on using
the subject. Observation is often used in user focused interviews and focus groups to
interaction studies, looking for example at investigate the use of a medical decision support
Human Computer Interfaces [13]. tool. The focused interviews where used to
2. Qualitative interviewing can be defined as a identify tasks and their goals and the focus
dialogue, where one of the participants (the groups to investigate in which situations the tool
researcher) is looking for answers from a would be used. Konio [21] concludes on the
subject [14]. Qualitative interviewing is often comparison of three software projects, which
split into unstructured and semi-structured have been based on using focus groups, that the
interviewing. In the unstructured case the outcome was positive. However, they point out
interviewer is only equipped with a number of that a focus group needs to be run by a trained
topics that should, but do not have to be, person. Similar findings where reported by
discussed in the interview. Whereas in the semi- Wilson [22]. His team has been conducting
structured case the interviewer has a series of focus groups to gain information about a VRE
questions, which should be addressed, but the [22].
sequence may be varied. Most importantly,
compared to quantitative interviewing, both 3. Conclusion
interview types allow additional questions on
As outlined in this paper, there are a number of
the basis of the response to a question [8].
approaches for gaining information from the e-
3. Focus groups consist of a small number of
Research community. In the case of gathering
people, chosen to participate in an informal
information about requirements towards a VRE,
discussion guided by a moderator. The group’s
aspects of bias, sampling and quality of the
task is to discuss defined subjects, opinions and
438
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
information gathered have to be taken into 3. Mack, R.L. and J. Nielsen, Usability
account. inspection methods. 1994, New York;
Furthermore, quantitative findings are not as Chichester: Wiley. xxiv, 413.
dependent on the moderator as quantitative 4. Lederer, A.L. and J. Prasad, Nine
methods. However, the depth of information Management guidelines for better cost
given by the quantitative methods is not as estimating. Communications of the ACM,
elaborate as qualitative methods, because of its 1992. 35(2): p. 51-59.
predefined questions and answers. Moreover, 5. University, L. VRE Demonstrator.
when using quantitative methods there is [Internet] 2006 [cited 2006 01 May 06];
normally not the option to elaborate on http://e-science.lancs.ac.uk/vre-
questions. Therefore, the developers would have demonstrator/index.html].
to have thought about all the requirements 6. Bryman, A., Social research methods. 2nd
needed by the user group beforehand, which can ed. 2004, Oxford: Oxford University Press.
be strongly argued to be unlikely case. xiv, 592.
Therefore, quantitative methods can restrict the 7. Creswell, J.W., Research design:
enquiring requirements of the e-Research qualitative, quantitative, and mixed method
community. approaches. 2nd ed ed. 2003, Thousand
When looking at qualitative research methods, Oaks, Calif.; London: Sage. xxvi, 246.
there are opportunities to ask groups of users or 8. Punch, K., Introduction to social research:
individuals. This provides the opportunity to quantitative and qualitative approaches.
understand and query in detail, what 2nd ed. 2005, London: SAGE. xvi, 320.
functionality they want from their project. 9. Faulkner, X., Usability engineering. 2000,
Therefore the outcomes will give more Basingstoke: Macmillan. xii, 244.
information about the background and reasoning 10. Kuniavsky, M., Observing the user
of user requirements. experience: a practitioner's guide to user
This depth and quality of information would not research. Morgan Kaufmann series in
be achievable using quantitative methods due to interactive technologies. 2003, San
the fixed questioning structure. When Francisco, Calif.; London: Morgan
examining the difference between individual Kaufmann. xvi, 560.
qualitative interviews and group inquiries, the 11. Gillham, B., Developing a questionnaire.
discussion evolving from the group will enable Real world research. 2000, London:
to greater detail to be recorded and individuals Continuum. ix, 93.
can clarify requirements by discussing them 12. Jakobs, K., R. Procter, and R. Williams. A
with peers rather then with an IT specialist [16, study of user participation in standards
18]. This may prevent the changing of ideas setting. in Conference companion on
over time. Through the multiplication of focus Human factors in computing systems:
groups, the importance of functionality can be common ground. 1996. Vancouver, British
verified and therefore arguments can be cross- Columbia, Canada: ACM Press.
referenced as described by [8]. 13. Seaman, C.B., Qualitative methods in
The arguments above qualitative methods seem empirical studies of software engineering.
to be the best path to pursue for the user needs Software Engineering, IEEE Transactions
gathering in the VRE programme. However, it on, 1999. 25(4): p. 557.
is acknowledged that if the enquiry is done with 14. Gillham, B., The research interview. Real
a low level of motivation and attention to detail, world research. 2000, London: Continuum.
the information gathered may be bias and viii, 96.
therefore the tasks will not be well informed 15. Litosseliti, L., Using focus groups in
through the users. research. Continuum research methods
series. 2003, London: Continuum. vii, 104.
3. References 16. Nielsen, J., The use and misuse of focus
groups. Software, IEEE, 1997. 14(1): p. 94.
1. Marmier, A. The eMinerals minigrid and 17. Rose, A.F., et al., Using qualitative studies
the National Grid Service: a user's to improve the usability of an EMR. Journal
perspective. in The All Hands Meeting. of Biomedical Informatics, 2005. 38(1): p.
2005. Nottingham, UK. 51-60.
2. Nielsen, J., Usability engineering. 1993, 18. Morgan, D.L., Focus groups as qualitative
Boston; London: Academic Press. xiv, 358. research. 2nd ed. 1997, Thousand Oaks;
London: Sage. viii, 80.
439
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
440
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Traditionally registration and tracking within Augmented Reality (AR) applications have been built
around limited bold markers, which allow for their orientation to be estimated in real-time. All
attempts to implement AR without specific markers have increased the computational requirements
and some information about the environment is still needed. In this paper we describe a method that
not only provides a generic platform for AR but also seamlessly deploys High Performance
Computing (HPC) resources to deal with the additional computational load, as part of the
distributed High Performance Visualization (HPV) pipeline used to render the virtual artifacts.
Repeatable feature points are extracted from known views of a real object and then we match the
best stored view to the users viewpoint using the matched feature points to estimate the objects
pose. We also show how our AR framework can then be used in the real world by presenting a
markerless AR interface for Transcranial Magnetic Stimulation (TMS).
441
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
In order to be successful, marker-less tracking is not requirements. It becomes more robust by changing the
only more computationally intensive, but also requires way intensity variation is calculated between each pixel
more information about the environment and the structure and its neighbours by allowing for edges that are not at
of any planes or real objects that are to be tracked. In this 45o or 90o to the point being evaluated.
paper we present a generic solution that does not rely on The Harris algorithm uses first order derivatives to
the use of markers, but rather feature points that are measure the local autocorrelation of each point. A
intelligently extracted from the users view. We also threshold value is then used to set all of the weaker
provide a solution to the computationally intensive task points to zero leaving all of the non zero points to be
of pose estimation and the rendering of complex artefacts interpreted as feature points- see figure 2.
by exploiting remote HPV resources through an advanced
environment for enabling visual supercomputing – the e-
Viz Project [8].
442
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
• Rendering the graphics with e-Viz severe depression and other drug resistant mental
illnesses such as epilepsy [19, 20].
The first implementation simply uses e-Viz to render the In previous work we developed an AR interface for
virtual artefacts present in our AR view. The user’s
TMS using an optical tracking system to calculate the
viewpoint is captured by the local machine and the pose pose of the subjects head relative to user’s viewpoint
estimation is calculated locally. The pose transformation
[21]. We are now developing a new AR interface that
is used to steer the e-Viz visualization pipeline, which in uses our generic framework – see figure 3. By removing
the background sends the transformation information to
the need for expensive optical tracking equipment our
an available visualization server. Our client then receives software will provide an inexpensive solution, making
the rendered image and composites it locally into the
the procedure more accessible to training and further
users view. research.
• Distributing the pose estimation module as part of
the visualization pipeline.
In order to fully take advantage of the e-Viz resources the
second implementation moves the pose estimation
module onto the e-Viz visualization pipeline. In this case
the e-Viz visualization is steered directly by the video
stream of the users view. e-Viz distributes the pose
estimation module to a suitable and available resource.
The pose estimation module then steers the visualization
pipeline and returns the final view back to the user after
compositing the artificial rendering into the real scene.
443
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
reliable 15 FPS, where there is little latency. On learning-based approach”, In proceedings of IEEE and ACM
congested networks e-Viz uses stricter compression International Symposium on
algorithms at a cost to the image quality to try and Mixed and Augmented Reality, pp. 295-304, 2002.
maintain these usable frame rates. [7] Striker, D., Kettenbach, T., “Real-time and marker-less
vision-based tracking for
6. Conclusions outdoor augmented reality applications”, In proceedings of
IEEE and ACM International
In conclusion we have found that our approach to Symposium on Augmented Reality, pp. 189-190, 2001.
producing a framework for AR has been very successful,
provided that optimum conditions are available. Problems [8] Riding, M., Wood, J., Brodlie, K., Brooke, J., Chen, M.,
occur when trying to run the pose estimation locally. It is Chisnall, D., Hughes, C., John, N.W., Jones, M.W., Roard, N.,
"e-Viz: Towards an Integrated Framework for High
simply too computationally intensive and so can not keep Performance Visualization", Proceedings of the UK e-Science
up with the real time video. Distributing this calculation All Hands Meeting 2005, EPSRC, ISBN 1-904425-53-4,
to a more powerful grid resource has solved this problem. pp1026-1032
Future work will concentrate on improving the
efficiency and reliability of the feature point detection [9] Zitova, B., Flusser, J., Kautsky, J., Peters, G., “Feature
algorithms, ensuring that we have more accurate pose point detection in mulitframe images”, Czech Pattern
Recognition Workshop, February 2-4, 2000.
estimation between frames. We also need to introduce
heuristics that will help predict the position of the virtual [10] Hodges, K. I., “Feature-Point Detection Using Distance
artefact, even if we are unable to calculate the pose of the Transforms: Application to Tracking Tropical Convective
object, by building up a knowledge base of previous Complexes”, American Meteorological Society, March 1998.
frames.
[11] Moravec H.P, “Towards Automatic Visual Obstacle
Avoidance”, Proc. 5th International Joint Conference on
Acknowledgements Artificial Intelligence, pp. 584, 1977.
This research is supported by the EPSRC within the [12] Moravec H.P., “Visual Mapping by a Robot Rover”,
scope of the project: “An Advanced Environment for International Joint Conference on Artificial Intelligence, pp.
Enabling Visual Supercomputing” (GR/S46567/01). 598-600, 1979.
Many thanks to the e-Viz team at Manchester, Leeds and
[13] Harris, C., Stephens, M., “A Combined Corner and Edge
Swansea for their help and support.
Detector”, Proc. Alvey Vision Conf., Univ. Manchester, pp.
We would like to thank Jason Lauder and Bob Rafal 147-151, 1988.
from the School of Psychology, University of Wales,
Bangor, for allowing us access to their TMS laboratory [14] Trajkovic, M., Hedley, M., “Fast Corner Detection”,
and for supplying data from past experiments. Image and Vision Computing, Vol. 16(2), pp. 75-87, 1998.
444
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Jeremy Cohena , Claire Jamesb , Shamim Rahmanb , Vasa Curcina , Brian Ballb , Yike Guoa ,
John Darlingtona
a
London e-Science Centre, Imperial College London, South Kensington Campus, London SW7 2AZ, UK
b
AEA Technology Rail, Central House, Upper Woburn Place, London WC1H 0JN, UK
Email: jeremy.cohen@imperial.ac.uk
Abstract. The UK’s railway network is extensive and utilised by many millions of passen-
gers each day. Passenger and train movements around the network create large amounts of
data; details of tickets sold, passengers entering and exiting stations, train movements etc.
Knowing how passengers want to use the network is critical in planning services that meet
their requirements. However, understanding and managing the vast amounts of data gener-
ated daily is not easy using traditional methods. We show how, utilising e-Science methods,
it is possible to make understanding this data easier and help the various stakeholders within
the rail industry to more accurately plan their operations and offer more efficient services
that better meet the requirements of passengers.
445
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
446
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
e-Science and promote easier access to high per- be executed in parallel and then taking advan-
formance computing resources for carrying out tage of multiple processors on a system by exe-
large-scale science. cuting these components concurrently. To simu-
The domain specific knowledge of the required late a fully distributed e-Science architecture, we
analysis is held by the project members from have used an external Oracle database contain-
AEA Technology Rail. Our aim is to encapsu- ing warehoused data as one of the data sources
late this knowledge within the computational within our system.
platform, in the form of workflows. We worked
in pairs consisting of an e-Science and rail data
specialist in order to transfer the analysis de- 5 Results
scription into Discovery Net workflows. The col-
laborative nature of the project combined with This work has shown how rail datasets can be
time constraints led us to see this as the most combined together in a systematic way, using the
practical solution. latest e-Science technologies, in order to max-
A workflow is a description of the processes imise the useful information derived, and that
and data/control flows required to carry out a this combination can provide a vital contribu-
task. The building blocks of workflows are com- tion to tackling a range of standard rail indus-
ponents. In Discovery Net, a component can be try questions / studies with increased speed and
a data source or a process applied to some data. accuracy. The project outputs can be separated
A large toolbox of processes is available within into the following elements:
the system and custom components can be built.
One way of developing a custom component is to 1. Comprehensive and consistent documenta-
use the scripting language Groovy [7]. Workflows tion of all major sources of passenger jour-
are built in the InforSense KDE client software ney data existing in the rail industry.
using an intuitive drag-and-drop interface to se- 2. Demonstration of how these may be com-
lect components from the toolbox and add them bined to maximise the useful information,
into the workflow. for two particular examples:
(a) The ticket sales database (LENNON)
and the London Area Travel Survey.
4.1 System Platform (b) The national rail timetable and on-train
automatic passenger counts.
A computational platform has been set up
specifically for the project. The InforSense KDE 3. Confirmation that a wide range of rail in-
software system has been deployed on a 48 pro- dustry questions/studies can be improved
cessor Sun Sparc IV based server operated by by better combination of passenger data
the London e-Science Centre. This multiproces- sources.
sor system allows for faster operation of the In- 4. Incorporation of an example combination of
forSense engine when executing workflows that rail data sources into a system, in order to
exhibit parallel features. The InforSense soft- show how the wide range of different data
ware can connect to distributed data sources sources may be combined systematically.
running on different servers in different loca- 5. Illustrative prototype of the user interface,
tions. In our demonstration environment, data in the InforSense KDE software.
is split between the London e-Science Centre’s
Presentation of the results of our workflow pro-
Oracle 10g server and data that has been im-
cessing utilises InforSense KDE’s Web-based
ported directly into the InforSense server.
portal into which workflows can be published. A
user logs into the system and is able to see the
4.2 Workflow execution set of workflows in their Web browser to which
they have been granted access privileges. The
Figure 1 shows the InforSense KDE client in- user can then change some attributes and exe-
terface displaying one of the workflows we have cute the workflows available to them but may
developed. Execution of the workflows can be not modify these workflows.
carried out on a multiprocessor system which Completed workflows, based on our rail indus-
offers significant speed benefits over a standard try use cases, provide a table of results and one
system. The Discovery Net architecture is de- or more visualisation options to allow industry
signed to take advantage of parallel machines decision makers to gain a simple visual view of
by determining elements of a workflow that can the results of the analysis task.
447
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
448
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
449
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
ulation data and services within the framework in order for it to be of any use. A primary goal
of VO standards. In this paper, we describe of the IVOA’s Theory Interest group is thus to
initial work investigating the new protocols and decide upon a standard file format for raw sim-
the changes to existing standards required to ulation data and a metadata language (based
enable the incorporation of simulated datasets on the Universal Content Descriptors used for
into the VO. observational data [4]) with which to describe
the contents. Based on this, a set of require-
2. Standards for Simulation Data ments of the standards being developed by the
Data Access Layer (DAL) and Virtual Observa-
Simulations are frequently used today in all ar- tory Query Language (VOQL) working groups
eas of astronomy; from the birth, evolution and within the IVOA have been identified so that
death of stars and planetary systems, to the for- simulation data can seamlessly be discovered
mation of galaxies in which they reside and the and retrieved through VO portals.
formation of large scale structures, dark matter In this section, we describe the key stan-
halos, in which galaxies themselves are thought dards for data discovery, querying and retrieval
to form. There is much variety in the processes that have been developed and approved by the
being investigated and the underlying physics IVOA, and discuss whether, in their current
that govern them. Many different approaches form, they fully support the incorporation of
have been chosen to tackle each problem, of- simulated data in the VO, as outlined above.
ten employing very different algorithms to deal Although the standards discussed here do not
with the complex physics involved. There is cover the full range of those that are being devel-
clearly a huge amount of information that must oped by the IVOA, they are those that are most
be recorded in order to fully describe a simu- relevant to the differences between observed and
lation and its results, not all of which can be simulated data.
quantified in numerical terms. In order for sim-
ulated data to be included in the Virtual Ob- 2.1. Data Modelling and Metadata
servatory, we must first clearly identify all the
different components that describe a simulated In order for simulated data to be included in
dataset. This is the purpose of defining an ab- the Virtual Observatory, we must first clearly
stract data model for simulations (see Sec. identify all the different components that de-
2.1). scribe a simulated dataset. To this purpose,
At the IVOA interoperability meeting in Ky- initial attempts have been made to construct
oto, the Theory Interest Group - charged with an abstract data model for simulations. In Fig-
ensuring that the VO standards also meet the ure 1 we demonstrate the current iteration of
requirements of theoretical (or simulated) data - the ‘Simulation’ data model. This model was
outlined a set of near-term targets with the aim developed using the corresponding data model
of identifying where the existing IVOA stan- defined for observational data, Observation [2],
dards and implentations must be updated to al- as our starting point, modifying it where neces-
low the discovery, exchange and analysis of sim- sary. It is hoped that this will ensure that Sim-
ulated data. Computational cosmology is one ulation has a similar overall structure to Ob-
example of an area of astronomy that should servation, differing only in the detail. There
be a major benficiary of an international vir- are two purposes to this approach. Firstly, it
tual observatory. Many independant groups are is hoped that a similarity between data models
working towards solving the problems of hier- will aid the process of comparing simulated and
archical structure formation, performing large- observed datasets. Secondly, it maintains the
scale simulations as an integral part of their in- possibility of defining an overall data model for
vestigations. astronomical data, real or synthetic.
However, the results of many of these simu- A simulation can essentially be broken down
lations are not publicly available. Furthermore, into three main categories: Observation Data,
of those that are, the data is stored in a wide Characterisation and Provenance. Observation
variety of (mostly undocumented) formats and Data describes the units and dimension of the
systems, often chosen having been convenient at data. It inherits from the Quantity data model
the time. Therefore the interchange and direct (currently in development [3]) which assigns the
comparison of results by independant groups is units and metadata to either single or arrays
uncommon as often much effort is required in of values. Characterisation describes not only
obtaining, understanding and translating data the ranges over which each measured quantity
450
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
SIMULATION Curation
Coord/Area
Figure 1: The principle components of the Simulation data model (see text)
is valid, but also how precise and how accurate ware on which the simulation was performed.
they are. It is composed of Coverage and Reso- Objective describes the overall purpose of the
lution. These represent the different parameters simulation – what was the purpose of the simu-
constraining the data. Resolution describes the lation? What were the phenomenon that is was
scales at which it is believed the simulation re- performed to investigate?
sults begin to become significantly influenced by
errors due to numerical effects, for example, due 2.2. Metadata: UCDs for Simulations
to the finite size and mass of simulation grid-
points or particles. Coverage describes the area In order to describe the quantities and concepts
of the Characterisation parameter space that that are being outlined in the data models and
the simulation occupies. It is itself composed published in data archives, a restricted vocab-
of Bounds, which describes the range of values ulary – Universal Content Descriptors (UCD
occupied by the simulation data, and Physical- [4]) – is being developed and controlled by the
Parameters, which consists of the set of physical IVOA. UCDs are not designed to allocate units
constants that define ‘the Universe’ occupied by or names to quantites, they are meant to de-
the simulation. scribe “what the unit is”. The overal purpose
of UCDs is to provide a standard means of de-
The third major component of Simulation is scribing astronomical quantities – whether it be
Provenance, which describes how the data was the luminosity of a galaxy or the exposure time
created. It consists of Theory, Computation and of the instrument with which is was observed
Objective. Theory represents a description of – using a restricted vocabulary (to prevent the
the underlying physical laws in the simulation. proliferation of words), whilst retaining the flex-
It is expected to consist of a reference to a pub- ibility to enable precise, non-ambiguous descrip-
lication or resource describing the simulation. tions of the vast range of quantities that oc-
Computation describes the technical aspect of cur in astronomical datasets. Their main goal
the simulation and has three components – Al- is to ensure interoperability between heteroge-
gorithms, TechnicalParameters and Resources. neous datasets. If an astronomer is searching
Algorithm describes the sequence of numerical for catalogues containing a specific quantity, she
techniques used to evolve the simulation from can use the UCD for that quantity to locate all
one state to the next. It is expected that this those that contain it, whether it was the main
also will contain a reference to a published pa- purpose of the observation or not.
per or resource. TechnicalParameters are quan- UCDs consist of a string constructed by com-
tities representing the inputs to the algorithms, bining ‘words’ seperated by semi-colons from
such as ‘number of particles’ and ‘box size’. Re- the controlled vocabulary. Individual words
sources describe the specifications of the hard- may be composed of several ‘atoms’ seperated
451
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
by a period (.). The order of these atoms in- Astrophysical and cosmological simulations
duces a hierarchy, where the following atom clearly cover an enormous range of scales and
is a specific instance of its predecessor. Se- processes. Consequently, there is no absolute
quences of words each representing a specific spatial scale for simulations, in the way that
concept are then combined to provide a describe there is for astronomical image data (position
of what the actual quantity is. For example, on the sky, distance from earth). The most ba-
stat.error;phot.mag;em.opt.V would refer to the sic function that a SNAP service must provide
error on the measurement of the photometric is direct access to the raw particle data (or grid
magnitude (brightness) of an object in the V- state) from a simulation output via an access
band of the optical region of the electromagnetic URL.
spectrum. However, due do rapidly improving hardware
However, untill now the UCD tree has been and fast and efficient parallel codes, the resolu-
defined to deal specifically with observational tion, and therefore the filesize of a single snap-
related quantities. Although there exists the shot can be extremely large. For example, many
flexibility to describe the physical quantities cosmological simulations (e.g. [5]) now contain
measured from simulations (as these are often 10243 particles within the simulation box. The
the properties of the objects that were being file containing their positions and velocities will
simulated), it is not currently possible to de- then be at least 20GB per timestep. It is there-
scribe the properties and parameters of the sim- fore important to stress that, contrary to the
ulations themselves. This includes some input procedure for retrieving observational images
physical parameters (i.e. cosmological parame- and data, simulation data cannot be retrieved
ters) that define the theoretical context of the via http with some kind of encoding for the bi-
simulation, and technical parameters that de- naries (e.g. base64) since this is extremely ex-
fine its size, scope and resolution (e.g num- pensive operation when large datasets are be-
ber of particles, length of simulation box side, ing handled. FTP, or ideally GridFTP, must
time/redshift of a simulation output). We have be used to retrieve the higher resolution simu-
therefore proposed a new branch of the UCD lations.
tree to encompass computatational techniques The SNAP standard is currently in the pro-
in astronomy; ‘comp’. This branch can be used cess of being defined and agreed through the
to describe both astrophysical (and cosmologi- IVOA. It is expected that v1.0 will be in place
cal) simulations, and data reduction and post- after the September 2006 Interoperability meet-
processing algorithms for both simulation and ing.
observational data.
Acknowledgements
2.3. Simple Numerical Access Protocol
LDS is support by a PPARC e-science stu-
One of the key objectives of the Virtual Obser- dentship.
vatory is to provide uniform interfaces to the
different forms of astronomical data. This is
References
now being realised with the development of a
standard means of retrieving astronomical im- [1] Astrogrid: www.astrogrid.org
ages and spectra. We are now developing a
prototype standard for retrieving raw simula- [2] http://www.ivoa.net/internal/IVOA/IvoaDataModel/obs.v0.2.pdf
tion data from a variety of astronomical simu- [3] http://www.ivoa.net/twiki/bin/view/IVOA/IVOADMQuantityWP
lation repositories – Simple Numerical Access
Protocol (SNAP). SNAP is designed to en- [4] http://www.ivoa.net/twiki/bin/view/IVOA/IvoaUCD
able uniform access to ‘raw’ simulation data.
This includes the retrieval of all the particles [5] Bode, P., Ostriker, J.P. 2003, ApJS, 145, 1
or grid points within the simulation box at a [6] http://www.ivoa.net/twiki/bin/view/IVOA/IvoaTheory
particular timestep (known as a ‘snapshot’), a
specified sub-volume of a simulation (all the
particles/grid-points within a certain region), or
data from a post-processed simulation. The lat-
ter typically includes catalogues of objects that
have been identified within a larger simulation
or within a suite of simulations.
452
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
The Common Instrument Middleware Architecture (CIMA) model is being adapted and extended
for a remote instrument access system being developed as an Australian eScience project.
Enhancements include the use of SRB federated Grid storage infrastructure, and the introduction
of the Kepler workflow system to automate data management and provide facile extraction and
generation of instrument and experimental metadata. The SRB infrastructure will in effect
underpin a Grid enabled network of X-ray diffraction instruments. The CIMA model is being
further extended to include instrument control, and is being embedded as a component in a feature
rich GridSphere portal based system for remote access.
453
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
middleware layer between the instrument and Resource Broker technology is being utilised to
the network within which to standardize the link the participating instrument sites.
representation of the instrument, to mediate At present the Data Manager for all sites is
access by downstream components, and to host located at JCU with a production scale SRB
extensions to the instrument’s functionality. storage facility. It is intended that each site will
This middleware layer need not be co-located run its own Data Manager and SRB storage
with the instrument, and could be elsewhere on facility, which is to be federated across the sites
the network. The CIMA model is intended to into a single shared data space. The security and
promote re-usability and hence reduce coding access rights issues associated with this store
effort in changing instrument environments or are presently being considered.
eco-systems.
454
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
VNC5 (and its many variants), CITRIX6, support for instrument control, and this involves
Tarantella7 and NX8, has the significant including CIMA services as components in a
advantages of ease of set-up and familiarity. modular portal architecture (see Fig. 6).
While convenient, these approaches are not
ideal and can afford remote instrument users
with excessive control over expensive and
potentially dangerous instruments, and have a
relatively poor security model. A significant
disincentive in building custom-built remote
access systems, is that there is a high coding
overhead that may reproduce functionality
already provided by an instrument
manufacturer. A significant advantage of the
custom built interface approach however, is that
the actions of the remote instrument user can be Figure 5. Portlet Diffraction Image Browser
tightly controlled, while at the same time
services outside the desktop environment can be A modular SOA model using Web Services
provided to offer a richer operating has been adopted to provide maximum
environment. flexibility, with Web Services providing
Accordingly a Gridsphere portal is being language and location independence. Container
built that includes a portlet providing a purpose communication is based on SOAP messages
built interface to a diffractometer instrument containing XML parcels. A data cache
that will tightly define the actions of a remote minimises the need to seek (get) updated status
user, and so minimise accidental damage and information from the instrument.
injury. The portlet allows the user to enter data
collection specifications and allows the
execution of those instructions. Live views of
the instrument and crystal sample are provided,
together with the most recent diffraction image
(see screen shot in Fig. 4.)
455
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
References
(1) (a) Huffman, K.L., Gupta, N., Devadithya,
T., Ma, Y., Chiu, K., Huffman, J.C.,
Bramley, R., McMullen, D.F. Instrument
Monitoring, Data Sharing and Archiving
Using Common Instrument Middleware
Architecture (CIMA): A new approach to
modernize and standardize the data
collection procedures in an X-ray
Crystallography Laboratory. Journal of
Chemical Information and Modeling,
accepted February 27 2006. (b) CIMA:
http://www.instrumentmiddleware.org
(2) Storage Resource Broker:
http://www.sdsc.edu/srb
(3) Kepler workflow project:
http://www.kepler-project.org
(4) Portable Grid Library:
http://plone.jcu.edu.au/hpc/hpc-
software/personal-grid-library/releases/0.3
(5) Virtual Network Computing;
http://www.realvnc.com/ (also TightVNC,
RealVNC, UltraVNC, and TridiaVNC)
(6) CITRIX: http://www.citrix.com
(7) Tarantella; now renamed Sun Secure
Global Desktop:
http://www.sun.com/software/products/sgd
(8) NoMachine (NX);
456
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
ShibVomGSite: A Framework for Providing Username and
Password Support to GridSite with Attribute based
Authorization using Shibboleth and VOMS
Joseph Olufemi Dada & Andrew McNab
School of Physics and Astronomy, University of Manchester, Manchester UK
Abstract
The GridSite relies on the Grid credentials for user authentication and authorization. It does not
support username/password pair and attributebased authorization. Since Public Key Infrastructure
(PKI) certificate cannot be installed on all the devices a user will use, access to GridSite protected
resources by users while attending conferences, at airport kiosk, working in Internet Café, on
holidays etc. is not possible without exposing the certificate's private key on multiple locations.
Exposing the private key on multiple locations poses a security risk. This paper describes our work
in progress, called ShibVomGSite: a framework that integrates Shibboleth, VOMS and GridSite to
solve this problem. ShibVomGSite issues to users a username and Time Limited Password bind to
their Distinguished Name (DN) and Certificate Authority’s DN in a database, which the users can
later use to gain access to their attributes that are used for authorization.
457
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
authorization process, VOMS and Shibboleth. organizational boundaries. It allows sites to
ShibVomGSite system is described in section 3 make informed authorization decisions for
along with a description of how its components individual access of protected online resources
work together to achieve the objective of our in a privacypreserving manner [4]. Shibboleth
work. A brief description of the prototype of consists of two major software components: the
MyIdentity service, VASGS service and Identity Provider (IdP) and Service Provider
GAMAS is presented in section 4, and section 5 (SP). The two components are deployed
gives the conclusion and further work. separately but work together to provide secure
access to Webbased resources. The operation
2 Background of Shibboleth is based on the Security Assertion
Markup Language (SAML) standard [10],
In this section we present a brief description of published by the OASIS [11]. The principle
authentication and authorization in GridSite and behind SAML’s design and Shibboleth’s is
provide an overview of the two technologies federated identity. Federated identity
that are relevant to our work: Shibboleth and technology permits organization with disparate
VOMS. A detailed description of Shibboleth, authentication and authorization methods to
VOMS and GridSite can be found on the interoperate thereby extending the capability of
individual websites [1, 68]. each organization’s existing services rather than
replacing them.
2.1 GridSite Authentication and Shibboleth exchanges attributes across
Authorization domains for authorization purposes. Its
architecture is dependent on PKI, which it uses
Mutual authentication in GridSite is established
to build trust relationship between the several
based on Grid credentials that require the use of
Shibboleth components of the Federation
X.509 identity certificates [5]. A user needs to
members. Figure 1 shows the authentication and
have a valid X.509 certificate together with the
authorization process in Shibboleth. As shown
corresponding private key in order to proof
in the figure, an IdP normally located at the
his/her identity to the GridSite resources.
origin site identifies the users while SP at the
After the proof of identity, the user needs to
target site protects the resources. When a user
be authorized to gain access to the GridSite
accesses a shibbolized resource at the target site
resources based on the resource provider access
for the first time, the Shibboleth Indexical
policy. The GridSite Apache module
Reference Establisher (SHIRE) directs the user
(mod_gridsite) implements authorization based
to go to Where Are You From (WAYF) to pick
on X.509, Grid Security Infrastructure (GSI) [9]
his/her domain (origin site). The user's browser
and VOMS credentials. It uses GACL, the “Grid
is then redirected to the authentication server at
Access Control Language” [3] provided by the
the origin site for user authentication. After the
GridSite/GACL library. This allows access
user is authenticated, the browser is redirected
control to be specified in terms of attributes
back to the target site along with the user's
found in Grid Credentials. The Access Control
handle and authentication assertion. The
Lists (ACLs) consist of a list of one or more
authentication assertion is a proof to the target
Entry blocks. When a user's credentials are
site that the user has been successfully
compared to the ACL, the permissions given to
authenticated. The Shibboleth Attribute
the user by Allow blocks are recorded, along
Requester (SHAR) at the target site uses the
with those forbidden by Deny blocks. When all
user's handle to request for attributes of the user
entries have been evaluated, any forbidden
from the Attribute Authority (AA) at the origin
permissions are removed from those granted.
site. The attributes are then passed to the
Shibboleth Authorization (ShibAuthZ) module,
2.2 Shibboleth which will make an access control decision
based on these attributes. GAMAS takes over
Shibboleth is standardsbased, open source
the function of ShibAuthZ in ShibVomGSite
middleware software, which provides Web
system.
Single Sign On (SSO) across or within
458
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
1. User’s request
Authentication User
4b. User’s handle SHIRE
Server & Authentication
Assertion
3. User’s
authentication 2. User’s domain?
5. User’s handle
& user’s AA info
WAYF
Handle
Service
6. User’s handle &
SHAR
4a. User’s handle attributes request
registration
8. AuthZ request
+ user’s attributes
Attribute 7. User’s attributes
Authority ShibAuthZ
IdP/Origin Site SP/Target Site
Figure 1: Architecture of Shibboleth
459
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
MyIdentity VASGS
Service Service GAMAS
3. User 5. Retrieves
user's DN & 6. Request for user's attributes 8. User’s
authentication 9. Authorization
Insurer's DN attributes &
decision authorization
decision request
7. User's attributes
IdP SP
4. Request for attributes using Handle
Origin site
1. Resource
2. Redirect user for
access request
authentication
Target
user
10. Access to Resource
resources
Target site
Figure 2: ShibVomGSite Architecture
460
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
MyIdentity MyIdentity Service
CertAuthT
MyIdentity MyIdentity
User DBInterface DB
MyIdentity MyIdentity
Password AuthTModule
Figure 3: User interaction with MyIdentity Service
The user has the opportunity of managing access GridSite resources; attributes are pull
his/her information in the MyIdentityDB using directly from the VOMS. Our VASGS service
the MyIdentityPassword module. therefore allows IdP to use VOMS as an
Shibboleth IdP uses MyIdentityAuthT Attribute Repository.
Apache module for the authentication of users
anytime users attempt to access GridSite
protected resources on the target site.
MyIdentityAuthT Apache module is based on VASGS
ConnectorPlugIn IdP
Apache module (mod_auth_mysql) [14]. The
module is integrated into the MyIdentityDB that
contains the username/password and other user's
information. A full paper on MyIdentity service
is in preparation.
461
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
user
IdP/Origin Site SP/Target Site
MyIdentity SHIRE
AuthT
Module
SHAR
(2)
VASGS Handle
VOMAttribute Service GAMAS
Service mod_shib (3)
GridSite
(1) _gridsite (4)
Library
VASGS
Connector Attribute
PlugIn Authority
ShibAuthZ
Figure 5: Integration of GAMAS with Service Provider
462
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
4. After the decision, the granted/denied Password in a database (MyIdentityDB). It also
decision is returned back to serves as an attribute repository for the IdP, and
mod_shib_gridsite. provides DN and CA’s DN used as parameter to
obtain VOMS attributes for users with the help
5. Mod_shib_gridsite returns the decision to
of the VASGS service. We also described
Apache server. The user is then granted or
GAMAS that uses the VOMS attributes
denied access to the target resource
received from Shibboleth for authorization.
according to the result of the decision.
GAMAS receives the attributes through the
Shibboleth SP (mod_shib), carries out
4 Prototype Implementation authorization process and returns decision result
We have implemented the prototype of to the Apache Server, which grants or denies
MyIdentity service, VASGS service and user’s request depending on the result of the
GAMAS. In this section, we briefly describe authorization decision.
our implementation. The work presented in this paper is a work in
MyIdentity service prototype is progress. Efforts are continue to further develop
implemented in C. The database server used is the existing prototype to a full working system
MySQL. Users can login with their certificates suitable for deployment. We are also working
and obtain username and password. on integrating ShibVomGSite system with
MyIdentityCertAuthT module is a Common Flexible Access Middleware Extension (FAME)
Gateway Interface (CGI) script that connects to [15].
the MyIdentityDB using
MyIdentityDBInterface. MyIdentityPassword, 6 Acknowledgements
which allows users to manage their records, is
This work was funded by the Particle and
also a CGI script that uses
Astronomy Research Council through their
MyIdentityDBInterface to interact with the
GridPP programme.
MyIdentityDB. MyIdentityAuthT module is the
Our thanks also go to other members of the
authentication server for the Shibboleth, which
various EDG and EGEE security working
is based on the Apache module:
groups for providing much of the wider
mod_auth_mysql [14].
environments for this work.
VASGS service is implemented in JAVA. It
is based on Web Service technology. It has two
sub components as earlier described: VASGS 7 References
VOMAttributeService that resides with VOMS [1] Grid Security for the Grid, Web Platform for
server and the VASGSConnectorPlugIn that Gird, http://www.gridsite.org/ .
Shibboleth uses to invoke the VASGS
VOMAttributeService on the VOMS server. [2] McNab, A., The GridSite Web/Grid
VOMAttributeService runs inside a Tomcat Security System, Software Practice. Exper.,
web container just like the others services
http://www.interscience
.wiley.com, 35:827
provided by the VOMS server while 834, 2005.
ConnectorPlugIn is a JAVA class that [3] McNab, A., “GridBased Access Control
Shibboleth invokes to connect to the and User Management for Unix
VOMAttributeService. Environments, File systems, Web Sites and
GAMAS (mod_shib_gridsite and gridsite Virtual Organizations”, in Proceedings of
library) is implemented in C as Apache module. CHEP 2003, La Jolla, CA, 2003.
It is an extension of the mod_gridsite. As earlier
described, it receives users' attributes from [4] UK Computing for Particle Physics,
Shibboleth to make authorization decision.
http://www.gridpp.ac.uk/
.
[5] Houseley, R., Polk, W., Ford, W. and Solo,
5 Conclusion and Further Work D., Internet X.509 Public Key Infrastructure
Certificate and Certificate Revocation List
We have presented a ShibVomGSite framework
(CRL) Profile. RFC 3280, IETF, 2002.
that provides username/password support and
attributebased authorization to GridSite. This [6] Alfieri, R., Cechini, R., Ciaschini, V.,
framework allows users access to GridSite Spataro, F., dell' Agnello, L., Frohner, A.
protected resources anywhere and anytime and Lörentey, K., From gridmapfile to
using time limited password. VOMS: managing authorization in a Grid
MyIdentity service binds user’s DN and environment,
CA's DN with the username and Time Limited
463
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
http://www.cnaf.infn.it/ ~ferrari/
seminari/gri
glie05/lezione02/vomsFGCS.pdf, 2004.
[7] EDGVOMADMIN Developer Guide,
http://edgwp2.web.cern.ch/edgwp2/
security/
voms/edgvomsadmindevguide.pdf.
[8] Shibboleth Project, Internet2,
http://shibboleth .internet2.edu/.
[9] Welch, V., Siebenlist, F., Foster, I.,
Bresnahan, J., Czajkowski, K., Gawor, J.,
Kesselman, C., Meder, S., Pearlman, L. and
Tuecke, S. Security for Grid Services. In
International Symposium High Performance
Distributed Computing, 2003.
[10] Assertions and Protocol for the OASIS
Security Assertion Markup Language
(SAML)V1.1, http://www.oasis
open.org/committees/download.php/3406/oa
sissstcsamlcore1.1.pdf.
[11] OASIS, http://www.oasisopen.org/.
[12] Ciaschini, V., A VOMS Attribute
Certificate Profile for authorization,
http://infnforge.cnaf.infn.it/docman/view.ph
p/7/58/ACRFC.pdf, 2004
[13] Novotny, J., Tuecke, J. and Welch, V., “An
Online Credential Repository for the Grid:
MyProxy”,
http://www.globus.org/alliance/publications/
papers/myproxy.pdf
[14] Mod_auth_mysql:
http://modauthmysql.sourceforge.net/.
[15] FAMEPERMIS Flexible Access
Middleware Extension to PERMIS,
http://www
.famepermis.org/.
464
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
By using Web services, people can generate flexible business processes whose activities are scattered
across different organizations, with the services carrying out the activities bound at run-time. We refer to
an execution of a Web service based automatic business process as a business session (multi-party
session). A business session consists of multiple Web service instances which are called session partners.
Here, we refer to a Web service instance as being a stateful execution of the Web service. In [8], we
investigate the security issues related to business sessions, and demonstrate that security mechanisms are
needed at the instance level to help session partners generate a reasonable trust relationship. To achieve
this objective, an instance-level authentication mechanism is proposed. Experimental systems are
integrated with both the GT4 and CROWN Grid infrastructures, and comprehensive experimentation is
conducted to evaluate our authentication mechanism. Additionally, we design a policy-based
authorization mechanism based on our instance-level authentication mechanism to further support
trustworthy and flexible collaboration among session partners involved in the same business session. This
mechanism allows an instance invoker to dynamically assign fine-grained access control policies for the
new invoked instance so as to grant other session partners the necessary permissions.
465
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
system. Experimental results are also presented advance. For example, in PKI, the certificate
and analysed. In Section 4, we briefly discuss authority can generate a public key pair and sign
the limitations of existing authorization systems a certificate for every user. The public key pair
in business process management, and then is forwarded to the user out-of-band, and the
introduce our instance-level authorization certificate is stored in a directory server and
system. Finally, Section 5 concludes the paper. published over the network. Thus users can use
the certificates and the key pairs obtained in
2. Instance-Level Authentication advance to prove their identities during the
communication with other users. In contrast to
Authentication is always a fundamental concern these classical authentication mechanisms,
when a security infrastructure for distributed instance-level authentication systems have to
systems is designed. In order to validate the generate and distribute both the identities of the
identities of the principals in distributed instances and the secrets used to prove the
systems, authentication systems are employed. identities at run-time as the instances are
Without losing generality we regard an dynamically generated.
automatic business process as an automatic
distributed system, and the principals working Most current Web service systems have not
within a business session are Web service carefully considered authentication issues in
instances. In this section, we discuss the business sessions yet. In GT4, service instances
authentication issues for Web service instances can be identified by their resources keys, but the
within business sessions and propose an solutions to proving the possession of the
instance-level authentication mechanism. identifiers are not standardized. We can use user
certificates to generate proxy credentials at run-
2.1 Authentication Issues in Web Service time for every new generated instance, and thus
Business Sessions verify the identities of service instances.
However, the proxy certificate solution cannot
When a Web service receives an initial request be directly used to enforce the security
from a business session, it normally generates a boundaries of business sessions, as there is no
new instance to deal with this request. This notion of business session in this solution at all.
instance is then involved within the business
session and may also deal with other requests 2.2 Instance-Level Authentication in [5]
from its session partners. Consider that the
session partners may be generated and bound at Hada and Maruyama [5] propose a session
run-time, and cooperate in a peer-to-peer way authentication mechanism that can realize basic
[1]. Moreover, the specification of the authentication functions for session partners. In
communication amongst the session partners on their protocol a Session Authority (SA)
some occasions cannot be precisely defined in component is introduced to take charge of
advance. Thus, there are cases where a service distributing session authentication messages.
instance has to deal with requests from other The SA generates a session secret for every
instances which it never communicated before business session. When a new service instance
[8]. When an instance receives such a request, is invoked and involved within a business
the instance needs an authentication mechanism session, the associated session secret will be
to help it verify the identity of the sender in case forwarded to the instance, and thus the instance
that the request is sent from an instance in other can use this secret to authenticate itself to its
business session by mistake, or from a session partners. This authentication protocol
malicious attacker. can effectively authenticate the session partners
working within the same business session and
Essentially, there are two goals for an prevent the uncontrolled messages transport
authentication system [7]. One is to establish the across the boundaries of business sessions.
identities of the principals; another is to However, there are still many security issues in
distribute secret keys for principals. Such secret the area of business session management which
keys can be used in further communication needs to be carefully considered. For instance,
amongst the principals or used to generated new Hada and Maruyama’s protocol only can
short-term secrets. In conventional distinguish the session partners in one business
authentication mechanisms (e.g., PKI, Kerberos, session from the service instances in other
etc.), the identities of the users and the secret business sessions. However, in some scenarios a
(e.g., password) used in the authentication fine-grained control over the session partners is
processes are generated and deployed in required. For example, a Web service may have
466
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
467
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
invoked instance and waits for the response messages. Secondly, if an experimental system
(message 6). Compared with the original is deployed on different computers, operations
instance-invoking process in Grid, two in the system may be executed in a concurrent
additional messages (messages 3 and 4) are fashion. Deploying the experimental system can
brought by the session authority. Note that help us to evaluate the performance of the
message 1 needs to be protected by additional systems in the worst case where all the
security measures. Because the new resource operations of the system are executed
has not been generated, and thus the MAC sequentially. In this sub-section all the
associated with the message cannot be experiments are deployed on a single computer
calculated. After the resource has been unless mentioned otherwise.
generated, the integrity and freshness of
ES1 on a si ngl e comput er ( no addi t i onal secur i t y pr t ocol )
messages 2 ~ 6 can be protected by using MACs
14000000
and nonces. 12000000
Mi l l i seconds
10000000
8000000
ES1
6000000
We have implemented two experimental 4000000
2000000
0
systems. The first experimental system (ES1 for
90
1440
2790
4140
5490
6840
8190
9540
10890
12240
13590
14940
16290
17640
18990
20340
21690
23040
short) consists of a SA service and three I nst ances
10000000
i l l i seconds
2000000
We deploy the experimental systems on a single 0
90
1530
2970
4410
5850
7290
8730
10170
11610
13050
14490
15930
468
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
When ES1 is incorporated with the secure After evaluating the performance of the
conversation protocol, that is, the secure experimental systems deployed on a single
conversation protocol is used to generate computer, we deploy the experimental systems
signatures for every message transported in the in a distributed way and further evaluate them in
experiment, the time consumption of the a more realistic environment. In the experiments
experimental system starts increasing non- introduced in this sub-section, the experimental
linearly after over 720 instances are generated. systems are distributed unless mentioned
The same phenomenon occurs in ES2 after 1040 otherwise. The scalability of the experimental
instances are generated when ES2 is systems deployed in a distributed way is much
incorporated with the secure conversation better than on a single computer. As illustrated
protocol. As illustrated in Figures 4 and 5, the in Figure 8, the time consumption of the
time consumption of ES1 is about as twice as distributed ES1 increases linearly, until more
that of ES2, while ES1 can realize fine-grained than 70,000 new instances are generated. When
authentication at the instance level. incorporated with the secure conversation
protocol (see Figure 9), ES1 executes stably
CROWN- a si ngl e comput er wi t hout addi t i onal
until over 3,000 instances have been generated.
25000000
secur i t y pr ot ocol In the experiments shown in Figures 10 and 11,
Mi l l i seconds
20000000
15000000 ES2
the time consumption of ES1 is proportional to
10000000
5000000
ES1 the number of the new generated instances until
0 over 300,000 instances are generated.
300
24300
48300
72300
96300
1E+05
1E+05
2E+05
2E+05
2E+05
2E+05
3E+05
3E+05
3E+05
3E+05
4E+05
4E+05
4E+05
4E+05
I nst ances
Di st r i but ed ES1- NO addi t i onal secur i t y
pr ot ocol
Figure 6: Compare between ES1 and ES2 on
20000000
CROWM
Mi l l i seconds
15000000
10000000
5000000
CROWN- Si ngl e comput er - Secur e message- Si gnat ur e
0
35000000
90
5220
10350
15480
20610
25740
30870
36000
41130
46260
51390
56520
61650
66780
71910
77040
30000000
Mi l l i seconds
25000000
20000000 ES2 I nst ances
15000000
10000000 ES1
5000000
0
Figure 8: ES1 deployed on GT4
300
24300
48300
72300
96300
1E+05
1E+05
2E+05
2E+05
2E+05
2E+05
3E+05
3E+05
3E+05
3E+05
4E+05
4E+05
4E+05
4E+05
I nst ances
Di st r i but ed ES1- Secur e conver sat i on- Si gnat ur e
4000000
CROWN + Secure Message 3000000
2000000
1000000
We have successfully implemented our design 0
in the large-scale CROWN (China Research and 90 360 630 900 170 440 710 980 250 520 790 060 330 600 870
1 1 1 1 2 2 2 3 3 3 3
I nst ances
Development environment Over Wide-area
Network) Grid. CROWN aims to promote the Figure 9: ES1 deployed on GT4+ Secure
utilization of valuable resources and Conversation
cooperation of researchers nationwide and
CROWN- Di st r i but ed- No addi t i onal secui t y
world-wide. In the experiments presented in pr ot ocol
40000000
system, the scalability of the authentication 30000000
ES2
ES1
system is still very good. 20000000
M
10000000
0
300
39300
78300
1E+05
2E+05
2E+05
2E+05
3E+05
3E+05
4E+05
4E+05
4E+05
5E+05
469
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
470
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
control relationship between instances is the AuthzServcie; the PDP then makes an
more flexible. authorization decision according to this
With the assistance of the instance-level authenticated Token. In practice, the
authentication system, our authorization AuthzService can be implemented in various
mechanism can define the business ways. For example, the AuthzService can be
session constraints within the access implemented as a centralized service which
control policies. deals with the requests from all the session
partners within the business session. The
4.2 Design of Authorization Mechanism AuthzServices also can be implemented in a
decentralized way; that is, each AuthzService
As Web services are loosely coupled and can be only manages the service instances within a
bound at run-time, the number of the session local domain, and multiple AuthzServices are
partners involved within a business session may associated with a business session.
keep changing. Therefore, it is difficult for an
instance to obtain real-time knowledge about its
session partners when it is involved in a peer-to-
peer business session. For instance, when an
invoker sets the access control policies to a new
invoked instance, it may not know the
identifiers of the session partners who are going
to access this new instance. On extreme
occasions, the instances supposed to access the Figure 13: An example of the authorization
new instance have not been spawned yet. process
Therefore, identity-based authorization systems
are not suitable, and we propose an attribute- Figure 13 presents an example of the
based authorization mechanism instead. authorization enforcement. In this example there
are three instances in a business session, A, B
We adopt XACML (eXtensible Access Control and C. A invokes B, and specifies an access
Markup Language) to express fine-grained control policy encoded with XACML for the
access control policy. As illustrated in Figure 12, new generated B. C also intends to access B.
our authorization system consists of two Assume that A has stored the policy into the
components, the authorization service authorization service of B (step 0 in Figure 13).
(AuthzService for short) and the SA service. The steps of the authorization enforcement in
Figure 13 are presented as follows:
471
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
that B assigned in advance and returns the the new invoked instances and ensure the
authorization result back. correctness and the security of the cooperation
During the processes of authentication and between the session partners.
authorization, SAML (Security Assertion
Markup Language) is used to describe related We deployed our experimental systems on the
security assertions in the messages. For GT4 and CROWN Grid infrastructures. These
example, the attributes (e.g., identifier) of an experimental systems are mainly used to
instance ought to be encapsulated according to evaluate the scalability and the performance of
the SAML specification. our instance-level authentication system. In
some experiments, our authentication system
5. Conclusion can execute stably until over 260,000 instances
are generated. Such experimental results
Conventionally, business process supporting demonstrate the scalability of the instance-level
techniques (e.g., workflow) mainly deal with authentication system is good. Additionally,
the issues of implementing business process compared with the performance of service-level
within an organization, and business processes security mechanisms, the overhead introduced
supported by these techniques are normally by our instance-level authentication system is
implemented in the client/server model. A Web reasonable. We are now in the process of
service business process may have to execute in evaluating the performance of the instance-level
a cross-organizational environment where the authorization system and going to present the
client/serer model is not suitable. Web services results in the papers published in the future.
have to achieve business objectives through
peer-to-peer cooperative interactions [1, 3]. The Reference
structures of Web service business processes
thus can be more flexible and complex than [1] G. Alonso, F. Casati, H. Kuno, and V. Machiraju,
“Web Services,” Springer Verlag, 2003.
conventional business processes. Because
session partners cooperate in a peer-to-peer way,
[2] V. Atluri, WK Huang, “An Authorization Model
a session partner may lack the real-time for Workflows,” Proc. the 5th European
information about its session partners. All these Symposium on Research in Computer Security,
issues bring new challenges to security systems; Lecture Notes in Computer Science, Vol. 1146,
a lot of security issues also need to be carefully Springer-Verlag, pp. 44-64, 1996.
considered. In this paper, we present our efforts
in instance-level authentication and [3] G. B. Chafle, S. Chandra, V. Mann and M. G.
authorization. An instance-level authentication Nanda, “Decentralized Orchestration of Composite
system is proposed. Compared with traditional Web Services,” Proc. the 13th international World
Wide Web conference on Alternate track paper &
authentication systems, this system is able to
posters, pp. 134-143, May 2004.
generate the identities for session partners and
authenticate them at run-time. Our instance- [4] D. Domingos, A. Rito-Silva, P. Veiga,
level authentication system can help a service “Authorization and Access Control in Adaptive
instance to distinguish its session partners from Workflows,” Proc. 8th European Symposium on
the instances in other sessions, and thus Research in Computer Security, pp. 23-38, 2003.
generate a security boundary for business
sessions. However, considering the potential [5] S. Hada and H. Maruyama, “Session
Authentication Protocol for Web Services,” Proc.
competitions among session partners and a lot
2002 Symposium on Application and the Internet,
of other security requirements (e.g., the pp. 158-165, Jan. 2002.
separation of duty), we propose an instance-
level authorization system so as to further [6] D. Hollingsworth, “Workflow Management
improve the security of business sessions. In Coalition: The Workflow Reference Model,”
most conventional authorization systems for Technical Report WFMC-TC-1003, Workflow
business processes, the access control relation Management Coalition, Brussels, Belgium, 1994.
within a business session is statically predefined
by the administrator of the business session. In [7] T. Y. C. Woo and S. S. Lam. “A Semantic Model
Web service business sessions, session partners for Authentication Protocols,” Proc. IEEE
Symposium on Research in Security and Privacy,
works in a peer-to-peers way and their
pp. 178--194, Oakland, California, May 1993.
relationship is more flexible and difficult to be
precisely predefined. Therefore, our [8] D. Zhang and Jie Xu, “Securing Instance-Level
authorization system enables session partners to Interactions in Web Services,” Proc. 2005 ISADS
dynamically assign access control policies for IEEE, pp. 443-450, April 2005.
472
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
473
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
successfully fulfilled their requirements for that solution is concerned, e.g. credential management.
software. Depending on the context, the user in First we shall outline those problems that we have
question may be the end-user of the software, the encountered that we believe are due to the
system administrator installing, maintaining or heavyweight nature of the existing solutions, and
administering the software, etc. We consider then we shall discuss problems specific to
software to be “user-friendly” if its usability is particular aspects of computational grid security.
high and if the user is able to use it appropriately
in such a manner that they are either unaware that 2.1 Heavyweight security solutions
they are using it or that its impact on their user Many of the problems inherent in heavyweight
experience is positive. grid middleware have been discussed in detail
In the context of computational grid elsewhere ([6], [2], [10]), and so we shall not
infrastructure, we use “heavyweight” to mean discuss them here. Instead we outline the specific
software that is some combination of the problems that the heavyweight nature of the
following: existing grid security solutions causes in the
• complex to understand, configure or use; context of the security of computational grid
• has a system “footprint” (disk space used, environments. (It is important to note, however,
system resources used, etc.) that is that, security considerations aside, heavyweight
disproportionately large when considered as a grid middleware is, by its very nature, a significant
function of the frequency and extent to which barrier to the wider uptake of grid technology ([6],
the user uses the software; [11], [2]).) The specific problems that such grid
• has extensive dependencies, particularly where middleware causes in a security context include
those dependencies are unlikely to already be the following:
satisfied in the operating environment, or • Complexity:
where the dependencies are themselves Amongst security professionals it is well known
“heavyweight”; that complexity is the enemy of security, indeed,
• difficult or resource-intensive to install or some security professionals have described
deploy; complexity as the worst enemy of security ([12]).
• difficult or resource-intensive to administer; The existing grid security solutions are extremely
and/or complex, and this makes them very difficult to
• not very scalable. configure or use, and, most crucially, to
We use the term “lightweight” to refer to software understand ([13], [8]). Thus it is very difficult
that is not “heavyweight”, i.e. it possesses none of even for experienced system administrators to
the features listed above, or only possesses a very correctly install and configure these security
small number of them to a very limited extent. solutions ([14]).
Thus lightweight software would be characterised • Extensive dependencies:
by being: The security of a piece of software is – at a
• not too complex for end-users, developers and minimum – dependent on the security of its
administrators to use and understand in their components and dependencies. It is also
normal work context, dependent on many other factors, but if the
• easy to deploy in a scalable manner, and security of any of its components and
• easy to administer and maintain. dependencies is breached then the likelihood is
See [6] and [10] for a fuller discussion of high that the security of the software as a whole
lightweight and heavyweight grid middleware. will also be violated (although there are
The problems we have encountered with the techniques, such as privilege separation ([15]),
current grid security mechanisms and frameworks which can help mitigate this). The software’s
tend to fall into two main categories. There are components and dependencies thus form a “chain
those problems that seem to be common to most of of trust”, which is only as strong as its weakest
the existing grid security solutions, e.g. excessive link ([12]). It therefore follows that, all else being
complexity, poor scalability, high deployment and equal, software with extensive dependencies is
maintenance overheads, etc. We observe that all both more likely to be vulnerable to security
these solutions have a common characteristic that exploits, and also more difficult to secure in the
we believe is responsible for these problems, first place, than software with few dependencies.
namely they are all very heavyweight solutions to It is also worth mentioning in this context that
the security problems they address. software which installs its own versions of system
Then there are those problems that are specific libraries (or is statically compiled with respect to
to the particular aspect(s) of security (e.g. those system libraries) is also likely to be more
authentication, authorisation, etc) with which the insecure. This is because its versions of these
474
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
libraries are unlikely to be as readily upgraded in which they are deployed are inherently insecure
response to security alerts as the system libraries, ([22], [23]), it is therefore likely that most of the
in part because system administrators may well not existing grid security solutions will be deployed or
be aware that the software uses its own versions of used in an insecure manner in practice. Further,
these libraries. usability is not a “feature” that can be bolted on at
• Scalability: the end of the development process; usable
Many of the current grid security solutions are not systems are only possible if usability is a central
particularly scalable ([16], [17]). Given that, at concern of the designers at all stages of the
least in principle, computational grids can consist system’s design and development ([24], [25],
of thousands of individual users and machines and [26]). It is thus unfortunately the case that most of
hundreds or thousands of administrative domains, the existing grid security solutions are unlikely to
it is clear that grid middleware whose scalability is be usable or secure in practice.
poor will not be suitable for medium to large grids.
2.2 Authentication
One aspect of this lack of scalability that has
serious security implications is the requirement for As mentioned above, the principal authentication
the security configuration of each node and/or mechanism in current computational grid
client in a grid environment to be individually environments is GSI, and there are a number of
configured ([8], [14]). As one might expect, this serious problems with this mechanism; these are
leads to some nodes/clients being incorrectly discussed in some detail in [27], [8] and [14].
configured, and in an environment with a large Perhaps the most significant point that has
number of nodes this may remain unnoticed for a emerged from these examinations of the problems
considerable period of time. with GSI is that the digital certificates upon which
In addition, where this configuration must be it relies are “cognitively difficult objects” for its
maintained on a regular basis – particularly if the users ([8]). This is borne out by the fact that some
software is designed to stop working if its potential users of grid computing environments
configuration is not kept up-to-date – it is not have refused to use those environments if they
unusual for users to have found ways of required a digital certificate in order to do so ([9]).
circumventing this. For example, where up-to- It would thus seem probable that any better
date certificate revocation lists (CRLs) are authentication solution for grid environments must
required for successful operation, often no CRLs not rely on users making active use of digital
will be installed to avoid the software failing when certificates ([8]). Ideally, users should not have to
the CRL is no longer current ([8]). have such certificates at all ([28], [8], [14]).
• Usability:
From the definition of “heavyweight software” 2.3 Authorisation
given above it should be clear that it is inherently The current computational grid environments
difficult for such software to have high usability or provide very limited authorisation mechanisms
to be user-friendly ([10]). In addition, an ([16], [17]); most of them rely on simple “allow”
examination of the software development process lists in manually maintained text files. Although
for most of the existing grid security solutions more sophisticated mechanisms are available (e.g.,
reveals very little evidence of usability CAS ([29]), VOMS ([30]), PERMIS ([20]),
considerations playing a significant role in their Shibboleth ([31])), few of these are currently in
design. Given the wealth of usability-related active use in the existing environments. Some of
problems ([8]) with the Grid Security these mechanisms (e.g. CAS, VOMS) are based on
Infrastructure (GSI) ([18]) – the principal GSI and thus inherit its usability and security
authentication mechanism used in existing grid problems. In addition, they are all, to a greater or
environments – it seems unlikely that its design lesser degree, heavyweight (in the sense defined
was informed by many usability concerns. The above). This leaves system administrators with
authors are only aware of two grid security two choices, neither of which is satisfactory: they
components/solutions in active use – MyProxy can either rely on an overly simple mechanism that
([19]) and PERMIS ([20]) – where usability does not scale well (“allow” lists in text files), or
concerns have been explicitly considered at some they can attempt to use one of the more
point in their development/deployment ([19], sophisticated, heavyweight, solutions, with all of
[21]), and even with those products it is not clear the associated problems inherent in such solutions.
whether usability was a core concern of their
designers. 2.4 Auditing
Unfortunately, since security mechanisms The auditing features of current computational grid
whose usability is poor in the environment in environments are, at best, rudimentary. GSI, the
475
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
principal authentication mechanism, relies wholly security solutions must have high usability in their
on the integrity of users’ credentials being deployed environments, and usability must be a
maintained – the only auditing information core concern of their designers. As discussed in
provided is the distinguished name (DN) of the [10], this requires continuous user involvement in
digital certificate and the IP address from which the design and development processes from their
the authentication request appears to originate. As earliest stages.
discussed in [8], the likelihood of the integrity of Given the large investment (in terms of
the digital certificate being compromised is high. resources allocated) in the existing computational
Also, in the general case, an entity’s apparent IP grid environments, it is clear that, despite their
addresses cannot be relied upon to belong to it relatively low usability and the high likelihood of
(due to IP address “spoofing”). Thus, the audit their security being violated, they cannot simply be
information provided is often of little value when abandoned. Thus our solution also needs to be
attempting to detect or clean up after a security interoperable with the existing computational grid
incident. environments.
Very little work has currently been done in the As security professional Bruce Schneier notes
area of the auditing requirements of computational in [12], “In cryptography, security comes from
grid environments, and so it is not clear what following the crowd”, and this is true of secure
information needs to be collected, or how it should software in general. From a security standpoint it
be stored, processed, etc ([32]). However, it is is generally undesirable to re-invent security
clear that most of the existing security solutions do primitives (especially cryptographic primitives),
not address these problems in any detail. Indeed, since these will, of necessity, have received less
many of them (e.g. GSI), do not follow accepted testing and formal security verification than
best practice of storing the audit data at a location existing primitives that are widely used. We
remote to where that data was gathered. This therefore intend to use appropriate existing
means that, in the event of the security of a cryptographic techniques rather than developing
machine in the grid environment being breached, any new ones. In addition, we intend to use
the attacker will be able to modify the audit data existing cryptographic toolkits and security
that might otherwise reveal how they obtained components, where we can do so without
access to the machine. negatively impacting on our other design goals
(such as usability).
3 Solution Requirements In summary, our solution must be:
(a) lightweight, i.e.
From the discussion above, and other work in the ○ of low complexity,
areas of grid middleware design ([6], [10], [9]), ○ easy to deploy,
grid security ([27], [28], [8], [14]), and usable ○ easy to administer,
security ([22], [23], [33], [34], [21]), we abstract ○ minimal in its external dependencies, and
some very general requirements that any grid ○ scalable.
security solution should possess. We then (b) highly usable by its intended users,
consider some of the specific security (c) developed in conjunction with its intended
requirements in the areas of authentication, users,
authorisation and auditing. (d) interoperable with the existing computational
grid environments (or at least with the most
3.1 General requirements
common ones),
As should be clear from the discussion in Section (e) implemented using existing cryptographic
2.1, heavyweight grid security solutions are techniques, and
unlikely to improve the current situation. Indeed, (f) maximal in its use of appropriate existing
the seriousness of the problems inherent in cryptographic toolkits and other security
heavyweight grid middleware is arguably one of components.
the most significant factors in the relatively low See [22], [33], [35], [6], [10] and [21] for more
use of current computational grid environments complete discussions on appropriate design and
([7], [10]). Thus a core requirement for any grid development methodologies and guidelines for
security solution is that it is lightweight (in the meeting requirements (a) to (c).
sense defined above).
As discussed above, security solutions whose 3.2 Authentication
usability is poor are inherently insecure, and As discussed in Section 2.2, it is vital that end-user
usable security solutions require usability to be a interaction with digital certificates be minimised,
central concern of the developers at all stages of ideally to the point where end-users have no
the solution’s design and development. Thus grid interaction with digital certificates. A number of
476
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
methods have been proposed for doing this ([28], improve the reliability of the data collected for
[8], [14]). We have chosen to adopt the method audit purposes and we propose to use this as a
described in [28] as it is designed to be starting point for further development.
interoperable with the existing grid authentication
mechanisms, highly scalable, extensible and 4 Development Methodology
represents a significant improvement in its
auditing facilities over the existing mechanisms. Our development methodology is a combination of
Of equal importance, its development user-centred security methodologies (such as [22],
methodology fully embraces the need for usability [35]), drawing heavily on the user-centred design
to be a core concern of any security solution and methodologies (e.g. [25], [36], [37]), in
for user involvement in the design and conjunction with formal methods drawn from
development processes. security analysis (e.g. [38], [39]). There are two
principal strands to our development strategy: the
3.3 Authorisation development and security verification of working
software solutions, and research into, and formal
It seems intuitively obvious that a truly general
analysis and modelling of, the security space
authorisation framework that would be applicable
(computational grid environments) in which that
to any possible scenario in a computational grid
software is to be deployed.
environment is likely to be extremely complex and
Both usability and security requirements are
its implementations would quite probably be very
situated requirements, and, as such, are crucially
heavyweight. The general nature of authorisation
dependent on context. As mentioned above (and
frameworks such as PERMIS and Shibboleth is
discussed in more detail in [6] and [10]), the
probably precisely what causes them to be such
attempts to produce truly general solutions has led
heavyweight security solutions and so unsuitable
to heavyweight grid middleware whose usability is
for our purposes. As discussed in [10], it is
low and this has directly contributed to the low
probably impossible to develop usable grid
adoption of grid technology by the academic
middleware that can be used in all the problem
community. To avoid repeating these mistakes,
domains that might be of interest. Instead, smaller
we are situating our project in a specific context,
domain-specific middleware is needed.
namely the RealityGrid Project ([40]) and its
Thus we do not attempt to re-invent a general
chosen middleware, WEDS ([2]) and the AHE
authorisation framework. Instead, we propose to
([7]). A design goal of our software is thus tight
concentrate on some specific problem areas of
integration with the middleware used by the
interest and develop a lightweight authorisation
RealityGrid Project.
framework appropriate for those areas. A design
As discussed in [41] and [10], a prerequisite
goal for this framework is that it is extensible, so
for effective development of any software solution
that others working in similar problem areas are
is a detailed knowledge of the requirements of the
likely to be able to make use of it. As noted in
stakeholders for whom that solution is being
Section 3.1, interoperability is a crucial
developed. Thus the first stage in our software
requirement, and so our initial investigation will
development plan is a requirements capture
be into wrapping the flat file authorisation tables
exercise using techniques from the user-centred
used in most existing computational grid
design methodologies. The results of this exercise
environments, in a manner analogous to that
are fed into both the formal modelling and analysis
proposed in [28] for the existing authentication
of the security problem space under consideration,
mechanisms.
and the software design process. Low fidelity
3.4 Auditing prototyping (e.g. [42]) is used to ensure that the
gathered requirements accurately reflect the users’
As discussed in Section 2.4, the auditing needs. Our development cycle thus looks
requirements for computational grid environments something like this:
are not yet well understood, and the existing 1) Requirements capture and analysis;
facilities are quite basic. Further research is 2) Formal modelling and analysis;
necessary before the requirements in this area can 3) Software design;
be definitively stated. However, it seems likely 4) Formal verification of security of design;
that conforming with accepted best practice for 5) Usability testing using low fidelity prototypes.
network auditing (e.g. storing audit data at a The above cycle is iterative (note that the
location remote from where it is collected, not different phases may partially overlap) with each
relying on easily falsifiable identifiers such as IP phase dependent on the successful completion of
addresses, etc.) would be a sensible place to start. the previous phase. If a phase cannot be
The mechanism proposed in [28] attempts to completed successfully, earlier phases are repeated
477
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
with the current design and suggested changes should be major improvements over the current
from the failed phase as input. This cycle iterates grid security solutions.
until a stable design is produced. In addition, our development methodology
Note that no program code (even as a incorporates formal security analysis and security
prototype) is written until a stable design has been verification of both our software design and its
produced, and this design has passed usability implementation. By using formal security
trials and security verification. This helps to methods we can ensure that our software’s security
ensure that software development is user-driven model is at least theoretically sound. Whilst this
rather than programmer-driven, and that the does not, of course, guarantee that the software
software design is responsive to user concerns. will be used securely in practice, if any software
After the stable design is produced, a software deployment is to be secure, a prerequisite is that its
prototype of this design is then developed, which security model is theoretically sound. We are
is integrated with our target middleware in the unaware of any formal security analysis or
following cycle: verification of the existing grid security solutions.
1) Development of software prototype; Once the final implementation of our software has
2) Integration with target middleware; undergone formal security analysis and
3) Usability testing; verification, we believe it would be reasonable to
4) Formal verification of security of integrated have greater confidence in our software being
prototype. secure than in the existing solutions.
The above cycle is iterative in a similar manner
to the previous cycle. This cycle iterates until a 5.2 Authentication
working software prototype has been integrated From the discussion above (and see [8] and [14]
with the target middleware and passed its usability for further details) it is apparent that, in current
trials and security verification. computational grid environments, there is a
We have adopted a component based approach significant risk of users’ credentials being obtained
to the authentication and authorisation by unauthorised individuals, and this is principally
mechanisms we are trying to develop, so these are due to usability issues with the current security
developed as separate components, the mechanisms. Obtaining user credentials would
development of each following the development allow attackers to successfully exploit these
cycle described above. environments without needing to discover and
exploit security vulnerabilities in the program code
5 Security Improvements of the security mechanisms. We intend to address
the usability issues with the current mechanisms
In this section we discuss how our proposed (see Section 3.2), and, as discussed in Section 5.1,
solutions and our software development will undertake extensive usability testing to ensure
methodology will improve the security of we have addressed these issues. This means that
computational grid environments. First we discuss our software should significantly reduce – or
the impact of our development methodology on possibly even eliminate – the risk of attackers
the security of our solutions, and then we discuss obtaining users’ credentials through the usability
how our solutions address some of the security deficiencies of the environment’s security
issues present in the current computational grid mechanisms.
environments.
5.3 Authorisation
5.1 Development Methodology
Environments that currently use the existing
Our development methodology incorporates heavyweight authentication solutions are likely to
software development principles and practice have inappropriate authorisation policies in place,
(such as [22], [33], [35], [34]) proven to increase i.e. the complexity of the solution is such that the
the security of systems by addressing usability actual authorisation policy implemented may not
concerns. In addition, we undertake a continuous match the intended authorisation policy in the
programme of usability testing to ensure that our administrator’s mind. If the implemented policy is
usability goals have actually been achieved. Thus too lax, then unauthorised individuals may be able
not only will our software have been designed in to access the environment, or users may be able to
accordance with proven usability design perform unauthorised actions. If the implemented
principles, but we will have tested the software in policy is too strict, this is likely to be quickly
usability trials to ensure that it genuinely possesses discovered by users, who will complain about
a high degree of usability for its intended user being unable to perform legitimate actions. This
community. Together, these measures guarantee will lead to a relaxation of the policy, and may
that – at least in usability terms – our solutions
478
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
well lead to the policy being relaxed to an academic community. Many of these deficiencies
inappropriately high degree. This may happen stem from the heavyweight nature of the solutions
because of the dissonance between the in question, and from their low usability. The low
administrator’s understanding of the authorisation usability of these solutions is probably a
mechanisms and the actual workings of these consequence of the fact that usability
mechanisms (which led to the mistaken policy considerations were not central to their design.
being implemented in the first place). In this paper we have described how our “user-
Alternatively, fear of further angering the user friendly” approach to grid security seeks to
community may lead the administrator to err on mitigate the existing deficiencies by closely
the side of “caution”, i.e. to implement an involving the user in the design and development
extremely permissive policy so as to be certain that processes. By combining this with formal
the users are able to do what they should be methods in computer security we aim to produce
allowed to do. lightweight security solutions that are easy to use
As described in Section 5.1, our development for our target user population and which are
methodology actively addresses the usability theoretically sound.
issues that cause such dissonance. Consequently,
our authorisation component will represent Acknowledgements
significant improvements for administrators (and
so for users) over the current situation, i.e. it will This work is supported by EPSRC through the
be much more likely that administrators will be User-Friendly Authentication and Authorisation
able to correctly set appropriate authorisation for Grid Environments project (EP/D051754/1).
policies.
References
5.4 Auditing
[1] Hurley, J. “THE GRID IDENTITY CRISIS: DEFINING
As discussed in Section 2.4, the auditing facilities IT AND ITS REALITY”, GRIDtoday, Vol. 1, No. 10 (19
of the existing computational grid environments August, 2002):
http://www.gridtoday.com/02/0819/100248.html
are not very sophisticated, and the auditing [2] Coveney, P., Vicary, J., Chin, J. and Harvey, M.
requirements are not very well understood. This Introducing WEDS: a WSRF-based Environment for
means that it is likely to be difficult to diagnose Distributed Simulation (2004). UK e-Science Technical Report
attacks against these environments. We intend to UKeS-2004-07:
http://www.nesc.ac.uk/technical_papers/UKeS-2004-07.pdf
actively investigate this area as a better [3] Evans, B. “Grid Computing: Too Big To Be Ignored”.
understanding would allow us to implement better InformationWeek, 10 November 2003:
auditing facilities. However, as discussed in http://www.informationweek.com/story/showArticle.jhtml?arti
Section 3.4, merely by following acknowledged cleID=16000717
[4] McBride, S. and Gedda, R.“Grid computing uptake slow
best practice for network auditing we will have in search for relevance”. Computerworld Today, 12 November
improved the current situation significantly. 2004:
http://www.computerworld.com.au/index.php?id=138181333
[5] Ricadela, A. “Slow Going On The Global Grid”.
6 Preliminary Work InformationWeek, 25 February 2005:
http://www.informationweek.com/story/showArticle.jhtml?arti
As described in [28] and [8], our proposed cleID=60402106
replacement authentication mechanism is nearing [6] Chin, J. and Coveney, P.V. Towards tractable toolkits for
the end of its design phase – see [8] for details of the Grid: a plea for lightweight, usable middleware. (2004). UK
usability concerns addressed in the design process. e-Science Technical Report UKeS-2004-01:
http://www.nesc.ac.uk/technical_papers/UKeS-2004-01.pdf
We anticipate soon having some software [7] Coveney, P.V., Harvey, M.J., Pedesseau, L., Mason, D.,
prototypes for integration with our target Sutton, A., McKeown, M. and Pickles, S. Development and
lightweight grid middleware ([2], [7]). Once this deployment of an application hosting environment for grid
is done we will begin usability testing of these based computational science (2005). Proceedings of the UK e-
Science All Hands Meeting 2005, Nottingham, UK, 19-22
prototypes. As further development of the other September 2005:
security mechanisms proceeds, we will gradually http://www.allhands.org.uk/2005/proceedings/papers/366.pdf
incorporate prototypes of these mechanisms in [8] Beckles, B., Welch, V. and Basney, J. Mechanisms for
accordance with the software development increasing the usability of grid security (2005). Int. J. Human-
Computer Studies 63 (1-2) (July 2005), pp. 74-101:
methodology outlined in Section 4. http://dx.doi.org/10.1016/j.ijhcs.2005.04.017
[9] Sinnott, R. Development of Usable Grid Services for the
7 Conclusion Biomedical Community (2006). Designing for e-Science:
Interrogating new scientific practice for usability, in the lab
The existing grid security solutions suffer from a and beyond, Edinburgh, UK, 26-27 January 2006.
[10] Beckles, B. Re-factoring grid computing for usability
number of deficiencies that are serious barriers to (2005). Proceedings of the UK e-Science All Hands Meeting
wider adoption of grid technologies by the
479
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
2005, Nottingham, UK, 19-22 September 2005: [25] Beyer, H. and Holtzblatt, K. Contextual Design: Defining
http://www.allhands.org.uk/2005/proceedings/papers/565.pdf Customer-centered Systems. Morgan Kaufmann Publishers,
[11] Pickles, S.M., Blake, R.J., Boghosian, B.M., Brooke, 1998.
J.M., Chin, J., Clarke, P.E.L., Coveney, P.V., González- [26] Cooper, A. The Inmates Are Running the Asylum: Why
Segredo, N., Haines, R., Harting, J., Harvey, M., Jones, M.A.S., High-tech Products Drive Us Crazy and How to Restore the
McKeown, M., Pinning, R.L., Porter, A.R., Roy, K. and Sanity. Sams, 1999.
Riding, M. The TeraGyroid Experiment (2004), Proceedings of [27] Snelling, D.F., van den Berghe, S. and Li, V. Q. Explicit
GGF10, Berlin, Germany, 2004: http://www.cs.vu.nl/ggf/apps- Trust Delegation: Security for Dynamic Grids. FUJITSU Sci.
rg/meetings/ggf10/TeraGyroid-Case-Study-GGF10-final.pdf Tech. J. 40 (2) (December 2004), pp. 282-294:
[12] Schneier, B. Secrets and Lies: Digital Security in a http://www.unigrids.org/papers/explicittrust.pdf
Networked World. John Wiley & Sons, 2000. [28] Beckles, B. Removing digital certificates from the end-
[13] Beckles, B., Brostoff, S., and Ballard, B. A first attempt: user’s experience of grid environments (2004). Proceedings of
initial steps toward determining scientific users’ requirements the UK e-Science All Hands Meeting 2004, Nottingham, UK,
and appropriate security paradigms for computational grids 31 August - 3 September 2004:
(2004). Proceedings of the Workshop on Requirements Capture http://www.allhands.org.uk/2004/proceedings/papers/250.pdf
for Collaboration in e-Science, Edinburgh, UK, 14-15 January [29] Pearlman, L., Welch, V., Foster, I., Kesselman, C.,
2004, pp. 17-43: Tuecke, S. A Community Authorization Service for Group
http://www.escience.cam.ac.uk/papers/req_analysis/first_attem Collaboration (2002). Proceedings of the IEEE 3rd
pt.pdf International Workshop on Policies for Distributed Systems
[14] Beckles, B. Kerberos: a usable grid authentication and Networks, Monterey, CA, 2002, pp. 50-59:
protocol (2006). Designing for e-Science: Interrogating new http://www.globus.org/alliance/publications/papers/CAS_2002
scientific practice for usability, in the lab and beyond, _Revised.pdf
Edinburgh, UK, 26-27 January 2006. [30] Alfieri, R., Cecchini, R., Ciaschini, V., dell’Agnello, L.,
[15] Provos, N., Friedl, M. and Honeyman, P. Preventing Frohner, A., Gianoli, A., Lorentey, K.L., Spataro, F.. VOMS,
Privilege Escalation (2003). 12th USENIX Security Symposium, an Authorization System for Virtual Organizations (2003). 1st
Washington, DC, USA, 2003: European Across Grids Conference, Santiago de Compostela,
http://www.usenix.org/publications/library/proceedings/sec03/t Spain, 13-14 February 2003: http://grid-
ech/provos_et_al.html auth.infn.it/docs/VOMS-Santiago.pdf
[16] Lock, R., and Sommerville, I. Grid Security and its use of [31] Shibboleth Project: http://shibboleth.internet2.edu/
X.509 Certificates. DIRC internal Conference submission [32] Van, T. “Grid Stack: Security debrief”. Grid Stack, 17
2002. Lancaster DIRC, Lancaster University, 2002: May 2005: http://www-
http://www.comp.lancs.ac.uk/computing/research/cseg/projects 128.ibm.com/developerworks/grid/library/gr-
/dirc/papers/gridpaper.pdf gridstack1/index.html
[17] Broadfoot, P. J. and Martin, A. P. A critical survey of [33] Yee, K-P. User interaction design for secure systems
grid security requirements and technologies (2003). (2002). Proceedings of the 4th International Conference on
Programming Research Group Research Report PRG-RR-03- Information and Communications Security, Singapore, 9-12
15, Programming Research Group, Oxford University December 2002. Expanded version available at:
Computing Laboratory, August 2003: http://zesty.ca/sid
http://web.comlab.ox.ac.uk/oucl/publications/tr/rr-03-15.html [34] Balfanz, D., Durfee, G., Grinter, R.E, Smetters, D.K. In
[18] Foster, I., Kesselman, C., Tsudik, G. and Tuecke, S. A search of usable security: Five lessons from the field (2004).
security architecture for computational grids (1998). IEEE Security and Privacy 2 (5) (September-October 2004),
Proceedings of the 5th ACM conference on Computer and pp. 19-24:
communications security, San Francisco, CA, 1998, pp.83-92: http://doi.ieeecomputersociety.org/10.1109/MSP.2004.71
http://portal.acm.org/citation.cfm?id=288111 [35] Fléchais, I., Sasse, M.A. and Hailes, S.M.V. Bringing
[19] Basney, J., Humphrey, M. and Welch, V. The MyProxy Security Home: A process for developing secure and usable
online credential repository (2005). Software: Practice and systems (2003). Proceedings of the 2003 workshop on New
Experience 35 (9):801-816, 25 July 2005: security paradigms, Switzerland, 2003, pp. 49-57:
http://dx.doi.org/10.1002/spe.688 http://portal.acm.org/citation.cfm?id=986664
[20] Chadwick, D.W. and Otenko, O. The PERMIS X.509 role [36] Cooper, A., and Reimann, R. About Face 2.0: The
based privilege management infrastructure (2002). Proceedings Essentials of Interaction Design. John Wiley and Sons, New
of the 7th ACM symposium on Access control models and York, 2003.
technologies, Monterey, CA, 2002, pp. 135-140: [37] Sommerville, I. An Integrated Approach to Dependability
http://portal.acm.org/citation.cfm?id=507711.507732 Requirements Engineering (2003). Proceedings of the 11th
[21] Brostoff, S., Sasse, M.A., Chadwick, D., Cunningham, J., Safety-Critical Systems Symposium, Bristol, UK, 2003.
Mbanaso, U. and Otenko, S. R-What? Development of a role- [38] Ryan, P.Y.A., Schneider, S.A., Goldsmith, M.H., Lowe,
based access control policy-writing tool for e-Scientists. G., and Roscoe, A.W. Modelling and Analysis of Security
Software: Practice and Experience 35 (9):835-856, 25 July Protocols. Addison-Wesley Professional, 2000.
2005: http://dx.doi.org/10.1002/spe.691 [39] Abdallah, A.E., Ryan, P.Y.A. and Schneider, S. (eds.).
[22] Zurko, M.E. and Simon, R.T., User-centered security Formal Aspects of Security. Proceedings of the First
(1996). Proceedings of the 1996 workshop on New security International Conference, FASec 2002, London, December
paradigms, Lake Arrowhead, CA, USA, 1996, pp. 27-33: 2002. LNCS 2629, Springer-Verlag, 2003.
http://portal.acm.org/citation.cfm?id=304859 [40] RealityGrid: http://www.realitygrid.org/
[23] Adams, A. and Sasse, M.A. Users are not the enemy: [41] Beckles, B. User requirements for UK e-Science grid
Why users compromise security mechanisms and how to take environments (2004). Proceedings of the UK e-Science All
remedial measures (1999). Communications of the ACM, Hands Meeting 2004, Nottingham, UK, 31 August - 3
Volume 42, Issue 12 (December 1999), pp. 40-46: September 2004:
http://portal.acm.org/citation.cfm?id=322806 http://www.allhands.org.uk/2004/proceedings/papers/251.pdf
[24] Gould, J.D. and Lewis, C. Designing for Usability: Key [42] Snyder, C. Paper prototyping: The fast and easy way to
Principles and What Designers Think (1985). Communications design and refine user interfaces. Morgan Kaufmann
of the ACM, Volume 28, Issue 3 (March 1985), pp. 300-311: Publishers, London, 2003.
http://portal.acm.org/citation.cfm?id=3170
480
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
We have carried out a comprehensive computational study of the structures and
properties of a series of iron-bearing minerals under various conditions using grid
technologies developed within the eMinerals project. The work has enabled by a close
collaboration between computational scientists from different institutions across the UK,
as well as the involvement of computer and grid scientists in a true team effort. We show
here that our new approach for scientific research is only feasible with the use of the
eMinerals minigrid. Prior to the eMinerals project, such an approach would have been
almost impossible within the timescale available. The new approach allows us to achieve
our goals in a much quicker, more comprehensive and detailed way. Preliminary
scientific results of an investigation of the transport and immobilization of arsenic
species in the environment are presented in the paper.
1. Introduction
The remediation of contaminated waters and conditions, in order to guide strategies to reduce
land poses a grand challenge on a global scale. arsenic contamination. We have chosen to use a
The contamination of drinking water by arsenic computational approach to study the capabilities
has been associated with various cancerous and of iron-bearing minerals to immobilize active
non-cancerous health effects and is thus one of arsenic species in water.
the most pressing environmental concerns [1]. It This large, integrated project requires
is clearly important to understand the input from personnel with diverse skills and rich
mechanisms of any processes that reduce the experience, as well as techniques and
amount of active arsenic species in groundwater, infrastructure to support a large scale computing
which can be achieved by selective adsorption. effort. The complex calculations required
It is generally considered that iron-bearing include workflows, high-throughput and the
minerals are promising absorbents for a wide ability to match different computational need
range of solvated impurities. There is a serious against available resources. Furthermore, data
need for a detailed understanding of the management tools are required to support the
capabilities of different iron-bearing minerals to vast amount of data produced in such a study.
immobilize contaminants under varying This is clearly a real challenge for the e-science
481
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
community. The aim of the eMinerals project [2] methodologies and investigating many different
is to incorporate grid technologies with iron-bearing minerals. This new approach is
computer simulations to tackle complex aimed at expanding the horizons of simulation
environmental problems. During the first phase studies and meeting the needs for a quick,
of this project, eMinerals minigrid has been comprehensive and detailed investigation into
developed and tested. In addition, a functional the mechanisms of arsenic adsorption by
virtual organisation has been established, based various adsorbents. The investigation at such a
on the sharing of resources and expertise, with scale would be impossible without the
supporting tools. collaborative grid technologies.
We can now fully explore the
established infrastructure to perform simulations 2. Grid technologies in computational
of environmental processes, which means simulations
having to deal with very complicated simulation All the simulations conducted in this study were
systems close to ‘real systems’ and under facilitated by the eMinerals minigrid [3]. The
varying conditions. Here we report on a case eminerals mini-grid is composed of a collection
study using the grid technologies developed in of shared and dedicated resources. These are
the eMinerals project to conduct a large spread across 6 sites, and are depicted in figure
integrated study of the transport and 1. We also regularly complement these
immobilization of arsenic in a series of iron- resources with external resources such as those
bearing minerals, where the computational made available by the National Grid Service.
scientists work as a team to tackle one common More details of building and examining tools as
problem. The whole team operates as a task- well as our view and experience on using these
oriented virtual organization (VO) and each tools have been discussed in ref [3,4]. Here we
member of the VO will use his own expertise to just give some brief description.
deal with one or several aspects of the project.
A close collaboration between all team 2.1 Communication tools
members is inevitable in such a comprehensive As mentioned in the introduction, this project is
study, which enables the scientific studies to be a large collaboration in terms of the number of
carried out at different levels, using different people involved and the levels of calculations
482
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
needed to be carried out. Good organization and success of the project. Because the members of
communication is therefore the key to the the project are based at different
institutions across the UK, it would be (MCS) [5]. The MCS manages the simulation
extremely difficult to keep the whole team in workflow using Condor’s DAGman
communication on a daily or weekly basis if we functionality and integrates with Storage
followed traditional methods. Resource Brokers (SRB) for data management.
Within the eMinerals project, we have The Condor interface has proved quite popular,
developed a new approach, taking advantage of and facilitates the submission and management
different communication tools like: Personal of large numbers of jobs to globus resources. In
Access Grid (AG) and a Wiki, Instant message addition, it enables the use of additional Condor
(IM) in addition to more conventional tools (e.g. tools and in particular the DAG-Man workflow
e-mails, telephone). Each tool is suitable to a manager. SRB has provided seamless access to
particular communication required by the distributed data resources. As shown in figure1,
collaboration. there are 5 SRB data storage vaults spread
AG meeting: AG does offer a valuable across 4 sites within the project amounting to a
alternative to the expense of travelling for total capacity of roughly 3 Terabytes. A
project meetings. The AG meetings are metadata annotation service (RCommands
now acting as a valuable management tool service)[6] hosted by the CCLRC, and database
and in the context of the project enabling back end, was used to generate and search
us to identify the most suitable person to metadata for files stored in the SRB vaults.
complete the selected simulations within a The use of MCS together with SRB
specified time. For instance, for the arsenic allows us to handle each simulation following a
pollutant problem, through the AG simple and automated three-step procedure. The
meeting, we have decided that team input files are downloaded from the SRB and
members from Bath and Birkbeck focus on MCS decides where to run the job within the
adsorption of arsenic species on iron eMinerals minigrid, depending on the available
oxides and hydroxides using interatomic compute facilities. Finally the output files are
potential based simulations, while those uploaded back to the SRB for subsequent
from the Royal Institution and Cambridge analysis. This tool certainly facilitates the work
work on arsenic species adsorption on of the scientists and avoids unnecessary delays
pyrite and dolomite using quantum in the calculations.
mechanical calculations. The AG meetings Our practice has shown that the SRB is
also play an important role in members of prime importance for data management in
contributing their ideas to collaborative such collaboration. It not only provides a
papers like this one. facility to accommodate a large number of data
The Wiki: Wiki is a website which can sets, but also allows us to share information
be freely edited by the members of the VO. more conveniently, e.g. large files that are nigh
We have use Wiki to exchange ideas, impossible to transfer using traditional
deposit news information, edit communication tools (e.g. email) due to size
collaborative papers. For example, we limitations. SRB also permits us to access the
have used Blogbooks for the resource data files wherever and whenever we wish.
management to notify other members Here we just illustrate a couple of
before submitting large suites of examples to show the benefits of using grid
calculations. The downside of this tool is techniques to support workflow in a scientific
that wiki does not support instant study. In the case of the quantum mechanical
communication, which limits the quality of study of the structures of goethite, pyrite and
its use as a collaborative tool. wüstite, we have to take into account that these
Instant message (IM): IM gives much minerals can have different magnetic structures,
better instantaneous chat facilities than which requires a range of separate calculations.
email and wiki. It is particular useful for For example, ferric iron, which has five d-
members of the project developing new electrons, exists with magnetic numbers equal to
tools because it is easy to ask questions 5 (HS) and 1 (LS), while ferrous iron, which has
and reach agreement. 6 d-electrons, can have a magnetic number 4
(HS) or 0 (LS). This means that the minerals
2.2 Grid environment support for workflow with ferrous iron can have anti-ferromagnetic
All calculations conducted in this study have (AFM); ferromagnetic (FM); and non-magnetic
been organized by a meta-scheduling job (NM) structures, while minerals containing
submission tool called my_condor_submit ferric iron show AFM or FM structures. For
483
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Table 1 Electronic and magnetic structures as predicted by the hybrid-functionals reported in the text as
well as the PW91 (GGA) Hamiltonian and experiments. HS=high spin, LS=low spin, M= metal and
I=insulator.
484
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
O-H bond distance and the hydrogen bond accuracy provided by hybrid functional
distance. Our calculations suggest that α- approaches, while for LS ferrous iron, DFT
FeOOH is adequately described in DFT and alone gives a reasonable result.
there is no need for alternative techniques to
study this material.
3.2 Arsenic incorporation in pyrite FeS2
Pyrite (FeS2) the most abundant of all metal
b) Pyrite
sulphides, plays an important role in the
Pyrite contains ferrous iron, and from Table 1
transport of arsenic. Under reducing conditions,
we see that DFT fails to predict this mineral as
pyrite can delay the migration of arsenic by
insulator. However, the magnetic structure for
adsorption on its surfaces, as well as by
pyrite (LS) agrees with experiment. Pyrite has
incorporation into the bulk lattice. Pyrite can
also been studied using plane-waves to describe
host up to about 10 wt % of arsenic [19]. Under
the electron density (e.g. ref. 8), while we are
oxidizing conditions, pyrite dissolves, leading to
using localised Gaussian-basis functions. In the
the generation of acid drainage and releasing
previous study pyrite is correctly reported as an
arsenic into the environment. Although it is key
insulator. We also find that both the hybrid-
information, the location and speciation of
functionals and GGA overestimate the lattice
arsenic in pyrite remains a matter of debate. The
parameters compared to the experimental value.
pyrite has a NaCl-like cubic structure with an
We believe this is again due to the description
alternation of Fe atoms and S2 groups. X-ray
of the electron density by the Gaussian-basis
adsorption spectroscopic studies showed that
functions, which can be improved. Hence, our
arsenic substitutes for sulphur, forming AsS di-
conclusion from this study is that pyrite can be
anion groups rather than As2 groups [20]. On
described by plane-wave DFT methods and
the other hand a recent experimental study has
does not need to be described by hybrid-
proposed that arsenic can also act like a cation,
functionals.
substituting for iron. The different arsenic
configurations have been investigated using
c)Wüstite
first-principles calculations.
FeO is a classical example where DFT is known
Calculations have been executed within
to fail to predict the electronic structure, as
the DFT framework. In order to work with
shown in Table 1. If we concentrate on the
arsenic concentrations comparable with natural
structural properties of wüstite on the other hand
observations, calculations were performed on
we find that GGA underestimates, while the
2x2x2 pyrite supercells containing up to two
hybrid-functional described by 40% Hartree-
arsenic atoms (< 4 wt. % As). The arsenic has
Fock exchange overestimates the lattice
been successively placed in iron and sulphur
parameters. This is also reflected in the internal
sites. In the later case, we have investigated the
geometries. Hence, our conclusions are that
three following substitution mechanisms:
wüstite is better described by hybrid-functionals,
formation of AsS groups, formation of As2
but some properties, like the geometrical
groups and substitution of one As atom for one
properties are well described with GGA.
S2 group. These different configurations have
However, the magnetic and electronic structures
been compared by considering simple
require hybrid-functionals to be correctly
incorporation reactions under both oxidizing
described.
and reducing conditions. Solution energies
Our study showed that minerals
suggest that, in conditions where pyrite is stable,
containing ferric iron are often better described
arsenic will preferentially be incorporated in a
within DFT than minerals containing ferrous
sulphur site forming AsS groups (Fig 2). The
iron. However, ferrous iron in LS configurations,
incorporation of arsenic as a cation is
such as pyrite, is also well described within
energetically unfavourable in pure FeS2 pyrite.
DFT, while minerals containing ferrous iron,
Previous studies have shown that above an
which show magnetism (i.e. HS configuration)
arsenic concentration of about 6 wt%, the solid
are better described by hybrid-functionals. This
solution becomes metastable and segregated
result is a reflection of the known shortcomings
domains of the arsenopyrite are formed in the
of DFT and its representation of electron
pyrite lattice [21]. Even for the low
correlation; high-spin ionic states are always
concentrations modelled here, the substitution
poorly represented within the DTF
energies indicate that the arsenic tends to cluster.
approximation, so we expect the structural
Our results also show that, in oxidising
features of HS ferric iron to be captured
conditions, the presence of arsenic
considerably worse, and requiring the additional
.
485
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Table 2 Surface energies for iron (hydr)oxide minerals including both dry and hydroxylated surfaces
Fe2O3
Surface 001 012a 012b 012c 100 101 110 111a 111b
Dry surface
1.78 1.87 2.75 2.35 1.99 2.34 2.02 2.21 2.07
energy (Jm-2)
Hydrated surface
0.90 0.38 0.28 0.38 0.27 0.04 0.20 0.33 0.28
energy (Jm-2)
FeOOH
Surface 010 100 110 001 011 101 111
Dry surface
1.92 1.68 1.26 0.67 1.18 1.72 1.33
energy(Jm-2)
Hydrated surface
0.51 1.17 0.68 0.52 0.34 1.32 1.12
energy (Jm-2)
Fe(OH)2
Surface 001 010 011 101 110 111
Dry surface
0.04 0.38 0.35 0.35 0.64 0.60
energy (Jm-2)
may accelerate the dissolution of pyrite with the (hydr)oxide minerals, we have first carried out
en environmental consequences that implies a large number of interatomic potential-based
(acid rock drainage and release in soluaiton of simulations to examine the surface structures
toxic metals). and stabilities of these minerals. We have
presented the calculated surface energies for
both dry and hydrated surfaces are listed in
Table 2
The calculated surface energies of the
three minerals show that in the case of dry
Fe2O3, the iron terminated {001} surface is the
most stable surface whereas the {012b} surface
As is the least stable one. Surface structure analysis
indicated that the atoms on the {001} surface
have relatively high coordination compared to
other surfaces, which makes this surface less
reactive. In general, the surface energies of the
hydroxylated surfaces are lower than those of
Figure 2 Relaxed structure of the 2x2x2 the corresponding dry surfaces. Upon hydration,
supercell of pyrite containing one AsS group. the iron-terminated {001} surface becomes the
least stable surface, although the reconstructed
dipolar oxygen-terminated surface now
3.3 Iron (hydr)oxide mineral surfaces becomes highly stable, due to the adsorbed
Iron (hydr)oxides play an important water molecules filling oxygen vacancies at the
role in many disciplines including pure, surface.
environmental and industrial chemistry, soil Like hematite, the hydrated surfaces of
science, and biology. Iron (hydr)oxide minerals goethite are more stable than the dry ones. In
are thought to be promising adsorbents to Fe(OH)2, the surface energies for all surfaces
immobilize active arsenic and other toxic are lower compared to those of Fe2O3 and
species in groundwater. There are eleven known FeOOH, which is mainly due to the fact that
crystalline iron (hydr)oxide compounds and the Fe(OH)2 is an open layered structure.
focus of our work here is initially on three Overall our calculated energies as well
representative systems, namely the pure as structure analysis indicate that all the surfaces
hydroxide Fe(OH)2, the mixed oxyhydroxide where the surface atoms are capped by OH
goethite FeOOH, and the iron oxide hematite groups have relatively low surface energies,
Fe2O3. which implies that the presence of hydroxyls
In order to investigate the adsorption groups helps to stabilise the surface.
mechanisms of arsenic species on the iron
486
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
3.4 Aqueous solutions in the vicinity of iron broad, long-range oscillations continue further
hydroxide surfaces away. The explanation lies in the electrostatic
Although environmentally relevant potential, as pictured in figure 4. The structuring
immobilisation processes concern the effect of the surface on the electrostatic
adsorption from solution, the exact structure of potential of pure water has a longer range than
aqueous solutions in contact with surfaces is not what could be inferred from the simpler density
yet completely elucidated. The distribution and curves. The addition of ions in solution only
local concentration of the various species is serves to reinforce this effect.
difficult to observe experimentally. Coarse Real surfaces are likely to be charged.
grained Surface Complexation Models do not But the corresponding simulations show that the
include explicitly surface effects, but effect of the charged surface on the electrostatic
conversely, ultra precise ab initio calculations potential in the solvent does not significantly
are unable to cope with the amount of water differ from the neutral surfaces.
required. The intermediate solution, the use of Additional calculations of surfaces of
classical atomistic modelling techniques, was different minerals (calcite CaCO3 and hematite
also traditionally hampered by the fact that Fe2O3) confirm these findings and suggest that
realistic ionic concentrations require the the long range oscillatory behaviour of salt
treatment of many water molecules (at a factor concentration is a consequence of the
of 50 water molecules per ion at the already structuring presence of a surface on the
high concentration of 1 mol l-1). electrostatic potential of the water.
However, developments in computer We conclude that although the
power have now made it possible for traditional double (stern)-layer models are
simulations of both surface and solution at correct in assuming that the ion distribution is
atomic resolution and in large enough quantity controlled by the electrostatic potential, they fail
to produce statistically meaningful results. We to reproduce the distribution at medium/high
have carried out many Molecular Dynamics salt concentration because the electrostatic
simulations (DL_POLY code) of (Na+/Cl-)/ contribution of the structured (layered) solvent
goethite interfaces, in an aqueous solution at is not taken into account.
different ionic strengths and surface charges.
Our main observation is that the distribution of
ions near the surface is not accurately described
by the classical models of the electrical double
layer.
It is considered that any surface has a
structuring effect on the liquid it is in contact
with, as can be observed in the density
oscillations in figure 3. This density rippling in
turn controls the salt concentrations. There is a
direct correlation with the water density up to
10Å from the surface, but relatively unexpected,
487
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
In the near future, we will extend the [9] N.C. Tombs and H.P. Rooksby, “Structure
study to investigate the immobilization of a of monoxides of some transition elements at low
number of arsenic species by various iron- temperatures”, Nature, 165, 442 (1950).
bearing minerals. A large number of [10] R.W.G. Wyckoff, Crystal structures, 1,
calculations will be carried out to investigate the 346 (1963).
adsorption mechanisms of arsenic species of [11]K. Fujino, S. Sasaki, Y. Takeuchi and
different oxidation state (e.g. As(III) and As(V)) R.Sadanaga, “X-ray determination of electron
on different iron-bearing mineral surfaces as distributions in forsterite, fayalite and tephroite”,
well as studying the influence of varying Acta Cryst., B37, 513 (1981).
conditions on the adsorption process. The [12] R.L. Blacke, R.E. Hessevick, T. Zoltai,
outcome of the project will be a detailed atomic- L.W. Finger, “Refinement of hematite
level understanding of the chemical and structure”, Amer. Mineral., 51, 123 (1966).
physical processes involved in the [13] A.F. Gaultieri and P. Venturelli, “In situ
immobilization of arsenic species by a variety of study of the goethite-hematite phase
important iron minerals, which will benefit a transformation by real time synchrotron powder
wide range of communities, such as water diffraction”, Amer. Mineral., 84, 895 (1999).
treatment and environment agencies; academic [14] P.S. Bagus, C.R. Brundle, T.J. Chuang and
and industrial surface scientists; mineralogists K. Wandelt, “Width of D-level final-state
and geochemists. structure observed in photoemission spectra of
fexo”, Phys. Rev. Lett., 83, 1229 (1977).
[15] I. Opahle, K. Koepernik and H. Escrig,
Acknowledgements “Full-potential band-structure calculation of
This work was funded by NERC via grants iron pyrite”, Phys. Rev. B, 60, 14035 (1999).
NER/T/S/2001/00855, NE/C515698/1 and [16] Q. Williams, E. Knittle, R. Reichlin, S.
NE/C515704/1. Martin and R. Jeanloz, “Structural and
electronic-properties of Fe2 SiO4 –fayalite at
References ultrahigh pressures-amorphization and gap
[1] D. J. Vaughan, “Arsenic”, Elements, 2 closure”, J. Geophys. Res. [Solid Earth] 95,
71(2006). 21549 (1990).
[2] M. T. Dove and N. H. Leeuw, “Grid [17] A. Fujimori, M. Saeki, N. Kimizuka, M.
computing and molecular simulations: the Taniguchi and S. Suga, “Photoemission
vision of the eMinerals project”, Mol. Sim., 31, satellites and electronic-structure of Fe2O3”,
297 (2005). Phys. Rev. B, 34, 7318 (1986).
[3] M. Calleja et al., “Collaborative grid [18] M. Ziese and M.J. Thornton Materials for
infrastructure for molecular simulations: The spin electronics, (eds), Springer, (2001).
eMinerals minigrid as a prototype integrated [19] P.K. Abraitis, R.A.D. Pattrick, D.J.
computer and data grid”, Mol. Sim., 31, 303 Vaughan “Variations in the compositional,
(2005). textural and electrical properties of natural
[4] M. T. Dove et al., “The eminerals pyrite: a review”. Int. J. Miner. Process., 74, 41
collaboratory: tools and experience”, Mol. Sim., (2004).
31, 329 (2005). [20] K.S. Savage, T.N. Tingle, P.A. O'Day, G.A.
[5] R.P. Bruin, et al., “Job submission to grid Waychunas and D.K. Bird, “Arsenic speciation
computing environments”, All Hands Meeting in pyrite and secondary weathering phases”,
(2006). Appl. Geochem., 15, 1219 (2000).
[6] R.P. Tyer et. al., “Automatic metadata [21] M. Reich and U. Becker, “First-principles
capture and grid computing”, All Hands calculations of the thermodynamic mixing
Meeting (2006). properties of arsenic incorporation into pyrite
[7] K.G. Stollenwerk, “Geochemical processes and marcasite”. Chem. Geol., 225, 278 (2006).
controlling transport of arsenic in groundwater:
A review of adsorption”. In: A.H. Welch, K. G.
Stollenwerk (eds) Arsenic in groundwater:
Geochemistry and Occurrence, Kluwer
Academic publishers, Boston, pp67-100 (2003).
[8] M. Blanchard, M. Alfredsson, J.P. Brodholt,
G.D. Price, K. Wright and C.R.A. Catlow,
“Electronic structure study of the high-pressure
vibrational spectrum of FeS2 pyrite”. J. Phys.
Chem. B, 109, 22067 (2005).
488
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The GOLD Project: Architecture, Development and
Deployment
Hugo Hiden1, Adrian Conlin1, Panayiotis Perrioellis1, Nick Cook1, Rob Smith1, Allen Wright2
1
Department of Computing Science
2
Department of Chemical Engineering and Advanced Materials
University of Newcastle upon Tyne
Abstract
This paper presents a description of the architecture, development and deployment of the
Middleware developed as part of the GOLD project. This Middleware has been derived from the
requirement to accelerate the chemical process development lifecycle through the enablement of
highly dynamic Virtual Organisations. The generic design of the Middleware will allow its
application to a wide variety of additional domains.
489
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
data, laboratory notes, experimental data, safety A VO approach has been adopted because these
studies during the initial development stages skills were not available in either the initiating
through to industrial plant design information if company or a single contractor. In addition the
and when the projects reach the commercial VO approach has the potential to offer
exploitation phase. substantial cost savings.
Much of the information exchanged between The project consists of four basic phases:
participants is confidential, and must be secured Preliminary reaction engineering investigations,
from unauthorised access. The security pilot plant design, pilot plant operations and
requirements are demanding because the access commercial scale plant development. Although
control must change over the course of the superficially this is a straightforward project,
development process, allowing different levels each of the phases described above can be
of access to certain individuals as projects subject to a number of disturbances leading to
progress. significant changes in the subsequent tasks to be
performed. For example, the new operating
conditions may unexpectedly affect the
2. Chemical Development Scenario downstream recovery of the catalyst and a new
The functional requirements of the GOLD separation method must be found. This could
middleware have been identified from a number involve a modification to the VO structure
of sources including industrial consultation and through the introduction of a new specialist skill
the extensive CPD experiences of some of the and or removal of an existing participant. This
GOLD team members. In addition an actual example is one of many unanticipated factors
CPD project has been undertaken in that could require radical changes to the project
collaboration with a number of companies as plan after the project has been started. The
part of the development of the GOLD software, therefore, must be able to coordinate
demonstrator. activities between VO participants in the face of
such disturbances.
The aim of this ongoing CPD project is to
convert an existing high tonnage manufacturing During the course of the CPD project,
process from batch to continuous operation. In significant quantities of information are
order to accomplish this goal a range of exchanged between participants. This
specialist skills are required some of which are information usually takes the form of
outlined in Table 1. “dossiers”, each of which can contain a set of
individual documents. Dossiers can cover
Participant Skill various aspects of the development lifecycle,
but the scenario considered during the
Reaction Analyses chemical reactions,
development of GOLD is focused on four main
Engineering and suitable operating
areas: Commercial, Technical, Manufacturing
Consulant conditions.
and Safety, Health and Environment (SHE).
Access by users to the information contained
Pilot Plant Can design, build and operate
within these dossiers by VO participants is
Designers small scale chemical plants.
controlled depending on the roles held by these
individuals.
Equipment Supplies off the shelf process
Vendor equipment according to This project has provided a detailed case study
supplied specifications. outlining tasks performed by and information
exchanged between VO participants. Whilst not
Equipment Supplies custom build exhaustive it does provide a reasonably
Manufacturer equipment not available from thorough test for the middleware developed by
the project.
Table 1: Participant Roles
490
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
491
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Rust, et al, 1995) XML schema exist for
chemical specific information and can be easily
stored within the information repository as
XmlDocument objects. In other cases
information is available in a structured form
such as plant design data, however there is no
standardised XML schema to represent this
data. The GOLD project is actively
investigating existing schema to represent this
type of data, however, most information of this
type is currently stored as binary data and
persisted as BinaryDocuments within the
information repository.
Operation Description
createDocument Creates a new empty
DocumentRecord
in the information
repository
retrieveDocument Retrieves the
document data
associated with a
Figure 2: Document Class Hierarchy specified
The basic Information Model contains two DocumentRecord.
document types: XmlDocument, which
contains a single XML file and updateDocument Updates the stored
BinaryDocument which contains arbitrary document data
binary data (such as, for example, a Microsoft associated with a
Word document). Documents specific to the specified
CPD process are derived from one of these two DocumentRecord.
base classes. In some cases, such as the
representation of chemical structures using the
Table 3: Storage Web Service
Chemical Markup Language (CML, Murray
492
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
By capturing projects as lists of tasks with start
and end times, management and monitoring can
be carried out using familiar Gantt charts. Co
493
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
ordination functionality is provided to VO Operation Description
members via the Project Web Service which
allows access to projects, tasks and their sendUserMessage Sends a message to
constituent documents. the Worklist of a
specific VO user.
Operation Description
sendRoleMessage Sends a message to
getTaskDocument Get specified
the Worklists of all
DocumentRecord
users with a specified
associated with a
role.
specific project task
getMessages Retrieve all of the
saveTaskDocument Save a
messages for a
DocumentRecord
specified VO user.
to a specific project
task.
Table 7: Messaging Web Service
getProject Get a specified
Project object 8. Regulation Services
getTask Get a task associated The Regulation Services provided by the GOLD
with a specified Middleware are responsible for monitoring the
project. interactions between individual users and
companies so that actions performed within the
VO comply with agreed standards of operation.
Table 6: The Project Web Service There are numerous ways to define these
In addition to the task based approach for and standards: it could be a requirement that
coordinating projects on a high level, there is a responses to requests for documents and
need to initiate and control interactions between information are returned within a predefined
individual VO participants. For example, when time interval. It may also be desirable that
a project requires the production of a specific certain requests and interactions are non
document, a mechanism is needed to repudiable such that no party can deny these
communicate that requirement to the relevant interactions at a later date.
parties. In order to accommodate this
The GOLD Middleware has provided two
requirement, the GOLD Middleware uses a
mechanisms for performing regulation. The first
“Worklist” approach whereby each VO User
is a logging service that accepts log messages
has a list of tasks to be performed. Events such
from all of the other components of the
as the beginning or end of tasks, the arrival of
Middleware, thereby allowing a thorough
new VO members or the need for the
auditing (or possible replay) of the events and
production of project documents can be brought
messages that were exchanged over the course
to the attention of selected VO users by placing
of a VO project. The logging web service stores
an appropriate message into their Worklists.
the log messages contained in the class
This has been implemented as a Messaging web
hierarchy, a partial view of which is shown in
service that can be used to send messages to
Figure 3, in the Information Repository
individual VO users or to all users that are
database.
members of a specific VO role.
494
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
acts as a set of Web Service handlers that
intercepd messages flowing between VO
participants that need to be nonrepudiable. The
effective structure of this system is shown in
Figure 4 which shows a message, msg, being
transferred from participant A to participant B
via a trusted Delivery Agent, DA.
Figure 3: Logging Classes
Figure 4: NonRepudiable Message
The logging web service provides the following Delivery Structure
functionality to store and search for log
messages (Table 8): 9. Implementation Details
Operation Description The current incarnation of the GOLD
Middleware has been implemented within the
logMessage Sends a message to Sun Microsystems Application Server v9. This
the logging service has been selected partly based upon its open
which is stored in the source status and partly because it provides, in
Information conjunction with Netbeans 5.0, an easy to use
Repository database. environment for developing and deploying Web
Services. The majority of GOLD Middleware
searchLog Searches the functionality has been implemented as
Information Enterprise Java Beans (EJBs) with Web Service
Repository database wrappers provided to deliver the services
for log messages of a described above. In order to reduce
certain type or that development effort, data is exchanged between
were logged within a web services in the form of XML documents
certain time period or which are currently automatically mapped to
pertaining to a specific Java classes using the XML Serialization
resource or user. provided by the Java runtime. Future work,
however, will define formal XML schema for
the representation of the class attributes within
Table 8: Logging Web Service
the Information Model and use Java XML
The second regulation mechanism provided by Binding (JAXB) to exchange information in a
the the GOLD Middleware uses the non more interoperable format. Storage of the data
repudiable exchange tools developed by Cook, within the Information Repository has been
et al, 2002 to ensure that communications implemented using the Hibernate objectto
between individuals for certain classes of relational database mapping library
interaction are impossible to deny at a later date. (http://www.hibernate.org), which automatically
For the sake of integration, the nonrepudiation creates database schema and handles all
tools use the logging service to save information serialisation / deserialisation and
regarding the state of any nonrepudiable synchronisation issues. The database used for
information exchange between participants. the demonstration system is the open source
MySQL package, although use of the Hibernate
Whilst the nonrepudiation protocol as
library provides a high degree of database
implemented by Cook, et al, 2002, is complex,
independence for a production quality system.
on a simple level, the nonrepudiation system
495
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Configuration of the VO users, roles and Sharman, N., Alpdemir, N., Ferris, J.,
policies is performed using a Java Server Pages Greenwood, M., Li, P. and Wroe, C. 2004. The
(JSP) application, whilst the GUI presented to myGrid Information Model. In Proceedings of
End Users is based upon the Gridshpere portal the UK eScience All Hands Meeting 2004, 1
server, with additions to support authentication September.
of users via the GOLD authentication web
OASIS. 2003. eXtensible Access Control
service. In order to demonstrate the potential for
Markup Language (XACML) Version 1.0.
multiple views into an operating VO, the
OASIS Standard, http://www.oasis
security policies are configured using a Swing
open.org/committees/xacml.
GUI which interacts directly with the security
EJBs using CORBA over SSL. Conlin, A., Cook, N., Hiden, H., Periorellis, P.,
Smith (2005) RCSTR: 923 GOLD Architecture
Document. School of Computing Science,
10.Conclusions
University of Newcastle, Jul 2005
The Middleware developed as part of the
GOLD project has been guided by a number of MurrayRust, P., Rzepa, H.S., Leach, C., 1995,
factors ranging from the experience of the CML – Chemical Markup Language, Poster
individual team members operating actual CPS presentation, 210th ACS Meeting, Chicago, 21st
projects to a series of interviews performed with August, 1995
a significant number of CPD companies in
conjunction with the Management School at
Lancaster University. The current
implementation, whilst able to support the CPD
process has been designed with flexibility in
mind. By modifying the security roles (Section
6) and document types supported (Section 5),
collaborative development projects across a
wide range of application domains can be
supported. This generic approach is an
important aspect of the GOLD project, as one of
the key requirements is to support multiple
classes of VO.
11.References
Demchenko, Y., 2004. Virtual organisations in
computer grids and identity management.
Information Security Technical Report, 9, 1, 59
76.
Skonnard, A, 2002, The XML Files: The birth
of Web Services, MSDN Magazine, Volume 17,
Number 10, October 2002.
496
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
We describe our experience in practical deployment of the gLite VOMS (Virtual Organization
Management Service). The formation of all sizes of groups with similar research agenda can be
viewed as the emergence of Virtual Organisations (VOs). The deployment of centralised VOMS
servers at a national level as part of a grid infrastructure facilitates the creation of these groups and
their subsequent registration to the existing grid infrastructures. It also helps grid resources to
engage user communities by dealing with them in a more scalable way. In the following we will
describe the use cases, some of the technical aspects, and the deployment and administration of
such a VOMS server. The evaluation of robustness of the VOMS releases 1.4, 1.5 and 3.0 and
known problems are also described.
497
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
services in a Grid environment, and exposing 3.3 Enabling a VO
their functionality without putting resources and
A formal request has to be made to the
services in jeopardy. This enables the resource
management. The following information about
providers and grid enabled service provider to
the VO has to be supplied in the request.
share and host their service and resource with
full confidence which will help Computational
Grid to grow. • VO name that conforms to the
LCG/EGEE guidelines. The latest
format was DNS format to avoid
3. VOs support in UK names for different VOs clashing. For
example minos.gridpp.ac.uk is an
3.1 Examples of local use cases acceptable name. However short UNIX
like user names are still used for
• Small site group with no resources. practical reasons.
• Small national experiment or research • VO support contacts. Specific people
projects like MINOS, CEDAR and and mailing lists.
RealityGrid with no resources.
• Security contacts - two people who can
• A local group of a bigger VO that respond quickly in the event of a
owns local resources and do not want security incident relating to a member
to share them but want to access them of the VO, or to the VO as a whole.
with the same set of tools. (Typical of
• VO services end points like file
this case are data servers for local
catalogs.
users.)
• Hardware requirements - memory size,
• Grid application development test
disk space etc.
machines.
• Software requirements - any software
• Different grids might want to cooperate
beyond the basic Linux tools/libraries,
to support each other resources.
including things which are part of
• The local funding situation imposes to standard distributions as they may not
share resources with groups not be installed by default.
belonging to any VO; however they
• Typical usage pattern - expected job
might be willing to access their portion
frequency and variation over time, job
of resources through the grid.
length, data read and written per job
• Distributed Tier 2 sites1 might want to etc.
unify certain categories of users in one
• Glue schema fields used - this would
VO.
give an idea of what is really used in
• A site might want to give temporary the information system and needs to be
access to one group resources to ensured to be properly set and
another local group. maintained.
• General procedures – like VO software
3.2 GridPP and NGS installation
• Roughly the number of users expected
In UK two different national grid organizations to use the resources, to give a guide to
GridPP and NGS (National Grid Service) have how many pool accounts to create.
decided jointly support gLite VOMS servers.
The request has to be approved by the
• Common infrastructure to maintain the GridPP/NGS management. After approval the
VOMS servers VO gets created on the VOMS server and the
• Common VOs support VO manager is enabled to add users. Sometimes
• Common distribution of information the VO is too small and the VO administration
• Enable each other VOs on each other is done by the VOMS manager under request.
systems The information to enable the VO at sites is then
downloadable from the GridPP/NGS WEB sites.
The pool consists of two front end servers one VOs are considered responsible for the
for NGS and one for GridPP, two backup maintenance of the information in their own
servers and a test server. The servers are hosted interest.
at the in Manchester as part of the Tier2 and
NGS infrastructures. They have been running
since January 2006 and they now host 9 national
VOs and 2 local.
1
A tier 2 site provides regional Compute Power within the
EGEE and GridPP.
498
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
4. Technical aspects of VOMS Services. The authorization mechanisms as
described above have a number of drawbacks:
.
4.1 gridmap-files to VOMS awareness; from On line dependence for maintenance of the
individual authorization to virtual gridmapfile,
organisation
Production grid services providing simple • Denial of service attacks changes the
access to data and compute resources have, to behaviour of the communities
date, dealt with authorization on an individual membership,
by individual base. There has been a trend to • No ability to deal with users who are
supply some level of delegation to this process. members of multiple communities
VOMS mechanisms have the potential to (Group membership),
provide a level of delegation such that • No ability to select differing levels of
authorisation decisions may be taken based on access within a recognised
membership to a virtual organization, removing communities (roles and capabilities),
the requirements of pre-registration of • Data Protection.
individuals. This section describes the
evolution of these mechanisms. Out of the LCG and its sister project gLite has
emerged more elegant authorization mecha-
4.2 gridmap-files
nisms. While the concept has been around for a
The purpose of a gridmap-file is to translate an while stable implementations are still relatively
incoming request whose originator's identity is new.
known as a string representation of an X509
namespace (their distinguished name, DN) to a 4.4 lcas and lcmaps
local system identity: the username of a local LCAS (Local Centre Authorization Service) and
account. It is a flat file, each line containing a LCMAPS (Local Credential Mapping Service)
DN in double quotes followed by a comma replace the gridmap file system with something
separated list of local user identities. The a little more sophisticated. As the names suggest
service gateway having obtained the user's DN they split the process into separate authorization
through the context of the SSL negotiation and the mapping mechanisms. LCAS is a plug-
searches the gridmap file line by line until it in2 which is called from within a modified
finds a match. Both the account and the service, (currently there exists modified
mapping must exist before the user is authorised versions of the Globus gatekeeper and an earlier
to use the service. version of the GridFTP server). LCAS makes
Authorization decisions based upon three
distinct sets of policy data: users/groups
4.3 Pool Accounts and LDAP directories
allowed, users/groups disallowed3, and service
The prerequisite of a system account forces each availability times. LCMAPS (Local Credential
prospective user to register with each system. Mapping Service) handles the assignment of
In a grid comprising of many resources and local accounts and credentials. LCMAPS is the
many users this registration process and account enforcement point of the authorization decision
generation is not scalable. The Pool account made by the LCAS.
mechanism goes some way to providing a
mechanism with which to address this. Pool 4.5 VOMS Awareness
accounts system simply leases system accounts The VOMS provides trustable assertions
to end grid users for a specified duration. relating group memberships, roles and
capabilities to the owner of X509 certificates. It
Having only partially addressed the registration maintains a database of these relationships and a
issues (accounts are available but authorization means to administer them. It also provides a
to use them has still to be granted) a system of web service through which these assertions can
populating the gridmap file automatically is be obtained. These assertions can be obtained
used. This system creates mappings to sets of by an authorised resource or by the individual to
(pools of) accounts. To date this has been whom the assertions belong.
achieved by regularly retrieving users' DNs In the First case, where the resource obtains
which have been published by various these assertions and applies them, provides little
organisations in secure LDAP servers. With all more than today's pool account/LDAP system.
of the above in place access to resources can be The full advantages of the VOMS appears when
granted to predefined communities; membership
of these communities can be maintained by the
2
communities themselves. At the time of writing, A service implementation of LCAS is also
this is where authorization policies are realised being developed.
3
in production grids like the UK National Grid Rather like the hosts.allow and hosts.deny file
used by tcp wrappers.
499
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the user is able to download assertions and hand 3.20GHz with 1GB RAM and 160GB HDD.
them over to the requested service, thus The one is front-end voms01.gridpp.ac.uk and
providing a mechanism by which the user can the second one is backup voms02.gridpp.ac.uk.
assert their choice of group membership, role This name scheme allow to keep service
and capability from those granted to them. It is running by moving main alias name to the
this ability that fully addresses the five backup server in case of any serious problems
drawbacks listed above. with the main one. The hosts are running under
Linux SL3, kernel 2.4.21. Both hosts are now in
5. VOMS Deployment the “production stage”.
Currently there are 10 VOs defined on the
GridPP server:
5.1 The Virtual Organization Membership
Service. Overview • gridpp – for the GRIDPP project,
The Virtual Organization Membership Service higher energy physics community
(VOMS) has been developed in the framework • ltwo – Teaching purposes within the
of EDG and DataTAG collaborations to solve London Tier 2
the problems of granting users authorisation to • t2k – Next Generation Long Baseline
access the resources at VO level, providing Neutrino Oscillation Experiment
support for group membership, roles and • minos – Main Injector Neutrino
capabilities. The VOMS system consists of the Oscillation Search
following parts • cedar – Combined e-Science Data
Analysis Resource for high-energy
• User server: receives requests form physics
client and returns information about the • gridcc – for the GRID project on
user. remote control and monitoring of
• User client: contacts the server instrumentation such as distributed
presenting a user's certificate and • manmace – for the Manchester MACE
obtain a list of groups, roles and engineers to run on the resources
capabilities4 of the user. maintained at the Manchester EGEE
• Administration client: used by VO Tier2 centre
administrator to add user, create new • babar – for running babar experiment
groups, change roles. simulation and analysis on the
• Administration server: accept the EGEE/LCG grid
request from the admin client and • pheno – dedicated to developing the
updates the database. phenomenological tools necessary to
interpret the events produced by the
LHC
• ralpp – VO for local tests in the RAL
Particle Physics department
500
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
5.3 Known problems
There was a difficult situation with VOMS of
versions 1.4/1.5 development and support. Bug
samples:
6. Discussion
Our experience in deployment of gLite VOMS
service shows that the current state of the
middleware is acceptable to serve our needs. An
increasing number of VO creation requests are
being submitted to satisfy the most diverse
necessities of local and regional groups.
7. Acknowledgemnts
This work was funded by the Particle Physics
and Astronomy Research Council through their
GridPP end e-Science programmes and by NGS.
We would also like to thank other members of
the EGEE security working group for providing
much of the wider environment into which this
work fits.
References
501
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
In this paper the authors present the RealityGrid PDA and Smartphone Clients: two
novel and highly effective handheld user interface applications, which have been created
purposefully to deliver (as a convenience tool) flexible and readily available ‘around the
clock’ user access to scientific applications running on the Grid. A discussion is
provided detailing the individual user interaction capabilities of these two handheld
Clients, which include real-time computational steering and interactive 3D visualisation.
The authors also identify, along with several lessons learned through the development of
these Clients, how such seemingly advanced user-oriented e-Science facilities (in
particular interactive visualisation) have been efficiently engineered irrespective of the
targeted thin-device deployment platform’s inherent hardware (and software) limitations.
502
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 1. “The central theme of RealityGrid is computational steering” [2] (left). The project has
developed its own Grid-enabled computational steering toolkit [10], comprising of lightweight Web
Services middleware and a programmable application library (right).
503
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 2. Lightweight Visualisation (right) has provided a lightweight software framework to support
remote and ubiquitous visualisation-oriented user interaction via a wide range of inexpensive and readily
available commodity devices such as PDAs (left), mobile phones and also low-end laptop/desktop PCs.
504
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 3. The RealityGrid PDA and Smartphone Clients both provide a mobile facility to query a Registry
(left) and retrieve a list of currently deployed jobs (right). Once a job has been selected, each Client will
then attach (connect) remotely to the appropriate scientific application running either on or off the Grid.
505
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 4. The PDA and Smartphone Clients provide a mobile user interface for monitoring and tweaking
(steering) simulation parameters in real-time (left), additionally incorporating visual parameter value
trend plotting capabilities (right) to aid understanding of how experiments change and evolve as they run.
facility to monitor and tweak (steer) simulation been developed to utilise a sophisticated
parameters in real-time (either individually or as handheld device multi-threading scheme. This
part of a batch), as the simulation itself runs and has enabled each of the Clients to continuously
evolves concurrently on a remote computational poll the steering middleware (for updated
resource. In RealityGrid, individual parameters parameter information) in an integral and
are exposed from within a scientific application dedicated background thread, whilst
through the project’s open-source computational concurrently remaining active and responsive at
steering API. This level of exposure facilitates all times to capture user input events
internal simulation/application parameter data to (entering/tweaking parameter values, etc.) in a
be accessed and manipulated (steered) remotely completely separate user interface thread. Prior
via the RealityGrid PDA and Smartphone to the implementation of this handheld device
Clients, through a representative middleware multi-threading scheme, both Clients would
interface known as the Steering Web Service simply ‘lock up’ their user interfaces and
(SWS) (Refer to Figure 1); or in the case of the temporarily block all user inputs throughout the
earlier OGSI-based middleware: the Steering duration of every regular update polling cycle.
Grid Service (SGS). The RealityGrid PDA and
Smartphone Clients, via remote interactions 3.5 Visual Graph Plotting
through the SWS/SGS, are both additionally To enhance and extend the process of steering a
able to provide mobile facilities for invoking simulation through either of the handheld
lower-level simulation control procedures such RealityGrid Clients, both user interfaces also
as Pause/Resume and Stop/Restart. incorporate a visual parameter value trend
Whilst attached to a simulation, the plotting facility (Refer to Figure 4). Using a
PDA and Smartphone Clients will each poll its plotted graph to visually represent parameter
representative SWS/SGS via SOAP message value fluctuation trends was identified from the
exchanges at regular one second intervals, to initial Human Factors investigation [3] to
request and subsequently display ‘up to the greatly improve the scientist’s perception and
second’ simulation parameter state information; understanding of their experiments. The two
with one integral parameter update (request and handheld RealityGrid Clients will each allow
display) requiring a duration of roughly two the researcher to visually plot historical records
tenths of a second in order to fully complete. for one user-specified parameter at a time. This
The information returned to the Clients from the visual plotting functionality has been custom-
RealityGrid middleware (in SOAP format) built (for both Clients) using the .NET Compact
individually describes all of the parameters that Framework’s standard Graphics Device
have been exposed from a simulation using the Interface API (known as GDI+). As with
steering toolkit. Each parameter description computational steering, visual graph plotting for
includes a unique identifier (or handle), current, both Clients is also performed in real-time,
minimum and maximum values, data types, and allowing the researcher to visually observe how
status flags denoting ‘steerable’, monitored-only their simulation changes and evolves
or internal to the application-embedded steering concurrently, either naturally over a period of
library. In order to maximise usability during time or more often as a direct response to user-
the steering process, both user interfaces have instigated computational steering activity.
506
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 5. The handheld RealityGrid Clients each offer the researcher an inclusive provision of advanced,
visualisation-oriented interaction facilities, which (up until this point) would only typically otherwise
have been accessible through conventional desktop (or laptop)-based user interface applications.
3.6 Handheld Interactive 3D Visualisation (Refer to Figure 3), the researcher’s PDA or
Smartphone Client will then initiate a two-stage
The most advanced feature of the RealityGrid attaching sequence (Refer to Figure 2): firstly
PDA and Smartphone Clients is their highly acquiring a remote host network endpoint from
novel ability to each support user-interactive 3D the Lightweight Visualisation Web Service;
visualisation (Refer to Figures 5 and 6). Both of secondly establishing a direct socket channel
the handheld RealityGrid Clients have been (using the acquired addressing information) to
developed specifically for use with the authors’ the visualisation’s representative Lightweight
Lightweight Visualisation framework, allowing Visualisation Server. The Server is responsible
them to each offer researchers an inclusive for managing simultaneous Client connections,
provision of advanced, visualisation-oriented communicating received interaction commands
interaction facilities, which (up until this point) directly into the visualisation application and
would only typically otherwise have been encoding/serving pre-rendered images back to
accessible through conventional desktop (or the Client(s) for display. All communication
laptop)-based user interface applications. Using between the Client and the Server (message
Lightweight Visualisation’s middleware and it’s passing/image serving) is performed using
accompanying, application-embedded module Transmission Control Protocol/Internet Protocol
(Refer to Figure 2), the handheld RealityGrid (TCP/IP) with optional SSL data protection.
Clients are both able to communicate user- The use of TCP/IP as the base networking
captured interaction commands (pan, zoom, protocol within Lightweight Visualisation was
rotate, etc.) directly into a remote visualisation deemed preferable to the higher-level HTTP, in
application, and in response receive visual order to eliminate undesirable transmission and
feedback through the form of pre-rendered, processing latencies incurred through the
scaled, encoded and streamed image (frame) marshalling of large quantities of binary image
sequences; processed and displayed locally data within SOAP-formatted message packets.
within the Clients (using GDI+) at an average Once attached to a visualisation, user
frequency of 7-10 frames-per-second. Thus interactions through the handheld Clients adopt
Lightweight Visualisation is able to achieve an similar, standard conventions to those of the
advanced level of 3D visualisation interactivity familiar desktop model: dragging the stylus
within each of the Clients by shielding their across the screen (as opposed to the mouse) or
respective resource-constrained handheld pressing the directional keypad (as opposed to
devices from having to perform any graphically- the keyboard cursor keys) in order to explore a
demanding operations (modelling, rendering, visualised model, and selecting from a series of
etc.) at the local user interface hardware level. defined menu options and dialog windows in
The process by which the handheld order to access any application-specific features
RealityGrid Clients connect and interact of the remote visualisation software (Refer to
remotely with a visualisation (via Lightweight Figure 6). As with steering, handheld device
Visualisation) is slightly more involved than for multi-threading has again been implemented
steering a simulation through the SWS/SGS. within the Clients’ 3D visualisation front-ends,
Having found and subsequently selected a enabling them to both continuously receive and
visualisation application/job from a Registry display served image streams in a dedicated
507
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 6. When using the RealityGrid PDA or Smartphone Clients (left), researchers who employ VMD
[13] on a daily basis are able to access and interact (through their handheld device) with familiar,
indigenous user interface facilities from their everyday desktop or workstation application (right).
background thread, whilst constantly remaining RealityGrid projects such as STIMD [6] and
active and responsive to capture all user input SPICE [7]; it has also provided the initial test-
activity in a separate and concurrently running bed application for developing the Lightweight
user interface thread. Visualisation framework. When using the 3D
visualisation user interfaces of the RealityGrid
3.7 Tailor-Made User Interaction PDA or Smartphone Clients, researchers who
RealityGrid has delivered Grid-enabling support employ the VMD code on a daily basis are able
for a wide range of existing and heterogeneous to access and interact (through their handheld
scientific visualisation codes. These currently device) with familiar, indigenous user interface
include commercial and open-source codes that facilities (in addition to the generic pan, zoom,
have been successfully instrumented (adapted) rotate, etc.) from their everyday desktop or
for RealityGrid using the toolkit, as well as workstation application (Refer to Figure 6).
purely bespoke codes that have been written
within the project to fulfil a specific scientific 4. Conclusion and Future Directions
role; usually acting as an on-line visualisation
In this paper the authors have presented the
counterpart to a bespoke simulation application.
RealityGrid PDA and Smartphone Clients;
Each of RealityGrid’s numerous visualisation
offering an insight into how these types of
applications typically provides its own unique
handheld user interface applications (despite
assemblage of dedicated user interaction
their perceived inherent limitations, both in
services, which are also often geared
terms of hardware and software capability) can
specifically towards a particular, individual field
now be effectively engineered to deliver real,
of scientific research (fluid dynamics, molecular
beneficial scientific usability (steering, 3D
dynamics, etc.). At the heart of Lightweight
visualisation, security, etc.) and provide flexible
Visualisation is its ability to accommodate this
‘around the clock’ user access to the Grid by
level of user-interactive diversity by ‘tapping
complimenting and extending the more
into’ any given visualisation’s indigenous user
traditional desk-based methods of e-Science
facilities (through a small embedded code
user interaction. Through the creation of these
module or script) (Refer to Figure 2) and then
two handheld user interfaces, scientists within
exposing them (similar to steerable parameters)
the RealityGrid project have been able to take
so that they can be accessed and invoked
advantage of a highly convenient means for
remotely by a specially built (or Lightweight
readily accessing and interacting with their
Visualisation-enabled) Client; in this instance
Grid-based research experiments, which has
the RealityGrid PDA and Smartphone Clients.
proved to be highly beneficial, particularly
To properly demonstrate the benefits
when having to deal with scientific jobs (often
of this ‘tapping into’ approach, the RealityGrid
due to delayed scheduling allocations or lengthy
PDA and Smartphone Clients have both initially
compute-time periods) outside of working hours
been developed to provide tailor-made handheld
or whilst being away from the conventional
front-ends for the Visual Molecular Dynamics
desk/office-based working environment: during
(VMD) [13] application (Refer to Figures 5 and
a meeting/conference, or whilst being at home
6). VMD is an open-source visualisation code
in the evenings/away at weekends, etc.
that has been employed extensively within
508
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Future avenues of development for the [5]. R. J. Blake, P. V. Coveney, P. Clarke and
RealityGrid PDA and Smartphone Clients S. M. Pickles, “The TeraGyroid
currently include creating additional tailor-made Experiment – Supercomputing 2003.”
handheld front-ends for alternative visualisation Scientific Programming, vol. 13, no. 1,
codes (and scientists who don’t use VMD). pp. 1-17, 2005.
Extended development is also currently engaged [6]. P.W. Fowler, S. Jha and P.V. Coveney.
with integrating a handheld user interface for “Grid-based steered thermodynamic
RealityGrid’s recently released Application integration accelerates the calculation of
Hosting Environment (AHE) [14], which will binding free energies.” Philosophical
introduce invaluable new resource selection, Transactions of the Royal Society A, vol.
application launching and job management 363, no. 1833, pp. 1999-2015, August 15,
capabilities. Further endeavours (outside of 2005.
RealityGrid) are also currently geared towards [7]. S. Jha, P. Coveney and M. Harvey.
deploying this developed handheld technology (2006, June). “SPICE: Simulated Pore
as part of a trial with the University Hospitals of Interactive Computing Environment –
Leicester NHS Trust (UHL NHS Trust), in Using Federated Grids for “Grand
which it will be employed by surgeons and Challenge” Biomolecular Simulations,”
medical consultants to view and interact with presented at the 21st Int. Supercomputer
Magnetic Resonance Imaging (MRI) and Conf, Dresden, Germany, 2006. [On-
Computed Tomography (CT) scan data, whilst line]. Available:
being away from the conventional office or http://www.realitygrid.org/publications/s
desk-based working environment. pice_isc06.pdf
[8]. TeraGrid, http://www.teragrid.org
Acknowledgement [9]. M. Schneider. (2004, April). “Ketchup on
the Grid with Joysticks.” Projects in
The authors wish to thank the dedicated teams Scientific Computing Annual Research
of scientists within the RealityGrid project Report, Pittsburgh Supercomputing
(University College London, University of Center. [On-line]. pp. 36-39. Available:
Manchester, University of Oxford, http://www.psc.edu/science/2004/teragyr
Loughborough University, Edinburgh Parallel oid/ketchup_on_the_grid_with_joysticks.
Computing Centre and Imperial College). pdf
Special thanks also go to Simon Nee (formally [10]. S.M. Pickles, R. Haines, R.L. Pinning
of Loughborough University) and to Andrew and A.R. Porter. “A practical toolkit for
Porter of Manchester Computing for their computational steering.” Philosophical
invaluable, additional support and advise. Transactions of the Royal Society A, vol.
363, no. 1833, pp. 1843-1853, August 15,
References 2005.
[11]. WSRF::Lite/OGSI::Lite,
[1]. RealityGrid, http://www.realitygrid.org
http://www.sve.man.ac.uk/Research/Ato
[2]. J. Redfearn. Ed. (2005, Sept).
Z/ILCT
“RealityGrid – Real science on
[12]. SecureBlackBox,
computational Grids.” e-Science 2005:
http://www.eldos.com/sbb
Science on the Grid. [On-line]. pp. 12-13.
[13]. W. Humphrey, A Dalke and K. Schulten,
Available:
“VMD – Visual Molecular Dynamics”, J.
http://www.epsrc.ac.uk/CMSWeb/Downl
Molec Graphics, vol. 14, pp. 33-38,
oads/Other/RealityGrid2005.pdf
1996.
[3]. R.S. Kalawsky and S.P. Nee. (2004,
[14]. P.V. Coveney, S.K. Sadiq, R. Saksena,
Feb.). “e-Science RealityGrid Human
S.J. Zasada, M. Mc Keown and S.
Factors Audit – Requirements and
Pickles, "The Application Hosting
Context Analysis.” [On-line]. Available:
Environment: Lightweight Middleware
http://www.avrrc.lboro.ac.uk/hfauditReal
for Grid Based Computational Science,”
ityGrid.pdf
presented at TeraGrid '06, Indianapolis,
[4]. D. Bradley. (2004, Sept). “RealityGrid –
USA, 2006. [On-Line]. Available:
Real Science on computational Grids.” e-
http://www.realitygrid.org/publications/t
Science 2004: The Working Grid. [On-
g2006_ahe_paper.pdf
line]. pp. 12-13. Available:
http://www.epsrc.ac.uk/CMSWeb/Downl Ian R. Holmes and Roy S. Kalawsky 2006
oads/Other/RealityGrid2004.pdf
509
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Educational institutions, enterprises and industry are increasingly adopting portal frameworks, as
gateways and the means to interact with their customers. The dynamic components that make up a
portal, known as portlets, either use proprietary interfaces to interact with container that controls
them, or more commonly today, the standardised JSR-168 interface. Currently, however, portal
frameworks impose a number of constraints and limitations on the integration of legacy
applications. This typically means that significant effort is needed to embed an application into a
portal, and results in the need to fork the source tree of the application for the purposes of
integration. This paper reports on an investigation that is studying the means to utilise existing Web
applications through portlets via bridging technologies. We discuss the capabilities of these
technologies and detail our work with the PHP-JavaBridge and PortletBridge-portlet, while
demonstrating how they can be used with a number of standard PHP applications.
Typically, scripting languages are used for Web While JSR-223 and the PHP-JavaBridge project
development, which is a popular and well- provide a means for Java to access script based
established method of generating an application. languages, portal developers are also producing
Although portlets can use JavaServer Pages bridges. Apache Portals is developing Portals
510
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Bridges [6], which will offer JSR-168 complaint JSR-223 introduces a new scripting API
portlet development using Web frameworks, javax.script that consists of interfaces and
such as Struts, JSF, PHP, Perl and Velocity. classes, which define the scripting engines and a
Bridges are being developed for each of the framework for their use in Java. JSR-223 also
supported Web frameworks. In Portals Struts provides a number of scripting engines for
Bridge [8] support has been added to a number different script based languages. Currently,
of portals, including JetSpeed, JBoss, under the reference implementation version 1.0
GridSphere [9], Vignette Application Portal and [11], there are servlet engines for Groovy,
Apache Cocoon. Portlet developers are also Rhino and PHP. The GroovyScriptEngine,
producing portlets, which are intended to allow RhinoScriptEngine and PHPScritpEngine
regular Web sites to be hosted as JSR-168 classes in the API represent the servlet engines,
compliant portlets. The PortletBridge-portlet respectively.
[10] aims to proxy and rewrite content from a
downstream Web site into a portlet. PHP uses a Server API (SAPI) module, which
defines interactions between a Web server and
The question is though, with these bridging PHP. PHP 4 includes Java capabilities provided
technologies becoming more usable, can either by a servlet SAPI extension module or by
developers use them efficiently and avoid re- integrating Java into PHP. The Java SAPI
implementing existing applications within module enables PHP to be run as a servlet while
compliant portlets? using Java Native Interface (JNI) to
communicate with Java. However, unlike PHP
This paper will discuss the various 4, PHP 5 does not include the Java integration
implementations and outline our own support in its current release, although it is
experiences with some of these bridging available under the development branch. The
technologies. Section 2 will give an insight to version of PHP 5 included with the reference
some of the bridging technologies introduced. implementation of JSR-223 does not use the
Section 3 discusses our configuration and set-up more traditional CGI/FastCGI as its server API.
of the bridging technologies. Section 4 presents Instead, it makes use of the new PHP Java
some results of our implementation. Section 5 scripting engine.
discusses some PHP Wiki applications tested
with the PHP-JavaBridge. While Section 6, JSR-223’s reference implementation is based on
provides an insight to our future work involving PHP 5.0.1. The current version of PHP is 5.1.2.
a Single Sign-On (SSO), which interfaces to Even though there is a mismatch, versions of
existing web applications from a portlet. PHP can be compiled for inclusion into JSR-
Finally, Section 7 concludes the paper. 223. Very few PHP extensions are provided by
JSR-223. Any additional PHP extensions
2. Bridging Technologies desired can be compiled against the version of
PHP packaged with JSR-223.
In this section we discuss and outline the
reference implementation of JSR-223, then we With respect to the installation of the JSR-223,
move on to detail both PHP-JavaBridge and the Apache Tomcat is configured by the installation
PortletBridge-portlet in terms of architecture, script to process .php files. When executing
functionality and features. PHP applications with JSR-223, the PHP servlet
engine calls the PHP interpreter and passes on
2.1 JSR-223
the script. Although JSR-223 contains a PHP
The JSR-223 specification is designed to allow executable binary, it is not used. Instead, the
scripting language pages to be used in Java JSR-223 servlet uses JNI for native library files,
server-side applications and vice versa. It namely php.so or php.dll under Windows.
basically allows scripting languages to access The advantage to this method is that the
Java classes. JSR-223 is currently a public draft performance of PHP under JSR-223 is similar to
specification and may therefore change. The that of a native Apache Web server with
material presented in this section is based on the mod_php.
draft specification. In order to assess JSR-223’s
reference implementation we examine it in JSR-223 has examples of session sharing
relationship with PHP as the scripting language between JSP and PHP, amongst others. Thus
and Apache Tomcat as the servlet container. JSP and PHP applications may exchange or
share information via sessions. Both types of
session may be attached to a single PHP
511
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
application, through the use of the native scripting engines, which have a Java or
use_trans_sid option, resulting in the Web ECMA 335 [13] virtual machine.
server’s response containing both session IDs.
The PHP-JavaBridge XML protocol is used to
The PHP servlet engine provides additional connect to PHP server instances. The PHP
objects as predefined variables to aid server instance can be a J2EE container, such as
interoperability. JSR-223 requires that objects, Apache Tomcat or a PHP enabled Web server,
such as an HTTP request, should be available such as Apache or IIS. PHP version 5.1.2 is
from scripting languages. From a PHP script, included with the PHP-JavaBridge as a
these predefined variables reflect the same CGI/FastCGI component, with native library
variables available from a JSP page, namely: files, such as php-cgi-i386-liunx.so for
• $request Linux. The bridge is not limited to the included
• $response CGI environment. The operating system’s
• $servlet native installation of a PHP CGI/FastCGI can
• $context
also be used, or an alternative set of binaries can
replace those included. Although PHP is the
Although there can be communication between
focus of the PHP-JavaBridge project, it does
PHP and Tomcat sessions, there are limitations
include other features, including the ability to
on what a session can hold. A Tomcat session
contain PHP scripts with JSF (since version 3.0
may only hold scalar values, i.e. strings and
of the PHP-JavaBridge) and RMI/IIOP.
numbers. It may not contain database
Examples of these capabilities are included in
connections, file handlers or PHP objects, but
the standard distribution.
may contain Java objects.
Unlike the reference implementation of JSR-
Other limitations of the reference
223, the PHP-JavaBridge does not use JNI.
implementation are most noticeable when
Instead, HTTP instances are allocated by the
migrating or using existing PHP applications.
Web server or from the Java/J2EE backend. The
PHP applications usually make extensive use of
PHP instances communicate using a
predefined variables, however
“continuation passing style”, which means that
$_SERVER[’REQUEST_URI‘] or
if PHP instances were to crash it would not
$_SERVER[’QUERY_STRING’] are not
takedown the Java server or servlet at the
defined, instead the request variables can be
backend. Because the PHP-JavaBridge
found in $_SERVER[’argv’]. Another
communicates via a protocol, it does have a
limitation is relative paths within PHP pages,
certain amount of additional functionality.
which can be worked around in two ways. One
There can be more than one HTTP server routed
of which is to set the include_path in the
to a single PHP-JavaBridge or each HTTP
php.ini initialisation file. The other way is
server can have its own PHP-JavaBridge. When
to modify PHP applications to use absolute
the PHP-JavaBridge is running in a J2EE
paths.
environment, it is also possible to share sessions
between JSP and PHP. Although the PHP-
Although JSR-223 is still a reference
JavaBridge can use Novell MONO or Microsoft
implementation under review, it does make
.NET as a backend, this is beyond the focus of
substantial progress towards the interoperability
this section.
of Java and script-based languages. This brings
forth a new dimension of Java interoperability,
As mentioned, the PHP-JavaBridge is flexible
which can be useful for developers and end
with regards to the type of PHP server or
users alike. Despite JSR-223’s limitations, it
communication method that it uses. The PHP-
does provide foundations for other
JavaBridge also has four operating modes, some
implementations.
of which implement practical solutions for
2.2 The PHP-JavaBridge production servers. The different operating
modes of the bridge have been categorised as:
The PHP-JavaBridge is as the name implies a • Request.
PHP to Java bridge. The PHP-JavaBridge uses • Dynamic.
JSR-223 concepts while adding important • System.
aspects, such as enabling communications with • J2EE.
already existing PHP server installations. The
PHP-JavaBridge is described as an XML-based When the PHP-JavaBridge is operated in
network protocol designed to communicate with Request mode, the bridge is created and
512
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
destroyed on every request or response. While flexibility and integration when communicating
the bridge is running in Dynamic mode the with existing PHP applications or servers.
bridge starts and stops synchronously with a
pre-defined HTTP server. System mode is when 2.3 The PortletBridge-Portlet
the bridge is run as a system service. This mode While JSR-223 and the PHP-JavaBridge
is installed as an RPM under Red Hat Enterprise concentrate on the interoperability of scripting
Linux. Finally, the J2EE mode is when the languages and Java. The PortletBridge-portlet
bridge is installed in a J2EE server, such as focuses on creating a generic method of
Apache Tomcat. Other modes and allowing Web sites to be viewed via a JSR-168
configurations are possible, though not compliant portlet. Typically, portals, such as
discussed, are here [5], including executing a uPortal provide the means to view Web sites via
standalone bridge and configurations for Iframes. Yet, the JSR-168 specification does not
Security Enhanced Linux distributions. recommend their use, as there is no control over
the Iframe for the portal. Typically, links within
When using a J2EE environment, PHP an Iframe are external from the portal
applications can be packaged with a PHP- environment, which can result in inconsistencies
JavaBridge into a .war archive. The archive with a user’s experience.
will contain the configuration of the bridge, the
PHP scripts and if needed a CGI environment. The PortletBridge-portlet is also known as a
This makes Web applications for Java, based on ‘Web Clipping Portlet’. The PortletBridge-
PHP, easy to deploy. portlet proxies Web site content and renders the
downstream content into a portlet. An
The PHP-JavaBridge does not encounter the interesting concept from the PortletBridge-
same issues with PHP extensions, as the JSR- portlet is using XSLT to control or transform
223. This is due to the configuration flexibility the rendering of the downstream content. This is
of the PHP environment. When utilising an ideal for portlet environments, as portlets do not
existing PHP installation, any extension module typically cover an entire page. Developers can
can be installed in the normal manner, with only select the parts they wish to render in a portlet,
little or no configuration required of the PHP- which is unlike an Iframe. These differences are
JavaBridge. However, the ease of integrating an also highlighted with the ability to resize the
existing PHP application into a J2EE PortletBridge-portlet, which is also not possible
environment can be hampered by the in an Iframe.
application itself. A PHP application, in some
cases, is not fully aware of the J2EE The PortletBridge-portlet also allows developers
environment, such as host names containing the to control, which links within the downstream
additional port in the URL. Some predefined content are proxied and which are not. Images
PHP variables are also not available in certain or Flash animations from remote resources can
circumstances, or are not correct under the PHP- also be proxied by the PortletBridge-portlet. A
JavaBridge environment. regular expression is used to define the links to
proxy within the portlet and which links are
The performance of the PHP-JavaBridge [5] external to the portlet. The rewriting of the
shows that the overall execution time is similar downstream content is not limited to links.
to that of native scripting engines executed from JavaScript and CSS can also be handled by the
the command line with jrunscript. The XSLT or with regular expressions. By default,
results also show that there is a performance only the body of a Web page is rendered by the
overhead when using the PHP-JavaBridge PortletBridge into a portlet. This follows the
communication protocol. However, the JSR-168 specification in which no header or
overhead introduced is not substantial and footer tags are allowed into portlets. However, it
considering that communications between is still possible to manipulate the content of a
different hosts is only 50% slower than page’s header/footer with XSLT and have it
communications on a single host, it is fair to rendered into the portlet.
ignore the overheads in favour of flexibility.
The PortletBridge-portlet uses the CyberNecko
Developers using either JSR-223 or the PHP- HTML parser [14], which allows HTML pages
JavaBridge can produce hybrid Java and PHP to be accessed via XML interfaces. It is built
applications. Session sharing between JSP and around the Xerces Native Interface (XNI) that
PHP is also available with both technologies. has the ability to communicate “streaming”
However, the PHP-JavaBridge offers greater documents. The PortletBridge-portlet also
513
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
defines a “memento” that holds user states, such The PHP-JavaBridge was installed in Apache
as cookies. The memento is shared between the Tomcat as a .war file. The php-servlet.jar
PortletBridge portlet and the PortletBridge and JavaBridge.jar were also placed in
servlet. For this interaction to occur the J2EE $CATALINA_HOME/shared/lib as per the
container, in this case Apache Tomcat needs to installation instructions [5]. The PHP-CGI
have crossContext enabled. binary and the MySQL PHP extension were
used from the Debian Linux installation, instead
Currently the PortletBridge-portlet lacks the of those originally included with the PHP-
features to support complex XHTML and is a JavaBridge. This then provided a feature rich
relatively young open source project. PHP environment to applications running with
Nevertheless, it does present a viable option for the PHP-JavaBridge.
making Web sites or applications available
through a portlet. In addition, due to the When using PHP applications under the PHP-
adherence of the PortletBridge-portlet to the JavaBridge the PHP’s configuration file
JSR-168 specification, it is also possible to use (php.ini) is not in a default location. Instead it
this technology in various different portals or is located under
even consume the portlet with Web Services for $CATALINA_HOME/webapps/JavaBridge/WE
Remote Portlets (WSRP) [15]. Because the B-INF/cgi, referenced as php-cgi-i386-
PortletBridge-portlet is a generic solution that is linux.ini. Any PHP configurations are
confined to portals, it is also a practical option placed in the PHP configuration file, such as the
to combine the PortletBridge with other configuration of the MySQL extension.
bridging technologies such as the PHP-
JavaBridge to provide a fully featured PHP 5 has some differences compared to PHP
application environment. 4, which can lead to application
incompatibilities. BibAdmin is a PHP 4
3. BibAdmin Portlet Configuration application, which was migrated to PHP 5 [17].
Certain PHP functions used in BibAdmin
Non-Java applications, in our case, PHP, need required modifications to reflect API changes in
to be redesigned or re-implemented with Java PHP 5, such as mysql_fetch_object().
for use as a portlet. Using the techniques Other changes included some PHP predefined
described in this section, a developer can reduce variables. The PHP variable
the amount of work needed to have a PHP $_SERVER[‘SERVER_NAME’] was changed to
application available as a portlet under a J2EE $_SERVER[‘HTTP_HOST’]. This change was
environment. used to reflect the additional port number in the
URLs needed for BibAdmin administrative
Our configuration consisted of the following tasks when creating a new user. Another useful
components: PHP variable was $_SERVER[‘PHP_SELF’],
• Debian Linux with PHP5-CGI binaries, which not only provides the PHP page, but also
• A PHP application, the Web application directory, i.e. /BibAdmin-
• MySQL server 4.1.x, 0.5/index.php. BibAdmin under a normal
• Apache Tomcat 5.5.16, PHP 5 setup was compared to BibAdmin under
• PHP-JavaBridge 3.0.8r3, the PHP-JavaBridge. No differences were found
• PortletBridge-portlet CVS, in terms of functionality or usage.
• GridSphere 2.1.x.
The PortletBridge-portlet is typically installed
Our installation of PHP included the MySQL from a Web archive. However, with
extension, as it is common for PHP applications GridSphere, the PortletBridge-portlet was first
to be part of a LAMP/WAMP stack. An expanded into a directory. This was due to a
arbitrary multi-user application was chosen, missing root directory in the PortletBridge-
namely BibAdmin 0.5 [16]. BibAdmin is a portlet package. The created directory is copied
Bibtex bibliography server. The need for a into Apache Tomcat’s webapps directory.
multi-user application was to match the security Some additional changes to the web.xml file
framework of the portal and is required for were needed for the PortletBridge-portlet to
future work outlined in Section 5. GridSphere is operate correctly in GridSphere. In addition, the
a JSR-168 compliant portal with good user GridSphere UI tag library is required.
support, which we have used extensively in
other projects. A simple XSL style sheet was used that defines
what transformations are carried out on the
514
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
downstream content from a Web site. In the Apache Web server. Although, this is only true
Edit mode of the portlet, the configuration of after the minor changes needed to BibAdmin to
the portlet can be altered. The preferences make it aware of its different environment.
include proxy server authentication, the initial Figure 2, shows the BibAdmin portlet login
URL to load and XSL style sheets. The Edit page. Figure 3, is another screenshot of the
mode of the PortletBridge-portlet is only used BibAdmin performing as a portlet.
for development purposes. The end user would
not generally use these preferences. The portlets Some parts of the PortletBridge-portlet are not
proxy configurations are useful for larger or complete, such as XHTML support. This
more complex networks. Different proxy hampers the number of sites that can be proxied
authentication methods can be used, such as by the PortletBridge. Currently PHP logins are
Windows NT LAN Manager (NTLM). also an issue. The PortletBridge developers
have been contacted about this problem and it is
BibAdmin was configured with a MySQL under investigation. However, the login issue is
database connection setting and packaged as a not completely relevant to our needs, as it does
.war file. Included in the Web archive’s WEB- not fit in with the portals security framework. A
INF was the CGI components, web.xml and user should be automatically be logged into
any PHP-JavaBridge dependencies. BibAdmin with the appropriate credentials,
provided by the portal security framework. This
Figure 1, shows the final layout of the area of research is discussed further in our
configuration for reusing a Web application in a future work under Section 6.
portlet under a J2EE environment.
Hosting Server
PortletBridge-
Portlet
Exsiting Application
CGI/FastCGI
515
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 4: PHPWiki with the PHP-JavaBridge Portal developers can use the techniques
presented, to not only provide new and existing
tools along side within a common environment,
but also allow applications to be used as a
portlet and Web application simultaneously.
References
516
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
517
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
K. Harrison
Cavendish Laboratory, University of Cambridge, CB3 0HE, UK
C.L. Tan
School of Physics and Astronomy, University of Birmingham, B15 2TT, UK
D. Liko, A. Maier, J.T. Moscicki
CERN, CH-1211 Geneva 23, Switzerland
U. Egede
Department of Physics, Imperial College London, SW7 2AZ, UK
R.W.L. Jones
Department of Physics, University of Lancaster, LA1 4YB, UK
A. Soroko
Department of Physics, University of Oxford, OX1 3RH, UK
G.N. Patrick
Rutherford Appleton Laboratory, Chilton, Didcot, OX11 0QX, UK
Abstract
Details are presented of GANGA, the Grid user interface being developed to enable large-scale
distributed data analysis within High Energy Physics. In contrast to the standard LCG Grid user
interface it makes transparent most of the Grid technicalities. GANGA can also be used as a
frontend for smaller batch systems thus providing a homogeneous environment for the data analysis
on inhomogeneous resources. As a clear software framework GANGA offers easy possibilities for
extension and customization. We report on current GANGA deployment, and show that by creating
additional pluggable modules its functionality can be extended to fit the needs of other grid user
communities.
518
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
519
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 2: Schematic representation of the GANGA architecture. The main functionality is divided
between Application Manager, Job Manager and Archivist, and is accessed by the Client through the
GANGA Public Interface (GPI). The client can run the Graphical User Interface (GUI), the Command-
Line Interface In Python (CLIP) or GPI scripts.
520
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The local Job Repository is a lightweight 4.2 Command Line Interface in Python
database written entirely in Python so that it can
GANGA's Command Line Interface in Python
be easy run on any computer platform. It is
(CLIP) provides for interactive job definition
mainly designated for individual users and
and submission from an enhanced Python shell,
provides possibility to work with GANGA off-
IPython [13], with many nice features. A user
line, for example prepare jobs for submission
needs to enter only a few commands to set
until on-line resources become available. In
application properties and submit a job to run
contrast, the remote Job Repository is a service
the application on a chosen backend, and
centrally maintained for the need of many users.
switching from one backend to another is trivial.
It gives the advantage of job access from any
CLIP includes possibilities for organising jobs
location where internet connection is present.
in logical files, for creating job templates, and
Current implementation of the remote Job
for exporting jobs to GPI scripts. Exported jobs
Repository is based on the AMGA metadata
can be freely edited, shared with others, and/or
interface [12]. It supports secure connection and
loaded back into GANGA. The CLIP is
user authentication and authorization based on
especially useful for learning how GANGA
Grid proxy certificates.
works, for one-off job-submissions, and --
The performance tests of both the local and
particularly for developers -- for understanding
remote repositories show good scalability for up
problems if anything goes wrong. A realistic
to 10 thousand jobs per user, with the average
scenario of full submission of an analysis job in
time of job creation being 0.2 and 0.4 second
LHCb is illustrated in Fig 3.
for the local and remote repository
correspondingly. 4.3 GPI scripts
3.4 Plugin modules GPI scripts allow sequences of commands to be
executed in the GANGA environment, and are
GANGA comes with a variety of plugin
ideal for automating repetitive tasks. GANGA
modules representing different types of
includes commands that can be used outside of
applications and backends. These modules are
the Python/IPython environment to create GPI
controlled correspondingly by the Application
scripts containing job definitions; to perform job
and Job Manager. Such a modular design allows
submission based on these scripts, or on scripts
the functionality to be extended in an easy way
exported from CLIP; to query job progress; and
to suit particular needs as described in Section
to kill jobs. Working with these commands is
6.
similar to working with the commands typically
encountered when running jobs on a local batch
4. User view system, and for users can have the appeal of
User interacts with GANGA trough the Client being immediately familiar.
and configuration files.
4.4 Graphical User Interface
4.1 Configuration The GUI (Fig. 4) aims to further simplify user
GANGA has default parameters and behaviour interaction with GANGA. It is based on the
that can be redefined at startup using one or PyQt [14] graphics toolkit and GPI, and consists
more configuration files, which use the syntax of a set of dockable windows and a central job
understood by the standard Python [2] monitoring panel. The main window includes:
ConfigParser module. Configuration files can be • a toolbar, which simplifies access to all
introduced at the level of system, group or user, the GUI functionality;
with each successive file processed able to • a logical folders organiser implemented in
override settings from preceding files. The a form of job tree;
configuration files allow selection of the Python • a job monitoring table, which shows the
packages that should be initialised when status of user jobs selected from the job
GANGA starts, and consequently of the tree;
modules, classes, objects and functions that are • a job-details panel, which allows
made available through the GPI. They also inspection of job definitions.
allow modification of the default values to be The Job Builder window has a set of standard
used when creating objects from plugin classes, tool buttons for job creation and modification. It
and permit actions such as choosing the log also has a button for dynamic export of plugin
level for messaging, specifying the location of methods, job attribute value entry widgets, and
the job repository, and changing certain visual a job attribute tree view. The scriptor window
aspects of GANGA. allows the user to execute arbitrary Python
521
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
# Define a dataset
dataLFN = LHCbDataset(files=[
'LFN:/lhcb/production/DC04/v2/00980000/DST/Presel_00980000_00001212.dst' ,
:
'LFN:/lhcb/production/DC04/v2/00980000/DST/Presel_00980000_00001248.dst'])
Figure 3: A illustration of the set of CLIP commands for submission of a job to perform analysis in
LHCb. While definition of the application includes elements specific to the LHCb experiment, the
definition of the job follows the same structure everywhere.
scripts and GPI commands, maximising remaining 5% initially failed due to a problem
flexibility when working inside the GUI. at a CE. These jobs were automatically restarted
Message from GANGA are directed to a log and the full result was available within 10
panel. hours. The fraction of the analysis completed as
By default, all windows are shown together in a a function of time is illustrated in Fig. 5.
single frame, but because they are dockable they Recently more than 30 million events have been
can be resized and placed according to the tastes processed with the success rate exceeding 94%.
of the individual user. Job definition and Further details on the specific use of Ganga
submission is accomplished through mouse within LHCb can be found in [15].
clicks and form completion, with overall In the last months GANGA found a new
functionality similar to CLIP application outside the physics community. In
particular, it was used for job submission to the
5. Deployment EGEE and associated computing grids within
the activities towards the avian flu drug
Although still in the development phase, discovery [16], and for obtaining optimum
GANGA already has functionality that makes it planning of the available frequency spectrum
useful for physics studies. Tutorials for ATLAS for the International Telecommunication Union
and LHCb have been held in the UK and at [17].
CERN, and have led to GANGA being tried out
by close to a 100 people.
6. Extensions
Within LHCb Ganga has seen regular use for
distributed analysis since the end of 2005. There Despite similarities between ATLAS and LHCb
are about 30 users at present of the system and they have some specific requirements and use
performance has been quite satisfactory. In a cases for the Grid interface. In order to deal
test an analysis which took about 1s per event with these complications GANGA was designed
ran over 5 million events. For the analysis on as a software framework having a pluggable
the Grid the job was split into 500 subjobs. It component structure from the very beginning.
was observed that 95% of the results had Due to this architecture GANGA readily allows
returned within less than 4 hours while the extension to support other application types and
522
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 4: Screenshot of the GANGA GUI, showing: a main window (right), displaying job tree,
monitoring panel and job details; and an undocked scriptor window (left).
backends that can be found outside these two illustrated in Fig. 6. The interface class
experiments. In particular, GANGA could be documents the required methods -- a backend
very useful for any computing task to be plugin, for example must have submit and kill
executed in a Grid environment that depends on methods -- and contains default (often empty)
sequential analysis of large datasets. implementations.
In practice plugins associated with a given The interface classes have a common base class,
category of job building block inherit from a GangaObject, which provides a persistency
common interface class - one of IApplication, mechanism and allows user default values for
IBackend, IDataset, ISplitter and IMerger as plugin properties to be set via a configuration
523
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 6: GANGA plugins. Plugins for different types of application, backend, dataset, splitter and
merger inherit from interface classes, which have a common base class. Schemas for the Executable
application and for the LCG backend are shown as examples.
file. Another function of the GangaObject class GANGA has a clear application programming
is to introduce a uniform description of all interface that can be exploited as an “engine”
methods and data members visible to the user. for implementing application-specific portals in
Such a description called GANGA Schema different science areas. Written in Python
allows GUI and CLIP representations of the GANGA can also benefit from more simple
plugin classes to be build automatically. Thus a integration of new scientific applications as
GANGA developer can add a new plugin compared with the Java based Grid user
without special knowledge of the GUI and CLIP interfaces such as Espresso [20].
frameworks. GANGA has been tried out by close to 100
people from ATLAS and LHCb, and growing
7. Conclusions number of physicists use it routinely for running
analyses on the Grid, with considerable success.
GANGA is being developed as a Grid user Expansion of GANGA outside the High Energy
interface for distributed data analysis within the Physics community was reported recently.
ATLAS and LHCb experiments. In contrast to
many existing web portals [18] providing access
Acknowledgements
to the Grid GANGA is a set of user tools
running locally and therefore can offer more We are pleased to acknowledge support for the
enhanced functionality. For example, it work on GANGA from GridPP in the UK and
simplifies configuration of applications based from the ARDA group at CERN. GridPP is
on the GAUDI/ATHENA framework used in funded by the UK Particle Physics and
ATLAS and LHCb; allows trivial switching Astronomy Research Council (PPARC). ARDA
between testing on a local batch system and is part of the EGEE project, funded by the
running full-scale analyses on the Grid, hiding European Union under contract number INFSO-
Grid technicalities; provides for job splitting RI-508833.
and merging; and includes automated job
monitoring and output retrieval. GANGA offers References
possibilities for working in an enhanced Python
shell, with scripts, and through a graphical [1] http://ganga.web.cern.ch/ganga/
interface. [2] G.van Rossum and F.L. Drake, Jr. (eds.),
Although specifically addressing the needs of Python Reference Manual, Release~2.4.3
ATLAS and LHCb for running applications (Python Software Foundation, 2006);
performing large-scale data processing on http://www.python.org/
today's Grid systems, GANGA has a component [3] ATLAS Collaboration, Atlas - Technical
architecture that can be easily adjusted for Proposal, CERN/LHCC94-43 (1994);
future Grid evolutions. As compared with http://atlas.web.cern.ch/Atlas/
similar Grid user interfaces [19] developed
within other High Energy Physics experiments
524
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
525
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Simon Peters1, Pascal Ekin2, Anja LeBlanc2, Ken Clark1 and Stephen Pickles2
1
School of Social Sciences & 2Manchester Computing, University of Manchester
2006
Abstract
This article presents and discusses the motivation, methodology and implementation of an e-Social Science pilot
demonstrator project entitled: Grid Enabled Micro-econometric Data Analysis (GEMEDA). This used the National
Grid Service (NGS) to investigate a policy relevant Social Science issue: the welfare of ethnic minority groups in the
United Kingdom. The underlying problem is that of a statistical analysis that uses quantitative data from more than
one source. The application of grid technology to this problem allows one to integrate elements of the required
empirical modelling process: data extraction, data transfer, statistical computation and results presentation, in a
manner that is transparent to a casual user.
1. Introduction 2. Motivation
Economic investigation of the experiences and Social science researchers in the UK have
prospects of Britain’s ethnic groups paints a access to a wide range of survey data sets.
picture of multiple deprivation and disadvantage These are collected by numerous different
in areas such as earnings and employment. One agencies, including the government, with
consequence of this is that the welfare of such different purposes in mind and exhibit
groups is of major policy concern. However, in considerable variation along a number of
order to evaluate minority welfare, a dimensions. Only on rare occasions are the
requirement for successful policy intervention, needs of social scientists uppermost in the
one needs to be able to produce appropriate minds of survey designers — hence the topics
statistical quantities, such as poverty measures. covered, the sample frames and sizes, the
The full modelling process required for this questionnaire formats, data collection
goes beyond an analysis using a single data set, methodologies and specific questions asked are
and has components that suggest an e-Research extremely diverse. In order to obtain a more
approach is appropriate. The motivation for two complete answer to their research questions,
of these components, data and statistical researchers frequently have to use more than
modelling, is discussed in section 2, with details one data set. The type of analysis possible,
of the methodology reported in section 3. however, is often constrained by the fact that
Section 3 also deals with Grid adoption issues, each data set possesses one desirable attribute
and the perspective taken is from the economics while being deficient elsewhere.
discipline, where dedicated e-Science resources,
such as equipment and specialised staff, are not 2.1. A Grid Solution for the Data Problem?
readily available. This underlines one of the The economic welfare of ethnic minority groups
objectives of the demonstrator. We not only in the UK, raises data issues that require such a
want to show that modern micro-econometric multiple data set approach. The basic problem is
research can be implemented on the Grid, but that non-whites account for a small proportion
that it can be done in a manner that builds on of the population and sample surveys typically
existing infrastructure investments (such as the yield minority samples that are too small for
National Grid Service), and develops them. meaningful results to be obtained. To some
Another component of our modelling extent the situation is improved when Census
process, results visualization, is briefly data are available. However, while these
discussed in section 4 along with other details of provide relatively large samples of minority
the Grid implementation. Section 5 presents a individuals and households, they do not contain
summary of the substantive analysis, and any direct measures of income. Other surveys
section 6 concludes. do contain such information but have limited
sample sizes when minorities are analysed, with
526
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the problem being especially acute when fitted model (predicted responses for data
reporting is required for small area geographies. imputation, residuals for simulation) to another
A major consequence of such data problems data set, the so-called recipient sample (possibly
is that important questions about the welfare of larger-scale but less detailed), taking due
minority groups have not been answered. For account of the statistical issues surrounding both
example, small sample sizes preclude useful model assumptions and data matching as
measures of household welfare such as poverty required, such as the potentially heterogeneous
rates or inequality measures at anything other nature of the survey data.
than high levels of aggregation. Yet research Further, the underlying assumptions
suggests that disaggregation along two associated with standard statistical inference
dimensions is crucially important when may well be violated in a combined analysis. As
discussing the welfare of Britain’s ethnic a consequence, it may be preferable to calculate
minority groups. First, it is clear that treating statistical items such as poverty measure
non-white, minority groups as a homogenous standard errors using re-sampling methods
entity is not valid. There is considerable (variants of statistical bootstrapping). As noted
diversity between groups such as Caribbeans, by Doornik et al. (2004), such analyses have a
Indians, Pakistanis, Bangladeshis and the component that allows for so-called
Chinese (Leslie, 1998; Modood et al., 1997). embarrassingly parallelisable computations, and
This diversity is often quantitatively greater as such are well suited to implementation on the
than the differences between non-whites, taken high performance computing (HPC) resources
as a whole, and the majority white community. available on the NGS.
The second dimension where aggregation is
important is geographical. Britain’s ethnic 3. Methods
minorities tend to live in co-ethnic clusters, or
The initial plan was to extend an existing
enclaves, and this clustering has important
macro-econometric application, the SAMD
consequences for economic activity and
project of Russell et al. (2003), to deal with
unemployment (Clark and Drinkwater, 2002).
multiple data sources and a different statistical
The Social Science micro-data sets that
analysis.
could be used to address the above problem are
not large from an e-Science perspective. They 3.1. Grid Adoption Issues.
do, however, tend to be messy and difficult to
work with. This problem is compounded for A review of current technologies early in the
repeated samples (variable definitions change), project resulted in the decision not to re-use
longitudinal samples (records require software from the SAMD project. Although the
information from previous waves), and for the broad design principles still applied, technology
data combination approach considered in this had moved on significantly since the earlier
article. Grid technology has the potential to project was conceived. The intervening years
integrate the tasks (data extraction, file transfer) have seen several trends become well
associated with processing quantitative data of established. In particular, Web Service
this type, by hosting the information in an technologies have become widely accepted in
appropriate manner on a data grid. the e-Science community, the UK e-Science
programme has made a significant investment in
2.2. A Grid Solution for the Modelling OGSA-DAI, and there has been a steady shift
Problem? towards Web portals to avoid difficulties in
The empirical analysis, a micro-econometric deploying complex middleware stacks on end-
one that relies upon the combination of two or users' computers. Fortunately, sacrificing re-use
more data sources, belongs to the broader group of SAMD's redundant technology was offset in
of modelling techniques associated with linking part by several factors: the advent of the
data sets. Following the terminology of Chesher National Grid Service; the advent of the OGSA-
and Nesheim (2006), who provide a review of DAI software; and the completion of other
this area, as the data linkage is performed with projects, which did provide components that we
no or unidentifiable common records between were able to re-use, such as the Athens
the data sets, it falls into the class known as authorisation software developed for
statistical data fusion. ConvertGrid (Cole et al, 2006). After due
The essence of the approach is to estimate a consideration, the availability and increased
statistical model on one data set, the so-called maturity of the NGS suggested efforts should be
donor sample (a relatively small scale but concentrated on their systems, namely Oracle
detailed survey), and then apply elements of the for the data bases and MPI for parallelization.
527
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
3.2. The Data Sources First step estimation uses ordinary least
squares (OLS) to obtain the coefficient
The project has grid enabled two data sources,
the British Household Panel Survey (the BHPS) estimates β̂ . The second step uses the fitted
and the 1991 Census Samples of Anonymised residuals. û ic = log( y ic ) − βˆ ' x ic , to estimate a
Records (the SARs). One can now combine data
from the smaller-scale, detailed, BHPS source model of the variance components. Set2
with the larger sample sizes and geographical û ic = û .c + ê ic where ê ic = û ic - û .c and
coverage of the Census SARs. This lets us proceed to model the idiosyncratic
provide poverty measures for ethnic minorities heteroscedastic component σ ic2 using a logistic
which are both broken down by particular style transformed equation
ethnic group and geographically disaggregated.
ê ic2
The decision to use the 1991 data needs = α ' z ic + ric where zic is a vector of
some comment. This decision was taken when (A − ê ic2 )
the project was first proposed. The data needs appropriate explanatory variables, A is set to
to be readily available to accredited researchers 2
to allow deployment on a data Grid, and it was 1.05 max( ê ic ) and ric is a suitable error term..
felt at the time that the confidentiality Estimation of the α coefficients is done using
restrictions envisaged for the so-called Licensed OLS.
2001 SARs (the public domain version of the Once α̂ has been obtained the appropriate
2001 SARs) would not contain variables prediction for σ ic2 can be calculated. The
suitable for the analysis required. The original
release of the Licensed data was not scheduled remaining variance component, σ 2η can be
to contain detailed information on ethnic calculated as σ̂ 2η = max(V̂(η c ),0) where
minorities or on geographical details below
regional level. There were also further V̂(η c ) = 1
C −1 ∑ (û .ic - û .. ) 2 − 1
C ∑ V(ê .c )
restrictions (such as the grouping of age) on a c c
variable's response categories. These restrictions nc
are lifted for the Controlled Access (CAMS) and V(ê .c ) = 1
(n c −1) n c
2
∑ ê .c .
i =1
version of the 2001 SARs. However, the access
and confidentiality constraints imposed on the 3.3.2. Calculations Using the Census Data,
2001 CAMS make them presently unusable the Recipient Sample.
from a data Grid perspective.
Once the above estimates have been obtained
3.3. The Statistical Methodology one can impute the economic variable of
interest for any given set of comparable
This follows an approach in the poverty explanatory variables xkc and zkc. The index k
mapping literature due to Elbers et al. (2003). indicates an individual in the Census data
The version of their methodology that is source. The Census based predictions can then
employed is presented here. be calculated, under an assumption of
σ̂ 2η + σ̂ 2kc
3.3.1. Calculations Using the Survey Data, Normality, as: ŷ kc = exp(βˆ ' x kc + 2
) . The
the Donor Sample.
variance prediction σ̂ 2kc
is calculated using
The survey data is used to estimate a model of
αˆ ' z kc .
the economic variable of interest, yic , which is
income in this study. The index i refers to an One can also calculate a wide variety of
individual in a sample cluster, which is indexed poverty measures. The demonstrator uses the
by c. Each cluster contains nc observations and parametric (expected) head count (PHC),
C simulated head count (SHC), and simulated
there are N = ∑ n c observations over the C poverty gap (SPG). The PHC measure requires
c =1
an assumption of Normality and is calculated as
clusters. The economic variable of interest is n
∑ Φ( (log(p) - βˆ ' x
specified as log( y ic ) = β' x ic + u ic where xic is a 1
PHC(p) = kc ) σ̂ 2η + σ̂ 2kc )
vector of suitably defined explanatory variables, n k =1
individual idiosyncratic error terms: uic = ηc + εic where Ф(.) is the standard Normal distribution
where ηc and εic are uncorrelated with xic, function and p is a so-called poverty line.
independent of each other and IID(0, σ η ) and Summation is taken over the sub-sample of
2
interest, ethnic group within geographic area.
ID(0, σ ic ) respectively.1
2
528
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
If the parametric assumption of Normality that analysis to the user. The details of the
was incorrect, this would cause misspecification actual analysis and associated data management
problems that might affect the estimated are invisible to the user.
measures and their associated standard errors. The demonstrator service presently produces
Simulation can be used to counter this poverty measures using individual level data.
possibility and is used for SHC and SPG. The Demonstrator options are the standard
SHC measure is obtained as the average of B headcount measure, and the poverty gap
simulated head count measures: measure. These can be calculated for two
B possible poverty lines, either 60% of the UK
∑
1
SHC(p) = SHC(p) b where median income or 50% of the UK mean income.
B b =1 The poverty measures are then displayed on a
GIS (Geographic Information System) style
n choropleth map display for UK regional and
∑
1
SHC(p) b = I( (βˆ b ' x kc + ~u k,b + ~ek,b σ̂ kc,b ) < log(p)) SARs area geographies. The display presents
n k =1 the poverty measure for the chosen ethnic
Note that I(.) is an indicator function taking the minority group. The use of individual income
value of 1 if the condition inside the parentheses data also allows calculation of poverty measures
is satisfied and zero otherwise. The standard by gender, and, if this option has been chosen
error predictor, σ̂ kc,b , is calculated using for the analysis, the resulting poverty measures
can be displayed by ethnic group within gender.
αˆ b ' z kc . The coefficients, β̂ b and α̂ b , are The display also produces a box-whisker style
obtained from the bth casewise resample of the plot of the predicted income quantiles4 for all
survey data, error terms, ~ u k,b & ~ek,b , are drawn the ethnic groups associated with a region. This
is available for the whole of the UK and at the
with replacement from the error vectors level of geography displayed by the map (UK
~ & ~e .3
u b b region or SARs area).
The SPG measure is obtained in a similar Other information, such as standard errors,
manner: and the results of the model fitted using the
B survey data, are available from the supporting
∑
1
SPG(p) = SPG(p) b files returned by the demonstrator. It is also
B b =1 possible to request classification based upon the
( )
n pseudo-geography available for the 1991 SARs,
∑
1 ŷ b
where SPG(p) b = I( ŷ kc,b < p) ∗ 1 - kc, p
. although these results are not accessible via the
n k =1 visualization tool. The visualisation applet will
~ ~
Here ŷ kc,b = exp(β b ' x kc + u k,b + ek,b σ̂ kc,b ) .
ˆ not display information at a geographic level if
the sample size for an ethnic minority group is
Standard errors are calculated in the usual deemed small. This mimics the type of
fashion using the B simulated values of the confidentially restrictions applied to the
chosen poverty measure: SPG(p)b or SHC(p)b. equivalent controlled access 2001 data.
A simulated version of the PHC(p), PHC(p)b is
used to calculate its standard error. This is based 4.1.1. The Web Service Client
upon the PHC(p) equation above, but with
The GEMEDA service stands at the heart of the
the σ̂ 2η , σ̂ 2jc , β̂ and α̂ replaced by the values architecture illustrated in Figure 1. It provides
obtained from the bth simulation. Elbers et al. the services and generates the HTML sent to the
(2003) suggest B can be set to 300. light client (Web browser) through the handling
of events (control). The Spring framework
4. The Grid Implementation container enforces the MVC (Model View
Control) design pattern, enforcing the separation
The demonstrator is designed for a researcher of logic from content and greatly encouraging
who wishes to investigate the welfare of ethnic code re-use.
minority groups in the UK. Specifically it When a user first accesses GEMEDA he/she
allows the researcher to choose different ethnic creates a user account. An Athens
groups, to specify a level of geography, and to username/password combination is required
pick from a limited set of poverty measures. which is verified on the fly by calling an XML-
The aim is to provide an easy-to-use (Web- RPC Athens security interface.
based) interface which allows the user to make The service downloads a proxy credential
choices about the type of analysis to be automatically through a MyProxy server during
performed and which then returns the results of each step of the workflow. The proxy credential
529
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
lasts 24 hours. The longevity of the Proxy Converted data sources are then uploaded to the
credential is verifiable at all times by clicking HPC node using secure GridFTP.
on the appropriate link. Connection to the front-
end (Web client) is through an HTTPS
NGS Oracle hosted datasets NGS HPC computing nodes
connection. Manchester, Oxford, Leeds,
CCLRC-RAL…
The service generates separate OGSA-DAI BHPS SARs SARs GEMEDA parallelized
Household Individual
SQL queries targeted at the SARs and BHPS analysis code
grid service
datasets as a result of user input, and uploads
necessary executables to the user's selected
HPC computation node if they are not already
present. OGSA-DAI
NGS MyProxy server
530
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Once the job is completed, the results are area of interest. The map has a zoom facility
written to file and downloaded through GridFTP which is useful when the finer SARs area
by the web service. These results are then geographies are displayed. Figure 2 presents a
converted to XML and spatially mapped by the screenshot of the SARs area map for Indian
GEMEDA service before being sent back to the Male simulated head count poverty measures
user interface for viewing. using the half mean income poverty line. The
linked plot displays predicted income quantiles
4.1.5. Visualisation. for all the groups available from the SARs area
Once the results files have been returned to the last pointed to, which was Manchester in this
GEMEDA service, they may be viewed in the case.
raw or processed by the service's visualisation
applet. A C utility converts the raw data into a 5. A Summary of the Substantive
form (dbf) appropriate for the applet. This is Analysis.5
done for both the regional and SARs area
geographies. The applet, which runs under Java Using data from 1991, models of individual
1.4 and above, uses this information along with income were estimated using the BHPS data
special mapping data files (shp files) to produce separately for males and females. The
choropleth maps at regional and SARs area specification of the regression equation included
geographies for the selected ethnic group and all of the variables available via the
gender category. The shp files are obtained from demonstrator.6 The results supported splitting
http://edina.ac.uk/ukborders/ and the applet the sample by gender as the signs of the
combines these to produce the map, along with parameters on some of the regressors were
the legend, linked plot, and buttons for different for males and females (e.g. the
gender/ethnic group/geography selection. The variables indicating marital status and the
applet uses the GeoTools java library 2 to aid presence of children in the household). Full
the reading and display of the information regression results are not presented here; instead
provided. GeoTools is open source, and was we note that the explanatory power of both
also used by the Hydra I Grid project prediction equations appeared reasonable, 53%
(Birkin et al, 2005). The data and maps remain and 40% for males and females respectively,
on the GEMEDA service's server, and though the functional form tests suggest there
permission to use the mapping information is may be room for improvement in the female
allowed if a user has Athens authentication. equation specification. Heteroscedasticity tests
strongly rejected the null of homoscedasticity
This heterogeneity in the variance was modelled
using the methodology of Elbers et al. (2003)
described in section 3.3 above
The breakdown of poverty measures by
region and ethnic group for males shows
considerable diversity across each of these
dimensions. In general non-whites have higher
poverty measures than Whites and this
conforms to what we know about the higher
unemployment rates and lower earnings of
ethnic groups in the UK. Some groups,
particularly Black Africans, Pakistanis and
Bangladeshis, do particularly badly while the
Indians and, to a lesser extent, the Chinese have
poverty rates closer to those of Whites. This
broad ranking is similar to that in Berthoud
Figure 2: A Screenshot of the Visualization (1998). ‘Southern’ areas of the country
Applet. generally have lower poverty headcounts than
other regions although it should be noted that
The poverty measure displayed using the we do not correct for regional price differentials
map's colour scheme is the one chosen by the here. Some regions have relatively high poverty
user at job initiation. The applet allows the user rates for particular groups, for example
to switch between different linked box-whisker Bangladeshis and Pakistanis in the North,
style plots of predicted income quantiles by Pakistanis in the East Midlands and Black
using the map cursor to point to the geographic Africans in Yorkshire and Humberside, the
531
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
532
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
techniques of Ahmad et al (2005), although this Wales, Journal of Population Economics, 15, 5-
is outside the domain of the authors. 29.
As this was a small project, evaluation of
our objectives and usability issues have Cole,K. L.Mason, P.Ekin, J.Maclaren
proceeded in a somewhat ad hoc manner and (2006), ConvertGrid, Proceedings of the Second
are still ongoing. One area of concern is the International Conference on e-Social Science.
level of security required to access the service.
The steps required for obtaining and processing Doornik,J., N.Shepherd, D.F.Hendry (2004),
an e-certificate are quite involved, and this, Parallel Computation in Econometrics: A
combined with problems of access caused by Simplified Approach, Nuffield Economics
institutional firewalls, can be off-putting for Working Paper.
potential users in the Social Sciences.
Notwithstanding this, and other issues such Elbers,C., J.O.Lanjouw and P.Lanjouw
as compute resource brokerage, we regard the (2003), Micro-level Estimation of Poverty and
establishment of a critical mass of compatible Inequality, Econometrica, 355-364.
Grid-enabled datasets and tools as a necessary
condition for the success of e-Social Science in Leslie, D. (1998), An Investigation of Racial
the UK, and hope that the GEMEDA service is Disadvantage, Manchester University Press,
at least a useful stepping stone towards this Manchester.
goal.
Modood, T., R. Berthoud, J. Lakey, J. Nazroo,
Acknowledgements P. Smith, S. Virdee, and S. Beishon (1997),
Research supported by ESRC grant number Ethnic Minorities in Britain: Diversity and
RES-149-25-0009, “Grid Enabled Disadvantage, Policy Studies Institute, London.
Microeconometric Data Analysis”.
We benefited from discussion with members Russell,C., K.Cole, M.A.S.Jones,
S.M.Pickles, M.Riding, K.Roy, M.Sensier
of the following projects: SAMD (Celia
Russell, Mike Jones), ConvertGrid (Keith Cole), (2003), Grid Technology for Social Science:
Hydra I Grid (Mark Birkin, Andrew Turner), The SAMD Project. IASSIST Quarterly, 27#4.
and with locally based NGS staff (Matt Ford). Endnotes
Additional support was provided in the form
of an allocation of resources on the NGS itself. 1. IID: identically and independently
distributed. ID: independently distributed.
References 2. The “.” subscript means that an average has
Ahmad,K., L.Gillam, D.Cheng (2005), been taken over that index.
Society Grids, Proceedings of the UK e-Science 3. ~eb needs to be appropriately standardised.
All Hands Meeting 2005, EPSRC Sept.2005 4. Minimum, 10% quantile, 25% quantile,
median, 75% quantile, 90% quantile, maximum.
Berthoud, R. (1998), The Incomes of Ethnic 5. Space restrictions preclude presentation of
Minorities, ISER Report 98-1, Colchester: the regression results, plots, and tables at the
University of Essex, Institute for Social and regional and SARs levels of geography.
Economic Research. 6. Constant, Gender, Age, Age squared,
Children present, Marital status, Labour force
Birkin,M., P.Dew, O.McFarland J.Hodrien. position, Housing tenure, High qualifications,
(2005), HYDRA: A Prototype Grid-enabled Immigrant, Region.
Decision Support System, Proceedings of the 7. Results are reported to whole percentages,
First International Conference on e-Social and one decimal place for the standard error.
Science. 8. Enclave refers to GBprofile codes 5,13,18,
22, 29 and 33. Poor refers to GBprofile codes
Chesher, A. and L.Nesheim (2006), Review 1,7,17,21,23,35,36,27,43,44,45,49. The Rest
of the Literature on the Statistical Properties of refers to the remaining GBprofile codes
Linked Datasets, DTI Economics Papers, 9. Black-other, Other-asian and Other-other
Occasional Paper No.3. omitted. SHC is the simulated head count.
Standard errors are in parentheses.
Clark, K. and S.Drinkwater, (2002), 10. As 9 above.
Enclaves, neighbourhood effects and economic
outcomes: Ethnic minorities in England and
533
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
K.L.L. Tan, 1,2, V. Gayle1, P.S. Lambert1, R.O. Sinnott3, K.J. Turner2
1. Department of Applied Social Science, University of Stirling
2. Department of Computing Science and Mathematics, University of Stirling
3. National e-Science Centre, University of Glasgow
Abstract
The ESRC funded Grid Enabled Occupational Data Environment (GEODE) project is conceived to
facilitate and virtualise occupational data access through a grid environment. Through GEODE it is
planned that occupational data users from the social sciences can access curated datasets, share
micro datasets, and perform statistical analysis within a secure virtual community. The Michigan
Data Documentation Initiative (DDI) is used to annotate the datasets with social science specific
metadata to provide for better semantics and indexes. GEODE uses the Globus Toolkit and the
Open Grid Service Architecture – Data Access and Integration (OGSA-DAI) software as the Grid
middleware to provide data access and integration. Users access and use occupational data sets
through a GEODE web portal. This portal interfaces with the Grid infrastructure and provides user-
oriented data searches and services. The processing of CAMSIS (Cambridge Social Interaction and
Stratification) measures is used as an illustrative example of how GEODE provides services for
linking occupational information.
This paper provides an overview of the GEODE work and lessons learned in applying Grid
technologies in this domain.
534
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
535
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
2.4 Usability and accessibility fields, whilst maintaining the same framework
whose scope can be extended to provide data
Many social science researchers are unfamiliar
and services for more users. Therefore a generic
with advanced computer applications. Therefore
design of the GEODE infrastructure is required
it is desirable to develop GEODE as an
to adapt to non-occupational social data.
application that can be used with minimal
learning and configuration. Though a custom
application has been considered, a web portal is 3. Architecture
much more appealing to the users.
GEODE has developed a web portal as the 3.1 Overview
user interface by which occupational data
The high level architectural design of GEODE
researchers interact with the grid infrastructure.
is depicted in Figure 1. The dotted box
Through the portal, users can administer their
represents the occupational data virtual
data resources, search the data index, and make
community that comprises the index service,
requests for statistical services. The portal is
various data services (G1, G2, O4) and
accessible via the Internet using standard web
application services (in this case the CAMSIS
browsers. Application users are not bound to
linking service). Users interact with the grid
specific machines and software in order to
indirectly through the GEODE portal as their
perform tasks. This will greatly increase
web interface. The individual components and
usability in the social science community. The
functions are elaborated in detail in this section
portal was developed with the GridSphere
and in section 3.2.
Portal Framework [6], an open-source and
widely used tool for portal development. 3.1.1 Index service
2.5 Services The GEODE Index Service is considered to be
the main core of the virtual community, as
The Grid-specific services are built with Globus
services and resources are first discovered prior
Toolkit 4 [7]. This WSRF [8] implementation
to performing the actual operations. It is
was recently accepted as an OASIS standard. At
deployed on the default index service supplied
present, GEODE has developed specific
by GT4. This index aggregates registration
services developed to make queries to the index
information from both data and application
service, and to link occupational data to
services, where the respective service metadata
CAMSIS scale scores [9]. As the scope of
is propagated by each registering service. This
GEODE evolves, services can be readily
design is not subject to a single index, though
implemented and deployed on the Grid and
currently one instance is deemed sufficient.
accessed via the portal drawing on the service-
oriented characteristic of Grid services. 3.1.2 Data services
2.6 Security Based on OGSA-DAI configurable data
services and appropriate drivers, the GEODE
Security is a major concern, especially when
data services provide the data abstraction on
sharing data in the virtual community. Globus
occupational data to overcome issues of
Toolkit uses GSI [10] to establish security in a
heterogeneity of data sets including relational
Grid environment. GSI offers authentication,
tables and text files (comma separated). Data
authorization, credential delegation, and single
resources are deployed dynamically and register
sign-on that GEODE leverages to administer
with the index service where discovery could be
resources, trust and portal service operations.
made. O1, O2 and O3 are examples of the data
OGSA-DAI makes it possible for resource
resources abstracted by the respective data
security to be configured when deployed.
services. GEODE maintains two data services
Users delegate their credentials (proxy
G1 and G2 (as shown in Figure 1). G1
certificates) to the GEODE services to allow
virtualises data that are curated at Stirling
operations to be carried out on their behalf and
locally, and G2 is a collection of resources
accounted for.
harnessed from a wider international
2.7 Framework Extensibility community of social scientists. The resources of
G1 and G2 feature the occupational information
Finally, it is highly desirable to make the described in Table A1.
infrastructure capable of incorporating social OGSA-DAI provides a framework for
data about aspects other than occupation. This automatic derivation of data access metadata,
benefits social science researchers in other with provisions made for custom metadata also.
536
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The DDI schema is used to annotate the data users and to perform operations on their behalf
resource with social science metadata which (via credential delegation).
upon registration is aggregated within the index At the initial stage, only the CAMSIS score
service alongside with the data access metadata. linking service will be implemented. As the
The GEODE architecture is scalable. An Grid services are based in Service-Oriented
arbitrary number of data services can be Architecture (SOA), services can be developed
deployed (intuitively on different and deployed into the Grid with minimal effort.
machines/sites) to handle the vast amount of
virtualised datasets, both internal and external to 3.2 Portal
GEODE. Users may choose to provide data The GEODE portal provides the web interface
using their own OGSA-DAI data services to users by which operations and functions are
(depicted as O4 in Figure 1), considered as invoked by proxy on the Grid services.
external services that register with the index to GridSphere is used to develop the components
be made available to all. There can be an that make up the entire portal, namely the
arbitrary number of external data services like presentation view, presentation logic, and the
O4 registering to the GEODE grid, but for application logic. The view is implemented with
illustration purposes only one (O4) is shown. JSP and the logic with portlets that controls the
Users do not interact with the data services presentation flow.
directly, but rather do so via the application The emphasis is on the application logic,
services which in turn are invoked from the developed as a portlet service, which interfaces
portal. with the Grid environment. The portlet service
invokes the operations of the Grid services and
3.1.3 Application services
returns results to the presentation logic. GEODE
The application services are grid services follows the Model-View-Controller design
developed using GT4 to provide specific data pattern which the occupational data Grid
analysis functions on the data resources. The (model), presentation (view) and portlet service
application services access the data resources (controller) represent.
and render specified results to the user via the
portal (user interface). 3.3 Extensibility
Authorization will be verified prior to The GEODE architecture is designed to be as
accessing the services. This is where the GSI is generic as possible. One of the most promising
involved to set up a security context with the benefits is to be able to apply or extend the
structure towards other social science statistical
537
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
538
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
indexing service with the initial specified The GEODE portal was developed with a
metadata. The metadata when altered is user interface that interacts with the indexing
reflected into the indexing service. Presently the service and G1. The views (portlets) accept the
metadata that can be altered is the list of user’s input, which the portlet services use to
comma-separated value files that are accessible communicate with the indexing service and G1.
via HTTP. When changed updates to the Figures 2 to 5 shows screenshots of the GEODE
schema of the data resource that represents it prototype portal. Figure 2 illustrates the portal
occurs. querying the indexing service for all G1
services and displays them as a list. The portal
539
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
is also able to list data resources deployed complicated and specific installations for
selected G1 services as seen in Figure 3. It is GEODE users. This allows researchers to utilise
now possible to deploy (see Figure 4) and GEODE on other machines instead of being
undeploy (undeploy button in Figure 5) data constrained to their own machines.
resources, and to modify the metadata via the Prospective users have been identified and
portal. In addition, the portal is able to query the would engage in the assessment of using the
indexing service for data resources registered GEODE prototype when it is ready. This will
with G1 and to display the metadata of selected gather valuable feedback to help in the
data resources. Checks are put in place to guard development of a useful GEODE portal.
against the issues listed in Section 3.4.
5. Conclusion
4.2 Project progress
DDI has been incorporated manually and tested 5.1 Benefits of GEODE
successfully in registering and retrieving from
the indexing service. GEODE is currently The prototype has illustrated that the design and
developing the portal interface and data service implementation provides a viable and scalable
activities to manage semantic definitions framework for GEODE and the community
interactively. The requirements for linking data members. GEODE and its design in principle
resources to CAMSIS scores will be examined can be extended and applied for non-
in detail, and implementation will commence occupational social science data.
thereafter. GEODE encourages good occupational data
GSI and credential delegation will be utilisation via the portal for occupational
assessed and implemented once the functional information exchange and services, and having
requirements of GEODE are finalised. To the datasets semantically annotated.
simplify the complexities of client GSI set up, Occupational data researchers have a common
creating proxy certificates and delegating channel for dataset publication that improves
credentials, the web start of the CoG data quality definitions, usability and better
(Commodity Grid) kit [12][13] will be publicity. The complexity of accessing
investigated. Java Web Start [14] allows CAMSIS stratification scale scores will be
applications to be deployed and launched with a greatly reduced as a result of using the CAMSIS
single click from a Web browser, thus omitting data linkage service. Likewise other
540
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
occupational information linkage services can [3] P.S. Lambert, V. Gayle, K. Prandy, R.O.
be applied to meet other user requirements. Sinnott, K.L.L. Tan, K.J. Turner, GEODE
Grid Enabled Occupational Data
5.2 Future work Environment, http://www.geode.stir.ac.uk,
The possibilities of extending GEODE are Oct. 2005.
great. For example, researchers can create [4] OGSA-DAI, Open Grid Service
collaborations as resources which authorise Architecture, Data Access and Integration,
members of the community can use, operate on, http://www.ogsadai.org.uk, Feb 2006.
and share. Datasets in XML format can be [5] DDI, Data Documentation Initiative,
virtualised readily with OGSA-DAI as the http://www.icpsr.umich.edu/DDI/, Feb.
middleware, although this is not required in the 2006.
scope of GEODE. Analytical and often time- [6] Gridsphere Portal Framework,
consuming statistical services can be developed http://www.gridsphere.org, Dec. 2005
and deployed to the Grid. There is a substantial [7] I. Foster, Globus Toolkit Version 4:
user-community in the social sciences who Software for Service-Oriented Systems,
would benefit from utilising the GEODE IFIP International Conference on Network
services. Additionally, GEODE can be extended and Parallel Computing, Springer-Verlag
to incorporate non-occupational data and LNCS 3779, pp 2-13, 2005.
services. Ideally GEODE can be established as [8] WSRF specification, Web Services
the portal for a wide range of social science Resource Framework (WSRF) v1.2
data. Specification.
[9] P.S. Lambert and K. Prandy, CAMSIS:
Cambridge Social Interaction and
Appendix Stratification scales, http://www.camsis.
Table A1: Selected Occupational Information stir.ac.uk/, Aug. 2005.
Datasets used in GEODE [10] The Globus Alliance, Grid Security
Infrastructure,
1. CAMSIS indexes, www.camsis.stir.ac.uk/versions.html http://www.globus.org/toolkit/docs/4.0/sec
Format: Index matrix, SPSS and plain text data files urity/key-index.html, Apr. 2006.
Units: Variety of national OUG, plus gender,
employment status
[11] Jennifer M. Schopf, Monitoring and
Output: CAMSIS scale scores Discovery in a Web Services Framework:
2. CAMSIS occupational information value labels, Functionality and Performance of the
www.camsis.stir.ac.uk/occunits/distribution.html Globus Toolkit’s MDS4, http://www-
Format: One-to-one translation, SPSS and plain text files unix.mcs.anl.gov/~schopf/Pubs/mds-
Units: Variety of national OUGs
Output: Text labels to numerical OUG codes sc05.pdf, Apr. 2006.
3. ISEI tools, home.fsw.vu.nl/~ganzeboom/pisa/ [12] Java CoG Kit – Webstart Applications,
Format: One-to-one translation, SPSS and plain text files http://www.cogkit.org/release/4_1_2/webst
Units: ISOC-88 and -68 international OUG schemes art/, Feb. 2006.
Output: ISEI and SIOPS occupational scale scores
4. E-SEC matrices, www.iser.essex.ac.uk/esec
[13] Java CoG Kit, http://www.cogkit.org/, Feb.
Format: Index matrix, MS-Excel and SPSS syntax 2006.
Units: ISCO-88 OUG, and employment status (es) data [14] Java Web Start Overview White Paper,
Output: E-SEC class position for OUG-es combination http://java.sun.com/developer/technicalArti
5. Hakim gender segregation codes (Hakim, C. 1998 Social cles/WebServices/JWS_2/JWS_White_Pap
Change and Innovation, Oxford University Press, pp266-90).
Format: One-to-one translation, paper printout er.pdf, May 2005
Units: ISCO-88 and UK SOC90 OUG schemes [15] P.S. Lambert, K.L.L. Tan, V. Gayle, K.
Output: Gender segregation information for OUG codes Prandy, R.O. Sinnott, Developing a Grid
Enabled Occupational Data Environment,
to appear Second International Conference
References on e-Social Science, Manchester, UK, June
2006.
[1] P.S. Lambert, “Handling Occupational
[16] The Globus Alliance, GT4.0: Credential
Information”, Building Research Capacity,
Management: MyProxy, http://www.
pp 4:9-12, Nov. 2002.
globus.org/toolkit/docs/4.0/security/mypro
[2] I. Foster, C. Kesselman, S. Tuecke, The
xy/, Apr. 2006.
Anatomy of the Grid: Enabling Scalable
[17] The Globus Alliance, GT4.0: Security:
Virtual Organizations. International J.
Delegation Service, http://www.
Supercomputer Applications, 15(3), 2001.
globus.org/toolkit/docs/4.0/security/delegat
ion/, Apr. 2006.
541
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Rob Procter1, Mike Batty2, Mark Birkin3, Rob Crouchley4, William H. Dutton5, Pete
Edwards6, Mike Fraser7, Peter Halfpenny8, Yuwei Lin1 and Tom Rodden9
1
National Centre for e-Social Science, University of Manchester
2
GeoVUE, Centre for Advanced Spatial Analysis, UCL
3
MoSeS, School of Geography, University of Leeds
4
CQeSS, Centre for e-Science, University of Lancaster
5
OeSS, Oxford Internet Institute, University of Oxford
6
PolicyGrid, Department of Computing Science, University of Aberdeen
7
MiMeG, Department of Computer Science, University of Bristol
8
National Centre for e-Social Science, University of Manchester
9
DReSS, Department of Computer Science, Nottingham University
Abstract
This paper outlines the work of the UK National Centre for e-Social Science and its
plans for facilitating the take-up of Grid infrastructure and tools within the social science
research community. It describes the kinds of social science research challenges to
which Grid technologies and tools have been applied, the contribution that NCeSS can
make to support the wider take-up of e-Science and, finally, its plans for future work.
542
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
potential relevance well beyond the social key themes – record, re-representation, and
sciences and we will return to this later. replay – to ensure the development of services
that have some general purchase and utility.
2.1 Collaboratory for Quantitative e-Social Work to date has seen the development of a
Science (CQeSS) dedicated ReplayTool2 which supports the use
The overall aim of CQeSS is to ensure the of records containing multiple forms data from
effective development and use of Grid-enabled multiple sources, allows data to be represented
quantitative methods. Part of the focus of and re-represented in different ways to support
CQeSS is on developing middleware that allows different kinds of analysis, and enable
users to exploit Grid resources such as datasets researchers to replay digital records to support
while continuing to employ their favourite analysis. Current work seeks to implement
desktop analysis tools [6]. semantic web ontologies to support structured
As e-Social Science develops, there will be a forms of analysis and is also exploring the
growing user base of social researchers who are potential of vision recognition and text mining
keen to share resources and applications in techniques to support automatic coding of large
order to tackle some of the large-scale research datasets. The work of the Node is driven by an
challenges that confront us. They will be aware iterative user-centred prototyping approach to
of the potential of e-Science technology to demonstrate the salience of new forms of digital
provide collaborative tools and provide access record to future social science research.
to distributed computing resources and data.
2.3 Modelling and Simulation for e-Social
However, social scientists are not ideally
Science (MoSeS)
catered for by the current Grid middleware and
often lack the extensive programming skills to MoSeS will provide a suite of modeling and
use the current infrastructure to the full and to simulation tools which will be thoroughly
adapt their existing “heritage” applications. grounded in a series of well-defined policy
This problem prompted the Grid Technology scenarios. The scenarios will be validated by
Group at CCLRC Daresbury Laboratory to both social scientists and non-academic users
write the prototype GROWL: Grid Resources [1]. Moses is particularly focused on policy
On Workstation Library toolkit.1 This toolkit applications within the domains of health care,
provides an easy-to-use client-side interface. transport planning and public finance. Social
CQeSS is now seeking to apply the GROWL science problems of this type are characterized
middleware to wrap the statistical modeling by a requirement for extensive data integration,
methods available in SABRE as well as other the need for multiple iterations of
computationally-demanding models and to computationally intensive scenarios, and a
make these developments available in the collaborative approach to problem-solving.
distributed environment as a componentised R The key objective is to develop a
library and as a Stata “plug-in”. representation of the entire UK population as
individuals and households, at present and in the
2.2 New Forms of Digital Record for e-Social future, together with a package of modeling
Science (DReSS) tools which allows specific research and policy
DReSS seeks to understand how new forms of questions to be addressed. The advances it will
record may emerge from and for e-Social seek to achieve include the creation of a
Science. Specifically, it seeks to explore the dynamic, real-time, individually-based
development of new tools for capturing, demographic forecasting model; for defined
replaying, and analyzing social science records policy scenarios, to facilitate integration of data
[3,4,5,12]. Social scientists work in close and reporting services, including Geographical
partnership with computer scientists on three Information Systems (GIS), with modeling,
Driver Projects to develop e-Social Science forecasting and optimisation tools, based on a
applications demonstrating the salience of new secure grid services architecture; to use hybrid
forms of digital record. agent-based simulations to articulate the
Development work within the Driver connections between individual level and
Projects focuses on the assembly of qualitative structural change in social systems; and to
records, the structuring of assembled records, provide high level capability for the articulation
and the coupling of qualitative and quantitative of unique evidence-based user scenarios for
records. The research is underpinned by three social research and policy analysis.
2
www.ncess.ac.uk/nodes/
1
http://www.growl.org.uk digitalrecord/replaytool/
543
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
2.4 Semantic Grid Tools for Rural Policy distributed across institutions, disciplines and
Development and Appraisal (PolicyGrid) locations. We are undertaking studies of
research and analytic practice to develop
PolicyGrid brings together social scientists with
requirements for the design of these tools,
interests in rural policy development and
including detailed qualitative studies of current
appraisal with computer scientists who have
practice in fields of video-based research in
experience in Grid and Semantic Web
both social scientific and professional
technologies. The core objective is to explore
communities.
how Semantic Grid tools [10] can support social
MiMeG has built prototype software that
scientists and policy makers who increasingly
can connect collaborators, allowing them to see,
use mixed-method approaches combining
discuss and annotate video together in real-time
qualitative and quantitative research techniques
and from remote locations. This software is now
(e.g., surveys and interviews, ethnography, case
freely available under a GPL license in PC and
studies, simulations).
Mac versions and allows real time analysis of
Provision of metadata infrastructure in this
video data between two or more remote sites.
context (to support annotation and sharing of
The software also enables each site to see the
resources) presents many challenges including
others’ notes, transcripts, drawings etc on the
the dynamic and contested nature of many
video they are watching and talk to each other at
concepts within the social sciences, the need to
the same time. We are now beginning
align with existing thesauri (where those exist)
development of a second version of the
and the need to support open, community based
software, which will begin to consider more
efforts. The Node is thus exploring so-called
innovative ways of capturing and representing
“folktology” solutions which exploit both
communicative conduct in remote data sessions.
lightweight ontology and folksonomy (social
MiMeG has collected a wide-ranging data
tagging) based approaches [13]. Another
on existing video-based research practice across
activity is investigating the use of
the social and cognitive sciences and the work
argumentation approaches [14] as a means of
of video analysts working in other occupations.
facilitating evidence-based policy making;
Now that the prototype tools are beginning to be
argument structures provide a mechanism for
adopted we are undertaking a series of studies
linking qualitative analyses with other resources
of their use by social scientists as part of their
including simulation experiments, and queries
everyday research activities.
over Grid-enabled data services. Enhanced
support for social simulation on the Grid [15] is 2.6 Geographic Virtual Urban
another focus, with a framework for Environments (GeoVUE)
characterising simulation models under
development, as well as workflow support. GeoVUE will provide grid-enabled virtual
environments within which users are able to
2.5 Mixed Media Grid (MiMeG) link spatial data to geographic information
systems-(GIS)-related software relevant to a
MiMeG is developing tools to support
variety of scientific and design problems. It will
distributed, collaborative video analysis [11].
provide decision support for a range of users
The project arises from converging
from academics and professionals involved in
developments in contemporary social science.
furthering our geographic understanding of
Digital video is becoming an invaluable tool for
cities to planners and urban designers who
social and cognitive scientists to capture and
require detailed socio-economic data in plan
analyse a wide range of social action and
preparation. At the heart of GeoVUE lies the
interactions; video-based research is
concept of a VUE which represents a particular
increasingly undertaken by research teams
way of looking at spatial data with respect to the
distributed across institutions in the UK, Europe
requirements of different kinds of user, thus
and worldwide; and there is little existing
defining a virtual organisation relevant to the
support for remote and real-time discussion and
problem in hand.
analysis of video data for social scientists.
GeoVUE is developing three demonstrators
Therefore MiMeG has taken as a central
of increasing complexity and sophistication
concern how to design tools and technologies to
which will form part of a structured process of
support remote, collaborative and real-time
sequential development to the ultimate aim of
analysis of video materials and related data. In
producing a generic framework.
developing these tools the project has the
potential to support video-based
‘collaboratories’ amongst research teams
544
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
2.7 Oxford e-Social Science Project (OeSS) required as e-Science expands from its early
adopters [2].
OeSS focuses on the inter-related social,
There are also important questions to be
institutional, ethical, legal and other issues
answered about how Grid-based tools might be
surrounding e-Science infrastructures and
adapted to make them more usable. Members of
research practices. The design and use of
NCeSS have been active organising workshops
advanced Internet and Grid technologies in the
on usability issues for e-Science3. Finally, the
social, natural and computer sciences are likely
Hub is working with the UK e-Science Institute
to reconfigure not only how researchers get and
to investigate how barriers to the wider take up
provide data resources and other information
of Grid technologies can be tackled.
but also what they and the public can access and
know; not only how they collaborate, but with
whom they collaborate; not only what 4. Developing e-Social Science
computer-based services they use, but from Much of what is presently understood about e-
whom they obtain services [8]. This Social Science is an extrapolation of existing
reconfiguring of access to a wide variety of practice and is focused around areas where it is
resources raises numerous issues, including believed the Grid can address known limitations
ethical concerns (e.g., confidentiality, of research methods.
anonymity, informed consent), legal Computer-based modeling and simulation is
uncertainties (e.g., privacy and data protection, a well established social science research tool.
liability), social (ownership, trust, credit), and As models get more complex, they need
institutional (IPR and risk in multi-institutional computing power and there are similar benefits
collaboration) issues [7,9,16]. OeSS assembled to be had from the Grid for statistics-based
a multi-disciplinary team to analyze e-Sciences research generally.
in the UK and globally, focusing on a set of in- The sharing and re-use of data is already
depth case studies to uncover the dynamics of well established in the social sciences, but
these ethical, legal and institutional issues that heterogeneity in formats means that linking
facilitate or constrain the use of e-Science data, different datasets together can still prove very
tools and other resources, and shape initiatives difficult. In partnership with UK data centres,
to address them. NCeSS has begun pilot studies of ‘grid
enabling’ selected datasets.
3. Social Shaping The vast amounts of data generated as
people go about their daily activities are, as yet,
Social shaping is defined very broadly within
barely exploited for research purposes. For
the NCeSS programme to include all social and
example, use of public services is captured in
economic aspects of the genesis,
administrative records; in the private sector,
implementation, use, usability, immediate
patterns of consumption of goods and services
effects and longer-term impacts of the new
are captured in credit and debit card records;
technologies. Despite the very substantial
patterns of movement are logged by sensors,
current investment in the Grid, little is known
such as traffic cameras, satellites and mobile
about the nature and extent of take-up, about
phones; the movement of goods is increasingly
how and why and by whom these new
tracked by devices such as RFID tags.
technologies are being adopted, nor what will be
Exploiting these data sources to their full
their likely effects on the character and conduct
research potential requires new mechanisms for
of future scientific research, including social
ensuring secure and confidential access to
scientific research.
sensitive data, and new tools for integrating,
While OeSS is the only NCeSS node which
structuring, visualisation and analysis.
has social shaping research as its principal aim,
How e-Social Science will develop in the
a number of them address this as a subsidiary
longer term will become clearer as researchers
aim and it is also a focus for a number of the
explore and experiment with the opportunities
Small Grant projects. The ‘entangled data’
the Grid provides. To facilitate this process,
project (Essex University), for example, has
NCeSS has been organizing a series of Agenda
been conducting a comparative study to
understand how groups of research scientists
collaborate using shared digital data sources, 3
See, for example,
developing insights into the likely use and non- www.nesc.ac.uk/action/esi/contribution.cfm?Tit
use of e-Science technologies, and the social le=613 and
and technological innovations that may be www.ncess.ac.uk/support/chi/index.php/Main_P
age
545
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Setting Workshops (ASWs)4 to which social shaped by the degree it can address the
scientists are invited to hear about opportunities problems of potential users. NCeSS is
for using the Grid in research and to reflect on developing an increasingly concrete
how these technologies might address the understanding of the applications of e-Sciences
obstacles. Nine ASWs have been held over the in the addressing social science problems. The
past eighteen months and more are planned. examples illustrate how social scientists can
combine its component parts – i.e., datasets,
4.1. Widening Engagement services and tools – in flexible yet powerful
The ASWs have helped to identify new areas ways to overcome various kinds of problems
for the application of Grid technologies in the they face in pursuing their research.
social sciences and community needs. For For example, solving complex statistical
example, a theme emerging from several of the modeling problems often involves multiple
ASWs is that Grid-enabled datasets, services steps and iterations where the output of one step
and tools are key enablers for the wider take-up is used as the input in the next: 1) select data or
of e-Social Science. They have also enabled subsets, maybe from more than one source; 2)
NCeSS to profile the social science research merge, harvest or fuse; 3) input to model and
constituency in terms of its awareness and analyse; 4) repeat previous cycle with new or
readiness to adopt new technologies. We have different data; 5) repeat with different model
used these findings to inform NCeSS strategy assumptions, parameters and possibly data; 6)
for widening the community’s engagement with synthesise outputs from multiple models or
e-Social Science. world views; 7) load output into a different
Our social science researcher profile analysis or visualisation tool.
identifies three distinct communities – the ‘early Managing these steps manually is potentially
adopters’ who are keen to push to the limit of difficult, and performing the data integration
what is possible; the ‘interested’ who will adopt and modeling using desktop PC tools may be
new research tools and services if they believe very time consuming. For example, analysis of
these will provide simple ways of advancing work histories requires many different sources
their research; and the ‘uncommitted’ who have of data which need to be reconciled and
yet to appreciate the relevance of Grid integrated in order to produce a coherent and
technologies for their research. contextualised life or work history and takes
Those early adopters who have not been one week on a desk top PC in Stata. To then
recruited into the NCeSS programme will simultaneously analyse the data could take
demand research tools they can apply. Tools are months on a desktop PC running serial SABRE
also needed to convert the interested into and over 10 years using Stata. This presents a
adopters and demonstrators which will convince major obstacle to research. Using Grid-enabled
the uncommitted to join the ranks of the data, analysis and workflow tools, however, the
interested. researcher can compose complex sequences of
To address the needs of these communities, steps using different computational and data
NCeSS has begun to build an e-Infrastructure resources, and execute them (semi)
for social sciences. To engage with the automatically using powerful computers in a
uncommitted, NCeSS will deploy a selection of comparatively short time.
demonstrators from those being developed Schelling’s model of segregation provides a
within the Nodes (e.g., GeoVUE visualisation way of exploring how the spatial distribution of
tools), NCeSS Small Grants, and demonstrators society groups (‘agents’) responds to different
developed by the e-Social Science PDPs. More assumptions about neighbour preferences. For
generally, NCeSS will provide a platform for example, a social scientist interested in racial
disseminating the benefits of e-Social Science to segregation could use it to explore questions
the wider research community, leverage existing such as:
e-Social Science and e-Science investments, and • Sensitivity of the model to initial
ensure the usability and sustainability of arrangements of agents.
middleware, services and tools. • The relationship between ratio of different
‘races’ and the final segregation ratio.
4.2 Solving Real Social Science Problems To do this, the researcher might need to run
simulations for 50 different values of the initial
The diffusion of any innovation, including e-
race ratio and 100 different initial agents’
Social Science tools and practices, will be
arrangements. Using a desktop PC, 5000
simulations will take about one week which
4
Funded by the ESRC/JISC ReDReSS project. may well make it impractical. If the researcher
546
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
has access to Grid computational resources, be run in about five minutes, opening up new
however, the same number of simulations can possibilities for the researcher.
User
portals
User User
portals portals
Existing
Desktop tools
Tools
repository
GROWL Web
Service
Web-based
Interfaces
data archives
NGS
Metadata
registry
Research tools Common
and tools and
demonstrators services
Grid-enabled
datasets
547
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
A range of research tools and demonstrators A range of Grid-enabled datasets are being
selected from those being developed within the made available (quantitative and qualitative),
NCeSS programme will be deployed, with including datasets already Grid-enabled by the
selection criteria informed by ongoing PDPs or being Grid-enabled by UK data
consultations with the wider social science centres. Others will be selected after
research community. Agenda Setting consultation with Nodes and Small Grants to
Workshops have already enabled us to identify identify dataset usage within the NCeSS
simulation and modeling as a priority area. programme; the major data centres to identify
Building on the work of MoSeS and PolicyGrid, patterns of dataset usage (quantitative and
simulation and modeling tools will be deployed qualitative) within the wider social science
which will enable researchers to create their community, to understand licensing issues and
own workflows to run their own simulations, ensure complementarity with JISC funded Grid-
visualise and analyse results, and archive them enabling activities (current and planned); and
for future comparison and re-use (see Figure 2). with the social science community to identify
This will facilitate development and sharing of research drivers and ensure a fit with the
social simulation resources within the UK social ESRC’s future data strategy plans.
science community, encourage cooperation
between model developers and researchers, and 6. Benefits for Social Sciences
help foster the adoption of simulation as a
research method in the social sciences. The e-Infrastructure project will serve a number
A set of common tools and services (e.g., of important objectives which are relevant to the
workflow, visualisation) will also be made social science and wider research communities:
available. Our approach to e-Infrastructure • enhance understanding of issues around
building does not preclude local community- resource discovery, data access, security and
level deployments of tools and services. Indeed, usability by providing a testbed for the
the NCeSS programme is uncovering development of metadata and service
requirements for usability, control and trust in registries, tools for user authorisation and
qualitative research that may be more authentication, and user portals;
appropriately served through localised provision • lay foundations for an integrated strategy for
of tools and data. For example, there may be the future development and support of e-
issues with the distribution of sensitive video Social Science infrastructure and services;
data to (and via) un-trusted third parties. • leverage the infrastructure investment being
Development of desktop tools and community- made by UK e-Science core programme and
oriented service deployments is part of the JISC for the benefit of the Social Sciences;
ongoing NCeSS e-Infrastructure strategy, • promote synergies across NCeSS and other
particularly in supporting qualitative research. ESRC investments, co-ordinate activities,
MiMeG and DReSS are building new tools and encourage mutual support and identify areas
have a remit within their existing work in which to promote the benefits of common
programmes to capacity build and deploy to policies and technology standards.
interested communities.
548
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
NCeSS is working closely with the social Developing digital records: early
science community to ensure that the project is experiences of record and replay, To appear
driven by research needs and, specifically, to in Computer Supported Cooperative Work:
identify the most research-relevant resources, The Journal of Collaborative Computing,
tools and services to incorporate into the e- Special Issue on e-Research.
Infrastructure. It is also working with the UK e- 6. Crouchley, R., van Ark, T., Pritchard, J.,
Science core programme (NGS, OMII-UK Grose, D., Kewley, J., Allan, R., Hayes, M.
consortium, NGS, DCC) and JISC to devise an and Morris, L. (2005). Putting Social
e-Infrastructure development plan which will Science Applications on the Grid. In Proc
define technical standards and mechanisms to 1st Int. Conf. on e-Social Science,
ensure long term sustainability. Manchester.
7. David, P.A. and Spence, M. (2003).
7. Summary and Future Work Towards institutional infrastructures for e-
Science: the scope of the challenge.
In this paper we have presented an overview of Research Report No. 2 (Oxford Internet
how the NCeSS programme is working to Institute).
develop applications of Grid technologies and 8. Dutton, W. (2005). The Internet and Social
tools within the social sciences and to Transformation: Reconfiguring Access, pp.
understand the factors which encourage or 375-97 in Dutton, W. H. et al. (eds),
inhibit their wider diffusion and deployment. As Transforming Enterprise. Cambridge,
such, NCeSS has lessons for the e-Science Massachusetts: MIT Press.
community as it begins to grapple with these 9. Dutton, W. D., Jirotka, M. and Schroeder,
same problems. R. (2006). Ethical, Legal and Institutional
NCeSS will continue to develop its agenda Dynamics of the e-Sciences. OeSS Project
for e-Social Science. Activities focusing on the Draft Working Paper. University of Oxford.
usability of e-Science are expanding through 10. Edwards, P., Preece, A., Pignotti, E.,
participation in JISC VRE and EPSRC usability Polhill, G. and Gotts, N. (2005). Lessons
and e-Science programmes. NCeSS is also Learnt from Deployment of a Social
beginning a series of fieldwork investigations of Simulation Tool to the Semantic Grid. In
work practices in developing areas of e-Science Proc 1st Int. Conf. on e-Social Science,
so as to better understand the impact of these Manchester.
new technologies and the issues they raise for 11. Fraser, M., Biegel, G., Best, K.,
usability, and the methodologies used for Hindmarsh, J., Heath, C., Greenhalgh, C.
requirements capture, design and development. and Reeves, S. (2005). Distributing Data
Sessions: Supporting remote collaboration
Bibliography with video data. In Proc 1st Int. Conf. on e-
Social Science, Manchester.
1. Birkin, M., Clarke, M., Rees, P., Chen, H.,
12. French, A., Greenhalgh, C., Crabtree, A.,
Keen, J. and Xu, J. (2005). MoSeS:
Wright, M., Hampshire, A. and Rodden, T.
Modelling and Simulation for e-Social
(2006). Software replay tools for time-
Science. In Proc 1st Int. Conf. on e-Social
based social science data,
Science, Manchester.
13. Guy, M. and Tonkin, E. (2006).
2. Carlson, S. and Anderson, B. (2006). e-
Folksonomies: Tidying up Tags? D-Lib
Nabling Data: Potential impacts on data,
Magazine 12(1).
methods and expertise. In Proc 2nd Int.
14. Kirschner, P., Buckingham Shum, S. and
Conf. on e-Social Science, Manchester.
Carr, C. (2003). Visualising
3. Crabtree, A. and Rouncefield, M. (2005).
Argumentation: Software Tools for
Working with text logs: some early
Collaborative and Educational Sense-
experiences of e-SS in the field. In Proc 1st
Making. Springer-Verlag: London.
Int. Conf. on e-Social Science, Manchester.
15. Pignotti, E., Edwards, P., Preece, A, Gotts,
4. Crabtree, A. French, A., Greenhalgh, C.,
M. and Polhill, G. (2005). Semantic
Rodden, T. and Benford, S. (2006).
Support for Computational Land-Use
Working with digital records: developing
Modelling. In Proc 5th IEEE International
tool support. In Proc 2nd Int. Conf. on e-
Symposium on Cluster Computing and
Social Science, Manchester
Grid, Cardiff.
5. Crabtree, A., French, A., Greenhalgh, C.,
16. Schroeder, R., Caldas, A., Mesch, G. and
Benford, S., Cheverst, K., Fitton, D.,
Dutton, W. (2005). In Proc 1st Int. Conf. on
Rouncefield, M. and Graham, C. (2006).
e-Social Science, Manchester.
549
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Workflows are playing a crucial role in e-Science systems. In many cases, e-Scientists need to do
average case estimates of the performance of workflows. Quality of Service (QoS) properties are
used to do the evaluation. We defined the Bell-Curve Calculus (BCC) to describe and calculate the
selected QoS properties. The paper presents our motivation of using the BCC and the methodology
used during the developing procedure. It also gives the analysis and discussions of the experimental
results from the ongoing development.
550
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The Bell-Curve Calculus is a good estimate composite services from the correlative
of Quality of Service properties over a wide properties of their components.
range of values. We consider four fundamental methods to
2. Why a Bell-Curve Calculus combine Grid services (we use services S1 and
S2 to represent two arbitrary services).
Although PEPA [3] and some other projects use
exponential distribution as their atomic Sequential: S 2 is invoked after S1 ’s invocation
distribution, we still have sufficient reasons to
and the input of S 2 is the output of S1 .
choose bell curve. Initial experimental evidence
from DIGS 1 suggests that bell curves are a Parallel_All: S1 and S2 are invoked
possible approximation to the probabilistic simultaneously and the outputs are both passed
behaviour of a number of QoS properties used to the next service.
in e-Science workflows, including the reliability
of services, considered as their mean time to Parallel_First: The output of whichever of S1
failure; the accuracy of numerical calculations and S 2 first succeeds is passed to the next
in workflows; and the run time of services.
service.
Moreover, the Central Limit Theorem (CLT) [4]
also gives us some theoretical support by Conditional: S1 is invoked first. If it succeeds,
concluding that: its output is the output of the workflow; if it
“The distribution of an average tends to be fails, S 2 is invoked and the output of S 2 is the
Normal, even when the distribution from which output of the whole workflow.
the average is computed, is decidedly non- In terms of the three QoS properties and
Normal.” four combination methods, we have twelve
Here in the CLT, ‘Normal’ refers to a fundamental combination functions (see Table
Normal Distribution, i.e. a Bell Curve. 1). For instance, the combination function of
Furthermore, from the mathematical run time in sequential services is the sum of the
description of bell curves, we can see that the run times of the component services.
biggest advantage of using a bell curve is that
its probability density function (pdf) p(x) TABLE 1
2
( x−µ )
− THE TWELVE FUNDAMENTAL COMBINATION FUNCTIONS
1 2
= 2σ has only two parameters: mean
e
2π σ Seq Para_All Para_Fir Cond
value µ and standard deviation σ, where µ
run time sum max min cond1
decides the location of a bell curve and σ
decides the shape of a bell curve. While accuracy mult combine1 varies? cond2
evaluating the performance of a workflow, we
need to gather all the possible data values of the reliability mult combine2 varies? cond3
QoS properties we analyse from all the input
The table shows the twelve fundamental combination
services. We do calculations and analysis using functions in terms of three QoS properties and the four basic
the information and pass the results through the combination methods. Sum, max, min and mult represent
whole workflow. It will be a big burden if we respectively taking the sum, maximum, minimum and
transfer and calculate all the possible data multiplication of the input bell-curves. Cond1-3 are three
different conditional functions and their calculation depends
values one by one. Now using bell curves which on the succeeding results. The functions of Varies are
have only two parameters, the job becomes parallel_first, which means the output of the workflow is the
more efficient. All we need to do is to store and output of the first succeeded service. Combine1-2 are
propagate the two parameters in workflows and probabilistic merges, which are in the forms of linear-
_combinations_of_distribution_1*probability_of_1_occurri
a bell curve can be constructed at any time from ng+linear_combinations_of_distribution_2*probability_of_
µ and σ. 2_occurring+...+linear_combinations_of_distribution_N*pr
Then we will see if we can calculate the obability_of_N_occurring. Neither are uniquely defined
QoS properties of a whole workflow from the functions, but depend on different use cases of workflows,
which adds the uncertainty to the calculus. But in most
corresponding properties of its component workflows, only the basic combinations sum, max, min and
services, namely if we can define some mult are needed. What we do for combine1-2 is to combine
inference rules to derive the QoS properties of these basic functions based on different workflow.
1
DIGS (Dependability Infrastructure for Grid Services, Here we convert the formula of bell curve to
http://digs.sourceforge.net/) is an EPSRC-funded project, to a function in terms of µ and σ, then get the bell
investigate in fault-tolerance system and other quality of curve function as
service issues in service-oriented architectures
551
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
(x− µ)2 error is very small. But does it always work like
−
e 2σ 2 this? When we choose two closer bell curves as
bc( µ , σ ) = λx. the inputs, the error became comparatively large
2π σ
(see Figure 4). This inconsistency decided one
.
aspect worth investigation: through systematic
Our job is to define different instantiations
experimentation using Agrajag, we needed to
of the combination functions applying to
explore in a wide range of data to find various
different QoS properties and different workflow
error status in different input situations.
structures.
FIGURE 2
3. Methodology THE SUM OF TWO BELL CURVES
AND ITS APPROXIMATION
Suppose we have two bell curves corresponding
to two services. We present them using a bell
curve function defined in Section2 as bc( µ1 , σ 1 )
and bc( µ 2 , σ 2 ) . We need to describe µ 0 and σ 0
using µ1 , µ 2 , σ 1 and σ 2 . That is, µ0 =
f µ ( µ1 , µ 2 , σ 1 , σ 2 ) and σ 0 = fσ ( µ1 , µ 2 , σ 1 , σ 2 ) The
combination function bc( µ0 , σ 0 ) is defined as
bc( µ0 , σ 0 ) = F (bc( µ1 , σ 1 ), bc( µ 2 , σ 2 )) =
bc( f µ ( µ1 , µ 2 , σ 1 , σ 2 ) , fσ ( µ1 , µ 2 , σ 1 , σ 2 ) ), which is
actually a function in terms of four parameters
-- µ1 , µ 2 , σ 1 and σ 2 .
The graph shows the sum of two bell curves (red curve and
Therefore we have two main tasks: green curve). It can be used to model the run time of
(1) Can we find a satisfactory instantiation sequential combinations. Here we use an exact
mathematically proved method: µ 0 = µ1 + µ 2 and
of F (bc( µ1 , σ 1 ), bc( µ 2 , σ 2 )) for every
situation we are investigating? σ 0 = σ 12 + σ 2 2 to estimate the piecewise uniform curve
(2) How good will our approximations be? (blue curve) produced by Agrajag. The mauve curve is our
approximation curve, which almost coincides with the blue
‘Good’ here means accurate and curve. We can see there is still a tiny error shown at the title
efficient. of the graph. It is caused by the approximation using
For example, for the property run time in piecewise uniform functions. In the ideal situation (the
sequential services, we can use µ 0 = µ1 + µ 2 and resolution values which divide a curve to locally constant
and connected segments →+∞), the error is zero.
σ 0 = σ 12 + σ 2 2 , which has been proved true in
mathematics [5]. FIGURE 3
Our experiments are based on a system THE MAXIMUM OF TWO BELL CURVES
552
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
553
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
TABLE 2
THE COMBINATION METHODS OF GETTING MAXIMUM OF TWO BELL CURVES
The table shows the situation when we use different combination methods to get the maximum of two bell curves at the same
resolution value. We set µ 0 = max( µ1 , µ 2 ) in all the methods. µ1 And σ 1 both take values from 2−5 to 25 . Since we use the
standard bell curve as one input and µ1 >0, σ 0 = σ 1 or σ 2 (with bigger µ ) and σ 0 = σ 1 × σ 2 are the same method in this case.
Despite this, the first two combination methods are the best two methods we got. The combination methods we choose are rough
hypotheses based on Figure 4. We estimate the output parameters according to the location and shape of the Agrajag approximation
curve. We calculated the average error of each method to compare how good these combination methods are.
554
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
FIGURE 7
THE COMPARISON OF COMBINATION METHOD 2
WITH DIFFERENT RESOLUTION VALUES
The red line is the real data of µ and the green line is our
p
approximation using the linear function
µ p = 0.99 × µ1 + 0.24 . In this case, σ 1 is fixed to 0.5. In
this figure, the linear approximation achieved a good result,
especially in the upper part of the curve.
FIGURE 9
THE APPROXIMATION OF µp IN TERMS OF µ1 AND σ1
of µ1 , µ 2 , σ 1 and σ 2 . Then the combination of µ1 and σ 1 . We can see in this figure, the two surfaces do
not coincide in all the areas, which means that we still need
function we aim for turns to bc ( µ p , σ p ) = to do some adjustment during the procedure of the linear
bc( f µ ( µ1 , µ 2 , σ 1 , σ 2 ) , fσ ( µ1 , µ 2 , σ 1 , σ 2 ) ) approximation. We could change parameters of the linear
functions or use non-linear functions to do the
F (bc( µ1 , σ 1 ), bc( µ 2 , σ 2 )) . approximation.
555
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
FIGURE 10
THE LINEAR APPROXIMATION TO σp
IN TERMS OF σ1 WHEN µ1 IS FIXED
In this graph, the red surface is the real perfect ‘a’ surface
and the green surface is our approximation of ‘a’ in terms of
µ1 . There is a gap between the two curves in the area when
µ 1 is relatively small. We could still use some
compensation function to erase it. But in this case, the
absolute value of error is acceptable. So we chose to keep
the form of the approximation function simpler.
556
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
FIGURE 13
EVALUATION USE CASE
6. Future Work
In Section 3~5, we gave the methodology and
some experimental results of the BCC, and also
applied it to certain use case. The next step is to
complete the BCC by finishing the rest of the
fundamental combination functions. Then we
will find real data to do more evaluation. After
This is an example workflow for creating population-based that, we will consider extend the BCC by
"brain atlases", comprised of procedures, shown as orange importing Log-Normal or other distributions to
ovals, and data items flowing between them, shown as describe the values more precisely. We may
rectangles. More details of the workflow can be found in
http://twiki.ipaw.info/bin/view/Challenge/FirstProvenanceC also embed the extended BCC to some
hallenge. frameworks to enhance their functionalities of
prediction and evaluation.
References
FIGURE 14
ABSTRACTED WORKFLOW USE CASE [1] Foster,I., Kesselman,C., The Grid: Blueprint
for a New Computing Infrastructure, Morgan
Kaufmann Publishers, 1999
[2] Smarr, L., Chapter 1. Grids in Context, in
The Grid 2, Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, 2003
[3] Hillston, J. A Compositional Approach to
Performance Modelling 1995
[4]http://www.statisticalengineering.com/central
_limit_theorem.htm
[5]http://mathworld.wolfram.com/NormalSumD
istribution.html
557
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
558
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
This paper is organised as follows: Section 2 introduces <qualifiers name="product">capsid protein 2</qualifiers>
<qualifiers name="protein_id">BAA19020.1</qualifiers>
<qualifiers name="translation">MSDGAV...</qualifiers>
</cds>
the need for Type Adaptors and presents a motivating ex- </FEATURES>
<SEQUENCE>atgagtgatggagcagt..</SEQUENCE>
</DDBJXML>
ample within a bioinformatics scenario. In Section 3, the At a syntactic level, the output from the DDBJ-XML Service is
use of WSDL for Type Adaptor description and discovery not compatible with the input to the NCBI_Blast Service.
is shown before we present our implementation in Section
4. Evaluation of our implementation is given in Section 5, Figure 2. The output from the DDBJ-XML Service is not
followed by the examination of related work in Section 6. compatible for input to the NCBI-Blast Serivce.
Finally, we conclude and present further work in Section 7.
2. Motivation and Use Case sequence data to an alignment service, such as the BLAST
service at NCBI2 . This service can consume a string of
A typical task within bioinformatics involves retriev- FASTA3 formatted sequence data.
ing sequence data from a database and passing it to Intuitively, a Bioinformatician will view the two se-
an alignment tool to check for similarities with other quence retrieval tasks as the same type of operation, ex-
known sequences. Within MY G RID, this interaction pecting both to be compatible with the NCBI-Blast ser-
is modelled as a simple workflow, with each stage in vice. The semantic annotations attached through FETA af-
the task being fulfilled by a Web Service, illustrated firm this as the output types are assigned the same con-
in Figure 1. Many Web Services are available for re- ceptual type, namely a Sequence Data Record concept.
However, when plugging the two services together, we see
Get Sequence Data Sequence Alignment that the output from either sequence data retrieval service is
XEMBL not directly compatible for input to the NCBI-Blast service
Accesssion Sequence Blast (Figure 2). To harmonise the workflow, some intermediate
NCBI-Blast
ID Data Results
DDBJ-XML processing is required on the data produced from the first
service to make it compatible for input to the second ser-
vice. We define this translation step as syntactic mediation,
an additional workflow stage that is carried out by a partic-
Figure 1. A simple bioinformatics task: get sequence data ular class of programs we define as Type Adaptors.
from a database and perform a sequence alignment on it.
559
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
560
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
when introducing a new representation for information that 4.3. Binding Publication and Discovery
is already present in other formats, Type Adaptors must be
created to transform the new representation to all other for-
mats. Since our M-Binding language is used to specify XML to
By introducing an intermediate representation, to which XML document conversion, and our intermediary language
all data formats are converted, the problem of scalability is an OWL ontology, conversion between XML instances and
can be diminished. For n compatible interfaces, O(n) Type OWL concept instances is specified in terms of a canonical
Adaptors are required, resulting in a linear complexity as XML representation for OWL individuals. While the use of
more services are added. When introducing new formats, XML to represent OWL concepts is common, XML Schemas
only one Type Adaptor is required to convert the new data to describe these instances are not available. Hence, we au-
format to and from the intermediate representation. There- tomatically generate XML Schemas (OWL instance schema)
fore, our Architecture is based around the use of an inter- to describe valid concept instances for a given ontology.
mediary representation. When considering the mechanisms With an intermediary schema in place (in the form of
necessary to support users in the use and agreement of a an OWL instance schema), the appropriate M-Bindings to
common representation, we see existing work already tack- translate data to and from XML format, and FXML -T (the
les a similar problem within MY G RID. FETA uses descrip- FXML -M translator), Web Service harmonisation is sup-
tions that supply users with conceptual definitions of what ported. To fully automate the process, i.e. discover the
the service does using domain specific terminology. Part of appropriate M-Bindings (themselves individual Type Adap-
this solution involved the construction of a large ontology tors) based on translation requirements, we require a reg-
for the bioinformatics domain covering the types of data istry to advertise M-Bindings. Since Type Adaptors can be
shown in our use case. Therefore, we extend these existing described using WSDL, we use the Grimoires service reg-
OWL ontologies to capture the semantics and structure of istry (www.grimoires.org) to record, advertise and supply
the data representation. adaptors. Grimoires is an extended UDDI [1] registry that
In terms of sharing and reuse, it is common for Web Ser- enables the publishing and querying of Service interfaces
vices to supply many operations that operate over the same, while also supporting semantic annotations of service de-
or subsets of the same data. Therefore, the transformation scriptions [14]. Figure 5 illustrates how Bindings are regis-
for a given source type may come in the form of a Type tered with Grimoires using our Binding Publisher Service.
Adaptor designed to cater for multiple types. Hence, the The Binding Publisher Service consumes an M-Binding,
generation of a WSDL description for a given Type Adap- along with the source and destination XML schemas, and
tor should capture all the transformation capabilities of the produces a WSDL document that describes the translation
adaptor. This can be achieved by placing multiple operation capability of the M-Binding. This generated WSDL in
definitions in the WSDL, one for each of the possible type then publicly hosted and registered with Grimoires using
conversions. the save service operation. Afterwards, an M-Binding
WSDL may be discovered using the standard Grimoires
4.2. Bindings query interface (the findService operation) based on
the input and output schema types desired.
To describe the relationship between XML schema com-
ponents and OWL concepts / properties, we devised a map-
Source Schema Generated WSDL
ping language, FXML -M (Formalised XML mapping), de- element: s1 in message type: s1
out message type: d1
scribed in previous work [18]. FXML -M is a composable element: s2
... in message type: s2
Produces
mapping language in which mapping statements translate element: sn
out message type: d2
...
in message type: sn
XML schema components from a source schema to a des- out message type: dn
M-Binding
tination schema. FXML -T (Formalised XML Translator) is m1: s1 --> d1
Binding Publisher GRIMOIRES
an interpreter for FXML -M which consumes an M-Binding m2: s2 --> d2
... Service save_service Repository
Register
(collection of mappings) and the source XML document and mn: sn --> dn Binding
findInterface
input type: s1
produces a destination document. Broadly, an M-Binding B output type: d1
Destination Schema M-Binding
contains a sequence of mappings m1 , m2 , . . . , mn . In the element: d1 m1: s1 --> d1
style of XML schema, M-Bindings may also import other element: d2
...
m2: s2 --> d2
...
M-Bindings (e.g. B1 = {m1 , m2 } ∪ B2 ) to support com- element: dn mn: sn --> dn
561
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 6. Automated syntatic mediation for our use case Expanding document and schema size will increase the
using Grimoires to store Binding descriptions and FXML -T translation cost linearly or better.
for document translation We test the scalability of FXML -T in two ways: by increas-
ing input document size (while maintaining uniform input
XML schema size), and by increasing both input schema
size and input document size. We test FXML -T against the
4.4. Automated Mediation following XML translation tools:
• XSLT: Using Perl and the XML::XSLT module4 .
To achieve syntactic mediation, two M-Bindings are nec-
essary: a realisation M-Binding (to convert XML to the • XSLT: Using JAVA (1.5.0) and Xalan5 (v2.7.0).
intermediate OWL representation), and a serialisation M- • XSLT: Using Python (v2.4) and the 4Suite Module6 (v0.4).
Binding (to convert OWL to XML). Figure 6 is an expansion • SXML: A SCHEME implemention for XML parsing and con-
of Figure 4 showing the use of Grimoires and FXML -T to version (v3.0).
automate syntactic mediation for our use case using and in-
termediate OWL representation. When registering the DDBJ Since FXML -T is implemented using an interpreted lan-
and NCBI-Blast services with FETA, semantic annotation guage, and Perl is also interpreted, we would expect them
are used to specify the input and output semantic types us- to peform slowly in comparison to JAVA and Python XSLT
ing a reference to a concept in the Bioinformatics ontology which are compiled7 . Figure 7 shows the time taken to
- in this case the Sequence Data concept. At workflow transform a source document to a structurally identical des-
composition time, the Bioinformatician may wish to feed tination document for increasing document sizes.
the output from the DDBJ service into the NCBI-Blast ser- The maximum document size tested is 1.2 MB, twice
vice because they are deemed semantically compatible (i.e. that of the Blast results obtained in our use-case. From Fig-
they share the same semantic type). While this workflow ure 7 we see that FXML -T has a linear expansion in transfor-
can be specified, it is not executable because of the differ- mation time against increasing document size. Both Python
ence in data formats assumed by each service provider - a and JAVA implementations also scale linearly with better pe-
stage of syntactic mediation is required. Figure 6 shows the formance than FXML -T due to JAVA and Python using com-
mediation phase, using Grimoires to discover M-Bindings, piled code. Perl exhibits the worst performance in terms of
FXML -T to perform the translation from DDBJ format to
time taken, but a linear expansion is still observed.
FASTA format with JENA used to hold the intermediate rep- Our second performance test examines the translation
resentation (in the form of an OWL concept instance). cost with respect to increasing XML schema size. To per-
form this test, we generate structurally equivalent source
and destination XML schemas and input XML documents
5. Evaluation which satisfy them. XML input document size is directly
proportional to schema size; with 2047 schema elements,
To evaluate this approach, we have examined the perfor- the input document is 176KBytes, while using 4095 ele-
mance of our Translation Engine (FXML -T) and the use of ments a source document is 378KBytes. Figure 8 shows
Grimoires as a repository for advertising M-Binding doc-
4 http://xmlxslt.sourceforge.net/
uments. The aims are threefold: (i) to test the scalability 5 http://xml.apache.org/xalan-j/
of our mapping language approach; (ii) to establish FXML - 6 http://4suite.org/
T as scalable translation engine by examining the perfor- 7 although Python is interpreted, the 4Suite library is statically linked to
mance costs against increasing document sizes, increas- natively compiled code
562
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
70 0.3
FXML File Size = 152Bytes
FXML Fit Fit, File Size = 152Bytes
Perl 0.28 File Size = 411Bytes
JAVA
60 JAVA Fit Fit, File Size = 411Bytes
PERL 0.26 File Size = 1085Bytes
PERL Fit Fit, File Size = 1085Bytes
PYTHON File Size = 2703Bytes
50 PYTHON Fit Fit, File Size = 2703Bytes
SXML
SXML Fit
0.22
40
0.2
FXML
30
0.18
20 0.16
Java 0.14
10 SXML
Python 0.12
0 0.1
0 200 400 600 800 1000 1200 0 1 2 3 4 5 6
Document Size (KBytes) Number of bindings
Figure 7. Transformation Performance against increasing Figure 9. Transformation Performance against number of
XML Document Size bindings
300 SXML
SXML Fit is the ability to compose M-Bindings at runtime. This can
250
SXML
achieved by creating an M-Binding that includes individ-
200 ual mappings from an external M-Binding, or imports all
mappings from an external M-binding. For Service inter-
150
faces operating over multiple schemas, M-Bindings can be
100
FXML
composed easily from existing Type Adaptors. Ideally, this
50
composability should come with minimal cost. To examine
Java
Python
M-Binding cost, we increased the number of M-Bindings
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 imported and observed the time required to transform the
Number of Schema Elements
document. To perform the translation, 10 mappings are
Figure 8. Transformation Performance against increasing required m1 , m2 , . . . , m10 . M-Binding 1 contains all the
XML Schema Size required mapping statements: B1 = {m1 , m2 , . . . , m10 }.
M-Binding 2 is a composition of two M-Bindings where
B2 = {m1 , . . . , m5 } ∪ B2a and B2a = {m6 , . . . , m10 }.
To fully test the cost of composition, we increased the num-
ber of M-Bindings used and run each test using 4 source
documents with sizes 152Bytes, 411Bytes, 1085Bytes, and
translation time against the number of schema elements 2703Bytes. While we aim for zero composability cost,
used. Python and JAVA perform the best - a linear expansion we would expect a small increase in translation time as
with respect to schema size that remains very low in com- more M-Bindings are included. By increasing source docu-
parison to FXML -T and Perl. FXML -T itself has a quadratic ment size, a larger proportion of the translation time will be
expansion; however, upon further examination, we find the spent on reading in the document and translating it. Con-
quadratic expansion emanates from the XML parsing sub- sequently, the relative cost of composing M-Bindings will
routines used to read schemas and M-Bindings, whereas the be greater for smaller documents and therefore the increase
translation itself has a cost linear to the size of its input. in cost should be greater. Figure 9 shows the time taken
The SCHEME XML library used for XML parsing is com- to transform the same four source documents against the
mon to FXML -T and SXML, hence the quadratic expansion same mappings distributed across an increasing number of
for SXML also. Therefore, our translation approach is linear M-Bindings. On the whole, a very subtle increase in per-
when implemented with a suitable XML parser. formance cost is seen, with the exception of the file size
563
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
1085Bytes. We attribute this anomaly to the rate of error in method is decomposed into the terms ‘get’, ‘nuc’, and ‘seq’.
time recording which is only accurate to ±10 milliseconds. To this end, the matching algorithm is relaxed and not in-
tended to be used automatically; instead, users are aided
5.3. Registry Cost during workflow composition.
Within the SEEK framework [4], each service has a num-
The cost of M-Binding discovery using Grimoires is not ber of ports which expose a given functionality. Each port
significant when compared to cost of executing the target advertises a structural type which defines the input and out-
service. put data format by a reference to an XML schema type. If the
Our final test examines the cost of using Grimoires to output of one service port is used as input to another service
discover the required M-Bindings. While this feature fa- port, it is defined as structurally valid when the two types
cilitates automation, it will require additional Web Service are equal. Each service port can also be allocated a seman-
invocations in a workflow. tic type which is specified by a reference to a concept within
an ontology. If two service ports are plugged together, they
Activity Average are semantically valid if the output from the first port is sub-
DDBJ Execution 2.50 sumed by the input to the second port. Structural types are
Realisation Discovery 0.22 linked to semantic types by a registration mapping using a
Realisation Transformation 0.47 custom mapping language based on XPATH. If the concate-
Jena Mediation 0.62 nation of two ports is semantically valid, but not structurally
Serialisation Discovery 0.23 valid, an XQUERY transformation can be generated to inte-
Serialisation Translation 0.27 grate the two ports, making the link structurally feasible.
Total Mediation 1.81 The SEEK system provides data integration between differ-
ent logical organisations of data using a common concep-
The Table above shows the average time taken (from 5 runs) tual representation, the same technique that we adopt. How-
for a type translation using OWL as an intermediarry repre- ever, their work is only applicable to services within the
sentation. These types of translation consist of 5 steps: (i) bespoke SEEK framework. The architecture we present is
discover realisation M-Binding, (ii) transformation of result designed to work with arbitrary WSDL Web Services anno-
to OWL instance, (iii) import OWL instance to JENA KB, (iv) tated using conventional semantic Web Service techniques.
discovery of serialisation M-Binding, (v) trasformation of Hull et al [8] dictate that conversion services, or shims,
OWL instance to XML . Results show that the total mediation can be placed in between service whenever some type of
time is roughly 2 seconds, with the largest portion of the translation is required - exactly as the current MY G RID so-
time taken importing the OWL instance into JENA. The dis- lution. They explicitly specify that a shim service is ex-
covery overhead (realisation and serialisation M-Bindings) perimentally neutral in the sense that it has no side-effect
is acceptable in comparison to service execution and trans- on the result of the experiment. By enumerating the types
lation times. of shims required in bioinformatics Grids and classifying
all instances of shim services, it is hoped that the neces-
6. Related Work sary translation components could be automatically inserted
into a workflow. However, their focus is not on the transla-
The Interoperability and Reusability of Internet Services tion between different data representation, rather the need to
(IRIS) project have also recognised the need to assist bioin- manipulate data sets; extracting information from records,
formaticians in the design and execution of Workflows. finding alternative sources for data, and modifying work-
Radetzki et al [16] describe adaptor services using WSDL flow designs to cope with iterations over data sets.
with additional profiles in the Mediator Profile Language Moreau et al [15], have investigated the same prob-
(MPL)8 . adaptor services are collated in a Mediator Pool lem within the Grid Physics Network, GriPhyn9 . To pro-
and queried using a derivation from the input and out- vide a homogeneous access model to varying data sources,
put descriptions of services connected by a dataflow. The Moreau et al propose a separation between logical and
query is a combination of syntactic information and seman- physical file structures. This allows access to data sources
tic concepts from a domain ontology. A matchmaking algo- to be expressed in terms of the logical structure of the in-
rithm presents a ranked list of adaptors, some of which may formation, rather than the way it is physically represented.
be composed from multiple adaptor instances. To aid the The XML Data Type and Mapping for Specifying Datasets
Adaptor description stage, WSDL descriptions of services (XDTM) prototype provides an implementation which al-
and their documentation are linguistically analised to es- lows data source to be navigated using XPATH. While this
tablish the sort of service. For example, the ‘getNucSeq’ approach is useful when amalgamating data from different
8 http://www.cs.uni-bonn.de/III/bio/iris/MediatorProfile.owl 9 http://griphyn.org/
564
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
physical representations (i.e. XML files, binary files and di- [2] R. Akkiraju, J. Farrell, J.Miller, M. Nagarajan, M. Schmidt,
rectory structures), it does not address the problem of data and A. S. K. Verma. Web service semantics - WSDL-S.
represented using different logical representations (i.e. dif- Technical report, UGA-IBM, 2005.
ferent schemas with the same physical representation). Our [3] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic
service integration problem arises from the fact that differ- web. Scientific American, pages 34 – 43, 2001.
[4] S. Bowers and B. Ludascher. An ontology-driven frame-
ent service providers use different logical representations of work for data transformation in scientific workflows. In
conceptually equivalent information. Intl. Workshop on Data Integration in the Life Sciences
(DILS’04), 2004.
7. Conclusions and Future Work [5] E. Christensen, F. Curbera, G. Meredith, and S. Weer-
awarana. Web services description language (WSDL) 1.1,
March 2001. W3C.
In this paper, we have described the motivation for a [6] J. Clark. XSL transformations (XSLT) version 1.0. Techni-
generic Type Adaptor description policy to support auto- cal report, W3C, 1999.
mated workflow harmonisation when syntactic incompati- [7] C. Goble, S. Pettifer, R. Stevens, and C. Greenhalgh. Knowl-
bilities are encountered. By using WSDL to describe Type edge Integration: In silico Experiments in Bioinformatics.
Adaptor capabilities, and the Grimoires registry to adver- In I. Foster and C. Kesselman, editors, The Grid: Blueprint
tise them, translation components can be discovered at run- for a New Computing Infrastructure Second Edition. Mor-
time and placed into the running workflow. Using FXML -M gan Kaufmann, November 2003.
[8] D. Hull, R. Stevens, and P. Lord. Describing web services
as a Type Adaptor language and OWL ontologies as an in-
for user-oriented retrieval. 2005.
termediate representation, we show an implementation that [9] N. Kavantzas, D. Burdet, G. Ritzinger, and T. Fletcher. WS-
supports a bioinformatics use case. Evaluation of our bind- CDL web services choreography description language ver-
ing language approach, its implementation, and the use of sion 1.0. Technical report, W3C, 2004.
Grimoires as a registry shows our architecture is scalable, [10] G. Klyne and J. J. Carroll. Resource description frame-
supports Type Adaptor composablity with virtually no cost, work (RDF): Concepts and abstract syntax. Technical re-
and has a relatively low overhead for binding discovery. port, W3C, 2004.
While our Architecture has been based on Type Adap- [11] F. Leymann. Web services flow language (WSFL 1.0), May
tors taking the form of FXML -M Binding documents, the 2001.
[12] P. Lord, P. Alper, C. Wroe, and C. Goble. Feta: A light-
approach will work with any Type Adaptor language, such
weight architecture for user oriented semantic service dis-
as XSLT and JAVA, providing the appropriate code is put
covery. In The Semantic Web: Research and Applications:
in place to automatically generate WSDL descriptions of Second European Semantic Web Conference, ESWC 2005,
their capabilities and the workflow engine is able to inter- Heraklion, Crete, Greece, pages 17 – 31, Jan. 2005.
pret their WSDL binding (i.e. execute an XSLT script, run [13] D. Martin, M. Burstein, G. Denker, J. Hobbs, L. Kagal,
a JAVA program, invoke a particular SOAP service). Even O. Lassila, D. McDermott, S. McIlraith, M. Paolucci, B. Par-
though our Architecture has been designed to fit within the sia, T. Payne, M. Sabou, E. Sirin, M. Solanki, N. Srinivasan,
existing MY G RID application, this approach will apply to and K. Sycara. OWL-S: Semantic markup for web service.
any Grid or Web Services architecture. When incorporating Technical report, The OWL Services Coalition, 2003.
the use of Semantic Web technology, namely the association [14] S. Miles, J. Papay, T. Payne, M. Luck, and L. Moreau. To-
wards a protocol for the attachment of metadata to service
of WSDL message parts with concepts from an ontology, we
descriptions and its use in semantic discovery. Scientific
have followed existing practices such as those used by the
Programming, pages 201–211, 2005.
FETA system, OWL - S [13], WSMO [17], and WSDL - S [2]. [15] L. Moreau, Y. Zhao, I. Foster, J. Voeckler, and M. Wilde.
In future work, we aim to improve the FXML -T imple- XDTM: the XML Dataset Typing and Mapping for Speci-
mentation in the area of XML parsing. This will make fying Datasets. In Proceedings of the 2005 European Grid
FXML -T translation cost scale at an acceptable rate when Conference (EGC’05), Amsterdam, Nederlands, Feb. 2005.
increasing both XML document size and XML schema size. [16] U. Radetzki, U. Leser, S. Schulze-Rauschenbach, J. Zim-
mermann, J. Lussem, T. Bode, and A. Cremers. Adapters,
shims, and glue - service interoperability for in silico exper-
8. Acknowledgment iments. Bioinformatics, 22(9):1137–1143, 2006.
[17] D. Roman, H. Lausen, and U. Keller. D2v1.0. web ser-
This research is funded in part by EPSRC myGrid vice modeling ontology (WSMO), September 2004. WSMO
project (reference GR/R67743/01). Working Draft.
[18] M. Szomszor, T. R. Payne, and L. Moreau. Automatic syn-
tactic mediation for web service integration. In Proceed-
References ings of the International Conference on Web Services 2006
(ICWS2006), Chicago, USA, Sept. 2006.
[1] UDDI technical white paper, September 2000.
565
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract— The success of web services has infuenced the way will often have constraints on how they wish the workflow
in which grid applications are being written. Grid users seek to to perform. These may be described in the form of a QoS
use combinations of web services to perform the overall task they document which details the level of service they require from
need to achieve. In general this can be seen as a set of services
with a workflow document describing how these services should the Grid. This may include requirements on such things as the
be combined. The user may also have certain constraints on the overall execution time for their workflow; the time at which
workflow operations, such as execution time or cost to the user, certain parts of the workflow must be completed; and the cost
specified in the form of a Quality of Service (QoS) document. of using services within the Grid to complete the workflow.
These workflows need to be mapped to a subset of the Grid In order to determine if these QoS constraints can be satis-
services taking the QoS and state of the Grid into account –
service availability and performance. We propose in this paper fied it is necessary to store historic information and monitor
an approach for generating constraint equations describing the performance of different web services within the Grid. Such
workflow, the QoS requirements and the state of the Grid. This information could be performance data related to execution
set of equations may be solved using Integer Linear Programming and periodic information such as queue length, availability.
(ILP), which is the traditional method. We further develop a 2- Here we see that existing Grid middleware for performance
stage stochastic ILP which is capable of dealing with the volatile
nature of the Grid and adapting the selection of the services repositories may be used for the storage and retrieval of this
during the life of the workflow. We present experimental results data. If the whole of the workflow is made concrete at the
comparing our approaches, showing that the 2-stage stochastic outset, it may lead to QoS violations. Therefore we have
programming approach performs consistently better than other adopted an iterative approach. At each stage the workflow is
traditional approaches. This work forms the workflow scheduling divided into those abstract services which need to be deployed
service within WOSE (Workflow Optimisation Services for e-
Science Applications), which is a collaborative work between now and those that can be deployed later. Those abstract
Imperial College, Cardiff University and Daresbury Laborartory. services which need to be deployed now are made concrete and
deployed to the Grid. However, to maintain QoS constraints it
is necessary to ensure that at each iteration the selected web
I. I NTRODUCTION services will still allow the whole workflow to achieve QoS.
Grid Computing has been evolving over recent years to- This paper presents results of the workflow scheduling
wards the use of service orientated architectures [1]. Func- service within WOSE (Workflow Optimisation Services for
tionality within the Grid exposes itself through a service e-Science Applications). WOSE is an EPSRC-funded project
interface which may be a standard web service endpoint. This jointly conducted by researchers at Imperial College, Cardiff
functionality may be exposing computational power, storage, University and Daresbury Laboratory. We discuss how our
software capable of being deployed, access to instruments or work relates to others in the field in Section II. Section III
sensors, or potentially a combination of the above. describes the process of workflow aware performance guided
Grid workflows that users write and submit may be abstract scheduling, followed by a description of the 2-stage stochas-
in nature, in which case the final selection of web services tic programming approach and an algorithm for stochastic
has not been finalised. We refer to the abstract description scheduling in Section IV. In Section V we illustrate how our
of services as abstract services in this paper. Once the web approach performs through simulation before concluding in
services are discovered and selected, the workflow becomes Section VI.
concrete, meaning the web services matching the abstract
description of services are selected. II. R ELATED W ORK
The Grid is by nature volatile – services appear and disap- Business Process Execution Language (BPEL) [2] is be-
pear due to changes in owners policies, equipment crashing ginning to become a standard for composing web-services
or network partitioning. Thus submitting an abstract workflow and many projects such as Triana [3] and WOSE [4] have
allows late binding of the workflow with web services cur- adopted it as a means to realise service-based Grid workflow
rently available within the Grid. The workflow may also take technology. These projects provide tools to specify abstract
advantage of new web services which were not available at workflows and workflow engines to enact workflows. Buyya et
the time of writing. Users who submit a workflow to the Grid al [5] propose a Grid Architecture for Computational Economy
566
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
TABLE I
to minimise along with several constraints which need to
S CHEDULING PARAMETERS .
be satisfied. The objective here is to minimise the overall
Symbol Name workflow cost: *
Abstract service
+-,/.10325476847294;:=<=> ?A@ (1)
, ,
Expected time, cost and selection
variable associated with
C DEC C HIJC G G
web service matching
FG F
! Maximum time in which the ?B0 KNM K (2)
" "$# workflow should
(get
executed KBL
)
)
&%' Time in which is expected to complete
Number of abstract services ? is the cost associated with web services. We have identified
) ) Number of
web
services
matching the following constraints.
O Selection Constraint :
C HIJC G
(GRACE) considering a generic way to map economic models P 4RQ F 0TS
into a distributed system architecture. The Grid resource bro- G M K (3)
K
M KVUXWZY QNS\[ G
ker (Nimrod-G) supports deadline and budget based schedul-
(4)
ing of Grid resources. However no QoS guarantee is provided
G G
by the Grid resource broker. Zeng et al [6] investigate QoS- Equation 3 takes care of mapping ] to one and only one
aware composition of Web Services using integer program- web service. For each ] , only one of the M K equals G 1,
ming method. The services are scheduled using local planning, while all the rest are 0.
global planning and integer programming approaches. The O Deadline Constraint : Equation 5 ensures that ] fin-
execution time prediction of web services is calculated using ishes within the assigned deadline.
an arithmetic mean of the historical invocations. However CH I C G G G
F
Zeng et al assume that services provide upto date QoS and ex- K M K`_ba < a c 476d< (5)
ecution information based on which the scheduler can obtain a KB^ ^
service level agreement with the web service. Brandic et al [7] O Other workflow specific constraints : These constraints
extend the approach of Zeng et al to consider application-
specific performance models. However their approach fails to are generated based on the workflow nature and other soft
guarantee QoS over entire life-time of a workflow. They also deadlines (execution constraints). This could be explicitly
assume that web services are QoS-aware and therefore certain specified by the end-user. e.g. some abstract service or a
level of performance is guaranteed. However in an uncertain subset of abstract services is required to be completed
Grid environment, QoS may be violated. Brandic et al have within . seconds. These could also be satisfying other
no notion of global planning of a workflow. Thus there is QoS parameters such as reliability and availability. A full
a risk of QoS violation. Huang et al [8] have developed a list of constraints is beyond the scope of this paper.
framework for dynamic web service selection for the WOSE IV. T WO - STAGE STOCHASTIC ILP WITH RECOURSE
project. However it is limited only to best service selection and
Stochastic programming, as the name implies, is mathemati-
no QoS issues are considered. We see our work fitting in well
cal (i.e. linear, integer, mixed-integer, nonlinear) programming
within their optimisation service of the WOSE architecture.
but with a stochastic element present in the data. By this
A full description of the architecture can be found in [8].
we mean that in deterministic mathematical programming the
Our approach not only takes care of dynamically selecting
data (coefficients) are known numbers while in stochastic pro-
the optimal web service but also makes sure that overall
gramming these numbers are unknown, instead we may have
QoS requirements of a workflow is satisfied with sufficiently
a probability distribution present. However these unknowns,
high probability. The main contribution of our paper is the
having a known distribution could be used to generate a
novel QoS support approach and an algorithm for stochastic
finite number of deterministic programs through techniques
scheduling of workflows in a volatile Grid.
such as Sample Average Approximation (SAA) and an e -
III. W ORKFLOW AWARE P ERFORMANCE G UIDED optimal solution to the true problem could be obtained. A
S CHEDULING full discussion of SAA is beyond the scope of this paper and
We provide Table: I as a quick reference to the parameters interested readers may refer [9].
of the ILP. Consider a set f of abstract services that can be scheduled
currently and concurrently. Let g fhg be the number of such
A. Deterministic Integer Linear Program (ILP) services. Similarly let i be the set of unscheduled abstract
Before presenting our 2-stage stochastic integer linear pro- services and g ijg be its number. Equations (6) to (9) represent
gram we first present the deterministic ILP program. The a 2-stage stochastic program with recourse, where stage-1
program is integer linear as it contains only integer variables minimises current costs and stage-2 aims to minimise future
(unknowns) and the constraints appearing in the program are costs. The recourse term is kml Mon Qqpsr , which is the future cost.
all linear. The ILP consists of an objective which we wish The term <Zt: in the objective of the stage-2 program is the
567
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
penalty incurred for failing to compute a feasible schedule. The A. Algorithm for stochastic scheduling of workflows
vector < has values such that the incurred penalty is clearly Algorithm 1 obtains scheduling solutions for abstract work-
apparent in the objective value. The : variables are also present flow services by solving 2-stage stochastic programs, where
in the constraints of stage-2 programs in order to keep the stage-1 minimises current costs and stage-2 minimises future
program feasible as certain realisations of random variables costs. This algorithm guarantees an e -optimal solution (i.e.,
will make the program infeasible. The vector : consists of a solution with an absolute optimality gap of e to the true
continuous variables whose size depends on the number of solution) with desired probability [9]. However to achieve the
*
constraints appearing in the program. desired accuracy one needs to sample enough scenarios, which
+-,/.10325476847294J,u<'> ?wvyx lzkml M n QJpsrqr;@ (6) often get quite big in a large utility grid, and in a service
O Stage-1 rich environment with continuous execution time distributions
C n C C HIJC G G associated with Grid services, the number of scenarios is
?B0 F G F theoretically infinite. However with proper discretisation tech-
KNM K (7)
K L niques the number of scenarios or the sample size required
to get the desired accuracy is at most linear in the number
Subject to the following constraints: selection, scheduling
of Grid services. This is clearly evident from the value of
along with other possible constraints.
O Stage-2 (equation (11)), which is the sample size, as g jg being the size
p is a vector consisting of random variables of runtimes and of feasible set, is exponential in the number of Grid services.
Finally statistical confidence intervals are then derived on the
costs of services. Mon is the vector denoting the solutions of
stage-1. Q(Mon ,* p ) is the optimal solution of
quality of the approximate solutions.
Algorithm 1 initially obtains scheduling solutions for stage-
+-,/.;{|0 254z6847254J,/<'> }Z@=vy~'1 (8) 1 abstract services, f in the workflow. This stage-1 result
C 1C C HIJC G G puts constraints on stage-2 programs, which aims at finding
} 0 FG F
KNM K (9) scheduling solutions for rest of the unscheduled workflow. The
K L sampling size (equation (11)) for each iteration, guarantees an
Subject to the following constraints: selection, scheduling e -optimal solution to the true scheduling problem with desired
along with other possible constraints. } is a realisation of accuracy, 95% in our case. If the optimality gap or variance of
expected costs of using services. The function x is the the gap estimator are small, only then the scheduling operation
expected objective value of stage-2, which is computed using is a success. If not, the iteration is repeated as mentioned
the SAA problem listed in equation (10). The stage-2 solution in step of the algorithm. This leads to computing new
can be used to recompute stage-1 solution, which in turn leads schedule for stage-1 abstract services with tighter QoS bounds.
to better stage-2 solutions. When the scheduled stage-1 abstract services finish execution,
S algorithm 1 is used to schedule abstract services that follow
254z6847254J,/<'> ?v F
kml M n Q!} r7@ (10) them in the workflow. Step selects the stage-1 solution,
- which has a specified tolerance to the true problem with
\8 H c +u g jg probability at least equal to specified confidence level 1 - .
?
le r (11)
0 ¡¢ £ (12)
In equation ( 11), g jg is the number of elements in the feasible
¤ ? r
set, which is the set of possible mappings of abstract services 0 ¢ £ l£
to real Grid services. 1 - is the desired probability of Z
S r (13)
^=¥§¦ l
accuracy, the tolerance, e the distance of solution to true ©
S
solution and 8 H$ is the maximum execution time variance ¨ 0 ? v X©mF
kjl Mªn Qq} r (14)
of a particular service in the Grid. One could argue that it -
may not be trivial to calculate both 8 H and g jg . Maximum © ¨ r
¤ 0
lzkm
\
l M n Q!¬ n Qq} r
execution time variance of some Grid service could be a good © © Z
S r (15)
approximation for 8 H and g jg could be obtained with proper ^=¥\« l
discretisation techniques. Equation (11) is derived in [10]. Our Algorithm 1 initially obtains scheduling solutions for stage-
scheduling service provides a 95% guarantee. Hence 1 - is 1 abstract services, f in the workflow. This stage-1 result
taken as 0.95. e - is taken as for convenience, while c +u g jg puts constraints on stage-2 programs, which aims at finding
turns out to be approximately equal to 4. In our case in order scheduling solutions for rest of the unscheduled workflow.
to obtain 95% confidence level, approximately turns out The sampling size (eq. 11) for each iteration, guarantees an
to be around 600. This means that one needs to solve nearly e -optimal solution to the true scheduling problem with desired
600 deterministic ILP programs in stage-2 for each iteration accuracy, 95% in our case. If the optimality gap or variance of
of algorithm 1. The number of unknowns in the ILP being the gap estimator are small, only then the scheduling operation
only about 500, negligible time is spent to solve these many is a success. If not, the iteration is repeated as mentioned
scenarios. in step of the algorithm. This leads to computing new
568
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
G G
Algorithm 1 Algorithm for stochastic scheduling for including 95% of the execution or waiting time distribution
© ®
Step 1 : Choose sample sizes and , iteration area. In equation (16), G ¶ K and K are the mean and standard G
£
count , tolerance e and rule to terminate iterations deviation of the execution time distribution of a running
Step 2 : Check if termination is required software service. G K G is a simple G product function of K .
L ^
for m = 1, . . .,M do
G K 0 ¶ K v 'µ K v¸· 47.;476o¹.;4725< (16)
Step 3.1 : Generate samples, and solve the SAA prob-
^ ^
lem, let the optimal objective be ? for corresponding K (equation (17)) for stage-2 programs is calculated in a
iteration ^
slightly differentG fashion. G G
end for K 0b}\º l¶ K Q K r»v¸}-¼ l¶ K Q K r (17)
Step 3.2 : Compute a lower bound estimate (eq. 12) on ^
¤
the objective and its variance
© (eq. 13) Here } º is the execution time distribution sample of an abstract
^ ¥ ¦ use one of the feasible
Step 3.3 : Generate samples, service on a Grid service. } ¼ is the waiting time distribution
stage-1 solution and solve the SAA problem to compute an sample associated with ½ K . We have used Monte-Carlo [12]
¨
upper bound estimate (eq. 14) on the objective and its technique for sampling values out of the distributions. Other
¤
variance (eq. 15) sampling techniques such as Latin Hypercube sampling could
¨
Step 3.4 : ^=Estimate
¥ « the optimality gap ( ¯ H!² = g g ) also be used in place. We provide an example for calculating
¤ ^u° = ¤
and
¤ the variance of the gap estimator ( + initial deadlines, given by equation (18) for the first abstract
)
^ §
¥ ± ^=¥ ¦
¤ H!² service (generate matrix) of workflow type 1. Deadline calcu-
^ ¥ « 3.5 : If ¯
Step and are small, choose stage-1 lation of an abstract service takes care of all possible paths
solution. Stop ^u° ^=¥ ±
H!² in a workflow and scaling is performed with reference to
¤
Step 3.6 : If ¯ and are
© large, tighten stage-1 the longest execution path in a workflow. Equation (18) is
^u° ^=¥ ±
QoS bounds, increase and/or , goto step 2 scaled with reference to .;472¾<Z¿À n . It should be noted that
initially implies calculation before performing the iterations
of algorithm 1. Subsequent deadlines of abstract services in a
schedule for stage-1 abstract services with tighter QoS bounds. workflow are calculated initially by scaling with reference to
Step ´³ selects the stage-1 solution, which has a specified the remaining workflow deadline.
tolerance to the true problem with probability at least equal
to specified confidence level 1 - . a < a=c 476d< 0 G vÁ .;4725* <Z¿ÀG n (18)
^ Á H Á Âqà ¤
H
0 l Ssv ĵ r
V. E XPERIMENTAL E VALUATION Á ¶ * (19)
0 F Ã $
H ¤ Å
H
l Ssv 'µ r
In this section we present experimental results for the ILP
Á Âqà ¶ Å (20)
techniques described in this paper. Å
A. Setup Initial deadline calculation is done in order to reach an optimal
Table II summarises the experimental setup. We have solution faster. We are currently investigating cut techniques
G * G
performed 3 simulations and for each different setup of a which can help Hto
H ¤ reach optimal solutions even faster. Here
simulation we have performed 10 runs and averaged out the ¶ and are the mean and coefficient of variation of
results. Initially 500 jobs allow the system to reach steady a Grid ¤ service
!
H ² that has the maximum expected runtime. If ¯
state, the next 1000 jobs are used for calculating statistics such and are large, bounds are tightened in such a way that ^u°
=
^ ¥ ± G
as mean execution time, mean cost, mean failures, mean partial in the next iteration they become smaller. e.g. minimum coef-
executions and mean utilisation. The last 500 jobs mark the ficient for time ( K ) could be set as the deadline or recourse
ending period of the simulation. Mean of an abstract service is term variable values ^ ( : ) in the stage-2 programs could be
measured in millions of instructions (MI). In order to compute used to tighten deadline. The workflows experimented with are
expected runtimes, we put no restriction on the nature of exe- shown in figure 1. The workflows are simulation counterparts
cution time distribution G and apply Chebyshev inequality [11] of the real world workflows. Their actual execution is a delay
to compute expected runtimes such that 95% of jobs would based on their execution time distribution, as specified in table
execute in time under K (equation (16)). It should be noted II. In the first simulation, type 1 workflows are used, in the
^
that such bounds or confidence intervals on the execution second simulation, type 2 workflows are used and in the third
times can also be computed using other techniques such as simulation workload is made heterogenous (HW). The type
Monte Carlo approach [12] and Central Limit Theorem [11] 1 workflow is quite simple compared to type 2, which is a
or by performing finite integration, if the underlying execution real scientific workflow. All the workflows have different QoS
time PDFs (Probability Density Functions) are available in requirements as specified in table II. The ILP solver used
analytical forms. The waiting time is also computed in such is CPLEX by ILOG [13], which is one of the best industrial
a way that in 95% of the cases, the waiting time encountered quality optimisation software. The simulation is developed on
will be less than the computed one. The value 'µ appearing top of simjava 2 [14], a discrete event simulation package.
in the equations below is due to applying Chebyshev inequality The Grid size is kept small in order to get an asymptotic
569
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
100
TABLE II DSLC
DDLC
SDLC
S IMULATION PARAMETERS .
80
Simulation
1 2 3
Services matching 24 12 24 60
Failures (%)
Service speed (kMIPS) 3-14 3-14 3-14
Unit cost (per sec) 5-29 5-29 5-29
Rate ( Æ ) (per sec)
40
(
Arrival 1.5-10 0.1-2.0 1.5-3.6
( Mean (Ç ) (kMI) 7.5-35 10-30 7.5-35
CV = È /Ç 0.2-2.0 0.2-1.4 0.2-2.0 20
Workflows Type 1 Type 2 HW
&AR (sec) 40-60 80-100 40-60
0
2 3 4 5 6 7 8 9 10
Arrival Rate (jobs/sec)
80
Failures (%)
Workflow Type 1
CREATE IM
THROW IM Heterogenous Workload (HW)
LIFECYCLE
COMMAND (7)
EXCEPTION (12 )
EXECUTE COMMAND 0
(8) 2 3 4 5 6 7 8 9 10
Arrival Rate (jobs/sec)
CHECK IF COMMAND YES XDAQ APPLIANT (10)
NO Fig. 3.
THROW IM COMMAND MONITOR DATA
EXCEPTION (13) ACQUISITION (11 )
Fig. 1. Workflows 80
60
Utilisation (%)
Ë
B. Results 40
deadlines. The workflows don’t have any slack period, mean- Arrival Rate (jobs/sec)
ing they are scheduled without any delay as soon as they are Fig. 4. Avg Utilisation vs Æ , CV = 0.2 (Simulation 1)
submitted. DDLC (dynamic, deterministic, least cost satisfying 100
DSLC
DDLC
deadlines) and DSLC (dynamic, stochastic, least cost satisfy- SDLC
Ë
the deadlines don’t change once they are calculated. The
40
deadlines get changed iteratively in case of DSLC due to
the iterative nature of algorithm 1. Scheduling of abstract 20
the workflows are submitted, an ILP is solved and scheduling Fig. 5. Avg Utilisation vs Æ , CV = 1.8 (Simulation 1)
solutions for all abstract services within the workflows are
obtained. In case of SDLC, once the scheduling solutions are
obtained, they don’t get changed during the entire lifetime of C. Effect of arrival rate and workload
the workflows. The main comparison metrics here are mean
cost, mean time, failures and mean utilisation as we increase We see that in case of figures 2 and 3, as É increases,
É and CV. However we will keep our discussion limited to DSLC continues to outperform other schemes. This trends
failures as a workflow failure means failure in satisfying QoS continues however but with a reduced advantage. This can be
requirements of workflows. explained as follows. This trends continues however but the
570
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
100
DSLC
advantage keeps on reducing as arrival rates increase. This can DDLC
SDLC
Failures (%)
response time of services is an increasing function of arrival
rate. Hence failures increase. Moreover this behaviour not 40
of SDLC. Referring to figures 4 and 5, it is apparent that Fig. 6. Failures vs Æ , CV = 0.2 (Simulation 2)
when CV is low, utilisation in case of DSLC and DDLC turns
100
out to be the same. However SDLC also registers reasonable DSLC
DDLC
SDLC
Failures (%)
high unpredicatability, DDLC and SDLC register moderate
utilisations. In case of workflow type 2, for low and high 40
than DSLC. This is because even if DSLC iteratively tightens Arrival Rate (jobs/sec)
deadlines, it doesn’t help to get a better schedule due to Fig. 8. Avg Utilisation vs Æ , CV = 0.2 (Simulation 2)
highly predictable environment and as a result failures increase 100
DSLC
DDLC
slightly as it tries to schedule workflows which would have SDLC
Ì
Failures (%)
and high CVs, DSLC performs significantly better than other 0.2 0.4 0.6 0.8 1 1.2
Arrival Rate (jobs/sec)
1.4 1.6 1.8 2
schemes. Referring to figures 10 and 11, we see that as CV is Fig. 9. Avg Utilisation vs Æ , CV = 1.4 (Simulation 2)
increased for low arrival rates, utilisation drops which indicates
that failures increase, which in turn indicates that environment
becomes more and more unpredictable. With high workloads,
as CV is increased, utilisation drops, but this time DSLC
registers highest utilisation. SDLC and DDLC register lower both low and high CVs, DSLC outperforms other schemes.
utilisation as they fail to cope with the increasing uncertainty. For high CV, the environment becomes highly uncertain and
However for low CV, they all start off from about the same hence SDLC registers a spiky behaviour in utilisation. This is
utilisation mark. When workload is made heterogenous, for in agreement considering its static nature of job assignment.
571
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
100
100 DSLC
DSLC DDLC
DDLC SDLC
SDLC
80 80
60 60
Utilisation (%)
Ë
Utilisation (%)
Í
40 40
20 20
0 0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.5 2 2.5 3 3.5
Coefficient of Variation of Execution of Abstract Services Arrival Rate (jobs/sec)
Fig. 10. Avg Utilisation vs CV, Æ = 0.1 (Simulation 2) Fig. 14. Avg Utilisation vs Æ , CV = 0.2 (Simulation 3)
100 100
DSLC DSLC
DDLC DDLC
SDLC SDLC
80 80
60 60
Utilisation (%)
Utilisation (%)
Ë
40 40
20 20
0
0.2 0.4 0.6 0.8 1 1.2 1.4 0
1.5 2 2.5 3 3.5
Coefficient of Variation of Execution of Abstract Services
Arrival Rate (jobs/sec)
Fig. 11. Avg Utilisation vs CV, Æ = 2.0 (Simulation 2) Fig. 15. Avg Utilisation vs Æ , CV = 1.8 (Simulation 3)
100
DSLC
DDLC
SDLC
0
VI. C ONCLUSION AND F UTURE W ORK
1.5 2 2.5 3 3.5
Arrival Rate (jobs/sec)
We have developed a 2-stage stochastic programming ap-
Fig. 12. Failures vs Æ , CV = 0.2 (Simulation 3)
proach to workflow scheduling using an ILP formulation of
100
DSLC QoS constraints, workflow structure, performance models of
DDLC
SDLC
Grid services and the state of the Grid. The approach gives
80
a considerable improvement over other traditional schemes.
This is because SAA approach obtains e -optimal solutions
60
572
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
573
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
The integration of computational grids and data grids into a common problem solving environment
enables collaboration between members of the GENIEfy project. In addition, state-of-the-art
optimisation algorithms complement the component framework to provide a comprehensive toolset
for Earth system modelling. In this paper, we present for the first time, the application of the non-
dominated sorting genetic algorithm (NSGA-II) to perform a multiobjective tuning of a 3D
atmosphere model. We then demonstrate how scripted database workflows enable the collective
pooling of available resource at distributed client sites to collaboratively progress ensembles of
simulations through to completion on the computational grid. A systematic study of the oceanic
thermohaline circulation in a hierarchy of 3D atmosphere-ocean-sea-ice models is presented
providing evidence for bi-stability in the Atlantic Meridional Overturning Circulation.
574
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Southern Ocean. The north flowing surface timesteps to handle atmospheric dynamics and
current and the south bound deep current form a the addition of a third (vertical) dimension in
major part of the global ocean “Conveyor Belt” the atmosphere, the GENIE-2 model requires
and are often referred to collectively as the approximately two orders of magnitude more
Atlantic Meridional Overturning Circulation CPU time than GENIE-1 to simulate an
(MOC). The vast currents that circulate the equivalent time period. In Marsh [2] each
globe because of the MOC are responsible for GENIE-1 simulation required only a few hours
transport of heat energy and salinity and play a of CPU time and the entire study was performed
major role in determining the Earth’s climate. in about 3 days on a 200 node flocked Condor
The strength of the Atlantic MOC is pool. However, for GENIE-2 models, individual
sensitive to the global hydrological cycle. In simulations require ~5-10 days of continuous
particular, variation in the atmospheric fresh run time and a cycle stealing approach is no
water transport between the Atlantic and Pacific longer appropriate. Indeed, simply securing a
basins could lead to systematic changes in the sufficient number of CPUs over weekly
THC. Since moisture transports by the timescales would be a significant challenge. In
atmosphere increase significantly under global the remainder of this paper we show how the
warming scenarios, a key concern for climate Grid computing software described in [4] is
change predictions is an understanding of how essential for the break down of large ensembles
the THC may be affected under increased of lengthy simulations into manageable
greenhouse gas forcing. It is possible that the compute tasks. The GENIE data management
system will react in a non-linear fashion to system is used to mediate the execution of large
increased fresh water transports and the Atlantic ensemble studies enabling members of the
MOC could collapse from its current “on” state project to pool their resource to perform these
into an “off” state, where no warm conveyor sub-tasks. Through a collaborative effort the
belt exists, and a colder, drier (more more expensive study of GENIE-2 is enabled.
continental) winter climate may be expected in
northern Europe. Studies using box models [3] 3 Collaborative PSE
and models of intermediate complexity
(EMICs) such as GENIE-1 (C-GOLDSTEIN) GENIE has adopted software developed by the
[2], have found bi-stability in the properties of Geodise project [5] to build a distributed
the THC; the “on” and “off” states can exist collaborative problem solving environment [4]
under virtually the same fresh water flux for the study of new Earth system models. The
properties of the atmosphere depending on the Geodise Toolboxes integrate compute and data
initial conditions of the model (starting from an Grid functionality into the Matlab and Jython
“on” or “off” state). However, the most environments familiar to scientists and
comprehensive type of climate model, the engineers. In particular, we have built upon the
coupled Atmosphere-Ocean General Circulation Geodise Database Toolbox to provide a shared
Models (AOGCMs), have yet to find any Grid-enabled data management system, a
conclusive evidence for this bi-stability. In this central repository for output from GENIE Earth
paper, we extend the work performed with the system models. An interface to the OPTIONS
GENIE-1 model (comprising a 3D ocean, 2D design search and optimisation package [6] is
atmosphere and sea-ice) and present the first also exploited to provide the project with access
study of THC in the GENIE-2 model, a fully 3D to state-of-the-art optimisation methods which
ocean-atmosphere-sea-ice EMIC model from are used in model tuning.
the GENIE framework. We thus take a step up
the ladder of model complexity to study the 3.1 Data Management System
behaviour of the same ocean model but now
The Geodise data model allows users to create
under a 3D dynamical atmosphere in place of
descriptive metadata in Matlab / Jython data
the simple 2D energy moisture balance code
structures and associate that metadata with any
(EMBM) of GENIE-1.
file, variable or datagroup (logical aggregation
This study extends the work of Marsh [2] in
of data) archived to the repository. The Geodise
two key areas. First, the new atmosphere model
XML Toolbox is used to convert the metadata
must be tuned to provide a reasonable
to a valid XML document which is then
climatology when coupled to the 3D ocean. We
ingested and made accessible in the database.
present, for the first time, the application of a
The GENIE database system augments this
multiobjective tuning algorithm to this problem.
model by defining XML schemas that constrain
The second issue relates to computational
the permissible metadata and improve
complexity. Due to the need for shorter
575
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
subsequent query performance. The interface to Nimrod/G and GridSolve, allowing progress to
the database is exposed through web services be monitored and output data to be shared.
and all access is secured and authenticated using
X.509 certificates. Through a common client 3.3 Collaborative study
members of the project can access the database
both programmatically and with a GUI The programmatic interface to the shared data
interface. repository allows the database to become an
Recent work on the Geodise database active part of an experiment workflow.
toolbox has further enhanced its functionality by To define an ensemble in the database the
supporting aggregated queries (e.g. max, count), coordinator of the study creates an experiment
datagroup sharing (i.e. users working on the “datagroup”, a logical entity in the database
same task can add data into a shared data describing the purpose of the ensemble and
group), and improved efficiency of data access acting as parent to a set of simulation
control. These enhancements enable the definitions. The simulation entities record the
scientist to compose more powerful and flexible details for the specific implementation of the
scripted workflows which have the ability to model including the total number of timesteps
discover, based on the content of the database, to be performed and the output data to generate.
the status of an experiment and make further This data structure within the database captures
contributions to the study. all of the information that is required for the
study and its creation amounts to ~100 lines of
configurable script for the scientist to edit.
3.2 Task Farming
Once a study has been defined the
We exploit the GENIE database and the restart coordinator typically circulates the unique
capabilities of the models to break down identifier to members of the project. If a user
ensembles of long simulations into manageable wishes to contribute compute resource for the
compute tasks. By staging the restart files in the progression of the experiment then this is the
shared repository all members of the project can only piece of information that they require.
access the ensemble study, query against the They simply contribute to the study by invoking
metadata in the database to find new work units, a “worker” script, specifying the experiment
submit that work to compute resources available identifier and providing details of the amount of
to them and upload new results as they become work they would like to submit to a particular
available. Using the common client software resource. A user session is shown in Figure 1.
project members can target local institutional
clusters, Condor pools or remote resources
available to them on the grid (Globus 2.4).
The task farming paradigm is ideally suited
to Grid computing where multiple
heterogeneous resources can be harnessed for
the concurrent execution of work units.
Examples of such systems are Nimrod/G [7]
and GridSolve [8] which provide task farming
capabilities in a grid environment. Nimrod/G
allows users to specify a parametric study using
a bespoke description language and upload the
study to a central agent. The agent mediates the
execution of the work on resources that are
available to it. The GridSolve system works in a
similar fashion, exploiting an agent to maintain
details about available servers and then
selecting resource on behalf of the user. In
contrast we exploit the GENIE database system, Figure 1: Typical user session contributing to
providing scriptable logic on the client side to an ensemble study.
coordinate the task farming based on point-in-
time information obtained from the central The first action is to create a time-limited
database. This allows more dynamic use of proxy certificate (to authenticate all actions on
resource as the client is more easily installed the Grid). The user then retrieves a resource
within institutional boundaries. The study is definition from the database and invokes the
mediated in persistent storage, as with autonomous “worker” script. The “worker”
576
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
allows a user to specify the number of timesteps compute tasks. Based on a point-in-time
to progress the members of the ensemble by and assessment of the progress of the study the
the number of compute jobs to submit to the system returns a list of work units that are
specified resource. available for progression. The list of jobs is
The user specifies a resource by creating a prioritised so that the entire ensemble is
data structure in Matlab capturing the progressed as evenly as possible; all inactive
information needed by the system to exploit that work units are returned first with those having
resource. We provide a utility function that least achieved timestep given highest priority.
prompts the user for the necessary information Stage 4: The final action of the worker script
and then upload an appropriate data structure to is to submit the specified number of compute
the database. Once the resource definition is tasks to the resource that the user provided.
available in the database it is more common for Working through the list of compute tasks
the user to retrieve the definition and use it to obtained in stage 3 the script retrieves the
submit work to the resource it describes. The appropriate restart files from the database and
worker script progresses through four stages of transfers them, along with the model binary, to a
execution (see Figure 2): unique directory on the remote compute
resource. Once the model run is set up the
script submits the job to the scheduler of the
resource. The job handle returned by the
scheduler is uploaded to the database.
Subsequent invocations of a worker script on
this experiment are then informed about all
active compute tasks and can act accordingly.
It would be unreasonable to require a user to
manually invoke the worker script on a regular
basis and the most common mode of operation
is to set up a scheduled task (Windows) or cron
job (Linux) to automatically initiate the Matlab
session. The user typically provides a script to
run one or more workers that submit work to the
Figure 2: Scripted workflow of the autonomous resources available to the client. Through the
"worker" interfacing to a resource managed by regular invocation of the worker script the entire
the Globus Toolkit (v2.4). ensemble is progressed to completion without
any additional input by the user. The progress of
Stage 1: The worker interrogates the the study can be monitored at any point through
database for details of any changes to the scripts a function call that queries the database for
that define the experiment. Any new or updated summary information about the experiment.
scripts that have been added to the experiment
are downloaded and installed in the user’s local 4 Results
workspace. This provides the experiment
coordinator the means to dynamically update We present the results of an end-to-end study of
the study and ensures that all participants are a small suite of GENIE-2 models. The first
running the same study. stage of the study performs a tuning of the
Stage 2: The worker invokes a post- IGCM atmosphere to provide a stable coupling
processing script to ensure that any completed to the GOLDSTEIN ocean component. Using
work units are archived to the database before an optimal parameter set we then define a
new work units are requested. A list of active number of ensembles in the database system for
job handles is retrieved from the database and the IGCM coupled to the GOLDSTEIN model
each handle is polled for its status. Any running on three computational meshes with
completed compute tasks are processed and the varying horizontal resolution. Project members
output from each is uploaded to the database. collaborate to progress the simulations through
The files from which the model may be to completion on resource available to them
restarted are tagged as such. In the event of a both locally and nationally.
model failure the post-processing will attempt to
record the problem to allow future invocations 4.1 Multiobjective Tuning
of the worker to perform corrective action. An important aspect of Earth system model
Stage 3: The worker invokes a selection development is the tuning of free parameters so
script that queries the database for available
577
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
that the simulated system produces a reasonable (IGCM atmosphere, slab ocean, slab sea-ice)
climatology. In Price [4] the OPTIONS design were varied and 2 constraints were applied.
search and optimisation package [6] was used to Three objective functions were defined to
apply a Genetic Algorithm to vary 30 free provide improvements in the seasonal averages
parameters in the IGCM and minimise a single of the surface energy fluxes (OBJ1), the
objective measure of the mismatch between precipitation and evaporation properties (OBJ2)
annually averaged model fields and and the surface wind stresses (OBJ3). The
observational data. This measure of model-data model runs were performed on a ~1400 node
mismatch was reduced by ~36% and the Condor pool at the University of Southampton
resulting parameters are now the preferred set with each generation taking approximately three
for further study using the IGCM. However, hours to complete. The algorithm generated a
while this tuning produced a good improvement Pareto optimal set of solutions comprising 117
in the sea-surface temperature (SST), surface members. The results of the tuning exercise are
energy fluxes and horizontal wind stresses, summarised in Table 1 comparing the ‘best’
there was little or no improvement in the result (arbitrarily selected as the point with
precipitation and evaporation fields which minimum sum of the three objectives) of the
comprised part of the tuning target. Analysis of Pareto study with the objectives evaluated at the
the model state at this point (GAtuned) in default and GAtuned points in parameter space.
parameter space also showed that seasonal We first point out that the precipitation and
properties of the model were not a good match evaporation in the model (OBJ2) saw little
for seasonally averaged observational data. improvement in the original GAtuned study
Recent studies have therefore adopted a over the default parameters even though these
multiobjective tuning strategy in order to fields were part of that tuning study. The
provide targeted improvements in seasonally GApareto result has produced significant
averaged fields of the model across physically improvements in all three objective functions as
significant groupings of state variables. In desired and provided a better match to the
general, multiobjective optimisation methods observational data. The scale of these
seek to find a set of solutions in the design improvements is also greater then the original
space that are superior to other points when all GAtuned study, but we note that this is
objectives are considered but may be inferior in primarily due to the seasonal rather than annual
a subset of those objectives. Such points lie on averages that have been used to comprise the
the Pareto front and are those for which no other tuning target. That is to say that we are
point improves all objectives [9]. We have comparing the seasonal measures of fitness at a
exploited an implementation of the non- point that was obtained when optimising annual
dominated sorting genetic algorithm (NSGA-II) averages – we should expect a greater
[10]. As described in [4] the GENIE model is improvement in the seasonal study. The
exposed as a function that accepts as input the improvements in the precipitation and
tuneable parameters and returns, after evaporation fields are not as great as the
simulation, the multiple objective measures of improvements made in the other two objectives
model mismatch to observational data. The and we note that there are probably fundamental
NSGA-II algorithm maintains a population of limits to the model’s ability to simulate
solutions like a GA but uses a different selection precipitation and evaporation that tuning alone
operator to create each successive generation. cannot overcome.
The algorithm ranks each member of the parent
population according to its non-domination Name Default GAtuned GApareto
level and selects the members used to create the OBJ1 4.47 3.68 3.23
next population from the Pareto-optimal OBJ2 3.87 3.84 3.41
solutions that have maximum separation in OBJ3 3.11 2.32 2.08
objective space. The method seeks to reduce all Table 1: Values of the three objective functions
objectives while maintaining a diverse set of evaluated at the default, GAtuned and GApareto
solutions. The result of the optimisation is a set points in parameter space.
of points on the Pareto front from which a user
can then apply domain knowledge to select the 4.2 Study of THC bi-stability
best candidates.
The IGCM atmosphere component was We have performed a number of ensemble
tuned using the NSGA-II algorithm applied over studies of the GENIE-2 ig-go-sl model
50 generations using a population size of 100. consisting of the 3D IGCM atmosphere code
32 free parameters in the genie_ig_sl_sl model coupled to the 3D GOLDSTEIN ocean model
578
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
with a slab sea-ice component. This model retrieval of model data files did not adversely
typically requires approximately 5 days of impact the performance of the system.
continuous run time to perform a 1000 year
simulation of the Earth system, which is long
enough for the global THC to approach
equilibrium. In the Atlantic Ocean, the THC is
characterised by a “meridional overturning”,
which is diagnosed in the model as a
streamfunction in the vertical-meridional plane.
The units of this meridional overturning
streamfunction are 106 m3s-1, or Sverdrups (Sv).
The sign convention is positive for a clockwise
circulation viewed from the east, corresponding
to surface northward flow, northern sinking and
southward deep flow. This circulation pattern Figure 3: Resource usage for the twelve
corresponds to the Conveyor Belt mode of the ensemble studies.
THC, the strength of which is given by the
maximum of the overturning streamfunction. In The breakdown of client contributions is
order to obtain a realistic Conveyor Belt in presented in Figure 4a. The distribution of
GENIE-2 (as in other similar models), it is submitted jobs roughly reflects the amount of
necessary to include a degree of surface resource available to the client in each case. E.g.
freshwater flux correction to ensure a sufficient the client responsible for ~50% of the work was
Atlantic-Pacific surface salinity contrast. We the only submission node on the large condor
have studied the Atlantic maximum overturning pool and was also used to submit jobs to the
rate under a range of this freshwater flux National Grid Service. The distribution of jobs
correction in a three-level hierarchy of GENIE- across the computational resources is presented
2, across which we vary the resolution of the in Figure 4b and illustrates that work was
horizontal mesh in the ocean component. distributed evenly to the available resources. By
Twelve ensemble studies were executed by mediating the study through a shared central
three members of the project using database repository the coordinating scientist has had the
client installations at their local institutions. The work performed over a collected pool of
unique identifier for each study was circulated resource, much of which is behind institutional
amongst members of the project willing to firewalls and probably not directly accessible to
contribute resource and regular invocations of him/her. The system also enables us to
the client “worker” script were scheduled at introduce new resource as a study is performed
each site to submit model runs to a number of and target the most appropriate platforms.
computational resources around the UK. These
resources included five nodes of the UK
National Grid Service (http://www.ngs.ac.uk),
three Beowulf clusters available at the
Universities of East Anglia and Southampton
and a large Condor pool (>1,400 nodes) at the
University of Southampton.
The ensemble studies of the GENIE-2 model
presented here were carried out during a 14
Figure 4: Distributions of client submissions
week period from December 2005 until March
and resource usage.
2006. In total five client installations were used
to perform:
The data from these runs is being processed
• 319 GENIE-2 simulations
and a more detailed analysis of the results of
• 3,736 compute tasks these studies will be the subject of a separate
• 46,992 CPU hrs (some timings estimated) paper [11]. We briefly present the initial
• 362,000 GENIE-2 model years findings of this experiment.
The daily breakdown of resource usage is The effect of varying a fresh water flux
plotted in Figure 3. The rate of progression of correction parameter in the atmosphere
the studies was largely determined by the component (IGCM) of a suite of GENIE-2
amount of available work in the system at any models with different ocean grid resolutions has
given time. The overheads in performing been studied. Six of the ensembles are plotted in
queries on the database and the upload and
579
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 5 showing the strength of the maximum 5 Discussion and Future Work
Atlantic overturning circulation for ensemble
members across three meshes (36x36 equal Programmatic access to a shared central
area, 72x72 equal area, 64x32 lat-lon) initialised database provides the means to mediate the
from states with default present day circulation collaborative study of Earth System models in
(r1) and collapsed circulation (r0). Maximum the GENIE project. In contrast to other grid-
overturning rates are obtained as averages over enabled task farming systems, such as
the last 100 years of the 1000-year simulations. Nimrod/G and GridSolve, we exploit client side
scripted database workflows in place of a
central agent to mediate the execution of our
studies. Our system allows resource to be
dynamically introduced to studies, but we
achieve this at the expense of execution control.
Systems with a central agent can build
execution plans, target jobs at the most
appropriate platform and provide guarantees for
completion times of experiments. However,
administrators of the agent system must ensure
that this central server can communicate with all
resource to be targeted. If available resource has
not been exposed to the grid then there is no
Figure 5: Maximum Atlantic overturning way of using it. By providing a rich client
circulation (MOC) across six ensembles, for within the institutional boundary we maximise
three different horizontal meshes of the ocean our ability to exploit resource. However, our
component. system is passive and relies upon the users to
contribute their resource and establish regular
Bi-stability in the Atlantic MOC is clearly invocations of their client system. While this
evident in all three model variants. Under the provides a scalable and robust solution it cannot
same atmospheric conditions there is a range of provide any guarantees about completion time.
fresh water corrections for which the ocean A publish / subscribe approach would overcome
THC is stable in both the “on” and “off” states this shortcoming but would require the
depending on initial conditions. The locations development of an active message queue in the
and widths of the hysteresis loops as a function database system. We will investigate this
of ocean mesh resolution are also an important possibility. The overheads in moving data
feature of these results. The increase of grid through a client are avoided in the task farming
resolution from 36x36 to 72x72 (effectively systems but the dynamic allocation of compute
multiplying by volume the number of grid cells tasks through multiple clients offsets this issue.
by 8) shows a slight narrowing of the hysteresis The definition and upload of a model study
loop and steeper transitions. The use of an involves some simple edits to a template script
ocean grid at the same resolution as the to specify the unique experiment definition in
atmosphere (64x32) exhibits a much wider loop the database. Once created the experiment
and is positioned at a different point in the coordinator circulates the unique identifier for
parameter space. The differences in these results the study and members of the project can then
are likely attributable to two factors; a) the choose whether to contribute their resource. The
interpolation employed in coupling the ocean effort involved in contributing to a study is
and atmosphere models and b) the different minimal because the “worker” script functions
resolutions of the ocean grids at the mid- autonomously. A user typically schedules a cron
latitudes which mean important processes are job or windows scheduled task to execute
better resolved at 64x32. In the 64x32 model, several times a day and then has no further
there are more grid points in the east-west interaction with the system. The scheduled task
direction giving better resolution of zonal initiates the client and invokes the “worker”
pressure gradients in the ocean, better THC using the experiment identifier. As long as a
structure, and improved northward transport of valid proxy certificate exists, the system
high salinity water by the THC itself, reducing attempts to contribute work until the study is
the need for surface freshwater flux correction. complete. A project member may stop their
These results tentatively support the notion that scheduled task at any time and withdraw their
more complex models can exhibit bi-stability in resource, introduce new resource, and drop in
the states of the THC. and out of the experiment as they see fit.
580
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
581
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
GridPP is a £33m, 6-year project funded by PPARC that aims to establish a Grid for UK Particle
Physics in time for the turn on of the CERN Large Hadron Collider (LHC) in 2007. Over the last
three years, a prototype Grid has been developed and put into production with computational
resources that have increased by a factor of 100. GridPP is now about halfway through its second
phase, the move from prototype to production is well underway though many challenges remain.
582
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
not all parts of the problem need to the same information. At the basic level, the Site
services or quality of service and that Functional Test (SFT), a small test job that runs
substantial benefits in cost and scale can also be many times a day at each site, determines the
gained by embracing an architecture where availability of the main Grid functions.
institutes, regions, or even Countries, can plug- Similarly, the Grid Status Monitor (GStat)
and-play. This, then, is the optimisation retrieves information published by each site
afforded by the Grid approach. about its status. Figure-3 shows the average
CPU availability by region for April 2006
2. Overview of GridPP derived from sites passing or failing the SFT.
Although this particular data set is not complete
Since September 2001, GridPP has striven to
(and an improved metric is being released in
develop and deploy a highly functional Grid
July) it can be seen that within Europe the UKI
across the UK as part of the LHC Computing
region (second entry from the left) made a
Grid (LCG)[1]. Working with European EDG
significant contribution with 90% of the total of
and latterly EGEE projects [2], GridPP helped
just of 4000 CPUs being available on average.
develop middleware adopted by LCG. This,
together with contributions from the US-based
Globus [3] and Condor [4] projects, has formed
the LCG releases which have been deployed
throughout the UK on a Grid consisting
presently of more than 4000 CPUs and 0.65 PB
of storage. The UK HEP Grid is anchored by
the Tier-1[5] centre at the Rutherford Appleton
Laboratory (RAL) and four distributed Tier-2
[6] centres known as ScotGrid, NorthGrid,
SouthGrid and the London Tier-2. There are 16
UK sites which form an integral part of the joint
LHC/EGEE computing Grid with 40,000 CPUs
Figure-3: Average CPU availability for April
and access to 10 PB of storage, stretching from
2006. CPUs at a site are deemed available
the Far-East to North America.
(Green) when the site passes the SFT or
unavailable (Red) if the site fails.
583
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
80
70
% Occupancy
60
Raw
50 Scaled
40
30
20
10
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Figure-4: Tier-1 CPU use for 2005. Figure-7: Job Efficiency (CPU-time/Wall-time)
for 2006, by Virtual Organisation.
584
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
585
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
586
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
587
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
588
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The backbone of the campus grid is made using standard middleware and tools but the value add services have
been provided by in-house designed software including resource brokering, user management and accounting.
Since the system was first started in November 2005 we have attracted six significant users from five different
departments. These use a mix of bespoke and licensed software and have run ~6300 jobs by the end of July
2006. Currently all users have access to ~1000 CPUs including NGS and OSC resources. With approximately
2 new users a week approaching the e-Research centre the current limitation on rate of uptake is the amount of
time that is spent with each user to make their interactions as successful as possible.
589
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The data system must be able to take the following 3.2 Connected Systems
factors into account:
Each resource that is connected to the system has
• Steep increase in the volume of data as
to have a minimum software stack installed. The
studies progress including the new class
middleware installed is the Virtual Data Toolkit
of research that is generated by
[4]. This includes the Globus 2.4 middleware [5]
computational grid work.
with various patches that have been applied
• Metadata to describe the properties of the through the developments of the European Data
data stored, the volume of which will be Grid project [6].
directly proportional to its quality.
• As the class of data stored changes from 3.3 Central Services
final post analysis research data towards The central services of the campus grid may be
the raw data on which many different broken down as follows:
studies can be done then the need for
• Information Server
replication and guaranteed quality of
• Resource Broker
storage will increase.
• Virtual Organisation Management
Therefore a system is needed which can allow a
range of physical storage medias to be added to a • Data Vault
common core to present a uniform user interface. These will all be described separately.
The actual storage must also be location 3.3.1 Information Server
independent different physical locations as
possible. The Information server forms the basis of the
Campus Grid with all information about resources
3. The OxGrid System registered to it. Using the Globus MDS 2.x [7]
system to provide details of the type of resource,
The solution as finally designed is one where including full system information. This also
individual users interact with all connected contains details of the installed scheduler and its
resources through a central management system as associated queue system. Additional information is
shown in Figure 1. This has been configured such added using the GLUE schema [7].
that as the usage of the system increases each
component can be upgraded so as to allow organic 3.3.2 Resource Broker
growth of the available resources.
The Resource Broker (RB) is the core component
of the system and the one with which the users of
the system have the most interaction. Current
designs for a resource broker are either very
heavyweight with added functionality which is
unlikely to be needed by the users or have a
significant number of interdependencies which
could make the system difficult to maintain.
The basic required functionality of a resource
broker is actually very simple is listed below:
Figure 1 Schematic of the OxGrid system • Submit tasks to all connected resources,
• Automatically decide the most
3.1 Authorisation and Authentication appropriate resource to distribute a task
The projected users for the system can be split into to,
two distinct groups, those that want to use only • Depending on the requirements of the
resources that exist within the university and those task submit to only those resources which
that want access to external systems such as the fulfil them,
National Grid Service etc. • Dependant on the users list of registered
For those users requiring external system access, systems distribute tasks appropriately.
we are required to use standard UK e-Science Using the Condor-G [9] system as a basis an
Digital Certificates. This is a restriction bought additional layer was built to interface the Condor
about by those system owners. Matchmaking system with MDS. This involved
For those users that only access university interactively querying the Grid Index Information
resources, a Kerberos Certificate Authority [3] Service to retrieve the information stored in the
system connected to the central university system wide LDAP database as well as extracting
authentication system has been used. This has been the user registration information and installed
taken up by central computing services and will software from the Virtual Organisation
therefore become a university wide service once Management system.
there is a significant enough user base. Additionally other information such as which
system a particular user is allowed to submit tasks
590
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
is available from the VOM system. Therefore this jobs. The list of installed software is also retrieved
must be queried whenever a task is submitted. at this point from the Virtual Organisation
Manager database from when the resources were
added.
Each of these generated job advertisements for
each resource are input into the Condor
Matchmaking [10] system once every 5 mins.
3.3.2.2 Job Submission Script
As the standard user for the campus grid will have
normally only submitted their jobs to a simple
scheduler it is important that we can abstract users
from underlying grid system and resource broker.
The Condor-G system uses non-standard job
description mechanisms and so a simpler more
efficient method has been implemented. It was
decided that most users were experienced with a
command line executable that required arguments
rather than designing a script submission type
system. It is intended though to alter this with
version 2 so that users may also use the GGF JSDL
Figure 2 Resource Broker operation [11] standard to describe their tasks.
3.3.2.1 The Generated Class advertisement The functionality of the system must be as follows:
• User must be able to specify the name of
The information passed into the resource class the executable and whether this should be
advertisement may be classified in three ways, that staged from the submit host or not,
which would be included in a standard Condor • Any arguments that must this executable
Machine class-ad, additional items that are needed be passed to run on the execution host,
for Condor-G grid operation and items which we • Any input files that are needed to run and
have added to give extra functionality. This third so must be copied onto the execution
set of information will be described here; host,
Requirements = (CurMatches < 20) && • Any output files that are generated and so
(TARGET.JobUniverse == 9) must be copied back to the submission
host,
It is important to ensure that a maximum number
• Any special requirements on the
of submitted jobs can be matched to the resource at
execution host such as MPI, memory
any one time and the resource will only accept
limits etc.
Globus universe jobs.
• Optional, so as to override the resource
CurMatches =0 broker as necessary the user should also
be able to specify the gatekeeper URL of
This is the number of currently matched jobs as
the resource he specifically wants to run
determined by the advertisement generator using
on. This is useful for testing etc.
MDS and Condor queue information every 5 mins.
It is important also that when a job is submitted
OpSys = "LINUX“ that the user will not get his task allocated by the
Arch = "INTEL" resource broker onto system to which he doesn’t
Memory = 501 have access. The job submission script accesses
the VOM system to get the list of allowed system
Information of the type of resource as is
for that user. This works through passing the DN
determined from the MDS information. It is
into the VOM and retrieving a comma separated
important to note though that a current limitation is
list of systems and reformats this into the format as
that this is for the head node only not workers so
accepted by the resource broker.
heterogeneous clusters cannot at present be used
on the system. job-submission-script -n 1 -e /usr/local/bin/rung03
-a test.com -i test.com -o test.log -r GAUSSIAN03 -
MPI = False
g maxwalltime=10
INTEL_COMPILER=True
GCC3=True This example runs a Gaussian job, which in the
current configuration of the grid will through the
Special capabilities of the resource need to be
resource broker only run on the Rutherford NGS
defined. In this case the cluster resource does not
node. This was a specific case that was developed
have MPI installed and so cannot accept parallel
to test capability matching.
591
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
3.3.3 Virtual Organisation Manager • Which registered systems the user can use
and their local username on each, this can
This is another example of a new solution being
either be pool accounts or a real
designed in-house due to over complicated
username.
solutions being only currently available. The
functionality required is as follows:
An example of the interface is shown in Figure 3.
• Add/Remove system to a list of available
systems,
• List available systems,
• Add/Remove users to a list of users to
access general systems,
• List users currently on the system.
• Add user to the SRB system,
When removing users though it is important that
their record is set as invalid rather than simply
removing entries so that system statistics are not
disrupted.
So that attached resources can retrieve the list of
stored distinguished names we have also created an
LDAP database that can then be retrieved from via
the standard EDG MakeGridmap scripts as
distributed in VDT.
Figure 3; The interface to add a new user to the
3.3.3.1 Storing the Data VOM
The underlying functionality has been provided When a user is added into the VOM system this
using a relational database. This has allowed the gives them access to the main computational
alteration of tables as and when additional resources through insertion of their DN into LDAP
functionality has become necessary. The decision and relational databases. It is also necessary that
was made to use the PostgreSQL relational each user is also added into the Storage Resource
Database [12] due to its know performance and Broker [13] system for storage of their large output
zero cost option. It was also important at this stage datasets.
to build in the extra functionality to support more
than one virtual organisation. The design of the
database is shown in Appendix 1.
3.3.3.2 Add, Remove and List Functions
Administration of the VOM system is through a
secure web interface. This has each available
management function as a separate web page
which must be navigated to.
The underlying mechanisms for the addition,
removal and list functions are the basically the
same for both systems and users.
The information required for an individual user
are:
• Name: The real name of the user.
• Distinguished Name (DN): This can Figure 4; The interface for system registration into
either be a complete DN string as per a the VOM
standard x509 digital certificate or if the To register a system with the VOM the following
Oxford Kerberbos CA is used then just information is needed;
their Oxford username, the DN for this • Name: Fully Qualified Domain Name.
type of user is constructed automatically. • Type: Either a Cluster or Central system,
• Type: Within the VOM a user may either users will only be able to see Clusters.
be an administrator or user. This is used • Administrator e-Mail: For support
to define the level of control that he has quereies.
over the VOM system, i.e. can alter the • Installed Scheduler: Such as PBS, LSF,
contents. This way the addition of new SGE or Condor.
system administrators into the system is • Maximum number of submitted jobs, to
automated. give the resource broker the value for
‘CurMatches’.
592
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
• Installed Software; List of the installed There are three different methods of displaying the
licensed software to again pass onto the information that is stored about tasks run on
resource broker. OxGrid. The overall number of tasks per month is
• Names of allowed users and their local shown in Figure 5.
usernames.
An example of the interface is shown in Figure 4.
3.4 Resource Usage Service
Within a distributed system where different
components are owned by different organisations it
is becoming increasingly important that tasks run
on the system are accounted for properly.
3.4.1 Information Presented
As a base set of information the following should
be recorded:
• Start time
• End time
• Resource name job run on
• Local job ID Figure 5; Total number of run tasks per month on
• Grid user identification all connected resources owned by Oxford, i.e.
• Local user name NOT including remote NGS core nodes or
• Executable run partners.
• Arguments passed to the executable
• Wall time This can then be split into total numbers per
• CPU time individual user and per individual connected
• Memory used system as shown in Figure 6 and Figure 7.
As well as these basic variables an additional
attribute has been set to account for the differing
cost of a resource. This can be particularly used for
systems that have special software or hardware
installed. This can also be used to form the basis of
a charging model for the system as a whole.
• Local Resource cost
3.4.2 Recording usage
There are two parts to the system, a client that can
be installed onto each of the attached resources and
a server that will record the information for
presentation to system administrators and users. It
was decided that it would be best to present the
accounting information to the server on a task by
task basis. This would result in instantaneously
correct statistics should it be necessary to apply Figure 6; Total number of submitted tasks per user
limits and quotas. The easiest way to achieve this
is through the creation of a set of Perl library
functions that attaché to the standard job-managers
that are distributed within VDT. These collate all
of the information to be presented to the server and
then call a ‘cgi’ script through a web interface on
the server.
The server database is part of the VOM system
described in section 3.3.3.1. An extra table has
been added with each attribute corresponding to a
column and each recorded task is a new row.
593
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
It was decided that the best method to create an on several occasions to assist with cluster upgrades
interoperable data vault system would be to before these resources can be added into the
leverage work already undertaken within the UK e- campus grid. This has illustrated the significant
Science community and in particular by the NGS. problems with the Beowulf solution to clustering,
The requirement is for a location independent especially if very cheap hardware has been
virtual filesystem. This can make use of spare purchased. This has lead on several occasions to
diskspace that is inherent in modern systems. the need to spend a significant amount of time
Initially though as a carrot to attract users we have installing operating systems which is reality has
added a 1Tb RAID system onto which users data little to do with construction of a campus grid.
can be stored. The storage system uses the Storage
3.6.3 Sociological issues
Resource Broker (SRB) from SDSC. This has the
added advantage of not only fulfilling the location The biggest issue when approaching resource
independent requirement but can also add metadata owners is always the initial reluctance to allow
to annotate stored data for improved data mining anyone but themselves to use resources they own.
capability. This is a general problem with academics all over
the world and as such can only be rectified with
The SRB system is also able to interface not only
careful consideration of their concerns and specific
to plain text files that are stored within normal
answers to the questions they have. This has
attached filesystems but also relational databases.
resulted in a set of documentation that can be given
This will allow large data users within the
to owners of clusters and Condor systems so that
university to make use of the system as well as
they can make informed choices on whether they
install the interfaces necessary to attach their own
want to donate resource or not.
databases with minimal additional work.
The other issue that has had to be handled is the
3.6 Attached Resources
communication with staff responsible for
When constructing the OxGrid system the greatest departmental security. To counteract this we have
problem that was encountered was with the produced a standard set of requirements with
different resources that have been connected to the firewalls which are generally well received. By
system. These fell into separate classes as ensuring that communication from departmental
described. equipment is a single system for the submission of
3.6.1 Teaching Lab Condor Pools 3.7 User Tools
The largest single donation of systems into the In this section various tools are described that have
OxGrid has come from the Computing Services been implemented to make user interaction with
teaching lab systems. These are used during the the campus grid easier.
day as Windows systems and have a dual boot
3.7.1 oxgrid_certificate_import
installation setup on them with a minimal Linux
2.4.X installation on them. The systems are One of the key problems within the UK e-Science
rebooted into Linux at 2100 each night and then community has always been that users have
run as part of the pool until 0700 where they are complained about difficult interactions with their
restarted back into Windows systems. Problems digital certificates and certainly when dealing with
have been encountered with the system imaging the format required by the GSI infrastructure. It
and control software used by OUCS and its Linux was therefore decided to produce an automated
support. This has lead to significant reduction in script. The is used to translate the certificate as it is
available capacity in this system until the problem retrieved from the UK e-Science CA,
is rectified. Since this system also is configured automatically save it in the correct location for GSI
without a shared filesystem several changes have as well as set permissions and the passphrase used
been made to the standard Globus jobmanager to verify the creation of proxy identities. This has
scripts to ensure that Condor file transfer is used been found to reduce the number of support calls
within the pool. about certificates as these can cause problems
A second problem has been found recently with which the average new user would be unable to
the discovery of a large number of systems within diagnose. To ensure that the operation has
a department that use the Ubuntu Linux completed successfully it also checks the creation
distribution which is currently not supported by of a proxy and prints its contents.
Condor. This has resulted in having to distribute
by hand a set of base C libraries that solve issues 3.7.2 oxgrid_q and oxgrid_status
with running the Condor installation So that a user can check their individually
submitted tasks we developed a script that sits on
3.6.2 Clustered Systems
top of the standard condor commands for showing
Since there is significant experience within the the job queue and registered systems. Both of these
OeRC with clustered systems we have been asked commands though as default show only the jobs
594
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the user has submitted or the systems they are were left turned on for 24 hours. These systems
allowed to run jobs on, though both of these though do produce 1.25M CPU hours of
commands also have global arguments to allow all processing power so the produced value of the
jobs and registered systems to be viewed. resource is very high.
Since the introduction of the campus grid it is
3.7.3 oxgrid_cleanup
considered that the utilisation of the
When a user has submitted many jobs and supercomputing centre has increased for MPI tasks
discovered that they have made an error in by ~10%.
configuration etc. it is important for a large
distributed system such as this that a tool exists for 6. Conclusion
the easy removal of not just the underlying Since November 2005 the campus grid has
submitted tasks but also the wrapper submitter connected systems from four different departments
script as well. Therefore this tool has been created including Physics, Chemistry, Biochemistry and
to remove the mother process as well as all its University Computing Services. The resources
submitted children. located in the National Grid Service and OSC have
also been connected for registered users. User
4. Users and Statistics interaction with these physically dislocated
Currently there are 25 registered users of the resources is seamless when using the resource
system. They range from a set of students who are broker and for those under the control of the OeRC
using the submission capabilities to the National accounting information has been saved for each
Grid Service to those users that are only wanting to task run. This has showed that ~6300 jobs have
use the local Oxford resources. been run on the Oxford based components of the
They have collectively run ~6300 tasks over the system (i.e. not including the wider NGS core
last six months using all of the available systems nodes or partners).
within the university (including the OSC), not The added advantage is that significant numbers of
including though the rest of the NGS. serial users from the Oxford Supercomputing
The user base is currently determined by those that Centre have moved to the campus grid so that it
have been registered with the OSC or similar has also increased its performance.
organizations before. Through outreach efforts
though this is being moved into the more data 7. Acknowledgments
intensive Social Sciences as their e-Science I would like to thank Prof. Paul Jeffreys and Dr.
projects move along. Several whole projects have Anne Trefethen for their continued support in the
also registered their developers to make use of the startup and construction of the campus grid. I
large data vault capability including the Integrative would also like to thanks Dr. Jon Wakelin for his
Biology Virtual Research Environment. assistance in the implementation and design of
We have had several instances where we have some aspects of the Version 1 of the underlying
asked for user input and a sample of them are software when the author and he were both staff
presented here. members at Centre for e-Research Bristol.
“My work is the simulation of the quantum
dynamics of correlated electrons in a laser field. 8. References
OxGrid made serious computational power 1. National Grid Service, http://www.ngs.ac.uk
easily available and was crucial for making the
simulating algorithm work.” Dr Dmitrii 2. Oxford Supercomputer Centre,
Shalashilin (Theoretical Chemistry) http://www.osc.ox.ac.uk
“The work I have done on OxGrid is on
3. Kerberos CA and kx509,
molecular evolution of a large antigen gene
http://www.citi.umich.edu/projects/kerb_pki/
family in African trypanosomes. OeRC/OxGrid
has been key to my research and has allowed 4. Virtual Data Toolkit,
me to complete within a few weeks calculations http://www.cs.wisc.edu/vdt
which would have taken months to run on my
5. Globus Toolkit, The Globus Alliance,
desktop.” Dr Jay Taylor (Statistics)
http://www.globus.org.
5. Costs 6. European Data Grid, http://eu-
There are two ways that the campus grid can be datagrid.web.cern.ch/eu-datagrid/
valued. Either in comparison with a similar sized 7. Cza jkowski k., Fitzgerald s., Foster I., and
cluster or in the increased throughput of the Kesselman C. Grid Information Services for
supercomputer system. Distributed Resource Sharing Proceedings of
Considering the 250 systems of the teaching the Tenth IEEE International Symposium on
cluster can be the basis for costs. The electricity High-Performance Distributed Computing
costs for these systems would be £7000 if they (HPDC-10), IEEE Press, August 2001.
595
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
8. GLUE schema The Grid Laboratory Uniform 11. Job Submission Description Language (JSDL)
Environement (GLUE)”, Specification, Version 1.0, GGF,
http://www.hicb.org/glue/glue.htm. Computing https://forge.gridforum.org/pro jects/jsdl-
(HPDC-10), IEEE Press, August 2001. wg/document/draft-ggf-jsdl-spec/en/21
9. J. Frey, T. Tannenbaum, M. Livny, I. Foster, 12. PostgreSQL, http://www.postgresql.org
S. Tuecke, Condor-G: A Computation
Management Agent for Multi-Institutional 13. C. Baru, R. Moore, A. Rajasekar and M. Wan.
Grids, Cluster Computing, Volume 5, Issue 3, The SDSC storage resource broker. In
Jul 2002, Pages 237 – 246 Proceedings of the 1998 Conference of the
Centre For Advanced Studies on Collaborative
10. R. Raman, M. Livny, M. Solomon, Research (Toronto, Ontario, Canada,
"Matchmaking: Distributed Resource November 30 - December 03, 1998)
Management for High Throughput
Computing," hpdc, p. 140, Seventh IEEE 14. LHC Computing Grid Project
International Symposium on High http://lcg.web.cern.ch/LCG/
Performance Distributed Computing (HPDC-7
'98), 1998.
596
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Appendix 1
VO_USER VO_RESOURCE
VO_ID VO_ID
USER_ID (ALSO UNIQUE) RESOURCE_ID (ALSO UNIQUE)
REAL_USER_NAME TYPE
DN JOBMANAGER
USER_TYPE SUPPORT
VALID VALID
MAXJOBS
SOFTWARE
VO_USER_RESOURCE_STATUS VO_USER_RESOURCE_USAGE
VO_ID VO_ID
USER_ID JOB_NUMBER
RESOURCE_ID USER_ID
UNIX_USER_NAME JOB_ID
START_TIME
END_TIME
RESOURCE_ID
JOBMANAGER
EXECUTABLE
NUMNODES
CPU_TIME
WALL_TIME
MEMORY
VMEMORY
COST
597
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Olivier van der Aa1, Stephen Burke2, Jeremy Coles2, David Colling1, Yves Coppens3, Greig
Cowan4, Alessandra Forti5, Peter Gronbech6, Jens Jensen2, David Kant2, Gidon Moont1, Fraser
Speirs7, Graeme Stewart7, Philippa Strange2, Owen Synge2, Matt Thorpe2, Steve Traylen2
1
Imperial College London 2CCLRC 3University of Birmingham 4University of Edinburgh 5University of Manchester
6
University of Oxford 7University of Glasgow
Abstract
GridPP operates a Grid for the UK particle physics community, which is tightly integrated with the
Enabling Grids for E-sciencE (EGEE) and LHC Computing Grid (LCG) projects. These are now
reaching maturity: EGEE recently entered its second phase, and LCG is ramping up to the start of
full-scale production in 2007, so the Grid must now be operated as a full production service. GridPP
provides CPU and storage resources at 20 sites across the UK, runs the UK-Ireland Regional
Operations Centre for EGEE, provides Grid-wide configuration, monitoring and accounting
information via the Grid Operations Centre, and takes part in shift rotas for Grid operations and user
support. It also provides support directly for its own system managers and users. This paper
describes the structure and operations model for EGEE and LCG and the role of GridPP within that,
focusing on the problems encountered in operating a worldwide production Grid with 24/7
availability, and the solutions which are being developed to enable this to succeed.
598
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
currently come largely from GridPP and 2 sites at Universities around the country [9].
GridIreland [5], but we also anticipate some For administrative convenience these are
degree of convergence with the National Grid grouped into four “virtual Tier 2s”. From a
Service (NGS) [6], and we are currently seeing technical point of view each site operates
some interest from other e-science sites in the independently, but the Tier 2s each have an
UK. overall co-ordinator and are moving towards
GridPP also co-ordinates the Grid some sharing of operational cover.
Operations Centre (GOC) [7]. This maintains a
database of information about all the sites in the 1.3 Virtual Organisations
Grid and provides a variety of monitoring and Users are grouped into Virtual Organisations
accounting tools. (VOs), and sites can choose which VOs to
In EGEE-1 a subset of the ROCs were support. EGEE is developing formal policies
classed as Core Infrastructure Centres (CICs). under which VOs can be approved, but at
These were involved in running critical services present there are effectively several different
and taking on a rotating responsibility to categories of VO:
monitor the state of the Grid and take action to
solve problems. In EGEE-II the CIC • LCG-approved VOs support the four LHC
functionality is being absorbed into the ROCs as experiments, and also some associated
more regions gain the necessary expertise. activities.
GridPP has been involved with the CIC • EGEE-approved VOs have gone through a
functions from the start. formal induction process in the EGEE NA4
1.2 LCG structure (application support) Activity, and receive
support from EGEE to help them migrate to
LCG is dedicated to providing for the the Grid. However, this is a relatively
computing needs of the experiments at the LHC heavyweight process and so far rather few
particle accelerator. These are expected to VOs have come in via this route.
produce several Pb of primary data per year, • Global VOs have a worldwide scope, but
together with similar volumes of derived and have not been formally approved by EGEE
simulated data, so data movement and storage is and do not receive direct support from it. At
therefore a major consideration. present these are predominantly related to
LCG organises its sites into a hierarchical, particle physics experiments.
tiered structure. Tier 0 is at CERN, the • Regional VOs are sponsored, and largely
laboratory which hosts the LHC accelerator. supported, by one of the participating Grid
Files are shipped from there to a number of Tier projects. The UK-I ROC currently has a
1 sites, which have substantial CPU, storage and small number of VOs in this category, but
manpower resources, high-capacity network the number is expected to increase.
connections, can guarantee high availability, Regional VOs may also be supported at
and are able to make a long-term commitment sites outside the region if they have some
to supporting the experiments. Tier 1 sites will relationship with the VO.
also need to transfer data between themselves.
Each Tier 1 site supports a number of Tier 2 Although the status of the VOs varies, as far as
sites, which are allowed to have lower levels of possible they are treated equivalently from a
resources and availability. Files are expected to technical point of view in both the middleware
be transferred in both directions between Tier 2 and the infrastructure. Each VO has a standard
sites and their associated Tier 1, but generally set of configuration parameters, which makes it
not on a wider scale, although the detailed data relatively easy for a site to add support for a
movement patterns are still under discussion. new VO (unless there are special requirements).
LCG is putting in place Service Level As a particle physics project GridPP is
Agreements (SLAs) to define the expected mainly interested in supporting VOs related to
performance of the Tier centres, and is also the experiments in which the UK participates.
developing monitoring tools to measure actual However, it has a policy of providing resources
performance against the SLAs. A set of Service at a low priority for all EGEE-approved VOs.
Challenges are being run, principally to test the
routine transfer of large data volumes but also to
2. Middleware
test the full computing models of the LHC
experiments. The EGEE/LCG middleware is in a state of
In the UK, GridPP has a Tier 1 site at the constant evolution as the technology develops,
Rutherford Appleton Laboratory [8] and 19 Tier and managing this evolution while maintaining
599
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
a production service is a major component of client tools and libraries for Grid services
Grid deployment activity. The current which may be needed by running jobs, and
middleware is based on software from Globus often to VO-specific client software.
[10] (but still using the Globus toolkit 2 release • Storage Element (SE) – this provides an
dating from 2002), Condor [11] and other interface to bulk data storage. Historically
projects, as collected into the Virtual Data the so-called “classic SE” was essentially
Toolkit (VDT) [12]. A substantial amount of the just a Globus GridFTP server, but the
higher-level middleware came from the system is now moving to a standard
European DataGrid (EDG) project [13], the interface known as the Storage Resource
predecessor of EGEE. Many other tools have Manager (SRM) [16].
been developed within LCG. The middleware • VOBOX – there have recently been
Activity in EGEE (JRA1) has developed new requests to have a mechanism to deploy
software under the gLite brand name, and an persistent services on behalf of some VOs.
integrated software distribution has recently This is currently under discussion due to
been deployed on the production system which concerns about scalability and security, but
has adopted the gLite name [14]. some such services are currently deployed
The middleware can broadly be split into on an experimental basis.
three categories: site services which are
deployed at every site, core services which 2.2 Core services
provide more centralised functions, and Grid Some services are not needed at every site, but
operation services which relate to the operation provide some general Grid-wide functionality.
of the Grid as a whole. Ideally it should be possible to have multiple
2.1 Site services instances of these to avoid single points of
failure, although this has not yet always been
Every site needs to deploy services related to achieved in practice. The services are:
processing, data storage and information
transport. There has also recently been some • Resource Broker (RB) – this accepts job
discussion about providing facilities for users to submissions from users, matches them to
deploy their own site-level services. The suitable sites for execution, and manages
individual services are as follows: the jobs throughout their lifetime. A
separate Logging and Bookkeeping (LB)
• Berkeley Database Information Index service is usually deployed alongside the
(BDII) – this is an LDAP server which broker.
allows sites to publish information • MyProxy [17] – a repository for long term
according to the GLUE information schema user credentials, which can be contacted by
[15]. other Grid services to renew short-lived
• Relational Grid Monitoring Architecture proxies before they expire.
(R-GMA) – another information system • BDII – the information from the site-level
presenting information using a relational BDII servers is collected into Grid-wide
model. This is also used to publish the BDIIs which provide a view of the entire
GLUE schema information, together with a Grid. BDIIs are usually associated with
variety of monitoring and accounting RBs to provide information about the
information, and is also available for users available resources to which jobs can be
to publish their own data. submitted.
• Computing Element (CE) – this provides an • R-GMA Registry and Schema – these are
interface to computing resources, typically the central components of the R-GMA
via submission to a local batch system. This system, providing information about
has so far used the Globus gatekeeper, but producers and consumers of data and
the system is now in transition to a new CE defining the schema for the relational
interface based on Condor-C. A tool called tables. At present there is a single instance
APEL (Accounting Processor using Event for the whole Grid.
Logs) is also run on the CE to collect • File Catalogue – this allows the location of
accounting information, and transmit it via file replicas stored on SEs using a Logical
R-GMA to a repository at the GOC. FileName (LFN). The system is currently in
• Worker Nodes (WN) – these do not run any transition from the EDG-derived Replica
Grid services (beyond standard batch Location Service (RLS) to the LCG File
schedulers), but need to have access to the Catalogue (LFC) which has improved
600
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
601
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
operation over long periods. Unfortunately this exceptional and must be rectified for the
is effectively confronted only on the production software to function. In a large Grid such
system, as test systems are too small to detect “exceptions” are in fact the norm, and the
the problems. software needs to be able to deal with this. In
It is often very difficult to port middleware addition, error reporting is often regarded as
to new operating systems (or even new versions having a low priority, and consequently it can
of a supported operating system), compilers and be very hard to understand the reason for a
batch systems. EGEE has an objective of failure. Ideally:
supporting a wide range of platforms, but so far
this has not been achieved in practice, and • Middleware should be fault tolerant;
porting remains a painful exercise. wherever possible it should retry or
The middleware typically depends on a otherwise attempt a recovery if some
number of open-source tools and libraries, and operation fails.
the consequent package dependencies can be • Services should be redundant, with
difficult to manage. This is particularly true automatic failover.
when the dependency is on a specific version – • Logging and error messages should be
which can lead to situations where different sufficient to trace the cause of any failure.
middleware components depend on different • It should be possible to diagnose problems
tool versions. This is partly a result of poor remotely.
backward-compatibility in the underlying tools.
These problems can be solved, but absorb a In practice these conditions are usually not
significant amount of time and effort. fulfilled in the current generation of
In a large production Grid it is not practical middleware, and it is therefore necessary to
to expect all sites to upgrade rapidly to a new devise operational procedures to mitigate the
release, so the middleware needs to be effects of failures.
backward-compatible at least with the previous
release and ideally the one before that. Any new
3. Grid deployment
services need to be introduced in parallel with
the existing ones. This has generally been Deployment relates to the packaging,
achieved, but is sometimes compromised by installation and configuration of the Grid
relatively trivial non-compatible changes. middleware.
The most important issue is configuration.
Much of the middleware is highly flexible and 3.1 Middleware installation and
supports a large number of configuration configuration
options. However, for production use most of Various installation/configuration tools have
this flexibility needs to be hidden from the been developed in a number of Grid projects
system managers who install it, and Grid which allow much of the middleware flexibility
deployment therefore involves design decisions to be exposed in the configuration. However, as
about how the middleware needs to be mentioned above, in practice system managers
configured. There is also a scalability aspect, for prefer to have the vast majority of configuration
example services should be discovered flexibility frozen out. LCG has therefore
dynamically wherever possible rather than developed a tool called YAIM (Yet Another
relying on static configuration. The middleware Installation Method) which uses shell scripts
should also be robust in the sense that small driven from a simple configuration file with a
changes to the configuration should not prevent minimal number of parameters.
it working, which is not always the case. In
practice, finding a suitable configuration and 3.2 Middleware releases
producing a well-integrated release is a major
task, which can take several months. Release management is a difficult area in a
situation where the middleware continues to
2.5 Operational issues evolve rapidly. So far there have been major
production releases about three times a year,
A production Grid requires continuous, reliable with all the software being upgraded together.
operation on a 24/7 basis. In a large system (This excludes minor patches, which are
there are always faults somewhere, which may released on an ad-hoc basis.) However, this has
include hardware failures, overloaded machines, caused some difficulties. On one side it takes a
misconfigurations, network problems and host long time for new functionality, which may be
certificate expiry. Middleware is often written urgently needed by some users, to get into the
under the assumption that such errors are
602
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
system. On the other hand, this is still seen as the highly diverse community of people who
too rapid by many sites. Practical experience is might require training.
that it takes many weeks for most sites to One especially difficult area is the question
upgrade to a new release, with some sites taking of the induction of new VOs, to allow them to
several months. understand how to adapt their software and
Deployment is now moving towards methods to make best use of the Grid. EGEE is
incremental upgrades of individual services to intending to run periodic User Forum events at
try to alleviate this situation. However, this may which users can share their experiences to
bring problems of its own, as the backward- improve the transfer of such information, with
compatibility requirements may become much the first having been held in March this year
greater with a more complex mix of service [27]. In GridPP, so far most users have been
versions at different sites. The introduction of a from the large LCG VOs who have a fairly
Pre-Production System (PPS) to try out pre- well-developed engagement with the Grid, but
release software should however enable more we will need to develop induction and training
problems to be found before new software goes for internal VOs which may only have a small
into full production. number of members.
Support for release installation is provided
principally on a regional basis; GridPP has a 3.4 VO support at sites
deployment team which provides advice and EGEE is intended to support a large number of
support to sites in the UK. VOs, and at present there are getting on for 200
VOs enabled somewhere on the infrastructure.
This implies that it should be easy for a site to
3.3 Documentation and training
configure support for a new VO. However,
Documentation is needed for both system there are currently some problems with this:
managers and users. Unfortunately there is little
or no dedicated effort to produce it, and the • The key configuration information is not
rapid rate of change means that documentation always available. The operations portal has
which does exist is often out of date. templates to allow VOs to describe
Documentation is also located on a wide range themselves, but in most cases this is not yet
of web sites with little overall organisation. in use.
However, more effort is now being made to • Supporting a new VO implies changes in
improve this area. various parts of the configuration, so this
For system managers the central deployment needs to be done in a parameterised way to
team provide installation instructions and make it as simple as possible - in the ideal
release notes. There is also a fairly extensive case a site should simply have to add the
wiki [24] which allows system managers to VO name to a configuration file. YAIM
share their experiences and knowledge. now incorporates a web-based tool to
The biggest problem with user generate the required configuration for
documentation is a lack of co-ordination. The those sites (the majority) which use it,
main User Guide is well-maintained, but in which goes a long way towards this goal.
other areas it relies on the efforts of individual Unfortunately, at present this also suffers
middleware developers. Many user-oriented from the fact that the configuration
web sites exist, but again with little co- information for many VOs is not yet
ordination. GridPP has its own web site with a present.
fairly extensive user area [25], which attempts • Some VOs may require non-standard
to improve the situation rather than add to the configurations or extra services, and this
confusion by providing structured pointers to decreases the chances that they will be
documentation held elsewhere. EGEE has supported by sites which do not have a
recently set up a User Information Group (UIG) strong tie to the VO. This is partly a
with the aim of co-ordinating user question of user education; VOs need to
documentation, based around a set of specific realise that working within a Grid
use-cases. environment can accommodate less site-
Training within EGEE is provided by the specific customisation than when working
NA3 Activity, which is co-ordinated by NESC with a small number of sites. It may also
[26]. The training material is also available on imply the development of a standardised
the web. However, again there is a problem with way to deploy non-standard services, the
keeping up with the rate of change of the “VO box” concept mentioned above.
system, and also to some extent with reaching
603
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
• Sites, and the bodies which fund them, may As mentioned earlier, LCG uses resources
also need to adapt their attitude somewhat from other Grids, notably OSG in the US, and
to the Grid model. Sites are often funded, efforts are currently underway to integrate these
and see themselves, as providing services into the EGEE/LCG operations framework as
for a small number of specific VOs, rather far as possible.
than a general service to the Grid
community as a whole. 4.1 Site status
New sites are initially entered into the GOC DB
3.5 VO software installation
as uncertified. The ROC to which they belong is
VOs often need to pre-install custom client tools expected to perform basic tests before marking
and libraries which will be available to jobs the site as certified, at which point it becomes a
running on WNs. The current method, which candidate for regular job submission, and starts
has been found to work reasonably well, is to to be covered by the routine monitoring
provide an NFS-shared area identified by a described below. Sites can also be marked as
standard environment variable, which is being in scheduled downtime, which
writeable by a restricted class of VO members. temporarily removes them from the system.
These people can also use GridFTP to write
version tags to a special file on the CE, which 4.2 Regular monitoring
are read and published into the information An operations rota has been established to deal
system, allowing jobs to be targeted at sites with site-level problems, with ROCs (formerly
which advertise particular software versions. just CICs) being on duty for a week at a time.
This tagging method can also be used to While on duty the various monitoring tools are
validate any other special features of the scanned regularly (information is aggregated for
execution environment which are required by this purpose on the operations portal). If
the VO. problems are found tickets are opened against
the relevant sites, and followed up with a
4. Grid operations specified escalation procedure, which in
extreme cases can lead to sites being decertified.
As discussed above, failures at individual sites
At present, with around 200 sites there are about
are a fact of life in such a large system, and the
50 such tickets per week, which illustrates both
middleware currently does not deal with errors
that active monitoring is vital to maintaining the
very effectively in most cases. The “raw” failure
Grid in an operational state, and that there is
rate for jobs can therefore be rather high. In the
never a time when the system is free of faults.
early days it was also not unusual for sites to
Tickets are initially opened in the GGUS
appear as “black holes”: some failure modes can
system, but each ROC has its own internal
result in many jobs being sent to a site at which
system which it uses to track problems at the
they either fail immediately, or are queued for a
sites it serves, using a variety of technologies.
long period.
An XML-based ticket exchange protocol has
To get to a reasonable level of efficiency
therefore been developed to allow tickets to be
(generally considered to be a job failure rate
exchanged seamlessly between GGUS and the
below 5-10%) it has been necessary to develop
various regional systems.
operational procedures to mask the effects of
Summaries of the operations activity are
the underlying failures. The key tool for this is
kept on the CIC portal, and issues are discussed
the SFT system, which probes the critical
at a weekly operations meeting.
functionality at each site on a regular basis.
Failing sites can be removed from being 4.3 Accounting
matched for job submission, using criteria
defined separately for each VO, using the FCR Accounting information is collected by the
tool. In parallel with this an operations team APEL tool and published to a database held at
submits tickets to the sites to ensure that the GOC [28]. At present this only relates to the
problems are fixed as quickly as possible, as use of CPU time, but storage accounting is in
described below. All these systems and tools development. Accounting information is
continue to be developed as we gain more currently only available for entire VOs (see
experience. Performance is measured by a Figure 3) due to concerns about privacy, but
variety of tools, e.g. the RTM, which allow the user-level information is likely to be required in
current situation and trends over time to be the future once the legal issues are clarified.
monitored. Accounting information is also available per site
and with various degrees of aggregation above
604
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
that. So far accounting data is only used for any given time, and to manage software
informational purposes, but it is likely that it upgrades in a way which does not disrupt the
will be used in a more formal way in future, e.g. service. This is still being developed but is now
to monitor agreed levels of resource provision. working well, and usage of the Grid has been
rising steadily. Many challenges nevertheless
remain, especially as the LHC starts to take
data.
References
[1] http://www.eu-egee.org/
[2] http://lcg.web.cern.ch/LCG/
[3] http://www.opensciencegrid.org/
[4] http://www.gridpp.ac.uk/
[5] http://cagraidsvr06.cs.tcd.ie/index.html
Figure 3: Accounting - integrated CPU time [6] http://www.ngs.ac.uk/
used by the atlas VO [7] http://goc.grid-
support.ac.uk/gridsite/gocmain/
4.4 User support [8] http://www.gridpp.ac.uk/tier1a/
[9] http://www.gridpp.ac.uk/tier2/
The question of user support initially received [10] http://www.globus.org/
relatively little attention or manpower, but it has [11] http://www.cs.wisc.edu/condor/
become clear that this is a very important area, [12] http://vdt.cs.wisc.edu/
and in the last year or so a substantial amount of [13] http://www.edg.org/
effort has been put into developing a workable [14] http://www.glite.org/
system. This is again based on the GGUS portal, [15] http://infnforge.cnaf.infn.it/glueinfomodel/
where tickets can be entered directly or via [16] http://sdm.lbl.gov/srm-wg/
email. People known as Ticket Processing [17] http://grid.ncsa.uiuc.edu/myproxy/
Managers (TPMs) then classify the tickets, [18] http://cic.in2p3.fr/
attempt to solve them in simple cases, or else [19] http://www.ggus.org/
assign them to the appropriate expert unit, and [20] https://lcg-sft.cern.ch/sft/lastreport.cgi
also monitor them to ensure that they progress [21] http://goc.grid.sinica.edu.tw/gstat/
in a timely way. TPMs are provided on a shift [22] http://gridportal.hep.ph.ic.ac.uk/rtm/
rota by the ROCs, and GridPP contributes to [23]
this, as well as providing some of the expert http://gridice2.cnaf.infn.it:50080/gridice/site/sit
support units. e.php
At present the system is working reasonably [24]
well, with typically a few tens of tickets per http://goc.grid.sinica.edu.tw/gocwiki/FrontPage
week. However, as usage of the system grows [25] http://www.gridpp.ac.uk/deployment/users/
the number of tickets is also likely to grow, and [26] http://www.egee.nesc.ac.uk/
it is not clear that the existing system can scale [27]
to meet this. The biggest problem is not http://indico.cern.ch/conferenceDisplay.py?conf
technical, but the lack of manpower to provide Id=286
user support, especially given that supporters [28] http://goc.grid-
need to be experts who generally have many support.ac.uk/gridsite/accounting/
other tasks. In the longer term we will probably
need to employ dedicated support staff.
5. Summary
GridPP runs the UK component of the
worldwide EGEE/LCG Grid. This is a very
large system which is now expected to function
as a stable production service. However, the
underlying Grid middleware is still changing
rapidly, and is not yet sufficiently robust. It has
therefore been vital to develop Grid operations
procedures which allow users to see only those
parts of the system which are working well at
605
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Toby O. H. White1, Peter Murray-Rust2, Phil A. Couch3, Rik P. Tyer2 , Richard P. Bruin1,
Ilian T. Todorov3, Dan J. Wilson4 , Martin T. Dove1, Kat F. Austen1 , Steve C. Parker5
1Department of Earth Sciences, Downing Street, Cambridge. CB2 3EQ
2UniversityChemical Laboratory, Lensfield Road, Cambridge. CB2 1EW
3CCLRC Daresbury Laboratory, Daresbury, Warrington. WA4 4AD
4Institut für Mineralogie und Kristallographie. J. W. Goethe-Universität, Frankfurt.
5Department of Chemistry, University of Bath. BA2 7AY
Abstract
Within the eMinerals project we have been making increasing use of CML (the Chemical Markup
Language) for the representation of our scientific data. The original motivation was primarily to aid
in data interoperability between existing simulation codes, and successful results of CML-mediated
inter-code communication are shown. In addition, though, we have discovered several other areas
where XML technologies have been invaluable in developing an escientific virtual organization,
and benefiting collaboration. These areas are explored, and we show a number of tools which we
have constructed. In particular, we demonstrate 1) a general library, FoX for allowing Fortran
programs to interact with XML data, 2) a general CML viewing tool, ccViz, and 3) an XPath
abstraction layer, AgentX.
606
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
All of these codes accept and emit chemical, calculation, and feed it into a low accuracy code to
physical, and numerical data, but each uses its own get more results.
input and output formats. This presents a number of In either of these cases, our workload would be
challenges when scientists from different greatly reduced if we could simply pass output
backgrounds, familiar with different codes, have to directly from one code to another, without
collaborate: concerning ourselves with conversion of
• when trying to exchange data, format representations.
translation and data conversion steps are necessary However, the route to this lofty though apparently
before different codes can understand the data. simple goal of data interoperability is beset by a
• translation at the human level is necessary, multitude of complicating factors. We have made
since a scientist familiar with the look and feel of much progress towards it, but due to its
DL_POLY output may not understand SIESTA complications, we have described it last, in section 6.
output, nor know where to look for equivalent However, along the way we have discovered
data. several other areas where XML has aided us in our
Both of these problems can be addressed using role as an escience testbed project, and these we shall
XML technologies, and we expand upon this below. describe first, and expand in sections 4 and 5.
A problem, by no means unique to this project, is
that all of our scientific simulation codes are written 2.2 Data visualization and analysis
in Fortran, of varying ages and styles. There are a One of the major features of XML is the ease with
number of potential approaches to interfacing Fortran which it may be parsed. Writing an XML parser is by
and XML; the approach we have adopted is to write no means trivial, but for almost all languages and
an extensive library, in pure Fortran, which exposes toolkits in current use, the work has already been
XML interfaces in a Fortran idiom. Its design and done. Thus, to write a tool which takes input from an
implementation are briefly explained in section 3. XML file, a developer need only interact with an in-
Having succeeded in making our Fortran codes memory representation of the XML.
speak XML, we have found three areas in particular When a computational scientific simulation code
where XML output has been useful. These are briefly is executed, it will produce output in some format
explained below, and the tools and methods we have which has its origins in a more-or-less humanly
developed are explained in sections 4, 5, and 6. readable textual form. Often, though, accretion of
2.1 Data transfer between codes output will have rendered it less comprehensible -
and in any case a long simulation may well result in
When considering the rôle of XML within a project hundreds of megabytes of output, which can
involved in computational science, with multiple certainly not be easily browsed by eye.
simulation codes in use, the temptation is first to Output from calculations serves two purposes.
think about its use in terms of a common file format First, it enables the scientist to follow the
which would allow easier data interchange between calculation's progress as the job is running, and, once
codes; and indeed that is the perspective from which it has finished, to ensure that it has done so in the
the eMinerals project first approached XML. proper manner, and without errors.
The potential uses of the ability to easily share Secondly, it is rare that the scientist is only
data between simulation codes are manifold. For interested in the fact that the job has finished.
example, as mentioned previously, we have multiple Usually, they want to extract further, usually
codes available, and they are capable of doing numerical, data from the output, and explore trends
conceptually similar things, but using different and correlations.
techniques. We might wish to study the same system These two aims are rarely in accord and this
using both empirical potentials (with DL_POLY) and results in output formats having little sensible logical
quantum mechanical DFT (with SIESTA). This structure. Thus, for some purposes, the scientist will
would enable us to gain a better appreciation of the scan the output by eye, while for others, tools must
different approximations inherent in each method, be written to parse the format, and extract and
and better understand the system. perhaps graphically display relevant data.
This complementary use of two codes would be XML formats are optimized for ease of
made much easier if we could use identical input computational processing at the expense of human
files for both codes, rather than having to generate readability. Thus if the simulation output is in XML,
multiple representations of the same data. Without scanning by eye becomes infeasible, but processing
any such ability, extra work is required, and there is and visualization tools may be written with much
the potential for errors creeping in as we move greater ease. If the XML format is common these
between representations. tools may be of wide applicability. We detail the
Furthermore, we might wish to use the output of construction of such tools below, in section 4.
one code as the input to another. We might wish to
extract a small piece of the output of a low accuracy 2.3 Data management
simulation, and study it in much greater depth with In addition to efforts towards data interoperability,
our more precise code. Conversely, we might want to and data visualization, there is a third area where we
take the output of a highly accurate initial have found XML invaluable, that of data
management.
607
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The use of grid computing has brought about an across the 9 or 10 compilers, and 7 or 8 hardware
explosion in the volume of data generated by the platforms, which are commonly available. There is
individual researcher. eScience tools have enabled no XML-aware language which is as portable, and
the generation of multiple jobs to be run, allocation cross-language compilation over that number of
of jobs to available computational resources, potential systems is a painful process. So, although
recovery of job output, and efficient distributed writing the whole library from scratch in Fortran
storage of the resultant data. We have addressed all does involve more work than leveraging an existing
of these to some extent within our project[9] and we solution, it achieves maximum portability, and means
are by no means unique in this. we are not restricted at all upon where our XML-
However, given this large volume of data, aware codes may run.
categorization is extremely important to facilitate The library we have written is called FoX
retrieval in both the short term when knowledge on (Fortran XML)[10] and an earlier version (named
the purpose and context of the data is still to hand xmlf90) was described in [2]. In its current
and long term, when it may not be. Of equal incarnation, it consists of four modules, which may
importance is that when collaborating with be used together or independently. Two of them
geographically disparate colleagues, the data must be provide APIs for input APIs, and two for output.
comprehensible by both the original investigator who On the input side, the modules provide APIs for
will have some notion of the data’s context, and by event-driven processing, SAX 2[11], and for DOM
other investigators who may not. Core level 2[12]. On the output side, there is a
This is related of course to the perennial problem general-purpose XML writing library; built on top of
of metadata, to which there are many approaches. which is a highly abstracted CML output layer. It is
The solution we have come up with in the eMinerals written entirely in standard Fortran 95.
project depends heavily on the fact that our data is The SAX and DOM portions of the library
largely held as XML, which makes it much easier to contain all calls provided by those APIs, modified
automatically parse the data to retrieve useful data slightly for the peculiarities of Fortran. An overview
and metadata by which to index it. of their initial design and capabilities may be seen in
The tools and techniques used are explained [2] although the current version of FoX has
further in section 5. significantly advanced, not least in now having full
XML Namespaces[13] support. The two output
3. Fortran and XML modules, wxml and wcml are described here.
608
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
memory, but since one would only be able serialize initially, as the code continued to be developed by
once the run is over and all the data complete, if the XML-unaware developers, it would eventually break.
run were interrupted, all data would be lost. In Therefore, FoX provides an interface where these
addition, it is very important that one be able to keep details are hidden from view:
track of the simulation as it progresses by observing
its output, which requires progressive output, rather call cmlAddMolecule(xf=xmlFile,
than all-in-one serialization.
coords=array_of_coords,
Much of this could be avoided, of course, by
elems=array_of_elements)
occasional serialization of a changing DOM tree; or
even by temporary conversion of the large Fortran which outputs the following chunk of CML:
arrays to a DOM and then step-by-step output; or,
indeed, by the method we use, which is direct XML <molecule>
serialization of the Fortran data. <atomArray>
Thus, the FoX XML output layer (termed wxml <atom xyz3=”0.1 0.1 0.1”
here) is a collection of subroutines which generate
elementType=”H”/>
XML output - tags, attributes and text - from Fortran ....
data. The library is extensive, and allows the output </atomArray>
of all structures described by the XML 1.1 & XML </molecule>
Namespaces specifications[13,14], although for most
purposes only a few are necessary. XML output is Similar interfaces are provided for all portions of
obtained by successive calls to functions named CML commonly used by simulation codes.
xmlNewElement, xmlAddAttribute, and Such calls do not interfere with the flow of the
xmlEndElement. Furthermore, there are a series code, and it is immediately obvious what their
of additional function calls to directly translate native purpose is. Further, with such calls, a developer with
Fortran data structures to appropriate collections of only a rudimentary knowledge of XML/CML can
XML structures. State is kept by the library to ensure easily add CML output to their code.
well-formedness throughout the document. This wcml interface is now used in most of the
Thus wxml enables rapid and easy production of Fortran simulation codes within eMinerals, including
well-formed XML documents, providing the user has the current public releases of SIESTA 2.0 and
a good grasp of the XML they want to produce. DLPOLY 3, and in addition is in the version of
CASTEP[15] being developed by MaterialsGrid[16].
3.2 wcml FoX is freely available, and BSD licensed to allow
However, the impetus for the creation of this library its inclusion into any products - we encourage its use.
was the desire to graft XML output onto existing
codes; specifically CML output. wxml is insufficient 4. Data visualization
for this for a number of reasons:
• the developers of traditional Fortran simulation Although reformatting simulation output in CML
codes are, by and large, ignorant of XML, and allows for much richer handling of the data, it has the
entirely content to stay that way. aforementioned disadvantage that the raw output is
• when adapting a code to output XML, it is then much less readable to the human eye than raw
important that any changes made not obscure the textual output.
natural flow of the code. This led to the desire for an tool which would
• it is desirable to write specialized functions to translate the CML into something more
output CML elements, and common groupings comprehensible. At its most basic level, this could
thereof, directly (if for no other reason than to simply strip away much of the angle-bracketed-
avoid misspellings of element and attribute names verbiage associated with XML. However, since a
in source code.) translation needs to be done anyway, it is then not
• further, it is useful to maintain a “house style” much more trouble to ensure that the translated
for the CML. output is visually much richer.
Thus, for example, consider a code that wishes to Such tools have been built before on non-XML
output the coordinates of a molecular configuration. bases, with adapters for the output of different codes,
This is represented in CML by a 3-level hierarchy of but what we hoped to do here was, through the use of
various tags, with around 20 different optional XML, avoid the necessity for the viewing tool to be
attributes, and 4 different ways of specifying the adapted for new codes. In addition, it was strongly
coordinates themselves. The average simulation code desired that the viewing, as far as possible, require no
developer does not wish to concern themselves with extra software installation from the user, in order that
the minutiae of these details; they certainly do not data could be viewed by colleagues and collaborators
wish to clutter up the code with hundreds of XML as easily as possible.
output statements. Since CML is an XML language, and translations
In fact, were this output to be encoded in the between XML languages are easily done; and
source as a lengthy series of loops over wxml calls, it XHTML is an XML language; and furthermore, these
is almost certain that even if it were written correctly days the web-browser is an application platform in
its own right, which is present on everyone's desktop
609
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
particular further aspects of the transform deserve
attention.
610
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
automatically picked up and displayed in the
browser, and graphed if appropriate.
Furthermore, it is important to note that any other
code which produces CML output is viewable in the
same way. This is useful because one of the biggest
barriers to be overcome in sharing data between
scientists from different backgrounds is a lack of
familiarity with each other’s toolsets. However, with
ccViz, we can share data between DL_POLY and
SIESTA users and view them in exactly the same
way.
This transferability applies not only to this
visualization tool, but to any analysis tools built on
CML output - a tool which analyses molecular
configurations, or which performs statistical analyses
on timestep-averaged data, for example, need be
Figure 3: Interactive Jmol applet embedded in webpage. written only once, since the raw information it needs
may be extracted from the output file the same way
for every CML-enabled code. Previously, such a tool
implemented in browsers to make this doable in the
would rely on the output format of one particular
fashion we managed for 2D graphs.
simulation code.
Fortunately, the well-established tool, Jmol[17],
which knows how to read CML files, is primarily
designed to act as an applet embedded in web-pages. 5. Data management
We may therefore use our XSLT transform to embed As mentioned above, working within a grid-enabled
instances of Jmol into the viewable output. However, environment, it is easy to generate large quantities of
Jmol accepts input only as files or strings, so we data. The problem then arises of how to manage this
needed to find a way to pass individual different data. Within the eMinerals project we store our data
molecular configurations from our monolithic CML using the San Diego Supercomputing Centre’s SRB
document to different individual Jmol applets [18], though there are many other solutions to
embedded in the transformed webpage. We overcame achieve similar aims of distributed data storage.
this problem by producing multiple-namespace However, the major problem encountered is not
XML documents: the output contains not only where to store the data, but how to index and retrieve
XHTML and SVG, but also chunks of CML, it; for both the originator of the data and their
containing the different configurations of interest. collaborators. The eMinerals project faces a
Since the browser is unaware of CML, it makes no particular challenge in this regard, since we have a
attempt to render this data. We then took advantage remarkably wide variety of data being generated.
of Java/Javascript interaction to enable its use. XML helps in this task in two ways. Firstly, having a
There is an API for controlling the browser's JS common approach to output formats gives the
DOM from a Java applet, called LiveConnect. We advantages explained in the previous section. The
adapted the Jmol source code so that when passed a second, larger, issue is the general problem of
string which is the id of a CML tag in the embedding metadata handling, and approaches have been
document, Jmol will traverse the in-memory browser attempted with varying degrees of success by many
DOM to find the relevant node, and extract data from people. The eMinerals approach is explained in great
the CML contained therein. This enables us to have a detail in [19]. However, we shall discuss it briefly
single XHTML document directly containing all the here, with particular reference to the ways in which
CML data, which knows how to interpret itself. This XML has helped us solve the problem.
addition to Jmol has been included in the main There are three sorts of metadata which we have
release, and is available from version 10.2. An found it useful to associate with each simulation run:
example is illustrated in figure 3.
• metadata associated with the compiled
4.3 Final document simulation code - its name, its version, any
compiled-in options, etc. Typically there will be 10
Thus, we constructed an XSLT transform which, or so such items.
when given a CML document, will output an • metadata associated with the set up of the
XHTML/SVG/CML document which is viewable in simulation; input parameters, in other words.
a web-browser; in which all relevant input & output These will vary according to the code - some codes
data is marked-up and displayed; all time-varying may use the temperature and pressure of the model
data is graphed, and all molecular configurations can system, some may record what methods were used
be viewed in three dimensions and manipulated by to run the model, etc. Typically there will be 50 or
the user. so such parameters
Very conveniently, the XSLT transform knows • Job-specific metadata. Since the purpose of
nothing about what these quantities are, nor from metadata is in order to index the job for later, easy,
what code they originate. So, we may add new retrieval, sometimes it is appropriate to attach
quantities to the simulation output, and they are extracted output as metadata. For example, if a
611
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
number of jobs are being run, with the final model can access that data in terms of the data represented
system energy being of particular interest, it is rather than the details of that representation, as
useful to attach the value of this quantity as illustrated by the following, simplified, series of
metadata to the stored files in order that the jobs pseudocode API calls.
may be later examined, indexed by this value.
In each of these cases, we find our life made axSelect(”Molecule”)
much easier due to the fact that our files are in an numAtoms = axSelect(”Atom”)
XML format. for i in range(numAtoms):
For the first two types of metadata, we use the axSelect(’xCoordinate’)
fact that CML offers <metadata> and x = axValue()
<parameter> elements, which have a (simplified) axSelect(’yCoordinate’)
structure like the following: y = axValue()
<metadataList> axSelect(’zCoordinate’)
<metadata name=”Program” z = axValue()
content=”DL_POLY”>
... Internally, this is implemented by AgentX having
</metadataList> access to three documents - the CML source, an
We may therefore extract all such elements and OWL[21] ontology, and an RDF[22] mapping files.
store them as metadata values in our metadata “Molecule” is a concept defined in the OWL
database. ontology; and the RDF provides a mapping between
The third type of metadata obviously requires that concept and an XPath/Xpointer expression,
specific instructions for the job at hand. However, which evaluates to the location of a <molecule>
because the output files are in XML format, we may tag. Furthermore, the ontology indicates that
easily point to locations within the file such that tools “Molecule”s may contain “Atom”s, which may have
can automatically extract relevant data. This can be properties “xCoordinate”, “yCoordinate”, and
done easily using an XPath expression to point at the “zCoordinate”. The RDF provides mappings between
final energy, for example (although in fact we do not each of these concepts, and XPath expressions which
use XPath directly - we use AgentX[20], as detailed point to locations in the CML.
in the next section.) Thus, presented with a CML file containing a
<molecule>, it may be queried with AgentX to
6. Data transfer retrieve the atomic coordinates by a simple series of
API calls, without the need to understand the syntax
Finally, we have also succeeded in using XML to of the CML file.
work towards code interoperability in the fashion More powerfully, though, it is possible to specify
originally foreseen. multiple mappings for one concept. That is, we may
The idea of a common data format is attractive, say that a “Molecule” may be found in multiple
and as explained in section 2, would be of enormous potential locations.
value. It has, however, proved elusive so far, for a The normal CML format for specifying a
number of reasons, the discussion of which is beyond molecule is quite verbose, albeit clear. For
the scope of this paper. Nevertheless, by passing DL_POLY we needed to represent tens of thousands
around small chunks of well-formatted, well- of atoms efficiently, so we used CML’s <matrix>
specified data, and agreeing on a common dialect element to provide a compressed format, at the cost
with a very small vocabulary, we are able to gain of losing the contextual information provided by the
much in interoperability. CML itself.
Much of the progress we have made in this area is However, by simply providing a new set of
due to AgentX, an XML abstraction tool we have mappings to AgentX, we could inform it that
built to help us in this task. “Molecule”s could be found in this new location in a
CML file, so all AgentX-enabled tools could
6.1 AgentX immediately retrieve molecular and atomic data
AgentX is a tool originating primarily from CCLRC without any need for knowledge of the underlying
Daresbury, but in the development of which format change.
eMinerals has played an important part. A previous AgentX is implemented as an application on top
version was described in [20]. of libxml2. It is implemented primarily in C, with
At its most basic, it may be understood as, firstly, wrappers to provide APIs for Perl, Python and
a method of abstracting complicated XPath Fortran.
expressions, and secondly as a method of abstracting Concepts and mappings are provided for most of
access to common data which may expressed the data that are in common use throughout the
differing representations by various XML formats. project, but it is easy to add private concepts or
For example, the three-dimensional location of an further mappings where the existing ones are
atom within a molecule is expressed in CML, as insufficient.
shown in section 3.2 above, with nested
<molecule>, <atomArray>, and <atom>
elements. AgentX provides an interface by which one
612
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
6.2 Data interchange between codes Physics Programs”, All Hands Meeting,
Nottingham, 1111. (2004)
We have incorporated AgentX into the CML-aware [3] White, T.O.H. et al. “eScience methods for the
version of DL_POLY. This has enabled us to use combinatorial chemistry problem of
CML-encoded atomic configurations as a primary adsorption of pollutant organic molecules on
storage format for grid-enabled studies. Details of mineral surfaces”, All Hands Meeting,
studies performed with this CML-aware DL_POLY Nottingham, 773 (2005)
are in [23]. [4] Soler, J. M. et al, “ The Siesta method for ab
Furthermore, the output of any of our CML- initio order-N materials simulation”, J. Phys.:
emitting codes may now be directly used as input to
Condens. Matter, 14, 2745 (2002).
DL_POLY, so we can perform direct comparisons of
SIESTA and DL_POLY results. [5] Todorov, I., and Smith, W., Phil. Trans. R. Soc.
AgentX has also been linked into a number of Lond. A, 362, 1835. (2004)
other codes., including AtomEye[24]; and the CCP1 [6] Warren. M.C. et al., “Monte Carlo methods for
GUI[25]. It is also used as the metadata extraction the study of cation ordering in minerals.” ,
tool in the scheme described in section 5. Mineralogical Magazine 65, (2001); also
Thus we are now successfully using CML as a http://www.esc.cam.ac.uk/ossia/
real data interchange format between existing codes [7] M. G. Tucker, M. T. Dove, and D. A. Keen, J.
that have been adapted to work within an XML Appl. Crystallogr. 34, 630 (2001)
environment. [8] Watson, G.W. et al., “Atomistic simulation of
dislocations, surfaces and interfaces in MgO”
7. Summary J. Chem. Soc. Faraday Trans., 92(3), 433
(1996); also http://www.bath.ac.uk/~chsscp/
Within the eMinerals project, we have made wide, group/programs/programs.html
and increasing, use of XML technologies, especially [9] Bruin, R.P., et al., “Job submission to grid
with CML. computing environments”, All Hands
While working towards the goal of data Meeting, Nottingham (2006), and
interoperability between environmentally-relevant references therein.
simulation codes, we have found several additional [10] This is not the only Fortran XML library in
areas where XML has been of particular use, and existence, - see also http://nn-online.org/code/
have developed a number of tools which leverage the xml/, http://sourceforge.net/projects/xml-
power of XML technologies to enable better fortran/, http://sourceforge.net/projects/
collaborative research and eScience. These include: libxml2f90 - but it is the most fully featured.
• FoX, a pure Fortran library for general XML, Its output support surpasses others and it is
and particular CML, handling. certainly the only one with CML support.
• Pelote, an XSLT library for generating SVG [11] http://www.saxproject.org
graphs from [12] http://www.w3.org/DOM/
• ccViz, an XHTML-based CML viewing tool. [13] Bray, T. et al., “Namespaces in XML 1.1”, W3C
• AgentX, an XML data abstraction layer. Recommendation, 4 February 2004
All of our tools are liberally licensed, and are [14] Bray, T. et al., “Extensible Markup Language
freely available from http://www.eminerals.org/tools (XML) 1.1” ,W3C Recommendation, 4
Finally, we have successfully developed methods February 2004
for true interchange of XML data between simulation [15] Segall, M.D. et al., J. Phys.: Cond. Matt. 14(11)
codes. pp.2717-2743 (2002)
[16] http://www.materialsgrid.org
Acknowledgements [17] http://jmol.sourceforge.net
[18] http://www.sdsc.edu/srb
We are grateful for funding from NERC (grant [19] Tyer, R.P. et al., “Automatic metadata capture
reference numbers NER/T/S/2001/00855, NE/ and grid computing”, All Hands Meeting,
C515698/1 and NE/C515704/1). Nottingham (2006) - in press.
[20] Couch, P.A. et al.,”Towards Data Integration
References for Computational Chemistry” All Hands
Meeting, Nottingham 426 (2005)
[1] Murray-Rust, P and Rzepa, H. S., “Chemical [21] http://www.w3.org/TR/owl-features/
Markup Language and XML Part I. Basic [22] http://www.w3.org/RDF/
principles”, J. Chem. Inf. Comp. Sci., 39, 928 [23]Dove, M.T. et al., “Anatomy of a grid-enabled
(1999); molecular simulation study: the
Murray-Rust, P. and Rzepa, H.S., “Chemical compressibility of amorphous silica”, All
Markup, XML and the World Wide Web. Part Hands Meeting, Nottingham (2006)
II: Information Objects and the CMLDOM”, [24] Li, J., “Atomeye: an efficient atomistic
J. Chem. Inf. Comp. Sci., 41, 1113 (2001). configuration viewer”, Modelling Simul.
[2] Garcia, A., Murray-Rust, P. and Wakelin, J. “The Mater. Sci. Eng., 173 (2003)
use of CML in Computational Chemistry and [25] http://www.cse.clrc.ac.uk/qcg/ccp1gui/
613
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
We describe the architecture for language processing adopted on the eScience project
‘Extracting the Science from Scientific Publications’ (nicknamed SciBorg). In this ap-
proach, papers from different sources are first processed to give a common XML format
(SciXML). Language processing modules operate on the SciXML in an architecture that
allows for (partially) parallel deep and shallow processing and for a flexible combination
of domain-independent and domain-dependent techniques. Robust Minimal Recursion Se-
mantics (RMRS) acts both as a language for representing the output of processing and as
an integration language for combining different modules. Language processing produces
RMRS markup represented as standoff annotation on the original SciXML. Information ex-
traction (IE) of various types is defined as operating on RMRSs. Rhetorical analysis of the
texts also partially depends on IE-like patterns and supports novel methods of information
access.
614
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
we mean systems which use very precise and de- sive manually-created training corpus is required.
tailed grammars of natural languages to analyse and In contrast, we propose a layered architecture us-
generate, especially approaches based on current ing an approach to IE that takes the RMRS markup,
linguistic theory (see §5). rather than text, as a starting point. Again, this fol-
3. We are developing a semantically-based repre- lows Deep Thought, but only a limited investigation
sentation language Robust Minimal Recursion Se- of the approach was attempted there.
mantics (RMRS:Copestake (2003)) for integration Chemistry IE We are interested in extraction of
of all levels of language processing and to pro- specific types of chemistry knowledge from texts
vide a standardised output representation. RMRS in order to build a database of key concepts fully
is an application-independent representation which automatically. For example, the following two sen-
captures the information that comes from the syn- tences are typical of the part of an organic synthesis
tax and morphology of natural language while hav- paper that describes experimental methods:
ing a sound logical basis compatible with Seman-
tic Web standards. Unlike previous approaches to The reaction mixture was warmed to rt,
semantic representation, underspecified RMRSs can whereat it was stirred overnight. The re-
be built by shallow language processing, even part- sultant mixture was kept at 0C for 0.5 h
of-speech (POS) taggers. See §4. and then allowed to warm to rt over 1 h.
4. Integration with XML is built in to the architec-
ture, as discussed in §3. We view language process- Organic synthesis is a sufficiently constrained do-
ing as providing standoff annotation expressed in main that we expect to be able to develop a formal
RMRS -XML, with deeper language processing pro- language which can capture the steps in procedures.
ducing more detailed annotation. We will then extract ‘recipes’ from papers and rep-
5. As far as possible, all technology we develop is resent the content in this language. A particular
Open Source. Much of it will be developed in close challenge is extraction of temporal information to
collaboration with other groups. support the representation of the synthesis steps.
Our overall aim is to produce an architecture that Ontology construction The second task involves
allows robust language processing even though it semi-automatically extending ontologies of chem-
incorporates relatively non-robust methods, includ- ical concepts, expressed in OWL. Although some
ing deep parsing using hand-written grammars and ontologies exist for chemistry and systematic
lexicons. The architecture we have developed is chemical names also give some types of ontological
not pipelined, since shallow processing can oper- information (see §6.2), resources are still relatively
ate in parallel to deeper processing modules as well limited, so automatic extraction of ontological in-
as providing input to them. We do not have space formation is important. Our proposed approach to
in this paper for detailed comparison with previous this is along the lines pioneered by Hearst (1992)
approaches, but note that this work draws on results and refined by other authors. For instance, from:
from the Deep Thought project: Uszkoreit (2002)
discusses the role of deep processing in detail. . . . the concise synthesis of naturally
In the next section, we outline the application tasks occurring alkaloids and other complex
we intend to address in SciBorg. This is followed polycyclic azacycles.
by a description of the overall architecture and the
we could derive an expression that conveyed the in-
RMRS language. §5 and §6 provide a little more de-
formation an ‘alkaloid IS-A azacycle’.
tail about the modules in the architecture, concen-
trating on architecture and integration details. §7 <owl:Class rdf:ID="Alkaloid">
describes our approach to discourse analysis. <rdfs:subClassOf
rdf:resource="#Azacycle"\>
2 Application tasks
In this section, we outline three tasks that we intend Ontologies considerably increase the flexibility of
to address in the project. As we will explain below, the chemistry IE, for instance by allowing for
these tasks all depend on matching patterns speci- matches on generic terms for groups of chemicals
fied in terms of RMRS. They can all be considered (cf., the GATE project (gate.ac.uk) ‘Ontology
as forms of information extraction (IE), although Based Information Extraction/OBIE’).
we use that term very broadly. Most existing IE Research markup The third application is
technology is based on relatively shallow process- geared towards humans browsing papers: the aim is
ing of texts to directly instantiate domain-specific to help them quickly see the most salient points and
templates or databases. However, for each new type the interconnections between papers. The theoreti-
of information, a hand-crafted system or an exten- cal background to this is discussed in more detail in
615
" ©
Proceedings of the UK e-Science All Hands Meeting 2006,
3NeSC
!!( 2006, ISBN 0-9553988-0-0
.
/ 0
1
2
$4 56
7
"985:
;!!3
$%
&
!'(
"
#
!!!
!!
>2?6@ A@ B@ CDFE >HGHIJA?,KCA LHM N N GH? ,A EPO K&C@ C
§7, but here we just illustrate its application to cita- dron and Copestake, 2006; Waldron et al., 2006).
tion. Our aim is to provide a tool which will enable SAF allows ambiguity to be represented using a
the reader to interpret the rhetorical status of a cita- lattice, thus allowing the efficient representation of
tion at a glance, as illustrated in Fig. 1. The exam- multiple results from modules.
ple paper (Abonia et al, 2002) cites the other papers Language processing depends on the domain-
shown, but in different ways: Cowart et al (1998) is dependent OSCAR-3 module (see §6) to recognise
a paper with which it contrasts, while Elguero et compound names and to markup data regions in
al (2001) is cited in support of an observation. We the texts. The three sentence level parsing modules
also aim to extract and display the most important shown here (described in more detail in §5) are the
textual sentence about each citation, as illustrated RASP POS tagger, the RASP parser and the ERG / PET
in the figure. deep parser. The language processing modules are
The Chemistry Researchers’ amanuensis By not simply pipelined, since we rely on parallel deep
the end of the project, we will combine these three and shallow processing for robustness. Apart from
types of IE task in a research prototype of a ‘chem- the sentence splitter and tokenisers, all modules
istry researchers’ amanuensis’ which will allow us shown output RMRS. Output from shallower pro-
to investigate the utility of the technology in im- cessing can be used by deeper processors, but might
proving the way that chemists extract information also be treated as contributing to the final output
RMRS , with RMRS s for particular stretches of text
from the literature. Our publisher partners are sup-
plying us with large numbers of papers which will being selected/combined by the RMRS merge mod-
form a database with which to experiment with ule. The arrows in the figure indicate the data flows
searches. The searches could combine the types of which we are currently using but others will be
IE discussed above: for instance, a search might be added: in later versions of the system the ERG/PET
specified so that the search term had to match the processor will be invoked specifically on subparts
goal of a paper. of the text identified by the shallower processing.
The deep parser will also be able to use shallow
processing results for phrases within selected areas
3 Overall architecture where there are too many out-of-vocabulary items
Figure 2 is intended give an idea of the overall ar- for deep parsing to give good results.
chitecture we are adopting for language processing. Processing which applies ‘above’ the sentence
The approach depends on the initial conversion of level, such as anaphora resolution and (non-
papers to a standard version of XML, SciXML, syntactically marked) word sense disambiguation
which is in use not just on SciBorg but on the Fly- (WSD), will operate on the merged RMRS, enrich-
Slip and CITRAZ projects as well (Rupp et al., ing it further. However, note that the SciXML
2006). For SciBorg, we have the XML source of markup is also accessible to these modules, allow-
papers, while for FlySlip and CitRAZ, the source ing for sensitivity to paragraph breaks and section
is pdf. Following this conversion, we invoke the boundaries, for instance. The application tasks are
domain-specific and domain-independent language all defined to operate on RMRSs.
processing modules on the text. In all cases, the This architecture is intended to allow the benefits of
language processing adds standoff annotation to the deep processing to be realised where possible but
SciXML base. Standoff annotation uses SAF (Wal- with shallow processing outputs being used as nec-
616
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
essary for robustness. The aim is to always have also be used with shallow processing techniques,
an RMRS for a processed piece of text. The intent such as part-of-speech tagging, noun phrase chunk-
is that application tasks be defined to operate on ing and stochastic parsers which operate without
RMRS s in a way that is independent of the compo- detailed lexicons. Shallow processing has the ad-
nent that produced the RMRS. Other modules could vantage of being more robust and faster, but is less
be added as long as they can be modified to produce precise: RMRS output from the shallower systems is
RMRS s: we may, for instance, investigate the use less fully specified than the output from the deeper
of a chunker. The architecture allows the incorpo- systems, but in principle fully compatible. In cir-
ration of modules which have been independently cumstances where deep parsing can be success-
developed without having to adjust the interfaces fully applied, a detailed RMRS can be produced,
between individual pairs of modules. but when resource limitations (in processing capa-
The parallel architecture is a development of ap- bility or lexicon availability, for instance) preclude
proaches investigated in VerbMobil (Ruland et al., this, the system can back-off to RMRSs produced
1998; Rupp et al., 2000) and Deep Thought by shallower analysers. Different analysers can be
(Callmeier et al., 2004). It is intended to be more flexibly combined: for instance shallow process-
flexible than either of these approaches, although ing can be used as a preprocessor for deep analysis
it has resemblances to the DeepThought Heart of to provide structures for unknown words, to limit
Gold machinery and we are using some of the soft- the search space or to identify regions of the text
ware developed for that. The QUETAL project which are of particular interest. Conversely, RMRS
(Frank et al., 2006) is also using RMRS. We have structures from deep analysis can be further instan-
discovered that a strictly pipelined architecture is tiated by anaphora resolution, word sense disam-
unworkable when attempting to combine modules biguation and other techniques. Thus RMRS is used
that have been developed independently. For in- as the common integration language to enable flex-
stance, different parsing technology makes very ible combinations of resources.
different assumptions about tokenisation, depend- Example RMRS output for a POS tagger and a deep
ing in particular on the treatment of punctutation. parser for the sentence the mixture was allowed to
Furthermore, developers of resources change their warm is shown below:1
assumptions over time, so attempting to make to- Deep processing POS tagger
kenisation consistent leads to a brittle system. Our h6:_the_q(x3) h1:_the_q(x2)
non-pipelined, standoff annotation approach sup- RSTR(h6,h8)
ports differences in tokenisation. It is generic BODY(h6,h7)
enough to allow us to investigate a range of differ- h9:_mixture_n(x3) h3:_mixture_n(x4)
ent strategies for combining deep and shallow pro- ARG1(h9,u10)
cessing. h11:_allow_v_1(e2) h5:_allow_v (e6)
ARG1(h11,u12)
4 RMRS ARG2(h11,x3)
ARG3(h11,h13)
RMRS is an extension of the Minimal Recursion Se-
mantics (MRS: Copestake et al. (2005)) approach 1
For space reasons, we have shown a rendered format,
which is well established in deep processing in rather than RMRS-XML, and have omitted much informa-
NLP. MRS is compatible with RMRS but RMRS can tion including tense and number.
617
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
618
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
properties of a sentence are insensitive to chemical The application and biological role ontologies are
composition. Schematically, the section should ap- currently relatively underdeveloped. However the
pear to the domain-independent parsers as: ChEBI ontology has the potential to interact with
We have recently communicated that other ontologies in the GO family.
the condensation of [compound-1] The most useful other resource which is generally
with [compound-2] affords [compound-3] available is PubChem (pubchem.ncbi.nlm.
nih.gov), which is designed to provide informa-
However the output of the language processing
tion on the biological activities of small molecules,
component must be relatable to the specific iden-
but also provides information on their structure and
tification of the compound for searches.
naming. There is also the IUPAC Gold Book which
It is also important that the modules know about
is a compendium of chemical terminology with hy-
text regions which should not be subject to gen-
perlinks between entries. While not capable of di-
eral morphological processing, lexical look-up etc.
rectly supporting the IE needs mentioned in §2,
In particular, in data sections of papers, standard
these resources are potentially useful for lexicons
techniques for sentence splitting result in very large
and to support creation of ontology links.
chunks of text being treated as a single sentence.
Given that language processing has to be bounded 7 Research markup and citation
by resource consumption, the expense of attempt-
ing to parse such regions could prevent analysis of
analysis
‘normal’ text. In our current project, the domain- Searching in unfamiliar scientific literature is hard,
specific processing is handled by OSCAR-3 (Cor- even when a relevant paper has been identified as
bett and Murray-Rust, 2006). a starting point. One reason is that the status of
a given paper with respect to similar papers is of-
6.2 Ontologies and other domain-specific ten not apparent from its abstract or the keywords it
resources contains. For instance, a chemist might be more
In chemistry, the need for explict ontologies is interested in papers containing direct experimen-
reduced by the concept of molecular structure: tal evidence rather than evidence by simulation, or
structures and systematic names are (at least in might look for papers where some result is contra-
principle) intercovertible, many classes of com- dicted. Such subtle relationships between the core
pounds (such as ketones) are defined by struc- claims and evidence status of papers are currently
tural features, and structures may be related to not supported by search engines such as Google
each other using concepts such as isomerism and Scholar; if we were able to model them, this would
substructure-superstructure relationships that may add considerable value.
be determined algorithmically. However, there The best sources for this information are the papers
are still many cases where explict ontology is re- themselves. Discourse analysis can help, via an
quired, such as in the mapping of trivial names analysis of the argumentation structure of the paper.
to structures, and in the assignment of com- For instance, authors typically follow the strategy
pounds to such categories as pesticides and natu- of first pointing to gaps in the literature before de-
ral products. The three ontologies for Chemistry scribing the specific research goal – thereby adding
that we are aware of are: Chemical Entities of important contrastive information in addition to the
Biological Interest (ChEBI: www.ebi.ac.uk/ description of the research goal itself. An essen-
chebi/index.jsp); FIX (methods and prop- tial observation in this context is that conventional
erties: obo.sourceforge.net/cgi-bin/ phrases are used to indicate the rhetorical status of
detail.cgi?fix) and REX (processes e.g., different parts of a text. For instance, in Fig. 3 sim-
chemical reactions obo.sourceforge.net/ ilar phrases are used to indicate the introduction of
cgi-bin/detail.cgi?rex). FIX and REX a goal, despite the fact that the papers come from
are currently development versions, while ChEBI different scientific domains.
has been fully released. ChEBI comes in two Argumentative Zoning (AZ), a method introduced
parts: a GO/OBO DAG-based ontology (‘ChEBI by Teufel (Teufel et al., 1999; Teufel and Moens,
ontology’), and a conventional structural database. 2000), uses cues and other superficial markers to
This second half is used in OSCAR-3, providing a pick out important parts of scientific papers and su-
source of chemical names and structures. pervised machine learning to find zones of differ-
The ChEBI ontology includes chemicals, classes ent argumentative status in the paper. The zones
of chemicals and parts of chemicals: these are or- are: A IM (the specific research goal of the cur-
ganised according to structure, biological role and rent paper); T EXTUAL (statements about section
application. Unfortunately ChEBI does not ex- structure); OWN (neutral descriptions of own work
plicitly distinguish between these types of entity. presented in current paper); BACKGROUND (gen-
619
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Cardiology: The goal of this study was to elucidate the effect of LV hypertrophy and LV geometry on the presence
of thallium perfusion defects.
Heupler et al. (1997): ‘Increased Left Ventricular Cavity Size, Not Wall Thickness, Potentiates Myocardial
Ischemia’, Am Heart J, 133(6)
Crop agriculture: The aim of this study was to investigate possible relationships between type and extent of quality
losses in wheat with the infestation level of S. mosellana.
Helenius et al. (1989): ‘Quality losses in wheat caused by the orange wheat blossom midge Sitodiplosis
mosellana’, Annals of Applied Biology, 114: 409-417
Chemistry: The primary aims of the present study are (i) the synthesis of an amino acid derivative that can be
incorporated into proteins /via/ standard solid-phase synthesis methods, and (ii) a test of the ability of the derivative
to function as a photoswitch in a biological environment.
Lougheed et al. (2004): ’Photomodulation of ionic current through hemithioindigo-modified gramicidin chan-
nels’, Org. Biomol. Chem, Vol. 2, No. 19, 2798-2801
Natural language processing: In contrast, TextTiling has the goal of identifying major subtopic boundaries, at-
tempting only a linear segmentation.
Hearst (1997): ‘TextTiling: Segmenting Text into Multi-paragraph Subtopic passages’, Computational Linguis-
tics, 23(1)
Natural language processing: The goal of the work reported here is to develop a method that can automatically
refine the Hidden Markov Models to produce a more accurate language model.
Kim et al. (1999): HMM Specialization with Selective Lexicalization, EMNLP-99
Figure 3: Similar phrases across the domains of chemistry and computational linguistics
erally accepted scientific background); C ONTRAST encoding is advantageous because it allows more
(comparison with or contrast to other work); BA - concise and flexible specification of cues than do
SIS (statements of agreement with other work or string-based patterns and because it allows identifi-
continuation of other work); and OTHER (neutral cation of more complex cues. For instance, papers
descriptions of other researchers’ work). AZ was quite frequently explain a goal via a contrast using
originally developed for computational linguistics a phrase such as: our goal is not to X but to Y:
papers, but as a general method of analysis, AZ
can and has been applied to different text types our goal is not to verify P but to construct
(e.g., legal texts (Grover et al., 2003) and biologi- a test sequence from P2
cal texts (Mizuta and Collier, 2004)) and languages
(e.g., Portuguese (Feltrim et al., 2005)); we are now Getting the contrast and scope of negation correct in
adapting it to the special language of chemistry pa- such examples requires relatively deep processing.
pers and to the specific search tasks in eChemistry. Processing of AZ cue phrases with ERG/PET should
Progress towards construction of citation maps is be feasible because their vocabulary and structure is
reported in Teufel (2005) and Teufel et al. (2006). relatively consistent.
The zones used in the original computational lin-
guistics domain concentrated on the phenomena of 8 Conclusion
attribution of authorship to claims (is a given sen- The aim of this paper has been to describe how the
tence an original claim of the author, or a state- separate strands of work on language processing
ment of a well-known fact) and of citation senti- within SciBorg fit together into a coherent architec-
ment (does the author criticise a certain reference ture. There are many aspects of the project that we
or use it as part of their own work). For applica- have not discussed in this paper because we have
tion of AZ to chemistry, changes need to be made not yet begun serious investigation. This includes
which mirror the different writing and argumenta- word sense disambiguation and anaphora resolu-
tion styles in chemistry, in comparison to computa- tion. The intention is to use existing algorithms,
tional linguistics. Argumentation patterns are gen- adapted as necessary to our architecture. We have
erally similar across the disciplines (they are there also not discussed the application of Grid comput-
to convince the reader that the work undertaken is ing that will be necessary as we scale up to process-
sound and grounded in evidence rather than directly ing thousands of papers.
carrying scientific information), but several factors
such as the use of citations, passive voice, or cue 2
Gargantini and Heitmeyer (1999), ‘Using Model
phrases vary across domains.
Checking to Generate Tests from Requirements Specifi-
For Chemistry, we intend to exploit the RMRS tech- cations’ In Nierstrasz and Lemoine (eds), Software Engi-
nology discussed earlier to detect cues. RMRS neering - ESEC/FSE’99, Springer
620
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
621
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
We describe two key interfaces used in an architecture for applying a range of Language
Technology tools to a corpus of Chemistry research papers, in order to provide a basis of
robust linguistic analyses for Information Extraction tasks. This architecture is employed
in the context of the eScience project ‘Extracting the Science from Scientific Publications’
(a.k.a. SciBorg), as described in Copestake et al. (2006). The interfaces in question are
the common representation for the papers, delivered in a range of formats, and the coding
of various types of lingustic information as standoff annotation. While both of these in-
terfaces are coded in XML their structure and usage are quite distinct. However, they are
employed at the main convergence points in the system architecture. What they share is
the ability to represent information from diverse origins in a uniform manner. We empha-
sise this degree of flexibility in our description of the interface structures and the design
decisions that led to these definitions.
622
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
processing modules only have to interface with one the origin points for floats, as they are known to
XML encoding. For this purpose we have adopted LaTeX (table and figures). As a result SciXML re-
an XML schema that has been developed over the tains a focus on the logical structure of the paper,
course of several projects for the precise purpose but preserves as much formatting information as is
of representing the logical structure of scientific re- required for effectively rendering the papers in an
search papers. Nevertheless, some adaptations to application.
the schema were require for the specific needs of
a corpus collected in well-defined XML encodings 4 The SciXML Schema
of publishers’ markup. We have named the result The resulting form of SciXML is defined as a Relax
SciXML, intuitively XML for Science. NG schema. Since this is an XML-based formal-
ism, it is difficult to exhibit any substantive frag-
3 The Development of SciXML ment in the space available here. Figure 1 shows
SciXML originates in XML markup for the logical just the overall strucuture of a paper and the first
structure of scientific papers in Computational Lin- level of elements in the <REFERENCELIST> ele-
guistics (Teufel and Moens, 1997; Teufel, 1999). It ment. The most comprehensive description that is
has subsequently been employed to corpora from a appropriate here is a catalogue of the types of con-
variety of disciplines, including Cardiology (Teufel struct in SciXML.
and Elhadad, 2002) and Genetics (Hollingsworth
Paper Identifiers: We require enough informa-
et al., 2005).
tion to identify papers both in processing and
What is equally significant is that these corpora,
when tracing back references. While publisher’s
while consistent in the function of their texts, were
markup may contain an extensive log of publica-
collected from a variety of different sources and
tion and reviewing, a handful of elements suffice for
in varying formats, so that conversions to SciXML
our needs: <TITLE>, <AUTHOR>, <AUTHORS>,
have been defined from: LaTeX, HTML and PDF
<FILENO>, <APPEARED>
(via OCR). The conversion from low level format-
ting produced a cumulative effect on the immediate Sections: The hierarchical embedding of text seg-
precursor for our SciXML schema. The more func- ments is encoded by the <DIV> element, which is
tional levels of the markup were impoverished, as recursive. A DEPTH attribute indicates the depth of
only distinctions that affected the formatting could embedding of a division. Each division starts with
be retrieved. The handling of lists was rather sim- a <HEADER> element.
plified and tables excluded, because of the diffi- Paragraphs are marked as element <P>. The
culty of processing local formatting conventions. paragraph structure has no embedding in SciXML,
Equally, the applications had no necessity to rep- but paragraph breaks within <ABSTRACT>,
resent papers in full formatting, so information like <CAPTION> and <FOOTNOTE> elements and
the modification of font faces at the text level was within list items (<LI>) are preserved noted with
excluded. While footnotes were preserved these a <SUBPAR> element.
were collected at the end of the paper, effectively as
end notes, alongside the vestigial representations of Abstract, Examples and Equations are text sec-
tables and figures, chiefly their captions. tions with specific functions and formatting and can
SciBorg has the advantage of access to publishers’ be distinguished in both publishers’ markup and in
markup which supports functional or semantic dis- the process of recovering information from PDF.
tinctions in structure and provides a detailed cod- We have added a functionally determined section
ing of the content and structure of lists and tables, <THEOREM>, as we have encountered this type of
rather than just their formatting on the page. How- construct in some of the more formal papers. Sim-
ever, we envisage IE applications which involve ilarly, linguistic examples were distinguished early
some rendering of the paper contents in a readable on, as Computational Linguistics was the first cor-
form, e.g. in authoring and proof reading aids for pus to be treated.
Chemists. While not as detailed as the publishers Tables, figures and footnotes are collected
page formatting we would require the paper content in a list element at the end of the docu-
to be recognisable, exploiting HTML as a cheap so- ment, <TABLELIST>, <FIGURELIST>,
lution to any more complex display problems. This <FOOTNOTELIST>, respectively. The tex-
implies retaining more of the explicit formatting in- tual position of floats is marked by an <XREF/>
formation, particularly at the text level. In practice, element. The reference point of a footnote is
this has meant the addition of inline markup for font marked by a separate series of markers: <SUP>,
face selections and some inclusion of LaTeX ob- also used for similar functions in associating
jects for formulae, as well as the preservation of authors and affiliations.
623
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
<define name="PAPER.ELEMENT">
<element name="PAPER">
<optional>
<ref name="METADATA.ELEMENT" />
<ref name="REFERENCELIST.ELEMENT">
<optional>
</optional>
<ref name="PAGE.ELEMENT" />
<optional>
</optional>
<ref name="AUTHORNOTELIST.ELEMENT">
<ref name="TITLE.ELEMENT" />
</optional>
<optional>
<optional>
<ref name="AUTHORLIST.ELEMENT" />
<ref name="FOOTNOTELIST.ELEMENT">
</optional>
</optional>
<optional>
<optional>
<ref name="ABSTRACT.ELEMENT" />
<ref name="FIGURELIST.ELEMENT">
</optional>
</optional>
<element name="BODY">
<optional>
<zeroOrMore>
<ref name="TABLELIST.ELEMENT">
<ref name="DIV.ELEMENT" />
</optional>
</zeroOrMore>
</element>
</element>
</define>
<optional>
<element name="ACKNOWLEDGMENTS">
<define name="REFERENCELIST.ELEMENT">
<zeroOrMore>
<element name="REFERENCELIST">
<choice>
<zeroOrMore>
<ref name="REF.ELEMENT" />
<ref name="REFERENCE.ELEMENT" />
<ref name="INLINE.ELEMENT" />
</zeroOrMore>
</choice>
</element>
</zeroOrMore>
</define>
</element>
</optional>
Lists: various types of lists are supported with bul- <xsl:template match="sec">
let points or enumeration, according to the TYPE <DIV DEPTH="{@level}">
attribute of the <LIST> element. Its contents will <xsl:apply-templates/>
be uniformly marked up as <LI> for list items. </DIV>
</xsl:template>
Cross referencing takes a number of forms in-
cluding the <SUP> and <XREF> mentioned above.
All research papers make use textual cross refer- Figure 2: An XSLT conversion template
ences to identify sections and figures. For Chem-
istry, reference to specific compounds was adopted,
from the publishers’ markup conventions. The from a publisher’s markup to a <DIV> in SciXML.
other crucial form of cross reference in research text The SciXML conventions show an affinity for the
is citations, linking to bibliographic information in structure of the formatted text which follows di-
the <REFERENCELIST>. rectly from usage with text recovered from PDF.
For our current purposes this is considered harm-
Bibliography list: The bibliography list at the end
less, as no information is lost. We assume that
is marked as <REFERENCELIST>. It consists of an application will render the content of a pa-
<REFERENCE> items, each referring to a formal
per, as required, e.g. in HTML. In fact, we al-
citation. Within these reference items, names of ready have demonstration applications that system-
authors are marked as <SURNAME> elements, and
atically render SciXML in HTML via an addi-
years as <YEAR>.
tional stylesheet. The one SciXML convention that
is somewhat problematic is the flattening of para-
5 The Conversion Process graph structure. While divisions can embed divi-
The conversion from the publisher’s XML markup sions, paragraphs may not embed paragraphs, not
to SciXML can be carried out using an XSLT even indirectly. To avoid information loss, here
stylesheet. In fact, most of the templates are quite an empty paragraph break marker is added to list
simple, mainly reorganising and/or renaming con- elements, abstracts and footnotes which may, in
tent. The few exceptions are collections of elements the general case, include paragraph divisions. Al-
such as footnotes and floats to list elements and though earlier versions of SciXML also encoded
replacing the occurrence in situ with a reference sentence boundaries, the sentence level of annota-
marker. Elements that explicitly encode the embed- tion has been transferred to the domain of linguistic
ding of text divisions are systematically mapped to standoff annotation, where it can be generated by
a non-hierarchical division with a DEPTH attribute automatic sentence splitters.
recording the level of embedding. Figure 2 shows a While the XSLT stylesheets that convert the pub-
template for converting a section (<sec>) element lisher XML to SciXML are relatively straightfor-
624
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
ward, each publisher DTD or schema requires a vided that a partial identification of the word class
separate script. This places a manual overhead on is possible, e.g. by POS (part of speech) tagging.
the inclusion of papers from a new publisher, but
fortunately at a per publisher rather than per journal RASP provides a statistically trained parser that
level. In practice, we have found that the element does not require a full lexicon (Briscoe and Carroll,
definitions are still sufficiently similar to allow a 2002). The parser forms part of a sequence of anal-
fair amount of cut-and-paste programming, so that ysers including a sentence splitter, tokeniser and a
the overhead decreases as more existing conversion POS tagger.
templates exist to draw on. A recent extension of A key factor in combining results from multiple
SciXML conversion to the PLoS DTD presented re- parsers is that they present compatible results. Both
markably few unknown templates. of the parsers we are using are capable of producing
analyses and partial analyses in the same form of
6 Language Technology in SciBorg representation: Robust Minimal Recursion Seman-
tics (Copestake, 2003), henceforth RMRS. This is
The goals of the SciBorg project, as its full ti-
a form of underspecified semantics resulting from
tle suggests, concern Information Extraction, in
the tradition of underspecification in symbolic Ma-
fact a range of IE applications are planned all
chine Translation. In RMRS, the degree of under-
starting from a corpus of published chemistry re-
specification is extended so that all stages of deep
search. However, the common path to those goals
and shallow parsing can be represented in a uni-
and the real challenge of the project is the appli-
form manner. It is therefore feasible to combine the
cation of Language Technology, and, in particu-
results of the deep and shallow parsing processes.
lar, linguistically motivated analysis techniques. In
Our presentation of the parsers above should have
practice, this involves the use of so-called ‘deep’
suggested one other path through the system archi-
parser, based on detailed formal descriptions of
tecture, in that the RASP tagger can provide the
the language with extensive grammars and lexi-
information necessary to run the unknown word
cons, and shallower alternatives, typically employ-
mechanism in the PET parser, what this requires is
ing stochastic models and the results of machine
a mapping from the part of speech tag to an abstract
learning. This multi-engine approach has been em-
lexical type in the ERG grammar definition.
ployed in Verbmobil (Ruland et al., 1998; Rupp
et al., 2000) and Deep Thought (Callmeier et al., The combination of results from deep and shal-
2004). low parsers is only part of the story, as these are
general purpose parsers. We will also need com-
While there have been considerable advances in the
ponents specialised for research text and, in par-
efficiency of deep parsers in recent years, there are
ticular, for Chemistry. The specialised nature of
a variety of ways that performance can be enhanced
the text of Chemistry research is immediately ob-
by making use of shallower results, e.g. as prepro-
vious, both in specialised terms and in sections
cessors or as a means of selecting the most interest-
which are no longer linguistic text, as we know
ing sections of a text for deeper analysis. In fact,
it. A sophisticated set of tools for the analysis of
the variety of different paths through the analysis
Chemistry research is being developed within the
architecture is a major constraint on the design of
SciBorg project, on the basis of those described
our formalism for linguistic annotations, and the
in Townsend et al. (2005). These range from NER
one which precludes the use of the major existing
(Named Entity Recognition) for Chemical terms,
frameworks for employing Language Technology
through the recognition of data sections, to spe-
in IE, such as GATE (http://gate.ac.uk).
cialised Chemistry markup with links to external
and automatically generated information sources.
7 Multiple Analysis Components
For the general parsing task the recognition of spe-
The main deep and shallow parsing components cialised terms and markup of data sections are the
that we use have been developed over a period of most immediate contribution. Though this does, to
time and represent both the state of the art and the some extent, exacerbate the problems of ambigu-
result of considerable collaboration. ity, as now deep, shallow and Chemistry NER pro-
PET/ERG: PET (Callmeier, 2002) is a highly op- cesses have different results for the same text seg-
timise HPSG parser that makes use of the En- ment.
glish Resource Grammar (ERG) (Copestake and The overlap between ordinary English and Chemi-
Flickinger, 2000). The ERG provides a detailed cal terms can be trivially demonstrated, in as much
grammar and lexicon for a range of text types. The as the prepositions in and as in sentence initial po-
coverage of the PET/ERG analysis engine can be sition can be easily confused with In and As, the
extended by an unknown word mechanism, pro- chemical symbols for indium and arsenic, respec-
625
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
tively. Fortunately some English words that moon- Raw Text: ”<p>Come <i>here</i>!</p>”
light as chemical symbols are less common in re-
search text, such as I, but this is only an example Unicode character points:
of the kinds of additional ambiguity. Formally, the .< .p .> .C.o .m .e . ∆ . < . i . > . h . e . r . e .
word lead would have at least 3 analyses, as verb, 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
<. / . i .>. ! .<. / .p.>.
noun and element, but the latter two are not orthog- 16 17 18 19 20 21 22 23 24
onal.
Figure 3: character pointers (points shown as ‘.’)
8 Standoff Annotation
Standoff annotation is an increasingly common so-
position in the text file where the text it relates to
lution for pooling different types of information
occurs, typically this encodes a span of the text.
about a text. Essentially, this entails a separation
Some form of indexing is a prerequisite for stand-
of the original text and the annotations into two
off annotation. The simplest indexing is by byte
separate files. There are some practical advantages
offset, encoding the position of a text segment in
in this: you are less likely to obscure the original
terms of the number of bytes in the file preceding
text under a mesh of annotations and less likely to
its start and finish positions. This is universal, in
proliferate numerous partially annotated versions of
that it can cope with all types of file content, but
the object text. However, the true motivation for
is not stable under variations in character encod-
standoff annotation is whether different annotation
ing. For an XML source text, indexing by char-
schemes will impose different structures on the text.
acter offset is more useful, particularly if character
To some extent XML forces a standoff annotation
entities have been resolved. Extraneous binary for-
scheme, by enforcing a strict tree structure, so that
mats such as graphics will be encoded externally as
even one linguistic annotation of a formatted text
<NDATA> elements. In fact, the variant of charac-
above the word level risks becoming incompatible
ter offset we adopt involves numbering the Unicode
with the XML DOM tree. As a simple example,
character points, as shown in Figure 3.
we exhibit here a fragment of formatted text from
A simple annotation of linguistic tokens would then
a data section of a Chemistry paper, alongside its
produce three elements:
SciXML markup and a simple linguistic markup for
phrase boundaries. <w from=’3’ to=’7’>Come</w>
<w from=’11’ to=’15’>here</w>
Formatted text calculated for C11 H18 O3 <w from=’19’ to=’20’>!</w>
SciXML markup <it> calculated for </it> XML also offers an additional structure for navigat-
C<sb>11</sb>H<sb>18</sb>O<sb>3</sb> ing around the file contents.We can take the XML
Phrasal markup <v>calculated</v> DOM trees as a primary way of locating points in
<pp>for <ne>C11H1803</ne></pp>
the text and character positions as secondary, so that
a character position is relative to a branch in the
Here, the assignment of a simple phrase structure XML tree. This XPoint notation is demonstrated
conflicts with the formatting directives for font face in Figure 4. The XML tree positions will not be
selection. The XML elements marked in bold face affected by changes within an XML element, e.g.
cannot be combined in the same XML dominance additional attributes, but will not be stable against
tree. changes in the DOM structure itself.
With each additional type or level of annotation the We currently make use both character offset in-
risk of such clashes increases, so a clean separa- dexing and XPoint indexing in different modules.
tion between the text and its annotations is help- This requires conversion via a table which relates
ful, but where do you separate the markup and how XPoints to character positions for each file. The
do you maintain the link between the annotations initial choice of indexing mode for each subprocess
and the text they address. As we have a common was influenced by the availability of different styles
XML markup schema, including logical structure of XML parser1 , but also informed by the ease of
and some formatting information we have a clear converting character information from non-XML
choice as to our text basis. The link between text linguistic tools. In the long term our aim would be
and annotations is usually maintained by indexing. to eliminate the use character offset indexing.
Here, there are some options available.
1
Computing character offsets is easier with an event-
8.1 Indexing based parser that has access to the input parsed for each
The indexing of annotations means that each ele- event. XPoints can be tracked more easily in a DOM
ment of the standoff file encodes references to the parser.
626
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
ROOT(/)
. p(/1) .
.h.e.r.e.
xpoint at x is: “/1/1.4”
627
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Using the RASP tagger, the tagset is based on 10 SAF as a Common Interface
CLAWS. The POS tagger also provides the option A SAF file pools linguistic annotations from all lev-
of an RMRS output. els of language processing and from distinct parsing
9.4 Chemical Terms strategies. In this role it provides a flexible interface
in a heterarchical processing architecture that is not
We use the NER functionality in OSCAR-3 (Cor-
committed to a single pipelined path through a fixed
bett and Murray-Rust, 2006) to provide an annota-
succession of processes. While this is reminiscent
tion for Chemical terms.
of a blackboard architecture or more localised pool
<annot type="oscar" id="o554" architectures, it is only a static representation of
from="/1/5/6/27/51/2/83.1"
to="/1/5/6/27/51/2/88/1.1" > the linguistic annotations. It does not provide any
<slot name="type">compound</slot> specific mechanisms for accessing the information
<slot name="surface">C11H18O3</slot>
<slot name="provenance">formulaRegex</slot> in the annotations, nor does it require any particu-
</annot> lar communications architecture in the processing
modules. In fact, there may have been more effi-
This example differs from the annotations shown
cient ways of representing a graph or lattice of lin-
above in two obvious respects: the indexing is given
guistic annotations, if it were not for the fact that
in XPoint rather than character offsets and the XML
the linguistic components we employ were defined
element has content rather than a value attribute.
in a range of different programming languages, so
This representation provides information about the
that an external XML interface structure has prior-
way that the named entity recognition was arrived
ity over any shared data structure. Given this fact,
at. For use in further stages of analysis, this content
SAF is a highly appropriate choice for the common
should be made compatible with an RMRS repre-
interface that is a key feature of any multi-engine
sentation.
analysis architecture.
9.5 RMRS Annotations
For the sake of completeness we include an exam- 11 Conclusions
ple of an RMRS annotation in Figure 5. This rep- We have presented two interface structures used in
resents the result for a partial analysis. The con- the architecture of the SciBorg project. Each of
tent of this element is represented in the XML form these occupies a crucial position in the architecture.
of RMRS annotation. However, the mechanisms SciXML provides a uniform XML encoding for re-
which make RMRS highly suitable for representing search papers from various publishers, so that all
arbitrary levels of semantic underspecification in a subsequent language processing only has to inter-
systematic and extensible way, e.g. allowing mono- face with SciXML constructs. SAF provides a com-
tonic extension to a more fully specified RMRS, up mon representation format for the results of vari-
to the representation of a full logical formula, make ous Language Technology components. We have
this representation relatively complex. Copestake emphasised the flexibility of these interfaces. For
(2003) provides a detailed account of the RMRS SciXML, this consists in conversion from publish-
formalism. ers’ markup following a range of DTDs, as well as
628
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
other formats, including HTML, LaTeX and even Hollingsworth, Bill, Ian Lewin, and Dan Tidhar.
PDF. For SAF, the primary mark of flexibility is in 2005. Retrieving Hierarchical Text. 2005. Struc-
allowing partial results from multiple parsers to be ture from Typeset Scientific Articles - a Prereq-
combined at a number of different levels. uisite for E-Science Text Mining. In In Proceed-
ings of the 4th UK E-Science All Hands Meeting,
12 Acknowledgements 267–273. Nottingham, UK.
We are very grateful to the Royal Society of
Ruland, Tobias, C. J. Rupp, Jörg Spilker, Hans
Chemistry, Nature Publishing Group and the In-
Weber, and Karsten L. Worm. 1998. Making
ternational Union of Crystallography for supply-
the Most of Multiplicity: A Multi-Parser Multi-
ing papers. This work was funded by EPSRC
Strategy Architecture for the Robust Processing
(EP/C010035/1) with additional support from Boe-
of Spoken Language. In Proc. of the 1998 Inter-
ing.
national Conference on Spoken Language Pro-
References cessing (ICSLP 98), 1163–1166. Sydney, Aus-
tralia.
Briscoe, Ted, and John Carroll. 2002. Robust ac-
curate statistical annotation of general text. In Rupp, C. J., Jörg Spilker, Martin Klarner, and
Proceedings of the 3rd International Conference Karsten Worm. 2000. Combining Analyses from
on Language Resources and Evaluation (LREC- Various Parsers. In Wolfgang Wahlster, ed.,
2002). Verbmobil: Foundations of Speech-to-Speech
Translation, 311–320. Berlin: Springer-Verlag.
Callmeier, U., A. Eisele, U. Schäfer, and M. Siegel.
2004. The DeepThought Core Architecture Teufel, S., and N. Elhadad. 2002. Collection and
Framework. In Proceedings of LREC-2004, linguistic processing of a large-scale corpus of
1205–1208. Lisbon, Portugal. medical articles. In Proceedings of the 3rd In-
Callmeier, Ulrich. 2002. Pre-processing and encod- ternational Conference on Language Resources
ing techniques in PET. In Stephan Oepen, Daniel and Evaluation (LREC-2002).
Flickinger, Jun’ichi Tsujii, and Hans Usz kor- Teufel, Simone. 1999. Argumentative Zoning: In-
eit, eds., Collaborative Language Engineering: formation Extraction from Scientific Text. Ph.D.
a case study in efficient gra mmar-based process- thesis, School of Cognitive Science, University
ing. Stanford: CSLI Publications. of Edinburgh, Edinburgh, UK.
Clement, L., and E.V. de la Clergerie. 2005. MAF: a
Teufel, Simone, and Marc Moens. 1997. Sentence
morphosyntactic annotation framework. In Pro-
extraction as a classification task. In Inderjeet
ceedings of the 2nd Language and Technology
Mani and Mark T. Maybury, eds., Proceedings of
Conference. Poznan, Poland.
the ACL/EACL-97 Workshop on Intelligent Scal-
Copestake, Ann. 2003. Report on the design of able Text Summarization, 58–65.
RMRS. DeepThought project deliverable.
Townsend, Joe, Ann Copestake, Peter Murray-
Copestake, Ann, Peter Corbett, Peter Murray-Rust, Rust, Simone Teufel, and Chris Waudby. 2005.
C. J. Rupp, Advaith Siddharthan, Simone Teufel, Language Technology for Processing Chemistry
and Ben Waldron. 2006. An Architecture for Publications. In Proceedings of the fourth UK
Language Technology for Processing Scientific e-Science All Hands Meeting (AHM-2005). Not-
Texts. In Proceedings of the 4th UK E-Science tingham, UK.
All Hands Meeting. Nottingham, UK.
Waldron, Benjamin, and Ann Copestake. 2006. A
Copestake, Ann, and Dan Flickinger. 2000. An Standoff Annotation Interface between DELPH-
open-source grammar development environment IN Components. In The fifth workshop on NLP
and broad-cover age English grammar using and XML: Multi-dimensional Markup in Natural
HPSG. In Proceedings of the Second conference Language Processing (NLPXML-2006).
on Language Resources and Evaluation (LREC-
2000), 591–600. Waldron, Benjamin, Ann Copestake, Ulrich
Schäfer, and Bernd Kiefer. 2006. Preprocessing
Corbett, Peter, and Peter Murray-Rust. 2006. High- and Tokenisation Standards in DELPH-IN Tools.
throughput identification of chemistry in life sci- In Proceedings of the 5th International Confer-
ence texts. In Proceedings of the 2nd Inter- ence on Language Re sources and Evaluation
national Symposium on Computational Life Sci- (LREC-2006). Genoa, Italy.
ence (CompLife ’06). Cambridge, UK.
629
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Electronic patient records are typically optimised for delivering care to a single patient. They omit significant
information that the care team can infer whilst including much of only transient value. They are secondarily an
indelible legal record, comprised of heterogeneous documents reflecting local institutional processes as much as, or
more than, the course of the patient’s illness.
By contrast, the CLEF Chronicle is a unified, formal and parsimonious representation of how a patient’s illness
and treatments unfold through time. Its primary goal is efficient querying of aggregated patient data for clinical
research, but it also supports summarisation of individual patients and resolution of co-references amongst clinical
documents. It is implemented as a semantic network compliant with a generic temporal object model whose
specifics are derived from external sources of clinical knowledge organised around ontologies.
We describe the reconstruction of patient chronicles from clinical records and the subsequent definition and
execution of sophisticated clinical queries across populations of chronicles. We also outline how clinical
simulations are used to test and refine the chronicle representation. Finally, we discuss a range of engineering and
theoretical challenges raised by our work.
630
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
certain types of information rarely formalised in CLEF Chronicle should have two central properties:
current records, e.g.
• the precise nature of any imaging investigation Parsimony – traditional patient records contain
(e.g. which anatomical structure was imaged) multiple discrete mentions of relevant instances in the
• the result of any investigation (what was found, real world (the tumour, the breast etc). A CLEF
what it signifies) Chronicle should have only one occurrence of each.
• why any investigation was ordered
• earlier provisional diagnoses Explicitness – traditional patient records may only
• what symptoms the patient experienced from the imply clinically important information (e.g. the fact of
disease, or its treatment relapse). This must be explicit within a CLEF
• when treatment was changed and why Chronicle Representation.
• abstractions such as ‘relapse’ or ‘stage’ of disease
3. Functions of CLEF Chronicle
Our goal is to define a representation supporting
increasingly detailed information about patients and Firstly, and most importantly, the CLEF Chronicle
allowing us to evaluate its usefulness. We call this is intended to support more detailed, and more
novel representation the CLEF Chronicle. expressive, querying of aggregations of patient stories
than is currently possible whilst at the same time
2. Properties of CLEF Chronicle improving the efficiency of complex queries. More
detailed, because the Chronicle is more explicit: we
can now ask e.g. how many patients ‘relapsed’ within a
The CLEF Chronicle for an individual patient seeks
set period of treatment. More expressive, because the
to represent their clinical story entirely as a network of
typed instances and their interrelations. Figure 1 typing information associated with each Chronicle
instance is drawn from a rich clinical ontology, such
illustrates the general flavour of what we are trying to
that queries may be framed in terms of arbitrarily
represent: a patient detects a painful mass in their
breast, as a result of which a clinic appointment occurs, abstract concepts: we can ask how many cancers of the
lower limb were recorded, and expect to retrieve those
where drug treatment for the pain is arranged and also
of all parts of the lower limb. This is more efficient,
a biopsy of the (same) mass in the (same) breast. The
first clinic arranges a follow-up appointment, to review because the traditional organization of patient records
tends to require much serial or nested processing of
the (same) biopsy, which finds cancer in the (same)
records.
mass. This (same) finding is in turn the indication for
Secondly, an individual Chronicle can serve as an
radiotherapy to the (same) breast.
important knowledge resource during its own
in dication
reconstruction from available electronic sources of
P ain D rug traditional clinical records. In particular, a Chronicle
can help resolve the frequent co-references and
abo ut
lo cus repeated references to real-world instances such as
plans
p lans
characterize traditional records. For example, heuristic
B reast lo cus C linic C linic and other knowledge linked to an ontology of
p lan s Chronicle data types can be used to reject any request
p lan s to instantiate more than one identifier for a [Brain], or
abo ut
abo ut any attempt to merge two mentions of [Pregnancy]
lo cus
separated by more than 10 months.
lo cus
B iopsy R adio Thirdly, the Chronicle is intended to serve as a
fin din g knowledge resource from which summarising
in dication
in dication information or abstractions may be inferred. For
example, deducing ‘anaemia’ from a run of discrete
M ass lo cus C ancer
low blood counts obtained from the traditional record,
F ig u re 1 : In fo rm a l V iew o f P a tien t-H isto ry F ra g m en t or ‘remission’ from several years of clinical inactivity
(N O T E : T im e-flow is ro u g h ly left Æ rig h t)
and ‘relapse’ when this is followed by a flurry of tests
and a new course of chemotherapy. Similarly, where
In addition to the obvious structural difference the record does not explicitly say why drug X was
between this representation and that of traditional given, reasoners browsing a Chronicle may identify
electronic records, any clinical content represented as a
631
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
632
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
633
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
consumer in response to local or systemic parameters (Figure 3), did not identify any
symptomatology arising from either the tumour or its significantly implausible patients.
treatment, and the actions of the clinician as a provider • Although not designed or required to produce
of investigations of varying sensitivity and treatments simulated populations of patients with characteristics
of variable effectiveness. The modelling process mimicking those of real populations, 10-year
involves dynamic event-driven interactions through Kaplan-Meier curves (a standard tool for comparing
simulated time of all three actors (disease, patient and cancer survival) for populations of simulated patients
clinician). The simulator generates simulated clinical are similar to those for populations of real patients
stories of similar content and complexity to real life, (Figure 4), including those curves representing
represented both as chronicles and as note-form text. subanalyses for tumour grade or disease stage at
Whilst the simulator models were engineered to presentation.
superficially approximate real populations of patients, • Two senior oncology physicians invited to comment
they do not pretend to possess any predictive properties on the generated narrative output of simulators
with respect to, for example, what effect any change in
treatment efficacy, or clinical service reconfiguration,
might have on real population outcomes. Rather, for
our stated purpose of testing the Chronicle as a means
to represent individual patient stories, it is sufficient to
evaluate whether the complexity and content of
individual generated clinical stories approximates that
of real clinical stories: whether they have surface truth-
likeness, or verisimilitude.
Judging the truth-likeness of generated stories is
necessarily subjective and imprecise. Formal
evaluation would be inherently problematic not least
because of the difficulty in obtaining sufficient time Figure 4 : Overall survival curves for real and
from specialist clinicians to serve as the judges. At the simulated breast cancer patients
present time, therefore, we base our claim that the suggested only minor changes (e.g. both AST and
simulated stories approximate real clinical stories on GGT should appear as liver damage markers).
the following evidence: • Queries similar to those forming the central research
• The prototype simulator was written by a clinician question in typical registered cancer trials may be
(JR) and is based on a detailed model of tumour, constructed and executed against the simulated
patient and clinician behaviour and their mutual chronicle repository, as well as more detailed
dynamic interactions. questions such as ‘what effect on long term survival
accrues from interrupting a course of chemotherapy
in order to go on holiday’.
634
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
(and on what evidence) a patient changed status from Using the simulator output, we may cautiously
being in remission to being in relapse. estimate the richness and complexity that ‘real’ clinical
Patient notices a problem aged 46 (wk 209) stories – and aggregations of them - might take on if
** Patient presents aged 45.8 (wk 212) ** represented as Chronicles. An analysis of a simulated
Ix:Xray chest: normal population of 986 patient chronicles revealed:
Ix:Clinical examination showed: a new 4.06cm breast mass • Individual patient chronicles comprise an average of
Follow-up in 3 weeks with results to plan treatment 385 object instances and 753 semantic relations
** Initial Treatment Consult: week 215 ** • The most complicated patient story required 2295
** Surgery ** instances and 4636 relationships
4.26 cm lesion was incompletely excised • The aggregated repository, comprising 986 discrete
Ix:Histopathology report on excision biopsy networks, contained 382,103 instances and 742,595
Grade 8 invasive ductal adenocarcinoma of breast. semantic relations.
Oestrogen receptor negative
21 nodes were positive The prototype simulator and documentation is
Six weeks of radiotherapy to axilla should follow surgery available for download from:
Secondaries: commence 6 weeks of radiotherapy. http://www.clinical-escience.org/simulator.html
Secondaries: commence 6 weeks of systemic chemotherapy. Subsequent releases of the reimplementation may be
High grade tumour: for chemotherapy made available through the same URL in future.
Tumour is hormone insensitive. No hormone antagonist
Follow-up in 8 weeks We have now developed a more generalised Java
** Consult to assess response to treatment : week 223 ** re-implementation of the simulator that is capable of
Ix:Xray chest: normal creating richer patient records, in a format suitable for
Ix:Clinical examination normal populating the Chronicle Object Model.
No evidence of further tumour.
See in 8 weeks 6. Future Challenges
Radiotherapy cycle given. 5 remaining
Radiotherapy cycle 5 deferred because patient on holiday Outstanding issues concerning the Chronicle
Radiotherapy cycle given. 4 remaining Representation include:
<<< SNIP >>> • How to store very large numbers of Chronicle
** Consult to assess response to treatment : week 428 ** Object Model instances persistently such that queries
Ix:Xray chest: abnormal. may be constructed, and efficiently executed, over
No new lesions found aggregations of such instances.
2 old lesions found • How to implement such a query mechanism
Largest measures 1cm, smallest is 0.7cm involving both ontological and temporal reasoning.
High alkaline phosphatatse: for bone scan • How to deal with data arising from the
Ix:Bone Scan: abnormal. ‘chroniclisation’ of real records will almost certainly
No new lesions found be both incomplete and ‘fuzzy’.
17 old lesions found
Largest measures 1.2cm, smallest is 0.7cm Other issues concern the presentation of the chronicle
Patient confused: for CTScan Brain to the clinician, including:
Ix:CTScan of brain: abnormal. • How the clinical content of a reconstructed chronicle
No new lesions found can be visualised in order to validate its content
6 old lesions found against the more traditional patient record data from
Largest measures 1.4cm, smallest is 0.8cm which it has been derived.
High grade unresponsive tumour. Palliative Care • How complex queries over sets of patient chronicles
See in 8 weeks can best be formulated by the ordinary clinician.
** Died aged 50 from cancer (week 475)
......................................
Grade 8 - 23 local mets, 70 distant metastasis
7. Discussion
Seen in clinic 19 times and 4 course of chemo prescribed
The idea of representing clinical information as
Total tumour mass 1374700 in 200 tumours of which 48
some form of semantic net, particularly focussing on
detected.
why things were done, is not new: echoes of it can be
ESR=160 Creatinine=559
found in Weed’s work on the problem oriented record
41 tumours were destroyed or removed
[7]. Ceusters and Smith more recently advocated the
54 bony mets, total mass 740941 d=5.41cm AlkPhos=4811
Figure 5: Extract of simulator ‘short note’ output
635
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
8. References
1. Taweel A, Rector, AL, Rogers J, Ingram D, Kalra D,
Gaizauskas R, Hepple M, Milan J, Power R, Scott D,
Singleton P. (2004) CLEF – Joining up Healthcare
with Clinical and Post-Genomic Research. Current
Perspectives in Healthcare Computing:203-211
2. Harkema H, Roberts I, Gaisauskas R, Hepple M.
(2005) A web service for biomedical term look-up.
Comparative and Functional Genomics 6;1-2:86-83
3. Rector A, Nowlan W, Kay S. (1991) Foundations for
an Electronic Medical Record. Methods of
Information in Medicine;30:179-86.
4. Beale T. (2003) Archetypes and the EHR. Stud
Health Technol Inform. 2003;96:238-44
5. Grenon P, Smith B (2004) SNAP and SPAN:
Towards Dynamic Spatial Ontology. Spatial
Cognition and Computation 4;1:69-104
6. Baader F, Calvanese D, McGuinness D, Nardi D,
Patel-Schneider P. (2003) The Description Logic
Handbook. Cambridge University Press. ISBN:
0521781760
7. Weed LI (1969) Medical records medical education,
and patient care. The problem-oriented record as a
basic tool. Cleveland, OH: Case Western Reserve
University
8. Ceusters W, Smith B. (2005) Strategies for referent
tracking in Electronic Health Records. Journal of
Biomedical Informatics (in press).
9. Tsarkov D, Horrocks I. (2006) FaCT++ Description
Logic Reasoner: System Description. Proc of Int.
Joint Conf. on Automated Reasoning (IJCAR~2006)
(in press).
636
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
The primary aim of this paper is to report the development of a new testbed, the BIOPATTERN
Grid, which aims to facilitate secure and seamless sharing of geographically distributed
bioprofile databases and to support analysis of biopatterns and bioprofiles to combat major
diseases such as brain diseases and cancer within a major EU project, BIOPATTERN
(www.biopattern.org). The main objectives of this paper are 1) to report the development of the
BIOPATTERN Grid prototype and implementation of BIOPATTERN grid services (e.g. data
query/update and data analysis for brain diseases); 2) to illustrate how the BIOPATTERN Grid
could be used for biopattern analysis and bioprofiling for early detection of dementia and for
assessment of brain injury on an individual basis. We highlight important issues that would arise
from the mobility of citizens in the EU and demonstrate how grid can play a role in personalised
healthcare by allowing sharing of resources and expertise to improve the quality of care.
637
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
data types, e.g. genomics and proteomic chosen to implement Grid middleware
information and biosignals, such as the functions. Condor [6] is used for job queuing,
electroencephalogram (EEG) and Magnetic job scheduling and to provide high throughput
Resonance Imaging (MRI). A bioprofile is a computing. The grid resources layer, contains
personal `fingerprint' that fuses together a computational resources, data resources, and
person's current and past medical history, knowledge resources (e.g. algorithms pool) and
biopatterns and prognosis. It combines data, networks.
analysis, and predictions of possible
susceptibility to diseases. It will drive
individualisation of care. Grid Portal
638
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
3. Implementation of BIOPATTERN
Grid Services The query service is implemented in four levels
from the top customer GUI to the bottom grid
resources level as shown in Figure 4.
The BIOPATTERN Grid provides both high
level and low level services. The high level The customer GUI level provides interfaces to
services are implemented mainly for end users allow end-users to seamlessly access grid-
(e.g. clinicians or researchers), who have enabled services with graphic visions. The
permissions to use specified grid-enabled query interface and file download interface are
services (e.g. to access distributed resources or designed to enable users to access required
to compare results from existing algorithms) clinical information without any grid
without any grid knowledge (e.g. underlying knowledge, and to allow users to simply follow
grid techniques, locations of resources and the defined steps to use this service. The
detailed access constrains to resources). The functionalities provided at this level include
low level services are implemented and handling HTTPS requests (e.g. obtainment of
provided mainly for grid services/applications requests from end-users by filling e-forms),
developers to access BIOPATTERN Grid forwarding such requests to the lower level, etc.
services directly, or to develop interfaces with JSP and html are languages used to implement
BIOPATTERN Grid (e.g. the current AmI-Grid these interfaces, which are held by Tomcat
project within BIOPATTERN is seeking the containers.
Customer
integration of AmI’s wireless data acquisition Query interface File download interface GUI
system with BIOPATTERN Grid). level
Grid
Query classes File download classes application
3.1 High Level Services level
Grid
OGSADAI-WSRF API GT4 core API GridFTP API middleware
Three high level services have been level
implemented in the BIOPATTERN Grid. As
Grid
shown in Figure 3, the services are clinical Clinical info databases
Distributed
Metadata databases resource
information query, clinical information update data resources level
and EEG analysis. The BIOPATTERN Grid Figure 4. Implementation of clinical
Portal serves as the interface between an end information query service
user (e.g. a clinician or a researcher) to the
BIOPATTERN Grid. It allows an end user to For the grid application level, two types of
access these services from any where with classes: query classes and file download classes,
Internet access (via a web browser). are responsible to translate specified requests
from the interfaces in the customer GUI level
BIOPATTERN Grid into detailed action requisitions and further send
those requisitions to grid servers for execution.
Clinical info Clinical info EEG analysis The query classes can provide concurrent access
query service update service service of distributed databases via OGSADAI-WSRF
Figure 3. High level services provided by the in order to obtain information as descriptions of
BIOPATTERN Grid specified patients, EEG data locations, etc. The
file download classes mainly offer file transfers
3.1.1 Clinical Information Query Service from the grid nodes which contain specified
EEG data to the grid nodes which can be
accessed by end-users.
This service is designed based on a scenario that
an end-user (e.g. a clinician) wants to query the The grid middleware level contains various
information of a patient and to download the APIs, including OGSADAI-WSRF API, GT4
patient’s EEG data for off-line analysis (the core API, GridFTP API, etc. All these APIs are
patient’s information and EEG data are all responsible for dealing with action requisitions
distributed across the BIOPATTERN Grid that come from the grid application level and
network). To achieve this, the end-user has to realising such requisitions by accessing
go through certain steps via the BIOPATTERN distributed data resources.
Grid Portal, e.g. 1) Query for a specified
patient; 2) Select EEG data files of the patient to The grid resource level holds distributed data
download. resources, which are described as low level
639
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
services held by Globus containers. Data several steps via the portal, e.g. 1). select for
resources that support the query service includes analysis either for dementia or for brain injury;
clinical information databases (e.g. database for 2) query for a specified patient; 3) select the
personal information of patients and EEG data), patient’s EEG data files for analysis and submit
and metadata databases (e.g. database for analysis jobs; 4) choose the way of viewing
descriptions of EEG data). analysis results (e.g. in canonograms and/or
bargraphs for dementia).
3.1.2 Clinical Information Update Service
The EEG analysis service is also implemented
in four levels, as presented in Figure 6.
This service is designed based on a scenario that
an end-user (e.g. a clinician) needs to update an Customer
GUI
Query Interface Analysis interface Result display interface
existing patient’s clinical records (e.g. a new level
3.1.3 EEG Analysis Service Different to the other two high level services,
the EEG analysis service holds both
computational and data resources. The
This service is designed based on a scenario that computational resources cover both High
an end-user (e.g. a clinician) wants to use Throughput Computing (HTC) and High
automated analysis services provided by Performance Computing (HPC) resources. The
BIOPATTERN Grid to help for the Condor pool, currently plays a key role in HTC-
detection/diagnosis of dementia or to help for based EEG analysis. The Globus-based
the assessment of brain injury (two services are computational resources support the HPC-based
provided currently). EEG analysis at present.
To achieve this, end-users have to go through
640
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
There are two types of low level services which The computational services are designed to
have been considered in the implementation, support those computation intensive
BIOPATTERN data services and computational applications, such as medical image analysis
services. and protein sequence comparison, and mainly
implemented based on WS-GRAM.
3.2.1 BIOPATTERN Data Services
On any BIOPATTERN grid nodes, a grid user
can use this type of services to submit jobs to
The goal of the provision of the data services is any permitted HPC and/or HTC resources and
to support certain features in clinical data access then get expected results back. The completion
and integration over distributed heterogeneous of a grid job may require several steps, i.e. 1)
data resources, such as database federation and creation of the job; 2) stage in (e.g. transmission
data transformation. of input files); 3) execution of the job; 4) stage
out (e.g. transmission of output files); 5) job
Thus far we have mainly investigated into the cleanup.
area of database federation and implemented
some basic data services, such as the generic Currently, all computational services are held
query and update services, based on by Globus containers located at certain grid
OGSADAI-WSRF, an OGSADAI distribution nodes (e.g. nodes at UoP and TSI) in the
compliant with WSRF specifications. BIOPATTERN Grid.
641
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
are normally suggested for elderly people in potential activity of clinical interest and discard
order to diagnosis dementia at an early stage irrelevant components such as background EEG
[7]. and artefacts; 2) a LORETA (LOw Resolution
Electromagnetic TomogrAphy) [9] is used to
Due to mobility of a citizen, a patient’s serial localise the source of brain activity. These
EEG recordings can be located in different methods are implemented in Matlab and
clinical centres and possibly across different compiled as executable (using the Matlab run-
countries. We envisage that grid computing can time library) and distributed over the Condor
play a role in early detection of dementia when pool to speed up the time-consuming analysis.
dealing with geographically distributed data Figure 9 shows an example of results of such
resources. analysis via the portal, with topography maps
(only for the two first ICA components) of one
To illustrate the concept of BIOPATTERN Grid normal patient (top) and one patient (bottom).
for early detection of dementia, a hypothetical The head model, which is shown, is assumed to
patient pool consisting of 400 subjects, each be an 8cm sphere and is represented with 8
with three EEG recordings was created. These slices (the lower slice is the left and the higher
data are hypothetical representation of is the right). Through this service, a clinician
recordings taken at three time instances akin to can easily use the advanced EEG analysis
longitudinal studies carried out in reality. The algorithm (e.g. ICA-LORETA) to analyze a
datasets are distributed at TSI, TUT and UoP patient’s raw EEG data for assessing brain
sites. The FD analysis algorithm is used to injury.
compute the FD of each dataset.
5. Conclusion
Through the Portal, an end-user (e.g. a clinician)
can select a patient, e.g. Mike, and the algorithm
is used to perform the analysis. Upon In this paper, we have presented a new testbed,
submission, Mike’s information, including his BIOPATTERN Grid which aims to share
serial EEG recordings (located in UoP, TSI and geographically distributed computation, data
TUT, respectively) are retrieved and analyzed. and knowledge resources for combating major
Results can be shown in canonograms (see diseases (e.g. brain diseases). We illustrated the
Figure 8) where changes in the EEG indicating development of BIOPATTERN Grid Prototype,
Mike’s conditions. The canonograms (from left the implementation of the BIOPATTERN Grid
to right) show the FD values of the Mike’s EEG services and two pilot applications for early
taken at time instances of 1 (data at TSI), 2 detection of dementia and for assessment of
(data at TUT) and 3 (data at UoP) respectively. brain injury.
The FD values for the left canonogram indicates
Mike in a normal condition with high brain BIOPATTERN Grid is an ongoing project and
activity, whereas the FD value for the right results presented here are limited in scale. In the
canonogram indicates Mike in a probable near future, the BIOPATTERN Grid prototype
Alzheimer Disease with low brain activity. The will be extended to include more grid nodes
middle one shows the stage in between. (both on Globus and on Condor), more
Changes in the FD values provide some computing resources (e.g. connecting with HPC
indication about the disease progression. This clusters in UNIPI), more grid applications (e.g.
can help clinicians to detect dementia at an early to include cancer diagnosis and prognosis) and
stage, to monitor its progression and response to services (e.g. data transformation), integration
treatment. with other projects within the BIOPATTERN,
such as to integrate with AmI-Grid project [10]
4.2 BIOPATTERN Grid for assessment of for connection with wireless data acquisition
brain injury networks and to integrate with grid.it [11][12] to
provide crawling services for information query.
The second application of the BIOPATTERN In the long term, we also seek the integration of
Grid is used to assess the severity of brain the BIOPATTERN Grid with other regional or
injury by analysing evoked potentials (EPs) national Grid projects/networks (e.g. EU’s
buried in raw EEG recordings. The assessment EGEE).
process consists of two stages: 1) an
Independent Component Analysis (ICA) [8] Due to the nature of healthcare, the
algorithm is used to extract single trial evoked BIOPATTERN Grid will need to address
several issues such as regulatory, ethical, legal,
642
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
privacy, security, and QoS, before it can move [4]. K. Ichikawa, S. Date, Y. Mizuno-Mastumoto, and S.
Shimojo, “A Grid-enabled System for analysis of
from research prototype to actual clinical tool. Brain Function”, In Proceedings of CCGrid 2003 (3rd
IEEE/ACM International Symposium on Cluster
Acknowledgement Computing and the Grid), May 2003.
[5]. Foster, “Globus Toolkit Version 4: Software for
Service-Oriented Systems”, Proceedings of IFIP
The authors would also like to thank Dr. C. International Conference on Network and Parallel
Computing, 2005, pp. 2-13.
Bigan from EUB, Romania for providing the
original EEG data sets and Dr. G. Henderson for [6]. http://www.cs.wisc.edu/condor/hawkeye
providing the algorithms for computing the FD [7]. G. T. Henderson, E. C. Ifeachor, H. S. K. Wimalartna,
index. We acknowledge the financial support of E. Allen and N. R. Hudson, “Prospects for routine
the European Commission (The BIOPATTERN detection of dementia using the fractal dimension of
the human electroencephalogram”, MEDSIP00, pp.
Project, Contract No. 508803) for part of this 284-289, 2000.
work.
[8]. T-W Lee, M. Girolami, TJ. Sejnowski, “Independent
component analysis using an extended infomax
References algorithm for mixed sub-Gaussian and super-Gaussian
sources”. Neural Computation 1999;11(2):606-633.
[9]. R. D. Pascual-Marqui. “Review of methods for solving
the EEG inverse problem” International Journal of
[1]. S. R. Amendolia, F. Estrella, C. D. Frate, J. Galvez, Bioelectromagnetism 1999, 1: 75-86
W. Hassan, T Hauer, D Manset, R McClatchey, M
Odeh, D Rogulin, T Solomonides and R Warren, [10]. M. Lettere, D. Guerri, R. Fontanelli. “Prototypal
“Development of a Grid-based Medical Imaging Ambient Intelligence Framework for Assessment of
Application”, Proceedings of Healthgrid 2005, from Food Quality and Safety”, 9th Int. Congress of the
Grid to Healthgrid, 2005, pp.59-69. Italian Association for Artificial Intelligence (AI*IA
2005) – Advances in artificial Intelligence, pp. 442-
[2]. S. Lloyd, M. Jirotka, A. C. Simpson, R. P. Highnam, 453, Milan (Italy), Sep. 21 - 23, 2005
D. J. Gavaghan, D. Watson and J. M. Brady, “Digital
mammography: a world without film?”, Methods of [11]. Grid.it: “Enabling Platforms for High-Performance
Information in Medicine, Vol.44, No. 2, pp. 168-169, Computational Grids Oriented to Scalable Virtual
2005. Organizations”, http://grid.it/.
[3]. J. S. Grethe, C. Baru, A. Gupta, M. James, B. [12]. K. Cerbioni, E. Palanca, A. Starita, F. Costa, P.
Ludaescher, M. E. Martone, P. M. Papadopoulos, S. T. Frasconi, ''A Grid Focused Community Crawling
Peltier, A. Rajasekar, S. Santini, “Biomedical Architecture for Medical Information Retrieval
Informatics Research Network: Building a National Services'', 2nd Int. Conf. on Computational
Collaboratory to Hasten the Derivation of New Intelligence in Medicine and Healthcare,
Understanding and Treatment of Disease”, CIMED’2005.
Proceedings of Healthgrid 2005, from Grid to
Healthgrid, 2005, pp. 100-109.
Wireless acquisition
TSI
network TUT
Bioprofile
Grid nodes database
LAN
(Globus)
Bioprofile
LAN Bioprofile
database Grid nodes
database Condor
(Globus)
pool
Grid node
(Globus)
Synapsis
Internet User
Interface
UOP
UNIPI
Grid nodes HTTPS
Bioprofile
database (Globus) Grid node
LAN (Globus)
LAN
Algorithms Condor
pool Web server
pool
(Grid Portal)
643
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 8. Canonograms showing the distribution of FD values for EEG analysis for dementia
Figure 9. Topography maps of normal subjects and patients with brain injury
644
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
A report is presented on the use of eScience tools to parameterise a quantum mechanical model of
an environmentally important organic molecule. eScience tools are shown to enable better model
parameterisation by facilitating broad parameter sweeps that would otherwise, were more
conventional methods used, be prohibitive in both time required to set up, submit and evaluate the
calculations, and in the volume of data storage required. In this case, the broad parameter sweeps
performed highlighted the existence of a computational artefact that was not expected affect this
system to such an extent, and which is unlikely to have been observed had fewer data points been
taken. The better parameterisation of the model leads to more accurate results and the better
identification of the applicability of aspects of the model to the system, such that great confidence
can be put in the results of the research, which is of environmental importance.
645
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
an investigation has been made of the applicability of
the current model to quantify of the energetics
associated with the 360˚ rotation of the (Cl–)C–C–C–
C(–Cl) torsion angle for 2,2’-dichloro biphenyl
(hereafter 2-PCB). There has been work in the
literature both experimentally [2] and
computationally [3], which has shown that previous
calculations of the groundstate equilibrium angle
have not been able to reproduce the 75˚ angle for the
molecule [3]. Figure 2 2,2’-dichloro biphenyl showing definition
The role that eScience played in this work was of torsion angle (red), the angle between the two
integral to the methodology. The many tools that planes defined by the pairs of carbons highlighted on
have been developed within the eMinerals project each phenyl ring
have been indispensable in enabling the individual
scientist to use to quickly begin work and to perform 2.2 Obtaining Starting Configurations
initial, very detailed, exploration of the system, in a
Starting configurations were required for both of the
way that would not have otherwise been possible.
PCB molecules and these were obtained from
experimental crystal structures stored in the
2. Methodology Cambridge Crystallographic Database [6]. As the
vacuum structure will differ from the crystal
2.1 Simulation Details structure, further structural relaxation was required.
All the calculations reported here were carried out at
the density functional theory (DFT) level using the 2.3 Calculating Box Size
latest version of the SIESTA code[4], which includes Modelling of the molecule in vacuum was carried out
CML output and z-matrix input. In the first instance, using Periodic Boundary Conditions. The molecule is
the calculations have been performed using the auto- enclosed in a virtual box, the dimensions of which
generated double zeta polarized (DZP) basis sets were varied to find the optimal size. The PCB
within the code and the PBE functional[5]. chosen for the box size calculation was, as mentioned
The study has taken place in two parts. Initially, above, the 3,4,5,3’,4’,5’-hexachloro biphenyl (6-
the box size surrounding an hexa–chlorinated PCB PCB) molecule. The chlorine atoms are larger, and
molecule was converged with respect to the total the C–Cl bond lengths longer, than is the case for
energy of the system. The aim of this was to hydrogen. The fully chlorinated PCB molecule,
determine the minimum box size necessary to where all 8 hydrogen atoms are replaced by chlorine
surround the molecule without any self-interaction atoms, posed difficulties in calculating the fixed
between periodic images. This is useful for two planar geometry due to unfavourable steric
reasons: first, a smaller box size means less repulsions between chlorine atoms in the 2 and 6
computational expense; second, the box size will positions on opposite rings. 6-PCB was chosen,
determine the lower-limit of the surface size therefore, because it has the largest space-filling
necessary when the interaction of the molecules with contributions of all the possible PCBs that can easily
the surface is investigated. be calculated in a planar configuration. The
The first part of the study was performed with the molecule has the approximate dimensions of 10Å in
3,4,5,3’,4’,5’-hexachloro biphenyl molecule (6- length and 5Å across the chlorophenyl rings (Figure
PCB), for reasons explained below. 1).
Once the minimum box size was determined, a As such, an initial box size was taken to be 15Å x
starting structure for the 2,2’-dichloro biphenyl 10 Å x 10Å, and the molecule aligned so that it lay
molecule was generated and then described using the lengthwise along the long axis of the cuboid. The
z-matrix format within the SIESTA code [4]. The z- two shorter box–sides were kept equal to allow for
matrix format allows the constraint of parameters free rotation around the C–C bond without
within the molecule, such as, in this case, the torsion interference between periodic images. The side
angle between the two phenyl rings (Figure 2). The lengths were sequentially increased up to 25 Å for
torsion angle was varied over 360˚ at 5˚ intervals, the c parameter and 20 Å for a and b, at 0.5 Å
requiring 72 calculations. Two basis sets were intervals. This required the calculation of over 100
tested; the auto-generated DZP SIESTA basis set and box sizes, each with different values of the lattice
a user-specified basis set, the latter of which was the parameters. The generation and management of
more computationally expensive of the two. these calculations is detailed in Section 2.5.
The broad parameter sweeps, enabled by the use
of eMinerals submission scripts, allowed such a
646
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
’converge8’
E/ eV
-4361.05
-4361.06
-4361.06
-4361.07
-4361.07
-4361.08
-4361.08
-4361.09
-4361.09
-4361.1
-4361.1
38
37
36
35
18 34
19 20 33
21 32 c/ Bohr
22 23 31
24 25 30
26 27 29
a/ Bohr 28 2928
Figure 3 Energy v. box size (a=c) for calculations with a 100 Ry cutoff, showing periodic
fluctuations in the energy with increase in box size, rather than the steady convergence
expected
detailed sampling of the box dimensions that we calculations were autogenerated and autosubmitted
were able to observe an unforeseen issue: it was as previously described.
found that the fineness of the grid over which the The only constraint on the system was the fixing
system’s energy is calculated was insufficient for the of the torsion angle, so the positions of the other
PCB molecule, resulting in the periodic fluctuation in atoms were allowed to optimise around this fixed
the energy corresponding to the distance between angle.
grid points that is shown in Figure 3. This is known
as the ‘eggbox’ effect, and occurs when an 2.5 eScience Tools Used
inadequate grid fineness is used in calculations. The SIESTA input files for each suite of calculations
While the mesh cutoff is always tested for were automatically generated using parameter sweep
convergence before starting production calculations, scripts which have been developed within the
it was not expected that the eggbox effect would be eMinerals project precisely to leverage the enhanced
observed so strongly in calculations of PCBs. The computing power made available through grid
grid size was fine-tuned by running a number of technology. These scripts are described in detail in
suites of the box dimension parameter sweep [7], along with a detailed description of
calculations with different mesh cutoffs and my_condor_submit (MCS), the submission script
searching for convergence. used in this work. However, a short description will
For each of the box convergence runs it was only be given here for clarity regarding this work.
necessary to perform single point calculations on the The parameter sweep scripts use, as input, a
system to determine the degree of interaction across template SIESTA input file, which can be modified
the periodic boundary conditions, resulting in the to create all of the required input files, and a very
large number of short calculations for which high- simple configuration file. The configuration file is
throughput eScience is particularly useful. used to specify the input parameters that should be
varied within the input file and the range and number
2.4 Calculating Torsional Energy
of values over which they are to be varied. From this
The structure of ortho-2,2’-dichloro biphenyl (Figure simple description of the problem space, running one
2) was described in z-matrix format, so that the command results in the creation of both a local and a
torsion angle could be constrained and, therefore, logical Storage Resource Broker (SRB) directory
varied over 360˚. In the first instance it was structure, and the creation of the required SIESTA
considered adequate to sample every 5˚ over the input files within this structure. Also, relevant MCS
rotation around the central carbon bond. These input files are created in the local directory structure.
647
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
’converge15_300gridsam’
E/ eV
-4344
-4346
-4348
-4350
-4352
-4354
-4356
-4358
-4360
-4362
38
36
34
32
18 19 30
20 21 c/ Bohr
22 28
23 24 26
25 26 24
a/ Bohr 27 28 2922
Figure 4 Energy v. box dimensions from calculations with a 300Ry cutoff and fcc + box centre grid sampling
The MCS files specify each of the jobs to be run, paramount importance in such a combinatorial study
including the SRB location of the executable and as this one.
input files, from where they should be downloaded In general terms metadata is stored in three tiers:
and to where output files should be uploaded. In the study, which is a self-contained piece of work;
addition, the MCS files detail any relevant metadata the dataset; and data objects. Each study can be
to be collected. In the case of the box size labelled with various topics from a controlled
convergence, the additional metadata requested were taxonomy and can be annotated with a high level
the lattice parameters, which were extracted from the description of this work. The study level metadata
XML SIESTA output file on completion of the can be searched at a later date either via keywords
calculation. within these annotations or via the topic labels.
Once the directory structure has been created, Searches can be performed using the RCommands,
jobs are submitted by running a second command which can be run from the command line or the
which walks the local directory structure and submits Metadata Manager, a web interface to the metatdata
all of the input files using MCS which takes care of database. There tools are fully discussed in [10], but
all appropriate data management and the meta- a description of the specifics of their use in this study
scheduling over the available computing resources. follows.
This use of MCS as the underlying submission In this case, prior to commencing the runs, a
system means that we can maximise the use of study for PCB molecules was created on the
available resources and minimise the latency metadata database, and within this study each suite of
between job submission and results retrieval. calculations constituted one dataset. Each data
The calculations are metascheduled across the object within the dataset relates to one directory
eMinerals minigrid [8] using Globus to check the containing the files for one calculation. The data
machine availability, and Condor-G to submit the object is associated with the URI for the directory
jobs, around which MCS is wrapped to enable within the SRB. As with the study, it is possible to
communications with the SRB and metadata annotate the dataset and data objects with
collection / storage . The user has the option of descriptions, which can later be searched for
specifying the machine on which the calculation is
keywords. In addition to these free text annotations,
run, or whether to run on a condor pool or a cluster.
Such a large number of calculations as dataset and data objects can have parameters
encountered here quickly becomes unmanageable associated with them. These are arbitrary name value
and it becomes difficult to trace individual jobs or pairs, which can be used to index the data using key
suites of calculations on the SRB file structure. parameters of interest. If these parameters are
Consequently, the use of metadata, and its automatic numerical, it is possible to search for data objects or
extraction from XML files using AgentX [9], a datasets that have a specified value for a certain
library for logically based xml data handling, is of parameter.
648
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 5 Energy (eV) v. torsion angle (degrees) with autogenerated basis set
In this work, each data object is associated with a With such a large number of files and suites as
number of parameters that are harvested on job required for this study, and with the frequent updates
completion by MCS, from both the MCS of the simulation code used, it is of paramount
environment (including the name of the computer importance to be able to trace the details of each
from which the jobs were submitted and the calculation, both in order to process the results and to
submission directory and the remote machine upon enable traceability of the workflow. This is
which the job has been run) and any metadata important both for the individual scientist carrying
available from the simulation code (e.g.: the out the calculations and for the facilitation of
executable name and version). In addition, AgentX collaborations, which are integral to a virtual
was used to harvest from the XML output file any organisation as distributed as the eMinerals project.
metadata requested for extraction as specified in the The torsion angle calculations were constrained
configuration file used to generate the individual geometry optimisations. These calculations were
MCS scripts. generally well behaved; however, it was necessary to
For the box convergence study, numerical values check the convergence of the calculations to ensure
of the lattice vectors were stored as metadata, that the energy extracted from the XML output file
allowing the metadata to be searched for this would be the correct energy for the system. A
parameter and a specified value. So, on completion number of python scripts and shell commands were
of each calculation, the results are uploaded to the used to check for convergence. Thereafter, an XSLT
SRB and metadata is automatically updated. This file was used to parse each XML file and extract the
removes the requirement from more traditional lattice vectors and total energy of the system, the data
methods of logging into each remote machine to from which could then be plotted, and the images
retrieve results, and making a record of each uploaded to the SRB for viewing by collaborators.
calculation in order to track the workflow.
Obviously, with over one thousand jobs run in this
study, such a method of working would be
unmanageable, and prohibitive of this detailed type
of parameterisation study.
649
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 6 Energy (eV) v. torsion angle (degrees) with user-specified basis set
650
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
basis set in the SIESTA input file was all that was used, as a comprehensive sweep of parameter space
required in order to generate and submit all the is neither lengthy nor computationally too expensive
required jobs. At present, two basis sets have been when using high-throughput computing. In this
tested. The autogenerated SIESTA DZP basis set, particular case, the parameterisation led to the
and a set of basis sets taken from previous identification of a phenomenon that was not expected
parameterisation calculations for the PCDDs[1]. to be observed in this system, which might have gone
Inspection of Figure 5 shows that the barrier to unobserved using traditional methods.
rotation is highest at 0º, and that there is a minimum
in the energy at around 70º, as expected, although 5. Acknowledgments
this is only slightly lower in energy than the broad
minimum within which it sits, between 70º and 130º. We are grateful for funding from NERC (grant
The kinetic barrier to rotation is around 1.25 eV at reference numbers NER/T/S/2001/00855, NE/
180º and 3.25 at 0º, although a more detailed C515698/1 and NE/C515704/1).
sampling is needed at 0º in order to determine this
value accurately. References:
An analogous plot is shown for the explicit basis
sets in Figure 6. It can be seen that, once again, the
barrier to rotation is at 0º, and another at 180º. Once 1. TOH White, RP Bruin, J Wakelin, C
again an energy minimum is found to be at 70º, but a Chapman, D Osborn, P Murray-Rust, E
second minimum is apparent at roughly 120º. The Artacho, MT Dove, M Calleja, eScience
barrier to rotation in this case, however, can be seen methods for the combinatorial chemistry
to be considerably lower at 2.5 eV. Further problem of adsorption of pollutant organic
investigation is obviously necessary to ascertain the molecules on mineral surfaces. Proceedings
best description of the molecule, which will include of the All Hands Meeting, Nottingham,
the repetition of the suites of torsion angle 773-780 (2005).
calculations using different basis sets. Additionally, 2. C Romming, HM Seip, and Aaneseno.Im,
the eScience tools that have already been used can Structure of Gaseous and Crystalline 2,2’-
further be applied with different codes that are not Dichlorobiphenyl. Acta Chemica
limited to DFT calculations. Investigations into the Scandinavica Series A-Physical and
applicability of DFT to the problem will be made by Inorganic Chemistry A 28 (5), 507-514
the study of the system using higher-level quantum (1974).
mechanical calculations. 3. R Zimmermann, C Weickhardt, U Boesl,
EW Schlag, Influence of chlorine
substituent positions on the molecular
4. Conclusions
structure and the torsional potentials of
Quantum mechanical calculations have been carried dichlorinated bipheyls: R2P1 spectra of the
out using eScience tools developed during the first singlet transition and AM1 calculations.
eMinerals project. These tools change the way that Journal of Molecular Structure 327, 81-997
computational chemists are able to carry out (1994).
scientific research for a number of reasons. First, the 4. JM Soler, E Artacho, JD Gale, A Garcia, J
tools allow for facile generation of many files, Junquera, P Ordejon, D Anachez-Portal, The
removing the necessity for lengthy set-up times. SIESTA method for ab initio order-N
Secondly, the metascheduling aspect of the setup and materials simulation. Journal Of Physics-
the use of the SRB for data storage, along with the Condensed Matter 14 (11), 2745-2779
use of digital certificates for security, removes the (2002).
need for monitoring of the machines upon which the 5. JP Perdew, K Burke, and M Ernzerhof,
calculations are running, or the need to directly log Generalized Gradient Approximation Made
into the machines. Searchable metadata, through the Simple. Physical Review Letters 77,
use of the RCommands and Metadata Manager, 3865-3868 (1996).
ensures that the vast numbers of calculations do not 6. FH Allen, The Cambridge Structural
get confused and thereby remain manageable. The Database: a quarter of a million crystal
use of tools that test convergence in the jobs, along structures and rising. Acta
with the structure of XML output, which allows for Crystallographica Section B-Structural
the easy extraction of relevant data, means that the Science 58, 380-388 (2002).
processing of the large number of results does not 7. RP Bruin, TOH White, AM Walker, KF
pose a problem to the research scientist. Austen, MT Dove, et al., Job submission to
The consequence of using all of these tools is that grid computing environments. Submitted to
it is much easier to properly parameterise the models All Hands Meeting 2006 (2006).
651
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
8. M Calleja, R Bruin, MG Tucker, MT Dove,
R Tyer, L Blanshard, K Kleese Van Dam, RJ
Allan, C Chapman, W Emmerich, P Wilson,
J Brodholt, A Thandavan and VN
A l e x a n d r o v, C o l l a b o r a t i v e g r i d
infrastructure for molecular simulations:
The eMinerals minigrid as a prototype
integrated compute and data grid. Molecular
Simulation, 31, 5, 303-313 (2005)
9. PA Couch, P Sherwood, S Sufi, IT Todorov,
RJ Allan, PJ Knowles, RP Bruin, MT Dove
and P Murray-Rust, Towards Data
Integration for Computational Chemistry.
Proceedings of the All Hands Meeting,
Nottingham (2005).
10. RP Tyer, PA Couch, K Kleese van Dam, IT
Todorov, RP Bruin, TOH White, AM
Walker, KF Austen, MT Dove, MO
Blanchard, Automatic metadata capture and
grid computing. Submitted to All Hands
Meeting 2006 (2006)
652
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
We report a case study in grid computing with associated data and metadata management
in which we have used molecular dynamics to investigate the anomalous compressibility
maximum in amorphous silica. The primary advantage of grid computing is that it
enables such an investigation to be performed as a highly-detailed sweep through the
relevant parameter (pressure in this case); this is advantageous when looking for derived
quantities that show unusual behaviour. However, this brings with it certain data
management challenges. In this paper we discuss how we have used grid computing with
data and metadata management tools to obtain new insights into the behaviour of
amorphous silica under pressure.
653
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
654
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The task we set ourselves was to Globus to submit three separate jobs per
demonstrate the reasonableness of this simulation. The first job takes care of data
hypothesis, and the chosen tool is molecular staging from the SRB to the remote computing
dynamics simulation. We have two good sets of resource, whilst the last job uploads output data
interatomic potential energy functions for silica, back to the SRB as well as capturing and storing
both based on quantum mechanical calculations metadata. The middle job takes care of running
of small clusters of silicon and oxygen atoms; the actual simulation and corresponding
these are described in references 11 and 12 analysis as part of a script job.
respectively. For the present paper we will only
present results using the model of reference 11, Data management
but will refer to the other set of results later. We Data management within the eMinerals
use the DL_POLY_3 molecular dynamics minigrid is focussed on the use of the San Diego
simulation code [13], which has been adapted Storage Resource Broker. The SRB provides a
for working within a grid computing good solution to the problem of getting data into
environment. and out of the eMinerals minigrid. However, it
We ran simulations for pressures between ±5 also facilitates archiving a complete set of files
GPa and at a temperature of 50 K. Our aim was associated with any simulation run on the
to capture data for many pressures within this minigrid. The way that the eMinerals project
range, and to analyse the resultant uses the SRB follows a simple workflow:
configurations at the same time as running the 1. The data for a run, and the simulation
simulations. At each pressure we ran one executable, are placed within the SRB. It is
simulation with a constant-pressure and not necessary for the files to be within the
constant-temperature (NPT; N implies a same SRB collection.
constant number of atoms) algorithm in order to 2. A submitted job, when it reaches the
obtain data for the equilibrium volume, execute machine, first downloads the
followed by a simulation using a constant- relevant files.
volume and constant-energy (NVE) algorithm 3. The job runs, generating new data files.
for the more detailed analysis (see later in this 4. At the end of the job, the data files
paper). The analysis was performed using generated by the run are parsed for key
additional programs, and was incorporated metadata which is ingested into the central
within the workflow of each job carried out metadata datbase, and the files are put into
within the grid computing environment. the SRB for inspection at a later time.
This simple workflow is managed by the MCS
eScience methodology tool using Condor’s DAGman functionality. We
have built into the workflow a degree of fault
Grid computing environment tolerance, repeating any tasks that fail due to
The simulations were performed on the unforseen errors such as network interruption.
combination of the eMinerals minigrid [3,14],
Combinatorial job preparation tools
which primarily consists of linux clusters
running PBS, and CamGrid [15], which consists One of the key tasks for escience is to provide
of flocked Condor pools. Although the the tools to enable scientists to access the
eMinerals minigrid and CamGrid are potential benefits of grid computing. If
independent grid environments, they have scientists are to run large combinatorial studies
overlapping resources, and the tools developed routinely, they need tools to make setting-up,
to access the eMinerals minigrid [3,14] have submitting, managing and tracking of many jobs
been adapted to work on CamGrid, and, nearly as easy as running a single job. To enable
incidentally, to also enable access to NGS the scientists within the eMinerals project team
resources, in the same manner. to run detailed combinatorial studies one of the
Access to the eMinerals minigrid is authors (RPB) has developed a set of tools that
controlled by the use of escience digital generate and populate new data collections on
certificates and the use of the Globus toolkit. the SRB for all points on the parameter sweep,
We have developed a metascheduling job and then generate and submit the set of MCS
submission tool called my_condor_submit scripts [2].
(MCS) to make the process easier for end users All that is required from the user is to
[2,3,15]. This uses the Condor-G interface to provide a template input file from which all
655
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
necessary unique input files will be created. The separations, and configurations are extracted for
user specifies the name of the parameter whose analysis of the atomic motions based on a
value is to be varied, the start and end values, comparison of all configurations. Finally a Perl
and the number of increments. For example, a script collates the results of this analysis for
user could specify that they wish to vary later plotting. This analysis will not be reported
pressure between –5 and +5 GPa in 11 steps in this paper, but is mentioned here to make the
which would result in input files being created extent of the workflow clear.
for pressures of –5, –4, –3, ... +4, +5 GPa. The
tools then create a simple directory structure, Use of XML
with each sub-directory containing the The eMinerals project has made a lot of use of
necessary input files. A similar directory the Chemical Markup Language (CML) to
structure will also be created on the SRB. Each represent simulation output data [5]. We have
directory on the submission machine also written a number of libraries for using CML
contains the relevant MCS input script, with within Fortran codes (most of our simulation
appropriate SRB calls and metadata commands. codes are written in Fortran), and most of our
The subsequent stage is job submission. The simulation codes now write CML. We typically
user runs one command, which walks through write three blocks of CML, one for “standard”
the locally created directory structure and metadata (some Dublin Core items, some
submits all of the jobs using MCS. It is MCS specific to the code version and compilation
that takes care of the issue of deciding where to etc), one to mirror input parameters (such as the
run the simulation within the eMinerals input temperature and pressure, technical
minigrid or Camgrid environments, using a parameters such as cut-off limits), and one for
metascheduling algorithm explained elsewhere output properties (including all computed
[2]. This algorithm ensures that the user’s properties step by step and averages over all
simulations do not sit in machine queues for steps). This is illustrated in Figure 2.
longer than is necessary, and that the task can The XML files stored within the SRB are
take full advantage of all available resources. transformed to HTML files using the TobysSRB
The tools also track the success of the job web interface to the SRB, with embedded SVG
submission process, and any errors that occur as plots of the step-by-step output [16]. This
part of the submission are recognised. For enables the user to quickly inspect the output
example, if one of the Globus gatekeepers stops from runs to check issues such as convergence.
responding, it will cause Condor-G to fail in the XML output also allows us to easily collect
submission. This will be noticed, and a user the desired metadata related to both input and
command will provide information on all failed output parameters using the AgentX library,
jobs and will resubmit them as appropriate. developed by one of the authors (PAC) [17].
This has been integrated into the MCS
Workflow within the job script
workflow, enabling the user to easily specify the
Usually MCS is used to submit a binary metadata they wish to collect as part of their job
executable, but can also submit a script (eg shell submission process without needing to know the
or Perl) containing a set of program calls. In ins and outs of XML file formats.
our application, a standards-compliant shell
script is used to control the execution of a Results management
number of statically linked executables and a When many jobs are run as part of a single
simple Perl script to run the analysis. In detail study, it is essential to have tools that collate the
the script runs the following codes in order on key results from each run. For example, one key
the remote host: first DL_POLY_3 is run with output from our work will be a plot of volume
the NPT ensemble in order to generate a model against pressure, with each data point obtained
with the appropriate density for the pressure of from a single computation performed using grid
interest, then the output of this run is used as computing. In this we exploit the use of XML in
input for a second DL_POLY_3 run in the NVE our output files, because the required averaged
ensemble in order to sample the vibrational quantities can be accessed by retrieving the
behaviour of the system at this density. value of the relevant XML element as per
Following the second molecular dynamics Figure 2. This value can be retrieved for each of
simulation, pair distribution functions are the individual files using a simple XSLT
extracted for the Si–Si, Si–O and O–O transform, combining all of the values together
656
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
<metadataList>
<metadata name=“dc:contributor” content=“I.T.Todorov & W.Smith”/>
<metadata name=“dc:source”
content=“cclrc/ccp5 program library package, daresbury laboratory molecular dynamics
program for large systems”/>
<metadata name=“identifier” content=“DL_POLY version 3.06 / March 2006”/>
<metadata name=“systemName” content=“DL_POLY : Glass 512 tetrahedra”/>
</metadataList>
then results in a list of points. This can easily be package. In our experience, this is the sort of
plotted as a graph using a further XSLT thing that makes grid computing on this scale
transformation into an SVG file. actually usable for the end user, and facilitates
These transformations can be done very collaborations.
quickly, and more importantly they can be done
automatically, which means that the user and Metadata
his/her collaborators simply need to look at a With such quantities of data, it is essential that
graph in a web page to quickly analyse trends the data files are saved with appropriate
within the data, rather than having to open metadata to enable files to be understood and
hundreds of files by hand to find the relevant data to be located using search tools. We have
data values to then copy into a graph plotting developed tools, called the RCommands, which
657
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
658
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Normalised length
relatively-rigid SiO4 tetrahedra – the ratio 1.02 Sample size
of the O–O to Si–O mean distance is equal
to (8/3)1/3 – and it can be seen that their
mean distances barely change with
pressure. Thus we see that the SiO4 0.98
659
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
‣ This work was collaborative in nature, distribution in vitreous silica. J Phys.: Cond.
involving at times sharing of data between Matt. 17, S67–S75 (2005)
9. http://www.esc.cam.ac.uk/movies/trid4t.html
colleagues who were in physically different 10. http://www.esc.cam.ac.uk/movies/glass1.html
locations. Sharing of data was enabled using 11. S Tsuenyuki, M Tsukada, H Aoki, Y Matsui. 1st
the above tools and methods. principles interatomic potential of silica applied
‣ Finally, it is worth remarking that a large part to molecular dynamics. Phys. Rev. Lett. 61, 869–
of the simulation work reported in this study 872 (1988)
12. BWH van Beest, GJ Kramer, RW van Santen.
was carried out by a third year undergraduate Force fields for silicas and aluminophosphates
student (LAS) as her degree project. We were based on ab initio calculations. Phys. Rev. Lett.
confident, as proved to be the case, that the 64, 1955–1958 (1990)
escience tools developed to support this work 13. IT Todorov and W Smith. DL_POLY_3: the
could easily be picked up by someone with no CCP5 national UK code for molecular-dynamics
simulations, Roy. Soc. Phil. Trans. 362, 1835–
previous exposure to escience or 1852 (2004)
computational science and who was working 14. M Calleja, L Blanshard, R Bruin, C Chapman,
under some considerable time pressure. A Thandavan, R Tyer, P Wilson, V Alexandrov,
RJ Allen, J Brodholt, MT Dove, W Emmerich, K
Kleese van Dam. Grid tool integration within the
Acknowledgements eMinerals project. Proceedings of the UK e-
Science All Hands Meeting 2004, (ISBN
We are grateful for funding from NERC (grant 1-904425-21-6), pp 812–817
reference numbers NER/T/S/2001/00855, NE/ 15. M Calleja, B Beckles, M Keegan, MA Hayes, A
C515698/1 and NE/C515704/1). Parker, MT Dove. CamGrid: Experiences in
constructing a university-wide, Condor-based
grid at the University of Cambridge.
References Proceedings of the UK e-Science All Hands
1. OB Tsiok, VV Brazhkin, AG Lyapin, LG Meeting 2004, (ISBN 1-904425-21-6), pp 173–
Khvostantsev. Logarithmic kinetics of the 178
amorphous-amorphous transformations in SiO2 16. TOH White, RP Tyer, RP Bruin, MT Dove, KF
and GeO2 glasses under high-pressure. Phys. Austen. A lightweight, scriptable, web-based
Rev. Lett. 80, 999–1002 (1998) frontend to the SRB. Proceedings of All Hands
2. RP Bruin, TOH White, AM Walker, KF Austen, 2006
MT Dove, RP Tyer, PA Couch, IT Todorov, MO 17. PA Couch, P Sherwood, S Sufi, IT Todorov, RJ
Blanchard. Job submission to grid computing Allan, PJ Knowles, RP Bruin, MT Dove, P
environments. Proceedings of All Hands 2006 Murray-Rust. Towards data integration for
3. M Calleja, R Bruin, MG Tucker, MT Dove, R computational chemistry. Proceedings of All
Tyer, L Blanshard, K Kleese van Dam, RJ Allan, Hands 2005 (ISBN 1-904425-53-4), pp 426–432
C Chapman, W Emmerich, P Wilson, J Brodholt,
A.Thandavan, VN Alexandrov. Collaborative
grid infrastructure for molecular simulations:
The eMinerals minigrid as a prototype integrated
compute and data grid. Mol. Simul. 31, 303–313
(2005)
4. RW Moore and C Baru. Virtualization services
for data grids. in Grid Computing: Making The
Global Infrastructure a Reality, (ed Berman F,
Hey AJG and Fox G, John Wiley), Chapter 16
(2003)
5. TOH White, P Murray-Rust, PA Couch, RP Tyer,
RP Bruin, MT Dove, IT Todorov, SC Parker.
Development and Use of CML in the eMinerals
project. Proceedings of All Hands 2006
6. R.P.Tyer, P.A. Couch, K Kleese van Dam, IT
Todorov, RP Bruin, TOH White, AM Walker, KF
Austen, MT Dove, MO Blanchard. Automatic
metadata capture and grid computing.
Proceedings of All Hands 2006
7. KO Trachenko, MT Dove, MJ Harris, V Heine.
Dynamics of silica glass: two-level tunnelling
states and low-energy floppy modes. J Phys.:
Cond. Matt. 12, 8041– 8064 (2000)
8. MG Tucker, DA Keen, MT Dove, K Trachenko.
Refinement of the Si–O–Si bond angle
660
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
This paper presents initial analysis and outcomes from the Integrated & Steering of Multi-Site Experiments
for the Engineering Body Scanner (ISME) Virtual Research Environment (VRE) project to enable
geographically remote teams of material scientists to work together. Comparisons between different
AccessGrid (AG) tools have been made, allowing us to identify their advantages and disadvantages
regarding quality (visual and audio) over broadband networks, utility, ease of use, reliability, cost and
support. These comparisons are particularly important when considering our intended use of AG in non-
traditional, non-office based settings, including using slow domestic broadband networks to run AG
sessions from home to support experiments at unsociable hours. Our detailed analysis has informed the
development of a suite of services on the web portal. Besides user interactions, a suite of services based on
JSR-168 had been developed for evaluation. Each service has been evaluated by materials scientist users,
thereby allowing qualitative user requirement gathering. This feedback has resulted in a prioritised
implementation list for these services within the prototype web portal allowing the project to develop useful
features for material scientists and the wider VRE community.
661
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
662
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
on recently acquired and analysed real-time data studios. This is because firstly, for the experimenter,
could add true value to the experiments, involvement must be seamless with the practical
especially if previous results can also be drawn experimental function and secondly, because
down for comparison. In effect, all the academics may need to enter into dialogue at home
collaborators will have opportunities to access during unsociable hours. Connectivity between our
and analyse the data and hence to offer their two functionalities is achieved through the use of a
opinions on the particular measurement strategy. shared virtual screen (“Shared Workspace”) on which
As a result the onus for experimentation and data analysed in the Data Function can be viewed on
analysis can be more evenly shared amongst the the web portal through the use of a standard web
group and a wider, multidisciplinary view browser.
brought to bear on the experiment in a timely
fashion. Hence the learning curve for the PhD The initial trial of AG steering was between the
students can be much steeper. Manchester School of Materials, Daresbury
Laboratory and ISIS, Oxford. Common AG meeting
We aim to extend our present ontology to support the tools include:
more complex workflows and interactions that need 1. inSORS [4],
to take place in our multidisciplinary teams. It is 2. AccessGrid ToolKit 2.4 (AGTk) [5],
envisaged that this will lead to a cultural change in 3. Videoconferecing Tools & Robus Audio Tools
the way experiments are undertaken. Most (vic&rat) [6]
importantly, it will allow the experimenter to regain 4. Virtual Rooms VideoConferencing System
the initiative. Moving away from pre-determined (VRVS) [7]
experimental sequences to interactive experiments, in
which developments are monitored during the Each has its own advantages and disadvantages in
experiment and strategies modified accordingly. This terms of the features provided, quality and cost.
project looks to tailor these toolkits for multi-site Although vic&rat are two separate video and audio
engineering experiments. The project is investigating applications, together they can considered as a basic
the use of various common web portals to allow the AG tool. vic&rat have a command-line interface
VRE project to deploy the Engineering Body Scanner which requires users to know the multicast-unicast
(EBS) [2] toolkit at multiple sites. bridge server before use. The combination of
command line interface and the need for a-priori
3. Analysis and Outcome of Experiment knowledge severely and negatively impact on vic&rat
Steering Trials as a general choice as an AG meeting tool.
The culture of the meeting room has been altered to VRVS, on the other hand, is a fully web and Applet-
accommodate the practical requirements of the end- based AG tool using its own improved version of
users’ for this project. The interactions via the vic&rat incorporated with its own Java proxy
multimedia resource (use of AG software) should not applications. It allows almost anyone to join an AG
be at the meeting room level but at the experimental meeting behind a firewall network without any ports
level, whereby the whole team can ‘meet’ together having to be specially opened. Although VRVS can
bringing their own data, modelling predictions and be in many respects be considered an ideal tool,
discuss and develop an evolving experimental limited communication between VRVS venues with
strategy with Grid middleware. This task is not one existing AG virtual venues has relegated its utility for
of just teleconferencing, but rather a means of our purposes.
involving extended multi-disciplinary groups in the
experimental steering process. It allows out-of-office The choice between inSORS and AGTk can be
hours steering using home broadband networks to informed by investigating the following attributes and
link to the experimental site. We have analysed comparisons discussed in Table 1:
various AG software to identify the best options a. Audio & Video communication network
dependant on cost, ease of use and reliability Focusing on support for AG meetings using a
regarding future use within the materials science slow upload rate characteristic of home
community. broadband networks.
b. Ease of Use and Available Features
For the Experimental Steering Function we have Focusing on the easiness and useful features
trialled Access Grid, focusing primarily on how best provided by the different AG tools
to configure it to optimise HCI and usability. To this c. Reliability
end we have established our own ‘virtual venue’. Due Focusing on the stability of the “bridge server”
to the 24 hour nature of our experiments and the cost supported by the AG tools.
and infrastructural implications it was deemed more d. Support of Hardware Tools
appropriate to use Access Grid predominantly on Focusing on the wider range of hardware tools
computers with good quality webcams rather than via particularly to webcam and sound card.
specialist traditional dedicated and fixed Access Grid
663
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
e. Firewall Network f. Cost & Set-Up & Support
Support of network behind firewall routers Focus on the cost and after sales support for
typical of large international experimental installation of AG tools.
facilities.
664
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
To expedite our trial, we have purchased a few copies Linux environments had been successful, portlet
of inSORS for communication between experimental deployment only proved stable if a lower version
and home broadband sites. The results of using of Java library and Apache web server was
features in inSORS’s have been found satisfactory for installed on the host machine.
materials scientists, particularly the features of
“Shared Whiteboard” and “File Sharing”. The ease uPortal
of use of IGMeeting in that one can set-up a meeting This web portal has proven stable under our
with other inSORS users without the need for a pre- rapid development of web services under
arranged meeting had been proven a great advantage Windows and Linux environments. Its well
over AGTk. documented web-site is a valuable resource.
Furthermore, the developer’s release v3.0 has
4. Analysis and Outcome of Data been modified for the use of JSR-168 as a native
Management module, rather than using conventional Apache’s
Portlet-to-Channel adapter. This has given us
4.1 Choice of Web Portal (JSR-168 greater ease-of-use and reliability when using
and WSRP) this web portal for our development.
Whilst the web portal concept can acts as an
efficient medium for our “Shared Workspace” Web Service for Remote Portlets
(discussion), to download/upload information (WSRP)
(data achiving/restore) or even to retrieve While XML-based web services have been used
previous experiment histories (playback virtual
in different API platforms to transfer data
experiments), it is imperative that our web portal between them, a new concept, Web Service for
conform to a standard to allow efficient ways to Remote Portlets (WSRP), allows portlets to be
deploy, share and even reuse portlets from
exposed as Web services [10]. The resulting web
various other VRE projects. It came to our service will be user-facing and interactive among
attention that JSR 168 is a Java Specification different web portals. Unlike traditional Web
Request (JSR) that establishes a standard API for
services, WSRP will carry both application and
creating portlets. presentation logic that can be displayed by a
consuming portal. To the end user, remote
We have investigated two main JSR-168
portlets will look like and interact with the user
compliant web portals, GridSphere [8] and just as local portlets would. Unfortunately,
uPortal [3] frameworks and come to the WSRF is not yet an agreed standard and
following summary conclusions: therefore interoperation is not guaranteed.
Furthermore, it is still at an early stage of
GridSphere development from certain software vendors, it is
There is a large community who use GridSphere therefore, difficult to judge and implement a real-
as their main web portal service, but we found pluggable WSRP portlet. Although various
the support and reliability of the software failed .NET 2 web portal libraries have the features to
to live up to expectations given our desire to run allow implementation of Web Portal service, the
and develop on Windows based platforms. use of WSRP has not been implemented to allow
Nonetheless, GridSphere deployment on other interoperability. We are, however, closely
LINUX based projects such as GECEM at monitoring the development of WSRP and our
Cardiff University [9], was successfully choice to use uPortal library that conforms to the
achieved, albeit proving challenging at times. current WSRP standard will enable us to exploit
We further note that the latest released Java- other WSRP modules when it is incorporated
based framework GridSphere v2.1.4 is more within a wider range of software vendors’
developed towards Linux-based than Windows, standard.
making it unstable and failing to even get pass
the setup installation stage under the Windows
environments. Although the initial setup in
665
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
4.2 Web Portlets for Materials o Access to a powerful resource computer via
Scientists fast internet to analyse results quickly.
We have conducted a series of structured o A project scheduling tool would be also
interviews with material scientists following the useful to plan the experiment and to keep a
completion of a questionnaire in order to diary during experiments.
ascertain which web services are required by the o Possibility to simulate the experiment in
material scientists. The interviews have been advance, like a virtual experiment, so that
conducted with members of differing roles and new students can get used to the procedure
levels of experience, namely, a Professor, a and the facilities.
project manager, a lecturer, an instrument
scientist and three PhD students. The principal Based on the above summary outcomes, we had
issues/requirements are summarised below: developed a prototype ISME web portal as
o Communications concurrent with scene/data shown in Figure 2 offering the required services:
visualisation is required. This will help to o Error Handling tab: for debugging
identify experimental issues. purposes during this prototyping phase.
o The need for a shared desktop/workspace for o News tab: to display daily news from
better collaboration. This is especially various news channels provided by uPortal.
necessary to communicate problems and o Data Related Services: Archive Data,
share data. File Manager, Catalogue Tool, Virtual Log-
o A data archive system is required, so that the Book
users are able to retrieve documents and These services relate to data archival and
data, to have easy access to previous work, management tasks. The catalogue tool
with the experiments systematically allows materials scientists to manage and
recorded. save their experimental results in a set of
o An electronic log book of the experiment discipline categories of database format.
would be useful, very simple and easy to This will allow PhD supervisors or other
use, including only pictures and text. colleagues to easily access and acquire
o A catalogue tool to organise the data information when needed. The Virtual Log-
transfer. Book is more a personal service where
o A tool to ensure the latest version of the experimenters can keep their notes and
data. hourly experimental results digitally on to
o A tool to overcome the problem of sending the centralised server for retrieval when
large datasets (GBytes) back university site needed in future.
– maybe by using a very fast Internet o Remote Processing Services: This
connection. service aims to provide the material
o AG meeting should be made more user- scientists with the ability to connect
friendly, easier to set up at any time, remotely to their local machine back at the
reliable, easy to use, portable, as a package, home university site and to launch different
not to require additional time to set-up. useful application installed there, such as
666
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
ABAQUS, ImageJ, GSAS, OpenGini, Exact, 5. Future Work
FORTRAN, MS Excel, MS Word etc. This
service ensures that the user is properly We are closely collaborating with a wide-range of
authenticated into the remote host and gives material scientist users to determine and implement
them the opportunity to use the software our final phase of the ISME web portal. Feedbacks
applications they usually employ to visualise from the first phase had given us the information to
data, analyse results, make conclusions etc. prioratise and implement the services required by the
materials scientists. Technical lessons learnt from the
o Communication: Image, Video, Chat,
uPortal will also be documented as we believe this
Shared Desktop, Message Board,
maybe of interest to the VRE community.
AccessGrid.
This service allows materials scientists to
At the same time, we are collaborating with
communicate to each other, for example,
Manchester Computing to develop an AG Meeting
sharing images and video on the web portal.
Notifications tools (based on vic&rat) to allow
A common shared workspace with AG tools
meetings to be conducted instantly. This open
can be used to visualise results from the
source, licence-free and cost-free customisable AG
experiments via the multimedia resource,
Meeting Notification tools will allow greater use
whereby the whole team can ‘meet’ together
within VRE and materials scientist communities.
bringing their own data, modelling
We have a Pan-Tilt-Zoom (PTZ) camera installed in
predictions and discuss and develop an
the experimental hutch to allow remote users to
evolving experimental strategy.
control the static viewpoint. In addition we have also
o Miscellaneous: Calendar, Project developed a novel AG viewpoint: the Mobile-AG
Scheduling, Virtual Experiments, Email node. The Mobile -AG gives remote users a first
The calendar service is to support person viewpoint and improved insight regarding the
experimental scheduling and also allows mounting and positioning of a materials sample. It
PhD supervisors or co-workers to locate uses a mobile camera system attached to the
specific experiments of a materials scientist. experimentalist’s glasses so that their operations in
Virtual Experiments can be undertaken the experimental hutch can be broadcast to scientists
using a legacy application developed by the at remote sites. This gives remote team members an
Open University [11]. insight into the local environment in a more flexible
This service aims to provide the researchers manner than is available with static cameras. This
with a simulation tool of the experiments at allows us to share views not otherwise available and
the remote site, which they can use prior may help us to diagnose experimental problems with
their visit to the facility to learn the the equipment, carry out repairs under instruction or
experimental process or even on-site in case to allow remote users to offer advice about the
any problems arise. experimental set-up. At this stage of the project, we
are still investigating the full potential use of Mobile-
We have shown the above prototype web portal to AG used in the experiment hutch.
materials scientists for initial feedback. The materials
scientists perform different roles at the university and In summary, this project focuses on understanding
during the experiments. The analysis of the feedback how VRE tools can assist large multi-site teams
results showed that the wide range of web services undertake experiments at international facilities. Our
was appreciated and that no services were perceived aim is to exploit and customise existing tools in such
as lacking. a way that scientist can use them seamlessly and
There are other services which can be improved and effectively in undertaking experiments. So far we
the concepts of these services need to be clear so as have undertaken a small pilot study based around two
not to replicate any of the currently available experimental facilities and one extended research
applications. For example, team. This project will be expanded to a wider range
o Virtual Log Book would require time stamps of users with assessment criteria needed for evaluate
on the experiments as an automatic feature to the prototype portal and services. We will have an
provide an historical record of actions. opportunity in July 2006 to have a full test-drive of
o Archive Tools need to be merged with the our AG tools and Web portal on the Daresbury 16.3
Catalogue tool to reduce replication within the beamline using public access users, since this is to
web portal. become a public service beamline maintained partly
o The Calendar service must be linked to by the Manchester team. If this proves successful we
experiments allocated by the experiment al will examine the possibility of transferring the
facility and should made public to allow other approach to the new Diamond Light Source near
colleagues to view timetables. Oxford. This has created a unique opportunity to
Some overlap between services has been identified in embed and test our VRE tools with a large
this phase as well as some irrelevant services for the community of new users to assess how these tools
experimenters such as Chat, Email and Video. support and impact the learning curve of new users.
667
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
6. References
1. K.T.W. Tan, D. Tsaneva, M. Daley, N.J. Avis
and P.J. Withers, Advanced Collaborative
Tools for Engineering Body Scanners. UK e-
Science Programme All Hands Meeting, 2005,
Nottingham
2. K.T.W. Tan, N.J. Avis, G. Johnson and P.J.
Withers, 2004, Towards a grid enabled
engineering body scanner. UK e-Science
Programme All Hands Meeting 2004,
Nottingham
3. uPortal – http://www.uportal.org
4. inSORS – Multimedia Conferencing &
Collaboration Software,
http://www.insors.com
5. The Access Grid Project – a grid community,
http://www.accessgrid.org
6. Vic & Rat – http://www-
mice.cs.ucl.ac.uk/multimedia/software/
7. Virtual Room Videoconferencing System –
http://www.vrvs.org
8. GridSphere Portal – http://www.gridsphere.org
9. M. Lin, D.W.Walker, Y. Chen and J.W. Jones,
A Web Service Architecture for GECEM, All-
Hands Meeting, 2004, Nottingham
10. Web Services for Remote Portlets –
http://www.oasis-open.org/committees/wsrp
11. J. A. James, J.R. Santistiban, M.R. Daymond
and L. Edwards, Use of a Virtual Laboratory to
plan, execute and analyse Neutron Strain
Scanning experiments, NOBUGS 2002
668
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
669
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
organizations can be created in principle “on- decisions. Once deployed, the PERMIS service
the-fly” without detailed agreements. requires an XML policy which describes in
complete detail the targets, actions and roles
2. PMI Technologies which apply at the resource (or at the
institution). This policy may be written using
A number of authorisation control mechanisms the Policy Editor GUI supplied with the
exist that can be applied to the Grid, examples PERMIS software, or may be edited by hand.
of these include CAS [3], Akenti [4] and VOMS Another important GUI supplied with PERMIS
[5]. Each offer their own advantages and is the Privilege Allocator (or the slightly more
disadvantages [6]. PERMIS (Privilege and Role user friendly Attribute Certificate Manager
Management Infrastructure Standards (ACM)). This is responsible for allocating roles
Validation) [7] is a Role Based Access Control and signing ACs for users. This tool can also be
System [8] which uses X509 Attribute used to browse LDAP directories for ACs and
Certificates (ACs) [9] to issue privileges to can be useful in confirming that ACs have been
users on a system. VOMS provides ACs to its loaded correctly.
users, but attributes are still handled from a
central server. PERMIS is completely In order to implement a dynamic PMI,
decentralised and access control decisions are extensions to this ACM tool need to be made.
made locally at the resource side. These access As it stands, the ACM can issue any certificate
control decisions are typically made through it wishes, irrespective of its validity within the
attribute certificates (ACs) signed by a party PMI. Ideally a method of enforcing the
trusted by the local resource provider. This infrastructure described in the XML Policy to
might be the local source of authority (SoA), allow an administrator to only be allowed to
however a more scalable model is to delegate issue valid ACs is needed. To keep user
this responsibility in a strictly controlled manner information at the home site, it would be
to trusted personnel involved in the virtual necessary to have a mechanism that would
organization (VO). allow a remote administrator to issue ACs only
with roles relevant at their institution to the
In a Grid context it is unrealistic to expect all home site LDAP. There are two gains to this,
information to be maintained centrally. VOs the first being that the remote admin can only
may well have many users and resources across operate within a very restricted attribute set,
multiple sites, and these users come and go which would exclude any possibility of them
throughout the course of the VO collaboration. issuing home site roles. The second gain is that,
Knowing for example that a given user is at as requested, all important user data is still held
Glasgow University is best answered by the at the home institution.
Glasgow University authentication processes.
However, whilst knowledge of a given users The Delegation Issuing Service (DIS) is
status may well be best answered by that users intended to provide this functionality. The
home authentication infrastructure, the roles and DyVOSE project is the first project to
responsibilities needed to access remote investigate this technology to any great extent,
resources specific to that VO may best be with the goal of providing a user and admin
delegated to trusted personnel associated with guide to the installation and operation of the
that VO – it is this capability that the DIS DIS service. In the next section we describe
service is to support. To achieve this requires some of the technicalities of setting up and
that an authorization infrastructure exists that using the DIS and then we outline how we have
can firstly define appropriate policy applied this technology for teaching purposes.
enforcement (PEP) and policy decision points
(PDP), i.e. define and enforce the rules needed
3. The Delegation Issuing Service
to grant/deny access requests based upon ACs .
In the current implementation, the DIS software
PERMIS offers a generic API which can be is a web service consisting of a Java library
applied to any resource, so our investigations based on the Tomcat and AXIS SOAP servers.
could also be applied to non-Grid regimes. The web service is accessed by a DIS client
written in PHP running on an Apache server
The PERMIS decision function can be a which acts as a proxy between the DIS service
standalone Java API, or it can be deployed in and the user. This client invokes the Java
the same container as the Grid Service it is component through SOAP calls, and is
intended to protect. The GGF SAML AuthZ presented to the user in a web browser after
API [10][11] provides a method for Globus to mandatory authentication to Apache using their
bypass the generic GSI [12] access control and own username and password. These
allow external services to make authorisation components may be hosted on separate
670
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
machines, although for the DyVOSE service, and the underlying PERMIS server that
investigations, they were situated on the same the DIS creates ACs for. These certificates are
computer. created from the command line using openssl
after creation of a site specific configuration file
The DIS service assumes that there is a Tomcat which handles certificate extensions and
application server of recommended version 4.x populates the DN with a structure corresponding
installed on the server side, along with an to the users present in LDAP. The two key users
LDAP server for attribute storage and Apache in the DIS infrastructure are the Source of
authentication. A Java Runtime Environment of Authority (SoA) and the DIS user. These are the
at least version 1.4 is required. On the client only two users who require pre-loaded ACs as
side, a functional Apache web server loaded everyone else in the LDAP server can be
with the SSL, PHP and LDAP modules is allocated ACs by the DIS. The AC is stored as
required. The Apache and Tomcat servers were an attribute labelled
hosted on the same server with no attributeCertificateAttribute, with an optional
incompatibility issues encountered. The ;binary extension dependent on the local LDAP
resource OS was chosen to be Fedora Core 4 as schema. The AC is created using the PERMIS
this distribution contained all the Apache Attribute Certificate Manager (ACM) GUI,
functionality and extra modules as standard which can also load the certificate into LDAP.
RPMs and were loaded (and to some extent, These two pre-loaded ACs are essential to the
configured) automatically. Edinburgh has operation of the PMI, and for the SOA and DIS
successfully (in terms of providing default user, they contain an attribute 'permisRole'
functionality) migrated the DIS service to whose entries are a list of all assignable roles
Fedora Core 5. In addition, the LDAP backend within the infrastructure. In the case of the DIS,
(Berkeley DB) was of a version advanced the attribute list is an explicit statement of every
enough to allow the most recent version of role that the DIS can assign and delegate. In
OpenLDAP to be installed on the machine. For addition to the attribute list, the SOA requires
the purposes of the Grid Course assignment at another attribute called 'XML policy' which
Glasgow, this was abandoned in favour of a contains the XML file representing the site
slightly older version of LDAP which was policy. This policy states the hierarchical
compatible with the GT3.3 PERMIS Authz relationships between roles, which targets and
service deployed on a separate machine, actions these roles apply to, the scope
although the Edinburgh DIS was successfully references, and which SOAs are to be trusted
integrated with the newest version of LDAP. within the VO.
The DIS software itself ships as two gzip files The SoA is the root of trust for the PKI, and
containing the server and client side tools signs every certificate (and through the DIS,
separately. These files include the necessary every Attribute Certificate) within the
Java libraries, Web Service Deployment infrastructure. A PEM format root certificate
Descriptor file, LDAP core schema, and was created using openssl, along with its
configuration files. A sample LDIF file corresponding encrypted private key. This
containing the required DIS users and their certificate is required to be present in the
certificates is provided for loading the LDAP Apache SSL configuration and is used to create
server to test the installation. Due to the user certificates compatible with the Globus
complexity of the surrounding PKI, this file is Toolkit, allowing Grid users to interact with the
essential for installation as it is HIGHLY PMI and be assigned meaningful privileges. The
unlikely that a DIS-friendly certificate root certificate is required to be loaded into the
infrastructure could be deployed prior to SOA node of the LDAP server, in this case the
confirming the success of the DIS install. One PEM certificate needs to be converted to the
drawback with this is that using this file forces DER format using openssl. In LDAP, this
the DIS service to handle users with a fixed certificate is loaded under the "userCertificate"
Distinguished Name (DN) which makes that attribute, again with an optional ";binary" suffix
particular setup quite non-portable. depending on the version of LDAP. In addition,
this DER format file is required by the DIS
The implementation of the DIS requires a server for validation, and also needs to be
consistent PKI comprising a total of around 9 loaded into the Tomcat server keystore and the
certificates and key pairs in order to realise the Java JRE security CA keystore. Finally, the root
service and its proxy. In several cases, the certificate and its private key need to be
certificates need to be loaded in three different converted to a PKCS p12 format which is used
formats (PEM, DER and p12) [13] in order to by the Attribute Certificate Manager (ACM) to
talk to the various components of the DIS sign the SoA ACs.
671
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
672
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 2: Diagram of the interactions required for Edinburgh to issue an AC granting access
to its resources.
locations for testing the service functionality. approach ran into several problems, the main
This is undesirable as in general, information on one being that when the PERMIS PDP
students should only really be present at the searched the LDAP tree and came across a
student’s home institution. The benefit of this referral, the LDAP server bounced the details
approach is that the implementation of this back to the PDP for it to do the search itself on
static PMI is well understood and easily the remote LDAP. Since no functionality exists
maintained through expertise gained in the for this the PDP crashed each time it
previous year’s work. encountered this referral. Attempts to make this
referral transparent to the PDP, i.e. getting the
To extend this static PMI to a model supporting
LDAP server to do the retrieval and presenting
dynamic delegation, a DIS service was created
remote attributes as if they came from the local
at both sites. Using the Glasgow DIS, the
LDAP were not successful. The PERMIS team
Edinburgh SOA could login and grant Glasgow
assured us that the PDP LDAP server
users the privilege to access the Edinburgh data
parameter could take several values, however
store. This way Glasgow retains all its student’s
despite numerous attempts this was never made
details, yet through privilege delegation, an
to work. A solution was found in which
Edinburgh user can grant these users a role
multiple LDAP servers could be listed within
recognised by Edinburgh provided they have
the site policy itself under a “Repository
been granted this privilege by the Glasgow DIS.
Policy” subject tag. This method meant that
A number of different approaches were referral was not necessary nor did it
considered based on the current version of the compromise local security since the entry was
PERMIS software. One method which was merely a location in which to look for
attempted was the use of LDAP referral, which attributes, and not a statement of trust on the
would allow a single LDAP to be specified by part of the local SoA.
the Edinburgh PERMIS Policy Decision Point
Two aspects of dynamic delegation which at the
(PDP) for it to retrieve its user attributes from.
time of writing were not implemented in the
With referral set up, a branch of the Edinburgh
PERMIS software were those of role mapping
LDAP server would point to the Glasgow
and authority recognition. Role mapping allows
LDAP as if it were part of its own tree. This
separate sites with their own security policies to
673
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
state which roles that apply at one site can the ability to delegate those roles, to this user,
match the roles at another site. Typically, an “GlaStudentTeamN” and “GlaStudentTeamP.
institution will define “External” roles that have Now an extra user EdDIS appears in the
lower privilege than the local ones and these Glasgow DIS. To demonstrate delegation, a new
roles are typically used as equivalences. Since user called “testuser” was created at Glasgow,
this functionality was not present on who was issued with an AC signed by EdDIS
implementation, it was forced upon the PMI by which allowed the user to delegate the external
an agreement which stated that the external roles to other Glasgow users (in this case
roles at both institutions would be given exactly “User1”). The “testuser” can then log into the
the same name. Now the only difference Glasgow DIS and create an AC for User1
between an external Glasgow role and an containing the role “GlaStudentTeamN”.
external Edinburgh role is that of which SoA (or
in this case, DIS) actually signed the user’s AC. User1 calls the Edinburgh Grid Service through
their own Glasgow BLAST service. GSI passes
The second function which was not available User1’s DN to the Edinburgh PERMIS PDP.
yet was that of recognition of authority, or how The PDP reads the Edinburgh policy, and
the VO formed between the two sites would locates the LDAP server to extract the User1
recognise ACs signed by the other site. An easy attributes (contained in the Repository Policy
solution to this is to add the external SoA to the list). The remote LDAP is queried, and the
“SoA Policy” tag within the XML policy. This User1 AC is extracted. The PDP checks the
way, any ACs extracted which have been signed signature on the AC and verifies that it has been
by the remote host site can be verified. This signed by the EdDIS user, who although
method, although easy, means that a given site existing at Glasgow, is signed by the Edinburgh
explicitly trusts all of the actions of the remote SoA. Since the chain of trust can be verified
SoA. Without a DIS service protecting the back to the Edinburgh SOA, who is the trusted
assignment of ACs, the remote SoA could in SoA in their policy, the PDP establishes this is a
principle assign any role they are aware about valid AC. Then the PDP makes the decision
from the home institution to any of its users. based on the user attribute presented whether to
The DIS service, since it only ever issues valid release the Protein or Nucleotide data to the
ACs within the constraints of its own site Glasgow Grid Service. Any ACs, that are part
policy, can enforce more stringent rules on what of the Glasgow PMI, issued by the Glasgow
the remote SoA can allocate. However, we SOA, will simply be ignored by the Edinburgh
suggest a different approach which has been PDP, as they are not signed by any party that
implemented in our dynamic delegation the Edinburgh SOA trusts. This allows two
scenario. PMIs to co-exist in one LDAP tree.
Instead of trusting an external SoA to establish a The DIS can assign and revoke user ACs as
chain of trust, we created a EdDIS user at the many times as it wishes, without affecting any
remote host who has been allocated a DIS other user certificates in its infrastructure. Also,
certificate containing all the roles they may any users who have delegated their roles to
delegate, but which has been signed by the other people will find that the roles they
home institution. Therefore when a Glasgow delegated will still be valid even if they cease to
user presents an AC which has been signed by be members of that PMI. A screenshot of the
this EdDIS user (within the Glasgow DIS) this DIS service window is shown in Figure 3.
AC will already be trusted by Edinburgh.
674
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The DIS software shows great promise as a tool areas. One key area of focus is to support and
to enable dynamic VO establishment. We have extend the largely static nature of ACs used for
successfully demonstrated a VO which allows a example in Shibboleth identity and service
SoA at a remote site to securely assign and provider interactions. Shibboleth access to
revoke privileges to home users, without user resources has up to now been based largely
information being factored out to external upon an agreed set of attributes and their values
databases. The adherence of the DIS service to based for example around the eduPerson object
the local policy means that only valid ACs can class (www.eduperson.org). This model is not
be issued, and the scenario described above conducive to the more dynamic nature of short
allows the establishment of distributed trust lived Grid based VOs which come together for a
without surrendering local security. Once given time to solve a particular problem. In this
installed the service is intuitive, with a GUI case, dynamic creation and recognition of
interface that allows AC issuing in a few attributes based on a limited trust model is more
seconds. apposite. As such, we plan to explore this
technology in a range of other e-Science
The Grid Computing module is now completed. projects at the National e-Science Centre in
Of the 11 students that took this module this Glasgow.
year, all but one managed to access and retrieve
data from the PERMIS protected service in 5.1 Acknowledgements
Edinburgh, thus providing that the infrastructure The DyVOSE project was funded by a grant
works. from the Joint Information Systems Committee
(JISC) as part of the Core Middleware
The work has not been without issues however. Technology Development Programme. The
The lack of availability of source code due to authors would like to thank the programme
commercial concerns makes the tracking of manager Nicole Harris and collaborators in the
errors and diagnosis of problems very project. In particular special thanks are given to
problematic. This extended the development Professor David Chadwick and especially Dr
time of this project by an unacceptable amount Sassa Otenko for help in exploring the DIS and
due to our reliance on the tireless help of the PERMIS technologies.
PERMIS development team, in particular Sassa
Otenko whom we are grateful to for his efforts. 6. References
The underlying PKI to establish the DIS service
is over-complicated, most certificates are [1] R.O.Sinnott, A.J.Stell, J.Watt, “Experiences
required to be duplicated and converted many in Teaching Grid Computing to Advanced Level
times through the system. This is probably due Students” Proceedings of CLAG+GridEdu
to the Web Proxy based approach of the GUI, Conference, May 2005, Cardiff, Wales
which demands many certificates and keystore [2] BLAST (Basic Local Alignment Search
entries to be maintained. Some of our scenario Tool),
definitions have had to change on several http://www.ncbi.nih.gov/Education/BLASTinfo
occasions due to undocumented features within /information3.html
the DIS and PERMIS. [3] L. Pearlman, et al., “A Community
Authorization Service for Group Collaboration”
In the absence of source code, some heavier in Proceedings of the IEEE 3rd International
documentation would be desirable, in particular Workshop on Policies for Distributed Systems
with regard to setting up the PKI. A simpleCA and Networks, 2002
approach that could generate the appropriate [4] Johnston, W., Mudumbai, S., Thompson,
keys, in the appropriate formats according to the M., “Authorization and Attribute Certificates
domain structure of your institution would be for Widely Distributed Access Control”, IEEE
invaluable. Once running, the software is easy 7th International Workshops on Enabling
to use and robust, but the implementation time Technologies: Infrastructure for Collaborative
required at this stage of the software Enterprises, Stanford, CA, June 1998, p340-345
development may be outside the remit of any (http://www-itg.lbl.gov/security/Akenti)
developers who are new to this technology. [5] VOMS Architecture, European Datagrid
Authorization Working Group, 5th September
Nevertheless the proof of concept that dynamic 2002
delegation of authority works has been a major [6] A.J.Stell, “Grid Security: An Evaluation of
output of the project. We believe that this Authorisation Infrastructures for Grid
federated model of policy definition and Computing” MSc Dissertation, University of
management that suit the needs of a multitude Glasgow 2004
of VOs has numerous potential application
675
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
[7] Privilege and Role Management
Infrastructure Standards Validation project
(www.permis.org)
[8] D.W. Chadwick, O. Otenko, “The PERMIS
X509 Role Based Privilege Management
Infrastructure”, Future Generation Computer
Systems, 936 (2002) 1-13, December 2002,
Elsevier Science BV
[9] D.W.Chadwick, O. Otenko, E.Ball, “Role
Based Access Control with X.509 Attribute
Certificates”, IEEE Internet Computing, Mar-
April 2003, pp. 62-69
[10] V. Welch, F Siebenlist, D.Chadwick, S.
Meder, L. Pearlman, “Use of SAML for OGSA
Authorization”, June 2004,
https://forge.gridforum.org/projects/ogsa-authz
[11] OASIS, Assertions and Protocol for the
OASIS Security Assertion Markup Language
(SAML) v1.1, 2 September 2003,
http://www.oasis-open.org/committees/security
[12] Globus Security Infrastructure (GSI)
http://www.globus.org/security
[13] OpenSSL:
http://www.flatmtn.com/computer/Linux-
SSLCertificates.html
676
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
University of Kent
Abstract
Authorization infrastructures manage privileges and render access control decisions, allowing
applications to adjust their behavior according to the privileges allocated to users. This paper describes
the PERMIS role based authorization infrastructure along with its conceptual authorisation, access
control, and trust models. PERMIS has the novel concept of a credential validation service, which
verifies a user’s credentials prior to access control decision making and enables the distributed
management of credentials. Details of the design and the implementation of PERMIS are presented along
with details of its integration with Globus Toolkit, Shibboleth and GridShib. A comparison of PERMIS
with other authorization and access control implementations is given, along with our plans for the future.
677
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 1 shows our high level conceptual model for an privileges allocated to a role/attribute will affect all
authorization infrastructure. Step 0 is the initialization step subjects who are members of the role or who have the
for the infrastructure, when the policies are created and assigned attribute.
stored in the various components. Each subject may Figure 2 shows that UserA is a member of both RoleA
possess a set of credentials from many different Attribute and RoleB, and UserB is a member of RoleB. RoleA has
Authorities (AAs), that may be pre-issued, long lived and been granted privileges P1 and P3, whilst RoleB has been
stored in a repository or short lived and issued on demand, granted privilege P2. With Role Based Access Controls
according to their Credential Issuing Policies. The Subject this means that UserA is granted privileges P1, P2 and P3
Source of Authority (SOA) dictates which of these whilst UserB only has privilege P2.
credentials can leave the subject domain for each target PERMIS supports hierarchical RBAC in which roles
domain. When a subject issues an application request (step (or attributes) are organized in a hierarchy, with some
1), the policy decision point (PDP) informs the being superior to others. A superior role inherits all the
application’s policy enforcement point (PEP) which privileges allocated to its subordinate roles. For example, if
credentials to include with the user’s request (steps 3-4). the role Staff is subordinate to Manager, the Manager role
These are then collected from the Credential Issuing will inherit the privileges allocated to the Staff role. A
Service (CIS) or Attribute Repository by the PEP (steps 5- member of the Manager role can perform operations
6). The user’s request is transferred to the target site (step explicitly authorized to Managers as well as operations
7) where the target SOA has already initialized the authorised to Staff. The inheritance of privileges from
Credential Validation Policy that says which credentials subordinate roles is recursive, thus a role ro will inherit
from which issuing AAs are trusted by the target site, and privileges from all its direct subordinate roles rs, and
the Access Control policy that says which privileges are indirect subordinate roles which are direct or indirect
given to which attributes. The user’s credentials are first subordinate roles of rs.
validated (steps 8-9) and then the validated attributes,
along with any environmental information, such as current 2.2 The Trust Model
date and time (step 10), are passed to the PDP for an access Credentials are the format used to securely transfer a
control decision (steps 11-12). If the decision is granted the subject’s attributes/roles from the Attribute Authority to
user’s request is allowed by the PEP, otherwise it is the recipient. They are also known as attribute assertions
rejected. In more sophisticated systems there may be a [20]. PERMIS only trusts valid credentials. A valid
chain of PDPs called by the PEP, in which case each PDP credential is one that has been issued by a trusted AA or
may return granted, denied or don’t know; the latter his delegate in accordance with the current authorization
response allowing the PEP to call the next PDP in the policies (Issuing, Validation and Delegation policies).
chain. It is important to recognize the difference between an
authentic credential and a valid credential. An authentic
credential is one that has been received exactly as it was
originally issued by the AA. It has not been tampered with
or modified. Its digital signature, if present, is intact and
validates as trustworthy by the underlying PKI, meaning
that the AA’s signing key has not been compromised, i.e.
his public key (certificate) is still valid. A valid credential
on the other hand is an authentic credential that has been
issued according to the prevailing authorization policies. In
order to clarify the difference, an example is the paper
money issued by the makers of the game Monopoly. This
Figure 2 : Example User-Role-Privilege Assignments money is authentic, since it has been issued by the makers
PERMIS uses the Role Based Access Control Model [18]. of Monopoly. The money is also valid for buying houses on
Roles are used to model organization roles, user groups, or Mayfair in the game of Monopoly. However, the money is
any attribute of the user. Subjects are assigned attributes, not valid if taken to the local supermarket because their
or role memberships. A subject can be the member of zero, policy does not recognize the makers of Monopoly as a
one or multiple roles at the same time. Conversely, a role trusted AA for issuing money.
can have zero, one or more subject occupants at the same Recognition of trusted AAs is part of PERMIS’s
time. Credential Validation Policy. PERMIS checks that the AA
Privileges are directly allocated to roles or assigned is mentioned in this policy directly, or that the credential
attributes. Thus each role (or assigned attribute) is issuer has been delegated a privilege by a trusted issuer,
associated with a set of privileges, representing the recursively (i.e. a recursive chain of trusted issuers is
authorised rights the role/attribute has been given by the established controlled by the Delegation Policies of the
system administrator. These are rights to perform Target SOA and the intermediate AAs in the chain). The
operations on target objects. Thus a subject is authorised to PERMIS Credential Validation Policy contains rules that
perform the operations corresponding to his role govern which attributes different AAs are trusted to issue,
memberships (or attribute assignments). Changing the along with a Delegation Policy for each AA. These rules
678
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
separate AAs into different groups and assign them from LDAP directories, such as subject names, and writing
different rights to issue different attributes to different sets policy ACs back to the author’s entry in the LDAP
of subjects. Further each AA will have its own Credential directory. Authors can use the Policy Editor to browse the
Issuing Policy and Delegation Policy. PERMIS assumes directories and select existing policies to update them.
that if a credential has been issued and signed by an AA,
then it must be conformant to the AA’s Issuing Policy, so
this need not be checked any further. However, if the
credential was subsequently delegated this may or may not
have conformed to the original AA’s Delegation Policy.
Therefore when PERMIS validates a credential it checks
that it conforms to the AA’s delegation policy as well as
the Target SOA’s delegation policy. PERMIS also makes
sure that all credentials conform to the delegation paradigm
that an issuer cannot delegate more privileges than he has
himself, to ensure constrained propagation of privileges
from issuers to subjects.
The net result of this trust model is that PERMIS can
support multiple AAs issuing different sets of attributes to
different groups of users, in which each AA can have
different delegation policies, yet the target SOA can
specify an overall Credential Validation Policy that
Figure 3: Architecture of the PERMIS Authorization
constrains which of these (delegated) credentials are
Infrastructure
trusted to be used to access the resources under his control.
3.2 Credential Management
3. PERMIS: A Modular Authorization
The Credential Management system is responsible for
Infrastructure issuing and revoking subject credentials. The Attribute
The PERMIS authorization infrastructure is shown in Certificate Manager (ACM) tool is used by administrators
Figure 3. The PERMIS authorisation infrastructure provides to allocate attributes to users in the form of X.509 ACs.
facilities for policy management, credential management, These bind the issued attributes with the subject’s and
credential validation and access control decision making. issuer’s identities in a tamper-proof manner. The ACM has
a GUI interface that guides the manager through the
3.1 Policy Management process of AC creation, modification and revocation. The
manager can search for a user in an attached LDAP
PERMIS Policies are rules and criteria that the decision
directory, or enter the DN of the user directly. There is
making process uses to render decisions. It mainly contains
then a picking list of attribute types (e.g. role, affiliation
two categories of rules, trust related rules (Credential
etc.), to which the manager can add his own value (e.g.
Validation Policy) and privilege related rules (Access
project manager, University of Kent). There is a pop up
Control Policy). Trust related rules specify the system’s
calendar allowing the manager to select the dates between
trust in the distributed Attribute Authorities. Only
which the AC is valid, plus the option of adding
credentials issued by trusted AAs within their authority will
appropriate times of day to these. Finally the manager can
be accepted. Privilege related rules specify the domains of
add a few standard selected extensions to the AC, to say
targets, the role hierarchies, the privileges assigned to each
whether the holder is allowed to further delegate or not,
role and the conditions under which these privileges may be
and if so, how long the delegation chain can be ("basic
used, for example, the times of day or the maximum
attribute constraints" extension [3]), or if the holder may
amount of a resource that may be requested.
assert the attributes or only delegate them to others ("no
PERMIS provides a policy composing tool, the Policy
assertion" extension [4]). Finally, the manager must add his
Editor [13], which users can use to compose and edit
digital signature to the AC, so the GUI prompts him for the
PERMIS policies. The GUI interface of the Policy Editor
PKCS#12 file holding his private key and his password to
comprises: the subject (user) policy window, the trusted
unlock it. Once the AC is signed, the manager has the
AA policy window, the role assignment policy window,
option of storing it in an LDAP directory or local filestore.
the role hierarchy policy window, the target resource
Besides creating ACs, the ACM allows the manager to edit
policy window, the action policy window and the role-
existing ACs and to revoke existing ACs by deleting them
privilege policy window. These windows provide forms for
from their storage location. Note that at present revocation
users to fill in, then the tool generates the corresponding
lists have not been implemented, because short validity
PERMIS policy. Policies can be saved in pure XML
times or deletion from storage have been sufficient to
format, or the XML can be embedded in an X.509
satisfy our current user requirements.
Attribute Certificate (AC) [3] and signed with the policy
The Delegation Issuing Service is a web service that
author’s private key.
dynamically issues X.509 ACs on demand. It may be
The Policy Editor is capable of retrieving information
679
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
called directly by an application’s PEP after a user has decision. This responsibility lies with the application
invoked the application, to issue short lived ACs for the dependent PEP.
duration of the user’s task. Alternatively there is a http Figure 4 depicts the overall architecture of the
interface that lets users invoke it via their web browers. PERMIS Authorization Decision Engine. It comprises five
This enables users to dynamically delegate their existing main components: the PDP, the CVS, the Credential
longer lived credentials to other users, to enable them to Retriever, the Credential Decoder, and the Policy Parser.
act on their behalf. This is especially powerful, as it
empowers users to delegate (a subset of) their privileges to 3.4 The PDP
others without any administrative involvement. Because The PDP component is responsible for making access
the DIS is controlled by its own PERMIS policy, written control decisions based on the valid attributes of the user
by the Subject SOA, an organization can tightly control and the Target SOA’s access control policy. As stated
who is allowed to delegate what to whom, and then leave before, this is based on the Role Based Access Control
its subjects to delegate as they see fit. More details of the (RBAC) Model, with support for attribute/role hierarchies.
DIS can be found in [2]. At initialization time the Target SOA’s access control
policy is read in and parsed by the Policy Parser so that the
3.3 Authorization Decision Engine
PDP is ready to make decisions. Both plain XML policies
The PERMIS Authorization Decision Engine is responsible and digitally signed and protected policies can be read in.
for credential validation and access control decision The former are stored as files in a local directory whilst the
making. Credential validation is the process that enforces latter are stored as X.509 policy ACs in the LDAP entry of
the trust model of PERMIS described in Section 2.2. the Target SOA. The latter are tamper resistant and
Access control decision making is the process that integrity protected, whereas the former have to be
implements the Role Based Access Control Model protected by the operating system.
described in Section 2.1. The CVS extracts the subset of Each time the user makes a request to the application
valid attributes from the set of available credentials, to perform a task, the PEP passes this request to the
according to the Target SOA’s Credential Validation PERMIS PDP along with user’s valid attributes and any
Policy. The PDP makes access control decisions based on required environmental attributes such as the time of day.
the Target SOA’s access control policy and the valid The PEP needs to know which environmental attributes are
attributes passed from the CVS. The PERMIS needed by the access control policy, and since the PEP is
authorization decision engine is superior to conventional software, then it is more likely that the access control
PDPs since it has the ability to validate credentials and policies will be restricted to constraints based on the
delegation chains, which is not a common capability of environmental attributes that the PEP is capable of passing
conventional PDPs e.g. the XACML PDP [15]. to the PDP.
680
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
In Figure 5 we can see the general flow of information and encountered again). Remember that in the general case
sequence of events. First of all the service is initialised by there are multiple trusted credential issuers, who each may
giving it the credential validation policy. The policy have their own Delegation Policies, which must be
parsing module is responsible for this (see Section 3.4). enforced by the Policy Enforcer in the same way that it
When the user activates the application, the target PEP enforces the Target SOA’s Delegation Policy.
requests the valid attributes of the subject (step 1). The CVS can be customized by PERMIS
Between the request for attributes and returning them (in implementers, by either enabling or disabling the credential
step 6) the following events may occur a number of times, services built-in with the PERMIS Authorisation Decision
as necessary i.e. the CVS is capable of recursively calling Engine, or by implementing their own credential decoding
itself as it determines the path in a delegation tree from a services and plugging them into PERMIS. The latter
given credential to a trusted AA specified in the policy. enables implementers to adopt credential formats that are
The Credential Validation Policy Enforcer requests not implemented by PERMIS, such as local proprietary
credentials from the Credential Retriever (step 2). PERMIS formats. PERMIS can theoretically be customized to
can operate in either credential pull mode or credential support most application specific credential validation
push mode. In credential push mode the application passes requirements.
the user’s credentials along with his request to the target
PEP (Step 7 in Figure 3) and the PEP passes them to the 4. Integrating PERMIS
CVS (Step 8a in Figure 3). In credential pull mode, the
credentials are dynamically pulled from one or more 4.1 Integration with GT4
remote credential providers (these could be AA servers,
Globus Toolkit (GT) is an implementation of Grid
LDAP repositories etc.) by the CVS (step 8b in Figure 3).
software, which has a number of tools that make
The actual attribute request protocol (e.g. SAML or LDAP)
development and deployment of Grid Services easier [9].
is handled by the Credential Retriever module. When
One of the key features of this toolkit is secure
operating in credential push mode, the PEP stores the
communications. However, Globus Toolkit has limited
already obtained credentials in a local Credential Provider
authorisation capabilities based on simple access control
repository and pushes the repository to the CVS, so that the
lists. To improve its authorization capabilities a Security
CVS can operate in logically the same way for both push
Assertions Markup Language (SAML) authorization
and pull modes. After credential retrieval, the Credential
callout has been added. SAML [20] is a standard designed
Retriever module passes the credentials to the Credential
by the Organization for the Advancement of Structured
Decoding module (step 3). From here they undergo the
Information Standards (OASIS) to provide a universal
first stage of validation – credential authentication (step 4).
mechanism for conveying security related information
Because only the Credential Decoder is aware of the actual
between the various parts of an access control system. The
format of the credentials, it has to be responsible for
Global Grid Forum has produced a profile of SAML for
authenticating the credentials using an appropriate
use in Grid authorisation [19]. The important consequence
Credential Authenticator module. Consequently, both the
of this is that it is now possible to deploy an authorisation
Credential Decoder and Credential Authenticator modules
service that GT will contact to make authorisation
are encoding specific modules. For example, if the
decisions about what methods can be executed by a given
credentials are digitally signed X.509 ACs, the Credential
subject. A PERMIS Authorisation Service has been
Authenticator uses the configured X.509 PKI to validate
developed to provide authorisation decisions for the
the signatures. If the credentials are XML signed SAML
Globus Toolkit through the SAML callout [8]
attribute assertions, then the Credential Authenticator uses
the public key in the SAML assertion to validate the 4.2 Integration with Shibboleth
signature. The Credential Decoder subsequently discards
all unauthentic credentials – these are ones whose digital Shibboleth [21] is a cross-institutional authentication and
signatures are invalid. Authentic credentials are decoded authorisation architecture for single sign on and access
and transformed into an implementation specific local control of web resources. Shibboleth defines a protocol for
format that the Policy Enforcer is able to handle (step 5). carrying authentication information and user attributes
The task of the Policy Enforcer is to decide if each from the user’s home site to the resource site. The resource
authentic credential is valid (i.e. trusted) or not. It does this site can then use the user attributes to make access control
by referring to its Credential Validation Policy to see if the decision about the user’s request. A user only needs to be
credential has been issued by a trusted AA or not. If it has, authenticated once by the home site in order to visit other
it is valid. If it has not, the Policy Enforcer has to work its Shibboleth protected resource sites in the federation, as the
way up the delegation tree from the current credential to its resulting authentication token is recognized by any
issuer and from there to its issuer, recursively, until a member of the federation. In addition to this, protection of
trusted AA is located, or no further issuers can be found (in the user’s privacy can be achieved, since the user is able to
which case the credential is not trusted and is discarded). restrict what attributes will be released to the resource
Consequently steps 2-5 are recursively repeated until providers from his/her home site. However Shibboleth’s
closure is reached (which, in the case of a loop in the built in access control decision making based on the user’s
credential chain, will be if the same credential is attributes, is simplistic in its functionality, and the
681
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
management of the access controls is performed together authorisation decision is made based on combining all the
with web server administration at the resource site. decisions returned by all the different PDPs. The
Furthermore, distributed management of credentials and combining algorithm currently adopted by GT4 is “Deny-
dynamic delegation of authority are not supported. To Override”, which means that the user’s request is
rectify these deficiencies, a Shibboleth-Apache authorised if and only if no PDP denies the request and at
Authorisation Module (SAAM) has been developed which least one PDP grants it.
integrates PERMIS with Shibboleth. SAAM plugs into
Apache and replaces the Shibboleth authorization 5. Related Work
functionality with calls to the PERMIS authorization
decision engine. A full description is provided in [5] Manandhar et at. [12] present an application infrastructure
PERMIS extends the access control model used in in which a data portal allows users to discover and access
Shibboleth by introducing hierarchies of roles, distributed data over Grid systems. They propose an authorization
management of attributes, and policy controlled decisions framework that allows the data portal to act as a proxy and
based on dynamically evaluated conditions. PERMIS exercise the user’s privileges. When a user authenticates to
supports the existing semantics of Shibboleth attributes, the data portal, a credential is generated stating that the
but also allows X.509 ACs to be used instead, where more data portal is authorized to exercise the user’s privileges
secure credentials are needed. for a specific period. The credential is then used by the
data portal to retrieve the user’s authorization tokens from
4.3 Integration with GridShib various providers. When requesting a service from a
service provider, the data portal presents both the
GridShib [10] provides interoperability between Globus
credential and the authorization tokens. The authorization
Toolkit [9] and Shibboleth [21]. The GridShib Policy
decision is then made by the service provider. The
Information Point (PIP) retrieves a user’s attributes from
proposed infrastructure mainly focuses on the interaction
the Shibboleth Identity Provider (IdP). These attributes are
between different systems in the Grid environment, with
parsed and passed to the GT4 PEP. The GT4 PEP then
no in depth discussion about the access control model or
feeds these attributes to the corresponding PDP to request
the trust model. Credential verification is also missing
an authorisation decision. GridShib integrates Shibboleth’s
from the discussion.
attribute management functionality with GT4’s
XACML [14] defines a standard for expressing access
authorisation decision making for Grid jobs. However, like
control policies, authorization requests, and authorization
GT4, GridShib provides only limited PDP functionality,
responses in XML format. The policy language allows
which is based on access control lists and is not capable of
users to define application specific data types, functions,
coping with dynamically changing conditions, which a
and combining logic algorithms, for the purpose of
policy based engine is.
constructing complex policies. Sun’s open source XACML
implementation [15] is a java implementation of the
XACML 2.0 standard and provides most of the feature in
the standard. The XACML policy language is richer than
that of PERMIS’s PDP policy, but XACML has not yet
addressed the issue of credential validation and is only now
working on dynamic delegation of authority [22].
The Community Authorisation Service (CAS) [11] was
developed by the Globus team to improve upon the
manageability of user authorisation. CAS allows a resource
owner to grant access to a portion of his/her resource to a
VO (or community – hence the name CAS), and then let the
Figure 6: GridShibPERMIS Integration Scheme community determine who can use this allocation. The
resource owner thus partially delegates the allocation of
authorisation rights to the community. This is achieved by
Figure 6 shows the GridShibPERMIS integration scheme. having a CAS server, which acts as a trusted intermediary
GridShibPERMIS provides a GridShibPERMIS Context between VO users and resources. Users first contact the
Handler that can be integrated with GT4 as a callable PDP. CAS asking for permission to use a Grid resource. The
The Context Handler is invoked by GT4 when an CAS consults its policy (which specifies who has
authorisation decision is to be made. The Context Handler permission to do what on which resources) and if granted,
is fed with the user’s attributes that have been retrieved returns a digitally self-signed capability to the user
from the Shibboleth IdP. They are parsed and stored in a optionally containing policy details about what the user is
local Credential Provider Repository, ready to be accessed allowed to do (as an opaque string). The user then contacts
by the PERMIS CVS as described in Section 3.5. The the resource and presents this capability. The resource
Context Handler calls the PERMIS CVS followed by the checks that the capability is signed by a known and trusted
PDP, which renders an access control decision based on CAS and if so maps the CAS’s distinguished name into a
the Target SOA’s policy, and returns it to GT4. In the case local user account name via the Gridmap file. Consequently
of multiple PDPs being configured into GT4, the final
682
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the Gridmap file now only needs to contain the name of the control, and assertions contain predicates that specify the
trusted CAS servers and not all the VO users. This granted privileges of users. KeyNote has been implemented
substantially reduces the work of the resource and released as an open source toolkit. But KeyNote is not
administrator. Further, determining who should be granted without its limitations. Keynotes policies and credentials
capabilities by the CAS server is the task of other managers are in their own proprietary format. KeyNote credentials
in the VO community, so this again relieves the burden of have no time limit, and Keynote has no concept of
resource managers. For finer grained access control, the revocation of credentials. Further, policies define the roots
resource can additionally call a further routine, passing to it of trust, but the policies themselves are not signed and
the opaque policy string from the capability, and using the therefore have to be stored securely and are only locally
returned value to refine the access rights of the user. trusted.
Unfortunately this part of the CAS implementation (policy
definition and evaluation routine) were never fully explored 6. Conclusions
and developed by the Globus team. This is precisely the
functionality that PERMIS has addressed. This paper presents our work on building a modular
The main purpose of SPKI [16] is to provide public authorization infrastructure, PERMIS. We have explained
key infrastructures based on digital certificates without the conceptual models of PERMIS by describing the
depending upon global naming authorities. SPKI binds authorization model, the access control model, and the trust
local names and authorizations to public keys (or the hash model of PERMIS. The design and the implementation of
values of public keys). Names are allocated locally by PERMIS have also been presented, with details of the
certificate issuers, and are only of meaning to them. SPKI architecture and an explanation of the facilities PERMIS
allows authorizations to be bound directly to public keys, provides to support policy management, attribute
removing the process of mapping from authorization to management, and decision making. Details of the decision
names and then to public keys. SPKI supports dynamic making process and the credential validation service are
delegation of authorizations between key holders, and also given, showing how PERMIS implements hierarchical
allocation of authorizations to groups. Though SPKI can RBAC decision making based on user credentials and
convey authorization information, it does not cover various authorization policies.
authorization decision making or access control policy Finally, we have presented a comparison of related
issues. One can thus regards SPKI as an alternative format work, pointing out their relative advantages and
to X.509 ACs or SAML attribute assertions for carrying disadvantages as compared to PERMIS.
credentials, and PERMIS could easily be enhanced to
6.1 Future Work
support this format of credential if it were required.
The EU DataGrid and DataTAG projects have In an application, sometimes coordination is needed
developed the Virtual Organisation Membership Service between access control decisions. For example, in order to
(VOMS) [6] as a way of delegating the authorisation of support mutually exclusive tasks (Separation of Duties),
users to managers in the VO. VOMS is a credential push the PDP needs to know if the same user is trying to
system in which the VOMS server digitally signs a short perform a second task in a set of mutually exclusive ones.
lived X.509 role AC for the VO user to present to the Alternatively, if multiple resources are available but their
resource. The AC contains role and group membership use is to be restricted, for example a maximum of 30GB of
details, and the Local Centre Authorisation Service (LCAS) storage throughout a grid, then enforcing this becomes
[7] makes its authorisation decision based upon the user’s more difficult. The use of a stateful PDP allows
AC and the job specification, which is written in job coordination between successive access control decisions,
description language (JDL) format. This design is similar in whilst passing coordination data between PDPs allows
concept to the CAS, but differs in message format and coordination over the use of restricted multiple resources.
syntax. However what neither VOMS nor CAS nor LCAS In order to achieve the latter, the access control policies
provide is the ability for the resource administrator to set should state what coordination between decision making is
the policy for access to his/her resource and then let the needed, which coordination data is used to facilitate this,
authorisation infrastructure enforce this policy on his/her and how this coordination data is updated afterwards. We
behalf. This is what systems such as PERMIS and Keynote have already implemented Separation of Duties in a
[17] provide. It would therefore be relatively easy to prototype stateful PDP and we are currently working on
replace LCAS with the PERMIS decision engine, so that more sophisticated distributed coordination between PDPs.
VOMS allocates role ACs and pushes them to the resource Obligations are actions that are required to be fulfilled
site, whilst PERMIS makes the policy controlled on the enforcement of access control decisions. Existing
authorization decisions. authorization infrastructures are mainly concerned with
KeyNote [17] is a trust management system that access control decision making, but this is not sufficient in
provides a general-purpose mechanism for defining security scenarios where obligations are needed. XACML policies
policies and credentials, and rendering authorization already support obligations and we are currently
decisions based on security policies and credentials. incorporating these into PERMIS. We are building an
KeyNote provides a language for defining both policies and obligation engine that will evaluate the obligations that are
assertions, where policies state the rules for security embedded in a policy e.g. Add the amount requested to the
683
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
amount already consumed, and will return the obligated [13] Sacha Brostoff, M. Angela Sasse, David Chadwick,
action to the PEP for enforcement, since it is the PEP that James Cunningham, Uche Mbanaso, Sassa Otenko. ““R-
ultimately has to interpret and fulfill the obligations. What?” Development of a Role-Based Access Control
(RBAC) Policy-Writing Tool for e-Scientists” Software:
Acknowlegements Practice and Experience
Volume 35, Issue 9, Date: 25 July 2005, Pages: 835-856
We would like to thank the UK JISC for funding part of [14] OASIS. “XACML 2.0 Core: eXtensible Access
this work under the DyCOM, DyVOSE, SIPS and GridAPI Control Markup Language (XACML) Version 2.0”, Oct,
projects, and the EC for funding part of this work under the 2005.
TrustCoM project (FP6 project number 001945). [15] Sun’s XACML Implementation available on
http://sunxacml.sourceforge.net/.
References [16] C. Ellison, B. Frantz, B. Lampson, R. Rivest, B.
Thomsa, and T. Ylonen. “SPKI Certificate Theory”. RFC
[1] D.W.Chadwick, A. Otenko “The PERMIS X.509 Role
2693, September 1999.
Based Privilege Management Infrastructure”. Future
[17] M. Blaze, J. Feigenbaum, J. Ioannidis, and A.
Generation Computer Systems, 936 (2002) 1–13,
Keromytis. “The KeyNote Trust Management System
December 2002. Elsevier Science BV.
Version 2”. RFC 2704, Sept. 1999.
[2] D.W.Chadwick. “Delegation Issuing Service”. NIST
[18] David F. Ferraiolo and Ravi Sandhu and Serban
4th Annual PKI Workshop, Gaithersberg, USA, April 19-
Gavrila and D. Richard Kuhn and Ramaswamy
21 2005
Chandramouli. “Proposed NIST standard for role-based
[3] ISO 9594-8/ITU-T Rec. X.509 (2001) “The Directory:
access control”. ACM Transactions on Information and
Public-key and attribute certificate frameworks”
System Security Volume 4, Issue 3. August 2001.
[4] ISO 9594-8/ITU-T Rec. X.509 (2005) “The Directory:
[19] Von Welch, Rachana Ananthakrishnan, Frank
Public-key and attribute certificate frameworks”
Siebenlist, David Chadwick, Sam Meder, Laura Pearlman.
[5] Wensheng Xu, David Chadwick, Sassa Otenko.
“Use of SAML for OGSI Authorization”, Aug 2005,
“Development of a Flexible PERMIS Authorisation
Available from https://forge.gridforum.org/projects/ogsa-
Module for Shibboleth and Apache Server”. Proceedings
authz
of 2nd EuroPKI Workshop, University of Kent, July 2005
[20] OASIS. “Security Assertion Markup Language
[6] R. Alfieri et al. “VOMS: an Authorization System for
(SAML) 2.0 Specification”, November 2004.
Virtual Organizations”, 1st European Across Grids
[21] S. Cantor. “Shibboleth Architecture, Protocols and
Conference, Santiago de Compostela, February 13-14,
Profiles”, Working Draft 02. 22 September 2004, see
2003
http://shibboleth.internet2.edu/
[7] Martijn Steenbakkers “Guide to LCAS v.1.1.16”, Sept
[22] XACML v3.0 administration policy Working Draft 05
2003. Available from
December 2005. http://www.oasis-
http://www.dutchgrid.nl/DataGrid/wp4/lcas/edg-lcas-1.1
open.org/committees/documents.php?wg abbrev=xacml.
[8] David Chadwick, Sassa Otenko, and Von Welch.
[23] N. Zhang, L. Yao, A. Nenadic, J. Chin, C. Goble, A.
“Using SAML to Link the GLOBUS Toolkit to the
Rector, D. Chadwick, S. Otenko and Q. Shi; "Achieving
PERMIS Authorisation Infrastructure”. In Proceedings of
Fine-grained Access Control in Virtual Organisations", to
Eighth Annual IFIP TC-6 TC-11 Conference on
appear in Concurrency and Computation: Practice and
Communications and Multimedia Security, Windermere,
Experience, published by John Wiley and Sons publisher.
UK, September 2004.
[9] I. Foster. “Globus Toolkit Version 4: Software for
Service-Oriented Systems”. IFIP International Conference
on Network and Parallel Computing, Springer-Verlag
LNCS 3779, pp 2-13, 2005.
[10] Barton, T., Basney, J., Freeman, T., Scavo, T.,
Siebenlist, F., Welch, V., Ananthakrishnan, R., Baker, B.,
and Keahey, K. “Identity Federation and Attribute-based
Authorization through the Globus Toolkit, Shibboleth,
Gridshib, and MyProxy”, 5th Annual PKI R&D Workshop.
April 2006.
[11] Ian Foster, Carl Kesselman, Laura Pearlman, Steven
Tuecke, and Von Welch. “The Community Authorization
Service: Status and Future”. In Proceedings of Computing
in High Energy Physics 03 (CHEP '03), 2003.
[12] Ananta Manandhar, Glen Drinkwater, Richard Tyer,
Kerstin Kleese. “GRID Authorization Framework for
CCLRC Data Portal”, Second Earth Science Portal
Workshop: Web Portal Framework
Design/Implementation, September 2003.
684
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Wenfei Fan Irini Fundulaki Floris Geerts Xibei Jia Anastasios Kementsietsidis
University of Edinburgh
wenfei@inf,efountou@inf,fgeerts@inf,x.jia@sms,akements@inf .ed.ac.uk
Abstract
As science communities and commercial organisations increasingly exploit XML as a means of exchanging and dissemi-
nating information, the selective exposure of information in XML has become an important issue. In this paper, we present
a novel XML security framework developed to enforce a generic, flexible access-control mechanism for XML data manage-
ment, which supports efficient and secure query access, without revealing sensitive information to unauthorized users. Major
components underlying the framework are discussed and the current status of a reference implemenation is reported.
user query over the XML database returns only information a powerful specification language for the definition of
in the database that the user is allowed to access; in other rich access policies;
Supported in part by EPSRC GR/S63205/01, GR/T27433/01 and BBSRC BB/D006473/1. Wenfei Fan is also affiliated to Bell Laboratories, Murray Hill, USA.
Floris Geerts is a postdoctoral researcher of the FWO Vlaanderen and is supported in part by EPSRC GR/S63205/01. He is also affiliated to Hasselt University and
685
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
(c) patient Mary’s view (d) Mary’s insurer’s view (e) a researcher’s view
Figure 1. Different access privileges on a medical records XML database
the ability to effectively derive and publish a number of 2 The Security Architecture
different schemas (DTD or XML Schema) characteriz-
ing accessible data based on different access policies; Views are commonly used in traditional (relational)
databases to enforce access control, and therefore provide
686
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
interacts with the system. These modules provide forms for in Section 4 why we choose Regular XPath as the security
user input and are responsible for presenting the system out- view specification language over the normal XPath.
put (e.g. the result of user queries). The core and indexer
modules implement all the basic algorithms and functional- We will use the example introduced in Section 1 to illus-
ity of the system. These latter modules are internal and thus trate the security framework built on this architecture. First,
users do not directly interact with them. security administrators specify the security specifications
There are two types of users that interact with the sys- for the different roles (researchers, insurers, doctors) using
tem, namely, security administrators who specify the access the security specification module. The security specifica-
policies, and normal users in certain roles who query the tions are then passed to the view derivation module where
XML views to which they have been granted access through a set of security view definitions
,
and
are auto-
their roles. matically derived from for the roles of researchers, insur-
Central to the architecture is a number of security views ers, and doctors respectively. Next, a user in the role of,
which is automatically derived from the access policies. say, a researcher, poses a query using the query editor
These policies are manually specified by annotating the module over the virtual security view
. This query
schema of the XML documents. is subsequently rewritten into a new query over the un-
Four languages are involved in the architecture: secu- derlying document in the query rewriting module. Query
rity specification and view specification languages ( and is optimised in the query optimization module with the
resp.) and view query and document query languages optional index from the indexer module, and passed to the
( and resp.). Among them, the security speci- query evaluation module in order to be executed over the
fication language
and view query language are document . Finally, the results of (and therefore )
used by the security administrators and the users respec- are shown in the result viewer module to the user .
tively. The view specification language and document
query language are used by the view derivation and The eight modules that comprise the architecture can
query rewriting modules respectively, for representing the also be classified in two categories according to their pur-
automatically generated view definitions and queries. Un- pose: modules to enforce security constraints (security
like the former two, they are internal and thus not visible specification editor, view derivation); modules to support
to users. We have to note here, that in contrast to most of efficient query answering (query editor, query rewriting,
the state-of-the-art approaches for XML access control that evaluation and optimization, indexer, result viewer). In the
employ XPath as their security specification language [8], following sections, we use this view of modules to present,
we use Regular XPath, a mild extension of XPath. We argue in more detail, their specifics.
687
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
3 Modules to Enforce Security Constraints that the elements of type in an instantiation of are ac-
cessible, inaccessible, and conditionally accessible, respec-
To enforce the security constraints, a security framework tively. The value of the attribute “accessible” for the root of
needs the necessary tools to (i) support the specification of is “yes” for every role by default, unless otherwise speci-
these constraints and (ii) enforce them when the users inter- fied. A fragment of the security specification for the XML
act with the system. In the proposed framework, these func- database in Figure 1 is illustrated in the following example.
tions are modularized into the security specification editor It is defined by extending the XSD graph shown in Figure 3.
and view derivation module.
<xs:element name=”Hospital” >
3.1 Security Specification Editor <xs:complexType ><xs:sequence >
</xs:element >
User Interaction: the security administrators associate <xs:element name=”Genetics” type=”records” >
the security constraints for each role with the elements in <as:role accessible=”no” >insurer researcher </as:role >
XSD . </xs:element >
Output: security specification .
688
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
and (c) conditional accessible nodes, e.g., Record elements, not explicitly defined, the default for an element is
which are accessible by a doctor if and only if the record is simply a trivial relative Regular XPath query that is the same
about one of his patients. as the element name.
The security specification language supports the fol- An access control policy over a relational database can
lowing salient features: (a) inheritance: if a node does not be simply enforced by defining a view via an SQL query.
have an accessibility explicitly defined, then it inherits the The view, referred to as a security view, consists of all and
accessibility of its parent (the root node is accessible by de- only the accessible information w.r.t. the policy. The view
fault for all roles); (b) overriding: on the other hand, an schema – flat relations – can be readily derived from the
explicit accessibility definition at a node will override the view definition and provided to the users for their query for-
accessibility inherited from its parent; (c) content-based ac- mulation. In contrast, security views for XML pose much
cess privileges: the accessibility of a node is specified with new challenges. Consequently, complex algorithms are
predicates/qualifiers (yes, no, or Regular XPath qualifiers); needed for the view derivation module. A dynamic pro-
and (d) context-dependency: the accessibility of elements gramming based view derivation algorithm has been pro-
in a document is determined by paths from the root to these posed in [5], which assumes DTDs and uses a subset of
elements in the document; e.g.,, the diagnosis child of a XPath to define specifications and security views. These al-
doctor is accessible only if the doctor is accessible. gorithms which consider XPath are already extended in our
The security specification editor provides the security context to account for Regular XPath.
administrators two modes to fill in the access policy: a
<xs:element name=”Hospital” >
graph mode and a text mode. The graph mode presents the <xs:complexType ><xs:sequence >
administrators an XSD graph of the underlying XML doc- <xs:element name=”Record” maxOccurs=”unbounded” >
ument. The administrators associate a security annotation <vs:data >
for role on node by clicking a node in the XSD
(Psychiatry Genetics)/Record Patient/Name=$custName
graph. The text mode automatically generates code skele- </vs:data >
tons in the access specification language . The admin- <xs:complexType ><xs:sequence >
istrators can add more rules to override the default rules in <xs:element name=”Date” type=”xs:date” />
the skeletons to (conditionally) grant access to any part of <xs:element name=”Bill” type=”bType” />
the XML document for some role <xs:element name=”Patient” type=”pType” />
</xs:sequence ></xs:complexType >
3.2 View Derivation Module </xs:element >
</xs:sequence ></xs:complexType >
Input: security specifications .
</xs:element >
Output: i) security view specifications
for each role
...
, ii) security view schema for each role . Figure 5. Security view
for role “insurer”
The security specification is subsequently processed
by the view derivation module, which automatically derives Example 2: Figure 5 shows the security view derived for
a set of security views
,
, ...,
expressed in an inter- the role “insurer” from the access specification in Exam-
nally used view specification language . A view
is ple 1.
an XSD annotated with Regular XPath queries along the
same lines as DAD (IBM DB 2 XML Extender [10]), AXSD
(Microsoft SQL Server [13]) and Oracle [17]). The XSD
4 Modules to Support Efficient Query An-
exposes only accessible data w.r.t. , and is used by the
users authorized by to formulate their queries over the swering
view. The Regular XPath annotations are not visible to au-
thorized users, and are used internally by the system to ex- A security framework should not only be able to enforce
tract accessible data from the XML document . The only but also to support efficient querying in the presence of se-
structural information about that the users are aware of curity constraints. The latter goal is important because a se-
is , and no information beyond the view can be inferred curity framework with a dramatically performance decrease
from user queries. Thus, our security views support both for query answering cannot be useful to real world users and
access/inference control and schema availability. applications.
Syntactically, a security view
from to an XSD Since security views are virtual, user queries must be
for role , denoted by
is defined as a pair rewritten into queries on the underlying XML document and
, where defines Regular XPath query an- then executed on the XML database to generate the result. In
notations used to extract accessible data for each element what follows, we present (through the modules) the whole
type in from an instance of . Specifically, for query answering process starting from the moment queries
each element in , is a data element which con- are composed in the query editor all the way to the point
tains a Regular XPath query defined over instances of . If where query results are returned to the user.
689
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
After the security views have been either derived by the In contrast to other frameworks supporting the specification
view derivation module, or manually specified by the se- of XML views, this design choice makes the implemented
curity administrators, the system is ready to answer user system capable of handling recursively defined schema (and
queries. These are composed in the query editor module. thus views).
The module provides the users in role their correspond- Apart from the closure of the query language, another
ing view XSD to guide them during query composition. important concern is the size of the rewritten query. Indeed,
This is an important feature for the users because without [6] has shown that the size of the rewritten query , if
such a schema the query formulation process is easily fallen directly represented in Regular XPath, may be exponential
into repeated failures: the users continually query some in- in the size of input query . The query rewriting model
formation they are not allowed to access as they do not have overcomes this challenge by employing an automaton char-
a clear picture of the accessible information. In addition, acterization of , denoted by MFA (mixed finite state au-
equipped with the security view schema, it becomes possi- tomata), which is linear in the size of . An MFA of
ble for the query editor module to provide more assistance is a finite state automaton (NFA, characterizing the data-
to users such as auto-completion (offer hints to the users selection path of ) annotated with alternating automata
(AFA, capturing the predicates of ).
about next possible tokens in the query being composed
according to the query grammar and the view schema) or Example 3: Figure 6 depicts the MFA
characterizing
visual query composition (allow users to compose queries
graphically through drag and drop operations).
the Regular XPath query:
= Hospital/*[not(Record/Doctor/Name=’David’)]/
Record[Doctor/Diagnosis=’Haemophilia’
4.2 Query Rewriting Module
and Patient/Sex=’Female’]/Patient/Name
Hospital/*/Record/Patient/Name; it is annotated with an
Once query on view
is passed to the query rewrit-
ing module, it is rewritten to a new query on the under-
lying document. Nevertheless, this query rewriting is non-
AFA (linked to states
the predicates of (the part enclosed in ).
and via dotted arrows) capturing
trivial. For example, XPath, the core of XQuery and XSLT, 4.3 Query Evaluation and Optimization Modules
is not closed under rewriting, i.e., for an XPath query on a
Input: (a) XML document and (b) query over .
Regular XPath is a mild extension of XPath that supports
the general Kleene closure instead of the limited recur-
sion ‘//’ (descendant-or-self axis). Therefore, user queries
ciency of the framework. Minimally, only the underlying
XML document and the rewritten queries are needed
for evaluating and optimising the queries. However, some
already written in XPath can be used as-is and need not be more information is essential for efficient evaluation and
re-defined, a necessity if a richer language like XQuery or optimisation. This information could be either at schema
XSLT was used. Second, and more to the point, Regular level (the XSD ) or at the instance level (e.g., an index).
690
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
(a) A visual tool for specifying views (b) The MFA for
Figure 7. The user interface of SMOQE
To get the benefit of the instance level optimization, an op- 4.4 Result Viewer Module
tional indexer model is needed.
Input: an XML document , answer to query
Our system [6] implements a novel algorithm for pro- Output: a text view with XML syntax highlighting, an
cessing Regular XPath queries represented by MFA. The interactive tree and a schema coloring representing the doc-
algorithm, referred to as HyPE (Hybrid Pass Evaluation), ument .
takes an MFA as input and evaluates it on an XML tree. A The last step of query answering under the security
unique feature of HyPE is that it needs a single top-down framework is to show the answer of query (i.e.,, the
depth-first traversal of the XML tree, during which HyPE answer of the original query by user ) to the autho-
both evaluates predicates of the input query (equivalently, rized user in the result viewer. The result produced in the
AFA of the MFA ) and identifies potential answer nodes (by query evaluation model is a new XML document. The func-
evaluating the NFA of the MFA). The potential answer nodes tion of the result viewer model is to present the document
are collected and stored in an auxiliary structure, referred to in a friendly way to the user. Specifically, the viewer allows
as Cans (candidate answers), which is often much smaller three modes: (a) the text mode, which presents the docu-
than the XML document tree. After the traversal of the doc- ment in a text editor with XML syntax highlighting. (b) the
ument tree, HyPE only needs a single pass of Cans to select tree mode, which displays the answer of the query as a tree;
the nodes that are in the answer of the input query. It is im- it provides an interactive interface such that users may click
portant to note that HyPE processes one document tree node on a node to browse its subtree. (c) the schema mode, which
at a time, and does not require that the whole document tree shows the schema graph where the parts of the graph for
is loaded in memory from disk. Therefore, HyPE can be which the query returned an answer are highlighted.
used even in scenarios where the document is not available
locally but instead it is streamed over the internet while the 5 SMOQE: a Reference Implementation
query is being evaluated.
We have developed a reference implementation [7],
To our knowledge, previous systems require to traverse called SMOQE (Secure MOdular Query Engine), for the
the XML document at least twice to evaluate XPath queries. security framework we proposed in this paper. It is imple-
For example, to evaluate an XPath query on an XML docu- mented in Java and supports all the functionality we defined
ment , Arb [12] requires a bottom-up pass of to evaluate
691
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
More execution modes. SMOQE supports two modes: a system. The automaton based query rewriting, evaluation
DOM mode and a StAX [11] (Streaming API for XML , a and optimization modules provide efficient query process-
new standard API for XML pull parsing, to be included in ing on the derived virtual security views, without any evi-
Java6) mode. In the DOM mode, the whole document tree dent degradation in either performance or functionality. Our
will be loaded into memory. While in the StAX mode only framework is fully modulized, which leaves space to plug in
one sequential scan of the document on disks is needed for other view derivation, query rewriting and evaluation tech-
the evaluation. The document does not need to be loaded niques.
into memory. Comparing to the main-memory XPath en- Several extensions are targeted for future work. First,
gines such as Xalan [19] and Saxon [18], which needs to we are investigating to allow more flexible security specifi-
randomly access the document, the StAX mode enables cation based on multiple XML data sources as found in data
SMOQE to process larger document efficiently. integration. Second, we plan to extend our query rewrit-
Index Management. In SMOQE, an optional index based ing and evaluation modules to handle other XML query lan-
optimization is supported. In addition, an index manage- guages such as XQuery and XSLT.
ment module, as shown in Fig. 8, is implemented, which
provides a graphical management interface for maintaining References
the index.
[1] E. Bertino and E. Ferrari. Secure and selective dissemination
of XML documents. TISSEC, 5(3):290–331, 2002.
[2] S. Cho, S. Amer-Yahia, L. Lakshmanan, and D. Srivastava.
Optimizing the secure evaluation of twig queries. In VLDB,
2002.
[3] E. Damiani, S. di Vimercati, S. Paraboschi, and P. Samarati.
Securing XML documents. In EDBT, 2000.
[4] E. Damiani, S. di Vimercati, S. Paraboschi, and P. Samarati.
A fine-grained access control system for XML documents.
TISSEC, 5(2):169–202, 2002.
[5] W. Fan, C. Y. Chan, and M. Garofalakis. Secure XML query-
ing with security views. In SIGMOD, 2004.
[6] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Rewriting
regular XPath queries on XML views.
http://www.lfcs.inf.ed.ac.uk/research/database/rewriting.pdf.
[7] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. SMOQE:
A system for providing secure access to xml. In VLDB, 2006.
[8] I. Fundulaki and M. Marx. Specifying Access Control Poli-
cies for XML Documents with XPath. In Proc. of the 9th
Figure 8. TAX index ACM Symposium on Access Control Models and Technolo-
Monitoring Support. SMOQE is the front-end that not gies (SACMAT), 2004.
[9] S. Hada and M. Kudo. XML access control lan-
only provides a friendly user interface to SMOQE engine, guage: Provisional authorization for XML documents.
but also opens a window of the system to let the user mon- http://www.trl.ibm.com/
itor the internal processing of the engine. It consists of a projects/xml/xacl/xacl-spec.html.
graphical querying interface, a semi-automatic view defini- [10] IBM. DB2 XML Extender.
tion tool, and query, automaton, index and result visualiza- http://www-3.ibm.com/software/data/db2/extended/xmlext/.
[11] JSR 173. Streaming API for XML.
tion tools. http://www.jcp.org/en/jsr/detail?id=173.
[12] C. Koch. Efficient processing of expressive node-selecting
6 Conclusion queries on XML data in secondary storage: A tree automata-
based approach. In VLDB, 2003.
A generic, flexible access control framework for protect- [13] Microsoft. XML support in microsoft SQL server 2005,
ing XML data is not only an important research direction, December 2005. http://msdn.microsoft.com/library/en-
but also a much needed funtionality by numerous XML re- us/dnsql90/html/sql2k5xml.asp/.
lated applications. In this paper, we have presented a view- [14] G. Miklau and D. Suciu. Controlling access to published data
using cryptography. In VLDB, 2003.
based access control framework for XML data and its refer- [15] M. Murata, A. Tozawa, M. Kudo, and S. Hada. XML access
ence implementation. Taking advantage of a rich security control using static analysis. In CCS, 2003.
specification language based on XSD and Regular XPath, [16] Oasis.
the framework is able to enforce fine-grained access poli- eXtensible Access Control Markup Language (XACML).
cies according to the structure and values of the protected http://www.oasis-open.org/committees/xcaml.
[17] Oracle. Using XML in Oracle internet applications.
XML data. By automatically deriving view specification, http://technet.oracle.com/tech/xml/
the work of security administrators is greatly reduced. By info/htdocs/otnwp/about xml.htm.
providing proper view schema to different user roles, the [18] SAXON. The XSLT and XQuery processor.
protected XML data is no longer a black box, which make http://saxon.sourceforge.net.
it possible to guide users when they pose queries on the [19] Xalan. http://xalan.apache.org.
692
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Mike Surridge, Steve J. Taylor, Terry Payne, Mariusz Jacyno, Ronald Ashri
E. Rowland Watkins, Thomas Leonard University of Southampton, UK
IT Innovation, Southampton, UK {trp,mj04r,ra}@ecs.soton.ac.uk
{ms,sjt,erw,tal}@it-innovation.soton.ac.uk
Abstract
As the technical infrastructure to support Grid environments matures, attention must be focused on integrating
such technical infrastructure with technologies to support more dynamic access to services, and ensuring that
such access is appropriately monitored and secured. Current approaches for securing organisations through
conventional firewalls are insufficient; access is either enabled or disabled for a given port, whereas access to
Grid services may be conditional on dynamic factors. This paper reports on the Semantic Firewall (SFW) project,
which investigated a policy-based security mechanism responsible for mediating interactions with protected
services given a set of dynamic access policies, which define the conditions in which access may be granted to
services. The aims of the project are presented, and results and contributions described.
693
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
694
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
695
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Policy Target
Enforcement Service Figure 3. Dynamic Policy Component
Service
User
696
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 4. Process roles and contexts: delega- Initial investigation suggested that Multi-Agent System
tion and subordination mechanisms for supporting organisations, such as Elec-
tronic Institutions[11] developed at IIIA-CSIC, Spain, can
be used to provide the best means to provide tractable de-
of subject criteria are defined for each process role that re- scriptions of protected services with interaction policies.
lates to that interaction normally based on the attributes of The parallels between process roles in the SFW and agent
the user whose action caused the new process to start. Ser- actors has been pursued through a collaboration, resulting in
vice actions can then add or remove subject criteria for this an analysis of the synergies and initial integration between
or other process roles (a procedure known as delegation in Electronic Institutions and the Semantic Firewall[5].
the SFW), or initiate new related interactions (a procedure A deeper analysis showed that, while the parallels are
known as subordination), as shown in Figure 3. striking, the notion of a process context was far more self-
There are often relationships between processes (con- contained in the Electronic Institution interaction models.
texts) and the user rights associated with them. For exam- After some effort to apply these principles to capture pro-
ple, when the project manager in Figure 1 tries to open an cess descriptions, it became clear that significant develop-
account, they are acting in the role of “account applicant”, ments would be needed to handle the idea that process con-
a right delegated to them by the “service provider” when texts can be related. It was also clear that agent models
the service was originally deployed. If the service grants may be difficult to integrate into Grid client applications,
the request, the account becomes a subordinate context in which are normally based on workflow models (for exam-
which the successful applicant takes the role of “account ple, the Taverna system[18] from myGrid 5 ). Given this,
manager”, and can in turn delegate charging rights (the “ac- the focus moved towards semantic workflow models, and
count user” process role) to their staff. The full hierarchy is after some investigation of other options such as WSDL-
shown in Figure 4. S6 and WSMO7 , the simpler but more mature OWL-S
The notions of process contexts and process roles, and specification[3] was selected.
the relationships between them, provides a way to present The main challenge was to introduce the process con-
a high-level interface for generating security policies based texts and roles into OWL-S, and to semantically encode the
on detailed state models of applications. It also provides a permissions associated with the process roles in a given pro-
way to present a semantically tractable description of the cess context. It was immediately clear that OWL-S can de-
overall policy, using semantic workflow languages such as scribe a generic “process”, including the notion of a “pro-
OWL-S [3] to describe the externally visible processes al- cess state” that affects which actions can be taken. How-
lowed by the application state models, and RDF or OWL ever, the process model as defined in the W3C submission 8
to describe the process roles of interacting actors, and the fails to support the notion that processes can be attributed
delegation or subordination relationships between them. to specific contexts that can be dynamically created, evolve,
and be destroyed, and about which queries can be posed
3.3 Security and the Semantic Web that should be resolvable using the generic description. As
the focus for the Semantic Firewall project is to define and
The final step in the Semantic Firewall project is to de- support dynamic security, the more general problem of such
velop the semantic representations of these dynamic secu- dynamism (including service discovery, provisioning, com-
rity policies, process roles, etc. It became clear in the final 5 http://www.mygrid.org.uk/
year of the project that the concepts needed arise in both 6 http://www.w3.org/Submission/WSDL-S/
the Semantic Web and the Agent research communities, but 7 http://www.w3.org/Submission/WSMO/
697
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Where appropriate, the process-role pre-conditions were ships, and in some cases legal constraints over the move-
accompanied by a “qualification” action, through which the ment of data [14].
actor could be granted that process role. This was encoded
The validation study was conducted, using an alpha re-
as a goal that would have to be achieved by finding and ex-
lease of the forthcoming GRIA 5 release. This uses dy-
ecuting another workflow in the same context, whose out-
namic token services developed in the NextGRID project
come would be to grant the required role. Note that this sec-
[1, 17] and exhibits a wider range of dynamic behaviour
ond workflow also had process role pre-conditions, and typ-
than the original GEMSS or GRIA systems. GRIA 5 will
ically could not be executed by the original actor. However,
also use the dynamic policy decision component from the
the pre-conditions could then be used to look for another
Semantic Firewall itself, which meant it was easy to inte-
actor (e.g. a supervisor) who could execute the required
grate with a Semantic Firewall deployment for evaluation
workflow.
tests.
The end result was an enrichment of the basic OWL-S
Because the GRIA 5 system is so dynamic, it was pos-
process descriptions, in which the security requirements to
sible to test the Semantic Firewall principles even with a
execute each process are described, along with mechanisms
very simple application. The scenario involves a data cen-
that could be used (if authorised) to acquire the necessary
tre curator transferring a large data file between two sites,
access rights. These mechanisms typically point to other
e.g. as part of a resilient back up strategy. The transfer
users who have control of parent contexts (for example, a
is conducted using a GRIA 5 data service at each site, but
prospective service user needs to acquire access permission
these services will also bill the user for their actions, using
from the manager of the account to which the service bills).
account services located at each site. The relevant accounts
are managed by other actors, e.g. the manager of the data
4 Validation centre where the curator works, and the manager of the net-
work storage archive service providing the off-site backup
To validate the Semantic Firewall approach, we needed facility. The services and actors are shown in Figure 5.
a Grid (or Web Services) application in which policies can The focus of the validation exercise was on whether the
change dynamically, e.g. as a result of a negotiation pro- SFW approach could accommodate semantically tractable
cedure. As suggested by Figure 1, such applications are service descriptions, and whether these could be used to
typically found in inter-enterprise business Grids, such as determine if and how the curator could achieve their data
those originally created by the GRIA 9 [20] and GEMSS 10 [8] transfer goal. Implementation of the application scenario
projects. These Grids are forced to use at least some dy- was trivial using the GRIA 5 system, and registry services
namic policy elements to handle changing business relation- were added to store information about the available ser-
9 GRIA is now an open source Grid middle-ware, currently at v4.3, ob- vices (including their semantic descriptions), and informa-
tainable via http://www.gria.org tion about users and their roles. Note that the user infor-
10 http://www.gemss.de mation available to each actor was incomplete the registry
698
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
revealed user roles only to those actors for whom the user The decision to use a semantic representation for the
was willing to exercise those roles, i.e. the registry filtered SFW presented a challenge given that to associate the
according to the trust relationships between users. policies defined by the Semantic Firewall with workflows,
To validate the use of semantic descriptions, the the service description language should be extended. The
MindSwap OWL-S API and execution engine 11 were used OWL-S ontologies provided the ideal platform for such ex-
to execute the required task, based on OWL-S workflows tensions, and in [15] we present the set of OWL-S exten-
discovered from the service and workflow registry. Reason- sions proposed. In addition, the execution environment had
ing over the pre-conditions allowed classification of discov- to be extended to support the notion of process abstractions
ered services and workflows into three classes: those that (through OWL-S simple processes) that could subsequently
could be executed by the user, those that could become ex- be provisioned at run time by consulting local service and
ecutable by the user with help from other actors, and those user registries (typically within the same Virtual Organi-
that could never be executed. In the second case, the rea- sation as the service client). For example, a client might
soning engine was able to work out how to negotiate the re- realise that they need to acquire credentials from their lo-
quired access rights with the SFW infrastructure, with help cal account manager before accessing those services with
from other actors (e.g. a supervisor). which such prior agreements had been made.
In the final experiments, software agents (services) were The research is ongoing, and future work will address
created to represent each actor, which would take in a work- extending the OWL-S extensions further, to support other
flow (or goal) from other users, and (where trustworthy) en- notions of service. Akkermans et al. [2] introduced such
act the workflow for that user. With these services in place, notions as service bundling and the sharability and consum-
it was possible to automate the negotiation process, thus ex- ability of resources within the OBELIX project. Current
ploiting the parallels and synergies between Semantic Web, semantic web service descriptions fail to make the distinc-
agents, and dynamic security. tion, for example, between resources that can be copied and
shared by many users, and that which is indivisible (e.g. a
security credential that can only be used by one user at the
5 Discussion time). Likewise, there is no support for establishing rela-
tionships between service elements that support each other,
The investigation of the Semantic Firewall to date has but are not necessarily part of a service workflow (such as
demonstrated that security policies can be defined for se- representing a billing service that supports another, primary
mantically described process models, with little additional service). Future investigation will consider how such fac-
run-time (semantic reasoning) overhead[15]. The research tors augment the definition of Grid services and further sup-
has highlighted a number of challenges for supporting dy- port policy definitions.
namic, context-specific service access. The use of policies
introduces a degree of flexibility for both service providers 6 Conclusions
and for the network administrators. By representing these
policies as state models (provided by the service providers
In this paper, we have summarised research conducted
themselves), network administrators can provide their own
as part of the EPSRC eScience funded Semantic Firewall
extension or modifications as necessary. To ensure that con-
project. The problem of providing secure access to ser-
flicts or violations do not occur between service and system
vices was analysed within a Grid scenario, which identi-
policies, and to provide pragmatic tools to manage policy
fied several challenges not normally apparent when consid-
definitions, further investigation is still required.
ering simple workflows. A semantically-annotated policy
An important consideration is that of managing policy model has been developed, that supports the definition of
violations, and mediating the type of response that can as- permissible actions/workflows for different actors assum-
sist clients who are attempting to make legitimate service ing different roles for given processes. An initial prototype
requests, whilst avoiding the exposure of policy details that implementation has been developed, which has been used
would enable malicious users to bypass the security mech- to validate the model on real-world, Grid case studies, and
anisms. By currently exposing the defined policies, clients several areas for future work have been identified.
can reason about candidate workflows to determine whether
they will fail, and thus avoid committing to workflows that
would subsequently need revising. In addition, the current 7 Acknowledgment
policy definitions allow the inference of other, necessary
stages (and consequently actors) that would assist in estab- This research is funded by the Engineering and Physi-
lishing the appropriate access rights. cal Sciences Research Council (EPSRC) Semantic Firewall
project (ref. GR/S45744/01). We would like to acknowl-
11 http://www.mindswap.org/2004/owl-s/api/ edge contributions made by Grit Denker (SRI) for valuable
699
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
insights into the requirements for semantically annotated [11] M. Estena. Electronic Institutions: from specification to de-
policies and interaction workflows. Thanks are also due velopment. PhD thesis, Technical University of Catalonia,
to Jeff Bradshaw and his research group for valuable dis- 2003.
cussions on the use and possible applicability of the KaOS [12] I. Foster and C. Kesselman. The Grid 2: Blueprint for a New
Computing Infrastructure. Morgan Kaufmann, 2003.
framework to Grid Services, and to Carles Sierra and his
[13] I. Foster, C. Kesselman, J. M. Nick, and S. Tuecke. Grid Ser-
group on understanding how notions of Electronic Institu-
vices for Distributed System Integration. IEEE Computer,
tions could be exploited in defining and deploying a Seman- 35(6):37–46, June 2002.
tic Firewall framework. [14] J. Herveg, F. Crazzolara, S. Middleton, D. Marvin, and
Y. Poullet. GEMSS: Privacy and security for a Medical Grid.
In Proceedings of HealthGRID 2004, Clermont-Ferrand,
References France, 2004.
[15] M. Jacyno, E.R.Watkins, S. Taylor, T. Payne, and M. Sur-
[1] M. Ahsant, M. Surridge, T. Leonard, A. Krishna, and ridge. Mediating Semantic Web Service Acces using the
O. Mulmo. Dynamic Trust Federation in Grids. In the 4th In- Semantic Firewall. In The 5th International Semantic Web
ternational Conference on Trust Management (iTrust 2006), Conference (Submitted), 2006.
Pisa, Italy, 2006. [16] L. Kagal, T. Finin, and A. Joshi. A Policy Based Approach
[2] H. Akkermans, Z. Baida, J. Gordijn, N. Pena, A. Altuna, to Security for the Semantic Web. In D. Fensel, K. Sycara,
and I. Laresgoiti. Value Webs: Using Ontologies to Bundle and J. Mylopoulos, editors, 2nd Int. Semantic Web Con-
Real-World Services. IEEE Intelligent Systems, 19(4):57– ference, volume 2870 of LNCS, pages 402–418. Springer,
66, 2004. 2003.
[3] A. Ankolekar, M. Burstein, J. Hobbs, O. Lassila, D. McDer- [17] T. Leonard, M. McArdle, M. Surridge, and M. Ahsant. De-
mott, D. Martin, S. McIlraith, S. Narayanan, M. Paolucci, sign and implementation of dynamic security components.
T. Payne, and K. Sycara. DAML-S: Web Service Descrip- NextGRID Project Output P5.4.5, 2006.
tion for the Semantic Web. In First International Semantic [18] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger,
Web Conference (ISWC) Proceedings, pages 348–363, 2002. M. Greenwood, T. C. adn K. Glover, M. Pocock, A. Wipat,
[4] R. Ashri, G. Denker, D. Marvin, M. Surrdige, and T. R. and P. Li. Taverna: a tool for the composition and enactment
Payne. Semantic Web Service Interaction Protocols: An of bioinformatics workflows. Bioinformatics, 20(17):3045–
Ontological Approach. In S. A. McIlraith, D. Plexousakis, 3054, 2004.
and F. van Harmelen, editors, Int. Semantic Web Conference, [19] M. Surridge and C. Upstill. Lessons for Peer-to-Peer Sys-
volume 3298 of LNCS, pages 304–319. Springer, 2004. tems. In Proceedings of the 3rd IEEE Conference on P2P
Computing, 2003.
[5] R. Ashri, T. Payne, M. Luck, M. Surridge, C. Sierra, J. A. R.
[20] S. Taylor, M. Surridge, and D. Marvin. Grid Resources for
Aguilar, and P. Noriega. Using Electronic Institutions to se-
Industrial Applications. In 2004 IEEE Int. Conf. on Web
cure Grid environments. In 10th International Workshop on
Services (ICWS’2004), 2004.
Cooperative Information Agents (Submitted), 2006.
[21] S. Tuecke, K. Czajkowski, I. Foster, J. Frey, S. Graham,
[6] R. Ashri, T. Payne, D. Marvin, M. Surridge, and S. Taylor.
C. Kesselman, T. Maguire, T. Sandholm, D. Snelling, and
Towards a Semantic Web Security Infrastructure. In Pro-
P. Vanderbilt. Open grid services infrastructure. Technical
ceedings of Semantic Web Services 2004 Spring Symposium
report, Global Grid Forum, 2003.
Series, 2004. [22] A. Uszok, J. Bradshaw, R. Jeffers, N. Suri, P. J. Hayes, M. R.
[7] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic Breedy, L. Bunch, M. Johnson, S. Kulkarni, and J. Lott.
web. Scientific American, 284(5):35–43, 2001. KAoS Policy and Domain Services: Toward a Description-
Essay about the possibilities of the semantic web. Logic Approach to Policy Representation, Deconfliction,
[8] G. Berti, S. Benkner, J. W. Fenner, J. Fingberg, G. Lons- and Enforcement. In 4th IEEE Int. Workshop on Policies
dale, S. E. Middleton, and M. Surridge. Medical simulation for Distributed Systems and Networks, pages 93–98. IEEE
services via the grid. In S. Nørager, J.-C. Healy, and Y. Pain- Computer Society, 2003.
daveine, editors, Proceedings of 1st European HealthGRID [23] M. J. Wooldridge. Introduction to Multiagent Systems. John
Conference, pages 248–259, Lyon, France, Jan. 16–17 2003. Wiley & Sons, Inc., New York, NY, USA, 2001.
EU DG Information Society.
[9] K. Czajkowski, D. F. Ferguson, F. I, J. Frey, S. Graham,
I. Sedukhin, D. Snelling, S. Tuecke, and W. Vambenepe. The
WS-Resource Framework. Technical report, The Globus Al-
liance, 2004.
[10] G. Denker, L. Kagal, T. Finin, M. Paolucci, and K. Sycara.
Security for DAML Services: Annotation and Matchmak-
ing. In D. Fensel, K. Sycara, and J. Mylopoulos, editors,
Proceedings of the 2nd International Semantic Web Con-
ference, volume 2870 of LNCS, pages 335–350. Springer,
2003.
700
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
701
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
compare them with each other. varying access rights) to perform a set of actions (e.g.
read, write, copy) on a set of resources (e.g. a file or a
disk) within an environment (e.g. during work hours
2 XACML and access control or from a secure machine). It may also contain a
condition on the environment, to be evaluated when
XACML [13] is the OASIS standard for access con-
the request is made. We will assume that requests
trol policies. It provides a language for describing
(and later, rules) are environment independent, and
access control policies, and a language for interrogat-
omit the condition and environment components.
ing these policies, to ask the policy if a given action
We will assume that a PDP contains a set of Poli-
should be allowed.
cies, each of which contain a set of Rules 1 . Rules
A simplified description of the behaviour of an
in XACML contain a target (which further contains
XACML policy is as follows. An XACML policy
sets of resources, subjects and actions) and an effect
has an associated Policy Decision Point (PDP) and a
(permit or deny). If the target of a rule matches a re-
Policy Enforcement Point (PEP) (See Figure 1.) Any
quest, the rule will return its effect. If the target does
access requests by a user are intercepted by the PEP.
not match the rule NotApplicable is returned.
Requests are formed in the XACML request language
As well as a set of rules, a policy contains a rule
by the PEP and sent to the PDP. The PDP evaluates
combining algorithm. Like rules, they also contain a
the request with respect to the access control policy.
target. All rules within a policy are evaluated by the
This response is then enforced by the PEP.
PDP. The results are then combined using the rule
access request XACML request combining algorithm, and a single effect is returned.
PEP PDP The PDP evaluates each policy, and combines the
access response results using a policy combining algorithm. The final
XACML response
effect is then returned to the PEP to be enforced. If
Permit is returned then the PEP will permit access,
Figure 1: XACML overview.
any other response will cause access to be denied.
When a request is made, the PDP will return ex-
actly one of:
3 VDM
• Permit: if the subject is permitted to perform
the action on the resource, VDM is a model oriented formal method incorporat-
ing a modelling or specification language (VDM-SL)
• Deny: if the subject is not permitted to perform with formal semantics [1], a proof theory [2] and re-
the action on the resource, or finement rules [11].
• NotApplicable: if the request cannot be an- A VDM-SL model is based on a set of data type
swered by the service. definitions, which may include invariants (arbitrary
predicates characterising properties shared by all
The full language also contains the response Inde- members of the type). Functionality is described in
terminate. This is triggered by an error in evaluat- terms of functions over the types, or operations which
ing the conditional part of the rule. Since we assume may have side effects on distinguished state vari-
rules to be environment independent, evaluating the ables. Functions and operations may be restricted
rules will not require evaluation of any conditional by preconditions, and may be defined in an explicit
statements. We therefore do not use Indetermi- algorithmic style or implicitly in terms of postcondi-
nate. 1 Strictly, it contains a set of policy sets, each of which con-
A full XACML request includes a set of subjects tain a set of policies. Policies then contain a set of rules. Ex-
(e.g. a user, the machine the user is on, and the tending our model to include this additional complexity would
applet the user is running could all be subjects with be straightforward.
702
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
tions. The models presented in this paper use only Target :: subjects : Subject-set
explicitly-defined functions. We remain within a fully resources : Resource-set
executable subset of the modelling language, allowing actions : Action-set
our models of XACML policies to be analysed using Each rule has a target and an effect. If a request
an interpreter. corresponds to the rule target then the rule effect is
VDM has strong tool support. The CSK VDM- returned. The brackets [..] denote that the target
Tools [3] include syntax and type checking, an inter- component may be empty. In this case, the default
preter for executable models, test scripting and cov- value is the target of the enclosing policy.
erage analysis facilities, program code generation and
pretty-printing. These have the potential to form a Rule :: target : [Target]
platform for tools specifically tailored to the analysis effect : Effect
of access control policies in an XACML framework. The effect of the rule can be Permit, Deny, or No-
An access control policy is essentially a complex tApplicable. These are modelled as enumerated
data type, and the XACML standard is a description values.
of the evaluation functions with respect to these data
Effect = Permit | Deny | NotApplicable
types. Thus VDM, with its focus on datatypes and
functionality, is a suitable language to describe access Requests are simply targets:
control policies.
Request :: target : Target
This section describes in detail the data types and
functionality of the Policy Decision Point. The de- The functionality of a PDP is captured in the fol-
scription is presented in VDM. We impose the simpli- lowing set of functions. We begin with the function
fications mentioned in the previous section. In par- targetmatch, which takes two targets and returns true
ticular, we limit targets to sets of subjects, actions if the targets have some subject, resource and action
and resources, and exclude consideration of the en- in common.
vironment. This means we only consider the effects
Permit, Deny and NotApplicable. targetmatch : Target × Target → Bool
Elements of the type PDP are pairs. The first field targetmatch(target1, target2) 4
has label policies and contains a set of the elements of (target1.subjects ∩ target2.subjects) 6= { } ∧
the type Policy. The second field contains one value (target1.resources∩target2.resources) 6= { }∧
of the type CombAlg, defined immediately below. (target1.actions ∩ target2.actions) 6= { }
PDP :: policies : Policy-set
policyCombAlg : CombAlg A request is evaluated against a rule using evalu-
ateRule. If the rule target is Null the targets are as-
CombAlg = DenyOverrides | PermitOverrides sumed to match, since the parent policy target must
match. If the targets match the effect of the rule is
DenyOverrides and PermitOverrides are enu-
returned, otherwise NotApplicable is returned.
merated values. They will act as pointers to the ap-
propriate algorithms, which are defined later. Other
evaluateRule : Request × Rule → Effect
possible combining algorithms are given in [13] but
for simplicity we will model only these two here. evaluateRule(req, rule) 4
A policy contains a target (the sets of subjects, re- if rule.target = Null
sources and actions to which it applies), a set of rules, then rule.effect
and a rule combining algorithm. else if targetmatch(req.target, rule.target)
Policy :: target : Target then rule.effect
rules : Rule-set else NotApplicable
ruleCombAlg : CombAlg
703
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
A policy is invoked if its target matches the request. evaluatePDP -DO : Request × PDP → Effect
It then evaluates all its rules with respect to a request,
evaluatePDP -DO(req, pdp) 4
and combines the returned effects using its rule com-
if ∃p ∈ pdp.policies ·evalPol (req, p) = Deny
bining algorithm. In the following DenyOverrides
then Deny
is abbreviated to D-Over, and PermitOverrides
else if ∃p ∈ pdp.policies ·evalPol (req, p) = Permit
to P-Over.
then Permit
evalPol : Request × Policy → Effect else NotApplicable
evalPol (req, pol ) 4 The above functions and data types are generic.
if targetmatch(req.target, pol .target) Any XACML policy in VDM will use these functions.
then if (pol .ruleCombAlg = D-Over) In the next section we show how to instantiate this
then evalRules-DO(req, pol .rules) generic framework with a particular policy.
else if (pol .ruleCombAlg = P-Over)
then evalRules-PO(req, pol .rules)
else NotApplicable 4 An example policy
else NotApplicable
In this section we present the initial requirements on
The deny overrides algorithm is implemented as an example policy and instantiate our abstract frame-
work with a policy aimed at implementing these re-
evalRules-DO : Request × Rule-set → Effect
quirements. In Section 5 we show how we can use the
evalRules-DO(req, rs) 4 testing capabilities of VDMTools [3] to find errors in
if ∃r ∈ rs · evaluateRule(req, r ) = Deny these policies.
then Deny This example is taken from [5], and describes the
else if ∃r ∈ rs·evaluateRule(req, r ) = Permit access control requirements of a university database
then Permit which contains student grades. There are two types
else NotApplicable of resources (internal and external grades), three
types of actions (assign, view and receive), and a
If any rule in the policy evaluates to Deny, the policy
number of subjects, who may hold the roles Faculty
will return Deny. Otherwise, if any rule in the policy or Student. We therefore define
evaluates to Permit, the policy will return Permit.
If no rules evaluate to either Permit or Deny, the Action = Assign | View | Receive
policy will return NotApplicable.
The permit overrides rule combining algorithm Resource = Int | Ext
(omitted) is identical in structure, but a single Per- Subjects are enumerated values,
mit overrides any number of Denys.
The evaluation of the PDP and its rule combining Subject = Anne | Bob | Charlie | Dave
algorithms has an equivalent structure to the policy and we populate the Student and Faculty sets as
evaluation functions already presented.
Student : Subject-set = {Anne, Bob}
evaluatePDP : Request × PDP → Effect Faculty : Subject-set = {Bob, Charlie}
evaluatePDP (req, pdp) 4
so Bob is a member of both sets. In practice, an ac-
if (pdp.policyCombAlg = D-Over)
cess control request is evaluated on the basis of cer-
then evalPDP -DO(req, pdp)
tain attributes of the subject. What we are therefore
else if (pdp.policyCombAlg = P-Over)
saying here is that the system, if asked, can produce
then evaluatePDP -PO(req, pdp)
evidence of Anne and Bob being students, and of Bob
else NotApplicable
and Charlie being faculty members.
704
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Informally, we can argue that populating the stu- In fact, this policy would have the same behaviour if
dent and faculty sets so sparsely is adequate for test- the two rules were combined using the DenyOver-
ing purposes. All rules we go on to define apply to rides algorithm. This is because both rules have the
roles, rather than to individuals, so we only need one effect Permit.
representative subject holding each possible role com- The PDP is a collection of policies; in this case
bination. only one.
The properties to be upheld by the policy are
PDPone : PDP = (PolicyStuFac, D-Over)
1. No students can assign external grades, We use the deny overrides algorithm here, but in this
2. All faculty members can assign both internal and case it has the same behaviour as permit overrides,
external grades, and because there is only one policy. In the next section
we examine this and other specifications using VDM-
3. No combinations of roles exist such that a user Tools.
with those roles can both receive and assign ex-
ternal grades.
5 Testing the specification
Our initial policy (following the example in [5]) is
The VDM toolset VDMTools described in [7] pro-
Requests for students to receive external vides considerable support for testing VDM specifi-
grades, and for faculty to assign and view cations, including syntax and type checking, testing
internal and external grades, will succeed. and debugging. Individual tests can be run at the
Implementing this policy naı̈vely leads to the fol- command line in the interpreter. The test arguments
lowing two rules, which together will form our ini- can be also read from pre prepared files, and scripts
tial (flawed) policy. Students may receive external are available to allow large batches of tests to be per-
grades,2 formed.
A systematic approach to testing requires that we
StudentRule : Rule = have some form of oracle against which to judge the
((Student, Ext, Receive), Permit) (in)correctness of test outcomes. This may be a
and faculty members may assign and view both in- manually-generated list of outcomes (where we wish
ternal and external grades. to assess correctness against expectations) or an exe-
cutable specification (if we wish to assess correctness
FacultyRule : Rule =
against the specification). In this section we use a
((Faculty, {Int, Ext}, {Assign, View}), Permit)
list of expected results as our oracle. Section 5.1 uses
The policy combines these two rules using the Per- one version of a VDM-SL specification as an oracle
mitOverrides algorithm. The target of the policy against another version.
is all requests from students and faculty members. Tests on PDPone are made by forming requests
PolicyStuFac : Policy = and evaluating the PDP with respect to these re-
((Student ∪ Faculty, {Int, Ext}, quests, using the function evaluatePDP from Sec-
{Assign, View, Receive}), tion 3.
{StudentRule, FacultyRule}, P-Over) Below we show four example requests and the re-
sults from PDPone.
2 The correct VDM description is
StudentRule : Rule = Request Result from PDPone
mk Rule(mk Target(Student, {Ext}, {Receive}), Permit) (Anne, Ext, Assign) NotApplicable
For ease of reading, we omit the mk constructs and brackets (Bob, Ext, Assign) Permit
around singleton sets. (Charlie, Ext, Assign) Permit
(Dave, Ext, Assign) NotApplicable
705
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
In the first test, PDPone returns NotApplicable with external marking. A careless implementation,
when user Anne (a student) asks to assign an exter- that merely included the names of the TAs as faculty
nal grade. This is because there is no rule which members, would overlook the fact that students are
specifically covers this situation. The PEP denies often employed as TAs.
any request which is not explicitly permitted, and so A more robust implementation, that makes TAs a
access is denied. separate role and develops rules specific for them, is
The second test points out an error, because Bob given below. Note that in TArule2, TAs are explicitly
(who is both a faculty member and a student) is al- forbidden to assign or view external grades; their role
lowed to assign external grades, in violation of rule is restricted to dealing with the internal grades.
one, which states that no student may assign external
grades. This policy has been written with an implicit TArule1 : Rule =
assumption that the sets student and faculty are dis- ((TA, {Int}, {Assign, View}), Permit)
joint. Constraining these sets to be disjoint when we
populate them allows us to reflect this assumption.
In practice this constraint would have to be enforced TArule2 : Rule =
at the point where roles are assigned to individuals ((TA, {Ext}, {Assign, View}), Deny)
rather than within the PDP.
The third test is permitted, as expected, since The rules are combined into a (TA-specific) policy
Charlie is a member of faculty, and the fourth test
returns NotApplicable, because Dave is not a stu-
dent or a faculty member. PolicyTA: Policy =
((TA, {Int, Ext}, {Assign, View, Receive}),
Multiple requests {TArule1, TArule2}, PermitOverrides)
is permitted. As pointed out in [5], this breaks the This new PDP can of course be tested inde-
first property, because Anne (a student) is piggyback- pendently, but it can also be compared with the
ing an illegal request (assigning an external grade) on previous one. We do this with respect to a test suite.
a legal one (receiving an external grade). In future, Because the policies are small, this test suite can be
therefore, we make the assumption that the PEP only comprehensive. We now populate the roles as
submits requests that contain singleton sets. Given
this assumption, we can limit the test cases we need Student : Subject-set = {Anne, Bob}
to consider to those containing only single subjects,
Faculty : Subject-set = {Charlie}
actions and resources. TA : Subject-set = {Bob, Dave}
5.1 Comparing specifications of PDPs taking care that there is a person holding each possi-
ble combination of roles that we allow. Every request
We now suppose (following [5]) that teaching assis- that each person can make is considered against each
tants (TAs) are to be employed to help with the inter- PDP. This is easily automated using a simple shell
nal marking. They are not, however, allowed to help script. The observed changes are summarised below.
706
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Request PDPone PDPtwo into XACML. We test policies against requests where
(Bob, Int, Assign) NotApp Permit Margrave uses verification.
(Bob, Int, View) NotApp Permit In [8, 15] the access control language RW is pre-
(Bob, Ext, Assign) NotApp Deny sented. It is based on propositional logic, and tools
(Bob, Ext, View) NotApp Deny exist to translate RW programs to XACML, and to
(Dave, Int, Assign) NotApp Permit verify RW programs.
(Dave, Int, View) NotApp Permit Alloy [10] has been used [9] to verify access con-
(Dave, Ext, Assign) NotApp Deny trol policies. XACML policies are translated into the
(Dave, Ext, View) NotApp Deny Alloy language and partial ordering between these
polices can be checked.
As a TA, Bob’s privileges now include assigning
and viewing internal grades, as well as all the priv-
ileges he has as a student. Everything else is now
explicitly denied.
7 Conclusions and Further
All requests from Dave, who is a now TA but not Work
a student, are judged NotApplicable (and conse-
quently denied) by the first policy, but the second We have presented a formal approach to modelling
policy allows him to view and assign internal grades. and analysing access control policies. We have used
It explicitly forbids him to assign or view external VDM, a well-established formal method, as our mod-
grades. elling notation. This has allowed us to use VDM-
Tools to analyse the resultant formal models. We
Internal consistency of a PDP have shown that rigorous testing of these policies is
possible within VDMTools, and further that policies
We consider a set of rules to be consistent if there may be checked for internal consistency.
is no request permitted by one of the rules which is Ongoing work seeks to represent rules that are de-
denied by another in the set. A set of policies is pendent on context. This will require extending the
consistent if there is no request permitted by one of VDM model with environmental variables, and allow-
the policies which is denied by another in the set. ing rules to query these variables. This will allow us
Rule consistency within a policy and policy con- to model a much broader range of policies, including
sistency within a PDP can each be checked using context-based authorisation [4] and delegation.
the method outlined above, using the functions eval- With larger policies, testing all possible requests
uateRule and evaluatePol from Section 3. may become time consuming. Further work will look
at techniques and tools for developing economical test
6 Related Work suites for access control policies.
Using attributes of subjects instead of subject iden-
An important related piece of work (and indeed a ma- tities would allow a greater range of policies to be
jor source of inspiration for this work) is [5]. Here the implemented, and this is something we are currently
authors present Margrave, a tool for analysing poli- working on.
cies written in XACML. Our intentions are almost A full implementation of the translation from
identical but there are some important differences. XACML to VDM and vice versa is under develop-
Margrave transforms XACML policies into Multi- ment.
Terminal Binary Decision Diagrams (MTBDDs) and Delegation could then also be modelled. The dele-
users can query these representations and compare gator could alter a flag in the environment (perhaps
versions of policies. Our work allows the possibil- by invoking a certain rule in the PDP) and the dele-
ity of the user manipulating the VDM representation gate is then only allowed access if the flag is set.
of a policy then translating the changed version back Following the RBAC profile of XACML [12], the
707
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
rules we have considered so far all contain a role as 196–205, New York, NY, USA, 2005. ACM
the subject-set in the target. However in the core Press.
specification of XACML [13] the subject-set of a rule
[6] J. Fitzgerald, I.J. Hayes, and A. Tarlecki,
can be an arbitrary set of subjects. If this is the
editors. International Symposium of Formal
case, then in general every possible combination of
Methods Europe. Newcastle, UK, July 2005,
subject, resource and action would need to be tested,
volume 3582 of LNCS. Springer-Verlag, 2005.
rather than just a representative from each role com-
bination. [7] John Fitzgerald and Peter Gorm Larsen.
Modelling Systems: Practical Tools and
Techniques in Software Developement.
8 Acknowledgments Cambridge University Press, 1998.
Many thanks to Joey Coleman for guiding me [8] D. Guelev, M. Ryan, and P. Schobbens.
through shell scripting, and John Fitzgerald for guid- Model-checking Access Control Policies. In
ing me through VDM. ISC’04: Proceedings of the Seventh
This work is part-funded by the UK EPSRC under International Security Conference, volume 3225
e-Science pilot project GOLD, DIRC (the Interdis- of LNCS, pages 219–230. Springer, 2004.
ciplinary Research Collaboration in Dependability)
and DSTL. [9] Graham Hughes and Tevfik Bultan.
Automated Verification of Access Control
Policies. Technical Report 2004-22, University
References of California, Santa Barbara, 2004.
[1] D.J. Andrews, editor. Information technology – [10] D. Jackson. Micromodels of software:
Programming languages, their environments Modelling and analysis with Alloy.
and system software interfaces – Vienna http://sdg.lcs.mit.edu/
Development Method – Specification Language alloy/reference-manual.pdf.
– Part 1: Base language. International
Organization for Standardization, December [11] Cliff B. Jones. Systematic Software
1996. International Standard ISO/IEC 13817-1. Developement using VDM. International Series
in Computer Science. Prentice-Hall, 1990.
[2] J. C. Bicarregui, J.S. Fitzgerald, and
P.A. Lindsay et. al. Proof in VDM: A [12] OASIS. Core and heirarchical role based access
Practitioner’s Guide. Springer-Verlag, 1994. control (RBAC) profile of XACML v2.0.
Technical report, OASIS, Feb 2005.
[3] CSK. VDMTools. available from
[13] OASIS. eXtensible Access Control Markup
http://www.vdmbook.com.
Language (XACML) version 2.0. Technical
[4] Sabrina De Capitani di Vimercati, Stefano report, OASIS, Feb 2005.
Paraboschi, and Pierangela Samarati. Access
[14] The GOLD project.
control: principles and solutions. Software -
http://gigamesh.ncl.ac.uk/.
Practice and Experience, 33:397–421, 2003.
[15] N. Zhang, M. Ryan, and D. Guelev. Evaluating
[5] Kathi Fisler, Shriram Krishnamurthi, Leo A. access control policies through model checking.
Meyerovich, and Michael Carl Tschantz. In J. Zhou, J. Lopez, R.H. Deng, and F. Bao,
Verification and change-impact analysis of editors, Eighth Information Security
access-control policies. In ICSE ’05: Conference (ISC’05), volume 3650 of LNCS,
Proceedings of the 27th International pages 446–460. Springer-Verlag, 2005.
Conference on Software Engineering, pages
708
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
709
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
[18], which is built on four key concepts – ontologies, dispensable properties which will be marked in
standard WSDL based Web services, goals, and advertised domain services (step 7). Invoked by the
mediators. WSMO stresses the role of a mediator in Dependent Property Reduction component (step 8), the
order to support interoperation between Web services. Service Matching and Ranking component accesses the
WSMO introduces mediators aiming to support distinct advertised domain services for service matching and
ontologies employed by service requests and service ranking (step 9), and finally it produces a list of
advertisements. matched services (step 10).
The above-mentioned methods facilitate service
discovery in some way. However, when matching
service advertisements with service requests, these
methods assume that service advertisements and
service requests use consistent properties to describe
relevant services. This is not always true for a large-
scale Grid system where service publishers and
requestors may use their pre-defined properties to
describe services. Some properties used in a service
advertisement may not be used by a service request
when searching for services. Therefore, one
challenging work in service discovery is that service
matchmaking should be able to deal with uncertainty
of service properties when matching service
advertisements with service requests. Figure 1. RSSM components.
In this paper, we present RSSM, a Rough Sets [19]
based service matchmaking algorithm for service
discovery that can deal with uncertainty of service In the following sections, we describe in depth the
properties. Experiment results show that the algorithm design of RSSM components. Firstly, we introduce
is more effective for service matchmaking than UDDI Rough Sets for service discovery.
and OWL-S mechanisms.
The remainder of this paper is organised as follows. 2.1 Service Discovery with Rough Sets
Section 2 presents in depth the design of the RSSM
matchmaking algorithm. Section 3 compares the Rough sets theory is a mathematic tool for
RSSM with UDDI and OWL-S mechanisms in terms knowledge discovery in databases. It is based on the
of effectiveness and overhead in service discovery. concept of an upper and a lower approximation of a set
Section 4 concludes this paper and presents future as shown in Figure 2. For a given set X , the yellow
work. grids (lighter shading) represent its upper
approximation, and the green grids (darker shading)
2. Algorithm Design represent its lower approximation.
710
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
U be a set of N advertised domain services, Following the work proposed by Paolucci et al.
U = {s1 , s2 ,K, s N } , N ≥ 1 . [14], we define the following relationships between
p R and p A :
P be a set of K properties used to describe the
N advertised services of the set U ,
exact match, p R and p A are equivalent or
P = { p1 , p2 , K , p K } , K ≥ 2 .
p R is a subclass of p A .
PA be a set of M advertised service properties
plug-in match, p A subsumes p R .
relevant to a service request R in terms of the
domain ontology Ω , subsume match, p R subsumes p A .
PA = { p A1 , p A2 , K , p AM } , PA ⊆ P , M ≥ 1 . nomatch, no subsumption between p R and
X be a set of advertised services relevant to the pA .
service request R , X ⊆ U .
Algorithm 1. Reducing irrelevant properties
X be the lower approximation of the set X . from advertised services.
X be the upper approximation of the set X . 1: for each property pA used in advertised domain services
2: for all properties used in a service request
According to the Rough Sets theory, we have 3: if pA is nomatch with any pR
4: then pA is marked with nomatch
X = {x ∈U : [x]P ⊆ X} (1) 5: end if
A 6: end for
7: end for
X = {x ∈U : [x]P I X ≠ ∅} (2) 8: remove all pA that are marked with nomatch
A
For a property p ∈ PA , we have For each property of a service request, we use the
∀x ∈ X , x definitely has property p . reduction algorithm as shown in Algorithm 1 to
remove irrelevant properties from advertised services.
∀x ∈ X , x possibly has property p . For those properties in advertised services that have a
nomatch result they will be treated as irrelevant
∀x ∈ U − X , x absolutely does not have properties. Advertised services are organised as service
property p . records in a database. Properties are organised in such
a way that each uses one column to ensure the
The use of “definitely”, “possibly” and “absolutely” correctness in the following reduction of dependent
are used to encode properties that cannot be specified properties. As a property used by one advertised
in a more exact way. This is a significant addition to service might not be used by another advertised
existing work, where discovery of services needs to be service, some properties may have empty values. A
encoded in a precise way, making it difficult to find property with an empty value in an advertised service
services which have an approximate match to a query. record becomes an uncertain property in terms of a
service request. If a property in an advertised service
2.2 Reducing Irrelevant Properties record is marked as nomatch, the column associated
with the property will be marked as nomatch. As a
When searching for a service, a service requestor result, all properties within the column including
may use some properties irrelevant to the properties uncertain properties (i.e. properties with empty values)
used by a service publisher in terms of a domain will not be considered in service matchmaking.
ontology. These irrelevant properties used by
advertised services should be removed before the 2.3 Reducing Dependent Properties
service matchmaking process is carried out.
Property reduction is an import concept in Rough Sets.
Let Properties used by an advertised service may have
dependencies. Dependent properties are indecisive
p R be a property for a service request. properties that are dispensable in matching advertised
p A be a property for a service advertisement. services.
711
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
712
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
m( p Ri , p Aj ) be a match degree between a service request property and the relationship is out of
five generations. Similarly, an advertised service
requested service property PRi and an advertised property will be given a match degree of 50% if it has
a subsume relationship with a service request property
service property PAj in terms of Ω, and the relationship is out of three generations.
p Ri ∈ PR , 1 ≤ i ≤ M , p Aj ∈ PAD , 1 ≤ j ≤ LD . Each decisive property used for identifying
advertised services has a maximum match degree when
v( p Aj ) be a value of the property p Aj , matching all the properties used in a service request.
S ( R, s ) can be calculated using formula (5).
p Aj ∈ PAD , 1 ≤ j ≤ LD .
S ( R, s ) be a similarity degree between an LD M
Soho Fiorentina
Algorithm 3 shows the rules for calculating a match
degree between a requested service property and an
advertised service property. A decisive property with Figure 3: Pizza ontology structure.
an empty value has a match degree of 50% when
matching all the properties used by a service request. In the following sections, we evaluate the RSSM
An advertised service property will be given a match algorithm in terms of its effectiveness and overhead in
degree of 100% if it has an exact match relationship service discovery. We compare RSSM matchmaking
with a property used by a service request. An with UDDI and the OWL-S matchmaking algorithm
advertised service property will be given a match proposed by Paolucci et al. [14] respectively. RACER
degree of 50% if it has a plug-in relationship with a
713
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
[22] was used by the OWL-S algorithm to reason the 3.2 Evaluating Overhead in Service Discovery
relationship between an advertised service property
and a requested service property. We implemented a We have registered 10,000 Pizza service records in
simple reasoning component in RSSM to reduce the a database for testing the overhead of the RSSM
high overhead incurred by RACER. matchmaking. We compare the overhead of the RSSM
matchmaking with that of UDDI and the OWL-S
3.1 Evaluating Effectiveness in Service Discovery matchmaking respectively in service matching as
shown in Figure 4. We also compare their performance
We have performed four groups of tests to evaluate in accessing service records as shown in Figure 5.
the effectiveness of RSSM in service discovery. Each From Figure 4 we can see that UDDI has the least
group had some targeted services to be discovered. overhead in service matching services. This is because
Properties such as Size, Price, Nuttoping, UDDI only supports keyword matching. It does not
Vegetariantopping, and Fruittopping were used by the support the inference of the relationships between
advertised services. Table 1 shows the evaluation requested service properties and advertised service
results. properties which is a time consuming process.
UDDI
OWL-S RSSM
Service Keyword
Matchmaking Matchmaking
Property Matchmaking
Certainty Correct Matched Correct Matched Correct Matched
Rate Matching Service Matching Service Matching Service
Rate Records Rate Records Rate Records
In the tests conducted for group one, both the OWL- Figure 4. Overhead evaluation in matching services.
S matchmaking and RSSM matchmaking have a
correct match rate of 100%. This is because all services
in this group do not have uncertain properties (i.e.
properties with empty values). UDDI discovers four
services, but only two services are correct because of
the use of keyword matching (in this case making use
of a categoryBag).
In the tests of the last three groups where
advertised, services have uncertain properties, the
OWL-S matchmaking cannot discover any services
producing a matching rate of 0%. Although UDDI can
still discover some services in the tests of the last three
groups, the correct matching rate for each group is low.
For example, in the tests of group three and group four
where advertised services have only 50% and 30%
certain properties respectively, UDDI cannot discover
any correct services. The RSSM matchmaking is more Figure 5: Overhead evaluation in accessing service
effective than UDDI and the OWL-S matchmaking in records.
tolerating uncertain properties when matching services.
For example, the RSSM matchmaking is still able to We also observe that the RSSM matchmaking has a
produce a correct match rate of 100% in the tests of the better performance than the OWL-S matchmaking
last three groups. when the number of advertised services is within 5500.
This is because the RSSM matchmaking used a simpler
reasoning component than RACER, which was used by
714
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
the OWL-S matchmaking. However, the overhead of [8] myGrid project, http://www.mygrid.org.uk
the RSSM matchmaking increases when the number of [9] S. Miles, J. Papay, T. Payne, K. Decker, and L.
services gets larger, due to the reduction of dependent Moreau. Towards a protocol for the attachment of
properties. The major overhead of the OWL-S semantic descriptions to grid services. Proceedings
matchmaking is caused by RACER, which is sensitive of the Second European across Grids Conference,
to the number of properties used by advertised services volume 3165 of Lecture Notes in Computer
instead of the number of services. Science, Nicosia, Cyprus, January 2004.
From Figure 5 we can see that the RSSM [10] L. Moreau, Grid RegIstry with Metadata Oriented
matchmaking has the best performance in accessing Interface: Robustness, Efficiency, Security
service records due to its reduction of dependent (GRIMOIRES). http://www.grimoires.org/, 2005.
properties. The OWL-S matchmaking has a similar [11] T. Berners-Lee, J. Hendler, and O. Lassila. The
performance to UDDI in this process. Semantic Web. Scientific American, Vol. 284 (4),
pages 34-43, 2001.
4. Conclusions and Future Work [12] OWL Services Coalition, OWL-S: Semantic
Markup for Web Services, white paper, Nov.
In this paper we have presented RSSM, a Rough 2003; www.daml.org/services/owl-s/1.0.
Sets based algorithm for service matchmaking. By [13] S. Bechhofer, F. Harmelen, J. Hendler, I.
dynamically reducing irrelevant and dependent Horrocks, D. McGuinness, P. F. Patel-Schneider,
properties in terms of a service request, the RSSM and L. A. Stein. OWL Web Ontology Language
algorithm can tolerate uncertain properties in service Reference. W3C Recommendation, Feb. 2004.
discovery. Furthermore, RSSM uses a lower [14] M. Paolucci, T. Kawamura, T. Payne, and K.
approximation and an upper approximation of a set to Sycara. Semantic matching of web service
dynamically determine the number of discovered capabilities. Proceedings of 1st International
services which may maximally satisfy user requests. Semantic Web Conference. (ISWC2002), Berlin,
Experimental results have shown that the RSSM 2002.
algorithm is effective in service matchmaking. [15] M. C. Jaeger, G. Rojec-Goldmann, G. Mühl, C.
As the reduction of dependent properties is a time Liebetruth, and K. Geihs. Ranked Matching for
consuming process, the future work will be focused on Service Descriptions using OWL-S. Proceedings
the design of heuristics for reducing dependent of KiVS 2005, Kaiserslautern, Germany, Feb.
properties to speed up the process. 2005.
[16] L. Li and I. Horrocks. A software framework for
matchmaking based on semantic web technology.
References Int. J. of Electronic Commerce, 8(4):39-60, 2004.
[17] S. Majithia, A. S. Ali, O. F. Rana, D. W. Walker:
[1] I. Foster and C. Kesselman, The Grid, Blueprint Reputation-Based Semantic Service Discovery.
for a New Computing Infrastructure, Morgan Proceedings of WETICE 2004, Italy.
Kaufmann Publishers Inc., San Francisco, USA, [18] Web Service Modeling Ontology (WSMO),
1998. http://www.wsmo.org
[2] M. Li and M.A.Baker, The Grid: Core [19] Z. Pawlak. Rough sets. International Journal of
Technologies, Wiley, 2005. Computer and Information Science, 11(5):341-
[3] Open Grid Services Architecture (OGSA), 356, 1982.
http://www.globus.org/ogsa/ [20] jUDDI, http://ws.apache.org/juddi/
[4] Web Services Resource Framework (WSRF), [21] mySQL, http://www.myswl.com
http://www.globus.org/wsrf/ [22] V. Haarslev and R. Möller. Description of the
[5] Universal Description, Discovery and Integration RACER System and its Applications. Proceedings
(UDDI), http://www.uddi.org/ International Workshop on Description Logics
[6] A. ShaikhAli, O. F. Rana, R. J. Al-Ali, D. W. (DL-2001), Stanford, USA.
Walker: UDDIe: An Extended Registry for Web [23] Global Grid Forum (GGF), http://www.ggf.org
Service. Proceedings of SAINT Workshops,
Orlando, Florida, USA, 2003.
[7] S. Miles, J. Papay, V. Dialani, M. Luck, K.
Decker, T. Payne, and L. Moreau. Personalised
Grid Service Discovery. IEE Proceedings
Software: Special Issue on Performance
Engineering, 150(4):252-256, August 2003.
715
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
All practical scientific research and development relies inherently on a well-understood frame-
work of quantities and units. While at first sight it appears that potential issues should by now
be well understood, closer investigation reveals that there are still many potential pitfalls. There
are a number of well-publicised case studies where lack of attention to units by both humans
and machines have resulted in significant problems. The same potential problems exist for all
e-Science applications whenever data is exchanged between different systems; an unambiguous
definition of the units is essential to data exchange. E-science applications must be able to im-
port and export scientific data accurately and without any necessity for human interaction. This
paper discusses the progress we have made in establishing such a capability and demonstrates it
with prototype software and a small but varied library of units.
1 Introduction made in establishing such a capability. We
conclude that RDF 5 provides a suitable and
All practical scientific research and develop- practical means to describe scientific units
ment relies inherently on a well-understood to enable their communication and inter-
framework of quantities and units. We use conversion, and thereby solve a major problem
the term quantity here as used in the Green in reliable exchange of scientific data.
Book 1 to describe concepts of length, time and
so on, rather than the more common “dimen-
sion”. An essential part of all scientific train- 2 Existing unit systems
ing is directed to the study and understanding
of this topic. While at first sight it appears We might choose to store all numbers accord-
that potential issues should by now be well- ing to SI conventions 1 and therefore contain
understood, closer investigation reveals that ourselves within a single system of units and
there are still many potential pitfalls. There quantities. This is a workable solution in
are a number of well-publicised case studies many situations. Unfortunately many fields
where human misinterpretation of unit infor- of science have units that predate SI and are
mation has resulted in significant problems. conveniently scaled for analysis, and so storing
Exactly the same potential problems exist these values in SI units would be inconvenient
in all e-Science applications whenever data is as well as requiring careful conversion. If on
exchanged between different systems - an un- the other hand we decide to retain the origi-
ambiguous definition of the units is essential. nal units we face the prospect of other people
While significant progress has been made us- choosing to use different units. This is almost
ing XML 2 (and XML derivatives) to describe inevitable given that different purposes may
such data, these solutions only address the is- lead us to use any of SI, British Imperial, US
sue of markup, and not interoperability. Such Imperial (“English units”), CGS, esu, emu,
an approach falls well short of the function- Gaussian and atomic systems or even more an-
ality necessary for a Semantic Grid 3,4 . e- cient and country-specific systems. It is quite
Science applications in a Semantic Grid must common to have two measurements on differ-
be able to import and export scientific data ent scales that might well be the same, and yet
(and in fact any numerical data) accurately we have no way of comparing the two without
and without any necessity for human interac- the aid of a pocket calculator and appropriate
tion. Without a system to decide whether it numerical constants.
is receiving quantities with the correct units To understand fully the problems of describ-
for its use, nothing but careful data entry pre- ing units it is useful to examine what a mea-
vents error or disaster. surement consists of: A numerical value with
This paper discusses the progress we have units. These two components must not be-
716
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
come separated or the value loses all meaning. vital to deciding what conversions are mean-
The following demonstrates how a measure of ingful. For example, the knot is a recognised
solution strength can be deconstructed: measure of speed in common use. It is a com-
pound unit and must be separated into the
Solution strength of 0.02 mol dm−3 quantities of length and time in order to com-
pare it with other measures of speed. By the
0.02 The value of the measurement. same token, the watt is an approved SI mea-
mol One unit, the mole indicating one Avo- sure of power but it is also common for data
gadro’s constant of molecules. sources to report the same quantity expressed
in joules per second. How can computer soft-
dm−3 A second unit - “per decimeter cubed”. ware equate these two concepts?
This may be further decomposed into: Various efforts have been made in computer
deci An SI prefix for 1/10th. science to maintain and convert units along-
side their values in software with mixed re-
meter The SI unit for length. sults. To do so within Object Oriented sys-
tems requires a certain inventiveness in order
−3
An exponent of the preceding unit, present to “treat a single entity as both a type and
if not equal to one. a value” 8 . These coding complexities can be
avoided by leaving the computational logic the
Confining ourselves to the SI system for the
moment, we note the possibility of multiple same and separating the units handling into a
separate software layer which is what we pro-
units for one measurement, and that each unit
vide for here.
may be preceded by a scaling factor such as
milli or mega. In addition to this each unit A more complicated issue is exemplified by
may be raised to a power, either positive or older measurements of pressure based on a col-
negative, typically with a magnitude of six or umn of mercury. A column height in millime-
less. ters of mercury has been a long-established
method of monitoring atmospheric pressure,
but of course this is not a pressure at all,
3 Conversion between unit rather a length. If treated as a length for con-
version purposes we may only convert from
systems millimeters or inches to some other length of
The great breadth of scientific and engineer- a mercury column. While this is entirely rea-
ing endeavour over many centuries has led to sonable, it is not particularly useful. Some
a wide variety of systems with a legacy of con- form of “bridge” is required to make the tran-
veniently sized units selected for a particu- sition from a length to a pressure but it is
lar purpose. Inevitably units from different entirely dependent on the material. This is
systems are encountered in the same setting more an issue of physical science and not ob-
and conversion must take place so that values viously within the scope of units description
may be compared or combined. The ramifi- so we treat it as a secondary concern.
cations of imperfect conversions are consider- Something definitely within the scope of
able, as NASA discovered to their cost when units description but equally contentious are
their Mars Climate Orbiter was destroyed in ratios. Treatment of ratios has been hotly de-
1999 6 due in part to the incorrect units of bated by standards committees over the years.
force given to the software. In 1983 the “Gimli With all ratios such as LD50s (a lethal dose
Glider” incident 7 involving a Boeing 767 run- that kills 50% of a test group) and the concen-
ning out of fuel occurred because of invalid tration of drug in formulations, one must keep
conversion of fuel weights. These two very the division between unit and measurement
high profile events are extreme examples of a clear. Although the units of a ratio cancel
problem that is encountered all the time. Lack and it becomes a unitless value, the value has
of precision and rigour lead to costly mistakes no meaning without context. LD50s are per-
in something that is not difficult but requires formed on all manner of small organisms, and
attention. it may be important to note that the dose was
The majority of conversions are exchanges issued in grams per kilogram of body mass of
of one unit for another, such as celsius for the chosen organism. One cannot for example
kelvin, or yards for meters. Such conver- convert such a ratio of weights into other units
sions present no challenge, but not all conver- if the units have been cancelled out. Some
sions are so straightforward. Knowledge of the of this information is relevant to unit descrip-
quantities (or dimensions) a unit relates to is tion, while the remainder falls into the domain
717
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
of experimental description and needs further work on ontologies expressed in the language
consideration of LISP such as the EngMath ontology 12. In-
It is clear that any system to aid unit con- deed, ontologies of bewildering complexity ex-
version must capture the units themselves, ist to describe many things but there is little
but also the quantities to which the units ap- evidence of them being applied to real prob-
ply. Unfortunately the issue of quantities is lems. As Vega et. al. explained 13 , knowledge
also complicated by differences in unit sys- sharing has a long way to go in order for these
tems. The esu and emu systems were cre- ontologies to be reused, let alone spread be-
ated to simplify the mathematics of electro- yond their original setting of knowledge engi-
magnetism and operate in only three dimen- neering and artificial intelligence.
sions rather than the four required to handle The XML schemas developed thus far tend
the same information under SI. While it is pos- to propose that one should have a definition
sible for esu and emu based values to work in for mol dm−3 and miles per hour as single en-
a four-dimensional system they are commonly tities, and within that definition are descrip-
applied in three such that there is no need for tions of what units and powers compose it is
charge in esu, or current in emu. This “mathe- composed of. This makes description straight-
matical trickery” makes values from these sys- forward, as shown by the GML fragments be-
tems dimensionally different to other systems low.
and demands non-trivial transformations to <measurement>
switch between them. Most other unit sys- <value>10</value>
tems are dimensionally consistent with SI and <gml:unit#mph/>
hence can be addressed all at once. Special </measurement>
consideration is needed to for work involving <DerivedUnit gml:id="mph">
electromagnetism. <name>miles per hour</name>
<quantityType>speed</quantityType>
<catalogSymbol>mph</catalogSymbol>
<derivationUnitTerm uom="#mile" exponent="1"/>
4 Machine-readable unit <derivationUnitTerm uom="#h" exponent="-1"/>
</DerivedUnit>
descriptions
One potential problem with this approach
Having established the need for and complexi- is in the vast numbers of permutations possi-
ties in describing units of measure, we come to ble to accommodate the many different ways
the issue of the technology to use. XML and people use units in practice. There are per-
its derivatives address the problem of data ex- haps ten different prefixes in common use in
change by making all data self-describing. By science, so at least in principle we may have
eschewing proprietary data formats, we make ten versions of each unit, and with compound
it easier for software authors to work with the units it might be common to talk of moles,
data. Consequently parsing of data can be a millimoles or micromoles per decimeter, cen-
much more robust process. timeter or meter cubed. We would then have
Several organisations are in the process of around one hundred useful permutations and
creating systems to make units transferable, many more possibilities.
including NIST’s unitsML 9 , the units sections Clearly such a system is more useful if it
within GML 10 and SWEET 11 ontologies from considers non-SI units such as inches, pounds
the Open Geospatial Consortium and NASA and so on. Every combination of two units
respectively. None of these systems has yet together results in another definition leading
been finalised and may yet be improved upon. to a finite but practically endless list of defini-
At this time there are no complete comput- tions. This is exemplified by the units section
erised units definitions endorsed by any stan- of the SWEET ontology. SWEET presently
dards organisation, and this means that there addresses a relatively small set of units around
is no accepted procedure to express scientific the basic SI units, and already the list is many
units in electronic form. Wherever the prob- pages of definitions with a very precise syntax
lem arises, people have either created their required to invoke them. In a long list, hu-
own systems that can cope with their imme- mans will have difficulty locating the correct
diate needs or resorted to plain text, thereby entities and both processing and validating the
condemning their data to digital obscurity. schema becomes increasingly difficult. There
Unit conversion and having the concepts of is already considerable scope for typograph-
units within software are not particularly new ical errors when writing programs to use on-
ideas, as illustrated by the many conversion tologies, and the bigger and more complex the
tools presently available and a vast amount of ontology the greater the problem becomes.
718
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
A more tractable but neglected alternative siblings. This is not the case with RDF, which
to the above approach is to explode the units allows more complex networks to be formed.
when they are invoked as follows: Much more powerful data description is possi-
ble without being limited to single level rela-
<measurement> tionships such as sibling or parent. We can
<value>10</value>
<unit> also add concepts such as “similar to” and
<unitname>mile</unitname> branch across trees of the data. This is valu-
<power>1</power> able in the context of units owing to the com-
</unit>
<unit>
plex web of relationships between units and
<unitname>hour</unitname> quantities. The limitations on a schema are
<power>-1</power> very much down to what is logically sensible
</unit> and reasonable to program. It is possible to
</measurement>
make non-hierarchical networks computation-
or in more condensed form ally intractable so such flexibility should only
be employed with care.
<measurement>
<value>10</value>
<unit id=#mile power="1"/> Figure 1: Units schema visualisation
<unit id=#hour power="-1"/>
</measurement>
719
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
examples of instances of the unit class, and described in both directions, thereby simplify-
not a class in itself as is commonly encouraged ing the library even further.
in ontology creation. This is a case of the fre- In order to maintain the integrity of the
quently encountered class/instance dilemma. units library, a number of rules must be ob-
In principle there is only one true measure for served that may be enforced with an ontology.
a quantity in a given unit system. In this case
we defer to SI to endorse one standard value • Every unit must relate to a base or de-
for the magnitude of one meter, one second rived quantity.
etc. The difference between a US gallon and
a UK gallon is just one example of many se- • Quantities that are not one of the 7 SI
mantic collisions that require clear description base quantities must be derived from a
and differentiation. combination of those 7 quantities.
Each unit instance is linked to its corre- • All non-SI units must have a conversion
sponding quantity (length, time, mass). A to SI base units or combinations of units.
derived quantity (such as volume or veloc-
ity) may be constructed from other quantities • Quantities may have conversions which
in which case “derived-from” links imported alter the dimensionality of the system us-
from the ontology instruct the system what ing SI units for the conversion.
other quantities can be substituted for the de-
rived quantity and to what order. Quantities The above system is complex and creates an
all have a standard SI recommended unit, and extensive network of units and quantities con-
where they are derived, they have a connec- taining many cross-links. This is expected and
tion to the base quantities from which they cannot be simplified any further without com-
are derived. This enables consistency checking promising the function of the system. A part
between units by using their quantities regard- of the library is illustrated in figure 2, showing
less of whether the quantity is base or derived. the conceptual separation of base quantities,
derived quantities and the units that corre-
The conversions themselves are expressed as
spond to them.
a series of computational operations, each con-
sisting of both scalar values and units. The
combination of units and values collected as Figure 2: Instances of units and quantities
one constant make it possible to perform con-
versions without prior knowledge of the out-
come. Although having units on the constants
may seem unnecessary, it supplies additional
rigour to the conversion process, as well as
lucidity to the conversion itself. An instruc-
tion might be to multiply by 3600 seconds per
hour, cancelling existing units and reminding
us what that scalar transformation represents.
It also allows support for conversions that con-
nect different quantities by fundamental phys-
ical relationships, as discussed later. This ar- The units are unscaled and without expo-
rangement makes it possible to infer the units nents in order to avoid the combinatorial is-
that result from a conversion rather than hav- sue discussed earlier. SI prefixes such as milli
ing to specify it in the ontology yielding great and mega are defined and invoked as separate
benefits in managing the units library and sim- entities. Supplementary information such as
plifying implementation. preferred abbreviations are also defined as re-
We investigated the possibility of storing quired. Non-SI units have only one conver-
conversions in some form of equation, but sion to the SI equivalent and no others. This
this demanded either a complete development helps to minimise the number of conversions
of an equation system or the use of an ex- required and avoids issues of multiple redun-
isting system such as MathML 14 or the less dant paths to the same result. If we wish to
commonly known OpenMath 15 . This solu- convert between two measurements in the Im-
tion proved far more complicated than practi- perial system, a route is found via the SI unit,
cal and was discarded in favour of a stepwise retaining any exponents. For example: Miles
system that still retains the content and re- =⇒ Meters =⇒ Feet instead of Miles =⇒ Feet.
versibility of an equation. As long as the re- The only caveat to this is computational pre-
versibility is retained, conversions need not be cision, as floating point arithmetic in a binary
720
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
6 Physical Equivalence
Conversions
Yet another set of conversions exist in sci-
ence that are extremely useful but completely
transform both the value and the quantity of
a measurement. Any physicist will know that
mass can be translated into energy with the
correct fundamental constant. Likewise, spec-
troscopists regularly convert wavelengths (or
inverse wavelengths, specifically the wavenum-
ber in cm−1 ) into energies, typically in elec-
tron volts, or joules. These transitions from 4. The program reduces the input units to SI
one set of base units to another are made pos- base units by expanding any derived SI quanti-
sible by equations which use fundamental con- ties, and performing all necessary conversions
stants incorporating units of their own. While to SI base units. The same process is applied
far from obvious and not a pure unit con- to the requested units, performing conversions
version, these scientific equivalences are use- in reverse. The simplified request and starting
ful and having such conversions automated is units are compared for quantity and unit con-
even more helpful than just providing nor- sistency, i.e. that the request has asked for a
mal conversions. To that end we have used reasonable operation such as length to length,
the same conversion description method for and nothing like length to volume. At this
quantity-quantity conversions as well as unit- point, an inconsistency in quantities leads to
unit conversions. The mathematical processes a systematic exploration of possible quantity-
implied by equations such as E = hν and quantity conversions, such that useful equal-
E = mc2 can be described in exactly the same ities for frequencies and energies can be in-
way as the process that translates from miles cluded amongst other physical equivalences.
to meters. The only differences are their at- All combinations of up to an arbitrary limit
tachment to quantities rather than units, and of three consecutive quantity-quantity conver-
the logic required to decide when to use them. sions are considered, and the appropriate con-
This is illustrated in figure 3, where the con- versions performed if a quantity match can be
version stems from the length quantity. Unfor- achieved. If the quantities are deemed com-
tunately it is not obvious how to limit the use patible and the conversion is a success, the
of this conversion to values that do not relate result has already been computed and is writ-
to electromagnetic radiation. This is an issue ten out in an RDF wrapper. Otherwise the
of context involving both purpose and mean- request is rejected as meaningless or beyond
ing that goes beyond the scope of units and the scope of the program.
conversions. This wider context of describing The “uniterator” has been tested with the
the measurements themselves might require a following successful conversions using a rela-
separate ontology to embrace all of science and tively limited ontology of units:
engineering.
• 10.5 mJ −→ 0.0105 W
7 Test implementation • 10 mg dm−3 −→ 4.55e-02 kg gallon−1
721
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
<Unit rdf:type="#Fahrenheit">
Figure 4: Unit conversion process outline <power-of>1</power-of>
</Unit>
</has-unit>
<has-desired-unit>
<Unit rdf:type="#Celsius">
<power-of>1</power-of>
</Unit>
</has-desired-unit>
</ch:Quantity>
8 Conclusions
In summary, previous attempts have been
made to define units for computer software,
but all of them have run into difficulties at var-
ious stages of their development. Successfully
designing a system that solves all of the possi-
ble problems has proven to be very challenging
because of the endless variations of unit appli-
cation and the tendency for people to define
their own units. The problem is simply too
broad for any one person to have experience
of all units and this has led to systems which
• 30 knots −→ 34.52 miles per hour are unintentionally incapable of tackling some
• 6 GHz −→ 3.98e-21 mJ units satisfactorily. The boundaries between
unit and measurement are somewhat blurred
• 200 cm−1 −→ 3.97e-21 J and this clouds the issue further.
The units system outlined here has been de-
The whole process relies heavily on the com- veloped with a heavy emphasis on facilitat-
mutative nature of the conversion processes. ing implementation and usefulness. It pro-
This may lead to problems with conversions vides a manageable way to make scientific
involving a translation of origin, such as with units machine-parseable and interconvertible
the Celsius temperature scale, but this can be on the semantic web. RDF is used to cre-
resolved with a more rigorous program. Since ate a network of units and quantities that
this program is a proof-of-concept script, it can be effortlessly extended with new units
will not be developed further to ensure ab- and conversions without requiring any rewrit-
solute reliability. At present it is capable of ten software. It provides several advantages
performing conversions involving simple tem- over existing XML methods by controlling the
peratures on outmoded scales, but may fail ways in which units relate to each other, and
with particular combinations of units. by clearly addressing issues of dimensional-
It should be noted that it was necessary to ity, convenience and functionality. A design
use grams as the base SI unit instead of kilo- decision has been made to keep the central
grams. Although incorrect as far as SI is con- dictionary and relationships as small and ele-
cerned, it allows complete divorcing of prefix gant as possible while retaining scope for even
and unit. The outputs of this software can be the most exotic of conversions between sys-
refined to present data according to SI recom- tems with the same number of dimensions.
mendations and is not a significant problem. The specialised systems used for electromag-
Some additional care is needed in encoding of netism remain a problem to be addressed in
conversions that normally rely in some way on future work. Although not addressed here, It
the kilogram, such as measures of energy. is entirely reasonable to have a parallel on-
An RDF or XML request for conversion tology for the esu and emu systems with ap-
takes the following form: propriate conversions. Such a process involves
<ch:Quantity>
many intricacies and only applies to a rela-
<ch:has-value>10</ch:has-value> tively specific area of science hence we have
<has-unit> not yet attempted to resolve it. The defini-
722
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
tions of quantities and units in our system will Units and Symbols in Physical Chemistry,
almost certainly make such a transformation Blackwell Science, 2 ed., 1993.
manageable. [2] World Wide Web Consortium,
Another key factor that is not provided Extensible markup language
is any intelligence. The system is not rich http://www.w3.org/XML/, viewed
enough to identify misuses of units, and can- 2005.
not hope to address some of the finer points
[3] D. De Roure, N. Jennings, and N. Shad-
such as restrictions on when Hz may be used
bolt Research agenda for the seman-
instead of s−1 and when a second relates to an
tic grid: A future e-science infrastruc-
angle. Attempting to address these more sub-
ture Technical Report UKeS-2002-02,
tle distinctions could easily lead to an ontology
National e-Science Centre, December ,
far too complicated for useful implementation.
(2001).
It is perhaps more suitable for this issue to be
addressed at the interface level rather than in [4] D. De Roure, N. Jennings, and N. Shad-
the underlying data. The key to successful bolt In Proceedings of the IEEE, pages
deployment is to design a system to be as uni- 669–681, 2005.
versally understandable as possible. Once a [5] World Wide Web Consortium,
proven and complete system is agreed upon, Resource description framework
a more expressive ontology language such as http://www.w3.org/rdf/, viewed 2005.
OWL can be used to restrict and validate units
[6] NASA, Mars climate or-
and conversions more comprehensively.
biter believed to be lost
An area neglected by this paper is that of http://mars.jpl.nasa.gov/msp98/orbiter/,
uncertainty. Strictly speaking, no measure-
1999.
ment is complete without a declaration of pre-
cision. This conspicuous absence is for a vari- [7] M. Williams, Flight Safety Australia,
ety of reasons. Firstly, expressions of error July-August (2003).
and precision come in many forms, relative [8] E. E. Allen, D. Chase, V. Luchangco, J.-
and absolute, all of which must be accounted W. Maessen, and G. L. S. Jr. In J. M.
for. Secondly there are many ways to mark up Vlissides and D. C. Schmidt, Eds., OOP-
precision, and we do not presume to force an SLA, pages 384–403. ACM, 2004.
approach on the reader. The units system pre-
[9] National Institute of Standards and
sented here is intended to demonstrate what Technology, Units markup language
can be done and to raise awareness of the re- http://unitsml.nist.gov/, 2003.
quirements for machine-readable units. The
issue of measurement mark up including pre- [10] Open Geospatial Consortium, GML
cision, units and domain relevance (to prevent - the geography markup language
spurious conversions) is deserving of a much http://www.opengis.net/gml/, viewed
lengthier discussion. There is no reason why 2005.
these features cannot be added to our schema [11] R. Raskin, M. J. Pan, I. Tkatcheva,
or software. and C. Mattmann, Semantic web for
We have demonstrated that the seman- earth and environmental terminology
tic technology RDF can provide a practical http://sweet.jpl.nasa.gov/index.html,
method to describe and communicate scien- 2004.
tific units necessary for complete description [12] T. R. Gruber and G. R. Olsen In J. Doyle,
of quantities, together with methods for com- P. Torasso, and E. Sandewall, Eds.,
parison and conversion of those units. The Fourth International Conference on Prin-
system provides a basis for an ontology that ciples of Knowledge Representation and
will enable the automated validation of the na- Reasoning, 1994.
ture of a quantity, its compatibility with the
units and its comparability with other quanti- [13] J. C. A. Vega, A. Gomez-Perez, A. L.
ties to provide the necessary and appropriate Tello, and H. S. A. N. P. Pinto, Lecture
conversions between unit systems. Notes in Computer Science, 1607, 725
(1999).
[14] W3C Math working group, Mathml 2.0
http://www.w3.org/Math/, 2001.
References
[15] OpenMath Society, Openmath
[1] I. Mills, T. Cvitas, K. Homann, N. Kallay, http://www.openmath.org/, 2001.
and K. Kuchitsu, IUPAC Quantities,
723
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
This paper presents an ISIS facilities ontology based on keywords in the ISIS Metadata catalog
(ICAT) and OntoMaintainer, a web application to facilitate the collaborative development of
ontologies. The ISIS facilities ontology aims to organize and make keywords in the ISIS Metadata
Catalogue more explicit which will improve the search and navigation of data by category. The
Ontology Maintainer allows users to view current versions of the ontology, and send feedback on
concepts modeled back to the maintainers of the ontology. To ensure ease of use, this service has
been made available to the user community via the World Wide Web.
724
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
725
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
are normally organised in taxonomies with a needing a classification hierarchy and simple
subsumption (“subclass”) hierarchy. OWL constraints. OWL DL supports those users who
classes can be specified as logical combinations want the maximum expressiveness while
(intersections, unions, or complements) of other retaining computational completeness for
classes, or as enumerations of specified objects. reasoning systems. OWL Full is meant for users
OWL can also declare properties, organize these who want maximum expressiveness and the
properties into a “sub-property” hierarchy, and syntactic freedom of RDF with no
provide domains and ranges for these properties computational guarantees [10]. OWL’s logical
[10]. Properties allow general facts about the model allows the use of a reasoner which can
members of classes and specific facts about check whether or not all of the statements and
individuals to be asserted and also represent definitions in the ontology are mutually
relationships between two individuals. For consistent, and can also recognise which
example the property hasSibling might link the concepts fit under which definitions. The
individual Matthew to the individual Gemma. reasoner can therefore help to maintain the
There are two main types of properties, object hierarchy correctly. This is particularly useful
properties and data type properties. Data type when dealing with cases where classes can have
properties are relations between instances of more than one parent [11].
classes and RDF literals and XML Schema data
types. Object properties are relations between 2.4 Ontology Editing Environments
instances of two classes. Properties may have a Ontology development or engineering tools
domain and a range specified. The domains of include suites and environments that can be
OWL properties are OWL classes, and ranges used to build a new ontology from scratch or by
can be either classes or externally-defined reusing existing ontologies [12]. Since the mid-
datatypes such as string or integer. Properties nineties there has been an exponential increase
link individuals from the domain to individuals in the development of technological platforms
from the range. Individuals, represent objects in related with ontologies [7]. Protégé is one of the
our domain of interest [10]. For example Italy, most widely used ontology building tools and
England China and France are individuals in the has been developed by the Stanford Medical
class Country. Informatics (SMI) Group at Stanford University
[13]. Protégé instances, slots and classes
OWL provides three sublanguages, namely roughly corresponds to OWL Individuals,
OWL Lite, OWL DL and OWL Full [10]. Properties and Classes (concepts).
OWL Lite supports those users primarily
726
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
727
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Subclasses representing experiments carried out ISISExperiment and individuals in the other five
by different scientific groups (e.g. main classes. The properties hasDataFileName,
CrystallographyGroupExperiment), at ISIS were hasInvestigationTitle, hasInvestigator, hasUsed
added to the class ISISExperiment (Figure 4). and wasConductedIn were created and domain
This is to facilitate common instrument and range constraints set in Protégé. All five
requirements related to different research properties have ISISExperiment as their domain
groups and to provide a focus for user and DataFileName, InvestigationTitle,
interaction. Investigator, Instrument, and Year as their range
respectively (Figure 6).
4. Ontology Maintainer
728
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
major advantage of this tool is that it provides Ontologies are constantly being changed due to
an easily accessible, user friendly method of technological advances, errors in earlier
displaying the hierarchy of the ontology being versions of the ontology, or the release of a
built. In addition to visualisation, it enables novel method of modelling a domain. The
users to submit feedback through a simple web process of updating an ontology is not an easy
form to maintainers of the ontology. task, as there needs to be a general consensus
amongst members of the community. The
The web-page is divided into two panels one Ontology Maintainer will provide a huge impact
displaying the tree hierarchy and the other on this process, as it will allow individuals to
showing a web form. The first panel shows a easily access current versions of the ontology
tree hierarchy with class names displayed through the World Wide Web.
alongside each of the tree nodes, users can
expand the tree fully by clicking on nodes or 5. Future work
classes in the tree. The web application is
modular and the look and feel of the tree can be The current ISIS facilities ontology serves as a
changed easily. To send comments users can base for future ontologies, and in the near future
click on the class they wish to comment on in it will be expanded with more concepts,
the first panel and this will generate a path of properties and restrictions. An ISIS Online
that particular class through the ontology, which Proposal System has been developed to
is added to the text area of the comment form. automate the capture of metadata [1]. This
Next users can enter their full name, email system enables users of the ISIS facility to
address, subject and comments in the second electronically create, submit and collaborate on
panel, as shown in Figure 7. Parameters entered applications for beam-time [1]. This new
in the form will be sent directly to the electronic proposal system provides access to a
maintainer of the ontology. rich source of metadata which can be fed
directly into the experimental set. Creation of
729
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
730
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
731
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Grid computing has become an important new research field. The goal of the Grid
computing infrastructure is to pervasively access the computational resources available on
the Internet. The web service technology has the potential to achieve the goal of grid
computing because of its self-contained, self-describing, and loosely coupled features. The
resources can be discovered and shared through the web service’s process interface. In this
paper we discuss how the web services can be composed from a conceptual perspective
and propos a context based semantic model for better describe web service in order to
realize automatic service discovery, composition, and invocation.
732
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
733
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The difficult steps to be realized in the Once a composite service has been built, the
process of service composition are locating service provider needs to publish its inputs and
suitable sub-services and formatting sub-services outputs as public interfaces for the service
into a service flow. Locating suitable services requester to invoke. The inputs and outputs of a
will be discussed in next section to see how the composite service are generated from some of its
semantic web technology can benefit service internal sub-services’ inputs and outputs. If we
selection. This section will focus on what aspects use a directed graph G(V, E) to represent all the
need to be considered when formatting the services in the grid environment and their
selected sub-services into a service flow by possible relationships1, then a composite service
linking relevant sub-service together. can be represented as a sub-graph of G.
When you link two services together, in fact Sub _ G1 (V1 , E1 ) ⊆ G
you are linking the former service’s outputs with
later service’s inputs, i.e. the later service’s
The Sub_G1 together with its context can be
inputs can be inputted by the value from the
represented as another sub-graph of G.
former service’s outputs. This linking is based on
the compatibility of the input and output data Sub _ G2 (V2 , E2 ) ⊆ G
type, satisfiability of per and post conditions, and Sub _ G1 ⊆ Sub _ G2
the similarity of the inputs and outputs’
semantics. These are the basic criteria that need Then we can represent the context of the
to be considered before joining two services composite service in the grid environment,
together. However, for sufficiently achieving a which is difference between Sub_G2 and Sub_G1
real task, just consider the basic criteria are not
C (Vc , Ec ) = Sub _ G2 − Sub _ G1
enough, some other contextual information also
need to be considered [7], such as:
The set of arcs Ec in C represents the inputs
¾ Time: When two services are running and outputs of the composite service.
in parallel, they have to make sure they Ec = I ∪ O
return the outputs at the same time
Where
before their outputs go into next
service’s inputs.
I: A set of inputs of the composite
service.
¾ Data Consistency: One service’s O: A set of outputs of the composite
output and another service’s input have service.
the same data type and semantic
meaning does not mean the data v
v v
consistency is satisfied. For example, vc
one service’s output is a weight in gram, v
Sub_G1
but another one’s input needs a weight C
v1
in kilogram. In this example both v
service’s input and output data type are vc v1 v1 vc
‘double’ and both of their semantic v1
meaning are ‘weight’, but the input and v vc Sub_G2
output of these two services cannot be … v
… v
directly linked together. The Grid
¾ Location: Sometimes the location of a Figure 3: Composite Service, Context and Grid
service also can affect the service
composition. For example, one service Figure 3 shows a graphical illustration of the
requires an address as an input and conceptual representation of the relationship
another service can return an address, among composite service, its context, and the
but there is a case that these two grid environment. In the diagram, the red arrows
services cannot be linked together. That represent the internal relationships between sub-
is when one service is in UK and
another is in US. 1
Note: the arcs in the graph G are uncertain due
to different composition requirements. Here just
captures a possible situation.
734
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
services inside the composite service, the black identify the meaning of a given object and
arrows represent the inputs and outputs of the therefore select a most suitable service or
composite service, and the blue dashed arrows information object.
represent the relationships between the services Therefore, we have developed our initial
in the grid environment. contextual based semantic service description
model which integrates the context together with
4. Semantic Service Description semantics to better describe a service [7]. This
Aspects and Matchmaking model addresses a service from six aspects:
735
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
736
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
The values in the table indicate the Medjahed et al. [14] provided a whole solution
candidate services satisfaction rate for each from semantic service description, service
aspect in the service requirement specification. composability model, to automatic composition
From the above result, it is easy to see that the of web services. However, as we discussed
Service1 has the highest overall satisfaction previously, these research efforts did not
score which represents the shortest semantic consider enough contextual aspects for
distance to the requirement, so the Service1 will describing a service which is important in the
be the chosen service. process of service discovery and composition.
Fensel et al. [9] proposed a web service Defining a contextual based semantic model to
modelling work (WSMF) to address how a analyze and describe the structure and
service should be described. Roman et al. [12] characteristics of services is very important in
proposed a web service modelling ontology the research areas of semantic web service, grid
(WSMO) based on the WSMF to semantically computing, and web services. In this paper, we
describe the aspects proposed in WSMF for discussed how the web services can be composed
describing services. Martin et al. [13] proposed a from a conceptual perspective, proposed a
semantic mark-up language OWL-S to describe contextual based semantic description model for
web services. This language describes a web automatic web service discovery and
service through three components: service profile, composition, and introduced a semantic
service process model, and service grounding. matchmaking algorithm based the model to
737
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
accurately identify a service or a sequence of 8. Martin, D., Paolucci, M., Mcllraith, S.,
services. This is particularly significant in Burstein, M., McDermott, D., McGuinness,
semantic based automatic service searches and D., Parsia, B., Payne, T., Sabou, M., Solanki,
matches as the number of services on the web is M., Srinivasan, N., and Sycara, K., 2004.
growing exponentially. Our next step is to Bringing Semantics to Web Services: The
develop an advanced quantitative analysis OWL-S Approach, presented at First
algorithm for semantic matching and to evaluate International Workshop on Semantic Web
its performance. In order to achieve automatic Services and Web Process Composition
service discover and composition, we also need (SWSWPC), San Diego, California, USA
to consummate the contextual based semantic 9. Fensel, D. and Bussler, C., The Web Service
model to more sufficiently describe services. Modeling Framework WSMF. Electronic
Commerce Research and Applications, Vol.
References 1, Issue 2, Elsevier Science B.V.
10. Lara, R., Lausen, H., Arroyo, S., Bruijn, J.,
and Fensel, D., Semantic Web Services:
1. Foster, I. and Kesselman, C., The Grid: Description Requirements and Current
Blueprint for a New Computing Technologies, In International Workshop on
Infrastructure, Publisher: Morgan Kaufmann Electronic Commerce, Agents, and Semantic
Inc, 1999. ISBN: 1-555860-475-8 Web Services, September 2003.
2. Foster, I., Kesselman, C., and Tuecke, S.,
The Anatomy of the Grid: Enabling Scalable 11. DAML-S: Semantic Markup for Web
Virtual Organizations. International Journal Services (The DAML Services Coalition),
of High Performance Computing 2001
Applications, 15 (3). 200-222. 2001 http://www.daml.org/services/daml-
s/2001/10/daml-s.html
3. Foster, I., Kishimoto H., Savva, A., Berry,
D., Djaoui, A., Grimshaw, A., Horn, B., 12. Roman, D., Keller, U., Lausen, H., Bruijn, J.,
Maciel, F., Siebenlist, F., Subramaniam, R., Lara, R., Stollberg, M., Polleres, A., Feier,
Treadwell, J., and Von Reich, J., The Open C., Bussler, C., and Fensel, D., Web Service
Grid Services Architecture, Version 1.0, Modelling Ontology (WSMO), WSMO
Global Grid Forum, 2005 Final Draft April 13, 2005.
http://www.gridforum.org/documents/GFD. http://www.w3.org/Submission/WSMO/
30.pdf 13. Martin, D., Burstein, M., Hobbs, J., Lassila,
4. Christensen, E., Curbera, F., Meredith, G., O., McDermott, D., Mcllraith, S., Narayanan,
and Weerawarana, S., 2001. Web Services S., Paolucci, M., Parsia, B., Payne, T., Sirin,
Description Language (WSDL) 1.1 E., Srinivasan, N., and Sycara, K., 2004.
http://www.w3.org/TR/2001/NOTE-wsdl- OWL-S: Semantic Mark-up for Web
20010315 Services.
5. W. Song, (2003) Semantic Issues in the Grid http://www.daml.org/services/owl-
computing and Web Services (invited s/1.1/overview/
speech), in the Proceedings of International 14. Medjahed, B., Bouguettaya, A., and
Conference on Management of e-Commerce Elmagarmid, A. K., Composing Web
and e-Government, Nanchang, China Services on the Semantic Web, the VLDB
6. Song, W., Du, X., and Munro, M., A Journal 2003. Volume 12, pp. 333-351
Contextual based Conceptual Model and 15. Berners-Lee, T., Hendler, J., and Lassila, O.,
Similarity Computation Method for The Semantic Web, Scientific American,
Automation of Web Service Discovery and May 17, 2001, pp. 35-43
Composition, Department report, University 16. De Roure, D., Jennings, N. R., and Shadbolt,
of Durham, UK, 2006 N. R. The Semantic Grid: Past, Present and
7. Dey, A. K. and Abowd, G. D., Towards a Future. Procedings of the IEEE, 2005,
Better Understanding of Context and Volume 93, Issue 3, pp. 669-681
Context Awareness. Presented at the CHI 17. W3C Web Service Definition
2000 Workshop on the What, Who, Where, http://www.w3.org/TR/ws-arch/
When, Why and How of Context-Awareness,
April 1-6, 2000
738
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
digital publishers obtain some understanding two of the co-ordinate systems in use in that
of this process and the associated pitfalls and domain. The HBP column shows that while
technological requirements. the co-ordinate system may be simple, the at-
tachment process is not. AstroDAS is inter-
esting in this context because its purpose is
1.1 A framework for annotation precisely to reconcile the co-ordinate systems
in a variety of database. It is a database in
In order to compare various types of anno- which the annotations of the objects are the co-
tation systems we suggest an informal frame- ordinates (typically relational keys) of objects
work that consists of the following basic com- in other catalogues. The intention of this anno-
ponents: tation is to support the more general forms of
cross-database annotation, which we describe
An annotation is some set of data elements below.
that is added to an existing base or target that
possesses structure. In order to create an anno- In the remainder of the report, we present a
tation, some form of attachment point is used series of examples of scientific annotation in
implicitly or explicitly. We shall use the term Section 2, and refer to our basic framework to
co-ordinate system for the mechanism for de- help compare them. In Section 3 we discuss
scribing the attachment point. Some care is some key concepts of database annotation and
needed, during database design, in making sure suggest them as topics for further research.
that the co-ordinate system is durable. More-
over one frequently finds several co-ordinate
systems in simultaneous use. Understanding 2 Examples of scientific annota-
the mappings between the co-ordinate systems tion systems
is seldom straightforward. Let us consider some
examples (which are discussed in more detail
2.1 UniProt database annotation
later in this paper):
• Cartographic data. The use of multi- Perhaps the most well-known examples of data-
ple co-ordinate systems (such as longi- base annotation are to be found in bioinformat-
tude and latitude vs. a local grid) is ics and in the design of information systems
commonplace in cartography and map- like UniProt, which consists of an assemblage
pings between such systems are well un- of databases including Swiss-Prot, an annota-
derstood. The point of attachment to a tion database produced by specialist curators,
image representation of a map is speci- and TrEMBL, which provides automated an-
fied by such a co-ordinate system. How- notations for proteins (http://www.ebi.uniprot.
ever, recent map data now relies on some org/about/background.shtml). These data-
form of object-oriented representation of bases were created to to disseminate protein
cartographic features, and attachment is, sequence data and associated analyses. They
presumably, to some object identifier. Note were designed specifically to receive annota-
that there is a subtlety about what is be- tion.
ing annotated; moreover the correspon- The curators of Swiss-Prot [BA00] are quite
dence between the two co-ordinate sys- specific about which fields in the database they
tems is no longer a simple 1-1 mapping. regard as annotation and which are “core data”.
• Molecular biology. This has moved in In figure 2, the boxed areas are those classified
the reverse direction. The original co- as annotation, the publication and taxonomic
ordinate systems were the gene identifiers entries, perhaps because the Swiss-Prot organ-
used in the various databases. Only re- isation was not responsible for its creation, are
cently have the linear (chromosome, off- regarded as core data. This database illus-
set) co-ordinates determined by genetic trates a number of interesting aspects of anno-
sequencing been discovered and, once again, tation which we discuss further in Section 3.1.
the mapping is not 1-1. Note that several of the fields have “pointers”
to entries in other databases, which provide
mappings between co-ordinate systems.
The various aspects of annotation are illustrated
in Figure 1. The genome column summarises
740
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
EMAP provides annotations that make con-
"!#
!$%%'&)(*+,-./+,01(*+
23*4(5.
2
+6
7
84(
99!
.
2
+
"!#
!$%%'&:9
,;<
,
* =
,?>,.$9
("2"*@A
nections between both a standard anatomical
I
B "!"$!BC'&D,
*+.EF>,.$9
("2"*HG A
F-
J2"K<$J$(*FK
,+/L9<K<*4(5+FM.,
=<.$92
.ONP2"*+/ (* 9RQ:S-
J2"K<$J$(*H-/5TT/
nomenclature and the results of tissue-level gene
I
I
=U
/ (*WV-
J2"K<$J$(*L/
=
(50(
=?=U
/ (*4XYZF-
J2"K<$J$(*?0,J
+/L=U
/ (*WV
-
J2"K<$J$(*LK
/
9
(
=F=U
/ (*4X5[
expression experiments with regions of 3D mouse
\
< =<.K4(5+/3T/](^T/_V^<"T$M`4(*4X@Vba4(*+,.H9";<
/
9U4XA
\ I<`
/.E
2
+/'dc4(5.(50(M$J/
*+/,'D+.,
M+
2"MUE+/'eIT$K.E
2"MUE+/'d./
=U
,2"MUE+/'
embryo tissue images and 2D tissue slices. This
\
M
,.T/+
2"MUE+/'e
/-*$2J$("2"MUE+/'D,
<0(
=2
+E
J,0
2"* 9fg=2
.,,
<0(
=2
+E
J,0
2"* 9f
\ .
29
(50$9fe,
<.
29
(50$9LYh
< =<.K4(5+/J,
9fD
< =<.K4(5+/
=
,/,'D
< =<.K4(5+/ project provides a suite of tools, including an
\
i "
/]jCC Y
Nk5[ interactive website (http://genex.hgu.mrc.
lI \ ImI
n"IIoNp[
5j$=">d6<.
2"`
/
8
/T/
`<.()/
*`4(* ac.uk/intro.html). The tools allow one to
i Il5Ij CCZ<K
,0jBGC'
q
/E/
9U4()r&s$2
.(Zqr&t(
9U4(^T$<./Fr&t`
/u/
8
/?&sq
/./!
(
9U4(^T$<./1A browse text nomenclature and make queries
v< =J,2
+(50,w9
,;<
,
* =
,?2
xH=J2"*
,0H=H=2
0(*-?x
2
.SM<"T$M`4(*@"!F-
J2"K<$J$(*
K
,+/H9<K<*4(5+yv4 about gene expressions that return sets of im-
l I<.z#s("2=U
,5T{DBQPCB!CBAV% XA
NPB[
\ I5HI
n"II \"| BB!F?B%!B
ages or a list of genes expressed for a given
\ U"Tf(5E/r&tq
/./HA&s
/
9"+<K
/./Fqr
v<"T$M`4(*}Vk
< =<.K4(5+/H9MrkX9
,,0-
J2"K<$J$(*_crs,.Tf(*
/J?9
,;<
,
* =
,
9?2
x+U
,
embryo image. Another way to query for gene
M
,
M+(50,m=U
/ (*ryv4
/
=
(50(
=/
*0K
/
9
(
=gM
,
M+(50,m=U
/ (* 9S/
*0m(50,
*+(5x(
=
/+("2"*@2
xF/)ME.
2
-
J"<+/5T E
J
expressions is to interactively select an area of
l
J/
*+L,JJSUE$9
("2J'hB AQkG!$CAV% XA
!~P! | \ QtU4(
9L(
9S/F9
,,0H9"+
2
./-,SM.
2
+, (*r
a 2D image. EMAP involves centralised edi-
!~P!F
5Qsq
,]/5T,.e,/
=UH9<K<*4(5+w(
9=2T$M$29
,0m2
xF/
*?/
=
(50(
=/
*0F/
K
/
9
(
==U
/ (*?0,.(5>,0?x.
2Tw/F9
(*-
J,SM.,
=<.$92
.H/
*0?J$(*`
,0FKEF/ torial control and curation; an editorial review
0(
9<$J
x(50,SK$2"*0
!~P!F 45l45rQh,J2"*-$9S+
23+U
,H?9
,,0m9"+
2
./-,SM.
2
+, (*V^-
J2"K<$J$(* 9
X board decides whether to accept gene expres-
x/5Tf("J
E
!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!
!!!!
!!!!!!! sion experiment results, and regions of images
2"ME.(5-U+,0LKE+U
,3*4(5.
2
+L2"* 92
.+(<"T&:9
,,3U++MrQ77"888rp<*4(M.
2
+2
.-
7
+
,.TY9
4(
9"+.(K<+,0L<*0,.F+U
,F.,/+(5>,F2TT42"* 9++.(K<+("2"*$!
2"
,.(5>$9Ll(
=
,
* 9
, are manually coloured by an expert.
!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!
!!!!
!!!!!!!
Ilh
C'h kYZ!RT$
q
zC'Z |i
*+,..
2Rg5
CG'd
<M4(*$ $A
*+,..
2Rg5
'd
<M4(*$
.,-("2"*r
*+,..
2Rg5
G Yd
<M4(*$ ""T4J
+EM
,
2.4 AstroDAS: Annotating astron-
*+,..
2Rg5
Cd,,0$9"+
2
.,
9 "M$J"*r
x/5Tz | %'d
<M4(*$ $YdB
omy catalogues
45'd
%'Z
l \ l5
\ 5IdG': II$ \
I3A
6a 4(5.,
="+FM.
2
+, (*w9
,;<
,
* =
(*-dE..
2J$(50
2"*
,w=
/.K$2
]E
J$(
=/
=
(50
|
6a ,,0m9"+
2
./-,SM.
2
+, (*e (5-*
/JRz+
2
./-,SM.
2
+, (*r
5
l B
Over the past several decades, databases or
|
|
q5 BB B%C S-
J2"K<$J$(*L-/5TT/L=U
/ (*r
7 | 50j \ BB
catalogues of celestial object observations, recorded
|
q5 BB S-
J2"K<$J$(*?K
,+/L9<K<*4(5+
| 7 | 50j \ BB
by disparate telescopes and other instruments
|
q5 B% S-
J2"K<$J$(*L0,J
+/L=U
/ (*r
| 7 | 50j \ BB% over various time periods, have migrated on-
| \ $ "I BB BB E..
2J$(50
2"*
,m=
/.K$2
]E
J$(
=/
=
(50
| 4
l | B *+,.$=U
/ (*}VbK
,+8
,,
*?-/5TT/F/
*0F0,J
+/ line. Central to the astronomical community’s
| =U
/ (* 9
XmV^
2
+,
*+(/J$XA
| \ | l B B !3I_V(*F
,xtB XA concept of a global “Virtual Observatory” is
| \ | l IF!S1V(*F
,xtB XA
n I
n"II SGCBCa
45I
%m
C the ability to identify records in these differ-
l | | lmlc | 5
lHn$5I
nnaI | n
Ica$nn"qL$n
l?Il
n"c
II | ILca$n"I | n?
c4q@6
lll
| 6l | cL
n
|
5 ent catalogues as referring to the same celestial
5
I$n
Ll
n
? | 6$n"q$n
6L | I
llccH
c
qaF
n
lclc
| c
n@l6 | lL
I
n"cI
FcIIaI
6L
I6
|
| I | lI
I | n$
lcH6l6
IIL4c$n"cI | IcllI6IILI
45IFIII
lII
object. Because the recorded location of a ce-
ll6
n
_5
cc |
4?ql5l$nclI
clF
cq
c
q
cH
c$n"ccL |
n
c |
?IcI
n"cl43
n
| cc456L
| Ia45 |
lestial object may vary slightly from catalogue
77
65lHl
c
n"Hll
cl?4
II
n"lL6
nn
IclL
n
I
to catalogue due to unavoidable measurement
Figure 2. An entry from UniProt error at the instrument level, the general cat-
alogue matching problem cannot be solved by
spatial proximity alone, and some researchers
2.3 Annotating biomedical images develop their own complex algorithms for mat-
ching celestial objects across different catalogues.
Some systems are designed to create web-accessible To provide astronomers with the ability to share
collections of annotated biomedical images. Gertz their assertions about matching celestial ob-
et al. [GSG+ 02] develop a graph model of an- jects directly with their colleagues, we have
notations for use in the Human Brain Project created prototypes for AstroDAS, a distributed
(HBP): annotation nodes serve to connect spe- annotation system partly inspired by BioDAS
cific image region of interest nodes with con- [BMPR06]. AstroDAS features an annotation
cept nodes from a controlled vocabulary. Graph database with a web service interface to store
edges define the relationship between nodes; and query annotation, and resolves queries on
one such relationship is “annotation of”. They astronomy catalogues using mapping tables that
also develop a framework for querying anno- are dynamically constructed from annotations
tation graphs based on path expressions and of celestial object matches. The AstroDAS pro-
predicates, which they test in a prototype sys- totypes complement the existing OpenSkyQuery
tem. Column (1) of Figure 1 refers to HBP system for distributed catalogue queries.
image annotation.
The ultimate aim of AstroDAS is similar to
The Edinburgh Mouse Atlas Project (EMAP) the goal of the earlier BioDAS: to record and
involves two types of annotations for images. share scientific assertions with a wider com-
742
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
munity. Whereas biologists use annotation in ture. The important observation here is that
BioDAS to interpret the DNA sequences in a annotation plays an important part in the evo-
genome, however, astronomers seek to share lution of both the form and content of data.
the mapping of entities derived from their re- What was once unknown or regarded as ad hoc
search across established scientific databases. annotation has become part of the database
Specifically, astronomers want to be able to structure. It is almost certainly the case that
share their identification of matching celestial the curators of Swiss-Prot now make extensive
objects within the existing federation of dis- use of database technology and that what is
parate catalogues. exported in Figure 2 is a “rendering” or data-
base view of the internal data. While data-
base management systems provide some help
3 Concepts and research topics in with structural evolution, it is always prob-
database annotation lematic. In this respect, databases designed
with conventional (relational or object-based)
structuring tools offer better prospects for ex-
One of the most useful effects a report such as tensibility than XML structured with DTDs or
this could have would be to help the designers XML-Schema which are, at heart, designed to
of a new database, schema or data format to express the serialisation of data.
prepare their data for annotation. Of course,
some databases, especially those in bioinfor-
matics, are designed to receive annotation. But 3.2 Location and attachment of an-
we have seen many examples of the need to ac- notations
commodate ad hoc annotation and the need for
ad hoc annotation to migrate to a more system- The annotations in the CC fields in Figure 2
atic form of annotation, that is, to become part appear to refer to the entire Swiss-Prot en-
of the regular database structure, which we dis- try. Reading down, one finds feature table
cuss further in the following sections. We also (FT) lines that contain “fine-grain” annotation
discuss annotation queries, research topics in about different segments of the sequence data.
relational database annotation, and annotat- There is a subtle difference between the two
ing annotations. forms of annotation. The CC annotations are
understood to refer to the whole entry because
3.1 Annotation and the evolution of they occur inside that entry. The FT annota-
database structure tions are outside the structure being annotated
and therefore require extra information, in this
case a pair of numbers specifying a segment, to
We return to the Swiss-Prot example of anno- describe their attachment to the data. Notice
tation within databases: Figure 2 shows a sin- that this assumes a stable co-ordinate system.
gle entry in Swiss-Prot. It is debatable what If the sequence data were to be updated with
one should classify as data, metadata or anno- deletions or insertions, attachment of annota-
tation. However, from a database perspective, tions would be problematic.
the entry illustrates several interesting points,
Consider another, fanciful, example of a fine-
including the evolution of structure. The struc-
grain attachment in which one wants to say
ture of the entry is an old, purpose-built file
something like “The third author of the first
format with a two-letter code giving the mean-
citation also publishes under the alias John
ing of each line of text. Notice that the com-
Doe”. One could imagine inserting this text in
ment lines (CC) have become structured with
the text of the Reference Author (RA) line, but
entries of the form -!- FUNCTION: ... which
this is likely to interfere with any software that
provide a degree of machine-readability of the
parses this line. Alternatively one could place
comment text. These entries were presumably
it externally in some other field of the entry.
not anticipated by the designers of the original
Once again, this assumes that the co-ordinate
format, and the alternative of specifying some
system is stable. For example, it assumes that
further two-letter codes for these entries, was
the numbering of the citations does not change
presumably ruled out as it would confuse ex-
when the database is updated.
isting software designed to parse the format.
There are now 26 such subfields, one of which Another issue is the attachment of an anno-
has additional machine-readable internal struc- tation to several entries/objects in any of the
743
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Supporting annotation queries for case (1) is in [BKT02] which also shows the connection
more likely to be straightforward than case (2). with the view deletion problem.
For the second case one needs to know where
In [BCTV04] a practical approach is taken to
the annotation is attached to the base data
annotation in which an extension of SQL is de-
and perhaps why it is attached. How this is
veloped which allows for explicit control over
captured in the database and expressed in the
the propagation of annotations. Consider the
query is an open question.
following simple join query
SELECT R.A, R.B, S.C
3.4 Annotating relational databases: FROM R, S
recent work WHERE R.B = S.B
Suppose the source is annotated. Presumably
Relational databases have had an extraordi- an annotation on a B value of R should propa-
narily successful history of commercial success gate to the B field of the output, because R.B
and fertile research. It is not surprising, there- is given as the output. But should an anno-
fore, that database researchers would first at- tation on a B field of S also be propagated to
tempt to understand annotation in the context the B field of the output? The structure of the
of relational databases. One of the immediate SQL indicates that it should not, but the query
challenges here is to understand how annota- obtained by replacing the first line by
tions should propagate through queries. If one SELECT R.A, S.B, S.C
thinks of annotation as some form of secondary is equivalent, so maybe the answer should be
mark-up on a table, how is that mark-up trans- yes. The idea in [BCTV04] is to allow the user
ferred to the result of a query. If, for example, to control the flow of annotation by adding
an annotation calls into question the veracity some further propagation instructions to the
of some value stored in the database, one would SQL query. The paper shows how to compute
like this information to be available to anyone the transfer of annotations for the extended
who sees the database through a query of view. version of SQL and demonstrates that for a
range of realistic queries the computation can
Equally important is the issue of backwards
be carried out with reasonable overhead.
propagation of annotations. We consider, as
a loose analogy, the BioDAS system, based on The work we have described so far has been
the DAS system discussed in Section 2.2. The limited to annotating individual values in a
users see and annotate the data using some table. Recently Geerts et al. [GKM06] have
GUI, which we can loosely identify with a data- taken a more sophisticated approach to anno-
base view. The annotation is transferred back- tating relational data. What they point out is
wards from the GUI to an annotation on some is that it is common to want to annotate associ-
underlying data source and is then propagated ations between values in a tuple. For example,
forwards to other users of the same data. Fol- in the query above one might want to annotate
lowing the correspondence, the question is how the A and B fields in the output with informa-
does an annotation propagate through a query tion that they came from input table R and
both backwards and forwards? the B and C fields with information that they
came from table S. To this end the introduce
It is easy to write down the obvious set of rules
the concept of a block – a set of fields in a tu-
for the propagation of annotation through the
ple to which one attaches an annotation and a
operations of the relational algebra. However,
colour which is essentially the content or some
because of nature of relational algebra, invert-
property of the annotation. They also investi-
ing these rules is non-deterministic. An anno-
gate both the theoretical aspects and the over-
tation seen in the output could have come from
head needed to implement the system. How-
more than one place in the input. To take
ever, as we have indicated in Section 3.2 that
one example: suppose one places an annota-
attachment may be even more complex, requir-
tion on some value in the output of a query Q.
ing associations between data and schema, for
Of all the possible annotations on the source
example.
data (the tables on which Q operates) is there
one which causes the desired annotation – and
only that annotation – to appear in the output
of Q. The complexity of this and several re-
lated annotation problems have been studied
745
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
3.5 Provenance and Annotation In each of these areas, there is further research
needed. Moreover, annotations often express
complex relationships between schema and data.
The topic of data provenance is of growing in-
To bring this into a uniform framework is a
terest and deserves separate treatment. How-
challenge for both database and ontology re-
ever there are close connections with annota-
search.
tion. One view of the connection is that prove-
nance – information about the origins of a piece
of data – is simply another form of annotation 5 Acknowledgements
that should be placed on data. It is certainly
true that there are many cases where prove-
nance information is added after the creation We would like to thank Mark Steedman, Robert
of data. However, it would be much better Mann, Amos Storkey, Bonnie Webber, as well
if provenance were captured automatically, in as the members of the Database Group in the
such a way that it becomes an intrinsic part of School of Informatics. This survey has been
the data. supported in part by the Digital Curation Cen-
tre, which is funded by the EPSRC e-Science
A more interesting connection is to be found
Core Programme and by the JISC.
in [BCTV04] and related papers. Much data
in scientific databases (e.g. the “core data” in
Figure 2) has been extracted from other data- References
bases. If the data in the source database has
been annotated, surely the annotations should
[ABW+ 04] R. Apweiler, A. Bairoch, C.H. Wu,
be carried into the receiving database. If the
W.C. Barker, B. Boeckmann,
receiving database is a simple view of the source S. Ferro, E. Gasteiger, H. Huang,
data, then the mechanisms described in [BCTV04], R. Lopez, M. Magrane, M.J. Martin,
or some generalisation of them, should describe D.A. Natale, C. O’Donovan,
both provenance and how annotations are to be N. Redaschi, and L.S. Yeh. Uniprot:
copied. However, manually curated databases the universal protein knowledgebase.
are more complex than views, and in this case Nucleic Acids Res, 32:D115–D119,
understanding the movement of annotations is 2004.
still an open problem. [BA00] A. Bairoch and R. Apweiler. The
SWISS-PROT protein sequence
database and its supplement
4 Conclusions TrEMBL. Nucleic Acids Research,
28:45–48, 2000.
[BCTV04] Deepavali Bhagwat, Laura Chiticariu,
Although we have not done enough work to
Wang Chiew Tan, and Gaurav
substantiate this claim, we believe it likely that
Vijayvargiya. An annotation
most of the 858 molecular biology databases management system for relational
listed in [Gal06] involve some form of annota- databases. In Proceedings of the
tion. Moreover, as we have tried to indicated, Thirtieth International Conference on
annotation is of growing importance in other Very Large Data Bases, pages
areas of scientific research. The success of new 912–923, Toronto, Canada, 2004.
databases will depend greatly on the degree to Morgan Kaufmann.
which they will support annotation. In this re- [BDF+ 02] Peter Buneman, Susan Davidson,
spect, the following points are crucial both in Wenfei Fan, Carmem Hara, and
database design and in systems architecture: Wang-Chiew Tan. Keys for XML.
Computer Networks, 39(5):473–487,
• the provision of a co-ordinate system to August 2002.
support the attachment of annotations, [BKT02] Peter Buneman, Sanjeev Khanna,
and Wang-Chiew Tan. On the
• the linkage or mapping of that co-ordinate Propagation of Deletions and
system to other, existing, co-ordinate sys- Annotations through Views. In
tems, and Proceedings of 21st ACM Symposium
on Principles of Database Systems,
• the need for extensibility in databases that Madison, Wisconsin, 2002.
are designed to receive annotations.
746
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
747
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Justin Bradley, Christopher Brown, Bryan Carpenter, Victor Chang, Jodi Crisp, Stephen
Crouch, David de Roure, Steven Newhouse, Gary Li, Juri Papay, Claire Walker, Aaron
Wookey
Abstract
This paper describes the work carried out at the Open Middleware Infrastructure Institute (OMII) and the
key elements of the OMII software distribution that have been developed within the community through
our support of the open source development process by commissioning software. The main objective of
the OMII is to preserve and consolidate the achievements of the UK e-Science Programme by collecting,
maintaining and improving the software modules that form the key components of a generic Grid
middleware. Recently, the activity at Southampton has been extended beyond 2009 through a new
project, OMII-UK, which forms a partnership that now includes the OGSA-DAI activities at Edinburgh
and the myGrid project at Manchester.
748
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Repository
1 Use
Cases
Quality Testing
Sources Bugs Requests
Review for
Evaluation
Enhancements
Risk and
2 Prioritisation
Helpdesk
Fix Bugs Priority
List
Development Teams
3
Open
User
Source Issues
Community
Distribution
The software engineering process comprises for code review and the generation of test cases
three interlinked loops: (see Figure 2).
Loop 1 is represented by the Testing loop. The As the code and functional specification (and
Testing loop evaluates the incoming software user documentation) are verified in isolation,
against test cases generated from their the OMII gains twice the benefits. The code is
functional specifications and discovers bugs, independently checked to prove it does what it
defined as deviations from the functional is supposed to; likewise test cases derived from
specification, in the implementation or the test the functional specification independently check
suite. This is the core ‘quality improvement’ that same functionality. These two tasks are
cycle of the software engineering process carried out by separate teams who deliberately
involving detailed checking of the source code, do not collaborate until later on in the process.
with a daily-build regression testing process that
uses a constantly evolving set of regression
Standards Code
tests. These tests are composed of existing unit
tests and integration tests of the deployed Environment Docs
system. This activity generates new bugs, whose Functional
Specification
existence is recorded in the bug tracking system
for later review.
Code – Internally Test Cases/
Cleaned Code
A key pre-requisite to building regression tests
is to have a clear functional specification
Testing
describing the code, the standards to be FAIL
PASS
complied with and the environment that the
Functional
software has to function in (which changes Specification
constantly), and associated documentation Release
describing the operation of the software. The
functional specification provides a starting point
Figure 2 – The software enhancement process
749
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
For sources submitted to the OMII, some test modules, and is another opportunity to check
cases may already exist. If not, or if test the source directly rather than just by testing.
coverage is deemed inadequate, new test cases
will be required from the external development Loop 3 describes the public use of the software.
team or developed internally. The types and The contributed sources, having reached a
quality of test cases are also reviewed and new satisfactory quality, are then integrated into the
test cases developed as a result of bug public distribution and released to the
discovery. The high-level test cases and testing community. Issues are identified as bugs that
tasks are added at an early stage to the ‘High- require fixing, or feature requests requiring
Level Test Plan’ to provide feedback into the enhancements to the functional specification.
design process. Problems raised by external users initially get
entered in the OMII Helpdesk where, depending
Testing forms a very large part of the OMII on their nature, they may go on to become a bug
quality function and is carried out on all code in the OMII bug-tracking system or considered
with the aim of continuously improving the as proposals for enhancements for the next
quality of the product. Several types of testing release cycle.
already happen at the OMII and the range and
scope of tests (e.g. usability, standards 3 OMII Software Architecture
compliance, etc.) is always being expanded.
The key components of the OMII software
Loop 2 characterises the development process. architecture are presented in Figure 3. Some of
Within this process the identified issues are these components are already part of the OMII
assessed for risk and prioritised for fixing in the distribution, other we expect to integrate over
next release cycle by the many development the next 6-12 months.
teams working within the OMII organisation.
Bugs are discovered as a result of both internal The OMII software distribution is based around
testing, internal evaluation, from the a lightweight client and server distribution. Both
development teams and as a result of the contain standalone tools and a web service
external use of the OMII distribution. In based infrastructure currently based upon the
collaboration with the appropriate development Apache Axis toolkit. We expect to provide
team, OMII will assess the risk and impact of support for the deployment of web portals,
fixing a bug. Prioritisation takes place at the through the provision of a JSR168 portlet
start of the development cycle and on a compliant hosting environment. We will
continuous basis within the OMII’s bug- continue to source application and infrastructure
tracking system. As the resources required to fix services from the community by commissioning
all bugs, will inevitably exceed those available, open-source software development.
maximum benefit must be obtained from the
available resources in order to ship code of the Much of the commissioned open-source
very highest quality within any given software development activity supports on-
timeframe. When the fix has been made, it will going or emerging standards activity – either in
be reviewed. Normally this is done using a peer- W3C (World Wide Web Consortium), OASIS
review approach. Once the fix is agreed, the (Organisation for the Advancement of
code will be checked into the code repository Structured Information) or the Open Grid
under the specified bug id prior to re-testing. Forum (previously the Global Grid Forum). We
Cross-fertilisation of developer skills is ensured see this as a vital part of our community
by encouraging developers to fix code in areas support. We expect to source and integrate
of the code base that they are less familiar with. additional software components through
This prevents a single developer being the only partnerships with other Grid middleware
expert in a particular section of code. The OMII providers, such as gLite and GT4.
development team at Southampton performs
weekly ‘bug-scrubs’ which prioritise the bugs
for fixing, taking into account the risks
associated with the fix. The activity of fixing
bugs generates new versions of the source
750
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Management Domain
Portlets Specific Application
Portlets Services
Account Workflow
Authorisation Jobs Infrastructure
… … Services
JSR-168 Portlet WS-Security Static
Environment Webpages
AXIS
TOMCAT
Figure 3 – The OMII Software Architecture
751
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
752
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
753
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
The problem of enabling scientist users to submit jobs to grid computing environments
will eventually limit the usability of grids. The eMinerals project has tackled this
problem by developing the “my_condor_submit” (MCS) tool, which provides a simple
scriptable interface to Globus, a flexible interaction with the Storage Resource Broker,
metascheduling with load balancing within a grid environment, and automatic metadata
harvesting. This paper provides an overview of MCS together with some use cases. We
also describe the use of MCS within parameter-sweep studies.
754
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Condor pool
Globus GT 2.4.3
Globus GT 2.4.3
PBS Command line interfaces to PBS UCL
Lake@UCL:
930 node
jmgr Compute resources via my_condor_submit job submission tool jmgr
Compute resources via gridsam job submission tool
SRB via Scommands
Condor Condor
Condor pool
Web interfaces to
Compute resources via eMinerals compute portal
SRB vault SRB via MySRB and TobysSRB web tools SRB vault
Metadata via eMinerals Metadata Manager
Simulation scientist
Globus GT 2.2
IBM
PBS Daresbury Load Reading
jmgr Leveller
SRB MCAT and Browser jmgr
Pond:
application
Data Metadata manipulation/
servers transfer searching, and data discovery
SRB vault SRB vault
MySRB
Globus GT 3.2.1
Globus GT 3.2.1
Lagoon: 4 node
Xserve cluster
RCommand server
Lake@Bath:
Condor Oracle Client
PBS Bath
jmgr jmgr
MCAT metadata
External grid resources, e.g NGS, campus grids and national HPC facilities
Figure 1. The eMinerals minigrid, showing the range of computational resources and the integrated data
management infrastructure based on the use of the Storage Resource Broker.
sweeps which use MCS for their submission to Globus gatekeepers and the possibility to
address this need. install SRB and metadata tools;
This paper describes how the eMinerals ‣ Creation of a metadata database and tools
project handles grid computing job submission [3]. These enable automated metadata
in a way that hides the underlying middleware capture and storage so that simulation data
from the users, and makes grid computing can be archived in a manner that allows for
genuinely usable for scientists. much simpler retrieval at a later date. It also
greatly facilitates collaborative working
The eMinerals minigrid 2006 within a distributed virtual organisation such
as eMinerals [8].
The current status of the eMinerals minigrid is
illustrated in Figure 1. At the heart are two sets my_condor_submit 2006
of components: a heterogeneous collection of
compute resources, including clusters and MCS consists of a fairly complex Perl program
Condor pools, and the data architecture based and a central database, and is used with a simple
on the SRB with a central metadata catalogue input file with slight extensions over a standard
installed onto a central application server, with Condor submission script. The primary purpose
distributed data vaults. Subsequent to our of MCS is to submit jobs to a grid infrastructure
previous descriptions of the minigrid [4,5] are with data archiving and management handled by
the following developments: the SRB. The use of the SRB in our context
‣ Links to campus grids, e.g. CamGrid [6]. serves two purposes. First, it is a convenient
CamGrid is a heterogeneous Condor pool way to enable transfer of data between the user
made up of several distributed and and a grid infrastructure, bypassing some of the
individually-maintained Condor pools which problems associated with retrieval of multiple
are integrated into a single grid through the files whose names may not be known
use of Condor flocking. These resources are beforehand using Condor and gridftp methods
linked to the eMinerals minigrid by for example. Second, the use of the SRB
installing a Globus gatekeeper with a enables users to archive all files associated with
Condor jobmanager to allow job submission a study in a way that facilitates good data
in the same manner as for other resources; management and enables collaborative access.
‣ Links to other grid resources such as the We have recently completely rewritten
National Grid Service (NGS) and NW-grid MCS. The original version [4,5] was written as
[7] with seamless access, provided they have a quick development to jumpstart access to the
755
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
eMinerals minigrid for the project scientists. It final values. Typically for metadata collection
turned out that MCS was more successful than we need to retrieve some data from each of
anticipated, mainly because it matched users’ these three components. It is worth remarking
requirements and way of working. Because the with regard to the property metadata that we
original version was written quickly, it was blur the distinction between data and metadata.
increasingly difficult to upgrade and integrate This is illustrated by a simple example. In a
new developments. Thus a complete rewrite study of the binding energy of a family of
proved to be essential to make future molecules (such as the dioxin family, where
developments easy to implement. members of the family differ in the number and
The current version of MCS (version 1.2 as location of chlorine or hydrogen atoms), we
of July 2006) has the following features that store the data on final energy as metadata. This
extend versions previously described: allows us to use this energy for our metadata
1. access to multiple collections (directories) search tools; an example would be to search
on the SRB, since typically executables, data through a study of all dioxin molecules for the
files and standard input files (such as molecule with the lowest binding energy.
pseudopotential files for quantum In MCS, metadata is extracted from the
mechanical calculations) will be stored in XML data documents using the AgentX library.
different collections; AgentX implements a simple API that can be
2. generalised to support job submission to used by other applications to find data
external grid infrastructures, including represented in a document according to their
campus grids and the NGS; context. The context is specified through a
3. metascheduling with load balancing across series of queries based on terms specified in an
all minigrid and external resources; ontology. These terms relate to classes of
4. new metadata and XML tools to entities of interest (for example ‘Atom’,
automatically obtain metadata from the ‘Crystal’ and ‘Molecule’) and their
output files and job submission parameters. properties (for example ‘zCoordinate’).
In the rest of this paper we will describe some This approach requires a set of mappings
of these in more detail. relating these terms to fragment identifiers. It is
these identifiers that are evaluated to locate
MCS and metadata parts of documents representing data with a well
defined context. In this way, AgentX hides the
We have been systematically enabling the details of a particular data format from its users,
simulation codes used within the eMinerals so that the complexities of dealing with XML
project to write output and configuration files in and the necessity of understanding the precise
XML. Specifically, we use the Chemical details of a particular data model are removed.
Markup Language (CML) [9]. One author The user is not required to understand the
(TOHW) has developed a Fortran library, details of the ontology or mappings, but must
‘FoX’, to facilitate writing CML files [9]. have an understanding of the terms used.
Another of the authors (PAC) has developed a Once data have been located using AgentX,
library, called ‘AgentX’, to facilitate general MCS associates each data item with a term
reading of XML files into Fortran using logical (such as ‘FinalEnergy’) which is used to
mappings rather than a concrete document provide the context of the data. The data and
structure [10]. term association are then stored in the project
Our output XML files broadly have the data metadata database, making use of the
separated into three components; metadata,
RCommands [3].
parameter and property. The metadata We provide an example of extracting the
component consists of general information
final energy from a quantum mechanical energy
about the job, such as program name, version relaxation using the SIESTA code [11]. In this
number etc. The parameter components are work we use final energy as a metadata item
mainly reflections of the input parameters that because it is a useful quantity for data
control, and subsequently identify, the particular organisation and for search tools. The call to
simulation. Examples range from the interesting AgentX in MCS follows an XPath-like syntax:
parameters such as temperature and pressure to
the more mundane but nevertheless important AgentX = finalEnergy,
control parameters such as various cut-offs. The chlorobenzene.xml:/Module
property components are the output data from [last]/PropertyList[title =
the simulation. These lists are vast, and include 'Final Energy']/Property
step-by-step data as well as final averages or [dictRef = 'siesta:Etot']
756
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
This directive specifies that MCS is to extract This type of metascheduling, based on a round-
the metadata as a value from the file called robin algorithm, has now been built into MCS.
‘chlorobenzene.XML’, and it will associate To use this metascheduling facility, the user
this value with the string called specifies the type of machine they wish to use,
‘finalEnergy’. The AgentX call looks for using the keywords ‘performance’ to submit
this value within the last module container, to clusters, or ‘throughput’ to submit to
which by convention holds properties Condor pools. MCS then retrieves a list of
representing the final state of the system, then machines from a central database (which is
within a propertyList called ‘Final mirrored for reliability), and queries the
Energy’, and finally within a property machines in turn checking for available
value defined by the dictionary reference processors. The database contains a list of
‘siesta:Etot’. available resources ranked in the order in which
In addition to metadata harvested from they were last submitted to (the most recently
output XML documents, we also collect used machine appears last in the list) and other
metadata related to the user's submission machine specific information such as the path to
environment. Examples include the date of the required queue command and the machine's
submission, name of the machine submitted architecture, etc.
from, and the user's username on the submission Querying of the resource state is performed
machine. Metadata can also be collected about by sending a Globus fork job to the resource
the job's execution environment, including the that executes the relevant local queue command
name of the machine on which the simulation (obtained from the central database). Querying
was run, the completion date of the run, and the the status of a PBS cluster will result in the use
user's username on that machine. Users are also of the ‘pbsnodes’ command with suitable
able to store arbitrary strings of metadata using output parsing. Querying one of the available
a simple MCS command. All these types of Condor pools will use a command written
metadata provide useful adjuncts to the specifically for this purpose, wrapping the
scientific metadata harvested from the standard ‘Condor_status -totals’. The
simulation output XML files. need for a purpose-written command is that this
request will poll flocked Condor pools within a
MCS and metascheduling campus grid, and the standard queue queries do
not return the required information in this case.
One problem with the earlier versions of MCS If all the resources are busy, MCS will
was that the user had to specify which compute inform the user and queue the jobs using
resource any simulation should be submitted to. information in the database to ensure even
This resulted in many jobs being submitted to a balance in the various resource queues.
few busy resources within the eMinerals As part of the metascheduling MCS takes
minigrid whilst other resources were left idle. account of the fact that not all binary
Because users are not allowed to log in to exectuables will run on all resources. Again, this
resources, it was not possible for them to check information is extracted from the database.
job queues; in any case, such as approach is not Executables for various platforms are held
a scalable solution to resource monitoring. within the SRB, and MCS will automatically
An intermediate solution to this problem select the appropriate SRB file download.
was the creation of a status monitoring web
page1 , which graphically shows the status of all Example MCS files
available minigrid resources. However, this
solution still requires user involvement to look Figures 2 and 3 show two examples of MCS
at the page and decide where to run. Moreover, input files. For those familiar with Condor, the
the web page caches the data for up to thirty Condor-G wrapping roots of MCS can be seen
minutes meaning that it would still be possible in the structure of the file. Figure 2 is the
for users to submit to resources that have simplest example. The first three lines give
suddenly become busy since the last update. information about the executable (name and
To enable better load balancing across location within the SRB) and the standard
resources, querying of resource utilisation must GlobusRSL command. The three lines with
be built into the submission mechanism, and names beginning with S provide the interaction
must use up to date rather than cached data. with the SRB. The Sdir line passes the name
1 http://www.eminerals.org/gridStatus
757
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
# Specify where the executable should get stdin from and put stdout to
GlobusRSL = (stdin=andalusite.dat)(stdout=andalusite.out)
of the SRB collection containing the files, and similar simulations concurrently (up to several
the “Sput *” and “Sget *” lines instruct hundreds), with each simulation corresponding
MCS to download and upload all files. The lines to a different parameter value. Parameter sweep
beginning with R concern the interaction with studies of this sort are well suited to the
the metadata database through the RCommands. resources available in the eMinerals minigrid.
The identification number of the relevant We have developed a set of script commands
metadata dataset into which data objects are to to make the task of setting up and running many
be stored is passed by the RDatasetID concurrent jobs within a single study relatively
parameter. The Rdesc command creates a data easy. The user supplies a template of the
object with the specified name. Its associated simulation input file and information regarding
URL within the SRB will be automatically the parameter values. The script then creates a
created by MCS. set of collections in the SRB, one for each set of
Figure 3 shows a more complex example, parameter values, containing the completed
including the components of the simpler script input files, and a commensurate set of
of Figure 2. This script contains near the top directories on the user’s local computer
parameters for the metascheduling task, containing the generated MCS input files. The
including a list of specified resources to be used actual process of submitting the jobs is
(preferredmachineList) and the type of performed by running another script command,
job (jobType). The script in Figure 3 involves which then walks through each of the locally
stored MCS input files and submits them all
creation of a metadata dataset. It also contains
completely independently from each other.
commands to use AgentX to obtain metadata
Now having the tools for setting up, running,
from the XML file. In this case, the study
and storing the resultant data files for large
concerns an investigation of how the energy of a
numbers of jobs, the scientist then faces the
molecule held over a mineral surface varies
problem of extracting the key information from
with its z coordinate and the repeat distance in
the resultant deluge of data. Although the
the z direction (latticeVectorC).
required information will differ from one study
to another, there are certain commonalities in
Parameter sweep code parameter sweep studies. The task is made
Many of the problems tackled within the easier by our use of XML to represent data. We
have written a number of analysis tools for
eMinerals project are combinatorial in nature.
Thus a typical study may involve running many gathering data from many XML files, using
AgentX, and generating XHTML files with
758
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
# Leave the code's stderr on the remote machine, to be uploaded to the SRB
# at job end
Transfer_Error = false
759
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
embedded SVG plots of interesting data [8,12]. suite of 121 SIESTA calculations, but it
To summarise, a full workflow can be subsequently transpired that more than one set
constructed, starting with the parameter sweep of calculations was required. MCS facilitated an
information, creating multiple sets of input files, efficient use of the eMinerals compute
submitting and executing all the jobs, returning resources, and led to the speedy retrieval of
the data, and finally retrieving particular data of results. MCS also harvested metadata from the
interest, and generating a single graph showing XML files, which was searchable by
results from all jobs. collaborators on the project, facilitating data
sharing and results manipulation.
Case examples
Pollutant arsenic ions within iron pyrite
In this section we describe some of the
In addition to the gains offered by MCS to
eMinerals science cases for which MCS has combinatorial studies, the usefulness of MCS is
proved invaluable. Some, but not all, of these
not restricted to that kind of study; it can be
examples use the parameter sweep tools as well. generalised to all studies involving a large
All use the SRB, metascheduling and metadata
number of independent calculations. We
tools within MCS. The key point from each
highlight here an example treated in more detail
example is that the process of job submission, elsewhere [15], namely the environmental
including parameter sweep studies, is made a lot
problem of the incorporation in of arsenic in
easier for the scientists. FeS2 pyrite. This has been investigated by
Amorphous silica under pressure considering four incorporation mechanisms in
two redox conditions through relatively simple
Our work on modelling amorphous silica is chemical reactions. This leads to eight reactions
described in detail elsewhere [13]. This study is with four to six components each. Therefore
easily broken down into many simulations, each many independent calculations were performed
with a different pressure but otherwise identical for testing and obtaining the total energy of all
simulation parameters. The user provides a reaction components using quantum mechanics
template input file with all of the necessary codes. For this example, several submission
input parameters, and uses the parameter sweep procedures were needed, with at least one for
tools to configure the scripts to vary pressure each reaction component.
between the lower and upper values and in the
desired number of steps (for example, values Summary and discussion
might vary from –5 GPa to +5 GPa in steps of
0.01 GPa, producing 101 simulations,). Once The problem of accessing grid resources
these values had been provided to the system without making either unreasonable demands of
then the user must simply run the command to the scientist user or forcing users to compromise
set up the jobs and the data within the SRB, and what they can achieve in their use of grid
then to submit the simulations to the available computing has been solved by the MCS tool
resources. MCS handles the complete job life described in this paper. MCS has undergone a
cycle, including resource scheduling. complete reworking recently, and is now
On completion of the jobs, the relevant task capable of metascheduling and load balancing,
was to extract from all the resultant output flexible interaction with the SRB, automatic
XML files the values of pressure and volume. metadata harvesting, and interaction with any
We also can monitor the performance and grid infrastructure with a Globus gatekeeper
quality of the simulations by extracting XML (including Condor pools as well as clusters).
data step by step. The only disadvantage faced by users of
MCS is the need to have the Globus client tools
Quantum mechanical calculations of and Condor installed on their desktop
molecular conformations computers. With recent releases of Globus, this
MCS was used extensively in the is less of a problem for users of Linux and Mac
parameterisation of jobs to study the molecular OS X computers, but remains a problem for
conformations of polychlorinatedbiphenyl users of the Windows operating system. A
(PCB) molecules [14]. Preliminary calculations coupled problem may arise from firewalls with
were required in order to determine the policies that restrict the opening of the
minimum space required to enclose a PCB necessary ports. For users who face either of
molecule in vacuum by sampling over different these problems, our solution is to provide a
repeat distances. The initial sweep involved a group of submit machines from which users can
submit their jobs using MCS. However, this is
760
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
less than ideal, so the next challenge we face is A.Thandavan, VN Alexandrov. Collaborative grid
to bring MCS closer to the desktop. We are infrastructure for molecular simulations: The
working on two solutions. One is the wrapping eMinerals minigrid as a prototype integrated
compute and data grid. Mol. Simul. 31, 303–313,
of MCS into a web portal, and the other is to 2005
wrap MCS into a web service. 6. M Calleja, B Beckles, M Keegan, MA Hayes, A
Although MCS has been designed within the Parker, MT Dove. CamGrid: Experiences in
context of the eMinerals project, efforts have constructing a university-wide, Condor-based
been made to enable more general usability. grid at the University of Cambridge. Proceedings
Thus, for example, it enables access to other of the UK e-Science All Hands Meeting 2004,
(ISBN 1-904425-21-6), pp 173–178
grid resources, such as the NGS and NW-Grid 7. http://www.nw-grid.ac.uk/
[7], provided that they allow access via a 8. MT Dove, E Artacho, TO White, RP Bruin, MG
Globus gatekeeper. The key point about MCS is Tucker, P Murray-Rust, RJ Allan, K Kleese van
that it provides integration with the SRB, and Dam, W Smith, RP Tyer, I Todorov, W
hence the SRB SCommand client tools need to Emmerich, C Chapman, SC Parker, A Marmier,
be installed. To make use of the metadata tools V Alexandrov, GJ Lewis, SM Hasan, A
(not a requirement), the RCommands and Thandavan, K Wright, CRA Catlow, M
Blanchard, NH de Leeuw, Z Du, GD Price, J
AgentX tools also need to be installed. These Brodholt, M Alfredsson. The eMinerals project:
tools can be installed within user filespaces developing the concept of the virtual organisation
rather than necessarily being in public areas. to support collaborative work on molecular-scale
More details on MCS, including manual, environmental simulations. Proceedings of All
program download, database details and Hands 2005 (ISBN 1-904425-53-4), pp 1058–
parameters sweep tools, are available from 1065
9. TOH White, P Murray-Rust, PA Couch, RP Tyer,
reference 16. RP Bruin, MT Dove, IT Todorov, SC Parker.
Development and Use of CML in the eMinerals
Acknowledgements project. Proceedings of All Hands 2006
10. PA Couch, P Sherwood, S Sufi, IT Todorov, RJ
We are grateful for funding from NERC (grant Allan, PJ Knowles, RP Bruin, MT Dove, P
reference numbers NER/T/S/2001/00855, NE/ Murray-Rust. Towards data integration for
C515698/1 and NE/C515704/1). computational chemistry. Proceedings of All
Hands 2005 (ISBN 1-904425-53-4), pp 426–432
11. JM Soler, E Artacho, JD Gale, A García, J
References Junquera, P Ordejón, D Sánchez-Portal. The
SIESTA method for ab initio order-N materials
1. MT Dove, TO White, RP Bruin, MG Tucker, M simulation. J. Phys.: Cond. Matter 14, 2745–
Calleja, E Artacho, P Murray-Rust, RP Tyer, I 2779, 2002
Todorov, RJ Allan, K Kleese van Dam, W Smith, 12. TOH White, RP Tyer, RP Bruin, MT Dove, KF
C Chapman, W Emmerich, A Marmier, SC Austen. A lightweight, scriptable, web-based
Parker, GJ Lewis, SM Hasan, A Thandavan, V frontend to the SRB. Proceedings of All Hands
Alexandrov, M Blanchard, K Wright, CRA 2006
Catlow, Z Du, NH de Leeuw, M Alfredsson, GD 13. MT Dove, LA Sullivan, AM Walker, K
Price, J Brodholt. eScience usability: the Trachenko, RP Bruin, TOH White, P Murray-
eMinerals experience. Proceedings of All Hands Rust, RP Tyer, PA Couch, IT Todorov, W Smith,
2005 (ISBN 1-904425-53-4), pp 30–37 K Kleese van Dam. Anatomy of a grid-enabled
2. RW Moore and C Baru. Virtualization services molecular simulation study: the compressibility
for data grids. in Grid Computing: Making The of amorphous silica. Proceedings of All Hands
Global Infrastructure a Reality, (ed Berman F, 2006
Hey AJG and Fox G, John Wiley), Chapter 16 14. KF Austen, TOH White, RP Bruin, MT Dove, E
(2003) Artacho, RP Tyer. Using eScience to calibrate our
3. RP Tyer, PA Couch, K Kleese van Dam, IT tools: parameterisation of quantum mechanical
Todorov, RP Bruin, TOH White, AM Walker, KF calculations with grid technologies. Proceedings
Austen, MT Dove, MO Blanchard. Automatic of All Hands 2006
metadata capture and grid computing. 15. Z Du, VN Alexandrov, M Alfredsson, KF Austen,
Proceedings of All Hands 2006 ND Bennett, MO Blanchard, JP Brodholt, RP
4. M Calleja, L Blanshard, R Bruin, C Chapman, A Bruin, CRA Catlow, C Chapman, DJ Cooke, TG
Thandavan, R Tyer, P Wilson, V Alexandrov, RJ Cooper, MT Dove, W Emmerich, SM Hasan, S
Allen, J Brodholt, MT Dove, W Emmerich, K Kerisit, NH de Leeuw, GJ Lewis, A Marmier, SC
Kleese van Dam. Grid tool integration within the Parker, GD Price, W Smith, IT Todorov, R. Tyer ,
eMinerals project. Proceedings of the UK e- AM Walker, TOH White, K Wright. A virtual
Science All Hands Meeting 2004, (ISBN research organization enabled by eMinerals
1-904425-21-6), pp 812–817 minigrid: An integrated study of the transport and
5. M Calleja, R Bruin, MG Tucker, MT Dove, R immobilisation of arsenic species in the
Tyer, L Blanshard, K Kleese van Dam, RJ Allan, environment. Proceedings of All Hands 2006
C Chapman, W Emmerich, P Wilson, J Brodholt, 16. http://www.eminerals.org/tools/mcs.html
761
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Martyn Fletcher, Tom Jackson, Mark Jessop, Stefan Klinger, Bojian Liang, Jim Austin.
Advanced Computer Architectures Group,
Department of Computer Science,
University of York, YO10 5DD, UK.
Abstract
The organisation of much industrial and scientific work involves the geographically distributed
utilisation of multiple tools, services and (increasingly) distributed data. A generic Distributed
Tool, Service and Data Architecture is described together with its application to the aero-engine
domain through the BROADEN project. A central issue in the work is the ability to provide a
flexible platform where data intensive services may be added with little overhead from existing tool
and service vendors. The paper explains the issues surrounding this and explains how the project is
investigating the PMC method (developed in DAME) and the use of Enterprise Service Bus to over
come the problems. The mapping of the generic architecture to the BROADEN application
(visualisation tools and distributed data and services) is described together with future work.
762
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
763
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
assuming other issues such as security, etc. are results for return to the requester
satisfied. (tool).
• The Processing Services act on
3. The Generic Distributed Tool, local data and provide high
Service and Data Architecture performance search, signal
extraction facilities, etc.
The scenario used for the DAME and • Distributed Data Management
BROADEN projects envisages geographically provides access to local and
dispersed users using a workbench and tools distributed data through Storage
which access distributed services and data. Request Broker abstraction
Figure 1 provides an overview of the generic mechanisms.
architecture that has been developed. The • The Local Workflow Manager
elements shown are: Management provides automatic
• The Graphics and Visualisation and requested workflows to be
Suites which contain a set of tools enacted with a single distributed
at a particular user location. node. A local workflow may also
• The Enterprise Service Bus to be part of a global workflow
enable tool interactions. controlled by the global workflow
• The distributed nodes encapsulate manager.
the data and high performance • The Local Resource Broker
services - see figure 2. manages selection of local
• The global workflow manager resources in keeping with specified
provides service orchestration on Service Level Agreements (SLAs).
demand from the individual users • Data Loading Services populate the
or as an automatic activity in local data stores in each node.
response to a predetermined These services may also perform
stimulus. other tasks such as data integrity
Figure 2 shows the generic overview of the checking, error correction, etc. on
generic architecture: input data.
• The Process Management • Databases / Data stores are
Controller (PMC) is a distributed provided for each type of data
component which manages the resident in the node.
distribution of service requests and
collection and aggregation of
764
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Global Global
Data Access Process Access
Distributed Node
Access to other
Process Management Controller (PMC) PMC nodes
Processing Processing
Service 1 Service Y
Access to other
Distributed Data Management Distributed Data
Management nodes
Data Data
Database / Datastore 1 Database / Datastore X Loading Loading
Service 1 Service Z
Workflow
WorkBench
WorkBench Manager Other
WorkBench
(or PXE) Services
PCT
CBR PCT QUICK
QUICK
Analysis SDE
SDE QUICK
Viewer Viewer SDE DataViewer
DataViewer
Visualiser DataViewer New Engine
Data
Distributed Distributed
Distributed Node 0 Node 1 Node 2
View PCT Processing Service Controller
Generator Extractor (PMC)
and
CBR Analyser Data
Engine Local Orchestrator
PMS Workflow
Manager
QUICK Local
Feature Resource
CBR KnowledgeBase
Provider Broker
Centralised
Remote
XTO
Storage Request Broker (SRB)
Master node 1 (with MCAT)
Database for other data PMS Database for Zmod Event / Feature
(e.g. RDF, extractions, etc.) Search Meta Data Database DataBase (Oracle)
Database
765
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
766
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
subsequent nodes can use this node as their could query their local SRB system and yet
seed. At later points in time, it may be desirable work with files that are hosted remotely.
to use additional seed nodes. The initialisation When data is requested from storage, it is
process for both new and existing nodes will be delivered via parallel IP streams, to maximize
synchronised and measurable. The PMC nodes network throughput, using protocols provided
will form a peer-to-peer network where the by SRB. In current implementations, a single
nodes are loosely coupled and only form MCAT is deployed inside a Postgress database.
connections in response to user requests. Real world implementations would use a
As new nodes are added and they register production standard database, such as Oracle or
with all other nodes in the network, each node DB2. With this setup, each nodes’s SRB
will maintain a persistent register of known installation is statically configured to use the
nodes. known MCAT.
It may also be necessary to temporarily The use of a single MCAT provides a single
remove nodes from the network e.g. for point of failure that would incapacitate the
maintenance. In these cases the node will not entire distributed system. To alleviate this, a
be logically removed, so as to maintain the data federated MCAT would be used, allowing each
coverage statistics for search operations. When node to use their own local MCAT. In large
these nodes are restarted, they will need to systems, this would provide a performance
ensure that their node lists are correct and up to benefit, as all data requests would be made
date. If a node is to be permanently removed through local servers. However, this would
from the PMC network, then it must use the de- require an efficient MCAT synchronization
register method on each PMC in its contact list. process.
767
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
• The QUICK Feature Provider service application domain requiring distributed tools
which provides features and events to a and services and data repositories.
client such as the QUICK Data Viewer.
• The XTO (eXtract Tracked Order) 8. Acknowledgements
service which extract specific time
series from the raw spectral vibration The work reported in this paper was developed
data. and undertaken as part of the BROADEN
• The Data Orchestrator is the Data (Business Resource Optimization for
Loading Service and is responsible for Aftermarket and Design on Engineering
“cleaning” and storing all raw engine Networks) project, a follow-on project to the
data. DAME project. The BROADEN project is part-
Also included in the BROADEN funded via the UK Department of Trade and
architecture is a centralised CBR Engine and Industry under its Technology Innovation
Knowledge Base [9]. Programme and the work described was
undertaken by teams at the Universities of York,
Leeds and Sheffield with industrial partners
6. Future work Rolls-Royce, EDS, Oxford BioSignals and
The generic architecture has been developed to Cybula Ltd.
be application neutral. It will be implemented
for the BROADEN application during the
course of the project. This will allow us to 9. References
assess the strengths and weaknesses of PMC [1] J. Austin et al., “Predictive maintenance: Distributed
and ESB for the task. aircraft engine diagnostics,” in The Grid, 2nd ed, I. Foster
Future work will include: and C. Kesselman, Eds. San Mateo, CA: Morgan
• The testing and demonstration of the Kaufmann, 2003, Ch. 5.
generic architecture in the BROADEN [2] I. Foster and C. Kesselman, Eds., The Grid, 2nd ed. San
application domain. Mateo, CA: Morgan Kaufmann, 2003.
• The testing and demonstration of the
generic architecture in other [3] Data Systems and Solutions Core ControlTM technology.
http://www.ds-s.com/corecontrol.asp.
application domains.
• The introduction of fault tolerance as [4] A. Nairac, N. Townsend, R. Carr, S. King, P. Cowley,
appropriate (inter and intra node). and L. Tarassenko, “A system for the analysis of jet engine
vibration data,” Integrated Computer-Aided Eng., vol. 6, pp.
• The exploration of fault tolerance 53–65, 1999.
techniques within the ESB.
• Performance testing would be helpful [5] D. O’Connel, “Pilots predict air chaos,” The Sunday
to discover issues early in the process. Times, p. 3.3, Mar. 14, 2004.
768
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
769
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
framework [7]. The rest of this paper further describes the ar-
The gridMonSteer architecture consists of two chitecture, implementation and application of grid-
parts: an application wrapper that runs in the same MonSteer. In Sections 3 and 4 respectively we ex-
execution environment as the application, and an plain the gridMonSteer architecture and our imple-
application controller that generally runs locally mentation of this architecture. We present a case
to the user. As all the communication in grid- study in Section 5 that uses gridMonSteer in com-
MonSteer is initiated by the application wrapper, bination with Triana [8][9] and Cactus [10][11] to
all communication is outbound from the Grid re- dynamically steer a Grid workflow. We draw our
source. This situation is generally allowed by fire- conclusions from this work in Section 6. First how-
walls as it implies that the application wrapper has ever, in Section 2, we look at work related to the
already complied with the security requirements of architecture presented in this paper.
that resource.
The gridMonSteer application wrapper is exe-
cuted as part of the application job submission. 2 Related Work
The arguments specified in this submission tell the
wrapper what information to request from/notify The number of scientific applications developed to
to the gridMonSteer controller, and these can be operate in a Grid environment is tiny when com-
tailored to a specific application scenario. This in- pared with the vast number that can be consid-
formation could for example be immediate notifi- ered to be legacy codes when run within such envi-
cation of output files created by the application, ronments. Grid legacy applications are not imple-
incremental requests for input to the application mented with hooks into the application monitoring
(e.g. via the standard input), requests for input to systems that are being developed for Grid environ-
be forwarded to the application via a network port, ments, such as Mercury [2] and OMIS [3] (see [1]
or notification of the state of the application. for a survery), and therefore have no method for
A gridMonSteer application controller is a ser- communicating their progress/intermediate results
vice generally run locally by the user that receives or receiving steering information.
information requests and notifications from the ap- One obvious solution to this problem is to “Grid-
plication wrapper. A controller exposes a simple enable” these applications by allowing them to in-
interface that can be invoked by the application teract with a monitoring system. However, en-
wrapper; in the current implementation this is web abling an application to interact with existing mon-
service interface. As long as this interface is ex- itoring systems requires modification of the appli-
posed, the actual behavior of controller is unde- cation source code. In a legacy application scenario
fined and can be tailored to individual or generic this is often not possible because the source code
application requirements; for example: is unavailable. If the source code is available, the
difficulty and cost of modifying unsupported code
• a generic controller could combine incremental is likely to prove prohibitive, especially when stan-
input to the standard input with immediate dards for monitoring architectures in Grid environ-
output from the standard output to conduct an ments are still being developed.
interactive command-line session with a Grid Without access to monitoring systems, when a
application. legacy application is submitted to execute on the
Grid via a resource manager, such as GRAM,
• a visualization controller could receive immedi- GRMS or Condor/G, that resource manager be-
ate notification of image files to produce a live comes the principal point of contact regarding the
animation of output from a Grid application. progress of that application. However, the progress
information available to a user is generally ex-
• a workflow controller could use immediate out- tremely limited, not normally stretching much be-
put file notification from an application to dy- yond the status of the job and its id. Although
namically steer the execution of a Grid work- a user could expect to have access to the execu-
flow, e.g. to launch an immediate response to tion directory of the legacy application, this access
a significant event. is pull based, so retrieving intermediate results in
770
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
a timely fashion would require constant polling of executed on a Grid resource and an application
the remote resource. This is very inefficient com- controller that is generally run locally to the user.
pared to the result notification approach used in As gridMonSteer application wrapper is submit-
gridMonSteer and unlikely to prove a usable archi- ted to the Grid as a standard job, this architec-
tecture for effective application controllers. ture works with any Grid resource manager (e.g.
An alternative to using a resource manager to ex- GRAM, GRMS or Condor/G). Unlike other Grid
ecute a legacy application is to expose the function- monitoring approaches, the gridMonSteer applica-
ality of the application as a Web or Grid service. To tion wrapper is non-intrusive and therefore does not
achieve this requires an application wrapper that require modifications to the application code.
mediates between the service interface and invoca- The process of running an application under grid-
tions of the application. A couple of approaches, in- MonSteer is initialized when, rather than submit-
cluding SWIG [12] and JACAW [13], have adopted ting an application to a Grid resource manager for
a fine-grained approach where the actual source execution, a gridMonSteer application wrapper job
code is wrapped exposing the application interface. is submitted instead (Step 1 in Figure 3). The
However, in addition to granularity and complexity original application executable and arguments are
considerations, this approach requires access to the supplied as arguments to the wrapper job. As the
application source code, a situation that is gener- wrapper job is submitted to the resource manager
ally not applicable to legacy applications. as a standard job, all the features of the chosen re-
A coarse grained approach, where the applica- source managers job description are available, as
tion is treated as a black box interfaced with using they would be if submitting the application di-
stdin/stdout or file input/output, is used in other rectly.
projects, such as GEMLCA [14] and SOAPLab [15]. As well as the application executable and argu-
This approach however typically provides little ad- ments, the address of a gridMonSteer application
ditional benefit over using a resource manager with controller service is also supplied as an argument
regards to application interaction. This is illus- to the wrapper job. An application controller is
trated by the fact that, rather than wrapping a service (e.g. web service) that exposes a simple
the application process directly, GEMLCA actually network interface (see Section 4.2). This interface
wraps a Condor job submission. is invoked by the application wrapper to request
One system that does employ similar components information (e.g. input files) from and notify infor-
to that of gridMonSteer is Nimrod/G [16][17], a mation (e.g. output files) to the controller. Other
Grid based parametric modeling system. In Nim- arguments supplied to the wrapper job specify ex-
rod/G an application wrapper is used to repeat- actly which information is notified to/requested
edly to run parametric modeling experiments with from the controller by the application wrapper, as
varying parameters and to pre/post stage results. will be discussed in Section 4.1.
A central database is queried by the application The gridMonSteer application wrapper job is ex-
wrapper to determine the next experiment to exe- ecuted on a Grid resource as specified in the job de-
cute. Such a scenario could easily be implemented scription and determined by the resource manager
using the dynamic scripting capacity of the current (Step 2). The application wrapper is then responsi-
gridMonSteer implementation (see Section 4.1.1), ble for launching the original application (Step 4),
with the central database acting as the application however, before this is done, any input files that are
controller. Unlike gridMonSteer, Nimrod/G does required to be pre-staged in the execution directory
not provide a generic legacy application monitoring are requested from the controller by the application
or steering architecture outside dynamic scripting wrapper (Step 3).
of parameter modeling experiments. After the application has been launched by the
application wrapper, the application wrapper reg-
ularly monitors the execution directory for the cre-
3 gridMonSteer Architecture ation/modification of output files (Step 5). If any
changes occur to files required by the application
The gridMonSteer architecture (see Figure 3) con- controller, as specified in the wrapper job argu-
sists of two parts, an application wrapper that is ments (see Section 4.1), then the controller is noti-
771
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Controller 4
monitor
legacy application
5
Web Service
output.dat
WSPeer image.jpg
http://www.xyz.org/gms
running directory
requestInput
3 6
notifyOutput
invoke using SOAP
Figure 1: gridMonSteer architecture using a web service controller launched using WSPeer.
fied via its network interface (Step 6). tion wrapper has already complied with the secu-
In addition to handling input/output files, the rity requirements of that resource. In the case of
application wrapper can perform other notifica- gridMonSteer, as the application wrapper is sub-
tion/steering tasks such as job state notification mitted as a standard Grid job, these security re-
and dynamic script execution. We detail such func- quirements are enforced by the resource manager
tionality further in Section 4. used to execute the wrapper job. Outbound com-
As long the application controller exposes the munication is also compatible with address trans-
predefined controller interface, the gridMonSteer lation systems such as NAT.
architecture does not define how the controller han- As long as a secure transport protocol such as
dles the information requests/notifications that it HTTPS is used for communication between the
receives from the wrapper. application wrapper and controller, identity cer-
tificates can be used by the application wrapper
to verify it is communicating with a trusted con-
3.1 Security Considerations troller. Further from that, it is up to the appli-
In a Grid environment most hosts are behind fire- cation controller implementation not to introduce
walls that typically create a one-way barrier for security flaws. The application controller should
incoming traffic on specified ports, thereby dis- operate under the same trust relationship that al-
abling incoming connections from external con- lowed the application wrapper job to be executed
trollers. Hosts can also employ address translation on the Grid resource initially.
that is not Internet facing and therefore cannot be
contacted directly, such as Network Address Trans-
lation (NAT) systems. Once this job is started in 4 gridMonSteer Implementa-
such an environment direct communication from an tion
application controller becomes difficult or impossi-
ble. The designs of the current gridMonSteer applica-
In the gridMonSteer architecture all communi- tion wrapper and application controller implemen-
cation is initiated by the gridMonSteer applica- tations are described in Sections 4.1 and 4.2 re-
tion wrapper. As the application wrapper executes spectively. At present the application wrapper is
in the same environment as the Grid application implemented in Java, however a C version is un-
this means that communication is always outbound der development. In the current implementation
from Grid resources. This is a situation that is gen- the controller exposes a web service interface, and
erally allowed by firewalls as it implies the applica- can be implemented in any language so long as this
772
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
773
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
to application execution. we consider the case that the workflow itself is the
Grid application rather than the current percep-
-append - Same as -in except incremental up-
tion that the legacy code being invoked remotely
dates to the input files are repeatedly re-
is the application. Although, this example is sim-
quested during application execution.
ple, it could be extended to data driven applica-
As with the equivalent arguments output files tions that involve complex decision-based workflow
(see Section 4.1.1), the above commands take a file interactions across the resources.
list detailing the files to be requested. This file list The specific scenario we illustrate in this case
can include the keyword STDIN to indicate request- study is a gridMonSteer wrapped Cactus [10][19]
ing the standard input. job executed within a Triana workflow [8][9], with
Triana both submitting the Cactus job to the Grid
resource manager and acting as gridMonSteer con-
4.2 Application Controller
troller for that job. As controller, Triana receives
A gridMonSteer application controller is a process timely notification of intermediate results via grid-
that receives input requests and output notifica- MonSteer that are then used to steer the overall
tion from the application wrapper via a network workflow being executed.
interface. An application controller is generally run The original idea for gridMonSteer came from a
locally to the user, allowing the output of an ap- SC04 demonstration in which Triana was used to
plication to be visualized on the users desktop for visualize two orbiting sources simulated by a Cac-
example, however it is equally valid for a remote tus job running in a Grid environment [20]. Access-
application controller to be used. Also, an appli- ing intermediate results is a typical requirement for
cation controller can be used to control multiple Cactus users who want to monitor the progress of
jobs. These jobs can be independent applications their application or derive scientific results. For
or separate parts of a parallel application. this scenario a specific Cactus thorn was coded to
In the current implementation application con- notify the files created by Cactus to the Triana con-
trollers are exposed as web services, however the troller after each time step, however this is a Cac-
gridMonSteer architecture could function equally tus specific solution. In this example we recreate
effectively using an alternative communication this demonstration except Cactus is treated as a
framework. In this implementation, the controller true legacy application, and gridMonSteer provides
is required to expose an interface containing six op- non-intrusive monitoring and steering capabilities.
erations (runJob, getInput, notifyOutput, notifyS- Cactus simulations present a number of interest-
tate, registerOutput, selectOutput), and providing ing cases for gridMonSteer. The output from Cac-
this interface is exposed, the actual behavior of the tus can be in a number of formats, including HDF5,
controller is undefined. JPEG, and ASCII formats suitable for visualizing
In the scenarios we are currently developing (see with tools such as X-Graph and GNUPlot. In some
Section 5 for an example), we use WSPeer [18] to instances, such as JPEG images, the output files
allow the application controller to dynamically ex- are overwritten after each time step, where other
pose itself as a web service for the duration of the output files are appended, such as X-Graph files.
jobs it is managing. Although this is beneficial for Grid job submission in Triana is conducted via
developing interactive application controllers, any the GridLab GAT [5], a high level API for access-
web service implementing the gridMonSteer con- ing Grid services. The GAT uses a pluggable adap-
troller interface could be employed as an applica- tor architecture that allows different services to be
tion controller. used without modifying the application code. Cur-
rent resource manager adaptors for the Java GAT
include GRAM, GRMS and local execution. Al-
5 Case Study: Using Cactus though different adaptors can be used within the
to Steer Triana Workflows GAT, as gridMonSteer is generic to any resource
manager, Triana can run gridMonSteer wrapped
In this case study we illustrate how gridMonSteer jobs via the GAT without having to modify the
can be used to steer complex Grid workflows. Here, job submission to take account of different resource
774
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 2: An example Triana workflow for executing a gridMonSteer wrapped Cactus job.
775
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
wrapper can be tailored to individual or generic to Peer Environments,” Journal of Grid Comput-
application requirements. ing, vol. 1, no. 2, pp. 199–217, 2003.
In this paper we presented a case study using [9] “The Triana Project.” See web site at:
gridMonSteer to monitor a legacy application run- http://www.trianacode.org.
ning as a component within a Grid workflow. In [10] T. Goodale, G. Allen, G. Lanfermann, J. Masso,
this case study, the intermediate results from the T. Radke, E. Seidel, and J. Shalf, “The cactus
legacy application where notified by gridMonSteer framework and toolkit: Design and applications,”
to the workflow controller and used to dynamically in Vector and Parallel Processing VECPAR 2002,
steer the workflow execution. This is an example 5th International Conference, Lecture Notes in
of how gridMonSteer enables interactive legacy ap- Computer Science, Springer, 2003.
plications to be used within Grid environments in [11] “The Cactus Computational Toolkit.”
situations that were previously unavailable. See web site at: http://www.cactuscode.org.
[12] D. M. Beazley and P. S. Lomdahl, “Lightweight
Computational Steering of Very Large Scale
References Molecular Dynamics Simulations,” in Proceedings
[1] S. Zanikolas and R. Sakellariou, “A taxonomy of Super Computing ’96, 1996.
of grid monitoring systems,” Future Generation [13] Y. Huang and D. W. Walker, “JACAW - A Java-C
Computer Systems, vol. 21, pp. 163–168, January Automatic Wrapper Tool and its Benchmark,” in
2005. Proceedings of the International Parallel and Dis-
[2] Z. Balaton and G. Gombas, “Resource and job tributed Processing Symposium (IPDPS’02), 2002.
monitoring in the grid,” in Proceedings of the Ninth [14] T. Delaitre, A.Goyeneche, P. Kacsuk, T.Kiss,
International Euro-Par Conference, Vol. 2790 of G. Terstyanszky, and S. Winter, “GEMLCA: Grid
Lecture Notes in Computer Science, Springer- execution management for legacy code architec-
Verlag, 2003. ture design,” in Proceedings of the 30th EUROMI-
[3] B. Balis, M. Bubak, T. Szepieniec, R. Wismuller, CRO Conference, Special Session on Advances in
and M. Radecki, “Monitoring grid applications Web Computing, August 2004.
with grid-enabled OMIS monitor,” in Proceedings [15] M. Senger, P. Rice, and T. Oinn, “A resource
of the First European Across Grids Conference, management architecture for metacomputing sys-
Vol. 2970 of Lecture Notes in Computer Science, tems,” in Proceedings of UK e-Science All Hands
pp. 230–239, Springer-Verlag, 1997. Meeting, pp. 509–513, September 2003.
[4] K. Czajkowski, I. Foster, N. Karonis, C. Kessel- [16] R. Buyya, D. Abramson, and J. Giddy, “Nim-
man, S. Martin, W. Smith, and S. Tuecke, “A Re- rod/g: An architecture of a resource management
source Management Architecture for Metacomput- and scheduling system in a global computational
ing Systems,” in Proc. IPPS/SPDP ’98 Workshop grid,” in HPC Asia 2000, pp. 283–289, May 2000.
on Job Scheduling Strategies for Parallel Process-
[17] “Nimrod: Tools for distributed para-
ing, pp. 62–82, IEEE Computer Society, 1998.
metric modelling.” See web site at:
[5] “The GridLab Project.” See web site at: http://www.csse.monash.edu.au/ ?davida/nimrod/.
http://www.gridlab.org.
[18] A. Harrison and I. Taylor, “WSPeer - An Inter-
[6] J. Frey, T. Tannenbaum, M. Livny, I. Foster, and face to Web Service Hosting and Invocation,” in
S. Tuecke, “Condor-G: A Computation Manage- HIPS Joint Workshop on High-Performance Grid
ment Agent for Multi-Institutional Grids,” in Pro- Computing and High-Level Parallel Programming
ceedings of the 10th IEEE International Sympo- Models, 2005.
sium on High Performance Distributed Computing
[19] “The Cactus Project.” See web site at:
(HPCD-’01), 2001.
http://www.cactuscode.org.
[7] I. Foster, C. Kesselman, J. Nick, and S. Tuecke,
“The Physiology of the Grid: An Open Grid Ser- [20] T. Goodale, I. Taylor, and I. Wang, “Integrating
vices Architecture for Distributed Systems Inte- Cactus Simulations within Triana Workflows,” in
gration,” tech. rep., Open Grid Service Infrastruc- Proceedings of 13th Annual Mardi Gras Confer-
ture WG, Global Grid Forum, 2002. ence - Frontiers of Grid Applications and Tech-
nologies, pp. 47–53, Louisiana State University,
[8] I. Taylor, M. Shields, I. Wang, and O. Rana, “Tri- February 2005.
ana Applications within Grid Computing and Peer
776
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Tim Folkes1, Elizabeth Auden2, Paul Lamb2, Matthew Whillock2, Jens Jensen1,Matthew Wild1
1 CCLRC – Rutherford Appleton Laboratory (RAL)
2 Mullard Space Science Laboratory (MSSL), University College London
Abstract
This paper describes how telemetry and science data from the Solar-B satellite will be
imported into the UK AstroGrid infrastructure for analysis. Data will be received from the
satellite by ISAS in Japan, and will then be transferred by MSSL to the UK where it is
cached in CCLRC’s Atlas tapestore, utilizing existing high speed network links between
RAL and Japan. The control of the data flow can be done from MSSL using their existing
network infrastructure and without the need for any extra local data storage. From the
cache in the UK, data can then be redistributed into AstroGrid’s MySpace for analysis,
using the metadata held at MSSL. The innovation in this work lies in tying together Grid
technologies from different Grids with existing storage and networking technologies to
provide an end-to-end solution for UK participants in an international scientific
collaboration. Solar-B is due for launch late 2006, but meanwhile the system is being
tested with data from another solar observatory.
777
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
778
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
779
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
from the Japanese site, ISAS, to RAL, To migrate the data to other storage
without having to copy the data to MSSL hierarchies, policies are created that use
over slower network links. various Media Specific Processes (MSP) to
move the data. At RAL we have
To set up this transfer, we needed to configured the system to create two copies
identify appropriately trusted Certification of all the data. If the file is smaller that
Authorities in the relevant countries, to 50MB then the first copy of the data is
issue certificates to all the users and hosts. written onto another set of disk. This
This was easy for the UK and Japan. Both allows for fast retrieval of small files. The
countries have national Grid CAs: the UK second copy of the data goes onto tape. For
e-Science CA [UKCA] and its Japanese all other file sizes, the data is written to two
counterpart [AIST]. Since CAs know and separate sets of tape. In both cases the
trust each other, it was easy to get the CAs second set of tapes are removed from the
to establish trust on behalf of the users. robot and are stored in a fire-safe.
The US is slightly more tricky because
there is no CA covering all of the US, but Once the files have been migrated to a
the Japan-UK links was highest priority. MSP, the file becomes “dual state” with a
Lockheed, one of the US partners, will be copy of the data on disk and at least one
transferring data initially to ISAS, and from else-where. This is the first stage in the
there it will be replicated to the UK via the data migration process.
GridFTP link.
Using the existing Grid CAs is by far the
best option because the CAs have already
negotiated trust globally [IGTF].
780
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
• “free space target” specifies the user access the file the data will be restored
percentage of free filesystem space from the MSP. Care has to be taken to
DMF will try to achieve if free space make sure the backups and the backups of
falls below “free space minimum”. the DMF database are kept in step.
• “migration target” specifies the
percentage of filesystem capacity that is DMF also allows files that are
maintained as a reserve of space that is accidentally deleted to be recovered. When
free or occupied by dual-state files. a file is deleted, it is removed from disk, but
DMF attempts to maintain this reserve an entry is still kept in the DMF database.
in the event that the filesystem free The files become “soft deleted”. After a
space reaches or falls below “free space pre-determined period (30 days by default)
minimum”. the soft-deleted entries are hard deleted and
the space in the MSP freed up. After this
there is no way to recover the file.
Conclusion
781
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
References
782
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
783
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
from the AIA instrument. These algorithms eSDO algorithms will be designated by JSOC as
include non-linear magnetic field extrapolation, optional user-invoked modules. UK SDO co-
coronal loop recognition, helicity computation, investigators and their teams will be able to
small event detection and recognition of coronal access the JSOC pipeline directly through
mass ejection (CME) dimming regions. The accounts at Stanford University. The coronal
magnetic field extrapolation algorithm will loop recognition algorithm will be the eSDO
particularly benefit from grid computing test case for JSOC pipeline integration.
techniques. Wiegelmann’s “optimization”
method of 3-D magnetic field extrapolation is For the wider UK solar community, access to
used to calculate magnetic field strength over a eSDO algorithms will be available through
user-defined solar volume [3]. Computations of AstroGrid or SolarSoft. Algorithms deployed as
volume integrals are distributed with the UK hosted AstroGridT CEA web services will
MPICH 1 or 2 parallel processing protocol; be accessible through the AstroGrid workbench
current investigations are exploring the via the application launcher and parameterized
execution of this code on National Grid Service workflows. Emerging VO access to facilities
(NGS) facilities through an AstroGrid such as NGS will allow scientists to execute
application interface. computationally intensive algorithms on
powerful remote machines. Algorithms that are
The universities of Birmingham and Sheffield not computationally intensive will be wrapped
are examining global and local helioseismology in IDL for SolarSoft distribution through the
respectively, and these algorithms will process MSSL gateway. CEA deployment will begin in
data from the HMI instrument. The Birmingham late September 2006.
group is implementing mode frequency analysis
and mode asymmetry analysis algorithms. The In addition to the algorithms under development
Sheffield group is developing subsurface flow by eSDO consortium institutions, the project
analysis, perturbation map generation and will also work with UK coronal seismologists to
computation of local helioseismology inversion deploy wave power analysis algorithms as
algorithms. Helioseismology algorithms analyse JSOC pipeline modules and AstroGrid CEA
low frequency solar oscillations with a applications.
periodicities that range from 5 minutes for p
mode (acoustic) waves to hours or days for g 3. Data Visualization
mode (gravity) waves [4]. These oscillations are
studied over long periods of time, so The huge volume of SDO data collected every
helioseismology algorithms require day makes it imperative to search the data
correspondingly long data sequences as input. archive efficiently. The visualization techniques
Because of the large input series, these including streaming tools, catalogues, thumbnail
algorithms should be hosted to locally to extraction and movie generation will aid
helioseismology data caches in the US or UK to scientists in archive navigation. By enabling
ensure efficient completion of data processing. scientists to identify relevant datasets quickly,
these tools will reduce network traffic and save
2.2 Algorithm Distribution research time.
784
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
locate the datasets, extract images from FITS Collaboration with the UK coronal seismology
files, label the images with relevant metadata community will enhance the streaming tool with
from the FITS headers, and then display an a wave power analysis function.
image gallery in the user’s web browser or
create an MPEG movie that the user can view or A prototype of the eSDO streaming tool that can
download. Following advice from the eSDO be used in conjunction with the AstroGrid
Phase A review with PPARC in November Helioscope application is available now [8].
2005, the movie maker will be enhanced to
evaluate a data sequence’s suitability for
wavelet analysis. Prototypes of the thumbnail
maker and image gallery tool are now available
[5,6].
3.2 Catalogues
Two science catalogues will be generated from
eSDO algorithms. One catalogue will store
statistical information about small solar events
and CME dimming regions. The catalogue will
be generated in the UK and annotated
continuously from the UK data centre’s cached
AIA datasets. The second catalogue will
provide a “GONG month” [7] of
helioseismology information produced by the
mode parameters analysis algorithm. This
catalogue will be generated in the US using
eSDO modules integrated with the JSOC
pipeline. Both catalogues will be searchable
through the virtual observatory with instances of
the AstroGrid DataSet Access (DSA) software. Figure 1: The SDO streaming tool prototype
loading a coronal image from the SOHO
LASCO instrument, zoomed in one spatial
3.3 SDO Streaming Tool level.
The SDO streaming tool will display HMI and
AIA science products in a web GUI. Scientists 4. UK Data Centre
will be able to pan and zoom the images in both
space and time. A user will launch the java The primary SDO data centre will be based in
webstart SDO streaming tool from a web the US at Stanford University and Lockheed
browser and then specify a start time, stop time, Martin, and users will be able to access the data
cadence, and data product. Three types of SDO stored there directly through the Virtual Solar
products will be available: AIA images from 10 Observatory (VSO). A second data centre will
channels, HMI continuum maps, and HMI line- be established in the UK at the ATLAS
of-sight magnetograms. datastore facility at the Rutherford Appleton
Laboratory. Rather than providing a full data
The user will be able to zoom in spatially from a mirror, this data centre will store all SDO
full disk, low resolution image to a full metadata as well as holding a cache recent data
resolution 512 by 512 pixel display of a solar products and popular data requests. The
area. Users will also be able to pan spatially metadata storage will allow SDO data searches
zoomed images in eight directions. Once a user to be included in AstroGrid workflows while
has selected a cadence (for instance, 1 image the local data cache will provide UK scientists
per hour), data products matching that cadence with access to SDO data when the US data
will be displayed on the screen; users will be centre experiences decreases in network speed
able to "zoom" in time by increasing or during periods of high user demand.
decreasing this cadence, while rewind, fast
forward and pause facilities will allow users to 4.1 Metadata and VO Searches
"pan" in time.
785
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
JSOC will store SDO metadata in a distributed JSOC has designated the Virtual Solar
Data Resource Management System (DRMS) Observatory (VSO) as the official front end for
while actual data files will be stored in a user access to SDO data, and UK scientists will
distributed Storage Unit Management System be able to download SDO data from the VSO
(SUMS). Each instance of DRMS will hold web interface. However, VSO is not compliant
metadata for all SDO files, regardless of the with the International Virtual Observatory
fraction of SDO data stored in the local SUMS. Alliance (IVOA) web services that AstroGrid
When new SDO high level data sets are created uses; therefore AstroGrid workflows cannot
in the UK, the metadata for those datasets will access SDO data through VSO. Instead, the
be entered into a DRMS database and eSDO project is configuring an instance of the
disseminated globally to all DRMS instances. AstroGrid DSA software to interface with the
Instances of DRMS operate in conjunction with UK DRMS. This will allow scientists to set up
Postgres and Oracle databases to store keyword an AstroGrid workflow that will search through
value pairs that are used to generate SDO FITS the SDO metadata, identify the relevant SDO
headers on the fly. datasets, and retrieve the datasets as URLs.
URLs
Image / table
File retrieval File File upload / spectra
ATLAS AstroGrid Aladin /
filesystem MySpace VOSpec
File
Figure 2: This figure demonstrates the flow of information when an AstroGrid user submits
a request for AIA or HMI data to the UK SDO data centre.
4.2 ATLAS Data Cache interface, and the relevant data products will be
returned to the UK in export format (FITS,
Although a UK instance of the DRMS server JPEG, VOTable, etc).
will provide a UK holding of all SDO metadata,
the datasets themselves will be archived in the In the event of a large flare or coronal mass
US. The eSDO project has proposed a 30 TB ejection, UK scientists can download data from
disk cache hosted at the ATLAS facility for the rolling 60 day cache without impedance
holding a fraction of the SDO data.15 TB will from the heavy network traffic of users
hold a rolling 60 day cache of recent AIA and accessing the JSOC archive. In addition, UK
HMI data while the other 15 TB will serve as a users of the SDO streaming tool will have more
“true cache” for older datasets requested by UK seamless access to images viewed at a high
users through the AstroGrid system. When a cadence. The true cache will reduce
UK user requests an HMI or AIA science transatlantic data transfers for such user groups
product through the AstroGrid system, the as the helioseismology community that process
request will be sent to the UK data centre. If the large blocks of data but are not concerned with
data is not available there, the request will be recent events.
redirected through an AstroGrid / VSO
786
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Figure 3: The proposed architecture for a UK SDO data centre demonstrates the interaction
between the full US SDO archive, the UK cache, and user requests.
[1] “Solar Dynamics Observatory Mission
5. Conclusions Factsheet”, 3 November 2005.
http://sdo.gsfc.nasa.gov/sdo_factsheet.pdf
Although eSDO funding will cease before the
SDO mission is launched in August 2008, the [2] Rixon, G. “Software Architecture of
project will deliver completed designs, code and AstroGrid v1”, 27 September 2005.
documentation for each of the three major
workpackages. Ten algorithms concentrating on [3] Wiegelmann, T. 2004, Solar Physics, 219,
solar feature recognition and helioseismology 87-108
will be ready for deployment through the JSOC
pipeline, AstroGrid and SolarSoft. The UK data [4] Lou, Y.Q. 2001, Astrophysical Journal, 556,
centre will provide interfaces between the JSOC 2, L1221-L125
data resource management system, the
AstroGrid data set access software and the [5] Auden, E. eSDO Thumbnail Maker, 19 July
ATLAS storage facility, allowing execution of 2006. http://msslxx.mssl.ucl.ac.uk:8080/esdo/
virtual observatory searches all SDO metadata visualization/jnlp/esdo_thumbnailmaker.jnlp
with rapid access to SDO data cached in the
UK. Finally, scientists will be able to identify [6] Auden, E. eSDO Image Gallery, 18 July
relevant datasets through visualization tools 2006. http://msslxx.mssl.ucl.ac.uk:8080/esdo/
such as catalogues, web-based image gallery visualization/jnlp/esdo_imagegallery.jnlp
and movie makers, and the SDO streaming tool.
[7] Howe, R., Komm, R., Hill, F. 2000, Solar
6. References Physics, 192, 2, 427-435
787
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Abstract
Within bioinformatics, a considerable amount of time is spent dealing with three problems;
heterogeneity, distribution and autonomy – concerns that are mirrored in this study and which e-
Science technologies should help to address. We describe the design and architecture of a system
which makes predictions about the secreted proteins of bacteria, describing both the strengths and
some weaknesses of current e-Science technology.
The focus of this study was on the development and application of e-Science workflows and a
service oriented approach to the genomic scale detection and characterisation of secreted proteins
from Bacillus species. Members of the genus Bacillus show a diverse range of properties, from the
non-pathogenic B. subtilis to the causative agent of anthrax, B. anthracis. Protein predictions were
automatically integrated into a custom relational database with an associated Web portal to facilitate
expert curation, results browsing and querying. We describe the use of workflow technology to
arrange secreted proteins into families and the subsequent study of the relationships between and
within these families. The design of the workflows, the architecture and the reasoning behind our
approach are discussed.
788
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
produced by these bacteria and secreted across trafficking signal directing the protein to the
the cytoplasmic membrane. secretory machinery. Not all proteins that are
This problem places different requirements translocated from the site of synthesis are
on the workflow and the surrounding secreted into the surrounding medium;
architecture than previously described in the transmembrane proteins are localised to the
bioinformatics workflow domain. Here, we cytoplasmic membrane itself; lipoproteins
describe in detail the background to the problem attach themselves to the outer surface of the
and the biological analysis that we wish to membrane, protruding into the periplasmic
perform to address this problem. Finally, we space of Gram-negative bacteria or the
describe the architecture that we have developed membrane/wall interface of Gram-positive
to support this analysis and discuss some bacteria. Finally, proteins may also attach
preliminary results of its application. themselves to the cell wall by covalent or ionic
interactions; the former may be distinguished by
2. The Secretome an LPXTG motif that is found in the C-terminal
domains of mature secreted proteins [7].
One of the main mechanisms that bacteria Recently, our understanding of the
use to interact with their environment is to secretome has greatly improved due to the
synthesise proteins and export them from the rapidly increasing number of complete bacterial
cell into their external surroundings. These genome sequences, essentially molecular
secreted proteins are often important in the blueprints containing the information that
adaptation of bacteria to a particular allows the protein repertoire of an organism to
environment. The entire complement of secreted be defined. Armed with this sequence
proteins is often referred to as the secretome. information, we can begin to predict which of
Characterising secreted proteins and the the proteins an organism is capable of
mechanisms of their secretion can reveal a great producing are destined to be secreted, as well as
deal about the capabilities of an organism. For the mechanisms of their secretion. As a result, a
example, soil organisms secrete number of studies have been undertaken, in
macromolecular hydrolases to recover nutrients which bioinformatics programs and resources
from their immediate surroundings. During play a vital role. These studies involve the
infection, many pathogenic bacteria secrete repeated application of a number of different
harmful enzymes and toxins into the algorithms to all of the gene and protein
extracellular environment. These secreted sequences encoded on the genome. Many of
virulence proteins can subvert the host defence these algorithms are computationally expensive
systems, for example by limiting the influence and, given that an average bacterial genome can
of the innate immune system, and facilitating encode around 4,000 or more proteins, the
the entry, movement and dissemination of process can become computationally bound. In
bacteria within infected tissues [4]. Bacteria addition, the results of the application of these
may also use secreted proteins to communicate algorithms needs to be stored and integrated in
with each other, allowing them to form complex order to make a prediction about the secretory
communities and to adapt quickly to changing status of the entire set of proteins encoded by a
conditions [5]. particular genome. Often, the results of the
The secretomes of pathogens are therefore classification algorithms may be error prone and
of great interest. The comparison of virulent and mechanisms to permit expert human curation
non-virulent bacteria at the genomic and and results browsing also need to be established.
proteomic levels can aid our understanding of Biologists have already begun to apply
what makes a bacterium pathogenic and how it conventional bioinformatics technology to the
has evolved [6]. Furthermore, the prediction and classification of secreted
characterisation of secretory virulence related proteins. The ‘first, largely genome-based
proteins could ultimately lead to the survey of a secretome’ was carried out using
identification of therapeutic targets. bioinformatics tools on the genome of the
The interest in protein secretion not only industrially important bacterium, Bacillus
includes the secreted proteins themselves, but subtilis [8], using legacy tools called from
also those proteins which form the secretory custom scripts in combination with expert
machinery used to export the proteins across the curation.
cytoplasmic membrane, and in the case of In this study we describe the development
Gram-negative bacteria, the outer membrane. and application of e-Science workflows and a
Secretory proteins incorporate a short sequence service-oriented approach to the genomic scale
called a signal peptide, which acts as a detection and characterisation of secreted
789
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
proteins from Bacillus species. Bacillus species not require further classification. A conceptual
are important not only for their industrial diagram illustrating the functionality of the
importance in the production of enzymes and classification workflow is shown in Figure 1.
pharmaceuticals, but also because of the The results of the classification process are
diversity of characteristics shown by the stored in a remote relational database and the
members of this genus. The Bacillus genus reduced set of proteins passed to the next
includes species that are soil inhabitants, able to service in the workflow. In the analysis
promote plant growth and produce antibiotics. workflow, data derived from the classification
The genus also includes harmful bacteria such workflow is retrieved from the database and
as Bacillus anthracis, the causative agent of analysed using a further series of service
anthrax. enabled tools. The architectural constraints
We utilised the system to make predictions responsible for this choice of design are
about the secretomes of 12 Bacillus species for discussed further in section 4. The data flow for
which complete genomic sequences are publicly both workflows combined is summarised in
available; this includes B. cereus (strain Figure 2.
ZK/E33L), B. thuringiensis konkukian (strain
97-27), B. anthracis (Sterne), B. anthracis
(Ames ancestor), B. anthracis (Ames, isolate
Porton), B. cereus (ATCC 10987), B. cereus
(strain ATCC 14579/DSM 31), B. subtilis (168),
B. licheniformis (strain DSM 13/ATCC 14580,
sub_strain Novozymes), B. licheniformis (DSM
13/ATCC 14580, sub_strain Goettingen), B.
clausii (KSM-K16) and B. halodurans (C-
125/JCM 9153). These predictions were
automatically integrated into a custom relational
database with an associated Web portal to
facilitate expert curation, results browsing and
querying.
3. Workflow Approach
In this study, the aim was to identify
proteins that are likely to be secreted and
classify them according to the putative
mechanism of their secretion. Such proteins
include those exported to the extracellular
medium, as well as proteins that attach Figure 1. Basic representation of the functionality of
themselves to the outer surface of the membrane the classification workflow. Shaded boxes indicate
(lipoproteins) and cell wall binding proteins the set of secreted proteins. The number in brackets
represents the total number of proteins to be
(sortase mediated proteins containing an
classified at each level, across the 12 Bacillus
LPXTG motif). Once secreted proteins had been species.
classified, we then investigated the composition
of the predicted secretomes in the twelve With respect to the workflow functionality,
species under study. for the classification workflow, the first
Two workflows were designed and objective was the prediction of lipoproteins.
implemented; the classification workflow and This was the responsibility of the first service in
the analysis workflow. The classification the workflow, which takes as input a set of all
workflow is concerned with making predictions predicted proteins derived from the EMBL
about the secretory characteristics of a particular record for the complete genome sequence. This
protein from a given set of proteins. The service employed LipoP [9] for the detection of
analysis workflow processes the data from the lipoproteins. Following the prediction of
first workflow in order to analyse the function lipoproteins, the putative ‘non-lipoprotein’ set
of the secreted proteins that have been found. of proteins was analysed for the presence of a
A general feature of the workflows is their signal peptide using a SignalP Web Service
linear construction. In the classification [10]; this was performed after the lipoprotein
workflow, for example, at each step, the set is identification because of possible limitations in
reduced in size, removing those proteins that do the efficiency of SignalP at detecting
790
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
lipoproteins. The use of SignalP at this point 1e-10 were used as input to MCL, (inflation
also removes most of the proteins from value 1.2).
subsequent analysis; this has considerable Hierarchical clustering was performed to
advantages as the downstream analyses are identify phylogenetic relations between the
potentially computationally intensive. Bacillus species in the context of their
Proteins with a putative N-terminal signal contributions to the protein families. The R
peptide as well as additional transmembrane package was wrapped as a Web Service for this
domains are likely to be retained in the purpose. R is a package providing a statistical
membrane. To identify these putative membrane computing and graphics environment (R Project
proteins from among the proteins in the signal website, http://www.r-project.org/). A distance
peptide dataset, we used a combination of three matrix was constructed using the Euclidean
transmembrane prediction Web Services based distance, and clustering was carried out using a
on the tools, TMHMM, MEMSAT and tmap, complete linkage method.
respectively [11]. A subsequent service in the
workflow was responsible for integrating the
results derived from these three tools, to make a
final prediction about the presence of a putative
transmembrane protein. Finally, the protein
dataset corresponding to proteins with no
predicted transmembrane domain was analysed
for the cell wall binding amino acid motif
LPXTG. The tool, ps_scan (the standalone
version of ScanProsite) which scans PROSITE
for the LPXTG motif [12] was wrapped as a
Web Service and called from the workflow. The
Figure 2. Data flow through the classification and
classification workflow was validated in its analysis workflows.
predicted capability by applying it to the
proteins of Bacillus subtilis whose secretory
status has been determined and, to a large
4. Architecture
extent, experimentally confirmed [8, 13]. The architectural view of the workflow
For the analysis workflow, the secreted components involved in the prediction and
proteins dataset was analysed to provide analysis of the secretomes of the Bacillus
information about the relationships between the species is shown in Figure 3.
secretomes of the 12 different organisms in the
study. Putative secreted proteins were extracted
from the database, clustered into families, and
the structure, functional composition and
relationships between these families were
studied. The set of secreted proteins includes
those predicted to be lipoproteins, cell wall
binding, or extracellular. Transmembrane
proteins and cytoplasmic proteins were
disregarded. Analysis of the data was initiated
by clustering the putative secretory proteins into
protein families using the MCL graph clustering Figure 3. Architectural layout of the classification
algorithm [14]. MCL provides a and analysis workflow components.
computationally efficient means of clustering
large datasets. BLASTp data was used with the We sought to avoid implementing services
MCL algorithm to identify close relatives of the on the client machine from which the
predicted secreted proteins. This approach workflows were enacted. This is particularly
follows that of [15], where the BLASTp important for those services interacting with the
algorithm provides a score for the putative database. In order to maximise performance, we
similarity between proteins. The necessary endeavoured to locate the service as close to the
BLASTp data was retrieved from the Microbase database as possible. However, we also found
system, a Web Service enabled resource for the continual problems with the reliability and
comparative genomics of microorganisms [16]. availability of services provided by outside,
For each predicted secreted protein, similar autonomous individuals. Another problem was
proteins with a BLASTp expect value less than the restrictions placed on the usage and
791
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
dissemination of some tools. Running such tools may also edit and curate the database as
locally required licensing agreements to be appropriate. A screenshot of the database portal
signed, limiting the tool usage to the licensees. is shown in Figure 4.
As a result of these factors, many of the
required services were implemented in-house,
although still delivered using the Web Services
infrastructure. The services were orchestrated
using SCUFL workflows and constructed and
enacted using the Taverna workbench [3].
The workflows implement the
bioinformatics processes outlined in Figure 2.
Most of the services both use and return text
based information, in standard bioinformatics
formats. At all steps of the classification
workflow, we have extracted the intermediate
results into a custom-designed database. This
was achieved using a set of bespoke data
storage services which parse the raw textual
results, and store them in a structured form. It is Figure 4. Screenshot of the Web portal summarising
these structured data which are used to feed the characteristics of predicted secreted proteins from
later services in both the classification and B. subtilis (168).
analysis workflows. In our case, the custom
database is hosted on a machine in close The analysis process is the most
network proximity to the Web Services. This computationally intensive section requiring a
has the significant advantage of reducing the large number of BLASTp searches. BLAST is
network costs involved in transferring data to the most commonly performed task in
and from the database. bioinformatics [1], and as such there are many
After completion of the classification available services which could have been used.
workflow, the custom database contains the data However, because of the computational
relating to each protein analysed, including the intensive nature of large BLAST searches, we
raw data, as well as an integrated summary of retrieved pre-computed BLAST results from the
the analysis. Tracking the provenance of the Microbase database. Microbase is a Grid based
data is important in this context, because there system for the automatic comparative analysis
are a number of different routes for the of microbial genomes [16]. Amongst other data,
classification workflow to designate a protein as Microbase stores all-against-all BLASTp
‘secretory’. The basic operational provenance searches for over three hundred microbial
provided by Taverna also aids in the genome sequences, and the data are accessible
identification of service failures [17]. This is in a Web Service-based fashion.
particularly important while running the The final visualisation steps in the analysis
transmembrane domain prediction services, as workflow were performed using R and MCL.
these run concurrently; a failure in one, Again, both services were implemented locally,
therefore, does not impact on the execution of although both were exposed using Web
the classification workflow, although may Services. Completion of the entire workflow
return incomplete conclusions and thus needs to took approximately 2-3 hours.
be recorded. This architecture differs from others using
my
We have developed a Web portal to provide Grid technology. Whilst the original
a user-friendly and familiar mechanism for requirements appeared to favour a highly
accessing the secretomes data in the database.1 distributed approach, we have found that the
From this site, users can select the bacterial technological and licensing constraints have led
species in which they are most interested and to a hybrid approach: a combination of services
view the corresponding results. Data is initially and workflows, combined with databases for
displayed as a table from where the users can both results storage and pre-caching of analyses.
navigate further to view the details of the The combination of all of these technologies
classifications. The protein sequences may be does result in fairly complex architecture but
viewed along with an overlay of predicted provides a system that is fast and reliable
signal peptides and their cleavage sites. Users enough for practical use.
1
At http://bioinf.ncl.ac.uk:8081/BaSPP
792
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
12
Transport/binding proteins and lipoproteins
Cell Wall
10
Protein secretion
An investigation into the functional distribution Sporulation
Metabolism of carbohydrates and related molecules
Specific pathways
of the proteins comprising the secretome of Transformation/competence
8
Metabolism of lipids
Metabolism of phosphate
each strain was carried out by classifying them Transcription regulation
Metabolism of amino acids and related molecules
6
into functionally related families based on their
sequence similarity as defined by BLASTp. The
4
member proteins of the 12 secretomes were
arranged into 543 families of 2 or more
2
members. Some 350 proteins showed no
0
similarity to other proteins and hence did not
fall into families. Core protein families that
contain members from all 12 proteomes are of Figure 5. Functional classification of the ‘core’
particular interest since they may represent secreted protein families. The graph shows the
secreted proteins whose functions are number of ‘core’ secreted proteins per ‘SubtiList’
category.
indispensable for growth in all environments.
9% of protein families were found to be ‘core’
Finally, we were interested to determine
and the functions of these were investigated.
whether the secretomes of the pathogenic
The Gene Ontology terms [18] of the genes
organisms were closely related to each other in
encoding the proteins in each cluster were
terms of their functional composition than to
examined and then summarised by classifying
those of the non-pathogens. The phylogeny of
the terms according to the “SubtiList”
the Bacillus strains was investigated in the
classification codes [19]. Figure 5 shows a
context of the relationships between their
summary of the different functional
secretomes. This was illustrated using a
classifications of the ‘core’ secreted protein
dendrogram, constructed using the R package,
families. Interestingly, a large number of core
in which the relation between the different
families had not been experimentally
Bacillus strains is based on their contribution to
characterised and remain of unknown function.
the predicted secreted protein families (Figure
More predictably, many core proteins were
6). Essentially the dendrogram highlights the
grouped into families concerned with cell wall
level of similarity between the secretomes of the
related functions, transporter proteins and
various strains.
proteins responsible for membrane biogenesis.
Within the B. cereus group subcluster (B.
In addition to defining core protein families,
anthracis, B. cereus and B. thuringiensis), two
we were also interested in gaining some insight
sub-clusters were formed by the well-
into the functions of protein families that are
established pathogens (CP000001 B. cereus,
specific to pathogens and those that are specific AE017355 B. thuringiensis konkukian, AE017225 B.
to non-pathogens. Canned queries over the anthracis, AE017334 B. anthracis, AE016879 B.
database allowed these results to be easily and anthracis), while the two members of
repeatedly extracted. Interestingly, only 14 of questionable pathogenesis (AE017194 B. cereus,
the 106 protein families that were unique to the AE016877 B. cereus) formed a separate cluster.
secretomes of the potentially pathogenic The environmental strains (B. subtilis, B.
bacteria (B. cereus (ZK / E33L), B. thuringiensis licheniformis, B. clausii, B. halodurans) formed
konkukian (97-27), B. anthracis (Sterne), B.
a separate cluster from that of the B. cereus
anthracis (Ames ancestor), B. anthracis (Ames,
isolate Porton), B. cereus (strain ATCC 10987) and group organisms.
B. cereus (ATCC 14579 / DSM 31)), showed
similarity to proteins with known functions. 6. Discussion
Thus, the majority were found to be of unknown From the biologist’s perspective: From the
function and remain to be characterised. perspective of a biologist, construction of a
The secretory protein families that were workflow that enables secretory protein
unique to the non-pathogens showed functions prediction over bacterial genomes using
that were indicative of their habitats. Of the 11
793
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
multiple prediction tools, integrates the results microbiologists and this work provides data to
into a database, and then performs analysis on prime further, biologically oriented
the families, is a novel development. This investigations.
approach utilises current bioinformatics
programs in order to make predictions, which From the eScientist’s perspective: We have
would otherwise take several days if performed shown here the effectiveness of an e-Science
manually. In particular the ease by which a approach, generating biologically meaningful
workflow may be re-run as and when new results. These results can be used to generate
genomes are sequenced is a distinct advantage, hypotheses that may be verified by experimental
especially as the rate of complete genome approaches.
sequencing continues to increase. The initial This process of data analysis was simplified
time in developing the application should also through the use of Microbase. The comparison
be considered; though this factor reduces in of the bacterial proteins had already been done,
significance with re-execution of the workflow therefore reducing the computing time required
across a number of genomes. in analysing the results.
Although the details of the workflow are
specific to the problem of predicting secretory
40
CP000001
AE017355
AE017225
AE017334
AE016879
AE017194
AE016877
AL009126
CP000002
AE017333
AP006627
BA000004
794
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
data would provide a significant advantage; in Meeting 2004, 31st Aug - 3rd Sept,
many cases the service executables are smaller Nottingham UK.
than the data they operate over. It should be 3. Oinn T, Addis M, Ferris J, Marvin D,
noted that licensing issues will prevent this in Senger M, Greenwood M, Carver T, Glover
many cases. K, Pocock MR, Wipat A, Li P, 2004.
Data heterogeneity has provided us with Bioinformatics, 20(17): 3045-54.
fewer problems than expected for a 4. Nomura K, He SY, 2005. PNAS. 102(10):
bioinformatics workflow. There are relatively 3527-3528.
few data types, most often protein lists are 5. Piazza F, Tortosa P, Dubnau D, 1999. J
passed between the services in the workflows. Bacteriol., 181(15): 4540-8.
Data heterogeneity was dealt with through the 6. Lee VT, Schneewind O, 2001. Genes Dev.,
use of a custom database and parsing code. In 15(14): 1725–1752.
the second workflow, the use of the Microbase 7. Boekhorst J, de Been MW, Kleerebezem M,
data warehouse removed many of these Siezen RJ, 2005. J Bacteriol., 187(14):
problems. 4928-34.
One major future enhancement is to use, 8. Tjalsma H, Bolhuis A, Jongbloed JDH, Bron
notification in conjunction with the workflow. S, van Dijl J.M, 2000. Microbiol. Mol. Biol.
This would provide a way of automatically Rev., 64(3): 515–547.
analysing recently annotated genomes for 9. Juncker AS, Willenbrock H, von Heijne G,
secretory proteins. The need for users to interact Nielsen H, Brunak S, Krogh A, 2003.
with the workflow would therefore be removed, Protein Sci., 12(8): 1652-62.
providing automatic updates of the database 10. Nielsen, H. & Krogh., A., 1998. Proc Int
with new secretory protein data. We also intend Conf Intell Syst Mol Biol. (ISMB 6), 6: 122-
to investigate the migration of workflows from 130.
the machine of their conception, to a remote 11. Moller S, Croning MDR, Apweiler R, 2001.
enclosed environment from which they are able Bioinformatics, 17(7): 646-653.
to access services that are unable to be exposed 12. Gattiker A, Gasteiger E, Bairoch A, 2002.
directly to third parties. Appl Bioinformatics, 1(2): 107-108.
In conclusion, the investigation of predicted 13. Tjalsma H, Antelmann H, Jongbloed JDH,
bacterial secretomes through the use of Braun PG, Darmon E, Dorenbos R, Dubois
workflows and e-Science technology indicates JY, Westers H, Zanen G, Quax WJ, Kuipers
the potential use of computing resources for OP, Bron S, Hecker M, van Dijl, J.M., 2004.
maximising the information gained as multiple Microbiol. Mol. Biol. Rev., 68: 207-233.
genomic sequences are generated. The
14. van Dongen S, 2000. A cluster algorithm for
knowledge gained from large scale analysis of
graphs. Technical Report INS-R0010,
secretomes can be used to generate inferences
National Research Institute for Mathematics
about bacterial evolution and niche adaptation,
and Computer Science in the Netherlands,
lead to hypotheses to inform future
Amsterdam.
experimental studies and possibly identify
proteins as candidate drug targets. 15. Enright AJ, Van Dongen S, Ouzounis CA,
2002. Nucleic Acids Res., 30(7):1575-1584.
7. Acknowledgments 16. Sun Y, Wipat A, Pocock M, Lee PA,
Watson P, Flanagan K, Worthington JT,
We gratefully acknowledge the support of the 2005. The 5th IEEE International
North East regional e-Science centre and the Symposium on Cluster Computing and the
European Commission (LSHC-CT-2004- Grid (CCGrid 2005), Cardiff, UK, May 9-
503468). We thank the EPSRC and Non-Linear 12.
Dynamics for supporting Tracy Craddock.
17. Zhao J, Stevens R, Wroe C, Greenwood M.
Goble C, 2004. In Proceedings of the UK e-
8. References Science All Hands Meeting, Nottingham UK,
1. Stevens RD, Tipney HJ, Wroe CJ, Oinn TM, 31 Aug-3 Sept.
Senger M, Lord PW, Goble CA, Brass A, 18. Gene Ontology Consortium, 2006.
Tassabehji M., 2004. Bioinformatics, 20 Nucleic Acids Res., 34 (Database
Suppl. 1:I303-I310. issue):D322-6.
2. Li P, Hayward K, Jennings C, Owen K, 19. Moszer I, Jones LM, Moreira S, Fabry C,
Oinn T, Stevens R, Pearce S and Wipat A. Danchin A, 2002. Nucleic Acids Res., 30(1):
Proceedings of the UK e-Science All Hands 62-5.
20. Oinn T & Pocock M, pers. comm
795
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
a
Cavendish Laboratory, University of Cambridge, CB3 0HE, UK
b
Department of Physics, University of Lancaster, LA1 4YB, UK
c
CERN, CH-1211 Geneva 23, Switzerland
d
School of Physics and Astronomy, University of Birmingham, B15 2TT, UK
Abstract
The ATLAS experiment, based at the Large Hadron Collider at CERN, Geneva, is currently
developing a grid-based distributed system capable of supporting it's data analysis activities which
require the processing of data volumes of the order of petabytes per year. The distributed analysis
system aims to bring the power of computation on a global scale to thousands of physicists by
enabling them to easily tap into the vast computing resource of various grids such as LCG, gLite,
OSG and Nordugrid, for their analysis activities whilst shielding them from the complexities of the
grid environment. This paper outlines the ATLAS distributed analysis model, the ATLAS data
management system, and the multi-pronged ATLAS strategy for distributed analysis in a
heterogeneous grid environment. Various frontend clients and backend submission systems will be
discussed before concluding with a status update of the system.
796
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
As with many large collaborations, different that he/she is working on the latest version of a
ATLAS physics groups have different work dataset by simply subscribing to it. Any
models and the distributed analysis system subsequent changes to this dataset (i.e.
needs to be the flexible enough to support all additional files, version changes, etc.) will
current and newly emerging work models whilst trigger a fresh download of the updated version
remaining robust. automatically.
Client tools provide users with the means to
3. Distributed Data Management interact with the central dataset catalogue.
Typical actions include listing, retrieving and
With data distributed at computing facilities inserting of datasets.
around the world, an efficient system to manage
access to this data is crucially important for 4. Distributed Analysis Strategy
effective distributed analysis.
Users performing data analysis need a ATLAS takes a multi-pronged approach to
random access mechanism to allow rapid pre- distributed analysis by exploiting its existing
filtering of data based on certain selection grid infrastructure directly via the various
criteria so as to identify data of specific interest. supported grid flavours and indirectly via the
This data then needs to be readily accessible by ATLAS Production System [12].
the processing system.
Analysis jobs produce large amounts of
data. Users need to be able to store and gain
access to their data in the grid environment. In
the grid environment where data is not centrally
known, an automated management system that
has the concept of file ownership and user quota
management is essential.
To meet these requirements, the Distributed
Data Management (DDM) system [11] has been
developed. It provides a set of services to move Figure 2: Distributed analysis strategy
data between grid-enabled computing facilities
whilst maintaining a series of databases to track 4.1 Frontend Clients
these data movements. The vast amount of data
is also grouped into datasets based on various Figure 2 shows various frontend clients
criteria (e.g. physics characteristics, production enabling distributed analysis on existing grid
batch run, etc.) for more efficient query and infrastructure.
retrieval. DDM consists of three components: a Pathena [13] is a Python script designed to
central dataset catalogue, a subscription enable access to OSG resources via the Panda
service and a set of client tools for dataset job management system [14]. It is just short of
lookup and replication. becoming a drop-in replacement for the
The central dataset catalogue is in effect a executable used in the ATLAS software
collection of independent internal services and framework. Users are able to exploit distributed
catalogues that collectively function as a single resources for their analysis activities with the
dataset bookkeeping system. very minimal inconvenience.
Subscription services enable data to be
automatically pulled to a site. A user can ensure
797
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Pathena makes the submission of analysis was used to submit various types of jobs to the
jobs to the Panda system a painless two stage Grid including the search for drugs to combat
process involving an optional build step (where Avian flu, regression testing of Geant 4 [19] to
user code can be compiled) followed by an detect simulation result deviations and the
execution step (with built-in job splitting optimisation of the evolving plan for radio
capabilities). A further merge step is in frequency sharing between 120 countries.
development which will allow the resulting A few physics experiments (e.g. BaBar [20],
output datasets from split jobs to be NA48 [21]) have also used GANGA in varying
consolidated. degrees while there are others (e.g. PhenoGrid
ATCOM [15] is the dedicated graphical user [22], Compass [23]) in the preliminary stages of
interface frontend (See Figure 1) to the ATLAS looking to exploit GANGA for their
production system designed to be used by a applications.
handful of expert users involved in large-scale GANGA is currently in active development
organised production of ATLAS data. It has with frequent software releases and it has an
potential to be used for end-user distributed increasing pool of active developers.
analysis purposes.
GANGA [16] is a powerful yet user friendly 4.2 Production System
frontend tool for job definition and
management, jointly developed by the ATLAS The ATLAS production system provides an
and LHCb [17] experiments. GANGA provides interface layer on top of the various grid
distributed analysis users with access to all grid middleware used in ATLAS. There is increased
infrastructure supported by ATLAS. It does so robustness as the distributed analysis system
by interfacing to an increasing array of benefits from the production system's
submission backend mechanisms. Submission experience with the grid and its retry and
to the production system and ARC are planned fallback mechanism for both data and workload
for the not so distant future. management.
GANGA currently provides two user A rapidly maturing product, the production
interface clients: a Command Line Interface system provides various facilities that are useful
(CLI) and a Graphical User Interface (GUI) for distributed analysis e.g. user configurable
(See Figure 3). In addition, it can also be jobs, X509 certificate-based access control and
embedded in scripts for non-interactive/ a native graphical user interface, ATCOM.
repetitive use. GANGA, due to its need to
satisfy ATLAS and LHCb experiment 4.3 LCG and gLite
requirements (unlike Pathena and ATCOM LCG is the computing grid designed to cater to
which are specialised tools designed for specific the needs of all the LHC experiments and is by
ATLAS tasks), has been designed from the default the main ATLAS distributed analysis
onset to be a highly extensible generic tool with target system. Access to the LHC grid resources
a component plug-in architecture. This is via the LCG Resource Broker (RB) or
pluggable framework makes the addition of new CondorG [24]. The LCG RB is a robust
applications and backends an easy task. submission mechanism that is scalable, reliable
A synergy of GANGA and DIANE [18] (a and has a high job throughput. CondorG,
job-distribution framework) has been adopted in although conceptually similar to the LCG RB,
several instances. In each instance, GANGA
798
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
799
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
5. Conclusion [13] T. Maeno, Distributed Analysis on Panda;
http://uimon.cern.ch/twiki/bin/view/Atlas/
Distributed analysis in ATLAS is still in its DAonPanda
infancy but is evolving rapidly. Many key [14] T. Wenaus, Kaushik De et al, Panda -
components like the DDM system have only Production and Distributed Analysis;
just come online. The multi-pronged approach http://twiki.cern.ch/twiki//bin/view/Atlas/
to distributed analysis will encourage one Panda
submission system to learn from another and [15] http://uimon.cern.ch/twiki/bin/view/Atlas/
ultimately produce a more robust and feature- AtCom
rich distributed analysis system. The distributed [16] http://ganga.web.cern.ch/ganga/
analysis system will be comprehensively tested [17] LHCb Collaboration, LHCb - Technical
and benchmarked as part of Service Challenge 4 Proposal, CERN/LHCC98-4 (1998);
[30] in the summer of 2006. http://lhcb.web.cern.ch/lhcb/
[18] http://it-proj-diane.web.cern.ch/it-proj-
Acknowledgements diane/
[19] http://geant4.web.cern.ch/geant4/
We are pleased to acknowledge support for the [20] http://www-public.slac.stanford.edu/babar/
work on the ATLAS distributed analysis system [21] http://na48.web.cern.ch/NA48/
from GridPP in the UK and from the ARDA [22] http://www.phenogrid.dur.ac.uk/
group at CERN. GridPP is funded by the UK [23] http://wwwcompass.cern.ch/
Particle Physics and Astronomy Research [24] http://www.cs.wisc.edu/condor/condorg/
Council (PPARC). ARDA is part of the EGEE [25] http://glite.web.cern.ch/glite/
project, funded by the European Union under [26] http://egee-intranet.web.cern.ch/egee-
contract number INFSO-RI-508833. intranet/gateway.html
[27] http://www.nordugrid.org/middleware/
References [28] http://www.globus.org/
[1] http://cern.ch [29] http://www.nordugrid.org/monitor/
[2] ATLAS Collaboration, Atlas - Technical [30] LHC Computing Grid Deployment
Proposal, CERN/LHCC94-43 (1994); Schedule 2006-08,
http://atlas.web.cern.ch/Atlas/ CERN-LCG-PEB-2005-05;
[3] LHC Study Group, The LHC conceptual http://lcg.web.cern.ch/LCG/PEB/Planning/
design report, CERN/AC/95-05 (1995); deployment/Grid%20Deployment
http://lhc.web.cern.ch/lhc/ %20Schedule.htm
[4] http://lcg.web.cern.ch/LCG/
[5] http://www.opensciencegrid.org/
[6] http://www.nordugrid.org/
[7] ATLAS Computing Group, ATLAS
Computing Technical Design Report,
CERN-LHCC-2005-022;
http://atlas-proj-computing-tdr.web.cern.ch/
atlas-proj-computing-tdr/PDF/
Computing-TDR-final-July04.pdf
[8] D. Liko et al., The ATLAS strategy for
Distributed Analysis in several Grid
infrastructures, in: Proc. 2006 Conference
for Computing in High Energy and Nuclear
Physics, (Mumbai, India, 2006);
http://indico.cern.ch/contributionDisplay.py
?contribId=263&sessionId=9&confId=048
[9] G.van Rossum and F.L. Drake, Jr. (eds.),
Python Reference Manual, Release~2.4.3
(Python Software Foundation, 2006);
http://www.python.org/
[10] http://cern.ch/atlas-proj-computing-tdr/
Html/Computing-TDR-21.htm
#pgfId-1019542
[11] ATLAS Database and Data Management
Project;
http://atlas.web.cern.ch/Atlas/GROUPS/
DATABASE/project/ddm/
[12] ATLAS Production System;
http://uimon.cern.ch/twiki/bin/view/
Atlas/ProdSys
800
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Nottingham, UK
Author Index
801
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
802
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
803
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
804
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
805
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
806
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
807
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
808
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
809
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
810
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
811
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Engineering Perspective
Gong, X Workshop GOLD Infrastructure for Virtual 11
Paper Organisations
Goodale, Tom Regular Service-Oriented Matchmaking and 289
Paper Brokerage
Regular gridMonSteer: Generic Architecture for 769
Paper Monitoring and Steering Legacy Applications
in Grid Environments
Goodman, Daniel Regular Martlet: A Scientific Work-Flow Language 45
Paper for Abstracted Parallisation
Greenwood, Phil Regular An Intelligent and Adaptable Grid-based 53
Paper Flood Monitoring and Warning System
Grimstead, Ian Poster Collaborative Visualization of 3D Point 320
Based Anatomical Model within a Grid
Enabled Environment
Gronbech, Peter Regular GridPP: Running a Production Grid 598
Paper
Guerri, D Regular The BIOPATTERN Grid - Implementation 637
Paper and Applications
Guo, Yike Regular Designing a Java-based Grid Scheduler using 163
Paper Commodity Services
Poster Integrating R into Discovery Net System 357
Poster Distributed, high-performance earthquake 413
deformation analysis and modelling
facilitated by Discovery Net
Poster Modelling Rail Passenger Movements 445
through e-Science Methods
H Haines, Keith Regular Building simple, easy-to-use Grids with Styx 225
Paper Grid Services and SSH
Halfpenny, Peter Regular The National Centre for e-Social Science 542
Paper
Hallett, Catalina Regular Summarisation and Visualisation of e-Health 69
Paper Data Repositories
Hamadicharef, B Regular The BIOPATTERN Grid - Implementation 637
Paper and Applications
Hamilton, M D Workshop GOLD Infrastructure for Virtual 11
Paper Organisations
Handley, James Regular Integrative Biology : Real Science through e- 84
Paper Science
Regular Service-Oriented Approach to Collaborative 241
Paper Visualization
Handley, James W Regular Model Based Visualization of Cardiac Virtual 233
Paper Tissue
812
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
813
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
814
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
815
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
816
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
817
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
818
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
819
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
820
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
821
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
822
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
823
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
824
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
825
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
826
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
827
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
828
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
Repositories
Wang, Hai H Poster Alternative interfaces for OWL ontologies 425
Wang, Haoxiang Regular Service-Oriented Approach to Collaborative 241
Paper Visualization
Wang, Ian Regular gridMonSteer: Generic Architecture for 769
Paper Monitoring and Steering Legacy Applications
in Grid Environments
Watkins, E Regular Semantic Security in Service Oriented 693
Rowland Paper Environments
Watkins, John Poster Application of the NERC Data Grid Metadata 344
and Data Models in the NERC Ecological
Data Grid (EcoGrid)
Watt, John Regular DyVOSE Project: Experiences in Applying 669
Paper Privilege Management Infrastructures
Wendel, Patrick Regular Designing a Java-based Grid Scheduler using 163
Paper Commodity Services
Werner, J C Poster Grid computing in High Energy Physics using 313
LCG: the BaBar experience
Whillock, Matthew Regular SolarB Global DataGrid 777
Paper
White, Toby O H Regular A virtual research organization enabled by 481
Paper eMinerals minigrid: an integrated study of the
transport and immobilisation of arsenic
species in the environment
Regular Using eScience to calibrate our tools: 645
Paper parameterisation of quantum mechanical
calculations with grid technologies
Regular Anatomy of a grid-enabled molecular 653
Paper simulation study: the compressibility of
amorphous silica
Regular Job submission to grid computing 754
Paper environments
Regular A Lightweight, Scriptable, Web-based 209
Paper Frontend to the SRB
Regular Application and Uses of CML within the 606
Paper eMinerals project
Poster Automatic metadata capture and grid 381
computing
Wild, Matthew Regular SolarB Global DataGrid 777
Paper
Williams, Gethin Regular Collaborative study of GENIEfy Earth 574
Paper System Models using scripted database
workflows in a Grid-enabled PSE
Wilson, Dan J Regular Application and Uses of CML within the 606
829
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
830
Proceedings of the UK e-Science All Hands Meeting 2006, © NeSC 2006, ISBN 0-9553988-0-0
831