Anda di halaman 1dari 38

XML and Related Technologies certification prep,

Part 3: XML processing


Explore how to parse and validate XML documents plus how to
use XQuery

Skill Level: Intermediate

Mark Lorenz (mlorenz@nc.rr.com)


Senior Application Architect
Hatteras Software, Inc.

26 Sep 2006

Parsing and validation represent the core of XML. Knowing how to use these
capabilities well is vital to the successful introduction of XML to your project. This
tutorial on XML processing teaches you how to parse and validate XML files as well
as use XQuery. It is the third tutorial in a series of five tutorials that you can use to
help prepare for the IBM certification Test 142, XML and Related Technologies.

Section 1. Before you start


In this section, you'll find out what to expect from this tutorial and how to get the
most out of it.

About this series


This series of five tutorials helps you prepare to take the IBM certification Test 142,
XML and Related Technologies, to attain the IBM Certified Solution Developer - XML
and Related Technologies certification. This certification identifies an
intermediate-level developer who designs and implements applications that make
use of XML and related technologies such as XML Schema, Extensible Stylesheet
Language Transformation (XSLT), and XPath. This developer has a strong
understanding of XML fundamentals; has knowledge of XML concepts and related
technologies; understands how data relates to XML, in particular with issues
associated with information modeling, XML processing, XML rendering, and Web

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 1 of 38
developerWorks® ibm.com/developerWorks

services; has a thorough knowledge of core XML-related World Wide Web


Consortium (W3C) recommendations; and is familiar with well-known, best
practices.

Anyone working in software development for the last few years is aware that XML
provides cross-platform capabilities for data, just as the Java® programming
language does for application logic. This series of tutorials is for anyone who wants
to go beyond the basics of using XML technologies.

About this tutorial


This tutorial is the third in the "XML and Related Technologies certification prep"
series that takes you through the key aspects of effectively using XML technologies
on Java projects. This third tutorial focuses on XML processing -- that is, how to
parse and validate XML documents. It lays the groundwork for Part 4, which focuses
on transformation, including the use of XSLT, XPath, and Cascading Style Sheets
(CSS).

This tutorial is written for Java programmers who have a basic understanding of
XML and whose skills and experience are at a beginning to intermediate level. You
should have a general familiarity with defining, validating, and reading XML
documents, as well as a working knowledge of the Java language.

Objectives
After completing this tutorial, you will know how to:

• Parse XML documents using the Simple API for XML 2 (SAX2) and
Document Object Model 2 (DOM2) parsers
• Validate XML documents against Document Type Definitions (DTDs) and
XML Schemas
• Access XML content from databases using XQuery

Prerequisites
This tutorial is written for developers who have a background in programming and
scripting and who have an understanding of basic computer-science models and
data structures. You should be familiar with the following XML-related,
computer-science concepts: tree traversal, recursion, and reuse of data. You should
be familiar with Internet standards and concepts, such as Web browser,
client-server, documenting, formatting, e-commerce, and Web applications.
Experience designing and implementing Java-based computer applications and
working with relational databases is also recommended.

XML processing
Page 2 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

System requirements
To run the examples in this tutorial, you need a Linux® or Microsoft® Windows® box
with at least 50MB of free disk space and administrative access to install software.
The tutorial uses, but does not require, the following software:

• Java software development kit (JDK) 1.4.2 or later


• Eclipse 3.1 or later
• XMLBuddy 2.0 or later (Note: Some portions of the series use capabilities
of XMLBuddy Pro, which is not free.)
See Resources for links to download the above software

Section 2. Parsing XML documents


You can parse an XML document in multiple ways (see Part 1 of this series, which
focuses on architecture), but the SAX parser and the DOM parser constitute the
primary ways. Part 1 features a high-level comparison of the two (see Resources).

StAX
A new API, called Streaming API for XML (StAX), is to be released
in late 2006. It is a pull API, as opposed to SAX's push model, so it
keeps control with the application rather than the parser. You can
also use StAX to modify the document being parsed. Read more in
"An Introduction to StAX" (see Resources).

XML instance document


This tutorial uses a store's catalog of available DVDs for purchase as the document
throughout. Conceptually, the catalog contains a collection of DVDs with information
about each DVD associated with it. The actual document is a short catalog with only
four DVDs in it, but it has enough complexity for you to learn about XML processing,
including validation. Listing 1 shows the file.

Listing 1. The XML instance document for the DVD catalog

<?xml version="1.0"?>
<!DOCTYPE catalog SYSTEM "dvd.dtd">
<!-- DVD inventory -->
<catalog>
<dvd code="_1234567">
<title>Terminator 2</title>
<description>
A shape-shifting cyborg is sent back from the future
to kill the leader of the resistance.

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 3 of 38
developerWorks® ibm.com/developerWorks

</description>
<price>19.95</price>
<year>1991</year>
</dvd>
<dvd code="_7654321">
<title>The Matrix</title>
<price>12.95</price>
<year>1999</year>
</dvd>
<dvd code="_2255577" genre="Drama">
<title>Life as a House</title>
<description>
When a man is diagnosed with terminal cancer,
he takes custody of his misanthropic teenage son.
</description>
<price>15.95</price>
<year>2001</year>
</dvd>
<dvd code="_7755522" genre="Action">
<title>Raiders of the Lost Ark</title>
<price>14.95</price>
<year>1981</year>
</dvd>
</catalog>

Using the SAX parser


As Part 1 of this series discussed, the SAX parser is an event-based parser. This
means that the parser sends events to callback methods as it parses a document
(see Figure 1). For simplicity, Figure 1 doesn't show all the events that would
actually occur.

Figure 1. SAX parser events

XML processing
Page 4 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

These events are pushed out to the application in real time, as the parser moves
across the document contents. One benefit of this processing model is that you can
handle large documents with relatively little memory. A downside is that you have
more work to do to handle all these events.

The org.xml.sax package contains a set of interfaces. One of these provides the
XMLReader interface to the parser. You can set up for parsing like this:

try {
XMLReader parser = XMLReaderFactory.createXMLReader();
parser.parse( "myDocument.xml" ); //complete path
} catch ( SAXParseException e ) {
//document is not well-formed
} catch ( SAXException e ) {
//could not find an implementation of XMLReader
} catch ( IOException e ) {
//problem reading document file
}

Apache Xerces2 parser


If you need a parser, you can download the open source Apache
Xerces2 parser from The Apache Software Foundation Web site
(see Resources).

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 5 of 38
developerWorks® ibm.com/developerWorks

Tip: Reuse the parser instance if possible. Creating a parser is expensive. If you
have multiple threads running, you can reuse parser instances from a resource pool.

This is all well and good so far, but how does your application get events from the
parser? I'm glad you asked.

Handling SAX events

To receive events from the parser, you implement the ContentHandler interface.
This interface has a number of methods that you can implement to process your
document. Alternatively, if you only want to handle one or two callbacks, you can
subclass DefaultHandler, which implements all the ContentHandler methods
(doing nothing) and overrides only the methods you need.

Either way, you write logic to do whatever processing you require upon receiving
startElement, characters, endDocument, and other callback methods invoked
by the SAX parser. You can see all the method calls from a document as they would
occur on pages 351-355 of XML in a Nutshell, Third Edition (see Resources).

The callback events are the normal events from a document as it's being parsed.
You can also handle validity callbacks by implementing an ErrorHandler. I'll
discuss this topic after I go over validation, so stay tuned.

To learn more about parsing with SAX, check out Chapter 20 of XML in a Nutshell,
Third Edition or read "Serial Access with the Simple API for XML (SAX)" (see
Resources).

SAX parser exception handling

By default, the parser ignores errors. To take action upon an invalid or


non-well-formed document, you must implement an ErrorHandler (note that
DefaultHandler implements this as well as the ContentHandler interface) and
define an error() method:

public class SAXEcho extends DefaultHandler {


...
//Handle validity errors
public void error( SAXParseException e ) {
echo( e.getMessage() );
echo( "Line " + e.getLineNumber() +
" Column " + e.getColumnNumber();
}

Then you must turn on the validation feature:

parser.setFeature( "http://xml.org/sax/features/validation", true );

Finally, call this code:

parser.setErrorHandler( saxEcho );

XML processing
Page 6 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

Remember, parser is an instance of XMLReader. The parser calls the error()


method if the document violates a schema (DTD or XML Schema) rule.

Other ErrorHandler methods


ErrorHandler also has warning and fatalError methods, for
nonviolations and well-formedness violations, respectively. You
don't normally need to do anything in these methods.

Echoing SAX events

As an exercise for the SAX parser skills you've learned, use the SAXEcho.java code
in Listing 2 to output the parser events for the catalog.xml file.

Listing 2. Echoing SAX events

package com.xml.tutorial;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
/**
* A handler for SAX parser events that outputs certain event
* information to standard output.
*
* @author mlorenz
*/
public class SAXEcho extends DefaultHandler {
public static final String XML_DOCUMENT_DTD = "catalogDTD.xml";
//validates via catalog.dtd
public static final String XML_DOCUMENT_XSD = "catalogXSD.xml";
//validates via catalog.xsd
public static final String NEW_LINE = System.getProperty("line.separator");
protected static Writer writer;
/**
* Constructor
*/
public SAXEcho() {
super();
}
/**
* @param args
*/
public static void main(String[] args) {
//-- Set up my instance to handle SAX events
DefaultHandler eventHandler = new SAXEcho();
//-- Echo to standard output
writer = new OutputStreamWriter( System.out );
try {
//-- Create a SAX parser
XMLReader parser = XMLReaderFactory.createXMLReader();
parser.setContentHandler( eventHandler );
parser.setErrorHandler( eventHandler );
parser.setFeature(
"http://xml.org/sax/features/validation", true );
//-- Validation via DTD --
echo( "=== Parsing " + XML_DOCUMENT_DTD + " ===" + NEW_LINE );

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 7 of 38
developerWorks® ibm.com/developerWorks

//-- Parse my XML document, reporting DTD-related errors


parser.parse( XML_DOCUMENT_DTD );
//-- Validation via XSD --
parser.setFeature(
"http://apache.org/xml/features/validation/schema",
true );
echo( NEW_LINE + NEW_LINE + "=== Parsing " +
XML_DOCUMENT_XSD + " ===" + NEW_LINE );
//-- Parse my XML document, reporting XSD-related errors
parser.parse( XML_DOCUMENT_XSD );
} catch (SAXException e) {
System.out.println( "Parsing Exception occurred" );
e.printStackTrace();
} catch (IOException e) {
System.out.println( "Could not read the file" );
e.printStackTrace();
}
System.exit(0);
}
//--Implement SAX callback events of interest (default is do nothing) --
/* (non-Javadoc)
* @see org.xml.sax.helpers.DefaultHandler#startElement(java.lang.String,
* java.lang.String, java.lang.String, org.xml.sax.Attributes)
* @see org.xml.sax.ContentHandler interface
* Element and its attributes
*/
@Override
public void startElement( String uri,
String localName,
String qName,
Attributes attributes)
throws SAXException {
if( localName.length() == 0 )
echo( "<" + qName );
else
echo( "<" + localName );
if( attributes != null ) {
for( int i=0; i < attributes.getLength(); i++ ) {
if( attributes.getLocalName(i).length() == 0 ) {
echo( " " + attributes.getQName(i) +
"=\"" + attributes.getValue(i) + "\"" );
}
}
}
echo( ">" );
}
/* (non-Javadoc)
* @see org.xml.sax.helpers.DefaultHandler#endElement(java.lang.String,
* java.lang.String, java.lang.String)
* End tag
*/
@Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
echo( "</" + qName + ">" );
}
/* (non-Javadoc)
* @see org.xml.sax.helpers.DefaultHandler#characters(char[], int, int)
* Character data inside an element
*/
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
String s = new String(ch, start, length);
echo(s);
}
//-- Add additional event echoing at your discretion --
/**
* Output aString to standard output
* @param aString
*/
protected static void echo( String aString ) {
try {
writer.write( aString );

XML processing
Page 8 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

writer.flush();
} catch (IOException e) {
System.out.println( "I/O error during echo()" );
e.printStackTrace();
}
}
/* (non-Javadoc)
* @see org.xml.sax.helpers.DefaultHandler#error(org.xml.sax.SAXParseException)
* @see org.xml.sax.ErrorHandler interface
*/
@Override
public void error(SAXParseException e) throws SAXException {
echo( NEW_LINE + "*** Failed validation ***" + NEW_LINE );
super.error(e);
echo( "* " + e.getMessage() + NEW_LINE +
"* Line " + e.getLineNumber() +
" Column " + e.getColumnNumber() + NEW_LINE +
"*************************" + NEW_LINE );
try {
Thread.sleep( 10 );
} catch (InterruptedException e1) {
e1.printStackTrace();
}
}
}

You can use the code in SAXEcho.java to see how SAX parsing all comes together.
Note that this code does not handle all events, so not everything from the original
document will be echoed (see Listing 3). Take a look at the ContentHandler
interface to see what other messages you might get.

Listing 3. Output from SAXEcho execution

=== Parsing catalogDTD.xml ===


<catalog><dvd><title>Terminator 2</title><description>
A shape-shifting cyborg is sent back from the future
to kill the leader of the resistance.
</description><price>19.95</price><year>1991</year>
</dvd><dvd><title>The Matrix</title><price>10.95</price>
<year>1999</year></dvd><dvd><title>Life as a House</title><description>
When a man is diagnosed with terminal cancer,
he takes custody of his misanthropic teenage son.
</description><price>15.95</price><year>2001</year>
</dvd><dvd><title>Raiders of the Lost Ark</title><price>
14.95</price><year>1981</year></dvd></catalog>
=== Parsing catalogXSD.xml ===
<catalog>
<dvd>
<title>Terminator 2</title>
<description>
A shape-shifting cyborg is sent back from the future
to kill the leader of the resistance.
</description>
<price>19.95</price>
<year>1991</year>
</dvd>
<dvd>
<title>The Matrix</title>
<price>10.95</price>
<year>1999</year>
</dvd>
<dvd>
<title>Life as a House</title>
<description>
When a man is diagnosed with terminal cancer,
he takes custody of his misanthropic teenage son.

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 9 of 38
developerWorks® ibm.com/developerWorks

</description>
<price>15.95</price>
<year>2001</year>
</dvd>
<dvd>
<title>Raiders of the Lost Ark</title>
<price>14.95</price>
<year>1981</year>
</dvd>
</catalog>

Using the DOM parser


In contrast to the SAX parser, the DOM parser builds a tree structure based on the
XML document contents (see Figure 2). For simplicity, some parsing actions are not
shown.

Figure 2. DOM parser tree

DOM doesn't specify an interface for the XML parser, so different vendors have
different parser classes. I'll continue to use the Xerces parser, which has a
DOMParser class.

You set up a DOM parser like this:

DOMParser parser = new DOMParser();

XML processing
Page 10 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

try {
parser.parse( "myDocument.xml" );
Document document = parser.getDocument();
} catch (DOMException e) {
// take validity action here
} catch (SAXException e) {
// well-formedness action here
} catch (IOException e) {
// take I/O action here
}

Traversing the DOM tree

DOM incurs an expense in time and memory to construct an entire document tree.
The payback comes from the many ways that you can traverse and manipulate the
document's content using the tree structure. Figure 3 shows a portion of the DVD
catalog document.

Figure 3. Traversing the DOM tree

The tree has a root, which you can access through the
Document.getDocumentElement() method. From any Node, you can use
Node.getChildNodes() to get a NodeList of children of the current Node. Note
that attributes are not considered a child of the containing Node. You can create new
Nodes, append them, insert them, locate them by name, and remove them. These
are just a few of the available capabilities.

One of the more powerful methods is Document.getElementsByTagName(),


which returns a NodeList of the matching Nodes in the descendant elements. The
DOM tree is available on the client as well as the server.

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 11 of 38
developerWorks® ibm.com/developerWorks

Client traversal

You can traverse the DOM tree in the client, and you can validate actions on an
XHTML page through JavaScript from within the browser. For example, the client
might need to find out if a Node with a particular name already exists:

//-- make sure a new DVD's title is unique


var titles = document.getElementsByTagName("title");
var newTitleValue = newTitle.getNodeValue();
var nextTitle;
for( i=0; i < titles.getLength(); i++ ) {
nextTitle = titles.item(i); //NodeList access by index
if( nextTitle.getNodeValue().equals( newTitleValue ) {
//take some action
}
}

Server traversal

On the server, you will certainly need to manipulate the tree, such as to add a new
child to a Node:

//-- add a new DVD with aName and description


public void createNewDvd( String aName, String description ) {
Element catalog = document.getDocumentElement(); //root
Element newDvd = document.createElement( aName );
Element dvdDescription =
document.createTextNode( description );
newDvd.appendChild( dvdDescription );
catalog.appendChild( newDvd ); //as last element
}

XHTML as an alternative
This tutorial works with a data document, but the document could
easily be an XHTML page, in which case you'd see Nodes such as
head, body, p, td, and li.

Caution: Make sure to use DOM interfaces, such as NodeList or NamedNodeMap,


to manipulate the tree. The DOM tree is dynamic, meaning it is updated immediately
based on changes you're making, so if you use local variables to cache values, they
might be wrong. For example, Node.getLength() returns a different value after a
call to removeChild().

DOM parser exception handling

DOM3
DOM3 has added a DOMErrorHandler, which provides a callback
mechanism to use instead of DOMException. Here is some
example code:

DOMParser parser = new DOMParser();


DOMConfiguration domConfig = document.domConfig;
domConfig.setParameter( DOMErrorHandler handler );

The class that implements the DOMErrorHandler interface has a

XML processing
Page 12 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

handleError(DOMError error) method, which returns true to


continue processing or false to stop processing (fatal errors
always stop processing).

The DOM parser throws a DOMException if problems occur during parsing. This is
a RuntimeException, since some languages don't support checked exceptions,
but you should always catch it or throw it in your Java code.

To detect manipulation problems, use the code of a DOMException. These codes


tell you what is wrong, such as an attempted change that makes the document
invalid (DOMException.INVALID_MODIFICATION_ERR) or a target Node that
could not be found (DOMException.NOT_FOUND_ERR). The DOMException section
within Chapter 9 of Processing XML with Java: A Guide to SAX, DOM, JDOM,
JAXP, and TrAX offers a complete list of DOMException codes with explanations
(see Resources).

Echoing the DOM tree

As an exercise for the DOM parser skills you've learned, use the DOMEcho.java
code in Listing 4 to output the contents of the DOM tree for the catalog.xml file. After
this code echoes the tree information, it then changes the tree and echoes the
updated tree.

Listing 4. Echoing a DOM tree

package com.xml.tutorial;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import org.w3c.dom.DOMException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.traversal.DocumentTraversal;
import org.w3c.dom.traversal.NodeFilter;
import org.w3c.dom.traversal.TreeWalker;
import org.xml.sax.SAXException;
import com.sun.org.apache.xerces.internal.parsers.DOMParser;
/**
* A handler to output certain information about a DOM tree
* to standard output.
*
* @author lorenzm
*/
public class DOMEcho {
public static final String XML_DOCUMENT_DTD =
"catalogDTD.xml"; //validates via catalog.dtd
public static final String NEW_LINE = System.getProperty("line.separator");
protected static Writer writer;
// Types of DOM nodes, indexed by nodeType value (e.g. Attr = 2)
protected static final String[] nodeTypeNames = {
"none", //0
"Element", //1
"Attr", //2
"Text", //3

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 13 of 38
developerWorks® ibm.com/developerWorks

"CDATA", //4
"EntityRef", //5
"Entity", //6
"ProcInstr", //7
"Comment", //8
"Document", //9
"DocType", //10
"DocFragment", //11
"Notation", //12
};
//-- DOMImplementation features (we only need one for now)
protected static final String TRAVERSAL_FEATURE = "Traversal";
//-- DOM versions (we're using DOM2)
protected static final String DOM_2 = "2.0";
/**
* Constructor
*/
public DOMEcho() {
super();
}
/**
* @param args
*/
public static void main(String[] args) {
//Echo to standard output
writer = new OutputStreamWriter( System.out );
//use the Xerces parser
try {
DOMParser parser = new DOMParser();
parser.setFeature( "http://xml.org/sax/features/validation", true );
parser.parse( XML_DOCUMENT_DTD ); //use DTD grammar for validation
Document document = parser.getDocument();
echoAll( document );
//-- add description for Indiana Jones movie
//---- find parent Node
Element indianaJones = document.getElementById("_7755522");
//---- insert a description before the price
// (anywhere else would be invalid)
NodeList prices = indianaJones.getElementsByTagName("price");
Node desc = document.createElement("description");
desc.setTextContent(
"Indiana Jones is hired to find the Ark of the Covenant");
indianaJones.insertBefore( desc, prices.item(0) );
//-- now, echo the document again to see the change
echoAll( document );
} catch (DOMException e) { //handle invalid manipulations
short code = e.code;
if( code == DOMException.INVALID_MODIFICATION_ERR ) {
//take action when invalid manipulation attempted
} else if( code == DOMException.NOT_FOUND_ERR ) {
//take action when element or attribute not found
} //add more checks here as desired
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* Echo all the Nodes, in preorder traversal order, for aDocument
* @param aDocument
*/
protected static void echoAll(Document aDocument) {
if( aDocument.getImplementation().hasFeature(
TRAVERSAL_FEATURE,DOM_2) ) {
echo( "=== Echoing " + XML_DOCUMENT_DTD + " ===" + NEW_LINE );
Node root = (Node) aDocument.getDocumentElement();
int whatToShow = NodeFilter.SHOW_ALL;
NodeFilter filter = null;
boolean expandRefs = false;
//-- depth first, preorder traversal

XML processing
Page 14 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

DocumentTraversal traversal = (DocumentTraversal)aDocument;


TreeWalker walker = traversal.createTreeWalker(
(org.w3c.dom.Node) root, //where to start
//(cannot go "above" the root)
whatToShow, //what to include
filter, //what to exclude
expandRefs); //include referenced entities or not
for( Node nextNode = (Node) walker.nextNode(); nextNode != null;
nextNode = (Node) walker.nextNode() ) {
echoNode( nextNode );
}
} else {
echo( NEW_LINE + "*** " + TRAVERSAL_FEATURE +
" feature is not supported" + NEW_LINE );
}
}
/**
* Output aNode's name, type, and value to standard output.
* @param aNode
*/
protected static void echoNode( Node aNode ) {
String type = nodeTypeNames[aNode.getNodeType()];
String name = aNode.getNodeName();
StringBuffer echoBuf = new StringBuffer();
echoBuf.append(type);
if( !name.startsWith("#") ) { //do not output duplicate names
echoBuf.append(": ");
echoBuf.append(name);
}
if( aNode.getNodeValue() != null ) {
if( echoBuf.indexOf("ProcInst") == 0 )
echoBuf.append( ", " );
else
echoBuf.append( ": " ); //output only to first newline
String trimmedValue = aNode.getNodeValue().trim();
int nlIndex = trimmedValue.indexOf("\n");
if( nlIndex >= 0 ) //found newline
trimmedValue = trimmedValue.substring(0,nlIndex);
echoBuf.append(trimmedValue);
}
echo( echoBuf.toString() + NEW_LINE );
echoAttributes( aNode );
}
/**
* Output aNode's attributes to standard output.
* @param aNode
*/
protected static void echoAttributes(Node aNode) {
NamedNodeMap attr = aNode.getAttributes();
if( attr != null ) {
StringBuffer attrBuf = new StringBuffer();
for( int i = 0; i < attr.getLength(); i++ ) {
String type = nodeTypeNames[attr.item(i).getNodeType()];
attrBuf.append(type);
attrBuf.append( ": " + attr.item(i).getNodeName() + "=" );
attrBuf.append( "\"" + attr.item(i).getNodeValue() + "\"" +
NEW_LINE );
}
echo( attrBuf.toString() );
}
}
/**
* Output aString to standard output
* @param aString
*/
protected static void echo( String aString ) {
try {
writer.write( aString );
writer.flush();
} catch (IOException e) {
System.out.println( "I/O error during echo()" );

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 15 of 38
developerWorks® ibm.com/developerWorks

e.printStackTrace();
}
}
}

Look at some portions of the logic:

protected static final String[] nodeTypeNames = {


...
};

This array maps the Node.getNodeType() int value to each of the types of
Nodes that you can encounter:

if( aDocument.getImplementation().hasFeature(
TRAVERSAL_FEATURE,DOM_2) ) {

DOM1 versus DOM2


In DOM1, traversing the document tree was done in a "linear"
fashion, with previous and next Nodes acquired using
NodeIterators and NodeFilters. In DOM2, the TreeWalker
interface added the concept of a current Node, with movement to
parent, child, and sibling.

You can read about DOM's NodeIterator and NodeFilter as


well as DOM2's TreeWalker in Chapter 12 of Processing XML
with Java: A Guide to SAX, DOM, JDOM, JAXP, and TrAX (see
Resources).

Bruno R. Preiss explains different tree traversals (see Resources).

DOMEcho takes advantage of the TreeWalker interface introduced in DOM2 (see


DOM 1 versus DOM 2). To be safe, check to make sure your parser supports this
feature. You can read about all the available features in the "DOM Modules" section
in Chapter 9 of Processing XML with Java: A Guide to SAX, DOM, JDOM, JAXP,
and TrAX (see Resources).

Basically, DOMEcho has an echoAll(Document aDoc) method, which uses the


TreeWalker with no filtering to get the Nodes in preorder traversal order (see DOM
1 versus DOM 2). echoNode(Node aNode) is then called for each. In turn,
echoNode calls echoAttributes(Node aNode) for its Node:

//---- find parent Node


Element indianaJones = document.getElementById("_7755522");
//---- insert a description before the price
// (anywhere else would be invalid)
NodeList prices = indianaJones.getElementsByTagName("price");
Node desc = document.createElement("description");
desc.setTextContent(
"Indiana Jones is hired to find the Ark of the Covenant");
indianaJones.insertBefore( desc, prices.item(0) );

This section of code is what changes the DOM tree. It adds a description in the

XML processing
Page 16 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

correct place so that the tree is still valid according to the document's schema.

Listing 5 shows the resulting output from DOMEcho.

Listing 5. Output from DOMEcho

=== Echoing catalogDTD.xml ===


Text:
Comment: DVD inventory
Text:
Element: dvd
Attr: code="_1234567"
Text:
Element: title
Text: Terminator 2
Text:
Element: description
Text: A shape-shifting cyborg is sent back from the future
to kill the leader of the resistance.
Text:
Element: price
Text: 19.95
Text:
Element: year
Text: 1991
Text:
Text:
Element: dvd
Attr: code="_7654321"
Text:
Element: title
Text: The Matrix
Text:
Element: price
Text: 10.95
Text:
Element: year
Text: 1999
Text:
Text:
Element: dvd
Attr: code="_2255577"
Attr: genre="Drama"
Text:
Element: title
Text: Life as a House
Text:
Element: description
Text: When a man is diagnosed with terminal cancer,
he takes custody of his misanthropic teenage son.
Text:
Element: price
Text: 15.95
Text:
Element: year
Text: 2001
Text:
Text:
Element: dvd
Attr: code="_7755522"
Attr: genre="Action"
Text:
Element: title
Text: Raiders of the Lost Ark
Text:
Element: price
Text: 14.95
Text:
Element: year
Text: 1981
Text:

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 17 of 38
developerWorks® ibm.com/developerWorks

Text:
=== Echoing catalogDTD.xml ===
Text:
Comment: DVD inventory
Text:
Element: dvd
Attr: code="_1234567"
Text:
Element: title
Text: Terminator 2
Text:
Element: description
Text: A shape-shifting cyborg is sent back from the future
to kill the leader of the resistance.
Text:
Element: price
Text: 19.95
Text:
Element: year
Text: 1991
Text:
Text:
Element: dvd
Attr: code="_7654321"
Text:
Element: title
Text: The Matrix
Text:
Element: price
Text: 10.95
Text:
Element: year
Text: 1999
Text:
Text:
Element: dvd
Attr: code="_2255577"
Attr: genre="Drama"
Text:
Element: title
Text: Life as a House
Text:
Element: description
Text: When a man is diagnosed with terminal cancer,
he takes custody of his misanthropic teenage son.
Text:
Element: price
Text: 15.95
Text:
Element: year
Text: 2001
Text:
Text:
Element: dvd
Attr: code="_7755522"
Attr: genre="Action"
Text:
Element: title
Text: Raiders of the Lost Ark
Text:
Element: description
Text: Indiana Jones is hired to find the Ark of the Covenant
Element: price
Text: 14.95
Text:
Element: year
Text: 1981
Text:
Text:

Whitespace

XML processing
Page 18 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

You'll notice a lot of Text Nodes in the DOMEcho output (Listing 6), many of them
with nothing apparent as content. Why would that be?

The parser reports whitespace (extra spaces, tabs, and carriage returns) that occurs
within the document's element contents.

Notice what's not reported: whitespace within elements, such as surrounding


attributes. Not shown here, but also not reported, is whitespace in the prolog. Note
that there is a Text Element for the description, but the whitespace is
normalized to strip out extra characters before and after the nonwhitespace content.

The Text elements due to whitespace that is in Element content are called
ignorable whitespace. Ignorable whitespace is not part of validation, as you're about
to see in Figure 4.

Figure 4. Whitespace processing

Section 3. Validating XML documents


Validation consists of ensuring the proper structure and content of XML documents
using a grammar. You can specify a grammar by using an XML schema, which can
take the form of a DTD or XML Schema file (see Schemas). This section of the
tutorial discusses DTD and XML Schema files.

Schemas
Technically speaking, DTDs, XML Schemas (capital S), and RELAX
NG are all types of XML schema (little s). XML Schemas (capital S)
are strictly called W3C XML Schemas. In this tutorial, whenever you
see XML Schema, realize that it's the W3C language and not the
generic schema document description.

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 19 of 38
developerWorks® ibm.com/developerWorks

Validating using a DTD


A DTD defines constraints to put on an XML instance document. These constraints
are not related to well-formedness. In fact, a document that is not well-formed is not
considered an XML document at all. Constraints relate to business rules about
content that must hold true for you to be able to use the document with an
application.

A DTD specifies the elements and attributes that an XML instance document must
contain to be considered valid. You can associate a document with a DTD by
including a DOCTYPE statement near the top of the document:

<!DOCTYPE catalog SYSTEM "catalog.dtd">

Now, go through the catalog.dtd file. To validate a document, you need to turn
validation on and use a validating parser. With this code, turn on validation for the
SAX parser:

saxParser.setFeature(
"http://xml.org/sax/features/validation", true );

With this code, turn on validation for the DOM parser:

domParser.setFeature(
"http://xml.org/dom/features/validation", true );

Figure 5 shows the catalog.dtd file.

Figure 5. Catalog DTD

XML processing
Page 20 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

Go line by line through the DTD to see what is being specified:

<!ELEMENT catalog (dvd+)>

The dvd+ specifies that a <catalog> element has one or more <dvd>s. Makes
sense; otherwise, you aren't going to be selling too many DVDs!

<!ELEMENT dvd (title, description?, price, year)>

The title, ..., year is called a sequence. It means that the named elements
must appear in this order as children of a <dvd> element. The question mark after
description means that a <dvd> has zero or one description elements -- in other
words, it's optional but if it is specified, there can only be one (an asterisk means
zero or more, and a plus sign means one or more).

<!ATTLIST dvd code ID #REQUIRED>

An ID type attribute must have a unique name within the document. You'll notice
that in the catalog.xml file, the IDs begin with an underscore. An XML name
cannot start with a number, but an underscore (or letter or many other nondigit
character) is fine. An element can only have one ID type. REQUIRED, as you might
have guessed, means that a <dvd> must have a code.

<!ATTLIST dvd genre ( Drama | Comedy | SciFi | Action | Romance ) #IMPLIED>

This is an enumeration. Since it is IMPLIED, it is optional. However, if it does appear


in the document, it must be one of the enumerated values (read them as "Drama or

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 21 of 38
developerWorks® ibm.com/developerWorks

Comedy or ...").

<!ELEMENT title (#PCDATA)>


<!ELEMENT description (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT year (#PCDATA)>

These remaining lines all specify parsed character data. None of these elements
may have children.

Now try to change the instance document to make sure the rules work correctly.
First, add a <description>, but put it at the end of the <dvd>. As expected, you
get an error (see Figure 6).

Figure 6. Description error

Now, add a genre (see Figure 7).

Figure 7. Genre error

XML processing
Page 22 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

Why didn't that work?! Science fiction is in the list! D'oh -- XML is case-sensitive, as
you know, so "scifi" won't work. It needs to be "SciFi".

Now check to see if IDs really need to be unique. Copy the same code into another
<dvd> (see Figure 8).

Figure 8. ID error

Sure enough, you get an appropriate error. You get the idea. Feel free to use the
DTD and XML files to try out other changes (see Download for the source files).

DTD exception handling

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 23 of 38
developerWorks® ibm.com/developerWorks

To handle DTD manipulation errors, you must turn on validation. For Xerces, you set
the schema validation feature to true:

parser.setFeature(
"http://apache.org/xml/features/validation/schema",
true );

You can read about the different Xerces parser features at The Apache Software
Foundation Web site (see Resources). To read more about validation with DTDs,
check out Chapter 3 of XML in a Nutshell, Third Edition (see Resources).

Validating with SAXEcho

Now, check out the validation. Comment out the price for the Life as a House dvd
in the XML document and see the results, using both DTD and XSD files for
validation. Listing 6 shows the output.

Listing 6. Output from SAXEcho execution

=== Parsing catalogDTD.xml ===


<catalog><dvd><title>Terminator 2</title><description>
A shape-shifting cyborg is sent back from the future
to kill the leader of the resistance.
</description><price>19.95</price><year>1991</year>
</dvd><dvd><title>The Matrix</title><price>10.95
</price><year>1999</year></dvd><dvd><title>Life as a House</title><description>
When a man is diagnosed with terminal cancer,
he takes custody of his misanthropic teenage son.
</description><year>2001</year>
*** Failed validation ***
* The content of element type "dvd" must match "(title,description?,price,year)".
*************************
</dvd><dvd><title>Raiders of the Lost Ark</title><price>14.95
</price><year>1981</year></dvd></catalog>
=== Parsing catalogXSD.xml ===
<catalog>
<dvd>
<title>Terminator 2</title>
<description>
A shape-shifting cyborg is sent back from the future
to kill the leader of the resistance.
</description>
<price>19.95</price>
<year>1991</year>
</dvd>
<dvd>
<title>The Matrix</title>
<price>10.95</price>
<year>1999</year>
</dvd>
<dvd>
<title>Life as a House</title>
<description>
When a man is diagnosed with terminal cancer,
he takes custody of his misanthropic teenage son.
</description>

*** Failed validation ***


* cvc-complex-type.2.4.a: Invalid content was found starting with
element 'year'. One of '{"":price}' is expected.
*************************

XML processing
Page 24 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

<year>2001</year>
</dvd>
<dvd>
<title>Raiders of the Lost Ark</title>
<price>14.95</price>
<year>1981</year>
</dvd>
</catalog>

Validating using an XML schema


Perhaps you're wondering: If I have DTDs to make sure a document's structure and
content is valid, why do I need another way to validate documents? I'll give you a
few reasons:

• Granular control over element and attribute values: XML Schema


allows you to specify the format, length, and data type.
• Complex data types: XML Schema supports the creation of new data
types and specialization from existing types.
• Element occurrence: With XML Schema, granular control of elements is
possible.
• Namespaces: XML Schema works with namespaces, which become
important for organizations that deal with other organizations.
The XML Schema language is more powerful than the DTD language and thus is
also more complicated. One nice aspect is that XML Schemas are written in XML,
whereas DTDs are not.

XSD
XML Schema is also known as XML Schema Definition, thus the file
extension .xsd.

Let's validate the same XML instance document that you used for DTD validation in
Listing 1. Listing 7 shows the XML Schema:

Listing 7. Catalog XML Schema

<?xml version="1.0" encoding="UTF-8"?>


<xs:schema elementFormDefault="qualified" xml:lang="EN"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<!-- Our DVD catalog contains four or more DVDs -->
<xs:element name="catalog">
<xs:complexType>
<xs:sequence minOccurs="4" maxOccurs="unbounded">
<xs:element ref="dvd"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<!-- DVDs have a title, an optional description, a price, and a release year -->
<xs:element name="dvd">
<xs:complexType>
<xs:sequence>

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 25 of 38
developerWorks® ibm.com/developerWorks

<xs:element name="title" type="xs:string"/>


<xs:element name="description" type="descriptionString"
minOccurs="0"/>
<xs:element name="price" type="priceValue"/>
<xs:element name="year" type="yearString"/>
</xs:sequence>
<xs:attribute name="code" type="xs:ID"/> <!-- requires a unique ID -->
<xs:attribute name="genre"> <!-- default = optional -->
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="Drama"/>
<xs:enumeration value="Comedy"/>
<xs:enumeration value="SciFi"/>
<xs:enumeration value="Action"/>
<xs:enumeration value="Romance"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
<!-- Descriptions must be between 10 and 120 characters long -->
<xs:simpleType name="descriptionString">
<xs:restriction base="xs:string">
<xs:minLength value="10"/>
<xs:maxLength value="120"/>
</xs:restriction>
</xs:simpleType>
<!-- Price must be < 100.00 -->
<xs:simpleType name="priceValue">
<xs:restriction base="xs:decimal">
<xs:totalDigits value="4"/>
<xs:fractionDigits value="2"/>
<xs:maxExclusive value="100.00"/>
</xs:restriction>
</xs:simpleType>
<!-- Year must be 4 digits, between 1900 and 2099 -->
<xs:simpleType name="yearString">
<xs:restriction base="xs:string">
<xs:pattern value="(19|20)\d\d"/>
</xs:restriction>
</xs:simpleType>
</xs:schema>

Notice that the XML Schema is a lot more involved than the corresponding DTD. In
fact, even taking out the comments and spacing, this schema is more than 50 lines
long, as opposed to the DTD schema that is nine lines long. (Granted, this schema
does more detailed checking than the DTD does). So, along with more granular
control comes more complexity -- a lot more complexity. The message is: If your
validation needs don't require an XML Schema, use a DTD.

Review the added value list for XML Schemas to see how the DVD catalog
documents benefit, in addition to enforcing comparable constraints from the DTD
you used before:

• Granular control over element and attribute values: Unlike the DTD,
which allows any character values, the XSD constrains the values of
descriptions (20 to 120 characters), prices (0.00 to 100.00), and years
(1900 to 2999).
• Complex data types: You created new data types that you can reuse

XML processing
Page 26 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

and extend even further: dvd, descriptionString, priceValue, and


yearString.
• Element occurrence: Since this tutorial has a small document, I set the
number of DVDs to be four or more so the document would be valid. In
reality, the minimum would probably be a larger number, but you can see
that these types of constraints are possible.
• Namespaces: You only used namespaces for XML Schema types, but
since XML Schemas are namespace-aware, you know that you can add
more namespaces to control name collisions.
Let's discuss some more points about the XML Schema to understand its contents:

• xs:complexType and xs:simpleType. A complexType is an element


that contains other elements or attributes:

<xs:element name="dvd">
<xs:complexType>
<xs:sequence>
<xs:element name="title" type="xs:string"/>
...

A simpleType is an element that only contains text and its own attribute
values:

<xs:simpleType name="yearString">
<xs:restriction base="xs:string">
<xs:pattern value="(19|20)\d\d"/>
</xs:restriction>
</xs:simpleType>

In this particular case, you define a new type called yearString that
must contain four digits and begin with either "19" or "20." You use the
xs:restriction element to derive a new, constrained type from an
existing (base) type. You use the xs:pattern facet element to compare
values to see if they match the specified expression (see Facets).

• xs:sequence. The child elements must appear in the exact order listed
(although minOccurs can make an element optional, as you saw):

<xs:sequence>
<xs:element name="title" type="xs:string"/>
<xs:element name="description" type="descriptionString" minOccurs="0"/>
<xs:element name="price" type="priceValue"/>
<xs:element name="year" type="yearString"/>
</xs:sequence>

The sequence declares that dvds in a valid document must have a


title, optionally followed by a description of between 10 and 120
characters, followed by a price of less than US$100 in the format

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 27 of 38
developerWorks® ibm.com/developerWorks

"nn.nn," and finally a year.

Facets
Schemas support a set of possible aspects for values. These
aspects are called facets and are used with a restriction to constrain
the valid values. The following facet types are available:

• pattern

• enumeration

• minLength and maxLength

• minInclusive, maxInclusive, minExclusive, and


maxExclusive

• totalDigits and fractionDigits

• whiteSpace

Note: Validation for XML Schemas requires XMLBuddy Pro.

Now make some edits and verify that your constraints are being enforced. Add a
genre of Adventure, enter a description more than 120 characters long, and
duplicate a dvd code (see Figure 9).

Figure 9. XSD errors

You can see that the genre, unique ID, and description length are all enforced.

XML Schema is capable of much more. Here are a few highlights:

• xs:choice: One of the child elements must appear.

XML processing
Page 28 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

• xs:all: Each of the child elements listed must appear once, but they
can appear in any order.
• xs:group: A set of elements of the group name can be defined and then
referenced (throughref=groupName).
• xs:attributeGroup: This is the corresponding indicator for attributes,
as xs:group is for elements.
• xs:date: This is a Gregorian calendar date as defined in ISO 8601,
formatted as YYYY-MM-DD.
• xs:time: The time is represented by hh:mm:ss, with or without "Z" for
UTC relative time.
• xs:duration: An amount of years, months, days, hours, and minutes.
As you can see, a lot of built-in power is available when you write an XML Schema.
Can't find what you need? Create a new type.

Data types

A powerful feature of XML Schema is the capability to create new data types. You
saw new types used extensively in the catalog.xsd file, including the creation of the
yearString and priceValue types. In this case, these types are only used in the
dvd type, but you could use them anywhere that years or prices appear in the
document.

These types extend existing decimal and string types:

<!-- Price must be < 100.00 -->


<xs:simpleType name="priceValue">
<xs:restriction base="xs:decimal">
<xs:totalDigits value="4"/>
<xs:fractionDigits value="2"/>
<xs:maxExclusive value="100.00"/>
</xs:restriction>
</xs:simpleType>
<!-- Year must be 4 digits, between 1900 and 2099 -->
<xs:simpleType name="yearString">
<xs:restriction base="xs:string">
<xs:pattern value="(19|20)\d\d"/>
</xs:restriction>
</xs:simpleType>

As I mentioned before, you can specialize an existing type using the restriction
element in combination with one or more facets. If more than one facet exists, you
can use them in combination to determine which values are valid and which are not.

Pattern matching

The pattern facet element supports a rich expression syntax that is similar to Perl.
You saw it used for the yearString, where you can read the pattern "
(19|20)\d\d" as "the string must start with either one-nine or two-zero and must
be followed by two decimal numbers." Table 1 shows a few more patterns.

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 29 of 38
developerWorks® ibm.com/developerWorks

Table 1. XML Schema pattern-matching expressions


Pattern Matches
(A|B) A string that matches A or B
A? Zero or one occurrence of a string that matches
A
A* Zero or more occurrences of a string that
matches A
A+ One or more occurrences of a string that
matches A
[abcd] A character that matches one of the specified
characters
[^abc] A character other than those specified
\t A tab character
\\ A backslash character
\c An XML name character
\s A space, tab, carriage-return, or line-feed
character
. Any character except a carriage return or line
feed

To read more about the many possibilities for expressions, see pages 427-429 of
XML in a Nutshell, Third Edition or view Table 24-5 in Chapter 24 of XML Bible,
Second Edition online (see Resources).

XSD exception handling

To handle XML Schema manipulation errors, you must turn on validation. For
Xerces, set the schema validation feature to true:

parser.setFeature(
"http://apache.org/xml/features/validation/schema",
true );

You can read about the different Xerces parser features on The Apache Software
Foundation Web site (see Resources).

I previously discussed DOMExceptions that can occur due to manipulation


problems. The DOMException's code indicates what type of problem has occurred.

DOMEcho revisited

Change the logic of DOMEcho.java to cause a DOMException. Here's the new


logic:

//---- find parent Node


Element indianaJones = document.getElementById("_7755522");
//---- insert a description before the price

XML processing
Page 30 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

// (anywhere else would be invalid)


NodeList years = indianaJones.getElementsByTagName("price");
Node desc = document.createTextNode(
"Indiana Jones is hired to find the Ark of the Covenant");
// This change will now fail validation.
indianaJones.insertBefore( desc, indianaJones );

This results in the following code being executed:

short code = e.code;


...
} else if( code == DOMException.NOT_FOUND_ERR ) {
//take action when element or attribute not found
echo( "*** Element not found" );
System.exit(code);
}

To read more about validation with XML Schemas, check out Chapter 17 of XML in
a Nutshell, Third Edition, W3Schools, or "Interactive XML tutorials" (see Resources).

Section 4. Using XQuery


XML Query (XQuery) is a language for writing expressions that return matching
results from XML data, often in a database. The functionality is like that provided by
SQL for non-XML content:

"Like SQL, XQuery contains functions for extracting, summarizing,


aggregating, and joining data from multiple datasets."
--"Java theory and practice: Screen-scraping with XQuery" by Brian
Goetz (see Resources)

XQuery expands upon XPath expressions, which the fourth part of this tutorial on
XML transformations discusses in detail. An XPath expression is also a valid XQuery
expression. So, why do you need XQuery? The value-add for XQuery is due to
clauses that XQuery adds to its expressions, allowing for more complicated
expressions much like a SELECT statement does in SQL.

XQuery clauses
XQuery contains multiple clauses, represented by the acronym FLWOR: for, let,
where, order by, return. Table 2 shows these parts.

Table 2. FLWOR clauses


Clause Description
for You use this looping construct to assign
values to variables used within the other
clauses. You declare the variables with a

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 31 of 38
developerWorks® ibm.com/developerWorks

dollar sign, as in $name, and get values


assigned to them from the search results.
let You use a let to assign a value to a
variable outside of a for.
where Much like in SQL, you use a where clause
to filter the results based on some criteria.
order by You use this clause to determine how to
sort the result set (ascending or
descending).
return You use the return clause to determine
the contents of the output of the query. The
contents can include literals, XML document
contents, HTML markup, or many other
possibilities.

XQuery contains a condition that evaluates to true or false and comprises the
search criteria within the FLWOR clauses. Look at some examples. You can use the
dvd.xml file shown in Listing 8 as the XML instance document.

Listing 8. dvd.xml

<?xml version="1.0"?>
<!-- DVD inventory -->
<catalog>
<dvd code="1234567">
<title>Terminator 2</title>
<price>19.95</price>
<year>1991</year>
</dvd>
<dvd code="7654321">
<title>The Matrix</title>
<price>12.95</price>
<year>1999</year>
</dvd>
<dvd code="2255577">
<title>Life as a House</title>
<price>15.95</price>
<year>2001</year>
</dvd>
<dvd code="7755522">
<title>Raiders of the Lost Ark</title>
<price>14.95</price>
<year>1981</year>
</dvd>
</catalog>

Saxon
You can get the free Saxon tools at Saxonica if you want to try out
XQuery yourself (see Resources).

To try this out, I used the Saxon XQuery tools. All my files are in the directory I
unpacked Saxon into. To use XQuery to create an HTML page that lists all the DVD
titles in ascending order, I used the dvdTitles.xq file shown in Listing 9, which also
shows the output. I used the following command to execute this query:

java -cp saxon8.jar net.sf.saxon.Query -t dvdTitles.xq > dvdTitles.html

XML processing
Page 32 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

Listing 9. XQuery to list DVD titles in ascending order

dvdTitles.xq:
<html>
<body>
Available DVDs:
<br/>
<ol>
{
for $title in doc("dvd.xml")/catalog/dvd/title
order by $title
return <li>{data($title)}</li>
}
</ol>
</body>
</html>
dvdTitles.html:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<body>
Available DVDs:
<br/>
<ol>
<li>Life as a House</li>
<li>Raiders of the Lost Ark</li>
<li>Terminator 2</li>
<li>The Matrix</li>
</ol>
</body>
</html>

In Listing 9, look at the XQuery logic in detail. First of all, the query must be
surrounded by curvy brackets ("{}"). You can see in this example that three of the
clauses are used (for, order by, and return). You use the doc() function to
open an XML document. $title is a variable that is set to each of the search
results during each loop. In this case, it is set to each result of the
/catalog/dvd/title expression -- thus, its name. The data() function in the
return clause pulls out just the value from the XML without the tags. If you just put
$title, you would get "<title>value</title>," which you don't want in your
HTML output. Notice that the XQuery is surrounded with all the HTML needed to
complete the page.

Now, suppose you want to output the prices for DVDs that cost more than US$15 in
descending order. Listing 10 shows the XQuery and output files.

Listing 10. DVD prices > US$15 in descending order

dvdPriceThreshold.hq
<html>
<body>
DVDs prices below $15.00:
<br/>
<ol>
{
for $price in doc("dvd.xml")/catalog/dvd/price
where $price < 15.00

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 33 of 38
developerWorks® ibm.com/developerWorks

order by $price descending


return <li>{data($price)}</li>
}
</ol>
</body>
</html>
dvdPrices.html
<?xml version="1.0" encoding="UTF-8"?>
<html>
<body>
DVDs prices below $15.00:
<br/>
<ol>
<li>14.95</li>
<li>12.95</li>
</ol>
</body>
</html>

The main difference with this query is that you specified a where clause. Just for
fun, you also reversed the sort order.

Obviously, you can do a lot more to learn the power of XQuery, but I've covered
enough to show you some of the possibilities. To learn more, check out "XQuery"
and "Five Practical XQuery Applications" (see Resources).

Section 5. Conclusion
The core of XML is parsing and validation. Knowing how to use these capabilities
well is vital to the successful introduction of XML to your project.

Summary
In this tutorial on XML processing, you've seen how to:

• Parse XML documents using the SAX2 and DOM2 parsers


• Validate XML documents against DTDs and XML Schemas
• Access XML content from databases using XQuery

XML processing
Page 34 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

Downloads
Description Name Size Download method
Sample DTD and XML files x-cert1423-code-samples.zip
16KB HTTP

Information about download methods

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 35 of 38
developerWorks® ibm.com/developerWorks

Resources
Learn
• XML and Related Technologies certification prep (developerWorks, August -
October, 2006): With this series of five tutorials, prepare to take the IBM
certification Test 142, XML and Related Technologies, to attain the IBM
Certified Solution Developer - XML and Related Technologies certification.
• XML: A Manager's Guide, Second Edition (Kevin Dick, Addison-Wesley
Professional, 2002): Read about uses of XML technologies in enterprise
applications.
• XML in a Nutshell, 3rd Edition (Elliotte Rusty Harold and W. Scott Means,
O'Reilly Media, 2004, ISBN: 0596007647): Check out this comprehensive XML
reference with everything from fundamental syntax rules, DTD and XML
Schema creation, XSLT transformations, processing APIs, XML 1.1, plus SAX2
and DOM Level 3.
• XQuery (Jim Keogh and Ken Davidson, McGraw-Hill/Osborne, 2005; ISBN:
0072262109): Learn to write XQuery expressions in this excerpt from chapter 9
of the book XML DeMYSTiFieD.
• Five Practical XQuery Applications (Tim Matthews and Srinivas Pandrangi, 9
May 2003): Add XQuery in your own apps to simplify difficult or tedious tasks.
• An Introduction to StAX (Elliotte Rusty Harold, O'Reilly Media, September 17,
2003): Read more about Streaming API for XML (StAX) in this article.
• Interactive XML tutorials: Explore a variety of XML topics including, SVG, DTD,
Schema, XSLT, DOM and SAX complete with student problems, access to
online parsers to process your answers for immediate feedback.
• W3Schools online Web tutorials: Discover Web-building tutorials, from basic
HTML and XHTML to advanced XML, SQL, Database, Multimedia and WAP.
• Java theory and practice: Screen-scraping with XQuery (Brian Goetz,
developerWorks, 22 Mar 2005): See how effectively you can use XQuery as an
HTML screen-scraping engine.
• Power your mashups with XQuery (Ning Yan, developerWorks, July 2006):
Create a mashup application that uses XQuery to couple Web content with XML
data and Web services.
• The Java XML Validation API (Elliotte Rusty Harold, developerWorks, August
2006): Check your documents for conformance to schemas with this XML
validation API.
• Saxonica: XSLT and XQuery Processing: Learn about this collection of tools for
processing XML documents that includes XSLT 2.0, XPath 2.0, XQuery 1.0,
and XML Schema 1.0 processors.
• DOMException from Chapter 9 of Processing XML with Java: A Guide to SAX,
DOM, JDOM, JAXP, and TrAX (Elliotte Rusty Harold, Addison-Wesley

XML processing
Page 36 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®

Professional, 2002): Read about DOMException -- generic, runtime exception.


• DOM Modules section in Chapter 9 of Processing XML with Java: A Guide to
SAX, DOM, JDOM, JAXP, and TrAX (Elliotte Rusty Harold, Addison-Wesley
Professional, 2002): Read about the fourteen modules in eight different
packages of DOM2.
• Chapter 12, The DOM Traversal Module of Processing XML with Java: A Guide
to SAX, DOM, JDOM, JAXP, and TrAX (Elliotte Rusty Harold, Addison-Wesley
Professional, 2002): Delve into this collection of utility interfaces that perform
most of the logic to traverse a DOM tree for simpler programs .
• Setting Features: Read how to set and query features from The Apache
Software Foundation, 2005.
• Serial Access with the Simple API for XML (SAX): Discover SAX -- the
event-driven, serial-access mechanism for accessing XML documents.
• Tree traversals: Bruno R. Preiss explains different tree traversals.
• XML Bible, Second Edition (Elliotte Rusty Harold): View Table 24-5 in Chapter
24 for a grammar of regular expressions symbols for XML schema.
• IBM XML 1.1 certification: Become an IBM Certified Developer in XML 1.1 and
related technologies.
• XML: See developerWorks XML Zone for a wide range of technical articles and
tips, tutorials, standards, and IBM Redbooks.
• developerWorks technical events and webcasts: Stay current with technology in
these sessions.
Get products and technologies
• Apache Xerces2 parser: Download the open source for a XML-compliant parser
that includes the Xerces Native Interface (XNI) framework for building parser
components and configurations.
• Java software development kit (JDK) 1.4.2 or later: Download the JDK to build
standards-based, interoperable apps, applets, and Web services.
• Eclipse 3.1 or later: Download this open source, extensible development
platform and application frameworks for building software.
• XMLBuddy 2.0 or later: Download and start to work in XML-related technology,
including XML, DTD, XML Schema, RELAX NG, RELAX NG compact syntax
and XSLT. You can get XMLBuddy as an Eclipse plugin.
Discuss
• XML zone discussion forums: Participate in any of several XML-centered
forums.
• developerWorks blogs: Get involved in the developerWorks community.

XML processing
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 37 of 38
developerWorks® ibm.com/developerWorks

About the author


Mark Lorenz
Mark Lorenz is the founder of Hatteras Software, an object-oriented consulting firm,
and the author of multiple books on software development. He is certified in
object-oriented analysis and design (OOAD), XML, RAD, and Java. He uses XHTML,
Web services, Ajax, JSF, Spring, BIRT, and related Eclipse-based tools to develop
Java enterprise applications. You can read Mark's blog on technology.

Trademarks
IBM, DB2, Lotus, Rational, Tivoli, and WebSphere are trademarks of IBM
Corporation in the United States, other countries, or both.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the
United States, other countries, or both.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft
Corporation in the United States, other countries, or both.

XML processing
Page 38 of 38 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Anda mungkin juga menyukai