Training
Certifications
Books
Special Offers
Community




 
Faster Smarter HTML & XML
Author Michael Morrison
Pages 352
Disk N/A
Level Beginner
Published 11/13/2002
ISBN 9780735618619
ISBN-10 0-7356-1861-5
Price(USD) $19.99
To see this book's discounted price, select a reseller below.
 

More Information

About the Book
Table of Contents
Sample Chapter
Index
Related Series
About the Author

Support: Book & CD

Rate this book
Barnes Noble Amazon Quantum Books

 


Chapter 16: Organizing Data with XML



Chapter 16  Organizing Data with XML

As I'm sure you know, there's a great deal of information floating around out there on the Internet. Not surprisingly, little of this information is structured in a meaningful way, other than being formatted for easy viewing, which sometimes makes it difficult to find what you're looking for. Granted, search engines do a good job of dealing with the chaos of the Web's colossal information overload but, for most of us, an added degree of organization and structure would be welcome. Fortunately, the industry is rapidly adopting a technology that aims to provide this.

I'm referring to XML, which you learned in Chapter 1, "An Introduction to HTML, DHTML, and XML," stands for Extensible Markup Language. Like HTML, XML allows you to use tags to create Web pages and other documents. In addition, XML is designed to be completely open- ended; you can create your own tags to give pages unique meaning. This isn't possible with HTML. You will learn in this chapter that XML introduces a whole new way of thinking about the Web and electronic information in general. It is sure to challenge and broaden your perspective about Web pages.

Getting to Know XML

Because Chapter 1 dealt with XML at a conceptual level, I want to dive in with the XML language and show you how it works. If you recall, XML is a generic language used to describe other markup languages such as HTML. Knowing this, you'll find XML to be extremely general and open- ended. It's not until you begin working with specific XML vocabularies that the true power of XML comes into full view. So as you learn about XML, try to think about how it might affect the HTML markup you've used so far.

The first thing to understand about XML is that it makes a clear distinction between markup and content.

In the following example of HTML code, can you guess which parts are markup and which are content?

<p>Let's sing a lament, the world isn't round it's <i>twisted and bent</i>.</p> 

In this HTML code, the <p>, <i>, </i>, and </p> tags are all markup; the remaining sentence text is content. Here, the markup is used to describe the appearance of the content, which is typical of HTML code. XML markup is often more descriptive and doesn't necessarily have anything to do with the appearance of content. Here is an example of a hypothetical XML document:

<question answer="true">The world's termites outweigh the world's humans ten to one. True or False?</question>

In this code, a hypothetical question-and-answer XML vocabulary is used to mark up content for a True/False question. Notice that the question is marked up using the <question> tag; the answer is specified by the answer attribute. None of this markup has anything to do with the content's appearance; instead, it focuses on its meaning. However, the markup is instantly familiar because it's formatted in a manner similar to HTML.

Although I've talked in terms of tags when describing the creation of XML documents, the actual structure is determined by elements.

For example, in the question markup you just saw, there is an element named question that is marked in the document by using the <question> and </question> tags. It's helpful to think in terms of elements when you're analyzing an XML vocabulary instead of thinking in terms of tags. Another way to explain elements is to say that they describe the structure of XML documents.

An element can have both start and end tags, as in the question element, or a single empty tag. An HTML example of an element with both start and end tags is the p paragraph element, which has both the <p> and </p> tags. An HTML element with a single empty tag is the img image element, which has only the <img/> tag. Notice that I closed the empty <img/> tag with a forward slash (/) before the closing angle bracket (>). This is important in XML; all empty tags must have a closing slash. This is the first of several picky XML coding conventions that you need to get used to.

XML elements are capable of containing content, child elements, or both. Content in XML is often referred to as character data, to indicate that it consists of characters of text. When an element contains child elements, it means that they're nested within the element. Although this may sound tricky, you are already experienced with nested elements. In fact, you just saw an example of HTML code that included both nested child elements and character data. Here is the sentence I just showed you:

<p>Let's sing a lament, the world isn't round it's <i>twisted and bent</i>.</p>

In this code, the p element contains the i element as a child element, along with the character data for part of the sentence. The i element also contains the character data twisted and bent. When you look at examples such as this, it becomes apparent that XML is not as complicated as it sounds.

The XML language consists of several components that describe the makeup of different parts of a document. Here are the major XML components:

  • Element tags
  • Entity references
  • Comments
  • Processing instructions
  • Document type declarations

Don't worry if these components sound technical because you're about to see that they're actually easy to understand. Their significance is that they describe the fundamental structure of the XML language, dictating the makeup of all XML documents. With a solid understanding of these components, you'll be able to read and understand the overall structure of any XML document, not to mention gain a new perspective on HTML.

Because I'm the kind of guy who loves to get in over my head when I learn new things, I'm going to use a similar approach here by explaining the XML components in the context of a real XML document. Don't worry if the document doesn't make sense immediately, because it will soon enough. The document I'm talking about uses a special XML vocabulary to mark up an audio collection. You could use a document like this to catalog all of your CDs and tapes. Here is the audio collection XML document:

<?xml version="1.0"?>
<!DOCTYPE audiocollection SYSTEM "AudioCollection.dtd">
 
<audiocollection>
  <!--  This is the Rock section of the collection. -->
  <audio type="rock" review="5" year="1990">
    <title>Cake</title>
    <artist>The Trash Can Sinatras</artist>
    <track>Obscurity Knocks</track>
    <track>Maybe I Should Drive</track>
    <track>Thrupenny Tears</track>
    <track>Even the Odd</track>
    <track>The Best Man's Fall</track>
    <track>Circling the Circumference</track>
    <track>Funny</track>
    <track>Only Tongue Can Tell</track>
    <track>You Made Me Feel</track>
    <track>January's Little Joke</track>
    <comments>Brilliant first release from the most underrated
    band in existence.</comments>
  </audio>
 
  <!-- This is the Jazz section of the collection. -->
  <audio type="jazz" review="5" year="1993">
    <title>Criss-Cross</title>
    <artist>Thelonious Monk</artist>
    <track>Hackensack</track>
    <track>Tea for Two</track>
    <track>Criss-Cross</track>
    <track>Eronel</track>
    <track>Rhythm-A-Ning</track>
    <track>Don't Blame Me</track>
    <track>Think of One</track>
    <track>Crepuscule with Nellie</track>
    <track>Pannonica</track>
    <comments>Excellent collection of Monk across five
    different sessions.</comments>
  </audio>
</audiocollection>

Even though I haven't formally introduced you to the details of the XML language, you can probably study this document for a few moments and make out most of its meaning. This is because XML tags tend to be pretty descriptive. By the way, Internet Explorer allows you to view XML documents and interact with them to some degree.

XML documents don't necessarily include any information about how they are to be displayed, and Internet Explorer doesn't try to interpret the meaning of the audio collection document. Instead, it focuses on highlighting the different structural parts. If you look carefully, you'll see a hyphen (-) to the left of some of the elements. Clicking this hyphen allows you to close the element, thereby hiding the information contained within it. This could be helpful in large XML documents. When you close an element, the hyphen turns into a plus sign (+), which can be used to reopen the element. Figure 16-1 shows the audio collection document with both of the audio elements closed.

Click to view graphic
Click to view graphic

Figure 16-1  Closing elements in an XML document can help you to see the higher level of the document's structure.

Now that you've seen the audio collection document from different angles, let's use it to learn about the primary XML components.

Understanding Elements and Tags

As you may have guessed, tags form the basis of all XML documents and are used to mark up elements. This is evident in the example of the audio collection by the artist element. It's marked up using the <artist> and </artist> tags. The distinction between elements and tags is admittedly subtle; think of elements as logical pieces of markup, and tags as specific text strings used to represent elements in XML documents.

Earlier in the chapter, I mentioned that elements can have both start and end tags, in which case they contain character data. Or they can be empty. Empty elements must be closed with a forward slash (/). A good example of an empty element in HTML is the br element, which is used to create a line break on a page. The br element doesn't contain any character data and according to XML standards must be coded as <br/>. Many HTML developers code empty elements without the closing slash, but the future of HTML is slanted toward XML, so you should get in the habit of closing empty elements properly.

Note that the forward slash used in empty tags is a carryover of the forward slash used with end tags. For example, consider that all end tags begin with a forward slash: </html>, </body>, </p>, and so on. Using a forward slash at the end of an empty tag is like combining a pair of start and end tags into a single tag. As evidence that this is the motivation behind the forward slash, I must point out that it's possible in XML to code an empty tag as a pair of tags, like this:

<br></br>

In this example, a line break is coded as a pair of tags, instead of the empty <br/> tag. Although the two-tag variation is valid XML, the empty tag approach is preferable because it's more concise.

Referencing Entities

Because of the rigid structure of XML, there are some pieces of information that must be specially encoded in order to include them as content. For example, the apostrophe character (') serves a special purpose in XML and must be specially coded if you intend to use an apostrophe as part of document content. To understand, consider the following XML content:

Last summer we visited Pike's Peak.

Because XML interprets apostrophes as markup, you must use a special technique to identify the apostrophe in Pike's Peak as content.

An entity reference is basically a unique name that identifies a piece of XML data. You use an entity reference by enclosing the reference between an ampersand (&) and a semicolon (;). The standard entity reference for an apostrophe character is &apos;, which means the previous XML content would be coded like this:

Last summer we visited Pike&apos;s Peak.

I realize this code isn't easy to read, but the &apos; entity reference clarifies that the apostrophe is XML content, not markup. You must also use an entity reference for the ampersand character because the ampersand character is used to identify entity references. You must use the entity reference &amp; for an ampersand that is content.

Entity references are usually unique to the specific document in which they appear. However, there are several built-in entity references. Table 16-1 shows the built-in entity references that are available for use in all XML documents:

Table 16-1 Built-In Entity References in XML

Entity ReferenceDescription
&amp;Ampersand character (&)
&quot;Double-quote character (")
&apos;Apostrophe character (')
&lt;Less-than character (<)
&gt;Greater-than character (>)

Using Comments

Like HTML, XML allows you to create comments that aren't interpreted as document markup or content. Comments are useful for adding notes that explain a certain part of a document, or maybe to mention an aspect of the document that you intend to improve upon later. Document markup and content is designed for interpretation by your computer; it's either processed or displayed, but comments are there solely for your benefit as the XML author. Put another way, comments are ignored when an XML document is processed and/or displayed.

You can place comments in a document anywhere content appears, which makes it possible to add comments throughout a document if you so desire. Comments are unique because they are enclosed by special symbols. More specifically, you start a comment with <!- and end it with -->. Here is an example of a simple comment:

<!-- Copyright (c) 2003 Tailspin Toys -->

This code shows how you could add a copyright notice to your XML documents with a comment. Comments are used a few times in the audio collection document, as the following code reveals:

<!-- This is the Rock section of the collection. -->
<!-- This is the Jazz section of the collection. -->

Using Processing Instructions

Contrary to what I've led you to believe, XML documents don't consist of markup, content, and comments only. In fact, a couple of other pieces of information commonly show up in XML documents. One is processing instructions, special commands passed along to the programs that process or view the XML document. Processing instructions are easily distinguished from other XML components because they always start with <? and end with ?>. For example, here is a processing instruction that you will see in virtually every XML document:

<?xml version="1.0"?>

Notice that in this code the processing instruction begins with <? and ends with ?>. Inside the instruction, you may notice that the structure is similar to tags. This is because processing instructions typically include a name followed by an attribute/value pair. The previous processing instruction example is used to identify an XML document as adhering to version 1 of the XML standard. This processing instruction was used at the beginning of the audio collection document and is an important part of all XML documents.

Declaring the Document Type

The last XML component you need to understand is document type declarations, which are extremely important because they describe the structure of an XML document.

The document type declaration takes care of the following three primary tasks:

  • Specifies the document's root element (for example, html is the root element of HTML documents)
  • Defines elements, attributes, and entities specific to the document
  • Identifies an external DTD for the document

This document type declaration stuff is confusing, so let me clarify its purpose. One of the principle features of XML is the ability to validate documents based on whether they adhere to the strict XML rules. In addition to making sure that a document follows the fundamental language rules of XML, it's also important to see that it adheres to the specific language rules of the markup language it is based on. For example, it's possible for a Web page to completely adhere to XML language rules yet completely violate HTML rules. A simple example of how this is possible is if you use the <joke> tag I mentioned previously on a Web page. You can code the <joke> tag so that it's perfectly legal in XML, but HTML has no <joke> tag, so the tag violates HTML.

The point is that XML documents have two different levels of correctness. The first is determined by whether a document meets the strict language requirements of XML. If it does follow these rules, it's considered a well-formed document. The second level of correctness is determined by whether a document adheres to a DTD for a particular markup language such as HTML. If the document passes this test as well, then it's referred to as a valid document. It's considered an accolade of the highest order for an XML document to be valid. Ideally, all XML documents—and ultimately all Web pages—would be valid documents. It goes without saying that a valid document is also a well-formed document, but the reverse is not always true.

Let's now circle back to the document type declaration for a document, whose main purpose is to identify the root element of the document as well as the DTD, which is usually contained in an external file. The DTD is essential for creating valid documents, and the audio collection document includes its document type declaration on a single line of code:

<!DOCTYPE audiocollection SYSTEM "AudioCollection.dtd">

In this code, the root element of the document is identified as audiocollection, with the external DTD being in the external file AudioCollection.dtd. This DTD can be used to validate the document, which you will learn about in Chapter 18, "XHTML: XML Meets HTML." The main point to understand now is the structure of the document type declaration and how it identifies the root element and external DTD.

In case it isn't obvious, the root element of a document is the element that contains all other elements. In HTML documents, the root element is html. In the audio collection document, it's audiocollection.

Modeling XML Data

By now you're probably thoroughly confused by document type declarations and how they are used to describe XML documents. This section will help clarify the role of both document type declarations and DTDs so you can fully understand why they are an important part of XML.

In case it's not abundantly clear yet, XML is all about structuring information. Almost every facet of XML is directly aimed at accomplishing this so that people can better understand information. To structure information, it's necessary to establish a model for the data. An XML document model serves as a template that determines what kind of information can appear in the document, as well as how it's structured.

XML document models are also sometimes referred to as schemas, and are used to describe a class of data. For example, the information contained within the audio collection document you saw earlier in the chapter could be considered a class of data: audio data. Once you've established a class of data by using a schema, you can create highly structured documents that can be tested for validity. The benefit of having valid documents is that they can be accurately processed with automated programs such as search engines.

An XML schema describes the arrangement of markup and content within a valid XML document; the document must strictly adhere to a schema to be considered valid. Knowing this, you can think of a schema as an agreement between an XML document (perhaps a Web page) and the XML vocabulary (HTML) in which it's written.

Consider a simplified, real world analogy. If you meet someone and he gives you his phone number, you expect the number to be in a certain format. If he gives you an 8-digit number, you immediately know something is wrong. Domestic phone numbers adhere to a 10-digit format. This format is the schema that you use to determine that the 8-digit number is invalid. Although this example is simplified, it nonetheless shows how we employ schemas in many areas of our lives other than Web development.

The specific role of a schema is to describe an XML vocabulary, naming every tag and attribute, as well as their relationships with each other. Of course, a document without a schema can use any custom tag or attribute. This is fine, but it precludes the document from being considered valid. And as you now know, validity is the ultimate goal of all XML documents. If documents without schemas can use any tags or attributes, then it's fair to say that schemas impose constraints on how documents of a certain type can be structured. More specifically, schemas constrain the structure of documents in two ways:

  1. They define the data model, which determines the specific order and nesting of elements.
  2. They establish the data types of document data.

The first function is the most important because it determines which elements can be used in a document and how they relate to one another. DTDs rely primarily on this approach for describing document structure, and are weak in establishing data types for document data. DTDs represent the standard approach for describing XML document structure, but are at risk of being replaced by an alternative called XML Schema.

XML Schema is a newer approach that Microsoft promotes that includes rich support for describing document data. DTDs do a great job detailing which tags and attributes can be used in a document, as well as how they can be nested. But XML Schema goes a step farther—you can nail down the data types of document content. The next two sections introduce you to both DTDs and XML Schema and show you how to use each to establish the structure of the audio collection document.

Working with DTDs

DTDs serve as the standard schema approach for describing the structure of XML documents. Although DTDs represent the original schema approach for XML, they aren't without flaws. One complaint is that they use a specialized language for describing the structure of XML vocabularies. Although this language is simple, it's cryptic and seemingly unnecessary when you consider that XML could be used to describe document structure. The only upside to the special language used in DTDs is that it's compact, making most DTDs relatively small. The DTD language describes the structure of documents using individual characters such as question marks, asterisks, and plus signs; hence, its cryptic look.

Even so, DTDs are easy to follow once you understand what the different characters mean, and they benefit from being concise. For proof, take a look at the following DTD, which describes the structure of the audio collection XML document:

<!ELEMENT audiocollection (audio)+>
 
<!ELEMENT audio (title, artist+, track+, comments?)>
<!ATTLIST audio
  type (rock | pop | jazz | classical | country | soul |
  hiphop | comedy | other) "rock"
  review (1 | 2 | 3 | 4 | 5) "3"
  year CDATA #IMPLIED>
 
<!ELEMENT title (#PCDATA)>
 
<!ELEMENT artist (#PCDATA)>
 
<!ELEMENT track (#PCDATA)>
 
<!ELEMENT comments (#PCDATA)>

As you can see, there isn't a lot to this DTD. The main part to understand is the relationship between the elements. Notice that the root audiocollection element is listed first. The word audio in parentheses next to the audiocollection element indicates that the audiocollection element contains the audio element as a child element. The plus sign (+) next to audio indicates that the audio element can appear multiple times within the audiocollection element.

You'll also see that the audio element contains several child elements of its own: title, artist, track, and comments. The plus signs next to artist and track indicate that there can be multiple elements of each. The question mark (?) next to comments indicates that the comments element is optional but can be used only once. This is the cryptic mumbo-jumbo DTD language I mentioned previously. It's fairly easy to understand but not necessarily intuitive.

Back to the audio element. It has three attributes, as noted by the ATTLIST notation in the DTD: type, review, and year. The type and review attributes are interesting because they specify a list of possible values along with a default value for each ("rock" for type and "3" for review). This is an important part of the DTD because you must adhere to the attribute lists for the type and review attributes when you create audio collection documents. In other words, the values for these attributes must be one of the values appearing in the lists. The year attribute is a text attribute that can contain any kind of text data; the CDATA notation indicates that the attribute can contain Character DATA.

You may notice that the remaining elements in the DTD contain #PCDATA, or Parsed Character DATA. This is a fancy way of saying that the elements contain text content.

There are certainly more subtleties to DTD design than I've mentioned in this brief explanation of the audio collection DTD, but I think I've given you an idea of how a DTD lays the ground rules for XML vocabularies. More important, DTDs provide the guidelines to which XML documents can be compared to determine validity.

Working with XML Schema

I mentioned previously that Microsoft offers a more powerful alternative to DTDs. This approach uses XML as the language describing document structure and also allows you to use specific data types. The Microsoft alternative to DTDs is known as XML Schema, and it's quite interesting because it uses a custom XML vocabulary to describe XML documents. This might seem strange at first, but all it means is that you create an XML Schema as an XML document using tags and attributes, much like you create an HTML document. The only thing necessary for you to do differently is learn how to use the XML Schema tags and attributes.

Rather than spending time sifting through the details of the XML Schema vocabulary, let's look at an example. Here's the code for a schema developed using XML Schema that describes the structure of the audio collection document:

<?xml version="1.0"?>
 
<Schema name="AudioCollectionSchema"
  xmlns="urn:schemas-microsoft-com:xml-data"
  xmlns:dt="urn:schemas-microsoft-com:datatypes">
  <ElementType name="title" content="textOnly"/>
 
  <ElementType name="artist" content="textOnly"/>
 
  <ElementType name="track" content="textOnly"/>
 
  <ElementType name="comments" content="textOnly"/>
 
  <AttributeType name="type" dt:type="enumeration"
    dt:values="rock pop jazz classical country soul hiphop
    comedy other" default="rock"/>
 
  <AttributeType name="review" dt:type="enumeration"
    dt:values="1 2 3 4 5" default="3"/>
 
  <AttributeType name="year" dt:type="int"/>
 
  <ElementType name="audio" content="eltOnly">
    <element type="title" minOccurs="1" maxOccurs="1"/>
    <element type="artist" minOccurs="1" maxOccurs="*"/>
    <element type="track" minOccurs="1" maxOccurs="*"/>
    <element type="comments" minOccurs="0" maxOccurs="1"/ >
    <attribute type="type"/>
    <attribute type="review"/>
    <attribute type="year"/>
  </ElementType>
 
  <ElementType name="audiocollection" content="eltOnly">
    <element type="audio" minOccurs="1" maxOccurs="*"/>
  </ElementType>
</Schema>

Although it isn't nearly as compact (or as cryptic), this schema is roughly the equivalent of the DTD for the audio collection document that you saw in the previous section. The schema is a little easier to understand than the DTD because it uses XML tags and attributes. For example, the minOccurs and maxOccurs attributes of the <element> tag are used to determine how many times an element may appear as a child, as opposed to the strange character codes used to carry out the same chore in the DTD.

In addition, the content of each element is specified clearly in the content attribute, set to eltOnly (elements only), textOnly (text only), mixed (elements and text), or empty. Perhaps the most significant improvement of XML Schema over DTDs is that it uses specific data types. For example, the year attribute is specified as type int, which means that it's an integer number.

Knowing how XML Schema improves on its DTD equivalent, you might think that all XML vocabularies use XML Schema to describe their structure. However, this isn't the case currently. The reality is that DTDs existed before XML and are widely used in large information management programs. This means that they aren't going anywhere in a hurry. The other compelling reason not to throw away DTDs is that they work. XML Schema might work better, but such changes take time to catch on. For now, you may find yourself creating both a DTD and a schema for any XML vocabularies that you dream up.

The Practical Side of XML

So far, this chapter has spent a great deal of time delving into the theory of XML and how it works as an organizational technology for information. What I haven't spent much time doing is assessing the role of XML in your life as a Web page creator. Sure, your new XML knowledge can be put to great use impressing your geeky friends the next time document structure becomes the hot topic at a lunch meeting. But how exactly does XML benefit you in terms of your Web pages?

The relevance of XML to Web pages is twofold. First and foremost, the Web is evolving toward a more structured information repository, and is a topic you may want to explore at another time. For now, just take my word that XML is dictating the future of HTML. The second aspect of XML that relates heavily to Web pages is the use of specialized XML vocabularies to mark up special types of information. Custom XML vocabularies will allow you to create documents containing information that can be tied into your Web pages.

The discussion of DTDs and XML Schema in this chapter assumed to some degree that you're interested in creating your own XML vocabulary. For example, the audio collection document uses a custom XML vocabulary with its own set of tags and attributes. Although this is a powerful use of XML—and quite liberating—you probably won't be creating your own XML vocabularies often, if ever. But you will use vocabularies that others created. A wide variety of XML vocabularies exist for marking up all kinds of interesting data in XML documents. Here are some examples, along with the types of data they model:

  • MathML mathematical equations
  • 3DML three-dimensional virtual worlds
  • VoxML interactive speech
  • SMIL multimedia integration
  • RELML real estate listings
  • HRMML human resource management
  • XMLNews news articles
  • P3P personal privacy

As you can see, XML vocabularies vary greatly in the kinds of data they model. If you include any of these kinds of data in your Web pages, consider using an XML vocabulary to mark up the data.

Key Points

  • Extensible Markup Language (XML) is a generic markup language used to describe other markup languages, such as HTML.
  • An XML vocabulary is a markup language designed using XML that applies to a specific type of content.
  • An element is a discrete piece of information within an XML document, typically corresponding to a tag or set of tags; elements are capable of containing content, child elements, or both.
  • In XML, empty elements must be closed with a forward slash (/).
  • Comments are useful for adding notes that explain a certain part of a document and are created by starting the comment with <!-- and ending it with -->.
  • A document type declaration—not to be confused with a document type definition (DTD)—appears near the top of an XML document just below the xml processing instruction, and identifies the document's root element and DTD.
  • The DTD is responsible for describing the tags and attributes capable of being used in the document, along with the relationships between them.
  • DTDs serve as the standard schema approach for describing the structure of XML documents.
  • XML Schema is a newer approach to XML data modeling (promoted by Microsoft) that improves upon DTDs by allowing you to describe document data precisely.



Last Updated: October 24, 2002
Top of Page