|
Chapter 16: Organizing Data with XML
Chapter 16 Organizing Data with XMLAs I'm sure you know, there's a great deal of information floating around out there on the Internet. Not surprisingly, little of this information is structured in a meaningful way, other than being formatted for easy viewing, which sometimes makes it difficult to find what you're looking for. Granted, search engines do a good job of dealing with the chaos of the Web's colossal information overload but, for most of us, an added degree of organization and structure would be welcome. Fortunately, the industry is rapidly adopting a technology that aims to provide this.I'm referring to XML, which you learned in Chapter 1, "An Introduction to HTML, DHTML, and XML," stands for Extensible Markup Language. Like HTML, XML allows you to use tags to create Web pages and other documents. In addition, XML is designed to be completely open- ended; you can create your own tags to give pages unique meaning. This isn't possible with HTML. You will learn in this chapter that XML introduces a whole new way of thinking about the Web and electronic information in general. It is sure to challenge and broaden your perspective about Web pages.
Getting to Know XMLBecause Chapter 1 dealt with XML at a conceptual level, I want to dive in with the XML language and show you how it works. If you recall, XML is a generic language used to describe other markup languages such as HTML. Knowing this, you'll find XML to be extremely general and open- ended. It's not until you begin working with specific XML vocabularies that the true power of XML comes into full view. So as you learn about XML, try to think about how it might affect the HTML markup you've used so far.
The first thing to understand about XML is that it makes a clear distinction between markup and content.
In the following example of HTML code, can you guess which parts are markup and which are content?
<p>Let's sing a lament, the world isn't round it's <i>twisted and bent</i>.</p>
In this HTML code, the
<question answer="true">The world's termites outweigh the world's humans ten to one. True or False?</question>
In this code, a hypothetical question-and-answer XML vocabulary is used to mark up content for a True/False question. Notice that the question is marked up using the Although I've talked in terms of tags when describing the creation of XML documents, the actual structure is determined by elements.
For example, in the question markup you just saw, there is an element named
An element can have both start and end tags, as in the XML elements are capable of containing content, child elements, or both. Content in XML is often referred to as character data, to indicate that it consists of characters of text. When an element contains child elements, it means that they're nested within the element. Although this may sound tricky, you are already experienced with nested elements. In fact, you just saw an example of HTML code that included both nested child elements and character data. Here is the sentence I just showed you:
<p>Let's sing a lament, the world isn't round it's <i>twisted and bent</i>.</p>
In this code, the
The XML language consists of several components that describe the makeup of different parts of a document. Here are the major XML components:
Don't worry if these components sound technical because you're about to see that they're actually easy to understand. Their significance is that they describe the fundamental structure of the XML language, dictating the makeup of all XML documents. With a solid understanding of these components, you'll be able to read and understand the overall structure of any XML document, not to mention gain a new perspective on HTML. Because I'm the kind of guy who loves to get in over my head when I learn new things, I'm going to use a similar approach here by explaining the XML components in the context of a real XML document. Don't worry if the document doesn't make sense immediately, because it will soon enough. The document I'm talking about uses a special XML vocabulary to mark up an audio collection. You could use a document like this to catalog all of your CDs and tapes. Here is the audio collection XML document:
<?xml version="1.0"?> Even though I haven't formally introduced you to the details of the XML language, you can probably study this document for a few moments and make out most of its meaning. This is because XML tags tend to be pretty descriptive. By the way, Internet Explorer allows you to view XML documents and interact with them to some degree.
XML documents don't necessarily include any information about how they are to be displayed, and Internet Explorer doesn't try to interpret the meaning of the audio collection document. Instead, it focuses on highlighting the different structural parts. If you look carefully, you'll see a hyphen ( Figure 16-1 Closing elements in an XML document can help you to see the higher level of the document's structure. Now that you've seen the audio collection document from different angles, let's use it to learn about the primary XML components.
Understanding Elements and TagsAs you may have guessed, tags form the basis of all XML documents and are used to mark up elements. This is evident in the example of the audio collection by theartist element. It's marked up using the <artist> and </artist> tags. The distinction between elements and tags is admittedly subtle; think of elements as logical pieces of markup, and tags as specific text strings used to represent elements in XML documents.
Earlier in the chapter, I mentioned that elements can have both start and end tags, in which case they contain character data. Or they can be empty. Empty elements must be closed with a forward slash (
Note that the forward slash used in empty tags is a carryover of the forward slash used with end tags. For example, consider that all end tags begin with a forward slash:
<br></br>
In this example, a line break is coded as a pair of tags, instead of the empty
Referencing EntitiesBecause of the rigid structure of XML, there are some pieces of information that must be specially encoded in order to include them as content. For example, the apostrophe character (') serves a special purpose in XML and must be specially coded if you intend to use an apostrophe as part of document content. To understand, consider the following XML content:
Last summer we visited Pike's Peak. Because XML interprets apostrophes as markup, you must use a special technique to identify the apostrophe in Pike's Peak as content.
An entity reference is basically a unique name that identifies a piece of XML data. You use an entity reference by enclosing the reference between an ampersand (
Last summer we visited Pike's Peak.
I realize this code isn't easy to read, but the Entity references are usually unique to the specific document in which they appear. However, there are several built-in entity references. Table 16-1 shows the built-in entity references that are available for use in all XML documents: Table 16-1 Built-In Entity References in XML
Using CommentsLike HTML, XML allows you to create comments that aren't interpreted as document markup or content. Comments are useful for adding notes that explain a certain part of a document, or maybe to mention an aspect of the document that you intend to improve upon later. Document markup and content is designed for interpretation by your computer; it's either processed or displayed, but comments are there solely for your benefit as the XML author. Put another way, comments are ignored when an XML document is processed and/or displayed.
You can place comments in a document anywhere content appears, which makes it possible to add comments throughout a document if you so desire. Comments are unique because they are enclosed by special symbols. More specifically, you start a comment with
<!-- Copyright (c) 2003 Tailspin Toys --> This code shows how you could add a copyright notice to your XML documents with a comment. Comments are used a few times in the audio collection document, as the following code reveals:
<!-- This is the Rock section of the collection. -->
Using Processing InstructionsContrary to what I've led you to believe, XML documents don't consist of markup, content, and comments only. In fact, a couple of other pieces of information commonly show up in XML documents. One is processing instructions, special commands passed along to the programs that process or view the XML document. Processing instructions are easily distinguished from other XML components because they always start with<? and end with ?>. For example, here is a processing instruction that you will see in virtually every XML document:
<?xml version="1.0"?>
Notice that in this code the processing instruction begins with
Declaring the Document TypeThe last XML component you need to understand is document type declarations, which are extremely important because they describe the structure of an XML document.
The document type declaration takes care of the following three primary tasks:
This document type declaration stuff is confusing, so let me clarify its purpose. One of the principle features of XML is the ability to validate documents based on whether they adhere to the strict XML rules. In addition to making sure that a document follows the fundamental language rules of XML, it's also important to see that it adheres to the specific language rules of the markup language it is based on. For example, it's possible for a Web page to completely adhere to XML language rules yet completely violate HTML rules. A simple example of how this is possible is if you use the The point is that XML documents have two different levels of correctness. The first is determined by whether a document meets the strict language requirements of XML. If it does follow these rules, it's considered a well-formed document. The second level of correctness is determined by whether a document adheres to a DTD for a particular markup language such as HTML. If the document passes this test as well, then it's referred to as a valid document. It's considered an accolade of the highest order for an XML document to be valid. Ideally, all XML documentsand ultimately all Web pageswould be valid documents. It goes without saying that a valid document is also a well-formed document, but the reverse is not always true. Let's now circle back to the document type declaration for a document, whose main purpose is to identify the root element of the document as well as the DTD, which is usually contained in an external file. The DTD is essential for creating valid documents, and the audio collection document includes its document type declaration on a single line of code:
<!DOCTYPE audiocollection SYSTEM "AudioCollection.dtd">
In this code, the root element of the document is identified as
In case it isn't obvious, the root element of a document is the element that contains all other elements. In HTML documents, the root element is
Modeling XML DataBy now you're probably thoroughly confused by document type declarations and how they are used to describe XML documents. This section will help clarify the role of both document type declarations and DTDs so you can fully understand why they are an important part of XML.In case it's not abundantly clear yet, XML is all about structuring information. Almost every facet of XML is directly aimed at accomplishing this so that people can better understand information. To structure information, it's necessary to establish a model for the data. An XML document model serves as a template that determines what kind of information can appear in the document, as well as how it's structured. XML document models are also sometimes referred to as schemas, and are used to describe a class of data. For example, the information contained within the audio collection document you saw earlier in the chapter could be considered a class of data: audio data. Once you've established a class of data by using a schema, you can create highly structured documents that can be tested for validity. The benefit of having valid documents is that they can be accurately processed with automated programs such as search engines. An XML schema describes the arrangement of markup and content within a valid XML document; the document must strictly adhere to a schema to be considered valid. Knowing this, you can think of a schema as an agreement between an XML document (perhaps a Web page) and the XML vocabulary (HTML) in which it's written. Consider a simplified, real world analogy. If you meet someone and he gives you his phone number, you expect the number to be in a certain format. If he gives you an 8-digit number, you immediately know something is wrong. Domestic phone numbers adhere to a 10-digit format. This format is the schema that you use to determine that the 8-digit number is invalid. Although this example is simplified, it nonetheless shows how we employ schemas in many areas of our lives other than Web development.
The specific role of a schema is to describe an XML vocabulary, naming every tag and attribute, as well as their relationships with each other. Of course, a document without a schema can use any custom tag or attribute. This is fine, but it precludes the document from being considered valid. And as you now know, validity is the ultimate goal of all XML documents. If documents without schemas can use any tags or attributes, then it's fair to say that schemas impose constraints on how documents of a certain type can be structured. More specifically, schemas constrain the structure of documents in two ways:
The first function is the most important because it determines which elements can be used in a document and how they relate to one another. DTDs rely primarily on this approach for describing document structure, and are weak in establishing data types for document data. DTDs represent the standard approach for describing XML document structure, but are at risk of being replaced by an alternative called XML Schema. XML Schema is a newer approach that Microsoft promotes that includes rich support for describing document data. DTDs do a great job detailing which tags and attributes can be used in a document, as well as how they can be nested. But XML Schema goes a step fartheryou can nail down the data types of document content. The next two sections introduce you to both DTDs and XML Schema and show you how to use each to establish the structure of the audio collection document.
Working with DTDsDTDs serve as the standard schema approach for describing the structure of XML documents. Although DTDs represent the original schema approach for XML, they aren't without flaws. One complaint is that they use a specialized language for describing the structure of XML vocabularies. Although this language is simple, it's cryptic and seemingly unnecessary when you consider that XML could be used to describe document structure. The only upside to the special language used in DTDs is that it's compact, making most DTDs relatively small. The DTD language describes the structure of documents using individual characters such as question marks, asterisks, and plus signs; hence, its cryptic look.Even so, DTDs are easy to follow once you understand what the different characters mean, and they benefit from being concise. For proof, take a look at the following DTD, which describes the structure of the audio collection XML document:
<!ELEMENT audiocollection (audio)+>
As you can see, there isn't a lot to this DTD. The main part to understand is the relationship between the elements. Notice that the root
You'll also see that the
Back to the
You may notice that the remaining elements in the DTD contain There are certainly more subtleties to DTD design than I've mentioned in this brief explanation of the audio collection DTD, but I think I've given you an idea of how a DTD lays the ground rules for XML vocabularies. More important, DTDs provide the guidelines to which XML documents can be compared to determine validity.
Working with XML SchemaI mentioned previously that Microsoft offers a more powerful alternative to DTDs. This approach uses XML as the language describing document structure and also allows you to use specific data types. The Microsoft alternative to DTDs is known as XML Schema, and it's quite interesting because it uses a custom XML vocabulary to describe XML documents. This might seem strange at first, but all it means is that you create an XML Schema as an XML document using tags and attributes, much like you create an HTML document. The only thing necessary for you to do differently is learn how to use the XML Schema tags and attributes.Rather than spending time sifting through the details of the XML Schema vocabulary, let's look at an example. Here's the code for a schema developed using XML Schema that describes the structure of the audio collection document:
<?xml version="1.0"?>
Although it isn't nearly as compact (or as cryptic), this schema is roughly the equivalent of the DTD for the audio collection document that you saw in the previous section. The schema is a little easier to understand than the DTD because it uses XML tags and attributes. For example, the
In addition, the content of each element is specified clearly in the Knowing how XML Schema improves on its DTD equivalent, you might think that all XML vocabularies use XML Schema to describe their structure. However, this isn't the case currently. The reality is that DTDs existed before XML and are widely used in large information management programs. This means that they aren't going anywhere in a hurry. The other compelling reason not to throw away DTDs is that they work. XML Schema might work better, but such changes take time to catch on. For now, you may find yourself creating both a DTD and a schema for any XML vocabularies that you dream up.
The Practical Side of XMLSo far, this chapter has spent a great deal of time delving into the theory of XML and how it works as an organizational technology for information. What I haven't spent much time doing is assessing the role of XML in your life as a Web page creator. Sure, your new XML knowledge can be put to great use impressing your geeky friends the next time document structure becomes the hot topic at a lunch meeting. But how exactly does XML benefit you in terms of your Web pages?The relevance of XML to Web pages is twofold. First and foremost, the Web is evolving toward a more structured information repository, and is a topic you may want to explore at another time. For now, just take my word that XML is dictating the future of HTML. The second aspect of XML that relates heavily to Web pages is the use of specialized XML vocabularies to mark up special types of information. Custom XML vocabularies will allow you to create documents containing information that can be tied into your Web pages. The discussion of DTDs and XML Schema in this chapter assumed to some degree that you're interested in creating your own XML vocabulary. For example, the audio collection document uses a custom XML vocabulary with its own set of tags and attributes. Although this is a powerful use of XMLand quite liberatingyou probably won't be creating your own XML vocabularies often, if ever. But you will use vocabularies that others created. A wide variety of XML vocabularies exist for marking up all kinds of interesting data in XML documents. Here are some examples, along with the types of data they model:
As you can see, XML vocabularies vary greatly in the kinds of data they model. If you include any of these kinds of data in your Web pages, consider using an XML vocabulary to mark up the data.
Key Points
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||