Efficient XML: Some Basics for the Windows Platform Developer

By Peter A. Bromberg, Ph.D.

Peter Bromberg

I regard myself as a particularly fortunate "XML Dude": About a year ago, I determined that, regardless of the amount of time I had in the day, and regardless of the fact that the company I worked for at the time had virtually no vision as to what XML could do to help solve their problems, I was going to spend some time -- for ME -- each evening, studying this new technology and learning how to use it. I don't know about you, but I come home each evening tired from being paid to think all day long. Very tired, I might add. But, I kept my promise to myself and made the time to study XML and XSLT. I remember when I bought my first book on XML, a Wrox title that I still refer to today (although I now possess many more such books). My first reaction was "Oh, crap! This is not going to be fun at all."



To their credit, this same company which had little XML vision saw fit to pay to send me to both Tech-Ed and the Microsoft Professional Developer's Conference, both of which were conveniently held in sunny Orlando (where I live), last year. All I can say is, when I arrived home each evening from the conferences, I was so excited I couldn't sleep. I saw the vision! I felt the excitement of the Microsoft gurus, who were genuinely enthusiastic about what they were doing and what the possibilities were. And it was all about XML and DOT NET.

Now, I am even more fortunate, because my vision of learning XML (along with DOT NET and related technologies) has begun to pay off. I now work for a company whose XML vision is well-formed (to make a pun) and the project I work on is almost 100% XML / XSLT based. Every ASP page in our application serves only as the "glue" to receive and process querystring or other parameters from an XML-based menu choice, to tell it what Javascript and XSL include files to load. All the loading and transformations are dynamic, all plumbing is handled by global Javascript functions, every single browser page in the application is the result of a dynamically generated XSL tranform. Client - side XML data islands hold important dynamically updated information that is accessed through global Javascript functions, and 100% of all data access is handled through XMLHttp XML request / response documents via COM + middleware components that we have authored, and it's sent, received back and processed --- from the CLIENT SIDE -- over the wire, to and from the databases. And it's all 100% financial industry standards compliant.    I feel proud to have been a member of the architectural team that went through a lot of pain to flesh out all this stuff and make the "proof of concept" become a reality that will provide real value to the customer (and make the company that I work for a ton of money). One of our biggest concerns continues to be performance tuning (see my "Performance Tuning Checklist" article for more on this).

Recently a friend I used to work with started using XML, and he sent some emails asking for comments and advice. Some of the assumptions he made in doing his "Beginning XML" exercise made me think: "We;ll, if he is making these mistakes (the same ones that I made, a lot of them) then I wonder how many other developers struggling with XML / XSLT are also doing this?  So I decided to write this article to try and summarize some of the most important things I've learned. If this helps you because you're just starting out with XML, that's great. And if you're already well under way as an XML Developer and some of the things that I touch on here make you think - well, that's even better. I am by no means an XML "guru". But you know what? I intend to become one. When I read in magazines like "Smart Partner" that XML gurus are currently being billed out at $300 an hour, I feel gratified that my decision almost a year ago was the right one for me. This technology is not going to go away, folks. It's big time. Study XML! It's the best job security you can get since winning the lottery became popular.

XML PERFORMANCE VARIABLES

In working with XML data and documents, there are four major variables that can affect the performance of MSXML:

  • The kind of XML data
  • The ratio of tags to text
  • The ratio of attributes to elements
  • The amount of discarded white space

There are also four key performance "metrics" involved on the Win32 platform:

  • Working set: The peak amount of memory used by MSXML to process requests. Once the working set exceeds available RAM, performance usually declines sharply as the operating system starts paging memory out to disk.
  • Megabytes per second: The raw speed for a given operation, such as the document load method.
  • Requests per second: How many requests the XML parser can handle per second. An XML parser might have a high megabytes-per-second rate, but if it is expensive to set up and tear down that parser, it will still have a low throughput in requests per second. For example, if the clients hit the server at a peak rate of one request per second, and if the server can do 150 requests per second, the server can probably handle up to 150 clients.
  • Scaling: How well your server can process requests in parallel. If your server is processing 150 client requests in parallel, then it is doing a lot of multithreading. Processing 150 threads in parallel is a lot for one processor, it will spend a lot of time switching between threads. You could add more processors to the computer to share the load.

    The fastest way to load an XML Document

The fastest way to load an XML document is to use the default "rental" threading model (which means the DOM document can be used by only one thread at a time) with validateOnParse, resolveExternals, and preserveWhiteSpace all disabled, like this in Javascript:

    var doc = new ActiveXObject("MSXML2.DOMDocument");
    doc.validateOnParse = false;
    doc.resolveExternals = false;
    doc.preserveWhiteSpace = false;
    doc.load("mystuff.xml");    

if you have an element-heavy XML document that contains a lot of white space between elements and stored in Unicode, it can actually be smaller in memory than on disk. Files that have a more balanced ratio of elements to text content end up at about 1.25 to 1.5 the UCS-2 disk file size when in memory. Files that are very data-dense, such as an attribute - heavy XML - persisted ADO recordset, can end up more than twice the disk-file size when loaded into memory.

        Attributes vs. Elements

You could conclude that attribute-heavy formats (such as an XML - persisted ADO recordset) deliver more data per second than element-heavy formats. But this should not be the only reason for you to switch everything to attributes. There are many other factors to consider in the decision to use attributes versus elements.

        Unique elements

My friend, in his honest but less than informed effort to create a useful XML document, made the mistake of attempting to use the XML elements as if they were unique "database fields". For example if you have an XML Document that consists of survey questions, you could conceive that in order to make each element "unique" you would give the tag a unique name. So your survey questions document might end up looking kind of like this:

<POLLQUESTIONS>
<Q10000002>Who is central Florida's best Internet Service Provider?</Q10000002>
<A10000009>MPINet</A10000009>
<A10000010>EarthLink</A10000010>
<A10000011>MindSpring</A10000011>
<A10000012>Access Orlando</A10000012>
<Q10000003>What is your favorite search engine?</Q10000003>
<A10000018>Yahoo</A10000018>
<A10000019>Altavista</A10000019>
<A10000020>Lycos</A10000020>
</POLLQUESTIONS>

Now what is wrong with the above document fragment? Two things, actually. First, it does not lend itself easily to XPATH statements that allow you to walk the DOM and find isolated nodes and / or subnodes of elements. True, each element has a "unique" tag name, but that's not the point. XML is a treelike hierarchical structure. If you need to be able to find an element or a node by number, or to sort, search or group, it's better to use either an attribute (<Question QNum="1">) to identify the unique "ID" of elements or nodes, or to include a sibling element (<Qnum>1</Qnum>) inside each Question tag. You can also use the position() operator. The second thing that's "wrong" is that the answer tags simply follow closed Question tags here -- there is no closing "Question" element that encompasses both the question and it's answers. A more productive version of the above might look like this:

<POLLQUESTIONS>
<Question QNum="1">
<QText> Who is central Florida's best Internet Service Provider?</Qtext>
<Answer>MPINet</Answer>
<Answer>EarthLink</Answer>
<Answer>MindSpring</Answer>
</Question>
<Question QNum="2">
<QText> What is your favorite search engine?</QText>
<Answer>Yahoo</Answer>
<Answer>Altavista</Answer>
<Answer>Lycos</Answer>
</Question>
</POLLQUESTIONS>

Separate Memory Structure for unique elements

With the second example, we can find any question by its number using XPATH like: //Question[@Qnum="2"]. We can sort, search, grab a question node along with its answers, and so on. And there is another very important but often overlooked reason to try and arrange your XML documents so that all the major tags have the same names: when the XML parser loads and processes your document, it creates a separate memory structure for each unique element name. So conceivably the first example above, if it had 1000 questions, could occupy orders of magnitude more memory than the second example with 1000 questions, taking a lot longer to parse, and possibly a lot longer to search or sort as well.

Walking the DOM tree for the first time also has an impact on the working set metric because some nodes in the tree are created "on demand", they are not automatically "there" after loading the document. Creating a DOM tree from scratch results in a higher peak working set than loading the same document from disk. Loading a document is roughly five times faster than creating the same document from scratch in memory. The reason is that the process of creating a document requires a lot of DOM calls, which slows things down.

        Walk Fast

The fastest way to walk the tree is to avoid the children collection and any kind of array access. Instead, use firstChild and nextSibling:

function WalkNodes(node)
{
    var child = node.firstChild;
    while (child != null)
    {
        WalkNodes(child);
        child = child.nextSibling;
    }
}


However, if you are looking for something  in the tree, the fastest way to find it is to use XPath via the selectSingleNode 
or selectNodes methods. 

Free-Threaded Documents

The "free-threaded" DOM document exposes the same interface as the "rental" threaded document. This object can be safely shared across any thread in the same process. It can be safely stored in ASP Application state on IIS.

Free-threaded documents are generally slower than rental documents because of the extra thread safety work they do. You use them when you want to share a document among multiple threads at the same time, avoiding the need for each of those threads to load it's own copy. In some cases, this can result in a big performance gain.

For example, suppose you have a 12K XML file on your Web server, and you have a simple ASP page that loads that file, increments an attribute inside the file, and saves the file again. Such ASP code is likely to be completely tied up with disk I/O. However, you could put the file into shared-application state using a free-threaded DOM document:

<%@ LANGUAGE=JSCRIPT %>
<%
Response.Expires = -1;
var doc = Application("Stuff");
if (doc == null)
{
    doc = Server.CreateObject("Msxml2.FreeThreadedDOMDocument");
    doc.async = false;
    doc.load(Server.MapPath("stuff.xml"));
    Application("Stuff") = doc;
}
Application.Lock();
var c = parseInt(doc.documentElement.getAttribute("count"))+1;
doc.documentElement.setAttribute("count",c);
Application.UnLock();

%>
<%=c%>

This second approach using the free-threaded DOM document can easily be seven times faster than the other.

       IDispatch

Late-bound scripting languages such as JScript and VBScript add a lot of overhead to each method call and property access in the DOM interface. The script engines invoke the methods and properties indirectly through the IDispatch interface and first call GetIDsOfNames or GetDispID, which will pass in a string name for the method or property and return a DISPID. Finally the engines package all the arguments into an array and call Invoke with the DISPID.

This is slower than calling a virtual function in C++ or compiled Visual Basic. For this reason, you may want to consider calling all your DOM functions from a "wrapper" compiled component that has the generic methods you need to do what you want, when using a script based application environment such as ASP. With VB, you want to also avoid late- bound DOM object invocation calls like the following:

    Dim doc as Object
    set doc = CreateObject("Microsoft.XMLDOM")    

This will be as slow as VBScript or JScript. To speed this up, from the Project menu, select References and add a reference to the latest version of the "Microsoft XML" library. Then you can write the following early-bound code:

    Dim doc As New MSXML.DOMDocument
    

     Use XSL to get the data you need for speed

XSL can be a big performance win over using DOM code for generating "transformed" reports from an XML document. For example, suppose you wanted to show all the questions and answers matching a certain key word category element. You might use selectNodes to find all the questions matching the category, then use another selectNodes call to iterate through the answer elements of each of those questions. But you could also write an XSL stylesheet:

<xsl:template xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<xsl:for-each select="/Question[CATEGORY='ASP']">
<xsl:for-each select="Answer">
<xsl:value-of/>
</xsl:for-each><hr/></xsl:for-each>
</xsl:template>

You could then create your output with a function like:

function report(doc) {
var xsl = new ActiveXObject("Microsoft.XMLDOM");    
xsl.async = false;
xsl.load("pollquestions.xsl");
return doc.transformNode(xsl)
}

This XSL transformation could be from 5 to 10 times faster than iterating through the DOC looking for your data!

         The    "//" Operator

The "//" operator walks the entire subtree looking for matches. If you are lazy like me, you use it more than you should because you are too lazy to look up and type in the full path. If you can, use the full path to get your data; it will typically give you up to a 15% performance boost. In editors such as XML Spy, there is a clipboard copy function that will return the XPATH statement to return any element.

Many of the items I've chosen to cover in this article, and others, along with real - world test results and detail, can be found in some excellent work by Chris Lovett of Microsoft. Also, author Kurt Cagle has some great work on using the IXSLTemplate interface to use cached template processors that will speed up repetitive transformations big time. You can find articles by both authors at MSDN online, by simply searching on their names.

Peter Bromberg is an independent consultant specializing in distributed .NET solutionsa Senior Programmer / Analyst at in Orlando and a co-developer of the NullSkull.com developer website. He can be reached at info@eggheadcafe.com