|
I regard myself as a particularly fortunate
"XML Dude": About a year ago, I determined that, regardless
of the amount of time I had in the day, and regardless of the fact that
the company I worked for at the time had virtually no vision as to what
XML could do to help solve their problems, I was going to spend some time
-- for ME -- each evening, studying this new technology and learning how
to use it. I don't know about you, but I come home each evening tired
from being paid to think all day long. Very tired, I might add. But, I
kept my promise to myself and made the time to study XML and XSLT. I remember
when I bought my first book on XML, a Wrox title that I still refer to
today (although I now possess many more such books). My first reaction
was "Oh, crap! This is not going to be fun at all."
To their credit, this same company which
had little XML vision saw fit to pay to send me to both Tech-Ed and the
Microsoft Professional Developer's Conference, both of which were conveniently
held in sunny Orlando (where I live), last year. All I can say is, when
I arrived home each evening from the conferences, I was so excited I couldn't
sleep. I saw the vision! I felt the excitement of the Microsoft gurus,
who were genuinely enthusiastic about what they were doing and what the
possibilities were. And it was all about XML and DOT NET.
Now, I am even more fortunate, because
my vision of learning XML (along with DOT NET and related technologies)
has begun to pay off. I now work for a company whose XML vision is well-formed
(to make a pun) and the project I work on is almost 100% XML / XSLT based.
Every ASP page in our application serves only as the "glue"
to receive and process querystring or other parameters from an XML-based
menu choice, to tell it what Javascript and XSL include files to load.
All the loading and transformations are dynamic, all plumbing is handled
by global Javascript functions, every single browser page in the application
is the result of a dynamically generated XSL tranform. Client - side XML
data islands hold important dynamically updated information that is accessed
through global Javascript functions, and 100% of all data access is handled
through XMLHttp XML request / response documents via COM + middleware
components that we have authored, and it's sent, received back and processed
--- from the CLIENT SIDE -- over the wire, to and from the databases.
And it's all 100% financial industry standards compliant. I
feel proud to have been a member of the architectural team that went through
a lot of pain to flesh out all this stuff and make the "proof of
concept" become a reality that will provide real value to the customer
(and make the company that I work for a ton of money). One of our biggest
concerns continues to be performance tuning (see my "Performance
Tuning Checklist" article for more on this).
Recently a friend I used to work with
started using XML, and he sent some emails asking for comments and advice.
Some of the assumptions he made in doing his "Beginning XML"
exercise made me think: "We;ll, if he is making these mistakes (the
same ones that I made, a lot of them) then I wonder how many other developers
struggling with XML / XSLT are also doing this? So I decided to
write this article to try and summarize some of the most important things
I've learned. If this helps you because you're just starting out with
XML, that's great. And if you're already well under way as an XML Developer
and some of the things that I touch on here make you think - well, that's
even better. I am by no means an XML "guru". But you know what?
I intend to become one. When I read in magazines like "Smart
Partner" that XML gurus are currently being billed out at $300 an
hour, I feel gratified that my decision almost a year ago was the right
one for me. This technology is not going to go away, folks. It's big time.
Study XML! It's the best job security you can get since winning the lottery
became popular.
XML PERFORMANCE VARIABLES
In working with XML data and documents,
there are four major variables that can affect the performance of MSXML:
- The kind of XML data
- The ratio of tags to text
- The ratio of attributes to elements
- The amount of discarded white space
There are also four key performance "metrics"
involved on the Win32 platform:
The fastest way to load an XML document is to use the
default "rental" threading model (which means the DOM document can be
used by only one thread at a time) with validateOnParse, resolveExternals,
and preserveWhiteSpace all disabled, like this in Javascript:
var doc = new ActiveXObject("MSXML2.DOMDocument");
doc.validateOnParse = false;
doc.resolveExternals = false;
doc.preserveWhiteSpace = false;
doc.load("mystuff.xml");
if you have an element-heavy XML document that contains
a lot of white space between elements and stored in Unicode, it can actually
be smaller in memory than on disk. Files that have a more balanced ratio
of elements to text content end up at about 1.25 to 1.5 the UCS-2 disk
file size when in memory. Files that are very data-dense, such as an attribute
- heavy XML - persisted ADO recordset, can end up more than twice the
disk-file size when loaded into memory.
Attributes
vs. Elements
You could conclude that attribute-heavy formats (such
as an XML - persisted ADO recordset) deliver more data per second than
element-heavy formats. But this should not be the only reason for you
to switch everything to attributes. There are many other factors to consider
in the decision to use attributes versus elements.
Unique
elements
My friend, in his honest but less than informed effort
to create a useful XML document, made the mistake of attempting to use
the XML elements as if they were unique "database fields". For
example if you have an XML Document that consists of survey questions,
you could conceive that in order to make each element "unique"
you would give the tag a unique name. So your survey questions document
might end up looking kind of like this:
<POLLQUESTIONS>
<Q10000002>Who is central Florida's best Internet Service Provider?</Q10000002>
<A10000009>MPINet</A10000009>
<A10000010>EarthLink</A10000010>
<A10000011>MindSpring</A10000011>
<A10000012>Access Orlando</A10000012>
<Q10000003>What is your favorite search engine?</Q10000003>
<A10000018>Yahoo</A10000018>
<A10000019>Altavista</A10000019>
<A10000020>Lycos</A10000020>
</POLLQUESTIONS>
Now what is wrong with the above document fragment? Two
things, actually. First, it does not lend itself easily to XPATH statements
that allow you to walk the DOM and find isolated nodes and / or subnodes
of elements. True, each element has a "unique" tag name, but
that's not the point. XML is a treelike hierarchical structure. If you
need to be able to find an element or a node by number, or to sort, search
or group, it's better to use either an attribute (<Question QNum="1">)
to identify the unique "ID" of elements or nodes, or to include
a sibling element (<Qnum>1</Qnum>) inside each Question tag.
You can also use the position() operator. The second thing that's
"wrong" is that the answer tags simply follow closed Question
tags here -- there is no closing "Question" element that encompasses
both the question and it's answers. A more productive version
of the above might look like this:
<POLLQUESTIONS>
<Question QNum="1">
<QText> Who is central Florida's best Internet Service Provider?</Qtext>
<Answer>MPINet</Answer>
<Answer>EarthLink</Answer>
<Answer>MindSpring</Answer>
</Question>
<Question QNum="2">
<QText> What is your favorite search engine?</QText>
<Answer>Yahoo</Answer>
<Answer>Altavista</Answer>
<Answer>Lycos</Answer>
</Question>
</POLLQUESTIONS>
Separate Memory Structure for unique elements
With the second example, we can find any question by
its number using XPATH like: //Question[@Qnum="2"].
We can sort, search, grab a question node along with its answers, and
so on. And there is another very important but often overlooked reason
to try and arrange your XML documents so that all the major tags have
the same names: when the XML parser loads and processes your document,
it creates a separate memory structure for each unique element
name. So conceivably the first example above, if it had 1000 questions,
could occupy orders of magnitude more memory than the second example
with 1000 questions, taking a lot longer to parse, and possibly a lot
longer to search or sort as well.
Walking the DOM tree for the first time also has an impact
on the working set metric because some nodes in the tree are created "on
demand", they are not automatically "there" after loading
the document. Creating a DOM tree from scratch results in a higher peak
working set than loading the same document from disk. Loading a document
is roughly five times faster than creating the same document from scratch
in memory. The reason is that the process of creating a document requires
a lot of DOM calls, which slows things down.
Walk
Fast
The fastest way to walk the tree is to avoid the children
collection and any kind of array access. Instead, use firstChild
and nextSibling:
function WalkNodes(node)
{
var child = node.firstChild;
while (child != null)
{
WalkNodes(child);
child = child.nextSibling;
}
}
However, if you are looking for something in the tree, the fastest way to find it is to use XPath via the selectSingleNode
or selectNodes methods.
Free-Threaded Documents
The "free-threaded" DOM document exposes the same interface
as the "rental" threaded document. This object can be safely shared across
any thread in the same process. It can be safely stored in ASP Application
state on IIS.
Free-threaded documents are generally slower than rental
documents because of the extra thread safety work they do. You use them
when you want to share a document among multiple threads at the same time,
avoiding the need for each of those threads to load it's own copy. In
some cases, this can result in a big performance gain.
For example, suppose you have a 12K XML file on your
Web server, and you have a simple ASP page that loads that file, increments
an attribute inside the file, and saves the file again. Such ASP code
is likely to be completely tied up with disk I/O. However, you could put
the file into shared-application state using a free-threaded DOM document:
<%@ LANGUAGE=JSCRIPT %>
<%
Response.Expires = -1;
var doc = Application("Stuff");
if (doc == null)
{
doc = Server.CreateObject("Msxml2.FreeThreadedDOMDocument");
doc.async = false;
doc.load(Server.MapPath("stuff.xml"));
Application("Stuff") = doc;
}
Application.Lock();
var c = parseInt(doc.documentElement.getAttribute("count"))+1;
doc.documentElement.setAttribute("count",c);
Application.UnLock();
%>
<%=c%>
This second approach using the free-threaded DOM document
can easily be seven times faster than the other.
IDispatch
Late-bound scripting languages such as JScript and VBScript
add a lot of overhead to each method call and property access in the DOM
interface. The script engines invoke the methods and properties indirectly
through the IDispatch interface and first call GetIDsOfNames or
GetDispID, which will pass in a string name for the method or property
and return a DISPID. Finally the engines package all the arguments into
an array and call Invoke with the DISPID.
This is slower than calling a virtual function in C++
or compiled Visual Basic. For this reason, you may want to consider calling
all your DOM functions from a "wrapper" compiled component that
has the generic methods you need to do what you want, when using a script
based application environment such as ASP. With VB, you want to also avoid
late- bound DOM object invocation calls like the following:
Dim doc as Object
set doc = CreateObject("Microsoft.XMLDOM")
This will be as slow as VBScript or JScript. To speed
this up, from the Project menu, select References and add
a reference to the latest version of the "Microsoft XML" library. Then
you can write the following early-bound code:
Dim doc As New MSXML.DOMDocument
Use XSL to get the data
you need for speed
XSL can be a big performance win over using DOM code
for generating "transformed" reports from an XML document. For example,
suppose you wanted to show all the questions and answers matching a certain
key word category element. You might use selectNodes to find all
the questions matching the category, then use another selectNodes
call to iterate through the answer elements of each of those questions.
But you could also write an XSL stylesheet:
<xsl:template
xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<xsl:for-each select="/Question[CATEGORY='ASP']">
<xsl:for-each select="Answer">
<xsl:value-of/>
</xsl:for-each><hr/></xsl:for-each>
</xsl:template>
You could then create your output with a function like:
function
report(doc) {
var xsl = new ActiveXObject("Microsoft.XMLDOM");
xsl.async = false;
xsl.load("pollquestions.xsl");
return doc.transformNode(xsl)
}
This
XSL transformation could be from 5 to 10 times faster than iterating through
the DOC looking for your data!
The
"//" Operator
The "//" operator walks the entire subtree looking for
matches. If you are lazy like me, you use it more than you should because
you are too lazy to look up and type in the full path. If you can, use
the full path to get your data; it will typically give you up to a 15%
performance boost. In editors such as XML Spy, there is a clipboard copy
function that will return the XPATH statement to return any element.
Many of the items I've chosen to cover in this article,
and others, along with real - world test results and detail, can be found
in some excellent work by Chris Lovett of Microsoft. Also, author Kurt
Cagle has some great work on using the IXSLTemplate interface to use cached
template processors that will speed up repetitive transformations big
time. You can find articles by both authors at MSDN
online, by simply searching on their names.
Peter Bromberg is an independent consultant specializing in distributed .NET solutionsa Senior Programmer / Analyst at
in Orlando and a co-developer of the EggheadCafe.com
developer website. He can be reached at pbromberg@yahoo.com
|