HTML to XHTML Conversion with SGMLReader
By Peter A. Bromberg, Ph.D.
Printer - Friendly Version
Peter Bromberg

This is a web - based implementation of converting HTML to well-formed XHTML using Chris Lovett of Microsoft's excellent SGMLReader. Chris's code has a command - line interface; however I needed an in-memory implementation for some work we're experimenting on that takes well-formed XHTML and converts it to RTF for display in a RichTextBox control. There are many other uses for XHTML compliant HTML, not the least of which is the fact that an XHTML page is a legitimate, well-formed XML document, which opens up a whole new range of possibilities for HTML processing when you think about it...

In order to make this work as a class library for use on the web or in-memory in an application, I needed to write a small "helper class", and I also needed to change the way errors are written in Lovett's SgmlReader class to a string property (the existing code was designed to write errors to an optional log file with a TextWriter, I needed to be able to return the concatenated error string to the web page for display instead). Below appears my helper class code:

using System;
            using Sgml;
            using System.IO;
            using System.Xml;
            using System.Text;
            using System.Web;

namespace SgmlReaderDll { /// <summary> /// Helper class to allow string processing using SGMLReader/Parser /// </summary> public class SGMLReaderHelper { private string _errors; public string Errors { get { return _errors; } set { _errors = value; } }

public SGMLReaderHelper() { }
public string ProcessString(string strInputHtml) { string strOutputXhtml = String.Empty; SgmlReader reader = new SgmlReader(); reader.DocType ="HTML"; StringReader sr = new System.IO.StringReader(strInputHtml); reader.InputStream = sr; StringWriter sw = new StringWriter(); XmlTextWriter w =new XmlTextWriter( sw); reader.Read(); while(!reader.EOF) { w.WriteNode(reader,true); } w.Flush(); w.Close(); this.Errors=reader.ErrorLog; return sw.ToString(); } } }

There are a lot of interesting uses for this type of utility. One which I use again and again is the ability to take an HTML web page that is not XHTML compliant, run it through this utility, and get back a valid XML document that fixes attributes with no quotes around them, self-closes HTML tags that need to be closed, and automatically surrounds script blocks in CDATA sections. The result can be saved with an XSL extension, and you are on your way to creating your XSL Stylesheet for your XML Transformation to create dynamic web pages!

And now for the fun part. Click the link below, which will bring you to the ASP.NET web page that allows you to paste your HTML and receive back XHTML, along with a report from Chris's creation that reports any errors:

Try the HTML to XHTML web page

As always, the full solution may be downloaded from the link below. Thanks to Chris Lovett for some really useful code.

Download the code that accompanies this article


 


Peter Bromberg is a C# MVP, MCP, and .NET consultant who has worked in the banking and financial industry for 20 years. He has architected and developed web - based corporate distributed application solutions since 1995, and focuses exclusively on the .NET Platform.