Screenscraping RealTime Stock Quotes with Regular Expressions and XML
By Peter A. Bromberg, Ph.D.
Printer - Friendly Version
Peter Bromberg

As an ex-stockbroker, I've always been fascinated with quote streamers and other web-based financial tickers. One of the most interesting recent offerings is Yahoo's real time stock quote service. In this article, I'll show how you can combine Regular Expressions and the Matches colection with the XmlDocument class to "scrape" the important parts of the web page from Yahoo, reformat the XML, and show it as a moving Marquee in Internet Explorer.

We will use the WebRequest class to make our call to the Yahoo URL. Then we will iterate the web page that we received, stripping out the elements we want from the one or more rows in the HTML Table using the Regex Match Collection and Regex match variables embedded in our match string, like this: (?<symbol>[^<]+)

Finally, we will iteratively build an XmlDocument from the match results using the XmlDocument class and methods, and return it to the caller. The XmlDocument can then be used to populate a DataSet, do an XSL Transform, or other generic purpose.



First, let's take a look at the HTML (Aspx) UI portion of the page:

<%@ Page language="c#" Codebehind="WebForm1.aspx.cs" 
AutoEventWireup="false" Inherits="RegexYahooXml.WebForm1" %> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" > <HTML> <HEAD> <title>Screenscraping With Regex and Xml.</title> <meta name="GENERATOR" Content="Microsoft Visual Studio .NET 7.1"> <meta name="CODE_LANGUAGE" Content="C#"> <meta name="vs_defaultClientScript" content="JavaScript"> <meta name="vs_targetSchema" content="http://schemas.microsoft.com/intellisense/ie5"> </HEAD> <body MS_POSITIONING="GridLayout"> <form id="Form1" method="post" runat="server"> <marquee id="Ticker" style="Z-INDEX: 101; LEFT: 72px; POSITION: absolute; TOP: 80px" runat="server" width="600" onmouseover="this.stop();" onmouseout="this.start();"></marquee> <asp:TextBox id="TextBox1" style="Z-INDEX: 102; LEFT: 168px; POSITION: absolute; TOP: 144px" runat="server" Width="256px" Height="24px"></asp:TextBox> <asp:Button id="Button1" style="Z-INDEX: 103; LEFT: 464px; POSITION: absolute; TOP: 144px" runat="server" Width="104px" Text="Get Stocks"></asp:Button> <asp:Label id="Label1" style="Z-INDEX: 104; LEFT: 176px; POSITION: absolute; TOP: 112px" runat="server" Width="376px" Font-Names="Verdana">Enter Symbols, separated by spaces.</asp:Label> <asp:Label id="Label2" style="Z-INDEX: 105; LEFT: 64px; POSITION: absolute; TOP: 8px" runat="server" Width="575px" Height="32px" Font-Names="Verdana">Yahoo Realtime Stock
Quotes with Regular Expressions and Xml.</asp:Label> </form> </body> </HTML>

You can see that I've added a Marquee control, set it to runat="server", and added client-side event handlers to start and stop the scroll when your mouse hovers over an item.

The "engine" of the process looks like this:

public XmlDocument GetXmlYahoo(string symbolList )
        {   
       
            string url="http://finance.yahoo.com/q?s=";
            url+=symbolList;
            url+="&d=e";
            WebRequest webRequest = WebRequest.Create(url);
             
            string beginStr = "";
            try
            {
                WebResponse webResponse = webRequest.GetResponse();
                beginStr = new StreamReader(webResponse.GetResponseStream(), 
                    Encoding.Default).ReadToEnd();
                webResponse.Close();
                // clean up some YHOO finance "junk" first so Regex matches won't fail
                beginStr = beginStr.Replace("\n", "");
                beginStr=beginStr.Substring(beginStr.IndexOf("Order Books"));
                beginStr=beginStr.Replace("<font color=ff0020>","");
                beginStr=beginStr.Replace("</font></font>","</font>");        
            }
            catch (Exception)
            {
                beginStr = "";
            }                     
            XmlDocument xmlDocument = new XmlDocument();
            XmlElement elemQuotes = xmlDocument.CreateElement("StockQuotes");
            xmlDocument.AppendChild(elemQuotes);
            // match string for our Regex Matches collection             
            string mainStr = 
"<td nowrap align="left"><font face=arial size=-1><a href=\"(?<href>[^\"]+)\">(?<symbol>[^<]+)</a>
</font></td><td nowrap align="center"><font face=arial size=-1><i>(?<time>[^<]+)</i>
</font></td><td nowrap><font face=arial size=-1><b><i>(?<price>[^>]+)</i></b>
</font></td><td nowrap><font face=arial size=-1><i>(?<change>[^<]+)</i></font></td>"
; new Regex(mainStr, RegexOptions.Compiled); IEnumerator iEnumerator = Regex.Matches(beginStr, mainStr).GetEnumerator(); //Response.Write("<textarea rows=100 cols=120>" +beginStr + "</textarea>"); try { while (iEnumerator.MoveNext()) { Match match = (Match)iEnumerator.Current; XmlElement elemQuote = xmlDocument.CreateElement("Quote"); XmlElement elemSymbol = xmlDocument.CreateElement("Symbol");
XmlElement elemTime = xmlDocument.CreateElement("Time"); XmlElement elemPrice = xmlDocument.CreateElement("Price"); XmlElement elemChange = xmlDocument.CreateElement("Change"); elemSymbol.InnerText = match.Groups["symbol"].Value; elemPrice.InnerText = match.Groups["price"].Value.Replace(",", "."); elemTime.InnerText=match.Groups["time"].Value.Replace(",", "."); elemChange.InnerText = match.Groups["change"].Value.Replace(",", "."); elemQuote.AppendChild(elemSymbol); elemQuote.AppendChild(elemPrice); elemQuote.AppendChild(elemChange); elemQuote.AppendChild(elemTime); xmlDocument.DocumentElement.AppendChild(elemQuote); } } catch(Exception ex) {throw new Exception(ex.Message);} return xmlDocument; }

What the above method does is as follows:

1) Accept the space-delimited list of stock symbols, and append it to the URL
2) Make the WebRequest to the Yahoo finance URL and get the response Text into "beginStr"
3) Chop off everything before "Order Books" in order to simplify processing
4) Clean up all instances of "<font color=ff0020>" in order to be able to handle both positive and negative price changes without writing a lot of extra Regex code
5) Create the main Regex Match string that will isolate every row that has stock information (Note the placeholder variable for each item embedded in the string (e.g., : (?<time>[^<]+)):

<td nowrap align="left"><font face=arial size=-1><a href=\"(?<href>[^\"]+)\">(?<symbol>[^<]+)</a>
</font></td><td nowrap align="center"><font face=arial size=-1><i>(?<time>[^<]+)</i>
</font></td><td nowrap><font face=arial size=-1><b><i>(?<price>[^>]+)</i></b>
</font></td><td nowrap><font face=arial size=-1><i>(?<change>[^<]+)</i></font></td>

6) Get the Enumerator for the Matches object, and loops though the collection.
7) Build an XmlDocument from the values returned, and return the XmlDocument

Note that this is the format the Yahoo service returns DURING MARKET HOURS ONLY.

Now that we have our XmlDocument, we will pass it to the "FormatXML" method, a utility method that simply converts the XML element values to <span> elements suitable for assigning to the innerHTML of our Marquee Control:

private string FormatXML(XmlDocument xmlDoc)
{
string strResult=String.Empty; 
string strBegin=" <SPAN STYLE='COLOR:blue'>";
IEnumerator iEnumerator = xmlDoc.DocumentElement.ChildNodes.GetEnumerator();
try
{
    while (iEnumerator.MoveNext())
    {
        XmlNode xmlNode = (XmlNode)iEnumerator.Current;
        string strUri="http://finance.yahoo.com/q?s=";            
        string strQuotes= xmlNode.ChildNodes[0].InnerText;            
        string strEndQry="&d=e";            
        string strFullUri=strUri+strQuotes+strEndQry;
        string[] strs = new string[]{strBegin, 
        "<a href='"+strFullUri+"' target='_blank'>" 
           + xmlNode.ChildNodes[0].InnerText + "</a>: ",
           xmlNode.ChildNodes[1].InnerText+ " ",
           "["+xmlNode.ChildNodes[2].InnerText + "] " 
           ,xmlNode.ChildNodes[3].InnerText+ "&nbsp;&nbsp;</SPAN>   "};
        strResult+= String.Concat(strs);                
    }
}
catch (Exception ex)
{throw new Exception(ex.Message);}
return strResult;
}

All this is kicked off and controlled by the Button Click event handler:

private void  Button1_Click(object sender, System.EventArgs e)
{
XmlDocument xmlDoc=GetXmlYahoo(TextBox1.Text);
string strHTML=FormatXML(xmlDoc);
Ticker.InnerHtml=strHTML;
}

And there you have it: a utility method for extracting realtime quotes from Yahoo and returning them in a generic XmlDocument. This could easily form the basis of a ServerControl (all you would need are a few public properties and a Designer class) or for a webservice or any of a number of other uses.

Download the Visual Studio.NET solution below. If you don't have 2003, just start a new blank WebForms project and add the files from mine.

Download the code that accompanies this article


Peter Bromberg is a C# MVP, MCP, and .NET consultant who has worked in the banking and financial industry for 20 years. He has architected and developed web - based corporate distributed application solutions since 1995, and focuses exclusively on the .NET Platform.