ScreenScraping with ServerXMLHttp

By Peter A. Bromberg, Ph.D.

Peter Bromberg

ScreenScraping, or the process of grabbing content from another site, stripping out only what you want, and then displaying it in your own site, is a highly frowned-upon, although often used practice. I don't want to get into the ethical and copyright issues involved here, that's something each individual will have to take into consideration.

With the arrival of MSXML3, Microsoft has provided the web developer with a server-safe, high load tool called the ServerXMLHttp object. Its use is relatively simple. In this short article I'll demonstrate how to grab weather information from a public weather site based on a specific zip code, strip out a small portion of it (eliminating all the ads and other extraneous content) and then display this in your own page.



Basically ServerXMLHttp is an http GET/POST component with some major added advantages in that unlike XMLHTTP, it does not rely on WinInet for HTTP. ServerXMLHTTP uses a new HTTP client stack. Designed for server applications, this "server-safe" subset of WinInet offers the following advantages.

reliability
The HTTP client stack offers longer uptimes. WinInet features that are not critical for server applications, such as URL caching, auto-discovery of proxy servers, HTTP/1.1 chunking, offline support, and support for Gopher and FTP (File Transfer Protocol) protocols are not included in the new HTTP subset.
security
The HTTP client stack enforces that a user-specific state cannot be shared with another user's session. Note that ServerXMLHTTP does not provide support for certificates.
The maximum number of instances that can exist simultaneously within a single process is 5,460. A similar limitation applies to the XMLHTTP component. However, other factors, such as available memory, CPU processing capacity, or available socket connections can further limit the number of instances that can be active simultaneously. Developers can partition the server application into multiple processes if this limit becomes a bottleneck.

The open method makes the connection between servers and the send method sends the request.

You can read the response using one of four properties.

responseBody
responseStream
responseText
responseXML


With ServerXMLHTTP, the usual sequence is to call open, set any custom header information through setRequestHeader, send, and then check one of the four response properties.

Let's say that you have a web application that is customized to the user, and one of the items you retrieve, either through reading a client cookie or looking up user information in your database, is the customer's zip code. You'll store the zip code in a variable , "zip" for use in the page. Here is how you would grab zip-code specific "weather" information and display it somewhere in one of your pages:

<%
if request.Form("SEND") ="" Then
%>

<Form action ="weather.asp" method=post>
<input type=text name=zip>Enter your Zipcode to see local weather<BR>
<input type=submit name="SEND" value="GET IT!">
</form>
<%
else
if request.form("zip") ="" then
zip="32801"      ' if it's blank, we just show them somebody else's weather..
else
zip=Request.form("zip")
end if

Dim srvXmlHttp
Dim result
dim URL
dim beginpos, endpos
' if this doesn't work (because you don't have MSXML3 installed / configured) you can revert to the commented line:
'Set srvXmlHttp=Server.CreateObject("MICROSOFT.XMLHTTP")
Set srvXmlHttp = Server.CreateObject("MSXML2.ServerXMLHTTP.3.0")
' This site is easy to strip weather info from ...
URL= "http://www.wunderground.com/cgi-bin/findweather/getForecast?query=" & zip
srvXmlHttp.open "GET", URL, false
srvXmlHttp.send()
'on error resume next
if srvXmlHttp.status = 200 Then
result = srvXmlHttp.responseText
beginpos =Instr(result,"<form name=""airport"">")
result =Mid(result,beginpos,len(result))
endpos =Instr(result,"</form>")
result = Mid(result,1,endpos+7)
Response.write "<BASEFONT FACE=Verdana>"
Response.write "<CENTER><h1>ScreenScraping 101</h1><BR>"
Response.write result
Response.write "</CENTER>"
end if
Response.write "<A HREF=http://www.wunderground.com/cgi-bin/findweather/getForecast?query=" & zip &">Your Weather</a>"
Set srvXMLHttp=Nothing
end if
%>

(Please be aware - the above is the code that was on their site at the time this article was written. It's very likely to have changed since then!)

Wanna try it? click here:

 

One more note: I've read a number of posts and even articles by professional developers claiming that they can't get ServerXMLHttp to work. Microsoft has a proxycfg.exe tool that you can download separately from the MSXML3 Release distribution. Just run "proxycfg -d" to set up the registry entries to make ServerXMLHttp connect directly to URLS and your problems should disappear. You must use this utility even if your server does not use proxy connections.

Peter Bromberg is an independent consultant specializing in distributed .NET solutions an independent consultant specializing in distributed application development in Orlando and a co-developer of the NullSkull.com developer website. He can be reached at info@eggheadcafe.com