ASP.NET - Using HtmlAgilityPack to Collect Google Realtime Search Results

Google has some search operators than many people are not aware of. If you suffix a google search url with one of these, you can restrict results to those that got indexed as little as one minute ago.

&tbs=rltm:1 [real time results]
&tbs=qdr:s [past second]
&tbs=qdr:n [past minute]
&tbs=qdr:h [past hour]
&tbs=qdr:d [past 24 hours (day)]
&tbs=qdr:w [past week]
&tbs=qdr:m [past month]
&tbs=qdr:y [past year]

Currently the first one (rltm:1) does not work, but it used to and google will probably resurrect it since they have already implemented realtime results in Google+ search. You can actually replace qdr:n with qdr:n10 for ten minutes, qdr:n30 for last thirty minutes, etc.

I built a search facility back in 2008 using HtmlAgilityPack to scrape these results and, surprisingly, the original code still works perfectly. This is not something you would want to deploy on a public website, as when there are too many search requests from a single IP address, google will start throwing up Captcha controls to make you prove you're not a "bot". However, it can be a very useful for research (find out what's hot in a particular subject), or you could execute a search say, every 5 minutes and cache the result for 5 minutes.

But mostly I wrote it as an exercise in screen-scraping with HtmlAgilityPack, and so I'm sharing it here. What I wanted was a search facility that would accept multiple keywords, execute a separate search on each, and aggregate and return the results.

The HTML of typical "search result" in a google page of search results looks like this:


<li class=g><div class=vsc pved=0CEwQkgowAA sig=1dC><h3 class="r"><a href="http://forums.asp.net/p/1754776/4758868.aspx/1?Newby+Question+Parser+Error" class=l onmousedown="return rwt(this,'','','','1','AFQjCNGW5y16_55IZr6JT_cfKjO8lC5RAQ','DNgfz1Glaz5mNPNi27l5Xw','0CEoQFjAA')">Newby Question - Parser Error : The Official Microsoft <em>ASP</em>.<em>NET</em> Forums</a></h3><div class="s"><div class="f kv"><cite>forums.<b>asp</b>.<b>net</b>/p/1754776/4758868.<b>asp</b>x/1?Newby+Question...</cite><span class=vshid></span><button class="gbil esw eswd" onclick="window.gbar&&gbar.pw&&gbar.pw.clk(this)" onmouseover="window.gbar&&gbar.pw&&gbar.pw.hvr(this,google.time())" g:entity="http://forums.asp.net/p/1754776/4758868.aspx/1?Newby+Question+Parser+Error" g:undo="poS0" title="Recommend this page" g:pingback="/gen_204?atyp=i&ct=plusone&cad=S0"></button></div><div class="esc slp" id="poS0" style="display:none">You +1'd this publicly. <a href="#" class=fl>Undo</a></div><div class="f slp">1 post - 1 author - Last post: 10 minutes ago</div><span class=st>Microsoft · Feedback on <em>ASP</em>.<em>NET</em>|; File Bugs · <em>ASP</em>.<em>net</em>. Microsoft is conducting an online survey to understand your opinion of the <em>ASP</em>.<em>NET</em> Web site. <b>...</b><br></span></div>

If you are not familiar with HtmlAgilityPack, it is a C# utility originally written by Simon Mourier that turns an HTML page into a XPATH-compatible XML DOM. So with a little XPATH knowledge, you can scrape pretty much anything you want out of a retrieved page of content.

Here is a nugget of code that illustrates how I do this:

public List<SearchResult> GetResults (String searchTerm, object state)
        {
             StateObject stateObject = (StateObject)state;
             if(stateObject.Minutes==120)
                baseUrl1 = baseUrl2;
             else
                baseUrl1 = baseUrl1.Replace("tbs=qdr:n10", "tbs=qdr:n" + stateObject.Minutes.ToString());

            searchTerm = searchTerm.Replace(".", "");
            string fullUrl = baseUrl1 + searchTerm;
           List<SearchResult> results = new List<SearchResult>();
            WebClient wc = new WebClient();
            string s = wc.DownloadString(fullUrl);
            wc.Dispose();
            HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(s);
            HtmlNodeCollection Links = doc.DocumentNode.SelectNodes("//li[@class='g']");
            string title = "";
            string link = "";
            string desc = "";
            if(Links==null ||Links.Count==0) return null;
            foreach( var node in Links)
             {
                 try
                {
                    SearchResult sr = new SearchResult();
                    sr.Title = node.FirstChild.FirstChild.InnerHtml;
                    sr.Title = HtmlHelper.HtmlStripTags(sr.Title, true, true);
                    sr.Link = node.FirstChild.FirstChild.Attributes["href"].Value;
                    desc = node.SelectSingleNode("div").InnerText.Trim();
                    sr.Description = HtmlHelper.HtmlStripTags(desc, true, true);
                     results.Add(sr);
                }
                catch(Exception ex)
                 {
                     Debug.Write(ex.ToString());
                 }
             }
             return results;
        }

private ManualResetEvent mre = new ManualResetEvent(false);
        private int _numItems = 0;
        private int _ctr = 0;
       List<SearchResult> AllResults = new List<SearchResult>();
      
         public List<SearchResult> MultiSearch(Dictionary<string,int> searchTerms, int minutes)
        {
            _numItems = searchTerms.Count ;
            foreach (string srch in searchTerms.Keys )
            {
                string srchr = srch.Replace(".", "");
                ThreadPool.QueueUserWorkItem(SearchCallback, new StateObject(srchr, minutes,searchTerms[srch]));
             }
             mre.WaitOne(5000);
             return AllResults;

        }

        private void SearchCallback(object state)
        {
            StateObject stateObject = (StateObject)state;
           List<SearchResult> result = GetResults((string)stateObject.SearchTerm, stateObject);
            if (result != null)
            {
                 AllResults.AddRange(result);
             }
             _ctr++;
             if (_ctr == _numItems)
                mre.Set();
        }

What this does is:

1) Create the correct search url based on number of minutes the user has input, along with one or more search terms.
2) Create a State Object which is a simple class to hold the search term and the number of minutes back to search.
3) Download the search results page and load it into an HtmlAgilityPack HtmlDocument object.
4) Execute a series of XPATH queries designed to get the search result title, description and link, and populate a SearchResult instance.
5) Use a REGEX class to strip unwanted HTML tags out of the title and description content.
6) Add the Search result to a List<SearchResult>
7) Perform this action as many times as there are search terms from the user. Use a ManualResetEvent to make the code wait until it is done.
8) Return the List<SearchResult> to the caller.

Here are the StateObject and the SearchResult classes:

public class SearchResult
    {
         public string Title { get; set; }
        public string Link { get; set; }
        public string Description { get; set; }
    }

   public  class StateObject
    {
         public string SearchTerm { get; set; }
         public int Minutes { get; set; }
        public StateObject(string search, int minutes, int topicId)
         {
             this.SearchTerm = search;
             this.Minutes = minutes;
        }
    }

  I have a a simple one-page web application that has a dropdownlist for selecting the time (minutes), a TextBox for entering one or more search terms, and a button to kick off the above process. The List<SearchResult> that comes back is used to bind a DataList.

  You can download the complete Visual Studio 2010 solution here.

By Peter Bromberg   Popularity  (7570 Views)