ASP.NET - Using the Yahoo YQL Query Language With ContentAnalysis Table to Generate Valid Keywords from Text or a Url

Up until recently, you could use the Yahoo Term Extraction Webservice to have the Yahoo Term Extraction API generate a list of valid key phrases from entered text or by reading a web page. However, Yahoo has announced that this service is deprecated and will only be available until March, 2012. All Yahoo APIs have now been folded into the new YQL model and use the same endpoint url, http://query.yahooapis.com/v1/public/yql.

All YQL statements comprise a SQL - like query language that is highly customizable. In fact, there is a vibrant community that has a GitHub repository of custom tables and repositories here: https://github.com/yql/yql-tables. There is also a "helper" site here: http://www.datatables.org/.

These "Open Datatables" allow developers to use just a single, uniform way of using any web-service or data source like Amazon, iTunes, or Twitter. The YQL (Yahoo! Query Language) platform enables developers to query, filter, and combine data across the web through a single interface. It exposes a SQL-like syntax that is both familiar to developers and expressive enough for getting the right data.

Open Data Tables are XML files that can be "plugged" into the Yahoo! Query Language open platform (YQL). These files describe how the YQL SQL-like language can be mapped onto any web service or source on internet. Once mapped, these data sources can be used by developers in many ways in YQL.  You can even extract elements from the HTML of a cross-domain url with queries like the following:

select * from html where url="http://finance.yahoo.com/q?s=yhoo" and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'

In this sample, we'll use the contentanalysis.analyze table from Yahoo! to perform "term extraction" (getting valid keywords and phrases) from either a block of text, or even directly from the url to a blog post or article. These keywords can be used for SEO optimization. A typical query resultset looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng"
    
yahoo:count="1" yahoo:created="2012-01-15T13:57:50Z" yahoo:lang="en-US">
    <diagnostics>
        <publiclyCallable>true</publiclyCallable>
        <user-time>140</user-time>
        <service-time>114</service-time>
        <build-version>24402</build-version>
    </diagnostics>
    <results>
        <entities xmlns="urn:yahoo:cap">
            <entity score="0.784327">
                 <text end="16" endchar="16" start="0" startchar="0">Italian sculptors</text>
            </entity>
            <entity score="0.764539">
                 <text end="72" endchar="72" start="58" startchar="58">the Virgin Mary</text>
                 <wiki_url>http://en.wikipedia.com/wiki/Mary_%28mother_of_Jesus%29</wiki_url>
                 <related_entities>
                     <wikipedia>
                         <wiki_url>http://en.wikipedia.com/wiki/Mary_MacKillop</wiki_url>
                         <wiki_url>http://en.wikipedia.com/wiki/S%c3%bcmela_Monastery</wiki_url>
                         <wiki_url>http://en.wikipedia.com/wiki/Canonization</wiki_url>
                         <wiki_url>http://en.wikipedia.com/wiki/Lourdes</wiki_url>
                         <wiki_url>http://en.wikipedia.com/wiki/Naval_warfare_of_World_War_I</wiki_url>
                     </wikipedia>
                 </related_entities>
            </entity>
            <entity score="0.509566">
                 <text end="29" endchar="29" start="22" startchar="22">painters</text>
            </entity>
        </entities>
    </results>
</query>

You can see above that the XML returned contains Wikipedia entries ("related_entities") as well as keywords ("entity/text"). In this example, I only use the keywords. Other queries will also return Yahoo!'s Categories for the entered content.

Here is my method to return keywords from a specified URL:

         // get search terms from an entered Url
        public static List<string> GetSearchTermsUrl(string url)
        {
            string query = "http://query.yahooapis.com/v1/public/yql?q=select * from contentanalysis.analyze where url='" + url + "'";
           List<string> items = new List<string>();
            string s = String.Empty;
            WebClient wc = new WebClient();
              s = wc.DownloadString( query);
            wc.Dispose();
            MemoryStream ms = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(s));
            XDocument doc = XDocument.Load(ms);
           XNamespace x = "urn:yahoo:cap";
            var results = doc.Descendants(x + "text");
            foreach(var itm in results)
                 items.Add(itm.Value);
            items = items.Distinct().ToList();
             return items;
        }

And here is my method to return the same type of results from a block of entered text (e.g. that you copied from a page):

public static List<string> YqlPost( string content)
         {
             // clean out various characters (e.g. from code samples) that could mess up the YQL select statement
            content = content.Replace(";", " ").Replace("{", " ").Replace("}", " ").Replace("@", " ").Replace("=", " ").Replace("'", " ");
            string query =
                 "SELECT * FROM contentanalysis.analyze WHERE text='" + content + "'";
           List<string> items = new List<string>();
            WebClient wc = new WebClient();
             wc.Headers.Add(HttpRequestHeader.ContentType,"application/x-www-form-urlencoded");
            NameValueCollection nvc = new NameValueCollection();
             nvc.Add("q",query);
            byte[] b = null;
            try
            {
                b = wc.UploadValues("http://query.yahooapis.com/v1/public/yql", nvc);
                MemoryStream ms = new MemoryStream(b);
                XDocument doc = XDocument.Load(ms);
                XNamespace x = "urn:yahoo:cap";
                var results = doc.Descendants(x + "text");
                foreach (var itm in results)
                     items.Add(itm.Value);
                items = items.Distinct().ToList();
            }
            catch (Exception ex)
             {
                 System.Diagnostics.Debug.WriteLine(ex.ToString());
             }
             finally
             {
                 wc.Dispose();
             }
             return items;
        }

You'll notice that in the YqlPost method I have a line that replaces certain characters with spaces as these are known to mess up the YQL query - just like "bad SQL". There may be additional ones to add, as I have only done limited testing with this.  Don't expect your YQL queries to come back with keywords every time. For example, this url http://msdn.microsoft.com/en-us/library/system.text.aspx doesn't return anything for me. I guess Yahoo just isn't interested in .NET namespaces.

   You can view the complete documentation for the contentanalysis API here: http://developer.yahoo.com/search/content/V2/contentAnalysis.html

   The downloadable Visual Studio 2010 solution has a class library containing the two static methods above, along with a test harness ASP.NET web application providing a form with a textarea for entered text, a textbox for an entered URL, and two buttons - one to get the results and one to clear the controls. You can only have either text in the textarea, or a url in the URL textbox - not both.

These are the keywords that Yahoo! returned from the text in this article:

YQL
query language
string query
MemoryStream
Yahoo!


   Results are listed in the textarea for easy copying.

   You can download the sample solution here.

By Peter Bromberg   Popularity  (7851 Views)