Build a Google Language
Translation Service Web Scraper Library

by Peter A. Bromberg, Ph.D.

Peter Bromberg

Recently a buddy and I were discussing various language translation services and applications, and that caused me to think and revisit the Google Language Tools page. It's something I try not to do too much of (think, that is, not visit Google), but after years of practice, I just can't stop! It's interesting how Americans are primarily vertical thinkers, and that, only through careful study, can one begin to break this unproductive habit and become a lateral thinker. Lateral thinking is far more productive for things like programming and problem solving. If you don't know what it is, I suggest you start with Dr. Edward DeBono's works.

I whipped out my Google developer key and looked around, but (at least on the first pass) I could not find anything about a published Web API for this service. So, I visited the page and performed (GASP!) the most famous technique that professional guru - level web developers ever learn: VIEW SOURCE!



Turns out that the portion of the page that accepts a TEXTAREA full of text, accepts a chosen Language Pair (e.g. "From|To", and submits the form is elegantly simple. And, the HTML that comes back is also relatively simple. I cannot understand why no web API for this, but ...

So I said to myself, "Self, you haven't done any web scraping for a while, how about let's hammer out this little puppy?" So, I did. Looked around on my hard drives for some nasty web - scrapin' code and couldn't find anything good. I remembered I used to use stuff like WebZinc and the freeware HtmlAgilityPack (which I highly recommend), and I decided I'd just look for something new and possibly simpler, just for a "change of tune".

I found "WebWagon" by Jon Vote: www.idioma-software.com. Its great! Not that sophisticated, but for simple scraping jobs, not bad at all. If you've followed any of my stuff here, you probably know that one of my credos is "don't reinvent the wheel". By looking at other people's code, learning how it works, and then adding to or improving it, we not only get the job done faster, we also become better developers. That's why I'm big on writing articles that share code, and my samples at GotdotNet.com have been downloaded over 34,000 times as of this writing. It's just my way of giving back to the developer community that has been so generous and selfless in helping me to grow as a developer.

Now! After I had picked myself off the floor from giggling about the funny name, I set to work. All I really needed that was "extra" was some code to handle parsing the string value out of the correct returned <textarea> element.

Not only that, but I also discovered that even though the Google official form makes a POST, it is perfectly happy with a GET - which makes everything that much simpler. The only issue is it reduces the amount of text you can translate to about 2000 characters or so. Not a big deal for me, since it's only for quick translations of short phrases.

So now, let's take a looky at the code I cobbled together. I added my utility code right into the WebWagon ("Moo! Moo!") assembly:

using System;
using System.Text.RegularExpressions;
using System.Data;
using System.Collections;

namespace HttpUtils
{
 /// <summary>
 /// Google Translation Utility Class (c)Peter A. Bromberg 2005 -Public Domain
/// </summary> 
 public  enum LangPair 
 {
 EnglishToGerman , 
 EnglishToSpanish , 
 EnglishToFrench , 
 EnglishToItalian , 
 EnglishToPortuguese, 
 EnglishToJapanese, 
 EnglishToKorean, 
 EnglishToChineseSimplified, 
 GermanToEnglish , 
 GermanToFrench, 
 SpanishToEnglish, 
 FrenchToEnglish, 
 FrenchToGerman , 
 ItalianToEnglish, 
 PortugueseToEnglish, 
 JapaneseToEnglish, 
 KoreanToEnglish, 
 ChineseSimplifiedToEnglish
 }

 public class TranslateUtil
 {
  private TranslateUtil()
  {
  }

  public static ArrayList  GetLangPairs()
  {
          ArrayList al = new ArrayList();
  Array vals=Enum.GetValues(typeof(LangPair)) ;
  al.Add("Please Select");
   foreach (object o in vals)
    al.Add(o.ToString());
   return al;
  }
  
  public static string GetTranslatedText(string textToTranslate , LangPair langPair)
  {  
   string strLangPair=String.Empty ;

   switch(langPair)
   {
    case (LangPair.ChineseSimplifiedToEnglish):
     strLangPair = "zh-CN%7Cen";
     break;
    case (LangPair.EnglishToChineseSimplified):
     strLangPair = "en%7Czh-CN";
     break;
    case (LangPair.EnglishToFrench):
     strLangPair = "en%7Cfr";
     break;
                case (LangPair.EnglishToGerman):
     strLangPair = "en%7Cde";
     break;
    case (LangPair.EnglishToItalian):
     strLangPair = "en%7Cit";
     break;
    case (LangPair.EnglishToJapanese):
     strLangPair = "en%7Cja";
     break;
    case (LangPair.EnglishToKorean):
     strLangPair = "en%7Cko";
     break;
    case (LangPair.EnglishToPortuguese):
     strLangPair = "en%7Cpt";
     break;
    case (LangPair.EnglishToSpanish):
     strLangPair = "en%7Ces";
     break;
    case (LangPair.FrenchToEnglish):
     strLangPair = "fr%7Cen";
     break;
    case (LangPair.FrenchToGerman):
     strLangPair = "fr%7Cde";
     break;
    case (LangPair.GermanToEnglish):
     strLangPair = "de%7Cen";
     break;
    case (LangPair.GermanToFrench):
     strLangPair = "de%7Cfr";
     break;
    case (LangPair.ItalianToEnglish):
     strLangPair = "it%7Cen";
     break;
    case (LangPair.JapaneseToEnglish):
     strLangPair = "ja%7Cen";
     break;
    case (LangPair.KoreanToEnglish):
     strLangPair = "ko%7Cen";
     break;
    case (LangPair.PortugueseToEnglish):
     strLangPair ="pt%7Cen";
     break;
    case (LangPair.SpanishToEnglish):
     strLangPair = "es%7Cen";
     break;
    default:
     strLangPair="en%7Cde";
     break;
   }

   WebWagon.HTMLPage ww = new WebWagon.HTMLPage();
   ww.LoadSource("http://translate.google.com/translate_t?text="
+textToTranslate+"&langpair=" +strLangPair); string[] stuff= ww.GetTagsByName("textarea"); Regex findData = new Regex(@"<(?<tag>.*).*>(?<text>.*)</\k<tag>>"); Match foundData = findData.Match(stuff[0]); return foundData.Groups["text"].Value ; } } }

You can see above that I decided to put my Google language pairs in as an Enum, and added a static method

public static ArrayList GetLangPairs()

to return these as an ArrayList (along with the "Please Choose" as the first one) which is perfect for databinding to a dropdownlist, either in a Windows Form or a Web Form.

Then I have the

public static string GetTranslatedText(string textToTranslate , LangPair langPair)

method, which accepts the text to translate and the required language pair enum, and then creates the correct string for the web call to Google in a switch statement. Finally, we do the WebWagon ("Moo! Moo!") call and use it's GetTagsByName method to get the first TEXTAREA (cause that's the one with the result) and I apply a final regex,

Regex findData = new Regex(@"<(?<tag>.*).*>(?<text>.*)</\k<tag>>");

to extract the text value of this element, and return it as the return value of the method. DONE! And it looks like so:

Both an ASP.NET and a Windows Forms app are included with the solution, which also includes Jon Vote's excellent WebWagon ("Mooooo!") and my Utility class. Enjoy!


Download the complete VS.NET 2003 solution below

 

 


Peter Bromberg is a C# MVP, MCP, and .NET consultant who has worked in the banking and financial industry for 20 years. He has architected and developed web - based corporate distributed application solutions since 1995, and focuses exclusively on the .NET Platform.

Article Discussion: