A lexical Analysis of Obama's State of the Union Speech

Each year I like to play with the text of the president's State of the Union speech to get the word frequency. You can get some pretty good information from how many times certain words are used. This year I do it via LINQ as a fun programming exercise.

For this exercise, I put together an extension method on the String class called GetWordFrequency. This allows us to call the method directly from the string of text that comprises the president's speech.

Here is the method:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace SOTU
{
    public static class CustomExtensions
    {

         public static string[] stopwords = {
                                               "a",
                                               "about",
                                               "above",
                                               "across",
                                               "after",
                                               "again",
                                               "against",
                                               "all",
                                               "almost",
                                               "alone",
                                               "along",
                                               "already",
                                               "also",
                                               "although",
                                               "always"
                                                    // rest of list abbreviated - full list in code sample in the download
                                           };
        /// <summary>
        /// Analyze word frequency for a given string.
        /// </summary>
public static Dictionary<string, int> GetWordFrequency(this string input)
{
return input
.Split(new char[] { ' ' })
.Where(i => i.Trim() != String.Empty && Regex.IsMatch(i,@"\w"))
.Select(i => Regex.Replace(i,@"[^A-Za-z0-9]+$","").ToLower())
.Where(x => !stopwords.Contains(x))
.GroupBy(w => w)
.OrderByDescending(group => group.Count())
.ToDictionary(group => group.Key, group => group.Count());
}

  We start out with a string array of stopwords. These are common words like "a", "and", "the" and so on, which we're not really interested in..

  Then I construct a LINQ query that splits the string into words, removes whitespace and non-alphanumeric items, casts to lowercase and finally throws away anything that's in the stopwords list.

  The result is returned to the caller, which can then be displayed, or in this case, also saved to a file for further analysis.

  Here are the first 40 "most used" words from ObamaSpeak:

  american,33
jobs,28
america,27
energy,23
tax,23
people,20
americans,18
country,17
congress,15
world,14
help,14
businesses,13
don't,12
economy,12
built,12
you're,12
million,11
tonight,11
i'm,11
workers,11
business,11
companies,11
pay,11
financial,10
oil,10
home,9
rules,9
debt,9
industry,9
job,9
gas,9
clean,9
own,8
stop,8
nearly,8
let's,8
taxes,8
education,8
power,8
government,8

My original method had a secondary loop, but thanks to a suggestion by fellow MVP Chris Eargle, the above code is even more efficient.

You can download the sample solution, which includes the text of the speech, here.

By Peter Bromberg   Popularity  (7946 Views)