For this exercise, I put together an extension method on the String class called
GetWordFrequency. This allows us to call the method directly from the string
of text that comprises the president's speech.
Here is the method:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace SOTU
{
public static class CustomExtensions
{
public static string[] stopwords = {
"a",
"about",
"above",
"across",
"after",
"again",
"against",
"all",
"almost",
"alone",
"along",
"already",
"also",
"although",
"always"
// rest of list abbreviated - full list in code sample in the download
};
/// <summary>
/// Analyze word frequency for a given string.
/// </summary>
public static Dictionary<string, int> GetWordFrequency(this string input)
{
return input
.Split(new char[] { ' ' })
.Where(i => i.Trim()
!= String.Empty && Regex.IsMatch(i,@"\w"))
.Select(i
=> Regex.Replace(i,@"[^A-Za-z0-9]+$","").ToLower())
.Where(x
=> !stopwords.Contains(x))
.GroupBy(w => w)
.OrderByDescending(group
=> group.Count())
.ToDictionary(group => group.Key, group => group.Count());
}
We start out with a string array of stopwords. These are common words like "a",
"and", "the" and so on, which we're not really interested
in..
Then I construct a LINQ query that splits the string into words, removes whitespace
and non-alphanumeric items, casts to lowercase and finally throws away anything
that's in the stopwords list.
The result is returned to the caller, which can then be displayed, or in this case,
also saved to a file for further analysis.
Here are the first 40 "most used" words from ObamaSpeak:
american,33
jobs,28
america,27
energy,23
tax,23
people,20
americans,18
country,17
congress,15
world,14
help,14
businesses,13
don't,12
economy,12
built,12
you're,12
million,11
tonight,11
i'm,11
workers,11
business,11
companies,11
pay,11
financial,10
oil,10
home,9
rules,9
debt,9
industry,9
job,9
gas,9
clean,9
own,8
stop,8
nearly,8
let's,8
taxes,8
education,8
power,8
government,8
My original method had a secondary loop, but thanks to a suggestion by fellow MVP Chris Eargle, the above code is even more efficient.
You can download the sample solution, which includes the text of the speech, here.