Generic Feed Parsers Redux

Generic canonicalized feed parsing with a .NET library

Some time ago I published some code that would use an XmlTextReader to “blast” through any kind of Xml feed document (Atom, RSS, etc.) with a switch statement that would canonicalize the items in the feed to a “GenericFeedItem” class and return the List<GenericFeedItem> to the caller. The code looks something like this:

foreach (Dictionary<string,string> d in items)
{
GenericFeedItem itm = new GenericFeedItem();
//do a switch on the Key of the Dictionary <string, string> of each item
foreach (string k in d.Keys)
{
switch (k)
{
case "title":
itm.Title = d[k];
break;
case "link":
itm.Link = d[k];
break;
case "published":
case "pubDate":
case "issued":
DateTime dt ;
bool ok = Rfc822DateTime.TryParse(d[k], out dt);
itm.PubDate = ok ? dt : DateTime.Now;
break;
case "content":
case "description":
itm.Description = d[k];
break;
default:
break;
}
}
// add the created item to our List
itemList.Add(itm);
}
return itemList;

Recently I found the “Public Domain” project on Codeplex.com. While this project is no longer maintained, it contains a great deal of usable “common scenario” code, including a very nice implementation of various feed parsers and generators (including OPML) that offer a “DistilledFeed” option which is very similar to my original concept. It doesn't matter whether you have pointed your FeedParser at an ATOM or an RSS feed, you have the option to convert it to a DistilledFeed which contains normalized, canonical element names (e.g., title, link, description, publicationDate).

So what I did was to pull from the much larger PublicDomain project only those classes necessary to build these FeedParsers, which results in a much smaller assembly. Using this to grab the DistilledFeed items from a list of Feed Urls is simple:

string s = Parser.ReadUriStream((string) url, 2000); // 2 second timeout
var ms = new MemoryStream(Encoding.UTF8.GetBytes(s));
var p = new FeedParser();
var dfeed = (DistilledFeed) p.CreateFeed<Feed>(ms).Distill();

Finally, I created a “Tester” Console program that has a subset of my Google reader feed list exported as OPML as an embedded resource. This is read into a DataSet, the table containing the "xmlUrl" field is Tables[3], and it automatically iterates over the feed urls, parsing each feed and showing the number of returned “DistilledFeed” Items for it. There is duplicate checking on the feed titles, and any exceptions are logged to the console. This is all done on a ThreadPool so it is pretty fast.

As I look back, I probably should have used LINQ to get the feed urls out of the embedded resource OPML, but DataSet.ReadXml is so darned convenient! Old habits die slowly, don't they...

If you have a need to gather “generic” FeedItems from a list of feed urls regardless of the type of feed, then this code is for you. You can download and play with the sample implementation solution here.

By Peter Bromberg   Popularity  (2778 Views)