Fast KeyWord String searching Algorithms

This is a question that apparently comes up frequently: What is the fastest way to search a string of text for a bunch of keywords (e.g., a "badwords" list for cleaning or disallowing forum spam, etc.). To the average developer, two approaches come to mind: First, the "brute force" String.IndexOf method. Second, using REGEX to find matches based on an OR (x|y|z..) expression.

Peter Duniho, a frequent poster at the MS C# language newsgroup, answered this with some information about a "State machine" method that piqued my interest. You can look at the original thread here:
http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/browse_thread/thread/462037d91c08e693/530e07a059045f2c?hl=en&lnk=gst&q=Peter+Duniho#530e07a059045f2c

After doing a bit more research, I came upon Tomas Petricek's implementation of the Aho-Corasick algorithm here:

http://tomasp.net/articles/ahocorasick.aspx

An interesting observation Petricek made was that for less than 70 keywords it is better to use String.IndexOf; Regular expressions are almost always slower than other algorithms, even if compiled. As a point of reference, with 200 keywords searched in a text of 10Kb length, Aho took 6.17ms, String.IndexOf took 12.6795ms, and Regex took 89.1643ms.

The Aho-Corasick algorithm is a string searching algorithm created by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns "at once", so the complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).

Informally, the algorithm constructs a trie with suffix tree-like set of links from each node representing a string (e.g. abc) to the node corresponding to the longest proper suffix (e.g. bc if it exists, else c if that exists, else the root). It also contains links from each node to the longest suffix node that corresponds to a dictionary entry; thus all of the matches may be enumerated by following the resulting linked list. It then uses the trie at runtime, moving along the input and keeping the longest match, using the suffix links to make sure that computation is linear. For every node that is in the dictionary and every link along the dictionary suffix linked list, an output is generated.

When the pattern dictionary is known in advance (e.g. a computer virus database), the construction of the automaton can be performed once off-line and the compiled automaton stored for later use. In this case, its run time is linear in the length of the input plus the number of matched entries.

The Aho-Corasick algorithm formed the basis of the original Unix command fgrep.

So in the case of a list of keywords, since the list is unlikely to change and the main overhead here is the generation of the original tree, I modified Tomas' code slightly to store the generated tree in the AppDomain cache and check for its presence there first. This cuts way down on the time required for repeated searches.

For a test, I grabbed a DataTable of 999 common "stopwords" (e.g. "a", "and", "the", etc.) and stored them as an XmlDocument in the project. This then can be easily loaded at runtime and the required string array of keywords generated. 

Then, I created a Windows Forms test harness that offers the ability to test all three of Petricek's methods:

  • FindAll, which returns an array of StringSearchResult objects that can be easily databound;
  • ContainsAny, which returns a boolean once any of the keywords in the list is found in the target string, and
  • FindFirst, which returns a single instance of StringSearchResult on the first match.

Each StringSearchResult struct contains both the found keyword and it's index position in the target document.

Here is the modified BuildTree method that is the heart of the algorithm:

private void BuildTree()
        {
          //  System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
           // sw.Start();
            // use AppDomain Cache to avoid overhead of regeneration of tree every time.
            if (AppDomain.CurrentDomain.GetData("root") != null)
                this._root = (TreeNode)AppDomain.CurrentDomain.GetData("root");
            else  // no tree in Cache, generate it and store in Cache
            {
                this._root = new TreeNode(null, ' ');
                TreeNode node1 = null;
                foreach (string str in this._keywords)
                {
                    node1 = this._root;
                    foreach (char ch in str)
                    {
                        TreeNode node2 = null;
                        foreach (TreeNode node3 in node1.Transitions)
                        {
                            if (node3.Char == ch)
                            {
                                node2 = node3;
                                break;
                            }
                        }
                        if (node2 == null)
                        {
                            node2 = new TreeNode(node1, ch);
                            node1.AddTransition(node2);
                        }
                        node1 = node2;
                    }
                    node1.AddResult(str);
                }
                ArrayList list = new ArrayList();
                foreach (TreeNode node in this._root.Transitions)
                {
                    node.Failure = this._root;
                    foreach (TreeNode node3 in node.Transitions)
                    {
                        list.Add(node3);
                    }
                }
                while (list.Count != 0)
                {
                    ArrayList list2 = new ArrayList();
                    Char ch;
                    foreach (TreeNode node in list)
                    {
                        TreeNode failure = node.Parent.Failure;
                        ch = node.Char;
                        while ((failure != null) && !failure.ContainsTransition(ch))
                        {
                            failure = failure.Failure;
                        }
                        if (failure == null)
                        {
                            node.Failure = this._root;
                        }
                        else
                        {
                            node.Failure = failure.GetTransition(ch);
                            foreach (string str2 in node.Failure.Results)
                            {
                                node.AddResult(str2);
                            }
                        }
                        foreach (TreeNode node5 in node.Transitions)
                        {
                            list2.Add(node5);
                        }
                    }
                    list = list2;
                }
                this._root.Failure = this._root;
                System.AppDomain.CurrentDomain.SetData("root", this._root);
            }
           // sw.Stop();
           // System.Diagnostics.Debug.WriteLine(sw.Elapsed.ToString() );
        }

You can enter any HTTP url in the TextBox, and the app will fetch the "Page" and apply the selected method to this as it's target document.

I've got some stopwatch code that will give you timings. Caching the generated Tree definitely cuts down on the timing after the first run.

There are other algorithms that were discussed for multiple keyword matching, but so far it appears that the Aho-Corasick algorithm is the fastest for this particular type of need.

Download the Visual Studio 2008 Solution here.

By Peter Bromberg   Popularity  (3966 Views)