Iterators, iterator blocks, data pipelines and LINQ

LINQ to Objects is based on iterators. These have always been part of .NET, and indeed are the basis of the foreach statement - but with LINQ to Objects they're more important than ever before. In this article Jon explores what iterators are about, how to implement them in C# 2 (or 3), and then the fundamental aspects of LINQ to Objects. Jon's new book, "C# In Depth" will be awarded to our top two Contest winners in March 2008.

"It is the mind that makes the body rich"  -- Shakespeare

What's an iterator?

The iterator pattern is a common one, and in .NET it's encapsulated in the IEnumerable and IEnumerator interfaces, and their generic counterparts. (Most of the time I'll use the nongeneric form for descriptions and the generic form in sample code. Obviously the generic form is strongly typed where the nongeneric form isn't, but the only other difference is that the generic form extends IDisposable.)

The basic idea is that as a data consumer, you can ask an IEnumerable for an IEnumerator with the GetEnumerator() call, and then iterate over the contents of the IEnumerator (using the MoveNext() method and Current property) until either you no longer need any more data, or the iterator runs out of information to return.

For example, a simple foreach loop over IEnumerable<string> looks something like this:

IEnumerator<string> iterator = dataSource.GetEnumerator(); 
while (iterator.MoveNext()) 
{ 
    string data = iterator.Current; 
    // Use data within the body of the loop 
}
The above code doesn't call Dispose on the iterator, for simplicity: the code generated for a foreach loop uses a try/finally construct to make sure that the iterator is disposed when we've finished with it. This is important when the iterator is implemented by magic of yield return statements which we'll see later, but it's not crucial to the basic pattern itself.

Why are there two interfaces involved?

You may well be asking yourself why IEnumerable doesn't have MoveNext() and Current members itself - why do we need an extra interface? Well, imagine you had two different loops iterating over the same list. You don't want the loops to interfere with each other, so you need two independent representations of the idea of "how much of the list I've looked at so far". This concept is precisely what IEnumerator is for. Typically it keeps a separate piece of information (such as the current index in the list) as well as referring to the collection itself. Although iterators are typically used for in-memory collections, they certainly don't have to be, as we'll see later. The key is the "ask for data, process it, ask for more data" cycle.

Iterator blocks (C# 2)

In C# 1, implementing iterators was a pain. You had to make your collection type implement IEnumerable, then create a separate type with a reference to an instance of its "parent", usually adding some more state for how far you'd already gone. It was very, very easy to end up with subtle off-by-one errors, and generally developers didn't do it unless they really had to.

In C# 2, the compiler does all the hard work for you when you use iterator blocks to implement either IEnumerable or IEnumerator (or the generic forms). It builds a state machine for you, and the iterator obtained effectively executes the code within the iterator block, yielding values as it goes. It's easier to demonstrate than to explain - here's a complete program:

using System;
using System.Collections.Generic;

class IteratorBlockDemo
{
    static IEnumerable<string> GetDemoEnumerable()
    {
        yield return "start";
        
        for (int i=0; i < 5; i++)
        {
            yield return i.ToString();
        }
        
        yield return "end";
    }
    
    static void Main()
    {
        foreach (string x in GetDemoEnumerable())
        {
            Console.WriteLine(x);
        }
    }
}
The output of the program is simple:

start
0
1
2
3
4
end

I won't go into all the details of how iterator blocks work (see chapter 6 of C# in Depth for more information, or the language specification for a lot of detail) but it's important to understand the flow:

  1. Main calls GetDemoEnumerable()
  2. GetDemoEnumerable() creates a new instance of the extra type generated by the compiler. None of the source code we've written executes yet.
  3. Main calls MoveNext()
  4. The iterator executes code until it reaches a yield statement. In our case, this happens immediately. The iterator remembers that the current item should be "start" and returns true to indicate that there is data available.
  5. Main uses the Current property to retrieve the data, then prints it out.
  6. Main calls MoveNext() again
  7. The iterator continues execution from the point it had previously reached - in other words, it goes to the line after the first yield return. As before, it executes code (initialising the i variable) until it reaches the next yield statement.
  8. ... The pattern repeats, until there's a call to MoveNext() which reaches the end of the method - at which point the call returns false to indicate that there's no more data available.

There are three very important (related) points here: first, that none of our original source code is executed until the first call to MoveNext(). Second, that subsequent calls to MoveNext() effectively jump back into the source code at the point they left off before. Yielding a value is sort of like "pausing" the method. Third, the compiler makes sure that the state of the method (in terms of local variables) is preserved - even though the i variable is local inside the iterator block, its value is kept inside the iterator so that next time MoveNext() is called, it can still be used.

These features of iterator blocks make them completely different from other methods - it can be quite tricky to get your head round how they work, until it suddenly clicks. It can be quite useful to debug into an iterator block to really see how the flow works for yourself. Now we know about iterators and how they can be easily implemented in C# 2, let's think about how we can chain them together.

Data pipelines and composability

LINQ to Objects is built on the concept of a data pipeline. We start with a data source which implements IEnumerable, and chain together lots of operations, each one acting on the result of the previous one and returning another IEnumerable - although it could be one with a different type of result. This act of chaining small operations together to form one large operation is called composition. The emphasis in composition is of making each of the small operations genuinely simple. Flexibility is provided in LINQ to Objects by the use of delegates which can act on the input elements - for example, to act as a predicate in a filter, indicating whether or not a particular value should pass the filter; or as a projection to map one value to another.

As a silly demonstration of this, suppose we have a sequence of numbers, in some random order. We want a result which is a sequence of strings - we want to filter out any negative numbers, sort the rest, and then call ToString to get the numbers in hexadecimal format. In C# 3 and .NET 3.5, we can do this with the following query expression:

from number in numbers
where number >= 0
orderby number
select number.ToString("x")

The parser translates this into a non-query expression before the compiler deals with it any further:

numbers.Where (number => number >= 0)
       .OrderBy (number => number)
       .Select (number => number.ToString("x"))

The => syntax is for lambda expressions - a simple way of creating delegates (and also expression trees). The Where, OrderBy and Select methods are extension methods provided by LINQ to Objects to do filtering, ordering and projections respectively. As you can tell, there are lots of different features interacting here, and I don't intend to go into the details of most of them - we're just concentrating on the behaviour of iterators in this article.

Deferred execution vs immediate execution

Just as when we created our own iterator using iterator blocks, the expression above doesn't execute any significant code when it's executed - it sets up the pipeline, but won't actually ask for any data. It only starts doing any actual filtering/ordering/projecting when the return value of Select is first asked for some data. This is called deferred execution. Many LINQ to Objects query operators use deferred execution, especially when they're conceptually building more of a pipeline. Other query operators such as Count(), Max() and ToList() use immediate execution - which means they perform their work immediately, and give you a result with all the appropriate data.

It's important to be aware of what's going on here so you know when operations will actually take place. For instance, if you specify a projection which might throw the exception, you need to know that the exception will only occur when the data is being requested: just because the call to Select doesn't throw an exception doesn't mean everything is okay!

Streaming vs buffering

There's another concept which is similar to deferred/immediate execution, but subtly different - and that's whether an operator streams the data or buffers it. An operator which streams data basically only asks for as much data (from the previous part of the pipeline) as it it needs to give the next result. Select() is the simplest example of this - it retrieves the next item, applies the specified delegate to project it to the result, and then yields that result. Where() is slightly more complicated - it will keep asking for more data until either there is no more data, or until it finds an element which passes its filter. When it finds a passing element, however, it will yield that value without asking for any more data.

Buffering operators act differently. When they're asked for data, they buffer up all the data from the earlier piece of the pipeline, and then work out all the results. Reverse() is the simplest example of this - in order to return all of the data in the reverse order, you've got to get to the last element before you can yield anything. Sorting and grouping also buffer data in LINQ to Objects.

Generators

LINQ to Objects contains a few generator operators - ones which produce a sequence of data without starting with one. DefaultIfEmpty is listed as a generator, but in some ways it isn't - it acts on an existing sequence, albeit one which may be empty. We won't look at that here.

There are three other standard generator operators: Empty, Repeat and Range. We'll look at possible implementations of each of them, and then work out how we could implement all of them slightly differently using composition. Of course, in reality all of these are implemented for us already, but it gives nice practice for understanding iterator blocks and composition.

Empty

Empty does exactly what you'd expect it to - it returns an empty sequence, i.e. one where the first call to MoveNext() returns false. Strangely enough, implementing it isn't quite as straightforward as we might expect. We don't need to return any values, so we can just leave the method body empty, right? Well, no... we actually need code like this:

public static IEnumerable<TResult> Empty<TResult> ()
{
    yield break;
}
The yield break; statement is an "early out" in iterator blocks - it acts like the end of the method, making MoveNext() return false. Why do we need it at all? Well, with an empty method body there's nothing to tell the compiler that we want an iterator block in the first place! The compiler only treats a method as being implemented with an iterator block if it encounters yield return or yield break within the method body. This is the only case I can think of where a yield break directly before the end of a method actually has an effect.

Repeat

Repeat takes two parameters - an element to return, and the number of times to return it. It will yield the same element however many times you've specified, and then stop. It's trivial to implement (including checking that the count isn't negative):

public static IEnumerable<TResult> Repeat<TResult>(TResult element, int count)
{
    if (count < 0)
    {
        throw new ArgumentOutOfRangeException("count");
    }
    for (int i=0; i < count; i++)
    {
        yield return element;
    }
}
A note on error checking: there's a problem here. Ideally, we would like our Range method to throw an exception immediately we call it, if count is less than 0. Currently, we'll only throw the exception when some code tries to iterate through the result. That's generally a bad thing, and it would be nice if the language gave us more support here. For the moment though, we'll leave our slightly broken implementation as it is.

Range

Range is also straightforward. Unlike the previous two operators, it's not generic: it always returns an IEnumerable<int>. You specify two parameters: the starting point, and how many values to return. The returned sequence starts at the given value, and counts up - so Range(10, 3) will yield 10, 11, 12. Here's an implementation, omitting error checking:

public static IEnumerable<int> Range(int start, int count)
{
    for (int i=start; i < start+count; i++)
    {
        yield return i;
    }
}

One generator to rule them all

The thing about the above generators is that they're all a bit specialized. In particular, they take no delegates, unlike almost everything else in LINQ to Objects! Let's introduce a new generator. It's so general, I reckon it deserves the name Generate. It will produce an infinite sequence by taking a starting point, yielding that, and then applying a delegate to the current value, yielding the current value, and repeating ad infinitum. Here's the code (with no error checking - we'd normally check that step wasn't null):

public static IEnumerable<TSource> Generate<TSource>(TSource start,
                                                     Func<TSource,TSource> step)
{
    TSource current = start;
    while (true)
    {
        yield return current;
        current = step(current);
    }
}

The code is as simple as we'd hope - but of course, we'd never want to actually iterate over the result of Generate directly - it would keep going forever. Indeed, Generate is almost useless without composition. However, by the cunning use of just a single other operator, we can easily produce the three generators we've already looked at. Rather than show any infinite sequences here, I'll wait for a moment before actually using Generate.

Take

Take is what I think of as a "pipeline" operator - it takes a sequence and returns a sequence, just like Select and Where do. As well as the "source" sequence, you need to specify how many elements of the source you want to be returned. Take simply yields elements from the source until either the source runs out of data, or the requested number of elements have been returned. So, the implementation (minus error checking) could look like this:

public static IEnumerable<TSource> Take<TSource>(this IEnumerable<TSource> source,
                                                 int count)
{
    foreach (TSource element in source)
    {
        if (count==0)
        {
            yield break;
        }
        count--;
        yield return element;
    }
}
Note the use of yield break here to indicate that we've finished when the counter reaches 0. (We could have used a simple break in this particular case, as there's no more code after the loop, but I thought it would be worth demonstrating a slightly more normal use of yield break than the one in our Empty implementation.) The use of this in the declaration of the source parameter indicates that Take is an extension method.

With Take and Generate in our toolkit, we can now reimplement Empty, Repeat and Range using composition:

public static IEnumerable<TResult> Repeat<TResult>(TResult element, int count)
{
    return Generate(element, x => x).Take(count);
}
    
public static IEnumerable<TResult> Empty<TResult>()
{
    return Repeat(default(TResult), 0);
}
    
public static IEnumerable<int> Range(int start, int count)
{
    return Generate(start, x => x+1).Take(count);
}
Repeat is just implemented by first creating an infinite sequence which begins with the specified element and then applies an identity function to it at each step. We then cut that infinite sequence to the required length by composing it with Take. An Empty sequence is simply the default value of the required type, repeated zero times. Finally Range creates another infinite sequence - starting with the specified value, it just adds one repeatedly - and again we Take the required number of elements.

It's not particularly useful to actually implement the three generator functions in this manner - but it does give an interesting insight into composition. The fact that Generate always creates an infinite sequence forces you to think about the fact that the values are only generated as they're requested - if Generate were to try to create all the values first, and then return them in one go, you'd clearly run out of memory quite quickly. I have a somewhat more complicated example, taken from a blog post which contains the full source. It's a query expression which is used to generate a visualisation of the Mandelbrot set:

var query = from row in Enumerable.Range(0, ImageHeight)
            from col in Enumerable.Range(0, ImageWidth)
            // Work out the initial complex value from the row and column
            let c = new Complex((col * SampleWidth) / ImageWidth + OffsetX,
                                (row * SampleHeight) / ImageHeight + OffsetY)
            // Work out the number of iterations
            select Generate(c, x => x * x + c).TakeWhile(x => x.SquareLength < 4)
                                              .Take(MaxIterations)
                                              .Count() into count
            // Map that to an appropriate byte value
            select (byte)(count == MaxIterations ? 0 : (count % 255) + 1);
The use of Generate here is within the nested query used to find the colour to use for a particular point. We generate an infinite path of values which jump around the field of complex numbers, stopping when either the point has a modulus of 2 or more, or we've performed as many iterations as we're interested in. The full code is downloadable from the blog post, but I hope this gives some indication of how powerful the composition technique is.

Conclusion

LINQ to Objects is full of sequences, and a bunch of simple operators which combine to form myriad possibilities. In C# in Depth, I recommend that when you're faced with a complicated query expression, you look at the sequences from which it's composed. Understanding how iterators work, streaming/buffering and deferred/immediate execution gives important foundation material to being able to work with LINQ effectively. Composing queries is fun, and the iterator blocks which first appeared in C# 2 make iterators reasonably easy to work with. In this article I've given just a few examples, but I'm sure you can come up with your own. Go ahead, create your own operators - and above all, experiment.

For many of us, LINQ isn't just a new technology, but a new way of approaching problems when writing in C#. It always takes a while to really get to grips with fresh ideas, so I'd heartily recommend playing with LINQ before you need to use it professionally. While the other LINQ providers are perhaps "sexier" than LINQ to Objects, there's a purity, simplicity and immediacy involved when you can just define a collection in memory and manipulate it to your heart's content.

Likewise iterator blocks are a mystery to many developers. Unless you've already been thinking in sequences, you may well have never written an iterator block. Hopefully this article has shown you how simple they can be, along with a couple of warnings where you may have expected slightly different behaviour. Iterator blocks were already useful in C# 2, but when combined with LINQ they can be truly spectacular, all within tiny pieces of code. Again - experiment. It'll all be useful in the long run, and if you're anything like me, I promise you'll have a lot of fun along the way.

By Jon Skeet   Popularity  (3589 Views)
Biography - Jon Skeet
Jon Skeet has worked with C# since 2002, and has been a Microsoft C# MVP since October 2003. He has spent a great amount of time in the C# community answering questions in newsgroups as well as writing articles on the most misunderstood aspects of C# and .NET. After having read tens of thousands of questions over the years, Jon has developed a deep insight into the areas that developers have trouble with, as well as what they’re trying to achieve. A keen reader of specifications, Jon aims to understand the language at the deepest level, which enables him to provide a detailed exposition of C#, including a few dark corners which can trip up the unwary developer. Jon's new book is C# In Depth