In-Memory Data Caching for Performance
By Peter A. Bromberg, Ph.D.
Printer - Friendly Version
Peter Bromberg

Recently while architecting and implementing a set of classes to handle Form Post to Xml conversions, I discovered that you can do a lot of very cool stuff with datasets in memory, especially when you have a repetitive process such as a complicated transform or translation that needs to operate on a large number of sets of data in some sort of an iterative loop. If you have a lengthy, repetitive operation that needs to be able to check certain items in other data stores for certain metadata or rules that tell it how to process each particular set of data, an in-memory cached DataSet can be more performant. I had already experimented extensively with things like getting a DataSet for one WebForm page, displaying it in a DataGrid, caching the Dataset, and having template columns in the grid that consisted of hyperlinks which would take you to a second edit page that simply got the same DataSet out of cache, set a RowFilter on the DataView, and allowed you to edit the specific record you had clicked on without having to make a second trip to the database.



In this particular case it was an issue of automated form posts of corporate leads that could be coming from a number of different sources all over the world, and some pretty sophisticated rules processing needed to be implemented. Some posts might, for example, come from Japan where data is typically encoded in UTF-7 ( 7-bit encoding), a charset which is not at all fun to try and decode into US-ASCII and consequently would require a specific set of rules. Or perhaps a particular post is already in XML format and what you are receiving is a form post with a single name/value pair, the value being a complete string representation of an XML document that mostly needed to be validated and mapped either through a set of rules containing XPATH mapping queries or with an XSL Transform, or both.

In addition, each form post might internally contain some metadata that needed to be extracted, which would dictate that in its particular case, a special set of rules must be used instead of the default rule set for that type of document. Hundreds, even thousands of these leads could come in on a daily basis and you need a process which can efficiently perform the conversions on chunks of say 100 or 200 records at a time, process them, do the conversion of each one into a "master" XML format that is mapped based on sets of rules to be consistent regardless of the type of post that was received, and then save each mapped record, write some metadata and status information to some tables and go to sleep until the process is woken up and told to get the next chunk of records to process. Consequently, a large set of processing rules containing Regex match expressions and / or XPath Queries needed to be stored in a special "rules" table, and I needed to be able to access and manipulate a specific subset of those rules "on the fly" based on the metadata for each record in the processing queue.

Avoid Repetitive "Chatty" database calls

When you have a process like this you want to try and avoid going back to the database every time you need to look up some metadata about how to process a particular record. If the process is an executable that's fired off by some scheduler component such as SQL Agent or Task Manager, or if it's a Service that has a timer or other controller process, you want to be able to get all the data you need to perform your entire operation in one shot, cache it somewhere reliable, and then you can do your processing very fast without the delays of going back to the data store because you already have assembled all the information you need in-memory.

Using the AppDomain Cache

For an executable, you do not have the ASP.NET Cache, Session or Application classes to work with. Instead, I use the AppDomain cache for this type of operation. The advantage of this cache, which is little - known to most .NET developers, is that any of your related components or classes in any satellite DLL to your process all have access to this same Cache since they are all running under the same App Domain.

You can store or retrieve virtually any object in AppDomain with the following sample code:

System.AppDomain.CurrentDomain.SetData("mystuff", myType)

myType =(myType) System.AppDomain.CurrentDomain.GetData("mystuff")

Note that when you get it back out, you may need to do an explicit cast. VB.NET developers should consider using DirectCast, not "CType". There is just one caveat, which can be kind of embarrassing if you have to discover it yourself, as I did over two years ago -- data in the App Domain cache is stored by reference. So for example, if you pull an XmlDocument instance out of your "mystuff" cache item, and you start filling in elements with data, and then you pull out another instance from the cached item (you think!) to start filling in data for another XmlDocument, guess what? Your second XmlDocument is still going to have all the data you put into the first one! What we need to do in this situation is use the Clone method to rapidly create an exact copy of what we got out of the cache item. In that case we will be working with a completely cache-disconnected new instance.

And if you're concerned about speed, don't worry. This cache is humongously faster than making a SQL call to your database to keep fetching the same or similar data. You can use the various DataView Sort and Filter operations to efficiently query your DataSet from the cache much faster than you could get the specific data out of your database, especially in a load environment. We'll look at some of these next.

Speed Counts - Meet the DataView: Your In-Memory Stored Procedure

The key to effective use of cached data in DataSets is the DataVIew and its related classes, DataTable, DataRow, and DataRelation which allow you to do a surprisingly sophisticated amount of SQL query - like filtering and sorting on in-memory data.

. A DataSet is a collection of DataTables and metadata. A DataTable is a collection of DataRows and other data. A DataRow is a collection of fields and other data. None of these have built-in capabilities for filtering or sorting the data they contain - that's the domain of the DataView and DataViewManager.

Customized Data View Filters

The DataView class is simply a view of a DataTable. At any point you can have different views of the same underlying data, and you can manage each view as an independent object with its own set of properties, methods, and events.The ADO.NET DataView object can be handled as a distinct object. DataView objects retain a reference to the underlying table and allow for update operations. A DataView is easy to create:

DataView dv;
dv = new DataView(theDataSet.Tables["Employees"]);

The DefaultView property of a DataTable returns a DataView object on that table. The DefaultView object is cached internally , so any changes you make are retrieved later:

DataView dv = dt.DefaultView;

With your DataView, you can use two tools, the RowFilter, whiche lets you specify the criteria that rows have to match in order to be visible through the view, and the Sort, which takes an expression and sorts the rows by that expression. You can use the Find method along with a Sort based on the sorted column, and you can employ any combination of the two methods.

RowFilter is a read/write property that gets and sets the expression used to filter the table. You assign it an expression that can be formed by any valid combination of column names, logical and numeric operators, and literals. Here are a few examples:

dv.RowFilter = "Country = 'USA'";
dv.RowFilter = "EmployeeID >5 AND Birthdate < #1/1/68#"
dv.RowFilter = "LastName LIKE 'Dard*'"

NOTE: You can also use the Select Method on a DataTable. The Select method is very fast, even though the DataTable is not indexed, because the records are all in-memory, so there is no disk I/O overhead as with a database.

However, though ADO .NET does not expose indexes, it does create and use them both for Primary Keys and for DataViews. You can take advantage of the indexes created for Primary Keys or DataViews as follows:

If you are searching on a Primary Key field, or if you can make the field(s) you are searching on a Primary Key, then instead of DataTable.Select(), you can use DataTable.Rows.Find(). Find() will look up a primary key value using the built index. Here is some code that illustrates the DataTable Select method:

[VB.NET]
Dim customerTable As DataTable
customerTable = myDataSet.Tables[0]

' Add a new ID column
customerTable.Columns.Add( "id", GetType(Integer) )
' Set PrimaryKey
customerTable.Columns("id").Unique = true
customerTable.PrimaryKey = new DataColumn() { customerTable.Columns("id") }
customerTable.AcceptChanges()
Dim strExpr As String
Dim strSort As String
strExpr = "id > 5"
' Sort descending by CompanyName column.
strSort = "name DESC"
' Use the Select method to find all rows matching the filter.
Dim foundRows As DataRow() = _

customerTable.Select( strExpr, strSort, DataViewRowState.CurrentRows )

Your DataTable also sports a Rows.Find method:

[C#]
private void FindInPrimaryKeyColumn(DataTable myTable, long n){
// Find the number n in the primary key column of the table.
DataRow foundRow = myTable.Rows.Find(n);
// Print the value of column 1 of the found row.
if(foundRow != null)
Console.WriteLine(foundRow[1]);
}

If you are doing repeated searches on a column or set of columns, you can create a DataView sorted on those columns, and then use DataView.FindRows() to search for a value within the columns. DataViews create an index for the sorted columns, and FindRows() uses this. Also, when using a DataView, it is more efficient to use the constructor that accepts the table, filter, sort, and rowversions together. Each time you set the filter, sort, or rowversion properties your index is recreated, so creating a DataView and setting these properties individually can actually create the index several times. Here is an example:

DataView dv =
new DataView(ds.Tables["Suppliers"], "Country = 'US'", "CompanyName", DataViewRowState.CurrentRows);
dv.AllowEdit = true;
dv.AllowNew = true;
dv.AllowDelete = true;

More on DataViews

With a filter string you use AND, OR, NOT to tie together smaller expressions. You can also use parentheses to group clauses and force precedence.

Any clause in a filter contains a column name that is normally compared against a literal, a number, a date, or another column name. You use both relational and arithmetic operators such as >=, <, >, +, *, %. If the rows to select don't follow a clear pattern that can be easily rendered through arithmetic or logical operators, you can use the IN operator, just as you would with Transact SQL:

dv.RowFilter = "employeeID IN (12,144,54)"

Wildcard characters, * and %, are supported and are used with the LIKE operator. Wildcards are allowed only at the beginning and/or end of a filter string. Strings must be enclosed within single quotes, and dates require pound symbols (#). Numeric values can utilize decimals and scientific notation.

The RowFilter also supports aggregates such as SUM, COUNT, MIN, MAX, and AVG. If a table has no rows, the aggregate functions return a null reference.

Len() returns the length of a specified expression, which can be a column name or any other valid expression. Substring() returns a sub-string of a specified length, starting at a specified point in the given expression. IIF() resolves to one of two values depending on the result of the given logical expression, and works just like the Visual Basic IIF(expression, value_if_true, value_if_false)

Using combinations of the above tools you can perform very sophisticated queries on your in-memory data without ever having to make a call over the wire to your database.

Sort your way to the Data

A DataView exposes a Sort property that you can use to sort the content of the view. Sort takes a comma-separated list of column expressions and sorts the view by that list. By adding a trailing ASC or DESC qualifier to any of the column expressions, you can sort that field in the specified ascending or descending order. If no direction qualifier is specified, then the sorting algorithm defaults to ASC. What's also useful with a Sort is that you have a Find method in the DataView:

private void FindValueInDataView (DataTable tbl)
{
DataView dv = New DataView (tbl);
dv.Sort = "LastName";
// Find the customer named "Smith" in the LastName column
int i = dv.Find ("Smith");
Response.Write (dv[i]);


DataView also exposes the Rowstate Filter, which Filters a DataTable based on the State of the rows:.

CurrentRows Current rows including unchanged, new, and modified rows.
Deleted All rows that have been deleted since last AcceptChanges call.
ModifiedCurrent All rows that have been modified since last AcceptChanges call.
ModifiedOriginal The original version of rows that have been modified since last AcceptChanges cal.
New All new rows that have been added since last AcceptChanges call.
OriginalRows Returns the original set of rows. This includes unchanged and deleted rows.
Unchanged All rows not affected by changes.

Any Changes you make become effective after you call AcceptChanges on the DataTable. Changes on a single row can be made effective by calling the DataRow's AcceptChanges. Changes can be canceled by calling RejectChanges either on the DataTable class (for all the rows) or on a single DataRow object.

The DataViewManager is the container for all view settings for each table in a DataSet. The key property of a DataViewManager is DataViewSettings, which is a collection of DataViewSetting objects:


Dim dvs As DataViewSetting
dvs = dvm.DataViewSettings("Customers")
dvs.Sort = "Company"

DataRelations and Constraints enforce referential integrity "on the fly"

Finally, you can enforce Primary Key and other constraints between the tables in your DataSets dynamically once your data has been assembled into the in-memory DataSet. For example:

UniqueConstraint pkey;
pkey=new UniqueConstraint("PK_Emplist", ds.Tables["empList"].Columns["EmpID"],true);
ds.Tables["empList"].Constraints.Add(pkey);

With code similar to the above, plus some Rule properties adjustments, we can enforce cascading deletes, inserts and updates just as we would in Sql Server. Being able to define a ForeignKeyConstraint on an in-memory DataSet can greatly amplify our programming power. DataRelation objects are the way we can bind the columns of tables together in order to make the business logic work the same way as it does in the database they came from.

Cache as Cache Can

If you are processing large amounts of cached data in an iterative manner, these tools can speed up your processing potentially a hundredfold and keep network traffic to your tired SQL Server to a mimimum. You can iterate through an in-memory table of data row-by-row, use a value or values in a row to set a filter or sort query on another DataTable in memory, then use the data in that query row to query data out of a third table that contains, for example, rules on how to process the data in the first table's row. You can then update fields in the row in the first table, and continue processing through all the rows, all the while never needing once to hit your database. Even if you are running a Console Application or Winforms app, you can retrieve the data for all these tables into a DataSet that you store in the AppDomain Cache one time, and never need to return to the database for this data so long as the app is running.

When you are finished, you can even use a DataAdapter and its properly configured Update command to send all the updates back to your SQL Server in a single call. By using this technique, I was able to do some heavy duty in-memory processing one a set of 233 records in about 9 seconds, an operation which would have easily taken 15 to 30 seconds or more,and used up considerable network traffic bloat had I chosen to repetitively query the database for my data each time I processed a new row. Proper use of caching mechanisms like these in .NET can result in up to a 1000% improvement in processing efficiency over legacy COM - based systems that don't employ caching.

 


Peter Bromberg is a C# MVP, MCP, and .NET consultant who has worked in the banking and financial industry for 20 years. He has architected and developed web - based corporate distributed application solutions since 1995, and focuses exclusively on the .NET Platform.