Native DataSet Compression in Visual Studio.NET 2005
By Peter A. Bromberg, Ph.D.

Peter Bromberg

Among other enhancements in Visual Studio.NET 2005, we now have the System.IO.Compression namespace, which offers industry standard Deflate ("zip") and GZip compression codecs. Since I've been somewhat involved in this area, particularly with the binary serialization and compression of DataSets for remoting over the wire, I thought it would be worthwhile to compare the new SerializationFormat.Binary enumeration on the DataSet class, which now enables DataSets to describe themselves to the BinaryFormatter serialization infrastructure in such a manner as to be truly "binary" serializable, instead of injecting a big glop of textual XML to the Formatter's Serialize method, and to see how well the resultant byte array can be compressed using the new native Compression classes.

The first thing I decided to do was build a convenient "helper" class which I"ve appropriately named "DataSetCodec". This is simply a wrapper class to enable to call static methods such as DataSetCodec.CompressDataSet and DataSetCodec.DecompressDataSet respectively.



After all, I reasoned - if this turned out to be really useful I might as well make it easy to use for DataSets. So the first code snippet I present for your viewing pleasure is the DataSetCodec class:

#region Using directives


using System;
using System.Collections.Generic;
using System.Text;
using System.Data;
using System.Runtime.Serialization.Formatters.Binary;
using System.IO;
using System.IO.Compression ;

#endregion

namespace DSCompression
{
    public class DataSetCodec
    {
        // private ctor; all members are static
        private DataSetCodec()
        {
        }
        public static byte[] CompressDataSet(DataSet ds)
        {
            ds.RemotingFormat = SerializationFormat.Binary;
            BinaryFormatter bf = new BinaryFormatter();
            MemoryStream ms = new MemoryStream();
            bf.Serialize(ms, ds);
            byte[] inbyt = ms.ToArray();
            System.IO.MemoryStream objStream = new MemoryStream();
            System.IO.Compression.DeflateStream objZS = 
new
System.IO.Compression.DeflateStream(objStream,
System.IO.Compression.CompressionMode.Compress); objZS.Write(inbyt, 0, inbyt.Length); objZS.Flush(); objZS.Close(); return objStream.ToArray(); } public static DataSet DecompressDataSet(byte[] bytDs, out int len) { System.Diagnostics.Debug.Write(bytDs.Length.ToString()); DataSet outDs = new DataSet(); MemoryStream inMs = new MemoryStream(bytDs); inMs.Seek(0, 0); DeflateStream zipStream = new DeflateStream(inMs, CompressionMode.Decompress,true); byte[] outByt = ReadFullStream(zipStream); zipStream.Flush(); zipStream.Close(); MemoryStream outMs = new MemoryStream(outByt); outMs.Seek(0, 0); outDs.RemotingFormat = SerializationFormat.Binary; BinaryFormatter bf = new BinaryFormatter(); len = (int)outMs.Length; outDs = (DataSet)bf.Deserialize(outMs, null); return outDs; } public static byte[] ReadFullStream(Stream stream) { byte[] buffer = new byte[32768]; using (MemoryStream ms = new MemoryStream()) { while (true) { int read = stream.Read(buffer, 0, buffer.Length); if (read <= 0) return ms.ToArray(); ms.Write(buffer, 0, read); } } } } }

Now that we have an easy way to send in a DataSet and get back a Zipped byte array, or send in a Zipped byte array and get back a DataSet, we can proceed to some testing scenarios to see what we are really dealing with. Besides the fact that I wanted to see that this actually worked, I also was interested in what features the Compression codecs offered, and what kind of compression ratios were achievable. Notice above that in the Decompress method, I've added an out parameter to return the Integer size of the compressed MemoryStream prior to deserialization so that it can be displayed if desired.

The next step was to add the SharpZipLib compression library as a comparison test. I have worked with Mike Krueger's C# Zip library since its inception, and he and his group have done a fine job, so I knew that it would represent a good comparison. The only thing I could not do is use Angelo Scotto's CompactFormatter with .NET Framework 2.0, as it simply isn't ready for that. However, I"ve left the CompactFormatter code along with my DataSetSurrogate and Wrapper class in the code in case anyone is enterprising enough to play with it. Users should note that this is the most recent version that Scotto sent me, and he has given me permission to put it out. His boss has given him "a lot of work" and I guess he just hasn't had the time to continue work as he would like to.

So the final result is that we have a nice Windows Forms test harness with a button to load a DataSet off the file system and display the first table in a DatagridView. We then have two sets of Compress and Decompress buttons - one set to use the new 2005 native compression class, and the other set to use SharpZiplib.

As can be seen in the screen cap below, the Compressed length of the Native 2005 compressed DataSet is 262952 bytes. That's down from its original uncompressed size of 675005, or about a 38 percent compression ratio (compressed size compared to original size). However, the SharpZipLib method with a compression ratio of "9" (maximum compression) provided about a 32 percent size. Unfortunately the native compression classes don't offer all the features of SharpZipLib such as compression ratio, zip passwords, a full set of zip file handling classes, and other methods such as BZip2, which, although somewhat slower than Zip, can provide up to a 15% tighter lossless compression result. You can also see i wasn't too concerned about my spelling errors. (Maybe it was because I was still working on my rendition of the Swedish Chef codec for our December Programming Contest.)

 

In addition to these results, I know that the CompactFormatter applied to .NET Framework 1.1 DataSets along with Ravinder Vuppula's DataSetSurrogate class, before compression offers not only a more compact byte stream than the BinaryFormatter, but a slight increase in speed as well. So, my opinion is that while the new Binary Serialization ability of the DataSet and the native System.IO.Compression classes are a welcome addition to version 2.0, they still leave something to be desired, and enterprising developers may wish to investigate further. The DataSet is a very complex animal, even more so with all its new features in version 2.0, but still it is basically nothing more than a sophisticated container for tables of data. When developers need to be able to remote such containers over the wire, they desire two things - first, a high compression ratio to keep the bandwidth consumption down, and second, speed of compression and decompression. In some commercial applications that are already using my CompressDataSet classes, the main objective is simply bandwidth savings, and they are reporting up to a 92 percent savings by using my classes. Perhaps, if time permits, I'll do more work in this area, given the information I've discovered and presented above. In the meantime, you are welcome to download the Visual Studio.NET 2005 solution zip file below, and play with it. I'd be interested in any feedback you have to offer.

Download the VS.NET 2005 Solution that accompanies this article


Peter Bromberg is a C# MVP, MCP, and .NET consultant who has worked in the banking and financial industry for 20 years. He has architected and developed web - based corporate distributed application solutions since 1995, and focuses exclusively on the .NET Platform.
Article Discussion: