I was doing some work the last couple of days to re-generate lots of large transaction files from archive tables to stress test a system.

This required extracting in excess of 400,000 rows of data per file (pretty small really), performing a date shift on transaction dates and then outputting to a file ready to be re-processed for stress testing the data load processes.

The first suggestion by my colleague was to do all the work in T-SQL and output to a file using xp_cmdshell. Whilst not a bad suggestion it was not my first choice due to some of the parsing required. My gut told me it would be easier to do in C# as a console application. Still I wrote the code and it worked but the performance was not ideal.

Now considering we needed to process 100+ files and each one would take between 2-5 minutes to process I went back to the drawing board.

So I opted for the console application and LINQ to SQL classes to process the data with a dash of LINQ thrown in to filter the data. The results where excellent, what took 2 minutes on SQL now took 6 seconds. Huge improvement.

I decided to take this a little further when I got home tonight and armed with my trusty test database I set out on the journey once again. This time to extract all of my 2,000,000 members, order them in memory and spit them out to a file. The Member class/table structure is shown below:

wiu5423j

4 different tests where performed: the first test had no ordering and allowed LINQ to buffer the results, the second was again unsorted but this time pre-fetched all the results, the third test ordered in memory by DateOfBirth using LINQ and the fourth let SQL Server do the sorting. No index exists on the DateOfBirth column. A cut down version of the code and the results are shown below and all source code is included at the end of this article:

NOTE: For each test the SQL Server was re-started prior to the first run

RunRowsUnsorted BufferedUnsorted Not BufferedLINQ SortSQL Server Sort
12,000,00015 seconds13 seconds15 seconds20 Seconds
22,000,00014 seconds12 seconds14 seconds15 Seconds
32,000,00014 seconds11 seconds14 seconds15 Seconds
// Create record format string for outputstring recordFormat = "{0}|{1}|{2}|{3}|{4}|{5}|{6}"; // Create textwriter to output data tousing (TextWriter tx = new StreamWriter(@"C:\Members.txt")) { // instantiate the data context from our LINQ to SQL class MembersDataContext ctx = new MembersDataContext(); //Extract all the members in to memory and order by Gender var sorted = from m in ctx.GetAllMembers() orderby m.Gender select m; // iterate over the rows using a forward only fast readerforeach (GetAllMembersResult item in sorted) { // write out the records to our file tx.WriteLine(string.Format(recordFormat , item.MemberID , item.MemberHash , item.FirstName , item.LastName , item.DateOfBirth.ToString("yyyy-MM-dd") , item.Gender , item.RowID)); } tx.Close(); }

So why the differences in times?

The first run is always slower in all cases due to the initial query against the SQL Server and the loading of the data in to memory

In the unsorted case the data is being read from the table via a relatively small buffer from the TDS (tabular data stream) stream and the rows are being processed. It would appear that the buffering is what is actually slowing down the process. If we fetch all the data first (as in the second test) and then write to file as follows we can reduce the time down to 12 seconds.

MembersDataContext ctx = new MembersDataContext(); var rows = ctx.GetAllMembers().ToList();

In the case of the sorted data all the data is being extracted and then sorted in memory by LINQ. This would appear to be extremely efficient and if we subtract the 2 second difference from the unsorted results when not buffering it would indicate that the sort process takes 2 seconds.

The final example that offloads the sorting to SQL Server is a little surprising and seems to indicate that SQL Server is not quite as good at doing the sorting as LINQ is.

Now all these tests are performed on the same machine so as soon as we bring a network in to the equation and if we are dealing with large volumes of data it will probably be preferable to combine the techniques used above.

For a very large volume of data deferring the sorting to SQL Server and then using the buffered technique as shown below would probably be the best bet.

// instantiate the data context from our LINQ to SQL class MembersDataContext ctx = new MembersDataContext(); // iterate over the rows using a forward only fast readerforeach (Member item in ctx.GetAllMembersSortedByDOB()) { // write out the records to our file tx.WriteLine(string.Format(recordFormat , item.MemberID , item.MemberHash , item.FirstName , item.LastName , item.DateOfBirth.ToString("yyyy-MM-dd") , item.Gender , item.RowID)); counter++; }

BondiGeek

Source Code [Download Here](http://www.bondigeek.com/blog/wp-content/uploads/2010/03/Parser1.zip)