I recently had a situation where we were working with a SQL database that we were unable to modify, where we had a requirement to detect changed rows. While there are out-of-the-box mechanisms for this, such as change tracking, timestamps, or using triggers to manually populate a table of changes, all of these require changes to the database. One possible solution that the team came up with was to calculate hashes of the query results, so that a different hash from the last calculated one shows that the data had changed. Beyond this, we wanted to keep out code footprint to a minimum, and re-use the queries that returned full result sets wherever possible. I did some investigation and, by modifying and combining the examples I found, came up with a solution that works pretty well for this purpose.
The database server in question is MS SQL Server 2008, and so the latest and greatest tools aren’t available. However in SQL 2005, a HASHBYTES() function was added. There are a few different hashing algorithms available, which can be swapped out as needed, including MD5 which is quick and suits our needs. Its usage is as so:
This function can easily be used to hash individual fields, but we were after a whole row result. For this, using SQL’s XML capabilities would be a fairly easy way to give us a single string that is built from all the columns in a result set and could be hashed:
AS x FOR
However I immediately ran into an issue with this that took a while to get to the bottom of; when I combined these two functions the query would execute successfully, but with a message showing an error about data being truncated:
Msg 8152, Level 16, State 10, Line 1, String or binary data would be truncated
I eventually found was that while the HASHBYTES() function took in a string, its limit was 8000 characters, and the error was because the XML result was being truncated when being passed to the hashing function. After a bit of searching, I found that the native sys.fn_repl_hash_binary function worked with a byte array, with no size limit. Putting all this together, the result looks like:
Q.Id AS [Id],
c.Id AS Id,
FROM Customers c
The nice thing about this is that the middle query is completely arbitrary, and can be swapped for any query however complex, so long as an Id column is exposed. There’s a couple of things that could be done with this to go one step further; one would be to insert the records into a table on the same database server, so that comparing the hashes with previous values could be done as a database operation, to generate a set of Ids to re-join with the main query to return full results.