Data Discovery and Classification with Data Preview

Francesco Mantovani
3 min readNov 23, 2021

PROBLEM

Dirty Database

Make Marie Kondo explode at 3… 2… 1…
I’m currently working on a project that is a sort of spaghetti monster. On all sides.

The databases are 20–30 years old and have probably passed in the hands of multiple teams, they have been attached to multiple technologies; permissions were probably setup buy drunk cow-boys and as result every database developer passing by was free to create a table, drop a column and piss in the corner.

Do you know those TV realities were a cameraman visits people buried alive in a messy home. Think about these databases as something like that.

Because these databases are obviously full of sensible data I was assigned to the tedious task of apply Data Discovery and Classification to it.

For those who are not familiar with it, Data Discovery and Classification is a nice feature that was introduced on SSMS a few years ago. Microsoft for once made a quite nice and easy-to-use tool. You can fire it by doing right click on the database > Tasks > Data Discovery and Classification > Classify Data…

On AdventureWorks2019 this is the result:

Data-classification-wizard

That is impressive! How could have SSMS figured out what columns need to be classified?

By the column name.

If you fire up SQL Server Profiler and run the tool again you can catch the query that generated that classification and if you have a look at it you will see that the words it’s seeking are hardcoded:

Data Classification Match Columns

You can download a full copy of the query here: MatchingColumns

As you can see Microsoft was wise enough to add multiple languages but don’t think not even for a minute that this tool will make the work at your place. You have to work!

You have to keep in mind that this tool might have caught 50% of the columns but you still need to:

— double check if the classification is correct

— find the other 50%.

In my case I had even more problems:

— The database columns are in multiple languages and they don’t really represent the data they contain

— Each database has more than 900 tables and because there is no data preview I had to run a SELECT * nearly a thousand time

I’ll rather drink bleach while poking my eye balls with a rusty fork than do that.

SOLUTION

This query will help you:

— List Schemas, Tables and Columns along with a preview of the data.

— Search for a specific Schemas, Table or Column name, or for a specific word mentioned inside the data.



DECLARE @WordMatch AS NVARCHAR(50)

SET @WordMatch = ‘last’

DROP TABLE IF EXISTS #ColumnsToDisplay

SELECT ROW_NUMBER () OVER (ORDER BY tab.name) AS Iteration,

s.name AS SchemaName,

tab.name AS table_name,

col.column_id,

col.name AS column_name,

t.name AS data_type,

col.max_length,

col.precision AS PrecisionNumber,

CAST(NULL AS VARCHAR(MAX)) AS DataSample

INTO #ColumnsToDisplay

FROM sys.tables AS tab

JOIN sys.schemas AS s

ON s.schema_id = tab.schema_id

JOIN sys.columns AS col

ON col.object_id = tab.object_id

LEFT JOIN sys.types AS t

ON col.user_type_id = t.user_type_id

DECLARE @Iterations INT = 0,

@CurrentIteration INT = 1;

SELECT @Iterations = MAX (Iteration)

FROM #ColumnsToDisplay

WHILE @CurrentIteration

--

--

Francesco Mantovani

Lorem Ipsum “Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit…” “There is no one who loves pain itself, who seeks