Data Quality Services overview
Data Quality Services (DQS) is a new knowledge-driven solution in SQL Server 2012 to improve data quality by cleansing, matching, de-duplicating and profiling data.
To improve a business’ data, the first requirement is to know the data. Data Quality Services centers around a Data Quality Knowledge Base (DQKB) that stores three types of knowledge: out-of-the-box knowledge provided by Microsoft (such as country names and regular expressions for validating e-mail addresses), knowledge generated by Data Quality Services by sampling the business’ data and knowledge generated by business users.
The Data Quality Knowledge Base is then used to improve data quality, either interactively by business users through the Data Quality Client application, or automatically in SQL Server Integration Services (SSIS) packages with the new DQS Cleansing component.
Issues that Data Quality Services address
Good quality data is complete, conformed, consistent, accurate, valid and unique. Data Quality Services can address issues with any of these.
Completeness – Is all required data available?
Sample data problem: 50% of all zip codes are listed as 0000.
Conformity – Does data conform to the specified format?
Sample data problem: Dates are listed as YYYY-MM-DD in one system, and DD-MM-YYYY in another system.
Consistency – Do values represent the same meaning?
Sample data problem: Sales amounts are presented in Euros in one system, and as Norwegian Kroner in another system.
Accuracy – Does data accurately represent reality or a verifiable source?
Sample data problem: Countries are spelled incorrectly.
Validity – Do values fall within acceptable ranges?
Sample data problem: E-mail addresses are listed as name@gmail instead of firstname.lastname@example.org.
Uniqueness – Is data unique?
Sample data problem: One person is registered as two customers: Cathrine Wilhelmsen and C. Wilhelmsen.
Data Quality Services components
DQS consists of two main components: Data Quality Server and Data Quality Client. It also provides a DQS Cleansing component in SQL Server Integration Services (SSIS) for cleansing data automatically in packages, as well as Master Data Services (MDS) data quality functionality.
Data Quality Server is implemented as three SQL Server catalogs: DQS_MAIN, DQS_PROJECTS and DQS_STAGING_DATA. The two first catalogs contain data necessary to run Data Quality Services, while the third catalog is the intermediate staging database for data that will be processed in Data Quality Services operations.
Data Quality Client is a standalone application used by business users and IT professionals to administer and manage knowledge bases and data quality projects. The client can be installed and run on the same computer as Data Quality Server or remotely on separate computers.
When to use Data Quality Services
A saying that most people in the data warehousing world have heard is “shit in, shit out”. Data Quality Services makes it possible to change that. But when is it appropriate to use Data Quality Services? In what situation, for what data?
One of our clients have a strict policy not to manipulate any data in their data warehouse, but rather expose poor data quality to the source systems and correct it there. They are not interested in using Data Quality Services to actually improve the data, but are very interested in using Data Quality Services as a method to expose poor quality data. By using the new DQS Cleansing component it’s possible to log poor quality data and enable automatic reporting to the source systems, instead of using resources on checking the data quality manually.
Data Quality Services looks very promising with Big Data, unstructured data, the demand for faster response times and ad-hoc analysis – when the users just want to analyze data and not build a comprehensive data warehouse. The possibilities are there, and it will be exciting to see how we use and what we think of Data Quality Services in a year or two or five.