Taynor McNally
What is data quality
Data quality is made up of six dimensions, completeness, timeliness, validity, consistency, accuracy and uniqueness. These dimensions can be measured to determine the level of data quality against data sets within an organisation. Data quality can be measured against data sets constantly throughout, or at specific intervals and stages during its lifecycle. This lifecycle can contain multiple phases for a data set, including its raw content, cleaned, transformed, joined and curated.
Achieving high data quality can be a balancing act, especially when competing data quality dimensions can be disruptive to each other. For example to improve the accuracy of a data set, may require more processing time to move the data from its raw to curated form. This is turn can impact the timeliness data quality dimension, for when the data is finally ready to be consumed. To manage this balancing act, prioritising which data quality dimension should be applied to the data set. Should take in consideration different elements such as its shape and content, at each stage the data set is processed in its lifecycle.
Measuring data quality
There dimensions of data quality that can be measured using repeatable methods and tooling that can be automated. These dimensions include completeness (counting the number of NULL or empty values), uniqueness (counting the number of duplicate values), and timeliness (when the data set was last refreshed/processed). These dimensions can deterministic and generic validation rules that can be applied to most data sets depending upon shape and content. With these types of generic processes, managing these data quality rules are typically picked up by a data engineering function. The accuracy, consistency and validity are more complex measurements to determine. These rules may need to include business logic, standards and relationships to determine the quality of the data. To implement these data quality rules will require a collaboration between the business, and data engineering functional teams to define and build them.
The impact of data quality
The impact of data quality may not be noticeable until after a data driven decision has been made. Depending on the type of decision, the impact can be demonstratable when the timeliness, completeness and accuracy of the data set is a high requirement. Organisations who depend upon data feeds of near real time transactional data, to make reactive commercial decisions directly impacting an organisation’s bottom line. There are data driven decisions that require prescriptive or predictive insights, for short term tactical decisions or medium/long term strategic planning and goal setting. Can have negative or positive impact against an organisation’s reputation and long term bottom line goals, depending on the level of data quality of its data sets. In both cases data quality can impact the level of confidence and trustworthiness of an organisation’s data sets.
How can data quality be addressed
There should be a focus on fostering a organisation data culture that is adopted by all functions of a business, and not just those who may process and deliver the data sets. A starting point would be to set baseline and continuous monitoring of data sets is important to understand the level of data quality, how has it improved or declined worse over time. This can include regular auditing, data profiling and classification, to report against current level of data quality. Addressing areas of decline, tuning for enhancements and building reusable framework to implement data quality at scale.
A collaborative effort between business and data engineering/analytics functions to define data quality rules, will improve how effective the data quality rules are designed and implemented. This can include setting up a data governance model that will include having data stewards, data owners and a data governance board, using data domains to manage the organisational structure of data and who is accountable for managing the level of quality.