Overview of Data Governance
What is Data Governance?
Data Governance is the systematic oversight of an organization’s data assets to ensure data quality, security, privacy, compliance, and usability while aligning data practices with business objectives.
Key Concepts
- Data Stewards: Individuals responsible for overseeing data quality, security, accessibility, and compliance with organizational standards and policies.
- General Data Protection Regulation (GDPR): EU legislation protecting personal data, providing individuals rights to access, rectify, erase, and transfer their data.
- Personally Identifiable Information (PPI): Data directly identifying an individual (e.g., names, addresses, emails).
- Personal Health Information (PHI): Health-related information that can directly or indirectly identify an individual, requiring strict privacy and security measures.
- Payment Card Industry Information (PCI): Standards for secure handling and processing of credit and debit card payments.
- Master Data Management (MDM): Processes and tools ensuring uniformity, accuracy, and accountability of core data entities within an organization.
Common Challenges in Data Governance
-
Data Silos:
- Problems: Fragmented data causing inconsistency, duplication, inefficiencies.
- Solutions: Centralized data repositories, unified data platforms, cross-department communication.
-
Implementation Complexity:
- Problems: Integration challenges due to resource constraints and lack of expertise.
- Solutions: Phased implementations focused on critical datasets.
-
Lack of Standardization:
- Problems: Inconsistent data handling processes.
- Solutions: Standardized policies, regular training, routine audits.
-
Resistance to Change:
- Problems: Organizational pushback from unclear benefits or workflow disruptions.
- Solutions: Clear communication, stakeholder involvement, ongoing training, and support.
Data Release Approval Workflow
- Request for Data Access: Submission detailing purpose and scope.
- Approval Workflow: Review involving stakeholders or data stewards.
- Data Release Decision: Evaluation based on policies, sensitivity, compliance.
- Access Provision and Notification: Granting access with defined conditions.
- Audit Record Creation: Documenting requests, decisions, and access for auditing.
Data Use Agreement (DUA)
A formal contract detailing terms for data sharing and usage between parties.
Typical DUA Clauses:
- Parties Clause: Defines entities, roles, responsibilities.
- Purpose Clause: Specifies permitted data uses.
- Confidentiality Clause: Data protection obligations.
- Security Clause: Security measures (encryption, access controls).
- Termination/Renewal Clause: Terms for terminating/renewing agreements, data handling post-agreement.
Data Breach
A data breach occurs when sensitive or protected information is accessed, disclosed, or stolen by unauthorized parties.
Causes
- Phishing: Deceptive attempts to obtain sensitive information through emails or messages pretending to be trustworthy sources.
- Insider Threats: Data breaches caused by employees or contractors misusing their authorized access.
- Physical Theft: Theft of physical hardware containing sensitive information.
- Unintentional Exposure: Accidental data exposure due to errors like misconfigured databases or email mistakes.
- Social Engineering: Manipulation techniques to gain unauthorized access.
Data Breach Protocol
- Identification: Recognize and confirm the breach.
- Containment: Limit further unauthorized access.
- Eradication: Remove the cause of the breach.
- Recovery: Restore normal operations securely.
- Notification: Inform affected parties and regulatory bodies.
Post-incident Review
- Detailed analysis of the breach.
- Lessons learned and improvements identified.
- Update security policies and response plans.
Data Quality
Data quality refers to the condition of data based on how well it serves its intended purpose. High-quality data is accurate, complete, consistent, reliable, and timely.
Key Dimensions of Data Quality
-
Accuracy
Definition: The degree to which data correctly represents real-world values.
Example: A customer’s recorded email is valid and belongs to the customer. -
Completeness
Definition: The extent to which all required data is present.
Example: A customer record includes first name, last name, email, and phone number. -
Consistency
Definition: The data is uniform across different systems.
Example: Expected phone number formats and date formats is the same across all systems -
Reliability
Definition: Data remains stable and trustworthy over time.
Example: Monthly reports reflect revenue totals that align with actual transactions. -
Timeliness
Definition: Data is up-to-date and available when needed.
Example: A logistics system shows live delivery updates.
Common Data Quality Issues
-
Missing or Incomplete Records
Records lack required information (e.g., no phone number provided). -
Duplicate or Redundant Records
Multiple entries for the same entity (e.g., duplicate customer records). -
Inconsistent Data Formats
Different formats across systems (e.g., inconsistent use of date formats). -
Outdated or Stale Data
Data is no longer accurate (e.g., obsolete email addresses).
Data Quality Metrics
Data quality metrics are quantifiable measures used to assess the condition of data. These metrics are typically calculated as percentages and can help identify quality issues.
-
Rows Passed vs. Rows Failed
Compares data entries that meet quality rules vs. those that fail.
Formula:((Total Rows – Rows Failed) / Total Rows) * 100 -
Completeness Percentage
Measures the proportion of fields that are filled.
Formula:(Populated Fields / Expected Fields) * 100 -
Accuracy Score
Percentage of records that align with reference data or verified sources. -
Duplicate Rate
The percentage of records identified as duplicates.
Formula:(Duplicate Records / Total Records) * 100 -
Timeliness Ratio
Percentage of data updated within a given timeframe (e.g., updated in the last 7 days, prompt user if this info correct).
Validation Methods
Techniques used to assess and ensure data quality:
-
Cross-validation
Compare data across multiple systems to find discrepancies. -
Sampling / Spot Checking
Review a subset of the data, especially useful for large datasets. -
Reasonable Expectations
Use predefined rules or benchmarks to detect anomalies.
Example: Expecting a 90%+ pass rate for all validation checks. -
Data Profiling
Analyze incoming data to detect quality issues in real-time. Often integrated into APIs or databases. -
Data Auditing
A formal and thorough review of data quality. Includes findings, recommendations, and can be scheduled or event-driven (e.g., following a breach).
Master Data
Master Data refers to core, non-transactional data that is essential to operations and shared across systems and processes.
Master Data Management (MDM)
A set of processes and tools that ensure consistency, accuracy, and a single source of truth for key business entities (e.g., customers, products).
Example: Customer data consolidated in a Data Warehouse from multiple sales platforms.
Scenarios for MDM
-
Mergers and Acquisitions
MDM ensures data from both entities is harmonized and integrated efficiently. -
Compliance
Helps meet regulatory requirements by maintaining accurate and consistent data. Supports standards like GDPR, HIPAA, SOX, and others. -
Streamlining Data Access
Provides a single source of truth and reduces redundancy, enabling faster and more reliable decision-making.
SQL Information Type and Sensitivity Levels
SQL Server provides built-in capabilities for classifying and labeling sensitive data. These features support data governance and compliance by allowing organizations to:
- Classify columns using predefined information types and sensitivity labels
- Track and audit the use and access of sensitive data
- Integrate with Microsoft Purview, Azure Defender, and other compliance tools
- Enable data discovery and alerting for regulatory compliance such as GDPR, HIPAA, and PCI DSS
Classification metadata is stored within the database and can be queried or exported for reporting. It helps data professionals identify sensitive data, apply protection mechanisms, and demonstrate compliance during audits.
SQL Information Type
Information types are used in classification to describe the category or nature of data stored in SQL systems. These types help define how the data should be treated from a sensitivity and privacy standpoint.
- Networking – IP addresses, domain information, MAC addresses
- Contact Info – Email addresses, physical addresses, and phone numbers
- Credentials – Usernames, passwords, and authentication tokens
- Credit Card – Payment card numbers and security codes
- Banking – Bank account numbers, sort codes
- Financial – General financial records such as income, assets, liabilities
- Other – Any additional custom-defined categories
- Name – First name, last name, or full name
- National ID – Government-issued IDs (e.g., driver’s license, passport number)
- SSN – Social Security Numbers (U.S.) or equivalent
- Health – Medical records, diagnoses, and treatment data
- Date of Birth – Full or partial birth dates used for identification
- [n/a] – Not applicable or unclassified
SQL Sensitivity Label
Sensitivity labels in SQL help define the required level of protection for data. These classifications are used to guide encryption, access control, auditing, and compliance practices.
- Public – Data intended for public consumption and does not require access controls.
- General – Low-sensitivity data for internal use, with minimal restrictions.
- Confidential – Sensitive data that should be protected and only accessible to specific roles or groups.
- Confidential - GDPR – Confidential data that falls under the GDPR, such as EU citizen PII.
- Highly Confidential – Data requiring strong controls and restricted access due to potential risk or impact.
- Highly Confidential - GDPR – Highly sensitive personal data under GDPR protections (e.g., biometric, health records).
- [n/a] – Data that has not been classified or is not applicable.
Example Data Column Classifications
| Column Name | SQL Information Type | SQL Sensitivity Label |
|---|---|---|
EmailAddress |
Contact Info | Confidential - GDPR |
DateOfBirth |
Date of Birth | Highly Confidential - GDPR |
FullName |
Name | Confidential |
CreditCardNo |
Credit Card | Highly Confidential |
IPAddress |
Networking | General |
UserPassword |
Credentials | Highly Confidential |
BankAccount |
Banking | Highly Confidential |
ProductCategory |
Other | Public |
CurrentBalance |
Financial | Confidential |
Trade |
Financial | Confidential |
TradeProfit |
Financial | Confidential |
These examples illustrate how SQL classification can be applied to different data columns, helping ensure appropriate data handling and compliance with regulatory requirements.