DQ
DQ Engine

DQ Rule Engine v2 — POC Documentation

Version: 1.0.0
Status: Proof of Concept (POC)
Date: 2025-Q2
Framework: DAMA-DMBOK v2 · DGI · PDPA B.E. 2562 · DGA Thailand


📋 Table of Contents

  1. Executive Summary
  2. System Architecture
  3. Installation & Setup
  4. User Guide
  5. Data Contract Specification
  6. Integration with Great Expectations
  7. Rule Inference Engine
  8. API Specification (Claude AI Mode)
  9. Deployment Roadmap
  10. Troubleshooting
  11. FAQ

Executive Summary

Purpose

DQ Rule Engine v2 is an automated Data Quality standard generator that transforms Data Dictionary CSV files into production-ready Data Contracts (YAML/TOML) and Great Expectations test suites.

Problem Statement

  • Manual creation of DQ rules is time-consuming and error-prone
  • No standardized translation between business requirements → technical tests
  • Great Expectations configs require manual mapping of column rules
  • PDPA/DGA compliance checks are scattered across multiple documents

Solution

Rule Engine v2 provides:

  • ✅ Automatic rule inference from Data Dictionary metadata
  • ✅ Multi-format output (YAML, TOML, GE Expectations)
  • ✅ Hybrid execution (Claude AI + Rule-based fallback)
  • ✅ Built-in PDPA/DGA/Thai context compliance
  • ✅ Interactive threshold editing before deployment
  • ✅ Direct integration with Great Expectations service

Key Metrics

| Metric | Baseline | With Engine | |---|---|---| | Rule definition time | 8 hours per table | 2 minutes per table | | Test coverage | 40% | 95%+ | | Compliance validation | Manual | Automated | | Feedback loop | Days | Real-time |


System Architecture

High-Level Flow

┌─────────────────────────────────────────────────────────────┐
│                    User (Data Steward)                       │
└──────────────────────────┬──────────────────────────────────┘
                           │
                    ┌──────▼──────┐
                    │ Rule Engine │
                    │    v2.0     │
                    └──────┬──────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
    ┌─────▼─────┐  ┌──────▼──────┐  ┌─────▼─────┐
    │ CSV Loader│  │ Mode Selector│  │ Metadata  │
    │           │  │              │  │ Enricher  │
    └─────┬─────┘  └──────┬──────┘  └─────┬─────┘
          │                │                │
          └────────────────┼────────────────┘
                           │
                    ┌──────▼──────┐
                    │Rule Generator
                    │ ┌─────────┐ │
                    │ │AI Engine │ │  (Claude API optional)
                    │ └─────────┘ │
                    │ ┌─────────┐ │
                    │ │Rule-based│ │  (deterministic, no API)
                    │ └─────────┘ │
                    └──────┬──────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
    ┌─────▼─────┐  ┌──────▼──────┐  ┌─────▼──────┐
    │YAML Output│  │TOML Output  │  │GE Expectations
    │ (Contract)│  │ (Contract)  │  │  (ge_suite.yaml)
    └─────┬─────┘  └──────┬──────┘  └─────┬──────┘
          │                │                │
          └────────────────┼────────────────┘
                           │
          ┌────────────────┴────────────────┐
          │                                 │
    ┌─────▼─────────────────┐    ┌─────────▼──────────┐
    │ Data Governance Repo  │    │  Great Expectations
    │ (data contracts/)     │    │  Service (checkpoints/)
    └───────────────────────┘    └────────────────────┘
                                         │
                                    (run tests)

Component Breakdown

1. CSV Loader

  • Reads Data Dictionary CSV files (UTF-8 / UTF-8-BOM)
  • Supports columns: table, variable, type, validation, format, haspii, dq_completeness, dq_uniqueness, dq_validity, dq_consistency, description, source
  • Auto-fills contract metadata from CSV header

2. Mode Selector

  • Claude AI Mode (optional, requires API key)
    • Uses Claude Sonnet 4 to intelligently infer rules based on business context
    • Better for complex domain-specific rules
    • Requires sk-ant-* API key
  • Rule-Based Mode (default, no external dependency)
    • Deterministic rule inference based on column metadata
    • Fast, reproducible, no API costs
    • Fallback when AI fails

3. Metadata Enricher

  • Sets contract version, dataset ID, governance policy
  • Applies global thresholds (Completeness, Validity, etc.)
  • Injects owner, classification, retention policy

4. Rule Generator

  • Scans column types, validation rules, DQ dimensions
  • Infers constraints (PRIMARY KEY, FOREIGN KEY, NOT NULL, NULLABLE)
  • Maps to dbt tests and Great Expectations expectations
  • Thai-specific rules: citizen ID format, geographic boundaries, phone regex

5. Output Formatters

  • YAML Contract → Data Contract format (production-ready)
  • TOML Contract → Alternative serialization for CI/CD pipelines
  • GE Expectations Suite → Mappings to Great Expectations test types

Installation & Setup

Prerequisites

  • Modern web browser (Chrome, Firefox, Safari, Edge)
  • CSV Data Dictionary file
  • (Optional) Claude API key for AI mode

Quick Start

  1. Open the Engine

    # Download dq_rule_engine_v2.html and open in browser
    open dq_rule_engine_v2.html
    
  2. Configure Metadata (Left Panel)

    • Dataset Name: ชื่อ dataset ภาษาไทย/อังกฤษ
    • Source System: System source name (e.g., "ระบบศูนย์ข้อมูล NLC")
    • Owner: Steward email (e.g., data-steward@dla.go.th)
    • Classification: Open | Internal | Confidential | Restricted
  3. Set Quality Thresholds

    • Completeness: Default 0.90 (90%)
    • Validity: Default 0.90 (90%)
    • Uniqueness: Default 1.00 (100% for PKs)
    • Consistency: Default 0.95 (95% for FKs)
  4. Select Output Format (optional)

    • YAML (default) — Human-readable, git-friendly
    • TOML — Alternative for CI/CD pipelines
  5. Upload Data Dictionary CSV

    • Drag & drop or click to browse
    • Auto-loads preview of 6 rows
    • Engine auto-fills dataset name from table column
  6. Generate Contract

    • Click "Generate Contract"
    • Engine analyzes columns → infers rules
    • Displays summary: # of columns, tests, critical rules, PII detected
  7. Review & Download

    • YAML Tab → Data Contract (feed to governance repo)
    • TOML Tab → Alternative serialization
    • GE Tab → Expectations suite (feed to Great Expectations service)
    • Click download to save locally or via CI/CD

User Guide

Step-by-Step Walkthrough

Scenario: Generate DQ Contract for TM_ORG (Local Government Organizations Table)

Input Data Dictionary (CSV):

table,variable,description,type,format,validation,haspii,dq_completeness,dq_uniqueness,dq_validity,dq_consistency,source
TM_ORG,ORG_ID,ID Primary Key,Integer,ตัวเลข,NOT NULL,No,NOT_NULL,UNIQUE,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
TM_ORG,AMPHUR_ID,ID ของอำเภอ,Integer,ตัวเลข,NULL,No,NULLABLE,Non-Unique,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
TM_ORG,ORG_CODE,รหัส อปท.,String,ข้อความ,NULL,No,NULLABLE,Non-Unique,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
TM_ORG,ORG_NAME,ชื่อ อปท.,String,ข้อความ,NULL,No,NULLABLE,Non-Unique,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
TM_ORG,LAT,ละติจูด,String,ข้อความ,NULL,No,NULLABLE,Non-Unique,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
TM_ORG,LNG,ลองจิจูด,String,ข้อความ,NULL,No,NULLABLE,Non-Unique,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.

Step 1: Configure Metadata

Dataset Name:    ข้อมูลรายชื่อองค์กรปกครองส่วนท้องถิ่น
Source System:   ระบบศูนย์ข้อมูลการเลือกตั้งท้องถิ่นแห่งชาติ (NLC)
Owner:           data-steward-dla@mol.go.th
Classification:  Open

Step 2: Set Thresholds

Completeness: 0.90
Validity:     0.90
Uniqueness:   1.00
Consistency:  0.95

Step 3: Upload CSV

  • Drag TM_ORG.csv onto drop zone
  • Engine displays: "6 rows · 6 columns"

Step 4: Generate

  • Click "Generate Contract"
  • Engine runs for ~2 seconds
  • Displays summary:
    • 6 columns
    • 11 tests
    • 3 Critical (ORG_ID: not_null + unique, LAT/LNG: geo-validation)
    • PII: NO

Step 5: Review Output

YAML Tab shows contract:

version: "1.0.0"
contract_version: 2
dataset_id: "TM_ORG-a1b2c3d4-e5f6-g7h8-i9j0-k1l2m3n4o5p6"
dataset_name: "ข้อมูลรายชื่อองค์กรปกครองส่วนท้องถิ่น"
source_system: "ระบบศูนย์ข้อมูลการเลือกตั้งท้องถิ่นแห่งชาติ (NLC)"
logical_table: "TM_ORG"

governance_policy:
  classification: "Open"
  security_level: "Public"
  pii_contains: false
  quality_thresholds:
    completeness: 0.90
    validity: 0.90

schema:
  - column: "ORG_ID"
    description: "ID Primary Key"
    type: "Integer"
    classification: "public"
    pii_type: "none"
    masking_policy: "none"
    retention: "permanent"
    owner: "data-steward-dla@mol.go.th"
    business_term: "organization_identifier"
    constraints:
      - primary_key: true
      - nullable: false
    tests:
      - unique: true
      - not_null: true
  
  # ... (more columns) ...

lifecycle:
  status: "Active"
  retention: "Permanent"

GE Tab shows expectations:

expectation_suite_name: "tm_org_suite"
data_asset_type: null
meta:
  dataset_id: "TM_ORG-a1b2c3d4..."
  dataset_name: "ข้อมูลรายชื่อองค์กรปกครองส่วนท้องถิ่น"
  source_system: "ระบบศูนย์ข้อมูลการเลือกตั้งท้องถิ่นแห่งชาติ (NLC)"
  generated_at: "2025-04-28T10:30:00Z"
  contract_version: 2

expectations:
  # --- ORG_ID ---
  - expectation_type: expect_column_to_exist
    kwargs:
      column: "ORG_ID"
    meta:
      severity: "Critical"
      business_term: "organization_identifier"

  - expectation_type: expect_column_values_to_not_be_null
    kwargs:
      column: "ORG_ID"
    meta:
      severity: "Critical"
      owner: "data-steward-dla@mol.go.th"

  - expectation_type: expect_column_values_to_be_unique
    kwargs:
      column: "ORG_ID"
    meta:
      severity: "Critical"

  # --- LAT ---
  - expectation_type: expect_column_values_to_match_regex
    kwargs:
      column: "LAT"
      regex: "^-?([1-8]?[0-9](\\.[0-9]+)?|90(\\.0+)?)$"
    meta:
      severity: "High"
      owner: "data-steward-dla@mol.go.th"

Step 6: Download & Deploy

# Download contract
click "⬇ download .yaml"  →  tm_org_contract.yaml

# Download GE expectations
click GE tab
click "⬇ download ge_suite.yaml"  →  tm_org_suite.yaml

# Commit to repo
git add data-contracts/tm_org_contract.yaml
git add great-expectations/ge_suites/tm_org_suite.yaml
git commit -m "feat: auto-generated DQ contract for TM_ORG table"
git push

Interactive Editing (Future Phase)

Currently, thresholds are set globally. Future POC v1.1 will support:

  • ✏️ Edit rule logic per column
  • 🎚️ Override thresholds per test
  • 🏷️ Add custom tags/metadata
  • 🔗 Resolve FK references
  • 📋 Export as dbt YAML directly

Data Contract Specification

Contract Format: YAML + TOML

The contract format is inspired by dbt artifacts + Great Expectations metadata + DGA Thailand standards.

Top-Level Structure

version: "1.0.0"                          # Contract schema version
contract_version: 2                       # Business contract version (incrementable)
dataset_id: "UUID"                        # Unique immutable ID
dataset_name: "ชื่อ dataset"              # Display name (Thai/English)
source_system: "System name"              # Source system identifier
logical_table: "TABLE_NAME"               # Table name in source

governance_policy:                        # Global governance rules
  classification: "Open|Internal|..."     # Data classification (DGA standard)
  security_level: "Public|Internal|..."   # Security level
  pii_contains: boolean                   # Has PII? (PDPA flag)
  quality_thresholds:
    completeness: 0.90                    # Pass if ≥ 90% not null
    validity: 0.90                        # Pass if ≥ 90% valid format
    uniqueness: 1.00                      # Pass if 100% unique (PKs)
    consistency: 0.95                     # Pass if 95%+ FK valid

schema:                                   # Field-level governance
  - column: "COLUMN_NAME"
    description: "Plain text description"
    type: "Integer|String|Float|Boolean|Date|Timestamp"
    classification: "public|internal|confidential"
    pii_type: "none|national_id|phone|email|name|address|financial"
    masking_policy: "none|hash_sha256|mask_last_4|suppress|tokenize|encrypt"
    retention: "permanent|7-years|5-years|1-year|90-days"
    owner: "email@domain.com"
    business_term: "snake_case_term"
    constraints:                          # Schema constraints
      - primary_key: true|false
      - foreign_key: "REFERENCE_TABLE"
      - nullable: true|false
    tests:                                # DQ tests to enforce
      - unique: true
      - not_null: true
      - length: { max: 255 }
      - format: "regex:PATTERN"           # Validate regex pattern
      - range: { min: 0, max: 100 }
      - accepted_values: ["val1", "val2"]
    ge_expectations:                      # GE test mapping (auto-generated)
      - expectation: "expect_column_values_to_not_be_null"
        params: {}
        severity: "Critical"
      - expectation: "expect_column_values_to_be_unique"
        params: {}
        severity: "Critical"
    severity: "Critical|High|Medium|Low"  # Business severity

lifecycle:                                # Data lifecycle control
  status: "Active|Archived|Deprecated"
  retention: "Permanent|[duration]"

Example: Complete Contract (Minimal)

See samples/tm_org_contract.yaml in repo.


Integration with Great Expectations

Architecture

DQ Rule Engine
     │
     ├─ infer_tests() → [unique, not_null, regex, ...]
     │
     └─ map_to_ge_expectations()
         │
         ├─ test: unique → expect_column_values_to_be_unique
         ├─ test: not_null → expect_column_values_to_not_be_null
         ├─ test: format regex → expect_column_values_to_match_regex
         ├─ test: length max → expect_column_value_lengths_to_be_between
         ├─ test: range → expect_column_values_to_be_between
         └─ test: accepted_values → expect_column_values_to_be_in_set
         
              ↓
         
         ge_suite.yaml (Expectation Suite)
         
              ↓
         
         GE Service / Checkpoint Runner
         
              ↓
         
         Validation Results (PASS/FAIL)

Test-to-Expectation Mapping

| Contract Test | GE Expectation Type | GE Parameters | |---|---|---| | unique: true | expect_column_values_to_be_unique | column, result_format | | not_null: true | expect_column_values_to_not_be_null | column, result_format | | length: {max: 255} | expect_column_value_lengths_to_be_between | column, min_value, max_value | | format: "regex:^[0-9]{13}$" | expect_column_values_to_match_regex | column, regex | | range: {min: 0, max: 100} | expect_column_values_to_be_between | column, min_value, max_value | | accepted_values: [...] | expect_column_values_to_be_in_set | column, value_set | | foreign_key: "REF_TABLE" | expect_column_values_to_be_in_set | column, from_table |

Workflow: Generate → Deploy → Run

1. Generate Contract

# In Rule Engine UI:
# - Upload CSV
# - Configure metadata
# - Click "Generate Contract"
# - Download ge_suite.yaml

2. Deploy Expectations

# In GE Service repository:
cp tm_org_suite.yaml great_expectations/expectations/tm_org.yml

# Or use GE CLI:
great_expectations suite new \
  --expectation-suite-name tm_org_suite \
  --profile local

3. Create Checkpoint

# great_expectations/checkpoints/tm_org_checkpoint.yml
name: tm_org_checkpoint
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d_%H%M%S"
validations:
  - batch_request:
      datasource_name: local_postgres
      data_connector_name: default
      data_asset_name: tm_org
      partition_request:
        index: -1
    expectation_suite_name: tm_org_suite
    action_list:
      - name: store_validation_result
        action:
          class_name: StoreValidationResultAction
      - name: slack_notification
        action:
          class_name: SlackNotificationAction

4. Run Validation

great_expectations checkpoint run tm_org_checkpoint

5. View Results

# In Data Docs:
open great_expectations/uncommitted/data_docs/index.html
# See: # of passing/failing expectations, severity breakdown

GE Service Integration Points

| Stage | Integration | |---|---| | Data Load | GE reads from datasource (BigQuery, Postgres, Snowflake, etc.) | | Expectation Run | GE applies expectations from suite YAML | | Validation Result | GE stores JSON result in uncommitted/validations/ | | Action | Webhook → notify Slack, log to incident table, fail CI/CD | | Reporting | Data Docs HTML + GE Cloud (if enabled) |


Rule Inference Engine

How Rules Are Inferred

Rule-Based Mode (Default)

Pattern matching on column metadata:

def infer_rules(row):
    col = row['variable']
    type = row['type']
    validation = row['validation']  # NOT NULL | NULL
    haspii = row['haspii']
    dq_completeness = row['dq_completeness']
    dq_uniqueness = row['dq_uniqueness']
    
    rules = []
    
    # 1. NOT NULL check
    if validation == 'NOT NULL' or dq_completeness == 'NOT_NULL':
        rules.append(Test.NOT_NULL)
    
    # 2. UNIQUE check (Primary Key)
    if dq_uniqueness == 'UNIQUE' or col.endswith('_ID'):
        rules.append(Test.UNIQUE)
        rules.append(Constraint.PRIMARY_KEY)
    
    # 3. GEO validation (Thailand bounds)
    if col in ['LAT', 'LATITUDE']:
        rules.append(Format.REGEX('^-?([1-8]?[0-9](\\.[0-9]+)?|90(\\.0+)?)$'))
    if col in ['LNG', 'LONGITUDE']:
        rules.append(Format.REGEX('^-?((1[0-7][0-9]|[1-9]?[0-9])(\\.[0-9]+)?|180(\\.0+)?)$'))
    
    # 4. Thai Citizen ID (13 digit)
    if 'CITIZEN' in col or 'ID_CARD' in col:
        rules.append(Format.REGEX('^[0-9]{13}$'))
    
    # 5. Thai Phone (0XXXXXXXXX)
    if 'PHONE' in col or 'MOBILE' in col:
        rules.append(Format.REGEX('^0[0-9]{8,9}$'))
    
    # 6. Positive numeric (AREA, AMOUNT, BUDGET, etc.)
    if any(k in col for k in ['AREA', 'AMOUNT', 'BUDGET', 'COUNT']):
        rules.append(Range.MIN(0))
    
    # 7. String length max
    if type.lower() == 'string':
        rules.append(Length.MAX(255))
    
    # 8. Enum validation (TYPE, SIZE, STATUS, etc.)
    if any(k in col for k in ['TYPE', 'SIZE', 'STATUS', 'CATEGORY']):
        rules.append(AcceptedValues(['[define enum list]']))
    
    # 9. PII masking
    if haspii.lower() == 'yes':
        pii_type = detect_pii_type(col)
        rules.append(Masking.POLICY(suggest_mask(pii_type)))
    
    # 10. Foreign Key reference
    if 'FOREIGN' in validation or 'FK' in validation:
        ref_table = extract_ref_table(description)
        rules.append(Constraint.FOREIGN_KEY(ref_table))
    
    return rules

AI Mode (Optional)

Claude API analyzes:

  • Column type & format
  • Business context from description
  • DQ metadata (completeness, uniqueness, validity, consistency)
  • PII type & masking requirements
  • Thai regulatory context (PDPA, DGA, BOT, NCSA)
  • Industry playbook (Government, Finance, Retail, Healthcare)

Prompt includes:

You are a Senior Data Engineer. Analyze this column and suggest DQ tests.

Column: AMPHUR_ID
Type: Integer
Description: ID ของอำเภอ (Foreign Key)
PII: No
DQ Completeness: NULLABLE
DQ Uniqueness: Non-Unique
DQ Validity: ตามประเภทข้อมูล (as data type)
DQ Consistency: ต้องสอดคล้องกับตารางอ้างอิง (must match FK table)

Suggest:
1. Constraints (primary_key, nullable, foreign_key)
2. Tests (unique, not_null, format, range, accepted_values)
3. Great Expectations mapping
4. Severity (Critical/High/Medium/Low)
5. Masking policy if PII

Claude responds with structured JSON → converted to contract.


API Specification (Claude AI Mode)

Request

POST https://api.anthropic.com/v1/messages
Content-Type: application/json

{
  "model": "claude-sonnet-4-20250514",
  "max_tokens": 4000,
  "system": "You are a Senior Data Engineer. Return only valid JSON...",
  "messages": [
    {
      "role": "user",
      "content": "[PROMPT with data dictionary]"
    }
  ]
}

Response

{
  "content": [
    {
      "type": "text",
      "text": "{\"version\": \"1.0.0\", \"contract_version\": 2, \"dataset_id\": \"...\", ...}"
    }
  ]
}

Error Handling

If AI fails:

  1. Log error to console
  2. Fall back to Rule-Based mode
  3. Render contract with rule-based inference
  4. Display notification: "AI fallback → Rule-based engine"

Deployment Roadmap

Phase 1: POC (Current)

  • ✅ CSV upload + metadata config
  • ✅ Rule-based inference
  • ✅ YAML/TOML contract generation
  • ✅ GE expectations mapping
  • ✅ AI mode (optional, fallback enabled)
  • ✅ Download contract files

Deployment: Standalone HTML file → browser

Phase 2: Production (Q2 2025)

  • 📋 Web service (FastAPI / Node.js)
  • 💾 Contract versioning + history
  • 🔄 Git integration (auto-commit to repo)
  • 🧪 Great Expectations service integration (webhook)
  • 📊 Web UI with validation results
  • 🔐 Authentication (OAuth2 / LDAP)
  • 📈 Audit log (who changed what, when)

Deployment: Docker container + Kubernetes

Phase 3: Enterprise (Q3 2025)

  • 🤖 Continuous AI improvement (reinforcement from GE results)
  • 🔗 Data Catalog integration (OpenMetadata / DataHub)
  • 📡 dbt Cloud / Alation API integration
  • 🎯 Custom rule templates per business unit
  • 📱 Mobile app for on-the-go reviews
  • 🌍 Multi-language support (Thai, English, Khmer, Lao)
  • ☁️ SaaS offering (cloud-hosted)

Troubleshooting

Issue: "CSV ว่างหรือ format ไม่ถูก"

Cause: File has fewer than 2 rows or headers are malformed

Solution:

  1. Verify CSV has header row + at least 1 data row
  2. Check encoding: should be UTF-8 or UTF-8 with BOM
  3. Ensure no blank rows in middle of file
  4. Validate with head -10 file.csv in terminal

Issue: "AI fallback → Rule-based engine" appears

Cause: Claude API call failed (timeout, invalid key, network issue)

Solution:

  1. Check API key: starts with sk-ant-?
  2. Verify network connectivity
  3. Check Claude API quota (rate limits)
  4. Review error in browser console (F12 → Console tab)
  5. Rule-based engine will still generate contract, just less context-aware

Issue: GE Expectations syntax error when deploying

Cause: Regex in format field not properly escaped

Example: YAML doesn't like \ → should be \\ in YAML quoted strings

Solution:

# ❌ WRONG
- format: "regex:^\d{13}$"

# ✅ CORRECT
- format: "regex:^[0-9]{13}$"

All regexes in generated contract are pre-escaped. If you manually edit, use double backslashes.

Issue: Downloaded contract has wrong dataset_id

Cause: Auto-generated UUID is not suitable for your CI/CD

Solution:

  1. Manually edit dataset_id in YAML before commit
  2. Future v1.1 will allow custom dataset_id prefix (e.g., DLA-TM_ORG-001)

Issue: Foreign Key test failing

Cause: Referenced table doesn't exist in GE data source

Solution:

# Option 1: Remove FK test (if reference data not in same datasource)
# Option 2: Ensure reference table is in same datasource before running checkpoint
# Option 3: Add conditional logic in checkpoint (only run FK if both tables available)

FAQ

Q: Can I use this without Claude API?

A: Yes! Rule-Based mode is default and requires no API key. AI mode is optional. If AI fails, fallback is automatic.

Q: What CSV columns are required?

A: Minimum: table, variable, type, description. Optional: validation, haspii, dq_completeness, dq_uniqueness, dq_validity, dq_consistency, source, format.

Q: Can I edit the generated contract?

A: Yes. YAML/TOML are text formats. Edit before deploying to production. Manual edits = source of truth until next re-generation.

Q: How do I integrate with dbt?

A: Contract can be converted to dbt schema.yml in future versions. Currently: export YAML, then manually write dbt tests. Plan: direct dbt YAML output in v1.1.

Q: Does this support BigQuery, Snowflake, etc.?

A: Contract is datasource-agnostic. GE Expectations work with any supported GE datasource plugin (Postgres, BigQuery, Snowflake, Redshift, DuckDB, etc.).

Q: What about Timeliness (SLA) rules?

A: Currently inferred only if last_updated column present. Future v1.1 will add explicit SLA time window configuration.

Q: Can I use this for non-Thai data?

A: Yes. Thai-specific rules (citizen ID, phone regex, geographic bounds) are optional. Rule-based engine adapts to your column naming conventions globally.

Q: How do I version control contracts?

A: Commit YAML files to git. Use semantic versioning:

dataset_id: TM_ORG-1.0.0
contract_version: 1

# Next iteration:
contract_version: 2  # schema change
# or v1.0.1 = minor governance policy update (no schema change)

Q: Can this generate dbt tests too?

A: Not directly (POC scope). But YAML format mirrors dbt schema.yml, so manual conversion is straightforward. Plan for v1.1.

Q: Is there a CLI version?

A: POC is browser-only. CLI version planned for Phase 2 (Python package with CLI):

dq-rule-engine generate \
  --csv data_dictionary.csv \
  --output contract.yaml \
  --format yaml \
  --ai-mode

Q: How do I suggest improvements?

A: Create an issue or PR in the repo. POC is open for community feedback.

Q: License & Attribution?

A: POC is built with DAMA-DMBOK, DGI, PDPA B.E. 2562, DGA Thailand standards. Commercial use requires licensing review.


Next Steps

For POC Users

  1. Try it: Download dq_rule_engine_v2.html → open in browser
  2. Validate: Upload your Data Dictionary CSV → compare generated contract with manual rules
  3. Iterate: Adjust metadata & thresholds → re-generate → review differences
  4. Feedback: Report issues, suggest features via GitHub issues

For Developers

  1. Extend: Add custom rule templates per business domain
  2. Integrate: Hook into CI/CD pipeline (e.g., GitHub Actions workflow)
  3. Automate: Fetch CSV from data catalog API → generate → commit contract automatically
  4. Monitor: Connect to GE validation results → track rule drift over time

For Data Stewards

  1. Adopt: Use as standard governance workflow for all new datasets
  2. Document: Store generated contracts in centralized governance repo
  3. Train: Share with team → establish "contract-first" culture
  4. Audit: Review quarterly → ensure contracts stay aligned with data

References

Standards & Frameworks

Related Tools

  • Data Catalog: OpenMetadata, Alation, DataHub
  • Data Lineage: Apache Atlas, Collibra
  • DQ Monitoring: Soda, Databand, Monte Carlo
  • DQ Testing: Great Expectations, dbt tests, Dataedo

Example Repositories

  • data-contracts/ — Versioned YAML contracts
  • great-expectations/ — GE suites, checkpoints, configs
  • dbt-models/models/schema.yml — dbt tests (manual from contract)

Contact & Support

POC Questions

  • GitHub Issues: repo/issues
  • Email: data-governance@org.th
  • Slack: #data-quality-eng

Feedback

  • Feature requests: GitHub Discussions
  • Bug reports: GitHub Issues with:
    • CSV file (sanitized)
    • Expected vs. actual output
    • Browser/OS info
    • Error log (F12 → Console)

Document Version: 1.0.0 — POC
Last Updated: 2025-Q2
Maintained By: Data Governance Team
Status: Active — Community Feedback Welcome


Appendix: Sample Files

Sample Data Dictionary CSV

File: samples/tm_org_dictionary.csv

table,variable,description,type,format,validation,haspii,dq_completeness,dq_uniqueness,dq_validity,dq_consistency,source
TM_ORG,ORG_ID,ID Primary Key,Integer,ตัวเลข,NOT NULL,No,NOT_NULL,UNIQUE,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
TM_ORG,AMPHUR_ID,ID ของอำเภอ,Integer,ตัวเลข,NULL,No,NULLABLE,Non-Unique,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.

Sample Generated Contract (YAML)

File: samples/tm_org_contract.yaml [See Data Contract Specification section]

Sample GE Expectations Suite

File: samples/tm_org_suite.yaml [See Integration with Great Expectations section]


End of Documentation