DQ Rule Engine v2 — POC Documentation
Version: 1.0.0
Status: Proof of Concept (POC)
Date: 2025-Q2
Framework: DAMA-DMBOK v2 · DGI · PDPA B.E. 2562 · DGA Thailand
📋 Table of Contents
- Executive Summary
- System Architecture
- Installation & Setup
- User Guide
- Data Contract Specification
- Integration with Great Expectations
- Rule Inference Engine
- API Specification (Claude AI Mode)
- Deployment Roadmap
- Troubleshooting
- FAQ
Executive Summary
Purpose
DQ Rule Engine v2 is an automated Data Quality standard generator that transforms Data Dictionary CSV files into production-ready Data Contracts (YAML/TOML) and Great Expectations test suites.
Problem Statement
- Manual creation of DQ rules is time-consuming and error-prone
- No standardized translation between business requirements → technical tests
- Great Expectations configs require manual mapping of column rules
- PDPA/DGA compliance checks are scattered across multiple documents
Solution
Rule Engine v2 provides:
- ✅ Automatic rule inference from Data Dictionary metadata
- ✅ Multi-format output (YAML, TOML, GE Expectations)
- ✅ Hybrid execution (Claude AI + Rule-based fallback)
- ✅ Built-in PDPA/DGA/Thai context compliance
- ✅ Interactive threshold editing before deployment
- ✅ Direct integration with Great Expectations service
Key Metrics
| Metric | Baseline | With Engine | |---|---|---| | Rule definition time | 8 hours per table | 2 minutes per table | | Test coverage | 40% | 95%+ | | Compliance validation | Manual | Automated | | Feedback loop | Days | Real-time |
System Architecture
High-Level Flow
┌─────────────────────────────────────────────────────────────┐
│ User (Data Steward) │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────▼──────┐
│ Rule Engine │
│ v2.0 │
└──────┬──────┘
│
┌────────────────┼────────────────┐
│ │ │
┌─────▼─────┐ ┌──────▼──────┐ ┌─────▼─────┐
│ CSV Loader│ │ Mode Selector│ │ Metadata │
│ │ │ │ │ Enricher │
└─────┬─────┘ └──────┬──────┘ └─────┬─────┘
│ │ │
└────────────────┼────────────────┘
│
┌──────▼──────┐
│Rule Generator
│ ┌─────────┐ │
│ │AI Engine │ │ (Claude API optional)
│ └─────────┘ │
│ ┌─────────┐ │
│ │Rule-based│ │ (deterministic, no API)
│ └─────────┘ │
└──────┬──────┘
│
┌────────────────┼────────────────┐
│ │ │
┌─────▼─────┐ ┌──────▼──────┐ ┌─────▼──────┐
│YAML Output│ │TOML Output │ │GE Expectations
│ (Contract)│ │ (Contract) │ │ (ge_suite.yaml)
└─────┬─────┘ └──────┬──────┘ └─────┬──────┘
│ │ │
└────────────────┼────────────────┘
│
┌────────────────┴────────────────┐
│ │
┌─────▼─────────────────┐ ┌─────────▼──────────┐
│ Data Governance Repo │ │ Great Expectations
│ (data contracts/) │ │ Service (checkpoints/)
└───────────────────────┘ └────────────────────┘
│
(run tests)
Component Breakdown
1. CSV Loader
- Reads Data Dictionary CSV files (UTF-8 / UTF-8-BOM)
- Supports columns:
table, variable, type, validation, format, haspii, dq_completeness, dq_uniqueness, dq_validity, dq_consistency, description, source - Auto-fills contract metadata from CSV header
2. Mode Selector
- Claude AI Mode (optional, requires API key)
- Uses Claude Sonnet 4 to intelligently infer rules based on business context
- Better for complex domain-specific rules
- Requires
sk-ant-*API key
- Rule-Based Mode (default, no external dependency)
- Deterministic rule inference based on column metadata
- Fast, reproducible, no API costs
- Fallback when AI fails
3. Metadata Enricher
- Sets contract version, dataset ID, governance policy
- Applies global thresholds (Completeness, Validity, etc.)
- Injects owner, classification, retention policy
4. Rule Generator
- Scans column types, validation rules, DQ dimensions
- Infers constraints (PRIMARY KEY, FOREIGN KEY, NOT NULL, NULLABLE)
- Maps to dbt tests and Great Expectations expectations
- Thai-specific rules: citizen ID format, geographic boundaries, phone regex
5. Output Formatters
- YAML Contract → Data Contract format (production-ready)
- TOML Contract → Alternative serialization for CI/CD pipelines
- GE Expectations Suite → Mappings to Great Expectations test types
Installation & Setup
Prerequisites
- Modern web browser (Chrome, Firefox, Safari, Edge)
- CSV Data Dictionary file
- (Optional) Claude API key for AI mode
Quick Start
-
Open the Engine
# Download dq_rule_engine_v2.html and open in browser open dq_rule_engine_v2.html -
Configure Metadata (Left Panel)
- Dataset Name:
ชื่อ dataset ภาษาไทย/อังกฤษ - Source System: System source name (e.g., "ระบบศูนย์ข้อมูล NLC")
- Owner: Steward email (e.g.,
data-steward@dla.go.th) - Classification:
Open | Internal | Confidential | Restricted
- Dataset Name:
-
Set Quality Thresholds
- Completeness: Default
0.90(90%) - Validity: Default
0.90(90%) - Uniqueness: Default
1.00(100% for PKs) - Consistency: Default
0.95(95% for FKs)
- Completeness: Default
-
Select Output Format (optional)
- YAML (default) — Human-readable, git-friendly
- TOML — Alternative for CI/CD pipelines
-
Upload Data Dictionary CSV
- Drag & drop or click to browse
- Auto-loads preview of 6 rows
- Engine auto-fills dataset name from
tablecolumn
-
Generate Contract
- Click "Generate Contract"
- Engine analyzes columns → infers rules
- Displays summary: # of columns, tests, critical rules, PII detected
-
Review & Download
- YAML Tab → Data Contract (feed to governance repo)
- TOML Tab → Alternative serialization
- GE Tab → Expectations suite (feed to Great Expectations service)
- Click download to save locally or via CI/CD
User Guide
Step-by-Step Walkthrough
Scenario: Generate DQ Contract for TM_ORG (Local Government Organizations Table)
Input Data Dictionary (CSV):
table,variable,description,type,format,validation,haspii,dq_completeness,dq_uniqueness,dq_validity,dq_consistency,source
TM_ORG,ORG_ID,ID Primary Key,Integer,ตัวเลข,NOT NULL,No,NOT_NULL,UNIQUE,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
TM_ORG,AMPHUR_ID,ID ของอำเภอ,Integer,ตัวเลข,NULL,No,NULLABLE,Non-Unique,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
TM_ORG,ORG_CODE,รหัส อปท.,String,ข้อความ,NULL,No,NULLABLE,Non-Unique,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
TM_ORG,ORG_NAME,ชื่อ อปท.,String,ข้อความ,NULL,No,NULLABLE,Non-Unique,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
TM_ORG,LAT,ละติจูด,String,ข้อความ,NULL,No,NULLABLE,Non-Unique,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
TM_ORG,LNG,ลองจิจูด,String,ข้อความ,NULL,No,NULLABLE,Non-Unique,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
Step 1: Configure Metadata
Dataset Name: ข้อมูลรายชื่อองค์กรปกครองส่วนท้องถิ่น
Source System: ระบบศูนย์ข้อมูลการเลือกตั้งท้องถิ่นแห่งชาติ (NLC)
Owner: data-steward-dla@mol.go.th
Classification: Open
Step 2: Set Thresholds
Completeness: 0.90
Validity: 0.90
Uniqueness: 1.00
Consistency: 0.95
Step 3: Upload CSV
- Drag TM_ORG.csv onto drop zone
- Engine displays: "6 rows · 6 columns"
Step 4: Generate
- Click "Generate Contract"
- Engine runs for ~2 seconds
- Displays summary:
- 6 columns
- 11 tests
- 3 Critical (ORG_ID: not_null + unique, LAT/LNG: geo-validation)
- PII: NO
Step 5: Review Output
YAML Tab shows contract:
version: "1.0.0"
contract_version: 2
dataset_id: "TM_ORG-a1b2c3d4-e5f6-g7h8-i9j0-k1l2m3n4o5p6"
dataset_name: "ข้อมูลรายชื่อองค์กรปกครองส่วนท้องถิ่น"
source_system: "ระบบศูนย์ข้อมูลการเลือกตั้งท้องถิ่นแห่งชาติ (NLC)"
logical_table: "TM_ORG"
governance_policy:
classification: "Open"
security_level: "Public"
pii_contains: false
quality_thresholds:
completeness: 0.90
validity: 0.90
schema:
- column: "ORG_ID"
description: "ID Primary Key"
type: "Integer"
classification: "public"
pii_type: "none"
masking_policy: "none"
retention: "permanent"
owner: "data-steward-dla@mol.go.th"
business_term: "organization_identifier"
constraints:
- primary_key: true
- nullable: false
tests:
- unique: true
- not_null: true
# ... (more columns) ...
lifecycle:
status: "Active"
retention: "Permanent"
GE Tab shows expectations:
expectation_suite_name: "tm_org_suite"
data_asset_type: null
meta:
dataset_id: "TM_ORG-a1b2c3d4..."
dataset_name: "ข้อมูลรายชื่อองค์กรปกครองส่วนท้องถิ่น"
source_system: "ระบบศูนย์ข้อมูลการเลือกตั้งท้องถิ่นแห่งชาติ (NLC)"
generated_at: "2025-04-28T10:30:00Z"
contract_version: 2
expectations:
# --- ORG_ID ---
- expectation_type: expect_column_to_exist
kwargs:
column: "ORG_ID"
meta:
severity: "Critical"
business_term: "organization_identifier"
- expectation_type: expect_column_values_to_not_be_null
kwargs:
column: "ORG_ID"
meta:
severity: "Critical"
owner: "data-steward-dla@mol.go.th"
- expectation_type: expect_column_values_to_be_unique
kwargs:
column: "ORG_ID"
meta:
severity: "Critical"
# --- LAT ---
- expectation_type: expect_column_values_to_match_regex
kwargs:
column: "LAT"
regex: "^-?([1-8]?[0-9](\\.[0-9]+)?|90(\\.0+)?)$"
meta:
severity: "High"
owner: "data-steward-dla@mol.go.th"
Step 6: Download & Deploy
# Download contract
click "⬇ download .yaml" → tm_org_contract.yaml
# Download GE expectations
click GE tab
click "⬇ download ge_suite.yaml" → tm_org_suite.yaml
# Commit to repo
git add data-contracts/tm_org_contract.yaml
git add great-expectations/ge_suites/tm_org_suite.yaml
git commit -m "feat: auto-generated DQ contract for TM_ORG table"
git push
Interactive Editing (Future Phase)
Currently, thresholds are set globally. Future POC v1.1 will support:
- ✏️ Edit rule logic per column
- 🎚️ Override thresholds per test
- 🏷️ Add custom tags/metadata
- 🔗 Resolve FK references
- 📋 Export as dbt YAML directly
Data Contract Specification
Contract Format: YAML + TOML
The contract format is inspired by dbt artifacts + Great Expectations metadata + DGA Thailand standards.
Top-Level Structure
version: "1.0.0" # Contract schema version
contract_version: 2 # Business contract version (incrementable)
dataset_id: "UUID" # Unique immutable ID
dataset_name: "ชื่อ dataset" # Display name (Thai/English)
source_system: "System name" # Source system identifier
logical_table: "TABLE_NAME" # Table name in source
governance_policy: # Global governance rules
classification: "Open|Internal|..." # Data classification (DGA standard)
security_level: "Public|Internal|..." # Security level
pii_contains: boolean # Has PII? (PDPA flag)
quality_thresholds:
completeness: 0.90 # Pass if ≥ 90% not null
validity: 0.90 # Pass if ≥ 90% valid format
uniqueness: 1.00 # Pass if 100% unique (PKs)
consistency: 0.95 # Pass if 95%+ FK valid
schema: # Field-level governance
- column: "COLUMN_NAME"
description: "Plain text description"
type: "Integer|String|Float|Boolean|Date|Timestamp"
classification: "public|internal|confidential"
pii_type: "none|national_id|phone|email|name|address|financial"
masking_policy: "none|hash_sha256|mask_last_4|suppress|tokenize|encrypt"
retention: "permanent|7-years|5-years|1-year|90-days"
owner: "email@domain.com"
business_term: "snake_case_term"
constraints: # Schema constraints
- primary_key: true|false
- foreign_key: "REFERENCE_TABLE"
- nullable: true|false
tests: # DQ tests to enforce
- unique: true
- not_null: true
- length: { max: 255 }
- format: "regex:PATTERN" # Validate regex pattern
- range: { min: 0, max: 100 }
- accepted_values: ["val1", "val2"]
ge_expectations: # GE test mapping (auto-generated)
- expectation: "expect_column_values_to_not_be_null"
params: {}
severity: "Critical"
- expectation: "expect_column_values_to_be_unique"
params: {}
severity: "Critical"
severity: "Critical|High|Medium|Low" # Business severity
lifecycle: # Data lifecycle control
status: "Active|Archived|Deprecated"
retention: "Permanent|[duration]"
Example: Complete Contract (Minimal)
See samples/tm_org_contract.yaml in repo.
Integration with Great Expectations
Architecture
DQ Rule Engine
│
├─ infer_tests() → [unique, not_null, regex, ...]
│
└─ map_to_ge_expectations()
│
├─ test: unique → expect_column_values_to_be_unique
├─ test: not_null → expect_column_values_to_not_be_null
├─ test: format regex → expect_column_values_to_match_regex
├─ test: length max → expect_column_value_lengths_to_be_between
├─ test: range → expect_column_values_to_be_between
└─ test: accepted_values → expect_column_values_to_be_in_set
↓
ge_suite.yaml (Expectation Suite)
↓
GE Service / Checkpoint Runner
↓
Validation Results (PASS/FAIL)
Test-to-Expectation Mapping
| Contract Test | GE Expectation Type | GE Parameters |
|---|---|---|
| unique: true | expect_column_values_to_be_unique | column, result_format |
| not_null: true | expect_column_values_to_not_be_null | column, result_format |
| length: {max: 255} | expect_column_value_lengths_to_be_between | column, min_value, max_value |
| format: "regex:^[0-9]{13}$" | expect_column_values_to_match_regex | column, regex |
| range: {min: 0, max: 100} | expect_column_values_to_be_between | column, min_value, max_value |
| accepted_values: [...] | expect_column_values_to_be_in_set | column, value_set |
| foreign_key: "REF_TABLE" | expect_column_values_to_be_in_set | column, from_table |
Workflow: Generate → Deploy → Run
1. Generate Contract
# In Rule Engine UI:
# - Upload CSV
# - Configure metadata
# - Click "Generate Contract"
# - Download ge_suite.yaml
2. Deploy Expectations
# In GE Service repository:
cp tm_org_suite.yaml great_expectations/expectations/tm_org.yml
# Or use GE CLI:
great_expectations suite new \
--expectation-suite-name tm_org_suite \
--profile local
3. Create Checkpoint
# great_expectations/checkpoints/tm_org_checkpoint.yml
name: tm_org_checkpoint
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d_%H%M%S"
validations:
- batch_request:
datasource_name: local_postgres
data_connector_name: default
data_asset_name: tm_org
partition_request:
index: -1
expectation_suite_name: tm_org_suite
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: slack_notification
action:
class_name: SlackNotificationAction
4. Run Validation
great_expectations checkpoint run tm_org_checkpoint
5. View Results
# In Data Docs:
open great_expectations/uncommitted/data_docs/index.html
# See: # of passing/failing expectations, severity breakdown
GE Service Integration Points
| Stage | Integration |
|---|---|
| Data Load | GE reads from datasource (BigQuery, Postgres, Snowflake, etc.) |
| Expectation Run | GE applies expectations from suite YAML |
| Validation Result | GE stores JSON result in uncommitted/validations/ |
| Action | Webhook → notify Slack, log to incident table, fail CI/CD |
| Reporting | Data Docs HTML + GE Cloud (if enabled) |
Rule Inference Engine
How Rules Are Inferred
Rule-Based Mode (Default)
Pattern matching on column metadata:
def infer_rules(row):
col = row['variable']
type = row['type']
validation = row['validation'] # NOT NULL | NULL
haspii = row['haspii']
dq_completeness = row['dq_completeness']
dq_uniqueness = row['dq_uniqueness']
rules = []
# 1. NOT NULL check
if validation == 'NOT NULL' or dq_completeness == 'NOT_NULL':
rules.append(Test.NOT_NULL)
# 2. UNIQUE check (Primary Key)
if dq_uniqueness == 'UNIQUE' or col.endswith('_ID'):
rules.append(Test.UNIQUE)
rules.append(Constraint.PRIMARY_KEY)
# 3. GEO validation (Thailand bounds)
if col in ['LAT', 'LATITUDE']:
rules.append(Format.REGEX('^-?([1-8]?[0-9](\\.[0-9]+)?|90(\\.0+)?)$'))
if col in ['LNG', 'LONGITUDE']:
rules.append(Format.REGEX('^-?((1[0-7][0-9]|[1-9]?[0-9])(\\.[0-9]+)?|180(\\.0+)?)$'))
# 4. Thai Citizen ID (13 digit)
if 'CITIZEN' in col or 'ID_CARD' in col:
rules.append(Format.REGEX('^[0-9]{13}$'))
# 5. Thai Phone (0XXXXXXXXX)
if 'PHONE' in col or 'MOBILE' in col:
rules.append(Format.REGEX('^0[0-9]{8,9}$'))
# 6. Positive numeric (AREA, AMOUNT, BUDGET, etc.)
if any(k in col for k in ['AREA', 'AMOUNT', 'BUDGET', 'COUNT']):
rules.append(Range.MIN(0))
# 7. String length max
if type.lower() == 'string':
rules.append(Length.MAX(255))
# 8. Enum validation (TYPE, SIZE, STATUS, etc.)
if any(k in col for k in ['TYPE', 'SIZE', 'STATUS', 'CATEGORY']):
rules.append(AcceptedValues(['[define enum list]']))
# 9. PII masking
if haspii.lower() == 'yes':
pii_type = detect_pii_type(col)
rules.append(Masking.POLICY(suggest_mask(pii_type)))
# 10. Foreign Key reference
if 'FOREIGN' in validation or 'FK' in validation:
ref_table = extract_ref_table(description)
rules.append(Constraint.FOREIGN_KEY(ref_table))
return rules
AI Mode (Optional)
Claude API analyzes:
- Column type & format
- Business context from description
- DQ metadata (completeness, uniqueness, validity, consistency)
- PII type & masking requirements
- Thai regulatory context (PDPA, DGA, BOT, NCSA)
- Industry playbook (Government, Finance, Retail, Healthcare)
Prompt includes:
You are a Senior Data Engineer. Analyze this column and suggest DQ tests.
Column: AMPHUR_ID
Type: Integer
Description: ID ของอำเภอ (Foreign Key)
PII: No
DQ Completeness: NULLABLE
DQ Uniqueness: Non-Unique
DQ Validity: ตามประเภทข้อมูล (as data type)
DQ Consistency: ต้องสอดคล้องกับตารางอ้างอิง (must match FK table)
Suggest:
1. Constraints (primary_key, nullable, foreign_key)
2. Tests (unique, not_null, format, range, accepted_values)
3. Great Expectations mapping
4. Severity (Critical/High/Medium/Low)
5. Masking policy if PII
Claude responds with structured JSON → converted to contract.
API Specification (Claude AI Mode)
Request
POST https://api.anthropic.com/v1/messages
Content-Type: application/json
{
"model": "claude-sonnet-4-20250514",
"max_tokens": 4000,
"system": "You are a Senior Data Engineer. Return only valid JSON...",
"messages": [
{
"role": "user",
"content": "[PROMPT with data dictionary]"
}
]
}
Response
{
"content": [
{
"type": "text",
"text": "{\"version\": \"1.0.0\", \"contract_version\": 2, \"dataset_id\": \"...\", ...}"
}
]
}
Error Handling
If AI fails:
- Log error to console
- Fall back to Rule-Based mode
- Render contract with rule-based inference
- Display notification: "AI fallback → Rule-based engine"
Deployment Roadmap
Phase 1: POC (Current)
- ✅ CSV upload + metadata config
- ✅ Rule-based inference
- ✅ YAML/TOML contract generation
- ✅ GE expectations mapping
- ✅ AI mode (optional, fallback enabled)
- ✅ Download contract files
Deployment: Standalone HTML file → browser
Phase 2: Production (Q2 2025)
- 📋 Web service (FastAPI / Node.js)
- 💾 Contract versioning + history
- 🔄 Git integration (auto-commit to repo)
- 🧪 Great Expectations service integration (webhook)
- 📊 Web UI with validation results
- 🔐 Authentication (OAuth2 / LDAP)
- 📈 Audit log (who changed what, when)
Deployment: Docker container + Kubernetes
Phase 3: Enterprise (Q3 2025)
- 🤖 Continuous AI improvement (reinforcement from GE results)
- 🔗 Data Catalog integration (OpenMetadata / DataHub)
- 📡 dbt Cloud / Alation API integration
- 🎯 Custom rule templates per business unit
- 📱 Mobile app for on-the-go reviews
- 🌍 Multi-language support (Thai, English, Khmer, Lao)
- ☁️ SaaS offering (cloud-hosted)
Troubleshooting
Issue: "CSV ว่างหรือ format ไม่ถูก"
Cause: File has fewer than 2 rows or headers are malformed
Solution:
- Verify CSV has header row + at least 1 data row
- Check encoding: should be UTF-8 or UTF-8 with BOM
- Ensure no blank rows in middle of file
- Validate with
head -10 file.csvin terminal
Issue: "AI fallback → Rule-based engine" appears
Cause: Claude API call failed (timeout, invalid key, network issue)
Solution:
- Check API key: starts with
sk-ant-? - Verify network connectivity
- Check Claude API quota (rate limits)
- Review error in browser console (F12 → Console tab)
- Rule-based engine will still generate contract, just less context-aware
Issue: GE Expectations syntax error when deploying
Cause: Regex in format field not properly escaped
Example: YAML doesn't like \ → should be \\ in YAML quoted strings
Solution:
# ❌ WRONG
- format: "regex:^\d{13}$"
# ✅ CORRECT
- format: "regex:^[0-9]{13}$"
All regexes in generated contract are pre-escaped. If you manually edit, use double backslashes.
Issue: Downloaded contract has wrong dataset_id
Cause: Auto-generated UUID is not suitable for your CI/CD
Solution:
- Manually edit
dataset_idin YAML before commit - Future v1.1 will allow custom dataset_id prefix (e.g.,
DLA-TM_ORG-001)
Issue: Foreign Key test failing
Cause: Referenced table doesn't exist in GE data source
Solution:
# Option 1: Remove FK test (if reference data not in same datasource)
# Option 2: Ensure reference table is in same datasource before running checkpoint
# Option 3: Add conditional logic in checkpoint (only run FK if both tables available)
FAQ
Q: Can I use this without Claude API?
A: Yes! Rule-Based mode is default and requires no API key. AI mode is optional. If AI fails, fallback is automatic.
Q: What CSV columns are required?
A: Minimum: table, variable, type, description. Optional: validation, haspii, dq_completeness, dq_uniqueness, dq_validity, dq_consistency, source, format.
Q: Can I edit the generated contract?
A: Yes. YAML/TOML are text formats. Edit before deploying to production. Manual edits = source of truth until next re-generation.
Q: How do I integrate with dbt?
A: Contract can be converted to dbt schema.yml in future versions. Currently: export YAML, then manually write dbt tests. Plan: direct dbt YAML output in v1.1.
Q: Does this support BigQuery, Snowflake, etc.?
A: Contract is datasource-agnostic. GE Expectations work with any supported GE datasource plugin (Postgres, BigQuery, Snowflake, Redshift, DuckDB, etc.).
Q: What about Timeliness (SLA) rules?
A: Currently inferred only if last_updated column present. Future v1.1 will add explicit SLA time window configuration.
Q: Can I use this for non-Thai data?
A: Yes. Thai-specific rules (citizen ID, phone regex, geographic bounds) are optional. Rule-based engine adapts to your column naming conventions globally.
Q: How do I version control contracts?
A: Commit YAML files to git. Use semantic versioning:
dataset_id: TM_ORG-1.0.0
contract_version: 1
# Next iteration:
contract_version: 2 # schema change
# or v1.0.1 = minor governance policy update (no schema change)
Q: Can this generate dbt tests too?
A: Not directly (POC scope). But YAML format mirrors dbt schema.yml, so manual conversion is straightforward. Plan for v1.1.
Q: Is there a CLI version?
A: POC is browser-only. CLI version planned for Phase 2 (Python package with CLI):
dq-rule-engine generate \
--csv data_dictionary.csv \
--output contract.yaml \
--format yaml \
--ai-mode
Q: How do I suggest improvements?
A: Create an issue or PR in the repo. POC is open for community feedback.
Q: License & Attribution?
A: POC is built with DAMA-DMBOK, DGI, PDPA B.E. 2562, DGA Thailand standards. Commercial use requires licensing review.
Next Steps
For POC Users
- Try it: Download
dq_rule_engine_v2.html→ open in browser - Validate: Upload your Data Dictionary CSV → compare generated contract with manual rules
- Iterate: Adjust metadata & thresholds → re-generate → review differences
- Feedback: Report issues, suggest features via GitHub issues
For Developers
- Extend: Add custom rule templates per business domain
- Integrate: Hook into CI/CD pipeline (e.g., GitHub Actions workflow)
- Automate: Fetch CSV from data catalog API → generate → commit contract automatically
- Monitor: Connect to GE validation results → track rule drift over time
For Data Stewards
- Adopt: Use as standard governance workflow for all new datasets
- Document: Store generated contracts in centralized governance repo
- Train: Share with team → establish "contract-first" culture
- Audit: Review quarterly → ensure contracts stay aligned with data
References
Standards & Frameworks
- DAMA-DMBOK v2 — Data Management Body of Knowledge
- DGI Framework — Thai Data Governance Initiative
- PDPA B.E. 2562 — Personal Data Protection Act
- Great Expectations — Data Quality Testing Framework
- dbt — Data Transformation
Related Tools
- Data Catalog: OpenMetadata, Alation, DataHub
- Data Lineage: Apache Atlas, Collibra
- DQ Monitoring: Soda, Databand, Monte Carlo
- DQ Testing: Great Expectations, dbt tests, Dataedo
Example Repositories
data-contracts/— Versioned YAML contractsgreat-expectations/— GE suites, checkpoints, configsdbt-models/→models/schema.yml— dbt tests (manual from contract)
Contact & Support
POC Questions
- GitHub Issues: repo/issues
- Email: data-governance@org.th
- Slack: #data-quality-eng
Feedback
- Feature requests: GitHub Discussions
- Bug reports: GitHub Issues with:
- CSV file (sanitized)
- Expected vs. actual output
- Browser/OS info
- Error log (F12 → Console)
Document Version: 1.0.0 — POC
Last Updated: 2025-Q2
Maintained By: Data Governance Team
Status: Active — Community Feedback Welcome
Appendix: Sample Files
Sample Data Dictionary CSV
File: samples/tm_org_dictionary.csv
table,variable,description,type,format,validation,haspii,dq_completeness,dq_uniqueness,dq_validity,dq_consistency,source
TM_ORG,ORG_ID,ID Primary Key,Integer,ตัวเลข,NOT NULL,No,NOT_NULL,UNIQUE,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
TM_ORG,AMPHUR_ID,ID ของอำเภอ,Integer,ตัวเลข,NULL,No,NULLABLE,Non-Unique,ตามประเภทข้อมูล,ต้องสอดคล้องกับตารางอ้างอิง,ระบบข้อมูลกลาง อปท.
Sample Generated Contract (YAML)
File: samples/tm_org_contract.yaml
[See Data Contract Specification section]
Sample GE Expectations Suite
File: samples/tm_org_suite.yaml
[See Integration with Great Expectations section]
End of Documentation