DQ Engine — Documentation

DQ Engine converts your Data Dictionary into production-ready Data Contracts, CI/CD configs, and automated test suites in under 2 minutes. Built for Thai government data stewards and enterprise data teams.

System Architecture

👤

User (Data Steward)

Uploads Data Dictionary

📂

File Upload

.csv / .xlsx

📋

Paste Text

CSV / TSV

🔗

Google Sheets

Public URL

parsed rows

📝

Metadata Enricher

Dataset, Owner, DGA Classification, Thresholds

Engine Mode

🤖AI (Claude)

⚡Rule-based

ContractResult

🔑

PK / FK

Constraint rules

🇹🇭

Thai Rules

ID, Phone, Geo

🔒

PII Masking

PDPA policy

📊

DQ Thresholds

4 dimensions

📄

YAML

Data Contract

⚙️

TOML

CI/CD Config

✅

GE Suite

ge_suite.yaml

🗄️

Data Governance Repo

data_contracts/

🧪

Great Expectations

checkpoints/ → run tests

🕐

Recent Contracts

Browser localStorage

Entry / Output

Process

Rule module

What is DQ Engine?

A browser-based tool that reads your column-level metadata (Data Dictionary) and automatically generates:

Data Contract YAML — governance-ready contract for your data repository
Data Contract TOML — config for CI/CD pipelines
Great Expectations Suite — ready-to-run automated data quality tests

No installation. No server. Works entirely in your browser.

Input: Data Dictionary

DQ Engine accepts your Data Dictionary in three ways:

1. File Upload (CSV or Excel)

Upload a .csv or .xlsx file. For Excel files with multiple sheets, choose which sheet to process.

2. Paste Text

Paste CSV or TSV data copied directly from Excel or Google Sheets.

3. Google Sheets URL

Paste a public Google Sheets URL and click Load Sheet to import directly.

Required Columns

| Column | Required | Description | |--------|----------|-------------| | table | ✅ | Logical table name | | variable | ✅ | Column / field name | | type | ✅ | Data type (String, Integer, Date, etc.) | | description | ✅ | Business description of the column | | format | optional | Format string or pattern | | validation | optional | NOT NULL / NULL constraint | | haspii | optional | yes / no — marks PII columns | | dq_completeness | optional | Completeness rule notes | | dq_uniqueness | optional | UNIQUE / Non-Unique | | dq_validity | optional | Validity rule notes | | dq_consistency | optional | FK or consistency rule notes | | source | optional | Source system reference |

Step-by-Step Workflow

Step 1 — Choose Engine Mode

| Mode | Description | |------|-------------| | Rule-based | Instant. Deterministic inference from column names and metadata. No API key needed. | | AI | Uses Claude Sonnet for smarter semantic inference. Requires API key. Falls back to rule-based if unavailable. |

Step 2 — Fill Contract Metadata

| Field | Description | |-------|-------------| | Dataset Name | Name of the dataset or system (e.g. TM_ORG) | | Source System | Origin system (e.g. Oracle, PostgreSQL, SAP) | | Owner Email | Data owner contact | | Data Category | DGA category (see below) | | Data Classification | DGA sensitivity level (see below) |

Step 3 — Set Quality Thresholds

Minimum acceptable score for each DQ dimension (0.0 – 1.0):

| Threshold | Default | Meaning | |-----------|---------|---------| | Completeness | 0.90 | 90% of rows must have this column non-null | | Validity | 0.90 | 90% of values must pass format/range rules | | Uniqueness | 1.00 | 100% unique (use for primary keys) | | Consistency | 0.95 | 95% must match referential constraints |

Step 4 — Upload Data Dictionary

Upload your file, paste text, or load from Google Sheets.

Generate

Click Generate Contract ⌘↵ (or press Cmd+Enter / Ctrl+Enter).

DGA Classification System

DQ Engine follows the Digital Government Agency (DGA) Thailand data governance standard.

Data Category

| Category | Description | |----------|-------------| | Open Data | Data that can be publicly disclosed without restriction | | Internal Data | Data for internal organizational use only | | Personal Data | Data relating to an identified or identifiable person (PDPA) | | Confidential Data | Data requiring access controls and need-to-know | | Security Data | Classified data with national security implications |

Data Classification

| Level | Description | |-------|-------------| | Open | No access restrictions | | Private | Internal use; not for public disclosure | | Confidential / Sensitive | Restricted distribution; business impact if disclosed | | Secret / Medium Sensitive | Significant harm if disclosed; controlled distribution | | Top Secret / Highly Sensitive | Severe harm if disclosed; strictest access controls |

Rule Inference Engine

The rule-based engine automatically infers data quality rules from column metadata:

Constraint Detection

| Signal | Inferred Rule | |--------|--------------| | validation = NOT NULL | not_null constraint | | dq_uniqueness = UNIQUE or _ID suffix | unique + primary_key | | dq_consistency contains FK keywords | foreign_key constraint | | Column type = String | length.max: 255 | | AREA, AMOUNT, BUDGET, COUNT | range.min: 0 | | TYPE, STATUS, CATEGORY | accepted_values placeholder |

Thai-Specific Validators

| Column Pattern | Rule Applied | |----------------|-------------| | LAT, LATITUDE, LAT_*, *_LAT | Thai latitude range 5.5°N – 20.5°N | | LNG, LONGITUDE, LON, LONG | Thai longitude range 97.5°E – 105.6°E | | CITIZEN_ID, ID_CARD, CITIZEN_NO | 13-digit Thai national ID regex | | PHONE, MOBILE, TEL | Thai phone format ^0[0-9]{8,9}$ |

PII Detection & Masking

When haspii = yes, the engine detects PII type from column name and assigns a masking policy:

| PII Type | Masking Policy | |----------|---------------| | national_id | hash_sha256 | | phone | mask_last_4 | | email | hash_sha256 | | name | suppress | | address | suppress | | financial | encrypt |

Output Formats

1. YAML — Data Contract

Production-ready governance contract. Example structure:

version: "1.0.0"
contract_version: 2
dataset_id: "TM_ORG-abc123"
dataset_name: "TM_ORG"
source_system: "Oracle"
logical_table: "TM_ORG"
generated_at: "2026-05-27T10:00:00Z"

governance_policy:
  data_category: "Personal Data"
  classification: "Confidential/Sensitive"
  security_level: "Confidential"
  pii_contains: true
  quality_thresholds:
    completeness: 0.9
    validity: 0.9
    uniqueness: 1.0
    consistency: 0.95

schema:
  - column: "CITIZEN_ID"
    description: "Thai national ID number"
    type: "String"
    classification: "confidential"
    pii_type: "national_id"
    masking_policy: "hash_sha256"
    retention: "7 years"
    severity: "Critical"
    constraints:
      - not_null: true
      - regex: "^[0-9]{13}$"
    tests:
      - expect_column_values_to_match_regex: "^[0-9]{13}$"

2. TOML — CI/CD Config

Same contract in TOML format for use in CI/CD pipelines and configuration files:

[contract]
version = "1.0.0"
dataset_name = "TM_ORG"
logical_table = "TM_ORG"

[governance_policy]
data_category = "Personal Data"
classification = "Confidential/Sensitive"
pii_contains = true

[[schema]]
column = "CITIZEN_ID"
type = "String"
classification = "confidential"
pii_type = "national_id"
masking_policy = "hash_sha256"

3. Great Expectations Suite

Ready-to-run Python test suite for automated data quality validation:

expectation_suite_name: "TM_ORG_suite"
expectations:
  - expectation_type: expect_column_values_to_not_be_null
    kwargs:
      column: CITIZEN_ID
    meta:
      severity: Critical

  - expectation_type: expect_column_values_to_match_regex
    kwargs:
      column: CITIZEN_ID
      regex: "^[0-9]{13}$"
    meta:
      severity: Critical
      rule: thai_citizen_id

Features

Shareable Config URL

Click 🔗 Copy shareable link to copy a URL that encodes your current metadata and thresholds. Share with teammates — they open it and get the same configuration pre-filled.

Recent Contracts

The 🕐 Recent menu in the topbar shows your last 3 generated contracts. Click any to re-download the YAML. Stored in your browser locally.

Client Branding

Click ⚙️ in the topbar to set a client or company name. Shown as a badge in the topbar — useful for consultant workflows across multiple clients.

Keyboard Shortcut

Press Cmd+Enter (Mac) or Ctrl+Enter (Windows/Linux) to generate without clicking.

Standards & Compliance

| Standard | Coverage | |----------|----------| | PDPA B.E. 2562 | PII detection, masking policy, data classification | | DGA Thailand | Data Category and Data Classification taxonomy | | DAMA-DMBOK v2 | Four DQ dimensions: completeness, validity, uniqueness, consistency | | Great Expectations | Test suite output compatible with GE v0.18+ |

AI Mode

When set to AI mode, DQ Engine calls Claude Sonnet via the Anthropic API to produce smarter contracts:

Semantic understanding of column descriptions (not just name patterns)
Better inferred business terms and retention policies
Richer constraint and expectation generation

Rate limit: 10 AI requests per hour per IP (enforced server-side).

Fallback: If AI is unavailable or rate-limited, the engine automatically falls back to rule-based and shows a warning. Output is never blocked.

FAQ

Q: Is my data sent anywhere? In rule-based mode, all processing happens in your browser — no data leaves your machine. In AI mode, your column metadata is sent to the Anthropic API for contract generation.

Q: What file size can I upload? Practical limit is ~10,000 columns. Files up to ~5 MB work without issues.

Q: Can I use it without the AI API key? Yes. Rule-based mode works fully offline with no API key required.

Q: How do I use the Great Expectations output? Save the .yaml output to your great_expectations/expectations/ directory and run great_expectations checkpoint run.

DQ Engine is built by datashane.com