Write me SQL data-quality checks for this table. I want to catch the kind of silent breakage that shows up 3 months later in a dashboard.
DATABASE: {postgres / mysql / bigquery / snowflake / sqlserver / sqlite}
TABLE NAME: {schema.table}
ROW COUNT (approx): {number}
COLUMNS (name + type, paste all): {paste}
USED FOR: {dashboard / api / pipeline / reporting}
KNOWN UPSTREAM SOURCES: {where_the_data_comes_from}
WHAT BAD DATA LOOKS LIKE in this table (past incidents): {real_examples_or_'unknown'}
KEYS / UNIQUENESS RULES: {primary_key + business_keys}
DELIVER:
1. **Schema-level checks** — row count vs expected range, freshness (last row added < X), duplicate primary keys.
2. **Column-level checks** — for each important column: null rate, distinct-value count, type validity, allowed values.
3. **Referential checks** — for FKs, orphans + dangling references.
4. **Business-logic checks** — pairs of fields that must agree (e.g. order_status = 'shipped' implies shipped_at IS NOT NULL).
5. **Distribution checks** — column means / counts against expected baselines (Z-score deviation).
6. **A single 'all checks in one query' version** using UNION ALL, where each check returns: check_name, status, fail_count, sample_bad_rows. Sorted so 'fail' comes first.
7. **How to schedule it** — Airflow / dbt tests / cron, with 'fail loud' guidance.0 copies·0 saves·1 views
201 words·1,346 chars
Sample output
DB: Postgres. Table: analytics.orders. Rows: ~12M. Columns: order_id PK, user_id FK, status varchar, created_at timestamptz, shipped_at timestamptz NULL, amount_cents int, currency char(3), country char(2), updated_at timestamptz. Used for: revenue dashboard + monthly close. Past incident: shipped orders without shipped_at because a backfill missed a column.
1. Schema-level checks:
-- Row count in expected range (12M ± 10%)
SELECT 'row_count' AS check_name,
CASE WHEN COUNT(*) BETWEEN 10800000 AND 13200000 THEN 'pass' ELSE 'fail' END AS status,
COUNT(*) AS observed,
12000000 AS expected
FROM analytics.orders;
-- Freshness (latest row within last 2 hours during business hours)
SELECT 'freshness' AS check_name,
CASE WHEN MAX(created_at) > NOW() - INTERVAL '2 hours' THEN 'pass' ELSE 'fail' END AS status,
MAX(created_at) AS latest_row,
NOW() AS current_time
FROM analytics.orders;
-- Primary key uniqueness
SELECT 'pk_uniqueness' AS check_name,
CASE WHEN COUNT(*) = COUNT(DISTINCT order_id) THEN 'pass' ELSE 'fail' END AS status,
COUNT(*) - COUNT(DISTINCT order_id) AS duplicate_count
FROM analytics.orders;
Parameters
Text generation parameters
FAQ
Add the past-incident check FIRST in your test suite. If you don't have a check that would have caught your last data quality nightmare, you're not finished yet. The prompt targets ChatGPT (GPT-4) and lives in the Coding & Development category on mycopyprompt.
This prompt is written for ChatGPT (GPT-4). It's a text/chat prompt — paste it into ChatGPT (GPT-4) (or compatible LLMs like Claude or GPT-4) to get the expected output.
1. Click the Copy button on this page to copy the full prompt. 2. Open ChatGPT (GPT-4). 3. Paste the prompt into a new conversation. 4. Replace any {placeholders} with your specifics, then send. Most prompts produce the right output on the first try; complex ones may need 1-2 iterations.
Yes — every prompt on mycopyprompt is free forever. No paywall, no signup wall for browsing or copying. You can use it for personal or commercial work, just don't redistribute the entire mycopyprompt library.
Absolutely — most prompts are templates. Look for {placeholders} (curly braces) and swap them with your own values. You can also reword sections, add constraints, or chain it with other prompts.
See the "Sample output" panel above — that's a real example of what ChatGPT (GPT-4) returns when this prompt runs. Your output will vary in wording but should follow the same structure and depth.
Similar prompts
Matched against every prompt in the catalog by title.
2. Column-level checks:
-- Null rates on critical columns
SELECT 'null_rate_user_id' AS check_name,
CASE WHEN COUNT(*) FILTER (WHERE user_id IS NULL) = 0 THEN 'pass' ELSE 'fail' END,
COUNT(*) FILTER (WHERE user_id IS NULL) AS null_count
FROM analytics.orders;
-- status in allowed values
SELECT 'status_allowed' AS check_name,
CASE WHEN COUNT(*) FILTER (WHERE status NOT IN ('pending','confirmed','shipped','cancelled','refunded')) = 0
THEN 'pass' ELSE 'fail' END,
COUNT(*) FILTER (WHERE status NOT IN ('pending','confirmed','shipped','cancelled','refunded')) AS bad_count
FROM analytics.orders;
-- Amount sanity (positive integer cents, max $50K)
SELECT 'amount_range' AS check_name,
CASE WHEN COUNT(*) FILTER (WHERE amount_cents < 1 OR amount_cents > 5000000) = 0
THEN 'pass' ELSE 'fail' END,
COUNT(*) FILTER (WHERE amount_cents < 1 OR amount_cents > 5000000) AS bad_count
FROM analytics.orders;
-- Currency code (3-letter, uppercase, ISO 4217 subset)
SELECT 'currency_format' AS check_name,
CASE WHEN COUNT(*) FILTER (WHERE currency !~ '^[A-Z]{3}$') = 0
THEN 'pass' ELSE 'fail' END,
COUNT(*) FILTER (WHERE currency !~ '^[A-Z]{3}$') AS bad_count
FROM analytics.orders;
3. Referential checks:
-- Orphaned user_id (FK pointing nowhere)
SELECT 'orphan_user_id' AS check_name,
CASE WHEN COUNT(*) = 0 THEN 'pass' ELSE 'fail' END,
COUNT(*) AS orphan_count
FROM analytics.orders o
LEFT JOIN analytics.users u ON u.user_id = o.user_id
WHERE u.user_id IS NULL;
4. Business-logic checks:
-- Shipped orders must have shipped_at (the actual past incident)
SELECT 'shipped_status_has_shipped_at' AS check_name,
CASE WHEN COUNT(*) = 0 THEN 'pass' ELSE 'fail' END,
COUNT(*) AS bad_count,
array_agg(order_id) FILTER (WHERE FALSE OR true) AS sample_ids
FROM (
SELECT order_id
FROM analytics.orders
WHERE status = 'shipped' AND shipped_at IS NULL
LIMIT 5
) s;
-- shipped_at must be >= created_at when set
SELECT 'shipped_after_created' AS check_name,
CASE WHEN COUNT(*) = 0 THEN 'pass' ELSE 'fail' END,
COUNT(*) AS bad_count
FROM analytics.orders
WHERE shipped_at IS NOT NULL AND shipped_at < created_at;
-- Cancelled orders should not have shipped_at
SELECT 'cancelled_not_shipped' AS check_name,
CASE WHEN COUNT(*) = 0 THEN 'pass' ELSE 'fail' END,
COUNT(*) AS bad_count
FROM analytics.orders
WHERE status = 'cancelled' AND shipped_at IS NOT NULL;
5. Distribution checks:
-- Daily revenue within 3 std deviations of trailing 30-day mean
WITH daily AS (
SELECT created_at::date AS d, SUM(amount_cents)/100.0 AS revenue
FROM analytics.orders
WHERE created_at > NOW() - INTERVAL '31 days'
GROUP BY 1
),
stats AS (
SELECT AVG(revenue) AS mean_rev, STDDEV(revenue) AS std_rev
FROM daily WHERE d < CURRENT_DATE
),
yesterday AS (
SELECT revenue FROM daily WHERE d = CURRENT_DATE - 1
)
SELECT 'revenue_zscore' AS check_name,
CASE WHEN ABS((yesterday.revenue - stats.mean_rev) / NULLIF(stats.std_rev, 0)) <= 3
THEN 'pass' ELSE 'fail' END,
yesterday.revenue AS yesterday_revenue,
stats.mean_rev AS trailing_mean,
stats.std_rev AS trailing_std
FROM yesterday, stats;
6. Single 'all checks' UNION query:
Use the structure above with consistent columns (check_name, status, fail_count, observed). Sort with ORDER BY status DESC, fail_count DESC (fails first).
7. Scheduling:
tests/ directory file with severity: error for the must-not-break checks, warn for distribution checks.Don't run all 15 checks on the dashboard every page load. Run them on a schedule, store the result, surface the latest snapshot.