Dataset Validation
Dataset validation allows you to define constraints that apply across entire collections, ensuring aggregate properties are met.
Validate Block
Add a validate block to a dataset:
schema Invoice {
amount: decimal in 100..10000,
status: "pending" | "paid"
}
dataset TestData {
invoices: 100 of Invoice,
validate {
sum(invoices.amount) >= 100000,
sum(invoices.amount) <= 500000
}
}
Vague will regenerate the dataset until all validation rules pass.
Aggregate Constraints
Sum Constraints
dataset Financial {
transactions: 100 of Transaction,
validate {
sum(transactions.amount) >= 50000,
sum(transactions.credits) == sum(transactions.debits)
}
}
Count Constraints
dataset Orders {
orders: 100 of Order,
payments: 50 of Payment,
validate {
count(payments) <= count(orders)
}
}
Collection Predicates
all()
Every item must satisfy the condition:
dataset Inventory {
products: 100 of Product,
validate {
all(products, .stock >= 0),
all(products, .price > 0)
}
}
some()
At least one item must satisfy the condition:
dataset Sales {
invoices: 100 of Invoice,
validate {
some(invoices, .status == "paid"),
some(invoices, .amount > 1000)
}
}
none()
No items should satisfy the condition:
dataset Clean {
records: 100 of Record,
validate {
none(records, .deleted == true),
none(records, .amount < 0)
}
}
Practical Examples
Financial Reconciliation
schema Transaction {
type: "credit" | "debit",
amount: decimal in 10..1000
}
dataset Ledger {
transactions: 200 of Transaction,
validate {
// Credits should roughly balance debits
sum(transactions where .type == "credit", .amount) >=
sum(transactions where .type == "debit", .amount) * 0.9,
// At least some of each type
some(transactions, .type == "credit"),
some(transactions, .type == "debit")
}
}
E-commerce Orders
schema Order {
id: uuid(),
total: decimal in 10..500,
status: "pending" | "shipped" | "delivered"
}
schema Shipment {
order: any of orders where .status == "shipped" or .status == "delivered",
shipped_at: datetime(2024, 2024)
}
dataset Store {
orders: 100 of Order,
shipments: 50 of Shipment,
validate {
// Can't ship more than ordered
count(shipments) <= count(orders),
// At least 30% fulfilled
count(shipments) >= count(orders) * 0.3,
// All shipped orders have positive total
all(orders where .status == "shipped", .total > 0)
}
}
User Activity
schema User {
id: uuid(),
created_at: datetime(2023, 2024),
post_count: int in 0..100,
is_active: boolean
}
dataset Community {
users: 1000 of User,
validate {
// Majority are active
count(users where .is_active == true) > count(users) * 0.6,
// Active users have posted
all(users where .is_active == true and .post_count == 0, false),
// Some power users
some(users, .post_count > 50)
}
}
Inventory Management
schema Product {
sku: unique regex("[A-Z]{3}-[0-9]{4}"),
stock: int in 0..1000,
reorder_point: int in 10..100,
price: decimal in 9.99..999.99
}
dataset Warehouse {
products: 200 of Product,
validate {
// Total inventory value
sum(products, .stock * .price) >= 100000,
sum(products, .stock * .price) <= 1000000,
// Some products need reordering
some(products, .stock < .reorder_point),
// But not all
none(products, .stock == 0) or some(products, .stock > 0)
}
}
Validation Failure
If validation can't be satisfied after max retries, generation fails with an error. To avoid this:
- Use achievable constraints — Ensure validation rules are statistically likely
- Widen ranges — Give enough room for aggregate targets
- Reduce interdependencies — Avoid circular validation logic
How It Works
- Vague generates all collections in the dataset
- Validation rules are evaluated
- If any rule fails, the entire dataset is regenerated
- Process repeats until success or max retries
This is "dataset-level rejection sampling" — more expensive than field-level constraints but powerful for aggregate properties.
Best Practices
- Keep rules achievable — Statistically likely to pass
- Use percentages — More flexible than exact counts
- Test incrementally — Add one validation rule at a time
- Monitor retries — High retry counts indicate tight constraints
See Also
- Constraints for record-level validation
- Negative Testing for constraint violation