Schema Inference

Vague can reverse-engineer schemas from existing JSON or CSV data, detecting types, ranges, patterns, and relationships.

Basic Usage

# Infer from JSON
vague --infer data.json -o schema.vague

# Infer from CSV
vague --infer data.csv --collection-name employees -o schema.vague

What Gets Detected

Types

Data Pattern	Inferred Type
`123`	`int`
`12.34`	`decimal`
`"text"`	`string`
`true/false`	`boolean`
`"2024-01-15"`	`date`
`"2024-01-15T10:30:00Z"`	`datetime()`

Formats

Pattern	Inferred Generator
UUID	`uuid()`
Email	`email()`
URL	`faker.internet.url()`
Phone	`phone()`

Ranges

// Input
[
  { "age": 25 },
  { "age": 42 },
  { "age": 31 }
]

// Inferred
schema Record {
  age: int in 25..42
}

Enums

// Input
[
  { "status": "active" },
  { "status": "active" },
  { "status": "pending" }
]

// Inferred with weights
schema Record {
  status: 0.67: "active" | 0.33: "pending"
}

Nullable Fields

// Input
[
  { "name": "John", "nickname": "Johnny" },
  { "name": "Jane", "nickname": null }
]

// Inferred
schema Record {
  name: string,
  nickname: string?
}

Unique Fields

// Input
[
  { "id": 1, "code": "ABC" },
  { "id": 2, "code": "DEF" },
  { "id": 3, "code": "GHI" }
]

// Inferred
schema Record {
  id: unique int in 1..3,
  code: unique "ABC" | "DEF" | "GHI"
}

Advanced Detection

Derived Fields

Detects computed relationships:

[
  { "qty": 2, "price": 10, "total": 20 },
  { "qty": 3, "price": 15, "total": 45 }
]

schema Record {
  qty: int in 2..3,
  price: int in 10..15,
  total: qty * price  // Detected multiplication
}

Ordering Constraints

Detects field ordering:

[
  { "start": "2024-01-01", "end": "2024-01-15" },
  { "start": "2024-02-01", "end": "2024-03-01" }
]

schema Record {
  start: date,
  end: date,
  assume end >= start
}

Conditional Constraints

Detects conditional patterns:

[
  { "type": "premium", "discount": 20 },
  { "type": "premium", "discount": 25 },
  { "type": "basic", "discount": 0 },
  { "type": "basic", "discount": 0 }
]

schema Record {
  type: "premium" | "basic",
  discount: int in 0..25,
  assume if type == "basic" { discount == 0 }
}

CSV Inference

Basic CSV

vague --infer employees.csv --collection-name employees

CSV Options

# Custom delimiter
vague --infer data.csv --infer-delimiter ";" --collection-name records

# Custom dataset name
vague --infer data.csv --collection-name users --dataset-name TestData

Programmatic API

import { inferSchema } from 'vague-lang';

const data = [
  { name: 'John', age: 30 },
  { name: 'Jane', age: 25 }
];

const schema = inferSchema(data, {
  collectionName: 'users',
  datasetName: 'Inferred'
});

console.log(schema);

Practical Examples

Migration Workflow

# 1. Export existing data
pg_dump --table=users -F json > users.json

# 2. Infer schema
vague --infer users.json -o users.vague

# 3. Review and adjust
# Edit users.vague to add constraints, relationships

# 4. Generate new test data
vague users.vague -o test-users.json

API Contract Discovery

# 1. Capture API responses
curl https://api.example.com/products > products.json

# 2. Infer schema
vague --infer products.json -o products.vague

# 3. Generate mock data
vague products.vague -o mock-products.json -s 42

Database Seeding

# 1. Export sample data
mongoexport --collection=orders --out=orders.json

# 2. Infer schema
vague --infer orders.json -o orders.vague

# 3. Generate scaled dataset
# Edit orders.vague to increase counts
vague orders.vague -o seed-data.json

TypeScript Generation

Generate TypeScript types alongside schemas:

# Schema + TypeScript
vague --infer data.json --typescript -o schema.vague

# TypeScript only
vague --infer data.json --ts-only

Output:

// schema.d.ts
export interface User {
  id: string;
  name: string;
  age: number;
  email: string;
  status: 'active' | 'pending' | 'inactive';
}

Limitations

Sample size matters — More data = better inference
Edge cases — Rare values may not be detected
Complex relationships — Cross-record refs not auto-detected
Nested objects — Deep nesting may need manual adjustment

Best Practices

Use representative data — Include edge cases in samples
Review inferred schemas — Adjust ranges and constraints
Add relationships — Manually add any of references
Test generation — Verify output matches expectations

Basic Usage​

What Gets Detected​

Types​

Formats​

Ranges​

Enums​

Nullable Fields​

Unique Fields​

Advanced Detection​

Derived Fields​

Ordering Constraints​

Conditional Constraints​

CSV Inference​

Basic CSV​

CSV Options​

Programmatic API​

Practical Examples​

Migration Workflow​

API Contract Discovery​

Database Seeding​

TypeScript Generation​

Limitations​

Best Practices​

See Also​