MokingBird DataGen · ai.mokingbird.xyz/mbdatagen

Turn Any Document Into
Production Training Data

GPRO-Hybrid RL. 5-stage automated validation. 100% local inference.
Designed for 8GB VRAM.

Get DataGen — Free
5-Phase Pipeline GPRO-Hybrid RL 5-Stage Validation 100% Local 8GB VRAM

The 5-Phase Pipeline

From raw document to validated, deployment-ready training data — fully automated, fully local.

Phase 01
📄
Extract
Parse 7 document formats. Async concurrent extraction. Chunk, clean, and structure raw content.
Phase 02
Enrich
Contextual tagging, difficulty scoring, layer classification, and semantic chunking for richer inputs.
Phase 03
Generate
GPRO-Hybrid RL generator. K=4 candidates per question, per-field rewards, group-relative normalization.
Phase 04
🛡️
Validate
5-stage automated chain: Schema → Distribution → Dedupe → Grounding → Novelty. HMAC-signed provenance.
Phase 05
🚀
Deploy
Export Jogg or Jogg-Mini format. Ready for fine-tuning, RAG augmentation, or direct model training.

Three Pillars of Quality at Scale

Every stage of DataGen is engineered for precision. No shortcuts, no cloud dependencies.

📥
Pillar 1 · Ingestion

Extract

7 document formats, async concurrent extraction. Feed DataGen anything — it handles the complexity so your pipeline doesn't have to.

  • PDF, DOCX, HTML/Web, Images (OCR)
  • Python, JavaScript, TypeScript source code
  • LaTeX documents and academic papers
  • Async batch pipeline — process thousands in parallel
  • Automatic chunking, cleaning, and deduplication
Pillar 2 · Generation

Generate

GPRO-Hybrid RL: K=4 candidates per question. Per-field rewards across every component — question, options, correct_answer, explanation, difficulty.

  • K=4 candidates per question — best candidate wins
  • Per-field rewards: question, options, correct_answer, explanation, difficulty
  • Group-relative normalization across candidate pool
  • Token-level advantages — stable gradient flow
  • 8 difficulty levels, 8 semantic layers (0–7)
🛡️
Pillar 3 · Quality Gate

Validate

5-stage automated chain. Every record is graded and HMAC-signed before it enters your training set — no silent failures.

  • Stage 1 · Schema: structural and type validation
  • Stage 2 · Distribution: difficulty + layer balance
  • Stage 3 · Dedupe: semantic near-duplicate detection
  • Stage 4 · Grounding: answer-to-source verification
  • Stage 5 · Novelty: information density scoring

The GPRO-Hybrid Advantage

Reinforcement learning that rewards quality at the field level — not just the final output.

Reward Function
Total Reward = α × Process Reward + γ × Outcome Reward = 0.7 × Σ(field scores) + 0.3 × overall quality
Field-Level Rewards
  • question — clarity, specificity, answerability score
  • 📋options — plausibility balance, distractor quality
  • correct_answer — grounding and source alignment
  • 💡explanation — depth, accuracy, educational value
  • 📊difficulty — calibration against document complexity
How Candidates Compete
  • 🔢K=4 candidates generated per question simultaneously
  • ⚖️Group-relative normalization — candidates compete against each other
  • 🏆Highest total reward candidate is selected and stored
  • 📈Token-level advantages — stable, efficient gradient flow
Head-to-Head Comparison
DataGen vs. generic LLM prompting for dataset creation
Capability MokingBird DataGen Generic LLM Prompting
Reproducibility
Quality control (5-stage)
RL-trained generator
Local / private inference
Per-field reward scoring
HMAC provenance on every record
Deterministic schema output
No API keys or internet required

5-Stage Validator Chain

Every record passes through five sequential quality gates before it enters your dataset.

Stage 1
Schema
Structural integrity and type validation
Stage 2
Distribution
Difficulty + layer balance across dataset
Stage 3
Dedupe
Semantic near-duplicate detection
Stage 4
Grounding
Answer-to-source document verification
Stage 5
Novelty
Information density and uniqueness scoring
Gate Decisions
AUTO_APPROVE
≥ 90%
Record passes all five stages and is written directly to the output dataset.
REQUIRE_REVIEW
70 – 89%
Flagged for optional human review before inclusion. Stored in a separate review queue.
AUTO_REJECT
< 70%
Record is rejected and logged with a full failure report. Never enters the output dataset.
Provenance — Every Record Is HMAC-Signed
// __provenance__ field — appended to every approved record { "__provenance__": { "run_id": "dg_20260329_a4f7e2b1", "source_hash": "sha256:3c9f2a...", // source document fingerprint "validator_score": 0.94, // composite quality score "stage_scores": [1.0, 0.96, 1.0, 0.91, 0.88], // per-stage "gate_decision": "AUTO_APPROVE", "hmac_sig": "sha256-hmac:7d3a91f..." // RunManifest signature } }

7 Document Formats Supported

Feed DataGen any document type. The async batch pipeline handles the rest.

📄
PDF
Text-based and scanned PDFs, multi-page, mixed content
📝
DOCX
Microsoft Word documents, tables, headers, footnotes
🌐
HTML / Web
Web pages, articles, documentation sites with boilerplate removal
🖼️
Images (OCR)
PNG, JPG, TIFF — full OCR pipeline with layout preservation
💻
Python / JS / TS
Source code with syntax-aware chunking and docstring extraction
🔬
LaTeX
Academic papers, equations, theorems, citations preserved
⚙️
Async Batch Pipeline
Process hundreds of documents concurrently. Queue management, progress tracking, error recovery — all built in. Mix formats in a single run.

Output Schemas

Two schema variants optimized for different deployment needs — full-fidelity or lightweight.

Jogg
Full MCQ Schema
"question":      "What is...",
"options":       ["A", "B", "C", "D"],
"correct_answer":"B",
"explanation":   "Because...",
"difficulty":    3, // 8 levels (1–8)
"layer":         2, // semantic layer (0–7)
"tags":          ["topic", "subtopic"],
"__provenance__": { ... HMAC-signed RunManifest }
Jogg-Mini
Lightweight — ~40% Smaller
"q":  "What is...",
"o":  ["A", "B", "C", "D"],
"a":  "B",
"d":  3,       // difficulty
"l":  2,       // layer
"t":  ["topic"],
"p":  "hmac:..." // provenance hash

// ~40% smaller payload vs Jogg full
// Ideal for high-volume training pipelines

Three Ways to Use DataGen

Python API, CLI, or Desktop App — same engine under the hood, different surfaces for different workflows.

Full programmatic control. Integrate DataGen directly into your training pipeline.
# MokingBird DataGen — Python API
from mokingbird.datagen import DataGenAPI

# Initialize with your local model
api = DataGenAPI(
    model="qwen2.5-7b-instruct",
    device="cuda",
    schema="jogg",         # or "jogg-mini"
    k_candidates=4,
)

# Generate training data from documents
dataset = api.generate(
    sources=["./docs/manual.pdf", "./src/"],
    n_questions=500,
    difficulty_range=(1, 8),
    validate=True,           # run 5-stage validator
    output_path="./output/dataset.jogg",
)

# Fine-tune directly on generated data
result = api.train(
    dataset=dataset,
    base_model="qwen2.5-7b",
    method="lora",
    epochs=3,
    output_dir="./models/fine-tuned/",
)
Scriptable, automatable, shell-friendly. Integrate DataGen into any workflow with a single command.
# Generate a dataset from a directory of documents
$ mokingbird generate --source ./docs/ --schema jogg --questions 500 --model qwen2.5-7b-instruct --output ./output/dataset.jogg

# Fine-tune on generated dataset
$ mokingbird train --dataset ./output/dataset.jogg --base-model qwen2.5-7b --method lora --epochs 3 --output ./models/fine-tuned/

# Run the 5-stage validator on an existing dataset
$ mokingbird validate --dataset ./output/dataset.jogg --report ./output/validation_report.json
A native PySide6 desktop application. Two focused interfaces — one for training data generation, one for model fine-tuning.

Training Tab (DDI — Document-Driven Ingestion)

Drop in your documents, configure extraction and generation settings, and launch the full 5-phase pipeline. Real-time progress tracking, per-stage quality scores, and live dataset preview as records are generated.

Generate Tab (PDI — Prompt-Driven Interface)

Direct prompt-driven question generation. Specify topics, difficulty ranges, and output format interactively. Review and approve individual records before they enter your final dataset. Export to Jogg or Jogg-Mini with one click.

PySide6 Desktop

Desktop Application
— PySide6

A native desktop application purpose-built for DataGen. Two dedicated tabs keep your workflow focused — one for ingesting and generating from documents, one for direct prompt-driven generation.

Training Tab · DDI

Document-Driven Ingestion. Full 5-phase pipeline, progress metrics, per-stage scoring, live dataset preview.

Generate Tab · PDI

Prompt-Driven Interface. Interactive generation, difficulty control, per-record review, one-click export.

This is a design preview. Download and install to start using.

MokingBird DataGen — Training Tab
Source: ./docs/training_manual.pdf
Schema: Jogg (full)
Model: qwen2.5-7b-instruct
K candidates: 4

Pipeline stage: Phase 04 · Validating
Approved: 437 records
In review: 28 records
Rejected: 9 records

Validator score: 0.937 avg
VRAM usage: 6.8 GB / 8 GB

Technical Requirements

Designed to run on consumer hardware. No enterprise GPU required.

Specification Minimum Recommended
GPU VRAM 6 GB 8 GB (RTX 4070 or equivalent)
System RAM 16 GB 32 GB
Operating System Windows 10/11 · Linux (Ubuntu 20.04+) · macOS 12+
Python Version 3.10+ 3.11 (recommended)
CUDA / ROCm CUDA 11.8 CUDA 12.1+ / ROCm 5.6+
Storage 20 GB free 50 GB+ (for models + datasets)
Internet Required Never — 100% offline after first model download

You Own It. All of It.

100% local inference. No data leaves your machine. No API keys required.

🔒

Fully Offline

Every computation — extraction, generation, validation — runs on your hardware. Your documents never touch a server.

🛡️

Zero Telemetry

No usage data, no crash reports, no model outputs are collected. What runs on your machine stays on your machine.

🗝️

Your Data, Your Models

Generated datasets and fine-tuned models belong entirely to you. Export anywhere, use with any inference engine.

Start Generating
Training Data Today.

DataGen downloads are launching soon. Join the notify list and we will email you when it is available.

Get DataGen — Free Read the Docs

Windows · Linux · macOS  ·  Python 3.10+  ·  No account required

← Back to mbDataGen
Blog

mbDataGen — Articles

Technical deep-dives on synthetic dataset generation with mbDataGen
mbDataGen

Synthetic Data That Actually Works: The Science Behind mbDataGen

April 2026 · MokingBird Team

← Back to Blog
mbDataGen

Synthetic Data That Actually Works: The Science Behind mbDataGen

April 13, 2026 · MokingBird Team · Tags: mbDataGen, synthetic data, GPRO, RL, training data

Ask any ML practitioner what the hardest part of fine-tuning is, and you'll get the same answer: data. High-quality, domain-specific training data is time-consuming to create manually, expensive to annotate professionally, and hard to find in public datasets. mbDataGen was built to solve this by generating synthetic training data from your own source documents.

Why Synthetic Data?

Synthetic data has a reputation problem it doesn't entirely deserve. The concern is circular generation: generate data with a model, then fine-tune that same model on the generated data, and you get drift and quality degradation. This concern is valid for naive approaches. The reason synthetic data fails is usually not the generation — it's the validation. mbDataGen treats validation as the core of the product.

The 5-Phase Pipeline

  1. Extract — Load source documents (17 formats: PDF, DOCX, Excel, CSV, JSON, Markdown, PowerPoint, Email, images with OCR, web content, and more)
  2. Enrich — Add contextual metadata: source attribution, entity extraction, topic classification
  3. Generate — Produce K=4 candidates using GPRO-Hybrid RL and score them
  4. Validate — Run every candidate through the 5-stage validator
  5. Deploy — Export with HMAC-signed RunManifest for complete provenance

GPRO-Hybrid RL: The Generation Engine

GPRO-Hybrid RL generates K=4 candidate outputs per data point and scores all of them using a hybrid reward function:

Total Reward = 0.7 × Field/Process Reward + 0.3 × Outcome/Overall Reward

Field/Process Reward (70%): Evaluates each field independently. Catches micro-quality issues that overall quality scores miss.

Outcome/Overall Reward (30%): Evaluates the data point holistically. Is this a useful training example? Is there diversity relative to other examples?

The 5-Stage Validator

  1. Schema Validation — Are all required fields present? Are types correct?
  2. Distribution Validation — Does the dataset match realistic distributions?
  3. Deduplication — Identifies semantic near-duplicates, not just exact matches
  4. Grounding Validation — Can every generated claim be traced back to a source passage?
  5. Novelty Validation — Does this data point add value over what already exists?

Scoring thresholds: ≥90% AUTO_APPROVE, 70–89% REQUIRE_REVIEW, <70% AUTO_REJECT.

HMAC-Signed RunManifest

Every dataset exported by mbDataGen includes a HMAC-signed RunManifest — a cryptographically verifiable provenance record documenting source documents, generation parameters, validation scores per stage, and timestamp. The HMAC signature ensures it cannot be tampered with after the fact.

Output Schemas

  • Instruction-following pairs: {"instruction": "...", "input": "...", "output": "..."}
  • MCQ with rationale: Full quiz format with 4 options, correct answer, explanation
  • Preference pairs: {"prompt": "...", "chosen": "...", "rejected": "..."} for DPO/ORPO training
  • Any custom JSON schema you define

Fully Local, Your Data Stays Yours

mbDataGen generates from your documents on your hardware. Source documents never leave your machine. Generated datasets are written to local files. If you have compliance requirements around training data, mbDataGen's local-first architecture addresses those requirements by design.

Download Free

mbDataGen is available as part of MokingBird AI — free to download. Full pipeline features are available in the Premium tier. Download at ai.mokingbird.xyz.