Node | RAG | DataGen | Fine-Tuning | Privacy

Main Site | Docs | Download Free

MokingBird DataGen · ai.mokingbird.xyz/mbdatagen

Turn Any Document Into
Production Training Data

GPRO-Hybrid RL. 5-stage automated validation. 100% local inference.
Designed for 8GB VRAM.

Get DataGen — Free

5-Phase Pipeline GPRO-Hybrid RL 5-Stage Validation 100% Local 8GB VRAM

The 5-Phase Pipeline

From raw document to validated, deployment-ready training data — fully automated, fully local.

Phase 01

📄

Extract

Parse 7 document formats. Async concurrent extraction. Chunk, clean, and structure raw content.

→

Phase 02

✨

Enrich

Contextual tagging, difficulty scoring, layer classification, and semantic chunking for richer inputs.

→

Phase 03

⚡

Generate

GPRO-Hybrid RL generator. K=4 candidates per question, per-field rewards, group-relative normalization.

→

Phase 04

🛡️

Validate

5-stage automated chain: Schema → Distribution → Dedupe → Grounding → Novelty. HMAC-signed provenance.

→

Phase 05

🚀

Deploy

Export Jogg or Jogg-Mini format. Ready for fine-tuning, RAG augmentation, or direct model training.

Three Pillars of Quality at Scale

Every stage of DataGen is engineered for precision. No shortcuts, no cloud dependencies.

📥

Pillar 1 · Ingestion

Extract

7 document formats, async concurrent extraction. Feed DataGen anything — it handles the complexity so your pipeline doesn't have to.

PDF, DOCX, HTML/Web, Images (OCR)
Python, JavaScript, TypeScript source code
LaTeX documents and academic papers
Async batch pipeline — process thousands in parallel
Automatic chunking, cleaning, and deduplication

⚡

Pillar 2 · Generation

Generate

GPRO-Hybrid RL: K=4 candidates per question. Per-field rewards across every component — question, options, correct_answer, explanation, difficulty.

K=4 candidates per question — best candidate wins
Per-field rewards: question, options, correct_answer, explanation, difficulty
Group-relative normalization across candidate pool
Token-level advantages — stable gradient flow
8 difficulty levels, 8 semantic layers (0–7)

🛡️

Pillar 3 · Quality Gate

Validate

5-stage automated chain. Every record is graded and HMAC-signed before it enters your training set — no silent failures.

Stage 1 · Schema: structural and type validation
Stage 2 · Distribution: difficulty + layer balance
Stage 3 · Dedupe: semantic near-duplicate detection
Stage 4 · Grounding: answer-to-source verification
Stage 5 · Novelty: information density scoring

The GPRO-Hybrid Advantage

Reinforcement learning that rewards quality at the field level — not just the final output.

Reward Function

Total Reward = α × Process Reward + γ × Outcome Reward = 0.7 × Σ(field scores) + 0.3 × overall quality

Field-Level Rewards

❓question — clarity, specificity, answerability score
📋options — plausibility balance, distractor quality
✅correct_answer — grounding and source alignment
💡explanation — depth, accuracy, educational value
📊difficulty — calibration against document complexity

How Candidates Compete

🔢K=4 candidates generated per question simultaneously
⚖️Group-relative normalization — candidates compete against each other
🏆Highest total reward candidate is selected and stored
📈Token-level advantages — stable, efficient gradient flow

Head-to-Head Comparison

DataGen vs. generic LLM prompting for dataset creation

Capability	MokingBird DataGen	Generic LLM Prompting
Reproducibility	✅	❌
Quality control (5-stage)	✅	❌
RL-trained generator	✅	❌
Local / private inference	✅	❌
Per-field reward scoring	✅	❌
HMAC provenance on every record	✅	❌
Deterministic schema output	✅	❌
No API keys or internet required	✅	❌

5-Stage Validator Chain

Every record passes through five sequential quality gates before it enters your dataset.

Stage 1

Schema

Structural integrity and type validation

→

Stage 2

Distribution

Difficulty + layer balance across dataset

→

Stage 3

Dedupe

Semantic near-duplicate detection

→

Stage 4

Grounding

Answer-to-source document verification

→

Stage 5

Novelty

Information density and uniqueness scoring

Gate Decisions

AUTO_APPROVE

≥ 90%

Record passes all five stages and is written directly to the output dataset.

REQUIRE_REVIEW

70 – 89%

Flagged for optional human review before inclusion. Stored in a separate review queue.

AUTO_REJECT

< 70%

Record is rejected and logged with a full failure report. Never enters the output dataset.

Provenance — Every Record Is HMAC-Signed

// __provenance__ field — appended to every approved record { "__provenance__": { "run_id": "dg_20260329_a4f7e2b1", "source_hash": "sha256:3c9f2a...", // source document fingerprint "validator_score": 0.94, // composite quality score "stage_scores": [1.0, 0.96, 1.0, 0.91, 0.88], // per-stage "gate_decision": "AUTO_APPROVE", "hmac_sig": "sha256-hmac:7d3a91f..." // RunManifest signature } }

7 Document Formats Supported

Feed DataGen any document type. The async batch pipeline handles the rest.

📄

PDF

Text-based and scanned PDFs, multi-page, mixed content

📝

DOCX

Microsoft Word documents, tables, headers, footnotes

🌐

HTML / Web

Web pages, articles, documentation sites with boilerplate removal

🖼️

Images (OCR)

PNG, JPG, TIFF — full OCR pipeline with layout preservation

💻

Python / JS / TS

Source code with syntax-aware chunking and docstring extraction

🔬

LaTeX

Academic papers, equations, theorems, citations preserved

⚙️

Async Batch Pipeline

Process hundreds of documents concurrently. Queue management, progress tracking, error recovery — all built in. Mix formats in a single run.

Output Schemas

Two schema variants optimized for different deployment needs — full-fidelity or lightweight.

Jogg

Full MCQ Schema

"question":      "What is...",
"options":       ["A", "B", "C", "D"],
"correct_answer":"B",
"explanation":   "Because...",
"difficulty":    3, // 8 levels (1–8)
"layer":         2, // semantic layer (0–7)
"tags":          ["topic", "subtopic"],
"__provenance__": { ... HMAC-signed RunManifest }

Jogg-Mini

Lightweight — ~40% Smaller

"q":  "What is...",
"o":  ["A", "B", "C", "D"],
"a":  "B",
"d":  3,       // difficulty
"l":  2,       // layer
"t":  ["topic"],
"p":  "hmac:..." // provenance hash

// ~40% smaller payload vs Jogg full
// Ideal for high-volume training pipelines

Three Ways to Use DataGen

Python API, CLI, or Desktop App — same engine under the hood, different surfaces for different workflows.

Full programmatic control. Integrate DataGen directly into your training pipeline.

# MokingBird DataGen — Python API
from mokingbird.datagen import DataGenAPI

# Initialize with your local model
api = DataGenAPI(
    model="qwen2.5-7b-instruct",
    device="cuda",
    schema="jogg",         # or "jogg-mini"
    k_candidates=4,
)

# Generate training data from documents
dataset = api.generate(
    sources=["./docs/manual.pdf", "./src/"],
    n_questions=500,
    difficulty_range=(1, 8),
    validate=True,           # run 5-stage validator
    output_path="./output/dataset.jogg",
)

# Fine-tune directly on generated data
result = api.train(
    dataset=dataset,
    base_model="qwen2.5-7b",
    method="lora",
    epochs=3,
    output_dir="./models/fine-tuned/",
)

Scriptable, automatable, shell-friendly. Integrate DataGen into any workflow with a single command.

# Generate a dataset from a directory of documents

$ mokingbird generate --source ./docs/ --schema jogg --questions 500 --model qwen2.5-7b-instruct --output ./output/dataset.jogg

# Fine-tune on generated dataset

$ mokingbird train --dataset ./output/dataset.jogg --base-model qwen2.5-7b --method lora --epochs 3 --output ./models/fine-tuned/

# Run the 5-stage validator on an existing dataset

$ mokingbird validate --dataset ./output/dataset.jogg --report ./output/validation_report.json

A native PySide6 desktop application. Two focused interfaces — one for training data generation, one for model fine-tuning.

Training Tab (DDI — Document-Driven Ingestion)

Drop in your documents, configure extraction and generation settings, and launch the full 5-phase pipeline. Real-time progress tracking, per-stage quality scores, and live dataset preview as records are generated.

Generate Tab (PDI — Prompt-Driven Interface)

Direct prompt-driven question generation. Specify topics, difficulty ranges, and output format interactively. Review and approve individual records before they enter your final dataset. Export to Jogg or Jogg-Mini with one click.

View Desktop Preview →

PySide6 Desktop

Desktop Application
— PySide6

A native desktop application purpose-built for DataGen. Two dedicated tabs keep your workflow focused — one for ingesting and generating from documents, one for direct prompt-driven generation.

Training Tab · DDI

Document-Driven Ingestion. Full 5-phase pipeline, progress metrics, per-stage scoring, live dataset preview.

Generate Tab · PDI

Prompt-Driven Interface. Interactive generation, difficulty control, per-record review, one-click export.

This is a design preview. Download and install to start using.

MokingBird DataGen — Training Tab

Source: ./docs/training_manual.pdf

Schema: Jogg (full)

Model: qwen2.5-7b-instruct

K candidates: 4

Pipeline stage: Phase 04 · Validating

Approved: 437 records

In review: 28 records

Rejected: 9 records

Validator score: 0.937 avg

VRAM usage: 6.8 GB / 8 GB

Technical Requirements

Designed to run on consumer hardware. No enterprise GPU required.

Specification	Minimum	Recommended
GPU VRAM	6 GB	8 GB (RTX 4070 or equivalent)
System RAM	16 GB	32 GB
Operating System	Windows 10/11 · Linux (Ubuntu 20.04+) · macOS 12+
Python Version	3.10+	3.11 (recommended)
CUDA / ROCm	CUDA 11.8	CUDA 12.1+ / ROCm 5.6+
Storage	20 GB free	50 GB+ (for models + datasets)
Internet Required	Never — 100% offline after first model download

You Own It. All of It.

100% local inference. No data leaves your machine. No API keys required.

🔒

Fully Offline

Every computation — extraction, generation, validation — runs on your hardware. Your documents never touch a server.

🛡️

Zero Telemetry

No usage data, no crash reports, no model outputs are collected. What runs on your machine stays on your machine.

🗝️

Your Data, Your Models

Generated datasets and fine-tuned models belong entirely to you. Export anywhere, use with any inference engine.

Start Generating
Training Data Today.

DataGen downloads are launching soon. Join the notify list and we will email you when it is available.

Get DataGen — Free Read the Docs

Windows · Linux · macOS · Python 3.10+ · No account required

← Back to mbDataGen

Blog

mbDataGen — Articles

Technical deep-dives on synthetic dataset generation with mbDataGen

mbDataGen

Synthetic Data That Actually Works: The Science Behind mbDataGen

April 2026 · MokingBird Team

← Back to Blog

mbDataGen

Synthetic Data That Actually Works: The Science Behind mbDataGen

April 13, 2026 · MokingBird Team · Tags: mbDataGen, synthetic data, GPRO, RL, training data

Ask any ML practitioner what the hardest part of fine-tuning is, and you'll get the same answer: data. High-quality, domain-specific training data is time-consuming to create manually, expensive to annotate professionally, and hard to find in public datasets. mbDataGen was built to solve this by generating synthetic training data from your own source documents.

Why Synthetic Data?

Synthetic data has a reputation problem it doesn't entirely deserve. The concern is circular generation: generate data with a model, then fine-tune that same model on the generated data, and you get drift and quality degradation. This concern is valid for naive approaches. The reason synthetic data fails is usually not the generation — it's the validation. mbDataGen treats validation as the core of the product.

The 5-Phase Pipeline

Extract — Load source documents (17 formats: PDF, DOCX, Excel, CSV, JSON, Markdown, PowerPoint, Email, images with OCR, web content, and more)
Enrich — Add contextual metadata: source attribution, entity extraction, topic classification
Generate — Produce K=4 candidates using GPRO-Hybrid RL and score them
Validate — Run every candidate through the 5-stage validator
Deploy — Export with HMAC-signed RunManifest for complete provenance

GPRO-Hybrid RL: The Generation Engine

GPRO-Hybrid RL generates K=4 candidate outputs per data point and scores all of them using a hybrid reward function:

Total Reward = 0.7 × Field/Process Reward + 0.3 × Outcome/Overall Reward

Field/Process Reward (70%): Evaluates each field independently. Catches micro-quality issues that overall quality scores miss.

Outcome/Overall Reward (30%): Evaluates the data point holistically. Is this a useful training example? Is there diversity relative to other examples?

The 5-Stage Validator

Schema Validation — Are all required fields present? Are types correct?
Distribution Validation — Does the dataset match realistic distributions?
Deduplication — Identifies semantic near-duplicates, not just exact matches
Grounding Validation — Can every generated claim be traced back to a source passage?
Novelty Validation — Does this data point add value over what already exists?

Scoring thresholds: ≥90% AUTO_APPROVE, 70–89% REQUIRE_REVIEW, <70% AUTO_REJECT.

HMAC-Signed RunManifest

Every dataset exported by mbDataGen includes a HMAC-signed RunManifest — a cryptographically verifiable provenance record documenting source documents, generation parameters, validation scores per stage, and timestamp. The HMAC signature ensures it cannot be tampered with after the fact.

Output Schemas

Instruction-following pairs: {"instruction": "...", "input": "...", "output": "..."}
MCQ with rationale: Full quiz format with 4 options, correct answer, explanation
Preference pairs: {"prompt": "...", "chosen": "...", "rejected": "..."} for DPO/ORPO training
Any custom JSON schema you define

Fully Local, Your Data Stays Yours

mbDataGen generates from your documents on your hardware. Source documents never leave your machine. Generated datasets are written to local files. If you have compliance requirements around training data, mbDataGen's local-first architecture addresses those requirements by design.

Download Free

mbDataGen is available as part of MokingBird AI — free to download. Full pipeline features are available in the Premium tier. Download at ai.mokingbird.xyz.

---
title: "Synthetic Data That Actually Works: The Science Behind mbDataGen"
date: "2026-04-13"
author: "MokingBird Team"
tags: ["mbDataGen", "synthetic data", "GPRO", "RL", "training data", "fine-tuning", "local AI"]
---

# Synthetic Data That Actually Works: The Science Behind mbDataGen

Ask any ML practitioner what the hardest part of fine-tuning is, and you'll get the same answer: data. Not the model architecture. Not the training loop. The data.

High-quality, domain-specific training data is time-consuming to create manually, expensive to annotate professionally, and hard to find in public datasets — especially for specialized domains like legal, medical, or scientific applications. The data that's available is often noisy, misaligned with your task, or insufficient in volume.

mbDataGen was built to solve this. It generates synthetic training data from your own source documents — clean, validated, and grounded in your actual knowledge base.

---

## Why Synthetic Data?

Synthetic data has a reputation problem it doesn't entirely deserve. The concern is circular generation: if you generate data with a model and then fine-tune that same model on the generated data, you get drift, hallucination amplification, and quality degradation.

This concern is valid for naive approaches — generating thousands of random examples with no validation and feeding them directly into a training loop. It's not inherent to synthetic data as a concept.

The reason synthetic data fails is usually not the generation — it's the validation. Most pipelines skip validation or treat it as an afterthought. mbDataGen treats validation as the core of the product.

---

## The 5-Phase Pipeline

mbDataGen organizes the entire process into five phases:

### Phase 1: Extract
Load your source documents — the knowledge base that generated data will be grounded in. mbDataGen supports 17 document formats: PDF (with multi-engine parsing), DOCX, Excel, CSV, JSON, Markdown, PowerPoint, Email, images with OCR, web content, and more.

During extraction, the system parses structure: sections, headings, tables, code blocks, and relationships between document elements. This structure informs later phases — generated data that understands document structure is more useful than data that treats all text as a flat blob.

### Phase 2: Enrich
Add contextual metadata to extracted content: source attribution, document type, section relationships, entity extraction, topic classification. This enrichment is what allows the provenance system to work — every generated data point can be traced back to specific source passages.

### Phase 3: Generate
Produce candidate data using the GPRO-Hybrid RL approach (described in detail below). For each target data point, generate K=4 candidates and score them. Candidates are structured according to your output schema — instruction-following pairs, MCQ questions with rationales, preference pairs, structured extraction examples, or any custom schema you define.

### Phase 4: Validate
Run every candidate through the 5-stage validator. This is where mbDataGen distinguishes itself — the validation pipeline is not a simple heuristic filter but a multi-stage quality gate that evaluates different dimensions of data quality independently.

### Phase 5: Deploy
Export approved records to your training format. Every output includes a HMAC-signed RunManifest — a cryptographically verifiable record of how each data point was produced. When you need to audit your training data or certify its provenance, the RunManifest provides the chain of custody.

---

## GPRO-Hybrid RL: The Generation Engine

The core of mbDataGen's generation step is **GPRO-Hybrid RL** — an original reward learning approach developed by MokingBird.

Standard data generation with an LLM produces one output per prompt. The quality of that output depends entirely on prompt engineering. There's no mechanism for the system to distinguish a good output from a mediocre one.

GPRO-Hybrid RL changes this by generating **K=4 candidate outputs** for each data point and scoring all of them using a hybrid reward function:

```
Total Reward = 0.7 × Field/Process Reward + 0.3 × Outcome/Overall Reward
```

**Field/Process Reward (70% weight):**
Evaluates each field of the generated output independently. For an MCQ question, this means scoring: Is the question grammatically correct? Is the question answerable from the source? Is the correct answer actually correct? Are the distractors plausible but clearly wrong? Are all required fields present and properly formatted?

Field-level scoring catches micro-quality issues that overall quality scores miss. A data point can look good at a high level while containing a subtly wrong distractor answer or a malformed JSON field.

**Outcome/Overall Reward (30% weight):**
Evaluates the data point holistically. Is this a useful training example? Does it test the right concepts? Is there diversity relative to other generated examples? Would a model that learned from this example be better at the target task?

The K=4 candidates are compared, and the highest-scoring candidate is selected for validation. This process — generating multiple candidates and selecting the best — is a form of rejection sampling with learned scoring, and it reliably produces higher-quality output than single-shot generation.

---

## The 5-Stage Validator

After generation selects the best candidate, it enters the validation pipeline. Five stages, each assessing a different quality dimension:

### Stage 1: Schema Validation
Does the output conform to the required schema? Are all required fields present? Are field types correct? Are values within expected ranges?

This catches structural failures — malformed JSON, missing fields, type errors — before they enter the training dataset.

### Stage 2: Distribution Validation
Does the generated dataset, taken as a whole, match realistic distributions? For classification tasks: are label proportions reasonable? For question generation: is there appropriate coverage across difficulty levels, question types, and topic areas? For instruction-following: is there variety in instruction types and response styles?

Distribution validation catches a subtle failure mode: a dataset that passes all per-example quality checks but is heavily skewed — 90% easy questions, all from one document section — and would produce a model with systematic blind spots.

### Stage 3: Deduplication
Near-duplicate examples in training data waste compute and can cause overfitting. mbDataGen identifies semantic near-duplicates (not just exact matches) and flags them for review or removal.

### Stage 4: Grounding Validation
Can each generated claim be traced back to a source passage? For factual content, can the generated answer be verified against the source document?

This stage is critical for preventing hallucination propagation. A generated example that contains a plausible-but-false claim — if it passes all structural and distribution checks — can introduce false information into a fine-tuned model's behavior. Grounding validation checks generation against source.

### Stage 5: Novelty Validation
Does this data point add value over what already exists in the dataset? If a very similar example already passed validation, is the marginal utility of this one sufficient to include it?

Novelty validation maximizes information density per training example.

**Scoring thresholds:**
- Score ≥ 90%: AUTO_APPROVE
- Score 70–89%: REQUIRE_REVIEW (human review queue)
- Score < 70%: AUTO_REJECT

---

## The RunManifest: Data Provenance

Every dataset exported by mbDataGen includes a **HMAC-signed RunManifest** — a structured metadata document that records:

- Source documents used (with hashes for integrity verification)
- Generation parameters (model, temperature, prompt version, K value)
- Validation scores per stage for each data point
- Timestamp and hardware fingerprint of the generation run
- Selection rationale for each accepted record

The HMAC signature ensures the manifest cannot be tampered with after the fact. This matters when:

- Your organization audits AI training data for compliance
- You need to demonstrate that training data was grounded in authorized sources
- You want to reproduce or extend a dataset months later
- You are submitting a model for certification and need to document its training data provenance

---

## Hardware Requirements and Output Schema

**Minimum hardware:** 6GB VRAM, 16GB RAM
**Recommended:** 8GB+ VRAM, 32GB RAM
**Storage:** Depends on model size and dataset volume

mbDataGen supports any output schema you define. Built-in schemas include:

- **Instruction-following pairs** — `{"instruction": "...", "input": "...", "output": "..."}`
- **MCQ with rationale** — Full Jogg quiz format with 4 options, correct answer, explanation
- **Preference pairs** — `{"prompt": "...", "chosen": "...", "rejected": "..."}` for DPO/ORPO training
- **Structured extraction** — Any custom JSON schema you define

If your target task needs a different format, you define the schema and mbDataGen generates to it.

---

## Use Cases

**Fine-tuning a domain-specific Q&A model.** You have a corpus of 10,000 internal technical documents. Using mbDataGen, generate 50,000 instruction-following pairs grounded in those documents. Fine-tune a base model on the result. The trained model answers questions about your internal systems with accuracy that general models cannot achieve.

**Building an educational AI quiz system.** Define an MCQ schema with question, four options, correct answer, and rationale. Feed in curriculum documents. mbDataGen generates a question bank that covers the curriculum with appropriate difficulty distribution and is validated for factual accuracy against the source.

**Creating preference data for alignment.** Generate instruction-response pairs, then use mbDataGen's comparative generation to create a chosen/rejected pair for each instruction, scoring which response is higher quality. Use the resulting preference dataset for DPO or ORPO fine-tuning.

**Augmenting sparse datasets.** You have 200 real labeled examples in a specialized domain — enough to establish quality but not enough to fine-tune reliably. Use those 200 examples as grounding signals to generate 5,000 validated synthetic examples with the same quality characteristics.

---

## Fully Local, Your Data Stays Yours

mbDataGen generates from your documents on your hardware. Source documents never leave your machine. Generated datasets are written to local files. The RunManifest is a local file.

If you have compliance requirements around training data — where it comes from, what it contains, who can access it — mbDataGen's local-first architecture and provenance system address those requirements by design.

---

## Where DataGen Fits in the Ecosystem

mbDataGen is not isolated tooling. It is the middle layer in a coherent end-to-end flow:

1. **mbRAG** — Retrieve and contextualize information from your source documents
2. **mbDataGen** — Generate structured training data grounded in that retrieved knowledge
3. **mbFT** — Fine-tune a model on the generated dataset to adapt it to your domain

This pipeline reduces the handoff friction between knowledge, data, and model. Instead of three disconnected tools with incompatible formats and separate configuration approaches, the MokingBird Node coordinates all three in one workspace.

---

## Download Free

mbDataGen is available as part of MokingBird AI — free to download.

Full pipeline features (all validation stages, HMAC provenance, all output schemas) are available in the Premium tier.

Download at [ai.mokingbird.xyz](https://ai.mokingbird.xyz).

Turn Any Document IntoProduction Training Data

The 5-Phase Pipeline

Three Pillars of Quality at Scale

Extract

Generate

Validate

The GPRO-Hybrid Advantage

5-Stage Validator Chain

7 Document Formats Supported

Output Schemas

Three Ways to Use DataGen

Training Tab (DDI — Document-Driven Ingestion)

Generate Tab (PDI — Prompt-Driven Interface)

Desktop Application— PySide6

Training Tab · DDI

Generate Tab · PDI

Technical Requirements

You Own It. All of It.

Fully Offline

Zero Telemetry

Your Data, Your Models

Start GeneratingTraining Data Today.

mbDataGen — Articles

Synthetic Data That Actually Works: The Science Behind mbDataGen

Synthetic Data That Actually Works: The Science Behind mbDataGen

Why Synthetic Data?

The 5-Phase Pipeline

GPRO-Hybrid RL: The Generation Engine

The 5-Stage Validator

HMAC-Signed RunManifest

Output Schemas

Fully Local, Your Data Stays Yours

Download Free

Turn Any Document Into
Production Training Data

Desktop Application
— PySide6

Start Generating
Training Data Today.