← Back to mbDataGen
Blog
mbDataGen — Articles
Technical deep-dives on synthetic dataset generation with mbDataGen
mbDataGen
Synthetic Data That Actually Works: The Science Behind mbDataGen
April 2026 · MokingBird Team
← Back to Blog
mbDataGen
Synthetic Data That Actually Works: The Science Behind mbDataGen
April 13, 2026 · MokingBird Team · Tags: mbDataGen, synthetic data, GPRO, RL, training data
Ask any ML practitioner what the hardest part of fine-tuning is, and you'll get the same answer: data. High-quality, domain-specific training data is time-consuming to create manually, expensive to annotate professionally, and hard to find in public datasets. mbDataGen was built to solve this by generating synthetic training data from your own source documents.
Why Synthetic Data?
Synthetic data has a reputation problem it doesn't entirely deserve. The concern is circular generation: generate data with a model, then fine-tune that same model on the generated data, and you get drift and quality degradation. This concern is valid for naive approaches. The reason synthetic data fails is usually not the generation — it's the validation. mbDataGen treats validation as the core of the product.
The 5-Phase Pipeline
- Extract — Load source documents (17 formats: PDF, DOCX, Excel, CSV, JSON, Markdown, PowerPoint, Email, images with OCR, web content, and more)
- Enrich — Add contextual metadata: source attribution, entity extraction, topic classification
- Generate — Produce K=4 candidates using GPRO-Hybrid RL and score them
- Validate — Run every candidate through the 5-stage validator
- Deploy — Export with HMAC-signed RunManifest for complete provenance
GPRO-Hybrid RL: The Generation Engine
GPRO-Hybrid RL generates K=4 candidate outputs per data point and scores all of them using a hybrid reward function:
Total Reward = 0.7 × Field/Process Reward + 0.3 × Outcome/Overall Reward
Field/Process Reward (70%): Evaluates each field independently. Catches micro-quality issues that overall quality scores miss.
Outcome/Overall Reward (30%): Evaluates the data point holistically. Is this a useful training example? Is there diversity relative to other examples?
The 5-Stage Validator
- Schema Validation — Are all required fields present? Are types correct?
- Distribution Validation — Does the dataset match realistic distributions?
- Deduplication — Identifies semantic near-duplicates, not just exact matches
- Grounding Validation — Can every generated claim be traced back to a source passage?
- Novelty Validation — Does this data point add value over what already exists?
Scoring thresholds: ≥90% AUTO_APPROVE, 70–89% REQUIRE_REVIEW, <70% AUTO_REJECT.
HMAC-Signed RunManifest
Every dataset exported by mbDataGen includes a HMAC-signed RunManifest — a cryptographically verifiable provenance record documenting source documents, generation parameters, validation scores per stage, and timestamp. The HMAC signature ensures it cannot be tampered with after the fact.
Output Schemas
- Instruction-following pairs:
{"instruction": "...", "input": "...", "output": "..."}
- MCQ with rationale: Full quiz format with 4 options, correct answer, explanation
- Preference pairs:
{"prompt": "...", "chosen": "...", "rejected": "..."} for DPO/ORPO training
- Any custom JSON schema you define
Fully Local, Your Data Stays Yours
mbDataGen generates from your documents on your hardware. Source documents never leave your machine. Generated datasets are written to local files. If you have compliance requirements around training data, mbDataGen's local-first architecture addresses those requirements by design.
Download Free
mbDataGen is available as part of MokingBird AI — free to download. Full pipeline features are available in the Premium tier. Download at ai.mokingbird.xyz.
---
title: "Synthetic Data That Actually Works: The Science Behind mbDataGen"
date: "2026-04-13"
author: "MokingBird Team"
tags: ["mbDataGen", "synthetic data", "GPRO", "RL", "training data", "fine-tuning", "local AI"]
---
# Synthetic Data That Actually Works: The Science Behind mbDataGen
Ask any ML practitioner what the hardest part of fine-tuning is, and you'll get the same answer: data. Not the model architecture. Not the training loop. The data.
High-quality, domain-specific training data is time-consuming to create manually, expensive to annotate professionally, and hard to find in public datasets — especially for specialized domains like legal, medical, or scientific applications. The data that's available is often noisy, misaligned with your task, or insufficient in volume.
mbDataGen was built to solve this. It generates synthetic training data from your own source documents — clean, validated, and grounded in your actual knowledge base.
---
## Why Synthetic Data?
Synthetic data has a reputation problem it doesn't entirely deserve. The concern is circular generation: if you generate data with a model and then fine-tune that same model on the generated data, you get drift, hallucination amplification, and quality degradation.
This concern is valid for naive approaches — generating thousands of random examples with no validation and feeding them directly into a training loop. It's not inherent to synthetic data as a concept.
The reason synthetic data fails is usually not the generation — it's the validation. Most pipelines skip validation or treat it as an afterthought. mbDataGen treats validation as the core of the product.
---
## The 5-Phase Pipeline
mbDataGen organizes the entire process into five phases:
### Phase 1: Extract
Load your source documents — the knowledge base that generated data will be grounded in. mbDataGen supports 17 document formats: PDF (with multi-engine parsing), DOCX, Excel, CSV, JSON, Markdown, PowerPoint, Email, images with OCR, web content, and more.
During extraction, the system parses structure: sections, headings, tables, code blocks, and relationships between document elements. This structure informs later phases — generated data that understands document structure is more useful than data that treats all text as a flat blob.
### Phase 2: Enrich
Add contextual metadata to extracted content: source attribution, document type, section relationships, entity extraction, topic classification. This enrichment is what allows the provenance system to work — every generated data point can be traced back to specific source passages.
### Phase 3: Generate
Produce candidate data using the GPRO-Hybrid RL approach (described in detail below). For each target data point, generate K=4 candidates and score them. Candidates are structured according to your output schema — instruction-following pairs, MCQ questions with rationales, preference pairs, structured extraction examples, or any custom schema you define.
### Phase 4: Validate
Run every candidate through the 5-stage validator. This is where mbDataGen distinguishes itself — the validation pipeline is not a simple heuristic filter but a multi-stage quality gate that evaluates different dimensions of data quality independently.
### Phase 5: Deploy
Export approved records to your training format. Every output includes a HMAC-signed RunManifest — a cryptographically verifiable record of how each data point was produced. When you need to audit your training data or certify its provenance, the RunManifest provides the chain of custody.
---
## GPRO-Hybrid RL: The Generation Engine
The core of mbDataGen's generation step is **GPRO-Hybrid RL** — an original reward learning approach developed by MokingBird.
Standard data generation with an LLM produces one output per prompt. The quality of that output depends entirely on prompt engineering. There's no mechanism for the system to distinguish a good output from a mediocre one.
GPRO-Hybrid RL changes this by generating **K=4 candidate outputs** for each data point and scoring all of them using a hybrid reward function:
```
Total Reward = 0.7 × Field/Process Reward + 0.3 × Outcome/Overall Reward
```
**Field/Process Reward (70% weight):**
Evaluates each field of the generated output independently. For an MCQ question, this means scoring: Is the question grammatically correct? Is the question answerable from the source? Is the correct answer actually correct? Are the distractors plausible but clearly wrong? Are all required fields present and properly formatted?
Field-level scoring catches micro-quality issues that overall quality scores miss. A data point can look good at a high level while containing a subtly wrong distractor answer or a malformed JSON field.
**Outcome/Overall Reward (30% weight):**
Evaluates the data point holistically. Is this a useful training example? Does it test the right concepts? Is there diversity relative to other generated examples? Would a model that learned from this example be better at the target task?
The K=4 candidates are compared, and the highest-scoring candidate is selected for validation. This process — generating multiple candidates and selecting the best — is a form of rejection sampling with learned scoring, and it reliably produces higher-quality output than single-shot generation.
---
## The 5-Stage Validator
After generation selects the best candidate, it enters the validation pipeline. Five stages, each assessing a different quality dimension:
### Stage 1: Schema Validation
Does the output conform to the required schema? Are all required fields present? Are field types correct? Are values within expected ranges?
This catches structural failures — malformed JSON, missing fields, type errors — before they enter the training dataset.
### Stage 2: Distribution Validation
Does the generated dataset, taken as a whole, match realistic distributions? For classification tasks: are label proportions reasonable? For question generation: is there appropriate coverage across difficulty levels, question types, and topic areas? For instruction-following: is there variety in instruction types and response styles?
Distribution validation catches a subtle failure mode: a dataset that passes all per-example quality checks but is heavily skewed — 90% easy questions, all from one document section — and would produce a model with systematic blind spots.
### Stage 3: Deduplication
Near-duplicate examples in training data waste compute and can cause overfitting. mbDataGen identifies semantic near-duplicates (not just exact matches) and flags them for review or removal.
### Stage 4: Grounding Validation
Can each generated claim be traced back to a source passage? For factual content, can the generated answer be verified against the source document?
This stage is critical for preventing hallucination propagation. A generated example that contains a plausible-but-false claim — if it passes all structural and distribution checks — can introduce false information into a fine-tuned model's behavior. Grounding validation checks generation against source.
### Stage 5: Novelty Validation
Does this data point add value over what already exists in the dataset? If a very similar example already passed validation, is the marginal utility of this one sufficient to include it?
Novelty validation maximizes information density per training example.
**Scoring thresholds:**
- Score ≥ 90%: AUTO_APPROVE
- Score 70–89%: REQUIRE_REVIEW (human review queue)
- Score < 70%: AUTO_REJECT
---
## The RunManifest: Data Provenance
Every dataset exported by mbDataGen includes a **HMAC-signed RunManifest** — a structured metadata document that records:
- Source documents used (with hashes for integrity verification)
- Generation parameters (model, temperature, prompt version, K value)
- Validation scores per stage for each data point
- Timestamp and hardware fingerprint of the generation run
- Selection rationale for each accepted record
The HMAC signature ensures the manifest cannot be tampered with after the fact. This matters when:
- Your organization audits AI training data for compliance
- You need to demonstrate that training data was grounded in authorized sources
- You want to reproduce or extend a dataset months later
- You are submitting a model for certification and need to document its training data provenance
---
## Hardware Requirements and Output Schema
**Minimum hardware:** 6GB VRAM, 16GB RAM
**Recommended:** 8GB+ VRAM, 32GB RAM
**Storage:** Depends on model size and dataset volume
mbDataGen supports any output schema you define. Built-in schemas include:
- **Instruction-following pairs** — `{"instruction": "...", "input": "...", "output": "..."}`
- **MCQ with rationale** — Full Jogg quiz format with 4 options, correct answer, explanation
- **Preference pairs** — `{"prompt": "...", "chosen": "...", "rejected": "..."}` for DPO/ORPO training
- **Structured extraction** — Any custom JSON schema you define
If your target task needs a different format, you define the schema and mbDataGen generates to it.
---
## Use Cases
**Fine-tuning a domain-specific Q&A model.** You have a corpus of 10,000 internal technical documents. Using mbDataGen, generate 50,000 instruction-following pairs grounded in those documents. Fine-tune a base model on the result. The trained model answers questions about your internal systems with accuracy that general models cannot achieve.
**Building an educational AI quiz system.** Define an MCQ schema with question, four options, correct answer, and rationale. Feed in curriculum documents. mbDataGen generates a question bank that covers the curriculum with appropriate difficulty distribution and is validated for factual accuracy against the source.
**Creating preference data for alignment.** Generate instruction-response pairs, then use mbDataGen's comparative generation to create a chosen/rejected pair for each instruction, scoring which response is higher quality. Use the resulting preference dataset for DPO or ORPO fine-tuning.
**Augmenting sparse datasets.** You have 200 real labeled examples in a specialized domain — enough to establish quality but not enough to fine-tune reliably. Use those 200 examples as grounding signals to generate 5,000 validated synthetic examples with the same quality characteristics.
---
## Fully Local, Your Data Stays Yours
mbDataGen generates from your documents on your hardware. Source documents never leave your machine. Generated datasets are written to local files. The RunManifest is a local file.
If you have compliance requirements around training data — where it comes from, what it contains, who can access it — mbDataGen's local-first architecture and provenance system address those requirements by design.
---
## Where DataGen Fits in the Ecosystem
mbDataGen is not isolated tooling. It is the middle layer in a coherent end-to-end flow:
1. **mbRAG** — Retrieve and contextualize information from your source documents
2. **mbDataGen** — Generate structured training data grounded in that retrieved knowledge
3. **mbFT** — Fine-tune a model on the generated dataset to adapt it to your domain
This pipeline reduces the handoff friction between knowledge, data, and model. Instead of three disconnected tools with incompatible formats and separate configuration approaches, the MokingBird Node coordinates all three in one workspace.
---
## Download Free
mbDataGen is available as part of MokingBird AI — free to download.
Full pipeline features (all validation stages, HMAC provenance, all output schemas) are available in the Premium tier.
Download at [ai.mokingbird.xyz](https://ai.mokingbird.xyz).