01 January 2026 | Thursday | News
Picture Courtesy | Public Domain
Tether Data has announced the release of QVAC Genesis II, a significant expansion of its open synthetic educational dataset for artificial intelligence, marking a major step forward in publicly accessible AI training resources. The new release adds 107 billion tokens to the original Genesis dataset, bringing the total to 148 billion tokens, according to Tether Data’s AI research division, QVAC.
With this expansion, Genesis II becomes the largest publicly available synthetic educational dataset designed specifically for AI pre-training. Covering 19 academic disciplines, the dataset is built to strengthen how AI models learn reasoning, explanation, and decision-making, moving beyond surface-level pattern recognition. The release underscores a broader push toward transparency and openness in AI development, at a time when many high-quality training datasets remain proprietary.
Building on the Genesis Foundation
QVAC Genesis II builds on the initial Genesis I release, which introduced a validated, education-focused synthetic dataset centered on core STEM subjects. Genesis I established a structured approach to generating training questions aimed at improving model reasoning accuracy rather than rote memorization.
The new version expands the dataset into ten additional academic domains, including chemistry, computer science, statistics, machine learning, astronomy, geography, econometrics, and electrical engineering. It also revisits college-level physics, regenerating content using an updated methodology designed to enhance conceptual clarity and instructional depth.
Together, Genesis I and Genesis II form what QVAC describes as the most extensive synthetic educational dataset made available to the public. The combined resource is intended for use in pre-training large language models and other AI systems that rely on structured academic content.
A New Approach to Synthetic Data Generation
Central to Genesis II is a new data generation technique known as Option-Level Reasoning. Unlike many synthetic data methods that focus primarily on identifying incorrect answers, this approach examines every possible answer in a multiple-choice question.
Correct answers are broken down to reinforce why they are correct, while incorrect options are analyzed to address common misconceptions. This design allows AI models to learn causal reasoning and decision logic, rather than simply associating questions with final outcomes.
Option-Level Reasoning complements the Failure Analysis method introduced in Genesis I, which focused on extracting instructional value from model errors. Together, these techniques create a pipeline in which each generated question is designed to deliver meaningful educational insight.
According to independent evaluations cited by QVAC, models trained on Genesis II data demonstrate higher reasoning accuracy and produce clearer, more structured explanations compared to models trained on earlier synthetic datasets.
Fintech Business Asia, a business of FinTech Business Review
© 2026 FinTech Business Review. All Rights Reserved.