In the evolving landscape of AI, Japan stands on the brink of a transformative chapter, with the potential to generate over 100 trillion yen (approximately $650 billion) in economic value. However, realizing this potential hinges on a critical element that many AI projects lack: usable training data.
This challenge is particularly acute for developers aiming to create AI systems that understand the nuances of the Japanese language and culture. While English-language datasets are abundant, Japanese developers face a persistent shortage of culturally relevant data necessary for building high-performance models. The time and cost associated with collecting, cleaning, and labeling new samples make it difficult to keep pace with the rapid development cycles of AI.
A New Path Forward
NTT DATA has demonstrated how synthetic data can dismantle this barrier. By leveraging minimal proprietary data, the company can generate large-scale operational-level datasets without compromising privacy or model performance. Utilizing the NVIDIA Nemotron-Personas-Japan dataset, which comprises 6 million personas based on Japan’s demographics, geography, and culture, NTT DATA significantly improved model accuracy in legal Q&A tasks from 15.3% to 79.3%. This represents a remarkable enhancement of 60 percentage points without exposing sensitive data in the learning pipeline.
Experimental Validation
To rigorously validate this approach, NTT DATA conducted a controlled evaluation using synthetic legal documents. The learning configuration included:
Base Model: tsuzumi 2 (NTT’s proprietary LLM)
Data Augmentation Model: GPT-OSS-120b
Seed Data: Nemotron-Personas-Japan
Judgment Model: GPT-5 (LLM-as-a-judge method)
By utilizing 500 personas extracted from Nemotron-Personas-Japan and expanding just 450 raw seed samples, the team generated over 138,000 training data points—300 times the equivalent of manually created samples. This synthetic data not only improved accuracy but also eliminated hallucinations that plagued the baseline model, enabling it to extract accurate legal terms without introducing noise.
Privacy by Design
While the accuracy improvements are compelling, they raise deeper questions about data usage. More than 90% of valuable corporate data remains untapped due to privacy regulations and security risks. In Japan, frameworks like the Personal Information Protection Act (PIPA) and the AI Governance Guidelines emphasize responsible data handling.
Synthetic data offers a solution to this dilemma by generating training data that accurately reflects real data trends without including personally identifiable information (PII). This approach allows companies to achieve both data minimization and enhanced model performance, starting with minimal proprietary data and scaling up with synthetic data.
Building a Sovereign Data Space
For Japanese companies constructing sovereign AI, data sovereignty is essential. However, sovereignty alone is insufficient; models must be informed by region-specific norms and constraints. The Nemotron-Personas-Japan dataset serves as foundational data for creating AI grounded in this reality, covering over 1,500 occupational classifications and regional distributions.
NTT DATA and other leading firms are actively developing a “data space” where government and businesses can collaboratively exchange synthetic data for AI training under shared governance and privacy guarantees. Techniques like federated learning enable this decentralized approach, allowing organizations to safely provide synthetic data reflecting their data trends without revealing sensitive information.
As NTT DATA’s research illustrates, the tools to overcome the data barrier are now open and accessible. Synthetic data is no longer a distant prospect; it is a tangible solution that developers can implement immediately to build AI systems rooted in Japanese culture while maintaining data sovereignty.
This article was produced by NeonPulse.today using human and AI-assisted editorial processes, based on publicly available information. Content may be edited for clarity and style.








