As AI technology continues to evolve, the need for documents that are easily interpretable by AI systems has become increasingly apparent. The LF AI & Data Foundation, part of the Linux Foundation, has launched a working group to develop DocLang, a new document format designed specifically for AI compatibility.
Purpose and Development
DocLang was initiated by a coalition of companies including IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis. The group argues that traditional formats such as PDF, Markdown, HTML, and LaTeX are not optimized for AI document parsing. These existing formats often lose critical semantic information and structural relationships when processed by AI models.
Technical Features
DocLang builds upon IBM’s open-source toolkit, Docling, which facilitates the conversion of various file formats into structured data suitable for AI. The new format aims to create a standardized method for exchanging structured outputs across different systems. According to Maxime Vermeir, VP of AI Strategy at ABBYY, “DocLang is designed to solve one of the foundational problems in enterprise AI: documents were built for humans, not machines.”
DocLang is specifically optimized for large language model (LLM) tokenizers, utilizing a markup system that aligns DocLang elements with LLM tokens on a one-to-one basis. This approach ensures that the conversion process retains all valuable information, supporting common graphical elements such as tables and charts.
Cost Efficiency and Performance
The introduction of DocLang could lead to significant cost savings for enterprises. Current methods of processing documents, such as using OCR on PDFs, can be token-intensive, leading to higher operational costs. Jon Knisley from ABBYY notes that “every time a PDF enters an AI pipeline, structure, meaning, and layout get lost,” which can bottleneck model accuracy.
Initial benchmarks suggest that using DocLang can reduce token consumption by a factor of 4 to over 30, depending on the complexity of the documents and the models used. For example, a PDF version of IBM’s 2025 annual report requires 8,421 input tokens compared to just 5,310 tokens for the DocLang version.
Future Prospects
While the adoption of DocLang is still in its early stages, the open standard invites participation from more technology providers and enterprises. The initial response has been positive, indicating a potential shift towards more AI-friendly document formats in the future.
This article was produced by NeonPulse.today using human and AI-assisted editorial processes, based on publicly available information. Content may be edited for clarity and style.








