Yam Peleg Declares Non-JSON Formats "Dead on Arrival" for LLM Reliability

Technology expert Yam Peleg recently stirred discussion within the AI community with a strong assertion regarding data formats for Large Language Models (LLMs). Peleg stated on social media, "> All non-JSON formats are dead on arrival. All LLMs are specifically trained on JSONs, the most common structure on the internet. If you invent a new format the model wasn’t trained on, it will forever be less reliable. You might save some tokens but you’re guaranteed to pay." His comments highlight a critical debate about the optimal data structures for LLM training and output.

The prevalence of JSON (JavaScript Object Notation) in LLM applications stems from its widespread use as a data interchange format, making it a natural fit for models trained on vast internet data. For tasks requiring structured output, such as interacting with APIs or databases, JSON's interoperability and clear schema definition are invaluable. Many sources confirm that JSON and JSONL (JSON Lines) are widely adopted in modern LLM pipelines, especially for fine-tuning models on structured input/output pairs.

However, the claim that "all LLMs are specifically trained on JSONs" requires nuance. While JSON is crucial for fine-tuning and ensuring structured responses, the initial pre-training of LLMs often involves massive datasets of raw text from diverse sources like websites, books, and code repositories. Formats such as plain .txt files for unstructured data and Parquet for efficiency at scale are also extensively used in the earlier stages of LLM development.

Despite the dominance of JSON for structured output, researchers are continually exploring methods to enhance LLM adherence to specific formats. Recent advancements include reinforcement learning techniques that use schema validation as a reward signal, improving JSON output accuracy to over 98% without extensive additional training data. Other approaches, like constrained decoding, also help enforce structured output, even for formats like YAML, which some practitioners find effective for LLM interactions due to its human readability and focus on content over syntax.

Peleg's warning about the "cost" of inventing new formats underscores the practical challenges developers face. Deviating from formats LLMs are inherently familiar with can indeed lead to less reliable outputs, even if it offers token savings. The ongoing evolution of LLM training and prompt engineering continues to emphasize the need for clear, consistent data structures to achieve predictable and high-quality results.