Introduction
If you've been working with Large Language Models, you've likely encountered a frustrating reality: every byte of data you send to an LLM API costs money. Whether you're using OpenAI's GPT-4, Anthropic's Claude, or any other major LLM provider for web scraping, token consumption directly translates to API bills. This is where Toonify enters the picture—a clever new data serialization format that can reduce your token usage by 30-60% while maintaining human readability and data structure.
Toonify is a Python implementation of TOON (Token-Oriented Object Notation), a compact serialization format specifically designed to minimize token consumption when passing structured data to language models. Think of it as the compressed middle ground between CSV and JSON—simple enough for humans to read, structured enough for machines to understand, and lean enough to make your LLM costs significantly smaller.
The Problem It Solves
JSON has been the standard for structured data exchange for decades. It's flexible, universal, and relatively human-readable. However, when you're sending data to LLMs, JSON's verbosity becomes a liability. Consider this example:
{
"products": [
{"id": 101, "name": "Laptop Pro", "price": 1299},
{"id": 102, "name": "Magic Mouse", "price": 79},
{"id": 103, "name": "USB-C Cable", "price": 19}
]
}This modest JSON payload consumes 247 bytes. When sent to an LLM, it gets tokenized into roughly 60-70 tokens. For many applications, you might be sending dozens of these payloads, quickly adding up to a meaningful portion of your API bill.
The same data in TOON format:
products[3]{id,name,price}:
101,Laptop Pro,1299
102,Magic Mouse,79
103,USB-C Cable,19
This representation is only 98 bytes—a 60% reduction—and tokenizes to roughly 25-30 tokens. That's where the magic happens. For large-scale data processing applications, this kind of reduction can translate to substantial cost savings.
What Makes Toonify Special
Toonify isn't just about shrinking file sizes. It's thoughtfully designed with several key principles in mind:
Compact Without Sacrificing Clarity: The format uses indentation-based syntax and intelligent defaults that make it readable even with minimal metadata. When you see products[3]{id,name,price}: you immediately understand what's coming next.
Type-Aware: Unlike CSV, TOON preserves data types. Numbers remain numbers, booleans stay booleans, and null values are explicit. This matters because LLMs can better understand and process structured, typed data.
Flexible Delimiters: Depending on your data characteristics, you can choose between comma, tab, and pipe delimiters. If your dataset contains commas within fields, switching to pipes prevents parsing issues while maintaining the same compact representation.
Automatic Optimization: Toonify intelligently switches between different representations based on data structure. Uniform arrays become compact tabular formats, while heterogeneous data falls back to more explicit notation. You don't need to manually optimize—the library handles it.
Key Folding for Deep Nesting: When dealing with deeply nested objects, Toonify can "fold" single-key chains into dotted notation, further reducing overhead:
api.response.product.title: Wireless Keyboard
Instead of creating multiple nested objects, a single line handles the entire chain.
How to Use Toonify
Getting started with Toonify is straightforward. Install it via pip:
pip install toonifyThen, in your Python code:
from toon import encode, decode
# Your data
data = {
'products': [
{'sku': 'LAP-001', 'name': 'Gaming Laptop', 'price': 1299.99},
{'sku': 'MOU-042', 'name': 'Wireless Mouse', 'price': 29.99}
]
}
# Convert to TOON
toon_string = encode(data)
print(toon_string)
# Convert back to Python dict
result = decode(toon_string)
assert result == dataThere's also a command-line interface for batch conversions:
# Convert JSON to TOON
toon input.json -o output.toon
# Convert TOON back to JSON
toon input.toon -o output.json
# Analyze token usage
toon data.json --statsYou can even pipe data through the converter:
cat data.json | toon -e > data.toonReal-World Applications
Toonify shines in several scenarios:
Batch Processing with LLMs: When you're sending large datasets to an LLM for analysis or transformation, Toonify reduces prompt engineering and token costs simultaneously. Feed your data to Claude or GPT-4 in TOON format, and you'll process more data per dollar spent. This optimization is particularly valuable when evaluating build vs buy decisions for web scraping solutions.
Context Window Optimization: Every LLM has a context window limit. By using Toonify, you fit more relevant data into that window, enabling more comprehensive analysis and reasoning.
Data Extraction Pipelines: At ScrapeGraphAI, we know about data extraction at scale. When extracting structured data from web pages and preparing it for LLM analysis, Toonify reduces transmission overhead while keeping the data human-readable for debugging.
Prompt Engineering: When crafting prompts with examples, using TOON format means more space for actual instruction and reasoning rather than verbose data representations.
Financial Reporting: Any application dealing with tabular financial or analytical data benefits from the compact tabular representation TOON provides.
Technical Specifications
Toonify supports all essential data types:
- Strings: Quoted only when necessary (containing special characters, whitespace, or looking like literals)
- Numbers: Integers and floats preserved exactly
- Booleans: True and false values maintained
- Null: Explicit null representation
- Arrays: Both uniform (tabular) and heterogeneous (list) formats
- Objects: Nested structures with optional key folding
The encoding function accepts several options:
toon = encode(data, {
'delimiter': 'tab', # comma (default), tab, or pipe
'indent': 4, # spaces per indentation level
'key_folding': 'safe', # off (default) or safe
'flatten_depth': 2 # max depth for key folding
})The decoding function similarly supports options:
data = decode(toon_string, {
'strict': True, # strict validation (default)
'expand_paths': 'safe', # off (default) or safe
'default_delimiter': ',' # fallback delimiter
})Performance Characteristics
Toonify maintains excellent performance while delivering significant space savings:
- Size Reduction: 30-60% smaller than JSON for structured data, up to 70% for tabular data
- Token Reduction: 40-70% fewer tokens for typical datasets
- Processing Speed: Encoding and decoding typically complete in under 1ms for standard payloads
- No External Dependencies: The library is lightweight and self-contained
When to Use Toonify vs. JSON
Use Toonify when:
- You're sending data to LLM APIs and paying per token
- Working with uniform, tabular-like data structures
- Context window space is constrained
- Human readability and debuggability matter
Stick with JSON when:
- Maximum compatibility with existing tools is critical
- Your data is highly irregular or deeply nested in non-uniform ways
- Working within established JSON-only infrastructure
Getting Started with the Community
Toonify is open source under the MIT license, available on GitHub at ScrapeGraphAI/toonify and on PyPI. For comprehensive documentation, examples, and API reference, visit the official Toonify documentation. The project welcomes contributions, and the development setup is straightforward:
git clone https://github.com/ScrapeGraphAI/toonify.git
cd toonify
pip install -e .[dev]
pytestThe format itself is actually a community effort—the Toonify Python implementation is inspired by and compatible with the TypeScript TOON library maintained at toon-format/toon.
Conclusion
Toonify represents a practical solution to a real problem in modern AI applications. As LLM usage scales and costs become increasingly important, optimizing every aspect of data transmission matters. Whether you're building production systems that process thousands of API requests daily or engineering prompts with complex data structures, Toonify provides immediate, measurable benefits without requiring architectural changes.
By reducing token consumption by 30-60% while maintaining data integrity and human readability, Toonify lets you do more with less—which in the world of LLM APIs translates directly to cost savings, faster inference, and better performance. If you're serious about optimizing your LLM workflows and looking for the best AI tools for data extraction, Toonify deserves a place in your toolkit.
Start with a small experiment: convert one of your JSON datasets to TOON format and measure the difference in token consumption. We think you'll be pleasantly surprised.
