ScrapeGraphAI 100k: A Real-World Dataset for Structured LLM Output

Today we're releasing ScrapeGraphAI 100k — a dataset of ~100,000 real-world structured extraction examples derived from 9 million PostHog events collected from the ScrapeGraphAI open-source library during Q2-Q3 2025.

The dataset is on hugging face 🤗

TL;DR + What's in Each Row

So our open source library uses LLMs to extract structured data from web pages. In a nutshell; Webpage markdown + user defined JSON schema in -> json out from the LLM.

We've been collecting anonymized telemetry (with user consent) to understand how people use the library. Learn more about how we use PostHog for analytics. We had around 9 million events.

This isn't synthetic data. These are real prompts, real schemas, real web content, and real LLM responses from real usage.

From 9M Events to 100k Examples

We started from 9M events collected from PostHog, then we clean it into two steps:

Step 1: Extract and Compute Metrics

Each PostHog event has the prompt, schema, response, source content, model used, and execution time nested inside other values. We pulled these fields out and computed some extra metrics for each schema:

Schema depth: How deep the nesting goes
Schema keys: How many fields to extract
Schema elements: Total structural pieces
Cyclomatic complexity: Branching shit from oneOf, anyOf, etc.
Complexity score: Weighted combo of all the above

These metrics should help us understanding how "hard/difficult" is a schema, they come from SLOT: Structuring the Output of Large Language Models (Wang et al., 2025). For example, the more nested a schema is, the more difficult - at least in theory. For more on working with structured output and schemas, check out our guide. We'll find out all these metrics are highly correlated and mostly useless; we could have just computed the number of keys tbh.

We also validated each response against its schema using jsonschema-rs to get response_is_valid. This tells you when the LLM fucked up; avg validity is 93%. Almost 90% of the times people used gpt-4o-mini.

Step 2: Balance by Schema

Some schemas showed up hundreds of thousands of times, others appeared once making the dataset heavily unbalanced. For example, two chinese schemas were used 5M times out of the 9M data we had 😅.

The fix was simple: for each unique schema hash, keep max 5 randomly selected examples from different source URLs. This makes the dataset more balanced across different extraction tasks.

Result: From 9M down to ~100k balanced examples across the full spectrum of extraction tasks we had. Which is what we wanted.

Dataset Statistics

We computed some stats to understand what is going on inside.

Schema Complexity Distribution

Schema Depth Distribution

depth is computed as:

schema formula

Schema Keys Distribution

keys is computed as:

schema formula

Schema Size Distribution

schema size, very easy, size = len(json.dumps(schema).encode("utf-8))

Schema Complexity Score Distribution

complexity:

complexity formula

Most schemas are simple — 2-4 levels deep, 5-15 keys. But there's a long tail of gnarly schemas with deep nesting and dozens of fields. Suprisily enough, most of them are cluster together and creates the weird looking dildo graph.

Response Size Distribution

Response sizes are all over the place. Simple price extraction? 50 bytes. Full article with metadata? 50KB. This diversity reflects the wide range of use cases, from price monitoring to real estate scraping to market research.

Percentile Analysis

Schema Depth Percentiles

Schema Keys Percentiles

Schema Complexity Percentiles

90% of schemas have <20 keys and depth <5. The other 10% is where LLMs start sweating.

Validation Rate vs. Schema Complexity

The interesting part. Does schema complexity affect whether the LLM produces valid output?

Validation by Schema Size

Validation by Schema Depth

Validation by Schema Keys

Validation by Complexity Score

The drop-off isn't linear — there are thresholds where LLMs start failing hard. To be honest; we cannot really tell here. My gut felling is that the validaty depends by the difficulty of the content as well; we don't have a metric for that one (yet).

Correlation Matrix

Schema metrics correlate with each other (complex in one dimension = complex in others). Execution time correlates with response size. No surprises.

What's In Each Row

Field	Description
`prompt`	Full prompt sent to the LLM
`schema`	JSON schema defining expected output
`schema_hash`	SHA256 for deduplication
`response`	What the LLM actually returned
`content`	Source web content
`llm_model`	Which model was used
`source`	Source URL
`execution_time`	How long it took
`response_size`	Response size in bytes
`schema_*`	Complexity metrics
`response_is_valid`	Did the response match the schema?

response_is_valid is clutch. Filter for successes or failures depending on what you need.

Limitations

Real-world data = messy data:

Some responses are truncated or malformed
Web content may reference external resources not included
Validation is syntactic only — semantically wrong but valid JSON passes
This reflects ScrapeGraphAI usage patterns, not general LLM usage

Get the Dataset