Definition
Structured data is information that conforms to a predefined schema or organizational model. Unlike unstructured data (free-form text, images, raw HTML), structured data has consistent fields, types, and relationships that make it directly processable by software. Common structured data formats include JSON, CSV, XML, and relational database tables.
Structured Data on the Web
Many websites embed structured data using standardized vocabularies to help search engines understand their content. The most common formats are:
JSON-LD
The preferred format by Google, JSON-LD (JavaScript Object Notation for Linked Data) embeds structured data in a <script> tag. A product page might include:
{
"@context": "https://schema.org",
"@type": "Product",
"name": "Wireless Headphones",
"price": "79.99",
"availability": "InStock"
}Microdata
HTML attributes (itemscope, itemprop, itemtype) that annotate existing markup with semantic meaning. More tightly coupled to HTML than JSON-LD.
RDFa
Similar to Microdata but using a different attribute vocabulary. Less common in modern web development.
Why Structured Data Matters for Extraction
When structured data is present, extraction becomes dramatically simpler. Instead of parsing visual layouts and inferring meaning from context, you can directly access machine-readable fields with defined types and relationships.
However, structured data coverage is inconsistent. Not every site implements it, and when present, it may be incomplete or outdated relative to the visible page content. Relying solely on embedded structured data means missing data from the majority of web pages.
From Unstructured to Structured
The core challenge in data extraction is converting unstructured web content (HTML, rendered text) into structured output. This transformation traditionally required custom parsers for each site layout.
Structured Data and ScrapeGraphAI
ScrapeGraphAI excels at producing structured output from any web page. Its AI-powered extraction transforms unstructured page content into clean JSON conforming to your specified schema — whether the source page has embedded structured data or not. You define the output structure you need, and the system maps page content to it intelligently.