What is Multimodal Extraction?

Definition

Multimodal extraction uses AI models capable of processing multiple types of input — text, images, visual layout, and document structure — to extract data from web pages and documents. Rather than working solely with HTML text, multimodal systems "see" pages as a human would, understanding the spatial relationships between elements, reading text within images, and interpreting visual cues like charts and diagrams.

Why Multimodal Matters

Text in Images

Product images often contain critical information: size specifications, ingredient lists, care instructions, or promotional badges like "20% OFF". Text-only extraction misses all of this.

Visual Layout Semantics

The visual position and styling of elements carries meaning. A large, bold price is the selling price; a smaller, struck-through number is the original price. This distinction lives in the visual presentation, not the HTML semantics.

Charts and Infographics

Data presented in charts, graphs, and infographics cannot be extracted through text parsing. Multimodal models can interpret these visual representations and convert them to structured data.

Non-HTML Documents

PDFs, scanned documents, and images of documents require visual understanding to extract their content. Multimodal extraction handles these without separate OCR pipelines.

How Multimodal Extraction Works

Modern multimodal LLMs accept both text and image inputs. For web extraction, the system can:

Render the page as a screenshot alongside its HTML
Feed both modalities to the model simultaneously
Leverage visual context to disambiguate text-based extraction

This dual-input approach resolves ambiguities that pure text extraction cannot — distinguishing between visually prominent and supplementary content, understanding tabular layouts rendered with CSS, and reading text embedded in images.

Multimodal Extraction in ScrapeGraphAI

ScrapeGraphAI supports multimodal extraction capabilities, combining text and visual understanding to extract data from complex pages. This is particularly valuable for pages where important information exists in images, visual layouts, or non-standard HTML structures.