Definition
Multimodal extraction uses AI models capable of processing multiple types of input — text, images, visual layout, and document structure — to extract data from web pages and documents. Rather than working solely with HTML text, multimodal systems "see" pages as a human would, understanding the spatial relationships between elements, reading text within images, and interpreting visual cues like charts and diagrams.
Why Multimodal Matters
Text in Images
Product images often contain critical information: size specifications, ingredient lists, care instructions, or promotional badges like "20% OFF". Text-only extraction misses all of this.
Visual Layout Semantics
The visual position and styling of elements carries meaning. A large, bold price is the selling price; a smaller, struck-through number is the original price. This distinction lives in the visual presentation, not the HTML semantics.
Charts and Infographics
Data presented in charts, graphs, and infographics cannot be extracted through text parsing. Multimodal models can interpret these visual representations and convert them to structured data.
Non-HTML Documents
PDFs, scanned documents, and images of documents require visual understanding to extract their content. Multimodal extraction handles these without separate OCR pipelines.
How Multimodal Extraction Works
Modern multimodal LLMs accept both text and image inputs. For web extraction, the system can:
- Render the page as a screenshot alongside its HTML
- Feed both modalities to the model simultaneously
- Leverage visual context to disambiguate text-based extraction
This dual-input approach resolves ambiguities that pure text extraction cannot — distinguishing between visually prominent and supplementary content, understanding tabular layouts rendered with CSS, and reading text embedded in images.
Multimodal Extraction in ScrapeGraphAI
ScrapeGraphAI supports multimodal extraction capabilities, combining text and visual understanding to extract data from complex pages. This is particularly valuable for pages where important information exists in images, visual layouts, or non-standard HTML structures.