Definition
Table extraction is the process of identifying tabular data on web pages or in documents and converting it into structured formats like JSON, CSV, or database rows. Tables are one of the most information-dense elements on the web, containing pricing data, specifications, statistics, schedules, and comparisons in a compact, organized layout.
Table Formats on the Web
HTML Tables
Traditional <table> elements with <thead>, <tbody>, <tr>, <th>, and <td> tags. These are the most straightforward to parse because the structure is explicitly defined in the markup.
CSS Grid/Flexbox Tables
Many modern sites use <div> elements styled with CSS Grid or Flexbox to create table-like layouts. These lack semantic table markup, making extraction harder — the structure exists visually but not in the HTML.
Rendered Tables in Images or PDFs
Tables in PDFs, screenshots, or embedded images require computer vision or OCR to detect row/column boundaries and extract cell content.
Extraction Challenges
Merged Cells
Cells spanning multiple rows (rowspan) or columns (colspan) complicate the mapping from HTML structure to rectangular data.
Nested Tables
Tables within tables create hierarchical structures that need careful handling to produce flat output.
Missing Headers
Some tables lack explicit header rows, requiring inference about what each column represents.
Responsive Tables
Tables that transform their layout at different screen widths may present differently to a headless browser depending on viewport configuration.
Table Extraction Approaches
- DOM parsing — traverse
<table>elements and extract cell content by row and column position - Pandas read_html — automatically finds and parses HTML tables into DataFrames
- AI-based detection — use models to identify table boundaries in non-semantic layouts or documents
Table Extraction with ScrapeGraphAI
ScrapeGraphAI handles table extraction through its AI-powered understanding of page content. It identifies tabular data regardless of whether it uses semantic HTML table tags or CSS-based layouts, extracting the content into structured JSON that maps cleanly to your defined schema.