Definition
Metadata extraction is the process of collecting descriptive information about a web page from its HTML metadata elements. This includes title tags, meta descriptions, Open Graph tags, Twitter Card data, canonical URLs, author information, publication dates, and other structured annotations that describe the page's content without being part of the visible body text.
Types of Web Page Metadata
HTML Meta Tags
Standard meta tags in the <head> section provide basic page information:
<title>— the page title displayed in browser tabs and search results<meta name="description">— a summary used by search engines<meta name="author">— the content author<meta name="keywords">— topic keywords (largely deprecated for SEO)<link rel="canonical">— the preferred URL for the page
Open Graph Protocol
Facebook's Open Graph tags control how pages appear when shared on social platforms:
og:title— the shared titleog:description— the shared descriptionog:image— the preview image URLog:type— content type (article, product, video)og:url— the canonical sharing URL
Twitter Cards
Similar to Open Graph but specific to Twitter/X, controlling how links render in tweets with twitter:card, twitter:title, twitter:description, and twitter:image.
Schema.org Metadata
JSON-LD blocks that provide rich structured data about the page content — article publication dates, product prices, business locations, event details, and more.
Why Metadata Matters
Metadata provides a quick, structured summary of page content without parsing the full body. For applications like search indexing, content aggregation, link previews, and competitive analysis, metadata extraction delivers high-value data with minimal processing.
Metadata Extraction in ScrapeGraphAI
ScrapeGraphAI can extract page metadata as part of its scraping output. Whether you need Open Graph tags for link preview generation, publication dates for content tracking, or canonical URLs for deduplication, the platform pulls metadata alongside body content in a single extraction request.