Clinical Data Extraction: Build Production Pipelines for Trial Registries, FAERS, and PubMed
Pharma companies spend 40-60% of their preclinical research time on clinical data extraction — pulling trial outcomes, adverse event signals, and competitive intelligence out of public registries that were never designed for bulk consumption. ClinicalTrials.gov alone holds over 500,000 study records. The FDA Adverse Event Reporting System (FAERS) ingests around 2.3 million reports per year. PubMed indexes 1.5 million new citations annually.
If you're a data engineer at a biotech, a clinical informatician building a research data warehouse, or a pharmacovigilance analyst trying to automate safety signal detection, this guide walks through the architecture, code, and compliance patterns you need.
Where Clinical Data Actually Lives
Clinical data is scattered across dozens of sources, each with different access patterns, update frequencies, and data quality characteristics.
ClinicalTrials.gov — The 800-pound gorilla. The v2 API at clinicaltrials.gov/api/v2/studies returns JSON and supports field-level queries. ~500,000+ registered studies, 3,000-4,000 new registrations per month, 10 requests/second without auth. Several fields visible on the rendered study page aren't fully exposed through the API — this is where web-based extraction fills the gap.
FDA FAERS — The backbone of post-market drug safety surveillance. Contains individual safety reports with drug names, adverse reactions coded in MedDRA, patient demographics, and outcome codes. The openFDA API at api.fda.gov/drug/event.json provides JSON access, though query complexity limits apply.
PubMed — Over 36 million biomedical citations via NCBI E-utilities (ESearch, EFetch, ELink). Rate limits: 3 requests/second without an API key, 10 with one.
| Source | Data Type | Access | Update Cadence |
|---|---|---|---|
| WHO ICTRP | Global trial registry aggregator | Public web | Weekly |
| EMA Clinical Data | EU marketing authorization data | Public web + API | Per-submission |
| DailyMed | FDA-approved drug labeling | API + web | Daily |
| AACT Database | ClinicalTrials.gov mirror in PostgreSQL | Direct download | Daily |
The AACT database from CTTI deserves special mention — it's a fully relational PostgreSQL dump of ClinicalTrials.gov data, updated nightly. If you just need structured trial metadata without real-time freshness, AACT saves you from building your own extraction pipeline entirely.
Extracting Clinical Trial Data from ClinicalTrials.gov
Single Trial Extraction with ScrapeGraphAI
The ClinicalTrials.gov v2 API handles many use cases, but when you need fields only on the rendered page, or when you want to extract data in a specific schema without writing a custom parser, ScrapeGraphAI handles it with a natural language prompt.
from scrapegraph_py import Client
client = Client(api_key="your-scrapegraph-api-key")
response = client.smartscraper(
website_url="https://clinicaltrials.gov/study/NCT06000696",
user_prompt="""
Extract the following from this clinical trial page:
- NCT number
- Official study title
- Brief summary (first 2 sentences only)
- Study phase
- Recruitment status
- Estimated enrollment
- Study type (interventional or observational)
- Primary outcome measures with timeframes
- Key inclusion criteria (up to 5 most important)
- Key exclusion criteria (up to 5 most important)
- Sponsor organization
- Collaborating organizations (if any)
- Study start date
- Estimated primary completion date
- Estimated study completion date
- Intervention names and types
- Condition(s) being studied
Return as structured JSON with snake_case keys.
"""
)
trial = response.get("result")This returns clean JSON you can load directly into a database or feed into a downstream analysis pipeline. The LLM-based extraction handles semantic understanding — it knows what "primary outcome measure" means in context, even if the page layout shifts.
Batch Extraction Across Multiple Trials
Real clinical data mining involves hundreds or thousands of trials:
from scrapegraph_py import Client
import time
import json
client = Client(api_key="your-scrapegraph-api-key")
nct_ids = [
"NCT06000696",
"NCT05924516",
"NCT06172738",
"NCT05564897",
"NCT04280705",
"NCT03857542",
"NCT05963958",
"NCT04381936",
]
extraction_prompt = """
Extract: NCT number, official title, phase, recruitment status,
enrollment count, sponsor, conditions, interventions (name + type),
primary outcomes with timeframes, study start date,
estimated completion date. Return as JSON with snake_case keys.
"""
results = []
failed = []
for nct_id in nct_ids:
try:
url = f"https://clinicaltrials.gov/study/{nct_id}"
response = client.smartscraper(
website_url=url,
user_prompt=extraction_prompt
)
results.append({
"nct_id": nct_id,
"data": response.get("result"),
"extracted_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
})
except Exception as e:
failed.append({"nct_id": nct_id, "error": str(e)})
time.sleep(0.5)
with open("clinical_trials_extract.json", "w") as f:
json.dump(results, f, indent=2)
if failed:
print(f"Failed extractions: {len(failed)}")
for f_item in failed:
print(f" {f_item['nct_id']}: {f_item['error']}")Using the ClinicalTrials.gov API Directly
For fields the API covers well, direct API calls are more efficient. Combine API-based and scraping-based extraction:
import requests
base_url = "https://clinicaltrials.gov/api/v2/studies"
params = {
"query.cond": "non-small cell lung cancer",
"query.intr": "pembrolizumab",
"filter.overallStatus": "RECRUITING",
"fields": "NCTId,BriefTitle,Phase,OverallStatus,EnrollmentCount,LeadSponsorName,StartDate,CompletionDate",
"pageSize": 50,
"format": "json"
}
response = requests.get(base_url, params=params)
studies = response.json().get("studies", [])
for study in studies:
protocol = study.get("protocolSection", {})
identification = protocol.get("identificationModule", {})
status = protocol.get("statusModule", {})
design = protocol.get("designModule", {})
sponsor = protocol.get("sponsorCollaboratorsModule", {})
nct_id = identification.get("nctId")
title = identification.get("briefTitle")
phase_list = design.get("phases", [])
overall_status = status.get("overallStatus")
lead_sponsor = sponsor.get("leadSponsor", {}).get("name")
print(f"{nct_id}: {title}")
print(f" Phase: {', '.join(phase_list)} | Status: {overall_status}")
print(f" Sponsor: {lead_sponsor}")Pass the NCT IDs to ScrapeGraphAI batch extraction for fields the API doesn't cover — eligibility details, tabular results data, or protocol amendment history.
Extracting Adverse Event Data from FDA FAERS
Pharmacovigilance teams live and die by FAERS data. The openFDA API is the fastest path to adverse event data for specific drugs:
import requests
from datetime import datetime, timedelta
drug_name = "ozempic"
start_date = "20230101"
end_date = "20251231"
url = "https://api.fda.gov/drug/event.json"
params = {
"search": f'patient.drug.openfda.brand_name:"{drug_name}" AND receivedate:[{start_date} TO {end_date}]',
"count": "patient.reaction.reactionmeddrapt.exact",
"limit": 25
}
response = requests.get(url, params=params)
data = response.json()
print(f"Top adverse reactions for {drug_name}:")
for result in data.get("results", []):
print(f" {result['term']}: {result['count']} reports")This gives you the top adverse reactions by count — useful for quick signal detection. For deep pharmacovigilance, you'll need to paginate through the event endpoint to get individual case safety reports (ICSRs) with full drug characterization and outcome data.
The openFDA API caps at 26,000 results per query. For comprehensive pharmacovigilance on high-volume drugs, you need the raw FAERS quarterly ASCII files (available at fda.gov/drugs/questions-and-answers-fdas-adverse-event-reporting-system-faers). These come as zipped packages with $-delimited files for demographics (DEMO), drugs (DRUG), reactions (REAC), outcomes (OUTC), report sources (RPSR), therapy dates (THER), and indications (INDI). Watch for the $ delimiter — it's not CSV, not TSV — and encoding issues in older quarters (Latin-1 with accented characters in reporter names).
PubMed Literature Extraction
Clinical data extraction from published literature supports systematic reviews, meta-analyses, and competitive intelligence. Here's how to extract structured data from PubMed articles using ScrapeGraphAI:
from scrapegraph_py import Client
client = Client(api_key="your-scrapegraph-api-key")
pmid = "39142855"
url = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/"
response = client.smartscraper(
website_url=url,
user_prompt="""
Extract the following from this PubMed article page:
- PMID
- Article title
- Journal name
- Publication date
- Authors (list of names)
- Study type (RCT, cohort, case-control, meta-analysis, etc.)
- Sample size / number of participants
- Intervention(s) studied
- Comparator(s)
- Primary endpoint(s)
- Key results (effect sizes, p-values, confidence intervals if available)
- Conclusion (1-2 sentences)
- MeSH terms listed
- Funding source if mentioned
Return as structured JSON.
"""
)
article_data = response.get("result")For bulk retrieval, use the NCBI E-utilities API (ESearch to find PMIDs, EFetch for XML records) with a registered API key for 10 req/sec rate limits, then pass individual articles through ScrapeGraphAI for structured extraction from the rendered pages.
Production Pipeline Architecture
A production clinical data extraction pipeline in pharma or biotech follows four layers:
Extraction Layer — Pull from ClinicalTrials.gov API + ScrapeGraphAI, openFDA API, PubMed E-utilities, and DailyMed. Stage all raw JSON into immutable, timestamped files in S3/GCS. Never overwrite a previous extraction. When you need to reprocess, replay from raw staging.
Validation Layer — Schema validation with Pydantic, PHI detection with Microsoft Presidio, and fuzzy-match deduplication (especially critical for FAERS, which is notorious for duplicate reports — the same adverse event gets reported by the manufacturer, the physician, and sometimes the patient). Validation is a separate stage from extraction — this lets you update validation rules without re-extracting.
Transformation Layer — Drug name normalization via RxNorm CUIs (the single hardest integration challenge across clinical data sources), MedDRA coding standardization, and cross-source linking. Without RxNorm, you cannot reliably join drug data across FAERS (free-text names like HUMIRA, humira, Humira Pen, adalimumab), ClinicalTrials.gov (mixed brand/generic), and PubMed (whatever the authors wrote). Use the NLM's RxNorm API at rxnav.nlm.nih.gov/REST to map all variants to canonical CUIs.
Consumption Layer — Dashboards (Metabase/Looker), disproportionality-based signal detection, and competitive intelligence reports.
Scheduling considerations: ClinicalTrials.gov should be extracted daily (status changes happen continuously). FAERS quarterly files drop 4-6 weeks after quarter end. openFDA API weekly for ongoing signal detection. PubMed daily or weekly depending on monitoring needs. Use Airflow, Prefect, or Dagster for orchestration — the non-negotiable requirement is idempotent DAGs.
Data Quality and Validation with Pydantic
Define strict schemas for every data source and validate every record before it enters the warehouse:
from pydantic import BaseModel, field_validator, Field
from typing import Optional
from datetime import date
import re
class ClinicalTrial(BaseModel):
nct_id: str
title: str
phase: Optional[str] = None
overall_status: str
enrollment_count: Optional[int] = None
lead_sponsor: str
start_date: Optional[date] = None
primary_completion_date: Optional[date] = None
conditions: list[str] = Field(default_factory=list)
interventions: list[str] = Field(default_factory=list)
@field_validator("nct_id")
@classmethod
def validate_nct_id(cls, v):
if not re.match(r"^NCT\d{8}$", v):
raise ValueError(f"Invalid NCT ID format: {v}")
return v
@field_validator("phase")
@classmethod
def validate_phase(cls, v):
valid_phases = [
"EARLY_PHASE1", "PHASE1", "PHASE2", "PHASE3", "PHASE4",
"Phase 1", "Phase 2", "Phase 3", "Phase 4",
"Phase 1/Phase 2", "Phase 2/Phase 3", "NA", None
]
if v not in valid_phases:
raise ValueError(f"Unexpected phase value: {v}")
return v
@field_validator("enrollment_count")
@classmethod
def validate_enrollment(cls, v):
if v is not None and (v < 0 or v > 1_000_000):
raise ValueError(f"Suspicious enrollment count: {v}")
return v
class AdverseEvent(BaseModel):
safety_report_id: str
receive_date: date
is_serious: Optional[bool] = None
patient_age: Optional[float] = None
patient_sex: Optional[str] = None
drugs: list[dict]
reactions: list[dict]
@field_validator("patient_sex")
@classmethod
def validate_sex(cls, v):
if v is not None and v not in ("1", "2", "0", "M", "F", "UNK"):
raise ValueError(f"Invalid patient sex code: {v}")
return v
@field_validator("patient_age")
@classmethod
def validate_age(cls, v):
if v is not None and (v < 0 or v > 120):
raise ValueError(f"Suspicious patient age: {v}")
return vFAERS deduplication deserves special attention. The standard approach: group by a composite key of (drug name normalized, reaction term, patient age bucket, patient sex, event date +/- 30 days) and take the most recent report version. Without deduplication, your signal detection will overcount heavily.
When SCHEMA_DRIFT alerts fire (null rates spiking on previously-reliable fields), it usually means a source changed their page layout or API schema — update your extraction prompts accordingly. For warehouse schema design, the key tables are trials, adverse_events, and publications, linked through a drug_name_map table backed by RxNorm CUIs for cross-source joins.
HIPAA and GDPR Compliance Checklist
Not legal advice. But this is what pharma data engineering teams actually check before launching clinical data extraction pipelines.
HIPAA (US)
Does HIPAA apply?
- Does the data originate from a covered entity (hospital, insurer, pharmacy)?
- Does the data contain any of the 18 HIPAA identifiers?
- ClinicalTrials.gov, FAERS, and PubMed public data: HIPAA generally does not apply to de-identified public records, but verify with counsel
Technical safeguards (if applicable):
- Encryption at rest (AES-256) and in transit (TLS 1.2+)
- Role-based access controls with audit logging
- Automatic PHI detection in pipeline (Microsoft Presidio or AWS Comprehend Medical)
- Data retention policy with automated expiration
- BAA with cloud provider
- Incident response plan
Administrative: Data Use Agreements, IRB approval for human subjects research, Privacy Impact Assessment, staff training on PHI handling.
GDPR (EU Data)
- Identify lawful basis for processing (legitimate interest for public data research is common)
- Document purpose limitation and enforce data minimization
- Set and enforce retention periods
- If processing EU resident personal data: conduct DPIA
- Cross-border transfer safeguards (SCCs or adequacy decision)
Practical tip: For most teams working with public registries, the main compliance risk isn't the source data — it's the pipeline accidentally capturing PHI from FAERS narratives. Run Presidio on every text field. The cost is negligible.
How Pharma Companies Use Clinical Data Extraction
Drug Repurposing — Extract all trials for a given mechanism of action, identify secondary endpoints that showed unexpected significance, cross-reference with adverse event profiles suggesting activity in other pathways. Companies like Recursion Pharmaceuticals run these analyses continuously.
Competitive Intelligence — Before investing $50M+ in Phase 2, teams extract weekly from ClinicalTrials.gov: competing trials for the same indication, endpoint choices, inclusion/exclusion criteria (overly narrow criteria can signal safety concerns from earlier phases), and enrollment timelines.
Pharmacovigilance Signal Detection — FAERS extraction feeds disproportionality analysis (reporting odds ratios). When the lower bound of the 95% CI exceeds 1.0 with sufficient case count (N >= 3), that's a signal worth investigating. Automate across every drug-event pair for continuous surveillance.
ScrapeGraphAI vs. Alternative Approaches
Manual API Calls — Full control and free, but massive maintenance burden. Every API schema change breaks your parsers. Best for teams with dedicated engineering staff and stable sources.
Custom Scrapers (Scrapy, Playwright) — High performance for bulk extraction, but brittle to layout changes. ClinicalTrials.gov redesigned their UI in 2023 and broke every custom scraper in existence.
ScrapeGraphAI — Semantic understanding survives layout changes. Natural language prompts instead of CSS selectors. Returns structured JSON directly. Per-request cost, slightly higher latency.
Recommended: Hybrid Approach — Use the ClinicalTrials.gov API for structured fields, ScrapeGraphAI for fields the API misses and for PubMed extraction, the openFDA API for FAERS queries, and raw quarterly files for comprehensive pharmacovigilance datasets.
The core insight is that clinical data extraction isn't a one-time project. It's infrastructure. The databases update continuously, the regulatory landscape shifts, and your research questions evolve. Build for change, not for a snapshot.
For more on building extraction pipelines with ScrapeGraphAI, see our guide on AI web scraping.
FAQ
Is it legal to scrape ClinicalTrials.gov?
Yes. ClinicalTrials.gov is a public database maintained by the U.S. National Library of Medicine, funded by taxpayers, and explicitly intended for public access. Respect their rate limits (10 req/sec for the API) and identify your scraper with a descriptive user agent string. The NLM has never taken enforcement action against good-faith programmatic access.
How does clinical data extraction differ from general web scraping?
Three ways. First, domain-specific terminology (MedDRA codes, ICD-10, drug nomenclature) that general-purpose parsers struggle with. Second, complex relational structure — trial arms, endpoints, adverse event causality assessments — requiring domain-aware schema design. Third, compliance requirements (HIPAA, GDPR, 21 CFR Part 11) impose audit trail and data provenance obligations that don't exist in general scraping.
Can ScrapeGraphAI extract medical terminology accurately?
Yes. ScrapeGraphAI's underlying LLMs are trained on broad corpora including biomedical literature, drug databases, and clinical documentation. Standard medical terminology in prompts — drug names, ICD codes, MeSH terms, clinical endpoints, laboratory values — is correctly identified and extracted from source pages.
