Data Extraction

From GISAXS

Jump to: navigation, search

Contents

1 Data Scraping
2 Document Parsing
3 Information Extraction

Data Scraping

LLM Data Scrapers list

Web Scraping

Firecrawl: API to turn websites into LLM-ready markdown or structured data (can be self-hosted)
Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
ScrapeGraphAI: You Only Scrape Once: web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.)
pipet: A swiss-army tool for scraping and extracting data from online assets
ScrapeGraphAI: You Only Scrape Once
Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
sitefetch: Fetch entire site into text file (to be used with AIs)
LLM Scraper: Turn webpage into structured data using LLMs
Trafilatura: Discover and Extract Text Data on the Web

llms.txt Generator

files-to-prompt: Concatenate directory of files into a single prompt
llms.txt Generator

llms.txt Generator (Online)

Firecrawl LLMs.txt generator (online tool, API)
- Just add llmstxt.new in front of any URL; e.g. http://llmstxt.new/http://yager-research.ca
Jina AI Reader: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/
arXiv-txt: Convert any arXiv paper to llms.txt version

Headless Browser (scrape & automate)

Lightpanda Browser

Code & Github

GitIngest: Turn any GitHub repository into a prompt-friendly text file, for inclusion in LLM's context. Available at: gitingest.com
repomix: Packs your entire repository into a single, AI-friendly file
github.gg: For analyzing GitHub repositories and providing valuable insights about code quality, dependencies, and more
Flatty - Codebase-to-Text for LLMs
CodeWeaver: Generate a Markdown Document of Your Codebase Structure and Content
RepoToTextForLLMs
uithub: Change 'g' to 'u' in github to get llm.txt

Related

gitdiagram: Repository to diagram

Media Files

Cobalt
You-Get: Web video downloader tool

Document Parsing

Intelligent Document Processing Leaderboard A unified leaderboard for OCR, KIE, classification, QA, table extraction, and confidence score evaluation

Systems

Docling: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
Microsoft Markitdown: converts various formats (PDF, Word, Excel, PPT) to Markdown (available via web interface on replit)
e2m: Everything to Markdown (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a)
Nvidia NV-ingest (code) scalable, performance-oriented document content and metadata extraction microservice
MegaParse: Your Parser for every type of documents (pdf, powerpoint, word)
Rowfill: Open-source document processing; extract, analyze, and process data from complex documents, images, PDFs and more with AI
Zerox: PDF to markdown vision model (OCR)
LlamaParse (example use for multimodal parsing)
Marker: PDFs and images to markdown
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion 256M (hf)
TWIX: Automatically Reconstructing Structured Data from Templatized Documents
ContextGem: LLM-based extraction from documents

PDF Conversion

Grobid (use online)
Chunkr (code)
PDF-Extract-Kit
Gemini 2.0
marker: converts PDFs and images to markdown, JSON, and HTML
olmOCR (online demo, code, hf)
Mistral OCR
MarkPDFDown

PDF Language Translation

PDFMathTranslate (online demo)

Structured Data Extraction

Unstract: Intelligent Document Processing (IDP): No-code LLM Platform to structure unstructured documents

Screenshot

Microsoft OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

Optical character recognition (OCR)

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model (project, code, demo)
Swift OCR: LLM Powered Fast OCR
Ollama OCR
surya
Mistral OCR
RolmOCR (hf)
HunyuanOCR
2026-03: Multimodal OCR: Parse Anything from Documents (hf)

Information Extraction

2025-07: Google LangExtract

Retrieved from "http://gisaxs.com/index.php?title=Data_Extraction&oldid=8730"