Latest revision as of 13:52, 4 December 2025

Data Scraping

LLM Data Scrapers list

Web Scraping

Firecrawl: API to turn websites into LLM-ready markdown or structured data (can be self-hosted)
Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
ScrapeGraphAI: You Only Scrape Once: web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.)
pipet: A swiss-army tool for scraping and extracting data from online assets
ScrapeGraphAI: You Only Scrape Once
Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
sitefetch: Fetch entire site into text file (to be used with AIs)
LLM Scraper: Turn webpage into structured data using LLMs
Trafilatura: Discover and Extract Text Data on the Web

llms.txt Generator

files-to-prompt: Concatenate directory of files into a single prompt
llms.txt Generator

llms.txt Generator (Online)

Firecrawl LLMs.txt generator (online tool, API)
- Just add llmstxt.new in front of any URL; e.g. http://llmstxt.new/http://yager-research.ca
Jina AI Reader: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/
arXiv-txt: Convert any arXiv paper to llms.txt version

Headless Browser (scrape & automate)

Lightpanda Browser

Code & Github

GitIngest: Turn any GitHub repository into a prompt-friendly text file, for inclusion in LLM's context. Available at: gitingest.com
repomix: Packs your entire repository into a single, AI-friendly file
github.gg: For analyzing GitHub repositories and providing valuable insights about code quality, dependencies, and more
Flatty - Codebase-to-Text for LLMs
CodeWeaver: Generate a Markdown Document of Your Codebase Structure and Content
RepoToTextForLLMs
uithub: Change 'g' to 'u' in github to get llm.txt

Media Files

Cobalt
You-Get: Web video downloader tool

Document Parsing

Intelligent Document Processing Leaderboard A unified leaderboard for OCR, KIE, classification, QA, table extraction, and confidence score evaluation

Systems

Docling: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
Microsoft Markitdown: converts various formats (PDF, Word, Excel, PPT) to Markdown (available via web interface on replit)
e2m: Everything to Markdown (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a)
Nvidia NV-ingest (code) scalable, performance-oriented document content and metadata extraction microservice
MegaParse: Your Parser for every type of documents (pdf, powerpoint, word)
Rowfill: Open-source document processing; extract, analyze, and process data from complex documents, images, PDFs and more with AI
Zerox: PDF to markdown vision model (OCR)
LlamaParse (example use for multimodal parsing)
Marker: PDFs and images to markdown
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion 256M (hf)
TWIX: Automatically Reconstructing Structured Data from Templatized Documents
ContextGem: LLM-based extraction from documents

PDF Conversion

Grobid (use online)
Chunkr (code)
PDF-Extract-Kit
Gemini 2.0
marker: converts PDFs and images to markdown, JSON, and HTML
olmOCR (online demo, code, hf)
Mistral OCR
MarkPDFDown

PDF Language Translation

PDFMathTranslate (online demo)

Structured Data Extraction

Unstract: Intelligent Document Processing (IDP): No-code LLM Platform to structure unstructured documents

@@ Line 19: / Line 19: @@
 ==llms.txt Generator (Online)==
 * Firecrawl [https://llmstxt.firecrawl.dev/ LLMs.txt generator] (online tool, [https://docs.firecrawl.dev/features/alpha/llmstxt API])
+** Just add llmstxt.new in front of any URL; e.g. [http://llmstxt.new/http://yager-research.ca http://llmstxt.new/http://yager-research.ca]
 * Jina AI [https://github.com/jina-ai/reader Reader]: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/
+* [https://www.arxiv-txt.org/ arXiv-txt]: Convert any arXiv paper to llms.txt version
 ==Headless Browser (scrape & automate)==
@@ Line 31: / Line 33: @@
 * [https://github.com/tesserato/CodeWeaver CodeWeaver]: Generate a Markdown Document of Your Codebase Structure and Content
 * [https://github.com/Doriandarko/RepoToTextForLLMs RepoToTextForLLMs]
+* [https://uithub.com/ uithub]: Change 'g' to 'u' in github to get llm.txt
+===Related===
+* [https://gitdiagram.com/ gitdiagram]: Repository to diagram
 ==Media Files==
@@ Line 37: / Line 43: @@
 =Document Parsing=
+* [https://idp-leaderboard.org/ Intelligent Document Processing Leaderboard] A unified leaderboard for OCR, KIE, classification, QA, table extraction, and confidence score evaluation
+==Systems==
 * [https://github.com/DS4SD/docling Docling]: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
 * [https://github.com/microsoft/markitdown Microsoft Markitdown]: converts various formats (PDF, Word, Excel, PPT) to Markdown (available via [https://msftmd.replit.app/ web interface on replit])
@@ Line 46: / Line 55: @@
 * [https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/ LlamaParse] ([https://github.com/run-llama/llama_cloud_services/blob/main/examples/parse/multimodal/gemini2_flash.ipynb example use for multimodal parsing])
 * [https://github.com/VikParuchuri/marker Marker]: PDFs and images to markdown
+* [https://arxiv.org/abs/2503.11576 SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion] 256M ([https://huggingface.co/ds4sd/SmolDocling-256M-preview hf])
+* [https://arxiv.org/abs/2501.06659 TWIX: Automatically Reconstructing Structured Data from Templatized Documents]
+* [https://github.com/shcherbak-ai/contextgem ContextGem]: LLM-based extraction from documents
 ==PDF Conversion==
@@ Line 55: / Line 67: @@
 * [https://olmocr.allenai.org/blog olmOCR] ([https://olmocr.allenai.org/ online demo], [https://github.com/allenai/olmocr code], [https://huggingface.co/collections/allenai/olmocr-67af8630b0062a25bf1b54a1 hf])
 * [https://mistral.ai/news/mistral-ocr Mistral OCR]
+* [https://github.com/MarkPDFdown/markpdfdown MarkPDFDown]
 ==PDF Language Translation==
@@ Line 71: / Line 84: @@
 * [https://github.com/VikParuchuri/surya surya]
 * [https://mistral.ai/news/mistral-ocr Mistral OCR]
+* [https://reducto.ai/blog/introducing-rolmocr-open-source-ocr-model RolmOCR] ([https://huggingface.co/reducto/RolmOCR hf])
+* [https://hunyuan.tencent.com/vision/zh?tabIndex=0 HunyuanOCR]
+=Information Extraction=
+* 2025-07: Google [https://github.com/google/langextract LangExtract]

Difference between revisions of "Data Extraction"

Latest revision as of 13:52, 4 December 2025

Contents

Data Scraping

Web Scraping

llms.txt Generator

llms.txt Generator (Online)

Headless Browser (scrape & automate)

Code & Github

Related

Media Files

Document Parsing

Systems

PDF Conversion

PDF Language Translation

Structured Data Extraction

Screenshot

Optical character recognition (OCR)

Information Extraction

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools