Revision as of 15:49, 6 March 2025

Data Scraping

LLM Data Scrapers list

Web Scraping

Firecrawl: API to turn websites into LLM-ready markdown or structured data (can be self-hosted)
Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
ScrapeGraphAI: You Only Scrape Once: web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.)
pipet: A swiss-army tool for scraping and extracting data from online assets
ScrapeGraphAI: You Only Scrape Once
Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
sitefetch: Fetch entire site into text file (to be used with AIs)
LLM Scraper: Turn webpage into structured data using LLMs
Trafilatura: Discover and Extract Text Data on the Web

llms.txt Generator

files-to-prompt: Concatenate directory of files into a single prompt
llms.txt Generator

llms.txt Generator (Online)

Firecrawl LLMs.txt generator (online tool)
Jina AI Reader: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/

Headless Browser (scrape & automate)

Lightpanda Browser

Code & Github

GitIngest: Turn any GitHub repository into a prompt-friendly text file, for inclusion in LLM's context. Available at: gitingest.com
repomix: Packs your entire repository into a single, AI-friendly file
github.gg: For analyzing GitHub repositories and providing valuable insights about code quality, dependencies, and more
Flatty - Codebase-to-Text for LLMs
CodeWeaver: Generate a Markdown Document of Your Codebase Structure and Content
RepoToTextForLLMs

Media Files

Cobalt
You-Get: Web video downloader tool

Document Parsing

Docling: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
Microsoft Markitdown: converts various formats (PDF, Word, Excel, PPT) to Markdown (available via web interface on replit)
e2m: Everything to Markdown (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a)
Nvidia NV-ingest (code) scalable, performance-oriented document content and metadata extraction microservice
MegaParse: Your Parser for every type of documents (pdf, powerpoint, word)
Rowfill: Open-source document processing; extract, analyze, and process data from complex documents, images, PDFs and more with AI
Zerox: PDF to markdown vision model (OCR)
LlamaParse (example use for multimodal parsing)
Marker: PDFs and images to markdown

PDF Conversion

Grobid (use online)
Chunkr (code)
PDF-Extract-Kit
Gemini 2.0
marker: converts PDFs and images to markdown, JSON, and HTML
olmOCR (online demo, code, hf)
Mistral OCR

PDF Language Translation

PDFMathTranslate (online demo)

Structured Data Extraction

Unstract: Intelligent Document Processing (IDP): No-code LLM Platform to structure unstructured documents

Screenshot

Microsoft OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

@@ Line 53: / Line 53: @@
 * [https://www.philschmid.de/gemini-pdf-to-data Gemini 2.0]
 * [https://github.com/VikParuchuri/marker marker]: converts PDFs and images to markdown, JSON, and HTML
+* [https://olmocr.allenai.org/blog olmOCR] ([https://olmocr.allenai.org/ online demo], [https://github.com/allenai/olmocr code], [https://huggingface.co/collections/allenai/olmocr-67af8630b0062a25bf1b54a1 hf])
+* [https://mistral.ai/news/mistral-ocr Mistral OCR]
 ==PDF Language Translation==
@@ Line 68: / Line 70: @@
 * [https://github.com/imanoop7/Ollama-OCR Ollama OCR]
 * [https://github.com/VikParuchuri/surya surya]
+* [https://mistral.ai/news/mistral-ocr Mistral OCR]

Difference between revisions of "Data Extraction"

Revision as of 15:49, 6 March 2025

Contents

Data Scraping

Web Scraping

llms.txt Generator

llms.txt Generator (Online)

Headless Browser (scrape & automate)

Code & Github

Media Files

Document Parsing

PDF Conversion

PDF Language Translation

Structured Data Extraction

Screenshot

Optical character recognition (OCR)

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools