Revision as of 09:18, 17 March 2025

Data Scraping

LLM Data Scrapers list

Web Scraping

Firecrawl: API to turn websites into LLM-ready markdown or structured data (can be self-hosted)
Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
ScrapeGraphAI: You Only Scrape Once: web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.)
pipet: A swiss-army tool for scraping and extracting data from online assets
ScrapeGraphAI: You Only Scrape Once
Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
sitefetch: Fetch entire site into text file (to be used with AIs)
LLM Scraper: Turn webpage into structured data using LLMs
Trafilatura: Discover and Extract Text Data on the Web

llms.txt Generator

files-to-prompt: Concatenate directory of files into a single prompt
llms.txt Generator

llms.txt Generator (Online)

Firecrawl LLMs.txt generator (online tool, API)
Jina AI Reader: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/

Headless Browser (scrape & automate)

Lightpanda Browser

Code & Github

GitIngest: Turn any GitHub repository into a prompt-friendly text file, for inclusion in LLM's context. Available at: gitingest.com
repomix: Packs your entire repository into a single, AI-friendly file
github.gg: For analyzing GitHub repositories and providing valuable insights about code quality, dependencies, and more
Flatty - Codebase-to-Text for LLMs
CodeWeaver: Generate a Markdown Document of Your Codebase Structure and Content
RepoToTextForLLMs

Media Files

Cobalt
You-Get: Web video downloader tool

Document Parsing

Docling: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
Microsoft Markitdown: converts various formats (PDF, Word, Excel, PPT) to Markdown (available via web interface on replit)
e2m: Everything to Markdown (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a)
Nvidia NV-ingest (code) scalable, performance-oriented document content and metadata extraction microservice
MegaParse: Your Parser for every type of documents (pdf, powerpoint, word)
Rowfill: Open-source document processing; extract, analyze, and process data from complex documents, images, PDFs and more with AI
Zerox: PDF to markdown vision model (OCR)
LlamaParse (example use for multimodal parsing)
Marker: PDFs and images to markdown
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion 256M (hf)

PDF Conversion

Grobid (use online)
Chunkr (code)
PDF-Extract-Kit
Gemini 2.0
marker: converts PDFs and images to markdown, JSON, and HTML
olmOCR (online demo, code, hf)
Mistral OCR

PDF Language Translation

PDFMathTranslate (online demo)

Structured Data Extraction

Unstract: Intelligent Document Processing (IDP): No-code LLM Platform to structure unstructured documents

Screenshot

Microsoft OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

@@ Line 46: / Line 46: @@
 * [https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/ LlamaParse] ([https://github.com/run-llama/llama_cloud_services/blob/main/examples/parse/multimodal/gemini2_flash.ipynb example use for multimodal parsing])
 * [https://github.com/VikParuchuri/marker Marker]: PDFs and images to markdown
+* [https://arxiv.org/abs/2503.11576 SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion] 256M ([https://huggingface.co/ds4sd/SmolDocling-256M-preview hf])
 ==PDF Conversion==

Difference between revisions of "Data Extraction"

Revision as of 09:18, 17 March 2025

Contents

Data Scraping

Web Scraping

llms.txt Generator

llms.txt Generator (Online)

Headless Browser (scrape & automate)

Code & Github

Media Files

Document Parsing

PDF Conversion

PDF Language Translation

Structured Data Extraction

Screenshot

Optical character recognition (OCR)

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools