Difference between revisions of "Data Extraction"
KevinYager (talk | contribs) (→Optical character recognition (OCR)) |
KevinYager (talk | contribs) (→llms.txt Generator (Online)) |
||
(One intermediate revision by the same user not shown) | |||
Line 18: | Line 18: | ||
==llms.txt Generator (Online)== | ==llms.txt Generator (Online)== | ||
− | * Firecrawl [https://llmstxt.firecrawl.dev/ LLMs.txt generator] (online tool) | + | * Firecrawl [https://llmstxt.firecrawl.dev/ LLMs.txt generator] (online tool, [https://docs.firecrawl.dev/features/alpha/llmstxt API]) |
* Jina AI [https://github.com/jina-ai/reader Reader]: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/ | * Jina AI [https://github.com/jina-ai/reader Reader]: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/ | ||
Line 54: | Line 54: | ||
* [https://github.com/VikParuchuri/marker marker]: converts PDFs and images to markdown, JSON, and HTML | * [https://github.com/VikParuchuri/marker marker]: converts PDFs and images to markdown, JSON, and HTML | ||
* [https://olmocr.allenai.org/blog olmOCR] ([https://olmocr.allenai.org/ online demo], [https://github.com/allenai/olmocr code], [https://huggingface.co/collections/allenai/olmocr-67af8630b0062a25bf1b54a1 hf]) | * [https://olmocr.allenai.org/blog olmOCR] ([https://olmocr.allenai.org/ online demo], [https://github.com/allenai/olmocr code], [https://huggingface.co/collections/allenai/olmocr-67af8630b0062a25bf1b54a1 hf]) | ||
+ | * [https://mistral.ai/news/mistral-ocr Mistral OCR] | ||
==PDF Language Translation== | ==PDF Language Translation== |
Latest revision as of 09:06, 12 March 2025
Contents
Data Scraping
- LLM Data Scrapers list
Web Scraping
- Firecrawl: API to turn websites into LLM-ready markdown or structured data (can be self-hosted)
- Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
- ScrapeGraphAI: You Only Scrape Once: web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.)
- pipet: A swiss-army tool for scraping and extracting data from online assets
- ScrapeGraphAI: You Only Scrape Once
- Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
- sitefetch: Fetch entire site into text file (to be used with AIs)
- LLM Scraper: Turn webpage into structured data using LLMs
- Trafilatura: Discover and Extract Text Data on the Web
llms.txt Generator
- files-to-prompt: Concatenate directory of files into a single prompt
- llms.txt Generator
llms.txt Generator (Online)
- Firecrawl LLMs.txt generator (online tool, API)
- Jina AI Reader: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/
Headless Browser (scrape & automate)
Code & Github
- GitIngest: Turn any GitHub repository into a prompt-friendly text file, for inclusion in LLM's context. Available at: gitingest.com
- repomix: Packs your entire repository into a single, AI-friendly file
- github.gg: For analyzing GitHub repositories and providing valuable insights about code quality, dependencies, and more
- Flatty - Codebase-to-Text for LLMs
- CodeWeaver: Generate a Markdown Document of Your Codebase Structure and Content
- RepoToTextForLLMs
Media Files
Document Parsing
- Docling: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
- Microsoft Markitdown: converts various formats (PDF, Word, Excel, PPT) to Markdown (available via web interface on replit)
- e2m: Everything to Markdown (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a)
- Nvidia NV-ingest (code) scalable, performance-oriented document content and metadata extraction microservice
- MegaParse: Your Parser for every type of documents (pdf, powerpoint, word)
- Rowfill: Open-source document processing; extract, analyze, and process data from complex documents, images, PDFs and more with AI
- Zerox: PDF to markdown vision model (OCR)
- LlamaParse (example use for multimodal parsing)
- Marker: PDFs and images to markdown
PDF Conversion
- Grobid (use online)
- Chunkr (code)
- PDF-Extract-Kit
- Gemini 2.0
- marker: converts PDFs and images to markdown, JSON, and HTML
- olmOCR (online demo, code, hf)
- Mistral OCR
PDF Language Translation
Structured Data Extraction
- Unstract: Intelligent Document Processing (IDP): No-code LLM Platform to structure unstructured documents
Screenshot
- Microsoft OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent