Difference between revisions of "Data Extraction"
KevinYager (talk | contribs) (→llms.txt Generator (Online)) |
KevinYager (talk | contribs) (→Code & Github) |
||
(6 intermediate revisions by the same user not shown) | |||
Line 19: | Line 19: | ||
==llms.txt Generator (Online)== | ==llms.txt Generator (Online)== | ||
* Firecrawl [https://llmstxt.firecrawl.dev/ LLMs.txt generator] (online tool, [https://docs.firecrawl.dev/features/alpha/llmstxt API]) | * Firecrawl [https://llmstxt.firecrawl.dev/ LLMs.txt generator] (online tool, [https://docs.firecrawl.dev/features/alpha/llmstxt API]) | ||
+ | ** Just add llmstxt.new in front of any URL; e.g. [http://llmstxt.new/http://yager-research.ca http://llmstxt.new/http://yager-research.ca] | ||
* Jina AI [https://github.com/jina-ai/reader Reader]: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/ | * Jina AI [https://github.com/jina-ai/reader Reader]: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/ | ||
+ | * [https://www.arxiv-txt.org/ arXiv-txt]: Convert any arXiv paper to llms.txt version | ||
==Headless Browser (scrape & automate)== | ==Headless Browser (scrape & automate)== | ||
Line 31: | Line 33: | ||
* [https://github.com/tesserato/CodeWeaver CodeWeaver]: Generate a Markdown Document of Your Codebase Structure and Content | * [https://github.com/tesserato/CodeWeaver CodeWeaver]: Generate a Markdown Document of Your Codebase Structure and Content | ||
* [https://github.com/Doriandarko/RepoToTextForLLMs RepoToTextForLLMs] | * [https://github.com/Doriandarko/RepoToTextForLLMs RepoToTextForLLMs] | ||
+ | * [https://uithub.com/ uithub]: Change 'g' to 'u' in github to get llm.txt | ||
+ | |||
+ | ===Related=== | ||
+ | * [https://gitdiagram.com/ gitdiagram]: Repository to diagram | ||
==Media Files== | ==Media Files== | ||
Line 46: | Line 52: | ||
* [https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/ LlamaParse] ([https://github.com/run-llama/llama_cloud_services/blob/main/examples/parse/multimodal/gemini2_flash.ipynb example use for multimodal parsing]) | * [https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/ LlamaParse] ([https://github.com/run-llama/llama_cloud_services/blob/main/examples/parse/multimodal/gemini2_flash.ipynb example use for multimodal parsing]) | ||
* [https://github.com/VikParuchuri/marker Marker]: PDFs and images to markdown | * [https://github.com/VikParuchuri/marker Marker]: PDFs and images to markdown | ||
+ | * [https://arxiv.org/abs/2503.11576 SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion] 256M ([https://huggingface.co/ds4sd/SmolDocling-256M-preview hf]) | ||
==PDF Conversion== | ==PDF Conversion== | ||
Line 71: | Line 78: | ||
* [https://github.com/VikParuchuri/surya surya] | * [https://github.com/VikParuchuri/surya surya] | ||
* [https://mistral.ai/news/mistral-ocr Mistral OCR] | * [https://mistral.ai/news/mistral-ocr Mistral OCR] | ||
+ | * [https://reducto.ai/blog/introducing-rolmocr-open-source-ocr-model RolmOCR] ([https://huggingface.co/reducto/RolmOCR hf]) |
Latest revision as of 13:34, 18 April 2025
Contents
Data Scraping
- LLM Data Scrapers list
Web Scraping
- Firecrawl: API to turn websites into LLM-ready markdown or structured data (can be self-hosted)
- Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
- ScrapeGraphAI: You Only Scrape Once: web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.)
- pipet: A swiss-army tool for scraping and extracting data from online assets
- ScrapeGraphAI: You Only Scrape Once
- Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
- sitefetch: Fetch entire site into text file (to be used with AIs)
- LLM Scraper: Turn webpage into structured data using LLMs
- Trafilatura: Discover and Extract Text Data on the Web
llms.txt Generator
- files-to-prompt: Concatenate directory of files into a single prompt
- llms.txt Generator
llms.txt Generator (Online)
- Firecrawl LLMs.txt generator (online tool, API)
- Just add llmstxt.new in front of any URL; e.g. http://llmstxt.new/http://yager-research.ca
- Jina AI Reader: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/
- arXiv-txt: Convert any arXiv paper to llms.txt version
Headless Browser (scrape & automate)
Code & Github
- GitIngest: Turn any GitHub repository into a prompt-friendly text file, for inclusion in LLM's context. Available at: gitingest.com
- repomix: Packs your entire repository into a single, AI-friendly file
- github.gg: For analyzing GitHub repositories and providing valuable insights about code quality, dependencies, and more
- Flatty - Codebase-to-Text for LLMs
- CodeWeaver: Generate a Markdown Document of Your Codebase Structure and Content
- RepoToTextForLLMs
- uithub: Change 'g' to 'u' in github to get llm.txt
Related
- gitdiagram: Repository to diagram
Media Files
Document Parsing
- Docling: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
- Microsoft Markitdown: converts various formats (PDF, Word, Excel, PPT) to Markdown (available via web interface on replit)
- e2m: Everything to Markdown (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a)
- Nvidia NV-ingest (code) scalable, performance-oriented document content and metadata extraction microservice
- MegaParse: Your Parser for every type of documents (pdf, powerpoint, word)
- Rowfill: Open-source document processing; extract, analyze, and process data from complex documents, images, PDFs and more with AI
- Zerox: PDF to markdown vision model (OCR)
- LlamaParse (example use for multimodal parsing)
- Marker: PDFs and images to markdown
- SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion 256M (hf)
PDF Conversion
- Grobid (use online)
- Chunkr (code)
- PDF-Extract-Kit
- Gemini 2.0
- marker: converts PDFs and images to markdown, JSON, and HTML
- olmOCR (online demo, code, hf)
- Mistral OCR
PDF Language Translation
Structured Data Extraction
- Unstract: Intelligent Document Processing (IDP): No-code LLM Platform to structure unstructured documents
Screenshot
- Microsoft OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent