Difference between revisions of "Data Extraction"

From GISAXS
Jump to: navigation, search
(llms.txt Generator (Online))
(Code & Github)
 
(6 intermediate revisions by the same user not shown)
Line 19: Line 19:
 
==llms.txt Generator (Online)==
 
==llms.txt Generator (Online)==
 
* Firecrawl [https://llmstxt.firecrawl.dev/ LLMs.txt generator] (online tool, [https://docs.firecrawl.dev/features/alpha/llmstxt API])
 
* Firecrawl [https://llmstxt.firecrawl.dev/ LLMs.txt generator] (online tool, [https://docs.firecrawl.dev/features/alpha/llmstxt API])
 +
** Just add llmstxt.new in front of any URL; e.g. [http://llmstxt.new/http://yager-research.ca http://llmstxt.new/http://yager-research.ca]
 
* Jina AI [https://github.com/jina-ai/reader Reader]: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/
 
* Jina AI [https://github.com/jina-ai/reader Reader]: Convert any URL to an LLM-friendly input with a prefix: https://r.jina.ai/
 +
* [https://www.arxiv-txt.org/ arXiv-txt]: Convert any arXiv paper to llms.txt version
  
 
==Headless Browser (scrape & automate)==
 
==Headless Browser (scrape & automate)==
Line 31: Line 33:
 
* [https://github.com/tesserato/CodeWeaver CodeWeaver]: Generate a Markdown Document of Your Codebase Structure and Content
 
* [https://github.com/tesserato/CodeWeaver CodeWeaver]: Generate a Markdown Document of Your Codebase Structure and Content
 
* [https://github.com/Doriandarko/RepoToTextForLLMs RepoToTextForLLMs]
 
* [https://github.com/Doriandarko/RepoToTextForLLMs RepoToTextForLLMs]
 +
* [https://uithub.com/ uithub]: Change 'g' to 'u' in github to get llm.txt
 +
 +
===Related===
 +
* [https://gitdiagram.com/ gitdiagram]: Repository to diagram
  
 
==Media Files==
 
==Media Files==
Line 46: Line 52:
 
* [https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/ LlamaParse] ([https://github.com/run-llama/llama_cloud_services/blob/main/examples/parse/multimodal/gemini2_flash.ipynb example use for multimodal parsing])
 
* [https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/ LlamaParse] ([https://github.com/run-llama/llama_cloud_services/blob/main/examples/parse/multimodal/gemini2_flash.ipynb example use for multimodal parsing])
 
* [https://github.com/VikParuchuri/marker Marker]: PDFs and images to markdown
 
* [https://github.com/VikParuchuri/marker Marker]: PDFs and images to markdown
 +
* [https://arxiv.org/abs/2503.11576 SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion] 256M ([https://huggingface.co/ds4sd/SmolDocling-256M-preview hf])
  
 
==PDF Conversion==
 
==PDF Conversion==
Line 71: Line 78:
 
* [https://github.com/VikParuchuri/surya surya]
 
* [https://github.com/VikParuchuri/surya surya]
 
* [https://mistral.ai/news/mistral-ocr Mistral OCR]
 
* [https://mistral.ai/news/mistral-ocr Mistral OCR]
 +
* [https://reducto.ai/blog/introducing-rolmocr-open-source-ocr-model RolmOCR] ([https://huggingface.co/reducto/RolmOCR hf])

Latest revision as of 13:34, 18 April 2025

Data Scraping

Web Scraping

  • Firecrawl: API to turn websites into LLM-ready markdown or structured data (can be self-hosted)
  • Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
  • ScrapeGraphAI: You Only Scrape Once: web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.)
  • pipet: A swiss-army tool for scraping and extracting data from online assets
  • ScrapeGraphAI: You Only Scrape Once
  • Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
  • sitefetch: Fetch entire site into text file (to be used with AIs)
  • LLM Scraper: Turn webpage into structured data using LLMs
  • Trafilatura: Discover and Extract Text Data on the Web

llms.txt Generator

llms.txt Generator (Online)

Headless Browser (scrape & automate)

Code & Github

Related

Media Files

Document Parsing

PDF Conversion

PDF Language Translation

Structured Data Extraction

Screenshot

  • Microsoft OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

Optical character recognition (OCR)