Difference between revisions of "Data Extraction"

From GISAXS
Jump to: navigation, search
(Created page with "=Data Scraping= * [https://github.com/patrickloeber/llm-data-scrapers LLM Data Scrapers] list ==Web Scraping== * [https://github.com/mendableai/firecrawl Firecrawl]: API to t...")
 
(PDF Conversion)
(2 intermediate revisions by the same user not shown)
Line 53: Line 53:
 
* [https://www.philschmid.de/gemini-pdf-to-data Gemini 2.0]
 
* [https://www.philschmid.de/gemini-pdf-to-data Gemini 2.0]
 
* [https://github.com/VikParuchuri/marker marker]: converts PDFs and images to markdown, JSON, and HTML
 
* [https://github.com/VikParuchuri/marker marker]: converts PDFs and images to markdown, JSON, and HTML
 +
* [https://olmocr.allenai.org/blog olmOCR] ([https://olmocr.allenai.org/ online demo], [https://github.com/allenai/olmocr code], [https://huggingface.co/collections/allenai/olmocr-67af8630b0062a25bf1b54a1 hf])
 +
* [https://mistral.ai/news/mistral-ocr Mistral OCR]
  
 
==PDF Language Translation==
 
==PDF Language Translation==
Line 68: Line 70:
 
* [https://github.com/imanoop7/Ollama-OCR Ollama OCR]
 
* [https://github.com/imanoop7/Ollama-OCR Ollama OCR]
 
* [https://github.com/VikParuchuri/surya surya]
 
* [https://github.com/VikParuchuri/surya surya]
 +
* [https://mistral.ai/news/mistral-ocr Mistral OCR]

Revision as of 15:49, 6 March 2025

Data Scraping

Web Scraping

  • Firecrawl: API to turn websites into LLM-ready markdown or structured data (can be self-hosted)
  • Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
  • ScrapeGraphAI: You Only Scrape Once: web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.)
  • pipet: A swiss-army tool for scraping and extracting data from online assets
  • ScrapeGraphAI: You Only Scrape Once
  • Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
  • sitefetch: Fetch entire site into text file (to be used with AIs)
  • LLM Scraper: Turn webpage into structured data using LLMs
  • Trafilatura: Discover and Extract Text Data on the Web

llms.txt Generator

llms.txt Generator (Online)

Headless Browser (scrape & automate)

Code & Github

Media Files

Document Parsing

PDF Conversion

PDF Language Translation

Structured Data Extraction

Screenshot

  • Microsoft OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

Optical character recognition (OCR)