Difference between revisions of "Data Extraction"

From GISAXS
Jump to: navigation, search
(Document Parsing)
(Document Parsing)
 
Line 43: Line 43:
  
 
=Document Parsing=
 
=Document Parsing=
 +
* [https://idp-leaderboard.org/ Intelligent Document Processing Leaderboard] A unified leaderboard for OCR, KIE, classification, QA, table extraction, and confidence score evaluation
 +
 +
==Systems==
 
* [https://github.com/DS4SD/docling Docling]: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
 
* [https://github.com/DS4SD/docling Docling]: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
 
* [https://github.com/microsoft/markitdown Microsoft Markitdown]: converts various formats (PDF, Word, Excel, PPT) to Markdown (available via [https://msftmd.replit.app/ web interface on replit])
 
* [https://github.com/microsoft/markitdown Microsoft Markitdown]: converts various formats (PDF, Word, Excel, PPT) to Markdown (available via [https://msftmd.replit.app/ web interface on replit])

Latest revision as of 08:18, 12 May 2025

Data Scraping

Web Scraping

  • Firecrawl: API to turn websites into LLM-ready markdown or structured data (can be self-hosted)
  • Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
  • ScrapeGraphAI: You Only Scrape Once: web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.)
  • pipet: A swiss-army tool for scraping and extracting data from online assets
  • ScrapeGraphAI: You Only Scrape Once
  • Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
  • sitefetch: Fetch entire site into text file (to be used with AIs)
  • LLM Scraper: Turn webpage into structured data using LLMs
  • Trafilatura: Discover and Extract Text Data on the Web

llms.txt Generator

llms.txt Generator (Online)

Headless Browser (scrape & automate)

Code & Github

Related

Media Files

Document Parsing

Systems

PDF Conversion

PDF Language Translation

Structured Data Extraction

Screenshot

  • Microsoft OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

Optical character recognition (OCR)