Data Extraction
From GISAXS
Data Scraping
Web Scraping
- Firecrawl: API to turn websites into LLM-ready markdown or structured data (can be self-hosted)
- Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
- ScrapeGraphAI: You Only Scrape Once: web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.)
- pipet: A swiss-army tool for scraping and extracting data from online assets
- ScrapeGraphAI: You Only Scrape Once
- Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
- sitefetch: Fetch entire site into text file (to be used with AIs)
- LLM Scraper: Turn webpage into structured data using LLMs
- Trafilatura: Discover and Extract Text Data on the Web
llms.txt Generator
llms.txt Generator (Online)
Headless Browser (scrape & automate)
Code & Github
Media Files
Document Parsing
- Docling: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
- Microsoft Markitdown: converts various formats (PDF, Word, Excel, PPT) to Markdown (available via web interface on replit)
- e2m: Everything to Markdown (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a)
- Nvidia NV-ingest (code) scalable, performance-oriented document content and metadata extraction microservice
- MegaParse: Your Parser for every type of documents (pdf, powerpoint, word)
- Rowfill: Open-source document processing; extract, analyze, and process data from complex documents, images, PDFs and more with AI
- Zerox: PDF to markdown vision model (OCR)
- LlamaParse (example use for multimodal parsing)
- Marker: PDFs and images to markdown
PDF Conversion
PDF Language Translation
Screenshot
- Microsoft OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent
Optical character recognition (OCR)