Commit Graph

3 Commits

Author SHA1 Message Date
Tomas Dvorak ed61d8ab8e feat(scraper): implement CloakBrowser support and enhance request stealth
Integrate CloakBrowser to improve success rates against Cloudflare
challenges and implement more robust request handling in the Go backend.

- Add CloakBrowser integration to Dockerfile and requirements
- Implement domain-specific request semaphores in Go to prevent rate-limiting
- Add shared HTTP client with cookie jar and header preservation for
  better session management
- Enhance request headers in Go to include modern client hints (Sec-Ch-Ua)
- Add benchmarking scripts to compare fetch methods (urllib vs Scrapling
  vs CloakBrowser)
- Update docker-compose to support CloakBrowser environment variables
- Optimize Docker image by pre-downloading patched Chromium binaries
2026-05-17 17:52:52 +02:00
Tomas Dvorak aa47f4309f refactor: optimize docker image and implement lightweight fetching
This commit improves the overall efficiency and reliability of the scraper by:

- Optimizing the Dockerfile by reducing layers, using `--no-install-recommends`, and consolidating Playwright installation.
- Adding resource limits (CPU/Memory) to the docker-compose configuration.
- Refactoring `main.go` to remove unused Cloudflare client structures and increasing cache TTL.
- Implementing a `lightweight_fetch` mechanism in `scrapling_fetch.py` using `urllib` to attempt fast requests before falling back to the heavier Scrapling/Playwright engine.
- Adding Cloudflare challenge detection to the lightweight fetcher.
2026-05-11 19:50:59 +02:00
Tomas Dvorak 455bf61302 upload 2026-03-12 19:11:08 +01:00