Integrate CloakBrowser to improve success rates against Cloudflare
challenges and implement more robust request handling in the Go backend.
- Add CloakBrowser integration to Dockerfile and requirements
- Implement domain-specific request semaphores in Go to prevent rate-limiting
- Add shared HTTP client with cookie jar and header preservation for
better session management
- Enhance request headers in Go to include modern client hints (Sec-Ch-Ua)
- Add benchmarking scripts to compare fetch methods (urllib vs Scrapling
vs CloakBrowser)
- Update docker-compose to support CloakBrowser environment variables
- Optimize Docker image by pre-downloading patched Chromium binaries
This commit improves the overall efficiency and reliability of the scraper by:
- Optimizing the Dockerfile by reducing layers, using `--no-install-recommends`, and consolidating Playwright installation.
- Adding resource limits (CPU/Memory) to the docker-compose configuration.
- Refactoring `main.go` to remove unused Cloudflare client structures and increasing cache TTL.
- Implementing a `lightweight_fetch` mechanism in `scrapling_fetch.py` using `urllib` to attempt fast requests before falling back to the heavier Scrapling/Playwright engine.
- Adding Cloudflare challenge detection to the lightweight fetcher.