Geddit
Built an automated AI-driven scraping pipeline that extracts trending topics and community data from Reddit in near real-time.
Overview
Architected a system that combines LLM-based tool calling with professional web-scraping APIs to find, track, and analyze trending Reddit content for research purposes.
Problem
Reddit data is volatile and difficult to scrape reliably at scale. Traditional methods often face rate-limiting and lack the intelligence to automatically identify which sub-Reddits are most relevant to specific niche keywords without manual intervention.
Constraints
- Requires high-reliability scraping to bypass anti-bot measures
- Must handle long-running background tasks for data snapshots
- Needs to support structured AI output for automated database entry
- Infrastructure must support local development (Jupyter) and production (Django)
Approach
Developed an event-driven architecture using Django and Celery. The system uses LangChain and LangGraph to coordinate 'agents' that first search for relevant communities via Bright Data's SERP API, then trigger deep crawls of those communities, processing the results through a webhook-based pipeline.
Key Decisions
Use LangGraph for scraping orchestration
Tool calling with standard LLM chains was too brittle for the multi-step process of searching, validating, and then scraping. LangGraph allows for stateful, cyclic graphs that can retry or branch based on the quality of search results.
- Linear LangChain Sequential Chains
- Hard-coded Python logic for API orchestration
Implement Webhook-based Snapshot Retrieval
Crawl APIs are asynchronous and can take minutes. Using webhooks via Cloudflare Tunnels and Qstash prevents blocking worker threads and ensures the system only processes data once it is fully ready.
- Long-polling via Celery tasks (Resource intensive)
- Synchronous API requests (Prone to timeouts)
Tech Stack
- Python
- Django
- LangChain / LangGraph
- Google Gemini
- Bright Data (Crawl & SERP APIs)
- PostgreSQL
- Redis
- Celery
- Cloudflare Tunnels
Result & Impact
- Fully Automated (Search to DB)Data Pipeline
- Asynchronous Webhook ProcessingSystem Latency
- AI-driven Topic ExtractionIntelligence
The agent effectively replaces hours of manual market research by automatically identifying relevant sub-communities and surfacing trending discussions. The integration of Django with Jupyter allowed for rapid prototyping of AI prompts that could be instantly promoted to production management commands.
Learnings
- Fuzzy query matching via LLMs significantly increases the relevance of scraped communities compared to keyword-only matching.
- Separating the scraping logic into a service-layer function makes it easier to trigger via both webhooks and CLI commands.
- Pre-commit hooks to strip Jupyter notebook outputs are essential when building AI projects to avoid leaking sensitive API keys or model outputs.
Additional Context
This project serves as a blueprint for Agentic Content Research. By bridging the gap between a robust web framework (Django) and modern AI orchestration (LangChain), the system transforms raw internet data into structured insights.
The most significant technical hurdle was managing the asynchronous nature of the Bright Data Crawl API. Instead of keeping a connection open, I implemented a robust webhook handler within Django. When a scrape event completes, Bright Data sends a POST request to a Cloudflare Tunnel URL, which triggers a Celery task to ingest the JSON snapshot into our Postgres models.
This architecture ensures that the Reddit AI Agent can scale to track hundreds of topics simultaneously without crashing the main application server, providing a seamless “set it and forget it” experience for content researchers.
[ website ] | [ repository ]