Ongoing

Geddit

AI Engineer · 2026 · Ongoing · 1 person · 3 min read

Built an automated AI-driven scraping pipeline that extracts trending topics and community data from Reddit in near real-time.

Overview

Architected a system that combines LLM-based tool calling with professional web-scraping APIs to find, track, and analyze trending Reddit content for research purposes.

Problem

Reddit data is volatile and difficult to scrape reliably at scale. Traditional methods often face rate-limiting and lack the intelligence to automatically identify which sub-Reddits are most relevant to specific niche keywords without manual intervention.

Constraints

Requires high-reliability scraping to bypass anti-bot measures
Must handle long-running background tasks for data snapshots
Needs to support structured AI output for automated database entry
Infrastructure must support local development (Jupyter) and production (Django)

Approach

Developed an event-driven architecture using Django and Celery. The system uses LangChain and LangGraph to coordinate 'agents' that first search for relevant communities via Bright Data's SERP API, then trigger deep crawls of those communities, processing the results through a webhook-based pipeline.

Key Decisions

Use LangGraph for scraping orchestration

Reasoning:

Tool calling with standard LLM chains was too brittle for the multi-step process of searching, validating, and then scraping. LangGraph allows for stateful, cyclic graphs that can retry or branch based on the quality of search results.

Alternatives considered:

Linear LangChain Sequential Chains
Hard-coded Python logic for API orchestration

Implement Webhook-based Snapshot Retrieval

Reasoning:

Crawl APIs are asynchronous and can take minutes. Using webhooks via Cloudflare Tunnels and Qstash prevents blocking worker threads and ensures the system only processes data once it is fully ready.

Alternatives considered:

Long-polling via Celery tasks (Resource intensive)
Synchronous API requests (Prone to timeouts)

Tech Stack

Python
Django
LangChain / LangGraph
Google Gemini
Bright Data (Crawl & SERP APIs)
PostgreSQL
Redis
Celery
Cloudflare Tunnels

Result & Impact

Fully Automated (Search to DB)

Data Pipeline
Asynchronous Webhook Processing

System Latency
AI-driven Topic Extraction

Intelligence

The agent effectively replaces hours of manual market research by automatically identifying relevant sub-communities and surfacing trending discussions. The integration of Django with Jupyter allowed for rapid prototyping of AI prompts that could be instantly promoted to production management commands.

Learnings

Fuzzy query matching via LLMs significantly increases the relevance of scraped communities compared to keyword-only matching.
Separating the scraping logic into a service-layer function makes it easier to trigger via both webhooks and CLI commands.
Pre-commit hooks to strip Jupyter notebook outputs are essential when building AI projects to avoid leaking sensitive API keys or model outputs.

Additional Context

This project serves as a blueprint for Agentic Content Research. By bridging the gap between a robust web framework (Django) and modern AI orchestration (LangChain), the system transforms raw internet data into structured insights.

The most significant technical hurdle was managing the asynchronous nature of the Bright Data Crawl API. Instead of keeping a connection open, I implemented a robust webhook handler within Django. When a scrape event completes, Bright Data sends a POST request to a Cloudflare Tunnel URL, which triggers a Celery task to ingest the JSON snapshot into our Postgres models.

This architecture ensures that the Reddit AI Agent can scale to track hundreds of topics simultaneously without crashing the main application server, providing a seamless “set it and forget it” experience for content researchers.

[ website ] | [ repository ]

All projects