How tgadsspy Works: Technical Deep Dive into the Classifier and Ingest Pipeline
Technical documentation of the tgadsspy data pipeline — gramesh API integration, niche classification architecture (regex + weights), geo classifier 3-step pipeline, media mirror SHA256 content-addressed storage, and aggregation caching. For developers, researchers and compliance teams.
Purpose and audience#
This document is a technical deep-dive into how Telegram Ads Spy collects, classifies, and serves Telegram ad data. It supplements the overview at /about with implementation-level detail. Primary audience: developers building on the public API, researchers who need to understand data provenance for citation, and compliance teams assessing the system's OSINT methodology.
A shorter overview methodology is published at /about. This document covers the how in depth: API integration, classification logic, storage architecture, and caching.
1. Data source: gramesh API#
All ad data in Telegram Ads Spy originates from a single source: the gramesh HTTP API at api.wall.systems/gramesh. gramesh is a proxy/aggregation layer over Telegram's MTProto protocol, exposing REST endpoints that return structured JSON. Telegram Ads Spy uses gramesh exclusively — no direct MTProto implementation, no scraping.
Key endpoints used#
POST /channels.getSponsored
- Input:
{ channel_id: <int>, dc_id: <int> } - Output: array of sponsored message objects for the given channel, in the given Telegram data-centre region
- Includes:
title,text,ctaUrl,ctaLabel,accentColor,mediaType,mediaUrl,ctaTargetUsername - Media URLs: signed, 1-hour TTL (
/files/photo/<id>?sig=&exp=) - Rate limit: 10 RPS against gramesh; Telegram Ads Spy throttles ingest cron to stay within this
POST /channels.getInfo
- Input:
{ username: <string> | id: <int> } - Output: channel metadata —
id,title,username,description,memberCount,avatarUrl - Used by the resolver-cron to hydrate Channel placeholders
POST /contacts.search
- Input:
{ q: <string> } - Output: array of channel objects matching the query
- Used by the discover-cron with 47 rotating seed queries
- Used by
/api/v1/submitwhen a user submits a t.me URL
POST /channels.getSimilar
- Input:
{ channel_id: <int> } - Output: similar channels as recommended by Telegram's internal similarity model
- Used by the
discover-similarin-process BFS spider
Fan-out by region#
Telegram has multiple data-centre clusters (DC1–DC5). Sponsored messages are region-specific: a channel in DC2 may show different sponsored messages than the same channel viewed from DC4. Telegram Ads Spy performs multi-region ingest for high-tier channels: the same channel is queried against multiple dc_id values to increase creative coverage. This is the mechanism behind "seen in multiple geos" — different DCs serve different advertiser targeting.
2. Channel pool and tiering#
The ingest pipeline operates on a pool of ~9,000+ channels. Channels are tiered by member count, which determines ingest frequency:
| Tier | Member count | Ingest interval | Rationale |
|---|---|---|---|
| S (Super) | 1M+ | 30 minutes | High ad density, fast creative turnover |
| A | 100k–1M | 2 hours | Active market, moderate turnover |
| B | 10k–100k | 8 hours | Adequate for daily coverage |
| C | 1k–10k | 72 hours | Low ad density, spot-checking sufficient |
| Placeholder | Unknown | Not ingested | Awaiting resolver to hydrate |
The resolver-cron (every 15 min) picks up placeholder channels — those submitted via seed batches, user submission, or discovery — and calls /channels.getInfo to populate memberCount, title, and avatarUrl. Once resolved, the channel is tiered and added to the ingest queue.
3. Creative deduplication#
When the ingest cron calls /channels.getSponsored and receives creatives, deduplication happens before any storage:
creative fingerprint = sha256(title + text + ctaUrl + ctaLabel + accentColor)
A creative is considered new only if its fingerprint hasn't been seen before. This means:
- The same ad running in 100 channels produces one
AdCreativerecord (not 100) - Each appearance in a channel produces one
SponsoredImpressionrecord pointing to theAdCreative - Minor variations (different
accentColorwith same text) are treated as different creatives — intentional, as colour variants are used in A/B testing
Creative lifecycle: A creative is first seen when its fingerprint first appears. It's considered "active" while it continues appearing in new ingest ticks. AdCreative.lastSeenAt is updated on each new impression. When a creative stops appearing, it transitions to inactive naturally — no explicit deactivation signal from gramesh.
4. Niche classification#
Every AdCreative is classified into one or more niches. The classifier is a weighted keyword-plus-brand-detection system implemented in lib/niche.ts.
Architecture#
The classifier operates on the concatenated text of title + text + ctaUrl. It runs two passes:
Pass 1: Brand detection A lookup against a dictionary of ~400 known advertiser brands, mapped to their primary niche. Example entries:
binance→crypto1xbet→gamblingnordvpn→vpnexness→forexdream11→betting
Brand matches carry high weight (w=10) and dominate the classification when present. The brand dictionary is maintained in lib/niche-brands.ts and updated as new advertisers are identified.
Pass 2: Keyword scoring For each niche, a set of regex patterns is evaluated against the creative text. Each match adds a positive weight to that niche's score. Patterns are designed to avoid false positives through:
- Specificity: "USDT P2P exchange" is a crypto signal; "exchange" alone is too generic
- Negation rules: Some patterns carry negative weight to suppress false positives (e.g., "slot" appearing in a tech context)
- Language variants: Patterns include Arabic, Russian, Indonesian, Thai and other language variants for the major niches
Score threshold: A niche is assigned if its score exceeds a minimum threshold. Multiple niches can be assigned — a creative can be gambling + crypto if it's a crypto-casino (e.g., BC.Game).
Niche taxonomy#
Current top-level niches (as of April 2026):
crypto, trading, forex, fintech, gambling (casino), betting (sports), vpn, dating, news, education, gaming, retail, tech, bots, adult, signals, remittance, ai, other
Sub-niches are assigned within the niche-meta.ts taxonomy for display grouping. The classification schema is append-only — niches are never removed, only new ones added.
Classification accuracy#
We validate accuracy through:
- Spot-check sampling: periodic manual review of 50 random creatives per niche
- Brand-miss audit: if a known brand is misclassified, the brand dictionary is updated
- False positive rate: estimated at ~4% based on last sampling round (January 2026)
Known limitations:
- Short text-only creatives with no brand signal have ~12% misclassification rate
- New brands not yet in the dictionary are initially classified by keyword only
- Multilingual edge cases (mixed script creatives) occasionally confuse the keyword scorer
5. Geo classification#
Every creative receives a geo assignment (ISO 3166-1 alpha-2 country code or regional code). The geo classifier is a 3-step cascade:
Step 1: CTA URL TLD analysis The CTA URL's top-level domain is parsed:
.ru→ RU.com.br,.br→ BR.pk→ PK.sa→ SA.eualone is ambiguous (treated as EU rather than a specific country)
Country-code TLDs provide high-confidence geo signals. If Step 1 produces a non-ambiguous result, the cascade stops.
Step 2: Language detection on creative text
If Step 1 is ambiguous (e.g., .com domain), the creative text language is detected using Unicode block analysis and a fastText-family language identifier:
- Arabic script → AR (regional)
- Cyrillic → RU/CIS (default RU unless Step 3 disambiguates)
- Devanagari → HI (India likely)
- Hangul → KR
- Hiragana/Katakana → JP
- Thai → TH
- Bengali → BD
- Urdu (Arabic script + language model) → PK
Step 3: Channel-level geo aggregation
The channel in which the creative appeared has its own geo signal (from language, name, description, and prior ingest history). When a creative appears in channels with consistent geo signals, the creative inherits that geo. For example, a .com domain English creative that appears predominantly in Russian-language channels is classified as RU.
Multiple geo assignment: A creative can have multiple geo codes when it demonstrably targets multiple markets (common for Binance global campaigns). In the UI, multi-geo creatives appear in all relevant geo filter segments.
6. Media mirror#
gramesh provides signed URLs to Telegram's media CDN with a 1-hour TTL. These URLs expire and become inaccessible, making them unsuitable for permanent archiving.
The Telegram Ads Spy-media-mirror cron (runs every 5 minutes) processes newly ingested creatives with unmirrored media:
- Fetch: HTTP GET to the gramesh-signed media URL
- Hash: SHA-256 of the raw binary content
- Store: Write to
/var/www/tgadsspy-media/<prefix>/<sha256-hex>.<ext>on the serverprefix= first 2 hex characters of the SHA256 (256-bucket directory sharding)ext= inferred from Content-Type header
- Update:
AdCreative.mediaUrlis updated from the gramesh URL to/m/<prefix>/<sha256-hex>.<ext>
The nginx alias serves /m/ paths from the media storage directory with:
Cache-Control: public, immutable, max-age=31536000
One-year immutable cache — content-addressed storage guarantees the hash never changes.
Deduplication: Two different ads using the same banner image produce a single stored file (same hash). The file is stored once; both creatives reference the same /m/... URL.
Fallback: For creatives where gramesh doesn't return a banner (text-only or channel-pic format), a secondary nightly cron (Telegram Ads Spy-creative-media) fetches og:image from the CTA URL domain as a fallback thumbnail. A third fallback uses the target channel's avatar URL.
7. Advertiser extraction#
Advertiser identity is derived from the CTA URL, not from any Telegram-provided field:
Domain advertiser: if ctaUrl is an external URL, the registered domain (e.g., binance.com from https://www.binance.com/en/referral?...) becomes an Advertiser record with type: domain. The full URL (including UTM parameters and referral codes) is preserved on the AdCreative record for competitive analysis.
Telegram advertiser: if ctaUrl is a t.me/<username> URL, the username becomes an Advertiser record with type: telegram. The channel is also added to the discovery queue if not already tracked.
Advertiser slug: a normalized version of the domain or username — lowercase, special characters stripped, used in /advertisers/<slug> URLs. Slugs are stable once assigned.
Alias merging: The same entity may advertise from multiple domains (e.g., binance.com and binance.cc). Manual alias merges are supported in the admin interface, consolidating creative counts under a canonical advertiser record.
8. Aggregation caching#
Two Redis keys are pre-warmed every minute by the Telegram Ads Spy-warm-cache cron:
Telegram Ads Spy:home:agg (TTL 120s)
Contains: total creative count, total advertiser count, total channel count, top niches (name + count), recent 20 creatives (thumbnail + title), today's new creative count, today's new advertiser count. Used by the home page dashboard and the /api/v1/stats endpoint. Cold miss on this key would cause the home page to hit the database directly — the warm-cache cron ensures this never happens in production.
Telegram Ads Spy:pool:stats (TTL 600s)
Contains: channel count by tier, total sponsored-eligible count, countries represented. Used by the OG image generator (the home page's dynamic Open Graph image includes live stats — must be fast to serve in the 100ms og:image timeout).
Per-entity caches: Individual channel, advertiser, and niche pages cache their aggregated stats at /api/v1/channels/<id>, etc., with TTL of 60–300s depending on update frequency expectations.
9. Discover-similar BFS spider#
In addition to the seed-based discovery cron, Telegram Ads Spy runs a continuous BFS (breadth-first search) spider using Telegram's "similar channels" graph:
- Anchor selection:
Channel.lastSimilarCheckAt IS NULL OR < NOW()-1h— channels that haven't been checked for similarities in the past hour, ordered ascending (oldest check first, new channels prioritised by default NULL value) - Fan-out: 30 channels per tick (every minute in-process)
- gramesh call:
POST /channels.getSimilar { channel_id }→ returns 10–20 similar channels - New channel handling: similar channels not in the pool are added as placeholders → resolver picks them up in the next 15-minute cycle
- Bot filter: channels with
memberCount < 100or names matching bot-name patterns are discarded - Rate limiting: anchor cooldown of 1 hour prevents the same channel from being re-spidered more than once per hour, regardless of how many other channels reference it as similar
This recursive BFS, combined with the multi-query discover-cron, is how the channel pool has grown from seed lists of a few hundred to 9,000+ channels organically.
10. Data completeness and known gaps#
What we capture well:
- EUR-cabinet sponsored messages on the Telegram Ads Platform (high coverage via multi-region ingest)
- TON-paid owner placements when they appear as channel posts in channels we track (partial coverage — not all channels are tracked)
Known gaps:
- Group-level advertising: Telegram groups and supergroups are not indexed (sponsored messages only run in channels; TON-paid posts in groups are outside our scope)
- Bot-to-user messages: Advertisers who send promotional messages directly to users via bots are not captured — we only see channel-level placements
- Inline bot results: Telegram's inline query ads (rare) are not captured
- Very new channels: Channels created recently that haven't been discovered by any pipeline path may be missed for days or weeks
- Low-tier channels (< 1k subscribers): Not eligible for EUR-cabinet sponsored messages; TON placements in very small channels are not in our scope
Coverage estimate: For EUR-cabinet sponsored messages, coverage is estimated at 65–75% of all unique creatives that ran in the period. The gap represents channels not yet in our pool that received sponsored message deliveries. This estimate is based on cross-referencing our creative counts against gramesh's own aggregate stats for channels we do track.
11. Data licensing and citation#
All data in Telegram Ads Spy's archive is released under CC-BY-4.0. You may use, republish, and analyse it for any purpose with attribution:
Source: tgadsspy.com · tgadsspy.com/blog/tgadsspy-classifier-pipeline-technical-deep-dive · CC-BY-4.0
For programmatic access: public API documentation · bulk CSV export.
For bug reports or data correction requests: open an issue at the GitHub repo or email [email protected].
Related documentation#
- /about — non-technical methodology overview
- Public API docs — endpoint reference for developers
- State of Telegram Ads 2026 — what the pipeline has collected
- Regulation Guide 2026 — how regulators can use this data
Also available in:
Cite this article
tgadsspy research (2026). How tgadsspy Works: Technical Deep Dive into the Classifier and Ingest Pipeline. tgadsspy.com. Retrieved from https://tgadsspy.com/blog/tgadsspy-classifier-pipeline-technical-deep-dive
Licensed CC-BY-4.0 — reuse allowed including commercial, attribution required.