NFT Platform — Implementation Plan¶
Companion to: Architecture v5 · 4 phases · 39 tasks
How to Read¶
| Field | Meaning |
|---|---|
| Effort | S = 1–3d · M = 3–7d · L = 1–3w · XL = 3w+ |
| Risk | High / Medium / Low |
| Depends | Task IDs that must be complete first |
⚠ Do not start Phase 2 until all Phase 1 exit criteria pass. A broken dedup or idempotency mechanism produces phantom balances that are hard to detect and expensive to clean up.
Phase Summary¶
| Phase | Goal | Est. |
|---|---|---|
| 1 | Working ingestion for 2 networks (Polygon + TON): dedup, idempotency, ownership, projections, API | 8–12w |
| 1.5 | Independent service admin panel for internal operations | 3–5d |
| 2 | Multi-network, metadata, scoring, ClickHouse, semantic search | 8–12w |
| 3 | Kafka, full reorg suite, personalization, canary SLA | 6–10w |
| 4 | Horizontal scaling — only where measured | Open |
Phase 1 — Foundation¶
Infrastructure · Schema · Polygon + TON adapters · Normalizer · State updater · Projections · API · Reorg handler
Infrastructure & Schema¶
P1-01 — Postgres schema baseline migration M Risk: Medium
7 core schemas: ref, ingest, ledger, catalog, market, projection, system. All constraints and indexes from arch §9. Additive migrations only. user_content, social, scoring introduced in Phase 2.
Depends: —
P1-02 — ref tables seed data S Risk: Low
Populate chains, networks, token_standards, marketplaces for target network.
Depends: P1-01
P1-03 — Redis + MinIO + Docker Compose S Risk: Low
Local dev + prod compose. Redis Streams is the Phase 1 event bus; Redis is also used for cache/locks.
Depends: —
P1-04 — Event envelope schema & validation library S Risk: Medium
Canonical RawEvent as Pydantic model + JSON Schema. Frozen interface between adapters and normalizer. Include schema_version. Must be stable from day one.
Depends: —
Chain Adapter¶
P1-05 — ChainAdapter base class & contract tests M Risk: Medium
Abstract interface: fetch_blocks, parse_raw_events. Contract tests (arch §16.5): schema validation, sub_index completeness, delta conservation, source_event_id stability, empty deltas for non-transfers. Every concrete adapter must pass all 5.
Depends: P1-04
P1-06 — Polygon + TON adapters XL Risk: High
Implement both adapters in Phase 1: Polygon (ERC-721/1155 + marketplace events) and TON (TEP-62 + GetGems). Cover: Transfer, TransferBatch (or TON equivalent batch semantics), Mint (from=zero), Burn (to=zero), Listing, Sale, Cancel. Record real mainnet tx fixtures. Both adapters must pass all contract tests. Address canonicalization per arch §3 — no universal lower().
Depends: P1-05
P1-07 — Block fetcher & ingest cursor M Risk: Medium
fetch_blocks with retry, rate-limit, circuit breaker. Write cursor to system.sync_cursors after each successful batch. Never advance cursor past a failed block.
Depends: P1-06
Normalizer¶
P1-08 — Normalizer worker L Risk: High
Consume chain.raw_events. UPSERT ledger.normalized_event_keys on (source_event_id, sub_index) — handles reorg re-inclusion, not just dedup. Write normalized_events + normalized_event_deltas. Create catalog stubs via uuidv5. Publish to ledger.normalized.
Critical: same event twice → second write updates chain metadata fields, does not create a second row, does not change normalized_event_id.
Depends: P1-04, P1-01
State Updater¶
P1-09 — applied_events idempotency pattern M Risk: High
Reusable library. INSERT INTO applied_events ON CONFLICT DO NOTHING before every state mutation. Write all 3 idempotency tests from arch §16.3 before marking done: single delivery, concurrent delivery, partial failure recovery.
Depends: P1-01
P1-10 — State updater — ownership M Risk: High
Consume ledger.normalized. Fetch all normalized_event_deltas per event. UPSERT ownership_current qty += delta. DELETE where qty <= 0. Update finality_status on finalization. Publish ledger.ownership_changed.
Depends: P1-08, P1-09
P1-11 — State updater — listings & sales M Risk: Medium
Handle listing, sale, cancel event_kinds. Update market.listings_current. Insert sales_history. Publish market.listing_changed and market.sale_recorded.
Depends: P1-10
Projection Pipeline¶
P1-12 — Projection pipeline — ownership_view M Risk: Medium
Upsert projection.ownership_view with finality_status (pending/confirmed/finalized). Use applied_events. Template for all subsequent projections — write it correctly once.
Depends: P1-10
P1-13 — Projection pipeline — asset_cards & collection_stats M Risk: Medium
projection.asset_cards and collection_stats (floor, volume, owner count). Handle is_stub=true gracefully — show stub data, not errors.
Depends: P1-12
P1-14 — Projection pipeline — portfolio_assets & listing_cards M Risk: Low
projection.portfolio_assets and listing_cards. Remaining projections for minimal functional API.
Depends: P1-13
Core API¶
P1-15 — Core API skeleton & auth module M Risk: Low
FastAPI modular structure. Auth: API keys only (JWT deferred to Phase 2). Enforce from day one: on-chain modules read only from projection.*. user_content and social read their own tables.
Depends: P1-01
P1-16 — API endpoints: ownership, catalog, marketplace M Risk: Low
GET /assets/{id}, /assets/{id}/owners, /collections/{id}, /collections/{id}/listings. All responses include finality_status. Stubs return is_stub: true. Non-existent collections return 404 with X-Data-Status: pending.
Depends: P1-14, P1-15
Reliability & Observability¶
P1-17 — Outbox publisher M Risk: Medium
Poll system.outbox_events WHERE published_at IS NULL ORDER BY created_at. Publish to Redis Streams. Mark published_at on success. After 5 failures → system.dlq.
Depends: P1-01
P1-18 — Metrics & structured logging M Risk: Medium
All metrics from arch §15 with alert thresholds. Structured log contract in all workers. 4 minimum dashboards. Alert rules for dlq_depth and outbox_unpublished_count at any non-zero value.
Depends: P1-07, P1-10
P1-19 — Reorg handler L Risk: High
Detect reorg in adapter. Emit ledger.reorg_detected. Mark normalized_event_keys.is_reverted=true scoped to affected asset_ids. Recompute ownership_current for affected asset_ids only. Write all 5 reorg scenarios from arch §16.2.
Depends: P1-10
Phase 1 Exit Criteria¶
- [ ] Contract tests pass for both adapters — all 5 types from arch §16.5 (10 tests total)
- [ ] Idempotency tests pass: single delivery, concurrent delivery, partial failure recovery
- [ ] Reorg scenarios 1–5 from arch §16.2 pass against real Postgres (no mocks)
- [ ] End-to-end: chain event → API response with correct finality_status within SLA
- [ ] ownership_view reconciliation: 0 mismatches for sample of 100 assets
- [ ] Observability: pipeline lag dashboard visible, dlq_depth alert tested and firing
- [ ] Code review: no direct SQL joins between on-chain domain modules in the API
Phase 2 — Multi-Network & Intelligence¶
Additional adapters · Metadata pipeline · Scoring · ClickHouse · Semantic search
P2-01 — Additional chain adapters (third+) L per adapter Risk: High
Each must pass all contract tests before connecting to normalizer.
Depends: P1 exit ✓
P2-02 — Metadata pipeline M Risk: Medium
Consume asset_created. Fetch URI, compute content_hash, create version only on hash change. Handle: IPFS timeouts (30s / 3 retries), malformed JSON (log+skip+retry), oversized >5MB (store ref only).
Depends: P1 exit ✓
P2-03 — Trait extraction & rarity scoring M Risk: Low
Parse traits. Compute rarity_rank and rarity_pct per trait value within collection. Update projection.asset_cards.
Depends: P2-02
P2-04 — ClickHouse setup & CDC pipeline L Risk: Medium
Deploy ClickHouse. CDC (Debezium) or export job for price_snapshots + sales_history. Postgres retains 30-day hot partition. Workers never write to ClickHouse directly.
Depends: P1 exit ✓
P2-05 — Scoring pipeline — asset scores M Risk: Medium
composite_score = rarity + liquidity + demand. Versioned via scoring_runs. Use applied_events. Update asset_cards.composite_score and trending_assets.
Depends: P2-03
P2-06 — Scoring pipeline — collector scores M Risk: Low
portfolio_value_usd, diversity_score, activity_score per account. Update portfolio_assets.estimated_value_usd.
Depends: P2-05
P2-07 — Text embeddings & pgvector search L Risk: Medium
Deploy semantic-encoder. HNSW index on embedding WHERE is_current=true. Hybrid search in search_docs. Expose /search endpoint.
Depends: P2-03
P2-08 — API: scores, trending, search M Risk: Low
GET /assets/{id}/score, /collections/{id}/trending, /search?q=. Volume stats from ClickHouse.
Depends: P2-05, P2-07
P2-09 — Reconciliation job M Risk: Low
Hourly (staging) / daily (prod). Sample 1000 assets, compare projection vs ground-truth. Alert if mismatch_rate > 0.1%.
Depends: P1 exit ✓
P2-10 — Metadata re-fetch scheduler M Risk: Low
Periodic re-fetch. After N failures → mark metadata_unreachable, stop retrying, surface in API.
Depends: P2-02
P2-11 — Visual embeddings M Risk: Low (parallelizable)
Image embeddings for visual similarity. Separate model_version from text embeddings.
Depends: P2-07
P2-12 — user_content module M Risk: Low
Wallet linking, folder CRUD, compose layouts. Reads user_content.* directly.
Depends: P1-15
P2-13 — social module M Risk: Low
Comments (threaded), reactions, notifications.
Depends: P2-12
P2-14 — Price oracle integration S Risk: Low
Native token → USD rates. All price_usd fields include rate timestamp. Alert if rate > 5min stale.
Depends: P1 exit ✓
Phase 2 Exit Criteria¶
- [ ] All contract tests pass for each new adapter
- [ ] stub_assets_total trends to zero within 10min for test batch of 1000 assets
- [ ] ClickHouse: 90-day floor price history < 100ms for collection with 50k sales
- [ ] Search: precision@10 > 0.7 on test query set
- [ ] Reconciliation: 0 mismatches for 3 consecutive daily runs across all networks
- [ ] Per-network lag metrics visible for all active networks
Phase 3 — Production Hardening¶
Kafka · Full reorg suite · Personalization · Canary SLA · Backfill
P3-01 — Kafka / Redpanda migration L Risk: Medium
Envelope schema frozen for this — no app logic changes. Dual-write 24h. Validate message counts match before cutover.
Depends: P2 exit ✓
P3-02 — Full reorg test suite L Risk: High
All 5 scenarios from arch §16.2 for every supported network. Automated simulation in staging.
Depends: P2 exit ✓
P3-03 — Personalization engine L Risk: Medium
User preference vectors from ownership + viewing history + folders. Blend with rarity + liquidity in trending and search.
Depends: P2-07, P2-06
P3-04 — Qdrant migration M Risk: Low (conditional)
Only if p99 semantic search > 200ms. Backend swap via abstraction layer. Validate precision@10 unchanged.
Depends: P3-01
P3-05 — Advanced collection analytics L Risk: Medium
ClickHouse: wash trading detection, price manipulation signals, whale concentration.
Depends: P2-04
P3-06 — Achievements system M Risk: Low
achievements_catalog + user_achievements, triggered by events.
Depends: P2-13
P3-07 — API rate limiting & quota M Risk: Low
Per-key rate limiting via Redis. Graceful degradation: serve stale projection data under DB pressure.
Depends: P1-15
P3-08 — SLA monitoring & canary M Risk: Medium
Automate freshness SLA checks from arch §17.2. Canary transaction every 5min. Alert if misses SLA window.
Depends: P2 exit ✓
P3-09 — Backfill tooling L Risk: Medium
CLI to backfill from configured start block. Uses same normalizer/state-updater pipeline. Idempotent.
Depends: P2 exit ✓
P3-10 — Staging environment parity M Risk: Low
Staging mirrors prod schema + pipeline exactly. Nightly reconciliation. All reorg tests run in staging.
Depends: P3-01
Phase 3 Exit Criteria¶
- [ ] Kafka: 0 message loss during 72h dual-write. Consumer lag < 1000 msgs/partition steady state
- [ ] Full reorg test suite passes for all supported networks
- [ ] Canary SLA: 99.5% visible in API within window over 7 days
- [ ] p99 API latency < 200ms under 2x peak load
Phase 4 — Scale¶
Only where measured. No speculative scaling.
⚠ Each task requires a profiler trace identifying the specific bottleneck.
P4-01 — Ingest horizontal scaling by network M
Shard adapter if ingest_lag fires consistently. Normalizer handles out-of-order via UPSERT — no changes needed there.
Trigger: measured ingest lag
P4-02 — Projection pipeline parallelization M
Partition by asset_id hash if projection_lag fires consistently. applied_events prevents cross-partition conflicts.
Trigger: measured projection lag
P4-03 — normalized_events partition count increase L Risk: Medium
Double partition count (8 → 16) if single partition exceeds ~100M rows. Requires data migration + maintenance window.
Trigger: measured query degradation
P4-04 — Hot-path service extraction XL Risk: Medium
Last resort. Profile first, extract second.
Trigger: measured resource contention on a specific module
P4-05 — Read replica routing M Risk: Medium
Route projection.* reads to replicas. Writes stay on primary. Monitor replication lag.
Trigger: primary DB CPU > 70% sustained
P4-06 — applied_events archival S
Archive rows > 90d to cold table if applied_events > 100M rows and INSERT latency degrades.
Trigger: measured INSERT latency degradation
Risk Register¶
| # | Risk | Phase | Impact | Mitigation |
|---|---|---|---|---|
| 1 | Dedup bug silently drops events | P1 | Critical | Contract test: same event twice → exactly 1 row. Must pass before production data flows. |
| 2 | applied_events missing → double balance on replay | P1 | Critical | Code review gate on every ledger-mutating function. Consider linter rule. |
| 3 | Solana addresses corrupted by lower() | P2 | Critical | Canonicalization unit tests per network family. Fixture tests with case-sensitive Solana addresses. |
| 4 | ERC-1155 TransferBatch: sub_index omitted → batch collapse | P1 | Critical | Contract test: 1, 10, 100-item batches. Assert contiguous sub_index from "0". |
| 5 | Reorg re-inclusion: is_reverted stays true | P1 | High | Reorg scenario 2 must pass before Phase 1 exit. UPSERT on normalized_event_keys updates is_reverted. |
| 6 | Projection diverges from ledger silently | P1+ | High | Reconciliation job (P2-09) + canary SLA monitor (P3-08). |
| 7 | ClickHouse lag causes stale analytics | P2 | Medium | Monitor CDC lag. Surface data_as_of timestamp in analytics responses. |
| 8 | Broken metadata URI → stuck stubs | P2 | Medium | After N failures: mark metadata_unreachable, stop retrying, surface in API. |
| 9 | Kafka migration loses messages | P3 | High | Dual-write 24h. Validate counts on both sides before cutover. |
| 10 | Premature Phase 4 scaling | P4 | Medium | Hard rule: no task without a profiler trace. |
Decision Log¶
| Decision | Chosen | Rejected | Rationale |
|---|---|---|---|
| API architecture | Modular monolith | Microservices | Domains tightly coupled. Velocity > isolation at this stage. |
| Event bus — Phase 1 | Redis Streams | Kafka | Lower ops overhead. Envelope frozen for future migration. |
| Event bus — Phase 3 | Kafka / Redpanda | Stay on Redis | At-least-once guarantees, consumer groups, replay. |
| normalized_events partitioning | HASH(network_id) | Time partition | Time partition breaks global unique PK in Postgres. |
| Dedup strategy | Unpartitioned keys table | UNIQUE on partitioned | Postgres can't enforce cross-partition UNIQUE without partition key. |
| Worker idempotency | applied_events (DB) | In-memory / Redis | DB transaction guarantees. Survives restart. Auditable. |
| Surrogate IDs | uuidv5 from natural key | Auto-increment | Deterministic. Eliminates FK race conditions. |
| Address canonicalization | Canonical string at ingestion | lower() at query time | lower() corrupts Solana addresses. |
| Embedding storage | pgvector | Qdrant | Simpler ops. Abstraction allows migration if p99 > 200ms. |
| ClickHouse timing | Phase 2 | Phase 3 | sales_history and price_snapshots outgrow Postgres faster. |