Skip to main content

03 — AI Pipeline Quality Improvements

Ring: 2 (Retention) — but foundational pieces are built in Ring 1 Dependency: R1-2 (Cost Control — extending the pipeline without tracking is dangerous) Handbook: Ch. 11-26 (pipeline), Ch. 60-64 (market intelligence), Ch. 170-188 (prompt architecture)

Problem

  • Current pipeline has 3 steps: AI Call → JSON Parse → DB Upsert.
  • Handbook defines a 10-stage pipeline; we target a 12-stage advanced version.
  • No product analysis — user’s text goes directly to AI.
  • Single-language search — queries are not generated in the target country’s language.
  • No market context — blind search with no country knowledge.
  • Feedback buttons collect data but DO NOT AFFECT ranking (dead feature).
  • Dedup is domain-based only — no fuzzy name matching.
  • Contact discovery is a separate operation — not part of the pipeline.

Decisions

D1: 5-Phase, 12-Stage Pipeline

━━━ PHASE 1: UNDERSTAND ━━━━━━━━━━━━━━━━━━━━━━━━━━━
 1. Product Analysis
    Input: "textile stain remover spray"
    Output: { industry, category, synonyms[], distribution_channels[], hs_code_hint }
    → Feeds all subsequent steps

 2. Market Context
    Input: country + industry (from Phase 1)
    Output: { import_volume, key_sectors, trade_fairs[], regulatory_notes, competitive_landscape }
    Source: DB cache (90 day TTL) || AI call
    → Makes discovery informed, NO blind search

━━━ PHASE 2: DISCOVER ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 3. Query Generation (multi-language, multi-angle)
    → EN: general search query
    → Target country language: localized query
    → Sector-specific: trade associations, directories
    → Internal DB: check for previously found companies

 4. Multi-Source Search
    → Web search (Perplexity/Gemini grounding)
    → Internal DB cache (match from export_ai_companies — 0 cost)
    → Shared companies in future: ALL orgs' data

 5. Entity Extraction
    → Raw AI output → clean company entities
    → Normalize: name, domain, address cleanup

━━━ PHASE 3: ENRICH ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 6. Company Enrichment
    → Website analysis: products_sold, target_customers, geographic_scope, estimated_size
    → This data feeds FitScore

 7. Classification
    → company_type: distributor / reseller / end_user / manufacturer
    → segment: per org's DB-driven segment definitions
    → confidence: 0-1 (separate for each field)

 8. Deduplication
    → Domain match (primary)
    → Fuzzy name match (secondary)
    → Internal DB cross-ref (found before?)
    → Merge: combine if same company came from multiple sources

━━━ PHASE 4: EVALUATE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 9. Confidence Scoring [DETERMINISTIC — no AI call]
    → Data source reliability (website > directory > AI inference)
    → Cross-source verification (2+ sources = more reliable)

 10. FitScore & Ranking [DETERMINISTIC — no AI call]
    → Existing 6-factor FitScore v2
    → + Market Context factor (country fit)
    → + Enrichment factors (company size, scope)
    → AI does not score; AI collects data, FORMULA scores (Handbook Golden Rule)

 11. Feedback Integration [DETERMINISTIC — no AI call]
    → Check feedback from previous searches
    → Prompt injection: "These types of companies were marked relevant"
    → Add feedback score as additional factor to FitScore

━━━ PHASE 5: ACTIVATE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 12. Contact Discovery [OPTIONAL — configurable]
    → On/off from org settings (default: off)
    → If on: auto-headhunt for Top N companies
    → Purchasing manager, import manager, procurement director, CEO
    → confidence: 0-1 per contact

D2: AI Call Grouping (Cost Optimization)

Each of the 12 stages will NOT be a separate AI call. Logical groupings:
CALL 1: Product Analysis + Market Context + Query Generation
  → "Analyze this product, what do you know about this country, generate search queries"
  → Phase 1 + Phase 2 start
  → If market context is cached: only Product Analysis + Query Gen

CALL 2: Multi-Source Search
  → Web search (Perplexity/Gemini grounding)
  → + Internal DB lookup (Supabase — 0 cost)

CALL 3: Entity Extraction + Classification + Enrichment
  → "Extract company entities from results, classify, enrich"
  → Heaviest prompt but single call

CALL 4 (optional): Contact Discovery
  → Only if configurable setting is enabled
  → Only for Top N companies

DETERMINISTIC (no AI call):
  → Dedup: domain + fuzzy match (code)
  → Confidence: source reliability (code)
  → FitScore: formula (code)
  → Feedback: DB lookup + score (code)
  → Ranking: sort (code)

TOTAL: 3-4 AI calls (current: 1, full atomic: 12)

D3: Market Context — Hybrid Cache

User searches: "Germany + textile chemicals"
  → Does market context exist in DB? (export_ai_market_context)
    → YES and <90 days old → from cache (0 cost, instant)
    → NO or >90 days old → include in CALL 1, save result to DB
  → Background: monthly batch job to refresh popular combinations

D4: Feedback Loop — Prompt Injection

At discovery start:
  → Fetch the same org's previous feedback
  → Common traits of "relevant" marked companies → inject into prompt
    Example: "User previously marked chemical distributors with >50 employees as relevant"
  → "not_relevant" marked companies → -5 point penalty in FitScore
  → Deterministic: no ML model, prompt enrichment + score adjustment

D5: Multi-Language Query Generation

Input: product="textile stain remover" country="Germany"
Output queries:
  → EN: "textile chemical distributors Germany"
  → DE: "Textilchemikalien Grosshandler Deutschland"
  → Sector: "TEGEWA members" (German textile chemicals association)
  → Generic: "chemical import companies Germany"

How: AI is asked to generate multi-language queries within CALL 1.
Language detection: country → language mapping (static table, no AI needed).

D6: Contact Discovery — Configurable

export_ai_organizations.settings JSONB:
  {
    "auto_headhunt": false,        // default off
    "auto_headhunt_top_n": 10,     // if on, how many companies
    "auto_headhunt_min_score": 70  // minimum FitScore
  }

→ When discovery completes: check settings
→ If on: trigger headhunt for Top N (FitScore >= min)
→ If off: return company list only, let user choose

Data Model

New Table: Market Context Cache

CREATE TABLE export_ai_market_context (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  country TEXT NOT NULL,
  industry TEXT NOT NULL,
  context_data JSONB NOT NULL,
  -- { import_volume, key_sectors, trade_fairs, regulatory_notes, competitive_landscape }
  source TEXT DEFAULT 'ai',           -- 'ai' | 'manual' | 'batch'
  expires_at TIMESTAMPTZ NOT NULL,    -- created_at + 90 days
  created_at TIMESTAMPTZ DEFAULT now(),
  updated_at TIMESTAMPTZ DEFAULT now(),
  UNIQUE(country, industry)
);

-- Org-independent (global cache — all orgs benefit)
-- RLS: authenticated users can read, only super_admin/batch can write

Existing Table Update

-- For fuzzy dedup on companies
CREATE INDEX idx_companies_name_lower
  ON export_ai_companies(lower(name));

-- For faster feedback lookup
CREATE INDEX idx_feedback_org_search
  ON export_ai_search_feedback(organization_id, search_id);

Current Code Impact

To Be Rewritten (Large)

FileReason
lib/discovery/run-discovery.ts9-step orchestrator → 12-step pipeline. Core logic changes.
lib/discovery/types.tsNew interfaces: IProductAnalysis, IMarketContext, IQuerySet, IPipelineResult
lib/prompts.tsbuildDiscoveryPrompt()3 separate prompt builders: CALL 1, CALL 2 config, CALL 3
app/api/discover/route.tsDelegates to pipeline, only handles request/response itself

New Files

FileContent
lib/discovery/stages/product-analysis.tsPhase 1: product analysis + market context
lib/discovery/stages/search.tsPhase 2: query generation + multi-source search
lib/discovery/stages/enrich.tsPhase 3: entity extraction + classification + enrichment
lib/discovery/stages/evaluate.tsPhase 4: dedup + confidence + FitScore + feedback (deterministic)
lib/discovery/stages/activate.tsPhase 5: contact discovery (optional)
lib/discovery/pipeline.tsOrchestrator: runs 5 phases sequentially, passes each phase’s output to the next
lib/discovery/feedback.tsFeedback lookup + prompt injection + FitScore adjustment
lib/discovery/dedup.tsDomain match + fuzzy name match + DB cross-ref
lib/discovery/market-context.tsCache lookup + AI call + DB save
lib/prompts/discovery-prompts.ts3-group prompt builder (CALL 1, 2, 3)

To Change (Medium)

FileChange
lib/scoring/fitScore.tsNew factors: market context score, feedback score
app/api/headhunt/route.tsCan be called from Pipeline Phase 5 (auto-headhunt)
lib/db/save-companies.tsExtended upsert for enrichment data

Pipeline Performance Targets

MetricTargetHandbook Ref
Total pipeline duration<20 seconds (3-4 AI calls)Ch. 25: <15s (we add +5s for market context)
CALL 1 (Product + Market + Query)<4 seconds
CALL 2 (Search)<8 seconds
CALL 3 (Extract + Classify + Enrich)<6 seconds
CALL 4 (Contact, optional)<5 seconds
Deterministic stages<2 seconds total
Accuracy (relevant company ratio)>70%Ch. 24
Noise (irrelevant company ratio)<15%Ch. 24
These targets will be tracked via the ai_job_runs table (integrated with 02-api-cost-control).

Handbook Alignment

Handbook ItemStatus
Ch. 11: 10-stage pipeline✅ 12 stages (2 additions: Market Context, Multi-source)
Ch. 17: Segment classification✅ Existing + confidence added
Ch. 18: Deduplication✅ Domain + fuzzy name + DB cross-ref
Ch. 19-20: FitScore✅ v2 + market context + feedback factors
Ch. 23: Feedback loop✅ Prompt injection + FitScore adjustment
Ch. 24: Accuracy framework✅ Targets defined, tracked via ai_job_runs
Ch. 25: Latency target✅ <20s (handbook <15s, +5s market context)
Ch. 60: Market Discovery⏳ Market context in Phase 1. Full discovery (country suggestion) in Ring 3
Ch. 62: Country Playbooks⏳ Market context collects base data. Full playbooks in Ring 3
Ch. 170-188: Prompt architecture✅ 3-group prompts, structured JSON, confidence, anti-hallucination

Future Decisions (not now, but not forgotten)

FD-1: Adaptive Pipeline (Ring 3+)

Simple searches use a short pipeline (CALL 1+2+deterministic), complex searches use the full pipeline (4 calls). Decision: automatic based on product analysis output.

FD-2: ML-Based Feedback (Ring 4+)

ML model training instead of prompt injection. When feedback data accumulates (1000+ ratings), automatic ranking improvement.

FD-3: Full Market Discovery (Ring 3)

Ch. 60: Without specifying a country, suggest the best markets based on product alone. Market context table supports this.

FD-4: Country Playbooks (Ring 3)

Ch. 62: Structured country guides. Extended version of the market context table.

FD-5: Import Data Integration (Ring 3+)

Ch. 61: Real import statistics integration. Requires an external data source.

FD-6: Vector DB (Ring 4)

Embedding-based search for company similarity. Enhances dedup and “similar companies” features.

Atomic Tasks

#TaskRingSize
PIPE-1Extend lib/discovery/types.ts — IProductAnalysis, IMarketContext, IQuerySet, IPipelineConfigR2Small
PIPE-2export_ai_market_context table + RLS + indexR2Migration
PIPE-3lib/discovery/market-context.ts — cache lookup + AI call + DB saveR2Medium
PIPE-4lib/prompts/discovery-prompts.ts — 3-group prompt builderR2Large
PIPE-5lib/discovery/stages/product-analysis.ts — CALL 1R2Medium
PIPE-6lib/discovery/stages/search.ts — CALL 2 (multi-source + multi-lang)R2Large
PIPE-7lib/discovery/stages/enrich.ts — CALL 3 (extract + classify + enrich)R2Large
PIPE-8lib/discovery/dedup.ts — domain + fuzzy name + DB cross-refR2Medium
PIPE-9lib/discovery/stages/evaluate.ts — confidence + FitScore + rankingR2Medium
PIPE-10lib/discovery/feedback.ts — feedback lookup + prompt inject + scoreR2Medium
PIPE-11lib/discovery/stages/activate.ts — configurable auto-headhuntR2Medium
PIPE-12lib/discovery/pipeline.ts — 5-phase orchestratorR2Large
PIPE-13Update lib/scoring/fitScore.ts — market context + feedback factorsR2Medium
PIPE-14Update app/api/discover/route.ts — delegate to new pipelineR2Small
PIPE-15Pipeline latency + accuracy tracking (ai_job_runs integration)R2Medium
PIPE-16companies.lower(name) index + feedback indexR2Migration