Research Pipeline

How Zorora’s 6-phase research pipeline works.

Overview

Zorora’s deep research workflow executes a 6-phase pipeline that searches across academic databases, web sources, and newsroom articles, then synthesizes findings with credibility scoring and citation graphs.

Pipeline Phases

Phase 1: Parallel Source Aggregation

What happens:

Searches academic databases (7 sources) in parallel:
- Google Scholar
- PubMed
- CORE
- arXiv
- bioRxiv
- medRxiv
- PubMed Central (PMC)
Searches web sources:
- Brave Search API (primary)
- DuckDuckGo (fallback)
Fetches newsroom articles (Asoba API)
All searches happen simultaneously for speed

Implementation:

# Parallel execution
academic_sources = academic_search(query)  # 7 sources in parallel
web_sources = web_search(query)           # Brave + DDG
newsroom_sources = newsroom_search(query) # Asoba API

Output: Raw sources from all three categories

Performance: ~8 seconds (parallel execution)

Phase 2: Citation Following

What happens:

Explores cited papers from initial sources
Configurable depth (1-3 hops)
Builds citation graph
Follows most relevant citations

Depth Levels:

Quick (depth=1): Skips citation following (~25-35s total)
Balanced (depth=2): 1-hop citation following (~35-50s total) - Coming soon
Thorough (depth=3): Multi-hop citations (~50-70s total) - Coming soon

Implementation:

if depth > 1:
    cited_papers = extract_citations(initial_sources)
    cited_sources = fetch_cited_papers(cited_papers, depth=depth-1)
    sources.extend(cited_sources)

Output: Extended source set with citation relationships

Performance: Adds ~10-20 seconds per depth level

Phase 3: Cross-Referencing

What happens:

Groups claims by similarity
Counts agreement across sources
Identifies conflicting claims
Highlights consensus

Implementation:

claims = extract_claims(sources)
grouped_claims = group_by_similarity(claims)
agreement_counts = count_agreement(grouped_claims)

Output: Grouped claims with agreement counts

Performance: ~2 seconds

Phase 4: Credibility Scoring

What happens:

Rules-based scoring of source authority
Factors considered:
- Domain reputation (Nature=0.85, arXiv=0.50, etc.)
- Citation count
- Cross-reference agreement
- Publisher type (academic journals vs predatory publishers)
- Retraction status

Scoring Rules:

High (0.7-1.0): Peer-reviewed journals, reputable sources
Medium (0.4-0.7): Preprints, reputable websites
Low (0.0-0.4): Unverified sources, low-citation papers

Implementation:

for source in sources:
    score = calculate_credibility(source)
    source.credibility_score = score
    source.credibility_category = categorize(score)

Output: Sources with credibility scores and categories

Performance: ~2 seconds

Phase 5: Citation Graph Building

What happens:

Constructs directed graph showing source relationships
Maps citation connections
Visualizes research network
Identifies key papers

Implementation:

graph = build_citation_graph(sources)
key_papers = identify_key_papers(graph)

Output: Citation graph structure

Performance: ~1 second

Phase 6: Synthesis

What happens:

Uses reasoning model to synthesize findings
Generates comprehensive answer
Includes citations with confidence levels
Highlights key findings
Notes areas of consensus and disagreement

Implementation:

prompt = f"""
SOURCES:
[Academic]: {academic_content}
[Web]: {web_content}
[Newsroom]: {newsroom_content}

QUESTION: {query}

Synthesize findings from ALL sources above.
Cite sources using [Academic], [Web], or [Newsroom] tags.
"""
synthesis = reasoning_model.generate(prompt)

Output: Final synthesis with citations

Performance: ~15-25 seconds (local reasoning model)

Pipeline Execution

Complete Flow

Query
  ↓
Phase 1: Parallel Source Aggregation (~8s)
  ├─► Academic (7 sources)
  ├─► Web (Brave + DDG)
  └─► Newsroom
  ↓
Phase 2: Citation Following (~10-20s, if depth > 1)
  ↓
Phase 3: Cross-Referencing (~2s)
  ↓
Phase 4: Credibility Scoring (~2s)
  ↓
Phase 5: Citation Graph Building (~1s)
  ↓
Phase 6: Synthesis (~15-25s)
  ↓
Result (with citations and confidence levels)

Total Time by Depth

Quick (depth=1): ~25-35 seconds
- Phase 1: ~8s
- Phase 3: ~2s
- Phase 4: ~2s
- Phase 5: ~1s
- Phase 6: ~15-25s
Balanced (depth=2): ~35-50 seconds - Coming soon
- Adds Phase 2: ~10-15s
Thorough (depth=3): ~50-70 seconds - Coming soon
- Adds Phase 2: ~20-30s

Data Flow

Source Aggregation

Query
  ↓
┌─────────────────────────────────────┐
│ Parallel Source Aggregation         │
├─────────────────────────────────────┤
│ Academic Search (7 sources)         │
│ Web Search (Brave + DDG)             │
│ Newsroom Search (Asoba API)         │
└─────────────────────────────────────┘
  ↓
Raw Sources (academic, web, newsroom)

Processing Pipeline

Raw Sources
  ↓
Citation Following (if depth > 1)
  ↓
Extended Sources
  ↓
Cross-Referencing
  ↓
Grouped Claims
  ↓
Credibility Scoring
  ↓
Scored Sources
  ↓
Citation Graph Building
  ↓
Research Graph
  ↓
Synthesis
  ↓
Final Result

Implementation Details

Academic Search

Sources:

Google Scholar
PubMed
CORE
arXiv
bioRxiv
medRxiv
PubMed Central (PMC)

Parallel Execution:

results = await asyncio.gather(
    scholar_search(query),
    pubmed_search(query),
    core_search(query),
    arxiv_search(query),
    biorxiv_search(query),
    medrxiv_search(query),
    pmc_search(query)
)

Web Search

Primary: Brave Search API

High quality results
Requires API key
Free tier: 2000 queries/month

Fallback: DuckDuckGo

No API key required
Lower quality results
Used if Brave unavailable

Newsroom Search

Source: Asoba API

Energy industry newsroom
90 days back
Max 25 relevant articles
Filtered by relevance

Credibility Scoring

Domain-Based Scoring:

Nature, Science: 0.85
Peer-reviewed journals: 0.70-0.85
arXiv, bioRxiv, medRxiv: 0.50-0.70
Reputable websites: 0.40-0.70
Unverified sources: 0.00-0.40

Citation Modifiers:

High citations: +0.10
Medium citations: +0.05
Low citations: 0.00

Cross-Reference Agreement:

High agreement: +0.10
Medium agreement: +0.05
Low agreement: 0.00

Synthesis

Model: Reasoning model (qwen2.5:32b or configured alternative)

Prompt Structure:

SOURCES:
[Academic]: {academic_content}
[Web]: {web_content}
[Newsroom]: {newsroom_content}

QUESTION: {query}

Synthesize findings from ALL sources above.
Cite sources using [Academic], [Web], or [Newsroom] tags.
Highlight key findings and areas of consensus/disagreement.

Performance Optimization

Parallel Execution

All source searches happen in parallel:

Academic (7 sources): Parallel
Web (Brave + DDG): Parallel
Newsroom: Parallel
Total: ~8 seconds (vs ~30+ seconds sequential)

Caching

Research Results:

Cached for 1 hour (general queries)
Cached for 24 hours (stable topics)
Cache key: query hash

Source Data:

Academic sources: Cached per query
Web sources: Cached per query
Newsroom: Cached per query

Research Pipeline

Overview

Pipeline Phases

Phase 1: Parallel Source Aggregation

Phase 2: Citation Following

Phase 3: Cross-Referencing

Phase 4: Credibility Scoring

Phase 5: Citation Graph Building

Phase 6: Synthesis

Pipeline Execution

Complete Flow

Total Time by Depth

Data Flow

Source Aggregation

Processing Pipeline

Implementation Details

Academic Search

Web Search

Newsroom Search

Credibility Scoring

Synthesis

Performance Optimization

Parallel Execution

Caching

See Also