Lessons from UR² Implementation: What We Learned Building Unified Retrieval and Reasoning

Six months ago, we decided to implement the UR² (Unified Retrieval and Reasoning) framework for AIMatrix. The promise was compelling: intelligent agents that could retrieve information and reason about it in a unified pipeline. The reality turned out to be more nuanced than the papers suggested.

Here’s what we learned building UR² in production, including the parts that worked, the parts that didn’t, and the surprises we encountered along the way.

Why UR² Made Sense for Us

Traditional RAG (Retrieval-Augmented Generation) systems retrieve documents and then reason about them separately. This works fine for simple Q&A, but our agents needed something more sophisticated.

Our customer service agents, for example, often needed to:

  1. Retrieve multiple policy documents
  2. Cross-reference information across them
  3. Reason about edge cases not explicitly covered
  4. Provide coherent, contextual responses

The sequential retrieve-then-reason approach was creating inconsistent responses and missing nuanced scenarios.

The UR² Promise vs. Reality

The Promise: UR² would elegantly unify retrieval and reasoning, making decisions about what information to fetch based on reasoning progress, and adapting retrieval strategies based on intermediate reasoning steps.

The Reality: It works, but with more complexity and edge cases than we anticipated.

Here’s our production UR² implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class UR2Agent:
    def __init__(self, retriever, reasoner, difficulty_assessor):
        self.retriever = retriever
        self.reasoner = reasoner
        self.difficulty_assessor = difficulty_assessor
        self.max_iterations = 5
        
    async def process_query(self, query):
        # Initial difficulty assessment
        difficulty = await self.difficulty_assessor.assess(query)
        
        if difficulty.is_simple:
            # Simple queries get standard RAG treatment
            return await self._simple_rag_response(query)
        
        # Complex queries get full UR2 treatment
        return await self._unified_retrieval_reasoning(query, difficulty)
    
    async def _unified_retrieval_reasoning(self, query, difficulty):
        context = ReasoningContext(query=query, difficulty=difficulty)
        
        for iteration in range(self.max_iterations):
            # Determine what information we need
            retrieval_plan = await self.reasoner.plan_retrieval(context)
            
            if not retrieval_plan.needs_more_info:
                break
                
            # Fetch information based on reasoning state
            retrieved_docs = await self.retriever.fetch(
                retrieval_plan.queries,
                context.current_understanding
            )
            
            # Update reasoning context
            context = await self.reasoner.update_context(
                context, retrieved_docs
            )
            
            # Check if we can answer now
            if context.confidence > difficulty.confidence_threshold:
                break
        
        return await self.reasoner.generate_response(context)

Difficulty Assessment: Harder Than Expected

The difficulty assessment component turned out to be crucial but tricky to get right. Initially, we tried a simple heuristic approach:

1
2
3
4
5
6
7
8
# Our first, naive difficulty assessment
def assess_difficulty_v1(query):
    if len(query.split()) < 10:
        return DifficultyLevel.SIMPLE
    elif any(keyword in query.lower() for keyword in ['complex', 'multiple', 'compare']):
        return DifficultyLevel.COMPLEX
    else:
        return DifficultyLevel.MEDIUM

This failed spectacularly. Short queries could be incredibly complex (“What if clause 3.2 contradicts section A?”), while long queries might be straightforward.

We evolved to a learned difficulty assessor:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
class LearnedDifficultyAssessor:
    def __init__(self, model_path):
        self.model = load_model(model_path)
        self.feature_extractor = QueryFeatureExtractor()
        
    async def assess(self, query):
        features = self.feature_extractor.extract(query)
        
        # Features we found actually matter:
        # - Semantic complexity (not just word count)
        # - Number of potential retrieval paths
        # - Historical query complexity for similar patterns
        # - Required reasoning depth
        
        prediction = await self.model.predict(features)
        
        return DifficultyAssessment(
            level=prediction.level,
            confidence_threshold=prediction.threshold,
            expected_iterations=prediction.iterations,
            retrieval_strategy=prediction.strategy
        )

Selective Retrieval: The Good and The Frustrating

Selective retrieval - fetching information based on reasoning progress - was the most promising aspect of UR².

What Worked:

  • Agents stopped retrieving irrelevant documents
  • Multi-hop reasoning improved significantly
  • Context windows were used more efficiently

What Was Frustrating:

  • The retrieval planner sometimes got stuck in loops
  • Determining “enough information” proved harder than expected
  • Computational overhead increased substantially

Here’s our retrieval planner that evolved through multiple iterations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class AdaptiveRetrievalPlanner:
    def __init__(self, query_generator, relevance_scorer):
        self.query_generator = query_generator
        self.relevance_scorer = relevance_scorer
        self.loop_detector = LoopDetector()
        
    async def plan_retrieval(self, context):
        # Check if we're in a retrieval loop
        if self.loop_detector.is_looping(context.retrieval_history):
            return RetrievalPlan(needs_more_info=False, reason="loop_detected")
        
        # Identify knowledge gaps
        gaps = await self._identify_knowledge_gaps(context)
        
        if not gaps:
            return RetrievalPlan(needs_more_info=False)
            
        # Generate targeted queries for gaps
        targeted_queries = []
        for gap in gaps:
            queries = await self.query_generator.generate_for_gap(
                gap, context.current_understanding
            )
            targeted_queries.extend(queries)
        
        # Score and rank potential queries
        scored_queries = await self.relevance_scorer.score_queries(
            targeted_queries, context
        )
        
        # Select top queries within token budget
        selected_queries = self._select_within_budget(
            scored_queries, context.remaining_budget
        )
        
        return RetrievalPlan(
            needs_more_info=True,
            queries=selected_queries,
            expected_docs=len(selected_queries) * 3  # rough estimate
        )

The Loop Problem

One of our biggest challenges was preventing retrieval loops. Agents would sometimes get stuck asking for the same information repeatedly, especially when dealing with ambiguous queries.

Our solution involved multiple strategies:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class LoopDetector:
    def __init__(self):
        self.similarity_threshold = 0.85
        self.max_similar_queries = 2
        
    def is_looping(self, retrieval_history):
        if len(retrieval_history) < 3:
            return False
            
        recent_queries = retrieval_history[-5:]  # Look at last 5 queries
        
        for i, query1 in enumerate(recent_queries):
            similar_count = 0
            for j, query2 in enumerate(recent_queries):
                if i != j and self._queries_similar(query1, query2):
                    similar_count += 1
            
            if similar_count >= self.max_similar_queries:
                return True
                
        return False
        
    def _queries_similar(self, q1, q2):
        # Use semantic similarity, not just string matching
        embedding1 = self.embed_query(q1)
        embedding2 = self.embed_query(q2)
        similarity = cosine_similarity(embedding1, embedding2)
        return similarity > self.similarity_threshold

Performance Challenges

UR² is computationally expensive. Each reasoning step requires model inference, and the adaptive nature means you can’t predict exactly how many steps you’ll need.

Optimization strategies we implemented:

  1. Caching intermediate results:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
class UR2Cache:
    def __init__(self):
        self.reasoning_cache = LRUCache(1000)
        self.retrieval_cache = LRUCache(5000)
        
    def cache_reasoning_step(self, context_hash, result):
        self.reasoning_cache[context_hash] = result
        
    def get_cached_reasoning(self, context):
        context_hash = self._hash_context(context)
        return self.reasoning_cache.get(context_hash)
  1. Parallel retrieval when possible:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
async def parallel_retrieval(self, queries, context):
    # Group queries that can be executed in parallel
    parallel_groups = self._group_independent_queries(queries)
    
    results = []
    for group in parallel_groups:
        batch_results = await asyncio.gather(*[
            self.retriever.fetch(query, context) for query in group
        ])
        results.extend(batch_results)
        
        # Update context after each batch
        context = await self.reasoner.update_context(context, batch_results)
    
    return results

When UR² Works Best

After six months in production, we’ve learned that UR² shines in specific scenarios:

Great for:

  • Multi-document reasoning tasks
  • Queries requiring iterative information gathering
  • Complex policy or regulatory questions
  • Scenarios where context builds progressively

Not worth the overhead for:

  • Simple factual questions
  • Single-document queries
  • Time-sensitive responses (the iterations add latency)
  • Highly structured data queries

Current Challenges and Next Steps

What we’re still working on:

  1. Cost management: UR² uses significantly more tokens than traditional RAG. We’re experimenting with smaller models for intermediate steps.

  2. Latency optimization: The iterative process can be slow. We’re exploring speculative execution and better caching.

  3. Quality evaluation: Traditional RAG metrics don’t capture UR²’s benefits well. We’re developing new evaluation frameworks.

  4. Failure recovery: When UR² gets confused, it can fail spectacularly. Better error handling is ongoing work.

Should You Implement UR²?

If you’re dealing with complex, multi-step reasoning tasks and can afford the computational overhead, UR² can provide significant quality improvements. But it’s not a drop-in replacement for traditional RAG.

Start with traditional RAG, identify where it fails, and then selectively apply UR² to those complex cases. That’s what we’re doing now, and it’s working much better than trying to use UR² for everything.

The framework is promising, but like most AI research implementations, the production reality is messier and more nuanced than the papers suggest. Build with that expectation, and you’ll be better prepared for the challenges ahead.