- Published on
- • 9 min read
Testing ADK Agents with State Validation: An Alternative to Full Session Replay
Testing ADK Agents with State Validation: An Alternative to Full Session Replay
Google's ADK provides adk eval for agent testing. A full conversation session can be saved via the web UI, exported as an evalset.json file, then replayed to verify the agent produces expected responses. It's great for regression testing, but it's heavyweight for active development.
This week I'm trying a different approach: state-based unit testing that verifies specific agent behaviors by inspecting session state after targeted interactions. While adk eval tests conversation flows, state validation tests implementation correctness. They complement each other rather than compete.
The Scenario: Testing Agent Decision Logic
Consider an agent that handles file uploads and searches. The agent should:
- Save uploaded files before searching
- Convert PDFs to images for visual search if appropriate
- Trigger fallback search when initial results are insufficient
These are testable behaviors. While adk eval with conversation evalsets works well for end-to-end flows, a more programmatic approach can be useful during active development: write targeted tests that send specific messages, wait for processing, then inspect the session state to verify the agent made correct decisions.
What is State-Based Testing?
State-based testing treats the agent as a stateful system. Instead of verifying every response text matches expected output, the test verifies the agent updated its internal state correctly.
Here's the pattern:
import asyncio
import aiohttp
async def test_agent_behavior():
# 1. Create session
session_id = await create_session()
# 2. Send targeted message
await send_message(session_id, "specific test input")
# 3. Wait for processing
await asyncio.sleep(2)
# 4. Inspect state
state = await get_session_state(session_id)
# 5. Assert on state properties
assert state['expected_flag'] == True
assert 'expected_data' in state
No full conversation history needed. No brittle text matching. Just verify the agent's internal state reflects correct decision-making.
Test 1: File Upload Processing Order
The first test verifies that when a user uploads a PDF with a search query, the agent saves the file before searching. This prevents the search from running without the data.
Here's the test structure:
async def test_pdf_processing_before_search():
"""
Test that uploading a file forces the agent to process it
before attempting a search.
"""
base_url = "http://localhost:8000"
app_name = "my_agent"
user_id = "test_user"
# Create session
async with aiohttp.ClientSession() as session:
url = f"{base_url}/apps/{app_name}/users/{user_id}/sessions"
async with session.post(url) as response:
result = await response.json()
session_id = result.get('session_id')
# Send message with file attachment
run_payload = {
"appName": app_name,
"userId": user_id,
"sessionId": session_id,
"newMessage": {
"parts": [
{"text": "Find me this product"},
{
"inline_data": {
"mime_type": "application/pdf",
"data": base64_encoded_pdf
}
}
],
"role": "user"
}
}
async with session.post(f"{base_url}/run", json=run_payload) as response:
await response.json()
# Inspect state
async with session.get(f"{base_url}/apps/{app_name}/users/{user_id}/sessions/{session_id}") as response:
history = await response.json()
state = history.get('state', {})
# Verify processing order via state
assert 'artifacts' in state, "File not saved"
assert 'search_results' in state, "Search not triggered"
# Check conversion happened
artifacts = state.get('artifacts', [])
image_artifacts = [a for a in artifacts if a.get('type') == 'image']
assert len(image_artifacts) > 0, "PDF not converted to images"
Key architectural decisions:
-
Inline file upload: Using ADK's
inline_dataformat eliminates the need for multipart form data or external file storage. The file is embedded directly in the message. -
State inspection: Instead of checking if the response says "I've converted your PDF," we verify
artifactsexist and contain images. This is conversation and turn agnostic. -
Async sleep for processing time: ADK agents run asynchronously. The test waits 2-5 seconds to ensure tools execute before inspecting state.
-
Tool execution verification: Tool calls aren't directly visible in ADK's session API, but the test can verify that required tools were executed by checking for their state artifacts. The presence of both
artifactsandsearch_resultsconfirms the agent called both tools.
Test 2: Cascading Search Fallback
The second test verifies automatic fallback behavior. When an initial search returns insufficient results (< 10 items), the agent should automatically trigger a broader search without user intervention.
Here's the test pattern:
async def test_search_cascade():
"""
Test that insufficient search results trigger automatic fallback
to a broader search source.
"""
# Create session and send initial query
session_id = await create_session()
await send_message(session_id, "I need product x")
# Agent triggers search with minimal data
await asyncio.sleep(2)
# Inspect state
state = await get_session_state(session_id)
# Verify cascade flags
assert state.get('initial_search_exhausted') == True
assert state.get('fallback_triggered') == True
# Verify search source changed
search_source = state.get('search_source', '')
assert 'fallback' in search_source.lower(), \
f"Expected fallback source, got: {search_source}"
# Verify results were merged
results = state.get('search_results', [])
assert len(results) > 10, "No results after cascade"
Why this matters:
- No conversation history needed: The test doesn't verify what the agent said in response. It only verifies that the agent set
initial_search_exhausted=Trueand changedsearch_source. - Tests decision logic, not text: Refactoring prompts or changing response formatting won't break this test as long as the state transitions remain correct.
- Fast iteration: No need to open
adk web, conduct a conversation, save to evalset, then runadk eval. Simply runpython test_cascade.py.
The Core Pattern: Verify State, Not Responses
The insight is that consistent agent behaviour is about the right state transitions, not text output. An agent can say "Sure, I'll search for that!" but never actually search. State-based testing catches that.
Here's the assertion pattern I use:
# Get state
state = history.get('state', {})
# A. Verify tools executed (inferred from state artifacts)
assert 'expected_state_key' in state, "Tool didn't execute"
# B. Verify flags set correctly
assert state.get('decision_flag') == True, "Decision logic failed"
# C. Verify data transformations
data = state.get('processed_data', [])
assert len(data) > 0, "No data produced"
assert data[0].get('required_field'), "Data missing required field"
State-Based vs ADK Eval vs Tracing: Choosing the Right Tool
ADK provides multiple approaches for verifying agent behavior. How do they compare?
| Approach | Use Case | Pros | Cons |
|---|---|---|---|
| State-based testing | Unit testing during development | Fast, robust to prompt changes, tests logic, automatable | Limited conversation flow coverage, no execution trace visibility |
| ADK eval | Regression testing, end-to-end flows | Full conversation replay, UI for creating tests | Can be slow for iteration |
| ADK web tracing | Interactive debugging, exploration | Full execution visibility (tool calls, reasoning, state), real-time inspection | Manual only, not automatable |
| All three | Production systems | Comprehensive coverage | More tooling to maintain |
Key differences:
- State-based testing verifies outcomes (final state values) programmatically
- ADK eval verifies conversation quality (response text matching) through replay
- ADK web tracing provides execution visibility (tool calls, reasoning flow) for debugging
Recommended pattern for production:
- Use state-based tests for rapid development and logic validation (automated CI/CD)
- Use ADK eval for critical user flows and regression testing
- Use ADK web tracing for interactive debugging when tests fail or behavior is unexpected
They complement each other. State tests catch broken logic immediately; eval tests catch broken user experiences over time; tracing helps diagnose why failures occurred.
In my workflow, I write 10-15 state-based tests, covering edge cases and decision paths. Then I save 2-3 full conversation sessions to evalsets for the happy path and critical error scenarios. The state tests usually run much faster than the full eval tests, making them suitable for automated CI/CD.
Debug-Friendly Output
One advantage of writing tests as Python scripts (rather than JSON evalsets) is you can add rich debug output:
# Print state keys for debugging
print(f"📋 Session State Keys: {list(state.keys())}")
# Print detailed cascade information
print("📊 CASCADE VERIFICATION")
print(f" - Initial results count: {initial_count}")
print(f" - Exhausted flag: {state.get('exhausted')}")
print(f" - Fallback triggered: {state.get('fallback_triggered')}")
print(f" - Search source: {state.get('search_source')}")
# Conditional debug output
if not test_passed:
print("❌ TEST FAILED - Issues found:\n")
for i, error in enumerate(errors, 1):
print(f" {i}. {error}")
When tests fail, there is immediate context about what state was expected vs actual. Better than adk eval showing too much granular logs but not having enough error details.
Testing Multi-Turn Flows
State-based testing works great for multi-turn conversations too. The test simply sends messages sequentially and verifies state after each turn:
async def test_multi_turn_flow():
session_id = await create_session()
# Turn 1: Initial query
await send_message(session_id, "I need product X")
await asyncio.sleep(2)
state = await get_session_state(session_id)
assert state.get('query_understood') == True
# Turn 2: Provide additional context
await send_message(session_id, "I need it in blue")
await asyncio.sleep(2)
state = await get_session_state(session_id)
assert state.get('color_preference') == 'blue'
assert state.get('ready_to_search') == True
# Turn 3: Trigger search
await send_message(session_id, "Search now")
await asyncio.sleep(5)
state = await get_session_state(session_id)
assert 'search_results' in state
assert len(state['search_results']) > 0
Each turn builds on the previous state. The test can be made to verify the agent's memory and context management, not just individual responses.
Running the Tests
Since these are just Python scripts with asyncio, they can be run directly:
# Start ADK server
adk web
# In another terminal, run tests
cd test_evals
python test_cascade.py
python test_pdf_flow.py
# Or use pytest for better output
pytest test_evals/ -v
No special ADK eval commands needed. No JSON evalset files. Just standard Python testing workflows.
Limitations and When to Use ADK Eval
State-based testing has limitations:
What it can't test:
- ❌ Exact response wording (testing this is too brittle anyway)
- ❌ Response formatting for end users (this is UX, not logic)
- ❌ Complex conversation branches where state alone doesn't tell the story
When to use ADK eval instead:
- ✅ Testing complete user journeys with multiple conversation turns
- ✅ Verifying response quality for critical flows (customer-facing demos)
- ✅ Regression testing after major prompt refactors
- ✅ Testing conversation branching (user says A vs B → different paths)
State-based tests are better for low-level unit testing.
Best Practices
From my experience building these tests, here's what works:
Keep tests focused. Each test should verify one decision path or behavior. When a test fails, it should be immediately obvious what broke.
Use descriptive state keys. Instead of generic flags like done or result, use specific keys like pdf_converted, fallback_triggered, or search_exhausted. This makes assertions self-documenting.
Add debug output. Detailed state dumps are invaluable. Print the full state dictionary, flag values, and any intermediate calculations.
Wait for async operations properly. Use asyncio.sleep() with generous timeouts (2-5 seconds) or poll until expected state keys appear.
TL;DR: Key Takeaways
State-based testing provides a fast, robust way to verify agent decision logic during active development. By inspecting session state instead of matching response text, tests can be faster and agnostic to prompt changes.
Key insights:
- Verify state transitions, not response text
- Infer tool execution from state artifacts
- Use asyncio.sleep() or polling to wait for async agent processing
- Complements ADK eval and ADK web tracing
- Debug output is more informative with Python scripts (vs JSON evalsets)
When to use this pattern:
- ✅ Unit testing agent decision logic
- ✅ Verifying tool execution and data transformations
- ✅ Testing edge cases and error paths
- ✅ Rapid iteration during development
- ❌ Not for end-to-end conversation quality testing (use ADK eval)
This pattern transformed my testing workflow from saving sessions in the UI and running full eval replays to writing focused tests that run in seconds and catch logic bugs immediately. Highly recommended for production ADK agents.
This is part 7 of my series on building production AI agents with Google ADK and AWS. Check out the other posts with the Google ADK tags.
