A Production-Grade Architecture for Agentic AI Systems on Microsoft Azure
A deep-dive into designing, deploying, and operating multi-agent AI systems on Microsoft Azure — covering orchestration, memory, tool integration, observability, and cost controls.
Building a chatbot that answers questions is straightforward. Building an agentic AI system that plans multi-step tasks, calls external tools, maintains memory across sessions, and stays within budget while running reliably in production — that's an engineering challenge of a completely different order.
This article walks through the end-to-end architecture I use to ship production-grade Agentic AI systems on Microsoft Azure.
This architecture assumes Azure OpenAI as the LLM provider, but the patterns apply broadly to any cloud + LLM combination.
What Is an Agentic AI System?#
An agentic system is one where an LLM doesn't just respond to a single prompt — it reasons, plans, and takes actions across multiple steps. The model decides which tools to call, in what order, and when the task is complete.
A minimal agent loop looks like this:
Core Architectural Layers#
1. Orchestration Layer#
The orchestration layer is the brain of the system. It receives the user request, maintains the conversation context, decides which agents or tools to invoke, and assembles the final response.
On Azure, I use Azure AI Foundry with the Semantic Kernel SDK as the primary orchestration framework.
import asyncio
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion
from semantic_kernel.agents import ChatCompletionAgent
kernel = Kernel()
kernel.add_service(
AzureChatCompletion(
deployment_name="gpt-4o",
endpoint="https://your-resource.openai.azure.com",
api_key="YOUR_API_KEY",
)
)
agent = ChatCompletionAgent(
service_id="default",
kernel=kernel,
name="OrchestratorAgent",
instructions="""
You are an orchestrator agent. Analyze the user's request,
break it into sub-tasks, and delegate to the appropriate tools or sub-agents.
Always verify your outputs before returning to the user.
""",
)2. Memory Architecture#
Agentic systems need multiple types of memory:
| Memory Type | Storage | Use Case |
|---|---|---|
| Working memory | In-context (LLM) | Current task state |
| Episodic memory | Azure Cosmos DB | Conversation history |
| Semantic memory | Azure AI Search (vector) | Domain knowledge, docs |
| Procedural memory | Prompt templates | How to do things |
For vector search, I use Azure AI Search with hybrid retrieval:
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
def hybrid_search(query: str, embedding: list[float], top_k: int = 5):
vector_query = VectorizedQuery(
vector=embedding,
k_nearest_neighbors=top_k,
fields="content_vector"
)
results = search_client.search(
search_text=query, # BM25 keyword search
vector_queries=[vector_query], # Semantic vector search
query_type="semantic",
semantic_configuration_name="my-semantic-config",
top=top_k,
select=["id", "content", "source", "title"]
)
return [r for r in results]3. Tool Integration Layer#
Agents need tools to act on the world. I structure tools as Azure Functions for serverless scalability:
# Tool definition for the agent
from semantic_kernel.functions import kernel_function
class WebSearchPlugin:
@kernel_function(
name="search_web",
description="Search the web for recent information on a given topic"
)
async def search(self, query: str) -> str:
"""Execute a web search and return formatted results."""
# Call Azure Bing Search API
results = await bing_search_client.search(query)
return format_search_results(results)
class DatabasePlugin:
@kernel_function(
name="query_database",
description="Query the product database for inventory or pricing information"
)
async def query(self, sql: str) -> str:
"""Execute a read-only SQL query."""
# Validate query is SELECT-only before execution
if not sql.strip().upper().startswith("SELECT"):
raise ValueError("Only SELECT queries are permitted")
results = await db.execute(sql)
return results.to_json()4. Multi-Agent Coordination#
For complex tasks, I use a hub-and-spoke multi-agent pattern:
5. Observability & Guardrails#
Production agentic systems need comprehensive observability. I use Azure Monitor + Application Insights with structured logging:
from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import trace
configure_azure_monitor(connection_string="YOUR_CONNECTION_STRING")
tracer = trace.get_tracer(__name__)
async def agent_step(step_name: str, input_data: dict):
with tracer.start_as_current_span(f"agent.{step_name}") as span:
span.set_attribute("agent.step", step_name)
span.set_attribute("agent.input_tokens", count_tokens(input_data))
result = await execute_step(input_data)
span.set_attribute("agent.output_tokens", count_tokens(result))
span.set_attribute("agent.success", True)
return resultKey guardrails to implement:
- Token budget enforcement — hard stop at 80% of context window
- Tool call limits — max 15 tool calls per task to prevent loops
- Content safety — Azure AI Content Safety on all inputs/outputs
- Rate limiting — per-user and per-agent call limits
- Timeout handling — async timeouts on all external calls
Cost Management#
Agentic systems can burn tokens at alarming rates. Here's how I keep costs predictable:
class TokenBudget:
def __init__(self, max_tokens: int = 50_000):
self.max_tokens = max_tokens
self.used = 0
def can_proceed(self, estimated_tokens: int) -> bool:
return (self.used + estimated_tokens) <= self.max_tokens
def consume(self, tokens: int):
self.used += tokens
if self.used > self.max_tokens * 0.9:
logger.warning(f"Token budget at {self.used}/{self.max_tokens}")Deployment Architecture#
Azure Container Apps (agents)
├── Orchestrator Service
├── Tool Router Service
└── Memory Manager Service
Azure AI Foundry
└── GPT-4o deployment
Azure AI Search
└── Knowledge base index
Azure Cosmos DB
└── Conversation history
Azure Service Bus
└── Async tool execution queue
Azure Key Vault
└── API keys + secrets
Key Takeaways#
- Separate orchestration from execution — keep your orchestrator thin and your tools stateless
- Design for failure — every tool call can fail; build retry logic and graceful degradation
- Observe everything — you can't debug what you can't see; log every agent step
- Budget aggressively — set hard token and API call limits from day one
- Test with adversarial inputs — agents are especially vulnerable to prompt injection
The shift from LLM-powered features to full agentic systems requires a shift in engineering mindset. Think less about "prompting" and more about distributed systems with an LLM at the center.
Written by
Niteen Badgujar
AI Engineer specializing in Agentic AI, LLMs, and production-grade machine learning systems on Azure. Writing to make complex AI concepts accessible and actionable.