Polyglot AI Systems in Practice: Python, Go, and Rust in the Same Stack

A pattern is solidifying in production AI systems: Python handles ML and LLM logic, Go manages API orchestration and governance, and Rust runs untrusted code in WebAssembly sandboxes. This isn't language preference — it reflects where each language's ecosystem and runtime characteristics are genuinely better suited to the problem.

The senior engineers commanding a $25K–$30K premium over their Go counterparts aren't picking Rust for aesthetics. They're using it because nothing else gives the same guarantees when executing untrusted, LLM-generated code in production.

Why Single-Language AI Systems Break Down

Each layer of an AI system has different requirements that tend to conflict when forced into one runtime.

ML layer needs PyTorch, TensorFlow, Hugging Face, and the Anthropic SDK. These are Python-first. Trying to replicate that ecosystem in another language means rewriting tooling that already works.

API layer needs to handle thousands of concurrent connections with predictable latency and low memory overhead. Python's GIL limits true concurrency, and asyncio helps but adds complexity. Go's goroutine model handles this cleanly.

Execution layer (when users or LLMs submit code to run) needs memory safety guarantees that can't be patched on top of a general-purpose runtime. Rust's ownership model provides them at compile time.

         ┌─────────────────────────────────────┐
         │         Clients / Frontend           │
         └──────────────┬──────────────────────┘
                        │
         ┌──────────────▼──────────────────────┐
         │          Go API Gateway              │
         │  Auth · Rate Limiting · Routing      │
         │  Fan-out to ML services              │
         └──────┬───────────────────┬──────────┘
                │                   │
   ┌────────────▼──────┐  ┌─────────▼────────────┐
   │  Python ML Layer  │  │  Rust WASM Sandbox   │
   │  RAG · LLM Calls  │  │  Code Execution      │
   │  Embeddings       │  │  Untrusted Input     │
   └───────────────────┘  └──────────────────────┘

Python: Keep It Focused on ML

The Python service should do one thing: ML logic. Don't let it become a monolith that also handles auth, rate limiting, and routing. FastAPI with async handlers gives you good throughput for I/O-bound LLM calls.

from fastapi import FastAPI, HTTPException
from anthropic import AsyncAnthropic
from sentence_transformers import SentenceTransformer
import numpy as np
 
app = FastAPI()
client = AsyncAnthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
 
@app.post("/rag/query")
async def rag_query(query: str, documents: list[str]) -> dict:
    if not documents:
        raise HTTPException(status_code=400, detail="No documents provided")
 
    query_emb = embedder.encode(query)
    doc_embs = embedder.encode(documents)
 
    # Select top-3 by cosine similarity
    scores = np.dot(doc_embs, query_emb) / (
        np.linalg.norm(doc_embs, axis=1) * np.linalg.norm(query_emb) + 1e-10
    )
    top_k = np.argsort(scores)[-3:][::-1]
    context = "\n\n".join(documents[i] for i in top_k)
 
    response = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": f"Context:\n{context}\n\nQ: {query}"}]
    )
    return {"answer": response.content[0].text, "sources": top_k.tolist()}

Deploy this as an internal service. It should not be directly accessible from the internet — the Go gateway is the only entry point.

Go: Gateway and Orchestration

Go coordinates requests between clients and the Python layer. The goroutine model makes fan-out patterns (querying multiple ML services in parallel) straightforward without callback complexity.

package main
 
import (
    "context"
    "encoding/json"
    "fmt"
    "net/http"
    "sync"
    "time"
)
 
type MLResult struct {
    ServiceID string
    Answer    string
    Error     error
}
 
// Fan out a single request to multiple ML service replicas
func queryMLServices(ctx context.Context, query string, docs []string) []MLResult {
    endpoints := []string{
        "http://ml-service-1:8000/rag/query",
        "http://ml-service-2:8000/rag/query",
    }
 
    results := make([]MLResult, len(endpoints))
    var wg sync.WaitGroup
 
    for i, ep := range endpoints {
        wg.Add(1)
        go func(idx int, url string) {
            defer wg.Done()
 
            reqCtx, cancel := context.WithTimeout(ctx, 15*time.Second)
            defer cancel()
 
            answer, err := callMLService(reqCtx, url, query, docs)
            results[idx] = MLResult{
                ServiceID: fmt.Sprintf("svc-%d", idx),
                Answer:    answer,
                Error:     err,
            }
        }(i, ep)
    }
 
    wg.Wait()
    return results
}
 
func handleQuery(w http.ResponseWriter, r *http.Request) {
    // Auth and rate limiting happen here, before forwarding to Python
    if !authenticate(r) {
        http.Error(w, "unauthorized", http.StatusUnauthorized)
        return
    }
 
    results := queryMLServices(r.Context(), extractQuery(r), extractDocs(r))
    json.NewEncoder(w).Encode(results)
}

Go's context cancellation propagates through all goroutines automatically. When the HTTP request context is cancelled (client disconnect, gateway timeout), the downstream ML calls are cancelled too.

Rust: Sandboxed Execution

If your system executes LLM-generated code — or any user-provided code — you need strict isolation. Rust with Wasmtime gives you a sandbox where the running code cannot access the host filesystem, network, or memory outside its allocated region.

use wasmtime::{Engine, Module, Store, Linker, Config};
use std::time::Duration;
use anyhow::Result;
 
pub struct Sandbox {
    engine: Engine,
}
 
impl Sandbox {
    pub fn new() -> Result<Self> {
        let mut config = Config::new();
        // Enforce resource limits at the engine level
        config.consume_fuel(true);
        Ok(Sandbox {
            engine: Engine::new(&config)?,
        })
    }
 
    pub fn run(&self, wasm_bytes: &[u8], fuel_limit: u64) -> Result<String> {
        let module = Module::new(&self.engine, wasm_bytes)?;
        let mut store = Store::new(&self.engine, ());
 
        // Fuel limits execution steps — prevents infinite loops
        store.add_fuel(fuel_limit)?;
 
        let linker = Linker::new(&self.engine);
        let instance = linker.instantiate(&mut store, &module)?;
 
        let run = instance.get_typed_func::<(), ()>(&mut store, "run")?;
        run.call(&mut store, ())?;
 
        let fuel_consumed = fuel_limit - store.fuel_consumed().unwrap_or(0);
        Ok(format!("completed, fuel consumed: {}", fuel_consumed))
    }
}

The fuel mechanism acts as a computational budget. An infinite loop runs out of fuel and terminates cleanly instead of hanging the service.

Service Boundaries

The key to making this work is treating each language layer as a separate service with a well-defined interface — not as internal modules of one application.

Layer	Protocol	Reason
Go → Python	HTTP/JSON (internal)	Simple, debuggable, Python-native
Go → Rust	HTTP or stdin/stdout	WASM output is text/binary
External → Go	HTTPS + Auth	Only entry point

Message queues (Kafka, RabbitMQ) fit well between Go and Python when ML tasks are long-running. Instead of blocking a Go goroutine for 30 seconds waiting for an LLM response, enqueue the task and receive the result via webhook when the Python service completes it.

Our Take

Not every AI project needs this architecture. A single FastAPI service is the right starting point for most teams. The polyglot split makes sense when you hit specific problems: Python concurrency limits under high load, a need to execute untrusted code safely, or separate deployment cadences for ML models versus API logic.

The value isn't in using three languages — it's in recognizing that these three concerns have different characteristics and designing for that explicitly. When you do need to split, drawing the boundaries early makes the eventual transition straightforward instead of a rewrite.