Google released Gemini 3.1 Flash-Lite and Gemini 3.1 Flash Live on April 7, 2026. Flash-Lite positions itself as the fastest and most cost-efficient model in the Gemini 3.1 family — built for workloads where response latency and per-request cost matter more than maximum reasoning depth. Flash Live targets real-time audio conversations, now available in more than 200 countries via Search Live and Gemini Live.
This post focuses on Flash-Lite from a developer integration perspective: where it fits, how to call it, and how to decide when Flash-Lite is the right choice versus a heavier model tier.
The Gemini 3.1 Model Tiers
Before going into Flash-Lite specifics, it helps to understand where it sits in the 3.1 lineup:
| Model | Best For | Relative Cost |
|---|---|---|
| Gemini 3.1 Pro | Complex reasoning, multi-step tasks | High |
| Gemini 3.1 Flash | Balanced performance/cost | Medium |
| Gemini 3.1 Flash-Lite | High-volume, latency-sensitive workloads | Low |
| Gemini 3.1 Flash Live | Real-time audio/conversation | Medium |
Flash-Lite is not a stripped-down model — it's an efficiency-optimized variant tuned specifically for tasks that don't require the full reasoning capacity of Pro or standard Flash. Classification, entity extraction, short-form generation, and structured output from templated prompts are its sweet spots.
Setting Up the Gemini API
npm install @google/generative-aiimport { GoogleGenerativeAI } from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({ model: "gemini-3.1-flash-lite" });Practical Use Cases for Flash-Lite
1. Document Classification at Scale
Flash-Lite handles high-volume classification tasks where you're processing hundreds or thousands of documents per hour:
async function classifyDocument(text: string): Promise<string> {
const prompt = `Classify the following document into exactly one category:
Categories: invoice, contract, report, correspondence, other
Document:
${text.slice(0, 2000)}
Respond with only the category name.`;
const result = await model.generateContent(prompt);
return result.response.text().trim().toLowerCase();
}Because Flash-Lite is optimized for throughput, you can run this classification pipeline at scale without the per-token costs of Pro-tier models becoming prohibitive.
2. Structured Data Extraction
Extracting structured fields from unstructured text is another strong fit — the task is well-defined, the output schema is fixed, and reasoning depth is rarely the bottleneck:
import { SchemaType } from "@google/generative-ai";
const extractionModel = genAI.getGenerativeModel({
model: "gemini-3.1-flash-lite",
generationConfig: {
responseMimeType: "application/json",
responseSchema: {
type: SchemaType.OBJECT,
properties: {
company_name: { type: SchemaType.STRING },
invoice_date: { type: SchemaType.STRING },
total_amount: { type: SchemaType.NUMBER },
currency: { type: SchemaType.STRING },
},
required: ["company_name", "invoice_date", "total_amount"],
},
},
});
const result = await extractionModel.generateContent(invoiceText);
const data = JSON.parse(result.response.text());3. Short-Form Content Generation
FAQ answers, product description variants, and templated notification copy are all cases where Flash-Lite's speed and cost profile make it attractive:
async function generateFaqAnswer(question: string, context: string): Promise<string> {
const prompt = `Based on the following context, write a concise FAQ answer (2-3 sentences max) for the question.
Context: ${context}
Question: ${question}
Answer:`;
const result = await model.generateContent(prompt);
return result.response.text();
}When to Use Flash-Lite vs Other Tiers
Use Flash-Lite when:
- Processing volume is high (hundreds to thousands of requests per day)
- Tasks are well-defined with clear input/output structure
- Response latency is user-facing (sub-second expectations)
- Cost is a primary constraint
Use standard Flash when:
- Tasks require moderate reasoning (multi-step logic, comparison tasks)
- You need a balance of capability and cost without the full Pro overhead
Use Pro when:
- Complex reasoning chains are required
- Accuracy on ambiguous or nuanced tasks is critical
- You're building an agentic workflow where errors cascade
Flash Live: Real-Time Audio
Gemini 3.1 Flash Live is a separate model targeting real-time conversation — think customer service bots, voice interfaces, and interactive audio applications. It's now available via the Gemini API alongside Flash-Lite:
// Flash Live uses the Live API (bidirectional streaming)
const liveModel = genAI.getGenerativeModel({ model: "gemini-3.1-flash-live" });Flash Live is Google's competitive response to real-time audio APIs from OpenAI and Anthropic. For applications requiring low-latency conversational AI with voice input/output, it's now the Gemini-native option.
Lyria 3 for Music Generation
Also released in preview: Lyria 3 via Gemini API and Google AI Studio. Lyria 3 is Google's music generation model. For developers building creative tools or media applications, this opens up music generation as a first-class API capability — similar to how image generation has become routine in content pipelines.
// Lyria 3 access via Gemini API (preview)
const lyriaModel = genAI.getGenerativeModel({ model: "lyria-3" });
const musicResult = await lyriaModel.generateContent(
"Generate a 30-second upbeat background track for a product demo video"
);Cost Optimization Strategy
For production workloads, a tiered routing approach tends to work well: route simple, high-volume requests to Flash-Lite, and escalate to Flash or Pro only when the task complexity warrants it. This can reduce AI infrastructure costs significantly without sacrificing output quality on tasks that actually need deeper reasoning.
function selectModel(taskType: "classification" | "extraction" | "reasoning" | "generation") {
switch (taskType) {
case "classification":
case "extraction":
return "gemini-3.1-flash-lite";
case "generation":
return "gemini-3.1-flash";
case "reasoning":
return "gemini-3.1-pro";
}
}Takeaway
Gemini 3.1 Flash-Lite is a practical addition to the model tier lineup — useful for teams running AI at scale who want to manage per-request costs without routing everything through a heavy model. The real value is in the tiered approach: match the model to the task complexity, and reserve Pro-tier inference for the work that actually needs it.