AI | Computing Arts

Getting Structured Data from LLMs: JSON Mode and Beyond

The biggest challenge with LLMs in production isn’t getting good responses—it’s getting parseable responses. When you need JSON for your pipeline, “Here’s the data you requested:” followed by markdown-wrapped output breaks everything. Here’s how to reliably extract structured data. The Problem 1 2 3 4 5 6 7 8 response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Extract the person's name and age from: 'John Smith is 34 years old'"}] ) print(response.choices[0].message.content) # "The person's name is John Smith and their age is 34." # ... not what we needed You wanted {"name": "John Smith", "age": 34}. You got prose. ...

Container Orchestration Patterns for AI Workloads

Running AI workloads in containers presents unique challenges that traditional web application patterns don’t address. GPU scheduling, model caching, and bursty inference traffic all require thoughtful architecture. Here’s what actually works in production. The GPU Scheduling Problem Standard Kubernetes scheduling assumes CPU and memory are your primary constraints. When you add GPUs to the mix, everything changes. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 apiVersion: v1 kind: Pod metadata: name: inference-server spec: containers: - name: model image: my-registry/llm-server:v1.2 resources: limits: nvidia.com/gpu: 1 memory: "32Gi" requests: nvidia.com/gpu: 1 memory: "24Gi" The naive approach—one GPU per pod—works until you realize GPUs cost $2-4/hour and sit idle between requests. MIG (Multi-Instance GPU) and time-slicing help, but they introduce complexity: ...

Working with LLM APIs: A Practical Guide

How to integrate large language models into your applications — from basic API calls to production-ready patterns.

AI Coding Assistants: From Skeptic to True Believer

How AI coding assistants transformed my workflow — and why the skeptics are missing out.