[{"data":1,"prerenderedAt":1261},["ShallowReactive",2],{"\u002Fblog\u002F2026\u002F04\u002FThe-Inference-Engine-Bringing-Models-to-Life":3,"\u002Fblog\u002F2026\u002F04\u002FThe-Inference-Engine-Bringing-Models-to-Life-surround":1257},{"id":4,"title":5,"body":6,"categories":1245,"date":1246,"description":1247,"extension":1248,"image":1249,"meta":1250,"navigation":1252,"path":1253,"seo":1254,"stem":1255,"twitter":1245,"__hash__":1256},"posts\u002F2026-04-15-The-Inference-Engine-Bringing-Models-to-Life.md","The Inference Engine - Bringing Models to Life",{"type":7,"value":8,"toc":1231},"minimark",[9,34,41,46,49,54,57,164,168,171,309,324,328,331,335,342,364,368,371,375,378,383,386,391,397,411,426,430,496,500,503,518,521,524,527,530,550,553,619,623,626,629,673,676,686,965,987,1103,1110,1113,1116,1119,1122,1136,1139,1171,1175,1178,1193,1196,1227],[10,11,12,13,18,19,23,24,28,29,33],"p",{},"What we learned from \"",[14,15,17],"a",{"href":16},"\u002Fblog\u002F2026\u002F03\u002FAI-ML-Models-Are-Not-Libraries","AI\u002FML Models Are Not Libraries","\" is that models are essentially collections of numbers ",[20,21,22],"em",{},"(weights)"," and, optionally, mathematical formulas. The \"optionally\" part is key, as we saw in \"",[14,25,27],{"href":26},"\u002Fblog\u002F2026\u002F03\u002FA-Trip-in-the-AI-ML-Model-Formats-Jungle","A Trip in the AI\u002FML Model Formats Jungle","\" that not all model files store the formulas themselves. Some formats are \"Mostly Self-Confined,\" while others are \"Weights-Only,\" expecting the application using them to \"know\" the underlying math. In \"",[14,30,32],{"href":31},"\u002Fblog\u002F2026\u002F03\u002FAnatomy-of-a-Model-the-Developer-Perspective","Anatomy of a Model - the Developer Perspective","\", we explored different architectures and their inputs and outputs. With that groundwork laid, we can now consider the inference process as a whole.",[10,35,36,37,40],{},"By now, it should be clear that producing a meaningful ",[20,38,39],{},"(to the caller)"," output during inference is a combined effort between the model, the inference runtime, and the inference endpoint that provides access to it.",[42,43,45],"h2",{"id":44},"inference-components","Inference Components",[10,47,48],{},"As with everything in software and IT, different people may interpret a term slightly differently when they use it. Here's what I mean by the terms you'll see later in this post.",[50,51,53],"h3",{"id":52},"compute-backend","Compute Backend",[10,55,56],{},"Those large and complex computation graphs often demand massive processing power and substantial amounts of fast memory. This naturally leads to the need for specialized hardware. A compute backend is the engine that executes the graph on a specific device (e.g., GPU, TPU, CPU). It implicitly includes the device’s own memory (GPU VRAM, TPU memory, etc.), as that’s where the tensors reside during execution. Below are some examples of popular compute backends.",[58,59,60,82],"table",{},[61,62,63],"thead",{},[64,65,66,72,77],"tr",{},[67,68,69],"th",{},[70,71,53],"strong",{},[67,73,74],{},[70,75,76],{},"Target Hardware",[67,78,79],{},[70,80,81],{},"Primary Software Layer",[83,84,85,99,112,125,138,151],"tbody",{},[64,86,87,93,96],{},[88,89,90],"td",{},[70,91,92],{},"CPU",[88,94,95],{},"x86 \u002F ARM Processors",[88,97,98],{},"Standard System RAM + OS Scheduler",[64,100,101,106,109],{},[88,102,103],{},[70,104,105],{},"CUDA",[88,107,108],{},"NVIDIA GPUs",[88,110,111],{},"CUDA Cores + cuDNN + VRAM",[64,113,114,119,122],{},[88,115,116],{},[70,117,118],{},"MLX",[88,120,121],{},"Apple Silicon (M-series)",[88,123,124],{},"Unified Memory Architecture + Metal",[64,126,127,132,135],{},[88,128,129],{},[70,130,131],{},"OpenVINO",[88,133,134],{},"Intel CPUs\u002FIGPUs\u002FVPUs",[88,136,137],{},"OneDNN \u002F OpenCL \u002F Plugin Architecture",[64,139,140,145,148],{},[88,141,142],{},[70,143,144],{},"ROCm",[88,146,147],{},"AMD GPUs",[88,149,150],{},"HIP \u002F ROCm Kernel Drivers",[64,152,153,158,161],{},[88,154,155],{},[70,156,157],{},"TPU (XLA)",[88,159,160],{},"Google TPU",[88,162,163],{},"XLA Compiler + libtpu + PJRT",[50,165,167],{"id":166},"inference-runtime-engine","Inference Runtime \u002F Engine",[10,169,170],{},"It's a specialized execution engine designed to run models in a production environment. It serves as the bridge between the high-level mathematical abstractions of a neural network and the underlying compute backend. Its responsibility is to load the model's weights into the backend's memory and then perform computations using that backend on new input. Here are some popular runtimes.",[58,172,173,197],{},[61,174,175],{},[64,176,177,182,187,192],{},[67,178,179],{},[70,180,181],{},"Inference Runtime",[67,183,184],{},[70,185,186],{},"Primary Target",[67,188,189],{},[70,190,191],{},"Compatible Compute Backends",[67,193,194],{},[70,195,196],{},"Programming Languages",[83,198,199,215,230,245,261,277,293],{},[64,200,201,206,209,212],{},[88,202,203],{},[70,204,205],{},"ONNX Runtime",[88,207,208],{},"Cross-platform \u002F Generic",[88,210,211],{},"CPU, CUDA, TensorRT, OpenVINO, CoreML, DirectML, ROCm",[88,213,214],{},"Python, C++, C#, Java, JS\u002FNode, Rust, Go",[64,216,217,222,224,227],{},[88,218,219],{},[70,220,221],{},"TensorRT",[88,223,108],{},[88,225,226],{},"CUDA, DLA (Deep Learning Accelerator)",[88,228,229],{},"Python, C++",[64,231,232,236,239,242],{},[88,233,234],{},[70,235,131],{},[88,237,238],{},"Intel Hardware",[88,240,241],{},"Intel CPU, iGPU, NPU, FPGA",[88,243,244],{},"Python, C++, C",[64,246,247,252,255,258],{},[88,248,249],{},[70,250,251],{},"ExecuTorch",[88,253,254],{},"Mobile & Edge",[88,256,257],{},"CPU (XNNPACK), CoreML (iOS), MPS (Mac), Vulkan, NPU",[88,259,260],{},"Python (Export), C++ (Runtime)",[64,262,263,268,271,274],{},[88,264,265],{},[70,266,267],{},"LiteRT (TFLite)",[88,269,270],{},"Mobile & Web",[88,272,273],{},"CPU, GPU (OpenCL\u002FMetal), TPU (Edge), WebGPU",[88,275,276],{},"Python, Java, Swift, C++, JS\u002FTS",[64,278,279,284,287,290],{},[88,280,281],{},[70,282,283],{},"vLLM",[88,285,286],{},"Data Center LLMs",[88,288,289],{},"CUDA (NVIDIA), ROCm (AMD), TPU, OpenVINO (Intel)",[88,291,292],{},"Python (Primary), C++",[64,294,295,300,303,306],{},[88,296,297],{},[70,298,299],{},"llama.cpp",[88,301,302],{},"Zero Dependencies. Run anywhere with C++.",[88,304,305],{},"CPU (AVX\u002FAMX), CUDA, Metal, Vulkan, SYCL, ROCm\u002FHIP, RPC",[88,307,308],{},"C++, Python (via bindings), Go, Rust, Node.js",[310,311,313],"note",{"variant":312},"soft",[10,314,315,316,319,320,323],{},"Not every inference runtime can run every model. For example, ",[317,318,299],"code",{}," is designed specifically for ",[317,321,322],{},"GGUF","-formatted models.",[50,325,327],{"id":326},"inference-service","Inference Service",[10,329,330],{},"Inference runtimes work with numeric tensors. An inference service bridges the gap between them and the client's data. It's where the preparation for ingestion and the postprocessing of the model's result happen. Those can be modules within larger applications, libraries, standalone applications, web services, and so on.",[50,332,334],{"id":333},"inference-provider","Inference Provider",[10,336,337,338,341],{},"Not really part of what this post is about, but oftentimes these are also called \"Inference Runtimes\" ",[20,339,340],{},"(heck, I've used that mental shortcut with my clients)",". A provider is all of the above delivered as a managed service. Often called \"Inference-as-a-Service\".",[343,344,345,352,358],"ul",{},[346,347,348,351],"li",{},[70,349,350],{},"The Cloud Giants"," Google Vertex AI, AWS Bedrock, Azure AI provide the full menu: endpoints, enterprise security, and access to specialized backends (like TPUs or Inferentia)",[346,353,354,357],{},[70,355,356],{},"Specialized API Providers"," like Together AI, Fireworks.ai, DeepInfra focus on \"Software-Defined Inference.\" They often write their own custom kernels and scheduling logic to be faster than the generic cloud providers.",[346,359,360,363],{},[70,361,362],{},"Hardware-First Providers"," like Groq, Cerebras, SambaNova are the ones that blur the lines most. They often market their entire cloud as a \"Language Runtime\" to emphasize that the hardware and software are one single, optimized unit.",[42,365,367],{"id":366},"inference-loops","Inference Loops",[10,369,370],{},"Most generative models (including LLMs) are designed to be called in a loop. It's the inference service that controls when the generation ends.",[50,372,374],{"id":373},"llm-inference","LLM Inference",[10,376,377],{},"A sequence diagram is probably the best way to visualize how all the inference components work together. Here is a conceptual LLM inference flow.",[379,380],"mermaid",{":config":381,"code":382},"config","---%0Aconfig%3A%0A%20%20layout%3A%20dagre%0A---%0A%0AsequenceDiagram%0A%20%20%20%20autonumber%0A%20%20%20%20participant%20C%20as%20Client%0A%20%20%20%20participant%20S%20as%20Service%0A%0A%20%20%20%20Note%20over%20S%3A%20Initialization%20Phase%0A%20%20%20%20S-%3E%3ES%3A%20init%0A%20create%20participant%20T%20as%20Tokenizer%0A%20S-%3E%3ET%3A%20Init%20Tokenizer%20(Vocabulary)%0A%20%20%20%20create%20participant%20R%20as%20Runtime%0A%20%20%20%20S-%3E%3ER%3A%20Start%20runtime%20(Model%2C%20Preferred%20Backend)%0A%0A%20%20%20%20participant%20B%20as%20Backend%20(GPU%2FVRAM)%0A%0A%20R-%3E%3EB%3A%20Allocate%20VRAM%20%2F%20Set%20Kernels%0A%20activate%20S%0A%20%20%20%20S-%3E%3ES%3A%20accept%20requests%0A%0A%20%20%20%20Note%20over%20C%2C%20B%3A%20Inference%20Request%22%0A%20activate%20C%0A%20%20%20%20C-%3E%3ES%3A%20%22Who%20are%20you%3F%22%0A%20%20%20%20S-%3E%3ES%3A%20prepare%20message%0A%20S-%3E%3ET%3A%20encode(text)%0A%20%20%20%20T--%3E%3ES%3A%20token%20IDs%0A%0A%20%20%20%20loop%20%22%5Buntil%20%3C%7CEOS%7C%3E%20or%20Limit%5D%22%0A%20%20%20%20%20%20%20%20S-%3E%3ER%3A%20run(token%20IDs)%0A%20%20%20%20%20%20%20%20R-%3E%3EB%3A%20compute(tensors)%0A%20%20%20%20%20%20%20%20B--%3E%3ER%3A%20logits%0A%20%20%20%20%20%20%20%20R--%3E%3ES%3A%20logits%0A%20%20%20%20%20%20%20%20S-%3E%3ES%3A%20sample(logits)%20-%3E%20token%20ID%0A%20%20%20%20%20%20%20%20S-%3E%3ET%3A%20decode(token%20ID)%0A%20%20%20%20%20%20%20%20T--%3E%3ES%3A%20token%0A%20%20%20%20%20%20%20%20S-%3E%3EC%3A%20stream(token)%0A%20%20%20%20%20%20%20%20S-%3E%3ES%3A%20tokens%20%2B%3D%2042%0A%20%20%20%20end%0A%20deactivate%20C%0A%20deactivate%20S",[10,384,385],{},"Keep in mind this describes a conceptual flow. It completely ignores aspects like performance, scalability, security, deployment architectures, and so on. In actual production systems, especially those under heavy load, we can't ignore those, and so the diagram would be somewhat different then. But these simplifications help illustrate the process.",[387,388,390],"h4",{"id":389},"startup-initialization-phase","Startup \u002F Initialization Phase",[392,393,394],"ol",{},[346,395,396],{},"The service typically starts by examining the configuration and the environment. It needs to determine:",[343,398,399,402,405,408],{},[346,400,401],{},"Which model(s) to use and how to access the artifact(s)? The model files could be bundled or downloaded on demand.",[346,403,404],{},"What inference runtime(s) can load the model(s)?",[346,406,407],{},"Which compute backend is best suited for the available hardware?",[346,409,410],{},"Which tokenizer does the model use? Generally, this information should be provided by the model creators, either in a configuration file or apparent from the distribution artifact.",[392,412,414,417,420,423],{"start":413},2,[346,415,416],{},"The service loads the tokenizer. Internally, LLMs use token IDs, and the tokenizer parses the input text and converts it to those IDs. The tokenizer's vocabulary is typically provided with the model, often in the form of a JSON file mapping each known word (or word fragment) to a number. Most inference runtimes include libraries that allow instantiating a tokenizer from these files.",[346,418,419],{},"The service starts the inference runtime, providing the model artifact or its location, along with the preferred compute backend(s).",[346,421,422],{},"The inference runtime instantiates a computation graph as defined by the model and loads the model's weights into the selected compute backend's memory. It then waits for the service to initiate a computation process.",[346,424,425],{},"The service exposes a UI or API to receive inference requests from clients.",[387,427,429],{"id":428},"inference-request-processing","Inference Request Processing",[392,431,433,436,443,446,449,452,455,463,466,480,483,486,489],{"start":432},6,[346,434,435],{},"The service receives a request. The payload is a string.",[346,437,438,439,442],{},"The service performs standard checks to ensure the request should be processed (authentication, rate-limit, quotas, etc.). It may also enhance\u002Fchange the message according to some policy ",[20,440,441],{},"(spellcheck, anonymization, ...)",".",[346,444,445],{},"The service calls the tokenizer to convert the content into token IDs. It's crucial that the service uses the exact same tokenizer and vocabulary as the one the model used during training.",[346,447,448],{},"The service gets a vector of token IDs from the tokenizer",[346,450,451],{},"The service requests to start a computation session on the currently loaded inference runtime, passing the vector of token IDs.",[346,453,454],{},"The inference runtime executes the computation graph loaded from the model on the specialized hardware through the compute backend.",[346,456,457,458,462],{},"LLMs are typically \"",[14,459,461],{"href":460},"\u002Fblog\u002F2026\u002F03\u002FAnatomy-of-a-Model-the-Developer-Perspective#decoder-only","Decoder-Only Transformers","\" and their head produces logits.",[346,464,465],{},"The inference runtime returns to the service the logits.",[346,467,468,469,472,473,475,476,479],{},"The service selects the next token based on the logits returned. Typically, it first applies a ",[317,470,471],{},"softmax"," function to convert them to probabilities (decimal values between 0 and 1). Then it reduces the list to just a few token IDs using \"top-p\" (the smallest set of tokens whose cumulated probability is ",[317,474,10],{},") or \"top-k\" (the ",[317,477,478],{},"k"," tokens with the highest probability). It then randomly draws a token from the reduced list.",[346,481,482],{},"Assuming it is a streaming service, it calls the tokenizer to de-tokenize the ID",[346,484,485],{},"The service gets the actual word\u002Ffragment from the tokenizer.",[346,487,488],{},"The service sends the actual word\u002Ffragment back to the client.",[346,490,491,492,495],{},"If the selected token is the model’s ",[317,493,494],{},"\u003C|EOS|>"," (End-Of-Sequence) token, then the service completes the session with the client. Otherwise, it appends the newly obtained token ID to the current vector of token IDs and repeats the process from step 10.",[50,497,499],{"id":498},"latent-diffusion-inference","Latent Diffusion Inference",[10,501,502],{},"Another example of models relying on loops for generations is diffusion models, frequently used for image generation.",[10,504,505,506,509,510,513,514,517],{},"Conceptually, those are not a single one but a combination of models. Typically, a ",[317,507,508],{},"CLIP"," model is used for understanding the textual input, a ",[317,511,512],{},"U-Net"," one for calculating the noise reduction, and a ",[317,515,516],{},"VAE"," for decoding the tensor into pixels.",[10,519,520],{},"Starting with a 100% noise, the service needs to invoke the inference runtime in a loop until a final result is achieved. This might be after a predefined number of iterations, or based on an algorithm that checks if the noise level falls below a certain threshold. This results in a rather complex flow:",[379,522],{":config":381,"code":523},"---%0Aconfig%3A%0A%20%20layout%3A%20dagre%0A%20%20theme%3A%20neural%0A---%0AsequenceDiagram%0A%20%20%20%20autonumber%0A%20%20%20%20participant%20C%20as%20Client%0A%20%20%20%20participant%20S%20as%20Service%0A%0A%20%20%20%20Note%20over%20S%3A%20Initialization%20Phase%0A%20%20%20%20S-%3E%3ES%3A%20init%0A%20%20%20%20create%20participant%20RC%20as%20Runtime%20(CLIP)%0A%20%20%20%20S-%3E%3ERC%3A%20Load%20Text%20Encoder%0A%20%20%20%20create%20participant%20RU%20as%20Runtime%20(U-Net)%0A%20%20%20%20S-%3E%3ERU%3A%20Load%20Noise%20Predictor%0A%20%20%20%20create%20participant%20RV%20as%20Runtime%20(VAE)%0A%20%20%20%20S-%3E%3ERV%3A%20Load%20Decoder%0A%0A%20%20%20%20participant%20B%20as%20Backend%20(GPU%2FVRAM)%0A%0A%20%20%20%20RU-%3E%3EB%3A%20Allocate%20VRAM%20%2F%20Set%20Kernels%0A%20%20%20%20activate%20S%0A%20%20%20%20S-%3E%3ES%3A%20accept%20requests%0A%0A%20%20%20%20Note%20over%20C%2C%20B%3A%20Inference%20Request%0A%20%20%20%20activate%20C%0A%20%20%20%20C-%3E%3ES%3A%20%22Generate%20image%20of...%22%0A%0A%20%20%20%20S-%3E%3ERC%3A%20run_encoder(text)%0A%20%20%20%20RC-%3E%3EB%3A%20compute(tensors)%0A%20%20%20%20B--%3E%3ERC%3A%20embeddings%0A%20%20%20%20RC--%3E%3ES%3A%20Concept%20Vector%0A%0A%20%20%20%20S-%3E%3ES%3A%20Init%20Latent%20Noise%20(z)%0A%0A%20%20%20%20loop%20%22%5BScheduler%20Timesteps%5D%22%0A%20%20%20%20%20%20%20%20S-%3E%3ERU%3A%20predict_noise(z%2C%20timestep%2C%20concept)%0A%20%20%20%20%20%20%20%20RU-%3E%3EB%3A%20compute(tensors)%0A%20%20%20%20%20%20%20%20B--%3E%3ERU%3A%20noise_tensors%0A%20%20%20%20%20%20%20%20RU--%3E%3ES%3A%20Predicted%20Pattern%20(%24%5Cepsilon%24)%0A%0A%20%20%20%20%20%20%20%20Note%20over%20S%3A%20Scheduler%20Logic%3A%20%3Cbr%2F%3E%20Clean%20z%20using%20Predicted%20Pattern%0A%20%20%20%20%20%20%20%20S-%3E%3ES%3A%20z%20%3D%20scheduler.step(z%2C%20%24%5Cepsilon%24)%0A%0A%20%20%20%20%20%20%20%20opt%20%22Optional%20Preview%22%0A%20%20%20%20%20%20%20%20%20%20%20%20S-%3E%3ERV%3A%20decode(z)%0A%20%20%20%20%20%20%20%20%20%20%20%20RV-%3E%3EB%3A%20compute(tensors)%0A%20%20%20%20%20%20%20%20%20%20%20%20B--%3E%3ERV%3A%20pixels%0A%20%20%20%20%20%20%20%20%20%20%20%20RV--%3E%3ES%3A%20image_data%0A%20%20%20%20%20%20%20%20%20%20%20%20S-%3E%3EC%3A%20stream(frame)%0A%20%20%20%20%20%20%20%20end%0A%20%20%20%20end%0A%0A%20%20%20%20S-%3E%3ERV%3A%20decode(final_z)%0A%20%20%20%20RV-%3E%3EB%3A%20compute(tensors)%0A%20%20%20%20B--%3E%3ERV%3A%20pixels%0A%20%20%20%20RV--%3E%3ES%3A%20Final%20Pixels%0A%20%20%20%20S-%3E%3EC%3A%20deliver(image)%0A%0A%20%20%20%20deactivate%20C%0A%20%20%20%20deactivate%20S",[10,525,526],{},"Again, this is a conceptual flow. In production environments, the flow would be heavily optimized and thus look different. Still, fundamentally, this is what happens behind the scenes:",[387,528,390],{"id":529},"startup-initialization-phase-1",[392,531,532,535,538,541,544,547],{},[346,533,534],{},"The service initializes by identifying the specific diffusion model configuration, the required runtimes, and the optimal hardware backends available.",[346,536,537],{},"The service instantiates the first inference runtime to load the text encoder (CLIP), which is responsible for understanding the semantic meaning of the user's prompt.",[346,539,540],{},"The service instantiates a second runtime for the noise predictor (U-Net), the \"brain\" of the diffusion process that identifies patterns within random noise.",[346,542,543],{},"The service instantiates a third runtime for the decoder (VAE), which is used to translate mathematical representations (latents) into actual pixel maps.",[346,545,546],{},"The runtimes coordinate with the compute backend to allocate VRAM and prepare the specialized kernels needed for high-speed tensor math.",[346,548,549],{},"With all models loaded and the hardware prepared, the service opens its API or UI to begin accepting image generation requests from clients.",[387,551,429],{"id":552},"inference-request-processing-1",[392,554,556,559,562,565,568,571,574,577,580,583,586,589,592,595,598,601,604,607,610,613,616],{"start":555},7,[346,557,558],{},"The service receives a natural language prompt from the client describing the image to be generated.",[346,560,561],{},"The service sends the prompt to the CLIP runtime to translate the string into a high-dimensional numerical representation (embeddings).",[346,563,564],{},"The CLIP runtime utilizes the backend to process the text, resulting in a \"Concept Vector\" that the other models can understand.",[346,566,567],{},"The result is a vector, which now acts as the permanent semantic anchor for the entire generation process.",[346,569,570],{},"The service receives this vector from the runtime",[346,572,573],{},"The service generates a tensor of completely random Gaussian noise (latents) at a smaller scale than the final image to serve as the \"starting canvas.\"",[346,575,576],{},"The service starts the loop by passing the current noisy latents, the concept vector, and the current timestep to the U-Net runtime.",[346,578,579],{},"The U-Net runtime executes its graph on the backend to identify which parts of the current noise look like the requested concepts.",[346,581,582],{},"The execution results in a \"pattern map\" (predicted noise) representing the elements the model suggests should be removed to reveal the image.",[346,584,585],{},"The runtime passes this prediction back to the service for the next orchestration step.",[346,587,588],{},"The service uses the scheduler library to mathematically subtract a portion of the predicted noise from the current latents, resulting in a slightly \"cleaner\" version of the image.",[346,590,591],{},"If configured for streaming, the service sends the current intermediate latents to the VAE runtime for decoding.",[346,593,594],{},"The VAE runtime processes the mathematical latent on the backend to reconstruct a human-readable pixel map.",[346,596,597],{},"The runtime returns the raw image data (RGB pixels) to the service.",[346,599,600],{},"The service receives the frame and formats it for transmission.",[346,602,603],{},"The service pushes the low-quality preview frame to the client so the user can watch the image \"emerge\" from the noise.",[346,605,606],{},"Once the loop reaches the noise threshold or step limit, the service sends the final refined latent to the VAE runtime for high-quality reconstruction.",[346,608,609],{},"The VAE performs a final pass on the backend",[346,611,612],{},"The runtime execution results in the final pixel map",[346,614,615],{},"The service receives the final generated asset from the runtime.",[346,617,618],{},"The service performs any final post-processing (like PNG encoding) and delivers the completed image to the client, closing the session.",[42,620,622],{"id":621},"single-shot-inference","Single-Shot Inference",[10,624,625],{},"While the above flows relay on inference loops, many smaller models can get their work done using a single-shot inference. That means we don't have to do the above mentioned predict-next loop and get the results we need by calling the inference runtime just once.",[10,627,628],{},"Consider the following categories of models:",[343,630,631,637,643,649,655,661,667],{},[346,632,633,636],{},[70,634,635],{},"Classification"," - “which class does this belong to?”",[346,638,639,642],{},[70,640,641],{},"Regression"," - “what is the numerical value?”",[346,644,645,648],{},[70,646,647],{},"Ranking\u002FRecommendation"," - “rank these items from most to least relevant.”",[346,650,651,654],{},[70,652,653],{},"Similarity \u002F Retrieval"," - “which items are most similar?”",[346,656,657,660],{},[70,658,659],{},"Detection \u002F Segmentation"," - “where are the objects? \u002F what is the mask?”",[346,662,663,666],{},[70,664,665],{},"Forecasting"," - “what will the next value be?”",[346,668,669,672],{},[70,670,671],{},"Anomaly"," - “is this point an outlier?”",[10,674,675],{},"The steps to use any of those from our code are almost identical. At the initialization phase, we still need to pick a model, an inference runtime that can load it, potentially a compatible tokenizer, and a compute backend. At request processing time, we still need to preprocess the input, execute the computation, and postprocess the result. As not all models work with text and word tokenizers, let's see how other examples follow the same process.",[10,677,678,679,682,683,685],{},"Say we have a ",[317,680,681],{},"json"," with some credit card transactions and want to check for possible fraud. Our input could be a ",[317,684,681],{}," like the one below.",[687,688,692],"pre",{"className":689,"code":690,"language":681,"meta":691,"style":691},"language-json shiki shiki-themes material-theme-lighter github-light github-dark monokai","[\n  {\n    \"account_id\": \"ACC_STEADY_COFFEE\",\n    \"history\": [\n      {\"month\": 11, \"day\": 1, \"dow\": 1, \"hour\": 8, \"min\": 15, \"amount\": 4.50, ...},\n      {\"month\": 11, \"day\": 1, \"dow\": 1, \"hour\": 9, \"min\": 30, \"amount\": 12.00, ... },\n      ...\n ]\n  },\n  ...\n]\n","",[317,693,694,703,708,738,753,852,936,941,947,953,959],{"__ignoreMap":691},[695,696,699],"span",{"class":697,"line":698},"line",1,[695,700,702],{"class":701},"swvn1","[\n",[695,704,705],{"class":697,"line":413},[695,706,707],{"class":701},"  {\n",[695,709,711,715,719,722,725,729,733,735],{"class":697,"line":710},3,[695,712,714],{"class":713},"saDeg","    \"",[695,716,718],{"class":717},"sEff5","account_id",[695,720,721],{"class":713},"\"",[695,723,724],{"class":701},":",[695,726,728],{"class":727},"sh1VR"," \"",[695,730,732],{"class":731},"sINAO","ACC_STEADY_COFFEE",[695,734,721],{"class":727},[695,736,737],{"class":701},",\n",[695,739,741,743,746,748,750],{"class":697,"line":740},4,[695,742,714],{"class":713},[695,744,745],{"class":717},"history",[695,747,721],{"class":713},[695,749,724],{"class":701},[695,751,752],{"class":701}," [\n",[695,754,756,759,761,765,767,769,773,776,778,781,783,785,788,790,792,795,797,799,801,803,805,808,810,812,815,817,819,822,824,826,829,831,833,836,838,840,843,845,849],{"class":697,"line":755},5,[695,757,758],{"class":701},"      {",[695,760,721],{"class":713},[695,762,764],{"class":763},"s_MOj","month",[695,766,721],{"class":713},[695,768,724],{"class":701},[695,770,772],{"class":771},"sYThS"," 11",[695,774,775],{"class":701},",",[695,777,728],{"class":713},[695,779,780],{"class":763},"day",[695,782,721],{"class":713},[695,784,724],{"class":701},[695,786,787],{"class":771}," 1",[695,789,775],{"class":701},[695,791,728],{"class":713},[695,793,794],{"class":763},"dow",[695,796,721],{"class":713},[695,798,724],{"class":701},[695,800,787],{"class":771},[695,802,775],{"class":701},[695,804,728],{"class":713},[695,806,807],{"class":763},"hour",[695,809,721],{"class":713},[695,811,724],{"class":701},[695,813,814],{"class":771}," 8",[695,816,775],{"class":701},[695,818,728],{"class":713},[695,820,821],{"class":763},"min",[695,823,721],{"class":713},[695,825,724],{"class":701},[695,827,828],{"class":771}," 15",[695,830,775],{"class":701},[695,832,728],{"class":713},[695,834,835],{"class":763},"amount",[695,837,721],{"class":713},[695,839,724],{"class":701},[695,841,842],{"class":771}," 4.50",[695,844,775],{"class":701},[695,846,848],{"class":847},"s4fT8"," ...",[695,850,851],{"class":701},"},\n",[695,853,854,856,858,860,862,864,866,868,870,872,874,876,878,880,882,884,886,888,890,892,894,896,898,900,903,905,907,909,911,913,916,918,920,922,924,926,929,931,933],{"class":697,"line":432},[695,855,758],{"class":701},[695,857,721],{"class":713},[695,859,764],{"class":763},[695,861,721],{"class":713},[695,863,724],{"class":701},[695,865,772],{"class":771},[695,867,775],{"class":701},[695,869,728],{"class":713},[695,871,780],{"class":763},[695,873,721],{"class":713},[695,875,724],{"class":701},[695,877,787],{"class":771},[695,879,775],{"class":701},[695,881,728],{"class":713},[695,883,794],{"class":763},[695,885,721],{"class":713},[695,887,724],{"class":701},[695,889,787],{"class":771},[695,891,775],{"class":701},[695,893,728],{"class":713},[695,895,807],{"class":763},[695,897,721],{"class":713},[695,899,724],{"class":701},[695,901,902],{"class":771}," 9",[695,904,775],{"class":701},[695,906,728],{"class":713},[695,908,821],{"class":763},[695,910,721],{"class":713},[695,912,724],{"class":701},[695,914,915],{"class":771}," 30",[695,917,775],{"class":701},[695,919,728],{"class":713},[695,921,835],{"class":763},[695,923,721],{"class":713},[695,925,724],{"class":701},[695,927,928],{"class":771}," 12.00",[695,930,775],{"class":701},[695,932,848],{"class":847},[695,934,935],{"class":701}," },\n",[695,937,938],{"class":697,"line":555},[695,939,940],{"class":847},"      ...\n",[695,942,944],{"class":697,"line":943},8,[695,945,946],{"class":701}," ]\n",[695,948,950],{"class":697,"line":949},9,[695,951,952],{"class":701},"  },\n",[695,954,956],{"class":697,"line":955},10,[695,957,958],{"class":847},"  ...\n",[695,960,962],{"class":697,"line":961},11,[695,963,964],{"class":701},"]\n",[10,966,967,968,982,983,986],{},"If we were to run a fraud detection model like ",[14,969,973,974,977,978,981],{"href":970,"rel":971},"https:\u002F\u002Fgithub.com\u002FIBM\u002Fai-on-z-fraud-detection",[972],"nofollow","IBM's ",[317,975,976],{},"GRU"," or ",[317,979,980],{},"LSTM"," models",", we need to convert our data to a feature tensor during the preprocessing. The input shape of the model is ",[317,984,985],{},"[7, 16, 220]",", meaning it is designed to process 7 batches of data simultaneously, where each batch contains a sequence of 16 transactions, and each transaction is represented by 220 features. So that's what the service needs to produce.",[687,988,990],{"className":689,"code":989,"language":681,"meta":691,"style":691},"[\n [ \u002F\u002F batch 1\n  [f1, f2, ..., f220], \u002F\u002F transaction 1\n  ...\n  [f1, f2, ..., f220], \u002F\u002F transaction 16\n ],\n ...\n [ \u002F\u002F batch 7\n  ...\n ]\n]\n",[317,991,992,996,1005,1041,1045,1074,1079,1084,1091,1095,1099],{"__ignoreMap":691},[695,993,994],{"class":697,"line":698},[695,995,702],{"class":701},[695,997,998,1001],{"class":697,"line":413},[695,999,1000],{"class":701}," [",[695,1002,1004],{"class":1003},"ss7Ak"," \u002F\u002F batch 1\n",[695,1006,1007,1010,1013,1016,1018,1021,1024,1026,1028,1030,1032,1035,1038],{"class":697,"line":710},[695,1008,1009],{"class":701},"  [",[695,1011,1012],{"class":847},"f",[695,1014,1015],{"class":771},"1",[695,1017,775],{"class":701},[695,1019,1020],{"class":847}," f",[695,1022,1023],{"class":771},"2",[695,1025,775],{"class":701},[695,1027,848],{"class":847},[695,1029,775],{"class":701},[695,1031,1020],{"class":847},[695,1033,1034],{"class":771},"220",[695,1036,1037],{"class":701},"],",[695,1039,1040],{"class":1003}," \u002F\u002F transaction 1\n",[695,1042,1043],{"class":697,"line":740},[695,1044,958],{"class":847},[695,1046,1047,1049,1051,1053,1055,1057,1059,1061,1063,1065,1067,1069,1071],{"class":697,"line":755},[695,1048,1009],{"class":701},[695,1050,1012],{"class":847},[695,1052,1015],{"class":771},[695,1054,775],{"class":701},[695,1056,1020],{"class":847},[695,1058,1023],{"class":771},[695,1060,775],{"class":701},[695,1062,848],{"class":847},[695,1064,775],{"class":701},[695,1066,1020],{"class":847},[695,1068,1034],{"class":771},[695,1070,1037],{"class":701},[695,1072,1073],{"class":1003}," \u002F\u002F transaction 16\n",[695,1075,1076],{"class":697,"line":432},[695,1077,1078],{"class":701}," ],\n",[695,1080,1081],{"class":697,"line":555},[695,1082,1083],{"class":847}," ...\n",[695,1085,1086,1088],{"class":697,"line":943},[695,1087,1000],{"class":701},[695,1089,1090],{"class":1003}," \u002F\u002F batch 7\n",[695,1092,1093],{"class":697,"line":949},[695,1094,958],{"class":847},[695,1096,1097],{"class":697,"line":955},[695,1098,946],{"class":701},[695,1100,1101],{"class":697,"line":961},[695,1102,964],{"class":701},[10,1104,1105,1106,1109],{},"Then the service can execute the model just once and get the scores. The output shape of ",[317,1107,1108],{},"[7, 16, 1]"," means the model generates results for 7 batches simultaneously, where each batch contains a single fraud score for each of the 16 transactions in the sequence. During post-processing, the service converts those scores to meaningful thresholds or labels",[10,1111,1112],{},"Here is how the inference flow looks:",[379,1114],{":config":381,"code":1115},"---%0Aconfig%3A%0A%20%20layout%3A%20dagre%0A%20%20theme%3A%20neural%0A---%0AsequenceDiagram%0A%20%20%20%20autonumber%0A%20%20%20%20participant%20C%20as%20Client%0A%20%20%20%20participant%20S%20as%20Service%0A%0A%20%20%20%20Note%20over%20S%3A%20Initialization%20Phase%0A%20%20%20%20S-%3E%3ES%3A%20init%0A%20%20%20%20create%20participant%20R%20as%20Runtime%0A%20%20%20%20S-%3E%3ER%3A%20Start%20runtime%20(Model%2C%20Preferred%20Backend)%0A%0A%20%20%20%20participant%20B%20as%20Backend%20(Compute%2FVRAM)%0A%0A%20%20%20%20R-%3E%3EB%3A%20Load%20Graph%20%26%20Allocate%20Memory%0A%20%20%20%20activate%20S%0A%20%20%20%20S-%3E%3ES%3A%20accept%20requests%0A%0A%20%20%20%20Note%20over%20C%2C%20B%3A%20Inference%20Request%0A%20%20%20%20activate%20C%0A%20%20%20%20C-%3E%3ES%3A%20Raw%20Data%20(e.g.%2C%20transactions.json)%0A%0A%20%20%20%20S-%3E%3ES%3A%20Featurize%20(Scale%20numbers%2C%20encode%20categories)%0A%0A%20%20%20%20S-%3E%3ER%3A%20run_inference(input_tensors)%0A%20%20%20%20R-%3E%3EB%3A%20compute(graph)%0A%0A%20%20%20%20B--%3E%3ER%3A%20raw_results%0A%20%20%20%20R--%3E%3ES%3A%20scores_matrix%0A%0A%20%20%20%20S-%3E%3ES%3A%20Post-process%20(Apply%20thresholds%2Flabels)%0A%20%20%20%20S-%3E%3EC%3A%20Analysis%20Report%0A%0A%20%20%20%20deactivate%20C%0A%20%20%20%20deactivate%20S",[10,1117,1118],{},"Hopefully, the flow is simple enough and self-explanatory, but for the sake of consistency with the previous ones:",[387,1120,390],{"id":1121},"startup-initialization-phase-2",[392,1123,1124,1127,1130,1133],{},[346,1125,1126],{},"The service prepares the environment and determines which model and backend are required for the task.",[346,1128,1129],{},"The service instantiates the inference runtime and provides the model artifact.",[346,1131,1132],{},"The runtime communicates with the compute backend to load the model's computation graph into memory and prepare for execution.",[346,1134,1135],{},"The service begins listening for data payloads from clients.",[387,1137,429],{"id":1138},"inference-request-processing-2",[392,1140,1141,1144,1150,1153,1156,1162,1165,1168],{"start":755},[346,1142,1143],{},"The client sends a dataset, such as a collection of transaction histories.",[346,1145,1146,1147,1149],{},"The service performs the \"data preparation\" step, constructing a ",[317,1148,985],{}," tensor",[346,1151,1152],{},"The service passes the prepared tensors to the runtime.",[346,1154,1155],{},"The runtime executes the graph on the backend.",[346,1157,1158,1159,1161],{},"The backend produces the ",[317,1160,1108],{}," tensor with the scores.",[346,1163,1164],{},"The runtime sends the scores tensor to the service.",[346,1166,1167],{},"The service applies the business logic to the raw scores, such as labeling a high-probability score as \"FRAUD\" or a medium one as \"SUSPECT.\"",[346,1169,1170],{},"The service returns the final report or categorized data to the client.",[42,1172,1174],{"id":1173},"summary","Summary",[10,1176,1177],{},"While AI models are often viewed as black boxes, their execution in production relies on a precise orchestration between specialized hardware, execution runtimes, and the services that wrap them. In the academic Python world, these boundaries are often blurred; a script that produces a correct result may not offer a clear path for decomposition or scaling.",[10,1179,1180,1181,1184,1185,1188,1189,1192],{},"It was only after reproducing these behaviors in language stacks like Java and TypeScript—using unified runtimes like ONNX—that I was able to establish a clear mental model of how these pieces fit together. This post deconstructs that connection, illustrating the interplay between ",[70,1182,1183],{},"compute backends",", ",[70,1186,1187],{},"inference runtimes",", and ",[70,1190,1191],{},"services"," to help you architect systems that are both performant and truly scalable.",[1194,1195],"hr",{},[310,1197,1200,1208],{"color":312,"icon":1198,"title":1199},"mdi-light-book-multiple","AI for Application Developers Series",[10,1201,1202,1203,1207],{},"The post is part of the ",[14,1204,1206],{"href":1205},"\u002Fblog\u002F2026\u002F03\u002FAI-for-Application-Developers","AI for Application Developers"," series - my personal notes on various AI topics converted to blog posts.",[10,1209,1210,1211,1214,1215,1218,1219,1222,1223,1226],{},"Please do not hesitate to ",[70,1212,1213],{},"correct"," me if I got something wrong, ",[70,1216,1217],{},"contribute"," if something is missing, ",[70,1220,1221],{},"ask"," me to clarify or simply ",[70,1224,1225],{},"share"," your experience and views.",[1228,1229,1230],"style",{},"html pre.shiki code .swvn1, html code.shiki .swvn1{--shiki-light:#39ADB5;--shiki-default:#24292E;--shiki-dark:#E1E4E8;--shiki-sepia:#F8F8F2}html pre.shiki code .saDeg, html code.shiki .saDeg{--shiki-light:#39ADB5;--shiki-light-font-style:inherit;--shiki-default:#005CC5;--shiki-default-font-style:inherit;--shiki-dark:#79B8FF;--shiki-dark-font-style:inherit;--shiki-sepia:#66D9EF;--shiki-sepia-font-style:italic}html pre.shiki code .sEff5, html code.shiki .sEff5{--shiki-light:#9C3EDA;--shiki-light-font-style:inherit;--shiki-default:#005CC5;--shiki-default-font-style:inherit;--shiki-dark:#79B8FF;--shiki-dark-font-style:inherit;--shiki-sepia:#66D9EF;--shiki-sepia-font-style:italic}html pre.shiki code .sh1VR, html code.shiki .sh1VR{--shiki-light:#39ADB5;--shiki-default:#032F62;--shiki-dark:#9ECBFF;--shiki-sepia:#CFCFC2}html pre.shiki code .sINAO, html code.shiki .sINAO{--shiki-light:#91B859;--shiki-default:#032F62;--shiki-dark:#9ECBFF;--shiki-sepia:#CFCFC2}html pre.shiki code .s_MOj, html code.shiki .s_MOj{--shiki-light:#E2931D;--shiki-light-font-style:inherit;--shiki-default:#005CC5;--shiki-default-font-style:inherit;--shiki-dark:#79B8FF;--shiki-dark-font-style:inherit;--shiki-sepia:#66D9EF;--shiki-sepia-font-style:italic}html pre.shiki code .sYThS, html code.shiki .sYThS{--shiki-light:#F76D47;--shiki-default:#005CC5;--shiki-dark:#79B8FF;--shiki-sepia:#AE81FF}html pre.shiki code .s4fT8, html code.shiki .s4fT8{--shiki-light:#90A4AE;--shiki-light-font-style:inherit;--shiki-default:#B31D28;--shiki-default-font-style:italic;--shiki-dark:#FDAEB7;--shiki-dark-font-style:italic;--shiki-sepia:#F44747;--shiki-sepia-font-style:inherit}html .light .shiki span {color: var(--shiki-light);background: var(--shiki-light-bg);font-style: var(--shiki-light-font-style);font-weight: var(--shiki-light-font-weight);text-decoration: var(--shiki-light-text-decoration);}html.light .shiki span {color: var(--shiki-light);background: var(--shiki-light-bg);font-style: var(--shiki-light-font-style);font-weight: var(--shiki-light-font-weight);text-decoration: var(--shiki-light-text-decoration);}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html .sepia .shiki span {color: var(--shiki-sepia);background: var(--shiki-sepia-bg);font-style: var(--shiki-sepia-font-style);font-weight: var(--shiki-sepia-font-weight);text-decoration: var(--shiki-sepia-text-decoration);}html.sepia .shiki span {color: var(--shiki-sepia);background: var(--shiki-sepia-bg);font-style: var(--shiki-sepia-font-style);font-weight: var(--shiki-sepia-font-weight);text-decoration: var(--shiki-sepia-text-decoration);}html pre.shiki code .ss7Ak, html code.shiki .ss7Ak{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#6A737D;--shiki-default-font-style:inherit;--shiki-dark:#6A737D;--shiki-dark-font-style:inherit;--shiki-sepia:#88846F;--shiki-sepia-font-style:inherit}",{"title":691,"searchDepth":413,"depth":413,"links":1232},[1233,1239,1243,1244],{"id":44,"depth":413,"text":45,"children":1234},[1235,1236,1237,1238],{"id":52,"depth":710,"text":53},{"id":166,"depth":710,"text":167},{"id":326,"depth":710,"text":327},{"id":333,"depth":710,"text":334},{"id":366,"depth":413,"text":367,"children":1240},[1241,1242],{"id":373,"depth":710,"text":374},{"id":498,"depth":710,"text":499},{"id":621,"depth":413,"text":622},{"id":1173,"depth":413,"text":1174},null,"2026-04-15","What we learned from \"AI\u002FML Models Are Not Libraries\" is that models are essentially collections of numbers (weights) and, optionally, mathematical formulas. The \"optionally\" part is key, as we saw in \"A Trip in the AI\u002FML Model Formats Jungle\" that not all model files store the formulas themselves. Some formats are \"Mostly Self-Confined,\" while others are \"Weights-Only,\" expecting the application using them to \"know\" the underlying math. In \"Anatomy of a Model - the Developer Perspective\", we explored different architectures and their inputs and outputs. With that groundwork laid, we can now consider the inference process as a whole.","md","\u002Fassets\u002Finference_stack.png",{"layout":1251},"new_post",true,"\u002Fblog\u002F2026\u002F04\u002FThe-Inference-Engine-Bringing-Models-to-Life",{"title":5,"description":1247},"2026-04-15-The-Inference-Engine-Bringing-Models-to-Life","bthrMhHrJ1R0my5VzBdSDBX-vdZMOz6BxeQ3_FrvCX4",[1258,1245],{"title":32,"path":31,"stem":1259,"description":1260,"children":-1},"2026-03-23-Anatomy-of-a-Model-the-Developer-Perspective","In the old days, every IT organization had a dedicated, almost sacred role: the DBA (Database Administrator), often informally known as the \"Gatekeeper of the Schema.\" These individuals ensured that the schema adhered to 3NF (Third Normal Form), that the appropriate fields were indexed, that no foreign keys were missing, that data was accurately partitioned, and that the overall structure resembled a proper \"Star\" or \"Snowflake\" schema. Most of this was a mystery to software developers who simply wanted to store and retrieve data via SQL, but it was crucial from a resource efficiency perspective.",1776326101387]