A Deep Dive into TensorRT & Triton for Production Inference
Introduction
So you’ve trained your model. Validation loss looks great, accuracy is solid, and you’re feeling good. Then someone asks: “Can we deploy this to production?”
And suddenly, latency matters. Throughput matters. GPU bills matter.
This is where TensorRT and Triton Inference Server become essential. Together, they form NVIDIA’s complementary toolkit for moving your trained model from a local prototype to a production ready system capable of handling thousands of requests per second. In this post, we will take a closer look at how that process works.
The Big Picture: Train → Optimize → Serve
[PyTorch / TensorFlow] [TensorRT] [Triton]
Train a model → Optimize it → Serve it at scale
(Accuracy focus) (Speed focus) (Production focus)
Step 1: Export the trained model (The Starting Point) Most of us are familiar with training model using frameworks like PyTorch or TensorFlow, leveraging CUDA and cuDNN under the hood for GPU acceleration. Once training is done, we export our model — typically to ONNX format, which acts as a universal bridge between deep learning frameworks.
# Export your trained PyTorch model to ONNX
dummy_input = torch.randn(1, 3, 224, 224).cuda()
torch.onnx.export(
model, # Your trained model
dummy_input, # Example input for tracing
"model.onnx", # Output file
opset_version=17, # ONNX opset version
input_names=["input"],
output_names=["output"],
dynamic_axes={ # Allow variable batch sizes
"input": {0: "batch_size"},
"output": {0: "batch_size"}
}
)
Why ONNX? It decouples your model from the training framework. TensorRT can then ingest it regardless of whether you trained in PyTorch, TensorFlow, or JAX.
That ONNX file is your handoff point. Now the real fun begins.
Step 2: Optimization with TensorRT
At its core, TensorRT takes your ONNX model and rebuilds it as a highly optimized inference engine specifically tuned for your GPU. Here’s what it actually does under the hood:
Layer Fusion
Our model, as exported, runs operations sequentially. A typical ResNet block might look like:
Conv2D → BatchNorm → ReLU → Conv2D → BatchNorm → ReLU
↓ ↓ ↓ ↓ ↓ ↓
kernel1 kernel2 kernel3 kernel4 kernel5 kernel6
Each operation is a separate GPU kernel call. Thus, separate memory reads, writes, and kernel launch overhead. TensorRT fuses these into fewer, larger kernels:
[Conv2D + BatchNorm + ReLU] → [Conv2D + BatchNorm + ReLU]
↓ ↓
kernel1 kernel2
Fewer kernel launches, fewer memory round-trips, significantly faster execution.
Precision Calibration (INT8 & FP16)
By default, models run in FP32. TensorRT drops to FP16 or even INT8 with minimal accuracy loss:
- FP16: Usually a free ~2x speedup with negligible accuracy impact
- INT8: Up to ~4x speedup, requires a calibration dataset to determine optimal scaling factors
config = builder.create_builder_config()
# FP16 - easy win, almost always safe
config.set_flag(trt.BuilderFlag.FP16)
# INT8 - needs calibration
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = MyCalibrator(calibration_dataset)
TensorRT’s INT8 calibration works by running a small representative dataset through the model and computing the optimal quantization ranges per layer, automatically. We don’t need to manually quantize anything.
Kernel Auto-Tuning
For any given operation, there are often dozens of possible CUDA kernel implementations. TensorRT benchmarks them all at engine build time and picks the fastest one for your specific GPU and input shape. This is why building a TensorRT engine takes a few minutes. It’s doing real benchmarking work upfront so inference is as fast as possible at runtime.
Memory Planning TensorRT pre-analyzes the entire computation graph and plans optimal memory reuse across layers. Tensors that are no longer needed get their memory reclaimed immediately for subsequent layers. This reduces peak memory usage significantly, which is important when we are running multiple models on the same GPU.
Building the Engine
Putting it all together:
import tensorrt as trt
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
# Parse ONNX model
with open("model.onnx", "rb") as f:
parser.parse(f.read())
# Configure optimization
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 << 30) # 4GB
config.set_flag(trt.BuilderFlag.FP16)
# Set dynamic shape profile
profile = builder.create_optimization_profile()
profile.set_shape(
"input",
min=(1, 3, 224, 224), # Minimum batch size
opt=(16, 3, 224, 224), # Optimal batch size
max=(64, 3, 224, 224) # Maximum batch size
)
config.add_optimization_profile(profile)
# Build & save engine (this takes a few minutes)
serialized_engine = builder.build_serialized_network(network, config)
with open("model.plan", "wb") as f:
f.write(serialized_engine)
print(" TensorRT engine built and saved!")
⚠️ Important: TensorRT engines are GPU-specific. An engine built on an A100 won’t run on a T4. Always engine be built on the same GPU class used in production.
Step 3: Serving with Triton (The Production Layer)
WE now have a blazing-fast TensorRT engine. But a .plan file sitting on disk doesn’t serve HTTP requests, handle concurrent users, or scale across multiple GPUs. That’s Triton’s job.
Setting Up the Model Repository Triton uses a simple directory structure:
model_repository/
└── resnet50/
├── config.pbtxt ← Model configuration
└── 1/
└── model.plan ← TensorRT engine
Configuring Triton (config.pbtxt)
name: "resnet50"
backend: "tensorrt"
max_batch_size: 64
input [
{
name: "input"
data_type: TYPE_FP16
dims: [3, 224, 224]
}
]
output [
{
name: "output"
data_type: TYPE_FP16
dims: [1000]
}
]
dynamic_batching {
preferred_batch_size: [8, 16, 32, 64]
max_queue_delay_microseconds: 500
}
instance_group [
{
count: 2 # Run 2 model instances concurrently
kind: KIND_GPU
gpus: [0] # On GPU 0
}
]
Throughput multipliexing with Dynamic Batching
This is one of Triton’s killer features. Instead of processing requests one by one (wasting GPU parallelism) or waiting for a fixed batch (adding latency), Triton dynamically groups incoming requests into batches at runtime.
t=0ms Request A arrives → queue: [A]
t=0.2ms Request B arrives → queue: [A, B]
t=0.3ms Request C arrives → queue: [A, B, C]
t=0.5ms Timer expires → Batch [A, B, C] → GPU 🚀
t=0.6ms Requests D,E,F,G arrive → preferred size (4) hit!
→ Batch [D, E, F, G] → GPU immediately 🚀
Launching Triton
docker run --gpus all --rm \
-p 8000:8000 \ # HTTP/REST
-p 8001:8001 \ # gRPC (lower overhead, preferred for high-throughput)
-p 8002:8002 \ # Prometheus metrics
-v $(pwd)/model_repository:/models \
nvcr.io/nvidia/tritonserver:24.01-py3 \
tritonserver --model-repository=/models
Sending Inference Requests
import tritonclient.http as httpclient
import numpy as np
client = httpclient.InferenceServerClient("localhost:8000")
# Check server health
assert client.is_server_live(), "Triton server is not running!"
# Prepare input
input_data = np.random.rand(1, 3, 224, 224).astype(np.float16)
inputs = [httpclient.InferInput("input", input_data.shape, "FP16")]
inputs[0].set_data_from_numpy(input_data)
outputs = [httpclient.InferRequestedOutput("output")]
# Run inference
response = client.infer("resnet50", inputs, outputs=outputs)
predictions = response.as_numpy("output")
print(f"Output shape: {predictions.shape}") # (1, 1000)
Key Takeaways
If you take nothing else from this post, remember these three things:
-
TensorRT is GPU-specific — always build your engine on the same GPU class you’ll deploy on. It’s not portable, but that’s exactly why it’s so fast.
-
FP16 is almost always a free win. Enable it by default. Only reach for INT8 if you need that extra push and are willing to validate accuracy carefully.
-
Triton’s dynamic batching is a throughput multiplier — tune
preferred_batch_sizeandmax_queue_delay_microsecondsbased on your latency SLA and expected traffic patterns.
Enjoy Reading This Article?
Here are some more articles you might like to read next: