Maximize Your AI Model Performance with GenAI-Perf: A Comprehensive Guide

Introduction to GenAI-Perf

GenAI-Perf is a command line tool designed to measure the throughput and latency of generative AI models served through an inference server. This tool is particularly useful for benchmarking large language models (LLMs) and provides essential metrics such as output token throughput, time to first token, inter-token latency, and request throughput.

In this blog post, we will explore the purpose, features, installation, usage, and community aspects of GenAI-Perf, ensuring you have all the information needed to leverage this powerful tool effectively.

Key Features of GenAI-Perf

Comprehensive Metrics: Measure various performance metrics including output token throughput, time to first token, inter-token latency, and request throughput.
Flexible Input Options: Specify model names, inference server URLs, and input types (synthetic or dataset).
Load Generation: Generate load with customizable parameters such as concurrent requests and request rates.
Result Logging: Log results in CSV and JSON formats for further analysis and visualization.
Visualization Support: Generate plots to visualize performance metrics and compare multiple runs.

Technical Architecture and Implementation

GenAI-Perf operates as a benchmarking tool that interfaces with the Triton Inference Server. It is built to support various model types including:

Large Language Models
Vision Language Models
Embedding Models
Ranking Models
Multiple LoRA Adapters

The tool is designed to be extensible and is currently in early release, meaning that features and command line options may evolve as development continues.

Installation Process

The easiest way to install GenAI-Perf is through the Triton Server SDK container. Here’s how you can do it:

export RELEASE="yy.mm" # e.g. export RELEASE="24.06"

docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

# Check out genai_perf command inside the container:
genai-perf --help

Alternatively, you can install from source. Ensure you have CUDA 12 installed and follow these steps:

pip install tritonclient
apt update && apt install -y --no-install-recommends libb64-0d libcurl4

# Clone the repository and install GenAI-Perf
 git clone https://github.com/triton-inference-server/perf_analyzer.git && cd perf_analyzer
 pip install -e genai-perf

Usage Examples and API Overview

To run performance benchmarks using GenAI-Perf, you can follow this quick start guide:

export RELEASE="yy.mm" # e.g. export RELEASE="24.06"

docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

# Run GenAI-Perf in the container:
genai-perf profile \
  -m gpt2 \
  --service-kind triton \
  --backend tensorrtllm \
  --num-prompts 100 \
  --random-seed 123 \
  --synthetic-input-tokens-mean 200 \
  --streaming \
  --output-tokens-mean 100 \
  --url localhost:8001

Example output will provide metrics such as:

LLM Metrics
 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
 ┃                Statistic ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
 │ Time to first token (ms) │  11.70 │   9.88 │  17.21 │  14.35 │  12.01 │  11.87 │
 │ Inter token latency (ms) │   1.46 │   1.08 │   1.89 │   1.87 │   1.62 │   1.52 │
 │     Request latency (ms) │ 161.24 │ 153.45 │ 200.74 │ 200.66 │ 179.43 │ 162.23 │
 │   Output sequence length │ 103.39 │  95.00 │ 134.00 │ 120.08 │ 107.30 │ 105.00 │
 │    Input sequence length │ 200.01 │ 200.00 │ 201.00 │ 200.13 │ 200.00 │ 200.00 │
 └──────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
 Output token throughput (per sec): 635.61
 Request throughput (per sec): 6.15

Community and Contribution Aspects

GenAI-Perf is an open-source project, and contributions are welcome. If you wish to contribute, please follow the guidelines outlined in the repository. Contributions that fix documentation errors or make small changes can be submitted directly, while significant enhancements should be discussed through GitHub issues.

For more details on contributing, refer to the official contribution guidelines.

License and Legal Considerations

GenAI-Perf is licensed under the NVIDIA Corporation license, which allows redistribution and use in source and binary forms, with or without modification. However, certain conditions must be met, including retaining copyright notices and disclaimers.

For more information on licensing, please refer to the license documentation.

Conclusion

GenAI-Perf is a powerful tool for benchmarking generative AI models, providing developers with essential metrics to optimize performance. By following the installation and usage guidelines outlined in this post, you can effectively leverage GenAI-Perf to enhance your AI model’s performance.

For more information, visit the GenAI-Perf GitHub repository.

FAQ Section

What is GenAI-Perf?

GenAI-Perf is a command line tool for measuring the performance of generative AI models served through an inference server, providing metrics like throughput and latency.

How do I install GenAI-Perf?

You can install GenAI-Perf via the Triton Server SDK container or from source by following the installation instructions in the documentation.

What metrics does GenAI-Perf provide?

GenAI-Perf provides metrics such as output token throughput, time to first token, inter-token latency, and request throughput, among others.

Can I contribute to GenAI-Perf?

Yes, contributions are welcome! You can submit small changes directly or discuss larger enhancements through GitHub issues.