Model Serving

Serve your trained models for production inference.

Chat Interface

The simplest way to test and interact with models:

aitraining chat

Opens a web interface at http://localhost:7860/inference. The Chat UI allows you to load any local or Hub model for interactive testing.

Custom Port

aitraining chat --port 3000

Custom Host

aitraining chat --host 0.0.0.0

API Server

The API server is a training runner, not an inference server. It exposes minimal endpoints for health checks while running training jobs.

Start API Server

aitraining api

Starts the training API on http://127.0.0.1:7860 by default.

Parameters

Parameter	Description	Default
`--port`	Port to run the API on	`7860`
`--host`	Host to bind to	`127.0.0.1`
`--task`	Task to run (optional)	`None`

Custom Port/Host

aitraining api --port 8000 --host 0.0.0.0

Environment Variables

The API server reads configuration from environment variables:

Variable	Description
`HF_TOKEN`	Hugging Face token for authentication
`AUTOTRAIN_USERNAME`	Username for training
`PROJECT_NAME`	Name of the project
`TASK_ID`	Task identifier
`PARAMS`	Training parameters (JSON)
`DATA_PATH`	Path to training data
`MODEL`	Model to use

Endpoints

Endpoint	Description
`GET /`	Returns training status message
`GET /health`	Health check (returns “OK”)

The API server automatically shuts down when no training jobs are active. For production inference, use vLLM or TGI instead.

Production Deployment

Using vLLM

For production-grade serving with high throughput:

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model ./my-trained-model \
  --port 8000

Using Text Generation Inference (TGI)

docker run --gpus all -p 8080:80 \
  -v ./my-model:/model \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /model

OpenAI-Compatible API

Both vLLM and TGI provide OpenAI-compatible endpoints:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # Not needed for local
)

response = client.chat.completions.create(
    model="my-model",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

Docker Deployment

Dockerfile Example

FROM python:3.10-slim

WORKDIR /app

# Install dependencies
RUN pip install aitraining torch

# Expose port
EXPOSE 7860

# Run chat server
CMD ["aitraining", "chat", "--host", "0.0.0.0", "--port", "7860"]

Build and run:

docker build -t my-model-server .
docker run -p 7860:7860 my-model-server

With GPU

docker run --gpus all -p 7860:7860 my-model-server

Load Testing

Using hey

hey -n 100 -c 10 \
  -m POST \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 50}' \
  http://localhost:8000/generate

Using locust

# locustfile.py
from locust import HttpUser, task

class ModelUser(HttpUser):
    @task
    def generate(self):
        self.client.post("/generate", json={
            "prompt": "Hello, how are you?",
            "max_tokens": 50
        })

locust -f locustfile.py --host http://localhost:8000

Monitoring

Prometheus Metrics

If using vLLM or TGI, metrics are available at /metrics.

Logging

aitraining api --port 8000 2>&1 | tee server.log

CLI Basics

Configuration

Training Commands

Advanced Usage

Inference

Model Serving

Model Serving

Chat Interface

Custom Port

Custom Host

API Server

Start API Server

Parameters

Custom Port/Host

Environment Variables

Endpoints

Production Deployment

Using vLLM

Using Text Generation Inference (TGI)

OpenAI-Compatible API

Docker Deployment

Dockerfile Example

With GPU

Load Testing

Using hey

Using locust

Monitoring

Prometheus Metrics

Logging

Next Steps

Benchmarking

Chat Interface

CLI Basics

Configuration

Training Commands

Advanced Usage

Inference

Documentation Index

​Model Serving

​Chat Interface

​Custom Port

​Custom Host

​API Server

​Start API Server

​Parameters

​Custom Port/Host

​Environment Variables

​Endpoints

​Production Deployment

​Using vLLM

​Using Text Generation Inference (TGI)

​OpenAI-Compatible API

​Docker Deployment

​Dockerfile Example

​With GPU

​Load Testing

​Using hey

​Using locust

​Monitoring

​Prometheus Metrics

​Logging

​Next Steps

Benchmarking

Chat Interface

Model Serving

Chat Interface

Custom Port

Custom Host

API Server

Start API Server

Parameters

Custom Port/Host

Environment Variables

Endpoints

Production Deployment

Using vLLM

Using Text Generation Inference (TGI)

OpenAI-Compatible API

Docker Deployment

Dockerfile Example

With GPU

Load Testing

Using hey

Using locust

Monitoring

Prometheus Metrics

Logging

Next Steps