Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.monostate.ai/llms.txt

Use this file to discover all available pages before exploring further.

Model Serving

Serve your trained models for production inference.

Chat Interface

The simplest way to test and interact with models:
aitraining chat
Opens a web interface at http://localhost:7860/inference. The Chat UI allows you to load any local or Hub model for interactive testing.

Custom Port

aitraining chat --port 3000

Custom Host

aitraining chat --host 0.0.0.0

API Server

The API server is a training runner, not an inference server. It exposes minimal endpoints for health checks while running training jobs.

Start API Server

aitraining api
Starts the training API on http://127.0.0.1:7860 by default.

Parameters

ParameterDescriptionDefault
--portPort to run the API on7860
--hostHost to bind to127.0.0.1
--taskTask to run (optional)None

Custom Port/Host

aitraining api --port 8000 --host 0.0.0.0

Environment Variables

The API server reads configuration from environment variables:
VariableDescription
HF_TOKENHugging Face token for authentication
AUTOTRAIN_USERNAMEUsername for training
PROJECT_NAMEName of the project
TASK_IDTask identifier
PARAMSTraining parameters (JSON)
DATA_PATHPath to training data
MODELModel to use

Endpoints

EndpointDescription
GET /Returns training status message
GET /healthHealth check (returns “OK”)
The API server automatically shuts down when no training jobs are active. For production inference, use vLLM or TGI instead.

Production Deployment

Using vLLM

For production-grade serving with high throughput:
pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model ./my-trained-model \
  --port 8000

Using Text Generation Inference (TGI)

docker run --gpus all -p 8080:80 \
  -v ./my-model:/model \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /model

OpenAI-Compatible API

Both vLLM and TGI provide OpenAI-compatible endpoints:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # Not needed for local
)

response = client.chat.completions.create(
    model="my-model",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

Docker Deployment

Dockerfile Example

FROM python:3.10-slim

WORKDIR /app

# Install dependencies
RUN pip install aitraining torch

# Expose port
EXPOSE 7860

# Run chat server
CMD ["aitraining", "chat", "--host", "0.0.0.0", "--port", "7860"]
Build and run:
docker build -t my-model-server .
docker run -p 7860:7860 my-model-server

With GPU

docker run --gpus all -p 7860:7860 my-model-server

Load Testing

Using hey

hey -n 100 -c 10 \
  -m POST \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 50}' \
  http://localhost:8000/generate

Using locust

# locustfile.py
from locust import HttpUser, task

class ModelUser(HttpUser):
    @task
    def generate(self):
        self.client.post("/generate", json={
            "prompt": "Hello, how are you?",
            "max_tokens": 50
        })
locust -f locustfile.py --host http://localhost:8000

Monitoring

Prometheus Metrics

If using vLLM or TGI, metrics are available at /metrics.

Logging

aitraining api --port 8000 2>&1 | tee server.log

Next Steps

Benchmarking

Measure model performance

Chat Interface

Interactive testing