Documentation Index
Fetch the complete documentation index at: https://docs.monostate.ai/llms.txt
Use this file to discover all available pages before exploring further.
Model Serving
Serve your trained models for production inference.Chat Interface
The simplest way to test and interact with models:http://localhost:7860/inference. The Chat UI allows you to load any local or Hub model for interactive testing.
Custom Port
Custom Host
API Server
The API server is a training runner, not an inference server. It exposes minimal endpoints for health checks while running training jobs.Start API Server
http://127.0.0.1:7860 by default.
Parameters
| Parameter | Description | Default |
|---|---|---|
--port | Port to run the API on | 7860 |
--host | Host to bind to | 127.0.0.1 |
--task | Task to run (optional) | None |
Custom Port/Host
Environment Variables
The API server reads configuration from environment variables:| Variable | Description |
|---|---|
HF_TOKEN | Hugging Face token for authentication |
AUTOTRAIN_USERNAME | Username for training |
PROJECT_NAME | Name of the project |
TASK_ID | Task identifier |
PARAMS | Training parameters (JSON) |
DATA_PATH | Path to training data |
MODEL | Model to use |
Endpoints
| Endpoint | Description |
|---|---|
GET / | Returns training status message |
GET /health | Health check (returns “OK”) |
The API server automatically shuts down when no training jobs are active. For production inference, use vLLM or TGI instead.
Production Deployment
Using vLLM
For production-grade serving with high throughput:Using Text Generation Inference (TGI)
OpenAI-Compatible API
Both vLLM and TGI provide OpenAI-compatible endpoints:Docker Deployment
Dockerfile Example
With GPU
Load Testing
Using hey
Using locust
Monitoring
Prometheus Metrics
If using vLLM or TGI, metrics are available at/metrics.
Logging
Next Steps
Benchmarking
Measure model performance
Chat Interface
Interactive testing