Inference Services
Learn about the inference services available in Atoma Node
Overview
Atoma Node integrates several leading open-source inference engines:
-
vLLM: A high-throughput and memory-efficient inference engine optimized for LLMs. Features state-of-the-art serving throughput with PagedAttention memory management and continuous batching.
-
mistral.rs: A blazingly fast Rust-based inference engine with support for various model architectures, quantization methods, and hardware acceleration options.
-
Text Embeddings Inference (TEI): A high-performance solution specifically designed for text embedding models, offering both REST and gRPC APIs with support for various embedding model architectures.
Chat Completions
Backend | Architecture/Platform | Docker Compose Profile |
---|---|---|
vLLM | CUDA | chat_completions_vllm |
vLLM | x86_64 | chat_completions_vllm_cpu |
vLLM | ROCm | chat_completions_vllm_rocm |
mistral.rs | x86_64, aarch64 | chat_completions_mistralrs_cpu |
Embeddings
Backend | Architecture/Platform | Docker Compose Profile |
---|---|---|
Text Embeddings Inference | CUDA | embeddings_tei |
Image Generations
Backend | Architecture/Platform | Docker Compose Profile |
---|---|---|
mistral.rs | CUDA | image_generations_mistralrs |
To run the node with confidential compute mode, you can use the following command:
Otherwise, you can run the node in non-confidential mode with: