Engages 3rd party hardware partners for capacity expansion.
Supports TEE (Trusted Execution Environment) with confidential computing-capable chips.
Ensures deployment on HIPAA/SOC2/TEE-compliant infrastructure.
Scalability
Features:
Currently handles 1,000 RPS (Requests Per Second).
Targets 1M RPS as a future goal.
How:
We will host most routed/ used LLM and ML models on Cerebras clusters, which allow fast inference and sufficient compute/ memory capacity to handle 1K-1M concurrent calls and processes. This requires collaboration and support from Cerebras software team.
For less-used/ routed models we can host them in io.net Ray Clusters, for slower inference and higher latency use cases, so we can balance the hosting cost in our hardware infrastructures.