O.RI INFRA

Hardware Infrastructure Hosting the ORI & models

  • Inference Endpoints Registry
  • Aims for 100K+ models in the registry (mirroring HuggingFace open-source models).
  • Deployed on ATLAS.O (Cerebras WSE-3 cluster).
  • Deployed on  IO.NET  (Ray Cluster with GPU- H100, A100).
  • Deployed on HIPAA/SOC2/TEE-compliant infrastructure.
  • Connects to 200+ Model inference-as-a-service providers for closed source models (AIML API, Fireworks AI, Together AI).
  • Routing Optimization
  • Utilizes chips that are 10x faster for routing operations, e.g. Cerebras Wafer.
  • Utilizes complier level optimizer for structured LLM tasks, e.g.  Rysana Inversion .
  • ORI Core Deployment
  • Exclusively runs on ATLAS.O (Cerebras WSE-3 cluster).
  • Inference Endpoint Distribution
  • ATLAS.O (Cerebras WSE-3 cluster).
  •  IO.NET  RAY CLUSTERS (GPU- H100, A100).
  • Expansion Strategy
  • Engages 3rd party hardware partners for capacity expansion.
  • Supports TEE (Trusted Execution Environment) with confidential computing-capable chips.
  • Ensures deployment on HIPAA/SOC2/TEE-compliant infrastructure.
  • Scalability
  • Features:
  • Currently handles 1,000 RPS (Requests Per Second).
  • Targets 1M RPS as a future goal.
  • How:
  • We will host most routed/ used LLM and ML models on Cerebras clusters, which allow fast inference and sufficient compute/ memory capacity to handle 1K-1M concurrent calls and processes. This requires collaboration and support from Cerebras software team.
  • For less-used/ routed models we can host them in  io.net  Ray Clusters, for slower inference and higher latency use cases, so we can balance the hosting cost in our hardware infrastructures.