O.RI IA [ INFERENCE AWARNESS ]



Inference Latency Awareness

Overview

O.RI Inference Awareness framework offers a sophisticated approach to managing inference latency, cost, and security in AI model deployment. It utilizes advanced routing algorithms to optimize inference latency, using IOG Advanced Graph Path Algorithm, and by selecting the geographically closest model. The system dynamically chooses between stronger and weaker models based on query complexity and domain specificity, while also considering computational costs. To enhance performance, the router model is trained using human preference data and data augmentation techniques. A key feature is the security-aware routing mechanism, which matches the security level required by the end user to the appropriate inference endpoint. The framework offers a tiered security structure, ranging from open-source compute options to highly secure, confidential computing environments. Advanced security features include support for Confidential Computing, hardware-based Trusted Execution Environments (TEEs), secure boot processes, remote attestation, and encrypted communication. These features ensure data confidentiality and integrity throughout the inference process. This comprehensive approach allows for flexible, efficient, and secure AI model deployment tailored to diverse user needs and security requirements.

Cost and Model Management

  • Cost Awareness
  • Model Selection: Dynamically selects between stronger and weaker LLM/Model based on the query's complexity or domain specificity.

Enhancing Router Model Performance

    .1Preference Data: Leverages human preference data to train the router model, enabling informed model selection.
    .2Data Augmentation: Employs techniques like paraphrasing or text noising to generate additional training data, improving generalization to new queries.
    .3Cost Evaluation: Assesses the cost of using each LLM/Model, considering computational resources and access costs.

Security Levels in Inference Endpoints

  • Security Levels of Inference Endpoint’s Deployed Hardware: Ensures matching the security level required by the end user.
  • Routes only to model endpoints with the necessary security level.
  • Potential routing to secure but expensive or less secure options based on requirements.

Security Cost Tiers

    .1Open Source Compute $
    .aConsumer environment GPU “High-risk”/Low Price
    .2Enterprise Compute $
    .aEnterprise environment GPU/Datacenter Hosted
    .3Enterprise + Compute $$
    .aDatacenter hosted GPUs with SOC2/ISO
    .4Enterprise TEE / Confidential computing only Compute $$$
    .aDatacenter hosted GPUs with SOC2/ISO and only on TEE Hardware

The security levels of inference endpoints' deployed hardware are crucial for maintaining data confidentiality and integrity. The framework incorporates a security-aware routing mechanism to match the security requirements of the end user.

Framework Security Features

  • Confidential Computing Support: The framework checks if the inference endpoint's deployed hardware supports Confidential Computing.
  • Hardware-Based Security Solutions: The framework evaluates the presence and effectiveness of hardware-based security solutions, including:
  • Hardware-based TEE: The H100 chip's TEE provides a secure environment for sensitive data and applications, protecting them from unauthorized access.
  • Secure Boot: The H100 chip's secure boot process ensures that the chip's firmware and software are authentic and have not been tampered with.
  • Remote Attestation: The H100 chip's remote attestation feature allows for the verification of the chip's identity and the integrity of its firmware and software.
  • Encrypted Communication: The H100 chip's encrypted communication feature ensures that data transmitted between the chip and other devices is secure and protected from eavesdropping.a