O.RI Inference Awareness framework offers a sophisticated approach to managing inference latency, cost, and security in AI model deployment. It utilizes advanced routing algorithms to optimize inference latency, using IOG Advanced Graph Path Algorithm, and by selecting the geographically closest model. The system dynamically chooses between stronger and weaker models based on query complexity and domain specificity, while also considering computational costs. To enhance performance, the router model is trained using human preference data and data augmentation techniques. A key feature is the security-aware routing mechanism, which matches the security level required by the end user to the appropriate inference endpoint. The framework offers a tiered security structure, ranging from open-source compute options to highly secure, confidential computing environments. Advanced security features include support for Confidential Computing, hardware-based Trusted Execution Environments (TEEs), secure boot processes, remote attestation, and encrypted communication. These features ensure data confidentiality and integrity throughout the inference process. This comprehensive approach allows for flexible, efficient, and secure AI model deployment tailored to diverse user needs and security requirements.
- : Dynamically selects between stronger and weaker LLM/Model based on the query's complexity or domain specificity.
.1: Leverages human preference data to train the router model, enabling informed model selection.
.2: Employs techniques like paraphrasing or text noising to generate additional training data, improving generalization to new queries.
.3: Assesses the cost of using each LLM/Model, considering computational resources and access costs.
- : Ensures matching the security level required by the end user.
- Routes only to model endpoints with the necessary security level.
- Potential routing to secure but expensive or less secure options based on requirements.
.1
.aConsumer environment GPU “High-risk”/Low Price
.2
.aEnterprise environment GPU/Datacenter Hosted
.3
.aDatacenter hosted GPUs with SOC2/ISO
.4
.aDatacenter hosted GPUs with SOC2/ISO and only on TEE Hardware
The security levels of inference endpoints' deployed hardware are crucial for maintaining data confidentiality and integrity. The framework incorporates a security-aware routing mechanism to match the security requirements of the end user.
- : The framework checks if the inference endpoint's deployed hardware supports Confidential Computing.
- : The framework evaluates the presence and effectiveness of hardware-based security solutions, including:
- : The H100 chip's TEE provides a secure environment for sensitive data and applications, protecting them from unauthorized access.
- : The H100 chip's secure boot process ensures that the chip's firmware and software are authentic and have not been tampered with.
- : The H100 chip's remote attestation feature allows for the verification of the chip's identity and the integrity of its firmware and software.
- : The H100 chip's encrypted communication feature ensures that data transmitted between the chip and other devices is secure and protected from eavesdropping.a