O.RI BME (Benchmark Matching Engine) aims to revolutionize the way we select and utilize AI models for answering user queries. By employing advanced techniques such as knowledge graphs, vector embeddings, and classification models, it can intelligently determine the most suitable benchmark for each specific question. This approach not only ensures high-quality responses but also optimizes resource usage by leveraging faster and more cost-effective models when appropriate. The system maintains an extensive database of benchmark tests and model performance results, which is continuously updated through a dynamic data pipeline in real-time fashion, as open LLM benchmark leaderboard changes every hour. This real-time updating mechanism ensures that the most current and top-performing models are always available in the inference endpoint registry. With an impressive 96% accuracy rate in initial testing, this innovative solution will enhance the efficiency and effectiveness of AI-powered question-answering systems, making them more accessible and reliable for users across various domains.

- Main features:
- Determines the optimal benchmark model that asses models that could best answer the user query.
- Efficiently leverages faster and more cost-effective models without compromising quality.
- How we will build it:
- We will utilizes knowledge graphs, vector embeddings, or classification models to identify the best model and its inference endpoints from the registry for each atomic query (note: this is in progress, we have built a  classification model that has 96% accuracy in our initial test, see the below technical section for the methodologies we experimented).
- We will maintains a comprehensive list of benchmark tests ( IFEval ,  BBH ,  MATH ,  GPQA ,  MuSR ,  MMLU ,  BFCL ) and the results of the models we host (note:  this is partly done ).
- We will build a data pipeline to dynamically update and host/ deploy the top ranked benchmark model collections in the inference endpoint registry in real-time (note:  this is partly done ).

The approach selects the best LLM model for a new prompt by computing cosine similarity between the prompt's embedding and a dataset of pre-embedded prompts, choosing the model associated with the most similar match.
.1:
- A dataset is prepared where each prompt is associated with a specific LLM model, and the prompts have been pre-embedded using a sentence transformer model. This data is stored in a CSV file (
filtered_data_with_embeddings.csv
), which includes columns for both the prompts and their embeddings.
.2:
- The CSV file containing the prompts, their associated models, and embeddings is loaded into a pandas DataFrame. The embeddings are extracted into a separate NumPy array for easier processing.
.3:
- When a new prompt is given, it is embedded using a pre-trained sentence transformer model (
all-MiniLM-L6-v2
).
.4:
- The cosine similarity between the new prompt's embedding and each embedding in the dataset is computed. This measures the similarity between the new prompt and the prompts in the dataset.
.5:
- The index of the highest cosine similarity value is identified, which corresponds to the most similar prompt in the dataset. The model associated with this prompt is selected as the best model to handle the new prompt.
.6:
- The name of the best model is returned and displayed, providing a recommendation for which LLM to use for the new prompt.
.1:
- The accuracy of the model selection heavily depends on the quality and diversity of the dataset. If the dataset lacks prompts similar to the new ones, the routing might not work well.
- The embeddings used in the dataset must be representative of the types of prompts the models will encounter. If they are not, the routing decision might be suboptimal.
.2:
- The model selection is based purely on the similarity of embeddings, without considering other factors like the complexity of the prompt, specific model strengths, or the required response style.
- If a model is particularly good at certain tasks but these tasks are underrepresented in the dataset, the approach may not select the best model for those tasks.
.3:
- As the dataset grows, computing cosine similarities between the new prompt's embedding and all dataset embeddings might become computationally expensive. This could lead to slow response times, especially in real-time applications.
- The entire dataset must be loaded into memory, which could be a problem for very large datasets.
.4:
- The method selects only the most similar prompt, which means it does not consider other potentially relevant prompts that might suggest different models. It could be beneficial to consider a weighted average of similarities or top-N similar prompts to make a more informed decision.
.5:
- The approach might overfit to the prompts present in the dataset, making it less effective for generalizing to new, unseen types of prompts. If the prompt types in the dataset are too narrow, the model might perform poorly on more diverse inputs.
.6:
- The current approach does not adapt to contextual changes or dynamic updates in the dataset or models. If new models are introduced or existing ones are fine-tuned, the dataset needs to be manually updated, which is not optimal for environments with frequent changes.
.7:
- The method does not provide a confidence score or threshold for when the selected model might not be reliable, which could be critical in applications requiring high precision.
These drawbacks suggest that while the approach is functional and straightforward, it could benefit from enhancements such as incorporating more contextual data, improving scalability, and adding mechanisms to handle uncertainty or poor matches.
The TF-IDF + XGBoost approach predicts the best LLM model by vectorizing prompts using TF-IDF and classifying them with XGBoost, but it struggles due to high dimensionality, inadequate feature representation, and poor handling of class imbalance, resulting in low accuracy and ineffective predictions.
.1:
- TF-IDF was used to vectorize prompts, transforming text data into numerical representations.
- An XGBoost classifier was trained to predict the best model based on the TF-IDF vectors.
.2:
- : The overall accuracy is only 8%, which suggests the model is not performing well in distinguishing between classes (models).
- : Most classes have near-zero values for precision, recall, and F1-score, indicating the model is failing to make correct predictions across nearly all classes.
- : Both are low, reinforcing the poor general performance across all classes.
.1:
- TF-IDF often results in high-dimensional and sparse vectors, which may not capture the semantic relationships between prompts effectively, leading to poor performance in downstream tasks like classification.
.2:
- TF-IDF captures word frequency but not the contextual or semantic meaning of the text, which is crucial when routing prompts to LLMs. It fails to account for nuanced differences that embeddings handle better.
.3:
- XGBoost is a powerful model but can struggle with high-dimensional sparse data, especially when the target classes are numerous and imbalanced, leading to ineffective learning.
.4:
- The data appears highly imbalanced, with some classes having very few samples. XGBoost, without proper handling of class imbalance, might focus more on the majority classes, leading to poor generalization.
.5:
- Unlike embeddings that consider the entire context of the text, TF-IDF ignores the word order and context, making it less suitable for understanding complex prompts.
This approach, using TF-IDF with XGBoost, is not suitable for tasks requiring nuanced understanding and classification of prompts for model selection. Moving towards embedding-based methods or more context-aware models (e.g., neural networks, transformer models) would likely provide a significant improvement in performance.
Fine-Tuning BERT and DistilBERT for Accurate LLM Routing: A Multi-Class Classification Approach
The BERT and DistilBERT classifiers attempt to route prompts to LLM models based on text classification but face challenges with class imbalance, fluctuating validation accuracy, and overall low prediction accuracy, demonstrating a need for improved data handling and training strategies.
- : The approach utilizes transformer-based models, specifically BERT and DistilBERT, known for their strong contextual understanding of language, to classify prompts into categories that correspond to different LLM models.
- : Prompts are labeled with their corresponding target LLM models, creating a multi-class classification problem. This dataset is used to train the classifiers.
- :
- The models are fine-tuned on the labeled dataset to learn patterns and distinctions between different prompt types associated with each LLM model.
- The training involves minimizing the loss function over several epochs, adjusting the model weights to improve classification accuracy.
- :
- The models are evaluated using metrics such as accuracy, precision, recall, and F1-score, with a focus on how well each class (LLM model) is identified by the classifier.
- Validation accuracy and training loss are monitored over epochs to assess the model's performance and convergence.
- :
- For a new prompt, the classifier predicts the most likely LLM model by analyzing the contextual and semantic features of the text, aiming to match it to the most suitable class based on training.
.1:
- The model achieved an accuracy of only 10%, indicating that it struggles significantly with correctly classifying the prompts across multiple classes.
.2:
- Some classes, like "RWKV-4-Raven-14B" and "oasst-pythia-12b," perform well, with relatively high precision and recall, while many other classes have near-zero values, indicating a failure to learn meaningful distinctions.
.3:
- The validation accuracy graph shows fluctuations across epochs, suggesting potential instability in the model’s training process or overfitting issues.
.4:
- Several classes have a very low number of samples, leading to poor model performance for these underrepresented classes, which is evident in the near-zero scores for many categories.
.5:
- Despite using transformer-based models, the complexity and variability of LLM model-specific prompts seem to pose challenges beyond simple contextual comprehension, highlighting the need for more sophisticated fine-tuning or data augmentation strategies.
Approach 2.4
The approach predicts the MMLU score needed to solve a prompt by training a DistilBERT-based regression model on labeled MMLU scores, optimizing for prompt-specific performance.
.1: A DistilBERT-based regressor is fine-tuned to predict MMLU scores, which indicate the level of difficulty required to solve a given prompt. The model consists of a DistilBERT encoder, a dropout layer, and a linear regressor to output the score.
.2: The data includes prompts and corresponding MMLU scores, normalized for consistency. Prompts are tokenized using the DistilBERT tokenizer and split into training and validation sets.
.3: A custom dataset class and dataloaders are defined for efficient batching. The training involves optimizing the model using AdamW, with loss monitored using MSELoss. Learning rate adjustments are handled by a scheduler that reduces the learning rate when validation performance plateaus.
.4: The model's performance is tracked using RMSE on the validation set, with logs recorded in TensorBoard. Training loss and validation RMSE are plotted over epochs to assess model convergence and generalization.
.5: A prediction function is implemented to estimate the MMLU score for new prompts, using the trained model to determine the complexity level required to solve the prompt.
.1: The accuracy of predicted MMLU scores depends heavily on the quality and representativeness of the training data. Inconsistent or biased data can lead to poor predictions.
.2: Fine-tuning transformer models like DistilBERT is computationally expensive, especially when scaling to large datasets or deploying in real-time environments.
.3: The model might overfit to specific prompt types or patterns seen during training, which can reduce its generalizability to unseen or diverse prompts.
.4: The current approach may struggle to scale effectively across significantly larger datasets or higher-dimensional prompts without further optimization or model adjustments.
.5: Regression models based on transformer outputs can be challenging to interpret, making it difficult to understand why certain MMLU scores are predicted, especially when compared to simpler heuristic methods.
MMLU Score breakdown
Normalized MMLU scores were effectively predicted using separate XGBoost and DistilBERT models.
.1: MMLU scores were normalized to ensure consistent scaling, enhancing the performance of both models.
.2:
- : A tree-based model trained on tabular data with normalized scores, known for its speed and performance in structured data regression tasks.
- : A transformer-based regression model fine-tuned on the prompts to predict normalized MMLU scores, leveraging deep contextual understanding.
.3: Both models were trained separately and evaluated on their ability to predict the normalized MMLU scores, with DistilBERT capturing text semantics and XGBoost handling numerical features efficiently.
.1: Each model has strengths depending on the type of input data—DistilBERT on text, XGBoost on numerical/tabular data—limiting their independent versatility.
.2: Managing two distinct models requires separate optimization, tuning, and maintenance efforts.
.3: Fine-tuning DistilBERT can be resource-intensive, and XGBoost may require extensive hyperparameter tuning for optimal performance.
Zero-shot classification using BART predicts task categories and selects the most suitable model based on the task description.
.1: The task data is organized into a structured format, where each category contains tasks with associated benchmark scores and model types, helping to guide model selection.
.2: The facebook/bart-large-mnli
model is employed to classify task descriptions into predefined categories without specific task training, using a zero-shot classification pipeline.
.3:
- For each task description, the model predicts the category with the highest confidence score.
- Within the predicted category, the approach attempts to identify the specific task, matching it to its benchmark score and the optimal model type designated for that task.
.4: The identified task details guide the selection of the best model type for the job, effectively routing tasks to the most suitable model based on the description and benchmark data.
This approach leverages zero-shot classification to not only categorize tasks but also directly select the best model for each task, streamlining the assignment process and enhancing task-model alignment, although it depends heavily on the predefined categories and task descriptions for accuracy.
The approach extends zero-shot classification by evaluating multiple models, including MiniLM, ALBERT, DeBERTa, XLM-RoBERTa, and ELECTRA, to select the best-performing model for each task.
.1: Task data is maintained in a dictionary with categories containing tasks, benchmark scores, and suggested model types.
.2: Using the facebook/bart-large-mnli
model, task descriptions are classified into predefined categories, and the most relevant category is predicted based on confidence scores.
.3:
- For each task, the predicted category is used to find the most specific matching task and its associated details.
- The approach identifies the best model type for the task from predefined data, streamlining model assignment.
.4:
- The approach is expanded to include multiple classifiers: MiniLM, ALBERT, DeBERTa, XLM-RoBERTa, and ELECTRA.
- Each classifier is tested on the task description, and their performances are compared to determine which model best suits the task.
- The best classifier is selected based on accuracy, confidence scores, and alignment with the task's benchmark requirements.
.5:
- The task is routed to the model with the highest performance, ensuring that the task description aligns with the strengths of the selected classifier.
- This dynamic selection process allows the system to choose the most appropriate model for each job rather than sticking to a single static model.
By extending the zero-shot classification approach to include multiple models, this method allows for a more nuanced selection of the optimal model for each task, improving task-to-model alignment. The flexible architecture adapts to task complexities, though it requires thorough benchmarking of each classifier's performance across various tasks for reliable model assignment.
- The BERT model was trained on a synthetic dataset to classify various NLP tasks, including
Emotion_Recognition_in_Text
, Code_Generation
, Creative_Writing
, and more. - The training involved fine-tuning the BERT model on the task-specific data, with training and validation losses monitored throughout.
- The evaluation was performed using metrics such as precision, recall, and F1-score, along with a confusion matrix to identify class-specific performance.
- BERT performed exceptionally well on tasks like
Code_Generation
, Creative_Writing
, and Fact-Checking
, with high precision and recall scores. This indicates BERT’s strong contextual understanding and ability to handle tasks that require deep language comprehension. - The model struggled with
Emotion_Recognition_in_Text
, achieving poor scores in this class, likely due to insufficient data or the complexity of recognizing nuanced emotional content. Additionally, the slight underperformance in Fact-Based_QA
suggests a need for more focused fine-tuning or data augmentation in specific categories. - BERT achieved an overall accuracy of 94%, demonstrating its capacity for handling diverse NLP tasks but highlighting areas that could benefit from further refinement.
- The XGBoost model was trained using the same synthetic dataset to classify the same set of NLP tasks. XGBoost, being a gradient-boosting decision tree model, relies on feature-based representations rather than contextual language understanding.
- The evaluation focused on precision, recall, and F1-score, supported by a confusion matrix to highlight class-specific performance.
- XGBoost showed high performance across most classes, achieving perfect scores in categories like
Code_Generation
, Dialogue_Generation
, and Fact-Checking
. The model’s structured approach allowed it to efficiently handle tasks with clear patterns or features. - Similar to BERT, XGBoost struggled with
Emotion_Recognition_in_Text
, indicating this class's inherent complexity and the possible need for additional data or feature engineering. The model also showed slight misclassifications in Creative_Writing
, suggesting limitations in capturing creative or nuanced content compared to BERT. - XGBoost achieved a slightly higher accuracy of 96%, indicating robust performance, particularly in classes that align well with structured feature patterns.
Emotion_Recognition_in_Text
Paraphrasing_and_Rewriting
Named_Entity_Recognition_(NER)
- 0.94
- Precision: 0.88, Recall: 0.87, F1-Score: 0.87
- Precision: 0.94, Recall: 0.94, F1-Score: 0.94
Training and Val Graph
- 0.96
- Precision: 0.89, Recall: 0.90, F1-Score: 0.89
- Precision: 0.95, Recall: 0.96, F1-Score: 0.95
Emotion_Recognition_in_Text
Named_Entity_Recognition_(NER)
Paraphrasing_and_Rewriting
Both models demonstrated strong performance with specific strengths and some areas for improvement, particularly in classes with limited data or nuanced understanding needs.