Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka Code repository: https://github.com/rasbt/LLMs-from-scratch |
![]() |
Score Correlation Analysis#
This notebook analyses the correlation between the different evaluation method scores
import json
with open("gpt4-model-1-response.json", "r") as file:
gpt4_model_1 = json.load(file)
with open("llama3-8b-model-1-response.json", "r") as file:
llama3_8b_model_1 = json.load(file)
GPT-4 vs Llama 3 8B#
import numpy as np
import matplotlib.pyplot as plt
list1, list2 = gpt4_model_1, llama3_8b_model_1
plt.scatter(list1, list2)
plt.plot(
np.unique(list1),
np.poly1d(np.polyfit(list1, list2, 1))(np.unique(list1))
)
plt.xlabel("GPT-4")
plt.ylabel("Llama3 8B")
plt.show()
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 2
1 import numpy as np
----> 2 import matplotlib.pyplot as plt
5 list1, list2 = gpt4_model_1, llama3_8b_model_1
7 plt.scatter(list1, list2)
ModuleNotFoundError: No module named 'matplotlib'
Correlation Coefficients#
import pandas as pd
from scipy.stats import spearmanr, kendalltau
pearson_correlation = np.corrcoef(list1, list2)[0, 1]
spearman_correlation, _ = spearmanr(list1, list2)
kendall_tau_correlation, _ = kendalltau(list1, list2)
correlation_table = pd.DataFrame({
"Pearson": [pearson_correlation],
"Spearman": [spearman_correlation],
"Kendall Tau": [kendall_tau_correlation]
}, index=['Results'])
correlation_table
Pearson | Spearman | Kendall Tau | |
---|---|---|---|
Results | 0.80489 | 0.698406 | 0.57292 |
For comparison, below are the correlation coefficients from the Prometheus 2 paper by Kim et al. 2024 (https://arxiv.org/abs/2405.01535), which are all in the same ballpark as the ones reported for Llama 3 above
Note that Prometheus 2 is a model specifically finetuned for LLM rating and evaluation
Pearson#
Evaluator LM |
VICUNA Bench |
VICUNA Bench |
MT Bench |
MT Bench |
FLASK |
FLASK |
FLASK |
Feedback Bench |
---|---|---|---|---|---|---|---|---|
GPT-4-1106 |
Claude-3-Opus |
GPT-4-1106 |
Claude-3-Opus |
GPT-4-1106 |
Claude-3-Opus |
Humans |
GPT-4-0613 |
|
LLAMA2-CHAT 7B |
0.205 |
0.243 |
0.036 |
0.055 |
0.317 |
0.256 |
0.299 |
0.523 |
LLAMA2-CHAT 13B |
0.185 |
0.141 |
-0.042 |
-0.002 |
0.239 |
0.247 |
0.263 |
0.545 |
LLAMA2-CHAT 70B |
0.350 |
0.463 |
0.178 |
0.228 |
0.388 |
0.402 |
0.317 |
0.592 |
MISTRAL-INSTRUCT-7B |
0.486 |
0.561 |
0.284 |
0.396 |
0.448 |
0.437 |
0.377 |
0.586 |
MIXTRAL-INSTRUCT-8X7B |
0.566 |
0.579 |
0.551 |
0.539 |
0.483 |
0.495 |
0.420 |
0.673 |
PROMETHEUS-7B |
0.484 |
0.528 |
0.378 |
0.382 |
0.352 |
0.331 |
0.348 |
0.847 |
PROMETHEUS-13B |
0.492 |
0.534 |
0.404 |
0.477 |
0.462 |
0.470 |
0.449 |
0.860 |
AUTO-J (13B) |
0.351 |
0.262 |
0.432 |
0.375 |
0.430 |
0.370 |
0.473 |
0.637 |
PROMETHEUS-2-7B |
0.642 |
0.610 |
0.543 |
0.554 |
0.645 |
0.578 |
0.544 |
0.878 |
PROMETHEUS-2-8X7B |
0.685 |
0.635 |
0.665 |
0.614 |
0.659 |
0.626 |
0.555 |
0.898 |
GPT-3.5-TURBO-0613 |
0.335 |
0.349 |
0.183 |
0.194 |
0.437 |
0.396 |
0.450 |
0.594 |
GPT-4-1106 |
/ |
0.694 |
/ |
0.717 |
/ |
0.736 |
0.679 |
0.753 |
CLAUDE-3-OPUS |
0.694 |
/ |
0.717 |
/ |
0.736 |
/ |
0.573 |
0.788 |
Spearman#
Evaluator LM |
VICUNA Bench |
VICUNA Bench |
MT Bench |
MT Bench |
MT Bench |
FLASK |
FLASK |
Feedback Bench |
---|---|---|---|---|---|---|---|---|
GPT-4-1106 |
Claude-3-Opus |
GPT-4-1106 |
Claude-3-Opus |
GPT-4-1106 |
Claude-3-Opus |
Humans |
GPT-4-0613 |
|
LLAMA2-CHAT 7B |
0.236 |
0.255 |
0.084 |
0.089 |
0.301 |
0.244 |
0.279 |
0.511 |
LLAMA2-CHAT 13B |
0.178 |
0.179 |
-0.025 |
0.044 |
0.206 |
0.222 |
0.224 |
0.543 |
LLAMA2-CHAT 70B |
0.348 |
0.466 |
0.197 |
0.252 |
0.391 |
0.389 |
0.298 |
0.585 |
MISTRAL-INSTRUCT-7B |
0.389 |
0.480 |
0.266 |
0.358 |
0.499 |
0.478 |
0.374 |
0.563 |
MIXTRAL-INSTRUCT-8X7B |
0.476 |
0.556 |
0.545 |
0.517 |
0.505 |
0.500 |
0.386 |
0.659 |
PROMETHEUS-7B |
0.508 |
0.528 |
0.385 |
0.349 |
0.367 |
0.326 |
0.317 |
0.876 |
PROMETHEUS-13B |
0.492 |
0.534 |
0.401 |
0.470 |
0.474 |
0.454 |
0.398 |
0.893 |
AUTO-J (13B) |
0.337 |
0.297 |
0.408 |
0.365 |
0.402 |
0.358 |
0.408 |
0.623 |
PROMETHEUS-2-7B |
0.643 |
0.584 |
0.550 |
0.524 |
0.626 |
0.569 |
0.490 |
0.909 |
PROMETHEUS-2-8X7B |
0.660 |
0.615 |
0.669 |
0.605 |
0.642 |
0.618 |
0.496 |
0.912 |
GPT-3.5-TURBO-0613 |
0.319 |
0.354 |
0.192 |
0.198 |
0.446 |
0.390 |
0.374 |
0.565 |
GPT-4-1106 |
/ |
0.659 |
/ |
0.721 |
/ |
0.729 |
0.650 |
0.753 |
CLAUDE-3-OPUS |
0.659 |
/ |
0.721 |
/ |
0.729 |
/ |
0.567 |
0.784 |
Kendall-Tau#
Evaluator LM |
VICUNA Bench |
VICUNA Bench |
MT Bench |
MT Bench |
FLASK |
FLASK |
FLASK |
Feedback Bench |
---|---|---|---|---|---|---|---|---|
GPT-4-1106 |
Claude-3-Opus |
GPT-4-1106 |
Claude-3-Opus |
GPT-4-1106 |
Claude-3-Opus |
Humans |
GPT-4-0613 |
|
LLAMA2-CHAT 7B |
0.183 |
0.203 |
0.065 |
0.070 |
0.229 |
0.186 |
0.211 |
0.419 |
LLAMA2-CHAT 13B |
0.145 |
0.146 |
-0.019 |
0.037 |
0.160 |
0.174 |
0.174 |
0.453 |
LLAMA2-CHAT 70B |
0.282 |
0.382 |
0.150 |
0.196 |
0.310 |
0.310 |
0.221 |
0.487 |
MISTRAL-INSTRUCT-7B |
0.314 |
0.391 |
0.208 |
0.281 |
0.395 |
0.384 |
0.287 |
0.454 |
MIXTRAL-INSTRUCT-8X7B |
0.395 |
0.468 |
0.433 |
0.419 |
0.410 |
0.408 |
0.304 |
0.551 |
PROMETHEUS-7B |
0.405 |
0.425 |
0.290 |
0.263 |
0.282 |
0.251 |
0.236 |
0.770 |
PROMETHEUS-13B |
0.397 |
0.434 |
0.299 |
0.352 |
0.365 |
0.352 |
0.299 |
0.793 |
AUTO-J (13B) |
0.282 |
0.242 |
0.303 |
0.272 |
0.312 |
0.282 |
0.312 |
0.515 |
PROMETHEUS-2-7B |
0.515 |
0.478 |
0.458 |
0.421 |
0.500 |
0.454 |
0.376 |
0.773 |
PROMETHEUS-2-8X7B |
0.559 |
0.515 |
0.535 |
0.483 |
0.526 |
0.507 |
0.388 |
0.800 |
GPT-3.5-TURBO-0613 |
0.255 |
0.287 |
0.148 |
0.157 |
0.360 |
0.315 |
0.298 |
0.489 |
GPT-4-1106 |
/ |
0.553 |
/ |
0.590 |
/ |
0.609 |
0.517 |
0.662 |
CLAUDE-3-OPUS |
0.553 |
/ |
0.590 |
/ |
0.609 |
/ |
0.453 |
0.693 |