Score Correlation Analysis

Contents

Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka

Code repository: https://github.com/rasbt/LLMs-from-scratch

Score Correlation Analysis#

This notebook analyses the correlation between the different evaluation method scores

import json

with open("gpt4-model-1-response.json", "r") as file:
    gpt4_model_1 = json.load(file)

with open("llama3-8b-model-1-response.json", "r") as file:
    llama3_8b_model_1 = json.load(file)

GPT-4 vs Llama 3 8B#

import numpy as np
import matplotlib.pyplot as plt


list1, list2 = gpt4_model_1, llama3_8b_model_1

plt.scatter(list1, list2)
plt.plot(
    np.unique(list1),
    np.poly1d(np.polyfit(list1, list2, 1))(np.unique(list1))
)
plt.xlabel("GPT-4")
plt.ylabel("Llama3 8B")
plt.show()

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 2
      1 import numpy as np
----> 2 import matplotlib.pyplot as plt
      5 list1, list2 = gpt4_model_1, llama3_8b_model_1
      7 plt.scatter(list1, list2)

ModuleNotFoundError: No module named 'matplotlib'

Correlation Coefficients#

import pandas as pd
from scipy.stats import spearmanr, kendalltau

pearson_correlation = np.corrcoef(list1, list2)[0, 1]
spearman_correlation, _ = spearmanr(list1, list2)
kendall_tau_correlation, _ = kendalltau(list1, list2)

correlation_table = pd.DataFrame({
    "Pearson": [pearson_correlation],
    "Spearman": [spearman_correlation],
    "Kendall Tau": [kendall_tau_correlation]
}, index=['Results'])

correlation_table

	Pearson	Spearman	Kendall Tau
Results	0.80489	0.698406	0.57292

For comparison, below are the correlation coefficients from the Prometheus 2 paper by Kim et al. 2024 (https://arxiv.org/abs/2405.01535), which are all in the same ballpark as the ones reported for Llama 3 above
Note that Prometheus 2 is a model specifically finetuned for LLM rating and evaluation

Pearson#

Evaluator LM	VICUNA Bench	VICUNA Bench	MT Bench	MT Bench	FLASK	FLASK	FLASK	Feedback Bench
	GPT-4-1106	Claude-3-Opus	GPT-4-1106	Claude-3-Opus	GPT-4-1106	Claude-3-Opus	Humans	GPT-4-0613
LLAMA2-CHAT 7B	0.205	0.243	0.036	0.055	0.317	0.256	0.299	0.523
LLAMA2-CHAT 13B	0.185	0.141	-0.042	-0.002	0.239	0.247	0.263	0.545
LLAMA2-CHAT 70B	0.350	0.463	0.178	0.228	0.388	0.402	0.317	0.592
MISTRAL-INSTRUCT-7B	0.486	0.561	0.284	0.396	0.448	0.437	0.377	0.586
MIXTRAL-INSTRUCT-8X7B	0.566	0.579	0.551	0.539	0.483	0.495	0.420	0.673
PROMETHEUS-7B	0.484	0.528	0.378	0.382	0.352	0.331	0.348	0.847
PROMETHEUS-13B	0.492	0.534	0.404	0.477	0.462	0.470	0.449	0.860
AUTO-J (13B)	0.351	0.262	0.432	0.375	0.430	0.370	0.473	0.637
PROMETHEUS-2-7B	0.642	0.610	0.543	0.554	0.645	0.578	0.544	0.878
PROMETHEUS-2-8X7B	0.685	0.635	0.665	0.614	0.659	0.626	0.555	0.898
GPT-3.5-TURBO-0613	0.335	0.349	0.183	0.194	0.437	0.396	0.450	0.594
GPT-4-1106	/	0.694	/	0.717	/	0.736	0.679	0.753
CLAUDE-3-OPUS	0.694	/	0.717	/	0.736	/	0.573	0.788

Spearman#

Evaluator LM	VICUNA Bench	VICUNA Bench	MT Bench	MT Bench	MT Bench	FLASK	FLASK	Feedback Bench
	GPT-4-1106	Claude-3-Opus	GPT-4-1106	Claude-3-Opus	GPT-4-1106	Claude-3-Opus	Humans	GPT-4-0613
LLAMA2-CHAT 7B	0.236	0.255	0.084	0.089	0.301	0.244	0.279	0.511
LLAMA2-CHAT 13B	0.178	0.179	-0.025	0.044	0.206	0.222	0.224	0.543
LLAMA2-CHAT 70B	0.348	0.466	0.197	0.252	0.391	0.389	0.298	0.585
MISTRAL-INSTRUCT-7B	0.389	0.480	0.266	0.358	0.499	0.478	0.374	0.563
MIXTRAL-INSTRUCT-8X7B	0.476	0.556	0.545	0.517	0.505	0.500	0.386	0.659
PROMETHEUS-7B	0.508	0.528	0.385	0.349	0.367	0.326	0.317	0.876
PROMETHEUS-13B	0.492	0.534	0.401	0.470	0.474	0.454	0.398	0.893
AUTO-J (13B)	0.337	0.297	0.408	0.365	0.402	0.358	0.408	0.623
PROMETHEUS-2-7B	0.643	0.584	0.550	0.524	0.626	0.569	0.490	0.909
PROMETHEUS-2-8X7B	0.660	0.615	0.669	0.605	0.642	0.618	0.496	0.912
GPT-3.5-TURBO-0613	0.319	0.354	0.192	0.198	0.446	0.390	0.374	0.565
GPT-4-1106	/	0.659	/	0.721	/	0.729	0.650	0.753
CLAUDE-3-OPUS	0.659	/	0.721	/	0.729	/	0.567	0.784

Kendall-Tau#

Evaluator LM	VICUNA Bench	VICUNA Bench	MT Bench	MT Bench	FLASK	FLASK	FLASK	Feedback Bench
	GPT-4-1106	Claude-3-Opus	GPT-4-1106	Claude-3-Opus	GPT-4-1106	Claude-3-Opus	Humans	GPT-4-0613
LLAMA2-CHAT 7B	0.183	0.203	0.065	0.070	0.229	0.186	0.211	0.419
LLAMA2-CHAT 13B	0.145	0.146	-0.019	0.037	0.160	0.174	0.174	0.453
LLAMA2-CHAT 70B	0.282	0.382	0.150	0.196	0.310	0.310	0.221	0.487
MISTRAL-INSTRUCT-7B	0.314	0.391	0.208	0.281	0.395	0.384	0.287	0.454
MIXTRAL-INSTRUCT-8X7B	0.395	0.468	0.433	0.419	0.410	0.408	0.304	0.551
PROMETHEUS-7B	0.405	0.425	0.290	0.263	0.282	0.251	0.236	0.770
PROMETHEUS-13B	0.397	0.434	0.299	0.352	0.365	0.352	0.299	0.793
AUTO-J (13B)	0.282	0.242	0.303	0.272	0.312	0.282	0.312	0.515
PROMETHEUS-2-7B	0.515	0.478	0.458	0.421	0.500	0.454	0.376	0.773
PROMETHEUS-2-8X7B	0.559	0.515	0.535	0.483	0.526	0.507	0.388	0.800
GPT-3.5-TURBO-0613	0.255	0.287	0.148	0.157	0.360	0.315	0.298	0.489
GPT-4-1106	/	0.553	/	0.590	/	0.609	0.517	0.662
CLAUDE-3-OPUS	0.553	/	0.590	/	0.609	/	0.453	0.693