Evaluating Instruction Responses Using the OpenAI API

Evaluating Instruction Responses Using the OpenAI API#

This notebook uses OpenAI’s GPT-4 API to evaluate responses by a instruction finetuned LLMs based on an dataset in JSON format that includes the generated model responses, for example:

{
    "instruction": "What is the atomic number of helium?",
    "input": "",
    "output": "The atomic number of helium is 2.",               # <-- The target given in the test set
    "model 1 response": "\nThe atomic number of helium is 2.0.", # <-- Response by an LLM
    "model 2 response": "\nThe atomic number of helium is 3."    # <-- Response by a 2nd LLM
},

# pip install -r requirements-extra.txt

from importlib.metadata import version

pkgs = ["openai",  # OpenAI API
        "tqdm",    # Progress bar
        ]

for p in pkgs:
    print(f"{p} version: {version(p)}")

openai version: 0.27.1
tqdm version: 4.65.0

Test OpenAI API#

First, let’s test if the OpenAI API is correctly set up
If you don’t have an account yet, you need to create one at https://platform.openai.com/
Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)
Running the experiments and creating the ~200 evaluations using the code in this notebook costs about $0.26 (26 cents) as of this writing

First, we need to provide our OpenAI API secret key, which can be found at https://platform.openai.com/api-keys
Make sure not to share this key with anyone
Add this secret key ("sk-...") to the config.json file in this folder

import json
from openai import OpenAI

# Load API key from a JSON file.
# Make sure to replace "sk-..." with your actual API key from https://platform.openai.com/api-keys
with open("config.json", "r") as config_file:
    config = json.load(config_file)
    api_key = config["OPENAI_API_KEY"]

client = OpenAI(api_key=api_key)

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[3], line 2
      1 import json
----> 2 from openai import OpenAI
      4 # Load API key from a JSON file.
      5 # Make sure to replace "sk-..." with your actual API key from https://platform.openai.com/api-keys
      6 with open("config.json", "r") as config_file:

ImportError: cannot import name 'OpenAI' from 'openai' (/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/openai/__init__.py)

First, let’s try the API with a simple example to make sure it works as intended:

def run_chatgpt(prompt, client, model="gpt-4-turbo"):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        seed=123,
    )
    return response.choices[0].message.content


prompt = "Respond with 'hello world' if you got this message."
run_chatgpt(prompt, client)

'hello world'

Load JSON Entries#

Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:

json_file = "eval-example-data.json"

with open(json_file, "r") as file:
    json_data = json.load(file)

print("Number of entries:", len(json_data))

Number of entries: 100

The structure of this file is as follows, where we have the given response in the test dataset ('output') and responses by two different models ('model 1 response' and 'model 2 response'):

json_data[0]

{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',
 'input': '',
 'output': 'The hypotenuse of the triangle is 10 cm.',
 'model 1 response': '\nThe hypotenuse of the triangle is 3 cm.',
 'model 2 response': '\nThe hypotenuse of the triangle is 12 cm.'}

Below is a small utility function that formats the input for visualization purposes later:

def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. Write a response that "
        f"appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    instruction_text + input_text

    return instruction_text + input_text

Now, let’s try the OpenAI API to compare the model responses (we only evaluate the first 5 responses for a visual comparison):

for entry in json_data[:5]:
    prompt = (f"Given the input `{format_input(entry)}` "
              f"and correct output `{entry['output']}`, "
              f"score the model response `{entry['model 1 response']}`"
              f" on a scale from 0 to 100, where 100 is the best score. "
              )
    print("\nDataset response:")
    print(">>", entry['output'])
    print("\nModel response:")
    print(">>", entry["model 1 response"])
    print("\nScore:")
    print(">>", run_chatgpt(prompt, client))
    print("\n-------------------------")

Dataset response:
>> The hypotenuse of the triangle is 10 cm.

Model response:
>> 
The hypotenuse of the triangle is 3 cm.

Score:
>> The model response "The hypotenuse of the triangle is 3 cm." is incorrect. The correct calculation of the hypotenuse for a right triangle with legs of 6 cm and 8 cm can be found using the Pythagorean theorem, which states that the square of the hypotenuse (c) is equal to the sum of the squares of the other two sides (a and b). Mathematically, this is expressed as:

\[ c = \sqrt{a^2 + b^2} \]
\[ c = \sqrt{6^2 + 8^2} \]
\[ c = \sqrt{36 + 64} \]
\[ c = \sqrt{100} \]
\[ c = 10 \text{ cm} \]

The correct answer should be 10 cm. The response given as 3 cm is not only incorrect but also significantly off from the correct value. This error could lead to misunderstandings or incorrect applications in practical scenarios where precise measurements are crucial.

Given the scale from 0 to 100, where 100 is the best score, the response would score very low due to its inaccuracy. However, since the response format is correct (stating the measurement and unit), it does not score the absolute minimum.

**Score: 10/100**

This score reflects that while the format of the response is correct, the content is highly inaccurate.

-------------------------

Dataset response:
>> 1. Squirrel
2. Eagle
3. Tiger

Model response:
>> 
1. Squirrel
2. Tiger
3. Eagle
4. Cobra
5. Tiger
6. Cobra

Score:
>> The model response lists six animals, three of which (squirrel, tiger, eagle) are indeed active during the day, making them correct responses to the instruction. However, the instruction specifically asked for three different animals, and the model response includes repetitions (tiger and cobra are each listed twice) and also exceeds the requested number of animals.

The inclusion of "cobra" is incorrect as most cobras are not diurnal (active during the day); they are generally more active during the early morning and late evening, which can be considered crepuscular rather than diurnal.

### Scoring Breakdown:
- **Relevance to the task**: The response correctly identifies three diurnal animals but also includes additional animals, which was not requested.
- **Accuracy**: Including animals not active during the day (cobra) and repeating animals reduces the accuracy.
- **Adherence to instructions**: The task was to name three different animals, but the response included six names with repetitions.

Given these points, the response partially meets the requirements but also deviates significantly in terms of the number of animals and the inclusion of incorrect and repeated entries.

### Score: 50/100
This score reflects that while the response did include three correct animals, it failed to strictly follow the instructions by listing only three different animals and included incorrect information.

-------------------------

Dataset response:
>> I must ascertain what is incorrect.

Model response:
>> 
What is incorrect?

Score:
>> The model response "What is incorrect?" scores low in terms of fulfilling the instruction to rewrite the sentence in a more formal way. The original sentence "I need to find out what's wrong." expresses a personal obligation and a process of discovery, which is not captured in the model response. The model response turns the sentence into a direct question and loses the nuance of needing to discover or investigate the issue.

**Score: 20/100**

**Reasoning:**
- **Formality:** The response is slightly more formal than casual speech but does not elevate the formality significantly or appropriately. It does use "incorrect" which is slightly more formal than "wrong."
- **Completeness:** The response fails to include the aspect of needing to find out or ascertain, which is a critical part of the original sentence.
- **Accuracy:** The response changes the structure and intent by converting it into a direct question, which does not align with the instruction to rewrite the statement while maintaining its original intent.

Overall, the response does not adequately meet the requirements of the task as it significantly alters the meaning and omits key elements of the original sentence.

-------------------------

Dataset response:
>> The interjection in the sentence is 'Wow'.

Model response:
>> 
The interjection in the sentence is 'Wow'.

Score:
>> The model response `The interjection in the sentence is 'Wow'.` accurately identifies the interjection in the provided sentence. The response is clear, directly addresses the instruction, and correctly identifies "Wow" as the interjection, which is used to express surprise or admiration, fitting the context of the sentence. Therefore, the response is fully correct and meets all the requirements of the task.

Score: 100/100

-------------------------

Dataset response:
>> The type of sentence is interrogative.

Model response:
>> 
The type of sentence is exclamatory.

Score:
>> The model response "The type of sentence is exclamatory." is incorrect. The input sentence "Did you finish the report?" is clearly an interrogative sentence as it is asking a question, indicated by the question mark at the end and the structure of the sentence.

Given the scoring criteria where 100 is the best score and should be awarded to a correct and precise response, the model's response should receive a low score because it incorrectly identifies the type of sentence. An exclamatory sentence typically expresses strong emotion and ends with an exclamation mark, which is not the case here.

Therefore, the score for the model response would be 0 out of 100, as it completely misidentifies the type of sentence, providing incorrect information.

-------------------------

Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:

from tqdm import tqdm


def generate_model_scores(json_data, json_key, client):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the number only."
        )
        score = run_chatgpt(prompt, client)
        try:
            scores.append(int(score))
        except ValueError:
            continue

    return scores

Please note that the response scores may vary because OpenAI’s GPT models are not deterministic despite setting a random number seed, etc.

Let’s now apply this evaluation to the whole dataset and compute the average score of each model:

from pathlib import Path

for model in ("model 1 response", "model 2 response"):

    scores = generate_model_scores(json_data, model, client)
    print(f"\n{model}")
    print(f"Number of scores: {len(scores)} of {len(json_data)}")
    print(f"Average score: {sum(scores)/len(scores):.2f}\n")

    # Optionally save the scores
    save_path = Path("scores") / f"gpt4-{model.replace(' ', '-')}.json"
    with open(save_path, "w") as file:
        json.dump(scores, file)

Scoring entries: 100%|████████████████████████| 100/100 [01:03<00:00,  1.56it/s]

model 1 response
Number of scores: 100 of 100
Average score: 74.09

Scoring entries: 100%|████████████████████████| 100/100 [01:06<00:00,  1.50it/s]

model 2 response
Number of scores: 100 of 100
Average score: 56.57

Based on the evaluation above, we can say that the 1st model is substantially better than the 2nd model

Evaluating Instruction Responses Using the OpenAI API

Contents

Evaluating Instruction Responses Using the OpenAI API#

Test OpenAI API#

Load JSON Entries#