Improving Instruction-Data Via Reflection-Tuning Using GPT-4

Improving Instruction-Data Via Reflection-Tuning Using GPT-4#

This notebook uses OpenAI’s GPT-4 API to implement the dataset refinement process from the Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning paper

In the original paper, the researchers refined the Alpaca and WizardLM instruction-finetuning datasets; in this notebook, we refine the instruction-dataset used in chapter 7 (however, since it has the same format as Alpaca, the same code works with the Alpaca dataset as well)
The expected dataset format is as follows:

    {
        "instruction": "Edit the following sentence for grammar.",
        "input": "He go to the park every day.",
        "output": "He goes to the park every day."
    },
    {
        "instruction": "Convert 45 kilometers to meters.",
        "input": "",
        "output": "45 kilometers is 45000 meters."
    },

Please note that this notebook reproduces the approach from the paper in which the authors used the GPT API to enhance existing datasets. However, it’s important to be aware that GPT API-generated data may not be used to develop models that compete with OpenAI, as specified in the OpenAI Terms of Use: “What you cannot do… Use Output to develop models that compete with OpenAI.” You can find a relevant discussion here).

# pip install -r requirements-extra.txt

from importlib.metadata import version

pkgs = [
    "openai",  # OpenAI API
    "tqdm",    # Progress bar
]

for p in pkgs:
    print(f"{p} version: {version(p)}")

openai version: 0.27.1
tqdm version: 4.65.0

Test OpenAI API#

First, let’s test if the OpenAI API is correctly set up
If you don’t have an account yet, you need to create one at https://platform.openai.com/
Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)
Running the code exactly as it appears in this notebook costs about $0.03 (3 cents) with GPT-4o-mini as of this writing
Applying the two methodologies above to all 1100 entries in the chapter 7 instruction dataset costs about $0.60 (60 cents)

First, we need to provide our OpenAI API secret key, which can be found at https://platform.openai.com/api-keys
Make sure not to share this key with anyone
Add this secret key ("sk-...") to the config.json file in this folder

import json
from openai import OpenAI

# Load API key from a JSON file.
# Make sure to replace "sk-..." with your actual API key from https://platform.openai.com/api-keys
with open("config.json", "r") as config_file:
    config = json.load(config_file)
    api_key = config["OPENAI_API_KEY"]

client = OpenAI(api_key=api_key)

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[3], line 2
      1 import json
----> 2 from openai import OpenAI
      4 # Load API key from a JSON file.
      5 # Make sure to replace "sk-..." with your actual API key from https://platform.openai.com/api-keys
      6 with open("config.json", "r") as config_file:

ImportError: cannot import name 'OpenAI' from 'openai' (/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/openai/__init__.py)

First, let’s try the API with a simple example to make sure it works as intended:

def run_chatgpt(prompt, client, model="gpt-4o-mini", system_prompt=None):
    # Define the system message if a system_prompt is provided
    messages = []
    
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    
    # Add the user prompt to the messages
    messages.append({"role": "user", "content": prompt})

    # Call the API
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.0,
        seed=123,
    )
    
    # Return the model's response
    return response.choices[0].message.content


prompt = "Respond with 'hello world' if you got this message."
run_chatgpt(prompt, client)

'hello world'

Load JSON Entries#

Next, let’s load and process the instruction dataset
Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:

from pathlib import Path

json_file = Path("..") / "01_main-chapter-code" / "instruction-data.json"

with open(json_file, "r") as file:
    json_data = json.load(file)

print("Number of entries:", len(json_data))

Number of entries: 1100

Let’s print one of the dataset entries to see its structure:

from pprint import pp as pprint

pprint(json_data[0])

{'instruction': 'Evaluate the following phrase by transforming it into the '
                'spelling given.',
 'input': 'freind --> friend',
 'output': 'The spelling of the given phrase "freind" is incorrect, the '
           'correct spelling is "friend".'}

Improve Instructions#

The Reflection-Tuning authors shared two approaches: (1) improving the instructions and (2) improving the responses
Let’s begin by improving the instructions in a given dataset
Below is a small utility function from the Reflection-Tuning repository to format the inputs to the GPT-4 model for this dataset refinement

def instr_prompt_no_input(ins, outp):

    sys_prompt = "You are a helpful, precise but picky assistant for checking the quality of a given instruction."
    prompt_template = "[Instruction]\n{ins}\n\n[The Start of Answer]\n{outp}\n\n[The End of Answer]\n\n[System]\n{criteria}\n\n"
    criteria = "We would like you to answer several questions related to the quality of a given instruction. \n" + \
                "1. Why this instruction is not good? First analyse the instruction based on Complexity of the Topic, Level of Detail Required, Knowledge Required, Ambiguity of the Instruction and Logical Reasoning or Problem-Solving Involved. \n" + \
                "Then analyse why this answer is not good for the given instruction? Analyse based on the Helpfulness, Relevance, Accuracy and Level of Details. \n" + \
                "Finally analyse why this bad instruction lead to a bad answer. " +\
                "2. Based on the reason you provided, generate a new and complete instruction which is complex and difficult to answer directly. " + \
                "Make sure the new instruction is relevent but independent to the original instruction, which can be answered without knowing the original instruction, put the new instruction in the format of [New Instruction] your instruction [End]" +\
                "3. Answer the newly generated instruction as detailed as possible, in the format of [New Answer] your answer [End] \n"
    prompt = prompt_template.format(
        ins=ins, outp=outp, criteria=criteria
    )
    return sys_prompt, prompt

To see how it works, consider the dataset entry, json_data[2]

pprint(json_data[2])

{'instruction': 'Convert 45 kilometers to meters.',
 'input': '',
 'output': '45 kilometers is 45000 meters.'}

We can refine the instruction as follows, using instr_prompt_no_input function defined above:

entry = json_data[2]

system_prompt, prompt = instr_prompt_no_input(ins=entry["instruction"], outp=entry["output"])
output = run_chatgpt(prompt=prompt, client=client, system_prompt=system_prompt)

print(output)

1. **Analysis of the Instruction:**

   - **Complexity of the Topic:** The topic of converting kilometers to meters is relatively simple and straightforward, as it involves basic unit conversion.
   - **Level of Detail Required:** The instruction does not require much detail; it simply asks for a conversion without any additional context or explanation.
   - **Knowledge Required:** Basic knowledge of metric units and their conversions is required, which is common knowledge.
   - **Ambiguity of the Instruction:** The instruction is clear and unambiguous; it specifies exactly what needs to be converted.
   - **Logical Reasoning or Problem-Solving Involved:** There is minimal logical reasoning involved, as the conversion factor (1 kilometer = 1000 meters) is a standard fact.

   **Analysis of the Answer:**

   - **Helpfulness:** The answer is helpful in that it provides the correct conversion.
   - **Relevance:** The answer is relevant to the instruction, as it directly addresses the conversion requested.
   - **Accuracy:** The answer is accurate; 45 kilometers does indeed equal 45,000 meters.
   - **Level of Details:** The answer lacks detail. It does not explain the conversion process or provide any context, which could be beneficial for someone unfamiliar with metric conversions.

   **Why the Bad Instruction Leads to a Bad Answer:** While the instruction itself is not bad, the simplicity of the task may lead to a lack of depth in the answer. The answer could have been improved by including an explanation of the conversion process, which would enhance understanding.

2. **New Instruction:**
   [New Instruction] Explain the significance of the metric system in global trade and provide examples of how unit conversions can impact international business transactions. [End]

3. **New Answer:**
   [New Answer] The metric system, also known as the International System of Units (SI), is a decimal-based system of measurement that is used globally. Its significance in global trade lies in its standardization, which facilitates international communication and commerce. 

   One of the primary advantages of the metric system is that it is universally recognized, which reduces confusion and errors in measurement. For example, when a company in the United States imports goods from Europe, the specifications for those goods are often provided in metric units. If the U.S. company is accustomed to using imperial units (like inches or pounds), they must convert these measurements to ensure compatibility. 

   Unit conversions can significantly impact international business transactions. For instance, if a manufacturer orders 100 kilograms of a product but mistakenly interprets it as 100 pounds, they will receive a much smaller quantity than intended, leading to production delays and financial losses. 

   Additionally, in industries such as pharmaceuticals, precise measurements are critical. A dosage specified in milligrams must be accurately converted to ensure patient safety. 

   In summary, the metric system's role in global trade is crucial for maintaining consistency and accuracy in measurements, which ultimately supports efficient and effective international business operations. [End]

The response is very verbose, which is useful for analysis purposes; also, it helps the GPT-4 model to make improvements via the chain-of-thought prompting approach
However, to construct the improved dataset, we are actually only interested in new instructions and outputs, not the analyses
We can use the following utility code from the Reflection-Tuning repository to extract the model’s improved instructions and outputs

import re

def extract_ins(text, no_input=True):
    if '[New Instruction]' in text:
        pattern = r'(\[New Instruction\])(.*?)(\[End\]|\[New Answer\]|New Answer:)'
    else:
        pattern = r'(New Instruction:)(.*?)(\[End\]|\[New Answer\]|New Answer:)'
    segments = re.findall(pattern, text, re.DOTALL)
    if len(segments) == 0:
        seg_ins = ''
    else:
        seg_ins = segments[0][1].strip()
    if seg_ins.endswith("\n\n3."):
        seg_ins = seg_ins[:-4]
    return seg_ins


def extract_oup(text, no_input=True):
    if '[New Answer]' in text:
        pattern = r'(\[New Answer\])(.*?)(\[End\]|$)'
    else:
        pattern = r'(New Answer:)(.*?)(\[End\]|$)'
        # pattern = r'(\[New Answer\]|New Answer:)(.*?)(\[End\]|$)'
    segments = re.findall(pattern, text, re.DOTALL)
    if len(segments) == 0:
        seg_oup = ''
    else:
        seg_oup = segments[0][1].strip()
    return seg_oup


def extract_instruction(text):
    if text == '':
        return []
    seg_ins = extract_ins(text, no_input=True)
    seg_oup = extract_oup(text, no_input=True)
    return [seg_ins, seg_oup]

Let’s use these utility functions to extract the improved instruction and response from the lengthy GPT-4 output generated earlier:

new_instr, new_outp = extract_instruction(output)

print(new_instr)

Explain the significance of the metric system in global trade and provide examples of how unit conversions can impact international business transactions.

print(new_outp)

The metric system, also known as the International System of Units (SI), is a decimal-based system of measurement that is used globally. Its significance in global trade lies in its standardization, which facilitates international communication and commerce. 

   One of the primary advantages of the metric system is that it is universally recognized, which reduces confusion and errors in measurement. For example, when a company in the United States imports goods from Europe, the specifications for those goods are often provided in metric units. If the U.S. company is accustomed to using imperial units (like inches or pounds), they must convert these measurements to ensure compatibility. 

   Unit conversions can significantly impact international business transactions. For instance, if a manufacturer orders 100 kilograms of a product but mistakenly interprets it as 100 pounds, they will receive a much smaller quantity than intended, leading to production delays and financial losses. 

   Additionally, in industries such as pharmaceuticals, precise measurements are critical. A dosage specified in milligrams must be accurately converted to ensure patient safety. 

   In summary, the metric system's role in global trade is crucial for maintaining consistency and accuracy in measurements, which ultimately supports efficient and effective international business operations.

Note that the instruction-refinement is currently only implemented for dataset entries that don’t have an "input" field

Improve Responses#

In a similar fashion, we can also apply the Reflection-Tuning refinement process specifically to the dataset responses (i.e., “output” fields)
Below are two small utility functions from the Reflection-Tuning repository to format the inputs to the GPT-4 model for dataset refinement

def res_gen_prompt_no_input(ins, outp):

    sys_prompt = "You are a helpful, precise but picky assistant for checking the quality of the answer to a given instruction."
    prompt_template = "[Instruction]\n{ins}\n\n[The Start of Answer]\n{outp}\n\n[The End of Answer]\n\n[System]\n{criteria}\n\n"
    criteria = "We would like you to answer several questions related to the quality of the answer to the given instruction. \n" + \
                "1. Why this answer is not good for the given instruction? Analyse based on the Helpfulness, Relevance, Accuracy and Level of Details. \n" + \
                "2. Based on the reason you provided, generate a better answer, new and complete, as detailed as possible, in the format of [Better Answer] your answer [End] \n" 
    prompt = prompt_template.format(
        ins=ins, outp=outp, criteria=criteria
    )
    return sys_prompt, prompt


def res_gen_prompt_input(ins, inp, outp):

    sys_prompt = "You are a helpful and precise assistant for checking the quality of the answer to a given instruction and its input."
    prompt_template = "[Instruction]\n{ins}\n\n[The Start of Input]\n{inp}\n\n[The End of Input]\n\n[The Start of Answer]\n{outp}\n\n[The End of Answer]\n\n[System]\n{criteria}\n\n"
    criteria = "We would like you to answer several questions related to the quality of the answer to the given instruction and corresponding input. \n" + \
                "1. Why this answer is not good for the given instruction and corresponding input? Analyse based on the Helpfulness, Relevance, Accuracy and Level of Details. \n" + \
                "2. Based on the reason you provided, generate a better answer, new and complete, as detailed as possible, in the format of [Better Answer] your answer [End] \n" 
    prompt = prompt_template.format(
        ins=ins, inp=inp, outp=outp, criteria=criteria
    )
    return sys_prompt, prompt

Again, let’s apply it to one of the dataset entries to see how it works, generating the improved response:

entry = json_data[2]

system_prompt, prompt = res_gen_prompt_no_input(ins=entry["instruction"], outp=entry["output"])
output = run_chatgpt(prompt=prompt, client=client, system_prompt=system_prompt)

print(output)

1. The answer provided is not good for the given instruction for several reasons:

- **Helpfulness**: While the answer does provide the correct conversion, it lacks any explanation or context. A more helpful answer would include a brief explanation of the conversion process, which would aid understanding.

- **Relevance**: The answer is relevant in that it addresses the instruction to convert kilometers to meters, but it could be more relevant by including the conversion factor used (1 kilometer = 1000 meters).

- **Accuracy**: The answer is accurate in terms of the numerical conversion (45 kilometers = 45000 meters). However, it could be misleading if the reader does not understand how the conversion was derived.

- **Level of Details**: The answer is very brief and lacks detail. A more detailed response would include the conversion factor and a step-by-step explanation of how the conversion is performed.

2. [Better Answer] To convert kilometers to meters, you can use the conversion factor that 1 kilometer is equal to 1000 meters. Therefore, to convert 45 kilometers to meters, you multiply 45 by 1000. 

So, 45 kilometers × 1000 meters/kilometer = 45000 meters. 

Thus, 45 kilometers is equal to 45000 meters. [End]

As we can see above, the response includes an analysis of the original response; we can extract the new response using the following utility function from the Reflection-Tuning repository

def extract_response(text):
    if text.count('[Better Answer]') >= 2:
        pattern = r'\[(Better Answer)\](.*?)(\[End\]|\[Better Answer\]|$)'
        segments = re.findall(pattern, text, re.DOTALL)
    else:
        # pattern = r'\[(Better Answer)\](.*?)\[End\]'
        pattern = r'\[(Better Answer)\](.*?)(\[End\]|End|$)'
        segments = re.findall(pattern, text, re.DOTALL)
    return [segment[1].strip() for segment in segments]

response = extract_response(output)[0]
print(response)

To convert kilometers to meters, you can use the conversion factor that 1 kilometer is equal to 1000 meters. Therefore, to convert 45 kilometers to meters, you multiply 45 by 1000. 

So, 45 kilometers × 1000 meters/kilometer = 45000 meters. 

Thus, 45 kilometers is equal to 45000 meters.

Improving the Dataset#

Now, let’s apply the instruction-reflection and response-reflection techniques to the actual dataset
Note: we only apply it to a small data subset here for demo purposes; to apply it to the whole dataset, change

data_to_process = json_data[:3]

to

data_to_process = json_data

Reflect instructions#

The following code applies the Reflection-Tuning methodology for dataset refinement to the instructions in the original dataset

data_to_process = json_data[:3]

from tqdm import tqdm


def reflect_instructions(json_data, client):
    new_json_data = [] 
    
    for entry in tqdm(json_data):
        
        if not entry["input"]:
            system_prompt, prompt = instr_prompt_no_input(ins=entry["instruction"], outp=entry["output"])
            output = run_chatgpt(prompt=prompt, client=client, system_prompt=system_prompt)
            new_instr, new_outp = extract_instruction(output)
            new_entry = {"instruction": new_instr, "input": "", "output": new_outp}
            new_json_data.append(new_entry)
        else:
            new_json_data.append(entry)

    return new_json_data

data_to_process = json_data[:3]

new_json_data = reflect_instructions(data_to_process, client)

100%|█████████████████████████████████████████████| 3/3 [00:06<00:00,  2.17s/it]

for i in new_json_data[:3]:
    pprint(i)
    print("\n\n")

{'instruction': 'Evaluate the following phrase by transforming it into the '
                'spelling given.',
 'input': 'freind --> friend',
 'output': 'The spelling of the given phrase "freind" is incorrect, the '
           'correct spelling is "friend".'}



{'instruction': 'Edit the following sentence for grammar.',
 'input': 'He go to the park every day.',
 'output': 'He goes to the park every day.'}



{'instruction': 'Explain the significance of understanding metric conversions '
                'in scientific research, and provide an example of how a '
                'miscalculation in unit conversion could impact experimental '
                'results.',
 'input': '',
 'output': 'Understanding metric conversions is crucial in scientific research '
           'because accurate measurements are fundamental to the validity of '
           'experimental results. The metric system is widely used in '
           'scientific disciplines due to its ease of use and universal '
           'acceptance, allowing researchers from different countries to '
           'communicate their findings effectively.\n'
           '\n'
           '   For example, consider a scenario in a chemistry experiment '
           'where a researcher needs to prepare a solution with a specific '
           'concentration. If the researcher intends to prepare a 1 molar (1 '
           'M) solution of sodium chloride (NaCl) in 1 liter of water, they '
           'must accurately measure the mass of NaCl required. The molar mass '
           'of NaCl is approximately 58.44 grams per mole. Therefore, to '
           'prepare 1 liter of a 1 M solution, the researcher needs to '
           'dissolve 58.44 grams of NaCl in water.\n'
           '\n'
           '   However, if the researcher mistakenly converts the volume from '
           'liters to milliliters and uses 1 mL instead of 1 L, they would '
           'only need 0.05844 grams of NaCl. This significant error in unit '
           'conversion would lead to a solution that is 1,000 times more '
           'concentrated than intended. Such a miscalculation could result in '
           'erroneous experimental outcomes, potentially leading to incorrect '
           'conclusions about the behavior of the solution in reactions or '
           'biological systems. This example highlights the importance of '
           'precise unit conversions in ensuring the accuracy and reliability '
           'of scientific research.'}

Let’s save the new dataset:

with open("instruction-reflected.json", "w") as file:
    json.dump(new_json_data, file, indent=4)

Reflect responses#

Let’s now do the same for the response-reflection:

data_to_process = json_data[:3]

def reflect_responses(json_data, client):
    new_json_data = [] 
    
    for entry in tqdm(json_data):
        
        if not entry["input"]:
            system_prompt, prompt = res_gen_prompt_no_input(ins=entry["instruction"], outp=entry["output"])
            output = run_chatgpt(prompt=prompt, client=client, system_prompt=system_prompt)
            new_response = extract_response(output)

            if not len(new_response):
                new_response = entry["output"]
                      
            new_entry = {"instruction": entry["instruction"], "input": "", "output": new_response[0]}
            new_json_data.append(new_entry)

        else:
            system_prompt, prompt = res_gen_prompt_input(ins=entry["instruction"], inp=entry["input"], outp=entry["output"])
            output = run_chatgpt(prompt=prompt, client=client, system_prompt=system_prompt)
            new_response = extract_response(output)

            if not len(new_response):
                new_response = entry["output"]

            new_entry = {"instruction": entry["instruction"], "input": entry["input"], "output": new_response[0]}
            new_json_data.append(new_entry)

    return new_json_data

new_json_data = reflect_responses(data_to_process, client)

100%|█████████████████████████████████████████████| 3/3 [00:07<00:00,  2.40s/it]

for i in new_json_data[:3]:
    pprint(i)
    print("\n\n")

{'instruction': 'Evaluate the following phrase by transforming it into the '
                'spelling given.',
 'input': 'freind --> friend',
 'output': 'The input phrase "freind" contains a spelling error. The correct '
           'transformation of the word is as follows: "freind" should be '
           'corrected to "friend." Therefore, the correct spelling is '
           '"friend."'}



{'instruction': 'Edit the following sentence for grammar.',
 'input': 'He go to the park every day.',
 'output': 'The original sentence "He go to the park every day" contains a '
           'grammatical error in the verb form. The correct form should be "He '
           'goes to the park every day." This is because the subject "He" is '
           'third person singular, and in English, the verb "to go" changes to '
           '"goes" when used with third person singular subjects. Therefore, '
           'the corrected sentence is grammatically accurate and maintains the '
           'original meaning.'}



{'instruction': 'Convert 45 kilometers to meters.',
 'input': '',
 'output': 'To convert kilometers to meters, you can use the conversion factor '
           'that 1 kilometer is equal to 1,000 meters. Therefore, to convert '
           '45 kilometers to meters, you multiply 45 by 1,000. \n'
           '\n'
           'So, 45 kilometers is equal to 45,000 meters (45 km × 1,000 m/km = '
           '45,000 m). \n'
           '\n'
           'This conversion is useful in various contexts, such as distance '
           'measurement in travel or scientific calculations.'}

Let’s save the new dataset:

with open("response-reflected.json", "w") as file:
    json.dump(new_json_data, file, indent=4)

Creating Improved Instruction Data#

Applying the two methodologies above to all 1100 entries in the chapter 7 instruction dataset costs about $0.60 (60 cents)
To avoid bloating the GitHub repository with dataset files, the resulting dataset files are available from Google Drive:
- instruction-reflected.json
- response-reflected.json

Improving Instruction-Data Via Reflection-Tuning Using GPT-4

Contents

Improving Instruction-Data Via Reflection-Tuning Using GPT-4#

Test OpenAI API#

Load JSON Entries#

Improve Instructions#

Improve Responses#

Improving the Dataset#

Reflect instructions#

Reflect responses#

Creating Improved Instruction Data#