Create “Passive Voice” Entries for an Instruction Dataset

Create “Passive Voice” Entries for an Instruction Dataset#

This notebook uses OpenAI’s GPT-4 to create “passive voice” entries for an instruction dataset, as shown in the example below

{  
   'instruction': 'Identify the verb in the following sentence',
   'input': 'The cat sleeps on the couch.',
   'output': 'The verb in the sentence is "sleeps."',
   'output_2': 'The sentence is "sleeps."'   #  <---- Newly created entry
}  

# pip install -r requirements-extra.txt

from importlib.metadata import version

pkgs = ["openai",  # OpenAI API
        "tqdm",    # Progress bar
       ]

for p in pkgs:
    print(f"{p} version: {version(p)}")

openai version: 0.27.1
tqdm version: 4.65.0

Test OpenAI API#

First, let’s test if the OpenAI API is correctly set up
If you don’t have an account yet, you need to create one at https://platform.openai.com/
Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)
Creating the ~200 passive voice entries using the code in this notebook costs about $0.13 (13 cents)

First, we need to provide our OpenAI API secret key, which can be found at https://platform.openai.com/api-keys
Make sure not to share this key with anyone
Add this secret key ("sk-...") to the config.json file in this folder

import json
from openai import OpenAI

# Load API key from a JSON file. 
# Make sure to replace "sk-..." with your actual API key from https://platform.openai.com/api-keys
with open("config.json", "r") as config_file:
    config = json.load(config_file)
    api_key = config["OPENAI_API_KEY"]

client = OpenAI(api_key=api_key)

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[3], line 2
      1 import json
----> 2 from openai import OpenAI
      4 # Load API key from a JSON file. 
      5 # Make sure to replace "sk-..." with your actual API key from https://platform.openai.com/api-keys
      6 with open("config.json", "r") as config_file:

ImportError: cannot import name 'OpenAI' from 'openai' (/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/openai/__init__.py)

First, let’s try the API with a simple example to make sure it works as intended:

def run_chatgpt(prompt, client, model="gpt-4-turbo"):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    return response.choices[0].message.content


# Prepare input
sentence = "I ate breakfast"
prompt = f"Convert the following sentence to passive voice: '{sentence}'"
run_chatgpt(prompt, client)

'Breakfast was eaten by me.'

Create JSON Entries#

Next, we load the file we want to modify:

import json

json_file = "instruction-examples.json"

with open(json_file, "r") as file:
    json_data = json.load(file)
    
print("Number of entries:", len(json_data))

Number of entries: 200

And we try the OpenAI chat API on a small sample first to ensure that it works correctly:

for entry in json_data[:5]:
    text = entry["output"]
    prompt = f"Without adding any response or explanation, convert the following text to passive voice: {text}"
    
    print("\nInput:")
    print(">>", text)
    print("\nOutput:")
    print(">>", run_chatgpt(prompt, client))
    print("\n-------------------------")

Input:
>> The verb in the sentence is "sleeps."

Output:
>> The sentence is "sleeps."

-------------------------

Input:
>> The plural form of "goose" is "geese."

Output:
>> The plural form of "goose" is referred to as "geese."

-------------------------

Input:
>> The three primary colors are red, blue, and yellow.

Output:
>> Red, blue, and yellow are considered the three primary colors.

-------------------------

Input:
>> They had finished the game.

Output:
>> The game had been finished by them.

-------------------------

Input:
>> The abbreviation for "Doctor of Philosophy" is Ph.D.

Output:
>> The abbreviation "Ph.D." is used for "Doctor of Philosophy".

-------------------------

Let’s now extend the code to add the generated entries to the json_data and add a progress bar:

from tqdm import tqdm  # a progress bar tool


for i, entry in tqdm(enumerate(json_data[:5]), total=len(json_data[:5])):
    text = entry["output"]
    prompt = f"Without adding any response or explanation, convert the following text to passive voice: {text}"
    json_data[i]["output_2"] = run_chatgpt(prompt, client)

100%|██████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.23it/s]

One more time, let’s make sure that the new entries ("output_2") look ok

json_data[0]

{'instruction': 'Identify the verb in the following sentence: The cat sleeps on the couch.',
 'input': '',
 'output': 'The verb in the sentence is "sleeps."',
 'output_2': 'The sentence is "sleeps."'}

Finally, if everything above looks ok, let’s run the conversion to passive voice on our entire json dataset (this takes about 3 minutes):

for i, entry in tqdm(enumerate(json_data), total=len(json_data)):
    text = entry["output"]
    prompt = f"Without adding any response or explanation, convert the following text to passive voice: {text}"
    json_data[i]["output_2"] = run_chatgpt(prompt, client)

100%|██████████████████████████████████████████████████████████████████| 200/200 [03:43<00:00,  1.12s/it]

After the conversion is completed, we save the file:

new_json_file = json_file.replace(".json", "-modified.json")


with open(new_json_file, "w") as file:
    json.dump(json_data, file, indent=4)  # "indent" for pretty-printing

Create “Passive Voice” Entries for an Instruction Dataset

Contents

Create “Passive Voice” Entries for an Instruction Dataset#

Test OpenAI API#

Create JSON Entries#