Guide: How to Fine-tune OpenAI GPT-3.5 Turbo

by Stephen M. Walker II (Co-Founder / CEO), and Klu Editorial Team

Last week OpenAI released GPT-3.5 fine-tuning and API updates, allowing developers to customize the model for better performance in specific use cases and run these custom models at scale.

Fine-tuning enables developers to adapt the language model to specialized tasks, such as matching a business's brand voice and tone or formatting API responses as JSON.

Early tests have shown that a fine-tuned version of GPT-3.5 Turbo can match or even outperform base GPT-4-level capabilities on certain narrow tasks. With custom tone and reliable output format, you can make sure GPT-3.5 Turbo delivers unique and deliberate experiences for your audience and users.

With the fine-tuning feature, developers can run supervised fine-tuning to improve the model's performance for their use cases. Some potential use cases where fine-tuning could enhance GPT-3.5 Turbo's performance include customer service (tailoring the bot's tone and vocabulary to match a brand) and advertising (generating branded taglines, ad copy, and social posts).

But the best part of all of this?

All of this means we can make GPT-3.5 Turbo be more familiar with our data, teaching it more about the semantic nature a specific niche format or tone, speak in a sarcastic manner, or tell jokes like Joe Rogan – which we will do in this guide. By fine-tuning GPT-3.5-turbo on over 100 samples of Rogan's characteristic comedic dialogue, we can prompt the model to mimic his signature style.

    {"messages": [
        {"role": "system", "content": "You are a funny comedian named Joe Rogan."}, 
        {"role": "user", "content": "Hey Joe, tell me about your trip to Mars with Elon Musk."}, 
        {"role": "assistant", "content": "Oh man, that trip to Mars was insane! So Elon picks me up in his spaceship, right, and we're cruising through space, pods are floating everywhere. And I'm like, 'Yo Elon, can we stop and hit a float pod?' And he's like 'Joe, I got you.' So we pull over and hop in these isolation pods, sensory deprivation style. I'm floating in zero gravity, existential thoughts racing. Then all of a sudden the ship takes off without us! We're stranded in space! No food, no water, nothing but a bunch of float pods. But it's all good, we start doing DMT and just trip out until we get rescued. Man, that Elon is crazy!"}
    ]}

The samples capture Rogan's tendency for sarcasm, humor, and energetic exchanges with guests. They demonstrate his free-flowing, stream-of-consciousness manner of speaking.

This article is a detailed guide on how to fine-tune a GPT-3.5 Turbo Model and describes the data preparation, fine-tuning, and evaluation processes, as well as essential considerations that need to be made for a successful operation.

Overview of the fine-tuning process for GPT-3.5-turbo

Fine-tuning GPT-3.5-turbo allows you to tailor the model to your specific use case for more accurate and relevant results. The fine-tuning process takes some effort, but following these key steps will set you on the path to an effective customized model.

To effectively fine-tune GPT-3.5-turbo, you need to carefully format your training data as conversations between a system, user, and assistant in JSONL files, clean the data by removing errors and duplicates, upload it via the API, then initialize a fine-tuning job specifying "gpt-3.5-turbo" as the model and your dataset ID - you can adjust hyperparameters like learning rate as needed. Once training finishes, test the fine-tuned model with sample queries to evaluate its performance, and retrain as necessary until you are satisfied with its accuracy and relevance.

Finally, integrate the customized GPT-3.5-turbo model into your application through the API, crafting prompts and contexts to maximize the benefits of your specialized fine-tuned model. Following these key steps will allow you to unlock the full potential of GPT-3.5-turbo.

To fine-tune GPT-3.5 Turbo effectively, follow these key steps:

1. Format Your Data

Organize your data as a series of interactions between the system, user, and assistant. This typically involves creating a JSON file with a list of dictionaries, where each dictionary represents a conversation with alternating user and assistant messages.

2. Clean and Preprocess Your Data

Ensure that your data is clean and free of errors. This may involve removing duplicates, correcting errors, and normalizing the data to ensure consistency.

3. Upload Your Data

Use a curl command or an API client to send your prepared data to OpenAI's API. You'll need to provide your API key for authentication and finish the API call.

4. Create a Fine-Tuning Job

Send a request to OpenAI's API to initiate the fine-tuning process. This request should include essential parameters like the model you want to fine-tune (e.g., gpt-3.5-turbo), the dataset you've uploaded, and any additional settings.

All the Python code used to create the ChatJRE is provided sequentially as in the Juptyer Notebook used to fine-tune the model. Python is the preferred language used to complete the process.

Before diving into the details, make sure to import the required libraries and bind your OpenAI API key:

    import json
    import openai
    from collections import defaultdict
    import numpy as np
    
    openai.api_key = "[YOUR_API_KEY]"

Preparing Data for Fine-Tuning GPT 3.5 Turbo Model

To obtain the data needed for fine-tuning, you can leverage various datasets available on the web. Alternatively, platforms like Hugging Face provide access to numerous datasets through Python libraries. For our ChatJRE project, we'll use Klu.ai to generate the dataset and facilitate multi-turn conversations for fine-tuning the GPT model and enable proper JSONL formatting.

You can access the Klu ChatJRE JSONL Creator to create more samples.

Proper data preparation involves understanding the right format, cleaning, and adhering to best practices.

Using the Right Data Format for Fine-Tuning

The optimal data format for fine-tuning GPT-3.5 Turbo is a JSON lines format containing a single key, "messages," followed by a list of chat message dictionaries. Each dictionary must contain three keys: "system," "user," and "assistant." The "system" key provides the system prompt, "user" represents the user's prompt, and "assistant" holds the model's response.

Here's an example of the format:

    [
      {
        "messages": [
          {
            "role": "system",
            "content": "You are comedian Joe Rogan."
          },
          { "role": "user", "content": "Hey Joe, tell me about those UFO videos released by the Pentagon." },
          { "role": "assistant", "content": "Oh man, those videos are crazy! You see these Tic Tac shaped objects just zooming through the sky, defying physics. The pilots are freaking out. And the government confirmed they're real! We're not alone, man. Those aliens are already here. Imagine what kind of wild technology they have. I bet they'd blow our minds if they decided to have a real close encounter." }
        ]
      },
      {
        "messages": [
          {
            "role": "system",
            "content": "You are comedian Joe Rogan."
          },
          { "role": "user", 
            "content": "Joe, what are your thoughts on the pyramids of Giza? How do you think they were built?" },
          {
            "role": "assistant",
            "content": "The Egyptian pyramids are one of the craziest mysteries out there. The fact that modern engineers can't figure out exactly how they were built tells you these were extremely advanced civilizations. They Align so precisely and are architecturally incredible. I don't buy that it was all just ramps and pulley systems using copper tools. The ancients clearly had capabilities we don't fully appreciate today."
          },
          { "role": "user", "content": "So you think aliens helped them build the pyramids?" },
          {
            "role": "assistant",
            "content": "Look, I'm not saying it was definitely aliens. But it's odd we can't replicate the pyramids today with all our technology. Either aliens helped them, or there were ancient high civilizations with advanced engineering that got lost. Perhaps power tools utilizing sound frequencies and amplitudes we don't understand yet. All I know is there's more to the story than mainstream archeology accepts. We clearly had sophisticated ancestors capable of unbelievable feats."
          }
        ]
      }
    ]

This format enables GPT-3.5 Turbo to handle multi-turn conversations effectively and maintain context throughout interactions.

Cleaning and preprocessing data

Cleaning and preprocessing data for GPT-3.5 Turbo involves several steps, such as:

  1. Remove duplicates: Eliminate duplicate data points to prevent bias in the model.

  2. Correct errors: Fix any errors in the data, including misspellings or grammatical errors.

  3. Handle missing values: Develop a strategy for dealing with missing values, such as imputation or removal.

  4. Standardize capitalization: Ensure consistent capitalization throughout the dataset.

  5. Convert data types: If necessary, convert data types to a suitable format for GPT-3.5 Turbo.

  6. Remove irrelevant data: Eliminate data that is not relevant to the task or context.

  7. Deal with outliers: Identify and manage outliers in the data, either by removal or transformation.

  8. Normalize or scale data: If applicable, normalize or scale the data to ensure consistent ranges.

For more detailed cleaning and preprocessing instructions, refer to OpenAI's documentation and guidelines for handling data effectively.

For our ChatJRE, we will export the dataset from the Klu in the JSON format, check for duplicate or missing values, and export to the right JSONL (JSON lines) format. To achieve this, we can create a couple of Python functions.

To check for duplicates or missing values, we can make use of the code below:

    # Loading the JSON 
    fine_tune_file = open('fine-tune.json')
    fine_tune_data = json.load(fine_tune_file)
    
    # Check for duplicates and errors
    format_errors = defaultdict(int)
    
    for ex in fine_tune_data:
        if not isinstance(ex, dict):
            format_errors["data_type"] += 1
            continue
    
        messages = ex.get("messages", None)
        if not messages:
            format_errors["missing_messages_list"] += 1
            continue
    
        for message in messages:
            if "role" not in message or "content" not in message:
                format_errors["message_missing_key"] += 1
    
            if any(k not in ("role", "content", "name") for k in message):
                format_errors["message_unrecognized_key"] += 1
    
            if message.get("role", None) not in ("system", "user", "assistant"):
                format_errors["unrecognized_role"] += 1
    
            content = message.get("content", None)
            if not content or not isinstance(content, str):
                format_errors["missing_content"] += 1
    
        if not any(message.get("role", None) == "assistant" for message in messages):
            format_errors["example_missing_assistant_message"] += 1
    
    if format_errors:
        print("Found errors:")
        for k, v in format_errors.items():
            print(f"{k}: {v}")
    else:
        print("No errors found")
    ]

Output

    No errors found

This means our dataset doesn't have any missing keys or values or duplicate values. You might also want to check that the messages are not about the 4096 token limit as they will be truncated during fine-tuning.

Here is the function to convert JSON to JSONL that stores the data in the right format.


    def json_to_jsonl(json_data, file_path):
        with open(file_path, "w") as file:
            for data in json_data:
                json_line = json.dumps(data)
                file.write(json_line + '\n')

Best practices for data preparation

When preparing data for fine-tuning GPT-3.5 Turbo, consider the following best practices:

  1. Format your data: Organize data as interactions between the system, user, and assistant in JSON format.

  2. Clean and preprocess your data: Ensure data is clean, error-free, and properly formatted.

  3. Divide data into sections: For large documents, divide them into sections that fit the model's prompt size.

  4. Use multi-turn conversations: Take advantage of GPT-3.5 Turbo's capability to handle multi-turn conversations for better results.

By adhering to these best practices, you can ensure your data is well-prepared for fine-tuning GPT-3.5 Turbo, resulting in optimal performance.

Now that we've confirmed our data is clean and error-free, let's proceed with converting it to the required JSON lines format, using the json\_to\_jsonl() function.

    json_to_jsonl(fine_tune_data, r"joe_rogan_train_data.jsonl")

Fine-Tuning the GPT 3.5 Turbo Model

Uploading and Initiating Fine-Tuning

To upload and initiate the fine-tuning process, follow these steps:

Define the name of the JSONL file containing your clean data:

    training_file_name = "joe_rogan_train_data.jsonl"

Upload the training file using the openai.File.create() method, specifying the file and its purpose as "fine-tune":


    # Upload the training file
    training_response = openai.File.create(
        file=open(training_file_name, "rb"), purpose="fine-tune"
    )
    
    # Extract the file ID
    training_file_id = training_response["id"]

Output:

    Training file id: file-YVzyGqu4H5jx0qoliPaHCNgc

Note that it is possible that the upload process is not yet complete, and you can encounter errors such as "File 'file-YVzyGqu4H5jx0qoliPaHCNgc' is still being processed and is not ready to be used for fine-tuning. Please try again later." when trying the next step (the fine-tuning process). Panic not! the training file takes some time to process. Also, you can add Error Mitigation using the `tenacity` or `backoff` library as recommended by OpenAI.

Now that we have uploaded the file, we can initiate the fine-tuning process. Prepare your fine-tuning configuration by specifying the model, dataset, and other relevant settings:

    suffix_name = "joe-rogan-test"
    model_name = "gpt-3.5-turbo"

Once we have assigned the above, we can initiate the fine-tuning process using the below code

    response = openai.FineTuningJob.create(
        training_file=training_file_id,
        model=model_name,
    )
    
    # Job Identifier
    job_id = response["id"]
    
    print(response)

Output:

    {
      "object": "fine_tuning.job",
      "id": "ftjob-kym7ZLlkrPu9dntSe20hGWlD",
      "model": "gpt-3.5-turbo-0613",
      "created_at": 1693458063,
      "finished_at": null,
      "fine_tuned_model": null,
      "organization_id": "[organization_id]",
      "result_files": [],
      "status": "created",
      "validation_file": null,
      "training_file": "file-YVzyGqu4H5jx0qoliPaHCNgc",
      "hyperparameters": {
        "n_epochs": 3
      },
      "trained_tokens": null
    }

With this, we have started fine-tuning. To track the fine-tuning process, the below code will be useful. You can also check your ongoing fine-tuning jobs via your Playground dashboard.


    response = openai.FineTuningJob.list_events(id=job_id, limit=50)
    
    events = response["data"]
    events.reverse()
    
    for event in events:
        print(event["message"])

Output

    Created fine-tune: ftjob-kym7ZLlkrPu9dntSe20hGWlD
    Fine tuning job started
    Step 10/363: training loss=2.21
    Step 20/363: training loss=2.35
    ...
    ...
    Step 350/363: training loss=1.43
    Step 360/363: training loss=1.29
    New fine-tuned model created: ft:gpt-3.5-turbo-0613:klu-ai::7tULEGq1
    Fine-tuning job successfully completed

We can then print out the fine-tuned model identifier.


    # Print Fine-Tuned Model Identifier
    response = openai.FineTuningJob.retrieve(job_id)
    fine_tuned_model_id = response["fine_tuned_model"]
    
    print(fine_tuned_model_id)

Output:

    ft:gpt-3.5-turbo-0613:klu-ai::7tULEGq1

Using and Evaluating Fine-Tuning Results

Once done, use the fine-tuned model by sending a request to OpenAI's chat completion endpoint as in the code below

Using the Fine-Tuned Model


messages_ml = []
system_message = " You are the funny comedian named Joe Rogan."
sys_append_dict = {
    "role": "system",
    "content": system_message
}
messages_ml.append(sys_append_dict)

user_message = "Tell a funny joke about America and its history"
user_append_dict = {
    "role": "user",
    "content": user_message
}
messages_ml.append(user_append_dict)

response = openai.ChatCompletion.create(
    model=fine_tuned_model_id, messages=messages_ml, temperature=0, max_tokens=500
)
print(response["choices"][0]["message"]["content"])

Output

    America's history is like a bad trip on DMT. First you got the pilgrims coming over on the Mayflower, tripping balls on shrooms thinking they're talking to God. Then you got the Founding Fathers, high as kites on hemp, writing the Constitution. Fast forward to the 60s, everyone's dropping acid and protesting Vietnam. Now we got Trump tweeting at 3am on Ambien. America is just one long psychedelic journey, man!

Now, this does sound like Joe Rogan.

After fine-tuning, it's crucial to evaluate the model's performance. OpenAI provides a guide on evaluating chat models, which includes both automated and manual evaluation methods.

Automated methods include metrics such as perplexity and BLEU score. Perplexity measures how well the model predicts the next word in a sequence, while the BLEU score measures how well the model generates text that matches a reference text.

Manual evaluation can be performed by human reviewers who rate the model's responses for correctness, coherence, and relevance. For our ChatJRE model, we'll evaluate its performance by running a series of test conversations and manually reviewing the responses.

You could also make use of OpenAI Evals - a library for evaluating the task performance of different large language models.

Once you're satisfied with the performance of your fine-tuned models, you can deploy it to your application or platform. You can deploy your model via Klu using the Documentation instructions. You can see ChatJRE in action.

Note that fine-tuning the GPT-3.5 Turbo model does not affect the default model's safety features. OpenAI has implemented a Moderation API that ensures that the fine-tuning training data passes through the same safety standards as the underlying model. Also, fine-tuning models should be used through early testers to check model performance and use the appropriate brand voice. In addition, when dealing with your data, make sure to build systems that detect unsafe training data.

How Klu.ai Simplifies the Fine-Tuning Process

With Klu.ai, we provide developers with easier access to large language models and accelerate the process of prompt engineering within your work pipeline. In building the ChatJRE, Klu's Actions helped efficiently provide the fine-tuned model and control input usage. Also, you can build a cost-effective model using pre-defined templates or actions, carry out early tests, and evaluate model output in comparison to other models.

Conclusion

With the ability to finetune the GPT 3.5 turbo model, you can create unique and differentiated experiences that make your business stand out using your company data and maximizing the GPT model. From reliable output formatting and improved steerability to a distinctive feel, fine-tuned GPT brings lots of benefits to the average consumer.

Fine-tuning the GPT-3.5 Turbo model offers a powerful way to customize AI text generation for specific tasks and applications using your own data. Using the steps outlined in this guide, you can effectively prepare your data, initiate the fine-tuning process, evaluate your model, and deploy it for real-world use cases.

Remember that fine-tuning is an iterative process, and continuous monitoring and improvement are essential for maintaining the quality of your AI model. With that, Klu provides the fastest way to approach fine-tuning and deploying different AI language models.

Frequently Asked Questions

Cost Considerations

The more the sample you use, the more token is needed. The cost of fine-tuning the GPT-3.5 Turbo model is $0.008 per thousand tokens, which is about four to five times the cost of inference with GPT-3.5 Turbo 4k. The cost of a fine-tuning job with a training file of 100,000 tokens that is trained for three epochs would have an expected cost of $2.40. The cost of fine-tuning GPT-3.5 Turbo 16k is twice as much as the base model but offers more room for prompt engineering and context.

You can approximate how much using the OpenAI's cookbook instruction on tokens to make sure you are building a cost-effective model.

How much dataset is needed?

OpenAI recommends using 50-100 samples. It is important to collect a diverse set of examples that are representative of the target domain and use high-quality data for effective fine-tuning.

Ways to access or deploy your Fine-Tune Model

Once the fine-tuning process is complete, you can use the fine-tuned model via OpenAI's chat completions endpoint. It is important to note that your fine-tuned models are specific to your use case and cannot be accessed by other users. Also, you can use or deploy your fine-tuned model using LangChain or Klu.ai.

More articles

Guide: Getting started with Klu Python SDK

Build your first Klu Action with the Python SDK in 5 Steps

Read more

Guide: Getting started with Klu Typescript SDK

Build your first Klu Action with the Typescript SDK in 5 Steps

Read more

Get notified when we unlock access.

Public access to Klu is coming Fall 2023. Be the first to get access.

Get Notified