Text Classification (Approach 2)
1. Objective
This guide provides step-by-step instructions on finetuning a model for Text Classification tasks on Emissary using our novel classification approach. In this approach, we add a classification head on top of the base LLMs that returns probabilities for a given set of labels. We recommend using Llama3.1-8B-instruct for this task.
2. Dataset Preparation
Prepare your dataset in the appropriate format for the classification task.
Classification Data Format
Each entry should contain:
- Prompt: The input text for classification.
- Completion: A binary value (0 or 1) indicating the classification output. Only
1
or0
should be used to denote the classification result.
For Binary Classification Data (Single Label)
JSONL Format
{
"prompt": "This is a sample text for binary classification",
"completion": 1
}
CSV Format
input,expected_output
"This is a sample text for binary classification",1
"This is another sample text for binary classification",0
For Multi-Label Binary Classification
In cases where multiple labels are evaluated independently, the completion should be a dictionary where each label is mapped to a binary value (0 or 1) to indicate its relevance.
JSONL Format
{
"prompt": "This is a sample text for classification",
"completion": {
"Label_1": 1,
"Label_2": 0,
"Label_3": 1,
"Label_N": 0
}
}
CSV Format
input,expected_output
"This is a sample text for binary classification", "{"Label_1": 1, "Label_2": 0, "Label_3": 1, "Label_N": 0}"
3. Finetuning Preparation
Please refer to the in-depth guide on Finetuning on Emissary here - Quickstart Guide.
Create Model Service
Navigate to Dashboard arriving at Model Services, the default page on the Emissary platform.
- Click + NEW SERVICE in the dashboard.
- In the pop-up, enter a new model service name, and click CREATE.
Uploading Datasets
A tile is created for your task. Click Manage to enter the task workspace.
-
Click MANAGE in the Datasets Available tile.
-
Click on + UPLOAD DATASET and select training and test datasets.
-
Name datasets clearly to distinguish between training and test data (e.g., train_classification_data.csv, test_classification_data.csv).
4. Model Finetuning
Now, go back one panel by clicking OVERVIEW and then click MANAGE in the Training Jobs tile.
Here, we’ll kick off finetuning. The shortest path to finetuning a model is by clicking + NEW TRAINING JOB, naming the output model, picking a backbone (base model), selecting the training dataset (you must have uploaded it in the step before), and finally hitting START NEW TRAINING JOB.
Selecting Classification Option
When creating a new training job, you need to specify that you are performing a classification task to utilize the novel classification approach.
In the Training Job Creation page, locate the Task Type option. Select Classification from the given options.
This selection ensures that a classification head is added on top of the base LLM, enabling the model to return probabilities for the specified labels.
A custom function that calculates a matching score for the given expected and predicted outputs, based on how well the binary expected labels match the predicted probabilities has been provided. Uncomment the "sample_match_binary_labels_metric" to use it.
Training Parameter Configuration
Please refer to the in-depth guide on configuring training parameters here - Finetuning Parameter Guide.
5. Model Monitoring & Evaluation
Using Test Datasets
Including a test dataset allows you to evaluate the model's performance during training.
- Per Epoch Evaluation: The platform evaluates the model at each epoch using the test dataset.
- Metrics and Outputs: View evaluation metrics and generated outputs for test samples.
- Post completion of training, check scores in Training Job --> Artifacts.
For the LLM model, expect the following:
6. Deployment
Refer to the in-depth walkthrough on deploying a model on Emissary here - Deployment Guide.
Deploying your models allows you to serve them and integrate them into your applications.
Finetuned Model Deployment
- Navigate to the Training Jobs Page. From the list of finetuning jobs, select the one you want to deploy.
- Go to the ARTIFACTS tab.
- Select a Checkpoint to Deploy.
7. Best Practices
- Start Small: Begin with a smaller dataset to validate your setup.
- Monitor Training: Keep an eye on training logs and metrics.
- Iterative Testing: Use the test dataset to iteratively improve your model.
- Data Format: Use the recommended data formats for your chosen model to ensure compatibility and optimal performance.