Dataset File Format
Your training dataset should be ajsonl file, meaning that each line should contain a json object representing a chat. A chat json objects should contain a messages field, which looks like this:
role and a content field. The content field represents - as the name suggests - the content of a message and is a string. A role can be one of the following:
system: The system prompt indicating general model behaviour. The system prompt only exists once in a chat and has to be the first message.user: The user role represents the person interacting with your LLM.assistant: The assistant role represents the LLM. Its messages are the ones that the model will be fine-tuned on.
- The first message has to be a
systemprompt. - The next message has to be a
usermessage, and the following messages should alternate betweenassistantanduser. - The last message has to be an
assistantmessage.
Dataset Size & Context Lengths
We recommend having at least 100 training chats to obtain good training results. We currently support the following maximum context lengths:- 7B Models: 2700 tokens
- 13B Models: 2100 tokens