Kolosal Plane is an open-source data augmentation pipeline that enables developers to generate synthetic conversational data using powerful large language models (LLMs) and use it to fine-tune smaller language models (SLMs). It simulates multi-turn Q&A conversations by leveraging models like GPT-4 or Anthropic Claude (v3.5/v3.7) as the conversation generator, and produces high-quality dialogue datasets for training compact models such as LLaMA (1B/3B), Gemma (1.1B/2B/4B), Mistral, etc. The goal is to bridge the performance gap between massive LLMs and smaller, more efficient models by augmenting training data with LLM-generated conversations.
- Synthetic Q&A Generation: Simulate dialogues by iteratively generating questions and answers. You can input a topic or context document, and Kolosal Plane will produce a realistic multi-turn conversation around it.
- Modular & Extensible: Plugin your own LLM (as the teacher) supports OpenAI API, local HuggingFace models, etc., via an async interface.
- Example Notebook & UI: Includes an example.ipynb for quickstart and a Streamlit web UI (On Progress) to interactively configure augmentation settings and preview results.
- Community-Driven: Join our Discord community for support, ideas, and collaboration. We welcome feedback and bug reports (please file an issue on GitHub or Discord)!
Good news! You can get started with Kolosal Plane in just a few steps:
-
Install Kolosal Plane: Clone the repository
git clone https://github.com/Genta-Technology/Kolosal-Plane.git
-
Install Dependencies: Navigate to the project directory and install the required packages using pip:
cd Kolosal-Plane pip install -r requirements.txt
-
Run it by using the command line:
python main.py
Data augmentation in the context of language models refers to expanding or enhancing the training data by algorithmic means – here, by using one model to generate new training examples for another. Instead of relying solely on human-written dialogues or scarce domain-specific data, Kolosal Plane uses a powerful LLM to synthesize new question-answer pairs and conversations. These synthetic examples can then be added to your training corpus to improve a smaller model’s performance.
Recent research has shown that synthetic data generated by large models can significantly boost the performance of smaller models when used for fine-tuning [1]. By generating diverse and high-quality examples, we can often exceed the diversity and quality of manual annotations, leading to better generalization. For instance, the Self-Instruct technique from Stanford Alpaca is a famous example: by using OpenAI’s text-davinci-003
(GPT-3.5) to produce 52,000 instruction-response examples, the team fine-tuned a 7B LLaMA model that behaves similarly to the much larger GPT-3.5 model [2]. This approach cost only ~$600 and yielded Alpaca 7B, an instruct-following model that is qualitatively on par with OpenAI’s text-davinci-003 on many tasks.
Other projects have also demonstrated the power of LLM-generated data for training compact models:
- Vicuna-13B (2023) was fine-tuned on user-shared ChatGPT conversations and achieved about 90% of ChatGPT’s quality despite being 13B parameters – showing a smaller model can emulate a larger one with the right training data ([2305.14314] QLoRA: Efficient Finetuning of Quantized LLMs).
- Unnatural Instructions and WizardLM used GPT-3.5/GPT-4 to generate complex instruction-following data, producing models that outperform many baseline LLMs without such fine-tuning (How to Generate and Use Synthetic Data for Finetuning).
- TinyStories (Phi-3) by Microsoft is a small 1.3B model trained entirely on GPT-4/3.5-generated children’s stories, achieving surprisingly strong language abilities. The synthetic TinyStories dataset (millions of short stories) allowed this small language model (SLM) to exhibit behaviors of much larger models (TinyStories Is A Synthetic DataSet Created With GPT-4 & Used To Train Phi-3 | by Cobus Greyling | Medium)
In all these cases, data augmentation via synthetic generation is the key. It’s faster and cheaper to obtain task-specific data from an LLM than to manually curate it, and the augmented data often improves the target model’s performance beyond what real data alone could do.
Fine-tuning a smaller model on high-quality data can yield outsized gains. Research and community experiments have demonstrated that a compact model (with LoRA or QLoRA fine-tuning) can, on specific tasks or domains, perform as well as or even better than a much larger general model that isn’t specialized [3]. This means you can deploy an efficient model that matches or surpasses a base GPT-4 or other state-of-the-art LLM in your application, given the model has been tailored with the right data.
For example, Predibase showed that a fine-tuned 8B LLaMA (via LoRA) outperformed GPT-4 in a common-sense reasoning task when only 10 real examples were available, by generating additional synthetic training data. In their experiments, as the number of training examples grows (via synthetic augmentation), the fine-tuned 8B model continues to improve and eventually overtakes GPT-4’s few-shot performance on benchmarks like HellaSwag. The Hugging Face QLoRA research similarly produced Guanaco, a family of models (7B–65B) fine-tuned on synthetically generated instruction data, that achieved 99% of ChatGPT’s performance on the Vicuna benchmark [4]. Guanaco-65B is in fact second only to GPT-4 on several chat leaderboards while being trained on a single GPU in 24 hours.
The takeaway: Fine-tuning works. With efficient methods like LoRA (which adds only a few million trainable parameters) and QLoRA (which fine-tunes in 4-bit precision), even billions-scale models become practical to adapt. Kolosal Plane embraces this, using LLM-generated data to give your small model a fighting chance against the giants in your domain [3]. In many cases, a domain-specialized 2–13B model can rival a 100B+ general model, at a fraction of the inference cost.
This is an overview on how the synthetic conversation pipeline works. Kolosal Plane’s data generation loop simulates a back-and-forth dialogue, optionally grounded in provided documents or context. The process can be summarized as:
-
Conversation Start (Question Generation): For each given topic or document, the pipeline first generates an initial user question. This is often guided by a conversation starter instruction you provide (e.g. “Ask a question about the document’s content” or a general topic prompt). Under the hood, Kolosal Plane uses a prompt template (inspired by Self-Instruct) to have the LLM come up with a relevant question or instruction based on the document/context.
-
LLM Answer Generation: The large model (LLM) is then invoked to answer the question. Before calling the LLM, Kolosal Plane can insert the relevant context (the document text, or any system/persona prompt you defined) into the conversation history so that the LLM’s answer is knowledgeable and coherent
-
Follow-up Question Generation: After an answer, the pipeline generates a follow-up user question to continue the conversation.
-
Conversation Continuation (Loop): The new question is added to the chat history, and we loop back to step 2. The LLM now answers the follow-up question, producing the next answer. This question/answer pair is appended to the growing conversation. The cycle of (User question → LLM answer → next User question → ...) repeats until a specified length (
max_conversations
) is reached
In the repository you’ll find example.ipynb
, which demonstrates programmatic usage.
Kolosal Plane is an early-stage project – we’d love your help in making it better! If you have questions or want to connect with other developers using the tool, join our Discord: Kolosal AI. It’s a great place to share ideas, get support, and discuss data augmentation and fine-tuning techniques.
Found a bug or have a feature request? Please open an issue on the GitHub repo. We welcome contributions – whether it’s reporting an issue, improving documentation, or adding new features.
This project is licensed under the Apache 2.0 License, which means you are free to use, modify, and distribute it in both academic and commercial projects. See the LICENSE
file for details.
[1] How to Generate and Use Synthetic Data for Finetuning
[2] Stanford CRFM
[3] How to Generate Synthetic Data and Fine-tune a SLM that Beats GPT-4o - Predibase - Predibase