Kolosal Plane

One platform to generate synthesis data for LLM and Embedding Models

Kolosal Plane is an open-source data augmentation pipeline that enables developers to generate synthetic conversational data using powerful large language models (LLMs) and use it to fine-tune smaller language models (SLMs). It simulates multi-turn Q&A conversations by leveraging models like GPT-4 or Anthropic Claude (v3.5/v3.7) as the conversation generator, and produces high-quality dialogue datasets for training compact models such as LLaMA (1B/3B), Gemma (1.1B/2B/4B), Mistral, etc. The goal is to bridge the performance gap between massive LLMs and smaller, more efficient models by augmenting training data with LLM-generated conversations.

Key Features

Synthetic Q&A Generation: Simulate dialogues by iteratively generating questions and answers. You can input a topic or context document, and Kolosal Plane will produce a realistic multi-turn conversation around it.
Modular & Extensible: Plugin your own LLM (as the teacher) supports OpenAI API, local HuggingFace models, etc., via an async interface.
Example Notebook & UI: Includes an example.ipynb for quickstart and a Streamlit web UI (On Progress) to interactively configure augmentation settings and preview results.
Community-Driven: Join our Discord community for support, ideas, and collaboration. We welcome feedback and bug reports (please file an issue on GitHub or Discord)!

TL:DR How to Use Kolosal Plane

Good news! You can get started with Kolosal Plane in just a few steps:

Install Kolosal Plane: Clone the repository

git clone https://github.com/Genta-Technology/Kolosal-Plane.git

Install Dependencies: Navigate to the project directory and install the required packages using pip:
```
cd Kolosal-Plane
pip install -r requirements.txt
```
Run it by using the command line:
```
python main.py
```

What is Data Augmentation for LLMs?

Data augmentation in the context of language models refers to expanding or enhancing the training data by algorithmic means – here, by using one model to generate new training examples for another. Instead of relying solely on human-written dialogues or scarce domain-specific data, Kolosal Plane uses a powerful LLM to synthesize new question-answer pairs and conversations. These synthetic examples can then be added to your training corpus to improve a smaller model’s performance.

Recent research has shown that synthetic data generated by large models can significantly boost the performance of smaller models when used for fine-tuning [1]. By generating diverse and high-quality examples, we can often exceed the diversity and quality of manual annotations, leading to better generalization. For instance, the Self-Instruct technique from Stanford Alpaca is a famous example: by using OpenAI’s text-davinci-003 (GPT-3.5) to produce 52,000 instruction-response examples, the team fine-tuned a 7B LLaMA model that behaves similarly to the much larger GPT-3.5 model [2]. This approach cost only ~$600 and yielded Alpaca 7B, an instruct-following model that is qualitatively on par with OpenAI’s text-davinci-003 on many tasks.

Other projects have also demonstrated the power of LLM-generated data for training compact models:

Vicuna-13B (2023) was fine-tuned on user-shared ChatGPT conversations and achieved about 90% of ChatGPT’s quality despite being 13B parameters – showing a smaller model can emulate a larger one with the right training data ([2305.14314] QLoRA: Efficient Finetuning of Quantized LLMs).
Unnatural Instructions and WizardLM used GPT-3.5/GPT-4 to generate complex instruction-following data, producing models that outperform many baseline LLMs without such fine-tuning (How to Generate and Use Synthetic Data for Finetuning).
TinyStories (Phi-3) by Microsoft is a small 1.3B model trained entirely on GPT-4/3.5-generated children’s stories, achieving surprisingly strong language abilities. The synthetic TinyStories dataset (millions of short stories) allowed this small language model (SLM) to exhibit behaviors of much larger models (TinyStories Is A Synthetic DataSet Created With GPT-4 & Used To Train Phi-3 | by Cobus Greyling | Medium)

In all these cases, data augmentation via synthetic generation is the key. It’s faster and cheaper to obtain task-specific data from an LLM than to manually curate it, and the augmented data often improves the target model’s performance beyond what real data alone could do.

Why Fine-Tune Smaller Models? Can they compete with GPT-4?

Fine-tuning a smaller model on high-quality data can yield outsized gains. Research and community experiments have demonstrated that a compact model (with LoRA or QLoRA fine-tuning) can, on specific tasks or domains, perform as well as or even better than a much larger general model that isn’t specialized [3]. This means you can deploy an efficient model that matches or surpasses a base GPT-4 or other state-of-the-art LLM in your application, given the model has been tailored with the right data.

For example, Predibase showed that a fine-tuned 8B LLaMA (via LoRA) outperformed GPT-4 in a common-sense reasoning task when only 10 real examples were available, by generating additional synthetic training data. In their experiments, as the number of training examples grows (via synthetic augmentation), the fine-tuned 8B model continues to improve and eventually overtakes GPT-4’s few-shot performance on benchmarks like HellaSwag. The Hugging Face QLoRA research similarly produced Guanaco, a family of models (7B–65B) fine-tuned on synthetically generated instruction data, that achieved 99% of ChatGPT’s performance on the Vicuna benchmark [4]. Guanaco-65B is in fact second only to GPT-4 on several chat leaderboards while being trained on a single GPU in 24 hours.

The takeaway: Fine-tuning works. With efficient methods like LoRA (which adds only a few million trainable parameters) and QLoRA (which fine-tunes in 4-bit precision), even billions-scale models become practical to adapt. Kolosal Plane embraces this, using LLM-generated data to give your small model a fighting chance against the giants in your domain [3]. In many cases, a domain-specialized 2–13B model can rival a 100B+ general model, at a fraction of the inference cost.

How the Synthetic Conversation Pipeline Works

This is an overview on how the synthetic conversation pipeline works. Kolosal Plane’s data generation loop simulates a back-and-forth dialogue, optionally grounded in provided documents or context. The process can be summarized as:

Conversation Start (Question Generation): For each given topic or document, the pipeline first generates an initial user question. This is often guided by a conversation starter instruction you provide (e.g. “Ask a question about the document’s content” or a general topic prompt). Under the hood, Kolosal Plane uses a prompt template (inspired by Self-Instruct) to have the LLM come up with a relevant question or instruction based on the document/context.
LLM Answer Generation: The large model (LLM) is then invoked to answer the question. Before calling the LLM, Kolosal Plane can insert the relevant context (the document text, or any system/persona prompt you defined) into the conversation history so that the LLM’s answer is knowledgeable and coherent
Follow-up Question Generation: After an answer, the pipeline generates a follow-up user question to continue the conversation.
Conversation Continuation (Loop): The new question is added to the chat history, and we loop back to step 2. The LLM now answers the follow-up question, producing the next answer. This question/answer pair is appended to the growing conversation. The cycle of (User question → LLM answer → next User question → ...) repeats until a specified length (max_conversations) is reached

Example: Using Kolosal Plane for Synthetic Data Generation

In the repository you’ll find example.ipynb, which demonstrates programmatic usage.

Community and Support

Kolosal Plane is an early-stage project – we’d love your help in making it better! If you have questions or want to connect with other developers using the tool, join our Discord: Kolosal AI. It’s a great place to share ideas, get support, and discuss data augmentation and fine-tuning techniques.

Found a bug or have a feature request? Please open an issue on the GitHub repo. We welcome contributions – whether it’s reporting an issue, improving documentation, or adding new features.

License

This project is licensed under the Apache 2.0 License, which means you are free to use, modify, and distribute it in both academic and commercial projects. See the LICENSE file for details.

References

[1] How to Generate and Use Synthetic Data for Finetuning

[2] Stanford CRFM

[3] How to Generate Synthetic Data and Fine-tune a SLM that Beats GPT-4o - Predibase - Predibase

[4] QLoRA: Efficient Finetuning of Quantized LLMs

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
api		api
example		example
interface		interface
kolosal_plane		kolosal_plane
.env.template		.env.template
.gitignore		.gitignore
DockerFile		DockerFile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
api_test.py		api_test.py
config.json		config.json
example.ipynb		example.ipynb
interface.py		interface.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kolosal Plane

One platform to generate synthesis data for LLM and Embedding Models

Key Features

TL:DR How to Use Kolosal Plane

What is Data Augmentation for LLMs?

Why Fine-Tune Smaller Models? Can they compete with GPT-4?

How the Synthetic Conversation Pipeline Works

Example: Using Kolosal Plane for Synthetic Data Generation

Community and Support

License

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Genta-Technology/Kolosal-Plane

Folders and files

Latest commit

History

Repository files navigation

Kolosal Plane

One platform to generate synthesis data for LLM and Embedding Models

Key Features

TL:DR How to Use Kolosal Plane

What is Data Augmentation for LLMs?

Why Fine-Tune Smaller Models? Can they compete with GPT-4?

How the Synthetic Conversation Pipeline Works

Example: Using Kolosal Plane for Synthetic Data Generation

Community and Support

License

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages