Skip to content

Genta-Technology/Kolosal-Plane

Repository files navigation

Kolosal Plane

One platform to generate synthesis data for LLM and Embedding Models

Kolosal Plane is an open-source data augmentation pipeline that enables developers to generate synthetic conversational data using powerful large language models (LLMs) and use it to fine-tune smaller language models (SLMs). It simulates multi-turn Q&A conversations by leveraging models like GPT-4 or Anthropic Claude (v3.5/v3.7) as the conversation generator, and produces high-quality dialogue datasets for training compact models such as LLaMA (1B/3B), Gemma (1.1B/2B/4B), Mistral, etc. The goal is to bridge the performance gap between massive LLMs and smaller, more efficient models by augmenting training data with LLM-generated conversations.

Key Features

  1. Synthetic Q&A Generation: Simulate dialogues by iteratively generating questions and answers. You can input a topic or context document, and Kolosal Plane will produce a realistic multi-turn conversation around it.
  2. Modular & Extensible: Plugin your own LLM (as the teacher) supports OpenAI API, local HuggingFace models, etc., via an async interface.
  3. Example Notebook & UI: Includes an example.ipynb for quickstart and a Streamlit web UI (On Progress) to interactively configure augmentation settings and preview results.
  4. Community-Driven: Join our Discord community for support, ideas, and collaboration. We welcome feedback and bug reports (please file an issue on GitHub or Discord)!

TL:DR How to Use Kolosal Plane

Good news! You can get started with Kolosal Plane in just a few steps:

  1. Install Kolosal Plane: Clone the repository

    git clone https://github.com/Genta-Technology/Kolosal-Plane.git
  2. Install Dependencies: Navigate to the project directory and install the required packages using pip:

    cd Kolosal-Plane
    pip install -r requirements.txt
  3. Run it by using the command line:

    python main.py

What is Data Augmentation for LLMs?

Data augmentation in the context of language models refers to expanding or enhancing the training data by algorithmic means – here, by using one model to generate new training examples for another. Instead of relying solely on human-written dialogues or scarce domain-specific data, Kolosal Plane uses a powerful LLM to synthesize new question-answer pairs and conversations. These synthetic examples can then be added to your training corpus to improve a smaller model’s performance.

Recent research has shown that synthetic data generated by large models can significantly boost the performance of smaller models when used for fine-tuning [1]. By generating diverse and high-quality examples, we can often exceed the diversity and quality of manual annotations, leading to better generalization. For instance, the Self-Instruct technique from Stanford Alpaca is a famous example: by using OpenAI’s text-davinci-003 (GPT-3.5) to produce 52,000 instruction-response examples, the team fine-tuned a 7B LLaMA model that behaves similarly to the much larger GPT-3.5 model [2]. This approach cost only ~$600 and yielded Alpaca 7B, an instruct-following model that is qualitatively on par with OpenAI’s text-davinci-003 on many tasks.

Other projects have also demonstrated the power of LLM-generated data for training compact models:

In all these cases, data augmentation via synthetic generation is the key. It’s faster and cheaper to obtain task-specific data from an LLM than to manually curate it, and the augmented data often improves the target model’s performance beyond what real data alone could do.

Why Fine-Tune Smaller Models? Can they compete with GPT-4?

Fine-tuning a smaller model on high-quality data can yield outsized gains. Research and community experiments have demonstrated that a compact model (with LoRA or QLoRA fine-tuning) can, on specific tasks or domains, perform as well as or even better than a much larger general model that isn’t specialized [3]. This means you can deploy an efficient model that matches or surpasses a base GPT-4 or other state-of-the-art LLM in your application, given the model has been tailored with the right data.

For example, Predibase showed that a fine-tuned 8B LLaMA (via LoRA) outperformed GPT-4 in a common-sense reasoning task when only 10 real examples were available, by generating additional synthetic training data. In their experiments, as the number of training examples grows (via synthetic augmentation), the fine-tuned 8B model continues to improve and eventually overtakes GPT-4’s few-shot performance on benchmarks like HellaSwag. The Hugging Face QLoRA research similarly produced Guanaco, a family of models (7B–65B) fine-tuned on synthetically generated instruction data, that achieved 99% of ChatGPT’s performance on the Vicuna benchmark [4]. Guanaco-65B is in fact second only to GPT-4 on several chat leaderboards while being trained on a single GPU in 24 hours.

The takeaway: Fine-tuning works. With efficient methods like LoRA (which adds only a few million trainable parameters) and QLoRA (which fine-tunes in 4-bit precision), even billions-scale models become practical to adapt. Kolosal Plane embraces this, using LLM-generated data to give your small model a fighting chance against the giants in your domain [3]. In many cases, a domain-specialized 2–13B model can rival a 100B+ general model, at a fraction of the inference cost.

How the Synthetic Conversation Pipeline Works

This is an overview on how the synthetic conversation pipeline works. Kolosal Plane’s data generation loop simulates a back-and-forth dialogue, optionally grounded in provided documents or context. The process can be summarized as:

  1. Conversation Start (Question Generation): For each given topic or document, the pipeline first generates an initial user question. This is often guided by a conversation starter instruction you provide (e.g. “Ask a question about the document’s content” or a general topic prompt). Under the hood, Kolosal Plane uses a prompt template (inspired by Self-Instruct) to have the LLM come up with a relevant question or instruction based on the document/context.

  2. LLM Answer Generation: The large model (LLM) is then invoked to answer the question. Before calling the LLM, Kolosal Plane can insert the relevant context (the document text, or any system/persona prompt you defined) into the conversation history so that the LLM’s answer is knowledgeable and coherent

  3. Follow-up Question Generation: After an answer, the pipeline generates a follow-up user question to continue the conversation.

  4. Conversation Continuation (Loop): The new question is added to the chat history, and we loop back to step 2. The LLM now answers the follow-up question, producing the next answer. This question/answer pair is appended to the growing conversation. The cycle of (User question → LLM answer → next User question → ...) repeats until a specified length (max_conversations) is reached

Example: Using Kolosal Plane for Synthetic Data Generation

In the repository you’ll find example.ipynb, which demonstrates programmatic usage.

Community and Support

Kolosal Plane is an early-stage project – we’d love your help in making it better! If you have questions or want to connect with other developers using the tool, join our Discord: Kolosal AI. It’s a great place to share ideas, get support, and discuss data augmentation and fine-tuning techniques.

Found a bug or have a feature request? Please open an issue on the GitHub repo. We welcome contributions – whether it’s reporting an issue, improving documentation, or adding new features.

License

This project is licensed under the Apache 2.0 License, which means you are free to use, modify, and distribute it in both academic and commercial projects. See the LICENSE file for details.

References

[1] How to Generate and Use Synthetic Data for Finetuning

[2] Stanford CRFM

[3] How to Generate Synthetic Data and Fine-tune a SLM that Beats GPT-4o - Predibase - Predibase

[4] QLoRA: Efficient Finetuning of Quantized LLMs

About

One platform to generate synthesis data for LLM and Embedding Models on your own device

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published