etl-pipeline

Star

Here are 2,056 public repositories matching this topic...

Zipstack / unstract

Star

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

unstructured-data etl-pipeline llm-platform

Updated May 27, 2025
Python

orchest / orchest

Star

Build data pipelines, the easy way 🛠️

python docker kubernetes data-science machine-learning airflow cloud deployment jupyter etl ide pipelines self-hosted jupyterlab notebooks data-pipelines dag etl-pipeline orchest

Updated Jun 6, 2023
TypeScript

apache / streampark

Star

Make stream processing easier! Easy-to-use streaming application development framework and operation platform.

streaming apache easy-to-use etl-pipeline development-framework streampark operation-platform

Updated May 28, 2025
Java

apache / hamilton

Star

Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

python data-science machine-learning etl pandas orchestration data-engineering data-analysis software-engineering feature-engineering dataframe hacktoberfest dag lineage etl-framework etl-pipeline rag mlops llmops

Updated May 30, 2025
Jupyter Notebook

AlexIoannides / pyspark-example-project

Star

Implementing best practices for PySpark ETL jobs and applications.

python data-science spark etl pyspark data-engineering etl-pipeline etl-job

Updated Jan 1, 2023
Python

san089 / Udacity-Data-Engineering-Projects

Star

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

Updated Aug 26, 2022
Python

san089 / goodreads_etl_pipeline

Star

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Updated Mar 9, 2020
Python

JSv4 / OpenContracts

Sponsor

Star

Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!

agent etl unstructured-data etl-pipeline vector-database llm prompt-engineering agentic-ai

Updated Jun 1, 2025
TypeScript

stitchfix / hamilton

Star

A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton

python data-science machine-learning etl numpy pandas data-engineering data-platform software-engineering feature-engineering dataframe dag hamiltonian etl-framework hamilton featurization etl-pipeline stitch-fix

Updated Jul 3, 2023
Python

techascent / tech.ml.dataset

Star

A Clojure high performance data processing system

java machine-learning clojure csv xlsx datascience dataset dataframe etl-pipeline

Updated May 13, 2025
Clojure

SorellaLabs / brontes

Star

A blazingly fast general purpose blockchain analytics engine specialized in systematic mev detection

rust ethereum evm etl-pipeline mev

Updated Apr 29, 2025
Rust

Pravko-Solutions / FlashLearn

Star

Integrate LLM in any pipeline - fit/predict pattern, JSON driven flows, and built in concurency support.

python ai concurrency ai-agents etl-pipeline llm llm-agent ai-agents-framework agentic-ai-development

Updated Mar 10, 2025
Python

YotpoLtd / metorikku

Star

A simplified, lightweight ETL Framework based on Apache Spark

scala sql big-data spark etl distributed-computing etl-framework etl-pipeline

Updated Jan 24, 2024
Scala

unbody-io / unbody

Star

The Supabase of AI era. A modular, open-source backend for building AI-native software — designed for knowledge, not static data.

backend chatbot developer-tools knowledge-base data-ingestion etl-pipeline rag data-enhancement vector-database llm ai-native generative-ai agentic-ai supabase-alternative

Updated May 30, 2025
TypeScript

ebonnal / streamable

Star

Fluent interface for (async) iterables

Updated Jun 1, 2025
Python

airscholar / e2e-data-engineering

Star

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.