Skip to content

Commit 770aa41

Browse files
committed
init
0 parents  commit 770aa41

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+18785
-0
lines changed

.github/workflows/ci.yml

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
name: CI Pipeline
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
pull_request:
7+
schedule:
8+
- cron: '0 0 * * *'
9+
10+
jobs:
11+
fmt:
12+
name: Format Check
13+
runs-on: self-hosted
14+
steps:
15+
- uses: actions/checkout@v4
16+
17+
- name: Set up Python 3.12
18+
uses: actions/setup-python@v5
19+
with:
20+
python-version: "3.12"
21+
22+
- name: Install formatters
23+
run: |
24+
python -m pip install --upgrade pip
25+
pip install black
26+
27+
- name: Check code formatting
28+
run: |
29+
black --check .
30+
31+
# - name: Check typos
32+
# uses: crate-ci/typos@v1.29.10
33+
34+
test:
35+
name: Test (${{ matrix.os }}, Python ${{ matrix.python-version }})
36+
runs-on: ${{ matrix.os }}
37+
strategy:
38+
matrix:
39+
os: [self-hosted] # macos-latest
40+
python-version: ["3.9", "3.10", "3.11", "3.12"]
41+
steps:
42+
- uses: actions/checkout@v4
43+
44+
- name: Set up Python ${{ matrix.python-version }}
45+
uses: actions/setup-python@v5
46+
with:
47+
python-version: ${{ matrix.python-version }}
48+
49+
- name: Install dependencies
50+
run: |
51+
python -m pip install --upgrade pip
52+
pip install -e .[dev]
53+
54+
- name: Install gensort on Linux
55+
if: matrix.os == 'self-hosted'
56+
run: |
57+
wget https://www.ordinal.com/try.cgi/gensort-linux-1.5.tar.gz
58+
tar -xzf gensort-linux-1.5.tar.gz
59+
chmod +x 64/gensort 64/valsort
60+
export PATH=$PATH:$(pwd)/64
61+
62+
- name: Run tests
63+
timeout-minutes: 20
64+
run: |
65+
pytest -n 4 -v -x --durations=50 --timeout=600 \
66+
--junitxml=pytest.xml \
67+
--cov-report term --cov-report xml:coverage.xml \
68+
--cov=smallpond --cov=examples --cov=benchmarks --cov=tests \
69+
tests/test_*.py
70+
71+
- name: Archive test results
72+
uses: actions/upload-artifact@v4
73+
with:
74+
name: test-results-${{ matrix.os }}-py${{ matrix.python-version }}
75+
path: |
76+
pytest.xml
77+
coverage.xml
78+
79+
build-docs:
80+
name: Build Documentation
81+
runs-on: self-hosted
82+
steps:
83+
- uses: actions/checkout@v4
84+
85+
- name: Set up Python 3.8
86+
uses: actions/setup-python@v5
87+
with:
88+
python-version: "3.8"
89+
90+
- name: Install dependencies
91+
run: pip install -e .[docs]
92+
93+
- name: Build HTML docs
94+
run: |
95+
cd docs
96+
make html
97+
98+
- name: Archive documentation
99+
uses: actions/upload-artifact@v4
100+
with:
101+
name: documentation
102+
path: docs/build/html
103+
104+
deploy-docs:
105+
name: Deploy Documentation
106+
runs-on: self-hosted
107+
needs: [build-docs]
108+
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
109+
steps:
110+
- uses: actions/checkout@v4
111+
112+
- name: Download artifact
113+
uses: actions/download-artifact@v4
114+
with:
115+
name: documentation
116+
path: docs/build/html
117+
118+
- name: Deploy to docs branch
119+
uses: peaceiris/actions-gh-pages@v4
120+
with:
121+
github_token: ${{ secrets.GITHUB_TOKEN }}
122+
publish_dir: docs/build/html
123+
destination_dir: ./
124+
keep_files: true
125+
branch: docs
126+
force_orphan: true

.gitignore

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
__pycache__
2+
.ipynb_checkpoints
3+
.tmp/
4+
dist/
5+
build/
6+
*.egg-info/
7+
tests/data/
8+
tests/runtime/
9+
*.log
10+
*.pyc
11+
*.xml
12+
.tmp/
13+
.idea
14+
.coverage
15+
.vscode/
16+
.hypothesis/
17+
docs/*/generated/
18+
.venv*/

LICENSE

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Copyright 2025 DeepSeek
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4+
5+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6+
7+
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

MANIFEST.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
exclude tests/data/** tests/runtime/**

README.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# smallpond
2+
3+
[![CI](https://github.com/deepseek-ai/smallpond/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/deepseek-ai/smallpond/actions/workflows/ci.yml)
4+
[![PyPI](https://img.shields.io/pypi/v/smallpond)](https://pypi.org/project/smallpond/)
5+
[![Docs](https://img.shields.io/badge/docs-latest-brightgreen.svg)](https://deepseek-ai.github.io/smallpond/)
6+
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
7+
8+
A lightweight data processing framework built on DuckDB and [3FS].
9+
10+
## Features
11+
12+
- 🚀 High-performance data processing powered by DuckDB
13+
- 🌍 Scalable to handle PB-scale datasets
14+
- 🛠️ Easy operations with no long-running services
15+
16+
## Installation
17+
18+
Python 3.8 to 3.12 is supported.
19+
20+
```bash
21+
pip install smallpond
22+
```
23+
24+
## Quick Start
25+
26+
```bash
27+
# Download example data
28+
wget https://duckdb.org/data/prices.parquet
29+
```
30+
31+
```python
32+
import smallpond
33+
34+
# Initialize session
35+
sp = smallpond.init()
36+
37+
# Load data
38+
df = sp.read_parquet("prices.parquet")
39+
40+
# Process data
41+
df = df.repartition(3, hash_by="ticker")
42+
df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)
43+
44+
# Save results
45+
df.write_parquet("output/")
46+
# Show results
47+
print(df.to_pandas())
48+
```
49+
50+
## Documentation
51+
52+
For detailed guides and API reference:
53+
- [Getting Started](docs/source/getstarted.rst)
54+
- [API Reference](docs/source/api.rst)
55+
56+
## Performance
57+
58+
We executed the [Gray Sort benchmark] using [smallpond] on a cluster comprising 50 compute nodes and 25 storage nodes running [3FS]. The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving a throughput of 3.66TiB/min.
59+
60+
[3FS]: https://github.com/deepseek-ai/3fs
61+
[Gray Sort benchmark]: https://sortbenchmark.org/
62+
[smallpond]: benchmarks/gray_sort_benchmark.py
63+
64+
## Development
65+
66+
```bash
67+
pip install .[dev]
68+
69+
# run unit tests
70+
pytest -v tests/test*.py
71+
72+
# build documentation
73+
pip install .[docs]
74+
cd docs
75+
make html
76+
python -m http.server --directory build/html
77+
```
78+
79+
## License
80+
81+
This project is licensed under the [MIT License](LICENSE).

benchmarks/file_io_benchmark.py

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
from smallpond.common import DEFAULT_BATCH_SIZE, DEFAULT_ROW_GROUP_SIZE, GB
2+
from smallpond.contrib.copy_table import CopyArrowTable, StreamCopy
3+
from smallpond.execution.driver import Driver
4+
from smallpond.logical.dataset import ParquetDataSet
5+
from smallpond.logical.node import (
6+
Context,
7+
DataSetPartitionNode,
8+
DataSourceNode,
9+
LogicalPlan,
10+
SqlEngineNode,
11+
)
12+
13+
14+
def file_io_benchmark(
15+
input_paths,
16+
npartitions,
17+
io_engine="duckdb",
18+
batch_size=DEFAULT_BATCH_SIZE,
19+
row_group_size=DEFAULT_ROW_GROUP_SIZE,
20+
output_name="data",
21+
**kwargs,
22+
) -> LogicalPlan:
23+
ctx = Context()
24+
dataset = ParquetDataSet(input_paths)
25+
data_files = DataSourceNode(ctx, dataset)
26+
data_partitions = DataSetPartitionNode(ctx, (data_files,), npartitions=npartitions)
27+
28+
if io_engine == "duckdb":
29+
data_copy = SqlEngineNode(
30+
ctx,
31+
(data_partitions,),
32+
r"select * from {0}",
33+
parquet_row_group_size=row_group_size,
34+
per_thread_output=False,
35+
output_name=output_name,
36+
cpu_limit=1,
37+
memory_limit=10 * GB,
38+
)
39+
elif io_engine == "arrow":
40+
data_copy = CopyArrowTable(
41+
ctx,
42+
(data_partitions,),
43+
parquet_row_group_size=row_group_size,
44+
output_name=output_name,
45+
cpu_limit=1,
46+
memory_limit=10 * GB,
47+
)
48+
elif io_engine == "stream":
49+
data_copy = StreamCopy(
50+
ctx,
51+
(data_partitions,),
52+
streaming_batch_size=batch_size,
53+
parquet_row_group_size=row_group_size,
54+
output_name=output_name,
55+
cpu_limit=1,
56+
memory_limit=10 * GB,
57+
)
58+
59+
plan = LogicalPlan(ctx, data_copy)
60+
return plan
61+
62+
63+
def main():
64+
driver = Driver()
65+
driver.add_argument("-i", "--input_paths", nargs="+")
66+
driver.add_argument("-n", "--npartitions", type=int, default=None)
67+
driver.add_argument(
68+
"-e", "--io_engine", default="duckdb", choices=("duckdb", "arrow", "stream")
69+
)
70+
driver.add_argument("-b", "--batch_size", type=int, default=1024 * 1024)
71+
driver.add_argument("-s", "--row_group_size", type=int, default=1024 * 1024)
72+
driver.add_argument("-o", "--output_name", default="data")
73+
driver.add_argument("-NC", "--cpus_per_node", type=int, default=128)
74+
75+
user_args, driver_args = driver.parse_arguments()
76+
total_num_cpus = driver_args.num_executors * user_args.cpus_per_node
77+
user_args.npartitions = user_args.npartitions or total_num_cpus
78+
79+
plan = file_io_benchmark(**driver.get_arguments())
80+
driver.run(plan)
81+
82+
83+
if __name__ == "__main__":
84+
main()

0 commit comments

Comments
 (0)