Completes scalabale databasing lesson

hlapp · hlapp · commit 74cf54997ccb · 2024-11-06T14:53:34.000-05:00
diff --git a/Python-ML/Databasing-for-ML.ipynb b/Python-ML/Databasing-for-ML.ipynb
@@ -12,6 +12,18 @@
     "## Iris dataset in Pandas: recap"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8ce260a9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# define path prefix for work directory as it might differ based on the environment\n",
+    "from pathlib import Path\n",
+    "WORK = Path(\"work/Python-ML\")"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -526,7 +538,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "duckdb.from_df(iris).to_parquet('iris.parquet')"
+    "duckdb.from_df(iris).to_parquet(f'{WORK}/iris.parquet')"
    ]
   },
   {
@@ -544,7 +556,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "iris.to_parquet('iris-pd.parquet', index=False)"
+    "iris.to_parquet(WORK / 'iris-pd.parquet', index=False)"
    ]
   },
   {
@@ -570,7 +582,7 @@
     "import pyarrow.parquet as pq\n",
     "\n",
     "table = pa.Table.from_pandas(iris)\n",
-    "pq.write_to_dataset(table, root_path='iris', partition_cols=['target'],\n",
+    "pq.write_to_dataset(table, root_path=WORK / 'iris', partition_cols=['target'],\n",
     "                    existing_data_behavior='delete_matching')"
    ]
   },
@@ -591,7 +603,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "pd.read_parquet(path='iris', partitioning='hive')"
+    "pd.read_parquet(path=WORK / 'iris', partitioning='hive')"
    ]
   },
   {
@@ -601,7 +613,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "duckdb.read_parquet('iris/*/*.parquet', hive_partitioning=True)"
+    "duckdb.read_parquet(f'{WORK}/iris/*/*.parquet', hive_partitioning=True)"
    ]
   },
   {
@@ -676,7 +688,8 @@
    "outputs": [],
    "source": [
     "iris_large = iris.sample(n=100_000, replace=True)\n",
-    "iris_large.to_parquet('iris_large.parquet')"
+    "iris_large.to_parquet('work/Python-ML/iris_large.parquet')\n",
+    "iris_large.to_csv('work/Python-ML/iris_large.csv', index=False)"
    ]
   },
   {
@@ -695,7 +708,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "f4fbc5c2",
+   "id": "27cfe96a",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -707,15 +720,227 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "0cd05a51",
+   "id": "a5514211",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "dbfile = f'{WORK}/iris_large.parquet'\n",
+    "for i in range(30_000):\n",
+    "    res = duckdb.sql(f\"select target, count(*) from read_parquet('{dbfile}') group by target\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1935ef00",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "for i in range(30_000):\n",
+    "    res = iris_large.groupby('target', observed=True).count()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7321c04d",
+   "metadata": {},
+   "source": [
+    "There are also different algorithms available for compression (e.g. Gzip, Brotli, Zstd) and encoding (e.g. Delta, RLE, PLAIN, DICT). These could further optimize query performance and storage efficiency for specific use-cases."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b8ee8098",
+   "metadata": {},
+   "source": [
+    "#### Appending to a Parquet dataset\n",
+    "\n",
+    "In the OLAP notion, Parquet datasets are not designed for mutability. Hence, rows can't be updated or deleted. \n",
+    "\n",
+    "However, although rows can't simply be appended to a Parquet _file_, new rows _can_ be appended to a Parquet _dataset_, which can consist of multiple files:\n",
+    "\n",
+    "- We can simply write new rows to a new Parquet file and make sure it is included in the file glob passed to the `read_parquet()` function. (Usually this means it should be in the same directory as the existing Parquet files.)\n",
+    "- We can also add new rows to an existing partitioned Parquet dataset. To allow existing files  but prevent overwriting them, we need to a combination of basename template and existing data behavior.\n",
+    "\n",
+    "Let's say we have 3 batches of data but we receive them not all at once but in 3 separate batches:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c63f1593",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "iris_shuffled = iris.sample(frac=1, replace=False)\n",
+    "iris_b1 = iris_shuffled.iloc[:50]\n",
+    "iris_b2 = iris_shuffled.iloc[50:100]\n",
+    "iris_b3 = iris_shuffled.iloc[100:]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "526b72a6",
+   "metadata": {},
+   "source": [
+    "Write the first batch (assuming we don't have the others yet)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2ed33d79",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pq.write_to_dataset(pa.Table.from_pandas(iris_b1, preserve_index=False),\n",
+    "                    root_path=f'{WORK}/iris-b', partition_cols=['target'],\n",
+    "                    basename_template='b1-{i}.parquet',\n",
+    "                    existing_data_behavior='overwrite_or_ignore')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0db850ec",
+   "metadata": {},
+   "source": [
+    "Then we can keep appending batches as we get them (or write all batches at once if we have them):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "87d256c9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for b, batch in enumerate([iris_b2, iris_b3], start=2):\n",
+    "    pq.write_to_dataset(pa.Table.from_pandas(batch, preserve_index=False),\n",
+    "                        root_path=f'{WORK}/iris-b', partition_cols=['target'],\n",
+    "                        basename_template=f'b{b}'+'-{i}.parquet',\n",
+    "                        existing_data_behavior='overwrite_or_ignore')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca832844",
+   "metadata": {},
+   "source": [
+    "How we query the dataset remains the same, regardless of whether the data was written all at once or in batches:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c8582c2c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "duckdb.sql(f\"select * from read_parquet('{WORK}/iris-b/*/*.parquet', hive_partitioning=True)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba9266c5",
+   "metadata": {},
+   "source": [
+    "Changing the schema of a Parquet dataset (such as by adding columns) is difficult. The best way to approach this is to create separate Parquet files for each consistent data batch, then use `duckdb.sql()` to combine them (presumably using some kind of OUTER JOIN), and writing the result back to a Parquet dataset.\n",
+    "\n",
+    "Say, our first batch doesn't have the petal measurements:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7162071f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "iris_b1.filter(regex=\"(sepal|target).*\").to_parquet(f'{WORK}/iris-c1.parquet', index=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b9f0d5a",
+   "metadata": {},
+   "source": [
+    "Then combine with the other two datasets"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2c5aea8a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rel = duckdb.sql(\"select * from iris_b2 UNION ALL \"\n",
+    "                 \"select * from iris_b3 UNION ALL \"\n",
+    "                 \"select sepal_length, sepal_width, null, null, target \" +\n",
+    "                 f\"from read_parquet('{WORK}/iris-c1.parquet')\")\n",
+    "rel"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "50452dcd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rel.to_parquet(f'{WORK}/iris-c.parquet')\n",
+    "# or alternatively as a partitioned dataset:\n",
+    "pq.write_to_dataset(rel.arrow(), root_path=f'{WORK}/iris-c', partition_cols=['target'],\n",
+    "                    existing_data_behavior='delete_matching')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bfebccbf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "duckdb.sql(f\"select * from read_parquet('{WORK}/iris-c.parquet')\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "149cad80",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "duckdb.sql(f\"select target, \"\n",
+    "           \"count(*) as num_rows, count(petal_length) as num_petal_measures \"\n",
+    "           f\"from read_parquet('{WORK}/iris-c.parquet')\" +\n",
+    "           \"group by target\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "57ab8b1b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "duckdb.sql(f\"select * from read_parquet('{WORK}/iris-c/*/*.parquet', hive_partitioning=True)\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d25f1509",
    "metadata": {},
    "outputs": [],
    "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },