Add alternate deploy yaml and brief explanation

gfcannon12 · gfcannon12 · commit 6750dc6cb3fc · 2021-10-15T16:50:18.000-04:00
diff --git a/deploy/ibm_cloud_code_engine/AlternateDeploy.md b/deploy/ibm_cloud_code_engine/AlternateDeploy.md
@@ -0,0 +1,113 @@
+<!--
+{% comment %}
+Copyright 2021 IBM
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Alternate CodeFlare Deploy Configuration for IBM Cloud Code Engine
+
+`example-ray-ml-cluster.yaml.template` starts from the `rayproject/ray-ml` docker image and subsequently installs codeflare during the build process.
+
+## Install pre-requisities
+
+Install the pre-requisites for Code Engine and Ray:
+
+1. Get ready with IBM Code Engine:
+
+- [Set up your Code Engine CLI](https://cloud.ibm.com/docs/codeengine?topic=codeengine-install-cli)
+- [optional] [Create your first Code Engine Project using the CLI](https://cloud.ibm.com/docs/codeengine?topic=codeengine-manage-project)
+
+2. [Set up the Kubernetes CLI](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
+
+3. Install Kubernetes:
+
+```shell
+pip install kubernetes
+```
+
+## Step 1 - Define your Ray Cluster on Code Engine
+
+If not already done in the previous step, create a Code Engine project:
+
+```shell
+ibmcloud ce project create -n <your project name>
+```
+
+Select the Code Engine project and switch the `kubectl` context to the project:
+
+```shell
+ibmcloud ce project select -n <your project name> -k
+```
+
+Extract the Kubernetes namespace assigned to your project. The namespace can be found in the `NAME` column in the output of the command:
+
+```shell
+kubectl get namespace
+````
+
+Export the namespace:
+
+```shell
+export NAMESPACE=<namespace from above>
+```
+
+A reference Ray cluster definition can be customized for your namespace with the following commands:
+```shell
+cd ./deploy/ibm_cloud_code_engine/
+sed "s/NAMESPACE/$NAMESPACE/" ./example-ray-ml-cluster.yaml.template > ./example-cluster.yaml
+```
+
+This reference deployment file will create a Ray cluster with following characteristics:
+
+- a cluster named example-cluster with up to 10 workers using the Kubernetes provider pointing to your Code Engine project
+- A head node type with 1 vCPU and 2GB of memory using the CodeFlare image `codeflare:latest`
+- A worker node type with 1 vCPU and 2GB memory using the ray image `codeflare:latest`
+- The default startup commands to start Ray within the container and listen to the proper ports
+- The autoscaler upscale speed is set to 2 for quick upscaling in this short and simple demo
+
+## Step 2 - Start Ray cluster
+
+You can now start the Ray cluster by running:
+
+```shell 
+ray up example-cluster.yaml
+```
+
+This command will create the Ray head node as Kubernetes Pod in your Code Engine project. When you create the Ray cluster for the first time, it can take few minutes until the Ray image is downloaded from the container registry. 
+
+## Step 3 - Run sample Pipeline with Jupyter
+
+A Jupyter server will be automatically running on the head node. To access the server from your local machine, execute the command:
+
+```shell
+kubectl -n $NAMESPACE port-forward <ray-cluster-name> 8888:8888
+```
+
+You can now access the Jupyter server by pointing your browser to the following url:
+
+```shell
+http://127.0.0.1:8888/lab
+```
+
+Once in the the Jupyer envrionment, examples are found in the `codeflare/notebooks` directory in the container image. Documentation for reference use cases can be found in [Examples](https://codeflare.readthedocs.io/en/latest/).
+
+To execute any of the notebooks with the Ray cluster running on Code Engine, edit the `ray.init()` line with the following parameters:
+
+```python
+ray.init(address='auto', _redis_password='5241590000000000')
+```
+
+This change will allow pipelines to be auto-scaled on the underlying Ray cluster running on Code Engine (up to 10 workers with the reference deployment). The number of workers and scaling parameters can be adjusted in the `yaml` file.
+
diff --git a/deploy/ibm_cloud_code_engine/example-ray-ml-cluster.yaml.template b/deploy/ibm_cloud_code_engine/example-ray-ml-cluster.yaml.template
@@ -0,0 +1,184 @@
+# A unique identifier for the head node and workers of this cluster.
+cluster_name: example-cluster
+
+# The maximum number of workers nodes to launch in addition to the head
+# node.
+max_workers: 10
+
+# The autoscaler will scale up the cluster faster with higher upscaling speed.
+# E.g., if the task requires adding more nodes then autoscaler will gradually
+# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
+# This number should be > 0.
+upscaling_speed: 10
+
+# If a node is idle for this many minutes, it will be removed.
+idle_timeout_minutes: 1 
+
+setup_commands:
+  - pip3 install --upgrade pip
+  - pip3 install codeflare --no-cache-dir
+
+# Kubernetes resources that need to be configured for the autoscaler to be
+# able to manage the Ray cluster. If any of the provided resources don't
+# exist, the autoscaler will attempt to create them. If this fails, you may
+# not have the required permissions and will have to request them to be
+# created by your cluster administrator.
+provider:
+    type: kubernetes
+
+    # Exposing external IP addresses for ray pods isn't currently supported.
+    use_internal_ips: true
+
+    # Namespace to use for all resources created.
+    namespace:  NAMESPACE
+
+    services:
+      # Service that maps to the head node of the Ray cluster.
+      - apiVersion: v1
+        kind: Service
+        metadata:
+            # NOTE: If you're running multiple Ray clusters with services
+            # on one Kubernetes cluster, they must have unique service
+            # names.
+            name: example-cluster-ray-head
+        spec:
+            # This selector must match the head node pod's selector below.
+            selector:
+                component: example-cluster-ray-head
+            ports:
+                - name: client
+                  protocol: TCP
+                  port: 10001
+                  targetPort: 10001
+                - name: dashboard
+                  protocol: TCP
+                  port: 8265
+                  targetPort: 8265
+
+# Specify the pod type for the ray head node (as configured below).
+head_node_type: head_node
+# Specify the allowed pod types for this ray cluster and the resources they provide.
+available_node_types:
+  worker_node:
+    # Minimum number of Ray workers of this Pod type.
+    min_workers: 0 
+    # Maximum number of Ray workers of this Pod type. Takes precedence over min_workers.
+    max_workers: 10
+    # User-specified custom resources for use by Ray. Object with string keys and integer values.
+    # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
+    node_config:
+      apiVersion: v1
+      kind: Pod
+      metadata:
+        # Automatically generates a name for the pod with this prefix.
+        generateName: example-cluster-ray-worker-
+      spec:
+        restartPolicy: Never
+        volumes:
+        - name: dshm
+          emptyDir:
+            medium: Memory
+        containers:
+        - name: ray-node
+          imagePullPolicy: Always
+          image: rayproject/ray-ml
+          command: ["/bin/bash", "-c", "--"]
+          args: ["trap : TERM INT; sleep infinity & wait;"]
+          # This volume allocates shared memory for Ray to use for its plasma
+          # object store. If you do not provide this, Ray will fall back to
+          # /tmp which cause slowdowns if is not a shared memory volume.
+          volumeMounts:
+          - mountPath: /dev/shm
+            name: dshm
+          resources:
+            requests:
+              cpu: 1
+              memory: 2G
+            limits:
+              # The maximum memory that this pod is allowed to use. The
+              # limit will be detected by ray and split to use 10% for
+              # redis, 30% for the shared memory object store, and the
+              # rest for application memory. If this limit is not set and
+              # the object store size is not set manually, ray will
+              # allocate a very large object store in each pod that may
+              # cause problems for other pods.
+              cpu: 1
+              memory: 2G
+
+  head_node:
+    # The minimum number of worker nodes of this type to launch.
+    # This number should be >= 0.
+    min_workers: 0
+    # The maximum number of worker nodes of this type to launch.
+    # This takes precedence over min_workers.
+    max_workers: 0
+    node_config:
+      apiVersion: v1
+      kind: Pod
+      metadata:
+        # Automatically generates a name for the pod with this prefix.
+        generateName: example-cluster-ray-head-
+        # Must match the head node service selector above if a head node
+        # service is required.
+        labels:
+            component: example-cluster-ray-head
+      spec:
+        # Change this if you altered the autoscaler_service_account above
+        # or want to provide your own.
+        serviceAccountName: NAMESPACE-writer
+
+        restartPolicy: Never
+
+        # This volume allocates shared memory for Ray to use for its plasma
+        # object store. If you do not provide this, Ray will fall back to
+        # /tmp which cause slowdowns if is not a shared memory volume.
+        volumes:
+        - name: dshm
+          emptyDir:
+            medium: Memory
+        containers:
+        - name: ray-node
+          imagePullPolicy: Always
+          image: rayproject/ray-ml
+          # Do not change this command - it keeps the pod alive until it is
+          # explicitly killed.
+          command: ["/bin/bash", "-c", "--"]
+          args: ['trap : TERM INT; sleep infinity & wait;']
+          ports:
+          - containerPort: 6379  # Redis port
+          - containerPort: 10001  # Used by Ray Client
+          - containerPort: 8265  # Used by Ray Dashboard
+
+          # This volume allocates shared memory for Ray to use for its plasma
+          # object store. If you do not provide this, Ray will fall back to
+          # /tmp which cause slowdowns if is not a shared memory volume.
+          volumeMounts:
+          - mountPath: /dev/shm
+            name: dshm
+          resources:
+            requests:
+              cpu: 1
+              memory: 2G
+            limits:
+              # The maximum memory that this pod is allowed to use. The
+              # limit will be detected by ray and split to use 10% for
+              # redis, 30% for the shared memory object store, and the
+              # rest for application memory. If this limit is not set and
+              # the object store size is not set manually, ray will
+              # allocate a very large object store in each pod that may
+              # cause problems for other pods.
+              cpu: 1
+              memory: 2G
+
+
+
+# Command to start ray on the head node. You don't need to change this.
+# Note dashboard-host is set to 0.0.0.0 so that kubernetes can port forward.
+head_start_ray_commands:
+    - ray stop
+    - ulimit -n 65536; ray start --head --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0
+
+# Command to start ray on worker nodes. You don't need to change this.
+worker_start_ray_commands:
+    - ray stop
+    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379