Skip to content

Commit 6750dc6

Browse files
committed
Add alternate deploy yaml and brief explanation
1 parent a2b290a commit 6750dc6

File tree

2 files changed

+297
-0
lines changed

2 files changed

+297
-0
lines changed
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
<!--
2+
{% comment %}
3+
Copyright 2021 IBM
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
{% endcomment %}
17+
-->
18+
19+
# Alternate CodeFlare Deploy Configuration for IBM Cloud Code Engine
20+
21+
`example-ray-ml-cluster.yaml.template` starts from the `rayproject/ray-ml` docker image and subsequently installs codeflare during the build process.
22+
23+
## Install pre-requisities
24+
25+
Install the pre-requisites for Code Engine and Ray:
26+
27+
1. Get ready with IBM Code Engine:
28+
29+
- [Set up your Code Engine CLI](https://cloud.ibm.com/docs/codeengine?topic=codeengine-install-cli)
30+
- [optional] [Create your first Code Engine Project using the CLI](https://cloud.ibm.com/docs/codeengine?topic=codeengine-manage-project)
31+
32+
2. [Set up the Kubernetes CLI](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
33+
34+
3. Install Kubernetes:
35+
36+
```shell
37+
pip install kubernetes
38+
```
39+
40+
## Step 1 - Define your Ray Cluster on Code Engine
41+
42+
If not already done in the previous step, create a Code Engine project:
43+
44+
```shell
45+
ibmcloud ce project create -n <your project name>
46+
```
47+
48+
Select the Code Engine project and switch the `kubectl` context to the project:
49+
50+
```shell
51+
ibmcloud ce project select -n <your project name> -k
52+
```
53+
54+
Extract the Kubernetes namespace assigned to your project. The namespace can be found in the `NAME` column in the output of the command:
55+
56+
```shell
57+
kubectl get namespace
58+
````
59+
60+
Export the namespace:
61+
62+
```shell
63+
export NAMESPACE=<namespace from above>
64+
```
65+
66+
A reference Ray cluster definition can be customized for your namespace with the following commands:
67+
```shell
68+
cd ./deploy/ibm_cloud_code_engine/
69+
sed "s/NAMESPACE/$NAMESPACE/" ./example-ray-ml-cluster.yaml.template > ./example-cluster.yaml
70+
```
71+
72+
This reference deployment file will create a Ray cluster with following characteristics:
73+
74+
- a cluster named example-cluster with up to 10 workers using the Kubernetes provider pointing to your Code Engine project
75+
- A head node type with 1 vCPU and 2GB of memory using the CodeFlare image `codeflare:latest`
76+
- A worker node type with 1 vCPU and 2GB memory using the ray image `codeflare:latest`
77+
- The default startup commands to start Ray within the container and listen to the proper ports
78+
- The autoscaler upscale speed is set to 2 for quick upscaling in this short and simple demo
79+
80+
## Step 2 - Start Ray cluster
81+
82+
You can now start the Ray cluster by running:
83+
84+
```shell
85+
ray up example-cluster.yaml
86+
```
87+
88+
This command will create the Ray head node as Kubernetes Pod in your Code Engine project. When you create the Ray cluster for the first time, it can take few minutes until the Ray image is downloaded from the container registry.
89+
90+
## Step 3 - Run sample Pipeline with Jupyter
91+
92+
A Jupyter server will be automatically running on the head node. To access the server from your local machine, execute the command:
93+
94+
```shell
95+
kubectl -n $NAMESPACE port-forward <ray-cluster-name> 8888:8888
96+
```
97+
98+
You can now access the Jupyter server by pointing your browser to the following url:
99+
100+
```shell
101+
http://127.0.0.1:8888/lab
102+
```
103+
104+
Once in the the Jupyer envrionment, examples are found in the `codeflare/notebooks` directory in the container image. Documentation for reference use cases can be found in [Examples](https://codeflare.readthedocs.io/en/latest/).
105+
106+
To execute any of the notebooks with the Ray cluster running on Code Engine, edit the `ray.init()` line with the following parameters:
107+
108+
```python
109+
ray.init(address='auto', _redis_password='5241590000000000')
110+
```
111+
112+
This change will allow pipelines to be auto-scaled on the underlying Ray cluster running on Code Engine (up to 10 workers with the reference deployment). The number of workers and scaling parameters can be adjusted in the `yaml` file.
113+
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# A unique identifier for the head node and workers of this cluster.
2+
cluster_name: example-cluster
3+
4+
# The maximum number of workers nodes to launch in addition to the head
5+
# node.
6+
max_workers: 10
7+
8+
# The autoscaler will scale up the cluster faster with higher upscaling speed.
9+
# E.g., if the task requires adding more nodes then autoscaler will gradually
10+
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
11+
# This number should be > 0.
12+
upscaling_speed: 10
13+
14+
# If a node is idle for this many minutes, it will be removed.
15+
idle_timeout_minutes: 1
16+
17+
setup_commands:
18+
- pip3 install --upgrade pip
19+
- pip3 install codeflare --no-cache-dir
20+
21+
# Kubernetes resources that need to be configured for the autoscaler to be
22+
# able to manage the Ray cluster. If any of the provided resources don't
23+
# exist, the autoscaler will attempt to create them. If this fails, you may
24+
# not have the required permissions and will have to request them to be
25+
# created by your cluster administrator.
26+
provider:
27+
type: kubernetes
28+
29+
# Exposing external IP addresses for ray pods isn't currently supported.
30+
use_internal_ips: true
31+
32+
# Namespace to use for all resources created.
33+
namespace: NAMESPACE
34+
35+
services:
36+
# Service that maps to the head node of the Ray cluster.
37+
- apiVersion: v1
38+
kind: Service
39+
metadata:
40+
# NOTE: If you're running multiple Ray clusters with services
41+
# on one Kubernetes cluster, they must have unique service
42+
# names.
43+
name: example-cluster-ray-head
44+
spec:
45+
# This selector must match the head node pod's selector below.
46+
selector:
47+
component: example-cluster-ray-head
48+
ports:
49+
- name: client
50+
protocol: TCP
51+
port: 10001
52+
targetPort: 10001
53+
- name: dashboard
54+
protocol: TCP
55+
port: 8265
56+
targetPort: 8265
57+
58+
# Specify the pod type for the ray head node (as configured below).
59+
head_node_type: head_node
60+
# Specify the allowed pod types for this ray cluster and the resources they provide.
61+
available_node_types:
62+
worker_node:
63+
# Minimum number of Ray workers of this Pod type.
64+
min_workers: 0
65+
# Maximum number of Ray workers of this Pod type. Takes precedence over min_workers.
66+
max_workers: 10
67+
# User-specified custom resources for use by Ray. Object with string keys and integer values.
68+
# (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
69+
node_config:
70+
apiVersion: v1
71+
kind: Pod
72+
metadata:
73+
# Automatically generates a name for the pod with this prefix.
74+
generateName: example-cluster-ray-worker-
75+
spec:
76+
restartPolicy: Never
77+
volumes:
78+
- name: dshm
79+
emptyDir:
80+
medium: Memory
81+
containers:
82+
- name: ray-node
83+
imagePullPolicy: Always
84+
image: rayproject/ray-ml
85+
command: ["/bin/bash", "-c", "--"]
86+
args: ["trap : TERM INT; sleep infinity & wait;"]
87+
# This volume allocates shared memory for Ray to use for its plasma
88+
# object store. If you do not provide this, Ray will fall back to
89+
# /tmp which cause slowdowns if is not a shared memory volume.
90+
volumeMounts:
91+
- mountPath: /dev/shm
92+
name: dshm
93+
resources:
94+
requests:
95+
cpu: 1
96+
memory: 2G
97+
limits:
98+
# The maximum memory that this pod is allowed to use. The
99+
# limit will be detected by ray and split to use 10% for
100+
# redis, 30% for the shared memory object store, and the
101+
# rest for application memory. If this limit is not set and
102+
# the object store size is not set manually, ray will
103+
# allocate a very large object store in each pod that may
104+
# cause problems for other pods.
105+
cpu: 1
106+
memory: 2G
107+
108+
head_node:
109+
# The minimum number of worker nodes of this type to launch.
110+
# This number should be >= 0.
111+
min_workers: 0
112+
# The maximum number of worker nodes of this type to launch.
113+
# This takes precedence over min_workers.
114+
max_workers: 0
115+
node_config:
116+
apiVersion: v1
117+
kind: Pod
118+
metadata:
119+
# Automatically generates a name for the pod with this prefix.
120+
generateName: example-cluster-ray-head-
121+
# Must match the head node service selector above if a head node
122+
# service is required.
123+
labels:
124+
component: example-cluster-ray-head
125+
spec:
126+
# Change this if you altered the autoscaler_service_account above
127+
# or want to provide your own.
128+
serviceAccountName: NAMESPACE-writer
129+
130+
restartPolicy: Never
131+
132+
# This volume allocates shared memory for Ray to use for its plasma
133+
# object store. If you do not provide this, Ray will fall back to
134+
# /tmp which cause slowdowns if is not a shared memory volume.
135+
volumes:
136+
- name: dshm
137+
emptyDir:
138+
medium: Memory
139+
containers:
140+
- name: ray-node
141+
imagePullPolicy: Always
142+
image: rayproject/ray-ml
143+
# Do not change this command - it keeps the pod alive until it is
144+
# explicitly killed.
145+
command: ["/bin/bash", "-c", "--"]
146+
args: ['trap : TERM INT; sleep infinity & wait;']
147+
ports:
148+
- containerPort: 6379 # Redis port
149+
- containerPort: 10001 # Used by Ray Client
150+
- containerPort: 8265 # Used by Ray Dashboard
151+
152+
# This volume allocates shared memory for Ray to use for its plasma
153+
# object store. If you do not provide this, Ray will fall back to
154+
# /tmp which cause slowdowns if is not a shared memory volume.
155+
volumeMounts:
156+
- mountPath: /dev/shm
157+
name: dshm
158+
resources:
159+
requests:
160+
cpu: 1
161+
memory: 2G
162+
limits:
163+
# The maximum memory that this pod is allowed to use. The
164+
# limit will be detected by ray and split to use 10% for
165+
# redis, 30% for the shared memory object store, and the
166+
# rest for application memory. If this limit is not set and
167+
# the object store size is not set manually, ray will
168+
# allocate a very large object store in each pod that may
169+
# cause problems for other pods.
170+
cpu: 1
171+
memory: 2G
172+
173+
174+
175+
# Command to start ray on the head node. You don't need to change this.
176+
# Note dashboard-host is set to 0.0.0.0 so that kubernetes can port forward.
177+
head_start_ray_commands:
178+
- ray stop
179+
- ulimit -n 65536; ray start --head --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0
180+
181+
# Command to start ray on worker nodes. You don't need to change this.
182+
worker_start_ray_commands:
183+
- ray stop
184+
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379

0 commit comments

Comments
 (0)