Skip to content

Commit 5c563b0

Browse files
amitosaurusAmit BhatnagarKeitaW
authored
utility to dump details of all nodes in a cluster, into a csv file (#652)
* utility to dump details of all nodes in a cluster, into a csv file * Rename list_cluster_nodes.py to dump_cluster_nodes_info.py updated script name to better reflect it's functionality * README.md for proper usage of the "tools" folder Adding README.md that provides guidelines for usage of utility script(s) in the "tools" folder * fixing the file format * Update 1.architectures/5.sagemaker-hyperpod/tools/README.md Co-authored-by: Keita Watanabe <keitaw09@gmail.com> * Update README.md --------- Co-authored-by: Amit Bhatnagar <theamit@amazon.com> Co-authored-by: Keita Watanabe <keitaw09@gmail.com>
1 parent 77fdf44 commit 5c563b0

File tree

2 files changed

+81
-0
lines changed

2 files changed

+81
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# SMHP Tools <!-- omit from toc -->
2+
3+
The “tools” directory contains utility scripts for common tasks to help debug and troubleshoot issues.
4+
Here are the details of each script, along with its usage and the expected output.
5+
6+
### [`dump_cluster_nodes_info.py`](./dump_cluster_nodes_info.py)
7+
8+
Utility to dump details of all nodes in a cluster, into a csv file.
9+
10+
**Usage:** `python dump_cluster_nodes_info.py –cluster-name <name-of-cluster-whose-node-details-are-needed>`
11+
12+
**Output:** “nodes.csv” file in the current directory, containing details of all nodes in the cluster
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
# SPDX-License-Identifier: MIT-0
3+
4+
import argparse
5+
import csv
6+
7+
import boto3
8+
9+
10+
def list_cluster_nodes_all(sagemaker_client, cluster_name):
11+
12+
nodes = []
13+
next_token = None
14+
15+
while True:
16+
17+
params = {
18+
"ClusterName" : cluster_name
19+
}
20+
if next_token:
21+
params["NextToken"] = next_token
22+
23+
response = sagemaker_client.list_cluster_nodes(**params)
24+
25+
nodes += response["ClusterNodeSummaries"]
26+
27+
if "NextToken" in response and response["NextToken"]:
28+
next_token = response["NextToken"]
29+
continue
30+
31+
break
32+
33+
return nodes
34+
35+
36+
def dump_nodes(cluster_name):
37+
38+
sagemaker_client = boto3.client("sagemaker")
39+
40+
nodes = list_cluster_nodes_all( sagemaker_client, cluster_name )
41+
42+
with open("nodes.csv", "w") as fd:
43+
csv_writer = csv.writer(fd)
44+
csv_writer.writerow([ "instance-id", "ip-address", "status", "hostname", "instance-group", "launch-time" ])
45+
46+
for node in nodes:
47+
# For each node, we need to call the 'describe_cluster_node' API
48+
# Ref: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/describe_cluster_node.html
49+
instance_id = node['InstanceId']
50+
node_details = sagemaker_client.describe_cluster_node(ClusterName=cluster_name, NodeId=instance_id)['NodeDetails']
51+
# ...and write necessary data in the CSV...
52+
csv_writer.writerow([node_details['InstanceId'],
53+
node_details['PrivatePrimaryIp'],
54+
node_details['InstanceStatus']['Status'],
55+
node_details['PrivateDnsHostname'],
56+
node_details['InstanceGroupName'],
57+
node_details['LaunchTime']])
58+
59+
print(f"Details of all nodes in cluster '{cluster_name}' have been saved in nodes.csv")
60+
61+
62+
if __name__ == "__main__":
63+
64+
argparser = argparse.ArgumentParser(description="Dump all HyperPod cluster nodes and their details in a CSV")
65+
argparser.add_argument("--cluster-name", action="store", required=True, help="Name of cluster to dump")
66+
args = argparser.parse_args()
67+
68+
dump_nodes(args.cluster_name)
69+

0 commit comments

Comments
 (0)