Skip to content

utility to dump details of all nodes in a cluster, into a csv file #652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 10, 2025
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions 1.architectures/5.sagemaker-hyperpod/tools/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# SMHP Tools <!-- omit from toc -->

The “tools” directory contains utility scripts for common tasks to help debug and troubleshoot issues.
Here are the details of each script, along with its usage and the expected output.

| Utility Script | Description | Usage | Output |
| -------------------------------------------------------------- | ------------------------------------------------------------------- | ------------------------- | ---------------------------------------------------------------------------------------- |
| [`dump_cluster_nodes_info.py`](./dump_cluster_nodes_info.py) | Utility to dump details of all nodes in a cluster, into a csv file | `python dump_cluster_nodes_info.py –cluster-name <name-of-cluster-whose-node-details-are-needed>` | “nodes.csv” file in the current directory, containing details of all nodes in the cluster |

Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

import argparse
import csv

import boto3


def list_cluster_nodes_all(sagemaker_client, cluster_name):

nodes = []
next_token = None

while True:

params = {
"ClusterName" : cluster_name
}
if next_token:
params["NextToken"] = next_token

response = sagemaker_client.list_cluster_nodes(**params)

nodes += response["ClusterNodeSummaries"]

if "NextToken" in response and response["NextToken"]:
next_token = response["NextToken"]
continue

break

return nodes


def dump_nodes(cluster_name):

sagemaker_client = boto3.client("sagemaker")

nodes = list_cluster_nodes_all( sagemaker_client, cluster_name )

with open("nodes.csv", "w") as fd:
csv_writer = csv.writer(fd)
csv_writer.writerow([ "instance-id", "ip-address", "status", "hostname", "instance-group", "launch-time" ])

for node in nodes:
# For each node, we need to call the 'describe_cluster_node' API
# Ref: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/describe_cluster_node.html
instance_id = node['InstanceId']
node_details = sagemaker_client.describe_cluster_node(ClusterName=cluster_name, NodeId=instance_id)['NodeDetails']
# ...and write necessary data in the CSV...
csv_writer.writerow([node_details['InstanceId'],
node_details['PrivatePrimaryIp'],
node_details['InstanceStatus']['Status'],
node_details['PrivateDnsHostname'],
node_details['InstanceGroupName'],
node_details['LaunchTime']])

print(f"Details of all nodes in cluster '{cluster_name}' have been saved in nodes.csv")


if __name__ == "__main__":

argparser = argparse.ArgumentParser(description="Dump all HyperPod cluster nodes and their details in a CSV")
argparser.add_argument("--cluster-name", action="store", required=True, help="Name of cluster to dump")
args = argparser.parse_args()

dump_nodes(args.cluster_name)