Latest change to Thick Image deployment.yaml breaks Multus on Talos Cluster #1422

SaberSHO · 2025-04-18T14:05:31Z

What happend:
Multus stopped working on my Talos cluster. Cluster is using flux and the thick daemonset via https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml

Additional configuration to copy the CNI binary to the host: https://www.talos.dev/v1.9/kubernetes-guides/network/multus/

Investigating the issue, the install-multus-binary init container is failing:

install-multus-binary:
    Container ID:  containerd://2ce3622c540260979f5b26d206a3731baaf2b97f037caaafd238853859187b23
    Image:         ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick
    Image ID:      ghcr.io/k8snetworkplumbingwg/multus-cni@sha256:cad1ed05d89b25199697ed09723cf1260bb670ee45d8161316ea1af999fe2712
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      cp /usr/src/multus-cni/bin/multus-shim /host/opt/cni/bin/multus-shim && cp /usr/src/multus-cni/bin/passthru /host/opt/cni/bin/passthru
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   cp: cannot stat '/usr/src/multus-cni/bin/passthru': No such file or directory

      Exit Code:    1
      Started:      Fri, 18 Apr 2025 09:49:03 -0400
      Finished:     Fri, 18 Apr 2025 09:49:03 -0400

Changing the start command to remove the passthru dir CPs results in Multus running properly again.

This change seems to be from: #1419
specifically this command change:

multus-cni/deployments/multus-daemonset-thick.yml

Line 206 in 4517063

    
           - "cp /usr/src/multus-cni/bin/multus-shim /host/opt/cni/bin/multus-shim && cp /usr/src/multus-cni/bin/passthru /host/opt/cni/bin/passthru"

What you expected to happen:
Init container succeeds and multus binary is copied

How to reproduce it (as minimally and precisely as possible):
Follow instructions here to install Multus on a Talos cluster: https://www.talos.dev/v1.9/kubernetes-guides/network/multus/

Anything else we need to know?:
It is possible that the install-cni.sh script run as part of the siderolabs install-cni init container needs to be updated to work with these new paths, but wanted to start the issue here to track and hopefully get guidance

Environment:

Multus version : Tried 4.20 and 4.1.4
image path and image ID (from 'docker images')
Image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick Image ID: ghcr.io/k8snetworkplumbingwg/multus-cni@sha256:cad1ed05d89b25199697ed09723cf1260bb670ee45d8161316ea1af999fe2712
(also tried with stable-thick and v4.1.4-thick
Kubernetes version (use kubectl version): v1.29.2
Primary CNI for Kubernetes cluster: Cillium
OS (e.g. from /etc/os-release): Talos Linux
File of '/etc/cni/net.d/': Not able to view due to Talos not having node access
File of '/etc/cni/multus/net.d': Not able to view due to Talos not having node access
NetworkAttachment info (use kubectl get net-attach-def -o yaml)

Name:         lan-network
Namespace:    cams
Labels:       kustomize.toolkit.fluxcd.io/name=apps
              kustomize.toolkit.fluxcd.io/namespace=flux-system
Annotations:  <none>
API Version:  k8s.cni.cncf.io/v1
Kind:         NetworkAttachmentDefinition
Metadata:
  Creation Timestamp:  2024-10-20T16:37:06Z
  Generation:          1
  Resource Version:    86578419
  UID:                 c2ca1a7f-443a-472e-b4b0-76c800082e63
Spec:
  Config:  { "cniVersion": "0.3.1", "name": "lan-network", "type": "macvlan", "mode": "bridge", "master": "enp1s0", "ipam": { "type": "host-local", "subnet": "192.168.5.0/24", "rangeStart": "192.168.5.235", "rangeEnd": "192.168.5.239", "routes": [ { "dst": "192.168.5.0/24" } ], "gateway": "192.168.5.1" } }
Events:    <none>

Target pod yaml info (with annotation, use kubectl get pod <podname> -o yaml)

Name:             scrypted-0
Namespace:        cams
Priority:         0
Service Account:  default
Node:             node1/192.168.5.211
Start Time:       Fri, 18 Apr 2025 09:26:38 -0400
Labels:           app=scrypted
                  apps.kubernetes.io/pod-index=0
                  controller-revision-hash=scrypted-64757cbdfb
                  statefulset.kubernetes.io/pod-name=scrypted-0
Annotations:      k8s.v1.cni.cncf.io/networks: [ { "name" : "lan-network", "interface": "eth1" } ]
                  kubectl.kubernetes.io/restartedAt: 2025-04-18T09:15:31-04:00
Status:           Running
IP:               10.244.0.250
IPs:
  IP:           10.244.0.250
Controlled By:  StatefulSet/scrypted
Containers:
  scrypted:
    Container ID:   containerd://21458d21defe540de61813cd67d9b71806aa3fac384c25b760ed98436606a585
    Image:          ghcr.io/koush/scrypted:latest
    Image ID:       ghcr.io/koush/scrypted@sha256:602d001ee8c1e31a22f4addb700e24d8133a8d7efef3493d6249a2e241f22b04
    Ports:          11080/TCP, 10443/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Fri, 18 Apr 2025 09:32:35 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /server/volume from app (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-56xp2 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  app:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  app-scrypted-0
    ReadOnly:   false
  kube-api-access-56xp2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Normal   Scheduled               37m                default-scheduler        Successfully assigned cams/scrypted-0 to node1
  Warning  FailedAttachVolume      37m                attachdetach-controller  Multi-Attach error for volume "pvc-a018f41f-d7a9-4b41-9af0-ad98e18570ca" Volume is already exclusively attached to one node and can't be attached to another
  Normal   SuccessfulAttachVolume  37m                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-a018f41f-d7a9-4b41-9af0-ad98e18570ca"
  Warning  FailedCreatePodSandBox  33m                kubelet                  Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedCreatePodSandBox  33m (x2 over 33m)  kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name "scrypted-0_cams_fd6119b0-6881-45ab-ba79-4c86ec19bc6a_0": name "scrypted-0_cams_fd6119b0-6881-45ab-ba79-4c86ec19bc6a_0" is reserved for "e701fb85fadf8c9616e511a36ab6b89ce98fe4c9770f7e596941310dbe3e5df4"
  Warning  FailedCreatePodSandBox  32m                kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b705a8ebc2e40cf8eda454f5294bd60a845fe6323d65ae9dc20f4e10992f4d02": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": EOF: StdinData: {"clusterNetwork":"/host/etc/cni/net.d/05-cilium.conflist","cniVersion":"0.3.1","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","type":"multus-shim"}
  Normal   Pulling                 32m                kubelet                  Pulling image "ghcr.io/koush/scrypted:latest"
  Normal   Pulled                  32m                kubelet                  Successfully pulled image "ghcr.io/koush/scrypted:latest" in 146ms (146ms including waiting)
  Normal   Created                 32m                kubelet                  Created container scrypted
  Normal   Started                 32m                kubelet                  Started container scrypted

Other log outputs (if you use multus logging)

cp: cannot stat '/usr/src/multus-cni/bin/passthru': No such file or directory
Stream closed EOF for kube-system/kube-multus-ds-whxhg (install-multus-binary)

The text was updated successfully, but these errors were encountered:

dougbtv · 2025-04-24T16:57:52Z

Thanks for the detailed report! Appreciate it.

Are you sure you have the latest image? It seems to me it might be that you have an outdated image, like it didn't pull? The quick-start daemonset doesn't explicitly set an imagepullpolicy.

I can see that we have tests passing today which show the passthru CNI binary gets installed: https://github.com/k8snetworkplumbingwg/multus-cni/actions/runs/14643179178/job/41090417937#step:15:43

And the tests complete.

I'm also not getting a reproduction of it locally:

[fedora@kubecon-demo multus-cni]$ kubectl apply -f deployments/multus-daemonset-thick.yml 
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-daemon-config created
daemonset.apps/kube-multus-ds created
[fedora@kubecon-demo multus-cni]$ watch -n1 kubectl get pods -o wide -A
[fedora@kubecon-demo multus-cni]$ docker ps | head -n 2
CONTAINER ID   IMAGE                  COMMAND                  CREATED        STATUS        PORTS                       NAMES
57a5a6d6a498   kindest/node:v1.32.2   "/usr/local/bin/entr…"   24 hours ago   Up 24 hours                               kind-worker
[fedora@kubecon-demo multus-cni]$ docker exec -it 57a5a6d6a498 ls /opt/cni/bin/passthru -lathr
-rwxr-xr-x. 1 root root 3.5M Apr 24 16:53 /opt/cni/bin/passthru
[fedora@kubecon-demo multus-cni]$ kubectl get pods -n kube-system | grep -i multus
kube-multus-ds-fctnr                         1/1     Running   0          64s
kube-multus-ds-p54d7                         1/1     Running   0          64s
kube-multus-ds-w9trd                         1/1     Running   0          64s
[fedora@kubecon-demo multus-cni]$ kubectl describe pod kube-multus-ds-fctnr -n kube-system | grep -i image
    Image:         ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick
    Image ID:      ghcr.io/k8snetworkplumbingwg/multus-cni@sha256:e69de2db70047d8fff8ddd290647da0814baa21b324de2a4906b4964d9599060
    Image:         ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick
    Image ID:      ghcr.io/k8snetworkplumbingwg/multus-cni@sha256:e69de2db70047d8fff8ddd290647da0814baa21b324de2a4906b4964d9599060
  Normal  Pulled     85s   kubelet            Container image "ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick" already present on machine
  Normal  Pulled     83s   kubelet            Container image "ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick" already present on machine
[fedora@kubecon-demo multus-cni]$ grep "passthru" deployments/multus-daemonset-thick.yml
            - "cp /usr/src/multus-cni/bin/multus-shim /host/opt/cni/bin/multus-shim && cp /usr/src/multus-cni/bin/passthru /host/opt/cni/bin/passthru"

bpickard22 added the triaged label Apr 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest change to Thick Image deployment.yaml breaks Multus on Talos Cluster #1422

Latest change to Thick Image deployment.yaml breaks Multus on Talos Cluster #1422

SaberSHO commented Apr 18, 2025

dougbtv commented Apr 24, 2025

Latest change to Thick Image deployment.yaml breaks Multus on Talos Cluster #1422

Latest change to Thick Image deployment.yaml breaks Multus on Talos Cluster #1422

Comments

SaberSHO commented Apr 18, 2025

dougbtv commented Apr 24, 2025