Skip to content

Cannot make a curl to pod's IP probably due to interface mismatch #10233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TuanTranBPK opened this issue Apr 15, 2025 · 8 comments
Open

Cannot make a curl to pod's IP probably due to interface mismatch #10233

TuanTranBPK opened this issue Apr 15, 2025 · 8 comments

Comments

@TuanTranBPK
Copy link

TuanTranBPK commented Apr 15, 2025

I install a K8S cluster using kubeadm

sudo kubeadm init --pod-network-cidr=10.42.0.0/16

kubectl taint nodes --all node-role.kubernetes.io/control-plane-

kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.2/manifests/tigera-operator.yaml

kubectl apply -f calico-config.yaml

calico-config.txt

Then install nginx pod

Expected Behavior

From the node, I should be able to make a curl to nginx's pod IP

Current Behavior

The curl is failed
[trant@eam32 calico]$ curl -v 10.42.53.71

  • Trying 10.42.53.71:80...
  • connect to 10.42.53.71 port 80 failed: No route to host
  • Failed to connect to 10.42.53.71 port 80: No route to host
  • Closing connection 0
    curl: (7) Failed to connect to 10.42.53.71 port 80: No route to host

The confusing part is when I perform a tcpdump on the cali7072c88a915 interface, I see ARP message with different IP address/interface (172.30.2.1) than the IP address/interface shown on the node (10.12.178.104)

tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on cali7072c88a915, link-type EN10MB (Ethernet), snapshot length 262144 bytes
22:55:37.324031 ARP, Request who-has 10.42.53.71 tell 172.30.2.1, length 28
22:55:38.368903 ARP, Request who-has 10.42.53.71 tell 172.30.2.1, length 28
22:55:39.392902 ARP, Request who-has 10.42.53.71 tell 172.30.2.1, length 28
22:55:40.417003 ARP, Request who-has 10.42.53.71 tell 172.30.2.1, length 28
22:55:41.440908 ARP, Request who-has 10.42.53.71 tell 172.30.2.1, length 28
22:55:42.464908 ARP, Request who-has 10.42.53.71 tell 172.30.2.1, length 28

kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
eam32 Ready control-plane 28m v1.30.5 10.12.178.104 Rocky Linux 9.5 (Blue Onyx) 5.14.0-503.19.1.el9_5.x86_64 cri-o://1.22.5

Possible Solution

Don't know

Steps to Reproduce (for bugs)

Se above

Context

Your Environment

The network/interface configuration is in the attachment

ip_a.txt

Calico log:

calico-node.log

  • Calico version: version v3.27.2
  • Calico dataplane (iptables, windows etc.)
  • Orchestrator version (e.g. kubernetes, mesos, rkt):
  • Operating System and version: Rocky Linux 9.5
  • Link to your project (optional):
@tomastigera
Copy link
Contributor

You have a whole bunch of interfaces on your node, k8s may have picked one of the ifaces as the main and use its IP as the node's internal IP, calico may have auto detected another IP. Check out https://docs.tigera.io/calico/latest/networking/ipam/ip-autodetection

Could you provide ip route output?

@TuanTranBPK
Copy link
Author

I tried IP autodection modes (can-reach and interface) but it doesn't work. My ip route is below, 10.42.53.71 is nginx pod IP
[trant@eam32 ~]$ ip r
default via 10.12.178.254 dev ens3f0np0.99 proto dhcp src 10.12.178.104 metric 401
10.12.178.0/23 dev ens3f0np0.99 proto kernel scope link src 10.12.178.104 metric 401
blackhole 10.42.53.64/26 proto 80
10.42.53.65 dev calica8461fef58 scope link
10.42.53.66 dev calib250ffe937d scope link
10.42.53.67 dev cali2f0666e3385 scope link
10.42.53.68 dev cali457d2a7ae14 scope link
10.42.53.69 dev calibcd5f644fa1 scope link
10.42.53.70 dev cali20b202672c4 scope link
10.42.53.71 dev cali7072c88a915 scope link
10.76.77.0/24 dev ens3f0np0.700 proto kernel scope link src 10.76.77.18 metric 402
172.16.0.0/12 proto static src 172.31.0.1
nexthop via 172.30.1.250 dev ens3f0np0.401 weight 1
nexthop via 172.30.2.250 dev enp196s0f1.402 weight 1
172.30.1.0/24 via 172.30.1.1 dev ens3f0np0.401 proto static metric 400
172.30.1.0/24 dev ens3f0np0.401 proto kernel scope link src 172.30.1.1 metric 403
172.30.2.0/24 dev enp196s0f1.402 proto kernel scope link src 172.30.2.1 metric 400
192.9.0.0/16 dev ens3f0np0.3 proto kernel scope link src 192.9.110.18 metric 404

@TuanTranBPK
Copy link
Author

Just add more information to narrow down the problem. The same installation procedure above works on the K8S cluster with a single VM having several interfaces. The problem described in this issue is with a physical machine with multiple network cards

ip-a-single-vm.txt

@TuanTranBPK
Copy link
Author

@tomastigera Anything else do you need for your investigation ? It seems to me that this is a bug.

Thanks,
Tuan

@tomastigera
Copy link
Contributor

Your node from the logs above picked startup/autodetection_methods.go 103: Using autodetected IPv4 address on interface ens3f0np0.3: 192.9.110.18/16

The ip r above is from the node where the nginx server is which does not really say much about whether there is a route from the client. However, there is no route from this node to a similar CIDR. That is weird given that the IPPool is 10.42.0.0/16 Created default IPv4 pool (10.42.0.0/16) with NAT outgoing true. IPIP mode: Never, VXLAN mode: Always, DisableBGPExport: false

It also says VXLAN always, but I do not see any vxlan device. I do see a vxlan device in the list of device of the VM.

However I do see 2025-04-15 20:34:45.475 [INFO][54] felix/vxlan_mgr.go 742: Assigning address to VXLAN device address=10.42.53.64/32 ipVersion=0x4 the device being created in the logs.

So I think the device is missing and the routes to pods on other nodes via the vxlan device are missing too. And that is the problem. But I cannot tell, why they are missing 🤔

@tomastigera
Copy link
Contributor

Have you tried a newer version than v3.27.2?

@TuanTranBPK
Copy link
Author

I tried with the latest version v3.30 (https://docs.tigera.io/calico/latest/getting-started/kubernetes/self-managed-onprem/onpremises) but the problem is always there.

kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.30.0/manifests/operator-crds.yaml

kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.30.0/manifests/tigera-operator.yaml

I modified only the CIDR field in the example file https://raw.githubusercontent.com/projectcalico/calico/v3.30.0/manifests/custom-resources.yaml
kubectl apply -f custom-resources.yaml

calico-node.log
tigera-operator.log

ip r
default via 10.12.178.254 dev ens3f0np0.99 proto dhcp src 10.12.178.104 metric 401
10.12.178.0/23 dev ens3f0np0.99 proto kernel scope link src 10.12.178.104 metric 401
blackhole 10.42.53.64/26 proto 80
10.42.53.65 dev cali6028fd2d0bb scope link
10.42.53.66 dev cali3d5af4bc40d scope link
10.42.53.67 dev cali1f4489ffc48 scope link
10.42.53.68 dev cali56aa98abf8e scope link
10.42.53.69 dev calie084588d789 scope link
10.42.53.70 dev cali98709610c8f scope link
10.42.53.71 dev cali62b602db861 scope link
10.42.53.72 dev cali02c87c8e7c0 scope link
10.42.53.73 dev cali034e5efd019 scope link
10.76.77.0/24 dev ens3f0np0.700 proto kernel scope link src 10.76.77.18 metric 402
172.16.0.0/12 proto static src 172.31.0.1
nexthop via 172.30.1.250 dev ens3f0np0.401 weight 1
nexthop via 172.30.2.250 dev enp196s0f1.402 weight 1
172.30.1.0/24 via 172.30.1.1 dev ens3f0np0.401 proto static metric 400
172.30.1.0/24 dev ens3f0np0.401 proto kernel scope link src 172.30.1.1 metric 403
172.30.2.0/24 dev enp196s0f1.402 proto kernel scope link src 172.30.2.1 metric 400
192.9.0.0/16 dev ens3f0np0.3 proto kernel scope link src 192.9.110.18 metric 404

Anything else do you want to look at ?

@coutinhop
Copy link
Member

@TuanTranBPK I see this line in your calico-node log:

2025-05-07 08:29:05.444 [INFO][9] startup/autodetection_methods.go 103: Using autodetected IPv4 address on interface ens3f0np0.3: 192.9.110.18/16

And it seems like the VXLAN device does get that interface ens3f0np0.3 for its parent:

2025-05-07 08:29:07.019 [INFO][83] felix/vxlan_mgr.go 822: Assigning address to VXLAN device address=10.42.53.64/32 ipVersion=0x4
2025-05-07 08:29:07.037 [INFO][83] felix/vxlan_mgr.go 627: VXLAN device parent changed from "" to "ens3f0np0.3" ipVersion=0x4

Do you know if that interface choice is "wrong"? (not the one connected to the rest of the cluster or something to that effect?)

You mentioned this

I tried IP autodection modes (can-reach and interface) but it doesn't work.

But what were the results when trying these modes? Did you try using CIDR(s) as well (https://docs.tigera.io/calico/latest/networking/ipam/ip-autodetection#change-the-autodetection-method)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants