Why are my packets refusing to be routed?

Posted on

A server stack is the collection of software that forms the operational infrastructure on a given machine. In a computing context, a stack is an ordered pile. A server stack is one type of solution stack — an ordered selection of software that makes it possible to complete a particular task. Like in this post about Why are my packets refusing to be routed? was one problem in server stack that need for a solution. Below are some tips in manage your linux server when you find problem about linux, iptables, linux-networking, nat, network-namespace.

I’m setting up a rather unusual network configuration meant to afford some additional protection to certain containers running on a host. There are some external requirements that are beyond the scope of this request that make it so that the physical server needs to have two IP addresses plumbed to the same physical interface. There’s a “green” IP for normal data and a “red” IP for the secure stuff coming from the container.

What I’m setting up looks like this:

enter image description here

So the secure process runs inside of a special network namespace, and when it needs to talk to the outside world, it sends packets to its local interface (vethred, 169.254.0.1) which is half of a veth pair, the other side being vethhost, 169.254.0.2. The packet is then routed out of the host via the redvlan interface (not the default “green” interface used by all the other processes).

I have succeeded in making this work on one machine, using a relatively small set of configuration settings. However, when I try to replicate this configuration on another host, it does not work, and the packet emerges from vethhost and then gets dropped before it is routed. I have reverse path filtering disabled on all interfaces, so it’s not that.

To describe how I have this set up:

Inside the red network namespace, the process has a typical view of the world:

$ ip netns exec red ip -4 addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
11: vethred@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link-netnsid 0
    inet 169.254.0.1/30 scope global vethred
       valid_lft forever preferred_lft forever
$ ip netns exec red ip -4 route
default via 169.254.0.2 dev vethred 
169.254.0.0/30 dev vethred proto kernel scope link src 169.254.0.1 

So when a process in the red namespace generates a packet destined outside the network namespace, it encounters the default route and is routed to the vethhost device. There are iptables rules controlling the behavior of the packet:

# Mark packets emerging from vethhost
iptables -t mangle -A PREROUTING -i vethhost -j MARK --set-mark 2
# use a secondary routing table for packets marked with 0x2
ip rule add fwmark 2 table 2
# create the second routing table
ip route add default via ${RED_GATEWAY} table 2

# Packets are now routed to the redvlan interface, but bear an internal IP
# as the source address.  So we need to perform an SNAT.
iptables -t nat -A POSTROUTING -o redvlan ! -s 10.20.0.10 -j SNAT --to-source 10.20.0.10

There are other rules to enable the reply packets coming back to the redvlan interface to be routed back to the vethhost device and into the red network namespace, but I’ll leave those out because the packets aren’t even getting as far as leaving the host.

With the host configured as noted above, I’m able to send packets out the redvlan interface from a normal process:

$ ping -c3 10.20.0.11
PING 10.20.0.11 (10.20.0.11) 56(84) bytes of data.
64 bytes from 10.20.0.11 (10.20.0.11): icmp_seq=1 ttl=64 time=1.28 ms
64 bytes from 10.20.0.11 (10.20.0.11): icmp_seq=2 ttl=64 time=0.825 ms
64 bytes from 10.20.0.11 (10.20.0.11): icmp_seq=3 ttl=64 time=0.938 ms

--- 10.20.0.11 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.825/1.016/1.285/0.195 ms

But when I do the same from the red namespace:

$ ip netns exec red ping -c3 10.20.0.11
PING 10.20.0.11 (10.20.0.11) 56(84) bytes of data.

--- 10.20.0.11 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2054ms

And if I look at the network trace (irrelevant fields truncated for brevity):

[187669.474043] TRACE: raw:PREROUTING:policy:2 IN=vethhost OUT= SRC=169.254.0.1 DST=10.20.0.11 PROTO=ICMP ID=31570 SEQ=1 
[187669.474199] TRACE: mangle:PREROUTING:rule:1 IN=vethhost OUT= SRC=169.254.0.1 DST=10.20.0.11 PROTO=ICMP ID=31570 SEQ=1 
[187669.474352] TRACE: mangle:PREROUTING:policy:3 IN=vethhost OUT= SRC=169.254.0.1 DST=10.20.0.11 PROTO=ICMP ID=31570 SEQ=1 MARK=0x2 
[187669.474507] TRACE: nat:PREROUTING:policy:2 IN=vethhost OUT= SRC=169.254.0.1 DST=10.20.0.11  PROTO=ICMP ID=31570 SEQ=1 MARK=0x2 

The packet is getting marked and exits the PREROUTING chain but never gets routed! It should be traversing the FORWARD and POSTROUTING chains after PREROUTING, but it doesn’t, which means the kernel dropped the packet while making the routing decision. And it absolutely should match a routing rule — here’s the routing table:

$ ip route show table 2
default via 10.20.0.1 dev redvlan 
10.20.0.0/24 dev redvlan proto kernel scope link src 10.20.0.10 
169.254.0.0/30 dev vethhost proto kernel scope link src 169.254.0.2 

The packet should have matched either the default route or the link local route. Even if the marking bits weren’t working the packet should have at least been routed to the green interface to exit the host (the default route in the default routing table). But instead it just dropped.

Everything I’ve read about this kind of issue suggests that reverse path filtering can cause this, since at this stage of the routing process, the source address is a non-routable IP. But as noted above, rp_filter is disabled:

$ sysctl -a | grep \.rp_filter
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.eth0.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.lo.rp_filter = 0
net.ipv4.conf.redvlan.rp_filter = 0
net.ipv4.conf.vethhost.rp_filter = 0

So it’s not reverse path filtering nuking the packet. I’ve even enabled logging of martians and there are no log messages saying that martians were dropped.

Thank you to @axus for pointing me to this thread: https://unix.stackexchange.com/questions/292801/routing-between-linux-namespaces which did not directly address the issue, but did contain one bit of helpful prodding:

The kernel is going to treat the namespaces as if they were separate hosts. Meaning you have to configure the kernel to act as a router.

And indeed, I had neglected to set ip_forward=1 in sysctl. The packet wasn’t getting dropped because of a mismatchin the routing table; it was being dropped because the kernel was not configured to act like a router.

Leave a Reply

Your email address will not be published. Required fields are marked *