[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABG=zsBEh-P4NXk23eBJw7eajB5YJeRS7oPXnTAzs=yob4EMoQ@mail.gmail.com>
Date: Wed, 31 Aug 2022 09:37:41 -0700
From: Aditi Ghag <aditivghag@...il.com>
To: netdev@...r.kernel.org, bpf@...r.kernel.org
Cc: Daniel Borkmann <daniel@...earbox.net>
Subject: [RFC] Socket termination for policy enforcement and load-balancing
This is an RFC for terminating sockets with intent. We have two
prominent use cases in Cilium [1] where we need a way to identify and
forcefully terminate a set of sockets so that they can reconnect.
Cilium uses eBPF cgroup hooks for load-balancing, where it translates
a service vip to one of the service backend ip addresses at socket
connect time for TCP and connected UDP. Client applications are likely
to be unaware of the remote containers that they are connected to
getting deleted, and are left hanging when the remotes go away
(long-running UDP applications, particularly). For the policy
enforcement use case, users may want to enforce policies on-the-fly
where they want all client applications traffic including established
connections to be redirected to a subset of destinations.
We evaluated following ways to identify, and forcefully terminate sockets:
- The sock_destroy API added for similar Android use cases is
effective in tearing down sockets. The API is behind the
CONFIG_INET_DIAG_DESTROY config that's disabled by default, and
currently exposed via SOCK_DIAG netlink infrastructure in userspace.
The sock destroy handlers for TCP and UDP protocols send ECONNABORTED
error code to sockets related to the abort state as mentioned in RFC
793.
- Add unreachable routes for deleted backends. I experimented with
this approach with my colleague, Nikolay Aleksandrov. We found that
TCP and connected UDP sockets in the established state simply ignore
the ICMP error messages, and continue to send data in the presence of
such routes. My read is that applications are ignoring the ICMP errors
reported on sockets [2].
- Use BPF (sockets) iterator to identify sockets connected to a
deleted backend. The BPF (sockets) iterator is network namespace aware
so we'll either need to enter every possible container network
namespace to identify the affected connections, or adapt the iterator
to be without netns checks [3]. This was discussed with my colleague
Daniel Borkmann based on the feedback he shared from the LSFMMBPF
conference discussions.
- Use INET_DIAG infrastructure to filter and destroy sockets connected
to stale backends. This approach involves first making a query to
filter sockets connecting to a destination ip address/port using
netlink messages with type SOCK_DIAG_BY_FAMILY, and then use the query
results to make another message of type SOCK_DESTROY to actually
destroy the sockets. The SOCK_DIAG infrastructure, similar to BPF
iterators, is network namespace aware.
We are currently leaning towards invoking the sock_destroy API
directly from BPF programs. This allows us to have an effective
mechanism without having to enter every possible container network
namespace on a node, and rely on the CONFIG_INET_DIAG_DESTROY config
with the right permissions. BPF programs attached to cgroup hooks can
store client sockets connected to a backend, and invoke destroy APIs
when backends are deleted.
To that end, I'm in the process of adding a new BPF helper for the
sock_destroy kernel function similar to the sock_diag_destroy function
[4], and am soliciting early feedback on the evaluated and selected
approaches. Happy to share more context.
[1] https://github.com/cilium/cilium
[2] https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_ipv4.c#L464
[3] https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L3011
[4] https://github.com/torvalds/linux/blob/master/net/core/sock_diag.c#L298
Powered by blists - more mailing lists