[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzbZiZwfae+B2gu4WkWVRoVkLJUYhFf0rorx0jVCf_kiQw@mail.gmail.com>
Date: Fri, 27 Mar 2020 17:32:18 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: Alexei Starovoitov <alexei.starovoitov@...il.com>,
Martynas Pumputis <m@...bda.lt>,
Joe Stringer <joe@...d.net.nz>, bpf <bpf@...r.kernel.org>,
Networking <netdev@...r.kernel.org>
Subject: Re: [PATCH bpf-next 3/7] bpf: add netns cookie and enable it for bpf
cgroup hooks
On Fri, Mar 27, 2020 at 8:59 AM Daniel Borkmann <daniel@...earbox.net> wrote:
>
> In Cilium we're mainly using BPF cgroup hooks today in order to implement
> kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
> ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
> between Cilium managed nodes. While this works in its current shape and avoids
> packet-level NAT for inter Cilium managed node traffic, there is one major
> limitation we're facing today, that is, lack of netns awareness.
>
> In Kubernetes, the concept of Pods (which hold one or multiple containers)
> has been built around network namespaces, so while we can use the global scope
> of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
> NodePort ports on loopback addresses), we also have the need to differentiate
> between initial network namespaces and non-initial one. For example, ExternalIP
> services mandate that non-local service IPs are not to be translated from the
> host (initial) network namespace as one example. Right now, we have an ugly
> work-around in place where non-local service IPs for ExternalIP services are
> not xlated from connect() and friends BPF hooks but instead via less efficient
> packet-level NAT on the veth tc ingress hook for Pod traffic.
>
> On top of determining whether we're in initial or non-initial network namespace
> we also have a need for a socket-cookie like mechanism for network namespaces
> scope. Socket cookies have the nice property that they can be combined as part
> of the key structure e.g. for BPF LRU maps without having to worry that the
> cookie could be recycled. We are planning to use this for our sessionAffinity
> implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
> which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
> provide the cookie for the initial network namespace while passing the context
> instead of NULL would provide the cookie from the application's network namespace.
> We're using a hole, so no size increase; the assignment happens only once.
> Therefore this allows for a comparison on initial namespace as well as regular
> cookie usage as we have today with socket cookies. We could later on enable
> this helper for other program types as well as we would see need.
>
> (*) Both externalTrafficPolicy={Local|Cluster} types
> [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c
>
> Signed-off-by: Daniel Borkmann <daniel@...earbox.net>
> ---
> include/linux/bpf.h | 1 +
> include/net/net_namespace.h | 10 +++++++++
> include/uapi/linux/bpf.h | 16 ++++++++++++++-
> kernel/bpf/verifier.c | 16 +++++++++------
> net/core/filter.c | 37 ++++++++++++++++++++++++++++++++++
> net/core/net_namespace.c | 15 ++++++++++++++
> tools/include/uapi/linux/bpf.h | 16 ++++++++++++++-
> 7 files changed, 103 insertions(+), 8 deletions(-)
>
[...]
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 745f3cfdf3b2..cb30b34d1466 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3461,13 +3461,17 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
> expected_type = CONST_PTR_TO_MAP;
> if (type != expected_type)
> goto err_type;
> - } else if (arg_type == ARG_PTR_TO_CTX) {
> + } else if (arg_type == ARG_PTR_TO_CTX ||
> + arg_type == ARG_PTR_TO_CTX_OR_NULL) {
> expected_type = PTR_TO_CTX;
> - if (type != expected_type)
> - goto err_type;
> - err = check_ctx_reg(env, reg, regno);
> - if (err < 0)
> - return err;
> + if (!(register_is_null(reg) &&
> + arg_type == ARG_PTR_TO_CTX_OR_NULL)) {
Other parts of check_func_arg() have different pattern that doesn't
negate this complicated condition:
if (register_is_null(reg) && arg_type == ARG_PTR_TO_CTX_OR_NULL)
;
else {
...
}
It's marginally easier to follow, though still increases nestedness :(
> + if (type != expected_type)
> + goto err_type;
> + err = check_ctx_reg(env, reg, regno);
> + if (err < 0)
> + return err;
> + }
> } else if (arg_type == ARG_PTR_TO_SOCK_COMMON) {
> expected_type = PTR_TO_SOCK_COMMON;
> /* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */
[...]
> +static const struct bpf_func_proto bpf_get_netns_cookie_sock_addr_proto = {
> + .func = bpf_get_netns_cookie_sock_addr,
> + .gpl_only = false,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_PTR_TO_CTX_OR_NULL,
Just for completeness sake, have you considered using two helpers -
one accepting context and other accepting nothing instead of adding
ARG_PTR_TO_CTX_OR_NULL? Would it be too bad?
> +};
> +
[...]
> +static atomic64_t cookie_gen;
> +
> +u64 net_gen_cookie(struct net *net)
> +{
> + while (1) {
> + u64 res = atomic64_read(&net->net_cookie);
> +
> + if (res)
> + return res;
> + res = atomic64_inc_return(&cookie_gen);
> + atomic64_cmpxchg(&net->net_cookie, 0, res);
you'll always do extra atomic64_read, even if you succeed on the first
try. Why not:
while (1) {
u64 res = atomic64_read(&net->net_cookie);
if (res)
return res;
res = atomic64_inc_return(&cookie_gen);
if (atomic64_cmpxchg(&net->net_cookie, 0, res) == 0)
return res;
}
> + }
> +}
> +
[...]
Powered by blists - more mailing lists