[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQKFdpiQFxgF253V5XmtjnrVXcZ14sxT_Q3vOQ97WxScMQ@mail.gmail.com>
Date: Fri, 23 Sep 2022 08:26:55 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: bpf <bpf@...r.kernel.org>, Alexei Starovoitov <ast@...nel.org>,
Andrii Nakryiko <andrii@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Kernel Team <kernel-team@...com>,
Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in
init ops to recur itself
On Thu, Sep 22, 2022 at 6:11 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
>
> On 9/22/22 5:12 PM, Alexei Starovoitov wrote:
> > On Thu, Sep 22, 2022 at 3:56 PM Martin KaFai Lau <kafai@...com> wrote:
> >>
> >> From: Martin KaFai Lau <martin.lau@...nel.org>
> >>
> >> When a bad bpf prog '.init' calls
> >> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
> >>
> >> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> >> ... => .init => bpf_setsockopt(tcp_cc).
> >>
> >> It was prevented by the prog->active counter before but the prog->active
> >> detection cannot be used in struct_ops as explained in the earlier
> >> patch of the set.
> >>
> >> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> >> in order to break the loop. This is done by checking the
> >> previous bpf_run_ctx has saved the same sk pointer in the
> >> bpf_cookie.
> >>
> >> Note that this essentially limits only the first '.init' can
> >> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> >> does not support ECN) and the second '.init' cannot fallback to
> >> another cc. This applies even the second
> >> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
> >>
> >> Signed-off-by: Martin KaFai Lau <martin.lau@...nel.org>
> >> ---
> >> include/linux/filter.h | 3 +++
> >> net/core/filter.c | 4 ++--
> >> net/ipv4/bpf_tcp_ca.c | 54 ++++++++++++++++++++++++++++++++++++++++++
> >> 3 files changed, 59 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/filter.h b/include/linux/filter.h
> >> index 98e28126c24b..9942ecc68a45 100644
> >> --- a/include/linux/filter.h
> >> +++ b/include/linux/filter.h
> >> @@ -911,6 +911,9 @@ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len);
> >> bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
> >> void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
> >>
> >> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >> + char *optval, int optlen);
> >> +
> >> u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> >> #define __bpf_call_base_args \
> >> ((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \
> >> diff --git a/net/core/filter.c b/net/core/filter.c
> >> index f4cea3ff994a..e56a1ebcf1bc 100644
> >> --- a/net/core/filter.c
> >> +++ b/net/core/filter.c
> >> @@ -5244,8 +5244,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
> >> return -EINVAL;
> >> }
> >>
> >> -static int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >> - char *optval, int optlen)
> >> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >> + char *optval, int optlen)
> >> {
> >> if (sk_fullsock(sk))
> >> sock_owned_by_me(sk);
> >> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
> >> index 6da16ae6a962..a9f2cab5ffbc 100644
> >> --- a/net/ipv4/bpf_tcp_ca.c
> >> +++ b/net/ipv4/bpf_tcp_ca.c
> >> @@ -144,6 +144,57 @@ static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
> >> .arg2_type = ARG_ANYTHING,
> >> };
> >>
> >> +BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
> >> + int, optname, char *, optval, int, optlen)
> >> +{
> >> + struct bpf_tramp_run_ctx *run_ctx, *saved_run_ctx;
> >> + int ret;
> >> +
> >> + if (optname != TCP_CONGESTION)
> >> + return _bpf_setsockopt(sk, level, optname, optval, optlen);
> >> +
> >> + run_ctx = (struct bpf_tramp_run_ctx *)current->bpf_ctx;
> >> + if (unlikely(run_ctx->saved_run_ctx &&
> >> + run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
> >> + saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;
> >> + /* It stops this looping
> >> + *
> >> + * .init => bpf_setsockopt(tcp_cc) => .init =>
> >> + * bpf_setsockopt(tcp_cc)" => .init => ....
> >> + *
> >> + * The second bpf_setsockopt(tcp_cc) is not allowed
> >> + * in order to break the loop when both .init
> >> + * are the same bpf prog.
> >> + *
> >> + * This applies even the second bpf_setsockopt(tcp_cc)
> >> + * does not cause a loop. This limits only the first
> >> + * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> >> + * pick a fallback cc (eg. peer does not support ECN)
> >> + * and the second '.init' cannot fallback to
> >> + * another cc.
> >> + */
> >> + if (saved_run_ctx->bpf_cookie == (uintptr_t)sk)
> >> + return -EBUSY;
> >> + }
> >> +
> >> + run_ctx->bpf_cookie = (uintptr_t)sk;
> >> + ret = _bpf_setsockopt(sk, level, optname, optval, optlen);
> >> + run_ctx->bpf_cookie = 0;
> >
> > Instead of adding 4 bytes for enum in patch 3
> > (which will be 8 bytes due to alignment)
> > and abusing bpf_cookie here
> > (which struct_ops bpf prog might eventually read and be surprised
> > to find sk pointer in there)
> > how about adding 'struct task_struct *saved_current' as another arg
> > to bpf_tramp_run_ctx ?
> > Always store the current task in there in prog_entry_struct_ops
> > and then compare it here in this specialized bpf_init_ops_setsockopt?
> >
> > Or maybe always check in enter_prog_struct_ops:
> > if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
> > run_ctx)->saved_current == current) // goto out since recursion?
> > it will prevent issues in case we don't know about and will
> > address the good recursion case as explained in patch 1?
> > I'm assuming 2nd ssthresh runs in a different task..
> > Or is it actually the same task?
>
> The 2nd ssthresh() should run in the same task but different sk. The
> first ssthresh(sk[1]) was run in_task() context and then got
> interrupted. The softirq then handles the rcv path which just happens
> to also call ssthresh(sk[2]) in the unlikely pkt-loss case. It is like
> ssthresh(sk[1]) => softirq => ssthresh(sk[2]).
>
> The tcp-cc ops can recur but cannot recur on the same sk because it
> requires to hold the sk lock, so the patch remembers what was the
> previous sk to ensure it does not recur on the same sk. Then it needs
> to peek into the previous run ctx which may not always be
> bpf_trump_run_ctx. eg. a cg bpf prog (with bpf_cg_run_ctx) can call
> bpf_setsockopt(TCP_CONGESTION, "a_bpf_tcp_cc") which then will call the
> a_bpf_tcp_cc->init(). It needs a bpf_run_ctx_type so it can safely peek
> the previous bpf_run_ctx.
got it.
>
> Since struct_ops is the only one that needs to peek into the previous
> run_ctx (through tramp_run_ctx->saved_run_ctx), instead of adding 4
> bytes to the bpf_run_ctx, one idea just came to my mind is to use one
> bit in the tramp_run_ctx->saved_run_ctx pointer itsef. Something like
> this if it reuses the bpf_cookie (probably missed some int/ptr type
> casting):
>
> #define BPF_RUN_CTX_STRUCT_OPS_BIT 1UL
>
> u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
> struct bpf_tramp_run_ctx *run_ctx)
> __acquires(RCU)
> {
> rcu_read_lock();
> migrate_disable();
>
> run_ctx->saved_run_ctx = bpf_set_run_ctx((&run_ctx->run_ctx) |
> BPF_RUN_CTX_STRUCT_OPS_BIT);
>
> return bpf_prog_start_time();
> }
>
> BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
> int, optname, char *, optval, int, optlen)
> {
> /* ... */
> if (unlikely((run_ctx->saved_run_ctx &
> BPF_RUN_CTX_STRUCT_OPS_BIT) && ...) {
> /* ... */
> if (bpf_cookie == (uintptr_t)sk)
> return -EBUSY;
> }
>
> }
that should work, but don't you need to loop through all previous
run_ctx and check all with BPF_RUN_CTX_STRUCT_OPS_BIT type ?
Since run_ctx is saved in the task and we have preemptible
rpgos there could be tracing prog in the chain:
struct_ops_run_ctx->tracing_run_ctx->struct_ops_run_ctx
where 1st and last have the same 'sk'.
Powered by blists - more mailing lists