[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200820231620.dorvpjme6lqqe4iq@kafai-mbp.dhcp.thefacebook.com>
Date: Thu, 20 Aug 2020 16:16:20 -0700
From: Martin KaFai Lau <kafai@...com>
To: Mat Martineau <mathew.j.martineau@...ux.intel.com>
CC: <bpf@...r.kernel.org>, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Eric Dumazet <edumazet@...gle.com>, <kernel-team@...com>,
Lawrence Brakmo <brakmo@...com>,
Neal Cardwell <ncardwell@...gle.com>, <netdev@...r.kernel.org>,
Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [PATCH v5 bpf-next 07/12] bpf: tcp: Add bpf_skops_hdr_opt_len()
and bpf_skops_write_hdr_opt()
On Thu, Aug 20, 2020 at 03:39:21PM -0700, Mat Martineau wrote:
>
> On Thu, 20 Aug 2020, Martin KaFai Lau wrote:
>
> > The bpf prog needs to parse the SYN header to learn what options have
> > been sent by the peer's bpf-prog before writing its options into SYNACK.
> > This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
> > This syn_skb will eventually be made available (as read-only) to the
> > bpf prog. This will be the only SYN packet available to the bpf
> > prog during syncookie. For other regular cases, the bpf prog can
> > also use the saved_syn.
> >
> > When writing options, the bpf prog will first be called to tell the
> > kernel its required number of bytes. It is done by the new
> > bpf_skops_hdr_opt_len(). The bpf prog will only be called when the new
> > BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
> > When the bpf prog returns, the kernel will know how many bytes are needed
> > and then update the "*remaining" arg accordingly. 4 byte alignment will
> > be included in the "*remaining" before this function returns. The 4 byte
> > aligned number of bytes will also be stored into the opts->bpf_opt_len.
> > "bpf_opt_len" is a newly added member to the struct tcp_out_options.
> >
> > Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
> > header options. The bpf prog is only called if it has reserved spaces
> > before (opts->bpf_opt_len > 0).
> >
> > The bpf prog is the last one getting a chance to reserve header space
> > and writing the header option.
> >
> > These two functions are half implemented to highlight the changes in
> > TCP stack. The actual codes preparing the bpf running context and
> > invoking the bpf prog will be added in the later patch with other
> > necessary bpf pieces.
> >
> > Reviewed-by: Eric Dumazet <edumazet@...gle.com>
> > Signed-off-by: Martin KaFai Lau <kafai@...com>
> > ---
> > include/net/tcp.h | 6 +-
> > include/uapi/linux/bpf.h | 3 +-
> > net/ipv4/tcp_input.c | 5 +-
> > net/ipv4/tcp_ipv4.c | 5 +-
> > net/ipv4/tcp_output.c | 105 +++++++++++++++++++++++++++++----
> > net/ipv6/tcp_ipv6.c | 5 +-
> > tools/include/uapi/linux/bpf.h | 3 +-
> > 7 files changed, 109 insertions(+), 23 deletions(-)
> >
>
> ...
>
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 5084333b5ab6..631a5ee0dd4e 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
>
> ...
>
> > @@ -826,6 +886,15 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
> > opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
> > }
> >
> > + if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp,
> > + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))) {
> > + unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
> > +
> > + bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining);
> > +
> > + size = MAX_TCP_OPTION_SPACE - remaining;
> > + }
> > +
> > return size;
> > }
> >
>
> Since bpf_skops_hdr_opt_len() is called after the SACK code tries to use up
> all remaining option space for SACK blocks, it's less likely that there will
> be sufficient space remaining. Did you consider moving this hunk before the
> SACK option space is allocated to give the bpf prog higher priority?
No. Not at this point.
It is unlike a well defined option (e.g. MPTCP) that may have a
requirement on being presence in the header. For a generic usage in bpf,
it is hard to judge if a bpf program is writing a higher priority option.
Also considering SACK is a transient behavior, the bpf prog will eventually
get its chance to write. The bpf prog should always assume there may not have
enough space and ready to retry later.
Powered by blists - more mailing lists