[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5ddd7266c36aa_671a2b0b882605c04a@john-XPS-13-9370.notmuch>
Date: Tue, 26 Nov 2019 10:43:50 -0800
From: John Fastabend <john.fastabend@...il.com>
To: Martin Lau <kafai@...com>, Jakub Sitnicki <jakub@...udflare.com>
Cc: "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"kernel-team@...udflare.com" <kernel-team@...udflare.com>,
John Fastabend <john.fastabend@...il.com>
Subject: Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit
psock or its ops on copy
Martin Lau wrote:
> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
> > On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
> > > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
> > > [ ... ]
> > >
> > >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
> > >> sk->sk_prot = psock->sk_proto;
> > >> psock->sk_proto = NULL;
> > >> }
> > >> +
> > >> + if (psock->icsk_af_ops) {
> > >> + icsk->icsk_af_ops = psock->icsk_af_ops;
> > >> + psock->icsk_af_ops = NULL;
> > >> + }
> > >> }
> > >
> > > [ ... ]
> > >
> > >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
> > >> + struct sk_buff *skb,
> > >> + struct request_sock *req,
> > >> + struct dst_entry *dst,
> > >> + struct request_sock *req_unhash,
> > >> + bool *own_req)
> > >> +{
> > >> + const struct inet_connection_sock_af_ops *ops;
> > >> + void (*write_space)(struct sock *sk);
> > >> + struct sk_psock *psock;
> > >> + struct proto *proto;
> > >> + struct sock *child;
> > >> +
> > >> + rcu_read_lock();
> > >> + psock = sk_psock(sk);
> > >> + if (likely(psock)) {
> > >> + proto = psock->sk_proto;
> > >> + write_space = psock->saved_write_space;
> > >> + ops = psock->icsk_af_ops;
> > > It is not immediately clear to me what ensure
> > > ops is not NULL here.
> > >
> > > It is likely I missed something. A short comment would
> > > be very useful here.
> >
> > I can see the readability problem. Looking at it now, perhaps it should
> > be rewritten, to the same effect, as:
> >
> > static struct sock *tcp_bpf_syn_recv_sock(...)
> > {
> > const struct inet_connection_sock_af_ops *ops = NULL;
> > ...
> >
> > rcu_read_lock();
> > psock = sk_psock(sk);
> > if (likely(psock)) {
> > proto = psock->sk_proto;
> > write_space = psock->saved_write_space;
> > ops = psock->icsk_af_ops;
> > }
> > rcu_read_unlock();
> >
> > if (!ops)
> > ops = inet_csk(sk)->icsk_af_ops;
> > child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
> >
> > If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
> > properly. To double check what happens here:
> I did not mean the init path. The init path is fine since it init
> eveything on psock before publishing the sk to the sock_map.
>
> I was thinking the delete path (e.g. sock_map_delete_elem). It is not clear
> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
> psock->icsk_af_ops to NULL from running in parallel with
> tcp_bpf_syn_recv_sock()? An explanation would be useful.
>
I'll answer. Updates are protected via sk_callback_lock so we don't have
parrallel updates in-flight causing write_space and sk_proto to be out
of sync. However access should be OK because its a pointer write we
never update the pointer in place, e.g.
static inline void sk_psock_restore_proto(struct sock *sk,
struct sk_psock *psock)
{
+ struct inet_connection_sock *icsk = inet_csk(sk);
+
sk->sk_write_space = psock->saved_write_space;
if (psock->sk_proto) {
struct inet_connection_sock *icsk = inet_csk(sk);
bool has_ulp = !!icsk->icsk_ulp_data;
if (has_ulp)
tcp_update_ulp(sk, psock->sk_proto);
else
sk->sk_prot = psock->sk_proto;
psock->sk_proto = NULL;
}
+
+ if (psock->icsk_af_ops) {
+ icsk->icsk_af_ops = psock->icsk_af_ops;
+ psock->icsk_af_ops = NULL;
+ }
}
In restore case either psock->icsk_af_ops is null or not. If its
null below code catches it. If its not null (from init path) then
we have a valid pointer.
rcu_read_lock();
psock = sk_psock(sk);
if (likely(psock)) {
proto = psock->sk_proto;
write_space = psock->saved_write_space;
ops = psock->icsk_af_ops;
}
rcu_read_unlock();
if (!ops)
ops = inet_csk(sk)->icsk_af_ops;
child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
We should do this with proper READ_ONCE/WRITE_ONCE to make it clear
what is going on and to stop compiler from breaking these assumptions. I
was going to generate that patch after this series but can do it before
as well. I didn't mention it here because it seems a bit out of scope
for this series because its mostly a fix to older code.
Also I started to think that write_space might be out of sync with ops but
it seems we never actually remove psock_write_space until after
rcu grace period so that should be OK as well and always point to the
previous write_space.
Finally I wondered if we could remove the ops and then add it back
quickly which seems at least in theory possible, but that would get
hit with a grace period because we can't have conflicting psock
definitions on the same sock. So expanding the rcu block to include
the ops = inet_csk(sk)->icsk_af_ops would fix that case.
So in summary I think we should expand the rcu lock here to include the
ops = inet_csk(sk)->icsk_af_ops to ensure we dont race with tear
down and create. I'll push the necessary update with WRITE_ONCE and
READ_ONCE to fix that up. Seeing we have to wait until the merge
window opens most likely anyways I'll send those out sooner rather
then later and this series can add the proper annotations as well.
Powered by blists - more mailing lists