[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoDq5G_KU3jJ2=kedHz9OvmLRD5sKf_KLrw3mg-yKrhtkw@mail.gmail.com>
Date: Mon, 27 May 2024 16:06:31 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Neal Cardwell <ncardwell@...gle.com>
Cc: Eric Dumazet <edumazet@...gle.com>, "David S . Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org,
eric.dumazet@...il.com
Subject: Re: [PATCH net 1/4] tcp: add tcp_done_with_error() helper
Hi Neal,
On Sat, May 25, 2024 at 10:14 PM Neal Cardwell <ncardwell@...gle.com> wrote:
>
> On Fri, May 24, 2024 at 3:36 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > tcp_reset() ends with a sequence that is carefuly ordered.
> >
> > We need to fix [e]poll bugs in the following patches,
> > it makes sense to use a common helper.
> >
> > Suggested-by: Neal Cardwell <ncardwell@...gle.com>
> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > ---
> > include/net/tcp.h | 1 +
> > net/ipv4/tcp.c | 2 +-
> > net/ipv4/tcp_input.c | 25 +++++++++++++++++--------
> > 3 files changed, 19 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index 060e95b331a286ad7c355be11dc03250d2944920..2e7150f6755a5f5bf7b45454da0b33c5fac78183 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -677,6 +677,7 @@ void tcp_skb_collapse_tstamp(struct sk_buff *skb,
> > /* tcp_input.c */
> > void tcp_rearm_rto(struct sock *sk);
> > void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
> > +void tcp_done_with_error(struct sock *sk);
> > void tcp_reset(struct sock *sk, struct sk_buff *skb);
> > void tcp_fin(struct sock *sk);
> > void tcp_check_space(struct sock *sk);
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 681b54e1f3a64387787738ab6495531b8abe1771..2a8f8d8676ff1d30ea9f8cd47ccf9236940eb299 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -598,7 +598,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
> > */
> > mask |= EPOLLOUT | EPOLLWRNORM;
> > }
> > - /* This barrier is coupled with smp_wmb() in tcp_reset() */
> > + /* This barrier is coupled with smp_wmb() in tcp_done_with_error() */
> > smp_rmb();
> > if (READ_ONCE(sk->sk_err) ||
> > !skb_queue_empty_lockless(&sk->sk_error_queue))
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 9c04a9c8be9dfaa0ec2437b3748284e57588b216..5af716f1bc74e095d22f64d605624decfe27cefe 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -4436,6 +4436,22 @@ static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp,
> > return SKB_NOT_DROPPED_YET;
> > }
> >
> > +
> > +void tcp_done_with_error(struct sock *sk)
> > +{
> > + /* Our caller wrote a value into sk->sk_err.
> > + * This barrier is coupled with smp_rmb() in tcp_poll()
> > + */
> > + smp_wmb();
> > +
> > + tcp_write_queue_purge(sk);
> > + tcp_done(sk);
> > +
> > + if (!sock_flag(sk, SOCK_DEAD))
> > + sk_error_report(sk);
> > +}
> > +EXPORT_SYMBOL(tcp_done_with_error);
> > +
> > /* When we get a reset we do this. */
> > void tcp_reset(struct sock *sk, struct sk_buff *skb)
> > {
> > @@ -4460,14 +4476,7 @@ void tcp_reset(struct sock *sk, struct sk_buff *skb)
> > default:
> > WRITE_ONCE(sk->sk_err, ECONNRESET);
> > }
> > - /* This barrier is coupled with smp_rmb() in tcp_poll() */
> > - smp_wmb();
> > -
> > - tcp_write_queue_purge(sk);
> > - tcp_done(sk);
> > -
> > - if (!sock_flag(sk, SOCK_DEAD))
> > - sk_error_report(sk);
> > + tcp_done_with_error(sk);
> > }
> >
> > /*
> > --
>
> Thanks, Eric!
>
> Thinking about this more, I wonder if there is another aspect to this issue.
>
> I am thinking about this part of tcp_done():
>
> void tcp_done(struct sock *sk)
> {
> ...
> sk->sk_shutdown = SHUTDOWN_MASK;
>
> if (!sock_flag(sk, SOCK_DEAD))
> sk->sk_state_change(sk);
>
> The tcp_poll() code reads sk->sk_shutdown to decide whether to set
> EPOLLHUP and other bits. However, sk->sk_shutdown is not set until
> here in tcp_done(). And in the tcp_done() code there is no smp_wmb()
> to ensure that the sk->sk_shutdown is visible to other CPUs before
> tcp_done() calls sk->sk_state_change() to wake up threads sleeping on
> sk->sk_wq.
>
> So AFAICT we could have cases where this sk->sk_state_change() (or the
> later sk_error_report()?) wakes a thread doing a tcp_poll() on another
> CPU, and the tcp_poll() code may correctly see the sk->sk_err because
> it was updated before the smp_wmb() in tcp_done_with_error(), but may
> fail to see the "sk->sk_shutdown = SHUTDOWN_MASK" write because that
> happened after the smp_wmb() in tcp_done_with_error().
I agree. Accessing sk_shutdown with a pair of smp operations makes
sure that another cpu can see the consistency of both sk_shutdown and
sk_err in tcp_poll().
>
> So AFAICT maybe we need two changes?
>
> (1) AFAICT the call to smp_wmb() should actually instead be inside
> tcp_done(), after we set sk->sk_shutdown?
>
> void tcp_done(struct sock *sk)
> {
> ...
> sk->sk_shutdown = SHUTDOWN_MASK;
>
> /* Ensure previous writes to sk->sk_err, sk->sk_state,
> * sk->sk_shutdown are visible to others.
> * This barrier is coupled with smp_rmb() in tcp_poll()
> */
> smp_wmb();
I wonder if it would affect those callers who have no interest in
pairing smp operations, like tcp_v4_syn_recv_sock()? For those
callers, WRITE_ONCE/READ_ONCE() is enough to protect itself only.
Thanks,
Jason
>
> if (!sock_flag(sk, SOCK_DEAD))
> sk->sk_state_change(sk);
>
> (2) Correspondingly, AFAICT the tcp_poll() call to smp_rmb() should be
> before tcp_poll() first reads sk->sk_shutdown, rather than right
> before it reads sk->sk_err?
>
> thanks,
> neal
>
Powered by blists - more mailing lists