[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CADVnQynXDAL=6V-a+L34bP3KEDWAWwk1kdL2ym0AttL_t54miA@mail.gmail.com>
Date: Thu, 3 Nov 2022 09:54:59 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: "luwei (O)" <luwei32@...wei.com>
Cc: edumazet@...gle.com, davem@...emloft.net, yoshfuji@...ux-ipv6.org,
dsahern@...nel.org, kuba@...nel.org, pabeni@...hat.com,
xemul@...allels.com, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [patch net v3] tcp: prohibit TCP_REPAIR_OPTIONS if data was
already sent
On Wed, Nov 2, 2022 at 10:11 PM luwei (O) <luwei32@...wei.com> wrote:
>
>
> 在 2022/11/2 10:46 PM, Neal Cardwell 写道:
> > On Wed, Nov 2, 2022 at 8:23 AM Lu Wei <luwei32@...wei.com> wrote:
> >> If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code
> >> of TCPOPT_SACK_PERM is called to enable sack after data is sent
> >> and before data is acked, ...
> > This "before data is acked" phrase does not quite seem to match the
> > sequence below, AFAICT?
> >
> > How about something like:
> >
> > If setsockopt TCP_REPAIR_OPTIONS with opt_code TCPOPT_SACK_PERM
> > is called to enable SACK after data is sent and the data sender receives a
> > dupack, ...
> yes, thanks for suggestion
> >
> >
> >> ... it will trigger a warning in function
> >> tcp_verify_left_out() as follows:
> >>
> >> ============================================
> >> WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132
> >> tcp_timeout_mark_lost+0x154/0x160
> >> tcp_enter_loss+0x2b/0x290
> >> tcp_retransmit_timer+0x50b/0x640
> >> tcp_write_timer_handler+0x1c8/0x340
> >> tcp_write_timer+0xe5/0x140
> >> call_timer_fn+0x3a/0x1b0
> >> __run_timers.part.0+0x1bf/0x2d0
> >> run_timer_softirq+0x43/0xb0
> >> __do_softirq+0xfd/0x373
> >> __irq_exit_rcu+0xf6/0x140
> >>
> >> The warning is caused in the following steps:
> >> 1. a socket named socketA is created
> >> 2. socketA enters repair mode without build a connection
> >> 3. socketA calls connect() and its state is changed to TCP_ESTABLISHED
> >> directly
> >> 4. socketA leaves repair mode
> >> 5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup
> >> ack receives) increase
> >> 6. socketA enters repair mode again
> >> 7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack
> >> 8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out
> >> increases
> >> 9. sack_outs + lost_out > packets_out triggers since lost_out and
> >> sack_outs increase repeatly
> >>
> >> In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if
> >> Step7 not happen and the warning will not be triggered. As suggested by
> >> Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was
> >> already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIONS
> >> can be set only if tp->segs_out is 0.
> >>
> >> socket-tcp tests in CRIU has been tested as follows:
> >> $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp* --keep-going \
> >> --ignore-taint
> >>
> >> socket-tcp* represent all socket-tcp tests in test/zdtm/static/.
> >>
> >> Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters")
> >> Signed-off-by: Lu Wei <luwei32@...wei.com>
> >> ---
> >> net/ipv4/tcp.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >> index ef14efa1fb70..1f5cc32cf0cc 100644
> >> --- a/net/ipv4/tcp.c
> >> +++ b/net/ipv4/tcp.c
> >> @@ -3647,7 +3647,7 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
> >> case TCP_REPAIR_OPTIONS:
> >> if (!tp->repair)
> >> err = -EINVAL;
> >> - else if (sk->sk_state == TCP_ESTABLISHED)
> >> + else if (sk->sk_state == TCP_ESTABLISHED && !tp->segs_out)
> > The tp->segs_out field is only 32 bits wide. By my math, at 200
> > Gbit/sec with 1500 byte MTU it can wrap roughly every 260 secs. So a
> > caller could get unlucky or carefully sequence its call to
> > TCP_REPAIR_OPTIONS (based on packets sent so far) to mess up the
> > accounting and trigger the kernel warning.
> >
> > How about using some other method to determine if this is safe?
> > Perhaps using tp->bytes_sent, which is a 64-bit field, which by my
> > math would take 23 years to wrap at 200 Gbit/sec?
> >
> > If we're more paranoid about wrapping we could also check
> > tp->packets_out, and refuse to allow TCP_REPAIR_OPTIONS if either
> > tp->bytes_sent or tp->packets_out are non-zero. (Or if we're even more
> > paranoid I suppose we could have a special new bit to track whether
> > we've ever sent something, but that probably seems like overkill?)
> >
> > neal
> > .
>
> I didn't notice that u32 will be easily wrapped in huge network throughput,
> thank you neal.
>
> But tcp->packets_out shoud not be used because tp->packets_out can decrease
> when expected ack is received, so it can decrease to 0 and this is the common
> condition.
To say tp->packets_out should not be used is a bit strong. :-)
Obviously packets_out decreases when packets are ACKed. The point of
checking both tp->bytes_sent and tp->packets_out would be if we are
paranoid enough that we want to prevent this warning in the case where
tp->bytes_sent wraps and becomes zero. If tp->bytes_sent wraps and is
zero, we will be saved from hitting this warning if we deny the
request to set TCP_REPAIR_OPTIONS if tp->packets_out is non-zero.
(Because we can only hit this warning if tp->sacked_out is non-zero,
and tp->sacked_out should only be non-zero if tp->packets_out is
non-zero.)
neal
Powered by blists - more mailing lists