lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <661954fce5f33_38e2532949f@willemb.c.googlers.com.notmuch>
Date: Fri, 12 Apr 2024 11:36:28 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Zijian Zhang <zijianzhang@...edance.com>, 
 Willem de Bruijn <willemdebruijn.kernel@...il.com>, 
 Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org, 
 davem@...emloft.net, 
 kuba@...nel.org, 
 cong.wang@...edance.com, 
 xiaochun.lu@...edance.com
Subject: Re: [External] Re: [PATCH net-next 2/3] selftests: fix OOM problem in
 msg_zerocopy selftest

Zijian Zhang wrote:
> On 4/10/24 4:06 PM, Willem de Bruijn wrote:
> >>>>> In this case, for some reason, notifications do not
> >>>>> come in order now. We introduce "cfg_notification_order_check" to
> >>>>> possibly ignore the checking for order.
> >>>>
> >>>> Were you testing UDP?
> >>>>
> >>>> I don't think this is needed. I wonder what you were doing to see
> >>>> enough of these events to want to suppress the log output.
> >>
> >> I tested again on both TCP and UDP just now, and it happened to both of
> >> them. For tcp test, too many printfs will delay the sending and thus
> >> affect the throughput.
> >>
> >> ipv4 tcp -z -t 1
> >> gap: 277..277 does not append to 276
> > 
> > There is something wrong here. 277 clearly appends to 276
> > 
> 
> ```
> if (lo != next_completion)
>      fprintf(stderr, "gap: %u..%u does not append to %u\n",
>          lo, hi, next_completion);
> ```
> 
> According to the code, it expects the lo to be 276, but it's 277.

Ack. I should have phrased that message better.
 
> > If you ran this on a kernel with a variety of changes, please repeat
> > this on a clean kernel with no other changes besides the
> > skb_orphan_frags_rx loopback change.
> > 
> > It this is a real issue, I don't mind moving this behind cfg_verbose.
> > And prefer that approach over adding a new flag.
> > 
> > But I have never seen this before, and this kind of reordering is rare
> > with UDP and should not happen with TCP except for really edge cases:
> > the uarg is released only when both the skb was delivered and the ACK
> > response was received to free the clone on the retransmit queue.
> 
> I found the set up where I encountered the OOM problem in msg_zerocopy
> selftest. I did it on a clean kernel vm booted by qemu, 
> dfa2f0483360("tcp: get rid of sysctl_tcp_adv_win_scale") with only
> skb_orphan_frags_rx change.
> 
> Then, I do `make olddefconfig` and turn on some configurations for
> virtualization like VIRTIO_FS, VIRTIO_NET and some confs like 9P_FS
> to share folders. Let's call it config, here was the result I got,
> ```
> ./msg_zerocopy.sh
> ipv4 tcp -z -t 1
> ./msg_zerocopy: send: No buffer space available
> rx=564 (70 MB)
> ```
> 
> Since the TCP socket is always writable, the do_poll always return True.
> There is no any chance for `do_recv_completions` to run.
> ```
> while (!do_poll(fd, POLLOUT)) {
>      if (cfg_zerocopy)
>          do_recv_completions(fd, domain);
>      }
> ```
> Finally, the size of sendmsg zerocopy notification skbs exceeds the 
> opt_mem limit. I got "No buffer space available".
> 
> 
> However, if I change the config by
> ```
>   DEBUG_IRQFLAGS n -> y
>   DEBUG_LOCK_ALLOC n -> y
>   DEBUG_MUTEXES n -> y
>   DEBUG_RT_MUTEXES n -> y
>   DEBUG_RWSEMS n -> y
>   DEBUG_SPINLOCK n -> y
>   DEBUG_WW_MUTEX_SLOWPATH n -> y
>   PROVE_LOCKING n -> y
> +DEBUG_LOCKDEP y
> +LOCKDEP y
> +LOCKDEP_BITS 15
> +LOCKDEP_CHAINS_BITS 16
> +LOCKDEP_CIRCULAR_QUEUE_BITS 12
> +LOCKDEP_STACK_TRACE_BITS 19
> +LOCKDEP_STACK_TRACE_HASH_BITS 14
> +PREEMPTIRQ_TRACEPOINTS y
> +PROVE_RAW_LOCK_NESTING n
> +PROVE_RCU y
> +TRACE_IRQFLAGS y
> +TRACE_IRQFLAGS_NMI y
> ```
> 
> Let's call it config-debug, the selftest works well with reordered
> notifications.
> ```
> ipv4 tcp -z -t 1
> gap: 2117..2117 does not append to 2115
> gap: 2115..2116 does not append to 2118
> gap: 2118..3144 does not append to 2117
> gap: 3146..3146 does not append to 3145
> gap: 3145..3145 does not append to 3147
> gap: 3147..3768 does not append to 3146
> ...
> gap: 34935..34935 does not append to 34937
> gap: 34938..36409 does not append to 34936
> 
> rx=36097 (2272 MB)
> missing notifications: 36410 < 36412
> tx=36412 (2272 MB) txc=36410 zc=y
> ```
> For exact config to compile the kernel, please see
> https://github.com/Sm0ckingBird/config

Thanks for sharing the system configs. I'm quite surprised at these
reorderings *over loopback* with these debug settings, and no weird
qdiscs that would explain it. Can you see whether you see drops and
retransmits?

> 
> I also did selftest on 63c8778d9149("Merge branch 
> 'net-mana-fix-doorbell-access-for-receive-queues'"), the parent of 
> dfa2f0483360("tcp: get rid of sysctl_tcp_adv_win_scale")
> 
> with config, selftest works well.
> ```
> ipv4 tcp -z -t 1
> missing notifications: 223181 < 223188
> tx=223188 (13927 MB) txc=223181 zc=y
> rx=111592 (13927 MB)
> ```
> 
> with config-debug, selftest works well with reordered notifications
> ```
> ipv4 tcp -z -t 1
> ...
> gap: 30397..30404 does not append to 30389
> gap: 30435..30442 does not append to 30427
> gap: 30427..30434 does not append to 30443
> gap: 30443..30450 does not append to 30435
> gap: 30473..30480 does not append to 30465
> gap: 30465..30472 does not append to 30481
> gap: 30481..30488 does not append to 30473
> tx=30491 (1902 MB) txc=30491 zc=y
> rx=15245 (1902 MB)
> ```
> 
> Not sure about the exact reason for this OOM problem, and why
> turning on DEBUG_LOCKDEP and PROVE_RAW_LOCK_NESTING can solve
> the problem with reordered notifications...

The debug config causes the reordering notifications, right? But
solves the OOM.

> If you have any thoughts or
> comments, please feel free to share them with us.
> 
> If the problem does exist, I guess we can force `do_recv_completions`
> after some number of sendmsgs and move "gap: ..." after cfg_verbose?

I do want to understand the issue better. But not sure when I'll find
the time.

Both sound reasonable to me, yes.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ