lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0ac5752d-0b36-436a-9c37-13e262334dce@bytedance.com>
Date: Thu, 11 Apr 2024 17:26:06 -0700
From: Zijian Zhang <zijianzhang@...edance.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>,
 Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, kuba@...nel.org,
 cong.wang@...edance.com, xiaochun.lu@...edance.com
Subject: Re: [External] Re: [PATCH net-next 2/3] selftests: fix OOM problem in
 msg_zerocopy selftest

On 4/10/24 4:06 PM, Willem de Bruijn wrote:
>>>>> In this case, for some reason, notifications do not
>>>>> come in order now. We introduce "cfg_notification_order_check" to
>>>>> possibly ignore the checking for order.
>>>>
>>>> Were you testing UDP?
>>>>
>>>> I don't think this is needed. I wonder what you were doing to see
>>>> enough of these events to want to suppress the log output.
>>
>> I tested again on both TCP and UDP just now, and it happened to both of
>> them. For tcp test, too many printfs will delay the sending and thus
>> affect the throughput.
>>
>> ipv4 tcp -z -t 1
>> gap: 277..277 does not append to 276
> 
> There is something wrong here. 277 clearly appends to 276
> 

```
if (lo != next_completion)
     fprintf(stderr, "gap: %u..%u does not append to %u\n",
         lo, hi, next_completion);
```

According to the code, it expects the lo to be 276, but it's 277.

>> gap: 276..276 does not append to 278
> 
> This would be an actual reordering. But the above line already
> indicates that 276 is the next expected value.
> 
>> gap: 278..1112 does not append to 277
>> gap: 1114..1114 does not append to 1113
>> gap: 1113..1113 does not append to 1115
>> gap: 1115..2330 does not append to 1114
>> gap: 2332..2332 does not append to 2331
>> gap: 2331..2331 does not append to 2333
>> gap: 2333..2559 does not append to 2332
>> gap: 2562..2562 does not append to 2560
>> ...
>> gap: 25841..25841 does not append to 25843
>> gap: 25843..25997 does not append to 25842
>>
>> ...
>>
>> ipv6 udp -z -t 1
>> gap: 11632..11687 does not append to 11625
>> gap: 11625..11631 does not append to 11688
>> gap: 11688..54662 does not append to 11632
> 
> If you ran this on a kernel with a variety of changes, please repeat
> this on a clean kernel with no other changes besides the
> skb_orphan_frags_rx loopback change.
> 
> It this is a real issue, I don't mind moving this behind cfg_verbose.
> And prefer that approach over adding a new flag.
> 
> But I have never seen this before, and this kind of reordering is rare
> with UDP and should not happen with TCP except for really edge cases:
> the uarg is released only when both the skb was delivered and the ACK
> response was received to free the clone on the retransmit queue.

I found the set up where I encountered the OOM problem in msg_zerocopy
selftest. I did it on a clean kernel vm booted by qemu, 
dfa2f0483360("tcp: get rid of sysctl_tcp_adv_win_scale") with only
skb_orphan_frags_rx change.

Then, I do `make olddefconfig` and turn on some configurations for
virtualization like VIRTIO_FS, VIRTIO_NET and some confs like 9P_FS
to share folders. Let's call it config, here was the result I got,
```
./msg_zerocopy.sh
ipv4 tcp -z -t 1
./msg_zerocopy: send: No buffer space available
rx=564 (70 MB)
```

Since the TCP socket is always writable, the do_poll always return True.
There is no any chance for `do_recv_completions` to run.
```
while (!do_poll(fd, POLLOUT)) {
     if (cfg_zerocopy)
         do_recv_completions(fd, domain);
     }
```
Finally, the size of sendmsg zerocopy notification skbs exceeds the 
opt_mem limit. I got "No buffer space available".


However, if I change the config by
```
  DEBUG_IRQFLAGS n -> y
  DEBUG_LOCK_ALLOC n -> y
  DEBUG_MUTEXES n -> y
  DEBUG_RT_MUTEXES n -> y
  DEBUG_RWSEMS n -> y
  DEBUG_SPINLOCK n -> y
  DEBUG_WW_MUTEX_SLOWPATH n -> y
  PROVE_LOCKING n -> y
+DEBUG_LOCKDEP y
+LOCKDEP y
+LOCKDEP_BITS 15
+LOCKDEP_CHAINS_BITS 16
+LOCKDEP_CIRCULAR_QUEUE_BITS 12
+LOCKDEP_STACK_TRACE_BITS 19
+LOCKDEP_STACK_TRACE_HASH_BITS 14
+PREEMPTIRQ_TRACEPOINTS y
+PROVE_RAW_LOCK_NESTING n
+PROVE_RCU y
+TRACE_IRQFLAGS y
+TRACE_IRQFLAGS_NMI y
```

Let's call it config-debug, the selftest works well with reordered
notifications.
```
ipv4 tcp -z -t 1
gap: 2117..2117 does not append to 2115
gap: 2115..2116 does not append to 2118
gap: 2118..3144 does not append to 2117
gap: 3146..3146 does not append to 3145
gap: 3145..3145 does not append to 3147
gap: 3147..3768 does not append to 3146
...
gap: 34935..34935 does not append to 34937
gap: 34938..36409 does not append to 34936

rx=36097 (2272 MB)
missing notifications: 36410 < 36412
tx=36412 (2272 MB) txc=36410 zc=y
```
For exact config to compile the kernel, please see
https://github.com/Sm0ckingBird/config


I also did selftest on 63c8778d9149("Merge branch 
'net-mana-fix-doorbell-access-for-receive-queues'"), the parent of 
dfa2f0483360("tcp: get rid of sysctl_tcp_adv_win_scale")

with config, selftest works well.
```
ipv4 tcp -z -t 1
missing notifications: 223181 < 223188
tx=223188 (13927 MB) txc=223181 zc=y
rx=111592 (13927 MB)
```

with config-debug, selftest works well with reordered notifications
```
ipv4 tcp -z -t 1
...
gap: 30397..30404 does not append to 30389
gap: 30435..30442 does not append to 30427
gap: 30427..30434 does not append to 30443
gap: 30443..30450 does not append to 30435
gap: 30473..30480 does not append to 30465
gap: 30465..30472 does not append to 30481
gap: 30481..30488 does not append to 30473
tx=30491 (1902 MB) txc=30491 zc=y
rx=15245 (1902 MB)
```

Not sure about the exact reason for this OOM problem, and why
turning on DEBUG_LOCKDEP and PROVE_RAW_LOCK_NESTING can solve
the problem with reordered notifications... If you have any thoughts or
comments, please feel free to share them with us.

If the problem does exist, I guess we can force `do_recv_completions`
after some number of sendmsgs and move "gap: ..." after cfg_verbose?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ