lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <13cdc078-770f-4083-826f-89d13b13140d@bytedance.com>
Date: Fri, 12 Apr 2024 11:58:23 -0700
From: Zijian Zhang <zijianzhang@...edance.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>,
 Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, kuba@...nel.org,
 cong.wang@...edance.com, xiaochun.lu@...edance.com
Subject: Re: [External] Re: [PATCH net-next 2/3] selftests: fix OOM problem in
 msg_zerocopy selftest

On 4/12/24 8:36 AM, Willem de Bruijn wrote:
> Zijian Zhang wrote:
>> On 4/10/24 4:06 PM, Willem de Bruijn wrote:
>>>>>>> In this case, for some reason, notifications do not
>>>>>>> come in order now. We introduce "cfg_notification_order_check" to
>>>>>>> possibly ignore the checking for order.
>>>>>>
>>>>>> Were you testing UDP?
>>>>>>
>>>>>> I don't think this is needed. I wonder what you were doing to see
>>>>>> enough of these events to want to suppress the log output.
>>>>
>>>> I tested again on both TCP and UDP just now, and it happened to both of
>>>> them. For tcp test, too many printfs will delay the sending and thus
>>>> affect the throughput.
>>>>
>>>> ipv4 tcp -z -t 1
>>>> gap: 277..277 does not append to 276
>>>
>>> There is something wrong here. 277 clearly appends to 276
>>>
>>
>> ```
>> if (lo != next_completion)
>>       fprintf(stderr, "gap: %u..%u does not append to %u\n",
>>           lo, hi, next_completion);
>> ```
>>
>> According to the code, it expects the lo to be 276, but it's 277.
> 
> Ack. I should have phrased that message better.
>   
>>> If you ran this on a kernel with a variety of changes, please repeat
>>> this on a clean kernel with no other changes besides the
>>> skb_orphan_frags_rx loopback change.
>>>
>>> It this is a real issue, I don't mind moving this behind cfg_verbose.
>>> And prefer that approach over adding a new flag.
>>>
>>> But I have never seen this before, and this kind of reordering is rare
>>> with UDP and should not happen with TCP except for really edge cases:
>>> the uarg is released only when both the skb was delivered and the ACK
>>> response was received to free the clone on the retransmit queue.
>>
>> I found the set up where I encountered the OOM problem in msg_zerocopy
>> selftest. I did it on a clean kernel vm booted by qemu,
>> dfa2f0483360("tcp: get rid of sysctl_tcp_adv_win_scale") with only
>> skb_orphan_frags_rx change.
>>
>> Then, I do `make olddefconfig` and turn on some configurations for
>> virtualization like VIRTIO_FS, VIRTIO_NET and some confs like 9P_FS
>> to share folders. Let's call it config, here was the result I got,
>> ```
>> ./msg_zerocopy.sh
>> ipv4 tcp -z -t 1
>> ./msg_zerocopy: send: No buffer space available
>> rx=564 (70 MB)
>> ```
>>
>> Since the TCP socket is always writable, the do_poll always return True.
>> There is no any chance for `do_recv_completions` to run.
>> ```
>> while (!do_poll(fd, POLLOUT)) {
>>       if (cfg_zerocopy)
>>           do_recv_completions(fd, domain);
>>       }
>> ```
>> Finally, the size of sendmsg zerocopy notification skbs exceeds the
>> opt_mem limit. I got "No buffer space available".
>>
>>
>> However, if I change the config by
>> ```
>>    DEBUG_IRQFLAGS n -> y
>>    DEBUG_LOCK_ALLOC n -> y
>>    DEBUG_MUTEXES n -> y
>>    DEBUG_RT_MUTEXES n -> y
>>    DEBUG_RWSEMS n -> y
>>    DEBUG_SPINLOCK n -> y
>>    DEBUG_WW_MUTEX_SLOWPATH n -> y
>>    PROVE_LOCKING n -> y
>> +DEBUG_LOCKDEP y
>> +LOCKDEP y
>> +LOCKDEP_BITS 15
>> +LOCKDEP_CHAINS_BITS 16
>> +LOCKDEP_CIRCULAR_QUEUE_BITS 12
>> +LOCKDEP_STACK_TRACE_BITS 19
>> +LOCKDEP_STACK_TRACE_HASH_BITS 14
>> +PREEMPTIRQ_TRACEPOINTS y
>> +PROVE_RAW_LOCK_NESTING n
>> +PROVE_RCU y
>> +TRACE_IRQFLAGS y
>> +TRACE_IRQFLAGS_NMI y
>> ```
>>
>> Let's call it config-debug, the selftest works well with reordered
>> notifications.
>> ```
>> ipv4 tcp -z -t 1
>> gap: 2117..2117 does not append to 2115
>> gap: 2115..2116 does not append to 2118
>> gap: 2118..3144 does not append to 2117
>> gap: 3146..3146 does not append to 3145
>> gap: 3145..3145 does not append to 3147
>> gap: 3147..3768 does not append to 3146
>> ...
>> gap: 34935..34935 does not append to 34937
>> gap: 34938..36409 does not append to 34936
>>
>> rx=36097 (2272 MB)
>> missing notifications: 36410 < 36412
>> tx=36412 (2272 MB) txc=36410 zc=y
>> ```
>> For exact config to compile the kernel, please see
>> https://github.com/Sm0ckingBird/config
> 
> Thanks for sharing the system configs. I'm quite surprised at these
> reorderings *over loopback* with these debug settings, and no weird
> qdiscs that would explain it. Can you see whether you see drops and
> retransmits?
> 

No drops and retransmits are observed.

```
ip netns exec ns-djROUw1 netstat -s

Tcp:
     1 active connection openings
     0 passive connection openings
     0 failed connection attempts
     1 connection resets received
     0 connections established
     16158 segments received
     32311 segments sent out
     0 segments retransmitted
     0 bad segments received
     0 resets sent

ip netns exec ns-djROUw1 ip -s link show veth0

2: veth0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65535 qdisc noqueue 
state UP mode DEFAULT group default qlen 1000
     link/ether 02:02:02:02:02:02 brd ff:ff:ff:ff:ff:ff link-netns 
ns-djROUw2
     RX: bytes  packets  errors  dropped overrun mcast
     1067646    16173    0       0       0       0
     TX: bytes  packets  errors  dropped carrier collsns
     2116207634 32325    0       0       0       0
```

>>
>> I also did selftest on 63c8778d9149("Merge branch
>> 'net-mana-fix-doorbell-access-for-receive-queues'"), the parent of
>> dfa2f0483360("tcp: get rid of sysctl_tcp_adv_win_scale")
>>
>> with config, selftest works well.
>> ```
>> ipv4 tcp -z -t 1
>> missing notifications: 223181 < 223188
>> tx=223188 (13927 MB) txc=223181 zc=y
>> rx=111592 (13927 MB)
>> ```
>>
>> with config-debug, selftest works well with reordered notifications
>> ```
>> ipv4 tcp -z -t 1
>> ...
>> gap: 30397..30404 does not append to 30389
>> gap: 30435..30442 does not append to 30427
>> gap: 30427..30434 does not append to 30443
>> gap: 30443..30450 does not append to 30435
>> gap: 30473..30480 does not append to 30465
>> gap: 30465..30472 does not append to 30481
>> gap: 30481..30488 does not append to 30473
>> tx=30491 (1902 MB) txc=30491 zc=y
>> rx=15245 (1902 MB)
>> ```
>>
>> Not sure about the exact reason for this OOM problem, and why
>> turning on DEBUG_LOCKDEP and PROVE_RAW_LOCK_NESTING can solve
>> the problem with reordered notifications...
> 
> The debug config causes the reordering notifications, right? But
> solves the OOM.
> 

With config, OOM is introduced by dfa2f0483360("tcp: get rid of
sysctl_tcp_adv_win_scale"). And, it can be solved by config-debug,
but reordering notifications are observed.

With config, there is no OOM in 63c8778d9149("Merge branch
'net-mana-fix-doorbell-access-for-receive-queues'"), everything works
well. But with config-debug, reordering notifications are observed.

>> If you have any thoughts or
>> comments, please feel free to share them with us.
>>
>> If the problem does exist, I guess we can force `do_recv_completions`
>> after some number of sendmsgs and move "gap: ..." after cfg_verbose?
> 
> I do want to understand the issue better. But not sure when I'll find
> the time.
> 

I also want to understand this, I will look into it when I find time :)

> Both sound reasonable to me, yes.
Thanks for the review and suggestions, I will update in the next iteration.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ