linux-kernel - Re: Bug report: UDP ~20% degradation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8f95150d-db0d-d9e5-4eff-2196d5e8de05@gmail.com>
Date:   Sun, 12 Feb 2023 13:50:29 +0200
From:   Tariq Toukan <ttoukan.linux@...il.com>
To:     Vincent Guittot <vincent.guittot@...aro.org>,
        Tariq Toukan <tariqt@...dia.com>
Cc:     David Chen <david.chen@...anix.com>,
        Zhang Qiao <zhangqiao22@...wei.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        linux-kernel@...r.kernel.org,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Saeed Mahameed <saeedm@...dia.com>,
        Network Development <netdev@...r.kernel.org>,
        Gal Pressman <gal@...dia.com>, Malek Imam <mimam@...dia.com>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        David Ahern <dsahern@...nel.org>,
        Talat Batheesh <talatb@...dia.com>
Subject: Re: Bug report: UDP ~20% degradation



On 08/02/2023 16:12, Vincent Guittot wrote:
> Hi Tariq,
> 
> On Wed, 8 Feb 2023 at 12:09, Tariq Toukan <tariqt@...dia.com> wrote:
>>
>> Hi all,
>>
>> Our performance verification team spotted a degradation of up to ~20% in
>> UDP performance, for a specific combination of parameters.
>>
>> Our matrix covers several parameters values, like:
>> IP version: 4/6
>> MTU: 1500/9000
>> Msg size: 64/1452/8952 (only when applicable while avoiding ip
>> fragmentation).
>> Num of streams: 1/8/16/24.
>> Num of directions: unidir/bidir.
>>
>> Surprisingly, the issue exists only with this specific combination:
>> 8 streams,
>> MTU 9000,
>> Msg size 8952,
>> both ipv4/6,
>> bidir.
>> (in unidir it repros only with ipv4)
>>
>> The reproduction is consistent on all the different setups we tested with.
>>
>> Bisect [2] was done between these two points, v5.19 (Good), and v6.0-rc1
>> (Bad), with ConnectX-6DX NIC.
>>
>> c82a69629c53eda5233f13fc11c3c01585ef48a2 is the first bad commit [1].
>>
>> We couldn't come up with a good explanation how this patch causes this
>> issue. We also looked for related changes in the networking/UDP stack,
>> but nothing looked suspicious.
>>
>> Maybe someone here can help with this.
>> We can provide more details or do further tests/experiments to progress
>> with the debug.
> 
> Could you share more details about your system and the cpu topology ?
> 

output for 'lscpu':

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   40 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          24
On-line CPU(s) list:             0-23
Vendor ID:                       GenuineIntel
BIOS Vendor ID:                  QEMU
Model name:                      Intel(R) Xeon(R) Platinum 8380 CPU @ 
2.30GHz
BIOS Model name:                 pc-q35-5.0
CPU family:                      6
Model:                           106
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       24
Stepping:                        6
BogoMIPS:                        4589.21
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic 
sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx 
pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology 
cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand 
hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd 
ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid 
ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f 
avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni 
avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat 
avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni 
avx512_bitalg avx512_vpopcntdq rdpid md_clear arch_capabilities
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       768 KiB (24 instances)
L1i cache:                       768 KiB (24 instances)
L2 cache:                        96 MiB (24 instances)
L3 cache:                        384 MiB (24 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-23
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers 
attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass 
disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers 
and __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable: eIBRS with unprivileged eBPF
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

> The commit  c82a69629c53 migrates a task on an idle cpu when the task
> is the only one running on local cpu but the time spent by this local
> cpu under interrupt or RT context becomes significant (10%-17%)
> I can imagine that 16/24 stream overload your system so load_balance
> doesn't end up in this case and the cpus are busy with several
> threads. On the other hand, 1 stream is small enough to keep your
> system lightly loaded but 8 streams make your system significantly
> loaded to trigger the reduced capacity case but still not overloaded.
> 

I see. Makes sense.
1. How do you check this theory? Any suggested tests/experiments?
2. How do you suggest this degradation should be fixed?

Thanks,
Tariq