lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b9573c0e-3cdb-4444-b8f2-579aa699b2e1@alu.unizg.hr>
Date: Sun, 5 Nov 2023 20:27:44 +0100
From: Mirsad Todorovac <mirsad.todorovac@....unizg.hr>
To: Andrew Lunn <andrew@...n.ch>
Cc: Heiner Kallweit <hkallweit1@...il.com>, linux-kernel@...r.kernel.org,
 "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 netdev@...r.kernel.org, nic_swsd@...ltek.com
Subject: Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to
 reduce spinlock contention

On 11/5/23 16:33, Andrew Lunn wrote:

>>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>>>> a lock storm that will stall all of the cores and CPUs on the same memory controller
>>>> for certain time I/O takes to finish.

> Please provide benchmark data to show this is a real issue, and the
> patch fixes it.

Certainly, Sir, but I have to figure out what to measure.

To better think of it, I actually do not have a system with a physical RTL8411b, this patch
was made by the finding in the visual inspection of the code.

FWIW, separation of code and data is the design principle that is strongly endorsed lately
and it seems like a good practice that prevents a range of security breaches and attacks:

[1] https://blog.klipse.tech/databook/2020/10/02/separate-code-data.html

[2] Reduce system complexity by separating Code from Data,
     https://livebook.manning.com/book/data-oriented-programming/chapter-2/v-2/

In this case, data is hard-coded.

The resulting code would be smaller in size and execution time, and IMHO more readable,
(in a table), but I will not enter much discussion for you have made your mind already.

>> Additionally, I would like to "inline" many functions, as I think that call/return
>> sequences with stack frame generation /destruction are more expensive than inlining the
>> small one liners.
> 
> Please provide benchmarks to show the compiler is getting this wrong,
> and inline really is needed.

I think I am by now technologically up to that:

"drivers/net/ethernet/realtek/r8169_main.s" 302034 lines
-------------------------------------------------------------------------------------
   7500 r8168_mac_ocp_write:
   7501 .LVL488:
   7502 .LFB6322:
   7503         .loc 1 900 1 is_stmt 1 view -0
   7504         .cfi_startproc
   7505 1:      call    __fentry__
   7506         .section __mcount_loc, "a",@progbits
   7507         .quad 1b
   7508         .previous
   7509         .loc 1 901 2 view .LVU1955
   7510         .loc 1 903 2 view .LVU1956
   7511         .loc 1 903 7 view .LVU1957
   7512         .loc 1 903 10 view .LVU1958
   7513         .loc 1 903 33 view .LVU1959
   7514         .loc 1 903 57 view .LVU1960
   7515         .loc 1 903 88 view .LVU1961
   7516         .loc 1 903 95 view .LVU1962
   7517         .loc 1 900 1 is_stmt 0 view .LVU1963
   7518         pushq   %rbp
   7519         .cfi_def_cfa_offset 16
   7520         .cfi_offset 6, -16
   7521         movq    %rsp, %rbp
   7522         .cfi_def_cfa_register 6
   7523         pushq   %r15
   7524         .cfi_offset 15, -24
   7525         .loc 1 903 103 view .LVU1964
   7526         leaq    6692(%rdi), %r15
   7527         .loc 1 900 1 view .LVU1965
   7528         pushq   %r14
   7529         .cfi_offset 14, -32
   7530         movl    %edx, %r14d
   7531         pushq   %r13
   7532         pushq   %r12
   7533         .cfi_offset 13, -40
   7534         .cfi_offset 12, -48
   7535         movq    %rdi, %r12
   7536         .loc 1 903 103 view .LVU1966
   7537         movq    %r15, %rdi
   7538 .LVL489:
   7539         .loc 1 900 1 view .LVU1967
   7540         pushq   %rbx
   7541         .cfi_offset 3, -56
   7542         .loc 1 900 1 view .LVU1968
   7543         movl    %esi, %ebx
   7544         .loc 1 903 103 view .LVU1969
   7545         call    _raw_spin_lock_irqsave
   7546 .LVL490:
   7547 .LBB3557:
   7548 .LBB3558:
   7549         .loc 1 893 6 view .LVU1970
   7550         movl    %ebx, %edi
   7551 .LBE3558:
   7552 .LBE3557:
   7553         .loc 1 903 103 view .LVU1971
   7554         movq    %rax, %r13
   7555 .LVL491:
   7556         .loc 1 903 5 is_stmt 1 view .LVU1972
   7557         .loc 1 904 2 view .LVU1973
   7558 .LBB3564:
   7559 .LBI3557:
   7560         .loc 1 891 13 view .LVU1974
   7561 .LBB3563:
   7562         .loc 1 893 2 view .LVU1975
   7563         .loc 1 893 6 is_stmt 0 view .LVU1976
   7564         call    rtl_ocp_reg_failure
   7565 .LVL492:
   7566         .loc 1 893 5 view .LVU1977
   7567         testb   %al, %al
   7568         jne     .L375
   7569 .LVL493:
   7570 .LBB3559:
   7571 .LBI3559:
   7572         .loc 1 891 13 is_stmt 1 view .LVU1978
   7573 .LBB3560:
   7574         .loc 1 896 2 view .LVU1979
   7575         .loc 1 896 28 is_stmt 0 view .LVU1980
   7576         sall    $15, %ebx
   7577 .LVL494:
   7578         .loc 1 896 58 view .LVU1981
   7579         movq    (%r12), %rax
   7580         .loc 1 896 35 view .LVU1982
   7581         orl     %r14d, %ebx
   7582         .loc 1 896 2 view .LVU1983
   7583         orl     $-2147483648, %ebx
   7584 .LVL495:
   7585 .LBB3561:
   7586 .LBI3561:
   7587         .loc 2 66 120 is_stmt 1 view .LVU1984
   7588 .LBB3562:
   7589         .loc 2 66 168 view .LVU1985
   7590 #APP
   7591 # 66 "./arch/x86/include/asm/io.h" 1
   7592         movl %ebx,176(%rax)
   7593 # 0 "" 2
   7594 .LVL496:
   7595 #NO_APP
   7596 .L375:
   7597         .loc 2 66 168 is_stmt 0 view .LVU1986
   7598 .LBE3562:
   7599 .LBE3561:
   7600 .LBE3560:
   7601 .LBE3559:
   7602 .LBE3563:
   7603 .LBE3564:
   7604         .loc 1 905 2 is_stmt 1 view .LVU1987
   7605         .loc 1 905 7 view .LVU1988
   7606         .loc 1 905 10 view .LVU1989
   7607         .loc 1 905 33 view .LVU1990
   7608         .loc 1 905 57 view .LVU1991
   7609         .loc 1 905 88 view .LVU1992
   7610         .loc 1 905 95 view .LVU1993
   7611         movq    %r13, %rsi
   7612         movq    %r15, %rdi
   7613         call    _raw_spin_unlock_irqrestore
   7614 .LVL497:
   7615         .loc 1 905 5 view .LVU1994
   7616         .loc 1 906 1 is_stmt 0 view .LVU1995
   7617         popq    %rbx
   7618         .cfi_restore 3
   7619         popq    %r12
   7620         .cfi_restore 12
   7621 .LVL498:
   7622         .loc 1 906 1 view .LVU1996
   7623         popq    %r13
   7624         .cfi_restore 13
   7625 .LVL499:
   7626         .loc 1 906 1 view .LVU1997
   7627         popq    %r14
   7628         .cfi_restore 14
   7629 .LVL500:
   7630         .loc 1 906 1 view .LVU1998
   7631         popq    %r15
   7632         .cfi_restore 15
   7633 .LVL501:
   7634         .loc 1 906 1 view .LVU1999
   7635         popq    %rbp
   7636         .cfi_restore 6
   7637         .cfi_def_cfa 7, 8
   7638         xorl    %eax, %eax
   7639         xorl    %edx, %edx
   7640         xorl    %esi, %esi
   7641         xorl    %edi, %edi
   7642         jmp     __x86_return_thunk
   7643         .cfi_endproc
   7644 .LFE6322:
   7645         .size   r8168_mac_ocp_write, .-r8168_mac_ocp_write
   7646         .p2align 4
   7647         .section        __patchable_function_entries,"awo",@progbits,rtl_eriar_cond_check
   7648         .align 8
   7649         .quad   .LPFE44
   7650         .text
   7651 .LPFE44:
   7652         nop
   7653         nop
   7654         nop
   7655         nop
-------------------------------------------------------------------------------------

The call of the function is the actual call:

-------------------------------------------------------------------------------------
  39334 .LBE11119:
  39335         .loc 1 3112 2 is_stmt 1 view .LVU10399
  39336         xorl    %edx, %edx
  39337         movl    $64556, %esi
  39338         movq    %r13, %rdi
  39339         call    r8168_mac_ocp_write
-------------------------------------------------------------------------------------

The command used for generating the assembly was taken from .o.cmd file and
added -save-temps as the only change:

$ gcc -Wp,-MMD,drivers/net/ethernet/realtek/.r8169_main.o.d -save-temps -nostdinc \
-I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi \
-I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi \
-include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h \
-include ./include/linux/compiler_types.h -D__KERNEL__ -fmacro-prefix-map=./= \
-std=gnu11 -fshort-wchar -funsigned-char -fno-common -fno-PIE -fno-strict-aliasing \
-mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m64 -falign-jumps=1 \
-falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup \
-mtune=generic -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables \
-mindirect-branch=thunk-extern -mindirect-branch-register -mindirect-branch-cs-prefix \
-mfunction-return=thunk-extern -fno-jump-tables -mharden-sls=all -fpatchable-function-entry=16,16 \
-fno-delete-null-pointer-checks -O2 -fno-allow-store-data-races -fstack-protector-strong \
-fno-omit-frame-pointer -fno-optimize-sibling-calls -fno-stack-clash-protection \
-fzero-call-used-regs=used-gpr -pg -mrecord-mcount -mfentry -DCC_USING_FENTRY -falign-functions=16 \
-fno-strict-overflow -fno-stack-check -fconserve-stack -Wall -Wundef -Werror=implicit-function-declaration \
-Werror=implicit-int -Werror=return-type -Werror=strict-prototypes -Wno-format-security -Wno-trigraphs \
-Wno-frame-address -Wno-address-of-packed-member -Wframe-larger-than=1024 -Wno-main \
-Wno-unused-but-set-variable -Wno-unused-const-variable -Wvla -Wno-pointer-sign -Wcast-function-type \
-Wno-array-bounds -Wno-alloc-size-larger-than -Wimplicit-fallthrough=5 -Werror=date-time \
-Werror=incompatible-pointer-types -Werror=designated-init -Wenum-conversion -Wno-unused-but-set-variable \
-Wno-unused-const-variable -Wno-restrict -Wno-packed-not-aligned -Wno-format-overflow -Wno-format-truncation \
-Wno-stringop-overflow -Wno-stringop-truncation -Wno-missing-field-initializers -Wno-type-limits \
-Wno-shift-negative-value -Wno-maybe-uninitialized -Wno-sign-compare -g -gdwarf-5  -fsanitize=bounds-strict \
-fsanitize=shift -fsanitize=bool -fsanitize=enum  -DMODULE  -DKBUILD_BASENAME='"r8169_main"' \
-DKBUILD_MODNAME='"r8169"' -D__KBUILD_MODNAME=kmod_r8169 -c -o drivers/net/ethernet/realtek/r8169_main.o \
drivers/net/ethernet/realtek/r8169_main.c   ; ./tools/objtool/objtool --hacks=jump_label --hacks=noinstr \
--hacks=skylake --retpoline --rethunk --sls --stackval --static-call --uaccess --prefix=16 \
   --module drivers/net/ethernet/realtek/r8169_main.o

This is a build against net-next. Please find the attached config.

RTL_(R|W)(8|16|32) family are obviously macros:

#define RTL_W32(tp, reg, val32)	writel((val32), tp->mmio_addr + (reg))

This is the writel():

   7590 #APP
   7591 # 66 "./arch/x86/include/asm/io.h" 1
   7592         movl %ebx,176(%rax)
   7593 # 0 "" 2
   7594 .LVL496:
   7595 #NO_APP

writel() looks optimal.

Hope this helps.

> Until there are benchmarks: NACK.

That means I've got a Reviewed-by: Jacob Keller and two NACKs.

I am voted out. :-/

I suppose one NACK from a maintainer is sufficient to halt the patch.

Going back to the documentation, the drawing board, and of course, the Source. :-)

Best regards,
Mirsad Todorovac

Download attachment "net-next-config.xz" of type "application/x-xz" (58464 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ