lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <814b5598-5284-9558-8f56-12a6f7a67187@linux.dev>
Date: Mon, 9 Oct 2023 11:07:47 +0800
From: Yajun Deng <yajun.deng@...ux.dev>
To: Eric Dumazet <edumazet@...gle.com>, rostedt@...dmis.org,
 mhiramat@...nel.org, dennis@...nel.org, tj@...nel.org, cl@...ux.com,
 mark.rutland@....com
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com,
 netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
 Alexander Lobakin <aleksander.lobakin@...el.com>,
 linux-trace-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()


On 2023/10/8 17:12, Yajun Deng wrote:
>
> On 2023/10/8 16:53, Eric Dumazet wrote:
>> On Sun, Oct 8, 2023 at 10:44 AM Yajun Deng <yajun.deng@...ux.dev> wrote:
>>>
>>> On 2023/10/8 15:18, Eric Dumazet wrote:
>>>> On Sun, Oct 8, 2023 at 9:00 AM Yajun Deng <yajun.deng@...ux.dev> 
>>>> wrote:
>>>>> On 2023/10/8 14:45, Eric Dumazet wrote:
>>>>>> On Sat, Oct 7, 2023 at 8:34 AM Yajun Deng <yajun.deng@...ux.dev> 
>>>>>> wrote:
>>>>>>> On 2023/10/7 13:29, Eric Dumazet wrote:
>>>>>>>> On Sat, Oct 7, 2023 at 7:06 AM Yajun Deng 
>>>>>>>> <yajun.deng@...ux.dev> wrote:
>>>>>>>>> Although there is a kfree_skb_reason() helper function that 
>>>>>>>>> can be used to
>>>>>>>>> find the reason why this skb is dropped, but most callers 
>>>>>>>>> didn't increase
>>>>>>>>> one of rx_dropped, tx_dropped, rx_nohandler and 
>>>>>>>>> rx_otherhost_dropped.
>>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>>> +
>>>>>>>>> +void netdev_core_stats_inc(struct net_device *dev, u32 offset)
>>>>>>>>> +{
>>>>>>>>> +       /* This READ_ONCE() pairs with the write in 
>>>>>>>>> netdev_core_stats_alloc() */
>>>>>>>>> +       struct net_device_core_stats __percpu *p = 
>>>>>>>>> READ_ONCE(dev->core_stats);
>>>>>>>>> +       unsigned long *field;
>>>>>>>>> +
>>>>>>>>> +       if (unlikely(!p))
>>>>>>>>> +               p = netdev_core_stats_alloc(dev);
>>>>>>>>> +
>>>>>>>>> +       if (p) {
>>>>>>>>> +               field = (unsigned long *)((void 
>>>>>>>>> *)this_cpu_ptr(p) + offset);
>>>>>>>>> +               WRITE_ONCE(*field, READ_ONCE(*field) + 1);
>>>>>>>> This is broken...
>>>>>>>>
>>>>>>>> As I explained earlier, dev_core_stats_xxxx(dev) can be called 
>>>>>>>> from
>>>>>>>> many different contexts:
>>>>>>>>
>>>>>>>> 1) process contexts, where preemption and migration are allowed.
>>>>>>>> 2) interrupt contexts.
>>>>>>>>
>>>>>>>> Adding WRITE_ONCE()/READ_ONCE() is not solving potential races.
>>>>>>>>
>>>>>>>> I _think_ I already gave you how to deal with this ?
>>>>>>> Yes, I replied in v6.
>>>>>>>
>>>>>>> https://lore.kernel.org/all/e25b5f3c-bd97-56f0-de86-b93a3172870d@linux.dev/ 
>>>>>>>
>>>>>>>
>>>>>>>> Please try instead:
>>>>>>>>
>>>>>>>> +void netdev_core_stats_inc(struct net_device *dev, u32 offset)
>>>>>>>> +{
>>>>>>>> +       /* This READ_ONCE() pairs with the write in 
>>>>>>>> netdev_core_stats_alloc() */
>>>>>>>> +       struct net_device_core_stats __percpu *p = 
>>>>>>>> READ_ONCE(dev->core_stats);
>>>>>>>> +       unsigned long __percpu *field;
>>>>>>>> +
>>>>>>>> +       if (unlikely(!p)) {
>>>>>>>> +               p = netdev_core_stats_alloc(dev);
>>>>>>>> +               if (!p)
>>>>>>>> +                       return;
>>>>>>>> +       }
>>>>>>>> +       field = (__force unsigned long __percpu *)((__force 
>>>>>>>> void *)p + offset);
>>>>>>>> +       this_cpu_inc(*field);
>>>>>>>> +}
>>>>>>> This wouldn't trace anything even the rx_dropped is in 
>>>>>>> increasing. It
>>>>>>> needs to add an extra operation, such as:
>>>>>> I honestly do not know what you are talking about.
>>>>>>
>>>>>> Have you even tried to change your patch to use
>>>>>>
>>>>>> field = (__force unsigned long __percpu *)((__force void *)p + 
>>>>>> offset);
>>>>>> this_cpu_inc(*field);
>>>>> Yes, I tested this code. But the following couldn't show anything 
>>>>> even
>>>>> if the rx_dropped is increasing.
>>>>>
>>>>> 'sudo python3 /usr/share/bcc/tools/trace netdev_core_stats_inc'
>>>> Well, I am not sure about this, "bpftrace" worked for me.
>>>>
>>>> Make sure your toolchain generates something that looks like what I 
>>>> got:
>>>>
>>>> 000000000000ef20 <netdev_core_stats_inc>:
>>>>       ef20: f3 0f 1e fa          endbr64
>>>>       ef24: e8 00 00 00 00        call   ef29 
>>>> <netdev_core_stats_inc+0x9>
>>>> ef25: R_X86_64_PLT32 __fentry__-0x4
>>>>       ef29: 55                    push   %rbp
>>>>       ef2a: 48 89 e5              mov    %rsp,%rbp
>>>>       ef2d: 53                    push   %rbx
>>>>       ef2e: 89 f3                mov    %esi,%ebx
>>>>       ef30: 48 8b 87 f0 01 00 00 mov    0x1f0(%rdi),%rax
>>>>       ef37: 48 85 c0              test   %rax,%rax
>>>>       ef3a: 74 0b                je     ef47 
>>>> <netdev_core_stats_inc+0x27>
>>>>       ef3c: 89 d9                mov    %ebx,%ecx
>>>>       ef3e: 65 48 ff 04 08        incq   %gs:(%rax,%rcx,1)
>>>>       ef43: 5b                    pop    %rbx
>>>>       ef44: 5d                    pop    %rbp
>>>>       ef45: c3                    ret
>>>>       ef46: cc                    int3
>>>>       ef47: e8 00 00 00 00        call   ef4c 
>>>> <netdev_core_stats_inc+0x2c>
>>>> ef48: R_X86_64_PLT32 .text.unlikely.+0x13c
>>>>       ef4c: 48 85 c0              test   %rax,%rax
>>>>       ef4f: 75 eb                jne    ef3c 
>>>> <netdev_core_stats_inc+0x1c>
>>>>       ef51: eb f0                jmp    ef43 
>>>> <netdev_core_stats_inc+0x23>
>>>>       ef53: 66 66 66 66 2e 0f 1f data16 data16 data16 cs nopw 
>>>> 0x0(%rax,%rax,1)
>>>>       ef5a: 84 00 00 00 00 00
>>>
>>> I'll share some I can see it.
>>>
>>> 1.
>>>
>>> objdump -D vmlinux
>>>
>>> ffffffff81b2f170 <netdev_core_stats_inc>:
>>> ffffffff81b2f170:    e8 8b ea 55 ff           callq ffffffff8108dc00
>>> <__fentry__>
>>> ffffffff81b2f175:    55                       push   %rbp
>>> ffffffff81b2f176:    48 89 e5                 mov    %rsp,%rbp
>>> ffffffff81b2f179:    48 83 ec 08              sub    $0x8,%rsp
>>> ffffffff81b2f17d:    48 8b 87 e8 01 00 00     mov 0x1e8(%rdi),%rax
>>> ffffffff81b2f184:    48 85 c0                 test   %rax,%rax
>>> ffffffff81b2f187:    74 0d                    je ffffffff81b2f196
>>> <netdev_core_stats_inc+0x26>
>>> ffffffff81b2f189:    89 f6                    mov    %esi,%esi
>>> ffffffff81b2f18b:    65 48 ff 04 30           incq %gs:(%rax,%rsi,1)
>>> ffffffff81b2f190:    c9                       leaveq
>>> ffffffff81b2f191:    e9 aa 31 6d 00           jmpq ffffffff82202340
>>> <__x86_return_thunk>
>>> ffffffff81b2f196:    89 75 fc                 mov %esi,-0x4(%rbp)
>>> ffffffff81b2f199:    e8 82 ff ff ff           callq ffffffff81b2f120
>>> <netdev_core_stats_alloc>
>>> ffffffff81b2f19e:    8b 75 fc                 mov -0x4(%rbp),%esi
>>> ffffffff81b2f1a1:    48 85 c0                 test   %rax,%rax
>>> ffffffff81b2f1a4:    75 e3                    jne ffffffff81b2f189
>>> <netdev_core_stats_inc+0x19>
>>> ffffffff81b2f1a6:    c9                       leaveq
>>> ffffffff81b2f1a7:    e9 94 31 6d 00           jmpq ffffffff82202340
>>> <__x86_return_thunk>
>>> ffffffff81b2f1ac:    0f 1f 40 00              nopl   0x0(%rax)
>>>
>>>
>>> 2.
>>>
>>> sudo cat /proc/kallsyms | grep netdev_core_stats_inc
>>>
>>> ffffffff9c72f120 T netdev_core_stats_inc
>>> ffffffff9ca2676c t netdev_core_stats_inc.cold
>>> ffffffff9d5235e0 r __ksymtab_netdev_core_stats_inc
>>>
>>>
>>> 3.
>>>
>>> ➜  ~ ifconfig enp34s0f0
>>> enp34s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
>>>           inet 10.10.30.88  netmask 255.255.255.0  broadcast 
>>> 10.10.30.255
>>>           inet6 fe80::6037:806c:14b6:f1ca  prefixlen 64 scopeid 
>>> 0x20<link>
>>>           ether 04:d4:c4:5c:81:42  txqueuelen 1000  (Ethernet)
>>>           RX packets 29024  bytes 3118278 (3.1 MB)
>>>           RX errors 0  dropped 794  overruns 0  frame 0
>>>           TX packets 16961  bytes 2662290 (2.6 MB)
>>>           TX errors 0  dropped 0 overruns 0  carrier 0 collisions 0
>>>           device interrupt 29  memory 0x39fff4000000-39fff47fffff
>>>
>>> ➜  ~ ifconfig enp34s0f0
>>> enp34s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
>>>           inet 10.10.30.88  netmask 255.255.255.0  broadcast 
>>> 10.10.30.255
>>>           inet6 fe80::6037:806c:14b6:f1ca  prefixlen 64 scopeid 
>>> 0x20<link>
>>>           ether 04:d4:c4:5c:81:42  txqueuelen 1000  (Ethernet)
>>>           RX packets 29272  bytes 3148997 (3.1 MB)
>>>           RX errors 0  dropped 798  overruns 0  frame 0
>>>           TX packets 17098  bytes 2683547 (2.6 MB)
>>>           TX errors 0  dropped 0 overruns 0  carrier 0 collisions 0
>>>           device interrupt 29  memory 0x39fff4000000-39fff47fffff
>>>
>>>
>>> The rx_dropped is increasing.
>>>
>>>
>>> 4.
>>>
>>> sudo python3 /usr/share/bcc/tools/trace netdev_core_stats_inc
>>>
>>> TIME     PID     TID     COMM            FUNC
>>>
>>> (Empty, I didn't see anything.)
>>>
>>>
>>> 5.
>>>
>>> sudo trace-cmd record -p function -l netdev_core_stats_inc
>>>
>>> sudo trace-cmd report
>>>
>>> (Empty, I didn't see anything.)
>>>
>>>
>>> If I add a 'pr_info("\n");'   like:
>>>
>>> +      pr_info("\n");
>>>           field = (__force unsigned long __percpu *)((__force void 
>>> *)p +
>>> offset);
>>>           this_cpu_inc(*field);
>>>
>>>
>>> Everything is OK. The 'pr_info("\n");' can be changed to anything else,
>>> but not
>>>
>>> without it.
>> This seems to be a bug that has nothing to do with the patch.
>>
>> Try getting help from Steven maybe.
>
>
> Hi Steven,
>
> Need your help.
>
> 1. The following code wouldn't trace anything by the command 'sudo 
> python3 /usr/share/bcc/tools/trace netdev_core_stats_inc'
>
> +void netdev_core_stats_inc(struct net_device *dev, u32 offset)
> +{
> +       /* This READ_ONCE() pairs with the write in 
> netdev_core_stats_alloc() */
> +       struct net_device_core_stats __percpu *p = 
> READ_ONCE(dev->core_stats);
> +       unsigned long __percpu *field;
> +
> +       if (unlikely(!p)) {
> +               p = netdev_core_stats_alloc(dev);
> +               if (!p)
> +                       return;
> +       }
> +       field = (__force unsigned long __percpu *)((__force void *)p + 
> offset);
> +       this_cpu_inc(*field);
> +}
> +EXPORT_SYMBOL_GPL(netdev_core_stats_inc);
>
> 2. If I add a 'pr_info("\n");', it would be fine. The 'pr_info("\n");' 
> can be changed to others.
>
> +void netdev_core_stats_inc(struct net_device *dev, u32 offset)
> +{
> +       /* This READ_ONCE() pairs with the write in 
> netdev_core_stats_alloc() */
> +       struct net_device_core_stats __percpu *p = 
> READ_ONCE(dev->core_stats);
> +       unsigned long __percpu *field;
> +
> +       if (unlikely(!p)) {
> +               p = netdev_core_stats_alloc(dev);
> +               if (!p)
> +                       return;
> +       }
> +    pr_info("\n");
> +       field = (__force unsigned long __percpu *)((__force void *)p + 
> offset);
> +       this_cpu_inc(*field);
> +}
> +EXPORT_SYMBOL_GPL(netdev_core_stats_inc);
>
> I don't know why we need to add something in netdev_core_stats_inc, 
> the trace will be fine.
>

I think I found something different in the assembly code.

this_cpu_read:
ffffffff81b2f120 <netdev_core_stats_inc>:
ffffffff81b2f120:       e8 db ea 55 ff          callq ffffffff8108dc00 
<__fentry__>
ffffffff81b2f125:       55                      push   %rbp
ffffffff81b2f126:       48 89 e5                mov    %rsp,%rbp
ffffffff81b2f129:       48 83 ec 08             sub    $0x8,%rsp
ffffffff81b2f12d:       48 8b 87 e8 01 00 00    mov 0x1e8(%rdi),%rax
ffffffff81b2f134:       48 85 c0                test   %rax,%rax
ffffffff81b2f137:       0f 84 2f 76 2f 00       je ffffffff81e2676c 
<netdev_core_stats_inc.cold>
ffffffff81b2f13d:       89 f6                   mov    %esi,%esi
ffffffff81b2f13f:       65 48 8b 04 30          mov %gs:(%rax,%rsi,1),%rax
ffffffff81b2f144:       c9                      leaveq
ffffffff81b2f145:       e9 f6 31 6d 00          jmpq ffffffff82202340 
<__x86_return_thunk>
ffffffff81b2f14a:       66 0f 1f 44 00 00       nopw 0x0(%rax,%rax,1)


this_cpu_write:
ffffffff81b2f120 <netdev_core_stats_inc>:
ffffffff81b2f120:       e8 db ea 55 ff          callq ffffffff8108dc00 
<__fentry__>
ffffffff81b2f125:       55                      push   %rbp
ffffffff81b2f126:       48 89 e5                mov    %rsp,%rbp
ffffffff81b2f129:       48 83 ec 08             sub    $0x8,%rsp
ffffffff81b2f12d:       48 8b 87 e8 01 00 00    mov 0x1e8(%rdi),%rax
ffffffff81b2f134:       48 85 c0                test   %rax,%rax
ffffffff81b2f137:       0f 84 2f 76 2f 00       je ffffffff81e2676c 
<netdev_core_stats_inc.cold>
ffffffff81b2f13d:       89 f6                   mov    %esi,%esi
ffffffff81b2f13f:       65 48 c7 04 30 01 00    movq $0x1,%gs:(%rax,%rsi,1)
ffffffff81b2f146:       00 00
ffffffff81b2f148:       c9                      leaveq
ffffffff81b2f149:       e9 f2 31 6d 00          jmpq ffffffff82202340 
<__x86_return_thunk>
ffffffff81b2f14e:       66 90                   xchg   %ax,%ax


this_cpu_read + this_cpu_write:
ffffffff81b2f0e0 <netdev_core_stats_inc>:
ffffffff81b2f0e0:       e8 1b eb 55 ff          callq ffffffff8108dc00 
<__fentry__>
ffffffff81b2f0e5:       55                      push   %rbp
ffffffff81b2f0e6:       48 89 e5                mov    %rsp,%rbp
ffffffff81b2f0e9:       41 55                   push   %r13
ffffffff81b2f0eb:       41 54                   push   %r12
ffffffff81b2f0ed:       41 89 f4                mov    %esi,%r12d
ffffffff81b2f0f0:       53                      push   %rbx
ffffffff81b2f0f1:       48 8b 9f e8 01 00 00    mov 0x1e8(%rdi),%rbx
ffffffff81b2f0f8:       48 85 db                test   %rbx,%rbx
ffffffff81b2f0fb:       0f 84 1f 76 2f 00       je ffffffff81e26720 
<netdev_core_stats_inc.cold>
ffffffff81b2f101:       44 89 e6                mov    %r12d,%esi
ffffffff81b2f104:       48 01 de                add    %rbx,%rsi
ffffffff81b2f107:       65 48 8b 06             mov %gs:(%rsi),%rax
ffffffff81b2f10b:       65 48 c7 06 01 00 00    movq $0x1,%gs:(%rsi)
ffffffff81b2f112:       00
ffffffff81b2f113:       5b                      pop    %rbx
ffffffff81b2f114:       41 5c                   pop    %r12
ffffffff81b2f116:       41 5d                   pop    %r13
ffffffff81b2f118:       5d                      pop    %rbp
ffffffff81b2f119:       e9 22 32 6d 00          jmpq ffffffff82202340 
<__x86_return_thunk>
ffffffff81b2f11e:       66 90                   xchg   %ax,%ax


this_cpu_inc:
ffffffff81b2f120 <netdev_core_stats_inc>:
ffffffff81b2f120:       e8 db ea 55 ff          callq ffffffff8108dc00 
<__fentry__>
ffffffff81b2f125:       55                      push   %rbp
ffffffff81b2f126:       48 89 e5                mov    %rsp,%rbp
ffffffff81b2f129:       48 83 ec 08             sub    $0x8,%rsp
ffffffff81b2f12d:       48 8b 87 e8 01 00 00    mov 0x1e8(%rdi),%rax
ffffffff81b2f134:       48 85 c0                test   %rax,%rax
ffffffff81b2f137:       0f 84 2f 76 2f 00       je ffffffff81e2676c 
<netdev_core_stats_inc.cold>
ffffffff81b2f13d:       89 f6                   mov    %esi,%esi
ffffffff81b2f13f:       65 48 ff 04 30          incq %gs:(%rax,%rsi,1)
ffffffff81b2f144:       c9                      leaveq
ffffffff81b2f145:       e9 f6 31 6d 00          jmpq ffffffff82202340 
<__x86_return_thunk>
ffffffff81b2f14a:       66 0f 1f 44 00 00       nopw 0x0(%rax,%rax,1)


pr_info + this_cpu_inc:
ffffffff81e26720 <netdev_core_stats_inc>:
ffffffff81e26720:       e8 db 74 26 ff          callq ffffffff8108dc00 
<__fentry__>
ffffffff81e26725:       55                      push   %rbp
ffffffff81e26726:       48 89 e5                mov    %rsp,%rbp
ffffffff81e26729:       41 55                   push   %r13
ffffffff81e2672b:       41 89 f5                mov    %esi,%r13d
ffffffff81e2672e:       41 54                   push   %r12
ffffffff81e26730:       53                      push   %rbx
ffffffff81e26731:       48 8b 9f e8 01 00 00    mov 0x1e8(%rdi),%rbx
ffffffff81e26738:       48 85 db                test   %rbx,%rbx
ffffffff81e2673b:       75 43                   jne ffffffff81e26780 
<netdev_core_stats_inc+0x60>
ffffffff81e2673d:       49 89 fc                mov    %rdi,%r12
ffffffff81e26740:       ba 20 2a 00 00          mov $0x2a20,%edx
ffffffff81e26745:       bf 20 00 00 00          mov    $0x20,%edi
ffffffff81e2674a:       be 20 00 00 00          mov    $0x20,%esi
ffffffff81e2674f:       e8 0c 36 4f ff          callq ffffffff81319d60 
<__alloc_percpu_gfp>
ffffffff81e26754:       48 89 c7                mov    %rax,%rdi
ffffffff81e26757:       48 85 c0                test   %rax,%rax
ffffffff81e2675a:       74 17                   je ffffffff81e26773 
<netdev_core_stats_inc+0x53>
ffffffff81e2675c:       48 89 d8                mov    %rbx,%rax
ffffffff81e2675f:       f0 49 0f b1 bc 24 e8    lock cmpxchg 
%rdi,0x1e8(%r12)
ffffffff81e26766:       01 00 00
ffffffff81e26769:       48 85 c0                test   %rax,%rax
ffffffff81e2676c:       74 05                   je ffffffff81e26773 
<netdev_core_stats_inc+0x53>
ffffffff81e2676e:       e8 1d 25 4f ff          callq ffffffff81318c90 
<free_percpu>
ffffffff81e26773:       49 8b 9c 24 e8 01 00    mov 0x1e8(%r12),%rbx
ffffffff81e2677a:       00
ffffffff81e2677b:       48 85 db                test   %rbx,%rbx
ffffffff81e2677e:       74 11                   je ffffffff81e26791 
<netdev_core_stats_inc+0x71>
ffffffff81e26780:       48 c7 c7 80 cd 77 82    mov $0xffffffff8277cd80,%rdi
ffffffff81e26787:       e8 6e 94 f6 ff          callq ffffffff81d8fbfa 
<_printk>
ffffffff81e2678c:       65 4a ff 04 2b          incq %gs:(%rbx,%r13,1)
ffffffff81e26791:       5b                      pop    %rbx
ffffffff81e26792:       41 5c                   pop    %r12
ffffffff81e26794:       41 5d                   pop    %r13
ffffffff81e26796:       5d                      pop    %rbp
ffffffff81e26797:       e9 a4 bb 3d 00          jmpq ffffffff82202340 
<__x86_return_thunk>


'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make 
the trace work well.

They all have 'pop' instructions in them. This may be the key to making 
the trace work well.

Hi all,

I need your help on percpu and ftrace.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ