netdev - Re: [RFC PATCH net-next] net: add an entry for CONFIG_NET_RX_BUSY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLPYPPHCSLiXmwP9G+Gfa8J=w3aD-HtPVzhjDeQvO_Z9g@mail.gmail.com>
Date: Wed, 24 Jul 2024 10:54:08 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Jason Xing <kerneljasonxing@...il.com>
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com, horms@...nel.org, 
	netdev@...r.kernel.org, Jason Xing <kernelxing@...cent.com>
Subject: Re: [RFC PATCH net-next] net: add an entry for CONFIG_NET_RX_BUSY_POLL

On Wed, Jul 24, 2024 at 9:33 AM Jason Xing <kerneljasonxing@...il.com> wrote:
>
> On Wed, Jul 24, 2024 at 8:38 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> >
> > On Wed, Jul 24, 2024 at 12:28 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Tue, Jul 23, 2024 at 6:01 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > >
> > > > On Tue, Jul 23, 2024 at 11:26 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > >
> > > > > On Tue, Jul 23, 2024 at 5:13 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > > >
> > > > > > On Tue, Jul 23, 2024 at 11:09 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > > > >
> > > > > > > On Tue, Jul 23, 2024 at 10:57 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > > > > > >
> > > > > > > > > From: Jason Xing <kernelxing@...cent.com>
> > > > > > > > >
> > > > > > > > > When I was doing performance test on unix_poll(), I found out that
> > > > > > > > > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop()
> > > > > > > > > occupies too much time, which causes around 16% degradation. So I
> > > > > > > > > decided to turn off this config, which cannot be done apparently
> > > > > > > > > before this patch.
> > > > > > > >
> > > > > > > > Too many CONFIG_ options, distros will enable it anyway.
> > > > > > > >
> > > > > > > > In my builds, offset of sk_ll_usec is 0xe8.
> > > > > > > >
> > > > > > > > Are you using some debug options or an old tree ?
> > > > > >
> > > > > > I forgot to say: I'm running the latest kernel which I pulled around
> > > > > > two hours ago. Whatever kind of configs with/without debug options I
> > > > > > use, I can still reproduce it.
> > > > >
> > > > > Ok, please post :
> > > > >
> > > > > pahole --hex -C sock vmlinux
> > > >
> > > > 1) Enable the config:
> > > > $ pahole --hex -C sock vmlinux
> > > > struct sock {
> > > >         struct sock_common         __sk_common;          /*     0  0x88 */
> > > >         /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> > > >         __u8
> > > > __cacheline_group_begin__sock_write_rx[0]; /*  0x88     0 */
> > > >         atomic_t                   sk_drops;             /*  0x88   0x4 */
> > > >         __s32                      sk_peek_off;          /*  0x8c   0x4 */
> > > >         struct sk_buff_head        sk_error_queue;       /*  0x90  0x18 */
> > > >         struct sk_buff_head        sk_receive_queue;     /*  0xa8  0x18 */
> > > >         /* --- cacheline 3 boundary (192 bytes) --- */
> > > >         struct {
> > > >                 atomic_t           rmem_alloc;           /*  0xc0   0x4 */
> > > >                 int                len;                  /*  0xc4   0x4 */
> > > >                 struct sk_buff *   head;                 /*  0xc8   0x8 */
> > > >                 struct sk_buff *   tail;                 /*  0xd0   0x8 */
> > > >         } sk_backlog;                                    /*  0xc0  0x18 */
> > > >         __u8
> > > > __cacheline_group_end__sock_write_rx[0]; /*  0xd8     0 */
> > > >         __u8
> > > > __cacheline_group_begin__sock_read_rx[0]; /*  0xd8     0 */
> > > >         struct dst_entry *         sk_rx_dst;            /*  0xd8   0x8 */
> > > >         int                        sk_rx_dst_ifindex;    /*  0xe0   0x4 */
> > > >         u32                        sk_rx_dst_cookie;     /*  0xe4   0x4 */
> > > >         unsigned int               sk_ll_usec;           /*  0xe8   0x4 */
> > >
> > > See here ? offset of sk_ll_usec is 0xe8, not 0x104 as you posted.
> >
> > Oh, so sorry. My fault. I remembered only that perf record was
> > executed in an old tree before you optimise the layout of struct sock.
> > Then I found out if I disable the config applying to the latest tree
> > running in my virtual machine, the result is better. So let me find a
> > physical server to run the latest kernel and will get back more
> > accurate information of 'perf record' here.
>
> Now I'm back. The same output of perf when running the latest kernel
> on the virtual server goes like this:
>        │
>        │    static inline bool sk_can_busy_loop(const struct sock *sk)
>        │    {
>        │    return READ_ONCE(sk->sk_ll_usec) && !signal_pending(current);
>        │      mov     0xe8(%rdx),%ebp
>  55.71 │      test    %ebp,%ebp
>        │    ↓ jne     62
>        │    sock_poll():
> command I used: perf record -g -e cycles:k -F 999 -o tk5_select10.data
> -- ./bin-x86_64/select -E -C 200 -L -S -W -M -N "select_10" -n 100 -B
> 500
>
> If it's running on the physical server, the perf output is like this:
>        │     ↓ je     e1
>        │       mov    0x18(%r13),%rdx
>   0.03 │       mov    %rsi,%rbx
>   0.00 │       mov    %rdi,%r12
>        │       mov    0xe8(%rdx),%r14d
>  26.48 │       test   %r14d,%r14d
>
> What a interesting thing I found is that running on the physical
> server the delta output is better than on the virtual server:
>     original kernel, remove access of sk_ll_usec
> physical: 2.26, 2.08 (delta is 8.4%)
> virtual: 2.45, 2.05 (delta is ~16%)
>
> I'm still confused about reading this sk_ll_usec can cause such a
> performance degradation situation.
>
> Eric, may I ask if you have more ideas/suggestions about this one?
>

We do not micro-optimize based on 'perf' reports, because of artifacts.

Please run a full workload, sending/receiving 1,000,000 messages and report
the time difference, not on a precise function but the whole workload.

Again, I am guessing there will be no difference, because the cache
line is needed anyway.

Please make sure to run the latest kernels, this will avoid you
discovering issues that have been already fixed.