[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLPYPPHCSLiXmwP9G+Gfa8J=w3aD-HtPVzhjDeQvO_Z9g@mail.gmail.com>
Date: Wed, 24 Jul 2024 10:54:08 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Jason Xing <kerneljasonxing@...il.com>
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com, horms@...nel.org,
netdev@...r.kernel.org, Jason Xing <kernelxing@...cent.com>
Subject: Re: [RFC PATCH net-next] net: add an entry for CONFIG_NET_RX_BUSY_POLL
On Wed, Jul 24, 2024 at 9:33 AM Jason Xing <kerneljasonxing@...il.com> wrote:
>
> On Wed, Jul 24, 2024 at 8:38 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> >
> > On Wed, Jul 24, 2024 at 12:28 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Tue, Jul 23, 2024 at 6:01 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > >
> > > > On Tue, Jul 23, 2024 at 11:26 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > >
> > > > > On Tue, Jul 23, 2024 at 5:13 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > > >
> > > > > > On Tue, Jul 23, 2024 at 11:09 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > > > >
> > > > > > > On Tue, Jul 23, 2024 at 10:57 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > > > > > >
> > > > > > > > > From: Jason Xing <kernelxing@...cent.com>
> > > > > > > > >
> > > > > > > > > When I was doing performance test on unix_poll(), I found out that
> > > > > > > > > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop()
> > > > > > > > > occupies too much time, which causes around 16% degradation. So I
> > > > > > > > > decided to turn off this config, which cannot be done apparently
> > > > > > > > > before this patch.
> > > > > > > >
> > > > > > > > Too many CONFIG_ options, distros will enable it anyway.
> > > > > > > >
> > > > > > > > In my builds, offset of sk_ll_usec is 0xe8.
> > > > > > > >
> > > > > > > > Are you using some debug options or an old tree ?
> > > > > >
> > > > > > I forgot to say: I'm running the latest kernel which I pulled around
> > > > > > two hours ago. Whatever kind of configs with/without debug options I
> > > > > > use, I can still reproduce it.
> > > > >
> > > > > Ok, please post :
> > > > >
> > > > > pahole --hex -C sock vmlinux
> > > >
> > > > 1) Enable the config:
> > > > $ pahole --hex -C sock vmlinux
> > > > struct sock {
> > > > struct sock_common __sk_common; /* 0 0x88 */
> > > > /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> > > > __u8
> > > > __cacheline_group_begin__sock_write_rx[0]; /* 0x88 0 */
> > > > atomic_t sk_drops; /* 0x88 0x4 */
> > > > __s32 sk_peek_off; /* 0x8c 0x4 */
> > > > struct sk_buff_head sk_error_queue; /* 0x90 0x18 */
> > > > struct sk_buff_head sk_receive_queue; /* 0xa8 0x18 */
> > > > /* --- cacheline 3 boundary (192 bytes) --- */
> > > > struct {
> > > > atomic_t rmem_alloc; /* 0xc0 0x4 */
> > > > int len; /* 0xc4 0x4 */
> > > > struct sk_buff * head; /* 0xc8 0x8 */
> > > > struct sk_buff * tail; /* 0xd0 0x8 */
> > > > } sk_backlog; /* 0xc0 0x18 */
> > > > __u8
> > > > __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */
> > > > __u8
> > > > __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */
> > > > struct dst_entry * sk_rx_dst; /* 0xd8 0x8 */
> > > > int sk_rx_dst_ifindex; /* 0xe0 0x4 */
> > > > u32 sk_rx_dst_cookie; /* 0xe4 0x4 */
> > > > unsigned int sk_ll_usec; /* 0xe8 0x4 */
> > >
> > > See here ? offset of sk_ll_usec is 0xe8, not 0x104 as you posted.
> >
> > Oh, so sorry. My fault. I remembered only that perf record was
> > executed in an old tree before you optimise the layout of struct sock.
> > Then I found out if I disable the config applying to the latest tree
> > running in my virtual machine, the result is better. So let me find a
> > physical server to run the latest kernel and will get back more
> > accurate information of 'perf record' here.
>
> Now I'm back. The same output of perf when running the latest kernel
> on the virtual server goes like this:
> │
> │ static inline bool sk_can_busy_loop(const struct sock *sk)
> │ {
> │ return READ_ONCE(sk->sk_ll_usec) && !signal_pending(current);
> │ mov 0xe8(%rdx),%ebp
> 55.71 │ test %ebp,%ebp
> │ ↓ jne 62
> │ sock_poll():
> command I used: perf record -g -e cycles:k -F 999 -o tk5_select10.data
> -- ./bin-x86_64/select -E -C 200 -L -S -W -M -N "select_10" -n 100 -B
> 500
>
> If it's running on the physical server, the perf output is like this:
> │ ↓ je e1
> │ mov 0x18(%r13),%rdx
> 0.03 │ mov %rsi,%rbx
> 0.00 │ mov %rdi,%r12
> │ mov 0xe8(%rdx),%r14d
> 26.48 │ test %r14d,%r14d
>
> What a interesting thing I found is that running on the physical
> server the delta output is better than on the virtual server:
> original kernel, remove access of sk_ll_usec
> physical: 2.26, 2.08 (delta is 8.4%)
> virtual: 2.45, 2.05 (delta is ~16%)
>
> I'm still confused about reading this sk_ll_usec can cause such a
> performance degradation situation.
>
> Eric, may I ask if you have more ideas/suggestions about this one?
>
We do not micro-optimize based on 'perf' reports, because of artifacts.
Please run a full workload, sending/receiving 1,000,000 messages and report
the time difference, not on a precise function but the whole workload.
Again, I am guessing there will be no difference, because the cache
line is needed anyway.
Please make sure to run the latest kernels, this will avoid you
discovering issues that have been already fixed.
Powered by blists - more mailing lists