netdev - RE: [PATCH net-next 1/2] net: Keep sk->sk_forward

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <IA0PR11MB73557DEAB912737FD61D2873FC749@IA0PR11MB7355.namprd11.prod.outlook.com>
Date: Thu, 11 May 2023 09:26:46 +0000
From: "Zhang, Cathy" <cathy.zhang@...el.com>
To: Eric Dumazet <edumazet@...gle.com>
CC: Shakeel Butt <shakeelb@...gle.com>, Linux MM <linux-mm@...ck.org>, Cgroups
	<cgroups@...r.kernel.org>, Paolo Abeni <pabeni@...hat.com>,
	"davem@...emloft.net" <davem@...emloft.net>, "kuba@...nel.org"
	<kuba@...nel.org>, "Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
	"Srinivas, Suresh" <suresh.srinivas@...el.com>, "Chen, Tim C"
	<tim.c.chen@...el.com>, "You, Lizhen" <lizhen.you@...el.com>,
	"eric.dumazet@...il.com" <eric.dumazet@...il.com>, "netdev@...r.kernel.org"
	<netdev@...r.kernel.org>
Subject: RE: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper
 size



> -----Original Message-----
> From: Eric Dumazet <edumazet@...gle.com>
> Sent: Thursday, May 11, 2023 3:51 PM
> To: Zhang, Cathy <cathy.zhang@...el.com>
> Cc: Shakeel Butt <shakeelb@...gle.com>; Linux MM <linux-mm@...ck.org>;
> Cgroups <cgroups@...r.kernel.org>; Paolo Abeni <pabeni@...hat.com>;
> davem@...emloft.net; kuba@...nel.org; Brandeburg, Jesse
> <jesse.brandeburg@...el.com>; Srinivas, Suresh
> <suresh.srinivas@...el.com>; Chen, Tim C <tim.c.chen@...el.com>; You,
> Lizhen <lizhen.you@...el.com>; eric.dumazet@...il.com;
> netdev@...r.kernel.org
> Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper
> size
> 
> On Thu, May 11, 2023 at 9:00 AM Zhang, Cathy <cathy.zhang@...el.com>
> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Zhang, Cathy
> > > Sent: Thursday, May 11, 2023 8:53 AM
> > > To: Shakeel Butt <shakeelb@...gle.com>
> > > Cc: Eric Dumazet <edumazet@...gle.com>; Linux MM <linux-
> > > mm@...ck.org>; Cgroups <cgroups@...r.kernel.org>; Paolo Abeni
> > > <pabeni@...hat.com>; davem@...emloft.net; kuba@...nel.org;
> > > Brandeburg, Jesse <jesse.brandeburg@...el.com>; Srinivas, Suresh
> > > <suresh.srinivas@...el.com>; Chen, Tim C <tim.c.chen@...el.com>;
> > > You, Lizhen <Lizhen.You@...el.com>; eric.dumazet@...il.com;
> > > netdev@...r.kernel.org
> > > Subject: RE: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as
> > > a proper size
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Shakeel Butt <shakeelb@...gle.com>
> > > > Sent: Thursday, May 11, 2023 3:00 AM
> > > > To: Zhang, Cathy <cathy.zhang@...el.com>
> > > > Cc: Eric Dumazet <edumazet@...gle.com>; Linux MM <linux-
> > > > mm@...ck.org>; Cgroups <cgroups@...r.kernel.org>; Paolo Abeni
> > > > <pabeni@...hat.com>; davem@...emloft.net; kuba@...nel.org;
> > > Brandeburg,
> > > > Jesse <jesse.brandeburg@...el.com>; Srinivas, Suresh
> > > > <suresh.srinivas@...el.com>; Chen, Tim C <tim.c.chen@...el.com>;
> > > > You, Lizhen <lizhen.you@...el.com>; eric.dumazet@...il.com;
> > > > netdev@...r.kernel.org
> > > > Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc
> > > > as a proper size
> > > >
> > > > On Wed, May 10, 2023 at 9:09 AM Zhang, Cathy
> > > > <cathy.zhang@...el.com>
> > > > wrote:
> > > > >
> > > > >
> > > > [...]
> > > > > > > >
> > > > > > > > Have you tried to increase batch sizes ?
> > > > > > >
> > > > > > > I jus picked up 256 and 1024 for a try, but no help, the
> > > > > > > overhead still
> > > > exists.
> > > > > >
> > > > > > This makes no sense at all.
> > > > >
> > > > > Eric,
> > > > >
> > > > > I added a pr_info in try_charge_memcg() to print nr_pages if
> > > > > nr_pages
> > > > > >= MEMCG_CHARGE_BATCH, except it prints 64 during the
> > > > > >initialization
> > > > > of instances, there is no other output during the running. That
> > > > > means nr_pages is not over 64, I guess that might be the reason
> > > > > why to increase MEMCG_CHARGE_BATCH doesn't affect this case.
> > > > >
> > > >
> > > > I am assuming you increased MEMCG_CHARGE_BATCH to 256 and 1024
> > > but
> > > > that did not help. To me that just means there is a different
> > > > bottleneck in the memcg charging codepath. Can you please share
> > > > the perf profile? Please note that memcg charging does a lot of
> > > > other things as well like updating memcg stats and checking (and
> > > > enforcing) memory.high even if you have not set memory.high.
> > >
> > > Thanks Shakeel! I will check more details on what you mentioned. We
> > > use "sudo perf top -p $(docker inspect -f '{{.State.Pid}}'
> > > memcached_2)" to monitor one of those instances, and also use "sudo
> > > perf top" to check the overhead from system wide.
> >
> > Here is the annotate output of perf top for the three memcg hot paths:
> >
> > Showing cycles for page_counter_try_charge
> >   Events  Pcnt (>=5%)
> >  Percent |      Source code & Disassembly of elf for cycles (543288 samples,
> percent: local period)
> > --------------------------------------------------------------------------------------------------
> -
> >     0.00 :   ffffffff8141388d:       mov    %r12,%rax
> >    76.82 :   ffffffff81413890:       lock xadd %rax,(%rbx)
> >    22.10 :   ffffffff81413895:       lea    (%r12,%rax,1),%r15
> >
> >
> > Showing cycles for page_counter_cancel
> >   Events  Pcnt (>=5%)
> >  Percent |      Source code & Disassembly of elf for cycles (1004744 samples,
> percent: local period)
> > --------------------------------------------------------------------------------------------------
> --
> >          : 160              return i + xadd(&v->counter, i);
> >    77.42 :   ffffffff81413759:       lock xadd %rax,(%rdi)
> >    22.34 :   ffffffff8141375e:       sub    %rsi,%rax
> >
> >
> > Showing cycles for try_charge_memcg
> >   Events  Pcnt (>=5%)
> >  Percent |      Source code & Disassembly of elf for cycles (256531 samples,
> percent: local period)
> > --------------------------------------------------------------------------------------------------
> -
> >          : 22               return __READ_ONCE((v)->counter);
> >    77.53 :   ffffffff8141df86:       mov    0x100(%r13),%rdx
> >          : 2826             READ_ONCE(memcg->memory.high);
> >    19.45 :   ffffffff8141df8d:       mov    0x190(%r13),%rcx
> 
> This is rephrasing the info you gave earlier ?

Yep, I just want to show some details.

> 
>   16.77%  [kernel]            [k] page_counter_try_charge
>     16.56%  [kernel]            [k] page_counter_cancel
>     15.65%  [kernel]            [k] try_charge_memcg
> 
> What matters here is a call graph.

Thanks for explanation. I re-collect it.

> 
> perf record -a -g sleep 5 # While the test is running perf report --no-children -
> -stdio

Here is the output with the command you paste, it's from system wide,
I only show pieces of memcached records, and it seems to be a
callee -> caller stack trace:

     9.02%  mc-worker        [kernel.vmlinux]          [k] page_counter_try_charge
            |
             --9.00%--page_counter_try_charge
                       |
                        --9.00%--try_charge_memcg
                                  mem_cgroup_charge_skmem
                                  |
                                   --9.00%--__sk_mem_raise_allocated
                                             __sk_mem_schedule
                                             |
                                             |--5.32%--tcp_try_rmem_schedule
                                             |          tcp_data_queue
                                             |          tcp_rcv_established
                                             |          tcp_v4_do_rcv
                                             |          tcp_v4_rcv
                                             |          ip_protocol_deliver_rcu
                                             |          ip_local_deliver_finish
                                             |          ip_local_deliver
                                             |          ip_rcv
                                             |          __netif_receive_skb_one_core
                                             |          __netif_receive_skb
                                             |          process_backlog
                                             |          __napi_poll
                                             |          net_rx_action
                                             |          __do_softirq
                                             |          |
                                             |           --5.32%--do_softirq.part.0
                                             |                     __local_bh_enable_ip
                                             |                     __dev_queue_xmit
                                             |                     ip_finish_output2
                                             |                     __ip_finish_output
                                             |                     ip_finish_output
                                             |                     ip_output
                                             |                     ip_local_out
                                             |                     __ip_queue_xmit
                                             |                     ip_queue_xmit
                                             |                     __tcp_transmit_skb
                                             |                     tcp_write_xmit
                                             |                     __tcp_push_pending_frames
                                             |                     tcp_push
                                             |                     tcp_sendmsg_locked
                                             |                     tcp_sendmsg
                                             |                     inet_sendmsg
                                             |                     sock_sendmsg
                                             |                     ____sys_sendmsg

     8.98%  mc-worker        [kernel.vmlinux]          [k] page_counter_cancel
            |
             --8.97%--page_counter_cancel
                       |
                        --8.97%--page_counter_uncharge
                                  drain_stock
                                  __refill_stock
                                  refill_stock
                                  |
                                   --8.91%--try_charge_memcg
                                             mem_cgroup_charge_skmem
                                             |
                                              --8.91%--__sk_mem_raise_allocated
                                                        __sk_mem_schedule
                                                        |
                                                        |--5.41%--tcp_try_rmem_schedule
                                                        |          tcp_data_queue
                                                        |          tcp_rcv_established
                                                        |          tcp_v4_do_rcv
                                                        |          tcp_v4_rcv
                                                        |          ip_protocol_deliver_rcu
                                                        |          ip_local_deliver_finish
                                                        |          ip_local_deliver
                                                        |          ip_rcv
                                                        |          __netif_receive_skb_one_core
                                                        |          __netif_receive_skb
                                                        |          process_backlog
                                                        |          __napi_poll
                                                        |          net_rx_action
                                                        |          __do_softirq
                                                        |          do_softirq.part.0
                                                        |          __local_bh_enable_ip
                                                        |          __dev_queue_xmit
                                                        |          ip_finish_output2
                                                        |          __ip_finish_output
                                                        |          ip_finish_output
                                                        |          ip_output
                                                        |          ip_local_out
                                                        |          __ip_queue_xmit
                                                        |          ip_queue_xmit
                                                        |          __tcp_transmit_skb
                                                        |          tcp_write_xmit
                                                        |          __tcp_push_pending_frames
                                                        |          tcp_push
                                                        |          tcp_sendmsg_locked
                                                        |          tcp_sendmsg
                                                        |          inet_sendmsg

     8.78%  mc-worker        [kernel.vmlinux]          [k] try_charge_memcg
            |
             --8.77%--try_charge_memcg
                       |
                        --8.76%--mem_cgroup_charge_skmem
                                  |
                                   --8.76%--__sk_mem_raise_allocated
                                             __sk_mem_schedule
                                             |
                                             |--5.21%--tcp_try_rmem_schedule
                                             |          tcp_data_queue
                                             |          tcp_rcv_established
                                             |          tcp_v4_do_rcv
                                             |          |
                                             |           --5.21%--tcp_v4_rcv
                                             |                     ip_protocol_deliver_rcu
                                             |                     ip_local_deliver_finish
                                             |                     ip_local_deliver
                                             |                     ip_rcv
                                             |                     __netif_receive_skb_one_core
                                             |                     __netif_receive_skb
                                             |                     process_backlog
                                             |                     __napi_poll
                                             |                     net_rx_action
                                             |                     __do_softirq
                                             |                     |
                                             |                      --5.21%--do_softirq.part.0
                                             |                                __local_bh_enable_ip
                                             |                                __dev_queue_xmit
                                             |                                ip_finish_output2
                                             |                                __ip_finish_output
                                             |                                ip_finish_output
                                             |                                ip_output
                                             |                                ip_local_out
                                             |                                __ip_queue_xmit
                                             |                                ip_queue_xmit
                                             |                                __tcp_transmit_skb
                                             |                                tcp_write_xmit
                                             |                                __tcp_push_pending_frames
                                             |                                tcp_push
                                             |                                tcp_sendmsg_locked
                                             |                                tcp_sendmsg
                                             |                                inet_sendmsg
                                             |                                sock_sendmsg
                                             |                                ____sys_sendmsg
                                             |                                ___sys_sendmsg
                                             |                                __sys_sendmsg


> 
> What precise kernel are you using btw ?

The above data is collected with 'net-next/main' git branch:
base-commit: ed23734c23d2 ("Merge tag 'net-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")