[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKoB2hn8QKBw+8faL4MWZ1ByDW8T9UHyS9G-8c11mWdOw@mail.gmail.com>
Date: Thu, 11 May 2023 18:35:03 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Shakeel Butt <shakeelb@...gle.com>
Cc: "Zhang, Cathy" <cathy.zhang@...el.com>, Linux MM <linux-mm@...ck.org>,
Cgroups <cgroups@...r.kernel.org>, Paolo Abeni <pabeni@...hat.com>,
"davem@...emloft.net" <davem@...emloft.net>, "kuba@...nel.org" <kuba@...nel.org>,
"Brandeburg, Jesse" <jesse.brandeburg@...el.com>, "Srinivas, Suresh" <suresh.srinivas@...el.com>,
"Chen, Tim C" <tim.c.chen@...el.com>, "You, Lizhen" <lizhen.you@...el.com>,
"eric.dumazet@...il.com" <eric.dumazet@...il.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper size
On Thu, May 11, 2023 at 6:24 PM Shakeel Butt <shakeelb@...gle.com> wrote:
>
> On Thu, May 11, 2023 at 2:27 AM Zhang, Cathy <cathy.zhang@...el.com> wrote:
> >
> >
> >
> [...]
> >
> > Here is the output with the command you paste, it's from system wide,
> > I only show pieces of memcached records, and it seems to be a
> > callee -> caller stack trace:
> >
> > 9.02% mc-worker [kernel.vmlinux] [k] page_counter_try_charge
> > |
> > --9.00%--page_counter_try_charge
> > |
> > --9.00%--try_charge_memcg
> > mem_cgroup_charge_skmem
> > |
> > --9.00%--__sk_mem_raise_allocated
> > __sk_mem_schedule
> > |
> > |--5.32%--tcp_try_rmem_schedule
> > | tcp_data_queue
> > | tcp_rcv_established
> > | tcp_v4_do_rcv
> > | tcp_v4_rcv
> > | ip_protocol_deliver_rcu
> > | ip_local_deliver_finish
> > | ip_local_deliver
> > | ip_rcv
> > | __netif_receive_skb_one_core
> > | __netif_receive_skb
> > | process_backlog
> > | __napi_poll
> > | net_rx_action
> > | __do_softirq
> > | |
> > | --5.32%--do_softirq.part.0
> > | __local_bh_enable_ip
> > | __dev_queue_xmit
> > | ip_finish_output2
> > | __ip_finish_output
> > | ip_finish_output
> > | ip_output
> > | ip_local_out
> > | __ip_queue_xmit
> > | ip_queue_xmit
> > | __tcp_transmit_skb
> > | tcp_write_xmit
> > | __tcp_push_pending_frames
> > | tcp_push
> > | tcp_sendmsg_locked
> > | tcp_sendmsg
> > | inet_sendmsg
> > | sock_sendmsg
> > | ____sys_sendmsg
> >
> > 8.98% mc-worker [kernel.vmlinux] [k] page_counter_cancel
> > |
> > --8.97%--page_counter_cancel
> > |
> > --8.97%--page_counter_uncharge
> > drain_stock
> > __refill_stock
> > refill_stock
> > |
> > --8.91%--try_charge_memcg
> > mem_cgroup_charge_skmem
> > |
> > --8.91%--__sk_mem_raise_allocated
> > __sk_mem_schedule
> > |
> > |--5.41%--tcp_try_rmem_schedule
> > | tcp_data_queue
> > | tcp_rcv_established
> > | tcp_v4_do_rcv
> > | tcp_v4_rcv
> > | ip_protocol_deliver_rcu
> > | ip_local_deliver_finish
> > | ip_local_deliver
> > | ip_rcv
> > | __netif_receive_skb_one_core
> > | __netif_receive_skb
> > | process_backlog
> > | __napi_poll
> > | net_rx_action
> > | __do_softirq
> > | do_softirq.part.0
> > | __local_bh_enable_ip
> > | __dev_queue_xmit
> > | ip_finish_output2
> > | __ip_finish_output
> > | ip_finish_output
> > | ip_output
> > | ip_local_out
> > | __ip_queue_xmit
> > | ip_queue_xmit
> > | __tcp_transmit_skb
> > | tcp_write_xmit
> > | __tcp_push_pending_frames
> > | tcp_push
> > | tcp_sendmsg_locked
> > | tcp_sendmsg
> > | inet_sendmsg
> >
> > 8.78% mc-worker [kernel.vmlinux] [k] try_charge_memcg
> > |
> > --8.77%--try_charge_memcg
> > |
> > --8.76%--mem_cgroup_charge_skmem
> > |
> > --8.76%--__sk_mem_raise_allocated
> > __sk_mem_schedule
> > |
> > |--5.21%--tcp_try_rmem_schedule
> > | tcp_data_queue
> > | tcp_rcv_established
> > | tcp_v4_do_rcv
> > | |
> > | --5.21%--tcp_v4_rcv
> > | ip_protocol_deliver_rcu
> > | ip_local_deliver_finish
> > | ip_local_deliver
> > | ip_rcv
> > | __netif_receive_skb_one_core
> > | __netif_receive_skb
> > | process_backlog
> > | __napi_poll
> > | net_rx_action
> > | __do_softirq
> > | |
> > | --5.21%--do_softirq.part.0
> > | __local_bh_enable_ip
> > | __dev_queue_xmit
> > | ip_finish_output2
> > | __ip_finish_output
> > | ip_finish_output
> > | ip_output
> > | ip_local_out
> > | __ip_queue_xmit
> > | ip_queue_xmit
> > | __tcp_transmit_skb
> > | tcp_write_xmit
> > | __tcp_push_pending_frames
> > | tcp_push
> > | tcp_sendmsg_locked
> > | tcp_sendmsg
> > | inet_sendmsg
> > | sock_sendmsg
> > | ____sys_sendmsg
> > | ___sys_sendmsg
> > | __sys_sendmsg
> >
> >
> > >
>
>
> I am suspecting we are doing a lot of charging for a specific memcg on
> one CPU (or a set of CPUs) and a lot of uncharging on the different
> CPU (or a different set of CPUs) and thus both of these code paths are
> hitting the slow path a lot.
>
> Eric, I remember we have an optimization in the networking stack that
> tries to free the memory on the same CPU where the allocation
> happened. Is that optimization enabled for this code path? Or maybe we
> should do something similar in memcg code (with the assumption that my
> suspicion is correct).
The suspect part is really:
> 8.98% mc-worker [kernel.vmlinux] [k] page_counter_cancel
> |
> --8.97%--page_counter_cancel
> |
> --8.97%--page_counter_uncharge
> drain_stock
> __refill_stock
> refill_stock
> |
> --8.91%--try_charge_memcg
> mem_cgroup_charge_skmem
> |
> --8.91%--__sk_mem_raise_allocated
> __sk_mem_schedule
Shakeel, networking has a per-cpu cache, of +/- 1MB.
Even with asymmetric alloc/free, this would mean that a 100Gbit NIC
would require something like 25,000
operations on the shared cache line per second.
Hardly an issue I think.
memcg does not seem to have an equivalent strategy ?
Powered by blists - more mailing lists