lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <CANn89iKoB2hn8QKBw+8faL4MWZ1ByDW8T9UHyS9G-8c11mWdOw@mail.gmail.com> Date: Thu, 11 May 2023 18:35:03 +0200 From: Eric Dumazet <edumazet@...gle.com> To: Shakeel Butt <shakeelb@...gle.com> Cc: "Zhang, Cathy" <cathy.zhang@...el.com>, Linux MM <linux-mm@...ck.org>, Cgroups <cgroups@...r.kernel.org>, Paolo Abeni <pabeni@...hat.com>, "davem@...emloft.net" <davem@...emloft.net>, "kuba@...nel.org" <kuba@...nel.org>, "Brandeburg, Jesse" <jesse.brandeburg@...el.com>, "Srinivas, Suresh" <suresh.srinivas@...el.com>, "Chen, Tim C" <tim.c.chen@...el.com>, "You, Lizhen" <lizhen.you@...el.com>, "eric.dumazet@...il.com" <eric.dumazet@...il.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org> Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper size On Thu, May 11, 2023 at 6:24 PM Shakeel Butt <shakeelb@...gle.com> wrote: > > On Thu, May 11, 2023 at 2:27 AM Zhang, Cathy <cathy.zhang@...el.com> wrote: > > > > > > > [...] > > > > Here is the output with the command you paste, it's from system wide, > > I only show pieces of memcached records, and it seems to be a > > callee -> caller stack trace: > > > > 9.02% mc-worker [kernel.vmlinux] [k] page_counter_try_charge > > | > > --9.00%--page_counter_try_charge > > | > > --9.00%--try_charge_memcg > > mem_cgroup_charge_skmem > > | > > --9.00%--__sk_mem_raise_allocated > > __sk_mem_schedule > > | > > |--5.32%--tcp_try_rmem_schedule > > | tcp_data_queue > > | tcp_rcv_established > > | tcp_v4_do_rcv > > | tcp_v4_rcv > > | ip_protocol_deliver_rcu > > | ip_local_deliver_finish > > | ip_local_deliver > > | ip_rcv > > | __netif_receive_skb_one_core > > | __netif_receive_skb > > | process_backlog > > | __napi_poll > > | net_rx_action > > | __do_softirq > > | | > > | --5.32%--do_softirq.part.0 > > | __local_bh_enable_ip > > | __dev_queue_xmit > > | ip_finish_output2 > > | __ip_finish_output > > | ip_finish_output > > | ip_output > > | ip_local_out > > | __ip_queue_xmit > > | ip_queue_xmit > > | __tcp_transmit_skb > > | tcp_write_xmit > > | __tcp_push_pending_frames > > | tcp_push > > | tcp_sendmsg_locked > > | tcp_sendmsg > > | inet_sendmsg > > | sock_sendmsg > > | ____sys_sendmsg > > > > 8.98% mc-worker [kernel.vmlinux] [k] page_counter_cancel > > | > > --8.97%--page_counter_cancel > > | > > --8.97%--page_counter_uncharge > > drain_stock > > __refill_stock > > refill_stock > > | > > --8.91%--try_charge_memcg > > mem_cgroup_charge_skmem > > | > > --8.91%--__sk_mem_raise_allocated > > __sk_mem_schedule > > | > > |--5.41%--tcp_try_rmem_schedule > > | tcp_data_queue > > | tcp_rcv_established > > | tcp_v4_do_rcv > > | tcp_v4_rcv > > | ip_protocol_deliver_rcu > > | ip_local_deliver_finish > > | ip_local_deliver > > | ip_rcv > > | __netif_receive_skb_one_core > > | __netif_receive_skb > > | process_backlog > > | __napi_poll > > | net_rx_action > > | __do_softirq > > | do_softirq.part.0 > > | __local_bh_enable_ip > > | __dev_queue_xmit > > | ip_finish_output2 > > | __ip_finish_output > > | ip_finish_output > > | ip_output > > | ip_local_out > > | __ip_queue_xmit > > | ip_queue_xmit > > | __tcp_transmit_skb > > | tcp_write_xmit > > | __tcp_push_pending_frames > > | tcp_push > > | tcp_sendmsg_locked > > | tcp_sendmsg > > | inet_sendmsg > > > > 8.78% mc-worker [kernel.vmlinux] [k] try_charge_memcg > > | > > --8.77%--try_charge_memcg > > | > > --8.76%--mem_cgroup_charge_skmem > > | > > --8.76%--__sk_mem_raise_allocated > > __sk_mem_schedule > > | > > |--5.21%--tcp_try_rmem_schedule > > | tcp_data_queue > > | tcp_rcv_established > > | tcp_v4_do_rcv > > | | > > | --5.21%--tcp_v4_rcv > > | ip_protocol_deliver_rcu > > | ip_local_deliver_finish > > | ip_local_deliver > > | ip_rcv > > | __netif_receive_skb_one_core > > | __netif_receive_skb > > | process_backlog > > | __napi_poll > > | net_rx_action > > | __do_softirq > > | | > > | --5.21%--do_softirq.part.0 > > | __local_bh_enable_ip > > | __dev_queue_xmit > > | ip_finish_output2 > > | __ip_finish_output > > | ip_finish_output > > | ip_output > > | ip_local_out > > | __ip_queue_xmit > > | ip_queue_xmit > > | __tcp_transmit_skb > > | tcp_write_xmit > > | __tcp_push_pending_frames > > | tcp_push > > | tcp_sendmsg_locked > > | tcp_sendmsg > > | inet_sendmsg > > | sock_sendmsg > > | ____sys_sendmsg > > | ___sys_sendmsg > > | __sys_sendmsg > > > > > > > > > > I am suspecting we are doing a lot of charging for a specific memcg on > one CPU (or a set of CPUs) and a lot of uncharging on the different > CPU (or a different set of CPUs) and thus both of these code paths are > hitting the slow path a lot. > > Eric, I remember we have an optimization in the networking stack that > tries to free the memory on the same CPU where the allocation > happened. Is that optimization enabled for this code path? Or maybe we > should do something similar in memcg code (with the assumption that my > suspicion is correct). The suspect part is really: > 8.98% mc-worker [kernel.vmlinux] [k] page_counter_cancel > | > --8.97%--page_counter_cancel > | > --8.97%--page_counter_uncharge > drain_stock > __refill_stock > refill_stock > | > --8.91%--try_charge_memcg > mem_cgroup_charge_skmem > | > --8.91%--__sk_mem_raise_allocated > __sk_mem_schedule Shakeel, networking has a per-cpu cache, of +/- 1MB. Even with asymmetric alloc/free, this would mean that a 100Gbit NIC would require something like 25,000 operations on the shared cache line per second. Hardly an issue I think. memcg does not seem to have an equivalent strategy ?
Powered by blists - more mailing lists