netdev - Re: [net] 4890b686f4: netperf.Throughput

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iLZYGOjTFThvq-J678BtPtqsywwxFekt9XTcPxk_La8eg@mail.gmail.com>
Date:   Fri, 24 Jun 2022 06:22:42 +0200
From:   Eric Dumazet <edumazet@...gle.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     Xin Long <lucien.xin@...il.com>,
        Marcelo Ricardo Leitner <marcelo.leitner@...il.com>,
        kernel test robot <oliver.sang@...el.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Soheil Hassas Yeganeh <soheil@...gle.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux Memory Management List <linux-mm@...ck.org>,
        network dev <netdev@...r.kernel.org>,
        linux-s390@...r.kernel.org, MPTCP Upstream <mptcp@...ts.linux.dev>,
        "linux-sctp @ vger . kernel . org" <linux-sctp@...r.kernel.org>,
        lkp@...ts.01.org, kbuild test robot <lkp@...el.com>,
        Huang Ying <ying.huang@...el.com>,
        "Tang, Feng" <feng.tang@...el.com>, zhengjun.xing@...ux.intel.com,
        fengwei.yin@...el.com, Ying Xu <yinxu@...hat.com>
Subject: Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression

On Fri, Jun 24, 2022 at 6:13 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@...nel.org> wrote:
> >
> > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote:
> > > From the perf data, we can see __sk_mem_reduce_allocated() is the one
> > > using CPU the most more than before, and mem_cgroup APIs are also
> > > called in this function. It means the mem cgroup must be enabled in
> > > the test env, which may explain why I couldn't reproduce it.
> > >
> > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
> > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
> > > reclaim the memory, which is *more frequent* to call
> > > __sk_mem_reduce_allocated() than before (checking reclaimable >=
> > > SK_RECLAIM_THRESHOLD). It might be cheap when
> > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
> > > cheap when mem_cgroup_sockets_enabled is true.
> > >
> > > I think SCTP netperf could trigger this, as the CPU is the bottleneck
> > > for SCTP netperf testing, which is more sensitive to the extra
> > > function calls than TCP.
> > >
> > > Can we re-run this testing without mem cgroup enabled?
> >
> > FWIW I defer to Eric, thanks a lot for double checking the report
> > and digging in!
>
> I did tests with TCP + memcg and noticed a very small additional cost
> in memcg functions,
> because of suboptimal layout:
>
> Extract of an internal Google bug, update from June 9th:
>
> --------------------------------
> I have noticed a minor false sharing to fetch (struct
> mem_cgroup)->css.parent, at offset 0xc0,
> because it shares the cache line containing struct mem_cgroup.memory,
> at offset 0xd0
>
> Ideally, memcg->socket_pressure and memcg->parent should sit in a read
> mostly cache line.
> -----------------------
>
> But nothing that could explain a "-69.4% regression"

I guess the test now hits memcg limits more often, forcing expensive reclaim,
and the memcg limits need some adjustments.

Overall, tests enabling memcg should probably need fine tuning, I will
defer to Intel folks.


>
> memcg has a very similar strategy of per-cpu reserves, with
> MEMCG_CHARGE_BATCH being 32 pages per cpu.
>
> It is not clear why SCTP with 10K writes would overflow this reserve constantly.
>
> Presumably memcg experts will have to rework structure alignments to
> make sure they can cope better
> with more charge/uncharge operations, because we are not going back to
> gigantic per-socket reserves,
> this simply does not scale.