linux-kernel - Re: [mm] f3344adf38: fxmark.hdd_btrfs_DWAL_63

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wi1KcOZo4JeRDgJCU0gB5v16q6GJsFtF365MSfM8hH3zA@mail.gmail.com>
Date:   Mon, 15 Mar 2021 13:42:50 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     kernel test robot <oliver.sang@...el.com>
Cc:     Muchun Song <songmuchun@...edance.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Roman Gushchin <guro@...com>,
        Stephen Rothwell <sfr@...b.auug.org.au>,
        Chris Down <chris@...isdown.name>,
        Yafang Shao <laoar.shao@...il.com>,
        Wei Yang <richard.weiyang@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>, lkp@...ts.01.org,
        kernel test robot <lkp@...el.com>,
        "Huang, Ying" <ying.huang@...el.com>,
        Feng Tang <feng.tang@...el.com>, zhengjun.xing@...el.com
Subject: Re: [mm] f3344adf38: fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec
 -52.4% regression

On Sun, Mar 14, 2021 at 11:30 PM kernel test robot
<oliver.sang@...el.com> wrote:
>
> FYI, we noticed a -52.4% regression of fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec

That's quite the huge regression.

But:

> due to commit: f3344adf38bd ("mm: memcontrol: optimize per-lruvec stats counter memory usage")

That's _literally_ just changing a dynamically allocated per-cpu array
of "long[]" to an array of "s32[]" and in the process shrinking it
from 304 bytes to 152 bytes.

> in testcase: fxmark
> on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory

I think this must be some really random memory layout issue that
causes some false sharing or similar.

And it's not even that some fundamental data structure gets a
different layout, it literally is just either:

 (a) the (now smaller) array is allocated from a differently chunk,
and that then causes random cache effects with something else

 (b) the (old, and bigger) array was more spread out, and as a result
had different fields in different cachelines and less false sharing

Normally I'd say that (b) is the obvious case, except for the fact
that this is a __percpu array.

So in the common case there shouldn't be any cache contention _within_
the array itself. Any cache contention should be with something else
very hot that the change now randomly makes be in the same cache way
or whatever.

Afaik, only the flushing of the vmstats batches does access the percpu
arrays from other CPUs, so (b) is not _entirely_ impossible if
memcg_flush_percpu_vmstats() were to be very very very hot.

But the profile data doesn't show anything remotely like that.

In fact, the actual report seems to show things improving, ie things
like elapsed time going down:

>       1361            -9.5%       1232        fxmark.time.elapsed_time
>       1361            -9.5%       1232        fxmark.time.elapsed_time.max
>      25.67            +9.1%      28.00        fxmark.time.percent_of_cpu_this_job_got
>     343.68            -2.0%     336.76        fxmark.time.system_time
>      23.66            +2.5       26.12        mpstat.cpu.all.sys%

but I don't know what the benchmark actually does, so maybe it just
repeats things until it hits a certain confidence interval, and thus
"elapsed time" is immaterial.

Honestly, normally if I were to get a report about "52% regression"
for a commit that is supposed to optimize something, I'd just revert
the commit as a case of "ok, that optimization clearly didn't work".

But there is absolutely no sign that this commit is actually the
culprit, or explanation for why that should be, and what could be
going on.

So I'm going to treat this as a "bisection failure, possibly due to
random code or data layout changes". Particularly since this seems to
be a 4-socket Xeon Phi machine, which I think is likely a very very
fragile system if there is some odd cache layout issue.

If somebody can actually figure out what happened there, that would be
good, but for now it goes into my "archived as a random outlier"
folder.

                 Linus