linux-kernel - Re: [mm/page_alloc] 7fef431be9: vm-scalability.throughput 87.8% improvement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <494C73E0-452D-4503-8ED6-DAE11A8471E5@redhat.com>
Date:   Mon, 26 Oct 2020 20:09:43 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Axel Rasmussen <axelrasmussen@...gle.com>
Cc:     David Hildenbrand <david@...hat.com>,
        David Rientjes <rientjes@...gle.com>,
        kernel test robot <rong.a.chen@...el.com>,
        Kevin Ko <kevko@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Oscar Salvador <osalvador@...e.de>,
        Wei Yang <richard.weiyang@...ux.alibaba.com>,
        Pankaj Gupta <pankaj.gupta.linux@...il.com>,
        Michal Hocko <mhocko@...e.com>,
        Alexander Duyck <alexander.h.duyck@...ux.intel.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Dave Hansen <dave.hansen@...el.com>,
        Mike Rapoport <rppt@...nel.org>,
        "K. Y. Srinivasan" <kys@...rosoft.com>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        Stephen Hemminger <sthemmin@...rosoft.com>,
        Wei Liu <wei.liu@...nel.org>,
        Matthew Wilcox <willy@...radead.org>,
        Michael Ellerman <mpe@...erman.id.au>,
        Michal Hocko <mhocko@...nel.org>,
        Scott Cheloha <cheloha@...ux.ibm.com>,
        LKML <linux-kernel@...r.kernel.org>, lkp@...ts.01.org,
        lkp@...el.com, ying.huang@...el.com, feng.tang@...el.com,
        zhengjun.xing@...el.com
Subject: Re: [mm/page_alloc] 7fef431be9: vm-scalability.throughput 87.8% improvement


> Am 26.10.2020 um 19:11 schrieb Axel Rasmussen <axelrasmussen@...gle.com>:
> 
> On Mon, Oct 26, 2020 at 1:31 AM David Hildenbrand <david@...hat.com> wrote:
>> 
>>> On 23.10.20 21:44, Axel Rasmussen wrote:
>>> On Fri, Oct 23, 2020 at 12:29 PM David Rientjes <rientjes@...gle.com> wrote:
>>>> 
>>>> On Wed, 21 Oct 2020, kernel test robot wrote:
>>>> 
>>>>> Greeting,
>>>>> 
>>>>> FYI, we noticed a 87.8% improvement of vm-scalability.throughput due to commit:
>>>>> 
>>>>> 
>>>>> commit: 7fef431be9c9ac255838a9578331567b9dba4477 ("mm/page_alloc: place pages to tail in __free_pages_core()")
>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>> 
>>>>> 
>>>>> in testcase: vm-scalability
>>>>> on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
>>>>> with following parameters:
>>>>> 
>>>>>      runtime: 300s
>>>>>      size: 512G
>>>>>      test: anon-wx-rand-mt
>>>>>      cpufreq_governor: performance
>>>>>      ucode: 0x5002f01
>>>>> 
>>>>> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
>>>>> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
>>>>> 
>>>> 
>>>> I'm curious why we are not able to reproduce this improvement on Skylake
>>>> and actually see a slight performance degradation, at least for
>>>> 300s_128G_truncate_throughput.
>>>> 
>>>> Axel Rasmussen <axelrasmussen@...gle.com> can provide more details on our
>>>> results.
>>> 
>>> Right, our results show a slight regression on a Skylake machine [1],
>>> and a slight performance increase on a Rome machine [2]. For these
>>> tests, I used Linus' v5.9 tag as a baseline, and then applied this
>>> patchset onto that tag as a test kernel (the patches applied cleanly
>>> besides one comment, I didn't have to do any code fixups). This is
>>> running the same anon-wx-rand-mt test defined in the upstream
>>> lkp-tests job file:
>>> https://github.com/intel/lkp-tests/blob/master/jobs/vm-scalability.yaml
>> 
>> Hi,
>> 
>> looking at the yaml, am I right that each test is run after a fresh boot?
> 
> Yes-ish. For the results I posted, the larger context would have been
> something like:
> 
> - Kernel installed, machine freshly rebooted.
> - Various machine management daemons start by default, some are
> stopped so as not to interfere with the test.
> - Some packages are installed on the machine (the thing which
> orchestrates the testing in particular).
> - The test is run.
> 
> So, the machine is somewhat fresh in the sense that it hasn't been
> e.g. serving production traffic just before running the test, but it's
> also not as clean as it could be. It seems plausible this difference
> explains the difference in the results (I'm not too familiar with how
> the Intel kernel test robot is implemented).

Ah, okay. So most memory in the system is indeed untouched. Thanks!


> 
>> 
>> As I already replied to David, this patch merely changes the initial
>> order of the freelists. The general end result is that lower memory
>> addresses will be allocated before higher memory addresses will be
>> allocated - within a zone, the first time memory is getting allocated.
>> Before, it was the other way around. Once a system ran for some time,
>> freelists are randomized.
>> 
>> There might be benchmarks/systems where this initial system state might
>> now be better suited - or worse. It doesn't really tell you that core-mm
>> is behaving better/worse now - it merely means that the initial system
>> state under which the benchmark was started affected the benchmark.
>> 
>> Looks like so far there is one benchmark+system where it's really
>> beneficial, there is one benchmark+system where it's slightly
>> beneficial, and one benchmark+system where there is a slight regression.
>> 
>> 
>> Something like the following would revert to the previous behavior:
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 23f5066bd4a5..fac82420cc3d 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1553,7 +1553,9 @@ void __free_pages_core(struct page *page, unsigned
>> int order)
>>         * Bypass PCP and place fresh pages right to the tail, primarily
>>         * relevant for memory onlining.
>>         */
>> -       __free_pages_ok(page, order, FPI_TO_TAIL);
>> +       __free_pages_ok(page, order,
>> +                       system_state < SYSTEM_RUNNING ? FPI_NONE :
>> +                                                       FPI_TO_TAIL);
>> }
>> 
>> #ifdef CONFIG_NEED_MULTIPLE_NODES
>> 
>> 
>> (Or better, passing the expected behavior via MEMINIT_EARLY/... to
>> __free_pages_core().)
>> 
>> 
>> But then, I am not convinced we should perform that change: having a
>> clean (initial) state might be true for these benchmarks, but it's far
>> from reality. The change in numbers doesn't show you that core-mm is
>> operating better/worse, just that the baseline for you tests changed due
>> to a changed initial system state.
> 
> Not to put words in David's mouth :) but at least from my perspective,
> our original interest was "wow, an 87% improvement! maybe we should
> deploy this patch to production!", and I'm mostly sharing my results
> just to say "it actually doesn't seem to be a huge *general*
> improvement", rather than to advocate for further changes / fixes.

Ah, yes, I saw the +87% and thought „that can‘t be right“.

> IIUC the original motivation for this patch was to fix somewhat of an
> edge case, not to make a very general improvement, so this seems fine.
> 

Exactly.