lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87pm20p9ra.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date:   Sat, 30 Sep 2023 12:26:01 +0800
From:   "Huang, Ying" <ying.huang@...el.com>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Kefeng Wang <wangkefeng.wang@...wei.com>,
        Zi Yan <ziy@...dia.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching

Johannes Weiner <hannes@...xchg.org> writes:

> On Wed, Sep 27, 2023 at 01:42:25PM +0800, Huang, Ying wrote:
>> Johannes Weiner <hannes@...xchg.org> writes:
>> 
>> > The idea behind the cache is to save get_pageblock_migratetype()
>> > lookups during bulk freeing. A microbenchmark suggests this isn't
>> > helping, though. The pcp migratetype can get stale, which means that
>> > bulk freeing has an extra branch to check if the pageblock was
>> > isolated while on the pcp.
>> >
>> > While the variance overlaps, the cache write and the branch seem to
>> > make this a net negative. The following test allocates and frees
>> > batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
>> >
>> > Before:
>> >           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
>> >                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
>> >                  0      cpu-migrations                   #    0.000 /sec
>> >             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
>> >     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
>> >    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
>> >     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
>> >         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
>> >
>> >          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
>> >
>> > After:
>> >           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
>> >                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
>> >                  0      cpu-migrations                   #    0.000 /sec
>> >             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
>> >     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
>> >    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
>> >     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
>> >         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
>> >
>> >          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
>> >
>> > A side effect is that this also ensures that pages whose pageblock
>> > gets stolen while on the pcplist end up on the right freelist and we
>> > don't perform potentially type-incompatible buddy merges (or skip
>> > merges when we shouldn't), whis is likely beneficial to long-term
>> > fragmentation management, although the effects would be harder to
>> > measure. Settle for simpler and faster code as justification here.
>> 
>> I suspected the PCP allocating/freeing path may be influenced (that is,
>> allocating/freeing batch is less than PCP high).  So I tested
>> one-process will-it-scale/page_fault1 with sysctl
>> percpu_pagelist_high_fraction=8.  So pages will be allocated/freed
>> from/to PCP only.  The test results are as follows,
>> 
>> Before:
>> will-it-scale.1.processes                        618364.3      (+-  0.075%)
>> perf-profile.children.get_pfnblock_flags_mask         0.13     (+-  9.350%)
>> 
>> After:
>> will-it-scale.1.processes	                 616512.0      (+-  0.057%)
>> perf-profile.children.get_pfnblock_flags_mask	      0.41     (+-  22.44%)
>> 
>> The change isn't large: -0.3%.  Perf profiling shows the cycles% of
>> get_pfnblock_flags_mask() increases.
>
> Ah, this is going through the free_unref_page_list() path that
> Vlastimil had pointed out as well. I made another change on top that
> eliminates the second lookup. After that, both pcp fast paths have the
> same number of lookups as before: 1. This fixes the regression for me.
>
> Would you mind confirming this as well?

I have done more test for the series and addon patches.  The test
results are as follows,

base
perf-profile.children.get_pfnblock_flags_mask	     0.15	(+- 32.62%)
will-it-scale.1.processes			618621.7	(+-  0.18%)

mm: page_alloc: remove pcppage migratetype caching
perf-profile.children.get_pfnblock_flags_mask	     0.40	(+- 21.55%)
will-it-scale.1.processes			616350.3	(+-  0.27%)

mm: page_alloc: fix up block types when merging compatible blocks
perf-profile.children.get_pfnblock_flags_mask	     0.36	(+-  8.36%)
will-it-scale.1.processes			617121.0	(+-  0.17%)

mm: page_alloc: move free pages when converting block during isolation
perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 15.10%)
will-it-scale.1.processes			615578.0	(+-  0.18%)

mm: page_alloc: fix move_freepages_block() range error
perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 12.78%)
will-it-scale.1.processes			615364.7	(+-  0.27%)

mm: page_alloc: fix freelist movement during block conversion
perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 10.52%)
will-it-scale.1.processes			617834.8	(+-  0.52%)

mm: page_alloc: consolidate free page accounting
perf-profile.children.get_pfnblock_flags_mask	     0.39	(+-  8.27%)
will-it-scale.1.processes			621000.0	(+-  0.13%)

mm: page_alloc: close migratetype race between freeing and stealing
perf-profile.children.get_pfnblock_flags_mask	     0.37	(+-  5.87%)
will-it-scale.1.processes			618378.8	(+-  0.17%)

mm: page_alloc: optimize free_unref_page_list()
perf-profile.children.get_pfnblock_flags_mask	     0.20	(+- 14.96%)
will-it-scale.1.processes			618136.3	(+-  0.16%)

It seems that the will-it-scale score is influenced by some other
factors too.  But anyway, the series + addon patches restores the score
of will-it-scale.  And the cycles% of get_pfnblock_flags_mask() is
almost restored by the final patch (mm: page_alloc: optimize
free_unref_page_list()).

Feel free to add my "Tested-by" for these patches.

--
Best Regards,
Huang, Ying

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ