linux-kernel - Re: [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6d9d7ad8-58ba-1da8-a046-466b1ebfcf8e@amd.com>
Date:   Wed, 20 Sep 2023 16:12:45 +0530
From:   Raghavendra K T <raghavendra.kt@....com>
To:     Mel Gorman <mgorman@...e.de>, Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Ingo Molnar <mingo@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        David Hildenbrand <david@...hat.com>, rppt@...nel.org,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Bharata B Rao <bharata@....com>,
        Aithal Srikanth <sraithal@....com>,
        kernel test robot <oliver.sang@...el.com>,
        Sapkal Swapnil <Swapnil.Sapkal@....com>,
        K Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning

On 9/19/2023 9:52 PM, Mel Gorman wrote:
> On Tue, Sep 19, 2023 at 11:28:30AM +0200, Peter Zijlstra wrote:
>> On Tue, Aug 29, 2023 at 11:36:08AM +0530, Raghavendra K T wrote:
>>
>>> Peter Zijlstra (1):
>>>    sched/numa: Increase tasks' access history
>>>
>>> Raghavendra K T (5):
>>>    sched/numa: Move up the access pid reset logic
>>>    sched/numa: Add disjoint vma unconditional scan logic
>>>    sched/numa: Remove unconditional scan logic using mm numa_scan_seq
>>>    sched/numa: Allow recently accessed VMAs to be scanned
>>>    sched/numa: Allow scanning of shared VMAs
>>>
>>>   include/linux/mm.h       |  12 +++--
>>>   include/linux/mm_types.h |   5 +-
>>>   kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
>>>   3 files changed, 102 insertions(+), 24 deletions(-)
>>
>> So I don't immediately see anything horrible with this. Mel, do you have
>> a few cycles to go over this as well?
> 
> I've been trying my best to find the necessary time and it's still on my
> radar for this week. 

Hello Mel,
Thanks you a lot for your time and for having a detailed look, and your
patches.

In summary, I will start with your patchset.
Link:  https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ 
sched-numabselective-v1r5
and see if there is any cumulative benefits from my patches (3-6) on top 
of them.

Trying to give out some details for your questions. please skip if its
long..

Preliminary results don't look great for the first part
> of the series up to the patch "sched/numa: Add disjoint vma unconditional
> scan logic" even though other reports indicate the performance may be
> fixed up later in the series. For example
> 
> autonumabench
>                                     6.5.0-rc6              6.5.0-rc6
>                           sched-pidclear-v1r5   sched-forcescan-v1r5
> Min       syst-NUMA02        1.94 (   0.00%)        1.38 (  28.87%)
> Min       elsp-NUMA02       12.67 (   0.00%)       21.02 ( -65.90%)
> Amean     syst-NUMA02        2.35 (   0.00%)        1.86 (  21.13%)
> Amean     elsp-NUMA02       12.93 (   0.00%)       21.69 * -67.76%*
> Stddev    syst-NUMA02        0.54 (   0.00%)        0.90 ( -67.67%)
> Stddev    elsp-NUMA02        0.18 (   0.00%)        0.44 (-144.19%)
> CoeffVar  syst-NUMA02       22.82 (   0.00%)       48.50 (-112.58%)
> CoeffVar  elsp-NUMA02        1.38 (   0.00%)        2.01 ( -45.56%)
> Max       syst-NUMA02        3.15 (   0.00%)        3.89 ( -23.49%)
> Max       elsp-NUMA02       13.16 (   0.00%)       22.36 ( -69.91%)
> BAmean-50 syst-NUMA02        2.01 (   0.00%)        1.45 (  27.69%)
> BAmean-50 elsp-NUMA02       12.77 (   0.00%)       21.34 ( -67.04%)
> BAmean-95 syst-NUMA02        2.22 (   0.00%)        1.52 (  31.68%)
> BAmean-95 elsp-NUMA02       12.89 (   0.00%)       21.58 ( -67.39%)
> BAmean-99 syst-NUMA02        2.22 (   0.00%)        1.52 (  31.68%)
> BAmean-99 elsp-NUMA02       12.89 (   0.00%)       21.58 ( -67.39%)
> 
>                     6.5.0-rc6   6.5.0-rc6
>                  sched-pidclear-v1r5sched-forcescan-v1r5
> Duration User        5702.00    10264.25
> Duration System        17.02       13.59
> Duration Elapsed       92.57      156.30
> 
> Similar results seen across multiple machines. It's not universally bad
> but the NUMA02 tests appear to suffer quite badly and while not realistic,
> they are somewhat relevant because numa02 is likely an "adverse workload"
> for the logic that skips VMAs based on PID accesses.
> 
> For the rest of the series, the changelogs lacked detail on why those
> changes helped. Patch 4's changelog lacks detail and patch 6 stating
> "VMAs being accessed by more than two tasks are critical" is not helpful
> either -- e.g. why are they critical?

Agree, for patch 5 and 6 (scanning shared VMA and recently accessed
VMAs) there was a brief rationale in cover letter, but it was not enough
perhaps.

More background:
I had used trace_prints to understand vma sizes, PID hash, success
percentage of is_vma_accessed(), and also how many tasks are typically
accessing etc for some of the workloads..
(vma_size here was in KB)

E.g.,
<...>-1451602 [116] ...1. 39195.488591: vma_fault: vma=ffff8bcab42ad7b8 
pid=1451602 hash=40, success=1
            <...>-1451481 [210] ..... 39196.948390: sched_numascan: 
comm=numa01 pid=1451481 vma = ffff8bc9228637b8 
access_hist=4200000cfe66727 hashval = 26 bitmap_wt = 22, vma_size = 
3153924 success = 1
            <...>-1451570 [052] ...1. 39196.948725: vma_fault: 
vma=ffff8bc9228637b8 pid=1451570 hash=25, success=1

1) For very large VMAs we may incur delay in scanning whole VMA,
because we scan only in 256MB chunks and filter out tasks which had not
touched them etc, So idea was to speed up the scanning.

2) Similar rationale for recently accessed VMA, i.e., not to delay
scanning for a very recently (hot) accessed VMAs.

[ I did not explore using young page info, mm walk etc as I thought it
may be expensive ].

> They are obviously shared VMAs and
> therefore it may be the case that they need to be identified and interleaved
> quickly

Yes. Mostly that was idea as mentioned above.

> but maybe not. Is the shared VMA that is critical a large malloc'd
> area split into per-thread sections or something that is MAP_SHARED? The
> changelog doesn't say so I have to guess.  > There are also a bunch of
> magic variables with limited explanation (e.g. why NR_ACCESS_PID_HIST==4
> and SHARED_VMA_THRESH=3?),

Those thresholds were result of multiple experiments I did.
(SHARED_VMA_THRESH = 3,4 .. NR_ACCESS_PID_HIST=3, 4 etc ).

One thing I did not look is whether I should reduce PID_RESET interval
(because we are maintaining more history now.)

> the numab fields are not documented 
Agree, I should have done better earlier.

> and the
> changelogs lack supporting data. I suspect that patches 3-6 may be dealing
> with regressions introduced by patch 2, particularly for NUMA02, but I'm

TBH, Did not really target to worsen num02, improve num02 later.
This is the data I had for the full patchset.

autonumabench
                              base                   patched
Min       syst-NUMA02        0.99 (   0.00%)        0.99 (   0.00%)
Min       elsp-NUMA02        3.04 (   0.00%)        3.04 (   0.00%)
Amean     syst-NUMA02        1.06 (   0.00%)        1.05 *   1.08%*
Amean     elsp-NUMA02        3.80 (   0.00%)        3.39 *  10.68%*
Stddev    syst-NUMA02        0.10 (   0.00%)        0.07 (  24.57%)
Stddev    elsp-NUMA02        0.73 (   0.00%)        0.34 (  52.86%)
CoeffVar  syst-NUMA02        9.04 (   0.00%)        6.89 (  23.75%)
CoeffVar  elsp-NUMA02       19.25 (   0.00%)       10.16 (  47.22%)
Max       syst-NUMA02        1.27 (   0.00%)        1.21 (   4.72%)
Max       elsp-NUMA02        4.91 (   0.00%)        4.04 (  17.72%)
BAmean-50 syst-NUMA02        1.00 (   0.00%)        1.01 (  -0.66%)
BAmean-50 elsp-NUMA02        3.21 (   0.00%)        3.12 (   2.60%)
BAmean-95 syst-NUMA02        1.03 (   0.00%)        1.02 (   0.32%)
BAmean-95 elsp-NUMA02        3.61 (   0.00%)        3.28 (   9.09%)
BAmean-99 syst-NUMA02        1.03 (   0.00%)        1.02 (   0.32%)
BAmean-99 elsp-NUMA02        3.61 (   0.00%)        3.28 (   9.09%)

Duration User        1555.24     1377.57
Duration System         8.10        7.99
Duration Elapsed       30.86       26.49

But then, I saw result from Kernel test Robot, which compared individual
patches,

commit:
   2f88c8e802 ("(tip/sched/core) sched/eevdf/doc: Modify the documented 
knob to base_slice_ns as well")
   2a806eab1c ("sched/numa: Move up the access pid reset logic")
   1ef5cbb92b ("sched/numa: Add disjoint vma unconditional scan logic")
   68cfe9439a ("sched/numa: Allow scanning of shared VMAs")


2f88c8e802c8b128 2a806eab1c2e1c9f0ae39dc0307 1ef5cbb92bdb320c5eb9fdee1a8 
68cfe9439a1baa642e05883fa64
---------------- --------------------------- --------------------------- 
---------------------------
          %stddev     %change         %stddev     %change 
%stddev     %change         %stddev
              \          |                \          |                \ 
          |                \
     271.01            +0.8%     273.24            -0.7%     269.00 
       -26.4%     199.49 ±  3%  autonuma-benchmark.numa01.seconds
      76.28            +0.2%      76.44           -11.7%      67.36 ± 
6%     -46.9%      40.49 ±  5% 
autonuma-benchmark.numa01_THREAD_ALLOC.seconds
       8.11            -0.9%       8.04            -0.7%       8.05 
        -0.1%       8.10        autonuma-benchmark.numa02.seconds
       1425            +0.7%       1434            -3.1%       1381 
       -30.1%     996.02 ±  2%  autonuma-benchmark.time.elapsed_time

I do see some negligible overhead from first patch but second patch
still gave some improvement.

My observation with the patchset was increase in system time
  because of additional scanning we re-introduced but this
was still 2x better than where we started without numascan enhancements.

> not certain as I didn't dedicate the necessary test time to prove that
> and it's the type of information that should be in the changelog. While
> there is nothing wrong with that as such, it's very hard to imagine how
> patches 3-6 work in every case and be certain that the various parameters
> make sense. That could cause difficulties later in terms of maintenance.
>

Agree regarding maintenance.

> My initial thinking was "There should be a standalone series that deals
> *only* with scanning VMAs that had no fault activity and skipped due to
> PID hashing". These are important because there may be no fault activity
> because there is no scan activity which is due to to fault activity. The
> series is incomplete and without changelogs but I pushed it anyway to
> 

Agreed.

> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ sched-numabselective-v1r5
> 

Thanks.. Patches are simple to start with (1-4) with a force scan in
patch5. Will experiment with these.

> The first two patches simply improve the documentation on what is going
> on, patch 3 adds a tracepoint for figuring out why VMAs were skipped or
> not skipped. Patch 4 handles a corner case to complete the scan of a VMA
> once it has started regardless of what task is doing the scanning. The
> last patch scans VMAs that have seen no fault activity once active VMAs
> have been scanned.
>
> It has its weaknesses because it may be overly simplisitic and it forces
> all VMAs to be scanned on every sequence which is wasteful. It also hurts
> NUMA02 performance, although not as badly as ""sched/numa: Add disjoint
> vma unconditional scan logic". On the plus side, it is easier to reason
> about, it solves only one problem in the series and any patch on top or
> modification should justify each change individually.
> 
Anything else you have in mind that I should look into apart from
above (Rebasing to your patches and experiment with my patch 3-6 for any
cumulative improvements ?).

Thanks and Regards
- Raghu