linux-kernel - Re: [RFC PATCH V3 0/1] sched/numa: Fix disjoint set vma scan regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53f3872a-4cbf-563a-2658-9222586680da@amd.com>
Date:   Wed, 7 Jun 2023 17:10:53 +0530
From:   Sapkal Swapnil <Swapnil.Sapkal@....com>
To:     Raghavendra K T <raghavendra.kt@....com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Mel Gorman <mgorman@...e.de>,
        Andrew Morton <akpm@...ux-foundation.org>,
        David Hildenbrand <david@...hat.com>, rppt@...nel.org,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Bharata B Rao <bharata@....com>,
        Aithal Srikanth <sraithal@....com>,
        kernel test robot <oliver.sang@...el.com>
Subject: Re: [RFC PATCH V3 0/1] sched/numa: Fix disjoint set vma scan
 regression

Hello Raghavendra,

On 5/31/2023 9:55 AM, Raghavendra K T wrote:
> With the numa scan enhancements [1], only the threads which had previously
> accessed vma are allowed to scan.
> 
> While this had improved significant system time overhead, there were corner
> cases, which genuinely need some relaxation for e.g., concern raised by
> PeterZ where unfairness amongst the thread belonging to disjoint set of vmas,
> that can potentially amplify the side effects, where vma regions belonging
> to some of the tasks being left unscanned.
> 
> [1] had handled that issue by allowing first two scans at mm level
> (mm->numa_scan_seq) unconditionally. But that was not enough.
> 
> One of the test that exercise similar side effect is numa01_THREAD_ALLOC where
> allocation happen by main thread and it is divided into memory chunks of 24MB
> to be continuously bzeroed (for 128 threads on my machine).
> 
> This was found in internal LKP run and also reported by [4].
> 
> While RFC V1 [2] tried to address this issue, the logic had more heuristics.
> RFC V2 [3] was rewritten based on vma_size.
> 
> Current implementation drops some of additional logic for long running task
> and relooked some of the usage of READ_ONCE/WRITE_ONCE().
> 
> The current patch addresses the same issue in a more accurate way as
> follows:
> 
> (1) Any disjoint vma which is not associated with a task, that tries to
> scan is now allowed to induce prot_none faults. Total number of such
> unconditional scans allowed per vma is derived based on the exact vma size
> as follows:
> 
> total scans allowed = 1/2 * vma_size / scan_size.
> 
> (2) Total scans already done is maintained using a per vma scan counter.
> 
> With above patch, numa01_THREAD_ALLOC regression reported is resolved,
> but please note that with [1] there was a drastic decrease in system time
> for mmtest numa01, this patch adds back some of the system time.
> 
> Summary: numa scan enhancement patch [1] togethor with the current patchset
> improves overall system time by filtering unnecessary numa scan
> while still retaining necessary scanning in some corner cases which
> involves disjoint set vmas.
> 
> Your comments/Ideas are welcome.
> 
> Changes since:
> RFC V2:
> 1) Drop reset of scan counter that tried to take care of long running workloads
> 2) Correct usage of READ_ONCE/WRITE_ONCE (Bharata)
> 3) Base is 6.4.0-rc2
> 
> RFC V1:
> 1) Rewrite entire logic based on actual vma size than heuristics
> 2) Added Reported-by kernel test robot and internal LKP test
> 3) Rebased to 6.4.-rc1 (ba0ad6ed89)
> 
> Result:
> SUT: Milan w/ 2 numa nodes 256 cpus
> 
> Run of numa01_THREAD__ALLOC on 6.4.0-rc2 (that has w/ numascan enhancement)
>                  	base-numascan	base		base+fix
> real    		1m1.507s	1m23.259s	1m2.632s
> user    		213m51.336s	251m46.363s	220m35.528s
> sys     		3m3.397s	0m12.492s	2m41.393s
> 
> numa_hit 		5615517		4560123		4963875
> numa_local 		5615505		4560024		4963700
> numa_other 		12		99		175
> numa_pte_updates 	1822797		493		1559111
> numa_hint_faults 	1307113		523		1469031
> numa_hint_faults_local 	612617		488		884829
> numa_pages_migrated 	694370		35		584202
> 
> We can see regression in base real time recovered, but with some additional
> system time overhead.
> 
> Below is the mmtest autonuma performance
> 
> autonumabench
> ===========
> (base 6.4.0-rc2 that has numascan enhancement)
> 					base-numascan		base			base+fix
> Amean     syst-NUMA01                  300.46 (   0.00%)       23.97 *  92.02%*       67.18 *  77.64%*
> Amean     syst-NUMA01_THREADLOCAL        0.20 (   0.00%)        0.22 *  -9.15%*        0.22 *  -9.15%*
> Amean     syst-NUMA02                    0.70 (   0.00%)        0.71 *  -0.61%*        0.70 *   0.41%*
> Amean     syst-NUMA02_SMT                0.58 (   0.00%)        0.62 *  -5.38%*        0.61 *  -3.67%*
> Amean     elsp-NUMA01                  320.92 (   0.00%)      276.13 *  13.96%*      324.11 *  -0.99%*
> Amean     elsp-NUMA01_THREADLOCAL        1.02 (   0.00%)        1.03 *  -1.83%*        1.03 *  -1.83%*
> Amean     elsp-NUMA02                    3.16 (   0.00%)        3.93 * -24.20%*        3.14 *   0.81%*
> Amean     elsp-NUMA02_SMT                3.82 (   0.00%)        3.87 *  -1.27%*        3.44 *   9.90%*
> 
> Duration User      403532.43   279173.53   359098.23
> Duration System      2114.31      179.20      481.54
> Duration Elapsed     2312.20     2004.48     2335.84
> 
> Ops NUMA alloc hit                  55795455.00    45452739.00    45500387.00
> Ops NUMA alloc local                55794177.00    45435858.00    45500070.00
> Ops NUMA base-page range updates   147858285.00       18601.00    42043107.00
> Ops NUMA PTE updates               147858285.00       18601.00    42043107.00
> Ops NUMA hint faults               150531983.00       18254.00    42450080.00
> Ops NUMA hint local faults %       125691825.00       11964.00    32993313.00
> Ops NUMA hint local percent               83.50          65.54          77.72
> Ops NUMA pages migrated             13535786.00        2207.00     4654628.00
> Ops AutoNUMA cost                     753952.10          91.44      212633.14
> 
> Please note there is a system time overhead added for numa01 but we still have very
> good improvement w.r.t base without numascan.
> 

I tested the patch with lkp autonuma benchmark on a dual socket 4th 
Generation EPYC server (2 X 96C/192T) running in NPS1 mode. Below are 
the results:

commit:
   6.4.0-rc2
   6.4.0-rc2+patch

       6.4.0-rc2            6.4.0-rc2+patch
---------------- ---------------------------
          %stddev     %change         %stddev
              \          |                \
     501.84           -12.5%     439.14       numa01.seconds
     228.66            -1.8%     224.44       numa01_THREAD_ALLOC.seconds
       0.51           +21.6%       0.62       numa02.seconds
     107.17            +0.0%     107.17       numa02_SMT.seconds
       2936            -9.1%       2669       elapsed_time
     794910            +3.7%     824178       system_time
     474520           -17.5%     391331       user_time

Tested-by: Swapnil Sapkal <swapnil.sapkal@....com>

> [1] Link: https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
> [2] Link: https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
> [3] Link: https://lore.kernel.org/lkml/cover.1684228065.git.raghavendra.kt@amd.com/T/
> [4] Link: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/
> 
> Raghavendra K T (1):
>    sched/numa: Fix disjoint set vma scan regression
> 
>   include/linux/mm_types.h |  1 +
>   kernel/sched/fair.c      | 31 ++++++++++++++++++++++++-------
>   2 files changed, 25 insertions(+), 7 deletions(-)
> 
--
Thanks and regards,
Swapnil