[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53f3872a-4cbf-563a-2658-9222586680da@amd.com>
Date: Wed, 7 Jun 2023 17:10:53 +0530
From: Sapkal Swapnil <Swapnil.Sapkal@....com>
To: Raghavendra K T <raghavendra.kt@....com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Cc: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Mel Gorman <mgorman@...e.de>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>, rppt@...nel.org,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Bharata B Rao <bharata@....com>,
Aithal Srikanth <sraithal@....com>,
kernel test robot <oliver.sang@...el.com>
Subject: Re: [RFC PATCH V3 0/1] sched/numa: Fix disjoint set vma scan
regression
Hello Raghavendra,
On 5/31/2023 9:55 AM, Raghavendra K T wrote:
> With the numa scan enhancements [1], only the threads which had previously
> accessed vma are allowed to scan.
>
> While this had improved significant system time overhead, there were corner
> cases, which genuinely need some relaxation for e.g., concern raised by
> PeterZ where unfairness amongst the thread belonging to disjoint set of vmas,
> that can potentially amplify the side effects, where vma regions belonging
> to some of the tasks being left unscanned.
>
> [1] had handled that issue by allowing first two scans at mm level
> (mm->numa_scan_seq) unconditionally. But that was not enough.
>
> One of the test that exercise similar side effect is numa01_THREAD_ALLOC where
> allocation happen by main thread and it is divided into memory chunks of 24MB
> to be continuously bzeroed (for 128 threads on my machine).
>
> This was found in internal LKP run and also reported by [4].
>
> While RFC V1 [2] tried to address this issue, the logic had more heuristics.
> RFC V2 [3] was rewritten based on vma_size.
>
> Current implementation drops some of additional logic for long running task
> and relooked some of the usage of READ_ONCE/WRITE_ONCE().
>
> The current patch addresses the same issue in a more accurate way as
> follows:
>
> (1) Any disjoint vma which is not associated with a task, that tries to
> scan is now allowed to induce prot_none faults. Total number of such
> unconditional scans allowed per vma is derived based on the exact vma size
> as follows:
>
> total scans allowed = 1/2 * vma_size / scan_size.
>
> (2) Total scans already done is maintained using a per vma scan counter.
>
> With above patch, numa01_THREAD_ALLOC regression reported is resolved,
> but please note that with [1] there was a drastic decrease in system time
> for mmtest numa01, this patch adds back some of the system time.
>
> Summary: numa scan enhancement patch [1] togethor with the current patchset
> improves overall system time by filtering unnecessary numa scan
> while still retaining necessary scanning in some corner cases which
> involves disjoint set vmas.
>
> Your comments/Ideas are welcome.
>
> Changes since:
> RFC V2:
> 1) Drop reset of scan counter that tried to take care of long running workloads
> 2) Correct usage of READ_ONCE/WRITE_ONCE (Bharata)
> 3) Base is 6.4.0-rc2
>
> RFC V1:
> 1) Rewrite entire logic based on actual vma size than heuristics
> 2) Added Reported-by kernel test robot and internal LKP test
> 3) Rebased to 6.4.-rc1 (ba0ad6ed89)
>
> Result:
> SUT: Milan w/ 2 numa nodes 256 cpus
>
> Run of numa01_THREAD__ALLOC on 6.4.0-rc2 (that has w/ numascan enhancement)
> base-numascan base base+fix
> real 1m1.507s 1m23.259s 1m2.632s
> user 213m51.336s 251m46.363s 220m35.528s
> sys 3m3.397s 0m12.492s 2m41.393s
>
> numa_hit 5615517 4560123 4963875
> numa_local 5615505 4560024 4963700
> numa_other 12 99 175
> numa_pte_updates 1822797 493 1559111
> numa_hint_faults 1307113 523 1469031
> numa_hint_faults_local 612617 488 884829
> numa_pages_migrated 694370 35 584202
>
> We can see regression in base real time recovered, but with some additional
> system time overhead.
>
> Below is the mmtest autonuma performance
>
> autonumabench
> ===========
> (base 6.4.0-rc2 that has numascan enhancement)
> base-numascan base base+fix
> Amean syst-NUMA01 300.46 ( 0.00%) 23.97 * 92.02%* 67.18 * 77.64%*
> Amean syst-NUMA01_THREADLOCAL 0.20 ( 0.00%) 0.22 * -9.15%* 0.22 * -9.15%*
> Amean syst-NUMA02 0.70 ( 0.00%) 0.71 * -0.61%* 0.70 * 0.41%*
> Amean syst-NUMA02_SMT 0.58 ( 0.00%) 0.62 * -5.38%* 0.61 * -3.67%*
> Amean elsp-NUMA01 320.92 ( 0.00%) 276.13 * 13.96%* 324.11 * -0.99%*
> Amean elsp-NUMA01_THREADLOCAL 1.02 ( 0.00%) 1.03 * -1.83%* 1.03 * -1.83%*
> Amean elsp-NUMA02 3.16 ( 0.00%) 3.93 * -24.20%* 3.14 * 0.81%*
> Amean elsp-NUMA02_SMT 3.82 ( 0.00%) 3.87 * -1.27%* 3.44 * 9.90%*
>
> Duration User 403532.43 279173.53 359098.23
> Duration System 2114.31 179.20 481.54
> Duration Elapsed 2312.20 2004.48 2335.84
>
> Ops NUMA alloc hit 55795455.00 45452739.00 45500387.00
> Ops NUMA alloc local 55794177.00 45435858.00 45500070.00
> Ops NUMA base-page range updates 147858285.00 18601.00 42043107.00
> Ops NUMA PTE updates 147858285.00 18601.00 42043107.00
> Ops NUMA hint faults 150531983.00 18254.00 42450080.00
> Ops NUMA hint local faults % 125691825.00 11964.00 32993313.00
> Ops NUMA hint local percent 83.50 65.54 77.72
> Ops NUMA pages migrated 13535786.00 2207.00 4654628.00
> Ops AutoNUMA cost 753952.10 91.44 212633.14
>
> Please note there is a system time overhead added for numa01 but we still have very
> good improvement w.r.t base without numascan.
>
I tested the patch with lkp autonuma benchmark on a dual socket 4th
Generation EPYC server (2 X 96C/192T) running in NPS1 mode. Below are
the results:
commit:
6.4.0-rc2
6.4.0-rc2+patch
6.4.0-rc2 6.4.0-rc2+patch
---------------- ---------------------------
%stddev %change %stddev
\ | \
501.84 -12.5% 439.14 numa01.seconds
228.66 -1.8% 224.44 numa01_THREAD_ALLOC.seconds
0.51 +21.6% 0.62 numa02.seconds
107.17 +0.0% 107.17 numa02_SMT.seconds
2936 -9.1% 2669 elapsed_time
794910 +3.7% 824178 system_time
474520 -17.5% 391331 user_time
Tested-by: Swapnil Sapkal <swapnil.sapkal@....com>
> [1] Link: https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
> [2] Link: https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
> [3] Link: https://lore.kernel.org/lkml/cover.1684228065.git.raghavendra.kt@amd.com/T/
> [4] Link: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/
>
> Raghavendra K T (1):
> sched/numa: Fix disjoint set vma scan regression
>
> include/linux/mm_types.h | 1 +
> kernel/sched/fair.c | 31 ++++++++++++++++++++++++-------
> 2 files changed, 25 insertions(+), 7 deletions(-)
>
--
Thanks and regards,
Swapnil
Powered by blists - more mailing lists