[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6ef38c6e-e47d-66c2-216a-76ab4a59feb1@amd.com>
Date: Mon, 23 Oct 2023 10:55:55 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: linux-kernel@...r.kernel.org, linux-mm@...ck.org
Cc: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Mel Gorman <mgorman@...e.de>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>, rppt@...nel.org,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Bharata B Rao <bharata@....com>,
Aithal Srikanth <sraithal@....com>,
kernel test robot <oliver.sang@...el.com>,
Sapkal Swapnil <Swapnil.Sapkal@....com>,
K Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [PATCH V1 0/1] sched/numa: Fix mm numa_scan_seq based
unconditional scan
On 10/20/2023 9:27 PM, Raghavendra K T wrote:
> NUMA balancing code that updates PTEs by allowing unconditional scan
> based on the value of processes' mm numa_scan_seq is not perfect.
>
> More description is in patch1.
>
> Have used the below patch to identify the corner case.
>
> Detailed Result: (Only part of the result is updated
> in patch1 to save space in commit log)
>
> Detailed Result:
>
> SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.
>
> Base kernel: upstream 6.6-rc6 (dd72f9c7e512) with Mels patch-series
> from tip/sched/core [1] applied.
>
> Summary: Some benchmarks imrove. There is increase in system
> time due to additional scanning. But elapsed time shows gain.
>
> However there is also some overhead seen for benchmarks like NUMA01.
>
> kernbench
> ========== base patched
> Amean user-128 13799.58 ( 0.00%) 13789.86 * 0.07%*
> Amean syst-128 3280.80 ( 0.00%) 3249.67 * 0.95%*
> Amean elsp-128 165.09 ( 0.00%) 164.78 * 0.19%*
>
> Duration User 41404.28 41375.08
> Duration System 9862.22 9768.48
> Duration Elapsed 519.87 518.72
>
> Ops NUMA PTE updates 1041416.00 831536.00
> Ops NUMA hint faults 263296.00 220966.00
> Ops NUMA pages migrated 258021.00 212769.00
> Ops AutoNUMA cost 1328.67 1114.69
>
> autonumabench
>
> NUMA01_THREADLOCAL
> ==================
> Amean syst-NUMA01_THREADLOCAL 10.65 ( 0.00%) 26.47 *-148.59%*
> Amean elsp-NUMA01_THREADLOCAL 81.79 ( 0.00%) 67.74 * 17.18%*
>
> Duration User 54832.73 47379.67
> Duration System 75.00 185.75
> Duration Elapsed 576.72 476.09
>
> Ops NUMA PTE updates 394429.00 11121044.00
> Ops NUMA hint faults 1001.00 8906404.00
> Ops NUMA pages migrated 288.00 2998694.00
> Ops AutoNUMA cost 7.77 44666.84
>
> NUMA01
> =====
> Amean syst-NUMA01 31.97 ( 0.00%) 52.95 * -65.62%*
> Amean elsp-NUMA01 143.16 ( 0.00%) 150.81 * -5.34%*
>
> Duration User 84839.49 91342.19
> Duration System 224.26 371.12
> Duration Elapsed 1005.64 1059.01
>
> Ops NUMA PTE updates 33929508.00 50116313.00
> Ops NUMA hint faults 34993820.00 52895783.00
> Ops NUMA pages migrated 5456115.00 7441228.00
> Ops AutoNUMA cost 175310.27 264971.11
>
> NUMA02
> =========
> Amean syst-NUMA02 0.86 ( 0.00%) 0.86 * -0.50%*
> Amean elsp-NUMA02 3.99 ( 0.00%) 3.82 * 4.40%*
>
> Duration User 1186.06 1092.07
> Duration System 6.44 6.47
> Duration Elapsed 31.28 30.30
>
> Ops NUMA PTE updates 776.00 731.00
> Ops NUMA hint faults 527.00 490.00
> Ops NUMA pages migrated 183.00 153.00
> Ops AutoNUMA cost 2.64 2.46
>
> Link: https://lore.kernel.org/linux-mm/ZSXF3AFZgIld1meX@gmail.com/T/
>
Forgot to add skip_vma_count trace results:
autonumabench: numa01_THREAD_LOCAL 3 iterations
base:
inaccessible:13133
pid_inactive:15807
scan_delay:471
seq_completed:50
shared_ro:6983
unsuitable:3917
patched:
inaccessible:4727
pid_inactive:5119
scan_delay:455
seq_completed:7
shared_ro:2551
unsuitable:5402
> Raghavendra K T (1):
> sched/numa: Fix mm numa_scan_seq based unconditional scan
>
> include/linux/mm_types.h | 3 +++
> kernel/sched/fair.c | 4 +++-
> 2 files changed, 6 insertions(+), 1 deletion(-)
>
> ---8<---
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 010ba1b7cb0e..a4870b01c8a1 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -10,6 +10,30 @@
> #include <linux/tracepoint.h>
> #include <linux/binfmts.h>
>
> +TRACE_EVENT(sched_vma_start_seq,
> +
> + TP_PROTO(struct task_struct *t, struct vm_area_struct *vma, int start_seq),
> +
> + TP_ARGS(t, vma, start_seq),
> +
> + TP_STRUCT__entry(
> + __array( char, comm, TASK_COMM_LEN )
> + __field( pid_t, pid )
> + __field( void *, vma )
> + __field( int, start_seq )
> + ),
> +
> + TP_fast_assign(
> + memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
> + __entry->pid = t->pid;
> + __entry->vma = vma;
> + __entry->start_seq = start_seq;
> + ),
> +
> + TP_printk("comm=%s pid=%d vma = %px start_seq=%d", __entry->comm, __entry->pid, __entry->vma,
> + __entry->start_seq)
> +);
> +
> /*
> * Tracepoint for calling kthread_stop, performed to end a kthread:
> */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c8af3a7ccba7..e0c16ea8470b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3335,6 +3335,7 @@ static void task_numa_work(struct callback_head *work)
> continue;
>
> vma->numab_state->start_scan_seq = mm->numa_scan_seq;
> + trace_sched_vma_start_seq(p, vma, mm->numa_scan_seq);
>
> vma->numab_state->next_scan = now +
> msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
>
>
Powered by blists - more mailing lists