linux-kernel - Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a8337371-50ed-4618-b48e-78b96d18810f@redhat.com>
Date:   Mon, 30 Oct 2023 09:00:56 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Byungchul Park <byungchul@...com>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org
Cc:     kernel_team@...ynix.com, akpm@...ux-foundation.org,
        ying.huang@...el.com, namit@...are.com, xhao@...ux.alibaba.com,
        mgorman@...hsingularity.net, hughd@...gle.com, willy@...radead.org,
        peterz@...radead.org, luto@...nel.org, tglx@...utronix.de,
        mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com
Subject: Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios
 at migration

On 30.10.23 08:25, Byungchul Park wrote:
> Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'.
> We always face the migration overhead at either promotion or demotion,
> while working with tiered memory e.g. CXL memory and found out TLB
> shootdown is a quite big one that is needed to get rid of if possible.
> 
> Fortunately, TLB flush can be defered or even skipped if both source and
> destination of folios during migration are kept until all TLB flushes
> required will have been done, of course, only if the target PTE entries
> have read only permission, more precisely speaking, don't have write
> permission. Otherwise, no doubt the folio might get messed up.
> 
> To achieve that:
> 
>     1. For the folios that map only to non-writable TLB entries, prevent
>        TLB flush at migration by keeping both source and destination
>        folios, which will be handled later at a better time.
> 
>     2. When any non-writable TLB entry changes to writable e.g. through
>        fault handler, give up CONFIG_MIGRC mechanism so as to perform
>        TLB flush required right away.
> 
>     3. Temporarily stop migrc from working when the system is in very
>        high memory pressure e.g. direct reclaim needed.
> 
> The measurement result:
> 
>     Architecture - x86_64
>     QEMU - kvm enabled, host cpu
>     Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
>     Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled
>     Benchmark - XSBench -p 50000000 (-p option makes the runtime longer)
> 
>     run 'perf stat' using events:
>        1) itlb.itlb_flush
>        2) tlb_flush.dtlb_thread
>        3) tlb_flush.stlb_any
>        4) dTLB-load-misses
>        5) dTLB-store-misses
>        6) iTLB-load-misses
> 
>     run 'cat /proc/vmstat' and pick:
>        1) numa_pages_migrated
>        2) pgmigrate_success
>        3) nr_tlb_remote_flush
>        4) nr_tlb_remote_flush_received
>        5) nr_tlb_local_flush_all
>        6) nr_tlb_local_flush_one
> 
>     BEFORE - mainline v6.6-rc5
>     ------------------------------------------
>     $ perf stat -a \
> 	   -e itlb.itlb_flush \
> 	   -e tlb_flush.dtlb_thread \
> 	   -e tlb_flush.stlb_any \
> 	   -e dTLB-load-misses \
> 	   -e dTLB-store-misses \
> 	   -e iTLB-load-misses \
> 	   ./XSBench -p 50000000
> 
>     Performance counter stats for 'system wide':
> 
>        20953405     itlb.itlb_flush
>        114886593    tlb_flush.dtlb_thread
>        88267015     tlb_flush.stlb_any
>        115304095543 dTLB-load-misses
>        163904743    dTLB-store-misses
>        608486259	   iTLB-load-misses
> 
>     556.787113849 seconds time elapsed
> 
>     $ cat /proc/vmstat
> 
>     ...
>     numa_pages_migrated 3378748
>     pgmigrate_success 7720310
>     nr_tlb_remote_flush 751464
>     nr_tlb_remote_flush_received 10742115
>     nr_tlb_local_flush_all 21899
>     nr_tlb_local_flush_one 740157
>     ...
> 
>     AFTER - mainline v6.6-rc5 + CONFIG_MIGRC
>     ------------------------------------------
>     $ perf stat -a \
> 	   -e itlb.itlb_flush \
> 	   -e tlb_flush.dtlb_thread \
> 	   -e tlb_flush.stlb_any \
> 	   -e dTLB-load-misses \
> 	   -e dTLB-store-misses \
> 	   -e iTLB-load-misses \
> 	   ./XSBench -p 50000000
> 
>     Performance counter stats for 'system wide':
> 
>        4353555      itlb.itlb_flush
>        72482780     tlb_flush.dtlb_thread
>        68226458     tlb_flush.stlb_any
>        114331610808 dTLB-load-misses
>        116084771    dTLB-store-misses
>        377180518    iTLB-load-misses
> 
>     552.667718220 seconds time elapsed
> 
>     $ cat /proc/vmstat
> 

So, an improvement of 0.74% ? How stable are the results? Serious 
question: worth the churn?

Or did I get the numbers wrong?

>   #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5c02720c53a5..1ca2ac91aa14 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -135,6 +135,9 @@ enum pageflags {
>   #ifdef CONFIG_ARCH_USES_PG_ARCH_X
>   	PG_arch_2,
>   	PG_arch_3,
> +#endif
> +#ifdef CONFIG_MIGRC
> +	PG_migrc,		/* Page has its copy under migrc's control */
>   #endif
>   	__NR_PAGEFLAGS,
>   
> @@ -589,6 +592,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
>   PAGEFLAG(Idle, idle, PF_ANY)
>   #endif
>   
> +#ifdef CONFIG_MIGRC
> +PAGEFLAG(Migrc, migrc, PF_ANY)
> +#endif

I assume you know this: new pageflags are frowned upon.

-- 
Cheers,

David / dhildenb