linux-kernel - Re: [linus:master] [x86/mm/tlb] 7e33001b8b: will-it-scale.per_thread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <069d686ab958d973563cfad52373ec6c8aad72ca.camel@surriel.com>
Date: Sat, 30 Nov 2024 12:28:42 -0500
From: Rik van Riel <riel@...riel.com>
To: kernel test robot <oliver.sang@...el.com>
Cc: oe-lkp@...ts.linux.dev, lkp@...el.com, linux-kernel@...r.kernel.org,
 Ingo Molnar <mingo@...nel.org>, Andy Lutomirski <luto@...nel.org>, Peter
 Zijlstra	 <peterz@...radead.org>, Linus Torvalds
 <torvalds@...ux-foundation.org>
Subject: Re: [linus:master] [x86/mm/tlb]  7e33001b8b:
 will-it-scale.per_thread_ops 20.7% improvement

On Sat, 2024-11-30 at 16:07 +0800, kernel test robot wrote:
> 
> 
> Hello,
> 
> in this test, we don't have CONFIG_DEBUG_VM.
> # CONFIG_DEBUG_VM is not set
> 
> below report is just FYI.
> 
> 
> kernel test robot noticed a 20.7% improvement of will-it-
> scale.per_thread_ops on:
> 
> 
> commit: 7e33001b8b9a78062679e0fdf5b0842a49063135 ("x86/mm/tlb: Put
> cpumask_test_cpu() check in switch_mm_irqs_off() under
> CONFIG_DEBUG_VM")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git maste
> r

It's good to get this confirmation that the mm_cpumask
really is that expensive.

I guess we could experiment with something like the following:

1) Stop using the mm_cpumask altogether on x86
2) Instead, at context switch time just update
   per_cpu variables like cpu_tlbstate.loaded_mm
   and friends
3) At (much rarer) TLB flush time:
   - Iterate over all CPUs
   - Use cpustate.loaded_mm and .is_lazy to  build a 
     (per-CPU?) cpumask.
   - Pass that cpumask to functions like flush_tlb_multi
     and on_each_cpu_mask

Does that make sense as something we could try to
further reduce context switch overhead, and the
TLB flush thundering herd on the mm_cpumask triggered
by the main loop in will-it-scale's tlb_flush2 test?

https://github.com/antonblanchard/will-it-scale/blob/master/tests/tlb_flush2.c


-- 
All Rights Reversed.