linux-kernel - Re: [RFC PATCH v9 2/2] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d96164a6-c522-1bfc-8b37-333726cdc573@efficios.com>
Date:   Thu, 20 Apr 2023 10:35:15 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Aaron Lu <aaron.lu@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org, Olivier Dion <odion@...icios.com>,
        michael.christie@...cle.com
Subject: Re: [RFC PATCH v9 2/2] sched: Fix performance regression introduced
 by mm_cid

On 2023-04-20 09:54, Mathieu Desnoyers wrote:
> On 2023-04-20 09:35, Aaron Lu wrote:
> [...]
>>>>>
>>>>> Then we clearly have another member of mm_struct on the same cache 
>>>>> line as
>>>>> pcpu_cid which is bouncing all over the place and causing 
>>>>> false-sharing. Any
>>>>> idea which field(s) are causing this ?
>>>>
>>>> That's my first reaction too but as I said in an earlier reply:
>>>> https://lore.kernel.org/lkml/20230419080606.GA4247@ziqianlu-desk2/
>>>> I've tried to place pcpu_cid into a dedicate cacheline with no other
>>>> fields sharing a cacheline with it in mm_struct but it didn't help...
>>>
>>> I see two possible culprits there:
>>>
>>> 1) The mm_struct pcpu_cid field is suffering from false-sharing. I 
>>> would be
>>>     interested to look at your attempt to move it to a separate cache 
>>> line to
>>>     try to figure out what is going on.
>>
>> Brain damaged...my mistake, I only made sure its following fields not
>> share the same cacheline but forgot to exclude its preceding fields and
>> turned out it's one(some?) of the preceeding fields that caused false
>> sharing. When I did:
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 5eab61156f0e..a6f9d815991c 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -606,6 +606,7 @@ struct mm_struct {
>>                   */
>>                  atomic_t mm_count;
>>   #ifdef CONFIG_SCHED_MM_CID
>> +               CACHELINE_PADDING(_pad1_);
>>                  /**
>>                   * @pcpu_cid: Per-cpu current cid.
>>                   *
>> mm_cid_get() dropped to 0.0x% when running hackbench :-)
> 
> Now we are talking! :)
> 
>>
>> sched_mm_cid_migrate_to() is about 4% with most cycles spent on
>> accessing mm->mm_users:
>>
>>         │     dst_cid = READ_ONCE(dst_pcpu_cid->cid);
>>    0.03 │       mov     0x8(%r12),%r15d
>>         │     if (!mm_cid_is_unset(dst_cid) &&
>>    0.07 │       cmp     $0xffffffff,%r15d
>>         │     ↓ je      87
>>         │     arch_atomic_read():
>>         │     {
>>         │     /*
>>         │     * Note for KASAN: we deliberately don't use 
>> READ_ONCE_NOCHECK() here,
>>         │     * it's non-inlined function that increases binary size 
>> and stack usage.
>>         │     */
>>         │     return __READ_ONCE((v)->counter);
>>   76.13 │       mov     0x54(%r13),%eax
>>         │     sched_mm_cid_migrate_to():
>>         │       cmp     %eax,0x410(%rdx)
>>   21.71 │     ↓ jle     1d8
>>         │     atomic_read(&mm->mm_users) >= t->nr_cpus_allowed)
>>
>> With this info, it should be mm_users that caused false sharing for
>> pcpu_cid previously. Looks like mm_users is bouncing.
> 
> I suspect that the culprit here is mm_count rather than mm_users. 
> mm_users just happens to share the same cache line as mm_count.
> 
> mm_count is incremented/decremented with mmgrab()/mmdrop() during
> context switch.
> 
> This is likely causing other issues, for instance, the
> membarrier_state field is AFAIR read-mostly, used for 
> membarrier_mm_sync_core_before_usermode() to issue core
> sync before every return to usermode if needed.
> 
> Other things like mm_struct pgd pointer appear to be likely
> read-mostly variables.
> 
> I suspect it's mm_count which should be moved to its own cache line
> to eliminate false-sharing with all the other read-mostly fields
> of mm_struct.

I have prepared a patch which moves the mm_count field into its own
cache line, but after a quick discussion with Peter Zijlstra, it appears
that the work on lazy-tlb refcounting currently in

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/lazy

will take care of this by entirely removing the reference counting for lazy TLB.

So with this, I suspect we are as good as we can be in terms of near-zero
footprint for the mm_cid feature, right ?

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a57e6ae78e65..f740fa447df1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -553,6 +553,21 @@ struct vm_area_struct {
  struct kioctx_table;
  struct mm_struct {
  	struct {
+		/*
+		 * Fields which are often written to are placed in a separate
+		 * cache line.
+		 */
+		struct {
+			/**
+			 * @mm_count: The number of references to &struct mm_struct
+			 * (@mm_users count as 1).
+			 *
+			 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
+			 * &struct mm_struct is freed.
+			 */
+			atomic_t mm_count;
+		} ____cacheline_aligned_in_smp;
+
  		struct maple_tree mm_mt;
  #ifdef CONFIG_MMU
  		unsigned long (*get_unmapped_area) (struct file *filp,
@@ -590,14 +605,6 @@ struct mm_struct {
  		 */
  		atomic_t mm_users;
  
-		/**
-		 * @mm_count: The number of references to &struct mm_struct
-		 * (@mm_users count as 1).
-		 *
-		 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
-		 * &struct mm_struct is freed.
-		 */
-		atomic_t mm_count;
  #ifdef CONFIG_SCHED_MM_CID
  		/**
  		 * @cid_lock: Protect cid bitmap updates vs lookups.
  



-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com