[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d5c394f8-5833-6e5c-2155-6160415cfd92@efficios.com>
Date: Thu, 20 Apr 2023 10:41:07 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Aaron Lu <aaron.lu@...el.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org, Olivier Dion <odion@...icios.com>,
michael.christie@...cle.com
Subject: Re: [RFC PATCH v9 2/2] sched: Fix performance regression introduced
by mm_cid
On 2023-04-20 10:39, Aaron Lu wrote:
> On Thu, Apr 20, 2023 at 10:18:52PM +0800, Aaron Lu wrote:
>> On Thu, Apr 20, 2023 at 09:54:29AM -0400, Mathieu Desnoyers wrote:
>>> On 2023-04-20 09:35, Aaron Lu wrote:
>>> [...]
>>>>>>>
>>>>>>> Then we clearly have another member of mm_struct on the same cache line as
>>>>>>> pcpu_cid which is bouncing all over the place and causing false-sharing. Any
>>>>>>> idea which field(s) are causing this ?
>>>>>>
>>>>>> That's my first reaction too but as I said in an earlier reply:
>>>>>> https://lore.kernel.org/lkml/20230419080606.GA4247@ziqianlu-desk2/
>>>>>> I've tried to place pcpu_cid into a dedicate cacheline with no other
>>>>>> fields sharing a cacheline with it in mm_struct but it didn't help...
>>>>>
>>>>> I see two possible culprits there:
>>>>>
>>>>> 1) The mm_struct pcpu_cid field is suffering from false-sharing. I would be
>>>>> interested to look at your attempt to move it to a separate cache line to
>>>>> try to figure out what is going on.
>>>>
>>>> Brain damaged...my mistake, I only made sure its following fields not
>>>> share the same cacheline but forgot to exclude its preceding fields and
>>>> turned out it's one(some?) of the preceeding fields that caused false
>>>> sharing. When I did:
>>>>
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 5eab61156f0e..a6f9d815991c 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -606,6 +606,7 @@ struct mm_struct {
>>>> */
>>>> atomic_t mm_count;
>>>> #ifdef CONFIG_SCHED_MM_CID
>>>> + CACHELINE_PADDING(_pad1_);
>>>> /**
>>>> * @pcpu_cid: Per-cpu current cid.
>>>> *
>>>> mm_cid_get() dropped to 0.0x% when running hackbench :-)
>>>
>>> Now we are talking! :)
>>>
>>>>
>>>> sched_mm_cid_migrate_to() is about 4% with most cycles spent on
>>>> accessing mm->mm_users:
>>>>
>>>> │ dst_cid = READ_ONCE(dst_pcpu_cid->cid);
>>>> 0.03 │ mov 0x8(%r12),%r15d
>>>> │ if (!mm_cid_is_unset(dst_cid) &&
>>>> 0.07 │ cmp $0xffffffff,%r15d
>>>> │ ↓ je 87
>>>> │ arch_atomic_read():
>>>> │ {
>>>> │ /*
>>>> │ * Note for KASAN: we deliberately don't use READ_ONCE_NOCHECK() here,
>>>> │ * it's non-inlined function that increases binary size and stack usage.
>>>> │ */
>>>> │ return __READ_ONCE((v)->counter);
>>>> 76.13 │ mov 0x54(%r13),%eax
>>>> │ sched_mm_cid_migrate_to():
>>>> │ cmp %eax,0x410(%rdx)
>>>> 21.71 │ ↓ jle 1d8
>>>> │ atomic_read(&mm->mm_users) >= t->nr_cpus_allowed)
>>>>
>>>> With this info, it should be mm_users that caused false sharing for
>>>> pcpu_cid previously. Looks like mm_users is bouncing.
>>>
>>> I suspect that the culprit here is mm_count rather than mm_users. mm_users
>>> just happens to share the same cache line as mm_count.
>>>
>>> mm_count is incremented/decremented with mmgrab()/mmdrop() during
>>> context switch.
>>>
>>> This is likely causing other issues, for instance, the
>>> membarrier_state field is AFAIR read-mostly, used for
>>> membarrier_mm_sync_core_before_usermode() to issue core
>>> sync before every return to usermode if needed.
>>>
>>> Other things like mm_struct pgd pointer appear to be likely
>>> read-mostly variables.
>>>
>>> I suspect it's mm_count which should be moved to its own cache line
>>> to eliminate false-sharing with all the other read-mostly fields
>>> of mm_struct.
>>>
>>> Thoughts ?
>>
>> Makes sesne, I was wondering where the write side of mm_user is. Let me
>> see how that goes by placing mm_count aside from other read mostly fields.
>
> With the following naive padding for mm_count:
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5eab61156f0e..866696e2d83e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -604,7 +604,9 @@ struct mm_struct {
> * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
> * &struct mm_struct is freed.
> */
> + CACHELINE_PADDING(_pad1_);
> atomic_t mm_count;
> + CACHELINE_PADDING(_pad2_);
> #ifdef CONFIG_SCHED_MM_CID
> /**
> * @pcpu_cid: Per-cpu current cid.
>
> mm_cid_get() is about 0.1% and sched_mm_cid_migrate_to() is about 0.2%
> for hackbench on SPR :-)
Allright, then our work is done here. I'm popping the champagne.
I'm doing some more testing on v10 which includes a fix for the
time-snapshot wrong runqueue, and I'll post it today.
Thanks a lot for your help ! :-)
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists