linux-kernel - Re: [RFC PATCH v9 2/2] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <721f4b8b-c238-53b1-9085-a9dae6a961e1@efficios.com>
Date:   Thu, 20 Apr 2023 09:10:35 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Aaron Lu <aaron.lu@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org, Olivier Dion <odion@...icios.com>,
        michael.christie@...cle.com
Subject: Re: [RFC PATCH v9 2/2] sched: Fix performance regression introduced
 by mm_cid

On 2023-04-20 08:50, Aaron Lu wrote:
> On Thu, Apr 20, 2023 at 08:41:05AM -0400, Mathieu Desnoyers wrote:
>> On 2023-04-20 05:56, Aaron Lu wrote:
>>> On Wed, Apr 19, 2023 at 11:50:12AM -0400, Mathieu Desnoyers wrote:
>>>> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
>>>> sysbench regression reported by Aaron Lu.
>>>
>>> mm_cid_get() dropped to 5.x% after I disable CONFIG_DEBUG_PREEMPT, using
>>> __this_cpu_X() doesn't help, I suppose that is because __this_cpu_X()
>>> still needs to fetch mm->pcpu_cid.
>>>
>>> Annotate mm_cid_get():
>>>
>>>          │     static inline int mm_cid_get(struct mm_struct *mm)
>>>          │     {
>>>     0.05 │       push   %rbp
>>>     0.02 │       mov    %rsp,%rbp
>>>          │       push   %r15
>>>          │       push   %r14
>>>          │       push   %r13
>>>          │       push   %r12
>>>          │       push   %rbx
>>>     0.02 │       sub    $0x10,%rsp
>>>          │     struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
>>>    71.30 │       mov    0x60(%rdi),%r12
>>>          │     struct cpumask *cpumask;
>>>          │     int cid;
>>>          │
>>>          │     lockdep_assert_rq_held(rq);
>>>          │     cpumask = mm_cidmask(mm);
>>>          │     cid = __this_cpu_read(pcpu_cid->cid);
>>>    28.44 │       mov    %gs:0x8(%r12),%edx
>>>          │     if (mm_cid_is_valid(cid)) {
>>>
>>>
>>> sched_mm_cid_migrate_to() is 4.x% and its annotation :
>>>
>>>          │     dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
>>>          │       mov     -0x30(%rbp),%rax
>>>    54.53 │       mov     0x60(%r13),%rbx
>>>    19.61 │       movslq  0xaf0(%rax),%r15
>>>
>>> The reason why accessing mm->pcpu_cid is so costly is still a myth to
>>> me...
>>
>> Then we clearly have another member of mm_struct on the same cache line as
>> pcpu_cid which is bouncing all over the place and causing false-sharing. Any
>> idea which field(s) are causing this ?
> 
> That's my first reaction too but as I said in an earlier reply:
> https://lore.kernel.org/lkml/20230419080606.GA4247@ziqianlu-desk2/
> I've tried to place pcpu_cid into a dedicate cacheline with no other
> fields sharing a cacheline with it in mm_struct but it didn't help...

I see two possible culprits there:

1) The mm_struct pcpu_cid field is suffering from false-sharing. I would be
    interested to look at your attempt to move it to a separate cache line to
    try to figure out what is going on.

2) (Maybe?) The task_struct mm field is suffering from false-sharing and stalling
    the next instruction which needs to use its value to fetch the mm->pcpu_cid
    field. We could try moving the task_struct mm field into its own cache line to
    see if it helps.

Thanks,

Mathieu

> 
> Thanks,
> Aaron

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com