linux-kernel - Re: [RFC PATCH v8] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <60a8b86d-de59-b34e-71ba-3dfd2cce7f02@efficios.com>
Date:   Wed, 19 Apr 2023 09:10:52 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Aaron Lu <aaron.lu@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org, Olivier Dion <odion@...icios.com>,
        michael.christie@...cle.com
Subject: Re: [RFC PATCH v8] sched: Fix performance regression introduced by
 mm_cid

On 2023-04-19 04:06, Aaron Lu wrote:
> On Tue, Apr 18, 2023 at 12:01:09PM -0400, Mathieu Desnoyers wrote:
>> On 2023-04-18 07:21, Aaron Lu wrote:
>>> On Mon, Apr 17, 2023 at 11:08:31AM -0400, Mathieu Desnoyers wrote:
>>>> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
>>>> sysbench regression reported by Aaron Lu.
>>>
>>> For postgres_sysbench on SPR:
>>> sched_mm_cid_migrate_to() is in the range of 0.1x% - 0.4x%, mm_cid_get()
>>> is in the range of 0.1x% - 0.3x%. Other cid functions are pretty minor.
>>>
>>> For hackbench on SPR:
>>> ched_mm_cid_migrate_to() is about 3%-4%, mm_cid_get() is about 7%-8%,
>>> other cid functions are pretty minor.
>>
>> It's a bit higher than I would have expected for hackbench.
>>
>> Can you run with the attached schedstats patch applied and boot
>> with schedstats=enable ? Let me know how the counters behave please,
>> because I cannot reproduce anything unexpected on my machine.
> 
> Only event that stood out is mm_cid_get_cached, other events are very
> few in a 5s window. I think these numbers are expected and indicate the
> optimization worked well. Please see attachment for details, the number
> are got by doing the following after 20s the test was started:
> 	touch schedstat
> 	cat /proc/schedstat >> schedstat
> 	sleep 5
> 	cat /proc/schedstat >> schedstat
> 
> However, mm_cid_get() still shows as having 7.x% in perf cycles.
> Annotate shows the cycles are mostly spent on accessing mm->pcpu_cid:
> 
>         │
>         │     lockdep_assert_rq_held(rq);
>         │     cpumask = mm_cidmask(mm);
>         │     pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
>   64.50 │       mov   0x60(%r14),%r12
>         │     struct rq *rq = this_rq();
>    9.80 │       mov   %rbx,-0x30(%rbp)
>         │     pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
>    0.11 │     → call  debug_smp_processor_id

^ this call to debug_smp_processor_id is odd. Do you happen to have CONFIG_DEBUG_PREEMPT=y
in your kernel config ? This likely adds unwelcome overhead in those benchmarks.

>         │       mov   %eax,%ebx
>         │       cmp   $0x1fff,%eax
> 
> 
>         │     raw_spin_unlock(&cid_lock);
>         │       mov   $0xffffffff8474ca84,%rdi
>         │     WRITE_ONCE(use_cid_lock, 0);
>         │       movl  $0x0,use_cid_lock
>         │     raw_spin_unlock(&cid_lock);
>         │     → call  _raw_spin_unlock
>         │     ↑ jmp   fe
>         │     arch_static_branch():
>    0.02 │236:   xchg  %ax,%ax
>         │     mm_cid_snapshot_time():
>         │     struct rq *rq = this_rq();
>         │238: → call  debug_smp_processor_id
>         │       mov   $0x33340,%r13
>         │       mov   %eax,%ebx
>         │       cmp   $0x1fff,%eax
>         │     ↓ ja    2b3
>         │24d:   add   -0x7d5ce500(,%rbx,8),%r13
>         │     pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
>   21.23 │       mov   0x60(%r14),%rbx
>    4.11 │     → call  debug_smp_processor_id
>         │       mov   %eax,%r12d
>         │       cmp   $0x1fff,%eax
>         │     ↓ ja    2c4
>         │     WRITE_ONCE(pcpu_cid->time, rq->clock);
> 
> Initially I thought it might be due to some false sharing between
> pcpu_cid field and other fields of mm_struct, but after I placed pcpu_cid
> into a separate cacheline, it still didn't make a difference.
> 
> For sched_mm_cid_migrate_to(), its cycle percent is 2.97% and most cycles
> are also spent on accessing mm->pcpu_cid according to perf annotate:
> 
>         │     dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
>   64.22 │       mov     0x60(%r15),%r13
>    9.53 │       movslq  0xaf0(%rdi),%r14
> 
> No idea why accesing mm->pcpu_cid is so expensive, surely after
> initial allocation, nobody would change its value? Without a write
> side, I don't see how this can happen...

I'm also considering using __this_cpu* operations, such as this on mm_cid_get():


static inline int mm_cid_get(struct mm_struct *mm)
{
         struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
         struct rq *rq = this_rq();
         struct cpumask *cpumask;
         int cid;

         lockdep_assert_rq_held(rq);
         cpumask = mm_cidmask(mm);
         cid = __this_cpu_read(pcpu_cid->cid);
         if (mm_cid_is_valid(cid)) {
                 schedstat_inc(rq->mm_cid_get_cached);
                 mm_cid_snapshot_time(mm);
                 return cid;
         }
         if (mm_cid_is_lazy_put(cid)) {
                 if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET)) {
                         schedstat_inc(rq->mm_cid_get_put_lazy);
                         __mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
                 }
         }
         cid = __mm_cid_get(mm);
         schedstat_inc(rq->mm_cid_get_alloc);
         __this_cpu_write(pcpu_cid->cid, cid);
         return cid;
}

Would this help ?

Thanks,

Mathieu





-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com