lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230420133519.GA154479@ziqianlu-desk2>
Date:   Thu, 20 Apr 2023 21:35:19 +0800
From:   Aaron Lu <aaron.lu@...el.com>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        <linux-kernel@...r.kernel.org>, Olivier Dion <odion@...icios.com>,
        <michael.christie@...cle.com>
Subject: Re: [RFC PATCH v9 2/2] sched: Fix performance regression introduced
 by mm_cid

On Thu, Apr 20, 2023 at 09:10:35AM -0400, Mathieu Desnoyers wrote:
> On 2023-04-20 08:50, Aaron Lu wrote:
> > On Thu, Apr 20, 2023 at 08:41:05AM -0400, Mathieu Desnoyers wrote:
> > > On 2023-04-20 05:56, Aaron Lu wrote:
> > > > On Wed, Apr 19, 2023 at 11:50:12AM -0400, Mathieu Desnoyers wrote:
> > > > > Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
> > > > > sysbench regression reported by Aaron Lu.
> > > > 
> > > > mm_cid_get() dropped to 5.x% after I disable CONFIG_DEBUG_PREEMPT, using
> > > > __this_cpu_X() doesn't help, I suppose that is because __this_cpu_X()
> > > > still needs to fetch mm->pcpu_cid.
> > > > 
> > > > Annotate mm_cid_get():
> > > > 
> > > >          │     static inline int mm_cid_get(struct mm_struct *mm)
> > > >          │     {
> > > >     0.05 │       push   %rbp
> > > >     0.02 │       mov    %rsp,%rbp
> > > >          │       push   %r15
> > > >          │       push   %r14
> > > >          │       push   %r13
> > > >          │       push   %r12
> > > >          │       push   %rbx
> > > >     0.02 │       sub    $0x10,%rsp
> > > >          │     struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
> > > >    71.30 │       mov    0x60(%rdi),%r12
> > > >          │     struct cpumask *cpumask;
> > > >          │     int cid;
> > > >          │
> > > >          │     lockdep_assert_rq_held(rq);
> > > >          │     cpumask = mm_cidmask(mm);
> > > >          │     cid = __this_cpu_read(pcpu_cid->cid);
> > > >    28.44 │       mov    %gs:0x8(%r12),%edx
> > > >          │     if (mm_cid_is_valid(cid)) {
> > > > 
> > > > 
> > > > sched_mm_cid_migrate_to() is 4.x% and its annotation :
> > > > 
> > > >          │     dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
> > > >          │       mov     -0x30(%rbp),%rax
> > > >    54.53 │       mov     0x60(%r13),%rbx
> > > >    19.61 │       movslq  0xaf0(%rax),%r15
> > > > 
> > > > The reason why accessing mm->pcpu_cid is so costly is still a myth to
> > > > me...
> > > 
> > > Then we clearly have another member of mm_struct on the same cache line as
> > > pcpu_cid which is bouncing all over the place and causing false-sharing. Any
> > > idea which field(s) are causing this ?
> > 
> > That's my first reaction too but as I said in an earlier reply:
> > https://lore.kernel.org/lkml/20230419080606.GA4247@ziqianlu-desk2/
> > I've tried to place pcpu_cid into a dedicate cacheline with no other
> > fields sharing a cacheline with it in mm_struct but it didn't help...
> 
> I see two possible culprits there:
> 
> 1) The mm_struct pcpu_cid field is suffering from false-sharing. I would be
>    interested to look at your attempt to move it to a separate cache line to
>    try to figure out what is going on.

Brain damaged...my mistake, I only made sure its following fields not
share the same cacheline but forgot to exclude its preceding fields and
turned out it's one(some?) of the preceeding fields that caused false
sharing. When I did:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5eab61156f0e..a6f9d815991c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -606,6 +606,7 @@ struct mm_struct {
                 */
                atomic_t mm_count;
 #ifdef CONFIG_SCHED_MM_CID
+               CACHELINE_PADDING(_pad1_);
                /**
                 * @pcpu_cid: Per-cpu current cid.
                 *
mm_cid_get() dropped to 0.0x% when running hackbench :-)

sched_mm_cid_migrate_to() is about 4% with most cycles spent on
accessing mm->mm_users:

       │     dst_cid = READ_ONCE(dst_pcpu_cid->cid);
  0.03 │       mov     0x8(%r12),%r15d
       │     if (!mm_cid_is_unset(dst_cid) &&
  0.07 │       cmp     $0xffffffff,%r15d
       │     ↓ je      87
       │     arch_atomic_read():
       │     {
       │     /*
       │     * Note for KASAN: we deliberately don't use READ_ONCE_NOCHECK() here,
       │     * it's non-inlined function that increases binary size and stack usage.
       │     */
       │     return __READ_ONCE((v)->counter);
 76.13 │       mov     0x54(%r13),%eax
       │     sched_mm_cid_migrate_to():
       │       cmp     %eax,0x410(%rdx)
 21.71 │     ↓ jle     1d8
       │     atomic_read(&mm->mm_users) >= t->nr_cpus_allowed)

With this info, it should be mm_users that caused false sharing for
pcpu_cid previously. Looks like mm_users is bouncing.

Thanks,
Aaron

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ