linux-kernel - Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZH1okIz83ELfjy4o@BLR-5CG11610CF.amd.com>
Date:   Mon, 5 Jun 2023 10:16:08 +0530
From:   "Gautham R. Shenoy" <gautham.shenoy@....com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Peter Zijlstra <peterz@...radead.org>, Tejun Heo <tj@...nel.org>,
        jiangshanlai@...il.com, torvalds@...ux-foundation.org,
        linux-kernel@...r.kernel.org, kernel-team@...a.com,
        joshdon@...gle.com, brho@...gle.com, briannorris@...omium.org,
        nhuck@...gle.com, agk@...hat.com, snitzer@...nel.org,
        void@...ifault.com, libo.chen@...cle.com, srikar@...ux.vnet.ibm.com
Subject: Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue
 execution locality

Hello Vincent,

On Tue, May 23, 2023 at 06:12:45PM +0200, Vincent Guittot wrote:
> On Tue, 23 May 2023 at 13:18, Peter Zijlstra <peterz@...radead.org> wrote:
> >
> > So wakeup based placement is mostly all about LLC, and given this thing
> > has dinky small LLCs it will pile up on the one LLC you target and leave
> > the others idle until the regular idle balancer decides to make an
> > appearance and move some around.
> >
> > But if these are fairly short running tasks, I can well see that not
> > going to help much.
> >
> >
> > Much of this was tuned back when there was 1 L3 per Node; something
> > which is still more or less true for Intel but clearly not for these
> > things.
> >
> >
> > The below is a bit crude and completely untested, but it might help. The
> > flip side of that coin is of course that people are going to complain
> > about how selecting a CPU is more expensive now and how this hurts their
> > performance :/
> >
> > Basically it will try and iterate all L3s in a node; wakeup will still
> > refuse to cross node boundaries.
> 
> That remember me some discussion about system with fast on die
> interconnect where we would like to run wider than llc at wakeup (i.e.
> DIE level) something like the CLUSTER level but on the other side of
> MC
>

Adding Libo Chen who was a part of this discussion. IIRC, the problem was
that there was no MC domain on that system, which would have made the
SMT domain to be the sd_llc. But since the core is single threaded,
the SMT domain would be degnerated thus leaving no domain which has
the SD_SHARE_PKG_RESOURCES flag.

If I understand correctly, Peter's patch won't help in such a
situation.

However, it should help POWER10 which has the SMT domain as the LLC
and previously it was observed that moving the wakeup search to the
parent domain was helpful (Ref:
https://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/)


--
Thanks and Regards
gautham.


> Another possibility to investigate would be that each wakeup of a
> worker is mostly unrelated to the previous one and it cares only
> waker. so we should use -1 for the prev_cpu
> 
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0ca13ac..ddb7f16a07a9 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >         return idle_cpu;
> >  }
> >
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > +       struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));
> > +       struct sched_group *sg;
> > +
> > +       if (!sd_node || sd_node == sd)
> > +               return -1;
> > +
> > +       sg = sd_node->groups;
> > +       do {
> > +               int cpu = cpumask_first(sched_group_span(sg));
> > +               struct sched_domain *sd_child;
> > +
> > +               sd_child = per_cpu(sd_llc, cpu);
> > +               if (sd_child != sd) {
> > +                       int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> > +                       if ((unsigned int)i < nr_cpumask_bits)
> > +                               return i;
> > +               }
> > +
> > +               sg = sg->next;
> > +       } while (sg != sd_node->groups);
> > +
> > +       return -1;
> > +}
> > +
> >  /*
> >   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> >   * the task fits. If no CPU is big enough, but there are idle ones, try to
> > @@ -7199,6 +7226,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >         if ((unsigned)i < nr_cpumask_bits)
> >                 return i;
> >
> > +       if (sched_feat(SIS_NODE)) {
> > +               i = select_idle_node(p, sd, target);
> > +               if ((unsigned)i < nr_cpumask_bits)
> > +                       return i;
> > +       }
> > +
> >         return target;
> >  }
> >
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index ee7f23c76bd3..f965cd4a981e 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> >   */
> >  SCHED_FEAT(SIS_PROP, false)
> >  SCHED_FEAT(SIS_UTIL, true)
> > +SCHED_FEAT(SIS_NODE, true)
> >
> >  /*
> >   * Issue a WARN when we do multiple update_rq_clock() calls
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 678446251c35..d2e0e2e496a6 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1826,6 +1826,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> >  DECLARE_PER_CPU(int, sd_llc_size);
> >  DECLARE_PER_CPU(int, sd_llc_id);
> >  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_node);
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index ca4472281c28..d94cbc2164ca 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -667,6 +667,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> >  DEFINE_PER_CPU(int, sd_llc_size);
> >  DEFINE_PER_CPU(int, sd_llc_id);
> >  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_node);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> > @@ -691,6 +692,18 @@ static void update_top_cache_domain(int cpu)
> >         per_cpu(sd_llc_id, cpu) = id;
> >         rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> >
> > +       while (sd && sd->parent) {
> > +               /*
> > +                * SD_NUMA is the first domain spanning nodes, therefore @sd
> > +                * must be the domain that spans a single node.
> > +                */
> > +               if (sd->parent->flags & SD_NUMA)
> > +                       break;
> > +
> > +               sd = sd->parent;
> > +       }
> > +       rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
> > +
> >         sd = lowest_flag_domain(cpu, SD_NUMA);
> >         rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
> >