linux-kernel - Re: PSI idle-shutoff

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CAJuCfpG=bADguJWdxV4ZhTze9fmKP2bhsbf0xzbd06Fhr4_U5w@mail.gmail.com>
Date:   Tue, 11 Oct 2022 10:11:58 -0700
From:   Suren Baghdasaryan <surenb@...gle.com>
To:     Hillf Danton <hdanton@...a.com>
Cc:     Pavan Kondeti <quic_pkondeti@...cinc.com>,
        Johannes Weiner <hannes@...xchg.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, quic_charante@...cinc.com
Subject: Re: PSI idle-shutoff

On Tue, Oct 11, 2022 at 4:38 AM Hillf Danton <hdanton@...a.com> wrote:
>
> On 10 Oct 2022 14:16:26 -0700 Suren Baghdasaryan <surenb@...gle.com>
> > On Mon, Oct 10, 2022 at 3:57 AM Hillf Danton <hdanton@...a.com> wrote:
> > > On 13 Sep 2022 19:38:17 +0530 Pavan Kondeti <quic_pkondeti@...cinc.com>
> > > > Hi
> > > >
> > > > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > > > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > > > there is a RUNNING task. So we would always end up re-arming the work.
> > > >
> > > > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > > > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > > > help. The work is already scheduled. so we don't do anything there.
> > > >
> > > > Probably I am missing some thing here. Can you please clarify how we
> > > > shut off re-arming the psi avg work?
> > >
> > > Instead of open coding schedule_delayed_work() in bid to check if timer
> > > hits the idle task (see delayed_work_timer_fn()), the idle task is tracked
> > > in psi_task_switch() and checked by kworker to see if it preempted the idle
> > > task.
> > >
> > > Only for thoughts now.
> > >
> > > Hillf
> > >
> > > +++ b/kernel/sched/psi.c
> > > @@ -412,6 +412,8 @@ static u64 update_averages(struct psi_gr
> > >         return avg_next_update;
> > >  }
> > >
> > > +static DEFINE_PER_CPU(int, prev_task_is_idle);
> > > +
> > >  static void psi_avgs_work(struct work_struct *work)
> > >  {
> > >         struct delayed_work *dwork;
> > > @@ -439,7 +441,7 @@ static void psi_avgs_work(struct work_st
> > >         if (now >= group->avg_next_update)
> > >                 group->avg_next_update = update_averages(group, now);
> > >
> > > -       if (nonidle) {
> > > +       if (nonidle && 0 == per_cpu(prev_task_is_idle, raw_smp_processor_id())) {
> >
> > This condition would be incorrect if nonidle was set by a cpu other
> > than raw_smp_processor_id() and
> > prev_task_is_idle[raw_smp_processor_id()] == 0.
>
> Thanks for taking a look.

Thanks for the suggestion!

>
> > IOW, if some activity happens on a non-current cpu, we would fail to
> > reschedule psi_avgs_work for it.
>
> Given activities on remote CPUs, can you specify what prevents psi_avgs_work
> from being scheduled on remote CPUs if for example the local CPU has been
> idle for a second?

I'm not a scheduler expert but I can imagine some work that finished
running on a big core A and generated some activity since the last
time psi_avgs_work executed.  With no other activity the next
psi_avgs_work could be scheduled on a small core B to conserve power.
There might be other cases involving cpuset limitation changes or cpu
offlining but I didn't think too hard about these. The bottom line, I
don't think we should be designing mechanisms which rely on
assumptions about how tasks will be scheduled. Even if these
assumptions are correct today they might change in the future and
things will break in unexpected places.

>
> > This can be fixed in collect_percpu_times() by
> > considering prev_task_is_idle for all other CPUs as well. However
> > Chengming's approach seems simpler to me TBH and does not require an
> > additional per-cpu variable.
>
> Good ideas are always welcome.

No question about that. Thanks!