linux-kernel - Re: [PATCH] ACPI: Remove msleep() bloat from acpi_os

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0jvYitb7DLyLkqTRv0TT=6yBHDvEvb8tJLzAOVKa3hqnQ@mail.gmail.com>
Date: Wed, 28 Aug 2024 12:58:34 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Len Brown <lenb@...nel.org>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, rjw@...ysocki.net, linux-acpi@...r.kernel.org, 
	linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Len Brown <len.brown@...el.com>, Arjan van de Ven <arjan@...ux.intel.com>, 
	Todd Brandt <todd.e.brandt@...el.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Anna-Maria Behnsen <anna-maria@...utronix.de>, Frederic Weisbecker <frederic@...nel.org>, 
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] ACPI: Remove msleep() bloat from acpi_os_sleep()

On Wed, Aug 28, 2024 at 6:12 AM Len Brown <lenb@...nel.org> wrote:
>
> On Tue, Aug 27, 2024 at 7:29 AM Rafael J. Wysocki <rafael@...nel.org> wrote:
> >
> > First, let me add a few people who know more about timers than I do.
> >
> > On Tue, Aug 27, 2024 at 5:42 AM Len Brown <lenb@...nel.org> wrote:
> > >
> > > From: Len Brown <len.brown@...el.com>
> > >
> > > Optimize acpi_os_sleep(ms) using usleep_range(floor, ceiling).
> > > The floor of the range is the exact requested ms,
> > > with an additional 1ms of slack for sleeps above 20ms.
> > >
> > > This reduces  the kernel resume time of the Dell 9300
> > > to 1,124 ms from 2,471 ms.
> > >
> > > The ACPI AML Sleep(ms) method calls acpi_os_sleep(ms),
> > > which has invoked msleep(ms) since 2013.
> > >
> > > But msleep(ms) is based on jiffies, and the rounding-up
> > > logic to convert to jiffies on a HZ=250 system causes
> > > msleep(5) to bloat to a minimum of a 12ms delay.
> > > msleep(5) typically takes over 15ms!
> > >
> > > As a result, AML delay loops with small Sleep() inside
> > > magnify the entire loop.  A particularly painful example
> > > is ACPI support for powering-on ICL and TGL
> > > thunderbolt/pcie_ports during system resume.
> > >
> > > Regarding jiffy-based msleep() being inexpensive
> > > and hrtimer-based usleep_range() being expensive.
> > > ACPI AML timer invocations are rare, and so it
> > > is unlikely the hrtimer cost will be noticible,
> > > or even measurable.  At the same time, the msleep()
> > > timer duration bloat is significant enough to
> > > be noticed by end users.
> >
> > I'm not sure why you are refusing to follow the implementation of
> > fsleep() and Documentation/timers/timers-howto.rst and still use
> > msleep() for sleep durations longer than 20 ms.
>
> timers_howto.rst could use an update to reflect reality.

Maybe it could.

> It doesn't disclose how toxic msleep actually is for small values.
> msleep(1) takes 11ms on a HZ=250 system.
>
> Linux/ACPI has to support any random AML Sleep(ms) call,
> and sometimes we see timeout loops implemented around
> an inner Sleep(1ms).  If we use msleep those loops explode
> by 11x and aggregate durations that are noticed by end users.

So this is all about short sleeps.

You don't need to argue with me about short sleeps, I'm convinced.

Thus I'm going to skip all arguments regarding short sleeps going forward.

> fsleep does three things -- and none of them are a good fit
> for acpi_os_sleep:
>
> static inline void fsleep(unsigned long usecs)
> {
>         if (usecs <= 10)
>                 udelay(usecs);
>         else if (usecs <= 20000)
>                 usleep_range(usecs, 2 * usecs);
>         else
>                 msleep(DIV_ROUND_UP(usecs, 1000));
> }
>
> > udelay(usecs);
> will never execute in the ACPI case, as the minimum delay is 1000 usec.
>
> > usleep_range(usecs, 2 * usecs);
>
> timers-howto.rst says this:
>
>                         "With the introduction of a range, the scheduler is
>                         free to coalesce your wakeup with any other wakeup
>                         that may have happened for other reasons, or at the
>                         worst case, fire an interrupt for your upper bound."
>
> But the way usleep_range works is it sets the timer for the
> upper bound, and opportunistically wakes/cancels if another timer
> fires after the lower bound and before the upper bound.

Yes, that's how it works.

> It calls it a "worst case" that the timer fires at the upper bound.
>
> But when ACPI suspend/resume flows are running the only other
> timer is the tick, and so the "worst case" happens effectively ALWAYS.

Arguably though, ACPI suspend/resume flows are not the only case in
which AML skeeps.

> So when fsleep does a usleep_range(usecs, 2 * usecs), it is effectively
> DOUBLING the duration of all timers 20 ms and smaller.
> There may be scenarios where doing this makes sense,
> but acpi_os_sleep() is not one of them.

It would be good to actually try this.

However, I'm not arguing that usleep_range(usecs, 2 * usecs) is
suitable for ACPI sleeps.  I'm saying that it is friendly with respect
to the timekeeping subsystem to give it a nonzero range if it is
affordable.

And I do think that it is affordable in the ACPI sleep case.

> > msleep(DIV_ROUND_UP(usecs, 1000));
>
> msleep(50) takes 59.8ms -- a 20% penalty.

I guess you mean on a HZ=250 system system and I guess this is a
measured number.

> We have loops with AML Sleep(50) in the middle,
> and this  code would bloat them by 20%, while
> the user is waiting -- for no benefit. o

If this happens during system suspend/resume and if this is the only
thing that is waited for (ie. nothing runs in parallel with it).

So your resume time is 20% longer than it can be in theory if every
Sleep() causes an extra timer interrupt to be programmed.  Is this
extra overhead justified?

> Again, there may be scenarios where doing this makes sense,
> but acpi_os_sleep() is not one of them.

I'm not convinced.

> > Why should usleep_range() be used for 100 ms sleeps, for instance?
> > This goes against the recommendation in the above document, so is
> > there a particular reason?
>
> The document doesn't say *why* msleep is recommended.
> One would assume that when there are many timers, msleep
> is efficient because it consolidates them to jiffy boundaries,
> and if they are long duration timers, perhaps the assumption is
> that they don't care so much about the additional delays?
>
> Again, there are certainly scenarios where that makes sense,
> but at the end of the day, msleep(100) takes 106 msec.

And is this a big deal really?  Is a user really going to notice the
6% difference and even if they do, would they care?  For one, I
wouldn't.

Arguably, it all depends on the intention of the developer who uses
the msleep(), os Sleep() in the AML case.

Do they mean "exactly 100 ms" or do they mean "at least 100 ms"?  If
the latter is the case, the whole argument is moot if not misguided.

As a side note, I generally think that if a developer puts Sleep(n) in
their code they need to assume that the sleep will be slightly longer
than n for reasons related to scheduling etc. so I don't believe that
they ever mean "exactly n".  However, they probably don't think that
it may take 3 times longer, so it is better to make that not happen.

> ACPI is not a heavy timer user, so the msleep efficiency justification
> for making the user wait longer holds no weight.
>
> Now that we realize that end-users notice timer bloat in acp_os_sleep,
> it is clear that msleep is simply a poor choice for acpi_os_sleep.

For short sleeps, yes.

For long sleeps, using an hrtimer generally ends up in programming an
extra timer interrupt unless the range is really wide, but then
msleep() gets better again.  So is the extra timer interrupt really
justified every time AML evaluates Sleep()?  And again, system
suspend/resume are not the only cases when that happens.

Honestly, I don't think so and that's why I prefer msleep() to be used
for longer sleeps even though it is "slacky".

> > > Regarding usleep_range() timer coalescing.
> > > It virtually never works during ACPI flows, which
> > > commonly run when there are few coalescing
> > > opportunities. As a result, the timers almost
> > > always expire at the maximum end of their specified range.
> >
> > I don't think that's the main point of using a nonzero range in
> > usleep_range().  AFAICS, this is about letting the timekeeping
> > subsystem know how much you care about timer precision so it can
> > arrange things to meet everyone's needs.
>
> The range in usleep_range is to enable timer coalescing,
> which is to reduce the number of timers firing on the system.

Exactly.  Which reduces the number of times a timer interrupts needs
to be programmed.  Which generally reduces overhead.

> If it has another purpose, neither the code nor the API documentation
> are forthcoming.
>
> > > It was tempting to use usleep_range(us, us)
> > > for all values of us.  But 1 ms is added to the
> > > range for timers over 20ms on the reasoning that
> > > the AML Sleep interface has a granularity of 1ms,
> > > most costly loops use duration under 20ms inside,
> > > and singular long sleeps are unlitly to notice an
> > > additiona 1ms, so why not allow some coalescing...
> >
> > So again, why not use msleep() for sleeps longer than 20 ms?
>
> per above.  Too slow.

per above, I'm not convinced.

> > > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
> > > Signed-off-by: Len Brown <len.brown@...el.com>
> > > Suggested-by: Arjan van de Ven <arjan@...ux.intel.com>
> > > Tested-by: Todd Brandt <todd.e.brandt@...el.com>
> > > ---
> > >  drivers/acpi/osl.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> > > index 70af3fbbebe5..c4c76f86cd7a 100644
> > > --- a/drivers/acpi/osl.c
> > > +++ b/drivers/acpi/osl.c
> > > @@ -607,7 +607,13 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler)
> > >
> > >  void acpi_os_sleep(u64 ms)
> > >  {
> > > -       msleep(ms);
> > > +       u64 us = ms * 1000;
> > > +
> > > +       if (us <= 20000)
> > > +               usleep_range(us, us);
> > > +       else
> > > +               usleep_range(us, us + 1000);
> > > +
> > >  }
> > >
> > >  void acpi_os_stall(u32 us)
> > > --
> >
> > While I agree with using usleep_range() for sleeps up to 20 ms in
> > acpi_os_sleep(), I disagree with the patch as is.
>
> The measurement results do not support any form of acpi_os_sleep that
> invokes any form of msleep().

That's a bold claim and it seems a bit unfounded.

Have you tested anything other than system suspend/resume?

Have you tested anything other than x86 laptops?

> Honestly, I think because of when and how acpi_os_sleep is called, we should
> consider making it yet simpler:
>
> void acpi_os_sleep(u64 ms)
> {
>         u64 us = ms * 1000;
>
>         usleep_range(us, us);
> }

Which I obviously disagree with.

In the meantime I have checked how many users of usleep_range() use
the same value as both arguments of it and I have found 2 out of over
4000.

So now, is everyone clueless or is ACPI so special that it has to do
something different than everyone else?

And because I would actually like to see what the difference in terms
of system suspend/resume time is between the $subject patch and
something alternative that uses msleep() for longer sleeps and
generally uses nonzero ranges with usleep_range(), attached is a patch
to compare.

Thanks!

View attachment "acpi-os-sleep.patch" of type "text/x-patch" (459 bytes)