[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0gTfhTQ4AMZ+ukuJZEG=RRo-wbPsf7NPbWA0snDeA5ivQ@mail.gmail.com>
Date: Wed, 28 Aug 2024 22:29:27 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Anna-Maria Behnsen <anna-maria@...utronix.de>
Cc: Len Brown <lenb@...nel.org>, "Rafael J. Wysocki" <rafael@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>, linux-acpi@...r.kernel.org, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org, Frederic Weisbecker <frederic@...nel.org>,
Peter Zijlstra <peterz@...radead.org>, Jonathan Corbet <corbet@....net>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH] ACPI: Remove msleep() bloat from acpi_os_sleep()
Hi,
On Wed, Aug 28, 2024 at 9:44 PM Anna-Maria Behnsen
<anna-maria@...utronix.de> wrote:
>
> Hi,
>
> I try to give some input from the timer perspective and maybe it helps
> to clarify which should be the proper way to go for acpi_os_sleep(). It
> also identifies some problems with the current documentation and
> implementation of msleep/fsleep.
>
> Len Brown <lenb@...nel.org> writes:
>
> > Attempting to grab all the loose ends from this discussion
> > and put them into a list of 4 things:
> >
> > 1. Is it time to re-visit Jon's proposal to fix msleep, especially for
> > small sleep values?
>
> Lets have a deeper look to msleep() internals: msleep() uses timer list
> timers. Because of the design of the timer wheel (granularity of buckets
> increases with the levels) and because of the granularity of jiffies the
> sleep values will be longer as specified. Let's assume we are executing
> a msleep(1) on a HZ=250 system:
>
> First msecs are mapped on jiffies, so this results in a 4ms timeout
> value, as there is nothing shorter than 1 jiffie. Then the jiffie value
> is handed over to timer code and msleep() adds another jiffie to the
> timeout. The timeout is then 2 jiffies or 8ms. With this timeout a timer
> list timer is queued. To make sure that timers will not fire early or
> race with a concurrent incrementation of jiffie, timers are queued
> always into the next bucket. As the timer will end up in the first level
> of the timer wheel the granularity of the buckets is 1 jiffies. This
> means that the timeout would be 3 jiffies in worst case.
>
> The additional jiffie in msleep() is the historical prevention that the
> sleep time is at least the specified time. This is handled by timer
> wheel core code itself, so this extra jiffie could be removed. I will
> provide a patch for it.
>
> But even with this change, the worst case sleep length will be 8ms
> instead of 1ms.
>
> For comparison, see here a table with the values for all the steps
> explained above taking some different msleep values. I already dropped
> the addition of the extra jiffie. The granularity of the timer wheel
> levels can be found at the first long comment in kernel/time/timer.c.
>
> This is still a HZ=250 system:
>
> msleep() | msecs_to_jiffies() | worst case delay after | timer wheel
> | | enqueue (next bucket) | level
> --------------------------------------------------------------------------
> 1 ms | 1 jiffie | 2 jiffies, 8 ms | 0
> 256 ms | 64 jiffies | 72 jiffies, 288 ms | 1
> 257 ms | 65 jiffies | 72 jiffies, 288 ms | 1
> 2048 ms | 513 jiffies | 576 jiffies, 2304 ms | 2
> 2300 ms | 575 jiffies | 576 jiffies, 2304 ms | 2
> 4096 ms | 1024 jiffies | 1088 jiffies, 4352 ms | 2
>
> The same values on a HZ=1000 system:
>
> msleep() | msecs_to_jiffies() | worst case delay after | timer wheel
> | | enqueue (next bucket) | level
> --------------------------------------------------------------------------
> 1 ms | 1 jiffie | 2 jiffies, 2 ms | 0
> 256 ms | 256 jiffies | 264 jiffies, 264 ms | 1
> 257 ms | 257 jiffies | 264 jiffies, 264 ms | 1
> 2048 ms | 2048 jiffies | 2112 jiffies, 2112 ms | 2
> 2300 ms | 2300 jiffies | 2304 jiffies, 2304 ms | 2
> 4096 ms | 4096 jiffies | 4608 jiffies, 4608 ms | 3
>
>
> The documentation states, that msleep is not valid for short sleep
> values. But as demonstrated with the two tables, this is not enirely
> true. So the descision whether msleep() is suitable for the usecase is
> not as simple as documented. It depends on the HZ configuration in
> combination with the timeout value, and on the request how precise the
> timeout has to be. And another important point is: hrtimers are more
> expensive then timer list timers...
>
> The documentation was originally written back in 2010 where the non
> cascading timer wheel wasn't in place yet. So is has to be updated.
>
> > 2. We agree that usleep_range() is appropriate for short acpi_os_sleep()
> > due to ASL loops with small Sleep() values.
> > But if we do something different for large and small values,
> > where is the line between small and large?
>
> As pointed out above - this depends on HZ and what the requirements of
> the callsite are.
>
> > fsleep anointed 20ms, but didn't document why.
> > (and it made both short sleeps *and* long sleeps too slow to be useful
> > here, IMO)
>
> fsleep() just implements what the outdated documentation states. And it
> adds a magic max value for usleep_range(). It seems to me, that fsleep()
> somehow accidentially found the way into the kernel in 2020. As it is
> now there this needs to be fixed and should take the above things into
> account in some generic way.
>
> > 3. Is usleep_range(min, max) with min= max bad?
> > If it is good, why is virtually nobody else using min=max?
>
> It's not bad to use it. It depends on your use case. If you really need
> the precise sleep length, then it should be valid to use min = max.
>
> I hope the timer information will help to find the proper solution for
> acpi_os_sleep().
Yes, it is useful, thank you!
The main difficulty is that acpi_os_sleep() really works on behalf of
AML code (it is called when the Sleep() operator is evaluated in AML)
and it is hard to say what behavior is intended there.
We know that the msleep() precision is not sufficient at least in some
cases (for example, a high-count loop in AML that sleeps for 5 ms in
every iteration), but we don't really know what precision is needed.
IMV it generally is reasonable to assume that firmware developers
would not expect the exact sleep time (and the ACPI specification is
not very precise regarding this either), but they wouldn't also expect
sleep times much longer (in relative terms) than requested. So
roughly speaking they probably assume something between t and t + t/4,
where t is the requested sleep time in milliseconds, regardless of the
HZ value.
Also, you said above that hrtimers were more expensive than the timer
wheel timers, which we all agree with, but it is not clear to me what
exactly the difference is. For example, if there is a loop of 1000
iterations in AML that each sleep for 5 ms, how much more overhead
from using hrtimers would be there relative to using timer-wheel
timers?
Generally speaking, Len's idea of using usleep_range() for all sleep
lengths in acpi_os_sleep() is compelling because it leads to simple
code, but it is not clear to me how much more it would cost relative
to msleep() (or what issues it may provoke for that matter) and what
delta value to use in there. One idea is to derive the delta from the
sleep length (say, take 1/4 of it like I did above), but the
counter-argument is that this would cause the loops in AML in question
to take measurably more time and there may be no real reason for it.
Thanks!
Powered by blists - more mailing lists