linux-kernel - Re: [linux-pm] Re: Hibernation considerations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0707212000380.6747@asgard.lang.hm>
Date:	Sat, 21 Jul 2007 20:43:24 -0700 (PDT)
From:	david@...g.hm
To:	Alan Stern <stern@...land.harvard.edu>
cc:	Jeremy Maitin-Shepard <jbms@....edu>,
	Milton Miller <miltonm@....com>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Ying Huang <ying.huang@...el.com>,
	LKML <linux-kernel@...r.kernel.org>,
	linux-pm <linux-pm@...ts.linux-foundation.org>
Subject: Re: [linux-pm] Re: Hibernation considerations

On Sat, 21 Jul 2007, Alan Stern wrote:

> On Fri, 20 Jul 2007 david@...g.hm wrote:
>
>>> How would you prevent tasks from being scheduled?  How would you
>>> prevent drivers from deadlocking because in order to put their device
>>> in a low-power state they need to acquire a lock which is held by a
>>> user task?
>>
>> you give up on the suspend becouse you have no way of getting the user
>> task to give up the lock.
>
> Once the deadlock has occurred it's too late.  You can't give up; in
> fact you can't do anything at all.  The system has hung.
>
>> however, kernel locks should not be held by user tasks, user tasks are not
>> expected to behave in rational ways, allowing them to compete with kernel
>> tasks for locks is a sure way to get a deadlock or indefinate stall.
>
> What on Earth are you talking about?  "Kernel locks should not be held
> by user tasks"?  Then who _should_ hold them?  You are aware, I hope,
> that down() and mutex_lock() can be called only in process context?
>
>> what locks are accessed this way?
>
> Lots of them.  For example, most drivers won't want a suspend to occur
> right in the middle of an I/O transfer.  To prevent this, the driver
> might use a mutex.  The task doing the I/O (which will be a user task)
> acquires the mutex during a transfer and the suspend routine acquires
> the mutex while quiescing the device.

wait a min her, it's possible we are misunderstanding each other.

as I see it.

if userspace can aquire locks that prevent the kernel from shutting off 
(or doing anything else in particular) then it's possible for misbehaving 
userspace code to stop the kernel by simply choosing to never release the 
lock.

this would be a trivial DOS from userspace.

now, if you are talking instead about the fact that when userspace makes a 
system call, the execution of that system call involves aquiring locks 
that are released before the system call completes you have a very 
different situation.

if you have locks that are held across system calls then you should 
already have problems. becouse you can't count on userspace ever taking 
whatever action is appropriate to release the lock.

what am I missing that concerns you so much?

>>>> Does it really (fundamentally) require scheduling tasks, particularly in
>>>> the case that the devices have already been put in the "quiesced" state?
>>>
>>> I can't say for sure.  That's the way we have been doing it.  It
>>> wouldn't be easy to change, because the driver would have to busy-wait
>>> during delays -- which would mean it would need to use different code
>>> for system-wide suspend and runtime suspend.
>>
>> please define terms so that we are all on the same page
>
> Please read Documentation/power/devices.txt.

I have done so.

>> what do you mean by
>> system-wide suspend
>
> That's what you would call standby, suspend-to-RAM, or hibernate.  The
> entire system goes to sleep.
>
>> runtime suspend
>
> That's when an individual device is placed in a low-power state to
> save energy while it isn't being used.  The system as a whole remains
> awake and the device will be resumed the next time it is needed for
> anything.

thanks for the defintitions.

having read through Documentation/power/devices.txt I remain convinced 
that you are making a fundamental mistake.

you are designing a system that will only work if everything (every 
driver, every state transition) participates fully in the process at all 
times. You started with the facts 'this is the info that ACPI provides and 
this is how it is designed to be used' and worked from there instead of 
looking to see what the kernel really needed and figuring how to provide a 
good interface for that that happens to be implemented (today) with ACPI. 
(a proper power management framework shouldn't care if you have ACPI, APM, 
or some other method of controlling the devices)

this leads to resume functions that can only work if the proper suspend 
function was called rather then makeing 'resume' just mean 'go to full 
operation', which is the same thing that gets called when the device is 
first initialized. internally it can examine the hardware and follow 
different paths depending on what it finds the current state of the 
hardware is, but the outside world (including the rest of the kernel) 
should not care. the fact that the rest of the kernel needs to know if it 
should call 'resume' or 'initialize' is a failure in the abstraction.

in fact, a better abstraction would be something like

report_power_modes
   which would return a series of modes (sorted only by modeID)
   modeID, %power_used_in_this_mode, %capability_in_this_mode
   (I would make mode 0 always be complete power off, and mode 1 always be 
full capacity)

report_power_mode_speed
   which would return a matrix giving how long it takes to transition from 
any mode to any other mode. this should be a relative number, not an 
absolute number since it will be different at different clock speeds.

set_operational_mode(modeID)
   which would take you from whatever mode you are in now to the requested 
mode.

most devices would report the simple list of modes

0,0,0
1,100,100

with a mode_speed matrix of
   0 1
   ---
0|0 1
1|1 0

it may be that there is more info needed for the powr management engine to 
decide what modes it wants to put things into, if so identify what type of 
info you need and add another column to the modes list.
for example:
   you may want to add a flag for 'does this mode allow downstream devices 
to operate?'
   you may want to make a mode for 'this mode doesn't allow any new 
requests, but continues to process pending requests' and have a flag that 
indicates this

currently it looks like there's no way to find out what modes are 
available, and you have to know what mode something is in currently before 
you can request it change to a different mode. both of these prevent 
effective power management without encoding intimate knowledge of the 
capability of the particular hardware in your management tool.

some of this may be discoverable via the ACPI interface (it's not talked 
about much in the devices.txt file), but the mode setting is still wrong.

note that in the example above it's accpetable for a driver to cache what 
mode it thinks the device is in, but it needs to properly set the new 
mode even if it's cached data is incorrect.

this approach would allow the transition of ALL drivers to the new mode of 
operation in one fell swoop, and then adding additional power management 
features is just adding to the existing list rather then implementing new 
functions.

David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/