[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200707101207.16711.nigel@nigel.suspend2.net>
Date: Tue, 10 Jul 2007 12:07:15 +1000
From: Nigel Cunningham <nigel@...el.suspend2.net>
To: Kyle Moffett <mrmacman_g4@....com>
Cc: Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Pavel Machek <pavel@....cz>, "Rafael J. Wysocki" <rjw@...k.pl>,
Matthew Garrett <mjg59@...f.ucam.org>,
linux-kernel@...r.kernel.org, linux-pm@...ts.linux-foundation.org,
Alan Stern <stern@...land.harvard.edu>
Subject: Re: [PATCH] Remove process freezer from suspend to RAM pathway
Hi.
Sorry for the long delay. Busy weekend and my motivation for working on
programming is almost zero at the moment...
On Friday 06 July 2007 15:01:48 Kyle Moffett wrote:
> On Jul 06, 2007, at 00:03:15, Nigel Cunningham wrote:
> > On Friday 06 July 2007 13:54:15 Benjamin Herrenschmidt wrote:
> >> On Fri, 2007-07-06 at 09:35 +1000, Nigel Cunningham wrote:
> >>>
> >>> Nice try :) Okay then, you remove the freezer, try hibernating,
> >>> then get back to me after you've fixed your filesystem because
> >>> some process that wasn't frozen started writing things after the
> >>> atomic copy (making the on disk filesystem inconsistent with the
> >>> snapshot).
> >>>
> >>> As Pavel rightly said, you can get rid of the freezer, but you're
> >>> only going to have to implement another one that does the
> >>> essentially the same thing, even if it is at some other level.
> >>
> >> I was mostly talking about STR... Regarding STD, we have a
> >> different problem and we all know it. The freezer is one somewhat
> >> horrible way to get it working for now, I would prefer something
> >> more along the way that blocks the page cache from writing out new
> >> dirty pages though, except those specifically flagged by the
> >> snapshot.
> >>
> >> That is, some kind of proper snapshotting facility, as linus was
> >> describing some time ago.
> >
> > The kind of thing Linus was talking about would limit you (as
> > swsusp and uswsusp do now) to only half the amount of memory.
>
> How so? Suppose hibernate is implemented like this:
>
> (1) Userspace program calls sys_freeze_processes()
> (a) Pokes all CPUs with IPMIs and tells them to finish the
> currently running timeslot then stop
> (b) Atomically sends SIGSTOP to all userspace processes in a non-
> trappable way, except the calling process and any process which is
> ptracing it.
> (c) Returns to the calling process.
Ok. First, I'll ignore the specification that userspace does this - I don't
think it matters whether it's userspace or kernel that does the suspending
and I'm yet to see a good reason for it to be [required to be] done from
userspace.
In this first step, you've reinvented the first part of the current freezer
implementation. The reason we don't use a real signal is precisely so we can
have an untrappable SIGSTOP. In this regard, I particularly remember Win4Lin
from a few years ago. It would die if you sent it a real signal, so we had to
do it this way. No doubt there are other instances I'm not aware of.
> (2) Userspace process sends SIGCONT to only those processes which are
> necessary for sync and a device-mapper snapshot.
How do you determine which ones are needed? Why stop them in the first place?
> (3) Userspace calls sys_snapshot_kernel(snapshot_overhead_pages)
> (a) Kernel starts freeing memory and swapping stuff out to make
> room for a copy of *kernel* memory (not pagecache, not process RAM).
> It does the same for at least snapshot_overhead_pages extra (used by
> userspace later). It then allocates this memory to keep it from
> going away. Since most processes are stopped we won't have much else
> competing with us for the RAM.
Ok. So now you also need processes running that are needed for swapping,
because freeing that memory might involve swapping. Fully agree with the
logic though (not really surprising - this is what I do in
Suspend2^wTuxOnIce).
> (a) Kernel uses the device-mapper up-call-into-filesystem
> machinery to get all mounted filesystems synced and ready for a DM
> snapshot. This may include sending data via the userspace processes
> resumed in (2). Any deadlocks here are userspace's fault (see (2)).
> Will need some modification to handle doing multiple blockdevs at a
> time. Anything using FUSE is basically perma-synced anyways (no dep-
> handling needed), and anything using loop should already be handled
> by DM. This includes allocating memory for the basic snapshot
> datastructures.
> (b) At this point all blockdev operations should be halted and
> disk caches flushed; that's all we care about.
> (c) Go through the device tree and quiesce DMA and shut off
> interrupts. Since all the disks are synced this is easy.
> (d) Use IPMIs again to get all the CPUs together, which should be
> easy as most processes are sleeping in IO or SIGSTOPed, and we're
> getting no interrupts.
> (e) One CPU turns off all interrupts on itself and takes an atomic
> snapshot of kernel memory into the previously allocated storage.
> Once again, does not include pagecache. The kernel also records a
> list of what pages *are* included in the pagecache. It then marks
> all userspace pages as copy-on-write.
Hotplugging cpus (when all those locking issues are taken care of) is simpler.
Prior to cpu hotplugging, I used IMPIs to put secondary cpus into a tight
loop, so I know it's possible to do it this way too. That way, though, you
have less flexibility. What if a cpu really is plugged in between hibernate
and resume? With cpu hotplugging, it's handled properly and transparently.
Without cpu hotplugging, you could be using uninitialised data after the
atomic restore.
Marking userspace as COW makes things more complicated, too. You then have to
add code to the COW handling to update the list of pages that need to be
saved, and you reduce the reliability of the whole process. You can't predict
beforehand how many of these COW pages are going to be needed, and therefore
can't know how much memory to free earlier on in the process. If you run out
of memory, what will be the effect?
> (f) That CPU finalizes the modified DM snapshot using the
> previously-allocated memory.
> (g) That CPU frees up the snapshot_overhead_pages memory allocated
> during step (a) for userspace to use.
> (h) The CPU does the equivalent of a "swapoff -a" without
> overwriting any data already on any swap device(s).
You still need to remember what swap you're going to use to write the image.
You'll probably want to get this information (and allocate the swap) sooner
rather than later so that you're not racing against the memory freeing
earlier, and don't run into issues with bmapping the pages or having enough
memory to record the bdevs & sector numbers (not usually an issue, but if
swap is highly fragmented...).
> (i) The CPU then IPMI-signals the other CPUs to wake them up
> (j) The kernel returns a FD-reference to the snapshot and the read-
> only halves of the CoW pagecache to the process which called
> sys_snapshot_kernel().
Readonly halves? I don't get that, sorry.
> (4) The userspace process now has a reference to the copy of the
> kernel pages and the unmodified pagecache pages. Since 99% of the
> processes aren't running, we aren't going to be having to CoW many of
> the pagecache pages.
Mmm, but you still don't know how many.
> (5) The userspace process uses read() or other syscalls to get data
> out of the kernel-snapshot FD in small chunks, within its
> snapshot_overhead_pages limit. It compresses these and writes them
> out to the snapshot-storage blockdev (must not be mounted during
> snapshot), or to any network server.
>
> (6) The userspace process syncs the disks and halts the system. Any
> changed filesystem pages after the pseudo-DM-snapshot should have
> been stored in semi-volatile storage somewhere and will be discarded
> on the next reboot.
Are you thinking the changed filesystem pages are caught by COW? (AFAIUI,
kernel writes aren't). If (as I expect), you're thinking about filesystem
writes to DM based storage, what about non DM-based filesystem pages?
> So basically your hibernate-overhead would consist of:
> (1) The pages necessary for the atomic snapshot of kernel memory
> and the list of pagecache pages at that time
> (2) A little memory necessary for the kernel non-persistent DM
> snapshot datastructures.
> (3) The snapshot_overhead_pages needed by userspace.
>
> If you're using swap devices then you can save 99% of the state of
> the running kernel with an initial swapout overhead of virtually
> nothing beyond the size of the unswappable kernel memory.
FWIW, let me note an important variation from how Suspend2 works; it might
provide food for thought. In Suspend2, we treat the processes that remain
stopped throughout the whole process specially. We write their data to disk
before the atomic copy (usually 70 or 80% of memory), and then use the memory
they occupy for the destination of the atomic copy. This further reduces the
amount of memory that has to be freed, almost always to zero.
Regards,
Nigel
--
See http://www.tuxonice.net for Howtos, FAQs, mailing
lists, wiki and bugzilla info.
Content of type "application/pgp-signature" skipped
Powered by blists - more mailing lists