[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101106051811.GB11535@count0.beaverton.ibm.com>
Date: Fri, 5 Nov 2010 22:18:11 -0700
From: Matt Helsley <matthltc@...ibm.com>
To: Oren Laadan <orenl@...columbia.edu>
Cc: Matt Helsley <matthltc@...ibm.com>,
Gene Cooperman <gene@....neu.edu>,
"Luck, Tony" <tony.luck@...el.com>, Kapil Arya <kapil@....neu.edu>,
"ksummit-2010-discuss@...ts.linux-foundation.org"
<ksummit-2010-discuss@...ts.linux-foundation.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
On Sat, Nov 06, 2010 at 12:06:09AM -0400, Oren Laadan wrote:
> On 11/05/2010 09:16 PM, Matt Helsley wrote:
> > On Fri, Nov 05, 2010 at 01:17:03PM -0400, Gene Cooperman wrote:
> >> On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote:
> >>>> Oren noted that sometimes it's important to stop the process only
> >>>> for a few milliseconds while one checkpoints. In DMTCP, we do that
> >>>> by configuring with --enable-forked-checkpointing. This causes us
> >>>> to fork a child process taking advantage of copy-on-write and then
> >>>> checkpoint the memory pages of the child while the parent continues
> >>>> to execute.
> >>>
> >>> Interesting ... but while the process is only stopped for the duration
> >>> of the fork, it may be taking COW faults on almost every page it
> >>> touches. I think this will not work well for large HPC applications
> >>> that allocate most of physical memory as anonymous pages for the
> >>> application. It may even result in an OOM kill if you don't complete
> >>> the checkpoint of the child and have it exit in a timely manner.
<snip>
> > The current linux-cr approach to handling [dirty] pages doesn't use COW.
> > The tasks are frozen using the cgroup freezer and thus unable to modify
> > the pages. So we don't have to mess with page tables nor do we pay
> > any extra overhead for page faults.
>
> The current linux-cr patchset leaves out any optimizations
> for simplicity of reviewing - first get it working and reviewed.
> We experienced with optimizations with previous systems.
>
> > If we ever implement thawed checkpointing -- checkpointing while
> > the task isn't frozen -- then we'd probably use COW and see
> > the same faults. The difference then would be that in-kernel we
> > wouldn't have one extra task per mm being checkpointed.
>
> Thawed checkpointing can be done with any COW tax, by leveraging
> the native hardware dirty bit in page tables. There is no need to
> trigger additional checkpoints. Tracking modified pages using the
s/checkpoints/faults/
Cheers,
-Matt Helsley
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists