linux-kernel - Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20101106051811.GB11535@count0.beaverton.ibm.com>
Date:	Fri, 5 Nov 2010 22:18:11 -0700
From:	Matt Helsley <matthltc@...ibm.com>
To:	Oren Laadan <orenl@...columbia.edu>
Cc:	Matt Helsley <matthltc@...ibm.com>,
	Gene Cooperman <gene@....neu.edu>,
	"Luck, Tony" <tony.luck@...el.com>, Kapil Arya <kapil@....neu.edu>,
	"ksummit-2010-discuss@...ts.linux-foundation.org" 
	<ksummit-2010-discuss@...ts.linux-foundation.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

On Sat, Nov 06, 2010 at 12:06:09AM -0400, Oren Laadan wrote:
> On 11/05/2010 09:16 PM, Matt Helsley wrote:
> > On Fri, Nov 05, 2010 at 01:17:03PM -0400, Gene Cooperman wrote:
> >> On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote:
> >>>> Oren noted that sometimes it's important to stop the process only
> >>>> for a few milliseconds while one checkpoints. In DMTCP, we do that
> >>>> by configuring with --enable-forked-checkpointing. This causes us
> >>>> to fork a child process taking advantage of copy-on-write and then
> >>>> checkpoint the memory pages of the child while the parent continues
> >>>> to execute.
> >>>
> >>> Interesting ... but while the process is only stopped for the duration
> >>> of the fork, it may be taking COW faults on almost every page it
> >>> touches.  I think this will not work well for large HPC applications
> >>> that allocate most of physical memory as anonymous pages for the
> >>> application. It may even result in an OOM kill if you don't complete
> >>> the checkpoint of the child and have it exit in a timely manner.

<snip>

> > The current linux-cr approach to handling [dirty] pages doesn't use COW.
> > The tasks are frozen using the cgroup freezer and thus unable to modify
> > the pages. So we don't have to mess with page tables nor do we pay
> > any extra overhead for page faults.
> 
> The current linux-cr patchset leaves out any optimizations
> for simplicity of reviewing - first get it working and reviewed.
> We experienced with optimizations with previous systems.
> 
> > If we ever implement thawed checkpointing -- checkpointing while
> > the task isn't frozen -- then we'd probably use COW and see
> > the same faults. The difference then would be that in-kernel we
> > wouldn't have one extra task per mm being checkpointed.
> 
> Thawed checkpointing can be done with any COW tax, by leveraging
> the native hardware dirty bit in page tables. There is no need to
> trigger additional checkpoints. Tracking modified pages using the

s/checkpoints/faults/

Cheers,
	-Matt Helsley
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/