linux-kernel - Re: [PATCH 00/17] RFC: userfault v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54549E56.5050106@huawei.com>
Date:	Sat, 1 Nov 2014 16:48:22 +0800
From:	zhanghailiang <zhang.zhanghailiang@...wei.com>
To:	Peter Feiner <pfeiner@...gle.com>
CC:	Andrea Arcangeli <aarcange@...hat.com>, <qemu-devel@...gnu.org>,
	<kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	Andres Lagar-Cavilla <andreslc@...gle.com>,
	Dave Hansen <dave@...1.net>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
	Andy Lutomirski <luto@...capital.net>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Sasha Levin <sasha.levin@...cle.com>,
	"Hugh Dickins" <hughd@...gle.com>,
	"Dr. David Alan Gilbert" <dgilbert@...hat.com>,
	Christopher Covington <cov@...eaurora.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Android Kernel Team <kernel-team@...roid.com>,
	"Robert Love" <rlove@...gle.com>,
	Dmitry Adamushko <dmitry.adamushko@...il.com>,
	"Neil Brown" <neilb@...e.de>, Mike Hommey <mh@...ndium.org>,
	Taras Glek <tglek@...illa.com>, Jan Kara <jack@...e.cz>,
	KOSAKI Motohiro <kosaki.motohiro@...il.com>,
	Michel Lespinasse <walken@...gle.com>,
	"Minchan Kim" <minchan@...nel.org>,
	Keith Packard <keithp@...thp.com>,
	"Huangpeng (Peter)" <peter.huangpeng@...wei.com>,
	Isaku Yamahata <yamahata@...inux.co.jp>,
	Anthony Liguori <anthony@...emonkey.ws>,
	"Stefan Hajnoczi" <stefanha@...il.com>,
	Wenchao Xia <wenchaoqemu@...il.com>,
	"Andrew Jones" <drjones@...hat.com>,
	Juan Quintela <quintela@...hat.com>
Subject: Re: [PATCH 00/17] RFC: userfault v2

On 2014/11/1 3:39, Peter Feiner wrote:
> On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote:
>> Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
>> we have to do this (block the write action), because we have to save the page before it
>> is dirtied by writing action. This is the difference, compared to pre-copy migration.
>
> Ah ha, I understand the difference now. I suppose that you have considered
> doing a traditional pre-copy migration (that is, passes over memory saving
> dirty pages, followed by a pause and a final dump of remaining dirty pages) to
> a file. Your approach has the advantage of having the VM pause time bounded by
> the time it takes to handle the userfault and do the write, as opposed to
> pre-copy migration which has a pause time bounded by the time it takes to do
> the final dump of dirty pages, which, in the worst case, is the time it takes
> to dump all of the guest memory!
>

Right! Strictly speaking, Migrate VM's state into a file(fd) is not snapshot,
Because its time is not decided (depend on the time of finishing mingration).
A VM's snasphot should be decided, it should be the time when i fire snapshot
command.
Snapshot is very like taking a photo, getting a VM's state on the time;)

> You could use the old fork & dump trick. Given that the guest's memory is
> backed by private VMA (as of a year ago when I last looked, is always the case
> for QEMU), you can have the kernel do the write protection for you.
> Essentially, you fork Qemu and, in the child process, dump the guest memory
> then exit. If the parent (including the guest) writes to guest memory, then it
> will fault and the kernel will copy the page.
>

It is difficult to do fork in qemu process, which has multi-threads and holds
all kinds of locks. actually, this scheme has been discussed in community long time
ago. It is not accepted.

> The fork & dump approach will give you the best performance w.r.t. guest pause
> times (i.e., just pausing for the COW fault handler), but it does have the
> distinct disadvantage of potentially using 2x the guest memory (i.e., if the

Agreed! This is the second reason why community does not accept it.

> parent process races ahead and writes to all of the pages before you finish the
> dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child
> memory as you copy it.
>

IMHO,The scheme i mentioned in the previous email, may be the simplest and the
most efficient way, if userfault could support only wrprotect fault.
We can also do some optimization to reduce influence for VM when do snapshot,
such as caching the request pages by using memory buffer, etc.

>> Great! Do you plan to issue your patches to community? I mean is your work based on
>> qemu? or an independent tool (CRIU migration?) for live-migration?
>> Maybe i could fix the migration problem for ivshmem in qemu now,
>> based on softdirty mechanism.
>
> I absolutely plan on releasing these patches :-) CRIU was the first open-source
> userland I had planned on integrating with. At Google, I'm working with our
> home-grown Qemu replacement. However, I'd be happy to help with an effort to
> get softdirty integrated in Qemu in the future.
>

Great;)

>>> Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To
>>
>> I have read them cursorily, it is useful for pre-copy indeed. But it seems that
>> it can not meet my need for snapshot.
>
>>> make softdirty usable for live migration, I've added an API to atomically
>>> test-and-clear the bit and write protect the page.
>>
>> How can i find the API? Is it been merged in kernel's master branch already?
>
> Negative. I'll be sure to CC you when I start sending this stuff upstream.
>
>

OK, I look forward to it:)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/