linux-kernel - Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5215E4E7.1020808@parallels.com>
Date:	Thu, 22 Aug 2013 14:16:07 +0400
From:	Pavel Emelyanov <xemul@...allels.com>
To:	David Vrabel <david.vrabel@...rix.com>
CC:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Cyrill Gorcunov <gorcunov@...il.com>,
	"H. Peter Anvin" <hpa@...or.com>, Jan Beulich <JBeulich@...e.com>,
	Andy Lutomirski <luto@...capital.net>,
	Andrew Morton <akpm@...ux-foundation.org>,
	<Xen-devel@...ts.xen.org>,
	Boris Ostrovsky <boris.ostrovsky@...cle.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	Ingo Molnar <mingo@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with
 existing use

On 08/22/2013 01:32 PM, David Vrabel wrote:
> On 22/08/13 00:04, Linus Torvalds wrote:
>> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov <gorcunov@...il.com> wrote:
>>>
>>> I personally don't see bug here because
>>>
>>>  - this swapped page soft dirty bit is set for non-present entries only,
>>>    never for present ones, just at moment we form swap pte entry
>>>
>>>  - i don't find any code which would test for this bit directly without
>>>    is_swap_pte call
>>
>> Ok, having gone through the places that use swp_*soft_dirty(), I have
>> to agree. Afaik, it's only ever used on a swap-entry that has (by
>> definition) the P bit clear. So with or without Xen, I don't see how
>> it can make any difference.
>>
>> David/Konrad - did you actually see any issues, or was this just from
>> (mis)reading the code?
> 
> There are no Xen related bugs in the code, we were misreading it.
> 
> It was my call to raise this as a regression without a repro and clearly
> this was the wrong decision.
> 
> However, having looked at the soft dirty implementation and specifically
> the userspace ABI I think that it is far to closely coupled to the
> current implementation.  I think this will constrain future development
> of the feature should userspace require a more efficient ABI than
> scanning all of /proc/<pid>/pagemaps.
> 
> Minimal downtime during 'live' checkpointing of a running task needs the
> checkpointer to find and write out dirty pages faster than the task can
> dirty them.

Absolutely, but in "find and write" the "write" component is likely to take
the majority of time -- we can scan PTEs of a mapping MUCH faster, than
transmitting those over even 10Gbit link.

We actually see this IRL -- in CRIU there's an atomic test, that checks
mappings get dumped and restored properly. One of sub-tests is one 512Mb mapping.
With it total dump time _minus_ memory dump time (which includes not only pagemap
file scan, but also files, registers, process tree, sessions, etc.) is fractions 
of one second, while only the memory dump part's time is several seconds.

That said, the super-fast API for getting "what has changed" is not as tempting
to have as faster network/disk.

What is _more_ time consuming in iterative migration in our case is the need to
re-scan the whole /proc tree to get which processes had died and appeared, mess 
with /proc/pid/fd finding out what files were (re-)opened/closed/changed, talking
to sock_diag subsystem for sockets information and alike. However, we haven't yet
done careful analysis for what the slowest part is, but pagemap scans is definitely
not.

> This seems less likely to be possible if every iteration
> all PTEs have to be scanned by the checkpointer instead of (e.g.,)
> accessing a separate list of dirtied pages.

But we don't scan all the x64 virtual address space's PTEs, instead we first
analyze the /proc/pid/maps and scan only PTEs sitting in private mappings.

> David
> .
> 

Thanks,
Pavel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/