linux-kernel - Re: sharing page cache pages between multiple mappings

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:	Fri, 20 May 2016 12:37:37 +0200
From:	Miklos Szeredi <miklos@...redi.hu>
To:	Dave Chinner <david@...morbit.com>
Cc:	Michal Hocko <mhocko@...nel.org>, linux-mm@...ck.org,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-btrfs@...r.kernel.org,
	"Darrick J. Wong" <darrick.wong@...cle.com>
Subject: Re: sharing page cache pages between multiple mappings

On Fri, May 20, 2016 at 1:48 AM, Dave Chinner <david@...morbit.com> wrote:
> On Thu, May 19, 2016 at 12:17:14PM +0200, Miklos Szeredi wrote:
>> On Thu, May 19, 2016 at 11:05 AM, Michal Hocko <mhocko@...nel.org> wrote:
>> > On Thu 19-05-16 10:20:13, Miklos Szeredi wrote:
>> >> Has anyone thought about sharing pages between multiple files?
>> >>
>> >> The obvious application is for COW filesytems where there are
>> >> logically distinct files that physically share data and could easily
>> >> share the cache as well if there was infrastructure for it.
>> >
>> > FYI this has been discussed at LSFMM this year[1]. I wasn't at the
>> > session so cannot tell you any details but the LWN article covers it at
>> > least briefly.
>>
>> Cool, so it's not such a crazy idea.
>
> Oh, it most certainly is crazy. :P
>
>> Darrick, would you mind briefly sharing your ideas regarding this?
>
> The current line of though is that we'll only attempt this in XFS on
> inodes that are known to share underlying physical extents. i.e.
> files that have blocks that have been reflinked or deduped.  That
> way we can overload the breaking of reflink blocks (via copy on
> write) with unsharing the pages in the page cache for that inode.
> i.e. shared pages can propagate upwards in overlay if it uses
> reflink for copy-up and writes will then break the sharing with the
> underlying source without overlay having to do anything special.
>
> Right now I'm not sure what mechanism we will use - we want to
> support files that have a mix of private and shared pages, so that
> implies we are not going to be sharing mappings but sharing pages
> instead.  However, we've been looking at this as being completely
> encapsulated within the filesystem because it's tightly linked to
> changes in the physical layout of the filesystem, not as general
> "share this mapping between two unrelated inodes" infrastructure.
> That may change as we dig deeper into it...
>
>> The use case I have is fixing overlayfs weird behavior. The following
>> may result in "buf" not matching "data":
>>
>>     int fr = open("foo", O_RDONLY);
>>     int fw = open("foo", O_RDWR);
>>     write(fw, data, sizeof(data));
>>     read(fr, buf, sizeof(data));
>>
>> The reason is that "foo" is on a read-only layer, and opening it for
>> read-write triggers copy-up into a read-write layer.  However the old,
>> read-only open still refers to the unmodified file.
>>
>> Fixing this properly requires that when opening a file, we don't
>> delegate operations fully to the underlying file, but rather allow
>> sharing of pages from underlying file until the file is copied up.  At
>> that point we switch to sharing pages with the read-write copy.
>
> Unless I'm missing something here (quite possible!), I'm not sure
> we can fix that problem with page cache sharing or reflink. It
> implies we are sharing pages in a downwards direction - private
> overlay pages/mappings from multiple inodes would need to be shared
> with a single underlying shared read-only inode, and I lack the
> imagination to see how that works...

Indeed, reflink doesn't make this work.

We could reflink-up on any open (or on lookup), not just on write,
it's a trivial change in overlayfs.   Drawback is slower first
open/lookup and space used by duplicate trees even without
modification on the overlay.  Not sure if that's a problem in
practice.

I'll think about the generic downwards sharing.  For overlayfs it
doesn't need to be per-page, so that might make it somewhat simpler
problem.

Thanks,
Miklos