linux-kernel - Re: page fault scalability (ext3, ext4, xfs)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALCETrUm0C+3etmsve7kQwspaycXnj4ZWNTWi+C5r4r-pahqUw@mail.gmail.com>
Date:	Thu, 15 Aug 2013 17:21:27 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Dave Chinner <david@...morbit.com>
Cc:	"Theodore Ts'o" <tytso@....edu>,
	Dave Hansen <dave.hansen@...el.com>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	Linux FS Devel <linux-fsdevel@...r.kernel.org>,
	xfs@....sgi.com,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	Jan Kara <jack@...e.cz>, LKML <linux-kernel@...r.kernel.org>,
	Tim Chen <tim.c.chen@...ux.intel.com>,
	Andi Kleen <ak@...ux.intel.com>
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 5:14 PM, Dave Chinner <david@...morbit.com> wrote:
> On Thu, Aug 15, 2013 at 03:26:09PM -0700, Andy Lutomirski wrote:
>> On Thu, Aug 15, 2013 at 3:18 PM, Dave Chinner <david@...morbit.com> wrote:
>> > On Thu, Aug 15, 2013 at 02:43:09PM -0700, Andy Lutomirski wrote:
>> >> On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner
>> >> <david@...morbit.com> wrote:
>> >> > On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>
>> >> In current kernels, this chain of events won't work:
>> >>
>> >>  - Server goes down
>> >>  - Server comes up
>> >>  - Userspace on server calls mmap and writes something
>> >>  - Client reconnects and invalidates its cache
>> >>  - Userspace on server writes something else *to the same page*
>> >>
>> >> The client will never notice the second write, because it won't update
>> >> any inode state.
>> >
>> > That's wrong. The server wrote the dirty page before the client
>> > reconnected, therefore it got marked clean.
>>
>> Why would it write the dirty page?
>
> Terminology mismatch - you said it "writes something", not "dirties
> the page". So, it's easy to take that as "does writeback" as opposed
> to "dirties memory".

When I say "writes something" I mean literally performs a store to
memory.  That is:

ptr[offset] = value;

In my example, the client will *never* catch up.

>
>> > The second write to the
>> > server page marks it dirty again, causing page_mkwrite to be
>> > called, thereby updating the timestamp/i_version field. So, the NFS
>> > client will notice the second change on the server, and it will
>> > notice it immediately after the second access has occurred, not some
>> > time later when:
>> >
>> >> With my patches, the client will as soon as the
>> >> server starts writeback.
>> >
>> > Your patches introduce a 30+ second window where a file can be dirty
>> > on the server but the NFS server doesn't know about it and can't
>> > tell the clients about it because i_version doesn't get bumped until
>> > writeback.....
>>
>> I claim that there's an infinite window right now, and that 30 seconds
>> is therefore an improvement.
>
> You're talking about after the second change is made. I'm talking
> about the difference in behaviour after the *initial change* is
> made. Your changes will result in the client not doing an
> invalidation because timestamps don't get changed for 30s with your
> patches.  That's the problem - the first change of a file needs to
> bump the i_version immediately, not in 30s time.
>
> That's why delaying timestamp updates doesn't fix the scalability
> problem that was reported. It might fix a different problem, but it
> doesn't void the *requirment* that filesystems need to do
> transactional updates during page faults....
>

And this is why I'm unconvinced that your requirement is sensible.
It's attempting to make sure that every mmaped write results in a some
kind of FS update, but it actually only results in an FS update
*before* the *first* mmapped write after writeback.  It's racy as
hell.

My approach is slow but not racy.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/