[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrUm0C+3etmsve7kQwspaycXnj4ZWNTWi+C5r4r-pahqUw@mail.gmail.com>
Date: Thu, 15 Aug 2013 17:21:27 -0700
From: Andy Lutomirski <luto@...capital.net>
To: Dave Chinner <david@...morbit.com>
Cc: "Theodore Ts'o" <tytso@....edu>,
Dave Hansen <dave.hansen@...el.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Linux FS Devel <linux-fsdevel@...r.kernel.org>,
xfs@....sgi.com,
"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
Jan Kara <jack@...e.cz>, LKML <linux-kernel@...r.kernel.org>,
Tim Chen <tim.c.chen@...ux.intel.com>,
Andi Kleen <ak@...ux.intel.com>
Subject: Re: page fault scalability (ext3, ext4, xfs)
On Thu, Aug 15, 2013 at 5:14 PM, Dave Chinner <david@...morbit.com> wrote:
> On Thu, Aug 15, 2013 at 03:26:09PM -0700, Andy Lutomirski wrote:
>> On Thu, Aug 15, 2013 at 3:18 PM, Dave Chinner <david@...morbit.com> wrote:
>> > On Thu, Aug 15, 2013 at 02:43:09PM -0700, Andy Lutomirski wrote:
>> >> On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner
>> >> <david@...morbit.com> wrote:
>> >> > On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>
>> >> In current kernels, this chain of events won't work:
>> >>
>> >> - Server goes down
>> >> - Server comes up
>> >> - Userspace on server calls mmap and writes something
>> >> - Client reconnects and invalidates its cache
>> >> - Userspace on server writes something else *to the same page*
>> >>
>> >> The client will never notice the second write, because it won't update
>> >> any inode state.
>> >
>> > That's wrong. The server wrote the dirty page before the client
>> > reconnected, therefore it got marked clean.
>>
>> Why would it write the dirty page?
>
> Terminology mismatch - you said it "writes something", not "dirties
> the page". So, it's easy to take that as "does writeback" as opposed
> to "dirties memory".
When I say "writes something" I mean literally performs a store to
memory. That is:
ptr[offset] = value;
In my example, the client will *never* catch up.
>
>> > The second write to the
>> > server page marks it dirty again, causing page_mkwrite to be
>> > called, thereby updating the timestamp/i_version field. So, the NFS
>> > client will notice the second change on the server, and it will
>> > notice it immediately after the second access has occurred, not some
>> > time later when:
>> >
>> >> With my patches, the client will as soon as the
>> >> server starts writeback.
>> >
>> > Your patches introduce a 30+ second window where a file can be dirty
>> > on the server but the NFS server doesn't know about it and can't
>> > tell the clients about it because i_version doesn't get bumped until
>> > writeback.....
>>
>> I claim that there's an infinite window right now, and that 30 seconds
>> is therefore an improvement.
>
> You're talking about after the second change is made. I'm talking
> about the difference in behaviour after the *initial change* is
> made. Your changes will result in the client not doing an
> invalidation because timestamps don't get changed for 30s with your
> patches. That's the problem - the first change of a file needs to
> bump the i_version immediately, not in 30s time.
>
> That's why delaying timestamp updates doesn't fix the scalability
> problem that was reported. It might fix a different problem, but it
> doesn't void the *requirment* that filesystems need to do
> transactional updates during page faults....
>
And this is why I'm unconvinced that your requirement is sensible.
It's attempting to make sure that every mmaped write results in a some
kind of FS update, but it actually only results in an FS update
*before* the *first* mmapped write after writeback. It's racy as
hell.
My approach is slow but not racy.
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists