linux-ext4 - Re: page fault scalability (ext3, ext4, xfs)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrU=Ag1bvmc=9Wo8K66gOSYtCyncveYEycYdTd_1T9z-JA@mail.gmail.com>
Date:	Thu, 15 Aug 2013 14:31:14 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Dave Chinner <david@...morbit.com>
Cc:	Jan Kara <jack@...e.cz>, "Theodore Ts'o" <tytso@....edu>,
	Dave Hansen <dave.hansen@...el.com>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	Linux FS Devel <linux-fsdevel@...r.kernel.org>,
	xfs@....sgi.com,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Tim Chen <tim.c.chen@...ux.intel.com>,
	Andi Kleen <ak@...ux.intel.com>
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 2:28 PM, Dave Chinner <david@...morbit.com> wrote:
> On Thu, Aug 15, 2013 at 09:45:31AM +0200, Jan Kara wrote:
>> On Thu 15-08-13 17:11:42, Dave Chinner wrote:
>> > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
>> > > On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@...morbit.com> wrote:
>> > > > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
>> > > >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@...morbit.com> wrote:
>> > > >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
>> > > >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
>> > > >> >> > > It would be better to write zeros to it, so we aren't measuring the
>> > > >> >> > > cost of the unwritten->written conversion.
>> > > >> >> >
>> > > >> >> > At the risk of beating a dead horse, how hard would it be to defer
>> > > >> >> > this part until writeback?
>> > > >> >>
>> > > >> >> Part of the work has to be done at write time because we need to
>> > > >> >> update allocation statistics (i.e., so that we don't have ENOSPC
>> > > >> >> problems).  The unwritten->written conversion does happen at writeback
>> > > >> >> (as does the actual block allocation if we are doing delayed
>> > > >> >> allocation).
>> > > >> >>
>> > > >> >> The point is that if the goal is to measure page fault scalability, we
>> > > >> >> shouldn't have this other stuff happening as the same time as the page
>> > > >> >> fault workload.
>> > > >> >
>> > > >> > Sure, but the real problem is not the block mapping or allocation
>> > > >> > path - even if the test is changed to take that out of the picture,
>> > > >> > we still have timestamp updates being done on every single page
>> > > >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
>> > > >> > and have nanosecond granularity, so every page fault is resulting in
>> > > >> > a transaction to update the timestamp of the file being modified.
>> > > >>
>> > > >> I have (unmergeable) patches to fix this:
>> > > >>
>> > > >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
>> > > >
>> > > > The big problem with this approach is that not doing the
>> > > > timestamp update on page faults is going to break the inode change
>> > > > version counting because for ext4, btrfs and XFS it takes a
>> > > > transaction to bump that counter. NFS needs to know the moment a
>> > > > file is changed in memory, not when it is written to disk. Also, NFS
>> > > > requires the change to the counter to be persistent over server
>> > > > failures, so it needs to be changed as part of a transaction....
>> > >
>> > > I've been running a kernel that has the file_update_time call
>> > > commented out for over a year now, and the only problem I've seen is
>> > > that the timestamp doesn't get updated :)
>> > >
>> > > I think I must be misunderstanding you (or vice versa).  I'm currently
>> >
>> > Yup, you are.
>> >
>> > > redoing the patches, and this time I'll do it for just the mm core and
>> > > ext4.  The only change I'm proposing to ext4's page_mkwrite is to
>> > > remove the file_update_time call.
>> >
>> > Right. Where does that end up? All the way down in
>> > ext4_mark_iloc_dirty(), and that does:
>> >
>> >         if (IS_I_VERSION(inode))
>> >             inode_inc_iversion(inode);
>> >
>> > The XFS transaction code is the same - deep inside it where an inode
>> > is marked as dirty in the transaction, it bumps the same counter and
>> > adds it to the transaction.
>>   Yeah, I'd just add that ext4 maintains i_version only if it has been
>> mounted with i_version mount option. But then NFS server would depend on
>> c/mtime update so it won't help you much - you still should update at least
>> one of i_version, ctime, mtime on page fault. OTOH if the filesystem isn't
>> exported, you could avoid this relatively expensive dance and defer things
>> as Andy suggests.
>
> The problem with "not exported, don't update" is that files can be
> modified on server startup (e.g. after a crash) or in short
> maintenance periods when the NFS service is down. When the server is
> started back up, the change number needs to indicate the file has
> been modified so that clients reconnecting to the server see the
> change.
>
> IOWs, even if the NFS server is not up or the filesystem not
> exported we still need to update change counts whenever a file
> changes if we are going to tell the NFS server that we keep them...

This will keep working as long as the clients are willing to wait for
writeback (or msync, munmap, or exit) on the server.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html