linux-ext4 - Re: page fault scalability (ext3, ext4, xfs)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130815213725.GT6023@dastard>
Date:	Fri, 16 Aug 2013 07:37:25 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Andy Lutomirski <luto@...capital.net>
Cc:	Theodore Ts'o <tytso@....edu>, Dave Hansen <dave.hansen@...el.com>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	Linux FS Devel <linux-fsdevel@...r.kernel.org>,
	xfs@....sgi.com,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	Jan Kara <jack@...e.cz>, LKML <linux-kernel@...r.kernel.org>,
	Tim Chen <tim.c.chen@...ux.intel.com>,
	Andi Kleen <ak@...ux.intel.com>
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <david@...morbit.com> wrote:
> > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@...morbit.com> wrote:
> >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@...morbit.com> wrote:
> >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
> >> >> >> > > cost of the unwritten->written conversion.
> >> >> >> >
> >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
> >> >> >> > this part until writeback?
> >> >> >>
> >> >> >> Part of the work has to be done at write time because we need to
> >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> >> >> >> problems).  The unwritten->written conversion does happen at writeback
> >> >> >> (as does the actual block allocation if we are doing delayed
> >> >> >> allocation).
> >> >> >>
> >> >> >> The point is that if the goal is to measure page fault scalability, we
> >> >> >> shouldn't have this other stuff happening as the same time as the page
> >> >> >> fault workload.
> >> >> >
> >> >> > Sure, but the real problem is not the block mapping or allocation
> >> >> > path - even if the test is changed to take that out of the picture,
> >> >> > we still have timestamp updates being done on every single page
> >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> >> >> > and have nanosecond granularity, so every page fault is resulting in
> >> >> > a transaction to update the timestamp of the file being modified.
> >> >>
> >> >> I have (unmergeable) patches to fix this:
> >> >>
> >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> >> >
> >> > The big problem with this approach is that not doing the
> >> > timestamp update on page faults is going to break the inode change
> >> > version counting because for ext4, btrfs and XFS it takes a
> >> > transaction to bump that counter. NFS needs to know the moment a
> >> > file is changed in memory, not when it is written to disk. Also, NFS
> >> > requires the change to the counter to be persistent over server
> >> > failures, so it needs to be changed as part of a transaction....
> >>
> >> I've been running a kernel that has the file_update_time call
> >> commented out for over a year now, and the only problem I've seen is
> >> that the timestamp doesn't get updated :)
> >>
> 
> [...]
> 
> > If a filesystem is providing an i_version value, then NFS uses it to
> > determine whether client side caches are still consistent with the
> > server state. If the filesystem does not provide an i_version, then
> > NFS falls back to checking c/mtime for changes. If files on the
> > server are being modified without either the tiemstamps or i_version
> > changing, then it's likely that there will be problems with client
> > side cache consistency....
> 
> I didn't think of that at all.
> 
> If userspace does:
> 
> ptr = mmap(...);
> ptr[0] = 1;
> sleep(1);
> ptr[0] = 2;
> sleep(1);
> munmap();
> 
> Then current kernels will mark the inode changed on (only) the ptr[0]
> = 1 line.  My patches will instead mark the inode changed when munmap
> is called (or after ptr[0] = 2 if writepages gets called for any
> reason).
> 
> I'm not sure which is better.  POSIX actually requires my behavior
> (which is most irrelevant).

Not by my reading of it. Posix states that c/mtime needs to be
updated between the first access and the next msync() call. We
update mtime on the first access, and so therefore we conform to the
posix requirement....

> My behavior also means that, if an NFS
> client reads and caches the file between the two writes, then it will
> eventually find out that the data is stale.

"eventually" is very different behaviour to the current behaviour.

My understanding is that NFS v4 delegations require the underlying
filesystem to bump the version count on *any* modification made to
the file so that delegations can be recalled appropriately. So not
informing the filesystem that the file data has been changed is
going to cause problems.

> The current behavior, on
> the other hand, means that a single pass of mmapped writes through the
> file will update the times much faster.
> 
> I could arrange for the first page fault to *also* update times when
> the FS is exported or if a particular mount option is set.  (The ext4
> change to request the new behavior is all of four lines, and it's easy
> to adjust.)

What does "first page fault" mean?

Cheers,

Dave
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html