linux-kernel - Re: [PATCH 14/19] xfs: convert to new i

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20171213232537.GC4094@dastard>
Date:   Thu, 14 Dec 2017 10:25:37 +1100
From:   Dave Chinner <david@...morbit.com>
To:     Jeff Layton <jlayton@...nel.org>
Cc:     linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        hch@....de, neilb@...e.de, bfields@...ldses.org,
        amir73il@...il.com, jack@...e.de, viro@...iv.linux.org.uk
Subject: Re: [PATCH 14/19] xfs: convert to new i_version API


So now I've looked at the last patch .....

On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > From: Jeff Layton <jlayton@...hat.com>
> > 
> > Signed-off-by: Jeff Layton <jlayton@...hat.com>
> > ---
> >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> >  fs/xfs/xfs_icache.c           | 4 ++--
> >  fs/xfs/xfs_inode.c            | 2 +-
> >  fs/xfs/xfs_inode_item.c       | 2 +-
> >  fs/xfs/xfs_trans_inode.c      | 2 +-
> >  5 files changed, 8 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 6b7989038d75..6b47de201391 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> >  	to->di_flags	= be16_to_cpu(from->di_flags);
> >  
> >  	if (to->di_version == 3) {
> > -		inode->i_version = be64_to_cpu(from->di_changecount);
> > +		inode_set_iversion_queried(inode,
> > +					   be64_to_cpu(from->di_changecount));
> 
> So we use the "kernel managed" (really not sure what that means)
> set function here to read it off disk, but...

This stores the value from disk in the incore inode as "val << 1",
then sets the lowest bit to indicate that it has been "queried"
so that it will be incremented on the first modification.

Why do we initialise values read from disk as "queried"? This means
the i_version will change once every time it's brought into memory
and modified, regardless of whether anyone is looking at it. What
purpose does this serve?

> >  		to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
> >  		to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
> >  		to->di_flags2 = be64_to_cpu(from->di_flags2);
> > @@ -314,7 +315,7 @@ xfs_inode_to_disk(
> >  	to->di_flags = cpu_to_be16(from->di_flags);
> >  
> >  	if (from->di_version == 3) {
> > -		to->di_changecount = cpu_to_be64(inode->i_version);
> > +		to->di_changecount = cpu_to_be64(inode_peek_iversion_raw(inode));
> 
> ... use the raw access mode to put it back on disk.

This writes the current inode->i_version value directly to disk,
including the "queried" flag.

Hence every time this inode cycles through memory and is modified,
we essentially shift the on-disk i_version value upwards by 1 slot
(i.e. double it's value) when we read it back in from disk.

Seems like a bug - this is not a monotonically increasing counter
anymore - after ~60 modification cycles through memory it's going to
have an practically random value when pulled in off disk, not a
slowly increasing value.

> >  		to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 43005fbe8b1e..4838462616fd 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -293,14 +293,14 @@ xfs_reinit_inode(
> >  	int		error;
> >  	uint32_t	nlink = inode->i_nlink;
> >  	uint32_t	generation = inode->i_generation;
> > -	uint64_t	version = inode->i_version;
> > +	uint64_t	version = inode_peek_iversion_raw(inode);
> >  	umode_t		mode = inode->i_mode;
> >  
> >  	error = inode_init_always(mp->m_super, inode);
> >  
> >  	set_nlink(inode, nlink);
> >  	inode->i_generation = generation;
> > -	inode->i_version = version;
> > +	inode_set_iversion_queried(inode, version);
> 
> Again - raw mode to read, kernel managed to set.

This, again, will double the i_version value. Shouldn't all the XFS
code just be using inode_peek_iversion(), not the _raw variant?

> 
> >  	inode->i_mode = mode;
> >  	return error;
> >  }
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 801274126648..be6d87980dd5 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -833,7 +833,7 @@ xfs_ialloc(
> >  	ip->i_d.di_flags = 0;
> >  
> >  	if (ip->i_d.di_version == 3) {
> > -		inode->i_version = 1;
> > +		inode_set_iversion(inode, 1);
> 
> But here you are using the "filesystem managed" mdoe to set the
> new value. Why? How is this any different from reading the value
> off disk and setting it?

Still don't understand why this is different to reading the inode
from disk....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com