[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110915202500.GD12086@tux1.beaverton.ibm.com>
Date: Thu, 15 Sep 2011 13:25:00 -0700
From: "Darrick J. Wong" <djwong@...ibm.com>
To: "Ted Ts'o" <tytso@....edu>
Cc: Andreas Dilger <adilger.kernel@...ger.ca>,
Sunil Mushran <sunil.mushran@...cle.com>,
Amir Goldstein <amir73il@...il.com>,
Andi Kleen <andi@...stfloor.org>,
Mingming Cao <cmm@...ibm.com>,
Joel Becker <jlbec@...lplan.org>, linux-ext4@...r.kernel.org,
Coly Li <colyli@...il.com>
Subject: Re: [PATCH 01/37] e2fsprogs: Read and write full-sized inodes
On Wed, Sep 14, 2011 at 12:39:01PM -0400, Ted Ts'o wrote:
> On Wed, Aug 31, 2011 at 05:35:17PM -0700, Darrick J. Wong wrote:
> > As part of adding inode checksums, it is necessary for all e2fsprogs to read
> > and write full inodes so that checksums may be calculated correctly. Since
> > struct ext2_inode_large is a superset of struct ext2_inode, replace the smaller
> > one with the larger one.
>
> OK, so I need to explain how the large inode support was designed.
>
> The original assumption behind it was that most of the time, most user
> progams outside of programs shipped natively with e2fsprogs (which are
> special) don't actually need to access large inodes directly. They
> are much more likely to use ext2fs_file_{open,read}(). To the extent
> that they need to access inodes at all, they will tend to do so
> read-only to fetch the size or modtime or user/group ownership fields.
>
> And when we write an inode, we need to do a read/modify/write cycle
> with the inode table block anyway. So the idea was to provide full
> ABI compatibility with the struct ext4_inode structure, and the
> functions that read and write them. So ext4_inode is a fixed size,
> 128 byte structure, and functions that currently take ext4_inode must
> only read or write the first 128 bytes.
>
> This also implies that probably the better place to put the checksum
> calculation code is in ext4_write_inode()/ext4_write_inode_large()
> functions. If we need a new function ext4_write_inode_raw() which has
> the same function signature as ext4_write_inode_large(), but which
> skips the checksum calculation, so be it.
Okay, so here's my new game plan ... internally, libext2fs needs to read/write
whatever sb->inode_size is. However, to avoid ABI breakage, it will use an
internal buffer (stack variable, memory block, whatever) for the IO operation
and the checksum operations. As for whatever buffer+size the caller passes in,
it will memcpy this (potentially smaller) amount so that external programs that
still use the 128-byte structure won't get corrupted. The other tools within
e2fsprogs can of course allocate the largest inode size they care about
(ext2_inode_large) if they so choose.
So I think this means that I can rip out both superpatches. The inode read and
scan functions will need to allocate a buffer that is whatever size the
superblock says inodes are so that it can read, and checksum the inode. After
that, the functions will copy however many bytes the caller tells us to copy
into the caller's buffer. The write function will allocate a buffer that is
whatever size the superblock says inodes are, read that size off the disk, copy
however many bytes the caller gave us into this temporary buffer, then
calculate the checksum and write the buffer out to disk. e2fsprogs that care
about extended inode fields can be modified to use ext2_inode_large, and
everything else can keep using ext2_inode. This also keeps it so that client
programs don't have to know or care how big inodes really are.
Sorry I wasn't aware that there are non-e2fsprogs users of libext2fs.
--D
> Similarly, if we put the checksum verfication in
> ext4_read_inode{_large}(), with a new error code if the checksum is
> incorrect, all existing callers will get checksum verification for
> free.
>
> Now, about how ext4_{read,write}_inode_full(). This function is designed so
> that struct ext4_inode_large can grow in newer versions of the
> library. To that end, the callers must call it as follows:
>
> struct ext2_inode_large inode;
>
> retval = ext2fs_read_inode_full(fs, ino, &inode, sizeof(inode));
>
> If they happen to compile on a system where struct ext2_inode_large is
> 140 bytes, then they will pass in a bufsize of 140 bytes --- and it is
> up to struct ext2fs_read_inode_full() to only pass back 140 bytes, and
> for struct ext2fs_write_inode_full() to only read 140 bytes from the
> passed-in pointer. You can think of the bufsize parameter as being a
> built-in versioning mechanism.
>
> It is for a similar reason that gethostbyname() takes a socklen_t
> parameter. That way it doesn't have to worry about stack smashing an
> legacy ipv4-only program.
>
> So you can't just blanket replace struct ext2_inode with struct
> ext2_inode_large, and have ext2fs_read_inode() blindly copy out
> however many bytes happen to be in sizeof(struct ext2_inode_large)
> today. That way lies all sorts of wierd and hard-to-debug versioning
> skew problems....
>
> Does this make sense?
>
> - Ted
>
> P.S. The ext4 wiki is down right now, due to the kernel.org security
> recovery efforts, but when it's back up, someone please remind me to
> get this text up on the wiki. :-)
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists