linux-ext4 - Re: strange ext{3,4}

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-id: <20080315230544.GV3542@webber.adilger.int>
Date:	Sun, 16 Mar 2008 07:05:44 +0800
From:	Andreas Dilger <adilger@....com>
To:	Dmitri Monakhov <dmonakhov@...nvz.org>
Cc:	linux-ext4@...r.kernel.org
Subject: Re: strange ext{3,4}_settattr logic

On Mar 15, 2008  19:07 +0300, Dmitri Monakhov wrote:
> I've found what ext3_setattr() code has some strange logic. I'm talking
> about truncate path. 
> 
> int ext3_setattr(struct dentry *dentry, struct iattr *attr)
> {
> ...
> 	if (S_ISREG(inode->i_mode) &&
>             attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
>                 handle_t *handle;
> <<< This is shrinking case, and according to function comments:
> <<< "In particular, we want to make sure that when the VFS
> <<< * shrinks i_size, we put the inode on the orphan list and modify
> <<< * i_disksize immediately"
> <<< we about to write i_disksize. But WHY do we have to do it explicitly?
> <<< Later inode_setattr() will call ext3_truncate() which will do it
> <<< this work for us.

The reason that i_disksize is written to disk here immediately is that the
journal is stopped.  Once that is done then in case of a crash the orphan
recovery code will detect the unfinished truncate and complete it before
mounting the filesystem.

Without this it is possible to get a partial truncate after a crash because
the truncate may span several transactions due to the potentially large
number of blocks that need to be modified.  What is important with ext3
is that because e2fsck is not run on each boot whatever is on disk needs
to be consistent after a crash.

If there is a file being truncated or unlinked that needs to be completed
after a crash or the blocks will be leaked.  To ensure this happens, there
is a singly-linked list of inodes on the disk called the "orphan list"
that keeps track of all inodes currently undergoing truncate or unlink.
After a crash the kernel or e2fsck will walk this list and finish the
truncate or unlink of the inode, freeing the blocks.

>         rc = inode_setattr(inode, attr);
> <<< Now the most interesting question. What we have to do now in 
> <<< case of error? We are in tricky situation. Truncate not happened,
> <<< and blocks visible to the user, but i_disksize was already written,
> <<< so later memory reclaiming/ read_inode will result in unexpected
> <<< updating i_size.

The only ways inode_setattr() can fail are:
- expanding vmtruncate hits EFBIG, but we checked that above
- shrinking vmtruncate on a swapfile returns ETXTBUSY.  This was added
  after the ext3_setattr() code was written.

If the ext3_truncate() or mark_inode_dirty() call fails, it does not
return an error code.  For ext3 the only way this can fail is if the
journal is aborted, which means the filesystem is already in read-only
mode and nothing can be done to clean up the truncate until the next
mount, at which point the orphan recovery code discussed above will
finish the operation.

>         /* If inode_setattr's call to ext3_truncate failed to get a
>          * transaction handle at all, we need to clean up the in-core
>          * orphan list manually. */
> <<< Following code will remove inode only from in memory(because handle = NULL)
> <<< orphan list. Please someone explain me what this lines suppose to do
> <<< actually.
>         if (inode->i_nlink)
>                 ext3_orphan_del(NULL, inode);

This will only be important in the case of a failed operation above.
The ext3_truncate() code will normally have already removed the inode
from the orphan list when it is finished, but we aren't sure whether
that code was called so we need to do it again here (it is safe to call
even if the inode is not on the list) to ensure we don't hit a J_ASSERT()
that the orphan list is empty in the unmount code (ext3_put_super()).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html