linux-ext4 - Re: Introducing Next3 - built-in snapshots support for Ext3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 9 May 2010 13:56:54 +0200
From:	"Amir G." <amir73il@...rs.sourceforge.net>
To:	Theodore Tso <tytso@....edu>
Cc:	Ric Wheeler <rwheeler@...hat.com>, Andi Kleen <andi@...obates.de>,
	linux-ext4@...r.kernel.org
Subject: Re: Introducing Next3 - built-in snapshots support for Ext3

On Sun, May 9, 2010 at 5:25 AM, Theodore Tso  wrote:
>
> For example, if you have flags that only are used in-memory, they shouldn't be allocated out of i_flags, but instead using the i_state field in the ext4_inode_info structure, referenced via EXT4_I(inode).i_state.   That's what it is there fore.
>

I have 4 non-persistent flags. I will move them to i_state.
I've kept them in i_flags out of laziness, since I use lsattr -X
to read non-persistent snapshot flags along with persistent snapshot flags.

>>
>> Next3 uses 1 assigned field (i_version), but it does not "abuse" it.
>> You see, Next3 only tampers with i_version of snapshot files.
>> And by tamper I mean: set it to next snapshot inode number on snapshot take.
>
> What do you mean by "next snapshot number"?   How is it used?  Why do you need it?   Given that all snapshot inodes must be stored out of a snapshot directory (why?) can't the snapshot directory name be used to establish some kind of ordering?   Is ordering significant, or do you just need it to find them all?   If it's just to find them all, why not just used a linked list which is stored in memory; does it really need to be in the on-disk structure at all?
>

i_version is used to chain the snapshot list on-disk, similar to orphan list.
I used i_dtime in the past, but I was concerned that a bug would
result in cleanup of all snapshots,
so I started using i_version instead.
I can revert back to using i_dtime (snapshot files are non-truncatable
non-unlinkable) instead of i_version.

the snapshot file directory entry name is arbitrary and may be used by
a "snapshot management system" as it wishes,
to organize and display snapshots.
As far as the snapshot sub-system is concerned, the on-disk snapshot
list is the only reference to the snapshot files.

>
> For example: i_snapshot_blocks_count.  Is that really necessary?   Why can not be computed by looking at i_size of the snapshot inode?   Or by checking to see if the superblock has be COW'ed?  If it hasn't then the s_blocks_count in the fs superblock must be what would have been in i_snapshot_blocks_count.  If the sb has been COW'ed, s_blocks_count in the COW'ed sb must be that value.  Why allocate and waste a full 32-bit field out of _every_ inode in the file system if it's possible to get that value via other places.
>

very well, I can read snapshot_blocks_count from COWed superblock (it
is always COWed on snapshot take) and release i_snapshot_blocks_count.

> I have similar question about bg_cow_bitmap.  Is that really necessary?   The COW bitmap is just a copy of the base file system's block allocation bitmap, right?  (The wiki documentation and the design PDF isn't completely clear on that point, but that seems to be what it is.)   So why do you need to allocate a field out of the bg discriptor field for it?
>
> It's not clear why you need an exclude inode, if you are also storing the address of the exclude bitmap blocks in the bg descriptor.  One or the other, but not both...
>

bg_cow_bitmap/bg_exculde_bitmap are used by Next3 as non-persistent
cache for the address of a bitmap blocks,
which can be read from active_snapshot/exclude_inode.
in other words, instead of allocating per group in-memory structure, I
used the 2 unused fields in the in-memory group descriptor.
the only side effect for the ext* on-disk format is that those fields
are no longer 0 after mounting a volume with Next3.
is that a problem? can the CSUM feature resolve that problem?

in e2fsprogs, I only reference those fields for debugging purpose
(dumpe2fs displays them).
also create_exclude_inode and resize2fs set the bg_exclude_bitmap, but
they don't have to,
because Next3 re-reads all exclude_bitmap block addresses from exclude
inode on mount time.
so please feel free to reject those field assignments. I can include
them in a seperate debug only patch.


> If s_snapshot_inum and s_snapshot_id refer to the "active" snapshot, why do they need to be in the on-disk structure.   Why not just have the first item of the linked list whose head is s_last_snapshot be the "active" snapshot (if this needs to be in the on-disk state at all); wouldn't the active snapshot be the most recent one anyway?

good question. again, there use to be only 1 field s_last_snapshot,
but I split it into 2 field to recover from crash in the middle
of snapshot take.
a half taken snapshot is set as s_last_snapshot, but only a ready
snapshot is set as s_snapshot_inum.
Next3 will cleanup a half taken snapshot on mount time.
tune2fs -O ^has_snapshot will cleanup (discard) all snapshot files,
including the half taken snapshot.

> Also, as far as i_next_snapshot is concerned, why not just use d_time for the linked list.  That's what we do with the orphaned inode list, so we have code that maintains a linked list using d_time already.   So that way you don't need _any_ new inode fields.
>

I don't know if you noticed, but I reused the code of
add/del_orphan_list() to manipulate the snapshot list...
And as I said a few lines back, I can revert to using i_dtime instead
of i_next_snapshot.

> I'm not convinced by all of the fs feature compatibility flags you've defined, either.   In general, if you suspect one part of the file system, you need to check everything.  So do you really need separate ro_compat bits for "fix_snapshot" and "fix_exclude"?

no, not really. these are informational only. I didn't even use
fix_snapshot yet.

> And why do you need "IS_SNAPSHOT"?
>

I "fix" the COWed superblock to make it look like ext2 superblock and
set the is_snapshot feature, so fsck would know it is checking a Next3
snapshot image (and not report wrong block counts).

> As far as COMPAT_BIG_JOURNAL, we have feature flags in the journal, and that probably is better placed there.

I will look into that.

>And I assume COMPAT_EXCLUDE_INODE is really "COMPAT_HAS_SNAPSHOTS"?

logically, it means that the exclude inode/bitmap is allocated.
currently, the only feature that uses the exclude bitmap is the
snapshot feature,
so I don't mind bundling them together.

but I do recommend to mke2fs -o exclude_inode if you intend to switch
from Ext3 to Next3 at some point.
it will guarenty that exclude bitmap blocks are allocated their
corresponding block bitmap.

>
> BTW, if you are free at 11:00 US/Eastern on Monday, maybe you could join the ext4 weekly conference call, and we could try to hash some of this out on the telephone call?  I'm not sure where you are geographically based, so I don't know if this would be a hard time for you to make or not.
>

I would be happy to join the weekly call. I am located in Israel. How
does this work exactly?
Should I call in? to which number? need credentials?

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html