lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150624052156.GA15410@birch.djwong.org>
Date:	Tue, 23 Jun 2015 22:21:56 -0700
From:	"Darrick J. Wong" <darrick.wong@...cle.com>
To:	Mingming Cao <mingming.cao@...cle.com>
Cc:	linux-ext4@...r.kernel.org
Subject: Re: [RFC 1/2] btree header

On Tue, Jun 23, 2015 at 10:14:19PM -0600, Mingming Cao wrote:
> On 06/23/2015 01:33 PM, Darrick J. Wong wrote:
> > On Mon, Jun 22, 2015 at 08:24:37PM -0700, mingming cao wrote:
> >> ---
> >>  ext4_btree.h | 146 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 146 insertions(+)
> >>  create mode 100644 ext4_btree.h
> >>
> >> diff --git a/ext4_btree.h b/ext4_btree.h
> >> new file mode 100644
> >> index 0000000..efd5ce3
> >> --- /dev/null
> >> +++ b/ext4_btree.h
> >> @@ -0,0 +1,146 @@
> >> +/*
> >> + * copyright (C) 2015 Oracle.  All rights reserved.
> >> + *
> >> + * This program is free software; you can redistribute it and/or
> >> + * modify it under the terms of the GNU General Public
> >> + * License v2 as published by the Free Software Foundation.
> >> + *
> >> + * This program is distributed in the hope that it will be useful,
> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> + * General Public License for more details.
> >> + *
> >> + * You should have received a copy of the GNU General Public
> >> + * License along with this program; if not, write to the
> >> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> >> + * Boston, MA 021110-1307, USA.
> >> + */
> >> +
> >> +#ifndef EXT4_BTREE_H
> >> +#define EXT4_BTREE_H
> >> +
> >> +/*
> >> + * This btree geometry structure is used to defines how the btree block
> >> + * layout for different type of btrees. This is used to implement a generic
> >> + * btree implementation. The record length, value lenth, etc, are variable
> >> + * based on different btree type, like reflink btree, and other btree use cases.
> >> + */
> >> +struct ext4_btree_geo {
> >> +	__le16 node_len;
> >> +	__le16 header_len;
> >> +	__le16 key_len;
> >> +	__le16 val_len;
> >> +	__le16 rec_len;
> > (I wonder if these four _len values could be u8, but whatever...)
> 
> okay...
> 
> >
> >> +	__le16 max_pairs;
> >> +	__le16 max_records;
> >> +};
> > Does this geometry structure live on disk?  The __le prefix suggests that it
> > does, but the only definition of one of these is the in-core ref_btree_geo,
> > which means that u16 is sufficient.
> 
> At the beginning since the prototype is a in memory btree, the geometry
> structure was not on disk, and declared globally. So I had those values to be
> u8 before... but I am planning to store those layout info on disk... that's
> why here I made change to __le, but hasnt really update the rest of code to
> reflect the geo is stored on disk yet. One question I have is whethere to
> store the geometry structure into the btree block header. Maybe only for
> index node...

Hm.  XFS declares a struct xfs_btree_ops which holds the size of the key and
record fields (the pointer is assumed to be a AG block pointer), and function
pointers to various tree-specific helpers.  The max_{pairs,records} values
are kept in the XFS context structure.  None of it ever recorded on disk,
though (obviously) the values are used for sanity-checks of whatever's being
read in...

...or written out.  Hm, I sort of wonder (random aside here) if we could
abuse the jbd2 triggers to do spot-checks of what's being written out to disk
while we're also computing block checksums. </ramble>

(Nuts, 10pm, uh... I'll keep going in the morning.)

--D

> >
> > Given that the ext4_btree_geo seems to contain all the incore info about the
> > btree you're working with, it ought to know about the magic number so that it
> > can compare with whatever it reads off disk.
>
> Good point, it should know the magic number to compare with the header, if we
> decide to keep the geometry incore only.... I started to wonder if we really
> need to store geometry info on disk...
> 
> > (Thinking ahead to when we have multiple btrees with different record
> > definitions and magic numbers.)
> Agree...
> 
> >> +
> >> +/*
> >> + * Every btree block starts from a header structure, including the index node
> >> + * and the leaf node.
> >> + * The index btree node started from this header structure, followed by 
> >> + * (key,value) pairs
> >> + * Index node:
> >> + * ---------------------------------------------------------------------------
> >> + * |header| key | key | key | key|...........| value | value | value | value |
> >> + * |--------------------------------------------------------------------------
> >> + *
> >> + * And the leaf node begins with ext4_btree_header, and followed by records.
> >> + * leaf node
> >> + * * ------------------------------------------------------------------------
> >> + * |header| rec | rec | rec | rec |...........| rec | rec | rec | rec | rec |
> >> + * |-------------------------------------------------------------------------
> >> + */
> >> +
> >> +#define REF_BTREE_MAGIC  0x52454642 /* "REFB" magic number for refcount btree */
> >> +
> >> +struct ext4_btree_header {
> >> +	__le32	btree_magic;	/* type of btree */
> >> +	__le32	btree_generation; /* generation for this type of btree */
> >> +	__le32	btree_level;	/* the level of this node in the tree */
> >> +	__le32	btree_numkeys;	/* # of records for leaf node*/
> >> +	__le32	btree_padding1;
> > I think this padding field results in a hidden u32 gap here?
> 
> Sorry that was a mistake. I added generation but forget to remove this padding.
> >> +	__le64	btree_blockno;	/* remember the blk# of this node */
> > (Unfortunate that you have to burn all 64 bits for all btrees, but I remembered
> > that ext4 lets you set flexbg size to 2^31.  Oh well.)
> 
> true... though this block no is
> >> +	__le64	btree_leftsib;	/* used for fast search sibling nodes */
> >> +	__le64	btree_rightsib; 
> >> +	__le64	btree_crc;	/* this node checksums */
> > Future proofing, I guess?  Since we don't use 64-bit crcs right now.
> 
> yes... that was the intention... I guess that is a little overkill for now.
> > If you decide that we only need 32 bits for a crc, you could make btree_level a
> > u16 (since max levels is 8), numkeys a u16, and store the crc32 after that.
> > Then this on-disk header would only be 40 bytes, instead of 64 bytes.
> 
> Sounds good to me .. make it more packed...
> > (Hmm.  Maybe you were trying to make things align to a 64B cacheline?)
> >
> >> +	__le64	btree_padding2;
> >> +};
> >> +
> >> +# define EXT4_BLOCK_SIZE	4096
> > Shouldn't this be the FS blocksize?  It'll waste a lot of space on a 64k block
> > filesystem (reduced fanout opportunities as well as 60K of empty space in each
> > tree node) and I'm not sure what happens with 1k blocks.
> 
> Again, sorry this is me being lazy in the prototype here. I will fix it properly with filesystem blocksize ...
> 
> >> +#define EXT4_BTREE_HEADER_SIZE	sizeof(struct ext4_btree_header)
> >> +#define EXT4_BTREE_NODESIZE	EXT4_BLOCK_SIZE	
> >> +
> >> +struct ext4_btree_key {
> >> +	__le64		blockno;
> >> +};
> >> +#define EXT4_BTREE_KEY_SIZE	sizeof(struct ext4_btree_key)
> >> +
> >> +struct ext4_btree_val {
> >> +	__le64		blockno;
> >> +};
> >> +#define EXT4_BTREE_VAL_SIZE	sizeof(struct ext4_btree_val)
> >> +
> >> +struct ext4_btree_rec {
> >> +	struct ext4_btree_key	key;
> >> +	__le32			len;
> >> +	__le32			ref_count;
> >> +};
> > Hm.  Looking forward to having btrees to track things other than refcount, I
> > imagine it'll become necessary to parameterize the record definition for each
> > type of btree?
> >
> totally agree. Will need to make it more generic by adding interface to define different type of btree, with own record_data structure and let the btree user to pass those info into the generic code..
> 
> > (Though, I guess this record format works well enough for the inode and block
> > bitmaps, and maybe that's as far as you want to go...)
> 
> for reflink, block allocation, yes. But I like to make the code more generic if possible
> >> +#define EXT4_BTREE_REC_SIZE	sizeof(struct ext4_btree_rec)
> >> +
> >> +#define EXT4_BTREE_MAX_KEY_VAL_PAIRS \
> >> +	((EXT4_BTREE_NODESIZE - EXT4_BTREE_HEADER_SIZE) / \
> >> +	 (EXT4_BTREE_KEY_SIZE + EXT4_BTREE_VAL_SIZE))
> >> +
> >> +#define EXT4_BTREE_MIN_KEY_VAL_PAIRS \
> >> +        (EXT4_BTREE_MAX_KEY_VAL_PAIRS / 2)
> >> +
> >> +#define EXT4_BTREE_MAX_RECS \
> >> +	((EXT4_BTREE_NODESIZE - EXT4_BTREE_HEADER_SIZE) / \
> >> +	  EXT4_BTREE_REC_SIZE)
> >> +
> >> +#define EXT4_BTREE_MIN_RECS \
> >> +	(EXT4_BTREE_MAX_RECS / 2)
> >> +
> >> +#define EXT4_BTREE_MAX_LEVELS		8
> >> +
> >> +
> >> +/* Index node */
> >> +struct ext4_btree_index_node {
> >> +	struct ext4_btree_header	node_header;
> >> +	struct ext4_btree_key		key[EXT4_BTREE_MAX_KEY_VAL_PAIRS];
> >> +	struct ext4_btree_val		val[EXT4_BTREE_MAX_KEY_VAL_PAIRS];
> > (Changing EXT4_BTREE_NODESIZE to depend on the FS blocksize makes it impossible
> > to use arrays here, oh well...)
> 
> Hmm.. okay, probably a pointer here.
> >> +};
> >> +
> >> +/* Record Node */
> >> +struct ext4_btree_leaf_node {
> >> +	struct ext4_btree_header	node_header;
> >> +	struct ext4_btree_rec		rec[EXT4_BTREE_MAX_RECS];
> >> +};
> >> +
> >> +struct ext4_btree_node {
> >> +	struct ext4_btree_header	node_header;
> >> +};
> >> +
> >> +struct ext4_btree_root {
> >> +	struct ext4_btree_node *node;
> >> +	struct ext4_btree_geo	geo;
> >> +};
> >> +
> >> +#define ext4_btree_trace printf
> > printk?
> >
> > --D
> 
> :) this was done in the user-space first. When the code finally all move to kernel will be something like printk
> 
> Thanks a lot , for your review and comments!:)
> 
> Mingming
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ