lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150623193333.GA10037@birch.djwong.org>
Date:	Tue, 23 Jun 2015 12:33:33 -0700
From:	"Darrick J. Wong" <darrick.wong@...cle.com>
To:	mingming cao <mingming.cao@...cle.com>
Cc:	linux-ext4@...r.kernel.org
Subject: Re: [RFC 1/2] btree header

On Mon, Jun 22, 2015 at 08:24:37PM -0700, mingming cao wrote:
> ---
>  ext4_btree.h | 146 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 146 insertions(+)
>  create mode 100644 ext4_btree.h
> 
> diff --git a/ext4_btree.h b/ext4_btree.h
> new file mode 100644
> index 0000000..efd5ce3
> --- /dev/null
> +++ b/ext4_btree.h
> @@ -0,0 +1,146 @@
> +/*
> + * copyright (C) 2015 Oracle.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + */
> +
> +#ifndef EXT4_BTREE_H
> +#define EXT4_BTREE_H
> +
> +/*
> + * This btree geometry structure is used to defines how the btree block
> + * layout for different type of btrees. This is used to implement a generic
> + * btree implementation. The record length, value lenth, etc, are variable
> + * based on different btree type, like reflink btree, and other btree use cases.
> + */
> +struct ext4_btree_geo {
> +	__le16 node_len;
> +	__le16 header_len;
> +	__le16 key_len;
> +	__le16 val_len;
> +	__le16 rec_len;

(I wonder if these four _len values could be u8, but whatever...)

> +	__le16 max_pairs;
> +	__le16 max_records;
> +};

Does this geometry structure live on disk?  The __le prefix suggests that it
does, but the only definition of one of these is the in-core ref_btree_geo,
which means that u16 is sufficient.

Given that the ext4_btree_geo seems to contain all the incore info about the
btree you're working with, it ought to know about the magic number so that it
can compare with whatever it reads off disk.

(Thinking ahead to when we have multiple btrees with different record
definitions and magic numbers.)

> +
> +/*
> + * Every btree block starts from a header structure, including the index node
> + * and the leaf node.
> + * The index btree node started from this header structure, followed by 
> + * (key,value) pairs
> + * Index node:
> + * ---------------------------------------------------------------------------
> + * |header| key | key | key | key|...........| value | value | value | value |
> + * |--------------------------------------------------------------------------
> + *
> + * And the leaf node begins with ext4_btree_header, and followed by records.
> + * leaf node
> + * * ------------------------------------------------------------------------
> + * |header| rec | rec | rec | rec |...........| rec | rec | rec | rec | rec |
> + * |-------------------------------------------------------------------------
> + */
> +
> +#define REF_BTREE_MAGIC  0x52454642 /* "REFB" magic number for refcount btree */
> +
> +struct ext4_btree_header {
> +	__le32	btree_magic;	/* type of btree */
> +	__le32	btree_generation; /* generation for this type of btree */
> +	__le32	btree_level;	/* the level of this node in the tree */
> +	__le32	btree_numkeys;	/* # of records for leaf node*/
> +	__le32	btree_padding1;

I think this padding field results in a hidden u32 gap here?

> +	__le64	btree_blockno;	/* remember the blk# of this node */

(Unfortunate that you have to burn all 64 bits for all btrees, but I remembered
that ext4 lets you set flexbg size to 2^31.  Oh well.)

> +	__le64	btree_leftsib;	/* used for fast search sibling nodes */
> +	__le64	btree_rightsib; 
> +	__le64	btree_crc;	/* this node checksums */

Future proofing, I guess?  Since we don't use 64-bit crcs right now.

If you decide that we only need 32 bits for a crc, you could make btree_level a
u16 (since max levels is 8), numkeys a u16, and store the crc32 after that.
Then this on-disk header would only be 40 bytes, instead of 64 bytes.

(Hmm.  Maybe you were trying to make things align to a 64B cacheline?)

> +	__le64	btree_padding2;
> +};
> +
> +# define EXT4_BLOCK_SIZE	4096

Shouldn't this be the FS blocksize?  It'll waste a lot of space on a 64k block
filesystem (reduced fanout opportunities as well as 60K of empty space in each
tree node) and I'm not sure what happens with 1k blocks.

> +#define EXT4_BTREE_HEADER_SIZE	sizeof(struct ext4_btree_header)
> +#define EXT4_BTREE_NODESIZE	EXT4_BLOCK_SIZE	
> +
> +struct ext4_btree_key {
> +	__le64		blockno;
> +};
> +#define EXT4_BTREE_KEY_SIZE	sizeof(struct ext4_btree_key)
> +
> +struct ext4_btree_val {
> +	__le64		blockno;
> +};
> +#define EXT4_BTREE_VAL_SIZE	sizeof(struct ext4_btree_val)
> +
> +struct ext4_btree_rec {
> +	struct ext4_btree_key	key;
> +	__le32			len;
> +	__le32			ref_count;
> +};

Hm.  Looking forward to having btrees to track things other than refcount, I
imagine it'll become necessary to parameterize the record definition for each
type of btree?

(Though, I guess this record format works well enough for the inode and block
bitmaps, and maybe that's as far as you want to go...)

> +#define EXT4_BTREE_REC_SIZE	sizeof(struct ext4_btree_rec)
> +
> +#define EXT4_BTREE_MAX_KEY_VAL_PAIRS \
> +	((EXT4_BTREE_NODESIZE - EXT4_BTREE_HEADER_SIZE) / \
> +	 (EXT4_BTREE_KEY_SIZE + EXT4_BTREE_VAL_SIZE))
> +
> +#define EXT4_BTREE_MIN_KEY_VAL_PAIRS \
> +        (EXT4_BTREE_MAX_KEY_VAL_PAIRS / 2)
> +
> +#define EXT4_BTREE_MAX_RECS \
> +	((EXT4_BTREE_NODESIZE - EXT4_BTREE_HEADER_SIZE) / \
> +	  EXT4_BTREE_REC_SIZE)
> +
> +#define EXT4_BTREE_MIN_RECS \
> +	(EXT4_BTREE_MAX_RECS / 2)
> +
> +#define EXT4_BTREE_MAX_LEVELS		8
> +
> +
> +/* Index node */
> +struct ext4_btree_index_node {
> +	struct ext4_btree_header	node_header;
> +	struct ext4_btree_key		key[EXT4_BTREE_MAX_KEY_VAL_PAIRS];
> +	struct ext4_btree_val		val[EXT4_BTREE_MAX_KEY_VAL_PAIRS];

(Changing EXT4_BTREE_NODESIZE to depend on the FS blocksize makes it impossible
to use arrays here, oh well...)

> +};
> +
> +/* Record Node */
> +struct ext4_btree_leaf_node {
> +	struct ext4_btree_header	node_header;
> +	struct ext4_btree_rec		rec[EXT4_BTREE_MAX_RECS];
> +};
> +
> +struct ext4_btree_node {
> +	struct ext4_btree_header	node_header;
> +};
> +
> +struct ext4_btree_root {
> +	struct ext4_btree_node *node;
> +	struct ext4_btree_geo	geo;
> +};
> +
> +#define ext4_btree_trace printf

printk?

--D

> +
> +/* Ref B+Tree interface */
> +struct ext4_btree_node * ext4_btree_node_alloc();
> +void ext4_btree_node_free( struct ext4_btree_node *node);
> +struct ext4_btree_rec * ext4_btree_lookup(struct ext4_btree_root * root, 
> +					  struct ext4_btree_key * key);
> +int ext4_btree_insert(struct ext4_btree_root *root, 
> +		      struct ext4_btree_rec *rec);
> +int ext4_btree_delete(struct ext4_btree_root *root, 
> +		  struct ext4_btree_key *key);
> +void ext4_btree_print(struct ext4_btree_root * root);
> +int ext4_btree_valid_check(struct ext4_btree_root *root);
> +void ext4_ref_btree_init(struct ext4_btree_root * root);
> +
> +
> +#endif
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ