lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <558A2E9B.6080808@oracle.com>
Date:	Tue, 23 Jun 2015 22:14:19 -0600
From:	Mingming Cao <mingming.cao@...cle.com>
To:	"Darrick J. Wong" <darrick.wong@...cle.com>
CC:	linux-ext4@...r.kernel.org
Subject: Re: [RFC 1/2] btree header

On 06/23/2015 01:33 PM, Darrick J. Wong wrote:
> On Mon, Jun 22, 2015 at 08:24:37PM -0700, mingming cao wrote:
>> ---
>>  ext4_btree.h | 146 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 146 insertions(+)
>>  create mode 100644 ext4_btree.h
>>
>> diff --git a/ext4_btree.h b/ext4_btree.h
>> new file mode 100644
>> index 0000000..efd5ce3
>> --- /dev/null
>> +++ b/ext4_btree.h
>> @@ -0,0 +1,146 @@
>> +/*
>> + * copyright (C) 2015 Oracle.  All rights reserved.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public
>> + * License v2 as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public
>> + * License along with this program; if not, write to the
>> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
>> + * Boston, MA 021110-1307, USA.
>> + */
>> +
>> +#ifndef EXT4_BTREE_H
>> +#define EXT4_BTREE_H
>> +
>> +/*
>> + * This btree geometry structure is used to defines how the btree block
>> + * layout for different type of btrees. This is used to implement a generic
>> + * btree implementation. The record length, value lenth, etc, are variable
>> + * based on different btree type, like reflink btree, and other btree use cases.
>> + */
>> +struct ext4_btree_geo {
>> +	__le16 node_len;
>> +	__le16 header_len;
>> +	__le16 key_len;
>> +	__le16 val_len;
>> +	__le16 rec_len;
> (I wonder if these four _len values could be u8, but whatever...)

okay...

>
>> +	__le16 max_pairs;
>> +	__le16 max_records;
>> +};
> Does this geometry structure live on disk?  The __le prefix suggests that it
> does, but the only definition of one of these is the in-core ref_btree_geo,
> which means that u16 is sufficient.

At the beginning since the prototype is a in memory btree, the geometry structure was not on disk, and declared globally. So I had those values to be u8 before... but I am planning to store those layout info on disk... that's why here I made change to __le, but hasnt really update the rest of code to reflect the geo is stored on disk yet. One question I have is whethere to store the geometry structure into the btree block header. Maybe only for index node...
>
> Given that the ext4_btree_geo seems to contain all the incore info about the
> btree you're working with, it ought to know about the magic number so that it
> can compare with whatever it reads off disk.
Good point, it should know the magic number to compare with the header, if we decide to keep the geometry incore only.... I started to wonder if we really need to store geometry info on disk...

> (Thinking ahead to when we have multiple btrees with different record
> definitions and magic numbers.)
Agree...

>> +
>> +/*
>> + * Every btree block starts from a header structure, including the index node
>> + * and the leaf node.
>> + * The index btree node started from this header structure, followed by 
>> + * (key,value) pairs
>> + * Index node:
>> + * ---------------------------------------------------------------------------
>> + * |header| key | key | key | key|...........| value | value | value | value |
>> + * |--------------------------------------------------------------------------
>> + *
>> + * And the leaf node begins with ext4_btree_header, and followed by records.
>> + * leaf node
>> + * * ------------------------------------------------------------------------
>> + * |header| rec | rec | rec | rec |...........| rec | rec | rec | rec | rec |
>> + * |-------------------------------------------------------------------------
>> + */
>> +
>> +#define REF_BTREE_MAGIC  0x52454642 /* "REFB" magic number for refcount btree */
>> +
>> +struct ext4_btree_header {
>> +	__le32	btree_magic;	/* type of btree */
>> +	__le32	btree_generation; /* generation for this type of btree */
>> +	__le32	btree_level;	/* the level of this node in the tree */
>> +	__le32	btree_numkeys;	/* # of records for leaf node*/
>> +	__le32	btree_padding1;
> I think this padding field results in a hidden u32 gap here?

Sorry that was a mistake. I added generation but forget to remove this padding.
>> +	__le64	btree_blockno;	/* remember the blk# of this node */
> (Unfortunate that you have to burn all 64 bits for all btrees, but I remembered
> that ext4 lets you set flexbg size to 2^31.  Oh well.)

true... though this block no is
>> +	__le64	btree_leftsib;	/* used for fast search sibling nodes */
>> +	__le64	btree_rightsib; 
>> +	__le64	btree_crc;	/* this node checksums */
> Future proofing, I guess?  Since we don't use 64-bit crcs right now.

yes... that was the intention... I guess that is a little overkill for now.
> If you decide that we only need 32 bits for a crc, you could make btree_level a
> u16 (since max levels is 8), numkeys a u16, and store the crc32 after that.
> Then this on-disk header would only be 40 bytes, instead of 64 bytes.

Sounds good to me .. make it more packed...
> (Hmm.  Maybe you were trying to make things align to a 64B cacheline?)
>
>> +	__le64	btree_padding2;
>> +};
>> +
>> +# define EXT4_BLOCK_SIZE	4096
> Shouldn't this be the FS blocksize?  It'll waste a lot of space on a 64k block
> filesystem (reduced fanout opportunities as well as 60K of empty space in each
> tree node) and I'm not sure what happens with 1k blocks.

Again, sorry this is me being lazy in the prototype here. I will fix it properly with filesystem blocksize ...

>> +#define EXT4_BTREE_HEADER_SIZE	sizeof(struct ext4_btree_header)
>> +#define EXT4_BTREE_NODESIZE	EXT4_BLOCK_SIZE	
>> +
>> +struct ext4_btree_key {
>> +	__le64		blockno;
>> +};
>> +#define EXT4_BTREE_KEY_SIZE	sizeof(struct ext4_btree_key)
>> +
>> +struct ext4_btree_val {
>> +	__le64		blockno;
>> +};
>> +#define EXT4_BTREE_VAL_SIZE	sizeof(struct ext4_btree_val)
>> +
>> +struct ext4_btree_rec {
>> +	struct ext4_btree_key	key;
>> +	__le32			len;
>> +	__le32			ref_count;
>> +};
> Hm.  Looking forward to having btrees to track things other than refcount, I
> imagine it'll become necessary to parameterize the record definition for each
> type of btree?
>
totally agree. Will need to make it more generic by adding interface to define different type of btree, with own record_data structure and let the btree user to pass those info into the generic code..

> (Though, I guess this record format works well enough for the inode and block
> bitmaps, and maybe that's as far as you want to go...)

for reflink, block allocation, yes. But I like to make the code more generic if possible
>> +#define EXT4_BTREE_REC_SIZE	sizeof(struct ext4_btree_rec)
>> +
>> +#define EXT4_BTREE_MAX_KEY_VAL_PAIRS \
>> +	((EXT4_BTREE_NODESIZE - EXT4_BTREE_HEADER_SIZE) / \
>> +	 (EXT4_BTREE_KEY_SIZE + EXT4_BTREE_VAL_SIZE))
>> +
>> +#define EXT4_BTREE_MIN_KEY_VAL_PAIRS \
>> +        (EXT4_BTREE_MAX_KEY_VAL_PAIRS / 2)
>> +
>> +#define EXT4_BTREE_MAX_RECS \
>> +	((EXT4_BTREE_NODESIZE - EXT4_BTREE_HEADER_SIZE) / \
>> +	  EXT4_BTREE_REC_SIZE)
>> +
>> +#define EXT4_BTREE_MIN_RECS \
>> +	(EXT4_BTREE_MAX_RECS / 2)
>> +
>> +#define EXT4_BTREE_MAX_LEVELS		8
>> +
>> +
>> +/* Index node */
>> +struct ext4_btree_index_node {
>> +	struct ext4_btree_header	node_header;
>> +	struct ext4_btree_key		key[EXT4_BTREE_MAX_KEY_VAL_PAIRS];
>> +	struct ext4_btree_val		val[EXT4_BTREE_MAX_KEY_VAL_PAIRS];
> (Changing EXT4_BTREE_NODESIZE to depend on the FS blocksize makes it impossible
> to use arrays here, oh well...)

Hmm.. okay, probably a pointer here.
>> +};
>> +
>> +/* Record Node */
>> +struct ext4_btree_leaf_node {
>> +	struct ext4_btree_header	node_header;
>> +	struct ext4_btree_rec		rec[EXT4_BTREE_MAX_RECS];
>> +};
>> +
>> +struct ext4_btree_node {
>> +	struct ext4_btree_header	node_header;
>> +};
>> +
>> +struct ext4_btree_root {
>> +	struct ext4_btree_node *node;
>> +	struct ext4_btree_geo	geo;
>> +};
>> +
>> +#define ext4_btree_trace printf
> printk?
>
> --D

:) this was done in the user-space first. When the code finally all move to kernel will be something like printk

Thanks a lot , for your review and comments!:)

Mingming
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ