lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 14 May 2009 11:30:15 +0530
From:	"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>
To:	Theodore Tso <tytso@....edu>
Cc:	Eric Sandeen <sandeen@...hat.com>, cmm@...ibm.com,
	linux-ext4@...r.kernel.org
Subject: Re: [PATCH 2/3] ext4: Clear the unwritten buffer_head flag properly

On Wed, May 13, 2009 at 06:28:47PM -0400, Theodore Tso wrote:
> On Wed, May 13, 2009 at 01:56:56PM -0500, Eric Sandeen wrote:
> > The comment:
> > 
> >  	/*
> > +	 * The above get_blocks can cause the buffer to be
> > +	 * marked unwritten. So clear the same.
> > +	 */
> > +	clear_buffer_unwritten(bh);
> > 
> > is imho not helpful; to me it says "we -just- set this, so clear it!"
> > It leaves me scratching my head.
> 
> I've updated it the comment to say this instead.
> 
> 	/*
> 	 * When we call get_blocks without the create flag, the
> 	 * BH_Unwritten flag could have gotten set if the blocks
> 	 * requested were part of a uninitialized extent.  We need to
> 	 * clear this flag now that we are committed to convert all or
> 	 * part of the uninitialized extent to be an initialized
> 	 * extent.  This is because we need to avoid the combination
> 	 * of BH_Unwritten and BH_Mapped flags being simultaneously
> 	 * set on the buffer_head.
> 	 */

The last line in the above comment is not a problem with the latest
kernel with all the patches in the patch queue. The patch that does that
is  "ext4: Mark the unwritten buffer_head as mapped during write_begin"

The unwritten and mapped state together was a problem with the code path
we had before in ext4_da_get_block_prep
we had:

ret = ext4_get_blocks(NULL, inode, iblock, 1,  bh_result, 0);
.....
..
} else if (ret > 0) {
	bh_result->b_size = (ret << inode->i_blkbits);
	if (buffer_unwritten(bh_result)) {
		bh_result.b_blocknr = ~0;
		....
	}
}

The above can result in 

1) We do a multi block delayed alloc to prealloc space. That would get
us multiple buffer_heads marked with BH_Unwritten. (say 10, 11, 12)
2) pdflush attempt to write some pages (say mapping block 10) which
cause
a get_block call with create = 1. That would attempt to convert
uninitialized extent to initialized one. This can cause multiple blocks
to be marked initialized. ( say 10, 11 , 12)
3) We do an overwrite of block 11. That would mean we call
ext4_da_get_block_prep, which would again do a get_block for block 11
with create = 0. But remember we already have buffer_head marked with
BH_Unwritten flag. But the buffer was unmapped because it is unwritten
( We are fixing this mess in the patch for 2.6.31).
4) The get_block call will find the buffer mapped due to step b. And
mark the buffer_head mapped. There we go . We end up with buffer_head
mapped and unwritten
5) later in ext4_da_get_block_prep we check whether the buffer_head in
marked
BH_Unwritten if so we set the block number to ~0. This is introduced by
[PATCH -V4 1/2] Fix sub-block zeroing for buffered writes into unwritten
extents
6) So now we have a buffer_head that is mapped, unwritten, with
b_blocknr = ~0. That would result in the I/O error.


Now that we have the actual block number in bh_result.b_blocknr I guess
we should be ok to have the unwritten flag set. But then i guess it is
against the VFS assumption of buffer_head state. The state unwritten
indicate the blocks are not yet written. So if we don't do a
clear_buffer_unwritten as in the patch we end up with a block that is
written ( done via create = 1 get_block call) and still marked
unwritten. (see the 6 step example above)

Now to explain what "ext4: Mark the unwritten buffer_head as mapped
during write_begin" does.

It marks the buffer_head as mapped in the write_begin phase.
And that helps in making sure we don't end up calling get_block
multiple times.  So with delayed allocation we now have between
the write_begin and writepage phase a buffer_head which is mapped
and unwritten for blocks in the fallocate space. Once we do
writepage for the block we will have buffer_head which is mapped and
unwritten flag cleared. The unwritten get cleared when we do a get_block
call with create = 1.


To achieve the above we need to make sure writepage code path looks at
the unwritten flag and does a get_blocks call with create = 1. With
mainline we have in writepage code path

mpage_da_map_blocks(...)

if ((mpd->b_state  & (1 << BH_Mapped)) &&
    !(mpd->b_state & (1 << BH_Delay)))
	return 0;
	...
	..
	ext4_da_get_block_write()


If we start marking buffer_head mapped for fallocate blocks in the
write_begin phase, then the above code will return without doing
any get_block(create = 1) call. That means we don't convert the
uninitialized extent to initialized one. So along with marking
buffer_head mapped and unwritten in the write_begin phase we also
need changes to writepage code path which does something like below

mpage_da_map_blocks(...)

if ((mpd->b_state  & (1 << BH_Mapped)) &&
	!(mpd->b_state & (1 << BH_Delay)) &&
	!(mpd->b_state & (1 << BH_Unwritten)))
	    return 0;
	...
	..
	ext4_da_get_block_write()

this is done in patch "ext4: Mark the unwritten buffer_head as mapped
during write_begin".

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ