lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <20080307235210.GM1881@webber.adilger.int>
Date:	Fri, 07 Mar 2008 16:52:10 -0700
From:	Andreas Dilger <adilger@....com>
To:	Jan Kara <jack@...e.cz>
Cc:	linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC] JBD ordered mode rewrite

On Mar 06, 2008  18:42 +0100, Jan Kara wrote:
>   Below is my rewrite of ordered mode in JBD. Now we don't have a list of
> data buffers that need syncing on transaction commit but a list of inodes
> that need writeout during commit. This brings all sorts of advantages such
> as possibility to get rid of journal heads and buffer heads for data
> buffers in ordered mode, better ordering of writes on transaction commit,
> simplification of some JBD code, no more anonymous pages when truncate of
> data being committed happens. The patch has survived some light testing
> but it still has some potential of eating your data so beware :) I've run
> dbench to see whether we didn't decrease performance by different handling
> of truncate and the throughput I'm getting on my machine is the same (OK,
> is lower by 0.5%) if I disable the code in truncate waiting for commit to
> finish... Also the throughput of dbench is about 2% better with my patch
> than with current JBD.
>   Any comments or testing most welcome.

Looks like a very good patch - thanks for your effort in moving this
beyond the "hand-waving" stage that it's been in for the past few years.

I'm looking at what implications this has for delayed allocation in ext4,
because the vast majority of file data will be unmapped in that case
and a journal commit in ordered mode will no longer cause the data to
be flushed to disk.

I _think_ is OK, because the pdflushd will now be totally in charge of
flushing the dirty pages to disk, instead of this previously being done
by ordered mode in the journal.  I know there have been some bugs in this
area in the past, but I guess it isn't much different than running in
writeback mode.  That said, I don't know how many users run in writeback
mode unless they are running a database, and the database is doing a lot
of explicit fsync of file data so there may still be bugs lurking...

Some care is still needed here because with ext4 delayed allocation the
common case will be unmapped buffers, while the ext3 implementation will
only have this in very rare cases (unmapped mmap writes), but it looks
like the right mechanism is already in place for both.

I'll just put a bunch of comments inline, not necessarily problems, just
walking through the code to ensure I understand what is going on...

> @@ -1675,7 +1675,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
>  			 */
>  			clear_buffer_dirty(bh);
>  			set_buffer_uptodate(bh);
> -		} else if (!buffer_mapped(bh) && buffer_dirty(bh)) {
> +		} else if (!buffer_mapped(bh) && buffer_dirty(bh)
> +			   && !wbc->skip_unmapped) {

(comment) This is skipping writeout of unmapped pages during journal commit -
good, otherwise we would deadlock trying to add block mappings...

> @@ -183,6 +184,8 @@ void ext3_delete_inode (struct inode * inode)
>  {
>  	handle_t *handle;
>  
> +	if (ext3_should_order_data(inode))
> +		ext3_begin_ordered_truncate(inode, 0);

(comment) This is flushing out pages allocated in the previous transaction
to keep consistent semantics in case of a crash loses the delete...  There
is some possibility for optimization here - comments below at
journal_begin_ordered_truncate().

> @@ -1487,46 +1462,49 @@ static int ext3_ordered_writepage(struct page *page,
> +	if (!wbc->skip_unmapped) {
> +		handle = ext3_journal_start(inode,
> +				ext3_writepage_trans_blocks(inode));
> +		if (IS_ERR(handle)) {
> +			ret = PTR_ERR(handle);
> +			goto out_fail;
> +		}
>  	}

I guess the common case here for delalloc is that if someone cares to fsync
the file then that inode would be added to the journal list and the pages
would be mapped here in the process context and flushed with the journal
commit.  This is ideal because it avoids syncing out all of the dirty pages
for inodes that were NOT fsync'd and fixes the "one process doing fsync
kills performance for streaming writes" problem in ordered mode that was
recently discussed on the list.

We in fact win twice because fsync_page_range() will potentially only
map and flush the range of pages that the application cares about and
does not necessarily have to write out all pages (though that will still
happen in ordered mode if the pages had previously been mapped).

> +	else if (!PageMappedToDisk(page))
> +		goto out_fail;

(style) } else if (...) {
		goto out_fail;
	}
 
> +	/* This can go as soon as someone cares about that ;) */
>  	if (!page_has_buffers(page)) {
>  		create_empty_buffers(page, inode->i_sb->s_blocksize,
>  				(1 << BH_Dirty)|(1 << BH_Uptodate));
>  	}

Is there any reason to keep this in the non-data-journal case?  Adding
buffer heads to every page is a non-trivial amount of RAM usage.

> @@ -2989,7 +2973,14 @@ int ext3_write_inode(struct inode *inode, int wait)
>   * be freed, so we have a strong guarantee that no future commit will
>   * leave these blocks visible to the user.)
>   *
> - * Called with inode->sem down.
> + * Another thing we have to asure is that if we are in ordered mode

s/asure/assure/

> + * and inode is still attached to the committing transaction, we must
> + * we start writeout of all the dirty buffers which are being truncated.

s/buffers/pages/

> @@ -3032,6 +3023,13 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
>  	    attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
>  		handle_t *handle;
>  
> +		if (ext3_should_order_data(inode)) {
> +			error = ext3_begin_ordered_truncate(inode,
> +					attr->ia_size);

(style) align "attr->ia_size" with previous '('.

> @@ -520,6 +520,8 @@ static void ext3_clear_inode(struct inode *inode)
>  	EXT3_I(inode)->i_block_alloc_info = NULL;
>  	if (unlikely(rsv))
>  		kfree(rsv);
> +	journal_release_jbd_inode(EXT3_SB(inode->i_sb)->s_journal,
> +		&EXT3_I(inode)->jinode);

(style) align with '(' on previous line.

>  /*
> - * When an ext3-ordered file is truncated, it is possible that many pages are
> - * not sucessfully freed, because they are attached to a committing transaction.
> + * When an ext3 file is truncated, it is possible that some pages are not
> + * sucessfully freed, because they are attached to a committing transaction.

s/sucessfully/successfully/

> -static int inverted_lock(journal_t *journal, struct buffer_head *bh)
> -{
> -	if (!jbd_trylock_bh_state(bh)) {
> -		spin_unlock(&journal->j_list_lock);
> -		schedule();
> -		return 0;
> -	}
> -	return 1;
> -}

Always nice to see unpleasant code like this be removed.

> +static int journal_submit_data_buffers(journal_t *journal,
> +		transaction_t *commit_transaction)
>  {
> +	/*
> +	 * We are in a committing transaction. Therefore no new inode
> +	 * can be added to our inode list. We use JI_COMMIT_RUNNING
> +	 * flag to protect inode we currently operate on from being
> +	 * released while we write out pages.
> +	 */

Should we J_ASSERT() this is true?  Probably better to put this comment
before the function instead of inside it.

> +	spin_lock(&journal->j_list_lock);
> +	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> +		mapping = jinode->i_vfs_inode->i_mapping;
> +		if (!mapping_cap_writeback_dirty(mapping))
> +			continue;
> +		wbc.nr_to_write = mapping->nrpages * 2;
> +		jinode->i_flags |= JI_COMMIT_RUNNING;
> +		spin_unlock(&journal->j_list_lock);
> +		err = do_writepages(jinode->i_vfs_inode->i_mapping, &wbc);
> +		if (!ret)
> +			ret = err;
> +		spin_lock(&journal->j_list_lock);
> +		jinode->i_flags &= ~JI_COMMIT_RUNNING;
> +		wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
>  	}
> +	spin_unlock(&journal->j_list_lock);

Hmm, this is one area I'm a bit worried about.  In the old code the number
of buffers to sync for a transaction are bounded because they can only be
added to the transaction while it is open.  With this new inode-list code
there could be a process busily adding new dirty + mapped pages to the
inode(s) in the running transaction, while we are blocked here writing
out the pages in kjournald trying to commit the previous transaction.

At some point we will eventually no longer be able to add new pages to the
running transaction (because it is full or was closed due to timeout) and
we won't be able to start a new transaction because the committing transaction
has not yet committed.  At that point we would be able to finish the commit
of the previous transaction, but not until we had reached the point of
blocking all operations in the filesystem waiting for the new transaction
to open.  This would essentially lead to livelock of the system, with
stop-start-stop cycles for the transactions and IO.

What would be needed is some kind of mechanism to separate the dirty pages
on the inode into "dirty from current transaction" and "dirty from the
previous transaction".  That seems non-trivial, however.  I guess we could
also force any process doing writing to flush out all of the dirty and
mapped pages on the inode if it detects the inode is in two transactions.

This would at least parallelize the IO submission and also throttle the
addition of new pages until the dirty ones on the committing transaction
had been flushed but may essentially mean sync IO performance for streaming
writes on large files without delalloc (which would not add new pages that
are mapped normally).

> +static int journal_finish_data_buffers(journal_t *journal,
> +		transaction_t *commit_transaction)
>  {
> +	/* Now refile inode to proper lists */
> +	list_for_each_entry_safe(jinode, next_i,
> +			&commit_transaction->t_inode_list, i_list) {

(style) align with '(' on previous line

> +		list_del(&jinode->i_list);
> +		if (jinode->i_next_transaction) {
> +			jinode->i_transaction = jinode->i_next_transaction;
> +			jinode->i_next_transaction = NULL;
> +			list_add(&jinode->i_list,
> +				&jinode->i_transaction->t_inode_list);
>  		}
> +		else
> +			jinode->i_transaction = NULL;

(style)		} else {
			jinode->i_transaction = NULL;
		}

> @@ -655,7 +565,14 @@ start_journal_io:
>  	   so we incur less scheduling load.
>  	*/
>  
> -	jbd_debug(3, "JBD: commit phase 4\n");
> +	jbd_debug(3, "JBD: commit phase 3\n");

Rather than numbering these phases, which has little meaning and often
means they just get renumbered like here, it is probably more useful to
give them descriptive names like "JBD: commit wait for data buffers", etc.

> @@ -1504,10 +1327,10 @@ __blist_del_buffer(struct journal_head **list, struct journal_head *jh)
>   * Remove a buffer from the appropriate transaction list.
>   *
>   * Note that this function can *change* the value of
> - * bh->b_transaction->t_sync_datalist, t_buffers, t_forget,
> - * t_iobuf_list, t_shadow_list, t_log_list or t_reserved_list.  If the caller
> - * is holding onto a copy of one of thee pointers, it could go bad.
> - * Generally the caller needs to re-read the pointer from the transaction_t.
> + * bh->b_transaction->t_buffers, t_forget, t_iobuf_list, t_shadow_list,
> + * t_log_list or t_reserved_list.  If the caller is holding onto a copy of one
> + * of thee pointers, it could go bad.  Generally the caller needs to re-read

s/thee/these/

> +/*
> + * File inode in the inode list of the handle's transaction
> + */
> +int journal_file_inode(handle_t *handle, struct jbd_inode *jinode)
> +{
> +	transaction_t *transaction = handle->h_transaction;
> +	journal_t *journal = transaction->t_journal;
> +
> +	if (is_handle_aborted(handle))
> +		return 0;

Should we be returning an error back to the caller at this point?

> +	jbd_debug(4, "Adding inode %lu, tid:%d\n", jinode->i_vfs_inode->i_ino,
> +			transaction->t_tid);
> +
> +	spin_lock(&journal->j_list_lock);
> +
> +	if (jinode->i_transaction == transaction ||
> +	    jinode->i_next_transaction == transaction)
> +		goto done;

Is it possible/safe to do the above check outside of the j_list_lock
as an optimization?  We will be calling this function for every page
that is dirtied and it will likely be quite highly contended.  It used
to be that we HAD to get this lock because we were almost certainly
adding the new buffers to the transaction list, but in this updated
code the common case is that the inode will already be on the list...

> +/*
> + * This function must be called when inode is journaled in ordered mode
> + * before truncation happens. It starts writeout of truncated part in
> + * case it is in the committing transaction so that we stand to ordered
> + * mode consistency guarantees.
> + */
> +int journal_begin_ordered_truncate(struct jbd_inode *inode, loff_t new_size)
> +{
> +	journal_t *journal;
> +	transaction_t *commit_trans;
> +	int ret = 0;
> +
> +	if (!inode->i_transaction && !inode->i_next_transaction)
> +		goto out;
> +	journal = inode->i_transaction->t_journal;
> +	spin_lock(&journal->j_state_lock);
> +	commit_trans = journal->j_committing_transaction;
> +	spin_unlock(&journal->j_state_lock);
> +	if (inode->i_transaction == commit_trans) {
> +		ret = __filemap_fdatawrite_range(inode->i_vfs_inode->i_mapping,
> +			new_size, LLONG_MAX, WB_SYNC_ALL);
> +		if (ret)
> +			journal_abort(journal, ret);
> +	}

Is there any way here to entirely avoid writing blocks which were just
allocated during the current transaction (e.g. temp files during compile
or whatever)?  One possibility is to store the transaction number when
the inode was created (or first dirtied?) in ext3_inode_info and if that is
equal to the committing transaction then we don't need to write anything
at all?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ