linux-ext4 - Re: RT/ext4/jbd2 circular dependency

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <544E7144.4080809@windriver.com>
Date:	Mon, 27 Oct 2014 10:22:28 -0600
From:	Chris Friesen <chris.friesen@...driver.com>
To:	Thomas Gleixner <tglx@...utronix.de>
CC:	Austin Schuh <austin@...oton-tech.com>, <pavel@...linux.ru>,
	"J. Bruce Fields" <bfields@...ldses.org>,
	<linux-ext4@...r.kernel.org>, <tytso@....edu>,
	<adilger.kernel@...ger.ca>,
	rt-users <linux-rt-users@...r.kernel.org>
Subject: Re: RT/ext4/jbd2 circular dependency

On 10/26/2014 08:25 AM, Thomas Gleixner wrote:
> On Thu, 23 Oct 2014, Chris Friesen wrote:
>> On 10/17/2014 12:55 PM, Austin Schuh wrote:
>>> Use the 121 patch.  This sounds very similar to the issue that I helped
>>> debug with XFS.  There ended up being a deadlock due to a bug in the
>>> kernel work queues.  You can search the RT archives for more info.
>>
>> I can confirm that the problem still shows up with the rt121 patch. (And
>> also with Paul Gortmaker's proposed 3.4.103-rt127 patch.)
>
>> We added some instrumentation and it looks like we've tracked down the problem.
>> Figuring out how to fix it is proving to be tricky.
>>
>> Basically it looks like we have a circular dependency involving the
>> inode->i_data_sem rt_mutex, the PG_writeback bit, and the BJ_Shadow list.  It
>> goes something like this:
>>
>> jbd2_journal_commit_transaction:
>> 1) set page for writeback (set PG_writeback bit)
>> 2) put jbd2 journal head on BJ_Shadow list
>> 3) sleep on PG_writeback bit waiting for page writeback complete
>>
>> ext4_da_writepages:
>> 1) ext4_map_blocks() acquires inode->i_data_sem for writing
>> 2) do_get_write_access() sleeps waiting for jbd2 journal head to come off
>> the BJ_Shadow list
>>
>> At this point the flush code can't run because it can't acquire
>> inode->i_data_sem for reading, so the page will never get written out.
>> Deadlock.
>
> Sorry, I really cannot map that sparse description to any code
> flow. Proper callchains for the involved parts might help to actually
> understand what you are looking for.

There are details (stack traces, etc.) in the first message in the thread:
http://www.spinics.net/lists/linux-rt-users/msg12261.html


Originally we had thought that nfsd might have been implicated somehow, 
but it seems like it was just a trigger (possibly by increasing the rate 
of sync I/O).

In the interest of full disclosure I should point out that we're using a 
modified kernel so there is a chance that we have introduced the problem 
ourselves.  That said, we have not made significant changes to either 
ext4 or jbd2.  (Just a couple of minor cherry-picked bugfixes.)


The relevant code paths are:

Journal commit.  The important thing here is that we set the 
PG_writeback on a page, put the jbd2 journal head on BJ_Shadow list, 
then sleep waiting for page writeback complete.  If the page writeback 
never completes, then the journal head never comes off the BJ_Shadow list.


jbd2_journal_commit_transaction
     journal_submit_data_buffers
         journal_submit_inode_data_buffers
             generic_writepages
                 set_page_writeback(page) [PG_writeback]
     jbd2_journal_write_metadata_buffer
         __jbd2_journal_file_buffer(jh_in, transaction, BJ_Shadow);

     journal_finish_inode_data_buffers
         filemap_fdatawait
             filemap_fdatawait_range
                 wait_on_page_writeback(page)
                     wait_on_page_bit(page, PG_writeback) <--stuck here
     jbd2_journal_unfile_buffer(journal, jh) [delete from BJ_Shadow list]



We can get to the code path below a couple of different ways (see 
further down).  The important stuff here is:
1) There is a code path that takes i_data_sem and then goes to sleep 
waiting for the jbd2 journal head to be removed from the BJ_Shadow list. 
  If the journal head never comes off the list, the sema will never be 
released.
2) ext4_map_blocks() always takes a read lock on i_data_sem.  If the 
sema is held by someone waiting for the journal head to come off the 
list, it will block.

ext4_da_writepages
     write_cache_pages_da
         mpage_da_map_and_submit
             ext4_map_blocks
                 down_read((&EXT4_I(inode)->i_data_sem))
                 up_read((&EXT4_I(inode)->i_data_sem))
                 down_write((&EXT4_I(inode)->i_data_sem))
                 ext4_ext_map_blocks
                     ext4_mb_new_blocks
                         ext4_mb_mark_diskspace_used
                             __ext4_journal_get_write_access
                                 jbd2_journal_get_write_access
                                     do_get_write_access
                                         wait on BJ_Shadow list



One of the ways we end up at ext4_da_writepages() is via the page 
writeback thread.  If i_data_sem is already held by someone that is 
sleeping, this can result in pages not getting written out.

bdi_writeback_thread
     wb_do_writeback
         wb_check_old_data_flush
             wb_writeback
                 __writeback_inodes_wb
                     writeback_sb_inodes
                         writeback_single_inode
                             do_writepages
                                 ext4_da_writepages


Another way to end up at ext4_da_writepages() is via sync writev() 
calls.  In the traces from my original report this ended up taking the 
sema and then going to sleep waiting for the journal head to get removed 
from the BJ_Shadow list.

sys_writev
     vfs_writev
         do_readv_writev
             do_sync_readv_writev
                 ext4_file_write
                     generic_file_aio_write
                         generic_write_sync
                             ext4_sync_file
                                 filemap_write_and_wait_range
                                      __filemap_fdatawrite_range
                                          do_writepages
                                              ext4_da_writepages


Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html