lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 23 Oct 2014 11:54:55 -0600
From:	Chris Friesen <chris.friesen@...driver.com>
To:	Austin Schuh <austin@...oton-tech.com>, <pavel@...linux.ru>
CC:	"J. Bruce Fields" <bfields@...ldses.org>,
	<linux-ext4@...r.kernel.org>, <tytso@....edu>,
	<adilger.kernel@...ger.ca>,
	rt-users <linux-rt-users@...r.kernel.org>
Subject: RT/ext4/jbd2 circular dependency (was: Re: Hang writing to nfs-mounted
 filesystem from client)

On 10/17/2014 12:55 PM, Austin Schuh wrote:
> Use the 121 patch.  This sounds very similar to the issue that I helped
> debug with XFS.  There ended up being a deadlock due to a bug in the
> kernel work queues.  You can search the RT archives for more info.

I can confirm that the problem still shows up with the rt121 patch. (And
also with Paul Gortmaker's proposed 3.4.103-rt127 patch.)

We added some instrumentation and it looks like we've tracked down the problem.
Figuring out how to fix it is proving to be tricky.

Basically it looks like we have a circular dependency involving the
inode->i_data_sem rt_mutex, the PG_writeback bit, and the BJ_Shadow list.  It
goes something like this:

jbd2_journal_commit_transaction:
1) set page for writeback (set PG_writeback bit)
2) put jbd2 journal head on BJ_Shadow list
3) sleep on PG_writeback bit waiting for page writeback complete

ext4_da_writepages:
1) ext4_map_blocks() acquires inode->i_data_sem for writing
2) do_get_write_access() sleeps waiting for jbd2 journal head to come off
the BJ_Shadow list

At this point the flush code can't run because it can't acquire
inode->i_data_sem for reading, so the page will never get written out.
Deadlock.


The following is a more detailed timeline with information from added trace
events:

 nfsd-2012  [003] ....1..  8612.903541: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903546: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903559: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903565: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903611: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903616: ext4_map_blocks_up_write: dev 147,3 ino 2097161
<...>-5960  [004] ....1..  8612.903628: jbd2_submit_inode_data: dev 147,3 ino 2097160
<...>-5960  [004] ....111  8612.903651: jbd2_list_add_bjshadow: adding jh ffff880415350000 to transaction ffff880415391180 BJ_Shadow list
<...>-5960  [004] ....111  8612.903653: jbd2_list_add_bjshadow: adding jh ffff8803eb08dbd0 to transaction ffff880415391180 BJ_Shadow list
<...>-5960  [004] ....111  8612.903655: jbd2_list_add_bjshadow: adding jh ffff8803eb08d150 to transaction ffff880415391180 BJ_Shadow list
<...>-5960  [004] ....111  8612.903656: jbd2_list_add_bjshadow: adding jh ffff8803eb08d0e0 to transaction ffff880415391180 BJ_Shadow list
<...>-5960  [004] ....111  8612.903657: jbd2_list_add_bjshadow: adding jh ffff88031c9449a0 to transaction ffff880415391180 BJ_Shadow list
 nfsd-2012  [003] ....1..  8612.903658: ext4_map_blocks_down_write: dev 147,3 ino 2097161
<...>-5960  [004] ....111  8612.903658: jbd2_list_add_bjshadow: adding jh ffff88031c944310 to transaction ffff880415391180 BJ_Shadow list
 nfsd-2012  [003] ....1..  8612.903665: ext4_map_blocks_up_write: dev 147,3 ino 2097161
<...>-5960  [004] ....1..  8612.903696: jbd2_finish_inode_data: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903706: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903714: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903802: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903814: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903960: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903983: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904311: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904318: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904331: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904337: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904399: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904408: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904678: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904772: ext4_map_blocks_up_write: dev 147,3 ino 2097161
<...>-2015  [007] ....1..  8612.934515: ext4_map_blocks_down_write: dev 147,3 ino 2097161
<...>-2015  [007] ....1..  8612.934525: jbd2_list_sleep_bjshadow: waiting for jh ffff8803eb08dbd0 from transaction ffff880415391180 to be removed from BJ_Shadow list



Timeline:

pid 5960 is [jbd2/drbd3-8]
pid 2015 is [nfsd]
pid 2012 is [nfsd]

pid 5960:
8612.903628: jbd2_submit_inode_data for inode 2097160.
	     This is right before calling journal_submit_inode_data_buffers(), which
             ends up calling set_page_writeback().
8612.903653: Add jh ffff8803eb08dbd0 to BJ_Shadow list.
	     This is in bd2_journal_write_metadata_buffer() right before calling
             __jbd2_journal_file_buffer().
8612.903696: This is in journal_finish_inode_data_buffers(), right before calling
             filemap_fdatawait() which ends up calling wait_on_page_bit(page,
             PG_writeback)
<we see no more logs for pid 5960 after this>


pid 2015:
8612.934515: takes write lock on inode->i_data_sem for inode 2097161
8612.934525: goes to sleep waiting for jh ffff8803eb08dbd0 to be removed from
             BJ_Shadow list
<we see no more logs for pid 2015 after this>

pid 2012:
8617.963896: hits 5-sec retry limit and stops the trace.  This means it blocked
             trying to get a read lock on inode->i_data_sem for inode 2097161
             at time 8612.963.


Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ