lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <4B754E5E.603@ge.com> Date: Fri, 12 Feb 2010 13:49:34 +0100 From: Enrik Berkhan <Enrik.Berkhan@...com> To: linux-ext4@...r.kernel.org Subject: possible ext4 related deadlock Hi, currently we're experiencing some process hangs that seem to be ext4-related. (Kernel 2.6.28.10-Blackfin, i.e. with Analog Devices patches including some memory management changes for NOMMU.) The situation is as follows: We have two threads writing to an ext4-filesystem. After several hours and accross about 20 systems there happens one hang where (reconstructed from Alt-SysRq-W output): 1. pdflush waits in start_this_handle 2. kjournald2 waits in jdb2_journal_commit_transaction 3. thread 1 waits in start_this_handle 4. thread 2 waits in ext4_da_write_begin (start_this_handle succeeded) grab_cache_page_write_begin __alloc_pages_internal try_to_free_pages do_try_to_free_pages congestion_wait Actually, thread 2 shouldn't be completely blocked, because congestion_wait has a timeout if I understand the code correctly. Unfortunately, I pressed Alt-SysRq-W only once when having a chance to reproduce the problem on a test system with console access. When the system is in this state, some external event like telnet login or killing a monitoring process in an older telnet sessin by pressing Ctrl-C makes it continue to work normally. I suspect that this triggers some memory freeing which allows thread 2 in the example above to get some pages and continue running. I had a look at all the recent ext4/jbd2 changes since about 2.6.28 but couldn't identify anything that would solve this problem. But maybe I just couldn't identify the right thing. What I have noticed is that the order of start_this_handle and grab_cache_page_write_begin has changed between ext3 and ext4: ext3_write_begin: ... page = grab_cache_page_write_begin(mapping, index, flags); if (!page) return -ENOMEM; *pagep = page; handle = ext3_journal_start(inode, needed_blocks); ... ext4_{da_}_write_begin: ... handle = ext4_journal_start(inode, needed_blocks); if (IS_ERR(handle)) { ret = PTR_ERR(handle); goto out; } /* We cannot recurse into the filesystem as the transaction is already * started */ flags |= AOP_FLAG_NOFS; page = grab_cache_page_write_begin(mapping, index, flags); ... As I understand the change of the order requires the AOP_FLAG_NOFS in the ext4 code. Might this be the reason for the deadlock? Would it be worth trying to change the order back or is there a very good reason for the change between ext3 and ext4? Or am I looking in a completely wrong place? Any help would be appreciated. Enrik -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists