linux-ext4 - Re: [PATCH 0/6 v4] Lazy itable initialization for Ext4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100928040142.GA7865@thunk.org>
Date:	Tue, 28 Sep 2010 00:01:42 -0400
From:	Ted Ts'o <tytso@....edu>
To:	Lukas Czerner <lczerner@...hat.com>
Cc:	linux-ext4@...r.kernel.org, rwheeler@...hat.com,
	sandeen@...hat.com, adilger@...ger.ca, snitzer@...il.com
Subject: Re: [PATCH 0/6 v4] Lazy itable initialization for Ext4

On Thu, Sep 16, 2010 at 02:47:25PM +0200, Lukas Czerner wrote:
> 
> as Mike suggested I have rebased the patch #1 against Jens'
> linux-2.6-block.git 'for-next' branch and changed sb_issue_zeroout()
> to cope with the new blkdev_issue_zeroout(), and changed
> sb_issue_zeroout() to the new syntax everywhere I am using it.
> Also some typos gets fixed.

We may have a problem with the lazy_itable patches.  I've tried
running the XFSTESTS three times now.  This was with a system where
mke2fs was setup (via /etc/mke2fs.conf) to always format the file
system using lazy_itable_init.  This meant that any of the xfstests
which reformated the scratch partition and then started a stress test
would stress the newly added itable initialization code.
Unfortunately the results weren't good.

The first time, I got the following soft lockup warning:

[ 2520.528745] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2520.531445]  ef2b8e44 00000046 00000007 e29c1500 e29c1500 e29c1760 e29c175c c0b55500
[ 2520.534983]  c0b55500 e29c175c c0b55500 c0b55500 c0b55500 32423426 00000224 00000000
[ 2520.538270]  00000224 e29c1500 00000001 ef205000 00000005 ef2b8e74 ef2b8e80 c026eb2c
[ 2520.541743] Call Trace:
[ 2520.542742]  [<c026eb2c>] jbd2_log_wait_commit+0x103/0x14f
[ 2520.544291]  [<c01711dc>] ? autoremove_wake_function+0x0/0x34
[ 2520.545816]  [<c026bf95>] jbd2_log_do_checkpoint+0x1a8/0x458
[ 2520.547431]  [<c026f4ed>] jbd2_journal_destroy+0x107/0x1d3
[ 2520.549602]  [<c01711dc>] ? autoremove_wake_function+0x0/0x34
[ 2520.551100]  [<c0252bef>] ext4_put_super+0x78/0x2f7
[ 2520.552798]  [<c01f3c3c>] generic_shutdown_super+0x47/0xb8
[ 2520.554692]  [<c01f3ccf>] kill_block_super+0x22/0x36
[ 2520.556470]  [<c01f3816>] deactivate_locked_super+0x22/0x3e
[ 2520.558372]  [<c01f3bf1>] deactivate_super+0x3d/0x41
[ 2520.560138]  [<c02057a9>] mntput_no_expire+0xb5/0xd8
[ 2520.561880]  [<c0206609>] sys_umount+0x273/0x298
[ 2520.563358]  [<c0206640>] sys_oldumount+0x12/0x14
[ 2520.564952]  [<c0646715>] syscall_call+0x7/0xb
[ 2520.566596] 3 locks held by umount/15126:
[ 2520.568121]  #0:  (&type->s_umount_key#20){++++..}, at: [<c01f3bea>] deactivate_super+0x36/0x41
[ 2520.571819]  #1:  (&type->s_lock_key#2){+.+...}, at: [<c01f3096>] lock_super+0x20/0x22
[ 2520.574788]  #2:  (&journal->j_checkpoint_mutex){+.+...}, at: [<c026f4e6>] jbd2_journal_destroy+0x100/0x1d3

In addition, there were these mysterious error messages:

[ 2542.026996] ata1: lost interrupt (Status 0x50)
[ 2542.029750] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 2542.032656] ata1.00: failed command: WRITE DMA
[ 2542.034312] ata1.00: cmd ca/00:10:00:00:00/00:00:00:00:00/e0 tag 0 dma 8192 out
[ 2542.034313]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 2542.039892] ata1.00: status: { DRDY }

Why are they strange?  Because this was running under KVM, and there
were no underlying hardware problems in the host OS.

The other two times I got a hard hang at XFStests 219 and 83, and the
system was caught in such a type look that magic-sysrq wasn't working
correctly.

I've XFStests in this setup before applying these patches, and things
worked fine.  I'm currently rolling back the patches and trying
another xfstests runs just to make sure the problem wasn't introduced
by some patch, but for now, it looks there might be a problem
somewhere.  And unfortunately, since it's not happening in a regular
location or test, and the system is so badly locked up sysrq doesn't
work, finding it may be intersting....

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html