linux-kernel - Re: next-20090310: ext4 hangs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090325194316.GQ23439@duck.suse.cz>
Date:	Wed, 25 Mar 2009 20:43:17 +0100
From:	Jan Kara <jack@...e.cz>
To:	Alexander Beregalov <a.beregalov@...il.com>
Cc:	Theodore Tso <tytso@....edu>,
	"linux-next@...r.kernel.org" <linux-next@...r.kernel.org>,
	linux-ext4@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
	sparclinux@...r.kernel.org
Subject: Re: next-20090310: ext4 hangs

On Wed 25-03-09 20:07:46, Alexander Beregalov wrote:
> 2009/3/25 Jan Kara <jack@...e.cz>:
> > On Wed 25-03-09 18:29:10, Alexander Beregalov wrote:
> >> 2009/3/25 Jan Kara <jack@...e.cz>:
> >> > On Wed 25-03-09 18:18:43, Alexander Beregalov wrote:
> >> >> 2009/3/25 Jan Kara <jack@...e.cz>:
> >> >> >> > So, I think I need to try it on 2.6.29-rc7 again.
> >> >> >>   I've looked into this. Obviously, what's happenning is that we delete
> >> >> >> an inode and jbd2_journal_release_jbd_inode() finds inode is just under
> >> >> >> writeout in transaction commit and thus it waits. But it gets never woken
> >> >> >> up and because it has a handle from the transaction, every one eventually
> >> >> >> blocks on waiting for a transaction to finish.
> >> >> >>   But I don't really see how that can happen. The code is really
> >> >> >> straightforward and everything happens under j_list_lock... Strange.
> >> >> >  BTW: Is the system SMP?
> >> >> No, it is UP system.
> >> >  Even stranger. And do you have CONFIG_PREEMPT set?
> >> >
> >> >> The bug exists even in 2.6.29, I posted it with a new topic.
> >> >  OK, I've sort-of expected this.
> >>
> >> CONFIG_PREEMPT_RCU=y
> >> CONFIG_PREEMPT_RCU_TRACE=y
> >> # CONFIG_PREEMPT_NONE is not set
> >> # CONFIG_PREEMPT_VOLUNTARY is not set
> >> CONFIG_PREEMPT=y
> >> CONFIG_DEBUG_PREEMPT=y
> >> # CONFIG_PREEMPT_TRACER is not set
> >>
> >> config is attached.
> >  Thanks for the data. I still don't see how the wakeup can get lost. The
> > process even cannot be preempted when we are in the section protected by
> > j_list_lock... Can you send me a disassembly of functions
> > jbd2_journal_release_jbd_inode() and journal_submit_data_buffers() so that
> > I can see whether the compiler has not reordered something unexpectedly?
  Thanks for the disassembly...

> By default gcc inlines journal_submit_data_buffers()
> Here is -fno-inline version. Default version is in attach.
> ====
> 
> static int journal_submit_data_buffers(journal_t *journal,
>                 transaction_t *commit_transaction)
> {
>       9c:       9d e3 bf 40     save  %sp, -192, %sp
>       a0:       11 00 00 00     sethi  %hi(0), %o0
>         struct jbd2_inode *jinode;
>         int err, ret = 0;
>         struct address_space *mapping;
> 
>         spin_lock(&journal->j_list_lock);
>       a4:       a4 06 25 70     add  %i0, 0x570, %l2
>  * our inode list. We use JI_COMMIT_RUNNING flag to protect inode we currently
>  * operate on from being released while we write out pages.
>  */
> static int journal_submit_data_buffers(journal_t *journal,
>                 transaction_t *commit_transaction)
> {
>       a8:       90 12 20 00     mov  %o0, %o0
>       ac:       40 00 00 00     call  ac <journal_submit_data_buffers+0x10>
>       b0:       b0 10 20 00     clr  %i0
>         struct jbd2_inode *jinode;
>         int err, ret = 0;
>         struct address_space *mapping;
> 
>         spin_lock(&journal->j_list_lock);
>         list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>       b4:       a6 06 60 60     add  %i1, 0x60, %l3
> {
>         struct jbd2_inode *jinode;
>         int err, ret = 0;
>         struct address_space *mapping;
> 
>         spin_lock(&journal->j_list_lock);
>       b8:       40 00 00 00     call  b8 <journal_submit_data_buffers+0x1c>
>       bc:       90 10 00 12     mov  %l2, %o0
>         list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>       c0:       10 68 00 1d     b  %xcc, 134 <journal_submit_data_buffers+0x98>
>       c4:       c2 5e 60 60     ldx  [ %i1 + 0x60 ], %g1
>                 mapping = jinode->i_vfs_inode->i_mapping;
>                 jinode->i_flags |= JI_COMMIT_RUNNING;
>                 spin_unlock(&journal->j_list_lock);
>       c8:       90 10 00 12     mov  %l2, %o0
>         struct address_space *mapping;
> 
>         spin_lock(&journal->j_list_lock);
>         list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>                 mapping = jinode->i_vfs_inode->i_mapping;
>                 jinode->i_flags |= JI_COMMIT_RUNNING;
>       cc:       c2 04 60 28     ld  [ %l1 + 0x28 ], %g1
  Here we load jbd2_inode->i_flags into %g1.

>         int err, ret = 0;
>         struct address_space *mapping;
> 
>         spin_lock(&journal->j_list_lock);
>         list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>                 mapping = jinode->i_vfs_inode->i_mapping;
>       d0:       e0 58 a1 e0     ldx  [ %g2 + 0x1e0 ], %l0
>                 jinode->i_flags |= JI_COMMIT_RUNNING;
>       d4:       82 10 60 01     or  %g1, 1, %g1
  Here we set JI_COMMIT_RUNNING.

>                 spin_unlock(&journal->j_list_lock);
>       d8:       40 00 00 00     call  d8 <journal_submit_data_buffers+0x3c>
  Here we seem to call preempt_disable() (it would be useful if we could
confirm that - easiest option I know is compiling JBD2 into a kernel but
some object file trickery should be able to find it out as well...)

>       dc:       c2 24 60 28     st  %g1, [ %l1 + 0x28 ]
  And here we store the register back to memory - but we could be already
preempted here which could cause bugs...

>                  * submit the inode data buffers. We use writepage
>                  * instead of writepages. Because writepages can do
>                  * block allocation  with delalloc. We need to write
>                  * only allocated blocks here.
>                  */
>                 err = journal_submit_inode_data_buffers(mapping);
>       e0:       7f ff ff d3     call  2c <journal_submit_inode_data_buffers>
>       e4:       90 10 00 10     mov  %l0, %o0
>                 if (!ret)
>       e8:       80 a6 20 00     cmp  %i0, 0
>       ec:       b1 64 40 08     move  %icc, %o0, %i0
>                         ret = err;
>                 spin_lock(&journal->j_list_lock);
>       f0:       40 00 00 00     call  f0 <journal_submit_data_buffers+0x54>
>       f4:       90 10 00 12     mov  %l2, %o0
>                 J_ASSERT(jinode->i_transaction == commit_transaction);
>       f8:       c2 5c 40 00     ldx  [ %l1 ], %g1
>       fc:       80 a0 40 19     cmp  %g1, %i1
>      100:       22 68 00 07     be,a   %xcc, 11c
> <journal_submit_data_buffers+0x80>
>      104:       c2 04 60 28     ld  [ %l1 + 0x28 ], %g1
  Again, here we load jinode->i_flags.

>      108:       11 00 00 00     sethi  %hi(0), %o0
>      10c:       92 10 21 04     mov  0x104, %o1
>      110:       40 00 00 00     call  110 <journal_submit_data_buffers+0x74>
>      114:       90 12 20 00     mov  %o0, %o0
>      118:       91 d0 20 05     ta  5
>                 jinode->i_flags &= ~JI_COMMIT_RUNNING;
>                 wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
>      11c:       90 04 60 28     add  %l1, 0x28, %o0
>      120:       92 10 20 00     clr  %o1
>                 err = journal_submit_inode_data_buffers(mapping);
>                 if (!ret)
>                         ret = err;
>                 spin_lock(&journal->j_list_lock);
>                 J_ASSERT(jinode->i_transaction == commit_transaction);
>                 jinode->i_flags &= ~JI_COMMIT_RUNNING;
>      124:       82 08 7f fe     and  %g1, -2, %g1
  Here we go &= ~JI_COMMIT_RUNNING

>                 wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
>      128:       40 00 00 00     call  128 <journal_submit_data_buffers+0x8c>
>      12c:       c2 24 60 28     st  %g1, [ %l1 + 0x28 ]
  And only here we store it back to memory...

>         struct jbd2_inode *jinode;
>         int err, ret = 0;
>         struct address_space *mapping;
> 
>         spin_lock(&journal->j_list_lock);
>         list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>      130:       c2 5c 60 10     ldx  [ %l1 + 0x10 ], %g1
>      134:       a2 00 7f f0     add  %g1, -16, %l1
>          * prefetches into the prefetch-cache which only is accessible
>          * by floating point operations in UltraSPARC-III and later.
>          * By contrast, "#one_write" prefetches into the L2 cache
>          * in shared state.
>          */
>         __asm__ __volatile__("prefetch [%0], #one_write"
>      138:       c2 5c 60 10     ldx  [ %l1 + 0x10 ], %g1
>      13c:       c7 68 40 00     prefetch  [ %g1 ], #one_write
>      140:       82 04 60 10     add  %l1, 0x10, %g1
>      144:       80 a4 c0 01     cmp  %l3, %g1
>      148:       32 6f ff e0     bne,a   %xcc, c8
> <journal_submit_data_buffers+0x2c>
>      14c:       c4 5c 60 20     ldx  [ %l1 + 0x20 ], %g2
>                 spin_lock(&journal->j_list_lock);
>                 J_ASSERT(jinode->i_transaction == commit_transaction);
>                 wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
>         }
>         spin_unlock(&journal->j_list_lock);
>      150:       90 10 00 12     mov  %l2, %o0
>      154:       40 00 00 00     call  154 <journal_submit_data_buffers+0xb8>
>      158:       b1 3e 20 00     sra  %i0, 0, %i0
>         return ret;
> }
>      15c:       81 cf e0 08     rett  %i7 + 8
>      160:       01 00 00 00     nop
  So the compiled code looks a bit suspitious to me. Having the disassembly
with symbols properly resolved would help confirm it. I'm adding sparc list
to CC just in case someone sees the problem...

									Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/