[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a4423d670903251538p3a7c180br1bcfe08c4f044c15@mail.gmail.com>
Date: Thu, 26 Mar 2009 01:38:32 +0300
From: Alexander Beregalov <a.beregalov@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: Theodore Tso <tytso@....edu>,
"linux-next@...r.kernel.org" <linux-next@...r.kernel.org>,
linux-ext4@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
sparclinux@...r.kernel.org
Subject: Re: next-20090310: ext4 hangs
2009/3/25 Jan Kara <jack@...e.cz>:
> On Wed 25-03-09 20:07:46, Alexander Beregalov wrote:
>> 2009/3/25 Jan Kara <jack@...e.cz>:
>> > On Wed 25-03-09 18:29:10, Alexander Beregalov wrote:
>> >> 2009/3/25 Jan Kara <jack@...e.cz>:
>> >> > On Wed 25-03-09 18:18:43, Alexander Beregalov wrote:
>> >> >> 2009/3/25 Jan Kara <jack@...e.cz>:
>> >> >> >> > So, I think I need to try it on 2.6.29-rc7 again.
>> >> >> >> I've looked into this. Obviously, what's happenning is that we delete
>> >> >> >> an inode and jbd2_journal_release_jbd_inode() finds inode is just under
>> >> >> >> writeout in transaction commit and thus it waits. But it gets never woken
>> >> >> >> up and because it has a handle from the transaction, every one eventually
>> >> >> >> blocks on waiting for a transaction to finish.
>> >> >> >> But I don't really see how that can happen. The code is really
>> >> >> >> straightforward and everything happens under j_list_lock... Strange.
>> >> >> > BTW: Is the system SMP?
>> >> >> No, it is UP system.
>> >> > Even stranger. And do you have CONFIG_PREEMPT set?
>> >> >
>> >> >> The bug exists even in 2.6.29, I posted it with a new topic.
>> >> > OK, I've sort-of expected this.
>> >>
>> >> CONFIG_PREEMPT_RCU=y
>> >> CONFIG_PREEMPT_RCU_TRACE=y
>> >> # CONFIG_PREEMPT_NONE is not set
>> >> # CONFIG_PREEMPT_VOLUNTARY is not set
>> >> CONFIG_PREEMPT=y
>> >> CONFIG_DEBUG_PREEMPT=y
>> >> # CONFIG_PREEMPT_TRACER is not set
>> >>
>> >> config is attached.
>> > Thanks for the data. I still don't see how the wakeup can get lost. The
>> > process even cannot be preempted when we are in the section protected by
>> > j_list_lock... Can you send me a disassembly of functions
>> > jbd2_journal_release_jbd_inode() and journal_submit_data_buffers() so that
>> > I can see whether the compiler has not reordered something unexpectedly?
> Thanks for the disassembly...
>
>> By default gcc inlines journal_submit_data_buffers()
>> Here is -fno-inline version. Default version is in attach.
>> ====
>>
>> static int journal_submit_data_buffers(journal_t *journal,
>> transaction_t *commit_transaction)
>> {
>> 9c: 9d e3 bf 40 save %sp, -192, %sp
>> a0: 11 00 00 00 sethi %hi(0), %o0
>> struct jbd2_inode *jinode;
>> int err, ret = 0;
>> struct address_space *mapping;
>>
>> spin_lock(&journal->j_list_lock);
>> a4: a4 06 25 70 add %i0, 0x570, %l2
>> * our inode list. We use JI_COMMIT_RUNNING flag to protect inode we currently
>> * operate on from being released while we write out pages.
>> */
>> static int journal_submit_data_buffers(journal_t *journal,
>> transaction_t *commit_transaction)
>> {
>> a8: 90 12 20 00 mov %o0, %o0
>> ac: 40 00 00 00 call ac <journal_submit_data_buffers+0x10>
>> b0: b0 10 20 00 clr %i0
>> struct jbd2_inode *jinode;
>> int err, ret = 0;
>> struct address_space *mapping;
>>
>> spin_lock(&journal->j_list_lock);
>> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>> b4: a6 06 60 60 add %i1, 0x60, %l3
>> {
>> struct jbd2_inode *jinode;
>> int err, ret = 0;
>> struct address_space *mapping;
>>
>> spin_lock(&journal->j_list_lock);
>> b8: 40 00 00 00 call b8 <journal_submit_data_buffers+0x1c>
>> bc: 90 10 00 12 mov %l2, %o0
>> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>> c0: 10 68 00 1d b %xcc, 134 <journal_submit_data_buffers+0x98>
>> c4: c2 5e 60 60 ldx [ %i1 + 0x60 ], %g1
>> mapping = jinode->i_vfs_inode->i_mapping;
>> jinode->i_flags |= JI_COMMIT_RUNNING;
>> spin_unlock(&journal->j_list_lock);
>> c8: 90 10 00 12 mov %l2, %o0
>> struct address_space *mapping;
>>
>> spin_lock(&journal->j_list_lock);
>> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>> mapping = jinode->i_vfs_inode->i_mapping;
>> jinode->i_flags |= JI_COMMIT_RUNNING;
>> cc: c2 04 60 28 ld [ %l1 + 0x28 ], %g1
> Here we load jbd2_inode->i_flags into %g1.
>
>> int err, ret = 0;
>> struct address_space *mapping;
>>
>> spin_lock(&journal->j_list_lock);
>> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>> mapping = jinode->i_vfs_inode->i_mapping;
>> d0: e0 58 a1 e0 ldx [ %g2 + 0x1e0 ], %l0
>> jinode->i_flags |= JI_COMMIT_RUNNING;
>> d4: 82 10 60 01 or %g1, 1, %g1
> Here we set JI_COMMIT_RUNNING.
>
>> spin_unlock(&journal->j_list_lock);
>> d8: 40 00 00 00 call d8 <journal_submit_data_buffers+0x3c>
> Here we seem to call preempt_disable() (it would be useful if we could
> confirm that - easiest option I know is compiling JBD2 into a kernel but
> some object file trickery should be able to find it out as well...)
55bab0: 82 10 60 01 or %g1, 1, %g1
spin_unlock(&journal->j_list_lock);
55bab4: 40 06 4b 20 call 6ee734 <_spin_unlock>
55bab8: c2 24 e0 28 st %g1, [ %l3 + 0x28 ]
<..>
void __lockfunc _spin_unlock(spinlock_t *lock)
{
6ee734: 9d e3 bf 40 save %sp, -192, %sp
6ee738: 11 00 1f f9 sethi %hi(0x7fe400), %o0
6ee73c: 7f fb 36 59 call 5bc0a0 <_mcount>
6ee740: 90 12 22 40 or %o0, 0x240, %o0 ! 7fe640
<rt_trace_on+0xc8>
spin_release(&lock->dep_map, 1, _RET_IP_);
6ee744: 94 10 00 1f mov %i7, %o2
6ee748: 92 10 20 01 mov 1, %o1
6ee74c: 7f f6 21 18 call 476bac <lock_release>
6ee750: 90 06 20 18 add %i0, 0x18, %o0
_raw_spin_unlock(lock);
6ee754: 7f fb 7f 91 call 5ce598 <_raw_spin_unlock>
6ee758: 90 10 00 18 mov %i0, %o0
preempt_enable();
6ee75c: 40 00 05 fd call 6eff50 <sub_preempt_count>
6ee760: 90 10 20 01 mov 1, %o0
6ee764: c2 59 a0 08 ldx [ %g6 + 8 ], %g1
6ee768: 82 08 60 08 and %g1, 8, %g1
6ee76c: 02 c8 40 04 brz %g1, 6ee77c <_spin_unlock+0x48>
6ee770: 01 00 00 00 nop
6ee774: 7f ff f4 f1 call 6ebb38 <preempt_schedule>
6ee778: 01 00 00 00 nop
6ee77c: 81 cf e0 08 rett %i7 + 8
6ee780: 01 00 00 00 nop
}
>
>> dc: c2 24 60 28 st %g1, [ %l1 + 0x28 ]
> And here we store the register back to memory - but we could be already
> preempted here which could cause bugs...
>
>> * submit the inode data buffers. We use writepage
>> * instead of writepages. Because writepages can do
>> * block allocation with delalloc. We need to write
>> * only allocated blocks here.
>> */
>> err = journal_submit_inode_data_buffers(mapping);
>> e0: 7f ff ff d3 call 2c <journal_submit_inode_data_buffers>
>> e4: 90 10 00 10 mov %l0, %o0
>> if (!ret)
>> e8: 80 a6 20 00 cmp %i0, 0
>> ec: b1 64 40 08 move %icc, %o0, %i0
>> ret = err;
>> spin_lock(&journal->j_list_lock);
>> f0: 40 00 00 00 call f0 <journal_submit_data_buffers+0x54>
>> f4: 90 10 00 12 mov %l2, %o0
>> J_ASSERT(jinode->i_transaction == commit_transaction);
>> f8: c2 5c 40 00 ldx [ %l1 ], %g1
>> fc: 80 a0 40 19 cmp %g1, %i1
>> 100: 22 68 00 07 be,a %xcc, 11c
>> <journal_submit_data_buffers+0x80>
>> 104: c2 04 60 28 ld [ %l1 + 0x28 ], %g1
> Again, here we load jinode->i_flags.
>
>> 108: 11 00 00 00 sethi %hi(0), %o0
>> 10c: 92 10 21 04 mov 0x104, %o1
>> 110: 40 00 00 00 call 110 <journal_submit_data_buffers+0x74>
>> 114: 90 12 20 00 mov %o0, %o0
>> 118: 91 d0 20 05 ta 5
>> jinode->i_flags &= ~JI_COMMIT_RUNNING;
>> wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
>> 11c: 90 04 60 28 add %l1, 0x28, %o0
>> 120: 92 10 20 00 clr %o1
>> err = journal_submit_inode_data_buffers(mapping);
>> if (!ret)
>> ret = err;
>> spin_lock(&journal->j_list_lock);
>> J_ASSERT(jinode->i_transaction == commit_transaction);
>> jinode->i_flags &= ~JI_COMMIT_RUNNING;
>> 124: 82 08 7f fe and %g1, -2, %g1
> Here we go &= ~JI_COMMIT_RUNNING
>
>> wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
>> 128: 40 00 00 00 call 128 <journal_submit_data_buffers+0x8c>
>> 12c: c2 24 60 28 st %g1, [ %l1 + 0x28 ]
> And only here we store it back to memory...
spin_lock(&journal->j_list_lock);
55c104: 40 06 48 3c call 6ee1f4 <_spin_lock>
55c108: 90 10 00 12 mov %l2, %o0
jinode->i_flags &= ~JI_COMMIT_RUNNING;
55c10c: c2 04 20 28 ld [ %l0 + 0x28 ], %g1
wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
55c110: 90 04 20 28 add %l0, 0x28, %o0
55c114: 92 10 20 00 clr %o1
if (!ret)
ret = err;
}
spin_lock(&journal->j_list_lock);
jinode->i_flags &= ~JI_COMMIT_RUNNING;
55c118: 82 08 7f fe and %g1, -2, %g1
wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
55c11c: 7f fc 27 6e call 465ed4 <wake_up_bit>
55c120: c2 24 20 28 st %g1, [ %l0 + 0x28 ]
<..>
void __lockfunc _spin_lock(spinlock_t *lock)
{
6ee1f4: 9d e3 bf 30 save %sp, -208, %sp
6ee1f8: 11 00 1f f9 sethi %hi(0x7fe400), %o0
6ee1fc: 7f fb 37 a9 call 5bc0a0 <_mcount>
6ee200: 90 12 21 c8 or %o0, 0x1c8, %o0 ! 7fe5c8
<rt_trace_on+0x50>
preempt_disable();
6ee204: 40 00 07 83 call 6f0010 <add_preempt_count>
6ee208: 90 10 20 01 mov 1, %o0
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
6ee20c: 92 10 20 00 clr %o1
6ee210: 90 06 20 18 add %i0, 0x18, %o0
6ee214: fe 73 a8 af stx %i7, [ %sp + 0x8af ]
6ee218: 94 10 20 00 clr %o2
6ee21c: 96 10 20 00 clr %o3
6ee220: 98 10 20 02 mov 2, %o4
6ee224: 7f f6 21 a5 call 4768b8 <lock_acquire>
6ee228: 9a 10 20 00 clr %o5
LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
6ee22c: 7f fb 81 33 call 5ce6f8 <_raw_spin_lock>
6ee230: 90 10 00 18 mov %i0, %o0
}
6ee234: 81 cf e0 08 rett %i7 + 8
6ee238: 01 00 00 00 nop
<..>
void wake_up_bit(void *word, int bit)
{
465ed4: 9d e3 bf 40 save %sp, -192, %sp
465ed8: 11 00 1f f2 sethi %hi(0x7fc800), %o0
465edc: 40 05 58 71 call 5bc0a0 <_mcount>
465ee0: 90 12 23 c8 or %o0, 0x3c8, %o0 ! 7fcbc8
<kthread_stop_lock+0x88>
__wake_up_bit(bit_waitqueue(word, bit), word, bit);
465ee4: 92 10 00 19 mov %i1, %o1
465ee8: 7f ff ff c5 call 465dfc <bit_waitqueue>
465eec: 90 10 00 18 mov %i0, %o0
465ef0: 92 10 00 18 mov %i0, %o1
465ef4: 7f ff ff e7 call 465e90 <__wake_up_bit>
465ef8: 94 10 00 19 mov %i1, %o2
}
465efc: 81 cf e0 08 rett %i7 + 8
465f00: 01 00 00 00 nop
>
>> struct jbd2_inode *jinode;
>> int err, ret = 0;
>> struct address_space *mapping;
>>
>> spin_lock(&journal->j_list_lock);
>> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>> 130: c2 5c 60 10 ldx [ %l1 + 0x10 ], %g1
>> 134: a2 00 7f f0 add %g1, -16, %l1
>> * prefetches into the prefetch-cache which only is accessible
>> * by floating point operations in UltraSPARC-III and later.
>> * By contrast, "#one_write" prefetches into the L2 cache
>> * in shared state.
>> */
>> __asm__ __volatile__("prefetch [%0], #one_write"
>> 138: c2 5c 60 10 ldx [ %l1 + 0x10 ], %g1
>> 13c: c7 68 40 00 prefetch [ %g1 ], #one_write
>> 140: 82 04 60 10 add %l1, 0x10, %g1
>> 144: 80 a4 c0 01 cmp %l3, %g1
>> 148: 32 6f ff e0 bne,a %xcc, c8
>> <journal_submit_data_buffers+0x2c>
>> 14c: c4 5c 60 20 ldx [ %l1 + 0x20 ], %g2
>> spin_lock(&journal->j_list_lock);
>> J_ASSERT(jinode->i_transaction == commit_transaction);
>> wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
>> }
>> spin_unlock(&journal->j_list_lock);
>> 150: 90 10 00 12 mov %l2, %o0
>> 154: 40 00 00 00 call 154 <journal_submit_data_buffers+0xb8>
>> 158: b1 3e 20 00 sra %i0, 0, %i0
>> return ret;
>> }
>> 15c: 81 cf e0 08 rett %i7 + 8
>> 160: 01 00 00 00 nop
> So the compiled code looks a bit suspitious to me. Having the disassembly
> with symbols properly resolved would help confirm it. I'm adding sparc list
> to CC just in case someone sees the problem...
>
> Honza
> --
> Jan Kara <jack@...e.cz>
> SUSE Labs, CR
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists