lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-Id: <1158689399.32184.17.camel@localhost>
Date:	Tue, 19 Sep 2006 15:09:59 -0300
From:	Rudá Moura <ruda.moura@...ra.com.br>
To:	linux-kernel <linux-kernel@...r.kernel.org>
Cc:	Alexander Viro <viro@...celfarce.linux.theplanet.co.uk>
Subject: Processes stuck in D trying to acquire lock at vfs_rename_dir

We are huge mail provider with a pool of many mx, pop and imap machines
and we are facing a rather strange situation when we recently
upgraded one of our applications, and after a week running, several
processes hang in "D" (disk wait) state forever. That way, we cannot
strace, gdb, pstack them to know what they were doing or where they
were.

On the top of those machines are running RHEL version 4 with
kernel 2.6.9-42.ELsmp. Some are running (stock) kernel 2.6.17.11 
with KDB patch applied. The hardware is described as follow:

- Dell PowerEdge 6850, QUAD 3,2GHZ HT , 8GB RAM,
- 5 SCSI disks 146 GB 15k RPM.

We use NFS to keep mailboxes in maildir format. They are shared to many
machines and are provided by a NFS server storage.

In order to gain more understanding of the problem
we decide to enable mutex/futex debug (CONFIG_DEBUG_MUTEXES=y),
because the bug is hard to reproduce and happens in an occasional manner
 for a week or more.
 
The process locked was "ttrlmtp_tcp", they stuck in D while are trying
to rename() as the log message states:

lmtplog.6.gz:Sep 18 19:00:02 mangoro trrlmtpd_tcp[12499]:
1158616802.840213: TrrMailMdirMoveMBToIdPermMB: trying to rename
[/nfs/mail3d03/m/a/r/i/a/4/8/2/6/maria4826/Maildir] to
[/nfs/mail3d03/m/a/r/i/a/4/8/2/6/18538100#perm!terra.maria4826] ...

and this is what Mutex/Futex Debug gave to us:

Sep 18 19:00:02 mangoro kernel:
==========================================
Sep 18 19:00:02 mangoro kernel: [ BUG: lock recursion deadlock detected!
|
Sep 18 19:00:02 mangoro kernel:
------------------------------------------
Sep 18 19:00:15 mangoro kernel: 
Sep 18 19:00:15 mangoro kernel: trrlmtpd_tcp/12499 is trying to acquire
this lock:
Sep 18 19:00:15 mangoro kernel:  [ed260830] {inode_init_once}
Sep 18 19:00:15 mangoro kernel: .. held by:      trrlmtpd_tcp:12499
[d94cc560, 115]
Sep 18 19:00:15 mangoro kernel: ... acquired at:
lock_rename+0x36/0x94
Sep 18 19:00:15 mangoro kernel: ... trying at:
vfs_rename_dir+0x84/0x100
Sep 18 19:00:15 mangoro kernel: ------------------------------
Sep 18 19:00:15 mangoro kernel: | showing all locks held by: |
(trrlmtpd_tcp/12499 [d94cc560, 115]):
Sep 18 19:00:15 mangoro kernel: ------------------------------
Sep 18 19:00:15 mangoro kernel: 
Sep 18 19:00:15 mangoro kernel: #001:             [f6e839f4]
{alloc_super}
Sep 18 19:00:15 mangoro kernel: ... acquired at:
lock_rename+0x4f/0x94
Sep 18 19:00:15 mangoro kernel: 
Sep 18 19:00:15 mangoro kernel: #002:             [ed260b50]
{inode_init_once}
Sep 18 19:00:15 mangoro kernel: ... acquired at:
lock_rename+0x2b/0x94
Sep 18 19:00:15 mangoro kernel: 
Sep 18 19:00:15 mangoro kernel: #003:             [ed260830]
{inode_init_once}
Sep 18 19:00:15 mangoro kernel: ... acquired at:
lock_rename+0x36/0x94
Sep 18 19:00:15 mangoro kernel: 
Sep 18 19:00:15 mangoro kernel: 
Sep 18 19:00:15 mangoro kernel: trrlmtpd_tcp/12499's [current]
stackdump:
Sep 18 19:00:15 mangoro kernel: 
Sep 18 19:00:15 mangoro kernel:  <c0104205> show_trace+0xd/0xf
<c01042ce> dump_stack+0x15/0x17
Sep 18 19:00:15 mangoro kernel:  <c0131205> report_deadlock+0xea/0x102
<c0131364> check_deadlock+0x147/0x151
Sep 18 19:00:15 mangoro kernel:  <c0131908> debug_mutex_add_waiter
+0x80/0x93  <c02fcf02> __mutex_lock_slowpath+0x117/0x37a
Sep 18 19:00:15 mangoro kernel:  <c02fcddb> mutex_lock+0x24/0x27
<c0168888> vfs_rename_dir+0x84/0x100
Sep 18 19:00:15 mangoro kernel:  <c0168b2a> vfs_rename+0x135/0x273
<c0168d9f> do_rename+0x137/0x176
Sep 18 19:00:15 mangoro kernel:  <c0168e17> sys_renameat+0x39/0x54
<c0168e44> sys_rename+0x12/0x14
Sep 18 19:00:15 mangoro kernel:  <c01032ef> sysenter_past_esp+0x54/0x75 


Unfortunately we couldn't use KDB to debug this process because we had
to reboot the machine.
We are waiting for another process lock.

Any help in the subject will be very appreciate and we are ready to
provide more information about this problem if required.

Thanks in Advance.

-- 
Rudá Moura <ruda.moura@...ra.com.br>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ