lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 1 Aug 2014 07:50:53 +1000
From:	NeilBrown <neilb@...e.de>
To:	Ben Greear <greearb@...delatech.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Cc:	"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	linux-fsdevel@...r.kernel.org
Subject: Re: Killing process in D state on mount to dead NFS server. (when
 process is in fsync)

On Thu, 31 Jul 2014 14:20:07 -0700 Ben Greear <greearb@...delatech.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 07/31/2014 01:42 PM, NeilBrown wrote:
> > On Thu, 31 Jul 2014 11:00:35 -0700 Ben Greear <greearb@...delatech.com> wrote:
> > 
> >> So, this has been asked all over the interweb for years and years, but the best answer I can find is to reboot the system or create a fake NFS server
> >> somewhere with the same IP as the gone-away NFS server.
> >> 
> >> The problem is:
> >> 
> >> I have some mounts to an NFS server that no longer exists (crashed/powered down).
> >> 
> >> I have some processes stuck trying to write to files open on these mounts.
> >> 
> >> I want to kill the process and unmount.
> >> 
> >> umount -l will make the mount go a way, sort of.  But process is still hung. umount -f complains: umount2:  Device or resource busy umount.nfs: /mnt/foo:
> >> device is busy
> >> 
> >> kill -9 does not work on process.
> > 
> > Kill -1 should work (since about 2.6.25 or so).
> 
> That is -[ONE], right?  Assuming so, it did not work for me.

No, it was "-9" .... sorry, I really shouldn't be let out without my proof
reader.

However the 'stack' is sufficient to see what is going on.

The problem is that it is blocked inside the "VM" well away from NFS and
there is no way for NFS to say "give up and go home".

I'd suggest that is a bug.   I cannot see any justification for fsync to not
be killable.
It wouldn't be too hard to create a patch to make it so.
It would be a little harder to examine all call paths and create a
convincing case that the patch was safe.
It might be herculean task to convince others that it was the right thing
to do.... so let's start with that one.

Hi Linux-mm and fs-devel people.  What do people think of making "fsync" and
variants "KILLABLE" ??

I probably only need a little bit of encouragement to write a patch....

Thanks,
NeilBrown

> 
> Kernel is 3.14.4+, with some of extra patches, but probably nothing that
> influences this particular behaviour.
> 
> [root@...005-14010010 ~]# cat /proc/3805/stack
> [<ffffffff811371ba>] sleep_on_page+0x9/0xd
> [<ffffffff8113738e>] wait_on_page_bit+0x71/0x78
> [<ffffffff8113769a>] filemap_fdatawait_range+0xa2/0x16d
> [<ffffffff8113780e>] filemap_write_and_wait_range+0x3b/0x77
> [<ffffffffa0f04734>] nfs_file_fsync+0x37/0x83 [nfs]
> [<ffffffff811a8d32>] vfs_fsync_range+0x19/0x1b
> [<ffffffff811a8d4b>] vfs_fsync+0x17/0x19
> [<ffffffffa0f05305>] nfs_file_flush+0x6b/0x6f [nfs]
> [<ffffffff81183e46>] filp_close+0x3f/0x71
> [<ffffffff8119c8ae>] __close_fd+0x80/0x98
> [<ffffffff81183de5>] SyS_close+0x1c/0x3e
> [<ffffffff815c55f9>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
> [root@...005-14010010 ~]# kill -1 3805
> [root@...005-14010010 ~]# cat /proc/3805/stack
> [<ffffffff811371ba>] sleep_on_page+0x9/0xd
> [<ffffffff8113738e>] wait_on_page_bit+0x71/0x78
> [<ffffffff8113769a>] filemap_fdatawait_range+0xa2/0x16d
> [<ffffffff8113780e>] filemap_write_and_wait_range+0x3b/0x77
> [<ffffffffa0f04734>] nfs_file_fsync+0x37/0x83 [nfs]
> [<ffffffff811a8d32>] vfs_fsync_range+0x19/0x1b
> [<ffffffff811a8d4b>] vfs_fsync+0x17/0x19
> [<ffffffffa0f05305>] nfs_file_flush+0x6b/0x6f [nfs]
> [<ffffffff81183e46>] filp_close+0x3f/0x71
> [<ffffffff8119c8ae>] __close_fd+0x80/0x98
> [<ffffffff81183de5>] SyS_close+0x1c/0x3e
> [<ffffffff815c55f9>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Thanks,
> Ben
> 
> > If it doesn't please report the kernel version and cat /proc/$PID/stack
> > 
> > for some processes that cannot be killed.
> > 
> > NeilBrown
> > 
> >> 
> >> 
> >> Aside from bringing a fake NFS server back up on the same IP, is there any other way to get these mounts unmounted and the processes killed without 
> >> rebooting?
> >> 
> >> Thanks, Ben
> >> 
> > 
> 
> 
> - -- 
> Ben Greear <greearb@...delatech.com>
> Candela Technologies Inc  http://www.candelatech.com
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.13 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iQEcBAEBAgAGBQJT2rLiAAoJELbHqkYeJT4OqPgH/0taKW6Be90c1mETZf9yeqZF
> YMLZk8XC2wloEd9nVz//mXREmiu18Hc+5p7Upd4Os21J2P4PBMGV6P/9DMxxehwH
> YX1HKha0EoAsbO5ILQhbLf83cRXAPEpvJPgYHrq6xjlKB8Q8OxxND37rY7kl19Zz
> sdAw6GiqHICF3Hq1ATa/jvixMluDnhER9Dln3wOdAGzmmuFYqpTsV4EwzbKKqInJ
> 6C15q+cq/9aYh6usN6z2qJhbHgqM9EWcPL6jOrCwX4PbC1XjKHekpFN0t9oKQClx
> qSPuweMQ7fP4IBd2Ke8L/QlyOVblAKSE7t+NdrjfzLmYPzyHTyfLABR/BI053to=
> =/9FJ
> -----END PGP SIGNATURE-----


Download attachment "signature.asc" of type "application/pgp-signature" (829 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ