lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Thu, 6 Dec 2007 18:49:20 -0700
From:	Matthew Wilcox <matthew@....cx>
To:	Bob Bell <b_lkml@...bellsplace.com>
Cc:	Andrew Morton <akpm@...l.org>, Linus Torvalds <torvalds@...l.org>,
	trond@...app.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] TASK_KILLABLE version 2

On Mon, Sep 24, 2007 at 04:16:48PM -0400, Bob Bell wrote:
> I've been testing this patch on my systems.  It's working for me when
> I read() a file.  Asynchronous write()s seem okay, too.  However,
> synchronous writes (caused by either calling fsync() or fcntl() to
> release a lock) prevent the process from being killed when the NFS
> server goes down.
> 
> When the process is sent SIGKILL, it's waiting with the following call
> tree:
>    do_fsync
>    nfs_fsync
>    nfs_wb_all
>    nfs_sync_mapping_wait
>    nfs_wait_on_requests_locks (I believe)
>    nfs_wait_on_request
>    out_of_line_wait_on_bit
>    __wait_on_bit
>    nfs_wait_bit_interruptible
>    schedule
> 
> When the process is later viewed after being deemed "stuck", it's
> waiting with the following call tree:
>    do_fsync
>    filemap_fdatawait
>    wait_on_page_writeback_range
>    wait_on_page_writeback
>    wait_on_page_bit
>    __wait_on_bit
>    sync_page
>    io_schedule
>    schedule
> 
> If I hazard a guess as to what might be wrong here, I believe that when
> the processes catches SIGKILL, nfs_wait_bit_interruptible is returning
> -ERESTARTSYS.  That error bubbles back up to nfs_fsync.  However,
> nfs_fsync returns ctx->error, not -ERESTARTSYS, and ctx->error is 0.
> do_fsync proceeds to call filemap_fdatawait.  I question whether 
> nfs_sync should return an error, and if do_fsync should skip 
> filemap_fdatawait if the fsync op returned an error.
> 
> I did try replacing the call to sync_page in __wait_on_bit with 
> sync_page_killable and replacing TASK_UNINTERRUPTIBLE with 
> TASK_KILLABLE.  That seemed to work once, but then really screwed things 
> up on subsequent attempts.

Hi Bob,

The patch I posted earlier today (a somewhat cleaned-up version of a
patch originally from Liam Howlett) converts NFS to use TASK_KILLABLE.
Could you have another shot at breaking it?  You can find it here:

http://marc.info/?l=linux-fsdevel&m=119698206729969&w=2

(It has some prerequisites that are in the git tree, so probably the
easiest thing is just to pull that git tree).

Liam notes that there's still problems with programs that call *stat*(),
but I haven't looked into that issue yet.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ