linux-kernel - Re: PROBLEM: NFS client IO fails with ERESTARTSYS when another mount point with the same export is unmounted with force [NFS] [SUNRPC]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAPKjjnpNV9Eq1bxFc3XLjLxFWqJ5_5a_vqC+aUK_t4dGBE7cxw@mail.gmail.com>
Date: Wed, 28 Feb 2024 11:37:49 +0800
From: Zhitao Li <zhitao.li@...rtx.com>
To: NeilBrown <neilb@...e.de>
Cc: Jeff Layton <jlayton@...nel.org>, Trond Myklebust <trondmy@...merspace.com>, 
	"chuck.lever@...cle.com" <chuck.lever@...cle.com>, "tom@...pey.com" <tom@...pey.com>, 
	"anna@...nel.org" <anna@...nel.org>, "Dai.Ngo@...cle.com" <Dai.Ngo@...cle.com>, 
	"kolga@...app.com" <kolga@...app.com>, "huangping@...rtx.com" <huangping@...rtx.com>, 
	"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: PROBLEM: NFS client IO fails with ERESTARTSYS when another mount
 point with the same export is unmounted with force [NFS] [SUNRPC]

Thanks for your help.

The fix does work in my cases. "dd" fails with EIO when its mount
point or another mount point with the same export is unmounted with
force.

1. Another mount point is unmounted with force:
[root@v6 zhitaoli]#   dd if=/dev/urandom of=/mnt/test/1G bs=1M
count=1024 oflag=direct
dd: error writing '/mnt/test/1G': Input/output error
100+0 records in
99+0 records out
103809024 bytes (104 MB, 99 MiB) copied, 5.05276 s, 20.5 MB/s

2. The mount point is unmounted with force:
[root@v6 zhitaoli]#   dd if=/dev/urandom of=/mnt/test/1G bs=1M
count=1024 oflag=direct
dd: error writing '/mnt/test/1G': Input/output error
60+0 records in
59+0 records out
61865984 bytes (62 MB, 59 MiB) copied, 2.13265 s, 29.0 MB/s

Best regards
Zhitao Li, at SmartX

On Wed, Feb 28, 2024 at 7:55 AM NeilBrown <neilb@...e.de> wrote:
>
> On Fri, 23 Feb 2024, Zhitao Li wrote:
> > Thanks for Jeff's reply.
> >
> > I did see  ERESTARTSYS in userland. As described in the above
> > "Reproduction" chapter, "dd" fails with "dd: error writing
> > '/mnt/test1/1G': Unknown error 512".
> >
> > After strace "dd", it turns out that syscall WRITE fails with:
> > write(1, "4\303\31\211\316\237\333\r-\275g\370\233\374X\277\374Tb\202\24\365\220\320\16\27o3\331q\344\364"...,
> > 1048576) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> >
> > In fact, other syscalls related to file systems can also fail with
> > ERESTARTSYS in our cases, for example: mount, open, read, write and so
> > on.
> >
> > Maybe the reason is that on forced unmount, rpc_killall_tasks() in
> > net/sunrpc/clnt.c will set all inflight IO with ERESTARTSYS, while no
> > signal gets involved. So ERESTARTSYS is not handled before entering
> > userspace.
> >
> > Best regards,
> > Zhitao Li at SmartX.
> >
> > On Thu, Feb 22, 2024 at 7:05 PM Jeff Layton <jlayton@...nel.org> wrote:
> > >
> > > On Wed, 2024-02-21 at 13:48 +0000, Trond Myklebust wrote:
> > > > On Wed, 2024-02-21 at 16:20 +0800, Zhitao Li wrote:
> > > > > [You don't often get email from zhitao.li@...rtx.com. Learn why this
> > > > > is important at https://aka.ms/LearnAboutSenderIdentification ]
> > > > >
> > > > > Hi, everyone,
> > > > >
> > > > > - Facts:
> > > > > I have a remote NFS export and I mount the same export on two
> > > > > different directories in my OS with the same options. There is an
> > > > > inflight IO under one mounted directory. And then I unmount another
> > > > > mounted directory with force. The inflight IO ends up with "Unknown
> > > > > error 512", which is ERESTARTSYS.
> > > > >
> > > >
> > > > All of the above is well known. That's because forced umount affects
> > > > the entire filesystem. Why are you using it here in the first place? It
> > > > is not intended for casual use.
> > > >
> > >
> > > While I agree Trond's above statement, the kernel is not supposed to
> > > leak error codes that high into userland. Are you seeing ERESTARTSYS
> > > being returned to system calls? If so, which ones?
> > > --
> > > Jeff Layton <jlayton@...nel.org>
> >
>
> I think this bug was introduced by
> Commit ae67bd3821bb ("SUNRPC: Fix up task signalling")
> in Linux v5.2.
>
> Prior to that commit, rpc_killall_tasks set the error to -EIO.
> After that commit it calls rpc_signal_task which always uses
> -ERESTARTSYS.
>
> This might be an appropriate fix.  Can you please test and confirm?
>
> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
> index 2d61987b3545..ed3a116efd5d 100644
> --- a/include/linux/sunrpc/sched.h
> +++ b/include/linux/sunrpc/sched.h
> @@ -222,7 +222,7 @@ void                rpc_put_task(struct rpc_task *);
>  void           rpc_put_task_async(struct rpc_task *);
>  bool           rpc_task_set_rpc_status(struct rpc_task *task, int rpc_status);
>  void           rpc_task_try_cancel(struct rpc_task *task, int error);
> -void           rpc_signal_task(struct rpc_task *);
> +void           rpc_signal_task(struct rpc_task *, int);
>  void           rpc_exit_task(struct rpc_task *);
>  void           rpc_exit(struct rpc_task *, int);
>  void           rpc_release_calldata(const struct rpc_call_ops *, void *);
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index cda0935a68c9..cdbdfae13030 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -895,7 +895,7 @@ void rpc_killall_tasks(struct rpc_clnt *clnt)
>         trace_rpc_clnt_killall(clnt);
>         spin_lock(&clnt->cl_lock);
>         list_for_each_entry(rovr, &clnt->cl_tasks, tk_task)
> -               rpc_signal_task(rovr);
> +               rpc_signal_task(rovr, -EIO);
>         spin_unlock(&clnt->cl_lock);
>  }
>  EXPORT_SYMBOL_GPL(rpc_killall_tasks);
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index 6debf4fd42d4..e88621881036 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -852,14 +852,14 @@ void rpc_exit_task(struct rpc_task *task)
>         }
>  }
>
> -void rpc_signal_task(struct rpc_task *task)
> +void rpc_signal_task(struct rpc_task *task, int sig)
>  {
>         struct rpc_wait_queue *queue;
>
>         if (!RPC_IS_ACTIVATED(task))
>                 return;
>
> -       if (!rpc_task_set_rpc_status(task, -ERESTARTSYS))
> +       if (!rpc_task_set_rpc_status(task, sig))
>                 return;
>         trace_rpc_task_signalled(task, task->tk_action);
>         set_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
> @@ -992,7 +992,7 @@ static void __rpc_execute(struct rpc_task *task)
>                          * clean up after sleeping on some queue, we don't
>                          * break the loop here, but go around once more.
>                          */
> -                       rpc_signal_task(task);
> +                       rpc_signal_task(task, -ERESTARTSYS);
>                 }
>                 trace_rpc_task_sync_wake(task, task->tk_action);
>         }