linux-kernel - [RFC 0/2] fuse: introduce fuse server recovery mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240524064030.4944-1-jefflexu@linux.alibaba.com>
Date: Fri, 24 May 2024 14:40:28 +0800
From: Jingbo Xu <jefflexu@...ux.alibaba.com>
To: miklos@...redi.hu,
	linux-fsdevel@...r.kernel.org
Cc: linux-kernel@...r.kernel.org,
	winters.zc@...group.com
Subject: [RFC 0/2] fuse: introduce fuse server recovery mechanism

Background
==========
The fd of '/dev/fuse' serves as a message transmission channel between
FUSE filesystem (kernel space) and fuse server (user space). Once the
fd gets closed (intentionally or unintentionally), the FUSE filesystem
gets aborted, and any attempt of filesystem access gets -ECONNABORTED
error until the FUSE filesystem finally umounted.

It is one of the requisites in production environment to provide
uninterruptible filesystem service.  The most straightforward way, and
maybe the most widely used way, is that make another dedicated user
daemon (similar to systemd fdstore) keep the device fd open.  When the
fuse daemon recovers from a crash, it can retrieve the device fd from the
fdstore daemon through socket takeover (Unix domain socket) method [1]
or pidfd_getfd() syscall [2].  In this way, as long as the fdstore
daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse
daemon crashes, though the filesystem service may hang there for a while
when the fuse daemon gets restarted and has not been completely
recovered yet.

This picture indeed works and has been deployed in our internal
production environment until the following issues are encountered:

1. The fdstore daemon may be killed by mistake, in which case the FUSE
filesystem gets aborted and irrecoverable.

2. In scenarios of containerized deployment, the fuse daemon is deployed
in a container POD, and a dedicated fdstore daemon needs to be deployed
for each fuse daemon.  The fdstore daemon could consume a amount of
resources (e.g. memory footprint), which is not conducive to the dense
container deployment.

3. Each fuse daemon implementation needs to implement its own fdstore
daemon.  If we implement the fuse recovery mechanism on the kernel side,
all fuse daemon implementations could reuse this mechanism.


What we do
==========

Basic Recovery Mechanism
------------------------
We introduce a recovery mechanism for fuse server on the kernel side.

To do this:
1. Introduce a new "tag=" mount option, with which users could identify
a fuse connection with a unique name.
2. Introduce a new FUSE_DEV_IOC_ATTACH ioctl, with which the fuse server
could reconnect to the fuse connection corresponding to the given tag.
3. Introduce a new FUSE_HAS_RECOVERY init flag.  The fuse server should
advertise this feature if it supports server recovery.


With the above recovery mechanism, the whole time sequence is like:
- At the initial mount, the fuse filesystem is mounted with "tag="
  option
- The fuse server advertises FUSE_HAS_RECOVERY flag when replying
  FUSE_INIT
- When the fuse server crashes and the (/dev/fuse) device fd is closed,
  the fuse connection won't be aborted.
- The requests submitted after the server crash will keep staying in
  the iqueue; the processes submitting the requests will hang there
- The fuse server gets restarted and recovers the previous state before
  crash (including the negotiation results of the last FUSE_INIT)
- The fuse server opens /dev/fuse and gets a new device fd, and then
  runs FUSE_DEV_IOC_ATTACH ioctl on the new device fd to retrieve the
  fuse connection with the tag previously used to mount the fuse
  filesystem
- The fuse server issues a FUSE_NOTIFY_RESEND notification to request
  the kernel to resend those inflight requests that have been sent to
  the fuse server before the server crash but not been replied yet
- The fuse server starts to process requests normally (those queued in
  iqueue and those resent by FUSE_NOTIFY_RESEND)

In summary, the requests submitted after the server crash will stay in
the iqueue and get serviced once the fuse server recovers from the crash
and retrieve the previous fuse connection.  As for the inflight requests
that have been sent to the fuse server before the server crash but not
been replied yet, the fuse server could request the kernel to resend
those inflight requests through FUSE_NOTIFY_RESEND notification type.


Security Enhancement
---------------------
Besides, we offer a uid-based security enhancement for the fuse server
recovery mechanism.  Otherwise any malicious attacker could kill the
fuse server and take the filesystem service over with the recovery
mechanism.

To implement this, we introduce a new "rescue_uid=" mount option
specifying the expected uid of the legal process running the fuse
server.  Then only the process with the matching uid is permissible to
retrieve the fuse connection with the server recovery mechanism.


Limitation
==========
1. The current mechanism won't resend a new FUSE_INIT request to fuse
server and start a new negotiation when the fuse server attempts to
re-attach to the fuse connection through FUSE_DEV_IOC_ATTACH ioctl.
Thus the fuse server needs to recover the previous state before crash
(including the negotiation results of the last FUSE_INIT) by itself.

PS. Thus I had to do hacking tricks on libfuse passthrough_ll daemon
when testing the recovery feature.

2. With the current recovery mechanism, the fuse filesystem won't get
aborted when the fuse server crashes.  A following umount will get hung
there.  The call stack shows the hang task is waiting for FUSE_GETATTR
on the mntpoint:

[<0>] request_wait_answer+0xe1/0x200
[<0>] fuse_simple_request+0x18e/0x2a0
[<0>] fuse_do_getattr+0xc9/0x180
[<0>] vfs_statx+0x92/0x170
[<0>] vfs_fstatat+0x7c/0xb0
[<0>] __do_sys_newstat+0x1d/0x40
[<0>] do_syscall_64+0x60/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

It's not fixed yet in this RFC version.

3. I don't know if a kernel based recovery mechanism is welcome on the
community side.  Any comment is welcome.  Thanks!


[1] https://copyconstruct.medium.com/file-descriptor-transfer-over-unix-domain-sockets-dcbbf5b3b6ec
[2] https://copyconstruct.medium.com/seamless-file-descriptor-transfer-between-processes-with-pidfd-and-pidfd-getfd-816afcd19ed4


Jingbo Xu (2):
  fuse: introduce recovery mechanism for fuse server
  fuse: uid-based security enhancement for the recovery mechanism

 fs/fuse/dev.c             | 55 ++++++++++++++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h          | 15 +++++++++++
 fs/fuse/inode.c           | 46 +++++++++++++++++++++++++++++++-
 include/uapi/linux/fuse.h |  7 +++++
 4 files changed, 121 insertions(+), 2 deletions(-)

-- 
2.19.1.6.gb485710b