linux-kernel - Re: [HELP] FUSE writeback performance bottleneck

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <67771830-977f-4fca-9d0b-0126abf120a5@fastmail.fm>
Date: Mon, 3 Jun 2024 16:43:17 +0200
From: Bernd Schubert <bernd.schubert@...tmail.fm>
To: Jingbo Xu <jefflexu@...ux.alibaba.com>, Miklos Szeredi
 <miklos@...redi.hu>,
 "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 lege.wang@...uarmicro.com
Subject: Re: [HELP] FUSE writeback performance bottleneck



On 6/3/24 08:17, Jingbo Xu wrote:
> Hi, Miklos,
> 
> We spotted a performance bottleneck for FUSE writeback in which the
> writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
> used for copy_page().
> 
> fuse_writepages_fill
>   alloc tmp_page
>   copy_highpage
> 
> This is because of FUSE writeback design (see commit 3be5a52b30aa
> ("fuse: support writable mmap")), which newly allocates a temp page for
> each dirty page to be written back, copy content of dirty page to temp
> page, and then write back the temp page instead.  This special design is
> intentional to avoid potential deadlocked due to buggy or even malicious
> fuse user daemon.

I also noticed that and I admin that I don't understand it yet. The commit says

<quote>
    The basic problem is that there can be no guarantee about the time in which
    the userspace filesystem will complete a write.  It may be buggy or even
    malicious, and fail to complete WRITE requests.  We don't want unrelated parts
    of the system to grind to a halt in such cases.
</quote>


Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
how fast storage is?

Buggy - hmm yeah, then it is splice related only? But I think splice feature was
not introduced yet when fuse got mmap and writeback in 2008? 
Without splice the pages are just copied into a userspace buffer? So what can
userspace do wrong with its copy?

Failure - why can't it do what nfs_mapping_set_error() does?

I guess I miss something, but so far I don't understand what that is.


> 
> There was a proposal of removing this constraint for virtiofs [1], which
> is reasonable as users of virtiofs and virtiofs daemon don't run on the
> same OS, and virtiofs daemon is usually offered by cloud vendors that
> shall not be malicious.  While for the normal /dev/fuse interface, I
> don't think removing the constraint is acceptable.
> 
> 
> Come back to the writeback performance bottleneck.  Another important
> factor is that, (IIUC) only one kworker at the same time is allowed for
> writeback for each filesystem instance (if cgroup writeback is not
> enabled).  The kworker is scheduled upon sb->s_bdi->wb.dwork, and the
> workqueue infrastructure guarantees that at most one *running* worker is
> allowed for one specific work (sb->s_bdi->wb.dwork) at any time.  Thus
> the writeback is constraint to one CPU for each filesystem instance.
> 
> I'm not sure if offloading the page copying and then FUSE requests
> sending to another worker (if a bunch of dirty pages have been
> collected) is a good idea or not, e.g.
> 
> ```
> fuse_writepages_fill
>   if fuse_writepage_need_send:
>     # schedule a work
> 
> # the worker
> for each dirty page in ap->pages[]:
>     copy_page
> fuse_writepages_send
> ```
> 
> Any suggestion?
> 
> 
> 
> This issue can be reproduced by:
> 
> 1 ./libfuse/build/example/passthrough_ll -o cache=always -o writeback -o
> source=/run/ /mnt
> ("/run/" is a tmpfs mount)
> 
> 2 fio --name=write_test --ioengine=psync --iodepth=1 --rw=write --bs=1M
> --direct=0 --size=1G --numjobs=2 --group_reporting --directory=/mnt
> (at least two threads are needed; fio shows ~1800MiB/s buffer write
> bandwidth)


That should quickly run out of tmpfs memory. I need to find time to improve
this a bit, but this should give you an easier test:

https://github.com/libfuse/libfuse/pull/807

> 
> 
> [1]
> https://lore.kernel.org/all/20231228123528.705-1-lege.wang@jaguarmicro.com/
> 
> 


Thanks,
Bernd