linux-kernel - Re: [PATCH] fuse: make foffset alignment opt-in for optimum backend performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8fd07d19-b7eb-4ae6-becc-08e6e1502fc8@linux.alibaba.com>
Date: Fri, 5 Jul 2024 20:00:19 +0800
From: Jingbo Xu <jefflexu@...ux.alibaba.com>
To: Bernd Schubert <bernd.schubert@...tmail.fm>, miklos@...redi.hu,
 linux-fsdevel@...r.kernel.org
Cc: linux-kernel@...r.kernel.org
Subject: Re: [PATCH] fuse: make foffset alignment opt-in for optimum backend
 performance

Hi Bernd,

Thanks for the comment.


On 7/5/24 7:50 PM, Bernd Schubert wrote:
> 
> 
> On 7/5/24 12:04, Jingbo Xu wrote:
>> Sometimes the file offset alignment needs to be opt-in to achieve the
>> optimum performance at the backend store.
>>
>> For example when ErasureCode [1] is used at the backend store, the
>> optimum write performance is achieved when the WRITE request is aligned
>> with the stripe size of ErasureCode.  Otherwise a non-aligned WRITE
>> request needs to be split at the stripe size boundary.  It is quite
>> costly to handle these split partial requests, as firstly the whole
>> stripe to which the split partial request belongs needs to be read out,
>> then overwrite the read stripe buffer with the request, and finally write
>> the whole stripe back to the persistent storage.
>>
>> Thus the backend store can suffer severe performance degradation when
>> WRITE requests can not fit into one stripe exactly.  The write performance
>> can be 10x slower when the request is 256KB in size given 4MB stripe size.
>> Also there can be 50% performance degradation in theory if the request
>> is not stripe boundary aligned.
>>
>> Besides, the conveyed test indicates that, the non-alignment issue
>> becomes more severe when decreasing fuse's max_ratio, maybe partly
>> because the background writeback now is more likely to run parallelly
>> with the dirtier.
>>
>> fuse's max_ratio	ratio of aligned WRITE requests
>> ----------------	-------------------------------
>> 70			99.9%
>> 40			74%
>> 20			45%
>> 10			20%
>>
>> With the patched version, which makes the alignment constraint opt-in
>> when constructing WRITE requests, the ratio of aligned WRITE requests
>> increases to 98% (previously 20%) when fuse's max_ratio is 10.
>>
>> [1] https://lore.kernel.org/linux-fsdevel/20240124070512.52207-1-jefflexu@linux.alibaba.com/T/#m9bce469998ea6e4f911555c6f7be1e077ce3d8b4
>> Signed-off-by: Jingbo Xu <jefflexu@...ux.alibaba.com>
>>
>> Signed-off-by: Jingbo Xu <jefflexu@...ux.alibaba.com>
>> ---
>>  fs/fuse/file.c            | 4 ++++
>>  fs/fuse/fuse_i.h          | 6 ++++++
>>  fs/fuse/inode.c           | 9 +++++++++
>>  include/uapi/linux/fuse.h | 9 ++++++++-
>>  4 files changed, 27 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index f39456c65ed7..f9b477016c2e 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -2246,6 +2246,10 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct page *page,
>>  	if (ap->num_pages == data->max_pages && !fuse_pages_realloc(data))
>>  		return true;
>>  
>> +	/* Reached alignment */
>> +	if (fc->opt_alignment && !(page->index % fc->opt_alignment_pages))
>> +		return true;
> 
> I link the idea, but couldn't it just do that with 
> fc->max_pages? I'm asking because fc->opt_alignment_pages
> takes another uint32_t in fuse_init_out and there is not so much
> space left anymore.

I'm okay with resuing max_pages as the alignment constraint.  They are
the same in our internal scenarios.  But I'm not sure if it is the case
in other scenarios.



>>  /*
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 99e44ea7d875..9266b22cce8e 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -1331,6 +1331,15 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
>>  			}
>>  			if (flags & FUSE_NO_EXPORT_SUPPORT)
>>  				fm->sb->s_export_op = &fuse_export_fid_operations;
>> +
>> +			/* fallback to default if opt_alignment <= PAGE_SHIFT */
>> +			if (flags & FUSE_OPT_ALIGNMENT) {
>> +				if (arg->opt_alignment > PAGE_SHIFT) {
>> +					fc->opt_alignment = 1;
>> +					fc->opt_alignment_pages = 1 <<
>> +						(arg->opt_alignment - PAGE_SHIFT);
> 
> opt_alignment is the number of bits required for alignment? Not 
> very user friendly, from my point of view would be better to have
> this in a byte or kb unit.
> 

Actually I referred to fuse_init_out.map_alignment, which is also
log2(byte alignment).  Anyway I'm okay making it a more understandable name.

-- 
Thanks,
Jingbo