linux-kernel - Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f69544.2e70.191e419e656.Coremail.00107082@163.com>
Date: Thu, 12 Sep 2024 10:39:48 +0800 (CST)
From: "David Wang" <00107082@....com>
To: "Kent Overstreet" <kent.overstreet@...ux.dev>
Cc: linux-bcachefs@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [BUG?] bcachefs performance: read is way too slow when a file
 has no overwrite.

Hi, 
At 2024-09-09 21:37:35, "Kent Overstreet" <kent.overstreet@...ux.dev> wrote:
>On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote:

>> 
>> Based on the result:
>> 1. The row with prepare-write size 4K stands out, here.
>> When files were prepaired with write size 4K, the afterwards
>>  read performance is worse.  (I did double check the result,
>> but it is possible that I miss some affecting factors.);
>
>On small blocksize tests you should be looking at IOPS, not MB/s.
>
>Prepare-write size is the column?
Each row is for a specific prepare-write size indicated by first column. 

>
>Another factor is that we do merge extents (including checksums); so if
>the preparet-write is done sequentially we won't actually be ending up
>with extents of the same size as what we wrote.
>
>I believe there's a knob somewhere to turn off extent merging (module
>parameter? it's intended for debugging).

I made some debug, when performance is bad, the conditions
bvec_iter_sectors(iter) != pick.crc.uncompressed_size and 
bvec_iter_sectors(iter) != pick.crc.live_size are "almost" always both "true",
while when performance is good (after "thorough" write), they are only little
percent (~350 out of 1000000)  to be true.

And if those conditions are "true", "bounce" would be set and code seems to run
on a time consuming path.

I suspect "merely read" could never change those conditions, but "write" can?

>
>> 2. Without O_DIRECT, read performance seems correlated with the difference
>>  between read size and prepare write size, but with O_DIRECT, correlation is not obvious.
>
>So the O_DIRECT and buffered IO paths are very different (in every
>filesystem) - you're looking at very different things. They are both
>subject to the checksum granularity issue, but in buffered mode we round
>up reads to extent size, when filling into the page cache.
>
>Big standard deviation (high tail latency?) is something we'd want to
>track down. There's a bunch of time_stats in sysfs, but they're mostly
>for the write paths. If you're trying to identify where the latencies
>are coming from, we can look at adding some new time stats to isolate.