linux-kernel - Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240922083148.10070-1-00107082@163.com>
Date: Sun, 22 Sep 2024 16:31:48 +0800
From: David Wang <00107082@....com>
To: 00107082@....com,
	kent.overstreet@...ux.dev
Cc: linux-bcachefs@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite

>Hi, 
>
>At 2024-09-22 00:12:01, "Kent Overstreet" <kent.overstreet@...ux.dev> wrote:
>>On Sun, Sep 22, 2024 at 12:02:07AM GMT, David Wang wrote:
>>> Hi, 
>>> 
>>> At 2024-09-09 21:37:35, "Kent Overstreet" <kent.overstreet@...ux.dev> wrote:
>>> >On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote:
>>> 
>>> >
>>> >Big standard deviation (high tail latency?) is something we'd want to
>>> >track down. There's a bunch of time_stats in sysfs, but they're mostly
>>> >for the write paths. If you're trying to identify where the latencies
>>> >are coming from, we can look at adding some new time stats to isolate.
>>> 
>>> About performance, I have a theory based on some observation I made recently:
>>> When user space app make a 4k(8 sectors) direct write, 
>>> bcachefs would initiate a write request of ~11 sectors, including the checksum data, right?
>>> This may not be a good offset+size pattern of block layer for performance.  
>>> (I did get a very-very bad performance on ext4 if write with 5K size.)
>>
>>The checksum isn't inline with the data, it's stored with the pointer -
>>so if you're seeing 11 sector writes, something really odd is going
>>on...
>>
>
>.... This is really contradict with my observation:
>1. fio stats yields a average 50K IOPS for a 400 seconds random direct write test.
>2. from /proc/diskstatas, average "Field 5 -- # of writes completed"  per second is also 50K
>(Here I conclude the performance issue is not caused by extra IOPS for checksum.)
>3.  from "Field 10 -- # of milliseconds spent doing I/Os",  average disk "busy" time per second is about ~0.9second, similar to the result of ext4 test.
>(Here I conclude the performance issue it not caused by not pushing disk device too hard.)
>4. delta(Field 7 -- # of sectors written) / delta(Field 5 -- # of writes completed)  for 5 minutes interval is 11 sectors/write.
>(This is why I draw the theory that the checksum is with raw data......I thought is was a reasonable...)
>
>I will make some debug code to collect sector number patterns.
>

I collected sector numbers at the beginning of submit_bio in block/blk-core.c,
It turns out my guess was totally wrong, the user data is 8-sectors clean, the ~11 sectors
I observed was just average sector per write. Sorry, I assumed too much, I thought each user write
would be companied by a checksum-write.....
And during a stress direct-4K-write test, the top-20 write sector number pattern is:
	+---------+------------+
	| sectors | percentage |
	+---------+------------+
	|    8    |  97.637%   |
	|    1    |   0.813%   |   
	|   510   |   0.315%   |  <== large <--journal_write_submit
	|    4    |   0.123%   |
	|    3    |   0.118%   |
	|    2    |   0.117%   |
	|   508   |   0.113%   |  <==
	|   509   |   0.094%   |  <==
	|    5    |   0.075%   |
	|    6    |   0.037%   |
	|   507   |   0.032%   |  <==
	|    14   |   0.024%   |
	|    13   |   0.020%   |
	|    11   |   0.020%   |
	|    15   |   0.020%   |
	|    10   |   0.020%   |
	|    16   |   0.018%   |
	|    12   |   0.018%   |
	|    7    |   0.017%   |
	|    20   |   0.017%   |
	+---------+------------+

btree_io write pattern, collected from btree_node_write_endio, 
is kind of uniform/flat distributed, not on block-friendly size
boundaries (I think):
	+---------+------------+
	| sectors | percentage |
	+---------+------------+
	|    1    |   9.021%   |
	|    3    |   1.440%   |
	|    4    |   1.249%   |
	|    2    |   1.157%   |
	|    5    |   0.804%   |
	|    6    |   0.409%   |
	|    14   |   0.259%   |
	|    15   |   0.253%   |
	|    16   |   0.228%   |
	|    7    |   0.226%   |
	|    11   |   0.223%   |
	|    10   |   0.223%   |
	|    13   |   0.222%   |
	|    9    |   0.213%   |
	|    12   |   0.202%   |
	|    41   |   0.194%   |
	|    17   |   0.183%   |
	|    8    |   0.182%   |
	|    18   |   0.167%   |
	|    20   |   0.167%   |
	|    19   |   0.163%   |
	|    21   |   0.160%   |
	|   205   |   0.158%   |
	|    22   |   0.145%   |
	|    23   |   0.117%   |
	|    24   |   0.093%   |
	|    51   |   0.089%   |
	|    25   |   0.080%   |
	|   204   |   0.079%   |
	+---------+------------+


Now, it seems to be that journal_io's big trunk of IO and btree_io's
irregular IO size would be the main causing factors for halving direct-4K-write
 user-io bandwidth, compared with ext4.


Maybe btree_io's irregular IO size could be regularized?

> 
>
>
>>I would suggest doing some testing with data checksums off first, to
>>isolate the issue; then it sounds like that IO pattern needs to be
>>looked at.
>
>I will try it. 

I format  partition with
`sudo bcachefs format --metadata_checksum=none --data_checksum=none /dev/nvme0n1p1`

It dosen't have significant help with write performance:
"IOPS=53.3k, BW=208MiB/s" --> "IOPS=55.3k, BW=216MiB/s",
and btree write's irregular IO size pattern still shows up.

But it help improve direct-4k-read performance significantly, I guess that would be expected
considering no extra data needs to be fetched for each read.

> 
>>
>>Check the extents btree in debugfs as well, to make sure the extents are
>>getting written out as you think they are.


Thanks
David