linux-kernel - Re: [PATCH v2 0/7] large atomic writes for xfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <51f5b96e-0a7e-4a88-9ba2-2d67c7477dfb@oracle.com>
Date: Fri, 13 Dec 2024 17:15:55 +0000
From: John Garry <john.g.garry@...cle.com>
To: Christoph Hellwig <hch@....de>
Cc: brauner@...nel.org, djwong@...nel.org, cem@...nel.org, dchinner@...hat.com,
        ritesh.list@...il.com, linux-xfs@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        martin.petersen@...cle.com
Subject: Re: [PATCH v2 0/7] large atomic writes for xfs

On 13/12/2024 14:38, Christoph Hellwig wrote:
> On Tue, Dec 10, 2024 at 12:57:30PM +0000, John Garry wrote:
>> Currently the atomic write unit min and max is fixed at the FS blocksize
>> for xfs and ext4.
>>
>> This series expands support to allow multiple FS blocks to be written
>> atomically.
> 
> Can you explain the workload you're interested in a bit more?

Sure, so some background is that we are using atomic writes for innodb 
MySQL so that we can stop relying on the double-write buffer for crash 
protection. MySQL is using an internal 16K page size (so we want 16K 
atomic writes).

MySQL has what is known as a REDO log - see 
https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html

Essentially it means that for any data page we write, ahead of time we 
do a buffered 512B log update followed by a periodic fsync. I think that 
such a thing is common to many apps.

> 
> I'm still very scared of expanding use of the large allocation sizes.

Yes

> 
> IIRC you showed some numbers where increasing the FSB size to something
> larger did not look good in your benchmarks, but I'd like to understand
> why.  Do you have a link to these numbers just to refresh everyones minds
> why that wasn't a good idea. 

I don't think that I can share numbers, but I will summarize the findings.

When we tried just using 16K FS blocksize, we found for low thread count 
testing that performance was poor - even worse baseline of 4K FS 
blocksize and double-write buffer. We put this down to high write 
latency for REDO log. As you can imagine, mostly writing 16K for only a 
512B update is not efficient in terms of traffic generated and increased 
latency (versus 4K FS block size). At higher thread count, performance 
was better. We put that down to bigger log data portions to be written 
to REDO per FS block write.

For 4K FS blocksize and 16K atomic writes configs - supported via 
forcealign or RTvol - performance will generally good across the board. 
forcealign was a bit better.

We also tried a hybrid solution with 2x partitions - 1x partition with 
16K FS block size for data and 1x partition with 4K FS block size for 
REDO. Performance here was good also. Unfortunately, though, this config 
is not fit for production - that is because we have a requirement to do 
FS snapshot and that is not possible over 2x FS instances. We also did 
consider block device snapshot, but there is reluctance to try this also.

> Did that also include supporting atomic
> writes in the sector size <= write size <= FS block size range, which
> aren't currently supported, but very useful?

I have no use for that so far.

Thanks,
John