linux-kernel - Re: [PATCH v2 0/7] large atomic writes for xfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <37cab50b-5791-4840-b7b7-c67d3878fced@oracle.com>
Date: Mon, 16 Dec 2024 08:40:34 +0000
From: John Garry <john.g.garry@...cle.com>
To: "Darrick J. Wong" <djwong@...nel.org>, Christoph Hellwig <hch@....de>
Cc: brauner@...nel.org, cem@...nel.org, dchinner@...hat.com,
        ritesh.list@...il.com, linux-xfs@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        martin.petersen@...cle.com
Subject: Re: [PATCH v2 0/7] large atomic writes for xfs


>>
>> Yeah, at the low end, it may make sense to do the 512B write via DIO. But
>> OTOH sync'ing many redo log FS blocks at once at the high end can be more
>> efficient.
>>
>>  From what I have heard, this was attempted before (using DIO) by some
>> vendor, but did not come to much.
>>
>> So it seems that we are stuck with this redo log limitation.
>>
>> Let me know if you have any other ideas to avoid large atomic writes...
> 
>  From the description it sounds like the redo log consists of 512b blocks
> that describe small changes to the 16k table file pages.  If they're
> issuing 16k atomic writes to get each of those 512b redo log records to
> disk it's no wonder that cranks up the overhead substantially. 

They are not issuing the redo log atomically. They do 512B buffered 
writes and then periodically fsync.

> Also,
> replaying those tiny updates through the pagecache beats issuing a bunch
> of tiny nonlocalized writes.
> 
> For the first case I don't know why they need atomic writes -- 512b redo
> log records can't be torn because they're single-sector writes.  The
> second case might be better done with exchange-range.
> 

As for exchange-range, that would very much pre-date any MySQL port. 
Furthermore, I can't imagine that exchange-range support is portable to 
other FSes, which is probably quite important. Anyway, they are not 
issuing the redo log atomically, so I don't know if mentioning 
exchange-range is relevant.

Regardless of what MySQL is specifically doing here, there are going to 
be other users/applications which want to keep a 4K FS blocksize and do 
larger atomic writes.

Thanks,
John