linux-kernel - Re: [RFC 0/4] minimum folio order support in filemap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <94d9e935-c8a4-896a-13ac-263831a78dd5@suse.de>
Date:   Thu, 22 Jun 2023 08:50:06 +0200
From:   Hannes Reinecke <hare@...e.de>
To:     Dave Chinner <david@...morbit.com>
Cc:     Pankaj Raghav <p.raghav@...sung.com>, willy@...radead.org,
        gost.dev@...sung.com, mcgrof@...nel.org, hch@....de,
        jwong@...nel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC 0/4] minimum folio order support in filemap

On 6/22/23 07:51, Hannes Reinecke wrote:
> On 6/22/23 00:07, Dave Chinner wrote:
>> On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
>>> On 6/21/23 10:38, Pankaj Raghav wrote:
>>>> There has been a lot of discussion recently to support devices and 
>>>> fs for
>>>> bs > ps. One of the main plumbing to support buffered IO is to have 
>>>> a minimum
>>>> order while allocating folios in the page cache.
>>>>
>>>> Hannes sent recently a series[1] where he deduces the minimum folio
>>>> order based on the i_blkbits in struct inode. This takes a different
>>>> approach based on the discussion in that thread where the minimum and
>>>> maximum folio order can be set individually per inode.
>>>>
>>>> This series is based on top of Christoph's patches to have iomap aops
>>>> for the block cache[2]. I rebased his remaining patches to
>>>> next-20230621. The whole tree can be found here[3].
>>>>
>>>> Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a 
>>>> buffered
>>>> IO on a nvme drive with bs>ps in QEMU without any issues:
>>>>
>>>> [root@...hlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
>>>> 16384
>>>> [root@...hlinux ~]# fio -bs=16k -iodepth=8 -rw=write 
>>>> -ioengine=io_uring -size=500M
>>>>             -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
>>>> io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 
>>>> 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
>>>> fio-3.34
>>>> Starting 1 process
>>>> Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
>>>> io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21 
>>>> 07:58:29 2023
>>>>     read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
>>>>     <snip>
>>>> Run status group 0 (all jobs):
>>>>      READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s 
>>>> (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
>>>>     WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s 
>>>> (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec
>>>>
>>>> Disk stats (read/write):
>>>>     nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441, 
>>>> in_queue=12725, util=96.27%
>>>>
>>>> One of the main dependency to work on a block device with bs>ps is
>>>> Christoph's work on converting block device aops to use iomap.
>>>>
>>>> [1] https://lwn.net/Articles/934651/
>>>> [2] https://lwn.net/ml/linux-kernel/20230424054926.26927-1-hch@lst.de/
>>>> [3] 
>>>> https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1
>>>>
>>>> Luis Chamberlain (1):
>>>>     block: set mapping order for the block cache in set_init_blocksize
>>>>
>>>> Matthew Wilcox (Oracle) (1):
>>>>     fs: Allow fine-grained control of folio sizes
>>>>
>>>> Pankaj Raghav (2):
>>>>     filemap: use minimum order while allocating folios
>>>>     nvme: enable logical block size > PAGE_SIZE
>>>>
>>>>    block/bdev.c             |  9 ++++++++
>>>>    drivers/nvme/host/core.c |  2 +-
>>>>    include/linux/pagemap.h  | 46 
>>>> ++++++++++++++++++++++++++++++++++++----
>>>>    mm/filemap.c             |  9 +++++---
>>>>    mm/readahead.c           | 34 ++++++++++++++++++++---------
>>>>    5 files changed, 82 insertions(+), 18 deletions(-)
>>>>
>>>
>>> Hmm. Most unfortunate; I've just finished my own patchset 
>>> (duplicating much
>>> of this work) to get 'brd' running with large folios.
>>> And it even works this time, 'fsx' from the xfstest suite runs 
>>> happily on
>>> that.
>>
>> So you've converted a filesystem to use bs > ps, too? Or is the
>> filesystem that fsx is running on just using normal 4kB block size?
>> If the latter, then fsx is not actually testing the large folio page
>> cache support, it's mostly just doing 4kB aligned IO to brd....
>>
> I have been running fsx on an xfs with bs=16k, and it worked like a charm.
> I'll try to run the xfstest suite once I'm finished with merging
> Pankajs patches into my patchset.
> Well, would've been too easy.
'fsx' bails out at test 27 (collapse), with:

XFS (ram0): Corruption detected. Unmount and run xfs_repair
XFS (ram0): Internal error isnullstartblock(got.br_startblock) at line 
5787 of file fs/xfs/libxfs/xfs_bmap.c.  Caller 
xfs_bmap_collapse_extents+0x2d9/0x320 [xfs]

Guess some more work needs to be done here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@...e.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman