linux-kernel - Re: [PATCH 19/26] netfs: New writeback implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <08dd01e3-c45e-47d9-bcde-55f7d1edc480@linux.dev>
Date: Fri, 29 Mar 2024 18:06:09 -0700
From: Vadim Fedorenko <vadim.fedorenko@...ux.dev>
To: Naveen Mamindlapalli <naveenm@...vell.com>,
 David Howells <dhowells@...hat.com>, Christian Brauner
 <christian@...uner.io>, Jeff Layton <jlayton@...nel.org>,
 Gao Xiang <hsiangkao@...ux.alibaba.com>,
 Dominique Martinet <asmadeus@...ewreck.org>
Cc: Matthew Wilcox <willy@...radead.org>, Steve French <smfrench@...il.com>,
 Marc Dionne <marc.dionne@...istor.com>, Paulo Alcantara <pc@...guebit.com>,
 Shyam Prasad N <sprasad@...rosoft.com>, Tom Talpey <tom@...pey.com>,
 Eric Van Hensbergen <ericvh@...nel.org>, Ilya Dryomov <idryomov@...il.com>,
 "netfs@...ts.linux.dev" <netfs@...ts.linux.dev>,
 "linux-cachefs@...hat.com" <linux-cachefs@...hat.com>,
 "linux-afs@...ts.infradead.org" <linux-afs@...ts.infradead.org>,
 "linux-cifs@...r.kernel.org" <linux-cifs@...r.kernel.org>,
 "linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
 "ceph-devel@...r.kernel.org" <ceph-devel@...r.kernel.org>,
 "v9fs@...ts.linux.dev" <v9fs@...ts.linux.dev>,
 "linux-erofs@...ts.ozlabs.org" <linux-erofs@...ts.ozlabs.org>,
 "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
 "linux-mm@...ck.org" <linux-mm@...ck.org>,
 "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 Latchesar Ionkov <lucho@...kov.net>,
 Christian Schoenebeck <linux_oss@...debyte.com>
Subject: Re: [PATCH 19/26] netfs: New writeback implementation

On 29/03/2024 10:34, Naveen Mamindlapalli wrote:
>> -----Original Message-----
>> From: David Howells <dhowells@...hat.com>
>> Sent: Thursday, March 28, 2024 10:04 PM
>> To: Christian Brauner <christian@...uner.io>; Jeff Layton <jlayton@...nel.org>;
>> Gao Xiang <hsiangkao@...ux.alibaba.com>; Dominique Martinet
>> <asmadeus@...ewreck.org>
>> Cc: David Howells <dhowells@...hat.com>; Matthew Wilcox
>> <willy@...radead.org>; Steve French <smfrench@...il.com>; Marc Dionne
>> <marc.dionne@...istor.com>; Paulo Alcantara <pc@...guebit.com>; Shyam
>> Prasad N <sprasad@...rosoft.com>; Tom Talpey <tom@...pey.com>; Eric Van
>> Hensbergen <ericvh@...nel.org>; Ilya Dryomov <idryomov@...il.com>;
>> netfs@...ts.linux.dev; linux-cachefs@...hat.com; linux-afs@...ts.infradead.org;
>> linux-cifs@...r.kernel.org; linux-nfs@...r.kernel.org; ceph-
>> devel@...r.kernel.org; v9fs@...ts.linux.dev; linux-erofs@...ts.ozlabs.org; linux-
>> fsdevel@...r.kernel.org; linux-mm@...ck.org; netdev@...r.kernel.org; linux-
>> kernel@...r.kernel.org; Latchesar Ionkov <lucho@...kov.net>; Christian
>> Schoenebeck <linux_oss@...debyte.com>
>> Subject: [PATCH 19/26] netfs: New writeback implementation
>>
>> The current netfslib writeback implementation creates writeback requests of
>> contiguous folio data and then separately tiles subrequests over the space
>> twice, once for the server and once for the cache.  This creates a few
>> issues:
>>
>>   (1) Every time there's a discontiguity or a change between writing to only
>>       one destination or writing to both, it must create a new request.
>>       This makes it harder to do vectored writes.
>>
>>   (2) The folios don't have the writeback mark removed until the end of the
>>       request - and a request could be hundreds of megabytes.
>>
>>   (3) In future, I want to support a larger cache granularity, which will
>>       require aggregation of some folios that contain unmodified data (which
>>       only need to go to the cache) and some which contain modifications
>>       (which need to be uploaded and stored to the cache) - but, currently,
>>       these are treated as discontiguous.
>>
>> There's also a move to get everyone to use writeback_iter() to extract
>> writable folios from the pagecache.  That said, currently writeback_iter()
>> has some issues that make it less than ideal:
>>
>>   (1) there's no way to cancel the iteration, even if you find a "temporary"
>>       error that means the current folio and all subsequent folios are going
>>       to fail;
>>
>>   (2) there's no way to filter the folios being written back - something
>>       that will impact Ceph with it's ordered snap system;
>>
>>   (3) and if you get a folio you can't immediately deal with (say you need
>>       to flush the preceding writes), you are left with a folio hanging in
>>       the locked state for the duration, when really we should unlock it and
>>       relock it later.
>>
>> In this new implementation, I use writeback_iter() to pump folios,
>> progressively creating two parallel, but separate streams and cleaning up
>> the finished folios as the subrequests complete.  Either or both streams
>> can contain gaps, and the subrequests in each stream can be of variable
>> size, don't need to align with each other and don't need to align with the
>> folios.
>>
>> Indeed, subrequests can cross folio boundaries, may cover several folios or
>> a folio may be spanned by multiple folios, e.g.:
>>
>>           +---+---+-----+-----+---+----------+
>> Folios:  |   |   |     |     |   |          |
>>           +---+---+-----+-----+---+----------+
>>
>>             +------+------+     +----+----+
>> Upload:    |      |      |.....|    |    |
>>             +------+------+     +----+----+
>>
>>           +------+------+------+------+------+
>> Cache:   |      |      |      |      |      |
>>           +------+------+------+------+------+
>>
>> The progressive subrequest construction permits the algorithm to be
>> preparing both the next upload to the server and the next write to the
>> cache whilst the previous ones are already in progress.  Throttling can be
>> applied to control the rate of production of subrequests - and, in any
>> case, we probably want to write them to the server in ascending order,
>> particularly if the file will be extended.
>>
>> Content crypto can also be prepared at the same time as the subrequests and
>> run asynchronously, with the prepped requests being stalled until the
>> crypto catches up with them.  This might also be useful for transport
>> crypto, but that happens at a lower layer, so probably would be harder to
>> pull off.
>>
>> The algorithm is split into three parts:
>>
>>   (1) The issuer.  This walks through the data, packaging it up, encrypting
>>       it and creating subrequests.  The part of this that generates
>>       subrequests only deals with file positions and spans and so is usable
>>       for DIO/unbuffered writes as well as buffered writes.
>>
>>   (2) The collector. This asynchronously collects completed subrequests,
>>       unlocks folios, frees crypto buffers and performs any retries.  This
>>       runs in a work queue so that the issuer can return to the caller for
>>       writeback (so that the VM can have its kswapd thread back) or async
>>       writes.
>>
>>   (3) The retryer.  This pauses the issuer, waits for all outstanding
>>       subrequests to complete and then goes through the failed subrequests
>>       to reissue them.  This may involve reprepping them (with cifs, the
>>       credits must be renegotiated, and a subrequest may need splitting),
>>       and doing RMW for content crypto if there's a conflicting change on
>>       the server.
>>
>> [!] Note that some of the functions are prefixed with "new_" to avoid
>> clashes with existing functions.  These will be renamed in a later patch
>> that cuts over to the new algorithm.
>>
>> Signed-off-by: David Howells <dhowells@...hat.com>
>> cc: Jeff Layton <jlayton@...nel.org>
>> cc: Eric Van Hensbergen <ericvh@...nel.org>
>> cc: Latchesar Ionkov <lucho@...kov.net>
>> cc: Dominique Martinet <asmadeus@...ewreck.org>
>> cc: Christian Schoenebeck <linux_oss@...debyte.com>
>> cc: Marc Dionne <marc.dionne@...istor.com>
>> cc: v9fs@...ts.linux.dev
>> cc: linux-afs@...ts.infradead.org
>> cc: netfs@...ts.linux.dev
>> cc: linux-fsdevel@...r.kernel.org

[..snip..]

>> +/*
>> + * Begin a write operation for writing through the pagecache.
>> + */
>> +struct netfs_io_request *new_netfs_begin_writethrough(struct kiocb *iocb, size_t
>> len)
>> +{
>> +	struct netfs_io_request *wreq = NULL;
>> +	struct netfs_inode *ictx = netfs_inode(file_inode(iocb->ki_filp));
>> +
>> +	mutex_lock(&ictx->wb_lock);
>> +
>> +	wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp,
>> +				      iocb->ki_pos, NETFS_WRITETHROUGH);
>> +	if (IS_ERR(wreq))
>> +		mutex_unlock(&ictx->wb_lock);
>> +
>> +	wreq->io_streams[0].avail = true;
>> +	trace_netfs_write(wreq, netfs_write_trace_writethrough);
> 
> Missing mutex_unlock() before return.
> 

mutex_unlock() happens in new_netfs_end_writethrough()

> Thanks,
> Naveen
>