[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dc019764-5128-526e-d8ea-effa78e37b39@redhat.com>
Date: Thu, 12 Jun 2025 18:57:26 +0200 (CEST)
From: Mikulas Patocka <mpatocka@...hat.com>
To: Dongsheng Yang <dongsheng.yang@...ux.dev>
cc: agk@...hat.com, snitzer@...nel.org, axboe@...nel.dk, hch@....de,
dan.j.williams@...el.com, Jonathan.Cameron@...wei.com,
linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-cxl@...r.kernel.org, nvdimm@...ts.linux.dev,
dm-devel@...ts.linux.dev
Subject: Re: [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices
Hi
On Thu, 5 Jun 2025, Dongsheng Yang wrote:
> Hi Mikulas and all,
>
> This is *RFC v2* of the *pcache* series, a persistent-memory backed cache.
> Compared with *RFC v1*
> <https://lore.kernel.org/lkml/20250414014505.20477-1-dongsheng.yang@linux.dev/>
> the most important change is that the whole cache has been *ported to
> the Device-Mapper framework* and is now exposed as a regular DM target.
>
> Code:
> https://github.com/DataTravelGuide/linux/tree/dm-pcache
>
> Full RFC v2 test results:
> https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/results.html
>
> All 962 xfstests cases passed successfully under four different
> pcache configurations.
>
> One of the detailed xfstests run:
> https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/test-results/02-._pcache.py_PcacheTest.test_run-crc-enable-gc-gc0-test_script-xfstests-a515/debug.log
>
> Below is a quick tour through the three layers of the implementation,
> followed by an example invocation.
>
> ----------------------------------------------------------------------
> 1. pmem access layer
> ----------------------------------------------------------------------
>
> * All reads use *copy_mc_to_kernel()* so that uncorrectable media
> errors are detected and reported.
> * All writes go through *memcpy_flushcache()* to guarantee durability
> on real persistent memory.
You could also try to use normal write and clflushopt for big writes - I
found out that for larger regions it is better - see the function
memcpy_flushcache_optimized in dm-writecache. Test, which way is better.
> ----------------------------------------------------------------------
> 2. cache-logic layer (segments / keys / workers)
> ----------------------------------------------------------------------
>
> Main features
> - 16 MiB pmem segments, log-structured allocation.
> - Multi-subtree RB-tree index for high parallelism.
> - Optional per-entry *CRC32* on cached data.
Would it be better to use crc32c because it has hardware support in the
SSE4.2 instruction set?
> - Background *write-back* worker and watermark-driven *GC*.
> - Crash-safe replay: key-sets are scanned from *key_tail* on start-up.
>
> Current limitations
> - Only *write-back* mode implemented.
> - Only FIFO cache invalidate; other (LRU, ARC...) planned.
>
> ----------------------------------------------------------------------
> 3. dm-pcache target integration
> ----------------------------------------------------------------------
>
> * Table line
> `pcache <pmem_dev> <origin_dev> writeback <true|false>`
> * Features advertised to DM:
> - `ti->flush_supported = true`, so *PREFLUSH* and *FUA* are honoured
> (they force all open key-sets to close and data to be durable).
> * Not yet supported:
> - Discard / TRIM.
> - dynamic `dmsetup reload`.
If you don't support it, you should at least try to detect that the user
did reload and return error - so that there won't be data corruption in
this case.
But it would be better to support table reload. You can support it by a
similar mechanism to "__handover_exceptions" in the dm-snap.c driver.
> Runtime controls
> - `dmsetup message <dev> 0 gc_percent <0-90>` adjusts the GC trigger.
>
> Status line reports super-block flags, segment counts, GC threshold and
> the three tail/head pointers (see the RST document for details).
Perhaps these are not real bugs (I didn't analyze it thoroughly), but
there are some GFP_NOWAIT and GFP_KERNEL allocations.
GFP_NOWAIT can fail anytime (for example, if the machine receives too many
network packets), so you must handle the error gracefully.
GFP_KERNEL allocation may recurse back into the I/O path through swapping
or file writeback, thus they may cause deadlocks. You should use
GFP_KERNEL in the target constructor or destructor because there is no I/O
to be processed in this time, but they shouldn't be used in the I/O
processing path.
I see that when you get ENOMEM, you retry the request in 100ms - putting
arbitrary waits in the code is generally bad practice - this won't work if
the user is swapping to the dm-pcache device. It may be possible that
there is no memory free, thus retrying won't help and it will deadlock.
I suggest to use mempools to guarantee forward progress in out-of-memory
situation. A mempool_alloc(GFP_IO) will never return NULL, it will just
wait until some other process frees some entry into the mempool.
Generally, a convention among device mapper targets is that the have a few
fixed parameters first, then there is a number of optional parameters and
then there are optional parameters (either in "parameter:123" or
"parameter 123" format). You should follow this convention, so that it can
be easily extended with new parameters later.
The __packed attribute causes performance degradation on risc machines
without hardware support for unaligned accesses - the compiled will
generate byte-by-byte accesses - I suggest to not use it and instead make
sure that the members in the structures are naturally aligned (and
inserting explicit padding if needed).
The function "memcpy_flushcache" in arch/x86/include/asm/string_64.h is
optimized for 4, 8 and 16-byte accesess (because that's what dm-writecache
uses) - I suggest to add more optimizations to it for constant sizes that
fit the usage pattern of dm-pcache,
I see that you are using "queue_delayed_work(cache_get_wq(cache),
&cache->writeback_work, 0);" and "queue_delayed_work(cache_get_wq(cache),
&cache->writeback_work, delay);" - the problem here is that if the entry
is already queued with a delay and you attempt to queue it again with zero
again, this new queue attempt will be ignored - I'm not sure if this is
intended behavior or not.
req_complete_fn: this will never run with interrupts disabled, so you can
replace spin_lock_irqsave/spin_unlock_irqrestore with
spin_lock_irq/spin_unlock_irq.
backing_dev_bio_end: there's a bug in this function - it may be called
both with interrupts disabled and interrupts enabled, so you should not
use spin_lock/spin_unlock. You should be called
spin_lock_irqsave/spin_unlock_irqrestore.
queue_work(BACKING_DEV_TO_PCACHE - i would move it inside the spinlock -
see the commit 829451beaed6165eb11d7a9fb4e28eb17f489980 for a similar
problem.
bio_map - bio vectors can hold arbitrarily long entries - if the "base"
variable is not from vmalloc, you can just add it one bvec entry.
"backing_req->kmem.bvecs = kcalloc" - you can use kmalloc_array instead of
kcalloc, there's no need to zero the value.
> + if (++wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT)
> + return NULL;
> +
> + udelay(PCACHE_WAIT_NEW_CACHE_INTERVAL);
> + goto again;
This is not good practice to insert arbitrary waits (here, the wait is
burning CPU power, which makes it even worse). You should add the process
to a wait queue and wake up the queue.
See the functions writecache_wait_on_freelist and writecache_free_entry
for an example, how to wait correctly.
> +static int dm_pcache_map_bio(struct dm_target *ti, struct bio *bio)
> +{
> + struct pcache_request *pcache_req = dm_per_bio_data(bio, sizeof(struct pcache_request));
> + struct dm_pcache *pcache = ti->private;
> + int ret;
> +
> + pcache_req->pcache = pcache;
> + kref_init(&pcache_req->ref);
> + pcache_req->ret = 0;
> + pcache_req->bio = bio;
> + pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
> + pcache_req->data_len = bio->bi_iter.bi_size;
> + INIT_LIST_HEAD(&pcache_req->list_node);
> + bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
This looks suspicious because you store the original bi_sector to
pcache_req->off and then subtract the target offset from it. Shouldn't
"bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);"
be before "pcache_req->off = (u64)bio->bi_iter.bi_sector <<
SECTOR_SHIFT;"?
Generally, the code doesn't seem bad. After reworking the out-of-memory
handling and replacing arbitrary waits with wait queues, I can merge it.
Mikulas
Powered by blists - more mailing lists