linux-kernel - Re: [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dc019764-5128-526e-d8ea-effa78e37b39@redhat.com>
Date: Thu, 12 Jun 2025 18:57:26 +0200 (CEST)
From: Mikulas Patocka <mpatocka@...hat.com>
To: Dongsheng Yang <dongsheng.yang@...ux.dev>
cc: agk@...hat.com, snitzer@...nel.org, axboe@...nel.dk, hch@....de, 
    dan.j.williams@...el.com, Jonathan.Cameron@...wei.com, 
    linux-block@...r.kernel.org, linux-kernel@...r.kernel.org, 
    linux-cxl@...r.kernel.org, nvdimm@...ts.linux.dev, 
    dm-devel@...ts.linux.dev
Subject: Re: [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices

Hi


On Thu, 5 Jun 2025, Dongsheng Yang wrote:

> Hi Mikulas and all,
> 
> This is *RFC v2* of the *pcache* series, a persistent-memory backed cache.
> Compared with *RFC v1* 
> <https://lore.kernel.org/lkml/20250414014505.20477-1-dongsheng.yang@linux.dev/>  
> the most important change is that the whole cache has been *ported to
> the Device-Mapper framework* and is now exposed as a regular DM target.
> 
> Code:
>     https://github.com/DataTravelGuide/linux/tree/dm-pcache
> 
> Full RFC v2 test results:
>     https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/results.html
> 
>     All 962 xfstests cases passed successfully under four different
> pcache configurations.
> 
>     One of the detailed xfstests run:
>         https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/test-results/02-._pcache.py_PcacheTest.test_run-crc-enable-gc-gc0-test_script-xfstests-a515/debug.log
> 
> Below is a quick tour through the three layers of the implementation,
> followed by an example invocation.
> 
> ----------------------------------------------------------------------
> 1. pmem access layer
> ----------------------------------------------------------------------
> 
> * All reads use *copy_mc_to_kernel()* so that uncorrectable media
>   errors are detected and reported.
> * All writes go through *memcpy_flushcache()* to guarantee durability
>   on real persistent memory.

You could also try to use normal write and clflushopt for big writes - I 
found out that for larger regions it is better - see the function 
memcpy_flushcache_optimized in dm-writecache. Test, which way is better.

> ----------------------------------------------------------------------
> 2. cache-logic layer (segments / keys / workers)
> ----------------------------------------------------------------------
> 
> Main features
>   - 16 MiB pmem segments, log-structured allocation.
>   - Multi-subtree RB-tree index for high parallelism.
>   - Optional per-entry *CRC32* on cached data.

Would it be better to use crc32c because it has hardware support in the 
SSE4.2 instruction set?

>   - Background *write-back* worker and watermark-driven *GC*.
>   - Crash-safe replay: key-sets are scanned from *key_tail* on start-up.
> 
> Current limitations
>   - Only *write-back* mode implemented.
>   - Only FIFO cache invalidate; other (LRU, ARC...) planned.
> 
> ----------------------------------------------------------------------
> 3. dm-pcache target integration
> ----------------------------------------------------------------------
> 
> * Table line  
>     `pcache <pmem_dev> <origin_dev> writeback <true|false>`
> * Features advertised to DM:
>   - `ti->flush_supported = true`, so *PREFLUSH* and *FUA* are honoured
>     (they force all open key-sets to close and data to be durable).
> * Not yet supported:
>   - Discard / TRIM.
>   - dynamic `dmsetup reload`.

If you don't support it, you should at least try to detect that the user 
did reload and return error - so that there won't be data corruption in 
this case.

But it would be better to support table reload. You can support it by a 
similar mechanism to "__handover_exceptions" in the dm-snap.c driver.

> Runtime controls
>   - `dmsetup message <dev> 0 gc_percent <0-90>` adjusts the GC trigger.
> 
> Status line reports super-block flags, segment counts, GC threshold and
> the three tail/head pointers (see the RST document for details).

Perhaps these are not real bugs (I didn't analyze it thoroughly), but 
there are some GFP_NOWAIT and GFP_KERNEL allocations.

GFP_NOWAIT can fail anytime (for example, if the machine receives too many 
network packets), so you must handle the error gracefully.

GFP_KERNEL allocation may recurse back into the I/O path through swapping 
or file writeback, thus they may cause deadlocks. You should use 
GFP_KERNEL in the target constructor or destructor because there is no I/O 
to be processed in this time, but they shouldn't be used in the I/O 
processing path.

I see that when you get ENOMEM, you retry the request in 100ms - putting 
arbitrary waits in the code is generally bad practice - this won't work if 
the user is swapping to the dm-pcache device. It may be possible that 
there is no memory free, thus retrying won't help and it will deadlock.

I suggest to use mempools to guarantee forward progress in out-of-memory 
situation. A mempool_alloc(GFP_IO) will never return NULL, it will just 
wait until some other process frees some entry into the mempool.

Generally, a convention among device mapper targets is that the have a few 
fixed parameters first, then there is a number of optional parameters and 
then there are optional parameters (either in "parameter:123" or 
"parameter 123" format). You should follow this convention, so that it can 
be easily extended with new parameters later.

The __packed attribute causes performance degradation on risc machines 
without hardware support for unaligned accesses - the compiled will 
generate byte-by-byte accesses - I suggest to not use it and instead make 
sure that the members in the structures are naturally aligned (and 
inserting explicit padding if needed).

The function "memcpy_flushcache" in arch/x86/include/asm/string_64.h is 
optimized for 4, 8 and 16-byte accesess (because that's what dm-writecache 
uses) - I suggest to add more optimizations to it for constant sizes that 
fit the usage pattern of dm-pcache,

I see that you are using "queue_delayed_work(cache_get_wq(cache), 
&cache->writeback_work, 0);" and "queue_delayed_work(cache_get_wq(cache), 
&cache->writeback_work, delay);" - the problem here is that if the entry 
is already queued with a delay and you attempt to queue it again with zero 
again, this new queue attempt will be ignored - I'm not sure if this is 
intended behavior or not.

req_complete_fn: this will never run with interrupts disabled, so you can 
replace spin_lock_irqsave/spin_unlock_irqrestore with 
spin_lock_irq/spin_unlock_irq.

backing_dev_bio_end: there's a bug in this function - it may be called 
both with interrupts disabled and interrupts enabled, so you should not 
use spin_lock/spin_unlock. You should be called 
spin_lock_irqsave/spin_unlock_irqrestore.

queue_work(BACKING_DEV_TO_PCACHE - i would move it inside the spinlock - 
see the commit 829451beaed6165eb11d7a9fb4e28eb17f489980 for a similar 
problem.

bio_map - bio vectors can hold arbitrarily long entries - if the "base" 
variable is not from vmalloc, you can just add it one bvec entry.
"backing_req->kmem.bvecs = kcalloc" - you can use kmalloc_array instead of 
kcalloc, there's no need to zero the value.

> +                if (++wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT)
> +                        return NULL;
> +
> +                udelay(PCACHE_WAIT_NEW_CACHE_INTERVAL);
> +                goto again;

This is not good practice to insert arbitrary waits (here, the wait is 
burning CPU power, which makes it even worse). You should add the process 
to a wait queue and wake up the queue.

See the functions writecache_wait_on_freelist and writecache_free_entry 
for an example, how to wait correctly.

> +static int dm_pcache_map_bio(struct dm_target *ti, struct bio *bio)
> +{
> +     struct pcache_request *pcache_req = dm_per_bio_data(bio, sizeof(struct pcache_request));
> +     struct dm_pcache *pcache = ti->private;
> +     int ret;
> +
> +     pcache_req->pcache = pcache;
> +     kref_init(&pcache_req->ref);
> +     pcache_req->ret = 0;
> +     pcache_req->bio = bio;
> +     pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +     pcache_req->data_len = bio->bi_iter.bi_size;
> +     INIT_LIST_HEAD(&pcache_req->list_node);
> +     bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);

This looks suspicious because you store the original bi_sector to
pcache_req->off and then subtract the target offset from it. Shouldn't
"bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);"
be before "pcache_req->off = (u64)bio->bi_iter.bi_sector << 
SECTOR_SHIFT;"?

Generally, the code doesn't seem bad. After reworking the out-of-memory 
handling and replacing arbitrary waits with wait queues, I can merge it.

Mikulas