linux-kernel - Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <44ae1d7c-8de7-47ce-a53c-c4075c39dc2a@linux.alibaba.com>
Date: Tue, 27 Jan 2026 08:55:17 +0800
From: Gao Xiang <hsiangkao@...ux.alibaba.com>
To: Cong Wang <cwang@...tikernel.io>, Matthew Wilcox <willy@...radead.org>
Cc: linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
 Cong Wang <xiyou.wangcong@...il.com>, multikernel@...ts.linux.dev
Subject: Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for
 shared memory

On 2026/1/27 08:02, Cong Wang wrote:
> On Mon, Jan 26, 2026 at 12:40 PM Matthew Wilcox <willy@...radead.org> wrote:
>>
>> On Mon, Jan 26, 2026 at 11:48:20AM -0800, Cong Wang wrote:
>>> Specifically for this scenario, struct inode is not compatible. This
>>> could rule out a lot of existing filesystems, except read-only ones.
>>
>> I don't think you understand that there's a difference between *on disk*
>> inode and *in core* inode.  Compare and contrast struct ext2_inode and
>> struct inode.
>>
>>> Now back to EROFS, it is still based on a block device, which
>>> itself can't be shared among different kernels. ramdax is actually
>>> a perfect example here, its label_area can't be shared among
>>> different kernels.
>>>
>>> Let's take one step back: even if we really could share a device
>>> with multiple kernels, it still could not share the memory footprint,
>>> with DAX + EROFS, we would still get:
>>> 1) Each kernel creates its own DAX mappings
>>> 2) And faults pages independently
>>>
>>> There is no cross-kernel page sharing accounting.
>>>
>>> I hope this makes sense.
>>
>> No, it doesn't.  I'm not suggesting that you use erofs unchanged, I'm
>> suggesting that you modify erofs to support your needs.
> 
> I just tried:
> https://github.com/multikernel/linux/commit/a6dc3351e78fc2028e4ca0ea02e781ca0bfefea3
> 
> Unfortunately, the multi-kernel derivation is still there and probably
> hard to eliminate without re-architecturing EROFS, here is why:
> 
>    DAXFS Inode (line 202-216):
> 
>    struct daxfs_base_inode {
>        __le32 ino;
>        __le32 mode;
>        ...
>        __le64 size;
>        __le64 data_offset;    /* ← INTRINSIC: stored directly in inode
> */
>        ...
>    };
> 
>   DAXFS Read Path:
>    // Pseudocode - what DAXFS does
>    void *data = base + inode->data_offset + file_offset;
>    copy_to_iter(data, len, to);
>    // DONE. No metadata parsing, no derivation.

Then? how do you handle memory-mapped cases? your
inode->data_offset still needs PAGE_SIZE aligned, no?

How it happens if an image with unaligned data offsets?

and why bother copy_to_iter in your filesystem itself
rather than using the upstream DAX infrastructure?

Also where you handle malicious `child_ino` if
sub-directories can generate a loop (from your on-disk
design?) How it deals with hardlinks btw?

> 
>   EROFS Read Path:
>    // What EROFS does (even in memory mode)
>    struct erofs_map_blocks map = { .m_la = pos };
>    erofs_map_blocks(inode, &map);  // ← DERIVES physical address
>        // Inside erofs_map_blocks():
>        //   - Check inode layout type (compact? extended?
> chunk-indexed?)
>        //   - For chunk-indexed: walk chunk table
>        //   - For plain: compute from inode
>        //   - Handle inline data, holes, compression...
>    src = base + map.m_pa;
> 
> Please let me know if I miss anything here.

Your expression above is very vague, so I don't know how
to react your words above.

I basically would like to say, your basic use case just
needs plain EROFS inodes (both compact & extended on-disk
core inode has a raw_blkaddr, and raw_blkaddr * PAGE_SIZE
is what you called `inode->data_offset`).

You could just ignore the EROFS compressed layout since
it needs to use page cache for those inodes even for
EROFS FSDAX, and your "DAXFS" doesn't deal with
compression.

Also, the expression above seems to be partially generated
by AI, but I have to write more reasonable words myself,
it seems unfair for me to reply in this thread.

> 
> Also, the speculative branching support is also harder for EROFS,
> please see my updated README here:
> https://github.com/multikernel/daxfs/blob/main/README.md
> (Skip to the Branching section.)

I also would like to discuss new use cases like
"shared-memory DAX filesystem for AI agents", but my
proposal is to redirect the whole write traffic into
another filesystem (either a tmpfs or a real disk fs) and
when agents need to snapshot, generate a new read-only
layer for memory sharing. The reason is because I really
would like to make the core EROFS format straight-forward
even for untrusted remote image usage.

Also a second quick glance of your cow approach, it just
seems nonsense from a real filesystem developer, anyway,
it's not me to prove your use cases to convince people,
it cannot be implemented with an existing fs with
enhancements.

If upstreaming is your interest, file a LSFMMBPF topic to
show your use cases to discuss, and I would like
to join the discussion.  If your interest is not
upstreaming, please ignore all my replies.

Thanks,
Gao Xiang

> 
> Thanks.
> Cong Wang