lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <44ae1d7c-8de7-47ce-a53c-c4075c39dc2a@linux.alibaba.com>
Date: Tue, 27 Jan 2026 08:55:17 +0800
From: Gao Xiang <hsiangkao@...ux.alibaba.com>
To: Cong Wang <cwang@...tikernel.io>, Matthew Wilcox <willy@...radead.org>
Cc: linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
 Cong Wang <xiyou.wangcong@...il.com>, multikernel@...ts.linux.dev
Subject: Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for
 shared memory



On 2026/1/27 08:02, Cong Wang wrote:
> On Mon, Jan 26, 2026 at 12:40 PM Matthew Wilcox <willy@...radead.org> wrote:
>>
>> On Mon, Jan 26, 2026 at 11:48:20AM -0800, Cong Wang wrote:
>>> Specifically for this scenario, struct inode is not compatible. This
>>> could rule out a lot of existing filesystems, except read-only ones.
>>
>> I don't think you understand that there's a difference between *on disk*
>> inode and *in core* inode.  Compare and contrast struct ext2_inode and
>> struct inode.
>>
>>> Now back to EROFS, it is still based on a block device, which
>>> itself can't be shared among different kernels. ramdax is actually
>>> a perfect example here, its label_area can't be shared among
>>> different kernels.
>>>
>>> Let's take one step back: even if we really could share a device
>>> with multiple kernels, it still could not share the memory footprint,
>>> with DAX + EROFS, we would still get:
>>> 1) Each kernel creates its own DAX mappings
>>> 2) And faults pages independently
>>>
>>> There is no cross-kernel page sharing accounting.
>>>
>>> I hope this makes sense.
>>
>> No, it doesn't.  I'm not suggesting that you use erofs unchanged, I'm
>> suggesting that you modify erofs to support your needs.
> 
> I just tried:
> https://github.com/multikernel/linux/commit/a6dc3351e78fc2028e4ca0ea02e781ca0bfefea3
> 
> Unfortunately, the multi-kernel derivation is still there and probably
> hard to eliminate without re-architecturing EROFS, here is why:
> 
>    DAXFS Inode (line 202-216):
> 
>    struct daxfs_base_inode {
>        __le32 ino;
>        __le32 mode;
>        ...
>        __le64 size;
>        __le64 data_offset;    /* ← INTRINSIC: stored directly in inode
> */
>        ...
>    };
> 
>   DAXFS Read Path:
>    // Pseudocode - what DAXFS does
>    void *data = base + inode->data_offset + file_offset;
>    copy_to_iter(data, len, to);
>    // DONE. No metadata parsing, no derivation.

Then? how do you handle memory-mapped cases? your
inode->data_offset still needs PAGE_SIZE aligned, no?

How it happens if an image with unaligned data offsets?

and why bother copy_to_iter in your filesystem itself
rather than using the upstream DAX infrastructure?

Also where you handle malicious `child_ino` if
sub-directories can generate a loop (from your on-disk
design?) How it deals with hardlinks btw?

> 
>   EROFS Read Path:
>    // What EROFS does (even in memory mode)
>    struct erofs_map_blocks map = { .m_la = pos };
>    erofs_map_blocks(inode, &map);  // ← DERIVES physical address
>        // Inside erofs_map_blocks():
>        //   - Check inode layout type (compact? extended?
> chunk-indexed?)
>        //   - For chunk-indexed: walk chunk table
>        //   - For plain: compute from inode
>        //   - Handle inline data, holes, compression...
>    src = base + map.m_pa;
> 
> Please let me know if I miss anything here.

Your expression above is very vague, so I don't know how
to react your words above.

I basically would like to say, your basic use case just
needs plain EROFS inodes (both compact & extended on-disk
core inode has a raw_blkaddr, and raw_blkaddr * PAGE_SIZE
is what you called `inode->data_offset`).

You could just ignore the EROFS compressed layout since
it needs to use page cache for those inodes even for
EROFS FSDAX, and your "DAXFS" doesn't deal with
compression.

Also, the expression above seems to be partially generated
by AI, but I have to write more reasonable words myself,
it seems unfair for me to reply in this thread.

> 
> Also, the speculative branching support is also harder for EROFS,
> please see my updated README here:
> https://github.com/multikernel/daxfs/blob/main/README.md
> (Skip to the Branching section.)

I also would like to discuss new use cases like
"shared-memory DAX filesystem for AI agents", but my
proposal is to redirect the whole write traffic into
another filesystem (either a tmpfs or a real disk fs) and
when agents need to snapshot, generate a new read-only
layer for memory sharing. The reason is because I really
would like to make the core EROFS format straight-forward
even for untrusted remote image usage.

Also a second quick glance of your cow approach, it just
seems nonsense from a real filesystem developer, anyway,
it's not me to prove your use cases to convince people,
it cannot be implemented with an existing fs with
enhancements.

If upstreaming is your interest, file a LSFMMBPF topic to
show your use cases to discuss, and I would like
to join the discussion.  If your interest is not
upstreaming, please ignore all my replies.

Thanks,
Gao Xiang

> 
> Thanks.
> Cong Wang


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ