linux-kernel - Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <665a9402445ee_166872941d@dwillia2-mobl3.amr.corp.intel.com.notmuch>
Date: Fri, 31 May 2024 20:22:42 -0700
From: Dan Williams <dan.j.williams@...el.com>
To: Jonathan Cameron <Jonathan.Cameron@...wei.com>, Dongsheng Yang
	<dongsheng.yang@...ystack.cn>
CC: Gregory Price <gregory.price@...verge.com>, Dan Williams
	<dan.j.williams@...el.com>, John Groves <John@...ves.net>, <axboe@...nel.dk>,
	<linux-block@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linux-cxl@...r.kernel.org>, <nvdimm@...ts.linux.dev>, <james.morse@....com>,
	Mark Rutland <mark.rutland@....com>
Subject: Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)

Jonathan Cameron wrote:
> On Thu, 30 May 2024 14:59:38 +0800
> Dongsheng Yang <dongsheng.yang@...ystack.cn> wrote:
> 
> > 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道:
> > > On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote:  
> > >>
> > >>
> > >> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道:  
> > >>> Dongsheng Yang wrote:
> > >>>
> > >>> What guarantees this property? How does the reader know that its local
> > >>> cache invalidation is sufficient for reading data that has only reached
> > >>> global visibility on the remote peer? As far as I can see, there is
> > >>> nothing that guarantees that local global visibility translates to
> > >>> remote visibility. In fact, the GPF feature is counter-evidence of the
> > >>> fact that writes can be pending in buffers that are only flushed on a
> > >>> GPF event.  
> > >>
> > >> Sounds correct. From what I learned from GPF, ADR, and eADR, there would
> > >> still be data in WPQ even though we perform a CPU cache line flush in the
> > >> OS.
> > >>
> > >> This means we don't have a explicit method to make data puncture all caches
> > >> and land in the media after writing. also it seems there isn't a explicit
> > >> method to invalidate all caches along the entire path.
> > >>  
> > >>>
> > >>> I remain skeptical that a software managed inter-host cache-coherency
> > >>> scheme can be made reliable with current CXL defined mechanisms.  
> > >>
> > >>
> > >> I got your point now, acorrding current CXL Spec, it seems software managed
> > >> cache-coherency for inter-host shared memory is not working. Will the next
> > >> version of CXL spec consider it?  
> > >>>  
> > > 
> > > Sorry for missing the conversation, have been out of office for a bit.
> > > 
> > > It's not just a CXL spec issue, though that is part of it. I think the
> > > CXL spec would have to expose some form of puncturing flush, and this
> > > makes the assumption that such a flush doesn't cause some kind of
> > > race/deadlock issue.  Certainly this needs to be discussed.
> > > 
> > > However, consider that the upstream processor actually has to generate
> > > this flush.  This means adding the flush to existing coherence protocols,
> > > or at the very least a new instruction to generate the flush explicitly.
> > > The latter seems more likely than the former.
> > > 
> > > This flush would need to ensure the data is forced out of the local WPQ
> > > AND all WPQs south of the PCIE complex - because what you really want to
> > > know is that the data has actually made it back to a place where remote
> > > viewers are capable of percieving the change.
> > > 
> > > So this means:
> > > 1) Spec revision with puncturing flush
> > > 2) Buy-in from CPU vendors to generate such a flush
> > > 3) A new instruction added to the architecture.
> > > 
> > > Call me in a decade or so.
> > > 
> > > 
> > > But really, I think it likely we see hardware-coherence well before this.
> > > For this reason, I have become skeptical of all but a few memory sharing
> > > use cases that depend on software-controlled cache-coherency.  
> > 
> > Hi Gregory,
> > 
> > 	From my understanding, we actually has the same idea here. What I am 
> > saying is that we need SPEC to consider this issue, meaning we need to 
> > describe how the entire software-coherency mechanism operates, which 
> > includes the necessary hardware support. Additionally, I agree that if 
> > software-coherency also requires hardware support, it seems that 
> > hardware-coherency is the better path.
> > > 
> > > There are some (FAMFS, for example). The coherence state of these
> > > systems tend to be less volatile (e.g. mappings are read-only), or
> > > they have inherent design limitations (cacheline-sized message passing
> > > via write-ahead logging only).  
> > 
> > Can you explain more about this? I understand that if the reader in the 
> > writer-reader model is using a readonly mapping, the interaction will be 
> > much simpler. However, after the writer writes data, if we don't have a 
> > mechanism to flush and invalidate puncturing all caches, how can the 
> > readonly reader access the new data?
> 
> There is a mechanism for doing coarse grained flushing that is known to
> work on some architectures. Look at cpu_cache_invalidate_memregion().
> On intel/x86 it's wbinvd_on_all_cpu_cpus()

There is no guarantee on x86 that after cpu_cache_invalidate_memregion()
that a remote shared memory consumer can be assured to see the writes
from that event.

> on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a
> public alpha specification for PSCI 1.3 with that defined but we
> don't yet have kernel code.)

That punches visibility through CXL shared memory devices?

> These are very big hammers and so unsuited for anything fine grained.
> In the extreme end of possible implementations they briefly stop all
> CPUs and clean and invalidate all caches of all types.  So not suited
> to anything fine grained, but may be acceptable for a rare setup event,
> particularly if the main job of the writing host is to fill that memory
> for lots of other hosts to use.
> 
> At least the ARM one takes a range so allows for a less painful
> implementation.  I'm assuming we'll see new architecture over time
> but this is a different (and potentially easier) problem space
> to what you need.

cpu_cache_invalidate_memregion() is only about making sure local CPU
sees new contents after an DPA:HPA remap event. I hope CPUs are able to
get away from that responsibility long term when / if future memory
expanders just issue back-invalidate automatically when the HDM decoder
configuration changes.