linux-kernel - Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20230302014225.GP2948950@paulmck-ThinkPad-P17-Gen-1>
Date:   Wed, 1 Mar 2023 17:42:25 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Jonas Oberhauser <jonas.oberhauser@...weicloud.com>
Cc:     Andrea Parri <parri.andrea@...il.com>,
        Alan Stern <stern@...land.harvard.edu>,
        Jonas Oberhauser <jonas.oberhauser@...wei.com>,
        will@...nel.org, peterz@...radead.org, boqun.feng@...il.com,
        npiggin@...il.com, dhowells@...hat.com, j.alglave@....ac.uk,
        luc.maranget@...ia.fr, akiyks@...il.com, dlustig@...dia.com,
        joel@...lfernandes.org, urezki@...il.com, quic_neeraju@...cinc.com,
        frederic@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Wed, Mar 01, 2023 at 11:52:09AM +0100, Jonas Oberhauser wrote:
> 
> 
> On 2/28/2023 4:40 PM, Paul E. McKenney wrote:
> > On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote:
> > > 
> > > On 2/27/2023 11:21 PM, Paul E. McKenney wrote:
> > > > On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
> > > > > On 2/27/2023 8:40 PM, Andrea Parri wrote:
> > > > > > > The LKMM doesn't believe that a control or data dependency orders a
> > > > > > > plain write after a marked read.  Hence in this test it thinks that P1's
> > > > > > > store to u0 can happen before the load of x1.  I don't remember why we
> > > > > > > did it this way -- probably we just wanted to minimize the restrictions
> > > > > > > on when plain accesses can execute.  (I do remember the reason for
> > > > > > > making address dependencies induce order; it was so RCU would work.)
> > > > > > > 
> > > > > > > The patch below will change what the LKMM believes.  It eliminates the
> > > > > > > positive outcome of the litmus test and the data race.  Should it be
> > > > > > > adopted into the memory model?
> > > > > > (Unpopular opinion I know,) it should drop dependencies ordering, not
> > > > > > add/promote it.
> > > > > > 
> > > > > >      Andrea
> > > > > Maybe not as unpopular as you think... :)
> > > > > But either way IMHO it should be consistent; either take all the
> > > > > dependencies that are true and add them, or drop them all.
> > > > > In the latter case, RCU should change to an acquire barrier. (also, one
> > > > > would have to deal with OOTA in some yet different way).
> > > > > 
> > > > > Generally my position is that unless there's a real-world benchmark with
> > > > > proven performance benefits of relying on dependency ordering, one should
> > > > > use an acquire barrier. I haven't yet met such a case, but maybe one of you
> > > > > has...
> > > > https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).
> > > > 
> > > > Though this is admittedly for ARMv7 and PowerPC.
> > > > 
> > > Thanks for the link.
> > > 
> > > It's true that on architectures that don't have an acquire load (and have to
> > > use a fence), the penalty will be bigger.
> > > 
> > > But the more obvious discussion would be what constitutes a real-world
> > > benchmark : )
> > > In my experience you can get a lot of performance benefits out of optimizing
> > > barriers in code if all you execute is that code.
> > > But once you embed that into a real-world application, often 90%-99% of time
> > > spent will be in the business logic, not in the data structure.
> > > 
> > > And then the benefits suddenly disappear.
> > > Note that a lot of barriers are a lot cheaper as well when there's no
> > > contention.
> > > 
> > > Because of that, making optimization decisions based on microbenchmarks can
> > > sometimes lead to a very poor "time invested" vs "total product improvement"
> > > ratio.
> > All true, though that 2x and 4x should be worth something.
> 
> I think the most egregious case we had (not barrier related, but cache
> related) was something like ~100x in specific benchmarks and then I think
> somewhere around 1% system-wide. I think the issue was that in the larger
> system, we couldn't keep the cache hot, so our greatly improved data
> locality was being diluted.
> But of course it always depends on how much that component actually
> contributes to the overall system performance.
> Which IMHO should be one of the measurements taken before starting to invest
> heavily into optimizations.

Agreed, it is all too easy for profound local optimizations to have
near-zero global effect.

> Surprisingly, many people don't want to do that. I think it's something
> about the misleading calculus of "2 months implementing the optimization + 2
> weeks building robust benchmarks & profiling infrastructure > 2 months
> implementing the optimization" instead of "2 weeks building robust
> benchmarks & profiling infrastructure + 0 months implementing a useless
> optimization < 2 months implementing the optimization", which seems to be
> the more common case.

Or, for that matter, having the investigation locate a five-minute change
that produces large benefits.

> > The real-world examples I know of involved garbage collectors, and the
> > improvement was said to be a few percent system-wide.  But that was a
> > verbal exchange, so I don't have a citation for you.
> 
> Thanks for the example, it sounds very reasonable (at least for platforms
> like PowerPC).
> GC has all the hallmarks of a good optimization target: measurable impact on
> system wide throughput and latency, and pointer chasing (=dependency
> ordering) being a hot part inside the algorithm.

I believe that the target was ARM.  It was definitely not PowerPC.

But yes, GCs are indeed intense exercise for pointer chasing.

							Thanx, Paul