[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <90270e25-e272-d3f7-0c1b-04e74d2af6f9@huaweicloud.com>
Date: Wed, 1 Mar 2023 11:52:09 +0100
From: Jonas Oberhauser <jonas.oberhauser@...weicloud.com>
To: paulmck@...nel.org
Cc: Andrea Parri <parri.andrea@...il.com>,
Alan Stern <stern@...land.harvard.edu>,
Jonas Oberhauser <jonas.oberhauser@...wei.com>,
will@...nel.org, peterz@...radead.org, boqun.feng@...il.com,
npiggin@...il.com, dhowells@...hat.com, j.alglave@....ac.uk,
luc.maranget@...ia.fr, akiyks@...il.com, dlustig@...dia.com,
joel@...lfernandes.org, urezki@...il.com, quic_neeraju@...cinc.com,
frederic@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po
On 2/28/2023 4:40 PM, Paul E. McKenney wrote:
> On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote:
>>
>> On 2/27/2023 11:21 PM, Paul E. McKenney wrote:
>>> On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
>>>> On 2/27/2023 8:40 PM, Andrea Parri wrote:
>>>>>> The LKMM doesn't believe that a control or data dependency orders a
>>>>>> plain write after a marked read. Hence in this test it thinks that P1's
>>>>>> store to u0 can happen before the load of x1. I don't remember why we
>>>>>> did it this way -- probably we just wanted to minimize the restrictions
>>>>>> on when plain accesses can execute. (I do remember the reason for
>>>>>> making address dependencies induce order; it was so RCU would work.)
>>>>>>
>>>>>> The patch below will change what the LKMM believes. It eliminates the
>>>>>> positive outcome of the litmus test and the data race. Should it be
>>>>>> adopted into the memory model?
>>>>> (Unpopular opinion I know,) it should drop dependencies ordering, not
>>>>> add/promote it.
>>>>>
>>>>> Andrea
>>>> Maybe not as unpopular as you think... :)
>>>> But either way IMHO it should be consistent; either take all the
>>>> dependencies that are true and add them, or drop them all.
>>>> In the latter case, RCU should change to an acquire barrier. (also, one
>>>> would have to deal with OOTA in some yet different way).
>>>>
>>>> Generally my position is that unless there's a real-world benchmark with
>>>> proven performance benefits of relying on dependency ordering, one should
>>>> use an acquire barrier. I haven't yet met such a case, but maybe one of you
>>>> has...
>>> https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).
>>>
>>> Though this is admittedly for ARMv7 and PowerPC.
>>>
>> Thanks for the link.
>>
>> It's true that on architectures that don't have an acquire load (and have to
>> use a fence), the penalty will be bigger.
>>
>> But the more obvious discussion would be what constitutes a real-world
>> benchmark : )
>> In my experience you can get a lot of performance benefits out of optimizing
>> barriers in code if all you execute is that code.
>> But once you embed that into a real-world application, often 90%-99% of time
>> spent will be in the business logic, not in the data structure.
>>
>> And then the benefits suddenly disappear.
>> Note that a lot of barriers are a lot cheaper as well when there's no
>> contention.
>>
>> Because of that, making optimization decisions based on microbenchmarks can
>> sometimes lead to a very poor "time invested" vs "total product improvement"
>> ratio.
> All true, though that 2x and 4x should be worth something.
I think the most egregious case we had (not barrier related, but cache
related) was something like ~100x in specific benchmarks and then I
think somewhere around 1% system-wide. I think the issue was that in the
larger system, we couldn't keep the cache hot, so our greatly improved
data locality was being diluted.
But of course it always depends on how much that component actually
contributes to the overall system performance.
Which IMHO should be one of the measurements taken before starting to
invest heavily into optimizations.
Surprisingly, many people don't want to do that. I think it's something
about the misleading calculus of "2 months implementing the optimization
+ 2 weeks building robust benchmarks & profiling infrastructure > 2
months implementing the optimization" instead of "2 weeks building
robust benchmarks & profiling infrastructure + 0 months implementing a
useless optimization < 2 months implementing the optimization", which
seems to be the more common case.
> The real-world examples I know of involved garbage collectors, and the
> improvement was said to be a few percent system-wide. But that was a
> verbal exchange, so I don't have a citation for you.
Thanks for the example, it sounds very reasonable (at least for
platforms like PowerPC).
GC has all the hallmarks of a good optimization target: measurable
impact on system wide throughput and latency, and pointer chasing
(=dependency ordering) being a hot part inside the algorithm.
Best wishes,
jonas
Powered by blists - more mailing lists