linux-kernel - Re: [RFC] tools/memory-model: Rule out OOTA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d7e59f22-12d3-42e5-88d2-e96beafb78c5@huaweicloud.com>
Date: Fri, 10 Jan 2025 13:21:32 +0100
From: Jonas Oberhauser <jonas.oberhauser@...weicloud.com>
To: Alan Stern <stern@...land.harvard.edu>
Cc: paulmck@...nel.org, parri.andrea@...il.com, will@...nel.org,
 peterz@...radead.org, boqun.feng@...il.com, npiggin@...il.com,
 dhowells@...hat.com, j.alglave@....ac.uk, luc.maranget@...ia.fr,
 akiyks@...il.com, dlustig@...dia.com, joel@...lfernandes.org,
 urezki@...il.com, quic_neeraju@...cinc.com, frederic@...nel.org,
 linux-kernel@...r.kernel.org, lkmm@...ts.linux.dev,
 hernan.poncedeleon@...weicloud.com
Subject: Re: [RFC] tools/memory-model: Rule out OOTA



Am 1/10/2025 um 4:12 AM schrieb Alan Stern:
> On Thu, Jan 09, 2025 at 09:09:00PM +0100, Jonas Oberhauser wrote:
> 
>>>> So for example, in
>>>>
>>>>     int a = 1;
>>>>     barrier();
>>>>     a = 2;
>>>>     //...
>>>>
>>>> the compiler does not know how the code inside barrier() accesses memory,
>>>> including volatile memory.
>>>
>>> I would say rather that the compiler does not know that the values
>>> stored in memory are the same before and after the barrier().
>>>
>>> Even the
>>> values of local variables whose addresses have not been exported.
>>
>> No, this is not true. I used to think so too until a short while ago.
>>
>> But if you look at the output of gcc -o3 you will see that it does happily
>> remove `a` in this function.
> 
> Isn't that consistent with what I said?


Ok, after careful reading, I think there are two assumptions you have 
that I think are not true, but my example is only inconsistent with 
exactly one of them being not true, not with both of them being not true:

1) the barrier only tells the compiler that it may *change* the value of 
memory locations. I believe it also tells the compiler that it may 
*read* the value of memory locations.
2) the barrier also talks about the values of local variables whose 
addresses have not been exported. I do not believe this is the case.

The second example I put (where a=1 is still emitted) shows your 
assumption 1) is inconsistent with what gcc currently does.

For what gcc guarantees, the manual says: "add memory to the list of 
clobbered registers. This will cause GCC to not keep memory values 
cached in registers across the assembler instruction", i.e., it needs to 
flush the value from the register to actual memory.

I believe this too is not consistent with your assumption 1), because if 
the barrier would just modify memory and not read it, there would be no 
need to flush the value to memory. It would suffice to ensure that the 
value is not assumed to be the same after the barrier.

With your assumption 1) discharged, the fact that `a=1` still can be 
removed from before the barrier should show that this guarantee does not 
hold for all memory locations (only for those that could legally be 
accessed in the barrier, which are those whose address has been exported).



>>> Question: Can the compiler assume that no other threads access a between
>>> the two stores, on the grounds that this would be a data race?  I'd
>>> guess that it can't make that assumption, but it would be nice to know
>>> for sure.
>>
>> It can not make the assumption if &a has escaped. In that case, barrier()
>> could be so:
>>
>> barrier(){
>>    store_release(OTHER_THREAD_PLEASE_MODIFY,&a);
>>
>>    while (! load_acquire(OTHER_THREAD_IS_DONE));
>> }
>>
>> with another thread doing
>>
>>    while (! load_acquire(OTHER_THREAD_PLEASE_MODIFY)) yield();
>>    *OTHER_THREAD_PLEASE_MODIFY ++;
>>    store_release(OTHER_THREAD_IS_DONE, 1);
> 
> Bear in mind that there's a difference between what a compiler _can do_
> and what gcc _currently does_.

Of course. But I am not sure if your comment addresses my comment here, 
or was intended for another section.

> 
>>> I think you're giving the compiler too much credit.  The one thing the
>>> compiler is allowed to assume is that the code, as written, does not
>>> contain a data race or other undefined behavior.
>>
>> Apologies, the way I used "assume" is misleading.
>> I should have said that the compiler has to ensure that even if the code of
>> foo() and barrier() were so, that the behavior of the code it generates is
>> the same (w.r.t. observable side effects) as if the program were executed by
>> the abstract machine. Or I should have said that it can *not* assume that
>> the functions are *not* as defined above.
>>
>> Which means that TURN_THE_BREAKS_ON would need to be assigned 1.
>> The only way the compiler can achieve that guarantee (while treating barrier
>> as a black box) is to make sure that the value of `a` before barrier() is 1.
> 
> Who says the compiler has to treat barrier() as a black box?  As far as
> I know, gcc makes no such guarantee.

Note that if it wouldn't, then barrier() would not work. Since the asm 
instruction is empty, the compiler could figure this out easily and just 
delete it.

But maybe I should also be more precise about what I mean by black box, 
namely, that
1) the asm block has significant side effects,
2) which include reading and storing to arbitrary (legally known) memory 
locations in an arbitrary order & control flow

Both of these imply that the compiler can not assume that it does not 
execute some logic equivalent to what I put above.




>> If herd7 would generate edges for semantic dependencies instead of for its
>> version of syntactic dependencies, then rwdep is the answer.
> 
> That statement is meaningless (or at least, impossible to implement)
> because there is no widely agreed-upon formal definition for semantic
> dependency.

Yes, which also means that a 100% correct end-to-end solution (herd + 
cat + ... ?) is currently not implementable.

But we can still break the problem into two halves, one which is 100% 
solved inside the cat file, and one which is the responsibility of herd7 
and currently not solved (or 100% satisfactorily solvable).

The advantage being that if we read the cat file as a mathematical 
definition, we can at least on paper argue 100% correctly about code for 
the cases where we either can figure out on paper what the semantic 
dependencies are, or where we at least just say "with relation to 
current compilers, we know what the semantically preserved dependencies 
are", even if herd7 or other tools lags behind in one or both.

After all, herd7 is just a (useful) automation tool for reasoning about 
LKMM, which has its limitations (scalability, a specific definition of 
dependencies, limited C subset...).


>> Given that we can not define dep inside the cat model, one may as well
>> define it as rwdep;rfe with the intended meaning of the dependencies being
>> the semantic ones; then it is an inaccuracy of herd7 that it does not
>> provide the proper dependencies.
> 
> Perhaps so.  LKMM does include other features which the compiler can
> defeat if the programmer isn't sufficiently careful.

How many of these are due to herd7's limitations vs. in the cat file?

>>> In the paper, I speculated that if a physical execution of a program
>>> matches an abstract execution containing such a non-observed OOTA cycle,
>>> then it also matches another abstract execution in which the cycle
>>> doesn't exist.  I don't know how to prove this conjecture, though.
>>
>> Yes, that also makes sense.
>>
>> Note that this speculation does not hold in the current LKMM though. In the
>> Litmus test I shared in the opening e-mail, where the outcome 0:r1=1 /\
>> 1:r1=1 is only possible with an OOTA (even though the values from the OOTA
>> are never used anywhere).
> 
> If the fact that the outcome 0:r1=1 /\ 1:r1=1 has occurred is proof that
> there was OOTA, then the OOTA cycle _is_ observed, albeit indirectly --
> at least, in the sense that I intended.  (The situation mentioned in the
> paper is better described as an execution where the compiler has elided
> all the accesses in the OOTA cycle.)

I'm not sure that sense makes a lot of sense to me.
But it does make the proof of your claim totally trivial. If there is no 
other OOTA-free execution with the same observable behavior, then it is 
proof that the OOTA happened, so the OOTA was observed.
So by contraposition any non-observed OOTA has an OOTA-free execution 
with the same observable behavior.

The sense in which I would define observed is more along the lines of 
"there is an observable side effect (such as store to volatile location) 
which has a semantic dependency on a load that reads from one of the 
stores in the OOTA cycle".




> 
>> With C++'s non-local model I wouldn't be totally surprised if there were
>> similar examples in C++, but given that its ordering definition is a lot
>> more straightforward than LKMM in that it doesn't have all these cases of
>> different barriers like wmb and rmb and corner cases like Noreturn etc., my
>> intuition says that there aren't any.
> 
> I'll have to give this some thought.

If you do want to prove the claim for the stricter definition of 
observable OOTA, I would think about looking at the first read r in each 
thread that reads from an OOTA store w, and see if at least one of them 
could read from another store.

If that is not possible, then my intuition would be that there would be 
some happens-before relation blocking you from reading from an earlier 
store than w, in particular, w ->hb r.

If it is not possible on any thread, then you get a bunch of these hb 
edges much in parallel to the OOTA cycle.

Perhaps you can turn that into an hb cycle.

This last step doesn't work in LKMM because the hb may be caused by 
rmb/wmb, which does not extend over the plain accesses in the OOTA cycle 
to the bounding store to make a bigger hb cycle.

But in C++, if you have w ->hb r on each OOTA rfe edge, then you also 
have r ->hb w' along the OOTA dep edge, and you get an hb cycle.

Best wishes,
    jonas