linux-kernel - Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210601063342.GB10026@leoy-ThinkPad-X240s>
Date:   Tue, 1 Jun 2021 14:33:42 +0800
From:   Leo Yan <leo.yan@...aro.org>
To:     Adrian Hunter <adrian.hunter@...el.com>
Cc:     Arnaldo Carvalho de Melo <acme@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Mark Rutland <mark.rutland@....com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Namhyung Kim <namhyung@...nel.org>,
        Andi Kleen <ak@...ux.intel.com>,
        linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with
 load-acquire and store-release

On Mon, May 31, 2021 at 10:03:33PM +0300, Adrian Hunter wrote:
> On 31/05/21 6:10 pm, Leo Yan wrote:
> > Hi Peter, Adrian,
> > 
> > On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote:
> >> Load-acquire and store-release are one-way permeable barriers, which can
> >> be used to guarantee the memory ordering between accessing the buffer
> >> data and the buffer's head / tail.
> >>
> >> This patch optimizes the memory ordering with the load-acquire and
> >> store-release barriers.
> > 
> > Is this patch okay for you?
> > 
> > Besides this patch, I have an extra question.  You could see for
> > accessing the AUX buffer's head and tail, it also support to use
> > compiler build-in functions for atomicity accessing:
> > 
> >   __sync_val_compare_and_swap()
> >   __sync_bool_compare_and_swap()
> > 
> > Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need
> > to support __sync_xxx_compare_and_swap() atomicity?
> 
> I don't remember, but it seems to me atomicity is needed only
> for a 32-bit perf running with a 64-bit kernel.

32-bit perf wants to access 64-bit value atomically, I think it tries to
avoid the issue caused by scenario:

        CPU0 (64-bit kernel)           CPU1 (32-bit user)

                                         read head_lo
        WRITE_ONCE(head)
                                         read head_hi


I dumped the disassembly for reading 64-bit value for perf Arm32 and get
below results:

  perf Arm32 for READ_ONCE():

	case 8: *(__u64_alias_t *) res = *(volatile __u64_alias_t *) p; break;
     84a:	68fb      	ldr	r3, [r7, #12]
     84c:	e9d3 2300 	ldrd	r2, r3, [r3]
     850:	6939      	ldr	r1, [r7, #16]
     852:	e9c1 2300 	strd	r2, r3, [r1]
     856:	e007      	b.n	868 <auxtrace_mmap__read_head+0xb0>

It uses the instruction ldrd which is "Load Register Dual (register)",
but this doesn't mean the instruction is atomic, especially based on
the comment in the kernel header include/asm-generic/rwonce.h, I think
the instruction ldrd/strd will be "atomic in some cases (namely Armv7 +
LPAE), but for others we rely on the access being split into 2x32-bit
accesses".


  perf Arm32 for __sync_val_compare_and_swap():

	u64 head = __sync_val_compare_and_swap(&pc->aux_head, 0, 0);
     7d6:	68fb      	ldr	r3, [r7, #12]
     7d8:	f503 6484 	add.w	r4, r3, #1056	; 0x420
     7dc:	f04f 0000 	mov.w	r0, #0
     7e0:	f04f 0100 	mov.w	r1, #0
     7e4:	f3bf 8f5b 	dmb	ish
     7e8:	e8d4 237f 	ldrexd	r2, r3, [r4]
     7ec:	ea52 0c03 	orrs.w	ip, r2, r3
     7f0:	d106      	bne.n	800 <auxtrace_mmap__read_head+0x48>
     7f2:	e8c4 017c 	strexd	ip, r0, r1, [r4]
     7f6:	f1bc 0f00 	cmp.w	ip, #0
     7fa:	f1bc 0f00 	cmp.w	ip, #0
     7fe:	d1f3      	bne.n	7e8 <auxtrace_mmap__read_head+0x30>
     800:	f3bf 8f5b 	dmb	ish
     804:	e9c7 2304 	strd	r2, r3, [r7, #16]

For __sync_val_compare_and_swap(), it uses the instructions
ldrexd/ldrexd, these two instructions rely on the exclusive monitor
for accessing 64-bit value, so seems to me this is more reliable way
for accessing 64-bit value in CPU's 32-bit mode.

Conclusion: seems to me __sync_xxx_compare_and_swap() should be kept
in this case rather than using READ_ONCE() for 32-bit building.  Or
any other suggestions?  Thanks!

Leo