lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <617b6c79-2d66-467f-89a0-79d2d2efb714@helsinkinet.fi>
Date: Mon, 1 Sep 2025 18:12:53 +0300
From: Eero Tamminen <oak@...sinkinet.fi>
To: Geert Uytterhoeven <geert@...ux-m68k.org>
Cc: Finn Thain <fthain@...ux-m68k.org>,
 Andrew Morton <akpm@...ux-foundation.org>, Lance Yang
 <lance.yang@...ux.dev>, Masami Hiramatsu <mhiramat@...nel.org>,
 Peter Zijlstra <peterz@...radead.org>, Will Deacon <will@...nel.org>,
 stable@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-m68k@...ts.linux-m68k.org
Subject: Re: [PATCH] atomic: Specify natural alignment for atomic_t

Hi Geert,

On 1.9.2025 11.51, Geert Uytterhoeven wrote:
>> On 23.8.2025 10.49, Lance Yang wrote:
>>   > Anyway, I've prepared two patches for discussion, either of which should
>>   > fix the alignment issue :)
>>   >
>>   > Patch A[1] adjusts the runtime checks to handle unaligned pointers.
>>   > Patch B[2] enforces 4-byte alignment on the core lock structures.
>>   >
>>   > Both tested on x86-64.
>>   >
>>   > [1]
>> https://lore.kernel.org/lkml/20250823050036.7748-1-lance.yang@linux.dev
>>   > [2] https://lore.kernel.org/lkml/20250823074048.92498-1-
>>   > lance.yang@...ux.dev
>>
>> Same goes for both of these, except that removing warnings makes minimal
>> kernel boot 1-2% faster than 4-aligning the whole struct.

Note that above result was from (emulated) 68030 Falcon, i.e. something 
that has really small caches (256-byte i-/d-cache), *and* a kernel 
config using CONFIG_CC_OPTIMIZE_FOR_SIZE=y (with GCC 12.2).


> That is an interesting outcome! So the gain of naturally-aligning the
> lock is more than offset by the increased cache pressure due to wasting
> (a bit?) more memory.

Another reason could be those extra inlined warning checks in:
-----------------------------------------------------
$ git grep -e hung_task_set_blocker -e hung_task_clear_blocker kernel/
kernel/locking/mutex.c: hung_task_set_blocker(lock, BLOCKER_TYPE_MUTEX);
kernel/locking/mutex.c: hung_task_clear_blocker();
kernel/locking/rwsem.c:         hung_task_set_blocker(sem, 
BLOCKER_TYPE_RWSEM_READER);
kernel/locking/rwsem.c:         hung_task_clear_blocker();
kernel/locking/rwsem.c:         hung_task_set_blocker(sem, 
BLOCKER_TYPE_RWSEM_WRITER);
kernel/locking/rwsem.c:         hung_task_clear_blocker();
kernel/locking/semaphore.c:     hung_task_set_blocker(sem, 
BLOCKER_TYPE_SEM);
kernel/locking/semaphore.c:     hung_task_clear_blocker();
-----------------------------------------------------


> Do you know what was the impact on total kernel size?

As expected, kernel code size is smaller with the static inlined warn 
checks removed:
-----------------------------------------------------
$ size vmlinux-m68k-6.16-fix1 vmlinux-m68k-6.16-fix2
    text	   data	    bss	    dec	    hex	filename
3088520	 953532	  84224	4126276	 3ef644	vmlinux-m68k-6.16-fix1  [1]
3088730	 953564	  84192	4126486	 3ef716	vmlinux-m68k-6.16-fix2  [2]
-----------------------------------------------------

But could aligning of structs have caused 32 bytes moving from BSS to 
DATA section?


	- Eero

PS. I profiled these 3 kernels on emulated Falcon. According to (Hatari) 
profiler, main difference in the kernel with the warnings removed, is it 
doing less than half of the calls to  NCR5380_read() / 
atari_scsi_reg_read(), compared to the other 2 versions.

These additional 2x calls in the other two versions, seem to mostly come 
through chain originating from process_scheduled_works(), 
NCR5380_poll_politely*() functions and bus probing.

After quick look at the WARN_ON_ONCE()s and SCSI code, I have no idea 
how having those checks being inlined to locking functions, or not, 
would cause a difference like that.  I've tried patching & building 
kernels again, and repeating profiling, but result is same.

While Hatari call (graph) tracking might have some issue (due to kernel 
stack return address manipulation), I don't see how there could be a 
problem with the profiler instruction counts.  Kernel code at given 
address does not change during boot in monolithic kernel, (emulator) 
profiler tracks _every_ executed instruction/address, and it's clearly 
correct function:
------------------------------------
# disassembly with profile data: <instructions percentage>% (<sum of 
instructions>, <sum of cycles>, <sum of i-cache misses>, <sum of d-cache 
hits>)
...
atari_scsi_falcon_reg_read:
$001dd826  link.w    a6,#$0      0.43% (414942, 1578432, 44701, 0)
$001dd82a  move.w    sr,d1       0.43% (414942, 224, 8, 0)
$001dd82c  ori.w     #$700,sr    0.43% (414942, 414368, 44705, 0)
$001dd830  move.l    $8(a6),d0   0.43% (414942, 357922, 44705, 414911)
$001dd834  addi.l    #$88,d0     0.43% (414942, 1014804, 133917, 0)
$001dd83a  move.w    d0,$8606.w  0.43% (414942, 3618352, 89169, 0)
$001dd83e  move.w    $8604.w,d0  0.43% (414942, 3620646, 89162, 0)
$001dd842  move.w    d1,sr       0.43% (414942, 2148, 142, 0)
$001dd844  unlk      a6          0.43% (414942, 436, 0, 414893)
$001dd846  rts                   0.43% (414942, 1073934, 134123, 414942)
atari_scsi_falcon_reg_write:
$001dd848  link.w    a6,#$0      0.00% (81, 484, 29, 0)
$001dd84c  move.l    $c(a6),d0   0.00% (81, 326, 29, 73)
...
------------------------------------

Maybe those WARN_ON_ONCE() checks just happen to slow down something 
marginally so that things get interrupted & re-started more for the SCSI 
code?

PPS. emulated machine has no SCSI drives, only one IDE drive (with 4MB 
Busybox partition):
----------------------------------------------------
scsi host0: Atari native SCSI, irq 15, io_port 0x0, base 0x0, can_queue 
1, cmd_per_lun 2, sg_tablesize 1, this_id 7, flags { }
atari-falcon-ide atari-falcon-ide: Atari Falcon and Q40/Q60 PATA controller
scsi host1: pata_falcon
ata1: PATA max PIO4 cmd fff00000 ctl fff00038 data fff00000 no IRQ, 
using PIO polling
...
ata1: found unknown device (class 0)
ata1.00: ATA-7: Hatari  IDE disk 4M, 1.0, max UDMA/100
ata1.00: 8192 sectors, multi 16: LBA48
ata1.00: configured for PIO
...
scsi 1:0:0:0: Direct-Access     ATA      Hatari  IDE disk 1.0  PQ: 0 ANSI: 5
sd 1:0:0:0: [sda] 8192 512-byte logical blocks: (4.19 MB/4.00 MiB)
sd 1:0:0:0: [sda] Write Protect is off
sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't 
support DPO or FUA
sd 1:0:0:0: [sda] Preferred minimum I/O size 512 bytes
sd 1:0:0:0: [sda] Attached SCSI disk
VFS: Mounted root (ext2 filesystem) readonly on device 8:0.
---------------------------------------------------

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ