lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 20 May 2023 12:13:27 +0200
From:   "Arnd Bergmann" <arnd@...db.de>
To:     guoren <guoren@...nel.org>
Cc:     "Palmer Dabbelt" <palmer@...osinc.com>,
        "Thomas Gleixner" <tglx@...utronix.de>,
        "Peter Zijlstra" <peterz@...radead.org>,
        "Andy Lutomirski" <luto@...nel.org>,
        "Conor.Dooley" <conor.dooley@...rochip.com>,
        Heiko Stübner <heiko@...ech.de>,
        "Jisheng Zhang" <jszhang@...nel.org>,
        "Huacai Chen" <chenhuacai@...nel.org>,
        "Anup Patel" <apatel@...tanamicro.com>,
        "Atish Patra" <atishp@...shpatra.org>,
        "Mark Rutland" <mark.rutland@....com>,
        Björn Töpel <bjorn@...nel.org>,
        "Paul Walmsley" <paul.walmsley@...ive.com>,
        "Catalin Marinas" <catalin.marinas@....com>,
        "Will Deacon" <will@...nel.org>, "Mike Rapoport" <rppt@...nel.org>,
        "Anup Patel" <anup@...infault.org>, shihua@...as.ac.cn,
        jiawei@...as.ac.cn, liweiwei@...as.ac.cn, luxufan@...as.ac.cn,
        chunyu@...as.ac.cn, tsu.yubo@...il.com, wefu@...hat.com,
        wangjunqiang@...as.ac.cn, kito.cheng@...ive.com,
        "Andy Chiu" <andy.chiu@...ive.com>,
        "Vincent Chen" <vincent.chen@...ive.com>,
        "Greentime Hu" <greentime.hu@...ive.com>,
        "Jonathan Corbet" <corbet@....net>, wuwei2016@...as.ac.cn,
        "Jessica Clarke" <jrtc27@...c27.com>,
        Linux-Arch <linux-arch@...r.kernel.org>,
        linux-kernel@...r.kernel.org, linux-riscv@...ts.infradead.org,
        "Guo Ren" <guoren@...ux.alibaba.com>
Subject: Re: [RFC PATCH 00/22] riscv: s64ilp32: Running 32-bit Linux kernel on 64-bit
 supervisor mode

On Sat, May 20, 2023, at 04:53, Guo Ren wrote:
> On Sat, May 20, 2023 at 4:20 AM Arnd Bergmann <arnd@...db.de> wrote:
>> On Thu, May 18, 2023, at 15:09, guoren@...nel.org wrote:
>>
>> I've tried to run the same numbers for the debate about running
>> 32-bit vs 64-bit arm kernels in the past, but focused mostly on
>> slightly larger systems, but I looked mainly at the 512MB case,
>> as that is the most cost-efficient DDR3 memory configuration
>> and fairly common.
> 512MB is extravagant, in my opinion. In the IPC market, 32/64MB is for
> 480P/720P/1080p, 128/256MB is for 1080p/2k, and 512/1024MB is for 4K.
>> 512MB chips is less than 5% of the total (I guess). Even in 512MB
> chips, the additional memory is for the frame buffer, not the Linux
> system.

This depends a lot on the target application of course. For
a phone or NAS box, 512MB is probably the lower limit.

What I observe in arch/arm/ devicetree submissions, in board-db.org,
and when looking at industrial Arm board vendor websites  is that
512MB is the most common configuration, and I think 1GB is still
more common than 256MB even for 32-bit machines. There is of course
a difference between number of individual products, and number of
machines shipped in a given configuration, and I guess you have
a good point that the cheapest ones are also the ones that ship
in the highest volume.

>> What I'd like to understand better in your example is where
>> the 14MB of memory went. I assume this is for 128MB of total
>> RAM, so we know that 1MB went into additional 'struct page'
>> objects (32 bytes * 32768 pages). It would be good to know
>> where the dynamic allocations went and if they are  reclaimable
>> (e.g. inodes) or non-reclaimable (e.g. kmalloc-128).
>>
>> For the vmlinux size, is this already a minimal config
>> that one would run on a board with 128MB of RAM, or a
>> defconfig that includes a lot of stuff that is only relevant
>> for other platforms but also grows on 64-bit?
> It's not minimal config, it's defconfig. So I say it's a roungh
> measurement :)
>
> I admit I wanted a little bit to exaggerate it, but that's the
> starting point for cutting down memory usage for most people, right?
> During the past year, we have been convincing our customers to use the
> s64lp64 + u32ilp32, but they can't tolerate even 1% memory additional
> cost in 64MB/128MB scenarios and then chose cortex-a7/a35, which could
> run 32-bit Linux. I think it's too early to talk about throwing 32-bit
> Linux into the garbage, not only for the reason of memory footprint
> but also for the ingrained opinion of the people. Changing their mind
> needs a long time.
>
>>
>> What do you see in /proc/slabinfo, /proc/meminfo/, and
>> 'size vmlinux' for the s64ilp32 and s64lp64 kernels here?
> Both s64ilp32 & s64lp64 use the same u32ilp32_rootfs.ext2 binary and
> the same opensbi binary.
> All are opensbi(2MB) + Linux(126MB) memory layout.
>
> Here is the result:
>
> s64ilp32:
> [    0.000000] Virtual kernel memory layout:
> [    0.000000]       fixmap : 0x9ce00000 - 0x9d000000   (2048 kB)
> [    0.000000]       pci io : 0x9d000000 - 0x9e000000   (  16 MB)
> [    0.000000]      vmemmap : 0x9e000000 - 0xa0000000   (  32 MB)
> [    0.000000]      vmalloc : 0xa0000000 - 0xc0000000   ( 512 MB)
> [    0.000000]       lowmem : 0xc0000000 - 0xc7e00000   ( 126 MB)
> [    0.000000] Memory: 97748K/129024K available (8699K kernel code,
> 8867K rwdata, 4096K rodata, 4204K init, 361K bss, 31276K reserved, 0K
> cma-reserved)

Ok, so it saves only a little bit on .text/.init/.bss/.rodata, but
there is a 4MB difference in rwdata, and a total of 10.4MB difference
in "reserved" size, which I think includes all of the above plus
the mem_map[] array.

89380K/131072K available (8638K kernel code, 4979K rwdata, 4096K rodata, 2191K init, 477K bss, 41692K reserved, 0K cma-reserved)

Oddly, I don't see anywhere close to 8KB in a riscv64 defconfig
build (linux-next, gcc-13), so I don't know where that comes
from:

$ size -A build/tmp/vmlinux | sort -k2 -nr | head
Total                   13518684
.text                    8896058   18446744071562076160
.rodata                  2219008   18446744071576748032
.data                     933760   18446744071583039488
.bss                      476080   18446744071584092160
.init.text                264718   18446744071572553728
__ksymtab_strings         183986   18446744071579214312
__ksymtab_gpl             122928   18446744071579091384
__ksymtab                 109080   18446744071578982304
__bug_table                98352   18446744071583973248



> KReclaimable:        644 kB
> Slab:               4536 kB
> SReclaimable:        644 kB
> SUnreclaim:         3892 kB
> KernelStack:         344 kB

These look like the only notable differences in meminfo:

  KReclaimable:       1092 kB                                        
  Slab:               6900 kB                                        
  SReclaimable:       1092 kB                                        
  SUnreclaim:         5808 kB                                        
  KernelStack:         688 kB                                        

The largest chunk here is 2MB in non-reclaimable slab allocations,
or a 50% growth of those.

The kernel stacks are doubled as expected, but that's only 344KB,
similarly for reclaimable slabs.

> # cat /proc/slabinfo
>
>                                                              [68/1691]
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
> slabdata <active_slabs> <num_slabs> <sharedavail>
> ext4_groupinfo_1k     28     28    144   28    1 : tunables    0    0
>   0 : slabdata      1      1      0
> p9_req_t               0      0    104   39    1 : tunables    0    0

Did you perhaps miss a few lines while pasting these? It seems
odd that some caches only show up in the ilp32 case (proc_dir_entry,
bd2_journa_handle, buffer_head, biovec_max, anon_vma_chain, ...) and
some others are only in the lp64 case (UNIX, ext4_prealloc_space,
files_cache, filp, ip_fib_alias, task_struct, uid_cache, ...).

Looking at the ones that are in both and have the largest size
increase, I see

# lp64
1788 kernfs_node_cache 14304 128
 590 shmem_inode_cache 646 936
 272 inode_cache 360 776
 153 ext4_inode_cache 105 1496
 250 dentry 1188 216
 192 names_cache 48 4096
 199 radix_tree_node 350 584
 307 kmalloc-64 4912 64
  60 kmalloc-128 480 128
  47 kmalloc-192 252 192
 204 kmalloc-256 816 256
  72 kmalloc-512 144 512
 840 kmalloc-1k 840 1024

# ilp32
1197 kernfs_node_cache 13938 88
 373 shmem_inode_cache 637 600
 174 inode_cache 360 496
  84 ext4_inode_cache 88 984
 177 dentry 1196 152
  32 names_cache 8 4096
 100 radix_tree_node 338 304
 331 kmalloc-64 5302 64
 132 kmalloc-128 1056 128
  23 kmalloc-192 126 192
  16 kmalloc-256 64 256
 428 kmalloc-512 856 512
  88 kmalloc-1k 88 1024

So sysfs (kernfs_node_cache) has the largest chunk of the
2MB non-reclaimable slab, grown 50% from 1.2MB to 1.8MB.
In some cases, this could be avoided entirely by turning
off sysfs, but most users can't do that.
shmem_inode_cache is probably mostly devtmpfs, the
other inode caches ones are smaller and likely reclaimable.

It's interesting how the largest slab cache ends up
being the kmalloc-1k cache (840 1K objects) on lp64,
but the kmalloc-512 cache (856 512B objects) on ilp32.
My guess is that the majority of this is from a single
callsite that has an allocation groing just beyond 512B.
This alone seems significant enough to need further
investigation, I would hope we can completely avoid
these by adding a custom slab cache. I don't see this
effect on an arm64 boot though, for me the 512B allocations
are much higher the 1K ones.

Maybe you can identify the culprit using the boot-time traces
as listed in https://elinux.org/Kernel_dynamic_memory_analysis#Dynamic
That might help everyone running a 64-bit kernel on
low-memory configurations, though it would of course slightly
weaken your argument for an ilp32 kernel ;-)

     Arnd

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ