linux-kernel - Re: [PATCH v3] binder_alloc: Replace kcalloc with kvcalloc to mitigate OOM issues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZnCDvpFveS6X0a1g@google.com>
Date: Mon, 17 Jun 2024 18:43:10 +0000
From: Carlos Llamas <cmllamas@...gle.com>
To: Lei Liu <liulei.rjpt@...o.com>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Arve Hjønnevåg <arve@...roid.com>,
	Todd Kjos <tkjos@...roid.com>, Martijn Coenen <maco@...roid.com>,
	Joel Fernandes <joel@...lfernandes.org>,
	Christian Brauner <brauner@...nel.org>,
	Suren Baghdasaryan <surenb@...gle.com>,
	linux-kernel@...r.kernel.org, opensource.kernel@...o.com
Subject: Re: [PATCH v3] binder_alloc: Replace kcalloc with kvcalloc to
 mitigate OOM issues

On Mon, Jun 17, 2024 at 12:01:26PM +0800, Lei Liu wrote:
> On 6/15/2024 at 2:38, Carlos Llamas wrote:
> > My understanding is that kvcalloc() == kcalloc() if there is enough
> > contiguous memory no?
> > 
> > I would expect the performance to be the same at best.
> 
> 1.The main reason is memory fragmentation, where we are unable to
> allocate contiguous order3 memory. Additionally, using the GFP_KERNEL
> allocation flag in the kernel's __alloc_pages_slowpath function results
> in multiple retry attempts, and if direct_reclaim and memory_compact
> are unsuccessful, OOM occurs.
> 
> 2.When fragmentation is severe, we observed that kvmalloc is faster
> than kmalloc, as it eliminates the need for multiple retry attempts to
> allocate order3. In such cases, falling back to order0 may result in
> higher allocation efficiency.
> 
> 3.Another crucial point is that in the kernel, allocations greater than
> order3 are considered PAGE_ALLOC_COSTLY_ORDER. This leads to a reduced
> number of retry attempts in __alloc_pages_slowpath, which explains the
> increased time for order3 allocation in fragmented scenarios.
> 
> In summary, under high memory pressure scenarios, the system is prone
> to fragmentation. Instead of waiting for order3 allocation, it is more
> efficient to allow kvmalloc to automatically select between order0 and
> order3, reducing wait times in high memory pressure scenarios. This is
> also the reason why kvmalloc can improve throughput.

Yes, all this makes sense. What I don't understand is how "performance
of kvcalloc is better". This is not supposed to be.

> > I'm not so sure about the results and performance improvements that are
> > claimed here. However, the switch to kvcalloc() itself seems reasonable
> > to me.
> > 
> > I'll run these tests myself as the results might have some noise. I'll
> > get back with the results.
> > 
> > Thanks,
> > Carlos Llamas
> 
> Okay, thank you for the suggestion. I look forward to receiving your
> test results and continuing our discussion.
> 

I ran several iterations of the benchmark test on a Pixel device and as
expected I didn't see any significant differences. This is a good thing,
but either we need to understand how you obtained a better performance
from using kvcalloc(), or it would be better to drop this claim from the
commit log.

The following are two individual samples of each form. However, if we
could average the output and get rid of the noise it seems the numbers
are pretty much the same.

Sample with kcalloc():
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_sendVec_binder/4          19983 ns         9832 ns        60255
BM_sendVec_binder/8          19766 ns         9690 ns        71699
BM_sendVec_binder/16         19785 ns         9722 ns        72086
BM_sendVec_binder/32         20067 ns         9864 ns        71535
BM_sendVec_binder/64         20077 ns         9941 ns        69141
BM_sendVec_binder/128        20147 ns         9944 ns        71016
BM_sendVec_binder/256        20424 ns        10044 ns        69451
BM_sendVec_binder/512        20518 ns        10064 ns        69179
BM_sendVec_binder/1024       21073 ns        10319 ns        67599
BM_sendVec_binder/2048       21482 ns        10502 ns        66767
BM_sendVec_binder/4096       22308 ns        10809 ns        63841
BM_sendVec_binder/8192       24022 ns        11649 ns        60795
BM_sendVec_binder/16384      27172 ns        13426 ns        51940
BM_sendVec_binder/32768      32853 ns        16345 ns        42211
BM_sendVec_binder/65536      80177 ns        39787 ns        17557

Sample with kvalloc():
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_sendVec_binder/4          19900 ns         9711 ns        68626
BM_sendVec_binder/8          19903 ns         9756 ns        71524
BM_sendVec_binder/16         19601 ns         9541 ns        71069
BM_sendVec_binder/32         19514 ns         9530 ns        72469
BM_sendVec_binder/64         20042 ns        10006 ns        69753
BM_sendVec_binder/128        20142 ns         9965 ns        70392
BM_sendVec_binder/256        20274 ns         9958 ns        70173
BM_sendVec_binder/512        20305 ns         9966 ns        70347
BM_sendVec_binder/1024       20883 ns        10250 ns        67813
BM_sendVec_binder/2048       21364 ns        10455 ns        67366
BM_sendVec_binder/4096       22350 ns        10888 ns        65689
BM_sendVec_binder/8192       24113 ns        11707 ns        58149
BM_sendVec_binder/16384      27122 ns        13346 ns        52515
BM_sendVec_binder/32768      32158 ns        15901 ns        44139
BM_sendVec_binder/65536      87594 ns        43627 ns        16040

To reiterate, the switch to kvcalloc() sounds good to me. Let's just fix
the commit log and Greg's suggestions too.

Thanks,
Carlos Llamas