[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZnCDvpFveS6X0a1g@google.com>
Date: Mon, 17 Jun 2024 18:43:10 +0000
From: Carlos Llamas <cmllamas@...gle.com>
To: Lei Liu <liulei.rjpt@...o.com>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Arve Hjønnevåg <arve@...roid.com>,
Todd Kjos <tkjos@...roid.com>, Martijn Coenen <maco@...roid.com>,
Joel Fernandes <joel@...lfernandes.org>,
Christian Brauner <brauner@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>,
linux-kernel@...r.kernel.org, opensource.kernel@...o.com
Subject: Re: [PATCH v3] binder_alloc: Replace kcalloc with kvcalloc to
mitigate OOM issues
On Mon, Jun 17, 2024 at 12:01:26PM +0800, Lei Liu wrote:
> On 6/15/2024 at 2:38, Carlos Llamas wrote:
> > My understanding is that kvcalloc() == kcalloc() if there is enough
> > contiguous memory no?
> >
> > I would expect the performance to be the same at best.
>
> 1.The main reason is memory fragmentation, where we are unable to
> allocate contiguous order3 memory. Additionally, using the GFP_KERNEL
> allocation flag in the kernel's __alloc_pages_slowpath function results
> in multiple retry attempts, and if direct_reclaim and memory_compact
> are unsuccessful, OOM occurs.
>
> 2.When fragmentation is severe, we observed that kvmalloc is faster
> than kmalloc, as it eliminates the need for multiple retry attempts to
> allocate order3. In such cases, falling back to order0 may result in
> higher allocation efficiency.
>
> 3.Another crucial point is that in the kernel, allocations greater than
> order3 are considered PAGE_ALLOC_COSTLY_ORDER. This leads to a reduced
> number of retry attempts in __alloc_pages_slowpath, which explains the
> increased time for order3 allocation in fragmented scenarios.
>
> In summary, under high memory pressure scenarios, the system is prone
> to fragmentation. Instead of waiting for order3 allocation, it is more
> efficient to allow kvmalloc to automatically select between order0 and
> order3, reducing wait times in high memory pressure scenarios. This is
> also the reason why kvmalloc can improve throughput.
Yes, all this makes sense. What I don't understand is how "performance
of kvcalloc is better". This is not supposed to be.
> > I'm not so sure about the results and performance improvements that are
> > claimed here. However, the switch to kvcalloc() itself seems reasonable
> > to me.
> >
> > I'll run these tests myself as the results might have some noise. I'll
> > get back with the results.
> >
> > Thanks,
> > Carlos Llamas
>
> Okay, thank you for the suggestion. I look forward to receiving your
> test results and continuing our discussion.
>
I ran several iterations of the benchmark test on a Pixel device and as
expected I didn't see any significant differences. This is a good thing,
but either we need to understand how you obtained a better performance
from using kvcalloc(), or it would be better to drop this claim from the
commit log.
The following are two individual samples of each form. However, if we
could average the output and get rid of the noise it seems the numbers
are pretty much the same.
Sample with kcalloc():
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
BM_sendVec_binder/4 19983 ns 9832 ns 60255
BM_sendVec_binder/8 19766 ns 9690 ns 71699
BM_sendVec_binder/16 19785 ns 9722 ns 72086
BM_sendVec_binder/32 20067 ns 9864 ns 71535
BM_sendVec_binder/64 20077 ns 9941 ns 69141
BM_sendVec_binder/128 20147 ns 9944 ns 71016
BM_sendVec_binder/256 20424 ns 10044 ns 69451
BM_sendVec_binder/512 20518 ns 10064 ns 69179
BM_sendVec_binder/1024 21073 ns 10319 ns 67599
BM_sendVec_binder/2048 21482 ns 10502 ns 66767
BM_sendVec_binder/4096 22308 ns 10809 ns 63841
BM_sendVec_binder/8192 24022 ns 11649 ns 60795
BM_sendVec_binder/16384 27172 ns 13426 ns 51940
BM_sendVec_binder/32768 32853 ns 16345 ns 42211
BM_sendVec_binder/65536 80177 ns 39787 ns 17557
Sample with kvalloc():
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
BM_sendVec_binder/4 19900 ns 9711 ns 68626
BM_sendVec_binder/8 19903 ns 9756 ns 71524
BM_sendVec_binder/16 19601 ns 9541 ns 71069
BM_sendVec_binder/32 19514 ns 9530 ns 72469
BM_sendVec_binder/64 20042 ns 10006 ns 69753
BM_sendVec_binder/128 20142 ns 9965 ns 70392
BM_sendVec_binder/256 20274 ns 9958 ns 70173
BM_sendVec_binder/512 20305 ns 9966 ns 70347
BM_sendVec_binder/1024 20883 ns 10250 ns 67813
BM_sendVec_binder/2048 21364 ns 10455 ns 67366
BM_sendVec_binder/4096 22350 ns 10888 ns 65689
BM_sendVec_binder/8192 24113 ns 11707 ns 58149
BM_sendVec_binder/16384 27122 ns 13346 ns 52515
BM_sendVec_binder/32768 32158 ns 15901 ns 44139
BM_sendVec_binder/65536 87594 ns 43627 ns 16040
To reiterate, the switch to kvcalloc() sounds good to me. Let's just fix
the commit log and Greg's suggestions too.
Thanks,
Carlos Llamas
Powered by blists - more mailing lists