linux-kernel - Re: Regarding your thread on LKML - drm_radeon spamming alloc_contig_range [WAS: Re: PROBLEM-PERSISTS: dmesg spam: alloc_contig

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170629174705.GN23586@orbis-terrarum.net>
Date:   Thu, 29 Jun 2017 17:47:05 +0000
From:   "Robin H. Johnson" <robbat2@...too.org>
To:     Kumar Abhishek <kumar.abhishek.kakkar@...il.com>
Cc:     robbat2@...is-terrarum.net, Michal Hocko <mhocko@...nel.org>,
        linux-kernel@...r.kernel.org, robbat2@...too.org,
        linux-mm@...ck.org, mina86@...a86.com
Subject: Re: Regarding your thread on LKML - drm_radeon spamming
 alloc_contig_range [WAS: Re: PROBLEM-PERSISTS: dmesg spam:
 alloc_contig_range: [XX, YY) PFNs busy]

CC'd back to LKML.

On Thu, Jun 29, 2017 at 06:11:00PM +0530, Kumar Abhishek wrote:
> Hi Robin,
> 
> I am an independent developer who stumbled upon your thread on the LKML
> after facing a similar issue - my kernel log being spammed by
> alloc_contig_range messages. I am running Linux on an ARM system
> (specifically the BeagleBoard-X15) and am on kernel version 4.9.33 with TI
> patches on top of it.
> 
> I am running Debian Stretch (9.0) on the system.
> 
> Here's what my stack trace looks like:
..
> 
> It's somewhat similar to your stack trace, but this here happens on an
> etnaviv GPU (Vivante GCxx).
> 
> In my case if I do 'sudo service lightdm stop', these messages stop too.
> This seems to suggest that the problem may be in the X server rather than
> the kernel? I seem to think this because I replicated this on an entirely
> different set of hardware than yours.
> 
> I just wanted to bring this to your notice, and also ask you if you managed
> to solve it for yourself.
> 
> One solution could be to demote the pr_info in alloc_contig_range to
> pr_debug or to do away with the message altogether, but this would be
> suppressing the issue instead of really knowing what it is about.
> 
> Let me know how I could further investigate this.
The problem, as far as I got diagnosed on LKML, is that some of the GPUs
have a bunch of non-fatal contiguous memory allocation requests: they
have a meaningful fallback path on the allocation, so 'PFNs busy' is a
false busy for their case.

However, if there was a another consumer that does NOT have a fallback,
the output would still be crucially useful.

Attached is the patch that I unsuccessfully proposed on LKML to
rate-limit the messages, with the last revision to only dump_stack() if
CONFIG_CMA_DEBUG was set.

The path that LKML wanted was to add a new parameter to suppress or at
least demote the failure message, and update all of the callers: but it
means that many of the indirect callers need that added parameter as
well.

mm/cma.c:cma_alloc this call can suppress the error, you can see it retry.
mm/hugetlb.c: These callers should get the error message.

The error message DOES still have a good general use in notifying you
that something is going wrong. There was noticeable performance slowdown
in my case when it was trying hard to allocate.

-- 
Robin Hugh Johnson
E-Mail     : robbat2@...is-terrarum.net
Home Page  : http://www.orbis-terrarum.net/?l=people.robbat2
ICQ#       : 30269588 or 41961639
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85

View attachment "000-despam-pfn-busy.patch" of type "text/x-diff" (1016 bytes)

Download attachment "signature.asc" of type "application/pgp-signature" (1114 bytes)