linux-kernel - Re: [PATCH AUTOSEL 3.18 8/9] mm/vmstat.c: assert that vmstat_text is in sync with stat_items

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181116191910.GJ1706@sasha-vm>
Date:   Fri, 16 Nov 2018 14:19:10 -0500
From:   Sasha Levin <sashal@...nel.org>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, stable@...r.kernel.org,
        linux-kernel@...r.kernel.org, Jann Horn <jannh@...gle.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Oleg Nesterov <oleg@...hat.com>,
        Christoph Lameter <clameter@....com>,
        Kemi Wang <kemi.wang@...el.com>,
        Andy Lutomirski <luto@...nel.org>,
        Ingo Molnar <mingo@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        linux-mm@...ck.org
Subject: Re: [PATCH AUTOSEL 3.18 8/9] mm/vmstat.c: assert that vmstat_text is
 in sync with stat_items_size

On Fri, Nov 16, 2018 at 07:44:57PM +0100, Michal Hocko wrote:
>On Fri 16-11-18 13:19:04, Sasha Levin wrote:
>> On Fri, Nov 16, 2018 at 09:55:25AM +0100, Michal Hocko wrote:
>[...]
>> > > Race condition with memory hotplug due to missing locks:
>> > >
>> > > 	https://marc.info/?l=linux-mm&m=154211934011188&w=2
>> >
>> > Memory hotplug locking is dubious at best and this patch doesn't really
>> > fix it. It fixes a theoretical problem. I am not aware anybody would be
>> > hitting in practice. We need to rework the locking quite extensively.
>>
>> The word "theoretical" used in the stable rules file does not mean
>> that we need to have actual reports of users hitting bugs before we
>> start backporting the relevant patch, it simply suggests that there
>> needs to be a reasonable explanation of how this issue can be hit.
>>
>> For this memory hotplug patch in particular, I use the hv_balloon driver
>> at this very moment (running a linux guest on windows, with "dynamic
>> memory" enabled). Should I wait for it to crash before I can fix it?
>>
>> Is the upstream code perfect? No, but that doesn't mean that it's not
>> working at all, and if there are users they expect to see fixes going in
>> and not just sitting idly waiting for a big rewrite that will come in a
>> few years.
>>
>> Memory hotplug fixes are not something you think should go to stable?
>> Andrew sent a few of them to stable, so that can't be the case.
>
>I am not arguing about hotplug fixes in general. I was arguing that this
>particular one is a theoretical one and hotplug locking is quite subtle.
>E.g. 381eab4a6ee mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
>http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv
>So in general unless the issue is really triggered easily I am rather
>conservative.

We have millions of machines running linux, everything is triggered
"easily" at that scale.

>> > > Raising an OOM event that causes issues in userspace when no OOM has
>> > > actually occured:
>> > >
>> > > 	https://marc.info/?l=linux-mm&m=154211939811582&w=2
>> >
>> > The patch makes sense I just do not think this is a stable material. The
>> > semantic of the event was and still is suboptimal.
>>
>> I really fail to understand your reasoning about -stable here. This
>> patch is something people actually hit in the field, spent time on
>> triaging and analysing it, and submitting a fix which looks reasonably
>> straightforward.
>>
>> That fix was acked by quite a few folks (including yourself) and merged
>> in. And as far as we can tell, it actually fixed the problem.
>>
>> Why is it not stable material?
>
>Because the semantic of the OOM event is quite tricky itself. We have
>discussed this patch and concluded that the updated one is more
>sensible. But it is not yet clear whether this is actually what other
>users expect as well. That to me does sound quite risky for a stable
>kernel.

So there's another patch following this one that fixes it? Sure - can I
take both?

Users expect to not have their containers die randomly, if you're saying
that you're still working on a fix for that then that is a different
story than saying "we fixed it, but it should not go to stable".

And let's also draw a line there, users will not wait for the OOM event
logic to be perfect before they can expect their workloads to run
without issues.

>> My understanding is that you're concerned with the patch itself being
>> "suboptimal", but in that case - why did you ack it?
>>
>> > > I think that all 3 cases represent a "real" bug users can hit, and I
>> > > honestly don't know why they were not tagged for stable.
>> >
>> > It would be much better to ask in the respective email thread rather
>> > than spamming mailing with AUTOSEL patches which rarely get any
>> > attention.
>>
>> I actually tried it, but the comments I got is that it gets in the way
>> and people preferred something they can filter.
>
>which means that AUTOSEL just goes to /dev/null...

Or just not get mixed with the process? for some people it's easier to
see AUTOSEL mails with the way it works now rather than if they suddenly
show up as a continuation of a weeks old thread.

>> > We have been through this discussion several times already and I thought
>> > we have agreed that those subsystems which are seriously considering stable
>> > are opted out from the AUTOSEL automagic. Has anything changed in that
>> > regards.
>>
>> I checked in with Andrew to get his input on this, he suggested that
>> these patches should be sent to linux-mm and he'll give it a close look.
>
>If Andrew is happy to get AUTOSEL patches then I will not object of
>course but let's not merge these patches without and expclicit OK.

This is fair. I think that the process has caused some unnecessary
friction: we all want the same result but just disagree on the means :)

I won't merge any mm/ AUTOSEL patches until this gets clearer.

--
Thanks,
Sasha