linux-kernel - Re: [PATCH] mm,shmem: Fix a typo in shmem_swapin

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8735rr54i9.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date:   Tue, 03 Aug 2021 16:14:38 +0800
From:   "Huang, Ying" <ying.huang@...el.com>
To:     Matthew Wilcox <willy@...radead.org>
Cc:     Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        David Hildenbrand <david@...hat.com>,
        Yang Shi <shy828301@...il.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Miaohe Lin <linmiaohe@...wei.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...e.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        Minchan Kim <minchan@...nel.org>
Subject: Re: [PATCH] mm,shmem: Fix a typo in shmem_swapin_page()

Matthew Wilcox <willy@...radead.org> writes:

> On Fri, Jul 23, 2021 at 01:23:07PM -0700, Hugh Dickins wrote:
>> I was wary because, if the (never observed) race to be fixed is in
>> swap_cluster_readahead(), why was shmem_swapin_page() being patched?
>> Not explained in its commit message, probably a misunderstanding of
>> how mm/shmem.c already manages races (and prefers not to be involved
>> in swap_info_struct stuff).
>> 
>> But why do I now say it's bad?  Because even if you correct the EINVAL
>> to -EINVAL, that's an unexpected error: -EEXIST is common, -ENOMEM is
>> not surprising, -ENOSPC can need consideration, but -EIO and anything
>> else just end up as SIGBUS when faulting (or as error from syscall).
>> So, 2efa33fc7f6e converts a race with swapoff to SIGBUS: not good,
>> and I think much more likely than the race to be fixed (since
>> swapoff's percpu_ref_kill() rightly comes before synchronize_rcu()).
>
> Yes, I think a lot more thought was needed here.  And I would have
> preferred to start with a reproducer instead of "hey, this could
> happen".  Maybe something like booting a 1GB VM, adding two 2GB swap
> partitions, swapon(partition A); run a 2GB memhog and then
>
> loop:
> 	swapon(part B);
> 	swapoff(part A);
> 	swapon(part A);
> 	swapoff(part B);
>
> to make this happen.
>
> but if it does happen, why would returning EINVAL be the right thing
> to do?  We've swapped it out.  It must be on swap somewhere, or we've
> really messed up.  So I could see there being a race where we get
> preempted between looking up the swap entry and calling get_swap_device().
> But if that does happen, then the page gets brought in, and potentially
> reswapped to the other swap device.
>
> So returning -EEXIST here would actually work.  That forces a re-lookup
> in the page cache, so we'll get the new swap entry that tells us which
> swap device the page is now on.

Yes.  -EEXIST is the right error code.  We use that in
shmem_swapin_page() to deal with race condition.

> But I REALLY REALLY REALLY want a reproducer.  Right now, I have a hard
> time believing this, or any of the other races can really happen.

I think the race is only theoretical too.  Firstly, swapoff is a rare
operations in practice; secondly, the race window is really small.

Best Regards,
Huang, Ying

>> 2efa33fc7f6e was intending to fix a race introduced by two-year-old
>> 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested
>> or not"), which added a call to inode_read_congested().  Certainly
>> relying on si->swap_file->f_mapping->host there was new territory:
>> whether actually racy I'm not sure offhand - I've forgotten whether
>> synchronize_rcu() waits for preempted tasks or not.
>> 
>> But if it is racy, then I wonder if the right fix might be to revert
>> 8fd2e0b505d1 too. Convincing numbers were offered for it, but I'm
>> puzzled: because Matthew has in the past noted that the block layer
>> broke and further broke bdi congestion tracking (I don't know the
>> relevant release numbers), so I don't understand how checking
>> inode_read_congested() is actually useful there nowadays.
>
> It might be useful for NFS?  I don't think congestion is broken there
> (except how does the NFS client have any idea whether the server is
> congested or not?)