[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87h7g54mbi.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Wed, 04 Aug 2021 16:59:45 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Hugh Dickins <hughd@...gle.com>
Cc: huang ying <huang.ying.caritas@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>,
Yang Shi <shy828301@...il.com>, Linux-MM <linux-mm@...ck.org>,
LKML <linux-kernel@...r.kernel.org>,
Miaohe Lin <linmiaohe@...wei.com>,
Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...e.com>,
Joonsoo Kim <iamjoonsoo.kim@....com>,
Matthew Wilcox <willy@...radead.org>,
Minchan Kim <minchan@...nel.org>
Subject: Re: [PATCH] mm,shmem: Fix a typo in shmem_swapin_page()
Hugh Dickins <hughd@...gle.com> writes:
> On Wed, 28 Jul 2021, huang ying wrote:
>> On Sat, Jul 24, 2021 at 4:23 AM Hugh Dickins <hughd@...gle.com> wrote:
>> >
>> > I was wary because, if the (never observed) race to be fixed is in
>> > swap_cluster_readahead(), why was shmem_swapin_page() being patched?
>>
>> When we get a swap entry from the page table or shmem xarray, and no
>> necessary lock is held to prevent the swap device to be swapoff (e.g.
>> page table lock, page lock, etc.), it's possible that the swap device
>> has been swapoff when we operate on the swap entry (e.g. swapin).
>
> Yes. And even without any swapoff, the swap entry may no longer be
> right by the time we go to swap it in, or after it has been swapped in.
Yes.
> Both mm/memory.c and mm/shmem.c have done an unlocked lookup to get
> the swap entry, and have to do a pte_same() or shmem_confirm_swap()
> later, to ensure that what they've got is still right. That lockless
> lookup and raciness is intentional, and has been working for years.
Yes.
>> So we need to find a way to prevent the swap device to be swapoff,
>> get_swap_device() based on percpu_ref is used for that.
>
> To prevent swapoff, no, it would be bad if swapoff could be prevented
> indefinitely. But I think you're saying to prevent swapoff from
> completing - reaching the point of freeing structures which might
> still be being examined.
Yes.
> Yes, though I thought that was already prevented. But I may well have
> missed the inode_read_congested() case which came in two years ago
> (affecting shmem swapin only). And I'll admit now to never(?) having
> studied or tested the SWP_SYNCHRONOUS_IO case in do_swap_page() (with
> suspiciously no equivalent in shmem swapin): I'd better start.
>
> I do dislike the way SWP_SYNCHRONOUS_IO imported low-level swap business
> into do_swap_page(): I'd love to try to move that into swap_state.c, and
> in doing so hopefully get to appreciate it better (and the suggested
> swapoff race: I presume it comes from skipping swapcache_prepare()).
Yes. I think so too.
> But I have no time for that at present.
>
>> To avoid to
>> call get_swap_device() here and there (e.g. now it is called in many
>> different places), I think it's better to call get_swap_device() when
>> we just get a swap entry without holding the necessary lock, that is,
>> in do_swap_page() and shmem_swapin_page(), etc. So that we can delete
>> the get_swap_device() call in lookup_swap_cache(),
>> __read_swap_cache_async(), etc. This will make it easier to
>> understand when to use get_swap_device() and clean up the code. Do
>> you agree?
>
> Offhand I'd say that I do not agree: but I can easily imagine coming to
> agree with you, once I have tried to do better and discovered I cannot.
>
> I dislike the way swap internals are being pushed out to the higher
> levels. Particularly since those higher levels already have to deal
> with similar races which are not protected by get_swap_device().
>
> I'd forgotten how earlier you found you had to add get_swap_device()
> into lookup_swap_cache(), and friends: and can see the attraction of
> doing it once instead of here and there. But it is still there in
> lookup_swap_cache() etc, so that's a poor argument for these commits.
I have a patch in hand to delete get_swap_device() in
lookup_swap_cache() and other places. In fact, the bug fixed in this
patch is found when developing that patch.
An untested initial version of that cleanup patch is as below, I am
still working on it. In that patch, 6 get_swap_device() are deleted,
while 4 get_swap_device() are added. What matters more is that it's
more clear about when to call get_swap_device(). That is, when you get
a swap entry without necessary lock and want to operate on it.
>> > Not explained in its commit message, probably a misunderstanding of
>> > how mm/shmem.c already manages races (and prefers not to be involved
>> > in swap_info_struct stuff).
>>
>> Yes. The commit message isn't clean enough about why we do that.
>
> Thanks for clearing that up. In intervening days I did read about 50
> of the ~100 mails on these commits, April and later (yes, I was Cc'ed
> throughout, thanks, but that didn't give me the time). I've not yet
> reached any that explain the get_swap_device() placement, perhaps
> I'll change my mind when eventually I read those.
I will write about that in the patch to cleanup get_swap_device() calling.
>>
>> > But why do I now say it's bad? Because even if you correct the EINVAL
>> > to -EINVAL, that's an unexpected error: -EEXIST is common, -ENOMEM is
>> > not surprising, -ENOSPC can need consideration, but -EIO and anything
>> > else just end up as SIGBUS when faulting (or as error from syscall).
>>
>> Yes. -EINVAL isn't a good choice. If it's the swapoff race, then
>> retrying can fix the race, so -EAGAIN may be a choice. But if the
>> swap entry is really invalid (almost impossible in theory), we may
>> need something else, for example, WARN_ON_ONCE() and SIGBUS? This
>> reminds me that we may need to distinguish the two possibilities in
>> get_swap_device()?
>
> Ah, I guess in that last sentence you're thinking of what I realized
> in writing previous mail, behaviour when given a corrupted swap entry.
Yes. But because that should be impossible even in theory, I think we
can ignore it for now to avoid to make the code even more complicated?
Now, some corrupted swap entry can be identified and a error message can
be printed in get_swap_device().
Best Regards,
Huang, Ying
----------------------------------8<--------------------------------------
>From 8c83cf945436b98222b72903b4bd3063ad41e51a Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@...el.com>
Date: Fri, 23 Jul 2021 15:56:26 +0800
Subject: [PATCH] swap: cleanup get/_put_swap_device()
---
mm/madvise.c | 10 ++++++++++
mm/memory.c | 12 +++++++++++-
mm/swap_state.c | 11 -----------
mm/swapfile.c | 36 ++++++------------------------------
mm/zswap.c | 5 +++++
5 files changed, 32 insertions(+), 42 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 012129fbfaf8..5e38e888645d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -199,6 +199,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
swp_entry_t entry;
struct page *page;
spinlock_t *ptl;
+ struct swap_info_struct *si;
orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
pte = *(orig_pte + ((index - start) / PAGE_SIZE));
@@ -209,11 +210,15 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
entry = pte_to_swp_entry(pte);
if (unlikely(non_swap_entry(entry)))
continue;
+ si = get_swap_device(entry);
+ if (!si)
+ continue;
page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
vma, index, false);
if (page)
put_page(page);
+ put_swap_device(si);
}
return 0;
@@ -234,6 +239,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
rcu_read_lock();
xas_for_each(&xas, page, end_index) {
swp_entry_t swap;
+ struct swap_info_struct *si;
if (!xa_is_value(page))
continue;
@@ -241,10 +247,14 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
rcu_read_unlock();
swap = radix_to_swp_entry(page);
+ si = get_swap_device(swap);
+ if (!si)
+ continue;
page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
NULL, 0, false);
if (page)
put_page(page);
+ put_swap_device(si);
rcu_read_lock();
}
diff --git a/mm/memory.c b/mm/memory.c
index 39e7a1495c3c..b150d1d577cd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1096,8 +1096,18 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
cond_resched();
if (ret == -EIO) {
+ int err;
+ struct swap_info_struct *si;
+
VM_WARN_ON_ONCE(!entry.val);
- if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
+ si = get_swap_device(entry);
+ if (!si) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ err = add_swap_count_continuation(entry, GFP_KERNEL);
+ put_swap_device(si);
+ if (err < 0) {
ret = -ENOMEM;
goto out;
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1a29b4f98208..7a800f6783fc 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -335,14 +335,8 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
unsigned long addr)
{
struct page *page;
- struct swap_info_struct *si;
- si = get_swap_device(entry);
- if (!si)
- return NULL;
page = find_get_page(swap_address_space(entry), swp_offset(entry));
- put_swap_device(si);
-
INC_CACHE_INFO(find_total);
if (page) {
bool vma_ra = swap_use_vma_readahead();
@@ -418,7 +412,6 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr,
bool *new_page_allocated)
{
- struct swap_info_struct *si;
struct page *page;
void *shadow = NULL;
@@ -431,12 +424,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* called after lookup_swap_cache() failed, re-calling
* that would confuse statistics.
*/
- si = get_swap_device(entry);
- if (!si)
- return NULL;
page = find_get_page(swap_address_space(entry),
swp_offset(entry));
- put_swap_device(si);
if (page)
return page;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e3dcaeecc50f..d7cd7fe2eaf9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1472,14 +1472,9 @@ int __swap_count(swp_entry_t entry)
{
struct swap_info_struct *si;
pgoff_t offset = swp_offset(entry);
- int count = 0;
- si = get_swap_device(entry);
- if (si) {
- count = swap_count(si->swap_map[offset]);
- put_swap_device(si);
- }
- return count;
+ si = swp_swap_info(entry);
+ return swap_count(si->swap_map[offset]);
}
static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
@@ -1501,15 +1496,10 @@ static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
*/
int __swp_swapcount(swp_entry_t entry)
{
- int count = 0;
struct swap_info_struct *si;
- si = get_swap_device(entry);
- if (si) {
- count = swap_swapcount(si, entry);
- put_swap_device(si);
- }
- return count;
+ si = swp_swap_info(entry);
+ return swap_swapcount(si, entry);
}
/*
@@ -3430,10 +3420,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
unsigned char has_cache;
int err;
- p = get_swap_device(entry);
- if (!p)
- return -EINVAL;
-
+ p = swp_swap_info(entry);
offset = swp_offset(entry);
ci = lock_cluster_or_swap_info(p, offset);
@@ -3479,8 +3466,6 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
unlock_out:
unlock_cluster_or_swap_info(p, ci);
- if (p)
- put_swap_device(p);
return err;
}
@@ -3581,14 +3566,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
*/
page = alloc_page(gfp_mask | __GFP_HIGHMEM);
- si = get_swap_device(entry);
- if (!si) {
- /*
- * An acceptable race has occurred since the failing
- * __swap_duplicate(): the swap device may be swapoff
- */
- goto outer;
- }
+ si = swp_swap_info(entry);
spin_lock(&si->lock);
offset = swp_offset(entry);
@@ -3660,8 +3638,6 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
out:
unlock_cluster(ci);
spin_unlock(&si->lock);
- put_swap_device(si);
-outer:
if (page)
__free_page(page);
return ret;
diff --git a/mm/zswap.c b/mm/zswap.c
index 7944e3e57e78..f707a73e35aa 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -902,9 +902,14 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
struct page **retpage)
{
bool page_was_allocated;
+ struct swap_info_struct *si;
+ si = get_swap_device(entry);
+ if (!si)
+ return ZSWAP_SWAPCACHE_FAIL;
*retpage = __read_swap_cache_async(entry, GFP_KERNEL,
NULL, 0, &page_was_allocated);
+ put_swap_device(si);
if (page_was_allocated)
return ZSWAP_SWAPCACHE_NEW;
if (!*retpage)
--
2.30.2
Powered by blists - more mailing lists