linux-kernel - Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4f184987-3cc2-c72d-0774-5d20ea2e1d49@cn.fujitsu.com>
Date:   Thu, 14 Jan 2021 11:52:14 +0800
From:   Ruan Shiyang <ruansy.fnst@...fujitsu.com>
To:     zhong jiang <zhongjiang-ali@...ux.alibaba.com>,
        Jan Kara <jack@...e.cz>
CC:     <linux-kernel@...r.kernel.org>, <linux-xfs@...r.kernel.org>,
        <linux-nvdimm@...ts.01.org>, <linux-mm@...ck.org>,
        <linux-fsdevel@...r.kernel.org>, <linux-raid@...r.kernel.org>,
        <darrick.wong@...cle.com>, <dan.j.williams@...el.com>,
        <david@...morbit.com>, <hch@....de>, <song@...nel.org>,
        <rgoldwyn@...e.de>, <qi.fuli@...itsu.com>, <y-goto@...itsu.com>
Subject: Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax
 mapping



On 2021/1/14 上午11:26, zhong jiang wrote:
> 
> On 2021/1/14 9:44 上午, Ruan Shiyang wrote:
>>
>>
>> On 2021/1/13 下午6:04, zhong jiang wrote:
>>>
>>> On 2021/1/12 10:55 上午, Ruan Shiyang wrote:
>>>>
>>>>
>>>> On 2021/1/6 下午11:41, Jan Kara wrote:
>>>>> On Thu 31-12-20 00:55:55, Shiyang Ruan wrote:
>>>>>> The current memory_failure_dev_pagemap() can only handle 
>>>>>> single-mapped
>>>>>> dax page for fsdax mode.  The dax page could be mapped by multiple 
>>>>>> files
>>>>>> and offsets if we let reflink feature & fsdax mode work together. So,
>>>>>> we refactor current implementation to support handle memory 
>>>>>> failure on
>>>>>> each file and offset.
>>>>>>
>>>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@...fujitsu.com>
>>>>>
>>>>> Overall this looks OK to me, a few comments below.
>>>>>
>>>>>> ---
>>>>>>   fs/dax.c            | 21 +++++++++++
>>>>>>   include/linux/dax.h |  1 +
>>>>>>   include/linux/mm.h  |  9 +++++
>>>>>>   mm/memory-failure.c | 91 
>>>>>> ++++++++++++++++++++++++++++++++++-----------
>>>>>>   4 files changed, 100 insertions(+), 22 deletions(-)
>>>>
>>>> ...
>>>>
>>>>>>   @@ -345,9 +348,12 @@ static void add_to_kill(struct task_struct 
>>>>>> *tsk, struct page *p,
>>>>>>       }
>>>>>>         tk->addr = page_address_in_vma(p, vma);
>>>>>> -    if (is_zone_device_page(p))
>>>>>> -        tk->size_shift = dev_pagemap_mapping_shift(p, vma);
>>>>>> -    else
>>>>>> +    if (is_zone_device_page(p)) {
>>>>>> +        if (is_device_fsdax_page(p))
>>>>>> +            tk->addr = vma->vm_start +
>>>>>> +                    ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>>>
>>>>> It seems strange to use 'pgoff' for dax pages and not for any other 
>>>>> page.
>>>>> Why? I'd rather pass correct pgoff from all callers of 
>>>>> add_to_kill() and
>>>>> avoid this special casing...
>>>>
>>>> Because one fsdax page can be shared by multiple pgoffs.  I have to 
>>>> pass each pgoff in each iteration to calculate the address in vma 
>>>> (for tk->addr).  Other kinds of pages don't need this. They can get 
>>>> their unique address by calling "page_address_in_vma()".
>>>>
>>> IMO,   an fsdax page can be shared by multiple files rather than 
>>> multiple pgoffs if fs query support reflink.   Because an page only 
>>> located in an mapping(page->mapping is exclusive), hence it  only has 
>>> an pgoff or index pointing at the node.
>>>
>>>   or  I miss something for the feature ?  thanks,
>>
>> Yes, a fsdax page is shared by multiple files because of reflink. I 
>> think my description of 'pgoff' here is not correct.  This 'pgoff' 
>> means the offset within the a file.  (We use rmap to find out all the 
>> sharing files and their offsets.)  So, I said that "can be shared by 
>> multiple pgoffs".  It's my bad.
>>
>> I think I should name it another word to avoid misunderstandings.
>>
> IMO,  All the sharing files should be the same offset to share the fsdax 
> page.  why not that ? 

The dedupe operation can let different files share their same data 
extent, though offsets are not same.  So, files can share one fsdax page 
at different offset.

> As you has said,  a shared fadax page should be 
> inserted to different mapping files.  but page->index and page->mapping 
> is exclusive.  hence an page only should be placed in an mapping tree.

We can't use page->mapping and page->index here for reflink & fsdax. 
And that's this patchset aims to solve.  I introduced a series of 
->corrupted_range(), from mm to pmem driver to block device and finally 
to filesystem, to use rmap feature of filesystem to find out all files 
sharing same data extent (fsdax page).


--
Thanks,
Ruan Shiyang.

> 
> And In the current patch,  we failed to found out that all process use 
> the fsdax page shared by multiple files and kill them.
> 
> 
> Thanks,
> 
>> -- 
>> Thanks,
>> Ruan Shiyang.
>>
>>>
>>>> So, I added this fsdax case here.  This patchset only implemented 
>>>> the fsdax case, other cases also need to be added here if to be 
>>>> implemented.
>>>>
>>>>
>>>> -- 
>>>> Thanks,
>>>> Ruan Shiyang.
>>>>
>>>>>
>>>>>> +        tk->size_shift = dev_pagemap_mapping_shift(p, vma, 
>>>>>> tk->addr);
>>>>>> +    } else
>>>>>>           tk->size_shift = page_shift(compound_head(p));
>>>>>>         /*
>>>>>> @@ -495,7 +501,7 @@ static void collect_procs_anon(struct page 
>>>>>> *page, struct list_head *to_kill,
>>>>>>               if (!page_mapped_in_vma(page, vma))
>>>>>>                   continue;
>>>>>>               if (vma->vm_mm == t->mm)
>>>>>> -                add_to_kill(t, page, vma, to_kill);
>>>>>> +                add_to_kill(t, page, NULL, 0, vma, to_kill);
>>>>>>           }
>>>>>>       }
>>>>>>       read_unlock(&tasklist_lock);
>>>>>> @@ -505,24 +511,19 @@ static void collect_procs_anon(struct page 
>>>>>> *page, struct list_head *to_kill,
>>>>>>   /*
>>>>>>    * Collect processes when the error hit a file mapped page.
>>>>>>    */
>>>>>> -static void collect_procs_file(struct page *page, struct 
>>>>>> list_head *to_kill,
>>>>>> -                int force_early)
>>>>>> +static void collect_procs_file(struct page *page, struct 
>>>>>> address_space *mapping,
>>>>>> +        pgoff_t pgoff, struct list_head *to_kill, int force_early)
>>>>>>   {
>>>>>>       struct vm_area_struct *vma;
>>>>>>       struct task_struct *tsk;
>>>>>> -    struct address_space *mapping = page->mapping;
>>>>>> -    pgoff_t pgoff;
>>>>>>         i_mmap_lock_read(mapping);
>>>>>>       read_lock(&tasklist_lock);
>>>>>> -    pgoff = page_to_pgoff(page);
>>>>>>       for_each_process(tsk) {
>>>>>>           struct task_struct *t = task_early_kill(tsk, force_early);
>>>>>> -
>>>>>>           if (!t)
>>>>>>               continue;
>>>>>> -        vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
>>>>>> -                      pgoff) {
>>>>>> +        vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, 
>>>>>> pgoff) {
>>>>>>               /*
>>>>>>                * Send early kill signal to tasks where a vma covers
>>>>>>                * the page but the corrupted page is not necessarily
>>>>>> @@ -531,7 +532,7 @@ static void collect_procs_file(struct page 
>>>>>> *page, struct list_head *to_kill,
>>>>>>                * to be informed of all such data corruptions.
>>>>>>                */
>>>>>>               if (vma->vm_mm == t->mm)
>>>>>> -                add_to_kill(t, page, vma, to_kill);
>>>>>> +                add_to_kill(t, page, mapping, pgoff, vma, to_kill);
>>>>>>           }
>>>>>>       }
>>>>>>       read_unlock(&tasklist_lock);
>>>>>> @@ -550,7 +551,8 @@ static void collect_procs(struct page *page, 
>>>>>> struct list_head *tokill,
>>>>>>       if (PageAnon(page))
>>>>>>           collect_procs_anon(page, tokill, force_early);
>>>>>>       else
>>>>>> -        collect_procs_file(page, tokill, force_early);
>>>>>> +        collect_procs_file(page, page->mapping, page_to_pgoff(page),
>>>>>
>>>>> Why not use page_mapping() helper here? It would be safer for THPs 
>>>>> if they
>>>>> ever get here...
>>>>>
>>>>>                                 Honza
>>>>>
>>>>
>>>
>>>
>>
> 
>