[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <af6122b4-2324-418b-b925-becf6036d9ab@linux.alibaba.com>
Date: Tue, 25 Feb 2025 18:15:47 +0800
From: Baolin Wang <baolin.wang@...ux.alibaba.com>
To: Zi Yan <ziy@...dia.com>
Cc: linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
Matthew Wilcox <willy@...radead.org>, Hugh Dickins <hughd@...gle.com>,
Kairui Song <kasong@...cent.com>, Miaohe Lin <linmiaohe@...wei.com>,
linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in
shmem_split_large_entry()
On 2025/2/25 17:20, Baolin Wang wrote:
>
>
> On 2025/2/21 10:38, Zi Yan wrote:
>> On 20 Feb 2025, at 21:33, Zi Yan wrote:
>>
>>> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>>>
>>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>>>
>>>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>>>
>>>>>>>> Hi Zi,
>>>>>>>>
>>>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>>>
>>>>>>> Thank you for taking a look at the patches. :)
>>>>>>>
>>>>>>>>
>>>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>>>> During shmem_split_large_entry(), large swap entries are
>>>>>>>>> covering n slots
>>>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>>>
>>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by
>>>>>>>>> the folio
>>>>>>>>> need to be split and the remaining n-1 shadow entries can be
>>>>>>>>> retained with
>>>>>>>>> orders ranging from 0 to n-1. This method only requires
>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>>>
>>>>>>>>> For example, to split an order-9 large swap entry (assuming
>>>>>>>>> XA_CHUNK_SHIFT
>>>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>>>
>>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>>>> xas_try_split() during split.
>>>>>>>>
>>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by
>>>>>>>> skipping the swap cache, we will split the large swap entry
>>>>>>>> stored in the shmem mapping into order-0 swap entries, rather
>>>>>>>> than splitting it into other orders of swap entries. This is
>>>>>>>> because the next time we swap in a shmem folio through
>>>>>>>> shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>>>
>>>>>>> Right. But the swapin is one folio at a time, right?
>>>>>>> shmem_split_large_entry()
>>>>>>
>>>>>> Yes, now we always swapin an order-0 folio from the async swap
>>>>>> device at a time. However, for sync swap device, we will skip the
>>>>>> swapcache and swapin the whole large folio by commit 1dd44c0af4fa,
>>>>>> so it will not call shmem_split_large_entry() in this case.
>>>>
>>>> Got it. I will check the commit.
>>>>
>>>>>>
>>>>>>> should split the large swap entry and give you a slot to store
>>>>>>> the order-0 folio.
>>>>>>> For example, with an order-9 large swap entry, to swap in first
>>>>>>> order-0 folio,
>>>>>>> the large swap entry will become order-0, order-0, order-1,
>>>>>>> order-2,… order-8,
>>>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>>>> Then, when a second order-0 is swapped in, the second order-0 can
>>>>>>> be used.
>>>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0
>>>>>>> will be used.
>>>>>>
>>>>>> Yes, understood. However, for the sequential swapin scenarios,
>>>>>> where originally only one split operation is needed. However, your
>>>>>> approach increases the number of split operations. Of course, I
>>>>>> understand that in non-sequential swapin scenarios, your patch
>>>>>> will save some xarray memory. It might be necessary to evaluate
>>>>>> whether the increased split operations will have a significant
>>>>>> impact on the performance of sequential swapin?
>>>>
>>>> Is there a shmem swapin test I can run to measure this?
>>>> xas_try_split() should
>>>> performance similar operations as existing
>>>> xas_split_alloc()+xas_split().
>>>>
>>>>>>
>>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all
>>>>>>> swap entries
>>>>>>> are order-0, which can lead to issues. There should be some check
>>>>>>> like
>>>>>>> if the swap entry order > folio_order, shmem_split_large_entry()
>>>>>>> should
>>>>>>> be used.
>>>>>>>>
>>>>>>>> Moreover I did a quick test with swapping in order 6 shmem
>>>>>>>> folios, however, my test hung, and the console was continuously
>>>>>>>> filled with the following information. It seems there are some
>>>>>>>> issues with shmem swapin handling. Anyway, I need more time to
>>>>>>>> debug and test.
>>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not
>>>>>>> allocate
>>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>>>> error below. Let me know if there is anything I can help.
>>>>>>
>>>>>> I encountered some issues while testing order 4 and order 6 swapin
>>>>>> with your patches. And I roughly reviewed the patch, and it seems
>>>>>> that the new swap entry stored in the shmem mapping was not
>>>>>> correctly updated after the split.
>>>>>>
>>>>>> The following logic is to reset the swap entry after split, and I
>>>>>> assume that the large swap entry is always split to order 0
>>>>>> before. As your patch suggests, if a non-uniform split is used,
>>>>>> then the logic for resetting the swap entry needs to be changed?
>>>>>> Please correct me if I missed something.
>>>>>>
>>>>>> /*
>>>>>> * Re-set the swap entry after splitting, and the swap
>>>>>> * offset of the original large entry must be continuous.
>>>>>> */
>>>>>> for (i = 0; i < 1 << order; i++) {
>>>>>> pgoff_t aligned_index = round_down(index, 1 << order);
>>>>>> swp_entry_t tmp;
>>>>>>
>>>>>> tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>> __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>> swp_to_radix_entry(tmp), 0);
>>>>>> }
>>>>
>>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>>>
>>>>>
>>>>> In addition, after your patch, the shmem_split_large_entry() seems
>>>>> always return 0 even though it splits a large swap entry, but we
>>>>> still need re-calculate the swap entry value after splitting,
>>>>> otherwise it may return errors due to shmem_confirm_swap()
>>>>> validation failure.
>>>>>
>>>>> /*
>>>>> * If the large swap entry has already been split, it is
>>>>> * necessary to recalculate the new swap entry based on
>>>>> * the old order alignment.
>>>>> */
>>>>> if (split_order > 0) {
>>>>> pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>
>>>>> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>> }
>>>>
>>>> Got it. I will fix it.
>>>>
>>>> BTW, do you mind sharing your swapin tests so that I can test my new
>>>> version
>>>> properly?
>>>
>>> The diff below adjusts the swp_entry_t and returns the right order after
>>> shmem_split_large_entry(). Let me know if it fixes your issue.
>>
>> Fixed the compilation error. It will be great if you can share a
>> swapin test, so that
>> I can test locally. Thanks.
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index b35ba250c53d..bfc4ef511391 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2162,7 +2162,7 @@ static int shmem_split_large_entry(struct inode
>> *inode, pgoff_t index,
>> {
>> struct address_space *mapping = inode->i_mapping;
>> XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
>> - int split_order = 0;
>> + int split_order = 0, entry_order = 0;
>> int i;
>>
>> /* Convert user data gfp flags to xarray node gfp flags */
>> @@ -2180,6 +2180,7 @@ static int shmem_split_large_entry(struct inode
>> *inode, pgoff_t index,
>> }
>>
>> order = xas_get_order(&xas);
>> + entry_order = order;
>>
>> /* Try to split large swap entry in pagecache */
>> if (order > 0) {
>> @@ -2192,23 +2193,23 @@ static int shmem_split_large_entry(struct
>> inode *inode, pgoff_t index,
>> xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
>> if (xas_error(&xas))
>> goto unlock;
>> +
>> + /*
>> + * Re-set the swap entry after splitting, and the swap
>> + * offset of the original large entry must be
>> continuous.
>> + */
>> + for (i = 0; i < 1 << cur_order; i += (1 <<
>> split_order)) {
>> + pgoff_t aligned_index = round_down(index, 1 <<
>> cur_order);
>> + swp_entry_t tmp;
>> +
>> + tmp = swp_entry(swp_type(swap), swp_offset(swap)
>> + i);
>> + __xa_store(&mapping->i_pages, aligned_index + i,
>> + swp_to_radix_entry(tmp), 0);
>> + }
>> cur_order = split_order;
>> split_order =
>> xas_try_split_min_order(split_order);
>> }
>
> This looks incorrect to me. Suppose we are splitting an order-9 swap
> entry, in the first iteration of the loop, it splits the order-9 swap
> entry into 8 order-6 swap entries. At this point, the order-6 swap
> entries are reset, and everything seems fine.
>
> However, in the second iteration, where an order-6 swap entry is split
> into 63 order-0 swap entries, the split operation itself is correct. But
typo: 64
> when resetting the order-0 swap entry, it seems incorrect. Now the
> 'cur_order' = 6 and 'split_order' = 0, which means the range for the
> reset index is always between 0 and 63 (see __xa_store()).
Sorry for confusing. The 'aligned_index' will be rounded down by
'cur_order' (which is 6), so the index is correct. But the swap offset
calculated by 'swp_offset(swap) + i' looks incorrect, cause the 'i' is
always between 0 and 63.
> > +for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
> > + pgoff_t aligned_index = round_down(index, 1 << cur_order);
> > + swp_entry_t tmp;
> > +
> > + tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
> > + __xa_store(&mapping->i_pages, aligned_index + i,
> > + swp_to_radix_entry(tmp), 0);
> > +}
>
> However, if the index is greater than 63, it appears that it is not set
> to the correct new swap entry after splitting. Please corect me if I
> missed anyting.
>
>> -
>> - /*
>> - * Re-set the swap entry after splitting, and the swap
>> - * offset of the original large entry must be continuous.
>> - */
>> - for (i = 0; i < 1 << order; i++) {
>> - pgoff_t aligned_index = round_down(index, 1 << order);
>> - swp_entry_t tmp;
>> -
>> - tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>> - __xa_store(&mapping->i_pages, aligned_index + i,
>> - swp_to_radix_entry(tmp), 0);
>> - }
>> }
>>
>> unlock:
>> @@ -2221,7 +2222,7 @@ static int shmem_split_large_entry(struct inode
>> *inode, pgoff_t index,
>> if (xas_error(&xas))
>> return xas_error(&xas);
>>
>> - return split_order;
>> + return entry_order;
>> }
>>
>> /*
>>
>>
>> Best Regards,
>> Yan, Zi
Powered by blists - more mailing lists