linux-kernel - Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 6 Jun 2022 22:09:02 +0530
From:   Aneesh Kumar K V <aneesh.kumar@...ux.ibm.com>
To:     Jonathan Cameron <Jonathan.Cameron@...wei.com>
Cc:     linux-mm@...ck.org, akpm@...ux-foundation.org,
        Huang Ying <ying.huang@...el.com>,
        Greg Thelen <gthelen@...gle.com>,
        Yang Shi <shy828301@...il.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Tim C Chen <tim.c.chen@...el.com>,
        Brice Goglin <brice.goglin@...il.com>,
        Michal Hocko <mhocko@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Hesham Almatary <hesham.almatary@...wei.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Alistair Popple <apopple@...dia.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Feng Tang <feng.tang@...el.com>,
        Jagdish Gediya <jvgediya@...ux.ibm.com>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        David Rientjes <rientjes@...gle.com>
Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to
 sysfs

On 6/6/22 9:46 PM, Jonathan Cameron wrote:
> On Mon, 6 Jun 2022 21:31:16 +0530
> Aneesh Kumar K V <aneesh.kumar@...ux.ibm.com> wrote:
> 
>> On 6/6/22 8:29 PM, Jonathan Cameron wrote:
>>> On Fri, 3 Jun 2022 14:10:47 +0530
>>> Aneesh Kumar K V <aneesh.kumar@...ux.ibm.com> wrote:
>>>    
>>>> On 5/27/22 7:45 PM, Jonathan Cameron wrote:
>>>>> On Fri, 27 May 2022 17:55:23 +0530
>>>>> "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com> wrote:
>>>>>       
>>>>>> From: Jagdish Gediya <jvgediya@...ux.ibm.com>
>>>>>>
>>>>>> Add support to read/write the memory tierindex for a NUMA node.
>>>>>>
>>>>>> /sys/devices/system/node/nodeN/memtier
>>>>>>
>>>>>> where N = node id
>>>>>>
>>>>>> When read, It list the memory tier that the node belongs to.
>>>>>>
>>>>>> When written, the kernel moves the node into the specified
>>>>>> memory tier, the tier assignment of all other nodes are not
>>>>>> affected.
>>>>>>
>>>>>> If the memory tier does not exist, writing to the above file
>>>>>> create the tier and assign the NUMA node to that tier.
>>>>> creates
>>>>>
>>>>> There was some discussion in v2 of Wei Xu's RFC that what matter
>>>>> for creation is the rank, not the tier number.
>>>>>
>>>>> My suggestion is move to an explicit creation file such as
>>>>> memtier/create_tier_from_rank
>>>>> to which writing the rank gives results in a new tier
>>>>> with the next device ID and requested rank.
>>>>
>>>> I think the below workflow is much simpler.
>>>>
>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
>>>> 1-3
>>>> :/sys/devices/system# cat node/node1/memtier
>>>> 1
>>>> :/sys/devices/system# ls memtier/memtier*
>>>> nodelist  power  rank  subsystem  uevent
>>>> /sys/devices/system# ls memtier/
>>>> default_rank  max_tier  memtier1  power  uevent
>>>> :/sys/devices/system# echo 2 > node/node1/memtier
>>>> :/sys/devices/system#
>>>>
>>>> :/sys/devices/system# ls memtier/
>>>> default_rank  max_tier  memtier1  memtier2  power  uevent
>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
>>>> 2-3
>>>> :/sys/devices/system# cat memtier/memtier2/nodelist
>>>> 1
>>>> :/sys/devices/system#
>>>>
>>>> ie, to create a tier we just write the tier id/tier index to
>>>> node/nodeN/memtier file. That will create a new memory tier if needed
>>>> and add the node to that specific memory tier. Since for now we are
>>>> having 1:1 mapping between tier index to rank value, we can derive the
>>>> rank value from the memory tier index.
>>>>
>>>> For dynamic memory tier support, we can assign a rank value such that
>>>> new memory tiers are always created such that it comes last in the
>>>> demotion order.
>>>
>>> I'm not keen on having to pass through an intermediate state where
>>> the rank may well be wrong, but I guess it's not that harmful even
>>> if it feels wrong ;)
>>>    
>>
>> Any new memory tier added can be of lowest rank (rank - 0) and hence
>> will appear as the highest memory tier in demotion order.
> 
> Depends on driver interaction - if new memory is CXL attached or
> GPU attached, chances are the driver has an input on which tier
> it is put in by default.
> 
>> User can then
>> assign the right rank value to the memory tier? Also the actual demotion
>> target paths are built during memory block online which in most case
>> would happen after we properly verify that the device got assigned to
>> the right memory tier with correct rank value?
> 
> Agreed, though that may change the model of how memory is brought online
> somewhat.
> 
>>
>>> Races are potentially a bit of a pain though depending on what we
>>> expect the usage model to be.
>>>
>>> There are patterns (CXL regions for example) of guaranteeing the
>>> 'right' device is created by doing something like
>>>
>>> cat create_tier > temp.txt
>>> #(temp gets 2 for example on first call then
>>> # next read of this file gets 3 etc)
>>>
>>> cat temp.txt > create_tier
>>> # will fail if there hasn't been a read of the same value
>>>
>>> Assuming all software keeps to the model, then there are no
>>> race conditions over creation.  Otherwise we have two new
>>> devices turn up very close to each other and userspace scripting
>>> tries to create two new tiers - if it races they may end up in
>>> the same tier when that wasn't the intent.  Then code to set
>>> the rank also races and we get two potentially very different
>>> memories in a tier with a randomly selected rank.
>>>
>>> Fun and games...  And a fine illustration why sysfs based 'device'
>>> creation is tricky to get right (and lots of cases in the kernel
>>> don't).
>>>    
>>
>> I would expect userspace to be careful and verify the memory tier and
>> rank value before we online the memory blocks backed by the device. Even
>> if we race, the result would be two device not intended to be part of
>> the same memory tier appearing at the same tier. But then we won't be
>> building demotion targets yet. So userspace could verify this, move the
>> nodes out of the memory tier. Once it is verified, memory blocks can be
>> onlined.
> 
> The race is there and not avoidable as far as I can see. Two processes A and B.
> 
> A checks for a spare tier number
> B checks for a spare tier number
> A tries to assign node 3 to new tier 2 (new tier created)
> B tries to assign node 4 to new tier 2 (accidentally hits existing tier - as this
> is the same method we'd use to put it in the existing tier we can't tell this
> write was meant to create a new tier).
> A writes rank 100 to tier 2
> A checks rank for tier 2 and finds it is 100 as expected.
> B write rank 200 to tier 2 (it could check if still default but even that is racy)
> B checks rank for tier 2 rank and finds it is 200 as expected.
> A onlines memory.
> B onlines memory.
> 
> Both think they got what they wanted, but A definitely didn't.
> 
> One work around is the read / write approach and create_tier.
> 
> A reads create_tier - gets 2.
> B reads create_tier - gets 3.
> A writes 2 to create_tier as that's what it read.
> B writes 3 to create_tier as that's what it read.
> 
> continue with created tiers.  Obviously can exhaust tiers, but if this is
> root only, could just create lots anyway so no worse off.
>   
>>
>> Having said that can you outline the usage of
>> memtier/create_tier_from_rank ?
> 
> There are corner cases to deal with...
> 
> A writes 100 to create_tier_from_rank.
> A goes looking for matching tier - finds it: tier2
> B writes 200 to create_tier_from_rank
> B goes looking for matching tier - finds it: tier3
> 
> rest is fine as operating on different tiers.
> 
> Trickier is
> A writes 100 to create_tier_from_rank  - succeed.
> B writes 100 to create_tier_from_rank  - Could fail, or could just eat it?
> 
> Logically this is same as separate create_tier and then a write
> of rank, but in one operation, but then you need to search
> for the right one.  As such, perhaps a create_tier
> that does the read/write pair as above is the best solution.
> 

This all is good when we allow dynamic rank values. But currently we are 
restricting ourselves to three rank value as below:

rank   memtier
300    memtier0
200    memtier1
100    memtier2

Now with the above, how do we define a write to create_tier_from_rank. 
What should be the behavior if user write value other than above defined 
rank values? Also enforcing the above three rank values as supported 
implies teaching userspace about them. I am trying to see how to fit
create_tier_from_rank without requiring the above.

Can we look at implementing create_tier_from_rank when we start 
supporting dynamic tiers/rank values? ie,

we still allow node/nodeN/memtier. But with dynamic tiers a race free
way to get a new memory tier would be echo rank > 
memtier/create_tier_from_rank. We could also say, memtier0/1/2 are 
kernel defined memory tiers. Writing to memtier/create_tier_from_rank 
will create new memory tiers above memtier2 with the rank value specified?

-aneesh