linux-kernel - Re: [RFC] memory tiering: use small chunk size and more tiers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <59291b98-6907-0acf-df11-6d87681027cc@linux.ibm.com>
Date:   Fri, 28 Oct 2022 10:35:44 +0530
From:   Aneesh Kumar K V <aneesh.kumar@...ux.ibm.com>
To:     "Huang, Ying" <ying.huang@...el.com>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        Alistair Popple <apopple@...dia.com>,
        Bharata B Rao <bharata@....com>,
        Dan Williams <dan.j.williams@...el.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Hesham Almatary <hesham.almatary@...wei.com>,
        Jagdish Gediya <jvgediya.oss@...il.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Jonathan Cameron <Jonathan.Cameron@...wei.com>,
        Michal Hocko <mhocko@...nel.org>,
        Tim Chen <tim.c.chen@...el.com>, Wei Xu <weixugc@...gle.com>,
        Yang Shi <shy828301@...il.com>
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

On 10/28/22 8:33 AM, Huang, Ying wrote:
> Hi, Aneesh,
> 
> Aneesh Kumar K V <aneesh.kumar@...ux.ibm.com> writes:
> 
>> On 10/27/22 12:29 PM, Huang Ying wrote:
>>> We need some way to override the system default memory tiers.  For
>>> the example system as follows,
>>>
>>> type		abstract distance
>>> ----		-----------------
>>> HBM		300
>>> DRAM		1000
>>> CXL_MEM		5000
>>> PMEM		5100
>>>
>>> Given the memory tier chunk size is 100, the default memory tiers
>>> could be,
>>>
>>> tier		abstract distance	types
>>>                 range
>>> ----		-----------------       -----
>>> 3		300-400			HBM
>>> 10		1000-1100		DRAM
>>> 50		5000-5100		CXL_MEM
>>> 51		5100-5200		PMEM
>>>
>>> If we want to group CXL MEM and PMEM into one tier, we have 2 choices.
>>>
>>> 1) Override the abstract distance of CXL_MEM or PMEM.  For example, if
>>> we change the abstract distance of PMEM to 5050, the memory tiers
>>> become,
>>>
>>> tier		abstract distance	types
>>>                 range
>>> ----		-----------------       -----
>>> 3		300-400			HBM
>>> 10		1000-1100		DRAM
>>> 50		5000-5100		CXL_MEM, PMEM
>>>
>>> 2) Override the memory tier chunk size.  For example, if we change the
>>> memory tier chunk size to 200, the memory tiers become,
>>>
>>> tier		abstract distance	types
>>>                 range
>>> ----		-----------------       -----
>>> 1		200-400			HBM
>>> 5		1000-1200		DRAM
>>> 25		5000-5200		CXL_MEM, PMEM
>>>
>>> But after some thoughts, I think choice 2) may be not good.  The
>>> problem is that even if 2 abstract distances are almost same, they may
>>> be put in 2 tier if they sit in the different sides of the tier
>>> boundary.  For example, if the abstract distance of CXL_MEM is 4990,
>>> while the abstract distance of PMEM is 5010.  Although the difference
>>> of the abstract distances is only 20, CXL_MEM and PMEM will put in
>>> different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
>>> This makes choice 2) hard to be used, it may become tricky to find out
>>> the appropriate tier chunk size that satisfying all requirements.
>>>
>>
>> Shouldn't we wait for gaining experience w.r.t how we would end up
>> mapping devices with different latencies and bandwidth before tuning these values? 
> 
> Just want to discuss the overall design.
> 
>>> So I suggest to abandon choice 2) and use choice 1) only.  This makes
>>> the overall design and user space interface to be simpler and easier
>>> to be used.  The overall design of the abstract distance could be,
>>>
>>> 1. Use decimal for abstract distance and its chunk size.  This makes
>>>    them more user friendly.
>>>
>>> 2. Make the tier chunk size as small as possible.  For example, 10.
>>>    This will put different memory types in one memory tier only if their
>>>    performance is almost same by default.  And we will not provide the
>>>    interface to override the chunk size.
>>>
>>
>> this could also mean we can end up with lots of memory tiers with relative
>> smaller performance difference between them. Again it depends how HMAT
>> attributes will be used to map to abstract distance.
> 
> Per my understanding, there will not be many memory types in a system.
> So, there will not be many memory tiers too.  In most systems, there are
> only 2 or 3 memory tiers in the system, for example, HBM, DRAM, CXL,
> etc. 

So we don't need the chunk size to be 10 because we don't forsee us needing
to group devices into that many tiers. 

> Do you know systems with many memory types?  The basic idea is to
> put different memory types in different memory tiers by default.  If
> users want to group them, they can do that via overriding the abstract
> distance of some memory type.
> 

with small chunk size and depending on how we are going to derive abstract distance,
I am wondering whether we would end up with lots of memory tiers with no 
real value. Hence my suggestion to wait making a change like this till we have
code that map HMAT/CDAT attributes to abstract distance. 




>>
>>> 3. Make the abstract distance of normal DRAM large enough.  For
>>>    example, 1000, then 100 tiers can be defined below DRAM, this is
>>>    more than enough in practice.
>>
>> Why 100? Will we really have that many tiers below/faster than DRAM? As of now 
>> I see only HBM below it.
> 
> Yes.  100 is more than enough.  We just want to avoid to group different
> memory types by default.
> 
> Best Regards,
> Huang, Ying
>