linux-kernel - Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7ba328e5-3bc8-cb22-f00c-eddb8aea9a06@virtuozzo.com>
Date:   Tue, 11 Oct 2022 12:07:28 +0300
From:   Alexander Atanasov <alexander.atanasov@...tuozzo.com>
To:     Nadav Amit <nadav.amit@...il.com>
Cc:     "Michael S. Tsirkin" <mst@...hat.com>,
        David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>, kernel@...nvz.org,
        Linux Virtualization <virtualization@...ts.linux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux MM <linux-mm@...ck.org>
Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated
 memory

Hello,

On 10.10.22 17:47, Nadav Amit wrote:
> On Oct 10, 2022, at 12:24 AM, Alexander Atanasov <alexander.atanasov@...tuozzo.com> wrote:
> 
>> Hello,
>>
>> On 10.10.22 9:18, Nadav Amit wrote:
>>> On Oct 7, 2022, at 3:58 AM, Alexander Atanasov <alexander.atanasov@...tuozzo.com> wrote:>

[snip]

>>> Side-note: That’s not the case for VMware balloon. I actually considered
>>> calling adjust_managed_page_count() just to conform with other balloon
>>> drivers. But since we use totalram_pages() to communicate to the hypervisor
>>> the total-ram, this would create endless (and wrong) feedback loop. I am not
>>> claiming it is not possible to VMware balloon driver to call
>>> adjust_managed_page_count(), but the chances are that it would create more
>>> harm than good.
>>
>> Virtio does both - depending on the deflate on OOM option. I suggested already to unify all drivers to inflate the used memory as it seems more logical to me since  no body expects the totalram_pages() to change but the current state is that both ways are accepted and if changed can break existing users.
>> See  discussion here https://lore.kernel.org/lkml/20220809095358.2203355-1-alexander.atanasov@virtuozzo.com/.
> 
> Thanks for the reminder. I wish you can somehow summarize all of that into the
> cover-letter and/or the commit messages for these patches.


I will put excerpts in the next versions and relevant links in the next 
versions. I see that the more i dig into it the deeper it becomes so it 
needs more explanations.


>>
>>
>>> Back to the matter at hand. It seems that you wish that the notifiers would
>>> be called following any changes that would be reflected in totalram_pages().
>>> So, doesn't it make more sense to call it from adjust_managed_page_count() ?
>>
>> It will hurt performance - all drivers work page by page , i.e. they update by +1/-1 and they do so under locks which as you already noted can lead to bad things. The notifier will accumulate the change and let its user know how much changed, so the can decide if they have to recalculate - it even can do so async in order to not disturb the drivers.
> 
> So updating the counters by 1 is ok (using atomic operation, which is not
> free)? And the reason it is (relatively) cheap is because nobody actually
> looks at the value (i.e., nobody actually acts on the value)?
> 
> If nobody considers the value, then doesn’t it make sense just to update it
> less frequently, and then call the notifiers?

That's my point too.
The drivers update managed page count by 1.
My goal is  when they are done to fire the notifier.

All drivers are similiar and work like this:
HV sends request inflate up/down
   driver up/down
     lock
       get_page()/put_page()
       optionally - adjust_managed_page_count(... +-1);
     unlock
    update_core and notify_balloon_changed

>>>> The difference is here:
>>>>
>>>> mm/zswap.c:     return totalram_pages() * zswap_max_pool_percent / 100 <
>>>> mm/zswap.c:     return totalram_pages() * zswap_accept_thr_percent / 100
>>>> uses percents and you can recalculate easy with
>>>>
>>>> +static inline unsigned long totalram_pages_current(void)
>>>> +{
>>>> +       unsigned long inflated = 0;
>>>> +#ifdef CONFIG_MEMORY_BALLOON
>>>> +       extern atomic_long_t mem_balloon_inflated_free_kb;
>>>> +       inflated = atomic_long_read(&mem_balloon_inflated_free_kb);
>>>> +       inflated >>= (PAGE_SHIFT - 10);
>>>> +#endif
>>>> +       return (unsigned long)atomic_long_read(&_totalram_pages) - inflated;
>>>> +}
> 
> So we have here two values and it appears there is a hidden assumption that
> they are both updated atomically. Otherwise, it appears, inflated
> theoretically might be greater that _totalram_pages dn we get negative value
> and all hell breaks loose.
> 
> But _totalram_pages and mem_balloon_inflated_free_kb are not updated
> atomically together (each one is, but not together).
> 

I do not think that can happen - in that case totalram_pages() is not 
adjusted and you can never inflate more than total ram.

Yes, they are not set atomic but see the use cases:

- a driver that does calculations on init.
   It will use notifier to redo the calculations.
   The notifier will bring the values and the size of change to help the 
driver decide if it needs to recalculate.

- a user of totalram_pages() that does calculations at run time -
   i have to research are there any users that could be affected by not 
setting the two values atomicaly - assuming there can be a slight 
difference. I.e. do we need precise calculations or they are calculating 
fractions.


>>>> And you are good when you switch to _current version - si_meminfo_current is alike .
>>>>
>>>> On init (probably) all use some kind of fractions to calculate but when there is a set value via /proc/sys/net/ipv4/tcp_wmem for example it is just a value and you can not recalculate it. And here, please, share your ideas how to solve this.
>>> I don’t get all of that. Now that you provided some more explanations, it
>>> sounds that what you want is adjust_managed_page_count(), which we already
>>> have and affects the output of totalram_pages(). Therefore, totalram_pages()
>>> anyhow accounts for the balloon memory (excluding VMware’s). So why do we
>>> need to take mem_balloon_inflated_free_kb into account?
>> Ok, you have this:
>>                                    / totalram
>> |----used----|b1|----free------|b2|
>>
>> drivers can inflate both b1 and b2 - b1 free gets smaller, b2 totalram pages get smaller. so when you need totalram_pages() to do a calculation you need to adjust it with the pages that are inflated in free/used (b1). VMWare is not exception , Virtio does the same.
>> And according to to mst and davidh it is okay like this.
>> So i am proposing a way to handle both cases.
> 
> Ugh. What about BALLOON_INFLATE and BALLOON_DEFLATE vm-events? Can’t this
> information be used instead of yet another counter? Unless, of course, you
> get the atomicity that I mentioned before.

What do you mean by vm-events ?


>>> Sounds to me that all you want is some notifier to be called from
>>> adjust_managed_page_count(). What am I missing?
>>
>> Notifier will act as an accumulator to report size of change and it will make things easier for the drivers and users wrt locking.
>> Notifier is similar to the memory hotplug notifier.
> 
> Overall, I am not convinced that there is any value of separating the value
> and the notifier. You can batch both or not batch both. In addition, as I
> mentioned, having two values seems racy.

I have identified two users so far above - may be more to come.
One type needs the value to adjust. Also having the value is necessary 
to report it to users and oom. There are options with callbacks and so 
on but it will complicate things with no real gain. You are right about 
the atomicity but i guess if that's a problem for some user it could 
find a way to ensure it. i am yet to find such place.

-- 
Regards,
Alexander Atanasov