linux-kernel - Re: [PATCHv6 1/1] mm: optimization on page allocation when CMA enabled

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGWkznGbGciTocrqMowepCc=aStYFHPt8RPGYjWxQ4TRG1ZqXQ@mail.gmail.com>
Date:   Wed, 8 Nov 2023 16:55:19 +0800
From:   Zhaoyang Huang <huangzhaoyang@...il.com>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     "zhaoyang.huang" <zhaoyang.huang@...soc.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, steve.kang@...soc.com,
        Vlastimil Babka <vbabka@...e.cz>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Joonsoo Kim <iamjoonsoo.kim@....com>
Subject: Re: [PATCHv6 1/1] mm: optimization on page allocation when CMA enabled

On Wed, Nov 8, 2023 at 1:28 AM Johannes Weiner <hannes@...xchg.org> wrote:
>
> On Mon, Oct 16, 2023 at 03:12:45PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@...soc.com>
> >
> > According to current CMA utilization policy, an alloc_pages(GFP_USER)
> > could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of
> > CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue),
> > which could lead to following alloc_pages(GFP_KERNEL) fail.
> > Solving this by introducing second watermark checking for GFP_MOVABLE,
> > which could have the allocation use CMA when proper.
> >
> > -- Free_pages(30MB)
> > |
> > |
> > -- WMARK_LOW(25MB)
> > |
> > -- Free_CMA(12MB)
> > |
> > |
> > --
>
> We're running into the same issue in production and had an incident
> over the weekend because of it. The hosts have a raised
> vm.min_free_kbytes for network rx reliability, which makes the
> mismatch between free pages and what's actually allocatable by regular
> kernel requests quite pronounced. It wasn't OOMing this time, but we
> saw very high rates of thrashing while CMA had plenty of headroom.
>
> I had raised the broader issue around poor CMA utilization before:
> https://lore.kernel.org/lkml/20230726145304.1319046-1-hannes@cmpxchg.org/
>
> For context, we're using hugetlb_cma at several gigabytes to allow
> sharing hosts between jobs that use hugetlb and jobs that don't.
>
> > @@ -2078,6 +2078,43 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
> >
> >  }
> >
> > +#ifdef CONFIG_CMA
> > +/*
> > + * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via
> > + * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok
> > + * again without ALLOC_CMA to see if to use CMA first.
> > + */
> > +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags)
> > +{
> > +     unsigned long watermark;
> > +     bool cma_first = false;
> > +
> > +     watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
> > +     /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */
> > +     if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) {
> > +             /*
> > +              * Balance movable allocations between regular and CMA areas by
> > +              * allocating from CMA when over half of the zone's free memory
> > +              * is in the CMA area.
> > +              */
ok, thanks for point out.
Could we simple it like this, which will mis-judge kmalloc within
ioctl as GFP_USER. I think it is ok as it is rare
             if (current_is_kswapd() || !current->mm)
                 gfp_flags = GFP_KERNEL;
             else
                 gfp_flags = GFP_USER;
            free_pages = zone_page_state(zone, NR_FREE_PAGES);
            free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)];
            free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
            cma_first = free_pages > zone_page_state(zone, NR_FREE_PAGES) / 2);

> > +             cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) >
> > +                             zone_page_state(zone, NR_FREE_PAGES) / 2);
> > +     } else {
> > +             /*
> > +              * watermark failed means UNMOVABLE & RECLAIMBLE is not enough
> > +              * now, we should use cma first to keep them stay around the
> > +              * corresponding watermark
> > +              */
> > +             cma_first = true;
> > +     }
> > +     return cma_first;
>
> I think it's a step in the right direction. However, it doesn't take
> the lowmem reserves into account. With DMA32 that can be an additional
> multiple gigabytes of "free" memory not available to GFP_KERNEL. It
> also has a knee in the balancing curve because it doesn't take
> reserves into account *until* non-CMA is depleted - at which point it
> would already be below the use-CMA threshold by the full reserves and
> watermarks.
>
> A more complete solution would have to plumb the highest_zoneidx
> information through the rmqueue family of functions somehow, and
> always take unavailable free memory into account:
>
> ---
> Subject: [PATCH] mm: page_alloc: use CMA when kernel allocations are beginning
>  to fail
>
> We can get into a situation where kernel allocations are starting to
> fail on watermarks, but movable allocations still don't use CMA
> because they make up more than half of the free memory. This can
> happen in particular with elevated vm.min_free_kbytes settings, where
> the remaining free pages aren't available to non-atomic requests.
>
> Example scenario:
>
>       Free: 3.0G
> Watermarks: 2.0G
>        CMA: 1.4G
> -> non-CMA: 1.6G
>
> CMA isn't used because CMA <= free/2. Kernel allocations fail due to
> non-CMA < watermarks. If memory is mostly unreclaimable (e.g. anon
> without swap), the kernel is more likely to OOM prematurely.
>
> Reduce the probability of that happening by taking reserves and
> watermarks into account when deciding whether to start using CMA.
>
> Signed-off-by: Johannes Weiner <hannes@...xchg.org>
> ---
>  mm/page_alloc.c | 93 +++++++++++++++++++++++++++++++------------------
>  1 file changed, 59 insertions(+), 34 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 733732e7e0ba..b9273d7f23b8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2079,30 +2079,52 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
>
>  }
>
> +static bool should_try_cma(struct zone *zone, unsigned int order,
> +                          gfp_t gfp_flags, unsigned int alloc_flags)
> +{
> +       long free_pages;
> +
> +       if (!IS_ENABLED(CONFIG_CMA) || !(alloc_flags & ALLOC_CMA))
> +               return false;
> +
> +       /*
> +        * CMA regions can be used by movable allocations while
> +        * they're not otherwise in use. This is a delicate balance:
> +        * Filling CMA too soon poses a latency risk for actual CMA
> +        * allocations (think camera app startup). Filling CMA too
> +        * late risks premature OOMs from non-movable allocations.
> +        *
> +        * Start using CMA once it dominates the remaining free
> +        * memory. Be sure to take watermarks and reserves into
> +        * account when considering what's truly "free".
> +        *
> +        * free_pages can go negative, but that's okay because
> +        * NR_FREE_CMA_PAGES should not.
> +        */
> +
> +       free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +       free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)];
> +       free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
> +
> +       return zone_page_state(zone, NR_FREE_CMA_PAGES) > free_pages / 2;
> +}
> +
>  /*
>   * Do the hard work of removing an element from the buddy allocator.
>   * Call me with the zone->lock already held.
>   */
>  static __always_inline struct page *
> -__rmqueue(struct zone *zone, unsigned int order, int migratetype,
> -                                               unsigned int alloc_flags)
> +__rmqueue(struct zone *zone, unsigned int order, gfp_t gfp_flags,
> +         int migratetype, unsigned int alloc_flags)
>  {
>         struct page *page;
>
> -       if (IS_ENABLED(CONFIG_CMA)) {
> -               /*
> -                * Balance movable allocations between regular and CMA areas by
> -                * allocating from CMA when over half of the zone's free memory
> -                * is in the CMA area.
> -                */
> -               if (alloc_flags & ALLOC_CMA &&
> -                   zone_page_state(zone, NR_FREE_CMA_PAGES) >
> -                   zone_page_state(zone, NR_FREE_PAGES) / 2) {
> -                       page = __rmqueue_cma_fallback(zone, order);
> -                       if (page)
> -                               return page;
> -               }
> +       if (should_try_cma(zone, order, gfp_flags, alloc_flags)) {
> +               page = __rmqueue_cma_fallback(zone, order);
> +               if (page)
> +                       return page;
>         }
> +
>  retry:
>         page = __rmqueue_smallest(zone, order, migratetype);
>         if (unlikely(!page)) {
> @@ -2121,7 +2143,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
>   * a single hold of the lock, for efficiency.  Add them to the supplied list.
>   * Returns the number of new pages which were placed at *list.
>   */
> -static int rmqueue_bulk(struct zone *zone, unsigned int order,
> +static int rmqueue_bulk(struct zone *zone, unsigned int order, gfp_t gfp_flags,
>                         unsigned long count, struct list_head *list,
>                         int migratetype, unsigned int alloc_flags)
>  {
> @@ -2130,8 +2152,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>
>         spin_lock_irqsave(&zone->lock, flags);
>         for (i = 0; i < count; ++i) {
> -               struct page *page = __rmqueue(zone, order, migratetype,
> -                                                               alloc_flags);
> +               struct page *page = __rmqueue(zone, order, gfp_flags,
> +                                             migratetype, alloc_flags);
>                 if (unlikely(page == NULL))
>                         break;
>
> @@ -2714,8 +2736,8 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
>
>  static __always_inline
>  struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
> -                          unsigned int order, unsigned int alloc_flags,
> -                          int migratetype)
> +                          unsigned int order, gfp_t gfp_flags,
> +                          unsigned int alloc_flags, int migratetype)
>  {
>         struct page *page;
>         unsigned long flags;
> @@ -2726,7 +2748,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>                 if (alloc_flags & ALLOC_HIGHATOMIC)
>                         page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
>                 if (!page) {
> -                       page = __rmqueue(zone, order, migratetype, alloc_flags);
> +                       page = __rmqueue(zone, order, migratetype,
> +                                        gfp_flags, alloc_flags);
>
>                         /*
>                          * If the allocation fails, allow OOM handling access
> @@ -2806,10 +2829,10 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
>  /* Remove page from the per-cpu list, caller must protect the list */
>  static inline
>  struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
> -                       int migratetype,
> -                       unsigned int alloc_flags,
> -                       struct per_cpu_pages *pcp,
> -                       struct list_head *list)
> +                              gfp_t gfp_flags, int migratetype,
> +                              unsigned int alloc_flags,
> +                              struct per_cpu_pages *pcp,
> +                              struct list_head *list)
>  {
>         struct page *page;
>
> @@ -2818,7 +2841,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>                         int batch = nr_pcp_alloc(pcp, zone, order);
>                         int alloced;
>
> -                       alloced = rmqueue_bulk(zone, order,
> +                       alloced = rmqueue_bulk(zone, order, gfp_flags,
>                                         batch, list,
>                                         migratetype, alloc_flags);
>
> @@ -2837,8 +2860,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>
>  /* Lock and remove page from the per-cpu list */
>  static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> -                       struct zone *zone, unsigned int order,
> -                       int migratetype, unsigned int alloc_flags)
> +                                   struct zone *zone, unsigned int order,
> +                                   gfp_t gfp_flags, int migratetype,
> +                                   unsigned int alloc_flags)
>  {
>         struct per_cpu_pages *pcp;
>         struct list_head *list;
> @@ -2860,7 +2884,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>          */
>         pcp->free_count >>= 1;
>         list = &pcp->lists[order_to_pindex(migratetype, order)];
> -       page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
> +       page = __rmqueue_pcplist(zone, order, gfp_flags, migratetype,
> +                                alloc_flags, pcp, list);
>         pcp_spin_unlock(pcp);
>         pcp_trylock_finish(UP_flags);
>         if (page) {
> @@ -2898,13 +2923,13 @@ struct page *rmqueue(struct zone *preferred_zone,
>
>         if (likely(pcp_allowed_order(order))) {
>                 page = rmqueue_pcplist(preferred_zone, zone, order,
> -                                      migratetype, alloc_flags);
> +                                      gfp_flags, migratetype, alloc_flags);
>                 if (likely(page))
>                         goto out;
>         }
>
> -       page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> -                                                       migratetype);
> +       page = rmqueue_buddy(preferred_zone, zone, order, gfp_flags,
> +                            alloc_flags, migratetype);
>
>  out:
>         /* Separate test+clear to avoid unnecessary atomics */
> @@ -4480,8 +4505,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>                         continue;
>                 }
>
> -               page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
> -                                                               pcp, pcp_list);
> +               page = __rmqueue_pcplist(zone, 0, gfp, ac.migratetype,
> +                                        alloc_flags, pcp, pcp_list);
>                 if (unlikely(!page)) {
>                         /* Try and allocate at least one page */
>                         if (!nr_account) {
> --
> 2.42.0

>