[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2a9fee95-425d-4b27-8253-10540669d94c@lucifer.local>
Date: Fri, 24 Jan 2025 19:09:14 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
maple-tree@...ts.infradead.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Jann Horn <jannh@...gle.com>,
Peter Zijlstra <peterz@...radead.org>, Michal Hocko <mhocko@...e.com>,
Peng Zhang <zhangpeng.00@...edance.com>,
Oleg Nesterov <oleg@...hat.com>
Subject: Re: [PATCH] kernel/fork: Be more careful about dup_mmap() failures
On Fri, Jan 24, 2025 at 01:19:45PM -0500, Liam R. Howlett wrote:
> Adding Oleg, who I forgot on the start of this patch..
>
>
> * Lorenzo Stoakes <lorenzo.stoakes@...cle.com> [250124 06:15]:
> > On Thu, Jan 23, 2025 at 03:58:49PM -0500, Liam R. Howlett wrote:
> > > From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
> > >
> > > In the even that there is a failure during dup_mmap(), the maple tree
> > > can be left in an unsafe state for other iterators besides the exit
> > > path.
> > >
> > > The unsafe state is created after the tree is cloned, but before the
> > > vmas are replaced; if a vma allocation fails (for instance), then the
> > > tree will have a marker (XA_ZERO_ENTRY) to denote where to stop
> > > destroying vmas on the exit path. This marker replaces a vma in the
> > > tree and may be treated as a pointer to a vma in iterators besides the
> > > special case exit_mmap() iterator.
> >
> > Thanks for addresing this.
>
> Thanks for looking at the patch.
>
> >
> > In the original issue, the problem arose in vm_area_dup() which I tracked
> > down to vma_iter_load() -> mas_walk() (see [0]).
> >
> > I'm not sure if your change would actually have addressed this? Would the
> > MMF_UNSTABLE flag prevent this code path from being taken (if so where?),
> > if not don't we need to add a check for this somewhere?
> >
> > [0]:https://lore.kernel.org/all/aa2c1930-becc-4bc5-adfb-96e88290acc7@lucifer.local/
>
> Remote vma accesses use check_stable_address_space() to avoid races
> with the oom killer, which checks MMF_UNSTABLE.
>
> Yeah, uprobe needs to do this and doesn't today.
>
> Oleg suggested this be done in build_map_info(), but the MMF_UNSTABLE
> bit is set under the mmap read lock and that function uses the read lock
> on the rmap, takes an mm references and records the address - then
> revisits the mm later to get the vma in register_for_each_vma().
>
> I think we need to check in register_for_each_vma(), prior to looking up
> the vma. This way we hold the write lock on the mm and will ensure the
> vma is stable from both the fork setup and a potential OOM events. The
> concern is that we have an invalid vma or the prev/next is invalid and
> that will only be a concern when operating on the vma itself, so this
> would be the safest place imo.
>
> The problem that has been created is that the next vma may be invalid.
> This arises when the rmap is used to find a vma then changes that vma.
> In this scenario, the mmap lock of the vma needs to be held and that mm
> needs to be checked as stable/not oom'ed. So we can limit the checking
> of the code to the scenario where another mm's vma is found and
> modified - which may cause a merge and look at the next vma.
>
> I haven't found anywhere else that looks up another mm_struct's vma and
> makes a change to the vma. Considering what that would mean, it makes
> sense that this is rare.
>
> So, I'll just do this:
>
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1260,6 +1260,9 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
> * returns NULL in find_active_uprobe_rcu().
> */
> mmap_write_lock(mm);
> + if (check_stable_address_space(mm)
> + goto unlock;
> +
>
>
Yeah I think on reflection this is sensible, at least as a proximate
solution. It certainly doesn't harm to apply the MMF_UNSTABLE flag on top
of this and we can always expand where we check/what we do with this.
>
>
> >
> > >
> > > All the locks are dropped before the exit_mmap() call, but the
> > > incomplete mm_struct can be reached through (at least) the rmap finding
> > > the vmas which have a pointer back to the mm_struct.
> > >
> > > Up to this point, there have been no issues with being able to find an
> > > mm_sturct that was only partially initialised. Syzbot was able to make
> > > the incomplete mm_struct fail with recent forking changes, so it has
> > > been proven unsafe to use the mm_sturct that hasn't been initialised, as
> > > referenced in the link below.
> > >
> > > Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to
> > > invalid mm") fixed the uprobe access, it does not completely remove the
> > > race.
> > >
> > > This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on
> > > the oom side (even though this is extremely unlikely to be selected as
> > > an oom victim in the race window), and sets MMF_UNSTABLE to avoid other
> > > potential users from using a partially initialised mm_struct.
> >
> > Good to have it set also I think.
> >
> > We were concerned that _somehow_ MMF_UNSTABLE might cause some issue in
> > OOM, which was why when I first thought of this we decided maybe not just
> > to be safe, but I am confident this will not be an issue as indicating that
> > something is about to be torn down and is unstable when it is in fact
> > unstable and about to be torn down is correct, plus I checked all the paths
> > where this impacted and found no issues.
> >
> > And in any case, setting this flag avoids the problem...!
> >
> > >
> > > Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
> > > Fixes: d240629148377 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
> > > Cc: Jann Horn <jannh@...gle.com>
> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> > > Cc: Peter Zijlstra <peterz@...radead.org>
> > > Cc: Michal Hocko <mhocko@...e.com>
> > > Cc: Peng Zhang <zhangpeng.00@...edance.com>
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@...cle.com>
> > > ---
> > > kernel/fork.c | 17 ++++++++++++++---
> > > 1 file changed, 14 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index ded49f18cd95c..20b2120f019ca 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -760,7 +760,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > mt_set_in_rcu(vmi.mas.tree);
> > > ksm_fork(mm, oldmm);
> > > khugepaged_fork(mm, oldmm);
> > > - } else if (mpnt) {
> > > + } else {
> > > +
> > > /*
> > > * The entire maple tree has already been duplicated. If the
> > > * mmap duplication fails, mark the failure point with
> > > @@ -768,8 +769,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > * stop releasing VMAs that have not been duplicated after this
> > > * point.
> > > */
> > > - mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> > > - mas_store(&vmi.mas, XA_ZERO_ENTRY);
> > > + if (mpnt) {
> > > + mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> > > + mas_store(&vmi.mas, XA_ZERO_ENTRY);
> > > + /* Avoid OOM iterating a broken tree */
> > > + set_bit(MMF_OOM_SKIP, &mm->flags);
> > > + }
> > > + /*
> > > + * The mm_struct is going to exit, but the locks will be dropped
> > > + * first. Set the mm_struct as unstable is advisable as it is
> > > + * not fully initialised.
> > > + */
> > > + set_bit(MMF_UNSTABLE, &mm->flags);
> >
> > Do we still need to do this if !mpnt?
>
> No, we probably don't need this now. But it still makes sense to do
> this considering what we are saying. There is a failed mm, let's mark
> it as unstable because it's going away in a moment - it's not stable,
> it's not even a complete mm.
>
> >
> > Also 'mpnt' is in the running for the worst variable name in mm...
> >
>
> mpnt is the oldMm vma Pointer, Now isn'T it? I think that's self
> documenting.
:)))
>
> Thanks,
> Liam
Powered by blists - more mailing lists