[lvc-project] [PATCH 6.1.y 1/3] mm/hugetlb: unshare page tables during VMA split, not before
Jann Horn
jannh at google.com
Mon Jul 7 16:19:23 MSK 2025
+cc relevant maintainers for the relevant upstream code so they're
aware of how I broke stable with my backports of hugetlb fixes. Looks
like I wasn't being careful enough...
On Mon, Jul 7, 2025 at 12:39 PM Fedor Pchelkin <pchelkin at ispras.ru> wrote:
> On Fri, 20 Jun 2025 23:33:31 +0200, Jann Horn wrote:
> > Currently, __split_vma() triggers hugetlb page table unsharing through
> > vm_ops->may_split(). This happens before the VMA lock and rmap locks are
> > taken - which is too early, it allows racing VMA-locked page faults in our
> > process and racing rmap walks from other processes to cause page tables to
> > be shared again before we actually perform the split.
> >
> > Fix it by explicitly calling into the hugetlb unshare logic from
> > __split_vma() in the same place where THP splitting also happens. At that
> > point, both the VMA and the rmap(s) are write-locked.
> >
> > An annoying detail is that we can now call into the helper
> > hugetlb_unshare_pmds() from two different locking contexts:
> >
> > 1. from hugetlb_split(), holding:
> > - mmap lock (exclusively)
> > - VMA lock
> > - file rmap lock (exclusively)
> > 2. hugetlb_unshare_all_pmds(), which I think is designed to be able to
> > call us with only the mmap lock held (in shared mode), but currently
> > only runs while holding mmap lock (exclusively) and VMA lock
> >
> > Backporting note:
> > This commit fixes a racy protection that was introduced in commit
> > b30c14cd6102 ("hugetlb: unshare some PMDs when splitting VMAs"); that
> > commit claimed to fix an issue introduced in 5.13, but it should actually
> > also go all the way back.
> >
> > [jannh at google.com: v2]
> > Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-1-1329349bad1a@google.com
> > Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-0-1329349bad1a@google.com
> > Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-1-f4136f5ec58a@google.com
> > Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page")
> > Signed-off-by: Jann Horn <jannh at google.com>
> > Cc: Liam Howlett <liam.howlett at oracle.com>
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes at oracle.com>
> > Reviewed-by: Oscar Salvador <osalvador at suse.de>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes at oracle.com>
> > Cc: Vlastimil Babka <vbabka at suse.cz>
> > Cc: <stable at vger.kernel.org> [b30c14cd6102: hugetlb: unshare some PMDs when splitting VMAs]
> > Cc: <stable at vger.kernel.org>
> > Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
> > [stable backport: code got moved around, VMA splitting is in
> > __vma_adjust]
> > Signed-off-by: Jann Horn <jannh at google.com>
> > ---
> > include/linux/hugetlb.h | 3 +++
> > mm/hugetlb.c | 60 ++++++++++++++++++++++++++++++-----------
> > mm/mmap.c | 8 ++++++
> > 3 files changed, 55 insertions(+), 16 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index cc555072940f..26f2947c399d 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -239,6 +239,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >
> > bool is_hugetlb_entry_migration(pte_t pte);
> > void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
> > +void hugetlb_split(struct vm_area_struct *vma, unsigned long addr);
> >
> > #else /* !CONFIG_HUGETLB_PAGE */
> >
> > @@ -472,6 +473,8 @@ static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
> >
> > static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { }
> >
> > +static inline void hugetlb_split(struct vm_area_struct *vma, unsigned long addr) {}
> > +
> > #endif /* !CONFIG_HUGETLB_PAGE */
> > /*
> > * hugepages at page global directory. If arch support
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 14b9494c58ed..fc5d3d665266 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -95,7 +95,7 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
> > static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
> > static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
> > static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
> > - unsigned long start, unsigned long end);
> > + unsigned long start, unsigned long end, bool take_locks);
> > static struct resv_map *vma_resv_map(struct vm_area_struct *vma);
> >
> > static inline bool subpool_is_free(struct hugepage_subpool *spool)
> > @@ -4900,26 +4900,40 @@ static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
> > {
> > if (addr & ~(huge_page_mask(hstate_vma(vma))))
> > return -EINVAL;
> > + return 0;
> > +}
> >
> > +void hugetlb_split(struct vm_area_struct *vma, unsigned long addr)
> > +{
> > /*
> > * PMD sharing is only possible for PUD_SIZE-aligned address ranges
> > * in HugeTLB VMAs. If we will lose PUD_SIZE alignment due to this
> > * split, unshare PMDs in the PUD_SIZE interval surrounding addr now.
> > + * This function is called in the middle of a VMA split operation, with
> > + * MM, VMA and rmap all write-locked to prevent concurrent page table
> > + * walks (except hardware and gup_fast()).
> > */
> > + mmap_assert_write_locked(vma->vm_mm);
> > + i_mmap_assert_write_locked(vma->vm_file->f_mapping);
>
>
> The above i_mmap lock assertion is firing on stable kernels from 5.10 to 6.1
> included.
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 11489 at include/linux/fs.h:503 i_mmap_assert_write_locked include/linux/fs.h:503 [inline]
> WARNING: CPU: 0 PID: 11489 at include/linux/fs.h:503 hugetlb_split+0x267/0x300 mm/hugetlb.c:4917
> Modules linked in:
> CPU: 0 PID: 11489 Comm: syz-executor.4 Not tainted 6.1.142-syzkaller-00296-gfd0df5221577 #0
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> RIP: 0010:i_mmap_assert_write_locked include/linux/fs.h:503 [inline]
> RIP: 0010:hugetlb_split+0x267/0x300 mm/hugetlb.c:4917
> Call Trace:
> <TASK>
> __vma_adjust+0xd73/0x1c10 mm/mmap.c:736
> vma_adjust include/linux/mm.h:2745 [inline]
> __split_vma+0x459/0x540 mm/mmap.c:2385
> do_mas_align_munmap+0x5f2/0xf10 mm/mmap.c:2497
> do_mas_munmap+0x26c/0x2c0 mm/mmap.c:2646
> __mmap_region mm/mmap.c:2694 [inline]
> mmap_region+0x19f/0x1770 mm/mmap.c:2912
> do_mmap+0x84b/0xf20 mm/mmap.c:1432
> vm_mmap_pgoff+0x1af/0x280 mm/util.c:520
> ksys_mmap_pgoff+0x41f/0x5a0 mm/mmap.c:1478
> do_syscall_x64 arch/x86/entry/common.c:51 [inline]
> do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81
> entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> RIP: 0033:0x46a269
> </TASK>
>
> Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
>
>
> The main reason is that those branches lack the following
>
> commit ccf1d78d8b86e28502fa1b575a459a402177def4
> Author: Suren Baghdasaryan <surenb at google.com>
> Date: Mon Feb 27 09:36:13 2023 -0800
>
> mm/mmap: move vma_prepare before vma_adjust_trans_huge
>
> vma_prepare() acquires all locks required before VMA modifications. Move
> vma_prepare() before vma_adjust_trans_huge() so that VMA is locked before
> any modification.
>
> Link: https://lkml.kernel.org/r/20230227173632.3292573-15-surenb@google.com
> Signed-off-by: Suren Baghdasaryan <surenb at google.com>
> Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
>
> thus the needed lock is acquired just after vma_adjust_trans_huge() and
> the newly added hugetlb_split().
Oh, yuck. Indeed. Thanks for finding this.
> Please have a look at a straightforward write-up which comes to my mind.
> It does something like the ccf1d78d8b86 ("mm/mmap: move vma_prepare before
> vma_adjust_trans_huge"), but in context of an old stable branch.
>
> If looks okay, I'll be glad to prepare it as a formal patch and send it
> out for the 5.10-5.15, too.
Thanks, that looks good to me.
> against 6.1.y
> -------------
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 0f303dc8425a..941880ed62d7 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -543,8 +543,6 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
> if (mas_preallocate(mas, vma, GFP_KERNEL))
> goto nomem;
>
> - vma_adjust_trans_huge(vma, start, end, 0);
> -
> if (file) {
> mapping = file->f_mapping;
> root = &mapping->i_mmap;
> @@ -562,6 +560,8 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
> vma_interval_tree_remove(vma, root);
> }
>
> + vma_adjust_trans_huge(vma, start, end, 0);
> +
> vma->vm_start = start;
> vma->vm_end = end;
> vma->vm_pgoff = pgoff;
> @@ -727,15 +727,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> return -ENOMEM;
> }
>
> - /*
> - * Get rid of huge pages and shared page tables straddling the split
> - * boundary.
> - */
> - vma_adjust_trans_huge(orig_vma, start, end, adjust_next);
> - if (is_vm_hugetlb_page(orig_vma)) {
> - hugetlb_split(orig_vma, start);
> - hugetlb_split(orig_vma, end);
> - }
> if (file) {
> mapping = file->f_mapping;
> root = &mapping->i_mmap;
> @@ -775,6 +766,16 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> vma_interval_tree_remove(next, root);
> }
>
> + /*
> + * Get rid of huge pages and shared page tables straddling the split
> + * boundary.
> + */
> + vma_adjust_trans_huge(orig_vma, start, end, adjust_next);
> + if (is_vm_hugetlb_page(orig_vma)) {
> + hugetlb_split(orig_vma, start);
> + hugetlb_split(orig_vma, end);
> + }
> +
> if (start != vma->vm_start) {
> if ((vma->vm_start < start) &&
> (!insert || (insert->vm_end != start))) {
More information about the lvc-project
mailing list