diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 49216c96236b..41bed89c8ee5 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -229,6 +229,7 @@ What: /sys/devices/system/cpu/vulnerabilities /sys/devices/system/cpu/vulnerabilities/meltdown /sys/devices/system/cpu/vulnerabilities/spectre_v1 /sys/devices/system/cpu/vulnerabilities/spectre_v2 + /sys/devices/system/cpu/vulnerabilities/spec_store_bypass Date: January 2018 Contact: Linux kernel mailing list Description: Information about CPU vulnerabilities diff --git a/Documentation/cachetlb.txt b/Documentation/cachetlb.txt index d79b008e4a32..3f9f808b5119 100644 --- a/Documentation/cachetlb.txt +++ b/Documentation/cachetlb.txt @@ -317,10 +317,10 @@ maps this page at its virtual address. about doing this. The idea is, first at flush_dcache_page() time, if - page->mapping->i_mmap is an empty tree and ->i_mmap_nonlinear - an empty list, just mark the architecture private page flag bit. - Later, in update_mmu_cache(), a check is made of this flag bit, - and if set the flush is done and the flag bit is cleared. + page->mapping->i_mmap is an empty tree, just mark the architecture + private page flag bit. Later, in update_mmu_cache(), a check is + made of this flag bit, and if set the flush is done and the flag + bit is cleared. IMPORTANT NOTE: It is often important, if you defer the flush, that the actual flush occurs on the same CPU diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 9d10847d1998..28687bbe2ee0 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2172,6 +2172,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted. allow data leaks with this option, which is equivalent to spectre_v2=off. + nospec_store_bypass_disable + [HW] Disable all mitigations for the Speculative Store Bypass vulnerability + noxsave [BUGS=X86] Disables x86 extended register state save and restore using xsave. The kernel will fallback to enabling legacy floating-point and sse state. @@ -3194,6 +3197,48 @@ bytes respectively. Such letter suffixes can also be entirely omitted. Not specifying this option is equivalent to spectre_v2=auto. + spec_store_bypass_disable= + [HW] Control Speculative Store Bypass (SSB) Disable mitigation + (Speculative Store Bypass vulnerability) + + Certain CPUs are vulnerable to an exploit against a + a common industry wide performance optimization known + as "Speculative Store Bypass" in which recent stores + to the same memory location may not be observed by + later loads during speculative execution. The idea + is that such stores are unlikely and that they can + be detected prior to instruction retirement at the + end of a particular speculation execution window. + + In vulnerable processors, the speculatively forwarded + store can be used in a cache side channel attack, for + example to read memory to which the attacker does not + directly have access (e.g. inside sandboxed code). + + This parameter controls whether the Speculative Store + Bypass optimization is used. + + on - Unconditionally disable Speculative Store Bypass + off - Unconditionally enable Speculative Store Bypass + auto - Kernel detects whether the CPU model contains an + implementation of Speculative Store Bypass and + picks the most appropriate mitigation. If the + CPU is not vulnerable, "off" is selected. If the + CPU is vulnerable the default mitigation is + architecture and Kconfig dependent. See below. + prctl - Control Speculative Store Bypass per thread + via prctl. Speculative Store Bypass is enabled + for a process by default. The state of the control + is inherited on fork. + seccomp - Same as "prctl" above, but all seccomp threads + will disable SSB unless they explicitly opt out. + + Not specifying this option is equivalent to + spec_store_bypass_disable=auto. + + Default mitigations: + X86: If CONFIG_SECCOMP=y "seccomp", otherwise "prctl" + spia_io_base= [HW,MTD] spia_fio_base= spia_pedr= diff --git a/Documentation/spec_ctrl.rst b/Documentation/spec_ctrl.rst new file mode 100644 index 000000000000..32f3d55c54b7 --- /dev/null +++ b/Documentation/spec_ctrl.rst @@ -0,0 +1,94 @@ +=================== +Speculation Control +=================== + +Quite some CPUs have speculation-related misfeatures which are in +fact vulnerabilities causing data leaks in various forms even across +privilege domains. + +The kernel provides mitigation for such vulnerabilities in various +forms. Some of these mitigations are compile-time configurable and some +can be supplied on the kernel command line. + +There is also a class of mitigations which are very expensive, but they can +be restricted to a certain set of processes or tasks in controlled +environments. The mechanism to control these mitigations is via +:manpage:`prctl(2)`. + +There are two prctl options which are related to this: + + * PR_GET_SPECULATION_CTRL + + * PR_SET_SPECULATION_CTRL + +PR_GET_SPECULATION_CTRL +----------------------- + +PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature +which is selected with arg2 of prctl(2). The return value uses bits 0-3 with +the following meaning: + +==== ===================== =================================================== +Bit Define Description +==== ===================== =================================================== +0 PR_SPEC_PRCTL Mitigation can be controlled per task by + PR_SET_SPECULATION_CTRL. +1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is + disabled. +2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is + enabled. +3 PR_SPEC_FORCE_DISABLE Same as PR_SPEC_DISABLE, but cannot be undone. A + subsequent prctl(..., PR_SPEC_ENABLE) will fail. +==== ===================== =================================================== + +If all bits are 0 the CPU is not affected by the speculation misfeature. + +If PR_SPEC_PRCTL is set, then the per-task control of the mitigation is +available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation +misfeature will fail. + +PR_SET_SPECULATION_CTRL +----------------------- + +PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which +is selected by arg2 of :manpage:`prctl(2)` per task. arg3 is used to hand +in the control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE or +PR_SPEC_FORCE_DISABLE. + +Common error codes +------------------ +======= ================================================================= +Value Meaning +======= ================================================================= +EINVAL The prctl is not implemented by the architecture or unused + prctl(2) arguments are not 0. + +ENODEV arg2 is selecting a not supported speculation misfeature. +======= ================================================================= + +PR_SET_SPECULATION_CTRL error codes +----------------------------------- +======= ================================================================= +Value Meaning +======= ================================================================= +0 Success + +ERANGE arg3 is incorrect, i.e. it's neither PR_SPEC_ENABLE nor + PR_SPEC_DISABLE nor PR_SPEC_FORCE_DISABLE. + +ENXIO Control of the selected speculation misfeature is not possible. + See PR_GET_SPECULATION_CTRL. + +EPERM Speculation was disabled with PR_SPEC_FORCE_DISABLE and caller + tried to enable it again. +======= ================================================================= + +Speculation misfeature controls +------------------------------- +- PR_SPEC_STORE_BYPASS: Speculative Store Bypass + + Invocations: + * prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, 0, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_ENABLE, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_DISABLE, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_FORCE_DISABLE, 0, 0); diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt index 560e4363a55d..f609142f406a 100644 --- a/Documentation/vm/remap_file_pages.txt +++ b/Documentation/vm/remap_file_pages.txt @@ -18,10 +18,9 @@ on 32-bit systems to map files bigger than can linearly fit into 32-bit virtual address space. This use-case is not critical anymore since 64-bit systems are widely available. -The plan is to deprecate the syscall and replace it with an emulation. -The emulation will create new VMAs instead of nonlinear mappings. It's -going to work slower for rare users of remap_file_pages() but ABI is -preserved. +The syscall is deprecated and replaced it with an emulation now. The +emulation creates new VMAs instead of nonlinear mappings. It's going to +work slower for rare users of remap_file_pages() but ABI is preserved. One side effect of emulation (apart from performance) is that user can hit vm.max_map_count limit more easily due to additional VMAs. See comment for diff --git a/Makefile b/Makefile index 82b947eb9bcf..d548dd129f44 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ VERSION = 3 PATCHLEVEL = 16 -SUBLEVEL = 58 +SUBLEVEL = 59 EXTRAVERSION = NAME = Museum of Fishiegoodies diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h index d8f9b7e89234..fce22cf88ee9 100644 --- a/arch/alpha/include/asm/pgtable.h +++ b/arch/alpha/include/asm/pgtable.h @@ -73,7 +73,6 @@ struct vm_area_struct; /* .. and these are ours ... */ #define _PAGE_DIRTY 0x20000 #define _PAGE_ACCESSED 0x40000 -#define _PAGE_FILE 0x80000 /* set:pagecache, unset:swap */ /* * NOTE! The "accessed" bit isn't necessarily exact: it can be kept exactly @@ -268,7 +267,6 @@ extern inline void pgd_clear(pgd_t * pgdp) { pgd_val(*pgdp) = 0; } extern inline int pte_write(pte_t pte) { return !(pte_val(pte) & _PAGE_FOW); } extern inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } extern inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -extern inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } extern inline int pte_special(pte_t pte) { return 0; } extern inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |= _PAGE_FOW; return pte; } @@ -345,11 +343,6 @@ extern inline pte_t mk_swap_pte(unsigned long type, unsigned long offset) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -#define pte_to_pgoff(pte) (pte_val(pte) >> 32) -#define pgoff_to_pte(off) ((pte_t) { ((off) << 32) | _PAGE_FILE }) - -#define PTE_FILE_MAX_BITS 32 - #ifndef CONFIG_DISCONTIGMEM #define kern_addr_valid(addr) (1) #endif diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h index 7670f33b9ce2..3c5720a198cc 100644 --- a/arch/arc/include/asm/pgtable.h +++ b/arch/arc/include/asm/pgtable.h @@ -61,7 +61,6 @@ #define _PAGE_WRITE (1<<4) /* Page has user write perm (H) */ #define _PAGE_READ (1<<5) /* Page has user read perm (H) */ #define _PAGE_MODIFIED (1<<6) /* Page modified (dirty) (S) */ -#define _PAGE_FILE (1<<7) /* page cache/ swap (S) */ #define _PAGE_GLOBAL (1<<8) /* Page is global (H) */ #define _PAGE_PRESENT (1<<10) /* TLB entry is valid (H) */ @@ -73,7 +72,6 @@ #define _PAGE_READ (1<<3) /* Page has user read perm (H) */ #define _PAGE_ACCESSED (1<<4) /* Page is accessed (S) */ #define _PAGE_MODIFIED (1<<5) /* Page modified (dirty) (S) */ -#define _PAGE_FILE (1<<6) /* page cache/ swap (S) */ #define _PAGE_GLOBAL (1<<8) /* Page is global (H) */ #define _PAGE_PRESENT (1<<9) /* TLB entry is valid (H) */ #define _PAGE_SHARED_CODE (1<<11) /* Shared Code page with cmn vaddr @@ -269,15 +267,6 @@ static inline void pmd_set(pmd_t *pmdp, pte_t *ptep) pte; \ }) -/* TBD: Non linear mapping stuff */ -static inline int pte_file(pte_t pte) -{ - return pte_val(pte) & _PAGE_FILE; -} - -#define PTE_FILE_MAX_BITS 30 -#define pgoff_to_pte(x) __pte(x) -#define pte_to_pgoff(x) (pte_val(x) >> 2) #define pte_pfn(pte) (pte_val(pte) >> PAGE_SHIFT) #define pfn_pte(pfn, prot) (__pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot))) #define __pte_index(addr) (((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) @@ -365,7 +354,7 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, /* Encode swap {type,off} tuple into PTE * We reserve 13 bits for 5-bit @type, keeping bits 12-5 zero, ensuring that - * both PAGE_FILE and PAGE_PRESENT are zero in a PTE holding swap "identifier" + * PAGE_PRESENT is zero in a PTE holding swap "identifier" */ #define __swp_entry(type, off) ((swp_entry_t) { \ ((type) & 0x1f) | ((off) << 13) }) diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h index bed6c8fa54b5..e69cbccbd44d 100644 --- a/arch/arm/include/asm/pgtable-2level.h +++ b/arch/arm/include/asm/pgtable-2level.h @@ -118,7 +118,6 @@ #define L_PTE_VALID (_AT(pteval_t, 1) << 0) /* Valid */ #define L_PTE_PRESENT (_AT(pteval_t, 1) << 0) #define L_PTE_YOUNG (_AT(pteval_t, 1) << 1) -#define L_PTE_FILE (_AT(pteval_t, 1) << 2) /* only when !PRESENT */ #define L_PTE_DIRTY (_AT(pteval_t, 1) << 6) #define L_PTE_RDONLY (_AT(pteval_t, 1) << 7) #define L_PTE_USER (_AT(pteval_t, 1) << 8) diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h index 40d60a679fc8..343d7b08d7a7 100644 --- a/arch/arm/include/asm/pgtable-3level.h +++ b/arch/arm/include/asm/pgtable-3level.h @@ -77,7 +77,6 @@ */ #define L_PTE_VALID (_AT(pteval_t, 1) << 0) /* Valid */ #define L_PTE_PRESENT (_AT(pteval_t, 3) << 0) /* Present */ -#define L_PTE_FILE (_AT(pteval_t, 1) << 2) /* only when !PRESENT */ #define L_PTE_USER (_AT(pteval_t, 1) << 6) /* AP[1] */ #define L_PTE_SHARED (_AT(pteval_t, 3) << 8) /* SH[1:0], inner shareable */ #define L_PTE_YOUNG (_AT(pteval_t, 1) << 10) /* AF */ diff --git a/arch/arm/include/asm/pgtable-nommu.h b/arch/arm/include/asm/pgtable-nommu.h index 0642228ff785..c35e53ee6663 100644 --- a/arch/arm/include/asm/pgtable-nommu.h +++ b/arch/arm/include/asm/pgtable-nommu.h @@ -54,8 +54,6 @@ typedef pte_t *pte_addr_t; -static inline int pte_file(pte_t pte) { return 0; } - /* * ZERO_PAGE is a global shared page that is always zero: used * for zero-mapped memory areas etc.. diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h index 38bbf1e2a9b6..7f240ab05b06 100644 --- a/arch/arm/include/asm/pgtable.h +++ b/arch/arm/include/asm/pgtable.h @@ -279,12 +279,12 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) * * 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 * 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 - * <--------------- offset ----------------------> < type -> 0 0 0 + * <--------------- offset ------------------------> < type -> 0 0 * - * This gives us up to 31 swap files and 64GB per swap file. Note that + * This gives us up to 31 swap files and 128GB per swap file. Note that * the offset field is always non-zero. */ -#define __SWP_TYPE_SHIFT 3 +#define __SWP_TYPE_SHIFT 2 #define __SWP_TYPE_BITS 5 #define __SWP_TYPE_MASK ((1 << __SWP_TYPE_BITS) - 1) #define __SWP_OFFSET_SHIFT (__SWP_TYPE_BITS + __SWP_TYPE_SHIFT) @@ -303,20 +303,6 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) */ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > __SWP_TYPE_BITS) -/* - * Encode and decode a file entry. File entries are stored in the Linux - * page tables as follows: - * - * 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 - * 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 - * <----------------------- offset ------------------------> 1 0 0 - */ -#define pte_file(pte) (pte_val(pte) & L_PTE_FILE) -#define pte_to_pgoff(x) (pte_val(x) >> 3) -#define pgoff_to_pte(x) __pte(((x) << 3) | L_PTE_FILE) - -#define PTE_FILE_MAX_BITS 29 - /* Needs to be defined here and not in linux/mm.h, as it is arch dependent */ /* FIXME: this is not correct */ #define kern_addr_valid(addr) (1) diff --git a/arch/arm/mm/proc-macros.S b/arch/arm/mm/proc-macros.S index ee1d80593958..204985e79308 100644 --- a/arch/arm/mm/proc-macros.S +++ b/arch/arm/mm/proc-macros.S @@ -98,7 +98,7 @@ #endif #if !defined (CONFIG_ARM_LPAE) && \ (L_PTE_XN+L_PTE_USER+L_PTE_RDONLY+L_PTE_DIRTY+L_PTE_YOUNG+\ - L_PTE_FILE+L_PTE_PRESENT) > L_PTE_SHARED + L_PTE_PRESENT) > L_PTE_SHARED #error Invalid Linux PTE bit settings #endif #endif /* CONFIG_MMU */ diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7b2b696a1b8c..e9134f065d0b 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -25,7 +25,6 @@ * Software defined PTE bits definition. */ #define PTE_VALID (_AT(pteval_t, 1) << 0) -#define PTE_FILE (_AT(pteval_t, 1) << 2) /* only when !pte_present() */ #define PTE_DIRTY (_AT(pteval_t, 1) << 55) #define PTE_SPECIAL (_AT(pteval_t, 1) << 56) #define PTE_WRITE (_AT(pteval_t, 1) << 57) @@ -402,13 +401,12 @@ extern pgd_t idmap_pg_dir[PTRS_PER_PGD]; /* * Encode and decode a swap entry: * bits 0-1: present (must be zero) - * bit 2: PTE_FILE - * bits 3-8: swap type - * bits 9-57: swap offset + * bits 2-7: swap type + * bits 8-57: swap offset */ -#define __SWP_TYPE_SHIFT 3 +#define __SWP_TYPE_SHIFT 2 #define __SWP_TYPE_BITS 6 -#define __SWP_OFFSET_BITS 49 +#define __SWP_OFFSET_BITS 50 #define __SWP_TYPE_MASK ((1 << __SWP_TYPE_BITS) - 1) #define __SWP_OFFSET_SHIFT (__SWP_TYPE_BITS + __SWP_TYPE_SHIFT) #define __SWP_OFFSET_MASK ((1UL << __SWP_OFFSET_BITS) - 1) @@ -426,18 +424,6 @@ extern pgd_t idmap_pg_dir[PTRS_PER_PGD]; */ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > __SWP_TYPE_BITS) -/* - * Encode and decode a file entry: - * bits 0-1: present (must be zero) - * bit 2: PTE_FILE - * bits 3-57: file offset / PAGE_SIZE - */ -#define pte_file(pte) (pte_val(pte) & PTE_FILE) -#define pte_to_pgoff(x) (pte_val(x) >> 3) -#define pgoff_to_pte(x) __pte(((x) << 3) | PTE_FILE) - -#define PTE_FILE_MAX_BITS 55 - extern int kern_addr_valid(unsigned long addr); #include diff --git a/arch/avr32/include/asm/pgtable.h b/arch/avr32/include/asm/pgtable.h index 4beff97e2033..ac7a817e2126 100644 --- a/arch/avr32/include/asm/pgtable.h +++ b/arch/avr32/include/asm/pgtable.h @@ -86,9 +86,6 @@ extern struct page *empty_zero_page; #define _PAGE_BIT_PRESENT 10 #define _PAGE_BIT_ACCESSED 11 /* software: page was accessed */ -/* The following flags are only valid when !PRESENT */ -#define _PAGE_BIT_FILE 0 /* software: pagecache or swap? */ - #define _PAGE_WT (1 << _PAGE_BIT_WT) #define _PAGE_DIRTY (1 << _PAGE_BIT_DIRTY) #define _PAGE_EXECUTE (1 << _PAGE_BIT_EXECUTE) @@ -101,7 +98,6 @@ extern struct page *empty_zero_page; /* Software flags */ #define _PAGE_ACCESSED (1 << _PAGE_BIT_ACCESSED) #define _PAGE_PRESENT (1 << _PAGE_BIT_PRESENT) -#define _PAGE_FILE (1 << _PAGE_BIT_FILE) /* * Page types, i.e. sizes. _PAGE_TYPE_NONE corresponds to what is @@ -210,14 +206,6 @@ static inline int pte_special(pte_t pte) return 0; } -/* - * The following only work if pte_present() is not true. - */ -static inline int pte_file(pte_t pte) -{ - return pte_val(pte) & _PAGE_FILE; -} - /* Mutator functions for PTE bits */ static inline pte_t pte_wrprotect(pte_t pte) { @@ -329,7 +317,6 @@ extern void update_mmu_cache(struct vm_area_struct * vma, * Encode and decode a swap entry * * Constraints: - * _PAGE_FILE at bit 0 * _PAGE_TYPE_* at bits 2-3 (for emulating _PAGE_PROTNONE) * _PAGE_PRESENT at bit 10 * @@ -346,18 +333,6 @@ extern void update_mmu_cache(struct vm_area_struct * vma, #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -/* - * Encode and decode a nonlinear file mapping entry. We have to - * preserve _PAGE_FILE and _PAGE_PRESENT here. _PAGE_TYPE_* isn't - * necessary, since _PAGE_FILE implies !_PAGE_PROTNONE (?) - */ -#define PTE_FILE_MAX_BITS 30 -#define pte_to_pgoff(pte) (((pte_val(pte) >> 1) & 0x1ff) \ - | ((pte_val(pte) >> 11) << 9)) -#define pgoff_to_pte(off) ((pte_t) { ((((off) & 0x1ff) << 1) \ - | (((off) >> 9) << 11) \ - | _PAGE_FILE) }) - typedef pte_t *pte_addr_t; #define kern_addr_valid(addr) (1) diff --git a/arch/blackfin/include/asm/pgtable.h b/arch/blackfin/include/asm/pgtable.h index 0b049019eba7..b88a1558b0b9 100644 --- a/arch/blackfin/include/asm/pgtable.h +++ b/arch/blackfin/include/asm/pgtable.h @@ -45,11 +45,6 @@ extern void paging_init(void); #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -static inline int pte_file(pte_t pte) -{ - return 0; -} - #define set_pte(pteptr, pteval) (*(pteptr) = pteval) #define set_pte_at(mm, addr, ptep, pteval) set_pte(ptep, pteval) diff --git a/arch/c6x/include/asm/pgtable.h b/arch/c6x/include/asm/pgtable.h index c0eed5b18860..78d4483ba40c 100644 --- a/arch/c6x/include/asm/pgtable.h +++ b/arch/c6x/include/asm/pgtable.h @@ -50,11 +50,6 @@ extern void paging_init(void); #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -static inline int pte_file(pte_t pte) -{ - return 0; -} - #define set_pte(pteptr, pteval) (*(pteptr) = pteval) #define set_pte_at(mm, addr, ptep, pteval) set_pte(ptep, pteval) diff --git a/arch/cris/include/arch-v10/arch/mmu.h b/arch/cris/include/arch-v10/arch/mmu.h index e829e5a37bbe..47a5dd21749d 100644 --- a/arch/cris/include/arch-v10/arch/mmu.h +++ b/arch/cris/include/arch-v10/arch/mmu.h @@ -58,7 +58,6 @@ typedef struct /* Bits the HW doesn't care about but the kernel uses them in SW */ #define _PAGE_PRESENT (1<<4) /* page present in memory */ -#define _PAGE_FILE (1<<5) /* set: pagecache, unset: swap (when !PRESENT) */ #define _PAGE_ACCESSED (1<<5) /* simulated in software using valid bit */ #define _PAGE_MODIFIED (1<<6) /* simulated in software using we bit */ #define _PAGE_READ (1<<7) /* read-enabled */ @@ -105,6 +104,4 @@ typedef struct #define __S110 PAGE_SHARED #define __S111 PAGE_SHARED -#define PTE_FILE_MAX_BITS 26 - #endif diff --git a/arch/cris/include/arch-v32/arch/mmu.h b/arch/cris/include/arch-v32/arch/mmu.h index c1a13e05e963..e6db1616dee5 100644 --- a/arch/cris/include/arch-v32/arch/mmu.h +++ b/arch/cris/include/arch-v32/arch/mmu.h @@ -53,7 +53,6 @@ typedef struct * software. */ #define _PAGE_PRESENT (1 << 5) /* Page is present in memory. */ -#define _PAGE_FILE (1 << 6) /* 1=pagecache, 0=swap (when !present) */ #define _PAGE_ACCESSED (1 << 6) /* Simulated in software using valid bit. */ #define _PAGE_MODIFIED (1 << 7) /* Simulated in software using we bit. */ #define _PAGE_READ (1 << 8) /* Read enabled. */ @@ -108,6 +107,4 @@ typedef struct #define __S110 PAGE_SHARED_EXEC #define __S111 PAGE_SHARED_EXEC -#define PTE_FILE_MAX_BITS 25 - #endif /* _ASM_CRIS_ARCH_MMU_H */ diff --git a/arch/cris/include/asm/pgtable.h b/arch/cris/include/asm/pgtable.h index 8b8c86793225..e824257971c4 100644 --- a/arch/cris/include/asm/pgtable.h +++ b/arch/cris/include/asm/pgtable.h @@ -114,7 +114,6 @@ extern unsigned long empty_zero_page; static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_MODIFIED; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_wrprotect(pte_t pte) @@ -290,9 +289,6 @@ static inline void update_mmu_cache(struct vm_area_struct * vma, */ #define pgtable_cache_init() do { } while (0) -#define pte_to_pgoff(x) (pte_val(x) >> 6) -#define pgoff_to_pte(x) __pte(((x) << 6) | _PAGE_FILE) - typedef pte_t *pte_addr_t; #endif /* __ASSEMBLY__ */ diff --git a/arch/frv/include/asm/pgtable.h b/arch/frv/include/asm/pgtable.h index eb0110acd19b..c49699d5902d 100644 --- a/arch/frv/include/asm/pgtable.h +++ b/arch/frv/include/asm/pgtable.h @@ -62,10 +62,6 @@ typedef pte_t *pte_addr_t; #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -#ifndef __ASSEMBLY__ -static inline int pte_file(pte_t pte) { return 0; } -#endif - #define ZERO_PAGE(vaddr) ({ BUG(); NULL; }) #define swapper_pg_dir ((pgd_t *) NULL) @@ -298,7 +294,6 @@ static inline pmd_t *pmd_offset(pud_t *dir, unsigned long address) #define _PAGE_RESERVED_MASK (xAMPRx_RESERVED8 | xAMPRx_RESERVED13) -#define _PAGE_FILE 0x002 /* set:pagecache unset:swap */ #define _PAGE_PROTNONE 0x000 /* If not present */ #define _PAGE_CHG_MASK (PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY) @@ -463,27 +458,15 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) * Handle swap and file entries * - the PTE is encoded in the following format: * bit 0: Must be 0 (!_PAGE_PRESENT) - * bit 1: Type: 0 for swap, 1 for file (_PAGE_FILE) - * bits 2-7: Swap type - * bits 8-31: Swap offset - * bits 2-31: File pgoff + * bits 1-6: Swap type + * bits 7-31: Swap offset */ -#define __swp_type(x) (((x).val >> 2) & 0x1f) -#define __swp_offset(x) ((x).val >> 8) -#define __swp_entry(type, offset) ((swp_entry_t) { ((type) << 2) | ((offset) << 8) }) +#define __swp_type(x) (((x).val >> 1) & 0x1f) +#define __swp_offset(x) ((x).val >> 7) +#define __swp_entry(type, offset) ((swp_entry_t) { ((type) << 1) | ((offset) << 7) }) #define __pte_to_swp_entry(_pte) ((swp_entry_t) { (_pte).pte }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -static inline int pte_file(pte_t pte) -{ - return pte.pte & _PAGE_FILE; -} - -#define PTE_FILE_MAX_BITS 29 - -#define pte_to_pgoff(PTE) ((PTE).pte >> 2) -#define pgoff_to_pte(off) __pte((off) << 2 | _PAGE_FILE) - /* Needs to be defined here and not in linux/mm.h, as it is arch dependent */ #define PageSkip(page) (0) #define kern_addr_valid(addr) (1) diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h index d8bd54fa431e..6e35e71d2aea 100644 --- a/arch/hexagon/include/asm/pgtable.h +++ b/arch/hexagon/include/asm/pgtable.h @@ -61,13 +61,6 @@ extern unsigned long zero_page_mask; #define _PAGE_DIRTY (1<<1) #define _PAGE_ACCESSED (1<<2) -/* - * _PAGE_FILE is only meaningful if _PAGE_PRESENT is false, while - * _PAGE_DIRTY is only meaningful if _PAGE_PRESENT is true. - * So we can overload the bit... - */ -#define _PAGE_FILE _PAGE_DIRTY /* set: pagecache, unset = swap */ - /* * For now, let's say that Valid and Present are the same thing. * Alternatively, we could say that it's the "or" of R, W, and X @@ -456,57 +449,36 @@ static inline int pte_exec(pte_t pte) #define pgtable_cache_init() do { } while (0) /* - * Swap/file PTE definitions. If _PAGE_PRESENT is zero, the rest of the - * PTE is interpreted as swap information. Depending on the _PAGE_FILE - * bit, the remaining free bits are eitehr interpreted as a file offset - * or a swap type/offset tuple. Rather than have the TLB fill handler - * test _PAGE_PRESENT, we're going to reserve the permissions bits - * and set them to all zeros for swap entries, which speeds up the - * miss handler at the cost of 3 bits of offset. That trade-off can - * be revisited if necessary, but Hexagon processor architecture and - * target applications suggest a lot of TLB misses and not much swap space. + * Swap/file PTE definitions. If _PAGE_PRESENT is zero, the rest of the PTE is + * interpreted as swap information. The remaining free bits are interpreted as + * swap type/offset tuple. Rather than have the TLB fill handler test + * _PAGE_PRESENT, we're going to reserve the permissions bits and set them to + * all zeros for swap entries, which speeds up the miss handler at the cost of + * 3 bits of offset. That trade-off can be revisited if necessary, but Hexagon + * processor architecture and target applications suggest a lot of TLB misses + * and not much swap space. * * Format of swap PTE: * bit 0: Present (zero) - * bit 1: _PAGE_FILE (zero) - * bits 2-6: swap type (arch independent layer uses 5 bits max) - * bits 7-9: bits 2:0 of offset - * bits 10-12: effectively _PAGE_PROTNONE (all zero) - * bits 13-31: bits 21:3 of swap offset - * - * Format of file PTE: - * bit 0: Present (zero) - * bit 1: _PAGE_FILE (zero) - * bits 2-9: bits 7:0 of offset - * bits 10-12: effectively _PAGE_PROTNONE (all zero) - * bits 13-31: bits 26:8 of swap offset + * bits 1-5: swap type (arch independent layer uses 5 bits max) + * bits 6-9: bits 3:0 of offset + * bits 10-12: effectively _PAGE_PROTNONE (all zero) + * bits 13-31: bits 22:4 of swap offset * * The split offset makes some of the following macros a little gnarly, * but there's plenty of precedent for this sort of thing. */ -#define PTE_FILE_MAX_BITS 27 /* Used for swap PTEs */ -#define __swp_type(swp_pte) (((swp_pte).val >> 2) & 0x1f) +#define __swp_type(swp_pte) (((swp_pte).val >> 1) & 0x1f) #define __swp_offset(swp_pte) \ - ((((swp_pte).val >> 7) & 0x7) | (((swp_pte).val >> 10) & 0x003ffff8)) + ((((swp_pte).val >> 6) & 0xf) | (((swp_pte).val >> 9) & 0x7ffff0)) #define __swp_entry(type, offset) \ ((swp_entry_t) { \ - ((type << 2) | \ - ((offset & 0x3ffff8) << 10) | ((offset & 0x7) << 7)) }) - -/* Used for file PTEs */ -#define pte_file(pte) \ - ((pte_val(pte) & (_PAGE_FILE | _PAGE_PRESENT)) == _PAGE_FILE) - -#define pte_to_pgoff(pte) \ - (((pte_val(pte) >> 2) & 0xff) | ((pte_val(pte) >> 5) & 0x07ffff00)) - -#define pgoff_to_pte(off) \ - ((pte_t) { ((((off) & 0x7ffff00) << 5) | (((off) & 0xff) << 2)\ - | _PAGE_FILE) }) + ((type << 1) | \ + ((offset & 0x7ffff0) << 9) | ((offset & 0xf) << 6)) }) /* Oh boy. There are a lot of possible arch overrides found in this file. */ #include diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h index 7935115398a6..2f07bb3dda91 100644 --- a/arch/ia64/include/asm/pgtable.h +++ b/arch/ia64/include/asm/pgtable.h @@ -57,9 +57,6 @@ #define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */ #define _PAGE_PROTNONE (__IA64_UL(1) << 63) -/* Valid only for a PTE with the present bit cleared: */ -#define _PAGE_FILE (1 << 1) /* see swap & file pte remarks below */ - #define _PFN_MASK _PAGE_PPN_MASK /* Mask of bits which may be changed by pte_modify(); the odd bits are there for _PAGE_PROTNONE */ #define _PAGE_CHG_MASK (_PAGE_P | _PAGE_PROTNONE | _PAGE_PL_MASK | _PAGE_AR_MASK | _PAGE_ED) @@ -300,7 +297,6 @@ extern unsigned long VMALLOC_END; #define pte_exec(pte) ((pte_val(pte) & _PAGE_AR_RX) != 0) #define pte_dirty(pte) ((pte_val(pte) & _PAGE_D) != 0) #define pte_young(pte) ((pte_val(pte) & _PAGE_A) != 0) -#define pte_file(pte) ((pte_val(pte) & _PAGE_FILE) != 0) #define pte_special(pte) 0 /* @@ -472,27 +468,16 @@ extern void paging_init (void); * * Format of swap pte: * bit 0 : present bit (must be zero) - * bit 1 : _PAGE_FILE (must be zero) - * bits 2- 8: swap-type - * bits 9-62: swap offset - * bit 63 : _PAGE_PROTNONE bit - * - * Format of file pte: - * bit 0 : present bit (must be zero) - * bit 1 : _PAGE_FILE (must be one) - * bits 2-62: file_offset/PAGE_SIZE + * bits 1- 7: swap-type + * bits 8-62: swap offset * bit 63 : _PAGE_PROTNONE bit */ -#define __swp_type(entry) (((entry).val >> 2) & 0x7f) -#define __swp_offset(entry) (((entry).val << 1) >> 10) -#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << 2) | ((long) (offset) << 9) }) +#define __swp_type(entry) (((entry).val >> 1) & 0x7f) +#define __swp_offset(entry) (((entry).val << 1) >> 9) +#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << 1) | ((long) (offset) << 8) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -#define PTE_FILE_MAX_BITS 61 -#define pte_to_pgoff(pte) ((pte_val(pte) << 1) >> 3) -#define pgoff_to_pte(off) ((pte_t) { ((off) << 2) | _PAGE_FILE }) - /* * ZERO_PAGE is a global shared page that is always zero: used * for zero-mapped memory areas etc.. diff --git a/arch/m32r/include/asm/pgtable-2level.h b/arch/m32r/include/asm/pgtable-2level.h index 9cdaf7350ef6..8fd8ee70266a 100644 --- a/arch/m32r/include/asm/pgtable-2level.h +++ b/arch/m32r/include/asm/pgtable-2level.h @@ -70,9 +70,5 @@ static inline pmd_t *pmd_offset(pgd_t * dir, unsigned long address) #define pfn_pte(pfn, prot) __pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot)) #define pfn_pmd(pfn, prot) __pmd(((pfn) << PAGE_SHIFT) | pgprot_val(prot)) -#define PTE_FILE_MAX_BITS 29 -#define pte_to_pgoff(pte) (((pte_val(pte) >> 2) & 0x7f) | (((pte_val(pte) >> 10)) << 7)) -#define pgoff_to_pte(off) ((pte_t) { (((off) & 0x7f) << 2) | (((off) >> 7) << 10) | _PAGE_FILE }) - #endif /* __KERNEL__ */ #endif /* _ASM_M32R_PGTABLE_2LEVEL_H */ diff --git a/arch/m32r/include/asm/pgtable.h b/arch/m32r/include/asm/pgtable.h index 103ce6710f07..050f7a686e3d 100644 --- a/arch/m32r/include/asm/pgtable.h +++ b/arch/m32r/include/asm/pgtable.h @@ -80,8 +80,6 @@ extern unsigned long empty_zero_page[1024]; */ #define _PAGE_BIT_DIRTY 0 /* software: page changed */ -#define _PAGE_BIT_FILE 0 /* when !present: nonlinear file - mapping */ #define _PAGE_BIT_PRESENT 1 /* Valid: page is valid */ #define _PAGE_BIT_GLOBAL 2 /* Global */ #define _PAGE_BIT_LARGE 3 /* Large */ @@ -93,7 +91,6 @@ extern unsigned long empty_zero_page[1024]; #define _PAGE_BIT_PROTNONE 9 /* software: if not present */ #define _PAGE_DIRTY (1UL << _PAGE_BIT_DIRTY) -#define _PAGE_FILE (1UL << _PAGE_BIT_FILE) #define _PAGE_PRESENT (1UL << _PAGE_BIT_PRESENT) #define _PAGE_GLOBAL (1UL << _PAGE_BIT_GLOBAL) #define _PAGE_LARGE (1UL << _PAGE_BIT_LARGE) @@ -206,14 +203,6 @@ static inline int pte_write(pte_t pte) return pte_val(pte) & _PAGE_WRITE; } -/* - * The following only works if pte_present() is not true. - */ -static inline int pte_file(pte_t pte) -{ - return pte_val(pte) & _PAGE_FILE; -} - static inline int pte_special(pte_t pte) { return 0; diff --git a/arch/m68k/include/asm/mcf_pgtable.h b/arch/m68k/include/asm/mcf_pgtable.h index 3c793682e5d9..2500ce04fcc4 100644 --- a/arch/m68k/include/asm/mcf_pgtable.h +++ b/arch/m68k/include/asm/mcf_pgtable.h @@ -35,7 +35,6 @@ * hitting hardware. */ #define CF_PAGE_DIRTY 0x00000001 -#define CF_PAGE_FILE 0x00000200 #define CF_PAGE_ACCESSED 0x00001000 #define _PAGE_CACHE040 0x020 /* 68040 cache mode, cachable, copyback */ @@ -243,11 +242,6 @@ static inline int pte_young(pte_t pte) return pte_val(pte) & CF_PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) -{ - return pte_val(pte) & CF_PAGE_FILE; -} - static inline int pte_special(pte_t pte) { return 0; @@ -391,26 +385,13 @@ static inline void cache_page(void *vaddr) *ptep = pte_mkcache(*ptep); } -#define PTE_FILE_MAX_BITS 21 -#define PTE_FILE_SHIFT 11 - -static inline unsigned long pte_to_pgoff(pte_t pte) -{ - return pte_val(pte) >> PTE_FILE_SHIFT; -} - -static inline pte_t pgoff_to_pte(unsigned pgoff) -{ - return __pte((pgoff << PTE_FILE_SHIFT) + CF_PAGE_FILE); -} - /* * Encode and de-code a swap entry (must be !pte_none(e) && !pte_present(e)) */ #define __swp_type(x) ((x).val & 0xFF) -#define __swp_offset(x) ((x).val >> PTE_FILE_SHIFT) +#define __swp_offset(x) ((x).val >> 11) #define __swp_entry(typ, off) ((swp_entry_t) { (typ) | \ - (off << PTE_FILE_SHIFT) }) + (off << 11) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) (__pte((x).val)) diff --git a/arch/m68k/include/asm/motorola_pgtable.h b/arch/m68k/include/asm/motorola_pgtable.h index e0fdd4d08075..0085aab80e5a 100644 --- a/arch/m68k/include/asm/motorola_pgtable.h +++ b/arch/m68k/include/asm/motorola_pgtable.h @@ -28,7 +28,6 @@ #define _PAGE_CHG_MASK (PAGE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_NOCACHE) #define _PAGE_PROTNONE 0x004 -#define _PAGE_FILE 0x008 /* pagecache or swap? */ #ifndef __ASSEMBLY__ @@ -168,7 +167,6 @@ static inline void pgd_set(pgd_t *pgdp, pmd_t *pmdp) static inline int pte_write(pte_t pte) { return !(pte_val(pte) & _PAGE_RONLY); } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |= _PAGE_RONLY; return pte; } @@ -266,19 +264,6 @@ static inline void cache_page(void *vaddr) } } -#define PTE_FILE_MAX_BITS 28 - -static inline unsigned long pte_to_pgoff(pte_t pte) -{ - return pte.pte >> 4; -} - -static inline pte_t pgoff_to_pte(unsigned off) -{ - pte_t pte = { (off << 4) + _PAGE_FILE }; - return pte; -} - /* Encode and de-code a swap entry (must be !pte_none(e) && !pte_present(e)) */ #define __swp_type(x) (((x).val >> 4) & 0xff) #define __swp_offset(x) ((x).val >> 12) diff --git a/arch/m68k/include/asm/pgtable_no.h b/arch/m68k/include/asm/pgtable_no.h index c527fc2ecf82..5673e64a93cf 100644 --- a/arch/m68k/include/asm/pgtable_no.h +++ b/arch/m68k/include/asm/pgtable_no.h @@ -37,8 +37,6 @@ extern void paging_init(void); #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -static inline int pte_file(pte_t pte) { return 0; } - /* * ZERO_PAGE is a global shared page that is always zero: used * for zero-mapped memory areas etc.. diff --git a/arch/m68k/include/asm/sun3_pgtable.h b/arch/m68k/include/asm/sun3_pgtable.h index f55aa04161e8..48657f9fdece 100644 --- a/arch/m68k/include/asm/sun3_pgtable.h +++ b/arch/m68k/include/asm/sun3_pgtable.h @@ -38,8 +38,6 @@ #define _PAGE_PRESENT (SUN3_PAGE_VALID) #define _PAGE_ACCESSED (SUN3_PAGE_ACCESSED) -#define PTE_FILE_MAX_BITS 28 - /* Compound page protection values. */ //todo: work out which ones *should* have SUN3_PAGE_NOCACHE and fix... // is it just PAGE_KERNEL and PAGE_SHARED? @@ -168,7 +166,6 @@ static inline void pgd_clear (pgd_t *pgdp) {} static inline int pte_write(pte_t pte) { return pte_val(pte) & SUN3_PAGE_WRITEABLE; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & SUN3_PAGE_MODIFIED; } static inline int pte_young(pte_t pte) { return pte_val(pte) & SUN3_PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & SUN3_PAGE_ACCESSED; } static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &= ~SUN3_PAGE_WRITEABLE; return pte; } @@ -202,18 +199,6 @@ static inline pmd_t *pmd_offset (pgd_t *pgd, unsigned long address) return (pmd_t *) pgd; } -static inline unsigned long pte_to_pgoff(pte_t pte) -{ - return pte.pte & SUN3_PAGE_PGNUM_MASK; -} - -static inline pte_t pgoff_to_pte(unsigned off) -{ - pte_t pte = { off + SUN3_PAGE_ACCESSED }; - return pte; -} - - /* Find an entry in the third-level pagetable. */ #define pte_index(address) ((address >> PAGE_SHIFT) & (PTRS_PER_PTE-1)) #define pte_offset_kernel(pmd, address) ((pte_t *) __pmd_page(*pmd) + pte_index(address)) diff --git a/arch/metag/include/asm/pgtable.h b/arch/metag/include/asm/pgtable.h index 0d9dc5487296..d0604c0a8702 100644 --- a/arch/metag/include/asm/pgtable.h +++ b/arch/metag/include/asm/pgtable.h @@ -47,7 +47,6 @@ */ #define _PAGE_ACCESSED _PAGE_ALWAYS_ZERO_1 #define _PAGE_DIRTY _PAGE_ALWAYS_ZERO_2 -#define _PAGE_FILE _PAGE_ALWAYS_ZERO_3 /* Pages owned, and protected by, the kernel. */ #define _PAGE_KERNEL _PAGE_PRIV @@ -219,7 +218,6 @@ extern unsigned long empty_zero_page; static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &= (~_PAGE_WRITE); return pte; } @@ -327,10 +325,6 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -#define PTE_FILE_MAX_BITS 22 -#define pte_to_pgoff(x) (pte_val(x) >> 10) -#define pgoff_to_pte(x) __pte(((x) << 10) | _PAGE_FILE) - #define kern_addr_valid(addr) (1) /* diff --git a/arch/microblaze/include/asm/pgtable.h b/arch/microblaze/include/asm/pgtable.h index 95cef0b5f836..550ea30cd489 100644 --- a/arch/microblaze/include/asm/pgtable.h +++ b/arch/microblaze/include/asm/pgtable.h @@ -40,10 +40,6 @@ extern int mem_init_done; #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -#ifndef __ASSEMBLY__ -static inline int pte_file(pte_t pte) { return 0; } -#endif /* __ASSEMBLY__ */ - #define ZERO_PAGE(vaddr) ({ BUG(); NULL; }) #define swapper_pg_dir ((pgd_t *) NULL) @@ -207,7 +203,6 @@ static inline pte_t pte_mkspecial(pte_t pte) { return pte; } /* Definitions for MicroBlaze. */ #define _PAGE_GUARDED 0x001 /* G: page is guarded from prefetch */ -#define _PAGE_FILE 0x001 /* when !present: nonlinear file mapping */ #define _PAGE_PRESENT 0x002 /* software: PTE contains a translation */ #define _PAGE_NO_CACHE 0x004 /* I: caching is inhibited */ #define _PAGE_WRITETHRU 0x008 /* W: caching is write-through */ @@ -337,7 +332,6 @@ static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; } static inline int pte_exec(pte_t pte) { return pte_val(pte) & _PAGE_EXEC; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; } static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; } @@ -499,11 +493,6 @@ static inline pmd_t *pmd_offset(pgd_t *dir, unsigned long address) #define pte_unmap(pte) kunmap_atomic(pte) -/* Encode and decode a nonlinear file mapping entry */ -#define PTE_FILE_MAX_BITS 29 -#define pte_to_pgoff(pte) (pte_val(pte) >> 3) -#define pgoff_to_pte(off) ((pte_t) { ((off) << 3) | _PAGE_FILE }) - extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; /* diff --git a/arch/mips/include/asm/pgtable-32.h b/arch/mips/include/asm/pgtable-32.h index b4204c179b97..b84a3e3d1811 100644 --- a/arch/mips/include/asm/pgtable-32.h +++ b/arch/mips/include/asm/pgtable-32.h @@ -148,20 +148,6 @@ pfn_pte(unsigned long pfn, pgprot_t prot) #define __swp_entry(type,offset) \ ((swp_entry_t) { ((type) << 10) | ((offset) << 15) }) -/* - * Bits 0, 4, 8, and 9 are taken, split up 28 bits of offset into this range: - */ -#define PTE_FILE_MAX_BITS 28 - -#define pte_to_pgoff(_pte) ((((_pte).pte >> 1 ) & 0x07) | \ - (((_pte).pte >> 2 ) & 0x38) | \ - (((_pte).pte >> 10) << 6 )) - -#define pgoff_to_pte(off) ((pte_t) { (((off) & 0x07) << 1 ) | \ - (((off) & 0x38) << 2 ) | \ - (((off) >> 6 ) << 10) | \ - _PAGE_FILE }) - #else /* Swap entries must have VALID and GLOBAL bits cleared. */ @@ -177,31 +163,6 @@ pfn_pte(unsigned long pfn, pgprot_t prot) ((swp_entry_t) { ((type) << 8) | ((offset) << 13) }) #endif /* defined(CONFIG_64BIT_PHYS_ADDR) && defined(CONFIG_CPU_MIPS32) */ -#if defined(CONFIG_64BIT_PHYS_ADDR) && defined(CONFIG_CPU_MIPS32) -/* - * Bits 0 and 1 of pte_high are taken, use the rest for the page offset... - */ -#define PTE_FILE_MAX_BITS 30 - -#define pte_to_pgoff(_pte) ((_pte).pte_high >> 2) -#define pgoff_to_pte(off) ((pte_t) { _PAGE_FILE, (off) << 2 }) - -#else -/* - * Bits 0, 4, 6, and 7 are taken, split up 28 bits of offset into this range: - */ -#define PTE_FILE_MAX_BITS 28 - -#define pte_to_pgoff(_pte) ((((_pte).pte >> 1) & 0x7) | \ - (((_pte).pte >> 2) & 0x8) | \ - (((_pte).pte >> 8) << 4)) - -#define pgoff_to_pte(off) ((pte_t) { (((off) & 0x7) << 1) | \ - (((off) & 0x8) << 2) | \ - (((off) >> 4) << 8) | \ - _PAGE_FILE }) -#endif - #endif #if defined(CONFIG_64BIT_PHYS_ADDR) && defined(CONFIG_CPU_MIPS32) diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h index e1c49a96807d..1659bb91ae21 100644 --- a/arch/mips/include/asm/pgtable-64.h +++ b/arch/mips/include/asm/pgtable-64.h @@ -291,13 +291,4 @@ static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -/* - * Bits 0, 4, 6, and 7 are taken. Let's leave bits 1, 2, 3, and 5 alone to - * make things easier, and only use the upper 56 bits for the page offset... - */ -#define PTE_FILE_MAX_BITS 56 - -#define pte_to_pgoff(_pte) ((_pte).pte >> 8) -#define pgoff_to_pte(off) ((pte_t) { ((off) << 8) | _PAGE_FILE }) - #endif /* _ASM_PGTABLE_64_H */ diff --git a/arch/mips/include/asm/pgtable-bits.h b/arch/mips/include/asm/pgtable-bits.h index e592f3687d6f..42461ccda809 100644 --- a/arch/mips/include/asm/pgtable-bits.h +++ b/arch/mips/include/asm/pgtable-bits.h @@ -50,8 +50,6 @@ /* * The following bits are implemented in software - * - * _PAGE_FILE semantics: set:pagecache unset:swap */ #define _PAGE_PRESENT_SHIFT 6 #define _PAGE_PRESENT (1 << _PAGE_PRESENT_SHIFT) @@ -64,14 +62,10 @@ #define _PAGE_MODIFIED_SHIFT 10 #define _PAGE_MODIFIED (1 << _PAGE_MODIFIED_SHIFT) -#define _PAGE_FILE (1 << 10) - #elif defined(CONFIG_CPU_R3000) || defined(CONFIG_CPU_TX39XX) /* * The following are implemented by software - * - * _PAGE_FILE semantics: set:pagecache unset:swap */ #define _PAGE_PRESENT_SHIFT 0 #define _PAGE_PRESENT (1 << _PAGE_PRESENT_SHIFT) @@ -83,8 +77,6 @@ #define _PAGE_ACCESSED (1 << _PAGE_ACCESSED_SHIFT) #define _PAGE_MODIFIED_SHIFT 4 #define _PAGE_MODIFIED (1 << _PAGE_MODIFIED_SHIFT) -#define _PAGE_FILE_SHIFT 4 -#define _PAGE_FILE (1 << _PAGE_FILE_SHIFT) /* * And these are the hardware TLB bits @@ -114,7 +106,6 @@ * The following bits are implemented in software * * _PAGE_READ / _PAGE_READ_SHIFT should be unused if cpu_has_rixi. - * _PAGE_FILE semantics: set:pagecache unset:swap */ #define _PAGE_PRESENT_SHIFT (0) #define _PAGE_PRESENT (1 << _PAGE_PRESENT_SHIFT) @@ -126,7 +117,6 @@ #define _PAGE_ACCESSED (1 << _PAGE_ACCESSED_SHIFT) #define _PAGE_MODIFIED_SHIFT (_PAGE_ACCESSED_SHIFT + 1) #define _PAGE_MODIFIED (1 << _PAGE_MODIFIED_SHIFT) -#define _PAGE_FILE (_PAGE_MODIFIED) #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT /* huge tlb page */ diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h index f94d39f9e21d..2b8cb50ec11c 100644 --- a/arch/mips/include/asm/pgtable.h +++ b/arch/mips/include/asm/pgtable.h @@ -253,7 +253,6 @@ extern pgd_t swapper_pg_dir[]; static inline int pte_write(pte_t pte) { return pte.pte_low & _PAGE_WRITE; } static inline int pte_dirty(pte_t pte) { return pte.pte_low & _PAGE_MODIFIED; } static inline int pte_young(pte_t pte) { return pte.pte_low & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte.pte_low & _PAGE_FILE; } static inline pte_t pte_wrprotect(pte_t pte) { @@ -309,7 +308,6 @@ static inline pte_t pte_mkyoung(pte_t pte) static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_MODIFIED; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline pte_t pte_wrprotect(pte_t pte) { diff --git a/arch/mn10300/include/asm/pgtable.h b/arch/mn10300/include/asm/pgtable.h index 2ddaa67e7983..629181ae111e 100644 --- a/arch/mn10300/include/asm/pgtable.h +++ b/arch/mn10300/include/asm/pgtable.h @@ -134,7 +134,6 @@ extern pte_t kernel_vmalloc_ptes[(VMALLOC_END - VMALLOC_START) / PAGE_SIZE]; #define _PAGE_NX 0 /* no-execute bit */ /* If _PAGE_VALID is clear, we use these: */ -#define _PAGE_FILE xPTEL2_C /* set:pagecache unset:swap */ #define _PAGE_PROTNONE 0x000 /* If not present */ #define __PAGE_PROT_UWAUX 0x010 @@ -241,11 +240,6 @@ static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } static inline int pte_write(pte_t pte) { return pte_val(pte) & __PAGE_PROT_WRITE; } static inline int pte_special(pte_t pte){ return 0; } -/* - * The following only works if pte_present() is not true. - */ -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } - static inline pte_t pte_rdprotect(pte_t pte) { pte_val(pte) &= ~(__PAGE_PROT_USER|__PAGE_PROT_UWAUX); return pte; @@ -338,16 +332,11 @@ static inline int pte_exec_kernel(pte_t pte) return 1; } -#define PTE_FILE_MAX_BITS 30 - -#define pte_to_pgoff(pte) (pte_val(pte) >> 2) -#define pgoff_to_pte(off) __pte((off) << 2 | _PAGE_FILE) - /* Encode and de-code a swap entry */ -#define __swp_type(x) (((x).val >> 2) & 0x3f) -#define __swp_offset(x) ((x).val >> 8) +#define __swp_type(x) (((x).val >> 1) & 0x3f) +#define __swp_offset(x) ((x).val >> 7) #define __swp_entry(type, offset) \ - ((swp_entry_t) { ((type) << 2) | ((offset) << 8) }) + ((swp_entry_t) { ((type) << 1) | ((offset) << 7) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) __pte((x).val) diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/asm/pgtable.h index 37bf6a3ef8f4..18994ccb1185 100644 --- a/arch/openrisc/include/asm/pgtable.h +++ b/arch/openrisc/include/asm/pgtable.h @@ -125,7 +125,6 @@ extern void paging_init(void); #define _PAGE_CC 0x001 /* software: pte contains a translation */ #define _PAGE_CI 0x002 /* cache inhibit */ #define _PAGE_WBC 0x004 /* write back cache */ -#define _PAGE_FILE 0x004 /* set: pagecache, unset: swap (when !PRESENT) */ #define _PAGE_WOM 0x008 /* weakly ordered memory */ #define _PAGE_A 0x010 /* accessed */ @@ -240,7 +239,6 @@ static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; } static inline int pte_exec(pte_t pte) { return pte_val(pte) & _PAGE_EXEC; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_mkspecial(pte_t pte) { return pte; } @@ -438,12 +436,6 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -/* Encode and decode a nonlinear file mapping entry */ - -#define PTE_FILE_MAX_BITS 26 -#define pte_to_pgoff(x) (pte_val(x) >> 6) -#define pgoff_to_pte(x) __pte(((x) << 6) | _PAGE_FILE) - #define kern_addr_valid(addr) (1) #include diff --git a/arch/openrisc/kernel/head.S b/arch/openrisc/kernel/head.S index 1d3c9c28ac25..f14793306b03 100644 --- a/arch/openrisc/kernel/head.S +++ b/arch/openrisc/kernel/head.S @@ -754,11 +754,6 @@ /* ===============================================[ page table masks ]=== */ -/* bit 4 is used in hardware as write back cache bit. we never use this bit - * explicitly, so we can reuse it as _PAGE_FILE bit and mask it out when - * writing into hardware pte's - */ - #define DTLB_UP_CONVERT_MASK 0x3fa #define ITLB_UP_CONVERT_MASK 0x3a diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h index b6762dfbe4fe..97b2fe45950f 100644 --- a/arch/parisc/include/asm/pgtable.h +++ b/arch/parisc/include/asm/pgtable.h @@ -146,7 +146,6 @@ extern void purge_tlb_entries(struct mm_struct *, unsigned long); #define _PAGE_GATEWAY_BIT 28 /* (0x008) privilege promotion allowed */ #define _PAGE_DMB_BIT 27 /* (0x010) Data Memory Break enable (B bit) */ #define _PAGE_DIRTY_BIT 26 /* (0x020) Page Dirty (D bit) */ -#define _PAGE_FILE_BIT _PAGE_DIRTY_BIT /* overload this bit */ #define _PAGE_REFTRAP_BIT 25 /* (0x040) Page Ref. Trap enable (T bit) */ #define _PAGE_NO_CACHE_BIT 24 /* (0x080) Uncached Page (U bit) */ #define _PAGE_ACCESSED_BIT 23 /* (0x100) Software: Page Accessed */ @@ -167,13 +166,6 @@ extern void purge_tlb_entries(struct mm_struct *, unsigned long); /* PFN_PTE_SHIFT defines the shift of a PTE value to access the PFN field */ #define PFN_PTE_SHIFT 12 - -/* this is how many bits may be used by the file functions */ -#define PTE_FILE_MAX_BITS (BITS_PER_LONG - PTE_SHIFT) - -#define pte_to_pgoff(pte) (pte_val(pte) >> PTE_SHIFT) -#define pgoff_to_pte(off) ((pte_t) { ((off) << PTE_SHIFT) | _PAGE_FILE }) - #define _PAGE_READ (1 << xlate_pabit(_PAGE_READ_BIT)) #define _PAGE_WRITE (1 << xlate_pabit(_PAGE_WRITE_BIT)) #define _PAGE_RW (_PAGE_READ | _PAGE_WRITE) @@ -186,7 +178,6 @@ extern void purge_tlb_entries(struct mm_struct *, unsigned long); #define _PAGE_ACCESSED (1 << xlate_pabit(_PAGE_ACCESSED_BIT)) #define _PAGE_PRESENT (1 << xlate_pabit(_PAGE_PRESENT_BIT)) #define _PAGE_USER (1 << xlate_pabit(_PAGE_USER_BIT)) -#define _PAGE_FILE (1 << xlate_pabit(_PAGE_FILE_BIT)) #define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_READ | _PAGE_WRITE | _PAGE_DIRTY | _PAGE_ACCESSED) #define _PAGE_CHG_MASK (PAGE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY) @@ -344,7 +335,6 @@ static inline void pgd_clear(pgd_t * pgdp) { } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~_PAGE_DIRTY; return pte; } diff --git a/arch/powerpc/include/asm/pgtable-ppc32.h b/arch/powerpc/include/asm/pgtable-ppc32.h index 47edde8c3556..eaeb1d3a4f4e 100644 --- a/arch/powerpc/include/asm/pgtable-ppc32.h +++ b/arch/powerpc/include/asm/pgtable-ppc32.h @@ -314,8 +314,8 @@ static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry) /* * Encode and decode a swap entry. * Note that the bits we use in a PTE for representing a swap entry - * must not include the _PAGE_PRESENT bit, the _PAGE_FILE bit, or the - *_PAGE_HASHPTE bit (if used). -- paulus + * must not include the _PAGE_PRESENT bit or the _PAGE_HASHPTE bit (if used). + * -- paulus */ #define __swp_type(entry) ((entry).val & 0x1f) #define __swp_offset(entry) ((entry).val >> 5) @@ -323,11 +323,6 @@ static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 3 }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val << 3 }) -/* Encode and decode a nonlinear file mapping entry */ -#define PTE_FILE_MAX_BITS 29 -#define pte_to_pgoff(pte) (pte_val(pte) >> 3) -#define pgoff_to_pte(off) ((pte_t) { ((off) << 3) | _PAGE_FILE }) - /* * No page table caches to initialise */ diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index 7356053b1133..e53a0c419b45 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -352,9 +352,6 @@ static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry) #define __swp_entry(type, offset) ((swp_entry_t){((type)<< 1)|((offset)<<8)}) #define __pte_to_swp_entry(pte) ((swp_entry_t){pte_val(pte) >> PTE_RPN_SHIFT}) #define __swp_entry_to_pte(x) ((pte_t) { (x).val << PTE_RPN_SHIFT }) -#define pte_to_pgoff(pte) (pte_val(pte) >> PTE_RPN_SHIFT) -#define pgoff_to_pte(off) ((pte_t) {((off) << PTE_RPN_SHIFT)|_PAGE_FILE}) -#define PTE_FILE_MAX_BITS (BITS_PER_LONG - PTE_RPN_SHIFT) void pgtable_cache_add(unsigned shift, void (*ctor)(void *)); void pgtable_cache_init(void); @@ -389,7 +386,7 @@ void pgtable_cache_init(void); * The last three bits are intentionally left to zero. This memory location * are also used as normal page PTE pointers. So if we have any pointers * left around while we collapse a hugepage, we need to make sure - * _PAGE_PRESENT and _PAGE_FILE bits of that are zero when we look at them + * _PAGE_PRESENT bit of that is zero when we look at them */ static inline unsigned int hpte_valid(unsigned char *hpte_slot_array, int index) { diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index d98c1ecc3266..78856f71dd4a 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -32,7 +32,6 @@ struct mm_struct; static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; } static inline int pte_none(pte_t pte) { return (pte_val(pte) & ~_PTE_NONE_MASK) == 0; } static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PAGE_PROT_BITS); } diff --git a/arch/powerpc/include/asm/pte-40x.h b/arch/powerpc/include/asm/pte-40x.h index ec0b0b0d1df9..486b1ef81338 100644 --- a/arch/powerpc/include/asm/pte-40x.h +++ b/arch/powerpc/include/asm/pte-40x.h @@ -38,7 +38,6 @@ */ #define _PAGE_GUARDED 0x001 /* G: page is guarded from prefetch */ -#define _PAGE_FILE 0x001 /* when !present: nonlinear file mapping */ #define _PAGE_PRESENT 0x002 /* software: PTE contains a translation */ #define _PAGE_NO_CACHE 0x004 /* I: caching is inhibited */ #define _PAGE_WRITETHRU 0x008 /* W: caching is write-through */ diff --git a/arch/powerpc/include/asm/pte-44x.h b/arch/powerpc/include/asm/pte-44x.h index 4192b9bad901..36f75fab23f5 100644 --- a/arch/powerpc/include/asm/pte-44x.h +++ b/arch/powerpc/include/asm/pte-44x.h @@ -44,9 +44,6 @@ * - PRESENT *must* be in the bottom three bits because swap cache * entries use the top 29 bits for TLB2. * - * - FILE *must* be in the bottom three bits because swap cache - * entries use the top 29 bits for TLB2. - * * - CACHE COHERENT bit (M) has no effect on original PPC440 cores, * because it doesn't support SMP. However, some later 460 variants * have -some- form of SMP support and so I keep the bit there for @@ -68,7 +65,6 @@ * * There are three protection bits available for SWAP entry: * _PAGE_PRESENT - * _PAGE_FILE * _PAGE_HASHPTE (if HW has) * * So those three bits have to be inside of 0-2nd LSB of PTE. @@ -77,7 +73,6 @@ #define _PAGE_PRESENT 0x00000001 /* S: PTE valid */ #define _PAGE_RW 0x00000002 /* S: Write permission */ -#define _PAGE_FILE 0x00000004 /* S: nonlinear file mapping */ #define _PAGE_EXEC 0x00000004 /* H: Execute permission */ #define _PAGE_ACCESSED 0x00000008 /* S: Page referenced */ #define _PAGE_DIRTY 0x00000010 /* S: Page dirty */ diff --git a/arch/powerpc/include/asm/pte-8xx.h b/arch/powerpc/include/asm/pte-8xx.h index d44826e4ff97..f665520c9975 100644 --- a/arch/powerpc/include/asm/pte-8xx.h +++ b/arch/powerpc/include/asm/pte-8xx.h @@ -29,7 +29,6 @@ /* Definitions for 8xx embedded chips. */ #define _PAGE_PRESENT 0x0001 /* Page is valid */ -#define _PAGE_FILE 0x0002 /* when !present: nonlinear file mapping */ #define _PAGE_NO_CACHE 0x0002 /* I: cache inhibit */ #define _PAGE_SHARED 0x0004 /* No ASID (context) compare */ #define _PAGE_SPECIAL 0x0008 /* SW entry, forced to 0 by the TLB miss */ diff --git a/arch/powerpc/include/asm/pte-book3e.h b/arch/powerpc/include/asm/pte-book3e.h index 576ad88104cb..91a704952ca1 100644 --- a/arch/powerpc/include/asm/pte-book3e.h +++ b/arch/powerpc/include/asm/pte-book3e.h @@ -10,7 +10,6 @@ /* Architected bits */ #define _PAGE_PRESENT 0x000001 /* software: pte contains a translation */ -#define _PAGE_FILE 0x000002 /* (!present only) software: pte holds file offset */ #define _PAGE_SW1 0x000002 #define _PAGE_BAP_SR 0x000004 #define _PAGE_BAP_UR 0x000008 diff --git a/arch/powerpc/include/asm/pte-fsl-booke.h b/arch/powerpc/include/asm/pte-fsl-booke.h index 2c12be5f677a..3040c64a302b 100644 --- a/arch/powerpc/include/asm/pte-fsl-booke.h +++ b/arch/powerpc/include/asm/pte-fsl-booke.h @@ -13,14 +13,11 @@ - PRESENT *must* be in the bottom three bits because swap cache entries use the top 29 bits. - - FILE *must* be in the bottom three bits because swap cache - entries use the top 29 bits. */ /* Definitions for FSL Book-E Cores */ #define _PAGE_PRESENT 0x00001 /* S: PTE contains a translation */ #define _PAGE_USER 0x00002 /* S: User page (maps to UR) */ -#define _PAGE_FILE 0x00002 /* S: when !present: nonlinear file mapping */ #define _PAGE_RW 0x00004 /* S: Write permission (SW) */ #define _PAGE_DIRTY 0x00008 /* S: Page dirty */ #define _PAGE_EXEC 0x00010 /* H: SX permission */ diff --git a/arch/powerpc/include/asm/pte-hash32.h b/arch/powerpc/include/asm/pte-hash32.h index 4aad4132d0a8..62cfb0c663bb 100644 --- a/arch/powerpc/include/asm/pte-hash32.h +++ b/arch/powerpc/include/asm/pte-hash32.h @@ -18,7 +18,6 @@ #define _PAGE_PRESENT 0x001 /* software: pte contains a translation */ #define _PAGE_HASHPTE 0x002 /* hash_page has made an HPTE for this pte */ -#define _PAGE_FILE 0x004 /* when !present: nonlinear file mapping */ #define _PAGE_USER 0x004 /* usermode access allowed */ #define _PAGE_GUARDED 0x008 /* G: prohibit speculative access */ #define _PAGE_COHERENT 0x010 /* M: enforce memory coherence (SMP systems) */ diff --git a/arch/powerpc/include/asm/pte-hash64.h b/arch/powerpc/include/asm/pte-hash64.h index 2505d8eab15c..1a66c25eeac2 100644 --- a/arch/powerpc/include/asm/pte-hash64.h +++ b/arch/powerpc/include/asm/pte-hash64.h @@ -16,7 +16,6 @@ */ #define _PAGE_PRESENT 0x0001 /* software: pte contains a translation */ #define _PAGE_USER 0x0002 /* matches one of the PP bits */ -#define _PAGE_FILE 0x0002 /* (!present only) software: pte holds file offset */ #define _PAGE_EXEC 0x0004 /* No execute on POWER4 and newer (we invert) */ #define _PAGE_GUARDED 0x0008 /* We can derive Memory coherence from _PAGE_NO_CACHE */ diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 71d084b6f766..3bfa934fff05 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -807,7 +807,7 @@ pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot) { pmd_t pmd; /* - * For a valid pte, we would have _PAGE_PRESENT or _PAGE_FILE always + * For a valid pte, we would have _PAGE_PRESENT always * set. We use this to check THP page at pmd level. * leaf pte for huge page, bottom two bits != 00 */ diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 589f9c65416a..53412712ea5e 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -237,10 +237,10 @@ extern unsigned long MODULES_END; _PAGE_DIRTY | _PAGE_YOUNG) /* - * handle_pte_fault uses pte_present, pte_none and pte_file to find out the - * pte type WITHOUT holding the page table lock. The _PAGE_PRESENT bit - * is used to distinguish present from not-present ptes. It is changed only - * with the page table lock held. + * handle_pte_fault uses pte_present and pte_none to find out the pte type + * WITHOUT holding the page table lock. The _PAGE_PRESENT bit is used to + * distinguish present from not-present ptes. It is changed only with the page + * table lock held. * * The following table gives the different possible bit combinations for * the pte hardware and software bits in the last 12 bits of a pte: @@ -267,7 +267,6 @@ extern unsigned long MODULES_END; * * pte_present is true for the bit pattern .xx...xxxxx1, (pte & 0x001) == 0x001 * pte_none is true for the bit pattern .10...xxxx00, (pte & 0x603) == 0x400 - * pte_file is true for the bit pattern .11...xxxxx0, (pte & 0x601) == 0x600 * pte_swap is true for the bit pattern .10...xxxx10, (pte & 0x603) == 0x402 */ @@ -644,13 +643,6 @@ static inline int pte_swap(pte_t pte) == (_PAGE_INVALID | _PAGE_TYPE); } -static inline int pte_file(pte_t pte) -{ - /* Bit pattern: (pte & 0x601) == 0x600 */ - return (pte_val(pte) & (_PAGE_INVALID | _PAGE_PROTECT | _PAGE_PRESENT)) - == (_PAGE_INVALID | _PAGE_PROTECT); -} - static inline int pte_special(pte_t pte) { return (pte_val(pte) & _PAGE_SPECIAL); @@ -1710,19 +1702,6 @@ static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -#ifndef CONFIG_64BIT -# define PTE_FILE_MAX_BITS 26 -#else /* CONFIG_64BIT */ -# define PTE_FILE_MAX_BITS 59 -#endif /* CONFIG_64BIT */ - -#define pte_to_pgoff(__pte) \ - ((((__pte).pte >> 12) << 7) + (((__pte).pte >> 1) & 0x7f)) - -#define pgoff_to_pte(__off) \ - ((pte_t) { ((((__off) & 0x7f) << 1) + (((__off) >> 7) << 12)) \ - | _PAGE_INVALID | _PAGE_PROTECT }) - #endif /* !__ASSEMBLY__ */ #define kern_addr_valid(addr) (1) diff --git a/arch/score/include/asm/pgtable-bits.h b/arch/score/include/asm/pgtable-bits.h index 7d65a96a82e5..0e5c6f466520 100644 --- a/arch/score/include/asm/pgtable-bits.h +++ b/arch/score/include/asm/pgtable-bits.h @@ -6,7 +6,6 @@ #define _PAGE_WRITE (1<<7) /* implemented in software */ #define _PAGE_PRESENT (1<<9) /* implemented in software */ #define _PAGE_MODIFIED (1<<10) /* implemented in software */ -#define _PAGE_FILE (1<<10) #define _PAGE_GLOBAL (1<<0) #define _PAGE_VALID (1<<1) diff --git a/arch/score/include/asm/pgtable.h b/arch/score/include/asm/pgtable.h index db96ad9afc03..5170ffdea643 100644 --- a/arch/score/include/asm/pgtable.h +++ b/arch/score/include/asm/pgtable.h @@ -90,15 +90,6 @@ static inline void pmd_clear(pmd_t *pmdp) ((pte_t *)page_address(pmd_page(*(dir))) + __pte_offset(address)) #define pte_unmap(pte) ((void)(pte)) -/* - * Bits 9(_PAGE_PRESENT) and 10(_PAGE_FILE)are taken, - * split up 30 bits of offset into this range: - */ -#define PTE_FILE_MAX_BITS 30 -#define pte_to_pgoff(_pte) \ - (((_pte).pte & 0x1ff) | (((_pte).pte >> 11) << 9)) -#define pgoff_to_pte(off) \ - ((pte_t) {((off) & 0x1ff) | (((off) >> 9) << 11) | _PAGE_FILE}) #define __pte_to_swp_entry(pte) \ ((swp_entry_t) { pte_val(pte)}) #define __swp_entry_to_pte(x) ((pte_t) {(x).val}) @@ -169,8 +160,8 @@ static inline pgprot_t pgprot_noncached(pgprot_t _prot) } #define __swp_type(x) ((x).val & 0x1f) -#define __swp_offset(x) ((x).val >> 11) -#define __swp_entry(type, offset) ((swp_entry_t){(type) | ((offset) << 11)}) +#define __swp_offset(x) ((x).val >> 10) +#define __swp_entry(type, offset) ((swp_entry_t){(type) | ((offset) << 10)}) extern unsigned long empty_zero_page; extern unsigned long zero_page_mask; @@ -198,11 +189,6 @@ static inline int pte_young(pte_t pte) return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) -{ - return pte_val(pte) & _PAGE_FILE; -} - #define pte_special(pte) (0) static inline pte_t pte_wrprotect(pte_t pte) diff --git a/arch/sh/include/asm/pgtable_32.h b/arch/sh/include/asm/pgtable_32.h index 0bce3d81569e..c646e563abce 100644 --- a/arch/sh/include/asm/pgtable_32.h +++ b/arch/sh/include/asm/pgtable_32.h @@ -26,8 +26,6 @@ * and timing control which (together with bit 0) are moved into the * old-style PTEA on the parts that support it. * - * XXX: Leave the _PAGE_FILE and _PAGE_WT overhaul for a rainy day. - * * SH-X2 MMUs and extended PTEs * * SH-X2 supports an extended mode TLB with split data arrays due to the @@ -51,7 +49,6 @@ #define _PAGE_PRESENT 0x100 /* V-bit : page is valid */ #define _PAGE_PROTNONE 0x200 /* software: if not present */ #define _PAGE_ACCESSED 0x400 /* software: page referenced */ -#define _PAGE_FILE _PAGE_WT /* software: pagecache or swap? */ #define _PAGE_SPECIAL 0x800 /* software: special page */ #define _PAGE_SZ_MASK (_PAGE_SZ0 | _PAGE_SZ1) @@ -105,14 +102,13 @@ static inline unsigned long copy_ptea_attributes(unsigned long x) /* Mask which drops unused bits from the PTEL value */ #if defined(CONFIG_CPU_SH3) #define _PAGE_CLEAR_FLAGS (_PAGE_PROTNONE | _PAGE_ACCESSED| \ - _PAGE_FILE | _PAGE_SZ1 | \ - _PAGE_HW_SHARED) + _PAGE_SZ1 | _PAGE_HW_SHARED) #elif defined(CONFIG_X2TLB) /* Get rid of the legacy PR/SZ bits when using extended mode */ #define _PAGE_CLEAR_FLAGS (_PAGE_PROTNONE | _PAGE_ACCESSED | \ - _PAGE_FILE | _PAGE_PR_MASK | _PAGE_SZ_MASK) + _PAGE_PR_MASK | _PAGE_SZ_MASK) #else -#define _PAGE_CLEAR_FLAGS (_PAGE_PROTNONE | _PAGE_ACCESSED | _PAGE_FILE) +#define _PAGE_CLEAR_FLAGS (_PAGE_PROTNONE | _PAGE_ACCESSED) #endif #define _PAGE_FLAGS_HARDWARE_MASK (phys_addr_mask() & ~(_PAGE_CLEAR_FLAGS)) @@ -343,7 +339,6 @@ static inline void set_pte(pte_t *ptep, pte_t pte) #define pte_not_present(pte) (!((pte).pte_low & _PAGE_PRESENT)) #define pte_dirty(pte) ((pte).pte_low & _PAGE_DIRTY) #define pte_young(pte) ((pte).pte_low & _PAGE_ACCESSED) -#define pte_file(pte) ((pte).pte_low & _PAGE_FILE) #define pte_special(pte) ((pte).pte_low & _PAGE_SPECIAL) #ifdef CONFIG_X2TLB @@ -445,7 +440,6 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) * Encode and de-code a swap entry * * Constraints: - * _PAGE_FILE at bit 0 * _PAGE_PRESENT at bit 8 * _PAGE_PROTNONE at bit 9 * @@ -453,9 +447,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) * swap offset into bits 10:30. For the 64-bit PTE case, we keep the * preserved bits in the low 32-bits and use the upper 32 as the swap * offset (along with a 5-bit type), following the same approach as x86 - * PAE. This keeps the logic quite simple, and allows for a full 32 - * PTE_FILE_MAX_BITS, as opposed to the 29-bits we're constrained with - * in the pte_low case. + * PAE. This keeps the logic quite simple. * * As is evident by the Alpha code, if we ever get a 64-bit unsigned * long (swp_entry_t) to match up with the 64-bit PTEs, this all becomes @@ -471,13 +463,6 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) #define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high }) #define __swp_entry_to_pte(x) ((pte_t){ 0, (x).val }) -/* - * Encode and decode a nonlinear file mapping entry - */ -#define pte_to_pgoff(pte) ((pte).pte_high) -#define pgoff_to_pte(off) ((pte_t) { _PAGE_FILE, (off) }) - -#define PTE_FILE_MAX_BITS 32 #else #define __swp_type(x) ((x).val & 0xff) #define __swp_offset(x) ((x).val >> 10) @@ -485,13 +470,6 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 1 }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val << 1 }) - -/* - * Encode and decode a nonlinear file mapping entry - */ -#define PTE_FILE_MAX_BITS 29 -#define pte_to_pgoff(pte) (pte_val(pte) >> 1) -#define pgoff_to_pte(off) ((pte_t) { ((off) << 1) | _PAGE_FILE }) #endif #endif /* __ASSEMBLY__ */ diff --git a/arch/sh/include/asm/pgtable_64.h b/arch/sh/include/asm/pgtable_64.h index dda8c82601b9..07424968df62 100644 --- a/arch/sh/include/asm/pgtable_64.h +++ b/arch/sh/include/asm/pgtable_64.h @@ -107,7 +107,6 @@ static __inline__ void set_pte(pte_t *pteptr, pte_t pteval) #define _PAGE_DEVICE 0x001 /* CB0: if uncacheable, 1->device (i.e. no write-combining or reordering at bus level) */ #define _PAGE_CACHABLE 0x002 /* CB1: uncachable/cachable */ #define _PAGE_PRESENT 0x004 /* software: page referenced */ -#define _PAGE_FILE 0x004 /* software: only when !present */ #define _PAGE_SIZE0 0x008 /* SZ0-bit : size of page */ #define _PAGE_SIZE1 0x010 /* SZ1-bit : size of page */ #define _PAGE_SHARED 0x020 /* software: reflects PTEH's SH */ @@ -129,7 +128,7 @@ static __inline__ void set_pte(pte_t *pteptr, pte_t pteval) #define _PAGE_WIRED _PAGE_EXT(0x001) /* software: wire the tlb entry */ #define _PAGE_SPECIAL _PAGE_EXT(0x002) -#define _PAGE_CLEAR_FLAGS (_PAGE_PRESENT | _PAGE_FILE | _PAGE_SHARED | \ +#define _PAGE_CLEAR_FLAGS (_PAGE_PRESENT | _PAGE_SHARED | \ _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_WIRED) /* Mask which drops software flags */ @@ -260,7 +259,6 @@ static __inline__ void set_pte(pte_t *pteptr, pte_t pteval) */ static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; } static inline int pte_special(pte_t pte){ return pte_val(pte) & _PAGE_SPECIAL; } @@ -304,11 +302,6 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -/* Encode and decode a nonlinear file mapping entry */ -#define PTE_FILE_MAX_BITS 29 -#define pte_to_pgoff(pte) (pte_val(pte)) -#define pgoff_to_pte(off) ((pte_t) { (off) | _PAGE_FILE }) - #endif /* !__ASSEMBLY__ */ #define pfn_pte(pfn, prot) __pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot)) diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h index b9b91ae19fe1..b2f7dc46a7d1 100644 --- a/arch/sparc/include/asm/pgtable_32.h +++ b/arch/sparc/include/asm/pgtable_32.h @@ -221,14 +221,6 @@ static inline int pte_young(pte_t pte) return pte_val(pte) & SRMMU_REF; } -/* - * The following only work if pte_present() is not true. - */ -static inline int pte_file(pte_t pte) -{ - return pte_val(pte) & SRMMU_FILE; -} - static inline int pte_special(pte_t pte) { return 0; @@ -375,22 +367,6 @@ static inline swp_entry_t __swp_entry(unsigned long type, unsigned long offset) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -/* file-offset-in-pte helpers */ -static inline unsigned long pte_to_pgoff(pte_t pte) -{ - return pte_val(pte) >> SRMMU_PTE_FILE_SHIFT; -} - -static inline pte_t pgoff_to_pte(unsigned long pgoff) -{ - return __pte((pgoff << SRMMU_PTE_FILE_SHIFT) | SRMMU_FILE); -} - -/* - * This is made a constant because mm/fremap.c required a constant. - */ -#define PTE_FILE_MAX_BITS 24 - static inline unsigned long __get_phys (unsigned long addr) { diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index bfeb626085ac..66a1c3b5a153 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -137,7 +137,6 @@ bool kern_addr_valid(unsigned long addr); #define _PAGE_SOFT_4U _AC(0x0000000000001F80,UL) /* Software bits: */ #define _PAGE_EXEC_4U _AC(0x0000000000001000,UL) /* Executable SW bit */ #define _PAGE_MODIFIED_4U _AC(0x0000000000000800,UL) /* Modified (dirty) */ -#define _PAGE_FILE_4U _AC(0x0000000000000800,UL) /* Pagecache page */ #define _PAGE_ACCESSED_4U _AC(0x0000000000000400,UL) /* Accessed (ref'd) */ #define _PAGE_READ_4U _AC(0x0000000000000200,UL) /* Readable SW Bit */ #define _PAGE_WRITE_4U _AC(0x0000000000000100,UL) /* Writable SW Bit */ @@ -167,7 +166,6 @@ bool kern_addr_valid(unsigned long addr); #define _PAGE_EXEC_4V _AC(0x0000000000000080,UL) /* Executable Page */ #define _PAGE_W_4V _AC(0x0000000000000040,UL) /* Writable */ #define _PAGE_SOFT_4V _AC(0x0000000000000030,UL) /* Software bits */ -#define _PAGE_FILE_4V _AC(0x0000000000000020,UL) /* Pagecache page */ #define _PAGE_PRESENT_4V _AC(0x0000000000000010,UL) /* Present */ #define _PAGE_RESV_4V _AC(0x0000000000000008,UL) /* Reserved */ #define _PAGE_SZ16GB_4V _AC(0x0000000000000007,UL) /* 16GB Page */ @@ -332,22 +330,6 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) } #endif -static inline pte_t pgoff_to_pte(unsigned long off) -{ - off <<= PAGE_SHIFT; - - __asm__ __volatile__( - "\n661: or %0, %2, %0\n" - " .section .sun4v_1insn_patch, \"ax\"\n" - " .word 661b\n" - " or %0, %3, %0\n" - " .previous\n" - : "=r" (off) - : "0" (off), "i" (_PAGE_FILE_4U), "i" (_PAGE_FILE_4V)); - - return __pte(off); -} - static inline pgprot_t pgprot_noncached(pgprot_t prot) { unsigned long val = pgprot_val(prot); @@ -609,22 +591,6 @@ static inline unsigned long pte_exec(pte_t pte) return (pte_val(pte) & mask); } -static inline unsigned long pte_file(pte_t pte) -{ - unsigned long val = pte_val(pte); - - __asm__ __volatile__( - "\n661: and %0, %2, %0\n" - " .section .sun4v_1insn_patch, \"ax\"\n" - " .word 661b\n" - " and %0, %3, %0\n" - " .previous\n" - : "=r" (val) - : "0" (val), "i" (_PAGE_FILE_4U), "i" (_PAGE_FILE_4V)); - - return val; -} - static inline unsigned long pte_present(pte_t pte) { unsigned long val = pte_val(pte); @@ -964,12 +930,6 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp); #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -/* File offset in PTE support. */ -unsigned long pte_file(pte_t); -#define pte_to_pgoff(pte) (pte_val(pte) >> PAGE_SHIFT) -pte_t pgoff_to_pte(unsigned long); -#define PTE_FILE_MAX_BITS (64UL - PAGE_SHIFT - 1UL) - int page_in_phys_avail(unsigned long paddr); /* diff --git a/arch/sparc/include/asm/pgtsrmmu.h b/arch/sparc/include/asm/pgtsrmmu.h index 79da17866fa8..ae51a111a8c7 100644 --- a/arch/sparc/include/asm/pgtsrmmu.h +++ b/arch/sparc/include/asm/pgtsrmmu.h @@ -80,10 +80,6 @@ #define SRMMU_PRIV 0x1c #define SRMMU_PRIV_RDONLY 0x18 -#define SRMMU_FILE 0x40 /* Implemented in software */ - -#define SRMMU_PTE_FILE_SHIFT 8 /* == 32-PTE_FILE_MAX_BITS */ - #define SRMMU_CHG_MASK (0xffffff00 | SRMMU_REF | SRMMU_DIRTY) /* SRMMU swap entry encoding @@ -94,13 +90,13 @@ * oooooooooooooooooootttttRRRRRRRR * fedcba9876543210fedcba9876543210 * - * The bottom 8 bits are reserved for protection and status bits, especially - * FILE and PRESENT. + * The bottom 7 bits are reserved for protection and status bits, especially + * PRESENT. */ #define SRMMU_SWP_TYPE_MASK 0x1f -#define SRMMU_SWP_TYPE_SHIFT SRMMU_PTE_FILE_SHIFT -#define SRMMU_SWP_OFF_MASK 0x7ffff -#define SRMMU_SWP_OFF_SHIFT (SRMMU_PTE_FILE_SHIFT + 5) +#define SRMMU_SWP_TYPE_SHIFT 7 +#define SRMMU_SWP_OFF_MASK 0xfffff +#define SRMMU_SWP_OFF_SHIFT (SRMMU_SWP_TYPE_SHIFT + 5) /* Some day I will implement true fine grained access bits for * user pages because the SRMMU gives us the capabilities to diff --git a/arch/tile/include/asm/pgtable.h b/arch/tile/include/asm/pgtable.h index 33587f16c152..b678c265655f 100644 --- a/arch/tile/include/asm/pgtable.h +++ b/arch/tile/include/asm/pgtable.h @@ -284,17 +284,6 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot) extern void start_mm_caching(struct mm_struct *mm); extern void check_mm_caching(struct mm_struct *prev, struct mm_struct *next); -/* - * Support non-linear file mappings (see sys_remap_file_pages). - * This is defined by CLIENT1 set but CLIENT0 and _PAGE_PRESENT clear, and the - * file offset in the 32 high bits. - */ -#define _PAGE_FILE HV_PTE_CLIENT1 -#define PTE_FILE_MAX_BITS 32 -#define pte_file(pte) (hv_pte_get_client1(pte) && !hv_pte_get_client0(pte)) -#define pte_to_pgoff(pte) ((pte).val >> 32) -#define pgoff_to_pte(off) ((pte_t) { (((long long)(off)) << 32) | _PAGE_FILE }) - /* * Encode and de-code a swap entry (see ). * We put the swap file type+offset in the 32 high bits; diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c index 33294fdc402e..c7454a905d04 100644 --- a/arch/tile/mm/homecache.c +++ b/arch/tile/mm/homecache.c @@ -265,10 +265,6 @@ static int pte_to_home(pte_t pte) /* Update the home of a PTE if necessary (can also be used for a pgprot_t). */ pte_t pte_set_home(pte_t pte, int home) { - /* Check for non-linear file mapping "PTEs" and pass them through. */ - if (pte_file(pte)) - return pte; - #if CHIP_HAS_MMIO() /* Check for MMIO mappings and pass them through. */ if (hv_pte_get_mode(pte) == HV_PTE_MODE_MMIO) diff --git a/arch/um/include/asm/pgtable-2level.h b/arch/um/include/asm/pgtable-2level.h index f534b73e753e..7afe86035fa7 100644 --- a/arch/um/include/asm/pgtable-2level.h +++ b/arch/um/include/asm/pgtable-2level.h @@ -41,13 +41,4 @@ static inline void pgd_mkuptodate(pgd_t pgd) { } #define pfn_pte(pfn, prot) __pte(pfn_to_phys(pfn) | pgprot_val(prot)) #define pfn_pmd(pfn, prot) __pmd(pfn_to_phys(pfn) | pgprot_val(prot)) -/* - * Bits 0 through 4 are taken - */ -#define PTE_FILE_MAX_BITS 27 - -#define pte_to_pgoff(pte) (pte_val(pte) >> 5) - -#define pgoff_to_pte(off) ((pte_t) { ((off) << 5) + _PAGE_FILE }) - #endif diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h index 0032f9212e74..344c559c0a17 100644 --- a/arch/um/include/asm/pgtable-3level.h +++ b/arch/um/include/asm/pgtable-3level.h @@ -112,25 +112,5 @@ static inline pmd_t pfn_pmd(pfn_t page_nr, pgprot_t pgprot) return __pmd((page_nr << PAGE_SHIFT) | pgprot_val(pgprot)); } -/* - * Bits 0 through 3 are taken in the low part of the pte, - * put the 32 bits of offset into the high part. - */ -#define PTE_FILE_MAX_BITS 32 - -#ifdef CONFIG_64BIT - -#define pte_to_pgoff(p) ((p).pte >> 32) - -#define pgoff_to_pte(off) ((pte_t) { ((off) << 32) | _PAGE_FILE }) - -#else - -#define pte_to_pgoff(pte) ((pte).pte_high) - -#define pgoff_to_pte(off) ((pte_t) { _PAGE_FILE, (off) }) - -#endif - #endif diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h index bf974f712af7..2324b624f195 100644 --- a/arch/um/include/asm/pgtable.h +++ b/arch/um/include/asm/pgtable.h @@ -18,7 +18,6 @@ #define _PAGE_ACCESSED 0x080 #define _PAGE_DIRTY 0x100 /* If _PAGE_PRESENT is clear, we use these: */ -#define _PAGE_FILE 0x008 /* nonlinear file mapping, saved PTE; unset:swap */ #define _PAGE_PROTNONE 0x010 /* if the user mapped it with PROT_NONE; pte_present gives true */ @@ -151,14 +150,6 @@ static inline int pte_write(pte_t pte) !(pte_get_bits(pte, _PAGE_PROTNONE))); } -/* - * The following only works if pte_present() is not true. - */ -static inline int pte_file(pte_t pte) -{ - return pte_get_bits(pte, _PAGE_FILE); -} - static inline int pte_dirty(pte_t pte) { return pte_get_bits(pte, _PAGE_DIRTY); diff --git a/arch/unicore32/include/asm/pgtable-hwdef.h b/arch/unicore32/include/asm/pgtable-hwdef.h index 7314e859cca0..e37fa471c2be 100644 --- a/arch/unicore32/include/asm/pgtable-hwdef.h +++ b/arch/unicore32/include/asm/pgtable-hwdef.h @@ -44,7 +44,6 @@ #define PTE_TYPE_INVALID (3 << 0) #define PTE_PRESENT (1 << 2) -#define PTE_FILE (1 << 3) /* only when !PRESENT */ #define PTE_YOUNG (1 << 3) #define PTE_DIRTY (1 << 4) #define PTE_CACHEABLE (1 << 5) diff --git a/arch/unicore32/include/asm/pgtable.h b/arch/unicore32/include/asm/pgtable.h index ed6f7d000fba..818d0f5598e3 100644 --- a/arch/unicore32/include/asm/pgtable.h +++ b/arch/unicore32/include/asm/pgtable.h @@ -283,20 +283,6 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; #define MAX_SWAPFILES_CHECK() \ BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > __SWP_TYPE_BITS) -/* - * Encode and decode a file entry. File entries are stored in the Linux - * page tables as follows: - * - * 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 - * 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 - * <----------------------- offset ----------------------> 1 0 0 0 - */ -#define pte_file(pte) (pte_val(pte) & PTE_FILE) -#define pte_to_pgoff(x) (pte_val(x) >> 4) -#define pgoff_to_pte(x) __pte(((x) << 4) | PTE_FILE) - -#define PTE_FILE_MAX_BITS 28 - /* Needs to be defined here and not in linux/mm.h, as it is arch dependent */ /* FIXME: this is not correct */ #define kern_addr_valid(addr) (1) diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 8f45ed429ba0..f3c8dc4083af 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -191,11 +191,20 @@ #define X86_FEATURE_USE_IBPB (7*32+12) /* "" Indirect Branch Prediction Barrier enabled */ #define X86_FEATURE_USE_IBRS_FW (7*32+13) /* "" Use IBRS during runtime firmware calls */ +#define X86_FEATURE_SPEC_STORE_BYPASS_DISABLE (7*32+14) /* "" Disable Speculative Store Bypass. */ +#define X86_FEATURE_LS_CFG_SSBD (7*32+15) /* "" AMD SSBD implementation */ +#define X86_FEATURE_IBRS (7*32+16) /* Indirect Branch Restricted Speculation */ +#define X86_FEATURE_IBPB (7*32+17) /* Indirect Branch Prediction Barrier */ +#define X86_FEATURE_STIBP (7*32+18) /* Single Thread Indirect Branch Predictors */ +#define X86_FEATURE_MSR_SPEC_CTRL (7*32+19) /* "" MSR SPEC_CTRL is implemented */ +#define X86_FEATURE_SSBD (7*32+20) /* Speculative Store Bypass Disable */ +#define X86_FEATURE_ZEN (7*32+21) /* "" CPU is AMD family 0x17 (Zen) */ +#define X86_FEATURE_L1TF_PTEINV (7*32+22) /* "" L1TF workaround PTE inversion */ #define X86_FEATURE_RETPOLINE (7*32+29) /* "" Generic Retpoline mitigation for Spectre variant 2 */ #define X86_FEATURE_RETPOLINE_AMD (7*32+30) /* "" AMD Retpoline mitigation for Spectre variant 2 */ /* Because the ALTERNATIVE scheme is for members of the X86_FEATURE club... */ -#define X86_FEATURE_KAISER (7*32+31) /* "" CONFIG_PAGE_TABLE_ISOLATION w/o nokaiser */ +#define X86_FEATURE_KAISER (7*32+31) /* CONFIG_PAGE_TABLE_ISOLATION w/o nokaiser */ /* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW (8*32+ 0) /* Intel TPR Shadow */ @@ -241,11 +250,13 @@ #define X86_FEATURE_SPEC_CTRL (10*32+26) /* "" Speculation Control (IBRS + IBPB) */ #define X86_FEATURE_INTEL_STIBP (10*32+27) /* "" Single Thread Indirect Branch Predictors */ #define X86_FEATURE_ARCH_CAPABILITIES (10*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */ +#define X86_FEATURE_SPEC_CTRL_SSBD (10*32+31) /* "" Speculative Store Bypass Disable */ /* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 11 */ -#define X86_FEATURE_IBPB (11*32+12) /* Indirect Branch Prediction Barrier */ -#define X86_FEATURE_IBRS (11*32+14) /* Indirect Branch Restricted Speculation */ -#define X86_FEATURE_STIBP (11*32+15) /* Single Thread Indirect Branch Predictors */ +#define X86_FEATURE_AMD_IBPB (11*32+12) /* Indirect Branch Prediction Barrier */ +#define X86_FEATURE_AMD_IBRS (11*32+14) /* Indirect Branch Restricted Speculation */ +#define X86_FEATURE_AMD_STIBP (11*32+15) /* Single Thread Indirect Branch Predictors */ +#define X86_FEATURE_VIRT_SSBD (11*32+25) /* Virtualized Speculative Store Bypass Disable */ /* * BUG word(s) @@ -260,6 +271,8 @@ #define X86_BUG_CPU_MELTDOWN X86_BUG(5) /* CPU is affected by meltdown attack and needs kernel page table isolation */ #define X86_BUG_SPECTRE_V1 X86_BUG(6) /* CPU is affected by Spectre variant 1 attack with conditional branches */ #define X86_BUG_SPECTRE_V2 X86_BUG(7) /* CPU is affected by Spectre variant 2 attack with indirect branches */ +#define X86_BUG_SPEC_STORE_BYPASS X86_BUG(8) /* CPU is affected by speculative store bypass attack */ +#define X86_BUG_L1TF X86_BUG(9) /* CPU is affected by L1 Terminal Fault */ #if defined(__KERNEL__) && !defined(__ASSEMBLY__) diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h index b8237d8a1e0c..facd048d5c72 100644 --- a/arch/x86/include/asm/io.h +++ b/arch/x86/include/asm/io.h @@ -340,4 +340,10 @@ extern void arch_phys_wc_del(int handle); #define arch_phys_wc_add arch_phys_wc_add #endif +#ifdef CONFIG_X86_PAT +extern int arch_io_reserve_memtype_wc(resource_size_t start, resource_size_t size); +extern void arch_io_free_memtype_wc(resource_size_t start, resource_size_t size); +#define arch_io_reserve_memtype_wc arch_io_reserve_memtype_wc +#endif + #endif /* _ASM_X86_IO_H */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4af016445a05..f695719a41bd 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -671,6 +671,7 @@ struct kvm_x86_ops { int (*hardware_setup)(void); /* __init */ void (*hardware_unsetup)(void); /* __exit */ bool (*cpu_has_accelerated_tpr)(void); + bool (*has_emulated_msr)(int index); void (*cpuid_update)(struct kvm_vcpu *vcpu); /* Create, but do not attach this VCPU */ diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index b87c91d5458e..ba2bed5258d3 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -172,6 +172,14 @@ enum spectre_v2_mitigation { SPECTRE_V2_IBRS, }; +/* The Speculative Store Bypass disable variants */ +enum ssb_mitigation { + SPEC_STORE_BYPASS_NONE, + SPEC_STORE_BYPASS_DISABLE, + SPEC_STORE_BYPASS_PRCTL, + SPEC_STORE_BYPASS_SECCOMP, +}; + extern char __indirect_thunk_start[]; extern char __indirect_thunk_end[]; @@ -195,22 +203,27 @@ static inline void vmexit_fill_RSB(void) #endif } -#define alternative_msr_write(_msr, _val, _feature) \ - asm volatile(ALTERNATIVE("", \ - "movl %[msr], %%ecx\n\t" \ - "movl %[val], %%eax\n\t" \ - "movl $0, %%edx\n\t" \ - "wrmsr", \ - _feature) \ - : : [msr] "i" (_msr), [val] "i" (_val) \ - : "eax", "ecx", "edx", "memory") +static __always_inline +void alternative_msr_write(unsigned int msr, u64 val, unsigned int feature) +{ + asm volatile(ALTERNATIVE("", "wrmsr", %c[feature]) + : : "c" (msr), + "a" ((u32)val), + "d" ((u32)(val >> 32)), + [feature] "i" (feature) + : "memory"); +} static inline void indirect_branch_prediction_barrier(void) { - alternative_msr_write(MSR_IA32_PRED_CMD, PRED_CMD_IBPB, - X86_FEATURE_USE_IBPB); + u64 val = PRED_CMD_IBPB; + + alternative_msr_write(MSR_IA32_PRED_CMD, val, X86_FEATURE_USE_IBPB); } +/* The Intel SPEC CTRL MSR base value cache */ +extern u64 x86_spec_ctrl_base; + /* * With retpoline, we must use IBRS to restrict branch prediction * before calling into firmware. @@ -219,14 +232,18 @@ static inline void indirect_branch_prediction_barrier(void) */ #define firmware_restrict_branch_speculation_start() \ do { \ + u64 val = x86_spec_ctrl_base | SPEC_CTRL_IBRS; \ + \ preempt_disable(); \ - alternative_msr_write(MSR_IA32_SPEC_CTRL, SPEC_CTRL_IBRS, \ + alternative_msr_write(MSR_IA32_SPEC_CTRL, val, \ X86_FEATURE_USE_IBRS_FW); \ } while (0) #define firmware_restrict_branch_speculation_end() \ do { \ - alternative_msr_write(MSR_IA32_SPEC_CTRL, 0, \ + u64 val = x86_spec_ctrl_base; \ + \ + alternative_msr_write(MSR_IA32_SPEC_CTRL, val, \ X86_FEATURE_USE_IBRS_FW); \ preempt_enable(); \ } while (0) diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h index 3a52ee0e726d..bfceb5cc6347 100644 --- a/arch/x86/include/asm/page_32_types.h +++ b/arch/x86/include/asm/page_32_types.h @@ -27,8 +27,13 @@ #define N_EXCEPTION_STACKS 1 #ifdef CONFIG_X86_PAE -/* 44=32+12, the limit we can fit into an unsigned long pfn */ -#define __PHYSICAL_MASK_SHIFT 44 +/* + * This is beyond the 44 bit limit imposed by the 32bit long pfns, + * but we need the full mask to make sure inverted PROT_NONE + * entries have all the host bits set in a guest. + * The real limit is still 44 bits. + */ +#define __PHYSICAL_MASK_SHIFT 52 #define __VIRTUAL_MASK_SHIFT 32 #else /* !CONFIG_X86_PAE */ diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h index 206a87fdd22d..89c50332a71e 100644 --- a/arch/x86/include/asm/pgtable-2level.h +++ b/arch/x86/include/asm/pgtable-2level.h @@ -62,44 +62,8 @@ static inline unsigned long pte_bitop(unsigned long value, unsigned int rightshi return ((value >> rightshift) & mask) << leftshift; } -/* - * Bits _PAGE_BIT_PRESENT, _PAGE_BIT_FILE and _PAGE_BIT_PROTNONE are taken, - * split up the 29 bits of offset into this range. - */ -#define PTE_FILE_MAX_BITS 29 -#define PTE_FILE_SHIFT1 (_PAGE_BIT_PRESENT + 1) -#define PTE_FILE_SHIFT2 (_PAGE_BIT_FILE + 1) -#define PTE_FILE_SHIFT3 (_PAGE_BIT_PROTNONE + 1) -#define PTE_FILE_BITS1 (PTE_FILE_SHIFT2 - PTE_FILE_SHIFT1 - 1) -#define PTE_FILE_BITS2 (PTE_FILE_SHIFT3 - PTE_FILE_SHIFT2 - 1) - -#define PTE_FILE_MASK1 ((1U << PTE_FILE_BITS1) - 1) -#define PTE_FILE_MASK2 ((1U << PTE_FILE_BITS2) - 1) - -#define PTE_FILE_LSHIFT2 (PTE_FILE_BITS1) -#define PTE_FILE_LSHIFT3 (PTE_FILE_BITS1 + PTE_FILE_BITS2) - -static __always_inline pgoff_t pte_to_pgoff(pte_t pte) -{ - return (pgoff_t) - (pte_bitop(pte.pte_low, PTE_FILE_SHIFT1, PTE_FILE_MASK1, 0) + - pte_bitop(pte.pte_low, PTE_FILE_SHIFT2, PTE_FILE_MASK2, PTE_FILE_LSHIFT2) + - pte_bitop(pte.pte_low, PTE_FILE_SHIFT3, -1UL, PTE_FILE_LSHIFT3)); -} - -static __always_inline pte_t pgoff_to_pte(pgoff_t off) -{ - return (pte_t){ - .pte_low = - pte_bitop(off, 0, PTE_FILE_MASK1, PTE_FILE_SHIFT1) + - pte_bitop(off, PTE_FILE_LSHIFT2, PTE_FILE_MASK2, PTE_FILE_SHIFT2) + - pte_bitop(off, PTE_FILE_LSHIFT3, -1UL, PTE_FILE_SHIFT3) + - _PAGE_FILE, - }; -} - /* Encode and de-code a swap entry */ -#define SWP_TYPE_BITS (_PAGE_BIT_FILE - _PAGE_BIT_PRESENT - 1) +#define SWP_TYPE_BITS 5 #define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 1) #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS) @@ -113,4 +77,21 @@ static __always_inline pte_t pgoff_to_pte(pgoff_t off) #define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low }) #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val }) +/* No inverted PFNs on 2 level page tables */ + +static inline u64 protnone_mask(u64 val) +{ + return 0; +} + +static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask) +{ + return val; +} + +static inline bool __pte_needs_invert(u64 val) +{ + return false; +} + #endif /* _ASM_X86_PGTABLE_2LEVEL_H */ diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h index 81bb91b49a88..5c686382d84b 100644 --- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -176,24 +176,45 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp) #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp) #endif -/* - * Bits 0, 6 and 7 are taken in the low part of the pte, - * put the 32 bits of offset into the high part. - * - * For soft-dirty tracking 11 bit is taken from - * the low part of pte as well. - */ -#define pte_to_pgoff(pte) ((pte).pte_high) -#define pgoff_to_pte(off) \ - ((pte_t) { { .pte_low = _PAGE_FILE, .pte_high = (off) } }) -#define PTE_FILE_MAX_BITS 32 - /* Encode and de-code a swap entry */ +#define SWP_TYPE_BITS 5 + +#define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE + 1) + +/* We always extract/encode the offset by shifting it all the way up, and then down again */ +#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS) + #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5) #define __swp_type(x) (((x).val) & 0x1f) #define __swp_offset(x) ((x).val >> 5) #define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << 5}) -#define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high }) -#define __swp_entry_to_pte(x) ((pte_t){ { .pte_high = (x).val } }) + +/* + * Normally, __swp_entry() converts from arch-independent swp_entry_t to + * arch-dependent swp_entry_t, and __swp_entry_to_pte() just stores the result + * to pte. But here we have 32bit swp_entry_t and 64bit pte, and need to use the + * whole 64 bits. Thus, we shift the "real" arch-dependent conversion to + * __swp_entry_to_pte() through the following helper macro based on 64bit + * __swp_entry(). + */ +#define __swp_pteval_entry(type, offset) ((pteval_t) { \ + (~(pteval_t)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \ + | ((pteval_t)(type) << (64 - SWP_TYPE_BITS)) }) + +#define __swp_entry_to_pte(x) ((pte_t){ .pte = \ + __swp_pteval_entry(__swp_type(x), __swp_offset(x)) }) +/* + * Analogically, __pte_to_swp_entry() doesn't just extract the arch-dependent + * swp_entry_t, but also has to convert it from 64bit to the 32bit + * intermediate representation, using the following macros based on 64bit + * __swp_type() and __swp_offset(). + */ +#define __pteval_swp_type(x) ((unsigned long)((x).pte >> (64 - SWP_TYPE_BITS))) +#define __pteval_swp_offset(x) ((unsigned long)(~((x).pte) << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)) + +#define __pte_to_swp_entry(pte) (__swp_entry(__pteval_swp_type(pte), \ + __pteval_swp_offset(pte))) + +#include #endif /* _ASM_X86_PGTABLE_3LEVEL_H */ diff --git a/arch/x86/include/asm/pgtable-invert.h b/arch/x86/include/asm/pgtable-invert.h new file mode 100644 index 000000000000..a0c1525f1b6f --- /dev/null +++ b/arch/x86/include/asm/pgtable-invert.h @@ -0,0 +1,41 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_PGTABLE_INVERT_H +#define _ASM_PGTABLE_INVERT_H 1 + +#ifndef __ASSEMBLY__ + +/* + * A clear pte value is special, and doesn't get inverted. + * + * Note that even users that only pass a pgprot_t (rather + * than a full pte) won't trigger the special zero case, + * because even PAGE_NONE has _PAGE_PROTNONE | _PAGE_ACCESSED + * set. So the all zero case really is limited to just the + * cleared page table entry case. + */ +static inline bool __pte_needs_invert(u64 val) +{ + return val && !(val & _PAGE_PRESENT); +} + +/* Get a mask to xor with the page table entry to get the correct pfn. */ +static inline u64 protnone_mask(u64 val) +{ + return __pte_needs_invert(val) ? ~0ull : 0; +} + +static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask) +{ + /* + * When a PTE transitions from NONE to !NONE or vice-versa + * invert the PFN part to stop speculation. + * pte_pfn undoes this when needed. + */ + if (__pte_needs_invert(oldval) != __pte_needs_invert(val)) + val = (val & ~mask) | (~val & mask); + return val; +} + +#endif /* __ASSEMBLY__ */ + +#endif diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 56bae3eedc70..5a102146341b 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -115,11 +115,6 @@ static inline int pte_write(pte_t pte) return pte_flags(pte) & _PAGE_RW; } -static inline int pte_file(pte_t pte) -{ - return pte_flags(pte) & _PAGE_FILE; -} - static inline int pte_huge(pte_t pte) { return pte_flags(pte) & _PAGE_PSE; @@ -146,19 +141,29 @@ static inline int pte_special(pte_t pte) (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE)); } +/* Entries that were set to PROT_NONE are inverted */ + +static inline u64 protnone_mask(u64 val); + static inline unsigned long pte_pfn(pte_t pte) { - return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT; + phys_addr_t pfn = pte_val(pte); + pfn ^= protnone_mask(pfn); + return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT; } static inline unsigned long pmd_pfn(pmd_t pmd) { - return (pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT; + phys_addr_t pfn = pmd_val(pmd); + pfn ^= protnone_mask(pfn); + return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT; } static inline unsigned long pud_pfn(pud_t pud) { - return (pud_val(pud) & PTE_PFN_MASK) >> PAGE_SHIFT; + phys_addr_t pfn = pud_val(pud); + pfn ^= protnone_mask(pfn); + return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT; } #define pte_page(pte) pfn_to_page(pte_pfn(pte)) @@ -303,9 +308,23 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd) return pmd_set_flags(pmd, _PAGE_RW); } -static inline pmd_t pmd_mknotpresent(pmd_t pmd) +static inline pud_t pud_set_flags(pud_t pud, pudval_t set) { - return pmd_clear_flags(pmd, _PAGE_PRESENT); + pudval_t v = native_pud_val(pud); + + return __pud(v | set); +} + +static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear) +{ + pudval_t v = native_pud_val(pud); + + return __pud(v & ~clear); +} + +static inline pud_t pud_mkhuge(pud_t pud) +{ + return pud_set_flags(pud, _PAGE_PSE); } #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY @@ -329,21 +348,6 @@ static inline pmd_t pmd_mksoft_dirty(pmd_t pmd) return pmd_set_flags(pmd, _PAGE_SOFT_DIRTY); } -static inline pte_t pte_file_clear_soft_dirty(pte_t pte) -{ - return pte_clear_flags(pte, _PAGE_SOFT_DIRTY); -} - -static inline pte_t pte_file_mksoft_dirty(pte_t pte) -{ - return pte_set_flags(pte, _PAGE_SOFT_DIRTY); -} - -static inline int pte_file_soft_dirty(pte_t pte) -{ - return pte_flags(pte) & _PAGE_SOFT_DIRTY; -} - #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */ /* @@ -362,19 +366,39 @@ static inline pgprotval_t massage_pgprot(pgprot_t pgprot) static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot) { - return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) | - massage_pgprot(pgprot)); + phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT; + pfn ^= protnone_mask(pgprot_val(pgprot)); + pfn &= PTE_PFN_MASK; + return __pte(pfn | massage_pgprot(pgprot)); } static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot) { - return __pmd(((phys_addr_t)page_nr << PAGE_SHIFT) | - massage_pgprot(pgprot)); + phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT; + pfn ^= protnone_mask(pgprot_val(pgprot)); + pfn &= PTE_PFN_MASK; + return __pmd(pfn | massage_pgprot(pgprot)); +} + +static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot) +{ + phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT; + pfn ^= protnone_mask(pgprot_val(pgprot)); + pfn &= PTE_PFN_MASK; + return __pud(pfn | massage_pgprot(pgprot)); +} + +static inline pmd_t pmd_mknotpresent(pmd_t pmd) +{ + return pfn_pmd(pmd_pfn(pmd), + __pgprot(pmd_flags(pmd) & ~_PAGE_PRESENT)); } +static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask); + static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { - pteval_t val = pte_val(pte); + pteval_t val = pte_val(pte), oldval = val; /* * Chop off the NX bit (if present), and add the NX portion of @@ -382,19 +406,67 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) */ val &= _PAGE_CHG_MASK; val |= massage_pgprot(newprot) & ~_PAGE_CHG_MASK; - + val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); return __pte(val); } static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) { - pmdval_t val = pmd_val(pmd); + pmdval_t val = pmd_val(pmd), oldval = val; val &= _HPAGE_CHG_MASK; val |= massage_pgprot(newprot) & ~_HPAGE_CHG_MASK; + val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); + return __pmd(val); +} + +#ifdef CONFIG_NUMA_BALANCING + +static inline pte_t pte_mknonnuma(pte_t pte) +{ + pteval_t val = pte_val(pte), oldval = val; + + val &= ~_PAGE_NUMA; + val |= (_PAGE_PRESENT|_PAGE_ACCESSED); + val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); + return __pte(val); +} +#define pte_mknonnuma pte_mknonnuma + +static inline pte_t pte_mknuma(pte_t pte) +{ + pteval_t val = pte_val(pte), oldval = val; + val &= ~_PAGE_PRESENT; + val |= _PAGE_NUMA; + val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); + return __pte(val); +} +#define pte_mknuma pte_mknuma + +static inline pmd_t pmd_mknonnuma(pmd_t pmd) +{ + pmdval_t val = pmd_val(pmd), oldval = val; + + val &= ~_PAGE_NUMA; + val |= (_PAGE_PRESENT|_PAGE_ACCESSED); + val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); return __pmd(val); } +#define pmd_mknonnuma pmd_mknonnuma + +static inline pmd_t pmd_mknuma(pmd_t pmd) +{ + pmdval_t val = pmd_val(pmd), oldval = val; + + val &= ~_PAGE_PRESENT; + val |= _PAGE_NUMA; + val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); + return __pmd(val); +} +#define pmd_mknuma pmd_mknuma + +#endif /* CONFIG_NUMA_BALANCING */ /* mprotect needs to preserve PAT bits when updating vm_page_prot */ #define pgprot_modify pgprot_modify @@ -917,6 +989,14 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte) } #endif +#define __HAVE_ARCH_PFN_MODIFY_ALLOWED 1 +extern bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot); + +static inline bool arch_has_pfn_modify_check(void) +{ + return boot_cpu_has_bug(X86_BUG_L1TF); +} + #include #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index ada2cb8adbc2..1d0eefbf3e12 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -155,10 +155,6 @@ static inline int pgd_large(pgd_t pgd) { return 0; } /* PUD - Level3 access */ /* PMD - Level 2 access */ -#define pte_to_pgoff(pte) ((pte_val((pte)) & PHYSICAL_PAGE_MASK) >> PAGE_SHIFT) -#define pgoff_to_pte(off) ((pte_t) { .pte = ((off) << PAGE_SHIFT) | \ - _PAGE_FILE }) -#define PTE_FILE_MAX_BITS __PHYSICAL_MASK_SHIFT /* PTE - Level 1 access. */ @@ -166,23 +162,57 @@ static inline int pgd_large(pgd_t pgd) { return 0; } #define pte_offset_map(dir, address) pte_offset_kernel((dir), (address)) #define pte_unmap(pte) ((void)(pte))/* NOP */ -/* Encode and de-code a swap entry */ -#define SWP_TYPE_BITS (_PAGE_BIT_FILE - _PAGE_BIT_PRESENT - 1) +/* + * Encode and de-code a swap entry + * + * | ... | 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number + * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names + * | TYPE (59-63) | ~OFFSET (10-58) | 0 |0|0|X|X| X| X|X|SD|0| <- swp entry + * + * G (8) is aliased and used as a PROT_NONE indicator for + * !present ptes. We need to start storing swap entries above + * there. We also need to avoid using A and D because of an + * erratum where they can be incorrectly set by hardware on + * non-present PTEs. + * + * SD (1) in swp entry is used to store soft dirty bit, which helps us + * remember soft dirty over page migration + * + * Bit 7 in swp entry should be 0 because pmd_present checks not only P, + * but also L and G. + * + * The offset is inverted by a binary not operation to make the high + * physical bits set. + */ +#define SWP_TYPE_BITS 5 + #ifdef CONFIG_NUMA_BALANCING /* Automatic NUMA balancing needs to be distinguishable from swap entries */ -#define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 2) +#define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE + 2) #else -#define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 1) +#define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE + 1) #endif +/* We always extract/encode the offset by shifting it all the way up, and then down again */ +#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS) + #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS) -#define __swp_type(x) (((x).val >> (_PAGE_BIT_PRESENT + 1)) \ - & ((1U << SWP_TYPE_BITS) - 1)) -#define __swp_offset(x) ((x).val >> SWP_OFFSET_SHIFT) -#define __swp_entry(type, offset) ((swp_entry_t) { \ - ((type) << (_PAGE_BIT_PRESENT + 1)) \ - | ((offset) << SWP_OFFSET_SHIFT) }) +/* Extract the high bits for type */ +#define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS)) + +/* Shift up (to get rid of type), then down to get value */ +#define __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT) + +/* + * Shift the offset up "too far" by TYPE bits, then down again + * The offset is inverted by a binary not operation to make the high + * physical bits set. + */ +#define __swp_entry(type, offset) ((swp_entry_t) { \ + (~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \ + | ((unsigned long)(type) << (64-SWP_TYPE_BITS)) }) + #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) }) #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val }) @@ -209,6 +239,8 @@ extern void cleanup_highmap(void); extern void init_extra_mapping_uc(unsigned long phys, unsigned long size); extern void init_extra_mapping_wb(unsigned long phys, unsigned long size); +#include + #endif /* !__ASSEMBLY__ */ #endif /* _ASM_X86_PGTABLE_64_H */ diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 90aaa6f39d61..32f1f8cac398 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -39,8 +39,6 @@ /* If _PAGE_BIT_PRESENT is clear, we use these: */ /* - if the user mapped it with PROT_NONE; pte_present gives true */ #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL -/* - set: nonlinear file mapping, saved PTE; unset:swap */ -#define _PAGE_BIT_FILE _PAGE_BIT_DIRTY #define _PAGE_PRESENT (_AT(pteval_t, 1) << _PAGE_BIT_PRESENT) #define _PAGE_RW (_AT(pteval_t, 1) << _PAGE_BIT_RW) @@ -96,15 +94,15 @@ /* * Tracking soft dirty bit when a page goes to a swap is tricky. * We need a bit which can be stored in pte _and_ not conflict - * with swap entry format. On x86 bits 6 and 7 are *not* involved - * into swap entry computation, but bit 6 is used for nonlinear - * file mapping, so we borrow bit 7 for soft dirty tracking. + * with swap entry format. On x86 bits 1-4 are *not* involved + * into swap entry computation, but bit 7 is used for thp migration, + * so we borrow bit 1 for soft dirty tracking. * * Please note that this bit must be treated as swap dirty page - * mark if and only if the PTE has present bit clear! + * mark if and only if the PTE/PMD has present bit clear! */ #ifdef CONFIG_MEM_SOFT_DIRTY -#define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE +#define _PAGE_SWP_SOFT_DIRTY _PAGE_RW #else #define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0)) #endif @@ -115,7 +113,6 @@ #define _PAGE_NX (_AT(pteval_t, 0)) #endif -#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE) #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) #define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 801b59a7e97d..280bfa055fc2 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -165,6 +165,11 @@ extern const struct seq_operations cpuinfo_op; extern void cpu_detect(struct cpuinfo_x86 *c); extern void fpu_detect(struct cpuinfo_x86 *c); +static inline unsigned long long l1tf_pfn_limit(void) +{ + return BIT_ULL(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT); +} + extern void early_cpu_init(void); extern void identify_boot_cpu(void); extern void identify_secondary_cpu(struct cpuinfo_x86 *); diff --git a/arch/x86/include/asm/spec-ctrl.h b/arch/x86/include/asm/spec-ctrl.h new file mode 100644 index 000000000000..ae7c2c5cd7f0 --- /dev/null +++ b/arch/x86/include/asm/spec-ctrl.h @@ -0,0 +1,80 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_SPECCTRL_H_ +#define _ASM_X86_SPECCTRL_H_ + +#include +#include + +/* + * On VMENTER we must preserve whatever view of the SPEC_CTRL MSR + * the guest has, while on VMEXIT we restore the host view. This + * would be easier if SPEC_CTRL were architecturally maskable or + * shadowable for guests but this is not (currently) the case. + * Takes the guest view of SPEC_CTRL MSR as a parameter and also + * the guest's version of VIRT_SPEC_CTRL, if emulated. + */ +extern void x86_virt_spec_ctrl(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl, bool guest); + +/** + * x86_spec_ctrl_set_guest - Set speculation control registers for the guest + * @guest_spec_ctrl: The guest content of MSR_SPEC_CTRL + * @guest_virt_spec_ctrl: The guest controlled bits of MSR_VIRT_SPEC_CTRL + * (may get translated to MSR_AMD64_LS_CFG bits) + * + * Avoids writing to the MSR if the content/bits are the same + */ +static inline +void x86_spec_ctrl_set_guest(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl) +{ + x86_virt_spec_ctrl(guest_spec_ctrl, guest_virt_spec_ctrl, true); +} + +/** + * x86_spec_ctrl_restore_host - Restore host speculation control registers + * @guest_spec_ctrl: The guest content of MSR_SPEC_CTRL + * @guest_virt_spec_ctrl: The guest controlled bits of MSR_VIRT_SPEC_CTRL + * (may get translated to MSR_AMD64_LS_CFG bits) + * + * Avoids writing to the MSR if the content/bits are the same + */ +static inline +void x86_spec_ctrl_restore_host(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl) +{ + x86_virt_spec_ctrl(guest_spec_ctrl, guest_virt_spec_ctrl, false); +} + +/* AMD specific Speculative Store Bypass MSR data */ +extern u64 x86_amd_ls_cfg_base; +extern u64 x86_amd_ls_cfg_ssbd_mask; + +static inline u64 ssbd_tif_to_spec_ctrl(u64 tifn) +{ + BUILD_BUG_ON(TIF_SSBD < SPEC_CTRL_SSBD_SHIFT); + return (tifn & _TIF_SSBD) >> (TIF_SSBD - SPEC_CTRL_SSBD_SHIFT); +} + +static inline unsigned long ssbd_spec_ctrl_to_tif(u64 spec_ctrl) +{ + BUILD_BUG_ON(TIF_SSBD < SPEC_CTRL_SSBD_SHIFT); + return (spec_ctrl & SPEC_CTRL_SSBD) << (TIF_SSBD - SPEC_CTRL_SSBD_SHIFT); +} + +static inline u64 ssbd_tif_to_amd_ls_cfg(u64 tifn) +{ + return (tifn & _TIF_SSBD) ? x86_amd_ls_cfg_ssbd_mask : 0ULL; +} + +#ifdef CONFIG_SMP +extern void speculative_store_bypass_ht_init(void); +#else +static inline void speculative_store_bypass_ht_init(void) { } +#endif + +extern void speculative_store_bypass_update(unsigned long tif); + +static inline void speculative_store_bypass_update_current(void) +{ + speculative_store_bypass_update(current_thread_info()->flags); +} + +#endif diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index 547e344a6dc6..3a3ce1a45b54 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -72,6 +72,7 @@ struct thread_info { #define TIF_SIGPENDING 2 /* signal pending */ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ +#define TIF_SSBD 5 /* Reduced data speculation */ #define TIF_SYSCALL_EMU 6 /* syscall emulation active */ #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_SECCOMP 8 /* secure computing */ @@ -97,6 +98,7 @@ struct thread_info { #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) +#define _TIF_SSBD (1 << TIF_SSBD) #define _TIF_SYSCALL_EMU (1 << TIF_SYSCALL_EMU) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_SECCOMP (1 << TIF_SECCOMP) @@ -131,12 +133,12 @@ struct thread_info { #define _TIF_WORK_MASK \ (0x0000FFFF & \ ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT| \ - _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU)) + _TIF_SINGLESTEP|_TIF_SSBD|_TIF_SECCOMP|_TIF_SYSCALL_EMU)) /* work to do on any return to user space */ #define _TIF_ALLWORK_MASK \ - ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT | \ - _TIF_NOHZ) + ((0x0000FFFF & ~(_TIF_SSBD | _TIF_SECCOMP)) | \ + _TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ) /* Only used for 64 bit */ #define _TIF_DO_NOTIFY_MASK \ @@ -145,7 +147,7 @@ struct thread_info { /* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW \ - (_TIF_IO_BITMAP|_TIF_NOTSC|_TIF_BLOCKSTEP) + (_TIF_IO_BITMAP|_TIF_NOTSC|_TIF_BLOCKSTEP|_TIF_SSBD) #define _TIF_WORK_CTXSW_PREV (_TIF_WORK_CTXSW|_TIF_USER_RETURN_NOTIFY) #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW) diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index aa8bd503b6ed..353f76dcf8a7 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -35,6 +35,8 @@ #define MSR_IA32_SPEC_CTRL 0x00000048 /* Speculation Control */ #define SPEC_CTRL_IBRS (1 << 0) /* Indirect Branch Restricted Speculation */ #define SPEC_CTRL_STIBP (1 << 1) /* Single Thread Indirect Branch Predictors */ +#define SPEC_CTRL_SSBD_SHIFT 2 /* Speculative Store Bypass Disable bit */ +#define SPEC_CTRL_SSBD (1 << SPEC_CTRL_SSBD_SHIFT) /* Speculative Store Bypass Disable */ #define MSR_IA32_PRED_CMD 0x00000049 /* Prediction Command */ #define PRED_CMD_IBPB (1 << 0) /* Indirect Branch Prediction Barrier */ @@ -57,6 +59,11 @@ #define MSR_IA32_ARCH_CAPABILITIES 0x0000010a #define ARCH_CAP_RDCL_NO (1 << 0) /* Not susceptible to Meltdown */ #define ARCH_CAP_IBRS_ALL (1 << 1) /* Enhanced IBRS support */ +#define ARCH_CAP_SSB_NO (1 << 4) /* + * Not susceptible to Speculative Store Bypass + * attack, so no Speculative Store Bypass + * control required. + */ #define MSR_IA32_BBL_CR_CTL 0x00000119 #define MSR_IA32_BBL_CR_CTL3 0x0000011e @@ -218,6 +225,8 @@ #define MSR_AMD64_IBSBRTARGET 0xc001103b #define MSR_AMD64_IBS_REG_COUNT_MAX 8 /* includes MSR_AMD64_IBSBRTARGET */ +#define MSR_AMD64_VIRT_SPEC_CTRL 0xc001011f + /* Fam 16h MSRs */ #define MSR_F16H_L2I_PERF_CTL 0xc0010230 #define MSR_F16H_L2I_PERF_CTR 0xc0010231 diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index ae2832700bb5..ccfb36b455a4 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #ifdef CONFIG_X86_64 @@ -470,6 +471,26 @@ static void bsp_init_amd(struct cpuinfo_x86 *c) va_align.mask = (upperbit - 1) & PAGE_MASK; va_align.flags = ALIGN_VA_32 | ALIGN_VA_64; } + + if (c->x86 >= 0x15 && c->x86 <= 0x17) { + unsigned int bit; + + switch (c->x86) { + case 0x15: bit = 54; break; + case 0x16: bit = 33; break; + case 0x17: bit = 10; break; + default: return; + } + /* + * Try to cache the base value so further operations can + * avoid RMW. If that faults, do not enable SSBD. + */ + if (!rdmsrl_safe(MSR_AMD64_LS_CFG, &x86_amd_ls_cfg_base)) { + setup_force_cpu_cap(X86_FEATURE_LS_CFG_SSBD); + setup_force_cpu_cap(X86_FEATURE_SSBD); + x86_amd_ls_cfg_ssbd_mask = 1ULL << bit; + } + } } static void early_init_amd(struct cpuinfo_x86 *c) @@ -535,6 +556,7 @@ static void init_amd_ln(struct cpuinfo_x86 *c) static void init_amd_zn(struct cpuinfo_x86 *c) { + set_cpu_cap(c, X86_FEATURE_ZEN); /* * Fix erratum 1076: CPB feature bit not being set in CPUID. It affects * all up to and including B1. diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 8af32e95e563..bed882333930 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -11,8 +11,10 @@ #include #include #include +#include +#include -#include +#include #include #include #include @@ -24,8 +26,31 @@ #include #include #include +#include static void __init spectre_v2_select_mitigation(void); +static void __init ssb_select_mitigation(void); +static void __init l1tf_select_mitigation(void); + +/* + * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any + * writes to SPEC_CTRL contain whatever reserved bits have been set. + */ +u64 x86_spec_ctrl_base; +EXPORT_SYMBOL_GPL(x86_spec_ctrl_base); + +/* + * The vendor and possibly platform specific bits which can be modified in + * x86_spec_ctrl_base. + */ +static u64 x86_spec_ctrl_mask = SPEC_CTRL_IBRS; + +/* + * AMD specific MSR info for Speculative Store Bypass control. + * x86_amd_ls_cfg_ssbd_mask is initialized in identify_boot_cpu(). + */ +u64 x86_amd_ls_cfg_base; +u64 x86_amd_ls_cfg_ssbd_mask; #ifdef CONFIG_X86_32 @@ -94,9 +119,29 @@ void __init check_bugs(void) print_cpu_info(&boot_cpu_data); } + /* + * Read the SPEC_CTRL MSR to account for reserved bits which may + * have unknown values. AMD64_LS_CFG MSR is cached in the early AMD + * init code as it is not enumerated and depends on the family. + */ + if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) + rdmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); + + /* Allow STIBP in MSR_SPEC_CTRL if supported */ + if (boot_cpu_has(X86_FEATURE_STIBP)) + x86_spec_ctrl_mask |= SPEC_CTRL_STIBP; + /* Select the proper spectre mitigation before patching alternatives */ spectre_v2_select_mitigation(); + /* + * Select proper mitigation for any exposure to the Speculative Store + * Bypass vulnerability. + */ + ssb_select_mitigation(); + + l1tf_select_mitigation(); + #ifdef CONFIG_X86_32 /* * Check whether we are able to run this kernel safely on SMP. @@ -157,6 +202,74 @@ static const char *spectre_v2_strings[] = { static enum spectre_v2_mitigation spectre_v2_enabled = SPECTRE_V2_NONE; +void +x86_virt_spec_ctrl(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl, bool setguest) +{ + u64 msrval, guestval, hostval = x86_spec_ctrl_base; + struct thread_info *ti = current_thread_info(); + + /* Is MSR_SPEC_CTRL implemented ? */ + if (static_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) { + /* + * Restrict guest_spec_ctrl to supported values. Clear the + * modifiable bits in the host base value and or the + * modifiable bits from the guest value. + */ + guestval = hostval & ~x86_spec_ctrl_mask; + guestval |= guest_spec_ctrl & x86_spec_ctrl_mask; + + /* SSBD controlled in MSR_SPEC_CTRL */ + if (static_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD)) + hostval |= ssbd_tif_to_spec_ctrl(ti->flags); + + if (hostval != guestval) { + msrval = setguest ? guestval : hostval; + wrmsrl(MSR_IA32_SPEC_CTRL, msrval); + } + } + + /* + * If SSBD is not handled in MSR_SPEC_CTRL on AMD, update + * MSR_AMD64_L2_CFG or MSR_VIRT_SPEC_CTRL if supported. + */ + if (!static_cpu_has(X86_FEATURE_LS_CFG_SSBD) && + !static_cpu_has(X86_FEATURE_VIRT_SSBD)) + return; + + /* + * If the host has SSBD mitigation enabled, force it in the host's + * virtual MSR value. If its not permanently enabled, evaluate + * current's TIF_SSBD thread flag. + */ + if (static_cpu_has(X86_FEATURE_SPEC_STORE_BYPASS_DISABLE)) + hostval = SPEC_CTRL_SSBD; + else + hostval = ssbd_tif_to_spec_ctrl(ti->flags); + + /* Sanitize the guest value */ + guestval = guest_virt_spec_ctrl & SPEC_CTRL_SSBD; + + if (hostval != guestval) { + unsigned long tif; + + tif = setguest ? ssbd_spec_ctrl_to_tif(guestval) : + ssbd_spec_ctrl_to_tif(hostval); + + speculative_store_bypass_update(tif); + } +} +EXPORT_SYMBOL_GPL(x86_virt_spec_ctrl); + +static void x86_amd_ssb_disable(void) +{ + u64 msrval = x86_amd_ls_cfg_base | x86_amd_ls_cfg_ssbd_mask; + + if (boot_cpu_has(X86_FEATURE_VIRT_SSBD)) + wrmsrl(MSR_AMD64_VIRT_SPEC_CTRL, SPEC_CTRL_SSBD); + else if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD)) + wrmsrl(MSR_AMD64_LS_CFG, msrval); +} + #ifdef RETPOLINE static bool spectre_v2_bad_module; @@ -349,33 +462,333 @@ static void __init spectre_v2_select_mitigation(void) } } +#undef pr_fmt +#define pr_fmt(fmt) "Speculative Store Bypass: " fmt + +static enum ssb_mitigation ssb_mode = SPEC_STORE_BYPASS_NONE; + +/* The kernel command line selection */ +enum ssb_mitigation_cmd { + SPEC_STORE_BYPASS_CMD_NONE, + SPEC_STORE_BYPASS_CMD_AUTO, + SPEC_STORE_BYPASS_CMD_ON, + SPEC_STORE_BYPASS_CMD_PRCTL, + SPEC_STORE_BYPASS_CMD_SECCOMP, +}; + +static const char *ssb_strings[] = { + [SPEC_STORE_BYPASS_NONE] = "Vulnerable", + [SPEC_STORE_BYPASS_DISABLE] = "Mitigation: Speculative Store Bypass disabled", + [SPEC_STORE_BYPASS_PRCTL] = "Mitigation: Speculative Store Bypass disabled via prctl", + [SPEC_STORE_BYPASS_SECCOMP] = "Mitigation: Speculative Store Bypass disabled via prctl and seccomp", +}; + +static const struct { + const char *option; + enum ssb_mitigation_cmd cmd; +} ssb_mitigation_options[] = { + { "auto", SPEC_STORE_BYPASS_CMD_AUTO }, /* Platform decides */ + { "on", SPEC_STORE_BYPASS_CMD_ON }, /* Disable Speculative Store Bypass */ + { "off", SPEC_STORE_BYPASS_CMD_NONE }, /* Don't touch Speculative Store Bypass */ + { "prctl", SPEC_STORE_BYPASS_CMD_PRCTL }, /* Disable Speculative Store Bypass via prctl */ + { "seccomp", SPEC_STORE_BYPASS_CMD_SECCOMP }, /* Disable Speculative Store Bypass via prctl and seccomp */ +}; + +static enum ssb_mitigation_cmd __init ssb_parse_cmdline(void) +{ + enum ssb_mitigation_cmd cmd = SPEC_STORE_BYPASS_CMD_AUTO; + char arg[20]; + int ret, i; + + if (cmdline_find_option_bool(boot_command_line, "nospec_store_bypass_disable")) { + return SPEC_STORE_BYPASS_CMD_NONE; + } else { + ret = cmdline_find_option(boot_command_line, "spec_store_bypass_disable", + arg, sizeof(arg)); + if (ret < 0) + return SPEC_STORE_BYPASS_CMD_AUTO; + + for (i = 0; i < ARRAY_SIZE(ssb_mitigation_options); i++) { + if (!match_option(arg, ret, ssb_mitigation_options[i].option)) + continue; + + cmd = ssb_mitigation_options[i].cmd; + break; + } + + if (i >= ARRAY_SIZE(ssb_mitigation_options)) { + pr_err("unknown option (%s). Switching to AUTO select\n", arg); + return SPEC_STORE_BYPASS_CMD_AUTO; + } + } + + return cmd; +} + +static enum ssb_mitigation __init __ssb_select_mitigation(void) +{ + enum ssb_mitigation mode = SPEC_STORE_BYPASS_NONE; + enum ssb_mitigation_cmd cmd; + + if (!boot_cpu_has(X86_FEATURE_SSBD)) + return mode; + + cmd = ssb_parse_cmdline(); + if (!boot_cpu_has_bug(X86_BUG_SPEC_STORE_BYPASS) && + (cmd == SPEC_STORE_BYPASS_CMD_NONE || + cmd == SPEC_STORE_BYPASS_CMD_AUTO)) + return mode; + + switch (cmd) { + case SPEC_STORE_BYPASS_CMD_AUTO: + case SPEC_STORE_BYPASS_CMD_SECCOMP: + /* + * Choose prctl+seccomp as the default mode if seccomp is + * enabled. + */ + if (IS_ENABLED(CONFIG_SECCOMP)) + mode = SPEC_STORE_BYPASS_SECCOMP; + else + mode = SPEC_STORE_BYPASS_PRCTL; + break; + case SPEC_STORE_BYPASS_CMD_ON: + mode = SPEC_STORE_BYPASS_DISABLE; + break; + case SPEC_STORE_BYPASS_CMD_PRCTL: + mode = SPEC_STORE_BYPASS_PRCTL; + break; + case SPEC_STORE_BYPASS_CMD_NONE: + break; + } + + /* + * We have three CPU feature flags that are in play here: + * - X86_BUG_SPEC_STORE_BYPASS - CPU is susceptible. + * - X86_FEATURE_SSBD - CPU is able to turn off speculative store bypass + * - X86_FEATURE_SPEC_STORE_BYPASS_DISABLE - engage the mitigation + */ + if (mode == SPEC_STORE_BYPASS_DISABLE) { + setup_force_cpu_cap(X86_FEATURE_SPEC_STORE_BYPASS_DISABLE); + /* + * Intel uses the SPEC CTRL MSR Bit(2) for this, while AMD uses + * a completely different MSR and bit dependent on family. + */ + switch (boot_cpu_data.x86_vendor) { + case X86_VENDOR_INTEL: + x86_spec_ctrl_base |= SPEC_CTRL_SSBD; + x86_spec_ctrl_mask |= SPEC_CTRL_SSBD; + wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); + break; + case X86_VENDOR_AMD: + x86_amd_ssb_disable(); + break; + } + } + + return mode; +} + +static void ssb_select_mitigation(void) +{ + ssb_mode = __ssb_select_mitigation(); + + if (boot_cpu_has_bug(X86_BUG_SPEC_STORE_BYPASS)) + pr_info("%s\n", ssb_strings[ssb_mode]); +} + +#undef pr_fmt +#define pr_fmt(fmt) "Speculation prctl: " fmt + +static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl) +{ + bool update; + + if (ssb_mode != SPEC_STORE_BYPASS_PRCTL && + ssb_mode != SPEC_STORE_BYPASS_SECCOMP) + return -ENXIO; + + switch (ctrl) { + case PR_SPEC_ENABLE: + /* If speculation is force disabled, enable is not allowed */ + if (task_spec_ssb_force_disable(task)) + return -EPERM; + task_clear_spec_ssb_disable(task); + update = test_and_clear_tsk_thread_flag(task, TIF_SSBD); + break; + case PR_SPEC_DISABLE: + task_set_spec_ssb_disable(task); + update = !test_and_set_tsk_thread_flag(task, TIF_SSBD); + break; + case PR_SPEC_FORCE_DISABLE: + task_set_spec_ssb_disable(task); + task_set_spec_ssb_force_disable(task); + update = !test_and_set_tsk_thread_flag(task, TIF_SSBD); + break; + default: + return -ERANGE; + } + + /* + * If being set on non-current task, delay setting the CPU + * mitigation until it is next scheduled. + */ + if (task == current && update) + speculative_store_bypass_update_current(); + + return 0; +} + +int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which, + unsigned long ctrl) +{ + switch (which) { + case PR_SPEC_STORE_BYPASS: + return ssb_prctl_set(task, ctrl); + default: + return -ENODEV; + } +} + +#ifdef CONFIG_SECCOMP +void arch_seccomp_spec_mitigate(struct task_struct *task) +{ + if (ssb_mode == SPEC_STORE_BYPASS_SECCOMP) + ssb_prctl_set(task, PR_SPEC_FORCE_DISABLE); +} +#endif + +static int ssb_prctl_get(struct task_struct *task) +{ + switch (ssb_mode) { + case SPEC_STORE_BYPASS_DISABLE: + return PR_SPEC_DISABLE; + case SPEC_STORE_BYPASS_SECCOMP: + case SPEC_STORE_BYPASS_PRCTL: + if (task_spec_ssb_force_disable(task)) + return PR_SPEC_PRCTL | PR_SPEC_FORCE_DISABLE; + if (task_spec_ssb_disable(task)) + return PR_SPEC_PRCTL | PR_SPEC_DISABLE; + return PR_SPEC_PRCTL | PR_SPEC_ENABLE; + default: + if (boot_cpu_has_bug(X86_BUG_SPEC_STORE_BYPASS)) + return PR_SPEC_ENABLE; + return PR_SPEC_NOT_AFFECTED; + } +} + +int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which) +{ + switch (which) { + case PR_SPEC_STORE_BYPASS: + return ssb_prctl_get(task); + default: + return -ENODEV; + } +} + +void x86_spec_ctrl_setup_ap(void) +{ + if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) + wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); + + if (ssb_mode == SPEC_STORE_BYPASS_DISABLE) + x86_amd_ssb_disable(); +} + +#undef pr_fmt +#define pr_fmt(fmt) "L1TF: " fmt +static void __init l1tf_select_mitigation(void) +{ + u64 half_pa; + + if (!boot_cpu_has_bug(X86_BUG_L1TF)) + return; + +#if defined(CONFIG_X86_32) && !defined(CONFIG_X86_PAE) + pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n"); + return; +#endif + + /* + * This is extremely unlikely to happen because almost all + * systems have far more MAX_PA/2 than RAM can be fit into + * DIMM slots. + */ + half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT; + if (e820_any_mapped(half_pa, ULLONG_MAX - half_pa, E820_RAM)) { + pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation not effective.\n"); + pr_info("You may make it effective by booting the kernel with mem=%llu parameter.\n", + half_pa); + pr_info("However, doing so will make a part of your RAM unusable.\n"); + pr_info("Reading https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html might help you decide.\n"); + return; + } + + setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV); +} #undef pr_fmt #ifdef CONFIG_SYSFS -ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, char *buf) + +static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr, + char *buf, unsigned int bug) { - if (!boot_cpu_has_bug(X86_BUG_CPU_MELTDOWN)) + if (!boot_cpu_has_bug(bug)) return sprintf(buf, "Not affected\n"); - if (boot_cpu_has(X86_FEATURE_KAISER)) - return sprintf(buf, "Mitigation: PTI\n"); + + switch (bug) { + case X86_BUG_CPU_MELTDOWN: + if (boot_cpu_has(X86_FEATURE_KAISER)) + return sprintf(buf, "Mitigation: PTI\n"); + + break; + + case X86_BUG_SPECTRE_V1: + return sprintf(buf, "Mitigation: __user pointer sanitization\n"); + + case X86_BUG_SPECTRE_V2: + return sprintf(buf, "%s%s%s%s\n", spectre_v2_strings[spectre_v2_enabled], + boot_cpu_has(X86_FEATURE_USE_IBPB) ? ", IBPB" : "", + boot_cpu_has(X86_FEATURE_USE_IBRS_FW) ? ", IBRS_FW" : "", + spectre_v2_module_string()); + + case X86_BUG_SPEC_STORE_BYPASS: + return sprintf(buf, "%s\n", ssb_strings[ssb_mode]); + + case X86_BUG_L1TF: + if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV)) + return sprintf(buf, "Mitigation: Page Table Inversion\n"); + break; + + default: + break; + } + return sprintf(buf, "Vulnerable\n"); } +ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, char *buf) +{ + return cpu_show_common(dev, attr, buf, X86_BUG_CPU_MELTDOWN); +} + ssize_t cpu_show_spectre_v1(struct device *dev, struct device_attribute *attr, char *buf) { - if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V1)) - return sprintf(buf, "Not affected\n"); - return sprintf(buf, "Mitigation: __user pointer sanitization\n"); + return cpu_show_common(dev, attr, buf, X86_BUG_SPECTRE_V1); } ssize_t cpu_show_spectre_v2(struct device *dev, struct device_attribute *attr, char *buf) { - if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2)) - return sprintf(buf, "Not affected\n"); + return cpu_show_common(dev, attr, buf, X86_BUG_SPECTRE_V2); +} - return sprintf(buf, "%s%s%s%s\n", spectre_v2_strings[spectre_v2_enabled], - boot_cpu_has(X86_FEATURE_USE_IBPB) ? ", IBPB" : "", - boot_cpu_has(X86_FEATURE_USE_IBRS_FW) ? ", IBRS_FW" : "", - spectre_v2_module_string()); +ssize_t cpu_show_spec_store_bypass(struct device *dev, struct device_attribute *attr, char *buf) +{ + return cpu_show_common(dev, attr, buf, X86_BUG_SPEC_STORE_BYPASS); +} + +ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *buf) +{ + return cpu_show_common(dev, attr, buf, X86_BUG_L1TF); } #endif diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index b1546961eac8..895085cedf30 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -690,17 +690,32 @@ static void init_speculation_control(struct cpuinfo_x86 *c) * and they also have a different bit for STIBP support. Also, * a hypervisor might have set the individual AMD bits even on * Intel CPUs, for finer-grained selection of what's available. - * - * We use the AMD bits in 0x8000_0008 EBX as the generic hardware - * features, which are visible in /proc/cpuinfo and used by the - * kernel. So set those accordingly from the Intel bits. */ if (cpu_has(c, X86_FEATURE_SPEC_CTRL)) { set_cpu_cap(c, X86_FEATURE_IBRS); set_cpu_cap(c, X86_FEATURE_IBPB); + set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL); } + if (cpu_has(c, X86_FEATURE_INTEL_STIBP)) set_cpu_cap(c, X86_FEATURE_STIBP); + + if (cpu_has(c, X86_FEATURE_SPEC_CTRL_SSBD) || + cpu_has(c, X86_FEATURE_VIRT_SSBD)) + set_cpu_cap(c, X86_FEATURE_SSBD); + + if (cpu_has(c, X86_FEATURE_AMD_IBRS)) { + set_cpu_cap(c, X86_FEATURE_IBRS); + set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL); + } + + if (cpu_has(c, X86_FEATURE_AMD_IBPB)) + set_cpu_cap(c, X86_FEATURE_IBPB); + + if (cpu_has(c, X86_FEATURE_AMD_STIBP)) { + set_cpu_cap(c, X86_FEATURE_STIBP); + set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL); + } } void get_cpu_cap(struct cpuinfo_x86 *c) @@ -803,21 +818,75 @@ static const __initconst struct x86_cpu_id cpu_no_meltdown[] = { {} }; -static bool __init cpu_vulnerable_to_meltdown(struct cpuinfo_x86 *c) +static const __initconst struct x86_cpu_id cpu_no_spec_store_bypass[] = { + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_PINEVIEW }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_LINCROFT }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_PENWELL }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_CLOVERVIEW }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_CEDARVIEW }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT1 }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_AIRMONT }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT2 }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_MERRIFIELD }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_CORE_YONAH }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNL }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNM }, + { X86_VENDOR_CENTAUR, 5, }, + { X86_VENDOR_INTEL, 5, }, + { X86_VENDOR_NSC, 5, }, + { X86_VENDOR_AMD, 0x12, }, + { X86_VENDOR_AMD, 0x11, }, + { X86_VENDOR_AMD, 0x10, }, + { X86_VENDOR_AMD, 0xf, }, + { X86_VENDOR_ANY, 4, }, + {} +}; + +static const __initconst struct x86_cpu_id cpu_no_l1tf[] = { + /* in addition to cpu_no_speculation */ + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT1 }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT2 }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_AIRMONT }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_MERRIFIELD }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_MOOREFIELD }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_GOLDMONT }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_DENVERTON }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_GEMINI_LAKE }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNL }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNM }, + {} +}; + +static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) { u64 ia32_cap = 0; - if (x86_match_cpu(cpu_no_meltdown)) - return false; - if (cpu_has(c, X86_FEATURE_ARCH_CAPABILITIES)) rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap); + if (!x86_match_cpu(cpu_no_spec_store_bypass) && + !(ia32_cap & ARCH_CAP_SSB_NO)) + setup_force_cpu_bug(X86_BUG_SPEC_STORE_BYPASS); + + if (x86_match_cpu(cpu_no_speculation)) + return; + + setup_force_cpu_bug(X86_BUG_SPECTRE_V1); + setup_force_cpu_bug(X86_BUG_SPECTRE_V2); + + if (x86_match_cpu(cpu_no_meltdown)) + return; + /* Rogue Data Cache Load? No! */ if (ia32_cap & ARCH_CAP_RDCL_NO) - return false; + return; + + setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN); - return true; + if (x86_match_cpu(cpu_no_l1tf)) + return; + + setup_force_cpu_bug(X86_BUG_L1TF); } /* @@ -868,12 +937,7 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c) setup_force_cpu_cap(X86_FEATURE_ALWAYS); - if (!x86_match_cpu(cpu_no_speculation)) { - if (cpu_vulnerable_to_meltdown(c)) - setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN); - setup_force_cpu_bug(X86_BUG_SPECTRE_V1); - setup_force_cpu_bug(X86_BUG_SPECTRE_V2); - } + cpu_set_bug_bits(c); } void __init early_cpu_init(void) @@ -1151,6 +1215,7 @@ void identify_secondary_cpu(struct cpuinfo_x86 *c) enable_sep_cpu(); #endif mtrr_ap_init(); + x86_spec_ctrl_setup_ap(); } struct msr_range { diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h index c37dc37e8317..4e70442527ff 100644 --- a/arch/x86/kernel/cpu/cpu.h +++ b/arch/x86/kernel/cpu/cpu.h @@ -45,4 +45,7 @@ extern const struct cpu_dev *const __x86_cpu_dev_start[], extern void get_cpu_cap(struct cpuinfo_x86 *c); extern void cpu_detect_cache_sizes(struct cpuinfo_x86 *c); + +extern void x86_spec_ctrl_setup_ap(void); + #endif /* ARCH_X86_CPU_H */ diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index aa3d005ad648..0d41977fb6ad 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -119,7 +119,10 @@ static void early_init_intel(struct cpuinfo_x86 *c) setup_clear_cpu_cap(X86_FEATURE_IBPB); setup_clear_cpu_cap(X86_FEATURE_STIBP); setup_clear_cpu_cap(X86_FEATURE_SPEC_CTRL); + setup_clear_cpu_cap(X86_FEATURE_MSR_SPEC_CTRL); setup_clear_cpu_cap(X86_FEATURE_INTEL_STIBP); + setup_clear_cpu_cap(X86_FEATURE_SSBD); + setup_clear_cpu_cap(X86_FEATURE_SPEC_CTRL_SSBD); } /* diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 687152ec56e1..040e20ffb4be 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -30,6 +30,7 @@ #include #include #include +#include /* * per-CPU TSS segments. Threads are completely 'soft' on Linux, @@ -216,6 +217,148 @@ static inline void switch_to_bitmap(struct tss_struct *tss, } } +#ifdef CONFIG_SMP + +struct ssb_state { + struct ssb_state *shared_state; + raw_spinlock_t lock; + unsigned int disable_state; + unsigned long local_state; +}; + +#define LSTATE_SSB 0 + +static DEFINE_PER_CPU(struct ssb_state, ssb_state); + +void speculative_store_bypass_ht_init(void) +{ + struct ssb_state *st = this_cpu_ptr(&ssb_state); + unsigned int this_cpu = smp_processor_id(); + unsigned int cpu; + + st->local_state = 0; + + /* + * Shared state setup happens once on the first bringup + * of the CPU. It's not destroyed on CPU hotunplug. + */ + if (st->shared_state) + return; + + raw_spin_lock_init(&st->lock); + + /* + * Go over HT siblings and check whether one of them has set up the + * shared state pointer already. + */ + for_each_cpu(cpu, topology_thread_cpumask(this_cpu)) { + if (cpu == this_cpu) + continue; + + if (!per_cpu(ssb_state, cpu).shared_state) + continue; + + /* Link it to the state of the sibling: */ + st->shared_state = per_cpu(ssb_state, cpu).shared_state; + return; + } + + /* + * First HT sibling to come up on the core. Link shared state of + * the first HT sibling to itself. The siblings on the same core + * which come up later will see the shared state pointer and link + * themself to the state of this CPU. + */ + st->shared_state = st; +} + +/* + * Logic is: First HT sibling enables SSBD for both siblings in the core + * and last sibling to disable it, disables it for the whole core. This how + * MSR_SPEC_CTRL works in "hardware": + * + * CORE_SPEC_CTRL = THREAD0_SPEC_CTRL | THREAD1_SPEC_CTRL + */ +static __always_inline void amd_set_core_ssb_state(unsigned long tifn) +{ + struct ssb_state *st = this_cpu_ptr(&ssb_state); + u64 msr = x86_amd_ls_cfg_base; + + if (!static_cpu_has(X86_FEATURE_ZEN)) { + msr |= ssbd_tif_to_amd_ls_cfg(tifn); + wrmsrl(MSR_AMD64_LS_CFG, msr); + return; + } + + if (tifn & _TIF_SSBD) { + /* + * Since this can race with prctl(), block reentry on the + * same CPU. + */ + if (__test_and_set_bit(LSTATE_SSB, &st->local_state)) + return; + + msr |= x86_amd_ls_cfg_ssbd_mask; + + raw_spin_lock(&st->shared_state->lock); + /* First sibling enables SSBD: */ + if (!st->shared_state->disable_state) + wrmsrl(MSR_AMD64_LS_CFG, msr); + st->shared_state->disable_state++; + raw_spin_unlock(&st->shared_state->lock); + } else { + if (!__test_and_clear_bit(LSTATE_SSB, &st->local_state)) + return; + + raw_spin_lock(&st->shared_state->lock); + st->shared_state->disable_state--; + if (!st->shared_state->disable_state) + wrmsrl(MSR_AMD64_LS_CFG, msr); + raw_spin_unlock(&st->shared_state->lock); + } +} +#else +static __always_inline void amd_set_core_ssb_state(unsigned long tifn) +{ + u64 msr = x86_amd_ls_cfg_base | ssbd_tif_to_amd_ls_cfg(tifn); + + wrmsrl(MSR_AMD64_LS_CFG, msr); +} +#endif + +static __always_inline void amd_set_ssb_virt_state(unsigned long tifn) +{ + /* + * SSBD has the same definition in SPEC_CTRL and VIRT_SPEC_CTRL, + * so ssbd_tif_to_spec_ctrl() just works. + */ + wrmsrl(MSR_AMD64_VIRT_SPEC_CTRL, ssbd_tif_to_spec_ctrl(tifn)); +} + +static __always_inline void intel_set_ssb_state(unsigned long tifn) +{ + u64 msr = x86_spec_ctrl_base | ssbd_tif_to_spec_ctrl(tifn); + + wrmsrl(MSR_IA32_SPEC_CTRL, msr); +} + +static __always_inline void __speculative_store_bypass_update(unsigned long tifn) +{ + if (static_cpu_has(X86_FEATURE_VIRT_SSBD)) + amd_set_ssb_virt_state(tifn); + else if (static_cpu_has(X86_FEATURE_LS_CFG_SSBD)) + amd_set_core_ssb_state(tifn); + else + intel_set_ssb_state(tifn); +} + +void speculative_store_bypass_update(unsigned long tif) +{ + preempt_disable(); + __speculative_store_bypass_update(tif); + preempt_enable(); +} + void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p, struct tss_struct *tss) { @@ -248,6 +391,9 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p, else hard_enable_TSC(); } + + if ((tifp ^ tifn) & _TIF_SSBD) + __speculative_store_bypass_update(tifn); } /* diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 28373e51b94a..464c49cdbbb9 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -860,6 +860,12 @@ void __init setup_arch(char **cmdline_p) memblock_reserve(__pa_symbol(_text), (unsigned long)__bss_stop - (unsigned long)_text); + /* + * Make sure page 0 is always reserved because on systems with + * L1TF its contents can be leaked to user processes. + */ + memblock_reserve(0, PAGE_SIZE); + early_reserve_initrd(); /* diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 215815b6407c..93ffaccb2f48 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -77,6 +77,7 @@ #include #include #include +#include /* State of each CPU */ DEFINE_PER_CPU(int, cpu_state) = { 0 }; @@ -243,6 +244,8 @@ static void notrace start_secondary(void *unused) */ check_tsc_sync_target(); + speculative_store_bypass_ht_init(); + /* * Enable the espfix hack for this CPU */ @@ -1162,6 +1165,8 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) set_mtrr_aps_delayed_init(); out: preempt_enable(); + + speculative_store_bypass_ht_init(); } void arch_enable_nonboot_cpus_begin(void) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index c911af5d0cf5..09c85f17aa44 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -302,7 +302,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, /* cpuid 0x80000008.ebx */ const u32 kvm_cpuid_8000_0008_ebx_x86_features = - F(IBPB) | F(IBRS); + F(AMD_IBPB) | F(AMD_IBRS) | F(VIRT_SSBD); /* cpuid 0xC0000001.edx */ const u32 kvm_supported_word5_x86_features = @@ -318,7 +318,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, /* cpuid 7.0.edx*/ const u32 kvm_cpuid_7_0_edx_x86_features = - F(SPEC_CTRL) | F(ARCH_CAPABILITIES); + F(SPEC_CTRL) | F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES); /* all calls to cpuid_count() should be made on the same cpu */ get_cpu(); @@ -524,13 +524,20 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, g_phys_as = phys_as; entry->eax = g_phys_as | (virt_as << 8); entry->edx = 0; - /* IBRS and IBPB aren't necessarily present in hardware cpuid */ - if (boot_cpu_has(X86_FEATURE_IBPB)) - entry->ebx |= F(IBPB); - if (boot_cpu_has(X86_FEATURE_IBRS)) - entry->ebx |= F(IBRS); + /* + * IBRS, IBPB and VIRT_SSBD aren't necessarily present in + * hardware cpuid + */ + if (boot_cpu_has(X86_FEATURE_AMD_IBPB)) + entry->ebx |= F(AMD_IBPB); + if (boot_cpu_has(X86_FEATURE_AMD_IBRS)) + entry->ebx |= F(AMD_IBRS); + if (boot_cpu_has(X86_FEATURE_VIRT_SSBD)) + entry->ebx |= F(VIRT_SSBD); entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features; cpuid_mask(&entry->ebx, 11); + if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD)) + entry->ebx |= F(VIRT_SSBD); break; } case 0x80000019: diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index aff98b185e87..f1c4bed46efe 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -109,21 +109,21 @@ static inline bool guest_cpuid_has_ibpb(struct kvm_vcpu *vcpu) struct kvm_cpuid_entry2 *best; best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0); - if (best && (best->ebx & bit(X86_FEATURE_IBPB))) + if (best && (best->ebx & bit(X86_FEATURE_AMD_IBPB))) return true; best = kvm_find_cpuid_entry(vcpu, 7, 0); return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL)); } -static inline bool guest_cpuid_has_ibrs(struct kvm_vcpu *vcpu) +static inline bool guest_cpuid_has_spec_ctrl(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0); - if (best && (best->ebx & bit(X86_FEATURE_IBRS))) + if (best && (best->ebx & bit(X86_FEATURE_AMD_IBRS))) return true; best = kvm_find_cpuid_entry(vcpu, 7, 0); - return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL)); + return best && (best->edx & (bit(X86_FEATURE_SPEC_CTRL) | bit(X86_FEATURE_SPEC_CTRL_SSBD))); } static inline bool guest_cpuid_has_arch_capabilities(struct kvm_vcpu *vcpu) @@ -134,5 +134,13 @@ static inline bool guest_cpuid_has_arch_capabilities(struct kvm_vcpu *vcpu) return best && (best->edx & bit(X86_FEATURE_ARCH_CAPABILITIES)); } +static inline bool guest_cpuid_has_virt_ssbd(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0); + return best && (best->ebx & bit(X86_FEATURE_VIRT_SSBD)); +} + #endif diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 374d7c293740..57d83c29f097 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -37,7 +37,7 @@ #include #include #include -#include +#include #include #include "trace.h" @@ -148,6 +148,12 @@ struct vcpu_svm { } host; u64 spec_ctrl; + /* + * Contains guest-controlled bits of VIRT_SPEC_CTRL, which will be + * translated into the appropriate L2_CFG bits on the host to + * perform speculative control. + */ + u64 virt_spec_ctrl; u32 *msrpm; @@ -1230,6 +1236,7 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu) u32 eax = 1; svm->spec_ctrl = 0; + svm->virt_spec_ctrl = 0; init_vmcb(svm); @@ -3139,11 +3146,18 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) break; case MSR_IA32_SPEC_CTRL: if (!msr_info->host_initiated && - !guest_cpuid_has_ibrs(vcpu)) + !guest_cpuid_has_spec_ctrl(vcpu)) return 1; msr_info->data = svm->spec_ctrl; break; + case MSR_AMD64_VIRT_SPEC_CTRL: + if (!msr_info->host_initiated && + !guest_cpuid_has_virt_ssbd(vcpu)) + return 1; + + msr_info->data = svm->virt_spec_ctrl; + break; case MSR_IA32_UCODE_REV: msr_info->data = 0x01000065; break; @@ -3218,7 +3232,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) break; case MSR_IA32_SPEC_CTRL: if (!msr->host_initiated && - !guest_cpuid_has_ibrs(vcpu)) + !guest_cpuid_has_spec_ctrl(vcpu)) return 1; /* The STIBP bit doesn't fault even if it's not advertised */ @@ -3259,6 +3273,16 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) break; set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1); break; + case MSR_AMD64_VIRT_SPEC_CTRL: + if (!msr->host_initiated && + !guest_cpuid_has_virt_ssbd(vcpu)) + return 1; + + if (data & ~SPEC_CTRL_SSBD) + return 1; + + svm->virt_spec_ctrl = data; + break; case MSR_STAR: svm->vmcb->save.star = data; break; @@ -3959,16 +3983,15 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu) clgi(); - local_irq_enable(); - /* * If this vCPU has touched SPEC_CTRL, restore the guest's value if * it's non-zero. Since vmentry is serialising on affected CPUs, there * is no need to worry about the conditional branch over the wrmsr * being speculatively taken. */ - if (svm->spec_ctrl) - native_wrmsrl(MSR_IA32_SPEC_CTRL, svm->spec_ctrl); + x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl); + + local_irq_enable(); asm volatile ( "push %%" _ASM_BP "; \n\t" @@ -4062,6 +4085,18 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu) #endif ); + /* Eliminate branch target predictions from guest mode */ + vmexit_fill_RSB(); + +#ifdef CONFIG_X86_64 + wrmsrl(MSR_GS_BASE, svm->host.gs_base); +#else + loadsegment(fs, svm->host.fs); +#ifndef CONFIG_X86_32_LAZY_GS + loadsegment(gs, svm->host.gs); +#endif +#endif + /* * We do not use IBRS in the kernel. If this vCPU has used the * SPEC_CTRL MSR it may have left it on; save the value and @@ -4080,25 +4115,12 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu) if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))) svm->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL); - if (svm->spec_ctrl) - native_wrmsrl(MSR_IA32_SPEC_CTRL, 0); - - /* Eliminate branch target predictions from guest mode */ - vmexit_fill_RSB(); - -#ifdef CONFIG_X86_64 - wrmsrl(MSR_GS_BASE, svm->host.gs_base); -#else - loadsegment(fs, svm->host.fs); -#ifndef CONFIG_X86_32_LAZY_GS - loadsegment(gs, svm->host.gs); -#endif -#endif - reload_tss(vcpu); local_irq_disable(); + x86_spec_ctrl_restore_host(svm->spec_ctrl, svm->virt_spec_ctrl); + vcpu->arch.cr2 = svm->vmcb->save.cr2; vcpu->arch.regs[VCPU_REGS_RAX] = svm->vmcb->save.rax; vcpu->arch.regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp; @@ -4197,6 +4219,11 @@ static bool svm_cpu_has_accelerated_tpr(void) return false; } +static bool svm_has_emulated_msr(int index) +{ + return true; +} + static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; @@ -4463,6 +4490,7 @@ static struct kvm_x86_ops svm_x86_ops = { .hardware_enable = svm_hardware_enable, .hardware_disable = svm_hardware_disable, .cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr, + .has_emulated_msr = svm_has_emulated_msr, .vcpu_create = svm_create_vcpu, .vcpu_free = svm_free_vcpu, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index fef8d308a5ee..f931787e067b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -47,7 +47,7 @@ #include #include #include -#include +#include #include "trace.h" @@ -2528,7 +2528,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) break; case MSR_IA32_SPEC_CTRL: if (!msr_info->host_initiated && - !guest_cpuid_has_ibrs(vcpu)) + !guest_cpuid_has_spec_ctrl(vcpu)) return 1; msr_info->data = to_vmx(vcpu)->spec_ctrl; @@ -2633,11 +2633,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) break; case MSR_IA32_SPEC_CTRL: if (!msr_info->host_initiated && - !guest_cpuid_has_ibrs(vcpu)) + !guest_cpuid_has_spec_ctrl(vcpu)) return 1; /* The STIBP bit doesn't fault even if it's not advertised */ - if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP)) + if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP | SPEC_CTRL_SSBD)) return 1; vmx->spec_ctrl = data; @@ -7356,6 +7356,17 @@ static void vmx_handle_external_intr(struct kvm_vcpu *vcpu) local_irq_enable(); } +static bool vmx_has_emulated_msr(int index) +{ + switch (index) { + case MSR_AMD64_VIRT_SPEC_CTRL: + /* This is AMD only. */ + return false; + default: + return true; + } +} + static bool vmx_mpx_supported(void) { return (vmcs_config.vmexit_ctrl & VM_EXIT_CLEAR_BNDCFGS) && @@ -7540,10 +7551,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) * is no need to worry about the conditional branch over the wrmsr * being speculatively taken. */ - if (vmx->spec_ctrl) - native_wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl); + x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0); vmx->__launched = vmx->loaded_vmcs->launched; + asm( /* Store host registers */ "push %%" _ASM_DX "; push %%" _ASM_BP ";" @@ -7674,8 +7685,7 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) if (unlikely(!msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL))) vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL); - if (vmx->spec_ctrl) - native_wrmsrl(MSR_IA32_SPEC_CTRL, 0); + x86_spec_ctrl_restore_host(vmx->spec_ctrl, 0); /* Eliminate branch target predictions from guest mode */ vmexit_fill_RSB(); @@ -9035,6 +9045,7 @@ static struct kvm_x86_ops vmx_x86_ops = { .hardware_enable = hardware_enable, .hardware_disable = hardware_disable, .cpu_has_accelerated_tpr = report_flexpriority, + .has_emulated_msr = vmx_has_emulated_msr, .vcpu_create = vmx_create_vcpu, .vcpu_free = vmx_free_vcpu, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8fddaf37ff4f..6f38c8385aa9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -906,6 +906,7 @@ static u32 emulated_msrs[] = { MSR_IA32_MISC_ENABLE, MSR_IA32_MCG_STATUS, MSR_IA32_MCG_CTL, + MSR_AMD64_VIRT_SPEC_CTRL, }; static unsigned num_emulated_msrs; @@ -4039,10 +4040,8 @@ static void kvm_init_msr_list(void) num_msrs_to_save = j; for (i = j = 0; i < ARRAY_SIZE(emulated_msrs); i++) { - switch (emulated_msrs[i]) { - default: - break; - } + if (!kvm_x86_ops->has_emulated_msr(emulated_msrs[i])) + continue; if (j < i) emulated_msrs[j] = emulated_msrs[i]; diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 471f1ff32eb1..9e77ca760eb4 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -4,6 +4,8 @@ #include #include #include /* for max_low_pfn */ +#include +#include #include #include @@ -699,3 +701,25 @@ void __init zone_sizes_init(void) free_area_init_nodes(max_zone_pfns); } +#ifdef CONFIG_SWAP +unsigned long max_swapfile_size(void) +{ + unsigned long pages; + + pages = generic_max_swapfile_size(); + + if (boot_cpu_has_bug(X86_BUG_L1TF)) { + /* Limit the swap file size to MAX_PA/2 for L1TF workaround */ + unsigned long long l1tf_limit = l1tf_pfn_limit(); + /* + * We encode swap offsets also with 3 bits below those for pfn + * which makes the usable limit higher. + */ +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) + l1tf_limit <<= PAGE_SHIFT - SWP_OFFSET_FIRST_BIT; +#endif + pages = min_t(unsigned long long, l1tf_limit, pages); + } + return pages; +} +#endif diff --git a/arch/x86/mm/kmmio.c b/arch/x86/mm/kmmio.c index 637ab34ed632..36a1cc435493 100644 --- a/arch/x86/mm/kmmio.c +++ b/arch/x86/mm/kmmio.c @@ -114,24 +114,29 @@ static struct kmmio_fault_page *get_kmmio_fault_page(unsigned long page) static void clear_pmd_presence(pmd_t *pmd, bool clear, pmdval_t *old) { + pmd_t new_pmd; pmdval_t v = pmd_val(*pmd); if (clear) { - *old = v & _PAGE_PRESENT; - v &= ~_PAGE_PRESENT; - } else /* presume this has been called with clear==true previously */ - v |= *old; - set_pmd(pmd, __pmd(v)); + *old = v; + new_pmd = pmd_mknotpresent(*pmd); + } else { + /* Presume this has been called with clear==true previously */ + new_pmd = __pmd(*old); + } + set_pmd(pmd, new_pmd); } static void clear_pte_presence(pte_t *pte, bool clear, pteval_t *old) { pteval_t v = pte_val(*pte); if (clear) { - *old = v & _PAGE_PRESENT; - v &= ~_PAGE_PRESENT; - } else /* presume this has been called with clear==true previously */ - v |= *old; - set_pte_atomic(pte, __pte(v)); + *old = v; + /* Nothing should care about address */ + pte_clear(&init_mm, 0, pte); + } else { + /* Presume this has been called with clear==true previously */ + set_pte_atomic(pte, __pte(*old)); + } } static int clear_page_presence(struct kmmio_fault_page *f, bool clear) diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c index 2cd89be6bb4a..72d3a6878314 100644 --- a/arch/x86/mm/mmap.c +++ b/arch/x86/mm/mmap.c @@ -114,3 +114,24 @@ void arch_pick_mmap_layout(struct mm_struct *mm) mm->get_unmapped_area = arch_get_unmapped_area_topdown; } } + +/* + * Only allow root to set high MMIO mappings to PROT_NONE. + * This prevents an unpriv. user to set them to PROT_NONE and invert + * them, then pointing to valid memory for L1TF speculation. + * + * Note: for locked down kernels may want to disable the root override. + */ +bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot) +{ + if (!boot_cpu_has_bug(X86_BUG_L1TF)) + return true; + if (!__pte_needs_invert(pgprot_val(prot))) + return true; + /* If it's real memory always allow */ + if (pfn_valid(pfn)) + return true; + if (pfn >= l1tf_pfn_limit() && !capable(CAP_SYS_ADMIN)) + return false; + return true; +} diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index aa7cdf54ad85..0aa429ec44d4 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -952,7 +952,8 @@ static int populate_pmd(struct cpa_data *cpa, pmd = pmd_offset(pud, start); - set_pmd(pmd, __pmd(cpa->pfn | _PAGE_PSE | massage_pgprot(pgprot))); + set_pmd(pmd, pmd_mkhuge(pfn_pmd(cpa->pfn >> PAGE_SHIFT, + canon_pgprot(pgprot)))); start += PMD_SIZE; cpa->pfn += PMD_SIZE; @@ -1022,7 +1023,8 @@ static int populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd, * Map everything starting from the Gb boundary, possibly with 1G pages */ while (end - start >= PUD_SIZE) { - set_pud(pud, __pud(cpa->pfn | _PAGE_PSE | massage_pgprot(pgprot))); + set_pud(pud, pud_mkhuge(pfn_pud(cpa->pfn >> PAGE_SHIFT, + canon_pgprot(pgprot)))); start += PUD_SIZE; cpa->pfn += PUD_SIZE; diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c index 657438858e83..bf144b6129b7 100644 --- a/arch/x86/mm/pat.c +++ b/arch/x86/mm/pat.c @@ -481,6 +481,20 @@ void io_free_memtype(resource_size_t start, resource_size_t end) free_memtype(start, end); } +int arch_io_reserve_memtype_wc(resource_size_t start, resource_size_t size) +{ + unsigned long type = _PAGE_CACHE_WC; + + return io_reserve_memtype(start, start + size, &type); +} +EXPORT_SYMBOL(arch_io_reserve_memtype_wc); + +void arch_io_free_memtype_wc(resource_size_t start, resource_size_t size) +{ + io_free_memtype(start, start + size); +} +EXPORT_SYMBOL(arch_io_free_memtype_wc); + pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, unsigned long size, pgprot_t vma_prot) { diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c index bbb1d2259ecf..c5d2acce7f15 100644 --- a/arch/x86/tools/relocs.c +++ b/arch/x86/tools/relocs.c @@ -984,9 +984,12 @@ static void emit_relocs(int as_text, int use_real_mode) die("Segment relocations found but --realmode not specified\n"); /* Order the relocations for more efficient processing */ - sort_relocs(&relocs16); sort_relocs(&relocs32); +#if ELF_BITS == 64 sort_relocs(&relocs64); +#else + sort_relocs(&relocs16); +#endif /* Print the relocations */ if (as_text) { diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 7005974c3ff3..baa1d038c864 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -27,6 +27,7 @@ #include #include +#include #include #include @@ -83,6 +84,8 @@ static void cpu_bringup(void) cpu_data(cpu).x86_max_cores = 1; set_cpu_sibling_map(cpu); + speculative_store_bypass_ht_init(); + xen_setup_cpu_clockevents(); notify_cpu_starting(cpu); @@ -334,6 +337,8 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus) } set_cpu_sibling_map(0); + speculative_store_bypass_ht_init(); + if (xen_smp_intr_init(0)) BUG(); diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h index cd1c00fc744f..6edbdb6ab225 100644 --- a/arch/xtensa/include/asm/pgtable.h +++ b/arch/xtensa/include/asm/pgtable.h @@ -89,8 +89,6 @@ * (PAGE_NONE)| PPN | 0 | 00 | ADW | 01 | 11 | 11 | * +-----------------------------------------+ * swap | index | type | 01 | 11 | 00 | - * +- - - - - - - - - - - - - - - - - - - - -+ - * file | file offset | 01 | 11 | 10 | * +-----------------------------------------+ * * For T1050 hardware and earlier the layout differs for present and (PAGE_NONE) @@ -111,7 +109,6 @@ * index swap offset / PAGE_SIZE (bit 11-31: 21 bits -> 8 GB) * (note that the index is always non-zero) * type swap type (5 bits -> 32 types) - * file offset 26-bit offset into the file, in increments of PAGE_SIZE * * Notes: * - (PROT_NONE) is a special case of 'present' but causes an exception for @@ -144,7 +141,6 @@ #define _PAGE_HW_VALID 0x00 #define _PAGE_NONE 0x0f #endif -#define _PAGE_FILE (1<<1) /* file mapped page, only if !present */ #define _PAGE_USER (1<<4) /* user access (ring=1) */ @@ -260,7 +256,6 @@ static inline void pgtable_cache_init(void) { } static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITABLE; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_wrprotect(pte_t pte) @@ -388,11 +383,6 @@ ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -#define PTE_FILE_MAX_BITS 26 -#define pte_to_pgoff(pte) (pte_val(pte) >> 6) -#define pgoff_to_pte(off) \ - ((pte_t) { ((off) << 6) | _PAGE_CA_INVALID | _PAGE_FILE | _PAGE_USER }) - #endif /* !defined (__ASSEMBLY__) */ diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 9175c84161ec..c55acc37fef2 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -438,14 +438,30 @@ ssize_t __weak cpu_show_spectre_v2(struct device *dev, return sprintf(buf, "Not affected\n"); } +ssize_t __weak cpu_show_spec_store_bypass(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "Not affected\n"); +} + +ssize_t __weak cpu_show_l1tf(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "Not affected\n"); +} + static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL); static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL); static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL); +static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL); +static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL); static struct attribute *cpu_root_vulnerabilities_attrs[] = { &dev_attr_meltdown.attr, &dev_attr_spectre_v1.attr, &dev_attr_spectre_v2.attr, + &dev_attr_spec_store_bypass.attr, + &dev_attr_l1tf.attr, NULL }; diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c index 56d46ffb08e1..f824836d2e7a 100644 --- a/drivers/block/floppy.c +++ b/drivers/block/floppy.c @@ -3459,6 +3459,9 @@ static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode, unsigned int (struct floppy_struct **)&outparam); if (ret) return ret; + memcpy(&inparam.g, outparam, + offsetof(struct floppy_struct, name)); + outparam = &inparam.g; break; case FDMSGON: UDP->flags |= FTD_MSG; diff --git a/drivers/gpu/drm/ast/ast_ttm.c b/drivers/gpu/drm/ast/ast_ttm.c index b8246227bab0..951421fb5803 100644 --- a/drivers/gpu/drm/ast/ast_ttm.c +++ b/drivers/gpu/drm/ast/ast_ttm.c @@ -275,6 +275,8 @@ int ast_mm_init(struct ast_private *ast) return ret; } + arch_io_reserve_memtype_wc(pci_resource_start(dev->pdev, 0), + pci_resource_len(dev->pdev, 0)); ast->fb_mtrr = arch_phys_wc_add(pci_resource_start(dev->pdev, 0), pci_resource_len(dev->pdev, 0)); @@ -283,11 +285,15 @@ int ast_mm_init(struct ast_private *ast) void ast_mm_fini(struct ast_private *ast) { + struct drm_device *dev = ast->dev; + ttm_bo_device_release(&ast->ttm.bdev); ast_ttm_global_release(ast); arch_phys_wc_del(ast->fb_mtrr); + arch_io_free_memtype_wc(pci_resource_start(dev->pdev, 0), + pci_resource_len(dev->pdev, 0)); } void ast_ttm_placement(struct ast_bo *bo, int domain) diff --git a/drivers/gpu/drm/cirrus/cirrus_ttm.c b/drivers/gpu/drm/cirrus/cirrus_ttm.c index 92e6b7786097..4ff3060dd673 100644 --- a/drivers/gpu/drm/cirrus/cirrus_ttm.c +++ b/drivers/gpu/drm/cirrus/cirrus_ttm.c @@ -275,6 +275,9 @@ int cirrus_mm_init(struct cirrus_device *cirrus) return ret; } + arch_io_reserve_memtype_wc(pci_resource_start(dev->pdev, 0), + pci_resource_len(dev->pdev, 0)); + cirrus->fb_mtrr = arch_phys_wc_add(pci_resource_start(dev->pdev, 0), pci_resource_len(dev->pdev, 0)); @@ -284,6 +287,8 @@ int cirrus_mm_init(struct cirrus_device *cirrus) void cirrus_mm_fini(struct cirrus_device *cirrus) { + struct drm_device *dev = cirrus->dev; + if (!cirrus->mm_inited) return; @@ -293,6 +298,8 @@ void cirrus_mm_fini(struct cirrus_device *cirrus) arch_phys_wc_del(cirrus->fb_mtrr); cirrus->fb_mtrr = 0; + arch_io_free_memtype_wc(pci_resource_start(dev->pdev, 0), + pci_resource_len(dev->pdev, 0)); } void cirrus_ttm_placement(struct cirrus_bo *bo, int domain) diff --git a/drivers/gpu/drm/drm_vma_manager.c b/drivers/gpu/drm/drm_vma_manager.c index 63b471205072..68c1f32fb086 100644 --- a/drivers/gpu/drm/drm_vma_manager.c +++ b/drivers/gpu/drm/drm_vma_manager.c @@ -50,8 +50,7 @@ * * You must not use multiple offset managers on a single address_space. * Otherwise, mm-core will be unable to tear down memory mappings as the VM will - * no longer be linear. Please use VM_NONLINEAR in that case and implement your - * own offset managers. + * no longer be linear. * * This offset manager works on page-based addresses. That is, every argument * and return code (with the exception of drm_vma_node_offset_addr()) is given diff --git a/drivers/gpu/drm/mgag200/mgag200_ttm.c b/drivers/gpu/drm/mgag200/mgag200_ttm.c index 5a00e90696de..f9ebdd2b4c19 100644 --- a/drivers/gpu/drm/mgag200/mgag200_ttm.c +++ b/drivers/gpu/drm/mgag200/mgag200_ttm.c @@ -274,6 +274,9 @@ int mgag200_mm_init(struct mga_device *mdev) return ret; } + arch_io_reserve_memtype_wc(pci_resource_start(dev->pdev, 0), + pci_resource_len(dev->pdev, 0)); + mdev->fb_mtrr = arch_phys_wc_add(pci_resource_start(dev->pdev, 0), pci_resource_len(dev->pdev, 0)); @@ -282,10 +285,14 @@ int mgag200_mm_init(struct mga_device *mdev) void mgag200_mm_fini(struct mga_device *mdev) { + struct drm_device *dev = mdev->dev; + ttm_bo_device_release(&mdev->ttm.bdev); mgag200_ttm_global_release(mdev); + arch_io_free_memtype_wc(pci_resource_start(dev->pdev, 0), + pci_resource_len(dev->pdev, 0)); arch_phys_wc_del(mdev->fb_mtrr); mdev->fb_mtrr = 0; } diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c b/drivers/gpu/drm/nouveau/nouveau_ttm.c index 7e185c122750..820117d86d89 100644 --- a/drivers/gpu/drm/nouveau/nouveau_ttm.c +++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c @@ -397,6 +397,9 @@ nouveau_ttm_init(struct nouveau_drm *drm) drm->gem.vram_available = nouveau_fb(drm->device)->ram->size; drm->gem.vram_available -= nouveau_instmem(drm->device)->reserved; + arch_io_reserve_memtype_wc(nv_device_resource_start(device, 1), + nv_device_resource_len(device, 1)); + ret = ttm_bo_init_mm(&drm->ttm.bdev, TTM_PL_VRAM, drm->gem.vram_available >> PAGE_SHIFT); if (ret) { @@ -429,6 +432,8 @@ nouveau_ttm_init(struct nouveau_drm *drm) void nouveau_ttm_fini(struct nouveau_drm *drm) { + struct nouveau_device *device = nv_device(drm->device); + mutex_lock(&drm->dev->struct_mutex); ttm_bo_clean_mm(&drm->ttm.bdev, TTM_PL_VRAM); ttm_bo_clean_mm(&drm->ttm.bdev, TTM_PL_TT); @@ -440,4 +445,7 @@ nouveau_ttm_fini(struct nouveau_drm *drm) arch_phys_wc_del(drm->ttm.mtrr); drm->ttm.mtrr = 0; + arch_io_free_memtype_wc(nv_device_resource_start(device, 1), + nv_device_resource_len(device, 1)); + } diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c index 6b1762e0879d..a1b0f3bb7e9a 100644 --- a/drivers/gpu/drm/radeon/radeon_object.c +++ b/drivers/gpu/drm/radeon/radeon_object.c @@ -359,6 +359,10 @@ void radeon_bo_force_delete(struct radeon_device *rdev) int radeon_bo_init(struct radeon_device *rdev) { + /* reserve PAT memory space to WC for VRAM */ + arch_io_reserve_memtype_wc(rdev->mc.aper_base, + rdev->mc.aper_size); + /* Add an MTRR for the VRAM */ if (!rdev->fastfb_working) { rdev->mc.vram_mtrr = arch_phys_wc_add(rdev->mc.aper_base, @@ -376,6 +380,7 @@ void radeon_bo_fini(struct radeon_device *rdev) { radeon_ttm_fini(rdev); arch_phys_wc_del(rdev->mc.vram_mtrr); + arch_io_free_memtype_wc(rdev->mc.aper_base, rdev->mc.aper_size); } /* Returns how many bytes TTM can move per IB. diff --git a/drivers/hid/hid-debug.c b/drivers/hid/hid-debug.c index 8bf61d295ffd..7192fa1d2786 100644 --- a/drivers/hid/hid-debug.c +++ b/drivers/hid/hid-debug.c @@ -1150,6 +1150,8 @@ static ssize_t hid_debug_events_read(struct file *file, char __user *buffer, goto out; if (list->tail > list->head) { len = list->tail - list->head; + if (len > count) + len = count; if (copy_to_user(buffer + ret, &list->hid_debug_buf[list->head], len)) { ret = -EFAULT; @@ -1159,6 +1161,8 @@ static ssize_t hid_debug_events_read(struct file *file, char __user *buffer, list->head += len; } else { len = HID_DEBUG_BUFSIZE - list->head; + if (len > count) + len = count; if (copy_to_user(buffer, &list->hid_debug_buf[list->head], len)) { ret = -EFAULT; @@ -1166,7 +1170,9 @@ static ssize_t hid_debug_events_read(struct file *file, char __user *buffer, } list->head = 0; ret += len; - goto copy_rest; + count -= len; + if (count > 0) + goto copy_rest; } } diff --git a/drivers/macintosh/via-cuda.c b/drivers/macintosh/via-cuda.c index d61f271d2207..2d8c56360fb8 100644 --- a/drivers/macintosh/via-cuda.c +++ b/drivers/macintosh/via-cuda.c @@ -432,26 +432,20 @@ cuda_start(void) void cuda_poll(void) { - /* cuda_interrupt only takes a normal lock, we disable - * interrupts here to avoid re-entering and thus deadlocking. - */ - if (cuda_irq) - disable_irq(cuda_irq); - cuda_interrupt(0, NULL); - if (cuda_irq) - enable_irq(cuda_irq); + cuda_interrupt(0, NULL); } static irqreturn_t cuda_interrupt(int irq, void *arg) { + unsigned long flags; int status; struct adb_request *req = NULL; unsigned char ibuf[16]; int ibuf_len = 0; int complete = 0; - spin_lock(&cuda_lock); + spin_lock_irqsave(&cuda_lock, flags); /* On powermacs, this handler is registered for the VIA IRQ. But they use * just the shift register IRQ -- other VIA interrupt sources are disabled. @@ -464,7 +458,7 @@ cuda_interrupt(int irq, void *arg) #endif { if ((in_8(&via[IFR]) & SR_INT) == 0) { - spin_unlock(&cuda_lock); + spin_unlock_irqrestore(&cuda_lock, flags); return IRQ_NONE; } else { out_8(&via[IFR], SR_INT); @@ -593,7 +587,7 @@ cuda_interrupt(int irq, void *arg) default: printk("cuda_interrupt: unknown cuda_state %d?\n", cuda_state); } - spin_unlock(&cuda_lock); + spin_unlock_irqrestore(&cuda_lock, flags); if (complete && req) { void (*done)(struct adb_request *) = req->done; mb(); diff --git a/drivers/target/iscsi/iscsi_target_auth.c b/drivers/target/iscsi/iscsi_target_auth.c index ab4915c0d933..f7940fa64d44 100644 --- a/drivers/target/iscsi/iscsi_target_auth.c +++ b/drivers/target/iscsi/iscsi_target_auth.c @@ -26,18 +26,6 @@ #include "iscsi_target_nego.h" #include "iscsi_target_auth.h" -static int chap_string_to_hex(unsigned char *dst, unsigned char *src, int len) -{ - int j = DIV_ROUND_UP(len, 2), rc; - - rc = hex2bin(dst, src, j); - if (rc < 0) - pr_debug("CHAP string contains non hex digit symbols\n"); - - dst[j] = '\0'; - return j; -} - static void chap_binaryhex_to_asciihex(char *dst, char *src, int src_len) { int i; @@ -241,9 +229,16 @@ static int chap_server_compute_md5( pr_err("Could not find CHAP_R.\n"); goto out; } + if (strlen(chap_r) != MD5_SIGNATURE_SIZE * 2) { + pr_err("Malformed CHAP_R\n"); + goto out; + } + if (hex2bin(client_digest, chap_r, MD5_SIGNATURE_SIZE) < 0) { + pr_err("Malformed CHAP_R\n"); + goto out; + } pr_debug("[server] Got CHAP_R=%s\n", chap_r); - chap_string_to_hex(client_digest, chap_r, strlen(chap_r)); tfm = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC); if (IS_ERR(tfm)) { @@ -348,9 +343,7 @@ static int chap_server_compute_md5( pr_err("Could not find CHAP_C.\n"); goto out; } - pr_debug("[server] Got CHAP_C=%s\n", challenge); - challenge_len = chap_string_to_hex(challenge_binhex, challenge, - strlen(challenge)); + challenge_len = DIV_ROUND_UP(strlen(challenge), 2); if (!challenge_len) { pr_err("Unable to convert incoming challenge\n"); goto out; @@ -359,6 +352,11 @@ static int chap_server_compute_md5( pr_err("CHAP_C exceeds maximum binary size of 1024 bytes\n"); goto out; } + if (hex2bin(challenge_binhex, challenge, challenge_len) < 0) { + pr_err("Malformed CHAP_C\n"); + goto out; + } + pr_debug("[server] Got CHAP_C=%s\n", challenge); /* * During mutual authentication, the CHAP_C generated by the * initiator must not match the original CHAP_C generated by diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c index 520c11c2dcca..92580bd45474 100644 --- a/fs/9p/vfs_file.c +++ b/fs/9p/vfs_file.c @@ -831,7 +831,6 @@ static const struct vm_operations_struct v9fs_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = v9fs_vm_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; static const struct vm_operations_struct v9fs_mmap_file_vm_ops = { @@ -839,7 +838,6 @@ static const struct vm_operations_struct v9fs_mmap_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = v9fs_vm_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index e1553eb5ef61..c8f953b7c084 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2025,7 +2025,6 @@ static const struct vm_operations_struct btrfs_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = btrfs_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 9bfd1af53b08..3f58a1d42c24 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1327,7 +1327,6 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) static struct vm_operations_struct ceph_vmops = { .fault = ceph_filemap_fault, .page_mkwrite = ceph_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; int ceph_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/fs/cifs/file.c b/fs/cifs/file.c index ebe39ef0616f..716c55ceee24 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -3111,7 +3111,6 @@ static struct vm_operations_struct cifs_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = cifs_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; int cifs_file_strict_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/fs/exec.c b/fs/exec.c index 70c31796a314..2acff9b648c0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -206,8 +206,7 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, if (write) { unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start; - unsigned long ptr_size; - struct rlimit *rlim; + unsigned long ptr_size, limit; /* * Since the stack will hold pointers to the strings, we @@ -236,14 +235,16 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, return page; /* - * Limit to 1/4-th the stack size for the argv+env strings. + * Limit to 1/4 of the max stack size or 3/4 of _STK_LIM + * (whichever is smaller) for the argv+env strings. * This ensures that: * - the remaining binfmt code will not run out of stack space, * - the program will have a reasonable amount of stack left * to work from. */ - rlim = current->signal->rlim; - if (size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4) + limit = _STK_LIM / 4 * 3; + limit = min(limit, rlimit(RLIMIT_STACK) / 4); + if (size > limit) goto fail; } diff --git a/fs/ext4/file.c b/fs/ext4/file.c index c81044f51785..6c3ad343f4ae 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -195,7 +195,6 @@ static const struct vm_operations_struct ext4_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = ext4_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c index 89c3f2ca3c9b..9173ffad4ce4 100644 --- a/fs/f2fs/file.c +++ b/fs/f2fs/file.c @@ -87,7 +87,6 @@ static const struct vm_operations_struct f2fs_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = f2fs_vm_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; static int get_parent_ino(struct inode *inode, nid_t *pino) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index ab9fad6fb6b1..893c1fcc6931 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -2138,7 +2138,6 @@ static const struct vm_operations_struct fuse_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = fuse_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 26b3f952e6b1..658b879d9846 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -496,7 +496,6 @@ static const struct vm_operations_struct gfs2_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = gfs2_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; /** diff --git a/fs/inode.c b/fs/inode.c index 2eba072f7857..49df50c900ca 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -352,7 +352,6 @@ void address_space_init_once(struct address_space *mapping) INIT_LIST_HEAD(&mapping->private_list); spin_lock_init(&mapping->private_lock); mapping->i_mmap = RB_ROOT; - INIT_LIST_HEAD(&mapping->i_mmap_nonlinear); } EXPORT_SYMBOL(address_space_init_once); diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 19cfc636b691..acf90dfc716b 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -618,7 +618,6 @@ static const struct vm_operations_struct nfs_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = nfs_vm_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; static int nfs_need_sync_write(struct file *filp, struct inode *inode) diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c index 24978153c0c4..43a2eb572782 100644 --- a/fs/nilfs2/file.c +++ b/fs/nilfs2/file.c @@ -136,7 +136,6 @@ static const struct vm_operations_struct nilfs_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = nilfs_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; static int nilfs_file_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c index 10d66c75cecb..9581d190f6e1 100644 --- a/fs/ocfs2/mmap.c +++ b/fs/ocfs2/mmap.c @@ -173,7 +173,6 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) static const struct vm_operations_struct ocfs2_file_vm_ops = { .fault = ocfs2_fault, .page_mkwrite = ocfs2_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; int ocfs2_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/fs/proc/array.c b/fs/proc/array.c index 6a8bdd516f80..44a6da1a9d49 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -79,6 +79,7 @@ #include #include #include +#include #include #include #include @@ -326,6 +327,31 @@ static inline void task_seccomp(struct seq_file *m, struct task_struct *p) #ifdef CONFIG_SECCOMP seq_printf(m, "Seccomp:\t%d\n", p->seccomp.mode); #endif + seq_printf(m, "Speculation_Store_Bypass:\t"); + switch (arch_prctl_spec_ctrl_get(p, PR_SPEC_STORE_BYPASS)) { + case -EINVAL: + seq_printf(m, "unknown"); + break; + case PR_SPEC_NOT_AFFECTED: + seq_printf(m, "not vulnerable"); + break; + case PR_SPEC_PRCTL | PR_SPEC_FORCE_DISABLE: + seq_printf(m, "thread force mitigated"); + break; + case PR_SPEC_PRCTL | PR_SPEC_DISABLE: + seq_printf(m, "thread mitigated"); + break; + case PR_SPEC_PRCTL | PR_SPEC_ENABLE: + seq_printf(m, "thread vulnerable"); + break; + case PR_SPEC_DISABLE: + seq_printf(m, "globally mitigated"); + break; + default: + seq_printf(m, "vulnerable"); + break; + } + seq_putc(m, '\n'); } static inline void task_context_switch_counts(struct seq_file *m, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 9380e4e291a6..a6242e29eef1 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -436,7 +436,6 @@ struct mem_size_stats { unsigned long anonymous; unsigned long anonymous_thp; unsigned long swap; - unsigned long nonlinear; u64 pss; }; @@ -446,7 +445,6 @@ static void smaps_pte_entry(pte_t ptent, unsigned long addr, { struct mem_size_stats *mss = walk->private; struct vm_area_struct *vma = mss->vma; - pgoff_t pgoff = linear_page_index(vma, addr); struct page *page = NULL; int mapcount; @@ -459,9 +457,6 @@ static void smaps_pte_entry(pte_t ptent, unsigned long addr, mss->swap += ptent_size; else if (is_migration_entry(swpent)) page = migration_entry_to_page(swpent); - } else if (pte_file(ptent)) { - if (pte_to_pgoff(ptent) != pgoff) - mss->nonlinear += ptent_size; } if (!page) @@ -470,9 +465,6 @@ static void smaps_pte_entry(pte_t ptent, unsigned long addr, if (PageAnon(page)) mss->anonymous += ptent_size; - if (page->index != pgoff) - mss->nonlinear += ptent_size; - mss->resident += ptent_size; /* Accumulate the size in pages that have been accessed. */ if (pte_young(ptent) || PageReferenced(page)) @@ -554,7 +546,6 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_ACCOUNT)] = "ac", [ilog2(VM_NORESERVE)] = "nr", [ilog2(VM_HUGETLB)] = "ht", - [ilog2(VM_NONLINEAR)] = "nl", [ilog2(VM_ARCH_1)] = "ar", [ilog2(VM_DONTDUMP)] = "dd", #ifdef CONFIG_MEM_SOFT_DIRTY @@ -628,10 +619,6 @@ static int show_smap(struct seq_file *m, void *v, int is_pid) (vma->vm_flags & VM_LOCKED) ? (unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0); - if (vma->vm_flags & VM_NONLINEAR) - seq_printf(m, "Nonlinear: %8lu kB\n", - mss.nonlinear >> 10); - show_smap_vma_flags(m, vma); if (m->count < m->size) /* vma is copied successfully */ @@ -735,8 +722,6 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, ptent = pte_clear_flags(ptent, _PAGE_SOFT_DIRTY); } else if (is_swap_pte(ptent)) { ptent = pte_swp_clear_soft_dirty(ptent); - } else if (pte_file(ptent)) { - ptent = pte_file_clear_soft_dirty(ptent); } set_pte_at(vma->vm_mm, addr, pte, ptent); diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c index 68d7fe857ba9..fdf6d28c4839 100644 --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -1556,7 +1556,6 @@ static const struct vm_operations_struct ubifs_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = ubifs_vm_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; static int ubifs_file_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 593985a0447c..37f203111630 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1481,5 +1481,4 @@ static const struct vm_operations_struct xfs_file_vm_ops = { .fault = xfs_filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = xfs_filemap_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 53b2acc38213..25c7dde4646a 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -445,21 +445,6 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte) { return pte; } - -static inline pte_t pte_file_clear_soft_dirty(pte_t pte) -{ - return pte; -} - -static inline pte_t pte_file_mksoft_dirty(pte_t pte) -{ - return pte; -} - -static inline int pte_file_soft_dirty(pte_t pte) -{ - return 0; -} #endif #ifndef __HAVE_PFNMAP_TRACKING @@ -821,6 +806,18 @@ static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, #endif /* CONFIG_MMU */ +#ifndef __HAVE_ARCH_PFN_MODIFY_ALLOWED +static inline bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot) +{ + return true; +} + +static inline bool arch_has_pfn_modify_check(void) +{ + return false; +} +#endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */ + #endif /* !__ASSEMBLY__ */ #ifndef io_remap_pfn_range diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 0441fa7c20b1..ac169482c507 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -45,6 +45,10 @@ extern ssize_t cpu_show_spectre_v1(struct device *dev, struct device_attribute *attr, char *buf); extern ssize_t cpu_show_spectre_v2(struct device *dev, struct device_attribute *attr, char *buf); +extern ssize_t cpu_show_spec_store_bypass(struct device *dev, + struct device_attribute *attr, char *buf); +extern ssize_t cpu_show_l1tf(struct device *dev, + struct device_attribute *attr, char *buf); #ifdef CONFIG_HOTPLUG_CPU extern void unregister_cpu(struct cpu *cpu); diff --git a/include/linux/fs.h b/include/linux/fs.h index b5118c1cdb98..429793382eb6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -395,7 +395,6 @@ struct address_space { spinlock_t tree_lock; /* and lock protecting it */ unsigned int i_mmap_writable;/* count VM_SHARED mappings */ struct rb_root i_mmap; /* tree of private and shared mappings */ - struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ struct mutex i_mmap_mutex; /* protect tree, count, list */ /* Protected by tree_lock together with the radix tree */ unsigned long nrpages; /* number of total pages */ @@ -467,8 +466,7 @@ int mapping_tagged(struct address_space *mapping, int tag); */ static inline int mapping_mapped(struct address_space *mapping) { - return !RB_EMPTY_ROOT(&mapping->i_mmap) || - !list_empty(&mapping->i_mmap_nonlinear); + return !RB_EMPTY_ROOT(&mapping->i_mmap); } /* @@ -2430,8 +2428,6 @@ extern int sb_min_blocksize(struct super_block *, int); extern int generic_file_mmap(struct file *, struct vm_area_struct *); extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); -extern int generic_file_remap_pages(struct vm_area_struct *, unsigned long addr, - unsigned long size, pgoff_t pgoff); int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *); extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *); diff --git a/include/linux/io.h b/include/linux/io.h index b76e6e545806..b6af1f7d7094 100644 --- a/include/linux/io.h +++ b/include/linux/io.h @@ -101,4 +101,26 @@ static inline void arch_phys_wc_del(int handle) #define arch_phys_wc_add arch_phys_wc_add #endif +/* + * On x86 PAT systems we have memory tracking that keeps track of + * the allowed mappings on memory ranges. This tracking works for + * all the in-kernel mapping APIs (ioremap*), but where the user + * wishes to map a range from a physical device into user memory + * the tracking won't be updated. This API is to be used by + * drivers which remap physical device pages into userspace, + * and wants to make sure they are mapped WC and not UC. + */ +#ifndef arch_io_reserve_memtype_wc +static inline int arch_io_reserve_memtype_wc(resource_size_t base, + resource_size_t size) +{ + return 0; +} + +static inline void arch_io_free_memtype_wc(resource_size_t base, + resource_size_t size) +{ +} +#endif + #endif /* _LINUX_IO_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 65a4dc2da9de..1ecb0d0e56ec 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -125,7 +125,6 @@ extern unsigned int kobjsize(const void *objp); #define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */ #define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */ #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */ -#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */ #define VM_ARCH_1 0x01000000 /* Architecture-specific flag */ #define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */ @@ -187,21 +186,19 @@ extern unsigned int kobjsize(const void *objp); extern pgprot_t protection_map[16]; #define FAULT_FLAG_WRITE 0x01 /* Fault was a write access */ -#define FAULT_FLAG_NONLINEAR 0x02 /* Fault was via a nonlinear mapping */ -#define FAULT_FLAG_MKWRITE 0x04 /* Fault was mkwrite of existing pte */ -#define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ -#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ -#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ -#define FAULT_FLAG_TRIED 0x40 /* second try */ -#define FAULT_FLAG_USER 0x80 /* The fault originated in userspace */ +#define FAULT_FLAG_MKWRITE 0x02 /* Fault was mkwrite of existing pte */ +#define FAULT_FLAG_ALLOW_RETRY 0x04 /* Retry fault if blocking */ +#define FAULT_FLAG_RETRY_NOWAIT 0x08 /* Don't drop mmap_sem and wait when retrying */ +#define FAULT_FLAG_KILLABLE 0x10 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_TRIED 0x20 /* Second try */ +#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ /* * vm_fault is filled by the the pagefault handler and passed to the vma's * ->fault function. The vma's ->fault is responsible for returning a bitmask * of VM_FAULT_xxx flags that give details about how the fault was handled. * - * pgoff should be used in favour of virtual_address, if possible. If pgoff - * is used, one may implement ->remap_pages to get nonlinear mapping support. + * pgoff should be used in favour of virtual_address, if possible. */ struct vm_fault { unsigned int flags; /* FAULT_FLAG_xxx flags */ @@ -270,9 +267,6 @@ struct vm_operations_struct { int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from, const nodemask_t *to, unsigned long flags); #endif - /* called by sys_remap_file_pages() to populate non-linear mapping */ - int (*remap_pages)(struct vm_area_struct *vma, unsigned long addr, - unsigned long size, pgoff_t pgoff); }; struct mmu_gather; @@ -1103,7 +1097,6 @@ extern void user_shm_unlock(size_t, struct user_struct *); * Parameter block passed down to zap_pte_range in exceptional cases. */ struct zap_details { - struct vm_area_struct *nonlinear_vma; /* Check page->index if set */ struct address_space *check_mapping; /* Check page->mapping if set */ pgoff_t first_index; /* Lowest page->index to unmap */ pgoff_t last_index; /* Highest page->index to unmap */ @@ -1121,8 +1114,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, /** * mm_walk - callbacks for walk_page_range - * @pgd_entry: if set, called for each non-empty PGD (top-level) entry - * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry * this handler is required to be able to handle * pmd_trans_huge() pmds. They may simply choose to @@ -1130,16 +1121,18 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * @pte_entry: if set, called for each non-empty PTE (4th-level) entry * @pte_hole: if set, called for each hole at all levels * @hugetlb_entry: if set, called for each hugetlb entry - * *Caution*: The caller must hold mmap_sem() if @hugetlb_entry - * is used. + * @test_walk: caller specific callback function to determine whether + * we walk over the current vma or not. A positive returned + * value means "do page table walk over the current vma," + * and a negative one means "abort current page table walk + * right now." 0 means "skip the current vma." + * @mm: mm_struct representing the target process of page table walk + * @vma: vma currently walked (NULL if walking outside vmas) + * @private: private data for callbacks' usage * - * (see walk_page_range for more details) + * (see the comment on walk_page_range() for more details) */ struct mm_walk { - int (*pgd_entry)(pgd_t *pgd, unsigned long addr, - unsigned long next, struct mm_walk *walk); - int (*pud_entry)(pud_t *pud, unsigned long addr, - unsigned long next, struct mm_walk *walk); int (*pmd_entry)(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk); int (*pte_entry)(pte_t *pte, unsigned long addr, @@ -1149,7 +1142,10 @@ struct mm_walk { int (*hugetlb_entry)(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long next, struct mm_walk *walk); + int (*test_walk)(unsigned long addr, unsigned long next, + struct mm_walk *walk); struct mm_struct *mm; + struct vm_area_struct *vma; void *private; }; @@ -1734,12 +1730,6 @@ struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node, for (vma = vma_interval_tree_iter_first(root, start, last); \ vma; vma = vma_interval_tree_iter_next(vma, start, last)) -static inline void vma_nonlinear_insert(struct vm_area_struct *vma, - struct list_head *list) -{ - list_add_tail(&vma->shared.nonlinear, list); -} - void anon_vma_interval_tree_insert(struct anon_vma_chain *node, struct rb_root *root); void anon_vma_interval_tree_remove(struct anon_vma_chain *node, @@ -1978,6 +1968,8 @@ int remap_pfn_range(struct vm_area_struct *, unsigned long addr, int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); +int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, pgprot_t pgprot); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index f9466c152c5d..249285eec500 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -272,15 +272,11 @@ struct vm_area_struct { /* * For areas with an address space and backing store, - * linkage into the address_space->i_mmap interval tree, or - * linkage of vma in the address_space->i_mmap_nonlinear list. + * linkage into the address_space->i_mmap interval tree. */ - union { - struct { - struct rb_node rb; - unsigned long rb_subtree_last; - } linear; - struct list_head nonlinear; + struct { + struct rb_node rb; + unsigned long rb_subtree_last; } shared; /* diff --git a/include/linux/nospec.h b/include/linux/nospec.h index e791ebc65c9c..0c5ef54fd416 100644 --- a/include/linux/nospec.h +++ b/include/linux/nospec.h @@ -7,6 +7,8 @@ #define _LINUX_NOSPEC_H #include +struct task_struct; + /** * array_index_mask_nospec() - generate a ~0 mask when index < size, 0 otherwise * @index: array element index @@ -55,4 +57,12 @@ static inline unsigned long array_index_mask_nospec(unsigned long index, \ (typeof(_i)) (_i & _mask); \ }) + +/* Speculation control prctl */ +int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which); +int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which, + unsigned long ctrl); +/* Speculation control for seccomp enforced mitigation */ +void arch_seccomp_spec_mitigate(struct task_struct *task); + #endif /* _LINUX_NOSPEC_H */ diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 59bca41b7229..d17b0437adfa 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -232,7 +232,6 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); * arg: passed to rmap_one() and invalid_vma() * rmap_one: executed on each vma where page is mapped * done: for checking traversing termination condition - * file_nonlinear: for handling file nonlinear mapping * anon_lock: for getting anon_lock by optimized way rather than default * invalid_vma: for skipping uninterested vma */ @@ -241,7 +240,6 @@ struct rmap_walk_control { int (*rmap_one)(struct page *page, struct vm_area_struct *vma, unsigned long addr, void *arg); int (*done)(struct page *page); - int (*file_nonlinear)(struct page *, struct address_space *, void *arg); struct anon_vma *(*anon_lock)(struct page *page); bool (*invalid_vma)(struct vm_area_struct *vma, void *arg); }; diff --git a/include/linux/sched.h b/include/linux/sched.h index 982f752bff30..e689a930849d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1973,7 +1973,8 @@ static inline void memalloc_noio_restore(unsigned int flags) #define PFA_NO_NEW_PRIVS 0 /* May not gain new privileges. */ #define PFA_SPREAD_PAGE 1 /* Spread page cache over cpuset */ #define PFA_SPREAD_SLAB 2 /* Spread some slab caches over cpuset */ - +#define PFA_SPEC_SSB_DISABLE 3 /* Speculative Store Bypass disabled */ +#define PFA_SPEC_SSB_FORCE_DISABLE 4 /* Speculative Store Bypass force disabled*/ #define TASK_PFA_TEST(name, func) \ static inline bool task_##func(struct task_struct *p) \ @@ -1996,6 +1997,13 @@ TASK_PFA_TEST(SPREAD_SLAB, spread_slab) TASK_PFA_SET(SPREAD_SLAB, spread_slab) TASK_PFA_CLEAR(SPREAD_SLAB, spread_slab) +TASK_PFA_TEST(SPEC_SSB_DISABLE, spec_ssb_disable) +TASK_PFA_SET(SPEC_SSB_DISABLE, spec_ssb_disable) +TASK_PFA_CLEAR(SPEC_SSB_DISABLE, spec_ssb_disable) + +TASK_PFA_TEST(SPEC_SSB_FORCE_DISABLE, spec_ssb_force_disable) +TASK_PFA_SET(SPEC_SSB_FORCE_DISABLE, spec_ssb_force_disable) + /* * task->jobctl flags */ diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 4054b0994071..3258b7e8db3f 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -3,6 +3,8 @@ #include +#define SECCOMP_FILTER_FLAG_MASK SECCOMP_FILTER_FLAG_SPEC_ALLOW + #ifdef CONFIG_SECCOMP #include diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h index 388293a91e8c..e4594de79bc4 100644 --- a/include/linux/swapfile.h +++ b/include/linux/swapfile.h @@ -9,5 +9,7 @@ extern spinlock_t swap_lock; extern struct plist_head swap_active_head; extern struct swap_info_struct *swap_info[]; extern int try_to_unuse(unsigned int, bool, unsigned long); +extern unsigned long generic_max_swapfile_size(void); +extern unsigned long max_swapfile_size(void); #endif /* _LINUX_SWAPFILE_H */ diff --git a/include/linux/swapops.h b/include/linux/swapops.h index e288d5c016a7..831a3168ab35 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -54,7 +54,7 @@ static inline pgoff_t swp_offset(swp_entry_t entry) /* check whether a pte points to a swap entry */ static inline int is_swap_pte(pte_t pte) { - return !pte_none(pte) && !pte_present_nonuma(pte) && !pte_file(pte); + return !pte_none(pte) && !pte_present_nonuma(pte); } #endif @@ -66,7 +66,6 @@ static inline swp_entry_t pte_to_swp_entry(pte_t pte) { swp_entry_t arch_entry; - BUG_ON(pte_file(pte)); if (pte_swp_soft_dirty(pte)) pte = pte_swp_clear_soft_dirty(pte); arch_entry = __pte_to_swp_entry(pte); @@ -82,7 +81,6 @@ static inline pte_t swp_entry_to_pte(swp_entry_t entry) swp_entry_t arch_entry; arch_entry = __swp_entry(swp_type(entry), swp_offset(entry)); - BUG_ON(pte_file(__swp_entry_to_pte(arch_entry))); return __swp_entry_to_pte(arch_entry); } diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index ced92345c963..7ad3d612d753 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -73,10 +73,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_ZERO_PAGE_ALLOC_FAILED, #endif #ifdef CONFIG_DEBUG_TLBFLUSH -#ifdef CONFIG_SMP NR_TLB_REMOTE_FLUSH, /* cpu tried to flush others' tlbs */ NR_TLB_REMOTE_FLUSH_RECEIVED,/* cpu received ipi for flush */ -#endif /* CONFIG_SMP */ NR_TLB_LOCAL_FLUSH_ALL, NR_TLB_LOCAL_FLUSH_ONE, #endif /* CONFIG_DEBUG_TLBFLUSH */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 58afc04c107e..24a1ccbcafa7 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -152,4 +152,16 @@ #define PR_SET_THP_DISABLE 41 #define PR_GET_THP_DISABLE 42 +/* Per task speculation control */ +#define PR_GET_SPECULATION_CTRL 52 +#define PR_SET_SPECULATION_CTRL 53 +/* Speculation control variants */ +# define PR_SPEC_STORE_BYPASS 0 +/* Return and control values for PR_SET/GET_SPECULATION_CTRL */ +# define PR_SPEC_NOT_AFFECTED 0 +# define PR_SPEC_PRCTL (1UL << 0) +# define PR_SPEC_ENABLE (1UL << 1) +# define PR_SPEC_DISABLE (1UL << 2) +# define PR_SPEC_FORCE_DISABLE (1UL << 3) + #endif /* _LINUX_PRCTL_H */ diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index b258878ba754..077344007063 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -14,6 +14,9 @@ #define SECCOMP_SET_MODE_STRICT 0 #define SECCOMP_SET_MODE_FILTER 1 +/* Valid flags for SECCOMP_SET_MODE_FILTER */ +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) + /* * All BPF programs must return a 32-bit value. * The bottom 16-bits are for optional return data. diff --git a/kernel/fork.c b/kernel/fork.c index 315342c2aac4..27f6a67f692e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -430,12 +430,8 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) mapping->i_mmap_writable++; flush_dcache_mmap_lock(mapping); /* insert tmp into the share list, just after mpnt */ - if (unlikely(tmp->vm_flags & VM_NONLINEAR)) - vma_nonlinear_insert(tmp, - &mapping->i_mmap_nonlinear); - else - vma_interval_tree_insert_after(tmp, mpnt, - &mapping->i_mmap); + vma_interval_tree_insert_after(tmp, mpnt, + &mapping->i_mmap); flush_dcache_mmap_unlock(mapping); mutex_unlock(&mapping->i_mmap_mutex); } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index d2596136b0d1..61e8e91482cf 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -16,6 +16,8 @@ #include #include #include +#include +#include #include #include #include @@ -205,9 +207,15 @@ static inline bool seccomp_may_assign_mode(unsigned long seccomp_mode) return true; } -static inline void seccomp_assign_mode(unsigned long seccomp_mode) +void __weak arch_seccomp_spec_mitigate(struct task_struct *task) { } + +static inline void seccomp_assign_mode(unsigned long seccomp_mode, + unsigned long flags) { current->seccomp.mode = seccomp_mode; + /* Assume default seccomp processes want spec flaw mitigation. */ + if ((flags & SECCOMP_FILTER_FLAG_SPEC_ALLOW) == 0) + arch_seccomp_spec_mitigate(current); set_tsk_thread_flag(current, TIF_SECCOMP); } @@ -507,7 +515,7 @@ static long seccomp_set_mode_strict(void) #ifdef TIF_NOTSC disable_TSC(); #endif - seccomp_assign_mode(seccomp_mode); + seccomp_assign_mode(seccomp_mode, 0); ret = 0; out: @@ -536,7 +544,7 @@ static long seccomp_set_mode_filter(unsigned int flags, long ret = -EINVAL; /* Validate flags. */ - if (flags != 0) + if (flags & ~SECCOMP_FILTER_FLAG_MASK) goto out; if (!seccomp_may_assign_mode(seccomp_mode)) @@ -546,7 +554,7 @@ static long seccomp_set_mode_filter(unsigned int flags, if (ret) goto out; - seccomp_assign_mode(seccomp_mode); + seccomp_assign_mode(seccomp_mode, flags); out: return ret; } diff --git a/kernel/sys.c b/kernel/sys.c index 0cb192dc4a93..3fb42f9897f1 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -53,6 +53,8 @@ #include #include +#include + #include /* Move somewhere else to avoid recompiling? */ #include @@ -1832,6 +1834,17 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr) } #endif +int __weak arch_prctl_spec_ctrl_get(struct task_struct *t, unsigned long which) +{ + return -EINVAL; +} + +int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, + unsigned long ctrl) +{ + return -EINVAL; +} + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -2010,6 +2023,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, me->mm->def_flags &= ~VM_NOHUGEPAGE; up_write(&me->mm->mmap_sem); break; + case PR_GET_SPECULATION_CTRL: + if (arg3 || arg4 || arg5) + return -EINVAL; + error = arch_prctl_spec_ctrl_get(me, arg2); + break; + case PR_SET_SPECULATION_CTRL: + if (arg4 || arg5) + return -EINVAL; + error = arch_prctl_spec_ctrl_set(me, arg2, arg3); + break; default: error = -EINVAL; break; diff --git a/mm/Makefile b/mm/Makefile index 4064f3ec145e..1eaa70bed7a4 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -3,7 +3,7 @@ # mmu-y := nommu.o -mmu-$(CONFIG_MMU) := fremap.o gup.o highmem.o madvise.o memory.o mincore.o \ +mmu-$(CONFIG_MMU) := gup.o highmem.o madvise.o memory.o mincore.o \ mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \ vmalloc.o pagewalk.o pgtable-generic.o diff --git a/mm/filemap.c b/mm/filemap.c index 128f3d34d874..94d69f277dae 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2078,7 +2078,6 @@ const struct vm_operations_struct generic_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = filemap_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; /* This is used for a general mmap of a disk file */ diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index d8d9fe3f685c..e609c215c5d2 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -306,7 +306,6 @@ static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf) static const struct vm_operations_struct xip_file_vm_ops = { .fault = xip_file_fault, .page_mkwrite = filemap_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; int xip_file_mmap(struct file * file, struct vm_area_struct * vma) diff --git a/mm/fremap.c b/mm/fremap.c deleted file mode 100644 index 72b8fa361433..000000000000 --- a/mm/fremap.c +++ /dev/null @@ -1,283 +0,0 @@ -/* - * linux/mm/fremap.c - * - * Explicit pagetable population and nonlinear (random) mappings support. - * - * started by Ingo Molnar, Copyright (C) 2002, 2003 - */ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include - -#include "internal.h" - -static int mm_counter(struct page *page) -{ - return PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES; -} - -static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long addr, pte_t *ptep) -{ - pte_t pte = *ptep; - struct page *page; - swp_entry_t entry; - - if (pte_present(pte)) { - flush_cache_page(vma, addr, pte_pfn(pte)); - pte = ptep_clear_flush(vma, addr, ptep); - page = vm_normal_page(vma, addr, pte); - if (page) { - if (pte_dirty(pte)) - set_page_dirty(page); - update_hiwater_rss(mm); - dec_mm_counter(mm, mm_counter(page)); - page_remove_rmap(page); - page_cache_release(page); - } - } else { /* zap_pte() is not called when pte_none() */ - if (!pte_file(pte)) { - update_hiwater_rss(mm); - entry = pte_to_swp_entry(pte); - if (non_swap_entry(entry)) { - if (is_migration_entry(entry)) { - page = migration_entry_to_page(entry); - dec_mm_counter(mm, mm_counter(page)); - } - } else { - free_swap_and_cache(entry); - dec_mm_counter(mm, MM_SWAPENTS); - } - } - pte_clear_not_present_full(mm, addr, ptep, 0); - } -} - -/* - * Install a file pte to a given virtual memory address, release any - * previously existing mapping. - */ -static int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long addr, unsigned long pgoff, pgprot_t prot) -{ - int err = -ENOMEM; - pte_t *pte, ptfile; - spinlock_t *ptl; - - pte = get_locked_pte(mm, addr, &ptl); - if (!pte) - goto out; - - ptfile = pgoff_to_pte(pgoff); - - if (!pte_none(*pte)) - zap_pte(mm, vma, addr, pte); - - set_pte_at(mm, addr, pte, pte_file_mksoft_dirty(ptfile)); - /* - * We don't need to run update_mmu_cache() here because the "file pte" - * being installed by install_file_pte() is not a real pte - it's a - * non-present entry (like a swap entry), noting what file offset should - * be mapped there when there's a fault (in a non-linear vma where - * that's not obvious). - */ - pte_unmap_unlock(pte, ptl); - err = 0; -out: - return err; -} - -int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr, - unsigned long size, pgoff_t pgoff) -{ - struct mm_struct *mm = vma->vm_mm; - int err; - - do { - err = install_file_pte(mm, vma, addr, pgoff, vma->vm_page_prot); - if (err) - return err; - - size -= PAGE_SIZE; - addr += PAGE_SIZE; - pgoff++; - } while (size); - - return 0; -} -EXPORT_SYMBOL(generic_file_remap_pages); - -/** - * sys_remap_file_pages - remap arbitrary pages of an existing VM_SHARED vma - * @start: start of the remapped virtual memory range - * @size: size of the remapped virtual memory range - * @prot: new protection bits of the range (see NOTE) - * @pgoff: to-be-mapped page of the backing store file - * @flags: 0 or MAP_NONBLOCKED - the later will cause no IO. - * - * sys_remap_file_pages remaps arbitrary pages of an existing VM_SHARED vma - * (shared backing store file). - * - * This syscall works purely via pagetables, so it's the most efficient - * way to map the same (large) file into a given virtual window. Unlike - * mmap()/mremap() it does not create any new vmas. The new mappings are - * also safe across swapout. - * - * NOTE: the @prot parameter right now is ignored (but must be zero), - * and the vma's default protection is used. Arbitrary protections - * might be implemented in the future. - */ -SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, - unsigned long, prot, unsigned long, pgoff, unsigned long, flags) -{ - struct mm_struct *mm = current->mm; - struct address_space *mapping; - struct vm_area_struct *vma; - int err = -EINVAL; - int has_write_lock = 0; - vm_flags_t vm_flags = 0; - - pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. " - "See Documentation/vm/remap_file_pages.txt.\n", - current->comm, current->pid); - - if (prot) - return err; - /* - * Sanitize the syscall parameters: - */ - start = start & PAGE_MASK; - size = size & PAGE_MASK; - - /* Does the address range wrap, or is the span zero-sized? */ - if (start + size <= start) - return err; - - /* Does pgoff wrap? */ - if (pgoff + (size >> PAGE_SHIFT) < pgoff) - return err; - - /* Can we represent this offset inside this architecture's pte's? */ -#if PTE_FILE_MAX_BITS < BITS_PER_LONG - if (pgoff + (size >> PAGE_SHIFT) >= (1UL << PTE_FILE_MAX_BITS)) - return err; -#endif - - /* We need down_write() to change vma->vm_flags. */ - down_read(&mm->mmap_sem); - retry: - vma = find_vma(mm, start); - - /* - * Make sure the vma is shared, that it supports prefaulting, - * and that the remapped range is valid and fully within - * the single existing vma. - */ - if (!vma || !(vma->vm_flags & VM_SHARED)) - goto out; - - if (!vma->vm_ops || !vma->vm_ops->remap_pages) - goto out; - - if (start < vma->vm_start || start + size > vma->vm_end) - goto out; - - /* Must set VM_NONLINEAR before any pages are populated. */ - if (!(vma->vm_flags & VM_NONLINEAR)) { - /* - * vm_private_data is used as a swapout cursor - * in a VM_NONLINEAR vma. - */ - if (vma->vm_private_data) - goto out; - - /* Don't need a nonlinear mapping, exit success */ - if (pgoff == linear_page_index(vma, start)) { - err = 0; - goto out; - } - - if (!has_write_lock) { -get_write_lock: - up_read(&mm->mmap_sem); - down_write(&mm->mmap_sem); - has_write_lock = 1; - goto retry; - } - mapping = vma->vm_file->f_mapping; - /* - * page_mkclean doesn't work on nonlinear vmas, so if - * dirty pages need to be accounted, emulate with linear - * vmas. - */ - if (mapping_cap_account_dirty(mapping)) { - unsigned long addr; - struct file *file = get_file(vma->vm_file); - /* mmap_region may free vma; grab the info now */ - vm_flags = vma->vm_flags; - - addr = mmap_region(file, start, size, vm_flags, pgoff); - fput(file); - if (IS_ERR_VALUE(addr)) { - err = addr; - } else { - BUG_ON(addr != start); - err = 0; - } - goto out_freed; - } - mutex_lock(&mapping->i_mmap_mutex); - flush_dcache_mmap_lock(mapping); - vma->vm_flags |= VM_NONLINEAR; - vma_interval_tree_remove(vma, &mapping->i_mmap); - vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear); - flush_dcache_mmap_unlock(mapping); - mutex_unlock(&mapping->i_mmap_mutex); - } - - if (vma->vm_flags & VM_LOCKED) { - /* - * drop PG_Mlocked flag for over-mapped range - */ - if (!has_write_lock) - goto get_write_lock; - vm_flags = vma->vm_flags; - munlock_vma_pages_range(vma, start, start + size); - vma->vm_flags = vm_flags; - } - - mmu_notifier_invalidate_range_start(mm, start, start + size); - err = vma->vm_ops->remap_pages(vma, start, size, pgoff); - mmu_notifier_invalidate_range_end(mm, start, start + size); - - /* - * We can't clear VM_NONLINEAR because we'd have to do - * it after ->populate completes, and that would prevent - * downgrading the lock. (Locks can't be upgraded). - */ - -out: - if (vma) - vm_flags = vma->vm_flags; -out_freed: - if (likely(!has_write_lock)) - up_read(&mm->mmap_sem); - else - up_write(&mm->mmap_sem); - if (!err && ((vm_flags & VM_LOCKED) || !(flags & MAP_NONBLOCK))) - mm_populate(start, size); - - return err; -} diff --git a/mm/gup.c b/mm/gup.c index 4d2fd27b33ab..b4164bfca326 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -61,7 +61,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, */ if (likely(!(flags & FOLL_MIGRATION))) goto no_page; - if (pte_none(pte) || pte_file(pte)) + if (pte_none(pte)) goto no_page; entry = pte_to_swp_entry(pte); if (!is_migration_entry(entry)) diff --git a/mm/interval_tree.c b/mm/interval_tree.c index 4a5822a586e6..fb844a358326 100644 --- a/mm/interval_tree.c +++ b/mm/interval_tree.c @@ -21,8 +21,8 @@ static inline unsigned long vma_last_pgoff(struct vm_area_struct *v) return v->vm_pgoff + ((v->vm_end - v->vm_start) >> PAGE_SHIFT) - 1; } -INTERVAL_TREE_DEFINE(struct vm_area_struct, shared.linear.rb, - unsigned long, shared.linear.rb_subtree_last, +INTERVAL_TREE_DEFINE(struct vm_area_struct, shared.rb, + unsigned long, shared.rb_subtree_last, vma_start_pgoff, vma_last_pgoff,, vma_interval_tree) /* Insert node immediately after prev in the interval tree */ @@ -36,26 +36,26 @@ void vma_interval_tree_insert_after(struct vm_area_struct *node, VM_BUG_ON(vma_start_pgoff(node) != vma_start_pgoff(prev)); - if (!prev->shared.linear.rb.rb_right) { + if (!prev->shared.rb.rb_right) { parent = prev; - link = &prev->shared.linear.rb.rb_right; + link = &prev->shared.rb.rb_right; } else { - parent = rb_entry(prev->shared.linear.rb.rb_right, - struct vm_area_struct, shared.linear.rb); - if (parent->shared.linear.rb_subtree_last < last) - parent->shared.linear.rb_subtree_last = last; - while (parent->shared.linear.rb.rb_left) { - parent = rb_entry(parent->shared.linear.rb.rb_left, - struct vm_area_struct, shared.linear.rb); - if (parent->shared.linear.rb_subtree_last < last) - parent->shared.linear.rb_subtree_last = last; + parent = rb_entry(prev->shared.rb.rb_right, + struct vm_area_struct, shared.rb); + if (parent->shared.rb_subtree_last < last) + parent->shared.rb_subtree_last = last; + while (parent->shared.rb.rb_left) { + parent = rb_entry(parent->shared.rb.rb_left, + struct vm_area_struct, shared.rb); + if (parent->shared.rb_subtree_last < last) + parent->shared.rb_subtree_last = last; } - link = &parent->shared.linear.rb.rb_left; + link = &parent->shared.rb.rb_left; } - node->shared.linear.rb_subtree_last = last; - rb_link_node(&node->shared.linear.rb, &parent->shared.linear.rb, link); - rb_insert_augmented(&node->shared.linear.rb, root, + node->shared.rb_subtree_last = last; + rb_link_node(&node->shared.rb, &parent->shared.rb, link); + rb_insert_augmented(&node->shared.rb, root, &vma_interval_tree_augment); } diff --git a/mm/ksm.c b/mm/ksm.c index d8c9c689862c..032ebc32c26f 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1749,7 +1749,7 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start, */ if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE | VM_PFNMAP | VM_IO | VM_DONTEXPAND | - VM_HUGETLB | VM_NONLINEAR | VM_MIXEDMAP)) + VM_HUGETLB | VM_MIXEDMAP)) return 0; /* just ignore the advice */ #ifdef VM_SAO diff --git a/mm/madvise.c b/mm/madvise.c index b414e9af2191..040617386648 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -155,7 +155,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, pte = *(orig_pte + ((index - start) / PAGE_SIZE)); pte_unmap_unlock(orig_pte, ptl); - if (pte_present(pte) || pte_none(pte) || pte_file(pte)) + if (pte_present(pte) || pte_none(pte)) continue; entry = pte_to_swp_entry(pte); if (unlikely(non_swap_entry(entry))) @@ -277,14 +277,7 @@ static long madvise_dontneed(struct vm_area_struct *vma, if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) return -EINVAL; - if (unlikely(vma->vm_flags & VM_NONLINEAR)) { - struct zap_details details = { - .nonlinear_vma = vma, - .last_index = ULONG_MAX, - }; - zap_page_range(vma, start, end - start, &details); - } else - zap_page_range(vma, start, end - start, NULL); + zap_page_range(vma, start, end - start, NULL); return 0; } @@ -305,7 +298,7 @@ static long madvise_remove(struct vm_area_struct *vma, *prev = NULL; /* tell sys_madvise we drop mmap_sem */ - if (vma->vm_flags & (VM_LOCKED|VM_NONLINEAR|VM_HUGETLB)) + if (vma->vm_flags & (VM_LOCKED | VM_HUGETLB)) return -EINVAL; f = vma->vm_file; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2bfd852934ac..543a3e2e11c3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6580,10 +6580,7 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma, return NULL; mapping = vma->vm_file->f_mapping; - if (pte_none(ptent)) - pgoff = linear_page_index(vma, addr); - else /* pte_file(ptent) is true */ - pgoff = pte_to_pgoff(ptent); + pgoff = linear_page_index(vma, addr); /* page is moved even if it's not RSS of this task(page-faulted). */ #ifdef CONFIG_SWAP @@ -6616,7 +6613,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma, page = mc_handle_present_pte(vma, addr, ptent); else if (is_swap_pte(ptent)) page = mc_handle_swap_pte(vma, addr, ptent, &ent); - else if (pte_none(ptent) || pte_file(ptent)) + else if (pte_none(ptent)) page = mc_handle_file_pte(vma, addr, ptent, &ent); if (!page && !ent.val) diff --git a/mm/memory.c b/mm/memory.c index 4dbf1173891d..06e065489e60 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -810,42 +810,40 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, /* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) { - if (!pte_file(pte)) { - swp_entry_t entry = pte_to_swp_entry(pte); - - if (likely(!non_swap_entry(entry))) { - if (swap_duplicate(entry) < 0) - return entry.val; - - /* make sure dst_mm is on swapoff's mmlist. */ - if (unlikely(list_empty(&dst_mm->mmlist))) { - spin_lock(&mmlist_lock); - if (list_empty(&dst_mm->mmlist)) - list_add(&dst_mm->mmlist, - &src_mm->mmlist); - spin_unlock(&mmlist_lock); - } - rss[MM_SWAPENTS]++; - } else if (is_migration_entry(entry)) { - page = migration_entry_to_page(entry); - - if (PageAnon(page)) - rss[MM_ANONPAGES]++; - else - rss[MM_FILEPAGES]++; - - if (is_write_migration_entry(entry) && - is_cow_mapping(vm_flags)) { - /* - * COW mappings require pages in both - * parent and child to be set to read. - */ - make_migration_entry_read(&entry); - pte = swp_entry_to_pte(entry); - if (pte_swp_soft_dirty(*src_pte)) - pte = pte_swp_mksoft_dirty(pte); - set_pte_at(src_mm, addr, src_pte, pte); - } + swp_entry_t entry = pte_to_swp_entry(pte); + + if (likely(!non_swap_entry(entry))) { + if (swap_duplicate(entry) < 0) + return entry.val; + + /* make sure dst_mm is on swapoff's mmlist. */ + if (unlikely(list_empty(&dst_mm->mmlist))) { + spin_lock(&mmlist_lock); + if (list_empty(&dst_mm->mmlist)) + list_add(&dst_mm->mmlist, + &src_mm->mmlist); + spin_unlock(&mmlist_lock); + } + rss[MM_SWAPENTS]++; + } else if (is_migration_entry(entry)) { + page = migration_entry_to_page(entry); + + if (PageAnon(page)) + rss[MM_ANONPAGES]++; + else + rss[MM_FILEPAGES]++; + + if (is_write_migration_entry(entry) && + is_cow_mapping(vm_flags)) { + /* + * COW mappings require pages in both + * parent and child to be set to read. + */ + make_migration_entry_read(&entry); + pte = swp_entry_to_pte(entry); + if (pte_swp_soft_dirty(*src_pte)) + pte = pte_swp_mksoft_dirty(pte); + set_pte_at(src_mm, addr, src_pte, pte); } } goto out_set_pte; @@ -1019,11 +1017,9 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ - if (!(vma->vm_flags & (VM_HUGETLB | VM_NONLINEAR | - VM_PFNMAP | VM_MIXEDMAP))) { - if (!vma->anon_vma) - return 0; - } + if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) && + !vma->anon_vma) + return 0; if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); @@ -1081,6 +1077,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, spinlock_t *ptl; pte_t *start_pte; pte_t *pte; + swp_entry_t entry; again: init_rss_vec(rss); @@ -1106,28 +1103,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (details->check_mapping && details->check_mapping != page->mapping) continue; - /* - * Each page->index must be checked when - * invalidating or truncating nonlinear. - */ - if (details->nonlinear_vma && - (page->index < details->first_index || - page->index > details->last_index)) - continue; } ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; - if (unlikely(details) && details->nonlinear_vma - && linear_page_index(details->nonlinear_vma, - addr) != page->index) { - pte_t ptfile = pgoff_to_pte(page->index); - if (pte_soft_dirty(ptent)) - ptfile = pte_file_mksoft_dirty(ptfile); - set_pte_at(mm, addr, pte, ptfile); - } if (PageAnon(page)) rss[MM_ANONPAGES]--; else { @@ -1150,33 +1131,25 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, } continue; } - /* - * If details->check_mapping, we leave swap entries; - * if details->nonlinear_vma, we leave file entries. - */ + /* If details->check_mapping, we leave swap entries. */ if (unlikely(details)) continue; - if (pte_file(ptent)) { - if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) - print_bad_pte(vma, addr, ptent, NULL); - } else { - swp_entry_t entry = pte_to_swp_entry(ptent); - if (!non_swap_entry(entry)) - rss[MM_SWAPENTS]--; - else if (is_migration_entry(entry)) { - struct page *page; + entry = pte_to_swp_entry(ptent); + if (!non_swap_entry(entry)) + rss[MM_SWAPENTS]--; + else if (is_migration_entry(entry)) { + struct page *page; - page = migration_entry_to_page(entry); + page = migration_entry_to_page(entry); - if (PageAnon(page)) - rss[MM_ANONPAGES]--; - else - rss[MM_FILEPAGES]--; - } - if (unlikely(!free_swap_and_cache(entry))) - print_bad_pte(vma, addr, ptent, NULL); + if (PageAnon(page)) + rss[MM_ANONPAGES]--; + else + rss[MM_FILEPAGES]--; } + if (unlikely(!free_swap_and_cache(entry))) + print_bad_pte(vma, addr, ptent, NULL); pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); } while (pte++, addr += PAGE_SIZE, addr != end); @@ -1288,7 +1261,7 @@ static void unmap_page_range(struct mmu_gather *tlb, pgd_t *pgd; unsigned long next; - if (details && !details->check_mapping && !details->nonlinear_vma) + if (details && !details->check_mapping) details = NULL; BUG_ON(addr >= end); @@ -1384,7 +1357,7 @@ void unmap_vmas(struct mmu_gather *tlb, * @vma: vm_area_struct holding the applicable pages * @start: starting address of pages to zap * @size: number of bytes to zap - * @details: details of nonlinear truncation or shared cache invalidation + * @details: details of shared cache invalidation * * Caller must protect the VMA list */ @@ -1410,7 +1383,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start, * @vma: vm_area_struct holding the applicable pages * @address: starting address of pages to zap * @size: number of bytes to zap - * @details: details of nonlinear truncation or shared cache invalidation + * @details: details of shared cache invalidation * * The range must fit into one VMA. */ @@ -1600,9 +1573,30 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, */ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) +{ + return vm_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot); +} +EXPORT_SYMBOL(vm_insert_pfn); + +/** + * vm_insert_pfn_prot - insert single pfn into user vma with specified pgprot + * @vma: user vma to map to + * @addr: target user address of this page + * @pfn: source kernel pfn + * @pgprot: pgprot flags for the inserted page + * + * This is exactly like vm_insert_pfn, except that it allows drivers to + * to override pgprot on a per-page basis. + * + * This only makes sense for IO mappings, and it makes no sense for + * cow mappings. In general, using multiple vmas is preferable; + * vm_insert_pfn_prot should only be used if using multiple VMAs is + * impractical. + */ +int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, pgprot_t pgprot) { int ret; - pgprot_t pgprot = vma->vm_page_prot; /* * Technically, architectures with pte_special can avoid all these * restrictions (same for remap_pfn_range). However we would like @@ -1620,19 +1614,29 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, if (track_pfn_insert(vma, &pgprot, pfn)) return -EINVAL; + if (!pfn_modify_allowed(pfn, pgprot)) + return -EACCES; + ret = insert_pfn(vma, addr, pfn, pgprot); return ret; } -EXPORT_SYMBOL(vm_insert_pfn); +EXPORT_SYMBOL(vm_insert_pfn_prot); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) { + pgprot_t pgprot = vma->vm_page_prot; + BUG_ON(!(vma->vm_flags & VM_MIXEDMAP)); if (addr < vma->vm_start || addr >= vma->vm_end) return -EFAULT; + if (track_pfn_insert(vma, &pgprot, pfn)) + return -EINVAL; + + if (!pfn_modify_allowed(pfn, pgprot)) + return -EACCES; /* * If we don't have pte special, then we have to use the pfn_valid() @@ -1645,9 +1649,9 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, struct page *page; page = pfn_to_page(pfn); - return insert_page(vma, addr, page, vma->vm_page_prot); + return insert_page(vma, addr, page, pgprot); } - return insert_pfn(vma, addr, pfn, vma->vm_page_prot); + return insert_pfn(vma, addr, pfn, pgprot); } EXPORT_SYMBOL(vm_insert_mixed); @@ -1662,6 +1666,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, { pte_t *pte; spinlock_t *ptl; + int err = 0; pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); if (!pte) @@ -1669,12 +1674,16 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, arch_enter_lazy_mmu_mode(); do { BUG_ON(!pte_none(*pte)); + if (!pfn_modify_allowed(pfn, prot)) { + err = -EACCES; + break; + } set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot))); pfn++; } while (pte++, addr += PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte - 1, ptl); - return 0; + return err; } static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, @@ -1683,6 +1692,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, { pmd_t *pmd; unsigned long next; + int err; pfn -= addr >> PAGE_SHIFT; pmd = pmd_alloc(mm, pud, addr); @@ -1691,9 +1701,10 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, VM_BUG_ON(pmd_trans_huge(*pmd)); do { next = pmd_addr_end(addr, end); - if (remap_pte_range(mm, pmd, addr, next, - pfn + (addr >> PAGE_SHIFT), prot)) - return -ENOMEM; + err = remap_pte_range(mm, pmd, addr, next, + pfn + (addr >> PAGE_SHIFT), prot); + if (err) + return err; } while (pmd++, addr = next, addr != end); return 0; } @@ -1704,6 +1715,7 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd, { pud_t *pud; unsigned long next; + int err; pfn -= addr >> PAGE_SHIFT; pud = pud_alloc(mm, pgd, addr); @@ -1711,9 +1723,10 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd, return -ENOMEM; do { next = pud_addr_end(addr, end); - if (remap_pmd_range(mm, pud, addr, next, - pfn + (addr >> PAGE_SHIFT), prot)) - return -ENOMEM; + err = remap_pmd_range(mm, pud, addr, next, + pfn + (addr >> PAGE_SHIFT), prot); + if (err) + return err; } while (pud++, addr = next, addr != end); return 0; } @@ -1935,12 +1948,11 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr, EXPORT_SYMBOL_GPL(apply_to_page_range); /* - * handle_pte_fault chooses page fault handler according to an entry - * which was read non-atomically. Before making any commitment, on - * those architectures or configurations (e.g. i386 with PAE) which - * might give a mix of unmatched parts, do_swap_page and do_nonlinear_fault - * must check under lock before unmapping the pte and proceeding - * (but do_wp_page is only called after already making such a check; + * handle_pte_fault chooses page fault handler according to an entry which was + * read non-atomically. Before making any commitment, on those architectures + * or configurations (e.g. i386 with PAE) which might give a mix of unmatched + * parts, do_swap_page must check under lock before unmapping the pte and + * proceeding (but do_wp_page is only called after already making such a check; * and do_anonymous_page can safely check later on). */ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, @@ -2340,25 +2352,11 @@ static inline void unmap_mapping_range_tree(struct rb_root *root, } } -static inline void unmap_mapping_range_list(struct list_head *head, - struct zap_details *details) -{ - struct vm_area_struct *vma; - - /* - * In nonlinear VMAs there is no correspondence between virtual address - * offset and file offset. So we must perform an exhaustive search - * across *all* the pages in each nonlinear VMA, not just the pages - * whose virtual address lies outside the file truncation point. - */ - list_for_each_entry(vma, head, shared.nonlinear) { - details->nonlinear_vma = vma; - unmap_mapping_range_vma(vma, vma->vm_start, vma->vm_end, details); - } -} - /** - * unmap_mapping_range - unmap the portion of all mmaps in the specified address_space corresponding to the specified page range in the underlying file. + * unmap_mapping_range - unmap the portion of all mmaps in the specified + * address_space corresponding to the specified page range in the underlying + * file. + * * @mapping: the address space containing mmaps to be unmapped. * @holebegin: byte in first page to unmap, relative to the start of * the underlying file. This will be rounded down to a PAGE_SIZE @@ -2387,7 +2385,6 @@ void unmap_mapping_range(struct address_space *mapping, } details.check_mapping = even_cows? NULL: mapping; - details.nonlinear_vma = NULL; details.first_index = hba; details.last_index = hba + hlen - 1; if (details.last_index < details.first_index) @@ -2397,8 +2394,6 @@ void unmap_mapping_range(struct address_space *mapping, mutex_lock(&mapping->i_mmap_mutex); if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) unmap_mapping_range_tree(&mapping->i_mmap, &details); - if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) - unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details); mutex_unlock(&mapping->i_mmap_mutex); } EXPORT_SYMBOL(unmap_mapping_range); @@ -2716,8 +2711,6 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address, entry = mk_pte(page, vma->vm_page_prot); if (write) entry = maybe_mkwrite(pte_mkdirty(entry), vma); - else if (pte_file(*pte) && pte_file_soft_dirty(*pte)) - pte_mksoft_dirty(entry); if (anon) { inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, address); @@ -2858,8 +2851,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma, * if page by the offset is not ready to be mapped (cold cache or * something). */ - if (vma->vm_ops->map_pages && !(flags & FAULT_FLAG_NONLINEAR) && - fault_around_pages() > 1) { + if (vma->vm_ops->map_pages && fault_around_pages() > 1) { pte = pte_offset_map_lock(mm, pmd, address, &ptl); do_fault_around(vma, address, pte, pgoff, flags); if (!pte_same(*pte, orig_pte)) @@ -2989,7 +2981,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma, return ret; } -static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma, +static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, unsigned int flags, pte_t orig_pte) { @@ -3009,44 +3001,6 @@ static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma, return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte); } -/* - * Fault of a previously existing named mapping. Repopulate the pte - * from the encoded file_pte if possible. This enables swappable - * nonlinear vmas. - * - * We enter with non-exclusive mmap_sem (to exclude vma changes, - * but allow concurrent faults), and pte mapped but not yet locked. - * We return with mmap_sem still held, but pte unmapped and unlocked. - */ -static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *page_table, pmd_t *pmd, - unsigned int flags, pte_t orig_pte) -{ - pgoff_t pgoff; - - flags |= FAULT_FLAG_NONLINEAR; - - if (!pte_unmap_same(mm, pmd, page_table, orig_pte)) - return 0; - - if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) { - /* - * Page table corrupted: show pte and kill process. - */ - print_bad_pte(vma, address, orig_pte, NULL); - return VM_FAULT_SIGBUS; - } - - pgoff = pte_to_pgoff(orig_pte); - if (!(flags & FAULT_FLAG_WRITE)) - return do_read_fault(mm, vma, address, pmd, pgoff, flags, - orig_pte); - if (!(vma->vm_flags & VM_SHARED)) - return do_cow_fault(mm, vma, address, pmd, pgoff, flags, - orig_pte); - return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte); -} - static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, unsigned long addr, int page_nid, int *flags) @@ -3161,15 +3115,12 @@ static int handle_pte_fault(struct mm_struct *mm, if (!pte_present(entry)) { if (pte_none(entry)) { if (vma->vm_ops) - return do_linear_fault(mm, vma, address, - pte, pmd, flags, entry); + return do_fault(mm, vma, address, pte, + pmd, flags, entry); return do_anonymous_page(mm, vma, address, pte, pmd, flags); } - if (pte_file(entry)) - return do_nonlinear_fault(mm, vma, address, - pte, pmd, flags, entry); return do_swap_page(mm, vma, address, pte, pmd, flags, entry); } diff --git a/mm/migrate.c b/mm/migrate.c index 4fb18030de62..102bdac2622e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -180,37 +180,6 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma, return SWAP_AGAIN; } -/* - * Congratulations to trinity for discovering this bug. - * mm/fremap.c's remap_file_pages() accepts any range within a single vma to - * convert that vma to VM_NONLINEAR; and generic_file_remap_pages() will then - * replace the specified range by file ptes throughout (maybe populated after). - * If page migration finds a page within that range, while it's still located - * by vma_interval_tree rather than lost to i_mmap_nonlinear list, no problem: - * zap_pte() clears the temporary migration entry before mmap_sem is dropped. - * But if the migrating page is in a part of the vma outside the range to be - * remapped, then it will not be cleared, and remove_migration_ptes() needs to - * deal with it. Fortunately, this part of the vma is of course still linear, - * so we just need to use linear location on the nonlinear list. - */ -static int remove_linear_migration_ptes_from_nonlinear(struct page *page, - struct address_space *mapping, void *arg) -{ - struct vm_area_struct *vma; - /* hugetlbfs does not support remap_pages, so no huge pgoff worries */ - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); - unsigned long addr; - - list_for_each_entry(vma, - &mapping->i_mmap_nonlinear, shared.nonlinear) { - - addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); - if (addr >= vma->vm_start && addr < vma->vm_end) - remove_migration_pte(page, vma, addr, arg); - } - return SWAP_AGAIN; -} - /* * Get rid of all migration entries and replace them by * references to the indicated page. @@ -220,7 +189,6 @@ static void remove_migration_ptes(struct page *old, struct page *new) struct rmap_walk_control rwc = { .rmap_one = remove_migration_pte, .arg = old, - .file_nonlinear = remove_linear_migration_ptes_from_nonlinear, }; rmap_walk(new, &rwc); diff --git a/mm/mincore.c b/mm/mincore.c index 725c80961048..3c4c8f6ab57e 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -124,17 +124,13 @@ static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd, ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); do { pte_t pte = *ptep; - pgoff_t pgoff; next = addr + PAGE_SIZE; if (pte_none(pte)) mincore_unmapped_range(vma, addr, next, vec); else if (pte_present(pte)) *vec = 1; - else if (pte_file(pte)) { - pgoff = pte_to_pgoff(pte); - *vec = mincore_page(vma->vm_file->f_mapping, pgoff); - } else { /* pte is a swap entry */ + else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); if (is_migration_entry(entry)) { @@ -142,9 +138,8 @@ static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd, *vec = 1; } else { #ifdef CONFIG_SWAP - pgoff = entry.val; *vec = mincore_page(swap_address_space(entry), - pgoff); + entry.val); #else WARN_ON(1); *vec = 1; diff --git a/mm/mmap.c b/mm/mmap.c index 44134131b8f1..77118da138fc 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -219,10 +219,7 @@ static void __remove_shared_vm_struct(struct vm_area_struct *vma, mapping->i_mmap_writable--; flush_dcache_mmap_lock(mapping); - if (unlikely(vma->vm_flags & VM_NONLINEAR)) - list_del_init(&vma->shared.nonlinear); - else - vma_interval_tree_remove(vma, &mapping->i_mmap); + vma_interval_tree_remove(vma, &mapping->i_mmap); flush_dcache_mmap_unlock(mapping); } @@ -639,10 +636,7 @@ static void __vma_link_file(struct vm_area_struct *vma) mapping->i_mmap_writable++; flush_dcache_mmap_lock(mapping); - if (unlikely(vma->vm_flags & VM_NONLINEAR)) - vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear); - else - vma_interval_tree_insert(vma, &mapping->i_mmap); + vma_interval_tree_insert(vma, &mapping->i_mmap); flush_dcache_mmap_unlock(mapping); } } @@ -777,14 +771,11 @@ again: remove_next = 1 + (end > next->vm_end); if (file) { mapping = file->f_mapping; - if (!(vma->vm_flags & VM_NONLINEAR)) { - root = &mapping->i_mmap; - uprobe_munmap(vma, vma->vm_start, vma->vm_end); + root = &mapping->i_mmap; + uprobe_munmap(vma, vma->vm_start, vma->vm_end); - if (adjust_next) - uprobe_munmap(next, next->vm_start, - next->vm_end); - } + if (adjust_next) + uprobe_munmap(next, next->vm_start, next->vm_end); mutex_lock(&mapping->i_mmap_mutex); if (insert) { @@ -2620,6 +2611,99 @@ SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) return vm_munmap(addr, len); } + +/* + * Emulation of deprecated remap_file_pages() syscall. + */ +SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, + unsigned long, prot, unsigned long, pgoff, unsigned long, flags) +{ + + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + unsigned long populate = 0; + unsigned long ret = -EINVAL; + struct file *file; + + pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. " + "See Documentation/vm/remap_file_pages.txt.\n", + current->comm, current->pid); + + if (prot) + return ret; + start = start & PAGE_MASK; + size = size & PAGE_MASK; + + if (start + size <= start) + return ret; + + /* Does pgoff wrap? */ + if (pgoff + (size >> PAGE_SHIFT) < pgoff) + return ret; + + down_write(&mm->mmap_sem); + vma = find_vma(mm, start); + + if (!vma || !(vma->vm_flags & VM_SHARED)) + goto out; + + if (start < vma->vm_start) + goto out; + + if (start + size > vma->vm_end) { + struct vm_area_struct *next; + + for (next = vma->vm_next; next; next = next->vm_next) { + /* hole between vmas ? */ + if (next->vm_start != next->vm_prev->vm_end) + goto out; + + if (next->vm_file != vma->vm_file) + goto out; + + if (next->vm_flags != vma->vm_flags) + goto out; + + if (start + size <= next->vm_end) + break; + } + + if (!next) + goto out; + } + + prot |= vma->vm_flags & VM_READ ? PROT_READ : 0; + prot |= vma->vm_flags & VM_WRITE ? PROT_WRITE : 0; + prot |= vma->vm_flags & VM_EXEC ? PROT_EXEC : 0; + + flags &= MAP_NONBLOCK; + flags |= MAP_SHARED | MAP_FIXED | MAP_POPULATE; + if (vma->vm_flags & VM_LOCKED) { + struct vm_area_struct *tmp; + flags |= MAP_LOCKED; + + /* drop PG_Mlocked flag for over-mapped range */ + for (tmp = vma; tmp->vm_start >= start + size; + tmp = tmp->vm_next) { + munlock_vma_pages_range(tmp, + max(tmp->vm_start, start), + min(tmp->vm_end, start + size)); + } + } + + file = get_file(vma->vm_file); + ret = do_mmap_pgoff(vma->vm_file, start, size, + prot, flags, pgoff, &populate); + fput(file); +out: + up_write(&mm->mmap_sem); + if (populate) + mm_populate(ret, populate); + if (!IS_ERR_VALUE(ret)) + ret = 0; + return ret; +} + static inline void verify_mm_writelocked(struct mm_struct *mm) { #ifdef CONFIG_DEBUG_VM @@ -3094,8 +3178,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping) * * mmap_sem in write mode is required in order to block all operations * that could modify pagetables and free pages without need of - * altering the vma layout (for example populate_range() with - * nonlinear vmas). It's also needed in write mode to avoid new + * altering the vma layout. It's also needed in write mode to avoid new * anon_vmas to be associated with existing vmas. * * A single task can't take more than one mm_take_all_locks() in a row diff --git a/mm/mprotect.c b/mm/mprotect.c index bfc99abb6277..da93553d0c9c 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -110,7 +110,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, } if (updated) pages++; - } else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) { + } else if (IS_ENABLED(CONFIG_MIGRATION)) { swp_entry_t entry = pte_to_swp_entry(oldpte); if (is_write_migration_entry(entry)) { @@ -258,6 +258,42 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start, return pages; } +static int prot_none_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ? + 0 : -EACCES; +} + +static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask, + unsigned long addr, unsigned long next, + struct mm_walk *walk) +{ + return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ? + 0 : -EACCES; +} + +static int prot_none_test(unsigned long addr, unsigned long next, + struct mm_walk *walk) +{ + return 0; +} + +static int prot_none_walk(struct vm_area_struct *vma, unsigned long start, + unsigned long end, unsigned long newflags) +{ + pgprot_t new_pgprot = vm_get_page_prot(newflags); + struct mm_walk prot_none_walk = { + .pte_entry = prot_none_pte_entry, + .hugetlb_entry = prot_none_hugetlb_entry, + .test_walk = prot_none_test, + .mm = current->mm, + .private = &new_pgprot, + }; + + return walk_page_range(start, end, &prot_none_walk); +} + int mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, unsigned long start, unsigned long end, unsigned long newflags) @@ -275,6 +311,19 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, return 0; } + /* + * Do PROT_NONE PFN permission checks here when we can still + * bail out without undoing a lot of state. This is a rather + * uncommon case, so doesn't need to be very optimized. + */ + if (arch_has_pfn_modify_check() && + (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && + (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) { + error = prot_none_walk(vma, start, end, newflags); + if (error) + return error; + } + /* * If we make a private mapping writable we increase our commit; * but (without finer accounting) cannot reduce our commit if we diff --git a/mm/mremap.c b/mm/mremap.c index 05f1180e9f21..6d49f62a4863 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -81,8 +81,6 @@ static pte_t move_soft_dirty_pte(pte_t pte) pte = pte_mksoft_dirty(pte); else if (is_swap_pte(pte)) pte = pte_swp_mksoft_dirty(pte); - else if (pte_file(pte)) - pte = pte_file_mksoft_dirty(pte); #endif return pte; } diff --git a/mm/msync.c b/mm/msync.c index 992a1673d488..bb04d53ae852 100644 --- a/mm/msync.c +++ b/mm/msync.c @@ -86,10 +86,7 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags) (vma->vm_flags & VM_SHARED)) { get_file(file); up_read(&mm->mmap_sem); - if (vma->vm_flags & VM_NONLINEAR) - error = vfs_fsync(file, 1); - else - error = vfs_fsync_range(file, fstart, fend, 1); + error = vfs_fsync_range(file, fstart, fend, 1); fput(file); if (error || start >= end) goto out; diff --git a/mm/nommu.c b/mm/nommu.c index 7f2c901c555e..b2b148319ba2 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1999,14 +1999,6 @@ void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf) } EXPORT_SYMBOL(filemap_map_pages); -int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr, - unsigned long size, pgoff_t pgoff) -{ - BUG(); - return 0; -} -EXPORT_SYMBOL(generic_file_remap_pages); - int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, unsigned long addr, void *buf, int len, int write) { diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 9056d22d2880..91810ba875ea 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -59,7 +59,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, continue; split_huge_page_pmd_mm(walk->mm, addr, pmd); - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) + if (pmd_trans_unstable(pmd)) goto again; err = walk_pte_range(pmd, addr, next, walk); if (err) @@ -86,9 +86,7 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end, break; continue; } - if (walk->pud_entry) - err = walk->pud_entry(pud, addr, next, walk); - if (!err && (walk->pmd_entry || walk->pte_entry)) + if (walk->pmd_entry || walk->pte_entry) err = walk_pmd_range(pud, addr, next, walk); if (err) break; @@ -97,6 +95,32 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end, return err; } +static int walk_pgd_range(unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + pgd_t *pgd; + unsigned long next; + int err = 0; + + pgd = pgd_offset(walk->mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) { + if (walk->pte_hole) + err = walk->pte_hole(addr, next, walk); + if (err) + break; + continue; + } + if (walk->pmd_entry || walk->pte_entry) + err = walk_pud_range(pgd, addr, next, walk); + if (err) + break; + } while (pgd++, addr = next, addr != end); + + return err; +} + #ifdef CONFIG_HUGETLB_PAGE static unsigned long hugetlb_entry_end(struct hstate *h, unsigned long addr, unsigned long end) @@ -105,10 +129,10 @@ static unsigned long hugetlb_entry_end(struct hstate *h, unsigned long addr, return boundary < end ? boundary : end; } -static int walk_hugetlb_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, +static int walk_hugetlb_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { + struct vm_area_struct *vma = walk->vma; struct hstate *h = hstate_vma(vma); unsigned long next; unsigned long hmask = huge_page_mask(h); @@ -121,15 +145,14 @@ static int walk_hugetlb_range(struct vm_area_struct *vma, if (pte && walk->hugetlb_entry) err = walk->hugetlb_entry(pte, hmask, addr, next, walk); if (err) - return err; + break; } while (addr = next, addr != end); - return 0; + return err; } #else /* CONFIG_HUGETLB_PAGE */ -static int walk_hugetlb_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, +static int walk_hugetlb_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { return 0; @@ -137,115 +160,115 @@ static int walk_hugetlb_range(struct vm_area_struct *vma, #endif /* CONFIG_HUGETLB_PAGE */ +/* + * Decide whether we really walk over the current vma on [@start, @end) + * or skip it via the returned value. Return 0 if we do walk over the + * current vma, and return 1 if we skip the vma. Negative values means + * error, where we abort the current walk. + * + * Default check (only VM_PFNMAP check for now) is used when the caller + * doesn't define test_walk() callback. + */ +static int walk_page_test(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; + if (walk->test_walk) + return walk->test_walk(start, end, walk); + + /* + * Do not walk over vma(VM_PFNMAP), because we have no valid struct + * page backing a VM_PFNMAP range. See also commit a9ff785e4437. + */ + if (vma->vm_flags & VM_PFNMAP) + return 1; + return 0; +} + +static int __walk_page_range(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + int err = 0; + struct vm_area_struct *vma = walk->vma; + + if (vma && is_vm_hugetlb_page(vma)) { + if (walk->hugetlb_entry) + err = walk_hugetlb_range(start, end, walk); + } else + err = walk_pgd_range(start, end, walk); + + return err; +} /** - * walk_page_range - walk a memory map's page tables with a callback - * @addr: starting address - * @end: ending address - * @walk: set of callbacks to invoke for each level of the tree - * - * Recursively walk the page table for the memory area in a VMA, - * calling supplied callbacks. Callbacks are called in-order (first - * PGD, first PUD, first PMD, first PTE, second PTE... second PMD, - * etc.). If lower-level callbacks are omitted, walking depth is reduced. + * walk_page_range - walk page table with caller specific callbacks * - * Each callback receives an entry pointer and the start and end of the - * associated range, and a copy of the original mm_walk for access to - * the ->private or ->mm fields. + * Recursively walk the page table tree of the process represented by @walk->mm + * within the virtual address range [@start, @end). During walking, we can do + * some caller-specific works for each entry, by setting up pmd_entry(), + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these + * callbacks, the associated entries/pages are just ignored. + * The return values of these callbacks are commonly defined like below: + * - 0 : succeeded to handle the current entry, and if you don't reach the + * end address yet, continue to walk. + * - >0 : succeeded to handle the current entry, and return to the caller + * with caller specific value. + * - <0 : failed to handle the current entry, and return to the caller + * with error code. * - * Usually no locks are taken, but splitting transparent huge page may - * take page table lock. And the bottom level iterator will map PTE - * directories from highmem if necessary. + * Before starting to walk page table, some callers want to check whether + * they really want to walk over the current vma, typically by checking + * its vm_flags. walk_page_test() and @walk->test_walk() are used for this + * purpose. * - * If any callback returns a non-zero value, the walk is aborted and - * the return value is propagated back to the caller. Otherwise 0 is returned. + * struct mm_walk keeps current values of some common data like vma and pmd, + * which are useful for the access from callbacks. If you want to pass some + * caller-specific data to callbacks, @walk->private should be helpful. * - * walk->mm->mmap_sem must be held for at least read if walk->hugetlb_entry - * is !NULL. + * Locking: + * Callers of walk_page_range() and walk_page_vma() should hold + * @walk->mm->mmap_sem, because these function traverse vma list and/or + * access to vma's data. */ -int walk_page_range(unsigned long addr, unsigned long end, +int walk_page_range(unsigned long start, unsigned long end, struct mm_walk *walk) { - pgd_t *pgd; - unsigned long next; int err = 0; + unsigned long next; + struct vm_area_struct *vma; - if (addr >= end) - return err; + if (start >= end) + return -EINVAL; if (!walk->mm) return -EINVAL; VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem)); - pgd = pgd_offset(walk->mm, addr); + vma = find_vma(walk->mm, start); do { - struct vm_area_struct *vma = NULL; + if (!vma) { /* after the last vma */ + walk->vma = NULL; + next = end; + } else if (start < vma->vm_start) { /* outside vma */ + walk->vma = NULL; + next = min(end, vma->vm_start); + } else { /* inside vma */ + walk->vma = vma; + next = min(end, vma->vm_end); + vma = vma->vm_next; - next = pgd_addr_end(addr, end); - - /* - * This function was not intended to be vma based. - * But there are vma special cases to be handled: - * - hugetlb vma's - * - VM_PFNMAP vma's - */ - vma = find_vma(walk->mm, addr); - if (vma) { - /* - * There are no page structures backing a VM_PFNMAP - * range, so do not allow split_huge_page_pmd(). - */ - if ((vma->vm_start <= addr) && - (vma->vm_flags & VM_PFNMAP)) { - if (walk->pte_hole) - err = walk->pte_hole(addr, next, walk); - if (err) - break; - pgd = pgd_offset(walk->mm, next); + err = walk_page_test(start, next, walk); + if (err > 0) continue; - } - /* - * Handle hugetlb vma individually because pagetable - * walk for the hugetlb page is dependent on the - * architecture and we can't handled it in the same - * manner as non-huge pages. - */ - if (walk->hugetlb_entry && (vma->vm_start <= addr) && - is_vm_hugetlb_page(vma)) { - if (vma->vm_end < next) - next = vma->vm_end; - /* - * Hugepage is very tightly coupled with vma, - * so walk through hugetlb entries within a - * given vma. - */ - err = walk_hugetlb_range(vma, addr, next, walk); - if (err) - break; - pgd = pgd_offset(walk->mm, next); - continue; - } - } - - if (pgd_none_or_clear_bad(pgd)) { - if (walk->pte_hole) - err = walk->pte_hole(addr, next, walk); - if (err) + if (err < 0) break; - pgd++; - continue; } - if (walk->pgd_entry) - err = walk->pgd_entry(pgd, addr, next, walk); - if (!err && - (walk->pud_entry || walk->pmd_entry || walk->pte_entry)) - err = walk_pud_range(pgd, addr, next, walk); + if (walk->vma || walk->pte_hole) + err = __walk_page_range(start, next, walk); if (err) break; - pgd++; - } while (addr = next, addr < end); - + } while (start = next, start < end); return err; } diff --git a/mm/rmap.c b/mm/rmap.c index e84beb4dcfce..044dcbc770cc 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -597,9 +597,8 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) if (!vma->anon_vma || !page__anon_vma || vma->anon_vma->root != page__anon_vma->root) return -EFAULT; - } else if (page->mapping && !(vma->vm_flags & VM_NONLINEAR)) { - if (!vma->vm_file || - vma->vm_file->f_mapping != page->mapping) + } else if (page->mapping) { + if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping) return -EFAULT; } else return -EFAULT; @@ -1286,7 +1285,6 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); set_pte_at(mm, address, pte, swp_pte); - BUG_ON(pte_file(*pte)); } else if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION)) { /* Establish migration entry for a file page */ @@ -1328,207 +1326,6 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, return ret; } -/* - * objrmap doesn't work for nonlinear VMAs because the assumption that - * offset-into-file correlates with offset-into-virtual-addresses does not hold. - * Consequently, given a particular page and its ->index, we cannot locate the - * ptes which are mapping that page without an exhaustive linear search. - * - * So what this code does is a mini "virtual scan" of each nonlinear VMA which - * maps the file to which the target page belongs. The ->vm_private_data field - * holds the current cursor into that scan. Successive searches will circulate - * around the vma's virtual address space. - * - * So as more replacement pressure is applied to the pages in a nonlinear VMA, - * more scanning pressure is placed against them as well. Eventually pages - * will become fully unmapped and are eligible for eviction. - * - * For very sparsely populated VMAs this is a little inefficient - chances are - * there there won't be many ptes located within the scan cluster. In this case - * maybe we could scan further - to the end of the pte page, perhaps. - * - * Mlocked pages: check VM_LOCKED under mmap_sem held for read, if we can - * acquire it without blocking. If vma locked, mlock the pages in the cluster, - * rather than unmapping them. If we encounter the "check_page" that vmscan is - * trying to unmap, return SWAP_MLOCK, else default SWAP_AGAIN. - */ -#define CLUSTER_SIZE min(32*PAGE_SIZE, PMD_SIZE) -#define CLUSTER_MASK (~(CLUSTER_SIZE - 1)) - -static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, - struct vm_area_struct *vma, struct page *check_page) -{ - struct mm_struct *mm = vma->vm_mm; - pmd_t *pmd; - pte_t *pte; - pte_t pteval; - spinlock_t *ptl; - struct page *page; - unsigned long address; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ - unsigned long end; - int ret = SWAP_AGAIN; - int locked_vma = 0; - - address = (vma->vm_start + cursor) & CLUSTER_MASK; - end = address + CLUSTER_SIZE; - if (address < vma->vm_start) - address = vma->vm_start; - if (end > vma->vm_end) - end = vma->vm_end; - - pmd = mm_find_pmd(mm, address); - if (!pmd) - return ret; - - mmun_start = address; - mmun_end = end; - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); - - /* - * If we can acquire the mmap_sem for read, and vma is VM_LOCKED, - * keep the sem while scanning the cluster for mlocking pages. - */ - if (down_read_trylock(&vma->vm_mm->mmap_sem)) { - locked_vma = (vma->vm_flags & VM_LOCKED); - if (!locked_vma) - up_read(&vma->vm_mm->mmap_sem); /* don't need it */ - } - - pte = pte_offset_map_lock(mm, pmd, address, &ptl); - - /* Update high watermark before we lower rss */ - update_hiwater_rss(mm); - - for (; address < end; pte++, address += PAGE_SIZE) { - if (!pte_present(*pte)) - continue; - page = vm_normal_page(vma, address, *pte); - BUG_ON(!page || PageAnon(page)); - - if (locked_vma) { - if (page == check_page) { - /* we know we have check_page locked */ - mlock_vma_page(page); - ret = SWAP_MLOCK; - } else if (trylock_page(page)) { - /* - * If we can lock the page, perform mlock. - * Otherwise leave the page alone, it will be - * eventually encountered again later. - */ - mlock_vma_page(page); - unlock_page(page); - } - continue; /* don't unmap */ - } - - if (ptep_clear_flush_young_notify(vma, address, pte)) - continue; - - /* Nuke the page table entry. */ - flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); - - /* If nonlinear, store the file page offset in the pte. */ - if (page->index != linear_page_index(vma, address)) { - pte_t ptfile = pgoff_to_pte(page->index); - if (pte_soft_dirty(pteval)) - ptfile = pte_file_mksoft_dirty(ptfile); - set_pte_at(mm, address, pte, ptfile); - } - - /* Move the dirty bit to the physical page now the pte is gone. */ - if (pte_dirty(pteval)) - set_page_dirty(page); - - page_remove_rmap(page); - page_cache_release(page); - dec_mm_counter(mm, MM_FILEPAGES); - (*mapcount)--; - } - pte_unmap_unlock(pte - 1, ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); - if (locked_vma) - up_read(&vma->vm_mm->mmap_sem); - return ret; -} - -static int try_to_unmap_nonlinear(struct page *page, - struct address_space *mapping, void *arg) -{ - struct vm_area_struct *vma; - int ret = SWAP_AGAIN; - unsigned long cursor; - unsigned long max_nl_cursor = 0; - unsigned long max_nl_size = 0; - unsigned int mapcount; - - list_for_each_entry(vma, - &mapping->i_mmap_nonlinear, shared.nonlinear) { - - cursor = (unsigned long) vma->vm_private_data; - if (cursor > max_nl_cursor) - max_nl_cursor = cursor; - cursor = vma->vm_end - vma->vm_start; - if (cursor > max_nl_size) - max_nl_size = cursor; - } - - if (max_nl_size == 0) { /* all nonlinears locked or reserved ? */ - return SWAP_FAIL; - } - - /* - * We don't try to search for this page in the nonlinear vmas, - * and page_referenced wouldn't have found it anyway. Instead - * just walk the nonlinear vmas trying to age and unmap some. - * The mapcount of the page we came in with is irrelevant, - * but even so use it as a guide to how hard we should try? - */ - mapcount = page_mapcount(page); - if (!mapcount) - return ret; - - cond_resched(); - - max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK; - if (max_nl_cursor == 0) - max_nl_cursor = CLUSTER_SIZE; - - do { - list_for_each_entry(vma, - &mapping->i_mmap_nonlinear, shared.nonlinear) { - - cursor = (unsigned long) vma->vm_private_data; - while (cursor < max_nl_cursor && - cursor < vma->vm_end - vma->vm_start) { - if (try_to_unmap_cluster(cursor, &mapcount, - vma, page) == SWAP_MLOCK) - ret = SWAP_MLOCK; - cursor += CLUSTER_SIZE; - vma->vm_private_data = (void *) cursor; - if ((int)mapcount <= 0) - return ret; - } - vma->vm_private_data = (void *) max_nl_cursor; - } - cond_resched(); - max_nl_cursor += CLUSTER_SIZE; - } while (max_nl_cursor <= max_nl_size); - - /* - * Don't loop forever (perhaps all the remaining pages are - * in locked vmas). Reset cursor on all unreserved nonlinear - * vmas, now forgetting on which ones it had fallen behind. - */ - list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.nonlinear) - vma->vm_private_data = NULL; - - return ret; -} - bool is_vma_temporary_stack(struct vm_area_struct *vma) { int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP); @@ -1574,7 +1371,6 @@ int try_to_unmap(struct page *page, enum ttu_flags flags) .rmap_one = try_to_unmap_one, .arg = (void *)flags, .done = page_not_mapped, - .file_nonlinear = try_to_unmap_nonlinear, .anon_lock = page_lock_anon_vma_read, }; @@ -1620,12 +1416,6 @@ int try_to_munlock(struct page *page) .rmap_one = try_to_unmap_one, .arg = (void *)TTU_MUNLOCK, .done = page_not_mapped, - /* - * We don't bother to try to find the munlocked page in - * nonlinears. It's costly. Instead, later, page reclaim logic - * may call try_to_unmap() and recover PG_mlocked lazily. - */ - .file_nonlinear = NULL, .anon_lock = page_lock_anon_vma_read, }; @@ -1753,14 +1543,6 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) goto done; } - if (!rwc->file_nonlinear) - goto done; - - if (list_empty(&mapping->i_mmap_nonlinear)) - goto done; - - ret = rwc->file_nonlinear(page, mapping, rwc->arg); - done: mutex_unlock(&mapping->i_mmap_mutex); return ret; diff --git a/mm/shmem.c b/mm/shmem.c index efab7a79b05f..507231b0dfe3 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2793,7 +2793,6 @@ static const struct vm_operations_struct shmem_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif - .remap_pages = generic_file_remap_pages, }; static struct dentry *shmem_mount(struct file_system_type *fs_type, diff --git a/mm/swap.c b/mm/swap.c index a4d58b5fc4be..6b833da5ce5a 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1103,10 +1103,8 @@ void __init swap_setup(void) if (bdi_init(swapper_spaces[0].backing_dev_info)) panic("Failed to init swap bdi"); - for (i = 0; i < MAX_SWAPFILES; i++) { + for (i = 0; i < MAX_SWAPFILES; i++) spin_lock_init(&swapper_spaces[i].tree_lock); - INIT_LIST_HEAD(&swapper_spaces[i].i_mmap_nonlinear); - } #endif /* Use a smaller cluster for small-memory machines */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 97235887cb5b..aef1f7e9143b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2166,6 +2166,35 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode) return 0; } + +/* + * Find out how many pages are allowed for a single swap device. There + * are two limiting factors: + * 1) the number of bits for the swap offset in the swp_entry_t type, and + * 2) the number of bits in the swap pte, as defined by the different + * architectures. + * + * In order to find the largest possible bit mask, a swap entry with + * swap type 0 and swap offset ~0UL is created, encoded to a swap pte, + * decoded to a swp_entry_t again, and finally the swap offset is + * extracted. + * + * This will mask all the bits from the initial ~0UL mask that can't + * be encoded in either the swp_entry_t or the architecture definition + * of a swap pte. + */ +unsigned long generic_max_swapfile_size(void) +{ + return swp_offset(pte_to_swp_entry( + swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1; +} + +/* Can be overridden by an architecture for additional checks. */ +__weak unsigned long max_swapfile_size(void) +{ + return generic_max_swapfile_size(); +} + static unsigned long read_swap_header(struct swap_info_struct *p, union swap_header *swap_header, struct inode *inode) @@ -2201,22 +2230,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p, p->cluster_next = 1; p->cluster_nr = 0; - /* - * Find out how many pages are allowed for a single swap - * device. There are two limiting factors: 1) the number - * of bits for the swap offset in the swp_entry_t type, and - * 2) the number of bits in the swap pte as defined by the - * different architectures. In order to find the - * largest possible bit mask, a swap entry with swap type 0 - * and swap offset ~0UL is created, encoded to a swap pte, - * decoded to a swp_entry_t again, and finally the swap - * offset is extracted. This will mask all the bits from - * the initial ~0UL mask that can't be encoded in either - * the swp_entry_t or the architecture definition of a - * swap pte. - */ - maxpages = swp_offset(pte_to_swp_entry( - swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1; + maxpages = max_swapfile_size(); last_page = swap_header->info.last_page; if (last_page > maxpages) { pr_warn("Truncating oversized swap area, only using %luk out of %luk\n", diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c index a34379892d85..b4743db43b53 100644 --- a/net/irda/af_irda.c +++ b/net/irda/af_irda.c @@ -786,6 +786,13 @@ static int irda_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) return -EINVAL; lock_sock(sk); + + /* Ensure that the socket is not already bound */ + if (self->ias_obj) { + err = -EINVAL; + goto out; + } + #ifdef CONFIG_IRDA_ULTRA /* Special care for Ultra sockets */ if ((sk->sk_type == SOCK_DGRAM) && @@ -2046,7 +2053,11 @@ static int irda_setsockopt(struct socket *sock, int level, int optname, err = -EINVAL; goto out; } - irias_insert_object(ias_obj); + + /* Only insert newly allocated objects */ + if (free_ias) + irias_insert_object(ias_obj); + kfree(ias_opt); break; case IRLMP_IAS_DEL: