A Total Overhaul on the Lunaix's Virtual Memory Model (#26)
* * Introducing a new declaritive pte manipulation toolset.
Prior to this patch, the original page table API is a simple,
straightforward, and yet much verbose design. Which can be seen
through with following characteristics:
1. The `vmm_set_mapping` is the only way provided to set pte
in the page table. It require explicitly specifying the
physical, virtual and pte attributes, as was by-design to
provide a comprehensiveness. However, we found that it
always accompanied with cumbersome address calculations and
nasty pte bit-masking just for setting these argment right,
especially when doing non-trivial mapping.
2. The existing design assume a strict 2-level paging and fixed
4K page size, tightly coupled with x86's 32-bit paging. It
makes it impossible to extend beyond these assumption, for
example, adding huge page or supporting any non-x86 mmu.
3. Interfacing to page table manipulation is not centralised,
there is a vast amount of eccentric and yet odd API dangling
in the kboot area.
In light of these limitations, we have redesign the entire virtual
memory interface. By realising the pointer to pte has already encodes
enough information to complete any pte read/write of any level, and
the pointer arithematics will automatically result the valid pointer
to the desired pte, allowing use to remove the bloat of invoking the
vmm_set_mapping.
Architectural-dependent information related to PTE are abstracted
away from the generic kernel code base, giving a pure declaritive
PTE construction and page table manipulation.
* Refactoring done on making kboot using the new api.
* Refactoring done on pfault handler.
* * Correct ptep address deduction to take account of pte size, which
previously result an unaligned ptw write
* Correct the use of memset and tlb invalidation when zeroing an
newly allocated pagetable. Deduce the next-level ptep and use it
accordingly
* Simplyfy the pre-boot stuff (boot.S) moves the setting of CRx into
a more readable form.
* Allocate a new stack reside in higher half mem for boostraping stage
allow us to free the bootctx safely before getting into lunad
* Adjust the bootctx helpers to work with the new vmm api.
* (LunaDBG) update the mm lookup to detect the huge-page mapping
correctly
* * Dynamically allocate page table when ptep trigger page fault for
pointing to a pte that do not have containing page table. Which
previously we always assume that table is allocated before pte
is written into. This on-demand allocation greatly remove the
overhead as we need to go through all n-level just to ensure the
hierarchy.
* Page fault handling procedure is refactored, we put all the
important information such as faulting pte and eip into a dedicated
struct fault_context.
* State out the definition we have invented for making things clear.
* Rewrite vmap function with the new ptep feature, the reduction in
LoC and complexity is significant.
* * Use huge page to perform fast and memory-efficient identity mapping
on physical address space (first 3GiB). Doing that enable us to
eliminate the need of selective mapping on bootloader's mem_map.
* Correct the address calculation in __alloc_contig_ptes
* Change the behavior of previously pagetable_alloc, to offload most
pte setting to it's caller, makes it more portable. We also renamed
it to 'vmm_alloc_page'
* Perform some formattings to make things more easy to read.
* * Rewrite the vms duplication and deletion. Using the latest vmm
refactoring, the implementation is much clean and intuitive than
before, althought the LoC is slightly longer. The rewrited version
is named to `vmscpy` and `vmsfree` as it remove the assumption of
source vms to be VMS_SELF
* Add `pmm_free_one` to allow user free the pmm page based on the
attribute, which is intented to solve the recent discovered leakage
in physical page resource, where the old pmm_free_page lack the
feature to free the PP_FGLOCKED page which is allocated to page
table, thus resulting pages that couldn't be freed by any means.
* Rename some functions for better clarity.
* * Rewrite the vmm_lookupat with new pte interface
* Adjust the memory layout such that the guest vms mount point is
shifted just before the vms self mounting point. This is remove
effort to locate it and skip it during vmscpy
* Add empty thread obj as place-holder, to prevent write to undefined
location when context save/store happened before threaded environment
is initialized
* * Fix more issues related to recent refactoring
1. introduce pte_mkhuge to mark pte as a leaf which previously
confuse the use of PS bit that has another interpretation
on last level pte
2. fix the address increamention at vmap
3. invalidate the tlb cache whenever we dynamically allocated
a page.
* (LunaDBG) rewrite the vm probing, employing the latest pte interfacing
and make it much more efficient by actually doing page-walk rather
than scanning linearly
* * Fix an issue where the boostrap stack is too small that the overflow
corrupt adjacent kernel structure
* Add assertion in pmm to enforce better consistency and invariants
* Page fault handler how aware of ptep fault and assign suitable permission
for level creation and page pre-allocation
* Ensure the mapping on dest_mnt are properly invalidated in TLB cache
after we setup the vms to be copied to.
* (LunaDBG) Fix the ptep calculation at specified level when querying an
individual pte
* * Rework the vms mount, they are now have more unified interface
and remove the burden of passing vm_mnt on each function call.
It also allow us to track any dangling mount points
* Fix a issue that dup_kernel_stack use stack top as start address
to perform copying. Which cause the subsequent exec address to be
corrupted
* Fix ptep_step_out failed on non-VMS_SELF mount point
* Change the way that assertion failure reporting, now they just
report it directly without going into another sys-trap, thus
preserve the stack around failing point to ease our debugging
experience.
* * ensure the tail pte checking is peformed regardless the pte value when
doing page table walking (e.g., vmsfree and vmscpy). Which previously
is a bug
* the self-mount point is located incorrectly and thus cause wrong one
being freed (vmsfree)
* ensure we unref the physical page only when the corresponding pte is
present (thus the pa is meaningful)
* add a flag in fault_context to indicate the mem-access privilege level
* address a issue that stack start ptep calculation is offseted by 1, causing
a destoryed thread accidentially free adjacent one's kernel stack
* * Purge the old page.h
* * Refactor the fault.c to remove un-needed thing from arch-dependent side.
* (LunaDBG) add utilities to interpret pte value and manipulate the ptep
* * Add generic definition for arch-dependent pagetable
76 files changed: