Skip to content

feat: reset KVM_REG_ARM_PTIMER_CNT on VM boot #4987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,13 @@ and this project adheres to

### Added

- [#4987](https://github.com/firecracker-microvm/firecracker/pull/4987): Reset
physical counter register (`CNTPCT_EL0`) on VM startup. This avoids VM reading
the host physical counter value. This is only possible on 6.4 and newer
kernels. For older kernels physical counter will still be passed to the guest
unmodified. See more info
[here](https://github.com/firecracker-microvm/firecracker/blob/main/docs/prod-host-setup.md#arm-only-vm-physical-counter-behaviour)

### Changed

- [#4913](https://github.com/firecracker-microvm/firecracker/pull/4913): Removed
Expand Down
18 changes: 9 additions & 9 deletions docs/prod-host-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -328,13 +328,16 @@ For vendor-specific recommendations, please consult the resources below:
- ARM:
[Speculative Processor Vulnerability](https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability)

##### [ARM only] Physical counter directly passed through to the guest
##### [ARM only] VM Physical counter behaviour

On ARM, the physical counter (i.e `CNTPCT`) it is returning the
[actual EL1 physical counter value of the host][1]. From the discussions before
merging this change [upstream][2], this seems like a conscious design decision
of the ARM code contributors, giving precedence to performance over the ability
to trap and control this in the hypervisor.
On ARM, Firecracker tries to reset the `CNTPCT` physical counter on VM boot.
This is done in order to prevent VM from reading host physical counter value.
Firecracker will only try to reset the counter if the host KVM contains
`KVM_CAP_COUNTER_OFFSET` capability. This capability is only present in kernels
containing
[this](https://lore.kernel.org/all/20230330174800.2677007-1-maz@kernel.org/)
patch series (starting from 6.4 and newer). For older kernels the counter value
will be passed through from the host.

##### Verification

Expand Down Expand Up @@ -428,6 +431,3 @@ To validate that the change took effect, the file
[^1]: Look for `GRUB_CMDLINE_LINUX` in file `/etc/default/grub` in RPM-based
systems, and
[this doc for Ubuntu](https://wiki.ubuntu.com/Kernel/KernelBootParameters).

[1]: https://elixir.free-electrons.com/linux/v4.14.203/source/virt/kvm/arm/hyp/timer-sr.c#L63
[2]: https://lists.cs.columbia.edu/pipermail/kvmarm/2017-January/023323.html
6 changes: 6 additions & 0 deletions src/vmm/src/arch/aarch64/regs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,12 @@ arm64_sys_reg!(SYS_CNTV_CVAL_EL0, 3, 3, 14, 3, 2);
// https://elixir.bootlin.com/linux/v6.8/source/arch/arm64/include/asm/sysreg.h#L459
arm64_sys_reg!(SYS_CNTPCT_EL0, 3, 3, 14, 0, 1);

// Physical Timer EL0 count Register
// The id of this register is same as SYS_CNTPCT_EL0, but KVM defines it
// separately, so we do as well.
// https://elixir.bootlin.com/linux/v6.12.6/source/arch/arm64/include/uapi/asm/kvm.h#L259
arm64_sys_reg!(KVM_REG_ARM_PTIMER_CNT, 3, 3, 14, 0, 1);

// Translation Table Base Register
// https://developer.arm.com/documentation/ddi0595/2021-03/AArch64-Registers/TTBR1-EL1--Translation-Table-Base-Register-1--EL1-
arm64_sys_reg!(TTBR1_EL1, 3, 0, 2, 0, 1);
Expand Down
60 changes: 49 additions & 11 deletions src/vmm/src/arch/aarch64/vcpu.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

use super::get_fdt_addr;
use super::regs::*;
use crate::vstate::kvm::OptionalCapabilities;
use crate::vstate::memory::GuestMemoryMmap;

/// Errors thrown while setting aarch64 registers.
Expand Down Expand Up @@ -78,6 +79,7 @@
cpu_id: u8,
boot_ip: u64,
mem: &GuestMemoryMmap,
optional_capabilities: &OptionalCapabilities,
) -> Result<(), VcpuError> {
let kreg_off = offset_of!(kvm_regs, regs);

Expand Down Expand Up @@ -106,6 +108,23 @@
vcpufd
.set_one_reg(id, &get_fdt_addr(mem).to_le_bytes())
.map_err(|err| VcpuError::SetOneReg(id, err))?;

// Reset the physical counter for the guest. This way we avoid guest reading
// host physical counter.
// Resetting KVM_REG_ARM_PTIMER_CNT for single vcpu is enough because there is only
// one timer struct with offsets per VM.
// Because the access to KVM_REG_ARM_PTIMER_CNT is only present starting 6.4 kernel,
// we only do the reset if KVM_CAP_COUNTER_OFFSET is present as it was added
// in the same patch series as the ability to set the KVM_REG_ARM_PTIMER_CNT register.
// Path series which introduced the needed changes:
// https://lore.kernel.org/all/20230330174800.2677007-1-maz@kernel.org/
// Note: the value observed by the guest will still be above 0, because there is a delta
// time between this resetting and first call to KVM_RUN.
if optional_capabilities.counter_offset {
vcpufd
.set_one_reg(KVM_REG_ARM_PTIMER_CNT, &[0; 8])
.map_err(|err| VcpuError::SetOneReg(id, err))?;

Check warning on line 126 in src/vmm/src/arch/aarch64/vcpu.rs

View check run for this annotation

Codecov / codecov/patch

src/vmm/src/arch/aarch64/vcpu.rs#L124-L126

Added lines #L124 - L126 were not covered by tests
}
}
Ok(())
}
Expand Down Expand Up @@ -214,20 +233,21 @@
#[cfg(test)]
mod tests {
#![allow(clippy::undocumented_unsafe_blocks)]
use kvm_ioctls::Kvm;

use super::*;
use crate::arch::aarch64::layout;
use crate::test_utils::arch_mem;
use crate::vstate::kvm::Kvm;

#[test]
fn test_setup_regs() {
let kvm = Kvm::new().unwrap();
let vm = kvm.create_vm().unwrap();
let kvm = Kvm::new(vec![]).unwrap();
let vm = kvm.fd.create_vm().unwrap();
let vcpu = vm.create_vcpu(0).unwrap();
let mem = arch_mem(layout::FDT_MAX_SIZE + 0x1000);
let optional_capabilities = kvm.optional_capabilities();

let res = setup_boot_regs(&vcpu, 0, 0x0, &mem);
let res = setup_boot_regs(&vcpu, 0, 0x0, &mem, &optional_capabilities);
assert!(matches!(
res.unwrap_err(),
VcpuError::SetOneReg(0x6030000000100042, _)
Expand All @@ -237,13 +257,31 @@
vm.get_preferred_target(&mut kvi).unwrap();
vcpu.vcpu_init(&kvi).unwrap();

setup_boot_regs(&vcpu, 0, 0x0, &mem).unwrap();
setup_boot_regs(&vcpu, 0, 0x0, &mem, &optional_capabilities).unwrap();

// Check that the register is reset on compatible kernels.
// Because there is a delta in time between we reset the register and time we
// read it, we cannot compare with 0. Instead we compare it with meaningfully
// small value.
if optional_capabilities.counter_offset {
let mut reg_bytes = [0_u8; 8];
vcpu.get_one_reg(SYS_CNTPCT_EL0, &mut reg_bytes).unwrap();
let counter_value = u64::from_le_bytes(reg_bytes);

// We are reading the SYS_CNTPCT_EL0 right after resetting it.
// If reset did happen successfully, the value should be quite small when we read it.
// If the reset did not happen, the value will be same as on the host and it surely
// will be more that MAX_VALUE.
let max_value = 1000;

assert!(counter_value < max_value);
}
}

#[test]
fn test_read_mpidr() {
let kvm = Kvm::new().unwrap();
let vm = kvm.create_vm().unwrap();
let kvm = Kvm::new(vec![]).unwrap();
let vm = kvm.fd.create_vm().unwrap();
let vcpu = vm.create_vcpu(0).unwrap();
let mut kvi: kvm_bindings::kvm_vcpu_init = kvm_bindings::kvm_vcpu_init::default();
vm.get_preferred_target(&mut kvi).unwrap();
Expand All @@ -261,8 +299,8 @@

#[test]
fn test_get_set_regs() {
let kvm = Kvm::new().unwrap();
let vm = kvm.create_vm().unwrap();
let kvm = Kvm::new(vec![]).unwrap();
let vm = kvm.fd.create_vm().unwrap();
let vcpu = vm.create_vcpu(0).unwrap();
let mut kvi: kvm_bindings::kvm_vcpu_init = kvm_bindings::kvm_vcpu_init::default();
vm.get_preferred_target(&mut kvi).unwrap();
Expand All @@ -283,8 +321,8 @@
fn test_mpstate() {
use std::os::unix::io::AsRawFd;

let kvm = Kvm::new().unwrap();
let vm = kvm.create_vm().unwrap();
let kvm = Kvm::new(vec![]).unwrap();
let vm = kvm.fd.create_vm().unwrap();
let vcpu = vm.create_vcpu(0).unwrap();
let mut kvi: kvm_bindings::kvm_vcpu_init = kvm_bindings::kvm_vcpu_init::default();
vm.get_preferred_target(&mut kvi).unwrap();
Expand Down
67 changes: 48 additions & 19 deletions src/vmm/src/builder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@
use crate::vmm_config::boot_source::BootConfig;
use crate::vmm_config::instance_info::InstanceInfo;
use crate::vmm_config::machine_config::{MachineConfig, MachineConfigError};
use crate::vstate::kvm::Kvm;

Check warning on line 71 in src/vmm/src/builder.rs

View check run for this annotation

Codecov / codecov/patch

src/vmm/src/builder.rs#L71

Added line #L71 was not covered by tests
use crate::vstate::memory::{GuestAddress, GuestMemory, GuestMemoryMmap};
use crate::vstate::vcpu::{Vcpu, VcpuConfig, VcpuError};
use crate::vstate::vm::Vm;
Expand Down Expand Up @@ -160,11 +161,17 @@
) -> Result<(Vmm, Vec<Vcpu>), StartMicrovmError> {
use self::StartMicrovmError::*;

let kvm = Kvm::new(kvm_capabilities)
.map_err(VmmError::Kvm)
.map_err(StartMicrovmError::Internal)?;
// Set up Kvm Vm and register memory regions.
// Build custom CPU config if a custom template is provided.
let mut vm = Vm::new(kvm_capabilities)
let mut vm = Vm::new(&kvm)
.map_err(VmmError::Vm)
.map_err(StartMicrovmError::Internal)?;
kvm.check_memory(&guest_memory)
.map_err(VmmError::Kvm)
.map_err(StartMicrovmError::Internal)?;
vm.memory_init(&guest_memory, track_dirty_pages)
.map_err(VmmError::Vm)
.map_err(StartMicrovmError::Internal)?;
Expand All @@ -186,7 +193,7 @@
#[cfg(target_arch = "x86_64")]
let (vcpus, pio_device_manager) = {
setup_interrupt_controller(&mut vm)?;
let vcpus = create_vcpus(&vm, vcpu_count, &vcpus_exit_evt).map_err(Internal)?;
let vcpus = create_vcpus(&kvm, &vm, vcpu_count, &vcpus_exit_evt).map_err(Internal)?;

// Make stdout non blocking.
set_stdout_nonblocking();
Expand Down Expand Up @@ -218,7 +225,7 @@
// Search for `kvm_arch_vcpu_create` in arch/arm/kvm/arm.c.
#[cfg(target_arch = "aarch64")]
let vcpus = {
let vcpus = create_vcpus(&vm, vcpu_count, &vcpus_exit_evt).map_err(Internal)?;
let vcpus = create_vcpus(&kvm, &vm, vcpu_count, &vcpus_exit_evt).map_err(Internal)?;
setup_interrupt_controller(&mut vm, vcpu_count)?;
vcpus
};
Expand All @@ -227,6 +234,7 @@
events_observer: Some(std::io::stdin()),
instance_info: instance_info.clone(),
shutdown_exit_code: None,
kvm,
vm,
guest_memory,
uffd,
Expand Down Expand Up @@ -476,7 +484,7 @@
uffd,
vm_resources.machine_config.track_dirty_pages,
vm_resources.machine_config.vcpu_count,
microvm_state.vm_state.kvm_cap_modifiers.clone(),
microvm_state.kvm_state.kvm_cap_modifiers.clone(),
)?;

#[cfg(target_arch = "x86_64")]
Expand Down Expand Up @@ -738,11 +746,16 @@
.map_err(VmmError::RegisterMMIODevice)
}

fn create_vcpus(vm: &Vm, vcpu_count: u8, exit_evt: &EventFd) -> Result<Vec<Vcpu>, VmmError> {
fn create_vcpus(
kvm: &Kvm,
vm: &Vm,
vcpu_count: u8,
exit_evt: &EventFd,
) -> Result<Vec<Vcpu>, VmmError> {
let mut vcpus = Vec::with_capacity(vcpu_count as usize);
for cpu_idx in 0..vcpu_count {
let exit_evt = exit_evt.try_clone().map_err(VmmError::EventFd)?;
let vcpu = Vcpu::new(cpu_idx, vm, exit_evt).map_err(VmmError::VcpuCreate)?;
let vcpu = Vcpu::new(cpu_idx, vm, kvm, exit_evt).map_err(VmmError::VcpuCreate)?;
vcpus.push(vcpu);
}
Ok(vcpus)
Expand All @@ -765,7 +778,7 @@
#[cfg(target_arch = "x86_64")]
let cpu_config = {
use crate::cpu_config::x86_64::cpuid;
let cpuid = cpuid::Cpuid::try_from(vmm.vm.supported_cpuid().clone())
let cpuid = cpuid::Cpuid::try_from(vmm.kvm.supported_cpuid.clone())
.map_err(GuestConfigError::CpuidFromKvmCpuid)?;
let msrs = vcpus[0]
.kvm_vcpu
Expand Down Expand Up @@ -801,16 +814,16 @@
cpu_config,
};

// Configure vCPUs with normalizing and setting the generated CPU configuration.
for vcpu in vcpus.iter_mut() {
vcpu.kvm_vcpu
.configure(vmm.guest_memory(), entry_addr, &vcpu_config)
.map_err(VmmError::VcpuConfigure)
.map_err(Internal)?;
}

#[cfg(target_arch = "x86_64")]
{
// Configure vCPUs with normalizing and setting the generated CPU configuration.
for vcpu in vcpus.iter_mut() {
vcpu.kvm_vcpu
.configure(vmm.guest_memory(), entry_addr, &vcpu_config)
.map_err(VmmError::VcpuConfigure)
.map_err(Internal)?;
}

// Write the kernel command line to guest memory. This is x86_64 specific, since on
// aarch64 the command line will be specified through the FDT.
let cmdline_size = boot_cmdline
Expand Down Expand Up @@ -845,6 +858,19 @@
}
#[cfg(target_arch = "aarch64")]
{
let optional_capabilities = vmm.kvm.optional_capabilities();
// Configure vCPUs with normalizing and setting the generated CPU configuration.
for vcpu in vcpus.iter_mut() {
vcpu.kvm_vcpu
.configure(
vmm.guest_memory(),
entry_addr,
&vcpu_config,
&optional_capabilities,
)
.map_err(VmmError::VcpuConfigure)
.map_err(Internal)?;
}
let vcpu_mpidr = vcpus
.iter_mut()
.map(|cpu| cpu.kvm_vcpu.get_mpidr())
Expand Down Expand Up @@ -1111,7 +1137,8 @@
.map_err(StartMicrovmError::Internal)
.unwrap();

let mut vm = Vm::new(vec![]).unwrap();
let kvm = Kvm::new(vec![]).unwrap();
let mut vm = Vm::new(&kvm).unwrap();
vm.memory_init(&guest_memory, false).unwrap();
let mmio_device_manager = MMIODeviceManager::new();
let acpi_device_manager = ACPIDeviceManager::new();
Expand All @@ -1137,14 +1164,15 @@
#[cfg(target_arch = "aarch64")]
{
let exit_evt = EventFd::new(libc::EFD_NONBLOCK).unwrap();
let _vcpu = Vcpu::new(1, &vm, exit_evt).unwrap();
let _vcpu = Vcpu::new(1, &vm, &kvm, exit_evt).unwrap();
setup_interrupt_controller(&mut vm, 1).unwrap();
}

Vmm {
events_observer: Some(std::io::stdin()),
instance_info: InstanceInfo::default(),
shutdown_exit_code: None,
kvm,
vm,
guest_memory,
uffd: None,
Expand Down Expand Up @@ -1362,15 +1390,16 @@
let vcpu_count = 2;
let guest_memory = arch_mem(128 << 20);

let kvm = Kvm::new(vec![]).expect("Cannot create Kvm");
#[allow(unused_mut)]
let mut vm = Vm::new(vec![]).unwrap();
let mut vm = Vm::new(&kvm).unwrap();
vm.memory_init(&guest_memory, false).unwrap();
let evfd = EventFd::new(libc::EFD_NONBLOCK).unwrap();

#[cfg(target_arch = "x86_64")]
setup_interrupt_controller(&mut vm).unwrap();

let vcpu_vec = create_vcpus(&vm, vcpu_count, &evfd).unwrap();
let vcpu_vec = create_vcpus(&kvm, &vm, vcpu_count, &evfd).unwrap();
assert_eq!(vcpu_vec.len(), vcpu_count as usize);
}

Expand Down
7 changes: 2 additions & 5 deletions src/vmm/src/device_manager/legacy.rs
Original file line number Diff line number Diff line change
Expand Up @@ -244,14 +244,11 @@ impl PortIODeviceManager {
#[cfg(test)]
mod tests {
use super::*;
use crate::test_utils::single_region_mem;
use crate::Vm;
use crate::vstate::vm::tests::setup_vm_with_memory;

#[test]
fn test_register_legacy_devices() {
let guest_mem = single_region_mem(0x1000);
let mut vm = Vm::new(vec![]).unwrap();
vm.memory_init(&guest_mem, false).unwrap();
let (_, mut vm, _) = setup_vm_with_memory(0x1000);
crate::builder::setup_interrupt_controller(&mut vm).unwrap();
let mut ldm = PortIODeviceManager::new(
Arc::new(Mutex::new(BusDevice::Serial(SerialDevice {
Expand Down
Loading
Loading