NFSroot with overlayfs on Debian 9 with Systemd

Download
Intro
Preparations
Install Debian using debootstrap
Modify the initial RAMdisk
PXElinux config entry
Overlayfs
Further tips and pitfalls

Download

Intro

If you've reached this page, you're probably experimenting with odd ways to boot Linux. I've arrived at this stage via several previous attempts during the years, where I'd boot smallish flash drives read-only or read-mostly, I'd mount dedicated tmpfs or other ramdisks on /var and the like... only to find out that browsers tend to save things in .cache etc. It's really quite a PITA to find all the places where software may want to write to.
This is where I started to appreciate the lewd beauty of "union filesystems" that can help create a thin "ramdisk for deltas", an RW layer in RAM cloaking a read-only mounted backend device - be it a fragile flash drive, or e.g. an NFS server export shared by many diskless clients.

For years, the popular FS to achieve this effect has been the AUFS. There's a nice howto, including a simple "RW layer insertion script".
This was my major source of inspiration, which I however re-shaped to use overlayfs.
The resulting script is attached - called overlay.sh .

Not everything is scripted. Debootstrap does its job, then the second stage in chroot installs some .debs, but after that, you need to create the initial ramdisk for NFSroot, based on slightly modified configuration. The overlay.sh actually got copied to its place within $NFSROOT by the "debootstrap_to_NFS_chroot" script, but then you're dropped into a shell that runs chrooted into $NFSROOT, you may want to install your own kernel and you do need to edit some config files by hand and update the initial RAMdisk.

There are a couple pitfalls that may be specific to Debian 9 and that you may want to be aware of.

Let's get our hands dirty.

Preparations

If this is the first time that you're playing with debootstrap and "linux distro in a chroot", please note that you should think of a nice name (and what disk partition to place this onto) for a subdirectory, where your NFSroot will be living. You may want to and you can create several such subdirs, to be used as several different "diskless distro roots" for your diskless machines to use...
Similarly to virtualization and guest disk images, you can have e.g. a production NFSroot and a handful of "testbeds" (alternative NFSroot subdirs) where you try things and that you can scrap if things don't work out very well.

Get yourself familiar with dhcpd, tftpd and pxelinux.
There are some nice howtos in the interwebs.
Try to pick up the principles. You should be able to get a vmlinuz binary and an initrd image and copy them to your /tftpboot by hand, and refer to them in pxelinux.cfg/default. The way I go about it, I just install the desired kernel (from the distro repo or my own) in the $NFSROOT/boot, and I create my custom initrd image in $NFSROOT/boot, and then I copy them by hand (or using a trivial script) to their place under /tftpboot . We'll get to that below.

In the file offered for download, maybe skim the README, and look at the variables defined at the top of the script called debootstrap-to-NFS-chroot.sh .
I mean to say that this is where you configure your paths, especially the $DESTDIR = this is what I tend to call $NFSROOT in this text.

Also take a glance into the directory debcfg-nfsroot/, especially the script called 2nd-stage.sh that gets executed in a chroot, after debootstrap has installed the base system.

Install Debian using debootstrap

Debootstrap is Debian's own tool that allows you to install a Debian environment into a subdirectory, which can either be a temporary mountpoint for some disk partition that will become the future "/" of the distro being installed, or it can be just a regular directory on some disk volume, that you advertise via NFS by mentioning that directory in /etc/exports.
The NFSroot at diskless machines (in the kernel cmdline in pxelinux) refers to that NFSroot export on the NFS server.

The script called debootstrap-to-NFS-chroot.sh is my own wrapper around the raw "debootstrap" that tries to add some marginal candy. Once the base Debian system is installed, This script of mine also calls a "second stage" script that adds a few extra packages and offers a chrooted shell for manual fine-tuning (such as, for you to create a custom initial RAMdisk).
There's also a second script that allows you to just chroot into the jail, whenever you need to, without having to manually bind-mount (and then umount!) the several system directories.

So you configure the debootstrap-to-NFS-chroot.sh, you run the script, and you hold your breath...
The debootstrap called by the script will take a couple minutes (at least) to install the Debian in your $NFSROOT directory.

When debootstrap is finished installing the base Debian system, the script chroots into the $NFSROOT target directory and calls a script called "2nd stage", which does a few further steps already in the chroot environment.

The 2nd stage ends up by starting an interactive shell, while still chrooted into the $NFSROOT directory - allowing you to continue with further modifications that may be needed in the target system. You will probably want to modify the initial RAMdisk - either right now, or you will need to chroot again later to finish the initrd. (Possibly several times, if you need to experiment with your own mods.)

If you exit the interactive shell at the end of 2nd stage, thereby exiting the chroot, the "script around debootstrap" tries to unmount some bind-mounted special directories, that were needed in the chroot. (Thanks to Jenda & k3dAR for the hint that /dev/pts needs dedicated handling.)

Modify the initial RAMdisk

For NFSroot, you need to modify $NFSROOT/etc/initramfs-tools/initramfs.conf thusly:

MODULES=netboot
DEVICE=
NFSROOT=auto
BOOT=nfs

Especially the BOOT= variable is missing in the stock initramfs.conf, and it does matter. Without it, the NFS modules do not get loaded when starting from initrd, and the kernel complains about "no root volume".

Speaking of modules, you may be wondering what to add into $NFSROOT/etc/initramfs-tools/modules . And the answer is: nothing particular. If you have modified the initramfs.conf in the aforementioned way, theoretically you do not need to specify any extra modules. Unless of course, you have some further needs. See the chapter on overlayfs for a good example.
For the moment, suffice to say that with the setup so far, your resulting initrd should get you booted via NFS.

And, speaking of initramfs, you need to do the following, while still chrooted into the $NFSROOT directory:
update-initramfs -u -k <your_kernel_version>

PXElinux config entry

To boot your kernel and your new initrd, copy them to someplace for tftpd to be able to reach them = someplace under /tftpboot. (Alternatively, you could possibly give permissions to tftpd to reach into your $NFSROOT/boot/ , to save yourself some copying.)

In pxelinux.cfg/default, you can start with an entry along these lines:

label deb9
  kernel NFSroot/deb9amd64.kernel
  append initrd=NFSroot/deb9amd64.initrd nfsroot=192.168.192.168:/var/NFSboot/deb9_x86_64 ip=dhcp ro

Yes, there's more, but the entry above should earn you a basic booting system, even if maybe with an odd error message or two.

In further chapters of this text, maybe keep in mind that the kernel cmdline args such as ro, rw, ip= and nfsroot= are intended for the user space scripts running inside initrd, rather than for the kernel itself :-) Not sure if maybe historically some of this was handled inside the kernel...

Overlayfs

If you have a flock of diskless clients, you probably want to boot them all off a shared NFSroot. In that scenario, you want to prevent the clients from racing for exclusive access to some files - lock files and many others. At which point you'll probably want to mount the root "read only".

Now... having a read-only filesystem is not particularly useful if you want to get some work done. Yes, of course you can mount individual network file systems in RW mode for your user data, but even that may be not enough. Some software just doesn't show all of its potential, or doesn't start at all, if the root volue is read-only.
That problem is solved by a filesystem layer that presents the host OS with a RW-mountable root volume, on top of a RO-mounted actual NFS volume - any differences are kept in RAM.

One such filesystem is the Overlayfs (of the unionfs family). In the interwebs you can find example setups using another such FS, called the AUFS - which has existed and flourished "out of tree" for a long time. The current state of affairs is, that Overlayfs has made it into the vanilla kernel, and Aufs has not...
There are differences between the AUFS and Overlayfs, but being a lazy person, the most important difference for me is, that the Overlayfs is in the upstream kernel source tree and managed along with the tree, which makes compilation against an ad hoc kernel version somewhat easier - especially now that building something out of tree (once a normal practice) taints the kernel.

In the tarball that you can download, there's a script called overlay.sh .
The root mountpoint manipulation / overlay insertion is just a couple lines, but the dance that they amount to is a little arcane => if you try to study what's going on, keep the docs on overlayfs (and possibly aufs) at hand. Let me leave a further explanation of how Overlayfs works up to your homework.

Copy the overlay.sh into
$NFSROOT/etc/initramfs-tools/scripts/init-bottom ,
add a new line saying just
overlay
(= the kernel driver for the FS) into $NFSROOT/etc/initramfs-tools/modules and update your initrd image.
Note that, for some reason, the overlay.sh script gets run during update-initramfs as well, and has to work around it. (Read the contents of the script if interested in the details. It's no rocket science. That's the way it is for Debian 9. You may want to test for this weird feature in future Debian releases - just in case this workaround becomes moot.)

The true purpose of the script is to run early at boot inside the initrd, just after the NFSroot gets mounted at a temporary directory, before pivot_root.

The script allows you to run an interactive shell before or after the overlayfs sorcery. Look for commented hints inside the script. This allows you to investigate the initrd just as it executes, in vivo (paused where your shell got started). To continue the boot process, exit the shell.

The script also allows you to avoid the overlay insertion, and instead mount the NFSroot RW. Just append the two letters rw to the kernel command line in your boot loader.
If you do so, it's up to you to make sure that only 1 diskless client runs the root mounted in rw mode, and note that any changes to the $NFSROOT on the server may wreak havoc unto any diskless clients running simultaneously in RO mode (+overlayfs). The typical error message is "stale file descriptor", and the only way out is a reboot of the affected client.
Mounting the NFSroot "read-write" is obviously useful for maintaining the diskless-booting Debian installation ($NFSROOT on the server), an alternative way to chrooting into the directory on the server (which appears to have runtime side effects in the server OS). Working on the NFSroot in RW mode is also a good way to test your NFS and misc sysctl performance tuning ;-)
To allow for this "on demand RW NFSroot", you definitely need to keep the NFS volume RW in /etc/exports on the server, and probably/maybe also in $NFSROOT/etc/fstab . Don't worry (very much) about the diskless clients - as long as you have "ro" as a default in the bootloader config (kernel cmdline), the NFSroot will get mounted RO on the clients. Whether this is enough in terms of idiot-proof safety and hacker-proof security, that's up to your judgement. How apt and fiddly your colleagues are, and how much is at stake if your $NFSROOT on the server gets tampered with: just your own work, some logon credentials, further sensitive data? (Consider making a backup if appropriate.)

Feel free to modify the overlay.sh script, or add your own scripts in init-bottom, if during the systemd-managed boot stages it's too late. Once the overlay gets mounted, even before pivot_root, you can already modify files on top of the read-only NFSroot export.

Further tips and pitfalls

Get rid of /etc/network/interfaces

Remove the classic file $NFSROOT/etc/network/interfaces . It appears to stand in the way of systemd-ese PnP networking, namely auto-detection of the particular NIC that happens to be plugged in at boot in a diskless system.
I've also noticed some problem with systemd-timesyncd which miraculously vanished after I removed that legacy file.
And, the legacy file appears to cause a "dependency-based hang" on some occasions of shutdown - some service that needs networking to shut down faces the interfaces already down, and the systemd keeps waiting in vain for the dependency to get satisfied :-)

How to unpack the initrd

While you're messing with the initrd, you may be interested to get to know what it looks like on the inside - to see if your changes got applied, and generally to understand how things work under the hood.

To unpack the contents of an initrd, on recent distroes you cannot just use "mc" the way it used to work in the past.
You may appreciate the following script:

#!/bin/sh

# As a first argument, supply the pathname to your initrd image.
# It will get unpacked into your current working directory.
# The initrd image will remain undamaged in its place, 
# won't vanish during the unpacking.

(cpio -id; zcat | cpio -id) < $1

# source of this wisdom: 
# https://unix.stackexchange.com/questions/163346/why-is-it-that-my-initrd-only-has-one-directory-namely-kernel
# Look for the answer by woolpool for a clear explanation.

Serial console

The boot process runs pretty fast, it can be over in a couple seconds. And, some lines even don't make it into system logs after the boot is finished - they can only be observed on the console.
To capture this, you may appreciate the following kernel cmdline curse:

console=ttyS0,115200n8

This means: redirect your console to the first serial port (/dev/ttyS0, aka COM1 in Windows), use the well-known standard rate of 115200 bps (the maximum supported by the legacy UART 16C550A), use 8 data bits and no parity. There's not a word about flow control - you can try with HW flow (RTS+CTS) or without flow control.

On the diskless client that you'd like to debug this way, obviously you need a physical COM port, probably on a legacy address (or maybe on PCI/PCI-e, with a driver compiled in monolithically). Legacy ISA/LPC UART's are the safest bet. USB UART's possibly won't work, because the USB tree is among the last devices to get initialized, possibly well after initrd is over, and while the systemd boot is underway. Obviously you need a serial port and some terminal emulator software on some other PC that will serve as the terminal in this exercise. In Windows, try Putty. In Linux, try Minicom - not sure if there's a better alternative in Linux to get a "native Linux" terminal emulation on a serial line.

If you want to see the console messages on the VGA console as well, try adding console=tty0 . You should get the output on both the VGA and the serial port.

If you run an interactive shell in the initrd, and you have just the serial console, the shell will appear on the serial console (and will accept keyboard input).

Whether or not you get a logon prompt on the serial console, that's up to the configuration of systemd. In a file somewhere. Apparently, if you manually add a serial console via the bootloader, systemd will automatically start a getty on that serial port for you. (Otherwise not, as there's no guarantee that the serial port is free.)

Most bootloaders also allow you to use the serial console to access the boot loader itself. See the bootloader-specific documentation.

Some BIOSes even allow the BIOS console redirection, including the setup, and that might also cater for DOS (for parts of DOS and programs that use BIOS console services for text output). This support in the BIOS probably won't collide with using that same serial port for a bootloader and linux console, as those later software components do not use the BIOS services, and the serial UART hardware is stupid enough that it doesn't matter who's talking to it from the host CPU at any given moment. The BIOS is put aside while the full-fledged OS is running.

Network and NFS tuning

Probably the most significant tweak, within a LAN:

echo "net.ipv4.tcp_slow_start_after_idle=0" >> /etc/sysctl.d/99-sysctl.conf
echo "net.ipv4.tcp_slow_start_after_idle=0" >> $NFSROOT/etc/sysctl.d/99-sysctl.conf

For the NFSroot, the "noatime" mount option makes sense - although maybe not much if combined with "ro". I tend to have "rw" in the fstab, for on-demand RW mounting. You can then janitor the $NFSROOT system from a diskless machine, rather than in a chroot on the server - which may turn out to be a cleaner option. Anyway - for the RW operation, "noatime" is definitely a good idea. Saves seeks on the spinning rust in the server's storage back-end = speeds up your interactive work a lot.

Otherwise the distro or kernel defaults for TCP and NFS nowadays seem pretty reasonable for 1Gb Ethernet on a modern LAN. For 10Gb, you will need to flex your administrator muscle a bit more... There are nice tutorials throughout the interwebs. Out of scope here.

On the server, you may want to tweak the VM and block layer for maximum writeback throughput... See the attached �� script - feel free to put this in your /etc/rc.local. Not exactly related to the read-only NFSroot diskless clients, but it will show if the clients also mount some volumes in RW mode, be it via NFS or SAMBA or iSCSI or whatever (local disks, e.g. for OS deployment work, making data backups etc).

The DHCP bug in klibc/ipconfig (initrd)

The initrd in Debian 9 still suffers from a decade-old bug in ipconfig, a companion utility of klibc.

 aptitude show klibc-utils

The version of klibc in Debian 9 is 2.0.4-9, the bug got fixed in 2.0.4-10...
https://askubuntu.com/questions/1043810/pxe-boot-fails-with-ip-config-no-response-giving-up
https://bugs.launchpad.net/ubuntu/+source/klibc/+bug/1327412
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=756633
https://unix.stackexchange.com/questions/520261/make-headers-install-not-working-as-expected

The bug presents itself like this, during boot:

IP-Config: eth0 hardware address e0:db:55:0c:34:7e mtu 1500 DHCP
IP-Config: eth1 hardware address e0:db:55:0c:34:80 mtu 1500 DHCP
IP-Config: no response after 2 secs - giving up
IP-Config: eth0 hardware address e0:db:55:0c:34:7e mtu 1500 DHCP
IP-Config: eth1 hardware address e0:db:55:0c:34:80 mtu 1500 DHCP
IP-Config: no response after 3 secs - giving up

...ad infinitum.

Reportedly, some DHCP servers are more tolerant to the bug than others. In my case, using the isc-dhcpd in Debian 9, the bug triggers on maybe 20-100% of boot attemtps, apparently depending on how many network interfaces the client box has and how many of them are connected while trying to boot... Once the klibc-ipconfig bug is fixed, any single interface is enough to PXE-boot Linux.

Note that:

in spite of the initrd being largely based on busybox, and in spite of the busybox being linked against a full-fledged libc-2.24.so, about 1.6 MB, present in the initrd image, the klibc is indeed also present in the initrd.
the klibc source code by HPA has close ties to the kernel. As part of the klibc source repository at kernel.org, the klibc library itself is accompanied by a few companion utility programs. One of them is ipconfig. And, the bug is in the ipconfig source code, not in the source code of klibc itself. So you need only ipconfig updated. Note that mkinitramfs (called by update-initramfs) gets the ipconfig binary into the initrd "build directory" by copying the whole
/usr/lib/klibc/bin/* into $INITRD_ROOT/bin/
...where they get intermingled with the busybox and its numerous hardlinked clones.
you may notice a file called something like
$INITRD_ROOT/lib/klibc-k3La8MUnuzHQ0_kG8hokcGAC0PA.so
i.e. the klibc shared library object.
Interestingly, the companion klibc-utils, if asked by ldd, appear to be statically linked. Yet, the klibc source directory will build both a nominally shared and a nominally static version of each util, and if you try ldd on both of them, you'll notice that both report being static. But the "shared" binary is noticeably smaller. IMO this means that the klibc is using some alternative dynamic linking system, unknown to ldd. Hence my advice to just use the "nominally static" version of the ipconfig, which potentially gets you out of a minor dependency hell inside the initrd, at the cost of about 12 kB of compiled code.
the ipconfig is a user-space program that gets started really early in the inital RAMdisk, before even the early stages of systemd. So: don't blame systemd for the bug, and now you also know why the IP-Config runtime messages nowadays do not show up in the system logs. Not in dmesg, not in messages, not in daemon.log - just on the physical console, if you manage to capture that.

How to rebuild the ipconfig klibc-util from source:

cd /usr/src/
git clone git://git.kernel.org/pub/scm/libs/klibc/klibc.git
cd klibc
make 
# mind the error message, and the advice it gives you.
# The advice doesn't quite work - not verbatim, but combined with 
# https://unix.stackexchange.com/questions/520261/make-headers-install-not-working-as-expected
# you should get something like
make ARCH=x86_64 O=. -C /usr/src/linux-5.2.7/ headers_install INSTALL_HDR_PATH=/usr/src/klibc/linux
# If this compiled, feel free to check out the contents of usr/kinit/ipconfig/*
# Copy the static binary to its strategically appropriate intermediate destination:
cp ./usr/kinit/ipconfig/static/ipconfig $NFSROOT/usr/lib/klibc/bin/
# [chroot into $NFSROOT]
update-initramfs -u -k <your_kernel_version>
^D
# [maybe check by unpacking the new initrd, that the fresh ipconfig is in place]
cp $NFSROOT/boot/initrd.img-<your_kernel_version> <your_tftpboot_directory_and_path>

(It's actually not horrid at all, just the script might look a little intimidating at first...)

I probably should've just started with Debian 10 (Buster) right away :-)

How to set up a soft bridge early within initrd

By now, you are probably asking yourself:
just why would you want a soft bridge, in a diskless client for christ sake? For sport or what? Are you *nuts* ?

I actually have a plausible reason.
In my scenario, a key motivation to play with diskless-booted Linux is maintenance of on-disk OS images: backup / restore / cloning / "deployment" as they say nowadays. Cloning sounds so much like theft, which is definitely not what we practice - "deployment" sounds like some politically correct newspeak, but actually matches our reality of preloading legal, licensed OS images onto the PC hardware that we sell for industrial process control.
Industrial/embedded PC's come in many form factors, some of which don't have the boot drive easily accessible/removable. The BIOS/UEFI "firmware" (this *is* ugly newspeak) in different models features varying capability to boot via the "legacy BIOS" method, PXE-boot via legacy BIOS or UEFI, outright bugs in that area etc. It is often helpful to boot Linux just for its inherent hardware debugging capabilities. And if I could run a virtual environment, with the physical HDD passed through into the guest VM, this would allow me e.g. to boot some OS deployment environment via the legacy BIOS method: old DOS (unsupported on modern PC HW), or older Windows PE-based bootable images (not compatible with UEFI boot either).

So in that context = where the diskless-booted Linux is used as a VM host (hypervisor), you need some way to give your VM guests access to your LAN. By default, the HV/emulator tool sets up a virtual LAN segment between the host and the guest, with no access to the upstream LAN. And you could of course route that virtual LAN to your physical LAN by an L3 hop. Maybe insert a NAT/masquerade. Assign IP addresses manually or run a DHCP server for the tiny virtual LAN.
But: isn't that kind of ugly?
Wouldn't it be neat, if your virtual guest VM's would have direct access to your physical LAN? To function in your LAN just as seamlessly as the physical machines, that happen to be lucky enough to be able to boot your tools directly? Especially, boot your deployment tools via PXE from the DHCP server in your LAN.
That's right: it *is* possible, and all you need is a soft bridge inside the diskless-booted Linux.

Allright, so just prepend a few cursewords such as brctl and ifconfig into the script that launches QEMU, and ready to go, right?
Well... not so fast.

Notice that long before your user space is fully booted, which is where starting QEMU can happen, your diskless system has mounted its root via NFS (and added some arcane overlayfs trickery on top). That, after the initrd has already asked for an IP address via DHCP in the first place.
What do you think will happen, if you start messing with the netdevices underneath NFS, underneath overlayfs?

That's right - you need to create the bridge before attempting to mount NFS. Mounting NFS has to happen with the netdevice of choice = br0 already up and running, after the machine has obtained an IP address from your DHCP server via br0. Which means, that you need to insert the setup of br0 into the early stages of initd boot, actually before systemd gets started.

In other words, how do we hook this up into initramfs-tools. Well it turns out, that perhaps the most convenient place where to insert the bridge setup, is in an initrd function called configure_networking(), that lives in a file called

$NFSROOT/usr/share/initramfs-tools/scripts/functions

Within that function, you can find several pre-existing goodies. You can learn what physical network interface to use for the diskless boot (especially if you have a hint available from pxelinux or ipxe = from your bootloader).
It is a good idea to stick to the MAC address that your "PXE" bootROM has used for DHCP previously, to get the same IP address assigned (to prevent wasting two IP addresses on just the bootloader and the Linux on the same machine). To drive your point home, you will probably need to instruct your DHCP server to ignore the GUID/UUID, which is otherwise the primary identifier, based on which IP addresses are handed out by DHCP (rather than the raw MAC address). In Linux on the client, you probably have no way of knowing, what GUID/UUID your BIOS PXE option ROM has used when asking for DHCP. Here is an example of how to configure the ISC DHCPd to ignore the GUID/UUID:

authoritative;
shared-network MYNETWORK {
   subnet 192.168.100.0 netmask 255.255.255.0 {
      default-lease-time 1200;
      max-lease-time 86400;
      range 192.168.100.100 192.168.100.254;
      option routers 192.168.100.1;

      if substring (option vendor-class-identifier, 0, 9) = "PXEClient" {
         # specify your TFTP boot server here:
         next-server 192.168.100.2;

         ####### now get the UUID/GUID out of the way: #######
         ignore-client-uids true;

         ...etc
      }
   }
}

But back to the point = to the client booting diskless over LAN, and how to insert the bridge in your initrd.
In the tarball for download, you'll find the relevant script snippets under the initrd-bridge subdirectory.
Apart from patching the configure_networking() function, you need to add a hook that will copy the brctl binary into your initrd and add the "bridge" module to /etc/initramfs-tools/modules. Again, details are in the tarball.
This bridge-related stuff is not included in the debootstrap-to-NFS-chroot script (not even in the 2nd-stage) - because for typical diskless use, you do not need the bridge.

Speaking of QEMU

Note that an instance of QEMU (your VM guest) will need an IP address too, and will ask for it via DHCP (that's why you've set up the bridge in the first place), and will need to use a different MAC address from the host PC, and get a different IP address. This probably cannot be avoided in a reasonable way - you simply do want the VM host (diskless Linux) to be alive simultaneously with the VM guest (something running in QEMU), and these are two distinct instances of different operating systems.
Note that, unless you specify a particular MAC address, QEMU will use a default MAC address for the virtual network interface: 52:54:00:12:34:56 . I.e. the MAC address would be identical in all your diskless clients running in parallel, leading to havoc. My workaround is: combine the first three bytes, that are specific to qemu, with the last three bytes, taken from the physical MAC address of your network interface that took place in the netboot. The address gets copied indirectly via the bridge interface = as configured within the initrd.

The following is an example script to start QEMU with a detailed sequence of command-line parameters, including the MAC address derivation:

#!/bin/bash

BRIDGE_PRESENT=`ip link show | grep br0`

if [ -z "$BRIDGE_PRESENT" ]; then
        echo "This OS instance lacks a soft-bridge. No br0 netdevice in the system."
        echo "Please PXEboot the bare metal into the 'Debian with Virtualization' boot profile."
        sleep 10
        exit 1
fi

# Derive the guest MAC address from the bridge MAC address.
# Beware of this if running multiple guests on the same diskless host!
# Note: without this, all guests would end up with 52:54:00:12:34:56 .
BR_MACADDR=`ifconfig br0 | grep ether | tr -s ' ' | cut -f 3 -d ' '`
echo "MAC address of the bridge: $BR_MACADDR"

GUEST_MACADDR=""
OCTET_NR=1

for OCTET in ${BR_MACADDR//:/ }
do
        if [ $OCTET_NR -eq 1 ]; then
                GUEST_MACADDR=52
        elif [ $OCTET_NR -eq 2 ]; then
                GUEST_MACADDR="$GUEST_MACADDR:54"
        elif [ $OCTET_NR -eq 3 ]; then
                GUEST_MACADDR="$GUEST_MACADDR:00"
        else
                GUEST_MACADDR="$GUEST_MACADDR:$OCTET"
        fi
        ((OCTET_NR=OCTET_NR+1))
done

echo "MAC address of the guest: $GUEST_MACADDR"

CPU_STUFF="-smp cpus=1,cores=2,threads=1,maxcpus=2"
MEMSIZE=1G

BIOS_DIR="-L /usr/share/qemu-efi"
BIOS_FILE="-bios OVMF-with-csm.fd"
#BIOS_FILE="-bios OVMF-pure-efi.fd"
BIOS_STUFF="$BIOS_DIR $BIOS_FILE"
#  you can also skip BIOS_STUFF altogether, to load the default SEABIOS,
#  probably without UEFI capability

CDROM_FILE="/mnt/smb2/NetBoot/Win7PE.iso"
CDROM_STUFF="-cdrom $CDROM_FILE"

HARDDISK_STUFF=""

for THIS_DRIVE in `ls /sys/block | grep sd`; do
        HARDDISK_STUFF="$HARDDISK_STUFF -drive file=/dev/$THIS_DRIVE,if=ide,format=raw,cache=directsync"
done

for THIS_DRIVE in `ls /sys/block | grep nvm`; do
        HARDDISK_STUFF="$HARDDISK_STUFF -drive file=/dev/$THIS_DRIVE,if=ide,format=raw"
done

#FLOPPY_FILE=/mnt/smb2/DOS/dos622.img
#FLOPPY_STUFF="-fda $FLOPPY_FILE"
#FLOPPY_STUFF="-drive file=$FLOPPY_FILE,if=floppy,readonly"
FLOPPY_STUFF=""

BOOT_ORDER=d
#BOOT_ORDER=n
BOOT_STUFF="-boot order=$BOOT_ORDER"

NETWORK_STUFF="-net nic,macaddr=$GUEST_MACADDR -net bridge"
#NETWORK_STUFF="-net nic,model=i82551,macaddr=$GUEST_MACADDR -net bridge"  

MOUSE_STUFF="-device usb-ehci,id=ehci -device usb-tablet -device usb-kbd"

#VGA_EMUL_TYPE=std
#VGA_EMUL_TYPE=cirrus
VGA_EMUL_TYPE=vmware
#VGA_EMUL_TYPE=virtio
#VGA_EMUL_TYPE=qxl

DISPLAY_TARGET=sdl
#DISPLAY_TARGET=gtk
#DISPLAY_STUFF="-display $DISPLAY_TARGET -vga $VGA_EMUL_TYPE -full-screen"
DISPLAY_STUFF="-display $DISPLAY_TARGET -vga $VGA_EMUL_TYPE"

qemu-system-x86_64 -enable-kvm -machine q35,accel=kvm -m $MEMSIZE $FLOPPY_STUFF $CDROM_STUFF $HARDDISK_STUFF $NETWORK_STUFF $BOOT_STUFF $MOUSE_STUFF $DISPLAY_STUFF $BIOS_STUFF

Apologies for the following notes, that are probably ultimately off topic in the context of PXE-booting Linux:
In a particular DOS program, PS2 keyboard emulation proved troublesome. I got keystrokes on Enter and arrow keys doubled. I ended up switchng to an emulated USB keyboard and that worked just fine for the culprit program...
Curiously, the -device usb-tablet is an optimal choice for mouse emulation.
Depending on what OS you want to boot inside the guest VM, you may need a larger or smaller $MEMSIZE.

Getting the MAC address from your boot loader

In recent versions of pxelinux (5 and above if memory serves) you can use:
SYSAPPEND 2
in the respective menu entry.

If you're using ipxe as a bootloader (instead of pxelinux), you can format the BOOTIF variable using the scripting capabilities of iPXE:
imgargs deb9amd64.kernel initrd=deb9amd64.initrd nfsroot=192.168.100.3:/var/NFSboot/deb9_x86_64 BOOTIF=01-${mac:hexhyp} ip=dhcp ro systemd.unit=graphical.target net.ifnames=0 intel_idle.max_cstate=1 mitigations=off

by: Frank Rysanek [rysanek AT fccps DOT cz]
in 2019-2020