A while ago I ran into an odd problem: A virtual machine running QEMM 9.0 (aka QEMM 97) would crash more or less every time it tried to read something from a floppy. No such problem was observable in any other environment. But what does QEMM have to do with reading from a floppy, anyway? Quite a lot.
It is well known that EMM386, QEMM, and their cousins provide upper memory (UMBs) and optionally emulate expanded memory (EMS) through the 386 paging unit. Memory above 1MB, normally not accessible from real-mode applications, is allocated and mapped below the 1MB boundary using paging. In the case of UMBs, memory pages are more or less statically “moved” (remapped) to addresses between 640KB and 1MB in order to fill gaps or even overlay unused ROMs. In the case of expanded memory, pages are swapped in an out of the page frame as requested through the EMS services.
In either case, 16:16 segmented memory addresses used by DOS and BIOS do not necessarily correspond to physical addresses, and that poses a problem for software which needs to operate with physical addresses, such as those used for DMA (direct memory access).
EMM386 and Classic DMA
In a traditional PC, there is one very common device which uses DMA transfers: the floppy controller. The NEC μPD765A or compatible floppy drive controller (FDC) supports a programmed I/O (PIO) mode, but due to lack of buffering, PIO has extremely tight timing requirements which would not be achievable on the original 8088-based PCs, and even on newer systems would require disabling interrupts for the entire duration of a transfer.
To avoid this problem, IBM used the DMA capability of the FDC and the BIOS programmed the system’s DMA controller to assist with the transfers. When the FDC was reading or writing data, only the FDC and the DMA controller were involved while the CPU merrily went around its business (in the case of DOS/BIOS, idly waiting for the transfer to complete) without noticing anything, save perhaps a slight reduction in the available memory bandwidth.
Because the DMA controller is decoupled from the CPU, it has no idea about any address translation the CPU might be doing and operates directly on physical addresses. By now it should be apparent that when real-mode software attempts a read or write from an address in expanded memory or an UMB, it will not be able to calculate the physical address correctly and the EMM (expanded memory manager) needs to intervene.
Luckily this isn’t too difficult. Software (in the case of a floppy, the BIOS) writes the 8-bit DMA page register and a 16-bit address register, which together form a 24-bit physical address (covering a 16MB address space). When the DMA transfer is actually initiated by writing the DMA mask register (to clear the mask), the memory manager steps in and checks the physical address (since the memory manager can easily trap I/O port accesses, it will be informed every time DOS/BIOS tries to write to the DMA controller).
If the address is in memory with a 1:1 mapping between linear and physical addresses, there’s nothing else to do. If not, the memory manager must adjust the address to correspond to the actual physical memory, reprogram the DMA controller, and then initiate the transfer. The mapping is performed transparently and DOS/BIOS code has no idea all of this is happening behind its back. There is no need to modify the 16-bit software.
EMM386 and Bus-Master DMA
Things get a lot more hairy with bus-mastering controllers which bypass the classic DMA controller. One of the first such devices was the Adaptec AHA-1540 SCSI HBA, an ISA device first available around 1986. Bus-mastering controllers (primarily disk and network) later became much more common with EISA and PCI.
Here the memory manager does not have any idea about the addresses involved, or perhaps even that there is a transfer going on. For disk transfers, the memory manager can hook the BIOS INT 13h interface and work around the problem by using a buffer in low memory with 1:1 mapping which the device can access. Memory contents have to be copied to/from the original buffer in remapped memory.
This is obviously slow and undesirable, but it was initially the only option to deal with bus-mastering disk controllers. Windows 3.0 used SMARTDRV with the double-buffering option for this purpose.
In 1990, Microsoft published the VDS (Virtual DMA Services) specification which neatly solves the problem. A VDS-enabled device driver or firmware calls the VDS provider (EMM386, QEMM, Windows 3.x) when it needs to translate between linear and physical addresses. Bus-mastering devices of any kind can transfer directly into and out of application buffers with no performance loss and relatively low added complexity. The BIOS shipped with more or less all newer bus-mastering storage controllers supports VDS, and so do loadable drivers.
Where’s the Bug?
But back to the original problem. In the faulty VM, QEMM relocated the DOS buffers into UMBs to conserve conventional memory. As a consequence, whenever anything as trivial as ‘DIR A:’ was executed, the DOS buffers were remapped and QEMM needed to translate the physical address.
For some reason that is not at all apparent, and most likely by mistake, the BIOS programmed the DMA mask register twice when reading from floppy (currently visible here). This confused QEMM. When the mask register was first written, QEMM correctly translated the physical address corresponding to an UMB to the true address above 1MB. But when the mask register was written the second time, QEMM translated the address again, somehow not realizing that it is already above 1MB and does not need to be translated again. The translation result was different the second time, which meant that the floppy read would trash some innocent memory, usually crashing the DOS VM (with a nice QEMM error message) almost immediately. No such problem was observed with EMM386, even with DOS buffers in UMBs.
As these things go, it was very lucky that the problem happened with floppy reads rather than writes; in the latter case, it would be floppies getting corrupted instead of memory, with much nastier and harder to spot consequences.
This is a curious case of an obscure bug (in QEMM) triggered by a BIOS which was quirky, but not outright buggy itself.
I suppose this is fixed in VirtualBox 4.3.16.
Entirely possible.
So, does VirtualBox use BIOS code from bochs?
It’s based on the bochs BIOS, though with quite a few modifications and additions.
I can’t tell if the duplicated DMA mask register setup is a mistake, a correction for some obscure bug somewhere else, or a clever way to pause the system slightly. Clever code should have comments.
Sometimes I am surprised at how long DOS remained viable even as programmers went spelunking deep into its inner workings and altering it.
Clever code without explaining comments isn’t so clever 🙂 I’m still leaning toward “mistake”.
And yes… it’s fascinating that DOS could function at all with all the memory managers, TSRs, multitaskers, etc. All with zero protection and just one stray pointer away from a crash. Not sure what that says about the PC industry.
Thanks! I think I had figured out via trial-and-error that I needed to pass some flag to disable something to do with floppy access when using QEMM on QEMU (using trial and error in CONFIG.SYS really takes me back), it’s nice to understand what was really going on!