After I Kryofluxed my MS OS/2 SDK 1.01 disks, I once again tried installing the OS in a VM. While the system booted up fine, it stubbornly refused to get past FORMAT. At the end, after going through all the cylinders and heads, it would always hang.
Analyzing the VM I realized that it does not so much hang as crash, because the OS kernel stack gets exhausted. And it gets exhausted because the disk driver gets into a funk, keeps repeatedly doing SET DRIVE PARAMETERS and READ VERIFY SECTORS, then crashes, probably tries to show the trap screen, and just miserably dies without actually showing anything. But why?
So I thought, the source for the disk driver is on the OS/2 1.0 DDK (one of the two drivers provided in source form, the other being the serial port drive), let’s see what it does then. Except I discovered that it’s not that easy… because the DDK release notes say that it’s not really the source code for the OS/2 1.0 disk driver, but rather an updated version. And sure enough, the source code says it’s the OS/2 1.1 disk driver (even though the code is from May 1987!). Why Microsoft would have done that is anyone’s guess.
The DDK source code is still close enough to the OS/2 1.01 SDK disk driver that I was able to make some sense of the disassembly, though I could not find any real problem in the code. But I had a very good idea what the problem might be–OS/2 1.0, like a number of old UNIX PC/AT ports, most likely silently assumes that the disk is very slow, and interrupts take a while to arrive. I was not able to spot the exact bug but it’s almost certain that the disk driver writes a command to the controller, and only then updates some internal state. Sometimes the command completion interrupt arrives before the state update, and then things go badly wrong.
So I tried simply delaying interrupts by a millisecond. And sure enough, FORMAT no longer hung! That indirectly proves the bug.
Interestingly, the hang seems to be specific to the FORMAT command, probably because FORMAT does something normal file operations don’t. If the disk is formatted with another OS (such DOS), the MS OS/2 1.01 SDK can use the disk without any apparent trouble. OS/2 1.1 (starting with the MS OS/2 1.03 SDK from March 1988) does not have this problem, whatever it is exactly. Perhaps that’s why I could not spot anything in the provided source code.
As a side note, the OS/2 disk driver source mentions LOADALL by name in a comment: WARNING!!!! Care must be taken to ensure that DS or ES are not put on the stack after they have been mapped to a buffer since if this buffer is in high memory and we are in real mode, the LOADALL mapping will be lost. That refers to the mapping obtained by the PhysToVirt DevHlp.
It is amusing that the OS/2 device driver programming documentation says the same thing, but carefully never mentions LOADALL by name. Nor does it explain how the LOADALL equivalent is implemented on a 386. When one understands the implementation, the restrictions are obvious. Without that, it just sounds like voodoo. Which, I suppose, is exactly what it was.
Why does it say “…Version is 10.00” on OS/2 1.01?
The internal version number visible to applications was 10.0 for OS/2 1.0 (and 20.0 for V2.0). I believe the thinking was that DOS applications could safely assume that if the version was 10.0 or higher, they were running on OS/2, otherwise they were running on plain DOS.