While researching the precise meaning of the Ctrl-Z (26 decimal, hex 1Ah, ASCII SUB) character in DOS, I was somewhat taken aback by this article which purports to correct a common misconception.
The article is, for the most part, entirely correct. The handle-based file API in DOS 2.0 and onward deals with pure binary files and no byte has any special meaning. Any special meaning that the Ctrl-Z character might have in such files is generally implemented in application programs and run-time libraries.
However, the statement that “MS-DOS didn’t have an End-Of-File character of any sort” is grossly misleading. The statement that “The treatment of character 26 and the handling of “text” files was a shared delusion, […] wholly layered above DOS itself” is simply untrue.
First of all, claiming that COMMAND.COM is “just an application” requires stretching the definition of what is “DOS” to the breaking point, and likely beyond. Sure, COMMAND.COM is not part of DOS… now try booting DOS without it.
And COMMAND.COM certainly ascribes special meaning to Ctrl-Z — for example when interpreting batch files, processing stops at a Ctrl-Z character. Again, claiming that batch files are merely an application construct, not part of DOS, is contrary to most people’s understanding of what DOS is.
But it’s not just COMMAND.COM. The DOS kernel itself (MSDOS.SYS in Microsoft’s releases) very much does ascribe special meaning to Ctrl-Z. One look at DOS 1.25 MSDOS.ASM is enough to ascertain that DOS does treat Ctrl-Z (look for ‘1AH’ in the source code) specially.
The catch is that Ctrl-Z has no particular meaning in the DOS file API but rather in the device I/O code.
When writing to the AUX device (typically a serial port), Ctrl-Z is written but terminates the output. When writing to the CON or LST devices (typically the console and a printer, respectively), Ctrl-Z terminates the output and isn’t written.
Analogously when reading from an AUX device, Ctrl-Z is stored and terminates the input. Console input is treated similarly, but the logic is much more complex.
But Why?
Now that we’ve established that the DOS kernel does in fact treat Ctrl-Z as a special character, the logical next questions is, why? What’s the point of all this?
To understand the purpose of Ctrl-Z, one has to consider early versions of CP/M, as well as early pre-releases of 86-DOS which, after all, mimicked CP/M.
What CP/M versions 1.x/2.x as well as 86-DOS 0.x had in common is that file sizes were not stored with byte granularity. Instead, file sizes were only tracked in terms of 128-byte “records”, which typically happened to correspond to 128-byte floppy disk sectors.
For executable programs, this was not an issue. When loading a program, CP/M or 86-DOS loaded a certain number of records/sectors. If there was some junk in the last record (very likely), it didn’t matter because it was never executed as code and may have been overwritten by data.
However, for text files, or possibly other data files, this was a problem. No one wanted up to 127 bytes of junk displayed on the screen or sent to the printer. CP/M, like old DEC operating systems, adopted the ASCII SUB (substitute) character in order to solve the problem. The SUB character is defined in the ASCII standard as “a control character used in the place of a character that has been found to be invalid or in error”. Which means that SUB therefore should never be used in (ASCII) text.
One possible approach was to pre-fill the 128-byte record buffer with Ctrl-Z (ASCII SUB) characters, write to the buffer as many characters of text as there were available, and then write the buffer to disk. When reading, a Ctrl-Z character indicated an end of file.
In the canonical usage, there was no requirement that a text file must be terminated with Ctrl-Z. If a text file was an exact multiple of 128 bytes, there was no need to add another record containing just a Ctrl-Z. When reading the file, the program already knew that there were no more records and that the file had been completely read.
By necessity, much of the Ctrl-Z handling was the responsibility of applications, not least because applications (rather than the OS) knew whether they were dealing with text or binary files. Applications processing text files considered Ctrl-Z to be an End-Of-File (EOF) marker on input, and added a Ctrl-Z at the end of output.
Byte Granular Files
Circa April 1981, 86-DOS version 1.0 was refined to track the file size on the basis of bytes rather than records. This removed most of the need for Ctrl-Z, but not all. Any existing text files, as well as text files transferred from CP/M disks, still had file sizes that were multiples of 128 bytes and almost certainly contained junk at the end. DOS-based applications therefore continued treating Ctrl-Z as an EOF marker in text files.
Likewise for transferring files to other systems, many DOS-based applications continued adding a Ctrl-Z to the end of text files.
The built-in Ctrl-Z handling in the DOS kernel for console/AUX/printer I/O didn’t go away, and neither did the Ctrl-Z processing in COMMAND.COM. It was propagated to DOS-like operating systems such as OS/2 or Windows 9x and Windows NT.
Consider the following (on a Windows 10 workstation):
M:\dos\86dos>type "86-DOS v0.34 #221 - 81-02-20.imd"
IMD 1.18: 27/12/2023 16:20:11
Generated by Applesauce Fast Imager 1.88.3
Although the IMD floppy image format is binary, the IMD header only contains ASCII text. Because it ends with a Ctrl-Z character, TYPEing an IMD file will only show the header text and none of the binary data. This trick was used by a number of binary formats — Ctrl-Z is not an EOF marker in this usage, but it is an end of text marker.
In other words, the TYPE command still treats Ctrl-Z as an EOF marker, even in modern Windows versions.
There is another place where Ctrl-Z is not only used but necessary. Many DOS users know that in the absence of a text editor, it is possible to create a text file by using COPY CON FOO.TXT. But how does one terminate the file input? By using Ctrl-Z of course.
Summary
It is true that as far as the DOS file API is concerned, files on disk are purely binary and Ctrl-Z (or any other byte) has no special meaning. However, it is categorically untrue that DOS has no concept of Ctrl-Z as an EOF marker. When performing I/O to or from the console, AUX, or PRN device, Ctrl-Z is in fact an EOF marker, and this logic is built into the DOS kernel itself.
The command shell, COMMAND.COM, is an integral part of DOS and treats Ctrl-Z as an EOF marker in text files, including batch files.
The reason why Ctrl-Z is understood to be an EOF marker in text files is historical; Ctrl-Z was necessary on systems which did not track the exact file size and only managed fixed size (usually 128-byte) records.
Addendum: Manual Support
To further underscore the point, here are several quotes from the MS-DOS 3.21 User’s Reference manual (chosen simply because that’s the first I had on hand in electronic copy).
The /A switch of the COPY command is documented as follows:
Causes the file to be treated as an ASCII (text) file. Data in the file is copied up to but not including the first end-of-file mark (in edlin this is CONTROL-Z). The remainder of the file is not copied.
Documenting the F6 key functionality in the DOS input line editor (NB: F6 still behaves the same way in Windows 10):
Puts a CONTROL-Z (1AH) end-of-file character in the new template.
Explaining how to use COPY CON to create a batch file:
After the last line, press CONTROL-Z and then press RETURN to save the batch file. MS-DOS displays the message “1 File(s) copied” to show that it created the file.
While the first example, the COPY command, is implemented in COMMAND.COM, the other two (input line editing and file termination when copying from the console) are built into the DOS kernel itself.
It is clear that Microsoft understood Ctrl-Z to be an end-of-file character in DOS text files, and that the appropriate logic was built into the DOS kernel itself, even though the DOS file API does not treat any character specially.
“The handle-based file API in DOS 2.0 and onward deals with pure binary files and no byte has any special meaning.”
IMHO, the mention of DOS 2.0 and handles was unnecessary. Handles brought nothing new in this game. PC-DOS 1.0 (and almost certainly, even earlier versions of DOS), while using only FCBs for file access, already considered files as pure binary and no bytes had any special meaning either, just as with file handles. I’m not sure about the very early versions of DOS, of course.
I know that you explain all this in the main article body, I just find mentioning of file handles weird and redundant.
“The handle-based file API in DOS 2.0 and onward deals with pure binary files and no byte has any special meaning.”
IMHO, the mention of DOS 2.0 and handles was unnecessary. Handles brought nothing new in this game. PC-DOS 1.0 (and almost certainly, even earlier versions of DOS), while using only FCBs for file access, already considered files as pure binary and no bytes had any special meaning either, just as with file handles. I’m not sure about the very early versions of DOS, of course.
I know that you explain all this in the main article body, I just find mentioning of file handles weird and redundant.
“What CP/M versions 1.x/2.x as well as 86-DOS 0.x had in common is that file sizes were not stored with byte granularity. Instead, file sizes were only tracked in terms of 128-byte “records” ”
Now that you mention it, it got me thinking. CP/M 2.x for example, certainly had file compression tools. Having random padding junk at the tail end of every file must have been a major PITA for the makers of these programs. And CP/M 3 didn’t appear before… 1983(!), according to Wikipedia! Maybe Tim Paterson did something right (byte granularity), after all…
Huh, the Pedant Supreme JdeBP again?
Still, it’s perfectly possible to launch DOS without COMMAND.COM.
echo SHELL=C:\DOS\DOSSHELL.EXE >> C:\CONFIG.SYS
As I remember,
SHELL=
was added back in MS-DOS 2.0, which brought many Unix influences. Among them was moving process management out of COMMAND.COM into the kernel and making the former an ordinary application (for the most part) that can be replaced with something else if the user wishes. If you think that breaks the definition of DOS, it seems you would consider MS-DOS kernel with the 4DOS shell no longer a DOS.“Sure, COMMAND.COM is not part of DOS… now try booting DOS without it.”
Well, I did that for quite some time, when I was using 4DOS as a COMMAND.COM replacement.
And at least I guess that when using SHELL=4DOS.COM in CONFIG.SYS, COMMAND.COM is in fact not used – or am I wrong here?
I should have said “try booting DOS without it when all you have is DOS disks”. Certainly COMMAND.COM can be replaced, but in practice it’s either replaced with some COMMAND.COM compatible shell (like 4DOS) or DOS is booted into some custom application that just uses DOS as a boot loader.
Well, what does 4DOS think about Ctrl-Z? Or, if you go straight to DOSSHELL, how do you get at the TYPE command?
And yes, with third party software you can replace everything, including the DOS kernel itself.
Yes, the FCB-based I/O is really no different in this regard — files are pure binary data, same as with the handle-based I/O. Similarly talking about C runtime libraries seems unnecessary and distracting, as if all software were written in C. Which it’s not, and it especially wasn’t the case in the old DOS days.
The FAT file system in 86-DOS is in my opinion a significant improvement over the old CP/M file system. The CP/M approach where file name and block allocation information were all stored together in directory entries barely scaled to floppies, and definitely didn’t scale to hard disks. The FAT approach could handle contemporary hard disks even with the original FAT12, and could be relatively easily extended to FAT16 and FAT32 to deal with much larger disks and much larger files.
I don’t know how file compression worked on CP/M, but the compressed stream either needed some kind of terminator (e.g. the stop code in LZW), or had to store the length of the compressed data somewhere.
While Ctrl-Z is listed as the end of file marker for CP/M, the read and write routines will cheerfully go right past any such marker. The behavior of CP/M isn’t that different from DOS. Supposedly, CP/M would pad the last record with Ctrl-Zs but that doesn’t seem to happen all the time.
DRI had changed random read/writes to use a 3 byte value for records which would have lead to file sizes of 2 GB. Rumors were that DRI had been planning a new file system which would have replaced the old one but DRI dawdled long enough that a modified version of MS’s FAT had to be used instead.
I fully expect that 86-DOS replicated the CP/M behavior. In other words, as far as the OS file API is concerned, a file is just a bunch of bytes that are not interpreted in any way by the file I/O routines.
The Programmer’s CP/M Handbook (by Andy Johnson-Laird) says that “If the file that you have created is a standard CP/M ASCII text file, you must arrange to fill the unused portion of the record with the standard 1AH end-of-file characters as CP/M expects […]”. Elsewhere the book says that “Therefore, two possible conditions can indicate end-of-file: either encountering a 1AH, or receiving a return code from the BDOS function (in the A register) of 0FFH.” I can’t see any mention in the book that CP/M itself would do the Ctrl-Z padding, but it was expected that applications would do that if they were writing text files.
On DOS I don’t think this was very visible because PC DOS 1.0 already used byte granular file sizes, and e.g. EDLIN used that. But for example on the 86-DOS 0.34 disk, the text files are all fully padded with Ctrl-Z (NEWS.DOC and all the .ASM files).
This reminds me that I’m sometimes wondering how long some or most of those (to me) arcane looking ASCII control characters were actually in use. Sure, plenty of them are still relevant today (everything in the interval from BEL to CR for example), and some others sound like they could have been relevant to ASR terminals, but what about things like SOH (Start of Header), SOF (Start of Field), or all those separators?
Did that ever find widespread use? Does it maybe still today in some form?
See here for “Exact file sizes in CP/M Plus” for supported workarounds:
http://www.seasip.info/Cpm/bytelen.html
@Anthony Williams: That article shows the problem since there were two different methods of computing the exact file length. I think the most common use would have been DOS Plus copying files from FAT disks. Reported file length modulo 128 is easy math.
@Julian Oster: One of the few post-TTY systems to use the control characters was the BASICode cassette format which used start of text (Ctrl-B) and end of text (Ctrl-C) to surround data blocks. But that had a very slow data rate.
A single critical byte can be easily lost at higher rates so more repetitive hand shaking methods would be used. The values used in the floppy headers and gaps make it difficult to lose a bit without catching the problem.
I love seeing real-world examples that demonstrate the idea that (practical) new systems by necessity needing to interface with the real world, and those interfaces in turn enforce interesting constraints on the old ones. Points as simple as “Yea…we track file size by bytes now instead of records…but we’re not just going to throw away our old, already paid for computers and files all at once” are easy to miss when looking back at something from the future without the context of the world during that era. Fantastic read, thank you!
@Julian Oster, some of the control characters are used as almost a “serialization” tool– particularly File Seperator/Group Seperator/Unit Seperator/Record Seperator (0x1C-0x1F).
Sometimes it’s easier to define your data structure as “here are the 25 top-level fields, with a File Seperator between each, leaving unneeded ones blank. Field 9 has four sub-elements seperated by Group Seperators,…”
You see this in places like credit-card processing APIs, some of which are built on models that fit little terminals running on tiny 8-bit MCUs and communicating via dial-up. That sort of structure is a lot easier to wrangle than JSON or XML data structure when you’re really resource constrained.
What next, you’ll tell us that JSON and XML do not solve all the world’s problems? Blasphemy!
The compromise, of course, is IFF. So good that both Apple and M$ had to
copy (and mangle) it!
ITU-T: Hold my beer!
Hope you guys have heard of X.208 and ASN.1 🙂
I am amazed at the number of newer information exchange methods that lack any form of field level error catching. Hoping that the transfer protocol notices the error and then sends the all data again is a bad idea with financial information. This makes me very happy to no longer be working on software for widespread financial companies.
Oh, that’s because “financial companies” tend to be run by the most
egregrious bullshit artists. If you don’t understand w/ever the hell
you’re doing, and you don’t care a patootie anyway, and responsibility
is not really your forte, get employed in finance! It’s your calling.
That’s OT, though…
I come here for the humor. You do _not_ disappoint.
The debate if COMMAND.COM is a part of MSDOS or not is kind of like the debate as if ICMP is a part of IP or not. Technically ICMP is a protocol that runs on top of IP (the same way as UDP and TCP does) while in practice a TCP/IP stack without UDP wouldn’t be considered complete by most people.
Side track: The 128 byte granularity bled over to XMODEM that (at least originally) used 128 byte packets and each file transmitted using XMODEM got it’s file size rounded up to the nearest 128 byte boundary.
In general not much of a problem, but probably ate up some extra disk space on systems not using 128/256 byte sizes. (An example is the disk drives for the Commodore 8-bit computers, where each 256 byte sector contained 2 bytes linking to the next track/sector and 254 bytes of data (and as a bonus the first two bytes where the load address for program files)).