So I was working on improving a DOS emulator, when I found that something seemingly trivial wasn’t working right when COMMAND.COM was asked to do the following:
echo AB> foo.txt
echo CD>> foo.txt
Instead of ABCD, foo.txt contained ABBC.
I verified that yes, the right data was being passed to fwrite()
, with the big caveat that what COMMAND.COM was doing wasn’t quite as straightforward as one might think:
- Open foo.txt
- Write ‘AB’
- Close foo.txt
- Open foo.txt
- Seek one byte backward from the end of the file
- Read one byte
- Write ‘CD’
- Close foo.txt
The reason for the complexity is that COMMAND.COM tries to deal with a case that the file ends with a Ctrl-Z character (which wasn’t the case for me), and if so, the Ctrl-Z needs to be deleted. Somehow the seek/read/write sequence was confusing things. But why?
Sitting down with a debugger, I could just see how the C run-time library (Open Watcom) could be fixed to avoid this problem. But I could not shake a nagging feeling that such a basic bug would have had to be discovered and fixed years ago.
So I proceeded to write a simple test program which I could try with other compilers.
To my great surprise, the venerable Microsoft Visual C++ 6.0 as well as IBM C/C++ 3.6 for Windows both only wrote ‘AB’ to the output file! The ‘CD’ never got written at all.
I added further logging to determine that in both cases, the second fwrite()
reported that it wrote zero bytes. But that’s where things got a bit weird.
For the Microsoft runtime, ferror()
was set but errno
was zero. For the IBM runtime, ferror()
was clear but errno
was set to 41. Which according to IBM’s errno.h
header means EPUTANDGET
… and what does that error even mean?
At this point, I knew I was doing something wrong. But what? For once, stackoverflow actually had the right answer! Amazing, that almost never happens.
Why Oh Why?
Of course one has to wonder… why is it like this? Having basic file I/O functions behave in this non-obvious way (either quietly failing or not writing the expected data, depending on the sequence of other function calls) is clearly sub-optimal.
It is obvious that it would not be rocket science for the C library to keep a record of whether the most recent I/O was a read or a write, and perform the appropriate flush or seek when switching directions. Indeed it’s clear that for example the IBM C runtime keeps track internally, and issues a very specific error when the correct sequencing is violated.
The closest thing to an answer that I’ve been able to find is that “it’s always been this way”.
With a caveat that “always” means since circa 1979, not always always. Looking at the 1978 edition of K&R, it’s obvious why: The original K&R library only supported the read ("r"
), write ("w"
), and append ("a"
) modes for fopen()
, with append being effectively a write. There was no update mode, ("r+"
) and hence reads and writes could not be mixed at all! That is very likely part of the puzzle.
By the time the oldest preserved ANSI C draft rolled out, the behavior was already set in stone. Consider how little things have changed over the years:
When a file is opened with update mode (
ANSI X3J11 C draft, 1988'+'
as the second or third character in the mode argument), both input and output may be performed on the associated stream. However, output may not be directly followed by input without an intervening call to thefflush
function or to a file positioning function (fseek
,fsetpos
, orrewind
), and input may not be directly followed by output without an intervening call to a file positioning function, unless the input operation encounters end-of-file. Opening a file with update mode may open or create a binary stream in some implementations.
The ANSI C Rationale contains the following text:
A change of input/output direction on an update file is only allowed following a
fsetpos
,fseek
,rewind
, orfflush
operation, since these are precisely the functions
which assure that the I/O buffer has been flushed.
The implication is that when the buffer I/O contains data, it’s not safe to switch read/write direction.
The published ANSI C89/ISO C90 is near identical to the draft Standard and does not bear repeating here. In C99, “may not” was replaced with “shall not” but little else changed:
When a file is opened with update mode (
ISO C99, 1999'+'
as the second or third character in the above list of mode argument values), both input and output may be performed on the associated stream. However, output shall not be directly followed by input without an intervening call to thefflush
function or to a file positioning function (fseek
,fsetpos
, orrewind
), and input shall not be directly followed by output without an intervening call to a file positioning function, unless the input operation encounters end-of-file. Opening (or creating) a text file with update mode may instead open (or create) a binary stream in some implementations.
Fast forward another (almost) quarter century, and we have this:
When a file is opened with update mode (’+’ as the second or third character in the previously described list of mode argument values), both input and output may be performed on the associated stream. However, output shall not be directly followed by input without an intervening call to the
ISO C23, 2024fflush
function or to a file positioning function (fseek
,fsetpos
, orrewind
), and input shall not be directly followed by output without an intervening call to a file positioning function, unless the input operation encounters end-of-file. Opening (or creating) a text file with update mode may instead open (or create) a binary stream in some implementations.
As far as Standard C is concerned, this has provably not changed since 1988 until present.
But of course the ANSI X3J11 Committee did not invent the C library. It worked on the basis of earlier documents, namely the elusive 1984 /usr/group Standard in case of the library.
While I couldn’t find a copy of the /usr/group Standard, the /usr/group committee likewise didn’t create the C library but rather tried to standardize existing implementations. Which means that the answer might lie in old UNIX manuals.
Even System V is too new and we have to look further back. The AT&T UNIX System III manual contains the following text in the fread
manual page:
When a file is opened for update, both input and output may be done on
AT&T UNIX System III manual, 1980
the resulting stream. However, output may not be directly followed by
input without an interveningfseek
orrewind
, and input may not be directly
followed by output without an interveningfseek
,rewind
, or an input opera-
tion which encounters end of file.
Hmm, that text from 1980 is rather similar to what ended up in ANSI C89. Sure, there was no fsetpos()
yet (an ANSI C invention), and the text is oddly missing any mention of fflush()
, even though flushing almost certainly made it OK to switch from writing to reading even then.
But it’s obvious that the restriction on switching between reading and writing on C library streams has been there for a very, very long time.
7th Edition UNIX (1979), even in the updated documentation from 1983, does not mention update mode for fopen()
and hence does not offer any advice on switching read/write directions.
Current Practice
At least Linux (glibc) and FreeBSD allow free intermixing of reads and writes. The FreeBSD man page for fopen()
states:
Reads and writes may be intermixed on read/write streams in any order, and do not require an intermediate seek as in previous versions of stdio. This is not portable to other systems, however; ISO/IEC 9899:1990 (“ISO C90”) and IEEE Std 1003.1 (“POSIX.1”) both require that a file positioning function intervene between output and input, unless an input operation encounters end-of-file.
In contrast, Microsoft’s library documentation (as of 2024) mirrors ISO C and states that flushing or seeking is required when changing read/write direction.
On the one hand, transparently handling the direction switching in the library is not outrageously difficult. On the other hand, doing so encourages programmers to write non-conforming C code which will fail in rather interesting ways on other implementations. As always, there are tradeoffs.
Old Source
Looking at historic source code proved quite interesting.
In 32V UNIX from 1979, fopen clearly opens files for either reading or writing, but not both (and any mode other than ‘w’ or ‘a’ means implicitly ‘r’!).
V6 UNIX from 1975 is too old to even have fopen()
. System III from 1980 on the other hand supports update mode, and opening streams for update sets an explicit _IORW
flag (and, as mentioned above, the System III documentation demands extra care when switching I/O direction).
Things get confusing with V7 UNIX from 1979. Although the documentation does not show any update mode option for fopen()
, the actual implementation supports it. In fact the V7 code from 1979 is nearly identical to what was in System III a year later. Why? I don’t know.
And then there’s the 2BSD code, again from 1979. While the BSD fopen()
has no provision for indicating update mode with the ‘+’ character, it allows specifying open modes like "rw"
, setting both the _IOREAD
and _IOWRT
flags. In fact the 2BSD man page for fopen explicitly lists "rw"
and "ra"
as supported open modes which allow both reading and writing, but there is nothing said about whether mixing fread()
and fwrite()
freely is allowed. There is also an explanatory README file with a note from November 1978 describing the change to allow mixed read and write access.
A 1977 paper by Dennis M. Ritchie A New Input-Output Package is quite clear that when fopen()
was first conceived, a stream would support either reading or writing, but not both. It is also clear that users found this too restrictive and by 1979, there were at least two different implementations (AT&T and BSD) which allowed mixed read/write streams.
Notably in the BSD implementation, fopen()
was modified to allow both reading and writing but fread()
and fwrite()
were not. It is not clear to me if the BSD code was robust enough to allow free mixing of reads and writes. The AT&T documentation has always been clear that it’s not allowed.
And as far as Standard C and POSIX are concerned, that has not changed until today. To write portable code, it is necessary to take some action when changing read/write direction. A dummy call such as
fseek( f, 0, SEEK_CUR );
is entirely sufficient to get the stream into a state where switching between reading and writing is safe.
I suppose oddities like this just happen when you have nearly nearly 50 years of history behind you.
It’s long been clear to me that stdio is just badly designed — it’s
a “hide the complexity” affair, instead of a “bring it out in the open
and make it managable” one. So these kind of awful gotchas are to be
expected.
Just me 2 cents — the insight likely won’t help here.
With many storage devices, it was impossible to have both reads and writes. Paper tape was referred to as different logical devices to the punch versus the reader even if the physical device was designed to do both but not at the same time.
Changing from read or write to read and write for a single open command required a lot of behind the scenes work to correctly handle single direction devices and clearing of buffers. I never found it that hard to close a file and then reopen it in a different mode but I guess preferred programming methods have changed.
Magnetic tape too. Reading and writing can’t be intermixed easily.
What I realized is that, at least in the programs I write, most of the time a file is open either for reading or for writing. Rewriting files is certainly done, but it’s not something that happens all the time.
This is all good, but where does ABBC come from?
UNIX has such a nice byte-wise orthogonal interface, and stdio then
breaks it for the sake of compatibility and efficiency — a rationale
that hasn’t really held up for decades.
Convenience routines around {read,write}(2) certainly can’t hurt —
that’s why me wrote librdwr — but stdio just obscures and confuses
what’s really happening in a rather un-UNIXy manner.
Ironically, stdio may be more suited environments like mess-dos than
it ever was for UNIX.
The single-byte fread() places ‘B’ in the FILE buffer. The following two-byte fwrite() appends ‘CD’, so the buffer now holds ‘BCD’. When the file is closed and the buffer flushed, the library writes two bytes… starting at the beginning of the buffer, so ‘BC’.
If you read the Dennis M. Ritchie paper, stdio was designed to be portable. It obviously pre-dates DOS by a few years, but it was designed to handle systems with record-oriented I/O and all kinds of strangeness. Which it does.
It also has quite good performance thanks to buffering, especially when doing single-charater reads or writes.
But if you think compatibility and efficiency are worthless, there’s nothing to discuss.
dmr’s pursuit of generality, while laudable in general (heh), led to
him making another mistake like stdio, one that didn’t have as much
impact — STREAMS.
Me point about the reasons of compatibility and efficiency is that
they haven’t held up in this case: nearly everything is byte-wise now,
and machines are powerful (parallel) enough that pipeline stalls[0]
have become the greater evil.
So stdio was a design for the past and present, not the future. Yet
many still feel they’re stuck w/ it to this day. Sound familiar?
YMMV, of course, as always…
[0] A term normally used in microarchitecture, but it just as well
applies here.
Oh, and on the issue of short {read,write}s being inefficient, which
remains somewhat of an issue: me’s studied that problem, as well. Me
proposed solution is to implement hardware pipes, where the processor
provides protected access to the kernel-side buffer and only triggers
an interrupt when it’s filled up on write (if that happens before it’s
drained by another process) or empty on read (if that happens before
it’s replenished by another process).
(That’d certainly be more of a modernization than adding the nth vector
instruction set.)
When I tried your example on MS-DOS 6.22:
echo AB> foo.txt
echo CD>> foo.txt
type foo.txt
I see:
AB
CD
Note the newlines between each of the “echo” statements’ output. Which is what I think the expected behavior should be, not “ABCD”, since the echo command appends newlines to the output. DOS’s echo command doesn’t offer an equivalent to the Unix “echo -n” command.
Of course! Because you’re not running it in the emulator I’m working on. So you don’t get the problem. I probably didn’t explain it well enough in the blog post.
And you’re right about the newlines, I don’t think they can be really be suppressed in DOS.
You get me thinking is there is a way to remove the buffer?. let’s say even if we don’t want the speed increase… No, we can’t we would need bit or byte transfers and the hard disk and other media work in sectors at least for the old media, the new media would even read/write longer blocks of information. So we need buffers in our operating system.
Now that I think in File Control Blocks – CP/M doesn’t essentially the application manages the buffer? and that is part of what simplifies stdio.
CP/M uses 128-byte records, so anything that does bytewise I/O on it needs to buffer the data. (My recollection is that the C libraries I’ve seen for CP/M tend to emulate UNIX-style open()/read()/write() and then have a UNIXy stdio library on top of that, rather than implementing fopen()/fread()/fwrite() directly with CP/M API calls).