What's new at the igloo?

Argh! Microsoft Notepad CRLF! Argh!

Date/Time Permalink: 08/25/08 08:27:53 pm
Category: HOWTOs and Guides

GNU/Linux/Unix users, and even Mac users, hate Microsoft Notepad. If you must use Windows, could you at least replace Notepad with - I don't know - anything else at all? Because Microsoft Notepad is to text what Internet Explorer is to HTML.

See, Notepad has the line-break formatting bug. If you write a text file on anything but Windows and send it to a Microsoft user and they open it in Notepad, all your paragraph line breaks (where you hit the 'return/enter' key) will not show up, resulting in a great big blob of text, or, if they're viewing with word-wrap turned off, a single loooooong line of text where you have to keep scrolling sideways.

Microsoft Wordpad? Doesn't have that bug. It's fine with Mac/Unix standard line breaks. Microsoft Word? Also doesn't have that bug. It can handle standard line breaks, too. The dozens of FOSS or proprietary third-party text editors out there for Windows? Also don't have that bug. Even DOS-bloody-EDIT doesn't have that bug, and it has more features and is faster than Notepad, too. So of course, every Microsoft user wants to use Notepad, and only Notepad.

When you receive a text file from a Microsoft user who wrote it in Notepad, and open it in most plain text editors, the line breaks will have an extra character interpreted as '^M'. And don't get me started on when somebody cuts and pastes Internet content into Notepad to save 'n' send. You get kind of an ASCII salad that isn't quite text and isn't quite binary. Again, these kinds of problems just don't show up with any other editors on the Windows platform that I've seen.

Let's try to understand what's going on here, setting down once and for all the tedious historical details of why this happens. This is going to hurt me as much as it hurts you. Here's some prerequisite reading.

Once upon a time, there were these things called typewriters. You know that prerequisite reading I just sent you off to do? You didn't do it, did you? Well, the "CR" (hex: 0x000D) character stands for "carriage return" and the "LF" (hex: 0x000A) stands for "line feed". That maps to a manual typewriter. They worked like this:

manual typewriter

You'd start a new paragraph by feeding in the paper and then - with your left hand - shoving the carriage (the part on top that has the paper) all the way to the right so the keys will be hitting the spot on the far left first. Then as you typed, the carriage would advance one space at a time. When it got all the way to the right (usually it went "ding!"), you'd have to push that carriage back again, and if you didn't also hit the line-feed lever, you'd start typing over the same line. So the line-feed lever is right there, mounted in the same spot you'd use to push the carriage back anyway, and you could combine both motions.

Sometimes, you'd want the actions to be separate. You'd want to skip down several lines, for instance. Sometimes you'd want to return the carriage without advancing the line, such as when you had to retype a place where you made a mistake and had to use Wite-Out, and then when it dried you had to type back over it. This was how we deleted.

Don't even ask how we did spreadsheets. The last remains of your sanity would swirl down the drain.

Anyway, the modern computer character system has all these left-over concepts from the manual typewriter (and later teletypes and printers), despite the fact that they're all appendices that we'd be better off evolving out of. That "CAPS LOCK" key, for instance, is next to the shift key because on the manual typewriter it was a "SHIFT LOCK", a physical lever lock which kept the shift key pressed down.

That's also why the backspace key is at the top of the keyboard, because it was hooked up to the carriage and physically moved it back to the right. Theoretically, you could do a carriage return by hitting the backspace key a bunch of times, if you were really mad at your finger and wanted to make it suffer.

The CR/LF ASCII characters are the modern control characters for these 'carriage return' and 'line feed' actions. Only we don't need to separate them any more, because we use the arrow keys/cursor keys to navigate within the document. And the rest of the world said, "Thank God! We have computers now, so we can just make the 'enter' key send a single, all-purpose character that advances to the next line of the virtual document and returns to the far left at the same time!"

And the people at Microsoft looked up from their desks and said, "What's that sound? Somebody's happy about something! Red alert! Battle stations! We must mess this up!" So they decided to make their DOS text-file standard insist on both a CR and a LF or no 'enter' happens. They didn't get another chance to have that much fun until they screwed up CSS box models.

Here, see for yourself, for a text document which says:

I like traffic lights.

But only when they're green.

here it is in binary view, Unix format:

Unix newlines

and in DOS format:

DOS newlines

There's the 0x0D0A right there!

Now, most any and every text-handling utility these days can handle either one. And it's no big deal to convert from one to the other. I actually have the classic sed solution saved in a shell script. Yeah, sure, it's no problem to filter your CSS to check for Internet Explorer before you start slinging div boxes around, too. What's a few more minutes off your life each day?

But invariably, if I assume the user I'm sending the file to is using Notepad, they'll came to me later going "I opened this in Emacs and it's full of '^M's!" If I assume the user is not using Notepad, they'll bounce it back going, "What the heck is wrong with you! Can't you find the 'enter' key?" I have to ask every single time. Like a fast food cook saying, "You want fries with that?" "Will that be with CRs or without?"

And it's been this way for 23 years!!! They still haven't fixed it, and we've spent an accumulated year off of all of our lives by now shuffling documents around because of one stupid program.

This blog post isn't going to change anything, is it?

Disclaimer This post isn't just a rant; it serves a purpose. This problem comes up all the time with my work. Invariably, someone will ask "Why?" Now, instead of having to drop everything to teach an impromptu seminar on the history of typographic technology, I can just point them here.

Update: A year and a half later, Jeff Atwood kind of unintentionally repeats almost everything I say here. Plus adds some new characters, for two Unicode line breaks. Oh, and he also completely flubs identifying what a daisy wheel is, despite linking to a Wikipedia article that shows a photo of something which clearly isn't what he's talking about. And then comments are turned off, for some mysterious reason. So, uh, somebody let him know? I'm busy.

Follow me on Twitter for an update every time this blog gets a post.
Stumble it Reddit this share on Facebook

blog comments powered by Disqus
suddenly the moon