codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Changing strings in files


In comp.lang.python, Chris Angelico <rosuav at gmail.com> wrote:
> Eli the Bearded <*@eli.users.panix.com> wrote:
>> Read first N lines of a file. If all parse as valid UTF-8, consider it text.
>> That's probably the rough method file(1) and Perl's -T use. (In
>> particular allow no nulls. Maybe allow ISO-8859-1.)
> ISO-8859-1 is basically "allow any byte values", so all you'd be doing
> is checking for a lack of NUL bytes.

ISO-8859-1, unlike similar Windows "charset"s, does not use octets
128-190. Charsets like Windows CP-1252 are nastier, because they do
use that range. Usage of 1-31 will be pretty restricted in either,
probably not more than tab, linefeed, and carriage return.

> I'd definitely recommend
> mandating UTF-8, as that's a very good way of recognizing valid text,
> but if you can't do that then the simple NUL check is all you really
> need.

Dealing with all UTF-8 is my preference, too.

> And let's be honest here, there aren't THAT many binary files that
> manage to contain a total of zero NULs, so you won't get many false
> hits :)

There's always the issue of how much to read before deciding.

Elijah
------
ASCII with embedded escapes? could be a VT100 animation