[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Changing strings in files

> On 10 Nov 2020, at 19:30, Eli the Bearded <*> wrote:
> In comp.lang.python, Chris Angelico <rosuav at> wrote:
>> Eli the Bearded <*> wrote:
>>> Read first N lines of a file. If all parse as valid UTF-8, consider it text.
>>> That's probably the rough method file(1) and Perl's -T use. (In
>>> particular allow no nulls. Maybe allow ISO-8859-1.)
>> ISO-8859-1 is basically "allow any byte values", so all you'd be doing
>> is checking for a lack of NUL bytes.

NUL check does not work for windows UTF-16 files.

> ISO-8859-1, unlike similar Windows "charset"s, does not use octets
> 128-190. Charsets like Windows CP-1252 are nastier, because they do
> use that range. Usage of 1-31 will be pretty restricted in either,
> probably not more than tab, linefeed, and carriage return.

Who told you that?

The C1 control plane is used in ISO 8 bit char sets.

One optimisation for the Vt100 family of terminals is to send CSI as 0x8b
and not as 0x1b '['. Back in the days of 9600 bps or 1200 bps connections
that was worth the effort.

>> I'd definitely recommend
>> mandating UTF-8, as that's a very good way of recognizing valid text,
>> but if you can't do that then the simple NUL check is all you really
>> need.
> Dealing with all UTF-8 is my preference, too.
>> And let's be honest here, there aren't THAT many binary files that
>> manage to contain a total of zero NULs, so you won't get many false
>> hits :)

There is the famous EICAR virus test file that is a valid 8086 program for
DOS that is printing ASCII.

> There's always the issue of how much to read before deciding.

Simple read it all, after all you have to scan all the file to do the replacement.

> Elijah
> ------
> ASCII with embedded escapes? could be a VT100 animation

The output of software that colours its logs maybe?


> --