codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Changing strings in files



> On 10 Nov 2020, at 19:30, Eli the Bearded <*@eli.users.panix.com> wrote:
> 
> In comp.lang.python, Chris Angelico <rosuav at gmail.com> wrote:
>> Eli the Bearded <*@eli.users.panix.com> wrote:
>>> Read first N lines of a file. If all parse as valid UTF-8, consider it text.
>>> That's probably the rough method file(1) and Perl's -T use. (In
>>> particular allow no nulls. Maybe allow ISO-8859-1.)
>> ISO-8859-1 is basically "allow any byte values", so all you'd be doing
>> is checking for a lack of NUL bytes.

NUL check does not work for windows UTF-16 files.

> 
> ISO-8859-1, unlike similar Windows "charset"s, does not use octets
> 128-190. Charsets like Windows CP-1252 are nastier, because they do
> use that range. Usage of 1-31 will be pretty restricted in either,
> probably not more than tab, linefeed, and carriage return.

Who told you that?

The C1 control plane is used in ISO 8 bit char sets.

One optimisation for the Vt100 family of terminals is to send CSI as 0x8b
and not as 0x1b '['. Back in the days of 9600 bps or 1200 bps connections
that was worth the effort.

> 
>> I'd definitely recommend
>> mandating UTF-8, as that's a very good way of recognizing valid text,
>> but if you can't do that then the simple NUL check is all you really
>> need.
> 
> Dealing with all UTF-8 is my preference, too.
> 
>> And let's be honest here, there aren't THAT many binary files that
>> manage to contain a total of zero NULs, so you won't get many false
>> hits :)

There is the famous EICAR virus test file that is a valid 8086 program for
DOS that is printing ASCII.

> 
> There's always the issue of how much to read before deciding.

Simple read it all, after all you have to scan all the file to do the replacement.

> 
> Elijah
> ------
> ASCII with embedded escapes? could be a VT100 animation

The output of software that colours its logs maybe?

Barry

> -- 
> https://mail.python.org/mailman/listinfo/python-list
>