Changing strings in files
> On 10 Nov 2020, at 19:30, Eli the Bearded <*@eli.users.panix.com> wrote:
> In comp.lang.python, Chris Angelico <rosuav at gmail.com> wrote:
>> Eli the Bearded <*@eli.users.panix.com> wrote:
>>> Read first N lines of a file. If all parse as valid UTF-8, consider it text.
>>> That's probably the rough method file(1) and Perl's -T use. (In
>>> particular allow no nulls. Maybe allow ISO-8859-1.)
>> ISO-8859-1 is basically "allow any byte values", so all you'd be doing
>> is checking for a lack of NUL bytes.
NUL check does not work for windows UTF-16 files.
> ISO-8859-1, unlike similar Windows "charset"s, does not use octets
> 128-190. Charsets like Windows CP-1252 are nastier, because they do
> use that range. Usage of 1-31 will be pretty restricted in either,
> probably not more than tab, linefeed, and carriage return.
Who told you that?
The C1 control plane is used in ISO 8 bit char sets.
One optimisation for the Vt100 family of terminals is to send CSI as 0x8b
and not as 0x1b '['. Back in the days of 9600 bps or 1200 bps connections
that was worth the effort.
>> I'd definitely recommend
>> mandating UTF-8, as that's a very good way of recognizing valid text,
>> but if you can't do that then the simple NUL check is all you really
> Dealing with all UTF-8 is my preference, too.
>> And let's be honest here, there aren't THAT many binary files that
>> manage to contain a total of zero NULs, so you won't get many false
>> hits :)
There is the famous EICAR virus test file that is a valid 8086 program for
DOS that is printing ASCII.
> There's always the issue of how much to read before deciding.
Simple read it all, after all you have to scan all the file to do the replacement.
> ASCII with embedded escapes? could be a VT100 animation
The output of software that colours its logs maybe?