[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Changing strings in files

On Wed, Nov 11, 2020 at 6:52 AM Barry Scott <barry at> wrote:
> > On 10 Nov 2020, at 19:30, Eli the Bearded <*> wrote:
> >
> > In comp.lang.python, Chris Angelico <rosuav at> wrote:
> >> Eli the Bearded <*> wrote:
> >>> Read first N lines of a file. If all parse as valid UTF-8, consider it text.
> >>> That's probably the rough method file(1) and Perl's -T use. (In
> >>> particular allow no nulls. Maybe allow ISO-8859-1.)
> >> ISO-8859-1 is basically "allow any byte values", so all you'd be doing
> >> is checking for a lack of NUL bytes.
> NUL check does not work for windows UTF-16 files.

Yeah, so if you're expecting UTF-16, you would have to do the decode
to text first, and the check for NULs second. One of the big
advantages of UTF-8 is that you can do the checks in either order.

> >> And let's be honest here, there aren't THAT many binary files that
> >> manage to contain a total of zero NULs, so you won't get many false
> >> hits :)
> There is the famous EICAR virus test file that is a valid 8086 program for
> DOS that is printing ASCII.

Yes. I didn't say "none", I said "aren't many" :) There's
fundamentally no way to know whether something is or isn't text based
on its contents alone; raw audio data might just happen to look like
an RFC822 email, it's just really really unlikely.

> > There's always the issue of how much to read before deciding.
> Simple read it all, after all you have to scan all the file to do the replacement.

If the script's assuming it'll mostly work on small text files, it
might be very annoying to suddenly read in a 4GB blob of video file
just to find out that it's not text. But since we're talking
heuristics here, reading in a small chunk of the file is going to give
an extremely high chance of recognizing a binary file, with a
relatively small cost.