codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Changing strings in files


On Tue, 10 Nov 2020 22:08:54 +1100
Cameron Simpson <cs at cskk.id.au> wrote:

> On 10Nov2020 10:07, Manfred Lotz <ml_news at posteo.de> wrote:
> >On Tue, 10 Nov 2020 18:37:54 +1100
> >Cameron Simpson <cs at cskk.id.au> wrote:  
> >> Use os.walk for trees. scandir does a single directory.  
> >
> >Perhaps better. I like to use os.scandir this way
> >
> >def scantree(path: str) -> Iterator[os.DirEntry[str]]:
> >    """Recursively yield DirEntry objects (no directories)
> >          for a given directory.
> >    """
> >    for entry in os.scandir(path):
> >        if entry.is_dir(follow_symlinks=False):
> >            yield from scantree(entry.path)
> >
> >        yield entry
> >
> >Worked fine so far. I think I coded it this way because I wanted the
> >full path of the file the easy way.  
> 
> Yes, that's fine and easy to read. Note that this is effectively a 
> recursive call though, with the associated costs:
> 
> - a scandir (or listdir, whatever) has the directory open, and holds
> it open while you scan the subdirectories; by contrast os.walk only
> opens one directory at a time
> 
> - likewise, if you're maintaining data during a scan, that is held
> while you process the subdirectories; with an os.walk you tend to do
> that and release the memory before the next iteration of the main
> loop (obviously, depending exactly what you're doing)
> 
> However, directory trees tend not to be particularly deep, and the
> depth governs the excess state you're keeping around.
> 

Very interesting information. Thanks a lot for this. I will take a
closer look at os.walk.

> >> >   - check if a file is a text file  
> >>
> >> This requires reading the entire file. You want to check that it
> >> consists entirely of lines of text. In your expected text encoding
> >> - these days UTF-8 is the common default, but getting this correct
> >> is essential if you want to recognise text. So as a first cut,
> >> totally untested:
> >>
> >> ...  
> >
> >The reason I want to check if a file is a text file is that I don't
> >want to try replacing patterns in binary files (executable binaries,
> >archives, audio files aso).  
> 
> Exactly, which is why you should not trust, say, the "file" utility.
> It scans only the opening part of the file. Great for rejecting
> files, but not reliable for being _sure_ about the whole file being
> text when it doesn't reject.
> 
> >Of course, to make this nicely work some heuristic check would be the
> >right thing (this is what file command does). I am aware that an
> >heuristic check is not 100% but I think it is good enough.  
> 
> Shrug. That is a risk you must evaluate yourself. I'm quite paranoid 
> about data loss, myself. If you've got backups or are working on
> copies the risks are mitigated.
> 
> You could perhaps take a more targeted approach: do your target files 
> have distinctive file extensions (for example, all the .py files in a 
> source tree).
> 

There are some distinctive file extensions. The reason I am satisfieg
with heuristics is that the string to change is pretty long so that
there is no real danger if I try to change in a binary file because
that string it not to be found in binary files. 

The idea to skip binary files was simply to save time. 

-- 
Manfred