codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

on sorting things


On 20Dec2019 08:23, Chris Angelico <rosuav at gmail.com> wrote:
>On Fri, Dec 20, 2019 at 8:06 AM Eli the Bearded <*@eli.users.panix.com> wrote:
>> Consider a sort that first compares file size and if the same number 
>> of
>> bytes, then compares file checksum. Any decently scaled real world
>> implementation would memoize the checksum for speed, but only work it out
>> for files that do not have a unique file size. The key method requires
>> it worked out in advance for everything.
>>
>> But I see the key method handles the memoization under the hood for you,
>> so those simpler, more common sorts of sort get an easy to see benefit.
>
>I guess that's a strange situation that might actually need this kind
>of optimization, but if you really do have that situation, you can
>make a magical key that behaves the way you want.
[... example implementation ...]

The classic situation matching Eli's criteria is comparing file trees 
for equivalent files, for backup or synchronisation or hard linking 
purposes; I've a script which does exactly what he describes in terms of 
comparison (size, then checksum, but I checksum a short prefix before 
doing a full file checksum, so even more fiddly).

However, my example above isn't very amenable to sorts, because you 
never bother looking at checksums at all for files of different sizes.  
OTOH, I do sort the files by size before processing the checksum phases, 
letting one sync/reclaim the big files first for example - a policy 
choice.

Cheers,
Cameron Simpson <cs at cskk.id.au>