[Feature suggestion] extend file matching algorithm in Content comparison mode

Posts: 3 · Sorontik 24 Apr 2022, 17:43

Hello everyone,

the number of posts about this topic over the past 10+ years clearly shows a huge interest in a feature that is able to reliably detect moved/renamed files even on the first sync and independently of the underlying file system and i believe with my concept, this is possible in an acceptible and reliable way, so i thought i'd suggest it and give you a chance to comment on it.

My idea is to extends the FFS compare algorithm in content-compare mode so it has a chance to find files that are moved or renamed, but otherwise unchanged.

There is no guarantee that all moved/renamed files are detected, i just want to reduce the number of false-negatives, i.e. of files that were moved/renamed but FFS doesn't detect it.

FFS has to somehow perform the following steps (potentially intertwined) for each comparison run:
1. Enumerate all files on both sides
2. identify pairs of potentially matching files
3. perform the compare on all of those pairs (according to compare settings)

According to the FFS manual on comparison settings, Step 2 is currently done by pairing files with equal relative paths and names.

My idea is to extend this step 2 in the content compare mode to add more potential pairs to the list.
Since all potential pairs are verified by bitwise content comparison in step 3, we don't need to worry about false positives in step 2.

I can't see any problems, aside from increased time consumption for additional comparisons, but i think it's safe to say:
For anyone resorting to content comparison mode, runtime is not a major priority.
Therefore i think, in this mode, it does make sense to offer time-consuming options if they provide functionality that can't really be achieved otherwise.

I can see 2 different cases where we can try to increase our chances of finding two identical but moved/renamed files.

1. moved but not renamed
2. renamed (and maybe also moved)

Case 1:
We collect a list of all unpaired files.
When we find a file that exists only on one side without an entry for it's filename in the list, we add it to the list.
Otherwise we add an entry for the current file and the corresponding list entry to our list of potential matches which will be verified in step 3.

Case 2:
Collect a hashtable of unpaired files.
When we find a file, that exists only on one side, calculate a hash value for it and look it up in the hashtable.
If there is no entry, add a new one for the current file.
if there are entries for this hash value, treat each of the entries as potential match for the current file.
During step 3, these guesses will be verified or falsified.

Actually, Case 1 is a special variant of Case 2, but since calculating the hash value for a file (possibly requiring to download at least a part of the file) can be expensive, i thought it makes sense to try and deal with Case 1 in an easier way.

Case 1 could also be a helpful - yet dangerous - addition to the other comparison modes.

Just to be very clear about that: this post is NOT about using hashes to verify data transmission/storage or to "speed up" the file content comparison by using hashes.
Both ideas have been discussed plenty already and have always been dismissed as impossible or not beneficial, which i totally agree with.

What do you think?

Posts: 1 · andublin 25 May 2022, 13:57

Sounds good, any way to improve detect moved/renamed files should be considered.