Feature request : faster content comparison

Posts: 2 · edrandall 29 Dec 2015, 17:43

I think the file content-comparison could be made quicker - (I've done this myself in a Perl script which I used before discovering FreeFileSync)

Once the existance of 2 duplicate candidate files has been established (by filename+path match) don't compute a checksum for the whole file. Just do the first 10kB or 100kB of each; Typically this will be enough to tell you if the file is different; Only continue to calculate in blocks of 10 or 100kB if the first block turns out to be identical.
This can save a lot of time on large image or video files especially if one drive is over a slow network link.

Provide further support for this with a UI option - a checkbox and size setting "Consider files identical if first NNN kB match" - if checked, the comparison stops after NNN kB, deeming the files identical anyway; if un-checked the comparison could continue up to the whole size of the file (it will always stop once a difference is encountered)

Posts: 2 · rmaus 31 Dec 2015, 19:17

I also suggest a probe variant on Ed's proposal (or yet another option) which is to examine both the first AND LAST chunks of a file when comparing files. This is especially useful for comparing rolling log files that are truncated to a fixed size.

Posts: 3 · Werve 22 Jun 2022, 14:44

I join in asking to add an option in the settings to speed up the comparison using the checksum method (perhaps in the donations package along with the other options for performance?).
Open source projects using this approach are very fast:
https://github.com/arsenetar/dupeguru/
https://github.com/qarmin/czkawka

Posts: 16 · Backitup 23 Jun 2022, 10:28

I did a synchronization earlier of over 100000 files and the comparison was done in a very short time. It seemed to me to be about 10 to 15 seconds. That’s pretty fast in my opinion. However, if there’s room for improvement then I’m all for it.

Posts: 3 · Werve 23 Jun 2022, 10:34

On what storage type were the files stored?
I tried between mechanical HDD and microSD 10GB each side (200+ files in each side) and it took almost 3 hours. Also depend of where the difference are, for example if each file have the same bytes except the last one it need a lot more time to compare byte to byte, if not the process is skipped before (first difference found).
I remember making comparisons with the programs I linked that use the chunk cheksum method and they had taken minutes (I don't remember precisely how much but not even 1 hour).
I think the best and easier method to reduce the time is consider different files with different size then actually compare.

Also the BLAKE3 hash seems the fastest and reliable https://github.com/BLAKE3-team/BLAKE3

Posts: 2613 · Plerry 23 Jun 2022, 11:11

The use of checksums for comparing files has been suggested multiple times, just like it has been explained multiple times that, although not technically impossible, this would not speed up the FFS comparison.
e.g.
viewtopic.php?t=5512
viewtopic.php?t=6296
viewtopic.php?t=1744

Posts: 7353 · Zenju 23 Jun 2022, 11:19

The use of checksums [...] would not speed up the FFS comparison. Plerry, 23 Jun 2022, 11:11

Correct, but performance with checksums is much worse: While FFS is comparing file content it breaks on the first byte difference found. Using checksums on the other hand would always require reading the full file content.

Posts: 3 · Werve 23 Jun 2022, 11:43

The use of checksums [...] would not speed up the FFS comparison. Plerry, 23 Jun 2022, 11:11
Correct, but performance with checksums is much worse: While FFS is comparing file content it breaks on the first byte difference found. Using checksums on the other hand would always require reading the full file content. Zenju, 23 Jun 2022, 11:19

I also think that instead of calculating the entire checksum of the file is in most cases faster the direct comparison of bytes. But for this I suggested the method of partial checksums. For example, a checksum on 10KB in the location at 50% of the file content, one at the beginning and one at the end. One of the tools I mentioned, dupeguru, did not have this system but since he added it it has become much faster. in particular it performs these partial checksums only if the file is larger than x (User input).
This is to avoid wasting a lot of time in cases where the first difference in bytes is at the end of a large file.
But actually the same logic is valid even without the use of checksum, just changing the comparison of large files to something not linear read.

The use of checksums for comparing files has been suggested multiple times, just like it has been explained multiple times that, although not technically impossible, this would not speed up the FFS comparison.
e.g.
viewtopic.php?t=5512
viewtopic.php?t=6296
viewtopic.php?t=1744 Plerry, 23 Jun 2022, 11:11

From the test I mentioned before, using the same folders for comparison the approach with the program that makes the comparison with the partial checksum method ended earlier. So at least in some cases it is faster than checking between bytes.
That's why I suggested adding an option, a checkbox, maybe close to the others for performance, and not to replace the default comparison.

Posts: 4 · sniggles 30 Aug 2024, 12:02

Maybe this assumption is too simplicistic, but wouldn't it be possible to produce a lot false positives, especially with large files, that have only changes in the unchecked parts?

Posts: 1101 · therube 30 Aug 2024, 16:38

I believe the idea was, you run the "chunk" first, & if does not compare, you're done (as in the files are not identical, so no need to compare any further).

If it does compare, then you run a full comparison on the file pairs.