Hi.
I've been trying to figure out if it would be possible when comparing files
to first compare based on time stamp and after that compare by content.
What I would like to achieve is a faster compare.
Best regards
Laeffe
Compare timestamps then content
- Posts: 1
- Joined: 16 Oct 2013
- Posts: 2450
- Joined: 22 Aug 2012
As can be found in the FFS help file under Comparison Settings, you can choose between
I. Compare by "File time and size"
II. Compare by "File content"
Obviously, option I compares faster than option II.
A (presently non-existing) option III, Compare by "File time and then file content",
would be about equally fast as option I if the timestamp is different,
and about as slow as option II if the timestamp is equal.
Anyway, it is at best as fast as option I, never faster.
The only potential benefit would be, you have a more refined detection of conflicts.
(But, with the timestamp already being the same, how likely is it that if also the
file-size is the same, the content will differ.)
Option III could be faster than option II when used as a consistency check,
but is only faster in the case you do have inconsistencies (on timestamp).
However, when checking consistency, you normally do not expect to have any inconsistencies.
And, checking fully consistent sets takes essentially as long in option III as in option II.
Or am I completely missing the point ... ?
The comparison you suggest only seems to make sense if you have some timestamp issues.
If that is the case, you might also consider to use option I and relax the
FileTimeTolerance (see the FFS help under Expert Settings).
I. Compare by "File time and size"
II. Compare by "File content"
Obviously, option I compares faster than option II.
A (presently non-existing) option III, Compare by "File time and then file content",
would be about equally fast as option I if the timestamp is different,
and about as slow as option II if the timestamp is equal.
Anyway, it is at best as fast as option I, never faster.
The only potential benefit would be, you have a more refined detection of conflicts.
(But, with the timestamp already being the same, how likely is it that if also the
file-size is the same, the content will differ.)
Option III could be faster than option II when used as a consistency check,
but is only faster in the case you do have inconsistencies (on timestamp).
However, when checking consistency, you normally do not expect to have any inconsistencies.
And, checking fully consistent sets takes essentially as long in option III as in option II.
Or am I completely missing the point ... ?
The comparison you suggest only seems to make sense if you have some timestamp issues.
If that is the case, you might also consider to use option I and relax the
FileTimeTolerance (see the FFS help under Expert Settings).
- Posts: 23
- Joined: 15 Aug 2009
>option 3: The only potential benefit would be, you have a more refined detection of conflicts.
this makes total sense if I don't trust my storage media.
This could be as easy as having crc errors due to bad SATA cables on one drive. Or other reading errors.
Or due to a recent article in ct computer magazine (which totally reflects my experience over the years), the error rate of consumer harddisks is no longer small if we consider TB storage. So there may well be two files with same date, different content and it would be important to be notified.
Klaus
this makes total sense if I don't trust my storage media.
This could be as easy as having crc errors due to bad SATA cables on one drive. Or other reading errors.
Or due to a recent article in ct computer magazine (which totally reflects my experience over the years), the error rate of consumer harddisks is no longer small if we consider TB storage. So there may well be two files with same date, different content and it would be important to be notified.
Klaus
- Posts: 2450
- Joined: 22 Aug 2012
If you don't trust your storage media and want to check for consistency,
why not simply use option II: Compare by "File content" ?
After all:
> Option III could be faster than option II when used as a consistency check,
but is only faster in the case you do have inconsistencies (on timestamp).
...
The comparison you suggest only seems to make sense if you have some timestamp issues.
The likelyhood of your timestamp getting corrupted by your storage media
is obviously much smaller than that the content might get corrupted.
For such an unlikely case, i.m.h.o. it does not seem worth to add option III, as,
when you do not have inconsistencies in timestamp, option II and option III
are essentially equally fast.
And don't forget: FFS is first and foremost a synchronisation tool;
anything beyond that is nice, but not its prime scope ...
The fastest way to check for inconsistencies is building a checksum-hash
database locally at either side (using a hash algorithm you trust),
and then compare (the checksum of) both databases.
(See e.g. Maatkit / Percona-toolkit).
But, that's quite a different ballgame than FFS.
why not simply use option II: Compare by "File content" ?
After all:
> Option III could be faster than option II when used as a consistency check,
but is only faster in the case you do have inconsistencies (on timestamp).
...
The comparison you suggest only seems to make sense if you have some timestamp issues.
The likelyhood of your timestamp getting corrupted by your storage media
is obviously much smaller than that the content might get corrupted.
For such an unlikely case, i.m.h.o. it does not seem worth to add option III, as,
when you do not have inconsistencies in timestamp, option II and option III
are essentially equally fast.
And don't forget: FFS is first and foremost a synchronisation tool;
anything beyond that is nice, but not its prime scope ...
The fastest way to check for inconsistencies is building a checksum-hash
database locally at either side (using a hash algorithm you trust),
and then compare (the checksum of) both databases.
(See e.g. Maatkit / Percona-toolkit).
But, that's quite a different ballgame than FFS.
- Site Admin
- Posts: 7211
- Joined: 9 Dec 2007
> FFS is first and foremost a synchronisation tool; anything beyond that is nice, but not its prime scope ...
FFS is actually about both, file synchronization and consistency checks. "compare by content" is specifically designed with a scenario like that in mind. It's a manual, but highly effectiv way to make sure that a backup medium is in a consistent state.
Personally I'm using FFS in the following way:
First I associate the data I want to preseve with a checksum. This can either be done by storing it in a .zip or .rar archive or by calculating the .md5 and storing it next to the files. At this point I know that whenever there is a data corruption in the future, the checksums will reveal the problem instantly.
However what I don't know at this point is if the data associated with the checksum is really consistent in first place (yes I've been bitten by this in the past...). This is where FFS comes into play: After the checksums were generated I compare the data again (extract again from the archives) with an authoritative reference data set using FFS and variant "binary comparison". If there were any error up to the point where the checksums were generated I would now find out with the help of FFS.
When there are no errors I have absolut confidence that the checksums reference valid data and can use them in the future to check the consistency at any time.
This scheme may need some getting used to but helps me sleep well at night knowing that I can always trust the data stored in my archive files (or otherwise associated with checksums) - no matter how unreliable the storage medium is.
FFS is actually about both, file synchronization and consistency checks. "compare by content" is specifically designed with a scenario like that in mind. It's a manual, but highly effectiv way to make sure that a backup medium is in a consistent state.
Personally I'm using FFS in the following way:
First I associate the data I want to preseve with a checksum. This can either be done by storing it in a .zip or .rar archive or by calculating the .md5 and storing it next to the files. At this point I know that whenever there is a data corruption in the future, the checksums will reveal the problem instantly.
However what I don't know at this point is if the data associated with the checksum is really consistent in first place (yes I've been bitten by this in the past...). This is where FFS comes into play: After the checksums were generated I compare the data again (extract again from the archives) with an authoritative reference data set using FFS and variant "binary comparison". If there were any error up to the point where the checksums were generated I would now find out with the help of FFS.
When there are no errors I have absolut confidence that the checksums reference valid data and can use them in the future to check the consistency at any time.
This scheme may need some getting used to but helps me sleep well at night knowing that I can always trust the data stored in my archive files (or otherwise associated with checksums) - no matter how unreliable the storage medium is.
- Posts: 3
- Joined: 2 Jun 2022
Heya, I just found this thread, since I were searching for exactly that feature - I guess FFS doesn't have anything like that so far?
Why I would consider this really useful
You might have some files, which automatically get synchronized by other means, which doesn't preserve timestamps (i.e. steam gamesave sync), or maybe you just downloaded the same file on both PCs independently in the meantime. Doing a content comparison, in case date & size throw a conflict, would avoid the need for manual resolution of those detectable scenarios. Hope I don't need to point out the obvious speed issues of File content only comparison... xP
Implementation
I'm thinking about making it a checkbox for the "File time and size" variant, to enable content comparison as fallback for conflicts with identical file size. And *maybe* also adding a field for setting a max. file size for it (so you don't check the content of huge files, if you're on a slow network, re-using the fields used in the filter tab). RFC?
Alternatively making it a context menu option "compare contents of conflicts", when you got any conflicts selected, and auto-resolving identical files, would be another option. But I guess that would be more effort to implement, and also not used by that many ppl (hard to find).
Remark
If you don't deem this feature worth the effort, maybe I'd try to implement it myself, if I find some time. I could send you the patches for review&merge afterwards, if the feature is considered useful enough. Based on what I found on the forum here, it has been requested quite a few times before already.
Related
Threads that seem to request the same idea/concept (just tried to collect them all for once):
viewtopic.php?t=8995
viewtopic.php?t=8774
viewtopic.php?t=7117
viewtopic.php?t=5836
viewtopic.php?t=3861 (for other/"weird" reasons, but could give the user the desired output as a side effect)
viewtopic.php?t=2333 (requested different solution, but to same core problem)
viewtopic.php?t=1512
viewtopic.php?t=1420 (this thread)
Why I would consider this really useful
You might have some files, which automatically get synchronized by other means, which doesn't preserve timestamps (i.e. steam gamesave sync), or maybe you just downloaded the same file on both PCs independently in the meantime. Doing a content comparison, in case date & size throw a conflict, would avoid the need for manual resolution of those detectable scenarios. Hope I don't need to point out the obvious speed issues of File content only comparison... xP
Implementation
I'm thinking about making it a checkbox for the "File time and size" variant, to enable content comparison as fallback for conflicts with identical file size. And *maybe* also adding a field for setting a max. file size for it (so you don't check the content of huge files, if you're on a slow network, re-using the fields used in the filter tab). RFC?
Alternatively making it a context menu option "compare contents of conflicts", when you got any conflicts selected, and auto-resolving identical files, would be another option. But I guess that would be more effort to implement, and also not used by that many ppl (hard to find).
Remark
If you don't deem this feature worth the effort, maybe I'd try to implement it myself, if I find some time. I could send you the patches for review&merge afterwards, if the feature is considered useful enough. Based on what I found on the forum here, it has been requested quite a few times before already.
Related
Threads that seem to request the same idea/concept (just tried to collect them all for once):
viewtopic.php?t=8995
viewtopic.php?t=8774
viewtopic.php?t=7117
viewtopic.php?t=5836
viewtopic.php?t=3861 (for other/"weird" reasons, but could give the user the desired output as a side effect)
viewtopic.php?t=2333 (requested different solution, but to same core problem)
viewtopic.php?t=1512
viewtopic.php?t=1420 (this thread)