xxhash & hash-sync mode

Posts: 12 · +ffsuser+ 13 Jan 2021, 19:23

please excuse me, if this topic has been mentioned before - my search did not yield any hits.

i would like to know if the following scenario is conceivable in ffs:

hash source files - sync - hash destination files - compare hashes

as this is quite time consuming, it would be advantageous to integrate xxhash into ffs: https://github.com/Cyan4973/xxHash

the speed of xxhash is quite impressive, in my opinion.

xxHash is an Extremely fast Hash algorithm, running at RAM speed limits. It successfully completes the SMHasher test suite which evaluates collision, dispersion and randomness qualities of hash functions. Code is highly portable, and hashes are identical across all platforms (little / big endian).

Benchmarks



Hash Name     Width     Bandwidth (GB/s)     Small Data Velocity     Quality

XXH3 (SSE2)     64     31.5 GB/s         133.1             10     

XXH128 (SSE2)     128     29.6 GB/s         118.1             10     

XXH64         64     19.4 GB/s         71.0             10

XXH32         32     9.7 GB/s         71.9             10

Posts: 2451 · Plerry 14 Jan 2021, 11:58

See
viewtopic.php?t=6709&p=22256#p22256
viewtopic.php?t=6296&p=20695#p20695

Posts: 12 · +ffsuser+ 14 Jan 2021, 13:23

hi pierry,

thanks a lot for the links.

first off, i just realize that my wording is a bit vague. of course i was referring to local disk-disk syncronization only plus a verify process of the hashes of both sides after synchronization finished.

anyway, seems like this option already exists and my suggestion comes far too late? :)

i was just so impressed by the benchmarks that xxhash offers, so i believed ffs (file content variant also) could greatly benefit from its speed.

do you have more details on how VerifyCopiedFiles process works right now? from what i understood, it does not make use of hash mechanism?

Posts: 2451 · Plerry 18 Jan 2021, 08:20

See the FFS Manual section on VerifyCopiedFiles and the further info it refers to.
And also that verification does not use hashes, as FFS does not use hashes, for the reasons explained in the earlier references.

Posts: 12 · +ffsuser+ 18 Jan 2021, 12:16

thanks for replying.

i understood that ffs does not make use of hash mechanisms.

the intention to mention xxhash was also rather meant to introduce a possibly new mode similar to the (quite slow) file-content-mode, but only for mirroring locally connected devices, e.g. hhd -> external-hdd - based on xxhash (including optional verification afterwards).

i am not sure if it can be implemented without major hurdles - theoretically, however, it should result in a safety and speed advantage due to the performance of xxhash.

Posts: 2451 · Plerry 18 Jan 2021, 14:19

As should be clear by now: it simply does not make sense for FFS to make use of hashes/checksums for the reasons referred to earlier. So, don't expect this feature to be introduced in FFS any time soon (if ever).

Posts: 12 · +ffsuser+ 19 Jan 2021, 11:43

don't expect this feature to be introduced in FFS any time soon (if ever). Plerry, 18 Jan 2021, 14:19

no worries. be assured that i do not expect anything mandatory.

it was merely a suggestion and i was interested in your assessment.

so nothing to be afraid of.

Posts: 10 · Thiemo 23 Mar 2021, 21:37

I would like to take up this discussion and share my thoughts on it.

1. I agree that there is no point in implementing a hash comparison if the original file has to be transferred over a slow connection. As little as does a bitwise comparison and still there it is. I definitely see that the bit comparison is not meant for that use, as the manual points out to be careful with it.

2. There are server client architecture file sync solutions out there, but that is another architecture and for simple use like ffs not feasible.

3. I noticed that under certain circumstances (e.g. file movement detection) ffs stores sync.ffs_db at one side, so I am very sure it would be possible to store locally generated hash values in this remote database. To detect manipulation of the database one could use a hash of its payload data stored into a metadata section of the database file. So, a hash based comparison could be achieved by downloading the database prior to the comparison. Mind, manipulations on the far side would only be detected if they were carried out by (another) ffs instance synchronising.

Maybe it helps to understand my thoughts when I explain my "needs".

Internet --slow (http, ...)--> computer A ==faster (FFS)==> sftp server ==faster (FFS)==> computers B, C, ...

Computer A downloads game data (100 GB fully installation, much less (10 GB?) on updates) that I want to backup on an sftp server from where it is synched to other computers. I have my doubts that the game download mechanism does not touches the date of all the files (creation date, access date, update date?) there are even if the content is not altered. As the full synch to and from my sftp server takes more than four hours, I would like ffs to check on the file content by hash comparison.

So far my two dimes.

Out of curiosity, is sync.ffs_db a ffs specific format or is it some kind of embedded SQL db like SQLite?

Cheers Thiemo

Posts: 2451 · Plerry 24 Mar 2021, 11:47

> As the full synch to and from my sftp server takes more than four hours, I would like ffs to check on the file content by hash comparison.

In line with what you indicate to understand under 1) above, you should also appreciate that this approach is feasible (or more practical/faster) only if
• The FFS instance executing the sync and comparing hash(es) would run at the receiving/destination/target side machine and
• The source-side hash would already be available at/from the source side
This quite specific and pretty restrictive.

> As little as does a bitwise comparison and still there it is.

Unless the above specifics apply, bitwise comparison is as fast as (or actually marginally faster than) when using hashes. Besides that, it can be used universally. That is very likely why the FFS author chose to add bitwise comparison and not hash based comparison.

Posts: 10 · Thiemo 24 Mar 2021, 13:59

Hm, I feel I could not make myself understood. Maybe, sort of recipe might help. I envision the process like this.

1. FFS checks the remote side for the presence of a hash db (might get included into sync.ffs_db).
2a. Not present or corrupt: either consider all files to be different or fall back to another detection method like size and date.
2b. Transfer the db to the FFS side to calculate the diff.
3. Create/Update the db to mark all files to be transferred as to be transferred and with the new hash values.
4. Transfer the db.
5. Transfer a file to be transferred.
6. Update the db marking the file as done.
7. Transfer the db.
Repeate 5 to 7 until all is done.

In case of a crash, FFS could go on from the point of last successful transfer, i.e. no need to check the ones already marked as to be transferred again.

Posts: 2451 · Plerry 24 Mar 2021, 16:16

The problem rests in your step 2b+3.
The saved hashes, as they may be stored in the db-files, would be the hashes of the files at the end of the previous run FFS sync, and are not the hashes of files that may have been modified in that location since then.
So, comparing the stored hashes is not useful in determining whether files may need to be synced.
Only comparing the actual hashes may indicate if files have changed and may need to be synced.
But (getting back to my repeated argument) as FFS only runs on a single machine, to determine the hash of the actual files, FFS would need to "download" the files to the machine running FFS so it can calculate the actual hashes (and then compare those to the hashes of the counter location).
As stated earlier, the same (i.e. the need for "downloading") holds for a bitwise comparison of files, but allows a direct (bitwise) comparison, instead of first calculating the hashes of the left and right files and then comparing those.

Posts: 10 · Thiemo 25 Mar 2021, 07:28

I see your point. My proposed process indeed is only useful if the remote site is to be kept updated only. It does not work in the other direction at all.

I also totally agree with you that hashed comparison only is of advantage if one can get rid of data transfer taking longer than the calculation of a hash. Asymmetrical, transfer speeds should be taken into account as well. It means, bitwise comparison is always advantageous if both sources for the comparison are in equally access. For all other cases, it depends on the circumstances like sync direction.

With "As little as does a bitwise comparison and still there it is." I wanted to express, that for a comparably slow data access on the "remote" side with symmetrical speed, you are better off to just "upload" all the data instead of "downloading all the data, doing a bitwise comparison and uploading the differences again". If the speed is asymmetrical, it depends on the rate of asymmetry. I do not think this is an unusual setting. It will most probably occur in any environment where you want to sync from disk to SAN.

Just my two dimes.