Suggestions: improvements to networked syncs

Posts: 16 · ulfben 19 Jul 2017, 10:34

I have to synchronize large amounts of files across LAN and the Internet, using SFTP in both cases. I have two ideas and wonder if they're possible to implement;

1. Multiple upload streams
I can't say for sure, but looking at the logs it seems files are uploaded in a serial fashion, one after another. The interface for configuring SFTP in FFS includes settings for connections and streams, but the info-text says it's just for directory listings (eg. to speed up scans?). Can we get a setting for concurrent upload streams?

The FileZilla-client allows this and it really speeds up transfers a lot (particularly in my case, with tons of smaller files).

2. Detect duplicate transfers, and copy on the server instead of transferring again
Without getting into the use case*; I often need several copies of the same file on a hard drive. Let's say videos of several GB in size. Is it entirely out of scope for FreeFileSync to detect when a (large) duplicate file is being transferred, and just make a copy of the file that's already on the server instead?

It would require a database for checksums, but could be limited to files that are >10MB**. The smaller stuff can just be transferred anew.

I imagine an implementation like this;
a) files are compared quickly (file name & size) between client and server. A sync database is populated with this info for each file that's >10MB.
b) at transfer we check each large file against the database, and if there's a collision (eg. a similar file already exist on the server), we do the (costlier) bit-wise comparison to ensure they are in fact identical.
c) if step b shows they are identical files, just copy the file that are already on the server.

I realize these are big features to ask for, but I believe the performance improvement would be commensurate with the effort. In my case it would literally cut hours and hours off transfer times and lower bandwidth use.

* the use case: I synchronize my archive drive (>4TB). I'm a photographer and work on several different computers, all synced to the archive. Ergo; the archive holds copies of several drives (laptop, desktop, work), many of which holds the same files.
**10MB is just an example. Should probably be an adjustable value.

Posts: 7055 · Zenju 21 Jul 2017, 13:43

1. Multiple upload streams
I can't say for sure, but looking at the logs it seems files are uploaded in a serial fashion, one after another. The interface for configuring SFTP in FFS includes settings for connections and streams, but the info-text says it's just for directory listings (eg. to speed up scans?). Can we get a setting for concurrent upload streams?

The FileZilla-client allows this and it really speeds up transfers a lot (particularly in my case, with tons of smaller files). ulfben, 19 Jul 2017, 10:34

This is high up on the todo to speed up sync operations that are latency-bound like (S)FTP. The plan is to add multi-session FTP comparison similar to what's available already for SFTP for the next release, and the more general multi-file in parallel processing (hopefully) for the release after that.

2. Detect duplicate transfers, and copy on the server instead of transferring again ulfben, 19 Jul 2017, 10:34

This is not implementable, except when running two process instances, one local and one on the server.

Posts: 16 · ulfben 22 Jul 2017, 16:34

1. That's great news! I look forward to it :)

2. Yeah - I guess we can't get the SFTP to do binary comparisons for us. But there is a copy-command in the protocol, no? So how about a less involved implementation:

- When FFS creates the list of all files to upload, it can trivially check if any file appears twice (first on filename and size, then confirm with hash comparison).
- If FFS detect that the the user is about to upload the same file multiple time, make a list of all target folders.
- When the first instance of a (duplicated) file has been uploaded, execute server side copy to immediately "sync" it to all its target folders.

Am I missing something obvious?

Thanks for your time!

Posts: 7055 · Zenju 22 Jul 2017, 16:41

confirm with hash comparison). ulfben, 22 Jul 2017, 16:34

Unfortunately there is no such thing as a hash comparison without prior download.

Posts: 16 · ulfben 22 Jul 2017, 16:44

Right, I'm talking about local files now. The "less involved" implementation doesn't bother with finding duplicates on the server, only in the current upload que.

Posts: 7055 · Zenju 22 Jul 2017, 17:12

So essentially you're going to trust that the server state is the same as during the last sync without checking it?

Posts: 16 · ulfben 22 Jul 2017, 18:18

I'm afraid I'm not seeing the problem. Let me try that again.

At some point FFS have decided "here are all the files I need to upload, for [target] to match [source]", right? It has built a list of actions to take*.

So say the same 4GB movie appears in 2 places in that list, with the intention to upload to the server. Instead of uploading two identical files, can't we "merge" those action into one upload + one copy?

* In my case, almost exclusively uploads and moves. I realize there are tons and tons of uses for FFS and this idea might not help all of those. What I'm talking about here is the case of "trivial" mirroring. Eg: whatever the state on the server, it shall match the source when we're done.

Posts: 7055 · Zenju 23 Jul 2017, 09:11

I guess this is doable, but it wouldn't be pretty: additional costs for local hash calculation, no progress while the copy takes place on the server, not applicable to the majority of scenarios. A better solution, rather than optimize the copying of duplicate files, might be to use symlinks. They are supported by SFTP, and uploading them is instant.

Posts: 16 · ulfben 23 Jul 2017, 11:25

The lack of progress for a remote-copy action is a really good point! It really makes the implementation I had in mind unreliable.

I also did some more research and found out that the sftp protocol doesn't include a copy command anyway. There's a draft for a copy-file extension, but it isn't widely implemented yet.

Too bad. This could've been an excellent summer-of-code project. :)