Scan much slower with many exclusions

Posts: 2 · Terrum 15 Oct 2020, 23:31

Hi all, I'm excited to be using FreeFileSync as a Linux alternative to a Windows software I was using with a very similar layout (I won't mention the name incase I'm not allowed).

I donated straight away to support this software, however I've noticed a somewhat major flaw. It seems that if I have tens to thousands of files in my 'exclude' filter, the scan takes much longer than normal. Without excludes, a scan of a 1TB drive takes about 5 minutes. With around 10k files in my exclude filter, it takes 6+ hours (in fact, it's still going as we speak so it might even take longer).

Is there a reason that FreeFileSync can't just scan like normal then apply the excludes afterwards? Doing this would surely be much more efficient and quicker in the long run. The Windows software I used did this and it seems to make most sense.

Any help is greatly appreciated! :)

Posts: 2946 · Plerry 16 Oct 2020, 09:00

In case of an empty Exclude Filter, each file is only Compared to its potentially present, identically named counterpart.
With an Exclude Filter that has 10k entries, each individual folder and file name additionally needs to be compared to all 10k filter entries, unless it earlier meets one of the 10k Exclude Filter rules, as each file that will ultimately be part of the sync has to meet "... none of the entries in the exclude list".
I don't know if that explains the big difference in time, but at least it can be up to 10k times more comparisons to be performed ...

But, having an Exclude Filter with 10k entries sound pretty extreme.
I find it impressive that FFS can apparently even handle such a long list, albeit it consumes time.
I also find it impressive that you apparently went through the effort of compiling such a long list.
However, although I do not know your exact use case, it is normally possible to keep the Exclude list fairly compact by excluding entire file-types (e.g. *.txt) instead of individual files, by excluding entire folders with all their content and subfolders, and by using smart combinations of regular and wildcard characters. All of which is presented in the Manual section on Excluding items.

Posts: 2 · Terrum 16 Oct 2020, 15:12

FFS does struggle to handle the list, it took a good 30 minutes of a frozen screen for it to process the exclusions, but other than that it seems to perform fine.

Basically I'm excluding the hundreds of GBs of Steam games I have installed that I don't need mirrored (because I can just re-download them again from Steam incase of a HDD failure).

My excludes do seem to include all the files inside the folders, I could try just have it exclude the folder instead. So I will try that and get back to you :)

Though in my personal opinion, it doesn't really make sense for each file/folder to be compared with the exclude filter, unless a wildcard is involved. Surely it would make more sense to have it perform a scan the way I am suggesting. Can this not be considered for future development?

Many thanks!

EDIT: It's way better now with just folder exclusions, not too sure why it included the files anyway. But thanks once again!

Posts: 4867 · xCSxXenon 16 Oct 2020, 17:15

Comparing then filtering is no faster than filtering during compare. It's still comparing the files to each entry/line of the exclusion list. Glad you found your solution though!

Posts: 7506 · Zenju 16 Oct 2020, 17:39

I'm surprised by these performance numbers. FFS can probably be optimized further, but I'm not able to reproduce a bottleneck regarding file exlusion filter:

Test case 1: scan system hard drive, fast SSD, ~1 million files, default exclusion filter
=> 10 seconds

Test case 2: add 20.000 explicit file exclusions, no wildcards
=> 32 seconds

So the CPU overhead of 20k exclude filter items and 1 million files is only 22 seconds, nowhere near the 6+ hours mentioned above. With slower hard disks this 22 seconds difference is not expected to change because all it would add is latency due to slower file I/O, but file filtering is CPU-bound.
Unless you're using some extremly slow computer (e.g. some raspberry pi), I suspect there is something else going on (maybe too little RAM, so that the slow system swap file is used instead).

Posts: 4867 · xCSxXenon 16 Oct 2020, 17:48

In other words though, that is a 120% increase to a time 2.2x as long. Consider using a mechanical drive with a slower CPU, or less RAM, that doesn't seem unreasonable. Also, it performs nm operations where n is # of files and m is # of filters. Although independently linear, thus efficient, together they increase k(n^2) where k is some ratio 0-1 that represents the # of filters compared to the n files. k is only arbitrary at values lower than, let's say 0.1, or filters are 10% of files, leaving n^2, which is very inefficient at large quantities

Posts: 7506 · Zenju 16 Oct 2020, 17:53

In other words though, that is a 120% increase to a time 2.2x as long. Consider using a mechanical drive with a slower CPU, or less RAM, that doesn't seem unreasonable. Also, it performs nm operations where n is # of files and m is # of filters. Although independently linear, thus efficient, together they increase k(n^2) where k is some ratio 0-1 that represents the # of filters compared to the n files. k is only arbitrary at values lower than, let's say 0.1, or filters are 10% of files, leaving n^2, which is very inefficient at large quantities xCSxXenon, 16 Oct 2020, 17:48

In theory, yes, in practice apparently not, as my test case shows. Slower hard disk won't have an impact on CPU time. A slow CPU might explain 6+h runtime only if it were ~1000 times slower than my test setup, which is unlikely, but only OP can tell.

Posts: 4867 · xCSxXenon 16 Oct 2020, 18:25

Oh shoot, I misread his 'control' as 1 hour, not 5 minutes... Ok yeah, something is definitely odd, my apologies!

Posts: 7506 · Zenju 1 Apr 2022, 16:21

Unless wildcards are involved, FFS could check for file matches in constant instead of linear time! So far I've never considered someone using filter lists that large, and for small filters performance of filter evaluation is negligible. Also, when using filter lists that large you're probably doing something wrong (e.g. by not using wildcards). But who is FFS to judge! I've implemented this perf optimization for the next release, FFS 11.19. I'm not able to measure any performance drawback anymore when using a 20.000 item exclude filter on 1 million files compared to using no filter at all on the same file set!

Posts: 4867 · xCSxXenon 1 Apr 2022, 17:31

I love algorithm analysis, it's my bread & butter lol!
Glad to see the change!