Case-sensitive collision detection

Posts: 11 · nilsonj 3 Sep 2020, 13:58

Greetings. I've recently started using your software for a major data migration from a Drobo5N NAS so that I can upgrade the underlying filesystem of the array. In the process, I've discovered a strange situation which might be experienced by few others, but your software could help alleviate the burden of understanding what is going on and how to fix it with some careful logic. More than anything, this is a spark for discussion and thoughts on the best way to handle this use-case.

Essentially, I have noticed some case-based file collisions because the underlying FS is EXT based, but the presentation view is CIFS/SMB. Via SSH I can see the filesystem for what it is. I might, for example, have two files or folders (It doesn't matter, same difference here):

MYFILE
MyFile

Both of these are separate entities to the underlying FS, and they may have gotten that way by any automated means of copy via SSH/SFTP or some other Unix-based manipulation not using the CIFS connection. (AFP is what originally caused this issue, I believe, as it does support case-sensitive naming identity.) When accessed via CIFS, the FFS software will read back the existence of the folders dutifully, but neglect to understand the subtlety of the collision. It will list the parent directory and see BOTH folders, both of which will be added to the queue. THEN when it goes to copy files, it will read back and copy the first one (specified by UPPER CASE), then it will read the SAME folder again because it is mentioned twice by a different casing. This MASKS the second folder/file. So in the case of flat files, this is a smaller issue, but say you have two folders, each with very different contents. The first one will be referenced twice, then the second one will be ignored. The final destination will present with only a copy of the first folder contents, and the two locations will NOT in fact be in sync.

I bring this up because the software, while unable to change its method of access or ability to pull these files via CIFS in this case, would definitely be able to understand the collision condition. The error message now shows that the second copy run for the second referenced casing of the name will generate errors because the "file already exists" on the destination while renaming the final temporary target. However, it should know that two folders/files with the same casing were referenced by the parent directory, and even though it can't access this material, it should report that a case collision has made it impossible or questionable to access the second object.

Food for thought.

The problem above was fixed with some SSH manipulation to the underlying files, and those case collisions noted by FFS were renamed to non-collision titles. This required a complete re-sync of the folder contents, which took longer than I'd like (which brings me to an adjacent point of incremental re-scans being a very tangible and desirable option when something small changes in a targeted way on the source and simply needs to be updated in the list to make a good delta without rescanning the entire base).

Thanks for your thoughts!

Posts: 7348 · Zenju 3 Sep 2020, 14:09

I think you're describing the following situation when accessing a case-sensitive file system (probably on Linux) over a case-insensitives CIFS connection:

1. FreeFileSync traversal lists two folders "folder" and "Folder"
2. accessing "folder" returns "folder" 's content => ok
3. accessing "Folder" also returns "folder" 's content => CIFS bug!

FreeFileSync is essentially receiving false data from CIFS, and I don't know how FFS could detect this.

Posts: 11 · nilsonj 3 Sep 2020, 14:22

Yes that is correct -- the detection would be that there are two references for the same location/contents. This is inferred by the CIFS access method paired with the case-insensitive collision. It could easily be a warning generated in the log that clues the user in to the possibility of an underlying incompatibility. Again, not suggesting that FFS could FIX the problem on its own (at least not within the current purview of the software), but it could probably go farther to let the user know about a potential data-threatening issue. As it is, only the vigilant would understand the problem well enough to avoid losing a lot of stuff.

Posts: 11 · nilsonj 3 Sep 2020, 14:27

As an aside, technically FFS could detect the capability of the underlying filesystem by intentionally writing a file and attempting to access it with reversed casing. That would infer the capability of the access method and thereby allow the conclusion that it were capable/not capable of allowing case-sensitive access. Then if/when a collision is found, it knows whether that's a real issue or something that's likely returning correct data.

Posts: 4289 · xCSxXenon 3 Sep 2020, 15:37

Forgive my ignorance, but isn't having two directories with the same name a completely horrible FS/structure? I would feel like fixing that 'issue' above all else

Posts: 11 · nilsonj 3 Sep 2020, 16:53

Yes it is, but sometimes you don't get to choose what happens with massive data moves and merges from multiple file structures. ;-) The point is to handle the error case gracefully, not to make a case for developing this error case intentionally.

Posts: 4289 · xCSxXenon 3 Sep 2020, 17:18

Well that leads to what I was thinking, why add overhead to develop a feature that is so uncommon and against best practice anyhow?

Posts: 11 · nilsonj 3 Sep 2020, 19:37

Yes, that's something I alluded to in my initial post. I understand that position, but there are quite a few aspects here that make this worth looking at (and talking about) rather than just dismissing it out of hand because it seems uncommon. I don't really argue that point, necessarily, but this proves to me that I won't be able to trust the software to function autonomously in my environment; nor can I trust the error messages because they are sophomoric to the case.

You can't control for user data. Truth be told, we don't know how prevalent this might be with NAS devices accessed by different methods. In my case, the very consumer-oriented Drobo5N has at least 4 different protocols you can utilize to manage the data, and its underlying filesystem handles case-sensitivity while not all presentation methods do. That may feel like a Drobo problem, but when you design a tool for robust data transfers you can /depend/ on to produce /meaningful/ backups, you have to know it will handle weird things that might (and indeed proven *do* in this case) come up without casually allowing things to be missed. At the very least, the user should be aware that a problem with the backup MIGHT exist, and I think that would be sufficient.

I adhere to "best practices" as much as possible, but sometimes when you move large collections of data using different protocols, you don't have predictable results. Cleaning the data before practical use is not always an option, and most of the real world works on this principle. Data warehousing is all about cleaning data --- not because people aren't following best practices at one time or another, but because they CAN'T all follow the same best practices with great consistency, and that leads to data integrity issues unless care is taken to clean things up.

Let's speak practically, though, for a moment. Let me lay out ONE of the several instances that culminated for me to find these problems in my own data store (there were four different situations that each led to this same behavior across a 25 TB storage array). I have a collection of music sorted by MusicBrainz. This app utilizes peer-reviewed metadata to tag and rename recordings into a folder structure, which was initially accessed via AFP (supporting case-sensitivity). As time went on and Apple lost favor with its own protocol, CIFS became the best way to connect with the device on a regular basis. This transition was made by the driver software, apart from any specific intervention on the user part. This means in practical terms that I did not choose to make this transition, nor would most other users. Many would be completely ignorant to the change. As peer-reviewed metadata is edited, sometimes case naming conventions change. This happened in particular with the Sublime album "40 oz. to Freedom" which at one point was stylized as "40 Oz. to Freedom". This is an unimportant change from the user perspective. When adding future tracks to this album, however, the Python backend of MusicBrainz via AFP had utilized the uppercase version and stored some tracks, later favoring the lowercase with others. The result split tracks between two folders which shared the same name and were thus conflicted when accessing via CIFS. (I note here that although AFP is no longer favored here, SSH/SFTP does still work for this access purpose, but the overhead is not cost-effective as a practical means of communication with the NAS.)

I could lay out at least three other use cases resulting in this same error condition from only my own experience. I'm sure it could be construed in other ways, too. I agree that focus must be maintained on what the software is supposed to do -- if it's a toaster, make sure it does the toasting well; but if you want it to be a reliable, autonomous, easy to use toaster for the common man, you'll have to think about some things beyond putting a piece of bread on an open flame. A balance is necessary, sure. I see an opportunity here, at least for the sake of discussion, to ask the question whether this piece of software should make some attempt to /ensure/ or /assure/ that it can copy the data from source to destination reliably. In my case, the software has failed, and the cause is well known to me after investigation. The opportunity is now to harden against this failure.

My solution to this after thinking over the past couple of back-and-forths would be to suggest the following:

1) Choose a filename with random characters in all lowercase.
2) Test exist for the file at source, destination
3) Test-write the file to source, destination
4) Test exist for the file at source, destination

This algorithm would provide all the information necessary to understand if a collision problem may become an issue. If one should arise during the copy, appropriate and insightful error messages can be shown to the user directing more appropriate remediation to his/her individual environment. It doesn't fix everything, but at least it sets the train of thought in the right direction. I could imagine as well that understanding if a medium chosen /supports/ case-sensitivity would be useful to know if case-sensitivity is indeed already considered among the options for the program.

Posts: 7348 · Zenju 4 Sep 2020, 15:09

A heuristic to detect this kind of problem could be:

1. folder contains 2 or more items which are equal when ignoring case (=> clarify: according to which definition of case insensitivity?), e.g. "file" and "File".
2. invent a new file name that is equivalent to the previous set, but different, e.g. "filE".
3. check if "filE" exists =>
no: good, case-sensitive file system!
yes: bad! cannot trust file system to return correct data for "file" and "File"!

Now if only there was a way to implement this without paying too much runtime performance costs...

Posts: 4289 · xCSxXenon 4 Sep 2020, 17:50

There's the challenge of the century... How much overhead do you add for this once-in-a-blue-moon issue? Or I suppose it could be a toggle, with an indicated performance hit

Posts: 11 · nilsonj 5 Sep 2020, 14:46

Right -- I certainly wouldn't be advocating for an algorithm that creates a performance hit across the board, but perhaps an "Advanced FS compatibility detection" setting would allow some performance hit to do some benchmarking and understand what additional considerations (i.e., <=O(N) performance hit) may be necessary to avoid collisions and have more appropriate error messages or handling.

Posts: 3 · lionkm 7 Feb 2022, 06:00

I wonder if this is what's causing the problem I'm experiencing running FreeFileSync under Ubuntu 20.04. I'm Synchronizing NTFS originals over my local Samba network to an ext4 Ubuntu 20.04 share, when it gets to a folder ".../hplip-3.21.12", created by the installation of hplip in Ubuntu from a .run file on the NTFS volume that contains (among other files) Dat2drv and dat2drv, it returns an error message "EEXIST: File exists [rename]". Judging from the error messages, and the files I see on the source and destination drives, FreeFileSync appears to be getting to Dat2drv first and writing it to the shared folder as dat2drv instead of Dat2drv. Then when it gets to dat2drv, it appears to make a copy of it on the shared folder named "dat2drv-c755.ffs_tmp", and when it tries to rename the file to dat2drv, the existing dat2drv causes the error. So instead of having a backup copy of both files on the destination drive, I have a copy of Dat2drv as dat2drv on the destination drive, but not dat2drv. Then if I run FreeFileSync again, it wants to write the copy of dat2drv over the one it created of Dat2drv with the wrong name, and running it again has it wanting to delete the original Dat2drv. What's puzzling is that there are thousands of files on both drives with upper and low case first characters, and they synchronize properly and without errors, so how could the network be the problem?

At the same time, FreeFileSync refuses to back up 2 log files with the error message "ENOENT: No such file or directory [open]" (the files are ".../hplip-3.21.12/hplip-install_Wed-26-Jan-2022_14:27:44-f92e.ffs_tmp" and "...14:27:27-2566.ffs_tmp". Not sure what's going on there; the files and folders have the necessary permissions...

Posts: 3 · Maikh 29 Nov 2022, 10:29

It also happened to me for files with same name but different case, but also for same looking at name files with a word one coded in ANSI the other in UTF8...