[Feature Request] Verify File integrity when doing Long (file content) scanning over-time

Discuss new features and functions
Posts: 6
Joined: 6 May 2015

c9870

hi, i have 3x drives (WD RE4-GP 2TB) that i use your software to mirror the data across the set from disk1 > Disk2 and Disk3. i have been doing the time/size scan to mirror the drives. i then decided to do a File content scan the software detected a few files changed. but i had now way of detecting which one was the correct file. one of the changed files was a static MP3 file and i listened to each file and noticed no difference in the audio streams. Using a MD5 checker i did confirm that the files were different, but did not know how much again no noticeable difference in a MP3 file.

what i am purposing is to add some sort of data integrity check on all the files when first scanned using "file content" scanning, when copied from a drive and when first wrote to a drive (the data is being read/wrote already might as well do the check now when "fresh" for future checks).

this could use a CRC-16 or CRC-32 (Cyclic redundancy check) calculation along with Date info and size (if date and size are the same then use the CRC info to verify the data hasn't changed 'accidentally').

then adding the feature to when scanning drives and Date/size haven't changed (indicating most likely the file hasn't been update) and the CRC on fileA has changed on Disk1 but Disk2 CRC and FileA reading the same; then giving the user a prompt about how the files on Disk1 might be corrupt and to verify that the files have/haven't been updated by the user to then make the appropriate decision on which to keep (also applies to mirrors as mirroring is 1 way and could copy a corrupt file to a second drive).

i did some very rough/general math on just how much space it would take up to do a CRC check for each file. including file name [280 character] date [16 number], size [16 number {for bytes}], a CRC [32 character], and overhead for you [64character] then as a text file it came out to 488bytes. then to make math easy lets say 512byte. so every 2 files = 1KB, 1 thousand files = 500KB, 10k files = 5mb, 100k = 50MB, 500k = 250MB (in my case), and 1million = 500MB. and that is worse case scenario as most files do not live at he edge of a 280 character file path.

i personally wouldn't mind loosing 250MB-500MB knowing that the data will be 'safe' from error on my 3x 2tb drives.


how i use the software
Disk1 is the "always use" drive
Disk2 is in my computer and mirrored from Disk1 every few days if i add something big
Disk3 is in a fire resistant box not plugged in to the computer in case of fire or crypto virus destroys the data on disk1 and disk2.

thanks hope you add this
Posts: 22
Joined: 6 Sep 2014

ferblaha

thumbs up
User avatar
Site Admin
Posts: 7161
Joined: 9 Dec 2007

Zenju

FreeFileSync has the ability to verify copied files, see help file chapter "Expert Settings"
Posts: 22
Joined: 6 Sep 2014

ferblaha

It would be great if the option to verify files is available as a config parameter of each folder sync configuration, instead of a hard to reach global config.
Posts: 6
Joined: 6 May 2015

c9870

FreeFileSync has the ability to verify copied files, see help file chapter "Expert Settings"Zenju
Are you referring to "VerifyCopiedFiles" in the expert settings?
that is only part of what I am asking for. That only verifies that the files got Read from Disk1, Wrote to Disk2, then Read Correctly from Disk2.
What that does not do is verify that the files haven't changed from sitting on the drive (aka bitrot).

saving a hash of the file will allow the software to check to see if the file has changed even a bit. this is best used with software like FFS and multiple drives (with copies of the data) so if a bit is changed the user has a accurate file that they can change out and replace the broke file.

in looking around for a program to do file hashing and hash storing in association to the file being hashed.
i did find a software called RapidCRC (http://www.ov2.eu/programs/rapidcrc-unicode). it does what i want with reading files off of a drive, then generating a hash of them (i choose SHA256) and saving it in a file (either 1 hash file per file, 1 has file per folder [containing a list of all file in the folder], or in a single file in a parent folder for all child folders and files) and then later check the files for change.

I choose to use the second option (1 file per folder) because it allows for single folders to be scanned at a time (unlike the third choice which you must do the entire tree associated with the hash file) and less obtrusive than the first option that creates thousands of files (one for each file), and i have a few numbers to report on how much space it takes up (but first i would like to note that it does save in a raw text format so opening and viewing in notepad works great)

I recorded number on 2 sets of folders
1. starting with 20.2GB with 755 files and 16 folders, ending with an extra 120KB and 15 new files. So that isn't to bad
2. starting with 0.99TB with 450167 files and 17495 folders, ending with an extra 38MB and 14903 new files.

This shows that my estimation of 512Bytes per file was a little over exaderated and is actually close to 88 Bytes per file (also note that these hash files only contain the SHA256 hash of 64 characters and the file name [not full path as it is in the folder the file is in] and does not contain the date or size and other information that FFS would also needs to save).
For my entire drive of with 534k files and 25.7k folders it only added about 25 Thousand files and 50MB to my drive (One Fifth of what i estimated).

although i did find some problems with the software.
it isn't easy to add files to the hash file once it is created. in order to add files to the hash file you have to re create the hash file for the entire folder, which isn't to bad for archiving as those folders/files do not update that often, but is a pain for folders that i'm updating. I would have to first verify that the files that i haven't changed are correct then re-run the RapidCRC program to hash all the files in the folder, even those that haven't changed.
embeding it in to File management software (FFS for example) it could know which files have changed (using date and size) and then update and add hashes for updated/new files.


ps. Also look at http://en.wikipedia.org/wiki/File_verification

pss: when typing long replies to a forum post, actually type them in to a word editor (Microsoft Word, Libre Office Writer, etc) as sometimes your sign in times out and you loose the text you have been typing for 2-3 hours (including researching and messing with a program to make sure it works). It is a pain :'(
User avatar
Site Admin
Posts: 7161
Joined: 9 Dec 2007

Zenju

Are you referring to "VerifyCopiedFiles" in the expert settings?
that is only part of what I am asking for. That only verifies that the files got Read from Disk1, Wrote to Disk2, then Read Correctly from Disk2.
What that does not do is verify that the files haven't changed from sitting on the drive (aka bitrot).

saving a hash of the file will allow the software to check to see if the file has changed even a bit. this is best used with software like FFS and multiple drives (with copies of the data) so if a bit is changed the user has a accurate file that they can change out and replace the broke file.

in looking around for a program to do file hashing and hash storing in association to the file being hashed.
i did find a software called RapidCRC (http://www.ov2.eu/programs/rapidcrc-unicode). it does what i want with reading files off of a drive, then generating a hash of them (i choose SHA256) and saving it in a file (either 1 hash file per file, 1 has file per folder [containing a list of all file in the folder], or in a single file in a parent folder for all child folders and files) and then later check the files for change.

I choose to use the second option (1 file per folder) because it allows for single folders to be scanned at a time (unlike the third choice which you must do the entire tree associated with the hash file) and less obtrusive than the first option that creates thousands of files (one for each file), and i have a few numbers to report on how much space it takes up (but first i would like to note that it does save in a raw text format so opening and viewing in notepad works great)

I recorded number on 2 sets of folders
1. starting with 20.2GB with 755 files and 16 folders, ending with an extra 120KB and 15 new files. So that isn't to bad
2. starting with 0.99TB with 450167 files and 17495 folders, ending with an extra 38MB and 14903 new files.

This shows that my estimation of 512Bytes per file was a little over exaderated and is actually close to 88 Bytes per file (also note that these hash files only contain the SHA256 hash of 64 characters and the file name [not full path as it is in the folder the file is in] and does not contain the date or size and other information that FFS would also needs to save).
For my entire drive of with 534k files and 25.7k folders it only added about 25 Thousand files and 50MB to my drive (One Fifth of what i estimated).

although i did find some problems with the software.
it isn't easy to add files to the hash file once it is created. in order to add files to the hash file you have to re create the hash file for the entire folder, which isn't to bad for archiving as those folders/files do not update that often, but is a pain for folders that i'm updating. I would have to first verify that the files that i haven't changed are correct then re-run the RapidCRC program to hash all the files in the folder, even those that haven't changed.
embeding it in to File management software (FFS for example) it could know which files have changed (using date and size) and then update and add hashes for updated/new files.


ps. Also look at http://en.wikipedia.org/wiki/File_verification

pss: when typing long replies to a forum post, actually type them in to a word editor (Microsoft Word, Libre Office Writer, etc) as sometimes your sign in times out and you loose the text you have been typing for 2-3 hours (including researching and messing with a program to make sure it works). It is a pain :'(c9870
In general, it's the hardware's job to make sure data integrity is preserved. For hard disks this is done similarly like you describe by storing a hash together with each block of data that is verified each time the block is read. So in this sense, there's no need to duplicate effort and manually try to do the same. The problem also isn't so much about data corruption as long as this situation can be detected (e.g. CRC error during read), but more about silent data corruption that results in no error at any level. This kind of issue is very rare, but possible. Most of these cases can be handled simply by using archives like zip or rar, which natively associate checksums with data. So what's left are very large files that are critically important, and that need to be protected against this very rare case of silent data corruption. This is a quite narrow usage scenario.
For this specific scenario I'd argue that simply associating a hash with the files is even not enough security since there is no guarantee that the file was not already corrupt while the hash was generated. So what's missing is an additional verification check after the hash was generated, but I don't see how software could do this second step automatically. This is essentially the user's judgement call.

From a user-level software perspective there is very little that can be verified at all:
http://blogs.msdn.com/b/oldnewthing/archive/2012/09/20/10350645.aspx

> pss: when typing long replies to a forum post [...] you loose the text you have been typing for 2-3 hours

Please feel free to complain, you have no idea how many texts I've already lost to this bug already:

https://sourceforge.net/p/forge/site-support/9579/
Posts: 6
Joined: 6 May 2015

c9870

In general, it's the hardware's job to make sure data integrity is preserved. For hard disks this is done similarly like you describe by storing a hash together with each block of data that is verified each time the block is read. So in this sense, there's no need to duplicate effort and manually try to do the same. The problem also isn't so much about data corruption as long as this situation can be detected (e.g. CRC error during read), but more about silent data corruption that results in no error at any level. This kind of issue is very rare, but possible. Most of these cases can be handled simply by using archives like zip or rar, which natively associate checksums with data. So what's left are very large files that are critically important, and that need to be protected against this very rare case of silent data corruption. This is a quite narrow usage scenario.
For this specific scenario I'd argue that simply associating a hash with the files is even not enough security since there is no guarantee that the file was not already corrupt while the hash was generated. So what's missing is an additional verification check after the hash was generated, but I don't see how software could do this second step automatically. This is essentially the user's judgement call.

From a user-level software perspective there is very little that can be verified at all:
http://blogs.msdn.com/b/oldnewthing/archive/2012/09/20/10350645.aspx

> pss: when typing long replies to a forum post [...] you loose the text you have been typing for 2-3 hours

Please feel free to complain, you have no idea how many texts I've already lost to this bug already:

https://sourceforge.net/p/forge/site-support/9579/Zenju
Yes, the hard drive does have the ability to correct single bits (even a few), but (as many of us are using) old HDDs more then just a few bits are getting broke more like entire sectors or groups of sectors. And with an entire group of sectors going bad the HDD can not correct that.

Also yes that the file might have been corrupt when the has was generated, but this is unlikely especially if a file is coped and has a hash generated in real time or hours after the file is create/updated; which FFS has the ability to do with real time and scheduled coping/mirroring. Then later when doing the “File Content” and FFS figures out that a file as been 'corrupted,' FFS will then look at hash (it could of saved when the file was created) and determine which file is the original 'un-corrupted' file.

Also FFS is in a better position to make sure that the saved hash is kept up to date as a file gets update and created as FFS is what would be the (metaphorical) vehicle for files to be copied around between drives.


The main reason I brought this up was because Free File Sync has an option to do “File Content” comparisons, which I'm assuming that it reads the files, generates a hash, then compares that to see if they are the same hash. The flaw that I seen in this (and started this thread) was the fact that it only relays the fact that the 2 files are in the same folder (respective to drive/location) and have the same time/date and size, but NEVER reported which file was the “correct file” even though those files have been on the drive for months with out being edited and was copied to the drives using FFS. If FFS had the hash of the files from months ago (when the files were good) FFS would be able to report to the user that the hash is different and which one is still in the 'original' state from months ago.


For the response to the MSDN article you linked, they are coming from my perspective that is not of programming as I only did a little of visual Basic but could never get in to it very much.

1 st for the compiler optimizing the code so that the hash is coming out of ram instead of off the drives them self could be dealt with by separating out the commands, clearing buffers, and using separate buffers for read drive1 and read drive2 hash buffers.

2 nd other programs that check and compare Hashes can and do reliably report the same hash on the same file with out fail even across different drives, cache structures, and even over the network. Also if the program is doing Gibabytes or Terabytes of reads the cache will not even be an issue as the 32-64MB (or in some seagate SSHD drives 8GB MLC NAND) the cache will be flushed many times over preventing this.

3 rd the hash archiving that I purpose doesn't have to be an “always on” feature (maybe on by default or a sub feature of “Detect moved files”) as there are the cases that it might need to be turned off, such as the user being able to write to a file but for some reason not be able to read from it. But for those of that do want it should be able to have it. People like me that do not have money to replace drives that are corrupting a few files out of hundreds of thousands as my drives are going on 6 years old.

4 th from the 4 th paragraph from the end: the idea that having the file verified then a spontaneous sector gets corrupted causes the victor to be over; on the contrary that is the victory, especially when used with a file copying program (like FFS) that can then later come and look at data and determine that the file is bad, and the victory part, declare a file on a separate drive/folder to be good and swap the bad file out and a good file in, #VictoryAchieved in my book.

5 th and lastly a quote from the article
>“If you really want to do some sort of copy verification, you'd be better off SAVING THE CHECKSUM somewhere and having the ultimate CONSUMER OF THE DATA VALIDATE THE CHECKSUM and raise an integrity error if it discovers corruption.”

Memory is the Key (a Red vs. Blue joke) and you can not remember something you did not commit to memory (or in this case storage) in the first place. Even remembering a 32bit (8byte) CRC in the Database will be better than what is currently being done by just saying that a 2 files are different from each other but not helping the user figure out which out of the 2 is the correct one.


/article response




You could even write a plug in or merging feature with another program (like RapidCRC) that already has the hashing and hash saving functionality, but then add it to your program that is the one reading and coping the files around and updating the hash archive for new/updated files and checking (with user input) for file integrity.
User avatar
Site Admin
Posts: 7161
Joined: 9 Dec 2007

Zenju

Yes, the hard drive does have the ability to correct single bits (even a few), but (as many of us are using) old HDDs more then just a few bits are getting broke more like entire sectors or groups of sectors. And with an entire group of sectors going bad the HDD can not correct that.

Also yes that the file might have been corrupt when the has was generated, but this is unlikely especially if a file is coped and has a hash generated in real time or hours after the file is create/updated; which FFS has the ability to do with real time and scheduled coping/mirroring. Then later when doing the “File Content” and FFS figures out that a file as been 'corrupted,' FFS will then look at hash (it could of saved when the file was created) and determine which file is the original 'un-corrupted' file.

Also FFS is in a better position to make sure that the saved hash is kept up to date as a file gets update and created as FFS is what would be the (metaphorical) vehicle for files to be copied around between drives.


The main reason I brought this up was because Free File Sync has an option to do “File Content” comparisons, which I'm assuming that it reads the files, generates a hash, then compares that to see if they are the same hash. The flaw that I seen in this (and started this thread) was the fact that it only relays the fact that the 2 files are in the same folder (respective to drive/location) and have the same time/date and size, but NEVER reported which file was the “correct file” even though those files have been on the drive for months with out being edited and was copied to the drives using FFS. If FFS had the hash of the files from months ago (when the files were good) FFS would be able to report to the user that the hash is different and which one is still in the 'original' state from months ago.


For the response to the MSDN article you linked, they are coming from my perspective that is not of programming as I only did a little of visual Basic but could never get in to it very much.

1 st for the compiler optimizing the code so that the hash is coming out of ram instead of off the drives them self could be dealt with by separating out the commands, clearing buffers, and using separate buffers for read drive1 and read drive2 hash buffers.

2 nd other programs that check and compare Hashes can and do reliably report the same hash on the same file with out fail even across different drives, cache structures, and even over the network. Also if the program is doing Gibabytes or Terabytes of reads the cache will not even be an issue as the 32-64MB (or in some seagate SSHD drives 8GB MLC NAND) the cache will be flushed many times over preventing this.

3 rd the hash archiving that I purpose doesn't have to be an “always on” feature (maybe on by default or a sub feature of “Detect moved files”) as there are the cases that it might need to be turned off, such as the user being able to write to a file but for some reason not be able to read from it. But for those of that do want it should be able to have it. People like me that do not have money to replace drives that are corrupting a few files out of hundreds of thousands as my drives are going on 6 years old.

4 th from the 4 th paragraph from the end: the idea that having the file verified then a spontaneous sector gets corrupted causes the victor to be over; on the contrary that is the victory, especially when used with a file copying program (like FFS) that can then later come and look at data and determine that the file is bad, and the victory part, declare a file on a separate drive/folder to be good and swap the bad file out and a good file in, #VictoryAchieved in my book.

5 th and lastly a quote from the article
>“If you really want to do some sort of copy verification, you'd be better off SAVING THE CHECKSUM somewhere and having the ultimate CONSUMER OF THE DATA VALIDATE THE CHECKSUM and raise an integrity error if it discovers corruption.”

Memory is the Key (a Red vs. Blue joke) and you can not remember something you did not commit to memory (or in this case storage) in the first place. Even remembering a 32bit (8byte) CRC in the Database will be better than what is currently being done by just saying that a 2 files are different from each other but not helping the user figure out which out of the 2 is the correct one.


/article response




You could even write a plug in or merging feature with another program (like RapidCRC) that already has the hashing and hash saving functionality, but then add it to your program that is the one reading and coping the files around and updating the hash archive for new/updated files and checking (with user input) for file integrity.c9870
> And with an entire group of sectors going bad the HDD can not correct that.
> The flaw that I seen in this [...] was the fact that it only relays the fact that the 2 files are in the same folder [...] but NEVER reported which file was the “correct file”

Data recovery in general is not such a big problem, when you have multiple backups. Also automation of recovery is not really needed since we're talking about a very rare situation. The issue is more with reliable detection of corruption in first place.

> Also yes that the file might have been corrupt when the has was generated, but this is unlikely

I had exactly this kind of data corruption a few times and it was virtually undetectable. Only after long investigation and program crashes I found out that the software I had stored in archives was already corrupt before putting it in there.
Posts: 6
Joined: 6 May 2015

c9870

> And with an entire group of sectors going bad the HDD can not correct that.
> The flaw that I seen in this [...] was the fact that it only relays the fact that the 2 files are in the same folder [...] but NEVER reported which file was the “correct file”

Data recovery in general is not such a big problem, when you have multiple backups. Also automation of recovery is not really needed since we're talking about a very rare situation. The issue is more with reliable detection of corruption in first place.

> Also yes that the file might have been corrupt when the has was generated, but this is unlikely

I had exactly this kind of data corruption a few times and it was virtually undetectable. Only after long investigation and program crashes I found out that the software I had stored in archives was already corrupt before putting it in there.Zenju
>Data recovery in general is not such a big problem, when you have multiple backups.


FFS is the software providing the transport for backup in this case.
FFS is already reading the files off of the drive in its best possible state (more later), writing the file to another drive, and possible checking to see if the file is correct (via an option in the “Expert Settings”) which is already being done by the software. Later with the user direction (a “file content” scan) Scanning both Disk1 (that had the original file) and disk2 (with the second copy of the file) and reporting to the user only that the 2 files are different from each other, but not informing the user which file is the correct file and which is corrupted.
Storing a hash of the file and displaying it to the user so they can decide what to do is better then just saying that one of the files might be broke.

>The issue is more with reliable detection of corruption in first place.


FFS has the feature to do a “File Content” scan which goes and compares files to see if they are the same at the bit (maybe byte) level. With this FFS can reliable detect if a file is corrupted.
If a file was found to be corrupted that was in an archive folder (as in unchanged since place on the separate drives) it would be a great if FFS could report to the user which file was correct. And storing a CRC or SHA256 (or other hash) of files is the way of detecting that there is corruption since the hash was made. FFS would also be saving date and size information so it could even report that a file has been updated and report that to the user.


>was already corrupt before putting it in there.


Yes this will not be able to identify corruption before the file is scanned to be copied and verified to a separate drive (and a possible hash would be created and saved on each drive/location).

The point of using software like FFS is to copy data from drive to drive. With FFS Real time or with Scheduled Batch commands to get the new files as soon as possible before they can be corrupted and copied to another location. And with coping and hashing the file can be copied and later verified that it hasn't changed since being first copied and hashed. Although not protecting the file before copied and hashed it still protects the file with the ability to verify that it hasn't changed at all and with other copies to fall back on in case it has changed.

Not doing it because a few files might be corrupted before scanning is not a reason to not at least to verify all files have not changed in the future. Also because most files (99.99997% estimated for disk2 and 99.9996% estimated for my disk3) will not be corrupted isn't a reason to not at least try to verify if a file changes over time.
Posts: 6
Joined: 6 May 2015

c9870

> And with an entire group of sectors going bad the HDD can not correct that.
> The flaw that I seen in this [...] was the fact that it only relays the fact that the 2 files are in the same folder [...] but NEVER reported which file was the “correct file”

Data recovery in general is not such a big problem, when you have multiple backups. Also automation of recovery is not really needed since we're talking about a very rare situation. The issue is more with reliable detection of corruption in first place.

> Also yes that the file might have been corrupt when the has was generated, but this is unlikely

I had exactly this kind of data corruption a few times and it was virtually undetectable. Only after long investigation and program crashes I found out that the software I had stored in archives was already corrupt before putting it in there.Zenju
also with having hashes saved on each drive the "File Content" scan could be done on multiple drive at the same time (example 3 drives at the same time) and independent of a "master" drive", with only the software needing to check for CRC errors instead of File1 not equal File2.

FFS would first have to do a normal scan (date/size) and update the Hash files on each drive, but could make the scanning go faster for multi drive situations (in my case 3).

they could be queued up with main thread creates child threads that then scan by folder to check hashes of files on a particular drive (different child thread for each drive or folder) then the CThread report to the main thread the hash of the file and hash stored in the Database of that folder so it can compare to the other file hashes from other CThreads.


just a though i had and had to type it out in relation to drive scanning and making it a little better.

also as a much smaller feature, the CThreads could record access or read time while reading the drive (for each file). When comparing the file hashes the main thread could also compare access/read times. If the MT sees that a file took a certain amount of time longer to access (the same file on a separate drive) but came back with the correct hash the software could recommend that the file needs to be refreshed.

i do not really know how that would be monitored, possibly the best would be access time during the file being scanned, as if the drive need to retry to read a sector the software will notice an access time spike when reading the file from lets say 600-900ms my current drive is doing on good files, and then a random spike for a few sectors (of the same file) up to 2000-3000ms and then returning to the normal 600-900ms. although i did notice that windows reports the drive being idle for a few seconds while this happens.

again this was just a though i had as i was watching my drive read files.
Posts: 6
Joined: 6 May 2015

c9870

@zenju

i did not want to make it sound like i did not like Free File Sync. i was just trying to add a feature to make it better File management software.

FFS is the best File Syncing software i have found and i will continue to use it; along with RapidCRC Unicode to fill my needs.

thanks for the great product.
Posts: 4
Joined: 1 Nov 2010

johndoe83753

Use MultiPAR so you can recover bad data.

That said, as a separate matter, block level copying in FFS would be great.