dupemerge
Last updated June 29th 2008, Version 0.999
Introduction Most harddisks contain quite a lot of completely identical files, which consume a lot of disk space. This waste of space can be drastically reduced by using the NTFS filesystem hardlinks functionality, if identical files aka dupes are hardlinked together.
Dupemerge searches for identical files on a logical drive and creates hardlink among those file thus saving lots of harddisk space.


Installation Dupemerge.exe is a command line utility, so copy dupemerge.exe to some directory referenced by your PATH environment variable. %systemroot% is a good place. e.g.: c:\winnt

Using dupemerge dupemerge.exe can be controlled by a few command line arguments, and its highlights are as follows:
 
Specify path More than one path can be specified to search for dupes.

dupemerge c:\data c:\test\42

The above command causes dupemerge to search below c:\data and c:\test\42 for dupes. Dupes might be spread across given subdirectory trees: e.g. If the files c:\data\a.txt c:\data\dd.txt and c:\test\42\new.bat are dupes, they get hardlinked together.

Use wildcards In certain situations only a few files, e.g.: *.pdb, below a path should be checked for dupes. To accomplish this dupemerge can be run with filters specified, to only match certain files.

dupemerge --wildcard *.dbg --wildcard a*.pdb c:\data

In the above example dupemerge only searches for file, which match the expressions sepcified with --wildcard. The --wildcard option can be used more than once.

Use regular expressions In certain situations some kinds of files, e.g.: alle file containing only letters should be checked for dupes. To accomplish this dupemerge can be run with regular expression filters specified, to only match certain files.

dupemerge --regexp "[a-z]*" c:\data

In the above example dupemerge only searches for files, which match the regular expressions sepcified with --regexp. The --regexp option can be used more than once.

List only To find out which files are dupes, but to not hardlink those file, dupemerge can be run in list mode

dupemerge --list c:\data c:\test\42

An extensive report is generated which files are dupes below c:\data and c:\test\42

Size dependent check The size of the files, which are compared, can be controlled by two switches

dupemerge --minsize 3000 --maxsize 500000 c:\data

In the above example dupemerge searches for files bigger than 3000 bytes and smaller than 500000 bytes below c:\data.

Sort output The output shows the order of found dupegroupes either random or by cardinality or by filesize. This is controlled by the size switch, which has the filesize or the cardinality modifier. The default behaviour is to show dupegroups random.

dupemerge --sort cardinality c:\data

In the above example dupemerge searches for files below c:\data and prints the output so that the dupegroup with most identical files is printed first, and the dupegroup with least members is printed last.

dupemerge --sort filesize c:\data

In the above example dupemerge searches for files below c:\data and prints the output so that the dupegroup, which contains the largest files, is printed first, and the dupegroup with smallest files is printed last.


Backgrounders Dupemerge creates a cryptological hashsum for each file found below the given pathes and compares those hashes to each other find the dupes. There is no file date comparison involved in detecting dupes, which might cause troubles.

To speed up comparison only files with same size get compared to each other. Furthermore the hashsums for equal sized files get calculated incrementally, which means, that during the first pass only the first 4 kilobyte are hashed and compared and during the next rounds more and more data are hashed and compared.

Due to long runtime on large disks some files, which have already been hashsumed, might change before all dupes to that file are found. To prevent false hardlink creation due to intermediate changes, dupemerge saves the file write time of a file when it hashsums the file and checks back if this time changed when it tries to hardlink dupes.

If dupemerge is run once, hardlinks among indentical files are created. To save time during a second run on the same locations, dupemerge checks if a file is already a hardlink, and tries to find the other hardlinks by comparing the unique NTFS file-id. This saves a lot of time, because especially checksums for large files need not to to be created twice.

Dupemerge has a dupe find algorithm, which is extremly tuned to especially perform well on large server disks, where it has been tested in depth to guarantee data integrity.


Limitations
  • The dupemerge.exe can only be used with NT4/W2K/WXP/W2K3
  • Dupes can only be merged within a NTFS volumes, under NT4/W2K/WXP/W2K3
  • Dupes can not be merged across NTFS volumes
  • Dupes can only be merged on *fixed* NTFS volumes
  • Dupes can only be merged on local NTFS volumes
  • There is a NTFS limit of having not more than 1023 hardlinks to one file. Dupemerge knows about this and denies creating more than 1023 hardlinks for one file

Frequently Asked questions Hello, this may seem a basic question, but how do I know how much space dupemerge has saved by using hard links? If I have two identical directories A & B and run dupemerge.exe C:\A C:\B, I’d imagine that the resulting size of the two directories would be halved.
However windows explorer still thinks the size on disk of the two directories combined is double rather than half.
Can you not see the saved space via explorer?

A: You can't see the saved space via explorer, because explorer simply adds the size of files found below a given location, and because hardlinks are very transparent, explorer does not know, that a summed up file is a hardlink, so it thinks it is a file.
To see the saved space open a command prompt and type 'dir', run dupemerge, and once again run 'dir'.
Or via Explorer: Open the drive and take a look at the drive properties, before and after running dupemerge.

History
June 29th 2008 Version 0.999 released.
Dupemerge preserves the timestamps of original files when it merges the files via hardlinks.

Itanium binaries are available.

October 21st 2007 Version 0.998 released.
Dupemerge respects the NTFS limit of having not more than 1023 hardlinks on one file

Binaries are now available for 32bit and 64bit, because the compiler for this tool changed to VS2005

January 25th 2007 Version 0.997 released.
There is an issue with dupemerge when it climbs down junctions. Until a proper fix to the main-algorithm is out, dupemerge is limited not to run down junctions, and it keeps files below junctions completely untouched.

When dupemerge was called many times with the same path, e.g. dupemerge.exe c:\1 c:\1 the files in that directory got accidently deleted.

November 3rd 2006 Version 0.995 released.
Improved the recursive runner performance, which yields faster scanning time.

Fixed a super rare merging bug, if dupemerge was run a second time on the same directory and file sizes and numbers were in a super rare constellation.

Migrated to the hardlink baseservices components, which also drive LinkShellExt and ln.

March 18th 2006 Version 0.991 released.
Fixed a critcal merging bug, which occured in very rare situations, when dupemerge was run a second time on the same directory. Did tests on large amounts of data, and checked the output via a second programm, to prove integrity.

Added calculation on savings resulting in running dupemerge, even with -l switch.

February 17th 2006 Version 0.985 released. Added a check if given paths are on NTFS drives.

Novemver 5th 2005 Version 0.98 released. Fixed the problems with bogous progress output.

January 10th 2005 Version 0.95 released

Status The 0.998 version is stable enough to satisfy most needs. A bugfixing release is scheduled for February 2008, which should contain a fix for the junction problem.

Acknowledgements I wish to thank those who have contributed significantly to the development of dupemerge.


Open Issues
  • There is an issue with junctions: If a file is found twice or more via a junction, it accidently gets deleted.
  • The number of dupegroups gets counted slightly incorrect.
  • There is a problem with cyrillic characters in Pathnames, which causes dupemerge to output wrong dupegroups, but behind the scenes dupegroups get created correclty.

Disclaimer This program is provided as is.

Contact / Donations Bug reports, or feature requests send to Hermann Schinagl..

Dupemerge.exe is and will be freeware, but if Dupemerge.exe was really helpful for you and saved lots of your time please think of donations either via PayPal



or by sending me a gift certificate from

amazon.de .

Download
Windows 2000
Windows XP
Windows Server 2003
Windows Vista
Please make sure that the necessary runtime .dlls are installed on your system. This prerequisites package can be downloaded from Microsoft:

vcredist_x86.exe for Vs2005 Sp1 (2.6 MB)

Afterwards install the
dupemerge.zip (26KB)

 
Vista64
Windows XP64
Please make sure that the necessary runtime .dlls are installed on your system. This prerequisites package can be downloaded from Microsoft:

vcredist_x64.exe for Vs2005 Sp1 (3.1 MB)

Afterwards install the
dupemerge64.zip (26KB)



Windows Itanium
Please make sure that the necessary runtime .dlls are installed on your system. This prerequisites package can be downloaded from Microsoft:

vcredist_IA64.exe for Vs2005 Sp1 (6.1 mb)

Afterwards install the
dupemergeItanium.zip (26KB)