dupemerge
Last updated August 23 2013, Version 1.07

Quick Start Download
Documentation
Frequently asked questions (FAQ)
History
Donations

Introduction Most hard disks contain quite a lot of completely identical files, which consume a lot of disk space. This waste of space can be drastically reduced by using the NTFS file system hardlink functionality to link the identical files ("dupes") together.
Dupemerge searches for identical files on a logical drive and creates hardlinks among those files, thus saving lots of hard disk space.

Installation

Dupemerge.exe is a command line utility which runs from a command prompt window or in a batch or cmd script. It needs no formal "setup". Copy dupemerge.exe to some directory referenced by your PATH environment variable. The %systemroot% (that is, c:\winnt or c:\windows) is a good place.

To remove it, delete the dupemerge.exe file from wherever you copied it.


Using dupemerge dupemerge.exe can be controlled by a few command line arguments, and its highlights are as follows:
 
Specify path More than one path can be specified to search for dupes.

dupemerge c:\data c:\test\42

The above command causes dupemerge to search c:\data and c:\test\42 and below for dupes. Dupes might be spread across given subdirectory trees: e.g. If the files c:\data\a.txt c:\data\dd.txt and c:\test\42\new.bat are dupes, they get hardlinked together.

Include via wildcards In certain situations only a few files, e.g.: *.pdb, below a path should be checked for dupes. To accomplish this, dupemerge can be run with filters specified, to only match certain files.

dupemerge --wildcard *.dbg --wildcard a*.pdb c:\data

In the above example dupemerge only searches for files which match the expressions specified with --wildcard. The --wildcard option can be used more than once.

Exclude via wildcards In certain situations all but a few files, e.g.: *.pdb, below a path should be checked for dupes. To accomplish this, dupemerge can be run with filters specified, to exclude certain files.

dupemerge --exclude *.dbg --exclude a*.pdb c:\data

In the above example dupemerge only searches for all files, but not the ones given via the --exclude. The --exclude option can be used more than once.

Use regular expressions In certain situations some kinds of files, e.g. all files containing only letters, should be checked for dupes. To accomplish this, dupemerge can be run with regular expression filters specified, to only match certain files.

dupemerge --regexp "[a-z]*" c:\data

In the above example dupemerge only searches for files which match the regular expressions specified with --regexp. The --regexp option can be used more than once.

List only To find out which files are dupes, but to not hardlink those files, dupemerge can be run in list mode

dupemerge --list c:\data c:\test\42

An extensive report is generated showing which files are dupes in c:\data and below and c:\test\42 and below.

Size dependent check The size of the files to be compared can be controlled by two switches

dupemerge --minsize 3000 --maxsize 500000 c:\data

In the above example dupemerge searches for files bigger than 3000 bytes and smaller than 500000 bytes below c:\data.

Sort Order The output shows the order of found dupegroups either random or by cardinality or by filesize. This is controlled by the --sort switch, which has the filesize or the cardinality modifier. The default behaviour is to show dupegroups random.

dupemerge --sort cardinality c:\data

In the above example dupemerge searches for files below c:\data and prints the output so that the dupegroup with most identical files is printed first, and the dupegroup with fewest identical files is printed last.

dupemerge --sort filesize c:\data

In the above example dupemerge searches for files below c:\data and prints the output so that the dupegroup which contains the largest files is printed first, and the dupegroup with smallest files is printed last.

Output

Dupemerge.exe returns its status at the end of its operation:

-f c:\backup\test\deleteme.dat
!*h c:\backup\test\cannothardlink.dat
!\f (0x00000005) c:\data\failed\AccessDenied.txt
!/f (0x00000005) c:\data\failed\MappingFailed.txt

Basically DupeMerge protocols each action it did, and prefixes two characters to each item it processed for each line of the output. The first column of the output contains the Operation, which was performed, and the second column specfies the Type of item, which was processed.

Operation Description
* Hardlink a file
- Remove an item from the target that is not present in the source. Used during Smart Mirror
? Enumerate an item.
~ Item has been excluded by command line arguments.
\ Opening a file.
/ Map file into adress space.
= Move/Rename a file.
! An error happened.

Item Description
f A File is processed.
h A Hardlink is processed.
s A Symbolic link file or Symbolic Link Directory.
j A Junction is processed.
d A Directory is processed.

Sample Description
~d d:\source\mydir The directory d:\source\mydir has been excluded intentionally by either e.g. --exclude.

~f d:\source\aFile The file d:\source\aFile has been excluded intentionally by e.g. --exclude.

!\f (0x00000005) d:\src\deny The read access to the file 'deny' has been denied. This means the file is not part of the deduping process.

!/f (0x00000005) d:\src\deny Could not map the file 'deny' into the adress space to calculate a checksum. This means the file is not part of the deduping process.

!-f (0x00000005) d:\src\deny Failed to delete a file because the access has been denied.

!=f (0x00000005) d:\src\Dupe.txt Failed to rename file before hardlinking.

!*h (0x00000476) d:\s\Gt1023.txt Dupemerged reached the OS limit of 1024 hardlinks per file.

!*h (0x000005b4) d:\s\changed.txt The timestamp of the file has changed since it was enumerated.

!*h (0x00000585) d:\ The NTFS implentation of this drive is broken. It returns the same file-index for files with different file size.
This is only a warning and DupeMerge continues but ignores the retrieved file-indices. It calculates all dupe-info on its own, which takes a bit longer.


Backgrounders Dupemerge creates a cryptological hashsum for each file found below the given paths and compares those hashes to each other to find the dupes. There is no file date comparison involved in detecting dupes, only the size and content of the files.

To speed up comparison, only files with the same size get compared to each other. Furthermore the hashsums for equal sized files get calculated incrementally, which means that during the first pass only the first 4 kilobyte are hashed and compared, and during the next rounds more and more data are hashed and compared.

Due to long run time on large disks, a file which has already been hashsummed might change before all dupes to that file are found. To prevent false hardlink creation due to intermediate changes, dupemerge saves the file write time of a file when it hashsums the file and checks back if this time changed when it tries to hardlink dupes.

Multiple Runs
If dupemerge is run once, hardlinks among identical files are created. To save time during a second run on the same locations, dupemerge checks if a file is already a hardlink, and tries to find the other hardlinks by comparing the unique NTFS file-id. This saves a lot of time, because checksums for large files need not be created twice.

Transaction based Hardlinking
Before DupeMerge hardlinks file together it renames the file to a temporary name, then creates the hardlink, and afterwards deletes the temporary file. All that is done to be able to roll-back the operation if e.g the hardlinking failes.

TimeStamp Handling
A tupel of hardlinks for one file has always the one timestamp. This is by design of NTFS. But things are a bit confusing sometimes, because after hardlinking the same timestamp is only shown, after the hardlink was once e.g. opened/accessed. So it may happen, that immediatley after dupemerge one observes different timestamps within a tupel of hardlinks, but after such a hardlink has been opened for e.g. read, the timestamp changes to the timestamp of the whole tupel. That's also by design of NTFS.


Dupemerge has a dupe-find algorithm which is tuned to perform especially well on large server disks, where it has been tested in depth to guarantee data integrity.


Limitations
  • The dupemerge.exe can only be used with NT4/W2K/WXP/W2K3/Windows7/W2K8
  • Dupes can only be merged within a NTFS volumes, under NT4/W2K/WXP/W2K3/Windows7/W2K8
  • Dupes can not be merged across NTFS volumes
  • Dupes can only be merged on *fixed* NTFS volumes
  • Dupes can only be merged on local NTFS volumes
  • There is a NTFS limit of having not more than 1023 hardlinks to one file. Dupemerge knows about this and won't create more than 1023 hardlinks for one file

Frequently Asked questions

Hello, this may seem a basic question, but how do I know how much space dupemerge has saved by using hard links? If I have two identical directories A & B and run dupemerge.exe C:\A C:\B, I'd imagine that the resulting size of the two directories would be halved. However windows explorer still thinks the size on disk of the two directories combined is double rather than half. Can you not see the saved space via explorer?

A: You can't see the saved space via Explorer, because Explorer simply adds the size of files found below a given location. Explorer does not currently care when two files are hardlinked. It simply reports the size of each file's data, and then totals the sizes even though the total may include duplications. To see the saved space open a command prompt and type 'dir', run dupemerge, and once again run 'dir'. Or via Explorer: Open the drive and take a look at the drive properties, before and after running dupemerge.

One can use the Hard Link Shell Extension or the ln.exe --list command or the ln.exe --truesize command to see how many filenames are hard-linked to any file. That can provide another way to learn how much space is saved by using hard links.


History
August 23th 2013 Version 1.07 released.
  • Dead Junctions to a different drive could lead to not detecting hardlinks during all operations. Very Nasty, but no dataloss caused.
August 20th 2013 Version 1.06 released.
August 7th 2013 Version 1.05 released.
  • Fixed a crash during cleanup at the very end when all was correctly done.

April 5th 2013 Version 1.04 released.
  • Fixed a crash when files were larger than 16gb.

October 28th 2012 Version 1.03 released.
  • The number of new dupegroups was reported non deterministic when huge amounts of files were scanned, but always calculated correct.
  • The --output option didn't redirect output properly.
  • Improved summary statistics.
  • Error message is printed out if the 1024 hardlink limit per file is exceeded.

October 1st 2012 Version 1.0 released.
  • In rare situations not all dupes of a group were merged into one group but in e.g two groups.
  • Fast file enumeration is now the default.
  • Little tweaks here and there, but in general a long journey has ended and this version qualifies for a 1.0 release

September 16th 2012 Version 0.9998 released.
  • Fixed a bug which caused the message 'Could not map view of file' to show up on certain files, causing the files to be not part of the deduping process
  • Fixed a problem, where dupemerge did not find all dupes.
  • Time during operation is printed in readable hh:mm:ss.mss
  • Dupemerge now uses the fast file enumerator which is also used in LSE and ln.exe. This speeds up file enumeration by the factor 10+
  • Fixed a problem where files larger than 4gb were never detected as dupes.
  • Files larger than 4gb sometimes might have caused the 'Could map view of file' error message
  • In general filesize is not a limiting factor anymore
  • DupeMerge now also handles files with ReadOnly attribute set
  • Added the --output option
  • Added a error reporting capabilty.
  • Added the --exclude option
  • Fixed the statistic output so that only new dupegroups are printed out.

February 25th 2012 Version 0.9994 released.
  • Dupemerge does not climb down junctions or symbolic link directories and ignores symbolic link files.

December 17th 2010 Version 0.9993 released.
  • Dupemerge does not merge more than 1022 equal files, because there is a NTFS limit of having not more than 1023 hardlinks to one file. This is a temporary solution. Let's see what can be done to improve situation here.
  • The regular expression machine used in dupemerge has been changed to tre-0.8.0, which means regexp patterns are no longer case sensitive, and the regexp machine works for 100%.

June 29th 2008 Version 0.999 released.
Dupemerge preserves the timestamps of original files when it merges the files via hardlinks.

Itanium binaries are available.

October 21st 2007 Version 0.998 released.
Dupemerge respects the NTFS limit of having not more than 1023 hardlinks on one file

Binaries are now available for 32bit and 64bit, because the compiler for this tool changed to VS2005

January 25th 2007 Version 0.997 released.
There is an issue with dupemerge when it climbs down junctions/symbolic links. Until a proper fix to the main-algorithm is out, dupemerge is does not run down junctions/symbolic links, and it keeps files below junctions/symbolic links completely untouched.



November 3rd 2006 Version 0.995 released.
Improved the recursive runner performance, which yields faster scanning time.

Fixed a super rare merging bug, if dupemerge was run a second time on the same directory and file sizes and numbers were in a super rare constellation.

Migrated to the hardlink baseservices components, which also drive LinkShellExt and ln.

March 18th 2006 Version 0.991 released.
Fixed a critcal merging bug, which occur ed in very rare situations, when dupemerge was run a second time on the same directory. Did tests on large amounts of data, and checked the output via a second program, to prove integrity.

Added calculation on savings resulting in running dupemerge, even with -l switch.

February 17th 2006 Version 0.985 released. Added a check if given paths are on NTFS drives.

Novemver 5th 2005 Version 0.98 released. Fixed the problems with bogus progress output.

January 10th 2005 Version 0.95 released

Status The 1.07 version is the base for ongoing development, which will be the support for multi-core machines, so that has calculation is distributed onto all available cores.

Acknowledgements I wish to thank those who have contributed significantly to the development of dupemerge.


Open Issues

With Dupemerge 1.07 all known open issues have been worked off.


License
  • This program is provided as is. Please see license.txt
  • ln.exe uses tre the as regular expression machine. See the BSD style tre license.
  • ln.exe uses ultragetopt for command line parsing. See the ultragetopt license.
  • ln.exe uses the Rockall v4.0 heapmanager for fast heap operations. See the EULA.

  • Contact / Donations Bug reports or feature requests send to Hermann Schinagl..

    Dupemerge.exe is and will be freeware, but if Dupemerge.exe was really helpful for you and saved lots of your time please think of donations either via PayPal



    or by sending me a gift certificate from

    amazon.de .



    or by donating bitcoins:


    19irqhz5cMDkp2hf9YWqDY26PgYguJQdc7



    Download
    Windows 2000
    Windows XP
    Windows Server 2003
    Windows Vista
    Windows 7/8
    Please make sure that the necessary runtime .dlls are installed on your system. This prerequisites package can be downloaded from Microsoft:

    vcredist_x86.exe for VS2005 SP1, version 6195/June 2011 (2.6 Mb)

    Afterwards install the
    dupemerge.zip (167kB)

     

    Vista64
    Windows XP64
    Windows 7/8
    64

    Please make sure that the necessary runtime .dlls are installed on your system. This prerequisites package can be downloaded from Microsoft:

    vcredist_x64.exe for VS2005 SP1, version 6195/June 2011 (3.0 Mb)

    Afterward install the
    dupemerge64.zip (220kB)



    Windows Itanium
    Please make sure that the necessary runtime .dlls are installed on your system. This prerequisites package can be downloaded from Microsoft:

    vcredist_IA64.exe for VS2005 SP1, version 6195/June 2011 (6.3 Mb)

    Afterwards install the
    dupemergeItanium.zip (347kB)