|
Last updated September 17 2024, Version 1.104
|
|
Privacy Statement
|
The privacy statement can be found here
|
Quick Start
|
Download
Documentation
Frequently asked questions (FAQ)
History
Donations
|
Introduction
|
Most hard disks contain quite a lot of completely identical files,
which consume a lot of disk space. This waste of space can be drastically
reduced by using the NTFS file system hardlink functionality to link the
identical files ("dupes") together.
Dupemerge searches for identical files on a logical drive and creates hardlinks among those
files, thus saving lots of hard disk space.
|
Installation
|
Dupemerge.exe is a command line utility which runs from a command prompt window or in a batch or cmd script. It needs no formal "setup". Copy dupemerge.exe
to some directory referenced by your PATH environment variable.
The %systemroot% (that is, c:\winnt or c:\windows) is a good place.
To remove it, delete the dupemerge.exe file from wherever you copied it.
|
|
Using dupemerge
|
dupemerge.exe can be controlled
by a few command line arguments, and its highlights are as follows:
|
|
Specify path
|
More than one path can be specified to search for dupes.
dupemerge
c:\data c:\test\42
The above command causes dupemerge to search
c:\data and c:\test\42 and below for dupes. Dupes might be
spread across given subdirectory trees: e.g. If the files c:\data\a.txt
c:\data\dd.txt and c:\test\42\new.bat are dupes, they get hardlinked
together.
|
|
--include
Include via wildcards
|
In certain situations only a few files, e.g.: *.pdb,
below a path should be checked for dupes. To accomplish this, dupemerge can be run
with filters specified, to only match certain files.
dupemerge --include *.dbg --wildcard a*.pdb c:\data
In the above example dupemerge only searches for files which
match the expressions specified with --include. The --include option can be used
more than once.
This option supports taking its arguments from file.
dupemerge --include @List.txt c:\data
The above example references to a file List.txt, which on every line contains a matching pattern. e.g.
*.dbg
*.sbr
|
|
--includedir Include directories via wildcards
|
To selectivley run Dupemerge on certain directories, the --includedir option can be used with wildcards.
dupemerge --includedir *test c:\data
Basically any arbitrary wildcard expressions can be used, because
the wildcard expressions are translated into a regular expression. This means that e.g *src\\sub?older*
is also a valid wildcard expression for --includedir.
The wildcard expression specified under --includedir is applied to the whole path, which means that e.g.
dupemerge --includedir "*fotos\\temp*" --copy c:\data
will include all directories containing 'fotos\temp' and their subdirectories. The above example will
e.g. include 'fotos\tempur\myfotos', 'fotos\temp\myfotos' or 'fotos\tempomat\myfotos'. Please note
that '\' has to be escaped via '\\'
This option supports taking its arguments from file.
dupemerge --includedir @List.txt c:\data
The above example references to a file List.txt, which on every line contains a matching pattern. e.g.
*fotos\\temp*
aDir
|
|
--exclude
Exclude via wildcards
|
In certain situations all but a few files, e.g.: *.pdb,
below a path should be checked for dupes. To accomplish this, dupemerge can be run
with filters specified, to exclude certain files.
dupemerge --exclude *.dbg --exclude a*.pdb c:\data
In the above example dupemerge only searches for all files, but not the ones given via
the --exclude. The --exclude option can be used more than once.
This option supports taking its arguments from file.
dupemerge --exclude @List.txt c:\data
The above example references to a file List.txt, which on every line contains a matching pattern. e.g.
*.pdb
*.sbr
Myfile*.*
|
|
--excludedir
Exclude directories via wildcards
|
In certain situations not all directories below a path should be checked for dupes. To accomplish this,
dupemerge can be run to exclude certain directories.
dupemerge --excludedir DontWantIt --excludedir DisLike c:\data
In the above example dupemerge searches all files except the ones excluded via --excludedir.
The --excludedir option can be used more than once in one invocation.
Basically any arbitrary wildcard expressions can be used, because
the wildcard expressions are translated into a regular expression. This means that e.g *file*.ext*.*
is also a valid wildcard expression for --excludedir.
dupemerge --excludedir *test* c:\data
The wildcard expression specified under --excludedir is applied to the whole path, which means that e.g.
dupemerge --excludedir "fotos\\temp" c:\data
will exclude all directories containing 'fotos\temp' and their
subdirectories. The above example will e.g. exclude 'fotos\tempur\myfotos', 'fotos\temp\myfotos'
or 'fotos\tempomat\myfotos'. Please note sure that '\' has to be escaped via '\\'
This option supports taking its arguments from file.
dupemerge --excludedir @List.txt c:\data
The above example references to a file List.txt, which on every line contains a matching pattern. e.g.
*fotos\\temp*
aDir
|
|
--regexp
Use regular expressions
|
In certain situations some kinds of files, e.g.
all files containing only letters, should be checked for dupes. To accomplish this,
dupemerge can be run with regular expression filters specified, to only match certain
files.
dupemerge --regexp "[a-z]*" c:\data
In the above example dupemerge only searches for files which
match the regular expressions specified with --regexp. The --regexp option can be
used more than once.
This option supports taking its arguments from file.
dupemerge --regexp @List.txt c:\data
The above example references to a file List.txt, which on every line contains a matching pattern. e.g.
[a-z]*
[0-9]*
|
|
--list
List only
|
To find out which files are dupes, but to not hardlink
those files, dupemerge can be run in list mode
dupemerge --list c:\data c:\test\42
An extensive report is generated showing which files are dupes in
c:\data and below and c:\test\42 and below.
|
|
--minsize
--maxsize
Size dependent check
|
The size of the files to be compared can be controlled by two switches
dupemerge
--minsize 3000 --maxsize 500000 c:\data
In the above example dupemerge searches
for files bigger than 3000 bytes and smaller than 500000 bytes below
c:\data.
|
|
--sort
Sort Order
|
The output shows the order of found dupegroups either random or by cardinality or by filesize. This
is controlled by the --sort switch, which has the filesize or the cardinality
modifier. The default behaviour is to show dupegroups random.
dupemerge --sort cardinality c:\data
In the above example dupemerge searches
for files below c:\data and prints the output so that the dupegroup with most
identical files is printed first, and the dupegroup with fewest identical files is printed last.
dupemerge --sort filesize c:\data
In the above example dupemerge searches
for files below c:\data and prints the output so that the dupegroup which contains the largest
files is printed first, and the dupegroup with smallest files is printed last.
|
--supportfs
|
There are a lot of filesystems out by third party vendors nowadays which support hardlinks.
In order to provide the dupemerge.exe functionality on that drives, the supported filesystems can be given:
dupemerge --supportfs btrfs x:\location_to_be_deduped
Configuring your favourite filesystem to be recognized by dupemerge.exe is on your own risk. Basically dupemerge.exe would do all operations
to the configured filesystems, which it does to NTFS. So make sure your filesystem supports the same primitives as NTFS does,
otherwise certain operations will fail.
|
|
|
|
Output
|
Dupemerge.exe returns its status at the end of its operation:
-f c:\backup\test\deleteme.dat
!*h c:\backup\test\cannothardlink.dat
!\f (0x00000005) c:\data\failed\AccessDenied.txt
!/f (0x00000005) c:\data\failed\MappingFailed.txt
Basically DupeMerge protocols each action it did, and prefixes two characters
to each item it processed for each line of the output. The first column of the output
contains the Operation, which was performed, and the second column specfies the
Type of item, which was processed.
Operation
|
Description
|
*
|
Hardlink a file
|
-
|
Remove an item from the target that is not present in the source. Used during Smart Mirror
|
?
|
Enumerate an item.
|
~
|
Item has been excluded by command line arguments.
|
\
|
Opening a file.
|
/
|
Map file into adress space.
|
=
|
Move/Rename a file.
|
!
|
An error happened.
|
'
|
Informational Message
|
Item
|
Description
|
f
|
A File is processed.
|
h
|
A Hardlink is processed.
|
s
|
A Symbolic link file or Symbolic Link Directory.
|
j
|
A Junction is processed.
|
d
|
A Directory is processed.
|
Sample
|
Description
|
~d d:\source\mydir
|
The directory d:\source\mydir has been excluded intentionally by either e.g. --exclude.
|
~f d:\source\aFile
|
The file d:\source\aFile has been excluded intentionally by
e.g. --exclude.
|
!\f (0x00000005) d:\src\deny
|
The read access to the file 'deny' has been denied. This means the file is not part of the deduping process.
|
!/f (0x00000005) d:\src\deny
|
Could not map the file 'deny' into the adress space to calculate a checksum. This means the file is not part of the deduping process.
|
!-f (0x00000005) d:\src\deny
|
Failed to delete a file because the access has been denied.
|
!=f (0x00000005) d:\src\Dupe.txt
|
Failed to rename file before hardlinking.
|
!*h (0x00000476) d:\s\Gt1023.txt
|
Dupemerged reached the OS limit of 1024 hardlinks per file.
|
!*h (0x000005b4) d:\s\changed.txt
|
The timestamp of the file has changed since it was enumerated.
|
!*h (0x00000585) d:\
|
The NTFS implentation of this drive is broken. It returns the same file-index for files with different file size.
This is only a warning and DupeMerge continues but ignores the retrieved file-indices. It calculates all dupe-info on its own, which takes a bit longer.
|
|
|
Backgrounders
|
Dupemerge creates a cryptological hashsum for each file found below
the given paths
and compares those hashes to each other to find the dupes. There is no file date comparison
involved in detecting dupes, only the size and content of the files.
To speed up comparison, only files with the same size get compared
to each other. Furthermore the hashsums for equal sized files
get calculated incrementally, which means that during the first
pass only the first 4 kilobyte are hashed and compared, and
during the next rounds more and more data are hashed and compared.
Due to long run time on large disks, a file which has already been
hashsummed might change before all dupes to that file are found.
To prevent false hardlink creation due to intermediate changes,
dupemerge saves the file write time of a file when it hashsums
the file and checks back if this time changed when it tries to
hardlink dupes.
Multiple Runs
If dupemerge is run once, hardlinks among identical files
are created. To save time during a second run on the same
locations, dupemerge checks if a file is already a hardlink, and
tries to find the other hardlinks by comparing the unique
NTFS file-id. This saves a lot of time, because
checksums for large files need not be created twice.
Transaction based Hardlinking
Before DupeMerge hardlinks file together it renames the file to a
temporary name, then creates the hardlink, and afterwards deletes the temporary file.
All that is done to be able to roll-back the operation if e.g the hardlinking failes.
TimeStamp Handling
A tupel of hardlinks for one file has always the one timestamp. This is by design of NTFS. But
things are a bit confusing sometimes, because after hardlinking the same timestamp is only shown,
after the hardlink was once e.g. opened/accessed. So it may happen, that immediatley after dupemerge
one observes different timestamps within a tupel of hardlinks, but after such a hardlink has been opened
for e.g. read, the timestamp changes to the timestamp of the whole tupel. That's also by design of NTFS.
Dupemerge has a dupe-find algorithm which is tuned
to perform especially well on large server disks, where it has been
tested in depth to guarantee data integrity.
|
|
Limitations
|
- The dupemerge.exe can only be used with NT4/W2K/WXP/W2K3/Windows7/W2K8/W10/W11
- Dupes can only be merged within a NTFS volumes, under NT4/W2K/WXP/W2K3/Windows7/W2K8/W10/W11
- Dupes can not be merged across NTFS volumes
- Dupes can only be merged on *fixed* NTFS volumes
- Dupes can only be merged on local NTFS volumes
- There is a NTFS limit of having not more than 1023 hardlinks to one file. Dupemerge knows about this and won't create more than 1023 hardlinks for one file
|
|
Frequently Asked questions
|
Hello, this may seem a basic question, but how do I know how much space dupemerge has saved by using hard links?
If I have two identical directories A & B and run dupemerge.exe C:\A C:\B, I'd imagine that the resulting size of the two directories would be halved. However windows explorer still thinks the size on disk of the two directories combined is double rather than half. Can you not see the saved space via explorer?
A: You can't see the saved space via Explorer, because
Explorer simply adds the size of files found
below a given location. Explorer does not currently care when two files are hardlinked.
It simply reports the size of each file's data, and then totals the sizes even though the total may include duplications.
To see the saved space open a command prompt and
type 'dir', run dupemerge, and once again run 'dir'. Or via Explorer: Open the drive and take a look at the
drive properties, before and after running dupemerge.
One can use the Hard Link Shell Extension
or the ln.exe --list command
or the ln.exe --truesize command
to see how many filenames are hard-linked
to any file. That can provide another way to learn how much space is saved by using hard links.
|
|
History
|
August 14 2021
|
Version 1.104 released.
- Added the --suportfs option
- Recompiled with signed binary and new version number to make it to chocolatey
- [Internal] but important change from VS2005 (sic) to VS2017. Basically everything compiled smoothly except for the heap, thus...
- [Internal] but important change from VS2005 (sic) to VS2017. Basically everything compiled smoothly except for the heap, thus...
-
[Internal] Removed the Rockall fast heap. This was neccessary, but also a big performance gain. Memory allocation is 2 times faster,
and memory deletion is 10 times faster. Memory allocation is crucial for the core of Dupemerge.
- [Internal] Dropped Itanium configuration, since VS2017 does not support it anymore, and I am sure there is no Itanium hardware out in the wild anymore.
|
June 15 2017
|
Version 1.080 released.
|
October 18 2014
|
Version 1.07.001 released.
- Certain Unicode command line arguments could drive the command line parser crazy.
|
August 23th 2013
|
Version 1.07 released.
- Dead Junctions to a different drive could lead to not detecting hardlinks during all operations. Very Nasty, but no dataloss caused.
|
August 20th 2013
|
Version 1.06 released.
- Introduced a sanity check for broken NTFS implementations.
- When hardlinking a tupel of equal files, the tupels common date is the date of the oldest file of the tupel.
- The statistics is printed after the detailed log and not before.
|
August 7th 2013
|
Version 1.05 released.
-
Fixed a crash during cleanup at the very end when all was correctly done.
|
April 5th 2013
|
Version 1.04 released.
-
Fixed a crash when files were larger than 16gb.
|
October 28th 2012
|
Version 1.03 released.
-
The number of new dupegroups was reported non deterministic when huge amounts of files were scanned, but always calculated correct.
-
The --output option didn't redirect output properly.
-
Improved summary statistics.
-
Error message is printed out if the 1024 hardlink limit per file is exceeded.
|
October 1st 2012
|
Version 1.0 released.
-
In rare situations not all dupes of a group were merged into one group but in e.g two groups.
-
Fast file enumeration is now the default.
-
Little tweaks here and there, but in general a long journey has ended and this version qualifies for a 1.0 release
|
September 16th 2012
|
Version 0.9998 released.
-
Fixed a bug which caused the message 'Could not map view of file' to show up on certain files, causing the files to be not part of the deduping process
-
Fixed a problem, where dupemerge did not find all dupes.
-
Time during operation is printed in readable hh:mm:ss.mss
-
Dupemerge now uses the fast file enumerator which is also used in LSE and ln.exe. This speeds up file enumeration by the factor 10+
-
Fixed a problem where files larger than 4gb were never detected as dupes.
-
Files larger than 4gb sometimes might have caused the 'Could map view of file' error message
-
In general filesize is not a limiting factor anymore
-
DupeMerge now also handles files with ReadOnly attribute set
-
Added the --output option
-
Added a error reporting capabilty.
-
Added the --exclude option
-
Fixed the statistic output so that only new dupegroups are printed out.
|
February 25th 2012
|
Version 0.9994 released.
-
Dupemerge does not climb down junctions or symbolic link directories and ignores symbolic link files.
|
December 17th 2010
|
Version 0.9993 released.
-
Dupemerge does not merge more than 1022 equal files, because there is a
NTFS limit of having not more than 1023 hardlinks to one file. This is a temporary
solution. Let's see what can be done to improve situation here.
-
The regular expression machine used in dupemerge has been changed to tre-0.8.0,
which means regexp patterns are no longer case sensitive, and the regexp machine
works for 100%.
|
June 29th 2008
|
Version 0.999 released.
Dupemerge preserves the timestamps of original files when it merges the files via hardlinks.
Itanium binaries are available.
|
October 21st 2007
|
Version 0.998 released.
Dupemerge respects the NTFS limit of having not more than 1023 hardlinks on one file
Binaries are now available for 32bit and 64bit, because the compiler for this tool changed to VS2005
|
January 25th 2007
|
Version 0.997
released.
There is an issue with dupemerge when it climbs down junctions/symbolic links. Until a proper
fix to the main-algorithm is out, dupemerge is does not run down junctions/symbolic links, and
it keeps files below junctions/symbolic links completely untouched.
|
November 3rd 2006
|
Version 0.995 released.
Improved the recursive runner performance, which yields faster scanning time.
Fixed a super rare merging bug, if dupemerge was run a second time on the same directory
and file sizes and numbers were in a super rare constellation.
Migrated to the hardlink baseservices components, which also drive
LinkShellExt and
ln.
|
March 18th 2006
|
Version 0.991 released.
Fixed a critcal merging bug, which occur ed in very rare situations, when
dupemerge was run a second time on the same directory. Did tests on large
amounts of data, and checked the output via a second program, to prove integrity.
Added calculation on savings resulting in running dupemerge, even with -l switch.
|
February 17th 2006
|
Version 0.985 released.
Added a check if given paths are on NTFS drives.
|
Novemver 5th 2005
|
Version 0.98 released.
Fixed the problems with bogus progress output.
|
January 10th 2005
|
Version 0.95 released
|
|
|
Status
|
The 1.100 version is the base for ongoing development, which will be the support
for multi-core machines, so that has calculation is distributed onto all available cores.
|
|
Acknowledgements
|
I wish to thank those who have contributed significantly to the development of dupemerge.
|
|
Open Issues
|
With Dupemerge 1.100 all known open issues have been worked off.
|
|
License
|
|
|
Contact / Donations
|
Bug reports or feature requests send to
Hermann Schinagl..
Dupemerge.exe is and will be freeware, but if Dupemerge.exe was really
helpful for you and saved lots of your time please think of donations either via PayPal
or by sending me a gift certificate from
.
or by donating bitcoins:
bc1q4hvevwrmnwt7jg8vws0v8xajywhffl4gwca5av
|
|
Download
|
|
|