Showing many duplicate files
- Anmelden oder Registrieren um Kommentare zu schreiben
I often did backups of photos from a mobile phone but without remembering if I already did other backups so it is likely that I have a number of duplicates. File names are usually different, although contains are identical byte by byte. I found the programme fdupes that can list duplicates. fdupes -Sr lists duplicates like this:
2167950 bytes each: ./2018-03-09 All/IMG_4175.JPG ./IMG_20171113_170712.JPG 3154136 bytes each: ./2018-03-09 All/IMG_0187.JPG ./IMG_20160715_183705.JPG 2836777 bytes each: ./2018-03-09 All/IMG_0807.JPG ./IMG_20161123_011537.JPG
Still, I have several thousands of duplicates, in different directories, so I need something more readable. I stored the output in a file and then wrote an awk script to run on it, which gives me something like this:
1) . 2) ./2018-03-09 All 3985 set(s) of duplicates: IMG_20171113_170712.JPG IMG_4175.JPG IMG_20160715_183705.JP IMG_0187.JPG IMG_20161123_011537.JPG IMG_0807.JPG
Files are grouped by lists of directories under which they are found (can be multiple times the same directory, can be any number of directories).
I inspect these lists, process them with emacs to make lists of files to delete (I use the kill-region C-w and kill-rectangle C-x r k functions), and when I am confident enough that I did not do any mistake, I run "xargs rm " (note: don't do that if you have files with space of characters interpreted by bash, like * or ?).
My awk script is probably over complicated but it seems to work. I tried putting enough comments to make it readable. You can use it, and I welcome suggestions. While writing this post, I am thinking that one improvement could be to put a \ before space, * and ? characters, to make it safe to run with "xargs rm ".
# Sort full paths by dir, ignoring the file name
function compare_by_dir(i1, v1, i2, v2)
{
# i1 and i2 are indexes, they are ignored
# v1 and v2 are values, they are full file paths
# extract the directory
# it is the longest string of non-newline characters followed by / and at least one non-newline
dir1 = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",v1)
dir2 = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",v2)
if (dir1 < dir2)
return -1
if (dir1 > dir2)
return 1
return 0
}
# a record is a list of lines, separated by an empty line
BEGIN { RS = "" ; FS = "\n" ; max_len = 1 }
{
record = $0
# Remove line with size information
sub(/[0-9]+ bytes each:\n/,"",record)
# Split each line of the record into path_array
split(record, path_array, "\n")
dir_string = ""
file_string = ""
# parse directories sorted by name to avoid e.g. getting two dir_strings A:::B and B:::A
PROCINFO["sorted_in"] = "compare_by_dir"
for (i in path_array)
{
# extract dir name and file name from full path
dir = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",path_array[i])
file = gensub(/[^\n]+\/([^\n]+)/,"\\1","g",path_array[i])
# concatenate dir/file names into dir_string/file_string, with ::: as separator
if (dir_string == "") {
dir_string = dir
file_string = file
}
else {
dir_string = dir_string ":::" dir
file_string = file_string ":::" file
}
}
# store the max file name length (to use as column width for printing)
if (length(file_string) > max_len)
max_len = length(file_string)
# dir_array[dir_string] is an array
# dir_array[dir_string][i], where i is is an integer from 1 to n, is a file_string
# dir_array[dir_string]["files"] is n (i.e. highest index value)
# set dir_array[dir_string]["files"] for the new file_string
if (!dir_string in dir_array)
dir_array[dir_string]["files"] = 1
else
dir_array[dir_string]["files"]++
# add the new file_string
dir_array[dir_string][dir_array[dir_string]["files"]] = file_string
}
END {
# make a separator, used between dir groups
sep = "-"
for (i = 1 ; i < max_len ; i++)
sep = sep "-"
# compare_by_dir is not suitable to parse dir_array, so restore default sort
PROCINFO["sorted_in"] = "@unsorted"
for (dir_string in dir_array) {
# split every string to an arry again, so that we can print it on different line
split (dir_string,dir_list,":::")
# print each dir with a number
i = 1
for (dir in dir_list) {
printf "%d) %s\n", i, dir_list[dir]
i++
}
# i reached (number of files in a duplicate set) + 1
num_files = i - 1
printf "\n%d set(s) of duplicates:\n\n", dir_array[dir_string]["files"]
# parse all file_strings
for (j = 1 ; j<=dir_array[dir_string]["files"] ; j++) {
# split into elements, to print in columns
split (dir_array[dir_string][j],file_array,":::")
for (k = 1; k <= num_files ; k++)
printf "%-" max_len "s", file_array[k]
print ""
}
# separate from the next group
printf "\n"
for (k = 1; k <= num_files ; k++)
printf sep
printf "\n\n"
}
}
The way you are doing it is probably safer, but assuming that all these pictures were taken with the same device, not in burst mode, I may try to use file metadata instead and identify duplicates based on the time at which the picture was created. I believe this is what you are getting in a terminal with: ls -l --full-time
Otherwise, if manually selecting, classifying and sorting all these pictures was not an option, I would most probably dump them to save time and storage space at once.
ls -l --full-time
I looked at that also and so far this would have worked. For some photos, I found multiple files identical to each other but with slightly different time (like 30 to 120s difference).
Otherwise, if manually selecting, classifying and sorting all these pictures..
I do a manual selection but with a long delay. For sorting, I usually just move files of a given date to a directory with a name starting with the date and providing information on what is in. This has been a good extension of my memory so far.
Anyway, below is a version with improved output (column width is adapted for each column, and separately for each set of duplicate). At least I am learning about awk.
# Sort full paths by dir, ignoring the file name
function compare_by_dir(i1, v1, i2, v2)
{
# i1 and i2 are indexes, they are ignored
# v1 and v2 are values, they are full file paths
# extract the directory
# it is the longest string of non-newline characters followed by / and at least one non-newline
dir1 = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",v1)
dir2 = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",v2)
if (dir1 < dir2)
return -1
if (dir1 > dir2)
return 1
return 0
}
# a record is a list of lines, separated by an empty line
BEGIN { RS = "" ; FS = "\n" ; max_len = 1 }
{
record = $0
# Remove line with size information
sub(/[0-9]+ bytes each:\n/,"",record)
# Split each line of the record into path_array
split(record, path_array, "\n")
dir_string = ""
file_string = ""
# parse directories sorted by name to avoid e.g. getting two dir_strings A:::B and B:::A
PROCINFO["sorted_in"] = "compare_by_dir"
num_dir = 0
for (i in path_array)
{
num_dir++
# extract dir name and file name from full path
dir = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",path_array[i])
file = gensub(/[^\n]+\/([^\n]+)/,"\\1","g",path_array[i])
file_len[num_dir] = length(file)
# concatenate dir/file names into dir_string/file_string, with ::: as separator
if (dir_string == "") {
dir_string = dir
file_string = file
}
else {
dir_string = dir_string ":::" dir
file_string = file_string ":::" file
}
}
# store the max file name length (to use as column width for printing)
if (length(file_string) > max_len)
max_len = length(file_string)
# dir_array[dir_string] is an array
# dir_array[dir_string][i], where i is is an integer from 1 to n, is a file_string
# dir_array[dir_string]["files"] is n (i.e. highest index value)
# dir_array[dir_string]["col" i], where i is an interger from 1 to the number of dir/files in a dir_string/file_sting is the max length of a filename in column i
# if this is the first occurence of this file string
if (!dir_string in dir_array) {
# capture the number of file_strings, i.e. 1
dir_array[dir_string]["files"] = 1
# capture the number of directories/files in the dir_string/file_string
dir_array[dir_string]["num_dir"] = num_dir
# set the length of columns to the filename lengths
for (i = 1 ; i <= num_dir ; i++)
dir_array[dir_string]["col" i] = file_len[i]
}
# if this is not the first occurence
else {
#increase the number of file_strings by one
dir_array[dir_string]["files"]++
for (i = 1 ; i <= num_dir ; i++)
if (file_len[i] > dir_array[dir_string]["col" i])
dir_array[dir_string]["col" i] = file_len[i]
}
# add the new file_string
dir_array[dir_string][dir_array[dir_string]["files"]] = file_string
}
END {
# compare_by_dir is not suitable to parse dir_array, so restore default sort
PROCINFO["sorted_in"] = "@unsorted"
for (dir_string in dir_array) {
# split every string to an arry again, so that we can print it on different line
split (dir_string,dir_list,":::")
# print each dir with a number
i = 1
for (dir in dir_list) {
printf "%d) %s\n", i, dir_list[dir]
i++
}
# i reached (number of files in a duplicate set) + 1
num_files = i - 1
printf "\n%d set(s) of duplicates:\n\n", dir_array[dir_string]["files"]
# parse all file_strings
for (j = 1 ; j<=dir_array[dir_string]["files"] ; j++) {
# split into elements, to print in columns
split (dir_array[dir_string][j],file_array,":::")
printf "%-" dir_array[dir_string]["col1"] "s", file_array[1]
for (k = 2; k <= num_files ; k++)
printf " %-" dir_array[dir_string]["col" k] "s", file_array[k]
print ""
}
# separate from the next group
printf "\n"
for (k = 1; k <= num_files ; k++)
for (l = 1 ; l <= dir_array[dir_string]["col" k]+1 ; l++)
printf "-"
printf "\n\n"
}
}
- Anmelden oder Registrieren um Kommentare zu schreiben

