I use PDF printing to collect news items and other articles. Consequently I have saved about 1500 files in my ‘PDF’ folder. At times I move some of these files to directories that are organized by topic. In my home directory there are about 8 directories and within those altogether 1000 subdirectories from which to choose.
An automatic filing system would be ideal solution, but to my knowledge no such software is freely available for Linux. Perhaps it is possible to develop software for organizing files? From what I have heard, Python might be a good environment because it seems to have properties that are well suited for dealing with strings. However, programming is new to me. At this point I do not want to learn Python.
I have settled for Bash programming and will be satisfied with what is feasible there. That is why what follows is an exercise in Bash programming. Please remember, I am a newbie, and your comments and suggestions are most welcome. I suppose the code I develop here can be made more efficient. And there may be bugs I do not see.
In order to define which files should go into which folders I shall use file names. I shall generate a list of files to be relocated and a list of possible destination directories. I shall compare the names of files to be transferred with the names of files that are already located in the destination folders.
The goal
My aim is to generate text file ‘transferlist.txt’ that contains a list of ‘mv’ commands telling which documents should be moved into which directories:
Code:
mv "Vibrations Turn Water Into Hydrogen.pdf" "/home/user/700 Energy/Hydrogen”
mv "What Synaptic Package Manager can do.pdf" "/home/user/300 Computers/Linux/Synaptic”
etc.
When the properties of ‘transferlist.txt’ are changed so that it can be run as a program, it can be executed simply by using the command
Code:
sh transferlist.txt
Listing files and possible destinations
We shall start by preparing a list of files in the ‘PDF’ folder. The aim is to uses this list ‘pdffiles.txt’ located in the ~/Test directory to indicate which files are to be moved files to other destinations. For clarity let me point out that in addition to pdf files there are other types of files in the PDF folder as well.
(The following scripts have been edited on the basis of suggestions received to this post.)
Code:
#!/bin/bash
# The list of files 'pdffiles.txt' that are to be moved to other destinations.
for f in ~/PDF/*; do
[[ -e $f ]] || continue
echo $f >> ~/Test/pdffiles.txt
done
# List the most potential destination directories - those that have at least 2 pdf files in them.
# The list below does not contain any hidden folders.
find ~/ \( ! -regex '.*/\..*' \) -type d -exec sh -c 'set -- "$0"/'*.pdf'; [ $# -gt 1 ]' {} \; -print >> ~/Test/folders1.txt
# Edit folder pathnames so that individual words are separated for future matching with files
while read -r UneditedFolderLine
do
# The next 2 lines remove slashes (‘/’) and underlinings to separate words in folder pathnames
Wordsseparated1=${UneditedFolderLine//\//" "}
Wordsseparated=${Wordsseparated1//_/" "}
# Not all directories are intended to be destinations for file transfers.
for Folderword in $Wordsseparated
do
case "$Folderword" in
*Music*|*Pictures*|*Calibre*|*PDF*)
Warning="On"
break
;;
*)
Warning="Off"
;;
esac
done
if
[ "${Warning}" = Off ]
then
echo "$UneditedFolderLine" >> ~/Test/folders2.txt
fi
done < ~/Test/folders1.txt # This file was input for the previous set of operations.
# Sort the list of directories alphabetically:
sort -d ~/Test/folders2.txt > ~/Test/folders3.txt
# Sort the list on the basis of length, ascending, producing the final list of destination folders:
cat ~/Test/folders3.txt | awk '{ print length, $0 }' | sort -n | cut -d" " -f2- > ~/Test/folders4.txt
So the key results of the first script are ‘pdffiles.txt’ listing the files to be moved and ‘folders4.txt’ listing potential destinations. Next we shall find matches between lines of these two documents.
Matching the names of files to be transferred and files already in destination folders
So we have a list of files to be transferred, ‘pdffiles.txt’, and a list of possible destinations, ‘folders.txt’. The code below generates ‘transferlist.txt’ on the basis of matching words in filenames.
For each file to be moved the code builds an index. The code visits every possible destination folder and sees how many times the words in the filename match words in the filenames of files in the destination folder. The folder that gets the highest index is selected to become the destination of the file to be transferred.
So here we go.
Code:
#!/bin/bash
# Start by reading the list of documents (pdffiles.txt) to be moved
while read -r Filerow
do
Index=0 # Index counts matches between words. Set it to zero.
SoFarBestIndex=0
TheBestFolder=aa
SoFarBestFolder=bb
for Fileword in $Filerow # Taking each word in the name of the file
do
if [ "${#Fileword}" -lt 4 ] || # except those words that are shorter than 4 characters
[[ "${Fileword}" = *[[:digit:]]* ]] # or contain digits.
then
continue
else
# Next read the list of destination folders from file folders.txt
while read -r UneditedFolderrow
do
Index=0 # Index counts matches between words. Set it to zero.
for FileInFolder in "$UneditedFolderrow"/*
do
[ ! -f "$FileInFolder" ] && continue
# Next let us chop the names of files in the folder into separate words
FileInFolder1=$(basename "$FileInFolder")
FileInFolder2="${FileInFolder1%.*}" # Now the words are separated and file endings deleted.
for WordinFileInFolder in $FileInFolder2
do
if [ "${#WordinFileInFolder}" -lt 4 ] || [[ "${WordinFileInFolder}" = *[[:digit:]]* ]] ; then
continue
else
if [ "$WordinFileInFolder" = "$Fileword" ] # If there is a match,
then
let "Index += 1" # then add 1 to the Index.
fi
fi
done
FolderrowIndex=$Index # Setting the index for the folder as a whole
done
if [[ $FolderrowIndex -gt $SoFarBestIndex ]]
then
SoFarBestFolder="$UneditedFolderrow"
SoFarBestIndex=$FolderrowIndex
fi
Index=0 # Even if the $UneditedFolderrow is not the best match, the Index has to be set to zero.
done < /home/user/Test/folders4.txt # The file containing destination directories.
TheBestFolder="$SoFarBestFolder"
fi
done
if [ "${TheBestFolder}" != "aa" ] && [ "${TheBestFolder}" != "bb" ]; then
echo "mv" \"$Filerow\" \"$TheBestFolder\" >> /home/user/Test/transferlist19.txt
else
# List files that found no matching destinations:
echo "mv" \"$Filerow\" \"\/home\/user\/PDF1\" >> /home/user/Test/orphans19.txt
fi
done < /home/user/Test/pdffiles.txt # The file listing files to be transferred.
My computer has an Intel i5-3550 CPU and 8GiB of memory. Nevertheless, my original effort trying to match 1500 files with 1000 destination directories was too much. The bash script killed itself after running some 6 hours. I have now edited the script above and it runs much more smoothly, allowing me to multitask without any problems.
I am not quite sure if the indexing part is accurate. I have tried to test it the best I can, but my logic fails to confirm the accuracy of the code. Do you see any mistakes?
Anyway, don’t you think that this is a proof of concept for a semiautomatic filing system? Isn’t that something any file addict needs.
Bookmarks