Results 1 to 10 of 10

Thread: In search of a semiautomatic filing system - Bash newbie

  1. #1
    Join Date
    Nov 2008

    In search of a semiautomatic filing system - Bash newbie

    I use PDF printing to collect news items and other articles. Consequently I have saved about 1500 files in my ‘PDF’ folder. At times I move some of these files to directories that are organized by topic. In my home directory there are about 8 directories and within those altogether 1000 subdirectories from which to choose.

    An automatic filing system would be ideal solution, but to my knowledge no such software is freely available for Linux. Perhaps it is possible to develop software for organizing files? From what I have heard, Python might be a good environment because it seems to have properties that are well suited for dealing with strings. However, programming is new to me. At this point I do not want to learn Python.

    I have settled for Bash programming and will be satisfied with what is feasible there. That is why what follows is an exercise in Bash programming. Please remember, I am a newbie, and your comments and suggestions are most welcome. I suppose the code I develop here can be made more efficient. And there may be bugs I do not see.

    In order to define which files should go into which folders I shall use file names. I shall generate a list of files to be relocated and a list of possible destination directories. I shall compare the names of files to be transferred with the names of files that are already located in the destination folders.

    The goal

    My aim is to generate text file ‘transferlist.txt’ that contains a list of ‘mv’ commands telling which documents should be moved into which directories:

    mv "Vibrations Turn Water Into Hydrogen.pdf" "/home/user/700 Energy/Hydrogen”
    mv "What Synaptic Package Manager can do.pdf" "/home/user/300 Computers/Linux/Synaptic”
    When the properties of ‘transferlist.txt’ are changed so that it can be run as a program, it can be executed simply by using the command

    sh transferlist.txt
    Listing files and possible destinations

    We shall start by preparing a list of files in the ‘PDF’ folder. The aim is to uses this list ‘pdffiles.txt’ located in the ~/Test directory to indicate which files are to be moved files to other destinations. For clarity let me point out that in addition to pdf files there are other types of files in the PDF folder as well.

    (The following scripts have been edited on the basis of suggestions received to this post.)

    # The list of files 'pdffiles.txt' that are to be moved to other destinations.
    for f in ~/PDF/*; do
        [[ -e $f ]] || continue
        echo $f >> ~/Test/pdffiles.txt
    # List the most potential destination directories - those that have at least 2 pdf files in them.
    # The list below does not contain any hidden folders. 
    find ~/ \( ! -regex '.*/\..*' \) -type d -exec sh -c 'set -- "$0"/'*.pdf'; [ $# -gt 1 ]' {} \; -print >> ~/Test/folders1.txt
    # Edit folder pathnames so that individual words are separated for future matching with files
    while read -r UneditedFolderLine 
    # The next 2 lines remove slashes (‘/’) and underlinings to separate words in folder pathnames
    Wordsseparated1=${UneditedFolderLine//\//" "} 	
    Wordsseparated=${Wordsseparated1//_/" "} 
    # Not all directories are intended to be destinations for file transfers.
    for Folderword in $Wordsseparated 
    		case "$Folderword" in
    [ "${Warning}" = Off ] 
    echo "$UneditedFolderLine" >> ~/Test/folders2.txt 
    done < ~/Test/folders1.txt # This file was input for the previous set of operations.
    # Sort the list of directories alphabetically:
    sort -d ~/Test/folders2.txt > ~/Test/folders3.txt
    # Sort the list on the basis of length, ascending, producing the final list of destination folders:
    cat ~/Test/folders3.txt | awk '{ print length, $0 }' | sort -n | cut -d" " -f2- > ~/Test/folders4.txt
    So the key results of the first script are ‘pdffiles.txt’ listing the files to be moved and ‘folders4.txt’ listing potential destinations. Next we shall find matches between lines of these two documents.

    Matching the names of files to be transferred and files already in destination folders

    So we have a list of files to be transferred, ‘pdffiles.txt’, and a list of possible destinations, ‘folders.txt’. The code below generates ‘transferlist.txt’ on the basis of matching words in filenames.

    For each file to be moved the code builds an index. The code visits every possible destination folder and sees how many times the words in the filename match words in the filenames of files in the destination folder. The folder that gets the highest index is selected to become the destination of the file to be transferred.

    So here we go.

    # Start by reading the list of documents (pdffiles.txt) to be moved
    while read -r Filerow 
    Index=0 # Index counts matches between words. Set it to zero.
    	for Fileword in $Filerow # Taking each word in the name of the file
    		if [ "${#Fileword}" -lt 4 ] ||  # except those words that are shorter than 4 characters
    		[[ "${Fileword}" = *[[:digit:]]* ]] # or contain digits.
    # Next read the list of destination folders from file folders.txt
    			while read -r UneditedFolderrow 
    			Index=0 # Index counts matches between words. Set it to zero.
    			for FileInFolder in "$UneditedFolderrow"/* 
    			[ ! -f "$FileInFolder" ] && continue
    # Next let us chop the names of files in the folder into separate words
    			FileInFolder1=$(basename "$FileInFolder")
     FileInFolder2="${FileInFolder1%.*}" # Now the words are separated and file endings deleted.
    					for WordinFileInFolder in $FileInFolder2 
    			if [ "${#WordinFileInFolder}" -lt 4 ] || [[ "${WordinFileInFolder}" = *[[:digit:]]* ]] ; then 
    				if [ "$WordinFileInFolder" = "$Fileword" ] # If there is a match, 
    				let "Index += 1" # then add 1 to the Index.
    					FolderrowIndex=$Index # Setting the index for the folder as a whole
    				if [[ $FolderrowIndex -gt $SoFarBestIndex ]] 
    				Index=0 # Even if the $UneditedFolderrow is not the best match, the Index has to be set to zero.
    			done < /home/user/Test/folders4.txt # The file containing destination directories.
    	if [ "${TheBestFolder}" != "aa" ] && [ "${TheBestFolder}" != "bb" ]; then 
    	echo "mv" \"$Filerow\" \"$TheBestFolder\" >> /home/user/Test/transferlist19.txt 
    # List files that found no matching destinations:
    	echo "mv" \"$Filerow\" \"\/home\/user\/PDF1\" >> /home/user/Test/orphans19.txt 
    done < /home/user/Test/pdffiles.txt # The file listing files to be transferred.
    My computer has an Intel i5-3550 CPU and 8GiB of memory. Nevertheless, my original effort trying to match 1500 files with 1000 destination directories was too much. The bash script killed itself after running some 6 hours. I have now edited the script above and it runs much more smoothly, allowing me to multitask without any problems.

    I am not quite sure if the indexing part is accurate. I have tried to test it the best I can, but my logic fails to confirm the accuracy of the code. Do you see any mistakes?

    Anyway, don’t you think that this is a proof of concept for a semiautomatic filing system? Isn’t that something any file addict needs.
    Last edited by gefalu2008; April 16th, 2013 at 12:42 PM.

  2. #2
    Join Date
    Feb 2013

    Re: In search of a semiautomatic filing system - Bash newbie

    ls > ~/pdffiles.txt
    This won't work if any of file names contain embedded newlines. An unlikely situation to be sure, but not impossible.
    At least, hide control characters in file names with ls -q.

    I want to exclude directories that contain music, pictures or Calibre files.
    I'd probably use a case statement for this
    case $folder in
    *) # commands adding folder to the list
    Finally I sorted the list of folders on the basis of line lenth, ascending:
    while read -r folder
    do printf '%d\t%s\n' ${#folder} "$folder"
    done < folders2 | sort -n | cut -f2-
    As to the last script, assigning a weight to each folder in order to find the best fit for files of certain kind reminds me of mail filtering tools in the fold of procmail. Have you looked into possibility of using something like crm114 for this?

    Also, looping over every file for every destination folder is very inefficient. Just matching 1500 files with 1000 folders means the inner loop must be executed 1,500,000 times. And you're trying to match each word in the names of source files with each word in the names of files already in destination folders. Although bash offers some features that could help here (hint: associative arrays), I don't think a shell script is the right tool for the job.
    Last edited by schragge; April 8th, 2013 at 08:29 PM.

  3. #3
    Join Date
    Nov 2008

    Re: In search of a semiautomatic filing system - Bash newbie


    thank you very much! I shall test these ideas. I suppose at least the idea of using the case statement would make the code a little bit more efficient.

    You are right about the huge amount of looping the script requires. That can be reduced by cutting manually the list of files into shorter parts.

    I am only starting with bash and know no other programming languages. Is there any way to make the script stop and liberate memory and/or processor capacity after some cycles? I would not mind if the process took much longer to complete, as long as the script would not take up all memory and the full capacity of one processor core.

  4. #4
    Join Date
    Apr 2012

    Re: In search of a semiautomatic filing system - Bash newbie

    I wonder if you might get some efficiency improvements by building lookup tables for the terms in the directory names using bash's associative arrays? for example, if we shamelessly steal this fragment from a poster on stackoverflow

    If you need performance you don't want to iterate over your array repeatedly.

    In this case you can create an associative array (hash table, or dictionary) that represents an index of that array. I.e. it maps the array element into its index in the list::

    make_index () {
      local index_name=$1
      local -a value_array=("$@")
      local i
      # -A means associative array, -g means create a global variable:
      declare -g -A ${index_name}
      for i in "${!value_array[@]}"; do
        eval ${index_name}["${value_array[$i]}"]=$i
    Then you can do something along the lines of

    $ echo $dir1
    /home/user/700 Energy/Hydrogen
    $ echo $dir2
    /home/user/700 Water/Hydrogen
    $ dir1terms=( ${dir1//// } ); dir2terms=( ${dir2//// } )
    $ for term in ${dir1terms[@]}; do echo "$term"; done
    $ make_index dir1lookup "${dir1terms[@]}"
    $ make_index dir2lookup "${dir2terms[@]}"
    $ filename="Vibrations Turn Water Into Hydrogen"
    $ fileterms=( ${filename//// } )
    $ count=0; for term in ${fileterms[@]}; do test "${dir1lookup[$term]}" && ((++count)); done; echo $count
    $ count=0; for term in ${fileterms[@]}; do test "${dir2lookup[$term]}" && ((++count)); done; echo $count
    You could probably modify that to use the array index itself to weight the values in favor of matches higher (or lower) in the directory tree if you wanted. You could do something similar in python with 'dictionaries'.

  5. #5
    Join Date
    Nov 2008

    Re: In search of a semiautomatic filing system - Bash newbie


    thank you very much! This talk of associative arrays makes my palms sweat - is it fear, is it eager anticipation? It will take days for me to apply this idea, but I shall try.

    Meanwhile I found two snippets of script that (when combined as below) might help me to shorten the list possible destination directories. I hope the following script lists only those directories that already have at least one pdf file in them:

    find ~/ \( ! -regex '.*/\..*' \) -type d -exec sh -c 'set -- "$0"/'*.pdf'; [ $# -gt 0 ]' {} \; -print >> /home/user/folders.txt
    I do not want to end in a situation where the code would list folders that have either zero or at least one pdf file.

  6. #6
    Join Date
    Jul 2007
    Ubuntu 14.04 Trusty Tahr

    Re: In search of a semiautomatic filing system - Bash newbie

    that line won't work too well, because (unless you enable nullglob) unmatched glob will simply degrade to literal '*.pdf' and each dir will pass [ $# -gt 0 ] test.

    try [ -f "$1" ] instead or simply run find -iname '*.pdf' and trim the filename part, along the lines of

    while read f; do echo ${f%.[pP][dD][fF]}; done < <( find -iname '*.pdf' ) | sort -u
    if number of dirs is much greater than number of pdfs this should be faster.
    Last edited by Vaphell; April 10th, 2013 at 05:57 AM.
    if your question is answered, mark the thread as [SOLVED]. Thx.
    To post code or command output, use [code] tags.
    Check your bash script here // BashFAQ // BashPitfalls

  7. #7
    Join Date
    Nov 2008

    Re: In search of a semiautomatic filing system - Bash newbie


    thank you for joining the effort! You are right:

    Quote Originally Posted by Vaphell View Post
    that line won't work too well, because (unless you enable nullglob) unmatched glob will simply degrade to literal '*.pdf' and each dir will pass [ $# -gt 0 ] test.
    That line seems to produce a full list of all of my directories. I obtained considerable improvement by using

    find ~/ \( ! -regex '.*/\..*' \) -type d -exec sh -c 'set -- "$0"/'*.pdf'; [ $# -gt 1 ]' {} \; -print >> /home/use/folders.txt
    It seems that this line produces a list of directories that have at least 2 (i.e., -gt 1) pdf files in them. In addition to the pdf files, there may be subdirectories as well.

    I tried your suggestion, attempting to produce a list 'folders.txt' as follows:

    while read f; do echo ${f%.[pP][dD][fF]}; done < <( find -iname '*.pdf' ) | sort -u >> /home/user/folders.txt
    and what I got was a list of all on my pdf files, it seems. But maybe I did something wrong here.

    What I would like to have is a list of all directories that have pdf files in them. (And why not doc, txt, and odt files as well).

    In any case, your comments were welcome. I found an improvement, and I needed a break from associative arrays.
    Last edited by gefalu2008; April 10th, 2013 at 12:38 PM.

  8. #8
    Join Date
    Jul 2007
    Ubuntu 14.04 Trusty Tahr

    Re: In search of a semiautomatic filing system - Bash newbie

    lol, i must have been unconscious when i wrote that line, it only strips extensions

    $ f=/some/dir/and/then/some/file.pdf
    $ echo ${f%.[pP][dD][fF]}   # this strips only extension
    $ echo ${f%/*}    # this strips everything from the last / (leaves dir)
    try [ -f "$1" ], in case $1 stores unmatched glob it will be false (no file called '*.pdf'), legit file will return true and number of params doesn't matter.
    Last edited by Vaphell; April 10th, 2013 at 02:14 PM.
    if your question is answered, mark the thread as [SOLVED]. Thx.
    To post code or command output, use [code] tags.
    Check your bash script here // BashFAQ // BashPitfalls

  9. #9
    Join Date
    Apr 2012

    Re: In search of a semiautomatic filing system - Bash newbie

    Quote Originally Posted by gefalu2008 View Post

    thank you very much! This talk of associative arrays makes my palms sweat - is it fear, is it eager anticipation? It will take days for me to apply this idea, but I shall try.
    I hope I haven't sent you down a rabbit hole with that suggestion - I like playing with this kind of stuff but I am by no means an expert

    I do encourage you to take a look at python - it has a quite expressive syntax for this kind of processing - for in example you could make a 'dictionary of dictionaries' perhaps keyed on the directories' inode values and then loop over that returning a [sorted] list of the members with the best match(es). I'm not sure if bash's arrays can be nested in that way.

    I haven't really looked at your code in detail but another thing that comes to mind from the point of view of performance is whether you have found a way to implement some kind of pruning i.e. if you are searching for $term and you don't find it in /path/to/dir/to/subdir then you know right away there's no point searching for it again in /path/to/dir or /path and so on

    Good luck and keep us posted with your progress

  10. #10
    Join Date
    Jul 2007
    Ubuntu 14.04 Trusty Tahr

    Re: In search of a semiautomatic filing system - Bash newbie

    +1 on exploring python. Managing tons of data is a chore in shells, even with associative arrays in bash
    easy transformation of lists and dictionaries is built right into the language
    if your question is answered, mark the thread as [SOLVED]. Thx.
    To post code or command output, use [code] tags.
    Check your bash script here // BashFAQ // BashPitfalls

Tags for this Thread


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts