Results 1 to 10 of 10

Thread: In search of a semiautomatic filing system - Bash newbie

  1. #1
    Join Date
    Nov 2008
    Beans
    36

    In search of a semiautomatic filing system - Bash newbie

    I use PDF printing to collect news items and other articles. Consequently I have saved about 1500 files in my ‘PDF’ folder. At times I move some of these files to directories that are organized by topic. In my home directory there are about 8 directories and within those altogether 1000 subdirectories from which to choose.

    An automatic filing system would be ideal solution, but to my knowledge no such software is freely available for Linux. Perhaps it is possible to develop software for organizing files? From what I have heard, Python might be a good environment because it seems to have properties that are well suited for dealing with strings. However, programming is new to me. At this point I do not want to learn Python.

    I have settled for Bash programming and will be satisfied with what is feasible there. That is why what follows is an exercise in Bash programming. Please remember, I am a newbie, and your comments and suggestions are most welcome. I suppose the code I develop here can be made more efficient. And there may be bugs I do not see.

    In order to define which files should go into which folders I shall use file names. I shall generate a list of files to be relocated and a list of possible destination directories. I shall compare the names of files to be transferred with the names of files that are already located in the destination folders.

    The goal

    My aim is to generate text file ‘transferlist.txt’ that contains a list of ‘mv’ commands telling which documents should be moved into which directories:

    Code:
    mv "Vibrations Turn Water Into Hydrogen.pdf" "/home/user/700 Energy/Hydrogen”
    mv "What Synaptic Package Manager can do.pdf" "/home/user/300 Computers/Linux/Synaptic”
    etc.
    When the properties of ‘transferlist.txt’ are changed so that it can be run as a program, it can be executed simply by using the command

    Code:
    sh transferlist.txt
    Listing files and possible destinations

    We shall start by preparing a list of files in the ‘PDF’ folder. The aim is to uses this list ‘pdffiles.txt’ located in the ~/Test directory to indicate which files are to be moved files to other destinations. For clarity let me point out that in addition to pdf files there are other types of files in the PDF folder as well.

    (The following scripts have been edited on the basis of suggestions received to this post.)

    Code:
    #!/bin/bash
    
    # The list of files 'pdffiles.txt' that are to be moved to other destinations.
    
    for f in ~/PDF/*; do
        [[ -e $f ]] || continue
        echo $f >> ~/Test/pdffiles.txt
    done
    
    # List the most potential destination directories - those that have at least 2 pdf files in them.
    # The list below does not contain any hidden folders. 
    find ~/ \( ! -regex '.*/\..*' \) -type d -exec sh -c 'set -- "$0"/'*.pdf'; [ $# -gt 1 ]' {} \; -print >> ~/Test/folders1.txt
    
    # Edit folder pathnames so that individual words are separated for future matching with files
    
    while read -r UneditedFolderLine 
    do 
    # The next 2 lines remove slashes (‘/’) and underlinings to separate words in folder pathnames
    Wordsseparated1=${UneditedFolderLine//\//" "} 	
    Wordsseparated=${Wordsseparated1//_/" "} 
    
    # Not all directories are intended to be destinations for file transfers.
    for Folderword in $Wordsseparated 
    	do 
    		case "$Folderword" in
    		*Music*|*Pictures*|*Calibre*|*PDF*)
    		Warning="On"
    		break
    		;;
    		*)
    		Warning="Off"
    		;;
    		esac
    	done	
    
    if 
    [ "${Warning}" = Off ] 
    then 
    echo "$UneditedFolderLine" >> ~/Test/folders2.txt 
    fi 
    done < ~/Test/folders1.txt # This file was input for the previous set of operations.
    
    # Sort the list of directories alphabetically:
    sort -d ~/Test/folders2.txt > ~/Test/folders3.txt
    
    # Sort the list on the basis of length, ascending, producing the final list of destination folders:
    cat ~/Test/folders3.txt | awk '{ print length, $0 }' | sort -n | cut -d" " -f2- > ~/Test/folders4.txt
    So the key results of the first script are ‘pdffiles.txt’ listing the files to be moved and ‘folders4.txt’ listing potential destinations. Next we shall find matches between lines of these two documents.


    Matching the names of files to be transferred and files already in destination folders

    So we have a list of files to be transferred, ‘pdffiles.txt’, and a list of possible destinations, ‘folders.txt’. The code below generates ‘transferlist.txt’ on the basis of matching words in filenames.

    For each file to be moved the code builds an index. The code visits every possible destination folder and sees how many times the words in the filename match words in the filenames of files in the destination folder. The folder that gets the highest index is selected to become the destination of the file to be transferred.

    So here we go.

    Code:
    #!/bin/bash 
    
    # Start by reading the list of documents (pdffiles.txt) to be moved
    while read -r Filerow 
    do 
    Index=0 # Index counts matches between words. Set it to zero.
    SoFarBestIndex=0 
    TheBestFolder=aa
    SoFarBestFolder=bb 
    
    	for Fileword in $Filerow # Taking each word in the name of the file
    	do 
    		if [ "${#Fileword}" -lt 4 ] ||  # except those words that are shorter than 4 characters
    		[[ "${Fileword}" = *[[:digit:]]* ]] # or contain digits.
    		then 
    			continue 
    			else 
    
    # Next read the list of destination folders from file folders.txt
    			while read -r UneditedFolderrow 
    			do 
    			Index=0 # Index counts matches between words. Set it to zero.
    			for FileInFolder in "$UneditedFolderrow"/* 
    			do 
    			[ ! -f "$FileInFolder" ] && continue
    
    # Next let us chop the names of files in the folder into separate words
    			FileInFolder1=$(basename "$FileInFolder")
     FileInFolder2="${FileInFolder1%.*}" # Now the words are separated and file endings deleted.
    
    					for WordinFileInFolder in $FileInFolder2 
    					do 
    
    			if [ "${#WordinFileInFolder}" -lt 4 ] || [[ "${WordinFileInFolder}" = *[[:digit:]]* ]] ; then 
    				continue 
    				else 
    				if [ "$WordinFileInFolder" = "$Fileword" ] # If there is a match, 
    				then 
    				let "Index += 1" # then add 1 to the Index.
    				fi 
    			fi 
    				done 
    					FolderrowIndex=$Index # Setting the index for the folder as a whole
    			done 
    				if [[ $FolderrowIndex -gt $SoFarBestIndex ]] 
    				then 		 
    				SoFarBestFolder="$UneditedFolderrow" 
    				SoFarBestIndex=$FolderrowIndex 
    				fi 
    				Index=0 # Even if the $UneditedFolderrow is not the best match, the Index has to be set to zero.
    			done < /home/user/Test/folders4.txt # The file containing destination directories.
    TheBestFolder="$SoFarBestFolder" 
    
    		fi 
    	done 
    	if [ "${TheBestFolder}" != "aa" ] && [ "${TheBestFolder}" != "bb" ]; then 
    	echo "mv" \"$Filerow\" \"$TheBestFolder\" >> /home/user/Test/transferlist19.txt 
    	else 
    # List files that found no matching destinations:
    	echo "mv" \"$Filerow\" \"\/home\/user\/PDF1\" >> /home/user/Test/orphans19.txt 
    
    	fi 
    done < /home/user/Test/pdffiles.txt # The file listing files to be transferred.
    My computer has an Intel i5-3550 CPU and 8GiB of memory. Nevertheless, my original effort trying to match 1500 files with 1000 destination directories was too much. The bash script killed itself after running some 6 hours. I have now edited the script above and it runs much more smoothly, allowing me to multitask without any problems.

    I am not quite sure if the indexing part is accurate. I have tried to test it the best I can, but my logic fails to confirm the accuracy of the code. Do you see any mistakes?

    Anyway, don’t you think that this is a proof of concept for a semiautomatic filing system? Isn’t that something any file addict needs.
    Last edited by gefalu2008; April 16th, 2013 at 12:42 PM.

  2. #2
    Join Date
    Feb 2013
    Beans
    Hidden!

    Re: In search of a semiautomatic filing system - Bash newbie

    Code:
    ls > ~/pdffiles.txt
    This won't work if any of file names contain embedded newlines. An unlikely situation to be sure, but not impossible.
    See http://mywiki.wooledge.org/ParsingLs
    At least, hide control characters in file names with ls -q.

    I want to exclude directories that contain music, pictures or Calibre files.
    I'd probably use a case statement for this
    Code:
    case $folder in
    *Music*|*Pictures*|*Calibre*);;
    *) # commands adding folder to the list
      ;;
    esac
    Finally I sorted the list of folders on the basis of line lenth, ascending:
    Consider
    Code:
    while read -r folder
    do printf '%d\t%s\n' ${#folder} "$folder"
    done < folders2 | sort -n | cut -f2-
    As to the last script, assigning a weight to each folder in order to find the best fit for files of certain kind reminds me of mail filtering tools in the fold of procmail. Have you looked into possibility of using something like crm114 for this?

    Also, looping over every file for every destination folder is very inefficient. Just matching 1500 files with 1000 folders means the inner loop must be executed 1,500,000 times. And you're trying to match each word in the names of source files with each word in the names of files already in destination folders. Although bash offers some features that could help here (hint: associative arrays), I don't think a shell script is the right tool for the job.
    Last edited by schragge; April 8th, 2013 at 08:29 PM.

  3. #3
    Join Date
    Nov 2008
    Beans
    36

    Re: In search of a semiautomatic filing system - Bash newbie

    schragge,

    thank you very much! I shall test these ideas. I suppose at least the idea of using the case statement would make the code a little bit more efficient.

    You are right about the huge amount of looping the script requires. That can be reduced by cutting manually the list of files into shorter parts.

    I am only starting with bash and know no other programming languages. Is there any way to make the script stop and liberate memory and/or processor capacity after some cycles? I would not mind if the process took much longer to complete, as long as the script would not take up all memory and the full capacity of one processor core.

  4. #4
    Join Date
    Apr 2012
    Beans
    7,256

    Re: In search of a semiautomatic filing system - Bash newbie

    I wonder if you might get some efficiency improvements by building lookup tables for the terms in the directory names using bash's associative arrays? for example, if we shamelessly steal this fragment from a poster on stackoverflow

    http://stackoverflow.com/questions/3...50606#14550606

    If you need performance you don't want to iterate over your array repeatedly.

    In this case you can create an associative array (hash table, or dictionary) that represents an index of that array. I.e. it maps the array element into its index in the list::

    Code:
    make_index () {
      local index_name=$1
      shift
      local -a value_array=("$@")
      local i
      # -A means associative array, -g means create a global variable:
      declare -g -A ${index_name}
      for i in "${!value_array[@]}"; do
        eval ${index_name}["${value_array[$i]}"]=$i
      done
    }
    Then you can do something along the lines of

    Code:
    $ echo $dir1
    /home/user/700 Energy/Hydrogen
    $ echo $dir2
    /home/user/700 Water/Hydrogen
    $
    $ dir1terms=( ${dir1//// } ); dir2terms=( ${dir2//// } )
    $
    $ for term in ${dir1terms[@]}; do echo "$term"; done
    home
    user
    700
    Energy
    Hydrogen
    $
    $ make_index dir1lookup "${dir1terms[@]}"
    $ make_index dir2lookup "${dir2terms[@]}"
    $
    $ filename="Vibrations Turn Water Into Hydrogen"
    $ fileterms=( ${filename//// } )
    $
    $ count=0; for term in ${fileterms[@]}; do test "${dir1lookup[$term]}" && ((++count)); done; echo $count
    1
    $ count=0; for term in ${fileterms[@]}; do test "${dir2lookup[$term]}" && ((++count)); done; echo $count
    2
    $
    You could probably modify that to use the array index itself to weight the values in favor of matches higher (or lower) in the directory tree if you wanted. You could do something similar in python with 'dictionaries'.

  5. #5
    Join Date
    Nov 2008
    Beans
    36

    Re: In search of a semiautomatic filing system - Bash newbie

    steeldriver,

    thank you very much! This talk of associative arrays makes my palms sweat - is it fear, is it eager anticipation? It will take days for me to apply this idea, but I shall try.

    Meanwhile I found two snippets of script that (when combined as below) might help me to shorten the list possible destination directories. I hope the following script lists only those directories that already have at least one pdf file in them:

    Code:
    find ~/ \( ! -regex '.*/\..*' \) -type d -exec sh -c 'set -- "$0"/'*.pdf'; [ $# -gt 0 ]' {} \; -print >> /home/user/folders.txt
    I do not want to end in a situation where the code would list folders that have either zero or at least one pdf file.

  6. #6
    Join Date
    Jul 2007
    Location
    Poland
    Beans
    4,499
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: In search of a semiautomatic filing system - Bash newbie

    that line won't work too well, because (unless you enable nullglob) unmatched glob will simply degrade to literal '*.pdf' and each dir will pass [ $# -gt 0 ] test.

    try [ -f "$1" ] instead or simply run find -iname '*.pdf' and trim the filename part, along the lines of

    Code:
    while read f; do echo ${f%.[pP][dD][fF]}; done < <( find -iname '*.pdf' ) | sort -u
    if number of dirs is much greater than number of pdfs this should be faster.
    Last edited by Vaphell; April 10th, 2013 at 05:57 AM.
    if your question is answered, mark the thread as [SOLVED]. Thx.
    To post code or command output, use [code] tags.
    Check your bash script here // BashFAQ // BashPitfalls

  7. #7
    Join Date
    Nov 2008
    Beans
    36

    Re: In search of a semiautomatic filing system - Bash newbie

    Vaphell,

    thank you for joining the effort! You are right:

    Quote Originally Posted by Vaphell View Post
    that line won't work too well, because (unless you enable nullglob) unmatched glob will simply degrade to literal '*.pdf' and each dir will pass [ $# -gt 0 ] test.
    That line seems to produce a full list of all of my directories. I obtained considerable improvement by using

    Code:
    find ~/ \( ! -regex '.*/\..*' \) -type d -exec sh -c 'set -- "$0"/'*.pdf'; [ $# -gt 1 ]' {} \; -print >> /home/use/folders.txt
    It seems that this line produces a list of directories that have at least 2 (i.e., -gt 1) pdf files in them. In addition to the pdf files, there may be subdirectories as well.

    I tried your suggestion, attempting to produce a list 'folders.txt' as follows:

    Code:
    while read f; do echo ${f%.[pP][dD][fF]}; done < <( find -iname '*.pdf' ) | sort -u >> /home/user/folders.txt
    and what I got was a list of all on my pdf files, it seems. But maybe I did something wrong here.

    What I would like to have is a list of all directories that have pdf files in them. (And why not doc, txt, and odt files as well).

    In any case, your comments were welcome. I found an improvement, and I needed a break from associative arrays.
    Last edited by gefalu2008; April 10th, 2013 at 12:38 PM.

  8. #8
    Join Date
    Jul 2007
    Location
    Poland
    Beans
    4,499
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: In search of a semiautomatic filing system - Bash newbie

    lol, i must have been unconscious when i wrote that line, it only strips extensions

    Code:
    $ f=/some/dir/and/then/some/file.pdf
    $ echo ${f%.[pP][dD][fF]}   # this strips only extension
    /some/dir/and/then/some/file
    $ echo ${f%/*}    # this strips everything from the last / (leaves dir)
    /some/dir/and/then/some
    try [ -f "$1" ], in case $1 stores unmatched glob it will be false (no file called '*.pdf'), legit file will return true and number of params doesn't matter.
    Last edited by Vaphell; April 10th, 2013 at 02:14 PM.
    if your question is answered, mark the thread as [SOLVED]. Thx.
    To post code or command output, use [code] tags.
    Check your bash script here // BashFAQ // BashPitfalls

  9. #9
    Join Date
    Apr 2012
    Beans
    7,256

    Re: In search of a semiautomatic filing system - Bash newbie

    Quote Originally Posted by gefalu2008 View Post
    steeldriver,

    thank you very much! This talk of associative arrays makes my palms sweat - is it fear, is it eager anticipation? It will take days for me to apply this idea, but I shall try.
    I hope I haven't sent you down a rabbit hole with that suggestion - I like playing with this kind of stuff but I am by no means an expert

    I do encourage you to take a look at python - it has a quite expressive syntax for this kind of processing - for in example you could make a 'dictionary of dictionaries' perhaps keyed on the directories' inode values and then loop over that returning a [sorted] list of the members with the best match(es). I'm not sure if bash's arrays can be nested in that way.

    I haven't really looked at your code in detail but another thing that comes to mind from the point of view of performance is whether you have found a way to implement some kind of pruning i.e. if you are searching for $term and you don't find it in /path/to/dir/to/subdir then you know right away there's no point searching for it again in /path/to/dir or /path and so on

    Good luck and keep us posted with your progress

  10. #10
    Join Date
    Jul 2007
    Location
    Poland
    Beans
    4,499
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: In search of a semiautomatic filing system - Bash newbie

    +1 on exploring python. Managing tons of data is a chore in shells, even with associative arrays in bash
    easy transformation of lists and dictionaries is built right into the language
    if your question is answered, mark the thread as [SOLVED]. Thx.
    To post code or command output, use [code] tags.
    Check your bash script here // BashFAQ // BashPitfalls

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •