Page 1 of 2 12 LastLast
Results 1 to 10 of 13

Thread: strip all tags and convert .html to .txt?

  1. #1
    Join Date
    May 2012
    Beans
    22

    strip all tags and convert .html to .txt?

    Hello,

    I have a folder filled with .html files. I want to mass convert them to .txt files while stripping away all the tags etc.

    I basically just want the words from the files. Is there an easy way to do this? I could open each file individually and copy and past but that would take me like a year.

    any help would be appreciated.

  2. #2
    Join Date
    Jan 2005
    Location
    South Africa
    Beans
    2,098
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: strip all tags and convert .html to .txt?

    google seems to be your friend

    http://www.google.co.za/search?hl=en...xt&btnG=Search

    just go through the links

    This one looks promising (but you must decide for yourself): http://www.aaronsw.com/2002/html2text/
    If you don't make backups of your important data, your data is obviously not important to you.

  3. #3
    Join Date
    Jan 2009
    Location
    ::1
    Beans
    2,460

    Re: strip all tags and convert .html to .txt?

    lynx --dump blablabla.html > blablabla.txt

  4. #4
    Join Date
    Aug 2010
    Location
    Lancs, United Kingdom
    Beans
    1,078
    Distro
    Xubuntu 14.04 Trusty Tahr

    Re: strip all tags and convert .html to .txt?

    I recommend htm2text which is available in the Ubuntu repository. There are several programs out there going by this name. This one is in the Ubuntu repository.

  5. #5
    Join Date
    May 2012
    Beans
    22

    Question Re: strip all tags and convert .html to .txt?

    Quote Originally Posted by sanderj View Post
    lynx --dump blablabla.html > blablabla.txt
    Can you rewrite that command line for me All files are on my desktop in a folder labled html

    Appreciate the help.

    Mike

  6. #6
    Join Date
    Feb 2008
    Location
    Planet earth, for now.
    Beans
    Hidden!
    Distro
    Xubuntu

    Re: strip all tags and convert .html to .txt?

    Quote Originally Posted by sanderj View Post
    lynx --dump blablabla.html > blablabla.txt
    OP reasonably new here. I'm not and I have no idea what this means.

    @Bostoncab: Yes, it is a help forum, but posters not having a good look to solve their problems before posting is generally frowned upon. The poster who provided the google link was trying to say this nicely, and so am I ... Check my sig.
    Last edited by Bucky Ball; August 15th, 2012 at 02:14 AM.

  7. #7
    Join Date
    Jan 2009
    Location
    ::1
    Beans
    2,460

    Re: strip all tags and convert .html to .txt?

    Quote Originally Posted by bostoncab View Post
    Can you rewrite that command line for me All files are on my desktop in a folder labled html

    Appreciate the help.

    Mike

    Open a Terminal (CTRL - ALT - t)
    Go to the directory containing the .html files (maybe cd Desktop/html/)
    Find the name of a .html file. Let's assume somefile.html
    Convert that file like this:

    Code:
    lynx --dump somefile.html > somefile.txt
    The result should be in somefile.txt.


    If lynx is not yet installed, install it via the Software Center

  8. #8
    Join Date
    May 2012
    Beans
    22

    Re: strip all tags and convert .html to .txt?

    You cant run it on the entire folder at once?

    I don't care what is generally frowned upon. If you are not looking to help in absolute beginner talk you should be down the hall in the sneering know it all lounge.

    Quote Originally Posted by sanderj View Post
    Open a Terminal (CTRL - ALT - t)
    Go to the directory containing the .html files (maybe cd Desktop/html/)
    Find the name of a .html file. Let's assume somefile.html
    Convert that file like this:

    Code:
    lynx --dump somefile.html > somefile.txt
    The result should be in somefile.txt.


    If lynx is not yet installed, install it via the Software Center

  9. #9
    Join Date
    Dec 2007
    Location
    Bombay
    Beans
    5,474
    Distro
    Lubuntu 14.04 Trusty Tahr

    Re: strip all tags and convert .html to .txt?

    Advanced HTML to text converter (html2text) also looks promising. Plus it's "supported" by Canonical.

    Edit: Already suggested by spjackson.
    Last edited by vasa1; August 15th, 2012 at 09:51 AM.
    de gustibus et coloribus non est disputandum -- Wiktionary

  10. #10
    hakermania's Avatar
    hakermania is offline Τώρα ξέρεις τι γράφω εδώ!
    Join Date
    Aug 2009
    Location
    Greece
    Beans
    1,701
    Distro
    Ubuntu Development Release

    Exclamation Re: strip all tags and convert .html to .txt?

    Quote Originally Posted by bostoncab View Post
    You cant run it on the entire folder at once?

    I don't care what is generally frowned upon. If you are not looking to help in absolute beginner talk you should be down the hall in the sneering know it all lounge.
    We are all trying to help here

    For you occasion, as you are an absolute beginner, I would recommend the following:

    First of all, give to your folder with your html files an easy name, like 'html' and place it to your home folder (beside your Music, Videos etc folders).

    Then, open a terminal via Ctrl+Alt+T

    You have to install a package. Give the command:
    Code:
    sudo apt-get install lynx-cur
    it will ask for your password. Give it your login password (nothing will be printed to screen but don't worry) and hit [Enter]. If it asks you for confirmatin whether to install the package or not, give it a Y and hit [Enter] again. Wait till the installation process is over.

    By default, when you open a terminal you are "located" at your home folder, so you can "cd" (change directory) to your 'html' folder with the command:
    Code:
    cd html
    Once you are inside your html folder in your terminal, run the following command:
    Code:
    nano extract_text.sh
    'nano' is a command line text editor. It will let you create and insert text into the file extract_text.sh

    Inside the text copy and paste (or write) the following:
    Code:
    #!/bin/bash
    
    for html_file in *.html; do
       txt_file=${html_file%.html}".txt"
       lynx --dump "$html_file" > "$txt_file"
    done
    To copy from the terminal you have to use Ctrl+Shift+C and to past to the terminal you have to use Ctrl+Shift+V. Of course you can do all these via the right-click menu as well.

    After you've written all of the above inside your text file in the terminal hit Ctrl+O, then [Enter] so as to save your file and then Ctrl+X so as to exit the editor.

    Now you have a file named extract_text.sh placed inside your 'html' folder, where all of your html files are.

    Linux systems understand as programs only files that have the executable bit. This bit can be placed to a file with the command:
    Code:
    chmod +x some_file
    After this, the system tricks 'some_file' as a program.

    So, you have to do the same to your extract_text.sh file:
    Code:
    chmdo +x extract_text.sh
    and your file will be tricked as a program from now on.

    In order to execute this program, just give
    Code:
    ./extract_text.sh
    and all your html files will be turned into txt, without the html tags inside them.

    I hoped you learned something new from my post. If you want better to understand what the extract_text.sh script program did, continue reading, I will explain.

    Let's take again a look of the script:
    Code:
    #!/bin/bash
    
    for html_file in *.html; do
       txt_file=${html_file%.html}".txt"
       lynx --dump "$html_file" > "$txt_file"
    done
    The first line, #!/bin/bash refers to which program we want to execute the code that follows. It refers to the so-called interpreter. 'bash' is a very good and widely used one. There are others, like sh, csh, ksh etc.

    Then, we have for html_file in *.html; do. This is a for loop. It searches all the files in the current directory that end with .html and, for each of them, it says html_file="filename.html", where filename.html is a corresponding file in your directory. So, each time this loops run, the $html_file variable is different and contains a different filename of your .html files.

    The line txt_file=${html_file%.html}".txt" creates a new variable called txt_file and makes it be the html_file variable without the .html extension but instead, with the .txt extension, and, finally, the line lynx --dump "$html_file" > "$txt_file" does the conversion, reading from the $html_file and writing to $txt_file
    Website

    Wallch (Wallpaper Changer): Sourceforge | Launchpad

Page 1 of 2 12 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •