PDA

View Full Version : html2text question



Jefferythewind
December 2nd, 2009, 05:37 PM
Hi Everyone,
I am using html2text to convert an html document to a text document. Sounds pretty simple huh? Well i am getting the conversion with this command:

html2text ~/inputfile.html > ~/output.txt

I am getting the output file, but inside the output file there are many strange symbols throughout the document, like this:


_G_o_ _t_o_ _M_a_i_n_ _C_o_n_t_e_n_t
************ UUVVMM WWWWWW IInnffoorrmmaattiioonn SSyysstteemm ************

If i don't want to go through the text file and take all this stuff out one by one, how would i clean this up?

Thanks

wmcbrine
December 2nd, 2009, 07:09 PM
I'm not familiar with html2text, but what those are, are backspaces, which it's using to implement underlining (underscore, backspace, character) and boldface (character, backspace, character).

Arndt
December 2nd, 2009, 10:42 PM
Hi Everyone,
I am using html2text to convert an html document to a text document. Sounds pretty simple huh? Well i am getting the conversion with this command:

html2text ~/inputfile.html > ~/output.txt

I am getting the output file, but inside the output file there are many strange symbols throughout the document, like this:


_G_o_ _t_o_ _M_a_i_n_ _C_o_n_t_e_n_t
************ UUVVMM WWWWWW IInnffoorrmmaattiioonn SSyysstteemm ************

If i don't want to go through the text file and take all this stuff out one by one, how would i clean this up?

Thanks

Do you have the manual page for the program?

Jefferythewind
December 3rd, 2009, 12:35 AM
yeah, you can get the manual by typing:

man html2text

it can also be found here:
http://www.mbayer.de/html2text/html2text1.shtml

I took a look at that and couldn't really find anything that looks like the problem I have, but if you all can see something that would be awesome.

laceration
December 3rd, 2009, 01:14 AM
try this

html2text ~/inputfile.html | strings > ~/output.txt

wmcbrine
December 3rd, 2009, 01:44 AM
it can also be found here:
http://www.mbayer.de/html2text/html2text1.shtml

I took a look at that and couldn't really find anything that looks like the problem I haveFrom that page, the -nobs option is what you want. And dude, seriously, the description is exactly what I told you in #2.

openuniverse
December 3rd, 2009, 07:00 AM
.

Jefferythewind
December 3rd, 2009, 04:16 PM
wmcbrine,
sorry man, you're right, but thanks a lot for the help, i was a little lazy on that one.