PDA

View Full Version : [SOLVED] Bash: removing CR/LF characters



EvilMarshmallow
December 4th, 2008, 03:40 PM
I've got a script that almost works...

The basics: I download a html page using wget and I want to remove ALL new line characters from it. This does some but not all in this case:


cat myfile | tr '\n' ' ' > outputfile

This however does not remove all the newline characters. Here's kinda what I'm seeing in the file:


<tr><td>blah blah blah BLAH
blah blah blah blah</td></tr>

After the all-caps BLAH, there is a newline character of some sort, but it's not replaced by the tr statement above. What should I be tr'ing to get this removed?

dwhitney67
December 4th, 2008, 03:54 PM
Where did the 'myfile' originate from? Maybe from M$-Windows?

It could be possible that you do not have a newline at the end of the statements in the file, but instead a return-character (\r), which sometimes is referred to as a line-feed.

albandy
December 4th, 2008, 03:56 PM
dos2unix filename

if you don't have it installed, install the tofrodos package.

sudo apt-get install tofrodos

EvilMarshmallow
December 4th, 2008, 04:01 PM
The file was generated by Lotus Notes. It is the contents of a particular database view output as HTML. Unfortunately, the geniuses who populate the database use the enter key to make it look pretty on the screen and those line feeds are saved in the text.

I tried your suggestion like this:

cat myfile | tr '\r' ' ' > outfile.txt

But now the output file will not open in gedit. It just hangs up. Any idea what I did wrong?

dwhitney67
December 4th, 2008, 04:08 PM
I was just guessing as to whether you have '\r' in your file. If indeed the file has them, I would follow albandy's advice and use dos2unix.



dos2unix myfile

EvilMarshmallow
December 4th, 2008, 04:10 PM
Albandy,

dos2unix had no effect. I still have the line breaks in there, and even running tr '\n' after dos2unix did nothing for me.

albandy
December 4th, 2008, 04:15 PM
Ok, maybe if is in html you should parse it like an xml file

albandy
December 4th, 2008, 04:17 PM
Open it with vim and make an screenshot

EvilMarshmallow
December 4th, 2008, 04:21 PM
I looked at it and everywhere there's a new line, it shows ^M.

iponeverything
December 4th, 2008, 04:23 PM
Save this to chomp do
chmod 755 chomp


and do:

cat file | ./chomp


#!/usr/bin/perl
while(<>){
chomp;
print $_;
}

EvilMarshmallow
December 4th, 2008, 04:28 PM
chomp had no effect. These new lines won't go away!

albandy
December 4th, 2008, 04:30 PM
It's so strange, with dos2unix it should work.
Try with this parameter:

dos2unix -a filename

EvilMarshmallow
December 4th, 2008, 04:40 PM
Still nothing.

Here's what I've got...


wget file (works fine)

dos2unix -a file (works fine but doesn't seem to change anything)

If at this point I do


cat file | tr '\n' ' ' > file2

file2 fails to open in gedit.

dwhitney67
December 4th, 2008, 04:48 PM
Check that the permissions on the file are set correctly, such that you have read/write permissions on it. It seems odd that dos2unix would fail, unless of course you did not have permissions to write to the file.

P.S. You keep referring to '\n', however from your own admission, your file has '\r' characters in it. These would appear as an ^M if you open the file in vim. vim will also indicate whether the file is read-only.

EvilMarshmallow
December 4th, 2008, 04:50 PM
I have full access to this file (testing from my Desktop folder). dos2unix just can't figure out that character.

I thought the idea behind dos2unix was to convert the \r's to \n???

My ultimate goal is to have NO NEW LINES anywhere. I'll be parsing the data out of this and newlines need to be removed.

EvilMarshmallow
December 4th, 2008, 05:07 PM
A new thought...

You'll note that I said the file would not open in gedit after the following:

dos2unix file
cat file | tr '\n' ' ' > file2

When I did this and then tried to open file2 in gedit, it failed completely, hung and I had to "Force Quit". However, when I tried just now in vim, it opened the file with a little warning at the bottom that "no eol was found".

It appears that dos2unix does convert the newline characters... but when I remove them from the file, it's left without an end marker. Any ideas about how I could remove all the special characters except the last one that marks the end of the file?

dwhitney67
December 4th, 2008, 05:08 PM
I have full access to this file (testing from my Desktop folder). dos2unix just can't figure out that character.

I thought the idea behind dos2unix was to convert the \r's to \n???

My ultimate goal is to have NO NEW LINES anywhere. I'll be parsing the data out of this and newlines need to be removed.

Can you verify that you have read-write permissions on the file. Just because it is located in your Desktop folder does not imply that you have write permissions.

Another alternative, if you are brave, is to examine the contents of the file using 'od'. This command will enable you to examine the bytes within the file.



/usr/bin/od -A d -x -v lotusfile

If you see any '0d' bytes, then these are the '\r'. If you see '0a' bytes, then these are '\n'.

P.S. Earlier I stated that a \r is a linefeed; this is incorrect. A \r is referred to as a carriage-return, and a \n is a linefeed.

EvilMarshmallow
December 4th, 2008, 05:13 PM
/usr/bin/od -A d -x -v file.txt | grep 0d

returns nothing after dos2unix has been run. The same thing with grep 0a returns many results. So dos2unix is doing the conversion, as I suspected. The problem is now how to get rid of the \n's from the file without destroying the end-of-file marker.

PS, I also chmod'ed the file so I'm sure I have write permission.

albandy
December 4th, 2008, 07:07 PM
It's possible to get a copy of that file?

EvilMarshmallow
December 4th, 2008, 07:11 PM
It's too big for UF.org's attachment policy. Can you pm me an email?

EvilMarshmallow
December 4th, 2008, 09:36 PM
Ok, thanks to albandy for some off-line help... I'm almost done with this script!

I discovered that awk removes the newlines that wouldn't go away before. Now I've got another sed/awk/tr/whatever question:

My records are delimited by two pairs of quotation marks: ""

How can I change "" into a newline?



//Note: here's my script so far:
wget http://[xxx]/file
dos2unix -a file
cat jobs.txt | awk '{ printf "%s", $0 }' > file2

mssever
December 5th, 2008, 12:41 AM
For future reference, it's easy to remove ^M characters in vim. Type the following (all characters are literal, except that ^M means Ctrl+M, ^V means Ctrl+V, etc.):

:%s/^V^M//gthen hit enter, and voilà. ^V is an escape character to allow you to type control characters directly.

To change "" into \n in vim, I'd do something like this:

:%s/""/^V^J/gHowever, for some reason, you can't enter ^J. Instead, you get ^@, which is the null byte. Either I'm misunderstanding something, or this is a bug in vim.

EvilMarshmallow
December 5th, 2008, 02:51 PM
OK, my problem has been solved completely!

I found the following page: http://www.computing.net/answers/unix/sed-newline/5640.html

This shows how to insert a newline into a file as part of a sed s/x/y/ command.

The long & short of it: You can't do \n in sed, you have to physically do a new line like this:


bob@computer:~$ sed 's:\"\":\
> :g' file1 > file2

That will replace every occurrence of two pairs of quotes ("") with a new line.

jackoverfull
December 9th, 2008, 06:30 PM
be careful while using that way: i had to remove a \n in a file and discovered that non-gnu versions of sed (eg: the mac os x one, the solaris one…) may not support it.

I ended up using perl:


perl -pi -e 's/\n//g'
for unix linefeed and


perl -pi -e 's/\r\n//g'
for dos carriage-returns.

Now my script can handle both without problems, but i'm still trying to figure out if a file uses \r\n or \n only…any hint?

i'd prefer to not use anything ubuntu specific, since this script will end up on very different systems (at least os x, different inux distros, windows with cygwin and maybe solaris).

mssever
December 10th, 2008, 03:30 AM
Now my script can handle both without problems, but i'm still trying to figure out if a file uses \r\n or \n only…any hint?
Since you're calling Perl to do your substutions, Perl is already a dependency. Why not write the whole thing in Perl and simplify things? If you look for the regex or substring \r\n, there's a good chance that the file uses Windows style. Similarly for \r and MacOS. \n might be inconclusive. At any rate, either do a search, or manually convert all line endings if doing so won't ruin your data. I sometimes do a two-stage conversion, which handles files created with Windows, MacOS, and *nix:


Replace \r\n with \n
Replace \r with \n

The resulting file contains only lines ending with \n, which you can easily convert to another style if you want.

jackoverfull
December 10th, 2008, 06:00 PM
Since you're calling Perl to do your substutions, Perl is already a dependency.
You're right: originally that was a problem.


Why not write the whole thing in Perl and simplify things?
Bacause i don't know perl enough…:rolleyes:


If you look for the regex or substring \r\n, there's a good chance that the file uses Windows style. Similarly for \r and MacOS. \n might be inconclusive. At any rate, either do a search, or manually convert all line endings if doing so won't ruin your data. I sometimes do a two-stage conversion, which handles files created with Windows, MacOS, and *nix:


Replace \r\n with \n
Replace \r with \n

The resulting file contains only lines ending with \n, which you can easily convert to another style if you want.
the problem is not converting the file: my script can now support unix (\n) and dos (\r\n) files without problems, the problem is to detect the type of file…i still haven't found a way to determine if the imput file is from dos or unix oses.

mssever
December 10th, 2008, 08:37 PM
the problem is to detect the type of file…i still haven't found a way to determine if the imput file is from dos or unix oses.
I already answered that question:

If you look for the regex or substring \r\n, there's a good chance that the file uses Windows style. Similarly for \r and MacOS. \n might be inconclusive.
Or, as I said earlier, you might be able to get away with just converting everything to \n style (which will still work in Windows in many--but not all--cases). That way, you don't need to determine which style the file uses.

jackoverfull
December 11th, 2008, 12:03 AM
I already answered that question:

Sorry: didn't thought of that…

Yes, i was trying to go with


cat imputfile | grep \r\n
wich seems to work...mostly.
I was wondering if there is a better way: the imput files i'll gett will always be mixed (contayning "\n", "\r\n" and maybe even "\r")…



Or, as I said earlier, you might be able to get away with just converting everything to \n style (which will still work in Windows in many--but not all--cases). That way, you don't need to determine which style the file uses.

Right, but i can't convert a 24 mb file (or maybe more) every time i launch the script…;)

mssever
December 11th, 2008, 12:42 AM
I was wondering if there is a better way: the imput files i'll gett will always be mixed (contayning "\n", "\r\n" and maybe even "\r")…If the files are mixed, then they're not Windows-style, or *nix style, or whatever. They're mixed. And you really only have two options: Use tools that transparently hide the differences,* or convert the line-endings to something consistent.


Right, but i can't convert a 24 mb file (or maybe more) every time i launch the script…;)Two things:


If possible, why not convert it once and save the converted file? Since it has various line endings, the specific style probably isn't important, so you could just pick one (\n would probably be best) and change the file to it. Of course, if you don't have write permission, than you'd have to keep a local cache, which might be more complicated than it's worth.
You don't have to convert the entire file, especially if it's large. Just convert the interesting lines.


* No such tool comes immediately to mind, but it shouldn't be too hard to write a function or two to handle that kind of thing, especially if you move from bash to a more flexible language such as Ruby or Python

jackoverfull
December 11th, 2008, 01:02 AM
If the files are mixed, then they're not Windows-style, or *nix style, or whatever. They're mixed. And you really only have two options: Use tools that transparently hide the differences,* or convert the line-endings to something consistent.

right.

I'm trying to read only the first \n or \r\n, wich should always be "the right one"…


If possible, why not convert it once and save the converted file? Since it has various line endings, the specific style probably isn't important, so you could just pick one (\n would probably be best) and change the file to it. Of course, if you don't have write permission, than you'd have to keep a local cache, which might be more complicated than it's worth.
Because i'm taking this file from another (proprietary) program, wich will rewrite it completely when it will add new data.


* No such tool comes immediately to mind, but it shouldn't be too hard to write a function or two to handle that kind of thing, especially if you move from bash to a more flexible language such as Ruby or Python
If it is possible to do the thing in another language it could be a way…
Rewriting everything would be a problem, since i don't know those language well enough and don't have the time to learn them for this project (i have to finish in about 3 weeks…now i've written everything but the gui and it still needs some testing), but using them to solve this particular problem may work.

mssever
December 11th, 2008, 02:50 AM
right.

I'm trying to read only the first \n or \r\n, wich should always be "the right one"…Then I'd probably look for the first occurrence of each line ending and get the offset of it. Then by looking at the smallest offset, I'd have my answer. I have no idea how to accomplish something like this in Bash, because I'd never attempt such a project in Bash. Other languages make this kind of thing so much easier. You'll likely have to resort to some other language (such as awk, Perl, Ruby, etc.) to get that info.



Because i'm taking this file from another (proprietary) program, wich will rewrite it completely when it will add new data.Fair enough.



If it is possible to do the thing in another language it could be a way…
Rewriting everything would be a problem, since i don't know those language well enough and don't have the time to learn them for this project (i have to finish in about 3 weeks…now i've written everything but the gui and it still needs some testing), but using them to solve this particular problem may work.I don't know how complex your project is, but I have several observations:


It's quite likely that your project would be simpler if done in another language, especially if it's a big project (Bash doesn't scale well). Of course, that's not always true, as there are some tasks for which Bash is ideal. But based on what you've said so far, I think you'd benefit from considering a switch. And Python might not be as hard to learn as you might think. Why not give it a weekend and see how quickly you progress?
If a hack job is acceptible, you could have a multi-lingual project, where you write a script in, say, Python that does some stuff and spits its output in a way that your Bash script can handle. That would minimize the amount of rewriting you have to do.
Coding a GUI in Bash sounds to me like a nightmare, unless your needs were so simple that zenity could handle it without resorting to all sorts of contortions.

jackoverfull
December 11th, 2008, 03:26 AM
Then I'd probably look for the first occurrence of each line ending and get the offset of it. Then by looking at the smallest offset, I'd have my answer. I have no idea how to accomplish something like this in Bash, because I'd never attempt such a project in Bash. Other languages make this kind of thing so much easier. You'll likely have to resort to some other language (such as awk, Perl, Ruby, etc.) to get that info.

Good idea, thankyou. I can probably find a way to do that using perl…


It's quite likely that your project would be simpler if done in another language, especially if it's a big project (Bash doesn't scale well). Of course, that's not always true, as there are some tasks for which Bash is ideal. But based on what you've said so far, I think you'd benefit from considering a switch. And Python might not be as hard to learn as you might think. Why not give it a weekend and see how quickly you progress?Well, it started out as a small script and now i have 156 kb of sources…:lolflag:
Originally I didn't thought that it would that complex, anyway now, after a couple of months, is almost finished (all the features originally planned implemented at least in an acceptable form - even this autodetect is not really needed, since it is highly improbable to get a windows-formatted file outside windows, but i'd like to include it for flexibility…) and i'm not so eager of rewriting everything…:biggrin:
Afaik, Bash is good in managing text streams, although is not the only one, for now it is working well even with that behemoth 24,7MB file i got. An usual imput file will be around 1.6MB, anyway.:)


If a hack job is acceptible, you could have a multi-lingual project, where you write a script in, say, Python that does some stuff and spits its output in a way that your Bash script can handle. That would minimize the amount of rewriting you have to do.That is perfectly acceptable. Perl would be my first choice in doing something like that, since, as you pointed out before, i'm already depending on it.


Coding a GUI in Bash sounds to me like a nightmare, unless your needs were so simple that zenity could handle it without resorting to all sorts of contortions.Sorry, my fault.
I never intended to code the gui in bash, absolutely, that would be insane!:lol:

The "engine" will be cross-platform, but every platform will have is own gui…
I'm not even sure that i'm going to do the gui myself: i'd prefer to stay away from windows if possible, i have bad dreams when working on that os…:rolleyes: