sed command - help please

**Timothy Taylor** · February 3rd, 2010

I have been trying to use the sed command to remove lines of what I suspect were Chinese or Japanese translation in text documents. These lines appear to consist mostly of non-ASCII characters like "Šš¥ß€FŠ"

I was hoping that delete lines containing non-ASCII characters:

sed -e '/^[:ascii:]/d' < test1.txt

but this doesn't appear to do anything.

I tried this:

sed -e '/^[\x00 - \x7F]/d' < test1.txt

but again, no luck.
I realise that this is probably because it is finding some ASCII characters in the lines I wish to discard?

Any ideas?

**Brandon Williams** · February 4th, 2010

Are you trying to delete all lines that have one or more non-ascii characters in them? In that case, I think you want the '^' symbol inside the '[...]' expression, as in:

Code:

sed -e '/[^:ascii:]/d'

Outside of the character class, the '^' matches on start-of-line. If it's the first character inside, it inverts the specified character class.

**Lars Noodén** · February 4th, 2010

Maybe awk or egrep would be more appropriate.

Do you want to keep the line, but leave it empty, or eliminate the offending line completely?

**bokopperud** · February 4th, 2010

Try 'tr'... perhaps with the -d option.
Note: It must be *in* the pipeline, it doesn't take file-arguments itself (use 'cat' first).

**Timothy Taylor** · February 4th, 2010

Thanks for the replies.

Originally Posted by Brandon Williams

Are you trying to delete all lines that have one or more non-ascii characters in them?

Yes.

Originally Posted by Brandon Williams

In that case, I think you want the '^' symbol inside the '[...]' expression, as in:

Code:

sed -e '/[^:ascii:]/d'

Outside of the character class, the '^' matches on start-of-line. If it's the first character inside, it inverts the specified character class.

I tried

sed -e '/[^:ascii:]/d' < test1.txt

and there is no output - it seems to delete every line...

Originally Posted by Lars Noodén

Do you want to keep the line, but leave it empty, or eliminate the offending line completely?

I want to remove it completely.

ETA
I have also tried:

sed -e '/[\x80 - \xFF]/d' < test1.txt

hoping that this would catch "extended" ASCII characters, but this seems to delete all the lines I actually want!

This seems to be a typical "simple 5-minute job"!

I'll have a look at awk, egrep and tr.

**mobilediesel** · February 4th, 2010

Originally Posted by Timothy Taylor

Thanks for the replies.

Yes.

I tried

sed -e '/[^:ascii:]/d' < test1.txt

and there is no output - it seems to delete every line...
:confused:

I want to remove it completely.

You almost have it, try:

Code:

sed -e '/[^[:print:]]/d' test1.txt

[:print:] is for printable characters and there's no need for the < character.

**Timothy Taylor** · February 4th, 2010

Originally Posted by mobilediesel

You almost have it, try:

Code:

sed -e '/[^[:print:]]/d' test1.txt

[

rint:] is for printable characters and there's no need for the < character.

No, that doesn't work either - no output.

**mobilediesel** · February 4th, 2010

Originally Posted by Timothy Taylor

No, that doesn't work either - no output.

That suggests that the file is all one line or every line contains non-printable characters.

**Timothy Taylor** · February 4th, 2010

Aye, something weird is going on...

I tried

sed -e '/[:ascii:]/d' test1.txt

and was surprised to get a list of paragraph numbers and some lines of the garbage text out.