Results 1 to 7 of 7

Thread: Parsing html content with grep

  1. #1
    Join Date
    Jan 2012
    Beans
    753

    Parsing html content with grep

    I have a html page things like this in it:
    Code:
    <a id="p1874298" href="index.php?page=post&amp;s=view&amp;id=1874298">
    How do I use grep to output only the number (i.e. "1874298")? So far this is what I have:
    Code:
    wget -O - example.com | grep -m 1 --only-matching --perl-regexp '<a id=\"[^<>]*'
    But that doesn't fully work...

  2. #2
    Join Date
    Jun 2006
    Location
    Brisbane Australia
    Beans
    713

    Re: Parsing html content with grep

    Code:
    wget -O - example.com | sed s'/^.*id=\([0-9]*\).*$/\1/'

  3. #3
    Join Date
    Jan 2012
    Beans
    753

    Re: Parsing html content with grep

    So then, would this be the best way to do it?
    Code:
    wget -O - example.com | grep -m 1 --only-matching --perl-regexp '<a id=\"[^<>]*' | sed s'/^.*id=\([0-9]*\).*$/\1/'
    Isn't there a better way do do this, rather than using grep and sed? I'm sure it can be done with grep alone.

  4. #4
    Join Date
    Jun 2006
    Location
    Brisbane Australia
    Beans
    713

    Re: Parsing html content with grep

    Quote Originally Posted by Stonecold1995 View Post
    So then, would this be the best way to do it?
    Code:
    wget -O - example.com | grep -m 1 --only-matching --perl-regexp '<a id=\"[^<>]*' | sed s'/^.*id=\([0-9]*\).*$/\1/'
    Isn't there a better way do do this, rather than using grep and sed? I'm sure it can be done with grep alone.
    Why did you add that grep back into the command I posted? You only need sed.

  5. #5
    Join Date
    Jun 2006
    Location
    Brisbane Australia
    Beans
    713

    Re: Parsing html content with grep

    I just realised you probably want to filter other lines out and just grab the number[s] from all the input. So you want something like:
    Code:
    wget -O - example.com | sed -n '/<a id=/s/^.*id=\([0-9]*\).*$/\1/p'

  6. #6
    Join Date
    Jan 2012
    Beans
    753

    Re: Parsing html content with grep

    So then if I wanted to only output the FIRST number, would it be best to pipe that to "head -1"? Or does sed have a feature like grep's "-m 1"?

  7. #7
    Join Date
    Jun 2006
    Location
    Brisbane Australia
    Beans
    713

    Re: Parsing html content with grep

    Yes, If I wanted only the first line of output then I would just add a "| head -1" on that sed line.

    However, if you want a more robust solution to parse xml/html tagged data from the command line then I would look at using xmlstarlet. I have used xmlstarlet previously and it is excellent.

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •