Results 1 to 7 of 7

Thread: Parsing html content with grep

  1. #1
    Join Date
    Jan 2012
    Location
    /a/
    Beans
    753
    Distro
    Kubuntu 13.04 Raring Ringtail

    Parsing html content with grep

    I have a html page things like this in it:
    Code:
    <a id="p1874298" href="index.php?page=post&amp;s=view&amp;id=1874298">
    How do I use grep to output only the number (i.e. "1874298")? So far this is what I have:
    Code:
    wget -O - example.com | grep -m 1 --only-matching --perl-regexp '<a id=\"[^<>]*'
    But that doesn't fully work...
    The whole thing is so patently infantile, so foreign to reality, that to anyone with a friendly attitude to humanity it is painful to think that the great majority of mortals will never be able to rise above this view of life.
    ~Sigmund Freud

  2. #2
    Join Date
    Jun 2006
    Location
    Brisbane Australia
    Beans
    713

    Re: Parsing html content with grep

    Code:
    wget -O - example.com | sed s'/^.*id=\([0-9]*\).*$/\1/'

  3. #3
    Join Date
    Jan 2012
    Location
    /a/
    Beans
    753
    Distro
    Kubuntu 13.04 Raring Ringtail

    Re: Parsing html content with grep

    So then, would this be the best way to do it?
    Code:
    wget -O - example.com | grep -m 1 --only-matching --perl-regexp '<a id=\"[^<>]*' | sed s'/^.*id=\([0-9]*\).*$/\1/'
    Isn't there a better way do do this, rather than using grep and sed? I'm sure it can be done with grep alone.
    The whole thing is so patently infantile, so foreign to reality, that to anyone with a friendly attitude to humanity it is painful to think that the great majority of mortals will never be able to rise above this view of life.
    ~Sigmund Freud

  4. #4
    Join Date
    Jun 2006
    Location
    Brisbane Australia
    Beans
    713

    Re: Parsing html content with grep

    Quote Originally Posted by Stonecold1995 View Post
    So then, would this be the best way to do it?
    Code:
    wget -O - example.com | grep -m 1 --only-matching --perl-regexp '<a id=\"[^<>]*' | sed s'/^.*id=\([0-9]*\).*$/\1/'
    Isn't there a better way do do this, rather than using grep and sed? I'm sure it can be done with grep alone.
    Why did you add that grep back into the command I posted? You only need sed.

  5. #5
    Join Date
    Jun 2006
    Location
    Brisbane Australia
    Beans
    713

    Re: Parsing html content with grep

    I just realised you probably want to filter other lines out and just grab the number[s] from all the input. So you want something like:
    Code:
    wget -O - example.com | sed -n '/<a id=/s/^.*id=\([0-9]*\).*$/\1/p'

  6. #6
    Join Date
    Jan 2012
    Location
    /a/
    Beans
    753
    Distro
    Kubuntu 13.04 Raring Ringtail

    Re: Parsing html content with grep

    So then if I wanted to only output the FIRST number, would it be best to pipe that to "head -1"? Or does sed have a feature like grep's "-m 1"?
    The whole thing is so patently infantile, so foreign to reality, that to anyone with a friendly attitude to humanity it is painful to think that the great majority of mortals will never be able to rise above this view of life.
    ~Sigmund Freud

  7. #7
    Join Date
    Jun 2006
    Location
    Brisbane Australia
    Beans
    713

    Re: Parsing html content with grep

    Yes, If I wanted only the first line of output then I would just add a "| head -1" on that sed line.

    However, if you want a more robust solution to parse xml/html tagged data from the command line then I would look at using xmlstarlet. I have used xmlstarlet previously and it is excellent.

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •