PDA

View Full Version : c/c++: reading a webpage into a string



wellery
October 10th, 2005, 01:18 AM
Does anyone know how to read a webpage from a given url into a string? So what I want is to end up with the html code in a string. I want to use c/c++.

dtfinch
October 10th, 2005, 04:19 AM
I'm still a little new to Linux programming, so I don't know the "correct" way to do it, but here's an easy, risky way if you trust the url is not malformed:
FILE *fp=popen("wget --quiet -O - http://www.google.com","r") //pipe wget output to a file handle
... //then read fp like a normal file
pclose(fp); //close pipe

popen() is my new favorite function. wget can read the url from a file if you need it to be safer. It can also read from stdin, but popen doesn't support "rw".

wellery
October 10th, 2005, 09:13 AM
That'll do the job. I was looking for a way to do without using a linux program such as wget. I know in java and vb you can use an xml reader to read a webpage into a string from a url. I was looking to do the same thing here. This will do for now however. Thanks

thumper
October 10th, 2005, 09:21 AM
That'll do the job. I was looking for a way to do without using a linux program such as wget. I know in java and vb you can use an xml reader to read a webpage into a string from a url. I was looking to do the same thing here. This will do for now however. Thanks
Well, if you really want to, open a TCP socket to port 80 of the site that you are wanting to connect to, push through the HTTP request (really quite small), and read the results that it sends you.

The problems with doing it this way is you then have to handle some of the HTTP internals yourself such as redirections, getting frame contents, and just hope that they don't use too much javascript to populate their page.

Or find an HTTP library that someone else has written that does all this for you.

jerome bettis
October 10th, 2005, 05:16 PM
in python you can do

import os
text = os.popen("lynx --accept-all-cookies --dump <url>").readlines()

text will contain the actual text of the page minus the html crap. for that to work you'll need to apt-get install lynx if it isn't already. this is kind of a hack and won't work on windows machines etc but it's the easiest way of doing it.

monkeyking
October 14th, 2005, 02:27 AM
Well he says that he want's it in c++ and without any linux progy's.
So I dont really see a solution in python depending on lynx as valid. ;)

wellery
October 14th, 2005, 04:38 AM
I'm not sure if it's going to work but the solution i've got is to use libcurl from:

http://curl.haxx.se/

isandir
October 23rd, 2005, 12:00 AM
This can be done in C++. I have a program that does something very similar. What you need to do is open a socket to the webserver you want your html from. Then send your get message and read what it sends back to you. My program just sends the html out to my screen but you could store it in a string.


connect(sockd, (struct sockaddr *)&sin, sizeof(sin));
FILE * readerc = fdopen(sockd, "r");
FILE * writerc = fdopen(sockd, "w");
fprintf(writerc, myMessage.c_str()); //mymessage is the get request
fflush(writerc);

char line[1000];
char *s=fgets(line,999,readerc);
for (int i =1;s != NULL;i++){
printf("%s", line);
s=fgets(line,999,readerc);
}

Of course there is a lot of setup before this code, but once you have a open connection to a webserver thats all you have to do to get the html out of it.