PDA

View Full Version : Data mining & machine learning looking for ideas


ppl
September 19th, 2007, 06:19 PM
I know there are all kinds of people hanging around, so here is the question(open ended one):

What could you learned from this data set, how could I do it?


1999-09-30 23:55:35 upi51936 897 4769 IntOffpeak http://tottered/ 555.176.38.203 0.00300 M 21
1999-09-30 23:55:35 upi51936 892 4769 IntOffpeak http://tottered/ 555.176.38.203 0.00300 M 21
1999-09-30 23:55:35 upi51936 1502 4769 IntOffpeak http://haiti.com/ 555.176.38.203 1.10000 M 21
1999-09-30 23:55:37 upi51936 1388 4769 IntOffpeak http://ads.doctors.com/ 555.176.38.203 0.04000 M 21
1999-09-30 23:55:39 upi51320 16419 4769 IntOffpeak http://heinz.com/ 555.176.55.164 25.20600 40
1999-09-30 23:55:39 upi51320 14299 4769 IntOffpeak http://home.rollback.com/ 555.176.55.164 26.34000 40
1999-09-30 23:55:39 upi51320 12542 4769 IntOffpeak http://heinz.com/ 555.176.55.164 27.92300 40
1999-09-30 23:55:42 upi51936 17432 4769 IntOffpeak http://decoders.com/ 555.176.38.203 2.59800 M 21
1999-09-30 23:55:43 upi51936 704 4769 IntOffpeak http://tottered/ 555.176.38.203 0.00400 M 21
1999-09-30 23:55:48 upi51936 18597 4769 IntOffpeak http://decoders.com/ 555.176.38.203 2.63300 M 21
1999-09-30 23:55:49 upi51936 1311 4769 IntOffpeak http://haiti.com/ 555.176.38.203 1.00400 M 21
1999-09-30 23:55:50 upi51936 1198 4769 IntOffpeak http://ads.doctors.com/ 555.176.38.203 0.04900 M 21
1999-09-30 23:55:53 upi51936 15934 4769 IntOffpeak http://decoders.com/ 555.176.38.203 2.32000 M 21
1999-09-30 23:55:54 upi51936 1351 4769 IntOffpeak http://haiti.com/ 555.176.38.203 1.01200 M 21
1999-09-30 23:55:55 upi51936 2294 4769 IntOffpeak http://tottered/ 555.176.38.203 0.18400 M 21
1999-09-30 23:55:59 upi51875 16304 4769 IntOffpeak http://www.illegitimacy.com/ 555.176.32.48 1.47000 M 34
1999-09-30 23:56:00 upi51936 2568 4769 IntOffpeak http://decoders.com/ 555.176.38.203 0.92200 M 21
1999-09-30 23:56:01 upi51936 1830 4769 IntOffpeak http://decoders.com/ 555.176.38.203 1.00000 M 21
1999-09-30 23:56:01 upi51936 6399 4769 IntOffpeak http://decoders.com/ 555.176.38.203 1.76800 M 21
1999-09-30 23:56:02 upi51936 923 4769 IntOffpeak http://tottered/ 555.176.38.203 15.71400 M 21
1999-09-30 23:56:28 upi59607 24051 4769 IntOffpeak http://ami.com/ 555.176.26.170 3.02600 M 21
1999-09-30 23:56:29 upi59607 1322 4769 IntOffpeak http://haiti.com/ 555.176.26.170 1.00000 M 21
1999-09-30 23:56:30 upi59607 1198 4769 IntOffpeak http://ads.doctors.com/ 555.176.26.170 0.04100 M 21
1999-09-30 23:56:30 upi51936 1310 4769 IntOffpeak http://haiti.com/ 555.176.38.203 1.00900 M 21
1999-09-30 23:56:31 upi51936 1197 4769 IntOffpeak http://ads.doctors.com/ 555.176.38.203 0.00300 M 21
1999-09-30 23:56:40 upi59607 20626 4769 IntOffpeak http://ami.com/ 555.176.26.170 5.90500 M 21
1999-09-30 23:56:41 upi59607 1363 4769 IntOffpeak http://haiti.com/ 555.176.26.170 1.03400 M 21
1999-09-30 23:56:42 upi59607 11030 4769 IntOffpeak http://ads.doctors.com/ 555.176.26.170 18.91400 M 21
1999-09-30 23:56:45 upi51936 20435 4769 IntOffpeak http://decoders.com/ 555.176.38.203 2.97000 M 21
1999-09-30 23:56:46 upi51936 1307 4769 IntOffpeak http://haiti.com/ 555.176.38.203 0.99600 M 21
1999-09-30 23:56:47 upi51936 11286 4769 IntOffpeak http://ads.doctors.com/ 555.176.38.203 16.29300 M 21
1999-09-30 23:57:04 upi59607 23950 4769 IntOffpeak http://ami.com/ 555.176.26.170 2.88800 M 21
1999-09-30 23:57:05 upi59607 1323 4769 IntOffpeak http://haiti.com/ 555.176.26.170 1.01300 M 21
1999-09-30 23:57:06 upi51936 4679 4769 IntOffpeak http://prophesy.com/ 555.176.38.203 1.58500 M 21
1999-09-30 23:57:07 upi51936 1340 4769 IntOffpeak http://prophesy.com/ 555.176.38.203 0.86600 M 21
1999-09-30 23:57:08 upi51936 1717 4769 IntOffpeak http://www.floppily.com/ 555.176.38.203 1.00100 M 21
1999-09-30 23:57:09 upi51936 1585 4769 IntOffpeak http://tottered/ 555.176.38.203 2.10900 M 21
1999-09-30 23:57:11 upi51936 56257 4769 IntOffpeak http://www.doctors.co.nz/ 555.176.38.203 4.18700 M 21
1999-09-30 23:57:13 upi51936 642 4769 IntOffpeak http://www.doctors.co.nz/ 555.176.38.203 0.00400 M 21
1999-09-30 23:57:13 upi51936 679 4769 IntOffpeak http://www.doctors.co.nz/ 555.176.38.203 0.00300 M 21
1999-09-30 23:57:13 upi51936 11634 4769 IntOffpeak http://ads.doctors.com/ 555.176.38.203 0.30400 M 21
1999-09-30 23:57:14 upi51936 676 4769 IntOffpeak http://www.doctors.co.nz/ 555.176.38.203 0.00300 M 21
1999-09-30 23:57:14 upi51936 671 4769 IntOffpeak http://www.doctors.co.nz/ 555.176.38.203 0.00400 M 21
1999-09-30 23:57:14 upi51936 671 4769 IntOffpeak http://www.doctors.co.nz/ 555.176.38.203 0.00400 M 21
1999-09-30 23:57:14 upi51936 671 4769 IntOffpeak http://www.doctors.co.nz/ 555.176.38.203 0.00300 M 21
1999-09-30 23:57:14 upi51936 1697 4769 IntOffpeak http://www.doctors.co.nz/ 555.176.38.203 4.76500 M 21
1999-09-30 23:57:19 upi59607 24574 4769 IntOffpeak http://ami.com/ 555.176.26.170 2.64100 M 21
1999-09-30 23:57:20 upi59607 1363 4769 IntOffpeak http://haiti.com/ 555.176.26.170 1.03100 M 21
1999-09-30 23:57:21 upi59607 11018 4769 IntOffpeak http://ads.doctors.com/ 555.176.26.170 15.56400 M 21
1999-09-30 23:57:49 upi51936 64963 4769 IntOffpeak http://www.geographic.com/ 555.176.38.203 16.19200 M 21
1999-09-30 23:57:49 upi51936 7366 4769 IntOffpeak http://www.geographic.com/ 555.176.38.203 16.19000 M 21
1999-09-30 23:57:49 upi51936 38949 4769 IntOffpeak http://www.geographic.com/ 555.176.38.203 16.18900 M 21
1999-09-30 23:57:49 upi51936 12285 4769 IntOffpeak http://www.geographic.com/ 555.176.38.203 31.10700 M 21

There are four millions of that data without explain, what we know is this is from a university internet. The first thing I could think about is to reduce the dataset size by reduce redundancy. But what and how could I learn from it, no idea yet. I don't even know what each column means.

ppl
September 19th, 2007, 06:30 PM
I am thinking about try what I could get out of WEKA. But don't know how well WEKA will handle a data file about 400 megabytes, but still think reduce dataset size is a good idea anyway, plus I need convert it to some file format WEKA will recognized anyway instead of a plain text.

Wybiral
September 19th, 2007, 06:54 PM
I think the most you will get out of it is to group them into similar properties or find patterns between the properties...

But it's not going to be of much use if you don't even know what the properties are.

Looks like a web history of some kind, you could find statistical patterns between time/data and domain.

nanotube
September 19th, 2007, 07:54 PM
you could probably use R to analyze the data set (r-project.org, also available in the repos).

or, you could use yale machine learning: http://sourceforge.net/projects/yale

looking at the data columns, you could surmise the following about the meaning of the columns:

1999-09-30 23:55:35 upi51936 897 4769 IntOffpeak http://tottered/ 555.176.38.203 0.00300 M 21

date; time; machinehostname; ??; ??; time slot indicator; domainname; ip address (for the machine hostname); kilobytes served, maybe; ??; ??.

so... at least you have about half of the fields identified. :)

my guess is that these are the logs of some web host, hosting these sites, and distributing its load to a number of machines through a load balancer...

so, what exactly may you want to learn from this dataset?
for one, you could do a histogram of some fields by frequency (e.g, field 3, machine hostname, to see if the machines all handle about the same amount of requests?), if field -3 is kb served, then you could find how many total kb are served per unit time per machine?

it's much easier to figure out what to do with the dataset, once you have an idea of what you want to learn from it, though. ;)

ppl
September 19th, 2007, 08:16 PM
I think the most you will get out of it is to group them into similar properties or find patterns between the properties...

But it's not going to be of much use if you don't even know what the properties are.

Looks like a web history of some kind, you could find statistical patterns between time/data and domain.

Yes, I was told it is a web history from university internet.

Thanks, Now I am kind of figure out what most of the properties now.
After talk to classmate, the first thing is to reduce the dataset size to something below 10 mega bytes. At least for WEKA, it can't handle dataset larger than that in our lab machines- Core2 6400, 2G Ram.

ppl
September 19th, 2007, 08:48 PM
you could probably use R to analyze the data set (r-project.org, also available in the repos).

or, you could use yale machine learning: http://sourceforge.net/projects/yale



Thanks for the information, I was pretty much focus on WEKA, which is a machine learning software and mentioned by lecture before. I still not sure which one will suit for my purpose.

But I guess the first thing I can do is get some statistic result out of that, I think R should be good for that, but I don't know I will have enough time for learning R or not, as I don't plan to spend more than 30 hours for the whole project.

I will definite try a machine learning approach in additional to a statistic one, either Yale or Weka maybe both. Looks Yale has a better user interface, but Weka has a strong point at trying different algorithm to me, not sure as I am not familiar with eithe.


looking at the data columns, you could surmise the following about the meaning of the columns:

1999-09-30 23:55:35 upi51936 897 4769 IntOffpeak http://tottered/ 555.176.38.203 0.00300 M 21

date; time; machinehostname; ??; ??; time slot indicator; domainname; ip address (for the machine hostname); kilobytes served, maybe; ??; ??.

so... at least you have about half of the fields identified. :)

my guess is that these are the logs of some web host, hosting these sites, and distributing its load to a number of machines through a load balancer...

so, what exactly may you want to learn from this dataset?
for one, you could do a histogram of some fields by frequency (e.g, field 3, machine hostname, to see if the machines all handle about the same amount of requests?), if field -3 is kb served, then you could find how many total kb are served per unit time per machine?

it's much easier to figure out what to do with the dataset, once you have an idea of what you want to learn from it, though. ;)

Yes, I agree with your. As this dataset obviously come from our university, the -3 field is the amount of traffic for that login in session. That is what I got so far:
1. Time & date
2. upi (id number)
3. ??
4. ??
5. traffic type: IntOffpeak(international offpeak), IntPeak, NZ(Domestic)
6. url (visited?)
7. ip address(users?)
8. traffic count
9. Male/Female(?)
10. number (age?)

-3 field I believe is a user's ip address as from the whole dataset I get, it is obviously from a intranet address as it come from a quite limited range.

As it is a open ended question, it is hard to know from beginning what I want the learn from that, could be everything :-) But hopefully something useful, and I guess it doesn't to make sense, the less sense it make , the more useful it could be.

Wybiral
September 19th, 2007, 09:02 PM
Oh wow, if that is the sex+age you could really find out some cool statistics from that data... Especially if you mix in the time+data.

You should keep us posted on how you do.

ppl
September 19th, 2007, 09:24 PM
Oh wow, if that is the sex+age you could really find out some cool statistics from that data... Especially if you mix in the time+data.

You should keep us posted on how you do.

A Wiki page will be good:)

Sad thing is there are so much possibility, but without a direction, it could take forever especially considering the size of the dataset.

ppl
September 19th, 2007, 09:34 PM
I don't know people make it works, but considering the dataset size 400M and 4 million entries , I think it should at least be process before feed it directly to a software, otherwise you certainly need a super computer.

I could reduce the redundancy and mapping Url and ip address to something else and divided age into to groups etc, but that certain has a limitation at how much I could compress that. Also could I divided it into small dataset and learning it bit by bit, it doesn't make sense to me now?

nanotube
September 19th, 2007, 10:51 PM
I don't know people make it works, but considering the dataset size 400M and 4 million entries , I think it should at least be process before feed it directly to a software, otherwise you certainly need a super computer.

I could reduce the redundancy and mapping Url and ip address to something else and divided age into to groups etc, but that certain has a limitation at how much I could compress that. Also could I divided it into small dataset and learning it bit by bit, it doesn't make sense to me now?

well, if you have 2g of ram, you could just load the whole thing into R :) once there, you can condense the data set into smaller summary chunks for further processing...

Wybiral
September 19th, 2007, 11:17 PM
You could definitely compress the date some, the time can be generalized to minutes or hours (no need for seconds) and the rest could be compressed by using a lookup table of the data (one integer for each demain / IP)

If you could put it in a binary format, or a database, that would help as well. You should decide what you want out of it. Possibly pick a few correlations that you'd like to find.

Seems more like a problem for good-old-fashioned statistics then machine learning but you might be able to use some kind of unsupervised learning technique to find new groups.

ppl
September 20th, 2007, 04:54 AM
well, if you have 2g of ram, you could just load the whole thing into R :) once there, you can condense the data set into smaller summary chunks for further processing...

Sounds pretty cool, I know Weka have problems with large data size much smaller than the ram, probably the problem of the algorithm or maybe Java, don't know.

CptPicard
September 20th, 2007, 04:58 AM
Check out the Apriori algorithm (http://en.wikipedia.org/wiki/Apriori_algorithm) and pals. I took a class on them back when I was at uni; it was taught by the Toivonen guy who is mentioned in one of the Wikipedia article's references :) Apriori itself is rather slow though, there are better ones, and algorithms that find not only common subsets, but also possibly causal patterns in time... you'll also need to discretize your continuous variables...

ppl
September 20th, 2007, 05:05 AM
You could definitely compress the date some, the time can be generalized to minutes or hours (no need for seconds) and the rest could be compressed by using a lookup table of the data (one integer for each demain / IP)


Yeah, that is the way I am thinking for compressing as well. and as you mentioned and I realized now, I need make some decision early on what properties I think is important and throw out others, rather than try get as much as information as I can. Then as least I will get something out in the end.

Then I decided I will give R a try first as it is a tool for statistics and probably will get some straightforward result,but only if I can figure out how to use it. I am trying to start it but couldn't find the menu item for R and don't know what command to use. Googling now...

Wybiral
September 20th, 2007, 05:51 AM
Well, you can keep all of your data. But chop it down for whatever task you are doing. I would pick two or three specific goals and try each of them, trimming the data from each one. Then see which one presents the most interesting correlations. But certainly don't ditch any of it, you never know when you'll want a giant pool of data to play with.

nanotube
September 20th, 2007, 09:48 AM
Yeah, that is the way I am thinking for compressing as well. and as you mentioned and I realized now, I need make some decision early on what properties I think is important and throw out others, rather than try get as much as information as I can. Then as least I will get something out in the end.

Then I decided I will give R a try first as it is a tool for statistics and probably will get some straightforward result,but only if I can figure out how to use it. I am trying to start it but couldn't find the menu item for R and don't know what command to use. Googling now...

if you have installed R from the repos (packages r-base-core and some other similar looking ones), start it with command "R". :) simple enough. ;)

ppl
September 20th, 2007, 09:47 PM
if you have installed R from the repos (packages r-base-core and some other similar looking ones), start it with command "R". :) simple enough. ;)

Thanks for that, that has been figured out easily after looking at some tutorial from some web pages. And I remembered I did exactly the same thing before, and only figure out it I should type r instead of R after goolgle about it.

An update about the properties in the dataset.

3. units 4431 (in bytes)
4. chargeRate 4769 (in billionths of a cent)

-3 connectionTime "3.47100" (in seconds)