PDA

View Full Version : Easy Ubuntu Clustering


Pages : [1] 2

ajt
January 4th, 2009, 08:58 PM
Discussions about Ubuntu Clustering seem to be scattered over several forums, and discussion about the Easy Ubuntu Clustering blueprint have been dormant so long that the threads are archived and marked read-only!

I want to re-start discussion about Ubuntu clustering here, where Ubuntu users in Education and Science can contribute: This is related to my proposed blueprint for a "Bioinformatics workstation/server for Ubuntu" https://blueprints.launchpad.net/ubuntu/+spec/biobuntu, which I've marked as obsolete now that NEBC have released Bio-Linux5 based on Ubuntu 8.04 LTS http://nebc.nox.ac.uk/biolinux.html

The Beowulf 'clustering' and grid aspects of my blueprint overlap with those of the "Easy Ubuntu Clustering" blueprint. In particular, creation of Kerrighed packages for Ubuntu: Some (very) old Debian packages have been removed from the Kerrighed website because they are badly out of date with the upstream sources. I've compiled the latest Kerrighed sources under Ubuntu 8.04, and booted it successfully. I'll post details on the "Easy Ubuntu Clustering" wiki https://wiki.ubuntu.com/EasyUbuntuClustering

Tony.

Kellemora
January 12th, 2009, 05:55 PM
Hi ajt

Glad you brought this back to the front again!

I was partly involved in the SETI project, quite a number of years ago now though, not in the computer end, just the CPU sharing end is all. UC of Berkely would use my shared computers as a part of their Beowulf Cluster.

More recently I have done a little on-line research into connecting a few of my computers here together into a cluster, hopefully to speed up some of the intense graphics work we do.

I've already retrofitted 3 of my computers with gigabyte LAN cards and played around with trying to figure out how to set them up, but I either kept hitting a brick wall or when I did find some instructions, they were so far over my head it was pitiful.

What led to this in the first place was, I wanted to learn more about how servers work, so I downloaded and installed Edubuntu.
Edubuntu works out of the box so to speak, but all the computers become dumb terminals and use the computing power of the computer the workstations are connected to, as well as burn up all the memory available. Although I liked the idea of a server, I didn't like the fact IT had to provide everything, including the CPU resources.

With so many computers sitting around here, I thought, why not tie them all together. After all, we have dual and quad core computers now, so how hard could it be anyhow......?

Turns out, it's WAY over my head right now, hi hi..........
Nonetheless, I've learned to do different things on different computers using a KVM then taking all the output and stuff it on a file server. That way I can keep working on other things while the crunching is being done on other machines. NOT a Cluster of course, just more efficient use of the machines I do have here.

Since then I've dumbed down to a single Data File Server, which is nothing more than File Sharing, and keep all the working programs on different machines, whichever is best for the project at hand.

But wouldn't it be nice to be able to put all that computing power to use by tying it all together like it was one large computer?

I see it done all over the place. 4, 6, 8, up to 36, 150 and 250 computers all working together. But they are for specific projects designed to use such a system.

Any any case, I'm going to keep monitoring this thread and hope it goes somewhere positive and progressive!

TTUL
Gary

ajt
January 12th, 2009, 08:09 PM
Hi ajt

Glad you brought this back to the front again!

Hello, Gary.

Thanks for joining this thread about Ubuntu Clustering :-)

I've been running openMosix for quite a while under Ubuntu 6.06 on our Beowulf cluster:

http://bioinformatics.rri.sari.ac.uk

I've posted a few times about openMosix on the Ubuntu Forums, but the threads are very dispersed so I decided to try and focus a discussion here in Education and Science where, I believe, potential Beowulf users might pick up on the thread!


I was partly involved in the SETI project, quite a number of years ago now though, not in the computer end, just the CPU sharing end is all. UC of Berkely would use my shared computers as a part of their Beowulf Cluster.

More recently I have done a little on-line research into connecting a few of my computers here together into a cluster, hopefully to speed up some of the intense graphics work we do.


We ran SETI as a test load when I first built our Beowulf, but I found it increasingly difficult to justify using all that electricity and switched to running Folding@home until we got into the top 1,000 ;-)


I've already retrofitted 3 of my computers with gigabyte LAN cards and played around with trying to figure out how to set them up, but I either kept hitting a brick wall or when I did find some instructions, they were so far over my head it was pitiful.


There are good instructions about how to setup DHCP and PXE boot Kerrighed at:

http://www.kerrighed.org/wiki/index.php/Kerrighed_on_NFSROOT

http://www.kerrighed.org/wiki/index.php/Kerrighed_on_NFSROOT_(contrib)


[...]
But wouldn't it be nice to be able to put all that computing power to use by tying it all together like it was one large computer?

I see it done all over the place. 4, 6, 8, up to 36, 150 and 250 computers all working together. But they are for specific projects designed to use such a system.


It's not that complicated if you are motivated enough to experiment. My summer student Kenny Strouts wrote a brief tutorial about his project installing Kerrighed under Debian Etch:

http://bioinformatics.rri.sari.ac.uk/drupal/?q=wiki/tutorial_kerrighed


Any any case, I'm going to keep monitoring this thread and hope it goes somewhere positive and progressive!


Please do, and see if you can interest any of your friends and colleagues in the thread too.

Bye,

Tony.

kloplop321
January 14th, 2009, 07:42 AM
I am excited about computer clustering as well, but because of no ubuntu availability I have been trying live CD's, and one of the computers I want to use it on doesn't have support for my LAN card. :( and it doesn't know how to do WiFi either, it thinks it is just an ethernet connection

ajt
January 14th, 2009, 07:48 AM
The kerrighed project have recently announced a Mandriva live CD running the Kerrighed Kernel:

http://www.kerrighed.org/forum/viewtopic.php?p=607#607

You can use this to see if your computers would run Kerrighed, but beware that it runs a DHCP server, so don't do it on someone else's LAN ;-)

Bye,

Tony.

kSt.
January 14th, 2009, 12:02 PM
Good paper (few years old though) -

"OpenMosix, OpenSSI and Kerrighed: A Comparative Study"
http://hal.inria.fr/docs/00/07/06/04/PDF/RR-5399.pdf

OpenMosix is no longer maintained. Supported distributions for openSSI are Fedora Core 3 and Debian Sarge (with work in progress to port to Etch & Lenny) - so I've no idea as to the status of OpenSSI with regards to Ubuntu. (Maybe someone could shed some light on this?)

Kerrighed is exciting, but still a research project - not yet production ready. At the moment, node failure under Kerrighed causes the whole cluster to freeze. Work is being done to solve this problem; see the following;
http://article.gmane.org/gmane.linux.cluster.kerrighed.user/69

At present the best place to seek information concerning Kerrighed is the mailing list, see http://news.gmane.org/gmane.linux.cluster.kerrighed.user

ajt
January 14th, 2009, 12:22 PM
I'm still running openMosix (linux-2.4.26-om1), but I'm planning to upgrade to Kerrighed. I tried out openSSI under Ubuntu 6.06 and 8.04, but had a lot of problems just trying to get it to run. There was some discussion about SSI recently on the beowulf list.

Apparently, kerrighed is used in production by Kerlabs:

http://www.kerlabs.com/-Home-.html

However, I agree Kerrighed is still quite fragile...

My colleague Luca Clivio from the Mario Negri Institute in Milan visited Christine Morin's group and they were extremely helpful. It seems that Kerrighed functionality has been their priority, but the current version is more robust. One thing we could do is help to package Kerrighed for Ubuntu, and get more people interested in using it.

I posted a message to the kerrighed-users list about this dicussion. My objective in starting this thread is to encourage people to try getting Kerrighed running under Ubuntu. It is already quite well supported by Mandriva. However, that is an rpm-based distribution and I want to use Ubuntu because I use bio-Linux (which is based on Ubuntu).

Bye,

Tony.

djamu
January 14th, 2009, 04:50 PM
Hi all,

I've been using / compiling kerrighed for some years now. ( 3D, non-academic )


My thoughts / experiences / ideas on kerrighed:


I'll sum up a couple of points that I'll elaborate.

1. It's a bad idea to use the Ubuntu > rather use a stock debian.
Although ubuntu originates from debian, a lot of things changed aside from the kernel version. It's easier to upgrade a debian kernel then to downgrade an ubuntu one. ( And just forget about compiling it the ubuntu way ... )

2. Someone mentioned OpenSSI which is nearly dead > know that Kerrighed re-used most of the OpenSSI + Mosix source.

3. While it's being the most likely candidate to be used by mere mortals, it lacks some elementary functionality ( aside from the outdated howto's & documentation ).
Despite the info on their site, some advertised functionality will not ( ever ) be available
http://www.kerrighed.org/wiki/index.php/Status
Roadmap >november 2008 > thread migration will NOT be implemented any time soon.
This is very misleading, and leaves me with mixed feeling knowing that it is (was) a EU funded operation, that is now run by Kerlabs.

I made some inquiries via a major german corporation in regard to aforementioned feature ( Just to make sure I got an honest reply).

I cannot fully disclose the document but in short it reads that Kerlabs doesn't see any benefit in developing thread migration any further. This because of the small user base that might actually use it ..... (sic)
It must have been a prank if this wasn't a genuine answer. Nothing wrong with out-of-this-world academics, but really how can a company / management be so ignorant for not seeing the huge benefits to the community..
About everyone I know ( in 3D ) would hook up their computers to create a mini cluster(s) ..

I also got an estimate from Kerlabs > the cost to fully implement / support thread migration....


4. Kerrighed live CD > don't bother trying to run it on vmware machines, kernel doesn't support PCNET32 ( NIC drivers vmware uses ).
I'll release my own LiveCD soon ( has proper failover head-node with iptables nat / gateway, webinterface, support for most high-end renderers ( LW / RIB / Maya / Houdini etc .. ) + 3rd party plugin support hooks > to control cluster from within an application ( 3D/video ) etc...
Just need to work a bit on an interface.
( volunteers register on my site )


So in short:

Partially because of for-mentioned reasons there's a big chance I'll fork the project.
Another reason is that my targeted userbase is completely different > I don't need fancy checkpointing on 500 node clusters / hot adding / removing etc...
Most users of this fork will cluster less then 10 computers ( so no need for infiniband / myrinet ).

I've already set up a bug tracker so ....


2 routes to get this done >

-use venture capital ( or other private funding ) > which will make software de-facto proprietary
-use donations ... and keep it open sourced ...


More ideas ?

ajt
January 14th, 2009, 10:28 PM
Hello, djamu.

Not sure why you think it's a bad idea to use Ubuntu?

The Kerrighed kernel is based on the 'vanilla' kernel sources, so it doesn't make much difference if you compile it under Debian or Ubuntu.

I think it does make a difference which distro you run: I used Debian from Potato to Etch, but I'm now using Ubuntu because I prefer it. I know that Ubuntu is Debian really, but I think it's much better presented and easier to use both as a server and a desktop. In fact, I use our Beowulf as a terminal server and we run 'embarrasingly' parallel HPC tasks from NX sessions using openMosix or use MPI. I've not looked at Debian Lenny, but Sarge and Etch were crude on the Desktop compared to Ubuntu 6.06...

openSSI looks very interesting but, despite several attempts, I've now abandoned any thought of using it. We must be careful not to accuse anyone of using the 'MOSIX' source, because that's proprietary! I guess you are talking about the GPL version when openMosix was forked ;-)

The three main contenders for FLOSS (Free/Libre Open Source Software) versions of SSI (Single System Image) Linux kernels seem to be openMosix, openSSI and Kerrighed. It's now unlikely the openMosix will continue (but I'm watching development of PMI:

http://linuxpmi.org/trac/

I agree that the openSSI project seems inactive (but don't dismiss it, because it has a lot of good points too). That leaves Kerrighed, which is still an active EU FP6 (Framework 6) funded project until 2010.

Why does venture capital mean something has to be proprietary?

Canonical have invested a lot in Ubuntu...

Thanks for joining in :-)

Bye,

Tony.

kvk
January 14th, 2009, 10:45 PM
This is wonderful information- thanks for posting it. I've been interested for a while now in constructing a small cluster (8-16 nodes) for running ecological models, but both my hardware abilities and programming need a bit more development. This will be great stuff to read through!

djamu
January 15th, 2009, 01:58 AM
Hello, djamu.

Not sure why you think it's a bad idea to use Ubuntu?

The Kerrighed kernel is based on the 'vanilla' kernel sources, so it doesn't make much difference if you compile it under Debian or Ubuntu.


True, aside from the upstream implementations on ubuntu that might not work on a downgraded kernel... can't think of any right now but I had my share of those before switching my servers back to debian ( hopeless outdated IDS packages for example, ).
Also a base debian install installs much less "bloat" then ubuntu server does.
For example > there's a new service in intrepid (forgot the name) , that let's you sign up for server monitoring...
Now I understand the need for Canonical to generate income.. but not letting me choose whether I want to install it is of a complete different magnitude... ( I try avoid using windows for the very same reason )....

In a server environment this is of less importance as the performance penalty is negligible...
But for tuned HPC cluster kernels ( to reduce interrupt jitter > whether NUMA / SMP ) every unnecessary service interrupt is 1 to many ...
So if I have the option to install a really minimal system (netinstall) or I have to search thru a stock Ubuntu server install to manually delete all unnecessary "helpful" services ... the choice should be obvious, at least common sense dictates the obvious



....
but Sarge and Etch were crude on the Desktop compared to Ubuntu 6.06...
....


That's a long time ago, they still don't install sudo in the base install :smile:

Another point in me using debian is that a lot more packages (not desktop related) are available and rigorously tested > the very reason of it's slow cycle...
( A good example is the fail2ban package, stable on debian, typo bug in the 8.04-LTS, preventing it from restart on boot ... submitted to launchpad with fix...., intrepid has proper version ) > BTW as a matter of fact, this kind of package regression is by LTS policy forbidden. The tight release cycle prevents thorough testing...


But let's not get into any kind of flaming, IMO I really think ubuntu is a great desktop distro..., just my personal opinion..



openSSI looks very interesting but, despite several attempts, I've now abandoned any thought of using it. We must be careful not to accuse anyone of using the 'MOSIX' source, because that's proprietary! I guess you are talking about the GPL version when openMosix was forked ;-)

Oops sorry my bad.. yes I did mean OpenMosix...

As for OpenSSI, I do think it's dead for 2 reasons.
1. Kerrighed uses a a substantial part of it's source > making OpenSSI practically obsolete
2. The XtreemOS project, which has (about) the same goal as OpenSSI ( where kerrighed is going to be part of )
http://www.xtreemos.eu/



...
It's now unlikely the openMosix will continue (but I'm watching development of PMI:


isn't PMI a rebranded OpenMOSIX ?



I agree that the openSSI project seems inactive (but don't dismiss it, because it has a lot of good points too). That leaves Kerrighed, which is still an active EU FP6 (Framework 6) funded project until 2010.


True, but since OpenSSI lacks the funding of the XtreemOS consortium ... chances are little it will emerge again..
http://www.xtreemos.eu/overview/plonearticlemultipage.2006-06-08.9297943452/partners

System Architecture Overview:
http://www.xtreemos.eu/science-and-research/plonearticlemultipage.2007-05-03.6408426978/copy_of_xtreemos-architecture


Why does venture capital mean something has to be proprietary?

It doesn't, it's just easier then providing just a service...
And I wouldn't if a couple of kernel devs chip in..


Canonical have invested a lot in Ubuntu... & Mark Shuttleworth flew in space
:lolflag:



On a more serious note.
Tony (AJT) you seem like a knowledgeable guy, I'd like to discuss in depth a couple of things ( compare my trunk compile kernel configs with yours / discuss different types of distributed FS's / tuning & jitter etc ... ) ..
But I'm a little afraid the title of the thread is not well chosen > I'll elaborate .. quite a lot of people reading this have will have a hard time grasping the type of cluster were discussing. To me it's obvious were talking (headless) HPC clustering...
In other words this thread is prone to pollution as soon as people start suggesting for example RHEL or SAN failover mechanisms as an alternative...

It's getting really late now, let's see where this is going..


Jan

altonbr
January 15th, 2009, 04:09 AM
Here's an amazing article on how to turn Xboxs into a Beofwulf cluster. Looks inherently more simple than I thought.

http://www.anandtech.com/linux/showdoc.aspx?i=2271&p=8

altonbr
January 15th, 2009, 04:23 AM
I also had these resources in another thread (http://ubuntuforums.org/showpost.php?p=5935648&postcount=4)...

Here are some resources I found:
Beowulf:
http://www.beowulf.org/overview/index.html
http://www.cacr.caltech.edu/beowulf/tutorial/building.html
http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book/node9.html

Related Projects:
https://computing.llnl.gov/linux/slurm/
http://www.rocksclusters.org/

Showcasing & Concepts:
http://helmer.sfe.se/
http://helmer2.sfe.se/
http://helmer3.sfe.se/
http://www.calvin.edu/~adams/research/microwulf/

Articles:
http://www.ibm.com/developerworks/library/l-halinux/
http://www.scribd.com/doc/312186/THE-GOOGLE-CLUSTER-ARCHITECTURE
http://www.clustermonkey.net//content/view/211/1/
http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book/node21.html

Comic:
http://imgs.xkcd.com/comics/network.png

kloplop321
January 15th, 2009, 09:05 AM
I would simply be interested in using 3 old computers running as a cluster to slightly speed things up and add some elbow room. I have tried some live CD's but my old laptop's LAN card isn't supported(odd), and I would need to run a fan control program on one of the computers because it's fan control is not automatic from the bios(I find that stupid)
the last one is a pentium MMX machine.

machoo02
January 15th, 2009, 03:41 PM
altonbr,

You could add this to you list o' links:
http://www.calvin.edu/~adams/research/microwulf/

altonbr
January 15th, 2009, 07:04 PM
altonbr,

You could add this to you list o' links:
http://www.calvin.edu/~adams/research/microwulf/

It's already in there ;)

djamu
January 15th, 2009, 08:11 PM
I would simply be interested in using 3 old computers running as a cluster to slightly speed things up and add some elbow room. I have tried some live CD's but my old laptop's LAN card isn't supported(odd), and I would need to run a fan control program on one of the computers because it's fan control is not automatic from the bios(I find that stupid)
the last one is a pentium MMX machine.

Don't want to discourage you but.
If it's a desktop OS which load you want to automatically share between computers..... come back in a couple of years... this is not possible (yet)... Kerrighed is closest for what you want to do & can migrate processes of standard applications to other computer(s) BUT can't split an application using a single process over 2 computers and doesn't distribute threads...

Brief recap of types:

The term supercomputer might be misleading, as in most cases it's not a single OS that runs over a group of CPU's but rather independent computers that run a common application(s) over a shared network(s) > in other words it would be more correct to talk about an HPC cluster then a supercomputer... ( not 1 computer )

Depending on the type of application / dataset(s) more or less speed is required for the interconnects.

An example:
Meet the most powerfull supercomputer on the planet... in everyones :) home.
Folding@home / Seti@home ( BOINC ) > http://boinc.berkeley.edu/
This is an extreme case of an Embarrassingly Parallel ( the correct term for this kind of HPC clustering ) very little data is send over the network by a scheduler to each compute node..
This kind of number crunching doesn't require intermediate data from other processes and doesn't need distributed RAM.
A classical renderfarm is another example ( with a bit more network traffic )
http://en.wikipedia.org/wiki/Embarrassingly_parallel


On the other end you have those applications / dedicated clusters that require massive amounts of intermediate data / huge databases / shared memory > The big boys from the pictures...
These monsters usually have 3D torus shaped high speed networks ( infiniband / myrinet / 10Gb ) with multiple connections / node and are usually build for a single purpose ( you can't run any app on them because they would run inefficient )
These type of clusters usually run a kind of MPI flavor and can't run stock applications
http://en.wikipedia.org/wiki/Message_Passing_Interface
http://en.wikipedia.org/wiki/OpenMP

Closest to a real supercomputer ( in the correct sense of the word ) and what you most likely want to achieve, unrelated to the previous mentioned MPI systems ( which are individual machines connected thru a network ) are the SSI systems. > Kerrighed / (Open)Mosix ( now PMI ) / OpenSSI
http://en.wikipedia.org/wiki/Single-system_image



One more thing worth noting:
The reason why more and more clusters are built from COTS ( commodity of the shelf > normal desktop computers ) parts instead of custom hardware is that before you had 2 distinct types of CPU's

scalar > fast for single instruction single out
http://en.wikipedia.org/wiki/Scalar_processor
http://en.wikipedia.org/wiki/SISD

vector > fast for single instruction multiple out
http://en.wikipedia.org/wiki/Vector_processor
http://en.wikipedia.org/wiki/SIMD

more: http://en.wikipedia.org/wiki/MIMD

While the x86 family originally was a pure scalar CPU it was unsuited ( very slow ) to do any vector processing.. The addition of the MMX and later 3Dnow / SSE etc.. extensions ( which are vector capabilities ) made the x86 hybrid.
So every modern x86 currently has 1 scalar + 1 vector + floating point + .... / core ( the P4 with hyperthreading had 2 scalar cores with 1 vector and 1 floating point core )

The CELL CPU has 1 scalar + 8 vector cores

Roadrunner ( currently the fastest supercomp ) uses hybrid nodes ( 1 AMD Opteron coupled to 1 CELL )

wanted to keep it brief ... pfff look at all that text again :lolflag:

Jan

meatpan
January 17th, 2009, 05:26 PM
The scope of EasyUbuntuClustering is very wide, which might be why the discussions on this thread have been so broad.

Is it realistic for an 'easy' clustering solution to cater to inexperienced developers, offer portable environments via VM's, provide automated load-balancing, and still maintain high QoS?

These are great goals to shoot for, but IMO you will likely have more success if you just pick one or two of the major features.

I can't comment much on the process migration (automated load balancing), since it's been more than 4 years since I've used OpenMosix. These features typically require savvy system admins, and configuration problems can be difficult to diagnose.

Automated deployment of VM's on a cluster or grid (not going to get into a definition war here. I'm referring to a grid as a federated assembly of heterogeneous clusters) is a hot topic that industry giants such as VMware (EMC), IBM, Microsoft, and Amazon are investigating. There are numerous technical subtleties relating this topic, and a logical solution might be to create a specific flavor of debian or ubuntu server that is adapted to hosting the popular VM's. This is especially relevant considering the CISC vendors have product roadmap plans that address common VM bottlenecks.

I admire your efforts to approach this complicated technical issue, and my only suggestion is to consider narrowing your scope a bit.

tinkertim
January 18th, 2009, 11:21 PM
Hello, djamu.

Not sure why you think it's a bad idea to use Ubuntu?

The Kerrighed kernel is based on the 'vanilla' kernel sources, so it doesn't make much difference if you compile it under Debian or Ubuntu.


Its not just a question of building, its how the kernel is built (i.e. using the Debian / Ubuntu patching and build system). The current Kerrighed 2.6.20 kernel (the way its patched and built) would make this problematic.

However, the jump to 2.6.29 would make it much easier.


I agree that the openSSI project seems inactive (but don't dismiss it, because it has a lot of good points too). That leaves Kerrighed, which is still an active EU FP6 (Framework 6) funded project until 2010.


I remain subscribed to the OpenSSI lists. Once in a while you'll see a flurry of activity, then radio silence for a few months. I'm still a big fan of the project, I think it will continue to advance for years to come. Its one of those projects where you need a solid block of time to actually do anything useful, so I'd imagine the devs devote what they can to it.


Why does venture capital mean something has to be proprietary?
Canonical have invested a lot in Ubuntu...


For a long time many people used the word 'commercial' to refer to non-free software which now causes a lot of confusion and mistrust.

Kerrighed has a very good, very honest and very transparent business model. Use / share / modify the software any way you like .. if you get stuck, paid support and development is available.

You will always have those who are a little over-sensitive to this kind of arrangement. For instance, someone might ask for help to implement thread migration and get a reply that they'll be better off hiring Kerlabs (which they would be). Then, the person will start screaming 'baitware' or worse.

I don't know if you'd be able to find a MOTU that would be interested in keeping up with Kerrighed, but surely nothing is preventing someone from making it available via PPA.

crazy___cow
January 19th, 2009, 08:53 AM
In the next month I'll try to setup a small cluster (using kerrighed or mosix) to research purpose (http://afni.nimh.nih.gov/afni). I have two dell poweredge 2900 monster (2cpu x 4 core) and several "old" dual CPU Xeon 3Ghz. The first idea is to use ubuntu hardy with a custom kernel (2.6.20 for kerrighed or 2.6.28 for mosix (it's free for researchers)). I think that it'will be a tricky job patching the kernel (with 64 bit support and all the modules to load onto pe2900 and others pcs)...but I'll try to do it, because ubuntu/debian is the most easy and powerful distro to use, configure and administrate.

djamu
January 19th, 2009, 02:07 PM
rereading previous posts it seems I overlooked this argument I've failed to notice since I thought we're talking headless compute nodes.

....
but Sarge and Etch were crude on the Desktop compared to Ubuntu 6.06...
....

So just out of curiosity.
Tony are you really implying an X server / client in the SSI nodes ? Aside from the unnecessary cost this would cause major problems as all GPU's should be of the same brand. Considering you'd only install the client X libraries on the nodes and use a remote X > XDMCP ( the machine which renders the image on screen / has display is the X server ) then still it's nearly impossible since kerrighed doesn't support LVS ( common cluster IP address ) that the X client could use to communicate with a standalone X server > could be handy for a multiseat system :).
I have some experience with OpenGL clusters, and that is a completely different topic in all aspects ( would be nice though ).....
In any case, the easiest way to implement a GUI is a web-based one tied to a job scheduler.

.
...
However, the jump to 2.6.29 would make it much easier.

in all aspects ... kerrighed being part of the kernel instead of a module, changes things drastically....

.
For a long time many people used the word 'commercial' to refer to non-free software which now causes a lot of confusion and mistrust.

Kerrighed has a very good, very honest and very transparent business model. Use / share / modify the software any way you like .. if you get stuck, paid support and development is available.

You will always have those who are a little over-sensitive to this kind of arrangement. For instance, someone might ask for help to implement thread migration and get a reply that they'll be better off hiring Kerlabs (which they would be). Then, the person will start screaming 'baitware' or worse.

I feel I need to clarify my statement on this.
The amount of money involved to develop thread migration ( the estimate I got from Kerlabs ) is considerable ( for a person ) and enough to buy myself a nice house.....
Knowing that its very hard to turn OSS into something commercial / proprietary, one could only try to commercialize a web-based front-end / scheduler....


I don't know if you'd be able to find a MOTU that would be interested in keeping up with Kerrighed, but surely nothing is preventing someone from making it available via PPA.
I doubt an ubuntu MOTU will be interested/has time so yes PPA is a valid option...

Jan

ajt
January 20th, 2009, 07:31 PM
This is wonderful information- thanks for posting it. I've been interested for a while now in constructing a small cluster (8-16 nodes) for running ecological models, but both my hardware abilities and programming need a bit more development. This will be great stuff to read through!

Hello, kvk.

I'm a biologist, turned bioinformatician, and there are lots of people just like me and you building our own DIY Beowulf clusters :-)

The real essence of this thread is to try and make it as 'easy' as possible for Ubuntu users to build this type of Beowulf cluster. I'm sure you already know that it can be quite a steep learning curve if you've never done anything like this before. However, there is nothing magic or particularly difficult about putting the hardware together. To give you a flavour of this approach, there is a chemical modeling cluster called 'COBALT' (Computers On Benches All Linked Together):

http://www.cobalt.chem.ucalgary.ca/

We do some of this type of quantum chemistry modeling under Ubuntu ;-)

The Kerrighed links I posted recently give examples of setting up DHCP and PXE booting of cluster nodes. I'm using this to PXE boot eight of our compute nodes to experiment with Kerrighed. Anyway, thanks for joining the discussion.

Bye,

Tony.

ajt
January 20th, 2009, 07:42 PM
Here's an amazing article on how to turn Xboxs into a Beofwulf cluster. Looks inherently more simple than I thought.

http://www.anandtech.com/linux/showdoc.aspx?i=2271&p=8

Hello, altonbr.

Connecting the boxes together is easy - Getting them to do something useful is the tricky bit :-)

Actually, I'm quite interested in clustering PS3's...

Not this one:

http://www.engadget.com/2007/08/11/sony-erects-massive-ps3-server-cluster-for-warhawk-mayhem/

More like this:

http://gravity.phy.umassd.edu/ps3.html

Bye,

Tony.

ajt
January 20th, 2009, 08:14 PM
[...]
But let's not get into any kind of flaming, IMO I really think ubuntu is a great desktop distro..., just my personal opinion..


Hello, Jan.

Good to know you like Ubuntu ;-)

[...]
isn't PMI a rebranded OpenMOSIX ?


It's a continuation of the project to port openMosix to the 2.6 kernel.

[...]
True, but since OpenSSI lacks the funding of the XtreemOS consortium ... chances are little it will emerge again..
http://www.xtreemos.eu/overview/plonearticlemultipage.2006-06-08.9297943452/partners
[...]


In my (now obsolete) 'biobuntu' blue-print, you will see that I'm also interested in XtreemOS as well as Kerrighed:

https://blueprints.launchpad.net/ubuntu/+spec/biobuntu

[...]
[...]
But I'm a little afraid the title of the thread is not well chosen > I'll elaborate .. quite a lot of people reading this have will have a hard time grasping the type of cluster were discussing. To me it's obvious were talking (headless) HPC clustering...
In other words this thread is prone to pollution as soon as people start suggesting for example RHEL or SAN failover mechanisms as an alternative...


I disagree - I've got more than 26 years experience of 'parallel' computing but I obtained most of my computing (and image analysis) experience working as a biologist. I now work as a bioinformatician.

There are lots of people like me working in bioinformatics: A common thread in discussions like this is that it is MUCH more difficult to grasp the scientific questions being asked than it is to build a Beowulf. The technology is a means to an end. The draconian 'lock-down' policies of some highly centralised IT departments lack the flexibility to allow R&D approaches to be used by scientists who often know more about the technology than people trained in provision of IT for the 'enterprise'.

BTW, That's not a criticism of people providing IT for the 'enterprise'. it's an observation, borne of my own experience, that IT for science and IT provided for the 'enterprise' have completely different objectives. My intention in starting a thread in this forum is to try to draw together opinions and help from other people with similar objectives to my own. I agree that it's important to keep the topic clear: It is how we make it 'easy' to cluster Ubuntu systems in order to use the aggregate resources of many computers to solve one problem. This is the essence of Beowulf:

http://www.beowulf.org/

Bye,

Tony.

ajt
January 20th, 2009, 08:23 PM
I would simply be interested in using 3 old computers running as a cluster to slightly speed things up and add some elbow room.


Hello, kloplop321.

I think that using a few old computers to learn about Beowulf clusters is fine, but at the end of the day they are still old computers and there is an overhead involved in communication between the compute nodes in a cluster. The first rule of 'parallel' computing is to determine what can be done independently by the nodes: That's where you gain an advantage.


I have tried some live CD's but my old laptop's LAN card isn't supported(odd), and I would need to run a fan control program on one of the computers because it's fan control is not automatic from the bios(I find that stupid)
the last one is a pentium MMX machine.

Maybe this would be a good time to learn how to compile the Linux kernel?

It's not that difficult the second time you do it, but the first time takes a bit longer ;-)

Check the Kerrighed links I posted earlier in this thread for more info.

Thanks for joining in.

Bye,

Tony.

ajt
January 20th, 2009, 08:26 PM
It's already in there ;)

Hello, altonbr.

Why not add things like this to the Wiki:

https://wiki.ubuntu.com/EasyUbuntuClustering

Bye,

Tony.

ajt
January 20th, 2009, 08:58 PM
[...]

scalar > fast for single instruction single out
http://en.wikipedia.org/wiki/Scalar_processor
http://en.wikipedia.org/wiki/SISD

vector > fast for single instruction multiple out
http://en.wikipedia.org/wiki/Vector_processor
http://en.wikipedia.org/wiki/SIMD

more: http://en.wikipedia.org/wiki/MIMD


Hello, Jan.

The BIG difference is using SIMD/MIMD instead of Von Neumann Arhitecture:

http://en.wikipedia.org/wiki/Von_Neumann_architecture

SIMD (Single Instruction Multiple Data) is fast, but extremely limited. I used a 9216 SIMD processor array for image analysis some time ago (CLIP4R). There is a brief summary of this type of processor here if anyone is interested:

http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/GASTERATOS/iii.htm

However, we concluded that the advantages of using 'exotic' processor architectures like this does not outweigh the difficulty of programming them, or the 'corner-turning' problem of actually getting images into the SIMD processor array. Our CLIP4R was a very interesting machine, but its CPU boards ended up being made into wall clocks. Sad, but well worth the lessons learned. The host controller for the CLIP4R was a pdp11/34, so you can tell how long about this was (1986). I replaced the pdp11 with a Vax 750, which had less than 1MIP integer performance (a MIP is defined as an integer perfomance of one million VAX780 instructions per second).

Enough nostalgia for now - This has little to do with Ubuntu clustering, but might rectify some misunderstandings about what 'parallel' computing involves. What we are talking about in the Easy Ubuntu Clustering blue-print, and this thread, is using a collection of Von Neumann architecture processors in different computers as if they were one large SMP (Symmetric Multi-Processing) computer. This is the SSI (Single System Image) approach to parallel computing used by openMosix, openSSI and Kerrighed. It's also the approach taken by Donald Becker, who invented the Beowulf cluster at NASA and commercialised it:

http://www.scyld.com/home.html

[Scyld, was Beowulf's companion in the Beowulf poem]

http://en.wikipedia.org/wiki/Beowulf

Bye,

Tony.

ajt
January 20th, 2009, 09:14 PM
The scope of EasyUbuntuClustering is very wide, which might be why the discussions on this thread have been so broad.

Is it realistic for an 'easy' clustering solution to cater to inexperienced developers, offer portable environments via VM's, provide automated load-balancing, and still maintain high QoS?

These are great goals to shoot for, but IMO you will likely have more success if you just pick one or two of the major features.


Hello, meatpan.

I found the previous discussions dispersed over several different forums difficult to follow, and repeating many of the same questions. Yes, the discussion so far has been quite broad but, as we realise what people are actually interested in, I think the discussion will become more focussed.


I can't comment much on the process migration (automated load balancing), since it's been more than 4 years since I've used OpenMosix. These features typically require savvy system admins, and configuration problems can be difficult to diagnose.


Not really - I run a 90-node openMosix system and I'm a biologist ;-)


Automated deployment of VM's on a cluster or grid (not going to get into a definition war here. I'm referring to a grid as a federated assembly of heterogeneous clusters) is a hot topic that industry giants such as VMware (EMC), IBM, Microsoft, and Amazon are investigating. There are numerous technical subtleties relating this topic, and a logical solution might be to create a specific flavor of debian or ubuntu server that is adapted to hosting the popular VM's. This is especially relevant considering the CISC vendors have product roadmap plans that address common VM bottlenecks.


Hmm... 'cloudy' today?

The BIG problem with VM's is that they are designed to be using for HAC (High Availablility Computing) not HPC (High Performance Computing). It's not HPC to virtualise instances of e.g. Ubuntu then merge them again into a 'virtual' SSI. It's a way of making your IT provisioning resilient and platform-independant. Nothing wrong with that, if you're an ISP...


I admire your efforts to approach this complicated technical issue, and my only suggestion is to consider narrowing your scope a bit.

Thanks - The technical aspects are really not that difficult, but the advantage of packaging up Kerrighed for Ubuntu would make it a lot easier!

Bye,

Tony.

ajt
January 20th, 2009, 09:25 PM
Its not just a question of building, its how the kernel is built (i.e. using the Debian / Ubuntu patching and build system). The current Kerrighed 2.6.20 kernel (the way its patched and built) would make this problematic.

However, the jump to 2.6.29 would make it much easier.


Hello, tinkertim.

I'm a bit confused by what you say about patching: The 'patch' program is the same under Debian and Ubuntu, and the Kerrighed patches are relative to the 'vanilla' kernel sources for 2.6.20. The kernel Makefile includes a 'deb-pkg' target, which has nothing to do with the Debian or Ubuntu patches to their kernels. I built and installed Kerrighed kernel debs under Ubuntu 6.06 using only the 2.6.20 sources and kerrighed patches.


I remain subscribed to the OpenSSI lists. Once in a while you'll see a flurry of activity, then radio silence for a few months. I'm still a big fan of the project, I think it will continue to advance for years to come. Its one of those projects where you need a solid block of time to actually do anything useful, so I'd imagine the devs devote what they can to it.


I've followed openSSI developmenmts for years, and I tried quite hard to get openSSI running, to use as a replacement for openMosix, but gave up in the end.


[...]
I don't know if you'd be able to find a MOTU that would be interested in keeping up with Kerrighed, but surely nothing is preventing someone from making it available via PPA.

OK, I'll take that as you volunteering to help me do it :-)

Bye,

Tony.

ajt
January 20th, 2009, 09:31 PM
In the next month I'll try to setup a small cluster (using kerrighed or mosix) to research purpose (http://afni.nimh.nih.gov/afni). I have two dell poweredge 2900 monster (2cpu x 4 core) and several "old" dual CPU Xeon 3Ghz. The first idea is to use ubuntu hardy with a custom kernel (2.6.20 for kerrighed or 2.6.28 for mosix (it's free for researchers)). I think that it'will be a tricky job patching the kernel (with 64 bit support and all the modules to load onto pe2900 and others pcs)...but I'll try to do it, because ubuntu/debian is the most easy and powerful distro to use, configure and administrate.

Hello, crazy___cow.

Be very careful how you interpret "free for researchers" - I looked into this too, and the deal is you are given one 'free' snapshot of MOSIX2 kernel patches plus binaries of the MOSIX2 utilities. The snapshot is the current release, which is good, but you have to pay $1,000 per year to get updates and you don't get any sources other than the kernel patches. It depends what your objectives are. MOSIX2 looks very good, but it's not FLOSS. That might not matter to everyone, but it matters to me.

Bye,

Tony.

ajt
January 20th, 2009, 09:55 PM
[...]
Tony are you really implying an X server / client in the SSI nodes ? Aside from the unnecessary cost this would cause major problems as all GPU's should be of the same brand. Considering you'd only install the client X libraries on the nodes and use a remote X > XDMCP ( the machine which renders the image on screen / has display is the X server ) then still it's nearly impossible since kerrighed doesn't support LVS ( common cluster IP address ) that the X client could use to communicate with a standalone X server > could be handy for a multiseat system :).


Hello, Jan.

The Xserver runs on the client PC, where it displays the output of the Xclient running on the SSI. My present openMosix system has two servers and 88 compute nodes. One server is the 'head' node of the cluster, providing PXE + DHCP + NFSROOT and a web server:

http://bioinformatics.rri.sari.ac.uk

The other server is for interactive logins via SSH/NX/VNC and is where the user's home directories are stored. They are exported via NFS to the nodes. Only the servers are allowed to migrate processes away to other nodes. The Beowulf becomes unstable if all nodes are allowed to migrate processes. The compute nodes accept migrated processes from the servers, but do NOT migrate their own local processes. This allows us to run MPI and SGE as local processes without them being migrated off nodes...


I have some experience with OpenGL clusters, and that is a completely different topic in all aspects ( would be nice though ).....
In any case, the easiest way to implement a GUI is a web-based one tied to a job scheduler.


I've been looking at eyeOS:

http://eyeos.org/

I've installed it, but not done much with it yet.

Most of the time, we just login using NX. If we want to run 'batch' jobs, we use SGE "qmon", or command-line tools to submit jobs. The webapps we run provide a web-based interface via the Beowulf web server using the Apache JK connector. Some, but not all, of the work accepted by these webapps is migrated automatically to compute nodes by openMosix. Any jobs using shared memory, in particular, are not migrated. This is due to the limitations of openMosix, as you know. Interrestingly, I did experiment with the openMosix 'MIGSHM' patch: My processes migrated away fine, but the didn't come back!

Bye,

Tony.

zkovax
January 23rd, 2009, 04:09 AM
Hi,

How about porting Perceus to ubuntu/debian?

http://www.linux-mag.com/id/6386

To be frank I am a cluster newbie, I have just built my own cluster to learn the technology, and after lot of reading and digging I have started to use Perceus, more precisely Caoslinux, for the same reason you started this wiki, as it makes cluster management easy, and it realy does the job. On the other hand, I am an Ubuntu fun, I am familiar with it, so I would love to do the same with Ubuntu. Unfortunately I am not a linux/ubuntu admin guru, otherwise I would have started the porting already.

Probably you do not know, that Perceus is the provisioning part of Warewulf, Caoslinux is the distribution to manage them, and the whole lot is maintained by Infiscale.

http://www.perceus.org/portal/
http://www.caoslinux.org/index.html
http://www.infiscale.com/

Cheers,

Zsolt

masulli
January 23rd, 2009, 05:14 PM
Hi all!!

Can anyone post a simple and exaustive guide on how to install a cluster of pc using Kerrighed 2.3.0 and Ubuntu 8.10 or 8.04 ?

Thanks

ajt
January 25th, 2009, 02:18 PM
Hi all!!

Can anyone post a simple and exaustive guide on how to install a cluster of pc using Kerrighed 2.3.0 and Ubuntu 8.10 or 8.04 ?

Thanks

Hello, masulli.

Not yet, but I hope we can create a guide:

https://blueprints.launchpad.net/ubuntu/+spec/easy-ubuntu-clustering

Bye,

Tony.

ajt
January 25th, 2009, 04:18 PM
Hi,

How about porting Perceus to ubuntu/debian?

http://www.linux-mag.com/id/6386

[...]
Probably you do not know, that Perceus is the provisioning part of Warewulf, Caoslinux is the distribution to manage them, and the whole lot is maintained by Infiscale.

http://www.perceus.org/portal/
http://www.caoslinux.org/index.html
http://www.infiscale.com/


Hello, Zsolt.

Actually, I do know about Warewulf, but 'cAos' is an rpm-based distro and I want to use a deb-based distro - preferably Ubuntu ;-)

There are similar systems like DRBL:

http://drbl.sourceforge.net/

These systems are very useful for 'jumpstarting' or administering Beowulf clusters - DRBL has been used to deploy Kerrighed:

http://trac.nchc.org.tw/grid/wiki/krg_DRBL

However, systems like this do not address the issue of how to actually use a cluster for HPC (High performance Computing). A Beowulf is not just a COW (Collection Of Workstations), it's more than that...

What is interesting about openMosix, openSSI and Kerrighed is that they do kernel-level load balancing in a similar way that an SMP (Symmetric Multi-Processing) computer does, to make use of multiple CPU's. There is a HUGE difference in cost between e.g. building a Beowulf cluster from 16 ordinary COTS (Commodity Off The Shelf) PC's and buying in a commercial 16p SMP computer system, which is why this is a popular approach.

The BIG difference is that, without non-COTS cluster node interconnects, node latency can be extremely large in comparison to local SMP processor interconnects. Kerrighed uses an optimised 'kernel' TCP/IP communication protocol to minimise the latency in communication between Linux kernels running on processors in different Kerrighed nodes. That's one reason why I think Kerrighed is well worth looking at.

Bye,

Tony.

zander1013
January 26th, 2009, 01:13 AM
hi,

i have tried to cover all the posts here (at least one time) so far. there has been allot of lofty discussion of architecture and so forth that drifts from (and imo confuses) the issue.

i have only read joel adams paper on how he configures his microwulf using ubuntu one time. but it appears that he has got what the original thread post requested.

except for the fact that he gives the instructions where the assumption is that the slave nodes will be diskless the description of how to configure a cluster using ubuntu where the nodes are connected by a cots switch is as easy as it will get without some kind of cluster installer like that for the .iso distro of ubuntu.


once again that link is...
http://www.calvin.edu/~adams/research/microwulf/
...that's it folks!

that is the answer to this thread.


i intend to use it as a guide to try and establish my first cluster of two identical ibm t60 laptops run by ubuntu by simply connecting them with a cat-5 patch chord (no switch) and configuring them using his configuration instructions as a guide. i don't think there will be any problem doing this (without the switch) to get started.

perhaps the best way for those interested in actually building a cluster is to follow my lead with whatever hardware they have available. that is to say just use two boxes that run ubuntu, connect them with a cat-5 patch chord and use joel's configuration instructions as a guide.

once you have two machines that are an actual cluster just scale up.


peace!

-zander

antony_css
January 27th, 2009, 08:01 AM
What I am curious about is how the computers are connected... Certainly it won't work by just connecting two computers with a LAN cable, so it seems that a router or sth more advanced is needed...

All I have is a modem which is provided by my ISP. I guess it is something that you connect it with a LAN cable and a telephone line so that you can access the web.

This is what I have done a month before:
==================================================
Hardware:
1. a win95 box with 133MHz CPU & 32Mb RAM
2. an xp box with 2GHz CPU,256Mb RAM
3. an ADSL modem with 4 slots for LANs
4. two LAN cables

Procedures:
I connected the two boxes to the modem with the LAN cables. Then, I tried to boot both boxes using liveCDs like CHAOS (http://midnightcode.org/projects/chaos/) and ClusterKnoppix (http://bofh.be/clusterknoppix/). Afterthat, I tried some floppies like GoMF (http://gomf.sourceforge.net/index.html) and openMosixLOAF (http://openmosixloaf.sourceforge.net/). Finally, I tried to start the following shell-script in background:
s=0
for (( i=1; i>0; i=i+1 ))
do
s=$(($s+$i))
done



Results.
liveCDs: cannot boot on the old box. success to boot on the new box but cannot configure itself as a node - sth called "DHCP server" was misssing.

floppies:
a. GoMF can boot on both boxes, but the same problem occured - "DHCP server" was absent.
b. openMosixLOAF can boot on both boxes, success to configure as nodes by manually entering private IPs and Gateway. can view cpu loads, ram utilization in real time from any boxes. cannot migrate the shell script because the floppy is not shipped with commands like "mosrun".
================================================== =======

To sum up, any attempt to construct openmosix-cluster using a modem failed (without compiling, of course). Could anyone tell me what is a "DHCP server"?

It seems that the so called "DHCP" is related to auto-detection & auto-configuration, and cannot be turned off. Is it possible to solve this without having to buy new hardware(s)?

antony_css
January 27th, 2009, 08:16 AM
Well, after surfing Wikipedia, I found that my ADSL modem is actually a router... :P
But that changes nothing... my ISP provided me this "modem" as a part of the internet service, but they didn't provide any further support on "advanced" usage, even a not a manual...

ajt
January 27th, 2009, 06:25 PM
hi,

i have tried to cover all the posts here (at least one time) so far. there has been allot of lofty discussion of architecture and so forth that drifts from (and imo confuses) the issue.


Hello, zander.

I think it's important to place what we are discussing in context: In fact, SIMD instructions are now used in the FPU's of modern processors and I think it's quite useful to have a little background information :-)


i have only read joel adams paper on how he configures his microwulf using ubuntu one time. but it appears that he has got what the original thread post requested.

except for the fact that he gives the instructions where the assumption is that the slave nodes will be diskless the description of how to configure a cluster using ubuntu where the nodes are connected by a cots switch is as easy as it will get without some kind of cluster installer like that for the .iso distro of ubuntu.


I don't think that building a Microwulf could be described as 'easy', and what I've read about Microwulf focusses more on the hardware aspects of how to construct a DIY Beowulf. Segregating 'system' and 'application' network traffic is a good idea, but it's not an original one:

http://bioinformatics.rri.sari.ac.uk/bobcat/


once again that link is...
http://www.calvin.edu/~adams/research/microwulf/
...that's it folks!

that is the answer to this thread.


I think the Microwulf project is great, in terms of its hardware, but it doesn't bring anything new to the software requirements of building Ubuntu clusters. In relation to 'diskless' booting, for example, there has been an Ubuntu 'diskless' wiki for a long time:

https://help.ubuntu.com/community/DisklessUbuntuHowto

The type of Beowulf cluster installation/management packages that are discussed by the Microwulf project are well known. What I suggested is that we try to package Kerrighed for Ubuntu, because Kerrighed is a state of the art SSI (Single System Image) Linux kernel load-balancing system that has already been packaged for Mandriva Linux. I think it would be useful to package it for Ubuntu too.


i intend to use it as a guide to try and establish my first cluster of two identical ibm t60 laptops run by ubuntu by simply connecting them with a cat-5 patch chord (no switch) and configuring them using his configuration instructions as a guide. i don't think there will be any problem doing this (without the switch) to get started.


Make sure you use a crossover cable if you do (or check that your laptop has autosensing NIC's). It's quite common for ISP's to provide a combined ADSL modem + router + WiFi these days, but a passive ethernet hub would be another solution for a low-cost experimental setup.


perhaps the best way for those interested in actually building a cluster is to follow my lead with whatever hardware they have available. that is to say just use two boxes that run ubuntu, connect them with a cat-5 patch chord and use joel's configuration instructions as a guide.

once you have two machines that are an actual cluster just scale up.


As Bart says: "Don't have a COW, man...", but that is exactly what you will create if you just connect together a Collection Of Workstations. There's a lot more to building a cluster than simply connecting computers together. However, connecting them up *is* quite important ;-)


peace!


A friend of mine saw graffiti in Florence recently that read:

Make tea, not war

I like that :-)

Bye,

Tony.

ajt
January 27th, 2009, 07:58 PM
[...]
To sum up, any attempt to construct openmosix-cluster using a modem failed (without compiling, of course). Could anyone tell me what is a "DHCP server"?

It seems that the so called "DHCP" is related to auto-detection & auto-configuration, and cannot be turned off. Is it possible to solve this without having to buy new hardware(s)?

Hello, antony_css.

A DHCP (Dynamic Host Configuration Protocol) server listens for computers on the network broadcasting DHCPDISCOVER requests and supplies them with a dynamic network configuration. This normally consists of an IP address, netmask and default router addess. Often, the DHCP server also supplies DNS name server addresses and DNS domains to search. In addition, modern DHCP servers support network booting and supply information about TFTP (Trivial File Transfer Protocol) where a boot image can be downloaded.

Although this might sound complicated, many routers run DHCP servers so all you have to do is connect your devices to the router, and find out what their IP addresses are by running:

ipconfig -a

This is necessary if you want to connect your computers together, because they need to know each other's IP address. This is a common private IP network:

192.168.0.254 router
192.168.0.1 node1
192.168.0.2 node2
...

For a small network, you can put this information in the /etc/hosts file.

Bye,

Tony.

punong_bisyonaryo
January 28th, 2009, 03:21 AM
Hi All!

I'm researching for a paper for our company (sadly, the paper is proprietary) on how to use existing open source solutions for cluster computing and balancing the loads between computers that merely do spreadsheets or emails with computers that do intensive compiles/builds. Looks like I found the right source of info!:D

I'll try to catch up on all the info you've guys have written. Interesting stuff.

ajt
January 28th, 2009, 07:18 PM
Hi All!

I'm researching for a paper for our company (sadly, the paper is proprietary) on how to use existing open source solutions for cluster computing and balancing the loads between computers that merely do spreadsheets or emails with computers that do intensive compiles/builds. Looks like I found the right source of info!:D

I'll try to catch up on all the info you've guys have written. Interesting stuff.

Hello, punong_bisyonaryo.

It's a popular wish to harness all the 'idle' CPU cycles on under-used office desktops, but the people using them will not be very happy when your intensive compile/build makes their web browser go catatonic!

What works best under these circumstances is 'volunteer' computing:

http://www.volunteerathome.com/

The point about volunteer computing is that the donor computer decides when to volunteer its resources (e.g. when a screen saver starts up).

In contrast, MOSIX/openMosix was actually designed to work the way that you want to by using the idle CPU's on workstations but the computers all have to run Linux. In practice, though, it's difficult to avoid someone switching a computer off, or rebooting it in the middle of your job that has been migrated onto it. When that happens your job is lost...

Clustering systems are only as reliable as the nodes they contain: If you don't control the nodes, then you don't control the cluster. For that reason, most people build a private cluster LAN with dedicated compute nodes connected to it. Even if you do control your company's desktops, you will still have problems running cluster software on the public (company) LAN, because you will be using up a lot of the available network bandwidth migrating programs and data to and from 'idle' desktops.

I think there are circumstances where it might work if e.g. you take over the company LAN and reboot Linux on the desktop PC's during out-of-hours and run your compile farm then. This is common practice in University PC teaching labs that run Micro$oft Windows by day, but Linux HPC software by night :-)

Good luck + please do post your conclusions here!

Bye,

Tony.

antony_css
January 29th, 2009, 05:44 AM
Well, somehow i found that dhcp can be set up as a daemon...
A floppy distribution "eucaristOS (http://eucaristos.sourceforge.net/)" actually use this technique to set up cluster on old computers...
However it does not support my network card... so sad...

altonbr
January 29th, 2009, 01:40 PM
Maybe we can take this information and start collecting it under https://help.ubuntu.com/community/Clustering ???

ajt
February 1st, 2009, 02:01 PM
Well, somehow i found that dhcp can be set up as a daemon...
A floppy distribution "eucaristOS (http://eucaristos.sourceforge.net/)" actually use this technique to set up cluster on old computers...
However it does not support my network card... so sad...

Hello, antony_css.

Beware, eucaristOS is based on an even older version of openMosix than I'm using!

You could compile in support for your network card, if you download the openMosix kernel sources, However, I think you would be better off trying out the Kerrighed 'live' CD, which also runs DHCP for you:

http://www.kerlabs.com/dl/kerrighed-live.iso

Bye,

Tony.

ajt
February 1st, 2009, 08:23 PM
Maybe we can take this information and start collecting it under https://help.ubuntu.com/community/Clustering ???

Hello, altonbr.

OK, I've created a 'Clustering (https://help.ubuntu.com/community/Clustering)' page...

Bye,

Tony.

swisspipe
February 2nd, 2009, 11:56 AM
Hi all!
I am trying to create a little home cluster with Ubunto 8.04 and Kerrighed, but I am facing some problems.
I run the build.sh script located in wiki page, but at some point he run the command vi Makefile and stop waiting for something, what i have to do at this point? The comment say "# edit Extraversion" what i need to chage/edit?
Thanks and cherrs ;)

ajt
February 2nd, 2009, 07:54 PM
Hi all!
I am trying to create a little home cluster with Ubunto 8.04 and Kerrighed, but I am facing some problems.
I run the build.sh script located in wiki page, but at some point he run the command vi Makefile and stop waiting for something, what i have to do at this point? The comment say "# edit Extraversion" what i need to chage/edit?
Thanks and cherrs ;)

Hello, swisspipe.

That build.sh describes the commands I used to compile the Kerrighed kernel under Ubuntu 8.04. You have to edit the 'Extraversion' in the Makefile if you are building a 'deb' because the kernel name must obey the Debian package name conventions. The build failed with the default Kerrighed 'Extraversion', so I tried changing it to:

EXTRAVERSION = -krg-2.3.0

That resulted in a valid Debian package name :-)

The commands in "build.sh" only compile and package the kernel. I've not packaged the Kerrighed user-land tools yet.

Bye,

Tony.

antony_css
February 17th, 2009, 08:49 AM
I really wish to have a workable eucaristOS disk for my computers...
Is it possible to merge the latest kernel in ajt's deb to the eucaristOS source?

Looking into the source directory of eucaristOS, I didn't find any related folders like "/boot" for me to replace the kernel... Isn't that because the disk does not use grub as bootloader?

BTW, I keep making stupid mistakes even when I tried to make the floppy image from the source. It crashes due to an unknown error...like the one below:

$ sudo make all
Building floppy image.../bin/sh: Syntax error: Bad fd number
make: *** [floppy] Error 2

What should I do?

antony_css
February 17th, 2009, 09:50 AM
Haha! this is the clue!
http://ubuntuforums.org/showthread.php?t=382548

kgkv
February 17th, 2009, 02:15 PM
Hi,

How about porting Perceus to ubuntu/debian?

http://www.linux-mag.com/id/6386

To be frank I am a cluster newbie, I have just built my own cluster to learn the technology, and after lot of reading and digging I have started to use Perceus, more precisely Caoslinux, for the same reason you started this wiki, as it makes cluster management easy, and it realy does the job. On the other hand, I am an Ubuntu fun, I am familiar with it, so I would love to do the same with Ubuntu. Unfortunately I am not a linux/ubuntu admin guru, otherwise I would have started the porting already.

Probably you do not know, that Perceus is the provisioning part of Warewulf, Caoslinux is the distribution to manage them, and the whole lot is maintained by Infiscale.

http://www.perceus.org/portal/
http://www.caoslinux.org/index.html
http://www.infiscale.com/

Cheers,

Zsolt

We have already "ported" perceus (and warewulf) to debian.. It needs
a lot of polishing but it works for i386 and amd64 on our cluster
of ~80 machines.


deb http://biodev.ece.ucsb.edu/debian/ main contrib

On you server:
aptitude install perceus-server

There are scripts for creating debian VNFS there.

zkovax
February 21st, 2009, 01:53 PM
We have already "ported" perceus (and warewulf) to debian.


Hi kgkv,

This looks really good. I will have a look.

Zsolt

antony_css
February 23rd, 2009, 12:41 AM
It sounds interesting:
http://code.google.com/p/distcc/

Anyway, what is that?

bigjimjams
February 23rd, 2009, 08:25 AM
Hi all!!

Can anyone post a simple and exaustive guide on how to install a cluster of pc using Kerrighed 2.3.0 and Ubuntu 8.10 or 8.04 ?

Thanks

Hi Masulli, are you or anyone else still looking for a guide to do this? I've recently managed to get a small Kerrighed 2.3.0 cluster up and running under Ubuntu 8.04 with the 2.6 kernel. I'm currently in the process of writing a guide covering the setup process so other people at work know how to maintain it. If anyone is interested I can post a link on here.

punong_bisyonaryo
February 23rd, 2009, 09:27 AM
Hi Masulli, are you or anyone else still looking for a guide to do this? I've recently managed to get a small Kerrighed 2.3.0 cluster up and running under Ubuntu 8.04 with the 2.6 kernel. I'm currently in the process of writing a guide covering the setup process so other people at work know how to maintain it. If anyone is interested I can post a link on here.

Yes please! Definitely interested.:D

bigjimjams
February 23rd, 2009, 11:19 AM
Yes please! Definitely interested.:D

Seeing as there is a demand for it, I'll get it finished quickly and post it here. :D

mips
February 23rd, 2009, 12:10 PM
What I am curious about is how the computers are connected... Certainly it won't work by just connecting two computers with a LAN cable, so it seems that a router or sth more advanced is needed...

For two PC's all that is required is a cross-over LAN cable & a configured hosts file. Multiple PCs will require a LAN Switch. A router is not required.

xingmu
February 24th, 2009, 12:14 AM
bigjimjams,

I am another person who would love to see a tutorial for this. I have been following tutorial at http://trac.nchc.org.tw/grid/wiki/krg_DRBL. But since I am using a more recent version of Kerrighed, it seems there are some problems towards the end with actually starting the cluster. I am also very frustrated that this and other tutorials (e.g. http://www.in2dwok.com/indawok/?page_id=16) don't *explain* what is being done but just tell you to do it.

If you can at least get a draft of your tutorial, I would offer to help extend it based on my experiences. I think the owner of this thread started a page for EasyUbuntuClustering on the wiki. We should do the same (either add to that page or start a dedicated one). In my view, before anyone can help make clustering "easy", we need to have more people actually able to start a cluster. :}

bigjimjams
February 24th, 2009, 06:21 AM
bigjimjams,

I am another person who would love to see a tutorial for this. I have been following tutorial at http://trac.nchc.org.tw/grid/wiki/krg_DRBL. But since I am using a more recent version of Kerrighed, it seems there are some problems towards the end with actually starting the cluster. I am also very frustrated that this and other tutorials (e.g. http://www.in2dwok.com/indawok/?page_id=16) don't *explain* what is being done but just tell you to do it.

If you can at least get a draft of your tutorial, I would offer to help extend it based on my experiences. I think the owner of this thread started a page for EasyUbuntuClustering on the wiki. We should do the same (either add to that page or start a dedicated one). In my view, before anyone can help make clustering "easy", we need to have more people actually able to start a cluster. :}

Hi Xingmu, sounds like a good plan, as I'd be interested in other peoples opinions on what I did. If you are using kerrighed 2.3.0, there is a known issue with the krgadm tool, which always states that the cluster is not running even when it is! I'll put a draft of the guide up on the "Easy" Clustering Wiki in the next day or so, as I'm currently swamped with a couple of tasks.

bigjimjams
February 26th, 2009, 08:21 AM
Hi Xingmu, sounds like a good plan, as I'd be interested in other peoples opinions on what I did. If you are using kerrighed 2.3.0, there is a known issue with the krgadm tool, which always states that the cluster is not running even when it is! I'll put a draft of the guide up on the "Easy" Clustering Wiki in the next day or so, as I'm currently swamped with a couple of tasks.

Hi Xingmu and anybody else interested, as promised in my earlier post I've added a draft guide for setting up a kerrighed 2.3.0 cluster in Ubuntu 8.04 on the Easy Ubuntu Clustering Wiki. Here is the link:

https://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide

If you have any questions, comments, suggestions or improvements let me know.

ajt
February 28th, 2009, 09:02 AM
Hi Xingmu and anybody else interested, as promised in my earlier post I've added a draft guide for setting up a kerrighed 2.3.0 cluster in Ubuntu 8.04 on the Easy Ubuntu Clustering Wiki. Here is the link:

https://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide

If you have any questions, comments, suggestions or improvements let me know.

Hello, bigjimjams.

Thanks for your excellent installation guide!

I've done more or less the same installation, but I've not documented it quite so well :-)

One difference is that I use UNFS3 (http://unfs3.sourceforge.net/) with 'ClusterNFS' extensions enabled for the NFSROOT. This allows all the nodes to share the same NFSROOT filesystem read-only, with 'tagged' links into a writable area for node-specific files.

Bye,

Tony.

xingmu
March 2nd, 2009, 05:26 AM
Hi Xingmu and anybody else interested, as promised in my earlier post I've added a draft guide for setting up a kerrighed 2.3.0 cluster in Ubuntu 8.04 on the Easy Ubuntu Clustering Wiki. Here is the link:

https://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide

If you have any questions, comments, suggestions or improvements let me know.

Thanks bigjimjams! I actually got the cluster up and running right after I saw your note about the "no running cluster" bug in krgadm. I ended using a package DRBL which configures the dhcp-server, tftp, etc. automatically. I am thinking it might be useful to add this to your tutorial as an alternative installation method.

On another note and somewhat off-topic, has anyone tried installing Kerrighed straight from the SVN? I see that above-mentioned bug has already been fixed in the SVN. I am also having troubles with processes not migrating to multi-core CPUs. I am hoping newer revisions might have fixed these problems. If so, are there any suggested revision numbers to use? Or is every revision supposed to be a working version?

bigjimjams
March 2nd, 2009, 03:55 PM
Hello, bigjimjams.

Thanks for your excellent installation guide!

I've done more or less the same installation, but I've not documented it quite so well :-)

One difference is that I use UNFS3 (http://unfs3.sourceforge.net/) with 'ClusterNFS' extensions enabled for the NFSROOT. This allows all the nodes to share the same NFSROOT filesystem read-only, with 'tagged' links into a writable area for node-specific files.

Bye,

Tony.

No problem Tony, happy to help out! I was creating the guide for documenting it at work anyway, so thought I'd share what I'd done to save other people having to search around the web like I did. Thanks for the tip about UNFS. I'll have a look at it, as it may be useful when we build our larger Kerrighed cluster in a couple of weeks. :D

bigjimjams
March 2nd, 2009, 04:07 PM
Thanks bigjimjams! I actually got the cluster up and running right after I saw your note about the "no running cluster" bug in krgadm. I ended using a package DRBL which configures the dhcp-server, tftp, etc. automatically. I am thinking it might be useful to add this to your tutorial as an alternative installation method.

On another note and somewhat off-topic, has anyone tried installing Kerrighed straight from the SVN? I see that above-mentioned bug has already been fixed in the SVN. I am also having troubles with processes not migrating to multi-core CPUs. I am hoping newer revisions might have fixed these problems. If so, are there any suggested revision numbers to use? Or is every revision supposed to be a working version?

Hi xingmu, I did see the DRBL option on the Kerrighed documentation webpage, but preferred to do it by hand on a first attempt. I agree that it might be useful to maybe add this as an alternative option. Are there any other benefits or drawbacks?

I'm planning on using the SVN version of kerrighed in a couple of weeks, as mentioned in my previous post. So I can let you know how it goes then if you haven't beaten me to it! ;) I think they've fixed a number bugs since 2.3, from looking at the responses on the mailing list.

As for you multi-core cpu problem, that seems quite strange, as I did plug an intel quad into my kerrighed test cluster and it worked fine with process migration. Was it using a 32-bit or 64-bit kernel?

NurChiN
March 3rd, 2009, 12:15 AM
a

xingmu
March 3rd, 2009, 01:48 AM
Hi xingmu, I did see the DRBL option on the Kerrighed documentation webpage, but preferred to do it by hand on a first attempt. I agree that it might be useful to maybe add this as an alternative option. Are there any other benefits or drawbacks?

I'm planning on using the SVN version of kerrighed in a couple of weeks, as mentioned in my previous post. So I can let you know how it goes then if you haven't beaten me to it! ;) I think they've fixed a number bugs since 2.3, from looking at the responses on the mailing list.

As for you multi-core cpu problem, that seems quite strange, as I did plug an intel quad into my kerrighed test cluster and it worked fine with process migration. Was it using a 32-bit or 64-bit kernel?

Ok, I've tried some of the SVN revisions now. I can confirm that r4762 and r5069 (latest) are working versions. They have "sort of" fixed the cluster status bug. I say "sort of" because krgadm cluster status outputs "0:1". My session_id is 1, but I can't understand what the "0" means (nb_min?). My node_id's are from 1 to 3, so 0 is not a node_id.
Also, auto-migration in the newer revisions doesn't work without setting up a scheduler (http://www.kerrighed.org/wiki/index.php/SchedConfig). I got that part figured out, but I *still* cannot get processes to properly use my dual-core CPUs. I am wondering if there are options I need to change when building the kernel?

BTW, I am using a 32-bit kernel, although the chips are 64-bit Xeon's.

ajt
March 3rd, 2009, 10:25 AM
No problem Tony, happy to help out! I was creating the guide for documenting it at work anyway, so thought I'd share what I'd done to save other people having to search around the web like I did. Thanks for the tip about UNFS. I'll have a look at it, as it may be useful when we build our larger Kerrighed cluster in a couple of weeks. :D

Hello, bigjimjams.

The '3' in UNFS3 means NFS version 3, so you have to be tell the Kerrighed kernel to look for an NFSROOT on using the NFSv3 protocol because the default is NFSv2. This is how I do it:

default linux
label linux
kernel vmlinuz-kerrighed
append root=/dev/nfs ip=dhcp nfsroot=192.168.0.254:/NFSROOT,v3 node_id=65 session_id=1


Bye,

Tony.

bigjimjams
March 3rd, 2009, 11:03 AM
Ok, I've tried some of the SVN revisions now. I can confirm that r4762 and r5069 (latest) are working versions. They have "sort of" fixed the cluster status bug. I say "sort of" because krgadm cluster status outputs "0:1". My session_id is 1, but I can't understand what the "0" means (nb_min?). My node_id's are from 1 to 3, so 0 is not a node_id.
Also, auto-migration in the newer revisions doesn't work without setting up a scheduler (http://www.kerrighed.org/wiki/index.php/SchedConfig). I got that part figured out, but I *still* cannot get processes to properly use my dual-core CPUs. I am wondering if there are options I need to change when building the kernel?

BTW, I am using a 32-bit kernel, although the chips are 64-bit Xeon's.

Hi xingmu, the "0" in "0:1" is normally the node id, as far as I know the manual assignment of node id's in 2.3 was broken and the node id's were automatically assigned from the ip address of each node, so for example a node with ip address 192.168.1.0 would automatically be assigned node id 0.

As I mentioned earlier, I used the 32-bit kernel on a intel Q6600 with the other nodes being AthlonXPs and it seemed to work fine with the process migration stuff, as all cores hit 100% usage. Therefore, I'm guessing its something to do with the kernel settings.

bigjimjams
March 3rd, 2009, 11:06 AM
Hello, bigjimjams.

The '3' in UNFS3 means NFS version 3, so you have to be tell the Kerrighed kernel to look for an NFSROOT on using the NFSv3 protocol because the default is NFSv2. This is how I do it:

default linux
label linux
kernel vmlinuz-kerrighed
append root=/dev/nfs ip=dhcp nfsroot=192.168.0.254:/NFSROOT,v3 node_id=65 session_id=1


Bye,

Tony.

Thanks Tony, I guessed it was to do with NFS version 3, but the code snippet helps. :D

jedi453
March 12th, 2009, 06:08 PM
Hi BigJimJams,

Thanks for the great guide!

I did notice a few problems with the guide however...:

- The guide doesn't show how to set up /etc/network/interfaces on the server

- In /etc/default/dhcp3-server your missing a semi-colon at the end of the third to last line

- In /etc/default/tftpd-hpa RUN_DAEMON should equal "yes" (all lowercase), mine was yelling at me without this (I was using debian so maybe that's why).


Just as a side note, I was having trouble at the tftp stage. DHCP assigned an IP address ok, but I recieved the "PXE-E32" error. This only happened on two of the three computers I was using.
So as this source (http://www.mail-archive.com/ltsp-discuss@lists.sourceforge.net/msg32044.html) suggests I tried booting from a pxe cd. I used GPXE.
I went to their website (http://kernel.org/pub/software/utils/boot/gpxe/) and downloaded the latest version. Then I untarred it (tar -xzf gpxe-*), then I changed directories into the new folder (cd gpxe-*).
Then I changed directory into src (cd src) then ran "make bin/gpxe.iso" Then the iso was in "bin/gpxe.iso". I was able to boot from this which resolved the problem.

robasc
March 15th, 2009, 08:46 AM
Hello everyone, My name is Rob and I am new to clustering and I am looking to find some people to chat with that can help me learn about clustering. Thanks to the help of ajt I am in the right place.

I have got a cluster put together using the idea from Joel Adams and their project Microwulf.

I thought this Idea would be a good way to start.

First off Let me just say that I am not an expert Linux guru but I do know the basics and I have been playing around for quite some time using various Linux OS systems; however, I am new to Ubuntu.

I am also very serious about getting this project done. I have already spent a good amount of money building this cluster and now I want to get it up and running.

One other thing I also liked the tutorial on kerrighed. I would be interested in learning more if anyone would care to help.

Thanks everyone!

bigjimjams
March 16th, 2009, 01:18 PM
Hello everyone, My name is Rob and I am new to clustering and I am looking to find some people to chat with that can help me learn about clustering. Thanks to the help of ajt I am in the right place.

I have got a cluster put together using the idea from Joel Adams and their project Microwulf.

I thought this Idea would be a good way to start.

First off Let me just say that I am not an expert Linux guru but I do know the basics and I have been playing around for quite some time using various Linux OS systems; however, I am new to Ubuntu.

I am also very serious about getting this project done. I have already spent a good amount of money building this cluster and now I want to get it up and running.

One other thing I also liked the tutorial on kerrighed. I would be interested in learning more if anyone would care to help.

Thanks everyone!

Hi robasc, I'm happy to help (if I can). What sort of things do you need help with? It might be worth posting your questions on here.

ajt
March 16th, 2009, 07:29 PM
[...]
I am also very serious about getting this project done. I have already spent a good amount of money building this cluster and now I want to get it up and running.

One other thing I also liked the tutorial on kerrighed. I would be interested in learning more if anyone would care to help.

Thanks everyone!

Hello, robasc.

Thanks for joining this thread :-)

I'm interested in SSI (Single System Image). This is like SMP (Symmetric Multi Processing), but the interconnect between the CPU's spans different computer systems rather than just the memory bus or hyper transport used in multi CPU systems. In COTS (Commodity Off The Shelf) Beowulf clusters the interconnect is usually 100-BaseT or Gigabit ethernet.

In the recent past, three SSI projects were popular: OpenSSI, openMosix and Kerrighed. The OpenSSI project looks good, but seems to be dormant because the code releases are quite old, and the openMosix project is now closed down. That leaves Kerrighed, which is being actively developed by an enthusiastic team:

http://www.kerrighed.org

I run a 90-node openMosix Beowulf, which we are hoping to upgrade to Kerrighed as soon as we can get it to run long enough to use ;-)

There are other ways of using a cluster, but the idea of a 'type 2' Beowulf class machine is that it has a distributed process space and automatically balances the load on different nodes. Older 'type 1' Beowulf clusters use a DRM (Distributed Resource Manager) like SGE (Sun Grid Engine) to balance the load and schedule jobs. Both types of Beowulf are used to run MPI (Message Passing Interface) programs to split the load of a single job over multiple nodes in order to use the aggregate performance of the cluster. SSI is best for 'embarrasingly' parallel computation. A good starting point to learn is ClusterMonkey:

http://www.clustermonkey.net/

Bye,

Tony.

robasc
March 17th, 2009, 02:37 PM
Thanks for the info ajt. I will be checking it out for sure.

Meanwhile, I have got something here that I need some of you to look over. I decided to go with a kerrighed setup using BigJimJams instructions; however, I was unsuccessful an getting the diskless nodes to boot up. I think you guys could probably help me to find my problems.

Ok, here is every thing wrapped up in a webpage that I have done from beginning to end which includes where I am with the project and what problems I am having. I know the page is lengthy, but I did not want to leave out anything on my setup.Please let me know what I am doing wrong and where I need to start.

Ok, here is the website to got to:

KerrighedCluster.htm (http://www.cpsspecialties.webhop.org/kerrighedCluster.htm)

ajt
March 17th, 2009, 05:08 PM
Thanks for the info ajt. I will be checking it out for sure.
[...]
Ok, here is every thing wrapped up in a webpage that I have done from beginning to end which includes where I am with the project and what problems I am having. I know the page is lengthy, but I did not want to leave out anything on my setup.Please let me know what I am doing wrong and where I need to start.

Ok, here is the website to got to:

KerrighedCluster.htm (http://www.cpsspecialties.webhop.org/kerrighedCluster.htm)

Hello, robasc.

That's a bit of a long read!

The server 192.168.1.128 is excluded from your network:

192.168.1.128/25 = 192.168.1.0/255.255.255.128

The lowest address in this network is 192.168.1.129

Why do you not just use:

192.168.1.0/24 = 192.168.1.0/255.255.255.0

Bye,

Tony.

robasc
March 17th, 2009, 09:44 PM
I know it is a very long read but I did not want to leave anything out.

I changed the network to 192.168.1.0/255.255.255.0 as you stated and now I get a PXE-E51 No DHCP or proxyDHCP offers were received.

I sure do appreciate everyone's help on this.

robasc
March 18th, 2009, 03:12 AM
Well it seems to be booting up but I get this:

Done.

Gave up waiting for root device. Common problems:

-Boot args (cat /proc/cmdline)
-Check rootdelay= (did the system wait long enough?)
-Check root= (did the system wait for the right device?)
-Missing modules (cat /proc/modules; ls /dev)
ALERT! /dev/nfs does not exist. Dropping to a shell!


BusyBox v1.10.2 (Ubuntu 1:1.10.2-1ubuntu6) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(intitramfs)_

Not really sure what this all means but I believe I am missing some files but not sure. That is my best guess.

I updated the bottom of the webpage with a screen capture if you care to look at it.

bigjimjams
March 18th, 2009, 07:08 AM
Well it seems to be booting up but I get this:

Done.

Gave up waiting for root device. Common problems:

-Boot args (cat /proc/cmdline)
-Check rootdelay= (did the system wait long enough?)
-Check root= (did the system wait for the right device?)
-Missing modules (cat /proc/modules; ls /dev)
ALERT! /dev/nfs does not exist. Dropping to a shell!


BusyBox v1.10.2 (Ubuntu 1:1.10.2-1ubuntu6) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(intitramfs)_

Not really sure what this all means but I believe I am missing some files but not sure. That is my best guess.

I updated the bottom of the webpage with a screen capture if you care to look at it.

Hi robasc, not sure if this would help but it seems that its booting the kernel from tftp fine but just not mounting the NFS. Have you tried editing /etc/network/interfaces on the server, and changing the ethernet card to manual, like you did with the nodes?

bigjimjams
March 18th, 2009, 08:05 AM
Well it seems to be booting up but I get this:

Done.

Gave up waiting for root device. Common problems:

-Boot args (cat /proc/cmdline)
-Check rootdelay= (did the system wait long enough?)
-Check root= (did the system wait for the right device?)
-Missing modules (cat /proc/modules; ls /dev)
ALERT! /dev/nfs does not exist. Dropping to a shell!


BusyBox v1.10.2 (Ubuntu 1:1.10.2-1ubuntu6) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(intitramfs)_

Not really sure what this all means but I believe I am missing some files but not sure. That is my best guess.

I updated the bottom of the webpage with a screen capture if you care to look at it.

Hi robasc, just had another thought. When you updated the subnet mask from 255.255.255.128 to 255.255.255.0 in the DHCP server config file, did you also update the /etc/exports for the nfs server, as I've just noticed on your webpage it was still using 255.255.255.128, which might explain why the NFS isn't mounting.

bigjimjams
March 18th, 2009, 05:27 PM
Hi BigJimJams,

Thanks for the great guide!

I did notice a few problems with the guide however...:

- The guide doesn't show how to set up /etc/network/interfaces on the server

- In /etc/default/dhcp3-server your missing a semi-colon at the end of the third to last line

- In /etc/default/tftpd-hpa RUN_DAEMON should equal "yes" (all lowercase), mine was yelling at me without this (I was using debian so maybe that's why).


Just as a side note, I was having trouble at the tftp stage. DHCP assigned an IP address ok, but I recieved the "PXE-E32" error. This only happened on two of the three computers I was using.
So as this source (http://www.mail-archive.com/ltsp-discuss@lists.sourceforge.net/msg32044.html) suggests I tried booting from a pxe cd. I used GPXE.
I went to their website (http://kernel.org/pub/software/utils/boot/gpxe/) and downloaded the latest version. Then I untarred it (tar -xzf gpxe-*), then I changed directories into the new folder (cd gpxe-*).
Then I changed directory into src (cd src) then ran "make bin/gpxe.iso" Then the iso was in "bin/gpxe.iso". I was able to boot from this which resolved the problem.

Hi Jedi453, thanks for the feedback, from what I remember the only change I made to the /etc/network/interface file on the server (which should have been in the guide) was to edit the configuration for eth0 from auto to manual. The IP address and subnet mask were then assigned manually, as mentioned in the guide. Question2 has been updated in the guide. Thanks for spotting that, there was bound to have been a typo in there somewhere! ;) I think point 3 is a distribution thing, as I've used capitals (if I remember correctly, it was written like this when the file was automatically created) and it works fine.

robasc
March 18th, 2009, 07:01 PM
Hi robasc, just had another thought. When you updated the subnet mask from 255.255.255.128 to 255.255.255.0 in the DHCP server config file, did you also update the /etc/exports for the nfs server, as I've just noticed on your webpage it was still using 255.255.255.128, which might explain why the NFS isn't mounting.


Hello BigJimJams, Yes I did, in fact I freshly installed the whole thing again using the subnet you gave in the tutorial.

But still having the same problem.

There are a couple of things I am wondering here and you might have already noticed on the webpage is that I am using intrepid amd64 version.

I had to modify the updates for intrepid rather than hardy. Also I had to modify:


debootstrap --arch i386 hardy /nfsroot/kerrighed http://archive.ubuntu.com/ubuntu/


To:


debootstrap --arch amd64 intrepid /nfsroot/kerrighed http://archive.ubuntu.com/ubuntu/


I am not sure if this has anything to do with it.

Another problem I experienced is when I had to add a user in /nfsroot/kerrighed.

When I checked the /etc/sudoers file there was no user installed so I had to manually install the user as noted in the webpage.

This was just a couple of things I have been wondering about.

I really do not know what else to do but I am trying to think of something.

robasc
March 18th, 2009, 07:30 PM
Hi robasc, not sure if this would help but it seems that its booting the kernel from tftp fine but just not mounting the NFS. Have you tried editing /etc/network/interfaces on the server, and changing the ethernet card to manual, like you did with the nodes?


Ok, I set the server to:


auto lo
iface lo inet loopback

iface eth0 inet manual


Just like the nodes.

Then I manually configured the server ip address like so:


sudo ifconfig eth0 192.168.1.1 netmask 255.255.255.0 up


Then restarted the network and dhcp3

and I still have the same issue.

cstrrider
March 18th, 2009, 10:45 PM
Hi
I am fairly new to clustering and I've been having a hard time finding an up to date guide that works. I really appriciate your Kerrighed guide on Ubuntu, it's the best clustering guide I have tried yet. I am running ubuntu server 8.10 with the ubunu-desktop package install. when I try to run DHPC3-server I always get this error:
sudo /etc/init.d/dhcp3-server restart
* Stopping DHCP server dhcpd3 [fail]
* Starting DHCP server dhcpd3 * check syslog for diagnostics.
[fail]

I hope someone can help me with this.
Thanks,
Cstrrider

robasc
March 18th, 2009, 10:58 PM
Assuming you are using the same addressing scheme as the guide try manually configuring your ip like so:


sudo ifconfig eth0 192.168.1.1 netmask 255.255.255.0 up


Then:


sudo /etc/init.d/networking restart

robasc
March 19th, 2009, 10:47 AM
Everyone, I updated the webpage. I took some material and updated the /etc/dhcp3/dhcpd.conf so my data would not become cluttered and confusing.


kerrighedCluster.htm (http://www.cpsspecialties.webhop.org/kerrighedCluster.htm)

jedi453
March 19th, 2009, 10:58 AM
Hi robasc,

I got this problem too. Are you using an initramfs? I think I fixed it by changing my initramfs.conf file (which I think is in

/etc/initramfs-tools/).

Make sure initramfs-tools is installed
sudo apt-get install initramfs-tools

Copy /etc/initramfs-tools/initramfs.conf to a backup file
sudo cp /etc/initramfs-tools/initramfs.conf{,.old}

Edit /etc/initramfs-tools/initramfs.conf change the line BOOT=local to BOOT=nfs

Copy your current initramfs to a backup. Replace "<kernelversion>" with your kernel version in the form 2.6.xx-something (if

you don't know what it is and your using your running kernel just replace it with "$(uname -r)" without quotes)
sudo cp /boot/initrd.img-<kernel-version>{,.old}


If that fails I recommend against continuing.

We're going to overwrite /srv/tftp/initrd.img (if it exists). So move it to a backup file
sudo mv /srv/tftp/initrd.img{,.old}

Then I remade my initramfs (making sure not to overwrite the initramfs I was using). Replace "<kernelversion>" with your

kernel version in the form vmlinuz-2.6.xx-something (if you don't know what it is and your using your running kernel just

replace it with "vmlinuz-$(uname -r)" without quotes (notice the "vmlinuz-" this time))
sudo mkinitramfs -o /srv/tftp/initrd.img <kernel-version>

Change /srv/tftp/pxelinux.cfg/default to accept the new initrd now called "initrd.img"

Then copy the old initramfs.conf back to the old name
sudo cp /etc/initramfs-tools/initramfs.conf{.old,}

I found this idea from a post on http://www.howtoforge.com/pxe_booting_debian .

Good luck

robasc
March 19th, 2009, 12:46 PM
Hi robasc,

I got this problem too. Are you using an initramfs? I think I fixed it by changing my initramfs.conf file (which I think is in

/etc/initramfs-tools/).

Make sure initramfs-tools is installed
Code:

sudo apt-get install initramfs-tools

Copy /etc/initramfs-tools/initramfs.conf to a backup file
Code:

sudo cp /etc/initramfs-tools/initramfs.conf{,.old}

Edit /etc/initramfs-tools/initramfs.conf change the line BOOT=local to BOOT=nfs

Copy your current initramfs to a backup. Replace "<kernelversion>" with your kernel version in the form 2.6.xx-something (if

you don't know what it is and your using your running kernel just replace it with "$(uname -r)" without quotes)
Code:

sudo cp /boot/initrd.img-<kernel-version>{,.old}






I appreciate that help jedi, unfortunately for me it did not work. I went up to the point above.

Any other suggestions would be appreciated as well.

Thanks

bigjimjams
March 19th, 2009, 05:01 PM
Hi robasc,

I got this problem too. Are you using an initramfs? I think I fixed it by changing my initramfs.conf file (which I think is in

/etc/initramfs-tools/).

Make sure initramfs-tools is installed
sudo apt-get install initramfs-tools

Copy /etc/initramfs-tools/initramfs.conf to a backup file
sudo cp /etc/initramfs-tools/initramfs.conf{,.old}

Edit /etc/initramfs-tools/initramfs.conf change the line BOOT=local to BOOT=nfs

Copy your current initramfs to a backup. Replace "<kernelversion>" with your kernel version in the form 2.6.xx-something (if

you don't know what it is and your using your running kernel just replace it with "$(uname -r)" without quotes)
sudo cp /boot/initrd.img-<kernel-version>{,.old}


If that fails I recommend against continuing.

We're going to overwrite /srv/tftp/initrd.img (if it exists). So move it to a backup file
sudo mv /srv/tftp/initrd.img{,.old}

Then I remade my initramfs (making sure not to overwrite the initramfs I was using). Replace "<kernelversion>" with your

kernel version in the form vmlinuz-2.6.xx-something (if you don't know what it is and your using your running kernel just

replace it with "vmlinuz-$(uname -r)" without quotes (notice the "vmlinuz-" this time))
sudo mkinitramfs -o /srv/tftp/initrd.img <kernel-version>

Change /srv/tftp/pxelinux.cfg/default to accept the new initrd now called "initrd.img"

Then copy the old initramfs.conf back to the old name
sudo cp /etc/initramfs-tools/initramfs.conf{.old,}

I found this idea from a post on http://www.howtoforge.com/pxe_booting_debian .

Good luck

Hi Jedi453, this is a good point to note for doing a diskless-boot. Now that you've mentioned it, I vaguely recall doing something similar. However, when you get round to using the kerrighed kernel, you don't use a initrd.img when booting.

bigjimjams
March 19th, 2009, 05:07 PM
Ok, I set the server to:



Just like the nodes.

Then I manually configured the server ip address like so:



Then restarted the network and dhcp3

and I still have the same issue.

Hi robasc, sorry to hear that your still having troubles! I noticed you we're using the 64-bit kernel and intrepid. I've actually been setting up another kerrighed cluster today, this time using the 64-bit kernel and 8.04. I followed the same guide as documented and setup eth0, as you've just mentioned, and it seemed to diskless boot fine. So maybe something has changed in 8.10 that I'm missing!

Also, from when adding a new user in the diskless boot system, it isn't automatically added to the sudoers file, which is why I mentioned adding them in the guide. I'll keep thinking about possible solutions.

robasc
March 19th, 2009, 07:40 PM
Cool, well it looks like were on to a new idea then. Let's solve the mystery.

Thanks

agungaryo
March 19th, 2009, 11:18 PM
hi everyone

I tried to create cluster for a simulation for the 1st time
and I followed the tutorial ,
for installing process until the end of tutorial ,everything works fine with some problems which were solved. ^_^

for an intro ,i just try to use 2 nodes and 1 server ( UBUNTU server hardy )
i check with "top" in node cluster , it appears 3 CPUs

when i try to execute my simulation in server (reach the CPU USAGE =100%)
but when i see in node cluster ,all of CPUs ( 3 CPUs) doesn't appear CPU USAGE (100%),it means none of them shows CPU USAGE =100% or share the LOAD from server.

i think ,the problem is i dont know how to use kerrighed ,so it can share the load of server

thank you


agung aryo

agungaryo
March 19th, 2009, 11:21 PM
hi robasc,
I'm beginner in ubuntu,
i've read yours ,it's long to read
but it's nice
i just want to know,
does your current kernel have NFS module ?

bigjimjams
March 20th, 2009, 07:13 AM
hi everyone

I tried to create cluster for a simulation for the 1st time
and I followed the tutorial ,
for installing process until the end of tutorial ,everything works fine with some problems which were solved. ^_^

for an intro ,i just try to use 2 nodes and 1 server ( UBUNTU server hardy )
i check with "top" in node cluster , it appears 3 CPUs

when i try to execute my simulation in server (reach the CPU USAGE =100%)
but when i see in node cluster ,all of CPUs ( 3 CPUs) doesn't appear CPU USAGE (100%),it means none of them shows CPU USAGE =100% or share the LOAD from server.

i think ,the problem is i dont know how to use kerrighed ,so it can share the load of server

thank you


agung aryo

Hi agung aryo, the server should not be running the kerrighed kernel, only the nodes should. Therefore any application run on the server won't migrate to the nodes. You have to ssh into one of the nodes, and launch the application/simulation from a node. If the application/simulation uses multiple processes, then you should see the processes migrate to all CPUs in the cluster and all of the nodes should reach CPU USAGE = 100% if you're working it hard enough.:D

robasc
March 20th, 2009, 10:39 AM
Hello agungaryo, I am not sure about an nfs module. I am very new to this as well. Hopefully everyone together can solve this issue.

Also, I would like to inform everyone that I installed 8.04 amd64 on the cluster and it booted up just fine and was able to log in on the nodes.

But I did run across a problem when installing kerrighed.

Here is the issue:

1. ) When I began performing the downloads awk was not found. The program asked me to download gawk. So I did. Don't know if anyone else run in to this before.

2. ) When I got down to configuring the kernel I left this at default. I am not sure of how to configure the kernel and check the nfs and network card driver?

3. ) And finally I run the make install and it failed. Here is the fault instructions:


root@master-node:/usr/src/kerrighed-2.3.0# make
Making all in modules
make[1]: Entering directory `/usr/src/kerrighed-2.3.0/modules'
make -C "/usr/src/linux-2.6.20" modules_prepare
make[2]: Entering directory `/usr/src/linux-2.6.20'
CHK include/linux/version.h
CHK include/linux/utsrelease.h
make[2]: Leaving directory `/usr/src/linux-2.6.20'
touch .modules_prepare
rm -f "/usr/src/kerrighed-2.3.0/modules/asm"
[ "/usr/src/kerrighed-2.3.0" == "/usr/src/kerrighed-2.3.0" ] || lndir -silent "/usr/src/kerrighed-2.3.0/modules"
[: 1: ==: unexpected operator
/usr/src/kerrighed-2.3.0/modules: From and to directories are identical!
make[1]: *** [all-y] Error 1
make[1]: Leaving directory `/usr/src/kerrighed-2.3.0/modules'
make: *** [all-recursive] Error 1
root@master-node:/usr/src/kerrighed-2.3.0#



Not sure what is going on here. It seems there is a duplicate of /usr/src/kerrighed-2.3.0?

Anyone?

Going back to the ubuntu 8.10 amd64 version, I am going to install another partition so I can keep hammering away at it. Hopefully I will find the reason for nfs not working.

agungaryo
March 20th, 2009, 11:14 AM
Hello agungaryo, I am not sure about an nfs module. I am very new to this as well. Hopefully everyone together can solve this issue.

Also, I would like to inform everyone that I installed 8.04 amd64 on the cluster and it booted up just fine and was able to log in on the nodes.

But I did run across a problem when installing kerrighed.

Here is the issue:

1. ) When I began performing the downloads awk was not found. The program asked me to download gawk. So I did. Don't know if anyone else run in to this before.

2. ) When I got down to configuring the kernel I left this at default. I am not sure of how to configure the kernel and check the nfs and network card driver?

3. ) And finally I run the make install and it failed. Here is the fault instructions:



Not sure what is going on here. It seems there is a duplicate of /usr/src/kerrighed-2.3.0?

Anyone?

Going back to the ubuntu 8.10 amd64 version, I am going to install another partition so I can keep hammering away at it. Hopefully I will find the reason for nfs not working.

(1) I think that's fine
(2) even the default has selected the necessary module but you must ensure that your network driver and NFS is included in kernel because I've seen the default setting doesn't include some network drivers, I'm also got a problem from this step, my node has SIS network Card . but after i compiled the kernel n node try to boot ,the network card wasn' detected even I've included the SIS driver module.my solution is "i change the network card" ,hi2
(3) I see, yup it's true. . .it's "numpuk" (indo)
i guess , U must try to enter the kerrighed directory and
do " make clean" and " make uninstall"
and repeat the step


for the interpid amd64 ,u can try to check Ur current kernel with .
"lsmod | grep nfs" , i think nfs ( maybe client) must present because "test diskless boot "is node try to boot with our current kernel

sorry if my answer above is wrong

thank you

agung aryo ( AccessNet )

bigjimjams
March 20th, 2009, 12:53 PM
Hello agungaryo, I am not sure about an nfs module. I am very new to this as well. Hopefully everyone together can solve this issue.

Also, I would like to inform everyone that I installed 8.04 amd64 on the cluster and it booted up just fine and was able to log in on the nodes.

But I did run across a problem when installing kerrighed.

Here is the issue:

1. ) When I began performing the downloads awk was not found. The program asked me to download gawk. So I did. Don't know if anyone else run in to this before.

2. ) When I got down to configuring the kernel I left this at default. I am not sure of how to configure the kernel and check the nfs and network card driver?

3. ) And finally I run the make install and it failed. Here is the fault instructions:



Not sure what is going on here. It seems there is a duplicate of /usr/src/kerrighed-2.3.0?

Anyone?

Going back to the ubuntu 8.10 amd64 version, I am going to install another partition so I can keep hammering away at it. Hopefully I will find the reason for nfs not working.

Hi robasc, yet I've noted a couple of errors in the guide when doing the fresh install yesterday. I response to 1) it should be gawk you install and not awk. 2) the default config doesn't necessarily install you NIC in the kernel. I think if I remember correctly, if you go to Device Drivers ---> Network Drivers ---> and then 10/100 or 1000 you can choose to install your network card. Make sure it isn't installed as a module (not a "M"). The NFS v3 is normally installed by the default config. As for 3) I'll have a look at what I did, as I can remember getting this error before I managed to get it working but somehow managed to find a work around. I'll let you know.

jedi453
March 20th, 2009, 02:44 PM
Hello agungaryo, I am not sure about an nfs module. I am very new to this as well. Hopefully everyone

together can solve this issue.

Also, I would like to inform everyone that I installed 8.04 amd64 on the cluster and it booted up just fine and was able to log

in on the nodes.

But I did run across a problem when installing kerrighed.

Here is the issue:

1. ) When I began performing the downloads awk was not found. The program asked me to download gawk. So I did. Don't know if

anyone else run in to this before.

2. ) When I got down to configuring the kernel I left this at default. I am not sure of how to configure the kernel and check

the nfs and network card driver?

3. ) And finally I run the make install and it failed. Here is the fault instructions:



Not sure what is going on here. It seems there is a duplicate of /usr/src/kerrighed-2.3.0?

Anyone?

Going back to the ubuntu 8.10 amd64 version, I am going to install another partition so I can keep hammering away at it.

Hopefully I will find the reason for nfs not working.

Interesting. I think problem three is a result of installing build-essential. I think it installs version 4.x of gcc. This

would cause a problem with kerrighed, as it only likes 3.3 of gcc. Gcc-3.3 is installed to /usr/bin/gcc-3.3 instead of

/usr/bin/gcc. You'll probably have to copy gcc-3.3 to /usr/bin/gcc , or create a symbolic link. As I don't know much about

symbolic links I'll tell you how to copy gcc-3.3 . I don't know why there would be a duplicate of /usr/src/kerrighed-2.3.0

First chroot to /nfsroot/kerrighed
sudo chroot /nfsroot/kerrighed

Then check for /usr/bin/gcc-3.3
ls /usr/bin/gcc-3.3

If there was no output for that, the following won't work. The output should be "/usr/bin/gcc-3.3" . If there was no output

install gcc-3.3 ( sudo apt-get install gcc-3.3 ) and try again.

First back up your old version of gcc:
cp /usr/bin/gcc{,.old}

Then copy gcc-3.3 to /usr/bin/gcc
cp /usr/bin/gcc{-3.3,}

You may need to delete your /usr/src/kerrighed-2.3.0 and /usr/src/linux-2.6.20 (in chroot [/nfsroot/kerrighed]) and redownload

them.

rm -rf /usr/src/kerrighed-2.3.0 /usr/src/linux-2.6.20

Then start from the third code line (right after chrooting and apt-get) in part 2.1 in bigjimjams's guide. Make sure to compile the kernel with the correct options. See above posts for how to do that.

Then the compile should go ok, unless you had another problem too.

If you want to change back things once you're done just move /usr/bin/gcc.old to /usr/bin/gcc

agungaryo
March 20th, 2009, 03:58 PM
Hi agung aryo, the server should not be running the kerrighed kernel, only the nodes should. Therefore any application run on the server won't migrate to the nodes. You have to ssh into one of the nodes, and launch the application/simulation from a node. If the application/simulation uses multiple processes, then you should see the processes migrate to all CPUs in the cluster and all of the nodes should reach CPU USAGE = 100% if you're working it hard enough.:D

hi bigjimjams,


client (10.14.200.0/24)<---------->server simulation (ubuntu server)<-------> nodes cluster (192.168.1.0/24)

my simulation is in server ( ubuntu server)
- u mean that I must reinstall the simulation at node cluster ?
- my simulation has client server application if i install at node cluster the client should access 192.168.1.0 network ,it means I should redirect from 10.14.200.0 to 192.169.1.0 ?
- I've two nodes cluster,but why "top" result (execute in node ) 3 CPUs, does it mean server include in ??


thank you


agung aryo

robasc
March 21st, 2009, 07:55 AM
Well Jedi, I tried what you said but I still came up with the same message duplicate found.

I am going to try and reinstall it and see if this works.

jedi453
March 21st, 2009, 12:00 PM
Well Jedi, I tried what you said but I still came up with the same message duplicate found.

I am going to try and reinstall it and see if this works.

Sorry I gave you two wrong answers.:( It's just tough diagnose the errors over the net. I assumed you got errors for the same reasons I did the first time I tried it. Maybe it's because I used Debian. You still might need to use gcc-3.3 to compile (as this source says http://www.kerrighed.org/wiki/index.php/Installing_Kerrighed_2.3.0). But if you try the code below and it returns version 4.1.x or 3.3.x then you might be fine according to the same source:
gcc --version

Don't give up!:)

Good luck!

robasc
March 21st, 2009, 05:27 PM
I agree it is not easy fixing problems over the internet.

gcc -version actually returned 4.2.4

Something else I noticed when configuring the kernel. My ethernet card is not listed

It shows a rtk 8169 but does not show a rtk 8111c?

bigjimjams
March 23rd, 2009, 04:52 PM
Hello agungaryo, I am not sure about an nfs module. I am very new to this as well. Hopefully everyone together can solve this issue.

Also, I would like to inform everyone that I installed 8.04 amd64 on the cluster and it booted up just fine and was able to log in on the nodes.

But I did run across a problem when installing kerrighed.

Here is the issue:

1. ) When I began performing the downloads awk was not found. The program asked me to download gawk. So I did. Don't know if anyone else run in to this before.

2. ) When I got down to configuring the kernel I left this at default. I am not sure of how to configure the kernel and check the nfs and network card driver?

3. ) And finally I run the make install and it failed. Here is the fault instructions:



Not sure what is going on here. It seems there is a duplicate of /usr/src/kerrighed-2.3.0?

Anyone?

Going back to the ubuntu 8.10 amd64 version, I am going to install another partition so I can keep hammering away at it. Hopefully I will find the reason for nfs not working.

Hi robasc, regarding point 3 above, I've been tinkering around and have possibly found a workaround, however, I didn't get chance to boot up the kerrighed cluster to test it out. Once more it involves a couple of things I missed out of the guide but remember doing, since I saw your message. Firstly, in the chroot'ed nfsboot system, you need to check your environment variables, so do the following.
$ export
Look to see if you have the variables CC=gcc-3.3, CXX=g++-3.3 and CPP=cpp-3.3. If not, then you may also not have the packages installed. So do the following:
apt-get install gcc-3.3 g++-3.3 cpp-3.3
export CC=gcc-3.3
export CXX=g++-3.3
export CPP=cpp-3.3
Once you've got these installed and the variables setup, you can do the following to install Kerrighed, which is pretty much what the guide does.
cd /usr/src/kerrighed-*
./configure --with-kernel=/usr/src/linux-2.6.20
make patch
make defconfig
cd ../linux-2.6.20
make menuconfig
cd ../kerrighed-*
make kernel
make -i
make kernel-install
make install -i

The -i flag will ignore errors, so you'll still get the error message but it seems to build and install all the necessary tools, modules, etc in the correct places. Hope this helps!:D

agungaryo
March 24th, 2009, 12:20 AM
hi everyone,

I've two interface in server (ubuntu server with DHCP,NFS,PXE server)
eth0 (192.168.1.0/24)--> hub which contains some nodes
eth1 (10.14.200.0/24)--> client for simulation (server client application)


I've installed kerrighed cluster according with tutorial,
bigjimjam says that i must run my simulation at node (192.168.0/24),but see my client simulation is at (10.14.200.0/24)
my simulation has been installed at server (ubuntu server) and it needs to be triggered by client

1) do i have to reinstall my simulation at node if i want to implement this clustering?
2) client has some problems if the simulation run at node (192.168.1.0/24) while client is at (10.14.200.0/24). even i can force my client move to (192.168.1.0/24)network.but I imagine if this kerrighed cluster is implemented for a web server which has IP PUBLIC ,how to manage the IP while I guess node at IP PRIVATE,any idea with this problems?

thank you
agung aryo,

bigjimjams
March 24th, 2009, 05:28 AM
hi everyone,

I've two interface in server (ubuntu server with DHCP,NFS,PXE server)
eth0 (192.168.1.0/24)--> hub which contains some nodes
eth1 (10.14.200.0/24)--> client for simulation (server client application)


I've installed kerrighed cluster according with tutorial,
bigjimjam says that i must run my simulation at node (192.168.0/24),but see my client simulation is at (10.14.200.0/24)
my simulation has been installed at server (ubuntu server) and it needs to be triggered by client

1) do i have to reinstall my simulation at node if i want to implement this clustering?
2) client has some problems if the simulation run at node (192.168.1.0/24) while client is at (10.14.200.0/24). even i can force my client move to (192.168.1.0/24)network.but I imagine if this kerrighed cluster is implemented for a web server which has IP PUBLIC ,how to manage the IP while I guess node at IP PRIVATE,any idea with this problems?

thank you
agung aryo,


Hi Agung Aryo, what type of simulation are you running? In my opinion, I can see too different options:

1) The client connects to the simulation on the server, which can then execute other programs/processes of the simulation on the cluster nodes via ssh.
2) The client connects to the server, which then grabs the information from the client and passes this to the cluster nodes when executing the simulation on the cluster nodes via ssh.

In order to have kerrighed working properly, only the cluster nodes should run the kerrighed kernel, the server should just be running a standard Ubuntu kernel. Also, the cluster nodes should have there own private network with the server. A client should not be connected to this private network, as all a client should see is the server, which then submits jobs to the cluster nodes. Hope this helps?

koukoobangoo
March 26th, 2009, 10:52 AM
Hi everyone
I'm a new user of Kerrighed. I set uo a cluster composed of three machines just like it is said in the easy ubuntu clustering tutorial fo kerrighed.

The problem is, the disks in each node seem not to be working.
It is said that one of the features of kerrighed is cluster file systems which means total virtualization of all the node's disks.

Is Kerrighed really able to show all the disks as an only one disk??

ajt
March 26th, 2009, 11:10 AM
Hi everyone
I'm a new user of Kerrighed. I set uo a cluster composed of three machines just like it is said in the easy ubuntu clustering tutorial fo kerrighed.

The problem is, the disks in each node seem not to be working.
It is said that one of the features of kerrighed is cluster file systems which means total virtualization of all the node's disks.

Is Kerrighed really able to show all the disks as an only one disk??

Hello, koukoobangoo.

KerFS was (temporarily) removed at Kerrighed 2.0.0 because it caused kernel crashes. It now seems to be developed as a separate module:

http://www.kerrighed.org/wiki/index.php/KernelDevelKdFS

Bye,

Tony.

apprentice_clusterer
March 29th, 2009, 02:36 AM
In order to have kerrighed working properly, only the cluster nodes should run the kerrighed kernel, the server should just be running a standard Ubuntu kernel.

Hello bigjimjams,

why can't the server participate to the kerrighed cluster ? I couldn't find a technical explanation for this on their website, but maybe I haven't looked hard enough :confused:

Since kerrighed is an SSI cluster, it would be nice to be able to use also the server as an additional node; excluding from the migration, of course, some of the processes (e.g. X).

I'm building a very small experimental cluster, with less than 5 nodes... giving up on one would make quite a difference :)

bigjimjams
March 29th, 2009, 05:14 AM
Hello bigjimjams,

why can't the server participate to the kerrighed cluster ? I couldn't find a technical explanation for this on their website, but maybe I haven't looked hard enough :confused:

Since kerrighed is an SSI cluster, it would be nice to be able to use also the server as an additional node; excluding from the migration, of course, some of the processes (e.g. X).

I'm building a very small experimental cluster, with less than 5 nodes... giving up on one would make quite a difference :)

Hi apprentice_clusterer, as far as I know it's because all the nodes that are part of the SSI cluster have to share a root file system of some sort, whether this is via NFS, UNFS3, etc. The server can't be in the cluster as it can't share the same root file system. Hope this helps.

bigjimjams
March 29th, 2009, 05:31 AM
I agree it is not easy fixing problems over the internet.

gcc -version actually returned 4.2.4

Something else I noticed when configuring the kernel. My ethernet card is not listed

It shows a rtk 8169 but does not show a rtk 8111c?

Hi robasc, I've managed to get kerrighed running using the 64-bit version of Hardy. However, I ended up using the revision 4762 of the SVN version of kerrighed, as I seemed to run into issues compiling 2.3 for a 64-bit kernel. I followed the same procedure as the guide on Easy Ubuntu Clustering, but the following was different in the kerrighed install:

sudo apt-get install subversion
svn checkout svn://scm.gforge.inria.fr/svn/kerrighed/trunk /nfsroot/kerrighed/usr/src/trunk -r 4762
sudo chroot /nfsroot/kerrighed
apt-get install automake autoconf libtool pkg-config gawk rsync bzip2 gcc-3.3 ncurses-dev wget lsb-release xmlto patchutils xutils-dev build-essential grub g++-3.3
cd /usr/src/trunk
./autogen.sh
./configure CC=gcc-3.3 CXX=g++-3.3
make defconfig
cd kernel
make menuconfig
cd ..
make kernel
make -i
make kernel-install
make install -i

I'll update the wiki soon so that it details how to install the SVN version too. In later revisions from the svn, you can drop the -i flag, as they seem to have corrected the src folder == dest folder issue. You also have to ensure you install certain things into the kernel for the new scheduler in Kerrighed. These can be found on the SchedConfig (http://www.kerrighed.org/wiki/index.php/SchedConfig) page.

As for the network card, I had the same chipset in my old machine and just enabled the 8169 in the kernel and it worked fine.

ajt
March 29th, 2009, 01:41 PM
Hi apprentice_clusterer, as far as I know it's because all the nodes that are part of the SSI cluster have to share a root file system of some sort, whether this is via NFS, UNFS3, etc. The server can't be in the cluster as it can't share the same root file system. Hope this helps.

Hello, bigjimjams and apprentice_clusterer.

That's not quite right - They have to share configuration information, and an easy way to achieve that is to use a single NFSROOT. However, you could just replicate the information manually. We've run Kerrighed on four independant Debian compute servers without any problem - Well, at least not relating to sharing the root filesystem ;-)

In our case, I could share the server root filesystem, but I don't, because it's already being shared with openMosix compute nodes using UNFS3 with 'cluster' extensions enabled as I mentioned previously. I'll post instructions about UNFS3 and clusterNFS on the Wiki when I've got time. Basically, the UNFS3 server interprets 'tags' for different client IP addresses or for clients in general.g.:


unique_file
unique_file$$192.168.1.98$$
unique_file$$192.168.1.99$$

shared_file
shared_file$$CLIENT$$


When "unique_file" is requested by a client, the UNFS3 server checks for the presence of 'tagged' files on the server. If tagged files are found, they are interpreted, otherwise the normal file is served. This makes it possible to share the root filesystem of the NFSROOT server with the PXE-booted compute nodes.

The contents of "unique_file" on the server are different to either file on the nodes with the IP addresses. The contents of "shared_file" on the server are different to the contents of "shared_file" on the clients, but are the same on all clients. There is a performance overhead when interpreting tags, so I don't use it where it is not actually needed.

This is all documented at:

http://unfs3.sourceforge.net/

Bye,

Tony.

floz23
March 29th, 2009, 08:36 PM
Greetings everyone! I've been watching this thread for a while now, and decided that I have to help a few friends try and reduce their povray rendering times :P

Now, I have a few questions that I can't seem to find clear answers to.

1. The server for the cluster will be a core i7 machine with 12gb of ram, so its obvious that i'd run a 64bit os. Does this mean that all of my nodes have to run 64bit Kerrighed kernels too? Or can the nodes run both 32bit and 64bit builds of Kerrighed?

2. povray 3.7 is now natively multithreaded. Can anyone confirm that the threads from povray 3.7 are distributed through all the nodes?

Much thanks,
-Adam

bigjimjams
March 30th, 2009, 05:25 AM
Greetings everyone! I've been watching this thread for a while now, and decided that I have to help a few friends try and reduce their povray rendering times :P

Now, I have a few questions that I can't seem to find clear answers to.

1. The server for the cluster will be a core i7 machine with 12gb of ram, so its obvious that i'd run a 64bit os. Does this mean that all of my nodes have to run 64bit Kerrighed kernels too? Or can the nodes run both 32bit and 64bit builds of Kerrighed?

2. povray 3.7 is now natively multithreaded. Can anyone confirm that the threads from povray 3.7 are distributed through all the nodes?

Much thanks,
-Adam

Hi Adam, depends on what setup you choose to use, but normally the server doesn't take part in the kerrighed cluster, therefore the server runs a different kernel to the nodes, so it doesn't matter if this is 32-bit or 64-bit. However, I'm unable to comment if you can mix 32-bit and 64-bit kernels in the cluster, my guess is probably not. As for 2), kerrighed only supports process migration, so threads will only be processed on the CPU they are launched. If this CPU is multi-core, then the threads have access to all the cores of the CPU. Hope this helps.

koukoobangoo
March 30th, 2009, 05:33 AM
Hi everybody..
I'm working now Clustering. I set up a trial cluster of three machines just the way it is said in the EasyUbuntuClustering. I'm using kerrighed.

The problem is that, I couldn't find a sultion for storage. It is said that kerrighed could virtualize all the disks on each node. Is it true??

And if not, what shared disk file system shall I use??

Need help

Thanks

koukoobangoo
March 30th, 2009, 05:53 AM
Hi tony..

I know this is a separate module. There is no way i can download it. I went to the repository of INRIA, and finally founed a version of Kerrighed containing KDFS instead of KerFS. I installed it. But id doesn't seem to be working.

Has anyone else found a solution for storage with kerrighed??

djamu
March 30th, 2009, 10:06 AM
floz23

Did you try to contact me thru my site ?




.....
Has anyone else found a solution for storage with kerrighed??

You could use OCFS2 > wouldn't try if I was you / on a small setup. It's designed for SAN storage with LUN's ( either iSCSI / fiber )

GlusterFS > works quite well ( albeit some parts run in userspace ), supports load balancing / HA configs /

or..
Just stick to NFS > you can tremendously speed it up using AUFS & a COW loop ( you only read from the server > changes are written to a local FS ( use tempfs ) ...

floz23
March 30th, 2009, 01:51 PM
Did you try to contact me thru my site

I sent you a message a few moments ago...

agungaryo
March 31st, 2009, 08:01 AM
Hi Agung Aryo, what type of simulation are you running? In my opinion, I can see too different options:

1) The client connects to the simulation on the server, which can then execute other programs/processes of the simulation on the cluster nodes via ssh.
2) The client connects to the server, which then grabs the information from the client and passes this to the cluster nodes when executing the simulation on the cluster nodes via ssh.

In order to have kerrighed working properly, only the cluster nodes should run the kerrighed kernel, the server should just be running a standard Ubuntu kernel. Also, the cluster nodes should have there own private network with the server. A client should not be connected to this private network, as all a client should see is the server, which then submits jobs to the cluster nodes. Hope this helps?

hi bigjimjams ,thank in advance for responding

I'm new in clustering
my simulation can be described like web server
I want to analyze how much the capacity of web server if I use clustering

cluster nodes ------- server (web server )----- clients (web clients)

the thousands client will request the site , it makes web server overload ( CPU usage 100%), I try to solve that problem with clustering

any idea with my problem ?


thank you
agung aryo

ajt
March 31st, 2009, 04:03 PM
hi bigjimjams ,thank in advance for responding

I'm new in clustering
my simulation can be described like web server
I want to analyze how much the capacity of web server if I use clustering

cluster nodes ------- server (web server )----- clients (web clients)

the thousands client will request the site , it makes web server overload ( CPU usage 100%), I try to solve that problem with clustering

any idea with my problem ?


Hello, agung aryo.

A Beowulf is not the same thing as a web server farm: Most people use a DNS 'round-robin' to spread the load between multiple web servers. That means when clients request a page from the web server IP address the local DNS redirects it to one machine in the available pool of servers.

Bye,

Tony.

agungaryo
March 31st, 2009, 05:12 PM
Hello, agung aryo.

A Beowulf is not the same thing as a web server farm: Most people use a DNS 'round-robin' to spread the load between multiple web servers. That means when clients request a page from the web server IP address the local DNS redirects it to one machine in the available pool of servers.

Bye,

Tony.

thank you,ajt.
but I've tried to use multiple server with round robin balancing,but the result is not satisfied . some issues appear like crash on database,DNS cache (while server get hundreds request from client),n I've checked the common error isn't at bandwidth but in CPU USAGE which causes "time out" error at client.
does kerrighed really not help my problem ?

by the way ,do I have to install my simulation at cluster node (kerrighed kernel) to use clustering ?

thank you
agung aryo

robasc
April 2nd, 2009, 06:07 PM
Hello everyone, It's been a little while since I last posted. BigJimJams, I have not had a chance to try what you said yet, been busy working. I will try it this week when I get a chance.

apprentice_clusterer
April 4th, 2009, 05:23 AM
Hello, bigjimjams and ajt,

thank you both for having explained the problem and for having also suggested a solution.

I will try to put in practise the suggestions.

regards

robasc
April 13th, 2009, 09:20 AM
Well BigJimJams

I did everything you said and I got this error:

trying to load: pxelinux.cfg/default

could not find kernel image: linux
boot:

robasc
April 13th, 2009, 10:01 AM
Hey BigJimJams, I got it to boot and I am logged in to all of my nodes but I am still having some trouble though.

I am trying to use the commands from the setup files for kerrighed and they are not doing anything.

Check to see if nodes in the cluster are running

sudo krgadm nodes

when I try to run this i get command not found

So then I tried to run it chroot /nfsroot/kerrighed and then it says are you sure you want to run kerrighed? and it does not let me type yes or no. It just goes to the command prompt.

I looked at the processes and only seen cpu 1 and 2 running if I am viewing it correctly.

I do not think I understand how to use kerrighed yet.

Rickles65
April 15th, 2009, 01:02 PM
Maybe I'm just thick headed today... So is there nothing out there any more that does process (thread) migration?

We have a need (and it's just 3 machines) for high CPU threads to be moved off to the more idle machines.

We'd done this before (about 5 years ago) with OpenMosix. We just added the appropriate kernel mods, NFS mounts and off we went.

The machines are fully loaded in this config as well so there's no need for SSI in the current config either. Just the net mounts and kernel changes.

bigjimjams
April 17th, 2009, 05:10 AM
Hey BigJimJams, I got it to boot and I am logged in to all of my nodes but I am still having some trouble though.

I am trying to use the commands from the setup files for kerrighed and they are not doing anything.

Check to see if nodes in the cluster are running

sudo krgadm nodes

when I try to run this i get command not found

So then I tried to run it chroot /nfsroot/kerrighed and then it says are you sure you want to run kerrighed? and it does not let me type yes or no. It just goes to the command prompt.

I looked at the processes and only seen cpu 1 and 2 running if I am viewing it correctly.

I do not think I understand how to use kerrighed yet.

Hi Robasc, it might be the case that the kernel was built but the tools were not. What is in the contents of /usr/local/bin on the nodes? Also, are you using the svn version of kerrighed? If so, did you create the /config directory, modify the fstab of the nodes, and configure the new scheduler that kerrighed uses by running sudo krg_legacy_scheduler? Can you give us the tail of dmesg from one of the nodes.

jbbjshlws
April 20th, 2009, 10:58 AM
Hello all!,

Firstly i would like to say fantastic guide, it has helped a lot.

I have ran into a few problems, but nothing a google search didn't find (the problems are/were moslty from lack of experience with linux).

The problem i am facing now that google couldnt help out with is this error after i have pxe booted the client node:


Gave up waiting for root device. Common problems:

-Boot args (cat /proc/cmdline)
-Check rootdelay= (did the system wait long enough?)
-Check root= (did the system wait for the right device?)
-Missing modules (cat /proc/modules; ls /dev)
ALERT! /dev/nfs does not exist. Dropping to a shell!


I have read through and seen that robasc has had this same error, but i didn't see any solution to the problem. My setup differs from Rob's in i am running all i386 32 bit machines, and stayed with the suggested:
$ sudo apt-get install debootstrap
debootstrap --arch i386 hardy /nfsroot/kerrighed http://archive.ubuntu.com/ubuntu/

for the bootable file system.

I am not sure why i am getting this error, and any help would be greatly appreciated. let me know if you need any logs (and let me know how to get them...lol)

Cheers,
Josh

bigjimjams
April 21st, 2009, 03:49 PM
Hello all!,

Firstly i would like to say fantastic guide, it has helped a lot.

I have ran into a few problems, but nothing a google search didn't find (the problems are/were moslty from lack of experience with linux).

The problem i am facing now that google couldnt help out with is this error after i have pxe booted the client node:


Gave up waiting for root device. Common problems:

-Boot args (cat /proc/cmdline)
-Check rootdelay= (did the system wait long enough?)
-Check root= (did the system wait for the right device?)
-Missing modules (cat /proc/modules; ls /dev)
ALERT! /dev/nfs does not exist. Dropping to a shell!


I have read through and seen that robasc has had this same error, but i didn't see any solution to the problem. My setup differs from Rob's in i am running all i386 32 bit machines, and stayed with the suggested:
$ sudo apt-get install debootstrap
debootstrap --arch i386 hardy /nfsroot/kerrighed http://archive.ubuntu.com/ubuntu/

for the bootable file system.

I am not sure why i am getting this error, and any help would be greatly appreciated. let me know if you need any logs (and let me know how to get them...lol)

Cheers,
Josh

Hi Josh, is this with the standard Ubuntu kernel, rather than the kerrighed kernel? If so, then I can only suggest that the problem may lie in the /etc/exports file or in the /srv/tftp/pxelinux.cfg/default file. Can you put the contents of both files on here to see if I can spot something out of the ordinary. However, if this is using the kerrighed kernel, have you made sure to include the drivers for your network card and nfs when configuring the kernel? Hope this may lead in the right direction.

jbbjshlws
April 21st, 2009, 10:01 PM
Hello bigjimjams!
I have followed the how to guide exactly step by step, i am very new to linux so i am nost 100% sure what kernal has been used, but it should be exactly what was in your how to.

Now the default contents of etc/exports is:

# /etc/exports: the access control list for filesystems which may be exported
# to NFS clients. See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes hostname1(rw,sync) hostname2(ro,sync)
#
# Example for NFSv4:
# /srv/nfs4 gss/krb5i(rw,sync,fsid=0,crossmnt)
# /srv/nfs4/homes gss/krb5i(rw,sync)
#
# /etc/exports #
/nfsroot/kerrighed 192.168.1.0/255.255.255.0(rw,no_subtree_check,async,no_root_sq uash)

and the contents of /srv/tftp/pxelinux is:

LABEL linux
KERNEL vmlinuz-2.6.27-7-generic
APPEND root=/dev/nfs initrd=initrd.img-2.6.27-7-generic nfsroot=192.168.1.1:/nfsroot/kerrighed ip=dhcp rw


I am able to boot through to where it says:
ALERT! /dev/nfs does not exist. Dropping to a shell!

Now i have looked and i do not have /dev/nfs in either the root of my file system or at /nfsroot/kerrighed/dev/nfs for the bootable file system, is this normal?

if it helps i am using 2.6.27-7-generic as my ubuntu instalation, i havn't upgraded to 2.6.27-11-generic, as i figured this could throw more spanners in the works.

I can give you my msn details if it would help speed things up with communication, i will then transcribe the solutions to this forum

bigjimjams
April 22nd, 2009, 04:27 AM
Hello bigjimjams!
I have followed the how to guide exactly step by step, i am very new to linux so i am nost 100% sure what kernal has been used, but it should be exactly what was in your how to.

Now the default contents of etc/exports is:



and the contents of /srv/tftp/pxelinux is:




I am able to boot through to where it says:


Now i have looked and i do not have /dev/nfs in either the root of my file system or at /nfsroot/kerrighed/dev/nfs for the bootable file system, is this normal?

if it helps i am using 2.6.27-7-generic as my ubuntu instalation, i havn't upgraded to 2.6.27-11-generic, as i figured this could throw more spanners in the works.

I can give you my msn details if it would help speed things up with communication, i will then transcribe the solutions to this forum

Hi Josh, I think I may have just spotted a mistake in the guide. In /etc/exports the ip address should read 192.168.1.1/255.255.255.0 and not 192.168.1.0/255.255.255.0 as it states. This is then the same as the ip address in /srv/tftp/pxelinux.cfg/default which states where to get the nfsroot from. Hope this helps.

jbbjshlws
April 22nd, 2009, 05:04 AM
Hello,
I have changed the ip address to match the ip of the server in both of those files, but i am still getting the same error:

ALERT! /dev/nfs does not exist. Dropping to a shell!

I will double check all of the other files used are referring to the same ip address. could anything else be causing this problem?

do i need to repeat any of the steps now that i have changed the values in those files (besides restarting the daemon's)?

mmcastig
April 22nd, 2009, 11:33 AM
Bigjimjams,
I am experiencing the exact same problem as jbbjshlws. My /etc/exports and /srv/tftp/pxelinux.cfg/default read respectively as follows:

# /etc/exports: the access control list for filesystems which may be exported
# to NFS clients. See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes hostname1(rw,sync) hostname2(ro,sync)
#
# Example for NFSv4:
# /srv/nfs4 gss/krb5i(rw,sync,fsid=0,crossmnt)
# /srv/nfs4/homes gss/krb5i(rw,sync)
#
# /etc/exports #
/nfsroot/kerrighed 192.168.1.1/255.255.255.0(rw,no_subtree_check,async,no_root_sq uash)
--------------------------------------------------------------------------
LABEL linux
KERNEL vmlinuz-2.6.27-13-generic
APPEND root=/dev/nfs initrd=initrd.img-2.6.27-13-generic nfsroot=192.168.1.1:/nfsroot/kerrighed ip=dhcp rw
--------------------------------------------------------------------------
The node just says that it can't find /dev/nfs and drops to a shell.
Hopefully this helps.

-mmcastig

P.S I do have the 2.6.27-14 kernel image, but it panics on boot for some reason, so I've reverted the whole thing back to the 13 image. Not sure if that means my computer is wack or something like that.

jbbjshlws
April 23rd, 2009, 05:55 AM
That makes me feel a little less silly, if two of us had the same issue!

I have tried a couple of things, but nothing seems to get away from this error!,

please make sure if you work it out to post the results for me, and vice versa.

Cheers,
Josh

mmcastig
April 23rd, 2009, 05:56 PM
It occurs to me that I might be barking up the wrong tree in pursuing Kerrighed. I'm trying to setup a load balancing cluster. Mostly for mass music manipulation (i.e ID3 tag editing, mass filename changes, and CDDB matching) with the program Easytag. I'd also like to just bolster the performance of my tower (I have 4 other computers sitting around). Should I be trying to setup a Kerrighed cluster for all this, or should I be headed in more of a Mosix direction? Or something I haven't tried yet?

jbbjshlws
April 23rd, 2009, 06:08 PM
From what i understand Kerrighed can do everything that Openmosix can do and is still supported, both are SSI systems. I am not the guru of clustering though

cfrieler
April 23rd, 2009, 11:37 PM
If anyone can venture a guess or provide references.....

I played with the first versions of OpenMosix not too long after I began seriously pursueing clustering. I've been happily using Caos Linux with Warewulf and now Caos-NSA, but I've always wanted the transparency of SSI. One issue that I can't find info on is the following - I routinely compile my MCMC simulations to use OpenMP on each 4-CPU node and then distribute tasks among the nodes using MPI. If thread migration is allowed to occur on an SSI cluster, will it be smart enough to move all the threads that need the same local memory?

Also...
I'm about to try the procedure in the HowTo on a collection of older SMP boxes I have, but they are all based on the Intel 450nx chipset. Past versions of Ubuntu have had issues with this chipset, and the problem has been resolved and reintroduced a few times. Any idea whether the latest kernel works on these older quad slot systems?

Thanks, and when I get something up and running, I'd be happy to perform testing for others.
Cliff

jbbjshlws
April 24th, 2009, 10:19 AM
Hello,
I am still getting the same error:

Gave up waiting for root device. Common problems:

-Boot args (cat /proc/cmdline)
-Check rootdelay= (did the system wait long enough?)
-Check root= (did the system wait for the right device?)
-Missing modules (cat /proc/modules; ls /dev)
ALERT! /dev/nfs does not exist. Dropping to a shell!

i have looked up the /dev/nfs issue through google and tried some of the tips and tricks to get it working, but none have worked so far, still would love a helping hand, happy to send a paypal $$ donation towards the project or someone (your choice) who can get the issue fixed before Sunday, (issue fixed means i can get past this issue and if you stick around long enough to help me out to get into running the Kerrighed kernel it would be appreciated also). If you are interested please let me know asap so i can pass on my details.

Kind Regards,
Joshua

jedi453
April 24th, 2009, 02:51 PM
Hello,
I am still getting the same error:

Gave up waiting for root device. Common problems:

-Boot args (cat /proc/cmdline)
-Check rootdelay= (did the system wait long enough?)
-Check root= (did the system wait for the right device?)
-Missing modules (cat /proc/modules; ls /dev)
ALERT! /dev/nfs does not exist. Dropping to a shell!

i have looked up the /dev/nfs issue through google and tried some of the tips and tricks to get it working, but none have worked so far, still would love a helping hand, happy to send a paypal $$ donation towards the project or someone (your choice) who can get the issue fixed before Sunday, (issue fixed means i can get past this issue and if you stick around long enough to help me out to get into running the Kerrighed kernel it would be appreciated also). If you are interested please let me know asap so i can pass on my details.

Kind Regards,
Joshua

It looks like you're using the initrd ( or initramfs) that came with your install, this might not work as it's not built for running over nfs. See my post ( # 86 in this thread ) about this problem and a way that fixed it for me. Don't worry about doing this for the kerrighed kernel, as you wont need an initrd.

Good Luck!

jbbjshlws
April 25th, 2009, 06:11 AM
Ok i have tried what you suggested jedi453, and i am now getting a different error:

read: Connection refused

and when it finally realises that the connection has actually been refused (about 1 hour of tring)

it comes up with

mount call failed:13
Done.
Begin: Running /scripts/nfs-bottom ...
Done.
Done.
Begin: Running /scripts/init-bottom ...
mount: mounting /root/dev on /dev/.static/dev failed: No such file or directory
Done.
mount: mounting /sys on /root/sys failed: No such file or directory
mount: mounting /proc on /root/proc failed: No such file or directory
Target filesystem doesn't have /sbin/init.
No init found. Try passing init= bootarg.

BusyBox v1.10.2 (Ubuntu 1:1.10.2-1ubuntu6) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs)


so i am not sure if this is what you were expecting, but this is the result i got, i am not sure where to go from here. the offer still stands if someone can help me out really quick!.

Cheers,
Josh

jedi453
April 25th, 2009, 09:34 PM
Ok i have tried what you suggested jedi453, and i am now getting a different error:

read: Connection refused

and when it finally realises that the connection has actually been refused (about 1 hour of tring)

it comes up with

mount call failed:13
Done.
Begin: Running /scripts/nfs-bottom ...
Done.
Done.
Begin: Running /scripts/init-bottom ...
mount: mounting /root/dev on /dev/.static/dev failed: No such file or directory
Done.
mount: mounting /sys on /root/sys failed: No such file or directory
mount: mounting /proc on /root/proc failed: No such file or directory
Target filesystem doesn't have /sbin/init.
No init found. Try passing init= bootarg.

BusyBox v1.10.2 (Ubuntu 1:1.10.2-1ubuntu6) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs)


so i am not sure if this is what you were expecting, but this is the result i got, i am not sure where to go from here. the offer still stands if someone can help me out really quick!.

Cheers,
Josh

OK, I've searched the Internet without perfect matches, I've found "tftpd: read: connection refused" ( http://forum.soft32.com/linux2/tftpd-read-Connection-refused-ftopict11717.html ) which suggests shutting off your firewall. I can only guess at this point, so it might help if you posted all the files the kerrighed guide tells you to edit and create, or double check them yourself. Also look here: http://www.howtoforge.com/pxe_booting_debian to learn a little more about pxe booting, many of the problems I faced setting up kerrighed were from lack of understanding on my part. It could be the version of tftpd you're using or the edition of ubuntu, or a mistake in following the guide. You should try reinstalling all the required software then rebooting the servers. I also had an odd problem (granted not the same one) setting mine up because I hadn't rebooted in a long while, so rebooting might help also.

Good Luck!

jbbjshlws
April 25th, 2009, 09:58 PM
ok, everychange i have made i have restarted the servers afterwords, just incase it needed to refresh something in the daemon, i guess this stems back to the old windows restart when there is a problem, but this has not changed any of the output, the version ofubuntu i am using is 2.6.27-7 kept it the same as the guide to cause less confusion i will try redoing the entire how to on another box to see if that changes anything,

Cheers,
Josh

ajt
April 26th, 2009, 07:26 AM
It occurs to me that I might be barking up the wrong tree in pursuing Kerrighed. I'm trying to setup a load balancing cluster. Mostly for mass music manipulation (i.e ID3 tag editing, mass filename changes, and CDDB matching) with the program Easytag. I'd also like to just bolster the performance of my tower (I have 4 other computers sitting around). Should I be trying to setup a Kerrighed cluster for all this, or should I be headed in more of a Mosix direction? Or something I haven't tried yet?

Hello, mmcastig.

I run a 90-node openMosix cluster and it works very well, but I'm going to start using Kerrighed instead soon because the openMosix project was officially closed over a year ago and openMosix is now unsupportable. It uses the 2.4 kernel, which does not support SATA hard drives. I would try to get Kerrighed running on your kit - It does work ;-)

Bye,

Tony.

ajt
April 26th, 2009, 07:45 AM
If anyone can venture a guess or provide references.....

[...]If thread migration is allowed to occur on an SSI cluster, will it be smart enough to move all the threads that need the same local memory?


Hello, Cliff.

Neither openMosix or Kerrighed support 'thread' (i.e. lightweight processes) migration. In fact, POSIX threads are usually implemented as separate processes under Linux and communicate using shared memory. That means they will not migrate under openMosix or the 'stable' version of Kerrighed.

What you want is NUMA (Non-Uniform Memory Architecture) where the kernel attempts to observe data locality using memory management hardware to keep data in the memory of the CPU where it is actually being processed.

SSI (Single System Image) is more like SMP (Symmetric Multi Processing) where data locality is not considered by the load balancing algorithm, which just distributes the CPU load independently of memory bandwidth. The openMosix load balancing algorithm does try to avoid memory depletion on a processing element by migrating processes away, but it only supports uniprocessor architecture and will only use one CPU even if two or more are present. I'm not sure how stable Kerrighed support for SMP is yet.

Bye,

Tony.

ajt
April 26th, 2009, 08:19 AM
OK, I've searched the Internet without perfect matches, I've found "tftpd: read: connection refused" ( http://forum.soft32.com/linux2/tftpd-read-Connection-refused-ftopict11717.html ) which suggests shutting off your firewall. I can only guess at this point, so it might help if you posted all the files the kerrighed guide tells you to edit and create, or double check them yourself. Also look here: http://www.howtoforge.com/pxe_booting_debian to learn a little more about pxe booting, many of the problems I faced setting up kerrighed were from lack of understanding on my part. It could be the version of tftpd you're using or the edition of ubuntu, or a mistake in following the guide. You should try reinstalling all the required software then rebooting the servers. I also had an odd problem (granted not the same one) setting mine up because I hadn't rebooted in a long while, so rebooting might help also.

Good Luck!

Hello, jedi453 and Josh.

There are two different issues here:

#1 Does tftpd work
#2 Does NFS work

You can test both of these independently from any Linux client. Just use the tftpd client on a stand-alone Ubuntu manchine to see if you can download the pxelinux secondary bootstrap. Then, try to mount the NFSROOT on e.g. /mnt. [tip] Use "showmount -e server_name" on the client to see what filesystems server_name' is exporting. You should also restart the NFS server after making changes to /etc/exports.

Bye,

Tony.

ajt
April 26th, 2009, 08:26 AM
Bigjimjams,
I am experiencing the exact same problem as jbbjshlws. My /etc/exports and /srv/tftp/pxelinux.cfg/default read respectively as follows:

# /etc/exports: the access control list for filesystems which may be exported
# to NFS clients. See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes hostname1(rw,sync) hostname2(ro,sync)
#
# Example for NFSv4:
# /srv/nfs4 gss/krb5i(rw,sync,fsid=0,crossmnt)
# /srv/nfs4/homes gss/krb5i(rw,sync)
#
# /etc/exports #
/nfsroot/kerrighed 192.168.1.1/255.255.255.0(rw,no_subtree_check,async,no_root_sq uash)


Hello, mmcastig.

Did you intend to export your NFSROOT to exacly one host?

Your exports config only exports the filesystem to the single host 192.168.1.1, To export it to the network 192.168.1.0 you should use:

/nfsroot/kerrighed 192.168.1.0/255.255.255.0(rw,no_subtree_check,async,no_root_sq uash)

Bye,

Tony.

ajt
April 26th, 2009, 08:30 AM
Maybe I'm just thick headed today... So is there nothing out there any more that does process (thread) migration?

We have a need (and it's just 3 machines) for high CPU threads to be moved off to the more idle machines.

We'd done this before (about 5 years ago) with OpenMosix. We just added the appropriate kernel mods, NFS mounts and off we went.

The machines are fully loaded in this config as well so there's no need for SSI in the current config either. Just the net mounts and kernel changes.

Hello, Rickles65.

Did you really migrate threads, using the MIGSHM patch?

I never got this to work properly - It was only ever 'alpha' release software. We stopped using it because it makes openMosix unstable.

Bye,

Tony.

ajt
April 26th, 2009, 08:40 AM
Hello bigjimjams!
I have followed the how to guide exactly step by step, i am very new to linux so i am nost 100% sure what kernal has been used, but it should be exactly what was in your how to.


Hello, Josh.

I don't think you did follow bigjimjams guide exactly ;-)


LABEL linux
KERNEL vmlinuz-2.6.27-7-generic
APPEND root=/dev/nfs initrd=initrd.img-2.6.27-7-generic nfsroot=192.168.1.1:/nfsroot/kerrighed ip=dhcp rw


You're trying to boot the 'generic' Ubuntu kernel, not the patched kerrighed kernel! The generic kernel does not support NFSROOT...

Bye,

Tony.

ajt
April 26th, 2009, 09:35 AM
thank you,ajt.
but I've tried to use multiple server with round robin balancing,but the result is not satisfied . some issues appear like crash on database,DNS cache (while server get hundreds request from client),n I've checked the common error isn't at bandwidth but in CPU USAGE which causes "time out" error at client.
does kerrighed really not help my problem ?


Hello, Agung.

The Apache web server spawns multiple threads which can be executed on different SMP processors local to the machine that Apache is running on. The Apache server keeps track of these threads using shared memory, so the SSI you use must support thread (as opposed to process) migration: At present, Kerrighed does not support thread migration.


by the way ,do I have to install my simulation at cluster node (kerrighed kernel) to use clustering ?


Yes, and you need to give the parent apache2 process inheritable migration capability using:


krgcapset -d <apache2.pid> +CAN_MIGRATE


Where <apache2.pid is the process ID of the apache2 server.

Bye,

Tony.

cfrieler
April 27th, 2009, 11:18 PM
... In fact, POSIX threads are usually implemented as separate processes under Linux and communicate using shared memory. That means they will not migrate under openMosix or the 'stable' version of Kerrighed.


Tony,
Thanks for the very informative reply. A lot of complexity here that is going to require some study. But since you apparently command a broad range within this topic area, let me ask a couple additional questions.

Does OpenMP produced threads, POSIX compliant threads, or it's own seperate processes? Is it compiler dependent?

To keep my programming model simple, I've built a relatively homogeneous cluster. All nodes are quad CPU with 4GB RAM, and they are run diskless, using CAOS-NSA. Currently, I write a task processing code that is compiled for 4 CPUs, highly optimized and benchmarked to run several orders of magnitude longer than the comm time necessary to transfer it, the supporting data and the results. These tasks are kicked off by a seperate MPI code that has a single instance on each node.

My thought was that with SSI I could do away with the MPI code. I had hoped that if I kicked off all the tasks on the master node, they would migrate as necessary to available nodes for processing. This would relieve me of coordinating, and make it easier to mix very different speeds of nodes.

Given that context, any additional comments about Ubuntu/Kerrighed? Any suggestions on other approaches/technologies I should look at?

Again, thanks for your insight.
Regards, Cliff

jbbjshlws
April 28th, 2009, 09:56 AM
Hello, Josh.

I don't think you did follow bigjimjams guide exactly ;-)


LABEL linux
KERNEL vmlinuz-2.6.27-7-generic
APPEND root=/dev/nfs initrd=initrd.img-2.6.27-7-generic nfsroot=192.168.1.1:/nfsroot/kerrighed ip=dhcp rw


You're trying to boot the 'generic' Ubuntu kernel, not the patched kerrighed kernel! The generic kernel does not support NFSROOT...

Bye,

Tony.

This was taken from the how to guide:
This guide is split into two parts: the first covers how to setup the server for diskless booting the nodes using the current kernel, the second part of the guide covers setting up Kerrighed 2.3.0 and incorporating it into the diskless boot configuration of part one.

It states that you are using the current kernel, then later to set up the Kerrighed kernel and incorporating it. so at the moment, I am using the current kernel, this is until I have been able to successfully boot and then start part two of the guide. so for the moment I think that it is set up as per the guide for where I am upto, sorry for the confusion.

I am still having this issue, I installed Ubuntu onto another box, and I can connect to the nfs share on the server without a problem. so it is being shared, but the problem persists. i have tried the guide on post 86, but it comes up with other errors so i reinstalled the machine from scratch to make sure I didn't stuff any settings up, but have the same problem. I have replicated the same setup on some virtual pc's, with the same problem. so I am consistently doing the same thing wrong, or the guide is not catering for my version of Ubuntu (2.6.27-7-generic).

I have followed http://www.cpsspecialties.webhop.org/kerrighedCluster.htm guide as well with the same issues. he is running the same version of Ubuntu as I and had the same issue earlier on and got it solved two weeks ago, but has not posted the solution yet. This could be the key, thanks for the help so far, I'm sure we can work this out! as I have said before I can give SSH details to people if they need to have a look around to see a silly setting I have done wrong, this might make solving this over the Internet a little bit easier!,


Cheers,
Josh

jedi453
April 28th, 2009, 06:09 PM
This was taken from the how to guide:

It states that you are using the current kernel, then later to set up the Kerrighed kernel and incorporating it. so at the moment, I am using the current kernel, this is until I have been able to successfully boot and then start part two of the guide. so for the moment I think that it is set up as per the guide for where I am upto, sorry for the confusion.

I am still having this issue, I installed Ubuntu onto another box, and I can connect to the nfs share on the server without a problem. so it is being shared, but the problem persists. i have tried the guide on post 86, but it comes up with other errors so i reinstalled the machine from scratch to make sure I didn't stuff any settings up, but have the same problem. I have replicated the same setup on some virtual pc's, with the same problem. so I am consistently doing the same thing wrong, or the guide is not catering for my version of Ubuntu (2.6.27-7-generic).

I have followed http://www.cpsspecialties.webhop.org/kerrighedCluster.htm guide as well with the same issues. he is running the same version of Ubuntu as I and had the same issue earlier on and got it solved two weeks ago, but has not posted the solution yet. This could be the key, thanks for the help so far, I'm sure we can work this out! as I have said before I can give SSH details to people if they need to have a look around to see a silly setting I have done wrong, this might make solving this over the Internet a little bit easier!,


Cheers,
Josh


Edit: did you set up the server with a static ip address?

Our server has two network cards, one is setup for internet access, the other is connected to a switch, which connects the four nodes. The server network card connected to the switch is manually configured with the IP address 192.168.1.1 and subnet mask 255.255.255.0.

See here for details: http://www.yolinux.com/TUTORIALS/LinuxTutorialNetworking.html

/Edit

So you've done the guide twice without success. It is made for hardy heron (8.04), looks like you're running intrepid ibex (8.10). I can't guarantee it's the problem, but if you've tried twice without success installing hardy heron and restarting the guide might be a reasonable alternative. As kerrighed depends on 2.6.20, you aren't getting a better kernel by using intrepid, so the only thing you would be giving up is the newer packages and maybe some time.

The other thing to consider is skipping that test and continue with the guide (beginning at the beginning of part two) as if it worked. After all it is just a test. At any rate I would post all the files the guide tells you to edit/create as a zip file attachment or in a post.

djdirty
April 30th, 2009, 07:07 AM
There has got to be a global issue here... Or at least something that is so insignificantly stupid that its funny!

I have tried on a multitude of different versions of Ubuntu to get this working! From 6.10 right up to 9.10! This is just getting annoying now!

I am working with Josh on this one.. And I haven't got any further than he has.. I am stuck at the dropping to busy box..

I have done as the Error Message says and checked /proc/cmdline and the UUID is in fact exactly the same UUID as /dev/sda1/ (which is my root partition) is this the correct UUID? Or is it meant to be something different.. My understanding to that is that when the client is trying to boot it it waiting for the Root Device of that UUID and the fact that it has already booted (on the server) is an issue..

Can anyone help in this issue? Would be much appreciated! We have big plans for this!

Cheers,

Dave

jbbjshlws
April 30th, 2009, 08:57 AM
Hello All,
I have tried to complete the setup as jedi suggested, but it still did not work, i have got numerious errors probably stemming back to:

apt-get install automake autoconf libtool pkg-config awk rsync bzip2 gcc-3.3 libncurses5 libncurses5-dev wget lsb-release xmlto patchutils xutils-dev build-essential

this threw a lot of errors. the version of awk i was not sure of and many other settings (linux menus etc)

I will be investigating these further before i post about them on the forum.

what my question is, when this is fully setup, how come someone cant just upload a vm image of a working system? it would only be a few small settings that would need to be changed to have it work for your own network, and seems it would be a lot more reliable and repeatable. could anyone who has managed to be graced with a working system do this? i can give ftp details, and i am more than happy to host files on my server to better the community (any files). please let me know, and i can give ftp details.

Cheers,
Josh

jedi453
April 30th, 2009, 12:05 PM
Hello All,
I have tried to complete the setup as jedi suggested, but it still did not work, i have got numerious errors probably stemming back to:

apt-get install automake autoconf libtool pkg-config awk rsync bzip2 gcc-3.3 libncurses5 libncurses5-dev wget lsb-release xmlto patchutils xutils-dev build-essential

this threw a lot of errors. the version of awk i was not sure of and many other settings (linux menus etc)

I will be investigating these further before i post about them on the forum.

what my question is, when this is fully setup, how come someone cant just upload a vm image of a working system? it would only be a few small settings that would need to be changed to have it work for your own network, and seems it would be a lot more reliable and repeatable. could anyone who has managed to be graced with a working system do this? i can give ftp details, and i am more than happy to host files on my server to better the community (any files). please let me know, and i can give ftp details.

Cheers,
Josh

Sorry I've had such a bad record fixing these problems :( .

Are you still using intrepid ibex (8.10)? It doesn't have gcc-3.3 (see here http://packages.ubuntu.com/intrepid/allpackages (takes a long time to load, even with broadband) ) which would cause a lot of errors... none of the packages would install, and nothing would work after that...

djdirty
April 30th, 2009, 05:10 PM
I have tried running through the installation with Hardy (8.04) and i get the error:0

Begin: Running /scripts/nfs-premount
Connect: Connection timed out

Then it just sits there... Whats the go there?


~Davo

nerdopolis
April 30th, 2009, 05:24 PM
Hi. I got my server to boot (not yet with Kerrighed). The problem was in the initrd.

On the root server (not /nfsroot/kerrighed) try:


Backup your current initrd:
sudo cp /boot/initrd.img-$(uname -r) /boot/initrd.img-$(uname -r).bak Backup your initrd config file
sudo cp /etc/initramfs-tools/initramfs.conf ~/initrdconfigThis command will replace your initramfs config with the one required to get nfs booting:
sudo echo -e "MODULES=netboot" "\nBUSYBOX=y" "\nCOMPCACHE_SIZE=\"\"" "\nBOOT=nfs" "\nDEVICE=eth0" "\nNFSROOT=auto" > /etc/initramfs-tools/initramfs.confCreate the new initrd:
sudo update-initramfs -uRestore the initramfs config:
sudo mv ~/initrdconfig /etc/initramfs-tools/initramfs.confcopy the new initrd to the /srv/tftp folder:
sudo cp /boot/initrd.img-$(uname -r) /srv/tftp/
sudo cp /boot/vmlinuz-$(uname -r) /srv/tftp/remove the nfs initrd:
sudo rm /boot/initrd.img-$(uname -r)put the old initrd in place:
sudo mv /boot/initrd.img-$(uname -r).bak /boot/initrd.img-$(uname -r)update the configuration. this whole block is a single command!
printf "LABEL linux
KERNEL vmlinuz-$(uname -r)
APPEND root=/dev/nfs initrd=initrd.img-$(uname -r) nfsroot=192.168.1.0:/nfsroot/kerrighed ip=dhcp rw" > /srv/tftp/pxelinux.cfg/defaultAfter that, it booted and froze to "Starting NFS Common Utilities" I was able to remove it from the startup script with out any problems. it still boots and I am able to browse and write to the file system, but just in case, move it to your home folder:
sudo mv /nfsroot/kerrighed/etc/init.d/nfs-common ~/nfs-common

nerdopolis
April 30th, 2009, 05:37 PM
djdirty: sounds like your nfs server isn't working... (thats just my guess)

Assuming you are using the guides (https://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide) folders and ip addresses:

does your /etc/exports look like this?
/nfsroot/kerrighed 192.168.1.0/255.255.255.0(rw,no_subtree_check,async,no_root_sq uash)

did you run:
sudo exportfs -avr

did you run:
sudo /etc/init.d/nfs-kernel-server restart

Thats just my guess...

djdirty
April 30th, 2009, 08:52 PM
Thanks Nerdopolis..

I did edit those configs to look exactly like that..

I will have a go at your other suggestion and see where that gets me.


~Davo

jbbjshlws
May 1st, 2009, 03:46 AM
Hello, Good news!
It seems the 3rd time is the charm as i am able to go get to the login screen now, the error i am having is when i go to "make kernel" it tells me:

From and to directories are identical!
[all-y] Error 1
Leaving directory /usr/src/kerrighed-2.3.0/modules
[all-recursive] Error 1


i hope this is something simple,....please!!

Cheers,
Josh

bigjimjams
May 1st, 2009, 05:34 AM
Hello, Good news!
It seems the 3rd time is the charm as i am able to go get to the login screen now, the error i am having is when i go to "make kernel" it tells me:

From and to directories are identical!
[all-y] Error 1
Leaving directory /usr/src/kerrighed-2.3.0/modules
[all-recursive] Error 1


i hope this is something simple,....please!!

Cheers,
Josh

Hi Josh, are you using a 32-bit or 64-bit kernel? If its the 32-bit version, I found that if you use the make kernel -i command it will ignore this error and still make everything required. Just to note, that if you use the SVN trunk version of kerrighed, instead of 2.3.0 this error seems to have been fixed. However, if you are using a 64-bit kernel, I never managed to get 2.3.0 working with that, but did get the SVN trunk version working.

Also, when the guide states to install awk, it should actaully be gawk. It might also be worth installing grub so when you run the command make kernel-install later on it won't complain. Hope this helps.

djdirty
May 1st, 2009, 06:44 AM
This is so weird! We both got it working on seperate boxes! What the heck..

I'm at the same error.. Will try the -i switch when I get a chance. :-) Thanks Bigjimjams!

jbbjshlws
May 1st, 2009, 10:27 AM
Hello bigjimjams!!!

This worked wonderfully!!, this is to say it is able to get past this issue, and i have got to 2.3, Starting Kerrighed. when i go to boot the machines now
i get to:

Trying to load:pxelinux.cfg/default
Could not find kernel image: linux
boot:

could this be related to using the -i parameter or something else?


Cheers,
Joshua

nerdopolis
May 1st, 2009, 02:32 PM
Try updating your
/srv/tftp/pxelinux.cfg/default

file.

make sure the Kerrighed kernel is in
/srv/tftp/

nerdopolis
May 1st, 2009, 05:28 PM
If anyone ran my commands, in my previous post I realized an error in one of them. sudo update-initramfs -u if you ran them.

jbbjshlws
May 1st, 2009, 07:34 PM
Hello,
I have checked /srv/tftp/pxelinux.cfg/default and i have:

LABEL kerrighedlinux
KERNEL vmlinuz-2.6.20-krg
APPEND console=tty1 root=/dev/nfs nfsroot=10.54.12.5:/nfsroot/kerrighed ip=dhcp rw




i have restarted the daemons, and the server that they are on

also in /srv/tftp/ i have:
initrd.img-2.6.24-23-generic
pxelinux.cfg
vmlinuz-2.6.24-23-generic
pxelinux.0
vmlinuz-2.6.20-krg

so i see no reason it is not working, please help!

macotto
May 2nd, 2009, 01:27 AM
clustering as well, but because of no ubuntu availability I have been trying live CD's, and one of the computers I want to use it on doesn't have support for my LAN card. :sad: and it doesn't know how to do WiFi either, it thinks it is just an ethernet connection

bigjimjams
May 2nd, 2009, 09:50 AM
Hello,
I have checked /srv/tftp/pxelinux.cfg/default and i have:

LABEL kerrighedlinux
KERNEL vmlinuz-2.6.20-krg
APPEND console=tty1 root=/dev/nfs nfsroot=10.54.12.5:/nfsroot/kerrighed ip=dhcp rw



If you try changing the LABEL to linux instead of kerrighedlinux (I think the guide may have said this - Whoops!) it may work. I think I found this problem recently when I performed another kerrighed install but haven't got around to changing the guide.

jbbjshlws
May 3rd, 2009, 04:24 AM
Hello Bigjimjams!!

that LABEL defiantly needed to be linux, that fixed that error. and i was able to successfully boot to the next error...(sigh...)

the error is:

Looking up ort of RPC 100005/1 on 10.54.12.5
Portmap: server 10.54.12.5 not responding, timedout
Root-NFS: unable to get mountd port number from server, using default mount: Server 10.54.12.5 not responding, timed out
Root-NFS: Server returned error-5 while mounting /nfsroot/kerrighed
VFS: unable to mount root fs via NFS, tyring floppy.
VFS: Insert root floppy and press ENTER

Now if i press enter i get the following error:

1VFS: Cannot open root device "nfs" or unknown-block (2,0)
Please append a correct "root=" boot option
Kernal panic- not syncing VFS: Unable to mount root fs on unknown-block (2,0)

When i run the command rpcinfo -p on the server and i look at the results for 100005 i get:

100005 1 udp 35644 mountd
100005 1 tcp 46678 mountd

not sure if this is the port that program should be running on or if those settings are correct. this error consistently comes up on 3 different machines (out of 3) all with different motherboards and hardware, so i don't think that this error is hardware dependent. soo close!!!

Cheers,
Josh

biggels
May 3rd, 2009, 09:36 AM
I have enjoyed reading your article and it has given me some insight into creating the clustered SSI system with Kerrighed. I have a question, I would be setting this up at home, and I would like to use the systems around the home in the bedrooms, my plan was to send out a Wake On Lan packet to have them boot to PXE before the HD and then start processing any work necessary. Can you have Kerrighed working with the Ubuntu GUI as well?

this way if my family wants to use the machines and they do any heavy duty processing, they are still usable and all of them have the advantage of the cluster.

If this is not possible, is there anyway I could have Kerrighed installed on one machine that has a GUI as the "master" machine so any overheads of CPU cycles are then dealt off to other machines, and it is not all terminal based?


much appreciated,


Cat

jbbjshlws
May 3rd, 2009, 09:46 AM
this is a good question, is there any display with Kerrighed at all?
I have played around with Xgrid a little bit, and that is quite graphical, but im not sure about Kerrighed, i will let you know when i get it solved if you dont find out sooner!,

Cheers,
Josh

bigjimjams
May 3rd, 2009, 12:48 PM
this is a good question, is there any display with Kerrighed at all?
I have played around with Xgrid a little bit, and that is quite graphical, but im not sure about Kerrighed, i will let you know when i get it solved if you dont find out sooner!,

Cheers,
Josh

I can't see why you can't install the ubuntu-desktop package on the kerrighed nodes to have a GUI on each machine that you can log into. This can be done over an NFS boot. Also, using the setup described in the ubuntu kerrighed guide, you can login to any of the kerrighed nodes via a terminal and launch processes which can then be distributed, so why not using a GUI frontend? It's possible I'm overlooking something here...

jbbjshlws
May 3rd, 2009, 08:32 PM
Any ideas from an expert about this?

Hello Bigjimjams!!

that LABEL defiantly needed to be linux, that fixed that error. and i was able to successfully boot to the next error...(sigh...)

the error is:

Looking up ort of RPC 100005/1 on 10.54.12.5
Portmap: server 10.54.12.5 not responding, timedout
Root-NFS: unable to get mountd port number from server, using default mount: Server 10.54.12.5 not responding, timed out
Root-NFS: Server returned error-5 while mounting /nfsroot/kerrighed
VFS: unable to mount root fs via NFS, tyring floppy.
VFS: Insert root floppy and press ENTER

Now if i press enter i get the following error:

1VFS: Cannot open root device "nfs" or unknown-block (2,0)
Please append a correct "root=" boot option
Kernal panic- not syncing VFS: Unable to mount root fs on unknown-block (2,0)

When i run the command rpcinfo -p on the server and i look at the results for 100005 i get:

100005 1 udp 35644 mountd
100005 1 tcp 46678 mountd

not sure if this is the port that program should be running on or if those settings are correct. this error consistently comes up on 3 different machines (out of 3) all with different motherboards and hardware, so i don't think that this error is hardware dependent. soo close!!!

Cheers,
Josh


I will try installing it on a GUI system as soon as i can get this whole setup working (one step at a time!) don't want to introduce more variables into it atm.

Cheers,
Josh

bigjimjams
May 4th, 2009, 06:18 AM
Any ideas from an expert about this?




I will try installing it on a GUI system as soon as i can get this whole setup working (one step at a time!) don't want to introduce more variables into it atm.

Cheers,
Josh

Hi Josh, can you give the full listing of rpcinfo -p so we can see if you have a mountd, a portmapper and an nfs running. Can you also check if you have a portmapper installed on the nfsroot filesystem. Also can you post the contents of /etc/hosts, /etc/hosts.allow and /etc/hosts.deny for the server and nodes.

jbbjshlws
May 4th, 2009, 06:29 AM
Hey!
This is the settings for the / server system
rcpinfo -p

program vers proto port
100000 2 tcp 111 portmapper
100000 2 udp 111 portmapper
100024 1 udp 41200 status
100024 1 tcp 47208 status
100003 2 udp 2049 nfs
100003 3 udp 2049 nfs
100003 4 udp 2049 nfs
100021 1 udp 38440 nlockmgr
100021 3 udp 38440 nlockmgr
100021 4 udp 38440 nlockmgr
100003 2 tcp 2049 nfs
100003 3 tcp 2049 nfs
100003 4 tcp 2049 nfs
100021 1 tcp 34066 nlockmgr
100021 3 tcp 34066 nlockmgr
100021 4 tcp 34066 nlockmgr
100005 1 udp 60030 mountd
100005 1 tcp 35569 mountd
100005 2 udp 60030 mountd
100005 2 tcp 35569 mountd
100005 3 udp 60030 mountd
100005 3 tcp 35569 mountd




etc/hosts
127.0.0.1 localhost
127.0.1.1 clustered

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts



etc/hosts.allow
# /etc/hosts.allow: list of hosts that are allowed to access the system.
# See the manual pages hosts_access(5) and hosts_options(5).
#
# Example: ALL: LOCAL @some_netgroup
# ALL: .foobar.edu EXCEPT terminalserver.foobar.edu
#
# If you're going to protect the portmapper use the name "portmap" for the
# daemon name. Remember that you can only use the keyword "ALL" and IP
# addresses (NOT host or domain names) for the portmapper, as well as for
# rpc.mountd (the NFS mount daemon). See portmap(8) and rpc.mountd(8)
# for further information.
#


etc/hosts.deny

# /etc/hosts.deny: list of hosts that are _not_ allowed to access the system.
# See the manual pages hosts_access(5) and hosts_options(5).
#
# Example: ALL: some.host.name, .some.domain
# ALL EXCEPT in.fingerd: other.host.name, .other.domain
#
# If you're going to protect the portmapper use the name "portmap" for the
# daemon name. Remember that you can only use the keyword "ALL" and IP
# addresses (NOT host or domain names) for the portmapper, as well as for
# rpc.mountd (the NFS mount daemon). See portmap(8) and rpc.mountd(8)
# for further information.
#
# The PARANOID wildcard matches any host whose name does not match its
# address.

# You may wish to enable this to ensure any programs that don't
# validate looked up hostnames still leave understandable logs. In past
# versions of Debian this has been the default.
# ALL: PARANOID


These are the settings for the /nfsroot/kerrighed client system

etc/hosts
# /etc/hosts #
127.0.0.1 localhost

10.65.12.5 clustered
10.65.12.6 kerrighednode1
10.65.12.7 kerrighednode2
10.65.12.8 kerrighednode3
10.65.12.9 kerrighednode4
10.65.12.10 kerrighednode5






etc/hosts.allow
# /etc/hosts.allow: list of hosts that are allowed to access the system.
# See the manual pages hosts_access(5) and hosts_options(5).
#
# Example: ALL: LOCAL @some_netgroup
# ALL: .foobar.edu EXCEPT terminalserver.foobar.edu
#
# If you're going to protect the portmapper use the name "portmap" for the
# daemon name. Remember that you can only use the keyword "ALL" and IP
# addresses (NOT host or domain names) for the portmapper, as well as for
# rpc.mountd (the NFS mount daemon). See portmap(8) and rpc.mountd(8)
# for further information.
#

/etc/hosts.deny

# /etc/hosts.deny: list of hosts that are _not_ allowed to access the system.
# See the manual pages hosts_access(5) and hosts_options(5).
#
# Example: ALL: some.host.name, .some.domain
# ALL EXCEPT in.fingerd: other.host.name, .other.domain
#
# If you're going to protect the portmapper use the name "portmap" for the
# daemon name. Remember that you can only use the keyword "ALL" and IP
# addresses (NOT host or domain names) for the portmapper, as well as for
# rpc.mountd (the NFS mount daemon). See portmap(8) and rpc.mountd(8)
# for further information.
#
# The PARANOID wildcard matches any host whose name does not match its
# address.

# You may wish to enable this to ensure any programs that don't
# validate looked up hostnames still leave understandable logs. In past
# versions of Debian this has been the default.
# ALL: PARANOID

How can i check for a port mapper on the nfsroot system?

This is the code from all of the places asked, please let me know if you need anything else

Cheers, Joshua

djdirty
May 4th, 2009, 06:31 AM
From what i can work out it is an issue with the Kernel... Because the generic kernel can boot and is a completely working version of Ubuntu..

Can anyone upload there Kerrighed Kernel? Wouldn't that be the easiest thing to do?

~Davo

bigjimjams
May 4th, 2009, 07:30 AM
How can i check for a port mapper on the nfsroot system?

This is the code from all of the places asked, please let me know if you need anything else

Cheers, Joshua

You could try to put all the node ip adresses in the server /etc/hosts file too. Also, did you remember to include the network card drivers and nfs in the kerrighed kernel and not as modules? Does the /sbin directory contain portmap on the nodes?

Also, just noticed that your /etc/hosts on the server doesn't contain the ip address for itself only the localhost, if you add it, it may solve the problem, as I believe portmap uses the /etc/host* files.

jbbjshlws
May 4th, 2009, 07:46 AM
Hello bigjimjams,
I have tested with the changed etc/hosts file, and it does not change the problem. the /sbin directory does contain portmap on the nodes.

Also, just noticed that your /etc/hosts on the server doesn't contain the ip address for itself only the localhost, if you add it, it may solve the problem, as I believe portmap uses the /etc/host* files.

the server name is clustered, ip address 10.65.12.5, it is in the list

did you remember to include the network card drivers and nfs in the kerrighed kernel and not as modules?

I did include them for most of the systems around the home, i did not include them for the laptop here, and the laptop has a different error (don't care about atm). this tells me that the network card drivers are functioning correctly in the desktop computers.

Hope this helps,

Cheers,
Josh

bigjimjams
May 4th, 2009, 08:50 AM
Hello bigjimjams,
I have tested with the changed etc/hosts file, and it does not change the problem. the /sbin directory does contain portmap on the nodes.



the server name is clustered, ip address 10.65.12.5, it is in the list



I did include them for most of the systems around the home, i did not include them for the laptop here, and the laptop has a different error (don't care about atm). this tells me that the network card drivers are functioning correctly in the desktop computers.

Hope this helps,

Cheers,
Josh

Hi Josh, from the /etc/hosts files you posted for the server and client, it appeared as though it was in the client list but not the server list. It should be present in both with the correct IP addresses, as the server may have more than one. Have you changed this and tried it since the post?

jbbjshlws
May 4th, 2009, 09:04 AM
Yes i had changed the server file, and put it and all of the clients into the list, checked it on 3 systems and all did the same thing, restarted the server and the daemons.

Cheers,

bigjimjams
May 4th, 2009, 04:10 PM
Yes i had changed the server file, and put it and all of the clients into the list, checked it on 3 systems and all did the same thing, restarted the server and the daemons.

Cheers,

Hi Josh, sorry to hear you're still having trouble! The only other thing I can think of at the moment is have you checked /etc/network/interfaces on the server and made sure that the ethernet adapter which is being used with the static IP address for the kerrighed network is set to manual and not auto. This should look something like the following if this is eth0:

iface eth0 inet manual

and not like this:

iface eth0 inet auto

The nodes should already have a configuration like this, as mentioned in the guide. Also did you sudo the exportfs -avr command for the server?

jbbjshlws
May 4th, 2009, 08:01 PM
Hello, i checked the /etc/network/interfaces, and it is set correctly (this is to say i have it set to eth2, but that's because i have got it as the 2nd card in the server. i did sudo the exportfs command, as without doing this it produced an error. So we are still stuck!
Bugger!,

Any other suggestions?

Cheers,
Josh

bigjimjams
May 5th, 2009, 05:23 AM
Hello, i checked the /etc/network/interfaces, and it is set correctly (this is to say i have it set to eth2, but that's because i have got it as the 2nd card in the server. i did sudo the exportfs command, as without doing this it produced an error. So we are still stuck!
Bugger!,

Any other suggestions?

Cheers,
Josh

Hi Josh, one thing I've just noticed is that in your earlier error message at boot up the portmap was looking for server 10.54.12.5, whereas in your /etc/hosts lists your server is 10.65.12.5. I think this is what is causing the problem. Can you check all your config files, especially the dhcp and dns settings. I think somewhere the ip address of your server has been entered wrongly. If you can't find anything you could always change the ip addresses you are using to 10.54.*.* instead of 10.65.*.* to see if this works.

djdirty
May 5th, 2009, 06:04 AM
Im at the same issue and all my IP addresses are consistent...

jbbjshlws
May 5th, 2009, 10:34 AM
OK! good eyes! that worked, i have now been able to finish this guide!!
yahoo!!!, now what can i run on it that would test it?

something that would normally run on one computer and take X time, that i can run on the cluster and take X/nodes??

as soon as i can see that it is fully functional i will start on your request biggles, and try to integrate it into a gui system to allow all of the nodes to still be of some use while sharing the processes. does anyone else know the best way to skin this cat?

Cheers, thankyou heaps for getting me this far,

Joshua!

bigjimjams
May 5th, 2009, 03:59 PM
OK! good eyes! that worked, i have now been able to finish this guide!!
yahoo!!!, now what can i run on it that would test it?

something that would normally run on one computer and take X time, that i can run on the cluster and take X/nodes??

as soon as i can see that it is fully functional i will start on your request biggles, and try to integrate it into a gui system to allow all of the nodes to still be of some use while sharing the processes. does anyone else know the best way to skin this cat?

Cheers, thankyou heaps for getting me this far,

Joshua!

No problems, happy to help! Glad you've finally got it all working!! :) You could try some benchmarking software to start, what was the one they used on Microwulf?

jbbjshlws
May 6th, 2009, 04:11 AM
Hmm spoke to soon, it seems i was able to get all the correct prompts as per the howto guide, but on further inspection it seems as if it is not working, i went to the link below to check to see if it was running:


http://www.kerrighed.org/wiki/index.php/FAQ
How do I know if the cluster is running? (aka How can I check that krgadm cluster start succeeded?)

Several ways:

* On the cluster nodes, the kernel log contains a line like

Kerrighed is running on XX nodes

where XX is the number of nodes.
* /proc/cpuinfo shows all CPUs of the cluster
* /proc/meminfo shows all memory of the cluster
* /proc/stat shows stats about all CPUs of the cluster



and all of the log files are blank, in fact the entire /proc directory is empty, (/nfsboot/kerrighed/proc from the server) so i am not sure what has gone wrong. i have backed up the whole system just in case something really weird happens and i loose everything!...

I also checked the info when i type in the "top" command into terminal and from what i understand my result should have several lines of cpu's, please let me know if this looks correct (only posted the top half of the output)

top - 04:09:46 up 19:29, 0 users, load average: 0.10, 0.04, 0.01
Tasks: 43 total, 1 running, 42 sleeping, 0 stopped, 0 zombie
Cpu(s): 20.6%us, 0.3%sy, 0.0%ni, 78.1%id, 0.0%wa, 0.3%hi, 0.7%si, 0.0%st
Mem: 256400k total, 138540k used, 117860k free, 9848k buffers
Swap: 262120k total, 0k used, 262120k free, 102368k cached



any help is appreciated,
Cheers,
Josh

bigjimjams
May 6th, 2009, 05:20 AM
Hmm spoke to soon, it seems i was able to get all the correct prompts as per the howto guide, but on further inspection it seems as if it is not working, i went to the link below to check to see if it was running:






and all of the log files are blank, in fact the entire /proc directory is empty, (/nfsboot/kerrighed/proc from the server) so i am not sure what has gone wrong. i have backed up the whole system just in case something really weird happens and i loose everything!...

I also checked the info when i type in the "top" command into terminal and from what i understand my result should have several lines of cpu's, please let me know if this looks correct (only posted the top half of the output)

top - 04:09:46 up 19:29, 0 users, load average: 0.10, 0.04, 0.01
Tasks: 43 total, 1 running, 42 sleeping, 0 stopped, 0 zombie
Cpu(s): 20.6%us, 0.3%sy, 0.0%ni, 78.1%id, 0.0%wa, 0.3%hi, 0.7%si, 0.0%st
Mem: 256400k total, 138540k used, 117860k free, 9848k buffers
Swap: 262120k total, 0k used, 262120k free, 102368k cached



any help is appreciated,
Cheers,
Josh

Hi Josh, /nfsroot/kerrighed/proc on the server should be blank but when the nodes are up and running the /proc directory on each of them should contain information about that machine and the rest of the cluster nodes. Have you set nbmin=0 and the session=1 in /nfsroot/kerrighed/etc/kerrighed_nodes? Also, the line in dmesg kerrighed running on xx nodes will not be present until you run: sudo krgadm cluster start on one of the nodes. Does sudo krgadm nodes from a node list all the kerrighed nodes in the form node_id:session? Can you post your dmesg from a cluster node.

djdirty
May 6th, 2009, 05:42 AM
Hi..

My nodes are still not booting..

Here is a screen shot..

http://img159.imageshack.us/img159/6350/kerrighgehdu.png

Obviously im all in Virtualbox.. Is that an issue?

I have run through the entire setup process again to make sure i havent missed anything for a 5th time...

jbbjshlws
May 6th, 2009, 05:42 AM
ok i have checked the /nfsroot/kerrighed/etc/kerrighed_nodes file and it is as you described,

when sudo krgadm nodes is ran, i get the result of 4 nodes (out of 4) in the list (6:1, 7:1, 8:1, 10:1),

If i try to access the proc directory from a node it seems to hang, if i look in the proc directory on the server, it is still empty even after i have typed in krgadm cluster start

i cannot post the dmesg from a node as there is nothing in the directory,

Cheers,

djdirty
May 6th, 2009, 06:49 AM
All sorted..

For anyone setting up in Virtualbox.. The Adapter type needs to be the Intel Server for the server and Intel Desktop for clients..

:-)

djdirty
May 6th, 2009, 08:21 AM
Ok.. Ive mucked around with my dhcpd.conf for a bit.. I want the DHCP to give any pc that asks for an IP address and start to load Kerrighed.. Anyone wanna upload there conf file that has this setup!

I dont want to have to enter in allll MAC addresses..

nerdopolis
May 6th, 2009, 09:08 AM
djdirty: This seems to work for me:

# /etc/dhcp3/dhcpd.conf #
subnet 192.168.1.0 netmask 255.255.255.0 {
option routers 192.168.1.1;
option broadcast-address 192.168.1.255;
range 192.168.1.1 192.168.1.254;
option subnet-mask 255.255.255.0;
filename "pxelinux.0";}

jbbjshlws: try installing and running htop. htop showed multiple processors when I sampled the Kerrighed live disk. top did not.

bigjimjams
May 6th, 2009, 03:23 PM
djdirty: This seems to work for me:
jbbjshlws: try installing and running htop. htop showed multiple processors when I sampled the Kerrighed live disk. top did not.

I believe top should do if you press the "1" key to display all CPUs.

bigjimjams
May 6th, 2009, 03:28 PM
ok i have checked the /nfsroot/kerrighed/etc/kerrighed_nodes file and it is as you described,

when sudo krgadm nodes is ran, i get the result of 4 nodes (out of 4) in the list (6:1, 7:1, 8:1, 10:1),

If i try to access the proc directory from a node it seems to hang, if i look in the proc directory on the server, it is still empty even after i have typed in krgadm cluster start

i cannot post the dmesg from a node as there is nothing in the directory,

Cheers,

/proc should display information from the machine that has booted that file system, which explains why it is empty on the server, as it never booted the kerrighed file system. Kerrighed modifies some of the information found in /proc to represent the whole cluster rather than the individual system.

Your problem may be related to a bad driver or something missing from the kernel or filesystem. Does the /nfsroot/kerrighed/etc/fstab contain the correct mount information for /proc?

djdirty
May 7th, 2009, 02:51 AM
Thanks Nerdopolis.. I tried that then my TFTP wouldnt even boot.. So i reverted back to my old dhcpd.conf and my tftp STILL wont boot! Heres my file..

# /etc/dhcp3/dhcpd.conf #
# General options
option dhcp-max-message-size 2048;
use-host-decl-names on;
deny unknown-clients;
deny bootp;
# DNS settings
option domain-name "kerrighed"; # Just an example name, call it whatever you want tp.
option domain-name-servers 10.65.12.5; # The ip address of the dhcp/tftp/nfs server.
# Information about the network setup
subnet 10.65.12.0 netmask 255.255.253.0 {
option routers 10.65.12.5; # IP addreess of the dhcp/tftp/nfs server.
option broadcast-address 10.65.12.255; # Broadcast address for your network.
}
# Declaring IP addresses for nodes and PXE info
group {
filename "pxelinux.0"; # location of PXE bootloader. Path is relative to tftpd's root(/srv/tftp/)
option root-path "10.65.12.5:/nfsroot/kerrighed"; # Location of the bootable filesystem on NFS server
host kerrighednode1 {
fixed-address 10.65.12.101; # IP address for kerrighednode1.
hardware ethernet 08:00:27:E5:41:49; # MAC address of the kerrighednode1's ethernet adapter
}
server-name "kerrighedserver"; # Name of the PXE server
next-server 10.65.12.5; # The IP address of the dhcp/tftp/nfs server
}

And my :srv/tftp/pxelinux.cfg/default file:-

LABEL linux
KERNEL vmlinuz-2.6.20-krg
APPEND console=tty1 root=/dev/nfs nfsroot=10.65.12.5:/nfsroot/kerrighed ip=dhcp rw

Grrrr..

And yes i have restarted all daemons and even the server...

djdirty
May 7th, 2009, 02:58 AM
Fail.. Firewall LOL

djdirty
May 7th, 2009, 06:04 AM
djdirty: This seems to work for me:

# /etc/dhcp3/dhcpd.conf #
subnet 192.168.1.0 netmask 255.255.255.0 {
option routers 192.168.1.1;
option broadcast-address 192.168.1.255;
range 192.168.1.1 192.168.1.254;
option subnet-mask 255.255.255.0;
filename "pxelinux.0";}' > /etc/dhcp3/dhcpd.conf

apt-get install tftpd-hpa -y

printf '#Defaults for tftpd-hpa
#RUN_DAEMON="no"
#OPTIONS="-l -s /var/lib/tftpboot"
# /etc/default/tftp-hpa #
#Defaults for tftp-hpa
RUN_DAEMON="yes"
OPTIONS="-l -s /srv/tftp"
jbbjshlws: try installing and running htop. htop showed multiple processors when I sampled the Kerrighed live disk. top did not.


Any other ones? Me and Josh dont wanna sit here and type in over 400 MAC addresses!

bigjimjams
May 7th, 2009, 12:53 PM
Any other ones? Me and Josh dont wanna sit here and type in over 400 MAC addresses!

Did you see my reply to the message you sent me earlier? Hope that dhcp config may help!

nerdopolis
May 7th, 2009, 04:06 PM
I made a critical mistake the /etc/dhcp3/dhcpd.conf. it was supposed to be:

# /etc/dhcp3/dhcpd.conf #
subnet 192.168.1.0 netmask 255.255.255.0 {
option routers 192.168.1.1;
option broadcast-address 192.168.1.255;
range 192.168.1.1 192.168.1.254;
option subnet-mask 255.255.255.0;
filename "pxelinux.0";} I accidentally copied and pasted too much information from the batch file I was working on. Sorry.

djdirty
May 8th, 2009, 12:03 AM
Rito.. Seen as we have all these pcs! They all have differing network cards.. How do we install drivers into the kernel for different cards?
\
Cheers,

Davo

jbbjshlws
May 9th, 2009, 08:22 AM
Hello!
After many tests and re-installations (even since the last post!) i have been able to successfully get the top command to work. I am able to see the 3 processors (over 3 nodes) in one view. Kerrighed is not migrating the processes across to the other nodes though. i have a node setup with "top" running to view the processors on the network (where it is showing all 3 processors), and on another machine i am running hardinfo to bench test the system. When it is running you can see the node running hardinfo go up to 100% usage, but the other two processors in the cluster don't change there value (sitting around 0.3%~~).
I have also tried john -test (apt-get install john), and crafty (apt-get install crafty) to test and they all have the same results as hardinfo.

So this brings me to two questions, does kerrighed speed up both multi-threaded and single-threaded applications?

and how can i migrate the tasks from one node to another? (i have tried $ sudo krgcapset -d +CAN_MIGRATE and also

krgcapset -k [pid] --effective +CAN_MIGRATE
migrate [pid] [node_id]
where i had the pid as hardinfo and the node_id as the node that it was running on (kerrighednode1)

So once again i am stuck with a cluster nearly working!!.

any ideas?
Cheers

jbbjshlws
May 9th, 2009, 08:23 AM
Like Davo, i also need to install the drivers to have all of the computers on the network being able to work. i need the drivers for the DELL Optiplex 755 and 745 series computers, the nic's are:

Broadcom® 5754 Gigabit Ethernet LAN solution 10/100/1000 Ethernet
Intel® 82566DM Gigabit LAN 10/100/1000
Broadcom 5721C1 NetXtreme Gigabit Ethernet PCI-E

How can i get these and integrate them into the kernel?

Cheers, Josh

bigjimjams
May 9th, 2009, 03:15 PM
Hello!
After many tests and re-installations (even since the last post!) i have been able to successfully get the top command to work. I am able to see the 3 processors (over 3 nodes) in one view. Kerrighed is not migrating the processes across to the other nodes though. i have a node setup with "top" running to view the processors on the network (where it is showing all 3 processors), and on another machine i am running hardinfo to bench test the system. When it is running you can see the node running hardinfo go up to 100% usage, but the other two processors in the cluster don't change there value (sitting around 0.3%~~).
I have also tried john -test (apt-get install john), and crafty (apt-get install crafty) to test and they all have the same results as hardinfo.

So this brings me to two questions, does kerrighed speed up both multi-threaded and single-threaded applications?

and how can i migrate the tasks from one node to another? (i have tried $ sudo krgcapset -d +CAN_MIGRATE and also

krgcapset -k [pid] --effective +CAN_MIGRATE
migrate [pid] [node_id]
where i had the pid as hardinfo and the node_id as the node that it was running on (kerrighednode1)

So once again i am stuck with a cluster nearly working!!.

any ideas?
Cheers

Hi Josh, are those benchmark tests multi-process or just multi-threaded? Kerrighed only migrates processes to other nodes and has no support for migrating threads at the moment. If you run krgcapset -s does it show that the inheritable and effective sets have the CAN_MIGRATE option? An easy way to test if kerrighed is working is to write a simple C program that is an infinite loop and spawn it the same time as you have cores. If every core in your cluster is at 100% then you know it works!

jbbjshlws
May 9th, 2009, 11:17 PM
Hello Bigjimjams,

Hi Josh, are those benchmark tests multi-process or just multi-threaded? Kerrighed only migrates processes to other nodes and has no support for migrating threads at the moment. If you run krgcapset -s does it show that the inheritable and effective sets have the CAN_MIGRATE option? An easy way to test if kerrighed is working is to write a simple C program that is an infinite loop and spawn it the same time as you have cores. If every core in your cluster is at 100% then you know it works!

when i run krgcapset -s it does show "CAN_MIGRATE" under "Inheritable Effective Capabilities: 03

I made an infinite loop in a sh script (not sure how to do it in c) and it did not migrate. those bench test programs to my knowledge have multi-threaded support, not sure about multi-process. what is everyone else running on there clusters? surely i could test with the same application.

Do you get an advantage when you use normal (aka written by other people with no intention to run on a cluster, eg video encoding etc) Single-threaded applications with the kerrighed cluster?

so i have a list of cpu's in top, they dont migrate tasks, and everything seems setup correctly, (but clearly isn't) whats the next attack?

Cheers,
Joshua

nerdopolis
May 10th, 2009, 10:49 AM
Try it with yes. yes is a program that comes with Linux, and with no arguments brings the CPU usage up to 100%. try running two instances of yes, and see if they migrate (two CPU's should be at 100%). If it doesn't try manually migrating them.

jbbjshlws
May 13th, 2009, 05:23 AM
Hello nerdopolis, i have tried using yes, i filled half a screen with text and i was only able to get a single cpu to reach a maximum of 3.7% usage, even with several instances of it running. how were you able to have it reach 100% (the computers i am testing on are around 2.4ghz machines) . thanks for the dhcp code, this really helped....now silly me i still need the mac addresses for the wol anyway!


Do you get an advantage when you use normal (aka written by other people with no intention to run on a cluster, eg video encoding etc) Single-threaded applications with the kerrighed cluster?

Also I am still unsure how to integrate the:
Broadcom® 5754 Gigabit Ethernet LAN solution 10/100/1000 Ethernet
Intel® 82566DM Gigabit LAN 10/100/1000
Broadcom® 5721C1 NetXtreme Gigabit Ethernet PCI-E
Drivers into the kernel. How can i get these and integrate them into the kernel?


I am in the process of trying out the latest svn for kerrighed on a different machine. Can any one please verify that when they try to download it from svn checkout svn://scm.gforge.inria.fr/svn/kerrighed/trunk it stops and they cannot get the full package, it comes up with an error:
'/SVNTrunk/kernel/include/linux/netfilter_ipv4' locked
although the file is not locked, and i have ran cleanup and tried unlocking it, and tried this on a couple of machines and have had the same result.

nerdopolis
May 13th, 2009, 03:21 PM
1.
wierd... try
yes > /dev/null 2.
I don't think single threaded application will be accelerated by a cluster. I don't think there is a way to run a single thread on two cpus. but you can multitask, one cpu can be loaded running the process and a second can be running your web browser or something.

And by the looks of it, threaded apps won't even benefit much until Kerrighed has thread migration. (but I am wondering if performance will increase if an application developer splits his program into multiple processess somehow)

3.
in my home directory I made a folder /svn in my home directory, and ran your command. It worked on my box, maybe it was the SVN revision. Try redownloading in a new folder again, and see if it works.

jbbjshlws
May 14th, 2009, 12:11 PM
Hello nerdopolis,
I am downloading the svn again, onto a different machine, in a different folder, so it should work, (hopefully,) it seems to have gotten further than last time.
typing yes > /dev/null into a machine deffanatly brings a CPU to 100%

I can't test it on my cluster as currently when i boot them up, and type in krgadm nodes status, it comes up with a blank list of no nodes (not even the one i am logged onto). I am thinking this has something to do with possibly the need to reset a cluster session or maybe i am terminating the session in a weird way. it is not consistent.

Also the other issue i face is, when they do occasionally appear int he list, when i type in krgadm cluster start then followed by top the computer that i run the command on freezes. all of the nodes in the cluster are not communicating and i think this could also have something to do with a session issue, is there a limit to how many nodes or the ip addresses of nodes across subnet's that could affect kerrigheds stability?

I was able to integrate the Broadcom drivers through the "make menuconfig" menu and ticking all of the drivers that were available and integrating them, they seem to be booting. I am still unsure how to integrate the Intel 82566DM Gigabit LAN drivers. they were not in the list.


Cheers,

Joshua

nerdopolis
May 15th, 2009, 08:06 PM
In the menuconfig is there any option for something like Intel e1000 or something like that? because I think that card may fall under that category.

I wouldn't know what the maximum number of nodes is... on the Kerrighed web page it talks about a 110 node cluster though...
http://www.kerrighed.org/forum/viewtopic.php?p=606#606

When the nodes fail to appear in krgadm nodes status, can you ping them?

for your Kerrighed problems, all I can think of right now is to try a shut-down on all nodes, rebooting the master node, and then powering up the other nodes again.

EDIT: I almost forgot, did the SVN download successfully?

jbbjshlws
May 16th, 2009, 03:05 AM
Hello nerdopolis,
I have checked the menuconfig and i have got Intel(R) PRO/1000 Gigabit Ethernet support there, i had it checked as built-in, but it still does not work. I could not find Intel e1000 in the list at all.

I am able to ping the computer without a problem

cluster@kerrighednode249:~$ krgadm nodes status

cluster@kerrighednode249:~$ ping 10.65.12.5
PING 10.65.12.5 (10.65.12.5) 56(84) bytes of data.
64 bytes from 10.65.12.5: icmp_seq=1 ttl=64 time=0.202 ms
64 bytes from 10.65.12.5: icmp_seq=2 ttl=64 time=0.334 ms

--- 10.65.12.5 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.202/0.268/0.334/0.066 ms
cluster@kerrighednode249:~$

You can see the result above when i type in krgadm nodes status, nothing...

The network is spread across 3 subnet's, i wouldn't think this would cause a problem, dhcp seems to be dealing out the files completely fine, but maybe i am wrong.

when i run:
cluster@kerrighednode249:~$ krgadm cluster poweroff
cluster@kerrighednode249:~$


the computer/computers do not turn off, also when i run the restart command they do not turn off. both of these commands used to restart the computers, and seem to work when there is nodes in that status list.

Not sure if this matters, but i have got my etc/dhcp3/dhcpd.conf file automatically assigning ip addresses, and my etc/hosts file set with a list of all of the ip addresses in the range and then an assigned name kerrighednode1 -kerrighednode500. clearly not all of these will be used at any one time, but if i didn't do this change in the hosts file, it got the server name assigned.

how do u define the master node?
and how can u shut down just the master node? (or any single node)

I tried "krgadm nodes poweroff -n" followed by its ip address, name or id, but it didn't work.

Also have you tried updating your install to the latest svn? anything i should know before i proceed?


I am checking the svn now, the folder is 230.7 MB and has 23,221 files in it, it seemed to lock the computer while it was downloading not sure if that is the expected (i would think not..., please let me know if this is your result. I tried downloading it again to compare sizes, but i got the error
svn: Can't create directory 'trunk/kernel/include/asm-m32r/m32104ut/.svn/props': Operation not supported
FIXED: Problem was restore to a file system that was not able to cope with the structure, (both fat and ntfs didn't work) was able to download to linux file system. also second issue was i was originally downloading from svn checkout svn://scm.gforge.inria.fr/svn/kerrighed/ as it happens there is over 1 million files and over 11 gig of files, it crashed the ubuntu system and locked it out completely. So when downloading the svn, make sure to have enough room, and to go to the correct file system!

Cheers,
Josh

nerdopolis
May 16th, 2009, 11:05 AM
On each node individually, run
sudo poweroffthen do the same on the server.
I think those krgadm shutdown commands are stubs. I'm not sure...

BTW: When I said "master node" I meant to say the server. Sorry.


For the drivers:
You may need to use an initramfs... The driver seemingly supports your card, but I read somewhere its only supported as a module. In the make menuconfig, try creating modules for the PRO/1000. once it finishes compiling run:

In the nfs file system change /etc/initramfs-tools/initramfs.conf to look like
MODULES=netboot
BUSYBOX=y
COMPCACHE_SIZE=""
BOOT=nfs
DEVICE=eth0
NFSROOT=auto

and in the nfs file system run:
update-initramfs -c -k 2.6.20-krg
copy the kernel and the new initramfs to /srv/tftp

change your /srv/tftp/pxelinux.cfg/default:

LABEL linux
KERNEL vmlinuz-2.6.20-krg
APPEND root=/dev/nfs initrd=initrd.img-2.6.20-krg nfsroot=10.54.12.5:/nfsroot/kerrighed ip=dhcp rw"





I didn't get to try out the SVN version yet. I didn't get much time with my cluster this week...

robasc
May 20th, 2009, 11:29 AM
Well it's been a while since my last posting. I have been so busy with work I have not had time to do anything.

Anyways, here is where I left off.

Section: 2.2: configure kerrighed of the guide

I went to edit the /etc/kerrighed_nodes file but there was no file existing by that name?

I must have made a mistake somewhere but I do not see one anywhere?

Any thoughts anyone?

bigjimjams
May 21st, 2009, 05:04 AM
Well it's been a while since my last posting. I have been so busy with work I have not had time to do anything.

Anyways, here is where I left off.

Section: 2.2: configure kerrighed of the guide

I went to edit the /etc/kerrighed_nodes file but there was no file existing by that name?

I must have made a mistake somewhere but I do not see one anywhere?

Any thoughts anyone?

Hi robasc, have you tried creating the file yourself and adding the lines:
session=1
nbmin=0
I think I may have had to do this myself before. Hope it helps.

robasc
May 21st, 2009, 10:23 AM
Yes I did, I was not sure if I had to or not. It's been a while since I last played with the cluster.

I have had some other issues with kerrighed. I noticed when I tried to configure the scheduler for kerrighed like so at http://www.kerrighed.org/wiki/index.php/SchedConfig:


Enabling the configurable scheduler framework
[edit] Kernel configuration

To (re-)enable the scheduler framework, select "Cluster support" --> "Kerrighed support for global scheduling" --> "Run-time configurable scheduler framework" (CONFIG_KRG_SCHED_CONFIG).


Well just to fill you in, the whole point of this is because I cannot get my nodes to boot up with kerrighed.

I keep getting r8169 eth0: link down

sending DHCP requests <3>DHCP/BOOTP: reply not for us
timed out!

and it just keeps looping through this.

I believe that my network card does not like the r8169 configured by the kerrighed kernel?

If I switch this back to boot from the linux kernel for nfs it boots fine. Not sure what do to from here yet.

any clues? Could it be that my chipset is not supported?

I did notice that you said you had the same chipset in your model and it worked is that correct Bigjimjams?

It would not let me choose enter into kerrighed support for global scheduling to choose the Run-time configurable schedule framework. It has me blocked out of this.

I used the install you provided from post #109, added the svn and configured schedulers in /etc/fstab with:


configfs /config configfs defaults 0 0

bigjimjams
May 22nd, 2009, 05:41 AM
Yes I did, I was not sure if I had to or not. It's been a while since I last played with the cluster.

I have had some other issues with kerrighed. I noticed when I tried to configure the scheduler for kerrighed like so at http://www.kerrighed.org/wiki/index.php/SchedConfig:



Well just to fill you in, the whole point of this is because I cannot get my nodes to boot up with kerrighed.

I keep getting r8169 eth0: link down

sending DHCP requests <3>DHCP/BOOTP: reply not for us
timed out!

and it just keeps looping through this.

I believe that my network card does not like the r8169 configured by the kerrighed kernel?

If I switch this back to boot from the linux kernel for nfs it boots fine. Not sure what do to from here yet.

any clues? Could it be that my chipset is not supported?

I did notice that you said you had the same chipset in your model and it worked is that correct Bigjimjams?

It would not let me choose enter into kerrighed support for global scheduling to choose the Run-time configurable schedule framework. It has me blocked out of this.

I used the install you provided from post #109, added the svn and configured schedulers in /etc/fstab with:

Hi, I know there have been some issues with the 8169.ko driver being outdated on the kerrighed.users mailing list. One solution was to change to the skge.ko driver. My chipset was actaully the 8111, but the 8169 driver worked. The configuration options for the svn kernel should already be chosen. The only one you may have to check is the "Automatic module loading" one, which I've found to have been disabled before.

To activate the new scheduler, you run:
sudo krg_legacy_scheduler from a node when the cluster is booted. Also, I've found that "nbmin" actually works again in the svn version, so you could set this in /etc/kerrighed_nodes to the number of nodes in your cluster.

As for the booting of nodes, I noticed that a BOOTP line appeared, I think the DHCP server should have a line which denies BOOTP but allows booting via DHCP/TFTP.

d_chall
May 22nd, 2009, 11:13 AM
We have already "ported" perceus (and warewulf) to debian.. It needs
a lot of polishing but it works for i386 and amd64 on our cluster
of ~80 machines.


deb http://biodev.ece.ucsb.edu/debian/ main contrib

On you server:
aptitude install perceus-server

There are scripts for creating debian VNFS there.
kgkv - Are you still working with Perceus on Ubuntu/Debian? I tried installing you package and everything seemed to work great at first, but whenever I try to boot the worker nodes, I get the error: "failed to execute / init, Kernel panic - not syncing: No init found. Try passing init = option to kernel". I've tried talking to people on the Perceus support IRC channel, but so far they've mostly just said that they don't test it on Debian. Did you every face this problem, or do you know any possible solutions? Also, if you know where I could find newer packages or premeade Debian/Ubuntu VNFS files that would be great. Thanks for you work on this!

jbbjshlws
May 25th, 2009, 01:47 AM
I made a critical mistake the /etc/dhcp3/dhcpd.conf. it was supposed to be:

# /etc/dhcp3/dhcpd.conf #
subnet 192.168.1.0 netmask 255.255.255.0 {
option routers 192.168.1.1;
option broadcast-address 192.168.1.255;
range 192.168.1.1 192.168.1.254;
option subnet-mask 255.255.255.0;
filename "pxelinux.0";} I accidentally copied and pasted too much information from the batch file I was working on. Sorry.

I can't test it on my cluster as currently when i boot them up, and type in
Code:

krgadm nodes status

, it comes up with a blank list of no nodes (not even the one i am logged onto). I am thinking this has something to do with possibly the need to reset a cluster session or maybe i am terminating the session in a weird way. it is not consistent.

Hello Nerdopolis,
This was related to the DHCP issue, if i have the mac addresses in the list, it seems to work without a problem, when they are automatically all added, i get a blank list of no nodes. I have a feeling that it worked for a while because the mac address was still in the dhcp lease time, when that expired they all disappeared out of the krgadm nodes list. So for now until i can actually get some tasks migrating across on the 3 node network i will put this issue off, as it is introducing more problems.


I ran yes with "yes > /dev/null" and it does bring the cpu to 100% without question, but even with several instances of it running it does not migrate across to any other cpu's.

I am now able to see the nodes on the network when i type in "krgadm nodes status" and the cpu's from them in "top" 1 (or htop).

does anyone else have any other ideas on having tasks migrate from one cpu to another?

I have decided i will get the cluster working on 3 computers before stuffing around with upgraded svn's, new nic drivers or automatic dhcp config, i feel this is a good move...

Let me know what i can test to see that the tasks are migrating, bigjimjams, if you could please post the code you used to write a simple C program that is an infinite loop, it would be appreciated, i have not tested this, i did test a sh script but it didn't migrate.

Cheers,

nerdopolis
May 26th, 2009, 10:16 PM
Hi. I looked at some files in a Kerrighed live CD session for hints, and one thing I noticed is that they are passing session_id, and node_id arguments to the Kerrighed kernel. (from the tftp config).

I think that's what might be keeping Kerrighed from working (although I'm not sure, but I did read from a somewhat incomplete tutorial that you have to pass the two arguments, after I found out about them)

The problem with the node_id argument is that it will require a seperate configuration file in the /srv/tftp/pxelinux.cfg/ folder for each node's IP, so that each node can get a different node_id argument (instead of just the /srv/tftp/pxelinux.cfg/default file).

How many nodes are you planning to connect? I can create a bash script that creates these files for you, once I have the number of nodes.

jbbjshlws
May 27th, 2009, 04:05 AM
I am starting with about 120, then in mid June i will have access to about 400. towards the end of July about 800. i can get exact figures for you, but at the moment i cannot reliably have more than 4 computers connect and see the processors in top. and can't get even two computers to share the cpu usage.

Cheers,
Joshua

nerdopolis
May 27th, 2009, 03:38 PM
Below I have the contents of the bash script file that defines the boot command line for 1016 nodes. They all have different node_id passed, and it also has tftp pass the session_id=1 to the nodes . The bash script defines the node_id for ip addresses 10.54.12.1 to 10.25.16.254 (assuming that's what you use, if its not I'll tweak it). The files are put into /srv/tftp/pxelinux.cfg, and the file names are the ip address they define the kernel command line for in hex.


#! /bin/bash
nodeid=1
a=10
b=54
c=12
d=1
for (( runtimes=1; runtimes<=$65535; runtimes-- ))
do
IP_ADDR=$a.$b.$c.$d
filename=$( printf '%0X' ${IP_ADDR//./ } )
printf 'LABEL linux
KERNEL vmlinuz-2.6.20-krg
APPEND root=/dev/nfs initrd=initrd.img-2.6.20-krg nfsroot=10.54.12.5:/nfsroot/kerrighed ip=dhcp rw session_id=1 node_id=' > /srv/tftp/pxelinux.cfg/$filename
printf $nodeid >> /srv/tftp/pxelinux.cfg/$filename
d=$[$d+1]
nodeid=$[$nodeid+1]
if (( 255==$d ));
then
d=1
c=$[$c+1]
fi
if (( 16==$c ));
then
exit 0
fi
done
If the idea works, I'll attach one that does ip addressess 192.168.1.1-192.168.1.254 which is what the guide uses.

schreck494
May 27th, 2009, 04:26 PM
I have no idea if you guys know anything about Open MPI, but I figured it couldn't hurt to ask. I am trying to build a beowulf cluster that will be used with this program:
http://parsec.ices.utexas.edu/
Right now, I have something that is just an updated version of the microwulf cluster running 8.04. I installed Open MPI from the repositories, and then used the Open MPI compilers to compile a parallel version of parsec. I think, however, that Open MPI is not configured correctly (I haven't changed anything since it installed), or I just don't know how to properly run parallel applications with MPI. Any suggestions would be greatly appreciated.
Thanks,
Mark

jbbjshlws
May 27th, 2009, 05:20 PM
Hello nerdopolis, i will be trying this tonight, before i do, i just wanted to double check with you why the new range would be 10.54.12.1 to 10.25.16.254, if it is going over 1016 nodes, i would have thought the range would have been 10.54.12.1-10.54.15.254, i might be missing somthing here. the server ip we are using is 10.65.12.5, and the current range is 10.65.12.6-10.65.15.254. I tried the live cd and i was able to ge the server computer up and running, but any other computers to boot to it did not work, are others successfully able to have the live cd working?

Cheers,
Joshua

nerdopolis
May 27th, 2009, 06:16 PM
Oops. You're right. it would have stopped at 10.54.15.254. I just wasn't thinking correctly when I wrote the post.


EDIT: I found and fixed a problem that caused a glitch in tftp. the bash script would output 10.54.12.1 as A41C1. TFTP wanted it to be 0A410C01. If your ran my file, you're going to need to run it again.

EDIT 2: I also found another issue with the bash script. If you ran the bash script, your going to need to run it again. I am sorry for any inconvenience this might have caused

Try this:
#! /bin/bash
nodeid=1
a=10
b=65
c=12
d=6
for (( runtimes=1; runtimes<=$65535; runtimes-- ))
do
IP_ADDR=$a.$b.$c.$d
filename=$( printf '%02X' ${IP_ADDR//./ } )
printf 'LABEL linux
KERNEL vmlinuz-2.6.20-krg
APPEND root=/dev/nfs nfsroot=10.65.12.5:/nfsroot/kerrighed ip=dhcp rw session_id=1 node_id=' > /srv/tftp/pxelinux.cfg/$filename
printf $nodeid >> /srv/tftp/pxelinux.cfg/$filename
d=$[$d+1]
nodeid=$[$nodeid+1]
if (( 255==$d ));
then
d=1
c=$[$c+1]
fi
if (( 16==$c ));
then
exit 0
fi
doneI was able to get the live cd to work but its a little hacky: when the server is booting, you need to attach it to a DHCP server, so that it can get its own IP address, after the server loads when you go to boot your nodes you need to disconnect the DHCP server so that the nodes don't get the first DHCP server's information.

cfrieler
May 28th, 2009, 04:16 AM
@Mark

I haven't used OpenMPI, but have used MPICH in several installations over the years.
It's not easy to convert a standard program to efficiently use MPI on a cluster. If all you did was to recompile using the MPI library, you shouldn't expect any computation on the other nodes. You haven't broken the code into distributable pieces yet!

My recommendation would be to stick to a simple demo program until you can confirm that MPI is working. There is a common "Hello World!" and a pi calculator that are both very easy to compile and run. Note that for MPICH, there are special compile and run commands that must be used. Use this simple program to touch each of the compute nodes and confirm that the MPI daemons are communicating. Then you can move on to learning how to convert your target code to distribute part of the task and really use the cluster.

Good Luck!

schreck494
May 28th, 2009, 10:46 AM
The program is already programmed to be run in a parallel environment. My problem is I have no idea how to set up Open MPI.

jbbjshlws
June 11th, 2009, 08:05 PM
Hello,
I have been very busy mucking around with this cluster, and i am trying to run an application on a clustered node that was written to test clusters with open_mp, the problem is the program is written for windows pc's, so i have installed "apt-get install ubuntu-desktop" and also "apt-get install wine" and had wine do all of its updates http://www.winehq.org/download/deb

after this, i installed winetricks, ran "sh winetricks dotnet20 vcrun2008" and proceeded through the prompts. it came up with errors for vcrun2008 with write permission issues, i chmod and chown the .wine directories to the logged in root user (cluster) and i still got the same errors for vcrun2008, (vcrun2005sp2 worked with no problems) so i played around for a while, the solution was to extract the files in vcrun2008 in a windows environment then copy the directories across to ubuntu then run the files, and it seems to install with no issue.

when i go to run the attached file, i get the attached error, (too long to post)

I do not get this or the above error when i run the file on a machine that has been installed from scratch (with 8.04), and not booted into ubuntu 8.04 from the nfs kerrighed kernel. i do not understand the error or how to fix it. i am posting this here because it does not seem to be a wine issue, as it works fine with wine on a normal un-clustered machine, so it seems related to the cluster setup.

Any help is appreciated,

Kind Regards,
Joshua

nerdopolis
June 14th, 2009, 09:11 PM
I think if that program was to run on a Kerrighed cluster in wine, it would still fail because Kerrighed has no thread migration, so I don't think kerrighed can support spanning processes' threads across multiple nodes yet, and I think that program would just create another thread in the wine process.

Kerrighed's live CD has a bash script that starts the cluster (I saw some weird ssh stuff come up when I ran it...), and runs a test. If you want it I'll attach it for you.

jbbjshlws
June 15th, 2009, 06:14 AM
Hello nerdopolis

thank you for your reply, it would be awesome if you could attach that ssh file for me to test with.

So the best way to test the file i have would be porting this to mono is this correct?

directhex
June 15th, 2009, 07:37 AM
Hello,
I have been very busy mucking around with this cluster, and i am trying to run an application on a clustered node that was written to test clusters with open_mp, the problem is the program is written for windows pc's, so i have installed "apt-get install ubuntu-desktop" and also "apt-get install wine" and had wine do all of its updates http://www.winehq.org/download/deb

after this, i installed winetricks, ran "sh winetricks dotnet20 vcrun2008" and proceeded through the prompts. it came up with errors for vcrun2008 with write permission issues, i chmod and chown the .wine directories to the logged in root user (cluster) and i still got the same errors for vcrun2008, (vcrun2005sp2 worked with no problems) so i played around for a while, the solution was to extract the files in vcrun2008 in a windows environment then copy the directories across to ubuntu then run the files, and it seems to install with no issue.

when i go to run the attached file, i get the attached error, (too long to post)

I do not get this or the above error when i run the file on a machine that has been installed from scratch (with 8.04), and not booted into ubuntu 8.04 from the nfs kerrighed kernel. i do not understand the error or how to fix it. i am posting this here because it does not seem to be a wine issue, as it works fine with wine on a normal un-clustered machine, so it seems related to the cluster setup.

Any help is appreciated,

Kind Regards,
Joshua

OpenMP is explicitly NOT designed to run code spanning multiple nodes. Performance when doing so with an OpenMP emulator is generally terrible, even with a high-speed Infiniband network.

nerdopolis
June 15th, 2009, 09:24 PM
I have the bash scripts for starting Kerrighed attached. it contains executables from the Kerrighed Live CD. demokrg is the one you run. it calls all the weird ssh commands I told you about, and then it runs another bash script that starts Kerrighed, and then a cpu eater, and then brings up htop. If it works it should show the number of cpus, all at 100%.

I think the best place to run this is on the nfs server as chroot. Extract the files to /bin, or another folder where Linux sees executables.

EDIT: I missed a file in the archive, so I reattached it.

DiabolicalGamer
June 21st, 2009, 10:33 PM
Hello Everyone, I've been trying to get a cluster running in VMWare using Kerrighed. I tried version 2.3.0 as well as the latest SVN version. I am able to get a 2 node cluster running and am able to manually migrate a process between them. I would however like to have this happen automatically. I realized last night that I hadn't configured krg_legacy_scheduler which I originally assumed was an old package. Today I've been struggling getting the /config mounted as ConfigFS. Not sure where to go from here. Is there any specific changes I need to make to the kernel when I'm compiling to get it to work or does it only work with a certain version? I could really use some help. Thanks in advance for your speedy replies. /DG

Sir Trysalot
June 23rd, 2009, 12:05 AM
Hello All!
I’m Tony, and I would like to hear some of you opinions of what I’m planning and also some more information on the benefits of running Kerrighed over some of the other ways of setting up a cluster

I am planning a system using the Atom 330 based D945GCLF2 Intel board, with 2 Gb of ram per node. I have thoughts of eventually growing to 8 – 16 total boards. Am I wasting my time setting up a cluster with these?

This is the most current and most relevant forum and threat that I have found concerning Beowulf cluster computing.

Thank you for taking the time to read and respond to my post.

Tony.
):P

DiabolicalGamer
July 7th, 2009, 01:11 AM
Hey Tony, I have been trying to setup a cluster using Kerrighed for about a week or two now and I've gotta say its not as easy as it looked originally. It's still in beta and documentation is either missing or scattered across the forums. I wouldn't discourage you from trying as I've learned a lot during the process, but if you've got a specific time schedule I'd definitely caution you. I hope to write a good How-To once I've got everything working so check this thread soon. :) /DG

a.mason
July 9th, 2009, 10:42 AM
Hi all - I'm Alicia Mason; I work with Tony / ajt at the Rowett Institute in Aberdeen. We now have a 7-node Beowulf running Kerrighed under a spin of Ubuntu, Bio-Linux 5, which I set up with Tony's help.

In doing this, I've made a lot of revisions, fleshings-out and corrections to BigJimJams' guide, which was excellent; it just needed a little work. I'd appreciate it if anyone has feedback, more suggestions, questions etc. The guide as is will produce a working 4-node Beowulf if you follow it exactly, and I'm working on adding instructions to make the server also a functioning node as we've done with our cluster.

Thanks everyone

AM

nerdopolis
July 9th, 2009, 11:09 PM
I think I read on some incomplete Kerrighed tutorial somewhere that you have to pass the session_id (which is the same for each node), and the node_id (which is different for each node) to the kerrighed kernel with tftp, for Kerrighed to work.

BTW: Kerrighed 2.4.0 is out.

EDIT: just reading on the release notes for 2.4.0, and it says you need to pass session_id=xx but nothing about node_id which simplifies alot of things.

gabrielaca
July 19th, 2009, 05:55 PM
to a.manson i will apreciate your effort to post those instructions, i´m planning on building a 6 node myself and your expirience would be very helpful.

artefact
July 22nd, 2009, 08:27 AM
By default, Kerrighed use the last digit of your IP address as node_id, but can still pass node_id parameter, if you want. It is still true for 2.4.0.

Jean

lrilling
July 22nd, 2009, 10:14 AM
Hi BigJimJams,

Hi Xingmu and anybody else interested, as promised in my earlier post I've added a draft guide for setting up a kerrighed 2.3.0 cluster in Ubuntu 8.04 on the Easy Ubuntu Clustering Wiki. Here is the link:

https://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide

If you have any questions, comments, suggestions or improvements let me know.

I've setup a link to your guide from Kerrighed's documentation page.

As you probably noticed, a few changes are required for Kerrighed 2.4, mostly because of the new global scheduler framework. In short:


A script called krg_legacy_scheduler is now installed, and is automatically run at cluster start, provided that /etc/init.d/kerrighed was properly installed.
The bugs of krgadm should now be fixed. If not, please tell us on kerrighed.users@listes.irisa.fr.

You could check the release notes (http://www.kerrighed.org/wiki/index.php/Release2_4_0) and the updated howto (http://www.kerrighed.org/wiki/index.php/Installing_Kerrighed_2.4.0) for details.

Thanks!

Louis

lrilling
July 22nd, 2009, 01:37 PM
Hello,

Although this post is getting old, I would like to fix a few wrong statements.

2. Someone mentioned OpenSSI which is nearly dead > know
that Kerrighed re-used most of the OpenSSI + Mosix source.

Kerrighed does not (and never did) borrow any source code from OpenSSI + openMosix.
The only piece of code that could matter is the global scheduler, which algorithm is only
inspired from MOSIX' one (openMosix one should be almost identical to the one of
MOSIX).

3. While it's being the most likely candidate to be used
by mere mortals, it lacks some elementary functionality ( aside from the
outdated howto's & documentation ).
Despite the info on their site, some advertised functionality will not ( ever )
be available
http://www.kerrighed.org/wiki/index.php/Status
Roadmap >november 2008 > thread migration will NOT be implemented any time soon.
This is very misleading,

It's true that thread migration is not on any roadmap yet. The status page does
not mention it anymore. The roadmap was given as we, Kerrighed developers,
wished it to be, but like with any free software project time cannot be
guaranteed.

and leaves me with mixed feeling knowing that it is (was)
a EU funded operation, that is now run by Kerlabs.

Kerlabs was actually created one year before EU decided to fund XtreemOS (not
Kerrighed). XtreemOS only planned contributions to Kerrighed, and it's still the
case since Kerlabs entered XtreemOS in June 2009. Before Kerlabs was created,
Kerrighed was funded by various french institutes (INRIA, University of Rennes
1, french department of defense, EDF).

I made some inquiries via a major german corporation in
regard to aforementioned feature ( Just to make sure I got an honest
reply).

That's an offending statement, really. Every post on the kerrighed.users mailing
list is considered, especially regarding threads (and you surely noticed it).
The poster's identity is interesting, but not a reason for not daring to reply.

I cannot fully disclose the document but in short it reads
that Kerlabs doesn't see any benefit in developing thread migration any further.

This is quite wrong. The reasons for not making it a priority to develop threads
are publicly detailed, and even mentioned in
http://www.kerrighed.org/wiki/index.php/FAQ. Kerlabs never stated (and it's
not Kerlabs' opinion) that thread migration does not deserve being developed.

This because of the small user base that might actually use it ..... (sic)
It must have been a prank if this wasn't a genuine answer. Nothing wrong with
out-of-this-world academics, but really how can a company / management be so
ignorant for not seeing the huge benefits to the community..
About everyone I know ( in 3D ) would hook up their computers to create a mini
cluster(s) ..

And how a guy knowing nothing about Kerrighed internals (the statements above
prove it, sorry) can despise the technical reasons mentioned in
http://article.gmane.org/gmane.linux.cluster.kerrighed.user/181. Ask anyone from
the parallel and distributed community about the chances for software DSM (this
is the scientific word for memory sharing in COTS clusters) to provide
performance, and they will confirm what I stated.

So in short:

Partially because of for-mentioned reasons there's a big chance I'll fork the
project.
Another reason is that my targeted userbase is completely different > I don't
need fancy checkpointing on 500 node clusters / hot adding / removing etc...
Most users of this fork will cluster less then 10 computers ( so no need for
infiniband / myrinet ).

I've already set up a bug tracker so ....

Forking is totally pointless! If you think that you can get people working on
thread migration (and seriously I doubt on this, Kerrighed is already short on
developers), why wouldn't they contribute to the original project? This is in no
way incompatible with other developments, and nobody from the Kerrighed
community ever discouraged contributions.

Really, Kerrighed does need contributors (developers, testers, users' feedack, etc), not pure criticism followed by forking.

Louis

nerdopolis
July 23rd, 2009, 04:30 PM
To Artefact: Last digit of the IP or last segment? would 192.168.1.2 and 192.168.1.12 be duplicate nodes?

lrilling
July 24th, 2009, 04:57 AM
To Artefact: Last digit of the IP or last segment? would 192.168.1.2 and 192.168.1.12 be duplicate nodes?

Least significant byte. 192.168.1.2 and 192.168.1.12 will have node id 2 and 12 respectively.

Louis

ajt
July 26th, 2009, 06:09 AM
Hello,

Although this post is getting old, I would like to fix a few wrong statements.
[...]


Hello, Louis.

Thanks very much for your post: I started this thread in an effort to bring together several disconnected discussions about SSI and clustering under Ubuntu, and to promote Kerrighed in a forum read by people involved in education and science. Quite often, discussions that are of practical value to people who are NOT developers are buried on lists and forums that are only read by developers. However, as you've seen on this list, I believe that there is quite a lot of interest from the Ubuntu community about using, rather than developing, Kerrighed in teaching and research.

In a post off-list, you asked me if I intend to use a Kerrighed kernel for my 'biobuntu' live DVD. Indeed I do: I would like to create an Ubuntu-based replacement for ClusterKnoppix, the Knoppix/OpenMosix-based live CD that can be used to transform a classroom of WIndows PC's into a useful Beowulf cluster. Any help and advice that you can give us would be welcome. I've tried out your existing Kerrighed live CD, and I would like to create an Ubuntu equivalent.

One issue I have with your existing live CD is that it should, clearly, not be booted on a LAN, because it advertises services that conflict with LAN servers. This is not a problem, of course, when booting computers that are on a private LAN. However, for the use-case of converting a classroom of Windows PC's into a Kerrighed cluster it is likely that there will already be servers set up by central IT services and, as is the case at our University, drastic action will be taken to avoid such conflicts happening - Like preventing people from booting the live CD!

Thanks again for joining the EasyUbuntuClustering discussion, and for making Kerrighed available. If you have time, please correct errors on the Ubuntu Kerrighed installation wiki started by Bigjimjams. Alicia will be updating it for Kerrighed 2.4.0 once we get it working on her 'kitcat' cluster at RINH. We also have a prototype Kerrighed cluster called 'santabarbara' built by my colleague Luca in Milan. Our project is to get these two clusters talking using XtreemOS, and Alicia will be in Milan next week helping Luca to install Kerrighed 2.4.0 on 'santabarbara'.

Bye,

Tony.

ajt
July 26th, 2009, 06:36 AM
The program is already programmed to be run in a parallel environment. My problem is I have no idea how to set up Open MPI.

Hello, schreck494.

Try using LAM (Local Area Multicomputer) MPI instead

aptitude install lam-dev lam-mpidoc


You need to create a file lam-bhost.def telling LAM that your node has the number of processors (nn) available in your Kerrighed cluster

node-name cpu=nn


This file is used by default when you start LAM

lamboot
lamnodes


Active MPI processes will then be migrated to other nodes by Kerrighed.

Bye,

Tony.

Merc248
July 26th, 2009, 07:48 PM
I personally use RHEL 5.3 and OSCAR 5.1rc1 for managing a ten node cluster. OSCAR seems pretty good for deploying an HPC cluster, but how does Kerrighed stack up against it?

ajt
July 27th, 2009, 07:27 AM
I personally use RHEL 5.3 and OSCAR 5.1rc1 for managing a ten node cluster. OSCAR seems pretty good for deploying an HPC cluster, but how does Kerrighed stack up against it?

Hello, Merc248.

OSCAR is used to set up clusters of Red-Hat derivative Linux PC's for HPC using MPI, not SSI. There was an SSI-OSCAR project for kerrighed in 2005:

http://ssi-oscar.gforge.inria.fr/

SSI-OSCAR is now part of OSCAR, but it has been dormant for two years:

http://svn.oscar.openclustergroup.org/trac/oscar/changeset/6146

I think OSCAR is good, but I don't think it can be used by people who want to build HPC clusters of Ubuntu PC's using an SSI kernel.

There was talk on this forum of using Warewulf/Perceus to deploy Kerrighed under Ubuntu, but I don't know if anyone has actually done it yet?

Our wiki, started by Bigjimjams, documents how to deploy Kerrighed under Ubuntu manually. Of course, it would be great if someone can automate the process of creating an Ubuntu SSI cluster using ideas from the RHEL-based cluster provisioning world. The best system I've seen is Scyld, but you have to pay for that :-)

Bye,

Tony.

Merc248
July 27th, 2009, 12:51 PM
Ahhh, so I'm comparing apples and oranges. :) Forgive my ignorance, I'm still learning a lot about setting up clusters. I'm not too sure if I can stray too far away from building a cluster with MPI unfortunately.

I wonder, however, if there's a clustering solution on Ubuntu that uses MPI and is about as easy to deal with as OSCAR? Doesn't seem like OSCAR installs all too cleanly on Debian systems (if it does at all.)

Syndr
July 27th, 2009, 06:52 PM
I am interested in setting up a small cluster using kerrighed. However, I do have a couple of basic questions about the general system.

My main use of such a system would be for standard desktop applications. As such I would need to run the X window system and a full desktop environment of some sort. How realistic would this be? Would it significantly impact the performance (assuming older systems)?

My understanding of the file system is that all nodes would use the '/nfsroot/kerrighed' directory on the server as a root filesystem. So something like X would have to be run on all the nodes, correct?

Also, in order to actually use the cluster, you have to run applications on the nodes themselves, and not the server. Therefore, to use for a desktop system you would likely want to interact directly with one of the nodes instead of the server?

I have not actually worked with clustering before, and as such want to have a fairly good grasp of how it would work before I attempt to set it up.


Thanks

ajt
July 28th, 2009, 05:24 AM
I am interested in setting up a small cluster using kerrighed. However, I do have a couple of basic questions about the general system.

My main use of such a system would be for standard desktop applications. As such I would need to run the X window system and a full desktop environment of some sort. How realistic would this be? Would it significantly impact the performance (assuming older systems)?


Hello, Syndr.

I think it's realistic, but most of the documentation I've seen about Kerrighed assumes that the 'head' server (i.e. the DHCP/NFSROOT server) doesn't run the kerrighed kernel. However, that's not how we use it!

You need to compile a stand-alone version of the Kerrighed klernel to run on the head node, and use UNFS3 with ClusterNFS extensions enabled to share the '/' filesystem of the head node with the compute nodes. Alicia will add instructions about how to do this to the wiki. Then, you make the head node a terminal server using FreeNX, and run your desktop apps. from an NX client.


My understanding of the file system is that all nodes would use the '/nfsroot/kerrighed' directory on the server as a root filesystem. So something like X would have to be run on all the nodes, correct?


Yes, and no - You run X11 apps. on the head node and it migrates them other compute nodes automatically. The X11 server runs on your client PC. If you use UNFS3 as I described, you don't need '/nfsroot/kerrighed' at all, just share the '/' filesystem of the head node with the compute nodes.


Also, in order to actually use the cluster, you have to run applications on the nodes themselves, and not the server. Therefore, to use for a desktop system you would likely want to interact directly with one of the nodes instead of the server?


That's wrong: An SSI system is not a COW (Collection Of Workstations)!

The whole idea of the Kerrighed kernel is to make your Beowulf cluster look like a large SMP machine to applications. You can, if you want, run apps. on the nodes. However, that does not require an SSI kernel and is really just a pool of computers that you can run your apps. on manually.


I have not actually worked with clustering before, and as such want to have a fairly good grasp of how it would work before I attempt to set it up.


People use computer clusters in many different ways: recently, 'cloud' computing has become popular. However, 'cloud' computing is really just a new name for buying computer time commercially from a computer bureau instead of owning your own computers.

In fact, I think it's a good idea if you seldom use computers but it becomes less and less attractive the more you use and pay for 'cloud' computing. An alternative is to build your own Beowulf cluster, and present it to your users as their own 'cloud'. This is now happening at major computing centres like the NSF, where users want the simplicity of 'cloud' computing, but they don't want to pay commerical 'cloud' fees.

Your idea of using a Kerrighed cluster to run desktop applications is realistic, and it's what I do now with openMosix. We're still testing Kerrighed 2.4.0, but we will upgrade our Beowulf from openMosix as soon as we think it is stable enough for 'production' work.

Good luck with your project, and post here to tell us how you get on.

Bye,

Tony.

gabrielaca
July 28th, 2009, 06:55 PM
hello ajt, i just read the wiki, and correct me if i´m wrong, i can have as many nodes as i can aford following the instructions, since the only diference will be the number of ip adresses and the corresponding hostname; am i correct?


now some questions, Kerrighed will recognize a multicore procesor? i´m guessing this one is easy, but what about multiprocesor boards, 2 or 4 multicore processors on a single board? :confused:

this install is based on a i386 Ubuntu but can it be done in a 64AMD Ubuntu? what i mean is if something will change drastically in the install process?

thanks.

Syndr
July 29th, 2009, 04:36 AM
Using the server itself as part of the cluster definitely does make sense here. Not only would you have more power available to the actual cluster, but it would make it easier to interact with the cluster itself, as the server is actually part of it.
How would a stand-alone version of the Kerrighed kernel be different from the kernel run on the nodes? I would assume that the nodes would run a different kernel than the server, correct?

I'll have to find out more about UNFS3, as I know very little about it. Instructions would definitely be very helpful here. Do you have any suggestions on where to look to learn more about it? How is it different from NFS?

I had not actually considered using something like FreeNX to interact with the cluster, but that would definitely open up some interesting possibilities.

Thanks for your help on this, and I will definitely let you know how it works out.


-Syndr

lrilling
August 3rd, 2009, 07:25 AM
In a post off-list, you asked me if I intend to use a Kerrighed kernel for my 'biobuntu' live DVD. Indeed I do: I would like to create an Ubuntu-based replacement for ClusterKnoppix, the Knoppix/OpenMosix-based live CD that can be used to transform a classroom of WIndows PC's into a useful Beowulf cluster. Any help and advice that you can give us would be welcome. I've tried out your existing Kerrighed live CD, and I would like to create an Ubuntu equivalent.

One issue I have with your existing live CD is that it should, clearly, not be booted on a LAN, because it advertises services that conflict with LAN servers. This is not a problem, of course, when booting computers that are on a private LAN. However, for the use-case of converting a classroom of Windows PC's into a Kerrighed cluster it is likely that there will already be servers set up by central IT services and, as is the case at our University, drastic action will be taken to avoid such conflicts happening - Like preventing people from booting the live CD!

I understand your use case, but this is exactly what we did not want to deal with when building the live CD. I suspect that the live CD should be customized for every use case because of the various configurations of other servers on the LAN, which is definitely against the idea of easy testing. Ideas are welcome though!

If you have time, please correct errors on the Ubuntu Kerrighed installation wiki started by Bigjimjams. Alicia will be updating it for Kerrighed 2.4.0 once we get it working on her 'kitcat' cluster at RINH.

AFAICS the update for Kerrighed 2.4.0 is the only "error".

We also have a prototype Kerrighed cluster called 'santabarbara' built by my colleague Luca in Milan. Our project is to get these two clusters talking using XtreemOS, and Alicia will be in Milan next week helping Luca to install Kerrighed 2.4.0 on 'santabarbara'.

According to some XtreemOS demo I've seen, it should work :) Thanks for the update, we are always interested by users' experiences with Kerrighed.

Louis