View Full Version : Easy Ubuntu Clustering
ajt
August 7th, 2009, 04:36 PM
hello ajt, i just read the wiki, and correct me if i´m wrong, i can have as many nodes as i can aford following the instructions, since the only diference will be the number of ip adresses and the corresponding hostname; am i correct?
Hello, gabrielaca.
It depends how deep your pockets are ;-)
Kerrighed does have a limit on the number of nodes at 32766 defined in the source file "include/kerrighed/sys/types.h":
#define KERRIGHED_MAX_NODES (1<<NR_BITS_IN_MAX_NODE_ID) /* Real limit 32766 */
now some questions, Kerrighed will recognize a multicore procesor? i´m guessing this one is easy, but what about multiprocesor boards, 2 or 4 multicore processors on a single board? :confused:
Kerrighed has a limit on the number of CPU's it can support defined in the same file "include/kerrighed/sys/types.h". As you see, the limit is currently set to four CPU's per node (i.e. motherboard):
#define KERRIGHED_MAX_CPU_PER_NODE 4
#define KERRIGHED_MAX_CPU (KERRIGHED_MAX_NODES * KERRIGHED_MAX_CPU_PER_NODE)
this install is based on a i386 Ubuntu but can it be done in a 64AMD Ubuntu? what i mean is if something will change drastically in the install process?
thanks.
I don't know because we've not tried it yet, but I don't expect it to.
Please post a message here if you try it out, and tell us how you got on.
Bye,
Tony.
ajt
August 7th, 2009, 04:57 PM
Using the server itself as part of the cluster definitely does make sense here. Not only would you have more power available to the actual cluster, but it would make it easier to interact with the cluster itself, as the server is actually part of it.
How would a stand-alone version of the Kerrighed kernel be different from the kernel run on the nodes? I would assume that the nodes would run a different kernel than the server, correct?
Hello, Syndr.
Just disable ROOT on NFS:
diff .config.old .config
...
< CONFIG_ROOT_NFS=y
---
> # CONFIG_ROOT_NFS is not set
I'll have to find out more about UNFS3, as I know very little about it. Instructions would definitely be very helpful here. Do you have any suggestions on where to look to learn more about it? How is it different from NFS?
Alicia will update the wiki soon: UNFS3 is a user-space, as opposed to kernel-space, NFS server. That means it's not as efficient as the kernel NFS server, but it's more flexible for our purposes. UNFS3 adopted the ClusterNFS project, which was based on an older version of UNFS. If you configure UNFS3 with ClusterNFS extensions when it is built, you can get it to read 'tags' with IP addresses etc. and serve particular files to particular hosts, or one set of files to clients, and another locally.
We just use UNFS3 from the Ubuntu repositories, with this config in "/etc/default/unfs3":
...
# Cluster extensions
# Enable cluster extensions. When this option is enabled, so-called tagged
# files are handled differently from normal files, making it possible to
# serve different file contents to different clients for the same filename.
# See tags(7) for a description of tagged files. This option causes a
# performance hit.
CLUSTER_EXTENSION="-c"
I had not actually considered using something like FreeNX to interact with the cluster, but that would definitely open up some interesting possibilities.
It does - we use it all the time: Works great over WAN or broadband!
Thanks for your help on this, and I will definitely let you know how it works out.
Please do, because what I hope we can do here is help each other get useful Kerrighed clusters running under Ubuntu.
Bye,
Tony.
Qwas
August 8th, 2009, 11:22 AM
Hi,how about using ubuntu ltsp as a base system? Afaik it has dhcp,tftp etc... already integrated so i guess its a better startpoint than the basic ubuntu. Right now im trying to install kerrighed this way but im having issues with building kerrighed kernel, so if you know about some precompiled kernel links are welcome.
One more thing, if somebody made it thru the tutorial, is it possible to create working virtual machine from this project and share it?
ajt
August 8th, 2009, 04:10 PM
Hi,how about using ubuntu ltsp as a base system? Afaik it has dhcp,tftp etc... already integrated so i guess its a better startpoint than the basic ubuntu. Right now im trying to install kerrighed this way but im having issues with building kerrighed kernel, so if you know about some precompiled kernel links are welcome.
One more thing, if somebody made it thru the tutorial, is it possible to create working virtual machine from this project and share it?
Hello, Qwas.
I have looked at LTSP, but its aims and objectives are different to those of a typical Beowulf cluster: Each LTSP node is intended to be booted diskless as a 'thin' client that runs its programs on the LTSP server. The nodes are 'thin', because they are only used to run what is, to all intents and purposes, an X-terminal.
The advantage of LTSP is that you can reuse old PC's as 'thin' clients instead of buying expensive X-terminals. You still need to buy quite a powerful LTSP server, which runs all the programs for the 'thin' clients. However, this is where Kerrighed might help if you cluster several systems to provide an SSI (Single System Image) LTSP server.
Indeed, it is possible to produce an SSI system using 'fat' workstations each of which run an SSI kernel, and share their resources with each other, but that is very different to LTSP. This is how openMosix has sometimes been used: A group of people agree to share their workstation resources, so that when they are not busy, other people can use their compute resources. However, this model is vulnerable to at least two serious problems: #1 People complain that their workstation is slow because 'everyone' else is sharing it, #2 People switch off their own workstation without telling anyone else and crash the Beowulf cluster.
Conventional wisdom is that you have to use dedicated compute nodes that are under your administrative control to build a useful Beowulf. Of course, you may only have temporary control of the compute nodes for the duration of a transient Beowulf cluster running on a classroom of Windows PC's overnight for example. However, you can ensure that nobody switches off a compute node that your job has been migrated to while the classroom is under your control and dedicated to running the Beowulf cluster.
Bye,
Tony.
Qwas
August 9th, 2009, 02:30 AM
Hi Tony, thanks for the answer
No i dont mean the clients will be used by people,they will provide computing power,they wont be terminals at all-they will be indeed truly dedicated for the computing. Its just that ltsp and kerrighed share the same idea-nodes(or thin clients) use tftp-ed image from the control computer, and this is exactly the same what ltsp does, right? My idea was about to use the ltsp services which are otherwise complicated to install (dhcp,tftp and so on) and swap the original ltsp client image stored in the system plus tune the configuration files for the kerrighed node image and boot the nodes with it so they become cluster nodes
ajt
August 9th, 2009, 06:38 AM
Hi Tony, thanks for the answer
No i dont mean the clients will be used by people,they will provide computing power,they wont be terminals at all-they will be indeed truly dedicated for the computing. Its just that ltsp and kerrighed share the same idea-nodes(or thin clients) use tftp-ed image from the control computer, and this is exactly the same what ltsp does, right? My idea was about to use the ltsp services which are otherwise complicated to install (dhcp,tftp and so on) and swap the original ltsp client image stored in the system plus tune the configuration files for the kerrighed node image and boot the nodes with it so they become cluster nodes
Hello, Qwas.
Well, the main point about LTSP is that the 'thin' clients DON'T provide any computing power, the LTSP server does that and all LTSP applications run on the LTSP server, not the 'thin' clients - That's why they are called 'thin'. All the LTSP clients provide is a GUI for applications that are actually running on the LTSP server.
Kerrighed nodes are 'fat' clients in this sense because they DO provide computing power to the SSI (Single System Image).
Setting up DHCP and TFTP are not really all that complicated, and I think we could probably do everything in a few deb's to create an Ununtu-based Kerrighed Beowulf cluster. The idea you have does make sense but, in my opinion, using LTSP as the basis for setting up DHCP/TFTP for Kerrighed doesn't.
There are several existing projects used for 'provisioning' of compute nodes in HPC: Perceus/Warewulf, for example has already been suggested earlier on this thread:
http://www.perceus.org/portal/
I think that Perceus looks like a more promising way forward than trying to customise LTSP, but I can see that Kerrighed might be useful as a way of improving the performance of an LTSP server and in that respect, I do see a useful link between the two projects.
Bye,
Tony.
lrilling
August 10th, 2009, 06:46 AM
Kerrighed has a limit on the number of CPU's it can support defined in the same file "include/kerrighed/sys/types.h". As you see, the limit is currently set to four CPU's per node (i.e. motherboard):
#define KERRIGHED_MAX_CPU_PER_NODE 4
#define KERRIGHED_MAX_CPU (KERRIGHED_MAX_NODES * KERRIGHED_MAX_CPU_PER_NODE)
There is actually no valuable reason for this limit, and it was removed on the 2.6.30 port. It wasn't in Kerrighed 2.4 because nobody seemed to complain about it, so that it was forgotten...
Louis
mike919
August 10th, 2009, 10:37 AM
Hey everyone,
I was working on following the network boot part of this guide with Ubuntu server 8.04.3 and I was running into a problem with the NFS server that a few people might also have encountered earlier in this thread.
Everything would work with the network booting _until_ I rebooted the main node. At that point the NFS kernel daemon took a long time to start and I would get connection refused errors when the other nodes tried to boot over network.
Checking the /var/log/messages of the main node I found this reason for NFS taking a long time to start:
RPC: failed to contact local rpcbind server (errno 5).
rpcbind: server localhost not responding, timed out
This presumably also was causing the connection refused errors. Finally, I found a solution to this here (http://www.linuxquestions.org/questions/mandriva-30/service-nfs-start-not-working-598581/).
This was for mandriva, but the ubuntu equivalent was this:
sudo apt-get install sysvconfig
sudo service portmap start
sudo service nfs-common start
sudo service nfs-kernel-server start
From this point on the network boot worked even after rebooting! Thanks for the great guide, and hopefully this prevents some frustration for other people.
Mike
count_dracula
August 11th, 2009, 07:57 PM
Thank you all for your messages. I'm wondering if you could help me with this issue I have.
This is what I want to achieve:
I do basic multimedia editing for my home videos. My current distro is Debian Lenny.
Since I have some old computers at home, I'm thinking I can use their CPU power to help me on the editing/rendering.
Would the guide posted by BigJimJams help achieve that? Would Perceus do it?
If any of above ...
My main computer (the one I'm interacting with for video editing) should be the main node -the server- or just a regular node?
[ or if you have a network design that would allow me to achieve my objective that's cool too ]
Thanks in advance.
1024Jon
August 12th, 2009, 11:59 AM
Solved. Just in case any1 else has run into the same problem, it appears to be a bug with kerrighed 2.4. I tried again using 2.3 and everything went fine.
Hey, great guide. I ran into one snag, and im not really sure where to look to fix it. Everything works fine until i get to ./configure... then make patch. When i run make patch i get make: *** No rule to make target `patch'. Stop. If i continue on skipping over this step everything else seems ok until i boot up the nodes and log in. When i try to run krgadm, i get krgadm: error while loading shared libraries: libkerrighed.so.1: cannot open shared object file: No such file or directory. And ive checked the file is present in /usr/local/lib. Another thing i noticed is that /etc/kerrighed_nodes was missing. My questions are, first do you think the problems are related, which i my first thought. And second any thoughts on the make patch, im not really sure where to look. Any input would be appreciated. Thanks
artefact
August 13th, 2009, 07:24 AM
As written on this page [1], 'make patch' is not needed anymore, from version 2.4.0.
Regarding the library issue, you may need to run ldconfig after 'make install' and/or adjust paths in /etc/ld.so.conf.
And, finally, kerrighed_nodes is not really necessary. You can just pass session_id and, eventually, node_id as boot params, session_id being the same for all nodes. node_id can be compute automatically if autonodeid enabled in kernel config CONFIG_KRG_AUTONODEID=y.
The given link has been updated recently to include some notices about the configuration of session_id and node_id.
Regards,
Jean
[1] http://kerrighed.org/wiki/index.php/Installing_Kerrighed_2.4.0
1024Jon
August 13th, 2009, 10:07 PM
Thanks, i will recompile with 2.4 and try again.
ajt
August 17th, 2009, 07:59 PM
And, finally, kerrighed_nodes is not really necessary. You can just pass session_id and, eventually, node_id as boot params, session_id being the same for all nodes. node_id can be compute automatically if autonodeid enabled in kernel config CONFIG_KRG_AUTONODEID=y.
[...]
Hello, Jean.
We've found that auto-configuration of Kerrighed does not work properly on the 'head' node unless "kerrighed_nodes" is present. In our case, we use "kerrighed_nodes" because our nodes are sharing the root filesystem of the NFSROOT server using UNFS3 with ClusterNFS extensions enabled.
Bye,
Tony.
Baleyba
August 27th, 2009, 05:02 AM
HI all!
I'm trying to compile the kerrighed kernel for node.
I'm using the version 2.4.0.
I'm using this tutorial: https://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide and the official Kerrighed doc for 2.4.0 compilation.
PXE boot is okay but I'm getting problems with kernel compilation.
After launching "make install", at the end I'm getting:
Checking for ELILO...No
Checking for LILO...No
Checking for SILO...No
Checking for PALO...No
Should I make a bootdisk? (y/N)
WARNING: Your system is probably unbootable now. After correcting any
problems, rerun this script with the command `mkboot -i'.
make[6]: *** [install] Error 1
make[5]: *** [install] Error 2
make[4]: *** [install] Error 2
make[3]: *** [install] Error 2
make[3]: Leaving directory `/usr/src/kerrighed-2.4.0/kernel'
make[2]: *** [kernel-install] Error 2
make[2]: Leaving directory `/usr/src/kerrighed-2.4.0'
make[1]: *** [install-am] Error 2
make[1]: Leaving directory `/usr/src/kerrighed-2.4.0'
make: *** [install-recursive] Error 1
I don't understand what happend???
I answer no to the "bootdisk" question and just after that, i'm getting these errors.
After that, if I check generated files: "/lib/modules/2.6.20-krg" is missing...
Please can you help me.
thanks a lot.
Regards,
Bal.
Baleyba
August 27th, 2009, 05:33 AM
Ok I found the solution ;)
I should answer Yes to the bootdisk question ... sorry.
Baleyba
August 28th, 2009, 07:03 AM
Hi,
I'm getting one other problem.
All is installed and working.
I followed tutorial and added CAN_MIGRATE capabilities.
But It doesn't work.
I see all the cpu and all the memory of the cluster. But all the applications I run only use the node CPU.
It become 100% and the others aren't used :(
I tried to run krg_legacy_scheduler but it fails.
I'm getting:
toto@kerrighednode1:~$ krg_legacy_scheduler
[: 21: ==: unexpected operator
configfs is not mounted on /config, aborting.
configfs has to be mounted on /config.
To do so:
1. mkdir /config (as root)
2. Have the following line in /etc/fstab (reboot, or mount it
manually on each node once it has just been addded):
configfs /config configfs defaults 0 0
Aborting, please adjust your system configuration and try again.
I don't understand because configfs exists and is mounted:
toto@kerrighednode1:~$ mount
rootfs on / type rootfs (rw)
/dev/root on / type nfs (rw,vers=2,rsize=4096,wsize=4096,hard,nolock,proto =udp,timeo=11,retrans=2,sec=sys,addr=192.168.2.23)
proc on /proc type proc (rw,nosuid,nodev,noexec)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec)
tmpfs on /var/run type tmpfs (rw,nosuid,nodev,noexec)
tmpfs on /var/lock type tmpfs (rw,nosuid,nodev,noexec)
/dev/root on /dev/.static/dev type nfs (rw,vers=2,rsize=4096,wsize=4096,hard,nolock,proto =udp,timeo=11,retrans=2,sec=sys,addr=192.168.2.23)
udev on /dev type tmpfs (rw)
tmpfs on /dev/shm type tmpfs (rw)
devpts on /dev/pts type devpts (rw)
tmpfs on /var/run type tmpfs (rw,nosuid,nodev,noexec)
tmpfs on /var/lock type tmpfs (rw,nosuid,nodev,noexec)
tmpfs on /var/run type tmpfs (rw,nosuid,nodev,noexec)
tmpfs on /var/lock type tmpfs (rw,nosuid,nodev,noexec)
configfs on /config type configfs (rw)
tmpfs on /var/run type tmpfs (rw,nosuid,nodev,noexec)
tmpfs on /var/lock type tmpfs (rw,nosuid,nodev,noexec)
So the first time I mounted "config" directory, kerrighed wrote inside ???
/config/krg_scheduler/probes and
/config/krg_scheduler/schedulers
...
So why doesn't it find this mounted point ??
Thanks for your help.
regards,
Bal.
squallgoh
September 5th, 2009, 06:01 AM
Sorry to hijack this thread, I do not know which forum is appropriate for questions on Beowulf.
I plan to build a server cluster that supports multiple X-Terminals. Would doing it Beowulf style be of any advantage as the applications I'll run are the usual business apps (OpenOffice etc) that are not programmed to take advantage of Parallel computing.
The main reason behind me wanting a cluster instead of a single server machine is for easier scalability (to handle more X-terminals). Is Beowulf a completely wrong direction to look at? In essence I want my server to be easily maintainable and scalable, and to save total computer hardware cost. Main environment is the business desktop: email, web browsing, OpenOffice etc.
Sorry if this is the wrong forum to post in, would the experts kindly point me in the right direction if I happen to post in the wrong place.
Thank you.
jvin248
September 6th, 2009, 08:09 PM
What you'll want to do is build the server stack with Kerrighed (the 'Beowulf' module you're working on). Then use LTSP installed on that instance to run the clients. LTSP is very easy (about five commands, there's a Ubuntu recipe that's in the Ubuntu forums/tutorials or IRC #ltsp - very helpful group).
The hard part is the Kerrighed system (that's discussed in this forum) - it's supposed to combine many machines into a 'single machine' - at least as far as network/users interacting with it - and allow adding/removing server horsepower as the tasks require. Most other projects, like Beowulf etc, are building parallel machines that require parallel aware software to run them as parallel instances.
My question, since I'm here, is what is the easiest recipe to get the Kerrighed system running on Ubuntu (8.04.3 or 9.04)? Any links?
ajt
September 7th, 2009, 10:28 AM
Sorry to hijack this thread, I do not know which forum is appropriate for questions on Beowulf.
Hello, squallgoh.
You're not really hijacking this thread - I started it here because I wanted to encourage people involved in education and science to think about how we might put together a *useful* Ubuntu-based Beowulf!
I plan to build a server cluster that supports multiple X-Terminals. Would doing it Beowulf style be of any advantage as the applications I'll run are the usual business apps (OpenOffice etc) that are not programmed to take advantage of Parallel computing.
That depends on what you want to do: I agree with jvin248's reply to your question. LTSP is designed for this sort of thing, but it just might be useful to use a Beowulf as the LTSP 'back-end' server. However, there are lots of other solutions to 'virtualise' user sessions. The main point to grasp is that a Beowulf is normally used to aggregate compute resources, not divide them up virtually...
The main reason behind me wanting a cluster instead of a single server machine is for easier scalability (to handle more X-terminals). Is Beowulf a completely wrong direction to look at? In essence I want my server to be easily maintainable and scalable, and to save total computer hardware cost. Main environment is the business desktop: email, web browsing, OpenOffice etc.
I understand that, and it's a reasonable suggestion to consider using a Beowulf in terms of scalability etc., but you could use e.g. a DNS round-robin as a simple solution to provide a server from a compute farm to each individual client.
Sorry if this is the wrong forum to post in, would the experts kindly point me in the right direction if I happen to post in the wrong place.
It's not the wrong place to post: Please let us know what you find out about running your desktop applications this way. I'm sure that many other people, myself included, are interested in using an Ubuntu Beowulf as the back end server for LTSP for all the reasons you mention.
Bye,
Tony.
ajt
September 7th, 2009, 10:41 AM
What you'll want to do is build the server stack with Kerrighed (the 'Beowulf' module you're working on). Then use LTSP installed on that instance to run the clients. LTSP is very easy (about five commands, there's a Ubuntu recipe that's in the Ubuntu forums/tutorials or IRC #ltsp - very helpful group).
Hello, jvin248.
I agree, but we don't really know yet how well 'office' type desktop applications will migrate under Kerrighed. My experience with openMosix is that things work quite well, but the delays introduced when processes migrate can be quite long and disruptive to interactive work. This is not a problem if, for example, you're running a computationally intensive task in the background. However, if you have to wait a few seconds for a response to a GUI application because the process is being migrated it makes it very difficult to use.
I've used a GUI application "consed" for editing DNA assembly consensus sequences under openMosix. It works quite well as long as it is either 'locked' to a node or not allowed to migrate(!). If we allow the openMosix load-balancer to move the "consed" GUI app. between different nodes to balance the (dynamic) load then it is not very easy to use. I think the same would be true of 'office' applications (spreadsheets etc.).
[...]
My question, since I'm here, is what is the easiest recipe to get the Kerrighed system running on Ubuntu (8.04.3 or 9.04)? Any links?
Our own Wiki?
http://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide
Bye,
Tony.
Arceliar
September 10th, 2009, 08:49 PM
So why doesn't it find this mounted point ??
Thanks for your help.
regards,
Bal.
I had the same problem here. The krg_legacy_scheduler is calling /bin/sh which is dash on newer [Ubuntu] releases, but the script doesn't seem to be dash friendly.
Editing /usr/bin/krg_legacy_scheduler (or /usr/local/bin, wherever you put it) to begin with
#!/bin/bash should do the trick--at least, for me it did.
mo0nykit
September 25th, 2009, 08:53 AM
Hello!
As far as I know, Kerrighed allows the head node to see multiple processors as its own. So here's my situation:
Blender (an open source 3D modeling, animation, rendering package), has options that set the number of threads to start when rendering. So if I have 8 dual-core nodes (16 processors), that means that I can set this option to 16? And will Blender automatically take advantage of those nodes (thanks to the Kerrighed multi-processor abstraction layer)?
Thanks!
nerdopolis
September 25th, 2009, 06:13 PM
Unfortunately I don't think so. Kerrighed 2.4.1 (the current version) supports process migration, but not thread migration.
There are older experimental versions which do support thread migration, but you lose the added stability/features of the new versions.
They removed thread migration for stability reasons when they wrote version 2.0.0, because they changed their focus from an experimental proof of concept to be a stable product. (I don't even think they even have thread migration in a compile option in 2.4.1 for adventurous types) but they do plan to add it in again if they have enough resources/developers.
mo0nykit
September 25th, 2009, 06:25 PM
@nerdopolis
Thanks! So that means I'll have to read up more on render queue managers for now :)
Gun1Dwn
September 29th, 2009, 06:40 PM
Hi, I've been following this thread for a few days now. I have my own kerrighed/ubuntu cluster setup with 3 nodes one is a quad core and the others are single P4's. My problem is that when i start several instances of "yes > /dev/null &" it migrates but only to 4 of the cpu's on the cluster and its not always the same cpu's sometimes its one of the P4's and 3 of the cores on the quad. its kinda random. I was wondering why it wont migrate to all six? any help is much appriciated. I love the guide and keep up the good work!!
Gun1Dwn
September 29th, 2009, 08:07 PM
i figured it out.... my krgcapset settings are 01067 for -e -d- -i -p . each core on the quad is 2.4Ghz and the P4's are 1.8Ghz. so it took around 10 instances of "yes > /dev/null &" to max them all out. I think this is because when kerrighed tries to migrate the processes it sees that my quad core is about twice as fast so it issues each core two processes to compensate. that comes out to about 10 instances. I really hope they get true thread migration implemented. It would be nice.
nerdopolis
September 30th, 2009, 06:39 PM
How can I setup the master node to also participate in the Kerrighed cluster? Do I just have the master node boot off the Kerrighed kernel?
Thanks in advance.
Gun1Dwn
September 30th, 2009, 07:05 PM
I too am interested in haveing the server as part of the cluster. I see a lot of benefit from having it as part of the cluster rather than separate.
Gun1Dwn
September 30th, 2009, 10:42 PM
Im wondering if it would be possible to have a harddrive in each node with all the nodes mirrored to seem as one drive. this would cut down on the overhead used on a single drive setup if each node had a local resource. I just dont see the sense in having a single drive serve an entire cluster of machines.
ajt
October 3rd, 2009, 06:46 PM
How can I setup the master node to also participate in the Kerrighed cluster? Do I just have the master node boot off the Kerrighed kernel?
Thanks in advance.
Hello, nerdopolis.
Not quite - You have to disable NFSROOT support in the kernel config or it won't boot stand-alone. There are also some issues about files having to be *identical* on all nodes or processes won't migrate. Louis explained that files are re-opened by the kernel on the remote node for reasons of efficiency when a process migrates.
We've got Kerrighed 2.4.0 running OK stand-alone and on PXE-booted nodes.
Bye,
Tony.
ajt
October 3rd, 2009, 07:02 PM
Im wondering if it would be possible to have a harddrive in each node with all the nodes mirrored to seem as one drive. this would cut down on the overhead used on a single drive setup if each node had a local resource. I just dont see the sense in having a single drive serve an entire cluster of machines.
Hello, Gun1Dwn.
HPC 'provisioning' systems replicate/mirror a system image onto a new or replacement node. This can be done via PXE, but many people just use the local disk for swap and /tmp. It's a lot easier to manage a cluster of 'stateless' nodes, and this seems to be a popular compromise.
Sun Microsystems had a 'cache' filesystem that made use of local drives on 'dataless' nodes to store a copy of any files that were accessed from an NFS server. Not mirroring, but replicating files locally where needed.
Bye,
Tony.
lrilling
October 6th, 2009, 04:07 AM
Louis explained that files are re-opened by the kernel on the remote node for reasons of efficiency when a process migrates.
To be precise, non-NFS files before migration are not re-opened after migration, but the application could re-open them, and thus requires them to be identical. Conversely, NFS files before migration are re-opened after migration, and thus *must* be NFS files on the target machine.
Louis
lrilling
October 6th, 2009, 04:14 AM
Sun Microsystems had a 'cache' filesystem that made use of local drives on 'dataless' nodes to store a copy of any files that were accessed from an NFS server. Not mirroring, but replicating files locally where needed.
Linux 2.6.30, and thus the Kerrighed port on it, has a similar facility called FS-cache http://people.redhat.com/dhowells/fscache/FS-Cache.pdf. NFS is FS-Cache aware, but it can be used for OpenAFS and iso9660 too for instance.
Louis
Gun1Dwn
October 7th, 2009, 03:32 AM
The FS-cache is an interesting idea. Ill have to look into it more. Right now I have an issue with bandwidth as I only have 100Mbit cards. I have 2 cards for each node and i was wondering how i can go about using both cards on each machine. I think it would be difficult to do since there is only one root FS and one set of config files from which all the nodes share. Is there a way to do this without having to make a separate root for each node as they did with microwulf (http://www.calvin.edu/%7Eadams/research/microwulf/). and if so would bonding the cards be beneficial?
o0splitpaw0o
October 25th, 2009, 02:49 AM
I been going the Beowulf route recently. Raninto this thread. Notice it's quite for awhile. I been using the notes from http://www.calvin.edu/~adams/research/microwulf/
They are a bit outdated, but I got the partitions completed & the PXE going. I can get up to TFTP'ing at this time, but can't get it for the life of me to boot off the mentioned mounted node folders to get them kicking in. Kind of puzzled about the pxelinux.0 section. I not sure if the notes were for devices without PXE built in, so don't know if I need to add tis to my tftp boot folder. I copied the kernel copy over.. but I guess I am a bit confused on this..
Hence why I'm admit in learning it. I could use some better information on this. I've been posting my progress on this here http://www.youtube.com/watch?v=0KwP6hDk2dc
Thanks!
ajt
October 26th, 2009, 09:20 AM
I been going the Beowulf route recently. Raninto this thread. Notice it's quite for awhile. I been using the notes from http://www.calvin.edu/~adams/research/microwulf/
Hello, o0splitpaw0o.
Yes, the thread has been a bit quiet - thanks for posting :-)
They are a bit outdated, but I got the partitions completed & the PXE going. I can get up to TFTP'ing at this time, but can't get it for the life of me to boot off the mentioned mounted node folders to get them kicking in. Kind of puzzled about the pxelinux.0 section. I not sure if the notes were for devices without PXE built in, so don't know if I need to add tis to my tftp boot folder. I copied the kernel copy over.. but I guess I am a bit confused on this..
This is how I've done it: First you need to configure your DHCP server:
# @(#)bobcat:/etc/dhcp3/dhcpd.conf 2009-04-02 A.J.Travis
ddns-update-style ad-hoc;
subnet 192.168.0.0 netmask 255.255.255.0 {
option subnet-mask 255.255.255.0;
option broadcast-address 192.168.0.255;
option domain-name-servers 192.168.0.254;
option routers 192.168.0.254;
allow booting;
allow bootp;
filename "pxelinux.0";
}
[...]
Then configure pxelinux:
manager@bobcat:/var/lib/tftpboot/pxelinux.cfg$ cat C0A80041
default linux
label linux
kernel vmlinuz-kerrighed
append root=/dev/nfs ip=dhcp nfsroot=192.168.0.254:/NFSROOT,v3 node_id=65 session_id=1
In this case, the file "C0A80041" is hex for node 192.168.0.65
Hence why I'm admit in learning it. I could use some better information on this. I've been posting my progress on this here http://www.youtube.com/watch?v=0KwP6hDk2dc
Thanks!
Just watched it ;-)
Bye,
Tony.
baggzy
October 30th, 2009, 02:20 PM
In reply to post #213... which I haven't found a solution for anywhere else...
I cannot get my nodes to boot up with kerrighed.
I keep getting r8169 eth0: link down
sending DHCP requests <3>DHCP/BOOTP: reply not for us timed out!
and it just keeps looping through this.
I believe that my network card does not like the r8169 configured by the kerrighed kernel?
If I switch this back to boot from the linux kernel for nfs it boots fine. Not sure what do to from here yet.
any clues? Could it be that my chipset is not supported?
I had the exact same problem. Days of scouring the internet revealed that there are problems with this family of Realtek cards which have been resolved in kernel 2.6 but not in 2.4. The solution back then was to get a new or patched r8168 driver from Realtek and replace the r8169 module in the initrd image. Unfortunately, since NFS set-up of kerrighed needs the card to work before it can run the initrd, the module can't be loaded. Doh! So the new driver has to be compiled into the kernel but doesn't seem to be compatible with that. Or at least I was unable to get that to compile. Nor did any of the other fixes out there help. So I finally resorted to hacking the driver code. And would you believe it, the solution couldn't be simpler! It's quick and dirty, but works:
cd /usr/src/kerrighed-2.4.1/_kernel/drivers/net #Don't miss out the underscore!
edit r8169.c #Replace "edit" with any editor.
Search for the text "link down" and you'll find the routine rtl8169_check_link_status:
static void rtl8169_check_link_status(struct net_device *dev,
struct rtl8169_private *tp, void __iomem *ioaddr) {
unsigned long flags;
spin_lock_irqsave(&tp->lock, flags);
if (tp->link_ok(ioaddr)) {
netif_carrier_on(dev);
if (netif_msg_ifup(tp))
printk(KERN_INFO PFX "%s: link up\n", dev->name);
} else {
if (netif_msg_ifdown(tp))
printk(KERN_INFO PFX "%s: link down\n", dev->name);
netif_carrier_off(dev);
}
spin_unlock_irqrestore(&tp->lock, flags);
}
Change netif_carrier_off to netif_carrier_on and then recompile the kernel.
That's it! Insanely simple, isn't it? :)
baggzy
November 2nd, 2009, 06:53 AM
By the way, it's unclear what the effect of that hack will be on network stability. Since my Kerrighed system hung every time I brought up the second node I've given up. This may be the result of the above hack, or may not. Still, hope it helps!
lrilling
November 6th, 2009, 05:26 AM
By the way, it's unclear what the effect of that hack will be on network stability. Since my Kerrighed system hung every time I brought up the second node I've given up. This may be the result of the above hack, or may not. Still, hope it helps!
Wait a minute. You're starting a cluster of one node and then you try adding a second node? Could you describe the sequence of Kerrighed commands that you are using?
I wonder if you are trying to use the not-yet-working-but-being-worked-on dynamic node addition feature of Kerrighed.
Louis
baggzy
November 6th, 2009, 02:24 PM
It's entirely possible I was doing something stupid. I'd followed the instructions from start to finish, then booted up the head node fine. Logged in, wandered around a bit, all looked fine. I ran "top" and could see my quad cores happily waiting for flops to process. Then I nfs booted the second node, which booted up fine (once I hacked that r8169.c driver), and again wandered around the file system and all appeared to be well. But then when I ran "top", or "ps", or anything that actually used some cpu, both nodes would hard freeze. I think I had kerrighed configured to auto-start once 2 nodes were available, so I guess it was running. Am I being a dolt?
baggzy
November 6th, 2009, 02:30 PM
Sorry - correction! The "head node" as I called it was also nfs-booted, so I should have referred to it as node1, and the other as node2 in my previous post. Both nfs-booted off another PC. (Sorry about that, I've been playing with so many different systems they're all blurring together...)
lrilling
November 7th, 2009, 11:34 AM
Sorry - correction! The "head node" as I called it was also nfs-booted, so I should have referred to it as node1, and the other as node2 in my previous post. Both nfs-booted off another PC. (Sorry about that, I've been playing with so many different systems they're all blurring together...)
Ok, I see nothing wrong with your setup. The autostart detail is interesting. Do you have the same issue if you disable autostart (just do not give the parameter), and, after having booted both nodes, run
# krgadm cluster start instead?
Anyway, we need kernel logs in order to figure out what went wrong. To collect those logs you could follow this guide http://www.kerrighed.org/wiki/index.php/KernelLogs.
Thanks,
Louis
baggzy
November 8th, 2009, 01:33 PM
No, I didn't try that. I'm away this weekend, but I'll set it up again next week and give that a try. If it still fails I'll send the logs... Would be nice to get it working! :D Cheers.
baggzy
November 11th, 2009, 08:07 PM
Ok, so I have a freshly installed kerrighed system. Two nodes to start with. They both nfs boot fine, and run fine when kerrighed isn't running. But once I start kerrighed they both freeze within about 10 seconds... The last entries in the kern.log file are:
Nov 11 23:01:50 node1 kernel: kerrighed: Cluster start with nodes 101-102 ...
Nov 11 23:01:50 node1 kernel: kerrighed: Cluster start succeeded.
Nov 11 23:01:50 node1 kernel: Kerrighed is running on 2 nodes: 101-102
Nov 11 23:01:50 node1 kernel: successfully registered scheduler_policy_type round_robin_balancer
Nov 11 23:01:50 node1 kernel: successfully registered scheduler_policy_type mosix_load_balancer
Any ideas?
ajt
November 12th, 2009, 06:28 AM
Ok, so I have a freshly installed kerrighed system. Two nodes to start with. They both nfs boot fine, and run fine when kerrighed isn't running. But once I start kerrighed they both freeze within about 10 seconds... The last entries in the kern.log file are:
[...]
[/code]Any ideas?
Hello, baggzy.
Are you running a 32-bit or 64-bit kernel?
As Louis pointed out recently, the Kerrighed developers are giving priority to the 64-bit kernel and he commends using Kerrighed 2.3.0 for i386/x86_32.
We have two stable 32-bit Kerrighed 2.3.0 clusters running at RINH, but had a nighmare experience trying to upgrade them. We've now gone back to 2.3.0 until we upgrade to 64-bit Kerrighed at the end of our current project, which depends on Bio-Linux 5.0 (32-bit Ubuntu 8.04 LTS).
Bye,
Tony.
baggzy
November 12th, 2009, 06:46 AM
Hi Tony & Louis,
I'm running the 64-bit 2.4.1 kerrighed and linux 2.6.20 kernel on two quad core AMD Phenom 9950's on two similar mainboards (an MSI DKA790GX and an MSI KA790GX). I pretty much followed bigjimjams guide exactly, so that's my basic set-up. Am I the first person to have this freezing problem? :-k Any idea how I can find out what the probem is?
Cheers!
ajt
November 12th, 2009, 07:58 AM
Hi Tony & Louis,
I'm running the 64-bit 2.4.1 kerrighed and linux 2.6.20 kernel on two quad core AMD Phenom 9950's on two similar mainboards (an MSI DKA790GX and an MSI KA790GX). I pretty much followed bigjimjams guide exactly, so that's my basic set-up. Am I the first person to have this freezing problem? :-k Any idea how I can find out what the probem is?
Cheers!
Hello, baggzy.
Are you sharing the same NFSROOT for both systems r/w?
Try using two completely separate NFSROOT filesystems or, as I've previously mentioned here, use UNFS3 with 'cluster' extensions enabled to avoid multiple nodes attempting to write to (or create) the same file:
aptitude install unfs3
HTH,
Tony.
baggzy
November 12th, 2009, 05:49 PM
No joy I'm afraid. I installed completely separate nfsroot's for each node but they still freeze when I do "top" or "ps" after starting kerrighed. Weird thing is that I can still ping them, so they are alive. But I can't log in - they ask for my password, print the login message, then freeze.
ajt
November 13th, 2009, 06:08 AM
No joy I'm afraid. I installed completely separate nfsroot's for each node but they still freeze when I do "top" or "ps" after starting kerrighed. Weird thing is that I can still ping them, so they are alive. But I can't log in - they ask for my password, print the login message, then freeze.
Hello, baggzy.
I had similar symptoms to this because "udev" isn't working properly in our setup on Alicia's cluster "kitcat" - Seems the tty's devices are not created and the /dev/random|urandom devices were not created either so SSH was broken. My current work-around is:
login on "kitcat"
cd /lib/udev/devices
sudo ./MAKEDEV generic
This directory is then used by diskless clients sharing the root filesystem of "kitcat" to populate /dev in memory at boot time.
Is everything that you need in /dev on your nodes?
If "udev" is working the devices should be created automatically.
Bye,
Tony.
lrilling
November 13th, 2009, 06:56 AM
Ok, so I have a freshly installed kerrighed system. Two nodes to start with. They both nfs boot fine, and run fine when kerrighed isn't running. But once I start kerrighed they both freeze within about 10 seconds... The last entries in the kern.log file are:
Nov 11 23:01:50 node1 kernel: kerrighed: Cluster start with nodes 101-102 ...
Nov 11 23:01:50 node1 kernel: kerrighed: Cluster start succeeded.
Nov 11 23:01:50 node1 kernel: Kerrighed is running on 2 nodes: 101-102
Nov 11 23:01:50 node1 kernel: successfully registered scheduler_policy_type round_robin_balancer
Nov 11 23:01:50 node1 kernel: successfully registered scheduler_policy_type mosix_load_balancer
Any ideas?
Hi baggzy,
There is nothing wrong with this log. Are the logs of the other node similar?
Thanks,
Louis
renkinjutsu
November 13th, 2009, 11:52 PM
for some reason, i got the impression that the server can share computer power with the cluster, but after reading the wiki on setting up the kerrighed cluster, i realized that the cluster shares computer power between nodes, and the server is just a server..
Is it possible to turn the cluster into an extention of the server's computer power?
ajt
November 14th, 2009, 06:29 AM
for some reason, i got the impression that the server can share computer power with the cluster, but after reading the wiki on setting up the kerrighed cluster, i realized that the cluster shares computer power between nodes, and the server is just a server..
Is it possible to turn the cluster into an extention of the server's computer power?
Hello, renkinjutsu.
The answer to your question is a definite maybe ;-)
I also want to use Kerrighed, with the 'head' node (i.e. login server) being part of the SSI cluster. However, it's not as easy as setting up a Kerrighed cluster consisting entirely of PXE booted NFSROOT compute nodes.
We have set up a 32-bit Kerrighed 2.3.0 cluster, but it is unstable when the NFS server "kitcat" is included in the cluster. Our "kitcat" cluster is stable when only the PXE-booted nodes form the cluster, but not when "kitcat" itself is a Kerrighed node. We're investigating what the problem is, but it appears to be related to autoconfiguration of Kerrighed nodes.
Look out for my student Alicia posting details about this on the Wiki.
Bye,
Tony.
baggzy
November 14th, 2009, 11:45 AM
Tony - thanks for the idea. My /dev directory looked fine but I ran MAKEDEV anyway. The /lib/udev/devices directory is now fully populated, but the nodes still freeze.
Louis - the logs on the two nodes look like this:
node1 (which I issue "krgadm cluster start" on):
Nov 14 01:24:54 node1 kernel: Start loading Kerrighed...
Nov 14 01:24:54 node1 kernel: Init Kerrighed worker(s)...Init Kerrighed low-level framework...
Nov 14 01:24:54 node1 kernel: Init kerrighed syscall mechanism
Nov 14 01:24:54 node1 kernel: Kerrighed tools - init module
Nov 14 01:24:54 node1 kernel: TIPC: Started in network mode
Nov 14 01:24:54 node1 kernel: TIPC: Own node address <1.1.102>, network identity 1
Nov 14 01:24:54 node1 kernel: RPC initialisation done
Nov 14 01:24:54 node1 kernel: Init Kerrighed low-level framework (nodeid 101) : done
Nov 14 01:24:54 node1 kernel: Init Kerrighed distributed services...
Nov 14 01:24:54 node1 kernel: KDDM initialisation : start
Nov 14 01:24:54 node1 kernel: KDDM set init
Nov 14 01:24:54 node1 kernel: KDDM set init : done
Nov 14 01:24:54 node1 kernel: KDDM initialisation done
Nov 14 01:24:54 node1 kernel: KerMM initialisation : start
Nov 14 01:24:54 node1 kernel: KerMM initialisation done
Nov 14 01:24:54 node1 kernel: DVFS initialisation : start
Nov 14 01:24:54 node1 kernel: FAF: initialisation : start
Nov 14 01:24:54 node1 kernel: FAF: initialisation : done
Nov 14 01:24:54 node1 kernel: DVFS initialisation done
Nov 14 01:24:54 node1 kernel: KerIPC initialisation : start
Nov 14 01:24:54 node1 kernel: KerIPC initialisation done
Nov 14 01:24:54 node1 kernel: Proc initialisation: start
Nov 14 01:24:54 node1 kernel: Proc initialisation: done
Nov 14 01:24:54 node1 kernel: EPM initialisation: start
Nov 14 01:24:54 node1 kernel: EPM initialisation: done
Nov 14 01:24:54 node1 kernel: Init Kerrighed distributed services: done
Nov 14 01:24:54 node1 kernel: scheduler initialization succeeded!
Nov 14 01:24:54 node1 kernel: Kerrighed... loaded!
Nov 14 01:24:54 node1 kernel: Try to enable bearer on lo:<5>TIPC: Enabled bearer <eth:lo>, discovery domain <1.1.0>, priority 10
Nov 14 01:24:54 node1 kernel: ok
Nov 14 01:24:54 node1 kernel: Try to enable bearer on eth0:<5>TIPC: Enabled bearer <eth:eth0>, discovery domain <1.1.0>, priority 10
Nov 14 01:24:54 node1 kernel: ok
Nov 14 01:26:06 node1 kernel: TIPC: Established link <1.1.102:eth0-1.1.103:eth0> on network plane B
Nov 14 01:26:39 node1 kernel: kerrighed: Cluster start with nodes 101-102 ...
Nov 14 01:26:39 node1 kernel: kerrighed: Cluster start succeeded.
Nov 14 01:26:39 node1 kernel: Kerrighed is running on 2 nodes: 101-102
Nov 14 01:26:39 node1 kernel: successfully registered scheduler_policy_type round_robin_balancer
Nov 14 01:26:39 node1 kernel: successfully registered scheduler_policy_type mosix_load_balancer
node2:
Nov 14 01:25:50 node2 kernel: Start loading Kerrighed...
Nov 14 01:25:50 node2 kernel: Init Kerrighed worker(s)...Init Kerrighed low-level framework...
Nov 14 01:25:50 node2 kernel: Init kerrighed syscall mechanism
Nov 14 01:25:50 node2 kernel: Kerrighed tools - init module
Nov 14 01:25:50 node2 kernel: TIPC: Started in network mode
Nov 14 01:25:50 node2 kernel: TIPC: Own node address <1.1.103>, network identity 1
Nov 14 01:25:50 node2 kernel: RPC initialisation done
Nov 14 01:25:50 node2 kernel: Init Kerrighed low-level framework (nodeid 102) : done
Nov 14 01:25:50 node2 kernel: Init Kerrighed distributed services...
Nov 14 01:25:50 node2 kernel: KDDM initialisation : start
Nov 14 01:25:50 node2 kernel: KDDM set init
Nov 14 01:25:50 node2 kernel: KDDM set init : done
Nov 14 01:25:50 node2 kernel: KDDM initialisation done
Nov 14 01:25:50 node2 kernel: KerMM initialisation : start
Nov 14 01:25:50 node2 kernel: KerMM initialisation done
Nov 14 01:25:50 node2 kernel: DVFS initialisation : start
Nov 14 01:25:50 node2 kernel: FAF: initialisation : start
Nov 14 01:25:50 node2 kernel: FAF: initialisation : done
Nov 14 01:25:50 node2 kernel: DVFS initialisation done
Nov 14 01:25:50 node2 kernel: KerIPC initialisation : start
Nov 14 01:25:50 node2 kernel: KerIPC initialisation done
Nov 14 01:25:50 node2 kernel: Proc initialisation: start
Nov 14 01:25:50 node2 kernel: Proc initialisation: done
Nov 14 01:25:50 node2 kernel: remove_proc_entry: /proc/meminfo busy, count=1
Nov 14 01:25:50 node2 kernel: remove_proc_entry: /proc/loadavg busy, count=1
Nov 14 01:25:50 node2 kernel: remove_proc_entry: /proc/stat busy, count=1
Nov 14 01:25:50 node2 kernel: remove_proc_entry: /proc/uptime busy, count=1
Nov 14 01:25:50 node2 kernel: EPM initialisation: start
Nov 14 01:25:50 node2 kernel: EPM initialisation: done
Nov 14 01:25:50 node2 kernel: Init Kerrighed distributed services: done
Nov 14 01:25:50 node2 kernel: scheduler initialization succeeded!
Nov 14 01:25:50 node2 kernel: Kerrighed... loaded!
Nov 14 01:25:50 node2 kernel: Try to enable bearer on lo:<5>TIPC: Enabled bearer <eth:lo>, discovery domain <1.1.0>, priority 10
Nov 14 01:25:50 node2 kernel: ok
Nov 14 01:25:50 node2 kernel: Try to enable bearer on eth0:<5>TIPC: Enabled bearer <eth:eth0>, discovery domain <1.1.0>, priority 10
Nov 14 01:25:50 node2 kernel: ok
Nov 14 01:25:50 node2 kernel: TIPC: Established link <1.1.103:eth0-1.1.102:eth0> on network plane B
Nov 14 01:25:51 node2 kernel: de_put: deferred delete of uptime
Nov 14 01:25:51 node2 kernel: de_put: deferred delete of loadavg
Nov 14 01:25:51 node2 kernel: de_put: deferred delete of stat
Nov 14 01:25:51 node2 kernel: de_put: deferred delete of meminfo
Nov 14 01:26:24 node2 kernel: Kerrighed is running on 2 nodes: 101-102
Nov 14 01:26:24 node2 kernel: successfully registered scheduler_policy_type round_robin_balancer
Nov 14 01:26:24 node2 kernel: successfully registered scheduler_policy_type mosix_load_balancer
Are those proc messages on node2 a problem?
Out of interest, am I correct that there's no way to NFS-boot the nodes using a module (i.e. not compiled-in) driver for my network card? I've had a stab at it (editing initramfs.conf, creating an initrd.img, recompiling the kernel without the driver built-in, and reboot) but IP-config always reports that there are no network cards...
I'm wondering if the problem is that comms problems are causing a freeze like the one people report when a node goes down. I've hacked the r8169.c driver so it doesn't issue false "link down" messages to the kernel, but there may be other issues too. I can download the official driver from Realtec but can only install it as a module... ](*,)
lrilling
November 14th, 2009, 12:14 PM
Louis - the logs on the two nodes look like this:
node1 (which I issue "krgadm cluster start" on):
Nov 14 01:24:54 node1 kernel: Start loading Kerrighed...
Nov 14 01:24:54 node1 kernel: Init Kerrighed worker(s)...Init Kerrighed low-level framework...
Nov 14 01:24:54 node1 kernel: Init kerrighed syscall mechanism
Nov 14 01:24:54 node1 kernel: Kerrighed tools - init module
Nov 14 01:24:54 node1 kernel: TIPC: Started in network mode
Nov 14 01:24:54 node1 kernel: TIPC: Own node address <1.1.102>, network identity 1
Nov 14 01:24:54 node1 kernel: RPC initialisation done
Nov 14 01:24:54 node1 kernel: Init Kerrighed low-level framework (nodeid 101) : done
Nov 14 01:24:54 node1 kernel: Init Kerrighed distributed services...
Nov 14 01:24:54 node1 kernel: KDDM initialisation : start
Nov 14 01:24:54 node1 kernel: KDDM set init
Nov 14 01:24:54 node1 kernel: KDDM set init : done
Nov 14 01:24:54 node1 kernel: KDDM initialisation done
Nov 14 01:24:54 node1 kernel: KerMM initialisation : start
Nov 14 01:24:54 node1 kernel: KerMM initialisation done
Nov 14 01:24:54 node1 kernel: DVFS initialisation : start
Nov 14 01:24:54 node1 kernel: FAF: initialisation : start
Nov 14 01:24:54 node1 kernel: FAF: initialisation : done
Nov 14 01:24:54 node1 kernel: DVFS initialisation done
Nov 14 01:24:54 node1 kernel: KerIPC initialisation : start
Nov 14 01:24:54 node1 kernel: KerIPC initialisation done
Nov 14 01:24:54 node1 kernel: Proc initialisation: start
Nov 14 01:24:54 node1 kernel: Proc initialisation: done
Nov 14 01:24:54 node1 kernel: EPM initialisation: start
Nov 14 01:24:54 node1 kernel: EPM initialisation: done
Nov 14 01:24:54 node1 kernel: Init Kerrighed distributed services: done
Nov 14 01:24:54 node1 kernel: scheduler initialization succeeded!
Nov 14 01:24:54 node1 kernel: Kerrighed... loaded!
Nov 14 01:24:54 node1 kernel: Try to enable bearer on lo:<5>TIPC: Enabled bearer <eth:lo>, discovery domain <1.1.0>, priority 10
Nov 14 01:24:54 node1 kernel: ok
Nov 14 01:24:54 node1 kernel: Try to enable bearer on eth0:<5>TIPC: Enabled bearer <eth:eth0>, discovery domain <1.1.0>, priority 10
Nov 14 01:24:54 node1 kernel: ok
Nov 14 01:26:06 node1 kernel: TIPC: Established link <1.1.102:eth0-1.1.103:eth0> on network plane B
Nov 14 01:26:39 node1 kernel: kerrighed: Cluster start with nodes 101-102 ...
Nov 14 01:26:39 node1 kernel: kerrighed: Cluster start succeeded.
Nov 14 01:26:39 node1 kernel: Kerrighed is running on 2 nodes: 101-102
Nov 14 01:26:39 node1 kernel: successfully registered scheduler_policy_type round_robin_balancer
Nov 14 01:26:39 node1 kernel: successfully registered scheduler_policy_type mosix_load_balancer
node2:
Nov 14 01:25:50 node2 kernel: Start loading Kerrighed...
Nov 14 01:25:50 node2 kernel: Init Kerrighed worker(s)...Init Kerrighed low-level framework...
Nov 14 01:25:50 node2 kernel: Init kerrighed syscall mechanism
Nov 14 01:25:50 node2 kernel: Kerrighed tools - init module
Nov 14 01:25:50 node2 kernel: TIPC: Started in network mode
Nov 14 01:25:50 node2 kernel: TIPC: Own node address <1.1.103>, network identity 1
Nov 14 01:25:50 node2 kernel: RPC initialisation done
Nov 14 01:25:50 node2 kernel: Init Kerrighed low-level framework (nodeid 102) : done
Nov 14 01:25:50 node2 kernel: Init Kerrighed distributed services...
Nov 14 01:25:50 node2 kernel: KDDM initialisation : start
Nov 14 01:25:50 node2 kernel: KDDM set init
Nov 14 01:25:50 node2 kernel: KDDM set init : done
Nov 14 01:25:50 node2 kernel: KDDM initialisation done
Nov 14 01:25:50 node2 kernel: KerMM initialisation : start
Nov 14 01:25:50 node2 kernel: KerMM initialisation done
Nov 14 01:25:50 node2 kernel: DVFS initialisation : start
Nov 14 01:25:50 node2 kernel: FAF: initialisation : start
Nov 14 01:25:50 node2 kernel: FAF: initialisation : done
Nov 14 01:25:50 node2 kernel: DVFS initialisation done
Nov 14 01:25:50 node2 kernel: KerIPC initialisation : start
Nov 14 01:25:50 node2 kernel: KerIPC initialisation done
Nov 14 01:25:50 node2 kernel: Proc initialisation: start
Nov 14 01:25:50 node2 kernel: Proc initialisation: done
Nov 14 01:25:50 node2 kernel: remove_proc_entry: /proc/meminfo busy, count=1
Nov 14 01:25:50 node2 kernel: remove_proc_entry: /proc/loadavg busy, count=1
Nov 14 01:25:50 node2 kernel: remove_proc_entry: /proc/stat busy, count=1
Nov 14 01:25:50 node2 kernel: remove_proc_entry: /proc/uptime busy, count=1
Nov 14 01:25:50 node2 kernel: EPM initialisation: start
Nov 14 01:25:50 node2 kernel: EPM initialisation: done
Nov 14 01:25:50 node2 kernel: Init Kerrighed distributed services: done
Nov 14 01:25:50 node2 kernel: scheduler initialization succeeded!
Nov 14 01:25:50 node2 kernel: Kerrighed... loaded!
Nov 14 01:25:50 node2 kernel: Try to enable bearer on lo:<5>TIPC: Enabled bearer <eth:lo>, discovery domain <1.1.0>, priority 10
Nov 14 01:25:50 node2 kernel: ok
Nov 14 01:25:50 node2 kernel: Try to enable bearer on eth0:<5>TIPC: Enabled bearer <eth:eth0>, discovery domain <1.1.0>, priority 10
Nov 14 01:25:50 node2 kernel: ok
Nov 14 01:25:50 node2 kernel: TIPC: Established link <1.1.103:eth0-1.1.102:eth0> on network plane B
Nov 14 01:25:51 node2 kernel: de_put: deferred delete of uptime
Nov 14 01:25:51 node2 kernel: de_put: deferred delete of loadavg
Nov 14 01:25:51 node2 kernel: de_put: deferred delete of stat
Nov 14 01:25:51 node2 kernel: de_put: deferred delete of meminfo
Nov 14 01:26:24 node2 kernel: Kerrighed is running on 2 nodes: 101-102
Nov 14 01:26:24 node2 kernel: successfully registered scheduler_policy_type round_robin_balancer
Nov 14 01:26:24 node2 kernel: successfully registered scheduler_policy_type mosix_load_balancer
Are those proc messages on node2 a problem?
I don't think so. Those messages look very rare (I personally saw one only once, and it was not on a machine I was controlling), and just report about an unexpected but safely handled case. They are absolutely not Kerrighed-specific actually, although they were triggered by the initialization phase of Kerrighed.
Out of interest, am I correct that there's no way to NFS-boot the nodes using a module (i.e. not compiled-in) driver for my network card? I've had a stab at it (editing initramfs.conf, creating an initrd.img, recompiling the kernel without the driver built-in, and reboot) but IP-config always reports that there are no network cards...
You're not correct ;) It is possible to NFS-boot with a NIC driver in loadable module, but this requires to :
use an initramfs (otherwise the kernel will try and fail to boot on NFS before any module can be loaded);
configure the initramfs for NFS booting;
ensure that the NIC driver is included in the initramfs;
ensure that initramfs' udev is able to automatically load the NIC driver's module, or hack the initramfs to make it load the driver manually;
Assign static node ids to your Kerrighed nodes (use the node_id= kernel command line parameter), since autoconfiguration only works when the NIC driver is built in the kernel.
I'm wondering if the problem is that comms problems are causing a freeze like the one people report when a node goes down. I've hacked the r8169.c driver so it doesn't issue false "link down" messages to the kernel, but there may be other issues too. I can download the official driver from Realtec but can only install it as a module... ](*,)
This maybe the cause of your problem.
You should definitely try a module that is reported to work correctly with 2.6.20 kernels.
Louis
baggzy
November 14th, 2009, 02:38 PM
Nope, still no joy. I have fully functioning comms\\:D/, but the nodes still freeze. ](*,)
For anyone who has the same network driver problem...
I followed the instructions in nerdopolis' post #210, but when I unpacked the initrd.img file the module wasn't in there. :-k Strange. So I manually hacked it in there as follows:
- Go to http://www.realtek.com.tw website and find the driver for your network card.
- Download the driver to /nfsroot/kerrighed/usr/src/
- Expand the tarball. Note the [driver directory] created.
- cd [driver directory]. Following the instructions in the readme to make it.
- Note the name of the [driver].ko created (in the src directory).
- Recompile the kernel without the offending driver.
- Create an initrd as follows:
chroot /nfsroot/kerrighed
mkinitramfs -o /boot/initrd.img-2.6.20-krg 2.6.20-krg
exit
cp /nfsroot/kerrighed/boot/initrd.img-2.6.20-krg /var/lib/tftpboot/
cp /nfsroot/kerrighed/boot/vmlinuz-2.6.20-krg /var/lib/tftpboot/
Edit your boot setup (/var/lib/tftpboot/pxelinux.cfg/default or whatever):
LABEL linux
KERNEL vmlinuz-2.6.20-krg
APPEND root=/dev/nfs initrd=initrd.img-2.6.20-krg nfsroot=192.168.1.1:/nfsroot/kerrighed ip=dhcp rw
Hack the initrd:
mkdir /nfsroot/kerrighed/boot/initrd # The directory to unpack the initrd into.
cd /nfsroot/kerrighed/boot/initrd
gunzip < ../initrd.img-2.6.20-krg | cpio -i # Unpack the initrd.
cp /nfsroot/kerrighed/usr/src/[driver directory]/src/[driver].ko lib/modules/2.6.20-krg/kernel/drivers/net/
find | cpio -H newc -o | gzip -9 > ../initrd.img-2.6.20-krg # Repack the initrd.
Apologies for any mistakes or typo's, since I'm writing this from memory.
Now if I could just get kerrighed to stop freezing... [-o< Any other ideas welcome!
baggzy
November 14th, 2009, 11:05 PM
Oh, and after all that copy the initrd again:
cp /nfsroot/kerrighed/boot/vmlinuz-2.6.20-krg /var/lib/tftpboot/
Can't believe I forgot to say that. #-o
renkinjutsu
November 15th, 2009, 01:36 AM
Hello, renkinjutsu.
The answer to your question is a definite maybe ;-)
I also want to use Kerrighed, with the 'head' node (i.e. login server) being part of the SSI cluster. However, it's not as easy as setting up a Kerrighed cluster consisting entirely of PXE booted NFSROOT compute nodes.
We have set up a 32-bit Kerrighed 2.3.0 cluster, but it is unstable when the NFS server "kitcat" is included in the cluster. Our "kitcat" cluster is stable when only the PXE-booted nodes form the cluster, but not when "kitcat" itself is a Kerrighed node. We're investigating what the problem is, but it appears to be related to autoconfiguration of Kerrighed nodes.
Look out for my student Alicia posting details about this on the Wiki.
Bye,
Tony.
thanks ajt, it reassures me when i know that a bunch of smart people are going after the same thing i want. I can't wait to hear the results of your experiments!!
lrilling
November 15th, 2009, 10:11 AM
Nope, still no joy. I have fully functioning comms\\:D/, but the nodes still freeze. ](*,)
Ok. Maybe producing some debugging kernel logs could help. After observing the freeze, could you trigger sys request 'w' on both nodes and send the resulting logs?
Two ways of triggering sysrq:
- Hit Alt+SysRq+w on the machine's keyboard,
- (as root on the machine) echo w > /proc/sysrq-trigger
Thanks,
Louis
baggzy
November 15th, 2009, 11:42 AM
Hi Louis,
Thanks for the suggestion. Here are the logs...
node1:
Nov 15 15:23:45 node1 kernel: Start loading Kerrighed...
Nov 15 15:23:45 node1 kernel: Init Kerrighed worker(s)...Init Kerrighed low-level framework...
Nov 15 15:23:45 node1 kernel: Init kerrighed syscall mechanism
Nov 15 15:23:45 node1 kernel: Kerrighed tools - init module
Nov 15 15:23:45 node1 kernel: TIPC: Started in network mode
Nov 15 15:23:45 node1 kernel: TIPC: Own node address <1.1.102>, network identity 0
Nov 15 15:23:45 node1 kernel: RPC initialisation done
Nov 15 15:23:45 node1 kernel: Init Kerrighed low-level framework (nodeid 101) : done
Nov 15 15:23:45 node1 kernel: Init Kerrighed distributed services...
Nov 15 15:23:45 node1 kernel: KDDM initialisation : start
Nov 15 15:23:45 node1 kernel: KDDM set init
Nov 15 15:23:45 node1 kernel: KDDM set init : done
Nov 15 15:23:45 node1 kernel: KDDM initialisation done
Nov 15 15:23:45 node1 kernel: KerMM initialisation : start
Nov 15 15:23:45 node1 kernel: KerMM initialisation done
Nov 15 15:23:45 node1 kernel: DVFS initialisation : start
Nov 15 15:23:45 node1 kernel: FAF: initialisation : start
Nov 15 15:23:45 node1 kernel: FAF: initialisation : done
Nov 15 15:23:45 node1 kernel: DVFS initialisation done
Nov 15 15:23:45 node1 kernel: KerIPC initialisation : start
Nov 15 15:23:45 node1 kernel: KerIPC initialisation done
Nov 15 15:23:45 node1 kernel: Proc initialisation: start
Nov 15 15:23:45 node1 kernel: Proc initialisation: done
Nov 15 15:23:45 node1 kernel: EPM initialisation: start
Nov 15 15:23:45 node1 kernel: EPM initialisation: done
Nov 15 15:23:45 node1 kernel: Init Kerrighed distributed services: done
Nov 15 15:23:45 node1 kernel: scheduler initialization succeeded!
Nov 15 15:23:45 node1 kernel: Kerrighed... loaded!
Nov 15 15:23:45 node1 kernel: Try to enable bearer on lo:<5>TIPC: Enabled bearer <eth:lo>, discovery domain <1.1.0>, priority 10
Nov 15 15:23:45 node1 kernel: ok
Nov 15 15:23:45 node1 kernel: Try to enable bearer on eth0:<5>TIPC: Enabled bearer <eth:eth0>, discovery domain <1.1.0>, priority 10
Nov 15 15:23:45 node1 kernel: ok
Nov 15 15:23:45 node1 /usr/sbin/cron[425726071]: (CRON) INFO (pidfile fd = 3)
Nov 15 15:23:45 node1 /usr/sbin/cron[425726072]: (CRON) STARTUP (fork ok)
Nov 15 15:23:45 node1 /usr/sbin/cron[425726072]: (CRON) INFO (Running @reboot jobs)
Nov 15 15:24:05 node1 ntpd[425726012]: getaddrinfo: "::1" invalid host address, ignored
Nov 15 15:24:05 node1 ntpd[425726107]: signal_no_reset: signal 17 had flags 4000000
Nov 15 15:24:27 node1 ntpd_initres[425726107]: host name not found: ntp.ubuntu.com
Nov 15 15:24:27 node1 ntpd_initres[425726107]: couldn't resolve `ntp.ubuntu.com', giving up on it
Nov 15 15:28:08 node1 kernel: TIPC: Established link <1.1.102:eth0-1.1.103:eth0> on network plane B
Nov 15 15:28:42 node1 kernel: kerrighed: Cluster start with nodes 101-102 ...
Nov 15 15:28:42 node1 kernel: kerrighed: Cluster start succeeded.
Nov 15 15:28:42 node1 kernel: Kerrighed is running on 2 nodes: 101-102
Nov 15 15:28:42 node1 kernel: successfully registered scheduler_policy_type round_robin_balancer
Nov 15 15:28:42 node1 kernel: successfully registered scheduler_policy_type mosix_load_balancer
Nov 15 15:31:04 node1 kernel: SysRq : Show Blocked State
Nov 15 15:31:04 node1 kernel:
Nov 15 15:31:04 node1 kernel: free sibling
Nov 15 15:31:04 node1 kernel: task PC stack pid father child younger older
Nov 15 15:31:04 node1 kernel: krg/0 D ffff810006122810 0 425726035 19 425726036 425724008 (L-TLB)
Nov 15 15:31:04 node1 kernel: ffff81011c7e1cc0 0000000000000046 0000000000000000 0000000000000001
Nov 15 15:31:04 node1 kernel: ffff810006122810 ffff81011c7e1c70 ffff81011d7ac2c0 ffffffff8072f910
Nov 15 15:31:04 node1 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:31:04 node1 kernel: Call Trace:
Nov 15 15:31:04 node1 kernel: [<ffffffff8804f8ec>] :kerrighed:__rpc_emergency_send_buf_alloc+0x55/0xb0
Nov 15 15:31:04 node1 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:31:04 node1 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:31:04 node1 kernel: [<ffffffff88023574>] :kerrighed:__get_kddm_obj_entry+0x4d/0xb7
Nov 15 15:31:04 node1 kernel: [<ffffffff8802b460>] :kerrighed:generic_kddm_grab_object+0x418/0x619
Nov 15 15:31:04 node1 kernel: [<ffffffff88045d61>] :kerrighed:update_dynamic_node_info_worker+0x0/0x30b
Nov 15 15:31:04 node1 kernel: [<ffffffff88045d8d>] :kerrighed:update_dynamic_node_info_worker+0x2c/0x30b
Nov 15 15:31:04 node1 kernel: [<ffffffff8804606c>] :kerrighed:update_dynamic_cpu_info_worker+0x0/0x16c
Nov 15 15:31:04 node1 kernel: [<ffffffff80244eeb>] run_workqueue+0x95/0x16a
Nov 15 15:31:04 node1 kernel: [<ffffffff8024871c>] keventd_create_kthread+0x0/0x65
Nov 15 15:31:04 node1 kernel: [<ffffffff802459d6>] worker_thread+0x191/0x1e8
Nov 15 15:31:04 node1 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:31:04 node1 kernel: [<ffffffff8024871c>] keventd_create_kthread+0x0/0x65
Nov 15 15:31:04 node1 kernel: [<ffffffff80245845>] worker_thread+0x0/0x1e8
Nov 15 15:31:04 node1 kernel: [<ffffffff802489c6>] kthread+0x125/0x163
Nov 15 15:31:04 node1 kernel: [<ffffffff8020a7b8>] child_rip+0xa/0x12
Nov 15 15:31:04 node1 kernel: [<ffffffff8024871c>] keventd_create_kthread+0x0/0x65
Nov 15 15:31:04 node1 kernel: [<ffffffff802488a1>] kthread+0x0/0x163
Nov 15 15:31:04 node1 kernel: [<ffffffff8020a7ae>] child_rip+0x0/0x12
Nov 15 15:31:04 node1 kernel:
Nov 15 15:31:04 node1 kernel: krg_legacy_sc D ffff810006122810 0 425726130 425726055 (NOTLB)
Nov 15 15:31:04 node1 kernel: ffff81011bcedd58 0000000000000082 0000000000000000 0000000000000000
Nov 15 15:31:04 node1 kernel: ffffffff8066c950 ffff81011bcedd08 0000000000000000 ffffffff8072f910
Nov 15 15:31:04 node1 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:31:04 node1 kernel: Call Trace:
Nov 15 15:31:04 node1 kernel: [<ffffffff8804f8ec>] :kerrighed:__rpc_emergency_send_buf_alloc+0x55/0xb0
Nov 15 15:31:04 node1 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:31:04 node1 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:31:04 node1 kernel: [<ffffffff8802b2ad>] :kerrighed:generic_kddm_grab_object+0x265/0x619
Nov 15 15:31:04 node1 kernel: [<ffffffff8801e5ab>] :kerrighed:hashed_string_list_lock_hash+0x5f/0x66
Nov 15 15:31:04 node1 kernel: [<ffffffff8801f102>] :kerrighed:global_config_attr_store_begin+0x4e/0x69
Nov 15 15:31:04 node1 kernel: [<ffffffff8802234e>] :kerrighed:pset_attribute_store+0x28/0x86
Nov 15 15:31:04 node1 kernel: [<ffffffff802b7354>] configfs_write_file+0xc1/0xea
Nov 15 15:31:04 node1 kernel: [<ffffffff8027b2b0>] vfs_write+0xad/0x172
Nov 15 15:31:04 node1 kernel: [<ffffffff8027b9b0>] sys_write+0x88/0xc3
Nov 15 15:31:04 node1 kernel: [<ffffffff80209a0e>] system_call+0x7e/0x83
Nov 15 15:31:04 node1 kernel:
Nov 15 15:31:04 node1 kernel: top D ffff810006122810 0 425726203 425726096 (NOTLB)
Nov 15 15:31:04 node1 kernel: ffff81011c233c38 0000000000000082 0000000000000000 0000000000000000
Nov 15 15:31:04 node1 kernel: 00000000f000000d ffff81011c233be8 ffff81011fc30000 ffffffff8072f910
Nov 15 15:31:04 node1 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:31:04 node1 kernel: Call Trace:
Nov 15 15:31:04 node1 kernel: [<ffffffff802a0b68>] inotify_d_instantiate+0x3e/0x68
Nov 15 15:31:04 node1 kernel: [<ffffffff802adc4d>] proc_root_lookup+0x12/0x30
Nov 15 15:31:04 node1 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:31:04 node1 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:31:04 node1 kernel: [<ffffffff88023574>] :kerrighed:__get_kddm_obj_entry+0x4d/0xb7
Nov 15 15:31:04 node1 kernel: [<ffffffff8802add5>] :kerrighed:generic_kddm_get_object+0x1f9/0x333
Nov 15 15:31:04 node1 kernel: [<ffffffff88045403>] :kerrighed:show_stat+0x1d1/0x56d
Nov 15 15:31:04 node1 kernel: [<ffffffff8026a742>] remove_vma_list+0x61/0x6e
Nov 15 15:31:04 node1 kernel: [<ffffffff80279407>] __get_unused_fd+0x66/0xe7
Nov 15 15:31:04 node1 kernel: [<ffffffff8029343a>] seq_read+0x105/0x28b
Nov 15 15:31:04 node1 kernel: [<ffffffff8027b41f>] vfs_read+0xaa/0x16e
Nov 15 15:31:04 node1 kernel: [<ffffffff8027b8ed>] sys_read+0x88/0xc3
Nov 15 15:31:04 node1 kernel: [<ffffffff80209a0e>] system_call+0x7e/0x83
Nov 15 15:31:04 node1 kernel:
node2:
Nov 15 15:27:53 node2 kernel: Start loading Kerrighed...
Nov 15 15:27:53 node2 kernel: Init Kerrighed worker(s)...Init Kerrighed low-level framework...
Nov 15 15:27:53 node2 kernel: Init kerrighed syscall mechanism
Nov 15 15:27:53 node2 kernel: Kerrighed tools - init module
Nov 15 15:27:53 node2 kernel: TIPC: Started in network mode
Nov 15 15:27:53 node2 kernel: TIPC: Own node address <1.1.103>, network identity 0
Nov 15 15:27:53 node2 kernel: RPC initialisation done
Nov 15 15:27:53 node2 kernel: Init Kerrighed low-level framework (nodeid 102) : done
Nov 15 15:27:53 node2 kernel: Init Kerrighed distributed services...
Nov 15 15:27:53 node2 kernel: KDDM initialisation : start
Nov 15 15:27:53 node2 kernel: KDDM set init
Nov 15 15:27:53 node2 kernel: KDDM set init : done
Nov 15 15:27:53 node2 kernel: KDDM initialisation done
Nov 15 15:27:53 node2 kernel: KerMM initialisation : start
Nov 15 15:27:53 node2 kernel: KerMM initialisation done
Nov 15 15:27:53 node2 kernel: DVFS initialisation : start
Nov 15 15:27:53 node2 kernel: FAF: initialisation : start
Nov 15 15:27:53 node2 kernel: FAF: initialisation : done
Nov 15 15:27:53 node2 kernel: DVFS initialisation done
Nov 15 15:27:53 node2 kernel: KerIPC initialisation : start
Nov 15 15:27:53 node2 kernel: KerIPC initialisation done
Nov 15 15:27:53 node2 kernel: Proc initialisation: start
Nov 15 15:27:53 node2 kernel: Proc initialisation: done
Nov 15 15:27:53 node2 kernel: EPM initialisation: start
Nov 15 15:27:53 node2 kernel: EPM initialisation: done
Nov 15 15:27:53 node2 kernel: Init Kerrighed distributed services: done
Nov 15 15:27:53 node2 kernel: scheduler initialization succeeded!
Nov 15 15:27:53 node2 kernel: Kerrighed... loaded!
Nov 15 15:27:53 node2 kernel: Try to enable bearer on lo:<5>TIPC: Enabled bearer <eth:lo>, discovery domain <1.1.0>, priority 10
Nov 15 15:27:53 node2 kernel: ok
Nov 15 15:27:53 node2 kernel: Try to enable bearer on eth0:<5>TIPC: Enabled bearer <eth:eth0>, discovery domain <1.1.0>, priority 10
Nov 15 15:27:53 node2 kernel: ok
Nov 15 15:27:53 node2 kernel: TIPC: Established link <1.1.103:eth0-1.1.102:eth0> on network plane B
Nov 15 15:27:53 node2 /usr/sbin/cron[429920407]: (CRON) INFO (pidfile fd = 3)
Nov 15 15:27:53 node2 /usr/sbin/cron[429920408]: (CRON) STARTUP (fork ok)
Nov 15 15:27:53 node2 /usr/sbin/cron[429920408]: (CRON) INFO (Running @reboot jobs)
Nov 15 15:28:13 node2 ntpd[429920348]: getaddrinfo: "::1" invalid host address, ignored
Nov 15 15:28:13 node2 ntpd[429920431]: signal_no_reset: signal 17 had flags 4000000
Nov 15 15:28:27 node2 kernel: Kerrighed is running on 2 nodes: 101-102
Nov 15 15:28:27 node2 kernel: successfully registered scheduler_policy_type round_robin_balancer
Nov 15 15:28:27 node2 kernel: successfully registered scheduler_policy_type mosix_load_balancer
Nov 15 15:28:35 node2 ntpd_initres[429920431]: host name not found: ntp.ubuntu.com
Nov 15 15:28:35 node2 ntpd_initres[429920431]: couldn't resolve `ntp.ubuntu.com', giving up on it
Nov 15 15:37:12 node2 kernel: input: AT Translated Set 2 keyboard as /class/input/input3
Nov 15 15:37:41 node2 kernel: SysRq : Show Blocked State
Nov 15 15:37:41 node2 kernel:
Nov 15 15:37:41 node2 kernel: free sibling
Nov 15 15:37:41 node2 kernel: task PC stack pid father child younger older
Nov 15 15:37:41 node2 kernel: udevd D ffff810006122810 0 429918484 1 429919452 19 (NOTLB)
Nov 15 15:37:41 node2 kernel: ffff81011e0cfc38 0000000000000082 0000000000000000 0000000000000000
Nov 15 15:37:41 node2 kernel: ffff81011e430e40 ffff81011e0cfbe8 ffffffff806bbc70 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: Call Trace:
Nov 15 15:37:41 node2 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:37:41 node2 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:37:41 node2 kernel: [<ffffffff88023574>] :kerrighed:__get_kddm_obj_entry+0x4d/0xb7
Nov 15 15:37:41 node2 kernel: [<ffffffff8802add5>] :kerrighed:generic_kddm_get_object+0x1f9/0x333
Nov 15 15:37:41 node2 kernel: [<ffffffff88045403>] :kerrighed:show_stat+0x1d1/0x56d
Nov 15 15:37:41 node2 kernel: [<ffffffff80279407>] __get_unused_fd+0x66/0xe7
Nov 15 15:37:41 node2 kernel: [<ffffffff8029343a>] seq_read+0x105/0x28b
Nov 15 15:37:41 node2 kernel: [<ffffffff8027b41f>] vfs_read+0xaa/0x16e
Nov 15 15:37:41 node2 kernel: [<ffffffff8027b8ed>] sys_read+0x88/0xc3
Nov 15 15:37:41 node2 kernel: [<ffffffff80209a0e>] system_call+0x7e/0x83
Nov 15 15:37:41 node2 kernel:
Nov 15 15:37:41 node2 kernel: krg/0 D ffff810006122810 0 429920372 19 429920373 429918320 (L-TLB)
Nov 15 15:37:41 node2 kernel: ffff81011c145cc0 0000000000000046 0000000000000000 ffffffff807324a0
Nov 15 15:37:41 node2 kernel: 000000018066c950 ffff81011c145c70 0000000000000080 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: Call Trace:
Nov 15 15:37:41 node2 kernel: [<ffffffff8804f8ec>] :kerrighed:__rpc_emergency_send_buf_alloc+0x55/0xb0
Nov 15 15:37:41 node2 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:37:41 node2 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:37:41 node2 kernel: [<ffffffff88023574>] :kerrighed:__get_kddm_obj_entry+0x4d/0xb7
Nov 15 15:37:41 node2 kernel: [<ffffffff8802b460>] :kerrighed:generic_kddm_grab_object+0x418/0x619
Nov 15 15:37:41 node2 kernel: [<ffffffff88045d61>] :kerrighed:update_dynamic_node_info_worker+0x0/0x30b
Nov 15 15:37:41 node2 kernel: [<ffffffff88045d8d>] :kerrighed:update_dynamic_node_info_worker+0x2c/0x30b
Nov 15 15:37:41 node2 kernel: [<ffffffff80244eeb>] run_workqueue+0x95/0x16a
Nov 15 15:37:41 node2 kernel: [<ffffffff8024871c>] keventd_create_kthread+0x0/0x65
Nov 15 15:37:41 node2 kernel: [<ffffffff802459d6>] worker_thread+0x191/0x1e8
Nov 15 15:37:41 node2 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:37:41 node2 kernel: [<ffffffff8024871c>] keventd_create_kthread+0x0/0x65
Nov 15 15:37:41 node2 kernel: [<ffffffff80245845>] worker_thread+0x0/0x1e8
Nov 15 15:37:41 node2 kernel: [<ffffffff802489c6>] kthread+0x125/0x163
Nov 15 15:37:41 node2 kernel: [<ffffffff8020a7b8>] child_rip+0xa/0x12
Nov 15 15:37:41 node2 kernel: [<ffffffff8024871c>] keventd_create_kthread+0x0/0x65
Nov 15 15:37:41 node2 kernel: [<ffffffff802488a1>] kthread+0x0/0x163
Nov 15 15:37:41 node2 kernel: [<ffffffff8020a7ae>] child_rip+0x0/0x12
Nov 15 15:37:41 node2 kernel:
Nov 15 15:37:41 node2 kernel: top D ffff810006122810 0 429920494 429920292 (NOTLB)
Nov 15 15:37:41 node2 kernel: ffff81011eeebc38 0000000000000082 0000000000000000 0000000000000000
Nov 15 15:37:41 node2 kernel: 00000000f000000d ffff81011eeebbe8 ffff81011fc30000 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: Call Trace:
Nov 15 15:37:41 node2 kernel: [<ffffffff8804f8ec>] :kerrighed:__rpc_emergency_send_buf_alloc+0x55/0xb0
Nov 15 15:37:41 node2 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:37:41 node2 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:37:41 node2 kernel: [<ffffffff8802add5>] :kerrighed:generic_kddm_get_object+0x1f9/0x333
Nov 15 15:37:41 node2 kernel: [<ffffffff88045403>] :kerrighed:show_stat+0x1d1/0x56d
Nov 15 15:37:41 node2 kernel: [<ffffffff8026a742>] remove_vma_list+0x61/0x6e
Nov 15 15:37:41 node2 kernel: [<ffffffff80279407>] __get_unused_fd+0x66/0xe7
Nov 15 15:37:41 node2 kernel: [<ffffffff8029343a>] seq_read+0x105/0x28b
Nov 15 15:37:41 node2 kernel: [<ffffffff8027b41f>] vfs_read+0xaa/0x16e
Nov 15 15:37:41 node2 kernel: [<ffffffff8027b8ed>] sys_read+0x88/0xc3
Nov 15 15:37:41 node2 kernel: [<ffffffff80209a0e>] system_call+0x7e/0x83
Nov 15 15:37:41 node2 kernel:
Nov 15 15:37:41 node2 kernel: bash D ffff810006122810 0 429920495 429920430 (NOTLB)
Nov 15 15:37:41 node2 kernel: ffff81011c175b48 0000000000000082 0000000000000000 ffff81011e6858b0
Nov 15 15:37:41 node2 kernel: ffff81011f3f0d80 ffff81011c175af8 000000001f498dc0 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: Call Trace:
Nov 15 15:37:41 node2 kernel: [<ffffffff80281ec9>] do_lookup+0x63/0x1ae
Nov 15 15:37:41 node2 kernel: [<ffffffff80283e83>] __link_path_walk+0xc43/0xd9a
Nov 15 15:37:41 node2 kernel: [<ffffffff804dc824>] xprt_timer+0x0/0x7f
Nov 15 15:37:41 node2 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:37:41 node2 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:37:41 node2 kernel: [<ffffffff88023574>] :kerrighed:__get_kddm_obj_entry+0x4d/0xb7
Nov 15 15:37:41 node2 kernel: [<ffffffff8802add5>] :kerrighed:generic_kddm_get_object+0x1f9/0x333
Nov 15 15:37:41 node2 kernel: [<ffffffff88044e9f>] :kerrighed:show_meminfo+0x105/0x498
Nov 15 15:37:41 node2 kernel: [<ffffffff8028b02e>] dput+0x21/0x14c
Nov 15 15:37:41 node2 kernel: [<ffffffff80283e83>] __link_path_walk+0xc43/0xd9a
Nov 15 15:37:41 node2 kernel: [<ffffffff8025f309>] get_page_from_freelist+0x2bf/0x348
Nov 15 15:37:41 node2 kernel: [<ffffffff8028ed75>] mntput_no_expire+0x1c/0x76
Nov 15 15:37:41 node2 kernel: [<ffffffff8029343a>] seq_read+0x105/0x28b
Nov 15 15:37:41 node2 kernel: [<ffffffff8027b41f>] vfs_read+0xaa/0x16e
Nov 15 15:37:41 node2 kernel: [<ffffffff8027b8ed>] sys_read+0x88/0xc3
Nov 15 15:37:41 node2 kernel: [<ffffffff80209a0e>] system_call+0x7e/0x83
Nov 15 15:37:41 node2 kernel:
lrilling
November 16th, 2009, 06:51 AM
Hi Louis,
Thanks for the suggestion. Here are the logs...
node1:
Nov 15 15:23:45 node1 kernel: Start loading Kerrighed...
Nov 15 15:23:45 node1 kernel: Init Kerrighed worker(s)...Init Kerrighed low-level framework...
Nov 15 15:23:45 node1 kernel: Init kerrighed syscall mechanism
Nov 15 15:23:45 node1 kernel: Kerrighed tools - init module
Nov 15 15:23:45 node1 kernel: TIPC: Started in network mode
Nov 15 15:23:45 node1 kernel: TIPC: Own node address <1.1.102>, network identity 0
Nov 15 15:23:45 node1 kernel: RPC initialisation done
Nov 15 15:23:45 node1 kernel: Init Kerrighed low-level framework (nodeid 101) : done
Nov 15 15:23:45 node1 kernel: Init Kerrighed distributed services...
Nov 15 15:23:45 node1 kernel: KDDM initialisation : start
Nov 15 15:23:45 node1 kernel: KDDM set init
Nov 15 15:23:45 node1 kernel: KDDM set init : done
Nov 15 15:23:45 node1 kernel: KDDM initialisation done
Nov 15 15:23:45 node1 kernel: KerMM initialisation : start
Nov 15 15:23:45 node1 kernel: KerMM initialisation done
Nov 15 15:23:45 node1 kernel: DVFS initialisation : start
Nov 15 15:23:45 node1 kernel: FAF: initialisation : start
Nov 15 15:23:45 node1 kernel: FAF: initialisation : done
Nov 15 15:23:45 node1 kernel: DVFS initialisation done
Nov 15 15:23:45 node1 kernel: KerIPC initialisation : start
Nov 15 15:23:45 node1 kernel: KerIPC initialisation done
Nov 15 15:23:45 node1 kernel: Proc initialisation: start
Nov 15 15:23:45 node1 kernel: Proc initialisation: done
Nov 15 15:23:45 node1 kernel: EPM initialisation: start
Nov 15 15:23:45 node1 kernel: EPM initialisation: done
Nov 15 15:23:45 node1 kernel: Init Kerrighed distributed services: done
Nov 15 15:23:45 node1 kernel: scheduler initialization succeeded!
Nov 15 15:23:45 node1 kernel: Kerrighed... loaded!
Nov 15 15:23:45 node1 kernel: Try to enable bearer on lo:<5>TIPC: Enabled bearer <eth:lo>, discovery domain <1.1.0>, priority 10
Nov 15 15:23:45 node1 kernel: ok
Nov 15 15:23:45 node1 kernel: Try to enable bearer on eth0:<5>TIPC: Enabled bearer <eth:eth0>, discovery domain <1.1.0>, priority 10
Nov 15 15:23:45 node1 kernel: ok
Nov 15 15:23:45 node1 /usr/sbin/cron[425726071]: (CRON) INFO (pidfile fd = 3)
Nov 15 15:23:45 node1 /usr/sbin/cron[425726072]: (CRON) STARTUP (fork ok)
Nov 15 15:23:45 node1 /usr/sbin/cron[425726072]: (CRON) INFO (Running @reboot jobs)
Nov 15 15:24:05 node1 ntpd[425726012]: getaddrinfo: "::1" invalid host address, ignored
Nov 15 15:24:05 node1 ntpd[425726107]: signal_no_reset: signal 17 had flags 4000000
Nov 15 15:24:27 node1 ntpd_initres[425726107]: host name not found: ntp.ubuntu.com
Nov 15 15:24:27 node1 ntpd_initres[425726107]: couldn't resolve `ntp.ubuntu.com', giving up on it
Nov 15 15:28:08 node1 kernel: TIPC: Established link <1.1.102:eth0-1.1.103:eth0> on network plane B
Nov 15 15:28:42 node1 kernel: kerrighed: Cluster start with nodes 101-102 ...
Nov 15 15:28:42 node1 kernel: kerrighed: Cluster start succeeded.
Nov 15 15:28:42 node1 kernel: Kerrighed is running on 2 nodes: 101-102
Nov 15 15:28:42 node1 kernel: successfully registered scheduler_policy_type round_robin_balancer
Nov 15 15:28:42 node1 kernel: successfully registered scheduler_policy_type mosix_load_balancer
Nov 15 15:31:04 node1 kernel: SysRq : Show Blocked State
Nov 15 15:31:04 node1 kernel:
Nov 15 15:31:04 node1 kernel: free sibling
Nov 15 15:31:04 node1 kernel: task PC stack pid father child younger older
Nov 15 15:31:04 node1 kernel: krg/0 D ffff810006122810 0 425726035 19 425726036 425724008 (L-TLB)
Nov 15 15:31:04 node1 kernel: ffff81011c7e1cc0 0000000000000046 0000000000000000 0000000000000001
Nov 15 15:31:04 node1 kernel: ffff810006122810 ffff81011c7e1c70 ffff81011d7ac2c0 ffffffff8072f910
Nov 15 15:31:04 node1 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:31:04 node1 kernel: Call Trace:
Nov 15 15:31:04 node1 kernel: [<ffffffff8804f8ec>] :kerrighed:__rpc_emergency_send_buf_alloc+0x55/0xb0
Nov 15 15:31:04 node1 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:31:04 node1 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:31:04 node1 kernel: [<ffffffff88023574>] :kerrighed:__get_kddm_obj_entry+0x4d/0xb7
Nov 15 15:31:04 node1 kernel: [<ffffffff8802b460>] :kerrighed:generic_kddm_grab_object+0x418/0x619
Nov 15 15:31:04 node1 kernel: [<ffffffff88045d61>] :kerrighed:update_dynamic_node_info_worker+0x0/0x30b
Nov 15 15:31:04 node1 kernel: [<ffffffff88045d8d>] :kerrighed:update_dynamic_node_info_worker+0x2c/0x30b
Nov 15 15:31:04 node1 kernel: [<ffffffff8804606c>] :kerrighed:update_dynamic_cpu_info_worker+0x0/0x16c
Nov 15 15:31:04 node1 kernel: [<ffffffff80244eeb>] run_workqueue+0x95/0x16a
Nov 15 15:31:04 node1 kernel: [<ffffffff8024871c>] keventd_create_kthread+0x0/0x65
Nov 15 15:31:04 node1 kernel: [<ffffffff802459d6>] worker_thread+0x191/0x1e8
Nov 15 15:31:04 node1 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:31:04 node1 kernel: [<ffffffff8024871c>] keventd_create_kthread+0x0/0x65
Nov 15 15:31:04 node1 kernel: [<ffffffff80245845>] worker_thread+0x0/0x1e8
Nov 15 15:31:04 node1 kernel: [<ffffffff802489c6>] kthread+0x125/0x163
Nov 15 15:31:04 node1 kernel: [<ffffffff8020a7b8>] child_rip+0xa/0x12
Nov 15 15:31:04 node1 kernel: [<ffffffff8024871c>] keventd_create_kthread+0x0/0x65
Nov 15 15:31:04 node1 kernel: [<ffffffff802488a1>] kthread+0x0/0x163
Nov 15 15:31:04 node1 kernel: [<ffffffff8020a7ae>] child_rip+0x0/0x12
Nov 15 15:31:04 node1 kernel:
Nov 15 15:31:04 node1 kernel: krg_legacy_sc D ffff810006122810 0 425726130 425726055 (NOTLB)
Nov 15 15:31:04 node1 kernel: ffff81011bcedd58 0000000000000082 0000000000000000 0000000000000000
Nov 15 15:31:04 node1 kernel: ffffffff8066c950 ffff81011bcedd08 0000000000000000 ffffffff8072f910
Nov 15 15:31:04 node1 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:31:04 node1 kernel: Call Trace:
Nov 15 15:31:04 node1 kernel: [<ffffffff8804f8ec>] :kerrighed:__rpc_emergency_send_buf_alloc+0x55/0xb0
Nov 15 15:31:04 node1 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:31:04 node1 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:31:04 node1 kernel: [<ffffffff8802b2ad>] :kerrighed:generic_kddm_grab_object+0x265/0x619
Nov 15 15:31:04 node1 kernel: [<ffffffff8801e5ab>] :kerrighed:hashed_string_list_lock_hash+0x5f/0x66
Nov 15 15:31:04 node1 kernel: [<ffffffff8801f102>] :kerrighed:global_config_attr_store_begin+0x4e/0x69
Nov 15 15:31:04 node1 kernel: [<ffffffff8802234e>] :kerrighed:pset_attribute_store+0x28/0x86
Nov 15 15:31:04 node1 kernel: [<ffffffff802b7354>] configfs_write_file+0xc1/0xea
Nov 15 15:31:04 node1 kernel: [<ffffffff8027b2b0>] vfs_write+0xad/0x172
Nov 15 15:31:04 node1 kernel: [<ffffffff8027b9b0>] sys_write+0x88/0xc3
Nov 15 15:31:04 node1 kernel: [<ffffffff80209a0e>] system_call+0x7e/0x83
Nov 15 15:31:04 node1 kernel:
Nov 15 15:31:04 node1 kernel: top D ffff810006122810 0 425726203 425726096 (NOTLB)
Nov 15 15:31:04 node1 kernel: ffff81011c233c38 0000000000000082 0000000000000000 0000000000000000
Nov 15 15:31:04 node1 kernel: 00000000f000000d ffff81011c233be8 ffff81011fc30000 ffffffff8072f910
Nov 15 15:31:04 node1 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:31:04 node1 kernel: Call Trace:
Nov 15 15:31:04 node1 kernel: [<ffffffff802a0b68>] inotify_d_instantiate+0x3e/0x68
Nov 15 15:31:04 node1 kernel: [<ffffffff802adc4d>] proc_root_lookup+0x12/0x30
Nov 15 15:31:04 node1 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:31:04 node1 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:31:04 node1 kernel: [<ffffffff88023574>] :kerrighed:__get_kddm_obj_entry+0x4d/0xb7
Nov 15 15:31:04 node1 kernel: [<ffffffff8802add5>] :kerrighed:generic_kddm_get_object+0x1f9/0x333
Nov 15 15:31:04 node1 kernel: [<ffffffff88045403>] :kerrighed:show_stat+0x1d1/0x56d
Nov 15 15:31:04 node1 kernel: [<ffffffff8026a742>] remove_vma_list+0x61/0x6e
Nov 15 15:31:04 node1 kernel: [<ffffffff80279407>] __get_unused_fd+0x66/0xe7
Nov 15 15:31:04 node1 kernel: [<ffffffff8029343a>] seq_read+0x105/0x28b
Nov 15 15:31:04 node1 kernel: [<ffffffff8027b41f>] vfs_read+0xaa/0x16e
Nov 15 15:31:04 node1 kernel: [<ffffffff8027b8ed>] sys_read+0x88/0xc3
Nov 15 15:31:04 node1 kernel: [<ffffffff80209a0e>] system_call+0x7e/0x83
Nov 15 15:31:04 node1 kernel:
node2:
Nov 15 15:27:53 node2 kernel: Start loading Kerrighed...
Nov 15 15:27:53 node2 kernel: Init Kerrighed worker(s)...Init Kerrighed low-level framework...
Nov 15 15:27:53 node2 kernel: Init kerrighed syscall mechanism
Nov 15 15:27:53 node2 kernel: Kerrighed tools - init module
Nov 15 15:27:53 node2 kernel: TIPC: Started in network mode
Nov 15 15:27:53 node2 kernel: TIPC: Own node address <1.1.103>, network identity 0
Nov 15 15:27:53 node2 kernel: RPC initialisation done
Nov 15 15:27:53 node2 kernel: Init Kerrighed low-level framework (nodeid 102) : done
Nov 15 15:27:53 node2 kernel: Init Kerrighed distributed services...
Nov 15 15:27:53 node2 kernel: KDDM initialisation : start
Nov 15 15:27:53 node2 kernel: KDDM set init
Nov 15 15:27:53 node2 kernel: KDDM set init : done
Nov 15 15:27:53 node2 kernel: KDDM initialisation done
Nov 15 15:27:53 node2 kernel: KerMM initialisation : start
Nov 15 15:27:53 node2 kernel: KerMM initialisation done
Nov 15 15:27:53 node2 kernel: DVFS initialisation : start
Nov 15 15:27:53 node2 kernel: FAF: initialisation : start
Nov 15 15:27:53 node2 kernel: FAF: initialisation : done
Nov 15 15:27:53 node2 kernel: DVFS initialisation done
Nov 15 15:27:53 node2 kernel: KerIPC initialisation : start
Nov 15 15:27:53 node2 kernel: KerIPC initialisation done
Nov 15 15:27:53 node2 kernel: Proc initialisation: start
Nov 15 15:27:53 node2 kernel: Proc initialisation: done
Nov 15 15:27:53 node2 kernel: EPM initialisation: start
Nov 15 15:27:53 node2 kernel: EPM initialisation: done
Nov 15 15:27:53 node2 kernel: Init Kerrighed distributed services: done
Nov 15 15:27:53 node2 kernel: scheduler initialization succeeded!
Nov 15 15:27:53 node2 kernel: Kerrighed... loaded!
Nov 15 15:27:53 node2 kernel: Try to enable bearer on lo:<5>TIPC: Enabled bearer <eth:lo>, discovery domain <1.1.0>, priority 10
Nov 15 15:27:53 node2 kernel: ok
Nov 15 15:27:53 node2 kernel: Try to enable bearer on eth0:<5>TIPC: Enabled bearer <eth:eth0>, discovery domain <1.1.0>, priority 10
Nov 15 15:27:53 node2 kernel: ok
Nov 15 15:27:53 node2 kernel: TIPC: Established link <1.1.103:eth0-1.1.102:eth0> on network plane B
Nov 15 15:27:53 node2 /usr/sbin/cron[429920407]: (CRON) INFO (pidfile fd = 3)
Nov 15 15:27:53 node2 /usr/sbin/cron[429920408]: (CRON) STARTUP (fork ok)
Nov 15 15:27:53 node2 /usr/sbin/cron[429920408]: (CRON) INFO (Running @reboot jobs)
Nov 15 15:28:13 node2 ntpd[429920348]: getaddrinfo: "::1" invalid host address, ignored
Nov 15 15:28:13 node2 ntpd[429920431]: signal_no_reset: signal 17 had flags 4000000
Nov 15 15:28:27 node2 kernel: Kerrighed is running on 2 nodes: 101-102
Nov 15 15:28:27 node2 kernel: successfully registered scheduler_policy_type round_robin_balancer
Nov 15 15:28:27 node2 kernel: successfully registered scheduler_policy_type mosix_load_balancer
Nov 15 15:28:35 node2 ntpd_initres[429920431]: host name not found: ntp.ubuntu.com
Nov 15 15:28:35 node2 ntpd_initres[429920431]: couldn't resolve `ntp.ubuntu.com', giving up on it
Nov 15 15:37:12 node2 kernel: input: AT Translated Set 2 keyboard as /class/input/input3
Nov 15 15:37:41 node2 kernel: SysRq : Show Blocked State
Nov 15 15:37:41 node2 kernel:
Nov 15 15:37:41 node2 kernel: free sibling
Nov 15 15:37:41 node2 kernel: task PC stack pid father child younger older
Nov 15 15:37:41 node2 kernel: udevd D ffff810006122810 0 429918484 1 429919452 19 (NOTLB)
Nov 15 15:37:41 node2 kernel: ffff81011e0cfc38 0000000000000082 0000000000000000 0000000000000000
Nov 15 15:37:41 node2 kernel: ffff81011e430e40 ffff81011e0cfbe8 ffffffff806bbc70 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: Call Trace:
Nov 15 15:37:41 node2 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:37:41 node2 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:37:41 node2 kernel: [<ffffffff88023574>] :kerrighed:__get_kddm_obj_entry+0x4d/0xb7
Nov 15 15:37:41 node2 kernel: [<ffffffff8802add5>] :kerrighed:generic_kddm_get_object+0x1f9/0x333
Nov 15 15:37:41 node2 kernel: [<ffffffff88045403>] :kerrighed:show_stat+0x1d1/0x56d
Nov 15 15:37:41 node2 kernel: [<ffffffff80279407>] __get_unused_fd+0x66/0xe7
Nov 15 15:37:41 node2 kernel: [<ffffffff8029343a>] seq_read+0x105/0x28b
Nov 15 15:37:41 node2 kernel: [<ffffffff8027b41f>] vfs_read+0xaa/0x16e
Nov 15 15:37:41 node2 kernel: [<ffffffff8027b8ed>] sys_read+0x88/0xc3
Nov 15 15:37:41 node2 kernel: [<ffffffff80209a0e>] system_call+0x7e/0x83
Nov 15 15:37:41 node2 kernel:
Nov 15 15:37:41 node2 kernel: krg/0 D ffff810006122810 0 429920372 19 429920373 429918320 (L-TLB)
Nov 15 15:37:41 node2 kernel: ffff81011c145cc0 0000000000000046 0000000000000000 ffffffff807324a0
Nov 15 15:37:41 node2 kernel: 000000018066c950 ffff81011c145c70 0000000000000080 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: Call Trace:
Nov 15 15:37:41 node2 kernel: [<ffffffff8804f8ec>] :kerrighed:__rpc_emergency_send_buf_alloc+0x55/0xb0
Nov 15 15:37:41 node2 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:37:41 node2 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:37:41 node2 kernel: [<ffffffff88023574>] :kerrighed:__get_kddm_obj_entry+0x4d/0xb7
Nov 15 15:37:41 node2 kernel: [<ffffffff8802b460>] :kerrighed:generic_kddm_grab_object+0x418/0x619
Nov 15 15:37:41 node2 kernel: [<ffffffff88045d61>] :kerrighed:update_dynamic_node_info_worker+0x0/0x30b
Nov 15 15:37:41 node2 kernel: [<ffffffff88045d8d>] :kerrighed:update_dynamic_node_info_worker+0x2c/0x30b
Nov 15 15:37:41 node2 kernel: [<ffffffff80244eeb>] run_workqueue+0x95/0x16a
Nov 15 15:37:41 node2 kernel: [<ffffffff8024871c>] keventd_create_kthread+0x0/0x65
Nov 15 15:37:41 node2 kernel: [<ffffffff802459d6>] worker_thread+0x191/0x1e8
Nov 15 15:37:41 node2 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:37:41 node2 kernel: [<ffffffff8024871c>] keventd_create_kthread+0x0/0x65
Nov 15 15:37:41 node2 kernel: [<ffffffff80245845>] worker_thread+0x0/0x1e8
Nov 15 15:37:41 node2 kernel: [<ffffffff802489c6>] kthread+0x125/0x163
Nov 15 15:37:41 node2 kernel: [<ffffffff8020a7b8>] child_rip+0xa/0x12
Nov 15 15:37:41 node2 kernel: [<ffffffff8024871c>] keventd_create_kthread+0x0/0x65
Nov 15 15:37:41 node2 kernel: [<ffffffff802488a1>] kthread+0x0/0x163
Nov 15 15:37:41 node2 kernel: [<ffffffff8020a7ae>] child_rip+0x0/0x12
Nov 15 15:37:41 node2 kernel:
Nov 15 15:37:41 node2 kernel: top D ffff810006122810 0 429920494 429920292 (NOTLB)
Nov 15 15:37:41 node2 kernel: ffff81011eeebc38 0000000000000082 0000000000000000 0000000000000000
Nov 15 15:37:41 node2 kernel: 00000000f000000d ffff81011eeebbe8 ffff81011fc30000 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: Call Trace:
Nov 15 15:37:41 node2 kernel: [<ffffffff8804f8ec>] :kerrighed:__rpc_emergency_send_buf_alloc+0x55/0xb0
Nov 15 15:37:41 node2 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:37:41 node2 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:37:41 node2 kernel: [<ffffffff8802add5>] :kerrighed:generic_kddm_get_object+0x1f9/0x333
Nov 15 15:37:41 node2 kernel: [<ffffffff88045403>] :kerrighed:show_stat+0x1d1/0x56d
Nov 15 15:37:41 node2 kernel: [<ffffffff8026a742>] remove_vma_list+0x61/0x6e
Nov 15 15:37:41 node2 kernel: [<ffffffff80279407>] __get_unused_fd+0x66/0xe7
Nov 15 15:37:41 node2 kernel: [<ffffffff8029343a>] seq_read+0x105/0x28b
Nov 15 15:37:41 node2 kernel: [<ffffffff8027b41f>] vfs_read+0xaa/0x16e
Nov 15 15:37:41 node2 kernel: [<ffffffff8027b8ed>] sys_read+0x88/0xc3
Nov 15 15:37:41 node2 kernel: [<ffffffff80209a0e>] system_call+0x7e/0x83
Nov 15 15:37:41 node2 kernel:
Nov 15 15:37:41 node2 kernel: bash D ffff810006122810 0 429920495 429920430 (NOTLB)
Nov 15 15:37:41 node2 kernel: ffff81011c175b48 0000000000000082 0000000000000000 ffff81011e6858b0
Nov 15 15:37:41 node2 kernel: ffff81011f3f0d80 ffff81011c175af8 000000001f498dc0 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: ffffffff807321d0 ffffffff8072f910 ffffffff8072f910 ffffffff8072f910
Nov 15 15:37:41 node2 kernel: Call Trace:
Nov 15 15:37:41 node2 kernel: [<ffffffff80281ec9>] do_lookup+0x63/0x1ae
Nov 15 15:37:41 node2 kernel: [<ffffffff80283e83>] __link_path_walk+0xc43/0xd9a
Nov 15 15:37:41 node2 kernel: [<ffffffff804dc824>] xprt_timer+0x0/0x7f
Nov 15 15:37:41 node2 kernel: [<ffffffff88022e4c>] :kerrighed:__sleep_on_kddm_obj+0x12e/0x243
Nov 15 15:37:41 node2 kernel: [<ffffffff8022c5bc>] default_wake_function+0x0/0xe
Nov 15 15:37:41 node2 kernel: [<ffffffff88023574>] :kerrighed:__get_kddm_obj_entry+0x4d/0xb7
Nov 15 15:37:41 node2 kernel: [<ffffffff8802add5>] :kerrighed:generic_kddm_get_object+0x1f9/0x333
Nov 15 15:37:41 node2 kernel: [<ffffffff88044e9f>] :kerrighed:show_meminfo+0x105/0x498
Nov 15 15:37:41 node2 kernel: [<ffffffff8028b02e>] dput+0x21/0x14c
Nov 15 15:37:41 node2 kernel: [<ffffffff80283e83>] __link_path_walk+0xc43/0xd9a
Nov 15 15:37:41 node2 kernel: [<ffffffff8025f309>] get_page_from_freelist+0x2bf/0x348
Nov 15 15:37:41 node2 kernel: [<ffffffff8028ed75>] mntput_no_expire+0x1c/0x76
Nov 15 15:37:41 node2 kernel: [<ffffffff8029343a>] seq_read+0x105/0x28b
Nov 15 15:37:41 node2 kernel: [<ffffffff8027b41f>] vfs_read+0xaa/0x16e
Nov 15 15:37:41 node2 kernel: [<ffffffff8027b8ed>] sys_read+0x88/0xc3
Nov 15 15:37:41 node2 kernel: [<ffffffff80209a0e>] system_call+0x7e/0x83
Nov 15 15:37:41 node2 kernel:
Hi baggzy,
Those logs show that communications are not working well. It's unfortunately rather hard to debug this. You could however try one thing: unplug the network cables of your nodes and re-plug them, and then report about what happens (with logs and sysrq w please). I expect that this will temporarily unblock the system.
Another thing: I doubt that this can solve your issue, but you should assign a session_id between 1 and 9999 (TIPC's valid range of network id) to your cluster. Just use the session_id= parameter in kernel command line. Not specifying any lets it at 0, but I'm not sure that TIPC supports this well.
Thanks,
Louis
lrilling
November 17th, 2009, 05:38 AM
Hi baggzy,
Those logs show that communications are not working well.
Looking at your first hack on the NIC driver, this kind of hack can definitely cause such Kerrighed freeze, since it prevents TIPC (and Kerrighed) from retransmitting lost packets.
Don't know about the driver from realtek though.
Louis
baggzy
November 17th, 2009, 04:35 PM
Hi baggzy,
Those logs show that communications are not working well. It's unfortunately rather hard to debug this. You could however try one thing: unplug the network cables of your nodes and re-plug them, and then report about what happens (with logs and sysrq w please). I expect that this will temporarily unblock the system.
Another thing: I doubt that this can solve your issue, but you should assign a session_id between 1 and 9999 (TIPC's valid range of network id) to your cluster. Just use the session_id= parameter in kernel command line. Not specifying any lets it at 0, but I'm not sure that TIPC supports this well.
Thanks,
Louis
Hi Louis,
You're right! Unplugging the network cables unfreezes the nodes. Weird. Does this help us fix it? (I've set the session_id up as well, though I think this was being done in the /etc/kerrighed_nodes file anyway.) I'll post the sysrq logs shortly.
I suspected that driver hack would prove unsatisfactory, though it did give me a chance to boot up the nodes and create an initrd, so not completely useless. But I've now got the latest official Realtek driver, so you'd think I'd be ok...
Still, making progress... slowly...
baggzy
November 17th, 2009, 07:48 PM
Hi Louis,
Well, I had a poke around and couldn't find any more comms issues, so I tried a bunch of other stuff... and it looks like I have an ACPI problem. I have the following ACPI-related messages in the kern.log:
Nov 17 23:27:55 node1 kernel: Checking aperture...
Nov 17 23:27:55 node1 kernel: CPU 0: aperture @ 8000000 size 32 MB
Nov 17 23:27:55 node1 kernel: Aperture too small (32 MB)
Nov 17 23:27:55 node1 kernel: No AGP bridge found
Nov 17 23:27:55 node1 kernel: Your BIOS doesn't leave a aperture memory hole
Nov 17 23:27:55 node1 kernel: Please enable the IOMMU option in the BIOS setup
Nov 17 23:27:55 node1 kernel: This costs you 64 MB of RAM
Nov 17 23:27:55 node1 kernel: Mapping aperture over 65536 KB of RAM @ 8000000
Nov 17 23:27:55 node1 kernel: ACPI: bus type pci registered
Nov 17 23:27:55 node1 kernel: PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
Nov 17 23:27:55 node1 kernel: PCI: Not using MMCONFIG.
Nov 17 23:27:55 node1 kernel: PCI: Using configuration type 1
Nov 17 23:27:55 node1 kernel: ACPI: Interpreter enabled
Nov 17 23:27:55 node1 kernel: ACPI: Using IOAPIC for interrupt routing
Nov 17 23:27:55 node1 kernel: Error attaching device data
Nov 17 23:27:55 node1 kernel: Error attaching device data
Nov 17 23:27:55 node1 kernel: Error attaching device data
Nov 17 23:27:55 node1 kernel: Error attaching device data
Nov 17 23:27:55 node1 kernel: Error attaching device data
Nov 17 23:27:55 node1 kernel: Error attaching device data
Nov 17 23:27:55 node1 kernel: PCI-DMA: Disabling AGP.
Nov 17 23:27:55 node1 kernel: PCI-DMA: aperture base @ 8000000 size 65536 KB
Nov 17 23:27:55 node1 kernel: PCI-DMA: using GART IOMMU.
Nov 17 23:27:55 node1 kernel: PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
Nov 17 23:27:55 node1 kernel: PCI: Ignore bogus resource 6 [0:0] of 0000:01:05.0
Nov 17 23:27:55 node1 kernel: ACPI: (supports S0 S1 S3 S4 S5)
Nov 17 23:27:55 node1 kernel: IP-Config: No network devices available.
Nov 17 23:27:55 node1 kernel: error: -2
Nov 17 23:27:55 node1 kernel: Can't read /etc/hostname
Nov 17 23:27:55 node1 kernel: Kerrighed session ID : 1
Nov 17 23:27:55 node1 kernel: Kerrighed node ID : 101
Nov 17 23:27:55 node1 kernel: Kerrighed nb nodes : 0
Nov 17 23:27:55 node1 kernel: Kerrighed min nodes : -1
Nov 17 23:27:55 node1 kernel: Freeing unused kernel memory: 216k freed
Nov 17 23:27:55 node1 kernel: r8168 Gigabit Ethernet driver 8.014.00-NAPI loaded
Nov 17 23:27:55 node1 kernel: ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 17 (level, low) -> IRQ 17
Nov 17 23:27:55 node1 kernel: PCI: Setting latency timer of device 0000:02:00.0 to 64
Nov 17 23:27:55 node1 kernel: r8168: This product is covered by one or more of the following patents: US5,307,459, US5,434,872, US5,732,094, US6,570,884, US6,115,776, and US6,327,625.
Nov 17 23:27:55 node1 kernel: eth0: Identified chip type is 'RTL8168C/8111C'.
Nov 17 23:27:55 node1 kernel: eth0: RTL8168B/8111B at 0xffffc2000004a000, 00:24:21:32:a6:22, IRQ 1278
Nov 17 23:27:55 node1 kernel: r8168 Copyright (C) 2009 Realtek NIC software team <nicfae@realtek.com>
Nov 17 23:27:55 node1 kernel: This program comes with ABSOLUTELY NO WARRANTY; for details, please see <http://www.gnu.org/licenses/>.
Nov 17 23:27:55 node1 kernel: This is free software, and you are welcome to redistribute it under certain conditions; see <http://www.gnu.org/licenses/>.
Nov 17 23:27:55 node1 kernel: r8168: eth0: link down
Nov 17 23:27:55 node1 kernel: r8168: eth0: link up
Nov 17 23:27:55 node1 kernel: r8168: eth0: link up
None of these looked serious to me. (What do you think?) But if I disable ACPI with the "acpi=off" boot option, I don't get the freezing problem.
So problem (kind of) solved. \\:D/
I guess the obvious question is... will this have any effect on the performance of the cluster?
Cheers!
baggzy
November 17th, 2009, 08:06 PM
I've narrowed it down a bit - seems like the problem is related to that IOMMU error. If I use the following boot option (which I found on t'internet when I googled the IOMMU error I was getting):
iommu=noaperture,memaper=3
and delete the "acpi=off" boot option, I still don't get the freezing problem.
So I guess I'll go with that. :D
It's time to put this thing to work...
Many thanks to you and Tony for all your ideas & assistance! You guys rock.
lrilling
November 18th, 2009, 06:06 AM
I've narrowed it down a bit - seems like the problem is related to that IOMMU error. If I use the following boot option (which I found on t'internet when I googled the IOMMU error I was getting):
iommu=noaperture,memaper=3and delete the "acpi=off" boot option, I still don't get the freezing problem.
So I guess I'll go with that. :D
It's time to put this thing to work...
Many thanks to you and Tony for all your ideas & assistance! You guys rock.
Happy that you finally got it working !
Looks like you guessed right. I don't know much about IOMMU and ACPI, so you probably did the right thing :)
Louis
gvpy
December 15th, 2009, 11:09 AM
Correct me if I'm wrong:
The guide https://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide sets up a cluster in which the nodes share resources.
Can the server also share resources with the nodes? I mean, what needs to be configured and how in order to make the processes of the server get migrated to the nodes?
I guess the kernel in the server also needs to be patched with Kerrighed, but I honestly have no idea about the other settings. Any thoughts?
quequotion
March 9th, 2010, 01:04 PM
Correct me if I'm wrong:
The guide https://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide sets up a cluster in which the nodes share resources.
Can the server also share resources with the nodes? I mean, what needs to be configured and how in order to make the processes of the server get migrated to the nodes?
I guess the kernel in the server also needs to be patched with Kerrighed, but I honestly have no idea about the other settings. Any thoughts?
I have exactly the same question. I'm about half-way through setting this up, and I realized I can't see how the head could migrate processes into the cluster...
One thing does come to mind though, having mounted /proc into the chroot might allow it to see the head's processes. I don't really know how or if that works. If it does, how does one give those processes permission to migrate?
quequotion
March 9th, 2010, 01:52 PM
Just disable ROOT on NFS:
diff .config.old .config
...
< CONFIG_ROOT_NFS=y
---
> # CONFIG_ROOT_NFS is not set
This appears to be an answer to my question.. and it's been posted and referenced a few times in the thread.... but what does it mean? (Sorry I think I'm going to need pictures or diagrams to wrap my head around this one...)
-edit-
Just to be sure, I mean to migrate processes from the pc running the head. This looks difficult to me as it does not have kerrighed installed and uses an entirely different kernel and architecture from the cluster. (although, even with the same kernel on the same arch I don't see how it could work)
like this:
Here's a minimal cluster, one head and one node.
[pxe/nfs server, karmic - amd64]=(head, hardy i386)
[pxe/nfs client]=(node, hardy - i386)
head and node can create processes and migrate them.
client has no reason or means to create processes (it has no operating system without node).
Can server introduce it's processes into head and migrate them into the cluster?
ajt
March 10th, 2010, 06:47 AM
This appears to be an answer to my question.. and it's been posted and referenced a few times in the thread.... but what does it mean? (Sorry I think I'm going to need pictures or diagrams to wrap my head around this one...)
-edit-
Just to be sure, I mean to migrate processes from the pc running the head. This looks difficult to me as it does not have kerrighed installed and uses an entirely different kernel and architecture from the cluster. (although, even with the same kernel on the same arch I don't see how it could work)
Hello, quequotion.
I've been away from this thread for a while, but we've been working on the "server-as-node" configuration of Alicia's Beowulf here in Aberdeen.
We can migrate processes off the head now to PXE-booted compute nodes, but it requires some careful planning :-)
Basically, Kerrighed expects the same binaries in *exactly* the same place on all nodes. If you PXE-boot everything and share a common NFSROOT, there are few problems - Except the show-stopper that you can't migrate processes off the 'login' node.
We've addressed that by using UNFS3 with 'ClusterNFS' extensions enabled to share the '/' filesystem on the head node with the PXE-booted compute nodes. There are two difficulties with this approach, that we have now identified:
#1 The stand-alone node(s) must have statically configured id and session
#2 Any services provided by the node must be started *after* Kerrighed
We've now go a reasonably stable 10-node/18p 32-bit Kerrighed cluster running under Bio-Linux 5.0:
http://kitcat.rri.sari.ac.uk
There are two stand-alone nodes, one the 'head' node and the other is "kitcat" itself, which is our login server. Processes will migrate off "kitcat" OK to the rest of the nodes. However, we still have problems if I enable USE_REMOTE_MEMORY, and Louis has said that the 32-bit version is not fully supported because they are giving priority to work on the 64-bit version. We will upgrade to 64-bit Ubuntu 10.04 and test Kerrighed in that environment until Bio-Linux 6.0 is released in June 2010.
Bye,
Tony.
soulsage
March 25th, 2010, 10:12 AM
i been playing with kerrighed for about 3 weeks now and i just have a few ?](*,)
Has anyone been able to actaully use kerrighed features?
Is there a way to get java(ImageJ) to recognize all the memory in a kerrighed cluster?
(I've tried not success)
What programs are there for kerrighed or must everything be costume made?
Does migration /kerrighed always hang while running the linpack?
lrilling
March 26th, 2010, 06:33 AM
Is there a way to get java(ImageJ) to recognize all the memory in a kerrighed cluster?
(I've tried not success)
Memory aggregation is a feature that appeared in Kerrighed 2.4.0. Some configuration is required in order to use it. Please check the FAQ:
http://kerrighed.org/wiki/index.php/FAQ#My_process_cannot_use_remote_memory
What programs are there for kerrighed or must everything be costume made?
I don't understand this question. Could you elaborate a bit?
Does migration /kerrighed always hang while running the linpack?
There are known issues with sockets and migration, especially for MPI and X applications. Problems are still under investigation.
Thanks,
Louis
soulsage
March 26th, 2010, 07:49 AM
Thank you for your relpy
What programs are there for kerrighed or must everything be costume made?
...
Basically i am asking what programs fully work with kerrighed or must i just create my own program to work with it?
...
also i check the configuration required in order to use it memory aggregation . Still no success
nerdopolis
March 26th, 2010, 06:29 PM
As far as I know, you can distribute normal running programs throughout the cluster. (run 3 apps in node 1 and 3 apps in node 2 for example, and then migrate them between nodes)
but I dont think single programs themselves yet get the full benefit of the cluster, as in order to benefit from multiple processors, they have to be written to use multiple threads. Some programs use multiple threads already, because we have dual and quad core processors now (and dual processor motherboards). Unfortunately Kerrighed can not yet distribute a programs threads across the nodes, The threads can only be on one node at a time.
That feature is planned for Kerrighed in the future, but its a complex feature, and the devs don't have enough resources for that.
lrilling
March 29th, 2010, 05:25 AM
but I dont think single programs themselves yet get the full benefit of the cluster, as in order to benefit from multiple processors, they have to be written to use multiple threads.
Some programs use multiple threads already, because we have dual and quad core processors now (and dual processor motherboards). Unfortunately Kerrighed can not yet distribute a programs threads across the nodes, The threads can only be on one node at a time.
If you replace "threads" by "processes" then it already works.
Louis
lrilling
March 29th, 2010, 05:27 AM
also i check the configuration required in order to use it memory aggregation . Still no success
Could you show us detailed steps of what you did?
Thanks,
Louis
anitemp
March 31st, 2010, 08:53 PM
I'm trying to get my Kerrighed system to work and I'm facing similar issues too. I recently found the 'live' version of Kerrighed 2.3.0 at this (http://www.kerlabs.com/dl/kerrighed-live.iso) link. Not a full fledged solution, though I could do what I wanted to.
Good luck!!
renomcdonald
April 2nd, 2010, 12:12 AM
Hey Ive looked all over and clustering has officially confused me! :-x
Would the Ubuntu Enterprise Cloud computing allow me to run a virtual machine and use the resources of all the nodes for processing? Cause that's what I want.
lrilling
April 2nd, 2010, 05:22 AM
Hey Ive looked all over and clustering has officially confused me! :-x
That's why Kerrighed is developped: making people forget about what clusters really are. But for administrators, some knowledge will still be required.
Would the Ubuntu Enterprise Cloud computing allow me to run a virtual machine and use the resources of all the nodes for processing? Cause that's what I want.
As far as I know UEC will allow you to multiplex your nodes, that is deploy several virtual machines per node, but it won't allow you to aggregate nodes into a single virtual machine.
Regards,
Louis
renomcdonald
April 2nd, 2010, 09:29 AM
oh ok. so Kerrighed would be my best solution then?;)
lrilling
April 6th, 2010, 04:50 AM
oh ok. so Kerrighed would be my best solution then?;)
;)
amichaud
April 11th, 2010, 11:41 PM
Has anyone tried Ansys/Fluent and also Matlab with Kerrighed. I've been tasked with the building of a new cluster for the department and these are the programs we will run.
For the hardware I'm looking to run Dual Xeon 5520 quad core cpus in each node with 8 nodes (64 cores) and a master node with similar specs. 12Gigs ram on each node. I've also been asked to include a hdd on each node which doesn't make sense to me...I'd rather use the money to buy more ram.
For storage I'm looking at 15TB SAN
Dell Powervault MD3000i
Will Kerrighed handle this hardware and software?
Any insight is much appreciated.
lrilling
April 13th, 2010, 04:32 AM
For the hardware I'm looking to run Dual Xeon 5520 quad core cpus in each node with 8 nodes (64 cores) and a master node with similar specs. 12Gigs ram on each node. I've also been asked to include a hdd on each node which doesn't make sense to me...I'd rather use the money to buy more ram.
For storage I'm looking at 15TB SAN
Dell Powervault MD3000i
Will Kerrighed handle this hardware and software?
Kerrighed is known to run Nehalem CPUs, so Xeon 5520 are ok.
For your storage, as long as it can be used to export a shared NFS root to the compute nodes, it's ok too.
Regards,
Louis
gabrielaca
April 13th, 2010, 07:41 PM
i´m sorry, Kerrighed supports multiple cpu´s? amichaud said dual xeon´s quad cores, i thought only single cpu´s where supported at this time, could you clarify?
gabrielaca
April 13th, 2010, 08:17 PM
yes it can handle up to 4 cpus per node, he he he sorry my mistake :lolflag:
lrilling
April 14th, 2010, 04:35 AM
yes it can handle up to 4 cpus per node, he he he sorry my mistake :lolflag:
This limitation of 4 cpus is mostly artificial and gone in the development version, which is based on Linux 2.6.30.
Regards,
Louis
renkinjutsu
April 14th, 2010, 12:31 PM
I read somewhere on linux.com (article written in 2006) that kerrighed supports migrating individual threads themselves to balance CPUs .. but on the kerrighed website "support for distributed threads" is listed under "Futer Work"
can any one currently using kerrighed tell me if threaded processes are distributed over the cluster?
lrilling
April 15th, 2010, 04:32 AM
I read somewhere on linux.com (article written in 2006) that kerrighed supports migrating individual threads themselves to balance CPUs .. but on the kerrighed website "support for distributed threads" is listed under "Futer Work"
can any one currently using kerrighed tell me if threaded processes are distributed over the cluster?
Thread migration used to work in now very old version 1.0.2 of Kerrighed which was based on Linux 2.4.29. This feature was dropped when porting Kerrighed to Linux 2.6.x. It should be re-introduced in the future. Please read this post for more details: https://listes.irisa.fr/sympa/arc/kerrighed.users/2008-12/msg00024.html
Thanks,
Louis
quequotion
April 18th, 2010, 04:28 PM
#1 The stand-alone node(s) must have statically configured id and session
I think this was the plan anyway.. was there something unusual about the way you specified their id and session?
#2 Any services provided by the node must be started *after* Kerrighed
I don't really know what this means or how to implement it :(
Could you give examples of your /etc/exports, /etc/fstab, /etc/network/interfaces, and /etc/hosts?
By the way, thank you for helping out so patiently... This thread is very long, but having looked through a few more pages I see you've had to repeat yourself several times.
joshruehlig
April 22nd, 2010, 04:48 PM
Haven't had the time to read every single post in this thread yet but just wanted to ask a question before I headed to class. (Ill read through it after)
https://wiki.edubuntu.org/EasyUbuntuClustering/UbuntuKerrighedClusterGuide
In this well written guide he talks about having a main machine, and nodes (one being the head node). Would this mean my main computer with a gui wouldn't be doing any of the computing work? The main computer in my case would be mp best one and I'd want it to be part of the cluster of computing power as well.
This is what my ideal setup would consist of,
main computer with server install and gui + 3-5 nodes connected to it just for share of ram/cpu/hard drive. Is this possible with this guide?
Thanks for your help!
PS. the server is to host a few sites (motion webcam stream, files) Remote Desktop.
And the cluster is just for fun, and for testing...
renkinjutsu
April 22nd, 2010, 05:32 PM
yup, the nfs/tftp server will not be part of the cluster.
You can easily set the server up on a weaker machine.. it doesn't take much computing power to serve files and act as a boot server. it's mostly disk throughput and network speed that counts for the server.
Since kerrighed will only be installed onto ONE computer, the server, you can leave your main desktop as it is and just reboot it into kerrighed whenever you start your cluster.
that's all i've gathered.. i don't actually have a kerrighed cluster :P
joshruehlig
April 22nd, 2010, 06:18 PM
Thanks for the quick reply!
hmm still not sure what I want to do yet, the main reason for the server is for motion, I wonder how hard that would be to run on the cluster.
BTW the Ubuntu Enterprise Cloud seems way to much then I need but it also fills my requirement of several servers hooked together. Lol I guess Im just confusing myself now. Has anyone here messed around with UEC?
Benic
April 23rd, 2010, 03:46 PM
Very interesting thread!
I've been using for some time a specific R script to analyze textual data that uses a very large matrix as input. When I bootstrap the results it can take several days (4-5) to process. In order to save some time, as I need to do that frequently, I've decided to set up a small cluster since the script involves the Rmpi library (based on lam-mpi). Not an easy task!!!
Did anyone succeed in building such a cluster? I've implemented the exact same R and MPI setup on all computers, set up a passwordless ssh access, reconfigured my Guarddog firewall to allow parallel computing, but when I test lam on two computers, I can't turn the second one into a node, despite positive output of lamboot -v, as you can see below:
~$ lamboot -v
LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University
n-1<3976> ssi:boot:base:linear: booting n0 (10.42.43.1)
n-1<3976> ssi:boot:base:linear: booting n1 (10.42.43.2)
n-1<3976> ssi:boot:base:linear: finished
When it comes to lamexec, I get that:
~$ lamexec C hostname
bureau1
dlo_inet (sendto): Operation not permitted
If I try:
~$ mpirun -np 1 R --no-save
I get this error when I run a function involving Rmpi:
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 5006 on node bureau1 exited on signal 13 (Broken pipe).
--------------------------------------------------------------------------
How could I get that error? If I force R to use Rmpi, I usually have to kill the process, same with ssh (or rsh) 10.42.43.2 R --slave.
A simple question (that may be the answer to all that), do I need to set all computers as ssh servers or only one as it is in this case?
Since I'm a complete newbie in the marvelous world of clustering computers, any help would be appreciated!
Thanks!
ajt
April 29th, 2010, 11:27 AM
Very interesting thread!
[...]
A simple question (that may be the answer to all that), do I need to set all computers as ssh servers or only one as it is in this case?
Since I'm a complete newbie in the marvelous world of clustering computers, any help would be appreciated!
Thanks!
Hello, Benic.
Yes, you need to have SSH servers on all your nodes and set-up password-less logins. LAM is very easy to use: First create SSH keys (use a 'null/empty' pass-phrase when prompted)
cd # go to your login directory
ssh-keygen -t rsa
cd .ssh
cp -ai id_rsa.pub authorized_keys
Now, either copy the .ssh directory to all the nodes or use NFS to mount your login folder on all the nodes. You will no longer be prompted for a password. When this is working, start up LAM (e.g. on node1,2,3 with 1 CPU in each)
cat >lam-bhost.def
node1 cpu=1
node2 cpu=1
node3 cpu=1
^D
lamboot -v
lamnodes
Now, you should be able to run your MPI programs!
Bye,
Tony.
ajt
April 29th, 2010, 11:49 AM
I think this was the plan anyway.. was there something unusual about the way you specified their id and session?
Hello, quequotion.
Yes, we specified it to be consistent with the automatic configuration of the PXE-booted nodes.
I don't really know what this means or how to implement it :(
Could you give examples of your /etc/exports, /etc/fstab, /etc/network/interfaces, and /etc/hosts?
manager@kitcat[manager] cat /etc/exports
# /etc/exports #
/ 192.168.0.0/255.255.255.0(ro,no_root_squash)
#/export 192.168.0.0/255.255.255.0(rw,no_root_squash)
manager@kitcat[manager] cat /etc/fstab
# /etc/fstab: static file system information.
#
# <file system> <mount point> <type> <options> <dump> <pass>
proc /proc proc defaults 0 0
# /dev/sde1
UUID=dd20339c-d208-42e0-8311-f24e77dc128e / ext3 relatime,errors=remount-ro 0 1
# /dev/sde3
UUID=fc839f2a-b004-486e-9153-13e357b6ea49 /backups ext3 relatime,noauto 0 2
# /dev/sde2
UUID=a7479698-1258-47a6-925f-422c84f5a612 none swap sw 0 0
# /dev/md0
/dev/md0 /export ext3 relatime 0 2
/dev/scd0 /media/cdrom0 udf,iso9660 user,noauto,exec,utf8 0 0
configfs /config configfs defaults 0 0
manager@kitcat[manager] cat /etc/network/interfaces
auto lo
iface lo inet loopback
auto eth2
iface eth2 inet static
address 143.234.32.16
netmask 255.255.240.0
gateway 143.234.36.2
auto eth0
iface eth0 inet static
address 192.168.0.254
netmask 255.255.255.0
auto eth1
iface eth1 inet static
address 192.168.1.254
netmask 255.255.255.0
manager@kitcat[manager] cat /etc/hosts [ 4:45PM]
127.0.0.1 localhost
127.0.1.1 kitcat
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# servers
192.168.1.254 krg254
192.168.0.254 node254 kitcat
# nodes
192.168.1.2 krg2
192.168.0.2 node2
192.168.1.3 krg3
192.168.0.3 node3
192.168.1.4 krg4
192.168.0.4 node4
192.168.1.5 krg5
192.168.0.5 node5
192.168.1.6 krg6
192.168.0.6 node6
192.168.1.7 krg7
192.168.0.7 node7
192.168.1.8 krg8
192.168.0.8 node8
192.168.1.1 krg1
192.168.0.1 node1
By the way, thank you for helping out so patiently... This thread is very long, but having looked through a few more pages I see you've had to repeat yourself several times.
That's OK, I'm happy to help if I can - I'd like to spread the word about Kerrighed in the Ubuntu community.
Bye,
Tony.
ajt
April 29th, 2010, 11:54 AM
This limitation of 4 cpus is mostly artificial and gone in the development version, which is based on Linux 2.6.30.
Hello, Louis.
In our tests, 32-bit Kerrighed seems very reluctant to use more than one core per processor chip and we had to get the cluster loaded significantly more than one CPU hog per CPU core before SMP processors on the same chip were used. Is this a result of the openMosix load-balancing algorithm?
Bye,
Tony.
Benic
April 29th, 2010, 01:33 PM
Hello, Benic.
Yes, you need to have SSH servers on all your nodes and set-up password-less logins.
Thanks Tony, it was the missing piece in the puzzle!
lrilling
April 30th, 2010, 05:08 AM
Hello, Louis.
In our tests, 32-bit Kerrighed seems very reluctant to use more than one core per processor chip and we had to get the cluster loaded significantly more than one CPU hog per CPU core before SMP processors on the same chip were used. Is this a result of the openMosix load-balancing algorithm?
This results from the simplification we made when implementing the load balancing algorithm. The implemented algorithm just does not consider the number of cores per node, which is equivalent to behave as if each node had only one core.
Some improvements to this algorithm are planned (eg number of cores per node, different CPU speeds), but right now I can't tell you when they will appear.
Thanks,
Louis
renkinjutsu
April 30th, 2010, 08:39 AM
lrilling,
do you think kerrighed will pick up thread migration again once the desired stability is finally met? Or would it just be impossible at that point?
lrilling
May 3rd, 2010, 04:47 AM
lrilling,
do you think kerrighed will pick up thread migration again once the desired stability is finally met? Or would it just be impossible at that point?
We should start working on it in a near future. But I have no means to give a reliable estimate on when it will be done. There is no impossibility, just priorities.
Thanks,
Louis
exsecratus
May 11th, 2010, 12:31 PM
Hi. Is it possible to use this with a software like blender or maya
to do render farm?
or even using vmware to use with 3dmax?
o/
jvin248
May 16th, 2010, 07:49 PM
Not having done it, http://dynebolic.org/ is supposed to be set up for clustering machines specifically for blender and related programs. I was a bit surprised to see last release was ~2007 as there used to be a lot of activity there.
Kerrighed should definitely allow doing this as well - build cluster then run any program on top of it across the group of computers. Blender was set up to multi-thread and use all resources provided. How to actually get it working .. I'm not sure, .. but it's been done by others already.
samshamolian
May 17th, 2010, 04:20 PM
hello all.. a bunch of sweet words on how much this site rocks and everything(really, i just don't know how to write such stuff)
now am a computer science student, entering my final year.. am thinking about a graduation project..
i have my mind set on doing something related to cluster computing.. i (personally) think it's a really neat subject, it may be needed in the near future as computer usage increases rapidly and the needs for high performance computing without paying tons of dollars is demanded by many organizations.. also i think of it as a good way to improve my gnu/linux capabilities and FOSS knowledge.. i have many other reasons to go on with this project, so long story short; i want to do a graduation project related to cluster computing.. now i have some questions.. - first, I'll appreciate any resources related to the subject, i of course found a lot of materials on the web(me love google).. but it'd be nice if someone pointed me to some of those underground, hidden, unknown, that are not too obvious to found and you had to bookmark it so u may make use of it one day kind of pages.. (i found alla mentioning that he wants to write something about clustering in more than one place of his writings) - secondly, and this is more important, i want to hear your opinions on what exactly can i do using clusters that can be useful as a graduation project.. because as i dig deeper on the web, i find that the subject is much more easier than i thought it would be in the first place, every thing is already implemented.. it'll get down to just connecting bunch of computers with a LAN, then install a clustering distro to each node, and walla.. it's done(almost) so the issue now is what is going to run on the system, this is gonna be the sole benefit of the project, and this is my question.. i thought about some stuff:a web or database server, or any kind of server to make use of high availability and load balancing.. running a high demanding application that needs huge processing power..(am thinking about just installing one of those clients that work on large research projects) some other lame ideas.. ok, after i wrote this.. i thought it would be too large to post here, so if you find this post not informative enough, please let me know, and i'll explain more.. waiting for your comments :) thank you very much.. and keep the good work
noval
May 17th, 2010, 08:32 PM
Dear all,
I would like to ask question, I already finish do install kerrighed & compile kernel on my laptop following the guidance & check all the file needed by the system cluster server :
/boot/vmlinuz-2.6.20-krg (Kerrighed kernel)
/boot/System.map (Kernel symbol table)
/lib/modules/2.6.20-krg (Kerrighed kernel module)
/etc/init.d/kerrighed (Kerrighed service script)
/etc/default/kerrighed (Kerrighed service configuration file)
/usr/local/share/man/* (Look inside these subdirectories for Kerrighed man pages)
/usr/local/bin/krgadm (The cluster administration tool)
/usr/local/bin/krgcapset (Tool for setting capabilities of processes on the cluster)
/usr/local/bin/krgcr-run (Tool for checkpointing processes)
/usr/local/bin/migrate (Tool for migrating processes)
/usr/local/lib/libkerrighed-* (Libraries needed by Kerrighed)
/usr/local/include/kerrighed (Headers for Kerrighed libraries)
result all have
but then while try to boot hostnode1 & hostnode2 (client node) can’t get connected, eventough it already get it’s own IP static
here is the error :
intel 810_AC97 Audio,version 1.01,05:13:06 May 16 2010
oprofile :using timer interrupt
TCP cubic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
TIPC:Activated (version 1.7.5 compiled May 16 20101 05:18:38)
NET: Registered protocol family 30
TIPC: Started in single node mode
acpi_processor-0571 [00] processor_get_psd : invalid _PSD data
acpi_processor-0571 [00] processor_get_psd : invalid _PSD data
Using IPI Shortcut mode
Time : tsc clocksource has been installed.
r8169 : eth0: link up
Sending DHCP request…………timed out!
IP-Config:Retriying forever (NFS root)…
r8169 : eth0: link up
Sending DHCP request…………timed out!
IP-Config:Retriying forever (NFS root)…
Please help me, what should I do?
Best regards
thanks
noval
lrilling
May 18th, 2010, 04:37 AM
result all have
but then while try to boot hostnode1 & hostnode2 (client node) can’t get connected, eventough it already get it’s own IP static
here is the error :
intel 810_AC97 Audio,version 1.01,05:13:06 May 16 2010
oprofile :using timer interrupt
TCP cubic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
TIPC:Activated (version 1.7.5 compiled May 16 20101 05:18:38)
NET: Registered protocol family 30
TIPC: Started in single node mode
acpi_processor-0571 [00] processor_get_psd : invalid _PSD data
acpi_processor-0571 [00] processor_get_psd : invalid _PSD data
Using IPI Shortcut mode
Time : tsc clocksource has been installed.
r8169 : eth0: link up
Sending DHCP request…………timed out!
IP-Config:Retriying forever (NFS root)…
r8169 : eth0: link up
Sending DHCP request…………timed out!
IP-Config:Retriying forever (NFS root)…
You said that the nodes should get static IPs, but the kernel is trying to get a dynamic one. If your setup definitely targets static IPs, you should check that they are properly given to the kernel in its command line. Otherwise, you should check your DHCP server.
Thanks,
Louis
noval
May 18th, 2010, 10:01 AM
Dear Louis,
I already check the DHCP server & get connected with PC that have OS (Ubuntu) the result are those PC get the IP static that was setting on the DHCP server like it should.
But while doing boot through network to get diskless booting, that's problem happened.
Here is my set up :
1.install the DHCP server package with aptitude or apt-get, as root:
# aptitude install dhcp3-server
2.Check that the DHCP server configuration file /etc/default/dhcp3-server
Make sure you know which is the right card,In this case it is eth0
# /etc/default/dhcp3-server #
interfaces="eth0"
3.Configure the DHCP daemon to issue addresses only to nodes, and tell it which addresses to give them.
# /etc/dhcp3/dhcpd.conf #
# General options
option dhcp-max-message-size 2048;
use-host-decl-names on;
deny unknown-clients; # This will stop any non-node machines from appearing on the cluster network.
deny bootp;
# DNS settings
option domain-name "kerrighed"; # Just an example name - call it whatever you want.
option domain-name-servers 10.11.13.1; # The server's IP address, manually configured earlier.
# Information about the network setup
subnet 10.11.13.0 netmask 255.255.255.0 {
option routers 10.11.13.1; # Server IP as above.
option broadcast-address 10.11.13.255; # Broadcast address for your network.
}
# Declaring IP addresses for nodes and PXE info
group {
filename "pxelinux.0"; # PXE bootloader. Path is relative to /var/lib/tftpboot
option root-path "10.11.13.1:/nfsroot/kerrighed"; # Location of the bootable filesystem on NFS server
host kerrighednode1 {
fixed-address 10.11.13.101; # IP address for the first node, kerrighednode1 for example.
hardware ethernet 00:26:18:B6:CD:19; # MAC address of the node's ethernet adapter
}
host kerrighednode2 {
fixed-address 10.11.13.102;
hardware ethernet 00:26:18:B6:C9:CB;
}
server-name "kerrighedserver"; # Name of the server. Call it whatever you like.
next-server 10.11.13.1; # Server IP, as above.
}
4.As root, install the TFTP server package, tftp-hpa, with aptitude or apt-get:
# aptitude install tftpd-hpa
5.Open its configuration file, /etc/default/tftpd-hpa, and make sure it uses the following settings.# /etc/default/tftp-hpa #
#Defaults for tftp-hpa
RUN_DAEMON="NO"
OPTIONS="-l -s /var/lib/tftpboot"
6.Now we need to configure inetd to run the tftp server. tftp dgram udp wait root /usr/sbin/in.tftpd /usr/sbin/in.tftpd -s /var/lib/tftpboot
7.As root, install syslinux, which is the system required for you to be able to PXE boot the nodes, and copy the PXE bootloader code from it to the TFTP server directory. This is the bootloader you told the DHCP daemon about in its configuration file earlier.
# aptitude install syslinux
# cp /usr/lib/syslinux/pxelinux.0 /var/lib/tftpboot
8.Still as root, create a directory to store the default configuration for all the nodes. They will search in this directory for configuration files during the PXE boot process.
# mkdir /var/lib/tftpboot/pxelinux.cfg
9.Still as root, copy your current kernel and initrd from /boot to /var/lib/tftpboot/ in order to test the diskless-boot system. Replace <KERNEL_VERSION> with whatever you are using. # cp /boot/vmlinuz-2.6.24-16-generic /boot/initrd.img-2.6.24-16-generic /var/lib/tftpboot/
10.Create the file /var/lib/tftpboot/pxelinux.cfg/default.
This will be the fallback configuration file that the nodes use to PXE
boot when they can't find a file specific to their own IP address.
LABEL linux
KERNEL vmlinuz-2.6.24-16-generic
APPEND console=tty1 root=/dev/nfs initrd=initrd.img-2.6.24-16-generic nfsroot=10.11.13.1:/nfsroot/kerrighed ip=dhcp rw
11.As root, install the packages nfs-kernel-server and nfs-common, which comprise the NFS server program. Keep your root authorisation until you're done working with the NFS server.
# apt-get install nfs-kernel-server nfs-common
12. Make a directory to store the bootable filesystem in:
# mkdir /nfsroot/kerrighed
13.Edit /etc/exports,which configures NFS file transfers. Add the following in order to make NFS export the filesystem that will be stored in the directory you just
made:
# /etc/exports
# /nfsroot/kerrighed 10.11.13.0/255.255.255.0(rw,no_subtree_check,async,no_root_sq uash)
14.Re-export the file systems, since you just changed how this is done:
# exportfs -avr
15.use debootstrap to itself install a minimal Ubuntu Hardy system to the bootable filesystem directory:
# aptitude install debootstrap
# debootstrap --arch i386 hardy /nfsroot/kerrighed http://archive.ubuntu.com/ubuntu/ (http://archive.ubuntu.com/ubuntu/16)
16.Change the current root of the file system to the bootable filesystem directory (stay chrooted until the guide tells you otherwise.) This is so that you can work with the bootable filesystem directly, as if it were a separate machine, while we make some adjustments to it.
# chroot /nfsroot/kerrighed
17.Set the root password for the bootable filesystem
# passwd
18 Mount the /proc directory of the current machine as the bootable system's /proc directory, so you can use programs on the bootable FS:
# mount -t proc none /proc
19 .Edit /etc/apt/sources.list.
deb http://archive.canonical.com/ubuntu (http://archive.canonical.com/ubuntu) hardy partner
deb http://archive.ubuntu.com/ubuntu/ (http://archive.ubuntu.com/ubuntu/) hardy main universe restricted multiverse
deb http://security.ubuntu.com/ubuntu/ (http://security.ubuntu.com/ubuntu/) hardy-security universe main multiverse restricted
deb http://archive.ubuntu.com/ubuntu/ (http://archive.ubuntu.com/ubuntu/) hardy-updates universe main multiverse restricted
deb-src http://archive.ubuntu.com/ubuntu/ (http://archive.ubuntu.com/ubuntu/) hardy main universe restricted multiverse deb-src http://security.ubuntu.com/ubuntu/ (http://security.ubuntu.com/ubuntu/) hardy-security universe main multiverse restricted
deb-src http://archive.ubuntu.com/ubuntu/ (http://archive.ubuntu.com/ubuntu/) hardy-updates universe main multiverse restricted
20.Update the current package
# aptitude update
21. Install some packages that our nodes need for using NFS and DHCP:
$ apt-get install dhcp3-common nfs-common nfsbooted openssh-server
22. make it work with NFS. Edit /etc/fstab, the filesystem index, of the bootable filesystem to look like this:
# /etc/fstab
#
#<file system> <mount point> <type> <options> <dump> <pass>
proc /proc proc defaults 0 0
/dev/nfs / nfs defaults 0 0
23. Edit /etc/hosts. This is so that the DHCP server will know which of the PXE-booted nodes get which IP and hostname. Add all your cluster nodes and the server to it. In our example case, it looks like this:
# /etc/hosts #
127.0.0.1 localhost
10.11.13.1 kerrighedserver
10.11.13.101 kerrighednode1
10.11.13.102 kerrighednode2
24.Do the following to create a symbolic link which will automount the bootable filesystem as /dev/nfs on the server, when it starts up. This should not collide with other existing services in the directory (e.g. anything that looks like /etc/rcS.d/S34xxxxxxx), so check carefully before you create the link. If there's a service with a similar name, disable it before you do anything else.
$ ln -sf /etc/network/if-up.d/mountnfs /etc/rcS.d/S34mountnfs
25. Edit /etc/network/interfaces and add a line to stop Network Manager from managing the nodes' Ethernet cards, as this can cause issues with NFS. Ours looks like the following:
# ...
# The loopback interface:
auto lo
iface lo inet loopback
# The primary network interface, manually configured to protect NFS:
iface eth0 inet manual
26. Create a user for the bootable system. Replace <username> with whatever you want to call her. adduser will ask you for her real name and other details, so play along.
# adduser <username>
27. Ensure that the new user is in the /etc/sudoers file so can run root commands on the cluster:
# /etc/sudoers #
#User privilege specification
root ALL=(ALL) ALL
<username> ALL=(ALL) ALL
28. Exit the root shell, and then exit from the chrooted bootable filesystem. You're done configuring the bootable FS and can now test it with your common-or-garden kernel.
# exit
# exit
29. Restart the servers on the head node, because you've been messing with them. You need to be root to do this:
# /etc/init.d/tftpd-hpa restart
# /etc/init.d/dhcp3-server restart
# /etc/init.d/nfs-kernel-server restart
Configure the BIOS of each node to have the following boot order: your primary boot device should be PXE, which will usually be described as "network boot" or "LAN boot". In certain cases you may need to enable the network cards as boot devices in the BIOS, reboot, and then set the boot priority. Remember also to disable "halt on all errors", since this can mess up your PXE booting.
Boot each of the nodes to see if it works. If so, you should be presented with a login prompt, where you can log-in using the username you defined earlier. When it all works, you're ready to try with a Kerrighed kernel.
Now that we've got a diskless boot system setup to use as a server, we only need to build the Kerrighed kernel for the nodes to use, put it in the bootable FS, and configure the Kerrighed settings properly in order to have a working SSI (Single System Image) cluster.
jetsam
May 18th, 2010, 01:47 PM
No problem then. Nothing to it, really. Should have my cluster up and running in about 75 years.
lrilling
May 19th, 2010, 06:31 AM
Dear Louis,
I already check the DHCP server & get connected with PC that have OS (Ubuntu) the result are those PC get the IP static that was setting on the DHCP server like it should.
But while doing boot through network to get diskless booting, that's problem happened.
If I understand correctly, a regular DHCP client running on top of Ubuntu with an Ubuntu kernel gets the IP address, but Kerrighed kernel's internal client fails to get the IP address. If you're not using Kerrighed 2.4.4, I recomment you to upgrade to this version since it contains fixes for your NIC driver (8169).
Thanks,
Louis
robinhoods
May 20th, 2010, 11:23 AM
Setting up the bootable filesystem
This isn't as simple as just copying the OS files into another directory - you'll need the debootstrap package to install the bootable filesystem, so install this first (you should still be root.) Once it's installed, use debootstrap to itself install a minimal Ubuntu Hardy system to the bootable filesystem directory:
# aptitude install debootstrap
# debootstrap --arch i386 hardy /nfsroot/kerrighed http://archive.ubuntu.com/ubuntu/
Change the current root of the file system to the bootable filesystem directory (stay chrooted until the guide tells you otherwise.) This is so that you can work with the bootable filesystem directly, as if it were a separate machine, while we make some adjustments to it.
# chroot /nfsroot/kerrighed
Set the root password for the bootable filesystem. You can use a program called apg, the automated password generator, to create a good one.
# passwd
Mount the /proc directory of the current machine as the bootable system's /proc directory, so you can use programs on the bootable FS:
# mount -t proc none /proc
Edit /etc/apt/sources.list. We want to add access to some extra Ubuntu repositories in order to be able to download the necessary packages for the FS. Add these lines:
deb http://archive.canonical.com/ubuntu hardy partner
deb http://archive.ubuntu.com/ubuntu/ hardy main universe restricted multiverse
deb http://security.ubuntu.com/ubuntu/ hardy-security universe main multiverse restricted
deb http://archive.ubuntu.com/ubuntu/ hardy-updates universe main multiverse restricted
deb-src http://archive.ubuntu.com/ubuntu/ hardy main universe restricted multiverse
deb-src http://security.ubuntu.com/ubuntu/ hardy-security universe main multiverse restricted
deb-src http://archive.ubuntu.com/ubuntu/ hardy-updates universe main multiverse restricted
Update the current package listing from the repositories you just added so you'll be able to install things from them:
]# aptitude update
Install some packages that our nodes need for using NFS and DHCP:
$ apt-get install dhcp3-common nfs-common nfsbooted openssh-server
Now we need to make it work with NFS. Edit /etc/fstab, the filesystem index, of the bootable filesystem to look like this:
# /etc/fstab
#
# <file system> <mount point> <type> <options> <dump> <pass>
proc /proc proc defaults 0 0
/dev/nfs / nfs defaults 0 0
Edit /etc/hosts. This is so that the DHCP server will know which of the PXE-booted nodes get which IP and hostname. Add all your cluster nodes and the server to it. In our example case, it looks like this:
# /etc/hosts #
127.0.0.1 localhost
192.168.1.1 kerrighedserver
192.168.1.101 kerrighednode1
192.168.1.102 kerrighednode2
192.168.1.103 kerrighednode3
192.168.1.104 kerrighednode4
192.168.1.105 kerrighednode5
192.168.1.106 kerrighednode6
Do the following to create a symbolic link which will automount the bootable filesystem as /dev/nfs on the server, when it starts up. This should not collide with other existing services in the directory (e.g. anything that looks like /etc/rcS.d/S34xxxxxxx), so check carefully before you create the link. If there's a service with a similar name, disable it before you do anything else.
$ ln -sf /etc/network/if-up.d/mountnfs /etc/rcS.d/S34mountnfs
Edit /etc/network/interfaces and add a line to stop Network Manager from managing the nodes' Ethernet cards, as this can cause issues with NFS. Ours looks like the following:
# ...
# The loopback interface:
auto lo
iface lo inet loopback
# The primary network interface, manually configured to protect NFS:
iface eth0 inet manual
Create a user for the bootable system. Replace <username> with whatever you want to call her. adduser will ask you for her real name and other details, so play along.
# adduser <username>
Ensure that the new user is in the /etc/sudoers file so she can run root commands on the cluster:
# /etc/sudoers #
#User privilege specification
root ALL=(ALL) ALL
<username> ALL=(ALL) ALL
Exit the root shell, and then exit from the chrooted bootable filesystem. You're done configuring the bootable FS and can now test it with your common-or-garden kernel.
# exit
# exit
jetsam
May 20th, 2010, 11:55 AM
Looks like it's coming along. Wow.
pengin80
May 21st, 2010, 07:41 AM
Hello all,
I'm a new Ubuntu user (on Server 10.04 64bit), who's found himself in possession of 7 Dell Optiplex 755s. My goal is to evaluate a number of options to see how we can most effectively harness all their power.
On the short list is Condor, GridGain, Hadoop and more recently Kerrighed. Reading BigJimJams excellent guide (https://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide) has opened my eyes to the possibility of network booting all the machines, for all the clustering options listed above, it seems so much easier than having to install/maintain each node. Thanks to Jedi453s post (http://ubuntuforums.org/showpost.php?p=6922286&postcount=86) I've just managed to fix boot errors and get an Ubuntu 10.04 image to start (not moved on to Kerrighed kernal yet).
I have a few noob questions which I hope you may be able to help with.
1) I can't find the package nfsbooted in lucid. As a result I carried on without it and everything seems OK so far; the lucid net boot seems to work. Someone pointed out to that I can get the package from here (http://packages.debian.org/lenny/nfsbooted), but do I really need to? What does this package do?
2) All the booted nodes seem to take on the same hostname. They are able to ping each other by the proper hostname (presumably because of the /etc/hosts file), but it's doesn't seem right that they all have the same name on the command line. How do other people address this?
3) Related to (2), is it really viable to have each node booting from the same image at /nfsroot/kerrighed? E.g. when one node runs a bit of software are create a lock file will this cause problems for all the other nodes?
Thanks for your help and thoughts.
joeinbend
June 9th, 2010, 11:20 AM
Hey Penguin,
I'm in a very similar situation with wanting to build a PXE boot cluster. Have you made any further progress on this? Hopefully we can jumpstart this conversation!
ductiletoaster
June 16th, 2010, 02:31 AM
ok so i was goin to use this guide (https://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide) to setup a cluster.
Available Computers:
AMD Athlon 64 3700 2.2 ghz (This one i was thinking about making a Gui workstation)
2gb DDR 400 RAM
200 gb HDD
AMD Athlon 64 3200 2.0 ghz
512mb DDR 400 RAM (might be 1gb don't remb)
160gb HDD
Pentium 4 1.7 ghz
256mb RAM
NO! HDD
Pentium 4 1.7 ghz
256mb RAM
NO! HDD
Celeron 667 mhz
192mb (might be 128mb)
15gb HDD
My Questions:
1) Does it matter that some of the CPU's are 32bit an others are 64bit? Keep in mind its a Diskless-boot Kerrighed Ubuntu Cluster...
2) And what would be the best way to partition the main servers hard drive(s)?
D3ATH-D3AL3R
June 17th, 2010, 04:34 PM
Hi gys...
I am trying to set up a ubuntu cluster using kerrighed in virtual box. But i am stuck at the NFS part. When i boot my machines they are stuck in (initramfs) after showing the following error:
Gave up waiting for root devices.Common problems:
- Boot args (cat /proc/cmdline)
- Check root delay (did the system wait long enough)
- Check root (did the system wait for right devices)
- Missing modules (cat /proc/modules; ls dev)
ALERT! /dev/nfs does not exist. Dropping to a shell.
nerdopolis
June 17th, 2010, 09:03 PM
/dev/null! Kerrighed 3.0.0 was released on Monday (6/14/2010)!
ductiletoaster: If you use a 32 bit image, it should not matter as most 64 bit processors can emulate a 32 bit.
lrilling
June 18th, 2010, 04:35 AM
Hi gys...
I am trying to set up a ubuntu cluster using kerrighed in virtual box. But i am stuck at the NFS part. When i boot my machines they are stuck in (initramfs) after showing the following error:
Gave up waiting for root devices.Common problems:
- Boot args (cat /proc/cmdline)
- Check root delay (did the system wait long enough)
- Check root (did the system wait for right devices)
- Missing modules (cat /proc/modules; ls dev)
ALERT! /dev/nfs does not exist. Dropping to a shell.
Your initramfs was probably not configured for NFS root. With Debian's initramfs-tools, you can enable it in /etc/initramfs-tools/initramfs.conf.
Thanks,
Louis
lrilling
June 18th, 2010, 04:37 AM
ductiletoaster: If you use a 32 bit image, it should not matter as most 64 bit processors can emulate a 32 bit.
Except that Kerrighed 2.4's support for 32 bits is buggy, and Kerrighed 3.0 does not even compile on 32 bits.
Thanks,
Louis
nerdopolis
June 18th, 2010, 10:57 PM
Is support for 32 bit processors completley dropped, or will it get 32 bit support back some time?
faisalmehmood
June 19th, 2010, 02:34 AM
I'm still running openMosix (linux-2.4.26-om1), but I'm planning to upgrade to Kerrighed. I tried out openSSI under Ubuntu 6.06 and 8.04, but had a lot of problems just trying to get it to run. There was some discussion about SSI recently on the beowulf list. ):P
lrilling
June 21st, 2010, 08:07 AM
Is support for 32 bit processors completley dropped, or will it get 32 bit support back some time?
It is not planned to get it back. Contributions are welcome though.
Thanks,
Louis
D3ATH-D3AL3R
June 22nd, 2010, 11:24 AM
Your initramfs was probably not configured for NFS root. With Debian's initramfs-tools, you can enable it in /etc/initramfs-tools/initramfs.conf.
Thanks,
Louis
I did what you said. But still cant boot my nodes.BTW i am using ubuntu 9.04. Will it works??? (the guide is for ubuntu 8.04)
lrilling
June 23rd, 2010, 05:17 AM
I did what you said. But still cant boot my nodes.BTW i am using ubuntu 9.04. Will it works??? (the guide is for ubuntu 8.04)
You may have to specify BOOT=nfs in kernel command line. Also make sure that your NIC driver is either built in the kernel, or is put in the initramfs. Again, this is how Debian's initramfs-tools works. I don't know if Ubuntu changes initramfs-tools.
Thanks,
Louis
D3ATH-D3AL3R
June 23rd, 2010, 02:36 PM
[SOLVED]
I read thread 155 (thanx to nerdopolis) and everything went fine.
Thnx again:)
a.kazemi
June 25th, 2010, 05:07 AM
Hi dears
I want to run a sample MPI C or C++ program for example hello word on two computers running Ubuntu,Eclipse and MPI. I can run MPI on each of these two PC and define virtual nodes with this command:
-np "number of nodes" ${workspace_loc:"project name"}/Debug/"project name"
in the program arguments. it works fine :lolflag: and now I want to run distributed computing on two PCs connected to each other through ethernet P2P network. can anybody help me how to configure Eclipse and network connection to do this?
Maybe I have to mention that each of these PCs have just one core!
D3ATH-D3AL3R
June 27th, 2010, 12:54 PM
Hi gys....
I am stuck in part 2 of the guide. I have compiled the kernel 3 times but every time i cant find the file /etc/kerrighed_nodes file. I have checked these files:
/boot/vmlinuz-2.6.20-krg (Kerrighed kernel)
/boot/System.map (Kernel symbol table)
/lib/modules/2.6.20-krg (Kerrighed kernel module)
/etc/init.d/kerrighed (Kerrighed service script)
/etc/default/kerrighed (Kerrighed service configuration file)
/usr/local/share/man/* (Look inside these subdirectories for Kerrighed man pages)
/usr/local/bin/krgadm (The cluster administration tool)
/usr/local/bin/krgcapset (Tool for setting capabilities of processes on the cluster)
/usr/local/bin/krgcr-run (Tool for checkpointing processes)
/usr/local/bin/migrate (Tool for migrating processes)
/usr/local/lib/libkerrighed-* (Libraries needed by Kerrighed)
/usr/local/include/kerrighed (Headers for Kerrighed libraries)
They are there but i dont know where i m going wrong. Moreover i dont have the grub folder in /boot. BTW i m using ubuntu 8.04.4 and i have gcc3.3 and gcc4.2 , g++3.3 and g++4.2. And my nodes are booting fine with the new kernel. But without the /etc/kerrighed_nodes file it is useless.
Thnx in advance....
lrilling
June 28th, 2010, 06:09 AM
And my nodes are booting fine with the new kernel. But without the /etc/kerrighed_nodes file it is useless.
You are supposed to write this file if you want to use it.
$ man 5 kerrighed_nodesdetails the contents of this file.
However, you should node that the use of /etc/kerrighed_nodes has been deprecated for a long time. The related documentation has even been removed in Kerrighed 3.0.0.
Thanks,
Louis
D3ATH-D3AL3R
July 2nd, 2010, 04:24 PM
Hi gys... I have succesfully implemented kerrighed 2.4.1 in ubuntu 8.04. Now i m trying to set up a cluster with kerrighed 3.0 and ubuntu 9.04. I keep getting this error during kernel compilation.
CC arch/x86/vdso/vdso32-setup.o
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:In function 'import_vdso_context':
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:321: error: 'compat' undeclared (first use in this function)
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:321: error: (Each undeclared identifier is reported only once
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:321: error: for each function it appears in.)
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:322: error: 'vdso_pages' undeclared (first use in this function)
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:324: warning: comparison between pointer and integer
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:325: error: 'vdso_size' undeclared (first use in this function)
make[3]: *** [arch/x86/vdso/vdso32-setup.o] Error 1 make[2]: *** [arch/x86/vdso] Error 2
make[1]: *** [sub-make] Error 2
make: *** [all] Error 2
This error occurs during make command.
lrilling
July 5th, 2010, 08:51 AM
Hi gys... I have succesfully implemented kerrighed 2.4.1 in ubuntu 8.04. Now i m trying to set up a cluster with kerrighed 3.0 and ubuntu 9.04. I keep getting this error during kernel compilation.
CC arch/x86/vdso/vdso32-setup.o
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:In function 'import_vdso_context':
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:321: error: 'compat' undeclared (first use in this function)
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:321: error: (Each undeclared identifier is reported only once
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:321: error: for each function it appears in.)
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:322: error: 'vdso_pages' undeclared (first use in this function)
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:324: warning: comparison between pointer and integer
/usr/src/kerrighed-3.0.0/_kernel/arch/x86/vdso/vdso32-setup.c:325: error: 'vdso_size' undeclared (first use in this function)
make[3]: *** [arch/x86/vdso/vdso32-setup.o] Error 1 make[2]: *** [arch/x86/vdso] Error 2
make[1]: *** [sub-make] Error 2
make: *** [all] Error 2
This error occurs during make command.
You must disable IA32 emulation (see
http://www.kerrighed.org/wiki/index.php/FAQ#Which_kernel_options_should_I_enable.2Fdisable _.3F).
I could swear I made this impossible to build a Kerrighed kernel with this option enabled. How did you configure the kernel?
Thanks,
Louis
Hereticq2
July 6th, 2010, 04:05 PM
You must disable IA32 emulation (see
http://www.kerrighed.org/wiki/index.php/FAQ#Which_kernel_options_should_I_enable.2Fdisable _.3F).
I could swear I made this impossible to build a Kerrighed kernel with this option enabled. How did you configure the kernel?
Thanks,
Louis
Hello. I have same problem.
I built a kernel in an automatic mode using the command build Kerrighed from the manual. http://www.kerrighed.org/docs/releases/3.0/INSTALL
After I got this error I tried to compile a kernel manually.
But menuсonfig haven't this option I'm build kernel with GCC-4.4. May be problem in GCC?
I'm use Ubuntu 10.04.
Other ideas? :) Thank you.
Erwanenharma
July 7th, 2010, 06:39 AM
Hi Everyone!
I'm trying to set up a kerrighed cluster with this howto:
https://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide
I have this problem:
root@kerrighedserver:/usr/src/kerrighed-2.4.1/kernel# make kernel-install
make -C /usr/src/kerrighed-2.4.1/_kernel O=/usr/src/kerrighed-2.4.1/kernel kernel-install
make[2]: *** No rule to make target `kernel-install'. Stop.
make[1]: *** [kernel-install] Error 2
make: *** [kernel-install] Error 2
if anyone has answer please let me know... I'm stuck here.
Best regards.
lrilling
July 7th, 2010, 06:40 AM
Hello. I have same problem.
I built a kernel in an automatic mode using the command build Kerrighed from the manual. http://www.kerrighed.org/docs/releases/3.0/INSTALL
After I got this error I tried to compile a kernel manually.
But menuсonfig haven't this option
Or are you trying to build for 32 bits? This is not supported and (my fault) not checked for in menuconfig.
Thanks,
Louis
Hereticq2
July 7th, 2010, 10:03 AM
Or are you trying to build for 32 bits? This is not supported and (my fault) not checked for in menuconfig.
Thanks,
Louis
Ok, I understand.
Yes, I'm build in 32 bits.
D3ATH-D3AL3R
July 7th, 2010, 01:36 PM
OK i get it.
I think i just have compile my kerrighed on 64 bit machine.
Thanks....:popcorn:
D3ATH-D3AL3R
July 7th, 2010, 01:42 PM
Hi Everyone!
I'm trying to set up a kerrighed cluster with this howto:
https://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide
I have this problem:
root@kerrighedserver:/usr/src/kerrighed-2.4.1/kernel# make kernel-install
make -C /usr/src/kerrighed-2.4.1/_kernel O=/usr/src/kerrighed-2.4.1/kernel kernel-install
make[2]: *** No rule to make target `kernel-install'. Stop.
make[1]: *** [kernel-install] Error 2
make: *** [kernel-install] Error 2
if anyone has answer please let me know... I'm stuck here.
Best regards.
Hi Erwanenharma (http://ubuntuforums.org/member.php?u=1108434),
I think i got the same error so i used kerrighed svn. It is listed in post #109 of this thread. After that everything went fine.
I hope it helps.:p
Hereticq2
July 9th, 2010, 03:18 AM
Hello,all.
I have cluster with 3 nodes in Intel Xeon CPU 2.40GHz.
In /proc/cpuinfo i see 12 cpu, but if i run linpack banchmarks test, work only 4 cpu.
Why? Or all cpus can work only with mpi?
Legacy scheduler behaviour is running.
And sheduler i configure:
# krgcapset -d +CAN_MIGRATE
# krgcapset -k $$ -d +CAN_MIGRATE
Me intrested run virtualisation(virtualbox, vmware or other who can work) in my cluster. Can I do it with Kerrighed? And,can work virtualisation with all cpu in cluster?
Thanks
lrilling
July 9th, 2010, 07:43 AM
Hello,all.
I have cluster with 3 nodes in Intel Xeon CPU 2.40GHz.
In /proc/cpuinfo i see 12 cpu, but if i run linpack banchmarks test, work only 4 cpu.
Why? Or all cpus can work only with mpi?
Legacy scheduler behaviour is running.
And sheduler i configure:
# krgcapset -d +CAN_MIGRATE
# krgcapset -k $$ -d +CAN_MIGRATE
Note that the two lines above are equivalent, and will only do what you want if you launch the benchmark from the same shell.
Me intrested run virtualisation(virtualbox, vmware or other who can work) in my cluster. Can I do it with Kerrighed? And,can work virtualisation with all cpu in cluster?
No known virtualization technology (VMware, Xen, VirtualBox, Qemu, KVM, Linux vservers, Linux containers) can benefit from Kerrighed right now. The most promising one is currently Linux containers (which should supersede Linux vservers some day). Qemu, and maybe later KVM could benefit from Kerrighed's migration in some future, but not from resource aggregation (especially CPUs).
Thanks,
Louis
Hereticq2
July 9th, 2010, 08:47 AM
Note that the two lines above are equivalent, and will only do what you want if you launch the benchmark from the same shell.
Louis
Sorry, don't understand. Where I need launch benchmark?
I'm login in first kerrighed node and launch benchmark there.
D3ATH-D3AL3R
July 10th, 2010, 01:45 PM
Hi gys....
I had set up a kerrighed cluster as a project for my college. Can anyone list some applications in which thread migration or process migration is shown in graphical or textual form.
ajt
July 11th, 2010, 11:35 AM
Hi gys....
I had set up a kerrighed cluster as a project for my college. Can anyone list some applications in which thread migration or process migration is shown in graphical or textual form.
Hello, D3ATH-D3AL3R.
There was a graphical tool to show migration in openMosix:
http://www.openmosixview.com/
However this, like openMosix itself, is now unsupported...
I've been thinking about porting "mosmon" from openMosix to Kerrighed, but I've not done it yet. Might be nice if someone ports openMosix view :-)
Bye,
Tony.
ajt
July 11th, 2010, 11:48 AM
You may have to specify BOOT=nfs in kernel command line. Also make sure that your NIC driver is either built in the kernel, or is put in the initramfs. Again, this is how Debian's initramfs-tools works. I don't know if Ubuntu changes initramfs-tools.
Hello, Louis.
AFAIK Ubuntu works the same way as Debian, but I find it a lot easier to build the NIC driver into the kernel and PXE-boot without an initrd :-)
Bye,
Tony.
ajt
July 11th, 2010, 12:00 PM
I'm going to try booting nodes on our Kerrighed Beowulf using Perceus:
http://www.perceus.org
I mentioned Perceus some time ago on this thread - has anyone succeeded in setting up Perceus for node provisioning in a Kerrighed Beowulf cluster?
At present, I'm using a home-brew system based on UNFS3 with 'clusterNFS' extensions enabled. Interestingly, Oracle use similar 'context' dependent symbolic links in OCFS (Oracle Cluster File System):
http://www.dba-oracle.com/real_application_clusters_rac_grid/cdsl.html
Although OCFS was designed to support Oracle databases, they have recently added POSIX compatibility to OCFS so it can be used as a general purpose filesystem. Perceus does things quite differently...
Bye,
Tony.
daniele.fetoni
July 19th, 2010, 12:39 PM
Hi guys
I have already build a kerrighed 2.4 cluster, but now I am struggling with kerrighed 3.0.0.
I have build (several times) with no errors the kernel over a debian lenny distro AMD64.
Machines can net boot without problem, but after loading vmlinuz-2.6.30-krg3.0.0, they crash and then reboot, then crash and reboot again and again...
I know that dhcp, tftp, nfs server works fine, so I think that there must be some troubles in the kernel. Maybe I should tick some options in menuconfig, or something similar.
My question is simple. Could anyone who managed to build a working kerrighed cluster tell me how he/she makes kernel?
Please, I need to use kerrighed 3.0.0 because of nodes hotplug improvement.
Thanks a lot
Daniele
lrilling
July 20th, 2010, 09:48 AM
AFAIK Ubuntu works the same way as Debian, but I find it a lot easier to build the NIC driver into the kernel and PXE-boot without an initrd :-)
I agree with you.
However, who knows what essential thing distros might put in their initramfs? So I don't try to convince people of not using an initramfs, because I could get far more questions for which I would have no answer ;)
Thanks,
Louis
lrilling
July 20th, 2010, 09:50 AM
Sorry, don't understand. Where I need launch benchmark?
I'm login in first kerrighed node and launch benchmark there.
You need to launch the benchmark from the very same shell in which you entered the krgcapset commands that you mentioned. You might already do it correctly, but having not seen any definite evidence of it, I recall this rule.
Thanks,
Louis
lrilling
July 20th, 2010, 10:03 AM
Hi gys....
I had set up a kerrighed cluster as a project for my college. Can anyone list some applications in which thread migration or process migration is shown in graphical or textual form.
There is some project about this at Polytech'Tours (a french college). You could have a look at this video (in french, sorry): http://www.youtube.com/user/mistergom.
Alexandre Lissy is the guy leading this project. You can get his email from kerrighed.users mailing list or its partial mirroring in gmane http://news.gmane.org/gmane.linux.cluster.kerrighed.user
Thanks,
Louis
mrrstrat
July 22nd, 2010, 01:40 PM
OK: now I am chiming in.
I do have a 'cookbook' list of instructions I created for setting up an Ubuntu-based high performance computing cluster (based on a Beowulf design).
The topology is:
* One main computer
* several diskless/monitorless PCs that TFTP recieve a linux kernal and boot up
* 100BT connected with a switch for fastest messaging
* a common landing zone (directory) for the cluster all PCs use
* Uses SSH to pass messages with OpenMPI
* Supports Open MPI C++/Fortran
How it works is it allows one computer to have all the dev tools and hosts the other computers that upon being turned on Netboot into the host machine. Then, you simply create programs using the OpenMPI compier and mpirun the jobs.
I am getting about 20x power over a single processesor computer with a couple of AM2 and several 400FSB XP3000 machines. The satellite computers have their MAC addresses manually setup, and this made implementation MUCH easier.
I just upgraded to 10.04 LTS, and am going to resetup everything (it takes under an hour). I made the instructions as such I could resetup everything quickly without having to remember how I did it each time.
I WILL post this PDF this weekend after I setup the cluster again.
I had to creat this set of instructions because I could not find a set of instructions that was easy to setup a HPC cluster. I based it on MANY different sources and in stages perfected a simple way to do it. The posts I have read here pushed me to figure it out: lots of fragmented advice, but NO EASY BRAINLESS way to do it.
And: I did not settle for anything less than a true HPC cluster that supports the latest OpenMPI downloads.
danmc
July 26th, 2010, 12:27 AM
OK: now I am chiming in.
I do have a 'cookbook' list of instructions I created for setting up an Ubuntu-based high performance computing cluster (based on a Beowulf design).
The topology is:
* One main computer
* several diskless/monitorless PCs that TFTP recieve a linux kernal and boot up
* 100BT connected with a switch for fastest messaging
* a common landing zone (directory) for the cluster all PCs use
* Uses SSH to pass messages with OpenMPI
* Supports Open MPI C++/Fortran
How it works is it allows one computer to have all the dev tools and hosts the other computers that upon being turned on Netboot into the host machine. Then, you simply create programs using the OpenMPI compier and mpirun the jobs.
I am getting about 20x power over a single processesor computer with a couple of AM2 and several 400FSB XP3000 machines. The satellite computers have their MAC addresses manually setup, and this made implementation MUCH easier.
I just upgraded to 10.04 LTS, and am going to resetup everything (it takes under an hour). I made the instructions as such I could resetup everything quickly without having to remember how I did it each time.
I WILL post this PDF this weekend after I setup the cluster again.
I had to creat this set of instructions because I could not find a set of instructions that was easy to setup a HPC cluster. I based it on MANY different sources and in stages perfected a simple way to do it. The posts I have read here pushed me to figure it out: lots of fragmented advice, but NO EASY BRAINLESS way to do it.
And: I did not settle for anything less than a true HPC cluster that supports the latest OpenMPI downloads.
I would love a look at that I'm looking into using a parallel cluster for a remote rendering project and a lot of the guides I have been looking at are either outdated or just don't quite work.
singing accordionist
July 26th, 2010, 06:26 PM
Is the recipe download file still available somewhere?
mrrstrat
July 26th, 2010, 10:19 PM
Here is the version that sets up my cluster:
This is setup for a 32/64 bit master node and the kernels for each node can be 32/64 bit.:popcorn:
Be warned: you have to have a clue about Linux and basic computer network topology. In this guide (if you can call it that) I setup my M5 computing cluster. This is basically a collective document from many other sources with my changes in how I setup my HPC.
For me, this is what I needed: just a list of things to do, files to edit, and things to look for to be sure its working. I actually paste in the blue text into the files referred to in the guide. I did not write this for mass-consumption (meaning other people), but this may be a welcomed help for others just trying to get a HPC going and not wanting to have an army of PCs with booting CDs (which works for some, but not me).
I just wanted something I could drop in cheap motherboards into with the most minimum of hardware (no disks, video cards if possible, no CDs).
mrrstrat
July 26th, 2010, 11:40 PM
I just want to add in that while I 'hijacked' other recipes for setting up a Linux HPC, I had to have a solution that used a 'standard' installation, as I use my computer for many different things as well as a HPC.
* Again: this guide was never supposed to get posted, but I have had a few people ask about it. So if you can overlook the 'edgy' writing style, you can use it now *
I am using the standard Linux kernals that have the small change to boot from NFS: I list how I got it to work in the text. I did not want a bunch of custom kernals to download and maintain that might clash with my host system. After all: I wanted to program for a HPC and not spend time debugging why its not working.
I had to have something I could replicate with whatever version of Ubuntu I had. And this 'guide' assumes Ubuntu. I also am assuming that a user wants a 'main node' that writes software to be launched on the connected nodes using MPI (with ssh).
The NFS directory created in the guide was a 32-bit installation that my 64-bit host would remote compile code for. This was a pain, so now I have everything 32-bit: so I compile on the host and launch from the host.
If you are wanting to setup a mixed kernal system with 32 and 64 bit, I can tell you it is no advantage (I have found). The pain of keeping 32-bit extensions in 64-bit environment and the lack of support in 64-bit for some 32-bit activities made the benefit of having 32/64 operation non-existant.
I was able to easily maintain the constant stream of kernal updates with the method in the guide I made, and it was STABLE and I never had to work on it.
I manually would start the cluster (three services needed to start DHCP, FTP, SSH). I had some simple checks to make sure things were working including a simple MPI program that would run on all of the nodes to make sure everthing was working. I never had problems and basically followed the guide I wrote.
The guide uses DHCP 3x, and used JAUNTY as the NFS OS the nodes used. Both of these should be able to be changed as needed (as I will verify this weekend to support my 32-bit Ubuntu 10.04 LTS main node).
ductiletoaster
July 27th, 2010, 03:36 AM
except that Kerrighed 2.4's support for 32 bits is buggy, and Kerrighed 3.0 does not even compile on 32 bits.
Ok so even though it cant be compiled on a 32bit system can i still use a 32bit diskless node. My main node would be 64bit?
lrilling
July 27th, 2010, 05:57 AM
Ok so even though it cant be compiled on a 32bit system can i still use a 32bit diskless node. My main node would be 64bit?
No. Kerrighed kernels only run on 64 bits hardware. If your "main" node is the NFS server and is not part of the SSI cluster, yes you should be able to use a 32 bit machine for it.
Thanks,
Louis
ductiletoaster
July 28th, 2010, 02:58 AM
Ok so to clarify... Kerrighed wont run on 32bit (GOT that) but you said i can still use the 32bit hardware if the main node is not part of a ssi cluster?
lrilling
July 28th, 2010, 07:38 AM
Ok so to clarify... Kerrighed wont run on 32bit (GOT that) but you said i can still use the 32bit hardware if the main node is not part of a ssi cluster?
You can use 32 bit hardware, but only for the main node, and only if it is not part of the SSI cluster.
Hope it's clearer...
Thanks,
Louis
ductiletoaster
July 29th, 2010, 03:49 PM
Yes thank you that helped. Us there any particular guides that are geared toward clusters running 32bit hardware. My issue is that I have several computers (listed below)
AMD Athlon 64 3700 2.2 ghz (64bit *was planning on using as the "main node")
2gb DDR 400 RAM
200 gb HDD
AMD Athlon 64 3200 2.0 ghz (64bit)
512mb DDR 400 RAM (might be 1gb don't remb)
160gb HDD
Pentium 4 1.7 ghz (32bit)
256mb RAM
NO! HDD
Pentium 4 1.7 ghz (32bit)
256mb RAM
NO! HDD
Celeron 667 mhz (32bit)
192mb (might be 128mb)
15gb HDD
As you can see three of the machines are 32 and 2 are 64. Im trying to figure out how could I run them together as a cluster. Also keep in mind i wanted to setup the system so that all but the first machine would be disk less (I have to make sure this is possible with all the machines). How ever I am open to any suggestions
danmc
August 1st, 2010, 11:25 AM
Here is the version that sets up my cluster:
This is setup for a 32/64 bit master node and the kernels for each node can be 32/64 bit.:popcorn:
Be warned: you have to have a clue about Linux and basic computer network topology. In this guide (if you can call it that) I setup my M5 computing cluster. This is basically a collective document from many other sources with my changes in how I setup my HPC.
For me, this is what I needed: just a list of things to do, files to edit, and things to look for to be sure its working. I actually paste in the blue text into the files referred to in the guide. I did not write this for mass-consumption (meaning other people), but this may be a welcomed help for others just trying to get a HPC going and not wanting to have an army of PCs with booting CDs (which works for some, but not me).
I just wanted something I could drop in cheap motherboards into with the most minimum of hardware (no disks, video cards if possible, no CDs).
Thanks for the guide it works great in 8.04 with a couple of slight modifications, but for the life of me I can't get this to work with 9.04 which unfortunately is the version I need due to blender disliking 8.04.:(
ductiletoaster
August 2nd, 2010, 03:42 PM
I had blender running fine on 8.04LTS. You could also try 10.04.
I personally have found that .10 releases are usually unreliable! If your doing development work and need reliability use .04...
danmc
August 3rd, 2010, 11:54 PM
I had blender running fine on 8.04LTS. You could also try 10.04.
I personally have found that .10 releases are usually unreliable! If your doing development work and need reliability use .04...
The latest version won't it has dependencies that aren't in 8.04's repos, it's a catch22 situation.
I can get 8.04 pxe booting no problem but 9.04 jut hangs and go's to an ALERT! /dev/nfs does not exist error even though I have used the exact same techniques:(?.
gabrielaca
August 6th, 2010, 09:19 AM
The latest version won't it has dependencies that aren't in 8.04's repos, it's a catch22 situation.
I can get 8.04 pxe booting no problem but 9.04 jut hangs and go's to an ALERT! /dev/nfs does not exist error even though I have used the exact same techniques:(?.
hello danmc, by latest version you mean Blender 2.5(BETAs), i havent got my cluster yet but want to run some 3D Blender and Lux in it, so as of now only Blender 2.49 can run on the HPC.
nerdopolis
August 6th, 2010, 06:10 PM
Kerrighed seems to have virtual images on their site. http://www.kerrighed.org/forum/viewtopic.php?p=615#615
phydiux
August 7th, 2010, 04:06 PM
Or are you trying to build for 32 bits? This is not supported and (my fault) not checked for in menuconfig.
Thanks,
Louis
Can anyone chime in and tell me what the last version that worked with 32-bit processors was? I'm trying to install Kerrighed 2.4.4, but am getting:
"make[2]: *** No rule to make target `kernel-install'. Stop."
when issuing "make kernel-install" - I don't have the option on the computers that I want to cluster to run 64-bit Linux, so I'm kind of stuck. I also tried with Kerrighed version 2.4.1 and get the same error.
Thanks in advance...
phydiux
August 8th, 2010, 08:05 AM
Can anyone chime in and tell me what the last version that worked with 32-bit processors was?
An update: I've been able to compile 2.4.4 by not making any changes under the /usr/src/kerrighed-2.4.4/kernel/ directory and doing a make and then make install from the kerrighed-2.4.4 directory (basically, skipping the following from the guide)
# cd kernel
# make defconfig
# make menuconfig
# make kernel
# make kernel-install
and instead, from /use/src/kerrighed-2.4.4/, simply doing this:
# make
# make install
My kerrighed compiled properly and I was able to follow the rest of the guide through to completion.
I now have an 18 node cluster running - here's my top:
top - 09:31:57 up 29 min, 1 user, load average: 1.07, 4.41, 6.03
Tasks: 319 total, 1 running, 318 sleeping, 0 stopped, 0 zombie
Cpu352: 0.7%us, 0.7%sy, 0.0%ni, 94.8%id, 0.0%wa, 0.2%hi, 3.5%si, 0.0%st
Cpu384: 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st
Cpu416: 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st
Cpu448: 0.0%us, 0.0%sy, 0.0%ni, 98.2%id, 0.0%wa, 0.2%hi, 1.5%si, 0.0%st
Cpu480: 0.0%us, 0.2%sy, 0.0%ni, 98.0%id, 0.0%wa, 0.2%hi, 1.5%si, 0.0%st
Cpu512: 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.3%hi, 0.7%si, 0.0%st
Cpu544: 0.0%us, 0.3%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.3%hi, 0.7%si, 0.0%st
Cpu576: 0.0%us, 0.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st
Cpu608: 0.0%us, 0.2%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.2%hi, 1.2%si, 0.0%st
Cpu640: 0.0%us, 0.2%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.2%hi, 1.0%si, 0.0%st
Cpu672: 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.3%hi, 0.7%si, 0.0%st
Cpu704: 0.0%us, 0.2%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.0%hi, 1.2%si, 0.0%st
Cpu736: 0.0%us, 0.2%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.2%hi, 0.7%si, 0.0%st
Cpu768: 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st
Cpu800: 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st
Cpu832: 0.0%us, 0.2%sy, 0.0%ni, 98.2%id, 0.0%wa, 0.2%hi, 1.2%si, 0.0%st
Cpu864: 0.0%us, 0.0%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.0%hi, 1.5%si, 0.0%st
Cpu896: 0.0%us, 0.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st
Mem: 11569408k total, 503476k used, 11065932k free, 0k buffers
Swap: 0k total, 0k used, 0k free, 232412k cached
18 processors, with 11.5 Gigs of RAM - I'm happy with that. I have room for two more machines in my rack, so I will probably bump the cluster up to 20 computers. It's now time to do some testing :) Thanks for the wonderful amount of information in this thread - it's been a great help on getting my cluster up and running.
ajt
August 16th, 2010, 07:58 AM
I decided to replace our UNFS3+ClusterNFS stateless node provisioning with Perceus. We've got Perceus 1.6 installed from the deb:
http://http://altruistic.infiscale.org/deb/perceus16.deb
And we can boot nodes from the "gravityos-base" VNFS:
http://http://altruistic.infiscale.org/~ian/gravityos-base.vnfs
So far, I've only been able to find a VNFS generation script for 32-bit Ubuntu 8.04:
http://http://www.cs.indiana.edu/~adkulkar/perceus-ubuntu/
Does anyone know of a script to create a 64-bit Ubuntu 10.04 VNFS?
Thanks,
Tony.
ajt
August 16th, 2010, 08:01 AM
An update: I've been able to compile 2.4.4 by not making any changes under the /usr/src/kerrighed-2.4.4/kernel/ directory and doing a make and then make install from the kerrighed-2.4.4 directory (basically, skipping the following from the guide)
[...]
18 processors, with 11.5 Gigs of RAM - I'm happy with that. I have room for two more machines in my rack, so I will probably bump the cluster up to 20 computers. It's now time to do some testing :) Thanks for the wonderful amount of information in this thread - it's been a great help on getting my cluster up and running.
Hello, phydiux.
We got this far, but found the cluster very fragile and have gone back to 2.3.0 which is more stable in our hands. Have you done much testing?
Bye,
Tony.
ductiletoaster
August 25th, 2010, 07:43 PM
To phydiux:
So are u saying u have 32bit only CPU's in your system by using your method? or are all of you computers have 64bit processors?
as i have mentioned in a previous post i have 5 machines and only two of which are 64 bit machines. Now the 32bit machines are older but they would add about 4.067ghz processing and 704mbs ram
phydiux
August 27th, 2010, 12:05 PM
We got this far, but found the cluster very fragile and have gone back to 2.3.0 which is more stable in our hands. Have you done much testing?
Actually, I haven't had a chance to play with the cluster in the last week, but it did seem to be fragile when I was working on it. I was running distributed.net to test the cluster, and it seemed like processes would show that they were working but hang on the cluster in no specific manner or for no specific reason. This happened regardless of whether or not I was using memory sharing on the cluster. If I issued a "dnetc -shutdown", the processes would not shut down, and I would have to shutdown the cluster to get the processes to actually shut down. I thought there was some quirk with dnetc, so I assumed that I would look for another piece of software to test with - I just had not had the time to do that yet. So, I can say that the stability was questionable. Now that you mention that, I think I will compile kerrighed 2.3.0 and see if that works better for me. I hope it does.
So are u saying u have 32bit only CPU's in your system by using your method? or are all of you computers have 64bit processors?
All of the machines in my cluster are 32-bit machines. The are not capable of running a 64-bit OS - they're all PentiumIII class processors.
as i have mentioned in a previous post i have 5 machines and only two of which are 64 bit machines. Now the 32bit machines are older but they would add about 4.067ghz processing and 704mbs ram
Well, you could compile Kerrighed as a 32-bit cluster (your 64-bit machines don't *have* to run 64-bit - they are still capable of running 32-bit) - the downside to this is that you can't run Kerrighed 3.x, you need to run the 2.x version of Kerrighed.
Does anyone know why they dropped 32-bit support in 3.x?
I'm at a crossroads right now - it's hard to stick with the 32-bit Kerrighed since they're no longer targeting/allowing 32-bit systems in the new versions, but at the same time I have 18Ghz worth of 32-bit processors along with over 11 gigs of RAM available if I stay 32-bit, and that whole setup is just sitting there doing nothing. If I use more current hardware, I'm probably in a better spot long-term, but at more expense, and I would have to purchase that stuff, since I don't have it sitting around.
lrilling
August 31st, 2010, 06:18 AM
Does anyone know why they dropped 32-bit support in 3.x?
Hello phydiux,
To be fair 32 bit support was already dropped with Kerrighed 2.4. The difference in Kerrighed 3.0 is that it does not build anymore for 32 bit.
The big reason for dropping 32 bit is the restricted manpower that we have. We prefer to focus on stability/features/performance for today's and tomorrow's machine, rather than investing time in keeping yesterday's machine alive.
Another reason is that no customer seems interested in paying for 32 bit support. In other words, nobody seems willing to fund developers for 32 bit.
Thanks,
Louis
amiller2k10
September 12th, 2010, 08:01 PM
An update: I've been able to compile 2.4.4 by not making any changes under the /usr/src/kerrighed-2.4.4/kernel/ directory and doing a make and then make install from the kerrighed-2.4.4 directory (basically, skipping the following from the guide)
# cd kernel
# make defconfig
# make menuconfig
# make kernel
# make kernel-install
and instead, from /use/src/kerrighed-2.4.4/, simply doing this:
# make
# make install
My kerrighed compiled properly and I was able to follow the rest of the guide through to completion.
I now have an 18 node cluster running - here's my top:
top - 09:31:57 up 29 min, 1 user, load average: 1.07, 4.41, 6.03
Tasks: 319 total, 1 running, 318 sleeping, 0 stopped, 0 zombie
Cpu352: 0.7%us, 0.7%sy, 0.0%ni, 94.8%id, 0.0%wa, 0.2%hi, 3.5%si, 0.0%st
Cpu384: 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st
Cpu416: 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st
Cpu448: 0.0%us, 0.0%sy, 0.0%ni, 98.2%id, 0.0%wa, 0.2%hi, 1.5%si, 0.0%st
Cpu480: 0.0%us, 0.2%sy, 0.0%ni, 98.0%id, 0.0%wa, 0.2%hi, 1.5%si, 0.0%st
Cpu512: 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.3%hi, 0.7%si, 0.0%st
Cpu544: 0.0%us, 0.3%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.3%hi, 0.7%si, 0.0%st
Cpu576: 0.0%us, 0.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st
Cpu608: 0.0%us, 0.2%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.2%hi, 1.2%si, 0.0%st
Cpu640: 0.0%us, 0.2%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.2%hi, 1.0%si, 0.0%st
Cpu672: 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.3%hi, 0.7%si, 0.0%st
Cpu704: 0.0%us, 0.2%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.0%hi, 1.2%si, 0.0%st
Cpu736: 0.0%us, 0.2%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.2%hi, 0.7%si, 0.0%st
Cpu768: 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st
Cpu800: 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st
Cpu832: 0.0%us, 0.2%sy, 0.0%ni, 98.2%id, 0.0%wa, 0.2%hi, 1.2%si, 0.0%st
Cpu864: 0.0%us, 0.0%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.0%hi, 1.5%si, 0.0%st
Cpu896: 0.0%us, 0.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st
Mem: 11569408k total, 503476k used, 11065932k free, 0k buffers
Swap: 0k total, 0k used, 0k free, 232412k cached
18 processors, with 11.5 Gigs of RAM - I'm happy with that. I have room for two more machines in my rack, so I will probably bump the cluster up to 20 computers. It's now time to do some testing :) Thanks for the wonderful amount of information in this thread - it's been a great help on getting my cluster up and running.
I managed to get the Kernel to compile and installed on a local hard drive. It boots up. Kerrighed starts, etc. I tried setting it up on a NFS now. I can boot from the NFS using a non-KRG kernel. However, when I use the KRG kernel; it will not boot. It downloads the kernel. Even gets as far as showing me kerrighed information on the boot screen. But, then it says the NFS is not responding. I assume this is when the kernel tries to mount the root file system. Note, I have root file system on (not a module) in the kernel as well as nfs file system and nfs server support. I also found another NFS tutorial that said to turn on bootp and rarp under networking options but that didn't help.. It seems like I've tried the things mentioned on this thread, any ideas??
I ultimately like to get this running on Ubuntu 10.04 but I installed 8.04 just to try and rule out some upgrade issues. I am trying the 2.4.1 version of kerrighed. Any ideas??
Thanks.
ajt
September 13th, 2010, 04:34 AM
[...]
It seems like I've tried the things mentioned on this thread, any ideas??
I ultimately like to get this running on Ubuntu 10.04 but I installed 8.04 just to try and rule out some upgrade issues. I am trying the 2.4.1 version of kerrighed. Any ideas??
Thanks.
Hello, amiller2k10.
It's possible that you've not compiled the Kerrighed kernel with support for your network card built-in, if the NFSROOT filesystem is inaccessible.
HTH,
Tony.
amiller2k10
September 13th, 2010, 08:50 PM
Hello, amiller2k10.
It's possible that you've not compiled the Kerrighed kernel with support for your network card built-in, if the NFSROOT filesystem is inaccessible.
HTH,
Tony.
Hey Tony,
Good thought but unless I am looking at this the wrong way, it appears to be built-in. I am using an NVIDIA network card. From what I understand, they use the forcedeth driver. In my .config, I have CONFIG_FORCEDETH=y (asterisk in menuconfig).
Unless I am missing something, shouldn't that cover it?
Thanks,
Tony
amiller2k10
September 14th, 2010, 07:19 AM
Well it isn't the kernel. I finally got it booting. It appears there is something going on with the NFS server. Time to troubleshoot it!!
anshumax
October 6th, 2010, 02:01 PM
Hello everyone,
I'm a newbie to cluster systems and just wanted to make a basic cluster system of with my PC and the laptops I have at home. For the time being, my aim is to setup a simple 1 headnode(ie. my laptop) and 1 node(ie. my PC) cluster. I've followed the Easy Ubuntu Clustering guide and followed the steps in it. I had trouble at many steps but I managed to troubleshoot them by reading the posts in this thread. I first managed to boot into a a basic ubuntu hardy system as described by in the guide by making changes to the initrd.img by changing it to BOOT=nfs. The minimal hardy I installed using debootstrap works fine. I'm able to view to the NFS filesystem as well. It prompts for a login at the starting and I login and I'm at the console of the node. IS works because when I do something like 'cd /usr/var/' and then 'ls' it shows me the kerrighed and linux tarballs and the extracted folders.
But when I build a kerrighed kernel and boot into it, the same setup gives me an error:
Looking up port of RPC 100005/1 on 192.168.1.1
Portmap: server 191.168.1.1 not responding, timedout
Root-NFS: unable to get mountd port number from server, using default mount: Server 192.168.1.1 not responding, timed out
Root-NFS: Server returned error-5 while mounting /nfsroot/kerrighed
VFS: unable to mount root fs via NFS, tyring floppy.
VFS: Insert root floppy and press ENTER
My headnode ie. kerrighedserver is 192.168.1.1 and kerrighednode1 is 192.168.1.101
Now the funny thing is that when I modify the /var/lib/tftpboot/pxelinux.cfg/default file and change it to 'KERNEL vmlinuz-2.6.32-25-generic' and 'initrd=initrd.img-2.6.32-25-generic', the system boots into the minimal hardy system (Yes, I know I'm using the 2.6.32-25 kernel for minimal hardy installation because I have ubuntu 10.04 installed on my laptop but I've installed hardy using debootstrap to /nfsroot/kerrighed)
The kerrighed kernel built easily upon following the steps in the guide.
Why does NFS work with the linux kernel(ie. vmlinuz-2.6.32-25-generic and initrd.img-2.6.32-25-generic) but not with the Kerrighed kernel(vmlinuz-2.6.20-krg)?
anshumax
October 6th, 2010, 03:10 PM
OK here's the contents of all the files mentioned in the guide.
/etc/default/dhcp3-server :
# Defaults for dhcp initscript
# sourced by /etc/init.d/dhcp
INTERFACES="eth0";
/etc/dhcp3/dhcpd.conf :
# /etc/dhcp3/dhcpd.conf #
# General options
option dhcp-max-message-size 2048;
use-host-decl-names on;
deny unknown-clients; # This will stop any non-node machines from appearing on the cluster network.
deny bootp;
# DNS settings
option domain-name "kerrighed"; # Just an example name - call it whatever you want.
option domain-name-servers 192.168.1.1; # The server's IP address, manually configured earlier.
# Information about the network setup
subnet 192.168.1.0 netmask 255.255.255.0 {
option routers 192.168.1.1; # Server IP as above.
option broadcast-address 192.168.1.255; # Broadcast address for your network.
}
# Declaring IP addresses for nodes and PXE info
group {
filename "pxelinux.0"; # PXE bootloader. Path is relative to /var/lib/tftpboot
option root-path "192.168.1.1:/nfsroot/kerrighed"; # Location of the bootable filesystem on NFS server
host kerrighednode1 {
fixed-address 192.168.1.101; # IP address for the first node, kerrighednode1 for example.
hardware ethernet 00:1C:C0:02:FB:74; # MAC address of the node's ethernet adapter
}
server-name "kerrighedserver"; # Name of the server. Call it whatever you like.
next-server 192.168.1.1; # Server IP, as above.
}
/etc/default/tftpd-hpa :
# /etc/default/tftpd-hpa
TFTP_USERNAME="tftp"
TFTP_DIRECTORY="/var/lib/tftpboot"
TFTP_ADDRESS="0.0.0.0:69"
TFTP_OPTIONS="--secure"
RUN_DAEMON="YES"
OPTIONS="-l -s /var/lib/tftpboot"
/var/lib/pxelinux.cfg/default :
LABEL linux
KERNEL vmlinuz-2.6.20-krg
APPEND root=/dev/nfs noinitrd nfsroot=192.168.1.1:/nfsroot/kerrighed ip=dhcp rw boot=nfs
/etc/exports :
/nfsroot *(rw,async,no_root_squash,no_subtree_check)
/nfsroot/kerrighed *(rw,async,no_root_squash,no_subtree_check)
/nfsroot/kerrighed/etc/fstab :
# <file system> <mount point> <type> <options> <dump> <pass>
proc /proc proc defaults 0 0
/dev/nfs / nfs defaults 0 0
configfs /config configfs defaults 0 0
/nfsroot/kerrighed/etc/fstab :
127.0.0.1 localhost
192.168.1.1 kerrighedserver
192.168.1.101 kerrighednode1
/nfsroot/kerrighed/etc/network/interfaces :
# The loopback interface:
auto lo
iface lo inet loopback
# The primary network interface, manually configured to protect NFS:
iface eth0 inet manual
I'll be happy to post the contents of any other file that might help in solving this problem.
Cheers and thanks!
ajt
October 6th, 2010, 03:20 PM
Hello everyone,
[...]
Why does NFS work with the linux kernel(ie. vmlinuz-2.6.32-25-generic and initrd.img-2.6.32-25-generic) but not with the Kerrighed kernel(vmlinuz-2.6.20-krg)?
Hello, anshumax.
The default kerrighed kernel does not use an initrd, so you have to compile the correct ethernet driver into the kernel in order to access the NFSROOT.
HTH,
Tony.
anshumax
October 6th, 2010, 04:06 PM
Hello, anshumax.
The default kerrighed kernel does not use an initrd, so you have to compile the correct ethernet driver into the kernel in order to access the NFSROOT.
HTH,
Tony.
Hi Tony,
Yes I know that by default the Kerrighed kernel does not require an initrd. I removed the line 'initrd=initrd.img-2.6.-32-25-generic' from the pxelinux.cfg/default file.
anshumax
mazimi
November 29th, 2010, 06:48 PM
Hi,
This thread along with the Wiki article was very helpful in setting up my Kerrighed cluster. I currently have a system consisting of 20 nodes. The head node is running Ubuntu 10.04 x64, DHCP, TFTP and NFS. The remaining 19 compute nodes are running Ubuntu 8.04 x64 with Kerrighed 2.4.1. I have access to the head node via SSH as it is connected to the network. The 19 compute nodes are not directly connected to the network and to access any one of them I first have to SSH to the head node and from there SSH to any compute node. I can run jobs from the compute nodes and can have the application use memory from the entire cluster.
My question is how can I make it so that users can simply login to the head node and submit their jobs there without having to access the kerrighed cluster? Is this even possible? If not, should I connect one of the compute nodes to the network and have my users SSH to that and submit jobs from there? Any help/advice would be appreciated.
lrilling
November 30th, 2010, 04:40 AM
Hi,
This thread along with the Wiki article was very helpful in setting up my Kerrighed cluster. I currently have a system consisting of 20 nodes. The head node is running Ubuntu 10.04 x64, DHCP, TFTP and NFS. The remaining 19 compute nodes are running Ubuntu 8.04 x64 with Kerrighed 2.4.1. I have access to the head node via SSH as it is connected to the network. The 19 compute nodes are not directly connected to the network and to access any one of them I first have to SSH to the head node and from there SSH to any compute node. I can run jobs from the compute nodes and can have the application use memory from the entire cluster.
My question is how can I make it so that users can simply login to the head node and submit their jobs there without having to access the kerrighed cluster? Is this even possible? If not, should I connect one of the compute nodes to the network and have my users SSH to that and submit jobs from there? Any help/advice would be appreciated.
Hi,
You have three options:
1. make your head node part of the SSI cluster (you should ask ajt/Tony about this),
2. connect one of the nodes to the external network. In this latter case, your head node need not appearing as a front-end.
3. (variant of 2.) port forward the ssh service from your head node to one of the cluster nodes. This way you need not connecting any cluster node to the external network: they will only be reachable through the SSH service.
Hope this helps.
Thanks,
Louis
mazimi
November 30th, 2010, 12:31 PM
3. (variant of 2.) port forward the ssh service from your head node to one of the cluster nodes. This way you need not connecting any cluster node to the external network: they will only be reachable through the SSH service.
I really like this idea, thanks a lot Louis!
I have some other general questions about the performance of my Kerrighed cluster which are not as critical but I would appreciate any insight you or others could give on this.
1. It's my understanding that Kerrighed is seen as an SMP system by applications. Subsequently a multi-threaded application (lets say 16 threads) can be started on a quad core compute node but the threads will be distributed to other compute nodes efficiently. This would be my guess for how Kerrighed's scheduler works for SMP applications. However, when you have an MPI application and you launch your application via 'mpirun' on the kerrighed cluster, the MPI scheduler connects to the other compute nodes via SSH and distributes the jobs among your 'hosts'. Is this the best method for handling MPI applications on kerrighed?
2. I imagine that running an MPI application on Kerrighed won't be as efficient as running it on an MPI system, is this correct?
3. In general, for applications such as molecular dynamics and finite element, will PVM or MPI systems show better performance? I've scanned a few published articles but many of them seem to be over a decade old.
lrilling
December 1st, 2010, 04:37 AM
1. It's my understanding that Kerrighed is seen as an SMP system by applications. Subsequently a multi-threaded application (lets say 16 threads) can be started on a quad core compute node but the threads will be distributed to other compute nodes efficiently. This would be my guess for how Kerrighed's scheduler works for SMP applications.
Please note that threads cannot be distributed yet. IOW, single-threaded processes can migrate, but threads of multi-threaded processes cannot. Similarly, a process cannot create another thread remotely.
However, when you have an MPI application and you launch your application via 'mpirun' on the kerrighed cluster, the MPI scheduler connects to the other compute nodes via SSH and distributes the jobs among your 'hosts'. Is this the best method for handling MPI applications on kerrighed?
Currently yes. There are still issues with MPI applications when they are distributed by remote fork/migration.
2. I imagine that running an MPI application on Kerrighed won't be as efficient as running it on an MPI system, is this correct?
I'm not sure to understand what you call an "MPI system", but you're assumption sounds correct. The gains that Kerrighed could bring are to transparently migrate MPI workers out of nodes so that they can be removed, or to re-balance the load.
3. In general, for applications such as molecular dynamics and finite element, will PVM or MPI systems show better performance? I've scanned a few published articles but many of them seem to be over a decade old.
Native MPI support will probably show the best performance. However Kerrighed can definitely help when MPI support is not coded in the application. It is then enough to program the application for multi-core architectures.
Thanks,
Louis
Trongersoll
December 13th, 2010, 06:26 PM
Ok, I can see that Kerrighed (http://www.kerrighed.org/) has moved on to the 64-bit world. One of the main advantages of the Beowulf concept was the use of Commodity computers. It is going to be years before 32-bit machines are relics rather than abundantly available at affordable prices. Since Kerrighed (http://www.kerrighed.org/) is no longer an option for us 32-bit people, what is available that will do something similar?
Also, since the Server in a Kerrighed (http://www.kerrighed.org/) cluster is a server and not really a head. Is it relatively easy for the nodes to use more than one hard drive in the server? can extra drives in the server be added easily?
togueter
December 16th, 2010, 06:03 AM
Hi, I try1 install kerrighed. I follow all step of guide. but my head node don't boot with kernel kerrighed. I try to install kerrighed in head node (out of /nfsroot) but when i boot may systhem , this say me: try to boot vnfs (why?, i want to boot head node, no client node), tray booting to diskette. eig? I press enter key and... kernel panic!!!
I think I do not understand the guidelines, because I assumed to be
Kerrighed install twice, once in the directory (as chroot)
/Nfsroot/Kerrighed/ nodes to boot with that kernel, and other
install at the ./ of the Master-Node (fronted) to communicate with
nodes.
thanks
ajt
December 16th, 2010, 06:21 AM
Hi, I try1 install kerrighed. I follow all step of guide. but my head node don't boot with kernel kerrighed. I try to install kerrighed in head node (out of /nfsroot) but when i boot may systhem , this say me: try to boot vnfs (why?, i want to boot head node, no client node), tray booting to diskette. eig? I press enter key and... kernel panic!!!
I think I do not understand the guidelines, because I assumed to be
Kerrighed install twice, once in the directory (as chroot)
/Nfsroot/Kerrighed/ nodes to boot with that kernel, and other
install at the ./ of the Master-Node (fronted) to communicate with
nodes.
thanks
Hi, togueter.
The default Kerrighed install builds an NFSROOT kernel, and the head node does NOT run the Kerrighed kernel. If you want to run a Kerrighed kernel on the head node, you need to disable NFSROOT in the Kerrighed kernel configuration, then rebuild and install a stand-alone kernel on the head node.
HTH,
Tony.
togueter
December 16th, 2010, 06:29 AM
ok tony, thanks. but the kerrighed binaries have to installed in fronted, or not?
ajt
December 16th, 2010, 06:40 AM
Ok, I can see that Kerrighed (http://www.kerrighed.org/) has moved on to the 64-bit world. One of the main advantages of the Beowulf concept was the use of Commodity computers. It is going to be years before 32-bit machines are relics rather than abundantly available at affordable prices. Since Kerrighed (http://www.kerrighed.org/) is no longer an option for us 32-bit people, what is available that will do something similar?
Also, since the Server in a Kerrighed (http://www.kerrighed.org/) cluster is a server and not really a head. Is it relatively easy for the nodes to use more than one hard drive in the server? can extra drives in the server be added easily?
Hi, Trongersoll.
Re: alternatives - It depends what you want to do: The main feature of Kerrighed is SSI (Single System Image). I used to use 32-bit openMosix, but this project has now closed:
http://openmosix.sourceforge.net/
This is still a viable system for 32-bit computers, but is no longer being developed and, critically, is limited to the 2.4 kernel and has very limited support for SATA drives.
Before deciding to switch to kerrighed, I also had a look at openSSI:
http://openssi.org/
There is a 32-bit version for Debian Lenny (i386) updated 2010/02/18, but the openSSI project seems dormant and I decided to try Kerrighed instead.
I have to say that the 32-bit version of Kerrighed 2.3.0 does work, and if you want to learn about SSI then this is an option. However, as Louis has said before the developers don't have the resources to devote to the 32-bit version. Remember, this is FLOSS and we are free to develop the 32-bit version of Kerrighed if we want to. However, look on eBay and you will see *many* very cheap COTS 64-bit server motherboards available!
It's a different matter if you want to use SSI for your work. In particular, if you want to use the aggregate memory of a Kerrighed Beowulf cluster you are limited to ~3GiB and the Kerrighed kernel does not support PAE. In that respect, you can only use 32-bit Kerrighed for transparent process migration - That's quite useful, though :-)
Bye,
Tony.
nandugopan
December 16th, 2010, 08:18 AM
Guys,
I am completely new to clustering and after reading this thread, felt I should try setting up a small cluster for our lab. We have 4 HP Z 800 workstations (4x2 processors with 6 cores each) with Gigabit Ethernet Card. Our lab runs domain decomposition type of molecular dynamics simulations on the institute cluster. It uses gcc + gfortran +FFTW2+ MPICH.I just wanted to estimate how much of an improvement in computational power can be brought about by an in-house clustering of workstations.
Would there be a considerable improvement of computational power by setting up a cluster with just 4 machines? (most of the posts here seem to refer to 20 + machines)
Is there a 'how-to' or manual (wishful thinking) on how to go about setting up a cluster? I remember seeing something on how to set up an Ubuntu cluster somewhere. Cant quite find the link
Hoping to hear your inputs
Thanks.:p
ajt
December 16th, 2010, 08:43 AM
Guys,
[...]
It uses gcc + gfortran +FFTW2+ MPICH.I just wanted to estimate how much of an improvement in computational power can be brought about by an in-house clustering of workstations.
[...]
Thanks.:p
Hi, nandugopan.
You don't need Kerrighed SSI to run MPICH programs, which use the MPICH library under a standard linux kernel. However, one strategy I've used is to configure my MPI hostfile to use only one node of a cluster and allow automatic openMosix/Kerrighed load-balancing to distribute MPI processes between nodes by overloading the number of slots on that node (i.e. more slots than cores). In deciding how much of an advantage this might be, I recommend reading up about Amdah's law:
http://en.wikipedia.org/wiki/Amdahl%27s_law
HTH,
Tony.
Trongersoll
December 16th, 2010, 01:17 PM
Hi Tony,
Thanks for your response. I started to play with OpenMosix about 5 years ago, but got busy with life and lost interest. I've accumulated a number of 32-bit machines over the years. The server can be 32 bit so i don't have a problem there, but if i go 32 bit i have 5 nodes at hand. If i go 64 bit, I have one node. Generally, when i upgrade my computers the old hardware would become a new node.
Right now I'm wondering how much of a problem i would have reverse engineering the 64 bit updates to 32 bit. I need to think about this some.
vak
January 8th, 2011, 12:43 PM
GUYS, I need SSI badly in Ubuntu! (http://ubuntuforums.org/showthread.php?t=1662580)
cellstorm
February 5th, 2011, 01:54 PM
here are debian packages:
http://www.kerrighed.org/debian/
well hidden....
here how to use that:
http://www.estrellateyarde.org/discover/cluster-kerrighed-en-linux
at least I think so.
this guide is probably well known here, but anyways, could be helpful:
http://www.debianadmin.com/how-to-set-up-a-high-performance-cluster-hpc-using-debian-lenny-and-kerrighed.html
vak
February 5th, 2011, 02:32 PM
@cellstorm, hey, thanks for sharing this info! I am going to try it
Did you tried both OpenSSI and Kerrighed?
How good is Kerrighed in comparison to OpenSSI ?
I saw the feature comparison table (http://en.wikipedia.org/wiki/Single-system_image), so in particular it would be interesting to know how is it now in Kerrighed with
* Single I/O space and
* Cluster IP
cellstorm
February 5th, 2011, 03:19 PM
nope. Did not try anything. I was just researching. anyways, things seem to be more complicated as I thought.
I was looking for a simple replacement of openmosix. A kernel which can be put on a livecd, and then can be used for distributed computing for mulitmedia apps.
As I understand it, the kernel used by kerrighed does needs to be compiled by hand, because it won't boot without compiling the driver of your network card into it.
Also, multithreaded applications do not benefit from it (yet), hope this returns soon.
the server, which holds all the applications the slaves are using, cannot be used for computing. Which is a bit strange, how would I work in e.g blender then?
vak
February 5th, 2011, 05:44 PM
@cellstorm, well, those guides are too far from debian packages available by your first link :)
The packages have been installed by me OK using dpkg -i
However boot without any configuration failed (kernel panic) :)
Also, I didn't find any guide that explains how to install and configure Kerrighed from the .deb
I'll try tomorrow.
lrilling
February 7th, 2011, 07:37 AM
here are debian packages:
http://www.kerrighed.org/debian/
well hidden....
Guys, there is a good reason for not telling the world about those debs: they are part of XtreemOS 3.0, and as such include advanced features that are not so well documented nor stabilized.
here how to use that:
http://www.estrellateyarde.org/discover/cluster-kerrighed-en-linux
at least I think so.
Typically, I doubt (cannot read spanish well enough) that this guide explains how to use the debs above, since they require some specific network configuration which is absolutely not required with Kerrighed 3.0.0...
If you still want to use those debs, the network configuration is explained in some README.Net file (to be found under /usr/share/doc/kerrighed in package kerrighed).
Thanks,
Louis
lrilling
February 7th, 2011, 07:42 AM
the server, which holds all the applications the slaves are using, cannot be used for computing. Which is a bit strange, how would I work in e.g blender then?
The server only serves DHCP, PXE requests (to ship their kernel to the nodes) and files (through NFS). Applications must be launched from regular cluster nodes.
Btw, there is no real notion of slave in Kerrighed. All nodes participate the same to the cluster. The server itself is not part of the cluster in the recommended configurations.
Thanks,
Louis
vak
February 7th, 2011, 07:49 AM
@lrilling, I have a following kernel panic problem during the boot with those debs: http://article.gmane.org/gmane.linux.cluster.kerrighed.user/1084
lrilling
February 7th, 2011, 07:52 AM
@cellstorm, well, those guides are too far from debian packages available by your first link :)
The packages have been installed by me OK using dpkg -i
However boot without any configuration failed (kernel panic) :)
Also, I didn't find any guide that explains how to install and configure Kerrighed from the .deb
Indeed, no guide covers those debs. However, you should succeed if you combine the usual guide(s) for Kerrighed 3.0.0 (see the official one http://www.kerrighed.org/wiki/index.php/UserDoc for instance) and file README.Net (to be found under /usr/share/doc/kerrighed). You should also be aware that two flavors of the kernel (packages kerrighed-image-2.6.30-krg and kerrighed-headers-2.6.30-krg) are shipped:
- version 3.0.0-1 includes support for a cluster-wide IP address, and requires network namespace isolation between the host system and the Kerrighed container.
- version 3.0.0+nonet-1 does not include support for cluster-wide IP address, and does not require network namespace isolation.
Thanks,
Louis
lrilling
February 7th, 2011, 07:54 AM
@lrilling, I have a following kernel panic problem during the boot with those debs: http://article.gmane.org/gmane.linux.cluster.kerrighed.user/1084
@vak: See my (coming) reply on kerrighed.users.
Thanks,
Louis
vak
February 7th, 2011, 08:02 AM
You should also be aware that two flavors of the kernel (packages kerrighed-image-2.6.30-krg and kerrighed-headers-2.6.30-krg) are shipped:
...
@lrilling, many thanks for the hint about the difference. It will save me a day.
Now I have to boot somehow in the system first :)
So, I am curious and am looking for your reply on kerrighed maillist.
ajithabraham.m
February 8th, 2011, 10:48 PM
hi
i done every thing as u said in the tutorial but when i start kerrighed
root@cluster:/home/kerrighed-src# /etc/init.d/kerrighed start
modinfo: could not find module kerrighed
* Starting Kerrighed: [ OK ]
and i check the status
root@cluster:/home/kerrighed-src# /etc/init.d/kerrighed status
modinfo: could not find module kerrighed
* Kerrighed status: not loaded [ OK ]
thanks
lrilling
February 10th, 2011, 06:51 AM
hi
i done every thing as u said in the tutorial but when i start kerrighed
root@cluster:/home/kerrighed-src# /etc/init.d/kerrighed start
modinfo: could not find module kerrighed
* Starting Kerrighed: [ OK ]
and i check the status
root@cluster:/home/kerrighed-src# /etc/init.d/kerrighed status
modinfo: could not find module kerrighed
* Kerrighed status: not loaded [ OK ]
thanks
Does kerrighed appear in lsmod?
# lsmod | grep kerrighed
If not, try (on any cluster node):
# depmod -a
and then retry:
# /etc/init.d/kerrighed start
Thanks,
Louis
abu_nawas
February 16th, 2011, 06:02 AM
my system freeze when migrating. anyone know how to solve this problem?
lrilling
February 17th, 2011, 11:28 AM
my system freeze when migrating. anyone know how to solve this problem?
Please give kernel logs (both source and target machine), stack backtraces of blocked tasks if possible (you can have them with SysRq-w, or echo w > /proc/sysrq-trigger), and of course detail a bit what you did: which process, how launched, how migrated.
And Kerrighed version of course :)
Thanks,
Louis
awannabeee
March 6th, 2011, 12:49 PM
forgive me if i am posting in the wrong thred!
can someone direct me to tutorials for single node set up of clustering, tried the hadoop tutorial and got confused at the configuation stage
Configuration
Use the following:
conf/core-site.xml:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property></configuration>http://hadoop.apache.org/common/docs/current/single_node_setup.html, is where the above comes from,can someone explain the above to me or tell me where to get the explanation of tha above?
baggzy
June 27th, 2011, 09:03 AM
Hi all! I'm attempting to install kerrighed 3.0.0 via NFSBOOT. There have been some changes to kerrighed and debian which aren't reflected in the online docs, so I thought I'd share my experiences in case it assists others. In this post I'll be covering the NFSBOOT part of the process. I'll compile and run kerrighed later. I'm booting off a machine running Ubuntu 10.10, for anyone who's interested.
The basic process is described in the official guide (http://www.kerrighed.org/wiki/index.php/Kerrighed_on_NFSROOT) and also this contributed doc (http://www.kerrighed.org/wiki/index.php/Kerrighed_on_NFSROOT_%28contrib%29) but I had to change some of the details. Follow the official guide but refer to the points below as you go.
1) The format for /etc/default/tftpd-hpa has changed. Using RUN_DAEMON = "yes" and OPTIONS = "-l -s /srv/tftp" doesn't seem to work any more. Instead I used what's shown below. The default file has TFTP_OPTIONS="--secure" but I couldn't get that to work. tftpd-hpa runs as a daemon by default now, so no need for the -l flag. (And no need for xinetd either.)
# /etc/default/tftpd-hpa
TFTP_USERNAME="tftp"
TFTP_DIRECTORY="/srv/tftp"
TFTP_ADDRESS="0.0.0.0:69"
TFTP_OPTIONS=""
2) The official guide doesn't mention which file to edit to "Configure the DHCP server". The file in question is /etc/dhcp3/dhcp.conf. The machine I'm booting off (the "server") has IP address 192.168.0.1 on network card eth0, but is connected to the cluster via a second network card with IP 192.168.1.1 called eth1. The nodes will have IP's starting at 192.168.1.101 and be called node101, node102, etc. My dhcp.conf therefore looks like this:
#
# Sample configuration file for ISC dhcpd for Debian
#
# Attention: If /etc/ltsp/dhcpd.conf exists, that will be used as
# configuration file instead of this file.
#
# $Id: dhcpd.conf,v 1.1.1.1 2002/05/21 00:07:44 peloy Exp $
#
# The ddns-updates-style parameter controls whether or not the server will
# attempt to do a DNS update when a lease is confirmed. We default to the
# behavior of the version 2 packages ('none', since DHCP v2 didn't
# have support for DDNS.)
ddns-update-style none;
# option definitions common to all supported networks...
#option domain-name "example.org";
#option domain-name-servers ns1.example.org, ns2.example.org;
default-lease-time 600;
max-lease-time 7200;
# If this DHCP server is the official DHCP server for the local
# network, the authoritative directive should be uncommented.
#authoritative;
# Use this to send dhcp log messages to a different log file (you also
# have to hack syslog.conf to complete the redirection).
log-facility local7;
# No service will be given on this subnet, but declaring it helps the
# DHCP server to understand the network topology.
#subnet 10.152.187.0 netmask 255.255.255.0 {
#}
# This is a very basic subnet declaration.
#subnet 10.254.239.0 netmask 255.255.255.224 {
# range 10.254.239.10 10.254.239.20;
# option routers rtr-239-0-1.example.org, rtr-239-0-2.example.org;
#}
# This declaration allows BOOTP clients to get dynamic addresses,
# which we don't really recommend.
#subnet 10.254.239.32 netmask 255.255.255.224 {
# range dynamic-bootp 10.254.239.40 10.254.239.60;
# option broadcast-address 10.254.239.31;
# option routers rtr-239-32-1.example.org;
#}
# A slightly different configuration for an internal subnet.
#subnet 10.5.5.0 netmask 255.255.255.224 {
# range 10.5.5.26 10.5.5.30;
# option domain-name-servers ns1.internal.example.org;
# option domain-name "internal.example.org";
# option routers 10.5.5.1;
# option broadcast-address 10.5.5.31;
# default-lease-time 600;
# max-lease-time 7200;
#}
# Hosts which require special configuration options can be listed in
# host statements. If no address is specified, the address will be
# allocated dynamically (if possible), but the host-specific information
# will still come from the host declaration.
#host passacaglia {
# hardware ethernet 0:0:c0:5d:bd:95;
# filename "vmunix.passacaglia";
# server-name "toccata.fugue.com";
#}
# Fixed IP addresses can also be specified for hosts. These addresses
# should not also be listed as being available for dynamic assignment.
# Hosts for which fixed IP addresses have been specified can boot using
# BOOTP or DHCP. Hosts for which no fixed address is specified can only
# be booted with DHCP, unless there is an address range on the subnet
# to which a BOOTP client is connected which has the dynamic-bootp flag
# set.
#host fantasia {
# hardware ethernet 08:00:07:26:c0:a5;
# fixed-address fantasia.fugue.com;
#}
# You can declare a class of clients and then do address allocation
# based on that. The example below shows a case where all clients
# in a certain class get addresses on the 10.17.224/24 subnet, and all
# other clients get addresses on the 10.0.29/24 subnet.
#class "foo" {
# match if substring (option vendor-class-identifier, 0, 4) = "SUNW";
#}
#shared-network 224-29 {
# subnet 10.17.224.0 netmask 255.255.255.0 {
# option routers rtr-224.example.org;
# }
# subnet 10.0.29.0 netmask 255.255.255.0 {
# option routers rtr-29.example.org;
# }
# pool {
# allow members of "foo";
# range 10.17.224.10 10.17.224.250;
# }
# pool {
# deny members of "foo";
# range 10.0.29.10 10.0.29.230;
# }
#}
### PART 1
# General options
option dhcp-max-message-size 2048;
use-host-decl-names on;
deny bootp;
### PART 2
option domain-name "kerrighed";
option domain-name-servers 192.168.1.1;
option ntp-servers ntp.network.net;
### PART 3
subnet 192.168.1.0 netmask 255.255.255.0 {
option routers 192.168.1.1;
option broadcast-address 192.168.1.255;
# Define the first and last IP address to be authorized
range 192.168.1.101 192.168.1.199;
# This set up the node name to « nodeXX » with XX the id of the node (ip-address based).
send host-name = concat("node", binary-to-ascii(10, 8, ".", substring(leased-address, 3, 1)));
# If you want to limit which boxes are using kerrighed, use something like this :
# host ssi1 { fixed-address 192.168.0.101; hardware ethernet xx:xx:xx:xx:xx:xx;
filename "/srv/tftp/pxelinux.0";
}
### PART 4
group {
server-name "server";
option root-path "192.168.1.1:/NFSROOT/kerrighed";
}
3) As far as I recall, I didn't change anything in the standard file, I just added the stuff at the bottom. But I left all the default gumph in there just in case. A few things to note:
i) Since the server is on 192.168.1.1, starting the range at 192.168.1.2 would mean that the node names start with node2... So I started the range at 101 so the first node is node101. I suppose I could reconfigure eth1 on the server to be 192.168.1.254 or something, but I didn't.
ii) I moved the filename "/srv/tftp/pxelinux.0"; from the group block to the subnet block. It didn't seem to work in the group block - I'd get a the following error on the node:
PXE-E53: No boot filename received
4) I added the following to /etc/default/dhcp3-server:
INTERFACES="eth1"
5) I added the following to /etc/network/interfaces:
auto eth1
iface eth1 inet static
address 192.168.1.1
netmask 255.255.255.0
6) I added the following to /etc/hosts.allow because without it I'd get a TFTP timeout error:
ALL: 192.168.1.*
7) If you make changes to the dhcp or tftp config files you need to restart various services. As Steve Kelly suggesting in his guide (http://www.stevekelly.eu/cluster.shtml), I set up a script to do this, called "restart-dhcp". You'll need to chmod 755 it, then put it somewhere in your $PATH (I put it in ~/bin and add that to my PATH in my .bashrc). Note that "restart" doesn't seem to work with tftpd-hpa so you have to "stop" it, then "start" it.
#
# tftpd-hpa restart is broken, so don't use it!
#
sudo restart portmap
sudo service dhcp3-server restart
sudo stop tftpd-hpa
sudo start tftpd-hpa
sudo service nfs-kernel-server restart
sudo exportfs -ra
8) The output from this script should look like the code below. If any of the process id's are missing the daemon probably didn't actually start. But a "status" check will say it did. There's obviously a bug somewhere and the only solution I found was to reboot the server.
portmap start/running, process 844
* Stopping DHCP server dhcpd3 [ OK ]
* Starting DHCP server dhcpd3 [ OK ]
tftpd-hpa stop/waiting
tftpd-hpa start/running, process 23714
* Stopping NFS kernel daemon [ OK ]
* Unexporting directories for NFS kernel daemon... [ OK ]
* Exporting directories for NFS kernel daemon... [ OK ]
* Starting NFS kernel daemon [ OK ]
9) The official guide doesn't mention it, but after you've done "sudo mkdir /srv/tftp" you have to do the following. The first thing your node will do when it tries to NFSBOOT is look for a file called pxelinux.0 in /srv/tftp. Obviously if it's not there it won't boot. Next it will fire up TFTP and try to transfer the boot files specified in the /srv/tftp/pxlinux.cfg/default file. If it fails it's either because TFTP isn't running, isn't set up correctly, doesn't have the required permissions, or your node isn't listed in /etc/hosts.allow. The chown may not be necessary, but if you have problems at the TFTP stage give it a try.
sudo cp /usr/lib/syslinux/pxelinux.0 /srv/tftp/
sudo chown -R tftp:tftp /srv/tftp
10) After you've debootstrapped lenny and chroot'd into /NFSROOT/kerrighed, in addition to mounting /proc I also had to do the following to avoid errors while using "apt-get install":
mount -t devpts none /dev/pts
11) To avoid warnings about your LOCALE the first thing you need to install is locales, then configure it and select "en_GB.UTF-8":
apt-get install locales
dpkg-reconfigure locales
12) After you've installed nfs-common, to avoid statd failing during boot, and holding things up about 5 minutes each time, edit /etc/default/nfs-common and change NEED_STATD to no:
NEED_STATD=no
13) After installing initramfs-tools edit /etc/initramfs-tools/initramfs.conf and change BOOT=local to:
BOOT=nfs
14) If you want to boot your node into debian to see if everything has worked so far...
i) Do the following. (Answer "No" to creating a link, and "No" to aborting.)
apt-get install linux-image-2.6.26-2-amd64
ii) exit (i.e. leave chroot)
iii) Copy the boot files:
sudo cp /NFSROOT/kerrighed/boot/vmlinuz-2.6.26-2-amd64 /srv/tftp/
sudo cp /NFSROOT/kerrighed/boot/initrd.img-2.6.26-2-amd64 /srv/tftp/
iv) Create or edit /srv/tftp/pxelinux.cfg/default to:
DEFAULT debian
LABEL debian
kernel vmlinuz-2.6.26-2-amd64
append ip=dhcp root=/dev/nfs initrd=initrd.img-2.6.26-2-amd64 nfsroot=192.168.1.1:/NFSROOT/kerrighed,rw rw
initrd initrd.img-2.6.26-2-amd64
15) That's it! You should be able to NFSBOOT your node(s)...
lrilling
June 28th, 2011, 11:55 AM
Hi all! I'm attempting to install kerrighed 3.0.0 via NFSBOOT. There have been some changes to kerrighed and debian which aren't reflected in the online docs, so I thought I'd share my experiences in case it assists others. In this post I'll be covering the NFSBOOT part of the process. I'll compile and run kerrighed later. I'm booting off a machine running Ubuntu 10.10, for anyone who's interested.
The basic process is described in the official guide (http://www.kerrighed.org/wiki/index.php/Kerrighed_on_NFSROOT) and also this contributed doc (http://www.kerrighed.org/wiki/index.php/Kerrighed_on_NFSROOT_%28contrib%29) but I had to change some of the details. Follow the official guide but refer to the points below as you go.
Hi baggzy,
First of all thanks for sharing your notes. If you don't mind, I'm going to comment some of the items, and with your permission I could also improve the official Kerrighed wiki :)
1) The format for /etc/default/tftpd-hpa has changed. Using RUN_DAEMON = "yes" and OPTIONS = "-l -s /srv/tftp" doesn't seem to work any more. Instead I used what's shown below. The default file has TFTP_OPTIONS="--secure" but I couldn't get that to work.
According to tfptd-hpa's documentation, when using --secure, option filename in the DHCP server configuration should not contain any path (eg. pxelinux.0, not /srv/tftp/pxelinux.0), and the file should of course live exactly in the daemon's directory (ie /srv/tftp). Have you tried setting up tftpd-hpa+dhcpd.conf this way?
tftpd-hpa runs as a daemon by default now, so no need for the -l flag. (And no need for xinetd either.)
# /etc/default/tftpd-hpa
TFTP_USERNAME="tftp"
TFTP_DIRECTORY="/srv/tftp"
TFTP_ADDRESS="0.0.0.0:69"
TFTP_OPTIONS=""
Good candidate for improving the wiki.
2) The official guide doesn't mention which file to edit to "Configure the DHCP server". The file in question is /etc/dhcp3/dhcp.conf.
I guess that you meant /etc/dhcp3/dhcpd.conf, right? Note that on newer distros (eg Debian Squeeze), it has even moved to /etc/dhcp/dhcpd.conf. The package has also been renamed isc-dhcp-server.
The machine I'm booting off (the "server") has IP address 192.168.0.1 on network card eth0, but is connected to the cluster via a second network card with IP 192.168.1.1 called eth1. The nodes will have IP's starting at 192.168.1.101 and be called node101, node102, etc. My dhcp.conf therefore looks like this:
#
# Sample configuration file for ISC dhcpd for Debian
#
# Attention: If /etc/ltsp/dhcpd.conf exists, that will be used as
# configuration file instead of this file.
#
# $Id: dhcpd.conf,v 1.1.1.1 2002/05/21 00:07:44 peloy Exp $
#
# The ddns-updates-style parameter controls whether or not the server will
# attempt to do a DNS update when a lease is confirmed. We default to the
# behavior of the version 2 packages ('none', since DHCP v2 didn't
# have support for DDNS.)
ddns-update-style none;
# option definitions common to all supported networks...
#option domain-name "example.org";
#option domain-name-servers ns1.example.org, ns2.example.org;
default-lease-time 600;
max-lease-time 7200;
# If this DHCP server is the official DHCP server for the local
# network, the authoritative directive should be uncommented.
#authoritative;
# Use this to send dhcp log messages to a different log file (you also
# have to hack syslog.conf to complete the redirection).
log-facility local7;
# No service will be given on this subnet, but declaring it helps the
# DHCP server to understand the network topology.
#subnet 10.152.187.0 netmask 255.255.255.0 {
#}
# This is a very basic subnet declaration.
#subnet 10.254.239.0 netmask 255.255.255.224 {
# range 10.254.239.10 10.254.239.20;
# option routers rtr-239-0-1.example.org, rtr-239-0-2.example.org;
#}
# This declaration allows BOOTP clients to get dynamic addresses,
# which we don't really recommend.
#subnet 10.254.239.32 netmask 255.255.255.224 {
# range dynamic-bootp 10.254.239.40 10.254.239.60;
# option broadcast-address 10.254.239.31;
# option routers rtr-239-32-1.example.org;
#}
# A slightly different configuration for an internal subnet.
#subnet 10.5.5.0 netmask 255.255.255.224 {
# range 10.5.5.26 10.5.5.30;
# option domain-name-servers ns1.internal.example.org;
# option domain-name "internal.example.org";
# option routers 10.5.5.1;
# option broadcast-address 10.5.5.31;
# default-lease-time 600;
# max-lease-time 7200;
#}
# Hosts which require special configuration options can be listed in
# host statements. If no address is specified, the address will be
# allocated dynamically (if possible), but the host-specific information
# will still come from the host declaration.
#host passacaglia {
# hardware ethernet 0:0:c0:5d:bd:95;
# filename "vmunix.passacaglia";
# server-name "toccata.fugue.com";
#}
# Fixed IP addresses can also be specified for hosts. These addresses
# should not also be listed as being available for dynamic assignment.
# Hosts for which fixed IP addresses have been specified can boot using
# BOOTP or DHCP. Hosts for which no fixed address is specified can only
# be booted with DHCP, unless there is an address range on the subnet
# to which a BOOTP client is connected which has the dynamic-bootp flag
# set.
#host fantasia {
# hardware ethernet 08:00:07:26:c0:a5;
# fixed-address fantasia.fugue.com;
#}
# You can declare a class of clients and then do address allocation
# based on that. The example below shows a case where all clients
# in a certain class get addresses on the 10.17.224/24 subnet, and all
# other clients get addresses on the 10.0.29/24 subnet.
#class "foo" {
# match if substring (option vendor-class-identifier, 0, 4) = "SUNW";
#}
#shared-network 224-29 {
# subnet 10.17.224.0 netmask 255.255.255.0 {
# option routers rtr-224.example.org;
# }
# subnet 10.0.29.0 netmask 255.255.255.0 {
# option routers rtr-29.example.org;
# }
# pool {
# allow members of "foo";
# range 10.17.224.10 10.17.224.250;
# }
# pool {
# deny members of "foo";
# range 10.0.29.10 10.0.29.230;
# }
#}
### PART 1
# General options
option dhcp-max-message-size 2048;
use-host-decl-names on;
deny bootp;
### PART 2
option domain-name "kerrighed";
option domain-name-servers 192.168.1.1;
option ntp-servers ntp.network.net;
### PART 3
subnet 192.168.1.0 netmask 255.255.255.0 {
option routers 192.168.1.1;
option broadcast-address 192.168.1.255;
# Define the first and last IP address to be authorized
range 192.168.1.101 192.168.1.199;
# This set up the node name to « nodeXX » with XX the id of the node (ip-address based).
send host-name = concat("node", binary-to-ascii(10, 8, ".", substring(leased-address, 3, 1)));
# If you want to limit which boxes are using kerrighed, use something like this :
# host ssi1 { fixed-address 192.168.0.101; hardware ethernet xx:xx:xx:xx:xx:xx;
filename "/srv/tftp/pxelinux.0";
}
### PART 4
group {
server-name "server";
option root-path "192.168.1.1:/NFSROOT/kerrighed";
}
3) As far as I recall, I didn't change anything in the standard file, I just added the stuff at the bottom. But I left all the default gumph in there just in case. A few things to note:
i) Since the server is on 192.168.1.1, starting the range at 192.168.1.2 would mean that the node names start with node2... So I started the range at 101 so the first node is node101. I suppose I could reconfigure eth1 on the server to be 192.168.1.254 or something, but I didn't.
ii) I moved the filename "/srv/tftp/pxelinux.0"; from the group block to the subnet block. It didn't seem to work in the group block - I'd get a the following error on the node:
PXE-E53: No boot filename received
I have to admit that I don't understand the reason for this group declaration, since it contains no host declarations. I was told that it was working as it is written in the wiki, but I'm very much inclined to believe you :) To me the contents of this group declaration should be simply put in the subnet statement above. People needing more complex DHCP configurations will probably know how to create appropriate group or class declarations. Do you believe that there is some interest for this group declaration?
4) I added the following to /etc/default/dhcp3-server:
INTERFACES="eth1"
Good candidate for the wiki too.
5) I added the following to /etc/network/interfaces:
auto eth1
iface eth1 inet static
address 192.168.1.1
netmask 255.255.255.0
Good candidate for the wiki.
6) I added the following to /etc/hosts.allow because without it I'd get a TFTP timeout error:
ALL: 192.168.1.*
Good catch! I don't remember anybody mentioning it (which does not say that nobody ever told me about it, I just don't remember:)). As far as I can see, you must have something in /etc/hosts.deny that match your nodes and tftp. Otherwise, leaving /etc/hosts.allow empty should be enough to have tftpd-hpa working.
7) If you make changes to the dhcp or tftp config files you need to restart various services. As Steve Kelly suggesting in his guide (http://www.stevekelly.eu/cluster.shtml), I set up a script to do this, called "restart-dhcp". You'll need to chmod 755 it, then put it somewhere in your $PATH (I put it in ~/bin and add that to my PATH in my .bashrc). Note that "restart" doesn't seem to work with tftpd-hpa so you have to "stop" it, then "start" it.
#
# tftpd-hpa restart is broken, so don't use it!
#
sudo restart portmap
sudo service dhcp3-server restart
sudo stop tftpd-hpa
sudo start tftpd-hpa
sudo service nfs-kernel-server restart
sudo exportfs -ra
I would not include this in the wiki, but maybe add some link to your post.
8) The output from this script should look like the code below. If any of the process id's are missing the daemon probably didn't actually start. But a "status" check will say it did. There's obviously a bug somewhere and the only solution I found was to reboot the server.
portmap start/running, process 844
* Stopping DHCP server dhcpd3 [ OK ]
* Starting DHCP server dhcpd3 [ OK ]
tftpd-hpa stop/waiting
tftpd-hpa start/running, process 23714
* Stopping NFS kernel daemon [ OK ]
* Unexporting directories for NFS kernel daemon... [ OK ]
* Exporting directories for NFS kernel daemon... [ OK ]
* Starting NFS kernel daemon [ OK ]
9) The official guide doesn't mention it, but after you've done "sudo mkdir /srv/tftp" you have to do the following. The first thing your node will do when it tries to NFSBOOT is look for a file called pxelinux.0 in /srv/tftp. Obviously if it's not there it won't boot. Next it will fire up TFTP and try to transfer the boot files specified in the /srv/tftp/pxlinux.cfg/default file. If it fails it's either because TFTP isn't running, isn't set up correctly, doesn't have the required permissions, or your node isn't listed in /etc/hosts.allow. The chown may not be necessary, but if you have problems at the TFTP stage give it a try.
sudo cp /usr/lib/syslinux/pxelinux.0 /srv/tftp/
sudo chown -R tftp:tftp /srv/tftp
Good candidate for the wiki.
10) After you've debootstrapped lenny and chroot'd into /NFSROOT/kerrighed, in addition to mounting /proc I also had to do the following to avoid errors while using "apt-get install":
mount -t devpts none /dev/pts
I think that the errors you mention are dpkg complaining that it is not able to log. This however does not prevent dpkg from correctly installing packages, so I personally don't mount /dev/pts. I can add this to the wiki though, if people can feel more comfortable this way.
11) To avoid warnings about your LOCALE the first thing you need to install is locales, then configure it and select "en_GB.UTF-8":
apt-get install locales
dpkg-reconfigure locales
Same category as the previous item, IMHO.
12) After you've installed nfs-common, to avoid statd failing during boot, and holding things up about 5 minutes each time, edit /etc/default/nfs-common and change NEED_STATD to no:
NEED_STATD=no
I wouldn't recommend disabling statd, especially when you mount NFS trees with file locking support (IIRC this is the default for mounts done from /etc/fstab). You might have issues because either /var/run or /var/lib/nfs is shared (each node must have its own ones, for instance using tmpfs mounts). /var/lib/nfs should be added to the /etc/fstab sample of the wiki, however it requires a bit of initialization. I don't know if nfsbooted have some helpers for this.
13) After installing initramfs-tools edit /etc/initramfs-tools/initramfs.conf and change BOOT=local to:
BOOT=nfs
Good candidate too.
14) If you want to boot your node into debian to see if everything has worked so far...
i) Do the following. (Answer "No" to creating a link, and "No" to aborting.)
apt-get install linux-image-2.6.26-2-amd64
ii) exit (i.e. leave chroot)
iii) Copy the boot files:
sudo cp /NFSROOT/kerrighed/boot/vmlinuz-2.6.26-2-amd64 /srv/tftp/
sudo cp /NFSROOT/kerrighed/boot/initrd.img-2.6.26-2-amd64 /srv/tftp/
iv) Create or edit /srv/tftp/pxelinux.cfg/default to:
DEFAULT debian
LABEL debian
kernel vmlinuz-2.6.26-2-amd64
append ip=dhcp root=/dev/nfs initrd=initrd.img-2.6.26-2-amd64 nfsroot=192.168.1.1:/NFSROOT/kerrighed,rw rw
initrd initrd.img-2.6.26-2-amd64
15) That's it! You should be able to NFSBOOT your node(s)...
Thanks again for your notes!
Louis
baggzy
July 1st, 2011, 12:52 PM
First of all thanks for sharing your notes. If you don't mind, I'm going to comment some of the items, and with your permission I could also improve the official Kerrighed wiki :)
Of course! :)
I guess that you meant /etc/dhcp3/dhcpd.conf, right? Note that on newer distros (eg Debian Squeeze), it has even moved to /etc/dhcp/dhcpd.conf. The package has also been renamed isc-dhcp-server.
Doh! Yes, that's what I meant. (And new location noted.)
I have to admit that I don't understand the reason for this group declaration, since it contains no host declarations. I was told that it was working as it is written in the wiki, but I'm very much inclined to believe you :) To me the contents of this group declaration should be simply put in the subnet statement above. People needing more complex DHCP configurations will probably know how to create appropriate group or class declarations. Do you believe that there is some interest for this group declaration?
I don't know if the group declaration actually does anything either, now you mention it... My nodes boot fine without it (I just tried). On the other hand, moving the remaining two lines up to the subnet section doesn't appear to change anything, so maybe those lines do nothing.
You must have something in /etc/hosts.deny that match your nodes and tftp. Otherwise, leaving /etc/hosts.allow empty should be enough to have tftpd-hpa working.
No my hosts.deny is empty. Odd.
I think that the errors you mention are dpkg complaining that it is not able to log. This however does not prevent dpkg from correctly installing packages, so I personally don't mount /dev/pts. I can add this to the wiki though, if people can feel more comfortable this way.
That's true, it just gets rid of all those warnings. As does the LOCALES thing.
I wouldn't recommend disabling statd, especially when you mount NFS trees with file locking support (IIRC this is the default for mounts done from /etc/fstab). You might have issues because either /var/run or /var/lib/nfs is shared (each node must have its own ones, for instance using tmpfs mounts). /var/lib/nfs should be added to the /etc/fstab sample of the wiki, however it requires a bit of initialization. I don't know if nfsbooted have some helpers for this.
From what I read there's a bug in statd which means it ends up in a race with portmapper. As a result statd won't start and takes ages to time out. That happens three times during boot, which means it took my nodes around 18 minutes to boot, instead of 1. If it causes problems I'll put it back, or find some other work-around. I'll let you know.
Cheers!
baggzy
July 1st, 2011, 05:35 PM
Hi all! I was going to share my experiences with the compilation process for kerrighed 3.0.0, but it went without a hitch so there isn't really anything to say. :)
Sadly when I tried to boot I was hit with the same problem that a few others have had - an immediate crash where the screen fills with multi-coloured flashing text. This happens so fast I had to make a movie with my digital camera to see what happened. Basically the last message is "Trying to load: pxelinux.cfg/default ... ok" then it crashes. Presumably the crash happens as soon as it executes vmlinuz-2.6.30-krg3.0.0... See below for a couple of images from the movie. These are two consecutive frames, so it happens in less than 1/30th of a second.
This leaves us with little information on which to base an investigation. The only other threads I found on this (here (http://comments.gmane.org/gmane.linux.cluster.kerrighed.user/1216) and here (https://bbs.archlinux.org/viewtopic.php?id=110758)) offered no solution. Anyone know what the problem is? Or, even better, how to solve it?
The only clue I have to offer is that I successfully built and booted linux-2.6.30 from the same tarball that kerrighed is patching. So the problem isn't with debian, it's with the kerrighed patches...
baggzy
July 2nd, 2011, 11:38 AM
According to tfptd-hpa's documentation, when using --secure, option filename in the DHCP server configuration should not contain any path (eg. pxelinux.0, not /srv/tftp/pxelinux.0), and the file should of course live exactly in the daemon's directory (ie /srv/tftp). Have you tried setting up tftpd-hpa+dhcpd.conf this way?
Sorry, I forgot to reply to this one... I just tried it and yes that seems to work. :) Still getting the multicoloured screen of death though... :(
lrilling
July 4th, 2011, 09:54 AM
Hi baggzy,
Hi all! I was going to share my experiences with the compilation process for kerrighed 3.0.0, but it went without a hitch so there isn't really anything to say. :)
Sadly when I tried to boot I was hit with the same problem that a few others have had - an immediate crash where the screen fills with multi-coloured flashing text. This happens so fast I had to make a movie with my digital camera to see what happened. Basically the last message is "Trying to load: pxelinux.cfg/default ... ok" then it crashes. Presumably the crash happens as soon as it executes vmlinuz-2.6.30-krg3.0.0... See below for a couple of images from the movie. These are two consecutive frames, so it happens in less than 1/30th of a second.
This leaves us with little information on which to base an investigation. The only other threads I found on this (here (http://comments.gmane.org/gmane.linux.cluster.kerrighed.user/1216) and here (https://bbs.archlinux.org/viewtopic.php?id=110758)) offered no solution. Anyone know what the problem is? Or, even better, how to solve it?
The only clue I have to offer is that I successfully built and booted linux-2.6.30 from the same tarball that kerrighed is patching. So the problem isn't with debian, it's with the kerrighed patches...
The only "solution" found so far for this issue is to build Kerrighed from the git repository. As far as I know, it's no less stable than Kerrighed 3.0.0, so it's not risky. The big problem with this screen of death is that I couldn't reproduce it myself. Others tried with some debugging options enabled, but could not obtain consistent results :/ Anyway, it looks "solved" in git!
Thanks,
Louis
Cluster Penguin
July 10th, 2011, 02:26 PM
Never mind.
Bocho
August 1st, 2011, 05:14 PM
Hi All, and thanks for your posts and replies!
I hope this information will be useful for some people with "multicoloured screen" (I was also among them). "Therapy":
after you create your kernel and got "multicoloured screen", change its name from "vmlinuz-2.6.30-krg3.0.0" for something else e.g. "vmlinuz" (do not forget about "default" in /srv/tftp/pxelinux.cfg/) ... and your kernel is alive.
And vice versa: if you want :) to see "multicoloured screen" - change the name of your working kernel to something else.
Works for me, has been tested several times.
-----------------------------------------------------
Slightly of another: since dhcpd package has been renamed to isc-dhcp-server for restart services I use this script:
sudo restart portmap
sudo service isc-dhcp-server restart
sudo stop tftpd-hpa
sudo start tftpd-hpa
sudo service nfs-kernel-server restart
sudo exportfs -raThank you baggzy for:I moved the filename "/srv/tftp/pxelinux.0"; from the group block to the subnet block.this is saves my neurons :)
Best regards!
lrilling
August 6th, 2011, 11:04 AM
I hope this information will be useful for some people with "multicoloured screen" (I was also among them). "Therapy":
after you create your kernel and got "multicoloured screen", change its name from "vmlinuz-2.6.30-krg3.0.0" for something else e.g. "vmlinuz" (do not forget about "default" in /srv/tftp/pxelinux.cfg/) ... and your kernel is alive.
And vice versa: if you want :) to see "multicoloured screen" - change the name of your working kernel to something else.
Works for me, has been tested several times.
Hi Bocho, this is an interesting point. I must mention that I almost never use "vmlinuz" as filename, more something like "2.6.20-SMP-64". But your observation makes me think that the multi-colored screen effect may be a bug around pxelinux. Maybe just a filename length issue...
Thanks,
Louis
Retrogamer95
August 14th, 2011, 05:55 PM
Lots of good information here. Once I get a few more workstations from the trash, I think I will try to make a cluster.
wbw
August 29th, 2011, 03:37 PM
I'm interested in making a Beowulf cluster and this thread looks dead on it.
However, I don't have a dedicated switch to use. All machines here are connected on a switch with a fixed IP, therefore if I install dhcp-server on one machine, it shouldn't harm the current network, right? Basically, what I want to know is if it's possible to perform the instructions from the guide the same way without breaking the current LAN?
Any help is appreciated. Thanks in advance.
ajt
August 29th, 2011, 03:49 PM
I'm interested in making a Beowulf cluster and this thread looks dead on it.
However, I don't have a dedicated switch to use. All machines here are connected on a switch with a fixed IP, therefore if I install dhcp-server on one machine, it shouldn't harm the current network, right? Basically, what I want to know is if it's possible to perform the instructions from the guide the same way without breaking the current LAN?
Any help is appreciated. Thanks in advance.
Hi, wbw.
Your network administrator (if that's not you) will not be happy about you putting another DHCP server on the LAN if one is running already. It's likely that one of your switches is the DHCP server if no other machine is doing it. However, this is like NOTHING in comparision to the deeply anti-social aspect of running a Beowulf on your LAN because it's really easy to saturate the network with traffic between Beowulf nodes.
That's why a Beowulf normally has a private LAN...
You can get a second-hand switch inexpensively off eBay ;-)
HTH,
Tony.
wbw
August 29th, 2011, 05:03 PM
Hi, wbw.
Your network administrator (if that's not you) will not be happy about you putting another DHCP server on the LAN if one is running already. It's likely that one of your switches is the DHCP server if no other machine is doing it. However, this is like NOTHING in comparision to the deeply anti-social aspect of running a Beowulf on your LAN because it's really easy to saturate the network with traffic between Beowulf nodes.
That's why a Beowulf normally has a private LAN...
You can get a second-hand switch inexpensively off eBay ;-)
HTH,
Tony.
Thanks for the fast answer!
Well, I'm not sure if there is a DHCP server already running, since all connections here are only with static IP. I understand point of saturating the LAN, but I was planning on using only two machines for testing purposes. If it works, we will probably be allowed to get a shiny new gigabit switch with some boxes. However, we will only get new "toys" if we prove it will work without any extra expense (oh, and probably without causing an incident on the LAN). =p
Considering the nature of this experiment (really few nodes, computation tasks only to test the process distribution), do you think it can still be problematic for the local LAN?
Thanks again, regards.
EDIT: Is it possible to do it with a router or it has to be a switch? With a router I could test at home with a desktop and a laptop, avoiding possible conflicts.
quequotion
October 2nd, 2011, 01:13 PM
EDIT: Is it possible to do it with a router or it has to be a switch? With a router I could test at home with a desktop and a laptop, avoiding possible conflicts.
If your machines have multiple (at least two on the main node, one on a single client node works, two on client nodes if there are 2+) ethernet ports, you can create a private LAN without a separate router or switch. Static IPs would be the way to go in this case.
lrilling
October 3rd, 2011, 04:33 AM
If your machines have multiple (at least two on the main node, one on a single client node works, two on client nodes if there are 2+) ethernet ports, you can create a private LAN without a separate router or switch. Static IPs would be the way to go in this case.
The basic configuration of TIPC (on which Kerrighed's kernel-to-kernel communications and node discovery rely) requires broadcast ethernet for node discovery though. It's probably possible to tune TIPC to use a router instead, but this is becoming tricky :)
Thanks,
Louis
quequotion
October 9th, 2011, 08:50 PM
The basic configuration of TIPC (on which Kerrighed's kernel-to-kernel communications and node discovery rely) requires broadcast ethernet for node discovery though. It's probably possible to tune TIPC to use a router instead, but this is becoming tricky :)
Thanks,
Louis
LOL, I misread. I thought wbw wanted to do it without a router....
liangrx06
October 17th, 2011, 09:22 AM
Hi All, and thanks for your posts and replies!
I hope this information will be useful for some people with "multicoloured screen" (I was also among them). "Therapy":
after you create your kernel and got "multicoloured screen", change its name from "vmlinuz-2.6.30-krg3.0.0" for something else e.g. "vmlinuz" (do not forget about "default" in /srv/tftp/pxelinux.cfg/) ... and your kernel is alive.
And vice versa: if you want :) to see "multicoloured screen" - change the name of your working kernel to something else.
Works for me, has been tested several times.
-----------------------------------------------------
Slightly of another: since dhcpd package has been renamed to isc-dhcp-server for restart services I use this script:
sudo restart portmap
sudo service isc-dhcp-server restart
sudo stop tftpd-hpa
sudo start tftpd-hpa
sudo service nfs-kernel-server restart
sudo exportfs -raThank you baggzy for:this is saves my neurons :)
Best regards!
Hello!
I'm having trouble in building kerrighed 3.0.0 cluster.
I know from this page that you have successfully building it.
Can you help me? I mean can you send me a detailed installtion guide?
Thank you very much!
lrilling
October 19th, 2011, 05:20 AM
Hello!
I'm having trouble in building kerrighed 3.0.0 cluster.
I know from this page that you have successfully building it.
Can you help me? I mean can you send me a detailed installtion guide?
Thank you very much!
Hello,
(Just curious) if you are following the official guide (http://kerrighed.sourceforge.net/wiki/index.php/UserDoc) from kerrighed.org, which step is failing?
Thanks,
Louis
DB1177
December 7th, 2011, 07:36 AM
Is this possible with ubuntu 11.1? I'm new to all of this so I apologize if this is a silly question.
magn0x
April 18th, 2012, 07:19 AM
DB1177, there is no reason, in principle, why clustering will not work with Ubuntu 11.10. However, I'm currently running into difficulties getting it to work. I'm basing my set up on BigJimJams - UbuntuKerrighedClusterGuide created on 26/02/2009. The DHCP server appears to work but, it now appears to be called isc-dhcp-server rather than dhcp3-server. I appear to have set up the TFTP server, but, I'm not confident the bootable filesystem is correctly setup. I get some error messages when using:
# debootstrap --arch i386 oneiric /nfsroot/kerrighed http://archive.ubuntu.com/ubuntu/
Even so, the file system appears to be created in the /nfsroot/kerrighed directory. But, the node1 is not booting over the LAN from the filesystem on the head node.
ajt
April 18th, 2012, 07:49 AM
DB1177, there is no reason, in principle, why clustering will not work with Ubuntu 11.10. However, I'm currently running into difficulties getting it to work. I'm basing my set up on BigJimJams - UbuntuKerrighedClusterGuide created on 26/02/2009. The DHCP server appears to work but, it now appears to be called isc-dhcp-server rather than dhcp3-server. I appear to have set up the TFTP server, but, I'm not confident the bootable filesystem is correctly setup. I get some error messages when using:
# debootstrap --arch i386 oneiric /nfsroot/kerrighed http://archive.ubuntu.com/ubuntu/
Even so, the file system appears to be created in the /nfsroot/kerrighed directory. But, the node1 is not booting over the LAN from the filesystem on the head node.
Hi, magn0x.
Have you monitored the DHCP server to see it it is responding?
tail -f /var/log/daemon.log | fgrep DHCP
HTH,
Tony.
magn0x
April 18th, 2012, 08:16 AM
Hi ajt,
Thanks and Nice one; Even though ifconfig indicates there has been correct ip address allocation to eth0 node1 , I find 'No such file exists', so its back to the beginning for me. I need to think about this.
kind regards,
magn0x
ajt
April 18th, 2012, 08:50 AM
Hi ajt,
Thanks and Nice one; Even though ifconfig indicates there has been correct ip address allocation to eth0 node1 , I find 'No such file exists', so its back to the beginning for me. I need to think about this.
kind regards,
magn0x
Hi, magn0x.
The 'No such file exists' error suggests that you've not set up TFTP properly. Bear in mind that pxelinux.0 should be in:
/var/lib/tftpboot/
And your DHCP server config should be something like:
subnet 192.168.0.0 netmask 255.255.255.0 {
option routers 192.168.0.254;
filename "pxelinux.0";
option root-path "192.168.0.254:/var/lib/kerrighed/image";
}
HTH,
Tony.
magn0x
April 18th, 2012, 10:13 AM
Hi Tony,
Yes the first error was LABEL in pxelinux.cfg/default was spelt LABLE. :mad:
Code:
/var/lib/tftpboot/pxelinux.0 OK :p
But, I need to review DHCP server config which is more like; #-o
Code:
subnet 192.168.2.0 netmask 255.255.255.0 {
option routers 192.168.2.1;
option broadcast-address 192.168.2.255;
}
group {
filename "pxelinux.0";
option root-path "192.168.2.1:/nfsroot/kerrighed";
host magn0x1 {
hardware ethernet nn:nn:nn:nn:nn:nn;
fixed-address 192.168.2.101;
}
host magn0x2 {
hardware ethernet yy:yy:yy:yy:yy:yy;
fixed-address 192.168.2.102;
}
server-name "kerrighedserver";
next-server 192.168.2.1;
}
-------------------------------
First node on booting reports:
Trying to load: pxelinux.cfg/default OK
No DEFAULT or UI configuration directive found!
boot:
Steve
tehowe
April 19th, 2012, 02:34 AM
Since this is the 'easy' thread :)
Has anyone tried a kerrighed node on Precise yet? This looks like it would be fun/frustrating to learn about, even between say a notebook and a desktop as an exercise. That's assuming you can use a commercial router (DD-WRT?) as the go-between. It'll be interesting to read back through the thread and see how it's done...
magn0x
April 19th, 2012, 05:05 AM
I've discovered that to set up the Ubuntu 11.10 image to be booted from the nfs server I require nfsbooted. This is not available in the Ubuntu repositories. It appears to have been in the Debian repositories at version (0.0.15) unstable, but, has since been removed. While it can still be found it appears to be a nightmare to run; requiring various prerequisites.
Has anyone any alternative suggestions to prepare the image for booting from the nfs server :?:
NB The latest version of Kerrighed is 3.0.0 see http://www.kerrighed.org/wiki/index.php/Main_Page.
Since this is based on Linux 2.6.30 it may prove difficult to run on Ubuntu 11.10 (Linux 3.0.0-17-generic)
vBulletin® v3.8.7, Copyright ©2000-2012, vBulletin Solutions, Inc.