HetCluster

The HetCluster is a cluster of PC-nodes used for small scale numerical computations in research and education by members of the Theory Group of the Physics Department of the National Technical University of Athens.

You can see current/past scientific/educational activity on the cluster here.

Table of Nodes

Node	CPU (GHz)	Cache (Mb)	RAM (Gb)	M/B	Other	User	Notes
het1 *	3.0 (PIV)	1	2	ASUS P4C800-E	HD250G		RP: ILB 0910;CA 0411
het2	3.0 (PIV)	1	2	ASUS P4C800-E	HD80G		RP: ILB 0910;RP: ILB 0607 Fan;CA 0411
het3	3.0 (PIV)	1	2	ASUS P4C800-E	HD80G		RP: ILB 0910;CA 0411
het4 *	3.0 (PIV)	1	2	ASUS P4C800-E	HD80G		RP: ILB 0910;CA 0411
het5	3.0 (PIV)	1	1	ASRock P4V88	HD80G		ILB 0507
het6	3.0 (PIV)	1	1	ASRock P4V88	HD80G		RP: ILB 0910;ILB 0507
het7	2.4+2.4 (Dual Xeon)	0.5	2	Intel SE7500CW2	HD80G+HD80G		CA 0211
het8	3.0 (PIV)	2	0.5	ASRock 755V88	HD80G		RP: ILB 0607 RAM; UP: ILB 0512 M/B+CPU+VGA+RAM; CA 0211;
het9	2.5 (PIV)	0.5	0.5	ASRock P4i65G	HD80G		RP: ILB 0610 M/B+RAM;CA 0211;
het10	2.5 (PIV)	0.5	0.5	ASUS P4T533-C	HD80G		UP: ILB 0512 PS;CA 0211;
het11 *	2.5 (PIV)	0.5	0.5	ASUS P4T533-C	HD80G		RP: ILB 0910;CA 0211;
het12	2.5 (PIV)	0.5	0.5	ASRock P4i65G	HD80G		RP: ILB 0610 M/B+RAM+VGA;CA 0211;
het13	2.5 (PIV)	0.5	0.5	ASUS P4T533-C (?)	HD80G		RP: ILB 0605 MB+VGA; RP: ILB 0610 PS; CA 0211;
het14	3.0 (PIV)	2	0.5	ASRock 755V88	HD80G		UP: ILB 0512 M/B+CPU+RAM; CA 0211;
het15 *	2.5 (PIV)	0.5	0.5	ASUS P4T533-C (?)	HD80G		RP: ILB 0605 MB; UP: ILB 0512 PS;CA 0211;
het16	3.0 (PIV)	2	0.5	ASRock 755V88	HD80G		RP: ILB 0910;UP: ILB 0602 M/B+CPU+RAM+VGA;CA 0211;
het17	3.4 (P D 945)	2	1.0 (DDR2 @400MHz)	Asus P5VD2-VM	HD160G		ILB 0702;
het18	3.4 (P D 945)	2	1.0 (DDR2 @533MHz)	Asus P5VD2-VM	HD160G		ILB 0702;
het19	3.4 (P D 945)	2	1.0 (DDR2 @533MHz)	Asus P5VD2-VM	HD160G		ILB 0702;
het20	3.4 (P D 945)	2	1.0 (DDR2 @533MHz)	Asus P5VD2-VM	HD160G		ILB 0702;
het21	3.0 (C2 Duo)	6	4.0 (DDR2 @800MHz)	Asus P5K-VM	HD250G		ILB 0805;
het22	3.0 (C2 Duo)	6	4.0 (DDR2 @800MHz)	Asus P5K-VM	HD250G		ILB 0805;
het23	3.0 (C2 Duo)	6	4.0 (DDR2 @800MHz)	Asus P5K-VM	HD250G		ILB 0805;
het24	3.0 (C2 Duo)	6	4.0 (DDR2 @800MHz)	Asus P5K-VM	HD250G		ILB 0805;
het25	3.0 (C2 Duo)	6	4.0 (DDR2 @800MHz)	Asus P5K-VM	HD250G		ILB 0805;
het26	3.0 (C2 Duo)	6	4.0 (DDR2 @800MHz)	Asus P5K-VM	HD250G		ILB 0805;
het27	3.0 (C2 Duo)	6	4.0 (DDR2 @800MHz)	Asus P5K-VM	HD250G		ILB 0805;
het28	3.0 (C2 Duo)	6	4.0 (DDR2 @800MHz)	Asus P5K-VM	HD250G		ILB 0805;
het29	3.0 (C2 Duo)	6	4.0 (DDR2 @1872MHz)	Asus P5K-VM	HD250G		ILB 0805;
het30	3.0 (C2 Duo)	6	4.0 (DDR2 @1872MHz)	Asus P5K-VM	HD250G		ILB 0805;
het31	3.0 (C2 Duo)	6	4.0 (DDR2 @1872MHz)	Asus P5K-VM	HD250G		ILB 0805;
het32	3.0 (C2 Duo)	6	4.0 (DDR2 @1872MHz)	Asus P5K-VM	HD250G		ILB 0805;
het33	3.0 (C2 Duo)	6	4.0 (DDR2 @1872MHz)	Asus P5K-VM	HD250G		ILB 0805;
het34	3.0 (C2 Duo)	6	4.0 (DDR2 @1872MHz)	Asus P5K-VM	HD250G		ILB 0805;
het35	3.0 (C2 Duo)	6	4.0 (DDR2 @1872MHz)	Asus P5K-VM	HD250G		ILB 0805;
het36	3.07 (i7)	8	12.0 (@1066MHz)	ASUSTeK P6T SE	HD500G+HD500G	farakos	PPS 0910;
het37	2.33 (Quad)	2	4.0 (DDR2 @800MHz)	Dell 0M858N (Optiplex 760)	HD250	farakos	PPS 0910;
het38	2.33 (Quad)	2	4.0 (DDR2 @800MHz)	Dell 0M858N (Optiplex 760)	HD250	farakos	PPS 0910;

Abbreviations:
i7 = Intel Core i7
Quad = Intel Core 2 Quad
C2 Duo=Intel Core 2 Duo
UP=Upgrade,HWP=Hardware Problem,RP=Replace/Repair
M/B=motherboard,HD=Hard Disk,PS=Power Supply,VGA=Video Card
CA=California Computers,ILB=Infolab,PPS:Papasavvas
Notes:

Nodes marked with * are non operational. See the Notes column for explanation.

News

100223: het15 dead, het7 grub, het11 boot starts then blank screen, het4 needs screen, het1 config network (screen)
091214: het36,37,38 added to cluster
091021: het1,2,3,4,6,11,16 repaired by ILB. Power supplies (4), motherboard battery replaced, BIOS repair. het1: configure network, het11: see w/monitor.
091015: het1,2,3,4,6,11,16 to ILB for repair. het9 back to the network, no video.
091006: het1,2,3,4,6,11,16 dead, no power. het9 not seen in network, het8 needs checking (starts after restart).
090327: Cluster finally moved to the basement. het1 has no power.
080530: het21-34 available for testing. Acquired het29-het35. Bios setting: het21-28 have RAM Freq set to auto (lshw: 800MHz??) and het29-34 have RAM Freq DDR2@1067
080528: het21-28 available for testing. Acquired het26-het28.
080520: Acquired het21-25 from Infolab (Asus P5K-VM, Core 2 Duo E8400 @ 3.00GHz, 64KB L1 Cache, 6MB L2 Cache, RAM DDR2 4x1Gb/1066MHz Kingston KHX8500D2, HD 250GB ATA WD WD2500AAKS-0, DVD-ROM DDU1671S, ). Acquired KVM-0831 and 3COM Baseline Switch 2016
080107: het3: no power, het16: power lamp on, ventillator off, het11: no inittab (linux complaint), restart, DVD light on, red light only on, nothing happens. Disk? M/B?
071106: Slow network on small hub corrected: gate no. 16 in large hub defective.
070828: het16 has hardware problem. Does not turn on (green lamp on), PS ventillator does not start.
070727: het11 has unknown hardware problems probably after power cut. Probably hard disk (starts bios then stops)
070521: het15 and het13 have been repaired (MB+VGA). het3 is simply offline (no hub avail)
070418: het15 has hardware problems. Taken to ILB. het17-het20 had short in reset button connection (problem at startup, no power). Repaired by ILB.
070328: het13 has hardware problems. het17-het20 are available to all. Software is at a testing phase, check your programs for correctness and efficiency. het2 had corrupt filesystem, OS reinstalled and old data erased.
070227: The new het17-het20 have been succesfully added to the cluster. They will soon be available to all.
070116: het12 and het8 are back. het13 has hardware problems. het2 has most likely damage on the disk. Need to reeinstall OS?
061106: het12 is back: M/B ASUS P4T533-C -> ASRock P4i65G and RAM and Video Card replaced. See BUG: Soft lockup on CPU#0. Reinstall Fedora?
061026: het9 is back: M/B ASUS P4T533-C -> ASRock P4i65G and RAM replaced.
061017: het13 is back: Power supply replaced
061016: het12 and het13 have hardware problems (power supply?short?).
060919: het9 has hardware problems
060719: Replaced fan (broken) on het2. Replaced memory (single chip) on het8. Repairing het8 filesystem.
060717: Node het15 is back. Nodes het2 and het8 has hardware problems.
060330: Node het16 is back for users.
060303: Node het16 is back. Software upgrade in process.
060210: Node het16 is off due to hardware problems.
060131: New node het16 added to the cluster.
051228: het8 is back. New video card GE Force FX5200 AGB8X 128Mb TV-out
051223: het10 is back: new power supply
051221: het8 and het10 are out: VGA card and power supply respectively.
051221: New nodes on the cluster: het14 and het15
051221: het2, het9 and het12 are back. het9 and het12 must be suspected of being unstable

General Information

The HetCluster is a cluster of PCs connected via 10/100 ethernet to the network. Each one is independent and the Operating System is Linux under the Fedora Core 4, 6 and Ubuntu Server 8.04,9.10 distribution. For information on how to obtain an account and existing software contact the administrator. You may login to a node via ssh at the address node.physics.ntua.gr . For the moment there is no server and the filesystems are independent. This may change in the future and a common filesystem may be created on het7 and mounted on all other nodes on /data via NFS. Some observed instabilities of NFS has made us delay this option.

You may use ssh to submit jobs on remote machines. This has been set up so that no password is asked between het nodes. A prototype script for job submission/remote command execution that checks node occupancy is the script /usr/local/bin/rtop. Use it in order to check the load of each mahine before tou decide to submit your jobs.

Rules for Using the Cluster

Please observe the following rules and in case that they cannot accomodate your needs contact the administrator.

Long jobs should be breakable into 8-12 hr jobs max. If you have not done it until now, it is a good idea to learn how to do it. Jobs must be moved to other nodes under a less than a day notice should it become necessary.
Respect other people's need for computer time: Benchmark your code, optimize it the best possible way - especially long jobs -, use the best compiler (in our case the intel compilers icc and ifort seem to be doing the best jobs). Get advice if you are not sure, the administrator maybe able to help.
Each person will be given priority on a number of nodes depending on needs. For this you should contact the administrator stating your needs in total CPU time on each node (normalized appx. to 3.0PIV processor), memory and number of nodes. Total CPU time can be "infinite" but higher priority will be given to shorter jobs.
When you do not need the nodes anymore notify the administrator. Users that lock nodes unnecessarily will be blacklisted.
Everybody will be allowed to run on any node given that:
- The node is empty and not allocated to anyone and the job will not run for more than 2 days, or
- The job is submitted at nice level 20 (with /usr/bin/nice not the shell built in please), and
- there are no memory conflicts with the needs of the primary user.
If the primary user complaints then the job will be stopped (not killed) and the owner has to restart it manually when the node is free or kill it (learn about the kill command, especially kill -STOP and kill -CONT)
Notify the administrator for anything abnormal in the nodes you are using.

Tips for Using the Cluster

Ssh gives extreme flexibility to automate remote operations on nodes. You will benefit greatly from learning its use. Some examples are given below.

Monitor the usage of all nodes: Helpful scripts (available on the nodes) are rtop, rload and ruse. rtop gives an instant of the command top on each node, rload reports the load averages and ruse the load averages and most important jobs. Each command can take as arguments the name(s) of a host(s) in order to report only on them, e.g.
rload (reports on all nodes)
rtop het3 het15 het4 het6
rload 3 5 7 9 15
ruse 3 het7 het9 15 het14 3
nice your jobs to the required level using the command /usr/bin/nice, e.g.
/usr/bin/nice +20 a.out >& log &
/usr/bin/nice +8 a.out >& log &
Copy files: Use the script rcopy to copy files to the exact same location in the filesystem of all or some of the cluster nodes. Use relative, not absolute links. e.g.:
rcopy file1 dir1 (copies file file1 and directory dir1 on all nodes - but not on the one you are now!)
rcopy -n 2 -n het7 -n 12 file1 file2 (copies files file1,file2 on het2,het7,het12)
Submit Jobs: There are infinite possibilities. One example is the script rexec where I submit a job sign.com passing to it parameters via flags by simply editing the table at the top of the script. A simpler example is rrun where by giving the command:
rrun run.het1 run.het2 run.het3
I submit the jobs run.hetn on hetn
Transfer Data: The GNU utilities tar and find together with scp and/or ssh can do miracles:
scp -r -p het2:data . (brings all directory ~/data and its contents from het2 to the current node)
ssh het2 "tar cfp - data" | tar xfvp - (does the same using the capabilities of tar)
More complicated jobs can be handled by scripts that can keep data directories up to date. An example is the script get-new-data-remote which is used together with the script get-new-data.