HetCluster
The HetCluster is a cluster of PC-nodes used for small
scale numerical computations in research and education by members of
the Theory Group of the
Physics
Department of the
National Technical University of Athens.
You can see
current/past scientific/educational activity
on the cluster here.
Table of Nodes
Node
| CPU (GHz)
| Cache (Mb)
| RAM (Gb)
| M/B
| Other
| User
| Notes
|
het1 *
| 3.0 (PIV)
| 1
| 2
| ASUS P4C800-E
| HD250G
|
| RP: ILB 0910;CA 0411
|
het2
| 3.0 (PIV)
| 1
| 2
| ASUS P4C800-E
| HD80G
|
| RP: ILB 0910;RP: ILB 0607 Fan;CA 0411
|
het3
| 3.0 (PIV)
| 1
| 2
| ASUS P4C800-E
| HD80G
|
| RP: ILB 0910;CA 0411
|
het4 *
| 3.0 (PIV)
| 1
| 2
| ASUS P4C800-E
| HD80G
|
| RP: ILB 0910;CA 0411
|
het5
| 3.0 (PIV)
| 1
| 1
| ASRock P4V88
| HD80G
|
| ILB 0507
|
het6
| 3.0 (PIV)
| 1
| 1
| ASRock P4V88
| HD80G
|
| RP: ILB 0910;ILB 0507
|
het7
| 2.4+2.4 (Dual Xeon)
| 0.5
| 2
| Intel SE7500CW2
| HD80G+HD80G
|
| CA 0211
|
het8
| 3.0 (PIV)
| 2
| 0.5
| ASRock 755V88
| HD80G
|
| RP: ILB 0607 RAM; UP: ILB 0512 M/B+CPU+VGA+RAM; CA 0211;
|
het9
| 2.5 (PIV)
| 0.5
| 0.5
| ASRock P4i65G
| HD80G
|
| RP: ILB 0610 M/B+RAM;CA 0211;
|
het10
| 2.5 (PIV)
| 0.5
| 0.5
| ASUS P4T533-C
| HD80G
|
| UP: ILB 0512 PS;CA 0211;
|
het11 *
| 2.5 (PIV)
| 0.5
| 0.5
| ASUS P4T533-C
| HD80G
|
| RP: ILB 0910;CA 0211;
|
het12
| 2.5 (PIV)
| 0.5
| 0.5
| ASRock P4i65G
| HD80G
|
| RP: ILB 0610 M/B+RAM+VGA;CA 0211;
|
het13
| 2.5 (PIV)
| 0.5
| 0.5
| ASUS P4T533-C (?)
| HD80G
|
| RP: ILB 0605 MB+VGA; RP: ILB 0610 PS; CA 0211;
|
het14
| 3.0 (PIV)
| 2
| 0.5
| ASRock 755V88
| HD80G
|
| UP: ILB 0512 M/B+CPU+RAM; CA 0211;
|
het15 *
| 2.5 (PIV)
| 0.5
| 0.5
| ASUS P4T533-C (?)
| HD80G
|
| RP: ILB 0605 MB; UP: ILB 0512 PS;CA 0211;
|
het16
| 3.0 (PIV)
| 2
| 0.5
| ASRock 755V88
| HD80G
|
| RP: ILB 0910;UP: ILB 0602 M/B+CPU+RAM+VGA;CA 0211;
|
het17
| 3.4 (P D 945)
| 2
| 1.0 (DDR2 @400MHz)
| Asus P5VD2-VM
| HD160G
|
| ILB 0702;
|
het18
| 3.4 (P D 945)
| 2
| 1.0 (DDR2 @533MHz)
| Asus P5VD2-VM
| HD160G
|
| ILB 0702;
|
het19
| 3.4 (P D 945)
| 2
| 1.0 (DDR2 @533MHz)
| Asus P5VD2-VM
| HD160G
|
| ILB 0702;
|
het20
| 3.4 (P D 945)
| 2
| 1.0 (DDR2 @533MHz)
| Asus P5VD2-VM
| HD160G
|
| ILB 0702;
|
het21
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @800MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het22
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @800MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het23
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @800MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het24
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @800MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het25
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @800MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het26
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @800MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het27
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @800MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het28
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @800MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het29
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @1872MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het30
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @1872MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het31
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @1872MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het32
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @1872MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het33
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @1872MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het34
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @1872MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het35
| 3.0 (C2 Duo)
| 6
| 4.0 (DDR2 @1872MHz)
| Asus P5K-VM
| HD250G
|
| ILB 0805;
|
het36
| 3.07 (i7)
| 8
| 12.0 (@1066MHz)
| ASUSTeK P6T SE
| HD500G+HD500G
| farakos
| PPS 0910;
|
het37
| 2.33 (Quad)
| 2
| 4.0 (DDR2 @800MHz)
| Dell 0M858N (Optiplex 760)
| HD250
| farakos
| PPS 0910;
|
het38
| 2.33 (Quad)
| 2
| 4.0 (DDR2 @800MHz)
| Dell 0M858N (Optiplex 760)
| HD250
| farakos
| PPS 0910;
|
Abbreviations:
i7 = Intel Core i7
Quad = Intel Core 2 Quad
C2 Duo=Intel Core 2 Duo
UP=Upgrade,HWP=Hardware Problem,RP=Replace/Repair
M/B=motherboard,HD=Hard Disk,PS=Power Supply,VGA=Video Card
CA=California Computers,ILB=Infolab,PPS:Papasavvas
Notes:
- Nodes marked with * are non
operational. See the Notes column for explanation.
News
- 100223: het15 dead, het7 grub, het11
boot starts then blank screen, het4 needs screen, het1 config
network (screen)
- 091214: het36,37,38 added to cluster
- 091021: het1,2,3,4,6,11,16 repaired by
ILB. Power supplies (4), motherboard battery replaced, BIOS
repair. het1: configure network, het11: see w/monitor.
- 091015: het1,2,3,4,6,11,16 to ILB for
repair. het9 back to the network, no video.
- 091006: het1,2,3,4,6,11,16 dead, no
power. het9 not seen in network, het8 needs checking (starts after restart).
- 090327: Cluster finally moved to the
basement. het1 has no power.
- 080530: het21-34 available for
testing. Acquired het29-het35. Bios setting: het21-28 have RAM Freq
set to auto (lshw: 800MHz??) and het29-34 have RAM Freq DDR2@1067
- 080528: het21-28 available for
testing. Acquired het26-het28.
- 080520: Acquired het21-25 from Infolab
(Asus P5K-VM, Core 2 Duo E8400 @ 3.00GHz, 64KB L1 Cache, 6MB L2
Cache, RAM DDR2 4x1Gb/1066MHz Kingston KHX8500D2, HD 250GB ATA WD
WD2500AAKS-0, DVD-ROM
DDU1671S, ). Acquired KVM-0831 and 3COM Baseline Switch 2016
- 080107: het3: no power, het16: power
lamp on, ventillator off, het11: no inittab (linux complaint),
restart, DVD light on, red light only on, nothing happens. Disk? M/B?
- 071106: Slow network on small hub
corrected: gate no. 16 in large hub defective.
- 070828: het16 has hardware
problem. Does not turn on (green lamp on),
PS ventillator does not start.
- 070727: het11 has unknown hardware
problems probably after power cut. Probably hard disk (starts bios
then stops)
- 070521: het15 and het13 have been
repaired (MB+VGA). het3 is simply offline (no hub avail)
- 070418: het15 has hardware
problems. Taken to ILB. het17-het20 had short in reset button
connection (problem at startup, no power). Repaired by ILB.
- 070328: het13 has hardware problems.
het17-het20 are available to all. Software is at a testing phase,
check your programs for correctness and efficiency. het2 had corrupt
filesystem, OS reinstalled and old data erased.
- 070227: The new het17-het20 have been
succesfully added to the cluster. They will soon be available to all.
- 070116: het12 and het8 are back. het13 has
hardware problems. het2 has most likely damage on the disk. Need to
reeinstall OS?
- 061106: het12 is back: M/B ASUS P4T533-C
-> ASRock P4i65G and RAM and Video Card replaced. See BUG: Soft
lockup on CPU#0. Reinstall Fedora?
- 061026: het9 is back: M/B ASUS P4T533-C
-> ASRock P4i65G and RAM replaced.
- 061017: het13 is back: Power supply replaced
- 061016: het12 and het13 have hardware
problems (power supply?short?).
- 060919: het9 has hardware problems
- 060719: Replaced fan (broken) on
het2. Replaced memory (single chip) on het8. Repairing het8 filesystem.
- 060717: Node het15 is back. Nodes het2
and het8 has hardware problems.
- 060330: Node het16 is back for users.
- 060303: Node het16 is back. Software
upgrade in process.
- 060210: Node het16 is off due to hardware
problems.
- 060131: New node het16 added to the cluster.
- 051228: het8 is back. New video card GE
Force FX5200 AGB8X 128Mb TV-out
- 051223: het10 is back: new power supply
- 051221: het8 and het10 are out: VGA card
and power supply respectively.
- 051221: New nodes on the cluster: het14
and het15
- 051221: het2, het9 and het12 are
back. het9 and het12 must be suspected of being unstable
General Information
The HetCluster is a cluster of PCs connected via 10/100 ethernet to
the network. Each one is independent and the Operating System is Linux
under the Fedora Core 4, 6 and Ubuntu Server 8.04,9.10 distribution. For
information on how to
obtain an account and existing software contact the administrator. You may login to
a node via ssh at the address
node.physics.ntua.gr
. For the
moment there is no server and the filesystems are independent. This
may change in the future and a common filesystem may be created on
het7 and mounted on all other nodes on /data via NFS. Some observed
instabilities of NFS has made us delay this option.
You may use ssh to submit jobs on remote machines. This has been set
up so that no password is asked between het nodes. A prototype script
for job submission/remote command execution
that checks node occupancy is the script /usr/local/bin/rtop. Use it
in order to check the load of each mahine before tou decide to submit
your jobs.
Rules for Using the Cluster
Please observe the following rules and in case that
they cannot accomodate your needs contact the administrator.
- Long jobs should be breakable into 8-12 hr jobs max. If you have not done
it until now, it is a good idea to learn how to do it. Jobs must be
moved to other nodes under a less than a day notice should it become
necessary.
- Respect other people's need for computer time: Benchmark your
code, optimize it the best possible way - especially long jobs -, use
the best compiler (in
our case the intel compilers icc and ifort seem to be doing the best
jobs). Get advice if you are not sure, the administrator maybe able to help.
- Each person will be given priority on a number of nodes
depending on needs. For this you should contact the administrator stating your
needs in total CPU time on each node (normalized appx. to 3.0PIV
processor), memory and number of nodes. Total CPU time can be
"infinite" but higher priority will be given to shorter jobs.
- When you do not need the nodes anymore notify the administrator. Users that
lock nodes unnecessarily will be blacklisted.
- Everybody will be allowed to run on any node given that:
- The node is empty and not allocated to anyone and the job will not run
for more than 2 days, or
- The job is submitted at nice level 20 (with /usr/bin/nice not
the shell built in please), and
- there are no memory conflicts with the needs of the primary user.
- If the primary user complaints then the job will be stopped (not killed) and the owner has to restart it manually when the node is free or kill it
(learn about the kill command, especially kill -STOP and kill -CONT)
- Notify the administrator for anything
abnormal in the nodes you are using.
Tips for Using the Cluster
Ssh gives extreme flexibility to automate remote operations on
nodes. You will benefit greatly from learning its use. Some
examples are given below.
- Monitor the usage of all nodes:
Helpful scripts (available on
the nodes) are rtop, rload and
ruse. rtop gives an instant of the command top on
each node, rload reports the load averages and ruse the load
averages and most important jobs. Each command can take as arguments
the name(s) of a host(s) in order to report only on them, e.g.
rload (reports on all nodes)
rtop het3 het15 het4 het6
rload 3 5 7 9 15
ruse 3 het7 het9 15 het14 3
- nice your jobs to the required level
using the command /usr/bin/nice, e.g.
/usr/bin/nice +20 a.out >& log &
/usr/bin/nice +8 a.out >& log &
- Copy files: Use the script rcopy to copy files to the exact same location in
the filesystem of all or some of the cluster nodes. Use relative, not absolute links. e.g.:
rcopy file1 dir1 (copies file file1 and
directory dir1 on all nodes - but not on the one you are now!)
rcopy -n 2 -n het7 -n 12 file1 file2
(copies files file1,file2 on het2,het7,het12)
- Submit Jobs: There are infinite
possibilities. One example is the script rexec
where I submit a job sign.com passing to
it parameters via flags by simply editing the table at the top of the
script. A simpler example is rrun where by giving
the command:
rrun run.het1 run.het2 run.het3
I submit the jobs run.hetn on hetn
- Transfer Data: The GNU utilities
tar and find
together with scp and/or ssh can do miracles:
scp -r -p het2:data . (brings all
directory ~/data and its contents from het2 to the current node)
ssh het2 "tar cfp - data" | tar xfvp -
(does the same using the capabilities of tar)
More complicated jobs can be handled by scripts that can keep data
directories up to date. An example is the script get-new-data-remote which is used
together with the script get-new-data.