The UF High Performance Computing Center's (HPC) Phase IIb cluster was down for several days in late July and early August for maintenance and upgrades. The cluster has been running 24/7 since its latest expansion in January 2007.
It is necessary to perform hardware maintenance with some regularity to avoid erratic errors and to upgrade the software to be able to use the latest features. From July 31 until Aug 9 the cluster was down for this maintenance. All nodes were power-cycled which caused the power supplies that were about to fail to actually fail, and they were then replaced. Several other hardware problems were resolved and parts were replaced. The OS on all nodes was upgraded to the CentOS 4.5 release, equivalent to RedHat Enterprise Linux.
One of the reasons to perform the shutdown was to allow the upgrade of the RapidScale parallel file system to a new release that supports the simultaneous access by file system clients over SDP/InfiniBand and over TCP/Ethernet. This is crucial for our next project to support large-scale, fast-access storage over the UF Campus Research Network. In addition some network changes were made to support the new capability better.
The UF High Performance Computing Center started operation, after several years of preparation by the HPC Committee, with Phase I of the HPC cluster which became operational in August of 2004. The cluster was manufactured by Dell and had 200 compute nodes with 2 GB of RAM and two Xeon processors in each node. The current cluster, Phase IIb, is about 8 times as powerful. It was manufactured by Rackable and has 400 nodes each with two dual-core Opteron processor, i.e. four CPUs, and 4 to 8 GB of RAM. This cluster has 1,600 CPUs, of which about three quarters are connected by a fast communication network: an InfiniBand fabric capable of 10 GB per second between two nodes.
For those interested in the gory detail: Each of eight I/O servers (targets) is connected via their dual-port LionCub HCA to two separate fabrics. Port 0 is connected to the core switches and is used for both message passing and I/O among the IB-enabled nodes. Port 1 is connected to the CISCO 3012 gateway (running a separate subnet manager) and is used exclusively for IPoIB storage traffic from the ethernet-only nodes. There is a dedicated, 3-channel trunk on the gateway for each server (aggregate of 24 GigE ports) and a corresponding port-channel on the catalyst 6506.
The cluster has 30 TB of storage with a very fast parallel file system on top of it. Multiple CPUs can write to the storage at over 3 GigaByte per second. This latest expansion of the cluster was installed in January 2007 and the cluster now has over 1,600 CPUs in 400 nodes; about 80 nodes are connected by Ethernet only.
The University of Florida was one of the five finalists in the category of "Innovation and Promise" in the Storage World "Best Practices in Storage" Awards Program. The award identifies and acknowledges excellence among users of storage IT solutions and approaches. Finalists in each category were honored in a ceremony April 18, 2007 at the Storage Networking World conference in San Diego. All five finalists received an award. Jon Akers from the HPC Center attended to receive the award.
A lot of information about the University of Florida High Performance Computing Center and its activities is available on the web at http://www.hpc.ufl.edu . An annual report was recently completeed. It provides a brief overview of the activities of the Center and its governing body, the HPC Committee (ITAC-HPC). Because this is the first annual report, some information from previous years is included for completeness. Please send requests for a copy to Erik Deumens at deumens@qtp.ufl.edu .

