Saturday, July 31, 2010

Super Computer in INDIA




C-DAC's HPCC (High Performance Computing and Communication) initiatives are aimed at designing, developing and deploying advanced computing systems, tools and technologies that impact strategically important application areas.

Fostering an environment of innovation and dealing with cutting edge technologies, C-DAC's PARAM series of supercomputers have been deployed to address diverse applications in science and engineering, and business computing at various institutions in India and abroad.

C-DAC's commitment to the HPCC initiative has once again manifest as a deliverable through the design, development and deployment of PARAM Padma, a terascale supercomputing system.

PARAM Padma is C-DAC's next generation high performance scalable computing cluster, currently with a peak computing power of One Teraflop. The hardware environment is powered by the Compute Nodes based on the state-of-the-art Power4 RISC processors, using Copper and SOI technology, in Symmetric Multiprocessor (SMP) configurations. These nodes are connected through a primary high performance System Area Network, PARAMNet-II, designed and developed by C-DAC and a Gigabit Ethernet as a backup network.

The PARAM Padma is powered by C-DAC's flexible and scalable HPCC software environment. The Storage System of PARAM Padma has been designed to provide a primary storage of 5 Terabytes scalable to 22 Terabytes. The network centric storage architecture, based on state-of-the-art Storage Area Network (SAN) technologies, ensures high performance, scalable and reliable storage. It uses Fibre Channel Arbitrated Loop (FC-AL) based technology for interconnecting storage subsystems like Parallel File Servers, NAS Servers, Metadata Servers, Raid Storage Arrays and Automated Tape Libraries, achieving an I/O performance of upto 2 Gigabytes/Second.

The Secondary backup storage subsystem is scalable from 10 Terabytes to 100 Terabytes with an automated tape library and support for DLT, SDLT and LTO Ultrium tape drives. It implements a Hierarchical Storage Management (HSM) technology to optimize the demand on primary storage and effectively utilize the secondary storage.

The PARAM Padma system is also accessible by users from remote locations.
An overview of PARAM Padma – A Teraflop Computing System PARAM Padma is a teraflop cluster having 62 number of 4 way SMP and one 32 way SMP node, equivalent to 280 POWER4 RISC processors in SMP nodes interconnected with C-DAC’s own developed proprietary PARAMNet system area network technology as shown in the Figure 1. The theoretical peak performance of complete configuration is 1.13 teraflops and these nodes are connected through a primary high performance system area network, PARAMNet-II and Gigabit Ethernet as a backup network.

Each node is a 4 -way SMP supporting four 1GHz POWER4 RISC processors and the aggregate memory for each compute node is 8 gigabytes. Each processor core has a 16KB L1 cache with a latency of 4ns to 6ns and two processor cores share a 1.41MB L2 cache with a latency of 9ns to 14ns. Four processors share a common 128MB L3 cache. PARAM Padma has 6 file servers; each is of UltraSparc-III processors in 4-way SMP configuration and the aggregate memory for each is 16 Gigabytes.

Major components of PARAMNet-II are SAN Switch (16 Ports), PARAMNet-II Network Interface Card (NIC) with C-DAC’s Communication Co-Processor (CCP-III) and C-DAC’s Virtual Interface Provider Library (C -VIPL), part of C-DAC HPCC suit of software tools as shown in the Figure 2 and Figure 3. PARAMNet-II network comprises of N hosts connected in non-blocking fat tree topology. PARAMNet-II switch is based on non-blocking crossbar-based architecture and it supports 8/16-ports providing 2.5 Gbps, full duplex raw bandwidth per port (2 Gbps, full duplex data bandwidth). The non-blocking architecture of the switch allows multilevel switching for realizing a large cluster. The switch offers very low latency of order 0.5 msec and it uses interval routing scheme and group adaptive routing based on Least Recently Used (LRU) algorithm to ensure uniform bandwidth distribution.

A single switch can support up to eight (SAN-SW8) or sixteen (SAN-SW16) hosts. For supporting more hosts, a multistage network is adopted and this network can be made non-blocking or blocking type, to be decided by usability of switches and cost/performance. Various topologies are possible by modifying the routing tables of the switches.


Figure 3: C-DAC’s Virtual Interface Provider (C-VIPL)

Two types of configurations of PARAM Padma are made for multi-level switching of PARAMNet-II switches. In the first type configuration, five first level switches (SAN-SW16) and one second level (SAN-SW16) switch has been employed to configure the cluster. The second type configuration involves twelve switches and is fully non-blocking. These are split up in eight first level and four-second level switches. There are no bottlenecks in this topology and the bisectional bandwidth scales with the number of nodes. For entire cluster of 62-nodes, bisectional bandwidth available is 4 Gbytes/sec. For both configurations, latencies associated with packet routing are very small (~1.5 msec for three levels of switching).


The NIC card is based on CCP-III chip using 0.15 micron 1 million gate technology. The NIC card provides interface to SAN SW8 port and SAN SW16 port switches and it supports 2.5.Gbps (fiber) links and host interface of PCI 2 .2 64 bit/66 MHz. The NIC card supports connection oriented (VIA), and connectionless (AM) protocols. The CCP has been designed to reduce software latency and increase the data throughput, which are the main parameters for good and effective communication. It avoids unnecessary copying of data either by directly delivering the message into the destination buffer or by copying it to page-aligned temporary area from where kernel can remap, thereby reducing the number of copies.

C-DAC’s HPCC suit of software tools on PARAM Padma effectively address the performance and usability challenges of clusters through a high performance communication protocols and a rich set of program development, system management and software engineering tools. KSHIPRA, communication substrate designed to support low latency and high bandwidth is the key to the high level of aggregate system performance and scalability of C -DAC HPCC software. C-VIPL is a part of KSHIPRA, scalable communication substrate for cluster of multiprocessors designed to support low latency and high bandwidth and high level of aggregate system performance. C-VIPL is an application program interface for PARAMNet-II and it adheres to VIA Specification version 1. It supports diverse operating systems such as AIX, Linux, Solaris and Windows. C-VIPL is compatible with C-MPI and MVICH, which is an MPICH implementation of MPI for Virtual Interface Architecture (VIA). HPCC software also provides low overhead communication, optimized MPI (C-MPI) and a Parallel File System (PFS) with MPI-IO interface to enable applications to scale on large clusters. Included in the HPCC software suites of products are high performance compilers, parallel debuggers, data visualisers, performance profilers and cluster monitoring and management tools.

C-MPI is a high performance implementation of the MPI standard for a Cluster of Multi Processors (CLUMPS). C-MPI also leverages on the fact that most of the high performance networks provides substantial exchange communication bandwidths. This allows the tuned algorithms to simultaneously send and receive messages over the network, which helps in reducing the number of communication hops. In addition, the algorithms effectively use the higher shared memory communication bandwidths on multi processor nodes. Also, C-MPI takes care of SMP features, which uses directly memory copy instead of going through an intermediate shared space and network. This is critical to improvement of MPI communication performance on PARAM Padma.

C-PFS a client-server and user-level parallel files system, addresses the high I/O throughput. Exporting MPI-IO interfaces for parallel programming and UNIX interface for system management, the C-PFS fully exploits the concurrent data paths between the compute nodes and the terabytes of storage in PARAM Padma. The storage system of PARAM Padma has been designed to provide a primary storage of 5 terabytes, which is scalable to 22 terabytes. The network centric storage architecture has been used, which is based on state-of-art- Storage Area Network (SAN) technologies and ensures a high performance, scalable and reliable storage. It uses Fiber Channel Arbitrated Loop (FC-AL) based technology for interconnecting storage subsystems like parallel tape libraries, achieving an I/O performance of upto 2 Gigabytes/second. The secondary backup storage subsystem is scalable from 10 terabytes to 100 terabytes with an automated tape lib rary features. It implements hierarchical storage management (HSM) to optimize the demand on primary storage and effectively utilize the secondary storage.

The industrial design and packaging of PARAM Padma offers flexibility to scale from a system with a few nodes to systems having a large number of nodes as shown in Figure 1. The PARAM Padma enclosure has been designed taking into consideration environmental requirements of high performance computing sub-assemblies, such as heat transfer, electromagnetic interference / compatibility. It is 19 inch 48 U standard rack with shielded cable trays and accessories for housing compute nodes, file servers, network switches and cables.

Application and System Benchmarks

Several characteristics of various application and system benchmarks are considered while designing and development of PARAM Padma in-order reduce the cost of communication from hardware, and also from software point of view. Considered here is the first type configuration of PARAM Padma and HPCC software over PARAMNet as parallel programming environment for execution of benchmarks. Gigabit Ethernet interconnect with IBM MPI programming environment is also considered for execution of several benchmarks, in addition.

Macro and Micro benchmarks have been used to test and extract the sustained performance of PARAM Padma. P-COMS (PARAM - Communication Overhead Measurement Suites – version 1.1.1) – a set of test suites has been used to model the performance of MPI point-to-point and collective communications on PARAM Padma. These suites compare the performance of point-to-point communications, including send and receive overheads for different send and receive modes and different (contiguous) message lengths used, as well as estimate the network latency and bandwidth. It has been observed that latencies are as low as 15 – 20 ms and the bandwidth is 160 MB/s on PARAMNet with HPCC software. A comparative study of measured communication overhead times for different system-area networks such as PARAMNet with HPCC software and Gigabit with IBM MPI indicates that the overheads are very less for MPI communication primitives.

NPB (NAS Parallel Benchmarks) is a collection of benchmarks to test the performance of PARAM Padma. NAS 2.3 constitutes eight CFD problems, coded in MPI and Standard C and Fortran 77/90. The LU kernel of NAS makes a triangular factorization of a matrix and it involves sending of small messages less than 100 bytes. Initial experiments indicate that the execution time for LU class C size problems decreases linearly upto 62 processors of PARAM Padma with PARAMNet-II and HPCC software. The tuning and optimization of the code is carried out.

HPL (High Performance LINPACK) for TOP500 Super Computers List is a popular benchmark suite to evaluate the capabilities of Super Computers and Clusters. The results of that benchmark are published semi-annually in the Top500 list of the world’s most powerful computers. The benchmark involves solving a system of linear equations. The impact of HPL performance mainly depends on performance of underlying communication network and the tasks executing at the different nodes of cluster, shared memory implementation of MPI, and the quality of process-or-to-processor mapping. The higher bandwidth and lower latency of the high performance PARAMNet switch network with C-DAC’s HPCC software has resulted in better performance in comparison with gigabit interconnect. The results of HPL benchmark on 32, and 64 processors reveal minor improvement on PARAMNet architectu re with HPCC software, in comparison to gigabit ethernet with IBM MPI. However, when the number of nodes is increased, the performances of HPL on PARAMNet with HPCC software shows substantive improvement over gigabit interconnect with IBM MPI. The Top-500 test with optimal parameters on 62 node (248 processors) configuration has resulted in approximately 532 Gflops to the peak performance of 992 Gflops on PARAM Padma. The HPL performance thus is approximately 53.6 % of peak performance. PARAMNet has been found to perform much better due to its low latency and high bandwidth in all HPL tests and scales very well upto 62 nodes (248 processors).

Scientific and Engineering Applications on PARAM Padma

Real life complex application problems and scientific and engineering research are the driving force behind the development of PARAM Padma. Many applications in critical Scientific and Engineering fields like Bioinformatics, Computational Structural Mechanics, Computational Atmospheric Sciences, Seismic Data Processing, Computational Fluid Dynamics, Evolutionary Computing and Computational Chemistry have been executed on PARAM Padma. In the following paragraphs, some of the activities in the areas that are pursued in high performance computing at C-DAC are described.

In Bio-informatics, realistic simulation of large biomolecules using molecular codes like AMBER, CHARRM, & GROMACS have been ported on PARAM Padma. Figure 5 gives the results of ten nanoseconds simulation done using AMBER. Also, in-house development of problem solving environment has been developed so that biologists can use the system for executing the codes like AMBER, CHARMM, FASTA, BLAST (parallel versions of these are ported) with a simple interface that shields the user from the intricacies of parallel computing.

Developments in Computational Structural Mechanics include FEMCOMP for stress analysis of FRP composite structures, NONLIN for stability analysis of thin walled structures and FRACT3D, parallel fracture mechanics software based on domain decomposition.



In Computational Atmospheric Sciences, Climate System model (CCSM2) for climate change simulations and Mesoscale Model (MM5) for Sciences, Climate System model (CCSM2) for climate at one-kilometer resolution are implemented. Figure 6 gives these results obtained using MM5 on PARAM Padma. The pre-processors to ingest the Indian meteorological data into the MM5 modeling system for regional weather forecast has been developed at C-DAC. The capability for running long climate simulation for CCSM2 on PARAM Padma is available.

Under Seismic Data Processing activities, a parallel seismic modeling and migration package (WAVES) for oil and natural gas exploration has been developed on PARAM Padma. This parallel software is focused on implementation of high precision 3-D seismic migration and modeling algorithms, and our experiments indicate that the software is scalable to a very large number of processors.

Major activities in Computational Fluid Dynamics include simulations of hypersonic flow for re -entry vehicle, fuel flow characteristics of an IC engine and a general 2-D Navier-Stokes solver. Figure 7 gives the performance of 2-D Navier-Stokes solver on PARAM Padma up to 248 processors, showing good scalability.



In the evolutionary computing area, development of parallel genetic algorithms based methodologies for protein structure prediction, multiple sequence alignment and financial modeling has been carried out. In Computational Quantum Chemistry, an indigenous package called INDMOL for electronic structure and molecular properties have been developed and benchmarked on PARAM Padma. GAMESS a widely used public domain code has also been ported.

C-DAC's Tera-Scale Supercomputing Facility (CTSF)

While the need and usefulness of high performance supercomputing in Business as well as Scientific and Engineering applications is unquestioned and is growing rapidly, it is not economically viable to have many such facilities at all user sites. Recognizing such a need, C-DAC had earlier set up a National PARAM Supercomputing Facility (NPSF) at Pune, housing its earlier generation PARAM 10000, a 100 Gflops peak computing power system. C-DAC recently established C -DAC's Tera-Scale Supercomputing Facility (CTSF), Bangalore that houses PARAM Padma as shown in Figure 8. Many premier research organizations have been using these facilities and encouraging performance is being reported for several industrial and scientific applications.



The primary objectives of CTSF are:

* To provide high performance computing facilities and services for the
scientific and research community and for the enterprise.

* To establish the technological capabilities in high performance computing that
have hitherto been confined only to developed countries.

* To solve some of the grand challenge problems which are the key to
economic growth, environmental understanding and research breakthroughs
in science and engineering.

Users can opt for one or more of the following options to access the CTSF resources remotely:

* Establishing a 56.6 Kbps dialup link over PSTN (Public Switched Telephone
Network).

* Establishing a dedicated 128 Kbps link over ISDN (Integrated Services Digital
Network).

* Establishing a 64 Kbps leased line terrestrial circuit between remote locations
and C-DAC

* Providing a secure login via the Internet.

Conclusions

From hardware and system software point of view, our experiences in making teraflops power PARAM Padma allowed us to understand the scalability issues of high-performance system area network PARAMNet-II and its associated HPCC suit of system softwares. The PARAMNet switch architecture enables low-cost, high performance implementations because of its functional simplicity. The results of HPL benchmark used in the competition for the Top-500 list shows an efficiency of 53.6 % on 62 nodes. Further, developments in these areas are in progress.

Many research and development organizations and academic institutions in India are actively involved in making small clusters using off-the shelf hardware and software components. Development of PARAM series of supercomputers has enabled tackling large scientific problems that need very large clusters. Many PARAM series of super computers have been deployed in leading premier institutes in India and a few outside, on various Parallel Computing co llaborative projects. C-DAC’s recently established C-DAC’s Terascale Supercomputing Facility (CTSF), which houses PARAM Padma, is open to several scientists and research workers in India and outside.

No comments: