The Cybermedia Center at Osaka University was founded by merging the former Computation Center, the former Education Center for Information Processing and part of the university library in April 2000. Such reorganization was conducted in order to comprehensively promote educational study in view of rapid developments in the field of information technology.
The goal of our center is twofold: 1) to continue providing stable infrastructure services as well as technical knowledge about supercomputers, information education systems and networks used around the world, and 2) to pursue research that enables the most advanced infrastructure services.
Advantages of Historical Vector Computing Maintained
At our center, we introduced 20‐nodes of SX‐8R in January, 2007. It replaced SX‐5/128M8 which had stunned the HPC community with its peak performance of more than 1TFLOPS for the first time ever as a vector‐type supercomputer and the 8th rank on the TOP500 list in 2001. While we have witnessed a phenomenal increase in the computational performance on the TOP500 list after this SX‐5, we now give first priority to the users’ benefits gained through the continuous improvement in performance rather than mere performance index. In line with such a policy, we decided to upgrade this system in a phased manner with an additional 10‐node SX‐9 system for July, 2008.
As far as the STREAM benchmark is concerned, we have realized how it is challenging to surpass the SX‐4 (shipped in 1994 and introduced to our center in 1996) which uses the Synchronous‐SRAM as the main memory unit. Although the performance of the SX‐4 was unbeaten by its successor model of SX‐5 (with the Synchronous‐DRAM) and even by the recent SX‐8 (with the Fast Cycle RAM or DDR2‐DRAM) in terms of its performance particularly for short array length, the performance in running real application programs has been improved by the sophisticated compiler technology. On the other hand, equipped with a massive amount of on‐die caches, the recent microprocessors can realize the performance which is comparable to vector machines as long as their caches are effective. While the possible performance degradation with these processors having several megabytes of caches is not so significant as compared to ten years ago, the supremacy of vector machines is still remarkable.

In terms of the STREAM2 benchmark, the performance of vector computers is rather excellent even for very short loop lengths partly due to the automatic loop collapse by the compiler. The efficiency of the cache of a microprocessor can now be grasped. In fact, it became clear that the vector machine is superior to conventional scalar processors also for thecase with short loop lengths that was thought to be tailored to microprocessors: vector operations can be effective even for the range of very short array lengths where the L1 cache of the microprocessor is effective.

In these ten years, the technology of past high‐end microprocessors has been inherited to budgetprice products, and the low power consumption technology has been spread simultaneously. Although the large‐scale cluster system based on PC is getting popular, it has also left the issues surrounding operation and maintenance. We built a cluster system that can closely be co‐operated with the vector machine in order to gain benefits from both architectures. It became a good example which demonstrates a synergistic effect by different architectures.
NEC Crossbar vs. Chelsio Unified‐Wire
Of the total 20 nodes of the SX‐8R system introduced at this time, 8 nodes are interconnected by bidirectional 16 GB/s through the IXS crossbar equipment. According to the Intel MPI Benchmark 3.0, the latency time measured for Ping‐ Pong is 3.79 microseconds, and the maximum bidirectional throughput for Send‐ Recv is 18.0 GB/s. On the Linpack HPC benchmark, 2.056TFLOPS (N= 352,256) with its peak performance ratio of 91.3% was achieved.

All the PCs comprising our cluster system with more than 600 nodes were introduced simultaneously and are inter‐connected with the Chelsio’s T310‐CX (10GBase‐CX4, ToE‐enabled). The latency time with Ping‐Pong is 10.49 microseconds and the maximum bidirectional throughput for Send‐Recv is 1.39GB/s. It seems that the throughput is limited due to the overhead arising from the internal processing of the sock driver of Intel’s MPI at present. Priority is given to the flexibility of cluster construction and utility on GridMPI, although it is also possible to transpose a stack to RDMA and to improve the throughput.
HPC Challenge Benchmark: SX vs. PC Cluster
The HPCC benchmark is gaining popularity as a comprehensive measure on the performance of HPC systems. While it is not realistic to make a straightforward comparison among different systems based on this benchmark, unlike the Linpack benchmark, the HPCC benchmark can give a certain insight into the performance characteristics of HPC systems through careful consideration.
Let us use the Kiviat Diagram for the comparative performance among different systems based on the results submitted to the HPCC benchmark site as of October, 2007. At this time, we measured the performance by setting the number of MPI processes to be identical to the number of CPUs. For the multiple‐node configuration, the number of MPI processes was set to the number of nodes with non‐OpenMP based parallelization within a node by utilizing the automatic parallelizable BLAS library for the SX and multi‐threaded GotoBLAS for Xeon‐5160.

These measurements show that the SX series has a excellent single‐node performance with balanced scores for many performance measures. In contrast, microprocessor‐based systems show relatively poor performance numbers depending on performance index.

Here let us take a close look at each of the measures. While there is an increasingly narrower gap between the SX‐8R and the Xeon system with respect to HPL (High Performance Linpack) and DGEMM (matrix‐matrix multiply) due to the improved cache blocking and SIMD‐extension/multi‐threading techniques getting pervasive for commodity processors, the SX‐8R still maintains competitiveness over scalar systems on STREAM and RandomRing Bandwidth. There is also a controversy over HPL as a neutral performance indicator, since the recent commodity processor‐based systems can easily achieve a relatively high HPL performance even with low‐speed interconnects. On the other hand, there are also indexes subject to change depending on the actual number of nodes used for the measurement. It can be suggested that for certain application codes, the hybrid system of a vector machine and a PC cluster might be appropriate.
The Grid Operation Center at the Cybermedia Center
The Grid operation center at our center is engaged in the research of the application of Grid technology to our center operations, and has actively been involved in the development of Grid middleware. We provide the user with our PC clusters as part of the Grid resources. Furthermore we are preparing a pioneering framework which enables resource provision including the vector system.
The supercomputer centers of the seven major national universities in Japan have charged for electricity based on the actual amount of use. As part of the Grid middleware established so far, the secure framework of authentication based on PKI was implemented first. On the other hand, the development of the virtual organization management system and the accounting function is expected for earlier deployment of the Grid technology for our center operation.

So far, the application procedure for the use of computing resources has not required person authentication because of the guaranteed payment for the usage of the system. However, the issuance of a Grid certificate requires person identification through interview regarding the PMA's classic profile. At present registration authorities have not been deployed nationwide, which is the largest obstacle in the deployment of Grid services.
We give first priority to the issuance of Grid certificates to all of nationwide users of our center and have built our own accounting system. Finally, we stipulated the certificate issuance procedure that does not require person identification through interview so that it does not interfere with Grid operations. We are prepared for proposing it to PMA as a profile. Such outcomes have also contributed to the Japanese NAREGI project (National Research Grid Initiative). Now the cooperative evaluation project using the Tokyo Institute of Technology's TSUBAME system is in progress and it is expected to become the foundation of the Grid system in Japan at a full‐scale operational level.