A Case for NOW

Thomas E. Anderson, David E. Culler, David A. Patterson and the NOW team

Notes by Li Li and Teck Chia (4/21/98)


Motivation

As the technology changes in a fast pace, MPP is not cost effective and only targeted at the biggest tasks. Most engineering workstations have a huge amount of memory and very fast processors, both of which sits idle most of time. Switched local network makes the access of remote memory far more quickly than local disk. why not harness them as a Network of Workstations (NOW) to make it a win for all users. With a good resource allocation policy, we can explicitly preserve interactive performance, dedicate unused resources throughout the network to be used by demanding applications: DRAM for memory-intensive programs, disks for I/O bound programs, CPU for parallel programs.

 Why NOW ?

Current NOW Testbed

100 SUN UltraSparcs connected by a Myrinet switched network, it's in one room.

According to LINPACK benchmark, NOW is the top 200 supercomputer in the world. This is very impressive!

Key NOW-Enabling Technology

  1. Cooperative File Caching
  2. Remote memory access time is an order of magnitude faster than local disk access time. If local memory is not enough and ample memory in remote machines, why not paging across the network?

  3. RAID
  4. Instead of managing RAID in hardware, we can manage RAID in software. In stead of hook RAID to a host computer, we can view all the disks in the network as a RAID. Each client machine can manage the RAID if one crashes.

  5. GLunix
  6. It's envisioned as a layer on top of unmodified commercial UNIX to globally manage network resources.

  7. Active Messages
  8. Using active messages for communication between processes to reduce network latency and processing overhead

  9. xFS

Discussion

It's instructive to compare what NOW originally claimed and what NOW has achieved. It seems that there is a large discrepancy. It's relevant to compare NOW with Condor and Sprite. It's interesting that A Case for NOW, 1994 didn't reference Condor project since Condor has more or less the same idea back in 1991.

In order to realize ideal NOW, process migration is key to maintain interactive performance. NOW never attempted to implement process migration whereas process migration is implemented in Condor and Sprite. Condor builds on top of existing OS, such as UNIX while Sprite is a network OS, compatible with UNIX, NOW has Glunix on top of UNIX. It seems that Condor achieves most of the goals of ideal NOW. Three features of NOW isn't present in Condor, xFS, cooperative caching and Active Messages for fast communication. NOW explores fast communication over switched network while Condor doesn't.

Ideal NOW

I think it's impossible to make it a win for all users. If you want to run a parallel database application, security is a problem, you don't want the payroll information to be revealed to desktop users. Process may crash, consistency and availability will be a big problem for database application. How to deal with these security, consistency and availability issues?

Current NOW implementation

One fits all will never work, Perhaps NOW has transformed to be a cost-effective way of building supercomputers. The approach MPP took to build supercomputer is not cost effective any more as the technology trend changes. Thinking Machine Inc.'s death indicates the failure of this approach. Do you think this is a cost-effective way of building supercomputer?