Summer Institute on Advanced Computation

August 20-23, 2000


College of Engineering & CS
Wright State University
Dayton, Ohio 45435-0001

NT Clusters --
The HPVM Project

 

Dr. Mario Lauria

Assistant Professor
Department of Computer and Information Science
Ohio State University

 
August 22, 2000 7:30 p.m. This site contains the "web-based proceedings" of Summer Institute on Advanced Computation that focused on high- performance high-throughput  Cluster Computing. 
Slides
           

DR. OSCAR GARCIA: Our speaker tonight came all the way from San Diego, and we were a little worried when at 4:30 his plane had not yet landed, so Jay was keeping track calling the airlines and checking trying to figure out whether we had to find another speaker.

He is an Assistant Professor in the Department of Computer and Information Science at Ohio State University. He actually did postdoctoral at the University of California, San Diego. How he left San Diego to come to Columbus I don't know, but I guess there must have been some incentives.

Previous to that, he spent some time also in a NATO-CNR Advanced Science Fellowship at the University of Illinois. He was a Fulbright fellow at Urbana-Champaign, so he can't say he didn't know the Midwest was like. His doctorate is from the "Federico II" University of Naples. He obtained his Ph.D in '94. He received Laurea Summa Cum Laude in Electrical Engineering from that university in 1992, a degree in Italy.

His interests include commodity cluster architectures, which you've heard a little bit about, high performance communication software for clusters, and he has worked in projects including the Fast Messages, HPVM and the NT Supercluster at NCSA. I guess his lecture tonight has to do with the NT Supercluster.

Let's welcome Mario Lauria.
           
 

DR. MARIO LAURIA: Thank you for that long introduction. Tonight I'm going to talk about this project I've been involved in which is HPVM. HPVM is basically a collection of software libraries and it's intended for clusters of PC's connected with high-speed metrical minute.

First of all, I'd like to thank the organizers of the advanced school for inviting me. I'm honored to be here and am happy to have the opportunity to talk about this project.

Let me clarify that all of this work has been done when I was a member of the Concurrent Systems Architecture Group of Andrew Chien. I only recently left the group to join Ohio State University, and the head of the group is Professor Andrew Chien. Here are the names of other people who have been working at various times in this project.

I'll also be talking about the NT Supercluster, in which we built a large cluster in collaboration with the NCSA. Among the people of NCSA that collaborated with us is Robert Pennington.

This is the outline of my talk. I'll be describing the history of Fast Messages. Fast Messages is the high performance communication software, the low level software, which is at the base of the HPVM suite. I'll also talk about the NT Supercluster, which is an organization which HPVM has been put to test using the real-world applications.

I'll conclude my talk describing what's the current status of HPVM and some of the current projects. I won't be talking too much about motivation for cluster technology. This is a school about clusters, so I'm sure you have heard a lot of good things about clusters and you know pretty much everything about the motivations.

Let me introduce HPVM. One of the reasons clusters are so attractive is that commodity processors have become so powerful, and they make a really good building block for high-performance computation. Another important enabling technology which has become available are the high-performance, high-speed communication networks.

We are using one of these, which is called Myrinet. Others that have been around are ATM, VIA. This is new technology I'll briefly be mentioning. Very recently there has been Gigabit Ethernet. These are technologies that make clusters buyable and practicable to build.

So this hardware is there. What is missing to build a supercomputer out of PC's? What's missing is the software, the glue to keep all these things together. And this is the role of HPVM. The idea was to build something like a visual interface, a user-level program interface that was hiding all the complexity which is lying beneath it. So this middle layer is what we built and we called HPVM.

HPVM is a collection of libraries. All this software is sitting directly on top of these high-speed networks. We are using Myrinet. There is nothing special about Myrinet. The reason we adopted Myrinet is that when we started the project in '94, it was one of fastest and also one of the least expensive networks that were available.

Another important feature is that with Myrinet you can put some of the software on top of the network interfaces. This is interesting from a research point of view because it allows you to do things like division of tasks between the host computer and the network interface.

So there is the hardware. Right on top of the hardware we built this high-performance library called Fast Messages. Fast Messages is a low-level library, meaning it provides only a few services, only the essential services, which are reliable communication and flow control and Nomad shells. The main purpose in designing Fast Messages was to deliver to the upper layer of the software the highest possible share of performance made available by the hardware.

Later on top of Fast Messages we built other libraries, other user-level libraries like MPI, Put/Get, and Global Arrays, which are libraries which are popular among Cray T3 users. Basically the project went into building this suite of software in layers. We started from the bottom layers and kept adding a layer at a time.

Another important thing is that Fast Messages accessing Myrinet. It's accessing the hardware directly. It's not going through the operating system. This is one of the defining features of messaging layers like Fast Messages. The reason is we wanted to build very high-performance communication software, so we couldn't afford the additional of overhead of going through the operating system.

The target we had was to have communication latencies on the order of tenths of microseconds. In '94 when we started communication, using TCP/IP was on the order milliseconds for latency time. The reason is the TCP/IP is, of course, going through the operating system, and just to make an operating system call and the contact switch is in the order of a hundred microseconds. So that's the reason why Fast Messages is a user-level library.

So why design protocol? Why design communication software from scratch? I'll use this slide to explain why. This is a hypothetical communication software that has a custom overhead of 125 microseconds. Meaning every time I want to send a message or every time I want to receive a message, I have to spend a fixed amount of CPU time, which is equivalent to 125 microseconds.

This is the graph of the bandwidth you will be measuring on a network that is supposed to be infrequently fast and for which we have to pay this 125 microsecond overhead every time we want to send or receive a message. It's a very simple graph that shows -- sorry. I didn't used a infrequently fast network. I used a network that which has a bandwidth of one gigabit per second. One gigabit per second corresponds to 120 megabytes per second. You can see that eventually you would reach the physical bandwidth of the network. However, to get to that bandwidth you have to use very large message sizes.

Let me point out that to get to half of the bandwidth the hardware is making available, you need to use messages that are at least 16K large or even larger. So is 16K large number? Is it a big number? How large is a 16K byte message? Is that bad or good? Our reasoning is that 16K is quite a large number. There is a number of studies that show applications tend to use very small messages. And here is reported some of the results.

Typically these studies are measurements of the size of messages on an Ethernet network like a department network. The results of the studies is that most of these messages are less than 500 or 200 bytes. So let's say that your application has an average size of 500 bytes. If the bandwidth you are seeing is given by this graph and your typical message size is this one, you can see you'd be using a very small fraction of the bandwidth which is potentially available on your network.

So this was the rationale to start from scratch and build an entirely new communication software. So the idea was to build a communication software which had a very small communication overhead and that was using all the possible hardware features and was using, for example, the processor and network interface to do some of the protocol processing and so on.

The final goal was to deliver the bandwidth, the physical bandwidth, to the application level. Because there is a difference in saying we are using a one gigabit per second network and delivering a bandwidth with this performance to the applications.

We start with FM, which is the name we gave to the low level communication software. We started adding one feature at the time. We tried to put the least amount of software as possible into this library. You can see that even to do the simplest thing like link management, which is just a few instructions needed to send the data to the network card and tell the network the data is there and to send it to the other network, even with a few instructions you already lose quite a bit of the performance.

These are numbers from the start of the project in '94 and '95. At that time there the early Myrinet boards weren't that fast. They had the processor, but the processor was very simple then. At the beginning of project we were also using Sun workstations. And one of the things we found out was that the IO bus of this workstation, the SBus, wasn't optimized for small-size transactions. They were optimized for large transactions. So if you were using a DMA, you were getting good performance. If you were not using DMA and were using the processor to move data one word at a time, you were getting very much lower performance.

So we kept adding as few features as possible. In the end the only services that Fast Messages was offering was reliable communication, because we thought that was important and we couldn't do without it, flow control, and pretty much that's it.

This is the final performance we got. The blue line is the Fast Messages performance of the first three days of Fast Messages, which is happening in June of '95. This is compared with the other line I drew before. That's the hypothetical performance curve I showed you before. So you can see the difference.

The important difference here, it's not much in the peak bandwidth, the important difference is in the fast rise of the curve. You can see that to get half of the big bandwidth, which is this case is 10 megabytes per second, you only need messages of 64 bytes of size. The reason we achieved this result is we tried to keep the overhead as low as possible. The overhead, again, is the amount of CPU time you have to spend just to send or receive a message.

As I said, FM is a low-level messaging layer. It's not really intended for end users; rather it is intended to build a user-level library on top of it. Another intended purpose was to use it as a communication support for runtimes because there was some research in the same group on distributed object languages, and so one of the projects was to use FM as a transport within the runtime support for this language.

Briefly, this is a description of the program interface of Fast Messages. If you really want to use Fast Messages to send and receive messages, this is the primitive you will be using. There is a send, and there's no receive. Because the interface is an active message style interface, meaning on the send side in addition to specifying the destination and the buffer of the data you want to send and the size, you also specify a user-defined function that will use the data on the receive side. That's why there is no FM_receive. There's an FM_extract, which is a function which must be called frequently on the receive side to allow the processing of incoming messages.

So we built this low-level library. We got very good performance. The performance we were getting at the application level was comparable to, for example, OC3 speed, which was commonly available at that time. OC3 was about 150 megabytes a second. 150 megabits per second is more or less equivalent to 18 megabytes per second we were getting.

This is our lower-level library. We wanted to build a user-level library on top of this. How good is an interface like this, such a simple interface, if you want to build something on top of it? So we wanted to build a user-level library, which is MPI, and build on top of FM. So this is the result, and as you can see it wasn't very good. The performance we were getting with our implementation of MPI was quite distant from the performance we were getting with the low-level FM library itself.

So we started studying the reason for this. This is the same graph before but this is in terms of efficiency, meaning the percentage of bandwidth of the MPI library with respect to the FM library. Without an MPI implementation like this, if you're writing a program, it probably would be better to use FM instead of MPI.

When we started studying the problem, we realized there was some inefficiencies at the interface between the low-level library FM-MPI, and these inefficiencies were due to additional copies of the data we were sending through MPI. The reason there was additional copies for things like sending a packet. A common operation is every protocol processing is to take the data, add in your header and send it to the other side. You need to add a header for identification purposes.

In doing this operation, given the nature of the FM interface, we were obliged to copy the data in a staging buffer, copy the header and then send it out of this buffer. Given the high performance we were trying to achieve, even a memory to memory copy was hurting performance. So we realized that this was the reason for these inefficiencies.

So we kept working on this. We introduced a new version of the library which had a little more sophisticated interface. For example, one of the things that this interface does is allows you to send a message in pieces, and each piece is copied directly to the memory on the network interface.

So if you have an interface like this, as opposed to the old interface in which the message had to be sent in one single piece, with an interface like this you can now efficiently assemble a message without unnecessary copies. So this was the result. Now you can see that that the MPI curve is trailing the FM curve much closer in terms of efficiency. And this is the graph. Even at the user level, we were able to have success in our goal of delivering the highest possible share of performance to the applications.

Some of you may have noted here that the peak bandwidth in this graph has now become 90 megabytes per second. If you remember before in the graph for FM1 it was close to 18 megabytes per second. The reason is in addition to improving and making research on the interface between the libraries, we also switched hardware. So the previous graph was using FM on top of SAN workstations using Myrinet. This one is using PCI -- PC, again, interconnected to Myrinet.

This result is from a year and a half after the previous one. The reason the performance is so much better here is not so much the processor is fast, and that was also an important factor, but the important reason here is that the IO bus was so much better. The IO bus is what was really making a difference in our project.

The PCI bus is a more advanced bus than Sbus; for example, one of the things it allows you to do is to move data using the processor with a program IO approach with the same efficiency as with DMA transfers. So things now are designed -- the data is sent through the IO bus by the processor on the send side and then is moved using a DMA on the receive side. For FM, if you have an IO bus, it allows you to do fast transfers when the processor is moving data, and that's going to help a lot with performance.

Latency also went down. In this case, the advantage is given by the reason that the processor here is so much faster. In terms of MPI, these are very good numbers. These are comparable to what you would find in a supercomputer like the Origin 2000.

At this point in the project we now had a user-level library, and we decided it was time to test the quality of this communication software with real-world applications. We teamed up with NCSA, and thanks to their funding we were able to build a large cluster. 192 -- actually, 96 dual Pentium Pro machines. We used this cluster to do research on things like how well a high-performance library like FM and HPVM scale with such a large number or processors, and how operating systems scale when you're using architecture with such a large number of nodes.

Of course, we used this cluster to make a direct comparison with the applications that were already running on the Origin 2000 or the SP2 and NCSA. I'll be showing this direct comparison using some of the NCSA applications. All the numbers I'll be showing were taken on the first version of the cluster which uses 192 Pentium II clocked at 300 megahertz.

There is currently another version of the machine which is running NCSA, which uses 256 Pentiums. Some of those are the 300 megahertz, some are the new 500 megahertz parts, and they are planning to build a larger machine sometime next year. Let me clarify that this is a production machine. NCSA wanted a production machine. They didn't want it to just make research like our group. So this is a production machine which has been running continuously since '98.

A couple pictures. These are the HP machines we used. These are Kayaks. Each one has 500 megabytes of RAM. These are dual Pentium II. They also built some Compaq. These Compaqs had some problem with the PCI bridge, so they didn't make it in the final production cluster. They are being used as a separate, smaller cluster which is for development. We assembled the cluster in the department and then when moved it to NCSA. This is a picture of the cluster when it was just finished in our department.

This is a list of the applications we have been running these two years. I'll be showing the graphs for a couple of these. All these applications were already at NCSA either on the Origin 2000 or the SP2. We ported them to the NT Supercluster. The porting wasn't that complicated. Basically we had to recompile this application against the MPI-FM library.

The graphs I will be showing are measurements taken without too much tuning. This is the first of the applications. This is used to solve 2-D Navier-Stokes equations. You can see this is a comparison between different machines. Let me point out that the red line is the NT cluster performance. The best performance in the pack is on the Origin 2000.

AUDIENCE MEMBER: (Inaudible.) It's pretty linear.

DR. MARIO LAURIA: Yes, in this application. It lists up to 64. Some applications scaled well even farther, at least on the NT Supercluster. Some of them scaled well in the cluster and didn't scale too well on the Origin 2000 beyond 64 processors.

The reason is that the Origin 2000 has an architecture in which 64 processors are in one box. If you wanted to add more you need to add a second box and the communication time, the latency, going from one box to the other, is higher than communication within a box. So that's why some applications didn't scale too well beyond 64.

You can see the performance ratio between the supercluster and the best run on the Origin 2000 is approximately 2, 1.5. The ratio was approximately constant on all the applications we used. The NT cluster performed quite well compared to SP2. This is the same application. This time it is speedup. Here again the performance ratio is between 1.5 and 2. This application is a kernel to do the Conjugated Gradient method. I don't know what happened to the processor numbers, but if I remember correctly this was 64 and this was 128. Again, the performance ratio is between 1.5 and 2.

Scaling. This is still in our applications. Cactus is an application to solve Einstein's equation on general relatively. Performance, as you can see the scaling is quite good. The ratio between the Origin and the NT this time is about 2.5.

AUDIENCE MEMBER: How do you explain that variation? You are right there with the Origin, and on the other you had a major difference. What characteristics of the code makes --

DR. MARIO LAURIA: What difference are you pointing to? This is the scaling.

AUDIENCE MEMBER: Okay.

DR. MARIO LAURIA: This is not absolute performance. This is performance compared to --

AUDIENCE MEMBER: I got you now. I didn't read the green.

DR. MARIO LAURIA: Okay. In absolute terms the ratio is about 2.5.

AUDIENCE MEMBER: Got you.

DR. MARIO LAURIA: This is the Quantum Monte Carlo application, and, again, these are absolute values, and here again the ratio is about 1.7.

So let me point out that all these numbers are taken with the 300 megahertz Pentium II's because this is what was available two years ago. The new machines NCSA is buying to enlarge the cluster are 500, that they bought last year, were 500, 550 megahertz. And I'll be buying and building my own cluster this year and probably will be using one gigahertz machines. In all this time I think the Origin 2000 processor still clocked at the 295 megahertz that we used at that time. So by today this gap should be much narrower.

This slide is meant to give an interesting characterization of all these supercomputers. I've often been asked, What's the difference between the NT Supercluster and a Beowulf Cluster? Here is a way of describing the difference. What I'm reporting here is on the first column is the ratio of the number of megaflops available for each node in the machine. The second is the number of FLOPS, Floating Point Operation Per second, for each byte of the communication bandwidth between the nodes.

So to get this number, we took the megaflop, the total of megaflops on each number, and divided it by the bandwidth that you can get using MPI-FM. This number is the megaflop divided by the run take time. Basically, this is a way of characterizing all these machines from the point of view of the ratio between the computing power of the processor compared to the communication performance.

It's the ratio between node performance to communication performance. Both in terms of bandwidth, in terms of latency, communication latency. You can see probably the most significant column is the last one, and you can see the NT Supercluster has an order of magnitude very close to the Origin 2000 and the Cray T3E.

What are the implications of these numbers? The way I read this is that these are more general purpose machines, and you can use this machine, a Beowulf machine. A Beowulf machine is great if you have an application that does a lot of computation but not a lot of communication. So in a sense you have to restrict yourself to a smaller number of applications if you have a Beowulf. With an NT cluster, or any of the other machines here, you have more freedom in the choice of applications that will perform well on your machine.

Let me briefly describe where we are today and what project I'm working on which is connected to HPVM. The latest release of HPVM is adding one more primary interface. Probably more interesting is that there is now support for shared memory. Meaning if you have a dual processor machine or a quadprocessor machine, it used to be to the case with the previous releases of HPVM that data was sent to the nearest switch and sent back, now this doesn't happen. If data is sent to a process which is running on the same machine, communication happens through shared memory, so it's much faster.

We also added support for VIA. VIA is the standard proposed by Intel, Compaq, and Microsoft, and it's a standard for high-performance interconnect. And Fast Messages has been widely recognized together with projects like Active Messages and U-Net as one of the research projects that inspired the creation of VIA. Today you can buy at least one high-speed network that conforms to the VIA standard, and that network is called Giganet.

The numbers. There has been some improvement in the Myrinet numbers, as you can see. Numbers have improved despite the fact that we have added something else in the software. In a sense we made the software a little more complicated by adding the shared memory support. The performance is better than the previous release, and essentially one of the main reasons is that we are using faster and faster hardware and that's why we're getting better and better numbers.

AUDIENCE MEMBER: Is VIA a Giganet competitor? I don't know anything about VIA.

DR. MARIO LAURIA: Giganet is a commercial product that you can buy that's implementing the VIA standard. The idea behind the VIA standard is that one day there will be several products implementing this standard and will be competing for the same market, and hopefully the hardware will become cheaper and cheaper. As you can expect, if you share memory, the performance you get is much better than one through Myrinet or Giganet. This is our latest graph for performance. There are four curves because there is now the performance number for VIA. VIA, again, is the standard for user-level communication, and we'll see how successful it is going to be.

One exciting thing about HPVM is that a number of other research groups have used it for their research. For example, the projects I'm aware of is a project at Ohio State. Before I joined the department, there was a group of professors. Panda. They used Fast Messages to do research on collective communication with support for collective communication. There is a group in Amsterdam. They have a nice cluster and they are using FM, and they have augmented it with multicast functionality. My alma mater is another group with a small cluster. They have been using it to build a video server. The used a cluster to build a video sever.

As I mentioned before, Fast Messages together with other projects from Berkeley, Active Messages from Cornell, U-Net, and Princeton has been the inspiration for the VIA standard. These are the URL's in case anyone wants to download the latest version of HPVM.

One of the exciting things about clusters is that you can use them to do research on systems and system architecture. And this is a project I'm currently working on. One of the things we learned after building the NT Supercluster is the machines become faster and faster, and applications are more or less the same. And the tendency of users is, as they get faster and faster hardware, is to run the application on larger and larger datasets. So there's an extreme need for large storage systems.

When we've moved to San Diego, we learned even the San Diego Supercomputer Center had similar interests. They actually have a group which is entirely dedicated to the support of large dataset applications. Large dataset applications require hundreds of gigabytes of data. An example of this is the Digital Sky Survey, which is a project between several universities in which they are taking a digital map of the sky at several different wavelengths, and the data they are collecting one day will amount to several terabytes of data.

On the internet there are plenty of other applications that can use large storage systems. Probably all of you know the TerraServer project was a demonstration project built by Microsoft. I'm sure everyone knows about the streamed audio/video.

Let me take a minute to talk about the third bullet. The reason I arrived here today and have not been have been here since Sunday is that in San Diego there a conference on bioinfomatics. This is an exploding field in which the bulk of the research is gathering data in these huge databases and trying to make sense of these huge amounts of data. Data like the human genome and Proting structures. This is an exploding field in which large datasets are increasingly and increasingly important. The play a very large role.

There are, of course, some commercial systems to handle this amount of data; however, it's very expensive and usually these are targeted, price wise, to large supercomputing centers. Our idea was we have clusters which are 16, 32, even more machines. The PC disk technology is the most inexpensive disk technology you can buy. Why not use a cluster to build something like a terabyte storage server using inexpensive ID disks?

The idea was to buy ID disks and put 2, 3, 4 of them on each one of the nodes on the cluster and then take advantage of the fast communication to move data across the nodes and do things like data scribing or data declustering and redistribution and so on. The advantage of this approach is not only are you using inexpensive technology, you are using the parallelism of all these machines to get a very high aggregate IO bandwidth. So if you distribute your data across all the disks, you can potentially read and write from all of them at the same time and get a very large IO bandwidth.

So the project initially was this. We had this cluster at the Department of Computer Science at UCSD. Across campus there was this supercomputing center. The idea was to connect our cluster to their SP2 with a high-speed link and try to do something interesting like running an application on the SP2 using our cluster as the storage server for the applications.

Here are some of the technical details. We used a 32-node cluster. We bought 64 disks. The largest that were available at the time. So for $30,000 we were able to put together one terabyte of disks. And this $30,000 has to be compared with the price of these large existing commercial storage servers which have price tags of hundreds of thousands of dollars.

The challenge, of course, is that once you have installed all these disks on all these machines, you have to find a way of using them. So the idea was to use something like an existing parallel IO library to do the high performance IO. There were a number of projects on high performance IO, so we didn't want to start our own. We wanted to use an existing piece of software.

We started experimenting with Panda. I mentioned before that HPVM has been used by several groups. There is a group at the University of Illinois doing research on higher performance IO. Their library is called Panda. They had been using HPVM, so we had the benefit of having Panda, which was already running on top of our cluster.

On top of Panda there's SRB, which is a piece of software built by the San Diego Supercomputer Center to do remote access to storage systems. The idea was to use Panda to access the disks in parallel and to get high performance IO and then use SRB to move data in and out of the cluster.

I think I'm running a little, so let me conclude by saying, clusters are becoming increasingly popular. Our approach to building a cluster was to use NT. The bulk of the research was in high performance communication, and then the collateral aspect was let's use NT and that proved to be an interesting research issue in itself.

Projects like HPVM demonstrate the maturity of the technology, and the things you can do. They are interesting because HPVM is not just a research project. Thanks to the NT Supercluster which is running NCSA, I think we have a convincing demonstration of the level of maturity achieved by this technology. It's also interesting because it's a good starting point for further research on cluster architecture, and I gave as an example the storage server I'm working on.

That's it, and if you have any questions I'll be happy to answer them.

AUDIENCE MEMBER: Let me ask you. Everybody in clusters, of course, is interested in high-performance floating point computation. Has anybody thought of transactions processing?

DR. MARIO LAURIA: Yes. I'm not that familiar with this use of a cluster. Some of the successful commercial applications of cluster in transaction processing are -- on one side there is Oracle that has a prior version of their Delta Base; and on the other side there are companies like Compaq, who are strongly promoting their PC and their hardware for an application like this.

I can only mention let's see, what else. Microsoft started something in clusters, but it was a half-hearted attempt. They have something like, I don't remember the commercial name, something that resembles a cluster which is basically it's only 2PC's connected together running NT. The purpose of this Microsoft product is just to have a higher availability product, and I think that's mainly targeted to customers that want to put databases on their machines to have highly reliable machines.

So there are a number of applications. I'm not familiar with a research project in that area.

AUDIENCE MEMBER: What is the motivation for doing NT rather than Linux?

DR. MARIO LAURIA: There are a couple motivations. One was we're using PC's, and 99 percent of PC's are using Windows. Why not use Windows to do our performance computation? In principle, there is nothing wrong with the idea, so we started and we tried and it worked.

After having worked with both Unix and NT, because the projects that started on SPARC and are using Solaris, I can say there is no clear winner on what's the best operating system to use for building a cluster, so basically whatever you're more comfortable with, use it.

So one reason was to enable a larger number of users to use clusters. The idea is if you used Windows, that's the operating system that almost everybody has on their desktop, so we are probably enabling a larger share of people to use clusters. Unix users and now, thanks to this work, it can use that by anybody. The other reason is there was a lot of funding available for someone doing research in NT, so that was another good motivation.

AUDIENCE MEMBER: I don't know if it was in FM or HPVM where for MPI you locally shared memory and remotely you go across the network, right?

DR. MARIO LAURIA: Yes. This is in --

AUDIENCE MEMBER: I'm sort of curious about --

DR. MARIO LAURIA: So the shared memory is inside Fast Messages now. Fast Messages is the lower level on top of which we used MPI.

AUDIENCE MEMBER: So this decision about whether to use shared memory across the network must be made pretty far down?

DR. MARIO LAURIA: That's correct. That's why it's in Fast Messages.

AUDIENCE MEMBER: Is that with every MPI connection or something like that?

DR. MARIO LAURIA: No. MPI doesn't know anything about what's going on downstairs, so to speak. I used MPICH to do this implementation of MPI. MPICH is a public domain implementation of MPI which has this nice feature which is built on top of -- inside there is another layer called ADI, which is another program interface. It's designed this way to make it easier to port to a new platform.

ADI basically is a restricted set of functions. Once you have implemented the functions in ADI, which are about a dozen, you have the entire MPICH library running on that platform. Remember MPI has something like 115 primitives, so it would be quite a lot of work to implement all them directly on top of Fast Messages. The MPICH people at Argonne had this nice idea to build on top of ADI. So the ADI in turn is implemented on Fast Messages, and the ADI doesn't just call Fast Messages to sent data to another node.

AUDIENCE MEMBER: This decision has been made at each message that gets sent?

DR. MARIO LAURIA: Basically MPI makes an ADI call which makes a call to Fast Messages, and then Fast Messages makes the decision of whether to use Myrinet or the shared memory.

AUDIENCE MEMBER: There are three questions lingering in my mind. I'll do them one by one. Are you aware of other systems of about the same significance of size which are running on NT?

DR. MARIO LAURIA: Not on NT.

AUDIENCE MEMBER: I'm asking just NT, about others with we know.

DR. MARIO LAURIA: Not this size. I think there are smaller clusters, but not this large that I know of.

AUDIENCE MEMBER: I have a comment on that. It turns out that there is a machine, which is a database machine, it's not a cluster, but you can think of a parallel machine too, and it's a data warehousing machine which is very popular. It's called the Teradata. The classical implementation was in Unix. They now have an NT machine of which we have the second copy that has ever been produced so it's very recent.

AUDIENCE MEMBER: I understand. My second question is have I heard you well saying there was a lot of funding for NT? From who?

DR. MARIO LAURIA: Microsoft, who else?

AUDIENCE MEMBER: My third question. You said that was a production system at NCSA. What kind of production are you doing?

DR. MARIO LAURIA: Running the application I listed before.

AUDIENCE MEMBER: I missed that, sorry.

DR. MARIO LAURIA: As you can imagine, there is a lot of competition for applications like this to run on the Origin 2000 or the SP2 at NCSA. There are, of course, hundreds of users running an application like this and they are interested in any kind of expensive machine that can run these things.

You can think of the NT Supercluster as an MPI machine. All the applications are inside applications. Anything that runs MPI fast, that's good enough for them. And the first year or two the NT Supercluster was run on friendly mode, meaning they were not charging users for the use, now they are charging, so a lot of users could go there and get an account and run it get in the queue like the origin 2000.

AUDIENCE MEMBER: You said you were building another one this fall. Funded by whom?

DR. MARIO LAURIA: Good question. I've just started. I was hired in March. So for now, just my startup funds and writing grant proposals. We'll start with a small cluster from the funds I have.

AUDIENCE MEMBER: It's called the Ohio Board of Regents.

DR. MARIO LAURIA: Yes.

AUDIENCE MEMBER: Thank you.

DR. MARIO LAURIA: Another question?

AUDIENCE MEMBER: (Inaudible.)

DR. MARIO LAURIA: This graph?

AUDIENCE MEMBER: At 2048 you were looking at bandwidth.

DR. MARIO LAURIA: This is the packetization that is going on within FM. Packets in this version of FM was 2K. So after 2K if you sent a message which is 2K and one, you need two packets instead of one. So the dip reflects the additional processing for the second packet.

MPI is following the dip with a shift because it's adding a header. So when the MPI graph refers to user data, the actual MPI message is longer because there is user data plus 32 bytes of header or something like that.

AUDIENCE MEMBER: I don't know if this is a fair question, but looking at some of those curves and seeing the differential (Inaudible). What's your prediction on supercomputers? Where is SGI and others going?

DR. MARIO LAURIA: First of all, let me observe that SP2 is a cluster itself. The difference is that an SP2 is built using the IBM workstation, the SP6000. The difference is they use a proprietary network. Besides that, that's a cluster. The Origin 2000 is not a cluster. So we'll see.

For a prediction on performance, I'll take this and make two observations. As I said, these are 300 megahertz. Probably by the time you use a 800 megahertz processor -- or one gigahertz, the two curves would be overlapping.

The other interesting observation is this is a ten hundred megahertz processor and this is a 295 megahertz processor. Communication performance is more or less the same, so why this wide gap? The reason I believe is that the floating point performance of Pentiums is not that great compared to the R10K.

Pentiums were built as general purpose machines, built for desktops. It will be interesting to see when the Merced comes out, the IA-64, because for this new processor a floating point unit has been very well designed. HP was involved in the design, and they're good at building floating point coprocessors.

I think when the Itanium processor becomes available, there will be another jump in performance in a comparison like this just because the floating point performance will be so much better at the same clock speed.

AUDIENCE MEMBER: Are clusters going to be the supercomputer of the future?

DR. MARIO LAURIA: I think they already are, and not only for the performance reasons. I think the main reason is the cost, plus they are scalable. One interesting thing I've seen in all the campuses I've been on and to the various supercomputing centers I've been to, it's often the case that people on campus decide -- people in medicine or biology -- they decide they want their own cluster, so they go to the supercomputing center and ask for directions. They say, I heard about the cluster. I have the money to build a cluster because they're not expensive. Can you help me with this?

NCSA was quite ahead. In '98 they funded our project. San Diego Supercomputer Center has been a little late on cluster technology. They weren't so enthusiastic. If you go there, they will showcase the 1,192 processor SP2 they just got in January. They are brand new things.

Only now they have hired people to build a supercluster. They are building clusters because someone from the Scripps Institution decided they wanted to build a cluster for their simulation and walked into the supercomputing center and said could you help us.

There a lot of people who have taken the initiative and decided I want my own supercomputer. I have the money for it.

AUDIENCE MEMBER: Can I try a different angle? You know, I'm certainly not an expert in economics, but think of it this way. If you are building a supercomputer for a customer, you have to provide very reliable software and at the same time hardware that you do not want to redesign too often.

Now, if he wanted to change Myrinet because he wants to go to Giganet, that's not a big deal. You don't have to move that much. But a supercomputer manufacturer that has a box and already has the processor, or if he gets a better processor, all he gets is custom boxes that are already built for him. He doesn't have the expenditure of having to design all the compilation for the new processor. It's off-the-self.

So the supercomputer manufacturer is going to have to utilize commodity processors and commodity networks. I think once they get into that mode, they will be doing the same. There will be cluster supercomputers, not because of any architectural issues, just because they are cheaper. And they are cheaper because everybody is going to be using the same processor on their desktops.

So that's my explanation, but I don't think SGI is going to go broke because they are getting into using. That's why they are doing clusters.

AUDIENCE MEMBER: It seems to me they saw we're getting beat by these, so if you can't beat them, join them.

AUDIENCE MEMBER: What is dead, of course, is the vector machines that use highly vectorized processors, and they require very specialized designs, and they are still around. But I think those are not going to be. See, it's a marketplace driven. If you produce many of a kind, it's the old Model T. So I don't think it's an economic mystery there.

DR. MARIO LAURIA: Right. If you build a cluster using the latest PC technology, you are leveraging Intel investments in that technology.

AUDIENCE MEMBER: I think, just looking at your graph and ignoring economics, you have 2CPUs of similar speeds, but there's a separation. And sure, your CPU's are going to get faster, but you're comparing that against an Origin 2000. Let's look at an Origin 3000 in order to be current, and, as a matter of fact, no matter how fast those CPU's get, the Origin CPU's can be just as fast. As a matter of fact, there's no reason why they can't simply use the same CPU.

DR. MARIO LAURIA: Since you are comparing SGI investment and processor design against Intel investment and processor design, you can guess who's going to win. Look at the amount of investments.

AUDIENCE MEMBER: The question is: Who is putting the money into the research to produce the hardware? And Intel is putting the money for the clusters, and SGI probably will also capitalize on Intel. So the competition, it's going to be between the same processors. It's not going to be --

AUDIENCE MEMBER: Right. So I think basically you're never going to catch up in this graph.

AUDIENCE MEMBER: Especially if you increase the number of processors.

AUDIENCE MEMBER: Because it doesn't matter how fast your CPU gets, right?

DR. MARIO LAURIA: Un-hum.

AUDIENCE MEMBER: You have to change your slope somehow. Now, I don't mean that you can't change your slope somehow, but one suspects that to change the slope you have to spend dollars.

AUDIENCE MEMBER: The point here is that the ratio there is 1.7 faster than NT. The cost is what?

DR. MARIO LAURIA: Probably 1 to 5.

AUDIENCE MEMBER: Sure. When you start arguing price performance, I agree with you. But I think just to look at the two lines, I think they've crossed at zero and they're never going to cross again. I mean, as a matter of fact, they're getting farther apart.

AUDIENCE MEMBER: I don't think they're going to cross because I don't think anybody is going to spend five times for half the performance.

AUDIENCE MEMBER: That's not five times for half the performance, it's five times for twice the performance.

AUDIENCE MEMBER: Or something like that. Maybe the National Security Agency will, and they have in the past and they still do. The interesting thing would be to a 3-D graph in which you would include cost.

AUDIENCE MEMBER: That's another story. There it looks like those two groups are getting farther apart.

AUDIENCE MEMBER: If you put cost, the dimensions go the other way.

AUDIENCE MEMBER: Right. And then that company can't compete anymore, so that top curve will disappear.

AUDIENCE MEMBER: Exactly my point.

AUDIENCE MEMBER: It will disappear. Except for NCSA and some other, maybe DOD. Some places will still need very high performance machines that are the top of the line. But there are not going to be too many, and they are going to be very expensive. That's my crystal ball.

AUDIENCE MEMBER: How far has the scaling been tested? How many processors?

DR. MARIO LAURIA: For that application, I don't know. I don't have any of the graphs here.

AUDIENCE MEMBER: The question is, is there any indication of saturation?

DR. MARIO LAURIA: It's very application dependant, so for some you can see that, for some you don't.

AUDIENCE MEMBER: That's why I was asking about the transaction machine because those are the database --

DR. MARIO LAURIA: But probably a transaction machine doesn't need high performance communication.

AUDIENCE MEMBER: Let me give you one example, and this is the only example I really know. When you think of all the Wal-Mart's in the country connected together to a distribution center, that's not insignificant computation. And not only that, but they gather that data. It's not just a transaction, it's what you do with the data that you are gathering. You're analyzing trends. Many banks are actually betting on market fluctuations. I mean, there are people doing economic studies that depend on having the results quickly.

AUDIENCE MEMBER: I think, though, Oscar you're just arguing against yourself now, because it could be those are the folks who will pay five times as much for twice the performance.

AUDIENCE MEMBER: I didn't say there wouldn't be some, but they are not going to be as many as, let's say, probably the five hundred or thousand universities and different researchers. In any university you probably will have ten clusters.

AUDIENCE MEMBER: I would think lots more. I mean, clusters are definitely a university game.

AUDIENCE MEMBER: So you will have more of the not so specialized. You may have JP Morgan for one of those doing the heavy-duty banking stuff, but you're not going to have the local bank, I'm guessing, you know.

AUDIENCE MEMBER: All we can do is speculate.

AUDIENCE MEMBER: Yeah.

DR. OSCAR GARCIA: Any other questions?

Okay. Thank you. So much.

 

last revised:11/16/00 05:11:28 PM
editor: pmateti@cs.wright.edu