Summer Institute on Advanced Computation

August 20-23, 2000


College of Engineering & CS
Wright State University
Dayton, Ohio 45435-0001

Software Tools and Programming Paradigms for HPC

Joshua Simons
Distinguished Engineer in High Performance Computing
Sun Microsystems

 
August 21, 2000. 1:00 p.m. This site contains the "web-based proceedings" of Summer Institute on Advanced Computation that focused on high- performance high-throughput  Cluster Computing.
           

DR. OSCAR GARCIA: Our luncheon speaker we are very happy to announce is from Sun Microsystems, one of the contributors to this event. He is a Distinguish Engineer in High Performance Computing at Sun Microsystems.  Mr. Joshua Simons joined Sun in 1996 from Thinking Machines, where he worked in a variety of software areas related to high performance computing during his eight years with that company. He holds a Science Masters -- Masters of Science. S.M. is the way it says here, so Masters of Science Degree in Computer Science from Harvard.
Let's welcome Joshua Simons.

           


MR. JOSHUA SIMONS: So can you hear me okay? I know you're eating, but I'd like to make this as interactive as possible. So please ask questions. I'll try to repeat them, since I'm the only one that's miked here.
I actually have two talks here. We'll see if we get to both of them. I'm going to talk about HPC tools and programming paradigms. It's a fairly short talk, so we won't be able to get into a lot of depth into each of the areas, but I hope we can have an interesting discussion nonetheless.
In terms of and overview for the talk, I'd like to give you a few words about my view, and, I guess, Sun's view on current HPC System Configurations, what's interesting out there, what's happening in that space. I'll talk a bit about what current programming paradigms look like, where I believe that's going in the future and give you some ideas. There's lots of questions marks there about potential directions.
And then as an example and not as a product pitch, I'm going to go through some of the features of one of the products we have called HPC ClusterTools, which is a distributive memory programming environment. All the various tool in that, we'll go through that.
And as I said, if there's time, we'll spend a little bit going over three strategic investments that Sun's making that aren't directly related to HPC at first blush, but I think if we talk about this together, I'm hoping that you'll see that they actually do have an interesting relationship to HPC. I will try to get to that. I'd be interested in your feedback there.
Okay. So what is an HPC system at this point? The first point I'd like to make is that I believe that we've gone through essentially what I would call the renaissance of HPC at this point. And I mean a renaissance in the literal sense. It's a rebirth; it's not a flowering necessarily of interesting new technology and introducing cool ways of doing things.
What's happened is that, if you look at -- as Oscar mentioned, I spent eight years at Thinking Machines. If you look at the past, say ten years or so, it's littered with a very large number of broken and dead computer companies that were trying to address the HPC markets directly and solely marketing an HPC. Thinking Machines, KSR, NASCAR. We could name many more companies.
What's happened is that the more mainstream vendors, like IBM, HP, and now Sun as well, have stepped up and moved into the HPC phase. They're primarily doing that by using what I would call essentially off-the-shelf technology that's been developed not particularly for HPC. So in the sense that there are more and stronger vendors involved in the HPC space at this point, I think of it as a rebirth.
As I said before, in terms of some funky, new hardware there are some exceptions. Terra, for example, is doing some interesting technology. Of course, I don't want to make the point that SMP’s are not interesting technology, but in terms of technologies that are not targeted directly at -- technologies that are targeted directly at HPC’s, you don't really see that very much anymore. So that's point number one.
Point number two, and I'm borrowing terms from the commercial site here, we believed that horizontal, vertical, and diagonal scaling are extremely for HPC. What determines which one is most appropriate for you, is what kind of applications you are actually trying to run and what kinds of problems you are actually trying to solve.
What we mean by horizontal scaling is large numbers of small boxes ganged together, so more like a Beowulf Cluster. You choose what small is, say 4 CPUs or below per node.
Vertical scaling refers to large boxes, so large is SMP’s. For example, Sun's current generation goes up to a 64 processor HPC 10,000, which is a fairly large box.
Diagonal, as you would imagine, is the combination of both of them. That's very large configurations of very large boxes. That's the kind of thing, for example, you would see used for an ASCII configuration by DOE, or by any number of Terrascale installations.
In terms of bandwidth and latency requirements, they're all over the map. Certainly you can find, and I'm sure everyone in this room could name, applications that require both very high bandwidth and extremely low latencies. There's also emerging an increasingly large section of the market that doesn't have those requirements.
If you look at bioinfomatics for example, which we consider to be very much HPC, you really -- if you're doing matching for example, you could really do that with wet noodles running between your nodes. It's massively parallel. It's task parallel. You don't really need much in terms of interconnect, but you do need computational power.
And what got cut off at the bottom there, and I apologize for that, is the comment that RAS, reliability, availability and serviceability are also extremely important. If only because when you tend to build these very large installations, it's extremely important to make sure not to lose components from that machine.
For example, if you're running an MPI job across that entire config, you don't want one of the nodes crashing on you. So characteristics that have been historically important on the commercial side, the RAS capabilities, are becoming increasingly important on the HPC side.
That's kind of our view of the space. Sun sort of parenthetically covers that space actually quite well. But this is, again, not a pitch for Sun. I want to talk more about ability technology.
So there are lots of requirements for tool sets, and I just wanted to mention a few of them here. Certainly any HPC tool set that is developed needs to adequately, or more than adequately, support the programming models that are currently in use. And there are essentially three prevalent programming models that we see at this point as being important.
One is thread-level parallelism. And that means explicit threads, perhaps with POSIX threads, or implicit threads, using auto parallelization with a complier, and directives using OpenMP. Those are all fair game for being labeled as thread-level parallelism.
Sun in particular believes this is important. Again, we've got these very large SMP’s. You see increasing numbers of vendors working towards these larger and larger machines. Deck, IBM and whatnot, HP as well. This is a model that's here to stay. It's not necessarily an easy model to use, but it's an effective way of getting parallelism out of an application on a single box.
Message passing is the second paradigm and for us this means MPI. For all intents and purposes, I believe not much new coding is happening in PVM at this point. MPI, as of the version two standard, it is now general enough to pretty much deal with anything that was being done in PVM.
And then the hybrid model, the mixture of these two, is actually becoming quite interesting to a lot of HPC customers. That means having threads within your node and using MPI only between your nodes. Again, it's important that your toolkit support this.
Performance analysis capabilities are obviously extremely important. This is an area, and I'm sure you're well aware, if you look out in the wide world of tools development, I would guess this is probably the richest area where there are the most number of tools. In spite of that, I don't think that anyone's really solved the message passing performance analysis problem for large applications. And I don't claim we have either at this point. I think it's an area of active research.
Parallel I/O tends to be important for the sort of traditional HPC customers. I think the jury is still out how much of the overall HPC market actually takes advantage of parallel I/O. I'd be interested to hear if people here have an opinion on that.
Resource management is critical. The ability to take your clustered system, because we do believe most HPC systems in the future are going to be some kind of a cluster, take those and actually treat them as single computational resources. We'll talk a little bit more about that in a minute.
And then of course, high performance. You can have all the functionality in your tool kit that I mentioned above and more, but if you don't have high performance then there's really no point to it. And we'll talk about some of the optimizations that should be done in an MPI library, for example, to make it high performance.
This is my last bad slide. Where are programming models going in the future? Certainly the current models are going to continue, and that's a good thing. I want to acknowledge the fact that we've moved forward. If you think back about five, ten years ago we did not have what I'd call a lingua franca for allowing folks like you to write applications, portable applications, and move them around on multiple HPC platforms.
Well, we finally got there. MPI, it's low level. It's got all the functionality there. It is low level, but you can write your applications and you can move them between vendor platforms. This is a good thing, but is that good enough? I don't think it is.
Certainly OpenMP has come along. That standardized the development of directives for threads-level parallelism. Looks to me as though-- well, there is work now going on to expand OpenMP into the NUMA realm. You'll see more and more vendors with NUMA-like capabilities. The DEC Wildfire for example is a good example, SGI. Sun also had technology that was called wildfire. It was an internal beta program that was actually discussed at the last supercomputer conference, so if you saw that paper you understand that Sun is doing something that is similar to NUMA, although not quite the same.
So OpenMP will continue to evolve, certainly for NUMA. I believe that there will also be a motion towards evolving OpenMP for clusters, and I label that as “The Revenge of HPF.” I think HPF failed to take-off. It sort of nose-dived into the runway before it really got full speed. And there are a lot of reasons for that, and we can talk about that if people are interested. It was a bit of a hard language for the vendors to actually develop compilers for, efficient compilers.
What I'm hoping is that because OpenMP has become so popular and is so widely accepted by the community, that we can continue to extend it's locality awareness that's develop in the NUMA side, continue to develop that to move it out into the cluster realm. And keep it simple enough so that it is continued to be accepted and still usable by customers to sort of up level their codes a little bit, add some data parallel flavor, add some layout directives, the way HPF sort of did that, and raise us up a bit more.
But above and beyond that, can we do better than that? I think there are a couple of interesting places to look at least. Certainly high-level object-oriented frameworks like POOMA, which was developed at Los Alamos National Lab, is an interesting thing. The nice thing about a system like POOMA is that they have shown it is possible to raise the level of abstraction for your programming but still maintain high performance.
It's always been the case that people associate high levels of abstraction and object-oriented programming in particular with very high overhead and very poor performance. It doesn't have to be that way, and I think we should look at moving upstream a bit into these higher levels of abstraction.
There's a question about what Java's role is in HPC. And I have to mention Java, right? I'm from Sun. I can't help that. The JavaGrande forum, I'm sure some of you are involved in this, is looking at different ways of integrating Java technology into HPC.
And there are different efforts running and there are different time lines. There are language modifications being considered to actually make Java an HPC worthy language itself for coding your actual codes, but you can also use it as an infrastructure for building interesting conglomerations of code that work in a distributed realm. And we'll talk a little bit more about that in a minute.
I'd also add, and I know it's one of the themes in your institute here this summer, is the idea of global models, and that's a bullet that actually fell completely off the screen. So I would label those others sort of local models, what you would run within a single site. But global models, I believe, are extremely important for HPC.
They fall into a couple of different categories, at least the way that I look at it. Distributed resource management; the ability to tie together a set of distributed resources the way Globus and Legion do, to give you access to these computational resources that are out there somewhere in this cloud of network and computation. Where you don't necessarily need to know where you're going, but you're going after a particular capability.
We've actually done some internal research around a Jini-based distributed resource manager to use Jini as an infrastructure for dealing with the coming and going of computational resources in the wide area.
On the other side there's Metacomputing; the ability to tie together resources that are distributed globally and use those to solve a single problem. That one is a little bit more of a stretch from my point of view. And we will get there eventually, I believe, as the bandwidth within the network continues to increase. We won't ever solve the latency problem in the wide area, so it's going to be appropriate for particular kinds of applications, but I don't think ultimately there's anything stopping us from doing that sort of thing.
So Globus and Netsolve are two examples. SETI, I like to think of that as an interesting example. And then PopularPower is a small start up that's actually trying to commercialize essentially what SETI is doing, and that looks like an interesting technology to follow as well.
Having made those sort of general comments, I want to run through our cluster tool products in some amount of detail to give you an idea of what a typical product for distributed memory computing looks like from our point of view at least, and I'll try to dive into some detail in places where I think it might be interesting.
So just to place us in context, we've talked a little bit about this. On the right-hand side, I'm talking about the development side. This is just meant to echo the fact that we believe that there are, again, these two or three main programming environments; the threaded applications and then distributed.
This set of slides is going to deal with the distributed aspect, but I did want to mention that, obviously, a core component in building a distributed memory toolkit is that you have a high performing local node toolkit as well. So everything is built on high performance compilers, good analyzers and good performance libraries.
The toolkit is built for both single SMP’s and clusters of SMP’s, and that's a key point. People tend to think of distributed memory computing as being for clusters. I mean that's obvious, right? But we also are very serious about making sure that MPI in particular runs extremely well on single SMP’s. It's important for us because we do have such large SMP’s available. We do have a fair number of HPC customers that would consider using a single 64-processor box for an MPI application, for example, and we could talk about that in a minute.
So what's in the toolkit? I don't think this should be too much of a surprise. There's a resource manager. The MPI library itself. There's a distributed math library. We have a distributed parallel debugger or development environment, and MPI I/O and parallel file system are both aspects of our parallel I/O solution.
And I should mention at this point that most of this technology came over from Thinking Machines about four years ago when I came over with the group, so it was acquired and the folks were all hired. And that was one of a number of HPC-oriented acquisitions that Sun has made over the last several years.
So this echoes the basic point that it's for clusters of UltraSPARC base machines of any size. So desktop-- actually below desktop. We have rack mounted systems as well, rack mounted clusters -- up to SMP+, which was the technology that I mentioned was discussed at supercomputing last year.
The cluster interconnect is an important point. Certainly we support any TCP connected interconnect. We've also done some particular work around SCI, and I'll talk more about why we did that. SCI is not necessarily interesting in and of itself, but technologically it's interesting, and we'll get to that on the next slide.
We've also opened up our MPI architecture to allow third-party vendors to plug into our HPC software stack, and I think this is a key point for any reasonable HPC approach, software approach, is to allow this interoperability. The reason I mention this, is what we've done with this, this is about a million lines of source code, we've actually made it available on the web under Sun's community source agreement.
One of the first organizations that signed up to use this was Myricom. I'm sure your familiar with their Myrinet hardware. What they're doing is actually integrating their Myrinet hardware and there GM layer with our MPI library, so that they'll be able to offer a low-latency solution through our MPI and across their interconnect.
Okay. This is sort of product oriented. The only thing I want to mention on this slide is the last bullet. The release that I'm talking about right now is reasonably scaled. We can support single parallel jobs that span up to 64 nodes and consume a thousand processes. And that's not limit on the size of your cluster certainly. The cluster can be much larger if you have a resource manager that deals with that. But this is more a testing limit than anything else, and we'll be taking that higher in the future.
So a resource manager. How do we do this? The strategy that we've taken to date, up until last month actually, was purely a third-party approach. We offer something called the CRE; Cluster Runtime Environment. That is essentially a smart job launcher that's capable of taking your job, you tell it where you want to put the -- it actually does load balancing -- but in its simplest case, you tell it where you want to put your job and it puts it there.
The idea was that third-party resource managers would be layered on top of that. We didn't want to make a decision for the customer to force them into a particular resource manager, because we found that customers had heterogeneous computing environments and had previously chosen a resource manager. So we wanted to work with whatever you had chosen.
We did, in spite of that, do an integration with LSF from Platform Computing, because we felt as thought they did have a good share of the market, they have a good amount of functionality, and if people wanted to use that, we would offer additional integration there.
Now one thing that has changed is the bottom line here. Sun, I think we announced it end of last month, we acquired Gridware. Gridware is the company that produces Codine and GRD. What we've done is essentially brought a distributed resource manager in-house within Sun, and what you'll see over time is a move to integrate the ClusterTools product more and more with that. This will become part of the base capabilities within Solaris at some point.
The goal is that every box actually be able to -- that every box actually be able to function as part of a distributed cluster, resource-management cluster. But we still want to maintain third-party interoperability. That's important to allow you to continue to make a choice.
So let me spend a few minutes talking about the MPI library and give you an idea of what we've done in terms of some of the optimizations in this library. I feel as though it's actually -- this is fairly objective, since I've talked to other people outside of Sun about this as well -- I would argue it's one of the best or maybe the best MPI implementation at this point. In terms of optimization levels for our platform, it's certainly true.
So general capabilities. This is a completely new native implementation of the MPI library. When we started at Thinking Machines we actually were building on an MPICH core, and we decided that we weren't going to be able to get to where we needed to be, in terms of scalability and performance. So we essentially ripped all of that out and redid that implementation. We used some funding from the DOE to do that.
It has all of the, what we would label, as the significant MPI 2 features of -- all the features of MPI 2. One-sided communication is not in there, and we can talk about that in a second. It's thread-safe. You have to do this if you're going to effectively support the mixed programming model that we were talking about earlier. It's also a significant point that applications can be developed and debugged with single instances of the parallel debugger that we ship as part of ClusterTools, which we'll go through in a second.
In terms of optimizations, we spent a lot of work in SCI, because SCI has a memory-to-memory semantic. It allows to take physical memory on one box and actually map it into the virtual address space of another box. Once you've set this mapping up, and you need to use kernel involvement to do that. But once you've done that, you effect transfers of data across that interconnect by doing loads of stores into those memory regions. So you completely sidestep the operating system at that point, and that gives you access to pretty low latency. By low, I mean at least with an SCI implementation, on the order of seven microseconds user process to user process between boxes and a cluster. So that's why we spent time working on SCI.
We also take advantage of an UltraSPARC special instruction called the block-move instruction that allows us to move chunks of data around within the MPI library, as we have to do buffer copies. One of the interesting features of the block-copy instruction is that it doesn't pollute your cache. So if you've gone through the trouble of setting up all your cache state for your computation, you really don't want the MPI library trouncing on that, so we avoid that.
We've implemented co-scheduling, which you can think of as an approximation of gang scheduling. We don't have a hardware synchronization mechanism across the cluster, but the co-scheduling techniques that were developed primarily at MIT and Berkeley allow you to do approximate gang scheduling based on local information available to you.
We've also done a lot of work on locality exploitation. And the point here is that if you have a bit SMP as a node, or even moderately sized SMP as a node, you should take advantage of that wherever you can.
So if you're doing, for example, a broadcast operation in a cluster of SMP’s, certainly everybody would build a spanning tree in order to get the data out to all the leaf nodes in your job, but there's no point in building the nodes of the spanning tree on a 64 processor box. You build, what, five or six levels of that spanning tree on the box. There's no point. Why don't you just drop the data into a single location and have everyone read from it? By doing that, by short-circuiting those kinds of things and taking advantage of the shared-memory nature of the boxes, we can get higher performance that way.
If you do that, one of things that you might think, just to give you an idea of the level of optimizations that we're doing, if I do have all the 64 CPUs in my box go and read memory locations to pull data out of there, now I'm overloading my memory system because I have this big hot spot in the system.
So what we actually do is some pipelining and some round robin access to those data structures to actually spread out the load so that different processes are actually attacking that buffer in different segment order, which turns out to give you a nice performance boost.
We also do lazy connections, which is really important for dealing with very large jobs. We made the observation that it's not necessarily the case, it's not typically the case, that when a large job is running, a large MPI job is running, that all of the point-to-point connections that are possible between those processes are actually used. So there's no point at startup time in setting up the N squared connections between all the process pairs.
So what we do, and you can turn this off if you don't want to use this, we establish the end points, connections, when the first communication happens between those two processes, which allows us to just consume the resources we need and just set up the connections when we need them.
We've also done a full multi-protocol implementation of the library. What I mean by that, and this is where Myricom comes in on the next slide, is on a pair-wise basis, we make a determination of the most efficient data pathway between two processes and the MPI job.
So, for example, if there the two processes are on separate boxes and connected by TCP, we'll choose a TCP connection and do it that way and we'll incur the latency by using that pathway. If the two processes are on the same box, we use shared memory and just set up a shared-memory region and just slosh the data back and forth between there. If an SCI adapter is available, then we'll use RSM, which is this Remote Shared Memory I was describing before, where the memory semantic is set up between boxes to move data very quickly.
Now, the bottom layer, the TCP layer and the RSM and shmem, it's called a Protocol Module Layer. And part of the effort surrounding putting all this out for community source, was to develop a well-defined API, a protocol module API, this dotted line here, that actually will allow third-party IHV’s, Independent Hardware Vendors, to plug into this architecture. As I mentioned, Myricom is doing this.
We're also very interested, just generally speaking, in the Infinaband specification and that whole process. Are people familiar with Infinaband? No. Okay. Let me just say a few words about that, because you'll be hearing a lot about that over the next couple of years.
Infinaband is a merging of two earlier industry efforts, future I/O and next generation I/O. All the big players broke up into these two different camps, and everyone was developing their own version of the industries next generation I/O and cluster interconnect.
And Sun actually played a key role in saying that this was a silly thing to do. Why should we have two specs going forward? That wasn't going to help the industry at all. So we managed to get everybody together in a room, and we decided to develop one common specification. So eventually this will replace PCI, and you will see it as cluster interconnect from, I would guess, virtually all the vendors at one point.
Some basic numbers. It comes, at least initially, in three different flavors. There's a 250-megabyte per second bidirectional; there's a one-gigabyte per second bidirectional; and there's a three-gigabyte per second bidirectional. If you're at all familiar with the VIA, the Virtual Interface Architecture, the verbs that used for communicating with this underlying hardware are very much like VIA. It's a message-queue oriented semantic. It has this OSI mechanisms bypass mechanism in it. Whereby, by adding disruptors onto these queue pairs, they're processed directly by the hardware so there's no intervention by the OS. And therefore you could be able to get very low latency.
So you will see this probably, I don't know for sure, but I think Intel will be rolling out some of the low level one by 250 megabyte per second hardware. I think by the end of this year actually you will start is so see bits and pieces of this. But you'll see over the next three or more years that there's going to be an increased emphasis on Infinaband, and you heard it here first. In any case, we, of course, want to be able to plug into that with our MPI architecture, and you'll see other vendors doing that as well I'm sure.
So you what else do you need in the toolkit? You need a subroutine library, a math library. This particular library, S3L, is built on top of MPI. It's fully thread-safe. When I say thread-safe in this talk, what I really mean is more MT warm; and what I mean by that, is that there's actually a fair amount of concurrency in our implementation.
You can get thread-safety by locking and unlocking at the entry and exit to each of your subroutines, but there's no concurrency whatsoever. What we've done is we've pushed a lot of the synchronization fairly far down into the library so that you can actually have threads running around inside the library in nontrivial ways and give you real concurrency. So S3L does this as well.
The parallel capabilities are listed there. I'm not going to go through them. There are a few more as well. This is fairly standard. We're always on the lookout for other things that should be in that library. It tends to been tricky because there's so many different requirements in the HPC space these days.
Let me quickly go through a few screen shots on Prism. This is our integrated-development environment; sort of one-stop shopping for dealing with debugging and performance analysis for a message-passing application. It supports F77, F90, C. It has some basic support for C++. This is something that we're working on. We expect to get up to full C++ at some point.
I'll just go through some of these slides just to give you an idea what it looks like. It's a little bit long in the tooth now in terms of the way it works. This is built actually with the Motif toolkit. We've had some thoughts about doing this in Java at some point.
What's being illustrated here, if you can read the small red rectangle, is this basic concept of Psets, process sets. This is a key piece of Prism. We designed this from the get-go to be extremely scalable. And the idea behind Psets is it lets you define semantically meaningful subsets of the processes in your MPI job and then allows you to issue any Prism command to those subsets.
So, for example, what you're seeing up here, it says define Pset master to be 0. So rank 0 in my MPI job is now called master. So typing master is no savings over typing the digit zero, but if I define Pset "slaves" to be "all - master", then I now have a pneumonic for referring to all of the other processes in the job that are not the master. The "all - master" implies some set notation. We can do basic set operations in terms of defining these Psets. You can define Psets based on the values of variables within individual processes in the job, which turns out to be pretty useful.
The point here is for scalability. Any command can be modified by these Pset qualifiers. You'll notice at the bottom it says (prism all). Prism has the concept of a current Pset, so anything that I type or mouse at this point, at least currently, will be sent to all the processes.
It's a little bit different than the way debuggers normally work, because if I, for example, have all the processes stopped at a breakpoint and I say continue Pset slaves, all the slave processes will continue, but Prism maintains control, and I can still issue debugging commands to rank zero. So it's somewhat of an asynchronous interface.
We put a fair amount of effort into giving you global displays of the program, and this is another key point that really needs to be emphasized in any toolkit, especially when dealing with these large distributed applications. You have to be able to get a sense of what's going on in your application, and you can't really get that if the only tool at your disposal is something that allows you to go in and sort of probe one or a couple of processes at a time. So we spent a fair amount on displays like this, which take a little bit of explaining.
What you're looking at is three different zoom levels of the same display. It's basically a generalization of the linear call stack you'd get from, say, DBX, where you're stopped at a particular point. You say where, it says well you were in main. You called subroutine one; you called subroutine two; you called subroutine three.
What we realized is that even for very, very large jobs with many, many processes, it's not typically the case that you entered main in your application and then suddenly hit a case statement and went out to a thousand different locations in your program. There's a lot of commonality in the way your processes trace through these applications.
So by using that observation, what we essentially do is take the end lineal call stacks in your application and merge them into a tree structure. So the way to read this, and you probably can't see this, is that on the top left everyone was in main. They made two recursive calls to search, and now you start to get a bifurcation. Some of the processes went to the right and made a call to alpha beta. The others you can't see; they've been iconified. But on the right, again, there's another bifurcation because two different calls to alpha beta were made at two difference call sites.
As you zoom up, you start to get line numbers in the files, and as you zoom all the way to the top, you get actual arguments that were passed within particular processes. So, for example, Process 2 had the following arguments passed on the stack at that point.
This is an active display. I could, for example, double click on this guy. My Pset would automatically update to contain Process 1, 2, and 3, and then any command that I would type at that point would just be sent to processes 1, 2, and 3, and the individual stacks would have been adjusted so that I would be at the appropriate point in the call stack. Again, it's just a way of given you a global view of what's happening with the program.
Another simple way of looking at the program from a global point of view is just looking at the Psets that have been defined in the application. Some of them are predefined by Prism. This idea of a current Pset -- what this is telling me that rank 0 of my job is stopped at a breakpoint, and the other fifteen processes in this job are currently still in the run state. If someone had hit a seg fault or had a bus error, then the error grid would be illuminated appropriately. You can also see masters and slaves, which I've defined previously.
Another key point in a toolkit, and sometimes you find this integrated with a debugger and sometimes you don't, is visualization. And this is visualization in particular to support debugging; it's not, generally speaking, fancy presentation graphics. We found that there was actually a lot of power to be had or given to the user in completely integrating the visualization with the debugger.
So essentially what you're looking at here, we call these data visualizers, and these are popped up in a debugging session with a print statement. It's a variant on a print statement. If you say print A, where A is an array, normally all the million elements would just get printed out on your screen. Well, that's not very good. So if you say “Print A on window name”, it will pop up of one of these visualizers.
So the way to look at this is, this is a text visualizer, and it's showing a three-dimensional array. I can tell it's three-dimensional because if you look at the top, there are three axes up there. There are two on the black rectangle and a third one on the slider. This is easier seen in an interactive demo, but the white rectangle and the black rectangle is a representation of this data window. So if I grab that white rectangle and drag it around, interactively I can pan across this data plane. That turns out to be useful in some cases.
You can also take and change the data representation to map. Instead of looking at numerical values, you can map it to pixel values, and that's nice for looking at physical simulations for example. There's a rendering for complex data, magnitude and direction. You can guilt histograms. There are a number of other basic capabilities that are here to aid the programmer in debugging the application. It turns out to be one of more useful features of Prism.
We have basic message-passing, event analysis. I would frankly say it's not yet up to the level of Vampire, for example. You can use Vampire with our MPI, if you'd like to do that, but we did feel compelled to give at least some amount of integrated performance analysis capability within Prism itself for those that don't want a third-party tool.
You can also drill down on the state of your MPI messaging queues, and this is another theme that I think is important. Exposing some internal state from within the MPI implementation is important typically. Looking at queues is one of the more obvious ways of doing that.
What you're seeing here is, rank runs this way, so Rank 0, Rank 1, Rank 2, and then entries in the queue for those ranks are shown off to the right. The color represents the MPI communicator. You can click on one of those and drill down and actually find out what data is sitting in that queue and what the data types are that it's associated with. Very useful for trying to find race conditions and looking at basic programming errors.
Okay. In terms of parallel I/O. As I mentioned, there are two capabilities. We have a full implementation of the MPI I/O standard within this toolkit, and I believe the other vendors are getting to this point as well. We feel it's important to have this, if only to support the more traditional HPC vendors that really are into high performance parallel I/O.
So MPI I/O writes either to Unix file system, which doesn't give you performance but it gives you compatibility with the Unix file system. It does sequential writes and sequential reads, or it can write into PFS, which is our Parallel File System, which does what I'm sure you would all expect. It let's you have some number of storage nodes attached to some number of cluster nodes in your machine. The storage objects that are attached to a particular node in the cluster are controlled by a PFS I/O daemon. The I/O daemon is responsible for all data motion onto and off of the storage devices on that node.
The way you get your parallelism and scalability here is by having your multiprocess compute job, which is shown up here as a two-process job. And these processes could certainly be running down on the same nodes that the I/O daemons are running on. By having those compute processes contacting the I/O daemons in parallel and then having the I/O daemons contacting storage in parallel in order to get as many spindles moving as possible, so you get by increased scalability by increasing the parallelism of your job and also potentially by increasing the parallelism of your file system. So this is included as well.
Just make a note here that the performance of something like this depends a lot on the -- you'll notice the green lines, most of them are going across, at least the ones between the nodes, are going across the cluster interconnect, so performance is critical and gated by the bandwidth, mostly the bandwidth of the interconnect that's used for the configuration.
To give you some idea of performance, this is reasonable performance I think for this generation of hardware that we're talking about, the UltraSPARC II. The first two lines are talking about within a single SMP, so 2.5 microsecond latency user process to user process within an MPI application.
And then MPI shared bandwidths of about 200 megabytes per second between processes and the job, and we can sustain that between multiple pairs because of the high back plane bandwidth of the SMP. Then the rest shows SCI latency, which I had said was seven. This is actually a slightly outdated slide. And then the bandwidth numbers aren't important.
As I said, I think for SCI the real deal there is latency reduction along the lines of what Infinaband is doing and along the lines of a next-generation interconnect that Sun has, that we can't talk about today because we're not under NDA, but the point is that the RSM work, the low latency work, will transfer forward into the future and continue to offer low latency and very high bandwidth for MPI applications.
Okay. I think I'll stop with that. I still have a few minutes, so are there any questions on ClusterTools before I move on? Yes, go ahead.
AUDIENCE MEMBER: On the Prism, I assume that that software is good on any platform?
MR. JOSHUA SIMONS: The developers back at Sun that work on this, do all of their development, well, virtually all of their development, on Solaris for UltraSPARC, and the product itself is only supported on Solaris for UltraSPARC. So the answer from a supported product point of view from Sun, is that it's SPARC Solaris, in fact, it's UltraSPARC Solaris; however, the fact that we've released this under open source, community source, has made it possible, at least in theory, for other parties to come in and actually port this to other platforms.
For example -- I said before that most of our development is done on Solaris and SPARC. We have S3L, for example, that group within Sun, just because they want to, they actually do a lot of their development on Linux as well. So Prism, in particular, is the hardest one to move over because it understands instruction formats and it has to compiler stab formats. So teaching it about new compilers and teaching it about new chips is not a trivial operation, but it could be done.
AUDIENCE MEMBER: Do you know of anybody that's started that kind of effort?
MR. JOSHUA SIMONS: No one to my knowledge has done that.
AUDIENCE MEMBER: I have a question on the MPI when it decides about the shmem or RSM or TCP IP. Is that a compiled time thing?
MR. JOSHUA SIMONS: Not all. It's done at runtime, and the decision is made on a pier-wide basis, so within a single job some processes may be using one method and some processes may be using another method. In fact, one process itself may be using shmem to get to some processes and TCP to get to some other processes and are RSM to get to other processes. Completely dynamic, it's all determined on the fly.
AUDIENCE MEMBER: (Inaudible.)
MR. JOSHUA SIMONS: No. It makes the decision up front. It finds out -- it looks and says I need to contact to him. What's the best pathway? It has an idea of rankings of pathways, so if you had a TCP connection and an RSM connection between those two nodes, it would choose the RSM connection.
AUDIENCE MEMBER: I'd like to get some more information about Prism. Is there a place or source where I can get that?
MR. JOSHUA SIMONS: Yes. Let me talk to you afterwards, and I'll try to come up with a good URL for you.
AUDIENCE MEMBER: Okay.
MR. JOSHUA SIMONS: Are there any other questions? Yes.
AUDIENCE MEMBER: What's the future of Java from an unbiased opinion?
MR. JOSHUA SIMONS: Unbiased, right? It's an interesting question, and a lot of people when they ask that, they tend to, understandably, you know, a person from Sun is a person from Sun. But the way I actually come at Java less as a person from Sun and more as an HPC person. Because in some I ways I have less visibility into the whole Java process than the average customer or the average persons who's involved in the Java community.
From am HPC point of view, it's unclear to me. It's not clear to me that we will necessarily ever get to the point where you would want to really code your kernels in Java. It certainly makes much more sense to use it as an infrastructure for -- and I'll talk more about this in a minute -- for example, for grid computing, and using it as infrastructure for doing these large distributed kinds of things. But whether or not it's going to fly as a computational language, I don't know.
The JavaGrande effort is looking at enhancements to the language that would make it better, but then the question is, well, do you stick with the JVM or should we really put a lot of effort in native compilers? And I could imagine, if we put an effort into native compilers, then there are those that would claim that Java is a cleaner language for doing object-oriented programming than something like C++, in which case it could take off, but I don't think it's going to take off in the context of a JVM, and I don't think the HPC community really needs the portability that comes with the JVM execution. The jury is still out.
AUDIENCE MEMBER: How does Bill Gates figure into all this?
MR. JOSHUA SIMONS: From an HPC point of view, not at all. So that's my answer, because I don't know the bigger story. Other questions? Okay.
I'm just going to spend a few minutes on these. Maybe by the time I'm done these still won't seem like they're HPC related, but I hope to convince you that they are.
These are three areas that Sun -- and I labeled them as one investment -- we call these Three Big Bets that the company has made, and we did this in the past. We bet that networks would be really important, for example. There are other things that the company has done that have turned out to be right.
Well, these are the next three bets that we've made, and they're based on what we see going on as technology is out there sort of in the computing world, and I don't, in particular, mean HPC necessarily. The three new are, not in any particular order, and I'll go through each of these in a minute, massive scale, integrated stack, and continuous real-time.
We tend to date these because we'll go back in a year, two years, three years, and see how right we were. We're investing enough are resources in these sorts of things, that if we're not right, we're in big trouble. But I think we are right, and, as I said, I think there are ramifications to everybody in this room.
So this is one of the motivators for why these three big bets are important. And basically what this slide is saying is that if you look at the top, that's the traditional model that people have been using for delivering software and delivering services essentially. You go out and you buy your application and you install it on your local machine and you use it; or maybe you put in a local server and you NFS mount it, but it's gone out. Many, many copies are distributed manner, and it's used in a distributed manner.
You do updates by sending out a new CD-ROM or by downloading something to your site and using it that way. The new model, which everyone's been hearing about, it's more related to an ASP model, an Application Service Provider model. And everything is heading in this direction, right, where everything is out on the web. And, you know, the logical conclusion of that is that nothing is local. Even your confidential banking data and whatnot or documents that you're writing, for example, may actually be out on some kind of secure server somewhere on the web.
One of the benefits of that is that you have ubiquitous access to it. As long as you can be convinced that it's secure and only you can get at it, then why do you care where it is as long as you can get it from wherever you are.
So that's kind of where things are going, and that has some implications. So Sun is putting a lot of effort into what we call an integrated stack. And the integrated stack is an e-commerce concept. It's really everything from the OS all the way up through application servers, web servers, JVM, Java to Enterprise Division, directory of databases.
All of this stuff has to be brought together in a really efficient way to run as well as possible on Sun platforms so that we can offer a solution for the Internet economy essentially, right? This is all e-commerce based. And above that level that I mentioned there are broker engines and whatnot. And you wonder what does this possibly have to do with HPC, right?
What you're looking here is kind of a portal model, and portal is somewhat of an overloaded term. Sometimes people, when you mention a portal, what they think of is a web page that you go to that is aggregating access. It has links off to lots and lots of different places. It's your one-stop shopping for doing things on the web.
Well, the other definition of a portal, and it's the definition that we prefer, is that it's typically a web page that you would go to, but what it's doing is it's given you a portal through something into something else. So, for example, we have something called sun.net that we use as Sun employees that allows me to walk up to any browser anywhere on any kind of a machine, as long as it's got Java running in the browser, and drill through our corporate firewall to access my e-mail, access an increasing number of applications that have been web-enabled and made available through this portal service to employees that are traveling wherever they are. So that's the kind of portal that we see as being interesting and useful.
If you think about it from an HPC point of view, it's not a big change to the picture, right? So you replace those generic apps with some kind of a parallel multibarbell sort of thing there with a parallel file system behind it, and you allow access to your HPC applications from anyplace, anytime.
So who would use this? Like a supercomputer site, for example. Somebody who has lots of remote users that need secure access to their data. It's very much along the lines of what's happening on the commerce side just translated over into technical applications. There's really no difference.
The stack doesn't care if you're doing technical computing or you're doing database accesses. It doesn't matter. The framework is common, and we think this will become increasingly important. My understanding is that this basic idea one of the themes of SC 2000, this Escape 2000 that they're running in Texas this year. So you'll probably see more of this in the future, and I think increasingly over time it's going to be important for us to figure out how that stack can be used for leveraging what's going on at HPC.
And kind of subtext that's running throughout what I'm saying is that, you know, historically -- I've been doing HPC for twelve years now or something like that, and we've always prided ourselves on being on the leading edge and doing things that other people aren't doing. That's still true in terms of content. But in terms of what's driving the industry and technology, it's not HPC anymore. It's this huge Internet economy.
And the point I made before about commodity nodes being sort of responsible, at least in part, for the renaissance of HPC, that's an example of this leverage that I'm talking about. We need to look at what's going on in the commercial space because that's going to be the infrastructure that we should be taking advantage of on the HPC side. It's no longer going to be possible to go off in our own directions. It's just too expensive to do just HPC oriented sorts of very, very large efforts.
I don't mean to say that companies won't put effort into HPC-related things. For example, the group of fifty engineers that worked on ClusterTools; well, ClusterTools has no bearing whatsoever on the commercial space. So as a vendor you still need to make investments into some HPC specific aspects for the community, but the point is that there's a huge amount of effort and a huge amount of money and thinking pouring into this commercial side, and we really need to figure out how to leverage it on the technical side.
Another thing that happens that's quite popular in the commercial space is this B2B stuff, business to business, where you have multiple sites communicating using some kind of high-level protocol for doing something. Maybe they're doing supply chain analysis or supply chain management, which is all gobbledygook to me, right? So you replace that. And, this is trivial, right? This is what I would call site to sight stacks.
Now, if you think about, you know, DOE is a good example, I'm sure many of you are examples as well, where you have multiple sites that are in close contact potentially -- eventually being used for doing megacomputing sorts of things.
Well, there's infrastructure here for building those kinds of applications. And, again, I don't think we should reinvent the wheel. There's a huge impetus here, and I think we should take advantage of it.
I won't say too much about continuous real-time, except to point out that what this is about is, if this stack and these machines are going to be the basis of an Internet economy, then it needs to be there all the time. The way our CTO says it, it's in terms of continuous availability versus percent uptime. It's kind of the difference between planes and computers. You don't talk about percent uptime for a plane, right? It better always be up. That's what we have to be striving for.
And there's a benefit there for HPC as well. As I said earlier, the way you build a large HPC complex these days is by aggregating large numbers of commodity components to build them, and you need those components to stay up. Your MPI app is not going to make forward progress if the nodes in your machine are down all the time. So the RAS capabilities and some of the other efforts that are happening to make things better on the commercial side, again, have a lot of relevance on the technical side.
Mission critical, brand critical refers to things along the lines of, well, if E-bay goes down or CNN or something goes down, well, when that happens, we show up on the news, right? This is not something that we as a company, or any company that is providing infrastructure for the net, can deal with. We have to take this very, very seriously. And, again, you benefit from it.
So massive scale. HPC has always been about massive scale, so there's nothing new there. But the new thing again is that the commercial side is now becoming increasingly about massive scale. Why is that? Because there are these huge scaling pressures.
If you look at the sum of computational and storage demand, it really goes up proportional to the number of devices, number of users, the duty cycle and the bandwidth. So, in particular, duty cycle goes up as bandwidth goes up. As the net becomes more useful, as the bandwidth goes higher, people are staying on-line more and more. As more DSL lines come on-line and more cable modems come on-line, everything is increasing.
When we hit the wireless revolution, which we are on the brink of entering at this point, the number of devices is going to explode. It's going to just be huge. So that's going to drive the commercial side into this big problem, right? If you look at the two curves here, the lower and one is meant to be a rough indication of Moore’s laws where processing power is doubling every eighteen months. The other one is Guilder's law, the capacity network bandwidth is doubling about every six to nine months.
So there's a problem there. We have a huge amount of infrastructure developing. If you look at wave division multiplexing and dense wave division multiplexing and Terabyte Routers and optical home computing, the backbone is already at Terabyte so it's just huge. How are we going do keep this full, because you know the demand is going to be there.
The way you keep this full, the way you take care of disparity between these curves, is by aggregating very large numbers of CPUs in machines at your service points in the network so that you can actually have enough horsepower to push data out into these big pipes.
Well, that sounds a lot like HPC, right. They're using it's for different reasons but there's, but they're still aggregating and they're going to have all the problems that HPC folks have had for years with things like downtime, with administrative tools. This is a huge area. It's constantly pounded into us that one of the biggest problems with large HPC installations is not necessarily the programming tools, it's the administrative tools that keeps the darn thing up and running and be able to use it in effective ways.
If you look AOL as an example, I believe they have around forty thousand CPUs at this point scattered amongst their data centers, and they use, what I called earlier, horizontal scaling. Those tend to be 4 CPU boxes, maybe a little bit bigger than that. So a very, very large box count.
We have to solve this problem. If we don't solve it, then the Internet economy is in big trouble, and parenthetically I think HPC would be in trouble as well. But I think there's -- again, the message is there's a huge benefit to us on the HPC side because we can leverage a lot of the work that's done here.
I think I'll stop. Any questions about that last bit?
AUDIENCE MEMBER: Maybe this is a simplistic question, but the integrated stack looks to be very much like a huge database. How is it different?
MR. JOSHUA SIMONS: The question is: How is the integrated stack different than a huge database?
Databases are actually a part of the stack.
AUDIENCE MEMBER: The directory part.
MR. JOSHUA SIMONS: Right, directory, but also data itself. The stack at it's lowest level you have the operating system, file system, directory servers, database servers as well, but then above that you start aggregating higher and higher levels of abstraction. So, for example, messaging servers are considered to be in there, so e-mail servers, portal servers that offer that kind of portal technology that I was mentioning. E-commerce engines above that. Security engines that allow secure transmissions between these things.
So it's really -- it's shorthand in our world, at least, for everything that you need to walk into a company and get them functioning, fully functioning in the Internet world as an e-commerce entity.
AUDIENCE MEMBER: So the word stack is not really meaningful, right?
MR. JOSHUA SIMONS: It's a stack in the sense that the offerings that are included in it are actually fairly well layered on top of each other, so it's a stack in that sense.
AUDIENCE MEMBER: And do you feel comfortable with the security issues that are in place?
MR. JOSHUA SIMONS: Do I want my private correspondence kept out on a web server somewhere? The question was do I feel comfortable with the security model that it's in place right now.
The answer is no right now. You know, I think there is a lot of work that has to be done there in order to improve security, but you couldn't have better people than, for example, banks bothering you about this, because they care passionately about this. If we solve it for banks, then I'm sure we've solved it for HPC, right? There aren't many people more paranoid than banks. Well, maybe that's not true. That's not true. I take that back. So maybe there are a few other hurdles to be crossed there in the government space.
Other questions? Okay. Thank you.

last revised:11/16/00 05:59:49 PM
editor: pmateti@cs.wright.edu