![]() Summer Institute on Advanced ComputationAugust 20-23, 2000 College of Engineering & CS Wright State University Dayton, Ohio 45435-0001 |
|
DR. OSCAR GARCIA: Our luncheon speaker we are very happy to announce is from
Sun Microsystems, one of the contributors to this event. He is a Distinguish
Engineer in High Performance Computing at Sun Microsystems. Mr. Joshua
Simons joined Sun in 1996 from Thinking Machines, where he worked in a variety
of software areas related to high performance computing during his eight years
with that company. He holds a Science Masters -- Masters of Science. S.M. is the
way it says here, so Masters of Science Degree in Computer Science from Harvard.
Let's welcome Joshua Simons.
MR. JOSHUA SIMONS: So can you hear me okay? I know you're eating, but I'd like
to make this as interactive as possible. So please ask questions. I'll try to
repeat them, since I'm the only one that's miked here.
I actually have two talks here. We'll see if we get to both of them. I'm going
to talk about HPC tools and programming paradigms. It's a fairly short talk, so
we won't be able to get into a lot of depth into each of the areas, but I hope
we can have an interesting discussion nonetheless.
In terms of and overview for the talk, I'd like to give you a few words about my
view, and, I guess, Sun's view on current HPC System Configurations, what's
interesting out there, what's happening in that space. I'll talk a bit about
what current programming paradigms look like, where I believe that's going in
the future and give you some ideas. There's lots of questions marks there about
potential directions.
And then as an example and not as a product pitch, I'm going to go through some
of the features of one of the products we have called HPC ClusterTools, which is
a distributive memory programming environment. All the various tool in that,
we'll go through that.
And as I said, if there's time, we'll spend a little bit going over three
strategic investments that Sun's making that aren't directly related to HPC at
first blush, but I think if we talk about this together, I'm hoping that you'll
see that they actually do have an interesting relationship to HPC. I will try to
get to that. I'd be interested in your feedback there.
Okay. So what is an HPC system at this point? The first point I'd like to make
is that I believe that we've gone through essentially what I would call the
renaissance of HPC at this point. And I mean a renaissance in the literal sense.
It's a rebirth; it's not a flowering necessarily of interesting new technology
and introducing cool ways of doing things.
What's happened is that, if you look at -- as Oscar mentioned, I spent eight
years at Thinking Machines. If you look at the past, say ten years or so, it's
littered with a very large number of broken and dead computer companies that
were trying to address the HPC markets directly and solely marketing an HPC.
Thinking Machines, KSR, NASCAR. We could name many more companies.
What's happened is that the more mainstream vendors, like IBM, HP, and now Sun
as well, have stepped up and moved into the HPC phase. They're primarily doing
that by using what I would call essentially off-the-shelf technology that's been
developed not particularly for HPC. So in the sense that there are more and
stronger vendors involved in the HPC space at this point, I think of it as a
rebirth.
As I said before, in terms of some funky, new hardware there are some
exceptions. Terra, for example, is doing some interesting technology. Of course,
I don't want to make the point that SMP’s are not interesting technology, but
in terms of technologies that are not targeted directly at -- technologies that
are targeted directly at HPC’s, you don't really see that very much anymore.
So that's point number one.
Point number two, and I'm borrowing terms from the commercial site here, we
believed that horizontal, vertical, and diagonal scaling are extremely for HPC.
What determines which one is most appropriate for you, is what kind of
applications you are actually trying to run and what kinds of problems you are
actually trying to solve.
What we mean by horizontal scaling is large numbers of small boxes ganged
together, so more like a Beowulf Cluster. You choose what small is, say 4 CPUs
or below per node.
Vertical scaling refers to large boxes, so large is SMP’s. For example, Sun's
current generation goes up to a 64 processor HPC 10,000, which is a fairly large
box.
Diagonal, as you would imagine, is the combination of both of them. That's very
large configurations of very large boxes. That's the kind of thing, for example,
you would see used for an ASCII configuration by DOE, or by any number of
Terrascale installations.
In terms of bandwidth and latency requirements, they're all over the map.
Certainly you can find, and I'm sure everyone in this room could name,
applications that require both very high bandwidth and extremely low latencies.
There's also emerging an increasingly large section of the market that doesn't
have those requirements.
If you look at bioinfomatics for example, which we consider to be very much HPC,
you really -- if you're doing matching for example, you could really do that
with wet noodles running between your nodes. It's massively parallel. It's task
parallel. You don't really need much in terms of interconnect, but you do need
computational power.
And what got cut off at the bottom there, and I apologize for that, is the
comment that RAS, reliability, availability and serviceability are also
extremely important. If only because when you tend to build these very large
installations, it's extremely important to make sure not to lose components from
that machine.
For example, if you're running an MPI job across that entire config, you don't
want one of the nodes crashing on you. So characteristics that have been
historically important on the commercial side, the RAS capabilities, are
becoming increasingly important on the HPC side.
That's kind of our view of the space. Sun sort of parenthetically covers that
space actually quite well. But this is, again, not a pitch for Sun. I want to
talk more about ability technology.
So there are lots of requirements for tool sets, and I just wanted to mention a
few of them here. Certainly any HPC tool set that is developed needs to
adequately, or more than adequately, support the programming models that are
currently in use. And there are essentially three prevalent programming models
that we see at this point as being important.
One is thread-level parallelism. And that means explicit threads, perhaps with
POSIX threads, or implicit threads, using auto parallelization with a complier,
and directives using OpenMP. Those are all fair game for being labeled as
thread-level parallelism.
Sun in particular believes this is important. Again, we've got these very large
SMP’s. You see increasing numbers of vendors working towards these larger and
larger machines. Deck, IBM and whatnot, HP as well. This is a model that's here
to stay. It's not necessarily an easy model to use, but it's an effective way of
getting parallelism out of an application on a single box.
Message passing is the second paradigm and for us this means MPI. For all
intents and purposes, I believe not much new coding is happening in PVM at this
point. MPI, as of the version two standard, it is now general enough to pretty
much deal with anything that was being done in PVM.
And then the hybrid model, the mixture of these two, is actually becoming quite
interesting to a lot of HPC customers. That means having threads within your
node and using MPI only between your nodes. Again, it's important that your
toolkit support this.
Performance analysis capabilities are obviously extremely important. This is an
area, and I'm sure you're well aware, if you look out in the wide world of tools
development, I would guess this is probably the richest area where there are the
most number of tools. In spite of that, I don't think that anyone's really
solved the message passing performance analysis problem for large applications.
And I don't claim we have either at this point. I think it's an area of active
research.
Parallel I/O tends to be important for the sort of traditional HPC customers. I
think the jury is still out how much of the overall HPC market actually takes
advantage of parallel I/O. I'd be interested to hear if people here have an
opinion on that.
Resource management is critical. The ability to take your clustered system,
because we do believe most HPC systems in the future are going to be some kind
of a cluster, take those and actually treat them as single computational
resources. We'll talk a little bit more about that in a minute.
And then of course, high performance. You can have all the functionality in your
tool kit that I mentioned above and more, but if you don't have high performance
then there's really no point to it. And we'll talk about some of the
optimizations that should be done in an MPI library, for example, to make it
high performance.
This is my last bad slide. Where are programming models going in the future?
Certainly the current models are going to continue, and that's a good thing. I
want to acknowledge the fact that we've moved forward. If you think back about
five, ten years ago we did not have what I'd call a lingua franca for allowing
folks like you to write applications, portable applications, and move them
around on multiple HPC platforms.
Well, we finally got there. MPI, it's low level. It's got all the functionality
there. It is low level, but you can write your applications and you can move
them between vendor platforms. This is a good thing, but is that good enough? I
don't think it is.
Certainly OpenMP has come along. That standardized the development of directives
for threads-level parallelism. Looks to me as though-- well, there is work now
going on to expand OpenMP into the NUMA realm. You'll see more and more vendors
with NUMA-like capabilities. The DEC Wildfire for example is a good example, SGI.
Sun also had technology that was called wildfire. It was an internal beta
program that was actually discussed at the last supercomputer conference, so if
you saw that paper you understand that Sun is doing something that is similar to
NUMA, although not quite the same.
So OpenMP will continue to evolve, certainly for NUMA. I believe that there will
also be a motion towards evolving OpenMP for clusters, and I label that as “The
Revenge of HPF.” I think HPF failed to take-off. It sort of nose-dived into
the runway before it really got full speed. And there are a lot of reasons for
that, and we can talk about that if people are interested. It was a bit of a
hard language for the vendors to actually develop compilers for, efficient
compilers.
What I'm hoping is that because OpenMP has become so popular and is so widely
accepted by the community, that we can continue to extend it's locality
awareness that's develop in the NUMA side, continue to develop that to move it
out into the cluster realm. And keep it simple enough so that it is continued to
be accepted and still usable by customers to sort of up level their codes a
little bit, add some data parallel flavor, add some layout directives, the way
HPF sort of did that, and raise us up a bit more.
But above and beyond that, can we do better than that? I think there are a
couple of interesting places to look at least. Certainly high-level
object-oriented frameworks like POOMA, which was developed at Los Alamos
National Lab, is an interesting thing. The nice thing about a system like POOMA
is that they have shown it is possible to raise the level of abstraction for
your programming but still maintain high performance.
It's always been the case that people associate high levels of abstraction and
object-oriented programming in particular with very high overhead and very poor
performance. It doesn't have to be that way, and I think we should look at
moving upstream a bit into these higher levels of abstraction.
There's a question about what Java's role is in HPC. And I have to mention Java,
right? I'm from Sun. I can't help that. The JavaGrande forum, I'm sure some of
you are involved in this, is looking at different ways of integrating Java
technology into HPC.
And there are different efforts running and there are different time lines.
There are language modifications being considered to actually make Java an HPC
worthy language itself for coding your actual codes, but you can also use it as
an infrastructure for building interesting conglomerations of code that work in
a distributed realm. And we'll talk a little bit more about that in a minute.
I'd also add, and I know it's one of the themes in your institute here this
summer, is the idea of global models, and that's a bullet that actually fell
completely off the screen. So I would label those others sort of local models,
what you would run within a single site. But global models, I believe, are
extremely important for HPC.
They fall into a couple of different categories, at least the way that I look at
it. Distributed resource management; the ability to tie together a set of
distributed resources the way Globus and Legion do, to give you access to these
computational resources that are out there somewhere in this cloud of network
and computation. Where you don't necessarily need to know where you're going,
but you're going after a particular capability.
We've actually done some internal research around a Jini-based distributed
resource manager to use Jini as an infrastructure for dealing with the coming
and going of computational resources in the wide area.
On the other side there's Metacomputing; the ability to tie together resources
that are distributed globally and use those to solve a single problem. That one
is a little bit more of a stretch from my point of view. And we will get there
eventually, I believe, as the bandwidth within the network continues to
increase. We won't ever solve the latency problem in the wide area, so it's
going to be appropriate for particular kinds of applications, but I don't think
ultimately there's anything stopping us from doing that sort of thing.
So Globus and Netsolve are two examples. SETI, I like to think of that as an
interesting example. And then PopularPower is a small start up that's actually
trying to commercialize essentially what SETI is doing, and that looks like an
interesting technology to follow as well.
Having made those sort of general comments, I want to run through our cluster
tool products in some amount of detail to give you an idea of what a typical
product for distributed memory computing looks like from our point of view at
least, and I'll try to dive into some detail in places where I think it might be
interesting.
So just to place us in context, we've talked a little bit about this. On the
right-hand side, I'm talking about the development side. This is just meant to
echo the fact that we believe that there are, again, these two or three main
programming environments; the threaded applications and then distributed.
This set of slides is going to deal with the distributed aspect, but I did want
to mention that, obviously, a core component in building a distributed memory
toolkit is that you have a high performing local node toolkit as well. So
everything is built on high performance compilers, good analyzers and good
performance libraries.
The toolkit is built for both single SMP’s and clusters of SMP’s, and that's
a key point. People tend to think of distributed memory computing as being for
clusters. I mean that's obvious, right? But we also are very serious about
making sure that MPI in particular runs extremely well on single SMP’s. It's
important for us because we do have such large SMP’s available. We do have a
fair number of HPC customers that would consider using a single 64-processor box
for an MPI application, for example, and we could talk about that in a minute.
So what's in the toolkit? I don't think this should be too much of a surprise.
There's a resource manager. The MPI library itself. There's a distributed math
library. We have a distributed parallel debugger or development environment, and
MPI I/O and parallel file system are both aspects of our parallel I/O solution.
And I should mention at this point that most of this technology came over from
Thinking Machines about four years ago when I came over with the group, so it
was acquired and the folks were all hired. And that was one of a number of HPC-oriented
acquisitions that Sun has made over the last several years.
So this echoes the basic point that it's for clusters of UltraSPARC base
machines of any size. So desktop-- actually below desktop. We have rack mounted
systems as well, rack mounted clusters -- up to SMP+, which was the technology
that I mentioned was discussed at supercomputing last year.
The cluster interconnect is an important point. Certainly we support any TCP
connected interconnect. We've also done some particular work around SCI, and
I'll talk more about why we did that. SCI is not necessarily interesting in and
of itself, but technologically it's interesting, and we'll get to that on the
next slide.
We've also opened up our MPI architecture to allow third-party vendors to plug
into our HPC software stack, and I think this is a key point for any reasonable
HPC approach, software approach, is to allow this interoperability. The reason I
mention this, is what we've done with this, this is about a million lines of
source code, we've actually made it available on the web under Sun's community
source agreement.
One of the first organizations that signed up to use this was Myricom. I'm sure
your familiar with their Myrinet hardware. What they're doing is actually
integrating their Myrinet hardware and there GM layer with our MPI library, so
that they'll be able to offer a low-latency solution through our MPI and across
their interconnect.
Okay. This is sort of product oriented. The only thing I want to mention on this
slide is the last bullet. The release that I'm talking about right now is
reasonably scaled. We can support single parallel jobs that span up to 64 nodes
and consume a thousand processes. And that's not limit on the size of your
cluster certainly. The cluster can be much larger if you have a resource manager
that deals with that. But this is more a testing limit than anything else, and
we'll be taking that higher in the future.
So a resource manager. How do we do this? The strategy that we've taken to date,
up until last month actually, was purely a third-party approach. We offer
something called the CRE; Cluster Runtime Environment. That is essentially a
smart job launcher that's capable of taking your job, you tell it where you want
to put the -- it actually does load balancing -- but in its simplest case, you
tell it where you want to put your job and it puts it there.
The idea was that third-party resource managers would be layered on top of that.
We didn't want to make a decision for the customer to force them into a
particular resource manager, because we found that customers had heterogeneous
computing environments and had previously chosen a resource manager. So we
wanted to work with whatever you had chosen.
We did, in spite of that, do an integration with LSF from Platform Computing,
because we felt as thought they did have a good share of the market, they have a
good amount of functionality, and if people wanted to use that, we would offer
additional integration there.
Now one thing that has changed is the bottom line here. Sun, I think we
announced it end of last month, we acquired Gridware. Gridware is the company
that produces Codine and GRD. What we've done is essentially brought a
distributed resource manager in-house within Sun, and what you'll see over time
is a move to integrate the ClusterTools product more and more with that. This
will become part of the base capabilities within Solaris at some point.
The goal is that every box actually be able to -- that every box actually be
able to function as part of a distributed cluster, resource-management cluster.
But we still want to maintain third-party interoperability. That's important to
allow you to continue to make a choice.
So let me spend a few minutes talking about the MPI library and give you an idea
of what we've done in terms of some of the optimizations in this library. I feel
as though it's actually -- this is fairly objective, since I've talked to other
people outside of Sun about this as well -- I would argue it's one of the best
or maybe the best MPI implementation at this point. In terms of optimization
levels for our platform, it's certainly true.
So general capabilities. This is a completely new native implementation of the
MPI library. When we started at Thinking Machines we actually were building on
an MPICH core, and we decided that we weren't going to be able to get to where
we needed to be, in terms of scalability and performance. So we essentially
ripped all of that out and redid that implementation. We used some funding from
the DOE to do that.
It has all of the, what we would label, as the significant MPI 2 features of --
all the features of MPI 2. One-sided communication is not in there, and we can
talk about that in a second. It's thread-safe. You have to do this if you're
going to effectively support the mixed programming model that we were talking
about earlier. It's also a significant point that applications can be developed
and debugged with single instances of the parallel debugger that we ship as part
of ClusterTools, which we'll go through in a second.
In terms of optimizations, we spent a lot of work in SCI, because SCI has a
memory-to-memory semantic. It allows to take physical memory on one box and
actually map it into the virtual address space of another box. Once you've set
this mapping up, and you need to use kernel involvement to do that. But once
you've done that, you effect transfers of data across that interconnect by doing
loads of stores into those memory regions. So you completely sidestep the
operating system at that point, and that gives you access to pretty low latency.
By low, I mean at least with an SCI implementation, on the order of seven
microseconds user process to user process between boxes and a cluster. So that's
why we spent time working on SCI.
We also take advantage of an UltraSPARC special instruction called the
block-move instruction that allows us to move chunks of data around within the
MPI library, as we have to do buffer copies. One of the interesting features of
the block-copy instruction is that it doesn't pollute your cache. So if you've
gone through the trouble of setting up all your cache state for your
computation, you really don't want the MPI library trouncing on that, so we
avoid that.
We've implemented co-scheduling, which you can think of as an approximation of
gang scheduling. We don't have a hardware synchronization mechanism across the
cluster, but the co-scheduling techniques that were developed primarily at MIT
and Berkeley allow you to do approximate gang scheduling based on local
information available to you.
We've also done a lot of work on locality exploitation. And the point here is
that if you have a bit SMP as a node, or even moderately sized SMP as a node,
you should take advantage of that wherever you can.
So if you're doing, for example, a broadcast operation in a cluster of SMP’s,
certainly everybody would build a spanning tree in order to get the data out to
all the leaf nodes in your job, but there's no point in building the nodes of
the spanning tree on a 64 processor box. You build, what, five or six levels of
that spanning tree on the box. There's no point. Why don't you just drop the
data into a single location and have everyone read from it? By doing that, by
short-circuiting those kinds of things and taking advantage of the shared-memory
nature of the boxes, we can get higher performance that way.
If you do that, one of things that you might think, just to give you an idea of
the level of optimizations that we're doing, if I do have all the 64 CPUs in my
box go and read memory locations to pull data out of there, now I'm overloading
my memory system because I have this big hot spot in the system.
So what we actually do is some pipelining and some round robin access to those
data structures to actually spread out the load so that different processes are
actually attacking that buffer in different segment order, which turns out to
give you a nice performance boost.
We also do lazy connections, which is really important for dealing with very
large jobs. We made the observation that it's not necessarily the case, it's not
typically the case, that when a large job is running, a large MPI job is
running, that all of the point-to-point connections that are possible between
those processes are actually used. So there's no point at startup time in
setting up the N squared connections between all the process pairs.
So what we do, and you can turn this off if you don't want to use this, we
establish the end points, connections, when the first communication happens
between those two processes, which allows us to just consume the resources we
need and just set up the connections when we need them.
We've also done a full multi-protocol implementation of the library. What I mean
by that, and this is where Myricom comes in on the next slide, is on a pair-wise
basis, we make a determination of the most efficient data pathway between two
processes and the MPI job.
So, for example, if there the two processes are on separate boxes and connected
by TCP, we'll choose a TCP connection and do it that way and we'll incur the
latency by using that pathway. If the two processes are on the same box, we use
shared memory and just set up a shared-memory region and just slosh the data
back and forth between there. If an SCI adapter is available, then we'll use RSM,
which is this Remote Shared Memory I was describing before, where the memory
semantic is set up between boxes to move data very quickly.
Now, the bottom layer, the TCP layer and the RSM and shmem, it's called a
Protocol Module Layer. And part of the effort surrounding putting all this out
for community source, was to develop a well-defined API, a protocol module API,
this dotted line here, that actually will allow third-party IHV’s, Independent
Hardware Vendors, to plug into this architecture. As I mentioned, Myricom is
doing this.
We're also very interested, just generally speaking, in the Infinaband
specification and that whole process. Are people familiar with Infinaband? No.
Okay. Let me just say a few words about that, because you'll be hearing a lot
about that over the next couple of years.
Infinaband is a merging of two earlier industry efforts, future I/O and next
generation I/O. All the big players broke up into these two different camps, and
everyone was developing their own version of the industries next generation I/O
and cluster interconnect.
And Sun actually played a key role in saying that this was a silly thing to do.
Why should we have two specs going forward? That wasn't going to help the
industry at all. So we managed to get everybody together in a room, and we
decided to develop one common specification. So eventually this will replace PCI,
and you will see it as cluster interconnect from, I would guess, virtually all
the vendors at one point.
Some basic numbers. It comes, at least initially, in three different flavors.
There's a 250-megabyte per second bidirectional; there's a one-gigabyte per
second bidirectional; and there's a three-gigabyte per second bidirectional. If
you're at all familiar with the VIA, the Virtual Interface Architecture, the
verbs that used for communicating with this underlying hardware are very much
like VIA. It's a message-queue oriented semantic. It has this OSI mechanisms
bypass mechanism in it. Whereby, by adding disruptors onto these queue pairs,
they're processed directly by the hardware so there's no intervention by the OS.
And therefore you could be able to get very low latency.
So you will see this probably, I don't know for sure, but I think Intel will be
rolling out some of the low level one by 250 megabyte per second hardware. I
think by the end of this year actually you will start is so see bits and pieces
of this. But you'll see over the next three or more years that there's going to
be an increased emphasis on Infinaband, and you heard it here first. In any
case, we, of course, want to be able to plug into that with our MPI
architecture, and you'll see other vendors doing that as well I'm sure.
So you what else do you need in the toolkit? You need a subroutine library, a
math library. This particular library, S3L, is built on top of MPI. It's fully
thread-safe. When I say thread-safe in this talk, what I really mean is more MT
warm; and what I mean by that, is that there's actually a fair amount of
concurrency in our implementation.
You can get thread-safety by locking and unlocking at the entry and exit to each
of your subroutines, but there's no concurrency whatsoever. What we've done is
we've pushed a lot of the synchronization fairly far down into the library so
that you can actually have threads running around inside the library in
nontrivial ways and give you real concurrency. So S3L does this as well.
The parallel capabilities are listed there. I'm not going to go through them.
There are a few more as well. This is fairly standard. We're always on the
lookout for other things that should be in that library. It tends to been tricky
because there's so many different requirements in the HPC space these days.
Let me quickly go through a few screen shots on Prism. This is our
integrated-development environment; sort of one-stop shopping for dealing with
debugging and performance analysis for a message-passing application. It
supports F77, F90, C. It has some basic support for C++. This is something that
we're working on. We expect to get up to full C++ at some point.
I'll just go through some of these slides just to give you an idea what it looks
like. It's a little bit long in the tooth now in terms of the way it works. This
is built actually with the Motif toolkit. We've had some thoughts about doing
this in Java at some point.
What's being illustrated here, if you can read the small red rectangle, is this
basic concept of Psets, process sets. This is a key piece of Prism. We designed
this from the get-go to be extremely scalable. And the idea behind Psets is it
lets you define semantically meaningful subsets of the processes in your MPI job
and then allows you to issue any Prism command to those subsets.
So, for example, what you're seeing up here, it says define Pset master to be 0.
So rank 0 in my MPI job is now called master. So typing master is no savings
over typing the digit zero, but if I define Pset "slaves" to be
"all - master", then I now have a pneumonic for referring to all of
the other processes in the job that are not the master. The "all -
master" implies some set notation. We can do basic set operations in terms
of defining these Psets. You can define Psets based on the values of variables
within individual processes in the job, which turns out to be pretty useful.
The point here is for scalability. Any command can be modified by these Pset
qualifiers. You'll notice at the bottom it says (prism all). Prism has the
concept of a current Pset, so anything that I type or mouse at this point, at
least currently, will be sent to all the processes.
It's a little bit different than the way debuggers normally work, because if I,
for example, have all the processes stopped at a breakpoint and I say continue
Pset slaves, all the slave processes will continue, but Prism maintains control,
and I can still issue debugging commands to rank zero. So it's somewhat of an
asynchronous interface.
We put a fair amount of effort into giving you global displays of the program,
and this is another key point that really needs to be emphasized in any toolkit,
especially when dealing with these large distributed applications. You have to
be able to get a sense of what's going on in your application, and you can't
really get that if the only tool at your disposal is something that allows you
to go in and sort of probe one or a couple of processes at a time. So we spent a
fair amount on displays like this, which take a little bit of explaining.
What you're looking at is three different zoom levels of the same display. It's
basically a generalization of the linear call stack you'd get from, say, DBX,
where you're stopped at a particular point. You say where, it says well you were
in main. You called subroutine one; you called subroutine two; you called
subroutine three.
What we realized is that even for very, very large jobs with many, many
processes, it's not typically the case that you entered main in your application
and then suddenly hit a case statement and went out to a thousand different
locations in your program. There's a lot of commonality in the way your
processes trace through these applications.
So by using that observation, what we essentially do is take the end lineal call
stacks in your application and merge them into a tree structure. So the way to
read this, and you probably can't see this, is that on the top left everyone was
in main. They made two recursive calls to search, and now you start to get a
bifurcation. Some of the processes went to the right and made a call to alpha
beta. The others you can't see; they've been iconified. But on the right, again,
there's another bifurcation because two different calls to alpha beta were made
at two difference call sites.
As you zoom up, you start to get line numbers in the files, and as you zoom all
the way to the top, you get actual arguments that were passed within particular
processes. So, for example, Process 2 had the following arguments passed on the
stack at that point.
This is an active display. I could, for example, double click on this guy. My
Pset would automatically update to contain Process 1, 2, and 3, and then any
command that I would type at that point would just be sent to processes 1, 2,
and 3, and the individual stacks would have been adjusted so that I would be at
the appropriate point in the call stack. Again, it's just a way of given you a
global view of what's happening with the program.
Another simple way of looking at the program from a global point of view is just
looking at the Psets that have been defined in the application. Some of them are
predefined by Prism. This idea of a current Pset -- what this is telling me that
rank 0 of my job is stopped at a breakpoint, and the other fifteen processes in
this job are currently still in the run state. If someone had hit a seg fault or
had a bus error, then the error grid would be illuminated appropriately. You can
also see masters and slaves, which I've defined previously.
Another key point in a toolkit, and sometimes you find this integrated with a
debugger and sometimes you don't, is visualization. And this is visualization in
particular to support debugging; it's not, generally speaking, fancy
presentation graphics. We found that there was actually a lot of power to be had
or given to the user in completely integrating the visualization with the
debugger.
So essentially what you're looking at here, we call these data visualizers, and
these are popped up in a debugging session with a print statement. It's a
variant on a print statement. If you say print A, where A is an array, normally
all the million elements would just get printed out on your screen. Well, that's
not very good. So if you say “Print A on window name”, it will pop up of one
of these visualizers.
So the way to look at this is, this is a text visualizer, and it's showing a
three-dimensional array. I can tell it's three-dimensional because if you look
at the top, there are three axes up there. There are two on the black rectangle
and a third one on the slider. This is easier seen in an interactive demo, but
the white rectangle and the black rectangle is a representation of this data
window. So if I grab that white rectangle and drag it around, interactively I
can pan across this data plane. That turns out to be useful in some cases.
You can also take and change the data representation to map. Instead of looking
at numerical values, you can map it to pixel values, and that's nice for looking
at physical simulations for example. There's a rendering for complex data,
magnitude and direction. You can guilt histograms. There are a number of other
basic capabilities that are here to aid the programmer in debugging the
application. It turns out to be one of more useful features of Prism.
We have basic message-passing, event analysis. I would frankly say it's not yet
up to the level of Vampire, for example. You can use Vampire with our MPI, if
you'd like to do that, but we did feel compelled to give at least some amount of
integrated performance analysis capability within Prism itself for those that
don't want a third-party tool.
You can also drill down on the state of your MPI messaging queues, and this is
another theme that I think is important. Exposing some internal state from
within the MPI implementation is important typically. Looking at queues is one
of the more obvious ways of doing that.
What you're seeing here is, rank runs this way, so Rank 0, Rank 1, Rank 2, and
then entries in the queue for those ranks are shown off to the right. The color
represents the MPI communicator. You can click on one of those and drill down
and actually find out what data is sitting in that queue and what the data types
are that it's associated with. Very useful for trying to find race conditions
and looking at basic programming errors.
Okay. In terms of parallel I/O. As I mentioned, there are two capabilities. We
have a full implementation of the MPI I/O standard within this toolkit, and I
believe the other vendors are getting to this point as well. We feel it's
important to have this, if only to support the more traditional HPC vendors that
really are into high performance parallel I/O.
So MPI I/O writes either to Unix file system, which doesn't give you performance
but it gives you compatibility with the Unix file system. It does sequential
writes and sequential reads, or it can write into PFS, which is our Parallel
File System, which does what I'm sure you would all expect. It let's you have
some number of storage nodes attached to some number of cluster nodes in your
machine. The storage objects that are attached to a particular node in the
cluster are controlled by a PFS I/O daemon. The I/O daemon is responsible for
all data motion onto and off of the storage devices on that node.
The way you get your parallelism and scalability here is by having your
multiprocess compute job, which is shown up here as a two-process job. And these
processes could certainly be running down on the same nodes that the I/O daemons
are running on. By having those compute processes contacting the I/O daemons in
parallel and then having the I/O daemons contacting storage in parallel in order
to get as many spindles moving as possible, so you get by increased scalability
by increasing the parallelism of your job and also potentially by increasing the
parallelism of your file system. So this is included as well.
Just make a note here that the performance of something like this depends a lot
on the -- you'll notice the green lines, most of them are going across, at least
the ones between the nodes, are going across the cluster interconnect, so
performance is critical and gated by the bandwidth, mostly the bandwidth of the
interconnect that's used for the configuration.
To give you some idea of performance, this is reasonable performance I think for
this generation of hardware that we're talking about, the UltraSPARC II. The
first two lines are talking about within a single SMP, so 2.5 microsecond
latency user process to user process within an MPI application.
And then MPI shared bandwidths of about 200 megabytes per second between
processes and the job, and we can sustain that between multiple pairs because of
the high back plane bandwidth of the SMP. Then the rest shows SCI latency, which
I had said was seven. This is actually a slightly outdated slide. And then the
bandwidth numbers aren't important.
As I said, I think for SCI the real deal there is latency reduction along the
lines of what Infinaband is doing and along the lines of a next-generation
interconnect that Sun has, that we can't talk about today because we're not
under NDA, but the point is that the RSM work, the low latency work, will
transfer forward into the future and continue to offer low latency and very high
bandwidth for MPI applications.
Okay. I think I'll stop with that. I still have a few minutes, so are there any
questions on ClusterTools before I move on? Yes, go ahead.
AUDIENCE MEMBER: On the Prism, I assume that that software is good on any
platform?
MR. JOSHUA SIMONS: The developers back at Sun that work on this, do all of their
development, well, virtually all of their development, on Solaris for
UltraSPARC, and the product itself is only supported on Solaris for UltraSPARC.
So the answer from a supported product point of view from Sun, is that it's
SPARC Solaris, in fact, it's UltraSPARC Solaris; however, the fact that we've
released this under open source, community source, has made it possible, at
least in theory, for other parties to come in and actually port this to other
platforms.
For example -- I said before that most of our development is done on Solaris and
SPARC. We have S3L, for example, that group within Sun, just because they want
to, they actually do a lot of their development on Linux as well. So Prism, in
particular, is the hardest one to move over because it understands instruction
formats and it has to compiler stab formats. So teaching it about new compilers
and teaching it about new chips is not a trivial operation, but it could be
done.
AUDIENCE MEMBER: Do you know of anybody that's started that kind of effort?
MR. JOSHUA SIMONS: No one to my knowledge has done that.
AUDIENCE MEMBER: I have a question on the MPI when it decides about the shmem or
RSM or TCP IP. Is that a compiled time thing?
MR. JOSHUA SIMONS: Not all. It's done at runtime, and the decision is made on a
pier-wide basis, so within a single job some processes may be using one method
and some processes may be using another method. In fact, one process itself may
be using shmem to get to some processes and TCP to get to some other processes
and are RSM to get to other processes. Completely dynamic, it's all determined
on the fly.
AUDIENCE MEMBER: (Inaudible.)
MR. JOSHUA SIMONS: No. It makes the decision up front. It finds out -- it looks
and says I need to contact to him. What's the best pathway? It has an idea of
rankings of pathways, so if you had a TCP connection and an RSM connection
between those two nodes, it would choose the RSM connection.
AUDIENCE MEMBER: I'd like to get some more information about Prism. Is there a
place or source where I can get that?
MR. JOSHUA SIMONS: Yes. Let me talk to you afterwards, and I'll try to come up
with a good URL for you.
AUDIENCE MEMBER: Okay.
MR. JOSHUA SIMONS: Are there any other questions? Yes.
AUDIENCE MEMBER: What's the future of Java from an unbiased opinion?
MR. JOSHUA SIMONS: Unbiased, right? It's an interesting question, and a lot of
people when they ask that, they tend to, understandably, you know, a person from
Sun is a person from Sun. But the way I actually come at Java less as a person
from Sun and more as an HPC person. Because in some I ways I have less
visibility into the whole Java process than the average customer or the average
persons who's involved in the Java community.
From am HPC point of view, it's unclear to me. It's not clear to me that we will
necessarily ever get to the point where you would want to really code your
kernels in Java. It certainly makes much more sense to use it as an
infrastructure for -- and I'll talk more about this in a minute -- for example,
for grid computing, and using it as infrastructure for doing these large
distributed kinds of things. But whether or not it's going to fly as a
computational language, I don't know.
The JavaGrande effort is looking at enhancements to the language that would make
it better, but then the question is, well, do you stick with the JVM or should
we really put a lot of effort in native compilers? And I could imagine, if we
put an effort into native compilers, then there are those that would claim that
Java is a cleaner language for doing object-oriented programming than something
like C++, in which case it could take off, but I don't think it's going to take
off in the context of a JVM, and I don't think the HPC community really needs
the portability that comes with the JVM execution. The jury is still out.
AUDIENCE MEMBER: How does Bill Gates figure into all this?
MR. JOSHUA SIMONS: From an HPC point of view, not at all. So that's my answer,
because I don't know the bigger story. Other questions? Okay.
I'm just going to spend a few minutes on these. Maybe by the time I'm done these
still won't seem like they're HPC related, but I hope to convince you that they
are.
These are three areas that Sun -- and I labeled them as one investment -- we
call these Three Big Bets that the company has made, and we did this in the
past. We bet that networks would be really important, for example. There are
other things that the company has done that have turned out to be right.
Well, these are the next three bets that we've made, and they're based on what
we see going on as technology is out there sort of in the computing world, and I
don't, in particular, mean HPC necessarily. The three new are, not in any
particular order, and I'll go through each of these in a minute, massive scale,
integrated stack, and continuous real-time.
We tend to date these because we'll go back in a year, two years, three years,
and see how right we were. We're investing enough are resources in these sorts
of things, that if we're not right, we're in big trouble. But I think we are
right, and, as I said, I think there are ramifications to everybody in this
room.
So this is one of the motivators for why these three big bets are important. And
basically what this slide is saying is that if you look at the top, that's the
traditional model that people have been using for delivering software and
delivering services essentially. You go out and you buy your application and you
install it on your local machine and you use it; or maybe you put in a local
server and you NFS mount it, but it's gone out. Many, many copies are
distributed manner, and it's used in a distributed manner.
You do updates by sending out a new CD-ROM or by downloading something to your
site and using it that way. The new model, which everyone's been hearing about,
it's more related to an ASP model, an Application Service Provider model. And
everything is heading in this direction, right, where everything is out on the
web. And, you know, the logical conclusion of that is that nothing is local.
Even your confidential banking data and whatnot or documents that you're
writing, for example, may actually be out on some kind of secure server
somewhere on the web.
One of the benefits of that is that you have ubiquitous access to it. As long as
you can be convinced that it's secure and only you can get at it, then why do
you care where it is as long as you can get it from wherever you are.
So that's kind of where things are going, and that has some implications. So Sun
is putting a lot of effort into what we call an integrated stack. And the
integrated stack is an e-commerce concept. It's really everything from the OS
all the way up through application servers, web servers, JVM, Java to Enterprise
Division, directory of databases.
All of this stuff has to be brought together in a really efficient way to run as
well as possible on Sun platforms so that we can offer a solution for the
Internet economy essentially, right? This is all e-commerce based. And above
that level that I mentioned there are broker engines and whatnot. And you wonder
what does this possibly have to do with HPC, right?
What you're looking here is kind of a portal model, and portal is somewhat of an
overloaded term. Sometimes people, when you mention a portal, what they think of
is a web page that you go to that is aggregating access. It has links off to
lots and lots of different places. It's your one-stop shopping for doing things
on the web.
Well, the other definition of a portal, and it's the definition that we prefer,
is that it's typically a web page that you would go to, but what it's doing is
it's given you a portal through something into something else. So, for example,
we have something called sun.net that we use as Sun employees that allows me to
walk up to any browser anywhere on any kind of a machine, as long as it's got
Java running in the browser, and drill through our corporate firewall to access
my e-mail, access an increasing number of applications that have been
web-enabled and made available through this portal service to employees that are
traveling wherever they are. So that's the kind of portal that we see as being
interesting and useful.
If you think about it from an HPC point of view, it's not a big change to the
picture, right? So you replace those generic apps with some kind of a parallel
multibarbell sort of thing there with a parallel file system behind it, and you
allow access to your HPC applications from anyplace, anytime.
So who would use this? Like a supercomputer site, for example. Somebody who has
lots of remote users that need secure access to their data. It's very much along
the lines of what's happening on the commerce side just translated over into
technical applications. There's really no difference.
The stack doesn't care if you're doing technical computing or you're doing
database accesses. It doesn't matter. The framework is common, and we think this
will become increasingly important. My understanding is that this basic idea one
of the themes of SC 2000, this Escape 2000 that they're running in Texas this
year. So you'll probably see more of this in the future, and I think
increasingly over time it's going to be important for us to figure out how that
stack can be used for leveraging what's going on at HPC.
And kind of subtext that's running throughout what I'm saying is that, you know,
historically -- I've been doing HPC for twelve years now or something like that,
and we've always prided ourselves on being on the leading edge and doing things
that other people aren't doing. That's still true in terms of content. But in
terms of what's driving the industry and technology, it's not HPC anymore. It's
this huge Internet economy.
And the point I made before about commodity nodes being sort of responsible, at
least in part, for the renaissance of HPC, that's an example of this leverage
that I'm talking about. We need to look at what's going on in the commercial
space because that's going to be the infrastructure that we should be taking
advantage of on the HPC side. It's no longer going to be possible to go off in
our own directions. It's just too expensive to do just HPC oriented sorts of
very, very large efforts.
I don't mean to say that companies won't put effort into HPC-related things. For
example, the group of fifty engineers that worked on ClusterTools; well,
ClusterTools has no bearing whatsoever on the commercial space. So as a vendor
you still need to make investments into some HPC specific aspects for the
community, but the point is that there's a huge amount of effort and a huge
amount of money and thinking pouring into this commercial side, and we really
need to figure out how to leverage it on the technical side.
Another thing that happens that's quite popular in the commercial space is this
B2B stuff, business to business, where you have multiple sites communicating
using some kind of high-level protocol for doing something. Maybe they're doing
supply chain analysis or supply chain management, which is all gobbledygook to
me, right? So you replace that. And, this is trivial, right? This is what I
would call site to sight stacks.
Now, if you think about, you know, DOE is a good example, I'm sure many of you
are examples as well, where you have multiple sites that are in close contact
potentially -- eventually being used for doing megacomputing sorts of things.
Well, there's infrastructure here for building those kinds of applications. And,
again, I don't think we should reinvent the wheel. There's a huge impetus here,
and I think we should take advantage of it.
I won't say too much about continuous real-time, except to point out that what
this is about is, if this stack and these machines are going to be the basis of
an Internet economy, then it needs to be there all the time. The way our CTO
says it, it's in terms of continuous availability versus percent uptime. It's
kind of the difference between planes and computers. You don't talk about
percent uptime for a plane, right? It better always be up. That's what we have
to be striving for.
And there's a benefit there for HPC as well. As I said earlier, the way you
build a large HPC complex these days is by aggregating large numbers of
commodity components to build them, and you need those components to stay up.
Your MPI app is not going to make forward progress if the nodes in your machine
are down all the time. So the RAS capabilities and some of the other efforts
that are happening to make things better on the commercial side, again, have a
lot of relevance on the technical side.
Mission critical, brand critical refers to things along the lines of, well, if
E-bay goes down or CNN or something goes down, well, when that happens, we show
up on the news, right? This is not something that we as a company, or any
company that is providing infrastructure for the net, can deal with. We have to
take this very, very seriously. And, again, you benefit from it.
So massive scale. HPC has always been about massive scale, so there's nothing
new there. But the new thing again is that the commercial side is now becoming
increasingly about massive scale. Why is that? Because there are these huge
scaling pressures.
If you look at the sum of computational and storage demand, it really goes up
proportional to the number of devices, number of users, the duty cycle and the
bandwidth. So, in particular, duty cycle goes up as bandwidth goes up. As the
net becomes more useful, as the bandwidth goes higher, people are staying
on-line more and more. As more DSL lines come on-line and more cable modems come
on-line, everything is increasing.
When we hit the wireless revolution, which we are on the brink of entering at
this point, the number of devices is going to explode. It's going to just be
huge. So that's going to drive the commercial side into this big problem, right?
If you look at the two curves here, the lower and one is meant to be a rough
indication of Moore’s laws where processing power is doubling every eighteen
months. The other one is Guilder's law, the capacity network bandwidth is
doubling about every six to nine months.
So there's a problem there. We have a huge amount of infrastructure developing.
If you look at wave division multiplexing and dense wave division multiplexing
and Terabyte Routers and optical home computing, the backbone is already at
Terabyte so it's just huge. How are we going do keep this full, because you know
the demand is going to be there.
The way you keep this full, the way you take care of disparity between these
curves, is by aggregating very large numbers of CPUs in machines at your service
points in the network so that you can actually have enough horsepower to push
data out into these big pipes.
Well, that sounds a lot like HPC, right. They're using it's for different
reasons but there's, but they're still aggregating and they're going to have all
the problems that HPC folks have had for years with things like downtime, with
administrative tools. This is a huge area. It's constantly pounded into us that
one of the biggest problems with large HPC installations is not necessarily the
programming tools, it's the administrative tools that keeps the darn thing up
and running and be able to use it in effective ways.
If you look AOL as an example, I believe they have around forty thousand CPUs at
this point scattered amongst their data centers, and they use, what I called
earlier, horizontal scaling. Those tend to be 4 CPU boxes, maybe a little bit
bigger than that. So a very, very large box count.
We have to solve this problem. If we don't solve it, then the Internet economy
is in big trouble, and parenthetically I think HPC would be in trouble as well.
But I think there's -- again, the message is there's a huge benefit to us on the
HPC side because we can leverage a lot of the work that's done here.
I think I'll stop. Any questions about that last bit?
AUDIENCE MEMBER: Maybe this is a simplistic question, but the integrated stack
looks to be very much like a huge database. How is it different?
MR. JOSHUA SIMONS: The question is: How is the integrated stack different than a
huge database?
Databases are actually a part of the stack.
AUDIENCE MEMBER: The directory part.
MR. JOSHUA SIMONS: Right, directory, but also data itself. The stack at it's
lowest level you have the operating system, file system, directory servers,
database servers as well, but then above that you start aggregating higher and
higher levels of abstraction. So, for example, messaging servers are considered
to be in there, so e-mail servers, portal servers that offer that kind of portal
technology that I was mentioning. E-commerce engines above that. Security
engines that allow secure transmissions between these things.
So it's really -- it's shorthand in our world, at least, for everything that you
need to walk into a company and get them functioning, fully functioning in the
Internet world as an e-commerce entity.
AUDIENCE MEMBER: So the word stack is not really meaningful, right?
MR. JOSHUA SIMONS: It's a stack in the sense that the offerings that are
included in it are actually fairly well layered on top of each other, so it's a
stack in that sense.
AUDIENCE MEMBER: And do you feel comfortable with the security issues that are
in place?
MR. JOSHUA SIMONS: Do I want my private correspondence kept out on a web server
somewhere? The question was do I feel comfortable with the security model that
it's in place right now.
The answer is no right now. You know, I think there is a lot of work that has to
be done there in order to improve security, but you couldn't have better people
than, for example, banks bothering you about this, because they care
passionately about this. If we solve it for banks, then I'm sure we've solved it
for HPC, right? There aren't many people more paranoid than banks. Well, maybe
that's not true. That's not true. I take that back. So maybe there are a few
other hurdles to be crossed there in the government space.
Other questions? Okay. Thank you.
| last revised:11/16/00 05:59:49 PM |
| editor: pmateti@cs.wright.edu |