Summer Institute on Advanced Computation

August 20-23, 2000


College of Engineering & CS
Wright State University
Dayton, Ohio 45435-0001

Linux Clusters

 
Dr. Kumaran Kalyanasundarm
Systems Engineering Specialist
Technology Center, SGI

Mr Doug Johnson
Ohio Supercomputer Center

 
August 22, 2000 1:00 p.m. This site contains the "web-based proceedings" of Summer Institute on Advanced Computation that focused on high- performance high-throughput  Cluster Computing. 
Slides1 Slides2 Slides3
           

DR. OSCAR GARCIA: In our afternoon lecture, we're going to hear from SGI's Kumaran Kalyanasundaram about his views. He's going to be assisted by Doug Johnson, and he will introduce Doug Johnson. So Kumaran, whenever you want.

           
 

DR. KALYANASUNDARAM: Okay. Thanks for coming. Actually, this is going to be an introductory lecture, or class, to Linux clusters. And I picked up most of the slides that I was going to use today from a class I attended at the Ohio Supercomputer Center. And the Ohio Supercomputer Center has a 128-node SGI Linux cluster that's up and operational, and some of you might have experience using it.

The person who's the technical lead and put together the class is Doug Johnson of OSC, and he was kind enough to drive up here and assist me in doing this. In other words, it means he's going to do it while I'm going to sit and listen just like you guys do. So, I mean, we talked about it this morning as to what needs to be presented, and after talking to some of you guys yesterday, I think it's going to be a very introductory level about Linux clusters.

So what we're going to talk about is understanding how to go about designing Linux clusters. So what do you choose in terms of hardware? So what do you choose the terms of processors? Do I choose Pentium III's? Do I choose Xeons? Alphas? Celerons? AMT's? What do I choose? What should I base that on? What should be the cache? How much memory per node? In a cluster, what should be the disk systems and file systems I should be using and things like that.

We'll also be talking about the various tools available for managing the system, okay, as a single entity. Something I mentioned in yesterday's talk. We have been talking about various tools that are available for monitoring the system, things like PCP Performance CoPilot. We will talk about a batch scheduler, and I mentioned PBS as an example, so we'll talk about PBS. We'll also talk about the various compilers available in the open source and closed source that people use to run these problems.

We'll talk a little bit about MPI. More importantly, we'll talk about the various networking options that you have available. You know, the things I mentioned yesterday, the default commodity, Fast Ethernet, Gigabit Ethernet, Myrinet, Giganet, Dolphin XSCI and things like that. So we'll mention the various things. So it's a little more in-depth than what we did yesterday, but it will give you an idea what you should pick and choose when you are designing your own cluster.

After that, Doug has a few slides to talk about for the cluster at OSC, and he is going to tell you about it. You can get accountants on it, and run your jobs on it. So he's going to talk about the cluster hardware and also the environment which is available there.

I know some of you have listened to Doug before, and this is another opportunity for you to. He has put some more information in the slides, and this is another opportunity to listen to it. Doug.

MR. DOUG JOHNSON: I think I have to have the microphone. I'm kind of a quiet speaker. Luckily they've got this nice PA system. I came out here with a laptop and realized I didn't have Office installed, so we're at the last minute kind of moving around the presentations.

I'm here to talk about Beowulf clusters today, and we're going to start off with a design course that has been developed at OSC. This was developed in conjunction with the systems staff and our science and technology support staff. So we have people coming to this content from both ends of the computer business. The systems' people, who administer the systems everyday, and then the science and technology support staff, who are on staff at OSC to assist researchers in using our hardware and software and who do research of their own.

So one of the goals of the tutorial is to provide a nice overview of what goes into putting together one of these clusters and how much time it takes and what kinds of decisions you would have to make. We have a general introduction and, as Kumaran said, we will have a good hardware overview of the different processor choices, memory choices, interconnects.

We'll gloss over some of these tings, because this is really a two-day course, so we're going to skip a lot of things. But after talking to a few people and listening to some of the talks this morning, hopefully we'll present something a little different than what's already been presented.

 

DR. KALYANASUNDARAM: You guys feel free to stop him any time you feel it's been done. Both of us haven't attended many of the other sessions, so we don't know what was presented. We're going by the abstracts there in the documents. So if you think this is basic or it's been done before, stop him and we'll go somewhere else. Or, if it's too advanced, again, stop him.

MR. DOUG JOHNSON: The Beowulf Cluster Project it's a variation on a very old theme. For decades people have had the idea of putting together small, inexpensive or small, expensive computers in a group and getting them to work together on larger problems. So Beowulf clusters are just a variation on that theme.

It started at NASA Goddard in 1994. Their goal was to do postprocessing of data from grand challenge problems being run on some of the national resources, some of supercomputers that NASA had as well as the National Science Foundation supercomputers. It's amazing to see how far Linux clusters have come, because six years later they're not being used to postprocess data from grand challenge problems on the National Science Foundation supercomputers, they're being used as the National Science Foundation supercomputers to run grand challenge problems.

They believed that Beowulf clusters would be composed of commodity hardware, and, of course, commodity is in the eye of the beholder or the eye of the definer or the person trying to sell you that commodity hardware. They're usually using Linux as the operating system, but it can be any open source operating system. The free BSD, Net DSD distributions are very popular as well, and I will go into a little detail as to why we think Linux is the best choice for this.

Sometimes the components in the software suites can be closed source. Not every facet of your software environment can be developed by the open-source community or through free projects. And some of the examples of those are a compiler suite. The Portland Group compilers, the suite of Fortran and C and C++ compilers, or the TotalView parallel debuggers are examples of critical pieces that we rely on to make this is a successful environment, so it doesn't have to all be free.

The main goal is to originally give the resources back the to the researchers. We're a centralized facility at OSC, so we're kind of perverting this project. We are the holders of the computer resources, and we're trying to do the same thing. We'll talk more about how we plan to give the resources back to the researchers later. It's a real interesting program.

Some people in this room are building their own Beowulf clusters, either at the department level or the individual lab level. Anybody's who's worked on these for even a small amount of time realizes that you do have control, but this is the key thing to remember: You inherit the work of maintaining the system and understanding the components and how they all fit together and debugging. So it's a trade-off. You don't have to give a computer vendor a ton of money to get a high-performance computing resource, but a lot of the work is going to fall on your shoulders.

Typically, we see that the computers that make up clusters are dedicated for that use, sometimes. There are, of course, examples of computer labs also being used as clusters. We saw one upstairs. I'll talk a little later this afternoon on a separate set of slides, and I'll bring up some provocative statements about how to do computer labs and clusters and kind of an interesting funding model for how a department would do funding for these types of computers.

The talk this morning mentioned the large cluster out at Stanford. There are definitely several examples of very large clusters, anywhere from 256 processor source up to thousands, or around a thousand. We consider a small cluster to be anywhere from four processors to 128. We have a 128-processor cluster at OSC. We consider it to be a small resource. It's large, but it's definitely not as large as what the national labs are putting together.

So we have a mission at OSC. When the state created us, a few years after that they came up with a mission statement, just like every organization has to have these days. And when we started this project, we needed to make sure that this project fit into what we're supposed to be doing according to our mission statement, and it dovetails very nicely. But in the end, the metric for the success of the project is how well it facilitates our users' research.

So we have to always be skeptical about this program. Beowulfs are very popular. Everybody is very interested in them. It's very exciting projects to work on, but being skeptical is something that we should always carry with us.

We started our Beowulf cluster project back in January of '99. We'll talk about our other cluster projects in another set of slides. We had two clusters before this one, and this was a system of Tindell Pentium -- or, five dual Pentium II 400 systems with four and a half gigabytes memory aggregate. We've added on a couple things.

Since we've purchased this cluster for testing, we're investigating the use of fibre channel for access to network attached storage, a system area network, we've also replaced the 100 megabit with Gigabit Ethernet cards. The interconnect of choice for this cluster was the Myrinet PCI card. We also evaluated, or attempted to evaluate, the Dolphin XSCI card. The Scalable Coherent Interface card. In the section on interconnects, we'll talk about both of these.

We did open this up to friendly users. Anybody who asked us, plus some people we asked to evaluate the technology, and we deemed it a success. It was a compelling enough technology that we were able to unplug an SP2, an 8-processor SP2 that we had, and use this computer in its place for these researchers that were still using that resource.

I think I want to touch on the hardware we have currently. We'll talk more about this larger cluster later. We now have a large Linux cluster, and it's made up of 33 quad Pentium Xeon computers. Pentium III, not Pentium II. That's a typo. These are SGI 1400's. They have 2 gigabytes of memory each, 660 gigabytes of disk aggregate, and each compute node has two, 64-bit Myrinet PCI cards for the interconnect. These are the cards used for the MPI traffic exclusively. No other traffic goes over these cards other than MPI traffic.

On the front end we have connectivity to our mass storage server, which is an Origin 2000, and that's a HIPPI interconnect. It's an 800 megabit per second interface. Basically the only traffic that goes over that is NSF traffic to the user's home directories.

Can a Beowulf cluster be part of a production HPC environment? We're going to assume a couple things, and we won't hold tightly to these prerequisites here because we're not going to go into as much depth today as this course originally did. But you should be familiar with Unix, Linux, and a passing familiarity with MPI or distributed shared memory programming of some sort. PVM, MPI. In the hardware overview section, we want to cover the processor memory and IO. And under the heading of IO will be included the network, the interconnects.

There's a lot of choices on the market for CPU's these days. We see an amazing increase in the performance of processors that are available for not that much money. Every year we see incredibly more powerful processors than the previous generation CPU's come onto the market. And Intel has done a good job of being the de facto standard of the commodity high volume market, and we've lived with the X86 architecture for the last twenty years.

The latest incarnation of it is the Pentium III. It's basically what we you're going to get if you buy a desktop computer, and this has some very nice consequences. The widespread of adoption has a nice consequence in that the software under it, for the Linux side at the very least, is very well-supported and very well-understood in how it executes and how it runs, and very well-debugged.

So with the wide adoption of this one processor, we see this Linux operating system that also has wide adoption live up to the saying, That with enough eyes, all bugs are very shallow. So it's a very well-debugged and stable platform.

The Pentium III is, like I said, the latest incarnation. It has additional extensions to the instruction set of the traditional X86 processor. These are called the SSE, the Streaming Store Instructions. They're a set of SIMD, that's a Single Instruction Multiple Data, for the noncomputer science people here. That's where they have a large bit field that gets operated on by one single instruction, but you have multiple copies of data in that one large bit field.

Now, the practical application of this in everyday use we're still wrestling with. We're still trying to figure out what these streaming store instructions and the SIMD instructions mean for the scientific or engineering programmer.

The cache on the Pentium III is limited to 512K, and runs at half processor's clock speed. With the newer faster processors, this is now dated information. They've decreased the cache size by half, but it now runs synchronous with the processor. In contrast, the Pentium III Xeon chip has the option of going all the way up to two megabytes in the L2 cache size, and that memory does run synchronous with the clock of the chip.

In practical experience with these with two different chips, we've seen that the Xeon, at least in these 550 megahertz and slower versions, have a very poor performing L2 cache. Even though they're running synchronous with the processor's clock, the latencies are very high. This kind of goes back to something we've all been told for many years about memory and computers. The smaller, the faster.

So even though they're able to go up to 2 megabytes, they weren't able to go up to that size of an L2 cache and keep it as fast as the 512 kilobyte L2 cache that's running at half the clock speed. It does not scale as well as you would want it to. So that's just a bit of a caveat.

It's also possible to purchase quadprocessor Xeon boxes. The maximum number of CPU's that you could put into a traditional Pentium III box is going to be two. With the Xeons you can by 4, and there are even some companies that manufacture 8-processor Pentium III Xeon boxes.

DR. KALYANASUNDARAM: Can you talk about scalability of applications within a 4-processor Xeon box?

MR. DOUG JOHNSON: The question is can we talk about how well an SMP node with four Xeon processors scales. Not very well. The front side bus, and this was a point that was touched on this morning, of these Intel processors is either a 100 megahertz or 133. It is a bus. They do not have a crossbar switch or a multiple bus architecture for the memory. So there is a 64-bit wide data bus running at either 100 megahertz or 133, and all four processors have to share that.

And the other traffic running on that bus is the cache coherency for the L level 2 cache on those CPU's, and the method Intel uses to do cache coherency bus snooping. They do not have an extra channel between the CPU's to keep the cache coherent. That traffic also has to contend with the normal processor to main memory traffic.

So consequently, one application that is trying to monopolize the memory bandwidth -- if you have another copy of that running on another CPU on that same board, they're only going to get half of the memory bus bandwidth to main memory. And even worse, when you have four copies of that programming running, you have to share it by a fourth. Because of the fact it is a bus and there's no crossbar, it's perfectly scaled by the number of processors that are in contention.

So what that comes out to in real numbers is, you have 300 megabytes of sustainable bandwidth to main memory, and if all four processors need as much as they can get, they'll only get 75 megabytes per second each. We'll come back to the number, because that's about the same that we can expect out of our interconnect to the other systems in the network using our high performance gigabit class interconnects.

Processor choices on the market are the Intel Celeron, which is a cheaper version of the Pentium II architecture. It has a smaller cache, and once again, we find that the rule holds, even though it's smaller -- well, they sell it has a lower performance cache, it turns out to be a better performer than the larger caches in the Pentium III. It does have a slower front side bus, so it has slightly lower main memory sustainable bandwidth. And it does not officially support SMP configurations, but there is a note here about how one could possibly use the Celerons in an SMP machine.

The fastest X86 compatible processor available on the market today is the AMDK7, the Athlon chip. It's a very compelling processor. It's a re-implementation of the X86 architecture, plus they also have their own extension that they use to get better performance out of a chip. It has a large L1 cache, as you can see. I think that is split 64 kilobytes between instruction cache and 64 for data cache.

It supports large L2 cache sizes as well, but we haven't seen the really large cache sizes on those chips yet. They haven't started manufacturing those. It has the same memory bus as the alpha, which is the EV6 bus. If they were able to get a motherboard manufacturer to manufacture multiple processor versions of this, it would have a crossbar for the memory bus. So you would have much better scaling for the SMP configurations if they were just available.

AUDIENCE MEMBER: Any idea when that will happen?

MR. DOUG JOHNSON: I have not heard one project to make that available. What it comes down to is the reason you that you would want to by an SMP computer compared to a single processor, is that it ends up being cheaper. You get a higher processor count for less money.

DR. KALYANASUNDARAM: Not only that, you end up getting less in networking cards too.

MR. DOUG JOHNSON: Yeah. And Kumaran pointed out you have to buy less infrastructure for it. There are less power supplies, less hard drives, less interconnects to buy, less network cards. So are there are advantages for single processor nodes.

AUDIENCE MEMBER: One of the things -- and if you're going to touch on this it later on fine, tell me that -- what I haven't understood is that the multiple processor machine, I have one, basically, Internet address I'm coming in at. How do I address the different processors once I'm inside there?

MR. DOUG JOHNSON: Yeah. We'll come back to that when we talk about MPI.

AUDIENCE MEMBER: Fine.

MR. DOUG JOHNSON: So a non-X86 non-Intel processor that is very popular on the market is the Compaq Alpha. It's been a very successful design. It's been on the market for almost ten years now. Digital, who designed the chip, has had a tradition of designing longevity into their processor designs. We saw that the VAX CPU design was a two-decade design. There was a story last week that this is your last chance to by a VAX from Compaq. So if you need to buy VAX's for your department, this is your last chance to buy new ones.

So the Compaq was notable because it was a 64-bit chip 64 RISC, Reduced Instruction Set Chip. And the 21 264 is about the fourth generation design of this chip, and it is bar none the fastest chip on the market for scalar processors, Vector withstanding. It supports different levels of L2 cache, 2 to 4 megabytes. It supports a 200 megahertz front-side memory bus. And for the SMP configurations of this computer, it is on a crossbar switch. And they have done very good things with their crossbar designs to have those scale to a large number of processors, surprisingly large.

Consequently, they're able to go up to 16-way configurations, and they've got plans for 32, or, I think, they are selling 32 processor configurations of this chip now. It's much more expensive than your typical X86 processor, unless you compare it with higher-end quadprocessor Xeon boxes.

Kind of on the side. The Compaq, while it's a very high-performance chip and is supported under Linux, it's kind of plagued with some problems. Namely, that not as many people are using it. And so the third party adoption of it has been smaller. You don't see as many commercial third party codes, such as Fluent and LS-DYNA porting over to the Linux for the Alpha. So if you're going to build a cluster and you rely on third party codes, you're more likely going to see a Linux X86 version rather than the Linux Alpha version of the software.

DR. KALYANASUNDARAM: So the processor of choice would be two processor Pentium III's or the AMDK7's.

MR. DOUG JOHNSON: Yes. If I was buying another 32-bit X86 cluster today, I would by either single processor AMD chips, or a dual Pentium III coppermine. Now, these buzz words that come out of Intel now and again, sometimes they actually mean something. So the coppermine is a slight revision to the Pentium III architecture.

And this is where I mentioned earlier they lowered the size of L2 cache on the newer Pentium III's. It's smaller. It's 256K, but it runs synchronous with the clock of the chip all the way up to their 1.13 gigahertz processors that they're selling today. So those chips have much better L2 cache performance. And if I was buying a Pentium III today, I would make sure it is coppermined.

AUDIENCE MEMBER: What about the communication between the two processors? Is there a potential weak link?

MR. DOUG JOHNSON: It's still the same.

AUDIENCE MEMBER: It's the same as the Xeon, right?

MR. DOUG JOHNSON: It's the same problem that the X86 architecture -- or, from Intel, has. It's a shared bus.

AUDIENCE MEMBER: And so going with that, with a dual processor on that, you could get into a situation where that's a weak link in the chain; is that correct?

MR. DOUG JOHNSON: Definitely.

DR. KALYANASUNDARAM: You would see it more on the four processor.

AUDIENCE MEMBER: I understand that. But if your communication between boxes goes up, and I'm able to switch out cards and take an Athlon and increase my communication between the box, wouldn't I be smarter going with individual Athlons versus dual processor Pentiums because I could run into that weak link of the processor?

MR. DOUG JOHNSON: For the dual processor it's important to understand your application and what's its memory bandwidth requirements are. With the hierarchal memory model that all these new processors have, having L1, L2, cache, a lot of those problems can be alleviated in the design of algorithms, specifically taking into account how cache is used on these Intel architectures. In the end, there's a lot of applications that are, no matter what you do, I mean, the fundamental algorithm is memory bandwidth intensive and will run poorly on this architecture.

DR. KALYANASUNDARAM: Most of the people who run codes use MPI within the box and even outside the box. They don't use some sort of SMP directives to run within a box and outside the box. With PMI you won't see a lot of cache trashing that you would see with, say, SMP directives. So the bus is not so often used that you know the snooping based cache coherency protocol has to be in action all the time.

You know, you are picking a cache line and there's a good chance that the other processor won't have the same cache line, not sharing the same piece of data, so you're not going to see that. The problem with single processor is the extra hardware you need to buy.

AUDIENCE MEMBER: What is the difference?

DR. KALYANASUNDARAM: The money is in the networking with these boxes.

AUDIENCE MEMBER: Okay.

MR. DOUG JOHNSON: If you have to buy double the network for same number of CPU's, it can add to the cost.

AUDIENCE MEMBER: If I wanted to get a 16-processor with the Athlon, I would I pay a certain price; if I get a 16-processor, which is only actually eight boxes doing the dual processor Pentiums, is it half? Can you price that out?

MR. DOUG JOHNSON: No, it's not. I don't think it's actually half. It's somewhere around 60% if we just do a roughest estimate on the numbers to get the same amount of processors. But that's, of course, a ballpark figure.

DR. KALYANASUNDARAM: And the AMT's are cheaper than SMP's.

MR. DOUG JOHNSON: They're about the same price for equivalent clock for speeds today. If I was building a small cluster around 16 processors, there's no reason -- well, for most problems, I'd go ahead and buy the single processor boxes. But for larger clusters, that's when you're probably more interested in getting the higher density of processors per node.

AUDIENCE MEMBER: Where would you say that goes? Sixteen you're saying you would go with --

MR. DOUG JOHNSON: 16. I would go; 32 I'd have to decide. It's kind of up in the air at that point.

DR. KALYANASUNDARAM: You are going from space considerations too.

MR. DOUG JOHNSON: Space considerations increase, cooling considerations increase. Because even though you only have one processor per box, you still have a 300-watt power supply. And so when you start going up to 32 PC's, that starts to cramp a lot of people's space. That's why we see a lot of vendors selling these 1 U height, one and a half inch height dual processor systems, because you can fit so many in one rack and it's easy to find space for one rack. But if you had to line this whole wall with shelving to store your cluster, that's when it's becomes difficult to find your space.

So all of these computers use SDram. There's a couple motherboards out on the market that they try to get you to use. This stuff called RDram, Rambus, Dram. It's a variation on Synchronous Dram. In most cases we're going to be using 100 megahertz SDram. And we could go into this topic for hours on what different levels of memory we want to buy, how much on a chip, how we want to put it into the computer. But we're going to kind of gloss over it and just look at the implications of SDram.

DR. KALYANASUNDARAM: Actually, the Intel web page has a lot of benchmarks of the RDram and the 133 megahertz SDram, and they were totally shocked by the results they got for RDram.

MR. DOUG JOHNSON: Yes. Some of the newer more expensive RDram has not performed as well as has people expected it to. I've been told why it doesn't perform as well. Because on paper it sounds like a great idea. Unfortunately, I'm not well-versed enough in the hardware to explain why RDram has been kind lackluster performance-wise.

So if you were putting together a cluster for your lab, how much memory you put in an individual node is going to be dependant upon your application. 128 megabytes, maybe even 64 megabytes could be more than enough depending on what your individual research is. In some cases you need one gigabyte of memory per processor.

We try to follow the industry standard for balanced systems, which says you should have one megabyte of memory for every megaflop that your processors can deliver. So in our large cluster we have 550 megahertz Pentium III's that can theoretically do 550 megaflops, so we have 512 megabytes of RAM per processor. But we're designing a general purpose computer for the widest range of people. So characterization of what your problem really needs is very important.

Beyond the processor, one of the most important pieces is the interconnect. Now, this is not the interconnect that you use just for NFS traffic or for R shelling to the machine. It's what the Myrinet traffic runs over, or, I'm sorry, the MPI traffic.

So if you're programming parallel computers, you're most likely going to be doing explicit message passing, and you're either going to be using PVM or MPI or high performance Fortran. And we'll come back to those. But all of those protocols, when they communicate to another computer, are going to be using the interconnect between the nodes. So there is a lot of choices today as to what you can use.

The most popular interconnects are going to be Myrinet, Fast Ethernet, the 100 megabit per second interconnect, and then Gigabit Ethernet, which is becoming more popular and widely supported by vendors now that there a copper standard and more people are looking to Gigabit Ethernet as an interconnect.

Some of the less commonly used interconnects are Giganet. This is a Gigabit class network card that implements the VIA protocol, the Virtual Interface Architecture. This is an industry standard message passing primitives language developed by Intel, Compaq, Microsoft. It's a consortium, and it's more commonly used in commercial applications like databases, but some people are using VIA as the communication primitive on which to build MPI or some other higher level message passing language on top of.

Dolphin XSCI. SCI stands for Scalable Coherent Interface. The Systran SCRAMM Net, which is an interesting card, it's basically a small amount of shared memory between all the nodes. Very low latency, but not very high bandwidth. HIPPI, which is an 800-megabit or 6400-megabit per second connection. Very expensive. GSN, which is the 6400-megabit per second variation of that, and ATM.

So we'll look at Myrinet first. This is our interconnect of choice. We think it's a great product. It's very well-designed. Myrinet is manufactured by a company called Myricom, they're in Pasadena. They're a start up out of Caltech. Chuck Sietz is the President of the company, and he's one of the people that came up with the architecture that is used in the Intel paragon class of supercomputers. Now. Those aren't made anymore, but they're still selling cards and switches that are based loosely on that design.

It's a full-duplex communication channel with 1.28 gigabytes per second each way. You can now buy a card that is two gigabytes per second each way. That's 250 megabytes per second. We'll talk about what the consequences of that number are, because, if you remember, I was saying that on a typical Intel box you only have 3000 megabytes per second to main memory, so how are you possibly going to fill an interconnect with traffic at 250 megabytes per second if you can only, from your processor, talk to memory at 300 megabytes per second?

The answer is that it uses user level OS bypass message passing primitives to do the communication. There's no use of a TCP/IP stack of some other higher overhead protocol for the cards to communicate to each other. They called this layered GM. Glenn's messages. It's extremely highly optimized to minimize data movement and to minimize host CPU utilization. So Those are two things they are trying to do.

They don't want the processor that's doing your computation to be affected by communications. Plus, they don't want to do unnecessary copies of your memory addresses just to communicate what you're communicating to the other nodes. Using that low level, you can build several different layers on top of that. And one of those, and the important one in our opinion, is MPI.

That's the only kind of traffic that runs over our Myrinet cards. We don't use any others, so it's exclusively for that use. Some people have also implemented this virtual interface architecture over Myrinet. They're an ANSI standard. You can go out today, build a card to the specification and sell it as a Myrinet competitor.

This is a general block diagram of the connection that it has. So the important part here is that with this DMA controller, you're able to pin regions of memory. Now, when I say that, that means that this DMA controller knows how to access variables. So if I have a variable named X, this DMA controller can access that address, the real physical address, directly.

It doesn't have to let the OS go through the page tables and tell it where that memory is located at really. It already has that memory region pin. So, when I say copy X to node whatever, this DMA controller can go out and get that memory on its own. It doesn't have to have the extra access through the TLB to get this information back to it.

So that's how they're getting the latencies down to single microseconds. The typical zero-length message half round trip latency for this technology is nine microseconds. Now, for a card that you can buy for a $1,000, that's an incredible feat.

A lot of the logic is run through the Myrinet control program. It uses an onboard risk processor running at twice the clock speed of the PCI bus, and the fast local memory is twice the clock speed of the PCI bus. That's used for buffering purposes. If it doesn't need the buffer, it will not use that memory. That's again minimizing the amount of copying of data around, so to minimize memory bus transactions. And it's got the bidirectional SAN link at 1.28 gigabytes per second. It's a very compelling architecture, and they have a very good road map of where they're taking this.

AUDIENCE MEMBER: Did you say they're a thousand bucks for one card?

MR. DOUG JOHNSON: Yeah. Just to give you an idea how much this would cost with the switch too, the card is more extensive, but one port is $250. Now, you can't by just one port. You have to buy at least eight, but you can go up to 128 ports on one switch. So I don't have anything in here about the switch, but let me talk about the switch now because this is a good time to talk about it.

The typical building block for the smaller clusters is going to be their 16-port switch. This is built on one VLSI switch that they use called a 16X bar. So they have a 16-port crossbar switch that they use for that, for the 16-port switch. So that 16-port crossbar is non-blocking and adds a maximal latency of 100 nanoseconds. So the worst your switch will ever add, if you have a 16-port switch, is 100 nanoseconds to the latency of your communication.

So their goal with the network is to make it not interesting to the computer scientist. They don't want the computer scientist studying it, because normally that means they're looking at some bottleneck that the network has created.

They also have a 64-port and that will scale up to 128 ports on one switch. Those are available this fall. The-64 port is available today. Those are built out of 16 of those 16 cross port crossbar chips. So it's called a Clause 64. Clause is a term for putting crossbars together.

So they've got a 4 by 4 network or 16-port crossbar switches. So what that gives you is you've got full bisection of those 64 ports, and plus it's non-blocking and reconfigurable because you could end up with one crossbar completely saturated and the routes are reconfigurable through the switch.

When you put that many crossbars together, you're going to add latency to the -- that the switch adds to the communications, but it's only 500 nanoseconds worst case. And if you're on the same crossbar, the port you're communicating through is on the same crossbar as your designation, it's 100 nanoseconds just like the smaller switches. So it's really an amazing performer.

AUDIENCE MEMBER: I'm trying to get something straight in my mind. When we talked about the number if I had the 16-node cluster that I want to build and you said, well, if I had just 16, I'd probably go with single processors. Are you talking of using this kind of a network card --

MR. DOUG JOHNSON: Yeah.

AUDIENCE MEMBER: -- and saying I would spend a $1,000 for each one of those for a card in it versus going with a dual processor and sharing a card?

MR. DOUG JOHNSON: Yeah.

AUDIENCE MEMBER: Man, going with dual processors I'm saving 8K right out of box.

MR. DOUG JOHNSON: Ten when you factor in the switch. So you're going to spend $250 a port. The 500 number, they've lowered their prices. The most you'll spend for a single port is $250.

AUDIENCE MEMBER: I had a number in my mind for like, you know, 15K total I could get a 16-node system, and if I'm looking at $8,000 here just on these cards, obviously I'm off in my pricing.

MR. DOUG JOHNSON: It's not cheap, but it has to be a decision that is driven by the application. So if you need single microsecond latencies with throughput of 80 megabytes per second for these older version cards, it would be a trade off between that or 100 megabit.

If you're going to put together the same system with Gigabit Ethernet, you might be able to put something together for a little bit less for a small count of ports on a switch. But as you get into the higher-port density switches for Gigabit Ethernet, you're typically looking at a $1,000 a port, and it reverses the cost.

AUDIENCE MEMBER: Do you have any experience on applications, Doug, as to what kind of applications would something this high, what kind? Can you give me an example?

MR. DOUG JOHNSON: I wish I had some newer slides on this, but what I eventually want to include in this tutorial is -- I can't remember the researcher's name, but to talk about the cost of parallel computation. So we'll come back to that question because we're going to talk about the different classes of parallel programs and I'll talk about it a little bit more. I don't have any really good slides or analytical discussions of parallel communication and computation. That's something I haven't worked into this presentation yet.

DR. KALYANASUNDARAM: I've seen running through dynamic codes of Pentium clusters, I've seen running two processes on a single box or running one process on a single box. So if you're running an eight-way job, or running one process per box or on eight boxes or running two processes on four boxes, the speed up is the same.

AUDIENCE MEMBER: What kind of communication was it using?

DR. KALYANASUNDARAM: Myrinet.

AUDIENCE MEMBER: Could you do that same calculation using something like 100 based-T?

DR. KALYANASUNDARAM: Yes. We did that.

AUDIENCE MEMBER: Does it show a tremendous decrease in its performance clusters?

DR. KALYANASUNDARAM: Not the performance between the four box and the eight box, but between the Myrinet and the networking is what made the difference. I'm doing a comparison of running an eight-way job on eight nodes of a Myrinet cluster, or on four nodes of a Myrinet cluster, and then comparing it to the same thing with Fast Ethernet.

The ratio was the same between the four-way and eight-way, which means the performance was the same. So the difference in performance had to do with the difference in the networking as opposed to the difference in the --

AUDIENCE MEMBER: How much was that difference in the performance by the networking is what I'm interested in? Was it significant?

DR. KALYANASUNDARAM: I have the resource. I can mail it to you. It is significant.

MR. DOUG JOHNSON: It's going to be very dependant upon the application. For CFD applications, a lot of those algorithms are latency sensitive, and so the lower latency that they can communicate to the other processes, the better they will perform. We have examples of codes that only do 200 floating point operations between communications, and that's not that much work between communicating. And if that was a traditional 100-megabit network, the costs of communicating that often would be very high. You would drive your host CPU utilization way up, and the latency that's added every time you communicate would really make the performance of an application go down.

Now, we have other applications that we run on the Myrinet network with MPI jobs that they compute for minutes before they communicate, and they wouldn't be heard at all if they were running over a 100 megabit because the relative cost added is so small when you look at the total run.

AUDIENCE MEMBER: Is there any profiling stuff out there that would be allow us to be able to --

MR. DOUG JOHNSON: Yes.

AUDIENCE MEMBER: You're going to talk about that later?

MR. DOUG JOHNSON: Yes.

AUDIENCE MEMBER: Okay.

MR. DOUG JOHNSON: The other nice part about Myrinet and the network that you can build out of these cards is they're scalable to thousands of nodes. You will add latency to the network in the worst case scenario, but you can preserve the bisectional bandwidth. Meaning if I went up to 1,024 ports on a Myrinet network, and this has been done, they preserve full bisectional bandwidth at that network.

For latency sensitive applications they might start to see something, but if you have a lot of communication going on and you're preserving that full bisectional bandwidth, the application is going to perform well. Even in some of the newer supercomputers, they can't say that they do this. And for some applications it's going to be a real design flaw. So that's the really nice thing about the Myrinet network is it's fast, it's relatively inexpensive, and it's scalable.

So now we'll go to kind of the other end; cheap and slow. Fast Ethernet is ubiquitous. It's everywhere. It's cheap. You can get a card for $15 that performs great. You can get a network switch that performs well enough for under $100 a port. And it's very well-understood. Meaning, that the drivers for it are very well-developed and they'll just work right. And you also have just a plethora of vendors to choose from, so you're never going to be locked into being a Myrinet junky and two years down the road they jack the price up on you because you're addicted.

Of course, going with Fast Ethernet, a lot of people are comfortable with going with Gigabit Ethernet. It's ten times as fast. One gigabit per second, of course. It's now available in copper interconnects using Cat5E. Slightly higher twists per inch. The problem is the switches are relatively expensive, especially when you get to the larger number of ports.

This fall you'll be able to buy up to 128 ports on a single gigabit switch. The problem is, what do you do if you have to go beyond a 120 ports with a gigabit switch? It's not scalable in the same way that Myrinet is. You will not preserve the bisectional bandwidth of your Gigabit Ethernet network if you have to go to 100 of nodes. In fact, you may not preserve in one switch itself.

We've been told that in some switches with the larger port density if you go above a certain number of cards, you start to use a backing store mechanism; meaning, they buffer your communication until there's capacity available in the back plane to get the packets where they need to go. That's a real down side if you're looking for the highest performance interconnect network that you can possibly build.

A lot of people are holding out hope for Gigabit Ethernet, and that's because it has the potential of being just like 100 megabit. It was expensive to begin with, but now it's cheap and ubiquitous. So eventually everything will have a Gigabit Ethernet connection and you can buy it for $50 a card and less than $100 a port.

One of the other interconnects that I mentioned is Giganet C-lan. This is a proprietary interconnect. It's 1.25 gigabytes per second, nonblocking full duplex, so it's supposedly got performance characteristics very similar to Myrinet. It implements VIA in hardware, so the message-passing primitives are op codes that you can call on the processor on that board.

So it has very low latency. Very good communication characteristics. Up to 30 ports per switch. It's not clear how scalable though. I don't think they can connect their switches together. I've been told that you cannot scale past 30 nodes with this technology. I could be wrong. I don't have enough information on the this. But they're at giganet.com if you're more interested in that.

The other interconnect I have a little more information about and experience with is Dolphin XSCI. This is the Scalable Coherent Interface. It's a very good interconnect. It's about as expensive as the Myrinet cards are, or about as cheap depending on how you want to define it. It's also an ANSI standard. You can go out and build your clone Dolphin XSCI card. It has a 400-megabyte full duplex link, and it allows an application to have a single memory address range across the cluster.

You can do two different things with this. You could let your application share that single address range for shared memory communication in a program; or, you could use MPI built on top of the shared memory. And so the reason that you'd want to do that is that the obvious limitation here, if you have to share a single address range across a whole cluster on these 32-bit nodes, you can only address 32 bits of memory. So the maximum shared memory segment that you're going to create with this type of technology on a 32-bit machine is going to be 4 gigabytes. So if you have a lot of nodes, a lot of processors, that's not much memory per node.

The other way to do that is with explicit message passing. Instead of shared memory programming, you'd use a library like MPI. So this also gets around this problem here. If you use the shared memory programming model for this card, everything is a cache miss. So the performance of a shared memory application could be kind of low or it could be very good. It depends on whether it's tailored specifically to the peculiarities of this network.

With MPI, it uses the shared memory to communicate between nodes instead of allocating everything in that shared memory address range. That's the better way to use this interconnect as an explicit message-passing network.

It has a slightly lower latency than Myrinet. It's somewhere around 5 microseconds, and this number up here, the 400 megabytes per second, you're never going to get that on a PC anyway. The PCI bandwidth is 133 or 266. Well, there are PCI buses that are 64 bit and 66 megahertz now that could theoretically saturate this 400 megabyte per second link.

There is no switch for this. They used to design switches for this network. They're now a 2-D TORUS. Say you have four connections coming out of each card and you build a 2-D TORUS out of that, and they have switches built into the card so if traffic comes across a link that is not intended for that destination, that switch on the card just keeps routing it.

So if you lose a link in your TORUS though, those switches on a chip will notice that those other links are down and it will reroute your traffic. Now, of course, that's going to do the same kind of thing that you run into traditionally with the TORUS. If you have too many links down, you're going to add a lot of overhead to your network, or you can end up with an island, a node you can't get to because all the links around it are down in that TORUS. They have very good hardware design.

I won't talk about the Systran card. It's a very odd card, and I only know one research group using it. HIPPI is what we use for our NFS traffic to our home directories. It's very good for this because it's got MTU size, Maximal Transfer Unit size of 64K. So for large file transfers, a card is not constantly having to send interrupts to the processor to process packets. It has very large packets. So large file transfers run very well on it.

The cards are expensive and the switches are more expensive, and it only makes sense if you have that infrastructure in place already. So if you have that infrastructure it does work in Linux. I have a plot I'll show later about how it really performs on TCP and UDP traffic.

GSN is somehow even more expensive than HIPPI, and it's the follow-on to HIPPI architecture. It's 800 megabytes per second full duplex. Like I said here, the highest PSI bandwidth is 4200 megabits per second. This is a 6400-megabit per second connection. It just doesn't make sense for this class of machine today.

ATM a couple years ago looked like a good choice; it doesn't look like a good choice anymore. There are faster, cheaper cards out there. Plus, you typically have to use the TCP/IP stack as your communication protocol on ATM. Now, there was a project out of Cornell called U-net that did the OS bypass and programmed directly to the card to do communication, but it's not very well-supported.

If you need to patch into an ATM network it works. We don't have experience with it personally, but we have seen labs that have used ATM and Linux to patch in their legacy ATM networks.

I won't talk too much about disks. The one point that was brought up this morning was the question as to whether you want to buy SCSI or IDE. And I used to say SCSI is definitely what you would want to buy because the vendors when they manufacture the platters and the arms, those platters or arms don't know whether they're going to be SCSI or not.

And traditionally, the disk vendors have taken higher quality lots of those and put those in their SCSI devices. I don't know if that's still true. But it used to also be that the SCSI disks were higher performance, and that's definitely not true anymore. The ATA 100 IDE interface was mentioned this morning, and that's a 100-megabyte per second bus for IDE devices.

And the question was asked why would you need that much bandwidth per disk. Well, one disk can 35 megabytes per second out of these new designs, so we're getting to the point that the performance of the disks are on a good curve. Seemed like we were stuck at 5 megabytes per second for a long time for a single disk, but these new disks are just astounding. Single millisecond times for incredibly large file systems for $150. Just great.

So I have a hard time saying go with SCSI because the IDE ATA 100 disks are so good. So that's definitely one area where you can save money. Of course, this is now dated, expect for the price. SCSI disks are still more expensive.

Somebody asked earlier what the concurrency characteristics of IDE are, and I don't understand that at all. I don't know what happens when there are multiple processes at the bus level, what's happening when there's contention. But IDE should definitely be considered.

We've been looking at serverless system area network file systems like GFS. It's attached through fibre channel storage. This is still very much a research project and unfortunately is not ready for use in even the most experimental clusters. Unless you're doing research on distributed file systems and serverless file systems, I would not recommend it today. Look at it again in maybe a year.

I'm going change gears a little bit and quickly go through these slides because, I want to get a plug in for not only what we're doing with the cluster but a plug in for the center in general, because I think most people in this room are from schools here in Ohio. So I'm really going to go through these quick cause there's a lot of duplication of information.

We do cluster computing because everybody is doing it. We're not leading; we're following the pack. Professors, when they start at a university and get their start up funds don't go out and buy a $15,000 Linux workstation, they go out and buy some cheap PC's and build a Linux cluster or just use Linux on their PC in a lab.

So there's a lot of advantages that it has. You can do high-performance computing with parallel algorithms or high-throughput computing. Run parameter space investigations really quickly if you have enough nodes. Or you can do some kind of fail over computing if you're doing some kind of data base.

AUDIENCE MEMBER: I work for the government and the government has warehouses filled with what is considered obsolete computers. I've seen them. They rise to the ceiling, pallets of them shrinkwrapped just gathering dust. How successful would it be to tap into some of these old computers?

MR. DOUG JOHNSON: It all depends on your problem and how many resource you have access to. If you have no access to high-performance computing, you're going to have an infinite amount more computing power than you do with old computers.

There's one project called the Stone Lab Computer or the Stone Super Computer Cluster, and they took a bunch of old 386's and 486's and they ran their science. They did their research on that system, and they even submitted a paper to SC '98 for the Gordon Bell prize for infinite price to performance ratio. The committee didn't buy it.

It all just depends on what you have access to because it's still worth it if that's the best that you can put together. I mean, we're spoiled by these new computers, but we forget how happy we were when we got these 32 bit X86 processors. So there's still a lot that can be done with old computers. You just have to remember how slow they are.

DR. OSCAR GARCIA: Maybe we can start a discussion here. My concept is a little different from just what the job is. It's also in how much of a hurry are you to complete your job. I think there is a new -- just like there was in electrical engineering (unintelligible) for amplifiers, I think there is a dollar problem for high-performance and high-throughput computing.

In other words, if you have time to burn, you can actually do it with few dollars. If you have dollars that you can burn and you're in a hurry, you better spend it in the machine, which is pretty much what I think we got from Kumaran last night. The reliability and being able to be sure that you're going to get your program running, and if it doesn't, you have service at your fingertips then that's really very important.

DR. KALYANASUNDARAM: Much of the cost of the high-end systems that we make, the proprietary systems, is in the fast networking and also in the cache. You have 4 megabyte to 8 megabyte L2 caches. So it depends on your application. If it's a trivially loosely double parallel problem that is based upon integers you can go ahead and run it on clusters, so the only problem is how do you manage them and make them run, if you want to write MPI and things like that.

If you have a floating point arithmetic intensive code that you don't want to spend time making the MPI based application and you want to use SMP directives to parallelize it and things like that because it's much easier to do and you want a reliable machine that you don't want to spend a lot of money managing it, then an high-end system, if you have the money, it's better to buy a high-end system.

MR. DOUG JOHNSON: We don't try to do single-application, single-threaded performance comparisons between the clusters and RT90 because it will lose hideously. The T90 is so fast for a single problem. But that's also the goal of parallel computing. Because even though a single process is going to be so much slower than the T90, if you have a parallel algorithm -- and this might be of interest to the structures people in the audience -- running a specific model using LS-DYNA, which is a structure's code, we can duplicate the performance of T90 on the same problem with eight processors of our cluster, and those eight processors were a heck of a lot cheaper than that T90 was. So that's why this is such a compelling project. But it comes down to the problem.

So we have a brief history of cluster computing at OSC. We've been doing it for a little bit with some slightly different designs than what the Beowulf people were originally trying. We tried two variations, and this was all before I came to OSC on this, and with varying level of success or adoption by our users. You'll notice we don't have large clusters of Dec workstations with ATM or FIDDI today, but we do have clusters of Intel boxes, and those are the same ones we listed earlier.

This is our mission statement and why we're doing this. This is kind of a restatement of what I was saying earlier about how this brings the resources out to the individual researcher. And I'll talk more about how we really think the resources are going to go out to the individual researcher in just a couple slides here.

Computers are getting faster compared to even the high-end. There's lots of downside to clusters. It's a lot more work to manage, and it requires a different funding model. And so what we've arrived at OSC is the concept of a cascade. We won't be able to buy a cluster that we can keep on the floor of our computer room for four years like we have with some of our traditional supercomputer systems that we've purchased. They will just not be useful to the researcher four years after we bought them.

We figure that the lifetime of a cluster will be 18 to 24 months. 24 months at the very most. So how do we turn over this hardware now often. We came up with this plan of cascading the hardware out to individual researchers. So in the first quarter of next year on OSC's web site look for the Hardware Seed Grant Program. We will be carving up our current cluster and giving it as grants to individual research labs.

Now, I emphasize the individual research lab. This is not for a department to come ask for a machine. This is for an individual researcher or group of researchers in a lab to request this hardware.

So that doesn't really help us with the funding part. You know, that doesn't save us any money by given away these computers. But at the department level, if you could implement a model like this where you force everybody in your department or everybody in your college to buy computers from you and you buy computers for your cluster, keep them for twelve months and then ship them out to the individual labs or the computer labs after that, you can cascade through those machines. You can keep a cluster and they get the slightly older machine, but they're still probably fast enough or capable enough. That's one way to deal with the funding issue that this brings up. It's a way to get more computing resources out to your department and to the individual researcher.

DR. KALYANASUNDARAM: You're giving away the networking too?

MR. DOUG JOHNSON: The Myrinet networking cards will go with it. We can't give away the 164-port switch. we don't know what we're going to do with that. There will be some money you will have to spend. A few thousand dollars for another Myrinet switch, but it's definitely a great deal.

AUDIENCE MEMBER: What's the condition on this? This is for people who are from Ohio?

MR. DOUG JOHNSON: Yeah. Ohio colleges, public and private.

AUDIENCE MEMBER: When do you think the date will be?

AUDIENCE MEMBER: What time?

MR. DOUG JOHNSON: First quarter, second quarter of next year are the dates that I've been told. And of course, we're open to new grant applications as we speak, so if you want to use this resource today, develop your codes, test your codes on it today and that will give it a much better story in the grant proposal anyway if you've used it before. So I'd definitely encourage that.

AUDIENCE MEMBER: There's nothing in there for us at all is there?

MR. DOUG JOHNSON: I'm afraid not. We are a state resource as it is, so we unfortunately can't help anybody else. Now, if you have some PI's that are in the state, maybe there is something that can be done then for getting an account. We're always looking for other people to collaborate with the PI's and the state.

We're collaborating with the community in evaluating and implementing new technologies, and we're also trying to put together a cohesive software story. I go everywhere with my laptop and I run Linux on my laptop, and I've got the compilers installed. I've got these different libraries I use. I've got a debugger and I can test code and do a lot of work on my laptop or my workstation at work. And the software environment is similar to the cluster.

So we took that idea a step further. We're giving away the software that we purchase for the cluster, so we have a state-wide license program. For researchers who have accounts at OSC, they are able to request licenses of the Portland Group Compilers, the NAG Scientific Libraries, the Etnus TotalView Parallel Debugger, and ADS, which is a scientific visualization application tool kit.

I'll have a link up for that in a later slide. But this is an excellent program that we want people to take advantage of. Because the goal is, it's easier for you to sit there and do the development on your laptop. But if you did that before, you'd have to do the development on your laptop and port it over to a T90 or an Origin 2000, and the development environment is a little different, the precision in the machine is a different.

It's a 64-bit machine compared to 32-bit. So you're going to have every headache that goes along with porting your code. It's hard enough to write it already, let alone maintain it on several platforms with different OS's and different architectures, different compilers, everything. We feel this is a compelling story to tell. Again, we're just following what researchers are doing. Most people are running Linux, so we're following what people are doing in the state.

We have staff that provides support for using this resource. The Science and technology support staff at oschelp@osc.edu. Easy enough. Everybody should know that if they have accounts at OSC. And they will help you with your day-to-day questions to your not so day-to-day questions. They are scientists and engineers and they are there to help you.

So this is all to encourage parallel programming. Vector processors are in most cases very expensive, and in lot of cases we can get more computation done if we are able to take advantage of parallel resources. We also want to see the clusters rolled out into individual research labs. It's already happening without us; we want to help it. We want to do the evaluation of the software and hardware so that we can make some mistakes that you can avoid.

This is just, you know, kind of some nice pictures. This is our first cluster. The same information as earlier, but that's where we do all our software testing now, and it has Myrinet, Gigabit Ethernet, SCI, and Fibre Channel, and we test everything there and break it a lot.

This is our production cluster. Production as of the end of this month. It's a free resource as it stands. Your grants do not get charged against the usage on this, but that changes in a week so run your codes now. That's just a description of how it came into being.

This is a collaborative project between SGI and OSC. SGI has a lot of people on staff that know a lot about high-performance computing, and they're doing a lot of work on Linux in the operating system realm and also the compiler realm for the new Itanium IA-64, the Intel 64-bit processor. They're going to be a great resource for the researcher for getting software to facilitate their research.

We set it up in Mountain View, California and showed it on the floor at SC in '99. We ran a user code on it. We ran seven different scientific applications developed by researchers in Ohio on the floor of SC after only having about two weeks to port the code over to this new platform.

This is just a rough schematic of how it's connected together for mass storage. All of the compute nodes have access to the file system on the Origin 2000, that is, the user's home directory. Plus, there's the 800-megabit per second connection to the front-end for interactive use of your file system, home file system. That is connected to a 30-terabyte tape storage library that has been upgraded to how much, 120?

AUDIENCE MEMBER: I guess so.

MR. DOUG JOHNSON: That's tertiary storage so large files and frequently used files are migrated out to tape. Their I nodes are left, their blocks are deleted so we can have infinite disk space supposedly.

User accounts are through the OSC user services data base on our cluster. We have job accounting, so we can charge against your research grant, not moneywise, but for your resource units that we allocate.

The utilization. This is a little bit higher than what it's been over the summer. We're averaging about 30% over the summer. 70% is about what you would expect for a parallel computer, a supercomputer. We have a mix of 70% of the cycles going to parallel jobs and 30% to serial jobs.

We, of course, want to emphasize parallel programming on this cluster because that's what makes the most sense on it. You'll get the best performance and the highest throughput, most likely, for most applications using a parallel programming paradigm.

That does not mean that we don't want serial throughput users as well. We just worked with a researcher at OSU that did some human genome mapping, and we got very good performance on that and very good turn-around time. It was it was a throughput-type job. We try to keep the system as available as possible. We try to minimize the amount of scheduled downtime and, of course, we do our best to minimize the amount of unscheduled downtime. And so I think that's our last slide here.

I'm going to talk about these real quick, because we were talking about this earlier. I put these in here just to give you an idea of what TCP and UDP string performance is for theses gigabit class interconnects. The maximum is really 300 megabits per second. We see this on most interconnects.

This is card manufacturing by a Althion. There's another we tested from Sysconnect that performed a little better than that. But the maximum number we've seen for TCP performance is 300 megabits per second, so we think that's the stack. The TCP stack is maxing out at that. This is a slightly more detailed plot of that performance showing the difference in the characteristics of the smaller block size transfers on the HIPPI compared to the Gigabit Ethernet network.

UDP performance. The one we really used the HIPPI network for, because that's the kind of traffic that runs over through NFS. It's really good. 634 megabits per second and 425. The difference between these two numbers is with the socket buffer size that's allocated for those sockets that it opens. The reason these are higher numbers is because they are not AK limited. They don't have to acknowledge. You just dump a bunch UDP packets out to the network, and if they get there great; if they don't, that's the protocol. So these are just some numbers to give you a general idea of what these gigabit class networks can do. A bit of a tangent from what we are talking about here.

Does anybody have any questions?

AUDIENCE MEMBER: (Inaudible.) Myrinet. The other cards that you actually have in the machines are above those numbers?

MR. DOUG JOHNSON: Yeah. I don't think I have a plot with me, but the throughput that you would see is 60 megabytes per second for the larger block transfer sizes. That's using MPI. So the latency is a lot lower, and plus the affect it has on the host CPU utilization is going to be a lot lower. Unfortunately, I don't think I have any plots on any of these for the Myrinet network.

Now, if you have a good PCI bus implementation with the newer Myrinet cards -- we're kind of hamstrung by an underperforming PCI bus that Intel manufactured, and the most for DMA transfers across the bus that we can even get is 70-something megabytes per second. So we're getting most of that, but we're never going to get more than that. Even though we have cards that can supposedly saturate the bus, which is 133 megabyte per second bus. But I have seen DMA transfer numbers on some supermicro dual processor boards that are around 500 megabytes per second. So you would really need a nice bus like that for these 250-megabyte per second channel cards that Myrinet is now selling.

With the GM/MPI layer, you're going to get some good fraction of that 250 megabytes per second. Not like with the TCP performance where we're only getting a third or so of the theoretical maximum from that network. You really will see a good percentage of the maximum.

Maybe I should bring up a slide on where the information is for the statewide license and also for grant applications, just in case we have people here who don't have accounts on our machines. Let me bring up the slide that has that information for just a few minutes. So when I was talking about a grant from OSC, this is not money. We don't handout money, but we hand out resource units for use of our supercomputers. And being a user at OSC, you will also be able to be eligible for the statewide license program of the software.

Our technical information server, which has all the documentation about using all of our resources, is the oscinfo.osc.edu link. That also has information about how to apply for grants. It also has information about our statewide license for software and how to get the licenses. I have a couple extra links here, but these are the two I wanted to show right now about how to apply for a grant or how to get the statewide license software.

I can't over emphasize how important it is to understand the characteristics of your problem. I'll come back to how do to that a little bit later when we talk about profiling strategies, code profiling and code characterization strategies.

But we can fall into several different categories when you look at your parallel algorithms or just type of problems in general. You could have your embarrassingly parallel; it's going to maybe communicate at start up, a little bit at intermediate stages in the program and maybe collect up some results at the end.

You're going to have random walk type problems that do this. It really doesn't rely on the network. You're not going to see that much of a performance degradation if you have 100 megabit or Myrinet. So if you can avoid buying Myrinet for your particular problem, you definitely want to know that you can avoid it.

Random walk problems like Monte Carlo simulations. Some particle physics problems and cryptography. Cryptography is more of an example of an embarrassingly serial program. This block-level parallel is more inalienable to domain decomposition to where you compute the solution over several nodes and you would use explicit message-passing, such as MPI or PVM that you looked at this morning.

One example of this is all the routines included in the Scalapack parallel linear algebra library. I want to kind of emphasize the importance of these libraries again because they are great tools. There are a lot of people that reinvent the wheel everyday, and with these routines it's possible to do a lot of linear algebra, primitives and not so primitive things, singular value decompositions, your Eigan values and Eigan vectors. Everything that would be very difficult for you to hand code in a parallel routine has already been done for you by somebody else, so it's important to know what tools are out there already that are available through a simple library call. It also makes your code easier to read and to maintain when you minimize the complexity by using libraries.

Loop-level parallel is where you decompose a do loop normally, and it's usually using shared memory for the communication. So if you have a matrix multiply, it's very easy to break that up across several processors using shared memory with OpenMP. Those are compiler directives. You explicitly put what looks like comments in your code, but the compiler knows to look at them as directives of how to split that code up using threads.

And multilevel is kind of a combination of those two. You do some kind of domain decomposition of your problem with explicit message passing then you have local parallelism using shared memory and OpenMP.

Now, we mentioned this. When you think of that, it's a lot more expensive to develop. Well, it's a lot more confusing and a lot more difficult to come up with an algorithm and write code for something like this. But a lot of new computers are going to require this to take advantage of the hardware that is being manufactured.

The IBM SP is an example of this. They greatly emphasize size shared memory for local communication with OpenMP, local communication and then just a little bit of MPI for off-node communication. So it's mentioned here because it's going to be important for these newer models of supercomputers.

Then we have the serially program or embarrassingly serial. It's where you can run it, a lot of different copies, do the parameter space investigation, and clusters are important for this. If you have a program with a large number of degrees of freedom that have to be investigated, you don't necessarily have to go out and come up with a parallel algorithm for these clusters to be useful to you. Also, with this kind of an application you would not need the high-speed interconnect because you're not going to be using it so much.

AUDIENCE MEMBER: What is the difference between that and embarrassingly parallel?

MR. DOUG JOHNSON: Because it actually does do some communication.

AUDIENCE MEMBER: This one does?

MR. DOUG JOHNSON: No. The serially parallel or embarrassingly serial or however you want to call it. It doesn't do any kind of communication at all.

AUDIENCE MEMBER: The serially parallel, that sounds like the throughput computing use of clusters.

MR. DOUG JOHNSON: Yes, it is. And it's really not that much different between embarrassingly parallel. I would even lump in some parallel applications to embarrassingly parallel. I've seen some parallel programs that will compute locally for minutes before communicating remotely. They're still parallel programs. They still require MPI. But they are not that much different from serial when it comes down to it.

This is just a little thing to keep in mind when you're dealing with the X86 computers, and the poor performance of the memory bus. This is just a little exercise in what kind of performance you can expect and how to think about your memory use.

The memory bus cycle is 10 nanoseconds; they run at 100 megahertz, and you have a 64-bit data path. So the maximum you are going to get out of that is 800 megabytes per second. But the memory latency for the row access strobes is really 60 nanoseconds, so you're going to have slightly lower performance because you are using SDram, and you've got this, you know, incredibly convoluted way of making memory work.

So you are going to see a demonstrated bandwidth of somewhere around 300 megabytes per second or 37% of the maximum. And so you just have to keep in mind that you are going the have all these clock cycles waiting for the memory to be refreshed. You're going the be waiting on the row access strobe to go through and refresh this SDram.

We use this benchmark to measure the memory bandwidth. It's a very important tool that's incredibly simple. It's a very long vector length with a scalar multiply of a vector with an add. It's so simple, but it has held lots of computer manufactures to a higher standard than before it was conceived. It's a very important tool because in most cases the limiting factor in application performance is going to be memory bandwidth.

Seymour Cray knew that a long time ago, and a lot of computer manufacturers have known that for a long time. That's why we spend so much money on computers like the T90. They have incredible memory systems. And even though these processors are getting faster today, the memory is not getting that much faster.

I'm going to take a minute to kind of skip some stuff we don't want to go into here. In fact, I'm going to switch tutorials here. We're going to skip down to the user environment. What happens when we look into the cluster at OSC? There are several different ways to access our cluster through Telnet, R login, or SSH.

SSH is the preferred method because when we use computers out in the lab or go to a location of a computer we don't control ourselves or don't trust, we don't want to sit there and send our clear-text password to our other computer because somebody could be sniffing it. It's a better idea to use SSH because it encrypts your passwords on all traffic. So if you can install SSH clients on your computers, it's a better and safer way to access our computers.

It also has another benefit in that it does automatic export foreigning. So these steps here that you would have to normally follow to be able to run X clients on your local PC, you don't have to do for SSH. It's already done. It uses the X off program and sets up a secure way for those X clients to be run back on your work station. If you can only use Telnet, these two steps are the method that you'll have to follow to use X clients on your workstation.

We support most main stream shells. Forcing shells on users is like forcing them to use VI or EMAX. It's a personal choice and we're not going to be the ones to decide it. The default shell is KSH though, so you would have to call into our consultation line to get your shell changed. That information is on OSC info on how to do that.

One of the most frustrating things when logging into somebody else's computer that you don't manage software on is logging in and having a braindead environment, where you don't have the right path or right environment variables, you don't have the shell you want. So we've tried to avoid that. When you log into our cluster, you will supposedly have the right path and right environment variables to use the cluster from the get-go. You do not have to edit your dot profile files, your log-in files, if you don't want to.

We achieve this through using a knockoff of the Cray modules, which is a set of Perl scripts that we use to maintain a user entrainment. So if you at your command prompt type module list, you'll get a long list of the modules that you have loaded in your environment. So we see here that this user has the PBS environment, the Portland Group compiler environment already set up. Of course, the modules, knows where to find the modules program and the MPI with a channel GM interface all set up ready to go.

If they need other programs, they can type this module avail, and you'll get a list of all the different software packages that are available on the cluster. And if you want to use HDF, which are binary IO routines, you would just type module load HDF and you would get that exact environment that's necessary to use that software. So here's an example of a user loading a module called SCMS. You can also unload the module if you decide you don't need that in your environment or you want to try a different version of that software, because we do sometimes support multiple versions of the software.

The development environment that we have includes the Portland Group compiler suite. I mentioned this earlier, and it's included in our statewide software license. They were a vendor for traditional HPC systems. The wrote HPF compilers, High Performance Fortran. This is another way to do parallel programming. There are compiler directives, and it looks like Fortran 90. You just add in what look like comments to have directives to the compiler to tell it how to parallelize your code.

The Portland Group was contracted by DOE and Intel to provide compilers for the Intel ASCI Red computer. This is the 9,000 processor Pentium Pro supercomputer that's installed at Sandia. It's been number one on the top 500 list for as long as anyone can remember, and these are the compilers they use on it.

It's an optimizing compiler for the P6 core that includes the Pentium Pro Pentium II and Pentium III, and they have versions of this suite available for X86 Linux, Solaris, and NT Windows. So if you have NT on your desktop, you can also get the same compiler suite. It includes C compiler, C++, Fortran 70 and Fortran 90, and High Performance Fortran.

The Fortran compilers, C and C++ compilers all support OpenMP directives. So if your method of parallelization is compiler directives, you can use this compiler suite for that. They are link compatible with GCC, which is the free software -- compiler suite that usually comes with Linux and is available for a plethora of platforms and architectures. And the suite also includes a debugger and profiler. We'll talk a little bit more about the profiler in a few slides, but you can also use GDB as the debugger on these executables that are compiled with a debugging flag. So if GDB is something you're comfortable with, these compilers will not keep you from using that.

These are just some common options that one would use with these compilers. I know a lot of people are getting into the habit of using the // comment style in their C programs. It's a lot easier than closing your comment with the */. It's sometimes a little bit easier. It's not ANSI though. You have to tell the compiler to accept those.

You can tell it to look at compiler directives for OpenMP and SGI style. Pragmas, the do-across type pragmas for parallelization. It's also got what every compiler should have, things to tell you when you're not doing something quite by the ANSI standard, or you can also lower that compliance standard or program in KNRC if you still have the need to. We recommend using the Enforces Strict ANSI C compliance because it keeps you from introducing bugs a lot of the time.

It supports the traditional Unix command line flags that you expect from compilers, unless you're on AIX. Include files and library search paths are the same as what you would expect. And the C++ compiler options are very similar. We'll talk a little bit more about the recommended flags that I have added in here for the command lines and what these different optimizations do. We'll talk about those in few slides.

The Fortran 77/Fortran 90 compiler. These are two separate compilers. Fortran 77 is a subset of the Fortran 90 standard. Unfortunately, the way they arrived at their Fortran 90 compiler was rather convoluted. They had an HPF compiler that they turned into a Fortran 90 compiler, and then they wrote a Fortran 77 compiler. So if you have Fortran 77 code, it's recommended you use the Fortran 77 compiler from Portland.

If you have some Fortran 90 features in your Fortran 77 code like a lot of us have done, where we just add in the little bits we're interested in, you'll have to use the Fortran 90 compiler and hopefully not run into any problems. But we've had some less than optimal experiences compiling Fortran 77 code with the Fortran 90 compiler.

So optimization is exceptionally important for getting good performance out of your high-level language on these modern architectures. And we can get a lot of information back from the compiler as to what it is or what it is not doing when we compile a code. So if we look at these default recommended flags, we recommend that you specify the type of processor, TP, P6. So that tells it to generate code for P6 core. That means this code will not run on a Pentium IV. So if you want to take this executable and run it on an older Pentium or a 486, it will not run.

We pass a couple different flags to the vectorizer in the optimizer. We specify a cache size. We tell it to be a associative with its operations. This does not tell it that the cache is associative to however many sets. This is saying that the compiler can try something different to arrive at the same answer, mathematically speaking.

DR. KALYANASUNDARAM: Basically round-off.

MR. DOUG JOHNSON: It can cause round-off problems, because if you have a divide it might do something differently than what would normally be generated. So even though we put this in the recommended flags, I'd be very careful with that option. I would experiment with it.

AUDIENCE MEMBER: Are we getting a little off of the cluster idea?

MR. DOUG JOHNSON: We're kind of getting bogged down in the details of the cluster. Maybe we should skip forward.

AUDIENCE MEMBER: Maybe the people here are interested in this. I don't know. Fortran and clusters is not my idea. Sorry, but I'm just --

MR. DOUG JOHNSON: That's all I do, Oscar, is Fortran on clusters.

AUDIENCE MEMBER: There you go.

AUDIENCE MEMBER: That's all the guy beside me does.

AUDIENCE MEMBER: All right.

MR. DOUG JOHNSON: Let me kind of get through the compiler section a little quicker, and we won't talk so much about what it's doing when it optimizes and how to take advantage of these.

One of the nice things about the compilers is that all command line optimizations can be put in line in the code for specific blocks of the code. So if you have some loops or a segment of code that is particularly inalienable to a certain type of optimization but the rest of the file isn't, you can specify that it only be applied to that section of the code. That's kind of a nice feature.

We'll skip the PMI compiler wrappers. This is just how you would compile your MPI programs with the Portland Group compilers on our cluster. And we will also skip the libraries and switch to code profiling, which there were a lot of questions on.

There's a couple things that came up as far as questions with code profiling. And that's for individual program performance and then parallel program performance. And when I was speaking to someone earlier, they were interested in a tool, one program that can tell them this. And unfortunately, it's more of a method on a cluster, but it is possible to do what needs to be done to profile your code completely. So it just takes a little more work.

It falls into three very broad categories; timing, profiling and hardware performance counters. This goes from the easiest to the hardest as far as how much work it takes to extract meaningful information. Of course, the simplest is just to look at your watch and see how long it takes your code to execute on a cluster. You can put explicit calls in MPI programs to get the time that it takes a certain segment of your code to execute. You could either to that with the get time of day call, or a better way in an MPI program is with the MPIW time function.

So beyond just simple timing, you would also want to, in a nonintrusive way, get an idea of how much time your program is spending in a particular routine or a particular set of routines. And the way to do this is with a code profiler. And that's either Gprof or PGprof. To take advantage of this, you have to compile with a special flag. So it does had add some overhead to the execution of your code, but it will produce a file that will do an analysis. We'll show you an analysis of how your code ran.

Here is an example of a program being compiled with a specific dash PG flag to use the built-in GNU tools, profiling tool, Gprof. And this is the output that you get. You get seconds, number of calls to that subroutine, and the percentage of the time spent in that subroutine for the program.

That's going to be basically your first step in profiling your serial code. You can also do it with the built-in profiler, with the Portland Group compiler suite, which is called PGprof. It has slightly different command line options for the compile because you have more options as to how you want to profile your code. You can either do it by the function level, as we saw in the last example, or line by line.

Now, line by line should be taken with a grain of salt, because so much optimization is done by the compiler that it might not really correspond to what's going on in the execution of your code anymore. We don't really recommend anything other than function-level profiling, and it's just a slightly different tool, other than the Gprof program.

And it has a nice X Windows interface, so instead of that output in our terminal we get an X client that comes up. And you can sort by different variables, and it's just a slightly different tool that gives you the same information. If you don't have an X environment available, it will print out the command line interface to it. So there's a caveat when using the profilers on a cluster. Each process --

DR. KALYANASUNDARAM: So the profiling, is it profiling the time taken in the routine or is --

MR. DOUG JOHNSON: Yes.

AUDIENCE MEMBER: Repeat the question.

MR. DOUG JOHNSON: The question was whether it was profiling the time taken in every routine. The answer is yes. So if you have five different subroutines you've written in your program, it will tell you how much the main program and the five subroutines took.

The downside to it is that it doesn't work with parallel programs. We'd like to see it work with parallel programs, but it writes out one file. So here's an example. I run this code in my home directory. It writes out this one file called PGprof.out. What if I have 100 copies of this running as part of an MPI program and they have all the same directory mounted over NFS? Only one is going to write out the file correctly and maybe file locking will work with NFS properly. It is a stateless file system and you're relying on stat D and lock D, which is always a mistake. So you're going to end up with a corrupted profile file. So it doesn't work with parallel programs.

We are trying to figure out with Portland how to make this work. They can get it to work today with HPF programs but that's because they control the interface into how the parallel program is started. Hopefully sometime in the future we'll be able to do regular old-fashioned code profiling with parallel programs to get function-level performance out of our codes. Otherwise, you'll have to program it in and get timings out of the code explicitly. So it's a little more work.

But the thing that we can do with MPI codes on clusters, and this is just a great tool, is something called Jumpshot. And all this does is that when you compile your MPI program, you compile with an MPI library that have stubs. They have place holders for the regular MPI calls that keep statistics on how much each call is being used, how much is being transferred, et cetera and when. So you end up with an output file from this as well, but it does the right thing.

This doesn't give you information about individual functions or individual lines like our command line profiling tools do, but what it does is it creates this prog name dot C log file. And we run Jumpshot on it and we get an output. Well, after we add the command line options for the compile that are listed here. We can end up with a graphical user interface that's a representation of the communication in our program.

So in this particular example I've zoomed in on a very specific block of communication. You can see how short the time period is by the time line. That's the elapsed time. And they're color coded. So the dark blue or purple is MPI reduce; the green is MPI I receive; and the kind of turquoise is I send; and then we end up in waits.

And you can you see how much these states are. You can see little bits of white when it's waiting on a receive, so you'd have to zoom in on the specific blocks of code. But with this, it doesn't look too bad, you have this process sending things off to these four processes. I don't know what it's doing in this particular block of code here, but this is the way to for sure get the communication characteristics of your parallel program.

It's also a way to debug problems. So if you end up with one thread that is taking an inordinate amount of time to send information to another thread that has a blocking receive, that other thread is not going to be doing any work because it's waiting on the other thread to end. This is a tool to allow you to find imbalances in how you decomposed your problem across the different MPI threads.

AUDIENCE MEMBER: If we see a bunch of red dots all over the place, we know this is a heavy communication kind of software; and if we see lots of blue, that's a good thing typically. Is that what you're saying, or not necessarily?

MR. DOUG JOHNSON: Right through here, the CPU is actually doing work.

AUDIENCE MEMBER: Okay.

MR. DOUG JOHNSON: All of this time that's in the light blue is some kind of communication. And so the less you see of this, the finer-grained your problem is. The longer this part is, the worse or the more time you're spending waiting on communications.

I've seen programs that do communication very well, and I'm not going to say that this program is doing communication very poorly. But these times are minimized, and everybody arrives at the point where they are sending or receiving at the same time, and they're very well synchronized. In some cases people are coming up to a point where they're ready to send and the receiver is not quite ready because they have some other computation to finish before they're ready to do whatever kind of communication. This is just kind of a heuristic tool to look at the communication characteristics of your program.

AUDIENCE MEMBER: Can I use this tool to decide whether I need to go with the more expensive card or lesser expensive card; and if so, what kind of information will I be looking for in here?

MR. DOUG JOHNSON: Look for more longer blues.

AUDIENCE MEMBER: Longer blues means I can get by with a cheaper card?

MR. DOUG JOHNSON: Longer blues means you need the more expensive card.

AUDIENCE MEMBER: Longer blue means I need the more expensive card?

MR. DOUG JOHNSON: Yes.

AUDIENCE MEMBER: I see a lot of black.

MR. DOUG JOHNSON: Yeah.

AUDIENCE MEMBER: And I want a lot of black?

MR. DOUG JOHNSON: Now, here again, there's a lot of different factors, and there's no one thing to look for to see that you need a better performing network. Definitely, if you have a lot of communication compared to computation, you need a faster network. Or if you see you have one process sitting around waiting to send something and another process that's doing computation for a long time past that time point that is the intended receiver, you see there is some kind of imbalance and you need to go back and rethink the algorithm or how you're doing communication in the program itself.

AUDIENCE MEMBER: Would I benefit if I did the same job and sent it to four processors and then turned around and use this with eight processors and sixteen and looked at the stuff?

MR. DOUG JOHNSON: Yes. Because that will show you the kind of scalability of your algorithm.

AUDIENCE MEMBER: Again, I'm looking for keeping the blue lines to, you know, if they are getting real long or --

MR. DOUG JOHNSON: Or there's a lot more blue than there is the time spent actually processing.

DR. KALYANASUNDARAM: This shows the time that you first started sending the data and the time it finished sending the data?

MR. DOUG JOHNSON: Yeah. And the intended senders and receivers.

DR. KALYANASUNDARAM: But these are I sends and I receives, which are basically asynchronous sends and asynchronous receives. So while sending it, it may well be the CPU is doing some sort of computation.

MR. DOUG JOHNSON: Yeah. It definitely could be the case, or it could be that it's spending that time sending this amount of data. It could be sending a large amount of data, and it's sending it to several destinations, so it's doing an I send progressively with different destination tags.

DR. KALYANASUNDARAM: So the question then becomes should I send one long message or several short messages?

MR. DOUG JOHNSON: There's the penalty for startup time, latency, and then the penalty for communication length, so that definitely has to be balanced. So analyzing these things analytically can be very involved. It's not that complicated of an equation. There's seven or so different variables. But it drastically varies the outcome as you change the different parameters of that. And it comes down to task granularity, how much do you have to compute between communicating, the penalty for communicating, the length of the communication time, the number of processors. And it all just varies so widely as those parameters vary.

This is more of a heuristic tool, like I said. You look at the program to see if you have an imbalanced algorithm. You look at the program to see if you are spending a lot of time communicating compared to computing. You might expect that, or you might think your communications are going to finish quicker than what you are seeing when you run this program. So it is not a scientific process when you use this tool.

AUDIENCE MEMBER: You said you were zoomed into time. Actually, this whole picture, if I could back up and not zoom in, I would see black, blue, black and blue, and keep that checkerboard pattern as I'm going down. And, hopefully, from a nonzoomed in picture, I would look at it and say, Am I seeing more black than I am blue? Okay, hey, that sounds good for the cheaper.

MR. DOUG JOHNSON: Or the other thing you can see that is bad is these blue bands being very diagonal, because that means you have other people ready to send before other people are ready to receive.

AUDIENCE MEMBER: This uses Jumpshot on Java?

MR. DOUG JOHNSON: It's included with the MPICH distribution.

AUDIENCE MEMBER: So the compiler flag I have to add in is which one?

MR. DOUG JOHNSON: Right here. You have to add that MPI log and also the MPE library to link that in.

AUDIENCE MEMBER: So whether I use the Portland compiler or not, it doesn't matter? I could use those with --

MR. DOUG JOHNSON: Yes. With the GNU compiler or it's just a library inside of the MPI routines.

AUDIENCE MEMBER: How much of a penalty am I looking at in performance? Assuming there is a penalty.

MR. DOUG JOHNSON: It's not very high.

AUDIENCE MEMBER: Really?

MR. DOUG JOHNSON: When you make that call to MPI I send or I receive or MPI send, it's just collecting statistics when you do that, so if you have a system that has implemented get time of day really poorly, that might start to hurt your performance of your code, but it's minimal. I think in the documentation they might give some more pseudoanalytic numbers of the overhead that it places on the codes.

So the least abstract way to profile your code on a Linux cluster is to use the model specific registers of the CPU. These are hardware counters that most modern CPU's implement to give you access to different events and their relative amounts for your codes. They are noninvasive, and the overhead of using them is next to nothing.

In an environment where you're task switching a lot, it might add a little overhead because there's extra state to save in the OS when you're accessing these hardware counters. But it happens anytime when you've applied the patches to the kernel that these need.

The first of these tools is Lperfex, and this was developed at OSC. It relies on a patch to the Linux kernel that can be found at this web site. It's to allow user-level programs to access registers in the CPU that do these hardware event countings. It's noninvasive and you can get a large number of statistics back from your code when you run it; everything from floating point operations per second, or just to get back the cycles per instruction or the integer operations per second or your memory bandwidth usage in reads or writes. This is the tool that is able to tell you, averaged over your entire application, what the characteristics of application's execution are.

So for this particular example we look at floating point operations for this one program, and it tells us that we're getting 65 megaflops on the code. So maybe we're happy with that, maybe we're not. We would have to look at a wide variety of different events that are supported by the model specific registers. There is actually 68 total, and these are some of the more important ones in our opinion, and some of the ones that we understand. 68 different events in the architecture is a lot, and we're not Intel hardware experts. We don't know what all of those really mean and how they affect the running of codes and the efficiency of the code on the processors.

So these are the important ones. And from this we can find out how many floating point operations per second our code is getting, what the read bandwidth is to memory we are using is, the write bandwidth we are using, L1 cache use.

AUDIENCE MEMBER: (Inaudible.)

MR. DOUG JOHNSON: For the entire MPI program?

AUDIENCE MEMBER: (Inaudible.)

MR. DOUG JOHNSON: Per node.

AUDIENCE MEMBER: (Inaudible.)

MR. DOUG JOHNSON: Yes. But we haven't figured out yet how to do it. In this case, we're just running this on a specific application, but theoretically there's a way to look at it for the system in total. And that way we could, if we wanted to, monitor the whole cluster and look at how many floating point operations per second every processor in the cluster is using, or how much memory bandwidth is currently being used for reads or writes or whatever, that we can get out of these events. We just haven't figured out how to do the global access yet. That kind of goes back to there's a lot of do-it-yourself involved with building these tools. We will surely put this on our web site when we figure it out, because that is something that we want to do.

DR. KALYANASUNDARAM: What did you exactly want to know?

AUDIENCE MEMBER: (Inaudible.)

MR. DOUG JOHNSON: It's definitely a way that we want to monitor without the user explicitly putting it in their job. We want to be able to look at that ourselves too. And I'm sure we'll announce it when we figure it out. And we'll be using another tool that I'll talk about later to actually do that called PCP, Performance CoPilot. It's one of the tools that SGI has released to the Linux community.

AUDIENCE MEMBER: This is individual CPU data we're looking at now?

MR. DOUG JOHNSON: Yes. Unfortunately, I don't have an example of us using this on a parallel code, but it is possible.

DR. KALYANASUNDARAM: (Unintelligible.)

MR. DOUG JOHNSON: With OpenMP, it doesn't work with threaded programs today, but with MPI programs it does. And this slide gives you an example. You would just do the MPI run, or the equivalent statement for your system, and then Lperfex the event that you want to monitor and then your executable name.

DR. KALYANASUNDARAM: So it leaves a particular output file in every --

MR. DOUG JOHNSON: It's standard out. It also has an option to write out to a file.

DR. KALYANASUNDARAM: And a different file for each processor or each node?

MR. DOUG JOHNSON: No. It wouldn't be smart about that. Those are one of the things we have to figure out how to add. We have to write some Romeo routines.

AUDIENCE MEMBER: Have you used other tools like MC Create or Cumulus?

MR. DOUG JOHNSON: No. I haven't heard of either of these, in fact.

AUDIENCE MEMBER: They are both available from Oak Ridge.

MR. DOUG JOHNSON: Oak Ridge.

AUDIENCE MEMBER: MC3 is designed specifically for clusters.

MR. DOUG JOHNSON: I'll have to take a look at those. We are using PAPI, Parallel Application Performance Interface. Lperfex is nonintrusive. You don't have to change your code to get these metrics back, but it averages over your whole application.

Well, what if you're interested in one specific subroutine? You would have to instrument the code explicitly for that section of the program. And to do that, you would have to use the PAPI calls which allow you to instrument sections of your code. And we support this on the cluster now as well. And this is no longer questioned. It's not "should be" compatible, it "is" compatible. That's what we use today for writing Lperfex on top of, so they do coexist.

AUDIENCE MEMBER: You mentioned these were the Intel numbers and so on. What about AMD or any of the other processors, will they not work with them, or you don't know?

MR. DOUG JOHNSON: I don't know.

DR. KALYANASUNDARAM: Does AMD have performance codes?

MR. DOUG JOHNSON: They do, but it needs a different driver and they have different events that you can monitor. The good people down and UTK who have more money and time than us are doing more maintenance on different platforms. PAPI is a cross platform set of metrics you can use, so they maintain the interface to several different platforms. I think that includes AMD chips, the Alpha, IRIX, or MIPS chips. Just all these different hardware architectures for different OS's.

And so what we're in the process of doing is rewriting Lperfex so it uses the PAPI low levels instead of the kernel driver low levels. If we abstract a PAPI, Lperfex works on every platform. That doesn't mean every platform will support an IO transaction event, but if it does, we can get it back through Lperfex in a uniform way across platforms.

We've got just a couple more topics to cover. one is nobody's favorite, but it's a reality. If more than one person uses a computer, especially a cluster, you need to have some sort of job-scheduling software. You know, in an ideal world we'd all call each other and reserve times and have perfect utilization of the computers. But we live in the real world, so we have limited resources as far as processors, memory, and network interfaces even. Everybody wants to have their job done this minute, and they want to run as many copies as they can. But that would be to the detriment to others, so we have to enforce policies.

Luckily, we can just point off that these policies, when somebody complains, that these policies have been created by management and by peer review, in the case of OSC's resources, so luckily we can pass some of that stuff off to deflect the heat a little bit. It let's us control the maximum resources available and how jobs are executed and when.

So there's a lot of choices for job scheduling on Linux clusters. This is just a first cut of the usual suspects. There's a few more, but these are the most common. The last two, LSF and PBS are by far the most popular and most complete and also the most configurable. LSF is commercial, and I should capitalize that C because they're really commercial. It's very expensive. PBS is free though.

DR. KALYANASUNDARAM: Not anymore.

MR. DOUG JOHNSON: Well, PBS is distributed under a license that's very similar to the California Board of Regents License. It was developed under contract for the U.S. government by MRJ for NASA, the NASA NAS computing facility.

By law and contract, the code that they have developed remains open to the public. So the license that they've developed it under is such that I could download a copy of PBS, call it Doug BS, and sell it to somebody if I could convince them to give me money for it. The only thing I would have to do is put a disclaimer and the original license in saying this code was developed under contract, yadda, yadda, yadda, and here are the terms for redistribution.

So MRJ, the company that developed this code, has been purchased by somebody else. They were making money off of PBS; they have decided they can make a lot more. So they're going to be offering binary only versions of this program that have enhancements that you can buy for money and are only available that way.

That doesn't mean you still can't download the free version. So the free version will always be available. And I don't want to criticize MRJ, the company that has maintained and developed this source code, but there's lot of work that's done on it by individuals and people in the community, and I think you're going to continue to see PBS well supported by the community.

A lot of the work was paid for by government contracts. Not just the NASA one, but some of the national labs have paid them to modify how the scheduler works or how it looks at clusters of systems. And those changes were rolled back into the free distribution. So I think you are still going to see that. I think you are still going to see people giving MRJ money and still see those changes make it back into version that we can all benefit from.

So we narrowed it down to those two and did an evaluation and found that LSF, while a good scheduler, good batch environment, didn't do everything we wanted it to do. Plus the price was a big factor in our decision. Overall there were no significant differences in the functionality between the two. But PBS provides us with more opportunities for customizations, optimizations, and development. And we've done some of that, and we'll going into that in a bit.

This is just a brief history of PBS and its development, most of what I've already covered. So this is how it works. You have a server on one instance of the server process, and this is the interface that communicates from the user environment to the scheduler and to all of the clients that are available to run your job on.

The PBS scheduler is a separate process. It is not the server. The scheduler is more or less the algorithm that runs jobs or doesn't run jobs. So the scheduler that we use on the cluster is a fifo scheduler. This is not optimal. There's some modifications that you can make to the behavior of the fifo to allow starving jobs to eventually get resources allocated to them.

For instance, say we have 16 processors available in our cluster, and I have a 16-processor parallel job that I want to run. Say there's two jobs running that are taking one processor each. I can't run my job. We don't recommend over committing resources. It's not a productive thing to do. So that 16-processor job is going to sit in the queue. What if two other people come along and submit their two processor jobs? They're going to get run.

There are options to turn on in the scheduler to account for starving jobs. So if my 16-processor job sits for too long, it will stop running jobs so that it can run that job. Now this method is the enemy of throughput. It will make nobody happy, more or less. Of course, there are lots of options to come up with, different scheduling algorithms that do backfill, take into account how long the short jobs need, the smaller jobs need, decides whether to run those or not, but we're basically living with what's available by default.

Now, I say by default because the scheduler is a complete separate module, and if you're a computer scientist or someone interested in job-scheduling algorithms, this is a great environment for you to prototype your algorithm in. You can drop in a new scheduler with fairly little effort on programming it to the PBS interface part. Of course, coming up with an algorithm for scheduling probably isn't trivial.

It allows you to write a C program or TCL code to define what your scheduler is. So it has a very nice modular interface for deciding how you are going to do scheduling for your resources. If you don't like what is the default with the fifo scheduler, you can modify it to your heart's content. It has a very well-defined interface for doing that.

AUDIENCE MEMBER: Are you saying at OSC you use fifo and that's it? So if I have that proverbial 15-, 16-processor job out there, it could sit in the queue for a while, or you have the software where after so long you will stop putting the small ones in?

MR. DOUG JOHNSON: We do a combination of things and we'll come back to that. Some of it requires human intervention.

Each client, each resource that's available to a PBS job, has a MOM running on it. A PBS MOM. These are on each of the compute nodes. This communicates back to the scheduler the resource usage of the current job running on it or whether it's available, what state it's in, whether it's up, down, if it's down, it's not communicating, so the scheduler knows to mark that node down, if it's node exclusive, or how many resources are left on that node.

Say we've got a two-processor job running using 500 megabytes RAM, it's going to report back to the scheduler how much resources it has left to schedule jobs on that particular node. So we've got four-processor nodes. We can run another two-processor job that uses up 1500 megabytes of RAM on that individual node.

Beyond the modular design of the scheduler and the interface that PBS provides you to write your own scheduler or modify the one that is already there, it also has an application programming interface to get all sorts of information back from the batch queue system.

It has an API to get back everything the server can find out from all the MOMS in the scheduler, you can get every piece of information. A little later in this set of slides I'll show you why that is important. It also has task manager primitives. A task manager API on the MOM's. These are the primitives that I can call to tell the MOM on an individual node to start processes, so that is the mechanism that the scheduler uses to fill the batch request. But we can use that and extend that to our own needs, and I'll show you how we've done that later.

This is just a flow-through of how PBS would handle a job. You have to put together a script that requests a certain amount of resources for your particular job and submit it to the server. The server does a first pass through the request. And it's an iterative process because it will iterate through every queue from the queue with the lowest amount of resources up to the queue that has the most amount of time, most amount processes, most amount of memory, that can be allocated by that queue, and it will schedule you in the first one that you can fit into. Then eventually your job will either sit in the queue scheduled and then eventually get run through the MOM pool that's been allocated for that particular job.

If you're running PBS locally, it's important to do things in a certain order when starting and stopping PBS, because it's possible to have systems go down and come back up or do maintenance and continue to maintain the same queue structure. The jobs might be restarted from the beginning, but it's possible to preserve what's in the queue without having everybody resubmit their jobs.

To start up, you want the MOMS on the nodes to start up so they're available when the server starts up. If they're not there, the server polls on, I think, a five-minute interval. So if it says that all nodes aren't there, it will check five minutes later to see if those MOMS are on line. It has a certain list of MOMS that it expects. You would want to start up the server second and then the scheduler the very last.

I won't talk about the T create because that's installation specific, and installation of PBS and initial configuration is a little involved.

When starting the server, you always want to specify an extra flag; meaning, that it looks for MOM's that already have jobs running. The MOM's are on the compute nodes. The server might go own independently of the MOM's, and if they have jobs running, you don't want the server to come back up and see that the MOM has some jobs running and not understand whose those are, and it will kill them off unless you specify T hot.

So that's definitely a required command line for the start up of the batch queue server. If you exit the server, you don't want to do a kill 9 on the server process. You want to explicitly call QTERM which will shut down the server but leave the jobs running on the compute nodes.

If you need to restart the MOM's and not affect the jobs, you do want to do a kill 9 on them, and you can actually do upgrades of PBS this way and leave the jobs running. If you have a colleague using a shared cluster and you really don't want to make him mad but you have to upgrade something cause you made some change to the PBS environment, there are definitely ways to keep things running and everybody happy.

Setting up a batch queue system. I've had limited experience with this, but I've come to learn that defining batch queues and limits on batch queues and how they interact with each other is a black art, because it never makes anybody happy with the throughput that they get on their jobs. This is regardless of what the policies are. Even if they're not constrained by the policies, they feel constrained by how the batch queue system will run their job. So this gives you the opportunity to do this at home and have fun with this yourself. It's an involved process, which I'm sure anybody that's done this will agree with.

The interface to manage the PBS server is with a queue manager interface. It's the QMGR command. When you type this, you end up at a command prompt. It's like a shell for the batch queue. You can query the state of every facet of the batch queue system and also set the parameters, create new queues, delete queues, start queues, stop queues.

It can also be called Nscripts, but you need to specify the C flag to let it know that the next arguments are to be evaluated in, quote, batch mode. It's not being submitted to the batch to be executed, but you're just executing those through a shell script.

With the default queue, you have to define the order in which jobs are processed and how; when a job is submitted, how they are routed. We start with a top level batch queue called batch. Nobody ever runs a job in the batch queue. So in your queue request script, you never request the batch queue as your destination because all it the does is route.

We do want it to be our default queue, though, because we're sending everything through it. Also at that point we can set an access control list as to who is allowed to query the scheduler and submit requests to the scheduler. So we basically, probably for the worse, opened up the ACL hosts to any system at osc.edu. So we can have client programs that connect to our scheduler and make requests of it. That's probably not a good feature, but it's a feature that you can do.

So if you want to be able to submit jobs from your work station in your lab, you can set up your cluster this way so that you can just type queue sub at the command line of your workstation or your laptop, as long as you've enabled the ACL hosts and set your computer to the specified lists of hosts that are allowed to do this.

What I want to do is gloss over a couple of the specifics of the queue manager configurations and get to what we've done as far as the day-to-day running of it. When you log, don't log everything. It's incredibly verbose. At this level, which is the level that we log, we feel that's enough information. Anything more is just too much. We want to be able to parse through it. Just to give you an idea, with this level, we have, I think, upwards of 600 megabytes of logs in six months, so it's very verbose itself. So, really, do not turn on the maximum. It's almost worthless at that point.

Luckily, these text files that you end up with for your logs are very compressible. In fact, they are compressible by a ratio of 1 to 100. So Gzip your files when you don't need them. It saves a significant amount of space.

Like I said, we have a top level queue that does routing. That might route to another routing queue that eventually gets to an execution queue. It routes based on the time requirement and the processor time requested and could also be by the memory requested. So this is our queue configuration at the center. We have four different time ranges that people can request. So if you need good turnaround, or if you're only going to have a short turnaround time on your job, you have a 0 to 5 hour queue available for all of the different counts of processors from a single node up to the whole cluster.

What this allows us to do is it gives us better control over draining the queue. If we needed downtime, we would have to the turn off the batch queue system 160 hours before if everybody, by default, got into the 160-hour queue. That is an enemy of throughput. That doesn't make people happy when we're draining the queues because their jobs won't run and they see all this unutilized hardware.

So what we do is 160 hours before the downtime, we turn off the long queue, 40 hours before the downtime, so on and so forth. So people running those short jobs that are less than a day long, are barely going to notice the downtime. And we get decent utilization, and people can even notice that, yeah, there is downtime scheduled. The longer queues have been turned off. I've only got a day to run. I want to only run jobs that need 5 to 10 hours or a day. Eventually we get down to where's there one day left before the downtime, and we don't bother turning off the queues at that point because we'll just kill off the jobs and let them restart from their initial point.

Let me skip through a little bit more of the PBS attributes because I think this is a verbose introduction to configuration of the PBS system. Let me go back to what limits we can set. We can set the number of individual process queues, like serial processors. We can limit the number of single-processor jobs to however many we want, so we don't have 128 single-process jobs running on the cluster. We can also limit how many jobs an individual user can have running at any time on the cluster. So bench queue systems are incredibly configurable and somewhat boring.

These will be available on the web, so if you need a reference to how to set up your own local batch queue system on your cluster, these will be available. And it will also tell you about configuring the fifo for different behaviors that it supports. Most of this is concerned with whether it has the starving job characteristics turned on or not.

AUDIENCE MEMBER: Is that at OSC's web?

MR. DOUG JOHNSON: I'll put up the web site later and these will be available as PDF's. They won't be available until tonight though.

When a job gets started, the end point is going to be through the MOM, and so there's a couple things we want to do when a job starts and finishes. The PBS MOM interface gives us the opportunity to run prolonged scripts and epilogue scripts. This allows us to slightly modify the environment that the job is going to see when they get started on that particular node that they've been allocated.

With the prologue, we create something called dollar TMPDIR that's accessible by the job. It creates a directory on every node you've been allocated in temp space that has a unique name that's identical across all the nodes that you've been allocated in that job. So when I show you some examples of job submission scripts in a little bit, you'll see the use of dollar TMPDIR. That will copy stuff out to that specified directory that's unique for that job for only that one time. That directory will be deleted at the end of the job. It's a place to prestage files and have access to fast local disk and have a common place to copy back from on every one of the nodes.

So this is the script that gets run before every job gets started in their batch environment. We also want to, of course, clean up after that. When the job finishes, we're going to run an epilogue script that deletes that space that we created for the user, the directory that we created for the user.

Another important thing that we've found is that it's hard to develop on these clusters. It's hard to develop programs on any computer. When you're running in a job environment, scheduled job environment like PBS or any of our other systems, it makes it hard for a user to make a change, recompile, and then, you know, run the program on the command line to get the change that he's seeing because we enforce limits on our interactive front-end. We only have one 4-processor computer that has to support every user using 128 other processors on the cluster. So we can't let people run a lot of jobs on the interactive front-end.

So a way to get around that is to provide an interactive batch for the users. This is where you would type queue sub with a minimum of requests for the resources that you need with the capital I flag, and then be allocated the number of nodes you've requested and get a pseudoshell on that first node in your allocation list. So that way you just feel like you're in an interactive shell, but you're really going through the batch queue environment.

You're scheduled, you're not using other peoples' resources, you're not keeping other people from the resources that they need. It's just the right way to let users do interactive work in a scheduled environment. These are the modifications that need to be made so that you can use X clients in your interactive batch job as well.

Parallel jobs present a special problem in a job-scheduled environment. The MPI run command that comes with MPICH, assumes that you've created a node list in a specified file that it looks for. Well, we don't want the users to create their node list because we've allocated them certain nodes for them to run on. So we've had to modify the way that parallel jobs get started, and this is the same kind of problem that anybody with PBS would run into.

The MPI run command's other problem is that it's not scalable. It uses R shell to start all of its processes, and there's a maximum number of sockets that can be open at any time under any Unix operating system, and Linux is subject to it as well. So if you have a very, very large cluster, you will have to definitely do something other than R shells to start a very large parallel job.

With PBS it's possible for us to use the same primitives that the scheduler uses to start jobs and query resources from those nodes. We can also use that to start parallel jobs. That's the correct way to do it. It just takes a little bit of programming.

That PBS, API for the task manager, the API is defined at this web site. It's a fairly simple set of calls where you specify where your destination is and what task you want to spawn, and then it has a wait command. So it's somewhat asynchronous at the API level.

We created the command called MPIEXEC. We support a couple different flags. But the really nice thing about this -- and anybody who's used MPICH with workstations has noticed that it takes forever for a job to start up because the R shell mechanism is just so low -- with MPIEXEC, since it's using this task manager API provided by PBS, it's almost instantaneous. It feels like you're programming and using one computer at that point.

One of the other benefits is that, and this is not the case today, but eventually this MPIEXEC routine will be more robust than the MPI run shell script. That's because, with this, it's possible to send a signal out to every process. With R shells, sometimes you end up with processes that just get orphaned. They are just sitting out there on the compute nodes with no parent anymore and nobody to control them.

With this, all of the processes on the node stay children of the MOM's on the individual nodes. So there's a greater control over children in the job process sense. So there's greater control for the MOM to clean up those jobs, and it just is a better way to run parallel jobs.

AUDIENCE MEMBER: Do you use that MPIEXEC in place of MPI run?

MR. DOUG JOHNSON: We have a shell script that mimics this behavior today. We have a C program that uses the task master primitives, but it's not reliable enough. There's currently some bugs in the task manager layer where sockets that have been opened are closed because there's somebody that's off by one in their counting, so we need to make the task manager layer more robust before we use this in a production sense. You could use this today on our cluster, the C program, and you can download the source of it today, but it's not robust enough for us to use on a large cluster.

AUDIENCE MEMBER: So eventually this will replace MPI run?

MR. DOUG JOHNSON: Yes. As ugly as the MPI run script is, if you've ever looked at that, it's not that complicated. It's just really ugly shell scripting that they have in the MPICH distribution.

This is kind of a schematic of how it works. MPIEXEC communicates through the PBS MOM; the mother superior on that first node that you've been allocated in your job. That's notation they use. So the first node you get allocated, that's mother superior, and that can tell all the other PBS MOM's in the node list that you've been allocated what jobs to start, what program to run and how to run them.

I'm going to skip forward to system management and monitoring. Like I said, SGI is supporting Linux in is a very aggressive way. It seems like every other week they're coming out with a new tool that has traditionally been closed source or has not been available for Linux, and they've ported new tools over to Linux. So we're seeing a lot of these tools they've released becoming more mature on Linux. The ports are becoming more stable, because that's what they are. These were tools developed for IRIX and they're having to be ported over to a new platform.

One of the tools that from basically day one of its release has been very useful is Performance CoPilot. This is a tool that can be used for two different things. System-wide performance management, and also it can be used as a monitoring tool. Really what it is, is a way to hierarchally manage system data. Since it's hierarchal, it allows you to kind of renormalize to different levels of detail.

So if you're interested in a specific theme, you can kind of drill down into that level of detail that you want. There's a lot of data that's returned to the central daemon that makes up PCP, and it requires that the data be self-describing. So I've got a schematic of how the layout of PCP is on the next slide. But you want the data returned to PCP to be self-describing.

You don't want to have to guess as to what it means. So when you look at a value returned through PCP, you want to know what the units are and what it really means. And if you're measuring something, you don't want to affect what you're measuring. If you're looking at network utilization, you don't want the measurement of it to double your utilization of the network. So everything that's inside PCP is designed with the goal to minimally affect what it's measuring.

There's a graphical interface, graphical user interface. But there's also a command line called PMinfo. There's hundreds of different events that you can monitor, and I'll go into just as view of those.

Here's kind of how it fits together. Centrally, you have something called the PCMD running, the Performance Collection Metric Daemon. Off of that, there's all these little tasks called PMDA's, Performance Metric Domain Agents. These could be a shell script. They could be a full-blown daemon, or they could something else, as long as you program it to the interface that they designed in the PCP interface to PCMD.

With that you can monitor disk CPU, network logs, switches and routers. It let's you collect and manage data from a large group of things and packages it in a manageable way. I'll kind of give a little view of how that's done. You get that information back through some sort of client. Either PMview, which is the graphical interface; PMinfo, which is the command line; or the logger, which is a process that will build log files out of specific events that you want to monitor.

We talked earlier about the performance counter access for an individual process. One of the things we're trying to figure out how to do now is get an aggregate of those counter events for the entire system. So if we're looking at, say, node one of our cluster, say we're interested in what its memory bandwidth utilization is. We want to eventually be able to collect those global counter statistics through this tool. It's possible, it's just we have to write the tool to make it work.

With PMinfo, it's scriptable. It can be used inside of scripts, and is a nice command line interface. Every piece of data that you can get through PCP that you've defined in your PMDA's can be retrieved using the PMinfo command line interface. It has these following options. So if you just type PMinfo with no arguments, you just get back a list of all of the PMDA's that are available on that one host that you just ran PMinfo on. Now, that's not so interesting, but if you specify the metric source for the PMCD, it's going to query that remote instance of the PMCD and find out what PMDA's it has locally on that resource. And that can be in a list. That doesn't just have to be one node.

And then you can query specific PMDA's if you know what they are on that specific node or specific node list; or, it can be a specific list of multiple PMDA's to query the values from. So just a simple example of this is on the local host, if I to a PMinfo disk dot all dot read underscore bytes, I get back this value.

So the name is kind of self-describing right there. I know that it's read bytes, from my disk, but I know that the data is more self-describing than that. So I have to do a little more digging into the hierarchy and find out what that means. So the data type that is returned is a 32-bit unsigned INT, and its units are kilobytes. So, I didn't know from the name that was kilobytes. It could have been bytes. I didn't know how to read that value. So it's self-describing.

I don't know time period. I'm going to have to add something to the example to find out. I think that is from the last minute. I'll have to add to this example. It also has a time period that this was sampled over.

In addition to the command line interface, there's also a nice graphical user interface. That's called PMview. Now, with PMview, if you just type PMview on the command line, you don't get this information built up out of it. PMview is something you can use in a script that you build up information from PMinfo to be graphically displayed in some manner.

So what I've used is one script that they ship by default that gives me the specific information from a node list that I type in. I'm looking at eight nodes on our cluster and the CPU utilization on those nodes. It also shows me the network interfaces. There's virtually no network traffic on any of the network interfaces in these nodes. They're all running serial jobs. These are all serial jobs that are running here.

And so this can kind of give you a flavor of what the goal of this tool is. If I want to look at an individual node, I can pretty easily look at an individual node. But say I've got a 100 of these. Maybe I'm looking for something that's in the red, if it's got over subscription of the CPU's, these bar graphs for those values will turn to some different color.

What it is, is a tool that can look at a large amount of data and look for different trims or different problem areas. You can also drill down to a more individual node level. From this interface, I was able to bring up specifics about one individual computer and get more information about the recent bytes transferred, received and sent, and then more specifics about the history of the load, the memory usage and the current CPU usage. Once again, the goal is with the application to be able the look at a large amount of data and look for things that look like they are out of the ordinary.

So why we think that the PCP is a good tool is not so much the graphical user interface, but if we can define our own PMDA's, say I'm very worried about NIS running on a specific node, I can write a chron job that goes out and runs every few minutes to look for whether that's running and run a shell script that looks for certain things.

Or, define a performance metric domain agent that collects that information up, reports it back to the PMCD. Now I can run one script somewhere locally and get all that information back from a lot of nodes. So this tool, in one way, can be used for system-wide performance management, like what we want to do looking at the model specific registers globally for the whole cluster, but it can also monitor events and gives us a single entry point into monitoring all of those different events. So it's a very extensible tool in that manner.

I know that we've kind of jumped from presentation to presentation, and with these four files that I'll put up on the web this evening available as PDF's, there might be a lot of questions after that. Feel free to send me an e-mail. Are there any questions now?

AUDIENCE MEMBER: Where do you see OSC going with your clusters? I know you're at 128 now, and you've obviously put into place the hardware, the communication side, that allows to expand it readily, can you give us any indication? I know you're planning on a trickle down in the cascade area. Are you just going to stay at 128? Are you hoping to grow this any?

MR. DOUG JOHNSON: What I've been told, since they are decisions made at a different level than mine, and this has been publicly put out for discussion to our state user group, is that the management at OSC sees OSC supporting three types of platforms in the future: A large vector machine; a large MPP system, like a T3E or some of the other MPP systems available on the market; and a cluster.

So we do have plans to put in another cluster beyond the one we have currently, and that is sometime in the next six months. That will be in conjunction with the hardware cascade. It will be a process of evaluation and reevaluation of where we are and what is the success level. How are we able to run these resources? Are we able to run everything, cause it is more resource intensive? And do our users care about this hardware.

AUDIENCE MEMBER: What have you seen so far on the maintenance cost of your cluster compared to what you've seen on some of your supercomputers?

MR. DOUG JOHNSON: We're able to buy hardware maintenance for them for a significant amount less. Software maintenance is a bit different. For this system, we're not necessarily purchasing a package of tools from one vendor. We're having to pull these together from different sources and make sure they work together, extend them a little bit to make them work the way we think they should work in our environment.

We don't have one person to go out and say this piece doesn't work. It kind of falls on our shoulders to make things work together. I'm full-time on the cluster project, more or less, and we have three other people that work on it. One other is full-time, but he has some other research projects. He's a research scientist with the center. And then two other people work on it a fair percentage of their time.

AUDIENCE MEMBER: Is that more or less than what you're seeing on the other computer systems?

MR. DOUG JOHNSON: Well, if you add up what work is really being put into the other systems, it all is probably about the same. We do have other people on the system staff. We have on-site engineers from SCI and Cray. Some of those other costs for those other systems are rolled into the maintenance cost, so we don't pay the salaries of the on-site engineers from the vendors directly. Somehow that's rolled in. Now, they don't do every bit of work. They a lot of times funnel software problems to the appropriate people inside their organizations, so it's a very effective model that way.

With the cluster, we're kind of left holding the bag with problems, so that is a real downside to it. It's also one of the most exciting things about it. If you don't like the way something is being done, you are able to change it. It's why we think so many researchers are interested in having these kinds of resources at their own institutions, because not only can you get work done on them, people are getting work done on them, they are a heck of a lot of fun.

AUDIENCE MEMBER: (Inaudible.)

MR. DOUG JOHNSON: We've always used fifo, the default scheduler that comes with PBS. In the future we're looking at a scheduler from Purdue called the Fair Share Scheduler. And this does backfill. Backfill is if it has a large job that's waiting to run and there's a bunch of little jobs that are running, it will try to schedule more little jobs that will complete at all about the same time so it can then run that larger job.

So backfill is only effective when you have a large enough system with a large enough number of different users requesting resources and they tell the truth about the time they need. You know, in a lot of cases we just get lazy. How much CPU time do I need? Oh, I don't know. A hundred hours. And we just type that in even if we only need ten hours. That's the reality of job scheduling.

It's very coarse-grained scheduling. All of the algorithms, these algorithms that they use for scheduling in OS's, are so much better because you have a large choice and it's quick, and you schedule stuff and unschedule stuff. With the job scheduling, it's a lot more difficult.

Any other questions?

AUDIENCE MEMBER: Are there any provisions if you're trying to do benchmarking where you really need exclusive access to machines?

MR. DOUG JOHNSON: It's definitely possible today. Now, if you need a job run across the whole cluster, you, of course, don't want to start with that. If you have a job that needs just one processor but you don't want to be contending with any other processes on that node, you'll want to specify that you want one node with 4 CPU's. That way you're going to be charged for 4 CPU's, even though you've only used one, but it will guarantee that you're node exclusive.

If you have a parallel job, it's the same kind of thing. You schedule every processor on that node in your request. That doesn't mean you are going to use every processor. Even with parallel jobs, sometimes it's beneficial to run only one process per node. Either because it needs a large amount of memory, or the memory bandwidth requirements of it are so high that you would completely cut in half the performance if you had two processes or four processes per node because of the limitations of the bus architecture on the Intel platform.

We do have some users in that boat. Their specific algorithm, and we've done our best, we've almost done everything we can think of to optimize the algorithm to be more cache friendly and use less main memory bandwidth, but there's basically nothing more we can do in a reasonable amount of time to make it run better. He has to run one process per four processor node, and he gets charged for all four processors that he's allocated. He theoretically gets charged. We haven't turned on charging yet.

Any other questions? Let me put my e-mail address up and also the web link to where these for PDF files will be. Thank you.

 

last revised:11/16/00 06:01:25 PM
editor: pmateti@cs.wright.edu