Summer Institute on Advanced Computation

August 20-23, 2000


College of Engineering & CS
Wright State University
Dayton, Ohio 45435-0001

High-end Architectures: Linux Clusters at the SGI Technology Center

Dr. Kumaran Kalyanasundaram

Systems Engineering Specialist
Technology Center, SGI

 
August 21, 2000 7:30 p.m. This site contains the "web-based proceedings" of Summer Institute on Advanced Computation that focused on high- performance high-throughput  Cluster Computing.
           

DR. OSCAR GARCIA: Tonight's speaker is Dr. Kumaran Kalyanasundaram. He is a Systems Engineering Specialist at the Technology Center at SGI. His doctoral work was in Computational Fluid Dynamics at Iowa State University, and he obtained his Ph.D in 1994. He actually was involved in developing a parallel surface reconstruction algorithm for a two-dimensional manifold into a three-dimensional. Three-dimensional object, I guess.

Kumaran has been at SGI since 1995, and he works for the Technology Center. And I've learned also during dinner that they have a very interesting group for people, particularly university people, who are interested in doing benchmarking with some of their applications. In the past few months he has been the Technical Lead for Linux clusters within the Technology Center. He has been very active in demonstrating the SGI 1200 compute cluster at a number of different meetings.

Let's welcome Kumaran.
           
 

DR. KALYANASUNDARAM: You feel very good about yourself when somebody reads something about you. I heard the speaker yesterday went on for two hours, so Jay told me that I need to go beyond two to win the award. There's a television set that they're given as an award. So I think my presentation is only for about four hours, so I'll try to drag it along to five or six so that I'm sure to win this.

The first question is why is a person from Silicon Graphics talking about Linux and an open-source operating system and about Linux clusters, which is all commodity stuff. I work for a group called the Technology Center, as Oscar said. Within the Technology Center, my main job is to do benchmarks on the high-end architectures, the proprietary stuff that we sell. The big SMP boxes that some of you will be aware of.

We do customer benchmarks. So if you have codes that you want to benchmark, if you want to buy our systems, we are the people who go and tune it for you and try to beat the other vendors who will be competing in the business. We also do standard benchmarks like spec stream and things like that and release it too on our systems.

The main reason for us moving to Linux from IRIX, which is another flavor of Unix, is that it's driven by Intel. One of the things we are doing is, for people who have heard of our Origin systems, which are the big SMP boxes, they have the MIPS processor and the IRIX operating system, in the future -- as you all know, Intel is working on their 64-bit chips along with HP, which is the IA-64. It used to be called Mercer. It used to be code name Mercer, and now it's Itanium. It's been due to be released every month. It's been postponed for more than a year and a half. It's supposed to be out at the end of this year.

At SGI, we think that the IA-64, not the first set of processes, not the Itanium, but the following ones, will slowly Intel because all its focus is on developing microprocessors. The Intel 64-bit microprocessors will be world-class processors. They will be world beaters. There's no way within SGI, which focuses on hardware architectures and operating systems and software and things like that, it will be very difficult for us to invest in a research group which develops MIPS processes and makes them as good as, say, the IA-64 processes three or four years from now.

So what we have done is, for our big systems, which are the SMP boxes or the shared-memory boxes which can scale up to 512 processes, we currently run the MIPS processes and IRIX operating system. We can also run the IA-64 processes whenever they are going to be released.

So we decided we are going to support IA-64. If IA-64 fails, then we always have the MIPS processes to fall back on. We have a small team of people working on developing the MIPS processes, and we have a road map up to 2005 and 2006, and all we have to do is extend it if IA-64 fails. If IA-64 does become, great. I'm sure it will, because it has the backing of this huge marketing machine called Intel. We want to be able to support it and not compete against it.

So now that we have the IA-64 processor in our road map, what would be the best operating system running on IA-64? HP has IA-64 in their road map, IBM has, SGI has, and several other people have. Most of the people working in the commodity space like Dell, Compaq, I'm sure they're going to run NT or maybe Linux.

IBM is working on monitoring an operating system basically to run on IA-64. HP is thinking of HPUX. SGI could have ported Irix on to IA-64. If you see the different flavors of operating systems being ported on IA-64, what would the most successful one, or what would be the operating system that many of the application windows, like say Oracle or Bioinformatics application windows or CFD application windows? What would be the operating system they would be porting to? That would be Linux, because that has the backing of the entire open-source community.

So we decided rather than pull IRIX, which would be like fourth or fifth in the priority list to all the vendors, and people will never port applications onto IA-64 running IRIX, we decided we would port Linux onto to IA-64. Which is easy, right? Because they have the backing of the open-source community.

What is difficult is, what we want Linux to do is not what all the other people want Linux to do. Today Linux scales to four processes. We build systems which are 512 processes. Today Linux is a 32-bit operating system; ours is a complete 64-bit operating system. Linux doesn't have even have a general file system. It's the slowest file system to boot.

So what do we do? What we are doing is, for people who watch the open-source community, we are releasing a lot of the open source stuff, a lot of our IRIX stuff, like our journal file system, all the software in IRIX, like, say, our Performance Copilot, which is a system monitoring tool, to Linux. We are hoping that the open-source community will accept it, and they have been very cooperative so far.

And our hope is by the time IA-64 is released by the end of this year and when it goes into our big systems, we would at least have a 64-processor system with IA-64 and Linux running on it. So that's our goal. That's our ambition. That's our aim. That's why we got interested in Linux.

So now we have products in the low-end with the IA-32 processors, a bunch of PC's running Linux or one Windows NT, and we have these small silver boxes, which can be 2 CPU, rack mountable boxes that one can use. Perfect systems for using as clusters or the full Xeon processor boxes running Linux which can be used as file servers. The reason we have those in our product list is basically because of all the work we are doing with Linux and we are hoping to learn from it. So that's a long story as to why we are in Linux.

So this is going to be a typical after-dinner talk. I won't go in depth into what clusters are and how to design a cluster and things like that. I've been asked to give a talk tomorrow at 1:30, I think, and I'll go through in depth on that. But today I just want to give you an overview, you know, maybe a 300-feet overview or 1000-feet overview as to what are clusters.

You hear this name being used so often. People think clusters is something new. The word Beowulf is used very often. Why do you spend so much money on buying a supercomputer from SGI? I can build on at one-hundredth of the cost and it's called a Beowulf cluster. Many people say that.

I was telling somebody here about a story. When I started doing this Linux cluster work, someone from the University of Connecticut, who had a full processor Origin and wanted to upgrade it to more number of processes, wanted to buy a bigger Origin. He was teaming up with a professor from MIT, and the professor from MIT told him, Why do you spend so many money upgrading your Origin? Why don't we set up a Beowulf cluster? It's one-tenth the price and you can add as many processors as you want, and it's very cheap.

So he decided he was going to buy a Beowulf cluster. Why spend so much money on a supercomputer when I can get it so cheap? He told the salesperson from SGI, I'm going to buy a Beowulf cluster. So the sales person decided, I'm not going to be able to sell an SGI, so let me at least sell a Beowulf cluster to him. So he said, Oh, we do Beowulf clusters too, so I'll sell you one.

So he came to me and said, Can you run this benchmark that they have on a Beowulf cluster? They want to find out what type of networking they need to have to connect the different systems together. Should it be some propriety stuff or should it be commodity stuff?

So I ran this benchmark. It took me two weeks to set up the machine and run the benchmark. So I run it and finally I'm successful. It runs successfully. Then I go for a ten-minute break and come back and run it again, it crashes. The same code. So the things are not very reliable when it runs. It's tough monitoring this whole system. It's tough making it work. It's tough to give you answers every time. Every time you have to install a new device driver for the new NIC cards that you have, it takes a long time to do that. It's not an easy thing. It's not the most reliable thing, but it's a very good experimental thing.

So there's something you get in performance and reliability that you're paying for in the big machines that you don't get in clusters, right? And the other thing is not all applications run very well on the clusters. You are using commodity 32-bit processes with very little secondary level caches on them. They have a couple of integer units, but just one floating point unit.

So if you have codes that are not floating point integer codes, high-energy physics codes, for example, which are trivially parallel, which do very little communication, needs very little communication, day-to-day compulsion types of problems, they run very well on clusters.

But if you have floating point intensive problems with a lot of communication, a CFD-type problem, where you cannot decompose the grid on which your fluid problem is being solved very easy, then you are going to have a tough time running it on a cluster, and SMP boxes will do much better. So you have to understand the type of problem you're solving to be able to run it on a cluster.

So let's talk about the type of clusters. Let's talk about cluster interconnects. The various types of proprietary stuff. One of the good things about having an open-source operating system is that you easily get device travelers for all the different types of networks that are being supported. Commodity stuff like Fast Ethernet. Proprietary stuff like Giganet or Myrinet or GSN. They're all available on Linux.

So what are the different types of clusters? It's just a naming thing. Some are availability clusters, so they are basically clusters that are being used for high availability types of applications. The web space a good example. So you have these clusters, and clusters renders itself very easily to RAS types of features. If one of your nodes goes down, there's always several other nodes to take up and do the work.

Throughput clusters. These are basically all the systems in your lab. And if you are running some sort of a batch environment and you are submitting jobs from one of the computers, it knows what the big different systems are. It knows which are the ones that have free cycles to spend and locates that job to that particular node.

So these are applications that are not cluster aware. It need not know that you're running it on a cluster. These are basically single-processor jobs that are going to be running on a single node in a cluster, and it need not know that there are other nodes which make up this huge machine.

Capability Clusters are compute clusters, which are your famous Beowulf clusters or scientific clusters. And there you run parallel jobs which are written in MPI. As I said, when you look at high availability clusters, the first type of clusters, the set of nodes in a cluster easily renders itself to RAS-like features. If one node goes down, there's also another node to pick it up.

When we talk about high availability, what do we mean? The general definition of high availability is that things are available 99.99% of the time, which is about eight hours of downtime for a whole year for which this system running. You can hide these downtimes because if one of your nodes goes down, there's always other nodes to do the work.

When we talk about high availability, we not only talk about the nodes in the cluster, not just the hardware, it's the different components that hardware depends upon. So it's not just the computers which make up the thing, it's also the networking which has to be highly available. It has to be the disk on which these, maybe file systems, on which these resources depend upon. They need to be highly available and also the applications. You need some sort of checkpoint restart mechanism, where if your application fails on one of the nodes, you can checkpoint it and restart it on another node or it fails over to another node.

So what is the high availability environment? In the simplistic sense, it's the nodes of the cluster, but it also includes the NIC cards or the networking, it also includes the disk in the cluster, it also includes the applications. So you have to have different types of monitoring tools to monitor the different stuff so you know when something is down, it can fail over to another node.

So this is a very simplistic figure of a high availability environment for people who are not exposed to it. So here you have two nodes in the cluster, A and B, both connected to some sort of a public network, which can be ten-based D, hundred-based D, FDDI, ATM, whatever. And sever A and server B also have a private network, which is called a heartbeat network, on which it keeps constantly checking whether the system is up. A checks whether B is still running and B checks whether A is running. So if one of the nodes fails, then it knows it has to pick up the applications that is running on the other one and run it on this one. This is very simplistic.

A bit more complicated figure would be this. You've introduced a fibre channel rate box in the figure, which has two controllers, so if one of the controllers fail it can fail over to the other one; and it also has dual paths to both the nodes in the cluster. So there's loop A and loop B. So if one of the loops fail, it can actually fail over to the other. So it's dual-hosted, and there's two loops.

So what is the different stuff that you would monitor here? You want to be able to monitor the storage processor and see if there's a failure here. You want to have some cluster management tools which will monitor the nodes and know they are up and running, and when they are down and so things can fail over to the other one. That's done by the heartbeat network. And you also want to monitor the applications so that when the applications fail there's some sort of checkpoint restart mechanism within the applications.

That's what this line says. So the heartbeat measures the nodes. The high availability framework monitors measures things within the storage, the fiber channel rate, like volumes, file systems, and the networking. And then application specifications monitor whether the application is up and running.

So when you have a failure, it can happen different places. It can happen within the storage system; it can happen within the networking; or it can happen within a node. A node can fail, or the application within a node can fail. And in all these cases, you need to notify the person who is administering the system.

If it is a storage failure, if it's one of the paths of the storage which has failed, you want to be able to access it by an alternate path. And whatever the case is, you want to make sure the downtime is very small. There could be times when there may not be a reel failure.

For example, when nodes are checking the heartbeat, it may not get a very quick response from the other system, and at that time it might construe it as a failure and the applications on the system might fail over. That could be a false fail load. In those cases you want to avoid those by making sure you build in logical mechanisms where, just because a system doesn't respond very quickly, you don't assume that system has failed.

Let's keep this very interactive. So if you have questions jump in, or if I'm saying something that you totally disagree with just stop me and say, you should stop. I need to take over or something. And people have done that before, so that's not the first time, and that's why I was hoping that you guys would do it too.

As far as Linux and high availability is concerned, since our focus is on Linux, these are the software packages I could find in the open source community that are available. These are all software packages which are used in the internet server space. Most of the high availability applications are the internet space, and these are various pieces of software that are being used.

So Linux is not yet ready to move into the enterprise type of market space, and basically not just because it doesn't have very good high availability solutions that we talked about, but also because it doesn't have -- the main thing is it doesn't have an journal file system. And once it gets that -- SGI has released it's XFS journal file system to the open source. IBM has released theirs. Linux has its own journal file system. So there are different journal file systems that are being tested out and hopefully one of them will get accepted. And that will be the first step, and along with the different pieces of more mature high availability software for Linux to be accepted in the enterprise space.

One of the things SGI is working on, SGI has a high availability software it's working on called fail safe, which does all this monitoring for you, which I talked about. And fail safe runs on IRIX, and we are in the process of porting it and releasing on Linux along with SUSE.

So this kind of reiterates what I said. You need these features and high availability framework, which hopefully will be available in the near future, because people are porting their fail safe type of solutions or high availability type solutions to Linux.

You will need some sort of cluster management tool which is almost available. People are using more of these clusters, and they are looking at system monitoring tools, cluster management tools, performance monitoring tools, and things like that. More of these are getting available in the open source. And the most important thing is having a journal file system. If you have a journal file system, one of things that would be very useful is a clustered journal filed system, and SGI is working on it.

I'll kind of interject. I mean, I want this to be a very general presentation, but in between I have to do my part as an SGI employee to talk about the SGI tools that are being ported onto Linux. One of the things is we have released XFS to the open source, and the other thing is we have released something called CXFS, which is Cluster XFS, in which you can run a storage area network-type of solution on a cluster, which have been extremely popular, in which all the nodes in the cluster can access. So the file system on each node can simultaneously access a single file system.

And that's part of IRIX now, and we are working towards releasing it on Linux. And the people we are working with are the people from Linux Care and hopefully in about eight to nine months CXFS will be part of Linux. So once you have a clustered journal file system in Linux, that would be very useful in any type of Linux cluster, whether it's a throughput cluster or a compute cluster or whether it is a high availability cluster.

So here is a very, very simplistic example, okay. In the Internet space where high availability -- these clusters are used for high availability solutions. You have a bunch of nodes or computers in a cluster, and there's kind of a hybrid throughput cluster and a high-availability cluster. So each one is doing it's job, and if it fails, the other ones can pick it up. Meanwhile, they are running their own applications too.

So you have a switch, maybe a Cisco switch, in the Internet space and you are receiving some requests, maybe HTTP requests. And all this switch does is, as soon as it receives it, it uses some sort of a system monitoring tool and delivers that or sends that request to one of nodes in the cluster. One of the nodes work on it and it sends back the request and it's sent back to the Internet space. Simple example.

It's kind of a throughput cluster, where each node which receives the request, is not aware there are other nodes in the cluster. It's also a high-availability cluster, because if one of the nodes goes down, then the request can always be sent to the other nodes.

One of things is there's no dynamic feedback from these nodes to the switch. So when you receive an HTTP request, it can be a small one or big one. Both of them are sent to a certain node, and that node can be working on it, but after a certain time the switch doesn't receive the information back. It thinks that that node is down, then it resends it to another node. That's kind of the false fail over I was talking about when nodes are testing each other for heartbeats, and when one of the nodes is down and the other node really doesn't know it because it just -- or when both nodes are really up and one of the nodes sends a heartbeat request and it doesn't receive it in a certain amount of time, there's no intelligence built in which for it to know that that node is actually not down but that the request didn't come back for some other reason. So actually Cisco is working on a switch, on a dynamic feedback switch, which helps to build some sort of feedback mechanism into the switch to avoid this kind of false fail over.

What are the different types of tools you would need to manage a cluster like that? You need some sort of system monitor tool. The system monitoring tool is basically for to know how the load is distributed among the nodes in the cluster. You need some sort of a cluster management tool to treat it as a single system as opposed to different nodes.

Questions? So let's look at throughput clusters. Throughput clusters are used for various reasons in the market space. It's sometimes used in structural and fluid mechanics types of applications. If people have seen the latest spec numbers for the Pentium III, it's amazing. Not just the integer spec number, which is always good, the floating point spec number. It's gradually catching up with all the superscalar risc processors.

I have seen structural mechanics codes -- and I shouldn't say that, because I would prefer selling bigger computers which are more expensive than this cheaper stuff -- but I have seen structural mechanics applications and fluid dynamics applications, fluid dynamic applications like Fluent and Star CD, and on a Pentium III chip that has such a small L2 cache they are only about half as slow compared to a risk superscalar processor. A Pierre, a MIPS or an IBM PC, power PC. So the results are amazing.

The only thing you can fail in a cluster environment, is because you are NFS mounting the file system. And if you have a lot of I/O to be written, that can be extremely slow. It's all single processor and you don't -- and then managing the system and making sure your applications will run very well and will never fail and the system will never fail. Those are the things you will have problems in.

As far as performance of a Pentium III, a 32-bit chip, it's catching up with all the risk processors. And I'm talking about the 833 Pentium III. And you have the 900 plus and the gigahertz Pentium III.

Usually the way people choose processors in the Linux cluster environment is to plot price performance. So usually if the one gigahertz Pentium III is out, it's not the latest one which goes into a cluster, into a node, but usually they will plot price performance and choose the one that gives them the best performance for the lowest price.

So the Pentium III of choice today is usually the 700 megahertz or the Sound 50 megahertz Pentium III. It's not usually the 800's or the 900's and the gigahertz because they're too expensive. That's why I chose the 833 and said, see the performance of that studied with all the risk processors and it's amazing.

EDS Server Farms. When you are designing chip logic, you want to run simulations of the same chip, so it's a very trivially parallel problem, renders itself nicely to solve on a cluster. Render farms. When people who Disney animation movies and things like that, many are running the same algorithm but on different pieces of data. It renders itself very nicely to run on a cluster.

So to manage on a throughput cluster, what are the different tools you need? Same ones. You need something for managing the cluster as a single entity. You need a system monitoring tool. You need accounting tools because people are submitting jobs and you want to do that. And the last piece, or the most important thing is, you need some sort of a batch scheduler to be able to submit these jobs. The batch scheduler will find out which nodes in the cluster are not being used and run those jobs on those nodes.

So the batch scheduler of choice usually in a cluster environment, in this open source area, is PBS. PBS is the follow-on to NQS. It is the follow-on from NQS from Cray, and NASA Ames developed it and they worked on it. Actually, PBS was bought by another company, so they are going to charge actually for PBS, but they are claiming they will have two types of PBS sources available. One available in the open source, for which there won't be any support; and one which they provide for which they will charge for support.

The competition to PBS, which is not open source, is, of course, LSF. PBS is the one of choice. There's always this thing about why I should release something to the open source and why I should charge for something. Linux, unfortunately, doesn't have a lot of kernel monitoring features available in other Unix operating systems, IRIX, Solaris, HPUX and things like that. So it's very elementary today in Linux.

And to get these features added to be able to support these features, either for developing system monitoring tools or whether to develop batch schedulers which have checkpoint restarts and things like that, you want to be able to work with the open-source community, and you have a better chance of having these features added to the kernel if your software is open source, as opposed to you charging for it. So PBS is open source, and I'm sure it will work better with Linux than not open source or software like LSF.

So let's move on to compute clusters or Beowulf clusters. Everybody knows what a Beowulf cluster is, and it all started in 1994 with commodity processes and commodity networking like Ethernet. It's always been popular in academic and research environments because people don't mind if things break. They have, you know, another project to put it all back together again. But the thing is, it's becoming very popular in commercial environments too. Especially in two areas; oil and gas. Seismic processing codes and Bioinfomatics.

Those are two areas where Linux clusters are becoming very popular, and you're seeing more and more of proprietary hardware. Proprietary big single image systems being replaced by Linux clusters. Amoco is one company that I worked with which bought a huge Linux cluster. So you're seeing more and more of these Linux clusters being accepted.

So the mileposts. The different mileposts. There's the first SGI Beowulf cluster we sold. I don't know if there's any people from Ohio Supercomputer Center here. It's actually not a typical Beowulf cluster, in that the type of system we sold to them was one which is actually shelf mountable not rack mountable. These are boxes which have Xeon processes and not Pentium III, so they're not the best in price. So it's not a very typical cluster.

The important thing to note here is that Linux clusters are replacing proprietary hardware. And the oil and gas is an area where things are happening. The other thing to note is even in research environments people have come to believe that Linux clusters can fail easily, and there's a lot of research that goes into making them extremely stable.

So if a vendor is able to provide this type of support, you know, software to run the system and also hardware which is extremely reliable, then people are willing to pay for it. And almost all the money in Linux seems to be in support. It's a free operating system. The hardware is extremely cheap. So all the money is in support, and that's what all the companies are fighting for, and that's why you see all these companies like Linux Care -- first off all, the Linux distribution companies, like RedHat, SUSE, Double Linux. Then you have the professional services companies like Linux Care, and also the big ones like IBM, HP all getting into Linux.

AUDIENCE MEMBER: One of the things I have wondered for a very long time is why isn't more artificial intelligence brought into trying to do diagnosis the failures of this unreliable part of the software? It seems that nobody wants to do some experimental AI on software. They want to do AI here, software here, but the two shall never come together.

DR. KALYANASUNDARAM: Give me an example.

AUDIENCE MEMBER: I don't know. This may not be a good example because I haven't found a rationale, but there is something called genetic programming. Why could you not, I'm thinking now a little bit freely, why could you not try to do some automatic diagnosis and repair?

Because eventually, software seems to continue to change very rapidly, and every time you change, you know very well you are at risk. If you could have software that somehow or another can repair itself or fail and survive, those things are possible. They're not totally impossible to have some kind of repair mechanism.

DR. KALYANASUNDARAM: I don't know much about genetic engineering to answer that, but, I don't know.

AUDIENCE MEMBER: Okay.

DR. KALYANASUNDARAM: Anybody else?

AUDIENCE MEMBER: I had a different question. I'm just curious as you give away more and more of the stuff and work with this as you just said, how are SGI's interests being maintained and are you going into the service area too?

DR. KALYANASUNDARAM: The thing is, you know, that's been one of the most difficult things for me to understand. If you give away stuff, how are you going to make any money? It's good for the small companies. I mean, when has RedHat ever posted a quarter in which they have made profits so far? Zero, right?

So here's the thing, one of the first pieces of software to be released to open source by SGI was called Performance Copilot. I don't know if you guys have heard about it. Performance Copilot is a system monitoring software which was basically on the big SMP boxes, the Origin systems and things like that. It gives you, using GUI tools, it gives you a 3-D -- is has a 3-D visualization tool.

Underneath it's using the same tools that you would use like OS View, talk, SAR, and things like that to collect data, par, and things like that and present it in kind of 3-dimensional graph, so it's easier for you to visualize as to what is going wrong in your system. And it's a very nice tool that was running on IRIX, and we were selling it and making a lot of money. It was packaged along with IRIX, so people had to buy it.

But people who bought our big systems were buying it because they want to see what is the bottleneck. For example, if I have a 32P system today and I want to scale 64P, should I just buy processors or should I buy equal amounts of this, or should I buy equal amounts of memory? That's a tough question to answer. So people are using these tools to study the usage level. Is the bottleneck memory? Is the bottleneck processes? It the bottleneck this and things like that. It was a very nice tool to understand what is going on. That was one of the first pieces of software that was released to the open source.

What we did to make money was to, even though you can take PCP from our open source and compile it and run it, we never released the GUI pieces to the open source. We said, If you buy it from us, it comes with the monitoring stuff. So all we released was the data collecting stuff, so all the stuff which goes for being able to display all these 3-D stuff, we never released it to the open source. So you have to buy it from us to do that, so it's kind of a selling mechanism.

The other thing we did was, if you want PCP from us, so if you don't pick it up from the open source and compile it yourself and run it, and you don't get the GUI stuff, you get only the data collection stuff, if you buy it from us and you want us and you want us to support it, then we will support it only on our hardware.

So it's basically Linux. You can run it on Dell, you can run it on Compaq, on an HP system anywhere, but you have to buy an SGI system. And if it's running on an SGI system, then we'll support it. So our basic intention was that you will buy more and more of our hardware to run the PCP stuff if you need the GUI tools. The thing is, somebody in the Czech Republic has already developed these GUI tools based on the collection stuff and released it to the open source.

So that was one way to look at it. Now, I'll show you another way to look at it. The catch here is, if we don't release something to the open source and we charged for it, so we put all this stuff onto Linux and we charge for it, the problem is we will never be able to work with the open source community with something that is not open source, like PCP, and make them build all the performance measuring mechanisms within the kernel that we need to make PCP successful. That will never happen.

You know, if you want some feature or something like that, how do we -- we can, of course, build it and release it to the open source, and the open source can just ignore saying that's a particular feature that is not needed for the general thing. It's something you need for PC, but you're charging. That is not needed.

On the other hand, if you release it to open source, more people have their hands on it, and as they see this mechanism work, they are building these features into the kernel and PCP is becoming very successful. The other thing that is happening is PCP is now being ported onto Solaris. PC is being ported onto HPUX. And they are doing everything to make PCP successful and making sure their kernel has all these data collection mechanisms within their kernel.

Now, we couldn't have dictated that if we hadn't released PCP to the open source. So how does SGI make money? SGI doesn't make any money out of PCP, but SGI can sit there, go into an environment where people have Sun boxes, HP boxes, SGI boxes and Linux boxes, and say, We can build a system monitoring mechanism based upon PCP, which is open source, that can support and run on all the boxes here. And PCP is already running on all these operating systems, which I couldn't have done if I hadn't released PCP to the open source. So that's the thinking here.

So in a way it's kind of supporting. It's professional services here. I'm not doing any engineering. I'm just going there and supporting all these boxes and building a system monitoring tool based on PCP. It's support. You are making most of your money in services and support and personnel services.

So this is something I've repeated for a long time in this talk. Why is suddenly clusters taking over? Clusters have been here for a long time. People have been using clustered workstations. You know, they have workstations, each one on the desktop all connected to a public network, maybe ten-based T. And people have sat and wondered always, How do I make my application not just run on my box but on all the other boxes that are sitting idle in the night? So this has happened ten years back.

My master's thesis is based on using Unix sockets. There used to be a popular commercial available panel algorithm called Vias(phonetic) Arrow, which people used for designing aircrafts. A fluid dynamic's code. And it runs on a single workstation. It will take you two days to get your results. And when your workstation is crunching out the results, the rest of them are sitting and sleeping.

So people are wondering, how do I use that? And my master's thesis was basically based on developing a parallel algorithm which runs on a network of HP workstations. So people are doing that for a long time. The reason it's become very popular now is because a) the processors have become very fast; and b) the proprietary networking like Myrinet, Giganet, Gigabit Ethernet and things like that have become very fast.

So if you have codes which communicate a lot between the nodes, you can get this type of networking. And you have an operating system, which is open source, and you have all the device drivers available. And also the applications, you know, space, the algorithms, parallel algorithms have become very mature. And now you have message-passing software like MPI and maybe PVM, which are extremely mature that people are writing their applications on. And you have a lot of commercial applications based on MPI.

Here is as bit of applications in different engineering spaces and scientific spaces that are available based on MPI that are extremely popular. The MM5 is a modeling code purely based on MPI. There is a shared-memory version too. It's funny. Because if I'm not giving the cluster talk and I'm giving the talk as to why SGI, CC, Numa is better than clusters, MM5 is an example of that too because MM5 has an OpenMP version which runs on the shared-memory machines which scales as well as the MPI version, and there's nobody maintaining it.

There are two scientists at NCAR who basically, every time MM5 has to be modified or new features added and who has to support the MPI version and rewrite it. So this is a way of telling people that it's easier to buy a shared-memory machine because it's easier to write applications, and you don't have to have two full-time research scientists supporting it. So that's my argument there. Today, because I'm giving my clusters talk, MM5 is an example of an MPI application or an MPI-based application which scales very well.

Gamess is another. Chemistry code from Iowa State. Zeus-MP is a hydrodynamics code developed at NCSA. In fact, they have a web page in which they compare the performance of different Windows architecture including clusters. I think NCSA has an NT cluster, so the performance is on an NT cluster.

Navier-Stokes codes. Monte Carlo simulation codes and Financial Analytics is an MPI problem, trivially parallel problem, that runs on parallel clusters. In fact, Morgan Stanley, JP Morgan, have both enlisted in buying an SGI Linux cluster, a 32P cluster, and they run their financial analytics stuff on it.

And then QCD, or the latest case theory, famous experiment in physics, the processor at Indiana University Stephen Gotlieb has a huge Linux cluster. And just this morning we managed to sell a 48-node Linux cluster. We were working on an RFP, and I got the news that SGI managed to sell to a 48-node Linux cluster to Fermilab for an experimental latest case theory.

There are different areas, and when I take these applications, these are specific applications that run very well on clusters because the applications have been developed as they run that. So it's not applicable to all the areas.

So the other thing is, not only do you have MPI support, now you have a debugger, and the debugger of choice is TotalView from Etnus. It's closed source; it's not open source. You have to pay for it, but it can debug MPI jobs running on thousands of nodes on thousands of processes.

Talking about networking, the important feature why all this networking is extremely fast is because most of them have all these bypass implementations where you don't have to talk to the kernel to communicate from one machine to the other. For example, Myrinet, their NIC cards have a processor in them which allows communication between the different nodes and you don't talk to the kernel at all. And because of that, you don't even see CPU overhead on the processes within a node. If you're running a single process or a single node and you have a NIC card in it, than that NIC card actually has the processor on it which takes up all the overhead. So there's no CPU overhead, and your jobs can run on the CPU. They can make full utilization of the CPU's within the node in a cluster.

The drawback is that most of these things are no longer called Beowulf clusters. They are called Myrinet clusters or Giganet clusters. Because the cost of a node in these clusters -- if you really go and spec out the motherboard, the different components in the cluster, you can build a node in a cluster with two Pentium III's and the L440 GX Intel motherboard for about 1,500 bucks. If you buy it from a vendor, it will be about 2,000 bucks for a 2U box with a single-port Ethernet, one IDE drive, one floppy no CD ROM, 2U box, with the L440 GX two Pentium III's. So around the 700 hundred megahertz range. Myrinet's card would be much more expensive. So the actual money in these clusters is in the networking and not in the processes or the nodes in the cluster.

So this is the configuration of the Ohio Supercomputer Center. It's a Beowulf cluster using these full processor, shelf-mountable systems. As I said, this is not the best choice, basically, because there are Xeon processors with a big L2 cache. And usually the possessor of choice is the Intel Celeron or the AMDK-7 or the Pentium III's and not the Xeon processors.

The price performance choice would be these two-rack mountable boxes which we call S1200's, but they're all OEM from VA Linux, and VA Linux actually sells it to a lot of companies, so you can get them. The ones we have actually have SCSI drives, but the more cheaper solution would be IDE drives. They have the L440 GX Intel motherboard, which has 2 PCI buses and gives you 2 PCI slots, but the motherboard actually has 6 PCI slots in them.

You can use Myrinet, and they all have a default, the motherboard has a default Ethernet port, and you can connect them all and have a head node which could be anything, maybe one of the nodes in the cluster or another node with a PCI fiber channel card and a fiber channel J board, which is NFS mounted over this private network. And then you can also have Myrinet connected to one of the PCI slots on the box to be a able to do your applications, talk communication.

So the Ethernet is used for NFS mounting the file system, while Myrinet is used to for the applications to talk to each other. People also have 1U boxes with the same motherboard with a lesser number of PCI slots, so that you can have more systems in a single rack.

The key measure, when you choose an interconnect, whether it's Myrinet, Gigabit Ethernet, Fast Ethernet, it's all applications dependant. If your applications do very little communication between each other, then all you need is the Ethernet. If your applications do a lot of communications, then you need a faster interconnect and you want to pay for it.

So latency and bandwidth and CPU overhead are the three terms you would hear, so you want faster bandwidth, low latency, and very little CPU overhead. As I said with Myrinet, you have very little CPU overhead because the NIC card itself has a processor on it.

So the different protocols that they use. Fast Ethernet uses TCP/IP. Myrinet uses something called GM, which is Glenn messaging, and again it bypasses the kernel. And Myrinet is usually, from when I have talked to different customers, if they want a proprietary network that is fast and it is not Fast Ethernet, it is usually Myrinet.

Giganet is another one which is competing with Myrinet. And one thing not mentioned here is Dolphin XSCI, the Scalable Current Interface, which people use. SCI has been used in, say, the Cray giga ring building the Convex SMP boxes and things like that. The problem with Dolphin XSCI though is it is a bit expensive, but the latency is very small. It's only about two microseconds, so it's extremely small. The other thing about Dolphin XSCI is the Intel motherboards do support cache coherency, so it's tough to build a system with Dolphin XSCI in which the motherboard doesn't support cache coherency.

And finally GSN, which stands for Gigabit System Network, uses the HIPPI 6400, which gives you 6400 megabytes per second, uses the schedule transfer protocol, or ST. It's way too expensive. I mean, there is support which is being done to Linux, the device drivers are being done, and you can get it in PCI, but it's way too expensive in this the space I just mentioned here.

These are all hardware round-trip latencies. Remember, hardware latency is different from application latencies, so it's not application. Sorry, these are MPI latencies. So it's about 18 microseconds for Myrinet, 25 for Giganet, and for Gigabit Ethernet it's about 100. And you can also see the price of the card and the price of the switch.

So between Giganet and Myrinet, the price of a NIC card is more expensive for Myrinet, but the switch is much cheaper. So these numbers are a bit old, so they could have changed, but the ratios and all remain the same. It just gives you an idea as to how much you are spending. So you can kind of think -- the problem is, when I showed you the 4-processor Xeon process system and then the 2-processor system, and looking at the price of a NIC card, then you are going to wonder it's more advantageous to have more processors in a single node because you are going to have lesser number of nodes and you can have lesser number of NIC cards because the cards are as expensive the node in the system.

The problem though is when you are using commodity stuff, the Intel front side bus doesn't scale very well at four processors. It scales only up to two processors. So if you're running applications within a box and you don't want to use MPI but you want to use some sort of shared-memory protocol, you know, Unix fork or something like that, or OpenMP within a box, you won't see scalability up to four processors. The default, or the good number, is two processors within a box.

So Myrinet is the one SGI kind of teamed up with, and it seems to be the networking of choice for people who want fast networking and need more than something like Fast Ethernet. The bandwidth is actually 1.28 gigabytes per second to be exact, and it's full duplex, so it's two ways. Myrinet is supported on almost all the Unix flavors and NT and Linux. And the great thing is it has an OS bypass strategy, which means that applications do very well using MPI.

I'm going to just mention it, but it's kind of useless, but there is networking developed by SGI called the Gigabyte System Network called the HIPPI/HP6400. It's very expensive. The switch is expensive, the cards are expensive, but it's being ported onto Linux. And there are two companies which sell PCI variations of this, of the cards too, the gigabytes. It's HP6400. It uses a protocol called the Schedule Transfer Protocol, which you can view it as DMA with the networking.

The most important thing is we are working on bringing the Schedule Transfer Protocol to the Gigabit Ethernet. So the problem with Gigabit Ethernet is that the CPU overhead is way too much in Gigabit Ethernet. It's cheaper than Myrinet and cheaper than Giganet, but the CPU overhead is way too much. Once the Schedule Transfer Protocol, hopefully, comes to Gigabit Ethernet and not TCP/IP, then you are going to see lower latency, higher bandwidth, lesser overhead in the CPU's.

Are people awake? Good. How many hours are we into this? Like two hours right? So I have to quickly wrap up in another with two or three hours. Maybe if I finish with the slides I'll go through them again. It's good practice for me too.

When you are designing a compute cluster, what are the things you have to worry about? The first thing is, do you need any special nodes? Do you need a front-end node to which you have, say, the disk system attached to it, which is NFS mounted to all the systems. Do you need that? Do you need that to do any special visualization? Are you going to do any cluster management from there? Are you going to run your batch scheduler in such a way that all the jobs that are submitted from the front end to the cluster and nobody has access or log-in capabilities onto any node in the cluster?

What should be the node width? Should these be the 2U boxes, the 1U boxes for size reasons? For space reasons? Do you need to have two processors or one processor or four processors and things like that. How many PCI slots you need, what type of networking you need, and these are the different considerations you need to have. What type of processors? Pentium? Celeron? Alphas? AMD?

So, if you buy a thin node cluster, which is usually two processors, Pentium III's, this is the way it looks. So you have the different nodes which are connected to some sort of a switch, which could be a Myrinet switch or a Fast Ethernet switch if you want to use the default Ethernet.

If you want a bigger cluster, like the Xeon processor one -- and this picture is kind of incorrect because it restricts to four processors. When I talk about clusters, it's not just these commodity clusters, it could the big SGI SMP boxes connected together, and those could be like 128 or 512, so you could have fatter nodes with more CPU's. And the advantage here is you are using lesser number of cards per box, and you are also scaling within a box if the bus allows you to scale.

So the advantage is you can use shared memory parallelism with lower latencies, higher bandwidths inside the box, and outside the box you need the communication with the MPI stuff. And that's what it says. So OpenMP or loop level, fine-grained parallelism within a box and coarse-grained parallelism outside the box.

So in general, when you talk about compute clusters, you need some sort of a cluster management software. So you want to use this from the front end, manage the cluster as a single entity. You want some sort of a message-passing software to write your applications to talk to the different nodes. You want some sort of system monitoring software where you can actually see the loads on the different nodes in the cluster and be able to manage them.

You want some sort of a batch scheduler which basically submits jobs and runs them on the different nodes. You want some sort of an automatic software installer on the different nodes, so some sort of a console from where you can install, connected to the serial ports of the different nodes in the cluster where you automatically install software and things like that.

So somebody asked me how does SGI make money. This way. We actually sell a cluster environment software called Ace, which has all the different pieces in it. I don't if you know this, but SGI also sells what is called it's own Linux distribution called Pro Pack. So it's based on the RedHat distribution, SUSE, Turbo Linux.

If you buy a box from us, you can choose to buy RedHat 6.1 or 6.2; SUSE, whatever version it is; Turbo Linux, and install it on the system. We also have an SGI Pro Pack CD, which comes free of cost. This SGI Pro Pack CD will, of course, run on any box, but it's basically intended for our boxes. And it has a number of patches with have been released to the open-source community but has not been accepted for the current version. It was too late for the current version of the kernel. It might make it to the next version of the kernel.

What these patches have is basically performance enhancement. There is an NFS patch to make NFS performance better. Then there is a big memory patch which supports greater than 4 gigabytes of memory and things like that. So patches like these are released and they are basically, what I'll call the overlays, so they are not part of the kernel that we release, and we support it on our systems.

So we have an Apache patch, so if people want to choose our box, then we say we will give you this SGI Pro Pack which has a patch in which Apache runs faster, so people tend to buy our hardware, and also software tools like this and save. The great thing is that when up buy it from a vendor which has it's own Linux distribution, sells commodity hardware and also has software packages, that is one number to call to get support for all those different tools. Otherwise, you are buying the hardware from somewhere, you are looking for Linux support over the web, and things like that. I don't want to turn this into a marketing talk, but that's the idea behind Linux and Linux clusters.

Any questions?

AUDIENCE MEMBER: You had a slide on automatic parallelization. You didn't make a comment on that. Is there anything you want to add about the automatic parallelization with OpenMP?

DR. KALYANASUNDARAM: Do you know what OpenMP is?

AUDIENCE MEMBER: It runs on some shared-memory systems.

DR. KALYANASUNDARAM: Yes. Automatic parallelization is basically the compiler does all the fine-grained parallelism for you. When you have do loops, which have no data dependency on them, the compiler automatically adds the OpenMP directive on top of it, and those run on SMP boxes. So if you have fatter nodes, you can use OpenMP, fine-grained parallelism within a box, and you can use PMI to communicate within boxes.

AUDIENCE MEMBER: That's something SGI is --

DR. KALYANASUNDARAM: Yeah. This OpenMP stuff is supported on Linux, and our compilers, our MIPS compilers which used to run on IRIX, we have ported onto IA-64, and they have been released to the open source. They are called the SGI Pro 64 compilers. And so it supports OpenMP, and C, C++, F77, F90, and it's all part of the open source. And it's already there.

AUDIENCE MEMBER: (Inaudible.)

DR. KALYANASUNDARAM: Not for MPI, OpenMP. There's now way to do MPI automatic parallelization.

AUDIENCE MEMBER: (Inaudible.)

DR. KALYANASUNDARAM: I would like to see that not happen basically because people like me will lose our jobs.

AUDIENCE MEMBER: You see, that's precisely the point. We do that. We obsolete ourselves all the time, and that's the way we make progress. That's why IBM had its little hiccups for a while because they didn't want to obsolete themselves, and it is not until you make things like MPI. MPI is too hard. MPI is too hard.

DR. KALYANASUNDARAM: It is too hard.

AUDIENCE MEMBER: If you don't make it easier, people are just not going to program on MPI. They are going to find something else.

DR. KALYANASUNDARAM: Well, to build that type of a logic into a compiler is extremely difficult.

AUDIENCE MEMBER: I'm sure it is.

DR. KALYANASUNDARAM: There used to be a compiler from The Portland group called HPF, High Performance Fortran, and people are using that basically to take CM5 codes, codes written on the Thinking Machines, use the HPF compiler and automatically change them into whatever architecture you want. SMP systems or distributed message-passing systems.

So the compiler did the job, but it was not a very good job. If you go and rewrite it yourself you would get much better performance, but there was an automatic compiler which could do that, create all those distributed message-passing directives and things like that, but it was not the most efficient way. When a compiler does that, it's going to analyze a very small piece of the code at the time.

Most of the people who write MPI codes look for coarse-grained parallelism. So they look at parallelism on top of the code, and they say I'll run the same code on different pieces of data. How do I spit this whole program up? When people do OpenMP, it's more fine-grained. It's all the do loop level, and that's what the compiler is also doing.

Even with OpenMP if your loop is too big, the compiler will think there is some data dependency there. If you have subroutine calls within your loop, the compiler will think there is data dependency there and won't parallelize it, so you have to go and do it on your own.

It has to be a clean loop. As simple as, say, a vector sum or something. They are a little more complex than that. I have to give some credit to the compiler folks, but as simple as that for it to do any parallelism on it.

AUDIENCE MEMBER: Let me give you my naive version, okay? My naive version is that there are different levels of abstraction in any program. And you can look at the program in a very massive way, and then you can look at it in more detail and more detail, and each one of those is a difficult level of brand new parallelism.

Now, the problem that we have in parallelizing automatically, is that as humans we go and look at the whole problem and we immediately somehow can see at what level of abstraction it's is easiest to extract the parallelism. And there is parallelism and different degrees of parallelism at each one of the levels.

What nobody has done is precisely analyze in an automatic way what are these different levels of extraction. And software is hard to analyze in that sense, but not impossible. The fact that it's hard, doesn't mean it's not possible.

DR. KALYANASUNDARAM: It's not impossible. It's just that there's too much logic you need to build into the compilers to be able to do that.

AUDIENCE MEMBER: If you had a language, and I'm dreaming now, or if you knew how to change a language so that you could actually analyze those different levels of abstraction; in other words, if you could have a rougher description like some of those software engineering languages in which you have empty boxes, and then you build bigger empty boxes and so and so forth, I think it would be possible to improve on the automatic parallelization. But again, It will never happen in my lifetime, so.

DR. KALYANASUNDARAM: I hope it's not in my lifetime.

Any other questions? I hope this was useful.

last revised:11/21/00 05:52:30 PM
editor: pmateti@cs.wright.edu