![]() Summer Institute on Advanced ComputationAugust 20-23, 2000 College of Engineering & CS Wright State University Dayton, Ohio 45435-0001 |
|
DR. PRABHAKER MATETI: I'm going to be talking about three separate things today. The Beowulf Cluster construction, and depending on your interest, I can go slow on this. I happen to be quite knowledgeable about Linux, per se, but I'm a relative newcomer to the Beowulf field. And I also want us to be able to get our hands-on experience for PVM and MPI.
I know from talking to several of you, that some of you are very comfortable with PVM; others are more comfortable with MPI and so on. So I'm going to aim for a rather low level of both PVM and MPI here, and then when we are in lab, I'm going to show the standard examples that come with the distribution and let you go from there.
To repeat, Beowulf cluster is a parallel computer built from commodity hardware and open source software. Whether that is really a must to be called that is debatable. Typically the clusters have these characteristics that you try to have as high-speed network as you can afford. As you can gather, one of the driving forces behind all of this is to build a supercomputer of power equaled to some traditional supercomputers, but with a lot less dollars spent in constructing them. So there is no particular limit as to which high-speed network one must have and how far should we go before things become not very useful.
So the first thing we try to go in trying to reduce the costs, we try to use as much as possible off-the-shelf hardware and open source software. And typically the programming is done using these libraries that we have been talking about. And I will go into these libraries into greater detail later on today.
Here is the original statement about their goals. “Beowulf is the project that produced the software for off-the-shelf clustered workstations based on commodity PC class hardware, a high bandwidth internal network and Linux operating system.”
This was a statement written several years ago. At that time the typical Ethernet LAN that one had was ten megabytes per second and therefore the so-called bonding of multiple cards into higher bandwidth was a major item in the construction of the clusters. With a hundred megabits per second already quite inexpensive now, it may not make much sense. In fact, we have an experimental set up that you will see later on today with gigabit networks, and it is difficult to find applications that will push that network to it's typical performance. So maybe this high bandwidth has now become, well, the commodity bandwidth that you will typically find.
Of course, it produces a cluster of power comparable to your typical supercomputers with a very low implementation cost, and in general it can scale quite well. It can grow and shrink quite well. In a minute, I'll show you a little slide that tells you how the size of these Beowulf clusters has been growing in the last few years. At the moment, I don't know the max number on this but we know of examples where there clusters of a thousand nodes.
Again things are happening rapidly enough that it is difficult to keep up with this. Here is one that we think is probably the best. The biggest Beowulf has a thousand nodes, and obviously this picture is not capturing all of it. You will see that there is a whole lot of PCs left in their original cases, which is another surprising thing why they do that. They might as well take them out of case and leave them in the open air. And lots of wiring and typically this is another one of things you will see, because of the number of PCs there, you cannot really have the physical space to connect the keyboard and the monitor and the mouse even if they came with it. So in general, not because you are trying to save dollars for these three components, but for physical space reasons, you will go buy some kind of a switch that will connect these three devices to one of these machines.
So when in question, when there is some node that is slightly misbehaving, or what have you, and you need to log in et cetera; then you connect that switch to that particular set of I/O devices and then deal with it.
AUDIENCE MEMBER: Do you know what interconnect they use, John Koza, he's doing genetic coding. Do you know what interconnect they have?
DR. PRABHAKER MATETI: This is straight hundred megabit from the look I can tell. I'm sure they are using a lot of switches. Exactly how the switches are connected to each other, we don't know. But I'm pretty sure these are really high quality switches there. TORUS network is typical in these.
AUDIENCE MEMBER: You know Michael Raimer is doing some of that stuff.
DR. PRABHAKER MATETI: Okay. Here is another that is still in the process of being built. Now, they are trying to build an even bigger Beowulf in this project. If you're interested visit the site there and you you'll see some more detail.
From this point on I'm not sure at what level I should talk about various things, but here is a typical PC. Again, there was a time when people trying to build clusters would select each one of these. I think it is no longer a reasonable thing to do. You just go to a reliable vendor, such as say HP or Dell or whatever, and you probably will go after a typical so-called server configuration. But even as they say, the K-Mart brand PCs will do fine I think.
The thing to keep in mind is that a number of these components, even in high quality vendors, probably are not as good. So you would want to check the details of their motherboard spec. For example, the amount of cache has been going down, as you may know, on the Pentium III chip itself. What used to be a standard 512 is now the 256K cache standard.
So you would want this cache to be as high as possible because of the intended application. They tend to be a lot of numerical computations. They tend to be tight loops and so on. So having a high amount of cache at all levels; L1, L2, L3 et cetera is good.
And then there is this thing called front-side bus speed. It is almost already nearly obsolete to talk about is 100 MHz. There are 133 MHz that are typical. There are higher speeds now available on server modules. And obviously, the higher the better. And memory expansion is another thing one should keep in mind.
The typical boards these days can easily do 512 megabytes. As you all know compared to upgrading the CPU versus memory, the general wisdom is that you should put more memory before you put a higher CPU. So the higher the better, and so you would want to worry about that. Lastly some of the servers may not have enough expansion slots and you would want to go for the largest number you could have.
Several motherboards come with built in options. Of these, probably the only built-in you should go for are the standard IDE and floppy kinds of things. And all the rest you probably don't want to be in there. Now, because of the large numbers of nodes that are going to be hooked together, you would want the board to have better than typical monitoring of the hardware. They should be able to check the internal temperature. The temperature sensing of CPU et cetera is now standard, but just the general air temperature in the box also should be sensed and they should be able to turn them off, et cetera. And you'll see that after a certain reasonably small number that you put in a cluster, trying to even boot these clusters can become a major problem. Already in our lab, for example, up stairs, we have about 25 machines and they need to be booted in a particular order. There's a server and that server better be up before the remaining 24 nodes come up. So unless you go for something like this so-called wake-on LAN, and most of these LAN, built-in or otherwise, now have this on the board, only you need to enable it and need to have the appropriate software for it so that things can be booted reasonably conveniently.
And, of course, not all boards are compatible with Linux. This doesn't mean there is some hidden flaw either in Linux or the motherboards it may mean the motherboard has very new devices and the Linux guys have yet to upgrade their code to deal with this stuff. So if you typically go for a motherboard that's been around for lets say six to seven months, that probably is enough to guarantee compatibility on Linux.
As I said, please stop me and ask me to spend more time, because I'm not sure how useful this information at this level of hardware detail is to you, and I do want to do a little bit of MPI and PVM today.
Surprisingly, the CPU does not matter to these clusters. Whichever is in terms of price and performance is appealing to you, anyone one of these CPUs will do fine. As you know, if you want SMP boards, SMP works properly only with the Intel Pentium chips. Even the Celerons depending on the board, they may or may not work right. But Celerons are proving to be quite fast for the dollars you spend. It is something that you may actually want to benchmark before you decide to buy the Pentium IIIs or IVs instead of the Celerons.
The Athlon has been pushing the clock frequency far more aggressively than Intel. I think right now it is already several months old to have one-gigahertz speeds.
In terms of memory, 100 MHz also called SDRAM, is already obsolete. 133 MHz now is common. Now, Rambus is being pushed as the kind of memory to have for high-speed applications, but it is something like four to five times as expensive than SDRAM as of now. When the price comes down, again, you will all go with Rambus. Rambus is listed somewhere around 400 MHZ. So depending on the price, you may or may not go for that.
Hard disk. This is where some of the advice you see in the Beowulf facts, that we have included, you may have to update. Those frequently asked questions and answers were written several years ago.
It is generally agreed, of course, that SCSI devices are faster than IDE devices; but compared to the price increase in there, most people think it is not worth it to go after SCSI devices. Particularly because it means you have the added expense of controllers and so on. The IDE, itself, has been increasing its speeds. The so called 73833 ATA-33 is now on the low, low-end machines. The typical machine today comes with ATA-66 which roughly speaking stands for 66 megabytes per second speed in burst mode. That has already been upped to a 100. So you could be looking for that, but they are not many motherboards as of now that have ATA 100 built in.
The capacities of IDE devices have been growing very, very fast. Today you could walk to the local CompUSA and buy a 40-gig hard disk or for something around $160. So it no longer make sense to not have hard disks in these clusters because of price reasons. You may still try to boot off the NFS and so on for other reasons, for administrative reasons, but not for reasons of cost.
As you go into this high performance clusters, Raid systems because important. Raid is a technology that basically has multiple hard disks. In an abstract way they all act like one hard disk, but they have mirrored images on the other hard disk. So it improves the reliability, and nearly all the Raid controllers have drivers within Linux.
As I said earlier, compute nodes don't need keyboard, monitor, or mouse. Not only they don't need, when you are putting together such a large number of rather large boxes together in a room, there is just no place for all these devices to be kept somewhere. You typically would want some kind of a front-end that shows various things nicely, et cetera. And in some systems you will need to make sure that the BIOS is able to boot beyond this checking that it does immediately after power is turned on. The so-called POST check.
Some of these BIOS will stop there if they detect that the keyboard is not there. You want to check the motherboard has the appropriate BIOS. Once in a while to trouble shoot you would want to connect the keyboard, monitor, mouse, so there are various switches available that you hook them up.
Now to interconnection network, the ATMs are still too expensive and people are not going for it. Myrinet was fairly common in high end until a year or so ago. But Gigabit Ethernet has become quite reasonable. Fast Ethernet these days, a typical network card that has fast Ethernet is around the order of ten dollars. So it just doesn't make sense not to use fast Ethernet anymore.
The TCP/IP stack, on the other hand, has not been, at least in the Linux world, has not been upgraded, optimized, and so typically even on these 100 megabit networks, you may see the low end of this 32/60 megabits per second bandwidth.
Now, there are some cards that Linux may not have the drivers for, so you would want to make sure that you buy the network cards for which there are reliable drivers. This is a little comparison that basically shows that as you go over into these higher block sizes the faster networks peak at a certain point and there seems to be some anomaly there and then they drop and then they pick up to a reasonable limit there.
This is another point that comes up again and again. The block sizes, unless you treat the various things at the TCP/IP level, at the level of drivers and codes in Linux, the block sizes are going to be rather low. So unless you tweak your Linux code and you recompile, they are probably operating somewhere there. Once you set it up you want to recompile your kernel very carefully.
The gigabit cards are now very well supported, and for no particular good reason I'm showing you two of them. The bottom one has to be the one we brought for our network. It's a Netgear GA 620. It goes for something like $230 per unit, and it has these optical connections, fiber cables you need to buy.
Myrinet used to be the network to use in really high-speed versions of these clusters, but I think it has almost as quickly fallen out of favor now.
You need to think a little bit about how you install the various things and this kind of stuff probably makes sense to the system administrator types in the audience, so I'm going to go over it quickly, but if it appeals to you ask me about this later.
The basic Linux installation has now become very trivial. Almost as simple as installing your Windows 98. It used to be complicated; it no longer is the case.
Of course, there are lots of cautions here that you need to be aware of. Linux is still not fully plug-and-play. And if you left your devices in that mode, and this typically can be turned off in the BIOS, frequently Linux will freeze somewhere during the installation. Occasionally we have seen it freeze. It installs fine, but the first time it boots before it completes boot it will freeze, and along the way the file system would have been corrupted enough that you have to redo the whole thing. So until you troubleshoot everything, you basically turn it off and then install it and make sure everything is good and then one device at a time you can turn it back on to plug-and-play.
For convenience, the typical clusters are set up to give a single system view. Single system view, of course, is good, but it does not have to share the file system throughout the net, including the binaries themselves. You will get better performance if you have local installed operating system and so on. In our lab in the OSIS lab, that is the case.
I should say a few words about the lab. Basically that lab is not a dedicated Beowulf cluster. What you would call that is come-and-go kind of nodes, but once in a while we can set up to behave like a cluster. So the only thing shared in our lab is this subdirectory called slash cluster, and that is mounted from one of the machines, and it is not a specifically made “servers”, it happens to be one of the ordinary machines that's just labeled as a server and acts as a server.
Again the details on this slide probably make sense to the system admin types. Make sure you set up accounts in such a way if a user can log in on one machine, he logs in the other machines without having to go the reauthentication of the things. In order to install MPI the simplest thing is to go to one of the sites and down load these source codes. There are various other vendors that will give you precompiled binaries, but probably simplest and most reliable one is actually to get the source code, and they are so simple you all can do it without understanding a bit about how the stuff is being built.
I have heard comments about these implementations so let me say some positive things. Both of these we have tried, and on to other hand not run large scale MPI programs yet, but we have had no problems with either one of these implementations. It turns out that it is fairly trivial to make them coexist. So both MPICH, which comes from Argonne and Michigan State University and LAN the originator from Ohio Supercomputers and now at Notre Dame University, both of them seem good for us and it so happens that LAM comes with better looking X Window oriented utilities that will deal with MPI.
So effectively installation is just this. You unpack the distribution. For those of you who know, the word for that is untar. So here comes a tar archive you untar it, you run configure and that is the same as saying just type the command configure, don't type the word run, and there it goes. It takes a few seconds and then you type make and there it is. Only make sure the binaries go into the right place, which is done by saying prefixes should be in our case/cluster/MPICH. Then you set up your path and environment to make sure the binaries of these are in your executables and you are ready to start compiling and running these programs.
So indeed the MPICH installation and the PVM installation are similar. These things can happen in under ten minutes, each on these modern machines. Remember, we are installing these in a shared directory, so you do not have the problem of installing it in thirty different nodes. On the other hand, as I said repeatedly, it probably will give you much better performance to actually duplicate them.
I just put this heading there as a placeholder for me to remind me about this topic. When you have a room full of these PC's, power requirements become a thing to worry about. The typical room may not be wired to deliver, say, 50 amps or so, so you have to be careful about that. It so happens that the standard AC seems to cool them well enough. So even to a number such as probably a 100 or so PC's in a room about this size, the standard AC probably will be fine. This is one of the good side effects of PCs being made robust in terms of heat. Beyond that, obviously you need to cool them in a special way.
Now, I thought I would show you some performance figures, and here is a cluster for which we have performance figures. This is a few years old. This is a cluster called the Little Blue Penguin at this lab Lanl. Here is a little description of it. 64 SMP machines each one has 2 CPU's of Pentium II's and 333Mhz and not a whole lot of memory, but lots of hard disk space and this is using this Myrinet. And compared to this traditional supercomputer -- the pink one is the traditional supercomputer, the blue one is the cluster. So the performance of these clusters can be very, very good. This is when running this linear accelerator application.
AUDIENCE MEMBER: The higher graph is better performance or worse performance?
DR. PRABHAKER MATETI: The higher graph should be better.
AUDIENCE MEMBER: It says time. Is that time to do the job?
DR. PRABHAKER MATETI: I'll have to double check. It’s supposed to be better.
AUDIENCE MEMBER: I was just curious.
DR. PRABHAKER MATETI: You are right.
Obviously, it's a good platform for scientific application, and this was the original intent of that. It can also be used for other applications such as image processing, et cetera, et cetera. And more and more these are being used for Internet servers. Some of you may know about this new search engine that has kind taken over all these other search engines, GOOGLE. GOOGLE is now run on a Linux cluster. And it's a very, very clever search engine. It's an extremely fast search engine and you can judge it by the performance it's gives you and it runs on the Linux clusters.
Let me summarize this part of the talk, by giving you some pointers. These are the very standard ones. The original site is the beowulf.org. It has not been updated in looks for maybe a couple years, so it kind of looks dull, but the second one has lots of updated info. They have current events related to Beowulf showing up there. Then there is the extreme Linux, which has a bunch of new information and, of course, there is this IEEE task force on cluster computing.
With that, let me switch to the next topic. But let me take a break and ask if you have questions or comments or do you want me to elaborate on any one of these things.
Some are busy taking down the URL's, so this is a good time to mention this. I'm going to have all of our slides on the web, so these are powerpoint slides. They will be down loadable for you from our web site. This will probably happen as of Thursday morning. So you don't have to take these URL's down. Things that you are seeing on the screen here, you will have them completely downloadable.
AUDIENCE MEMBER: Why do you feel that the Myrinet network is falling out of favor?
DR. PRABHAKER MATETI: Primarily because of price. With gigabit cards in the range of $200 apiece, it has to show significant performance advantage.
AUDIENCE MEMBER: I don't mean to be controversial, but I feel that the prices are basically reversed. The NIC's are much cheaper when you look at the gigabit Ethernet, but the per port cost is kind of switched.
DR. PRABHAKER MATETI: Yes and no. We bought just recently a 4-port gigabit switch. That's about two thousand dollars. Now, when you go from 4 to 8 to 16 and so on, the cost is increasing. But I think the cost of switches will rapidly fall in the next few months. The 4-port switch is easy to afford right, now and we bought two of them.
So we tried to see what performance we get, and also now is the time to announce how old our cluster is. We couldn't do any performance estimates yet, but we could have two little clusters, of four machines each, that are running one gigabit. Whereas I chose, because of the large number of people that we have, I chose to rather have a larger cluster of 25 machines running on a hundred megabits.
Now, our cluster depending on how you count is either a week old or less than two and a half days old. So if it crashes, I have a back-up plan of letting you to tell Ohio Supercomputer to try some of these MPI et cetera programs. Hopefully it will be up all right about you will be able try some things in the next hour half.
AUDIENCE MEMBER: This is a small point, but on the IDE hard drives, I was curious about the ATA 100. I thought that a lot of the hard drives, unless you get a very high spindle speed you can't really produce enough to keep up with the ATA 66.
DR. PRABHAKER MATETI: In a way that is what you might say a myth. I think the SCSI group wants to continue to prorogate. The good operating systems such as LINUX and so on, they do cap whole lot of buffering and as you know the 7,200 RPM is very common. 10,000 RPM IDE hard disks are available now, so it's not clear. For two to two and a half times the cost, if the SCSI hard disks are that much better and must have, I don't know. That's where they would rather larger capacity.
See, the larger persistent, permanent storage medium is becoming important for us. Already we are at a situation where CDs cannot transport your data. It's only 640 meg.
AUDIENCE MEMBER: The newer ATA 100 disks in some cases, and this stands true for SCSI as well, they are getting 35 megabytes per second on reperformance. So just with just a couple disks it's possible to saturate that 100-megabyte per second channel. There is a 33 percent overhead of the IDE protocol too, so 33 percent of that has to be eaten up by that protocol.
DR. PRABHAKER MATETI: There's no question that in a pure analysis you will find that the SCSI devices are definitely faster. But if you do this at the level of system, meaning you analyze the particular mix of jobs that you are running and you throw in it large amounts of buffering that a good operating system like Linux will do, whether the overall performance is better for the money is the question.
Remember, I think the deals guys don't often say this prominently enough, but the driving force is not just this commodity business. Commodity makes it cheaper. The whole thing has to be driven by price versus the performance you get for that price. That is my way of putting it using this word commodity.
Commodity, whether it's commoditied or not, if it is going to go for ten dollars a piece, I think the clusters will be made with that, and obviously because of the price they will become commodity.
AUDIENCE MEMBER: What is the price of like that SGI machine versus a cluster?
DR. PRABHAKER MATETI: I wish Oscar were here. We got our SGI machine just about six, seven months ago. He can give us an exact figure. When he comes back we'll ask him.
AUDIENCE MEMBER: (Inaudible.)
DR. PRABHAKER MATETI: I don't know the exact price. We'll find a definite number. And I also should check the Y scale. Was it time or something else or was it mismarked.
AUDIENCE MEMBER: Which system are you talking about?
DR. PRABHAKER MATETI: His question was what is the price of the SGI.
AUDIENCE MEMBER: Which system at Wright State are you talking about?
DR. PRABHAKER MATETI: We have the Origin 2K.
Jay, do you happen to know the price of this SGI machine we have.
AUDIENCE MEMBER: No.
DR. PRABHAKER MATETI: We'll postpone that question to Oscar because he will know the exact details of it.
AUDIENCE MEMBER: Might we be getting a hard copy of your charts?
DR. PRABHAKER MATETI: My slides? A hard copy we could make it happen, sure. But we are certainly putting this on the web.
Now, let me ask Oscar. There was a question. People are interested in the pricing figures on the SGI machine and the details of what it has.
(A nontranscribed discussion was had.)
DR. PRABHAKER MATETI: Now, you have heard the acronym many times, but here it is again. Parallel Virtual Machine. And today I'm going to go over this in some detail so that you are hopefully able to run at least programs written by other people, and you should be able to look through the code and understand some of the details.
The important thing to note here is that PVM can run on a network of really heterogeneous machines, including CPU's being different, operating systems being different. Obviously, you need to make sure you have compiled the programs for these multiple platforms before you could make use of that.
So effectively these are its components. On each machine that participates in this PVM, in this virtual machine, you would make sure there is this daemon running, pvmd3. You will see this later on, but if you are going do this in a cluster, you can start this during boot time making it convenient. Otherwise, the PVM is set up in such a way that a nonprivileged or nonsuper user, ordinary user can actually install PVM on multiple machines and make use of the network.
The other component that you will have reason to work with is called PVM, and this is like a command shell. With it you can configure in the sense of adding, deleting, et cetera, nodes, and you can check what is happening, and if pvmd is not already started, this will detect it and it will start.
So effectively a regular user can just dial PVM, and the local host on which you did that will then have a pvmd when you quit that PVM program.
And then there is this library. This is where you would have written your programs with this message-passing paradigm. And most PVM programs use this master worker style of coding, and you would have linked your program with this library. And we'll go through some details of the system calls and libraries. Effectively the libraries are statically archived. These are then names here. And I have a little bit of detailed in the next couple of slides.
This is the main library that you link with a typical C program, and this has routines in there to initiate and terminate processes on a given host. It can pack, send, and receive messages, and here comes again this barrier synchronization. There is a way to do that. It is a also problem from the program query and change configuration of PVM during runtime.
So you can add and delete nodes as it runs. It also deals with data format and conversion if the CPU's are of different byte order and so on. And it talks to the local daemon. This is the library if you are linking with the Fortran programs. This is library if you are making use of dynamic groups. We won't get into these details.
A typical running PVM application looks like that. I think colors would have been good, but here it is. The shaded things show you where what components are. Every host that participates and then the whole thing becomes one PVM will be running one daemon. Then the application processes are started by this daemon.
Here is the simplest PVM program that makes sense in the world of PVM itself. I have deleted a bunch of declarations so it fits on a slide, and I've also deleted some code that I'll emphasize as I go.
Here is the simplest PVM Hello World. This is making use of two processes. On the left hand is a program that we will compile and call it hello; on the right hand side is a program that you would compile and name it as hello other.
So with that, you'll see that one of the first things we do in a PVM program is to invoke PVM task ID. And luckily this can be called any number of times, but you would typically want to call it at somewhere at the very top and that gives you essentially what we would call the process ID. This is not the same process ID as in the Unix world this is PVM's way of numbering this particular process.
Then you can call any number of times this call from the library to spawn, to fork off processes. When you are doing PVM programming you would not do straight forking from Unix sockets or UNIX IPC kind of mechanisms. Instead, you would do whatever forking that you would want through this PVM spawn.
And in this case you are telling it the name of the program. The name of the program is hello other. So this will talk to the local daemon, this pvmd, and it tries to find where this program hello other is. And it is aware of the architecture, et cetera and will look at the appropriate paths trying to discover the executable whose name is hello other.
And hopefully it will succeed in which case the value that CC will get eventually will be a nonnegative value. Based on that, you will do something on the left-hand side of the code. Eventually when you are all done with the application, you are supposed to invoke PVM exit. So that is how the typical program looks.
Let's look at some of the dotted items. So assuming that worked fine; we would have a number one there, and we would then try to receive or send, et cetera, et cetera. In this example here we are trying to receive whatever package that somebody would send. The minus one there says any sender, any kind of message. And PVM buf info would give the information regarding who the sender was. And PVM unpack string, unpacks what we receive there into the buffer that you would have declared in that process. So at that point you have the message in this local memory called buf, and here we are trying to print the task number and the content of the buffer you just received. It would say something like hello my number is this.
And here is the complete program without any deleted lines. So here is the program that I encourage you to copy and compile and try and see if it will successfully run when you go visit our lab later today.
AUDIENCE MEMBER: Are you going to have handouts?
DR. PRABHAKER MATETI: Yes. Handouts probably not, but we have a little projection upstairs. So we'll have that.
AUDIENCE MEMBER: Because it's not readable.
DR. PRABHAKER MATETI: I know. All of that code, I'm not printing that. That you will read on your screen when you log in on our machines.
Here is a little diagram that shows how things work in this master working paradigm. The master starts and he would spawn off a bunch of processes, and each one of these workers, one of the first things they will do is so-called enroll into this PVM system. That is what was happening when we did PVM my task ID.
After that, they are going to be receiving some kind of information regarding the computation they should be performing from the master, so that is this receive. Then each of these workers will be crunching on their own tasks and their own jobs. Eventually they have answers and they will send them back and the master is collecting all of these answers at the top. That is how a typical computation goes.
And this is trying to show you the same thing without getting into the timing of it, but illustrating what goes where and how.
Here are some details, and I think I'm going to rush through this so we could have a quick enough introduction to MPI at the same kind of level so you can run this code.
pvm_mytid is something I said you will be calling in the beginning. This does the so-called enrolling. As a result, the caller gets a unique task identifier. And the calling process is already there. It simply returns the previously enrolled number. And I'm going to follow this style of showing you the exact syntax of how you would make use of it at the bottom.
pvm-spawn is the one that starts new processes. You can give this call additional detail and it will try to find an appropriate host that has the proper kinds of things. In the first example here, this was no particular choice being made so it will find whichever node that happens to be in your list at the top. The second one says I want RS6K operating system and only that kind of architecture you should find and fork a process on that.
pvm-exit. This is the last thing that you will do in each one of these processes so that all the resources that were given to that process are reclaimed.
pvm_addhosts is something you could invoke through the PVM console program, but if you are trying to do this through your PVM program you would call this. And in the C language interface, you basically give it an array of host names and some additional information. In the example there we are trying to add four hosts at the same time.
And obviously there is a way to delete the hosts, and this is a routine that you will use to pack before you could send. And this is something that we should be aware of. As I said, PVM is a message-passing library. Before this message can be formed, you need to pack the message yourself. If it is a structure, well, then you need to make sure you pack by their elements and it can be done recursively. But effectively the job of packing a data type is yours.
And after you pack, there is this hidden so-called active send buffer that gets filled as you call this. This word active is a little misleading. It is not active in any sense of the typical world. We could say instead it is current send buffer.
There the corresponding reverse that unpacks what you get. Again, there is a receive buffer that is the current receive buffer. Having received some message, that's where it is. And until you unpack it, it is not in your own local memory. If you remember in the hello world program we did the unpack string to get the data from the sender.
Here is a routine that would send the current buffer and that is why it is not an explicit argument in here. The receiver's task ID is the first argument, and there is a user given tag so that the receiver can distinguish among the many messages he may receive from this particular receiver.
This combines the two functions to a certain extent, and depending on how complex your program is, you would do your own packing and then send.
This is a multicast version of sending, and it sends this to that many tasks whose number is given here and whose corresponding task ID's should be in an array called the tid's.
This is how you would receive. A receiver would name the sender from which he wishes to receive, and from that sender there may have been multiple messages, and depending on the message tag, you could selectively receive some messages.
pvm_nrecv. This is the same as pvm_recv except this is a nonblocking receive operation, so you definitely would want to check the result value given by this. Whereas, typically in the previous example of pvm_recv, it will wait for that operation to complete.
There are some collective communication perimeters. By collective we mean the following: That all the processes that are running on the all the hosts that are running pvmd need to participate in this.
pvm_barrier. This is the barrier synchronization being done in PVM there. So all the callers, all the processes that have this line of code in there, will be waiting until all of them arrive at that point. In this case five of them, whose names are worker, should be reaching there.
pvm_bcast. This is the broadcasting of the data that is a current buffer. You give a list of workers in there and luckily it is not sent to the sender itself, even though it is in the same group.
Now, these two words, the broadcast, gather, and the next couple, you'll see them again in MPI also.
pvm_gather. This specifies a member, and that specified member receives messages, and he would gather all of them into this large array. So, for example, if a given worker is sending one row of a matrix, all of those rows put together will become the matrix. That is what we mean by gather.
pvm_scatter. There is a corresponding scatter. This is the reverse of it. So in order to take one matrix, let's say, and then give one row of it to an individual worker, you would use pvm_scatter. It takes this matrix and kind of cuts it into these rows, and one row at a time is then given to the workers. And root here is not a root in the sense of some kind of a tree structure, but root simply means the invoker of this. And again, one more time, repeating this. In order for pvm_scatter to work all the processes should be invoking this.
pvm_reduce. This is when the master receives all the results and wants to compute the total result in some way. And here is the binary operator that works on two operands. If you remember, I had described to you the general reduction operation. So you have a whole array of numbers that you got, and this binary operator is inserted between every pair of them. And the entire operation is performed by the group of processes.
Here is how you would prepare to execute your PVM programs. The PVM programs are typically expected to be there in a directory such as this. Your home, and you'll probably set up -- this is all changeable by the way -- you will probably set up a subdirectory called pvm3, and in there you will probably have a directory called bin. And then, if you are interested in running multiple architectures, you would have a directory named Linux; and if you're dealing with, say, NT there will be a directory named NT and so on.
So the different binaries that are compiled from the same source code but compiled for different platforms, will go into different directories. And you would also want the library of the PVM itself to be conveniently available. A simple way to do that is to have a symbolic link. Here is how you would compile. The whole thing is supposed to be one line. You have your myprog.c and you invoke your ordinary C compiler of yours, and you make sure you tell it where the library is and you are done.
AUDIENCE MEMBER: Is there a make file for this?
DR. PRABHAKER MATETI: There are make files for all of this. In the end, you don't have to really type all this stuff, but you do need to set up your executable path. So when you go upstairs to our lab, make sure that export path shows PVM binaries in it.
You can set up your own file of host machines that will participate in your PVM. This is an ordinary text file. You put in bunch of names, and I'll explain to you about our machines names in a few minutes.
You can create, in general, your own .rhosts file that sits in your home directory. And here is an example. I just you threw in a whole bunch of machine names. Most of them I think are made up names.
And you would invoke this pvmd3 with that host file, and the PVM daemon will then run on all these hosts whose names you have given. Here is how you would execute it, as though it's a normal program, and it is a normal program.
And I suppose we should make sure how to quit PVM. The application components of course should have executed pvm_exit at the bottom of their code. To make sure that all of these daemons, PVM daemons themselves will go away, the simplest thing to do is to get into PVM console, so type PVM and then the command that you issue in there is halt. That will remove all the daemons.
So that's the end of that particular talk. And I still have one more to go and then I want to give a little bit of detail about our lab before we take the break. Any questions about this part?
Much of this MPI thing is going to parallel what we said about PVM, except for some execution paths and so on. So MPI, and this why there's going to be, I'm sure, some questions, and then we will have hopefully some discussion with participation from you guys.
Why should one do MPI instead of PVM it is so similar, and they are similar. This also works on a network of heterogeneous machines. Like PVM, there are multiple implementations. In the case of PVM, the open source implementation there is just one. In this case there are actually three, but these are the two very well known ones.
MMI people will say often say they have formally specified the standards, but that word in the software engineering world will not fly, so I replaced it with rigorously specified. It is indeed a very carefully written document that describes the entire set of functions that MPI defines.
Of course, the typical implementation is such that you do have a high degree of portability. If you read deeply into MPI literature, they do acknowledge there can be problems in trying to port your MPI applications from one host to another wholly different host if the implementations are by different vendors.
But as long as the implementation, such as MPICH are everywhere, then it is indeed very portable. Of course, it enables third-party libraries for a variety of things can fully overlap communication.
One more thing that there was a little bit of confusion. What we currently have available generally is the MPI 1.X. I think our version actually is 1.1 or so. MPI 2 has been defined. So some of the comments about, well, the MPI does not have dynamic creation of processes et cetera, are taken care of this in this MPI 2. But these implementations are yet to become generally available.
So here are some of the features that MPI 2 has, but we are yet to have either widely available vendor implementations, or for that matter, open source implementations, are yet to happen. So in terms of the total number of functions, I haven't counted them. Somebody gave me a count of 129. So I thought it was safe to say 125 plus. And the typical claim is that you could write a good enough MPI program using just about six.
I think the standard hello-world-type of program here, uses just about that number. So here is the simplest skeleton that I could think of, trying to illustrate the manager worker paradigm again. And on the left is real code. On the right-hand side is some comments regarding the specific things that are showing up on the left-hand side.
Now, this is probably a good time to bring this acronym SPMD one more time. Here is a single program that's going to run with multiple data. And because of that goal, typically you would see somewhere at the very top this kind of and-if statement. And if statement where the process asks: What is my number? And if I am number zero, I am the manager. If I'm not number zero, I am the worker.
So it is kind of a stretch there to say it is single program. Well, it is single program, but right at the top we had branched off. The manager is then invoking a different routine altogether from the typical worker who is invoking the worker.
One more time, let's go through this initialization and finalization. Where in the case of PVM, you did this the task ID finding, and it didn't hurt you to do multiple times. Here it does matter that you do MPI in it at the very top and you do this exactly once. It could have been a little more robust, but that is the way it goes. And similarly, at the bottom of your code you would invoke, before you quit the process, you would invoke MPI finalize.
Here is the code for the manager. So the manager first of all finds out how many processes we have got. And this is the part that I was telling you about that MPI 2 can have processes dynamically created. As we have this MPI, at the time of invoking your program you need to say how many processes you are going to have.
For example, we might decide to have ten processes, but that number ten is not compiled in. And is so one of the first things you will do is ask what is the size of the process pool, and that call, the MPI_comm_size, gives you that number and returns that value in the ntasks.
Once again, I haven't shown all the declarations necessary for this thing to compile fully.
AUDIENCE MEMBER: That is not the same thing as the number of processors you are using?
DR. PRABHAKER MATETI: No the number processors but the number of processed that you have used. There is a command called mpi_run that takes a parameter about the number of processes. So you may say something like 10 there and that would be that number.
So having determined that and you are the manager whose number is zero, so the total number of processes, the have numbering scheme of zero up to and including 9. So for that many workers you are going to find some way of capturing what the next piece of work is, and then you send some description of the work to the worker. In this little example my description of the work “work”, is simply one integer. So the actual data is in this buffer. This is telling you how many of these items that you are sending. The next argument, the mpi_int, is saying what kind of an element these are. It is the number of the worker itself and some user created work tag used for the matching purposes.
And mpi_comm_world, we should spend some time on this separately. For now, let me just say that stands for the group of these processes. So having started this program, we have 10 processes that are together working. And that is the mpi_comm_world.
Now really sophisticated programming will involve splitting that so-called communicator into subgroups. There very clever ways of doing that, but I don't think we'll be able to get to those kinds of details here.
Now the worker will have a typical structure like this, where he typically has a nonterminating loop because he's going to be told explicitly by the master I have no more work for you and now you can terminate.
So he receives a piece of work and there are a bunch of arguments that describe various things. Typically, you would want to find out who sent this, et cetera, and that result is given in that status which is the last argument of mpi_recv.
And then the worker would do whatever actual work is. And then in this code here I'm just using this made up name "doWork" and that is expected to produce some kind of result, and again in this example the result happens to be one number. And that number you are sending back to your master in this one line of code there.
There is a standard example of computing pi. And I think you are probably not going to be able to see this, but this is an example I recommend you try. This is in our directories. It comes with the distribution. This is typical of many such applications. If you have a computation which is basically trying to do a summation of a series, the way you arrange your workers is like a cone. You would ask them to do some operation on the first item, on the P item, and the 2 X P item and so on, and then the next worker will start at 2, p + 1, 2 X P + 1, and so on.
So it's like a cone. And each worker will do the corresponding thing and they send the subresults to the master and it becomes the master's job to put the final answer together.
Now because our processes are all created up front, the group membership is static. As a result, you also have this good side effect that there are no race conditions caused, and group formation is collective, meaning all the members will get to know about it.
And here are some of these details of the various library calls that we have used so far. The general form of that is you give the buffer to send, and you are now describing how much of -- what is the quantity of data in it. There are N units of data in there and each item of data is of this MPI type, which you should give, and there is a process destination, which a this number in the range of zero to nine, if you chose a total of ten, and a certain work tag and a certain group that you would have chosen.
The typical group, in the simple example, is the mpi_comm_world. And the mpi_recv has almost the exact structure. At the end you have the status in which you get some information regarding the sender.
Well, send-receive will couple together in the sentence of the following: The sender's destination of course, should be a valid process rank. Rank is the word MPI uses for the identifier in that range of zero through nine. Receiver should, of course, also specify the valid source. Communicator, that was this mpi_comm_world, that has to be the same for the sender and the receiver. This is where the notion of group comes in. Unless you belong to the same group, process A cannot send to process B; and if you want to do between groups, then you need to do a little more sophisticated programming. And the tags, of course, should match, and the message data types must match; and receiver's buffer should be large enough. When all of these things are okay then the send-receive occurs.
Another little thing about semantics. P sends message, let's say m1 and then M2 to Q, and, of course, Q will receive them in that order. If P sends different messages to different processes, then you should not be concluding anything regarding when did R receive it and when did the process named Q receive it.
Now, there is a considerable variety in terms of blocking and unblocking. And unfortunately both PVM and MPI use the word asynchronous in a way that conflicts with the more theoretical, more conceptual computer science illustration. You need to be cautious about these words.
Send, receive can be blocking or not. By blocking or not we simply mean the following, and it should be decoupled from this word asynchronous. If you are saying receiving, I think we said this; obviously you cannot receive if there is no message to receive. But if you then invoke this receive in a nonblocking way, this receive will terminate. It will come back saying, I tried to receive, but I failed. That's what we mean by nonblocking.
A blocking send, of course, can be coupled with a nonblocking receive and vice-versa. And the send itself comes in a bunch of varieties. If you do nothing it is in the standard mode, otherwise there are these three other modes, which are to be understood carefully because of subtle semantics that can play in there.
Apart from this nonblocking thing, this is about the same as a regular send, and same with this. These two routines, if you happen to use the nonblocking kind, you can use it to then wait or test if things have worked okay or not. If you remember, handle was one of the arguments in the previous send, receive, in the previous receives.
Now back to a similar thing to PVM, collective communication. The broadcast here that left-hand side call should be invoked by all the processes even though all but one process is doing something different. That is the one that is sending stuff, whereas, the remaining processes are receiving the stuff. So all ten of them would invoke this at the same time, but one of those ten processes is doing the broadcast and the others would there for be receiving it.
mpi_scatter. And a similar thing applies to scatter. If you understood the pvm_scatter, this similar. But please note the order of parameters is of course different. In this case the scatter is actually being done by this processes whose number is root, and that is that argument.
mpi_gather. Gather is again similar in meaning to what we had seen in PVM.
mpi_reduce. Here is the reduce. So when all your workers have computed their subresults, they would have sent them to the master. And I think you would agree that if the master has to be doing the complete computation of the result, that would not be a good thing. So here is an operation called reduce, where the reduction is being performed by the entire group. So it's not just the master that will do it. But conceptually speaking, as I said, if you write down your subresults in a row like that, you basically insert this binary operator in between every two operands. If you then evaluate the expression, that's what reduction will yield. In this case, that reduction is being performed not by type one process named master, but by the entire group.
And there are various predefined reduction operations. As you can see they're all binary operations. And of course you can write your own. Effectively you need to make sure that the signature of the reduction operation you write is appropriate. Other than that, any kind of operation will do.
So here are the top ten reasons that Ohio Supercomputer produces. I thought I'd reproduce them for you, and I have slightly re-ordered the numbers. So I'll let you read them, and I think I'm almost at the end of this. If not, I'll come back to this. This is trying to say that MPI is better than PVM. There are lots of fairly straight forward advices in there.
First of all, as you saw, practically for every major send-receive kind of operation that PVM has, there's a corresponding operation in MPI.
If you want right now dynamic process creation PVM can do, it MPI cannot. But MPI can in the very near future. So roughly speaking, it kind of boils down to such detail. On the other hand, there is considerable vendor support for MPI, and the entire community is moving towards that.
PVM by-in-large you may continue to use the software that exists there, and because it is generally understood better because of longer development experience that people have, PVM may survive. But all new code, I think the general recommendation is, that you should be doing that in MPI.
AUDIENCE MEMBER: On No. 7 up there: MPI defines a 3rd party profiling mechanism. Does that mean that there should be profiling --
DR. PRABHAKER MATETI: For example, in the end you want to figure out which part of your code is really critical to improve. There have been all kinds of other applications that do that. And then the same kind of thing also gets into having really sophisticated debuggers.
AUDIENCE MEMBER: That's what I was leading to yesterday at lunch we heard from Sun on how --
DR. PRABHAKER MATETI: Right. They were claiming a hole there.
AUDIENCE MEMBER: They were claiming Prism on their debugger, and my question was going do is there other 3rd party --
DR. PRABHAKER MATETI: I think so. I don't have a list of these other 3rd party names, but, yes, they are quite a few. There are at least, as far as I could tell, there are at least a dozen or so implementations of MPI. Practically every supercomputer manufacturer has their own MPI implementations and associated applications like debuggers.
AUDIENCE MEMBER: So profiling, I can get a software that I can --
DR. PRABHAKER MATETI: From somebody else and plug it into the free version of MPI for example.
AUDIENCE MEMBER: Can I take my source code, if I have it, that has not been parallelized and run this thing through a profiler and --
DR. PRABHAKER MATETI: No. It's not trying to do that. Now I think I understand the question. What Mitch is getting at is the following: Suppose you have some old piece of software, a plain and straight-forward sequential piece of code, and he wants to learn how to parallelize it by watching the execution profile of that system. No. That is not what we are talking about.
That kind of profiling, standard Unix tools can give you. Even the GGC can give you that. I often number the flag as minus P. So along with the minus C et cetera, put minus P and it will come out of it typical file something dot out, and then you feed it through a profiler and it will print a nice listing of this routine consumed 10 percent. That routine consumed 20 percent and so on. So standard utilities can work on sequential code.
This is intended for MPI applications themselves. How to we profile that? For example, in all those MPI sends we talked about, the varieties of them. There are hidden implications about which is better. Should we do MPI I send, or should I do MPI plain send? And there can been considerable amount of difference in performance if you don't choose the right one.
MPI standard is trying to do this in such a good way that particular implementations don't have an effect of that, but unfortunately they will.
AUDIENCE MEMBER: Another question sort of along those lines. Is there any kind of program or do you know are people working on programs that can automate the parallelization of a particular piece of software using MPI?
DR. PRABHAKER MATETI: No. That is the goal of a large number of research groups and some commercial vendors will say they can do this. But unless you have given hints in the code, they are unable to do that. And for some reason the hint giving activity is limited to FORTRAN. The similar thing is not happening for C. I guess you can kind of tell the reason, because C is a little too lower level than FORTRAN.
The proper way to do it is, of course, manually. You take the source code and you kind of, in this design tree, you go up one level, produce a design, reversing the linear design from it, and then you recode it.
AUDIENCE MEMBER: You alluded to No. 3 and I think No. 10. What's the difference as to why --
DR. PRABHAKER MATETI: Rigorously specified means that the document is so carefully written in terms of its English, in terms of its choice of words, et cetera. In fact, we included a paper in book three that says PVM and MPI are very different, or whatever, something different. It's worth reading that. Remember, that group is the same group that developed PVM and then developed the MPICH. They say this PVM documentation confuses these two words, asynchronous and nonblocking. So it is in that kind of sense that the description of this MPI is done much more carefully. It is not formal in the sense of it is defined in mathematics discrete math and logic. No, by no means. It's a standard in the following sense it is a consensus document developed by a large number of groups; whereas, PVM was the result of effectively one research group and they kept cranking out implementations that are good and kept getting updated, so there was no competition in its implementation. MPI was not like that. Its spec was developed through group effort, negotiated. Users want this. Designers didn't want it. But some middle ground is reached, and then it is very carefully worded. So that's how it came up. Some item 3 and item 10 are not to same.
AUDIENCE MEMBER: (Inaudible.)
DR. PRABHAKER MATETI: I don't think they are trying to do that. They think they have done it carefully and that's that. And it is a old thing that I don't think -- this is another reason why we may not wish to program in PVM anymore. Effectively the effort has kind of stopped. But on the hand, I don't remember when exactly this version came out, but now there is a version of PVM that runs on NT. That is fairly recent. So there is an effort to branch out into other platforms but there is no effort in really improving it. The current version is number I think 3.4. It's unlikely there will be a really different version after that.
AUDIENCE MEMBER: One last question here, and that is, yesterday at lunch the guy from Sun was talking about binary coding.
DR. PRABHAKER MATETI: I was only there for a few minutes, so I'm not sure exactly.
AUDIENCE MEMBER: I thought he said Sun had their own version of MPI.
DR. PRABHAKER MATETI: He said that. It's their own MPI that you need to buy. He was claiming that his was the fastest MPI ever across all vendors.
AUDIENCE MEMBER: Right. When you say a standard, obviously that is a different version of MPI, or is it?
DR. PRABHAKER MATETI: No. No. By standard we mean a spec; such as, a while ago I was saying here is what reduce is supposed to do. And we have of course tried to simplify it on the right-hand side here. But there is a very well carefully crafted paragraph that describes what mpi_reduce is supposed to do. And the MPI standard is a collection of such paragraphs for all the 129 or so functions.
AUDIENCE MEMBER: I think what you're saying here, I should be able to take my MPI code, run off and compile it on Sun and it should be the greatest thing since sliced bread.
DR. PRABHAKER MATETI: It must. It better run. However, like all standards go it may not tell you what the performance differences are between, say, I send and regular send. It may so happen that MPICH does a good job on MPI send. Semantics is there but how fast would it do it? It may differ. One particular vendor may optimize send; another one may have optimized I send.
I know Oscar is interested in answering some of your questions.
DR. OSCAR GARCIA: One of the issues is that at Argonne National Lab, there is a significant group and they are some of the leaders in setting up that standard. Now, again, as it happens with all standards, each vendor wants to do a one upmanship on the code and they do. So as long as you stick to the standard I think you are fairly safe. It doesn't mean you are scot-free, because I have run standard codes and it didn't work on SGI last time I tried it. And so what can I say? There are going to have some implementation differences and that's with any standard.
AUDIENCE MEMBER: I missed the talk last night. Did the guy from SGI say they basically had an implementation?
DR. OSCAR GARCIA: The guy from SGI didn't cover MPI last night.
DR. PRABHAKER MATETI: The standard is expected to be followed. There should be no violation of that. But they are variance performance properties that you cannot deduce from the standard, but you can from an actual implementation.
DR. OSCAR GARCIA: What happens when you get you really get down to it, you don't usually keep up with all the upgrades of the operating system, so if you are not keeping up with the upgrades, you're going to find that the standard doesn't quite -- has not the operation of that machine.
DR. PRABHAKER MATETI: Okay. I think we are ready to quit, but let me take a quick minute. My goal is not that you can program the next major MPI application in the one and half hour, but I thought we'll take some existing code and go through the mechanics of compiling and running and so on. That is what we are set up to do.
Here is a little description of it. And you will note that we have private IP addresses. In the lab you will log in with this kind of a log-in ID. WSU upper case, and then a two-digit number. I suggest you chose the same two-digit number that is on the machine even though you could use some other numbers. So we have accounts number WSU01 up to 50. There are exactly 26 seats, and it looks like we are not 26 people. But if more people show up we have some extra folding chairs and you can watch over someone's shoulder. And there's no password. Once you do that, you should be in the typical X11 Unix environment, and if you need help at that level of Unix, I have a couple students who will be there in the lab.
| last revised:11/16/00 06:02:50 PM |
| editor: pmateti@cs.wright.edu |