![]() Summer Institute on Advanced ComputationAugust 20-23, 2000 College of Engineering & CS Wright State University Dayton, Ohio 45435-0001 |
|
DR. OSCAR GARCIA: Some of you may not have been with us last night. Mario Lauria is an Assistant Professor at Ohio State University. He has a doctorate from "Federico II" in Italy, and he actually had a Fulbright at Illinois. He also worked at the San Diego Supercomputer Center, postdoctoral. He has recently arrived in the Ohio community, and we are delighted to have him here. Let's welcome him again.
DR. MARIO LAURIA: Thank you, Oscar. The presentation I gave yesterday was more like an overview of HPVM, and HPVM is what we will be talking about today. The slides from my talk yesterday are available outside on the desk. Yesterday was more on the research issues, today will be more on the practical issues of how to build and use an NT Cluster. Since I think you have heard enough about clusters today, I'll focus mainly on what's different about using NT and an NT Cluster with respect to using a Unix-based cluster.
First, again, let me thank the organizers of the summer school. I'm delighted to be here and to talk about the work I've been doing on HPVM. I'd like to point out that HPVM is a project which was carried out by my Concurrent Systems Architecture Group, which I've been a member since '94. The Concurrent Systems Architecture Group is lead by Professor Andrew Chien. I owe a great deal to Phil Papadopoulos, which has been a member of the group. I have stolen so many of his slides from a tutorial he gave last year. Phil Papadopoulos is currently with the San Diego Supercomputer Center.
As I said, I will be talking about HPVM. HPVM is a collection of the libraries that enables someone to build a cluster running the NT operating system. I'll be focussing on the practical issues, management, and overall design, installation. I'll also mention some of the performance results we obtained and some of the details. I won't be going into much depth about details. If someone is interested, you can ask me for more.
I'd like this to be a very interactive talk, so please ask questions. Don't be afraid to ask questions. I usually use questions to pace my talk and to adjust a little the content of it. I'll try to tell you the story about using NT. Remember, I don't work for Microsoft. I'm not sold out to Microsoft NT. So I'll try to give you a fair picture of what's good and what's bad about using NT. We should be done by four and we should have a break around 2:30.
First of all, why we choose NT. Why build a cluster using NT as the operating system. The obvious reason is that we are using a PC, and Windows in general is very much on every PC. I would say it's hard to buy a PC today which doesn't have Windows already installed on it. So the idea was to try to enable this larger community of users and to give them access to cluster technology.
As I mentioned yesterday, there's another good reason. There was quite a bit of funding available for the research, and another interesting reason was nobody had done it, so that was a good reason by itself. Those are more political reasons, and here are also some technical reasons.
NT has some good points as we found out. One, it's a good support for SMP systems, multiprocessor systems. It's very common to build clusters using dual processor or even quadprocessor machines, and NT is one of the operating systems that can take advantage of the SMP architecture.
One of the nice things that we found out and that we found very useful is that NT has a nice library of lightweight threads. It has a nice thread library. Since we are using this within FM, FM is the lower layer of HPVM, it was very important to have lightweight threads.
As you know, one of the issues with thread libraries is the time it takes to switch from one thread to the other. The additional overhead of using threads. With NT we were luckily and we found out you can have very fast thread switching. That was very handy.
These are two areas in which Linux is a little weak. I think Linux is catching up with support for SMP. I'm not aware of any lightweight, really lightweight, thread library on Linux yet. This is one of reasons we are putting out an NT version of HPVM. The latest release is available for NT. We are a little behind with Linux releases of HPVM.
NT has good monitoring tools and good administration tools. Another important issue is that for Windows, there is good driver support. And this is not always the case with Linux. Every time you want to buy peripheral, you would see that the first device driver they put out is Windows, either NT and 98. Sometime later they also put out the Linux device drivers. This is not always the case.
I already covered this, NT versus Linux. I mentioned the good things about NT. One of the good things about Linux is it's an open source. This is handy if you are doing research at the systems level and you want to tweak and work with the operation systems internals. Command-line interface is good in Linux, and there's nothing like a command-line interface with NT. If there is, it's mainly GNU tools which have been ported to Windows NT, which I will be showing.
Some of these capabilities are becoming available to the operating system. As I mentioned, you can find some UNIX tools for NT, and we have Samba, which allows you to mix Linux and NT systems in one cluster. You have Perl.
The overall message here is there is no clear winner. Depending on what you want to do, one operating system could be better than the other. And what should drive the choice is what operating system you are more comfortable with, what operating system is more proper for your needs, and other considerations like this.
AUDIENCE MEMBER: Let me ask you, I don't do threads very often, but I have a hard time thinking that a thread is light and switching and it's heavy and switching. What do you mean by that?
DR. MARIO LAURIA: In FM, Fast Messages, we use the threads in the following way: You remember the FM has an active-message style interface, meaning on the send side when you want to send a message, you specify the destination, the buffer you want to send, the length. And then you receive a fourth argument, which is a handler, basically a user-defined function that will take care of the data on the receive side.
In an active-message style interface there is no receive primitive. You cannot do a receive. When the message is received on the send side, the data will automatically be taken care of by the handler. And we're using threads to execute the handler on the receive side.
So this processing of the data on the receive side in a sense is hidden from the ongoing computation. And the reason you want a fast switch between threads is you don't want to spend additional overhead to process the incoming data and incoming messages.
So just to give you an idea, to send a message from one node to another using FM, say a very short message like one byte, two bytes. Like suppose you want to send a messages just for synchronization purposes. So you're sending a message, and the total time to send this message is about 10 microseconds with Fast Messages using Myrinet, so it's very fast.
To process this message you have an overhead on the receive side of 100 microseconds or worse. You're defeating the purpose of a fast message. Ideally, you want this message to be processed as fast as possible without spending too many CPU cycles, without subtracting too many CPU cycles, from the ongoing computation on the receive side. One of the main objectives of Fast Messages was communication with very low overhead.
For overhead I mean the CPU cycles you have to spend just to do the sending or the receiving. So that's the reason we need very lightweight threads. The situation is even worse when you're receiving messages from several nodes. Messages can be very long. Suppose you have messages which are 1 megabyte, 10 megabytes in size, and these messages are packetized, meaning they are broken in packets of 2K. The way FM is designed, every time a packet is received, the handler is invoked to process incoming packets, and there is a different thread for each handler.
So if you have 32 nodes in your cluster and 31 nodes are sending messages to node 0, for example, the case for a reduce operation, you have a situation in which this node, node 0, is receiving 31 different messages and it has to execute 31 different handlers. So you're switching between 31 different threads. If you're not careful, you can be spending a lot of time to switch from one thread to another.
Remember, each packet is moving only 2K of data, and 2K of data can be processed in a very short time, a small number of CPU's in the cycle. Suppose you just have to copy them, this can be done in a small number of microseconds. If you have to spend 100 microseconds just to switch from another thread and perform a few microseconds of useful computation, you have a very large overhead. There are other trade offs between NT and Linux, and we'll show them as we go on.
Let me interrupt to show this. What I'm doing here is the following: I'm connected to the cluster in San Diego using these things call a terminal server. I'll be using this tool to show some of the things I will be talking about. What's is a terminal server? Terminal server is sort of analogous of X Windows on Linux.
For a long time, one of the disadvantages of NT is that you could not connect to a machine from a remote machine. That's a simple thing to do with Linux. Linux has the notion you can export your display. There was nothing like this in NT. This for a long time was a real disadvantage with NT. Things improved when Microsoft developed this technology. Actually, they bought it. It's called terminal server.
This is an add-on to NT, but it's already included in the operating system for Windows 2000. This allows you, from any machine, I can connect to a remote NT machine and I can export the desktop. This is the desktop of a machine called HPVM PDC, which is one of the machines in our cluster in San Diego. On these machines there are installed some of those GNU tools I was mentioning before. Like, I can do a reach. I can do a grab, and so on.
The reason there are GNU tools installed on our machines is they come in handy to do the porting of existing applications. If you are starting from scratch developing an application for an NT cluster, you would be using the Microsoft tools, Visual C, and so on. If you have an existing application which has been running for years on a Unix operating system, it's better to do the porting using these GNU tools I will be mentioning later.
This is a brief description of the cluster I'll be using. I'm afraid it doesn't read very well. The cluster currently is composed of 32 HP Kayak PC and 32 HP LPr NetServers. I'll show a picture of those. HP Kayak is a desktop unit. NetServer is a new generation of PC's which are rack mounted, so they are very thin and they are built so they can be stacked, and they are especially designed to build clusters.
These are the specs. The fast interconnect we are using is Myrinet. This is what we have been using for our project since '94. Recently we also added Giganet. Giganet is another fast network and probably you've heard about it. The peculiarity of Giganet is it's an implementation of a VIA standard. VIA is the standard that was introduced by Intel, Compaq and Microsoft. It's an example of this user-level communication architecture that we pioneered with Fast Messages. The good thing is that since this is a standard, hopefully one day there will be other implementations of this and the price will come down.
We also have Fast Ethernet. We use this for administration purposes, things like network shares and accessing the network for administration purposes. Machines are installed on shelves, and we have console switches and just two monitors for all these machines.
By the way, one of the concerns we had in building these clusters, which are sizable clusters, was heat. So this is something you don't always think of. But when you put so many machines in one room, they produce a lot of heat. One reason for using just two monitors and console switches is that also helps. Not only with space, of course, but also with the heat considerations.
So these are our pictures. These are the shelves with glass doors of the HP. The picture is not that clear, but I think you can see what it's about. These are two desktops that are very similar to the Compaq we have here, and there are two for each shelf. And here you can see the monitor. Compare that with one of these, which these are the new HP NetServers. Each one of these is a PC. So this is a 2 U rack unit. As you can see, this kind of construction allows for very compact assembling.
For HPVM you need a Pentium processor. Memory, the more you have, the better. All those machines have at least 384 megabytes of memory, most of them have 500 megabytes. The new ones, the new NetServers, the thin ones, have 1 gigabyte of memory.
High-performance network. Myrinet has been around for quite a number of years. It's very successful in academic projects and academic environments. Giganet is also coming out. An interesting alternative, potential alternative, is Gigabit Ethernet. Just recently Gigabit Ethernet has come out with a copper version, a twisted-pair version, which will make it even cheaper. However, there isn't yet a good fast communication library using this hardware.
The issue of homogenous versus heterogenous hardware. If you have machines that are exactly the same from the other point of view, this makes things much easier. Every time you have to roll out a new version of the operating system, you can make an image on one machine and distribute it on all of them. Things are a little more complicated if you have different hardware.
On the old machines you saw on the picture is installed NT Server, the Terminal Server Edition. This way we can connect to each single node the way I showed before. One nice thing is that security is good, and you don't have to worry about passwords traveling over the network. This technology, as I said, comes standard with Windows 2000.
Monitoring is quite simplified and quite easy. I'll show you another example. This again is the desktop of HPVM PDC. HPVM PDC is kind of the front-end of our cluster. If you go to programs, administrative tools, there is something called Terminal Server Administration. This is very handy tool. It gives you an overview about what is going on each node.
You can see here there is a list of all the machines on the cluster. We named them Shard, and they are numbered from 179 to 242. You just click one. Suppose you want to see what is going on in Shard 220, you can you choose between user view, there is nothing locked on here, process view, session view, and so on.
AUDIENCE MEMBER: If I were running a regular NT job that I had not parallelized, only one of those would show anything?
DR. MARIO LAURIA: Correct.
AUDIENCE MEMBER: Just checking.
DR. MARIO LAURIA: You can see terminal server really made a difference in administrating a cluster. Not only because it allows you to connect remotely to one machine, it also comes with this nice suite of tools.
Briefly, these are some of the people that have been involved. As I said, Andrew Chien, who's been the head of the group. Scott Pakin and myself have been two people who have been involved in this project for longer. Phil, who just left the group. And if anybody comes to the San Diego area, he's a good reference person to get in touch with in case you want to work with clusters.
He's been hired by the San Diego Supercomputer Center just for the purpose of building, developing, and some knowledge at the San Diego Supercomputer Center. He is starting a cluster group there, and he is going to be the reference point for all the people on campus that are going to be building their own clusters.
Suppose you want to build an NT cluster. First of all you have to download HPVM, which includes application libraries and device drivers for Myrinet. When you buy Myrinet, it comes with its own NT device driver. We have a slightly modified version so you have to install this one instead. In addition to an application library, there are other pieces of HPVM. Some of these are services. NT services are the equivalent of daemons in Unix.
You have to add something else to HPVM, which is a remote execution tool. This is on another area of weakness of NT with respect to Linux. With Linux, there are at least minimal tools. You can do a remote shell, or, if you're concerned with security, you can do an SSH to another machine to start a job there. There are no good implementations of these tools for NT.
Initially, we started using a commercial tool called LSF. LSF is a good tool. However, it's a commercial tool, and you have to pay for it. LSF is available for several platforms. It was started as a Unix tool and was later ported to NT. Its provides a number of services like remote job creation, and it is also a good tool for process monitoring. It can tell you the CPU load on all the nodes at the same time and other statistics like this.
For some time we used this. LSF under NT requires some work to make it work properly, so after awhile we will switch to Catapult. NCSA is still using LSF for their NT cluster. This is a much simpler tool. It doesn't provide process monitoring. Basically, it's a small application built on DCOM. DCOM is a library for distributed object programming that comes with NT. It's built into NT. It was developed at Berkeley as a pet project for a larger project and now it's being distributed among academic users. So the message here is if you want a full-fledged tool, buy LSF, otherwise you can use Catapult.
Other things that could be useful, Perl. NT Resource Kit. This is a kit of a collection of useful tools. It's handy to have. Of course, you need some kind of development environment like Visual C++, which come with a debugger. There is a simpler version of the debugger called Windbg. It comes in the SDK. This can be useful if you want to do debugging of parallel applications.
This is mostly what you are going to need, some other options. Suppose you want to build an NT cluster but are not fully convinced you want to use HPVM. What are your other options? One is PVM. PVM is one of the precursors to the MPI library. It's another a message library. It's a Winsock-based implementation of PVM.
There are other implementations of MPI for NT. One is WMPI from the University of Portugal. There is another commercial version, probably this is the best implementation of MPI you can find around. This has been developed by Tony Skjellum. He's a professor at Mississippi State University. He's one of the persons that contributed to the definition of the MPI standard. He founded a company just on commercialized and advanced implementations of MPI. They also have an implementation for NT.
Finally, Myricom, the manufacturer of Myrinet, has a porting of MPICH over GM. GM is their own library for fast communication. GM was inspired to FM, our communication library. The performance of GM is not that good compared to FM because they are supporting additional services, but that's another alternative that you have.
Three out of four of these libraries are MPI. Why? HPVM. For HPVM we have a number of libraries. 99% of the applications we have found, the applications that we ported to our clusters, are MPI applications. So in a sense HPVM, is a software suite to build an MPI cluster, an MPI machine.
Let's go back to HPVM and NT. Some practical advice about managing an NT cluster. Managing an NT cluster takes some work. Probably this is true for all kinds of clusters. The reason, I believe, is that nobody has yet produced a complete full-fledged solution for building a cluster. For every cluster you are going to build there are a lot of details that need to be ironed out on the administration, on the programming sites. This is also true for NT.
Remember, this cluster was built to work as a stand-alone machine. They were not assigned to work together. So you should expect some additional work to have this bunch of machines working together. Some basic requirements are a local hard disk drive for the operating system, similar to what you need for Linux.
There are some additional reasons for NT to have its own local disk. NT cannot be remote booted like Unix. We use the DHCP for IP setup. That was very handy. It served us well. We never had any problems with DHCP. They are nice graphical tool to do the system administration of the HTP on NT.
There is a peculiarity about NT, which is the system registry, which is a place where all applications keep some configuration information. This registry is a mixed blessing, because you know every time an application needs to store configuration information it's going to be there. It also means there is a lot of essential configuration information there.
So you have to be careful. If the system registry breaks you have to basically rebuild the machine. There are disadvantages like this registry grows and never reduces. It's unusual for an application to clean it up once it's uninstalled so one problem is that it keeps growing.
Another issue we found in dealing with NT is how do you deal with the upgrades and how do you roll out a new version of the operating system or how do you install the latest collection of -- in NT there is this thing called the -- I don't remember the name. Anyway, they decided that they need to fix a collection of bugs that were found, so you have these --
AUDIENCE MEMBER: Service packs.
DR. MARIO LAURIA: Service packs thank you. So basically once a year you have a new service pack. Linux does the same thing, but in a different way. They put out a new release of the operating system. Microsoft pretends that the operating system is always the same and you only need to do some minor adjustments and gives you a service pack.
So how do you install a service pack on 32 machines? The way we did it is just by keeping a reference image of a single node on a machine like HPVM PDC on the front end, and when you need to add one of your service packs, you build a new clean image of a node and then you roll out the image to all the nodes. There are some commercial applications that allow you to do this remote reimaging on machines.
Another reason we needed these reimaging tools, these tools to recreate a clean installation on a machine, our group works on doing system research. So it's often the case that we will do some experiment that leaves the machine in an unknown state or breaks some essential part of the operating system. If you are in this kind of research, it's handy to have something that can rebuild an entire machine in a few minutes.
AUDIENCE MEMBER: How long does that process take? Is it like a broadcast so all of them simultaneously updated?
DR. MARIO LAURIA: We'll get there in two slides. I'm going to talk in the next slides about some of the commercial products we have been using and our experience with them. The nice thing is that after trying a couple disks, they worked quite well. I mentioned our experience with that. Some of the things we used are disks, and we especially liked Imagecast and we also tried DiskImage. They worked quite well.
So, why do you need to by a commercial application to reimage a machine? The reason is that you can easily build -- take a snapshot, take an image of a machine. You've taken that machine, format the disk, install the operating system, and then copy the entire contents of the disk and it's an image.
The problem is when you go and copy back this image on 32 machines, you have the problem of personalizing all these copies. Personalizing means changing things like the IP address or the name of the machine. Also in NT there is something called SID, the System ID, which is a unique ID that each machine must have. If you change the SID of a machine after reimaging it, the machine will appear as a new machine to the domain server. So the main purpose of this thing is, first of all, to make it easy to have a collection of these images and to propagate them to the nodes, and the important task is the personalization of each individual node. And all the tools do a decent task at this.
The problem we found out is that they are not really designed to do reimaging of dozens of nodes at a time. So they work well. The problem is that it can take long time where you have to do all of them because basically this will be copying the same image to all the disks. So it took something like twenty minutes.
We also found out that we had to do a few machines at a time, otherwise things were breaking up. It's still good enough if you don't reimage all the machines very often. It's still good enough if you have to reimage a single machine at a time. Say an experiment goes bad and you want to reimage a single machine, these tools have worked very well.
Another reason it takes so long is that to do the personalization, it can take several reboots. Some of these tools use tricks like chain link configuration file. They put it in the machine, change it again and so on. Or in another case they were just booting the machine with a minimal operating system like MSDOS, which is just enough software to do network communication using this minimal bootstrap installation to download the image and reboot the machine with the full operation system.
AUDIENCE MEMBER: That's not automated is it?
DR. MARIO LAURIA: Yes, it is. So in one image case the only thing you had to do was -- no, it's all automated. But you have to wait there. So it was automated in the sense you could just click. In one window we were clicking the image you want to propagate, the other image you click on the nodes you want to reimage, and then you start it and it goes by itself and does everything. To create a new image it takes something like ten minutes. So it's not really bad unless you have to do a large number of machines very often.
This is an answer to your question. If you had to do this thing on 64 machines, it can take hours. Remember, 64 machines is a medium-sized machine. The one at NCSA is 128, and they are thinking of going to 256. And Phil Papadopoulos is working at San Diego Supercomputer Center where they have this spanking new SP2 with one thousand one hundred and so nodes, and they like to show that clusters can go this high too. So this becomes an issue with very large configurations.
So after tying all these nice commercial products, Phil come up with another solution. It turns out you can do this yourself. And the solution that Phil came up with is just install Linux in addition to NT on each node, and every time you have to restore the look or the image of the disk on the disk itself.
So a typical image is something like 500, 800 megabytes, and today each machine has disks that have a typical size of 8 gigabytes, 16 megabytes, 20 megabytes. You can easily reserve a partition to just store images of a machine. And then you use Linux to copy the image of the disk between the boot partition and the repository partition.
So the question is: Why do you need to boot in Linux? Couldn't you do this with NT itself? In theory you could copy the boot partition somewhere in a large file and vice versa when you want to reimage a machine. The reason we had to do it with Linux is basically because in NT there's nothing like DD. You cannot do a byte by byte copy of an entire partition, so that's why we have to resort to Linux.
Here's the trick of storing the image on the same disk and DD the image back and forth. Now you can reimage the entire disk very fast, because you don't have to broadcast the same image to all the nodes. This makes things much simpler. Basically to reimage the entire cluster, you have to change the configuration file on each node which tells which operating system to boot, and you have to change a second configuration file in the Linux partition to tell Linux which image to DD, to copy byte by byte, in the boot partition. With this trick, reimaging an entire disk could take a constant time of about ten minutes.
This is possible because, as I said, each PC, each node and cluster has a large disk, and the large disk is not that much used. You need a local disk because you need a place to store the operating system to boot, and you also need it for the virtual memory for the swap file. Besides that, it's not much used. It's quite unusual to find a program, an application, that uses the local disk.
Briefly, here is the tool to do this administration on NT. This is to show how we divided the main disk. As you can see, there is D, which is the boot partition, this partition just for images, two gigabytes. And the last partition is the scratch where we have the swap file. We have a separate partition for Linux. And a small detail I want to show you is that all these use NTFS except these two, which are FAT, for those of you who are familiar with the NT file systems. The reason we had to use FAT is Linux cannot read the NTFS file system.
So again, when you want to reimage a disk, you tell the machine to boot using this operating system instead of this. Once the node has come up as a Linux node, it will copy from one file into this partition, partition D, and the next time it reboots it will reboot the new copy of the operating system here.
Here are some more details on how we did all this. We used something called loadlin, which you can use to boot Linux from Windows 98. I'm sure you can do this with like lilo, you just need to know where the application stores communications files to tell it what operating system to use next time you have to bootstrap.
That's the other important thing I want to mention. You must be aware of compatibilities between operating systems. You must be aware that Linux can read a FAT partition but not NTFS. Not a big deal, but something you need to know. One good thing is dealing with Linux and NT, you have plenty of documentation on both sides.
AUDIENCE MEMBER: Is loadlin something that comes standard with NT?
DR. MARIO LAURIA: No. I think it comes from Linux.
AUDIENCE MEMBER: When you do this reboot, you go into Linux and do the copying, so you're back to a good imagine of NT. Can that be automated so you don't have to go through on each individual machine and tell it to reboot to NT?
DR. MARIO LAURIA: Yes, you can. We have an NT script that does all this. You can do this because you have a way of accessing remote disks. I can show you how to do this. I don't know if you can read this. There is an easy way in NT to access other disks once all the machines are in one NT domain.
For example, this is the directory again on HPVM PDC, which is the PC we use on the front-end. With a notation like this, I can access, in this case, the partition C of the disk on the Shard 220. You can have a script that accesses the file there and changes it or makes changes to it.
You can do the same thing -- you have to change one of the scripts to tell the machine what operating system to reboot next time, and then you need to make changes to another script to tell Linux what image to copy inside. Of course in Linux you have to create a script that whenever the operating system boots up, it makes a copy of these things and then restarts. Having a local copy of the image also takes care of those personalization issues.
AUDIENCE MEMBER: How about SID? Don't you have to reset it for every machine?
DR. MARIO LAURIA: No, because you're keeping a local copy of the image so the SID is the same. It's a doubly smart thing.
This was a quick overview on the administration issues of the NT cluster. Let's go to development issues and take a look at NT as a development platform.
NT and Unix are different, especially if you're developing applications. Someone that starts to use NT coming from the Linux world would be tempted to use the porting of GNU tools to NT. In some cases this is a good idea, and in same cases it is not a good idea. One reason it's not a good ideas is that this porting doesn't always retain the full functionality of the original GNU tools.
Some of the things available are listed here. The most useful and comprehensive set of porting are those made available by Cygwin. Cygwin by the way has recently been acquired by RedHat, which is good because it means the tools will probably become more and more available and there will be more development of those. For some of these tools there is still some work to do. Some are already good like Perl, the porting of Perl, and other things like Grep, Emacs. Never used the MKS toolkits. Everything I needed I was able to find it in the Cygwin suite. Theses are public domain. Here is the URL if you want to try some of these tools yourself.
If you're starting with NT, I would suggest you don't use things like Gcc for NT. Try to use Visual C. One good thing to have if you start doing development under NT is a subscription to MSDN. If you subscribe to MSDN you will have everything you are going to need, good documentation, all the Microsoft tools available. Even if you don't want to subscribe MSDN, the complete MSDN documentation is online at MSDN Microsoft.com.
As I said, it's not always a good idea to use GNU tools. When is it a good idea? It's a good idea when you want to port an existing Unix application to NT. In this case Cygwin tools can make life much easier. One of the main problems in porting a Unix application to NT is, guess what? The make file format is different. So the visual C++ doesn't have a notion of make file. I don't know if you're familiar with the Visual C++ environment. It's quite different.
Microsoft also provides a make utility called Nmake that is included with Visual C++, but the format of the make file used by Nmake is different. So what I do when I have to do a porting of an application is I use the original make file with the Cygwin make utility and just use the command-line version of the Visual C++ Compiler. So basically I use the Microsoft compiler with a GNU make.
Debugging is different, you have a visual integrated development environment you have to get used to. Once more, a disadvantage of this is that you can not do command-line debugging like good old GDB. You have to start the complete environment. The nice thing is that you can do some kind of remote debugging using the environment. It used to be the case that how you did debugging of a prior application with GDB is you had to open a window on each node and start GDB on all of these. With the visual environment, you can simply tell the debugger to go and debug the process which is running on the remote machine.
If you have an existing application that was developed under Unix, the least effort route is to use Gmake plus CL. This is the GNU make plus the command-line version of the Microsoft compiler. So are there differences in GGC and CL? Yes. CL is much more restrictive from the syntax point of few, so expect more warnings. Clearly there are some differences in the system calls, so you have to take care of this. There is no get time of day in NT, so if you want to do timings and take times in your application, you have to do a little amount of source transformations.
This slide is about remote access. I already covered that to a certain extent. There are a couple things to add. How do you access a remote machine? It used to be the case that this wasn't possible or it was very hard with NT. Now everything is much easier with the terminal server. Is terminal server good enough? Yes and no. It's good when you need an entire desktop. However, sometimes you don't need an entire desktop. Sometimes you just like the output of your application, and there is nothing like RSH and SSH that you can use to do remote execution of an application with redirection of IO. You either have the entire desktop of the remote machine or nothing.
There are a number of ways of doing remote access. The tendency of NT is to have a separate tool for each type of remote access tool. So if you need to edit the registry of your remote machine, use a certain tool. If you have to do administration of a file system on a remote machine, you use another tool. If you have to start/stop services, which are the equivalent of daemons, you have to use a separate utility.
This is a disadvantage of NT. It provides you with plenty of tools, but there is a different tool for each thing you need to do. There's nothing like a unified approach. On Linux you typically do an SSH and then some command line. With NT, you have to learn your way through all these tools.
Remote execution. I'm making a difference between remote access and remote execution. Remote execution is a subset sense of remote access. Remote execution is simply how do you start an application on all the nodes on the cluster. This system a little different from remote access because in addition to starting a process in redirect IO, you need an additional feature which needs to be able to kill a process.
Remote execution is an area in which all clusters are somewhat lacking. I think this is a problem even for Linux. With Linux, you can use remote shell, but whenever you want to kill an application you can have problems. Why? Because it's often the case that you're not just starting an executable on the remote nodes. It's more typical you start a script. Because, for example, you need to set some environment variables or do other things like that. So if you start a script remotely and then you kill the script, the executable that was started by the script would still be running. This is one of the reasons remote execution is one of the areas that still needs more work both in Linux and NT.
This is what we're using for remote execution in NT. Again, LSF is being used at NCSA. For sometime we also used it. Now they are becoming popular in some of these DCOM-based launchers. One is the one we are using called Catapult. It was something written by the cluster group at Berkley. I've seen another one which is being included in the NT release of MPICH. I haven't tried that yet.
We are done with Part I. In Part II I'll be giving some more details on HPVM itself. Also, as an example, I've downloaded and installed the NAS benchmarks on our clusters to give you an idea how to do the porting of a Linux application on an NT cluster.
In the second part I'd like to talk about briefly about HPVM. The first part was about NT and dealing with the NT cluster and the issues like the administration, NT as a development environment, and the tools you need to do all of the above, pro and cons of NT versus Linux.
I have to say I'm a little disappointed that nobody has come up with a fight about Linux and NT. I was hoping for a more combative audience. I'd like to hear about your opinions. Maybe in the second part.
The second part I want to just briefly talk about HPVM structure. I won't go over all the slides in too much detail. I want to give you a high-level overview of HPVM. So if somebody is interested, you can ask for me details. I will give you a high-level overview and then get back to the cluster and show how we run the real application. The real application is very simple. It's one of the NAS benchmarks for which I modified a make file. So in the first part I mentioned what's the easiest way of porting a Unix application to NT and showed you an example of this.
So HPVM is a collection of libraries. HPVM is software to support high-performance communication on a cluster of PC's that are interconnected with Myrinet. HPVM composed the lower layer, which is FM, which is a software that accesses the network, the Myrinet network. The access is direct, meaning this is a user-level library. If you want, it's a user-level protocol. Low-level meaning it has a very simple interface.
On top of this there are some higher level interfaces like MPI. HPVM is the result of a research project focused on getting the highest possible performance in a high-speed network like Myrinet. So a lot of research, a lot of work went into how to minimize the overhead the communication software is adding in using the SP network.
This is a more graphical representation. I don't know if you can see the layered structure of HPVM. The lower level here is the hardware. We started with Myrinet in '94. In the latest release we added support for VIA, which is the standard for high speed interconnect, meaning you can run Fast Messages here on Myrinet or something like Giganet that implements VIA.
The results in the latest release support shared memory. Why is this interesting? It's interesting because if you run Fast Messages on a cluster of dual processors or quadprocessors, you want to take advantage of shared memory whenever you're sending messages to the other processor on the same machine. In a sense you're using shared memory as a shortcut. Instead of going to the switch and coming back to the same machine, you can take advantage of faster mechanisms going through shared memory.
On top of Fast Messages, which is a very simple interface, an active-message like interface if you are familiar with it, there are some user-level interfaces like MPI, SHMEM and Global Arrays. These are popular among people of Cray T3E. In the latest release also BSP has been added. BSP is another research kind of communication library.
We are mostly concerned with MPI. 99% of the applications that have been real applications we have been using on our processors use MPI for communication. In a sense, if you are a user of this cluster, you could just learn about MPI and ignore everything else and basically use an HPVM cluster as a very fast MPI machine.
That's just as list of the libraries you have seen in the previous slide. As I said, this is the supported transport hardware you can use. Either Myrinet or Giganet. This first hardware differentiates an HPVM cluster from a Beowulf cluster.
A Beowulf cluster is essentially a number of PC's that communicate through standard TCP/IP internet network. What makes it possible is fast performance. The first communication on the HPVM NT is the combination of fast hardware and high-performance software like Fast Messages running on it.
As I said all applications are running on MPI. That's what we are going to talk about. I'm going to show the performance of MPI and not show you performance of SHMEM or put and get because this is what our application was built for.
When I say high performance, what do I mean? Here are some numbers. If you use HPVM on Myrinet, this is the communication performance you are getting. If you are using the lower-level interface of FM, you would be seeing a peak bandwidth of 100 megabytes per second and latencies on the order of 8, 9 microseconds. And these are numbers taken on the newest HP machines on our cluster, the NetServers. These are machines clocked at 550 megahertz. I'm specifying the machine because the faster the machine, the faster the protocol of processing and so the better these numbers.
If you're using the MPI program interface, which is basically another software layer on top of FM, you're getting a slightly lower performance of something like 91 megabytes per second with a message size of 64K and 9.6 microseconds latency. So this shows that this additional layer of software is adding approximately a 10% overhead on top of the FM overhead.
These are quite good numbers, especially if you compare them to what's available on the other commercial supercomputers like the SP2 or the Origin 2000. If you have a lot of money and you want to buy a supercomputer today, you have to choose between the IBM SP2 and the Origin 2000. The performance you would be getting spending one to two million dollars won't be that different from these ones.
If you use Giganet instead of Myrinet, the performance is slightly lower because the Giganet performance is not as good as Myrinet. It's still a respectable number considering this is commodity hardware and it's something you are building yourself. As you can expect, if you use shared memory between processors on the same machine, you are getting a much better performance.
First of all because you are not more limited by hardware constraints like the PCI bandwidth, the bandwidth on the IO bus. You're taking advantage of the much lower latency of accessing main memory as opposed to accessing the memory on a network card and taking advantage of the much higher system bandwidth, memory bandwidth available on the system bus.
AUDIENCE MEMBER: Why is the number so much lower for MPI on shared memory?
DR. MARIO LAURIA: On shared memory?
AUDIENCE MEMBER: It was only a small bit on the Myrinet one, 10% or so, and on the shared memory it's --
DR. MARIO LAURIA: That's a good question.
AUDIENCE MEMBER: It's actually lower than Myrinet on bandwidth.
DR. MARIO LAURIA: Latency is lower, of course.
AUDIENCE MEMBER: Right, latency is good.
DR. MARIO LAURIA: Why? I think the reason is with MPI you need an additional copy, memory to memory copy. To tell you the truth, I haven't looked in-depth into this, and I think the reason is the additional copy. I won't be able to tell you why you need an additional copy. It's interesting to see how at this level of performance a simple memory to memory copy how much it is hurting. So to get a performance like this it was a lot of research into how to avoid additional copies. But this is what is going on here.
AUDIENCE MEMBER: Is this for a specific application or are you running a benchmark?
DR. MARIO LAURIA: No. These are microbenchmarks like -- we used two to test; one to measure latency, one to measure bandwidth. The test latency is a ping-pong test. You send a message and the receive side sends the message back. So you time the time between sending and receiving the answer. That is the ping-pong.
The bandwidth test is a message in which you send a thousand messages and then you wait for the answer on the last one and you divide the time by the total number of bytes you have sent to the other side.
AUDIENCE MEMBER: Do you use the same microbench program?
DR. MARIO LAURIA: Yeah. Same microbench program. At the MPI level I'm using MPI send, MPI receive.
So these are the same numbers, but this is a graph showing the performance. On this axis there is message size, and that axis is megabytes per second observed bandwidth. This is just for Myrinet and VIA. The shared memory transport is not depicted here. As you can see, the curve for Myrinet is higher than the one for Giganet. You can see the packetization effect. Data is sent in packets of 2K. This is the impact of packetization.
Another interesting thing to look at is this number here. One way of measuring how much overhead your communication software is adding is to look at this number, which is the size of the message you need to get half of the peak bandwidth. So how do you measure this number? You take the peak bandwidth.
For example, 100 megabytes per second, divide that in half, 50 megabytes per second, and then look at the size of the message you need to get that bandwidth. Of course, in general, the larger the message, the better bandwidth you get. Because with longer messages, you can amortize the fixed overhead of sending or receiving a message.
What I call overhead is the number of CPU cycles which you need just to process the data to send or the data to receive. This is the CPU cycle you want to minimize because these are the CPU cycles which are subtracted from the useful computation. So you want the least amount of the overhead, and this a figure to measure how good is the overhead of the communication software.
So this is quite a good number. It means that with only 500 bytes you can already get half of the peak bandwidth. You don't need very long messages to see the peak bandwidth. This is useful because the smaller the useful size of messages, the more general is your communication software. You don't need an application with very long messages to take advantage of your fast network.
This is the largest cluster we built. In '98, when the group with was still at University of Illinois at Urbana-Champaign, we teamed up with NCSA. NCSA decided to fund the construction of a large cluster. They wanted to have a production machine, something they could use to run their applications on.
So this number is not correct. We went up to 192 in April of '98, and then it was later expanded to 256. Anyway, this is to give you an idea. This is the number of processors. The nodes are half that number because we used dual processors. We actually used HP Kayaks, that I mentioned in the first half of my talk, and it was later expanded adding Pentium III's. And they're planning for an even larger machine whenever Itanium comes up.
I'll be showing some of the performance on some of the real applications NCSA has been running on this machine. When we teamed up with NCSA, our interest was to build such a large cluster to study issues like how our software was scaling up with such a large number of nodes. NCSA's interests was let's see if we can use this machine to run our existing applications. We took some of the applications we were running on the Origin 2000, the IBM SP2, we compiled them under NT, and we ran them on the NT cluster.
All these numbers are for applications we just recompiled under NT without too much fine tuning. You can see a comparison between NT and the Origin numbers. This application is a Navier-Stokes equation kernel. The performance ratio between the NT Cluster and the Origin 2000 is between, the best Origin 2000 performance, is between 1.5 and 2, depending on the application. Again, this the performance. On the X axis is the number of processors. On the Y axis there is gigaflops here and speedup there. These are general results. Speedup on the NT is usually better than on the Origin 2000.
This is another application. It's a Conjugated Gradient Kernel. Again, the ratio is between 1.5 and 2. In this case this is 7 to 14; the ratio is about 2. You can see a good scale up, up to 128 nodes. We couldn't go farther because the largest Origin had 128 nodes.
An interesting observation I also made yesterday is the performance ratio is two, and this is using 300 megahertz Pentium II's, and this is using R10K clocked at 295 megahertz. So these two processors approximately clocked at the same speed. Communication performance is comparable on the order of 10 microseconds latency and 100 megabytes per second bandwidth.
So how do you explain this ratio of 2 to 1? The way I explain it is that what you are seeing here is the difference in floating point performance within the Pentium and the floating point performance of an R10K. The floating point performance of a Pentium is notoriously average compared to a very good floating point performance of an R10K. The situation is expected to change with the new IA-64 architecture, which has a much better design for the floating point unit.
Another interesting observation is that this is a 300 megahertz Pentium II. If we had to build a cluster today, we probably would be using a one gigahertz Pentium III, and so today we can expect this performance gap to be much closer or even null probably.
This is a short list. I apologize for the invisible ink. I briefly mentioned other efforts in this area, meaning other software available you can find for clusters and high-performance communication software. The first one is BIP. BIP is a research project by the University of Eleon(Phonetic). It's also high performance and it's also using Myrinet. They're using Linux and not NT. Performance is even a little better than Fast Messages; however, there are certain differences in functionality. For example, BIP doesn't guarantee reliable delivery. You can have a buffer overrun on the receive side.
As you can understand, there is a trade off between how much functionality you put in one of these high-performance communications software and the performance you get. The more functionality you add, the slower you get. That's why, for example, GM, which is what comes with the Myricom Myrinet, is slower than our own Fast Messages. However, GM also supports TCP/IP. If you use GM as a communication software, you can also use TCP on your network.
This is the Real World Computer Partnership. This is a Japanese project, and they also have an MPI implementation. It's a slightly better raw performance; however, their curve doesn't grow as fast as Fast Messages, and so the one half bandwidth is not as good as ours.
Then there is Active Messages and U-Net. These are some of the old projects. They started more or less around the same time as Fast Messages in '94. All of this research is on the same topic, high-performance communication, but each one of them gives a different contribution to the field. They have different results, and each one of these projects has a slightly different perspective on this topic.
I won't talk about the structure of HPVM. If you are interested, there are the slides, and you can ask me more about this. HPVM stands for High Performance Virtual Machine. HPVM is supposed to hide all the complexities of the hardware that's beneath it. So the idea is if you are happy with it, you could just use the MPI library that comes with HPVM and ignore what's beneath. You cannot ignore the details if you are the administrator of the cluster. If you are the user, you can ignore all of this. I'll pretend you are all users, and so now we'll go to the user part and show you how to use it.
Let's go back to the cluster. Again, this is the front-end of our cluster. I'm using terminal server to connect to this. One thing I can show you is that I can actually use terminal server to connect to any of the machines. This is useful if you want to see what is going on in some machine, or, for example, if you want to debug something on a machine.
I don't know if you're that familiar with the NAS benchmarks. NAS benchmarks are a collection of small kernels, small applications. These are MPI codes. Some are Fortran. Some are C. They are often used to do performance comparisons between parallel machines. It's pretty much a standard suite of MPI benchmarks. As you can see, there are some here called BI, CG, EP, IS, LU. You can see make files.
So these are available for quite a long list of different Unix flavors. It is not yet an official porting for NT. I had to manually do the porting. As you can see, I make my own copy of make file. Let's take look. As I said before, the easiest way of doing this -- you can do this a couple ways. The easiest way is to make the GNU make file in combination with the Microsoft compiler. If you are really brave, you can rewrite the make file using the Nmake syntax. The Nmake is the Microsoft version of make.
This is the top level make file. There are not many changes to do here. It would be more interesting to see the lower level of make file, the one that actually does the compilation. You can see here some of the changes, which are very minor but annoying changes. First of all, object files terminate in dot OBJ. So you have to change this.
The second annoying thing is the NT slash is the opposite of the Linux slash. So whenever it's needed, you have to change the slash. However, this is a GNU tool, so this slash needs an escape otherwise make will trip on it. You don't have to do it always. You need only to do it when this is an argument of Microsoft tool.
So this is the list of object files that are giving this as an argument to Microsoft compiler. Microsoft compiler wants the NT slash, so you have to change the slash. Dot EXE, and you have to change slightly the compilation rules. So instead of dot F dot O; you have dot F dot OBJ. No big deal.
As you can see here, the first line is importing a definition file. With the Microsoft C Compiler, you have to use CL instead of CC. Okay. I've done all these changes. I've compiled it. I have a dot EXE file. Now that we have an executable, we have the problem of starting the executable in all nodes and running it.
So for this we'll be using this launcher. It is called Catapult. Six months ago I would have used LSF. We are no longer using LSF. One reason is LSF requires some work to get it to work. In November, the research programmer that was in charge of administering LSF left the group, and so instead of hiring a new such programmer we decided to go for a simpler tool.
So the syntax is very easy. I don't know if you can see. It basically needs two arguments. One is the executable with all the arguments, the entire command-line you want to be executed on the remote machine. And then there is a simple list of the hosts you want to run these things on.
There is the executable I built. Those of familiar with the NAS benchmark know that there are several classes of this benchmark. Each class is different in the amount of data it's using. To make things simple, I used the small class. I'll be running this all on one processor, but this won't be much different. So to start this thing on a machine, I'll be doing something like this. First of all, I have to copy the executable. You can make this automatic with the script. I'm doing it manually to show you some of the steps. I'm copying the executable on the local disk of each node because one problem of this small launcher application is that they cannot use shares. What NT calls shares, is what Linux and Unix calls mounted file systems. So if you use something like Catapult, you cannot use share. So the executable must be on the local disk. So I'm copying this executable on the local disk using this notation. I don't know if you can see that. So this is one of those mechanisms that Windows NT gives you to access remote machines.
Now that the executable is on the C partition of the machine called Shard 211, I can start it. This is the program executing. Let me explain what is going on here. This machine is the font-end. To start an application on remote nodes, I'm using Catapult, which is a very simple launcher, and specifying two things. One is the command line I want to be executed, and the other is the machine I want this to be executed on.
The command line is the name of the executable plus these arguments that are required by MPI FM. Each MPI FM application needs these arguments. One is the number of processors I'm running this on. The other is a key. Why do I need a key? Because this is how MPI FM recognizes all the processes that belong to the same application. That's why you need these two arguments. And then this is the name of the machine.
Here you can see the launcher at work. The essential tasks of a launcher are, first create a shell on the remote node and execute the command line of that shell. The other important task is to redirect the output, and this is how Catapult is redirecting the output of the benchmark to this machine. This is the output you see from this BT benchmark. It gives you a bunch of statistics.
As I said, Catapult provides remote job execution, redirection of IO. The other critical task that's required to do something in Catapult, is to be able to kill a job. From this point of view, Catapult, is not perfect.
Suppose there's something that goes wrong in an application like BT and you do control C. So Catapult would be killing -- well, in this case it would work fine. The control C would be intercepted and the application BT would be killed on the remote node. But if instead of running BT, which is more often the case, i'll be starting a batch file, a script that in turn starts BT. If I do control C here, the result I would get is that the shell would be killed on the remote node but the executable would still be running.
In a case like this what I have to do is use the Terminal Server administration tool. Remember, it gives you a picture of what is going on what the remote node. Click on Shard 211, and this will give you a list of processes running on 211. I click on sessions, processes and we'll order this by user and I'll be able to find the application.
There is also way of doing this though a script. You can use Catapult to run remotely a tool called kill, to kill the processes. Basically that pretty much it. If you were using the Microsoft Integrated Environment, it would be skipping all the phases of make file, transformation and things. You start with an executable and then we use Catapult with the executable.
I'm pretty much done with what I wanted to talk about. If there are any questions or any clarifications you want? In the second half I have received even less questions than the first half, and I'm a little concerned about this. So let me try to start a fight about this Linux versus NT thing. As you can see, there is nothing you cannot do with NT, compared to what you would be able to do on a Linux. How many here are familiar with Unix and routinely program with Unix and would know how to do the same thing on here?
AUDIENCE MEMBER: Do you want something controversial to be said? Do you want a fight?
DR. MARIO LAURIA: Sure.
AUDIENCE MEMBER: How about this? What would be good and juicy? That you can do it in NT is certainly possible. You've certainly --
DR. MARIO LAURIA: But you have to learn the way of doing it. It's not a uniformed way of doing it like Linux.
AUDIENCE MEMBER: You can solve the problem using NT. It's obvious you have, and there was motivations for doing so. It's obvious there were difficulties you had to overcome. If I wanted to make a controversial statement, I could say, Do you want to use NT as its own punishment being that you had those difficulties?
DR. MARIO LAURIA: If you're a Unix user, I agree with you. If you are a Unix user and you're comfortable with Unix, you probably don't have a good reason to build an NT cluster but what if you are starting from scratch and you are already familiar with NT, it probably doesn't make sense to relearn everything under Linux and you better start with NT.
AUDIENCE MEMBER: That's a hope for Microsoft.
DR. MARIO LAURIA: That's a hope for Microsoft, yes.
AUDIENCE MEMBER: The biggest problem I've got is NT is an operating system controlled by Microsoft that I have to pay for and Linux is free.
DR. MARIO LAURIA: This is true. There are good things and bad things about this. Our point of view, meaning people that build software that is distributed and is released, this is an advantage because with Linux, what happens is you build a release for Linux version 2.02 and everything works. Then Linux 2.04 comes out and you're software doesn't work anymore because some little detail has changed in the operating system and the kernel makes your software not work anymore. With NT we've never had that problem. It's a much more stable interface, much more predictable, and from our point of view, it is much better.
You can do more work if you need to access the internal operating system, but this wasn't much of interest to us. Our reasoning was we are using commodity hardware and probably without modifying it. We don't go on the motherboard with a soldering iron and fix things. Probably we should do the same thing with the software. We want to use commodity software without any modification, and I think we were successful in this. We used NT without any modification, and installing HPVM is very easy because you just click on the setup icon and that takes care of everything for you, including installing the driver and everything.
AUDIENCE MEMBER: One thing just popped up in my mind. You have a 64 Unix in your cluster. How many licenses do you have to buy from Microsoft to run it?
DR. MARIO LAURIA: None. Because when you buy the PC, the operating system is already there.
AUDIENCE MEMBER: That's if you buy a preloaded PC. What if you want to build one yourself?
DR. MARIO LAURIA: We found out this is not a good idea because it's a lot of work.
AUDIENCE MEMBER: We understand that. A lot of work but little money, and there are people who don't have fat pockets.
DR. MARIO LAURIA: Yes and no, because if you're hiring a cheap grad a student, that's true. If you are not, you need a lot of work to assemble this machine. Assembling a machine is not a lot of work if it's 1, 2, 3. If you have 64 to put together, it's a lot of work. I know when we had to install two disks on 32 machines, it's was a lot of work. It required five people two full days of work to do this.
There's only one disk in a machine. No big deal. You open it, you connect those flop cables, you have to install the controller for the disk, you have to unscrew and rescrew. For two machines it's no big deal. If you have to do that 32 times it's a lot of work. And 32 machines, you have a good chance every now and one will break. If you have a nice machine with a warranty, it's much easier to call the guy and say we have a broken machine. Send a replacement. If you have to do it yourself, now, again, you're on your own.
AUDIENCE MEMBER: You can make the argument in terms of hardware certainly, but it seems that you're doing reinstalling the operating system on a regular basis if for no other reason than to clear the registry. So the whole idea of being supported for that kind of falls apart, because they're not going to come out and install it because you're doing it yourself.
DR. MARIO LAURIA: That's why need a very good tool to reimage your node. But this is true for Linux too.
AUDIENCE MEMBER: I'm saying the argument you just made argues for buying the hardware from the vendor but not necessarily buying the hardware with Windows on it.
DR. MARIO LAURIA: No. I'm not following you, why. You need a machine with this operating system, and you also need to be ready to reinstall the operating system at will on many machines at a time every time you need it. So basically you need the operating system and a good tool to clean up the machine.
AUDIENCE MEMBER: Maybe I missed the thread of the argument, but that's an issue that's independent from whether your using Linux or NT.
DR. MARIO LAURIA: Correct. So Phil Papadopoulos is starting his own cluster project at the San Diego Supercomputer Center, and they have decided to use Linux. The first thing they are doing, even before installing the machine, is developing a Linux tool to reimage all the nodes using Linux. So you're right, this isn't a issue for any type of cluster. From this point of view, the choice of operating system is neutral. Neutral, you can argue that with Linux this is something you can do by yourself. You have more tools. With NT probably you either have to buy a commercial product or if you want to use it yourself you have to use Linux. But it's an issue independent of the operating system you use.
AUDIENCE MEMBER: Do you have any idea or numbers, and this may be unfair, you may not be able to know this. Linux clusters, how often are they reinstalling every node and how often are you reinstalling every node.
DR. MARIO LAURIA: It depends a lot on what you are doing with the cluster. If this is a cluster that's being used for research, of course, you are doing lot of reinstall because you will be breaking the machine all the time. If you're just running a simple application, you probably won't be doing many reinstalls with either operating system.
AUDIENCE MEMBER: You're doing systems and (Inaudible) intrusive kind of research.
DR. MARIO LAURIA: Even if this is user-level stuff, you still have to keep tampering with the driver and changing it, and the services, which are like daemons, they interact with the operating system and so on.
FM, Fast Messages, and other fast messaging layers like this are user level, meaning they directly go to the network. This doesn't mean there is no operating system support for that. The operating system is kept out of the way as much as possible, but, of course, it's still. You still need the operating system, for example, when you have to map the memory of the interface into the main memory of the machine. That's the way the memory on the NIC is accessed.
DR. OSCAR GARCIA: Well, I guess we are the survivors. And I want to thank Mario for his excellent presentation. I also want to thank you for your tolerance with all the little glitches we have had, and I repeat once more, do not leave without giving us your evaluations.
Thank you very much, and I hope to be able to do it again next year.
| last revised:11/16/00 06:03:55 PM |
| editor: pmateti@cs.wright.edu |