Summer Institute on Advanced Computation

August 20-23, 2000


College of Engineering & CS
Wright State University
Dayton, Ohio 45435-0001

Introduction to the Condor System

 

Dr. Prabhaker Mateti

Department of Computer Science and Engineering
Wright State University

 
August 23, 2000 8:30 a.m. This site contains the "web-based proceedings" of Summer Institute on Advanced Computation that focused on high- performance high-throughput  Cluster Computing. 
Slides
           
 

DR. PRABHAKER MATETI: Well, you have me once again giving a talk. Today we'll be talking about Condor, and we are going to be a little more specific this time than we were in the previous discussions about Condor.

Hopefully, most of you have attended Miron Livny's talk at dinner on the first day, and he gave an excellent talk, and I'm going to be more detailed that he was. I'm going to be less conceptual and more concrete in terms of what you do, how you invoke. I'm making a use of a bunch of tutorials that Wisconsin has put together. Of course, I have adapted to suit my style of lecturing on things.

I think we should focus a little bit on the essential background idea of a system like Condor. It may have started long before this phrase cluster computing came into the world. You may not realize that Condor is about twelve years or so old. It is been in development ever since, and the current version is somewhere in the order of six some number.

So what it is, in modern terms, in terms of cluster computing is, Condor deals with nodes that come and go into the cluster. Even during the day, and, of course, it can happen on a weekly basis or on a monthly basis, et cetera. So we should understand clusters with part-time nodes is an idea that Condor deals with.

Behind that, there are these two sub ideas. One of them is cycle stealing. That happens to be the name the people used for this idea; that is, the running of jobs on workstations that do not belong to the owners. Once again, I remind you by workstations, we are not talking about the SPARC's and the Sun's and the MIPS and the SGI and so on, but anything that qualifies to be a workstation using the definition that I gave you a while ago. Which is, there is such a thing as a workstation operating system. And that has user authentication that has networking as a central idea and so on.

Then there is this notion of idleness. That comes into play because the typical user would not want you to use his machine when the machine is actively being used by him. Now, the definition of idleness needs to be flexible. Here is a little quick definition so that we get a handle on what is being meant by idle. And of course, there are all kinds of tools. Condor actually happens to be just one of them that deals with clusters with part time-nodes.

Another idea that we should also try to distinguish is between high performance versus high throughput. This was a point made very eloquently in the dinner talk, and I'm just repeating that over here.

These part-time nodes come and go, and this has to do with essentially how the users, how the owners, of these workers feel. Essentially, they should be willing to cooperate. And it boils down to the last two items; that they are willing to share the resources they have, and I'm focusing on CPU's, but you'll note that Condor will make heavy use of the disk space of the host.

The other aspect of it is willingness to trust. This has to do with computer security and so on. Regardless of the claims made by Condor developers, we should take those claims with a grain of salt. Those processes, when they are running on your machine, you are saying I trust you guys that everything will be okay on my machine when you leave, but that is something that could go wrong.

Another idea one needs to understand is the granularity of migration. This happens to be a very active area of research right now, migration. There are all kinds of approaches. They can be basically divided based on the granularity into these two; process migration and object migration.

The word object I'm using here in the same sense as in object-oriented programs. So in that sense, a process is a collection of objects. In case you haven't been noticing the object-oriented area closely enough, these days objects are classified again into two sub classes. They can be passive, meaning they are more like the old C++ kind of objects, or they can be active objects. That is objects that house a process inside them. And it is these active objects that have their open idea of mobility. A fancy term for them now is agents. They decide to go from one machine to another based on some criteria.

But in any case, such objects together constitute, in the granularity that I'm trying to describe here, a process. So we may decide to move an entire process or, even though one process is made of several objects, we may decide to move a few of these objects around. So in that sense we may have a single process running with objects that are scattered all over. And some of these objects are passive; some of these objects are active.

And process migration therefore is much coarser in grain. Process migration, one of the research items, that is how do we move a process from one machine to another. The old standard idea is you actually stop the process, take a snapshot of it, and then copy the snapshot in the other machine and then reinvoke it. To give you a crude way of describing that, it's almost like cause a core dump in some manner, make an executable program out of the core dump, take it to the other machine and you reinvoke it.

The word we use for that essential idea is checkpointing. Checkpointing is a way of causing a core dump, and then making an executable program out of it. It is not just that. Because of the way the operating systems have evolved, when you cause a core dump, a core dump does not keep track of the entire state of the process.

For example, a core dump cannot quite tell you which files were opened at the time of the core dump. So that when you reactivate this core dump, the file is opened at exactly the same point. You have read a thousand bytes, and now you are about to read the thousand and first byte, that's when the core dump occurred. So those details disappear. So checkpointing is a nontrivial matter to do in the context of the existing operating systems.

The very cutting edge operating systems do this job very well. The checkpointing idea then becomes part of the operating system service. If you are looking at the standard operating system we have now, such as VMS or Linux, Unix, Solaris, whatever; on all of these things checkpointing needs to be done essentially by the user processes.

AUDIENCE MEMBER: What is the algorithm to determine checkpointing? I understand checkpointing, or, actually, the preservation of the state of process in an interrupt situation, so any software interrupt will do that. But the way in which I visualize it, and I don't know for sure, is that you during the process insert a software interrupt. Is that the idea?

DR. PRABHAKER MATETI: Correct. Oscar's question was what are the essential difficulties in doing checkpointing and how do you determine that. As I said, the basic problem is that the operating systems, being older, did not think of it as an essential service to perform. That's why we'll come back to this idea in Condor.

Their basic answer is the following: If you need this, take your existing object code files, the .O files, but don't link it with the standard C library, link it with a semantically equivalent but enhanced library of Condor. So Condor has the usual standard library, plus it is maintaining all kind of things that you are doing through that library. So it has extra information built into it. As a result, when you ask it to do quote/unquote core dump, it can capture the full state. So that is what Condor is doing.

Now, Condor also happens to be very conservative and also, in the same sense of word, somewhat simplistic. Checkpointing in the distributed context is even harder. As I said, if a process sees multiple objects and these objects are all over the place, you can imagine that you might do a checkpoint of one object but not the checkpoint of another object because that is still okay for whatever reason.

So there is a particular order in which these checkpoints of a distributed processes are taken. Otherwise, when you want to restart such a process, you may have to roll back the states, and if you didn't do the checkpointing well enough you may unfortunately end up rolling back the state of the process all the way to the very beginning losing all the computation that you have performed so far. So distributed checkpointing is another major problem. There are clever algorithms by now, but they are not part of the Condor package, as it exists now.

You'll see Condor has this notion of a single ordinary kind of process, where we are now talking about such a process having active objects inside them. But effectively there is an overall point. In order to do a proper checkpoint, the system needs to maintain a reasonably complete record of all the operating system's calls that the process has made so far. Only then it is able to do a worthy enough checkpoint, worthy in the sense we could restart such a process elsewhere.

Now, of course, all of this we are still assuming that the architecture is nearly identical. Now, when we say that, we mean the two usual things. That the CPU is, of course, the same, the instruction set; and the operating system is the same. We are not dependent so much on the hard disk size, whether the monitor you are hooked to is low rez here high rez there, the mouse is exactly alike, no. But at a certain abstract level, the quote/unquote architecture is identical.

AUDIENCE MEMBER: You mentioned there was some cutting edge operating systems that possibly could share --

DR. PRABHAKER MATETI: For example, if you take your Vanilla Linux distribution, there are now patches produced by other research groups. You apply those patches to Linux, and then you recompile. Now Linux will be able to do checkpointing in a reasonably thorough way. These are patches. Now, remember, patches sometimes mean throwing away some entire files and replacing their files. So these are not like adding three lines of code here, four lines of code there, no. Sometimes they are replacing an entire subcomponent of the system.

AUDIENCE MEMBER: How do you see those operating systems, I guess, evolving or do you think this is --

DR. PRABHAKER MATETI: Checkpointing is going to be a central component in future ones. As I said, Linux, in this experimental sense, is already checkpoint capable. But if you are doing the standard RedHat, or Caldar or what have you, those are not checkpoint capable.

AUDIENCE MEMBER: Checkpoint capable for homogeneous systems?

DR. PRABHAKER MATETI: Checkpoint capable for, of course, homogeneous systems.

AUDIENCE MEMBER: Because, you know, if you have a heterogeneous systems, then you need a translation table.

DR. PRABHAKER MATETI: Right. Already, as you know, Linux has become a heterogeneous operating system. It now works on Alpha CPU's and so on. Nobody is claiming that you could checkpoint it on a regular PC with an Intel architecture and then move it over to the Alpha machine and begin to resume it's execution from there.

AUDIENCE MEMBER: Now, you may get into this later this morning and so --

DR. PRABHAKER MATETI: No. Unfortunately I think in the next few minutes I'll get specific to Condor. I'm simply trying to set the limitations and the robustness of Condor in context. Most of its robustness comes from its conservative design. It's not trying to be cutting edge, and I think we should understand that. That this group's main goal is to get this high throughput rather than high performance.

AUDIENCE MEMBER: How difficult is it in the, I guess, the modified versions of Linux that has the checkpointing capability? How difficult is it to stop the program and then to restart it based on these checkpoints?

DR. PRABHAKER MATETI: I have tried it as experiments and it seems to work, but we don't have extensive experience with these things. So I don't know how reliable it is after you have applied these patches, how reliable it's normal use of the operating system is. You know, the way these operating systems are, particularly Linux, when you apply some sophisticated user supplied patches, they're cutting edge and they are by and large worthy ideas, but they have not been completely debugged ideas. And that's one of the reasons why commercial distributors like RedHat and so on wouldn't do them, because they may crash in normal use when you are not even doing checkpointing.

That's where we also need to be conservative, and if we are trying to do this in a nonexperimental setup, I don't think we want to willy-nilly to apply these kernel patches.

AUDIENCE MEMBER: Just to make sure I'm with you. What you are saying is this may work on an individual machine, but if I have a process, even if it's a homogenous system of say eight Linux Beowulf cluster and I have an MPI program out there running on eight nodes, the ability to stop those processes with a checkpoint and restart the job back up is very much still difficult.

DR. PRABHAKER MATETI: Very much an experimental and iffy thing. Now, MPI programs, when they are working together to restart them, I think it is still a research issue. We are talking about an ordinary program. Now ordinary in the following sense. Forget about PVM and MPI and this ordinary business. Forget about also an ordinary program in the following sense that a regular C program can, of course, do fork; Can have multiple regular processes without working with PVM or MPI. Probably doing standard sockets and so on. Even those programs are hard to get them restarted even though the checkpoints have worked fine. But if you wrote a straight sequential piece of code but that would take, let's say, a 1000 hours to compute, by applying these checkpoint patches to a good kernel and running them, you will be able to restart such a sequential program.

AUDIENCE MEMBER: When you mentioned earlier that to get back to it you have to do all those system operations, or all the calls to the OS, that would be incredibly time consuming, wouldn't it?

DR. PRABHAKER MATETI: Again, that is in the sense an overstatement, that you want to record all the system calls. For instance, to give you an example where optimization can happen. You open the file, you process this file and then you closed it. At that point that entire sequence of system calls with respect to that file are now irrelevant. So essentially you want to be able to keep track of all the quote/unquote open objects. And they are not just files.

For example, you may be in a certain negotiation through the network and you need to be able to remember that you have sent "X" number of messages whose content was this. And only when you know all of that, when you restart it, you need to make sure that the receiver is now saying, yes, I am in receipt of all of those things.

Otherwise, when you restart this process, this process assumes that I sent you this piece of message, that piece of message, but that receiver may not. If the receiver itself has checkpointed, we now have a difficult problem of synchronizing these two. That's what makes distributed checkpointing much more work.

AUDIENCE MEMBER: What I see here is a tremendous opportunity for architectures. A traditional architecture being mostly oriented towards trying to preserve the state of the CPU. Now we are trying to preserve the state of the whole process. Now, if we had hardware support for that, your question of how much time it takes would be -- as a matter of fact, it may speed up the whole process if you had hardware support for the addresses or the references and to open objects. So it is a tremendous opportunity for a processor designer to do that.

AUDIENCE MEMBER: Another thing battling in my head then, is that if, you know, one of things going on here is trying to do commodity computing. Is the average PC user going to benefit from that? I guess maybe he would. Maybe he would have a more robust system. But if he doesn't then we're diverging again instead of converging on hardware.

AUDIENCE MEMBER: With a given fact that today you have multiprocessors, having support for the state of process has to be universally beneficial.

DR. PRABHAKER MATETI: I think there is no debate in the research community about checkpoint becoming a central issue in future designs of computer systems. There is no debate on that. But the research problem, the fundamental research problems have not been fully resolved yet. Particularly in the context of distributed checkpointing.

AUDIENCE MEMBER: (Inaudible.)

DR. PRABHAKER MATETI: That is right. That is why now the overall suggestion is that the user will indicate and the user will make a judgment as to what was important state and he records.

AUDIENCE MEMBER: I mean, I think that's kind of how we do business. We set up in the morning, get our desk all out and there might be points in the day, well, it's lunchtime, so I'll kind of pack up and move off. And likewise I could have points where it's easy to pack up and move on.

DR. PRABHAKER MATETI: Even in the old standard Unix it is possible for you to make a system call and say do a core dump even though nothing went wrong with me, and then it will do a core dump.

Now, I'm going to be specific now to the Condor in the next hour or so. My goal is to tell you enough so that you could try a few hands-on lab exercises.

So Condor is a system for high throughput computing by making use of idle computing resources. It manages both the machines that the owners are willing to let you use and, of course, your jobs. It has been a rather stable system, and maybe I'm understating it here when I say thousands of CPU hours. I think they're claiming much, much larger than that.

If you look at the essential techniques that Condor uses to accomplish all of this, it comes to what we just discussed regarding migratory programs. So far we've had enough discussion about checkpointing, but not much about remote IO. I would like to explain that. And the second point, the resource matching, we'll explain in the slides, the Condor people put out.

Remote IO is the following problem: When you are running your program on your machine, let's say you open a file whose path name is /home/pmatetif.c. If that same program is running elsewhere what guarantee is there that that path is my file and not some other namesake file whose content is entirely different. This guarantee is not there even in the context of worldwide mounting of shared file systems.

So the real solution to that kind of a problem is whenever you are doing input/output of that kind, do not do those input/output on the host machine, but do it on the machine of the user's request, where the user's request originates.

In other words, any time you are trying to read /home/pmatetif.c, it will really happen on my machine even though my machine may be a puny machine and the rest of the process has actually migrated and is running in the Condor pool. That is the essential reason for this notion of remote IO. And, of course, it is not something that is limited only to files. It is limited to all kinds of IO. That's where, again, Condor makes very conservative design choices.

AUDIENCE MEMBER: That means that the bandwidth between your home system, whether weak or strong, is going to be highly utilized.

DR. PRABHAKER MATETI: Very much. In actual fact when a Condor job is running in a Condor pool, there is a Condor shadow process running on your own machine.

Similarly, if you happen to have a program that produces some kind of a dialogue, it says answer me this question, and now you are supposed to type the answer. Obviously such a program is not usable in the context of Condor because you don't know exactly when it is running.

So here are some of the assumptions it makes. It assumes that a large number of workstations are idle most of the time. Owners of such machines would not mind being used by others while idle. The notion of idle is not so obvious, and we'll see why. Owners want their work to be given high priority. I think these are very fair assumptions. We should understand a few roles that Condor attributes to quote/unquote people. Owner is the one that offers his machine for use by others. That's our definition of who the owner is.

Then we have users. A user requests to run his jobs, and an administrator manages the pool of available machines, and, of course, it is entirely possible, you are the one person playing all three roles.

For example, during the day you may be doing other work and you don't want this computation to slow you down with this other work. You'd rather do fast Internet browsing while you are there. And you might have a computation task you would like to run on your own machine but when you are not there physically. So you could submit to yourself a job and ask Condor to do the thing for you.

It does all of this through a very, very clever idea. Clever and simple idea called the Classified Advertisements. Here is an example of an owner describing his machine, and we'll go through some of these details later on, but just to look at what he's describing. He is describing the name of it, and the kind of machine it is, how much disk space it has, what kind of speed it runs at, et cetera, et cetera. And he also says, well, these are the kinds of requirements when I let others use. Only when my loads average is this, my keyboard is idle for so long and so on. And we'll come back to an example like that one more time later on.

So effectively that is how an owner offers his machine. And here is how a user requests a computational resource. Here is an example. He's saying that he has an executable that he wants to run, and this particular program has no really stringent requirements, and therefore he's saying that his computing universe is quote/unquote standard. And we'll come back to that again in the few minutes.

Whereas this particular program does do some keyboard input, does output something to the screen, et cetera, it is not really an interactive thing. Rather than use keyboard input, what would have been keyboard input is actually going to come from these files. So this is how you would describe your job, and you submit this to the Condor pool.

Here is a slightly more detailed example of a job describing what its really core requirements are. This particular job is saying we want an Intel machine, and we want it to be running Linux at that time, and we want the memory on the machine to be 32 meg or more, and so on and so forth. And based on that it will compute a number called rank. Some a Condor pool of machines can be a single machine, and this is going to be managed and determined by the central manager. It is the central manager that is going to look at the offerings made by the owners and the requests made by the users, and it will try to do this matchmaking. And each machine that is in the pool runs various daemons to help with this entire business.

Here is a little system structure diagram. The central manager runs in one place, and here is a submit machine on the left-hand side and an execution machine on it be right-hand side. There are of course many, many execution machines. There can be many, many submit machines, and it is possible to have many central managers, but it's hard to administer that, so we typically have one central manager in a cluster. They all talk to each other, and the central manager is always keeping track of which ones are available, et cetera.

There is this Condor resource agent. It is really a daemon named the condor_startd. It allows a machine to execute Condor jobs and it enforces the owner's policy regarding if, and when, he comes back and begins to work. Even though so far it has been idle, what is the requirement that this user insists on. Will he allow this process to continue running when he returns, or does he want an immediate eviction of this process from his machine?

So those are the kinds of policies you would have described, and condor_startd facilitates that. The bottom one here, Condor User Agent, is something that schedules the various things and allows you to submit various jobs.

Here is where the robustness of this Condor comes from. It is nearly entirely due to checkpointing. Checkpointing allows guarantee forward progress of your jobs. If an execute machine crashes, you only lose work done since the last checkpoint. This is fairly obvious. And, of course, Condor maintains persistent job queue. Persistent is simply a fancy word for saying we have a list of these jobs on a disk, so it's not just a memory-based data structure.

Condor is good for managing a large number of jobs. You could easily give a thousand jobs of the following kind. Because of the nature of this, you, of course, are willing to revise the way your program works. You will rewrite your program until such a way that it computes from a certain number of input files, produces results into a certain number of output files, and you prepare a thousand different datasets for that. So these are the thousand sets of input, and these are the eventually produced thousand sets of outputs.

Once you manage to rewrite that, you could with one submit file make a submission request that says I would like this particular program to be run a thousand times, but each time with a different dataset. Now when you are talking about such large numbers of jobs, you often will run into this problem of saying that my job can be divided into subcomputations of A, B, and C, and they must be performed in that order. If you don't describe that, Condor may chose to run B before it has run A.

That is where we need to describe the dependencies. As you know, if the dependencies are circular, we cannot solve that. So as long as the dependencies are noncircular, we can deal with it. And the computer science term for that is Directed Acyclic Graph that describes these dependencies. And that's what this DAGMan is.

Once again, throughput, not only the robustness, but throughput is helped greatly by the checkpointing. The bottom bullet is about those remote system calls that we just discussed. I had explained the idea in the context of IO, but it is a system call that is being done remotely. So when the program is executing on the other machine, the system calls are going to be transferred to the originating machine, and that's where the system call occurs through the shadow process.

Before you begin to prepare your program for Condor execution, you need to think about these kinds of things. What kind of IO does it do? Does it use TCP/IP at all? Can the job be resumed? Did you write it in such a way that it can be? Does it do multiple processes? The old standard way of doing multiple processes is the fork and exec and so on. Or you may have done PVM, et cetera.

So let's focus on IO for a minute. Well, there is of course, this well-known IO that Unix uses, and this old-fashioned term coming out of this name Teletype, abbreviated now forever as TTY. So interactive TTY, this is your dumb terminal kind of text based. You type something and out comes again ASCII text. That is one kind of typical IO. I think everybody understands what we mean by interactive Teletype. By "Batch" TTY, what we are saying is even though it may be really TTY, the program will run perfectly well if we redirect the input from a file as standard input (STDIN) and redirect the output to a standard output (STDOUT) file and similarly this standard error (STDERR).

As long as you can do that, that is the so-called "Batch" TTY. Of course, there are programs that will do X Windows. There are programs that expect remotely mounted file systems and so on and so forth. There is going to be some IO that you may do on the local file system itself, and then there been all kinds of socket based input/output through the network protocols.

In order to deal with this, Condor has fashioned this idea of universes. And these are the four universes it currently has. As they enhance their system, they intend to add more universes. The Vanilla Universe, that is the name of the universe. That is not just a generic description of it. And then there is the Standard and the Scheduler and PVM universes.

In the next couple of slides we'll go into the details of these things. For example, let's look at the plain IO supplement. There is, of course, no interactive TTY support in any of these universes. If you look at X Window the standard universe does not support X Window at all. Whereas, if you are only dealing with input/output of files that are mounted through a shared filed system such as NFS, AFS, et cetera, then of course we have that in all of them. In the Vanilla universe, it is forbidden to do local input/output. If you are depending on TCP/IP stuff, then standard does provide that.

Here are a couple of universes that probably require a little more explanation. There is a universe corresponding to the PVM that we have talked about at some length, and these deals with multiple processes in Condor where the multiple processes have happened through PVM spawn. So it still does not deal with processes you may have done through fork. As long as you have done them through PVM spawn, then it is able to deal with that.

AUDIENCE MEMBER: This might be this is a silly question, but is there any fundamental reason why Condor couldn't use MPI?

DR. PRABHAKER MATETI: No fundamental reason. It is a question of development issues. As of now I think they are working on it and it will probably get added later on. In fact, that is their promised feature in the upcoming releases.

AUDIENCE MEMBER: That will be this year?

DR. PRABHAKER MATETI: That time scale I don't know.

AUDIENCE MEMBER: That requires that there is a PVM copy in the receiving machines.

DR. PRABHAKER MATETI: Okay. I think I'll come to that, but let me answer that question immediately for now. In order for this Condor scheme to work, obviously Condor needs to be installed on every host. Unless it so happens that you are talking about a pool of machines that are quote/unquote a cluster and you already have a shared system there. But by nature of this Condor, it is not a bunch of machines that would have, in general, a common shared mount point.

And, again, this may not have come through clearly, but Condor can now deal with Window NT machines and various flavors of Unix machines together. Obviously when you have described a job, you probably cannot write a job unless it is extremely Batch TTY oriented that will run equally well in NT versus Unix. You may have to say it will only run on that.

Going back to the universes. The scheduler is the universe I explained a while ago that you may wish to have your own jobs run on your own machine but when you are not actively using your machine. For this purpose alone it is an excellent idea that you install Condor on your machine even if you're such a paranoid person that you do not want any other user use your machine. This is just for your own use. If you install Condor on your machine, you could certainly make use of the scheduler universe so that it runs things when you are not actively doing other things.

When you are about to submit jobs to Condor, knowing the IO and other features of your program, you chose a universe, and you make sure that it does not have interactive Teletype. You make it batch ready. And if you think that this program is such a lengthy computation that there's a good chance that the machine on the other end may crash, et cetera, et cetera, therefore you wish to have checkpointing done. In such a situation you need to re-link your program; but otherwise, you do not need to do this.

Now in particular, when we say re-link, we are of course assuming that you have the component object code files; the so-called dot O or dot OBJ files. But if it is a commercial program you are you running, all you will you would probably have is an executable, in which case you there's no chance you can do condor_compile in order to re-link.

If you have the source code, well, you would go through the entire process of recompiling and re-linking with-- Instead of the standard C library, you would re-link it with the Condor library which then gives you the stability to do checkpointing thoroughly. Having prepared your job, you create a submit description file and you run this command called condor_submit.

I think we already understand what it means to say making your job batch ready, but here are the specific things that we are talking about. It should be able to run in the background, which means, of course, no interactive Teletype kind of input, windows, graphics, et cetera, et cetera. On the other hand if it is batch TTY, then you need to prepare such input files and Condor will do the redirection needed.

The Condor compile is the command looks like that.

AUDIENCE MEMBER: Is Condor only for C?

DR. PRABHAKER MATETI: The question was whether Condor is only for C. The answer is no. You could compile your programs with whatever standard compiler, but Condor compile is kind of misnomer there. What it is doing is the linking of the object code files. As long as you have linkable modules, it doesn't care.

The submit description file. It describes to Condor certain things about your particular executable, such as what kind of universe is it assuming, the input/output, et cetera, et cetera. And I think I have an example right after this. Here is an example of a condor_submit file where the universe chosen is standard, and therefore it has the corresponding IO support.

And the one that I want to focus your attention on right now is this is describing this word queue without a number that is following it. So this is a single instance of one executable file being expected to run just once. No matter which condor host machine it will run, it is going to be as though you had done a change directory to this initial directory and then you began the execution of the program. So that is the initial directory. And even though Condor lets you not specify that in a submit file in certain context, I think it's a poor habit to encourage. So always put the initial directory there.

So here are some details about that. Submit a single job to the standard universe, and the entire thing is equivalent to doing this. That you changed the current directory to the initial directory given there. Then you are running that my_job.condor from that location. The arguments that it has you have supplied them in the submit file, and note how the input/output redirection is happening using the files. And so this is, the last two lines there, as though you have done that, but remember that is not exactly what you did.

AUDIENCE MEMBER: Is that also implied there, that you have an ampersand at the end, that it's a background job?

DR. PRABHAKER MATETI: The question is it also implied there in the second line that process is running in the background, and the answer is yes.

"Clusters" and "processes". Warning, this word processes is being used in a nonstandard way here, and wherever I remembered, I tried to replace it with the word proc. A submit file describe one or more jobs in general. The collection of jobs is called a cluster. Each job in that cluster is a proc. And, of course, in the Unix sense of the word process, when that proc is running that is a Unix process.

Of course, such a proc is given an ID, and Condor calls it the Job ID. And it is in general going to have this kind of structure. The number, such as 23 there, that is the number of this cluster. Assuming this cluster had described more than five, more than six proc in it, it has a dot five and now we are referring to the sixth proc in that cluster.

Now, my apologies there. I didn't put this bullet down, but I should. A cluster is describing a bunch of procs, but they're identical executables but different datasets. So assuming there are six procs here, there probably would be job ID's of the kind 23.0 and 23.1 and so on. And proc numbers start at zero.

Here is an example of a cluster with 500 procs in it. First of all, it is kind of obvious to know that this last line, which was previously just queue, is now saying queue 500. That alone is not good enough to solve the problem of how do these 500 with the same executables deal with their input/output.

And there is this little macro idea. Previously we had a specific name there, and now we have a macro based on the process number. And there are a few more sophisticated versions of this, and you can cook up a different initial directory name based on the specific job ID. In this case, we did the simple-minded thing of saying run followed by the number of the process. So that number in the process is going to range from 0 to 499 in this context.

AUDIENCE MEMBER: In your /home/wsu03/condor you have one --

DR. PRABHAKER MATETI: I have one executable called my_job.condor.

AUDIENCE MEMBER: You have 500 directories though?

DR. PRABHAKER MATETI: I should have created 500 directories --

AUDIENCE MEMBER: With input files in there --

DR. PRABHAKER MATETI: So I have 500 directories, each one containing one file to begin with, which is a stdin, stdout, and stderr being generated. And all of this log is going to go into my_job.log in that directory. So in the end we will have 500 directories containing four files at a minimum each.

So this is describing some of the details I thought worthwhile putting it on the slide in case you missed my explanation. Here is the command for condor_submit. So having prepared that file you, invoke the command called condor_submit. Condor_submit parses the file and creates a "ClassAd" corresponding to the submission file that you have given it. And it creates the files when you execute the proc and, of course, it sends this classified add to the Condor scheduler that is running on that particular host.

Now, after having submitted this job, you may want to see if your jobs are running or still being put in the queue and nothing more is happening. You can monitor your jobs through the condor_q; or, you can look at the "User Log". These are all distinct commands. You can also look through condor_status; or, you can run -- in that slide there is an underscore that is not showing through will this projection. There is an underscore here and here and so on. Those are all one-word commands. Nearly all Condor commands begin with the word Condor and have an underscore after it.

You can set up your submission files so it sends you e-mail, but if you submitted 500 jobs it's going to be a nuisance to do this, so you may want to instead look at condor_history after completion. The next few slides have some details, but I think we can probably skip some of this.

"User Log" file logs the following, at a minimum, logs the following events. When the job is submitted, that is an event. When it actually starts executing on a host in the Condor pool, that event is also logged, and whether it is checkpointed and at one moment, or whether the owner came back to the machine, therefore your process had to be vacated and so on and so forth.

Condor_status has various options so you can see which machines in the Condor pool are running which Condor jobs. This is not the same as doing the process list with PS. It is only going to show you the Condor related activity.

Should you decide for some reason remove the jobs that you had previously submitted, then you invoke this, condor_rm, Condor remove. Of course, you should only be able to remove your jobs, but obviously the administrator or the root can remove other people's jobs for whatever reason.

By default, if you don't say anything it is going to send you mail, so you better say something. Either you say do not send me mail, notification = never; or, if you are so thoroughly enjoying such e-mail, then you say notification = always. So then every little event sends you an e-mail. Like if it gets checkpointed, it says, "I got checkpointed." If it gets restarted, it will say, "I got restarted." The typical good choice I think is notification = error, and you can give an e-mail address.

Condor_history. All of this is logged, of course, and condor_history is a way of browsing through these events that have happened. And basically you may be interested in the status field ("ST") that says "C" for "completed," "X" for the job was somehow removed.

Now, getting some more general description of ClassAds. Going back to this example that I had shown you briefly. The basic idea that then gets derived from this ClassAd is the following: That there is the owner offering such a ClassAd, offering such a service, and there is a requester of such a service.

If you take a look at these two ClassAds A and B, we have a notion of matching. The matching is not quite a symmetric thing, and therefore you ask the question in asymmetric way: Does A match B? and vice versa. Based on that, you then say we have a deal and then that jobs goes in there.

Now, this is not really a Condor ClassAd, but I think it's fairly easy to understand in the context of apartments and renters. So on the left-hand side is the ClassAd A, so this is the owner of the apartment. So he is saying my type is apartment, and he's targeting apartment renters and is describing some detail regarding the apartment. Like its square footage, what the dollar figure is for the rent, whether it's on the bus line or not, et cetera, et cetera.

And here is a renter that wants to rent the place, but he wants the place to be on the bus line, and he wants the square footage to be 2700. And as you can see this particular apartment satisfies both of these. That is the notion of matching.

This is a rigorously specified language, just as any other programming language is, so Condor is not just intuitive based on what this example is, but actually rigorously evaluating these things. So ClassAds love Condor to be a general system.

Constraints, ranks on matched expressions by the entities themselves, and only this so-called priority logic is integrated is in the software component called Matchmaker. And all the principal entities are represented by ClassAds. Typical examples are Machines, Jobs, Submitters, and there are quite a few others.

Here is a real example for Machines, Computers. There is a predefined list of attributes, but you can come up new attributes entirely on your own. In this case we want the attribute called Friend. And he is defining Friend to been one of these two names, "tannenba" and "wright." And ResearchGroup is Owner "jbasney" and another owner "raman." This is not the owner of that machine but whoever is bringing a job and asking: Could we run this job here? The owner of such a request.

There is this notion of trusted. Just an attribute name, there is no extra semantics to that name. Of course, it's numonic, and it means what the English word is suggesting there. But the owner is not "rival," this is a particular's owner's name, and the owner is not "riffraff." Again, there's no semantics there, and here is the requirement: That trusted should be true and this Boolean expression should evaluate to true.

Based on all of this, he can compute a number called rank. Rank is a predefined attribute name and you would typically put in expressions that computes a number.

I think only the last one needs to be explained. Well, let's start with all the bullets. So this machine will never start a job submitted by either "rival" or "riffraff" because trusted is false. If you look at trusted, if one of these two guys do it, this is going to fail. On the other hand, if someone from the research group in, our case we defined it as the owner being either "jbasney" or "raman," if they submit a job, it will always run.

If you look at the rest of the Boolean expressions for the rank, for the requirement, you'll see that it is the research group or something. So it's already trusted and the research group is true, so the rest of expression does not need to be validated.

On the other hand, if somebody else submits a job, it will only run on that machine provided the keyboard has been idle for more than 15 minutes because of number that it is put in there and the load average is less than 0.3.

That is how the semantics of the ClassAds works out. If you look at the rank of that expression there, if the machine is running a job submitted by owner "foo," it is going to give this a rank of 0 because that particular owner "foo" is neither a friend nor in the research group. That's how that expression evaluates. So based on these rank numbers, it will choose. If there is a competition for resources, the rank numbers play a role in the choice.

Here is an example ClassAd for a job. I think we looked at briefly a few minutes ago this particular job is saying that it wants the CPU to be Intel, the operating system to be running in Linux, and it wants memory to be at least 20 megs, and on and so forth.

AUDIENCE MEMBER: Can you put equivalent statements in the ClassAd? In other words, just like you had other expressions indicating the type of machine, you could actually make equivalences in some of the characters.

DR. PRABHAKER MATETI: Right. Wherever you see expressions, the language is powerful enough that it is essentially unquantified Boolean expressions. "No" for all, or this kind of stuff. But other than that, it is a very powerful Boolean expression. It can be very, very complicated. Our examples are rather simple.

So Condor defines a number of standard attributes, and they're all listed in the manual. And to see what other attributes you have in your particular pool, you run this program condor_status long, don't put the minus in there, and the particular host name, and it will give you the attributes that are defined in there.

A custom defined attribute might not be defined on all the machines, but if they are defined then you can use it. There are a few other things called "Meta-Operators" that I think we probably might skip, we may not make it otherwise to the end of this talk.

Priorities. There are essentially two priorities. It's kind of confusing that the priorities work one way in the one case and the other way in the job priorities. In the case the User Priorities, the lower the number the better it is. In the case of the Job Priorities, the higher the value the better your choice. Why this has to be so I don't know, but that's the why it worked out.

User Priorities in Condor. Each active user in the pool has a user priority, and you saw how it gets computed. You can change these priorities by invoking the command called condor_userprio. Again, to repeat, the lower the number the better. The user is given a share of machines based on this priority.

As an example, if Fred as a priority 10 and Joe has a priority of 20, then assuming the machines are available in that proportion, Fred will get twice as many machines as Joe will get. Of course, these priorities are continually adjusted. The people who have machines allocated greater than their priority, then their priority keeps decresing so that there is a certain notion of fairness in all of this.

On the other hand, this is of course Preemption. So the priorities work out in such a way that here comes a user that needs to be given the machines because of the priority. A higher -- remember higher means -- sorry. Higher means lower. Such a user will get his job paused and then the machine is given to the one whose priority is better. Now, it is paused. Meaning at that point a checkpoint may not occur, but if that job is really going to vacate that machine then a checkpoint will occur.

Of course, priorities can be viewed with condor_q, can be changed at anytime with Condor priority, et cetera. And this being the job priority, the higher the number the better your chances of running it.

Managing a large cluster of jobs. Inter-job dependencies become important when you are talking about large numbers of data sets and so on. That is where this acyclic directed graph comes into role, and the program that manipulates that kind of stuff is Condor DAGMan.

For example, if jobs are such that it is made of four things, A, B, C, D; but A should run first and then you want to run B and C, there's no particular requirement about the order of whether B should run before C or vice versa, but only after both of them finish should D run, then -- I'm sorry. I think we'll come to that example in syntactic form in a couple slides.

But submitting a large job means your initial directories, because of input/output redirection, needs to be properly done. And the typical way you would do that is to make use of this standard macro that substitutes where dollar process is, with a number of the process and you hook up a path name with that. The queue syntax there gives you a number. In this example 1,000. A cluster, of course, is more efficient. You may submit a thousand individual jobs, but it is better to submit one job with a queue number of 1000. If these jobs are of course different executables, you have to have different submit files. But if the job, that is the executable's file, is the same but you are simply running it on a whole bunch of different datasets, it's better to describe it through one job submit file and then say queue and then give input/output files in the directories.

AUDIENCE MEMBER: That dollar sign process, is that available in other places in this submit?

DR. PRABHAKER MATETI: Throughout the submit files, yes.

AUDIENCE MEMBER: You can use it as an argument?

DR. PRABHAKER MATETI: Right.

AUDIENCE MEMBER: (Inaudible.)

DR. PRABHAKER MATETI: You can do that. The question was is this dollar process available only in a rather limited context or is it a more general, and the answer is in all the submit files you can put that.

DAGMan handles a set of jobs, and you can also have this preimposed. And I think the next one is an example, so let's go to an example first. The one I described a minute or two ago, that we have job made of four subjobs A, B, C, and D. A should happen first, then B and C should happen next. There's no requirement about the order of B and C, but only after B and C is finished, D should begin. Here is the actual submit file of that.

The first four lines are describing where the submit files are for these names. There's no requirement that job A should have a submit file whose name is A.submit, whatever. The name of the file doesn't matter. The first four lines are introducing job identifiers pneumonic job identifiers. They are simply associating an identifier with a corresponding job description file.

Now, we have these two scripts. They are scripts, they can be executables, et cetera. They are saying before you do -- before you begin the invocation of the job D, run the program whose name is d_input_checker. Presumably, this particular user is concerned that sometimes B and C don't produce good output, so he wants to make sure that the output generated by B and C is good. That's what hopefully d_input_checker is doing.

Similarly, he is probably concerned that A doesn't always produced good output, so he has a postprocessing request there after the job A, which is going to do something on A output whose executable a_output_processor, and it is examining the file whose name is A.out hopefully produced by the job A.

Here is the little description of the actual directed acyclic graph. It is describing the directed acyclic graph in segments. Effectively, it's describing one node and its children. So here the node A has the children B and C and both the nodes B and C have the same child D.

Setting up a DAG, you create all the submit description files for these various individual jobs, and you prepare your executables, and you can in such a thing have a mix of these two universes, Vanilla And standard, and you set up your PRE/POST commands or scripts because they will be invoked as you describe them, and then you run this command called condor_submit_dag, and then you give the name of this file that has all of this.

There is a way to get rid of some of these dependencies, the obvious ones are it shutdowns itself you can do, or there's one other way that administrator can get remove of it. I'm going to skip and lot of this because I'm running out of time.

When you are submitting PVM based programs to Condor, it makes an assumption that the underlining paradigm is this Master-Worker Paradigm. And it does not have an active way of verifying that, but effectively the kinds of things that you could do with pvm_addhost, Condor will do on your behalf when you have described your universe as PVM.

The next few slides are describing things that an administrator would be interested in. So far we have focused on the user-level activity. The owner may actually want these Condor processes to vacate his machine the moment he comes back and so on, and that is the kind of thing that this slide is describing and this is the default. If you don't actively change it, this is what is going to be the case.

So, for example, at the moment the keyboard becomes busy the job gets suspended. The administrator is the one who should be invoking this process called condor_master. And it is possible to install Condor as a nonsuper user, but it is not recommended. So assuming that you install this through cooperation through your super user, the root, you would probably want to invoke condor-master during the boot scripts. In our labs upstairs that's what we have done.

There are these various administrator commands, with the obvious suggested semantics. Condor_off so that it turns everything off and gets rid of all these processes, and it will, of course, do checkpointing as it does that.

You can control, overall control, who is allowed to use your machines by host-based control. And here is an example. Just a simple example. We are saying that anybody who's coming into us with a host name that ends with .com, we don't want them to be there. That is the first line. Only those hosts who have the names ending in cs.wright.edu are allowed to do some write operations. That's the second line.

On the other hand, we want to deny these two groups of connections that may come through modem connections, such as PPP something, or this particular IP address range that we've had some bad past experience, so we want to get rid of these guys. On the other hand you want to allow this administrator to the come in through this particular machine osis111. Of course, allow the owner, and that is the name of the full host name, and allow this particular administrator, and both of these.

Another thing to observe when you are in the lab is the configuration hierarchy of the entire installation. There is a file called condor_config that is a pool-wide default. These are all normal text files that you can edit with any editor. There is a condor_config.local for each machine in the pool. And depending on what you said during the install regarding whether the file system is shared or not, these files will sit in one directory or will sit in the directories each individual machines. And condor_config.root is the administrator's thing.

Before Miron left he wanted me to make sure to make this offer to you all. He wants to keep his Condor pool of jobs very, very full. And so if you have any computations that can take advantage of the Condor pools, send him a request, his address is there. It is also there in our participants list, and he says he is more than happy to accommodate. He says it will take a couple days before you get the accounts.

For whatever reason they are not giving the source code with the usual immediate access, but you can download the executables from that location and there is a rather thorough manual that also you can get.

Now, I'm almost finished with Condor at this point, but I would like to go through some of the details of the lab setup that we have done for the Condor.

Let me take any questions or comments you have at this point. I need to switch to a different slide set here.

AUDIENCE MEMBER: You said that the condor_master is recommended to be run in a privileged account like root. Is there a really good reason I should let software that I don't get to see the software code for (Inaudible).

DR. PRABHAKER MATETI: If you are that paranoid, I guess you could not. It is entirely doable to run the condor_master as the nonroot. But the various setup details are such that they are easiest when you run them as the root. For example, in our lab condor_master is being run with root privileges. On the other hand, we do have a pseudouser whose name is Condor.

AUDIENCE MEMBER: So it doesn't itself need those root privileges.

DR. PRABHAKER MATETI: It doesn't really need, but it's simpler. Let's put it this way. It's an administration convenience more than anything else. Now Condor, almost throughout, is conspicuously silent about security details even though we did hear him say a lot about during Globus glide-in that he described.

These are some of the slides you have already seen before, but I'm going to focus on the last couple of slides that are new. Some of you were wondering what kinds of machines our lab machines are. Here are the details on that. They are all Pentium III machines. We haven upgraded about half of them in the following way just a few days ago. Either they have a 700 megahertz Pentium III there or it has a dual CPU setup there. All the machines there the motherboards are identical. They are dual SMP motherboards. In some of them there is only one 700 megahertz, but in the others there are the two CPU's there. With the exception of a couple whose padlocks we couldn't open, they all have half a gig of RAM. They all have rather than low capacity you would think, but in all of our setup we are using less than 50% of the hard disk even though we duplicate each install on every hard disk. We do that with a simple and a very realistic goal we have: We want each PC usable even when the network is completely unavailable. That lab is run for a variety of courses, one of them is a security course. We often run individual machines by themselves and occasionally we hook three together to do something and so on. So the software is duplicated on each one, and we find there is more than enough hard disk there.

They come with a built-in 3-COM 100 megabits network, and every PC has either a gigabit extra network card or a rather cheap, almost two-dollar kind of network card that does 10 megabits. That is used for the security-related experiments.

Now, going to what we will do today. You will log in as you did yesterday with the name WSU and some two digit number ranging from 0 through 50, no password. These are the exercises that we have setup. Hopefully all of these are self-explanatory once you visit this directory cluster Condor Examples. In that directory you will immediately see two subdirectories; one called Users, the other one called Admin. The User's contain examples of user kind of stuff. All four are user examples. I guess I didn't list the administrator's examples in here for fear it was becoming too long. That's about it, and I'll hopefully you'll enjoy our lab. I'm crossing my fingers there will be no glitches there. Let us know, and I, and a couple of my students will be there to help you with any details.

DR. OSCAR GARCIA: This is Prabhaker's last lecture. I want to tell you that every one of the lecturers did work on their particular presentations, especially for this summer institute, but Prabhaker actually went far out of his way and his laboratory was not ready for Beowulf and Condor. This was done especially for the summer institute and so were the transparencies. You can tell he has been working at it for-- I would say two months at least. So I want to express my gratitude and I think you should also do the same.

 

last revised:11/16/00 06:05:13 PM
editor: pmateti@cs.wright.edu