00 Clusters, Jobs, and Queues
- The IBM Spectrum LSF ("Load Sharing Facility") software is industry-leading enterprise-class software that distributes work across existing heterogeneous IT resources to create a shared, scalable, and fault-tolerant infrastructure, that delivers faster, balanced, more reliable workload performance and reduces cost
- First-come, first-served schedulint (FCFS)
- LSF 스케줄링의 기본값
- Jobs are considered for dispatch based on their order in the queue.
01 '클러스터'란 (Cluster)
1. 정의
- A group of computers (hosts) running LSF that work together as a single unit, combining computing power and sharing workload and resources.
- A cluster provides a single-system image for a network of computing resources.
- Hosts can be grouped into clusters in a number of ways. A cluster could contain:
- All the hosts in a single administrative group
- All the hosts on one file server or sub-network
- Hosts that perform similar functions
2. Configuration
- Define hosts in your cluster in lsf.cluster.cluster_name file
- tip: The name of your cluster should be unique. It should not be the same as any host or queue.
02 '잡'이란 (Job)
1. 정의
- A unit of work runs in the LSF system.
- A job is a command that is submitted to LSF for execution.
- LSF schedules, controls, and tracks the job according to configured policies.
- Jobs can be complex problems, simulation scenarios, extensive calculations, anything that needs compute power.
2. Job Slot
1) 정의
- A job slot is a bucket into which a single unit of work is assigned in the LSF system.
- Hosts are configured to have a number of job slots available and queues dispatch jobs to fill job slots.
2) Configuration
- Define job slot limits in lsb.resources file
3. Job States란
1) 정의: Job의 상태
2) 종류
종류 | 설명 |
PEND | Waiting in a queue for scheduling and dispatch |
RUN | Dispatched to a host and running |
DONE | Finished normally with zero exit value |
EXIT | Finished with non-zero exit value |
PSUSP | Suspended while pending |
USUSP | Suspended by user |
SSUSP | Suspended by the LSF system |
POST_DONE | Post-processing is completed without errors |
POST_ERR | Post-processing is complete with errors |
WAIT | Members of a chunk job that are waiting to run |
03 '큐'란 (Queue)
1. 정의
- A cluster wide container for jobs.
- All jobs wait in queues until they are scheduled and dispatched to hosts.
- Queues do not correspond to individual hosts; each queue can use all server hosts in the cluster, or a configured subset of the server hosts.
- When you submit a job to a queue, you do not need to specify an execution host.
- LSF dispatches the job to the best available execution host in the cluster to run that job.
- Queues implement different job scheduling and control policies.
2. Configuration
- Define queues in lsb.queues
- Tip: The names of your queues should be unique. They should not be ths same as the cluster name or any host in the cluster.
04 클러스터, 잡, 큐 관련 명령어
분류 | 명령어 | 설명 |
Cluster | lshosts | 클러스터 내의 호스트들에 관한 정적 (static) 자원 (resource) 정보 출력 |
bhosts | 클러스터 내의 서버 호스트들에 관한 자원 (resource)과 잡 (job) 정보 출력 | |
lsid | LSF 버전, 클러스터 이름, Master 호스트 이름 출력 | |
lsclusters | 클러스터 상태와 크기 출력 | |
Job | bjobs | 시스템 내의 job들 출력 |
bsub | job 제출 | |
Job Slot | bhosts | 호스트 및 호스트 그룹의 job slot limit 출력 |
bqueues | 큐의 job slot limit 출력 | |
busers | 사용자 및 사용자 그룹의 job slot limit 출력 | |
Queue | bqueues | 사용 가능한 queue 출력 |
bsub -q | 지정된 queue에 job 제출 | |
bparams | default queue들 출력 |
참조
'소프트웨어 > LSF' 카테고리의 다른 글
lsfstartup, lsfrestart, lsfshutdown (LSF 명령어) (0) | 2024.04.29 |
---|---|
LSF 빠른 참조 (0) | 2024.04.22 |
LSF 데몬 (0) | 2024.04.16 |
LSF 명령어 (0) | 2024.04.16 |
LSF란 (IBM Spectrum LSF) (0) | 2024.04.03 |