본문 바로가기
소프트웨어/LSF

LSF 클러스터, 잡, 큐

by yororing 2024. 4. 22.

00 Clusters, Jobs, and Queues

  • The IBM Spectrum LSF ("Load Sharing Facility") software is industry-leading enterprise-class software that distributes work across existing heterogeneous IT resources to create a shared, scalable, and fault-tolerant infrastructure, that delivers faster, balanced, more reliable workload performance and reduces cost
  • First-come, first-served schedulint (FCFS)
    • LSF 스케줄링의 기본값
    • Jobs are considered for dispatch based on their order in the queue.

01 '클러스터'란 (Cluster)

1. 정의

  • A group of computers (hosts) running LSF that work together as a single unit, combining computing power and sharing workload and resources.
  • A cluster provides a single-system image for a network of computing resources.
  • Hosts can be grouped into clusters in a number of ways. A cluster could contain:
    • All the hosts in a single administrative group
    • All the hosts on one file server or sub-network
    • Hosts that perform similar functions

2. Configuration

  • Define hosts in your cluster in lsf.cluster.cluster_name file
  • tip: The name of your cluster should be unique. It should not be the same as any host or queue.
 

02 '잡'이란 (Job)

1. 정의

  • A unit of work runs in the LSF system.
  • A job is a command that is submitted to LSF for execution.
  • LSF schedules, controls, and tracks the job according to configured policies. 
  • Jobs can be complex problems, simulation scenarios, extensive calculations, anything that needs compute power.

2. Job Slot

1) 정의

  • A job slot is a bucket into which a single unit of work is assigned in the LSF system.
  • Hosts are configured to have a number of job slots available and queues dispatch jobs to fill job slots.

2) Configuration

  • Define job slot limits in lsb.resources file

3. Job States란

1)  정의: Job의 상태

2) 종류

종류 설명
PEND Waiting in a queue for scheduling and dispatch
RUN Dispatched to a host and running
DONE Finished normally with zero exit value
EXIT Finished with non-zero exit value
PSUSP Suspended while pending
USUSP Suspended by user
SSUSP Suspended by the LSF system
POST_DONE Post-processing is completed without errors
POST_ERR Post-processing is complete with errors
WAIT Members of a chunk job that are waiting to run

03 '큐'란 (Queue)

1. 정의

  • A cluster wide container for jobs.
  • All jobs wait in queues until they are scheduled and dispatched to hosts.
  • Queues do not correspond to individual hosts; each queue can use all server hosts in the cluster, or a configured subset of the server hosts.
  • When you submit a job to a queue, you do not need to specify an execution host
  • LSF dispatches the job to the best available execution host in the cluster to run that job.
  • Queues implement different job scheduling and control policies.

2. Configuration

  • Define queues in lsb.queues
  • Tip: The names of your queues should be unique. They should not be ths same as the cluster name or any host in the cluster.

04 클러스터, 잡, 큐 관련 명령어

분류 명령어  설명
Cluster lshosts  클러스터 내의 호스트들에 관한 정적 (static) 자원 (resource) 정보 출력
bhosts  클러스터 내의 서버 호스트들에 관한 자원 (resource)과 잡 (job) 정보 출력
lsid  LSF 버전, 클러스터 이름, Master 호스트 이름 출력
lsclusters  클러스터 상태와 크기 출력
Job bjobs  시스템 내의 job들 출력
bsub  job 제출
Job Slot bhosts  호스트 및 호스트 그룹의 job slot limit 출력
bqueues  큐의 job slot limit 출력
busers  사용자 및 사용자 그룹의 job slot limit 출력
Queue bqueues  사용 가능한 queue 출력
bsub -q 지정된 queue에 job 제출
bparams   default queue들 출력

참조

  1. https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=lsf-clusters-jobs-queues 
  2.  

'소프트웨어 > LSF' 카테고리의 다른 글

lsfstartup, lsfrestart, lsfshutdown (LSF 명령어)  (0) 2024.04.29
LSF 빠른 참조  (0) 2024.04.22
LSF 데몬  (0) 2024.04.16
LSF 명령어  (0) 2024.04.16
LSF란 (IBM Spectrum LSF)  (0) 2024.04.03