{"id":130,"date":"2019-10-11T13:26:03","date_gmt":"2019-10-11T17:26:03","guid":{"rendered":"https:\/\/www.bu.edu\/engit\/?page_id=130"},"modified":"2019-11-13T13:05:11","modified_gmt":"2019-11-13T18:05:11","slug":"gpu","status":"publish","type":"page","link":"https:\/\/www.bu.edu\/engit\/knowledge-base\/grid\/gpu\/","title":{"rendered":"GPU"},"content":{"rendered":"<h1>THIS IS AN ARCHIVE<\/h1>\n<p>These queues are retired, and this page will be removed after a review.<\/p>\n<div dir=\"ltr\" id=\"content\" lang=\"en\">\n<p>Before using the GPUs on the Grid, follow the <a href=\"\/engit\/knowledge-base\/grid\/grid-gridinstructions\">general ENG-Grid instructions<\/a> and the <a href=\"\/engit\/knowledge-base\/grid\/software\/grid-cuda\">CUDA instructions<\/a>.<\/p>\n<h1 id=\"GPU-enabledGridqueues\">GPU-enabled Grid queues<\/h1>\n<p>The current GPU-enabled queues on the ENG-Grid are:<\/p>\n<pre class=\"darkSnippet\">gpu.q    -- 1 GPU and 2 CPU cores per 4GB RAM workstation node\r\n            Currently 16 nodes with a total of 16 GeForce Kepler GTX 650 (2GB) GPUs\r\n            (they additionally have small Quadro NVS GPUs (256MB) attached to the displays, not really useful for CUDA)\r\nbudge.q  -- 8 GPUs and 8 CPU cores per 24GB RAM per node\r\n            Currently 2 nodes with a total of 16 GPUs:  8 non-Fermi Tesla M1060's (4GB), 6 Tesla Fermi M2050's, and 2 Tesla Fermi M2090's (6GB)\r\nbungee.q -- 2 or 3 GPUs and 8 CPU cores per 24GB RAM node, with QDR InfiniBand networking\r\n            Currently 3 nodes with 1 Tesla Kepler K20 (6GB) and 2 Tesla Fermi M2070\/2075's (6GB) each, plus 13 nodes with 2 Tesla Fermi M2070\/2075's (6GB) each. \r\n            (Divided into bungee.q and bungee-exclusive.q for use by buy-in researchers)\r\ngpuinteractive.q -- a subset of budge.q intended for GPU but not CPU-intensive \"qlogin\" jobs<\/pre>\n<p>If you wish to be added to the permissions list to use these queues, please email enghelp [at] bu.edu .<\/p>\n<h2 id=\"UsingGPUresources\">Using GPU resources<\/h2>\n<p>For GPU submission on the Grid, we have configured a consumable resource complex called &#8220;gpu&#8221; on these queues. Each host has an integer quantity of the gpu resource corresponding to the number of GPUs in it. Machines with Fermi and Kepler-generation GPUs have the boolean resources &#8220;fermi&#8221; and &#8220;kepler&#8221;, as well.<\/p>\n<p>To see the status of arbitrary complex resources on the queue, use qstat with the -F switch, like this:<\/p>\n<pre class=\"darkSnippet\">qstat -q bungee.q,budge.q,gpu.q -F gpu,fermi,kepler<\/pre>\n<p>If you submit to any gpu-enabled queue and intend to use the GPU for computation, you should submit with the switch &#8220;-l gpu=1&#8221;.<\/p>\n<p>Thus, if you were to run, for example:<\/p>\n<pre class=\"darkSnippet\">cd \/mnt\/nokrb\/yourusername\r\nqsub -q bungee.q -cwd -l gpu=1 -b y \".\/mycudaprogram\"<\/pre>\n<p>That will pick a node in the gpu.q queue that has the gpu resource free, and will consume its resource. The machine still has another &#8220;slot&#8221; available for use by a qsub that does *not* request the gpu.<\/p>\n<p>Since there are 16 machines in the gpu.q queue with 2 CPUs each but only 1 GPU each, there are 32 slots total but only 16 slots of GPU. So if all the slots were empty, and you submit 17 jobs that each request &#8220;-l gpu=1&#8221;, the jobs will go to 16 hosts and one will wait in the queue for one of the jobs to finish so that a gpu frees up. So if you submit 16 jobs that each request a GPU and 16 that *don&#8217;t*, then they will all execute simultaneously and nothing will wait in the queue. For the bungee.q, there are 128 slots, because there are 8 cores in each bungee machine x 16 machines, but there are only 32 resources in the &#8220;gpu&#8221; complex, because there are 2 gpus in each bungee machine x 16 machines.<\/p>\n<p>If you specifically wanted two Fermi GPUs on the bungee.q, you would run:<\/p>\n<pre class=\"darkSnippet\">qsub -q bungee.q -cwd -l gpu=2 -l fermi=true -b y \".\/mycudaprogram\"<\/pre>\n<p>If you wanted to specifically avoid Fermi GPUs, you would use fermi=false. If you don&#8217;t care what kind of GPU you get, you would not bother putting the fermi= switch in there at all.<\/p>\n<p>Please do not request a gpu resource in the queue if you do not intend to use the gpu for that job, and likewise, please do not attempt to use the gpu in the queue without requesting the gpu resource &#8212; it will only slow things down for you to try have more GPU jobs running than you have GPUs in the system. Note that specifying &#8220;gpu=2&#8221; doesn&#8217;t actually change whether your code is *allowed* to use 2 GPUs or one &#8212; the &#8220;gpu&#8221; complex is just basically an honor system. It makes it so that you&#8217;ve &#8220;reserved&#8221; both GPUs on that machine for your own work, and as long as other people who are using 1 or 2 gpus also make sure to specify gpu=1 or gpu=2 accordingly, nobody should conflict. Of course, as soon as someone starts using gpu code without having reserved a gpu, this accounting doesn&#8217;t help anymore, so if you intend to use a gpu, please make sure to always request the complex.<\/p>\n<p>Likewise, if you request an interactive slot, make sure to &#8220;qlogin&#8221; to gpuinteractive.q and never to ssh directly into machines in the queue:<\/p>\n<pre class=\"darkSnippet\">qlogin -q gpuinteractive.q -l gpu=1\r\n(for an interactive login where you intend to run GPU code)<\/pre>\n<p>or<\/p>\n<pre class=\"darkSnippet\">qlogin -q gpu.q\r\n(for an interactive login where you do not intend to use the GPU.  NOTE WELL -- there's really no reason to do this!  For a basic login where you don't intend to use the GPU, there's no reason to use gpuinteractive.q at all -- use another queue that has far more slots in it, such as interactive.q!)<\/pre>\n<h2 id=\"EXAMPLE.3ASubmittingaCUDAJobthroughqsub\">EXAMPLE: Submitting a CUDA Job through qsub<\/h2>\n<p>We recommend that once you&#8217;re running production jobs, you submit batch jobs (qsub) instead of interactive jobs (qlogin). Refer to <a class=\"http\" href=\"\/engit\/knowledge-base\/grid\/software\/grid-cuda\">Grid Cuda<\/a> for step-by-step instructions on building a CUDA program in our environment, test your code on the command line, and then read below to batch it up.<\/p>\n<p>Set up Grid Engine as described at <a class=\"nonexistent\" href=\"\/engit\/Grid-GridGPU-moin-GridInstructions\">Grid Instructions<\/a> , and write a shell script to include all of the switches you wish to use, putting both it and the binary you wish to run in your \/mnt\/nokrb directory.<\/p>\n<pre class=\"darkSnippet\">#$ -V\r\n#$ -cwd\r\n#$ -q budge.q\r\n#$ -l fermi=false\r\n#$ -l gpu=1\r\n#$ -N yourJobName\r\n#$ -j y\r\n\r\n.\/yourCudaBinary<\/pre>\n<p>Now change to the \/mnt\/nokrb\/yourusername directory where you put both the script and binary, and run:<\/p>\n<pre class=\"darkSnippet\">qsub gridrun.sh<\/pre>\n<p>You could alternatively forego the shell script and put all of the switches on the command line, like this, but this gets unwieldy when there are too many options:<\/p>\n<pre class=\"darkSnippet\">qsub -q qsub -V -cwd -q budge.q -l fermi=false -l gpu=1 -N yourJobName -j y -b y \".\/yourCudaBinary\"<\/pre>\n<p>Note that this script uses the &#8220;-V&#8221; switch to put all of the libraries sourced in your current shell into the remote shell, and the &#8220;-j y&#8221; switch to join stdout (.o files) and stderr (.e files), and that it uses the &#8220;budge.q&#8221; and asks for one non-Fermi GPU. You could use the other queues, including bungee.q, if you need different features.<\/p>\n<h2 id=\"SubmittingaCUDAJobwithannvidia-smioperation\">Submitting a CUDA Job with an nvidia-smi operation<\/h2>\n<p>The gpu complex only reports the number of available GPUs on a node, trusting the users to have requested GPUs honestly using &#8220;-l gpu=#&#8221;. For more information, you can use deviceQuery or nvidia-smi, which report real-time GPU statistics.<\/p>\n<p>For deviceQuery, follow the instructions at <a class=\"http\" href=\"http:\/\/www.resultsovercoffee.com\/2011\/02\/cudavisibledevices.html\">http:\/\/www.resultsovercoffee.com\/2011\/02\/cudavisibledevices.html<\/a><\/p>\n<p>Here is an example for using nvidia-smi to do something similar &#8212; to check available GPU memory on each GPU in the system and passes back the device number of the unloaded GPU which you could then use as an argument to your binary to run cudaSetDevice.<\/p>\n<pre class=\"darkSnippet\">#$ -cwd\r\nhostname\r\ndev=`nvidia-smi -a | grep Free | awk '{print $3}'|.\/choose_device.sh`\r\n.\/command -device $dev<\/pre>\n<p>So just incorporate this into your own submission script and use it to pass an argument to your program to setCudaDevice appropriately.<\/p>\n<p>So, note that bungee.q has 2 GPUs per node and budge.q has 8, and in the third submission I specifically asked for Fermis:<\/p>\n<pre class=\"darkSnippet\">bungee:\/mnt\/nokrb\/kamalic$ qsub -q bungee.q nvidiamem.sh\r\nYour job 2334109 (\"nvidiamem.sh\") has been submitted\r\nbungee:\/mnt\/nokrb\/kamalic$ qsub -q budge.q nvidiamem.sh\r\nYour job 2334110 (\"nvidiamem.sh\") has been submitted\r\nbungee:\/mnt\/nokrb\/kamalic$ qsub -q bungee.q -l fermi=true nvidiamem.sh\r\nYour job 2334113 (\"nvidiamem.sh\") has been submitted<\/pre>\n<pre class=\"darkSnippet\">bungee:\/mnt\/nokrb\/kamalic$ more nvidiamem.sh.o*\r\n::::::::::::::\r\nnvidiamem.sh.o2334109\r\n::::::::::::::\r\nbungee16\r\n4092\r\n4092\r\n::::::::::::::\r\nnvidiamem.sh.o2334110\r\n::::::::::::::\r\nbudge02.bu.edu\r\n4092\r\n4092\r\n4092\r\n4092\r\n4092\r\n4092\r\n4092\r\n4092\r\n::::::::::::::\r\nnvidiamem.sh.o2334113\r\n::::::::::::::\r\nbungee05\r\n5365\r\n5365<\/pre>\n<p>Below is an example on a machine which has two CUDA cards, showing how to use the CUDA_VISIBLE_DEVICES variable to show only one of the two devices, query it to see that it&#8217;s the only one showing up, and then running on that device:<\/p>\n<pre class=\"darkSnippet\">hpcl-19:~\/Class\/cuda\/cudademo$ \r\n\/ad\/eng\/support\/software\/linux\/all\/x86_64\/cuda\/cuda_sdk\/C\/bin\/linux\/release\/deviceQuery -noprompt|egrep \"^Device\"\r\n[deviceQuery] starting...\r\nDevice 0: \"D14P2-30\"\r\nDevice 1: \"Quadro NVS 295\"\r\n[deviceQuery] test results...\r\nPASSED<\/pre>\n<p><strong>NOTE that on some platforms, &#8220;nvidia-smi&#8221; actually MISREPORTS the device numbers! It&#8217;s best to use deviceQuery, or to sanity-check what&#8217;s being reported!<\/strong><\/p>\n<pre class=\"darkSnippet\">[So we see both devices.  Now we set only the first device visible:]\r\n\r\nhpcl-19:~\/Class\/cuda\/cudademo$ export CUDA_VISIBLE_DEVICES=\"0\"\r\nhpcl-19:~\/Class\/cuda\/cudademo$ \r\n\/ad\/eng\/support\/software\/linux\/all\/x86_64\/cuda\/cuda_sdk\/C\/bin\/linux\/release\/deviceQuery -noprompt|egrep \"^Device\"\r\n[deviceQuery] starting...\r\nDevice 0: \"D14P2-30\"\r\n[deviceQuery] test results...\r\nPASSED\r\nhpcl-19:~\/Class\/cuda\/cudademo$ .\/cudademo\r\n[SNIP]\r\n9.000000 258064.000000 259081.000000 260100.000000 261121.000000\r\n\r\n[Now we set only the second device visible:]\r\n\r\nhpcl-19:~\/Class\/cuda\/cudademo$ export CUDA_VISIBLE_DEVICES=\"1\"\r\nhpcl-19:~\/Class\/cuda\/cudademo$ \r\n\/ad\/eng\/support\/software\/linux\/all\/x86_64\/cuda\/cuda_sdk\/C\/bin\/linux\/release\/deviceQuery -noprompt|egrep \"^Device\"\r\n[deviceQuery] starting...\r\nDevice 0: \"Quadro NVS 295\"\r\n[deviceQuery] test results...\r\nPASSED\r\nhpcl-19:~\/Class\/cuda\/cudademo$ .\/cudademo\r\n[SNIP]\r\n9.000000 258064.000000 259081.000000 260100.000000 261121.000000\r\nhpcl-19:~\/Class\/cuda\/cudademo$<\/pre>\n<p>Note that for a program as small as cudademo, any difference in speed between the two cards is meaningless.<\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"SeeAlso\">See Also<\/h2>\n<p>&nbsp;<\/p>\n<ul>\n<li><a class=\"http\" href=\"http:\/\/www.nvidia.com\/object\/gpu-applications.html\">http:\/\/www.nvidia.com\/object\/gpu-applications.html<\/a><\/li>\n<li>The cuda examples directory of <a class=\"https\" href=\"https:\/\/github.com\/eng-it\/grid-tests\">https:\/\/github.com\/eng-it\/grid-tests<\/a><\/li>\n<\/ul>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>THIS IS AN ARCHIVE These queues are retired, and this page will be removed after a review. Before using the GPUs on the Grid, follow the general ENG-Grid instructions and the CUDA instructions. GPU-enabled Grid queues The current GPU-enabled queues on the ENG-Grid are: gpu.q &#8212; 1 GPU and 2 CPU cores per 4GB RAM [&hellip;]<\/p>\n","protected":false},"author":16541,"featured_media":0,"parent":27,"menu_order":15,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/www.bu.edu\/engit\/wp-json\/wp\/v2\/pages\/130"}],"collection":[{"href":"https:\/\/www.bu.edu\/engit\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.bu.edu\/engit\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/engit\/wp-json\/wp\/v2\/users\/16541"}],"replies":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/engit\/wp-json\/wp\/v2\/comments?post=130"}],"version-history":[{"count":11,"href":"https:\/\/www.bu.edu\/engit\/wp-json\/wp\/v2\/pages\/130\/revisions"}],"predecessor-version":[{"id":931,"href":"https:\/\/www.bu.edu\/engit\/wp-json\/wp\/v2\/pages\/130\/revisions\/931"}],"up":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/engit\/wp-json\/wp\/v2\/pages\/27"}],"wp:attachment":[{"href":"https:\/\/www.bu.edu\/engit\/wp-json\/wp\/v2\/media?parent=130"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}