<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
> <channel><title>gehrcke.de &#187; Clobi</title> <atom:link href="http://gehrcke.de/category/clobi/feed/" rel="self" type="application/rss+xml" /><link>http://gehrcke.de</link> <description>Jan-Philip Gehrcke&#039;s website</description> <lastBuildDate>Fri, 07 Oct 2011 15:57:11 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3</generator> <item><title>Google Summer of Code end: code upload and acknowledgement</title><link>http://gehrcke.de/2009/08/google-summer-of-code-end-code-upload-and-acknowledgement/</link> <comments>http://gehrcke.de/2009/08/google-summer-of-code-end-code-upload-and-acknowledgement/#comments</comments> <pubDate>Mon, 24 Aug 2009 19:17:01 +0000</pubDate> <dc:creator>Jan-Philip Gehrcke</dc:creator> <category><![CDATA[Amazon Web Services]]></category> <category><![CDATA[CernVM]]></category> <category><![CDATA[Clobi]]></category> <category><![CDATA[General]]></category> <category><![CDATA[GSoC 2009]]></category> <category><![CDATA[Nimbus]]></category> <category><![CDATA[Personal Stuff]]></category> <category><![CDATA[Python]]></category> <guid
isPermaLink="false">http://gehrcke.de/?p=929</guid> <description><![CDATA[<p>The Google Summer of Code 2009 final evaluation deadline is today; 19 UTC. I don&#8217;t have time to summarize my summer here now, but there are two things I want to say to the world. First, I want to thank many people for enriching my summer. Second, I would like to announce the Clobi project [...]]]></description> <content:encoded><![CDATA[<p>The <em>Google Summer of Code 2009</em> final evaluation deadline is today; 19 UTC. I don&#8217;t have time to summarize my summer here now, but there are two things I want to say to the world. First, I want to thank many people for enriching my summer. Second, I would like to announce the <em>Clobi</em> project on <em>Google Code</em>.<span
id="more-929"></span></p><h4>Acknowledgement</h4><ul><li><a
href="http://www.mcs.anl.gov/~keahey/">Kate Keahey</a> (<a
href="http://www.anl.gov/">Argnonne National Laboratory</a>/<a
href="http://globus.org/">The Globus Alliance</a>/<a
href="http://workspace.globus.org/">Nimbus team</a>).<br
/> She offered me a fabulous mentorship. I assess this at its true worth by looking at what I&#8217;ve learned throughout the summer just because of these great conversations <img
src='http://gehrcke.de/wp/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> . Thank you for all your support and dedication and for pushing me and the project forwards!</li></ul><ul><li><a
href="http://www.mcs.anl.gov/~tfreeman/">Tim Freeman</a> and <a
href="http://www.linkedin.com/pub/david-labissoniere/b/888/663">David LaBissoniere</a> (<a
href="http://www.anl.gov/">Argnonne National Laboratory</a>/<a
href="http://globus.org/">The Globus Alliance</a>/<a
href="http://workspace.globus.org/">Nimbus team</a>).<br
/> Wonderful, extensive and patient extra-premium-support-with-special-treatment regarding <em>Nimbus/Workspace</em> in the MUD. Thank you so much! You smoothed my technical way through the project.</li></ul><ul><li><a
href="https://excess.org/">Ian Ward</a> (creator of <a
href="http://excess.org/urwid">urwid</a>, a console user interface library for <em>Python</em>).<br
/> He supported me so great while I was implementing the user interface for <em>Clobi&#8217;s Resource Manager </em>using <em>urwid</em>&#8216;s new <em>SelectEventLoop</em> technology. Talking to him saved so much valuable time. Thank you!</li></ul><ul><li><a
href="http://www.elastician.com/">Mitchell Garnaat</a> (the creator of <a
href="http://code.google.com/p/boto/">boto</a>, a Python interface to Amazon Web Services) and the <a
href="http://groups.google.com/group/boto-users">boto users mailinglist</a>.<br
/> Great and essential support since almost one year now. Thank you very much for answering many questions!</li></ul><ul><li><a
href="https://savannah.cern.ch/users/pbuncic">Predrag Buncic</a> (<a
href="http://cernvm.cern.ch/cernvm/">CernVM</a>) and <a
href="http://www.linkedin.com/in/artemharutyunyan">Artem Harutyunyan</a> (<a
href="http://aliceinfo.cern.ch/Collaboration/index.html">ALICE@LHC</a>).<br
/> Thank you <a
href="http://gehrcke.de/2009/06/cernvm-on-nimbusec2-public-key-injection-problem/">for your support</a> regarding <em>CernVM</em> on <em>Nimbus</em>!</li></ul><ul><li><a
href="http://www.linkedin.com/pub/jakub-moscicki/0/629/b79">Jakub Moscicki</a>, <a
href="http://www3.imperial.ac.uk/people/u.egede">Ulrik Egede</a>, <a
href="http://homepages.physik.uni-muenchen.de/~Johannes.Elmsheuser/contact.html">Johannes Elmsheuser</a> and the rest of the <a
href="http://ganga.web.cern.ch/ganga/">Ganga</a> crew.<br
/> Thanks a lot for very important, effective and efficient support concerning the system behind <em>Clobi</em> and <em>Clobi&#8217;s Ganga backend</em>. You are great!</li></ul><ul><li><a
href="http://www.linkedin.com/pub/stefan-kluth/4/72a/971">Stefan Klut</a>h and <a
href="http://www.linkedin.com/profile?viewProfile=&#038;key=15670615">Stefan Stonjek</a> (<a
href="http://atlas.ch/">ATLAS@LHC</a>, <a
href="http://www.mpp.mpg.de/">Max-Planck-Institut für Physik, Munich</a>)<br
/> Thank you for many helpful discussions, support and beautiful times in Munich!</li></ul><ul><li><a
href="http://people.cs.uchicago.edu/~borja/">Borja Sotomayor</a> (<a
href="http://www.uchicago.edu/">University of Chicago</a>, <a
href="http://globus.org">Globus Alliance</a>).<br
/> He made a great job as GSoC mentoring organization administrator. Thank you for this and for introducing the MUD <img
src='http://gehrcke.de/wp/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> .</li></ul><ul><li>Paul D Marshall.<br
/> Thank you for discussions regarding dynamical deployment of computing resources.</li></ul><ul><li><a
href=" http://www.linkedin.com/pub/xiaoming-gao/9/a90/348">Xiaoming Gao</a>.<br
/> Thank you for the exciting cooperation regarding <em>Virtual Block Stores</em> for Nimbus.</li></ul><ul><li><a
href="http://en.wikipedia.org/wiki/Alex_Martelli">Alex Martelli</a>.<br
/> He answered <a
href="http://stackoverflow.com/questions/1185660/python-is-os-read-os-write-on-an-os-pipe-threadsafe">an important Python question</a> that most people could not answer. Smoothed my way to implement inter thread communication in <em>Clobi&#8217;s Resource Manager</em>.</li></ul><h4>Clobi @ Google Code</h4><p>In <a
href="http://gehrcke.de/2009/08/distribute-high-performance-computing-jobs-among-multiple-computing-clouds/">this blog post</a> I introduced <strong>Clobi</strong>, the result of this <em>Google Summer of Code</em> project. Today, I created a new project page for <em>Clobi</em> on <em>Google Code</em>: <a
href="http://code.google.com/p/clobi/">http://code.google.com/p/clobi/</a></p><p>As a first action, I pushed my local <a
href="http://en.wikipedia.org/wiki/Mercurial_%28software%29">mercurial</a> code repository into the online repository. You can browse the code <a
href="http://code.google.com/p/clobi/source/browse/#hg">here</a> and you can look through the development history (the commits I&#8217;ve made) <a
href="http://code.google.com/p/clobi/source/list">here</a>.</p><p>Then I prepared a test release of all <em>Clobi</em> components. You can get it <a
href="http://code.google.com/p/clobi/downloads/list">in the download section</a>.</p><p>That&#8217;s all about <em>Clobi</em> for the next time. From tomorrow on, I will go on with my master thesis project in Physics (about <em>Magnetic Particle Imaging</em>). Next week, the <a
href="http://www.icmrm10.montana.edu/">ICMRM</a> conference in Montanta starts and I&#8217;ve to make a poster for my contribution (abstract <a
href="gehrcke_ICMRM09_abstr_MPI_MMF2vsSPM_090713.pdf">here</a>).</p><p><strong>I will go on with <em>Clobi</em></strong>, when there is more free time <img
src='http://gehrcke.de/wp/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> .</p> ]]></content:encoded> <wfw:commentRss>http://gehrcke.de/2009/08/google-summer-of-code-end-code-upload-and-acknowledgement/feed/</wfw:commentRss> <slash:comments>2</slash:comments> </item> <item><title>new system successfully tested: &#8220;Distribution of High Performance Computing Jobs among Multiple Computing Clouds&#8221;</title><link>http://gehrcke.de/2009/08/distribute-high-performance-computing-jobs-among-multiple-computing-clouds/</link> <comments>http://gehrcke.de/2009/08/distribute-high-performance-computing-jobs-among-multiple-computing-clouds/#comments</comments> <pubDate>Tue, 18 Aug 2009 12:14:26 +0000</pubDate> <dc:creator>Jan-Philip Gehrcke</dc:creator> <category><![CDATA[Amazon Web Services]]></category> <category><![CDATA[ATLAS Software]]></category> <category><![CDATA[CernVM]]></category> <category><![CDATA[Clobi]]></category> <category><![CDATA[GSoC 2009]]></category> <category><![CDATA[Linux]]></category> <category><![CDATA[Nimbus]]></category> <category><![CDATA[Python]]></category> <category><![CDATA[Science]]></category> <category><![CDATA[Technical Stuff]]></category> <guid
isPermaLink="false">http://gehrcke.de/?p=794</guid> <description><![CDATA[<p>Hello you out there!</p><p>I just started running the first serious test of the system I&#8217;ve developed during this year&#8217;s Google Summer of Code. If I wanted to put it in sensational words, the test could be called &#8220;Distribution of Particle Physics High Performance Computing Jobs among Multiple Computing Clouds&#8221;; just to get some readers [...]]]></description> <content:encoded><![CDATA[<p>Hello you out there!</p><p>I just started running the first serious test of the system I&#8217;ve developed during this year&#8217;s <a
href="http://code.google.com/soc/">Google Summer of Code</a>. If I wanted to put it in sensational words, the test could be called <strong>&#8220;Distribution of Particle Physics High Performance Computing Jobs among Multiple Computing Clouds&#8221;</strong>; just to get some readers <img
src='http://gehrcke.de/wp/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> . During the test, there will be some time I just sit around and watch my monitor, so I decided to share my experience about the new system with you and keep record of the test progress within this blog post.<span
id="more-794"></span></p><p><strong>Content</strong> (don&#8217;t worry, the sections are short):</p><ul><li><a
href="#intro">0 Introduction</a></li><li><a
href="#prep">1 Preparation</a></li><li><a
href="#start">2 VM startup</a></li><li><a
href="#sessmon">3 Session monitoring (number of Job Agents)</a></li><li><a
href="#submit">4 Job submission</a></li><li><a
href="#sessmon2">5 Session monitoring (number of jobs)</a></li><li><a
href="#jobmon">6 Job monitoring</a></li><li><a
href="#jobcomplete">7 Job completion, output receipt</a></li><li><a
href="#output">8 Examine output</a></li><li><a
href="#shutdown">9 VM shutdown</a></li><li><a
href="#appendix">10 Appendix</a></li></ul><h4><a
name="intro">0 Introduction</a></h4><p>First of all, I&#8217;ve to introduce the system. This is the longest section.</p><p>It&#8217;s a job scheduling system supporting Virtual Machines (VMs) in multiple <a
href="http://en.wikipedia.org/wiki/Infrastructure_as_a_service">Infrastructure-as-a-Service computing clouds</a>. Because of this, I found the name <strong>&#8220;Clobi&#8221;</strong>, which somehow comes from &#8220;cloud&#8221; and &#8220;combination&#8221;. If you know a better name, then please let me know <img
src='http://gehrcke.de/wp/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> . Currently, the system is using and supporting <a
href="http://workspace.globus.org/">Nimbus</a> (and it&#8217;s prepared for Cumulus, Nimbus&#8217; storage service) and <a
href="http://aws.amazon.com/">Amazon Web Services</a> (more precisely <a
href="http://aws.amazon.com/ec2/">EC2</a>, <a
href="http://aws.amazon.com/sqs/">SQS</a>, <a
href="http://aws.amazon.com/simpledb/">SimpleDB</a>, <a
href="http://aws.amazon.com/s3/">S3</a>).</p><p>It&#8217;s an &#8220;elastic&#8221; and &#8220;scalable&#8221; job system that can set up a huge computing resource pool almost instantly. VMs are added to or removed from the pool dynamically; based on need and demand. Jobs are submitted and processed using an asynchronous queueing system. An arbitrary number of clients is allowed to submit jobs to an existing resource pool. The basical realiability is inherited from the reliability of the core messaging components Amazon SQS (for queueing) and Amazon SimpleDB (for bookkeeping): to prevent data from being lost or becoming unavailable, it is stored redundantly and geographically dispersed across multiple datacenters.</p><p>Furthermore, the job system&#8217;s components are highly decoupled, which allows single components to fail or to get re-initialized without affecting the others.</p><p>The motivating application for this system is <a
href="https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasComputing">ATLAS Computing</a> (for the <a
href="http://atlas.ch/">ATLAS experiment</a> at <a
href="http://lhc.web.cern.ch/lhc/">LHC</a>, <a
href="http://public.web.cern.ch/public/">CERN</a> (Geneva)): a common ATLAS Computing application (the so-called &#8220;full chain&#8221;) will be run during this test.</p><p>But: the basic system is totally generic and can be used in any case whenever it&#8217;s convenient to distribute jobs among different clouds. This is always the case when one tries to satisfy the basic computing power needs for a low price by e.g. operating an own Nimbus cloud, but wants to able to instantly balance out peaks of desired computing power by simply adding Amazon’s EC2 to the resource pool for a certain amout of time. By using <strong>Clobi</strong>, combining different clouds to one big resource pool becomes very easy.</p><p>These are the main components used during this test:</p><ul><li>a <strong>special ATLAS Software Virtual Machine image</strong> based on <a
href="http://cernvm.cern.ch/cernvm/">CernVM</a>. I placed it on <a
href="http://workspace.globus.org/clouds/nimbus.html">Nimbus Teraport Cloud</a> and (as Amazon Machine Image) on S3 for EC2</li></ul><ul><li>the <strong>Clobi Resource Manager</strong> (observing job queues, starting/killing VMs, &#8230;)</li></ul><ul><li>the <strong>Clobi Job Agent</strong> (running on VMs, polling &#038; running jobs, bookkeeping, &#8230;)</li></ul><ul><li>the <strong>Clobi Job Management Interface</strong> (providing methods to submit / remove / kill / monitor / &#8230; jobs)</li></ul><ul><li>a <a
href="http://ganga.web.cern.ch/ganga/">Ganga</a> <strong>Clobi Backend</strong> (integrates <strong>Clobi</strong> into <a
href="http://ganga.web.cern.ch/ganga/">Ganga</a>, which is &#8220;an easy-to-use frontend for job definition and management&#8221;)</li></ul><p>The meaning of these components will become clearer in the following parts.</p><p>Let me show step by step &#8212; but only very roughly &#8212; how I use <strong>Clobi</strong> within the first serious test. Many details are left out, but you will get it in principle. After reading this blog post, you&#8217;ve an overview about what the system does and what I&#8217;ve actually done during the summer.</p><h4><a
name="prep">1 Preparation</a></h4><p>I&#8217;ve prepared session/cloud configuration files. Using them, I started a new <strong>Clobi session</strong> (a resource pool) with the <strong>Clobi Resource Manager</strong> (It&#8217;s a <a
href="http://www.python.org/">Python</a> application and I use it locally; here on my desktop machine). At first, it does much configuration and initialization stuff, including setting up SQS queues and SimpleDB domains. Interaction with Amazon Web Services is done via the <a
href="http://code.google.com/p/boto/">boto</a> module for Python. After initialization, the Resource Manager offers an interactive mode (it&#8217;s a multi-threaded console application with user interface, built using the <a
href="http://excess.org/urwid">urwid</a> module for Python).</p><h4><a
name="start">2 VM startup</a></h4><p>Using the <strong>Resource Manager</strong>, I started one VM on EC2 and one on Nimbus Teraport Cloud, both based on &#8220;the special ATLAS Software VM&#8221;, containing <a
href="http://atlas-computing.web.cern.ch/atlas-computing/projects/releases/status/">ATLAS Software 15.2.0</a> and the <strong>Clobi Job Agent</strong>. Starting VMs manually is done with a very simple command. The driving forces in the background are boto in case of EC2 and the <a
href="http://workspace.globus.org/vm/TP2.2/dev/reference.html">Nimbus cloud reference client</a>, which I&#8217;ve wrapped and controlled via Python&#8217;s subprocess module. The main loop thread of the <strong>Resource Manager</strong> periodically polls the states of just started EC2 instances and the states of Nimbus client subprocesses to figure out if the instructed actions result in success or not.</p><p>The following screenshot shows the <strong>Resource Manager</strong> in action (you basically see a terminal window). Follow a bit of the log. As you can see, it&#8217;s very easy to run VMs and &#8212; after a certain amout of time &#8212; the <strong>Resource Manager</strong> detects that both VMs have successfully started booting:</p><div
id="attachment_796" class="wp-caption aligncenter" style="width: 222px"><a
href="http://gehrcke.de/wp/wp-content/uploads/001_RM_VMs_started.png" rel="lightbox[794]"><img
src="http://gehrcke.de/wp/wp-content/uploads/001_RM_VMs_started-212x300.png" alt="Clobi Resource Manager showing two started Virtual Machines" title="001_RM_VMs_started" width="212" height="300" class="size-medium wp-image-796" /></a><p
class="wp-caption-text"><strong>Clobi Resource Manager</strong> showing two started Virtual Machines</p></div><p>The EC2 VM needed around 10 minutes to start up, while the Nimbus VM needed 20 minutes. Reason: the AMI is ~10 GB big; the image on Nimbus ~20 GB (wasted space &#038; time, but it&#8217;s just a test..).</p><p>You might say: &#8220;Only two VMs? Boring..&#8221;. I say: I could have taken several hundred. The point is: it would not make any difference, except in cost and in the amount of space used for log files. <strong>Clobi</strong> uses technology / is designed to always work reliably; even in different orders of magnitude. This is often called &#8220;scalable&#8221; or &#8220;elastic&#8221;. Basically, this positive characteristic is inherited from Amazon&#8217;s SQS, SDB and S3, which are used by <strong>Clobi</strong> to do management and control of the system.</p><h4><a
name="sessmon">3 Session monitoring (number of Job Agents)</a></h4><p>I conceal almost all the details of how the components exchange information. But I&#8217;ve to tell the following to you, to not completely confuse you:</p><ul><li>the <strong>Resource Manager</strong> gave some bootstrap information to the VMs.</li></ul><ul><li>the <strong>Job Agent</strong> is automatically invoked on VM operating system startup.</li></ul><p>Using the bootstrap information, each VM&#8217;s <strong>Job Agent</strong> &#8220;registers with SimpleDB&#8221;. The <strong>Resource Manager</strong> has a monitoring functionality to check SimpleDB for running Job Agents:</p><div
id="attachment_803" class="wp-caption aligncenter" style="width: 310px"><a
href="http://gehrcke.de/wp/wp-content/uploads/2009/08/002_RM_JAs_running.png" rel="lightbox[794]"><img
src="http://gehrcke.de/wp/wp-content/uploads/2009/08/002_RM_JAs_running-300x169.png" alt="Clobi Resource Manager showing two started Job Agents" title="002_RM_JAs_running" width="300" height="169" class="size-medium wp-image-803" /></a><p
class="wp-caption-text"><strong>Clobi Resource Manager</strong> showing two started <strong>Clobi Job Agents</strong></p></div><p>Voilà, now it&#8217;s definitely known that both VMs successfully started their <strong>Job Agent&#8217;s</strong>. These start in a &#8220;watching/lurking&#8221; state, periodically polling SQS for jobs.</p><h4><a
name="submit">4 Job submission</a></h4><p>The SimpleDB / SQS / S3 data structure together with <strong>Clobi&#8217;s Job Management Interface</strong> allows to submit jobs with different priorities, to remove jobs, to kill running jobs and to monitor jobs. Furthermore, transmission and receipt of an input/output sandbox archive is possible. This is needed to deliver executables and small input data and to receive small output data as well as stdout/err and other logs.</p><p>I&#8217;ve downloaded Ganga and installed it to my local machine. Then, I&#8217;ve started developing a new &#8220;<strong>Clobi backend</strong>&#8221; to integrate <strong>Clobi&#8217;s Job Management Interface</strong> into Ganga. Using this new backend, it&#8217;s possible to submit/kill/monitor/.. jobs right away from the Ganga interface, using Ganga&#8217;s common job description and management commands.</p><p>To test the system, I&#8217;ve prepared some shellscripts that invoke running <a
href="http://gehrcke.de/2009/06/atlas-software-how-to-run-the-full-chain/">&#8220;The Full Chain&#8221;</a> on the worker nodes. This is a very good test to validate the whole system: it needs some very small input files, only works if the ATLAS Software was set up properly (uoooh.. not trivial!), stresses the VM (the simulation step consumes much CPU power) and leaves some small output files for the output sandbox.</p><p>At Ganga startup, I provided a configuration file containing few but essential information about the <strong>Clobi session</strong> that I&#8217;ve set up before via <strong>Resource Manager</strong>. From this configuration file Ganga&#8217;s <strong>Clobi backend</strong> e.g. knows which SimpleDB domain to query and to which SQS queues the job messages must be submitted. Using this bootstrap information, <strong>an arbitrary number of Gangas could be used from anywhere to submit and manage jobs</strong> within this special <strong>Clobi session</strong>.</p><p>I will now submit the same job (the &#8220;full chain&#8221; thing) three times: the Nimbus VM has two virtual cores and its <strong>Job Agent</strong> will try to receive and run two jobs at the same time. The EC2 VM (m1.small) only has one virtual core. Hence, three jobs are needed to use the VMs to full capacity <img
src='http://gehrcke.de/wp/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> . This is a screenshot from the Ganga terminal session where I submitted the jobs:</p><div
id="attachment_805" class="wp-caption aligncenter" style="width: 227px"><a
href="http://gehrcke.de/wp/wp-content/uploads/2009/08/003_Ganga_submitted.png" rel="lightbox[794]"><img
src="http://gehrcke.de/wp/wp-content/uploads/2009/08/003_Ganga_submitted-217x300.png" alt="Ganga: job submission via Clobi backend" title="003_Ganga_submitted" width="217" height="300" class="size-medium wp-image-805" /></a><p
class="wp-caption-text">Ganga: job submission via <strong>Clobi backend</strong></p></div><p>The <strong>Clobi</strong> backend successfully did its job: it created three <strong>Clobi</strong> job IDs, submitted three SQS messages and uploaded three input sandbox archives.</p><h4><a
name="sessmon2">5 Session monitoring (number of jobs)</a></h4><p>The <strong>Resource Manager</strong> is able to observe the queues and to determine the number of jobs submitted to them. It recognizes two jobs in the queue for priority 2:<br
/><div
id="attachment_809" class="wp-caption aligncenter" style="width: 310px"><a
href="http://gehrcke.de/wp/wp-content/uploads/2009/08/004_RM_2jobs.png" rel="lightbox[794]"><img
src="http://gehrcke.de/wp/wp-content/uploads/2009/08/004_RM_2jobs-300x170.png" alt="Clobi Resource Manager detected two jobs in the queues" title="004_RM_2jobs" width="300" height="170" class="size-medium wp-image-809" /></a><p
class="wp-caption-text"><strong>Clobi Resource Manager</strong> detected two jobs in the queues</p></div></p><p>Only two? Maybe one <strong>Job Agent</strong> polled a job right away after submission, or maybe the SQS measurement was not exact (this is possible, too). Anyway, few time later there is only one job left in the queues and then they are empty. This means that the <strong>Job Agent</strong> on the Nimbus VM successfully grabbed two jobs and the EC2 VM grabbed one:<br
/><div
id="attachment_812" class="wp-caption aligncenter" style="width: 310px"><a
href="http://gehrcke.de/wp/wp-content/uploads/2009/08/005_RM_0jobs.png" rel="lightbox[794]"><img
src="http://gehrcke.de/wp/wp-content/uploads/2009/08/005_RM_0jobs-300x169.png" alt="Clobi Resource Manager detects zero jobs in the queues" title="005_RM_0jobs" width="300" height="169" class="size-medium wp-image-812" /></a><p
class="wp-caption-text"><strong>Clobi Resource Manager</strong> detects zero jobs in the queues</p></div></p><p>While me and Ganga are waiting for the jobs to finish (this takes some time and Ganga periodically polls the state of the jobs via <strong>Clobi backend / Clobi Job Management Interface</strong>), I use the time to advise you of an important fact: it&#8217;s the objective to automate the observe-queues-and-start/kill-VMs-as-required process in the future. The current <strong>Resource Manager</strong> is very prepared for this. Let me show the monitoring loop to you:<br
/><div
id="attachment_814" class="wp-caption aligncenter" style="width: 301px"><a
href="http://gehrcke.de/wp/wp-content/uploads/2009/08/006_RM_monitoring_loop.png" rel="lightbox[794]"><img
src="http://gehrcke.de/wp/wp-content/uploads/2009/08/006_RM_monitoring_loop-291x300.png" alt="Clobi Resource Manager showing its monitoring loop" title="006_RM_monitoring_loop" width="291" height="300" class="size-medium wp-image-814" /></a><p
class="wp-caption-text"><strong>Clobi Resource Manager</strong> showing its monitoring loop</p></div></p><p>It observes the number of jobs in the queues and the number of running <strong>Job Agents</strong> periodically. Based on this information the <strong>Resource Manager</strong> easily could start / kill virtual machines (I&#8217;ve already demonstrated how easy starting is; killing is described later). I did not implement this algorithm until now, because a) I had no time and b) I could have done it quick and dirty, but I really did not need this feature to develop and test the rest of the system. But this feature will come, because if it&#8217;s implemented properly with intelligent policies, it&#8217;s just great.</p><h4><a
name="jobmon">6 Job monitoring</a></h4><p>As I&#8217;ve already mentioned, Ganga periodically checks the jobs&#8217; states. Therefore, the <strong>Ganga Clobi backend</strong> provides a special method that Ganga calls from time to time from one of its monitoring threads. Normally this happens quietly, but I&#8217;ve put some debug output into this method. Let&#8217;s check it out:<br
/><div
id="attachment_815" class="wp-caption aligncenter" style="width: 310px"><a
href="http://gehrcke.de/wp/wp-content/uploads/2009/08/007_Ganga_monitoring_jobs.png" rel="lightbox[794]"><img
src="http://gehrcke.de/wp/wp-content/uploads/2009/08/007_Ganga_monitoring_jobs-300x97.png" alt="Ganga receives monitoring information via the Clobi backend" title="007_Ganga_monitoring_jobs" width="300" height="97" class="size-medium wp-image-815" /></a><p
class="wp-caption-text">Ganga receives monitoring information via the <strong>Clobi backend</strong></p></div></p><h4><a
name="jobcomplete">7 Job completion, output receipt</a></h4><p>After some more time, Ganga discovered that one of the three jobs finished successfully. This means that the <strong>Job Agent</strong> detected a returncode of 0 of the job shellscript and could successfully store the output sandbox archive to S3. At this point, the <strong>Clobi backend</strong> triggers to download and extract the output sandbox archive. This looks like:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="text"><pre class="de1">Clobi                              : INFO     status for job-090818044445-3585-3088: completed_success
Ganga.GPIDev.Lib.Job               : INFO     job 60 status changed to &quot;completed&quot;
Clobi                              : INFO     download atlassessions/0907210728-testsess-0c7e/jobs/out_sndbx_job-090818044445-3585-3088.tar.bz2 from S3
Clobi                              : INFO     store key 0907210728-testsess-0c7e/jobs/out_sndbx_job-090818044445-3585-3088.tar.bz2 as file /home/gurke/gangadir/workspace/gurke/LocalAMGA/60/output/out_sndbx_job-090818044445-3585-3088.tar.bz2 from bucket atlassessions
Clobi                              : INFO     Download of output sandbox archive successfull.</pre></div></div></div></div></div></div></div><p>Did you have doubts that this is my first serious test and everything worked until now? Some parts of the system are already tested very much, of course. But the <strong>Clobi backend</strong> for Ganga made the transition from vitally-important-features-missing to just-scratch-along-usability only a few hours ago. I&#8217;m really very happy that everything worked until now, but the output sandbox archive extraction could be improved:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="text"><pre class="de1">Clobi                              : CRITICAL Error while extracting output sandbox
Clobi                              : CRITICAL Traceback:
Traceback (most recent call last):
  File &quot;/mnt/hgfs/E/gsoc_code_repo/ganga_clobi_backend/Clobi/Clobi.py&quot;, line 243, in clobi_dl_extrct_outsandbox_arc
    sp = subprocess.Popen(
NameError: global name 'subprocess' is not defined</pre></div></div></div></div></div></div></div><p>Yoooah, I (want to) use Python&#8217;s subprocess module to extract the <code>tar.bz2</code> archive with system&#8217;s <code>tar</code>, but I forgot to <code>import subprocess</code> <img
src='http://gehrcke.de/wp/wp-includes/images/smilies/icon_sad.gif' alt=':-(' class='wp-smiley' /> . This is forgotten and fixed easily <img
src='http://gehrcke.de/wp/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> . Btw: of course I got three downloaded output archives and three extraction errors <img
src='http://gehrcke.de/wp/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> .</p><h4><a
name="output">8 Examine output</a></h4><p>The output sandbox archive files were stored on my local machine by Ganga&#8217;s <strong>Clobi backend</strong>. I take a look into one of it by extracting it manually:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="text"><pre class="de1">$ ls
out_sndbx_job-090818044451-3585-01c2.tar.bz2
$ tar xjf out_sndbx_job-090818044451-3585-01c2.tar.bz2
$ ls
AOD_007410_00001.pool.root  evgen.log  joblog_job-090818044451-3585-01c2  recoAOD.log
EVGEN_007410_00001.pool.root  jobagent_log  out_sndbx_job-090818044451-3585-01c2.tar.bz2</pre></div></div></div></div></div></div></div><p><strong>Great! The AOD file is there</strong>. This means that 1) <strong>Clobi</strong> did perfect work to control and manage the job and 2) the ATLAS Computing part (&#8220;The Full Chain&#8221;) worked perfectly:</p><ul><li>the interaction between a certain particle (which was defined within the input sandbox) and the ATLAS detector was successfully simulated.</li></ul><ul><li>the ATLAS detector output (basically times and voltages) was calculated successfully.</li></ul><ul><li>particle tracks and energy deposits were successfully reconstructed from times and voltages.</li></ul><ul><li>an event summary was successfully built from tracks and energy deposits.</li></ul><p>&#8220;The summary&#8221; is saved within the AOD file, which successfully returned to my local machine. Cool. Every single part of the system worked as it should (psss, don&#8217;t think of the extraction&#8230;).</p><h4><a
name="shutdown">9 VM shutdown</a></h4><p>You perhaps asked yourself how to dynamically kill VMs. Besides the &#8220;hard kill&#8221; (invoking Nimbus/EC2 API calls to shut down a VM), I&#8217;ve implemented a mechanism that I call &#8220;soft kill&#8221;: the <strong>Resource Manager</strong> sets the &#8220;soft kill flag&#8221; for a specific VM (in SimpleDB) and the corresponding <strong>Job Agent</strong> checks it from time to time. When it is set, it waits until all currently running jobs are done and then the <strong>Job Agent</strong> shuts down the VM. Let&#8217;s watch it in action (I had to look up the command of my own application, too few sleep recently!):<br
/><div
id="attachment_820" class="wp-caption aligncenter" style="width: 288px"><a
href="http://gehrcke.de/wp/wp-content/uploads/2009/08/008_RM_softkill.png" rel="lightbox[794]"><img
src="http://gehrcke.de/wp/wp-content/uploads/2009/08/008_RM_softkill-278x300.png" alt="Clobi Resource Manager setting up the softkill flag for both VMs" title="008_RM_softkill" width="278" height="300" class="size-medium wp-image-820" /></a><p
class="wp-caption-text"><strong>Clobi Resource Manager</strong> setting up the softkill flag for both VMs</p></div></p><p>After waiting some time we see the number of running <strong>Job Agents</strong> decrease to zero..<br
/><div
id="attachment_823" class="wp-caption aligncenter" style="width: 288px"><a
href="http://gehrcke.de/wp/wp-content/uploads/2009/08/009_RM_softkill_0jobagents.png" rel="lightbox[794]"><img
src="http://gehrcke.de/wp/wp-content/uploads/2009/08/009_RM_softkill_0jobagents-278x300.png" alt="Clobi Resource Manager detected that both Job Agents / VMs have shut down" title="009_RM_softkill_0jobagents" width="278" height="300" class="size-medium wp-image-823" /></a><p
class="wp-caption-text"><strong>Clobi Resource Manager</strong> detected that both Job Agents / VMs have shut down</p></div></p><h4><a
name="appendix">10 Appendix</a></h4><p>If you are very interested, you can find some additional material:</p><ul><li>My earlier work on this topic (from last year), &#8220;Amazon Web Services for ATLAS Computing&#8221; (AWSAC) can be found here: <a
href="http://gehrcke.de/awsac">http://gehrcke.de/awsac</a>.</li></ul><ul><li>The aboriginal GSoC project description <a
href="http://gehrcke.de/projects/google-summer-of-code/">can be found here</a>.</li></ul><ul><li>I&#8217;ve already written <a
href="http://gehrcke.de/category/technical-stuff/google-summer-of-code/">some blog posts about this project during Google Summer of Code</a>.</li></ul><ul><li>The last visualizations of the system (from the planning period) are these two schemes: <a
href="http://gehrcke.de/gsoc/jobsystem_scheme.jpeg" rel="lightbox[794]">one</a>, <a
href="http://gehrcke.de/gsoc/jobsystem_scheme_two_sessions.jpeg" rel="lightbox[794]">two</a>.</li></ul><p>A detailled, up-to-date and exact description of the system (&#8220;<strong>Clobi</strong>&#8220;) is planned for the future.</p><p>I will need the last days of GSoC to implement some missing and important features, to search and fix bugs and to clean everything up to make it presentable. I will definitely work on this project after GSoC (as the time allows it, of course). Currently I think about pushing the project to either <a
href="http://bitbucket.org/">bitbucket</a> or <a
href="http://code.google.com/">Google code</a>. Both support <a
href="http://en.wikipedia.org/wiki/Mercurial_%28software%29">mercurial</a> repositories and this is what I used for my code until now (locally).</p><p>If you like this, spread it! Every question and/or comment is much appreciated!</p><p>Thanks for listening <img
src='http://gehrcke.de/wp/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /></p> ]]></content:encoded> <wfw:commentRss>http://gehrcke.de/2009/08/distribute-high-performance-computing-jobs-among-multiple-computing-clouds/feed/</wfw:commentRss> <slash:comments>3</slash:comments> </item> </channel> </rss>
