Focus Demo: CANFAR+Skytree: A Cloud Computing and Data Mining System for Astronomy
aa r X i v : . [ a s t r o - ph . I M ] D ec **Volume Title**ASP Conference Series, Vol. **Volume Number****Author** c (cid:13) **Copyright Year** Astronomical Society of the Pacific CANFAR + Skytree: A Cloud Computing and Data Mining Systemfor Astronomy
Nicholas M. Ball National Research Council Canada, 5071 West Saanich Road, Victoria, BCV9E 2E7
Abstract.
This is a companion Focus Demonstration article to the CANFAR + Skytreeposter (Ball 2012, this volume), demonstrating the usage of the Skytree machine learn-ing software on the Canadian Advanced Network for Astronomical Research (CAN-FAR) cloud computing system. CANFAR + Skytree is the world’s first cloud computingsystem for data mining in astronomy.
1. Introduction
CANFAR (Gaudet et al. 2011) is the cloud computing system of the Canadian As-tronomy Data Centre (CADC). It is the first system designed to provide this capabilityto astronomers. Skytree is the world’s most advanced machine learning software. Itacts as a machine learning server to allow advanced data mining on large data. TheCANFAR + Skytree combination allows Skytree to be run on up to 500 cores simultane-ously, the current size of the CANFAR system. In this paper, we reproduce the FocusDemonstration session given at the conference, showing an example Skytree run, howto access and use CANFAR, and how to use the two in concert.
2. Running Skytree
Skytree can be run interactively on a UNIX or Mac OS X system. Installation is byunzipping the tarfile into a directory tar -zxf SkytreeServer11.3.2.tgz
This results in a directory containing the skytree-server executable, an exampledataset, a .lic license file, and some others.
We show the running of the nearest neighbors algorithm, allkn on the example datasetsupplied with the software. The dataset is from the Sloan Digital Sky Survey, reflectingthe company’s academic roots and links to astronomy. It contains just under 100,000rows (galaxies), and four SDSS colors. http://canfar.phys.uvic.ca Figure 1. Typical Skytree invocation on the terminal, showing the allkn run de-tailed in the text.
Begin by selecting suitable rows: cd SkytreeServer11.3.2/datasetsawk -F, ’(NF==3 || NF==7)’ sdss100kx4.skytree \> sdss100kx4.subsample.skytreecd ..
This is typical of analysis using Skytree: as with any data mining, one preparesthe data first. Some data preparation and results analysis tools are now available withthe software, but the machine learning invocation remains separate, on whatever file itis passed. Input files are typically ASCII format. The .skytree represents an explicitheader style in which datatypes are given that enables some algorithms to run faster,but the file is otherwise ASCII CSV.We then run allkn : ./skytree-server allkn \--references_in=datasets/sdss100kx4.subsample.skytree \--k_neighbors=1 \--distances_out=distances.out \--indices_out=indices.out The program is invoked via the skytree-server executable, the algorithm name(in this case allkn ), and passed arguments as appropriate to the algorithm. In this case,the input file, references in , the number of neighbors to find, 1, and the neighbordistances and file positions as output. The typical appearance of this run in the terminalis shown in Figure 1. Each algorithm is fully documented if invoked with the --help argument.ANFAR + Skytree 3Once run, we have obtained the neighbor distances, and, via the indices, whichobjects are the neighbors. These may be cast into a suitable form for visualization, e.g. paste -d \\0 distances.out indices.out \datasets/sdss100kx4.subsample.skytree > tmp.csvsed s/’header,double:1,header,unsigned_int:1,header,meta:3,\double:4’/’ which may be visualized in a program such as TOPCAT (Taylor 2005). Again, this istypical of an analysis with Skytree: it outputs results, which are then further processed.In this case, if one histograms the distances, selects those with large distances,and plots a color-color plot, e.g., u − g versus g − r ( ug vs. gr ), it is clear that allkn has found outliers. Obviously such a measure in isolation is crude (one might want tocalculate, for example, the local outlier factor), but it exemplifies the kind of analysisthat can be rapidly built up using data mining.
3. Running Software on CANFAR (Including Skytree)
To access CANFAR requires a CADC account, and a CANFAR account. These are setup via the CADC webpage at ,and by request to
[email protected] . Once given an account, accessCANFAR via ssh: desktop> ssh
This places the user on the CANFAR head node, from which it is possible toutilize software interactively, and run short processing jobs (e.g., a half hour or less).X Windows and X-forwarding is supported. Thus, one may install software as desired,including Skytree, and run it as above. Detailed usage of CANFAR is documented onthe wiki, at http://canfar.phys.uvic.ca/wiki . Rather than installing software in one’s home directory on the CANFAR head node,the bulk of the interaction with the system is via a virtual machine (VM). The VM iscreated by the user, who then has full root access to it. Access is via ssh:
CANFAR> vmcreate
To shut down the VM, use vmstop . One a VM exists, one does not vmcreate itagain, but starts it using vmstart . CANFAR has implemented, via International Virtual Observatory Alliance protocols, afilesystem, VOSpace, that gives CANFAR users access to hundreds of terabytes of per-sistent storage. Access to VOSpace requires an X.509 certificate, which can be obtainedby the user via . Nicholas M.BallVOSpace can be mounted as a filesystem, which enables it to be treated as another di-rectory tree, and accessed from one’s desktop, the CANFAR head node, one’s VM, ora batch job.
Batch jobs are managed by the Condor scheduling system. To prepare a batch job, aCondor submission file and a calling script are created on the CANFAR head node,which in turn calls a script on the VM. To submit a job, the VM is shutdown, and theCondor submission command is given:
CANFAR> vmstop
One may then monitor the execution of the job via the usual Condor commands,e.g., condor q
Skytree is invoked on the command line or a script as part of one’s analysis. Runningin batch allows up to 500 instances of Skytree to be run simultaneously.
4. Conclusions
CANFAR + Skytree represents world’s first cloud computing system for data mining inastronomy, and is open for use by any interested member of the astronomical commu-nity. For further details on usage, see the poster paper (Ball 2012, this volume), or visitthe CANFAR + Skytree website at https://sites.google.com/site/\discretionary{-}{}{}nickballastronomer . Acknowledgments.
This research used the facilities of the Canadian AstronomyData Centre, operated by the National Research Council of Canada with the support ofthe Canadian Space Agency. Funding for CANFAR was provided by CANARIE viathe Network Enabled Platforms Supporting Virtual Organisations program. The authorthanks D. Schade, A. Gray and M. Hack for their contributions to this work.