cloud-kepler Documentation
Release 1.2
Scott Fleming, Andrea Zonca, Jack Flowers, Peter McCullough, Ellen Price
Contents
1 System configuration 3
1.1 Python and Virtualenv setup . . . 3
1.2 Hadoop setup. . . 3
1.3 Lein setup . . . 5
1.4 LEMUR setup . . . 5
1.5 References . . . 6
2 Quickstart Guide 7 2.1 Specifying the data to download . . . 7
2.2 Configuration file options . . . 7
3 Retrieving and downloading data 9 3.1 get_data– Get data from MAST or hard disk . . . 9
3.2 join_quarters– Stitch multiple quarters of data together . . . 9
4 BLS pulse algorithm 11 4.1 drive_bls_pulse– Driver interface to BLS pulse . . . 11
4.2 bls_pulse_python– Naive pure Python implementation . . . 11
4.3 bls_pulse_vec– Vectorized Python implementation . . . 11
4.4 bls_pulse_cython– Optimized Cython implementation . . . 11
5 detrend– Detrend lightcurve data 13
6 clean_signal– Signal cleaning (removal of strong periodic signals) 15
7 postprocessing– Analyze output from BLS pulse 17
8 utils– Utility functions 19
Python Module Index 21
cloud-kepler Documentation, Release 1.2
cloud-kepler is a cloud-enabledKeplerplanet searching pipeline. Contents:
CHAPTER
1
System configuration
1.1 Python and Virtualenv setup
To set up Python and Virtualenv, run the following commands from a terminal:
cd ~/temp
curl -L -o virtualenv.py https://raw.github.com/pypa/virtualenv/master/virtualenv.py python virtualenv.py cloud-kepler --no-site-packages
. cloud-kepler/bin/activate pip install numpy
pip install simplejson pip install pyfits
Test that the basic python code is working:
cat {DIRECTORY_WITH_CLOUD_KEPLER}/test/test_q1.txt | python {DIRECTORY_WITH_CLOUD_KEPLER}/python/download.py
If it starts downloading and spewing base64 encoded numpy arrays, then you’re good.
1.2 Hadoop setup
Install Oracle VM VirtualBox 4.2.14 from VirtualBox-4.2.14-86644-win from
https://www.virtualbox.org/wiki/Downloads
Extract cloudera-quickstart-demo-vm-4.3.0-virtualbox.tar.gz fromhttps://ccp.cloudera.com/display/SUPPORT/Cloudera+QuickStart+VM
Enter the created folder and extract cloudera-quickstart-demo-vm-4.3.0-virtualbox.tar, you should end up with cloudera-quickstart-demo-vm.ovf and cloudera-quickstart-demo-vm.vmdk in whatever folder you extracted to Open up Oracle VM VirtualBox Manager
Select the New icon, the Create Virtual Machine window boots up. For operating system, select Linux and Ubuntu
For memory size, select 4096 MB
For Hard Drive, select “Use an existing virtual hard drive” and path to cloudera-quickstart-demo-vm.vmdk Press Create. Virtual machine now selectable in the main window on virtualbox manager.
Press the Settings button, opens the settings window. Choose the system tab
Change chipset to ICH9, make sure “Enable IO APIC” is checked. Select it and pressed Start, boot begins, this part takes a little while.
If it gets stuck on any one step for more than 20 minutes, you can assume something is wrong. Eventually the boot sequence will end and you will see a desktop in your virtual machine. Success!
1.2.1 WordCount Example
Note that this assumes a cloudera vm distribution of hadoop.
Inside your virtual machine, go to the Cloudera Hadoop Tutorial at http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_topic_5_1.html
Copy the source code for WordCount and past it into the gedit text editor. Save as WordCount.java in the cloudera’s home folder.
Per the instructions there, open terminal, cd to the home directory, then run as follows:
mkdir wordcount_classes
javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/* -d wordcount_classes WordCount.java
Right click on the wordcount_classes folder you made (it will be in the home directory) and select compress. Choose .jar as the file format and wordcount as filename:
echo "Hello World Bye World" > file0 echo "Hello Hadoop Goodbye Hadoop" > file1
hadoop fs -mkdir /user/cloudera /user/cloudera/wordcount /user/cloudera/wordcount/input hadoop fs -put file* /user/cloudera/wordcount/input
hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input output
According to the Cloudera Tutoria, this should be all you need to do, but I got an error message here, so everything is not quite right yet.
When you first log onto the virtual machine, it should begin with a firefox window open to some kind of cloudera page. Go to this and click the Cloudera Manager link.
Enter ‘admin’ and ‘admin’ as a username and password to access it.
Now you can see the health of your setup’s various components. mapreduce1 will probably be listed as in poor health. click on it
You should see that the jobtracker is the problem. Return to terminal:
sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system
Then restart jobtracker by clicking instances the instances tab, clicking on jobtracker, clicking to the processes tab, selecting the actions tab in the corner, and selecting restart:
hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input output
This time it should work:
hadoop fs -cat output/part-00000
This will open up the output folder for you from the hadoop run. It should look like this:
cloud-kepler Documentation, Release 1.2 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2
If it looks like that then you are good.
It is worth noting that Hadoop won’t work unless the directory you set as your output both does not currently exist and is in your hadoop fs home directory.
1.3 Lein setup
Note that this assumes a cloudera vm distribution of hadoop. You can find Lein athttps://github.com/technomancy/leiningen
Download the script from https://raw.github.com/technomancy/leiningen/stable/bin/lein and place it wherever you want:
export $HOME=/home cd
cd ..
cd etc/profile.d sudo vim lein.sh
On one line of the file writeexport PATH=$PATH:{wherever your lein file is located}(in my case /home/cloudera/Desktop)
Save the file and exit.
Exit and reenter terminal to get back to you home directory:
chmod 755 {location of lein}
Lein should now be functioning, call ‘lein’ in terminal to test.
1.4 LEMUR setup
Note that this assumes a cloudera vm distribution of hadoop.
Lemur can be downloaded fromhttp://download.climate.com/lemur/releases/lemur-1.3.1.tgz. follow that link and the file should appear in your download folder.
Extract it, and then put it wherever you want it to be:
export $HOME=/home cd
cd ..
cd etc/profile.d sudo vim lemur.sh
You are now writing a file which will allow your system to recognize lemur.
on the first line of the file writeexport LEMUR_HOME={wherever you saved your lemur file}(in my case /home/cloudera/Desktop/lemur).
on the second line of the file writeexport LEMUR_AWS_ACCESS_KEY={your aws access key}
on the third line of the file writeexport LEMUR_AWS_SECRET_KEY={your aws secret key}
on the fourth line of the file writeexport PATH=$PATH:$LEMUR_HOME/bin
save the file and exit.
Lemur should now work, calllemurin terminal to test.
1.5 References
• Koch, D.G., Borucki, W.J., Basri, G., et al. 2010, The Astrophysical Journal Letters, 713, L79 10.1088/2041-8205/713/2/L79
• Kovacs, G., Zucker, S., & Mazeh, T. 2002, Astronomy & Astrophysics, 391, 36910.1051/0004-6361:20020802
• Still, M., & Barclay, T. 2012, Astrophysics Source Code Library, 8004 • LEMUR launcher, Limote M. et al. 2012The Climate Corporation
CHAPTER
2
Quickstart Guide
A normal run of cloud-kepler can be started by:
more input.txt | python get_data.py mast | python join_quarters.py | python drive_bls_pulse.py -c config.conf
This sequence downloads all data from MAST and runs it through the algorithm with the parameters in a configuration file.
2.1 Specifying the data to download
The input file (or lines typed directly tostdin) should include the KIC ID, quarter number, and cadence identifier on each line, such as:
011013072 1 llc 011013072 2 slc 011600006 * llc
The special quarter identifier*will download all available quarters for the given KIC ID.slcindicates short-cadence data andllcindicates long-cadence data.
The Python scriptget_data.pyalso accepts the keyworddatafollowed by an absolute or relative filepath of a top-level data directory, with the same structure as theKeplerarchive on MAST; use this option instead ofmastif your data is stored locally.
2.2 Configuration file options
There are several options that can be specified in a configuration file; the same options can be specified via command line options, but they will be overriden by the file if it is provided (with the-cflag). A standard configuration file looks like: [DEFAULT] segment = 2 min_duration = 0.01 max_duration = 0.5 n_bins = 1000 direction = 0 mode = cython print_format = encode verbose = no profiling = off 7
Additional options will be added as needed, such as for detrending flags.
CHAPTER
3
Retrieving and downloading data
3.1
get_data
– Get data from MAST or hard disk
3.2
join_quarters
– Stitch multiple quarters of data together
CHAPTER
4
BLS pulse algorithm
4.1
drive_bls_pulse
– Driver interface to BLS pulse
4.2
bls_pulse_python
– Naive pure Python implementation
4.3
bls_pulse_vec
– Vectorized Python implementation
4.4
bls_pulse_cython
– Optimized Cython implementation
CHAPTER
5
detrend
– Detrend lightcurve data
CHAPTER
6
clean_signal
– Signal cleaning (removal of strong periodic
signals)
CHAPTER
7
postprocessing
– Analyze output from BLS pulse
CHAPTER
8
utils
– Utility functions
Python Module Index
p
postprocessing,17