cloud-kepler Documentation

(1)

cloud-kepler Documentation

Release 1.2

Scott Fleming, Andrea Zonca, Jack Flowers, Peter McCullough, Ellen Price

(2)

(3)

Contents

1 System configuration 3

1.1 Python and Virtualenv setup . . . 3

1.2 Hadoop setup. . . 3

1.3 Lein setup . . . 5

1.4 LEMUR setup . . . 5

1.5 References . . . 6

2 Quickstart Guide 7 2.1 Specifying the data to download . . . 7

2.2 Configuration file options . . . 7

3 Retrieving and downloading data 9 3.1 get_data– Get data from MAST or hard disk . . . 9

3.2 join_quarters– Stitch multiple quarters of data together . . . 9

4 BLS pulse algorithm 11 4.1 drive_bls_pulse– Driver interface to BLS pulse . . . 11

4.2 bls_pulse_python– Naive pure Python implementation . . . 11

4.3 bls_pulse_vec– Vectorized Python implementation . . . 11

4.4 bls_pulse_cython– Optimized Cython implementation . . . 11

5 detrend– Detrend lightcurve data 13

6 clean_signal– Signal cleaning (removal of strong periodic signals) 15

7 postprocessing– Analyze output from BLS pulse 17

8 utils– Utility functions 19

Python Module Index 21

(4)

(5)

cloud-kepler Documentation, Release 1.2

cloud-kepler is a cloud-enabledKeplerplanet searching pipeline. Contents:

(6)

(7)

CHAPTER

1 System configuration

1.1 Python and Virtualenv setup

To set up Python and Virtualenv, run the following commands from a terminal:

cd ~/temp

curl -L -o virtualenv.py https://raw.github.com/pypa/virtualenv/master/virtualenv.py python virtualenv.py cloud-kepler --no-site-packages

. cloud-kepler/bin/activate pip install numpy

pip install simplejson pip install pyfits

Test that the basic python code is working:

cat {DIRECTORY_WITH_CLOUD_KEPLER}/test/test_q1.txt | python {DIRECTORY_WITH_CLOUD_KEPLER}/python/download.py

If it starts downloading and spewing base64 encoded numpy arrays, then you’re good.

1.2 Hadoop setup

Install Oracle VM VirtualBox 4.2.14 from VirtualBox-4.2.14-86644-win from

https://www.virtualbox.org/wiki/Downloads

Extract cloudera-quickstart-demo-vm-4.3.0-virtualbox.tar.gz fromhttps://ccp.cloudera.com/display/SUPPORT/Cloudera+QuickStart+VM

Enter the created folder and extract cloudera-quickstart-demo-vm-4.3.0-virtualbox.tar, you should end up with cloudera-quickstart-demo-vm.ovf and cloudera-quickstart-demo-vm.vmdk in whatever folder you extracted to Open up Oracle VM VirtualBox Manager

Select the New icon, the Create Virtual Machine window boots up. For operating system, select Linux and Ubuntu

For memory size, select 4096 MB

For Hard Drive, select “Use an existing virtual hard drive” and path to cloudera-quickstart-demo-vm.vmdk Press Create. Virtual machine now selectable in the main window on virtualbox manager.

Press the Settings button, opens the settings window. Choose the system tab

(8)

Change chipset to ICH9, make sure “Enable IO APIC” is checked. Select it and pressed Start, boot begins, this part takes a little while.

If it gets stuck on any one step for more than 20 minutes, you can assume something is wrong. Eventually the boot sequence will end and you will see a desktop in your virtual machine. Success!

1.2.1 WordCount Example

Note that this assumes a cloudera vm distribution of hadoop.

Inside your virtual machine, go to the Cloudera Hadoop Tutorial at http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_topic_5_1.html

Copy the source code for WordCount and past it into the gedit text editor. Save as WordCount.java in the cloudera’s home folder.

Per the instructions there, open terminal, cd to the home directory, then run as follows:

mkdir wordcount_classes

javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/* -d wordcount_classes WordCount.java

Right click on the wordcount_classes folder you made (it will be in the home directory) and select compress. Choose .jar as the file format and wordcount as filename:

echo "Hello World Bye World" > file0 echo "Hello Hadoop Goodbye Hadoop" > file1

hadoop fs -mkdir /user/cloudera /user/cloudera/wordcount /user/cloudera/wordcount/input hadoop fs -put file* /user/cloudera/wordcount/input

hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input output

According to the Cloudera Tutoria, this should be all you need to do, but I got an error message here, so everything is not quite right yet.

When you first log onto the virtual machine, it should begin with a firefox window open to some kind of cloudera page. Go to this and click the Cloudera Manager link.

Enter ‘admin’ and ‘admin’ as a username and password to access it.

Now you can see the health of your setup’s various components. mapreduce1 will probably be listed as in poor health. click on it

You should see that the jobtracker is the problem. Return to terminal:

sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system

sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system

Then restart jobtracker by clicking instances the instances tab, clicking on jobtracker, clicking to the processes tab, selecting the actions tab in the corner, and selecting restart:

hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input output

This time it should work:

hadoop fs -cat output/part-00000

This will open up the output folder for you from the hadoop run. It should look like this:

(9)

cloud-kepler Documentation, Release 1.2 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2

If it looks like that then you are good.

It is worth noting that Hadoop won’t work unless the directory you set as your output both does not currently exist and is in your hadoop fs home directory.

1.3 Lein setup

Note that this assumes a cloudera vm distribution of hadoop. You can find Lein athttps://github.com/technomancy/leiningen

Download the script from https://raw.github.com/technomancy/leiningen/stable/bin/lein and place it wherever you want:

export $HOME=/home cd

cd ..

cd etc/profile.d sudo vim lein.sh

On one line of the file writeexport PATH=$PATH:{wherever your lein file is located}(in my case /home/cloudera/Desktop)

Save the file and exit.

Exit and reenter terminal to get back to you home directory:

chmod 755 {location of lein}

Lein should now be functioning, call ‘lein’ in terminal to test.

1.4 LEMUR setup

Note that this assumes a cloudera vm distribution of hadoop.

Lemur can be downloaded fromhttp://download.climate.com/lemur/releases/lemur-1.3.1.tgz. follow that link and the file should appear in your download folder.

Extract it, and then put it wherever you want it to be:

export $HOME=/home cd

cd ..

cd etc/profile.d sudo vim lemur.sh

You are now writing a file which will allow your system to recognize lemur.

on the first line of the file writeexport LEMUR_HOME={wherever you saved your lemur file}(in my case /home/cloudera/Desktop/lemur).

on the second line of the file writeexport LEMUR_AWS_ACCESS_KEY={your aws access key}

(10)

on the third line of the file writeexport LEMUR_AWS_SECRET_KEY={your aws secret key}

on the fourth line of the file writeexport PATH=$PATH:$LEMUR_HOME/bin

save the file and exit.

Lemur should now work, calllemurin terminal to test.

1.5 References

• Koch, D.G., Borucki, W.J., Basri, G., et al. 2010, The Astrophysical Journal Letters, 713, L79 10.1088/2041-8205/713/2/L79

• Kovacs, G., Zucker, S., & Mazeh, T. 2002, Astronomy & Astrophysics, 391, 36910.1051/0004-6361:20020802

• Still, M., & Barclay, T. 2012, Astrophysics Source Code Library, 8004 • LEMUR launcher, Limote M. et al. 2012The Climate Corporation

(11)

CHAPTER

2 Quickstart Guide

A normal run of cloud-kepler can be started by:

more input.txt | python get_data.py mast | python join_quarters.py | python drive_bls_pulse.py -c config.conf

This sequence downloads all data from MAST and runs it through the algorithm with the parameters in a configuration file.

2.1 Specifying the data to download

The input file (or lines typed directly tostdin) should include the KIC ID, quarter number, and cadence identifier on each line, such as:

011013072 1 llc 011013072 2 slc 011600006 * llc

The special quarter identifier_*will download all available quarters for the given KIC ID.slcindicates short-cadence data andllcindicates long-cadence data.

The Python scriptget_data.pyalso accepts the keyworddatafollowed by an absolute or relative filepath of a top-level data directory, with the same structure as theKeplerarchive on MAST; use this option instead ofmastif your data is stored locally.

2.2 Configuration file options

There are several options that can be specified in a configuration file; the same options can be specified via command line options, but they will be overriden by the file if it is provided (with the-cflag). A standard configuration file looks like: [DEFAULT] segment = 2 min_duration = 0.01 max_duration = 0.5 n_bins = 1000 direction = 0 mode = cython print_format = encode verbose = no profiling = off 7

(12)

Additional options will be added as needed, such as for detrending flags.

(13)

CHAPTER

3 Retrieving and downloading data

3.1 get_data

– Get data from MAST or hard disk

3.2 join_quarters

– Stitch multiple quarters of data together

(14)

(15)

CHAPTER

4 BLS pulse algorithm

4.1 drive_bls_pulse

– Driver interface to BLS pulse

4.2 bls_pulse_python

– Naive pure Python implementation

4.3 bls_pulse_vec

– Vectorized Python implementation

4.4 bls_pulse_cython

– Optimized Cython implementation

(16)

(17)

CHAPTER

5 detrend

– Detrend lightcurve data

(18)

(19)

CHAPTER

6 clean_signal

– Signal cleaning (removal of strong periodic

signals)

(20)

(21)

CHAPTER

7 postprocessing

– Analyze output from BLS pulse

(22)

(23)

CHAPTER

8 utils

– Utility functions

(24)

(25)

Python Module Index

p

postprocessing,17