An Open Source Data Integration Project using Python. Mike Pittaro BayPiggies, September 13, 2007

(1)

An Open Source Data Integration Project

using Python

Mike Pittaro

(2)

The SnapLogic Project



Open Source Data Integration Framework

 Our goal is to simplify data access and transformation  We expose data as an HTTP endpoint  Based on REST Architecture and resource modeling



Design Goals

 Scalable  Extensible (by ordinary developer)  Easier to use than writing code for every data interface  Target developers, not business users 'Data munging' (Greg Wilson)  Bridge the gap between the Web and Enterprise Data Enable data mashups

(3)

SnapLogic System Block Diagram

Database Reader Database Writer User Defined Compute Sorter Resource Definitions (RDF Data Store) user permissions + access control (relational tables) Command Line Interface Resource DefinitionResource DefinitionResource Definition SnapScript classes Resource Definition Pipeline Component Client API RSS/Atom Reader File Writer File Reader Lookup Joiner Data Server HTTP Listener and Dispatcher Resource DefinitionResource DefinitionResource Definition

Management Server Browser Client (Flex)

RE ST REST XML/RPC Python Program SnapAdmin REST

Other Data Servers REST

● Resource Definition (ResDef) ● URI

● Component

● Properties for component

(SQL Query, user, password) input and output views

(4)

Finding and Hiring Python Developers



Very Difficult

 Tried baypiggies mailing list in January 2007 triggered a jobs list discussion, a name change discussion, a metalist discussion, a list split (?) Never got a reply :(  We are a startup – risk is a factor



It seems better to hire good programmers

 Good programmers can use any language  We can convert them  It takes a few months for Python to really sink in

(5)

The first prototype



Lots of ideas, we needed to prove some concepts

 Modeling data as a resource  Streaming data through REST pipelines  Storing definitions of resources



Started the prototype in March 2006 with:

 Python  CherryPy as the core HTTP server  A desire to leverage as much as possible from existing packages and libraries  No strong bias toward the final decisions



Successful

 Proved the concepts  Learned a lot about Python and the larger community

(6)

Why Use Python for the prototype ?



Rapid development

 Prove concepts quickly  Get working code sooner, play with it  Throw it away and start again  Identify the real work to be done



Readable code

 The ability to understand what we tried was important  The ability to study what others have done (don't reinvent)  Readable code reduces ramp up for new developers



My 'hidden agenda'

 I'm biased towards Python This guy named Tim (PEP 20) said it was good in 1992  Wanted to see if we could build the whole system in Python

(7)

Results of the prototype



Technologies needed

 Basic libraries  HTTP Server  Database connectivity  Compact data encoding  RSS/Atom formats  Resource definitions  Resource database  Error and activity logging  Command line tool  Plugins  Configuration File



What can we leverage ?

 Python standard library  CherryPy, BaseHTTPServer, twisted, many more.  Python database API (pep 249)  PyASN.1  FeedParser  RDFLib  RDFLib store PySQLite  logging package  cmd package  modules, packages, import  import, ConfigParser

(8)

Writing our own HTTP Server



HTTP is a core part of our data server

 We looked as CherryPy, twisted, BaseHTTPServer, paste, mod_python, wsgi and mod_wsgi  All the existing http servers treat the request/response as a single unit.  Key requirement – streaming of request/responses



We wrote our own in Python

 ~440 lines of code, 520 lines of comment / docstring (pycount)  Took days, not weeks  It's not a full featured, general purpose HTTP server We did it to solve the 'streaming' problem

(9)

Tools of the trade



The basic tools

 Unix, Windows, LDAP, and smtp/imap email  mailman mailing lists  subversion for code  Trac, MoinMoin for bugs and specs  Komodo, Eclipse/Pydev, and vim



CMU SEI CMM Level

 Not sure, probably 0, 1 or maybe 2



The Joel Test

 We score 11.5 out of 12  http://www.joelonsoftware.com/articles/fog0000000043.html

(10)

Answer: This is Toxic Waste.



Question:

What are open source bug fixes?



We found bugs in some of the packages



We fixed them



Then what ?

 Now, we had a custom patched version of something  We didn't want to redistribute it.



Solution:

 Send the bug reports back, with test cases.  Send the patch or fix back .  The probability of a fix is much higher if you provide it.



We still haven't found a core Python bug

(11)

SnapAdmin: A Command line utility



We needed a command line utility

 Server management commands  Repository creation  User and security management  Import/export from repository



We used the standard Cmd package

 Relatively easy to use  Cmd prefers a 'nounverb' syntax _{resource import} ... vs. _{import resource}  One team member wrote a command shell first read the library reference keep it under you pillow

(12)

Deeper Python hacking SnapScript



We have a ResDef class thats very complex

 It is the basis for our resource definitions  Lots of RDFlib and graph code in the class  The class interface requires knowledge of a lot of the system.



We needed a simpler resource interface

 For programmers defining resources  For programmers linking and using resources



We created a package called SnapScript

 A set of very generic classes that encapsulate ResDef  They provide a simple interface to resources  They 'hide' the ResDef class  Implemented using setattr and getattr hooks

(13)

ResDef code sample

from SnapLogicUtils.Exceptions import * from SnapLogic.Utils.ResDef import ListTypes from SnapLogic.Utils.ResDef import ResDef r = ResDef.getResDef('SnapLogic.Components.DBRead') r.URI = '/Trac/TracTickets' r.setProperty('description', 'Read tickets from a Trac database.') r.setProperty('title', 'ReadTracTickets') r.setProperty('DBConnect', '/Trac/TracDatabase') sql = """SELECT ... """ r.setProperty('SQLStmt', sql) view =( ( 'id', SnapNumber, '' ), ( 'summary', SnapString, '' ), ( 'owner', SnapString, '' ), ( 'type', SnapString, '' ), ) r.setListProperty('InputView', ListTypes.OutputView , view)

(14)

Equivalent code using SnapScript class

from SnapLogic import SnapScript #Create a database read resource for the Trac tickets r = SnapScript.Resource.Resource(component='SnapLogic.Components.DBRead') r.props.URI = 'Trac/Tickets' r.props.description = 'Read tickets from a Trac database.' r.props.title = 'ReadTracTickets' r.props.DBConnect = '/Trac/TracDatabase' r.props.SQLStmt = """ SELECT .... “”” view1 =( ( 'id', SnapNumber, '' ), ( 'summary', SnapString, '' ), ( 'owner', SnapString, '' ), ( 'type', SnapString, '' ), ) r.props.outputview.output = view1 r.check() r.saveToServer(server1) SnapLogic.Resource.props is 'hooked' by __setattr__(), and assignments are treated specially.

(15)

People are going to see our Code !



The license is Open Source, GPL V2

 What will people think of our code ?  It's cleanup time !



Better organization of module/package structure

 DataServer, Components, Utils, SnapScript packages



Code docstrings

 The original code included good documentation  We were inconsistent, comments versus docstrings  We standardized on epydoc/epytext  Code 'owner' did the documentation

(16)

Coding Style



We had guidelines initially

 We didn't follow them consistently  Later standardized on pep 8 + 120 columns (versus 80)



Naming Conventions

 All classes initially had a 'Qbf' prefix, later 'Snap' An utterly useless thing to do; Python has namespaces!  Overuse of _ _ for private  Guidelines for lowercase, CamelCase, lowerCamelCase  WeHaveAGermanWorkingOnTheProject



imports get out of hand quickly

 Python library  Third party packages  Our packages/modules

(17)

Python is not Java or C++



Good Advice

 http://dirtsimple.org/2004/12/pythonisnotjava.html



Typical problems

 I can add strings ! for i in range(10): s = s + 'x' # good examples s2 = 10 * 'x' s3 = ''.join(list_of_strings)  Writing getters and setters Use property()



Everyone now has a Python Cookbook

(18)

Testing and the 'build' process



We have an automated build and test process

 Driven by buildbot  Watches for checkins , or we can start it manually



The test flow:

 Check out the code  Build the client code (Adobe Flex)  Build an installer image (Bitrock)  Install on a virtual machine  Run module unit tests  Run integration tests  Generate epydocs for code and API documentation



Code Coverage

 We collect code coverage using figleaf while tests run

(19)

Testing: Every module needs unit tests



We started with:

# module body # ... if __name__ == '__main__': # run tests  'Tests' were really usage examples  Tests were not consistently run or updated



We now use unittest

 Every module has a unit test in a ./test subdirectory  Starting to use Pymock for more complex tests.



Going Forward

 Unit tests are developed much earlier  Coverage is always checked

(20)

Integration testing



This is system level testing for us

 Depends on having the basic modules in order  Requires a lot of setup/infrastructure in place Databases, reference data, etc.



We verify code coverage for all tests

 Essential for Python – theres no 'compiler' to catch syntax errors  Unit testing alone can't cover all our code



Client testing

 Not automated yet  It's difficult since the client is not 'forms' oriented.

(21)

Packaging and Installs



We use bitrock for the main installer

 Installer downloads additional eggs from the Cheese Shop  Lot's of dependencies on other packages



We don't create rpms or debs

 We might do this in the future.



The biggest install problems

 We need a Python installation to get started  Old versions of Python included in Linux distributions  Need permissions to add additional site packages  Dependencies on database libraries database modules typically include a compile step.

(22)

Building the Snaplogic.org site



SnapLogic.org is our public site

 Strong commitment to Python and Open Source applications



Red Hat Linux

 Apache + mod_python  Python 2.4.4  Trac 10.0.4 with MySQL database  Mailman mailing lists  blog is WordPress (php) Nobody but me liked newsbruiser  Django for content downloads and registration process



Future Site tasks

 Better wiki (MoinMoin ?)  Web Content Management System (Django or Zope/Plone)

(23)

Future tasks for the SnapLogic project



Python 3000 support

 We started late on Python 2.5 support  Not really a big deal, we just didn't prioritize it.



WSGI

 We will likely move to WSGI interface for the server



Performance

 Python execution is still mainly single processor  We use threads, but benefits are minimal since we are compute intensive  Looking into parallelism (WSGI + Parallel Python?)



More connectivity

 New SnapLogic components  Python DB API in not as rich as Perl DBI:: yet

An Open Source Data Integration Project using Python. Mike Pittaro BayPiggies, September 13, 2007