(1)An Open Source Data Integration Project
using Python
Mike Pittaro
(2)The SnapLogic Project
Open Source Data Integration Framework
Our goal is to simplify data access and transformation
We expose data as an HTTP endpoint
Based on REST Architecture and resource modeling
Design Goals
Scalable
Extensible (by ordinary developer)
Easier to use than writing code for every data interface
Target developers, not business users
'Data munging' (Greg Wilson)
Bridge the gap between the Web and Enterprise Data
Enable data mashups
(3)SnapLogic System Block Diagram
Database Reader
Database Writer
User Defined
Compute
Sorter
Resource
Definitions
(RDF Data Store)
user permissions
+
access control
(relational tables) Command Line Interface
Resource
DefinitionResource
DefinitionResource
Definition
SnapScript classes
Resource Definition
Pipeline Component
Client API
RSS/Atom Reader
File Writer
File Reader
Lookup Joiner
Data Server
HTTP Listener
and
Dispatcher
Resource
DefinitionResource
DefinitionResource
Definition
Management Server Browser Client (Flex)
RE
ST
REST
XML/RPC
Python Program
SnapAdmin
REST
Other Data Servers
REST
● Resource Definition (ResDef)
● URI
● Component
● Properties for component
(SQL Query, user, password)
input and output views
(4)Finding and Hiring Python Developers
Very Difficult
Tried baypiggies mailing list in January 2007
triggered a jobs list discussion, a name change discussion, a metalist
discussion, a list split (?)
Never got a reply :(
We are a startup – risk is a factor
It seems better to hire good programmers
Good programmers can use any language
We can convert them
It takes a few months for Python to really sink in
(5) The first prototype
Lots of ideas, we needed to prove some concepts
Modeling data as a resource
Streaming data through REST pipelines
Storing definitions of resources
Started the prototype in March 2006 with:
Python
CherryPy as the core HTTP server
A desire to leverage as much as possible from existing packages
and libraries
No strong bias toward the final decisions
Successful
Proved the concepts
Learned a lot about Python and the larger community
(6) Why Use Python for the prototype ?
Rapid development
Prove concepts quickly
Get working code sooner, play with it
Throw it away and start again
Identify the real work to be done
Readable code
The ability to understand what we tried was important
The ability to study what others have done (don't reinvent)
Readable code reduces ramp up for new developers
My 'hidden agenda'
I'm biased towards Python
This guy named Tim (PEP 20) said it was good in 1992
Wanted to see if we could build the whole system in Python
(7) Results of the prototype
Technologies needed
Basic libraries
HTTP Server
Database connectivity
Compact data encoding
RSS/Atom formats
Resource definitions
Resource database
Error and activity logging
Command line tool
Plugins
Configuration File
What can we leverage ?
Python standard library
CherryPy, BaseHTTPServer,
twisted, many more.
Python database API (pep 249)
PyASN.1
FeedParser
RDFLib
RDFLib store PySQLite
logging package
cmd package
modules, packages, import
import, ConfigParser
(8) Writing our own HTTP Server
HTTP is a core part of our data server
We looked as CherryPy, twisted, BaseHTTPServer, paste,
mod_python, wsgi and mod_wsgi
All the existing http servers treat the request/response as a single
unit.
Key requirement – streaming of request/responses
We wrote our own in Python
~440 lines of code, 520 lines of comment / docstring (pycount)
Took days, not weeks
It's not a full featured, general purpose HTTP server
We did it to solve the 'streaming' problem
(9) Tools of the trade
The basic tools
Unix, Windows, LDAP, and smtp/imap email
mailman mailing lists
subversion for code
Trac, MoinMoin for bugs and specs
Komodo, Eclipse/Pydev, and vim
CMU SEI CMM Level
Not sure, probably 0, 1 or maybe 2
The Joel Test
We score 11.5 out of 12
http://www.joelonsoftware.com/articles/fog0000000043.html
(10) Answer: This is Toxic Waste.
Question:
What are open source bug fixes?
We found bugs in some of the packages
We fixed them
Then what ?
Now, we had a custom patched version of something
We didn't want to redistribute it.
Solution:
Send the bug reports back, with test cases.
Send the patch or fix back .
The probability of a fix is much higher if you provide it.
We still haven't found a core Python bug
(11)SnapAdmin: A Command line utility
We needed a command line utility
Server management commands
Repository creation
User and security management
Import/export from repository
We used the standard Cmd package
Relatively easy to use
Cmd prefers a 'nounverb' syntax
resource import ... vs.
import resource
One team member wrote a command shell first
read the library reference
keep it under you pillow
(12) Deeper Python hacking SnapScript
We have a ResDef class thats very complex
It is the basis for our resource definitions
Lots of RDFlib and graph code in the class
The class interface requires knowledge of a lot of the system.
We needed a simpler resource interface
For programmers defining resources
For programmers linking and using resources
We created a package called SnapScript
A set of very generic classes that encapsulate ResDef
They provide a simple interface to resources
They 'hide' the ResDef class
Implemented using setattr and getattr hooks
(13)ResDef code sample
from SnapLogicUtils.Exceptions import *
from SnapLogic.Utils.ResDef import ListTypes
from SnapLogic.Utils.ResDef import ResDef
r = ResDef.getResDef('SnapLogic.Components.DBRead')
r.URI = '/Trac/TracTickets'
r.setProperty('description', 'Read tickets from a Trac database.')
r.setProperty('title', 'ReadTracTickets')
r.setProperty('DBConnect', '/Trac/TracDatabase')
sql = """SELECT ... """
r.setProperty('SQLStmt', sql)
view =(
( 'id', SnapNumber, '' ),
( 'summary', SnapString, '' ),
( 'owner', SnapString, '' ),
( 'type', SnapString, '' ),
)
r.setListProperty('InputView', ListTypes.OutputView , view)
(14)Equivalent code using SnapScript class
from SnapLogic import SnapScript
#Create a database read resource for the Trac tickets
r = SnapScript.Resource.Resource(component='SnapLogic.Components.DBRead')
r.props.URI = 'Trac/Tickets'
r.props.description = 'Read tickets from a Trac database.'
r.props.title = 'ReadTracTickets'
r.props.DBConnect = '/Trac/TracDatabase'
r.props.SQLStmt = """ SELECT .... “””
view1 =(
( 'id', SnapNumber, '' ),
( 'summary', SnapString, '' ),
( 'owner', SnapString, '' ),
( 'type', SnapString, '' ),
)
r.props.outputview.output = view1
r.check()
r.saveToServer(server1)
SnapLogic.Resource.props is
'hooked' by __setattr__(),
and assignments are treated
specially.
(15)People are going to see our Code !
The license is Open Source, GPL V2
What will people think of our code ?
It's cleanup time !
Better organization of module/package structure
DataServer, Components, Utils, SnapScript packages
Code docstrings
The original code included good documentation
We were inconsistent, comments versus docstrings
We standardized on epydoc/epytext
Code 'owner' did the documentation
(16)Coding Style
We had guidelines initially
We didn't follow them consistently
Later standardized on pep 8 + 120 columns (versus 80)
Naming Conventions
All classes initially had a 'Qbf' prefix, later 'Snap'
An utterly useless thing to do; Python has namespaces!
Overuse of _ _ for private
Guidelines for lowercase, CamelCase, lowerCamelCase
WeHaveAGermanWorkingOnTheProject
imports get out of hand quickly
Python library
Third party packages
Our packages/modules
(17)Python is not Java or C++
Good Advice
http://dirtsimple.org/2004/12/pythonisnotjava.html
Typical problems
I can add strings !
for i in range(10):
s = s + 'x'
# good examples
s2 = 10 * 'x'
s3 = ''.join(list_of_strings)
Writing getters and setters
Use property()
Everyone now has a Python Cookbook
(18)Testing and the 'build' process
We have an automated build and test process
Driven by buildbot
Watches for checkins , or we can start it manually
The test flow:
Check out the code
Build the client code (Adobe Flex)
Build an installer image (Bitrock)
Install on a virtual machine
Run module unit tests
Run integration tests
Generate epydocs for code and API documentation
Code Coverage
We collect code coverage using figleaf while tests run
(19) Testing: Every module needs unit tests
We started with:
# module body
# ...
if __name__ == '__main__':
# run tests
'Tests' were really usage examples
Tests were not consistently run or updated
We now use unittest
Every module has a unit test in a ./test subdirectory
Starting to use Pymock for more complex tests.
Going Forward
Unit tests are developed much earlier
Coverage is always checked
(20)Integration testing
This is system level testing for us
Depends on having the basic modules in order
Requires a lot of setup/infrastructure in place
Databases, reference data, etc.
We verify code coverage for all tests
Essential for Python – theres no 'compiler' to catch syntax errors
Unit testing alone can't cover all our code
Client testing
Not automated yet
It's difficult since the client is not 'forms' oriented.
(21) Packaging and Installs
We use bitrock for the main installer
Installer downloads additional eggs from the Cheese Shop
Lot's of dependencies on other packages
We don't create rpms or debs
We might do this in the future.
The biggest install problems
We need a Python installation to get started
Old versions of Python included in Linux distributions
Need permissions to add additional site packages
Dependencies on database libraries
database modules typically include a compile step.
(22) Building the Snaplogic.org site
SnapLogic.org is our public site
Strong commitment to Python and Open Source applications
Red Hat Linux
Apache + mod_python
Python 2.4.4
Trac 10.0.4 with MySQL database
Mailman mailing lists
blog is WordPress (php)
Nobody but me liked newsbruiser
Django for content downloads and registration process
Future Site tasks
Better wiki (MoinMoin ?)
Web Content Management System (Django or Zope/Plone)
(23) Future tasks for the SnapLogic project
Python 3000 support
We started late on Python 2.5 support
Not really a big deal, we just didn't prioritize it.
WSGI
We will likely move to WSGI interface for the server
Performance
Python execution is still mainly single processor
We use threads, but benefits are minimal since we are compute
intensive
Looking into parallelism (WSGI + Parallel Python?)
More connectivity
New SnapLogic components
Python DB API in not as rich as Perl DBI:: yet