Soma:
What is Soma?
It’s Big Data Candy for the Cloud.
The Soma platform helps Data Scientist to collaborate together to discover and share new facts from large datasets hosted on shared infrastructure.
All this while lowering development & operations bottom line.
Meet our Customers
Expert
See themselves as “experts” or an authority on a subject. Wants the big picture, likes easy to use specialised applications with great visualisation.
Creative
People who see themselves as Data “artists”. Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative.
Engineer
See themselves as “engineers”. Focused on the technical problem of managing data — how to get it, store it, and learn from it. Normally strong software developers with some O/R statistics.
Researcher
See themselves as “scientists”. People with deep academic background in maths, machine learning & modeling complex processes. Reluctant coders.
Customers we support now
Creative
Need to explain the meaning of the data.
Good generalists, can code, with a flare for the visual or data
narrative.
Engineer
Focused on the technical problem of managing data Normally strong software developers
Researcher
People with deep academic background in science, maths, machine learning
What we deliver to customers
Creative
Now:
● Gitlab integration ● from gitlab
● Web facing applications
Researcher
Now:
● Discovery early adopters Early September
● Discovery platform rollout
Engineer
Now:
● Big Data Cluster
● Container Management November:
Fully operational big data station
Right Now
Mesos based Cloud O/S
● Cluster of 88 CPUs 295 GB of memory ● Distributed Application Scheduling ● Resource Scheduling
Container Management DNS service discover
Deployment
Gitlab Mesos Cluster Zookeeper Cluster HDFS Cluster Integrated DNS CI servers Docker RegistryGitlab
● All applications MUST be in gitlab
Mesos Cluster and Container Manager
● Let’s have a look at what is running right now:
“can mix both batch and real-time processing” “process at batch and
real-time Velocity”
Source Control Management Continuous Deployment
Service Monitoring
Always available key datasets
● DBPedia
● SemanticWeb Dogfood
1. Have gitlab account
2. Ask Research ops to add Soma Role to your project 3. If you are accepted you will be guided through
“dockerizing” you gitlab project
4. Once accepted, every push to your master branch will be deployed and accessible online through soma.
Integrated Discovery platform
SOMA Discover - hosted discovery tool based on smarter data project allowing exploration of data and sharing results.
Other internal tools such as Sig.ma, Social Lens, and other projects to follow.
Goals for Research Ops
Nurture a Data Engineering community at Insight with
supportive experts, shared tools & best practices
Provide a Shared analytics platform for Data Scientists at
Insight (Soma)
Encourage new research and engagements with the wider
Nurture
● Provide a structured approach to managing and
releasing all Engineering IP (Code and Data) at insight
○ Source control (Git) ○ release management ○ Assist in IP management
● Provide Quality Circles for Engineering practices
○ 2 Groups - Data Visualisation & Big Data, Workshops to
Provide
● Build big data infrastructure for Insight
○ Soma platform
● Support Hadoop ongoing development
○ Hadoop clusters, Dataspace support
● Support Ad Hoc projects requiring scale
○ Cancer atlas
● Provide “Big Data” Expertise to the Linked Data group
Problems being met
● High cost in research when data scales to “Big Data” [P1]
○ Ad Hoc Maintenance of big data sets is expensive [P2]
○ Development complexity of valuable Big Data jobs is prohibitive
[P3]
● The high cost in Operating Big Data infrastructure [P4]
○ Scarcity of hardware and lack of funds for new Hardware [P5] ○ Inability to maintain a core operations team [P7]
Soma serving our customers
Soma Create - Serves data fresh from the source. Hasqueryable large datasets that are both highly available & up-to-date. Has service to mash these up.
Soma Engineer - Provides a Lambda architecture consuming, cleaning, processing and loading the data to the data layer.
Soma Discover - Useful blocks of processing that can connected together using a nice GUI, works with many datastores
Soma Expert - vertical applications solving a real world problem, these apps are built by Insight’s Data Researchers and Data Creatives.
The 4 kinds of Data Scientist
Expert
See themselves as “experts” or an authority on a subject. Wants the big picture, likes easy to use specialised applications with great visualisation.
Creative
People who see themselves as Data “artists”. Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative.
Engineer
See themselves as “engineers”. Focused on the technical problem of managing data — how to get it, store it, and learn from it. Normally strong software developers with some O/R statistics.
Researcher
See themselves as “scientists”. People with deep academic background in maths, machine learning & modeling complex processes. Reluctant coders.
Goals
Soma to be a complete ecosystem to help researchers deliver “Big Data” distributed applications
Showcase Insight expertise
Standardize best practices for linked data at big data scales Delivers targeted applications & tools
Distributed O/S (Better than cloud)
● We use Mesos based infrastructure to provide
○ Scheduling Process Execution of Jobs/Applications across the
cluster
○ Resource scheduling of the needed CPU/Memory/Storage for
Where we are now
What we haveSoma Engineer - Standard Mesos platform - Provides a Lambda architecture consuming, cleaning,
processing and loading the data to the data layer. Soma Discover - Smarter Data - an interactive
expressive query tool creates data blocks & visualisations
What we need help on
Soma Expert - Pivoty - a medical index built from
standard HCLS datasets and uses a Pivot Browser
Soma Create - The Insight Standard Dataset - a shared