Open access to data and
analysis tools from the CMS
experiment at the LHC
Thomas McCauley
(for the CMS Collaboration and QuarkNet) University of Notre Dame, USA
Outline
• CMS at the LHC
• 1st public release of CMS data
• CMS masterclasses
• Large data release
• Open data portal
CMS at the LHC
• CMS (Compact Muon
Solenoid) is one of the two general-purpose
experiments at the LHC
• Over 350 papers
published describing searches for SUSY and exotica, measurements of QCD, electroweak, top, b, forward, and heavy-ion physics, as well as the discovery of the Higgs boson and its properties
• Collected ~ 28 1/fb of proton-proton collision data at COM energies up to 8 TeV
• Nearly 3000 physicists and ~800 engineers from over 40 countries
CMS public data (i)
The CMS experiment has allowed the release of the following data to the public for use in education and
outreach:
2000 events each of J/ψ → μμ, J/ψ → ee! 2000 events each of Υ → μμ, Υ → ee$
500 events each of Z → μμ, Z → ee! 1000 events each of W → μν, W → eν!
100,000 events each of di-muon, di-electron, and di-jet events in the energy range 2-110 GeV!
19 Higgs candidate events: 10 γγ, 1 2e2μ, 1 4e, 1 4μ, 2 bb, 2 ττ, 2 WW in the mass range 120-130 GeV!
~50 1/pb single muons for top quark analysis
Bold: indicates datasets already delivered and/or in use
Masterclasses
• Masterclasses: students travel to nearby universities and
research laboratories to listen to lectures, analyze real LHC data, and interact with other groups via videoconference.
• International masterclasses organized under the auspices of
IPPOG, the International Particle Physics Outreach Group (http:// ippog.web.cern.ch) with central organization at TU Dresden and Notre Dame. In 2014 (from Feb 12 - Apr 12) there were 69 CMS masterclasses in 26 countries in 12 languages.
• CMS masterclass developed in collaboration with QuarkNet
(http://quarknet.fnal.gov)
CMS masterclasses in
2014
https://quarknet.i2u2.org/content/running-cms-wzh-path-masterclass
!
CMS masterclasses in
2014
2014 CMS masterclass
exercise
• Students use up to 30 separate datasets each with 100 events containing samples from
the W, Z, and di-lepton events (one 4-lepton and two di-photon Higgs candidate events included)
• Each group views in an event display up to 100 events and attempts to determine
whether or not it is a W or Z (di-lepton) event.
• If a W, did it decay into an electron and a neutrino or into a muon and a neutrino? What
is the charge of the lepton?
• If a Z, is it di-electron or di-muon? What is the invariant mass?
• What is the W+:W- ratio? What does it mean for proton and its structure?
• What does the invariant mass spectrum look like? (There will be several unexpected
peaks from the di-lepton background)
• 2015: content the same data analysis tools improved (covered later); what follows shows
After an introduction by moderator covering HEP and
the experiment, start by opening the event display:
Select a set of 100 W, Z, J/
ψ
, and Y
events (each with a Higgs candidate
electron?
significant
MET?
Mark the answer on the spreadsheet
(hosted on Google docs):
muon!
muon!
...students correctly
identified a electron 90% of the time and a muon 93%
of the time ...students correctly identified an event as a W 91% of the time ...students correctly identified an event as a Z candidate (i.e. an event with 2 leptons) 92% of the time
...when the students correctly identified an event as W → μν
(W → eν), they correctly identified the charge 84% (81%) of the time. 11% (16%) of
these events were assigned no charge
2014 results
CMS value
Videoconference
http://cds.cern.ch/record/1693152
Students communicate and discuss results with other
masterclass groups using Vidyo
http://cern.ch/vidyo
with support from CERN and FNAL IT:
For 2015
• Exercise to remain the same
• New IPPOG masterclasses start next month
• Masterclasses for CERN visitors start next week
• New browser-based tool developed by RWTH Aachen will replace Google spreadsheets and include creation of plots on-the-fly
• New event display
!
Beyond 2015: new opportunity to use open data from CMS to develop new exercises in the future
https://www.i2u2.org/elab/cms/cima/index.php
Web-based data entry and histogram tool developed by RWTH Aachen
CMS Open Data policy
• Commitment to publication in open-access journals • Release of data to the public• Preservation and release of software and
documentation needed for reconstruction and analysis
• In the future: a commitment to release data after a
suitable embargo period
https://cms-docdb.cern.ch/cgi-bin/PublicDocDB/ShowDocument?docid=6032
CMS has drafted and adopted a data preservation, re-use, and open-access policy which includes:
New release (i)
• Half of reconstructed data from 2010 proton-proton
collisions at 7 TeV (tens of 1/pb)
• ~ 30 TB in size
• In CMS Analysis Object Data (AOD) format (ROOT
files)
The new release of CMS data is much larger and more extensive than previous releases:
CMS AOD
• Contains information needed for an analysis such
as physics objects, tracks, calo hits, vertices, trigger info, etc.
• ROOT-based format needing CMSSW in order to
read and analyze
• Q: How can/will the public handle such a dataset?
• A (partially): Initially focus on an already-proven,
Open Data Portal
• Data and tools and resources for analysis has been made available via an
“open data portal”
• Portal is divided into two main areas: “Education” and “Research”
• Datasets are distinguished as either “primary” or “derived”
• Philosophy: include and build upon the previous and current success of
public data in education and outreach but also include the possibility for more in-depth, complex analysis
• Built with Invenio digital library software:
http://invenio-software.org
The portal is a collaboration between CERN, CMS, ATLAS, ALICE, and LHCb: what follows is a description of the CMS content
Derived dataset record
• A “derived” dataset is a
dataset that has been created from a primary dataset and
contains reduced information (like four-vectors)
• Software with which to create
the derived datasets is provided
• Analysis of derived datasets
does not require special CMS software (but production of
CMS-specific CERN VM
• Analysis of primary
datasets requires CMSSW environment; we provide it in a virtual machine image
• VM contains SLC5, CMS
software environment,
access to primary datasets via XRootD
• Example code also
Invenio and CERN support
• Open data portal built with Invenio (a familiar
example of an application using Invenio is CERN
Document Server http://cdsweb.cern.ch)
• Invenio provides document organization, search
capability, and handling of metadata
• The portal relies on CERN support and services for
data storage, access to and distribution of data, and security and bandwidth restrictions
Data re-use
• Data released under the Creative Commons CC0
waiver: essentially releasing it into the public
domain http://creativecommons.org/publicdomain/ zero/1.0
• Data are identified with digital object identifiers
(DOI) and it is expected that third parties will access the data using these
Outlook
• CMS public data has reached thousands of students all
over the world via CMS masterclasses
• Re: open data portal “We can conclude that about ~82k
distinct users visited our site since the launch, out of which ~600 people downloaded EOS files over HTTP, ~5k read About pages, ~21k viewed collections, ~16k used event display, ~3k used histogramming, ~21k viewed records,
and ~10k used search.” - T. Simko (Invenio team)19 Dec
2014
• Next: Improve tools and with new, large data release