Things Made Easy: One Click
CMS Integration with Solr &
Drupal
Peter M. Wolanin, Ph.D.
Momentum Specialist (principal engineer), Acquia, Inc. Drupal contributor drupal.org/user/49851
co-maintainer of the Drupal Apache Solr Search Integration module May 10, 2012
•
What is Drupal?•
What Apache Solr features are integrated with Drupal?•
Why is Drupal plus Apache Solr is better than starting from scratch?•
What elements of the search can you configure in the UI without code?•
You are starting a new website project?•
You are wondering how hard it is to actually integrate Apache Solr with a website?•
You already use Drupal but not with Apache Solr?•
You like things that are easy yet powerful?Drupal: Web Application Framework +
CMS
== Social Publishing Platform
blogs / wikis forums / comments social ranking social tagging users social networks workflow taxonomy semantic web RSS content analytics Content Mgmt Systems Social Software Tools
Drupal “… is as much a Social Software platform as it is a web content management system.”
Drupal + Solr Provides Immediate
Access to Rich Search Features
Dynamic content requires dynamic navigation -
which is provided by an effective search Search facets mean no dead ends
Solr provides better keyword relevancy in results Much faster searches for sites with lots of content By avoiding database queries, Drupal with Solr scales better
DEMO:
A Drupal 7 partial copy of the conference
site with Apache Solr integration
Drupal Has User Accounts, Roles
& Permissions
Define custom roles Set granular access controls by role Configure user behavior: – Registration – Email – Profiles – Pictures
Drupal Modules Add
Functionality
“There’s a module for that”
More than 4100 Drupal 7 community modules
Often controlled by role-based permissions
Drupal core and modules are GPL v2+, and have a huge, active community
Drupal is Written in PHP, Which
Makes for Easy Customization
The Drupal architecture encourages and provides many avenues for customization by writing
modules but not patching Drupal core Drupal has a huge community of users.
Approximately 10,000 sites report to Drupal.org that they use the Apache Solr Search Integration module.
Drupal Entities are Content + Data
Node 7 Node 8 Node 9
Node 4 Node 5 Node 6
Node 1 Node 2 Node 3
Nodes are the basic entity used for text content
The entity system is
extensible - can represent any data
Examples of data stored within Drupal entities
– Text
– geographic location
Define new data fields on a node using the Field API module.
– Text, images, integers, date, reference, etc
Flexible and configurable in the UI
No programming required (many existing modules)
Entity Types are Enriched With
User-configurable Data Fields
A Strong Framework for
Content Classification
Core taxonomy system Modules provide
taxonomy-based appearance, access control
Standard input options include free tagging, flat-controlled, and
Drupal + Solr Search for Business,
Government and NGOs
http://www.mattel.com/search/ apachesolr_search/ http://www.hrw.org/en/search/apachesolr_search/ http://www.restorethegulf.gov/search/apachesolr_search/ http://www.nypl.org/search/apachesolr_search/ http://www.mylifetime.com/community/search/apachesolr_search/ http://opensource.com/search/apachesolr_search/ https://www.ethicshare.org/publications/ http://www.poly.edu/search/apachesolr_search/ https://www.eff.org/search/site/ http://www.whitehouse.gov/search/site/ http://www.emporia.edu/search/site/
Drupal Has Already Solved Many
Solr Integration Challenges
The most important - content indexing.
Facets, sorting, and highlighting of results.
Immediate integration with the More Like This and spell-check handlers.
Included sub-module integrates content access permissions by indexing to and filtering Solr
Easy Content Recommendation!
Uses the MLT handler
The Module Has a Pipeline for
Indexing Drupal Content to Solr
Drupal entities are processed into one (or more) document objects. Each document object is
converted to XML and sent to Solr.
title nid type
Node object Document object
Drupal functions entity_type label entity_id bundle XML string <doc> <field name="entity_type">node</field> <field name="label">Hello Drupal</field> <field name="entity_id">101</field>
<field name="bundle">session</field> </doc>
Entity Meta-data Gives
Automatic Facets!
Content types
Taxonomy terms per vocabulary
Content authors
Posted and modified dates Text and numbers selected via select list/radios/check boxes
Drupal Modules Implement hooks
to Control Indexing and Display
HOOK_apachesolr_index_document_build($document,
$entity, $entity_type, $env_id)
By creating a Drupal module (in PHP), you can
implement module and theme “hooks” to extend or alter Drupal behavior.
Change or replace the data normally indexed. Modify the search results and their appearance.
Updates to an Entity or Related
Meta-data Cause Reindexing
Drupal entities are indexed during Drupal cron (typically invoked via *nix cron).
By using a specialized tracking table, content can automatically be queued for reindex when changed, and subsets of content can potentially be sent to different Solr indexes.
Entities include many ID-based reference fields (e.g. the User ID of the author). Changes to the referenced data is also watched.
Indexing Tracking Tables Maintain
Order
+---+---+---+---+---+ | entity_type | entity_id | bundle | status | changed | +---+---+---+---+---+ | node | 36 | session | 1 | 1336520756 | | node | 37 | session | 1 | 1336510489 | | node | 38 | session | 1 | 1336510456 | | node | 39 | session | 1 | 1336510456 | | node | 40 | speaker_bio | 1 | 1336510456 | +---+---+---+---+---+
When a node is updated, the “changed” timestamp is updated.
The indexing pipeline tracks the largest timestamp and entity_id which has been indexed.
Example: Taxonomy Term
Classifying a Node is Changed
Grapefruit Citrus fruit
All nodes classified with this terms are queued to be re-indexed by setting the “changed”
column to the current time.
Thus you will correctly match ‘Citrus’ instead of ‘Grapefruit’ for those documents.
When Unpublished, Content is
Purged
Drupal core includes a simple editorial workflow where content may be toggled between
published (visible) and unpublished (incomplete, removed, spam, etc).
The module immediately removes content from the index when unpublished, and also tracks it for future removal in case the Solr server is
Search Using Dismax Query
Parsing & Boosting Features
Dynamic fields in schema.xml used to index standard and custom entity data fields
Dismax (or EDismax) handler used for keyword
searching across multiple fields and per-field boosts Query-time boosting options available in the UI
A Query Object Is Used to
Prepare and Run Searches
$query->setParam('hl.fl', $field);
$keys = $query->getParam('q');
$response = $query->search();
More Modules Available to
Add More Features
ApacheSolr Attachments
Apache Solr Multisite Search
Apache Solr Organic Groups Integration Apachesolr User indexing
Apachesolr Commerce
To Wrap Up !
Drupal has extensive Apache Solr integration already, and is highly customizable.
The Drupal platform is widely adopted, and the Drupal community drives rapid innovation.
Acquia provides Enterprise Drupal support and a network of partners.
Acquia includes a secure, hosted Solr index with every support subscription.
•
What is Drupal?•
What Apache Solr features are integrated with Drupal?•
Why is Drupal plus Apache Solr is better than starting from scratch?•
What elements of the search can you configure in the UI without code?•
http://www.solarium-project.org/•
http://php.net/solrhttp://pecl.php.net/package/solr
•
http://code.google.com/p/solr-php-client/Other PHP Integration Tools
Caveat: don’t use serialized PHP response format in a custom integration - use JSON writer.
•
Do you love Drupal, Solr, the LAMP stack,DevOps or anything related, and working at a fast-growing and successful startup?
•
Boston and Portland area U.S. offices.•
Some remote opportunities as well.•
Come talk to me!pwolanin in IRC #drupal or #solr