Aggregation - Cloudmesh MongoDB Usage Quickstart

3.6 LIBRARIES

3.7.2.1 Cloudmesh MongoDB Usage Quickstart

3.7.2.3.13 Aggregation

Aggregation operations are used to process given data and produce summarized results. Aggregation operations collect data from a number of documents and provide collective results by grouping data. PyMongo in its documentation offers a separate framework that supports data aggregation. This aggregation framework can be used to

result = db.profiles.create_index([('user_id', pymongo.ASCENDING)], unique=True)

sorted(list(db.profiles.index_information()))

users = cloudmesh.users.find({'firstname':'Gregor'}).sort(('dateofbirth', pymongo.DESCENDING))

for user in users: print user.get('email')

“provide projection capabilities to reshape the returned data” [23].

In the aggregation pipeline, documents pass through multiple pipeline stages which convert documents into result data. The basic pipeline stages include filters. Those filters act like document transformation by helping change the document output form. Other pipelines help group or sort documents with specific fields. By using native operations from MongoDB, the pipeline operators are efficient in aggregating results.

The addFields stage is used to add new fields into documents. It reshapes each document in stream, similarly to the project stage. The output document will contain existing fields from input documents and the newly added fields 24]. The following example shows how to add student details into a document.

The bucket stage is used to categorize incoming documents into groups based on specified expressions. Those groups are called buckets [24]. The following example shows the bucket stage in action.

In the bucketAuto stage, the boundaries are automatically determined in an attempt to evenly distribute documents into a specified number of buckets. In the following operation, input documents are grouped into four buckets according to the values in the price field [24].

db.cloudmesh_community.aggregate([

{

$addFields: {

"document.StudentDetails": {

$concat:['$document.student.FirstName', '$document.student.LastName'] } } } ]) db.user.aggregate([ { "$group": { "_id": { "city": "$city", "age": { "$let": { "vars": {

"age": { "$subtract" :[{ "$year": new Date() },{ "$year": "$birthDay" }] }}, "in": {

"$switch": { "branches": [

{ "case": { "$lt": [ "$$age", 20 ] }, "then": 0 }, { "case": { "$lt": [ "$$age", 30 ] }, "then": 20 }, { "case": { "$lt": [ "$$age", 40 ] }, "then": 30 }, { "case": { "$lt": [ "$$age", 50 ] }, "then": 40 }, { "case": { "$lt": [ "$$age", 200 ] }, "then": 50 } ] } } } } },

"count": { "$sum": 1 }}})

db.artwork.aggregate( [ {

The collStats stage returns statistics regarding a collection or view [24].

The count stage passes a document to the next stage that contains the number documents that were input to the stage [24].

The facet stage helps process multiple aggregation pipelines in a single stage [24].

The geoNear stage returns an ordered stream of documents based on the proximity to a geospatial point. The output documents include an additional distance field and can include a location identifier field [24].

The graphLookup stage performs a recursive search on a collection. To each output document, it adds a new array field that contains the traversal results of the recursive search for that document [24].

groupBy: "$price", buckets: 4 }

} ] )

db.matrices.aggregate( [ { $collStats: { latencyStats: { histograms: true } } } ] )

db.scores.aggregate( [ {

$match: { score: { $gt: 80 } } }, { $count: "passing_scores" } ])

db.artwork.aggregate( [ {

$facet: { "categorizedByTags": [ { $unwind: "$tags" }, { $sortByCount: "$tags" } ], "categorizedByPrice": [ // Filter out documents without a price e.g., _id: 7 { $match: { price: { $exists: 1 } } },

{ $bucket: { groupBy: "$price",

boundaries: [ 0, 150, 200, 300, 400 ], default: "Other",

output: { "count": { $sum: 1 }, "titles": { $push: "$title" }

} } }], "categorizedByYears(Auto)": [ { $bucketAuto: { groupBy: "$year",buckets: 4 } } ]}}])

db.places.aggregate([ { $geoNear: {

near: { type: "Point", coordinates: [ -73.99279 , 40.719296 ] }, distanceField: "dist.calculated",

maxDistance: 2,

query: { type: "public" }, includeLocs: "dist.location", num: 5, spherical: true } }]) db.travelers.aggregate( [ { $graphLookup: { from: "airports", startWith: "$nearestAirport", connectFromField: "connects",

The group stage consumes the document data per each distinct group. It has a RAM limit of 100 MB. If the stage exceeds this limit, the group produces an error [24].

The indexStats stage returns statistics regarding the use of each index for a collection [24].

The limit stage is used for controlling the number of documents passed to the next stage in the pipeline [24].

The listLocalSessions stage gives the session information currently connected to mongos or mongod instance [24].

The listSessions stage lists out all session that have been active long enough to propagate to the system.sessions collection [24].

The lookup stage is useful for performing outer joins to other collections in the same database [24]. connectToField: "airport", maxDepth: 2, depthField: "numConnections", as: "destinations" } } ] ) db.sales.aggregate( [ { $group : {

_id : { month: { $month: "$date" }, day: { $dayOfMonth: "$date" }, year: { $year: "$date" } },

totalPrice: { $sum: { $multiply: [ "$price", "$quantity" ] } }, averageQuantity: { $avg: "$quantity" },

count: { $sum: 1 } } } ] ) db.orders.aggregate( [ { $indexStats: { } } ] ) db.article.aggregate( { $limit : 5 } )

db.aggregate( [ { $listLocalSessions: { allUsers: true } } ] )

use config

db.system.sessions.aggregate( [ { $listSessions: { allUsers: true } } ] )

{

$lookup: {

The match stage is used to filter the document stream. Only matching documents pass to next stage [24].

The project stage is used to reshape the documents by adding or deleting the fields.

The redact stage reshapes stream documents by restricting information using information stored in documents themselves [24].

The replaceRoot stage is used to replace a document with a specified embedded document [24].

The sample stage is used to sample out data by randomly selecting number of documents form input [24].

The skip stage skips specified initial number of documents and passes remaining documents to the pipeline [24].

localField: <field from the input documents>,

foreignField: <field from the documents of the "from" collection>, as: <output array field>

} }

db.articles.aggregate(

[ { $match : { author : "dave" } } ] )

db.books.aggregate( [ { $project : { title : 1 , author : 1 } } ] )

db.accounts.aggregate( [

{ $match: { status: "A" } }, { $redact: { $cond: { if: { $eq: [ "$level", 5 ] }, then: "$$PRUNE", else: "$$DESCEND" } } } ]); db.produce.aggregate( [ {

$replaceRoot: { newRoot: "$in_stock" } } ] ) db.users.aggregate( [ { $sample: { size: 3 } } ] ) db.article.aggregate( { $skip : 5 } );

The sort stage is useful while reordering document stream by a specified sort key [24].

The sortByCounts stage groups the incoming documents based on a specified expression value and counts documents in each distinct group [24].

The unwind stage deconstructs an array field from the input documents to output a document for each element [24].

The out stage is used to write aggregation pipeline results into a collection. This stage should be the last stage of a pipeline [24].

Another option from the aggregation operations is the Map/Reduce framework, which essentially includes two different functions, map and reduce. The first one provides the key value pair for each tag in the array, while the latter one

“sums over all of the emitted values for a given key” [23].

The last step in the Map/Reduce process it to call the map_reduce() function and iterate over the results [23]. The Map/Reduce operation provides result data in a collection or returns results in-line. One can perform subsequent operations with the same input collection if the output of the same is written to a collection [25]. An operation that produces results in a in-line form must provide results with in the BSON document size limit. The current limit for a BSON document is 16 MB. These types of operations are not supported by views [25]. The PyMongo’s API supports all features of the MongoDB’s Map/Reduce engine [26]. Moreover, Map/Reduce has the ability to get more detailed results by passing

full_response=True argument to the map_reduce() function [26].

db.users.aggregate( [

{ $sort : { age : -1, posts: 1 } } ]

)

db.exhibits.aggregate(

[ { $unwind: "$tags" }, { $sortByCount: "$tags" } ] )

db.inventory.aggregate( [ { $unwind: "$sizes" } ] )

db.inventory.aggregate( [ { $unwind: { path: "$sizes" } } ] )

db.books.aggregate( [

{ $group : { _id : "$author", books: { $push: "$title" } } }, { $out : "authors" }

In document Introduction to Clouds and Machine Learning (Page 136-142)