Using MongoDB Bucket aggregate on Java

kommradHomer
4 min readJul 31, 2020

--

MongoDB and a 16$ Metal Bucket for kids

In some cases, using the driver or SDK of some system is a lot harder and confusing than interacting with that system directly. In my experience, the Java drivers of MongoDB provide such case.

Using the Java Driver , you might feel at a loss while trying to make something simple work, something that you already can easily do on the console. Even though the aggregation pipeline is really such a torture to code on the console , the Java Driver might set you back a little as well. Mostly because it’s not structured intuitive enough.

New in version 3.4 , the $bucket aggregation pipeline stage , “Categorizes incoming documents into groups, called buckets, based on a specified expression and bucket boundaries and outputs a document per each bucket.” Comes really handy , if you want to group things into predefined classes and do an operation for each item , like doing counts on different aspects inside each class. And with the $bucketAuto , you don’t need to define the boundaries for the buckets (or classes) either. You can read more about the whole pattern in this post.

Now that we are done with what $bucket is , i’ll try to show you how it’s used via Java Driver. Below is the way you can use bucket operation on console

{
$bucket: {
groupBy: <expression>,
boundaries: [ <lowerbound1>, <lowerbound2>, ... ],
default: <literal>,
output: {
<output1>: { <$accumulator expression> },
...
<outputN>: { <$accumulator expression> }
}
}
}

For a start , you can use the com.mongodb.client.model.Aggregates helper class for creating almost every pipeline stage, with simple constructors . But I find it really hard to reach the information on how they’re used , compared to what I do on the console and that’s where I feel the lack of intuition.

For creating some example, let’s assume that we have a collection of students looking like the list below:

{"student_id":"XXX","year":5,"name":"john doe1","attn":3,"final_grade":88}
{"student_id":"AAA","year":4,"name":"john doe2","attn":12,"final_grade":77}
{"student_id":"YYY",,"year":4"name":"john doe3","attn":9,"final_grade":66}
{"student_id":"ZZZ","year":5,"name":"john doe7","attn":8,"final_grade":22}
{"student_id":"TTT","year":5,"name":"john doe5","attn":6,"final_grade":33}
.
.

If I want to bucket these students into their grading classes like the ones i was subjected to during my high school years in Turkey, such as 0–45 falling into 1, 45–55 into 2 , 55–70 into 3 , 70–85 in to 4 and 85–100 into 5, I need to set up the boundaries for the bucket as :

[0,45,55,70,85,100]

You should watch out for the documents that might not fall into any bucket. If you have such documents in your collection , you should also provide the “default” bucket , “ A literal that specifies the _id of an additional bucket that contains all documents whose groupBy expression result does not fall into a bucket specified by boundaries”, Or you will get an exception. To learn more about how bucket boundaries work and what are the limitations for the default bucket , you can click here.

Also I will need to specify the groupBy expression , which is used for deciding the bucket for each document. It can simply show a field , or can be any complex expression , like the $hour of a date field. As I want to group students into buckets by their grades , I’ll be using final_grade field. With these 2 parameters , I could easily create the corresponding Bson object for my pipeline stage , but for doing additional operations on every document for each bucket , we need to use the com.mongodb.client.model.BucketOptions class.

A BucketOptions object can be supplied to the Aggregates.bucket() constructor , for using the optional things like additional fields in the outputs generated for each bucket. To add new fields to the output , you must provide accumulator expressions. Which makes sense because these expressions will be run for each matching document and the result must be piled up somewhere!

In my example , I’ll be using extra fields for counting how many of the students in each bucket has enough attendance , which is 8, and store it at attended_class . To do so , I’ll need to evaluate an expression , comparing the attdn of each student to 8 and return a 1 or 0 for the accumulator.

The second extra field will be the complete list of student_ids of each bucket, which can be done simply by the $push operator.

And the third field will be the total number of students in the bucket, simply done by $sum and stored at students_total. The final shape of my BucketOptions and Its usage on the Aggregates.bucket() constructor will be looking such :

You can use the created Bson b for the pipeline stage , anywhere appropriate. For building on the example , I will just use the students on the 5th year, filtering them by a $match stage , and then using the $bucket stage for grouping them as we’ve discussed. The final code will look like this:

--

--

kommradHomer
kommradHomer

Written by kommradHomer

proud seeder of 146.5GB The.Lord.of.the.Rings.Trilogy.1080p.Extended.Complete.Bluray.DTS-HD-6.1.x264-Grym

No responses yet