Back to Blog

Pipeline — Databases

·Reha Tuncer·Data Engineering
SQLMongoDBDatabasesCRUDMySQLPyMongo
View source on GitHub

Pipeline — Databases

A progressive journey through SQL and MongoDB — from basic table creation and CRUD operations to multi-table joins, aggregations, constraints, triggers, and NoSQL document operations with Python/PyMongo.


Learning Objectives

#Concept
1Create databases and tables with proper data types and constraints
2Perform CRUD operations: SELECT, INSERT, UPDATE, DELETE
3Filter and sort data with WHERE, ORDER BY, and LIKE
4Aggregate data across groups with GROUP BY, AVG, MAX, COUNT, SUM
5Join multiple tables with INNER JOIN, LEFT JOIN, and chained joins
6Enforce data integrity with NOT NULL, UNIQUE, PRIMARY KEY, AUTO_INCREMENT, ENUM
7Handle null values and computed columns with COALESCE and arithmetic
8Automate database logic with CREATE TRIGGER (BEFORE/AFTER, INSERT/UPDATE)
9Design schemas for many-to-many relationships and time-series data
10Interact with MongoDB via the shell: show dbs, use, find, insert, update, delete
11Perform CRUD operations programmatically with PyMongo in Python
12Build MongoDB aggregation pipelines for data analysis ($group, $sort, $limit)

Task-by-Task Reference

Each task below highlights the unique challenge it posed and the new technique introduced — techniques from earlier tasks are not repeated.


Task 0 — Create Database (0-create_database_if_missing.sql)

Challenge: Set up the foundational container for all subsequent work — the database itself — with idempotent execution.

Approach: CREATE DATABASE IF NOT EXISTS db_0 — safely re-runnable without errors.

New techniques introduced:

TechniquePurpose
CREATE DATABASE IF NOT EXISTSCreate a new database only if it doesn't already exist
Idempotent DDLSafe to run multiple times — no side effects on re-execution

Key takeaway: Every SQL workflow starts with database creation. IF NOT EXISTS makes scripts safe for repeated execution.


Task 1 — Create Table (1-first_table.sql)

Challenge: Define a table structure with typed columns — introducing DDL and data type selection.

Approach: CREATE TABLE IF NOT EXISTS first_table (id INT, name VARCHAR(256)) — two columns with specific types, backtick-quoted identifiers.

New techniques introduced:

TechniquePurpose
CREATE TABLE IF NOT EXISTSDefine a table structure with idempotent execution
INT data typeInteger column for numeric IDs
VARCHAR(256) data typeVariable-length string up to 256 characters
Backtick-quoted identifiersEscape table/column names that may conflict with reserved words

Key takeaway: Every table column needs an explicit data type. Choose INT for numbers, VARCHAR(N) for strings — the type constrains what data can be stored.


Task 2 — List All Rows (2-list_values.sql)

Challenge: Retrieve the entire contents of a table — the simplest possible read operation.

Approach: SELECT * FROM first_table — the wildcard * pulls every column.

New techniques introduced:

TechniquePurpose
SELECT * FROM tableRetrieve all columns and all rows
FROM clauseSpecify which table to query

Key takeaway: SELECT * is the simplest read operation — it returns every row and every column from the specified table.


Task 3 — Insert Row (3-insert_value.sql)

Challenge: Add data to the table — introducing the write side of CRUD.

Approach: INSERT INTO first_table (id, name) VALUES (89, 'Best School') — explicit column naming makes the intent clear.

New techniques introduced:

TechniquePurpose
INSERT INTO table (col1, col2) VALUES (v1, v2)Add a new row with specified column values
Explicit column specificationMakes insertion order-independent and self-documenting

Key takeaway: Named columns in INSERT make the operation explicit and maintainable — you're not relying on column order.


Task 4 — Filter and Sort (4-best_score.sql)

Challenge: Retrieve only records meeting a condition, ordered by a specific column — introducing WHERE and ORDER BY.

Approach: SELECT score, name FROM second_table WHERE score >= 10 ORDER BY score DESC — filter first, then sort.

New techniques introduced:

TechniquePurpose
WHERE conditionFilter rows — keep only those matching the condition
>= comparison operatorGreater-than-or-equal filter
ORDER BY column DESCSort results in descending order

Key takeaway: WHERE filters rows before they're returned; ORDER BY sequences the surviving rows. Both operate on the result set, not the table.


Task 5 — Average (5-average.sql)

Challenge: Compute a single aggregate value across all rows — introducing aggregate functions.

Approach: SELECT AVG(score) AS average FROM second_table — collapses all rows into one number.

New techniques introduced:

TechniquePurpose
AVG(column)Compute the arithmetic mean of a numeric column
AS aliasGive the result column a readable name

Key takeaway: Aggregate functions like AVG() collapse multiple rows into a single computed value — the result is one row, not one per input row.


Task 6 — Aggregate with Grouping (6-avg_temperatures.sql)

Challenge: Compute averages per city — introducing GROUP BY for partitioned aggregation.

Approach: SELECT city, AVG(value) AS avg_temp FROM temperatures GROUP BY city ORDER BY avg_temp DESC — group, aggregate within each group, then sort groups.

New techniques introduced:

TechniquePurpose
GROUP BY columnPartition rows into groups based on a column's values
Aggregate + GROUP BYAVG() computes per-group, not global
ORDER BY aggregate_alias DESCSort groups by their aggregate value

Key takeaway: GROUP BY partitions data before aggregation — each group gets its own AVG(), MAX(), or COUNT(). Without it, aggregates apply globally.


Task 7 — Maximum with Grouping (7-max_state.sql)

Challenge: Find the maximum temperature per state — applying MAX() in a grouped context.

Approach: SELECT state, MAX(value) AS max_temp FROM temperatures GROUP BY state ORDER BY state — same pattern as Task 6, with MAX() instead of AVG().

New techniques introduced:

TechniquePurpose
MAX(column)Find the highest value in a column (or per group)

Key takeaway: MAX() finds the highest value per group — same GROUP BY pattern, different aggregate. The grouping logic is identical across all aggregates.


Task 8 — Inner Join (8-genre_id_by_show.sql)

Challenge: Connect two related tables to list shows that have genres — introducing INNER JOIN.

Approach: SELECT tv_shows.title, tv_show_genres.genre_id FROM tv_shows INNER JOIN tv_show_genres ON tv_shows.id = tv_show_genres.show_id ORDER BY title, genre_id.

New techniques introduced:

TechniquePurpose
INNER JOIN table ON conditionCombine rows from two tables where the condition matches
table.column qualified namesDisambiguate columns when multiple tables share names
ORDER BY col1, col2Multi-column sort (primary by title, secondary by genre_id)

Key takeaway: INNER JOIN returns only rows where a match exists in BOTH tables. If a show has no genre, it's excluded from the result.


Task 9 — Left Join: Missing Genres (9-no_genre.sql)

Challenge: List shows that have NO linked genre — the anti-join pattern using LEFT JOIN + IS NULL.

Approach: SELECT tv_shows.title, tv_show_genres.genre_id FROM tv_shows LEFT JOIN tv_show_genres ON tv_shows.id = tv_show_genres.show_id WHERE tv_show_genres.genre_id IS NULL ORDER BY title.

New techniques introduced:

TechniquePurpose
LEFT JOINKeep ALL rows from the left table — add right-table data where a match exists
WHERE column IS NULLFind rows where the join produced no match
Anti-join patternLEFT JOIN + IS NULL = "find records in A with no match in B"

Key takeaway: LEFT JOIN preserves every row from the left table. When there's no match, right-table columns are NULL — filter on that to find orphans.


Task 10 — Count with Grouping (10-count_shows_by_genre.sql)

Challenge: Count how many shows belong to each genre — introducing COUNT() with joins and grouping.

Approach: SELECT tv_genres.name AS genre, COUNT(tv_show_genres.genre_id) AS number_of_shows FROM tv_genres LEFT JOIN tv_show_genres ON tv_genres.id = tv_show_genres.genre_id GROUP BY tv_genres.name ORDER BY number_of_shows DESC.

New techniques introduced:

TechniquePurpose
COUNT(column)Count non-NULL values — with LEFT JOIN, counts only matched rows
COUNT + LEFT JOIN patternGenres with 0 shows appear as 0 in the count
GROUP BY text_columnGroup by a name/string, not just IDs

Key takeaway: COUNT() with LEFT JOIN ensures even empty categories appear (with count 0). INNER JOIN + COUNT() would silently drop them.


Task 11 — Sum Ratings (11-rating_shows.sql)

Challenge: Sum ratings for each show — introducing SUM() aggregate.

Approach: SELECT tv_shows.title, SUM(tv_show_ratings.rate) AS rating FROM tv_shows LEFT JOIN tv_show_ratings ON tv_shows.id = tv_show_ratings.show_id GROUP BY tv_shows.title ORDER BY rating DESC.

New techniques introduced:

TechniquePurpose
SUM(column)Total all values in a numeric column (or per group)
LEFT JOIN for optional ratingsShows with no ratings still appear (SUM = NULL or 0)

Key takeaway: SUM() totals numeric values. With LEFT JOIN, shows without ratings return NULL — use COALESCE(SUM(...), 0) if you want 0 instead.


Task 12 — Sum with Chained Joins (12-rating_genres.sql)

Challenge: Sum ratings by genre — navigating three tables in one query with chained LEFT JOINs.

Approach: SELECT tv_genres.name, SUM(tv_show_ratings.rate) AS rating FROM tv_genres LEFT JOIN tv_show_genres ON tv_genres.id = tv_show_genres.genre_id LEFT JOIN tv_show_ratings ON tv_show_genres.show_id = tv_show_ratings.show_id GROUP BY tv_genres.name ORDER BY rating DESC.

New techniques introduced:

TechniquePurpose
Multiple consecutive LEFT JOINsChain through intermediate tables to reach the target data
Junction table traversalgenres → show_genres → show_ratings — the junction table bridges the relationship

Key takeaway: Chain multiple joins to traverse complex relationships. Each join adds one more hop through the schema — the query grows linearly with the path length.


Task 13 — Unique Constraints (13-uniq_users.sql)

Challenge: Create a users table that prevents duplicate emails — introducing data integrity constraints.

Approach: CREATE TABLE IF NOT EXISTS users (id INT NOT NULL AUTO_INCREMENT, email VARCHAR(256) NOT NULL UNIQUE, name VARCHAR(256), PRIMARY KEY (id)).

New techniques introduced:

TechniquePurpose
AUTO_INCREMENTAutomatically generate sequential integer IDs
NOT NULLReject INSERTs that omit this column
UNIQUEReject INSERTs that duplicate an existing value
PRIMARY KEY (col)Uniquely identify each row — implies NOT NULL + UNIQUE

Key takeaway: Constraints like NOT NULL, UNIQUE, and PRIMARY KEY enforce data integrity at the database level — bad data is rejected before it enters the table.


Task 14 — ENUM Constraint (14-country_users.sql)

Challenge: Extend the users table with a country column restricted to three specific values — introducing enumerated types.

Approach: CREATE TABLE ... country ENUM('US', 'CO', 'TN') NOT NULL DEFAULT 'US' — only these three values are accepted.

New techniques introduced:

TechniquePurpose
ENUM('val1', 'val2', 'val3')Restrict a column to a predefined set of string values
DEFAULT valueProvide a fallback value when none is specified in INSERT

Key takeaway: ENUM acts as a check constraint on strings — the database rejects any value not in the list. Use it for columns with a fixed, known set of valid options.


Task 15 — Real-World Aggregation (15-fans.sql)

Challenge: Rank band origins by total fans from a real dataset — applying grouping and summing to non-trivial data.

Approach: SELECT origin, SUM(fans) AS nb_fans FROM metal_bands GROUP BY origin ORDER BY nb_fans DESC — group by text column, sum numeric.

New techniques introduced:

TechniquePurpose
Grouping by text/string columnGROUP BY origin works on VARCHAR just like INT
Real dataset contextApply SQL patterns to meaningful, large-scale data

Key takeaway: GROUP BY and SUM() work with any data type — string columns for grouping, numeric columns for aggregation. The pattern is universal.


Task 16 — Pattern Matching and COALESCE (16-glam_rock.sql)

Challenge: Find Glam rock bands, compute their lifespan, and handle active bands (no split year) — introducing LIKE and COALESCE.

Approach: SELECT band_name, COALESCE(split, 2020) - formed AS lifespan FROM metal_bands WHERE style LIKE '%Glam rock%' ORDER BY lifespan DESC.

New techniques introduced:

TechniquePurpose
LIKE '%pattern%'Pattern matching with % wildcards (0+ characters)
COALESCE(val, default)Return the first non-NULL value — treat NULL split as "still active"
Arithmetic in SELECTCompute derived columns (split - formed) directly in the query

Key takeaway: COALESCE() provides fallback values for NULL — use it to handle missing data gracefully. Arithmetic in SELECT creates computed columns on the fly.


Task 17 — AFTER INSERT Trigger (17-store.sql)

Challenge: Automatically decrement item quantity when an order is placed — introducing database triggers for reactive logic.

Approach: CREATE TRIGGER decrease_quantity AFTER INSERT ON orders FOR EACH ROW UPDATE items SET quantity = quantity - NEW.number WHERE name = NEW.item_name.

New techniques introduced:

TechniquePurpose
CREATE TRIGGER name timing event ON tableDefine automated logic that fires on data changes
AFTER INSERTExecute the trigger body after a row is inserted
FOR EACH ROWRun the trigger once per affected row
NEW.columnReference the newly inserted row's column values
DELIMITER //Change statement terminator to allow ; inside trigger body

Key takeaway: Triggers automate reactive operations — when an order is placed, inventory updates itself. NEW refers to the row being inserted; OLD refers to the row being updated/deleted.


Task 18 — BEFORE UPDATE Trigger (18-valid_email.sql)

Challenge: Reset the valid_email flag whenever a user's email address changes — introducing conditional trigger logic with BEFORE UPDATE.

Approach: CREATE TRIGGER reset_valid_email BEFORE UPDATE ON users FOR EACH ROW IF OLD.email != NEW.email THEN SET NEW.valid_email = 0; END IF.

New techniques introduced:

TechniquePurpose
BEFORE UPDATEFire the trigger BEFORE the row is written — can modify NEW values
IF condition THEN ... END IFConditional logic inside a trigger body
OLD.column vs NEW.columnCompare old and new values to detect changes
SET NEW.column = valueModify the incoming row before it's saved

Key takeaway: BEFORE UPDATE triggers can inspect and modify data before it hits the table. Compare OLD vs NEW to detect changes, then adjust NEW to enforce business rules.


Schema Reference

hbtn_0d_tvshows.sql — TV Shows + Genres (Many-to-Many)

TableColumnsPurpose
tv_genresid (PK), nameGenre catalog
tv_showsid (PK), titleShow catalog
tv_show_genresshow_id (FK), genre_id (FK)Junction table — links shows to genres

Pattern: Junction table with composite foreign keys enables many-to-many relationships.

hbtn_0d_tvshows_rate.sql — TV Shows + Genres + Ratings

Adds tv_show_ratings (show_id, rate) — a fact table allowing multiple ratings per show.

metal_bands.sql — Bands Dataset

Single flat table: metal_bands (id, band_name, fans, formed, origin, split, style). Denormalized for simple queries. YEAR data type for formed/split. split is NULL for active bands.

temperatures.sql — Weather Time-Series

Flat table: temperatures (city, state, year, month, value). Designed for efficient GROUP BY queries on city and time dimensions.


Technique Inventory

TaskNew technique summarizedCategory
0CREATE DATABASE IF NOT EXISTSDDL — Database
1CREATE TABLE, INT, VARCHAR(N)DDL — Tables
2SELECT * FROM tableCRUD — Read
3INSERT INTO ... VALUESCRUD — Create
4WHERE, ORDER BY DESCFiltering & Sorting
5AVG(), AS aliasingAggregates
6GROUP BY, aggregate + groupingGrouped Aggregation
7MAX()Aggregates
8INNER JOIN ... ONJoins
9LEFT JOIN + IS NULL (anti-join)Joins
10COUNT() with GROUP BY + LEFT JOINAggregates + Joins
11SUM() with GROUP BYAggregates
12Multiple chained LEFT JOINsJoins
13AUTO_INCREMENT, NOT NULL, UNIQUE, PRIMARY KEYConstraints
14ENUM, DEFAULTConstraints
15Real-world aggregation on text columnAggregates
16LIKE '%pattern%', COALESCE(), arithmetic in SELECTPattern Matching & Nulls
17CREATE TRIGGER AFTER INSERT, NEW referenceTriggers
18BEFORE UPDATE, IF in trigger, OLD vs NEWTriggers

MongoDB Shell (Tasks 22–29)

Each task below highlights the unique challenge it posed and the new technique introduced — techniques from MySQL tasks are not repeated.


Task 22 — List Databases (22-list_databases)

Challenge: Discover what databases exist on the MongoDB server — the first interaction with a new NoSQL environment.

Approach: show dbs — MongoDB shell's equivalent of SHOW DATABASES in MySQL. Returns all database names and their sizes.

New techniques introduced:

TechniquePurpose
show dbsList all databases on the MongoDB server
MongoDB shell (mongosh)Interactive JavaScript-based shell for MongoDB operations

Key takeaway: show dbs is your first command in MongoDB — it confirms the server is reachable and shows available databases. Unlike MySQL, databases are lazily created on first insert.


Task 23 — Select/Create Database (23-use_or_create_database)

Challenge: Switch to (or implicitly create) a database — MongoDB has no explicit CREATE DATABASE.

Approach: use my_db — MongoDB creates the database automatically when you first insert data into it. The use command switches context without requiring the database to already exist.

New techniques introduced:

TechniquePurpose
use db_nameSwitch to a database (creates it lazily on first write)
Implicit database creationNo CREATE DATABASE needed — databases spring into existence on first insert

Key takeaway: In MongoDB, use switches context; the database isn't physically created until you insert a document. This is fundamentally different from MySQL's explicit CREATE DATABASE.


Task 24 — Insert Document (24-insert)

Challenge: Add data to a MongoDB collection — introducing document insertion with JSON-like syntax.

Approach: db.school.insert({ name: "Holberton school" }) — insert a BSON document (JavaScript object) into the school collection. Collections are created implicitly on first insert.

New techniques introduced:

TechniquePurpose
db.collection.insert({...})Insert a new document into a collection
JSON/BSON document syntaxKey-value pairs in { field: value } format
Implicit collection creationCollections are created automatically on first insert()

Key takeaway: MongoDB stores JSON-like documents. insert() adds a document to a collection — both the collection and database are created implicitly if they don't exist.


Task 25 — Find All Documents (25-all)

Challenge: Retrieve all documents from a collection — the MongoDB equivalent of SELECT *.

Approach: db.school.find() — without arguments, returns every document in the collection. Outputs documents in readable JSON format.

New techniques introduced:

TechniquePurpose
db.collection.find()Retrieve all documents from a collection
Cursor outputfind() returns a cursor that the shell iterates and displays

Key takeaway: find() with no arguments returns all documents — MongoDB's SELECT *. The shell automatically iterates the cursor and pretty-prints results.


Task 26 — Find with Filter (26-match)

Challenge: Retrieve only documents matching a specific condition — introducing query filters.

Approach: db.school.find({ name: "Holberton school" }) — pass a query document to filter results. Only documents whose name field matches the value are returned.

New techniques introduced:

TechniquePurpose
find({ field: value })Filter documents by exact field match
Query documentMongoDB uses {key: value} objects as query predicates

Key takeaway: MongoDB queries are themselves JSON-like documents. find({name: "X"}) is equivalent to WHERE name = 'X' in SQL — the filter is declarative.


Task 27 — Count Documents (27-count)

Challenge: Count how many documents exist in a collection — introducing document counting.

Approach: db.school.count() — returns the total number of documents. Can also accept a filter: db.school.count({name: "Holberton school"}).

New techniques introduced:

TechniquePurpose
db.collection.count()Return the total number of documents in a collection
count({filter})Count only documents matching a query filter

Key takeaway: count() is MongoDB's COUNT(*). It's a simple, fast way to check collection size or count matching documents.


Task 28 — Update Documents (28-update)

Challenge: Modify existing documents by adding new fields — introducing the $set operator and multi option.

Approach: db.school.update({ name: "Holberton school" }, { $set: { address: "972 Mission street" } }, { multi: true }) — match documents by name, then add an address field via $set. multi: true updates all matching documents, not just the first.

New techniques introduced:

TechniquePurpose
db.collection.update(query, update, options)Modify documents matching a query
$set operatorAdd or modify specific fields without replacing the entire document
{ multi: true }Update ALL matching documents (default updates only the first)

Key takeaway: $set modifies only specified fields — the rest of the document is untouched. Always use { multi: true } when you intend to update multiple documents; without it, only the first match is changed.


Task 29 — Delete Documents (29-delete)

Challenge: Remove documents from a collection — introducing bulk deletion.

Approach: db.school.deleteMany({ name: "Holberton school" }) — delete all documents matching the filter. deleteMany() removes every match; deleteOne() would remove only the first.

New techniques introduced:

TechniquePurpose
db.collection.deleteMany({filter})Delete all documents matching a query filter
deleteOne() vs deleteMany()deleteOne removes first match; deleteMany removes all matches

Key takeaway: deleteMany() is MongoDB's DELETE FROM ... WHERE ... — it removes all matching documents in one operation. Use deleteOne() when you want to limit removal to a single document.


Python & PyMongo (Tasks 30–34)

Transitioning from the MongoDB shell to programmatic database access with Python's pymongo driver.


Task 30 — List All Documents in Python (30-all.py)

Challenge: Retrieve all documents from a MongoDB collection and return them as a Python list — bridging the shell-to-Python gap.

Approach: list(mongo_collection.find())find() returns a Cursor; wrapping it in list() materializes all documents into a Python list of dicts.

New techniques introduced:

TechniquePurpose
pymongo collection objectPython driver interface matching shell commands
mongo_collection.find()Return a Cursor over all documents in the collection
list(cursor)Materialize a Cursor into a Python list of dictionaries

Key takeaway: PyMongo's API mirrors the shell — collection.find() is equivalent to db.collection.find(). Wrap the cursor in list() to get all documents at once.


Task 31 — Insert with Keyword Arguments (31-insert_school.py)

Challenge: Accept arbitrary keyword arguments and insert them as a MongoDB document — introducing dynamic document creation.

Approach: mongo_collection.insert(kwargs) — Python's **kwargs captures all keyword arguments as a dictionary, then insert() stores it directly as a MongoDB document.

New techniques introduced:

TechniquePurpose
**kwargs parameterCapture arbitrary keyword arguments as a dictionary
insert(document)Insert a Python dict as a MongoDB document (returns _id)

Key takeaway: **kwargs + insert() lets you pass any field-value pairs from Python directly into MongoDB — the Python dict maps 1:1 to a BSON document.


Task 32 — Update Many with $set (32-update_topics.py)

Challenge: Update the topics field for ALL schools matching a given name — translating the shell update pattern to Python.

Approach: mongo_collection.update_many({'name': name}, {'$set': {'topics': topics}}) — filter by name, then use $set to add/overwrite the topics field. update_many() ensures all matches are updated.

New techniques introduced:

TechniquePurpose
update_many(filter, update)Update all documents matching the filter
{'$set': {field: value}} in PythonSame $set operator, expressed as a Python dict

Key takeaway: PyMongo update operations use the same MongoDB operators ($set, $inc, etc.) — just expressed as Python dicts. update_many() is the safe default; update_one() only touches the first match.


Task 33 — Find by Array Field (33-schools_by_topic.py)

Challenge: Find all schools whose topics array contains a specific value — introducing array field matching in MongoDB.

Approach: mongo_collection.find({'topics': topic}) — when the field is an array, MongoDB automatically matches documents where ANY element equals the query value.

New techniques introduced:

TechniquePurpose
Array field matchingfind({array_field: value}) matches if any array element equals value
Implicit $in behaviorMongoDB treats scalar-to-array matching as "contains"

Key takeaway: MongoDB intelligently matches array fields — find({'topics': 'math'}) finds documents where topics is an array containing 'math'. No special syntax needed.


Task 34 — Nginx Log Analysis (34-log_stats.py)

Challenge: Connect to a real MongoDB instance and analyze Nginx access logs — introducing MongoClient, database/collection access, and count_documents().

Approach: MongoClient('mongodb://127.0.0.1:27017') connects to the local server. Then client.logs.nginx navigates to the nginx collection in the logs database. count_documents({}) counts total logs; count_documents({'method': 'GET'}) counts per HTTP method.

New techniques introduced:

TechniquePurpose
MongoClient(uri)Establish a connection to a MongoDB server
client.db_name.collection_nameAttribute-style access to database and collection
count_documents({filter})Count documents matching a filter (modern replacement for count())
Real-world data analysisApply MongoDB queries to production Nginx log data

Key takeaway: MongoClient is the entry point — it connects to MongoDB, and from there you navigate databases and collections with dot notation. count_documents() is the modern, preferred way to count.


Python & PyMongo — Aggregation (Tasks 105–106)

Advanced data analysis using MongoDB's aggregation pipeline — grouping, sorting, and limiting results.


Task 105 — Students Sorted by Average Score (105-students.py)

Challenge: Compute each student's average score from a nested topics array and sort them by that average — introducing the aggregation pipeline.

Approach: A two-stage pipeline: $addFields creates a new averageScore field by averaging $topics.score with $avg, then $sort orders by averageScore descending. aggregate(pipeline) executes and returns a cursor.

New techniques introduced:

TechniquePurpose
aggregate([pipeline_stages])Execute a sequence of data transformations
$addFields stageAdd computed fields to each document
$avg: '$nested.field'Compute the average of an array field's values
$sort: {field: -1}Sort results in descending order (-1)

Key takeaway: The aggregation pipeline processes documents through stages in sequence — each stage transforms the data flowing from the previous one. $addFields enriches documents; $sort reorders them. Together they form a composable data processing chain.


Task 106 — Top 10 IPs (106-log_stats.py)

Challenge: Extend the Nginx log analysis (Task 34) to find the 10 most frequent IP addresses — introducing $group, $sum, and $limit.

Approach: A three-stage pipeline appended to the existing script: $group clusters documents by $ip and counts each with $sum: 1, $sort orders by count descending, $limit: 10 keeps only the top 10.

New techniques introduced:

TechniquePurpose
$group stageGroup documents by a field, computing per-group aggregates
_id: '$field' in $groupSpecify the grouping key (field to group by)
$sum: 1Count documents per group (1 per document, summed)
$limit: NRestrict output to the first N documents

Key takeaway: $group is MongoDB's GROUP BY_id defines the bucket, accumulator expressions ($sum, $avg, $max) compute per-group values. $limit caps results, making it perfect for "top N" queries.


Technique Inventory — MongoDB

TaskNew technique summarizedCategory
22show dbs — list all databasesMongoDB Shell
23use db_name — implicit DB creationMongoDB Shell
24db.collection.insert({...}) — document insertionMongoDB CRUD
25db.collection.find() — retrieve all documentsMongoDB CRUD
26find({field: value}) — query with filterMongoDB Queries
27db.collection.count() — document countingMongoDB Queries
28$set operator, update(), multi: trueMongoDB Updates
29deleteMany() — bulk document deletionMongoDB CRUD
30list(mongo_collection.find()) — materialize cursor to Python listPyMongo CRUD
31insert(kwargs), **kwargs — dynamic insertPyMongo CRUD
32update_many() with $set in PythonPyMongo Updates
33Array field matching in find()PyMongo Queries
34MongoClient, count_documents() — real-world analysisPyMongo Connection
105aggregate(), $addFields, $avg, $sortAggregation Pipeline
106$group, $sum, $limit — top-N groupingAggregation Pipeline

Resources

MySQL

MongoDB