curriculum/challenges/english/blocks/lecture-working-with-sql/68816f386bc30d36f59e9563.md
Normalization is the process of organizing a relational database using established normal forms to reduce data redundancy and improve data integrity.
To normalize a database, you analyze the attributes and relationships in your table structures to identify opportunities for simplification based on normalization rules. Then, you divide the data into smaller, more focused tables and establish relationships between them using primary and foreign keys. These smaller tables will store all the data that you originally had, but they will be easier to manage, organize, and work with, compared to larger tables.
By minimizing data redundancy, normalization reduces storage space and prevents inconsistencies. For example, if a customer's address changes, you only need to update it in one customer table rather than in every order record.
It also helps preserve data integrity by making sure that dependencies are enforced by primary and foreign key constraints. This reduces the likelihood of insertion, update, and deletion anomalies.
A normalized database is easier to understand and maintain, which contributes to a well-designed database system.
The set of rules that you should follow to organize your tables are called normal forms. Each normal form builds on the previous ones, with rules becoming stricter as you move to higher normal forms. The First Normal Form (1NF) is the foundational form, with the most basic rules.
A table is in First Normal Form (1NF) if:
students table, you may have a student with multiple phone numbers. Instead of storing them as a comma-separated list in one cell, you should create a separate phone_numbers table with a foreign key pointing to the students table and only store one value in each cell. This keeps each value atomic and the design normalized.Essentially, the First Normal Form (1NF) focuses on making values simple and atomic.
The Second Normal Form (2NF) builds on 1NF, requiring that all 1NF requirements are met plus additional constraints.
Understanding 2NF requires familiarity with two key concepts: superkeys and candidate keys. A superkey is any set of attributes that uniquely identifies each row in a table, meaning no two rows can have identical values across all columns in the superkey.
For example, if we have a customers table with three columns:
customer_id | name | email
Let's say that customer_id and email must be unique for each record. Some examples of superkeys for this table would be:
customer_id{ customer_id, name }, this combination uniquely identifies rows because customer_id alone is unique, regardless of name's uniqueness.{ customer_id, name, email }email, since we are adding the unique constraint to this column too, each record can be uniquely identified by the email.Any set of attributes that, together, identify each row can be a superkey.
A candidate key is similar and yet, a bit different. It's a set of one or more columns on the table that can uniquely identify each record. A table can have multiple candidate keys but you'll need to choose one to act as the primary key.
They sound quite similar, right? Their difference is that the superkey can also contain additional attributes that are not necessary to identify each row uniquely, while the candidate key only contains attributes that do identify the row uniquely.
In our customers example, we mentioned that customer_id, { customer_id, name }, { customer_id, name, email }, and email can be superkeys. In this case, only customer_id and email can be candidate keys because candidate keys can't include attributes that don't identify the row uniquely, like name. This illustrates the minimality principle that distinguishes candidate keys from superkeys.
Now that you're familiar with these concepts, let's go back to the Second Normal Form (2NF).
The Second Normal Form (2NF) is based on addressing partial dependencies. A partial dependency occurs in a table when an attribute that is not part of the primary key only depends on part of a composite primary key (a composite key is a primary key made of multiple columns). This can lead to data redundancy and update anomalies.
For example, let's say we have an orders table with these columns:
order_id | item_id | order_date | quantity | order_shipping_city
In this table, the primary key is the combination of order_id and item_id because the same item ID can be in different orders, but their combination will be unique. You can see that there is a partial dependency between order_id and order_shipping_city. order_id is part of the primary key. order_shipping_city depends on order_id because every order with the same ID will have the same shipping city. However, the shipping city does not depend on the item_id, but this is also part of the primary key. Therefore, order_shipping_city does not depend on the entire primary key.
To fix this, you can split the table into two smaller tables, like order_header and order_items. In the order_header table, you could store the top-level information about the orders:
order_id | order_date | order_shipping_city
In the order_items table, you could store information about the items in the different orders that were submitted:
order_id | item_id | quantity
With these changes, both tables will be in Second Normal Form (2NF).
And that leads us to the Third Normal Form (3NF). This normal form builds on the second normal form.
For a table to be in Third Normal Form, it must:
To understand the Third Normal Form (3NF), you need to understand transitive dependencies. A transitive dependency occurs when an attribute that is not part of the primary key depends on another attribute that is not part of the primary key, which in turn depends on the primary key.
For example, let's modify our orders table to have these columns:
order_id | customer_id | customer_city | city_postal_code | order_date | quantity
The primary key in this new table is order_id because it uniquely identifies each row.
There is a transitive dependency in this table:
order_id > customer_id > customer_city > city_postal_code
The city's postal code is determined by the customer's city, which is determined by the customer's ID, which is determined by the order's ID. city_postal_code, an attribute that is not part of the primary key, is determined by customer_city, which is also not part of the primary key, but customer_city is determined by order_id through customer_id. This means that the city's postal code will be repeated for every order placed by customers who live in the same city. This will lead to data redundancy. Additionally, any change to a city's postal code would require updates across multiple records.
To solve the transitive dependency, you would need to split the table into multiple tables. First, an orders table:
order_id | customer_id | order_date | quantity
Then, a customers table:
customer_id | city_name
And a cities table:
city_name | city_postal_code
This removes the transitive dependency from the table.
And finally, we have the Boyce-Codd Normal Form (BCNF). This is also known as Normal Form 3.5 because it addresses some anomalies that the Third Normal Form may not address. For a table to be in the Boyce-Codd Normal Form (BCNF), it has to:
The goal of this normal form is to ensure that every attribute or set of attributes that determines another attribute is a superkey, which is a candidate key or a superset of them.
Understanding and applying the normal forms is essential for designing robust and efficient relational databases. By eliminating data redundancy and ensuring logical data dependencies, normalization leads to improved data integrity, reduced storage requirements, and simplified database maintenance. You should usually aim to reach the Third Normal Form (3NF).
Which normal form is primarily concerned with replacing groups of data within a single row with single values?
First Normal Form (1NF)
Second Normal Form (2NF)
Think about the most basic level of organization within a table and what constitutes a single, indivisible piece of data.
Third Normal Form (3NF)
Think about the most basic level of organization within a table and what constitutes a single, indivisible piece of data.
Boyce-Codd Normal Form (BCNF)
Think about the most basic level of organization within a table and what constitutes a single, indivisible piece of data.
1
A table is in Second Normal Form (2NF) if it is in First Normal Form (1NF) and if which of the following is true?
All non-key attributes are transitively dependent on the primary key.
Think about the issue that the Second Normal Form (2NF) addresses when a table has a composite primary key.
All non-key attributes are fully functionally dependent on the entire primary key.
Every determinant is a candidate key.
Think about the issue that the Second Normal Form (2NF) addresses when a table has a composite primary key.
There are no repeating groups of data.
Think about the issue that the Second Normal Form (2NF) addresses when a table has a composite primary key.
2
Which normal form addresses transitive dependencies, where a non-key attribute depends on another non-key attribute?
First Normal Form (1NF)
Think about the indirect relationships between non-key attributes and the primary key that this normal form aims to eliminate.
Second Normal Form (2NF)
Think about the indirect relationships between non-key attributes and the primary key that this normal form aims to eliminate.
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
Think about the indirect relationships between non-key attributes and the primary key that this normal form aims to eliminate.
3