src/postgres/third-party-extensions/postgresql_anonymizer/docs/how-to/1-static_masking.md
Static Masking is the simplest way to hide personal information! This idea is simply to destroy the original data or replace it with an artificial one.
Over the years, Paul has collected data about his customers and their purchases in a simple database. He recently installed a brand new sales application and the old database is now obsolete. He wants to save it and he would like to remove all personal information before archiving it.
In this section, we will learn:
DROP TABLE IF EXISTS customer CASCADE;
DROP TABLE IF EXISTS payout CASCADE;
CREATE TABLE customer (
id SERIAL PRIMARY KEY,
firstname TEXT,
lastname TEXT,
phone TEXT,
birth DATE,
postcode TEXT
);
Insert a few persons:
INSERT INTO customer
VALUES
(107,'Sarah','Conor','060-911-0911', '1965-10-10', '90016'),
(258,'Luke', 'Skywalker', NULL, '1951-09-25', '90120'),
(341,'Don', 'Draper','347-515-3423', '1926-06-01', '04520')
;
SELECT * FROM customer;
Sales are tracked in a simple table:
CREATE TABLE payout (
id SERIAL PRIMARY KEY,
fk_customer_id INT REFERENCES customer(id),
order_date DATE,
payment_date DATE,
amount INT
);
Let's add some orders:
INSERT INTO payout
VALUES
(1,107,'2021-10-01','2021-10-01', '7'),
(2,258,'2021-10-02','2021-10-03', '20'),
(3,341,'2021-10-02','2021-10-02', '543'),
(4,258,'2021-10-05','2021-10-05', '12'),
(5,258,'2021-10-06','2021-10-06', '92')
;
CREATE EXTENSION IF NOT EXISTS anon CASCADE;
SELECT anon.init();
SELECT setseed(0);
Paul wants to hide the last name and the phone numbers of his clients.
He will use the fake_last_name() and partial() functions for that:
SECURITY LABEL FOR anon ON COLUMN customer.lastname
IS 'MASKED WITH FUNCTION anon.fake_last_name()';
SECURITY LABEL FOR anon ON COLUMN customer.phone
IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$X-XXX-XX$$,2)';
SELECT anon.anonymize_table('customer');
SELECT id, firstname, lastname, phone
FROM customer;
This is called
Static Maskingbecause the real data has been permanently replaced. We'll see later how we can use dynamic anonymization or anonymous exports.
Declare a new masking rule and run the static anonymization function again.
Paul realizes that the postcode gives a clear indication of where his customers live. However he would like to have statistics based on their "postcode area".
Add a new masking rule to replace the last 3 digits by 'x'.
Aggregate the customers based on their anonymized postcode.
Paul wants age-based statistic. But he also wants to hide the real birth date of the customers.
Replace all the birth dates by January 1rst, while keeping the real year.
::: tip
HINT: You can use the make_date function !
:::
Even if the "customer" is properly anonymized, we can still isolate a given individual based on data stored outside of the table. For instance, we can identify the best client of Paul's boutique with a query like this:
WITH best_client AS (
SELECT SUM(amount), fk_customer_id
FROM payout
GROUP BY fk_customer_id
ORDER BY 1 DESC
LIMIT 1
)
SELECT c.*
FROM customer c
JOIN best_client b ON (c.id = b.fk_customer_id)
::: important
This is called Singling Out a person.
:::
We need to anonymize even further by removing the link between a person and its company. In the "order" table, this link is materialized by a foreign key on the field "fk_company_id". However we can't remove values from this column or insert fake identifiers because if would break the foreign key constraint.
How can we separate the customers from their payouts while respecting the integrity of the data?
Find a function that will shuffle the column "fk_company_id" of the "payout" table
::: tip
HINT: Check out the static masking section of the documentation :::
SECURITY LABEL FOR anon ON COLUMN customer.firstname
IS 'MASKED WITH FUNCTION anon.fake_first_name()';
SELECT anon.anonymize_table('customer');
SELECT id, firstname, lastname
FROM customer;
SECURITY LABEL FOR anon ON COLUMN customer.postcode
IS 'MASKED WITH FUNCTION anon.partial(postcode,2,$$xxx$$,0)';
SELECT anon.anonymize_table('customer');
SELECT id, firstname, lastname, postcode
FROM customer;
SELECT postcode, COUNT(id)
FROM customer
GROUP BY postcode;
SECURITY LABEL FOR anon ON COLUMN customer.birth
IS 'MASKED WITH FUNCTION make_date(EXTRACT(YEAR FROM birth)::INT,1,1)';
SELECT anon.anonymize_table('customer');
SELECT id, firstname, lastname, birth
FROM customer;
Let's mix up the values of the fk_customer_id:
SELECT anon.shuffle_column('payout','fk_customer_id','id');
Now let's try to single out the best client again :
WITH best_client AS (
SELECT SUM(amount), fk_customer_id
FROM payout
GROUP BY fk_customer_id
ORDER BY 1 DESC
LIMIT 1
)
SELECT c.*
FROM customer c
JOIN best_client b ON (c.id = b.fk_customer_id);
WARNING
Note that the link between a customer and its payout is now
completely false. For instance, if a customer A had 2 payouts. One of
these payout may be linked to a customer B, while the second one is
linked to a customer C.
In other words, this shuffling method with respect the foreign key constraint (aka the referential integrity) but it will break the data integrity. For some use case, this may be a problem.
In this case, Pierre will not be able to produce a BI report with the shuffle data, because the links between the customers and their payments are fake.