a data de-identification procedure
data records are replaced by one or more artificial identifiers called pseudonyms
pseudonyms make a data record less identifiable without sacrificing data analysis and processing
protect user identities
secure a dataset from identification
🚨 GDPR 🚨
A new set of EU-based policies about how companies can collect and use consumer data. GDPR passed in 2016, but that included a waiting period for companies to adjust.
May 25, 2018
Lawfulness, Fairness and Transparency
Purpose Limitation
Data Minimization
Accuracy
Storage Limitation
Integrity and Transparency
GDPR sets a broad definition of what constitutes personal data.
Any information that is owned by an individual, or that could identify a user, is now off-limits for most purposes without proper consent.
Countries in the European Union (EU) ...actually 🤔
Article 3(1) "applies to the processing of personal data in the context of the activities of an establishment of a controller or a processor in the Union, regardless of whether the processing takes place in the Union or not"
🎉 no more legal stuff 🎉
PSEUDONYMIZATION
ANONYMIZATION
An approach for treating personal data so that it cannot be used to identify individual users without the use of additional information. Most techniques involve replacing data with a placeholder value, or pseudonym. This pseudonym may be a masked version of a record or a token used for retrieving the original value.
Recommended by GDPR
A permanent de-identification of a data set so that no party will be able to identify the individuals in reference no matter what additional data they possess. Since anonymized data cannot be used to identify any individual, it is no longer considered personal data and as such does not fall under the purview of GDPR.
altering or replacing a record or part of a record without changing its format. For example, an unmasked social security number (SSN) might be stored as
679-69-8549
,
but a masked SSN using a technique to substitute the digits might look like
145-126-7741
.
a technique for replacing specific personal data with less specific values. For example, if a user's date of birth is August 20, 1997, then an approximated record might be stored as July 1- September 25, 1997, or even just 1997..
a cryptographic process that converts data into an unreadable format (ciphertext) so that only individuals or systems with access to the appropriate key can decrypt and read it.
No. Not mandatory according to GDPR
tokenizing a piece of data is to replace it with a unique token that acts as a stand-in which can be used to retrieve the original value.
Python already supports a common pattern that allows an engineer to replace a attribute with a set of methods that can intercept values as they're written and read.
either getter
/setter
or properties
For the following examples we'll use an overly simplified data masking and unmasking algorithm
def shift(string, reverse=False):
"""Shift ord value of character within range."""
new_string = []
for char in string:
try:
value = ord(char)
if value in range(ord('A'), ord('Z') + 1):
min = ord('A')
max = ord('Z')
elif value in range(ord('a'), ord('z') + 1):
min = ord('a')
max = ord('z')
values = range(min, max + 1)
if reverse:
index = values.index(value) - 1
else:
index = values.index(value) + 1
new_string.append(chr(values[index % len(values)]))
except:
new_string.append(char)
return ''.join(new_string)
then to mask
and unmask
...
def mask(value):
return shift(value)
def unmask(value):
return shift(value, True)
looks something like:
$ python simple.py Frank Valcarcel
MASKING "Frank Valcarcel"
MASKED "Gsbol Wbmdbsdfm"
UNMASKED "Frank Valcarcel"
getter/setter
class User:
def __init__(self):
self._name = None
@property
def name(self):
return self._name
@name.setter
def name(self, value):
self._name = value
looks something like:
$ python
>>> user = User()
>>> user.name = "Frank Valcarcel"
>>> print(f'user.name => {user.name}')
user.name => Frank Valcarcel
>>> print(f'user._name => {user._name}')
user._name => Gsbol Wbmdbsdfm
from django.db import models
from django.contrib.auth.models import AbstractUser
class User(AbstractUser):
name = models.CharField(max_length=128, blank=True)
phone = models.CharField(max_length=17, blank=True)
date_of_birth = models.DateField(blank=True, null=True)
ip_address = models.GenericIPAddressField(blank=True, null=True)
which fields are personal data?
ALL OF THEM
utils.py
def shift(string, reverse=False):
"""Shift ord value of character within range."""
new_string = []
for char in string:
try:
value = ord(char)...
new_string.append(chr(values[index % len(values)]))
except:
new_string.append(char)
return ''.join(new_string)
def mask(value):
return shift(value)
def unmask(value):
return shift(value, True)
Updated user/models.py
from django.db import models
from django.contrib.auth.models import AbstractUser
from app.utils import mask, unmask
class User(AbstractUser):
_name = models.CharField(max_length=128, blank=True)
@property
def name(self):
return unmask(self._name)
@name.setter
def name(self, value):
self._name = mask(value)
☐ Update the QuerySet
☐ Exclude Pseudonyms
☐ Update django-admin
as defined, our model can not be queried using any of the unmasked values. e.g. we can not call methods like: User.objects.filter(name='Frank')
to get this working, we're going to overide our model's QuerySet
, more specifically the _filter_or_exclude
method
let's look at the source code... django.db.models.query
so let's just insert our masked values in, and super
the parent instance of models.QuerySet
for everything else
class UserQuerySet(models.QuerySet):
def _filter_or_exclude(self, negate, *args, **kwargs):
masked_fields = ['name']
for field in masked_fields:
value = kwargs.pop(field, None)
if value is not None:
kwargs['_{}'.format(field)] = mask(value)
return super(UserQuerySet, self)._filter_or_exclude(negate, *args, **kwargs)
To finish this, we override the AuthUserManager.get_queryset
class UserManager(AuthUserManager):
def get_queryset(self):
return UserQuerySet(self.model)
class User(AbstractUser):
...
objects = UserManager()
✅ User.objects.filter(name='Frank')
✅ User.objects.exclude(name='Frank')
☑️ Update the QuerySet
☐ Exclude Pseudonyms
☐ Update django-admin
pseudonyms are useless and will pollute our models
django has this method defer
we can call on QuerySets so let's look at the source code again... QuerySet.defer()
This will be an easy addition now that we override models.QuerySet
class UserQuerySet(models.QuerySet):
def _filter_or_exclude(self, negate, *args, **kwargs):
masked_fields = ['name']
unmasked_fields = [f'_{field}' for field in masked_fields]
for field in masked_fields:
value = kwargs.pop(field, None)
if value is not None:
kwargs[f'_{field}'] = mask(value)
return super(UserQuerySet, self)._filter_or_exclude(negate, *args, **kwargs).defer(*unmasked_fields)
>>> u = User.objects.filter(name='Frank Valcarcel')
>>> u.__dict__
{'id': 1, 'password': 'pbkdf2_sha256...=', 'last_login': datetime.datetime(2018, 6, 11, 16, 20, 19, 775436, tzinfo=), 'is_superuser': True, 'username': 'frank', 'first_name': 'Frank', 'last_name': '', 'email': 'katie@cuttlesoft.com', 'is_staff': True, 'is_active': True, 'date_joined': datetime.datetime(2018, 6, 11, 16, 20, 14, 334272, tzinfo=)}
Note: I did not override the all()
query, but query methods that call filter
like get
will behave this way.
☑️ Update the QuerySet
☑️ Exclude Pseudonyms
☐ Update django-admin
write/read is masked/unmasked... but what about the django-admin?
We need to update our admin so we can read and write appropriately
class Meta:
model = User
fields = ('username', 'password', 'name')
class UserChangeForm(AuthUserChangeForm):
name = forms.CharField()
def __init__(self, *args, **kwargs):
super(UserChangeForm, self).__init__(*args, **kwargs)
self.fields['name'].initial = self.instance.name
self.fields['name'].validators = self._meta.model._meta.get_field('_name').validators
def clean(self, *args, **kwargs):
super(UserChangeForm, self).clean(*args, **kwargs)
self.instance.name = self.cleaned_data.get('name')
@admin.register(User)
class UserAdmin(AuthUserAdmin):
form = UserChangeForm
fieldsets = [
(None, {'fields': ['username', 'password']}),
('Personal Data', {'fields': [
'name',
]}),
]
list_display = ('username', 'name')
☑️ Update the QuerySet
☑️ Exclude Pseudonyms
☑️ Update django-admin
coming soon