In my previous article I mentioned that I would be posting an additional article each week – I had written this a few weeks back but due to spending my spare time working on my data privacy app, I hadn’t got round to publishing. I have tried to keep this as short as possible but data masking is such an intricate subject, it’s difficult to not get too technical.
Note. This guidance is my opinion based on experience and this approach should be discussed within your organisation. With any of my articles and opinions, please do provide feedback.
Many organisations are going through a period of review and transition to ensure that data held in test / non-production environments are obfuscated to ensure that they are processing personal data in compliance with the DPA / GDPR.
An oracle report states “Most organizations if not all copy production data into test and development environments to allow system administrators to test upgrades, patches and fixes.” And “Almost 1 out of 4 companies responded that this live data had been lost or stolen and 50% said that they had no way of knowing if the data in non-production environment had been compromised.”
The purpose of this post is to help people, organisations, anyone with interest with a recommendation for proceeding with data masking in a compliant and efficient manner when masking data on internal systems and those owned and supported by 3rd parties.
Anonymisation vs pseudonymisation
Many organisations currently refer to their data masking activities as anonymization, which under the GDPR is incorrect due to the data being in a position whereby it can be re-identified, under the new legislation this is referred to as ‘pseudonymisation’.
Pseudonymised data allows for the data controller to use the information in ways exceeding that of the original purpose, allowing for more flexibility for the controller and privacy for the data subject – pseudonymisation is best suited to many organisations activities moving forward. True anonymization is impractical in many case as it would require a full test data set to be built, providing referential integrity throughout all interlinking systems at a substantial cost, as without, the data will likely be unusable.
Pseudonymisation best practice
To effectively carry out pseudonymisation effectively and in a compliant manner – organisations should adhere to the following principles throughout.
First and foremost – get your 3rd parties involved!
Very few applications within an organisations IT estate are developed and maintained inhouse – the vast majority are provided by 3rd parties, whether that be a SaaS tool or internally hosted.
With that in mind – why do organisations spend lots of valuable time, wondering how to understand data architecture and create masking procedures, when the solutions provider should be doing this for you. They are the experts and should be providing you with a compliant solution – if your use of the system involves personal data then this needs to be factored in at the beginning. This is one of the first approaches that you should take which may prove incredibly useful and time saving.
(You are likely to find that software providers may be reluctant however this should become a standard within the industry and will be difficult to acheive if there is not a consistent approach from organisations).
Embed security from the offset;
- Very commonly missed – Data masking should be treated the same as and follow the same level of scrutiny as encryption and the management of keys. Similar to encryption, if your masking algorithm or process document is not secured and not properly segregated, the masking itself is worthless due to the ability to reverse engineer. A secure process must be in place ensuring that these sensitive documents are appropriately segregated and restricted via access control.
Early engagement for projects and adhoc requests;
- Organisations must ensure that there is a process available at the requirements gathering phase to define what data is required for test purposes (live or not) to ensure that the request can be fulfilled in a timely manner. This reduces the risk of live data being used due to a time restriction and evidences privacy by design. For 3rd party solutions, this allows for the relevant stakeholder to liaise with the 3rd party to ensure that test data is provided, tailored to the organisations data architecture. In the event that the 3rd party does not provide such information or isn’t willing to partake, the data architecture will need to be reviewed and understood by your internal team to ensure that a capability can be built (Note, if it cannot be built an organisation MUST document and evidence this decision as this is undermining privacy by design).
Find the data;
- For requests that are addressing existing non-compliances, i.e. an environment that already contains live data, the primary step for masking is to find the data. This is best carried out by an automated scanning tool however if you are aware of the environments and databases, this can be manually mapped out by viewing the database and table structures – you must ensure that the time is taken at this stage to identify all in scope items as data remanence can lead to re-identification if not properly segregated.
Identifying fields for masking;
- Due to the nature of data privacy and the combinations for identifying an individual, it is impossible to provide a specific set of data fields that need masking – a ‘top down’ approach is advised to ensure that what you are exposing is necessary and appropriately risk assessed.
- To begin with, the team carrying out the masking should specify the absolute essential fields required for the solution to be effective and provide the required referential integrity – typically this is a customer number or a unique ID (see X for internal identifiers)
- The essential fields must be run via your data protection office to ensure that an SME has provided the relevant sign off to ensure that an individual cannot be identified – if an individual can be identified externally, the procedure should be reviewed (internally identifiable may be suitable)
- Once the fields have been agreed on, a procedure can be written taking into account the sensitivity of the document when reviewing, storing and sharing.
Take a holistic approach;
- When carrying out such exercises it is in the organisation best interest to take a holistic view and break the silo when identifying and building a masking plan. As an organisation is likely to be taking this stance with their existing GDPR programme, this should provide a great opportunity to provide an enterprise wide approach – when taking this approach the ability to spot or work on creating a common denominator is critical to having a scalable and manageable masking strategy.
Masking procedures – top three;
- Shuffling – is the process of taking the fields within a subset of data and running a ‘shuffling’ script to move values around, this is an easy method for obfuscating information however as the complete database is still it tact, it is likely that reverse engineering can be carried out to re-identify.
- Substitution – is the process of substituting a field with another value working best with common variables such as firstname, lastname, postcode etc. the script will take the value and substitute it with another value out of a lookup table
- Masking out – is the process of removing the information or masking out the value, similar to the masked PAN information for PCI such as XXXX-XXXX-XXXX-0123 – this is a secure way of obfuscating data where no referential integrity is required.
There are many techniques for masking data and all have their pros and cons – to achieve a satisfactory level of pseudonymisation, a combination of techniques should be used for each respective field to ensure that there is a suitable balance between security and data quality.
Successful pseudonymisation is not just dependent on having a good masking technique and delivering quality data – the environment that the information sits in is equally important. The following issues are common following the implementation of a masking work stream;
- Shared environments – if your environments are shared by multiple projects, you must ensure that they are all using the same pseudonymised data sets. If there is true, live data within the environment, even for a legitimate purpose, the masking will be rendered obsolete due to the identification that is possible when comparing databases due to the common identifier.
- Non-production environments must be logically segregated from the production estate with no connection to ensure that test data is not accidentally put in live environments and visa versa, this should be reflected with user accounts and access control to ensure that there is full auditability throughout. Test accounts should be identifiable.
- Staging environment – If masking cannot be carried out as a procedure in transit, a staging environment will be required for moving the data to before masking. In the event that this is required, there should be a Dev and Test environment built as close to production controls as possible to ensure the security of the data – this should also be risk assessed and tolerated as part of the test data provisioning process. You must ensure that any data remanence is securely removed from this environment.
Securing the procedure and process;
As mentioned earlier on, masking procedures should be treated the same as encryption keys and should follow the same level of scrutiny.
Technical restriction or legitimate need
Organisations may find that there is a technical restriction or legitimate need for the use of live data within a non-production environment such as;
- Major incident that the use of live data will mitigate and reduce harm to the data subject
- Testing of a system that has a technical restriction when attempting to use masked data with a clear benefit to the data subject
When an organisation faces the above situations, a process needs to be in place to ensure the following aspects are covered;
- Privacy notice to ensure customers are aware that their data may be used in the event of the above with the below as mitigation
- Process for ensuring that the non-production environment has the controls in place to match that of the production environment or as close to as possible whilst the data resides
- Process to ensure that the request is risk assessed (covering the do nothing vs do something scenarios), reviewed and approved by senior management proactively when possible or reactively if there is a time constraint (critical incident) to ensure that only absolutely necessary requests are progressed
- Process to monitor the environment and ensure that a secure decommissioning process is followed immediately upon completion
- Ensure that the environment is included in the scope of a SARs request
- (This hasn’t been done before as far as I am aware but would be highly recommended) Process to contact the affected individuals before or after to keep them informed of when the data is being used and why – discuss if this is something that you can provide as an opt in
Throughout testing processes and a core component of testing integrity is ensuring that primary systems and downstream systems have referential integrity which is commonly a unique reference or customer number. In an ideal world, all systems that hold data should feed from a central database that is the master for all information and each record have a unique customer reference. This customer identifier, unfortunately within organisations is still considered personal data due to the ability to internally identify the individual which reduces the effectiveness and compliance of the masking procedures unless proactively addressed.
Options for internal identifiers
- Internally assess the risk of the internal identifier and reduce the risk and accessibility of the master i.e. restrict all access, to anything that gives a user the ability to see what is related to the key
- Replicate your master database and use a compound masking script to mask the unique identifier throughout – provide logical segregation and restrict access to this key and you have referential integrity that is only re-identifiable if access is granted to the key store i.e. how your internal PKI / key management process works.
Embedding your plan as part of a programme (Doesn’t have to be your GDPR programme)
Similar to the identification of required fields, a top down approach should be taken to implementation of a masking strategy. Summary of the steps below – these may repeat steps above;
- Create processes and governance from the offset
- Consult with 3rd party solution owners to understand data architecture and request tailored solutions
- Identify your data
- Analyse structures and identify common denominators
- Devise a plan for pseudonymisation aiming for complete masking across environments
- Mask what you can holistically
- Mask what you can in isolation
- Risk assess and summarise position and analyse what is left – decide on;
- Deleting remaining data
- Proceeding at risk and update privacy policies (must have a legitimate reason as per the live data process above and must be temporary until a compliant solution is sought)