One of the most common problems that IT departments face is to move production environments into down-level systems. Sometimes this means pushing the data from production to the development environment, to the QA environment, or some other environment within the enterprise.
With the rise of data security requirements and regulations that need to be followed, this process can be extremely time-consuming. A typical flow would start with a restore of the production database to the down-level system. This could take hours or days unto itself depending on factors such as network speed, IO speed on the down-level environment, other work being performed on the down-level environment and size of the database (among other factors). Once the database restore has been completed, the process of de-identifying the data in the database needs to be performed.
De-identification of data is a critical step in this process. Many companies have compliance regulations that require that customer, supplier, or employee data be masked as the data is moved from production to down-level environments. Beyond compliance regulations, companies that have users, customers, or employees in regions such as Canada or the European Union have requirements in which companies need to be able to “forget” users, including any down-level environments in which the data has been moved. Because of the regulations around personal data, the de-identification of personal data becomes a “must do” very, very quickly; even for companies that don’t do business in the regions involved. The EU GDPR is the perfect example of this as it covers EU citizens living in other countries. If you are a US-based company doing business with an EU citizen who lives in the US, they are still covered by GDPR and their data needs to be handled accordingly.
Proper de-identification of data needs to be completed so that there is no way to reverse engineer the process and get back to the original values so that the customer data can be identified from the data which lives in the down-level environment. By not having a way to identify a specific user in the dataset we have already abided by the requirements of the various laws that companies need to abide by.
While simple UPDATE statements can do resetting every record to the name of “John Doe” and every email address to “[email protected]”, it takes a lot more to remove personally identifiable information from the database than that. First, you need to identify which fields hold personally identifiable data; and it’s going to be more than just Name and Email Address. Fields like Address would also qualify, and Medical Conditions if you’re working with a medical database, etc. Most development and QA systems are going to need the records to have some sort of uniqueness to them, which maintaining the inability to roll back the change to get to the original value.
Typically, companies are going to have this entire process as a manual workflow with potentially a couple of different people handling the different steps. A Production Database Administrator will take a backup of production (if one doesn’t already exist that can be used), which is then handed off to a Development Database Administrator. The Development Database Administrator then does the restore to the development system, and then they update whatever fields hold information that has been identified as having privacy information. The success or failure of this will depend on if the Development Database Administrator who is running the code to de-identify the data runs all the scripts, and those scripts run without error.
Accounting for all of this work that has to be done is going to take time, potentially a lot of time. This is especially true if the running of the scripts is going to have to modify every row in the database, probably using a cursor so that the transaction log on the database doesn’t fill. Depending on the size of the database, these scripts could take days to run, and that is days that the development team is going to be unable to work until these scripts are finished as they can’t access the database until there is no personal information within the database.
All of this is to be done so that we as IT professionals can ensure that the data that we have been trusted with, namely our customers’ data, can be safely protected while we continue to enhance the systems and products that we manage.
As IT professionals the responsibility of managing the transaction of data from production to non- production environments falls to us, especially if the development teams need as close to actual production data as possible to troubleshoot production problems that users of the application are encountering.
Third-party solutions are going to become key to making the processes described here repeatable and documented.
Database schemas change more frequently than we care to admit and keeping up with these changes while ensuring that user data is properly protected is a big ask for any IT organization. The larger the organization the harder it becomes to track all the changes to the schema and the platform. Being able to offload those requirements to a software solution that will help with this process becomes key is making the process successful, repeatable, and documentable.