So for GDPR compliance there is no other way than to dump the partitions containing the data to delete, remove those rows from the dump then drop the partitions and import the edited dump? How can this be done with incoming data and requests hitting the valid data? Wouldn’t that require to interrupt the service for as long as the op would take? How are users usually handling such situation?
Hi there –
First, a remidner that GDPR does not oblige to real-time instant deletion. You can implement a once per day (or less frequent) removal of affected records. You could mark a record for deletion, and then delete as a regular batch when needed. Also, since what you want to remove is personal data and we are talking analytics here, you could always select all the data for the affected customer, and reingest the rows (using DEDUP) with null values for any data that is deemed as identifiable. As long as your upsert keys don’t include the data deemed as personal, you could have instant updates of the rows for your eligible users.
Alternatively, its possible using a copy-on-write process with two tables. You can copy preserved rows to the secondary table without the deleted rows. When querying, you query both tables and merge the results. It is a pain point to handle GDPR and HIPAA.
Many users tier their data already, either by using SAMPLE BY to create rollups, or simply by deleting data older than N days as part of their normal data management.
You can use UPDATE to anonymise the data for right to be forgotten, depending on the compliance requirements.
Finally, you could go with the deleted
column and erase all sensitive column values when marking the row as deleted. That can be done efficiently with deduplication. Of course, it’s a good idea to “GC” deleted rows occasionally just to save disk space. That can be done by inserting “clean” partition rows into a temp table, dropping the original partition and the inserting the “clean” rows back.