Developing applications that never delete
Software is a funny business. For example, you have Account A and Account B. You withdraw from one and add to the other inside of a transaction. Except that’s not how accounting works. You should probably do this in a different way.
It is not only possible to write a database application that never issues an update or a delete, but often practical. Developers of IoT (Internet of Things) applications do this all the time. Devices send time series data, usually status information, which goes in a table with the time stamp. Regardless of whether you are using a traditional database like Oracle, a newer distributed SQL database like CockroachDB, Yugabyte, or MariaDB Xpand, or even a NoSQL database like MongoDB, the method is essentially the same.
Consider a table like this:
Customer { id BIGINT(0) UNSIGNED AUTO_UNIQUE NOT NULL, name_given TINYTEXT, name_middle TINYTEXT, name_family TINYTEXT, email [varchar] TINYTEXT, dob DATETIME }
An update is required if the customer changes their email or family name. However, this means history is lost. An update can logically be thought of as a delete and an insert. Another way of doing it would be something like:
Customer { entry_id BIGINT(0) UNSIGNED AUTO_UNIQUE NOT NULL, entry_date TIMESTAMP NOT NULL, id BIGINT(0) UNSIGNED NOT NULL, name_given TINYTEXT, name_middle TINYTEXT, name_family TINYTEXT, email [varchar] TINYTEXT, dob DATETIME }
The entry_id
becomes the unique key for the row, but the id
is the key identifying that unique person. To find someone’s current name and email you would issue a query like:
select … from Customer where id=1 and entry_date = (select max(entry_date) from customer where id =1)
This query pulls the last entry for the customer where id
equals 1. To change the customer’s email or family name you simply insert a new row with an id
of 1 and a new row. (Note: Don’t do max(entry_id)
if the id
is an auto_unique
and not a sequence.)
This has a clear disadvantage in that you need a subquery and another join. However, it has a clear advantage if, for instance, some communication or other data comes back with the old family name or the firm receives an email from an old email address. Another advantage is that it dates information. In some jurisdictions information is required to be purged upon request or based on the date it was captured. This design makes that easy.
There are some other issues to consider. Consider the problem of finding customers who were shipped a particular item. You might have Customer
, Order
, Shipment
, and Shipped_Item
tables. Assuming you want only the “current” record and that all of the tables are versioned, you end up with at least three subqueries. Instead you can have a more traditional structure like the first customer table definition but issue inserts on delete with an archive table:
Customer_Archive { archive_id BIGINT(0) UNSIGNED AUTO_UNIQUE NOT NULL, customer_id BIGINT(0) UNSIGNED NOT NULL, entry_date TIMESTAMP NOT NULL, name_given TINYTEXT, name_middle TINYTEXT, name_family TINYTEXT, email [varchar] TINYTEXT, dob DATETIME }
The advantage of this is that only the current record is in the Customer
, Order
, Shipment
, and Shipped_Item
tables and the number of joins is reduced. Plus it maintains a search advantage over audit logs. There is a disadvantage to queries that search current records in combination with history.
In any operational system, one does not want history to get in the way of efficiency. While the application may never delete, some system process may need to purge records older than a given date. Moreover, it may make sense to feed an analytical database some types of data.
Updates and deletes remove history. Regardless of the structure you chose, when designing a database schema, it is sensible to take a note from double entry accounting and consider preserving history in addition to the current state. This principle is not for every application, but it is not merely for IoT or accounting applications.