Afraid to delete data? Think again

Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.

Data is a valuable corporate asset, which is why many organizations have a strategy of never deleting any of it. Yet as data volumes continue to grow, keeping all data around can get very expensive. An estimated 30% of data stored by organizations is redundant, obsolete or trivial (ROT), while a study from Splunk found that 60% of organizations say that half or more of their data is dark — which means its value is unknown.

Some obsolete data may pose a risk as companies are dealing with the increasing threats of ransomware and cyberattacks; this data may be underprotected and valuable to hackers. Adding to that, internal policies or industry regulations may require that organizations delete data after a certain period – such as ex-employee data, financial data or PII data.

Another issue with storing large amounts of obsolete data is that it clutters file servers, draining productivity. A 2021 survey by Wakefield Research found that 54% of U.S. office professionals agreed that they spend more time searching for documents and files than responding to emails and messages.

Being responsible stewards of the enterprise IT budget means that every file must earn its keep down to the last byte. It also means that data should not be prematurely deleted if it has value. A responsible deletion strategy must be executed in stages: inactive cold data should consume less expensive storage and backup resources and when data becomes obsolete, there is a methodical way to confine and delete it.  The question is — how to efficiently create a data deletion process which identifies, finds and deletes data in a systematic way?

Barriers to data deletion

Cultural: We are all data hoarders by nature and without some analytics to help us understand what data has truly become obsolete, it’s hard to change an organizational mindset of retaining all data forever. This unfortunately is no longer sustainable, given the astronomical growth in recent years of unstructured data — from genomics and medical imaging to streaming video, electric cars and IoT products. While deleting data that has no present or potential future purpose is not data loss, most storage admins have suffered the ire of users who inadvertently deleted files and then blamed IT. 

Legal/regulatory: Some data must be retained for a given term, although usually not forever. In some cases, data can only be held for a given time according to corporate policy — such as PII data. How do you know what data is governed by what rule and how do you prove you are complying?

Lack of systematic tools to understand data usage: Manually figuring out what data has become obsolete and getting users to act on it is tedious, time-consuming and hence never gets done. 

Tips for data deletion

Create a well-defined data management policy

Developing a sustainable data lifecycle management policy requires the right analytics. You’ll want to understand data usage to identify what data can be deleted based on data types, such as interim data, and data use, such as data not used in a long time. This also helps gain buy-in from business users because deletion is based on objective criteria rather than a subjective decision. 

With this knowledge, you can map out how data will transition over time: from primary storage to cooler tiers, possibly in the cloud, to archive storage, then confined out of the user space in a hidden location and, finally, deletion.

Considerations that may impact the policy include regulations, potential long-term value of data and the cost of storage and backups at every stage from primary to archive storage. These decisions can have enormous consequences if, say, datasets are deleted and then later needed for analytics or forecasting. 

Develop a communications plan for users and stakeholders

For a given workload or dataset, data owners should understand the cost versus benefits of retaining data. Ideally, the decision for data lifecycle policy is one agreed upon by all stakeholders — if not dictated by an industry regulation. Communicate the analytics on data usage and the policy with stakeholders to ensure they understand when data will expire and if there is a grace period that data is held in a confined or “undeleted” container. Confinement makes it easier for users to agree to data deletion workflows when they realize that if they need the data they can “unconfine” it within the grace period and get it back.

For long-term data that must be retained, ensure users understand the cost and any extra steps required to access data from deep archival storage. For example, data committed to AWS Glacier Deep Archive may take several hours to access. Egress fees will often apply.

Plan for technical issues that may arise

Deleting data is not a zero-cost operation. We usually think only of R/W speeds, but deletion consumes system performance as well. Take this example from a theme park: photos of guests (100K) per day are retained for up to 30 days after the customer has left the park. On day 30, the workload for the storage system is double; it needs the capacity to ingest 100K photos and delete 100K.

Workarounds for delete performance, known as “lazy deletes,” may deprioritize delete workload – but if the system can’t delete data at least as fast as new data is ingested, you will need to add storage to hold expired data. In scale-out systems, you may need to add nodes to handle deletes. 

A better approach is to tier cold data out of the primary file system and then confine and delete it, mitigating the issue of unwanted load and performance impact on the active filesystem.  

Put the data management plan into action

Once the policy has been determined for each dataset, you will need a plan for execution. An independent data management platform provides a unified approach covering all data sources and storage technologies. This can deliver better visibility and reporting on enterprise datasets while also automating data management actions. Collaboration between IT and LOB teams is an integral part of execution, leading to less friction as LOB teams feel they have a say in data management. Department heads are often surprised to find that 70% of their data is infrequently accessed. 

Given the current trajectory of data growth worldwide — data is projected to nearly double from 97 ZB in 2022 to 181 ZB in 2025 — enterprises have little choice than to revisit data deletion policies and find a way to delete more data than they’ve done in the past.

Without the right tools and collaboration, this can turn into a political battlefield. Yet by making data deletion another well-planned tactic in the overall data management strategy, IT will have a more manageable data environment that delivers better user experiences and value for the money spent on storage, backups and data protection. 

Kumar Goswami is CEO and cofounder of Komprise.


Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers