Actifio All Featured Tech Talks

Enabling Database DevOps with Large Volumes of Data

One of the biggest struggles any development organization faces is testing its application with production volumes of data. Databases keep metadata on their tables and indexes, to better generate execution plans for optimal data retrieval.

What this means is that it is always a challenge for developers to understand how their applications will scale with ever-growing volumes of data. This is a place where developers who work in corporate environments have an advantage over software vendors. The corporate developers do have some level of access to a production data set, or at least the actual volume of data in production.

The developer at a software vendor will have to use a data generation tool, which may or may not offer the same degree of data distribution that production workloads would normally have. This distribution of data is one of the most important facets of database performance, because of a behavior known as parameter sniffing, which when data is badly skewed can cause performance issues. Database engines cache their execution plans as the process of plan generation is very expensive, so the plan is built with the set of parameters that were used on that first execution.

A dramatic example of this would be if a table had 10 million values of a column called ID, and the first 9.5 million were sequential integers (1,2,3…) and the other 500,000 values were all the number 9 (dear data professional, please ignore my terrible data architecture for this example). A plan created with a value of 9, would have a vastly different execution plan than one created with one of the unique values, so when the plan is executed with one of the other parameters, a suboptimal plan would be created. This behavior, amongst others, is why it is so important to test with a data set with production volume and data distribution.

The unquestionable movement of DevOps and Agile across development organizations has definitely moved into database development. Just like application code, it’s extremely important to maintain source control consistency, while allowing multiple developers to work at the same time. While several advancements in database deployment technologies have occurred, such as the ability to deploy the RDBMS in a container, making it easy for the developer to have a copy of the engine running on their personal computer. Additionally, many continuous integration and development (CI\CD) toolsets have evolved to provide support for database usage. While these enhancements are all beneficial, we still have the problem of trying to get a database, that may be terabytes in size, to each developer for testing.

There are multiple technical approaches to sharing data across development environments. One is to use a shared server and provide each developer with their own schema and copy of the data. This works well with smaller data sets that can be managed, but enterprises frequently object to larger databases being used in this fashion because of the sheer volume of data required. Another approach is for all developers to use a smaller data set that has the reference data only, and a small amount of sample transactional data to be able to validate data workflows. This solution is acceptable, but it delays performance testing until multiple developers have brought their work together into a database with a full set of data. This delay is antithetical to the DevOps process and precludes developer workflow.

The cost of enterprise-class storage is not inconsequential and is important to this discussion. As is the challenge of moving around large volumes of data. One common scenario, that is particularly problematic is the limits of restoring data so that these development databases stay current for testing. This process, while able to be easily automated, is typically performed at the behest of the developer and in the best case has inherent latency involved with executing the restore. In the worst case, the DBA has to manually take an extract or backup of the database, move it to another server, hopefully, scrub any sensitive data, and then restore it. These processes can sap hours or even days of expensive developer productivity, which in addition to the infrastructure costs, can reduce your deployment speed.

There are challenges around DevOps and database workloads. While the tooling for DevOps has improved to meet the needs of the developer, some of the challenges around both making changes to databases as well as maintaining a current development environment have not improved. The size of large databases along with the friction involved makes large backups and restores impractical for most developers.

The best practical solution to meet all these needs is a solution that has the following capabilities:

  1. Can reuse backups to provision storage efficient database aware clones i.e. a software-based Copy Data Management solution that can perform database clones but also perform backup and disaster recovery anywhere on-premises on in any cloud using any storage.
  2. Has the flexibility to be used on-premises or in the cloud
  3. Supports a wide array of databases such as SAP HANA / ASE / MaxDB, Oracle, Oracle EBS, MS SQL, MySQL, PostgreSQL, Db2, and other enterprise databases
  4. Incremental forever refresh from production databases to test environments on-premises or in the cloud
  5. A software that can not only provision rapid database clones but also provision clones of application servers so that the full application stack can be provisioned to test environments

Conceptually, you are virtualizing your production database by creating virtual copies of it. This reduces the overall storage costs, and greatly reduces the time to productivity for the developers, as snapshots can be made available instantly, and can integrate common CI/CD tooling for databases.

 

 

 

 

 

 

About the author

Joey D'Antoni

Thought leader with 15 years of IT experience, working with a wide variety of data platforms including Microsoft SQL Server, Oracle, Hadoop and other data platforms. Extensive experience in cloud design and architecture, specifically focused on hybrid cloud models. Strong experience in clustering and storage. Frequent speaker at major IT conferences such as TechEd, PASS Summit, NoSQL Now, PASS Business Analytics Conference. SQL Server MVP, Dun and Bradstreet MVP.

Leave a Comment