Actifio All Featured Sponsored Tech Note

Data Virtualization: ETL Without the Data Movement

  1. Keeping Down-Level Environments Populated
  2. The Cost of Managing Database Development Environments
  3. Employees Need Proper Data in Order to Make Good Decisions
  4. Data Virtualization: ETL Without the Data Movement
  5. The Challenges of Testing Backup and Recovery in an Enterprise Environment
  6. RTOs and RPOs Should Be Driving Recovery, Not Just Backups
  7. Development Should Get the Same Care as Production
  8. Enabling Database DevOps with Large Volumes of Data

The most challenging part of any business intelligence (BI) or data warehouse projects is moving the data between operational systems into a system where it can be optimized for reporting. If you are uninitiated to common corporate data architecture, most organizations have things like point of sale systems, and inventory management systems, and in larger organizations entity relationship planning (ERP) systems. These systems are typically referred to as online transaction processing (OLTP) systems and are optimized for concurrency and write performance. This is because you want to be able to process all your sales without your customers waiting on transactions to complete.

 

While those OLTP systems are the best source for real-time reporting, users trying to report against live production data would have a negative impact on the sales or ERP system. Since relational databases have to lock records to maintain data consistency, transactions that are reading the data, like reports, have the potential to block write transactions like sales.

Business intelligence users, typically analysts but all the way up to executives, want to be able to use reports, dashboards, and of course everyone’s favorite data management tool –  Microsoft Excel. This means a transformation process needs to take place to move the data, which will involve development and potentially software licensing to move this data.

The solution to this is to move from the OLTP systems into a read optimized structure called a data warehouse. Data warehouses are designed using a “star schema” with facts and dimensions and are referred to as online analytical processing (OLAP) systems. A fact represents a process, event, or action. A fact table usually contains some type of measurement or aggregation as well as relationships to various dimension tables through surrogate keys. Dimensions describe an entity that is involved in one or more facts. For example, if our fact table represents online sales by item, we may have dimensions such as product, customer, and date.

This process of moving data out of the OLTP system into the OLAP system is traditionally referred to as the extract, transform, and load (ETL) process. You may also hear about extract, load, and transform processes, which are most commonly used in conjunction with data lakes, but for the purposes of this article, we are going to focus on ETL. ETL processes were traditionally run once a day, overnight, but as business cycles have moved faster, many organizations have moved to running ETL processes every four hours, or even more frequently than that. A small subset of companies may even have built real-time streaming solutions in order to populate their OLAP system in real-time.

As you can imagine, the costs and maintenance of these ETL processes are high. Most ETL processes use multiple database engines, file sources, and various transformations and calculations take place in the process, which is very complex. Any changes in the source system schema or file structures have the potential to break the ETL process. In terms of business intelligence projects, ETL development is typically 80-90% of work for a given project.

The other complexity is increasing the velocity of the data—the overnight processing model worked well in a non-24/7, international world. For global companies with operations in multiple time zones, demand those more real-time scenarios. While changing from running your ETL process once a day to every four hours, does require small changes like ensuring time stamps on your rows, however moving to a real-time solution requires configuration like change data tracking, and most commonly a streaming data solution like Apache Kafka to move the data. Building this level of functionality requires a large development effort and maintaining it is no easier. With the number of moving parts and data, this activity can be fraught with error. With any increase in velocity, the costs and the level of development required increase.

One potential solution to this is known as hybrid transaction and analytical processing (HTAP) which combines both OLTP and OLAP systems into a single system for reporting and processing. This notion is important, however challenging to implement on legacy systems, as this architecture requires a specialized query engine, typically in-memory to process transactions with minimal locking, while at the same time allowing analysts to be able to query the same data.

Most organizations haven’t implemented these features into their OLTP systems, and if they are using vendor-supplied OLTP systems, they may not support this technology. Some examples of this technology include Oracle TimesTen, Microsoft SQL Server In-Memory OLTP, and SAP Hana. Because of this lack of adoption, systems that support both reporting and transactional activity will face unpredictable performance due to blocking. So build a data structure that supports reporting independently of transactional activity is critical to getting your business users the information they need, while letting your front line business operate as expected.

About the author

Joey D'Antoni

Thought leader with 15 years of IT experience, working with a wide variety of data platforms including Microsoft SQL Server, Oracle, Hadoop and other data platforms. Extensive experience in cloud design and architecture, specifically focused on hybrid cloud models. Strong experience in clustering and storage. Frequent speaker at major IT conferences such as TechEd, PASS Summit, NoSQL Now, PASS Business Analytics Conference. SQL Server MVP, Dun and Bradstreet MVP.

Leave a Comment