IntroductionI have, as mentioned in a previous blog, been working on a data warehouse project using CDC for extracting the source system data. I would like to share some of the experiences and evaluation criteria used for selecting a CDC tool for our project.
The reason for using a specialist tool was as follows:
- Couldn't move all the source databases to SQL Server 2008
- Needed the capture change data to be sent to another server instance
- A consistent management experience for CDC between all our source systems. The two database vendors the tool needed to support was SQL Server and IBM AS400
TimescalesFirst we underestimated how long it would take to select a CDC tool. We plan for 1 month, for both evaluation criteria and running the tests, and it took us about 2/3 months. The time was taken in evaluating the tools against the criteria and gaining access to database systems from other projects.
However our development of the ETL solution was able to continue, while the CDC tool hadn’t been selected, as we used generic columns for the CDC metadata. Our requirements allowed this approach because only a selected set of metadata was needed from the CDC tool which most tools offered. The method to ensure that the changes were picked up in the right order was to use a row ID, which was a big integer identity seed, within the CDC destination table. This was able to work as most CDC tool records the changes in the same order as they are made.
Evaluation criteriaHere are some categories of criteria, which our client used, to select which CDC tool to evaluate, these are common sense categories:
- Impact: How CDC would impact both source and target systems servers and databases. The Server impact was measured in how much resources it would take to use on the server: memory, CPU, etc.
The database impact was whether it made any schema changes and if it had performance implications for the application. This was important to know as there was a third-party tool where the support contract would have been invalidated if any schema changes were made to their database. To tests for schema changes I use a simple procedure which was to use VSDBCMD tool to import the schema into a schema file before installing CDC. After the install import again to another file and use VS2010 schema compare tool to compare against the schema files.
- Schema changes: Does the CDC tool cope with schema changes being made on the source system. Could the tool continue to work if the schema changes were made onto columns which weren't being captured or if any new columns were added.
- Speed: How quickly were changes committed to the target server. The metrics for this was by the volume of changes and speed to commit them. Also how quickly the initial synchronisation took to complete.
- Management: What was the management tool, how easy was the management tool to use, how quickly CDC was recoverable from errors or disasters.
Another factor was the number of databases and tables which we wanted to capture data from. We had 20 source databases, which had about 15 tables each. Depending on the CDC tool more time would have to be spent on the development and deploying the CDC solution.
To develop or to configure that is the question? (sorry couldn't resist the Shakespeare pun)
That is the question that we found ourselves asking. As some tools required development of code to create a CDC solution. Also with the tools which required development the experience and environment had to be considered as part of the evaluation. Such as the level of privileges required on the development computer; is there any integration with source control or how to protect from loss of work; how easy is it to transfer development from one developer to another.
Another aspect of the CDC tools evaluation was how easy it was to deploy from our development environment to production. How easy was the CDC solution to deploy and was there any automation through scripting or exporting.
ObservationsWhile evaluating we found the same tool gave a different level of change data capture experience depending on the database vendor it was configured against. For example: when configured against the AS400 the tool was able to give a full row of data. But when configured against and SQL server it was only able to give the columns which had changed. The reason for this was how the tool had implemented CDC for the SQL server. There was no requirement for the replication components for SQL server to be installed, without this component the SQL server log file only records the columns which had changed. Hence why the tool was only able to give only the change columns data and not the full row of data.
We also found that the same CDC tool behaved differently when running against different processing editions (x86 and x64). The difference was with settings and memory requirements.
- Attunity - http://www.attunity.com/index.aspx - which have arrange of CDC products
- DBMoto - http://www.hitsw.com/products_services/dbmoto/dbmoto_dsheet.html
- IBM InfoSphere Change Data Capture - http://www-01.ibm.com/software/data/infosphere/change-data-capture
- Oracle golden gate - http://www.oracle.com/technetwork/middleware/goldengate/overview/index.html