Wednesday, November 9, 2016

Trickle Data Load


Problem statement: The enterprise needs to load millions of rows form legacy Salesforce CRM to new Salesforce CRM. While IT team continue data migration activity other business units must be able to use system for day to day operations. For example, resolving customer issues, responding queries, marketing activity and so on. Salesforce multitenant architecture with verity of APIs (Bulk /SOAP/Rest) handle this very well with very few exceptions.  

The particular problem that I want to highlight global search. When we bulk load data it affects global search indexing. Global search work asynchronously but gives a feel of near real time. But what we have to load is billions of rows. When we load high volumes of data if make the global search unusable as the indexing is not complete. In our test it took whole week for global search to work properly. CSR users were not very happy about it and they break various SAL due to this.


Solution:
The approach is load data as slowly as possible. If you load relatively less records (say 5k or less) the global search is available in the matter of seconds (for me, it was less than 30 seconds). So, instead of loading data in bulk we choose to load data in “Trickle Load” fashion. In a way it’s exactly opposite of bulk data loading. In Bulk data loading we upload all data and periodically keep checking the status. In Trickle load we load small chunk and wait for salesforce to finish updating global indexes. After some time, we submit again another chunk and this can go on and on till we finish.
If you think of this in a way, it’s just tread of between limits.

How to Trickle Load huge data?

1.   Split large CSV file of data into batches of 5k (php code is provided at the bottom)
2.   Trigger job of data loading at every 5 minutes to insert one batch file
3.   The easiest way for me was using data loader CLI to run from command prompt


Things to keep in mind:
·     With this approach we can load max 1,440,000 records per 24 hours (5k * (24* 12(batches per hour)))
·     As the wait time increase you will notice less problem of jamming the global search indexes. As the wait time reduced you will notice more the problem of jamming the global search indexes
·     Higher API consumption can be result with this approach.
·     We consider to wait for 5 minutes but you can choose another number as per your ability to wait after jamming problem.