Question

How to effectively load Customer Spine data with Cassandra/DDS in Pega Marketing (Cloud)

Hi guys,

We are using Pega Platform 8.1.5 (PegaCloud) with Pega Marketing 8.1.5

In our current setup we are using the following repositories in order to store Customer data:

  • Customer DDS table (internal Cassandra) - contains 100% of the Customer data
  • Customer DB table (Postgress) - simply contains the CustomerID field (we do this due the limitation on Pega Marketing Segments which require a relational database table). We might add new fields if required for segmentation.

The (simplified) load process is as follows:

  1. We get a Customer file in JSON format as our input. It contains a flat structure of customer attributes.
  2. Using a combination of a dataflow and several datasets, we:
    1. truncate the Customer DB table
    2. truncate the Customer DDS table
  3. Load the JSON file into Pega
  4. We then write the Customer properties into:
    1. the Customer DB table
    2. the Customer DDS table

So, basically, we do a truncate and an insert.

This works for us because is a simple data load process using dataflows/datasets which provides a flexible integration point (the input JSON file can be modified to contain new attributes which are transparently loaded into Pega).

At this point, the file is getting pretty big (and it will get bigger) and it takes about 40 mins to load.

The new requirement is to have 24/7 data availability.

We are considering 2 options in order to achieve it:

OPTION 1: Changing the load process from "full" to "incremental".

This is the preferred option right now since we believe it would achieve 24/7 availability.

Basically, instead of doing a truncate/insert, we would to never truncate and would do an insert/update.

We believe this is easily achievable by configuring the datasets appropriately:

  • The DB dataset as "Insert new and overwrite existing records"
  • The DDS dataset does an insert/update by default.

The question for this option is:

  1. Assuming we are loading several million records to DDS/Cassandra, taking approx 1 hour: What is the impact if records are being retrieved from Cassandra to serve the inbound channel whilst the data load is in progress.
    1. locking issues?
    2. performance issues?
    3. any other issues to watch for?

[Note: As suggested by Kevin in the comments below, we will investigate the option of using partitions and splitting the input file into several to take advantage of parallel data flow runs.]

OPTION 2: Introducing the concept of an staging area.

We think this option would get us close to 24/7 availability, but not to 100%.

We could optimize the process by introducing the concept of a typical "staging area".

If we were doing this with old fashion ETL, we would load the data into a Customer DB staging table, and once the load is complete, we would simply rename the tables.

We are trying to do something similar with datasets, something like:

  1. Load the input JSON file into Pega using a File DataSet - no changes here
  2. Write the data (Customer IDs only) into a staging Customer DB table - change, but no issues here
  3. Write the data (100% of the data) into a staging Customer DDS - change, but no issues here
  4. Instead of renaming the Customer DB table (and considering that we are on PegaCloud and don’t have full DB control), we are thinking about creating a dataflow which as follows:
    1. Source component: the staging Customer DB table
    2. Destination component: the "final" Customer DB table
  5. Instead of renaming the Customer DDS, we are thinking about creating a dataflow which as follows:
    1. Source component: the staging Customer DDS
    2. Destination component: the "final" Customer DDS

So, essentially, we would achieve the same as in a typical ETL world, but moving the data from the staging area to the "final" area still takes a few minutes.

Do you guys have any suggestions? Do these approaches make sense? Any suggestions to improve the process are highly appreciated.

Thank you very much,

Marcelo.

Correct Answer
November 21, 2019 - 7:32am

one thought regarding your initial file dataset loading process, you should consider using partitions if possible (versus using one giant file) as distinct files can be processed in parallel in dataflow distribution runs. This potentially can shorten the initial loading time.

Comments

Keep up to date on this post and subscribe to comments

November 21, 2019 - 7:32am

one thought regarding your initial file dataset loading process, you should consider using partitions if possible (versus using one giant file) as distinct files can be processed in parallel in dataflow distribution runs. This potentially can shorten the initial loading time.

November 21, 2019 - 5:19pm
Response to KevinZheng_GCS

Thank you Kevin, we will definitely explore this option.

Also, I added more detail to the initial post. 

Could you please be so kind to have a look and provide any feedback?

Thank you!

Marcelo.

December 3, 2019 - 8:28am
Response to KevinZheng_GCS

Hi Kevin,

I've done some tests as you suggested. I split the one giant file into chunks of 100,000 records each so that each of the files can be processed in parallel in dataflow distribution runs. Since we have 3 dataflow nodes, the load time is now around one third of the original time.

Thanks a lot for this suggestion. It really helps.

Regards,

Marcelo.

November 21, 2019 - 3:50pm

Thank you Kevin, we will definitely explore this option.