Azure Data Lake incremental load with file partition

I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.

My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.

Now, there are huge volume of tables (that has even billion +) that I want to load from SQL DB using DF to Azure DL.
MS docs mentioned two options, i.e. watermark columns and change tracking.
Let's say I have a "cust_transaction" table that has millions of rows and if I load to DL then it loads as "cust_transaction.txt".
Questions.

1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?

2) How do I split or partition the files into smaller files?

3) How should I merge and load the deltas from source data into the files?
Thanks.

edited Nov 13 at 9:25

AshokPeddakotla-MSFT

94317

asked Nov 13 at 3:55

Galaxite

add a comment |

I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.

My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.

1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?

2) How do I split or partition the files into smaller files?

3) How should I merge and load the deltas from source data into the files?
Thanks.

edited Nov 13 at 9:25

AshokPeddakotla-MSFT

94317

asked Nov 13 at 3:55

Galaxite

add a comment |

I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.

My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.

1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?

2) How do I split or partition the files into smaller files?

3) How should I merge and load the deltas from source data into the files?
Thanks.

edited Nov 13 at 9:25

AshokPeddakotla-MSFT

94317

asked Nov 13 at 3:55

Galaxite

I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.

My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.

1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?

2) How do I split or partition the files into smaller files?

3) How should I merge and load the deltas from source data into the files?
Thanks.

azure azure-data-lake

edited Nov 13 at 9:25

AshokPeddakotla-MSFT

94317

asked Nov 13 at 3:55

Galaxite

edited Nov 13 at 9:25

AshokPeddakotla-MSFT

94317

asked Nov 13 at 3:55

Galaxite

edited Nov 13 at 9:25

AshokPeddakotla-MSFT

94317

edited Nov 13 at 9:25

AshokPeddakotla-MSFT

94317

edited Nov 13 at 9:25

AshokPeddakotla-MSFT

94317

asked Nov 13 at 3:55

Galaxite

asked Nov 13 at 3:55

Galaxite

asked Nov 13 at 3:55

Galaxite

add a comment |

1 Answer
1

active

oldest

votes

You will want multiple files. Typically, my data lakes have multiple zones. The first zone is Raw. It contains a copy of the source data organized into entity/year/month/day folders where entity is a table in your SQL DB. Typically, those files are incremental loads. Each incremental load for an entity has a file name similar to Entity_YYYYMMDDHHMMSS.txt (and maybe even more info than that) rather than just Entity.txt. And the timestamp in the file name is the end of the incremental slice (max possible insert or update time in the data) rather than just current time wherever possible (sometimes they are relatively the same and it doesn't matter, but I tend to get a consistent incremental slice end time for all tables in my batch). You can achieve the date folders and timestamp in the file name by parameterizing the folder and file in the dataset.

Melissa Coates has two good articles on Azure Data Lake: Zones in a Data Lake and Data Lake Use Cases and Planning. Her naming conventions are a bit different than mine, but both of us would tell you to just be consistent. I would land the incremental load file in Raw first. It should reflect the incremental data as it was loaded from the source. If you need to have a merged version, that can be done with Data Factory or U-SQL (or your tool of choice) and landed in the Standardized Raw zone. There are some performance issues with small files in a data lake, so consolidation could be good, but it all depends on what you plan to do with the data after you land it there. Most users would not access data in the RAW zone, instead using data from Standardized Raw or Curated Zones. Also, I want Raw to be an immutable archive from which I could regenerate data in other zones, so I tend to leave it in the files as it landed. But if you found you needed to consolidate there, that would be fine.

Change tracking is a reliable way to get changes, but I don't like their naming conventions/file organization in their example. I would make sure your file name has the entity name and a timestamp on it. They have Incremental - [PipelineRunID]. I would prefer [Entity]_[YYYYMMDDHHMMSS]_[TriggerID].txt (or leave the run ID off) because it is more informative to others. I also tend to use the Trigger ID rather than the pipeline RunID. The Trigger ID is across all the packages executed in that trigger instance (batch) whereas the pipeline RunID is specific to that pipeline.

If you can't do the change tracking, the watermark is fine. I usually can't add change tracking to my sources and have to go with watermark. The issue is that you are trusting that the application's modified date is accurate. Are there ever times when a row is updated and the modified date is not changed? When a row is inserted, is the modified date also updated or would you have to check two columns to get all new and changed rows? These are the things we have to consider when we can't use change tracking.

To summarize:

Load incrementally and name your incremental files intelligently

If you need a current version of the table in the data lake, that is a separate file in your Standardized Raw or Curated Zone.

answered Nov 13 at 16:44

mmarie

4,64411329

Thanks for your reply. Two questions, 1)for very large table, do you output into a single file or how do you split into multiple files? 2) Based on your feedback, since there will be lots of multiple files for a single entity (e.g. for customer, there can be number of files with timestamp cust_<YYYYHHMM>. If the user want to access the data via power BI, how do a consolidated view of each of the entities can be created?
– Galaxite
Nov 13 at 22:36

1) It becomes less about the size of the table (unless it's just extremely wide with several hundred columns) and more about how often data changes. You may have millions of rows, but how many inserts and updates are done each day/hour/whatever? How often would your incremental load run? The performance guide says to try to keep your files under 2GB. Even with millions of rows, that is pretty easy to do with incremental loads that run every 12-24 hours. 2) You would never have two files with the same timestamp. And using those incremental files would not be ideal for analytics...
– mmarie
Nov 13 at 22:49

This is where your different zones come in. You want a record of all the changes in Raw, but your Power BI user probably just wants current. So use Data Factory or U-SQL to get the latest version of of each row from Raw and land it in the Curated Zone for end user consumption. You can partition that however you see fit... by primary key or other value.
– mmarie
Nov 13 at 22:55

1) There are two very large tables with few billions of rows and on an average, there will be about a million new rows every day.. There are not many columns though. A nightly increment load will be sufficient 2) Yes, there will never be two files with same timestamp. What I meant was that when you have multiple files for same entity, how do we present a single view to the user since the source data is coming from a single table (e.g. customer table from Azure SQL DB will output multiple files for customer with different timestamp).
– Galaxite
Nov 13 at 23:02

You could also use Databricks on top of your Raw files to create your curated current version. That is probably the preferred way these days. Basically, users don't read the Raw, they read the curated. You have to take extra ELT steps to get there by doing some light transformation to pull only the latest version of each row.
– mmarie
Nov 13 at 23:17

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53273550%2fazure-data-lake-incremental-load-with-file-partition%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

To summarize:

Load incrementally and name your incremental files intelligently

If you need a current version of the table in the data lake, that is a separate file in your Standardized Raw or Curated Zone.

answered Nov 13 at 16:44

mmarie

4,64411329

Thanks for your reply. Two questions, 1)for very large table, do you output into a single file or how do you split into multiple files? 2) Based on your feedback, since there will be lots of multiple files for a single entity (e.g. for customer, there can be number of files with timestamp cust_<YYYYHHMM>. If the user want to access the data via power BI, how do a consolidated view of each of the entities can be created?
– Galaxite
Nov 13 at 22:36

1) It becomes less about the size of the table (unless it's just extremely wide with several hundred columns) and more about how often data changes. You may have millions of rows, but how many inserts and updates are done each day/hour/whatever? How often would your incremental load run? The performance guide says to try to keep your files under 2GB. Even with millions of rows, that is pretty easy to do with incremental loads that run every 12-24 hours. 2) You would never have two files with the same timestamp. And using those incremental files would not be ideal for analytics...
– mmarie
Nov 13 at 22:49

This is where your different zones come in. You want a record of all the changes in Raw, but your Power BI user probably just wants current. So use Data Factory or U-SQL to get the latest version of of each row from Raw and land it in the Curated Zone for end user consumption. You can partition that however you see fit... by primary key or other value.
– mmarie
Nov 13 at 22:55

1) There are two very large tables with few billions of rows and on an average, there will be about a million new rows every day.. There are not many columns though. A nightly increment load will be sufficient 2) Yes, there will never be two files with same timestamp. What I meant was that when you have multiple files for same entity, how do we present a single view to the user since the source data is coming from a single table (e.g. customer table from Azure SQL DB will output multiple files for customer with different timestamp).
– Galaxite
Nov 13 at 23:02

You could also use Databricks on top of your Raw files to create your curated current version. That is probably the preferred way these days. Basically, users don't read the Raw, they read the curated. You have to take extra ELT steps to get there by doing some light transformation to pull only the latest version of each row.
– mmarie
Nov 13 at 23:17

|
show 1 more comment

To summarize:

Load incrementally and name your incremental files intelligently

If you need a current version of the table in the data lake, that is a separate file in your Standardized Raw or Curated Zone.

answered Nov 13 at 16:44

mmarie

4,64411329

Thanks for your reply. Two questions, 1)for very large table, do you output into a single file or how do you split into multiple files? 2) Based on your feedback, since there will be lots of multiple files for a single entity (e.g. for customer, there can be number of files with timestamp cust_<YYYYHHMM>. If the user want to access the data via power BI, how do a consolidated view of each of the entities can be created?
– Galaxite
Nov 13 at 22:36

1) It becomes less about the size of the table (unless it's just extremely wide with several hundred columns) and more about how often data changes. You may have millions of rows, but how many inserts and updates are done each day/hour/whatever? How often would your incremental load run? The performance guide says to try to keep your files under 2GB. Even with millions of rows, that is pretty easy to do with incremental loads that run every 12-24 hours. 2) You would never have two files with the same timestamp. And using those incremental files would not be ideal for analytics...
– mmarie
Nov 13 at 22:49

This is where your different zones come in. You want a record of all the changes in Raw, but your Power BI user probably just wants current. So use Data Factory or U-SQL to get the latest version of of each row from Raw and land it in the Curated Zone for end user consumption. You can partition that however you see fit... by primary key or other value.
– mmarie
Nov 13 at 22:55

1) There are two very large tables with few billions of rows and on an average, there will be about a million new rows every day.. There are not many columns though. A nightly increment load will be sufficient 2) Yes, there will never be two files with same timestamp. What I meant was that when you have multiple files for same entity, how do we present a single view to the user since the source data is coming from a single table (e.g. customer table from Azure SQL DB will output multiple files for customer with different timestamp).
– Galaxite
Nov 13 at 23:02

You could also use Databricks on top of your Raw files to create your curated current version. That is probably the preferred way these days. Basically, users don't read the Raw, they read the curated. You have to take extra ELT steps to get there by doing some light transformation to pull only the latest version of each row.
– mmarie
Nov 13 at 23:17

|
show 1 more comment

To summarize:

Load incrementally and name your incremental files intelligently

If you need a current version of the table in the data lake, that is a separate file in your Standardized Raw or Curated Zone.

answered Nov 13 at 16:44

mmarie

4,64411329

To summarize:

Load incrementally and name your incremental files intelligently

If you need a current version of the table in the data lake, that is a separate file in your Standardized Raw or Curated Zone.

answered Nov 13 at 16:44

mmarie

4,64411329

answered Nov 13 at 16:44

mmarie

4,64411329

answered Nov 13 at 16:44

mmarie

4,64411329

answered Nov 13 at 16:44

mmarie

4,64411329

Thanks for your reply. Two questions, 1)for very large table, do you output into a single file or how do you split into multiple files? 2) Based on your feedback, since there will be lots of multiple files for a single entity (e.g. for customer, there can be number of files with timestamp cust_<YYYYHHMM>. If the user want to access the data via power BI, how do a consolidated view of each of the entities can be created?
– Galaxite
Nov 13 at 22:36

1) It becomes less about the size of the table (unless it's just extremely wide with several hundred columns) and more about how often data changes. You may have millions of rows, but how many inserts and updates are done each day/hour/whatever? How often would your incremental load run? The performance guide says to try to keep your files under 2GB. Even with millions of rows, that is pretty easy to do with incremental loads that run every 12-24 hours. 2) You would never have two files with the same timestamp. And using those incremental files would not be ideal for analytics...
– mmarie
Nov 13 at 22:49

This is where your different zones come in. You want a record of all the changes in Raw, but your Power BI user probably just wants current. So use Data Factory or U-SQL to get the latest version of of each row from Raw and land it in the Curated Zone for end user consumption. You can partition that however you see fit... by primary key or other value.
– mmarie
Nov 13 at 22:55

1) There are two very large tables with few billions of rows and on an average, there will be about a million new rows every day.. There are not many columns though. A nightly increment load will be sufficient 2) Yes, there will never be two files with same timestamp. What I meant was that when you have multiple files for same entity, how do we present a single view to the user since the source data is coming from a single table (e.g. customer table from Azure SQL DB will output multiple files for customer with different timestamp).
– Galaxite
Nov 13 at 23:02

You could also use Databricks on top of your Raw files to create your curated current version. That is probably the preferred way these days. Basically, users don't read the Raw, they read the curated. You have to take extra ELT steps to get there by doing some light transformation to pull only the latest version of each row.
– mmarie
Nov 13 at 23:17

|
show 1 more comment

Thanks for your reply. Two questions, 1)for very large table, do you output into a single file or how do you split into multiple files? 2) Based on your feedback, since there will be lots of multiple files for a single entity (e.g. for customer, there can be number of files with timestamp cust_<YYYYHHMM>. If the user want to access the data via power BI, how do a consolidated view of each of the entities can be created?
– Galaxite
Nov 13 at 22:36

1) It becomes less about the size of the table (unless it's just extremely wide with several hundred columns) and more about how often data changes. You may have millions of rows, but how many inserts and updates are done each day/hour/whatever? How often would your incremental load run? The performance guide says to try to keep your files under 2GB. Even with millions of rows, that is pretty easy to do with incremental loads that run every 12-24 hours. 2) You would never have two files with the same timestamp. And using those incremental files would not be ideal for analytics...
– mmarie
Nov 13 at 22:49

This is where your different zones come in. You want a record of all the changes in Raw, but your Power BI user probably just wants current. So use Data Factory or U-SQL to get the latest version of of each row from Raw and land it in the Curated Zone for end user consumption. You can partition that however you see fit... by primary key or other value.
– mmarie
Nov 13 at 22:55

1) There are two very large tables with few billions of rows and on an average, there will be about a million new rows every day.. There are not many columns though. A nightly increment load will be sufficient 2) Yes, there will never be two files with same timestamp. What I meant was that when you have multiple files for same entity, how do we present a single view to the user since the source data is coming from a single table (e.g. customer table from Azure SQL DB will output multiple files for customer with different timestamp).
– Galaxite
Nov 13 at 23:02

You could also use Databricks on top of your Raw files to create your curated current version. That is probably the preferred way these days. Basically, users don't read the Raw, they read the curated. You have to take extra ELT steps to get there by doing some light transformation to pull only the latest version of each row.
– mmarie
Nov 13 at 23:17

Thanks for your reply. Two questions, 1)for very large table, do you output into a single file or how do you split into multiple files? 2) Based on your feedback, since there will be lots of multiple files for a single entity (e.g. for customer, there can be number of files with timestamp cust_<YYYYHHMM>. If the user want to access the data via power BI, how do a consolidated view of each of the entities can be created?
– Galaxite
Nov 13 at 22:36

1) It becomes less about the size of the table (unless it's just extremely wide with several hundred columns) and more about how often data changes. You may have millions of rows, but how many inserts and updates are done each day/hour/whatever? How often would your incremental load run? The performance guide says to try to keep your files under 2GB. Even with millions of rows, that is pretty easy to do with incremental loads that run every 12-24 hours. 2) You would never have two files with the same timestamp. And using those incremental files would not be ideal for analytics...
– mmarie
Nov 13 at 22:49

This is where your different zones come in. You want a record of all the changes in Raw, but your Power BI user probably just wants current. So use Data Factory or U-SQL to get the latest version of of each row from Raw and land it in the Curated Zone for end user consumption. You can partition that however you see fit... by primary key or other value.
– mmarie
Nov 13 at 22:55

1) There are two very large tables with few billions of rows and on an average, there will be about a million new rows every day.. There are not many columns though. A nightly increment load will be sufficient 2) Yes, there will never be two files with same timestamp. What I meant was that when you have multiple files for same entity, how do we present a single view to the user since the source data is coming from a single table (e.g. customer table from Azure SQL DB will output multiple files for customer with different timestamp).
– Galaxite
Nov 13 at 23:02

You could also use Databricks on top of your Raw files to create your curated current version. That is probably the preferred way these days. Basically, users don't read the Raw, they read the curated. You have to take extra ELT steps to get there by doing some light transformation to pull only the latest version of each row.
– mmarie
Nov 13 at 23:17

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk