Administrator

How To Deduplicate Records While Loading Data Into Snowflake

How To Deduplicate Records While Loading Data Into Snowflake

Data duplication is a common issue in many data engineering work and when it happens, it becomes a costly matter. So this blog describe how to de-duplicate data when a copy command is executed in snowflake.

How To Deduplicate Records While Loading Data Into Snowflake

Table restrictions are not enforced by Snowflake. One of the usual requirements when loading data using copy command from external stage is handling duplicate records and how to avoid them (specially when you have large duplicates available or duplicate files)

This episode is a comprehensive & practical guide with hands-on excercise to learn how to deduplicate data while loading it into snowflake.

Once you complete this video, you will be able to answer following questions

  1. Does File Format has de-duplication parameter?
  2. Does Copy command has de-duplication parmaeter?
  3. How to handle data deduplication with copy-select statement?
  4. How to handle duplicate data which is spread across many files in partitions folder inside external stage.

Sample Data Set

Here is the data set which is used in this guide.

SQL Used In This Deduplication Article

The File Format To Support Allow Duplicate

1
2
3
4
5
6
7
8
9
10
11
12
13
14
    create or replace file format allow_duplicate_ff 
        type = 'csv' 
        compression = 'none' 
        field_delimiter = ',' 
        record_delimiter = '\n' 
        skip_header = 1 
        field_optionally_enclosed_by = '\042' 
        escape = '\134';
        allow_duplicate = false;

-- this is not allowd..
 -- this attributes is applicable for other file format and not for csv
-- and we will see it in other episode.. so lets remove this field and re-create the file format.

Creating User Table With 6 Fields

1
2
3
4
5
6
7
8
9
10
11
create or replace transient table user(
    id number,
    first_name varchar(100),
    last_name varchar(100),
    email varchar(100),
    gender varchar(1),
    about_me varchar(500)
);

select * from user; -- no record in this table

Load Data Using Put Command To User Stage

1
2
3
4
5
6
7
8
9
10
11
12
13
-- step-2  
list @~/ch07/csv/;

-- run the copy command with 2 duplicate records for id-1 and id-2
copy into user
    from @~/ch07/csv/01_user_data.csv
    file_format = allow_duplicate_ff
    force = true 
    on_error = 'continue';
  
select * from user;

-- the duplicate rows available and that is expected

Copy Command To Remove Duplicate Records

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
truncate table user; -- truncate the table
select * from user; -- and there is no data in the user table

-- now lets re-run the copy command and re-load the same data 
-- this time we are using distinct clause
copy into user
from 
    (   
    select distinct * from @~/ch07/csv/01_user_data.csv t
    )
file_format = allow_duplicate_ff
force = true 
on_error = 'continue';

-- lets run the select command
select count(*) from user;
select * from user; 
    -- there is no duplicate
    -- but the order of the insert is different
    -- lets reivew the query profile

Deduplication & Partitioned Files

What if we have multiple files and they are having same duplicate records how we are going to deal with it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
truncate table user; -- truncate the table
select * from user order by id; -- and there is no data in the user table

list @~/ch07/csv/hour;

-- how the patterns works with de-duplication
copy into user
from 
    (   
    select distinct * from @~/ch07/csv/hour t
    )
file_format = allow_duplicate_ff
force = true 
pattern='.*[.]csv'
on_error = 'continue';


-- check the performance with large duplicate tables
create or replace transient table user_20k_duplicate(
    id number,
    first_name varchar(100),
    last_name varchar(100),
    email varchar(100),
    gender varchar(1),
    about_me varchar(500)
);

create or replace transient table user_20k_unique(
    id number,
    first_name varchar(100),
    last_name varchar(100),
    email varchar(100),
    gender varchar(1),
    about_me varchar(500)
);

-- without removal of duplicate values
copy into user_20k_duplicate
        from @~/ch07/csv/01_user_data.csv
        file_format = allow_duplicate_ff
        force = true 
        on_error = 'continue';
-- now lets re-run the copy command and re-load the same data      
copy into user
from 
    (   
    select distinct * from @~/ch07/csv/user_20k_duplicate.csv t
    )
file_format = allow_duplicate_ff
force = true 
on_error = 'continue';

Deduplication with 1.5M Rows

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
-- large data with lot of duplicates
create or replace transient table customers_duplicate (
	customer_pk number(38,0),
	salutation varchar(10),
	first_name varchar(20),
	last_name varchar(30),
	gender varchar(1),
	marital_status varchar(1),
	day_of_birth date,
	birth_country varchar(60),
	email_address varchar(50),
	city_name varchar(60),
	zip_code varchar(10),
	country_name varchar(20),
	gmt_timezone_offset number(10,2),
	preferred_cust_flag boolean,
	registration_time timestamp_ltz(9)
);

-- list the file inside the big_data/hour folder
list @~/ch07/csv/big_data/hour; 


-- we have total 3 files of total size of ~220Mb

copy into customers_duplicate
    from @~/ch07/csv/big_data/hour
    file_format = allow_duplicate_ff
    on_error = 'CONTINUE' 
    pattern='.*[.]csv'
    force = true;

-- check record counts
select count(*) from customers_duplicate;

-- check duplicate records
select customer_pk, count(1) from customers_duplicate group by customer_pk;

Large Data Set Deduplication

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
create or replace transient table customers_unique (
	customer_pk number(38,0),
	salutation varchar(10),
	first_name varchar(20),
	last_name varchar(30),
	gender varchar(1),
	marital_status varchar(1),
	day_of_birth date,
	birth_country varchar(60),
	email_address varchar(50),
	city_name varchar(60),
	zip_code varchar(10),
	country_name varchar(20),
	gmt_timezone_offset number(10,2),
	preferred_cust_flag boolean,
	registration_time timestamp_ltz(9)
);

-- the table should be empty
select * from customers_unique;

-- lets run the copy command using pattern matching to customer_duplicate tables
copy into customers_unique
from 
    (
        select distinct * from @~/ch07/csv/big_data/hour t
    ) 
    file_format = allow_duplicate_ff
    on_error = 'CONTINUE' 
    pattern='.*[.]csv'
    force = true;

-- check customer count and customer unique values
select count(*) from customers_unique;
select customer_pk, count(1) from customers_unique group by customer_pk;