Administrator

Validate Data In Snowflake While Loading

Validate Data In Snowflake While Loading

This blogs explain how to run validation process while loading small or large CSV files into snowflake. This can be done using a parameter and it has certain limitations and if you know how to use them approproately, you will save lot of compute cost.

How To Validate Data In Snowflake While Loading CSV Files

Data loading may take time and if any coversion issue or parsing issue occures while loading data into table, the cycle to fix the issue and re-running copy command is time consuming. This chapter helps you to understand how data can be validated before it is loaded into snowflake and how different options are there to debug the data issues

This episode

  1. Does File Format has data validation parameter?
  2. Does Copy command has data validation parmaeter?
  3. What is the limitation of data validation option and when it can not be used?
  4. Adding validation parameter increase compute cost?
  5. Adding validation parameter takes lot of time for large files?
  6. How to check the past validation error for all errors and partition files?

This episode helps you to describe each of the different data validation options and help you to solve a problem using hands-on excercise . Once you complete this video, you will be able to answer following questions

Once you complete this video, you will be able to answer following questions

  1. Does File Format has data validation parameter?
  2. Does Copy command has data validation parmaeter?
  3. What is the limitation of data validation option and when it can not be used?

Sample Data Set

Here is the data set which is used in this guide.

SQL Used In This Deduplication Article

The File Format To Support Allow Duplicate

Create Customer Table

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- customer table with 15 columns having different data types
create or replace transient table customer_validation (
	customer_pk number(38,0),
	salutation varchar(10),
	first_name varchar(50),
	last_name varchar(50),
	gender varchar(1),
	marital_status varchar(1),
	day_of_birth date,
	birth_country varchar(60),
	email_address varchar(50),
	city_name varchar(60),
	zip_code varchar(10),
	country_name varchar(20),
	gmt_timezone_offset number(10),
	preferred_cust_flag boolean,
	registration_time timestamp_ltz(9)
);

Create Standard File Format & Put Command

1
2
3
4
5
6
7
8
9
10
11
-- Create a file format called csv_ff
create or replace file format csv_ff 
    type = 'csv' 
    compression = 'none' 
    field_delimiter = ',' 
    record_delimiter = '\n' 
    skip_header = 1 
    field_optionally_enclosed_by = '\047';

-- put command 

SnowSQL Put Commmand & List User Stage

1
2
3
4
5
6
7
8
9
-- lets load the data using put command
put 
    file:///tmp/ch08/*
    @~/ch08/small-csv 
    auto_compress=false;

-- list the user stage location 
list @~/ch08/small-csv  ;

Run Copy Command (Without Validation Flag)

1
2
3
4
5
6
7
8
9
10
11
12
13
-- run copy command to load data from stage to table 
copy into customer_validation
    from @~/ch08/small-csv/customer_01_one_error.csv
    file_format = csv_ff;

-- now rerun the copy command using validation_mode
-- this is one additioal property, which will just validate but not load the data
copy into customer_validation
    from @~/ch08/small-csv/customer_01_one_error.csv
    file_format = csv_ff
    on_error = 'continue';


Run Copy Command with Validation Flag

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
    
    -- Option-1
    --   validation_mode = return_errors;
    --   Returns all errors (parsing, conversion, etc.) across all files specified in the COPY statement.
    
    -- Option-2
    -- validation_mode = return_n_rows;
    --     validation_mode = return_errors;
    
    -- Option-3
    -- validation_mode = return_all_errors;
    --  Returns all errors across all files specified in the COPY statement,
    
    copy into customer_validation
    from @~/ch08/small-csv/customer_01_one_error.csv
    file_format = csv_ff
    force = true
    validation_mode = return_all_errors;
    

-- Option-2
copy into customer_validation
    from @~/ch08/small-csv/customer_01_one_error.csv
    file_format = csv_ff;
    validation_mode = return_1_rows;
    
-- Option-3
copy into customer_validation
    from @~/ch08/small-csv/customer_01_one_error.csv
    file_format = csv_ff;
    validation_mode = return_all_errors;
--    

Multiple Line error

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- list the user stage location (again if you are not see my stage rleated video, pls watch them later.)
list @~/ch08/small-csv  ;

-- run copy command to load data from stage to table 
-- we don't know where all are the issues
copy into customer_validation
    from @~/ch08/small-csv/customer_02_three_errors.csv
    file_format = csv_ff
    on_error = 'continue'
    force = true
    validation_mode = return_errors;
    
    
    
    
-- run without validation mode and skip the error records
copy into customer_validation
    from @~/ch08/small-csv/customer_02_three_errors.csv
    file_format = csv_ff
    on_error = 'continue'
    force = true;
    
-- check the table
select * from customer_validation;

One Line Many errors

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
list @~/ch08/small-csv  ;

-- run copy command to load data from stage to table 
-- we don't know where all are the issues
copy into customer_validation
    from @~/ch08/small-csv/customer_03_one_line_many_error.csv
    file_format = csv_ff
    on_error = 'continue'
    force = true
    validation_mode = return_errors;
    
    
    
    
-- run without validation mode and skip the error records
copy into customer_validation
    from @~/ch08/small-csv/customer_03_one_line_many_error.csv
    file_format = csv_ff
    on_error = 'continue'
    force = true;
    
-- check the table
select * from customer_validation;

Large Files & Validation Paramter

```sql – here is my large files (csv compressed with gzip) list @~/ch08/csv/partition;

– there are 10 data files in user stage.. close to 3.5m data set – there are 3 files having data error – we don’t know which all rows under different files.

– since they are gz files.. we need a new file format

create or replace file format csv_gz_ff type = ‘csv’ compression = ‘gzip’ field_delimiter = ‘,’ field_optionally_enclosed_by = ‘\042’ skip_header = 1 ;

– copy command with validation mode = for all errors copy into customer_validation from @~/ch08/csv/partition file_format = csv_gz_ff on_error = ‘continue’ force = true pattern=’.*[.]csv[.]gz’; validation_mode = return_all_errors;

select * from customer_validation limit 10; – lets check the result .. – does it bring data issue from all files – parsing as well as conversion – and total time to parse it.

– what happens if we say first 10 rows copy into customer_validation from @~/ch08/csv/partition file_format = csv_gz_ff on_error = ‘continue’ force = true pattern=’.*[.]csv[.]gz’ validation_mode = return_10_rows;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- check the time taken, & which all files are picked


select * from table(validate(customer_validation, job_id=>'query-id'));
-- 01a81f86-3200-94b8-0002-14d20005a25e

select * from table(validate(customer_validation, job_id=>'01a81f81-3200-93d3-0002-14d20005d3a6'));


-- validation mode does not support transformation
copy into customer_validation
from 
(
    select distinct * from @~/ch08/csv/partition t
)
file_format = csv_gz_ff
on_error = 'continue'
force = true
pattern='.*[.]csv[.]gz'
validation_mode = return_all_errors;

```