DBT - Data Build Tool

3.1. DBT - Data Build Tool#

3.1.1. DBT overview#

Modeling changes are easy to follow and revert
Explicit dependencies between models
Explore dependencies between models
Data quality tests
Error reporting
Incremental load or fact tables
Track history of dimension tables
Easy to access documantation

3.1.2. DBT setup#

3.1.2.1. For Windows#

3.1.2.1.1. Python#

This is the Python installer you want to use:

https://www.python.org/ftp/python/3.10.7/python-3.10.7-amd64.exe

Please make sure that you work with Python 3.11 as newer versions of python might not be compatible with some of the dbt packages.

3.1.2.1.2. Virtualenv setup#

Here are the commands we executed in this lesson:

cd Desktop
mkdir course
cd course

virtualenv venv
venv\Scripts\activate

3.1.2.2. For Mac#

3.1.2.2.1. iTerm2#

We suggest you to use iTerm2 instead of the built-in Terminal application.

https://iterm2.com/

3.1.2.2.2. Homebrew#

Homebrew is a widely popular application manager for the Mac. This is what we use in the class for installing a virtualenv.

https://brew.sh/

Here are the commands we execute in this lesson:

mkdir course
cd course
virtualenv venv
. venv/bin/activate

3.1.2.3. dbt installation #

pip install dbt-core dbt-ADAPTER_NAME
# or : pip install dbt-core dbt-bigquery
# or (run local): pip íntall dbt
#On Linux/Mac: which dbt

dbt setup

Initialize the dbt profiles folder on Mac/Linux:

mkdir ~/.dbt

Initialize the dbt profiles folder on Windows:

mkdir %userprofile%\.dbt

Create a dbt project (all platforms):

dbt init <dbt project path>

Let’s check the DBT config for each platform:

BigQuery
…

Then config the DBT connection

Let’s check DBT connection after config

dbt debug

3.1.2.4. About dbt projects / Folder structure#

Resource	Description
models	Each model lives in a single file and contains logic that either transforms raw data into a dataset that is ready for analytics or, more often, is an intermediate step in such a transformation.
snapshots	A way to capture the state of your mutable tables so you can refer to it later.
seeds	CSV files with static data that you can load into your data platform with dbt.
data tests	SQL queries that you can write to test the models and resources in your project.
macros	Blocks of code that you can reuse multiple times.
docs	Docs for your project that you can build.
sources	A way to name and describe the data loaded into your warehouse by your Extract and Load tools.
exposures	A way to define and describe a downstream use of your project.
metrics	A way for you to define metrics for your project.
groups	Groups enable collaborative node organization in restricted collections.
analysis	A way to organize analytical SQL queries in your project such as the general ledger from your QuickBooks.
semantic models	Semantic models define the foundational data relationships in MetricFlow and the dbt Semantic Layer, enabling you to query metrics using a semantic graph.
saved queries	Saved queries organize reusable queries by grouping metrics, dimensions, and filters into nodes visible in the dbt DAG.

dbtlearn/              
├── README.md          
├── dbt_project.yml    # Configuration file for the dbt project, specifying project settings and configurations. 
│── profiles.yml         # Cấu hình kết nối tới Snowflake, BigQuery, Redshift, v.v.
│── packages.yml         # Quản lý các package dbt (tương tự requirements.txt của Python)
├── models/            # Contains SQL files that define dbt models, which are SQL queries that transform raw data into the desired format.  
│   ├── staging/         # Các bảng tạm (staging tables) trước khi chuyển đổi
│   ├── marts/           # Các bảng đầu ra (final data marts)
│   ├── intermediate/    # Các bảng trung gian (dùng cho transformation)
│   ├── sources.yml      # Định nghĩa nguồn dữ liệu (source tables)
│   └── schema.yml       # Định nghĩa schema, kiểm tra và tài liệu mô hình
├── snapshots/         # Contains SQL files that define dbt snapshots, which capture the state of a table at a specific point in time.  
├── tests/             # Contains SQL files that define dbt tests, which are used to validate the data and ensure data quality.  
├── macros/            # Contains SQL files that define dbt macros, which are reusable SQL snippets that can be used in models, tests, and other dbt files.  
├── seeds/             # Contains CSV files that are used to seed data into the database, providing static data that can be referenced in models.  
├── analysis/          # Contains SQL files for analysis purposes, allowing for ad-hoc queries and analysis outside of the main models.  
├── logs/              # Contains log files generated by dbt operations, useful for debugging and tracking the execution of dbt commands.  
│   └── dbt.log        # Log file for dbt operations  
└── target/            # Contains compiled files and artifacts generated by dbt, including the manifest file which contains metadata about the dbt project.  
    └── manifest.json  # JSON file containing the dbt manifest

3.1.3. Configs & Properties#

3.1.3.1. Project configuration#

dbt_project.yml: defines the directory of the dbt project and other project configurations.

YAML key	Value description
name	Your project’s name in snake case
version	Version of your project
require-dbt-version	Restrict your project to only work with a range of dbt Core versions
profile	The profile dbt uses to connect to your data platform
model-paths	Directories to where your model and source files live
seed-paths	Directories to where your seed files live
test-paths	Directories to where your test files live
analysis-paths	Directories to where your analyses live
macro-paths	Directories to where your macros live
snapshot-paths	Directories to where your snapshots live
docs-paths	Directories to where your docs blocks live
vars	Project variables you want to use for data compilation

For complete details on project configurations, see dbt_project.yml.

Chú ý sử dụng đúng naming convention: https://docs.getdbt.com/reference/dbt_project.yml#naming-convention

3.1.3.2. Models config #

Model configs có thể định nghĩa theo 3 cách: ^886e82

dbt_project.yml: nơi tổng hợp config của project và từng resource files (top-level)
properties.yml: đặt trong từng resources để apply config cho resource đó (mid-level , ghi đè top-level)
config() block trong từng file .sql (low-level, ghi đè mid-level và top-level)

Một số properties chỉ được cài đặt trong properties.yml:

description
tests
docs
columns
quote
source properties (e.g. loaded_at_field, freshness)
exposure properties (e.g. type, maturity)
macro properties (e.g. arguments)

# models/jaffle_shop.yml
version: 2

sources:
  - name: raw_jaffle_shop
    description: A replica of the postgres database used to power the jaffle_shop app.
    tables:
      - name: customers
        columns:
          - name: id
            description: Primary key of the table
            tests:
              - unique
              - not_null

      - name: orders
        columns:
          - name: id
            description: Primary key of the table
            tests:
              - unique
              - not_null

          - name: user_id
            description: Foreign key to customers

          - name: status
            tests:
              - accepted_values:
                  values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']


models:
  - name: stg_jaffle_shop__customers
    config:
      tags: ['pii']
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

  - name: stg_jaffle_shop__orders
    config:
      materialized: view
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
              config:
                severity: warn

How to apply config on resource path: (follow the syntax)

3.1.3.3. General properties#

3.1.3.3.1. columns #

định nghĩa columns properties: data_type, description, quote, tests, tags, meta

version: 2

models:
  - name: customers
    description: "Thông tin về khách hàng"
    columns:
      - name: customer_id
        description: "Mã định danh duy nhất của khách hàng"
        data_type: "integer"
        tests:
          - not_null
          - unique
      - name: email
        description: "Địa chỉ email của khách hàng"
        data_type: "string"
        tests:
          - not_null
          - unique
        tags: ["PII"]
        meta:
          sensitivity: "high"

models customers có hai cột customer_id và email, mỗi cột được mô tả chi tiết với các thuộc tính như description, data_type, tests, tags và meta. Việc định nghĩa chi tiết này giúp cải thiện tài liệu hóa và đảm bảo chất lượng dữ liệu trong dự án dbt của bạn.

3.1.3.3.2. config #

3.1.3.3.3. constraints#

Các constraints (ràng buộc) trong dbt không chỉ đơn thuần là validation (kiểm tra dữ liệu) sau khi chạy model, mà chúng thực sự được áp dụng ở cấp độ cấu trúc bảng trên database.

constraints giúp bảo vệ tính toàn vẹn dữ liệu ngay từ đầu, trong khi dbt test giúp phát hiện dữ liệu sai nhưng không ngăn chặn. Nếu database của bạn hỗ trợ constraints, đây là một cách mạnh mẽ để kiểm soát chất lượng dữ liệu trực tiếp trong schema.

Cụ thể:

Khi bạn khai báo constraints trong model, dbt sẽ yêu cầu database áp dụng ràng buộc trên bảng (ví dụ: cột không được null, cột phải unique, có khóa chính/khóa ngoại, v.v.).
Nếu dữ liệu vi phạm constraints này, database sẽ chặn việc nhập dữ liệu vào bảng ngay từ đầu.
Điều này khác với việc chạy test trong dbt (ví dụ: dbt test), vì test chỉ kiểm tra sau khi dữ liệu đã được nạp vào.

3.1.3.3.4. deprecation_date #

3.1.3.3.5. description #

description được sử dụng để cung cấp mô tả cho các resources như models, source, seed, snapshot, analyses, macro, test dữ liệu và các cột tương ứng của chúng. Những mô tả này giúp tài liệu hóa dự án và được hiển thị trên trang web tài liệu do dbt tạo ra.

3.1.3.3.6. lastest_version #

latest_version được sử dụng để chỉ định phiên bản mới nhất của một mô hình đã được phiên bản hóa. Điều này đặc biệt hữu ích khi bạn có nhiều phiên bản của cùng một mô hình và muốn kiểm soát phiên bản nào sẽ được sử dụng mặc định trong các tham chiếu không chỉ định rõ ràng phiên bản.

models:
  - name: ten_mo_hinh
    latest_version: 2
    versions:
      - v: 3
      - v: 2
      - v: 1

Trong ví dụ trên, mặc dù có ba phiên bản (1, 2, 3), nhưng latest_version được đặt là 2. Điều này có nghĩa là bất kỳ tham chiếu nào đến mô hình ten_mo_hinh mà không chỉ định phiên bản cụ thể (ví dụ: ref('ten_mo_hinh')) sẽ mặc định trỏ đến phiên bản 2. Phiên bản 3 sẽ được coi là “prerelease” (phiên bản chuẩn bị phát hành), trong khi phiên bản 1 được coi là “old” (cũ).

Nếu không chỉ định latest_version, dbt sẽ mặc định coi phiên bản 3 là phiên bản mới nhất. Do đó, ref('ten_mo_hinh') sẽ trỏ đến ten_mo_hinh.v3

3.1.3.3.7. include-exclude columns #

3.1.3.3.8. Data tests #

tests property defines assertions about a column, table, or view. 4 kiểm thử tích hợp sẵn trong dbt:

not_null: Xác nhận rằng không có giá trị null trong một cột.
unique: Xác nhận rằng không có giá trị trùng lặp trong một cột.
accepted_values: Xác nhận rằng tất cả các giá trị trong một cột đều nằm trong một danh sách giá trị được cung cấp.
relationships: Xác nhận rằng tất cả các bản ghi trong bảng con có một bản ghi tương ứng trong bảng cha (tính toàn vẹn tham chiếu).

Ngoài ra còn có các custom tests.

3.1.3.3.9. versions #

versions cho phép bạn quản lý và theo dõi các phiên bản khác nhau của một mô hình theo thời gian. Điều này đặc biệt hữu ích khi bạn cần thực hiện các thay đổi quan trọng đối với mô hình mà không muốn ảnh hưởng đến các quy trình hoặc người dùng hiện tại đang dựa vào phiên bản cũ.

# models/<schema>.yml
version: 2

models:
  - name: model_name
    versions:
      - v: <version_identifier> # required
        defined_in: <file_name> # optional -- default is <model_name>_v<v>
        columns:
          # specify all columns, or include/exclude columns from the top-level model YAML definition
          - [include](/reference/resource-properties/include-exclude): <include_value>
            [exclude](/reference/resource-properties/include-exclude): <exclude_list>
          # specify additional columns
          - name: <column_name> # required
      - v: ...
    
    # optional
    [latest_version](/reference/resource-properties/latest_version): <version_identifier> 

3.1.3.4. General configs#

Cấu hình	Mục đích
`docs`	Điều khiển việc hiển thị tài nguyên trong tài liệu tự động của dbt và đặt màu cho các nút.
`access`	Xác định mức độ truy cập của mô hình (`private`, `protected`, `public`) để kiểm soát phạm vi tham chiếu.
`alias`	Đặt tên thay thế cho bảng hoặc view được tạo trong cơ sở dữ liệu.
`contract`	Xác định hợp đồng cho mô hình, bao gồm các ràng buộc về schema và dữ liệu.
`database`	Chỉ định cơ sở dữ liệu nơi tài nguyên sẽ được tạo.
`enabled`	Bật hoặc tắt tài nguyên trong quá trình chạy dbt.
`event_time`	Đánh dấu một cột là cột thời gian sự kiện, hữu ích cho các thao tác liên quan đến thời gian.
`full_refresh`	Buộc làm mới hoàn toàn cho các mô hình gia tăng trong lần chạy tiếp theo.
`grants`	Áp dụng quyền truy cập cho các đối tượng cơ sở dữ liệu được tạo bởi dbt.
`group`	Gán mô hình vào một nhóm để quản lý và tổ chức.
`meta`	Lưu trữ thông tin bổ sung tùy chỉnh cho tài nguyên.
`persist_docs`	Lưu trữ mô tả tài nguyên dưới dạng nhận xét trong cơ sở dữ liệu.
`pre-hook & post-hook`	Thực thi các lệnh SQL hoặc macro trước và sau khi chạy một mô hình.
`schema`	Chỉ định schema nơi tài nguyên sẽ được tạo.
`tags`	Gắn thẻ cho tài nguyên để phân loại và lựa chọn trong các thao tác dbt.
`unique_key`	Xác định khóa duy nhất cho các mô hình gia tăng.

3.1.4. Resources#

3.1.4.1. Models#

Models are the basic building block of your business logic
Materialized as tables, views, etc…
They live in SQL files in the models folder
Models can reference each other and use templates and macros

Materializations types

Category	View	Table	Incremental	Ephemeral (CTEs)	Materialized View
Pros	- No additional data is stored. - Always has the latest records from source data.	- Fast to query.	- Significantly reduces build time by only transforming new records.	- Allows reusable logic. - Keeps the data warehouse clean by reducing table clutter.	- Combines query performance of a table with data freshness of a view. - Can refresh automatically depending on the database. - `dbt run` works like with views.
Cons	- Slow for significant transformations or when stacked on top of other views.	- Takes time to rebuild, especially for complex transformations. - New records are not automatically added.	- Requires additional configuration and is more complex than other models.	- Cannot be directly queried. - Cannot be referenced (`ref()`) in `dbt run-operation`. - Overuse can make queries harder to debug.	- Fewer configuration options depending on the database platform. - Not all databases support materialized views.
Advice	- Start with views before switching to another materialization. - Best for models that do not perform heavy transformations.	- Use for models queried by BI tools for better user experience. - Suitable for slow transformations used by many downstream models.	- Best for event-style data. - Use when dbt runs become too slow (do not start with incremental models).	- Use for lightweight transformations early in the DAG. - Best for models used in only one or two downstream models.	- Use when incremental models are sufficient, but you want the data platform to manage incremental logic and refresh.
Use it when…	- You want a lightweight representation. - You don’t reuse data too often.	- You read from this model repeatedly.	- Fact tables. - Appends to tables.	- You merely want an alias to your data.	- Similar use cases as incremental models but want the database to manage refresh.
Don’t use it when…	- You read from the same model several times.	- Building single-use models. - Your model is populated incrementally.	- You want to update historical records.	- You read from the same model several times.	- The database platform doesn’t support materialized views.

Có thể config materializations trong 3 cách theo như [[#^886e82]]

Ví dụ trong dbt_project.yml

# dbt_project.yml
models:
  dbtlearn:
    +materialized: view
    dim:
      +materialized: table

hoặc trong config block

3.1.4.1.1. Table/View Models#

{{
  config(
    materialized = 'view'
    )
}}

WITH src_hosts AS (
    SELECT
        *
    FROM
        {{ ref('src_hosts') }}
)
SELECT
    host_id,
    NVL(
        host_name,
        'Anonymous'
    ) AS host_name,
    is_superhost,
    created_at,
    updated_at
FROM
    src_hosts

3.1.4.1.2. Incremental Models#

The fct/fct_reviews.sql model:

{{
  config(
    materialized = 'incremental',
    on_schema_change='fail'
    )
}}
WITH src_reviews AS (
  SELECT * FROM {{ ref('src_reviews') }}
)
SELECT * FROM src_reviews
WHERE review_text is not null

{% if is_incremental() %}
  AND review_date > (select max(review_date) from {{ this }})
{% endif %}

3.1.4.1.3. Ephemeral view#

+materialized: ephemeral

Mỗi 1 model dạng ephemeral sẽ được lưu thành query tạm (mà không tạo thành đối tượng trên database: table, view…)

3.1.4.2. Seeds#

Seeds là local files dùng để upload trực triếp lên data warehouse from DBT, được lưu trữ trong folder /seeds/

Các command run model project không tải lại các file seed lên warehouse, ta cần chạy lệnh dbt seed hoặc dbt build để load các folder lên data warehouse

How to write Seed properties

#seeds/<filename>.yml
version: 2

seeds:
  - name: <string>
    description: <markdown_string>
    docs:
      show: true | false
      node_color: <color_id>
    config:
      <seed_config>: <config_value>
    tests:
      - <test>
      - ... # declare additional tests
    columns:
      - name: <column name>
        description: <markdown_string>
        meta: {<dictionary>}
        quote: true | false
        tags: [<string>]
        tests:
          - <test>
          - ... # declare additional tests

      - name: ... # declare properties of additional columns

  - name: ... # declare properties of additional seeds

3.1.4.3. Sources#

Sources là những data layer trừu tượng đại diện cho input data (data từ các nguồn database, schema khác) mà không bị thay đổi trong quá trình build và run DBT, tuy nhiên vì nó là data ở dạng database nên giá trị sẽ có tính chất freshness (thay vì cố định như seeds local files)

Định nghĩa Sources in model/, check các properties and config

# models/sources.yml
version: 2

sources:
  - name: airbnb # Source name
    schema: raw  # Schema name
    tables:
      - name: listings # reference to the table name
        identifier: raw_listings # Table name of data source

      - name: hosts
        identifier: raw_hosts

      - name: reviews
        identifier: raw_reviews
        loaded_at_field: date # check mức độ freshness của trường 'date' với hiện tại, nếu date quá cũ sau hiện tại thì sẽ warning hoặc error
        freshness:
          warn_after: {count: 1, period: hour}
          error_after: {count: 24, period: hour}
          filter: datediff('day', date, current_timestamp) < 2 # optional

filter sẽ thực hiện query filder bảng trước khi chạy assertions để tránh TH query cả bảng, nhằm tối ưu chi phí và performance.

3.1.4.4. Snapshots#

Strategy Snapshot được sử dụng để ghi lại sự thay đổi của dữ liệu theo thời gian bằng cách tạo bảng snapshot. Điều này rất hữu ích cho việc theo dõi các thay đổi trong dữ liệu quan trọng, chẳng hạn như thông tin khách hàng, trạng thái đơn hàng hoặc bất kỳ dữ liệu nào có thể thay đổi theo thời gian.

Snapshot sẽ tạo ra các bảng có cấu trúc gần giống như các bảng được snapshot, bổ sung thêm các trường snapshot_id, dbt_valid_from, dbt_valid_to. Khi bảng gốc được thay đổi và chạy command dbt snapshot sẽ update lại row bị thay đổi trong bảng gốc trong bảng snapshot tương ứng.

Có 2 strategies cho việc snapshot:

Timestamp Strategy (Chiến lược dựa trên dấu thời gian)

Cơ chế hoạt động: So sánh unique_key customer_id + một cột dấu thời gian (updated_at, modified_date, v.v.) để xác định khi nào bản ghi đã thay đổi.
Nếu giá trị trong cột này thay đổi, DBT sẽ tạo một bản ghi snapshot mới.
Phù hợp với: Các bảng có một cột timestamp đại diện cho lần cập nhật gần nhất.

snapshot customer_snapshot {
target_database: my_database
target_schema: snapshots
unique_key: customer_id
strategy: timestamp
updated_at: updated_at } ```

Ưu điểm:
- Hiệu suất cao vì chỉ cần kiểm tra một cột timestamp.
- Dễ triển khai nếu dữ liệu có cột timestamp đáng tin cậy.
Nhược điểm:
- Không thể theo dõi các thay đổi nếu dữ liệu không có cột updated_at.
- Nếu timestamp không được cập nhật đúng cách, có thể bỏ lỡ thay đổi.

Check Strategy (Chiến lược kiểm tra toàn bộ hàng)

Cơ chế hoạt động: So sánh toàn bộ giá trị của một hoặc nhiều cột để phát hiện thay đổi.
Nếu bất kỳ giá trị nào trong danh sách cột được kiểm tra thay đổi, DBT sẽ tạo một snapshot mới.
Phù hợp với: Các bảng không có cột updated_at hoặc khi bạn muốn theo dõi sự thay đổi của một tập hợp cột cụ thể.

snapshot order_snapshot {
  target_database: my_database
  target_schema: snapshots
  unique_key: order_id
  strategy: check
  check_cols: ['status', 'total_price', 'customer_id']
}

Ưu điểm:
- Linh hoạt hơn vì không phụ thuộc vào cột timestamp.
- Có thể theo dõi nhiều thay đổi cùng lúc.
Nhược điểm:
- Tốn tài nguyên hơn vì phải so sánh toàn bộ dữ liệu trong các cột chỉ định.
- Cần chọn các cột kiểm tra phù hợp, nếu không có thể tạo quá nhiều snapshot không cần thiết.

So sánh Timestamp vs Check Strategy

Tiêu chí	Timestamp Strategy	Check Strategy
Cơ chế	Dựa vào cột timestamp	So sánh nhiều cột chỉ định
Hiệu suất	Cao hơn do chỉ kiểm tra một cột	Chậm hơn nếu nhiều cột được kiểm tra
Dễ triển khai	Dễ nếu có cột timestamp	Phức tạp hơn
Độ chính xác	Phụ thuộc vào độ tin cậy của timestamp	Chính xác hơn nếu chọn đúng cột
Khi nào nên sử dụng từng loại?

Dùng timestamp strategy nếu bảng dữ liệu có cột updated_at hoặc dấu thời gian đáng tin cậy.
Dùng check strategy nếu cần theo dõi sự thay đổi của nhiều cột hoặc không có cột timestamp.

Cấu trúc file snapshot

-- `snapshots/scd_raw_listings.sql`

-- snapshot begining block
{% snapshot scd_raw_listings %}

-- define how to snapshot by snapshots config
{{
   config(
       target_schema='DEV',
       unique_key='id',
       strategy='timestamp',
       updated_at='updated_at',
       hard_delete="invalidate" 
   )
}}

-- define what is snapshoted ?
select * FROM {{ source('airbnb', 'listings') }}

-- snapshot ending block
{% endsnapshot %}

3.1.4.4.1. snapshot properties #

hard_deletes: action cho record trong bảng snapshot khi record đó bị delete khỏi bảng nguồn

3.1.4.5. Tests#

tests là các kiểm thử dữ liệu giúp đảm bảo tính toàn vẹn và chính xác của dữ liệu trong mô hình. Khi chạy lệnh dbt test, dbt sẽ thực thi các kiểm thử đã định nghĩa để xác nhận rằng dữ liệu đáp ứng các quy tắc và ràng buộc đã đặt ra.

dbt hỗ trợ hai loại kiểm thử chính là Generic Tests và Singular Tests

🔹 Chạy test:

dbt test : chạy toàn bộ các tests
dbt test --select customers: chỉ chạy test cho một mô hình cụ thể (customer)

🔹 Kết quả:

Nếu test thành công: Hiển thị ✅ PASSED.
Nếu test thất bại: Hiển thị ❌ FAILED với số dòng không đạt yêu cầu.

3.1.4.5.1. Generic Tests (Kiểm thử chung)#

Là các kiểm thử có thể áp dụng cho nhiều mô hình hoặc cột khác nhau.
Được tích hợp sẵn trong dbt hoặc từ các package mở rộng như dbt_utils.

Test Name	Mô tả
`unique`	Đảm bảo cột có giá trị duy nhất (không trùng lặp).
`not_null`	Đảm bảo cột không chứa giá trị NULL.
`accepted_values`	Đảm bảo giá trị của cột nằm trong danh sách cho phép.
`relationships`	Đảm bảo giá trị của cột nằm trong danh sách giá trị cột khác
macro function	Sử dụng các macro custom function để test
dbt package	Sử dụng các dbt packages để thêm function test
Cách sử dụng: Generic tests được khai báo trực tiếp trong tệp models/schema.yml.

# models/schema.yml
version: 2

models:
  - name: dim_listings_cleansed
    description: Cleansed table which contains Airbnb listings.
    columns:

      - name: listing_id
        description: Primary key for the listing
        tests:
          - unique
          - not_null

      - name: host_id
        description: The hosts's id. References the host table.
        tests:
          - not_null
          - relationships:
              to: ref('dim_hosts_cleansed') 
              field: host_id # reference field 'host_id' in `dim_hosts_cleansed`

      - name: room_type
        description: Type of the apartment / room
        tests:
          - accepted_values:
              values: ['Entire home/apt',
                      'Private room',
                      'Shared room',
                      'Hotel room']

      - name: minimum_nights
        description: '{{ doc("dim_listing_cleansed__minimum_nights") }}'
        tests:
          - positive_value # use macro function

  - name: dim_hosts_cleansed
    description: Cleansed table for the Airbnb hosts
    columns:
      - name: host_id
        description: The id of the host. This is the primary key.
        tests:
          - not_null
          - unique
      
      - name: host_name
        description: The name of the host
        tests:
          - not_null
      
      - name: is_superhost
        description: Defines whether the hosts is a superhost.
        tests:
          - accepted_values:
              values: ['t', 'f']

  - name: dim_listings_w_hosts
    tests:
      - dbt_expectations.expect_table_row_count_to_equal_other_table: # use package "dbt_expectations"
          compare_model: source('airbnb', 'listings')
    columns:
      - name: price
        tests:
          - dbt_expectations.expect_column_values_to_be_of_type:
              column_type: number
          - dbt_expectations.expect_column_quantile_values_to_be_between:
              quantile: .99
              min_value: 50
              max_value: 500
          - dbt_expectations.expect_column_max_to_be_between:
              max_value: 5000
              config:
                severity: warn

  - name: fct_reviews
    columns:
      - name: listing_id
        tests:
          - relationships:
              to: ref('dim_listings_cleansed')
              field: listing_id

      - name: reviewer_name
        tests:
          - not_null
      
      - name: review_sentiment
        tests:
          - accepted_values:
              values: ['positive', 'neutral', 'negative']

💡 Khi chạy dbt test, dbt sẽ tự động tạo và chạy các truy vấn SQL để kiểm tra điều kiện trên.

3.1.4.5.2. Singular Tests (Kiểm thử tùy chỉnh)#

Là các kiểm thử được viết dưới dạng truy vấn SQL.
Kiểm thử này có thể kiểm tra các quy tắc phức tạp, chẳng hạn như điều kiện logic hoặc dữ liệu bất thường.
Được khai báo trong thư mục tests/ dưới dạng file SQL.

Cách sử dụng: Tạo tệp tests/dim_listings_minimun_nights.sql:

# tests/dim_listings_minimun_nights.sql
SELECT
    *
FROM
    {{ ref('dim_listings_cleansed') }}
WHERE minimum_nights < 1
LIMIT 10

Thực hiện test: dbt test --select dim_listings_cleansed 💡 Nếu truy vấn trả về bất kỳ dòng nào, test sẽ thất bại.

3.1.4.5.3. Mở rộng tests với dbt-utils#

Dbt hỗ trợ thêm các kiểm thử nâng cao từ package dbt-utils, giúp kiểm thử linh hoạt hơn.

Trong file packages.yml:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.3.0 # version

Ví dụ: Kiểm tra trường email có đúng định dạng không?

models:
  - name: customers
    columns:
      - name: email
        tests:
          - dbt_utils.expression_is_true:
              expression: "email LIKE '%@%.%'"

🔹 Nếu có email không đúng định dạng, test sẽ thất bại.

3.1.4.5.4. Check SQL to get fail test#

Nếu trong TH chạy test bị lỗi, thông báo sẽ chỉ ra file .sql chạy test bị fail. Để check file .sql đó:

For Mac/Linux

cat <filepath>

For Windows:

type <filepath>

3.1.4.6. Macros#

🔹Macro trong dbt là các đoạn mã SQL được viết bằng Jinja có thể tái sử dụng, tương tự như hàm (function) trong lập trình. Macro giúp:

Tránh lặp lại code.
Viết SQL động có thể thay đổi tùy theo tham số.
Dễ dàng tùy chỉnh logic xử lý dữ liệu.
Giúp duy trì code sạch và dễ bảo trì.

🔹Macros được lưu trong thư mục macros/ của dự án dbt. Khi dbt chạy, nó sẽ biên dịch macro thành SQL thực thi.

🔹Macros có thể chứa:

Biến ({{ variable }}) để chèn giá trị động.
Logic điều kiện (if-else) để tùy chỉnh SQL.
Vòng lặp (for loop) để lặp qua danh sách giá trị.

3.1.4.6.1. Cấu trúc Macro và cách sử dụng#

-- macros/<filename.sql>

{% macro macro_name(param1, param2) %}
    -- Logic SQL của macro
{% endmacro %}

🔹For Loop + Sử dụng trong singular tests :

-- macros/no_nulls_in_columns.sql

{% macro no_nulls_in_columns(model) %}
    SELECT * FROM {{ model }} WHERE
    {% for col in adapter.get_columns_in_relation(model) -%}
        {{ col.column }} IS NULL OR
    {% endfor %}
    FALSE
{% endmacro %}

Sử dụng trong singular test

-- tests/no_nulls_in_dim_listings.sql
{{ no_nulls_in_columns(ref('dim_listings_cleansed')) }}

🔹SQL Select Template + Sử dụng trong generic tests:

-- macros/positive_value.sql
{% test positive_value(model, column_name) %}
SELECT
    *
FROM
    {{ model }}
WHERE
    {{ column_name}} < 1
{% endtest %}

Sử dụng trong generic tests:

# models/schema.yml

models:
  - name: dim_listings_cleansed
    columns:
      - name: minimum_nights
        tests:
          - positive_value # use macro function

🔹SQL Function Template + Sử dụng trong models definition:

{% macro cents_to_dollars(column_name, scale=2) %}

({{ column_name }} / 100)::numeric(16, {{ scale }})  

{% endmacro %}

🔹Dùng if-else trong macro

{% macro check_table(table_name) %}
    {% if execute %}  -- Chỉ chạy trong dbt run, không chạy khi compile
    SELECT COUNT(*) FROM {{ ref(table_name) }};
    {% else %} -- chạy nếu chạy dbt compile
    SELECT 'Skipping execution';
    {% endif %}
{% endmacro %}

-- Sử dụng: {{ check_table('customers') }}

🔹Dùng for loop trong macro

{% macro generate_columns(columns) %}
    {% for col in columns %}
        {{ col }} AS transformed_{{ col }},
    {% endfor %}
{% endmacro %}


-- Sử dụng:
SELECT 
	{{ generate_columns(['sales', 'profit', 'revenue']) }} 
FROM financial_data;

-- Sau khi compile:
SELECT 
	sales AS transformed_sales, 
	profit AS transformed_profit, 
	revenue AS transformed_revenue 
FROM financial_data;

🔹Get biến môi trường (Environment Variables)

{% macro get_env_var(var_name, default='') %}
    {% if var(var_name) %}
        {{ var(var_name) }}
    {% else %}
        {{ default }}
    {% endif %}
{% endmacro %}

-- Sử dụng:
SELECT {{ get_env_var('database_name', 'default_db') }} AS active_db;

3.1.4.6.2. Macro properties#

Macro properties được định nghĩa trong file properties.yml (nên để trong folder macro/) dùng để miêu tả function và các arguments liên quan:

check

3.1.4.7. Packages#

Packages DBT được list tại https://hub.getdbt.com/:

dbt_utils: bổ sung nhiều macro có thể (re)used across dbt projects
dbt_expectations: tích hợp nhiều macro test thay vì phải tự build hoặc test bên ngoài DBT
…

Cài thêm package The contents of packages.yml:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.3.0

Chạy lệnh để load package: dbt deps

Sử dụng package

Sử dụng trong models

-- models/fct_reviews.sql
{{
  config(
    materialized = 'incremental',
    on_schema_change='fail'
    )
}}
WITH src_reviews AS (
  SELECT * FROM {{ ref('src_reviews') }}
)
SELECT 
  {{ dbt_utils.generate_surrogate_key(['listing_id', 'review_date', 'reviewer_name', 'review_text']) }}
    AS review_id,
  * 
  FROM src_reviews
WHERE review_text is not null
{% if is_incremental() %}
  AND review_date > (select max(review_date) from {{ this }})
{% endif %}

3.1.4.8. Documentation#

Định nghĩa mô tả các models và field The models/schema.yml after adding the documentation:

version: 2

models:
  - name: dim_listings_cleansed
    description: Cleansed table which contains Airbnb listings.
    columns:
      
      - name: listing_id
        description: Primary key for the listing
        tests:
          - unique
          - not_null
        
      - name: host_id
        description: The hosts's id. References the host table.
        tests:
          - not_null
          - relationships:
              to: ref('dim_hosts_cleansed')
              field: host_id

      - name: room_type
        description: Type of the apartment / room
        tests:
          - accepted_values:
              values: ['Entire home/apt', 'Private room', 'Shared room', 'Hotel room']

      - name: minimum_nights
        description: '{{ doc("dim_listing_cleansed__minimum_nights") }}'
        tests:
          - positive_value

  - name: dim_hosts_cleansed
    columns:
      - name: host_id
        tests:
          - not_null
          - unique
      
      - name: host_name
        tests:
          - not_null
      
      - name: is_superhost
        tests:
          - accepted_values:
              values: ['t', 'f']
  
  - name: fct_reviews
    columns:
      - name: listing_id
        tests:
          - relationships:
              to: ref('dim_listings_cleansed')
              field: listing_id

      - name: reviewer_name
        tests:
          - not_null
      
      - name: review_sentiment
        tests:
          - accepted_values:
              values: ['positive', 'neutral', 'negative']

Định nghĩa tham chiếu trong file markdown The contents of models/docs.md:

{% docs dim_listing_cleansed__minimum_nights %}
Minimum number of nights required to rent this property. 

Keep in mind that old listings might have `minimum_nights` set 
to 0 in the source tables. Our cleansing algorithm updates this to `1`.

{% enddocs %}

Định nghĩa tham chiếu trong file markdown chứa special assets The contents of models/overview.md:

{% docs __overview__ %}
# Airbnb pipeline

Hey, welcome to our Airbnb pipeline documentation!

Here is the schema of our input data:
![input schema](https://dbtlearn.s3.us-east-2.amazonaws.com/input_schema.png)

{% enddocs %}

Tips: Có thể sử dụng VSCode Extension Power User for dbt để sửa documantation cho nhanh

3.1.4.8.1. Build Documentation#

dbt docs generate: build documantation
dbt docs serve: chạy website hiện metadata của project và DAG Linage Graph

3.1.4.9. Analyses#

Folder analyses/ được dùng để chứa SQL ad-hoc queries, report, hoặc phân tích dữ liệu mà không cần phải materialize (tạo bảng hoặc view) như các mô hình (models/).

Viết và lưu trữ câu truy vấn SQL mà bạn có thể chạy khi cần.
Phân tích dữ liệu mà không làm ảnh hưởng đến mô hình chính.
Chia sẻ truy vấn phức tạp giữa các thành viên trong nhóm.

The contents of analyses/full_moon_no_sleep.sql:

WITH fullmoon_reviews AS (
    SELECT * FROM {{ ref('fullmoon_reviews') }}
)
SELECT
    is_full_moon,
    review_sentiment,
    COUNT(*) as reviews
FROM
    fullmoon_reviews
GROUP BY
    is_full_moon,
    review_sentiment
ORDER BY
    is_full_moon,
    review_sentiment

Sau khi viết xong file SQL query, chạy phân tích bằng dbt compile và check SQL sau khi compile bằng

For Mac: cat target/compiled/{project_name}/<filepath>.sql
Window: type target/compiled/{project_name}/<filepath>.sql

3.1.4.10. Hooks#

Hooks là SQLs được chạy tự động trước hoặc sau một sự kiện cụ thể, giúp kiểm soát và tối ưu quá trình ETL.

Hook Type	Chạy Khi Nào?	Ví Dụ
`pre-hook`	Trước khi chạy một model (model/seed/snapshot)	Xóa dữ liệu cũ
`post-hook`	Sau khi chạy một model (model/seed/snapshot)	Ghi log, kiểm tra dữ liệu
`on-run-start`	Khi bắt đầu `dbt run/seed/snapshot`	Ghi log toàn bộ pipeline
`on-run-end`	Khi kết thúc `dbt run/seed/snapshot`	Cập nhật trạng thái pipeline

3.1.4.10.1. `pre-hook` và `post-hook` cho model #

Được sử dụng để chạy SQL trước (pre-hook) hoặc sau (post-hook) khi chạy một model.

Ví dụ 1: Audit logs trước khi chạy model

-- models/orders.sql
{{ config(
    materialized='table',
    pre_hook="INSERT INTO audit_logs (table_name, action, timestamp) VALUES ('orders', 'start', current_timestamp)",
    post_hook="INSERT INTO audit_logs (table_name, action, timestamp) VALUES ('orders', 'end', current_timestamp)"
) }}

SELECT * FROM raw_data.orders;

📌 Khi chạy dbt run, dbt sẽ:

Chạy pre_hook: Thêm log vào audit_logs trước khi chạy model.
Chạy model orders.sql.
Chạy post_hook: Thêm log vào audit_logs sau khi model chạy xong.

Ví dụ 2: Xóa dữ liệu cũ trước khi load

{{ config(
    materialized='incremental',
    pre_hook="DELETE FROM analytics.orders WHERE order_date >= '2024-01-01'"
) }}
SELECT * FROM raw_data.orders WHERE order_date >= '2024-01-01';

📌 Giúp tránh trùng dữ liệu khi load bảng mới.

Ví dụ 3: Thêm trực tiếp vào dbt_project.yml để thực hiện cho tất cả các models trong configs

# dbt_project.yml
+post-hook:
      - "GRANT SELECT ON {{ this }} TO ROLE REPORTER"

3.1.4.10.2. `on-run-start` và `on-run-end` (Chạy đầu/cuối dự án)#

Các hooks này chạy trước hoặc sau toàn bộ pipeline dbt run, thay vì chỉ chạy cho từng model.

Ví dụ 4: Ghi log vào bảng run_logs khi bắt đầu và kết thúc pipeline

Trong file dbt_project.yml:

on-run-start:
  - "INSERT INTO run_logs (run_id, status, start_time) VALUES ('{{ invocation_id }}', 'started', current_timestamp)"

on-run-end:
  - "UPDATE run_logs SET status='completed', end_time=current_timestamp WHERE run_id='{{ invocation_id }}'"

📌 Mục đích:

on-run-start: Ghi lại ID của lần chạy vào run_logs trước khi chạy models.
on-run-end: Cập nhật trạng thái khi quá trình chạy kết thúc.

3.1.4.10.3. Khi nào nên dùng Hooks?#

✅ Audit Logging: Ghi log trước/sau khi chạy models.
✅ Data Cleanup: Xóa dữ liệu cũ trước khi chạy (pre-hook).
✅ Data Quality Checks: Chạy dbt test tự động trước khi chạy models.
✅ Session Configuration: Set biến môi trường (SET TIME ZONE 'UTC').
✅ Incremental Loads: Xóa dữ liệu trùng (DELETE FROM table WHERE ...).

3.1.4.10.4. Lưu ý khi sử dụng Hooks#

⚠ Hooks chạy trong transaction của dbt, nên nếu model fail, post-hook có thể không chạy.
⚠ Không lạm dụng hooks vì có thể gây phức tạp cho pipeline.
⚠ Chạy SQL nặng trong post-hook có thể ảnh hưởng hiệu suất.

3.1.4.11. Exposures (Dashboard)#

Exposures trong dbt là cách để theo dõi và quản lý những hệ thống, báo cáo, hoặc ứng dụng sử dụng dữ liệu từ dbt. Nó giúp bạn hiểu ai đang sử dụng dữ liệu được tạo bởi dbt, từ đó giúp dễ dàng kiểm soát và đánh giá tác động khi thay đổi dữ liệu.

Mục tiêu: 🔹 Quản lý tài nguyên phụ thuộc: Biết được dashboard, ML model, hoặc hệ thống nào đang dùng dữ liệu.
🔹 Cải thiện tài liệu (dbt docs): Hiển thị các ứng dụng sử dụng dbt models trong dbt docs.
🔹 Hỗ trợ lineage (dbt deps): Xác định nếu có thay đổi trong models ảnh hưởng đến báo cáo nào.
🔹 Tăng khả năng truy xuất nguồn gốc dữ liệu: Biết ai đang tiêu thụ dữ liệu dbt.

3.1.4.11.1. Config Exposure#

Tạo file exposures.yml (hoặc bất kỳ tên gì) trong thư mục models/ hoặc exposures/:

Template

# models/<filename>.yml
version: 2  
  
exposures:  
- name: <string_with_underscores>  
[description](https://docs.getdbt.com/reference/resource-properties/description): <markdown_string>  
type: {dashboard, notebook, analysis, ml, application}  
url: <string>  
maturity: {high, medium, low} # Indicates level of confidence or stability in the exposure  
[tags](https://docs.getdbt.com/reference/resource-configs/tags): [<string>]  
[meta](https://docs.getdbt.com/reference/resource-configs/meta): {<dictionary>}  
owner:  
name: <string>  
email: <string>  
  
depends_on:  
- ref('model')  
- ref('seed')  
- source('name', 'table')  
- metric('metric_name')  
  
label: "Human-Friendly Name for this Exposure!"  
[config](https://docs.getdbt.com/reference/resource-properties/config):  
enabled: true | false  
  
- name: ... # declare properties of additional exposures

# models/dashboard.yml
version: 2

exposures:
  - name: executive_dashboard
    label: Executive Dashboard
    type: dashboard
    maturity: low
    url: https://00d200da.us1a.app.preset.io/superset/dashboard/x/?edit=true&native_filters_key=fnn_HJZ0z42ZJtoX06x7gRbd9oBFgFLbnPlCW2o_aiBeZJi3bZuyfQuXE96xfgB
    description: Executive Dashboard about Airbnb listings and hosts
      

    depends_on:
      - ref('dim_listings_w_hosts')
      - ref('mart_fullmoon_reviews')

    owner:
      name: Zoltan C. Toth
      email: dbtstudent@gmail.com

Thuộc tính	Ý nghĩa
`name`	Tên duy nhất của exposure (không có dấu cách).
`type`	Loại hệ thống sử dụng dữ liệu (dashboard, ML, application, analysis, etc.).
`label`	Tên dễ đọc của exposure (hiển thị trong dbt docs).
`description`	Mô tả về hệ thống hoặc báo cáo đang dùng dữ liệu.
`depends_on`	Danh sách models dbt hoặc sources mà hệ thống này phụ thuộc vào.
`owner`	Người chịu trách nhiệm cho hệ thống này (thường là team hoặc user cụ thể).

Các loại Exposures phổ biến

Loại (`type`)	Ý nghĩa
`dashboard`	Các bảng điều khiển (Looker, Tableau, Power BI, etc.).
`ml`	Mô hình Machine Learning sử dụng dữ liệu từ dbt.
`application`	Ứng dụng tiêu thụ dữ liệu từ dbt.
`analysis`	Báo cáo hoặc nghiên cứu dùng dữ liệu từ dbt.

3.1.4.11.2. Hiển thị Exposures trong dbt docs#

Chạy lệnh sau để tạo tài liệu:

dbt docs generate
dbt docs serve

Sau đó, mở trình duyệt và xem lineage của exposures trong dbt UI.

3.1.4.11.3. Kiểm tra dependencies của một Exposure#

dbt ls --select exposure:sales_dashboard+

📌 Mục đích: Hiển thị toàn bộ models mà sales_dashboard phụ thuộc vào.

3.1.4.12. Variables#

Variables (vars) trong dbt là các biến có thể được truyền vào dbt để sử dụng trong models, macros, hoặc tests. Chúng giúp tùy chỉnh tham số, giúp dbt linh hoạt hơn mà không cần sửa đổi SQL code trực tiếp.

3.1.4.12.1. Cách thiết lập `vars` trong dbt#

Có 3 cách chính để thiết lập biến (vars):

3.1.4.12.1.1. Định nghĩa trong `dbt_project.yml`#

Bạn có thể khai báo biến trực tiếp trong file dbt_project.yml:

# dbt_project.yml

vars:
  country: "USA"
  start_date: "2024-01-01"
  max_orders: 1000

📌 Biến này sẽ có giá trị mặc định và có thể ghi đè khi chạy dbt.

3.1.4.12.1.2. Truyền `vars` khi chạy `dbt run`#

Bạn có thể truyền biến khi chạy lệnh dbt:

dbt run --vars '{"country": "Canada", "start_date": "2024-02-01"}'

📌 Giá trị của biến sẽ thay thế giá trị trong dbt_project.yml chỉ trong lần chạy đó.

3.1.4.12.1.3. Định nghĩa `vars` trong một model cụ thể#

Bạn có thể đặt vars chỉ cho một model trong config:

{{ config(
    materialized='table',
    vars={"country": "UK"}
) }}

SELECT * 
FROM customers 
WHERE country = '{{ var("country", "USA") }}'

📌 Nếu biến country không được đặt, nó sẽ mặc định là "USA".

3.1.4.12.2. Cách sử dụng `vars` trong dbt#

3.1.4.12.2.1. SQL models#

Trong file models/customers.sql, bạn có thể sử dụng vars để lọc dữ liệu:

SELECT * 
FROM customers 
WHERE country = '{{ var("country", "USA") }}'

📌 Nếu chạy: dbt run --vars '{"country": "Canada"}' → Query sẽ chỉ lấy khách hàng ở Canada.

3.1.4.12.2.2. macros#

Biến cũng có thể dùng trong macros (Jinja):

{% macro filter_by_country() %}
    country = '{{ var("country", "USA") }}'
{% endmacro %}

Sau đó có thể dùng macro này trong model:

SELECT * 
FROM customers 
WHERE {{ filter_by_country() }}

3.1.4.12.2.3. seeds#

Nếu bạn có seed CSV, bạn có thể dùng vars để giới hạn dữ liệu:

seeds:
  my_project:
    my_seed_file:
      vars:
        start_date: "2024-01-01"

Sau đó, trong SQL:

SELECT * 
FROM {{ ref('my_seed_file') }}
WHERE order_date >= '{{ var("start_date", "2023-01-01") }}'

📌 Khi chạy dbt seed, dữ liệu sẽ được lọc theo start_date.

3.1.4.12.2.4. tests#

Bạn có thể tham chiếu vars khi viết test tùy chỉnh:

tests:
  - name: test_minimum_orders
    description: "Check if number of orders is greater than threshold"
    sql: >
      SELECT COUNT(*) 
      FROM {{ ref('orders') }} 
      WHERE total_orders < {{ var("max_orders", 500) }}

📌 Giá trị max_orders có thể thay đổi khi chạy dbt.

3.1.5. DBT Command #

Command	Description
init	Initializes a new dbt project Khởi tạo một dự án dbt mới trong thư mục hiện tại, tạo ra cấu trúc thư mục và các tệp cấu hình cần thiết. Khi muốn tạo một dự án dbt mới. `dbt init my_project`: ✔️ Dự án mới được tạo.
debug	Debugs dbt connections and projects Kiểm tra kết nối và cấu hình của dbt để đảm bảo rằng mọi thứ được thiết lập đúng cách. ✔️ Kiểm tra và báo lỗi kết nối nếu có.
build	Builds and tests all selected resources (models, seeds, snapshots, tests), this will do: run models, test tests, snapshot snapshots, seed seeds run model, test, snapshot và seed theo DAG order. Lệnh này giúp xây dựng toàn bộ dự án dbt một cách toàn diện. - Khi muốn chạy toàn bộ pipeline dữ liệu bao gồm: mô hình, kiểm thử, snapshot và seed theo đúng thứ tự phụ thuộc. - Hữu ích trong quá trình vận hành, đảm bảo dữ liệu được xử lý đúng và kiểm thử dữ liệu tự động. Kết quả: ✔️ Mô hình mới được chạy trên database. ✔️ Kiểm thử dữ liệu chạy để đảm bảo tính đúng đắn. ✔️ Snapshot được cập nhật (nếu có).
run	Runs the models in a project follow DAG order run models SQL đã được biên dịch trong dự án dbt trên cơ sở dữ liệu mục tiêu. Lệnh này tạo ra các bảng hoặc view dựa trên các mô hình đã định nghĩa. - Khi bạn chỉ muốn chạy mô hình SQL mà không cần chạy tests, seed, snapshot. - Hữu ích trong quá trình phát triển, để kiểm tra mô hình có chạy đúng không. 🔹`dbt run --select sales_report` : ✔️ Chạy mô hình `sales_report` trên database. 🔹`dbt run --select sales_report+` : ✔️ Chạy mô hình `sales_report` và tất cả dependents (luồng downstream - các bảng mà phụ thuộc vào bảng đích) trên database. 🔹`dbt run --select +sales_report` : ✔️ Chạy mô hình `sales_report` và tất cả dependencies (luồng upstream - các bảng mà bảng đích phụ thuộc) trên database. 🔹`dbt run --select +sales_report+` : ✔️ Chạy full luồng mô hình `sales_report`.
test	Executes tests defined (models, sources, snapshots, seeds) in a project Chạy các kiểm thử dữ liệu được định nghĩa trên các mô hình, nguồn dữ liệu, snapshot và seed. Lệnh này giúp xác minh tính toàn vẹn và chất lượng của dữ liệu. - Khi muốn kiểm tra tính đúng đắn của dữ liệu (ví dụ: không có giá trị NULL, giá trị duy nhất, quan hệ khóa ngoại đúng). - Hữu ích trong quá trình vận hành, giúp phát hiện lỗi dữ liệu kịp thời. `dbt test`: ✔️ Hiển thị lỗi nếu có dữ liệu không hợp lệ.
seed	Loads CSV files into the database Tải các tệp CSV vào cơ sở dữ liệu. Điều này hữu ích cho việc nhập dữ liệu tĩnh hoặc ít thay đổi, chẳng hạn như bảng mã quốc gia hoặc bảng tra cứu. - Khi muốn nạp dữ liệu tĩnh vào database từ file CSV (ví dụ: danh sách quốc gia, danh mục sản phẩm). - Hữu ích khi khởi tạo dự án hoặc chạy thử nghiệm. `dbt seed`: ✔️ Dữ liệu từ `country_list.csv` trong folder `seed/` được tải lên database.
snapshot	Executes “snapshot” jobs defined in a project Thực thi các công việc “snapshot” được định nghĩa trong dự án, cho phép theo dõi lịch sử thay đổi của dữ liệu theo thời gian. - Khi cần lưu lại lịch sử thay đổi dữ liệu, giúp phân tích dữ liệu theo thời gian. - Hữu ích với dữ liệu có sự thay đổi theo thời gian như: giá sản phẩm, trạng thái đơn hàng. `dbt snapshot`: ✔️ Lưu trạng thái hiện tại của khách hàng vào bảng snapshot.
docs	Generates documentation for a project Tạo và phục vụ tài liệu cho dự án dbt. Lệnh này có hai lệnh con: - `dbt docs generate`: Tạo trang web tài liệu cho dự án bằng cách biên dịch các tài nguyên và thu thập metadata từ cơ sở dữ liệu. - `dbt docs serve`: Khởi động máy chủ web để phục vụ tài liệu và mở trang web tài liệu trong trình duyệt mặc định. - Khi muốn tạo tài liệu tự động cho mô hình dữ liệu. - Hữu ích trong quản lý dữ liệu, giúp team dễ dàng tra cứu thông tin. `dbt docs serve`: ✔️ Website tài liệu hiển thị metadata và lineage của các mô hình.
run-operation	Invokes a macro, including running arbitrary maintenance SQL against the database Gọi một macro, bao gồm việc chạy các lệnh SQL bảo trì tùy ý trên cơ sở dữ liệu. Khi muốn chạy macro tùy chỉnh (ví dụ: xóa cache, cập nhật metadata). `dbt run-operation drop_old_tables` ✔️ Macro `drop_old_tables` chạy, xóa bảng cũ.
clean	Deletes artifacts present (folder: `\targer`) in the dbt project Khi muốn dọn dẹp thư mục `target/` và `dbt_packages/`, tránh lỗi do dữ liệu cũ. `dbt clean`: ✔️ Thư mục cũ bị xóa, đảm bảo môi trường sạch.
clone	Clones selected models from the specified state Sao chép các mô hình được chọn từ trạng thái được chỉ định. Khi muốn sao chép trạng thái mô hình từ một môi trường khác. `dbt clone --state prod` ✔️ Sao chép dữ liệu từ môi trường `prod`.
compile	Compiles (but does not run) the models in a project Biên dịch các mô hình trong dự án mà không thực thi chúng, tạo ra các tệp SQL đã biên dịch trong thư mục `target`. Khi muốn xem SQL đã biên dịch của mô hình mà không chạy nó. `dbt compile --select sales_report`: ✔️ File SQL được tạo trong `target/compiled/`.
deps	Downloads dependencies for a project Tải về các phụ thuộc cho dự án dbt, chẳng hạn như các gói dbt được chỉ định trong tệp `packages.yml`. Khi cần tải về package dbt từ `packages.yml` (ví dụ: `dbt_utils`). `dbt deps`: ✔️ Các package được tải về thư mục `dbt_packages/`.
invocation	Enables users to debug long-running sessions by interacting with active invocations.
list	Lists resources defined in a dbt project Liệt kê các tài nguyên được định nghĩa trong dự án dbt. Lệnh này chấp nhận các đối số lựa chọn tương tự như `dbt run`. Khi muốn liệt kê các mô hình trong dự án. `dbt ls --resource-type model`: ✔️ Liệt kê tất cả các mô hình.
parse	Parses a project and writes detailed timing info
retry	Retry the last run `dbt` command from the point of failure Thử lại lệnh dbt cuối cùng từ điểm thất bại. `dbt retry`: ✔️ Tiếp tục từ chỗ lỗi trước đó.
show	Previews table rows post-transformation Biên dịch định nghĩa dbt-SQL của một mô hình, kiểm thử, phân tích hoặc một truy vấn dbt-SQL tùy ý được truyền vào bằng `--inline`, sau đó chạy truy vấn đó trên kho dữ liệu và xem trước kết quả trong terminal. Khi muốn xem trước dữ liệu của một mô hình. `dbt show --select sales_report`: ✔️ Hiển thị kết quả SQL của `sales_report`.
source	Provides tools for working with source data (including validating that sources are “fresh”) Cung cấp các lệnh con hữu ích khi làm việc với dữ liệu nguồn. Lệnh này có một lệnh con là `dbt source freshness`, dùng để kiểm tra độ mới của các bảng nguồn. Khi muốn kiểm tra độ mới của dữ liệu nguồn. `dbt source freshness`: ✔️ Cảnh báo nếu dữ liệu quá cũ.

3.1.6. Test & debug DBT#

The original Great Expectations project on GitHub: great-expectations/great_expectations
dbt-expectations: calogica/dbt-expectations

For the final code in packages.yml, models/schema.yml and models/sources.yml, please refer to the course’s Github repo: nordquant/complete-dbt-bootcamp-zero-to-hero

Testing a single model:

dbt test --select dim_listings_w_hosts

Testing individual sources:

dbt test --select source:airbnb.listings

Debugging dbt:

dbt --debug test --select dim_listings_w_hosts

Logging:

The contents of macros/logging.sql:

{% macro learn_logging() %}
    {{ log("Call your mom!") }}
    {{ log("Call your dad!", info=True) }} --> Logs to the screen, too
--  {{ log("Call your dad!", info=True) }} --> This will be put to the screen
    {# log("Call your dad!", info=True) #} --> This won't be executed
{% endmacro %}

Executing the macro:

dbt run-operation learn_logging

3.1.7. Deployment#

3.1.8. Orchestration#

Links to different orchestrators

3.1.8.1. Dagster#

Set up your environment

Let’s create a virtualenv and install dbt and dagster. These packages are located in [requirements.txt].

virutalenv venv -p python3.11
pip install -r requirements.txt

Create a dagster project

Dagster has a command for creating a dagster project from an existing dbt project:

dagster-dbt project scaffold --project-name dbt_dagster_project --dbt-project-dir=dbtlearn

At this point in the course, open schedules.py and uncomment the schedule logic.

Start dagster Now that our project is created, start the Dagster server:

On Windows - PowerShell (Like the VSCode Terminal Window)

cd dbt_dagster_project
$env:DAGSTER_DBT_PARSE_PROJECT_ON_LOAD = 1
dagster dev

On Windows (Using cmd)

cd dbt_dagster_project
setx DAGSTER_DBT_PARSE_PROJECT_ON_LOAD 1
dagster dev

On Linux / Mac

cd dbt_dagster_project
DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster dev

We will continue our work on the dagster UI at http://localhost:3000/

Making incremental models compatible with orchestrators: The updated contents of models/fct/fct_reviews.sql:

{{
  config(
    materialized = 'incremental',
    on_schema_change='fail'
    )
}}
WITH src_reviews AS (
  SELECT * FROM {{ ref('src_reviews') }}
)
SELECT 
  {{ dbt_utils.generate_surrogate_key(['listing_id', 'review_date', 'reviewer_name', 'review_text']) }} as review_id,
  *
FROM src_reviews
WHERE review_text is not null

{% if is_incremental() %}
  {% if var("start_date", False) and var("end_date", False) %}
    {{ log('Loading ' ~ this ~ ' incrementally (start_date: ' ~ var("start_date") ~ ', end_date: ' ~ var("end_date") ~ ')', info=True) }}
    AND review_date >= '{{ var("start_date") }}'
    AND review_date < '{{ var("end_date") }}'
  {% else %}
    AND review_date > (select max(review_date) from {{ this }})
    {{ log('Loading ' ~ this ~ ' incrementally (all missing dates)', info=True)}}
  {% endif %}
{% endif %}

Passing a time range to our incremental model:

dbt run --select fct_reviews  --vars '{start_date: "2024-02-15 00:00:00", end_date: "2024-03-15 23:59:59"}'

Reference - Working with incremental strategies: https://docs.getdbt.com/docs/build/incremental-models#about-incremental_strategy