Polars

4.2. Polars#

4.2.1. Overview#

1. So sánh Pandas, Polars, PySpark, và Dask

Tiêu chí	Pandas	Polars	PySpark	Dask
Hiệu năng	Trung bình (single-threaded).	Rất cao (Rust, multi-threaded).	Cao (phân tán trên nhiều cluster).	Cao (đa luồng hoặc phân tán).
Kích thước dữ liệu	Giới hạn bởi bộ nhớ RAM.	Lớn (out-of-core và Apache Arrow).	Rất lớn (phân tán trên nhiều node).	Lớn (out-of-core, phân tán).
Lazy Evaluation	Không.	Có (tối ưu hóa pipeline).	Có (nhưng chậm hơn Polars).	Có.
Dễ sử dụng	Rất dễ (API quen thuộc với Python).	Dễ (API tương tự Pandas).	Trung bình (cần hiểu rõ Spark API).	Trung bình (API tương tự Pandas).
Khả năng mở rộng	Hạn chế (thích hợp cho dữ liệu nhỏ).	Tốt (Arrow, Parquet, ORC).	Rất cao (được thiết kế cho hệ thống lớn).	Cao (hỗ trợ phân tán và dữ liệu lớn).
Thời gian khởi chạy	Nhanh.	Nhanh.	Chậm (cần thiết lập SparkContext).	Trung bình.
Quản lý tài nguyên	Không hỗ trợ quản lý phân tán.	Không hỗ trợ quản lý phân tán.	Có (hỗ trợ cluster phân tán).	Có (quản lý song song, cluster nhỏ).
Ngôn ngữ cốt lõi	Python.	Rust.	Scala/Java (API Python là giao diện).	Python.
Định dạng hỗ trợ	CSV, Excel, SQL, JSON.	CSV, Parquet, JSON, IPC, ORC.	CSV, Parquet, ORC, Avro, JSON.	CSV, Parquet, JSON, Zarr.
Xử lý song song	Không (single-threaded).	Có (tích hợp đa luồng).	Có (phân tán trên cluster).	Có (đa luồng hoặc phân tán).
Ưu điểm chính	Dễ dùng, phổ biến, nhiều tài liệu hỗ trợ.	Rất nhanh, hỗ trợ lazy evaluation, tối ưu hóa.	Tốt cho dữ liệu cực lớn và xử lý phân tán.	Tốt cho dữ liệu lớn nhưng quen thuộc với Pandas.
Nhược điểm chính	Hiệu năng thấp với dữ liệu lớn.	Còn mới, tài liệu ít hơn Pandas.	Cần tài nguyên cluster, phức tạp hơn.	Hiệu năng kém hơn Polars với dữ liệu nhỏ.
Khi nào sử dụng?	Xử lý dữ liệu nhỏ và vừa, phân tích nhanh.	Dữ liệu lớn, cần hiệu năng cao, xử lý nhanh.	Dữ liệu cực lớn, xử lý phân tán trên cluster.	Khi cần xử lý dữ liệu lớn và sử dụng API quen thuộc Pandas.

Pandas: Lý tưởng cho xử lý dữ liệu nhỏ và vừa, dễ học, nhiều tài liệu hỗ trợ.
Polars: Phù hợp khi cần hiệu năng cao hoặc xử lý dữ liệu lớn nhanh chóng.
PySpark: Dành cho dữ liệu cực lớn, đòi hỏi xử lý phân tán trên cluster.
Dask: Là lựa chọn tốt nếu bạn cần xử lý dữ liệu lớn với cách tiếp cận quen thuộc từ Pandas.

2. Differences in concepts between Polars and pandas

Polars không có multi-index/index

pandas gắn nhãn cho mỗi hàng bằng “index”, trong khi Polars không dùng index mà đánh chỉ số các hàng bằng vị trí số nguyên.
Polars thiết kế để kết quả dễ dự đoán và câu lệnh dễ đọc, tránh sự phức tạp do “index” gây ra.

Polars sử dụng Apache Arrow thay vì NumPy arrays

Polars lưu trữ dữ liệu theo chuẩn Apache Arrow, tối ưu hóa việc phân tích dữ liệu, tăng tốc độ tải, giảm bộ nhớ, và tính toán nhanh hơn. pandas sử dụng NumPy arrays.
Polars utilizes the Arrow Columnar Format for its data orientation

Polars hỗ trợ hoạt động song song tốt hơn pandas

Polars khai thác khả năng xử lý đồng thời mạnh mẽ của Rust, cho phép nhiều hoạt động chạy song song. pandas chủ yếu hoạt động đơn luồng, cần thư viện bổ sung (như Dask) để chạy song song.

Polars có thể đánh giá lười biếng và tối ưu hóa câu truy vấn

pandas chỉ hỗ trợ đánh giá tức thời (eager evaluation). Polars hỗ trợ cả đánh giá tức thời và đánh giá lười biếng (lazy evaluation).
Khi dùng lazy evaluation, Polars tối ưu hóa tự động các câu truy vấn, cải thiện tốc độ và giảm tiêu thụ bộ nhớ.

Sử dụng index như một kỹ thuật tối ưu hóa

Polars có thể sử dụng cấu trúc dữ liệu “index” tương tự như cơ sở dữ liệu để tối ưu hóa, nhưng không dùng nó để quản lý dữ liệu như pandas.

3. Differences in Key syntax between Polars and pandas

Use the same syntax like pandas might be run, but it likely runs slower than it should. Then, it should be used by rewrite code syntax

import polars as pl

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
csv_file = "docs/assets/data/path.csv"

selecting data in parallel optimization

# select columns
df.select("a")

# filter
df.filter(pl.col("a") < 10)

lazy model to optimize query by identify that only the relevant. By calling the .collect method at the end con eagerly evaluate the query

# lazy mode: scan_csv
df = pl.scan_csv(csv_file)

# eager mode: read_csv
df = pl.read_csv(csv_file)
grouped_df = df.group_by("id1").agg(pl.col("v1").sum()).collect()

Column assignment in parallel

# Column assignment

# in Pandas
df.assign(
    tenXValue=lambda df_: df_.value * 10,
    hundredXValue=lambda df_: df_.value * 100,
)

# in Polars
df.with_columns(
    tenXValue=pl.col("value") * 10,
    hundredXValue=pl.col("value") * 100,
)

# Column assignment based on predicate

# in Pandas
df.assign(a=lambda df_: df_.a.where(df_.c != 2, df_.b))

# in Polars: Polars compute every branch of an when -> then -> otherwise in parallel
df.with_columns(
    pl.when(pl.col("c") == 2)  # when
    .then(pl.col("b"))  # then
    .otherwise(pl.col("a"))  # otherwise
    .alias("a")
)

filter in Polars

df.filter((pl.col("m2_living") > 2500) & (pl.col("price") < 300000))

dataframe transformation

df = pd.DataFrame(
    {
        "c": [1, 1, 1, 2, 2, 2, 2],
        "type": ["m", "n", "o", "m", "m", "n", "n"],
    }
)

# in pandas
df["size"] = df.groupby("c")["type"].transform(len)

# in polars
df.with_columns(pl.col("type").count().over("c").alias("size"))

# multi
df.with_columns(
    pl.col("c").count().over("c").alias("size"),
    pl.col("c").sum().over("type").alias("sum"),
    pl.col("type").reverse().over("c").alias("reverse_type"),
)

4.2.2. Data type and structures#

4.2.2.1. Data-types#

Category	Datatype
Numeric	`Int8`, `Int16`, `Int32`, `Int64` (both negative and positive)
	`UInt8`, `UInt16`, `UInt32`, `UInt64` (non-negative)
	`Float32`, `Float64` (need fast calculation by approximation)
	`Decimal` (need precision calculation)
Nested	`List`
	`Array`
	`Struct` (like `Dict`)
Temporal	`Date`
	`Time`
	`Datetime`
	`Duration`
Miscellaneous	`Boolean`
	`String` (UTF-8 encoded)
	`Binary`
	`Categorical` (inferred strings at runtime)
	`Enum` (Set predetermined strings)
	`Object`
	`Null`

4.2.2.2. Series#

import polars as pl

s1 = pl.Series("ints", [1, 2, 3, 4, 5])
s2 = pl.Series("uints", [1, 2, 3, 4, 5], dtype=pl.UInt64)
print(s1)

shape: (5,)
Series: 'ints' [i64]
[
	1
	2
	3
	4
	5
]

(s1.dtype, s2.dtype)

(Int64, UInt64)

4.2.2.3. Dataframe#

from datetime import date

df = pl.DataFrame(
    {
        "name": [
            "Alice Archer",
            "Ben Brown",
            "Chloe Cooper",
            "Daniel Donovan",
        ],
        "birthdate": [
            date(1997, 1, 10),
            date(1985, 2, 15),
            date(1983, 3, 22),
            date(1981, 4, 30),
        ],
        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
    }
)

print(df)

shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘

df.head(2)

shape: (2, 4)

name	birthdate	weight	height
str	date	f64	f64
"Alice Archer"	1997-01-10	57.9	1.56
"Ben Brown"	1985-02-15	72.5	1.77

df.tail(2)

shape: (2, 4)

name	birthdate	weight	height
str	date	f64	f64
"Chloe Cooper"	1983-03-22	53.6	1.65
"Daniel Donovan"	1981-04-30	83.1	1.75

df.sample(2)

shape: (2, 4)

name	birthdate	weight	height
str	date	f64	f64
"Chloe Cooper"	1983-03-22	53.6	1.65
"Ben Brown"	1985-02-15	72.5	1.77

print(df.glimpse(return_as_string=True))

Rows: 4
Columns: 4
$ name       <str> 'Alice Archer', 'Ben Brown', 'Chloe Cooper', 'Daniel Donovan'
$ birthdate <date> 1997-01-10, 1985-02-15, 1983-03-22, 1981-04-30
$ weight     <f64> 57.9, 72.5, 53.6, 83.1
$ height     <f64> 1.56, 1.77, 1.65, 1.75

print(df.describe())

shape: (9, 5)
┌────────────┬────────────────┬─────────────────────┬───────────┬──────────┐
│ statistic  ┆ name           ┆ birthdate           ┆ weight    ┆ height   │
│ ---        ┆ ---            ┆ ---                 ┆ ---       ┆ ---      │
│ str        ┆ str            ┆ str                 ┆ f64       ┆ f64      │
╞════════════╪════════════════╪═════════════════════╪═══════════╪══════════╡
│ count      ┆ 4              ┆ 4                   ┆ 4.0       ┆ 4.0      │
│ null_count ┆ 0              ┆ 0                   ┆ 0.0       ┆ 0.0      │
│ mean       ┆ null           ┆ 1986-09-04 00:00:00 ┆ 66.775    ┆ 1.6825   │
│ std        ┆ null           ┆ null                ┆ 13.560082 ┆ 0.097082 │
│ min        ┆ Alice Archer   ┆ 1981-04-30          ┆ 53.6      ┆ 1.56     │
│ 25%        ┆ null           ┆ 1983-03-22          ┆ 57.9      ┆ 1.65     │
│ 50%        ┆ null           ┆ 1985-02-15          ┆ 72.5      ┆ 1.75     │
│ 75%        ┆ null           ┆ 1985-02-15          ┆ 72.5      ┆ 1.75     │
│ max        ┆ Daniel Donovan ┆ 1997-01-10          ┆ 83.1      ┆ 1.77     │
└────────────┴────────────────┴─────────────────────┴───────────┴──────────┘

4.2.2.4. schema#

# set schema
df = pl.DataFrame(
    {
        "name": ["Alice", "Ben", "Chloe", "Daniel"],
        "age": [27, 39, 41, 43],
    },
    schema={"name": None, "age": pl.UInt8},
)

print(df)

shape: (4, 2)
┌────────┬─────┐
│ name   ┆ age │
│ ---    ┆ --- │
│ str    ┆ u8  │
╞════════╪═════╡
│ Alice  ┆ 27  │
│ Ben    ┆ 39  │
│ Chloe  ┆ 41  │
│ Daniel ┆ 43  │
└────────┴─────┘

# override datatype in some of columns (not all columns)
df = pl.DataFrame(
    {
        "name": ["Alice", "Ben", "Chloe", "Daniel"],
        "age": [27, 39, 41, 43],
    },
    schema_overrides={"age": pl.UInt16},
)

print(df)

shape: (4, 2)
┌────────┬─────┐
│ name   ┆ age │
│ ---    ┆ --- │
│ str    ┆ u16 │
╞════════╪═════╡
│ Alice  ┆ 27  │
│ Ben    ┆ 39  │
│ Chloe  ┆ 41  │
│ Daniel ┆ 43  │
└────────┴─────┘

4.2.3. I/O#

4.2.3.1. CSV#

# in eager mode (executed immediately)
df = pl.read_csv("docs/assets/data/path.csv")

# lazy mode (prefer)
df = pl.scan_csv("docs/assets/data/path.csv")

# write csv
df.write_csv("docs/assets/data/path.csv")

For Excel files : https://docs.pola.rs/user-guide/io/excel/

For Multiple files strategy

If data in multiple file is the same schema:

# in eager mode
df = pl.read_csv("docs/assets/data/my_many_files_*.csv")

# lazy mode (prefer)
df = pl.scan_csv("docs/assets/data/my_many_files_*.csv")
df.show_graph()

If data in multiple file is not in 1 table, but do same task (parallel):

import glob

import polars as pl

queries = []
for file in glob.glob("docs/assets/data/my_many_files_*.csv"):
    q = pl.scan_csv(file).group_by("bar").agg(pl.len(), pl.sum("foo"))
    queries.append(q)

dataframes = pl.collect_all(queries)
print(dataframes)

4.2.3.2. Parquet (prefer than CSV)#

Polars is optimize to load and write parquet file format.

4.2.3.2.1. Single parquet file#

# in eager mode (executed immediately)
df = pl.read_parquet("docs/assets/data/path.parquet")

# lazy mode (prefer)
df = pl.scan_parquet("docs/assets/data/path.parquet")

# write csv
df.write_parquet("docs/assets/data/path.parquet")

When we scan a Parquet file stored in the cloud, we can also apply predicate and projection pushdowns. This can significantly reduce the amount of data that needs to be downloaded. For scanning a Parquet file in the cloud, see Cloud storage.

4.2.3.2.2. Hive partitioned data#

https://docs.pola.rs/user-guide/io/hive/

For READ hive partitioned data:

┌───────────────────────────────────────────────────────┐
│ File path                                             │
╞═══════════════════════════════════════════════════════╡
│ docs/assets/data/hive/year=2023/month=11/data.parquet │
│ docs/assets/data/hive/year=2023/month=12/data.parquet │
│ docs/assets/data/hive/year=2024/month=01/data.parquet │
│ docs/assets/data/hive/year=2024/month=02/data.parquet │
└───────────────────────────────────────────────────────┘

import polars as pl

df = pl.scan_parquet("docs/assets/data/hive/").collect()

with pl.Config(tbl_rows=99):
    print(df)

For READ Handling mixed files:

┌─────────────────────────────────────────────────────────────┐
│ File path                                                   │
╞═════════════════════════════════════════════════════════════╡
│ docs/assets/data/hive_mixed/description.txt                 │
│ docs/assets/data/hive_mixed/year=2023/month=11/data.parquet │
│ docs/assets/data/hive_mixed/year=2023/month=12/data.parquet │
│ docs/assets/data/hive_mixed/year=2024/month=01/data.parquet │
│ docs/assets/data/hive_mixed/year=2024/month=02/data.parquet │
└─────────────────────────────────────────────────────────────┘

df = pl.scan_parquet(
    "docs/assets/data/hive_mixed/**/*.parquet", hive_partitioning=True
).collect()

with pl.Config(tbl_rows=99):
    print(df)

For READ specific hive files:

┌─────────────────────────────────────────────────────────────┐
│ File path                                                   │
╞═════════════════════════════════════════════════════════════╡
│ docs/assets/data/hive_mixed/description.txt                 │
│ docs/assets/data/hive_mixed/year=2023/month=11/data.parquet │
│ docs/assets/data/hive_mixed/year=2023/month=12/data.parquet │
│ docs/assets/data/hive_mixed/year=2024/month=01/data.parquet │
│ docs/assets/data/hive_mixed/year=2024/month=02/data.parquet │
└─────────────────────────────────────────────────────────────┘

df = pl.scan_parquet(
    [
        "docs/assets/data/hive/year=2024/month=01/data.parquet",
        "docs/assets/data/hive/year=2024/month=02/data.parquet",
    ],
    hive_partitioning=True,
).collect()

print(df)

For WRITE to hive files:

df = pl.DataFrame({"a": [1, 1, 2, 2, 3], "b": [1, 1, 1, 2, 2], "c": 1})
print(df)

shape: (5, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i32 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 1   │
│ 2   ┆ 1   ┆ 1   │
│ 2   ┆ 2   ┆ 1   │
│ 3   ┆ 2   ┆ 1   │
└─────┴─────┴─────┘

# partitioned by the columns a and b
df.write_parquet("docs/assets/data/hive_write/", partition_by=["a", "b"])

The output following paths:

┌──────────────────────────────────────────────────────┐
│ File path                                            │
╞══════════════════════════════════════════════════════╡
│ docs/assets/data/hive_write/a=1/b=1/00000000.parquet │
│ docs/assets/data/hive_write/a=2/b=1/00000000.parquet │
│ docs/assets/data/hive_write/a=2/b=2/00000000.parquet │
│ docs/assets/data/hive_write/a=3/b=2/00000000.parquet │
└──────────────────────────────────────────────────────┘

4.2.3.3. Json#

# in eager mode (executed immediately)
df = pl.read_json("docs/assets/data/path.json")

# read Newline Delimited JSON
df = pl.read_ndjson("docs/assets/data/path.json")

# lazy mode (prefer - only Newline Delimited JSON)
df = pl.scan_ndjson("docs/assets/data/path.csv")

# write csv
df.write_json("docs/assets/data/path.json")

4.2.3.4. Database#

detail: https://docs.pola.rs/user-guide/io/database/

4.2.3.4.1. Engine#

Polars xử lý data theo cơ chế column-wise Apache Arrow format nên một số engine có thể bị chậm do phải load data row-wise vào Python trước khi copying data lại vào the column-wise Apache Arrow format

row-wise engine: SQLAlchemy or DBAPI2 connection
column-wise engine: ConnectorX or ADBC

4.2.3.4.2. Read#

read_database_uri faster than read_database for SQLAlchemy or DBAPI2 connection if you are using a SQLAlchemy or DBAPI2

via URI

# via URI
uri = "postgresql://username:password@server:port/database"
query = "SELECT * FROM foo"

pl.read_database_uri(query=query, uri=uri)

via connection engine

# via connection engine
from sqlalchemy import create_engine

conn = create_engine("sqlite:///test.db")
# conn = sqlite3.connect("test.db")

query = "SELECT * FROM foo"

pl.read_database(query=query, connection=conn.connect())

4.2.3.4.3. Write#

support engine:

SQLAlchemy
Arrow Database Connectivity (ADBC)

uri = "postgresql://username:password@server:port/database"
df = pl.DataFrame({"foo": [1, 2, 3]})

df.write_database(table_name="records", connection=uri)

uri = "postgresql://username:password@server:port/database"
df = pl.DataFrame({"foo": [1, 2, 3]})

df.write_database(table_name="records", connection=uri, engine="adbc")

4.2.3.5. Cloud Storage#

Service: AWS S3, Azure Blob Storage, Google Cloud Storage
File-format: Parquet, CSV, IPC, NDJSON

Authen cloud

import polars as pl

source = "s3://bucket/*.parquet"

storage_options = {
    "aws_access_key_id": "<secret>",
    "aws_secret_access_key": "<secret>",
    "aws_region": "us-east-1",
}
df = pl.scan_parquet(source, storage_options=storage_options).collect()

4.2.3.5.1. Read#

Using pl.scan_* functions to read from cloud storage can benefit from predicate and projection pushdowns, where the query optimizer will apply them before the file is downloaded. This can significantly reduce the amount of data that needs to be downloaded. The query evaluation is triggered by calling collect.

source = "s3://bucket/*.parquet"

# Read file
df = pl.read_parquet(source)

# Scanning (query optimization)
df = (
    pl.scan_parquet(source)
    .filter(pl.col("id") < 100)
    .select("id", "value")
    .collect()
)

4.2.3.5.2. Write#

using:

s3fs for S3,
adlfs for Azure Blob Storage
gcsfs for Google Cloud Storage

import polars as pl
import s3fs

df = pl.DataFrame(
    {
        "foo": ["a", "b", "c", "d", "d"],
        "bar": [1, 2, 3, 4, 5],
    }
)

fs = s3fs.S3FileSystem()
destination = "s3://bucket/my_file.parquet"

# write parquet
with fs.open(destination, mode="wb") as f:
    df.write_parquet(f)

4.2.3.6. Bigquery#

4.2.3.6.1. Read#

import polars as pl
from google.cloud import bigquery

client = bigquery.Client()

# Perform a query.
QUERY = (
    "SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` "
    'WHERE state = "TX" '
    "LIMIT 100"
)
query_job = client.query(QUERY)  # API request
rows = query_job.result()  # Waits for query to finish

df = pl.from_arrow(rows.to_arrow())

4.2.3.6.2. Write#

from google.cloud import bigquery
import io

client = bigquery.Client()

# Write DataFrame to stream as parquet file; does not hit disk
with io.BytesIO() as stream:
    df.write_parquet(stream)
    stream.seek(0)
    job = client.load_table_from_file(
        stream,
        destination="tablename",
        project="projectname",
        job_config=bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.PARQUET,
        ),
    )
job.result()  # Waits for the job to complete

4.2.3.7. Hugging Face#

https://docs.pola.rs/user-guide/io/hugging-face/

4.2.4. Expression#

import polars as pl
from datetime import date

df = pl.DataFrame(
    {
        "name": [
            "Alice Archer",
            "Ben Brown",
            "Chloe Cooper",
            "Daniel Donovan",
        ],
        "birthdate": [
            date(1997, 1, 10),
            date(1985, 2, 15),
            date(1983, 3, 22),
            date(1981, 4, 30),
        ],
        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
    }
)

print(df)

shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘

4.2.4.1. Selecting and filtering#

bmi_expr = pl.col("weight") / (pl.col("height") ** 2)
print(bmi_expr)

[(col("weight")) / (col("height").pow([dyn int: 2]))]

bmi_expr just is an expressions with lazy, no computations have taken place yet. That’s what we need contexts for by select, with_columns, filter, group_by, …

4.2.4.1.1. select and create#

select

Select column

subdf = df.select(pl.col("weight"), pl.col("name"))
print(subdf)

shape: (4, 2)
┌────────┬────────────────┐
│ weight ┆ name           │
│ ---    ┆ ---            │
│ f64    ┆ str            │
╞════════╪════════════════╡
│ 57.9   ┆ Alice Archer   │
│ 72.5   ┆ Ben Brown      │
│ 53.6   ┆ Chloe Cooper   │
│ 83.1   ┆ Daniel Donovan │
└────────┴────────────────┘

subdf2 = df.select(["weight", "name"])
subdf2.equals(subdf)

True

df.select("weight")

shape: (4, 1)

weight
f64
57.9
72.5
53.6
83.1

produce new columns / series that are aggregations, combinations of other columns, or literals and same lenght with df or must be a scalar:

Scalars will be broadcast to match the length of the remaining series
Literals, like the number used above, are also broadcast
Broadcasting can also occur within expressions

select only includes the columns selected by its input expressions

result = df.select(
    bmi=bmi_expr,
    avg_bmi=bmi_expr.mean(),
    ideal_max_bmi=25,
    deviation=(bmi_expr - bmi_expr.mean()) / bmi_expr.std(),
)
print(result)

shape: (4, 4)
┌───────────┬───────────┬───────────────┬───────────┐
│ bmi       ┆ avg_bmi   ┆ ideal_max_bmi ┆ deviation │
│ ---       ┆ ---       ┆ ---           ┆ ---       │
│ f64       ┆ f64       ┆ i32           ┆ f64       │
╞═══════════╪═══════════╪═══════════════╪═══════════╡
│ 23.791913 ┆ 23.438973 ┆ 25            ┆ 0.115645  │
│ 23.141498 ┆ 23.438973 ┆ 25            ┆ -0.097471 │
│ 19.687787 ┆ 23.438973 ┆ 25            ┆ -1.22912  │
│ 27.134694 ┆ 23.438973 ┆ 25            ┆ 1.210946  │
└───────────┴───────────┴───────────────┴───────────┘

4.2.4.1.1.1. by datatype#

float_df = df.select(pl.col(pl.Float64))
print(float_df)

shape: (4, 2)
┌────────┬────────┐
│ weight ┆ height │
│ ---    ┆ ---    │
│ f64    ┆ f64    │
╞════════╪════════╡
│ 57.9   ┆ 1.56   │
│ 72.5   ┆ 1.77   │
│ 53.6   ┆ 1.65   │
│ 83.1   ┆ 1.75   │
└────────┴────────┘

4.2.4.1.1.2. by pattern matching in columns name#

^ : start
$ : end
* : multi-characters
. : single-character
…

result = df.select("^*ght$")
print(result)

shape: (4, 2)
┌────────┬────────┐
│ weight ┆ height │
│ ---    ┆ ---    │
│ f64    ┆ f64    │
╞════════╪════════╡
│ 57.9   ┆ 1.56   │
│ 72.5   ┆ 1.77   │
│ 53.6   ┆ 1.65   │
│ 83.1   ┆ 1.75   │
└────────┴────────┘

4.2.4.1.1.3. all columns#

result = df.select(pl.all())
print(result)

shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘

4.2.4.1.1.4. exclude columns#

result = df.select(pl.all().exclude("^*ght$"))
print(result)

shape: (4, 2)
┌────────────────┬────────────┐
│ name           ┆ birthdate  │
│ ---            ┆ ---        │
│ str            ┆ date       │
╞════════════════╪════════════╡
│ Alice Archer   ┆ 1997-01-10 │
│ Ben Brown      ┆ 1985-02-15 │
│ Chloe Cooper   ┆ 1983-03-22 │
│ Daniel Donovan ┆ 1981-04-30 │
└────────────────┴────────────┘

4.2.4.1.1.5. `with_columns`#

very similar to the context select, but with_columns creates a new dataframe that contains the columns from the original dataframe and the new columns according to its input expressions

result = df.with_columns(
    bmi=bmi_expr,
    avg_bmi=bmi_expr.mean(),
    ideal_max_bmi=25,
    deviation=(bmi_expr - bmi_expr.mean()) / bmi_expr.std(),
)
print(result)

shape: (4, 8)
┌───────────────┬────────────┬────────┬────────┬───────────┬───────────┬───────────────┬───────────┐
│ name          ┆ birthdate  ┆ weight ┆ height ┆ bmi       ┆ avg_bmi   ┆ ideal_max_bmi ┆ deviation │
│ ---           ┆ ---        ┆ ---    ┆ ---    ┆ ---       ┆ ---       ┆ ---           ┆ ---       │
│ str           ┆ date       ┆ f64    ┆ f64    ┆ f64       ┆ f64       ┆ i32           ┆ f64       │
╞═══════════════╪════════════╪════════╪════════╪═══════════╪═══════════╪═══════════════╪═══════════╡
│ Alice Archer  ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 23.791913 ┆ 23.438973 ┆ 25            ┆ 0.115645  │
│ Ben Brown     ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 23.141498 ┆ 23.438973 ┆ 25            ┆ -0.097471 │
│ Chloe Cooper  ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 19.687787 ┆ 23.438973 ┆ 25            ┆ -1.22912  │
│ Daniel        ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 27.134694 ┆ 23.438973 ┆ 25            ┆ 1.210946  │
│ Donovan       ┆            ┆        ┆        ┆           ┆           ┆               ┆           │
└───────────────┴────────────┴────────┴────────┴───────────┴───────────┴───────────────┴───────────┘

4.2.4.1.1.6. create column by group by expression#

import polars as pl

# Sample DataFrame
df = pl.DataFrame(
    {
        "group": ["A", "A", "B", "B", "B"],
        "value": [10, 20, 15, 25, 35],
    }
)

result = df.with_columns(
    pl.col("value").mean().over("group").alias("group_mean")
)

# Group by "group" and calculate group-wise mean
group_mean = df.group_by("group").agg(
    pl.col("value").mean().alias("group_mean")
)

# Join the group mean back to the original DataFrame
result2 = df.join(group_mean, on="group")

print(result2)
print(result.equals(result2))

shape: (5, 3)
┌───────┬───────┬────────────┐
│ group ┆ value ┆ group_mean │
│ ---   ┆ ---   ┆ ---        │
│ str   ┆ i64   ┆ f64        │
╞═══════╪═══════╪════════════╡
│ A     ┆ 10    ┆ 15.0       │
│ A     ┆ 20    ┆ 15.0       │
│ B     ┆ 15    ┆ 25.0       │
│ B     ┆ 25    ┆ 25.0       │
│ B     ┆ 35    ┆ 25.0       │
└───────┴───────┴────────────┘
True

4.2.4.1.2. rename#

4.2.4.1.2.1. `alias`#

set name

result = df.select(
    bmi_expr.alias("bmi"),
    bmi_expr.mean().alias("avg_bmi"),
    ((bmi_expr - bmi_expr.mean()) / bmi_expr.std()).alias("deviation"),
)
print(result)

shape: (4, 3)
┌───────────┬───────────┬───────────┐
│ bmi       ┆ avg_bmi   ┆ deviation │
│ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╡
│ 23.791913 ┆ 23.438973 ┆ 0.115645  │
│ 23.141498 ┆ 23.438973 ┆ -0.097471 │
│ 19.687787 ┆ 23.438973 ┆ -1.22912  │
│ 27.134694 ┆ 23.438973 ┆ 1.210946  │
└───────────┴───────────┴───────────┘

4.2.4.1.2.2. prefixing and suffixing#

result = df.select(
    (pl.col("^*ight$") * 10).name.prefix("10_multiple_"),
    (pl.col("birthdate").dt.year()).name.suffix("_year"),
)
print(result)

shape: (4, 3)
┌────────────────────┬────────────────────┬────────────────┐
│ 10_multiple_weight ┆ 10_multiple_height ┆ birthdate_year │
│ ---                ┆ ---                ┆ ---            │
│ f64                ┆ f64                ┆ i32            │
╞════════════════════╪════════════════════╪════════════════╡
│ 579.0              ┆ 15.6               ┆ 1997           │
│ 725.0              ┆ 17.7               ┆ 1985           │
│ 536.0              ┆ 16.5               ┆ 1983           │
│ 831.0              ┆ 17.5               ┆ 1981           │
└────────────────────┴────────────────────┴────────────────┘

4.2.4.1.2.3. custom function#

# There is also `.name.to_uppercase`, so this usage of `.map` is moot.
result = df.select(pl.all().name.map(str.upper))
print(result)

shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ NAME           ┆ BIRTHDATE  ┆ WEIGHT ┆ HEIGHT │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘

4.2.4.1.3. filter#

filter

result = df.filter(
    pl.col("birthdate").is_between(date(1982, 12, 31), date(1996, 1, 1)),
    pl.col("height") > 1.7,
)
print(result)

shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘

4.2.4.2. Basic operation#

import polars as pl
import numpy as np

np.random.seed(42)  # For reproducibility.

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", "spam"],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "A", "B"],
    }
)
print(df)

shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs  ┆ names ┆ random   ┆ groups │
│ ---  ┆ ---   ┆ ---      ┆ ---    │
│ i64  ┆ str   ┆ f64      ┆ str    │
╞══════╪═══════╪══════════╪════════╡
│ 1    ┆ foo   ┆ 0.37454  ┆ A      │
│ 2    ┆ ham   ┆ 0.950714 ┆ A      │
│ 3    ┆ spam  ┆ 0.731994 ┆ B      │
│ null ┆ egg   ┆ 0.598658 ┆ A      │
│ 5    ┆ spam  ┆ 0.156019 ┆ B      │
└──────┴───────┴──────────┴────────┘

4.2.4.2.1. Comparisons#

result = df.select(
    (pl.col("nrs") > 1).alias("nrs > 1"),  # .gt
    (pl.col("nrs") >= 3).alias("nrs >= 3"),  # ge
    (pl.col("random") < 0.2).alias("random < .2"),  # .lt
    (pl.col("random") <= 0.5).alias("random <= .5"),  # .le
    (pl.col("nrs") != 1).alias("nrs != 1"),  # .ne
    (pl.col("nrs") == 1).alias("nrs == 1"),  # .eq
)
print(result)

shape: (5, 6)
┌─────────┬──────────┬─────────────┬──────────────┬──────────┬──────────┐
│ nrs > 1 ┆ nrs >= 3 ┆ random < .2 ┆ random <= .5 ┆ nrs != 1 ┆ nrs == 1 │
│ ---     ┆ ---      ┆ ---         ┆ ---          ┆ ---      ┆ ---      │
│ bool    ┆ bool     ┆ bool        ┆ bool         ┆ bool     ┆ bool     │
╞═════════╪══════════╪═════════════╪══════════════╪══════════╪══════════╡
│ false   ┆ false    ┆ false       ┆ true         ┆ false    ┆ true     │
│ true    ┆ false    ┆ false       ┆ false        ┆ true     ┆ false    │
│ true    ┆ true     ┆ false       ┆ false        ┆ true     ┆ false    │
│ null    ┆ null     ┆ false       ┆ false        ┆ null     ┆ null     │
│ true    ┆ true     ┆ true        ┆ true         ┆ true     ┆ false    │
└─────────┴──────────┴─────────────┴──────────────┴──────────┴──────────┘

Use the operators &, |, and ~, for the Boolean operations “and”, “or”, and “not”, respectively, or the functions of the same name

# Boolean operators & | ~
result = df.select(
    ((~pl.col("nrs").is_null()) & (pl.col("groups") == "A")).alias(
        "number not null and group A"
    ),
    ((pl.col("random") < 0.5) | (pl.col("groups") == "B")).alias(
        "random < 0.5 or group B"
    ),
    (pl.col("groups") != "B").alias("group not B"),
)

print(result)

shape: (5, 3)
┌─────────────────────────────┬─────────────────────────┬─────────────┐
│ number not null and group A ┆ random < 0.5 or group B ┆ group not B │
│ ---                         ┆ ---                     ┆ ---         │
│ bool                        ┆ bool                    ┆ bool        │
╞═════════════════════════════╪═════════════════════════╪═════════════╡
│ true                        ┆ true                    ┆ true        │
│ true                        ┆ false                   ┆ true        │
│ false                       ┆ true                    ┆ false       │
│ false                       ┆ false                   ┆ true        │
│ false                       ┆ true                    ┆ false       │
└─────────────────────────────┴─────────────────────────┴─────────────┘

# Corresponding named functions `and_`, `or_`, and `not_`.
result2 = df.select(
    (pl.col("nrs").is_null().not_().and_(pl.col("groups") == "A")).alias(
        "number not null and group A"
    ),
    ((pl.col("random") < 0.5).or_(pl.col("groups") == "B")).alias(
        "random < 0.5 or group B"
    ),
)
print(result.equals(result2))

True

4.2.4.2.2. Unique values#

n_unique: exact number but may be slow for large dataset
approx_n_unique: approximation by uses the algorithm HyperLogLog++ to estimate the result

long_df = pl.DataFrame({"numbers": np.random.randint(0, 100_000, 100_000)})

result = long_df.select(
    pl.col("numbers").n_unique().alias("n_unique"),
    pl.col("numbers").approx_n_unique().alias("approx_n_unique"),
)

print(result)

shape: (1, 2)
┌──────────┬─────────────────┐
│ n_unique ┆ approx_n_unique │
│ ---      ┆ ---             │
│ u32      ┆ u32             │
╞══════════╪═════════════════╡
│ 63218    ┆ 63784           │
└──────────┴─────────────────┘

value_counts: unique values and their counts as structs datatype
unique: return only unique value
unique_counts: return only unique value count (need unique(maintain_order=True))

result = df.select()

result = df.select(
    pl.col("names").value_counts(sort=True).alias("value_counts"),
    pl.col("names").unique(maintain_order=True).alias("unique"),
    pl.col("names").unique_counts().alias("unique_counts"),
)

print(result)

shape: (4, 3)
┌──────────────┬────────┬───────────────┐
│ value_counts ┆ unique ┆ unique_counts │
│ ---          ┆ ---    ┆ ---           │
│ struct[2]    ┆ str    ┆ u32           │
╞══════════════╪════════╪═══════════════╡
│ {"spam",2}   ┆ foo    ┆ 1             │
│ {"foo",1}    ┆ ham    ┆ 1             │
│ {"ham",1}    ┆ spam   ┆ 2             │
│ {"egg",1}    ┆ egg    ┆ 1             │
└──────────────┴────────┴───────────────┘

4.2.4.2.3. Conditional (when - then - else)#

when: accept predicate expression (True/False seri)
then: corresponding values of the expression inside if when = True
otherwise: corresponding values of the expression inside if when = False, or null if not provided

result = df.select(
    pl.col("nrs"),
    pl.when(pl.col("nrs") == 1)
    .then(pl.col("nrs") + 1)
    .otherwise(
        pl.when(pl.col("nrs").is_null())
        .then(9999)
        .otherwise(pl.col("nrs") ** 3)
    )
    .alias("Collatz"),
)

print(result)

shape: (5, 2)
┌──────┬─────────┐
│ nrs  ┆ Collatz │
│ ---  ┆ ---     │
│ i64  ┆ i64     │
╞══════╪═════════╡
│ 1    ┆ 2       │
│ 2    ┆ 8       │
│ 3    ┆ 27      │
│ null ┆ 9999    │
│ 5    ┆ 125     │
└──────┴─────────┘

4.2.4.2.4. Expression expansion#

a shorthand notation for when you want to apply the same transformation to multiple columns

import polars as pl

df = pl.DataFrame(
    {  # As of 14th October 2024, ~3pm UTC
        "ticker": ["AAPL", "NVDA", "MSFT", "GOOG", "AMZN"],
        "company_name": [
            "Apple",
            "NVIDIA",
            "Microsoft",
            "Alphabet (Google)",
            "Amazon",
        ],
        "price": [229.9, 138.93, 420.56, 166.41, 188.4],
        "day_high": [231.31, 139.6, 424.04, 167.62, 189.83],
        "day_low": [228.6, 136.3, 417.52, 164.78, 188.44],
        "year_high": [237.23, 140.76, 468.35, 193.31, 201.2],
        "year_low": [164.08, 39.23, 324.39, 121.46, 118.35],
    }
)

print(df)

shape: (5, 7)
┌────────┬───────────────────┬────────┬──────────┬─────────┬───────────┬──────────┐
│ ticker ┆ company_name      ┆ price  ┆ day_high ┆ day_low ┆ year_high ┆ year_low │
│ ---    ┆ ---               ┆ ---    ┆ ---      ┆ ---     ┆ ---       ┆ ---      │
│ str    ┆ str               ┆ f64    ┆ f64      ┆ f64     ┆ f64       ┆ f64      │
╞════════╪═══════════════════╪════════╪══════════╪═════════╪═══════════╪══════════╡
│ AAPL   ┆ Apple             ┆ 229.9  ┆ 231.31   ┆ 228.6   ┆ 237.23    ┆ 164.08   │
│ NVDA   ┆ NVIDIA            ┆ 138.93 ┆ 139.6    ┆ 136.3   ┆ 140.76    ┆ 39.23    │
│ MSFT   ┆ Microsoft         ┆ 420.56 ┆ 424.04   ┆ 417.52  ┆ 468.35    ┆ 324.39   │
│ GOOG   ┆ Alphabet (Google) ┆ 166.41 ┆ 167.62   ┆ 164.78  ┆ 193.31    ┆ 121.46   │
│ AMZN   ┆ Amazon            ┆ 188.4  ┆ 189.83   ┆ 188.44  ┆ 201.2     ┆ 118.35   │
└────────┴───────────────────┴────────┴──────────┴─────────┴───────────┴──────────┘

For example: compute the mean value of the columns “price” and “year_high” and will rename them as “avg_year_high” and “avg_year_high”, respectively

# instead of
[
    pl.col("price").mean().alias("avg_price"),
    pl.col("year_high").mean().alias("avg_year_high"),
]

# using
expr = pl.col("price", "year_high").mean().name.prefix("avg_")
df.select(expr)

shape: (1, 2)

avg_price	avg_year_high
f64	f64
228.84	248.17

# multiply all columns with data type Float64 by 1.1
expr = (pl.col(pl.Float64) * 1.1).name.suffix("*1.1")
result = df.select(expr)
print(result)

shape: (5, 5)
┌───────────┬──────────────┬─────────────┬───────────────┬──────────────┐
│ price*1.1 ┆ day_high*1.1 ┆ day_low*1.1 ┆ year_high*1.1 ┆ year_low*1.1 │
│ ---       ┆ ---          ┆ ---         ┆ ---           ┆ ---          │
│ f64       ┆ f64          ┆ f64         ┆ f64           ┆ f64          │
╞═══════════╪══════════════╪═════════════╪═══════════════╪══════════════╡
│ 252.89    ┆ 254.441      ┆ 251.46      ┆ 260.953       ┆ 180.488      │
│ 152.823   ┆ 153.56       ┆ 149.93      ┆ 154.836       ┆ 43.153       │
│ 462.616   ┆ 466.444      ┆ 459.272     ┆ 515.185       ┆ 356.829      │
│ 183.051   ┆ 184.382      ┆ 181.258     ┆ 212.641       ┆ 133.606      │
│ 207.24    ┆ 208.813      ┆ 207.284     ┆ 221.32        ┆ 130.185      │
└───────────┴──────────────┴─────────────┴───────────────┴──────────────┘

4.2.4.2.4.1. How to select columns#

4.2.4.2.4.1.1. by columns name#

eur_usd_rate = 1.2

result = df.with_columns(
    (
        pl.col(
            "price",
            "day_high",
            "day_low",
            "year_high",
            "year_low",
        )
        / eur_usd_rate
    ).round(1)
)
print(result)

shape: (5, 7)
┌────────┬───────────────────┬───────┬──────────┬─────────┬───────────┬──────────┐
│ ticker ┆ company_name      ┆ price ┆ day_high ┆ day_low ┆ year_high ┆ year_low │
│ ---    ┆ ---               ┆ ---   ┆ ---      ┆ ---     ┆ ---       ┆ ---      │
│ str    ┆ str               ┆ f64   ┆ f64      ┆ f64     ┆ f64       ┆ f64      │
╞════════╪═══════════════════╪═══════╪══════════╪═════════╪═══════════╪══════════╡
│ AAPL   ┆ Apple             ┆ 191.6 ┆ 192.8    ┆ 190.5   ┆ 197.7     ┆ 136.7    │
│ NVDA   ┆ NVIDIA            ┆ 115.8 ┆ 116.3    ┆ 113.6   ┆ 117.3     ┆ 32.7     │
│ MSFT   ┆ Microsoft         ┆ 350.5 ┆ 353.4    ┆ 347.9   ┆ 390.3     ┆ 270.3    │
│ GOOG   ┆ Alphabet (Google) ┆ 138.7 ┆ 139.7    ┆ 137.3   ┆ 161.1     ┆ 101.2    │
│ AMZN   ┆ Amazon            ┆ 157.0 ┆ 158.2    ┆ 157.0   ┆ 167.7     ┆ 98.6     │
└────────┴───────────────────┴───────┴──────────┴─────────┴───────────┴──────────┘

4.2.4.2.4.1.2. by datatype#

result2 = df.with_columns(
    (
        pl.col(
            pl.Float32,
            pl.Float64,
        )
        / eur_usd_rate
    ).round(2)
)
print(result2)

shape: (5, 7)
┌────────┬───────────────────┬────────┬──────────┬─────────┬───────────┬──────────┐
│ ticker ┆ company_name      ┆ price  ┆ day_high ┆ day_low ┆ year_high ┆ year_low │
│ ---    ┆ ---               ┆ ---    ┆ ---      ┆ ---     ┆ ---       ┆ ---      │
│ str    ┆ str               ┆ f64    ┆ f64      ┆ f64     ┆ f64       ┆ f64      │
╞════════╪═══════════════════╪════════╪══════════╪═════════╪═══════════╪══════════╡
│ AAPL   ┆ Apple             ┆ 191.58 ┆ 192.76   ┆ 190.5   ┆ 197.69    ┆ 136.73   │
│ NVDA   ┆ NVIDIA            ┆ 115.78 ┆ 116.33   ┆ 113.58  ┆ 117.3     ┆ 32.69    │
│ MSFT   ┆ Microsoft         ┆ 350.47 ┆ 353.37   ┆ 347.93  ┆ 390.29    ┆ 270.33   │
│ GOOG   ┆ Alphabet (Google) ┆ 138.68 ┆ 139.68   ┆ 137.32  ┆ 161.09    ┆ 101.22   │
│ AMZN   ┆ Amazon            ┆ 157.0  ┆ 158.19   ┆ 157.03  ┆ 167.67    ┆ 98.63    │
└────────┴───────────────────┴────────┴──────────┴─────────┴───────────┴──────────┘

4.2.4.2.4.1.3. by pattern matching in columns name#

^ : start
$ : end
* : multi-characters
. : single-character
…

result = df.select(pl.col("ticker", "^.*_high$", "^.*_low$"))
print(result)

shape: (5, 5)
┌────────┬──────────┬───────────┬─────────┬──────────┐
│ ticker ┆ day_high ┆ year_high ┆ day_low ┆ year_low │
│ ---    ┆ ---      ┆ ---       ┆ ---     ┆ ---      │
│ str    ┆ f64      ┆ f64       ┆ f64     ┆ f64      │
╞════════╪══════════╪═══════════╪═════════╪══════════╡
│ AAPL   ┆ 231.31   ┆ 237.23    ┆ 228.6   ┆ 164.08   │
│ NVDA   ┆ 139.6    ┆ 140.76    ┆ 136.3   ┆ 39.23    │
│ MSFT   ┆ 424.04   ┆ 468.35    ┆ 417.52  ┆ 324.39   │
│ GOOG   ┆ 167.62   ┆ 193.31    ┆ 164.78  ┆ 121.46   │
│ AMZN   ┆ 189.83   ┆ 201.2     ┆ 188.44  ┆ 118.35   │
└────────┴──────────┴───────────┴─────────┴──────────┘

4.2.4.2.4.2. Generate expressions#

Instead of Way 1, should do by Way 2 to:

do a better job at optimising the query
parallelise the execution of the actual computations.

# Way 1:
result = df
for tp in ["day", "year"]:
    result = result.with_columns(
        (pl.col(f"{tp}_high") - pl.col(f"{tp}_low")).alias(f"{tp}_amplitude")
    )


# Way 2 (should do):
def amplitude_expressions(time_periods):
    for tp in time_periods:
        yield (pl.col(f"{tp}_high") - pl.col(f"{tp}_low")).alias(
            f"{tp}_amplitude"
        )


result = df.with_columns(amplitude_expressions(["day", "year"]))
print(result)

shape: (5, 9)
┌────────┬──────────────┬────────┬──────────┬───┬───────────┬──────────┬─────────────┬─────────────┐
│ ticker ┆ company_name ┆ price  ┆ day_high ┆ … ┆ year_high ┆ year_low ┆ day_amplitu ┆ year_amplit │
│ ---    ┆ ---          ┆ ---    ┆ ---      ┆   ┆ ---       ┆ ---      ┆ de          ┆ ude         │
│ str    ┆ str          ┆ f64    ┆ f64      ┆   ┆ f64       ┆ f64      ┆ ---         ┆ ---         │
│        ┆              ┆        ┆          ┆   ┆           ┆          ┆ f64         ┆ f64         │
╞════════╪══════════════╪════════╪══════════╪═══╪═══════════╪══════════╪═════════════╪═════════════╡
│ AAPL   ┆ Apple        ┆ 229.9  ┆ 231.31   ┆ … ┆ 237.23    ┆ 164.08   ┆ 2.71        ┆ 73.15       │
│ NVDA   ┆ NVIDIA       ┆ 138.93 ┆ 139.6    ┆ … ┆ 140.76    ┆ 39.23    ┆ 3.3         ┆ 101.53      │
│ MSFT   ┆ Microsoft    ┆ 420.56 ┆ 424.04   ┆ … ┆ 468.35    ┆ 324.39   ┆ 6.52        ┆ 143.96      │
│ GOOG   ┆ Alphabet     ┆ 166.41 ┆ 167.62   ┆ … ┆ 193.31    ┆ 121.46   ┆ 2.84        ┆ 71.85       │
│        ┆ (Google)     ┆        ┆          ┆   ┆           ┆          ┆             ┆             │
│ AMZN   ┆ Amazon       ┆ 188.4  ┆ 189.83   ┆ … ┆ 201.2     ┆ 118.35   ┆ 1.39        ┆ 82.85       │
└────────┴──────────────┴────────┴──────────┴───┴───────────┴──────────┴─────────────┴─────────────┘

4.2.4.3. Datatype Operations#

Full link: https://docs.pola.rs/api/python/stable/reference/datatypes.html

import polars as pl

df = pl.DataFrame(
    {  # As of 14th October 2024, ~3pm UTC
        "ticker": ["AAPL", "NVDA", "MSFT", "GOOG", "AMZN"],
        "company_name": [
            "Apple",
            "NVIDIA",
            "Microsoft",
            "Alphabet (Google)",
            "Amazon",
        ],
        "price": [229.9, 138.93, 420.56, 166.41, 188.4],
        "day_high": [231, 139, 424, 167, 189],
        "day_low": [228, 136, 417, 164, 188],
        "year_high": [237, 140, 468, 193, 201],
        "year_low": [164, 39, 324, 121, 118],
    }
)

print(df)

shape: (5, 7)
┌────────┬───────────────────┬────────┬──────────┬─────────┬───────────┬──────────┐
│ ticker ┆ company_name      ┆ price  ┆ day_high ┆ day_low ┆ year_high ┆ year_low │
│ ---    ┆ ---               ┆ ---    ┆ ---      ┆ ---     ┆ ---       ┆ ---      │
│ str    ┆ str               ┆ f64    ┆ i64      ┆ i64     ┆ i64       ┆ i64      │
╞════════╪═══════════════════╪════════╪══════════╪═════════╪═══════════╪══════════╡
│ AAPL   ┆ Apple             ┆ 229.9  ┆ 231      ┆ 228     ┆ 237       ┆ 164      │
│ NVDA   ┆ NVIDIA            ┆ 138.93 ┆ 139      ┆ 136     ┆ 140       ┆ 39       │
│ MSFT   ┆ Microsoft         ┆ 420.56 ┆ 424      ┆ 417     ┆ 468       ┆ 324      │
│ GOOG   ┆ Alphabet (Google) ┆ 166.41 ┆ 167      ┆ 164     ┆ 193       ┆ 121      │
│ AMZN   ┆ Amazon            ┆ 188.4  ┆ 189      ┆ 188     ┆ 201       ┆ 118      │
└────────┴───────────────────┴────────┴──────────┴─────────┴───────────┴──────────┘

4.2.4.3.1. Casting#

4.2.4.3.1.1. `cast`#

strict that determines how Polars behaves when it encounters a value that cannot be converted from the source data type to the target data type
- strict=True: raise if error
- strict=False: null if error

result = df.select(
    pl.col("price").cast(pl.Float32).name.suffix("_integers_as_floats"),
    pl.col("day_high").cast(pl.Int32).name.suffix("_floats_as_integers"),
)
print(result)

shape: (5, 2)
┌──────────────────────────┬─────────────────────────────┐
│ price_integers_as_floats ┆ day_high_floats_as_integers │
│ ---                      ┆ ---                         │
│ f32                      ┆ i32                         │
╞══════════════════════════╪═════════════════════════════╡
│ 229.899994               ┆ 231                         │
│ 138.929993               ┆ 139                         │
│ 420.559998               ┆ 424                         │
│ 166.410004               ┆ 167                         │
│ 188.399994               ┆ 189                         │
└──────────────────────────┴─────────────────────────────┘

4.2.4.3.1.2. downcasting numerical#

Reduce the memory footprint of a column by changing the precision associated with its numeric data type:

Int64 –> Int16
Float64 –> Float32

print(f"Before downcasting: {df.estimated_size()} bytes")
result = df.with_columns(
    pl.col("price").cast(pl.Float32),
    pl.col("day_high").cast(pl.Int32),
    pl.col("year_low")
    .cast(pl.Int8, strict=False)
    .alias("year_low_convert_with_strict"),
)
print(f"After downcasting: {result.estimated_size()} bytes")
print(result)

Before downcasting: 263 bytes
After downcasting: 229 bytes
shape: (5, 8)
┌────────┬────────────────┬────────────┬──────────┬─────────┬───────────┬──────────┬───────────────┐
│ ticker ┆ company_name   ┆ price      ┆ day_high ┆ day_low ┆ year_high ┆ year_low ┆ year_low_conv │
│ ---    ┆ ---            ┆ ---        ┆ ---      ┆ ---     ┆ ---       ┆ ---      ┆ ert_with_stri │
│ str    ┆ str            ┆ f32        ┆ i32      ┆ i64     ┆ i64       ┆ i64      ┆ ct            │
│        ┆                ┆            ┆          ┆         ┆           ┆          ┆ ---           │
│        ┆                ┆            ┆          ┆         ┆           ┆          ┆ i8            │
╞════════╪════════════════╪════════════╪══════════╪═════════╪═══════════╪══════════╪═══════════════╡
│ AAPL   ┆ Apple          ┆ 229.899994 ┆ 231      ┆ 228     ┆ 237       ┆ 164      ┆ null          │
│ NVDA   ┆ NVIDIA         ┆ 138.929993 ┆ 139      ┆ 136     ┆ 140       ┆ 39       ┆ 39            │
│ MSFT   ┆ Microsoft      ┆ 420.559998 ┆ 424      ┆ 417     ┆ 468       ┆ 324      ┆ null          │
│ GOOG   ┆ Alphabet       ┆ 166.410004 ┆ 167      ┆ 164     ┆ 193       ┆ 121      ┆ 121           │
│        ┆ (Google)       ┆            ┆          ┆         ┆           ┆          ┆               │
│ AMZN   ┆ Amazon         ┆ 188.399994 ┆ 189      ┆ 188     ┆ 201       ┆ 118      ┆ 118           │
└────────┴────────────────┴────────────┴──────────┴─────────┴───────────┴──────────┴───────────────┘

4.2.4.3.1.3. strings to numeric#

df = pl.DataFrame(
    {
        "integers_as_strings": ["1", "2", "3"],
        "floats_as_strings": ["4.0", "5.8", "-6.3"],
        "floats": [4.0, 5.8, -6.3],
    }
)

result = df.select(
    pl.col("integers_as_strings").cast(pl.Int32),
    pl.col("floats_as_strings").cast(pl.Float64),
    pl.col("floats").cast(pl.String),
)
print(result)

shape: (3, 3)
┌─────────────────────┬───────────────────┬────────┐
│ integers_as_strings ┆ floats_as_strings ┆ floats │
│ ---                 ┆ ---               ┆ ---    │
│ i32                 ┆ f64               ┆ str    │
╞═════════════════════╪═══════════════════╪════════╡
│ 1                   ┆ 4.0               ┆ 4.0    │
│ 2                   ┆ 5.8               ┆ 5.8    │
│ 3                   ┆ -6.3              ┆ -6.3   │
└─────────────────────┴───────────────────┴────────┘

4.2.4.3.1.4. booleans#

df = pl.DataFrame(
    {
        "integers": [-1, 0, 2, 3, 4],
        "floats": [0.0, 1.0, 2.0, 3.0, 4.0],
        "bools": [True, False, True, False, True],
    }
)

result = df.select(
    pl.col("integers").cast(pl.Boolean),
    pl.col("floats").cast(pl.Boolean),
    pl.col("bools").cast(pl.Int8),
)
print(result)

4.2.4.3.2. Number#

https://docs.pola.rs/api/python/stable/reference/expressions/computation.html

result = df.select(
    pl.col("nrs").alias("raw_nrs"),
    (pl.col("nrs") + 5).alias("nrs + 5"),
    (pl.col("nrs") - 5).alias("nrs - 5"),
    (pl.col("nrs") * pl.col("random")).alias("nrs * random"),
    (pl.col("nrs") / pl.col("random")).alias("nrs / random"),
    (pl.col("nrs") ** 2).alias("nrs ** 2"),
    (pl.col("nrs") % 3).alias("nrs % 3"),
)

print(result)

shape: (5, 7)
┌─────────┬─────────┬─────────┬──────────────┬──────────────┬──────────┬─────────┐
│ raw_nrs ┆ nrs + 5 ┆ nrs - 5 ┆ nrs * random ┆ nrs / random ┆ nrs ** 2 ┆ nrs % 3 │
│ ---     ┆ ---     ┆ ---     ┆ ---          ┆ ---          ┆ ---      ┆ ---     │
│ i64     ┆ i64     ┆ i64     ┆ f64          ┆ f64          ┆ i64      ┆ i64     │
╞═════════╪═════════╪═════════╪══════════════╪══════════════╪══════════╪═════════╡
│ 1       ┆ 6       ┆ -4      ┆ 0.37454      ┆ 2.669941     ┆ 1        ┆ 1       │
│ 2       ┆ 7       ┆ -3      ┆ 1.901429     ┆ 2.103681     ┆ 4        ┆ 2       │
│ 3       ┆ 8       ┆ -2      ┆ 2.195982     ┆ 4.098395     ┆ 9        ┆ 0       │
│ null    ┆ null    ┆ null    ┆ null         ┆ null         ┆ null     ┆ null    │
│ 5       ┆ 10      ┆ 0       ┆ 0.780093     ┆ 32.047453    ┆ 25       ┆ 2       │
└─────────┴─────────┴─────────┴──────────────┴──────────────┴──────────┴─────────┘

Or use named functions

# Python only:
result_named_operators = df.select(
    pl.col("nrs").alias("raw_nrs"),
    (pl.col("nrs").add(5)).alias("nrs + 5"),
    (pl.col("nrs").sub(5)).alias("nrs - 5"),
    (pl.col("nrs").mul(pl.col("random"))).alias("nrs * random"),
    (pl.col("nrs").truediv(pl.col("random"))).alias("nrs / random"),
    (pl.col("nrs").pow(2)).alias("nrs ** 2"),
    (pl.col("nrs").mod(3)).alias("nrs % 3"),
)

print(result.equals(result_named_operators))

True

4.2.4.3.3. Datetime#

https://docs.pola.rs/api/python/stable/reference/expressions/temporal.html

from datetime import date, datetime, time

df = pl.DataFrame(
    {
        "date": [date(2022, 2, 1), date(2022, 1, 5)],
        "end_date": [date(2022, 3, 4), date(2022, 4, 2)],
        "string": ["2022-01-01", "2022-01-02"],
    }
)

result = df.with_columns(
    date_to_str=pl.col("date").dt.to_string("%Y-%m-%d"),
    str_to_date=pl.col("string").str.to_date("%Y-%m-%d"),
    str_to_datetime=pl.col("string").str.to_datetime("%Y-%m-%d"),
    year_of_date=pl.col("date").dt.year(),
    after_10_bdays=pl.col("date").dt.add_business_days(10),
    days_delta=(pl.col("end_date") - pl.col("date")).dt.total_days(),
)
result

shape: (2, 9)

date	end_date	string	date_to_str	str_to_date	str_to_datetime	year_of_date	after_10_bdays	days_delta
date	date	str	str	date	datetime[μs]	i32	date	i64
2022-02-01	2022-03-04	"2022-01-01"	"2022-02-01"	2022-01-01	2022-01-01 00:00:00	2022	2022-02-15	31
2022-01-05	2022-04-02	"2022-01-02"	"2022-01-05"	2022-01-02	2022-01-02 00:00:00	2022	2022-01-19	87

4.2.4.3.4. String#

https://docs.pola.rs/api/python/stable/reference/expressions/string.html

import polars as pl

df = pl.DataFrame(
    {
        "language": ["English", "Dutch", "Portuguese", "Finish"],
        "fruit": ["pear", "peer", "pêra", "päärynä"],
    }
)

len and size

# len and size
result = df.with_columns(
    pl.col("fruit").str.len_bytes().alias("byte_count"),
    pl.col("fruit").str.len_chars().alias("letter_count"),
)
print(result)

shape: (4, 4)
┌────────────┬─────────┬────────────┬──────────────┐
│ language   ┆ fruit   ┆ byte_count ┆ letter_count │
│ ---        ┆ ---     ┆ ---        ┆ ---          │
│ str        ┆ str     ┆ u32        ┆ u32          │
╞════════════╪═════════╪════════════╪══════════════╡
│ English    ┆ pear    ┆ 4          ┆ 4            │
│ Dutch      ┆ peer    ┆ 4          ┆ 4            │
│ Portuguese ┆ pêra    ┆ 5          ┆ 4            │
│ Finish     ┆ päärynä ┆ 10         ┆ 7            │
└────────────┴─────────┴────────────┴──────────────┘

Parsing strings

result = df.select(
    pl.col("fruit"),
    pl.col("fruit").str.starts_with("p").alias("starts_with_p"),
    pl.col("fruit").str.contains("p..r").alias("p..r"),
    pl.col("fruit").str.contains("e+").alias("e+"),
    pl.col("fruit").str.ends_with("r").alias("ends_with_r"),
)
print(result)

shape: (4, 5)
┌─────────┬───────────────┬───────┬───────┬─────────────┐
│ fruit   ┆ starts_with_p ┆ p..r  ┆ e+    ┆ ends_with_r │
│ ---     ┆ ---           ┆ ---   ┆ ---   ┆ ---         │
│ str     ┆ bool          ┆ bool  ┆ bool  ┆ bool        │
╞═════════╪═══════════════╪═══════╪═══════╪═════════════╡
│ pear    ┆ true          ┆ true  ┆ true  ┆ true        │
│ peer    ┆ true          ┆ true  ┆ true  ┆ true        │
│ pêra    ┆ true          ┆ false ┆ false ┆ false       │
│ päärynä ┆ true          ┆ true  ┆ false ┆ false       │
└─────────┴───────────────┴───────┴───────┴─────────────┘

Extract

df = pl.DataFrame(
    {
        "urls": [
            "http://vote.com/ballon_dor?candidate=messi&ref=polars",
            "http://vote.com/ballon_dor?candidat=jorginho&ref=polars",
            "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars",
        ]
    }
)
result = df.select(
    pl.col("urls").str.extract(r"candidate=(\w+)", group_index=1),
)
print(result)

shape: (3, 1)
┌─────────┐
│ urls    │
│ ---     │
│ str     │
╞═════════╡
│ messi   │
│ null    │
│ ronaldo │
└─────────┘

Replace

df = pl.DataFrame({"text": ["123abc", "abc456"]})
result = df.with_columns(
    pl.col("text").str.replace(r"\d", "-"),
    pl.col("text").str.replace_all(r"\d", "-").alias("text_replace_all"),
)
print(result)

shape: (2, 2)
┌────────┬──────────────────┐
│ text   ┆ text_replace_all │
│ ---    ┆ ---              │
│ str    ┆ str              │
╞════════╪══════════════════╡
│ -23abc ┆ ---abc           │
│ abc-56 ┆ abc---           │
└────────┴──────────────────┘

Modifying

addresses = pl.DataFrame(
    {
        "addresses": [
            "128 PERF st",
            "Rust blVD, 158",
            "PoLaRs Av, 12",
            "1042 Query sq",
        ]
    }
)
addr = pl.col("addresses")
chars = ", 0123456789"
addresses = addresses.select(
    addr.alias("originals"),
    addr.str.to_titlecase(),
    addr.str.to_lowercase().alias("lower"),
    addr.str.to_uppercase().alias("upper"),
    addr.str.strip_chars(chars).alias("strip"),
    addr.str.strip_chars_end(chars).alias("end"),
    addr.str.strip_chars_start(chars).alias("start"),
    addr.str.strip_prefix("128 ").alias("prefix"),
    addr.str.strip_suffix(", 158").alias("suffix"),
    addr.str.head(3).alias("3_characters_begin"),
    addr.str.tail(3).alias("3_characters_end"),
    addr.str.slice(2, length=4).alias("from_2_with_len_4"),
)
addresses

shape: (4, 12)

originals	addresses	lower	upper	strip	end	start	prefix	suffix	3_characters_begin	3_characters_end	from_2_with_len_4
str	str	str	str	str	str	str	str	str	str	str	str
"128 PERF st"	"128 Perf St"	"128 perf st"	"128 PERF ST"	"PERF st"	"128 PERF st"	"PERF st"	"PERF st"	"128 PERF st"	"128"	" st"	"8 PE"
"Rust blVD, 158"	"Rust Blvd, 158"	"rust blvd, 158"	"RUST BLVD, 158"	"Rust blVD"	"Rust blVD"	"Rust blVD, 158"	"Rust blVD, 158"	"Rust blVD"	"Rus"	"158"	"st b"
"PoLaRs Av, 12"	"Polars Av, 12"	"polars av, 12"	"POLARS AV, 12"	"PoLaRs Av"	"PoLaRs Av"	"PoLaRs Av, 12"	"PoLaRs Av, 12"	"PoLaRs Av, 12"	"PoL"	" 12"	"LaRs"
"1042 Query sq"	"1042 Query Sq"	"1042 query sq"	"1042 QUERY SQ"	"Query sq"	"1042 Query sq"	"Query sq"	"1042 Query sq"	"1042 Query sq"	"104"	" sq"	"42 Q"

4.2.4.3.5. Lists and array#

4.2.4.3.5.1. List#

https://docs.pola.rs/api/python/stable/reference/expressions/list.html

List is suitable for columns those have values with 1-D or difference in lengths (unknown lenght)

from datetime import datetime
import polars as pl

df = pl.DataFrame(
    {
        "names": [
            ["Anne", "Averill", "Adams"],
            ["Brandon", "Brooke", "Borden", "Branson"],
            ["Camila", "Campbell"],
            ["Dennis", "Doyle"],
        ],
        "children_ages": [
            [5, 7],
            [],
            [],
            [8, 11, 18],
        ],
        "medical_appointments": [
            [],
            [],
            [],
            [datetime(2022, 5, 22, 16, 30)],
        ],
    }
)

print(df)

shape: (4, 3)
┌─────────────────────────────────┬───────────────┬───────────────────────┐
│ names                           ┆ children_ages ┆ medical_appointments  │
│ ---                             ┆ ---           ┆ ---                   │
│ list[str]                       ┆ list[i64]     ┆ list[datetime[μs]]    │
╞═════════════════════════════════╪═══════════════╪═══════════════════════╡
│ ["Anne", "Averill", "Adams"]    ┆ [5, 7]        ┆ []                    │
│ ["Brandon", "Brooke", … "Brans… ┆ []            ┆ []                    │
│ ["Camila", "Campbell"]          ┆ []            ┆ []                    │
│ ["Dennis", "Doyle"]             ┆ [8, 11, 18]   ┆ [2022-05-22 16:30:00] │
└─────────────────────────────────┴───────────────┴───────────────────────┘

weather = pl.DataFrame(
    {
        "station": [f"Station {idx}" for idx in range(1, 6)],
        "temperatures": [
            "20 5 5 E1 7 13 19 9 6 20",
            "18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",
            "19 24 E9 16 6 12 10 22",
            "E2 E0 15 7 8 10 E1 24 17 13 6",
            "14 8 E0 16 22 24 E1",
        ],
    }
)
weather = weather.with_columns(pl.col("temperatures").str.split(" "))
print(weather)

shape: (5, 2)
┌───────────┬──────────────────────┐
│ station   ┆ temperatures         │
│ ---       ┆ ---                  │
│ str       ┆ list[str]            │
╞═══════════╪══════════════════════╡
│ Station 1 ┆ ["20", "5", … "20"]  │
│ Station 2 ┆ ["18", "8", … "40"]  │
│ Station 3 ┆ ["19", "24", … "22"] │
│ Station 4 ┆ ["E2", "E0", … "6"]  │
│ Station 5 ┆ ["14", "8", … "E1"]  │
└───────────┴──────────────────────┘

# explode each element in its own rows
result = weather.explode("temperatures")
print(result)

shape: (49, 2)
┌───────────┬──────────────┐
│ station   ┆ temperatures │
│ ---       ┆ ---          │
│ str       ┆ str          │
╞═══════════╪══════════════╡
│ Station 1 ┆ 20           │
│ Station 1 ┆ 5            │
│ Station 1 ┆ 5            │
│ Station 1 ┆ E1           │
│ Station 1 ┆ 7            │
│ …         ┆ …            │
│ Station 5 ┆ E0           │
│ Station 5 ┆ 16           │
│ Station 5 ┆ 22           │
│ Station 5 ┆ 24           │
│ Station 5 ┆ E1           │
└───────────┴──────────────┘

# slicing
result = weather.with_columns(
    pl.col("temperatures").list.head(3).alias("head"),
    pl.col("temperatures").list.tail(3).alias("tail"),
    pl.col("temperatures").list.slice(-3, 2).alias("two_next_to_last"),
)
print(result)

shape: (5, 5)
┌───────────┬──────────────────────┬────────────────────┬────────────────────┬──────────────────┐
│ station   ┆ temperatures         ┆ head               ┆ tail               ┆ two_next_to_last │
│ ---       ┆ ---                  ┆ ---                ┆ ---                ┆ ---              │
│ str       ┆ list[str]            ┆ list[str]          ┆ list[str]          ┆ list[str]        │
╞═══════════╪══════════════════════╪════════════════════╪════════════════════╪══════════════════╡
│ Station 1 ┆ ["20", "5", … "20"]  ┆ ["20", "5", "5"]   ┆ ["9", "6", "20"]   ┆ ["9", "6"]       │
│ Station 2 ┆ ["18", "8", … "40"]  ┆ ["18", "8", "16"]  ┆ ["90", "70", "40"] ┆ ["90", "70"]     │
│ Station 3 ┆ ["19", "24", … "22"] ┆ ["19", "24", "E9"] ┆ ["12", "10", "22"] ┆ ["12", "10"]     │
│ Station 4 ┆ ["E2", "E0", … "6"]  ┆ ["E2", "E0", "15"] ┆ ["17", "13", "6"]  ┆ ["17", "13"]     │
│ Station 5 ┆ ["14", "8", … "E1"]  ┆ ["14", "8", "E0"]  ┆ ["22", "24", "E1"] ┆ ["22", "24"]     │
└───────────┴──────────────────────┴────────────────────┴────────────────────┴──────────────────┘

access in namespace list: https://docs.pola.rs/api/python/stable/reference/expressions/list.html

# Function with list

result = weather.with_columns(
    pl.col("temperatures")
    .list.eval(
        pl.element().cast(pl.Int64, strict=False).is_null()
    )  # each element cast to INTEGER (null if failded), then mask where null
    .list.sum()  # count True
    .alias("errors"),
)
print(result)

shape: (5, 3)
┌───────────┬──────────────────────┬────────┐
│ station   ┆ temperatures         ┆ errors │
│ ---       ┆ ---                  ┆ ---    │
│ str       ┆ list[str]            ┆ u32    │
╞═══════════╪══════════════════════╪════════╡
│ Station 1 ┆ ["20", "5", … "20"]  ┆ 1      │
│ Station 2 ┆ ["18", "8", … "40"]  ┆ 4      │
│ Station 3 ┆ ["19", "24", … "22"] ┆ 1      │
│ Station 4 ┆ ["E2", "E0", … "6"]  ┆ 3      │
│ Station 5 ┆ ["14", "8", … "E1"]  ┆ 2      │
└───────────┴──────────────────────┴────────┘

Row-wise computations in list (cross columns in same row)

weather_by_day = pl.DataFrame(
    {
        "station": [f"Station {idx}" for idx in range(1, 11)],
        "day_1": [17, 11, 8, 22, 9, 21, 20, 8, 8, 17],
        "day_2": [15, 11, 10, 8, 7, 14, 18, 21, 15, 13],
        "day_3": [16, 15, 24, 24, 8, 23, 19, 23, 16, 10],
    }
)
print(weather_by_day)

shape: (10, 4)
┌────────────┬───────┬───────┬───────┐
│ station    ┆ day_1 ┆ day_2 ┆ day_3 │
│ ---        ┆ ---   ┆ ---   ┆ ---   │
│ str        ┆ i64   ┆ i64   ┆ i64   │
╞════════════╪═══════╪═══════╪═══════╡
│ Station 1  ┆ 17    ┆ 15    ┆ 16    │
│ Station 2  ┆ 11    ┆ 11    ┆ 15    │
│ Station 3  ┆ 8     ┆ 10    ┆ 24    │
│ Station 4  ┆ 22    ┆ 8     ┆ 24    │
│ Station 5  ┆ 9     ┆ 7     ┆ 8     │
│ Station 6  ┆ 21    ┆ 14    ┆ 23    │
│ Station 7  ┆ 20    ┆ 18    ┆ 19    │
│ Station 8  ┆ 8     ┆ 21    ┆ 23    │
│ Station 9  ┆ 8     ┆ 15    ┆ 16    │
│ Station 10 ┆ 17    ┆ 13    ┆ 10    │
└────────────┴───────┴───────┴───────┘

rank_pct = (pl.element().rank(descending=True) / pl.all().count()).round(2)

result = weather_by_day.with_columns(
    # create the list of homogeneous data
    pl.concat_list(pl.all().exclude("station")).alias("all_temps")
).select(
    # select all columns except the intermediate list
    pl.all().exclude("all_temps"),
    # compute the rank by calling `list.eval`
    pl.col("all_temps").list.eval(rank_pct, parallel=True).alias("temps_rank"),
)

print(result)

shape: (10, 5)
┌────────────┬───────┬───────┬───────┬────────────────────┐
│ station    ┆ day_1 ┆ day_2 ┆ day_3 ┆ temps_rank         │
│ ---        ┆ ---   ┆ ---   ┆ ---   ┆ ---                │
│ str        ┆ i64   ┆ i64   ┆ i64   ┆ list[f64]          │
╞════════════╪═══════╪═══════╪═══════╪════════════════════╡
│ Station 1  ┆ 17    ┆ 15    ┆ 16    ┆ [0.33, 1.0, 0.67]  │
│ Station 2  ┆ 11    ┆ 11    ┆ 15    ┆ [0.83, 0.83, 0.33] │
│ Station 3  ┆ 8     ┆ 10    ┆ 24    ┆ [1.0, 0.67, 0.33]  │
│ Station 4  ┆ 22    ┆ 8     ┆ 24    ┆ [0.67, 1.0, 0.33]  │
│ Station 5  ┆ 9     ┆ 7     ┆ 8     ┆ [0.33, 1.0, 0.67]  │
│ Station 6  ┆ 21    ┆ 14    ┆ 23    ┆ [0.67, 1.0, 0.33]  │
│ Station 7  ┆ 20    ┆ 18    ┆ 19    ┆ [0.33, 1.0, 0.67]  │
│ Station 8  ┆ 8     ┆ 21    ┆ 23    ┆ [1.0, 0.67, 0.33]  │
│ Station 9  ┆ 8     ┆ 15    ┆ 16    ┆ [1.0, 0.67, 0.33]  │
│ Station 10 ┆ 17    ┆ 13    ┆ 10    ┆ [0.33, 0.67, 1.0]  │
└────────────┴───────┴───────┴───────┴────────────────────┘

4.2.4.3.5.2. Array (prefer)#

https://docs.pola.rs/api/python/stable/reference/expressions/array.html

Suitable for multi-dimension and known and fixed shape

In short, prefer the data type Array over List because it is more memory efficient and more performant. If you cannot use Array, then use List:

when the values within a column do not have a fixed shape; or
when you need functions that are only available in the list API.

df = pl.DataFrame(
    {
        "bit_flags": [
            [True, True, True, True, False],
            [False, True, True, True, True],
        ],
        "tic_tac_toe": [
            [
                [" ", "x", "o"],
                [" ", "x", " "],
                ["o", "x", " "],
            ],
            [
                ["o", "x", "x"],
                [" ", "o", "x"],
                [" ", " ", "o"],
            ],
        ],
    },
    schema={  # define array
        "bit_flags": pl.Array(pl.Boolean, 5),
        "tic_tac_toe": pl.Array(pl.String, (3, 3)),
    },
)

print(df)

shape: (2, 2)
┌───────────────────────┬─────────────────────────────────┐
│ bit_flags             ┆ tic_tac_toe                     │
│ ---                   ┆ ---                             │
│ array[bool, 5]        ┆ array[str, (3, 3)]              │
╞═══════════════════════╪═════════════════════════════════╡
│ [true, true, … false] ┆ [[" ", "x", "o"], [" ", "x", "… │
│ [false, true, … true] ┆ [["o", "x", "x"], [" ", "o", "… │
└───────────────────────┴─────────────────────────────────┘

access in namespace arr: https://docs.pola.rs/api/python/stable/reference/expressions/array.html

df = pl.DataFrame(
    {
        "first_last": [
            ["Anne", "Adams"],
            ["Brandon", "Branson"],
            ["Camila", "Campbell"],
            ["Dennis", "Doyle"],
        ],
        "fav_numbers": [
            [42, 0, 1],
            [2, 3, 5],
            [13, 21, 34],
            [73, 3, 7],
        ],
    },
    schema={
        "first_last": pl.Array(pl.String, 2),
        "fav_numbers": pl.Array(pl.Int32, 3),
    },
)

result = df.select(
    pl.col("first_last").arr.join(" ").alias("name"),
    pl.col("fav_numbers").arr.sort(),
    pl.col("fav_numbers").arr.max().alias("largest_fav"),
    pl.col("fav_numbers").arr.sum().alias("summed"),
    pl.col("fav_numbers").arr.contains(3).alias("likes_3"),
)
print(result)

shape: (4, 5)
┌─────────────────┬───────────────┬─────────────┬────────┬─────────┐
│ name            ┆ fav_numbers   ┆ largest_fav ┆ summed ┆ likes_3 │
│ ---             ┆ ---           ┆ ---         ┆ ---    ┆ ---     │
│ str             ┆ array[i32, 3] ┆ i32         ┆ i32    ┆ bool    │
╞═════════════════╪═══════════════╪═════════════╪════════╪═════════╡
│ Anne Adams      ┆ [0, 1, 42]    ┆ 42          ┆ 43     ┆ false   │
│ Brandon Branson ┆ [2, 3, 5]     ┆ 5           ┆ 10     ┆ true    │
│ Camila Campbell ┆ [13, 21, 34]  ┆ 34          ┆ 68     ┆ false   │
│ Dennis Doyle    ┆ [3, 7, 73]    ┆ 73          ┆ 83     ┆ true    │
└─────────────────┴───────────────┴─────────────┴────────┴─────────┘

4.2.4.3.6. Categorical and enums#

https://docs.pola.rs/api/python/stable/reference/expressions/categories.html

Prefer Enum over Categorical whenever possible.

When the categories are fixed and known up front, use Enum.
When you don’t know the categories or they are not fixed then you must use Categorical

4.2.4.3.7. Structs#

access in namespace struct: https://docs.pola.rs/api/python/stable/reference/expressions/struct.html

import polars as pl

ratings = pl.DataFrame(
    {
        "Movie": [
            "Cars",
            "IT",
            "ET",
            "Cars",
            "Up",
            "IT",
            "Cars",
            "ET",
            "Up",
            "Cars",
        ],
        "Theatre": [
            "NE",
            "ME",
            "IL",
            "ND",
            "NE",
            "SD",
            "NE",
            "IL",
            "IL",
            "NE",
        ],
        "Avg_Rating": [4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.5, 4.9, 4.7, 4.6],
        "Count": [30, 27, 26, 29, 31, 28, 28, 26, 33, 28],
    }
)
result = ratings.select(pl.col("Theatre").value_counts(sort=True))
print(result)

shape: (5, 1)
┌───────────┐
│ Theatre   │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {"NE",4}  │
│ {"IL",3}  │
│ {"ME",1}  │
│ {"ND",1}  │
│ {"SD",1}  │
└───────────┘

4.2.4.3.7.1. `unnest`#

result = ratings.select(pl.col("Theatre").value_counts(sort=True)).unnest(
    "Theatre"
)
print(result)

shape: (5, 2)
┌─────────┬───────┐
│ Theatre ┆ count │
│ ---     ┆ ---   │
│ str     ┆ u32   │
╞═════════╪═══════╡
│ NE      ┆ 4     │
│ IL      ┆ 3     │
│ ME      ┆ 1     │
│ ND      ┆ 1     │
│ SD      ┆ 1     │
└─────────┴───────┘

The function unnest will turn each field of the Struct into its own column.

rating_series = pl.Series(
    "ratings",
    [
        {"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
        {"Movie": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
    ],
)
print(rating_series)

unnest_sr = rating_series.struct.unnest()
print("------\nUnnest Series: \n", unnest_sr)

shape: (2,)
Series: 'ratings' [struct[3]]
[
	{"Cars","NE",4.5}
	{"Toy Story","ME",4.9}
]
------
Unnest Series: 
 shape: (2, 3)
┌───────────┬─────────┬────────────┐
│ Movie     ┆ Theatre ┆ Avg_Rating │
│ ---       ┆ ---     ┆ ---        │
│ str       ┆ str     ┆ f64        │
╞═══════════╪═════════╪════════════╡
│ Cars      ┆ NE      ┆ 4.5        │
│ Toy Story ┆ ME      ┆ 4.9        │
└───────────┴─────────┴────────────┘

4.2.4.3.7.2. `field`: extract field#

result = rating_series.struct.field("Movie")
print(result)

shape: (2,)
Series: 'Movie' [str]
[
	"Cars"
	"Toy Story"
]

4.2.4.3.7.3. `rename_fields`#

result = rating_series.struct.rename_fields(["Film", "State", "Value"])
result.to_frame().unnest("ratings")

shape: (2, 3)

Film	State	Value
str	str	f64
"Cars"	"NE"	4.5
"Toy Story"	"ME"	4.9

4.2.4.3.7.4. detect duplicated / unique rows#

result = ratings.filter(pl.struct("Movie", "Theatre").is_duplicated())
print(result)

shape: (5, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ ---   ┆ ---     ┆ ---        ┆ ---   │
│ str   ┆ str     ┆ f64        ┆ i64   │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    │
│ Cars  ┆ NE      ┆ 4.5        ┆ 28    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    │
│ Cars  ┆ NE      ┆ 4.6        ┆ 28    │
└───────┴─────────┴────────────┴───────┘

result = ratings.filter(pl.struct("Movie", "Theatre").is_unique())
print(result)

shape: (5, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ ---   ┆ ---     ┆ ---        ┆ ---   │
│ str   ┆ str     ┆ f64        ┆ i64   │
╞═══════╪═════════╪════════════╪═══════╡
│ IT    ┆ ME      ┆ 4.4        ┆ 27    │
│ Cars  ┆ ND      ┆ 4.3        ┆ 29    │
│ Up    ┆ NE      ┆ 4.8        ┆ 31    │
│ IT    ┆ SD      ┆ 4.7        ┆ 28    │
│ Up    ┆ IL      ┆ 4.7        ┆ 33    │
└───────┴─────────┴────────────┴───────┘

ratings.unique()

shape: (10, 4)

Movie	Theatre	Avg_Rating	Count
str	str	f64	i64
"Cars"	"ND"	4.3	29
"Up"	"IL"	4.7	33
"Cars"	"NE"	4.5	28
"IT"	"ME"	4.4	27
"IT"	"SD"	4.7	28
"ET"	"IL"	4.6	26
"Up"	"NE"	4.8	31
"ET"	"IL"	4.9	26
"Cars"	"NE"	4.6	28
"Cars"	"NE"	4.5	30

4.2.4.4. Duplicated#

ratings

shape: (10, 4)

Movie	Theatre	Avg_Rating	Count
str	str	f64	i64
"Cars"	"NE"	4.5	30
"IT"	"ME"	4.4	27
"ET"	"IL"	4.6	26
"Cars"	"ND"	4.3	29
"Up"	"NE"	4.8	31
"IT"	"SD"	4.7	28
"Cars"	"NE"	4.5	28
"ET"	"IL"	4.9	26
"Up"	"IL"	4.7	33
"Cars"	"NE"	4.6	28

Filter unique rows

ratings.unique()

shape: (10, 4)

Movie	Theatre	Avg_Rating	Count
str	str	f64	i64
"Cars"	"NE"	4.5	30
"IT"	"ME"	4.4	27
"Cars"	"NE"	4.5	28
"ET"	"IL"	4.6	26
"Up"	"NE"	4.8	31
"Cars"	"NE"	4.6	28
"Up"	"IL"	4.7	33
"IT"	"SD"	4.7	28
"ET"	"IL"	4.9	26
"Cars"	"ND"	4.3	29

Detect duplicated rows

result = ratings.filter(pl.struct("Movie", "Theatre").is_duplicated())
print(result)

shape: (5, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ ---   ┆ ---     ┆ ---        ┆ ---   │
│ str   ┆ str     ┆ f64        ┆ i64   │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    │
│ Cars  ┆ NE      ┆ 4.5        ┆ 28    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    │
│ Cars  ┆ NE      ┆ 4.6        ┆ 28    │
└───────┴─────────┴────────────┴───────┘

Detect unique rows

result = ratings.filter(pl.struct("Movie", "Theatre").is_unique())
print(result)

shape: (5, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ ---   ┆ ---     ┆ ---        ┆ ---   │
│ str   ┆ str     ┆ f64        ┆ i64   │
╞═══════╪═════════╪════════════╪═══════╡
│ IT    ┆ ME      ┆ 4.4        ┆ 27    │
│ Cars  ┆ ND      ┆ 4.3        ┆ 29    │
│ Up    ┆ NE      ┆ 4.8        ┆ 31    │
│ IT    ┆ SD      ┆ 4.7        ┆ 28    │
│ Up    ┆ IL      ┆ 4.7        ┆ 33    │
└───────┴─────────┴────────────┴───────┘

4.2.4.5. Missing#

In Polars, missing data is represented by the value null. This missing value null is used for all data types, including numerical types.

import polars as pl

df = pl.DataFrame(
    {
        "value": [1, None],
    },
)
print(df)

shape: (2, 1)
┌───────┐
│ value │
│ ---   │
│ i64   │
╞═══════╡
│ 1     │
│ null  │
└───────┘

4.2.4.5.1. Count null#

null_count_df = df.null_count()
print(null_count_df)

shape: (1, 1)
┌───────┐
│ value │
│ ---   │
│ u32   │
╞═══════╡
│ 1     │
└───────┘

4.2.4.5.2. Check null#

is_null_series = df.select(
    pl.col("value").is_null(),
)
print(is_null_series)

shape: (2, 1)
┌───────┐
│ value │
│ ---   │
│ bool  │
╞═══════╡
│ false │
│ true  │
└───────┘

4.2.4.5.3. Fill null#

df = pl.DataFrame(
    {
        "col1": [0.5, 1, 1.5, 2, 2.5],
        "col2": [1, None, 3, None, 5],
    },
)
fill_df = df.with_columns(
    pl.col("col2").fill_null(3).alias("fill_by_literal"),
    pl.col("col2")
    .fill_null((2 * pl.col("col1")).cast(pl.Int64))
    .alias("fill_by_other_col"),
    pl.col("col2").fill_null(strategy="forward").alias("fill_forward"),
    pl.col("col2").fill_null(strategy="backward").alias("fill_backward"),
    pl.col("col2").interpolate().alias("fill_by_interpolation"),
)
print(fill_df)

shape: (5, 7)
┌──────┬──────┬─────────────────┬─────────────────┬──────────────┬───────────────┬─────────────────┐
│ col1 ┆ col2 ┆ fill_by_literal ┆ fill_by_other_c ┆ fill_forward ┆ fill_backward ┆ fill_by_interpo │
│ ---  ┆ ---  ┆ ---             ┆ ol              ┆ ---          ┆ ---           ┆ lation          │
│ f64  ┆ i64  ┆ i64             ┆ ---             ┆ i64          ┆ i64           ┆ ---             │
│      ┆      ┆                 ┆ i64             ┆              ┆               ┆ f64             │
╞══════╪══════╪═════════════════╪═════════════════╪══════════════╪═══════════════╪═════════════════╡
│ 0.5  ┆ 1    ┆ 1               ┆ 1               ┆ 1            ┆ 1             ┆ 1.0             │
│ 1.0  ┆ null ┆ 3               ┆ 2               ┆ 1            ┆ 3             ┆ 2.0             │
│ 1.5  ┆ 3    ┆ 3               ┆ 3               ┆ 3            ┆ 3             ┆ 3.0             │
│ 2.0  ┆ null ┆ 3               ┆ 4               ┆ 3            ┆ 5             ┆ 4.0             │
│ 2.5  ┆ 5    ┆ 5               ┆ 5               ┆ 5            ┆ 5             ┆ 5.0             │
└──────┴──────┴─────────────────┴─────────────────┴──────────────┴───────────────┴─────────────────┘

4.2.4.5.4. Drop null#

df.drop_nulls("col2")

shape: (3, 2)

col1	col2
f64	i64
0.5	1
1.5	3
2.5	5

4.2.4.5.5. NaN value#

NaN (Not a Number) values are considered to be a type of floating point data and are not considered to be missing data (null) in Polars:

NaN values are not counted with the function null_count
NaN values are filled when you use the specialised function fill_nan method but are not filled with the function fill_null.
Polars has the functions is_nan and fill_nan, which work in a similar way to the functions is_null and fill_null, but only use for NaN value
Các phương thức tính toán (mean, sum, …) sẽ bỏ qua giá trị missing null nhưng sẽ bao gồm cả giá trị NaN. Do đó, để tránh ảnh hưởng đến kết quả bởi giá trị NaN, ta cần fill NaN bởi null trước khi tính toán.

import numpy as np

print("DF generate NaN value")
df = pl.DataFrame(
    {
        "dividend": [1, 0, -1],
        "divisor": [1, 0, -1],
    }
)
result = df.select(pl.col("dividend") / pl.col("divisor"))

print(result, end="\n--------------\n")

print("DF with NaN value")
nan_df = pl.DataFrame(
    {
        "raw_value_with_NaN": [1.0, np.nan, float("nan"), 3.0],
    },
)
print(nan_df, end="\n--------------\n")

DF generate NaN value
shape: (3, 1)
┌──────────┐
│ dividend │
│ ---      │
│ f64      │
╞══════════╡
│ 1.0      │
│ NaN      │
│ 1.0      │
└──────────┘
--------------
DF with NaN value
shape: (4, 1)
┌────────────────────┐
│ raw_value_with_NaN │
│ ---                │
│ f64                │
╞════════════════════╡
│ 1.0                │
│ NaN                │
│ NaN                │
│ 3.0                │
└────────────────────┘
--------------

print("Handling NaN value")
mean_nan_df = nan_df.with_columns(
    pl.col("raw_value_with_NaN")
    .fill_nan(None)
    .alias("replaced_by_missing_null"),
).select(
    pl.all().mean().name.suffix("_mean"),
    pl.all().sum().name.suffix("_sum"),
)
mean_nan_df

Handling NaN value

shape: (1, 4)

raw_value_with_NaN_mean	replaced_by_missing_null_mean	raw_value_with_NaN_sum	replaced_by_missing_null_sum
f64	f64	f64	f64
NaN	2.0	NaN	4.0

4.2.5. Dataframe Manipulation#

4.2.5.1. Replace series#

zip_with: Where mask evaluates true, take values from self. Where mask evaluates false, take values from other.

s1 = pl.Series([1, 2, 3, 4, 5])
s2 = pl.Series([5, 4, 3, 2, 1])
s1.zip_with(s1 < s2, s2)

shape: (5,)


i64
1
2
3
2
1

4.2.5.2. Aggregation#

import polars as pl

url = "https://theunitedstates.io/congress-legislators/legislators-historical.csv"

schema_overrides = {
    "first_name": pl.Categorical,
    "gender": pl.Categorical,
    "type": pl.Categorical,
    "state": pl.Categorical,
    "party": pl.Categorical,
}

dataset = pl.read_csv(url, schema_overrides=schema_overrides).with_columns(
    pl.col("birthday").str.to_date(strict=False)
)

dataset.head()

shape: (5, 36)

last_name	first_name	middle_name	suffix	nickname	full_name	birthday	gender	type	state	district	senate_class	party	url	address	phone	contact_form	rss_url	twitter	twitter_id	facebook	youtube	youtube_id	mastodon	bioguide_id	thomas_id	opensecrets_id	lis_id	fec_ids	cspan_id	govtrack_id	votesmart_id	ballotpedia_id	washington_post_id	icpsr_id	wikipedia_id
str	cat	str	str	str	str	date	cat	cat	cat	i64	i64	cat	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	i64	str	str	str	i64	str
"Bassett"	"Richard"	null	null	null	null	1745-04-02	"M"	"sen"	"DE"	null	2	"Anti-Administration"	null	null	null	null	null	null	null	null	null	null	null	"B000226"	null	null	null	null	null	401222	null	null	null	507	"Richard Bassett (Delaware poli…
"Bland"	"Theodorick"	null	null	null	null	1742-03-21	"M"	"rep"	"VA"	9	null	null	null	null	null	null	null	null	null	null	null	null	null	"B000546"	null	null	null	null	null	401521	null	null	null	786	"Theodorick Bland (congressman)"
"Burke"	"Aedanus"	null	null	null	null	1743-06-16	"M"	"rep"	"SC"	2	null	null	null	null	null	null	null	null	null	null	null	null	null	"B001086"	null	null	null	null	null	402032	null	null	null	1260	"Aedanus Burke"
"Carroll"	"Daniel"	null	null	null	null	1730-07-22	"M"	"rep"	"MD"	6	null	null	null	null	null	null	null	null	null	null	null	null	null	"C000187"	null	null	null	null	null	402334	null	null	null	1538	"Daniel Carroll"
"Clymer"	"George"	null	null	null	null	1739-03-16	"M"	"rep"	"PA"	-1	null	null	null	null	null	null	null	null	null	null	null	null	null	"C000538"	null	null	null	null	null	402671	null	null	null	1859	"George Clymer"

4.2.5.2.1. `group_by`#

After using group_by we use agg to apply aggregating expressions to the groups

from datetime import date


def compute_age():
    return date.today().year - pl.col("birthday").dt.year()


def avg_birthday(gender: str) -> pl.Expr:
    return (
        compute_age()
        .filter(pl.col("gender") == gender)
        .mean()
        .alias(f"avg {gender} birthday")
    )


def get_name() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")


q = (
    dataset.lazy()
    .filter(pl.col("type").is_not_null())  # filter dataset before group by
    .sort("birthday", descending=True)  # sort dataset before aggregate
    .group_by(
        "first_name",
        pl.col("birthday").dt.year().alias("birthyear"),
    )
    .agg(
        # count the number of rows in the group (which means we count how many people in the data set have each unique first name)
        pl.len(),
        # combine the values of the column “gender” into a list by referring the column but omitting an aggregate function
        pl.col("gender"),
        # get the first value of the column “last_name” within the group
        pl.first("last_name"),  # Short for `pl.col("last_name").first()`
        # Expression expansion
        pl.col("govtrack_id", "icpsr_id").count().name.prefix("count_"),
        # (Conditional expression) how many delegates of a state are “Pro” administration
        (pl.col("party") == "Anti-Administration").sum().alias("anti"),
        # filter group but dont need to filter dataset bởi vì có thể ảnh hưởng đến các expression khác
        avg_birthday("M"),
        avg_birthday("F"),
        # after sort by birthday, we get the youngest and oldest
        get_name().first().alias("youngest"),
        get_name().last().alias("oldest"),
        # sort in group (not use sorted in dataframe)
        get_name().sort().first().alias("alphabetical_first"),
        # sort by another columns
        pl.col("gender")
        .sort_by(get_name())
        .first()
        .alias("sort_by_another_column"),
    )
    .filter(pl.col("birthyear").is_not_null())  # filter dataset after group by
    .sort("len", descending=True)  # immediately sort the result
    .limit(5)  # limit it to the top five rows
)

df = q.collect()
df

shape: (5, 14)

first_name	birthyear	len	gender	last_name	count_govtrack_id	count_icpsr_id	anti	avg M birthday	avg F birthday	youngest	oldest	alphabetical_first	sort_by_another_column
cat	i32	u32	list[cat]	str	u32	u32	u32	f64	f64	str	str	str	cat
"John"	1826	19	["M", "M", … "M"]	"Ferdon"	19	19	0	198.0	null	"John Ferdon"	"John Eden"	"John Burch"	"M"
"John"	1825	16	["M", "M", … "M"]	"Hungerford"	16	16	0	199.0	null	"John Hungerford"	"John Hoge"	"John Atkins"	"M"
"John"	1831	16	["M", "M", … "M"]	"McCormick"	16	16	0	193.0	null	"John McCormick"	"John Clark"	"John Arnot"	"M"
"John"	1835	15	["M", "M", … "M"]	"Coghlan"	15	14	0	189.0	null	"John Coghlan"	"John Gilfillan"	"John Brown"	"M"
"John"	1845	14	["M", "M", … "M"]	"Tarsney"	14	14	0	179.0	null	"John Tarsney"	"John Smith"	"John Allen"	"M"

the groups of that column as lists

For multi-label of grouper and multi expressions each group

result = df.group_by(
    (pl.col("birthdate").dt.year() // 10 * 10).alias(
        "decade"
    ),  # group by label 1
    (pl.col("height") < 1.7).alias("short?"),  # group by label 2
).agg(
    pl.len(),
    pl.col("height").max().alias("tallest"),
    pl.col("weight", "height").mean().name.prefix("avg_"),
)
print(result)

shape: (3, 6)
┌────────┬────────┬─────┬─────────┬────────────┬────────────┐
│ decade ┆ short? ┆ len ┆ tallest ┆ avg_weight ┆ avg_height │
│ ---    ┆ ---    ┆ --- ┆ ---     ┆ ---        ┆ ---        │
│ i32    ┆ bool   ┆ u32 ┆ f64     ┆ f64        ┆ f64        │
╞════════╪════════╪═════╪═════════╪════════════╪════════════╡
│ 1980   ┆ true   ┆ 1   ┆ 1.65    ┆ 53.6       ┆ 1.65       │
│ 1990   ┆ true   ┆ 1   ┆ 1.56    ┆ 57.9       ┆ 1.56       │
│ 1980   ┆ false  ┆ 2   ┆ 1.77    ┆ 77.8       ┆ 1.76       │
└────────┴────────┴─────┴─────────┴────────────┴────────────┘

Custom Function in Group By

Python thường chậm hơn Rust. Ngoài chi phí chạy mã byte “chậm”, Python còn phải nằm trong giới hạn của Khóa phiên dịch toàn cầu (GIL). Điều này có nghĩa là nếu bạn sử dụng lambda hoặc hàm Python tùy chỉnh để áp dụng trong giai đoạn song song, tốc độ của Polars sẽ bị giới hạn khi chạy mã Python, ngăn không cho nhiều luồng thực thi hàm. Polars sẽ cố gắng song song hóa việc tính toán các hàm tổng hợp trên các nhóm, vì vậy bạn nên tránh sử dụng lambda và các hàm Python tùy chỉnh càng nhiều càng tốt. Thay vào đó, hãy cố gắng duy trì trong phạm vi API biểu thức Polars. Tuy nhiên, điều này không phải lúc nào cũng thực hiện được, vì vậy nếu muốn tìm hiểu thêm về cách sử dụng lambda, bạn có thể xem Custom Function

4.2.5.2.2. `group_by_dynamic`#

(See in Time Series Section)

4.2.5.3. Custom function#

(https://docs.pola.rs/user-guide/expressions/user-defined-python-functions/)

NOTE: Việc sử dụng custom function có thể ảnh hướng đến performance khi thực hiện tính toán song song của polars. Do đó, hãy cố gắng tận dụng các python API, thay vì là các function tuỳ chỉnh. Tuy nhiên trong TH cần thiết, ta có:

map_elements: Mỗi 1 giá trị sẽ chạy qua 1 function nào đó, và không liên quan đến giá trị khác trong seri
map_batches: Pass full seri vào function và trả ra 1 seri tương ứng là các value đã được biến đổi.

4.2.5.3.1. `map_elements`#

Mỗi 1 giá trị sẽ chạy qua 1 function nào đó, và không liên quan đến giá trị khác trong seri

Problem with map_elements:

Limited to individual items: Thông thường, bạn sẽ muốn có một phép tính cần thực hiện trên toàn bộ Chuỗi, thay vì từng mục riêng lẻ.
Performance overhead: Ngay cả khi bạn muốn xử lý từng mục riêng lẻ, việc gọi một hàm cho từng mục riêng lẻ vẫn chậm; tất cả các lệnh gọi hàm bổ sung đó đều bổ sung thêm rất nhiều chi phí.

import math

df = pl.DataFrame(
    {
        "keys": ["a", "a", "b", "b"],
        "values": [10, 7, 1, 23],
    }
)


def my_log(value):
    if (value % 2) == 0:
        return math.log(value)
    else:
        return value + 1000


out = df.select(pl.col("values").map_elements(my_log, return_dtype=pl.Float64))
print(out)

shape: (4, 1)
┌──────────┐
│ values   │
│ ---      │
│ f64      │
╞══════════╡
│ 2.302585 │
│ 1007.0   │
│ 1001.0   │
│ 1023.0   │
└──────────┘

4.2.5.3.2. `map_batches`#

Pass full seri vào function và trả ra 1 seri tương ứng là các value đã được biến đổi.

You can pass a return_dtype argument to map_batches if you want to override the inferred type.
- int -> Int64
- float -> Float64
- bool -> Boolean
- str -> String
- list[tp] -> List[tp] (where the inner type is inferred with the same rules)
- dict[str, [tp]] -> struct
- Any -> object (Prevent this at all times)

def diff_from_mean(series):
    # This will be very slow for non-trivial Series, since it's all Python
    # code:
    total = 0
    for value in series:
        total += value
    mean = total / len(series)
    return pl.Series([value - mean for value in series])


# Apply our custom function to a full Series with map_batches():
out = df.with_columns(pl.col("values").map_batches(diff_from_mean))
print("== select() with UDF ==")
print(out)

# Apply our custom function per group:
print("== group_by() with UDF ==")
out = df.group_by("keys").agg(pl.col("values").map_batches(diff_from_mean))
print(out)

# Apply our custom function per group and get original df
out2 = df.with_columns(
    pl.col("values")
    .map_batches(diff_from_mean)
    .over("keys")
    .alias("values_each_grp")
)
print(out2)

== select() with UDF ==
shape: (4, 2)
┌──────┬────────┐
│ keys ┆ values │
│ ---  ┆ ---    │
│ str  ┆ f64    │
╞══════╪════════╡
│ a    ┆ -0.25  │
│ a    ┆ -3.25  │
│ b    ┆ -9.25  │
│ b    ┆ 12.75  │
└──────┴────────┘
== group_by() with UDF ==
shape: (2, 2)
┌──────┬───────────────┐
│ keys ┆ values        │
│ ---  ┆ ---           │
│ str  ┆ list[f64]     │
╞══════╪═══════════════╡
│ b    ┆ [-11.0, 11.0] │
│ a    ┆ [1.5, -1.5]   │
└──────┴───────────────┘
shape: (4, 3)
┌──────┬────────┬─────────────────┐
│ keys ┆ values ┆ values_each_grp │
│ ---  ┆ ---    ┆ ---             │
│ str  ┆ i64    ┆ f64             │
╞══════╪════════╪═════════════════╡
│ a    ┆ 10     ┆ 1.5             │
│ a    ┆ 7      ┆ -1.5            │
│ b    ┆ 1      ┆ -11.0           │
│ b    ┆ 23     ┆ 11.0            │
└──────┴────────┴─────────────────┘

4.2.5.3.3. Optimize custom function#

Sử dụng các hàm được viết bằng compiled language
- For numeric: các hàm numpy, scipy
- Sử dụng numba để compiling lại function, đặc biệt có thể decorate bằng @guvectorize, cho phép tạo ra ufunc by compiling a Python function to fast machine code.
NOTE: Missing data is not allowed when calling generalized ufuncs

from numba import float64, guvectorize, int64


# This will be compiled to machine code, so it will be fast. The Series is
# converted to a NumPy array before being passed to the function. See the
# Numba documentation for more details:
# https://numba.readthedocs.io/en/stable/user/vectorize.html
@guvectorize([(int64[:], float64[:])], "(n)->(n)")
def diff_from_mean_numba(arr, result):
    total = 0
    for value in arr:
        total += value
    mean = total / len(arr)
    for i, value in enumerate(arr):
        result[i] = value - mean


out = df.select(pl.col("values").map_batches(diff_from_mean_numba))
print("== select() with UDF ==")
print(out)

out = df.group_by("keys").agg(
    pl.col("values").map_batches(diff_from_mean_numba)
)
print("== group_by() with UDF ==")
print(out)

== select() with UDF ==
shape: (4, 1)
┌────────┐
│ values │
│ ---    │
│ f64    │
╞════════╡
│ -0.25  │
│ -3.25  │
│ -9.25  │
│ 12.75  │
└────────┘
== group_by() with UDF ==
shape: (2, 2)
┌──────┬───────────────┐
│ keys ┆ values        │
│ ---  ┆ ---           │
│ str  ┆ list[f64]     │
╞══════╪═══════════════╡
│ a    ┆ [1.5, -1.5]   │
│ b    ┆ [-11.0, 11.0] │
└──────┴───────────────┘

Sử dụng is_elementwise=True argument in map_batches to stream results into the function, which means it might not get all values at once. Nhưng phải đảm bảo function được thực hiện trên từng element (thay vì cần dựa trên các elements khác nữa).

4.2.5.3.4. For multi-columns#

Combine multiple columns into a Struct, and then the function can extract the columns back out:

# Add two arrays together:
@guvectorize([(int64[:], int64[:], float64[:])], "(n),(n)->(n)")
def add(arr, arr2, result):
    for i in range(len(arr)):
        result[i] = arr[i] + arr2[i]


df3 = pl.DataFrame({"values1": [1, 2, 3], "values2": [10, 20, 30]})

out = df3.select(
    # Create a struct that has two columns in it:
    pl.struct(["values1", "values2"])
    # Pass the struct to a lambda that then passes the individual columns to
    # the add() function:
    .map_batches(
        lambda combined: add(
            combined.struct.field("values1"), combined.struct.field("values2")
        )
    )
    .alias("add_columns")
)
print(out)

shape: (3, 1)
┌─────────────┐
│ add_columns │
│ ---         │
│ f64         │
╞═════════════╡
│ 11.0        │
│ 22.0        │
│ 33.0        │
└─────────────┘

4.2.5.4. Window functions#

Là hàm cho phép tính toán trên từng group và trả kết quả trong df ban đầu thông qua method over giống như window để tính toán

Tham số:

mapping_strategy that determines how the results of the expression over the group are mapped back to the rows of the dataframe.
- mapping_strategy = 'group_to_rows': kết quả trả ra có thứ tự các dòng giống với original dataframe
- mapping_strategy = 'explode': Nếu không quan trọng thứ tự dòng trong original dataframe, nên sử dụng explode để cải thiện perform, khi đó các dòng cùng group sẽ đứng cạnh nhau.
- mapping_strategy = 'join': biểu diễn kết quả thành list các giá trị output trong group đó, và repeat lại cho các dòng trong cùng 1 group.

import polars as pl

types = "Grass Water Fire Normal Ground Electric Psychic Fighting Bug Steel Flying Dragon Dark Ghost Poison Rock Ice Fairy".split()
type_enum = pl.Enum(types)
# then let's load some csv data with information about pokemon
pokemon = pl.read_csv(
    "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
).cast({"Type 1": type_enum, "Type 2": type_enum})
print(pokemon.head())

shape: (5, 13)
┌─────┬───────────────────────┬────────┬────────┬───┬─────────┬───────┬────────────┬───────────┐
│ #   ┆ Name                  ┆ Type 1 ┆ Type 2 ┆ … ┆ Sp. Def ┆ Speed ┆ Generation ┆ Legendary │
│ --- ┆ ---                   ┆ ---    ┆ ---    ┆   ┆ ---     ┆ ---   ┆ ---        ┆ ---       │
│ i64 ┆ str                   ┆ enum   ┆ enum   ┆   ┆ i64     ┆ i64   ┆ i64        ┆ bool      │
╞═════╪═══════════════════════╪════════╪════════╪═══╪═════════╪═══════╪════════════╪═══════════╡
│ 1   ┆ Bulbasaur             ┆ Grass  ┆ Poison ┆ … ┆ 65      ┆ 45    ┆ 1          ┆ false     │
│ 2   ┆ Ivysaur               ┆ Grass  ┆ Poison ┆ … ┆ 80      ┆ 60    ┆ 1          ┆ false     │
│ 3   ┆ Venusaur              ┆ Grass  ┆ Poison ┆ … ┆ 100     ┆ 80    ┆ 1          ┆ false     │
│ 3   ┆ VenusaurMega Venusaur ┆ Grass  ┆ Poison ┆ … ┆ 120     ┆ 80    ┆ 1          ┆ false     │
│ 4   ┆ Charmander            ┆ Fire   ┆ null   ┆ … ┆ 50      ┆ 65    ┆ 1          ┆ false     │
└─────┴───────────────────────┴────────┴────────┴───┴─────────┴───────┴────────────┴───────────┘

# Rank by each group
result = pokemon.select(
    pl.col("Name", "Type 1"),
    pl.col("Speed")
    .rank("dense", descending=True)
    .over("Type 1")
    .alias("Speed rank"),
)

print(result)

shape: (163, 3)
┌───────────────────────┬─────────┬────────────┐
│ Name                  ┆ Type 1  ┆ Speed rank │
│ ---                   ┆ ---     ┆ ---        │
│ str                   ┆ enum    ┆ u32        │
╞═══════════════════════╪═════════╪════════════╡
│ Bulbasaur             ┆ Grass   ┆ 6          │
│ Ivysaur               ┆ Grass   ┆ 3          │
│ Venusaur              ┆ Grass   ┆ 1          │
│ VenusaurMega Venusaur ┆ Grass   ┆ 1          │
│ Charmander            ┆ Fire    ┆ 7          │
│ …                     ┆ …       ┆ …          │
│ Moltres               ┆ Fire    ┆ 5          │
│ Dratini               ┆ Dragon  ┆ 3          │
│ Dragonair             ┆ Dragon  ┆ 2          │
│ Dragonite             ┆ Dragon  ┆ 1          │
│ Mewtwo                ┆ Psychic ┆ 2          │
└───────────────────────┴─────────┴────────────┘

Over multi-group

result = pokemon.select(
    pl.col("Name", "Type 1", "Type 2"),
    pl.col("Speed")
    .rank("dense", descending=True)
    .over("Type 1", "Type 2")
    .alias("Speed rank"),
)

print(result)

shape: (163, 4)
┌───────────────────────┬─────────┬────────┬────────────┐
│ Name                  ┆ Type 1  ┆ Type 2 ┆ Speed rank │
│ ---                   ┆ ---     ┆ ---    ┆ ---        │
│ str                   ┆ enum    ┆ enum   ┆ u32        │
╞═══════════════════════╪═════════╪════════╪════════════╡
│ Bulbasaur             ┆ Grass   ┆ Poison ┆ 6          │
│ Ivysaur               ┆ Grass   ┆ Poison ┆ 3          │
│ Venusaur              ┆ Grass   ┆ Poison ┆ 1          │
│ VenusaurMega Venusaur ┆ Grass   ┆ Poison ┆ 1          │
│ Charmander            ┆ Fire    ┆ null   ┆ 7          │
│ …                     ┆ …       ┆ …      ┆ …          │
│ Moltres               ┆ Fire    ┆ Flying ┆ 2          │
│ Dratini               ┆ Dragon  ┆ null   ┆ 2          │
│ Dragonair             ┆ Dragon  ┆ null   ┆ 1          │
│ Dragonite             ┆ Dragon  ┆ Flying ┆ 1          │
│ Mewtwo                ┆ Psychic ┆ null   ┆ 2          │
└───────────────────────┴─────────┴────────┴────────────┘

Result is scalar value

result = pokemon.select(
    pl.col("Name", "Type 1", "Speed"),
    pl.col("Speed").mean().over(pl.col("Type 1")).alias("Mean speed in group"),
)

print(result)

shape: (163, 4)
┌───────────────────────┬─────────┬───────┬─────────────────────┐
│ Name                  ┆ Type 1  ┆ Speed ┆ Mean speed in group │
│ ---                   ┆ ---     ┆ ---   ┆ ---                 │
│ str                   ┆ enum    ┆ i64   ┆ f64                 │
╞═══════════════════════╪═════════╪═══════╪═════════════════════╡
│ Bulbasaur             ┆ Grass   ┆ 45    ┆ 54.230769           │
│ Ivysaur               ┆ Grass   ┆ 60    ┆ 54.230769           │
│ Venusaur              ┆ Grass   ┆ 80    ┆ 54.230769           │
│ VenusaurMega Venusaur ┆ Grass   ┆ 80    ┆ 54.230769           │
│ Charmander            ┆ Fire    ┆ 65    ┆ 86.285714           │
│ …                     ┆ …       ┆ …     ┆ …                   │
│ Moltres               ┆ Fire    ┆ 90    ┆ 86.285714           │
│ Dratini               ┆ Dragon  ┆ 50    ┆ 66.666667           │
│ Dragonair             ┆ Dragon  ┆ 70    ┆ 66.666667           │
│ Dragonite             ┆ Dragon  ┆ 80    ┆ 66.666667           │
│ Mewtwo                ┆ Psychic ┆ 130   ┆ 99.25               │
└───────────────────────┴─────────┴───────┴─────────────────────┘

result = pokemon.sort("Type 1").select(
    pl.col("Type 1").head(3).over("Type 1", mapping_strategy="explode"),
    pl.col("Name")
    .sort_by(pl.col("Speed"), descending=True)
    .head(3)
    .over("Type 1", mapping_strategy="explode")
    .alias("fastest/group"),
    pl.col("Name")
    .sort_by(pl.col("Attack"), descending=True)
    .head(3)
    .over("Type 1", mapping_strategy="explode")
    .alias("strongest/group"),
    pl.col("Name")
    .sort()
    .head(3)
    .over("Type 1", mapping_strategy="explode")
    .alias("sorted_by_alphabet"),
)
print(result)

shape: (43, 4)
┌────────┬───────────────────────┬───────────────────────┬─────────────────────────┐
│ Type 1 ┆ fastest/group         ┆ strongest/group       ┆ sorted_by_alphabet      │
│ ---    ┆ ---                   ┆ ---                   ┆ ---                     │
│ enum   ┆ str                   ┆ str                   ┆ str                     │
╞════════╪═══════════════════════╪═══════════════════════╪═════════════════════════╡
│ Grass  ┆ Venusaur              ┆ Victreebel            ┆ Bellsprout              │
│ Grass  ┆ VenusaurMega Venusaur ┆ VenusaurMega Venusaur ┆ Bulbasaur               │
│ Grass  ┆ Victreebel            ┆ Exeggutor             ┆ Exeggcute               │
│ Water  ┆ Starmie               ┆ GyaradosMega Gyarados ┆ Blastoise               │
│ Water  ┆ Tentacruel            ┆ Kingler               ┆ BlastoiseMega Blastoise │
│ …      ┆ …                     ┆ …                     ┆ …                       │
│ Rock   ┆ Kabutops              ┆ Kabutops              ┆ Geodude                 │
│ Ice    ┆ Jynx                  ┆ Articuno              ┆ Articuno                │
│ Ice    ┆ Articuno              ┆ Jynx                  ┆ Jynx                    │
│ Fairy  ┆ Clefable              ┆ Clefable              ┆ Clefable                │
│ Fairy  ┆ Clefairy              ┆ Clefairy              ┆ Clefairy                │
└────────┴───────────────────────┴───────────────────────┴─────────────────────────┘

4.2.5.5. Folds functions (across multi-columns)#

fold là một phương pháp được sử dụng để thực hiện các thao tác trên nhiều cột bằng cách áp dụng một hàm tùy chỉnh một cách tuần tự. Nó giúp kết hợp hoặc tổng hợp dữ liệu từ nhiều cột thành một kết quả duy nhất.

Cách hoạt động của fold:

fold bắt đầu với một giá trị khởi tạo (init value) acc và sau đó áp dụng một hàm kết hợp (binary function) function để xử lý dữ liệu từng cột theo cách tuần tự trên các cột exprs.
Giá trị khởi tạo sẽ được kết hợp với dữ liệu từ các cột exprs, và kết quả của bước trước đó sẽ được sử dụng làm đầu vào cho bước tiếp theo.

Tại sao phải dùng fold?

Xử lý nhiều cột linh hoạt: fold giúp bạn dễ dàng áp dụng một logic phức tạp trên nhiều cột mà không cần viết nhiều câu lệnh hoặc vòng lặp.
Hiệu năng cao: Vì Polars được thiết kế để tối ưu hóa hiệu năng, sử dụng fold sẽ nhanh hơn so với cách xử lý truyền thống như vòng lặp Python hoặc các thao tác pandas.
Tương thích với lazy execution: Khi dùng Polars ở chế độ lazy, fold giúp tận dụng các tối ưu hóa mà Polars cung cấp.
Tính tổng quát: fold rất mạnh mẽ vì nó cho phép định nghĩa bất kỳ phép toán tùy chỉnh nào (như tổng, tích, tìm giá trị lớn nhất/nhỏ nhất, v.v.).

Cú pháp:

pl.fold(init_value, lambda acc, col: <hàm tùy chỉnh>, exprs)

init_value: Giá trị khởi tạo.
lambda acc, col: Hàm tùy chỉnh, trong đó:
- acc là giá trị được tích lũy qua các bước.
- col là giá trị của cột hiện tại.
exprs: Danh sách các cột hoặc các biểu thức cần xử lý.

import operator
import polars as pl

df = pl.DataFrame(
    {
        "label": ["foo", "bar", "spam"],
        "a": [1, 2, 3],
        "b": [10, 20, 30],
    }
)

result = df.select(
    pl.fold(
        acc=pl.lit(0),
        function=operator.add,
        exprs=pl.col("a", "b"),
    ).alias("sum_fold"),
    pl.sum_horizontal(pl.col("a", "b")).alias("sum_horz"),
)

print(result)

shape: (3, 2)
┌──────────┬──────────┐
│ sum_fold ┆ sum_horz │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 11       ┆ 11       │
│ 22       ┆ 22       │
│ 33       ┆ 33       │
└──────────┴──────────┘

Bài toán tìm giá trị lớn nhất giữa các cột

df = pl.DataFrame({"col1": [11, 2, 3], "col2": [4, 10, 6], "col3": [7, 8, 9]})

result = df.with_columns(
    pl.fold(
        acc=float("-inf"),
        function=lambda acc, col: pl.Series(
            "max", acc.zip_with(acc > col, col)
        ),
        exprs=[pl.col("col1"), pl.col("col2"), pl.col("col3")],
    ).alias("max_values")
)

print(result)

shape: (3, 4)
┌──────┬──────┬──────┬────────────┐
│ col1 ┆ col2 ┆ col3 ┆ max_values │
│ ---  ┆ ---  ┆ ---  ┆ ---        │
│ i64  ┆ i64  ┆ i64  ┆ f64        │
╞══════╪══════╪══════╪════════════╡
│ 11   ┆ 4    ┆ 7    ┆ 11.0       │
│ 2    ┆ 10   ┆ 8    ┆ 10.0       │
│ 3    ┆ 6    ┆ 9    ┆ 9.0        │
└──────┴──────┴──────┴────────────┘

4.2.5.6. Numpy function#

This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operations through the NumPy API.

import polars as pl
import numpy as np

df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

out = df.select(np.log(pl.all()).name.suffix("_log"))
print(out)

shape: (3, 2)
┌──────────┬──────────┐
│ a_log    ┆ b_log    │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 0.0      ┆ 1.386294 │
│ 0.693147 ┆ 1.609438 │
│ 1.098612 ┆ 1.791759 │
└──────────┴──────────┘

4.2.5.7. Join#

Loại Join	Cú pháp	Mô tả
Equi inner join	`join(..., how="inner")`	Giữ lại các hàng khớp ở cả hai bảng (bên trái và bên phải).
Equi left outer join	`join(..., how="left")`	Giữ lại tất cả các hàng từ bảng bên trái và các hàng khớp từ bảng bên phải. Hàng không khớp ở bên trái sẽ có giá trị null ở cột bên phải.
Equi right outer join	`join(..., how="right")`	Giữ lại tất cả các hàng từ bảng bên phải và các hàng khớp từ bảng bên trái. Hàng không khớp ở bên phải sẽ có giá trị null ở cột bên trái.
Equi full join	`join(..., how="full")`	Giữ lại tất cả các hàng từ cả hai bảng, bất kể có khớp hay không. Hàng không khớp ở bảng này sẽ có giá trị null ở cột của bảng kia.
Equi semi join	`join(..., how="semi")`	Giữ lại các hàng từ bảng bên trái mà có khớp ở bảng bên phải.
Equi anti join	`join(..., how="anti")`	Giữ lại các hàng từ bảng bên trái mà không có khớp ở bảng bên phải.
Non-equi inner join	`join_where`	Tìm tất cả các cặp hàng từ bảng bên trái và bảng bên phải thỏa mãn điều kiện (predicate) cụ thể.
Asof join	`join_asof/join_asof_by`	Giống left outer join nhưng thay vì khớp chính xác, nó khớp với giá trị khóa gần nhất.
Cartesian product	`join(..., how="cross")`	Tính tích Descartes của hai bảng (kết hợp tất cả các hàng từ bảng này với tất cả các hàng từ bảng kia).

Equi join: Khớp các hàng dựa trên giá trị chính xác của khóa (key).
Non-equi join: Khớp các hàng dựa trên điều kiện tùy chỉnh (ví dụ: lớn hơn, nhỏ hơn…).
Asof join: Hữu ích khi làm việc với dữ liệu chuỗi thời gian, khớp các giá trị gần nhất theo thứ tự.
Cartesian product: Kết hợp tất cả các hàng của cả hai bảng, dẫn đến một bảng kết quả lớn. Thường ít được sử dụng do tốn tài nguyên.

4.2.5.7.1. Equi join (giống key merge)#

import polars as pl

# Tạo hai bảng DataFrame mẫu
df_left = pl.DataFrame({"id": [1, 2, 3], "value_left": ["A", "B", "C"]})

df_right = pl.DataFrame({"id": [2, 3, 4], "value_right": ["X", "Y", "Z"]})

# In các bảng
print("Left Table:")
print(df_left)
print("\nRight Table:")
print(df_right)

Left Table:
shape: (3, 2)
┌─────┬────────────┐
│ id  ┆ value_left │
│ --- ┆ ---        │
│ i64 ┆ str        │
╞═════╪════════════╡
│ 1   ┆ A          │
│ 2   ┆ B          │
│ 3   ┆ C          │
└─────┴────────────┘

Right Table:
shape: (3, 2)
┌─────┬─────────────┐
│ id  ┆ value_right │
│ --- ┆ ---         │
│ i64 ┆ str         │
╞═════╪═════════════╡
│ 2   ┆ X           │
│ 3   ┆ Y           │
│ 4   ┆ Z           │
└─────┴─────────────┘

# Equi semi join

result = df_left.join(df_right, on="id", how="semi")
print(result)

shape: (2, 2)
┌─────┬────────────┐
│ id  ┆ value_left │
│ --- ┆ ---        │
│ i64 ┆ str        │
╞═════╪════════════╡
│ 2   ┆ B          │
│ 3   ┆ C          │
└─────┴────────────┘

# Equi anti join

result = df_left.join(df_right, on="id", how="anti")

print(result)

shape: (1, 2)
┌─────┬────────────┐
│ id  ┆ value_left │
│ --- ┆ ---        │
│ i64 ┆ str        │
╞═════╪════════════╡
│ 1   ┆ A          │
└─────┴────────────┘

4.2.5.7.2. Non-equi inner join (key thoả mãn điều kiện)#

join_where (Non-equi inner join)

For example: join những thằng có khả năng mua item (điều kiện là cast của user phải lớn hơn giá của item)

# Non-equi inner join: join_where

df_left = pl.DataFrame({"item_id": [1, 2, 3], "price": [10, 25, 30]})

df_right = pl.DataFrame(
    {"user_id": ["A", "B", "C"], "cast_availabel": [5, 27, 35]}
)

result = df_left.join_where(
    df_right, pl.col("price") <= pl.col("cast_availabel")
)
print(result)

shape: (5, 4)
┌─────────┬───────┬─────────┬────────────────┐
│ item_id ┆ price ┆ user_id ┆ cast_availabel │
│ ---     ┆ ---   ┆ ---     ┆ ---            │
│ i64     ┆ i64   ┆ str     ┆ i64            │
╞═════════╪═══════╪═════════╪════════════════╡
│ 1       ┆ 10    ┆ B       ┆ 27             │
│ 1       ┆ 10    ┆ C       ┆ 35             │
│ 2       ┆ 25    ┆ B       ┆ 27             │
│ 2       ┆ 25    ┆ C       ┆ 35             │
│ 3       ┆ 30    ┆ C       ┆ 35             │
└─────────┴───────┴─────────┴────────────────┘

4.2.5.7.3. Asof join (key gần đúng, thay vì chính xác)#

Asof Join (Approximate Sorted Join) là một kỹ thuật rất hữu ích trong xử lý dữ liệu chuỗi thời gian hoặc dữ liệu liên tục, đặc biệt khi cần khớp các giá trị gần nhất thay vì khớp chính xác. Đây là một công cụ mạnh mẽ cho các ứng dụng sau:

Financial Data: Giả sử bạn có một bảng giá cổ phiếu và một bảng các sự kiện kinh tế, bạn có thể sử dụng asof join để lấy giá cổ phiếu gần nhất trước mỗi sự kiện kinh tế.
Log and Monitoring Analysis: Giả sử bạn có một bảng log lỗi hệ thống và một bảng ghi trạng thái máy chủ, bạn có thể dùng asof join để lấy trạng thái gần nhất tại thời điểm lỗi xảy ra.
User Behavior Analysis: bạn có dữ liệu clickstream (hành động người dùng trên website) và các mốc thời gian chạy chiến dịch quảng cáo, bạn có thể dùng asof join để xác định chiến dịch nào ảnh hưởng đến hành vi người dùng.
Healthcare and Patient Monitoring: Giả sử bạn có dữ liệu đo nhịp tim liên tục và danh sách các lần bệnh nhân dùng thuốc, bạn có thể sử dụng asof join để xem xét ảnh hưởng của thuốc lên nhịp tim.
Time Series in Manufacturing: Trong một dây chuyền sản xuất, bạn có dữ liệu từ các cảm biến và các bản ghi bảo trì. Asof join có thể giúp bạn kết hợp dữ liệu cảm biến với các hoạt động bảo trì gần nhất để phân tích nguyên nhân lỗi.

4.2.5.7.4. Cartesian product (cross join)#

4.2.5.8. Concat#

How:

vertical: mở rộng theo chiều dọc
horizontal: mở rộng theo chiều ngang
diagonal: mở rộng theo cả hai chiều, fill null bởi các giá trị không có

df_v1 = pl.DataFrame(
    {
        "a": [1],
        "b": [3],
    }
)
df_v2 = pl.DataFrame(
    {
        "a": [2],
        "b": [4],
    }
)
df_vertical_concat = pl.concat(
    [
        df_v1,
        df_v2,
    ],
    how="vertical",
)
print(df_vertical_concat)

shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 2   ┆ 4   │
└─────┴─────┘

4.2.5.9. Pivot#

df = pl.DataFrame(
    {
        "foo": ["A", "A", "B", "B", "C"],
        "N": [1, 2, 2, 4, 2],
        "bar": ["k", "l", "m", "n", "o"],
    }
)
print(df)

shape: (5, 3)
┌─────┬─────┬─────┐
│ foo ┆ N   ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ A   ┆ 1   ┆ k   │
│ A   ┆ 2   ┆ l   │
│ B   ┆ 2   ┆ m   │
│ B   ┆ 4   ┆ n   │
│ C   ┆ 2   ┆ o   │
└─────┴─────┴─────┘

# eager mode
out = df.pivot("bar", index="foo", values="N", aggregate_function="first")
print(out)

shape: (3, 6)
┌─────┬──────┬──────┬──────┬──────┬──────┐
│ foo ┆ k    ┆ l    ┆ m    ┆ n    ┆ o    │
│ --- ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ str ┆ i64  ┆ i64  ┆ i64  ┆ i64  ┆ i64  │
╞═════╪══════╪══════╪══════╪══════╪══════╡
│ A   ┆ 1    ┆ 2    ┆ null ┆ null ┆ null │
│ B   ┆ null ┆ null ┆ 2    ┆ 4    ┆ null │
│ C   ┆ null ┆ null ┆ null ┆ null ┆ 2    │
└─────┴──────┴──────┴──────┴──────┴──────┘

4.2.5.10. Unpivot#

import polars as pl

df = pl.DataFrame(
    {
        "A": ["a", "b", "a"],
        "B": [1, 3, 5],
        "C": [10, 11, 12],
        "D": [2, 4, 6],
    }
)
print(df)

shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ A   ┆ B   ┆ C   ┆ D   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a   ┆ 1   ┆ 10  ┆ 2   │
│ b   ┆ 3   ┆ 11  ┆ 4   │
│ a   ┆ 5   ┆ 12  ┆ 6   │
└─────┴─────┴─────┴─────┘

out = df.unpivot(["C", "D"], index=["A", "B"])
print(out)

shape: (6, 4)
┌─────┬─────┬──────────┬───────┐
│ A   ┆ B   ┆ variable ┆ value │
│ --- ┆ --- ┆ ---      ┆ ---   │
│ str ┆ i64 ┆ str      ┆ i64   │
╞═════╪═════╪══════════╪═══════╡
│ a   ┆ 1   ┆ C        ┆ 10    │
│ b   ┆ 3   ┆ C        ┆ 11    │
│ a   ┆ 5   ┆ C        ┆ 12    │
│ a   ┆ 1   ┆ D        ┆ 2     │
│ b   ┆ 3   ┆ D        ┆ 4     │
│ a   ┆ 5   ┆ D        ┆ 6     │
└─────┴─────┴──────────┴───────┘

4.2.6. Time Series#

4.2.6.1. Parsing datetime#

# When read file
df = pl.read_csv("docs/assets/data/apple_stock.csv", try_parse_dates=True)
print(df)

# convert string column to datetime
df = df.with_columns(pl.col("Date").str.to_date("%Y-%m-%d"))

4.2.6.2. Filter#

# Equal datetime
filtered_df = df.filter(
    pl.col("Date") == datetime(1995, 10, 16),
)
print(filtered_df)

# Between range
filtered_range_df = df.filter(
    pl.col("Date").is_between(datetime(1995, 7, 1), datetime(1995, 11, 1)),
)
print(filtered_range_df)

4.2.6.3. Grouping#

Parameters for group_by_dynamic (detail)

every: indicates the interval of the window
period: indicates the duration of the window
offset: can be used to offset the start of the windows

annual_average_df = df.group_by_dynamic("Date", every="1y").agg(
    pl.col("Close").mean()
)

df_with_year = annual_average_df.with_columns(
    pl.col("Date").dt.year().alias("year")
)
print(df_with_year)

4.2.7. Lazy API#

Thay vì thực hiện ngay lập tức mỗi thao tác (như trong DataFrame thông thường), LazyFrame ghi nhận các thao tác thành một cây kế hoạch (query plan). Điều này mang lại một số lợi ích:

Tối ưu hóa toàn diện: Polars có thể phân tích toàn bộ cây kế hoạch để thực hiện các tối ưu hóa như loại bỏ thao tác không cần thiết, giảm số lượng đọc/ghi dữ liệu, và sắp xếp lại thứ tự các phép tính để tăng hiệu suất.
Hạn chế I/O: Lazy API giảm thiểu số lần truy cập dữ liệu, làm tăng tốc độ xử lý.
Xử lý dữ liệu lớn: Lazy API phù hợp để làm việc với dữ liệu lớn vì nó giảm tải bộ nhớ.
Bắt được lỗi liên quan đến schema trong quá trình precessing thay vì phải load hết toàn bộ data

LazyFrame vs DataFrame

DataFrame (Eager API): Thực hiện các thao tác ngay lập tức. Tốt cho các bài toán nhỏ và dễ thử nghiệm.
LazyFrame (Lazy API): Chỉ ghi nhận thao tác và thực thi chúng khi bạn gọi collect().

Tối ưu hóa của Lazy API

Polars sử dụng nhiều kỹ thuật tối ưu hóa như:

Predicate Pushdown: Đẩy các phép lọc xuống sớm để giảm lượng dữ liệu xử lý.
Projection Pushdown: Chỉ đọc các cột cần thiết thay vì đọc toàn bộ dữ liệu.
Expression Simplification: Tái cấu trúc các biểu thức phức tạp để giảm chi phí tính toán.
Joins Optimization: Tối ưu các phép nối (join) dữ liệu.

Apply Lazy API

read file: pl.scan_
convert exist dataframe: .lazy()
execute query plan: .collect()

import polars as pl

q1 = (
    pl.scan_csv(
        r"coding\learning\contents\3_programming_and_frameworks\python\python_frameworks\polars\data\data.csv"
    )
    .with_columns(pl.col("symbol").str.to_uppercase())
    .filter(pl.col("capital") > 100_000_000_000)
)

# show query graph

# check: optimized = False
q1.show_graph(optimized=False)

# optimal query plan
q1.show_graph()

Polars

Contents

4.2. Polars#

4.2.1. Overview#

4.2.2. Data type and structures#

4.2.2.1. Data-types#

4.2.2.2. Series#

4.2.2.3. Dataframe#

4.2.2.4. schema#

4.2.3. I/O#

4.2.3.1. CSV#

4.2.3.2. Parquet (prefer than CSV)#

4.2.3.2.1. Single parquet file#

4.2.3.2.2. Hive partitioned data#

4.2.3.3. Json#

4.2.3.4. Database#

4.2.3.4.1. Engine#

4.2.3.4.2. Read#

4.2.3.4.3. Write#

4.2.3.5. Cloud Storage#

4.2.3.5.1. Read#

4.2.3.5.2. Write#

4.2.3.6. Bigquery#

4.2.3.6.1. Read#

4.2.3.6.2. Write#

4.2.3.7. Hugging Face#

4.2.4. Expression#

4.2.4.1. Selecting and filtering#

4.2.4.1.1. select and create#

4.2.4.1.1.1. by datatype#

4.2.4.1.1.2. by pattern matching in columns name#

4.2.4.1.1.3. all columns#

4.2.4.1.1.4. exclude columns#

4.2.4.1.1.5. with_columns#

4.2.4.1.1.6. create column by group by expression#

4.2.4.1.2. rename#

4.2.4.1.2.1. alias#

4.2.4.1.2.2. prefixing and suffixing#

4.2.4.1.2.3. custom function#

4.2.4.1.3. filter#

4.2.4.2. Basic operation#

4.2.4.2.1. Comparisons#

4.2.4.2.2. Unique values#

4.2.4.2.3. Conditional (when - then - else)#

4.2.4.2.4. Expression expansion#

4.2.4.2.4.1. How to select columns#

4.2.4.2.4.1.1. by columns name#

4.2.4.2.4.1.2. by datatype#

4.2.4.2.4.1.3. by pattern matching in columns name#

4.2.4.2.4.2. Generate expressions#

4.2.4.3. Datatype Operations#

4.2.4.3.1. Casting#

4.2.4.3.1.1. cast#

4.2.4.3.1.2. downcasting numerical#

4.2.4.3.1.3. strings to numeric#

4.2.4.3.1.4. booleans#

4.2.4.3.2. Number#

4.2.4.3.3. Datetime#

4.2.4.3.4. String#

4.2.4.3.5. Lists and array#

4.2.4.3.5.1. List#

4.2.4.3.5.2. Array (prefer)#

4.2.4.3.6. Categorical and enums#

4.2.4.3.7. Structs#

4.2.4.3.7.1. unnest#

4.2.4.3.7.2. field: extract field#

4.2.4.3.7.3. rename_fields#

4.2.4.3.7.4. detect duplicated / unique rows#

4.2.4.4. Duplicated#

4.2.4.5. Missing#

4.2.4.5.1. Count null#

4.2.4.5.2. Check null#

4.2.4.5.3. Fill null#

4.2.4.5.4. Drop null#

4.2.4.5.5. NaN value#

4.2.5. Dataframe Manipulation#

4.2.5.1. Replace series#

4.2.5.2. Aggregation#

4.2.5.2.1. group_by#

4.2.5.2.2. group_by_dynamic#

4.2.4.1.1.5. `with_columns`#

4.2.4.1.2.1. `alias`#

4.2.4.3.1.1. `cast`#

4.2.4.3.7.1. `unnest`#

4.2.4.3.7.2. `field`: extract field#

4.2.4.3.7.3. `rename_fields`#

4.2.5.2.1. `group_by`#

4.2.5.2.2. `group_by_dynamic`#

4.2.5.3.1. `map_elements`#

4.2.5.3.2. `map_batches`#