Overview
Tutorials
Using Patito for DataFrame Validation
Have you ever found yourself relying on some column of an external data source being non-nullable only to find out much later that the assumption proved to be false? What about discovering that a production machine learning model has had a huge performance regression because a new category was introduced to a categorical column? You might not have encountered any of these exact scenarios, but perhaps similar ones; they illustrate the necessity of validating your data.
A machine learning model might ingest data from a production system that changes frequently, and the author of the model wants to be notified if certain assumptions no longer hold. Or perhaps a data analyst might rely on a pre-processing step that removes all discontinued products from a data set, and this should be validated and communicated clearly in their Jupyter notebook.
patito is a dataframe validation library built on top of polars initially open sourced by Oda, which tries to solve this problem. The polars dataframe library has lately been making the rounds among data scientists, and for good reasons. It can be considered as a total replacement of the well-known pandas library, initially tempting you with its advertised top-notch performance, but then sealing the deal with its intuitive and expressive API. The exact virtues of polars is a topic for another article, but suffice it to say that it is highly recommended and it has some great introductory documentation.
The core idea of Patito is that you should define a so-called “model” for each of your data sources. A model is a declarative python class which describes the general properties of a tabular data set: the names of all the columns, their types, value bounds, and so on… These models can then be used to validate the data sources when they are ingested into your project’s data pipeline. In turn, your models become a trustworthy, centralized catalog of all the core facts about your data, facts you can safely rely upon during development.
Enough chit chat, let’s get into some technical details! Let’s say that your project keeps track of products, and that these products have four core properties:
A unique, numeric identifier
A name
An ideal temperature zone of either
"dry"
,"cold"
, or"frozen"
A product demand given as a percentage of the total sales forecast for the next week
In tabular form the data might look something like this.
|
|
|
|
---|---|---|---|
1 |
Apple |
dry |
0.23% |
2 |
Milk |
cold |
0.61% |
3 |
Ice cubes |
frozen |
0.01% |
… |
… |
… |
… |
We now start to model the restrictions we want to put upon our data.
In Patito this is done by defining a class which inherits from patito.Model
, a class which has one field annotation for each column in the data.
These models should preferably be defined in a centralized place, conventionally <YOUR_PROJECT_NAME>/models.py
, where you can easily find and refer to them.
from typing import Literal
import patito as pt
class Product(pt.Model):
product_id: int
name: str
temperature_zone: Literal["dry", "cold", "frozen"]
demand_percentage: float
Here we have used typing.Literal
from the standard library in order to specify that temperature_zone
is not only a str
, but specifically one of the literal values "dry"
, "cold"
, or "frozen"
.
You can now use this class to represent a single specific instance of a product:
>>> Product(product_id=1, name="Apple", temperature_zone="dry", demand_percentage=0.23)
Product(product_id=1, name='Apple', temperature_zone='dry', demand_percentage=0.23)
The class also automatically offers input data validation, for instance if you provide an invalid value for temperature_zone
.
>>> Product(product_id=64, name="Pizza", temperature_zone="oven", demand_percentage=0.12)
ValidationError: 1 validation error for Product
temperature_zone
unexpected value; permitted: 'dry', 'cold', 'frozen' (type=value_error.const; given=oven; permitted=('dry', 'cold', 'frozen'))
A discerning reader might notice that this looks suspiciously like pydantic’s data models, and that is in fact because it is!
Patito’s model class is built upon pydantic’s pydantic.BaseClass
and therefore offers all of pydantic’s functionality.
But the difference is that Patito extends pydantic’s validation of singular object instances to collections of the same objects represented as dataframes.
Let’s take the data presented in Table 1 and represent it as a polars dataframe.
>>> import polars as pl
>>> product_df = pl.DataFrame(
... {
... "product_id": [1, 2, 3],
... "name": ["Apple", "Milk", "Ice cubes"],
... "temperature_zone": ["dry", "cold", "frozen"],
... "demand_percentage": [0.23, 0.61, 0.01],
... }
... )
We can now use Product.validate() in order to validate the content of our dataframe.
>>> from project.models import Product
>>> Product.validate(product_df)
None
Well, that wasn’t really interesting…
The validate method simply returns None
if no errors are found.
It is intended as a guard statement to be put before any logic that requires the data to be valid.
That way you can rely on the data being compatible with the given model schema, otherwise the .validate()
method would have raised an exception.
Let’s try this with invalid data, setting the temperature zone of one of the products to "oven"
.
>>> invalid_product_df = pl.DataFrame(
... {
... "product_id": [64, 64],
... "name": ["Pizza", "Cereal"],
... "temperature_zone": ["oven", "dry"],
... "demand_percentage": [0.07, 0.16],
... }
... )
>>> Product.validate(invalid_product_df)
ValidationError: 1 validation error for Product
temperature_zone
Rows with invalid values: {'oven'}. (type=value_error.rowvalue)
Now we’re talking! Patito allows you to define a single class which validates both singular object instances and dataframe collections without code duplication!
Patito tries to rely as much as possible on pydantic’s existing modelling concepts, naturally extending them to the dataframe domain where suitable.
Model fields annotated with str
will map to dataframe columns stored as pl.Utf8
, int
as pl.Int8
/pl.Int16
/…/pl.Int64
, and so on.
Field types wrapped in Optional
allow null values, while bare types do not.
But certain modelling concepts are not applicable in the context of singular object instances, and are therefore necessarily not part of pydantic’s API.
Take product_id
as an example, you would expect this column to be unique across all products and duplicates should therefore be considered invalid.
In pydantic you have no way to express this, but Patito expands upon pydantic in various ways in order to represent dataframe-related constraints.
One of these extensions is the unique
parameter accepted by patito.Field
, which allows you to specify that all the values of a given column should be unique.
class Product(pt.Model):
product_id: int = pt.Field(unique=True)
name: str
temperature_zone: Literal["dry", "cold", "frozen"]
demand_percentage: float
The patito.Field
class accepts the same parameters as pydantic.Field
, but adds additional dataframe-specific constraints documented here.
In those cases where Patito’s built-in constraints do not suffice, you can specify arbitrary constraints in the form of polars expressions which must evaluate to True
for each row in order for the dataframe to be considered valid.
Let’s say we want to make sure that demand_percentage
sums up to 100% for the entire dataframe, otherwise we might be missing one or more products.
We can do this by passing the constraints
parameter to patito.Field
.
class Product(pt.Model):
product_id: int = pt.Field(unique=True)
name: str
temperature_zone: Literal["dry", "cold", "frozen"]
demand_percentage: float = pt.Field(constraints=pt.field.sum() == 100.0)
Here patito.field
is an alias for the field column and is automatically replaced with polars.col("demand_percentage")
before validation.
If we now use this improved class to validate invalid_product_df
, we should detect new errors.
>>> Product.validate(invalid_product_df)
ValidationError: 3 validation errors for Product
product_id
2 rows with duplicated values. (type=value_error.rowvalue)
temperature_zone
Rows with invalid values: {'oven'}. (type=value_error.rowvalue)
demand_percentage
2 rows does not match custom constraints. (type=value_error.rowvalue)
Patito has now detected that product_id
contains duplicates and that demand_percentage
does not sum up to 100%!
Several more properties and methods are available on patito.Model
as outlined here; you can for instance generate valid mock dataframes for testing purposes with Model.examples().
You can also dynamically construct models with methods such as Model.select(), Model.prefix(), and Model.join().
Integrating Polars With an SQL Database Using Efficient Caching
Many data-driven applications involve data that must be retrieved from a database, either a remote data warehouse solution such as Databricks or Snowflake, or a local database such as DuckDB or SQLite3. Patito offers a database-agnostic API to query such sources, returning the result as a polars DataFrame, while offering intelligent query caching on top. By the end of this tutorial you will be able to write data ingestion logic that looks like this:
from typing import Optional
from db import my_database
@my_database.as_query(cache=True)
def users(country: Optional[str] = None):
query = "select * from users"
if country:
query += f" where country = '{country}'"
return query
The wrapped users
function will now construct, execute, cache, and return the results of the SQL query in the form of a polars.DataFrame
object.
The cache for users(country="NO")
will be stored independently from users(country="US")
, and so on.
This, among with other functionality that will be explained later, allows you to integrate your local data pipeline with your remote database in an effortless way.
The following tutorial will explain how to construct a patito.Database
object which provides Patito with the required context to execute SQL queries against your database of choice.
In turn patito.Database.query()
can be used to execute SQL query strings directly and patito.Database.as_query()
can be used to wrap functions that produce SQL query strings.
The latter decorator turns functions into patito.Database.Query
objects which act very much like the original functions, only that they actually execute the constructed queries and return the results as DataFrames when invoked.
The Query
object also has additional methods for managing the query caches and more.
This tutorial will take a relatively opinionated approach to how to organize your code. For less opinionated documentation, see the referenced classes and methods above.
Setup
The following tutorial will depend on patito
having been installed with the caching
extension group:
pip install patito[caching]
Code samples in this tutorial will use DuckDB, but you should be able to replace it with your database of choice as you follow along:
pip install duckdb
Construct a patito.Database
Object
To begin we need to provide Patito with the tools required to query your database of choice. First we must implement a query handler, a function that takes a query string as its first argument, executes the query, and returns the result of the query in the form an Arrow table.
We are going to use DuckDB as our detailed example in this tutorial, but example code for other databases, including SQLite3, is provided at the end of this section.
We start by creating a db.py
module in the root of our application, and implement db.connection
as a way to connect to a DuckDB instance.
import duckdb
def connection(name: str) -> duckdb.DuckDBPyConnection:
return duckdb.connect(name)
Here db.connection()
must be provided with a name, either :memory:
to store the data in-memory, or a file name to persist the data on-disk.
We can use this new function in order to implement our query handler.
import duckdb
import pyarrow as pa
def connection(name: str) -> duckdb.DuckDBPyConnection:
return duckdb.connect(name)
def query_handler(query: str, *, name: str = ":memory:") -> pa.Table:
connection = connection(name=name)
return connection.cursor().query(query).arrow()
Notice how the first argument of query_handler
is the query string to be executed, as required by Patito, but the name
keyword is specific to our database of choice.
It is now simple for us to create a patito.Database
object by providing db.query_handler
:
from pathlib import Path
import duckdb
import patito as pt
import pyarrow as pa
def connection(name: str) -> duckdb.DuckDBPyConnection:
return duckdb.connect(name)
def query_handler(query: str, name: str = ":memory:") -> pa.Table:
cursor = connection(name).cursor()
return cursor.query(query).arrow()
my_database = pt.Database(query_handler=query_handler)
Additional arguments can be provided to the Database
constructor, for example a custom cache directory.
These additional parameters are documented here.
Documentation for constructing query handlers and patito.Database
objects for other databases is provided in the collapsable sections below:
None
See “Examples” section of patito.Database
.
None
You are welcome to create a GitHub issue if you need help integrating with you specific database of choice.
Querying the Database Directly
The db
module is now complete and we should be able to use it in order to execute queries directly against our in-memory database.
>>> from db import my_database
>>> my_database.query("select 1 as a, 2 as b")
shape: (1, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
│ 1 ┆ 2 │
└─────┴─────┘
The query result is provided in the form of a polars DataFrame
object.
Additional parameters can be provided to patito.Database.query()
as described here.
As an example, the query result can be provided as a polars.LazyFrame
by specifying lazy=True
.
>>> from db import my_database
>>> my_database.query("select 1 as a, 2 as b", lazy=True)
<polars.LazyFrame object at 0x13571D310>
Any additional keyword arguments provided to patito.Database.query()
are forwarded directly to the original query handler, so the following will execute the query against the database stored in my.db
:
>>> my_database.query("select * from my_table", name="my.db")
Delegation of parameters provided to patito.Database.query()
.
Wrapping Query-Producing Functions
Let’s assume that you have a project named user-analyzer
which analyzes users.
The associated python package should therefore be named user_analyzer
.
By convention, functions for retrieving data from a remote database should be placed in the user_analyzer.fetch
sub-module.
Using this module should be as simple as…
from user_analyzer import fetch
user_df = fetch.users()
Start by creating the python file for the fetch
sub-module, it should be located at projects/user-analyzer/user_analyzer/fetch.py
.
Next, implement the users
function as a function that returns a SQL query that should produce the intended data when executed in the remote database…
def users():
return "select * from d_users"
This is clearly not enough, the fetch.users
function only returns a query string for now, but it can trivially be converted to a function that returns a dataframe instead by using the query
decorator from db
…
from db import query
@query()
def users():
return "select * from d_users"
Polars vs. Pandas
When user_analyzer.fetch.users()
is invoked it will return a polars DataFrame by default.
Polars is a DataFrame library that is highly recommended over pandas; it will be familiar to most pandas users and can be easily converted to pandas when needed.
You can find introductory documentation for polars here.
If you still prefer to use pandas you can use the .to_pandas()
method like this…
from user_analyzer import fetch
# This returns a polars DataFrame
user_df = fetch.users()
# This returns a pandas DataFrame
user_df = fetch.users().to_pandas()
We can also add parameters to the users
function, if needed, let’s say we want to be able to filter on the users’ country codes:
from typing import Literal, Optional
from db import query
@query()
def users(country: Optional[str] = None):
if country_code:
return f"select * from d_users where country_code = '{country}'"
else:
return "select * from d_users"
You can now construct a DataFrame of all Finish users by writing fetch.users(country="FI")
.
If you want to access the SQL query rather than executing it, you can retrieve it with fetch.users.query_string(country="FI")
.
Specifying custom database parameters
The @query
decorator will by default execute your SQL query against the ANALYTICS.ANALYTICS
database schema.
If your query needs to use different schema, warehouses, users, etc., you can specify a custom db_params
parameter to the query decorator.
Here is an example where we execute the query against ANALYTICTS.ANALYTICS_FORECASTING
instead of ANALYTICS.ANALYTICS
.
from db import query
FORECASTING_SCHEMA = {"schema": "ANALYTICS_FORECASTING"}
@query(db_params=FORECASTING_SCHEMA):
def covid_cases():
return "return * from stg_covid_cases"
Normalizing column types
A Snowflake query might produce different column types based on how many rows are returned and/or the value bounds of each column.
In order to ensure consistent behavior, db.query
by default _upcasts_ all lower-typed dtypes such as Int8
to Int64
, Float16
to Float64
, and so on.
This behavior can be disabled by providing normalize_column_types=False
to the @query
decorator.
from db import query
@query(normalize_column_types=False)
def example_query():
return "example query"
Cache Your Queries to Speed Up Data Retrieval
Some database queries may take a long time to execute due to the data set being large and/or the computations being intensive.
In those cases you might want to store the result for reuse rather than re-executing the query every single time you invoke fetch.X()
.
Luckily, this is really easy with db.query
, you can simply add the query=True
parameter to the decorator and caching will be automatically enabled!
Enabling caching for fetch.users
will look like this…
...
@query(cache=True)
def users(country: Optional[str] = None):
...
Now, if you execute fetch.users()
it will query the database directly, but the _next_ time you execute it, it will instantaneously return the result from the previous execution.
The @query
decorator will cache the results based on the query string itself, so fetch.users()
, fetch.users(country="FI")
, fetch.users(country="NO")
, and so on will be cached independently.
Lazy data retrieval
You can also specify the lazy=True
parameter to the @query
decorator in order to receive the query result in the form of a LazyFrame
object rather than a DataFrame
.
This parameter plays well with cached query decorators since it will only read the strictly required data from the cache.
...
@query(cache=True, lazy=True)
def users():
...
# Only the subset of the rows with age_in_years >= 67 will be read into memory
pensioners = users().filter(pl.col("age_in_years") >= 67).collect()
Refreshing the cache
Sometimes you may want to forcefully reset the cache of a query function in order to get the latest version of the data from remote database.
This can be done by invoking X.refresh_cache()
rather than X()
directly.
Let’s say you want to retrieve the latest set of Norwegian users from the database…
from user_analyzer import fetch
user_df = fetch.users.refresh_cache(country="NO")
This will delete the cached version of the Norwegian users if the result has already been cached, and return the latest result.
The next time you invoke fetch.users(country="NO")
you will get the latest version of the cache.
If you want to clear all caches, regardless query parameterization, you can use the X.clear_caches()
method.
from user_analyzer import fetch
fetch.users.clear_caches()
The .refresh_cache()
and .clear_caches()
methods are in fact part of several other methods that are automatically added to @query
-decorated functions, the full list of such methods is:
.clear_caches()
- Delete all cache files of the given query function such that new data will be fetched the _next_ time the query is invoked..refresh_cache(*args, **kwargs)
- Force the resulting SQL query produced by the given parameters to be executed in the remote database and repopulate the parameter-specific cache..cache_path(*args, **kwargs)
- Return apathlib.Path
object pointing to the parquet file that is used to store the cache for the given parameters..query_string(*args, **kwargs)
- Return the SQL query string to be executed.
Automatically refreshing old caches
Sometimes it makes sense to cache a query result, but not forever.
In such cases you can specify the Time to Live (TTL) of the cache, automatically refreshing the cache when it becomes older than the specified TTL.
This can be done by specifying the ttl
argument to the @query
decorator as a datetime.timedelta.
Let’s say that we want to fetch the newest collection of users once a day, but otherwise cache the results. This can be achieved in the following way…
from datetime import timedelta
from db import query
@query(
cache=True,
ttl=timedelta(days=1),
)
def users(country: Optional[str] = None):
...
The first time you invoke fetch.users()
, the query will be executed in the remote database and the result will be cached.
After that, the cache will be used until you invoke fetch.users()
more than 24 hours after the cache was initially created.
Then the cache will be automatically refreshed.
You can also force a cache refresh any time by using the .refresh_cache()
method, for instance for all Norwegian users by executing fetch.users.refresh_cache(country="NO")
.
Specify custom cache files (advanced)
If you want to store the cached results in specific parquet files, you can specify the cache
parameter to the @query
decorator as a string or as a pathlib.Path
object.
Let’s say you want to store the users in a file called users.parquet
, this can be done in the following way:
from db import query
@query(cache="users.parquet")
def users(country: Optional[str] = None):
...
The file path users.parquet
is a so-called relative path and is therefore interpreted relative the artifacts/query_cache
sub-directory within the project’s root.
You can inspect the resulting path by executing users.cache_path()
:
from user_analyzer import fetch
print(fetch.users.cache_path())
# Outputs: /repo/projects/user-analyzer/artifacts/query_cache/users.parquet
You can also specify an absolute path if required, let’s say you want to place the file in <REPO>/projects/user-analyzer/users.parquet
:
from db import PROJECT_DIR, query
@query(cache=PROJECT_DIR / "users.parquet")
def users(country: Optional[str] = None):
...
The problem with the previous custom cache path is that fetch.users(country="NO")
and fetch.users(countr="FI")
will write to the same cache file, thus refreshing the cache much more than strictly necessary.
It would be more efficient to have a separate cache file for each country.
You can achieve this by inserting a {country}
formatting placeholder, like with an f-string, in the custom cache path:
from db import PROJECT_DIR, query
@query(cache=PROJECT_DIR / "users-{country}.parquet")
def users(country: Optional[str] = None):
...
Finish users will now be cached in users-FI.parquet
, while Norwegian users will be cached in users-NO.parquet
.
Automatic Data Validation
The @query
decorator integrates with the patito DataFrame validation library, allowing you to automatically validate the data fetched from the remote database.
If the concept of data validation, and why you should apply it in your data science projects, is new to you, then you should read “Using Patito for DataFrame Validation”.
Let’s say that we have a fetch.products() query function which produces a DataFrame of three columns.
from db import query
@query()
def products():
return """
select
product_id,
warehouse_department,
current_retail_price
from products
"""
Given this query we might want to validate the following assumptions:
product_id
is a unique integer assigned to each product.warehouse_department
takes one of three permissible values:"Dry"
,"Cold"
, or"Frozen"
.current_retail_price
is a positive floating point number.
By convention we should define a Patito model class named Product
placed in <project_module>/models.py
.
import patito as pt
class Product(pt.Model):
product_id: int = pt.Field(unique=True)
warehouse_department: Literal["Dry", "Cold", "Frozen"]
current_retail_price: float = pt.Field(gt=0)
We can now use user_analyzer.models.Product
to automatically validate the data produced by user_analyzer.fetch.products
by providing the model
keyword to the @query
decorator.
from db import query
from user_analyzer import models
@query(model=models.Product)
def products():
return """
select
product_id,
warehouse_department,
current_retail_price
from products
"""
Whenever you invoke fetch.products
, the data will be guaranteed to follow the schema of models.Product
, otherwise an exception will be raised.
You can therefore rest assured that the production data will not substantially change without you noticing it in the future.
API Reference
patito.DataFrame
- class patito.DataFrame(data=None, schema=None, *, schema_overrides=None, orient=None, infer_schema_length=100, nan_to_null=False)
A sub-class of polars.DataFrame with additional functionality related to Model.
Two different methods are available for constructing model-aware data frames. Assume a simple model with two fields:
>>> import patito as pt >>> class Product(pt.Model): ... name: str ... price_in_cents: int ...
We can construct a data frame containing products and then associate the
Product
model to the data frame usingDataFrame.set_model
:>>> df = pt.DataFrame({"name": ["apple", "banana"], "price": [25, 61]}).set_model( ... Product ... )
Alternatively, we can use the custom
Product.DataFrame
class which automatically associates theProduct
model to the data frame at instantiation.>>> df = Product.DataFrame({"name": ["apple", "banana"], "price": [25, 61]})
The
df
data frame now has a set of model-aware methods such as as Product.validate.
patito.DataFrame.cast
- DataFrame.cast(strict=False)
Cast columns to dtypes specified by the associated Patito model.
- Parameters:
strict (
bool
) – If set toFalse
, columns which are technically compliant with the specified field type, will not be casted. For example, a column annotated withint
is technically compliant withpl.UInt8
, even ifpl.Int64
is the default dtype associated withint
-annotated fields. Ifstrict
is set toTrue
, the resulting dtypes will be forced to the default dtype associated with each python type.- Returns:
A dataframe with columns casted to the correct dtypes.
- Return type:
Examples
Create a simple model:
>>> import patito as pt >>> import polars as pl >>> class Product(pt.Model): ... name: str ... cent_price: int = pt.Field(dtype=pl.UInt16) ...
Now we can use this model to cast some simple data:
>>> Product.DataFrame({"name": ["apple"], "cent_price": ["8"]}).cast() shape: (1, 2) ┌───────┬────────────┐ │ name ┆ cent_price │ │ --- ┆ --- │ │ str ┆ u16 │ ╞═══════╪════════════╡ │ apple ┆ 8 │ └───────┴────────────┘
patito.DataFrame.derive
- DataFrame.derive()
Populate columns which have
pt.Field(derived_from=...)
definitions.If a column field on the data frame model has
patito.Field(derived_from=...)
specified, the given value will be used to define the column. Ifderived_from
is set to a string, the column will be derived from the given column name. Alternatively, an arbitrary polars expression can be given, the result of which will be used to populate the column values.- Returns:
A new dataframe where all derivable columns are provided.
- Return type:
- Raises:
TypeError – If the
derived_from
parameter ofpatito.Field
is given as something else than a string or polars expression.
Examples
>>> import patito as pt >>> import polars as pl >>> class Foo(pt.Model): ... bar: int = pt.Field(derived_from="foo") ... double_bar: int = pt.Field(derived_from=2 * pl.col("bar")) ... >>> Foo.DataFrame({"foo": [1, 2]}).derive() shape: (2, 3) ┌─────┬─────┬────────────┐ │ foo ┆ bar ┆ double_bar │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪════════════╡ │ 1 ┆ 1 ┆ 2 │ │ 2 ┆ 2 ┆ 4 │ └─────┴─────┴────────────┘
patito.DataFrame.drop
- DataFrame.drop(columns=None, *more_columns)
Drop one or more columns from the dataframe.
If
name
is not provided then all columns not specified by the associated patito model, for instance set with DataFrame.set_model, are dropped.- Parameters:
columns (
Union
[str
,Collection
[str
],None
]) – A single column string name, or list of strings, indicating which columns to drop. If not specified, all columns not specified by the associated dataframe model will be dropped.more_columns (
str
) – Additional named columns to drop.
- Returns:
New dataframe without the specified columns.
- Return type:
Examples
>>> import patito as pt >>> class Model(pt.Model): ... column_1: int ... >>> Model.DataFrame({"column_1": [1, 2], "column_2": [3, 4]}).drop() shape: (2, 1) ┌──────────┐ │ column_1 │ │ --- │ │ i64 │ ╞══════════╡ │ 1 │ │ 2 │ └──────────┘
patito.DataFrame.fill_null
- DataFrame.fill_null(value=None, strategy=None, limit=None, matches_supertype=True)
Fill null values using a filling strategy, literal, or
Expr
.If
"default"
is provided as the strategy, the model fields with default values are used to fill missing values.- Parameters:
value (
Optional
[Any
]) – Value used to fill null values.strategy (
Optional
[Literal
['forward'
,'backward'
,'min'
,'max'
,'mean'
,'zero'
,'one'
,'defaults'
]]) – Accepts the same arguments aspolars.DataFrame.fill_null
in addition to"defaults"
which will use the field’s default value if provided.limit (
Optional
[int
]) – The number of consecutive null values to forward/backward fill. Only valid ifstrategy
is"forward"
or"backward"
.matches_supertype (
bool
) – Fill all matching supertype of the fillvalue
.
- Returns:
A new dataframe with nulls filled in according to the provided
strategy
parameter.- Return type:
Example
>>> import patito as pt >>> class Product(pt.Model): ... name: str ... price: int = 19 ... >>> df = Product.DataFrame( ... {"name": ["apple", "banana"], "price": [10, None]} ... ) >>> df.fill_null(strategy="defaults") shape: (2, 2) ┌────────┬───────┐ │ name ┆ price │ │ --- ┆ --- │ │ str ┆ i64 │ ╞════════╪═══════╡ │ apple ┆ 10 │ │ banana ┆ 19 │ └────────┴───────┘
patito.DataFrame.get
- DataFrame.get(predicate=None)
Fetch the single row that matches the given polars predicate.
If you expect a data frame to already consist of one single row, you can use
.get()
without any arguments to return that row.- Raises:
RowDoesNotExist – If zero rows evaluate to true for the given predicate.
MultipleRowsReturned – If more than one row evaluates to true for the given predicate.
RuntimeError – The superclass of both
RowDoesNotExist
andMultipleRowsReturned
if you want to catch both exceptions with the same class.
- Parameters:
predicate (
Optional
[Expr
]) – A polars expression defining the criteria of the filter.- Returns:
A pydantic-derived base model representing the given row.
- Return type:
Example
>>> import patito as pt >>> import polars as pl >>> df = pt.DataFrame({"product_id": [1, 2, 3], "price": [10, 10, 20]})
The
.get()
will by default return a dynamically constructed pydantic model if no model has been associated with the given dataframe:>>> df.get(pl.col("product_id") == 1) UntypedRow(product_id=1, price=10)
If a Patito model has been associated with the dataframe, by the use of DataFrame.set_model(), then the given model will be used to represent the return type:
>>> class Product(pt.Model): ... product_id: int = pt.Field(unique=True) ... price: float ... >>> df.set_model(Product).get(pl.col("product_id") == 1) Product(product_id=1, price=10.0)
You can invoke
.get()
without any arguments on dataframes containing exactly one row:>>> df.filter(pl.col("product_id") == 1).get() UntypedRow(product_id=1, price=10)
If the given predicate matches multiple rows a
MultipleRowsReturned
will be raised:>>> try: ... df.get(pl.col("price") == 10) ... except pt.exceptions.MultipleRowsReturned as e: ... print(e) ... DataFrame.get() yielded 2 rows.
If the given predicate matches zero rows a
RowDoesNotExist
will be raised:>>> try: ... df.get(pl.col("price") == 0) ... except pt.exceptions.RowDoesNotExist as e: ... print(e) ... DataFrame.get() yielded 0 rows.
patito.DataFrame.read_csv
- classmethod DataFrame.read_csv(*args, **kwargs)
Read CSV and apply correct column name and types from model.
If any fields have
derived_from
specified, the given expression will be used to populate the given column(s).- Parameters:
*args – All positional arguments are forwarded to
polars.read_csv
.**kwargs – All keyword arguments are forwarded to
polars.read_csv
.
- Returns:
A dataframe representing the given CSV file data.
- Return type:
Examples
The
DataFrame.read_csv
method can be used to automatically set the correct column names when reading CSV files without headers.>>> import io >>> import patito as pt >>> class CSVModel(pt.Model): ... a: float ... b: str ... >>> csv_file = io.StringIO("1,2") >>> CSVModel.DataFrame.read_csv(csv_file, has_header=False) shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ str │ ╞═════╪═════╡ │ 1.0 ┆ 2 │ └─────┴─────┘
The
derived_from
paramater ofpt.Field
allows you to specify the mapping between the CSV file’s column names, and the final column names you intend to construct.>>> import io >>> import patito as pt >>> class CSVModel(pt.Model): ... a: float ... b: str = pt.Field(derived_from="source_of_b") ... >>> csv_file = io.StringIO("a,source_of_b\n1,1")
# >>> CSVModel.DataFrame.read_csv(csv_file).drop() # shape: (1, 2) # ┌─────┬─────┐ # │ a ┆ b │ # │ — ┆ — │ # │ f64 ┆ str │ # ╞═════╪═════╡ # │ 1.0 ┆ 1 │ # └─────┴─────┘
patito.DataFrame.set_model
- DataFrame.set_model(model)
Associate a given patito
Model
with the dataframe.The model schema is used by methods that depend on a model being associated with the given dataframe such as DataFrame.validate() and DataFrame.get().
DataFrame(...).set_model(Model)
is equivalent withModel.DataFrame(...)
.- Parameters:
model (Model) – Sub-class of
patito.Model
declaring the schema of the dataframe.- Returns:
Returns the same dataframe, but with an attached model that is required for certain model-specific dataframe methods to work.
- Return type:
Examples
>>> from typing_extensions import Literal >>> import patito as pt >>> import polars as pl >>> class SchoolClass(pt.Model): ... year: int = pt.Field(dtype=pl.UInt16) ... letter: Literal["A", "B"] = pt.Field(dtype=pl.Categorical) ... >>> classes = pt.DataFrame( ... {"year": [1, 1, 2, 2], "letter": list("ABAB")} ... ).set_model(SchoolClass) >>> classes shape: (4, 2) ┌──────┬────────┐ │ year ┆ letter │ │ --- ┆ --- │ │ i64 ┆ str │ ╞══════╪════════╡ │ 1 ┆ A │ │ 1 ┆ B │ │ 2 ┆ A │ │ 2 ┆ B │ └──────┴────────┘ >>> casted_classes = classes.cast() >>> casted_classes shape: (4, 2) ┌──────┬────────┐ │ year ┆ letter │ │ --- ┆ --- │ │ u16 ┆ cat │ ╞══════╪════════╡ │ 1 ┆ A │ │ 1 ┆ B │ │ 2 ┆ A │ │ 2 ┆ B │ └──────┴────────┘ >>> casted_classes.validate()
patito.DataFrame.validate
- DataFrame.validate()
Validate the schema and content of the dataframe.
You must invoke
.set_model()
before invoking.validate()
in order to specify how the dataframe should be validated.- Returns:
The original dataframe, if correctly validated.
- Return type:
- Raises:
TypeError – If
DataFrame.set_model()
has not been invoked prior to validation. Note thatpatito.Model.DataFrame
automatically invokesDataFrame.set_model()
for you.patito.exceptions.ValidationError – If the dataframe does not match the specified schema.
Examples
>>> import patito as pt
>>> class Product(pt.Model): ... product_id: int = pt.Field(unique=True) ... temperature_zone: Literal["dry", "cold", "frozen"] ... is_for_sale: bool ...
>>> df = pt.DataFrame( ... { ... "product_id": [1, 1, 3], ... "temperature_zone": ["dry", "dry", "oven"], ... } ... ).set_model(Product) >>> try: ... df.validate() ... except pt.ValidationError as exc: ... print(exc) ... 3 validation errors for Product is_for_sale Missing column (type=type_error.missingcolumns) product_id 2 rows with duplicated values. (type=value_error.rowvalue) temperature_zone Rows with invalid values: {'oven'}. (type=value_error.rowvalue)
patito.Model
- class patito.Model(**data)
Custom pydantic class for representing table schema and constructing rows.
patito.Model.DataFrame
- Model.DataFrame(data=None, schema=None, *, schema_overrides=None, orient=None, infer_schema_length=100, nan_to_null=False)
A sub-class of polars.DataFrame with additional functionality related to Model.
Two different methods are available for constructing model-aware data frames. Assume a simple model with two fields:
>>> import patito as pt >>> class Product(pt.Model): ... name: str ... price_in_cents: int ...
We can construct a data frame containing products and then associate the
Product
model to the data frame usingDataFrame.set_model
:>>> df = pt.DataFrame({"name": ["apple", "banana"], "price": [25, 61]}).set_model( ... Product ... )
Alternatively, we can use the custom
Product.DataFrame
class which automatically associates theProduct
model to the data frame at instantiation.>>> df = Product.DataFrame({"name": ["apple", "banana"], "price": [25, 61]})
The
df
data frame now has a set of model-aware methods such as as Product.validate.
patito.Model.LazyFrame
- Model.LazyFrame(data=None, schema=None, *, schema_overrides=None, orient=None, infer_schema_length=100, nan_to_null=False)
LazyFrame class associated to DataFrame.
patito.Model.columns
- property Model.columns: List[str]
Return the name of the dataframe columns specified by the fields of the model.
- Returns:
List of column names.
Example
>>> import patito as pt >>> class Product(pt.Model): ... name: str ... price: int ... >>> Product.columns ['name', 'price']
patito.Model.defaults
- property Model.defaults: dict[str, Any]
Return default field values specified on the model.
- Returns:
Dictionary containing fields with their respective default values.
Example
>>> from typing_extensions import Literal >>> import patito as pt >>> class Product(pt.Model): ... name: str ... price: int = 0 ... temperature_zone: Literal["dry", "cold", "frozen"] = "dry" ... >>> Product.defaults {'price': 0, 'temperature_zone': 'dry'}
patito.Model.drop
- classmethod Model.drop(name)
Return a new model where one or more fields are excluded.
- Parameters:
name (
Union
[str
,Iterable
[str
]]) – A single string field name, or a list of such field names, which will be dropped.- Return type:
Type
[Model
]- Returns:
New model class where the given fields have been removed.
Examples
>>> class MyModel(Model): ... a: int ... b: int ... c: int ...
>>> MyModel.columns ['a', 'b', 'c']
>>> MyModel.drop("c").columns ['a', 'b']
>>> MyModel.drop(["b", "c"]).columns ['a']
patito.Model.dtypes
- property Model.dtypes: dict[str, Type[polars.datatypes.classes.DataType]]
Return the polars dtypes of the dataframe.
Unless Field(dtype=…) is specified, the highest signed column dtype is chosen for integer and float columns.
- Returns:
A dictionary mapping string column names to polars dtype classes.
Example
>>> import patito as pt >>> class Product(pt.Model): ... name: str ... ideal_temperature: int ... price: float ... >>> Product.dtypes {'name': Utf8, 'ideal_temperature': Int64, 'price': Float64}
patito.Model.example
- classmethod Model.example(**kwargs)
Produce model instance with filled dummy data for all unspecified fields.
The type annotation of unspecified field is used to fill in type-correct dummy data, e.g.
-1
forint
,"dummy_string"
forstr
, and so on…The first item of
typing.Literal
annotations are used for dummy values.- Parameters:
**kwargs (
Any
) – Provide explicit values for any fields which should not be filled with dummy data.- Returns:
A pydantic model object filled with dummy data for all unspecified model fields.
- Return type:
- Raises:
TypeError – If one or more of the provided keyword arguments do not match any fields on the model.
Example
>>> from typing import Literal >>> import patito as pt
>>> class Product(pt.Model): ... product_id: int = pt.Field(unique=True) ... name: str ... temperature_zone: Literal["dry", "cold", "frozen"] ... >>> Product.example(product_id=1) Product(product_id=1, name='dummy_string', temperature_zone='dry')
patito.Model.example_value
- classmethod Model.example_value(field)
Return a valid example value for the given model field.
- Parameters:
field (
str
) – Field name identifier.- Return type:
Union
[date
,datetime
,float
,int
,str
,None
]- Returns:
A single value which is consistent with the given field definition.
- Raises:
NotImplementedError – If the given field has no example generator.
Example
>>> from typing import Literal >>> import patito as pt
>>> class Product(pt.Model): ... product_id: int = pt.Field(unique=True) ... name: str ... temperature_zone: Literal["dry", "cold", "frozen"] ... >>> Product.example_value("product_id") -1 >>> Product.example_value("name") 'dummy_string' >>> Product.example_value("temperature_zone") 'dry'
patito.Model.examples
- classmethod Model.examples(data=None, columns=None)
Generate polars dataframe with dummy data for all unspecified columns.
This constructor accepts the same data format as polars.DataFrame.
- Parameters:
data (
Union
[dict
,Iterable
,None
]) – Data to populate the dummy dataframe with. If given as an iterable of values, then column names must also be provided. If not provided at all, a dataframe with a single row populated with dummy data is provided.columns (
Optional
[Iterable
[str
]]) – Ignored ifdata
is provided as a dictionary. If data is provided as aniterable
, thencolumns
will be used as the column names in the resulting dataframe. Defaults to None.
- Return type:
- Returns:
A polars dataframe where all unspecified columns have been filled with dummy data which should pass model validation.
- Raises:
TypeError – If one or more of the model fields are not mappable to polars column dtype equivalents.
Example
>>> from typing import Literal >>> import patito as pt
>>> class Product(pt.Model): ... product_id: int = pt.Field(unique=True) ... name: str ... temperature_zone: Literal["dry", "cold", "frozen"] ...
>>> Product.examples() shape: (1, 3) ┌──────────────┬──────────────────┬────────────┐ │ name ┆ temperature_zone ┆ product_id │ │ --- ┆ --- ┆ --- │ │ str ┆ cat ┆ i64 │ ╞══════════════╪══════════════════╪════════════╡ │ dummy_string ┆ dry ┆ 0 │ └──────────────┴──────────────────┴────────────┘
>>> Product.examples({"name": ["product A", "product B"]}) shape: (2, 3) ┌───────────┬──────────────────┬────────────┐ │ name ┆ temperature_zone ┆ product_id │ │ --- ┆ --- ┆ --- │ │ str ┆ cat ┆ i64 │ ╞═══════════╪══════════════════╪════════════╡ │ product A ┆ dry ┆ 0 │ │ product B ┆ dry ┆ 1 │ └───────────┴──────────────────┴────────────┘
patito.Model.from_row
- classmethod Model.from_row(row, validate=True)
Represent a single data frame row as a Patito model.
- Parameters:
row (
Union
[DataFrame
,DataFrame
]) – A dataframe, either polars and pandas, consisting of a single row.validate (
bool
) – IfFalse
, skip pydantic validation of the given row data.
- Returns:
A patito model representing the given row data.
- Return type:
- Raises:
TypeError – If the given type is neither a pandas or polars DataFrame.
Example
>>> import patito as pt >>> import polars as pl
>>> class Product(pt.Model): ... product_id: int ... name: str ... price: float ...
>>> df = pl.DataFrame( ... [["1", "product name", "1.22"]], ... schema=["product_id", "name", "price"], ... ) >>> Product.from_row(df) Product(product_id=1, name='product name', price=1.22) >>> Product.from_row(df, validate=False) Product(product_id='1', name='product name', price='1.22')
patito.Model.join
- classmethod Model.join(other, how)
Dynamically create a new model compatible with an SQL Join operation.
For instance,
ModelA.join(ModelB, how="left")
will create a model containing all the fields ofModelA
andModelB
, but where all fields ofModelB
has been madeOptional
, i.e. nullable. This is consistent with the LEFT JOIN SQL operation making all the columns of the right table nullable.- Parameters:
other (
Type
[Model
]) – Another patito Model class.how (
Literal
['inner'
,'left'
,'outer'
,'asof'
,'cross'
,'semi'
,'anti'
]) – The type of SQL Join operation.
- Return type:
Type
[Model
]- Returns:
A new model type compatible with the resulting schema produced by the given join operation.
Examples
>>> class A(Model): ... a: int ... >>> class B(Model): ... b: int ...
>>> InnerJoinedModel = A.join(B, how="inner") >>> InnerJoinedModel.columns ['a', 'b'] >>> InnerJoinedModel.nullable_columns set()
>>> LeftJoinedModel = A.join(B, how="left") >>> LeftJoinedModel.nullable_columns {'b'}
>>> OuterJoinedModel = A.join(B, how="outer") >>> sorted(OuterJoinedModel.nullable_columns) ['a', 'b']
>>> A.join(B, how="anti") is A True
patito.Model.non_nullable_columns
- property Model.non_nullable_columns: set[str]
Return names of those columns that are non-nullable in the schema.
- Returns:
Set of column name strings.
Example
>>> from typing import Optional >>> import patito as pt >>> class MyModel(pt.Model): ... nullable_field: Optional[int] ... inferred_nullable_field: int = None ... non_nullable_field: int ... another_non_nullable_field: str ... >>> sorted(MyModel.non_nullable_columns) ['another_non_nullable_field', 'non_nullable_field']
patito.Model.nullable_columns
- property Model.nullable_columns: set[str]
Return names of those columns that are nullable in the schema.
- Returns:
Set of column name strings.
Example
>>> from typing import Optional >>> import patito as pt >>> class MyModel(pt.Model): ... nullable_field: Optional[int] ... inferred_nullable_field: int = None ... non_nullable_field: int ... another_non_nullable_field: str ... >>> sorted(MyModel.nullable_columns) ['inferred_nullable_field', 'nullable_field']
patito.Model.pandas_examples
- classmethod Model.pandas_examples(data, columns=None)
Generate dataframe with dummy data for all unspecified columns.
Offers the same API as the pandas.DataFrame constructor. Non-iterable values, besides strings, are repeated until they become as long as the iterable arguments.
- Parameters:
data (
Union
[dict
,Iterable
]) – Data to populate the dummy dataframe with. If not a dict, column names must also be provided.columns (
Optional
[Iterable
[str
]]) – Ignored if data is a dict. If data is an iterable, it will be used as the column names in the resulting dataframe. Defaults to None.
- Return type:
DataFrame
- Returns:
A pandas DataFrame filled with dummy example data.
- Raises:
ImportError – If pandas has not been installed. You should install patito[pandas] in order to integrate patito with pandas.
TypeError – If column names have not been specified in the input data.
Example
>>> from typing import Literal >>> import patito as pt
>>> class Product(pt.Model): ... product_id: int = pt.Field(unique=True) ... name: str ... temperature_zone: Literal["dry", "cold", "frozen"] ...
>>> Product.pandas_examples({"name": ["product A", "product B"]}) product_id name temperature_zone 0 -1 product A dry 1 -1 product B dry
patito.Model.prefix
- classmethod Model.prefix(prefix)
Return a new model where all field names have been prefixed.
- Parameters:
prefix (
str
) – String prefix to add to all field names.- Return type:
Type
[Model
]- Returns:
New model class with all the same fields only prefixed with the given prefix.
Example
>>> class MyModel(Model): ... a: int ... b: int ...
>>> MyModel.prefix("x_").columns ['x_a', 'x_b']
patito.Model.rename
- classmethod Model.rename(mapping)
Return a new model class where the specified fields have been renamed.
- Parameters:
mapping (
Dict
[str
,str
]) – A dictionary where the keys are the old field names and the values are the new names.- Return type:
Type
[Model
]- Returns:
A new model class where the given fields have been renamed.
- Raises:
ValueError – If non-existent fields are renamed.
Example
>>> class MyModel(Model): ... a: int ... b: int ...
>>> MyModel.rename({"a": "A"}).columns ['b', 'A']
patito.Model.select
- classmethod Model.select(fields)
Create a new model consisting of only a subset of the model fields.
- Parameters:
fields (
Union
[str
,Iterable
[str
]]) – A single field name as a string or a collection of strings.- Return type:
Type
[Model
]- Returns:
A new model containing only the fields specified by
fields
.- Raises:
ValueError – If one or more non-existent fields are selected.
Example
>>> class MyModel(Model): ... a: int ... b: int ... c: int ...
>>> MyModel.select("a").columns ['a']
>>> sorted(MyModel.select(["b", "c"]).columns) ['b', 'c']
patito.Model.sql_types
- property Model.sql_types: dict[str, str]
Return compatible DuckDB SQL types for all model fields.
- Returns:
Dictionary with column name keys and SQL type identifier strings.
Example
>>> from typing import Literal >>> import patito as pt
>>> class MyModel(pt.Model): ... int_column: int ... str_column: str ... float_column: float ... literal_column: Literal["a", "b", "c"] ... >>> MyModel.sql_types {'int_column': 'INTEGER', 'str_column': 'VARCHAR', 'float_column': 'DOUBLE', 'literal_column': 'enum__4a496993dde04060df4e15a340651b45'}
patito.Model.suffix
- classmethod Model.suffix(suffix)
Return a new model where all field names have been suffixed.
- Parameters:
suffix (
str
) – String suffix to add to all field names.- Return type:
Type
[Model
]- Returns:
New model class with all the same fields only suffixed with the given suffix.
Example
>>> class MyModel(Model): ... a: int ... b: int ...
>>> MyModel.suffix("_x").columns ['a_x', 'b_x']
patito.Model.unique_columns
- property Model.unique_columns: set[str]
Return columns with uniqueness constraint.
- Returns:
Set of column name strings.
Example
>>> from typing import Optional >>> import patito as pt
>>> class Product(pt.Model): ... product_id: int = pt.Field(unique=True) ... barcode: Optional[str] = pt.Field(unique=True) ... name: str ... >>> sorted(Product.unique_columns) ['barcode', 'product_id']
patito.Model.valid_dtypes
- property Model.valid_dtypes: dict[str, List[pl.PolarsDataType | pl.List]]
Return a list of polars dtypes which Patito considers valid for each field.
The first item of each list is the default dtype chosen by Patito.
- Returns:
A dictionary mapping each column string name to a list of valid dtypes.
- Raises:
NotImplementedError – If one or more model fields are annotated with types not compatible with polars.
Example
>>> from pprint import pprint >>> import patito as pt
>>> class MyModel(pt.Model): ... bool_column: bool ... str_column: str ... int_column: int ... float_column: float ... >>> pprint(MyModel.valid_dtypes) {'bool_column': [Boolean], 'float_column': [Float64, Float32], 'int_column': [Int64, Int32, Int16, Int8, UInt64, UInt32, UInt16, UInt8], 'str_column': [Utf8]}
patito.Model.valid_sql_types
- property Model.valid_sql_types: dict[str, List[Literal['BIGINT', 'INT8', 'LONG', 'BLOB', 'BYTEA', 'BINARY', 'VARBINARY', 'BOOLEAN', 'BOOL', 'LOGICAL', 'DATE', 'DOUBLE', 'FLOAT8', 'NUMERIC', 'DECIMAL', 'HUGEINT', 'INTEGER', 'INT4', 'INT', 'SIGNED', 'INTERVAL', 'REAL', 'FLOAT4', 'FLOAT', 'SMALLINT', 'INT2', 'SHORT', 'TIME', 'TIMESTAMP', 'DATETIME', 'TIMESTAMP WITH TIMEZONE', 'TIMESTAMPTZ', 'TINYINT', 'INT1', 'UBIGINT', 'UINTEGER', 'USMALLINT', 'UTINYINT', 'UUID', 'VARCHAR', 'CHAR', 'BPCHAR', 'TEXT', 'STRING']]]
Return a list of DuckDB SQL types which Patito considers valid for each field.
The first item of each list is the default dtype chosen by Patito.
- Returns:
A dictionary mapping each column string name to a list of DuckDB SQL types represented as strings.
- Raises:
NotImplementedError – If one or more model fields are annotated with types not compatible with DuckDB.
Example
>>> import patito as pt >>> from pprint import pprint
>>> class MyModel(pt.Model): ... bool_column: bool ... str_column: str ... int_column: int ... float_column: float ... >>> pprint(MyModel.valid_sql_types) {'bool_column': ['BOOLEAN', 'BOOL', 'LOGICAL'], 'float_column': ['DOUBLE', 'FLOAT8', 'NUMERIC', 'DECIMAL', 'REAL', 'FLOAT4', 'FLOAT'], 'int_column': ['INTEGER', 'INT4', 'INT', 'SIGNED', 'BIGINT', 'INT8', 'LONG', 'HUGEINT', 'SMALLINT', 'INT2', 'SHORT', 'TINYINT', 'INT1', 'UBIGINT', 'UINTEGER', 'USMALLINT', 'UTINYINT'], 'str_column': ['VARCHAR', 'CHAR', 'BPCHAR', 'TEXT', 'STRING']}
patito.Model.validate
- classmethod Model.validate(dataframe)
Validate the schema and content of the given dataframe.
- Parameters:
dataframe (
Union
[DataFrame
,DataFrame
]) – Polars DataFrame to be validated.- Raises:
patito.exceptions.ValidationError – If the given dataframe does not match the given schema.
Examples
- Return type:
None
>>> import patito as pt >>> import polars as pl
>>> class Product(pt.Model): ... product_id: int = pt.Field(unique=True) ... temperature_zone: Literal["dry", "cold", "frozen"] ... is_for_sale: bool ...
>>> df = pl.DataFrame( ... { ... "product_id": [1, 1, 3], ... "temperature_zone": ["dry", "dry", "oven"], ... } ... ) >>> try: ... Product.validate(df) ... except pt.ValidationError as exc: ... print(exc) ... 3 validation errors for Product is_for_sale Missing column (type=type_error.missingcolumns) product_id 2 rows with duplicated values. (type=value_error.rowvalue) temperature_zone Rows with invalid values: {'oven'}. (type=value_error.rowvalue)
patito.Model.with_fields
- classmethod Model.with_fields(**field_definitions)
Return a new model class where the given fields have been added.
- Parameters:
**field_definitions (
Any
) – the keywords are of the form:field_name=(field_type, field_default)
. Specify...
if no default value is provided. For instance,column_name=(int, ...)
will create a new non-optional integer field named"column_name"
.- Return type:
Type
[Model
]- Returns:
A new model with all the original fields and the additional field definitions.
Example
>>> class MyModel(Model): ... a: int ... >>> class ExpandedModel(MyModel): ... b: int ... >>> MyModel.with_fields(b=(int, ...)).columns == ExpandedModel.columns True
patito.Field
- class patito.Field(default=PydanticUndefined, *, default_factory=None, alias=None, title=None, description=None, exclude=None, include=None, const=None, gt=None, ge=None, lt=None, le=None, multiple_of=None, allow_inf_nan=None, max_digits=None, decimal_places=None, min_items=None, max_items=None, unique_items=None, min_length=None, max_length=None, allow_mutation=True, regex=None, discriminator=None, repr=True, **extra)
Annotate model field with additional type and validation information.
This class is built on
pydantic.Field
and you can find its full documentation here. Patito adds additional parameters which are used when validating dataframes, these are documented here.- Parameters:
constraints (Union[polars.Expression, List[polars.Expression]) – A single constraint or list of constraints, expressed as a polars expression objects. All rows must satisfy the given constraint. You can refer to the given column with
pt.field
, which will automatically be replaced withpolars.col(<field_name>)
before evaluation.unique (bool) – All row values must be unique.
dtype (polars.datatype.DataType) – The given dataframe column must have the given polars dtype, for instance
polars.UInt64
orpl.Float32
.gt (
Optional
[float
]) – All values must be greater thangt
.ge (
Optional
[float
]) – All values must be greater than or equal toge
.lt (
Optional
[float
]) – All values must be less thanlt
.le (
Optional
[float
]) – All values must be less than or equal tolt
.multiple_of (
Optional
[float
]) – All values must be multiples of the given value.const (bool) – If set to
True
all values must be equal to the provided default value, the first argument provided to theField
constructor.regex (str) – UTF-8 string column must match regex pattern for all row values.
min_length (int) – Minimum length of all string values in a UTF-8 column.
max_length (int) – Maximum length of all string values in a UTF-8 column.
- Returns:
Object used to represent additional constraints put upon the given field.
- Return type:
FieldInfo
Examples
>>> import patito as pt >>> import polars as pl >>> class Product(pt.Model): ... # Do not allow duplicates ... product_id: int = pt.Field(unique=True) ... ... # Price must be stored as unsigned 16-bit integers ... price: int = pt.Field(dtype=pl.UInt16) ... ... # The product name should be from 3 to 128 characters long ... name: str = pt.Field(min_length=3, max_length=128) ... ... # Represent colors in the form of upper cased hex colors ... brand_color: str = pt.Field(regex=r"^\#[0-9A-F]{6}$") ... >>> Product.DataFrame( ... { ... "product_id": [1, 1], ... "price": [400, 600], ... "brand_color": ["#ab00ff", "AB00FF"], ... } ... ).validate() Traceback (most recent call last): ... patito.exceptions.ValidationError: 4 validation errors for Product name Missing column (type=type_error.missingcolumns) product_id 2 rows with duplicated values. (type=value_error.rowvalue) price Polars dtype Int64 does not match model field type. (type=type_error.columndtype) brand_color 2 rows with out of bound values. (type=value_error.rowvalue)
patito.Database
- class patito.Database(query_handler, cache_directory=None, default_ttl=datetime.timedelta(days=364))
Construct manager for executing SQL queries and caching the results.
- Parameters:
query_handler (
Callable
[...
,Table
]) – The function that the Database object should use for executing SQL queries. Its first argument should be the SQL query string to execute, and it should return the query result as an arrow table, for instance pyarrow.Table.cache_directory (
Optional
[Path
]) – Path to the directory where caches should be stored as parquet files. If not provided, the XDG Base Directory Specification will be used to determine the suitable cache directory, by default~/.cache/patito
or${XDG_CACHE_HOME}/patito
.default_ttl (
timedelta
) – The default Time To Live (TTL), or with other words, how long to wait until caches are refreshed due to old age. The given default TTL can be overwritten by specifying thettl
parameter inDatabase.query()
. The default is 52 weeks.
Examples
We start by importing the necessary modules:
>>> from pathlib import Path ... >>> import patito as pt >>> import pyarrow as pa
In order to construct a
Database
, we need to provide the constructor with a function that can execute query strings. How to construct this function will depend on what you actually want to run your queries against, for example a local or remote database. For the purposes of demonstration we will use SQLite since it is built into Python’s standard library, but you can use anything; for example Snowflake or PostgresQL.We will use Python’s standard library documentation to create an in-memory SQLite database. It will contain a single table named
movies
containing some dummy data. The details do not really matter here, the only important part is that we construct a database which we can run SQL queries against.>>> import sqlite3 ... >>> def dummy_database() -> sqlite3.Cursor: ... connection = sqlite3.connect(":memory:") ... cursor = connection.cursor() ... cursor.execute("CREATE TABLE movies(title, year, score)") ... data = [ ... ("Monty Python Live at the Hollywood Bowl", 1982, 7.9), ... ("Monty Python's The Meaning of Life", 1983, 7.5), ... ("Monty Python's Life of Brian", 1979, 8.0), ... ] ... cursor.executemany("INSERT INTO movies VALUES(?, ?, ?)", data) ... connection.commit() ... return cursor
Using this dummy database, we are now able to construct a function which accepts SQL queries as its first parameter, executes the query, and returns the query result in the form of an Arrow table.
>>> def query_handler(query: str) -> pa.Table: ... cursor = dummy_database() ... cursor.execute(query) ... columns = [description[0] for description in cursor.description] ... data = [dict(zip(columns, row)) for row in cursor.fetchall()] ... return pa.Table.from_pylist(data)
We can now construct a
Database
object, providingquery_handler
as the way to execute SQL queries.>>> db = pt.Database(query_handler=query_handler)
The resulting object can now be used to execute SQL queries against the database and return the result in the form of a polars
DataFrame
object.>>> db.query("select * from movies order by year limit 1") shape: (1, 3) ┌──────────────────────────────┬──────┬───────┐ │ title ┆ year ┆ score │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ f64 │ ╞══════════════════════════════╪══════╪═══════╡ │ Monty Python's Life of Brian ┆ 1979 ┆ 8.0 │ └──────────────────────────────┴──────┴───────┘
But the main way to use a
Database
object is to use the@Database.as_query
decarator to wrap functions which return SQL query strings.>>> @db.as_query() >>> def movies(newer_than_year: int): ... return f"select * from movies where year > {newer_than_year}"
This decorator will convert the function from producing query strings, to actually executing the given query and return the query result in the form of a polars
DataFrame
object.>>> movies(newer_than_year=1980) shape: (2, 3) ┌───────────────────────────────────┬──────┬───────┐ │ title ┆ year ┆ score │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ f64 │ ╞═══════════════════════════════════╪══════╪═══════╡ │ Monty Python Live at the Hollywo… ┆ 1982 ┆ 7.9 │ │ Monty Python's The Meaning of Li… ┆ 1983 ┆ 7.5 │ └───────────────────────────────────┴──────┴───────┘
Caching is not enabled by default, but it can be enabled by specifying
cache=True
to the@db.as_query(...)
decorator. Other arguments are also accepted, such aslazy=True
if you want to retrieve the results in the form of aLazyFrame
instead of aDataFrame
,ttl
if you want to specify another TTL, and any additional keyword arguments are forwarded toquery_executor
when the SQL query is executed. You can read more about these parameters in the documentation ofDatabase.query()
.
patito.Database.Query
- class patito.database.DatabaseQuery(query_constructor, cache_directory, query_handler, ttl, lazy=False, cache=False, model=None, query_handler_kwargs=None)
A class acting as a function that returns a polars.DataFrame when called.
patito.Database.Query.cache_path
- Database.Query.cache_path(*args, **kwargs)
Return the path to the parquet cache that would store the result of the query.
- Parameters:
args (
ParamSpecArgs
) – The positional arguments passed to the wrapped function.kwargs (
ParamSpecKwargs
) – The keyword arguments passed to the wrapped function.
- Return type:
Optional
[Path
]- Returns:
A deterministic path to a parquet cache. None if caching is disabled.
patito.Database.Query.clear_caches
- Database.Query.clear_caches()
Delete all parquet cache files produced by this query wrapper.
- Return type:
None
patito.Database.Query.query_string
- Database.Query.query_string(*args, **kwargs)
Return the query to be executed for the given parameters.
- Parameters:
*args (
ParamSpecArgs
) – Positional arguments used to construct the query string.*kwargs (
ParamSpecKwargs
) – Keyword arguments used to construct the query string.
- Return type:
str
- Returns:
The query string produced for the given input parameters.
patito.Database.Query.refresh_cache
- Database.Query.refresh_cache(*args, **kwargs)
Force query execution by refreshing the cache.
- Parameters:
*args (
ParamSpecArgs
) – Positional arguments used to construct the SQL query string.*kwargs (
ParamSpecKwargs
) – Keyword arguments used to construct the SQL query string.
- Return type:
TypeVar
(DF
, bound=Union
[DataFrame
,LazyFrame
], covariant=True)- Returns:
A DataFrame representing the result of the newly executed query.
patito.Database.Query.__call__
- Database.Query.__call__(*args, **kwargs)
Call self as a function.
- Return type:
TypeVar
(DF
, bound=Union
[DataFrame
,LazyFrame
], covariant=True)
patito.Database.as_query
- Database.as_query(*, lazy=False, cache=False, ttl=None, model=None, **kwargs)
Execute the returned query string and return a polars dataframe.
- Parameters:
lazy (
bool
) – If the result should be returned as a LazyFrame rather than a DataFrame. Allows more efficient reading from parquet caches if caching is enabled.cache (
Union
[str
,Path
,bool
]) – If queries should be cached in order to save time and costs. The cache will only be used if the exact same SQL string has been executed before. If the parameter is specified asTrue
, a parquet file is created for each unique query string, and is located at: artifacts/query_cache/<function_name>/<query_md5_hash>.parquet If the a string orpathlib.Path
object is provided, the given path will be used, but it must have a ‘.parquet’ file extension. Relative paths are interpreted relative to artifacts/query_cache/ in the workspace root. The given parquet path will be overwritten if the query string changes, so only the latest query string value will be cached.ttl (
Optional
[timedelta
]) – The Time to Live (TTL) of the cache specified as a datetime.timedelta object. When the cache becomes older than the specified TTL, the query will be re-executed on the next invocation of the query function and the cache will refreshed.model (
Optional
[Type
[Model
]]) – An optional Patito model used to validate the content of the dataframe before return.**kwargs (
Any
) – Connection parameters forwarded to sql_to_polars, for example db_params.
- Return type:
Callable
[[QueryConstructor
[ParamSpec
(P
)]],DatabaseQuery
[ParamSpec
(P
),Union
[DataFrame
,LazyFrame
]]]- Returns:
A new function which returns a polars DataFrame based on the query specified by the original function’s return string.
patito.Database.query
- Database.query(query, *, lazy=False, cache=False, ttl=None, model=None, **kwargs)
Execute the given query and return the query result as a DataFrame or LazyFrame.
See patito.Database.as_query for a more powerful way to build and execute queries.
- Parameters:
query (
str
) – The query string to be executed, for instance an SQL query.lazy (
bool
) – If the query result should be returned in the form of a LazyFrame instead of a DataFrame.cache (
Union
[str
,Path
,bool
]) – If the query result should be saved and re-used the next time the same query is executed. Can also be provided as a path. SeeDatabase.as_query()
for full documentation.ttl (
Optional
[timedelta
]) – How long to use cached results until the query is re-executed anyway.model (
Optional
[Type
[Model
]]) – A patito.Model to optionally validate the query result.**kwargs (
Any
) – All additional keyword arguments are forwarded to the query handler which executes the given query.
- Return type:
Union
[DataFrame
,LazyFrame
]- Returns:
The result of the query in the form of a
DataFrame
iflazy=False
, or aLazyFrame
otherwise.
Examples
We will use DuckDB as our example database.
>>> import duckdb >>> import patito as pt
We will construct a really simple query source from an in-memory database.
>>> db = duckdb.connect(":memory:") >>> query_handler = lambda query: db.cursor().query(query).arrow() >>> query_source = pt.Database(query_handler=query_handler)
We can now use
Database.query()
in order to execute queries against the in-memory database.>>> query_source.query("select 1 as a, 2 as b, 3 as c") shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i32 ┆ i32 ┆ i32 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ └─────┴─────┴─────┘
patito.duckdb
patito.duckdb.Database
- Database.__init__(path=None, read_only=False, **kwargs)
Instantiate a new DuckDB database, either persisted to disk or in-memory.
- Parameters:
path (
Optional
[Path
]) – Optional path to store all the data to. IfNone
the data is persisted in-memory only.read_only (
bool
) – If the database connection should be a read-only connection.**kwargs (
Any
) – Additional keywords forwarded toduckdb.connect()
.
Examples
>>> import patito as pt >>> db = pt.duckdb.Database() >>> db.to_relation("select 1 as a, 2 as b").create_table("my_table") >>> db.query("select * from my_table").to_df() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 2 │ └─────┴─────┘
patito.duckdb.Database.create_enum_types
- Database.create_enum_types(model)
Define SQL enum types in DuckDB database.
- Parameters:
model (
Type
[TypeVar
(ModelType
, bound= Model)]) – Model for which all Literal-annotated or enum-annotated string fields will get respective DuckDB enum types.- Return type:
None
Example
>>> import patito as pt >>> class EnumModel(pt.Model): ... enum_column: Literal["A", "B", "C"] ... >>> db = pt.duckdb.Database() >>> db.create_enum_types(EnumModel) >>> db.enum_types {'enum__7ba49365cc1b0fd57e61088b3bc9aa25'}
patito.duckdb.Database.create_table
- Database.create_table(name, model)
Create table with schema matching the provided Patito model.
See Relation.insert_into() for how to insert data into the table after creation. The Relation.create_table() method can also be used to create a table from a given relation and insert the data at the same time.
- Parameters:
name (
str
) – Name of new database table.model (Type[Model]) – Patito model indicating names and types of table columns.
- Returns:
Relation pointing to the new table.
- Return type:
Relation[ModelType]
Example
>>> from typing import Optional >>> import patito as pt >>> class MyModel(pt.Model): ... str_column: str ... nullable_string_column: Optional[str] ... >>> db = pt.duckdb.Database() >>> db.create_table(name="my_table", model=MyModel) >>> db.table("my_table").types {'str_column': VARCHAR, 'nullable_string_column': VARCHAR}
patito.duckdb.Database.create_view
- Database.create_view(name, data)
Create a view based on the given data source.
- Return type:
Relation
patito.duckdb.Database.default
- classmethod Database.default()
Return the default DuckDB database.
- Return type:
Database
- Returns:
A patito Database object wrapping around the given connection.
Example
>>> import patito as pt >>> db = pt.duckdb.Database.default() >>> db.query("select 1 as a, 2 as b").to_df() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 2 │ └─────┴─────┘
patito.duckdb.Database.empty_relation
- Database.empty_relation(schema)
Create relation with zero rows, but correct schema that matches the given model.
- Parameters:
schema (
Type
[TypeVar
(ModelType
, bound= Model)]) – A patito model which specifies the column names and types of the given relation.- Return type:
Relation
[TypeVar
(ModelType
, bound= Model)]
Example
>>> import patito as pt >>> class Schema(pt.Model): ... string_column: str ... bool_column: bool ... >>> db = pt.duckdb.Database() >>> empty_relation = db.empty_relation(Schema) >>> empty_relation.to_df() shape: (0, 2) ┌───────────────┬─────────────┐ │ string_column ┆ bool_column │ │ --- ┆ --- │ │ str ┆ bool │ ╞═══════════════╪═════════════╡ └───────────────┴─────────────┘ >>> non_empty_relation = db.query( ... "select 'dummy' as string_column, true as bool_column" ... ) >>> non_empty_relation.union(empty_relation).to_df() shape: (1, 2) ┌───────────────┬─────────────┐ │ string_column ┆ bool_column │ │ --- ┆ --- │ │ str ┆ bool │ ╞═══════════════╪═════════════╡ │ dummy ┆ true │ └───────────────┴─────────────┘
patito.duckdb.Database.execute
- Database.execute(query, *parameters)
Execute SQL query in DuckDB database.
- Parameters:
query (
str
) – A SQL statement to execute. Does not have to be terminated with a semicolon (;
).parameters (
Collection
[Union
[str
,int
,float
,bool
]]) – One or more sets of parameters to insert into prepared statements. The values are replaced in place of the question marks (?
) in the prepared query.
- Return type:
None
Example
>>> import patito as pt >>> db = pt.duckdb.Database() >>> db.execute("create table my_table (x bigint);") >>> db.execute("insert into my_table values (1), (2), (3)") >>> db.table("my_table").to_df() shape: (3, 1) ┌─────┐ │ x │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
Parameters can be specified when executing prepared queries.
>>> db.execute("delete from my_table where x = ?", (2,)) >>> db.table("my_table").to_df() shape: (2, 1) ┌─────┐ │ x │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 3 │ └─────┘
Multiple parameter sets can be specified when executing multiple prepared queries.
>>> db.execute( ... "delete from my_table where x = ?", ... (1,), ... (3,), ... ) >>> db.table("my_table").to_df() shape: (0, 1) ┌─────┐ │ x │ │ --- │ │ i64 │ ╞═════╡ └─────┘
patito.duckdb.Database.from_connection
- classmethod Database.from_connection(connection)
Create database from native DuckDB connection object.
- Parameters:
connection (
DuckDBPyConnection
) – A native DuckDB connection object created withduckdb.connect()
.- Return type:
Database
- Returns:
A Database object wrapping around the given connection.
Example
>>> import duckdb >>> import patito as pt >>> connection = duckdb.connect() >>> database = pt.duckdb.Database.from_connection(connection)
patito.duckdb.Database.query
- Database.query(query, alias='query_relation')
Execute arbitrary SQL select query and return the relation.
- Parameters:
query (
str
) – Arbitrary SQL select query.alias (
str
) – The alias to assign to the resulting relation, to be used in further queries.
- Return type:
Relation
Returns: A relation representing the data produced by the given query.
Example
>>> import patito as pt >>> db = pt.duckdb.Database() >>> relation = db.query("select 1 as a, 2 as b, 3 as c") >>> relation.to_df() shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ └─────┴─────┴─────┘
>>> relation = db.query("select 1 as a, 2 as b, 3 as c", alias="my_alias") >>> relation.select("my_alias.a").to_df() shape: (1, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ └─────┘
patito.duckdb.Database.table
- Database.table(name)
Return relation representing all the data in the given table.
- Parameters:
name (
str
) – The name of the table.- Return type:
Relation
Example
>>> import patito as pt >>> df = pt.DataFrame({"a": [1, 2], "b": [3, 4]}) >>> db = pt.duckdb.Database() >>> relation = db.to_relation(df) >>> relation.create_table(name="my_table") >>> db.table("my_table").to_df() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘
patito.duckdb.Database.to_relation
- Database.to_relation(derived_from)
Create a new relation object based on data source.
The given data will be represented as a relation associated with the database.
Database(x).to_relation(y)
is equivalent toRelation(y, database=Database(x))
.- Parameters:
derived_from (RelationSource) – One of either a polars or pandas
DataFrame
, apathlib.Path
to a parquet or CSV file, a SQL query string, or an existing relation.- Return type:
Relation
Example
>>> import patito as pt >>> db = pt.duckdb.Database() >>> db.to_relation("select 1 as a, 2 as b").to_df() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 2 │ └─────┴─────┘ >>> db.to_relation(pt.DataFrame({"c": [3, 4], "d": ["5", "6"]})).to_df() shape: (2, 2) ┌─────┬─────┐ │ c ┆ d │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 3 ┆ 5 │ │ 4 ┆ 6 │ └─────┴─────┘
patito.duckdb.Database.__contains__
- Database.__contains__(table)
Return
True
if the database contains a table with the given name.- Parameters:
table (
str
) – The name of the table to be checked for.- Return type:
bool
Examples
>>> import patito as pt >>> db = pt.duckdb.Database() >>> "my_table" in db False >>> db.to_relation("select 1 as a, 2 as b").create_table(name="my_table") >>> "my_table" in db True
patito.duckdb.Relation
- Relation.__init__(derived_from, database=None, model=None)
Create a new relation object containing data to be queried with DuckDB.
- Parameters:
derived_from (
Union
[DataFrame
,DataFrame
,DataFrame
,Path
,str
,DuckDBPyRelation
,Relation
]) –Data to be represented as a DuckDB relation object. Can be one of the following types:
A pandas or polars DataFrame.
An SQL query represented as a string.
A
Path
object pointing to a CSV or a parquet file. The path must point to an existing file with either a.csv
or.parquet
file extension.A native DuckDB relation object (
duckdb.DuckDBPyRelation
).A
patito.duckdb.Relation
object.
database (
Optional
[Database
]) – Which database to load the relation into. If not provided, the default DuckDB database will be used.model (
Optional
[Type
[TypeVar
(ModelType
, bound= Model)]]) –Sub-class of
patito.Model
which specifies how to deserialize rows when fetched with methods such as Relation.get() and__iter__()
.Will also be used to create a strict table schema if Relation.create_table(). schema should be constructed.
If not provided, a dynamic model fitting the relation schema will be created when required.
Can also be set later dynamically by invoking Relation.set_model().
- Raises:
ValueError – If any one of the following cases are encountered: - If a provided
Path
object does not have a.csv
or.parquet
file extension. - If a database and relation object is provided, but the relation object does not belong to the database.TypeError – If the type of
derived_from
is not supported.
Examples
Instantiated from a dataframe:
>>> import patito as pt >>> df = pt.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> pt.duckdb.Relation(df).filter("a > 2").to_df() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 6 │ └─────┴─────┘
Instantiated from an SQL query:
>>> pt.duckdb.Relation("select 1 as a, 2 as b").to_df() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 2 │ └─────┴─────┘
patito.duckdb.Relation.add_prefix
- Relation.add_prefix(prefix, include=None, exclude=None)
Add a prefix to all the columns of the relation.
- Parameters:
prefix (
str
) – A string to prepend to add to all the columns names.include (
Optional
[Iterable
[str
]]) – If provided, only the given columns will be renamed.exclude (
Optional
[Iterable
[str
]]) – If provided, the given columns will not be renamed.
- Raises:
TypeError – If both include and exclude are provided at the same time.
- Return type:
Relation
Examples
>>> import patito as pt >>> relation = pt.duckdb.Relation("select 1 as column_1, 2 as column_2") >>> relation.add_prefix("renamed_").to_df() shape: (1, 2) ┌──────────────────┬──────────────────┐ │ renamed_column_1 ┆ renamed_column_2 │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════════════════╪══════════════════╡ │ 1 ┆ 2 │ └──────────────────┴──────────────────┘
>>> relation.add_prefix("renamed_", include=["column_1"]).to_df() shape: (1, 2) ┌──────────────────┬──────────┐ │ renamed_column_1 ┆ column_2 │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════════════════╪══════════╡ │ 1 ┆ 2 │ └──────────────────┴──────────┘
>>> relation.add_prefix("renamed_", exclude=["column_1"]).to_df() shape: (1, 2) ┌──────────┬──────────────────┐ │ column_1 ┆ renamed_column_2 │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════════╪══════════════════╡ │ 1 ┆ 2 │ └──────────┴──────────────────┘
patito.duckdb.Relation.add_suffix
- Relation.add_suffix(suffix, include=None, exclude=None)
Add a suffix to all the columns of the relation.
- Parameters:
suffix (
str
) – A string to append to add to all columns names.include (
Optional
[Collection
[str
]]) – If provided, only the given columns will be renamed.exclude (
Optional
[Collection
[str
]]) – If provided, the given columns will not be renamed.
- Raises:
TypeError – If both include and exclude are provided at the same time.
- Return type:
Relation
Examples
>>> import patito as pt >>> relation = pt.duckdb.Relation("select 1 as column_1, 2 as column_2") >>> relation.add_suffix("_renamed").to_df() shape: (1, 2) ┌──────────────────┬──────────────────┐ │ column_1_renamed ┆ column_2_renamed │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════════════════╪══════════════════╡ │ 1 ┆ 2 │ └──────────────────┴──────────────────┘
>>> relation.add_suffix("_renamed", include=["column_1"]).to_df() shape: (1, 2) ┌──────────────────┬──────────┐ │ column_1_renamed ┆ column_2 │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════════════════╪══════════╡ │ 1 ┆ 2 │ └──────────────────┴──────────┘
>>> relation.add_suffix("_renamed", exclude=["column_1"]).to_df() shape: (1, 2) ┌──────────┬──────────────────┐ │ column_1 ┆ column_2_renamed │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════════╪══════════════════╡ │ 1 ┆ 2 │ └──────────┴──────────────────┘
patito.duckdb.Relation.aggregate
- Relation.aggregate(*aggregations, group_by, **named_aggregations)
Return relation formed by
GROUP BY
SQL aggregation(s).- Parameters:
aggregations (
str
) – Zero or more aggregation expressions such as “sum(column_name)” and “count(distinct column_name)”.named_aggregations (
str
) – Zero or more aggregated expressions where the keyword is used to name the given aggregation. For example,my_column="sum(column_name)"
is inserted as"sum(column_name) as my_column"
in the executed SQL query.group_by (
Union
[str
,Iterable
[str
]]) – A single column name or iterable collection of column names to group by.
- Return type:
Relation
Examples
>>> import patito as pt >>> df = pt.DataFrame({"a": [1, 2, 3], "b": ["X", "Y", "X"]}) >>> relation = pt.duckdb.Relation(df) >>> relation.aggregate( ... "b", ... "sum(a)", ... "greatest(b)", ... max_a="max(a)", ... group_by="b", ... ).to_df() shape: (2, 4) ┌─────┬────────┬─────────────┬───────┐ │ b ┆ sum(a) ┆ greatest(b) ┆ max_a │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ str ┆ i64 │ ╞═════╪════════╪═════════════╪═══════╡ │ X ┆ 4.0 ┆ X ┆ 3 │ │ Y ┆ 2.0 ┆ Y ┆ 2 │ └─────┴────────┴─────────────┴───────┘
patito.duckdb.Relation.alias
The alias that can be used to refer to the given relation in queries. Can be set with Relation.set_alias().
patito.duckdb.Relation.all
- Relation.all(*filters, **equalities)
Return
True
if the given predicate(s) are true for all rows in the relation.See
Relation.filter()
for additional information regarding the parameters.- Parameters:
filters (
str
) – SQL predicates to satisfy.equalities (
Union
[int
,float
,str
]) – SQL equality predicates to satisfy.
- Return type:
bool
Examples
>>> import patito as pt >>> df = pt.DataFrame( ... { ... "even_number": [2, 4, 6], ... "odd_number": [1, 3, 5], ... "zero": [0, 0, 0], ... } ... ) >>> relation = pt.duckdb.Relation(df) >>> relation.all(zero=0) True >>> relation.all( ... "even_number % 2 = 0", ... "odd_number % 2 = 1", ... zero=0, ... ) True >>> relation.all(zero=1) False >>> relation.all("odd_number % 2 = 0") False
patito.duckdb.Relation.case
- Relation.case(*, from_column, to_column, mapping, default)
Map values of one column over to a new column.
- Parameters:
from_column (
str
) – Name of column defining the domain of the mapping.to_column (
str
) – Name of column to insert the mapped values into.mapping (
Dict
[Union
[str
,float
,int
,None
],Union
[str
,float
,int
,None
]]) – Dictionary defining the mapping. The dictionary keys represent the input values, while the dictionary values represent the output values. Items are inserted into the SQL case statement by their repr() string value.default (
Union
[str
,float
,int
,None
]) – Default output value for inputs which have no provided mapping.
- Return type:
Relation
Examples
The following case statement…
>>> import patito as pt >>> db = pt.duckdb.Database() >>> relation = db.to_relation("select 1 as a union select 2 as a") >>> relation.case( ... from_column="a", ... to_column="b", ... mapping={1: "one", 2: "two"}, ... default="three", ... ).order(by="a").to_df() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 1 ┆ one │ │ 2 ┆ two │ └─────┴─────┘
… is equivalent with:
>>> case_statement = pt.sql.Case( ... on_column="a", ... mapping={1: "one", 2: "two"}, ... default="three", ... as_column="b", ... ) >>> relation.select(f"*, {case_statement}").order(by="a").to_df() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 1 ┆ one │ │ 2 ┆ two │ └─────┴─────┘
patito.duckdb.Relation.cast
- Relation.cast(model=None, strict=False, include=None, exclude=None)
Cast the columns of the relation to types compatible with the associated model.
The associated model must either be set by invoking Relation.set_model() or provided with the
model
parameter.Any columns of the relation that are not part of the given model schema will be left as-is.
- Parameters:
model (
Optional
[TypeVar
(ModelType
, bound= Model)]) – If Relation.set_model() has not been invoked or is intended to be overwritten.strict (
bool
) – If set toFalse
, columns which are technically compliant with the specified field type, will not be casted. For example, a column annotated withint
is technically compliant withSMALLINT
, even ifINTEGER
is the default SQL type associated withint
-annotated fields. Ifstrict
is set toTrue
, the resulting dtypes will be forced to the default dtype associated with each python type.include (
Optional
[Collection
[str
]]) – If provided, only the given columns will be casted.exclude (
Optional
[Collection
[str
]]) – If provided, the given columns will not be casted.
- Return type:
TypeVar
(RelationType
, bound= Relation)- Returns:
New relation where the columns have been casted according to the model schema.
Examples
>>> import patito as pt >>> class Schema(pt.Model): ... float_column: float ... >>> relation = pt.duckdb.Relation("select 1 as float_column") >>> relation.types["float_column"] INTEGER >>> relation.cast(model=Schema).types["float_column"] DOUBLE
>>> relation = pt.duckdb.Relation("select 1::FLOAT as float_column") >>> relation.cast(model=Schema).types["float_column"] FLOAT >>> relation.cast(model=Schema, strict=True).types["float_column"] DOUBLE
>>> class Schema(pt.Model): ... column_1: float ... column_2: float ... >>> relation = pt.duckdb.Relation( ... "select 1 as column_1, 2 as column_2" ... ).set_model(Schema) >>> relation.types {'column_1': INTEGER, 'column_2': INTEGER} >>> relation.cast(include=["column_1"]).types {'column_1': DOUBLE, 'column_2': INTEGER} >>> relation.cast(exclude=["column_1"]).types {'column_1': INTEGER, 'column_2': DOUBLE}
patito.duckdb.Relation.coalesce
- Relation.coalesce(**column_expressions)
Replace null-values in given columns with respective values.
For example,
coalesce(column_name=value)
is compiled to:f"coalesce({column_name}, {repr(value)}) as column_name"
in the resulting SQL.- Parameters:
column_expressions (
Union
[str
,int
,float
]) – Keywords indicate which columns to coalesce, while the string representation of the respective arguments are used as the null-replacement.- Returns:
Relation where values have been filled in for nulls in the given columns.
- Return type:
Relation
Examples
>>> import patito as pt >>> df = pt.DataFrame( ... { ... "a": [1, None, 3], ... "b": ["four", "five", None], ... "c": [None, 8.0, 9.0], ... } ... ) >>> relation = pt.duckdb.Relation(df) >>> relation.coalesce(a=2, b="six").to_df() shape: (3, 3) ┌─────┬──────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 │ ╞═════╪══════╪══════╡ │ 1 ┆ four ┆ null │ │ 2 ┆ five ┆ 8.0 │ │ 3 ┆ six ┆ 9.0 │ └─────┴──────┴──────┘
patito.duckdb.Relation.columns
- property Relation.columns: List[str]
Return the columns of the relation as a list of strings.
Examples
>>> import patito as pt >>> pt.duckdb.Relation("select 1 as a, 2 as b").columns ['a', 'b']
patito.duckdb.Relation.count
- Relation.count()
Return the number of rows in the given relation.
- Return type:
int
- Returns:
Number of rows in the relation as an integer.
Examples
>>> import patito as pt >>> relation = pt.duckdb.Relation("select 1 as a") >>> relation.count() 1 >>> (relation + relation).count() 2
The Relation.__len__() method invokes
Relation.count()
under the hood, and is equivalent:>>> len(relation) 1 >>> len(relation + relation) 2
patito.duckdb.Relation.create_table
- Relation.create_table(name)
Create new database table based on relation.
If
self.model
is set with Relation.set_model(), then the model is used to infer the table schema. Otherwise, a permissive table schema is created based on the relation data.- Returns:
A relation pointing to the newly created table.
- Return type:
Relation
Examples
>>> from typing import Literal >>> import patito as pt
>>> df = pt.DataFrame({"enum_column": ["A", "A", "B"]}) >>> relation = pt.duckdb.Relation(df) >>> relation.create_table("permissive_table").types {'enum_column': VARCHAR}
>>> class TableSchema(pt.Model): ... enum_column: Literal["A", "B", "C"] ... >>> relation.set_model(TableSchema).create_table("strict_table").types {'enum_column': enum__7ba49365cc1b0fd57e61088b3bc9aa25}
patito.duckdb.Relation.create_view
- Relation.create_view(name, replace=False)
Create new database view based on relation.
- Returns:
A relation pointing to the newly created view.
- Return type:
Relation
Examples
>>> import patito as pt >>> db = pt.duckdb.Database() >>> df = pt.DataFrame({"column": ["A", "A", "B"]}) >>> relation = db.to_relation(df) >>> relation.create_view("my_view") >>> db.query("select * from my_view").to_df() shape: (3, 1) ┌────────┐ │ column │ │ --- │ │ str │ ╞════════╡ │ A │ │ A │ │ B │ └────────┘
patito.duckdb.Relation.distinct
- Relation.distinct()
Drop all duplicate rows of the relation.
- Return type:
TypeVar
(RelationType
, bound= Relation)
Example
>>> import patito as pt >>> df = pt.DataFrame( ... [[1, 2, 3], [1, 2, 3], [3, 2, 1]], ... schema=["a", "b", "c"], ... orient="row", ... ) >>> relation = pt.duckdb.Relation(df) >>> relation.to_df() shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ │ 1 ┆ 2 ┆ 3 │ │ 3 ┆ 2 ┆ 1 │ └─────┴─────┴─────┘ >>> relation.distinct().to_df() shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ │ 3 ┆ 2 ┆ 1 │ └─────┴─────┴─────┘
patito.duckdb.Relation.drop
- Relation.drop(*columns)
Remove specified column(s) from relation.
- Parameters:
columns (str) – Any number of string column names to be dropped.
- Return type:
Relation
Examples
>>> import patito as pt >>> relation = pt.duckdb.Relation("select 1 as a, 2 as b, 3 as c") >>> relation.columns ['a', 'b', 'c'] >>> relation.drop("c").columns ['a', 'b'] >>> relation.drop("b", "c").columns ['a']
patito.duckdb.Relation.except_
- Relation.except_(other)
Remove all rows that can be found in the other other relation.
- Parameters:
other (
Union
[DataFrame
,DataFrame
,DataFrame
,Path
,str
,DuckDBPyRelation
,Relation
]) – Another relation or something that can be casted to a relation.- Return type:
TypeVar
(RelationType
, bound= Relation)- Returns:
New relation without the rows that can be found in the other relation.
Example
>>> import patito as pt >>> relation_123 = pt.duckdb.Relation( ... "select 1 union select 2 union select 3" ... ) >>> relation_123.order(by="1").to_df() shape: (3, 1) ┌─────┐ │ 1 │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘ >>> relation_2 = pt.duckdb.Relation("select 2") >>> relation_2.to_df() shape: (1, 1) ┌─────┐ │ 2 │ │ --- │ │ i64 │ ╞═════╡ │ 2 │ └─────┘ >>> relation_123.except_(relation_2).order(by="1").to_df() shape: (2, 1) ┌─────┐ │ 1 │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 3 │ └─────┘
patito.duckdb.Relation.execute
- Relation.execute()
Execute built relation query and return result object.
- Return type:
DuckDBPyRelation
- Returns:
A native
duckdb.DuckDBPyResult
object representing the executed query.
Examples
>>> import patito as pt >>> relation = pt.duckdb.Relation( ... "select 1 as a, 2 as b union select 3 as a, 4 as b" ... ) >>> result = relation.aggregate("sum(a)", group_by="").execute() >>> result.description [('sum(a)', 'NUMBER', None, None, None, None, None)] >>> result.fetchall() [(4,)]
patito.duckdb.Relation.filter
- Relation.filter(*filters, **equalities)
Return subset of rows of relation that satisfy the given predicates.
The method returns self if no filters are provided.
- Parameters:
filters (
str
) – A conjunction of SQLWHERE
clauses.equalities (
Union
[str
,int
,float
]) – A conjunction of SQL equality clauses. The keyword name is the column and the parameter is the value of the equality.
- Returns:
A new relation where all rows satisfy the given criteria.
- Return type:
Relation
Examples
>>> import patito as pt >>> df = pt.DataFrame( ... { ... "number": [1, 2, 3, 4], ... "string": ["A", "A", "B", "B"], ... } ... ) >>> relation = pt.duckdb.Relation(df) >>> relation.filter("number % 2 = 0").to_df() shape: (2, 2) ┌────────┬────────┐ │ number ┆ string │ │ --- ┆ --- │ │ i64 ┆ str │ ╞════════╪════════╡ │ 2 ┆ A │ │ 4 ┆ B │ └────────┴────────┘
>>> relation.filter(number=1, string="A").to_df() shape: (1, 2) ┌────────┬────────┐ │ number ┆ string │ │ --- ┆ --- │ │ i64 ┆ str │ ╞════════╪════════╡ │ 1 ┆ A │ └────────┴────────┘
patito.duckdb.Relation.get
- Relation.get(*filters, **equalities)
Fetch the single row that matches the given filter(s).
If you expect a relation to already return one row, you can use get() without any arguments to return that row.
- Raises:
RuntimeError – RuntimeError is thrown if not exactly one single row matches the given filter.
- Parameters:
filters (str) – A conjunction of SQL where clauses.
equalities (Any) – A conjunction of SQL equality clauses. The keyword name is the column and the parameter is the value of the equality.
- Returns:
A Patito model representing the given row.
- Return type:
Examples
>>> import patito as pt >>> import polars as pl >>> df = pt.DataFrame({"product_id": [1, 2, 3], "price": [10, 10, 20]}) >>> relation = pt.duckdb.Relation(df).set_alias("my_relation")
The
.get()
method will by default return a dynamically constructed Patito model if no model has been associated with the given relation:>>> relation.get(product_id=1) my_relation(product_id=1, price=10)
If a Patito model has been associated with the relation, by the use of Relation.set_model(), then the given model will be used to represent the return type:
>>> class Product(pt.Model): ... product_id: int = pt.Field(unique=True) ... price: float ... >>> relation.set_model(Product).get(product_id=1) Product(product_id=1, price=10.0)
You can invoke
.get()
without any arguments on relations containing exactly one row:>>> relation.filter(product_id=1).get() my_relation(product_id=1, price=10)
If the given predicate matches multiple rows a
MultipleRowsReturned
exception will be raised:>>> try: ... relation.get(price=10) ... except pt.exceptions.MultipleRowsReturned as e: ... print(e) ... Relation.get(price=10) returned 2 rows!
If the given predicate matches zero rows a
RowDoesNotExist
exception will be raised:>>> try: ... relation.get(price=0) ... except pt.exceptions.RowDoesNotExist as e: ... print(e) ... Relation.get(price=0) returned 0 rows!
patito.duckdb.Relation.inner_join
- Relation.inner_join(other, on)
Inner join relation with other relation source based on condition.
- Parameters:
other (
Union
[DataFrame
,DataFrame
,DataFrame
,Path
,str
,DuckDBPyRelation
,Relation
]) – A source which can be casted to aRelation
object, and be used as the right table in the join.on (
str
) – Join condition following theINNER JOIN ... ON
in the SQL query.
- Returns:
New relation based on the joined relations.
- Return type:
Relation
Example
>>> import patito as pt >>> products_df = pt.DataFrame( ... { ... "product_name": ["apple", "banana", "oranges"], ... "supplier_id": [2, 1, 3], ... } ... ) >>> products = pt.duckdb.Relation(products_df) >>> supplier_df = pt.DataFrame( ... { ... "id": [1, 2], ... "supplier_name": ["Banana Republic", "Applies Inc."], ... } ... ) >>> suppliers = pt.duckdb.Relation(supplier_df) >>> products.set_alias("p").inner_join( ... suppliers.set_alias("s"), ... on="p.supplier_id = s.id", ... ).to_df() shape: (2, 4) ┌──────────────┬─────────────┬─────┬─────────────────┐ │ product_name ┆ supplier_id ┆ id ┆ supplier_name │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞══════════════╪═════════════╪═════╪═════════════════╡ │ apple ┆ 2 ┆ 2 ┆ Applies Inc. │ │ banana ┆ 1 ┆ 1 ┆ Banana Republic │ └──────────────┴─────────────┴─────┴─────────────────┘
patito.duckdb.Relation.insert_into
- Relation.insert_into(table)
Insert all rows of the relation into a given table.
The relation must contain all the columns present in the target table. Extra columns are ignored and the column order is automatically matched with the target table.
- Parameters:
table (
str
) – Name of table for which to insert values into.- Returns:
The original relation, i.e.
self
.- Return type:
Relation
Examples
>>> import patito as pt >>> db = pt.duckdb.Database() >>> db.to_relation("select 1 as a").create_table("my_table") >>> db.table("my_table").to_df() shape: (1, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ └─────┘ >>> db.to_relation("select 2 as a").insert_into("my_table") >>> db.table("my_table").to_df() shape: (2, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ └─────┘
patito.duckdb.Relation.intersect
- Relation.intersect(other)
Return a new relation containing the rows that are present in both relations.
This is a set operation which will remove duplicate rows as well.
- Parameters:
other (
Union
[DataFrame
,DataFrame
,DataFrame
,Path
,str
,DuckDBPyRelation
,Relation
]) – Another relation with the same column names.- Returns:
A new relation with only those rows that are present in both relations.
- Return type:
Relation[Model]
Example
>>> import patito as pt >>> df1 = pt.DataFrame({"a": [1, 1, 2], "b": [1, 1, 2]}) >>> df2 = pt.DataFrame({"a": [1, 1, 3], "b": [1, 1, 3]}) >>> pt.duckdb.Relation(df1).intersect(pt.duckdb.Relation(df2)).to_df() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 1 │ └─────┴─────┘
patito.duckdb.Relation.join
- Relation.join(other, *, on, how='inner')
Join relation with other relation source based on condition.
See duckdb.Relation.inner_join() and Relation.left_join() for alternative method shortcuts instead of using
how
.- Parameters:
other (
Union
[DataFrame
,DataFrame
,DataFrame
,Path
,str
,DuckDBPyRelation
,Relation
]) – A source which can be casted to aRelation
object, and be used as the right table in the join.on (
str
) – Join condition following theINNER JOIN ... ON
in the SQL query.how (
Literal
['inner'
,'left'
]) – Either"left"
or"inner"
for what type of SQL join operation to perform.
- Returns:
New relation based on the joined relations.
- Return type:
Relation
Example
>>> import patito as pt >>> products_df = pt.DataFrame( ... { ... "product_name": ["apple", "banana", "oranges"], ... "supplier_id": [2, 1, 3], ... } ... ) >>> products = pt.duckdb.Relation(products_df) >>> supplier_df = pt.DataFrame( ... { ... "id": [1, 2], ... "supplier_name": ["Banana Republic", "Applies Inc."], ... } ... ) >>> suppliers = pt.duckdb.Relation(supplier_df) >>> products.set_alias("p").join( ... suppliers.set_alias("s"), ... on="p.supplier_id = s.id", ... how="inner", ... ).to_df() shape: (2, 4) ┌──────────────┬─────────────┬─────┬─────────────────┐ │ product_name ┆ supplier_id ┆ id ┆ supplier_name │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞══════════════╪═════════════╪═════╪═════════════════╡ │ apple ┆ 2 ┆ 2 ┆ Applies Inc. │ │ banana ┆ 1 ┆ 1 ┆ Banana Republic │ └──────────────┴─────────────┴─────┴─────────────────┘
>>> products.set_alias("p").join( ... suppliers.set_alias("s"), ... on="p.supplier_id = s.id", ... how="left", ... ).to_df() shape: (3, 4) ┌──────────────┬─────────────┬──────┬─────────────────┐ │ product_name ┆ supplier_id ┆ id ┆ supplier_name │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞══════════════╪═════════════╪══════╪═════════════════╡ │ apple ┆ 2 ┆ 2 ┆ Applies Inc. │ │ banana ┆ 1 ┆ 1 ┆ Banana Republic │ │ oranges ┆ 3 ┆ null ┆ null │ └──────────────┴─────────────┴──────┴─────────────────┘
patito.duckdb.Relation.left_join
- Relation.left_join(other, on)
Left join relation with other relation source based on condition.
- Parameters:
other (
Union
[DataFrame
,DataFrame
,DataFrame
,Path
,str
,DuckDBPyRelation
,Relation
]) – A source which can be casted to a Relation object, and be used as the right table in the join.on (
str
) – Join condition following theLEFT JOIN ... ON
in the SQL query.
- Returns:
New relation based on the joined tables.
- Return type:
Relation
Example
>>> import patito as pt >>> products_df = pt.DataFrame( ... { ... "product_name": ["apple", "banana", "oranges"], ... "supplier_id": [2, 1, 3], ... } ... ) >>> products = pt.duckdb.Relation(products_df) >>> supplier_df = pt.DataFrame( ... { ... "id": [1, 2], ... "supplier_name": ["Banana Republic", "Applies Inc."], ... } ... ) >>> suppliers = pt.duckdb.Relation(supplier_df) >>> products.set_alias("p").left_join( ... suppliers.set_alias("s"), ... on="p.supplier_id = s.id", ... ).to_df() shape: (3, 4) ┌──────────────┬─────────────┬──────┬─────────────────┐ │ product_name ┆ supplier_id ┆ id ┆ supplier_name │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞══════════════╪═════════════╪══════╪═════════════════╡ │ apple ┆ 2 ┆ 2 ┆ Applies Inc. │ │ banana ┆ 1 ┆ 1 ┆ Banana Republic │ │ oranges ┆ 3 ┆ null ┆ null │ └──────────────┴─────────────┴──────┴─────────────────┘
patito.duckdb.Relation.limit
- Relation.limit(n, *, offset=0)
Remove all but the first n rows.
- Parameters:
n (
int
) – The number of rows to keep.offset (
int
) – Disregard the firstoffset
rows before starting to count which rows to keep.
- Return type:
TypeVar
(RelationType
, bound= Relation)- Returns:
New relation with only n rows.
Example
>>> import patito as pt >>> relation = ( ... pt.duckdb.Relation("select 1 as column") ... + pt.duckdb.Relation("select 2 as column") ... + pt.duckdb.Relation("select 3 as column") ... + pt.duckdb.Relation("select 4 as column") ... ) >>> relation.limit(2).to_df() shape: (2, 1) ┌────────┐ │ column │ │ --- │ │ i64 │ ╞════════╡ │ 1 │ │ 2 │ └────────┘ >>> relation.limit(2, offset=2).to_df() shape: (2, 1) ┌────────┐ │ column │ │ --- │ │ i64 │ ╞════════╡ │ 3 │ │ 4 │ └────────┘
patito.duckdb.Relation.model
The model associated with the relation, if set with Relation.set_model().
patito.duckdb.Relation.order
- Relation.order(by)
Change the order of the rows of the relation.
- Parameters:
by (
Union
[str
,Iterable
[str
]]) – AnORDER BY
SQL expression such as"age DESC"
or("age DESC", "name ASC")
.- Return type:
TypeVar
(RelationType
, bound= Relation)- Returns:
New relation where the rows have been ordered according to
by
.
Example
>>> import patito as pt >>> df = pt.DataFrame( ... { ... "name": ["Alice", "Bob", "Charles", "Diana"], ... "age": [20, 20, 30, 35], ... } ... ) >>> df shape: (4, 2) ┌─────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════════╪═════╡ │ Alice ┆ 20 │ │ Bob ┆ 20 │ │ Charles ┆ 30 │ │ Diana ┆ 35 │ └─────────┴─────┘ >>> relation = pt.duckdb.Relation(df) >>> relation.order(by="age desc").to_df() shape: (4, 2) ┌─────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════════╪═════╡ │ Diana ┆ 35 │ │ Charles ┆ 30 │ │ Alice ┆ 20 │ │ Bob ┆ 20 │ └─────────┴─────┘ >>> relation.order(by=["age desc", "name desc"]).to_df() shape: (4, 2) ┌─────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════════╪═════╡ │ Diana ┆ 35 │ │ Charles ┆ 30 │ │ Bob ┆ 20 │ │ Alice ┆ 20 │ └─────────┴─────┘
patito.duckdb.Relation.rename
- Relation.rename(**columns)
Rename columns as specified.
- Parameters:
**columns (
str
) – A set of keyword arguments where the keyword is the old column name and the value is the new column name.- Raises:
ValueError – If any of the given keywords do not exist as columns in the relation.
Examples
- Return type:
Relation
>>> import patito as pt >>> relation = pt.duckdb.Relation("select 1 as a, 2 as b") >>> relation.rename(b="c").to_df().select(["a", "c"]) shape: (1, 2) ┌─────┬─────┐ │ a ┆ c │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 2 │ └─────┴─────┘
patito.duckdb.Relation.select
- Relation.select(*projections, **named_projections)
Return relation based on one or more SQL
SELECT
projections.Keyword arguments are converted into
{arg} as {keyword}
in the executed SQL query.- Parameters:
*projections (
Union
[str
,int
,float
]) – One or more strings representing SQL statements to be selected. For example"2"
or"another_column"
.**named_projections (
Union
[str
,int
,float
]) – One ore more keyword arguments where the keyword specifies the name of the new column and the value is an SQL statement defining the content of the new column. For examplenew_column="2 * another_column"
.
- Return type:
Relation
Examples
>>> import patito as pt >>> db = pt.duckdb.Database() >>> relation = db.to_relation(pt.DataFrame({"original_column": [1, 2, 3]})) >>> relation.select("*").to_df() shape: (3, 1) ┌─────────────────┐ │ original_column │ │ --- │ │ i64 │ ╞═════════════════╡ │ 1 │ │ 2 │ │ 3 │ └─────────────────┘ >>> relation.select("*", multiplied_column="2 * original_column").to_df() shape: (3, 2) ┌─────────────────┬───────────────────┐ │ original_column ┆ multiplied_column │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════════════════╪═══════════════════╡ │ 1 ┆ 2 │ │ 2 ┆ 4 │ │ 3 ┆ 6 │ └─────────────────┴───────────────────┘
patito.duckdb.Relation.set_alias
- Relation.set_alias(name)
Set SQL alias for the given relation to be used in further queries.
- Parameters:
name (
str
) – The new alias for the given relation.- Returns:
A new relation containing the same query but addressable with the new alias.
- Return type:
Relation
Example
>>> import patito as pt >>> relation_1 = pt.duckdb.Relation("select 1 as a, 2 as b") >>> relation_2 = pt.duckdb.Relation("select 1 as a, 3 as c") >>> relation_1.set_alias("x").inner_join( ... relation_2.set_alias("y"), ... on="x.a = y.a", ... ).select("x.a", "y.a", "b", "c").to_df() shape: (1, 4) ┌─────┬─────┬─────┬─────┐ │ a ┆ a:1 ┆ b ┆ c │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═════╡ │ 1 ┆ 1 ┆ 2 ┆ 3 │ └─────┴─────┴─────┴─────┘
patito.duckdb.Relation.set_model
- Relation.set_model(model)
Associate a give Patito model with the relation.
The returned relation has an associated
.model
attribute which can in turn be used by several methods such as Relation.get(), Relation.create_table(), and Relation.__iter__.- Parameters:
model – A Patito Model class specifying the intended schema of the relation.
- Returns:
A new relation with the associated model.
- Return type:
Relation[model]
Example
>>> from typing import Literal >>> import patito as pt >>> class MySchema(pt.Model): ... float_column: float ... enum_column: Literal["A", "B", "C"] ... >>> relation = pt.duckdb.Relation( ... "select 1 as float_column, 'A' as enum_column" ... ) >>> relation.get() query_relation(float_column=1, enum_column='A') >>> relation.set_model(MySchema).get() MySchema(float_column=1.0, enum_column='A') >>> relation.create_table("unmodeled_table").types {'float_column': INTEGER, 'enum_column': VARCHAR} >>> relation.set_model(MySchema).create_table("modeled_table").types {'float_column': DOUBLE, 'enum_column': enum__7ba49365cc1b0fd57e61088b3bc9aa25}
patito.duckdb.Relation.to_df
- Relation.to_df()
Return a polars DataFrame representation of relation object.
Returns: A
patito.DataFrame
object which inherits frompolars.DataFrame
.- Return type:
Example
>>> import patito as pt >>> pt.duckdb.Relation("select 1 as column union select 2 as column").order( ... by="1" ... ).to_df() shape: (2, 1) ┌────────┐ │ column │ │ --- │ │ i64 │ ╞════════╡ │ 1 │ │ 2 │ └────────┘
patito.duckdb.Relation.to_pandas
- Relation.to_pandas()
Return a pandas DataFrame representation of relation object.
Returns: A
pandas.DataFrame
object containing all the data of the relation.- Return type:
DataFrame
Example
>>> import patito as pt >>> pt.duckdb.Relation("select 1 as column union select 2 as column").order( ... by="1" ... ).to_pandas() column 0 1 1 2
patito.duckdb.Relation.to_series
- Relation.to_series()
Convert the given relation to a polars Series.
- Raises:
TypeError – If the given relation does not contain exactly one column.
Returns: A
polars.Series
object containing the data of the relation.- Return type:
Series
Example
>>> import patito as pt >>> relation = pt.duckdb.Relation("select 1 as a union select 2 as a") >>> relation.order(by="a").to_series() shape: (2,) Series: 'a' [i32] [ 1 2 ]
patito.duckdb.Relation.types
- Relation.types
Return the SQL types of all the columns of the given relation.
- Returns:
A dictionary where the keys are the column names and the values are SQL types as strings.
- Return type:
dict[str, str]
Examples
>>> import patito as pt >>> pt.duckdb.Relation("select 1 as a, 'my_value' as b").types {'a': INTEGER, 'b': VARCHAR}
patito.duckdb.Relation.union
- Relation.union(other)
Produce a new relation that contains the rows of both relations.
The
+
operator can also be used to union two relations.The two relations must have the same column names, but not necessarily in the same order as reordering of columns is automatically performed, unlike regular SQL.
Duplicates are not dropped.
- Parameters:
other (
Union
[DataFrame
,DataFrame
,DataFrame
,Path
,str
,DuckDBPyRelation
,Relation
]) – Apatito.duckdb.Relation
object or something that can be casted topatito.duckdb.Relation
. See Relation.- Return type:
TypeVar
(RelationType
, bound= Relation)- Returns:
New relation containing the rows of both
self
andother
.- Raises:
TypeError – If the two relations do not contain the same columns.
Examples
>>> import patito as pt >>> relation_1 = pt.duckdb.Relation("select 1 as a") >>> relation_2 = pt.duckdb.Relation("select 2 as a") >>> relation_1.union(relation_2).to_df() shape: (2, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ └─────┘
>>> (relation_1 + relation_2).to_df() shape: (2, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ └─────┘
patito.duckdb.Relation.with_columns
- Relation.with_columns(**named_projections)
Return relations with additional columns.
If the provided columns expressions already exists as a column on the relation, the given column is overwritten.
- Parameters:
named_projections (
Union
[str
,int
,float
]) – A set of column expressions, where the keyword is used as the column name, while the right-hand argument is a valid SQL expression.- Return type:
Relation
- Returns:
Relation with the given columns appended, or possibly overwritten.
Examples
>>> import patito as pt >>> db = pt.duckdb.Database() >>> relation = db.to_relation("select 1 as a, 2 as b") >>> relation.with_columns(c="a + b").to_df() shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ └─────┴─────┴─────┘
patito.duckdb.Relation.with_missing_defaultable_columns
- Relation.with_missing_defaultable_columns(include=None, exclude=None)
Add missing defaultable columns filled with the default values of correct type.
Make sure to invoke Relation.set_model() with the correct model schema before executing
Relation.with_missing_default_columns()
.- Parameters:
include (
Optional
[Iterable
[str
]]) – If provided, only fill in default values for missing columns part of this collection of column names.exclude (
Optional
[Iterable
[str
]]) – If provided, do not fill in default values for missing columns part of this collection of column names.
- Returns:
New relation where missing columns with default values according to the schema have been filled in.
- Return type:
Relation
Example
>>> import patito as pt >>> class MyModel(pt.Model): ... non_default_column: int ... another_non_default_column: int ... default_column: int = 42 ... another_default_column: int = 42 ... >>> relation = pt.duckdb.Relation( ... "select 1 as non_default_column, 2 as default_column" ... ) >>> relation.to_df() shape: (1, 2) ┌────────────────────┬────────────────┐ │ non_default_column ┆ default_column │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞════════════════════╪════════════════╡ │ 1 ┆ 2 │ └────────────────────┴────────────────┘ >>> relation.set_model(MyModel).with_missing_defaultable_columns().to_df() shape: (1, 3) ┌────────────────────┬────────────────┬────────────────────────┐ │ non_default_column ┆ default_column ┆ another_default_column │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞════════════════════╪════════════════╪════════════════════════╡ │ 1 ┆ 2 ┆ 42 │ └────────────────────┴────────────────┴────────────────────────┘
patito.duckdb.Relation.with_missing_nullable_columns
- Relation.with_missing_nullable_columns(include=None, exclude=None)
Add missing nullable columns filled with correctly typed nulls.
Make sure to invoke Relation.set_model() with the correct model schema before executing
Relation.with_missing_nullable_columns()
.- Parameters:
include (
Optional
[Iterable
[str
]]) – If provided, only fill in null values for missing columns part of this collection of column names.exclude (
Optional
[Iterable
[str
]]) – If provided, do not fill in null values for missing columns part of this collection of column names.
- Returns:
New relation where missing nullable columns have been filled in with null values.
- Return type:
Relation
Example
>>> from typing import Optional >>> import patito as pt >>> class MyModel(pt.Model): ... non_nullable_column: int ... nullable_column: Optional[int] ... another_nullable_column: Optional[int] ... >>> relation = pt.duckdb.Relation("select 1 as nullable_column") >>> relation.to_df() shape: (1, 1) ┌─────────────────┐ │ nullable_column │ │ --- │ │ i64 │ ╞═════════════════╡ │ 1 │ └─────────────────┘ >>> relation.set_model(MyModel).with_missing_nullable_columns().to_df() shape: (1, 2) ┌─────────────────┬─────────────────────────┐ │ nullable_column ┆ another_nullable_column │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════════════════╪═════════════════════════╡ │ 1 ┆ null │ └─────────────────┴─────────────────────────┘
patito.duckdb.Relation.__add__
- Relation.__add__(other)
Execute
self.union(other)
.See Relation.union() for full documentation.
- Return type:
TypeVar
(RelationType
, bound= Relation)
patito.duckdb.Relation.__getitem__
- Relation.__getitem__(key)
Return Relation with selected columns.
Uses Relation.select() under-the-hood in order to perform the selection. Can technically be used to rename columns, define derived columns, and so on, but prefer the use of Relation.select() for such use cases.
- Parameters:
key (
Union
[str
,Iterable
[str
]]) – Columns to select, either a single column represented as a string, or an iterable of strings.- Return type:
Relation
- Returns:
New relation only containing the column subset specified.
Example
>>> import patito as pt >>> relation = pt.duckdb.Relation("select 1 as a, 2 as b, 3 as c") >>> relation.to_df() shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ └─────┴─────┴─────┘ >>> relation[["a", "b"]].to_df() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 2 │ └─────┴─────┘ >>> relation["a"].to_df() shape: (1, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ └─────┘
patito.duckdb.Relation.__iter__
- Relation.__iter__()
Iterate over rows in relation.
If Relation.set_model() has been invoked first, the given model will be used to deserialize each row. Otherwise a Patito model is dynamically constructed which fits the schema of the relation.
- Returns:
An iterator of patito Model objects representing each row.
- Return type:
Iterator[Model]
Example
>>> from typing import Literal >>> import patito as pt >>> df = pt.DataFrame({"float_column": [1, 2], "enum_column": ["A", "B"]}) >>> relation = pt.duckdb.Relation(df).set_alias("my_relation") >>> for row in relation: ... print(row) ... float_column=1 enum_column='A' float_column=2 enum_column='B' >>> list(relation) [my_relation(float_column=1, enum_column='A'), my_relation(float_column=2, enum_column='B')]
>>> class MySchema(pt.Model): ... float_column: float ... enum_column: Literal["A", "B", "C"] ... >>> relation = relation.set_model(MySchema) >>> for row in relation: ... print(row) ... float_column=1.0 enum_column='A' float_column=2.0 enum_column='B' >>> list(relation) [MySchema(float_column=1.0, enum_column='A'), MySchema(float_column=2.0, enum_column='B')]
patito.duckdb.Relation.__len__
- Relation.__len__()
Return the number of rows in the relation.
See Relation.count() for full documentation.
- Return type:
int
patito.duckdb.Relation.__str__
- Relation.__str__()
Return string representation of Relation object.
Includes an expression tree, the result columns, and a result preview.
- Return type:
str
Example
>>> import patito as pt >>> products = pt.duckdb.Relation( ... pt.DataFrame( ... { ... "product_name": ["apple", "red_apple", "banana", "oranges"], ... "supplier_id": [2, 2, 1, 3], ... } ... ) ... ).set_alias("products") >>> print(str(products)) # xdoctest: +SKIP --------------------- --- Relation Tree --- --------------------- arrow_scan(94609350519648, 140317161740928, 140317161731168, 1000000) --------------------- -- Result Columns -- --------------------- - product_name (VARCHAR) - supplier_id (BIGINT) --------------------- -- Result Preview -- --------------------- product_name supplier_id VARCHAR BIGINT [ Rows: 4] apple 2 red_apple 2 banana 1 oranges 3
>>> suppliers = pt.duckdb.Relation( ... pt.DataFrame( ... { ... "id": [1, 2], ... "supplier_name": ["Banana Republic", "Applies Inc."], ... } ... ) ... ).set_alias("suppliers") >>> relation = ( ... products.set_alias("p") ... .inner_join( ... suppliers.set_alias("s"), ... on="p.supplier_id = s.id", ... ) ... .aggregate( ... "supplier_name", ... num_products="count(product_name)", ... group_by=["supplier_id", "supplier_name"], ... ) ... ) >>> print(str(relation)) # xdoctest: +SKIP --------------------- --- Relation Tree --- --------------------- Aggregate [supplier_name, count(product_name)] Join INNER p.supplier_id = s.id arrow_scan(94609350519648, 140317161740928, 140317161731168, 1000000) arrow_scan(94609436221024, 140317161740928, 140317161731168, 1000000) --------------------- -- Result Columns -- --------------------- - supplier_name (VARCHAR) - num_products (BIGINT) --------------------- -- Result Preview -- --------------------- supplier_name num_products VARCHAR BIGINT [ Rows: 2] Applies Inc. 2 Banana Republic 1
Licence
MIT License
Copyright (c) 2022 Oda Group Holding AS Copyright (c) 2023 Jakob Gerhard Martinussen and contributors
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Patito offers a simple way to declare pydantic data models which double as schema for your polars data frames. These schema can be used for:
Patito has first-class support for polars, a “blazingly fast DataFrames library written in Rust”.
Installation
You can simply install Patito with pip
like so:
pip install patito
DuckDB Integration
Patito can also integrate with DuckDB. In order to enable this integration you must explicitly specify it during installation:
pip install 'patito[duckdb]'