Simple ETL Configuration with Pydantic Settings

Verwendete Tools:

Python

Small ETL projects do not need a complicated configuration layer. Most of the time I want three things: read a few values from .env, derive predictable file paths from the project root, and fail early when the expected...

Small ETL projects do not need a complicated configuration layer. Most of the time I want three things: read a few values from .env, derive predictable file paths from the project root, and fail early when the expected folders are missing.

For that shape, I like one Settings class with typed environment values and computed path fields. The environment controls file names and runtime knobs; the Python code owns path construction.

Tiny Project Shape

etl/
├── .env
├── data/
│   ├── raw/
│   └── out/
└── src/etl/config.py

ETL_BATCH_SIZE=5000
ETL_INPUT_FILE=customers.csv
ETL_OUTPUT_FILE=customers_clean.parquet

Settings

config.py

Python

from functools import cached_property
from pathlib import Path

from pydantic import Field, computed_field
from pydantic_settings import BaseSettings, SettingsConfigDict


ROOT = Path(__file__).resolve().parents[2]


class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_file=ROOT / ".env",
        env_prefix="ETL_",
        extra="ignore",
    )

    batch_size: int = Field(default=1000, ge=1)
    input_file: str = "customers.csv"
    output_file: str = "customers_clean.parquet"

    @computed_field
    @cached_property
    def raw_dir(self) -> Path:
        return ROOT / "data" / "raw"

    @computed_field
    @cached_property
    def out_dir(self) -> Path:
        return ROOT / "data" / "out"

    @computed_field
    @cached_property
    def input_path(self) -> Path:
        return self.raw_dir / self.input_file

    @computed_field
    @cached_property
    def output_path(self) -> Path:
        return self.out_dir / self.output_file


settings = Settings()

print(settings.batch_size)
print(settings.input_path)
print(settings.output_path)

# --------------------
5000
/etl/data/raw/customers.csv
/etl/data/out/customers_clean.parquet

The useful split is small but important: BaseSettings reads and validates the values that can change between environments, while the computed fields keep all path construction in one place.

@computed_field also means those derived paths are part of the settings snapshot if you serialize the model. @cached_property means each path is calculated once per Settings instance. That is a good fit for ETL scripts where configuration is created at startup and treated as read-only.

The caveat is the same one that makes the pattern useful: do not mutate settings.input_file later and expect settings.input_path to follow. Create a new Settings object if the configuration changes.

Directory Check

I still keep filesystem checks outside the settings class. Creating Settings() should read configuration. Starting the pipeline should check whether the expected folders exist.

check required folders

Python

def check_dirs(settings: Settings) -> None:
    missing = [path for path in (settings.raw_dir, settings.out_dir) if not path.is_dir()]
    if missing:
        lines = "\n".join(f"- {path}" for path in missing)
        raise FileNotFoundError(f"Missing ETL directories:\n{lines}")


check_dirs(settings)
print("ready")

# --------------------
ready

That keeps validation honest: field validation belongs to Pydantic, and checks against the outside world happen at the edge of the pipeline.

In the Pipeline

pipeline.py

Python

settings = Settings()
check_dirs(settings)

df = read_customers(settings.input_path, batch_size=settings.batch_size)
clean = clean_customers(df)
write_customers(clean, settings.output_path)

print(clean.head(3).to_string(index=False))

# --------------------
 customer country  revenue
Company A      AT     1200
Company B      DE      940
Company C      AT      650

That is enough for most small ETL scripts: typed env values, visible derived paths, one explicit directory check, and no hidden configuration layer doing too much behind the scenes.