Small ETL projects do not need a complicated configuration layer. Most of the time I want three things: read a few values from .env, derive predictable file paths from the project root, and fail early when the expected folders are missing.
For that shape, I like one Settings class with typed environment values and computed path fields. The environment controls file names and runtime knobs; the Python code owns path construction.
Tiny Project Shape
etl/
├── .env
├── data/
│ ├── raw/
│ └── out/
└── src/etl/config.py
ETL_BATCH_SIZE=5000
ETL_INPUT_FILE=customers.csv
ETL_OUTPUT_FILE=customers_clean.parquet
Settings
from functools import cached_property
from pathlib import Path
from pydantic import Field, computed_field
from pydantic_settings import BaseSettings, SettingsConfigDict
ROOT = Path(__file__).resolve().parents[2]
class Settings(BaseSettings):
model_config = SettingsConfigDict(
env_file=ROOT / ".env",
env_prefix="ETL_",
extra="ignore",
)
batch_size: int = Field(default=1000, ge=1)
input_file: str = "customers.csv"
output_file: str = "customers_clean.parquet"
@computed_field
@cached_property
def raw_dir(self) -> Path:
return ROOT / "data" / "raw"
@computed_field
@cached_property
def out_dir(self) -> Path:
return ROOT / "data" / "out"
@computed_field
@cached_property
def input_path(self) -> Path:
return self.raw_dir / self.input_file
@computed_field
@cached_property
def output_path(self) -> Path:
return self.out_dir / self.output_file
settings = Settings()
print(settings.batch_size)
print(settings.input_path)
print(settings.output_path)
# --------------------
5000
/etl/data/raw/customers.csv
/etl/data/out/customers_clean.parquet
The useful split is small but important: BaseSettings reads and validates the values that can change between environments, while the computed fields keep all path construction in one place.
@computed_field also means those derived paths are part of the settings snapshot if you serialize the model. @cached_property means each path is calculated once per Settings instance. That is a good fit for ETL scripts where configuration is created at startup and treated as read-only.
The caveat is the same one that makes the pattern useful: do not mutate settings.input_file later and expect settings.input_path to follow. Create a new Settings object if the configuration changes.
Directory Check
I still keep filesystem checks outside the settings class. Creating Settings() should read configuration. Starting the pipeline should check whether the expected folders exist.
def check_dirs(settings: Settings) -> None:
missing = [path for path in (settings.raw_dir, settings.out_dir) if not path.is_dir()]
if missing:
lines = "\n".join(f"- {path}" for path in missing)
raise FileNotFoundError(f"Missing ETL directories:\n{lines}")
check_dirs(settings)
print("ready")
# --------------------
ready
That keeps validation honest: field validation belongs to Pydantic, and checks against the outside world happen at the edge of the pipeline.
In the Pipeline
settings = Settings()
check_dirs(settings)
df = read_customers(settings.input_path, batch_size=settings.batch_size)
clean = clean_customers(df)
write_customers(clean, settings.output_path)
print(clean.head(3).to_string(index=False))
# --------------------
customer country revenue
Company A AT 1200
Company B DE 940
Company C AT 650
That is enough for most small ETL scripts: typed env values, visible derived paths, one explicit directory check, and no hidden configuration layer doing too much behind the scenes.