Project source-tree

Below is the layout of the project (to 10 levels), followed by the contents of each key file.

Project directory layout

safezip/
├── src
│   └── safezip
│       ├── cli
│       │   ├── __init__.py
│       │   └── _main.py
│       ├── tests
│       │   ├── __init__.py
│       │   ├── conftest.py
│       │   ├── test_cli.py
│       │   ├── test_guard.py
│       │   ├── test_integration.py
│       │   ├── test_sandbox.py
│       │   └── test_streamer.py
│       ├── __init__.py
│       ├── _core.py
│       ├── _events.py
│       ├── _exceptions.py
│       ├── _guard.py
│       ├── _sandbox.py
│       ├── _streamer.py
│       └── py.typed
├── AGENTS.md
├── conftest.py
├── CONTRIBUTING.rst
├── docker-compose.yml
├── Dockerfile
├── Makefile
├── pyproject.toml
├── README.rst
└── tox.ini

README.rst

=======
safezip
=======
.. image:: https://raw.githubusercontent.com/barseghyanartur/safezip/main/docs/_static/safezip_logo.webp
   :alt: SafeZip Logo
   :align: center

Hardened ZIP extraction for Python - secure by default.

.. image:: https://img.shields.io/pypi/v/safezip.svg
   :target: https://pypi.python.org/pypi/safezip
   :alt: PyPI Version

.. image:: https://img.shields.io/pypi/pyversions/safezip.svg
   :target: https://pypi.python.org/pypi/safezip/
   :alt: Supported Python versions

.. image:: https://github.com/barseghyanartur/safezip/actions/workflows/test.yml/badge.svg?branch=main
   :target: https://github.com/barseghyanartur/safezip/actions
   :alt: Build Status

.. image:: https://readthedocs.org/projects/safezip/badge/?version=latest
    :target: http://safezip.readthedocs.io
    :alt: Documentation Status

.. image:: https://img.shields.io/badge/docs-llms.txt-blue
    :target: https://safezip.readthedocs.io/en/latest/llms.txt
    :alt: llms.txt - documentation for LLMs

.. image:: https://deepwiki.com/badge.svg
    :target: https://deepwiki.com/barseghyanartur/safezip
    :alt: Ask DeepWiki

.. image:: https://img.shields.io/badge/license-MIT-blue.svg
   :target: https://github.com/barseghyanartur/safezip/#License
   :alt: MIT

.. image:: https://coveralls.io/repos/github/barseghyanartur/safezip/badge.svg?branch=main&service=github
    :target: https://coveralls.io/github/barseghyanartur/safezip?branch=main
    :alt: Coverage

``safezip`` is a zero-dependency, hardened wrapper around Python's
``zipfile`` module that defends against the most common ZIP-based attacks:
ZipSlip path traversal, ZIP bombs, and malformed/crafted archives.

Features
========

- **ZipSlip protection** - relative traversal (``..``), absolute Unix and
  Windows paths, Windows UNC paths, and Windows drive-relative paths are all
  blocked. Filenames are Unicode NFC-normalised before path resolution,
  which defeats combining-character disguises of ``..`` components. Null
  bytes in filenames cause Python's ``zipfile`` layer to truncate the name
  before safezip sees it; the Sandbox also checks for null bytes as
  defence-in-depth.
- **ZIP bomb protection** - per-member and cumulative decompression ratio
  limits abort extraction before runaway decompression can exhaust disk or
  memory.
- **File size limits** - per-member size is checked against the declared header
  value at open time (Guard phase) and again against actual decompressed bytes
  during streaming (Streamer phase).  Total extraction size is enforced
  cumulatively across all members at stream time.
- **ZIP64 consistency checks** - crafted archives with inconsistent ZIP64
  extra fields are rejected before decompression begins.
- **Symlink policy** - configurable: ``REJECT`` (default), ``IGNORE``, or
  ``RESOLVE_INTERNAL`` (symlink entries are extracted as regular files; no OS
  symlink is created on disk).
- **Atomic writes** - every member is written to a temporary file first;
  the destination is only created after all checks pass.  No partial files
  are left on disk after a security abort.
- **Secure by default** - all limits are active without any configuration.
- **Zero dependencies** - standard library only.
- **Environment variable overrides** - all
  limits (including ``symlink_policy``) can be set via ``SAFEZIP_*``
  environment variables for containerised deployments.

Prerequisites
=============

Python 3.10 or later.  No additional packages required.

Installation
============
With ``uv``:

.. code-block:: sh

    uv pip install safezip

Or with ``pip``:

.. code-block:: sh

    pip install safezip

Quick start
===========

Drop-in replacement for the common ``zipfile`` extraction pattern:

.. pytestfixture: file_zip
.. code-block:: python
    :name: test_safe_extract

    from safezip import safe_extract

    safe_extract("path/to/file.zip", "/var/files/extracted/")

Or use the ``SafeZipFile`` context manager for more control:

.. pytestfixture: file_zip
.. code-block:: python
    :name: test_safe_zipfile

    from safezip import SafeZipFile

    with SafeZipFile("path/to/file.zip") as zf:
        print(zf.namelist())
        zf.extractall("/var/files/extracted/")

Custom limits
=============
See the `Default limits`_ for reference.

.. pytestfixture: file_zip
.. code-block:: python
    :name: test_custom_limits

    from safezip import SafeZipFile, SymlinkPolicy

    with SafeZipFile(
        "path/to/file.zip",
        max_file_size=100 * 1024 * 1024,      # 100 MiB per member (default: 1 GiB)
        max_total_size=500 * 1024 * 1024,     # 500 MiB total (default: 5 GiB)
        max_files=1_000,                      # (default: 10 000)
        max_per_member_ratio=50.0,            # (default: 200)
        max_total_ratio=50.0,                 # (default: 200)
        max_nesting_depth=1,                  # (default: 3)
        symlink_policy=SymlinkPolicy.IGNORE,  # (default: SymlinkPolicy.REJECT)
    ) as zf:
        zf.extractall("/var/files/extracted/")

Recursive extraction
====================

When an archive contains nested ``.zip`` files, set ``recursive=True`` to
descend into them automatically. All safety limits apply at every level. Each
nested archive is extracted into a directory named after it (without the
extension). Nested archive detection is content-based (not extension-based):
the member is streamed to a temporary file, inspected with
``zipfile.is_zipfile()``, and either recursed into (temp file used as input,
then deleted) or renamed to its final destination as a plain file.

.. pytestfixture: nested_file_zip
.. code-block:: python
    :name: test_recursive_extraction

    from safezip import SafeZipFile

    # archive.zip
    #   readme.txt
    #   data.zip          ← will be descended into, not extracted as a blob
    #     report.csv

    with SafeZipFile("path/to/archive.zip", recursive=True, max_nesting_depth=3) as zf:
        zf.extractall("/var/files/extracted/")

    # Result on disk:
    #   /var/files/extracted/readme.txt
    #   /var/files/extracted/data/report.csv

With ``max_nesting_depth=0``, opening any nested archive raises
``NestingDepthError`` before extracting a single byte from it:

.. pytestfixture: nested_file_zip
.. code-block:: python
    :name: test_recursive_extraction_depth_limit

    import pytest
    from safezip import SafeZipFile, NestingDepthError

    # archive.zip
    #   readme.txt
    #   data.zip          ← depth 1 exceeds max_nesting_depth=0 → NestingDepthError
    #     report.csv

    with pytest.raises(NestingDepthError):
        with SafeZipFile(
            "path/to/archive.zip", recursive=True, max_nesting_depth=0
        ) as zf:
            zf.extractall("/var/files/extracted/")

Security event monitoring
=========================

.. pytestfixture: file_zip
.. code-block:: python
    :name: test_security_event_monitoring

    from safezip import SafeZipFile, SecurityEvent

    def my_monitor(event: SecurityEvent) -> None:
        print(f"[safezip] {event.event_type} archive={event.archive_hash}")

    with SafeZipFile("path/to/file.zip", on_security_event=my_monitor) as zf:
        zf.extractall("/var/files/extracted/")

Environment variable overrides
==============================
See the `Default limits`_ for reference.

All limits can be overridden without changing code:

.. code-block:: sh

    export SAFEZIP_MAX_FILE_SIZE=104857600    # 100 MiB (default: 1 GiB)
    export SAFEZIP_MAX_TOTAL_SIZE=524288000   # 500 MiB (default: 5 GiB)
    export SAFEZIP_MAX_FILES=1000             # (default: 10 000)
    export SAFEZIP_MAX_PER_MEMBER_RATIO=50    # (default: 200)
    export SAFEZIP_MAX_TOTAL_RATIO=50         # (default: 200)
    export SAFEZIP_MAX_NESTING_DEPTH=1        # (default: 3)
    export SAFEZIP_SYMLINK_POLICY=ignore      # reject | ignore | resolve_internal (default: reject)

Default limits
==============

+--------------------------+------------+
| Parameter                | Default    |
+==========================+============+
| ``max_file_size``        | 1 GiB      |
+--------------------------+------------+
| ``max_total_size``       | 5 GiB      |
+--------------------------+------------+
| ``max_files``            | 10 000     |
+--------------------------+------------+
| ``max_per_member_ratio`` | 200        |
+--------------------------+------------+
| ``max_total_ratio``      | 200        |
+--------------------------+------------+
| ``max_nesting_depth``    | 3          |
+--------------------------+------------+
| ``symlink_policy``       | REJECT     |
+--------------------------+------------+
| ``recursive``            | False      |
+--------------------------+------------+

Testing
=======

All tests run inside Docker to prevent accidental pollution of the host system:

.. code-block:: sh

    make test

To test a specific Python version:

.. code-block:: sh

    make test-env ENV=py312

Writing documentation
=====================

Keep the following hierarchy:

.. code-block:: text

    =====
    title
    =====

    header
    ======

    sub-header
    ----------

    sub-sub-header
    ~~~~~~~~~~~~~~

    sub-sub-sub-header
    ^^^^^^^^^^^^^^^^^^

    sub-sub-sub-sub-header
    ++++++++++++++++++++++

    sub-sub-sub-sub-sub-header
    **************************

License
=======

MIT

Support
=======
For security issues contact me at the e-mail given in the `Author`_ section.

For overall issues, go
to `GitHub <https://github.com/barseghyanartur/safezip/issues>`_.

Author
======

Artur Barseghyan <artur.barseghyan@gmail.com>

CONTRIBUTING.rst

Contributor guidelines
======================

.. _safezip: https://github.com/barseghyanartur/safezip/
.. _uv: https://docs.astral.sh/uv/
.. _tox: https://tox.wiki
.. _ruff: https://beta.ruff.rs/docs/
.. _doc8: https://doc8.readthedocs.io/
.. _pre-commit: https://pre-commit.com/#installation
.. _issues: https://github.com/barseghyanartur/safezip/issues
.. _discussions: https://github.com/barseghyanartur/safezip/discussions
.. _pull request: https://github.com/barseghyanartur/safezip/pulls
.. _versions manifest: https://github.com/actions/python-versions/blob/main/versions-manifest.json

Developer prerequisites
-----------------------

pre-commit
~~~~~~~~~~

Refer to `pre-commit`_ for installation instructions.

TL;DR:

.. code-block:: sh

    curl -LsSf https://astral.sh/uv/install.sh | sh  # Install uv
    uv tool install pre-commit                        # Install pre-commit
    pre-commit install                                # Install hooks

Installing `pre-commit`_ ensures all contributions adhere to the project's
code quality standards.

Code standards
--------------

`ruff`_ and `doc8`_ are triggered automatically by `pre-commit`_.

To run checks manually:

.. code-block:: sh

    make doc8
    make ruff

Virtual environment
-------------------

.. code-block:: sh

    make create-venv

Installation
------------

.. code-block:: sh

    make install

Testing
-------

**All tests must be run inside Docker.**  This prevents accidental extraction
of malicious test archives from reaching the host filesystem.

.. code-block:: sh

    make test

To test a single environment:

.. code-block:: sh

    make test-env ENV=py312

For an interactive shell inside the container:

.. code-block:: sh

    make shell

In any case, GitHub Actions runs the full matrix automatically on every push.

Releases
--------
**Build the package for releasing:**

.. code-block:: sh

    make package-build

----

**Test the built package:**

.. code-block:: sh

    make check-package-build

----

**Make a test release (test.pypi.org):**

.. code-block:: sh

    make test-release

----

**Release (pypi.org):**

.. code-block:: sh

    make release

Adding tests
------------

- All test archives must be crafted programmatically in ``conftest.py`` using
  Python's ``struct`` module or ``zipfile``.  Do not commit pre-built ``.zip``
  files.
- Every new security check must have a corresponding test in the relevant
  ``test_*.py`` file.
- Integration tests must verify that no partial files remain on disk after a
  security abort (atomic write contract).

Pull requests
-------------

Open a `pull request`_ to the ``dev`` branch only. Never directly to ``main``.

.. note::

    Create pull requests to the ``dev`` branch only!

Examples of welcome contributions:

- Fixing documentation typos or improving explanations.
- Adding test cases for new edge cases.
- Extending support for additional archive attack vectors.
- Improving error messages.

General checklist
~~~~~~~~~~~~~~~~~

- Does your change require documentation updates?
- Does your change require new tests?
- Does your change add any external dependencies?
  If so, reconsider: ``safezip`` is intentionally dependency-free.

When fixing bugs
~~~~~~~~~~~~~~~~

- Add a regression test that reproduces the bug before your fix.

When adding a new feature
~~~~~~~~~~~~~~~~~~~~~~~~~

- Update ``README.rst`` (quick start, default limits table if relevant).
- Update ``plan.rst`` if the architectural design changes.
- Add appropriate tests in the correct ``test_*.py`` file.

GitHub Actions
--------------

Tests run on Python 3.10–3.14 (all non-EOL versions).  See the
`versions manifest`_ for the full list of available Python versions.

Questions
---------

Ask on GitHub `discussions`_.

Issues
------

Report bugs or request features on GitHub `issues`_.

**Do not report security vulnerabilities on GitHub.**
Contact the author directly at artur.barseghyan@gmail.com.

AGENTS.md

# AGENTS.md — safezip

**Package version**: See pyproject.toml
**Repository**: <https://github.com/barseghyanartur/safezip>
**Maintainer**: Artur Barseghyan <artur.barseghyan@gmail.com>

This file is for AI agents and developers using AI assistants to work on or with
safezip. It covers two distinct roles: **using** the package in application code,
and **developing/extending** the package itself.

---

## 1. Project Mission (Never Deviate)

> Hardened ZIP extraction for Python — secure by default, zero dependencies,
> production-grade.

- Secure defaults are never relaxed without an explicit caller decision.
- No external dependencies. Ever.
- The three-phase security model (Guard → Sandbox → Streamer) is preserved.
- No partial files on disk after a security abort.

---

## 2. Using safezip in Application Code

### Simple case

<!-- pytestfixture: file_zip -->
```python name=test_simple_case
from safezip import safe_extract

# Secure defaults protect against all common attacks
safe_extract("path/to/file.zip", "/var/files/extracted/")
```

### With monitoring and custom limits

<!-- pytestfixture: file_zip -->
```python name=test_with_monitoring_and_custom_limits
from safezip import SafeZipFile, SecurityEvent

def monitor(event: SecurityEvent) -> None:
    print(f"Security event: {event.event_type}")

with SafeZipFile(
    "path/to/file.zip",
    max_file_size=100 * 1024 * 1024,  # 100 MiB per member
    on_security_event=monitor,
) as zf:
    zf.extractall("/var/files/extracted/")
```

### Exception handling

All safezip exceptions inherit from `SafezipError`:

<!-- pytestfixture: file_zip -->
```python name=test_exception_handling
from safezip import (
    safe_extract,
    SafezipError,
    UnsafeZipError,          # path traversal or disallowed symlink
    CompressionRatioError,   # ZIP bomb attempt
    FileSizeExceededError,   # member too large
    TotalSizeExceededError,  # cumulative size exceeded
    FileCountExceededError,  # too many entries
    MalformedArchiveError,   # structurally invalid archive
    NestingDepthError,       # nested archive depth exceeded
)

try:
    safe_extract("path/to/file.zip", "/var/files/extracted/")
except UnsafeZipError:
    ...
except CompressionRatioError:
    ...
except SafezipError:
    # catch-all for any safezip violation
    ...
```

### Secure defaults reference

<!-- pytestfixture: file_zip -->
```python name=test_secure_defaults_reference
from safezip import SafeZipFile, SymlinkPolicy

SafeZipFile(
    "path/to/file.zip",
    max_file_size=1 * 1024**3,       # 1 GiB per member
    max_total_size=5 * 1024**3,      # 5 GiB total
    max_files=10_000,
    max_per_member_ratio=200.0,
    max_total_ratio=200.0,
    max_nesting_depth=3,
    symlink_policy=SymlinkPolicy.REJECT,
)
```

All limits are overridable via environment variables:

| Variable | Type | Default |
| --- | --- | --- |
| `SAFEZIP_MAX_FILE_SIZE` | int (bytes) | 1 GiB |
| `SAFEZIP_MAX_TOTAL_SIZE` | int (bytes) | 5 GiB |
| `SAFEZIP_MAX_FILES` | int | 10 000 |
| `SAFEZIP_MAX_PER_MEMBER_RATIO` | float | 200.0 |
| `SAFEZIP_MAX_TOTAL_RATIO` | float | 200.0 |
| `SAFEZIP_MAX_NESTING_DEPTH` | int | 3 |
| `SAFEZIP_SYMLINK_POLICY` | str | reject |

Resolution order: constructor argument > environment variable > hardcoded default.
Invalid env values are logged and silently ignored.

### What safezip does not do

- **Write mode** — `SafeZipFile` is read-only. It does not expose `open()`,
  `read()`, or any write-mode methods from `zipfile.ZipFile`.
- **Recursive extraction** — nested `.zip` members are extracted as raw files.
  Recursion, if needed, is the caller's responsibility via `_nesting_depth`.
- **Create OS symlinks** — `RESOLVE_INTERNAL` extracts symlink entries as
  regular files containing the target path as bytes. See section 5.

---

## 3. Architecture

Each extraction passes through three phases in order. Each phase owns exactly
one module. When adding a new check, identify the correct phase first.

| Phase | File | Runs | Raises |
| --- | --- | --- | --- |
| **Guard** | `_guard.py` | On `SafeZipFile.__init__()`, before any decompression | `FileCountExceededError`, `FileSizeExceededError`, `MalformedArchiveError` |
| **Sandbox** | `_sandbox.py` | Per member, before streaming begins | `UnsafeZipError` |
| **Streamer** | `_streamer.py` | Per member, during decompression | `FileSizeExceededError`, `TotalSizeExceededError`, `CompressionRatioError` |

**Guard** owns: file count limit, declared per-member size, ZIP64 consistency,
null bytes in filenames.

**Sandbox** owns: path traversal detection, absolute/UNC path rejection, Unicode
NFC normalisation, null-byte rejection, path length limit, symlink policy
(REJECT / IGNORE / RESOLVE_INTERNAL).

**Streamer** owns: per-member decompressed size, cumulative total size,
per-member ratio, cumulative ratio, atomic write contract (temp file → rename
on success, unlink on failure).

**Orchestration** (`_core.py`) — `SafeZipFile` and `safe_extract`. `_extract_one`
calls the three phases in order per member. Environment variable resolution,
security event emission, and symlink policy dispatch live here.

### Key files

| File | Purpose |
| --- | --- |
| `src/safezip/_core.py` | Public API, orchestration, env overrides, event emission |
| `src/safezip/_guard.py` | Phase A: static pre-checks |
| `src/safezip/_sandbox.py` | Phase B: path resolution, symlink policy |
| `src/safezip/_streamer.py` | Phase C: streaming extraction, atomic writes |
| `src/safezip/_exceptions.py` | Exception hierarchy (all inherit `SafezipError`) |
| `src/safezip/_events.py` | `SecurityEvent`, `SymlinkPolicy`, callback type |
| `src/safezip/tests/conftest.py` | All test archive fixtures |
| `pyproject.toml` | Build, ruff, mypy, pytest-cov configuration |
| `README.rst` | End-user documentation; keep in sync with code |

---

## 4. Security Principles

**1. Default limits are sacred.**
Never lower them in examples or generated code. If a user asks you to relax a
limit, warn about the tradeoff explicitly before complying.

**2. Atomicity is non-negotiable.**
Every member must follow: temp file → all checks pass → `replace()` to
destination. On any exception: `unlink(missing_ok=True)` the temp file. The
destination must never be created or modified if a check fails. No partial
files may remain on disk.

**3. Never merge phase responsibilities.**
Path checks belong in `_sandbox.py`. Static header checks in `_guard.py`.
Runtime byte checks in `_streamer.py`. Do not add path logic to the streamer
or size logic to the guard.

**4. Zero external dependencies.**
stdlib only. If you are considering adding an import that is not in the Python
standard library, the answer is no.

**5. Security events must not be suppressible.**
Exceptions raised inside `on_security_event` callbacks are caught and logged,
but the original security exception always propagates. Never let a broken
callback silently swallow a violation.

---

## 5. Known Intentional Behaviors — Do Not Treat as Bugs

### RESOLVE_INTERNAL extracts symlink entries as regular files

ZIP entries flagged as symlinks (via `external_attr` Unix mode `S_IFLNK`) are
written as regular files containing the link target path as bytes. Python's
`zipfile` does not create OS symlinks. The post-extraction `check_symlink` /
`_verify_symlink_chain` code in `_sandbox.py` is only reached if the OS creates
an actual symlink, which does not happen in the current extraction path.

This is **safe**: a regular file containing the text `"../escape.txt"` is
harmless. Real OS symlink creation and chain verification are
**not yet implemented**; they are future work (see the implementation note
below).

**If asked to implement real symlink support:** in `_extract_one`, for
`RESOLVE_INTERNAL` + `is_symlink_entry`, read the target bytes, call
`os.symlink(target, dest)`, then call `check_symlink(dest, base, policy)`,
unlink if unsafe. Add tests for both safe and escaping targets. Update README.

### compress_size == 0 skips the ratio check — this is correct

The ratio check in `_streamer.py` is gated on `compress_size > 0`. This is not
a vulnerability. Python's `zipfile` uses the central directory's `compress_size`
to control how many compressed bytes it reads. The only case where
`compress_size == 0` reaches the streamer for a member that successfully
decompresses is a genuinely empty member (zero bytes), for which skipping the
ratio check is correct behavior.

A crafted archive with `compress_size=0` in the central directory but non-empty
content is rejected by Python's `zipfile` with `BadZipFile` (CRC failure) before
the streamer is reached. This has been empirically verified. **Do not attempt to
"fix" this skip.**

### Nested archives are extracted as raw files

Members with ZIP-like extensions (`.zip`, `.jar`, `.whl`, `.egg`, etc.) are
extracted as opaque blobs. `SafeZipFile` does not auto-recurse. The
`_nesting_depth` parameter and `NestingDepthError` exist to guard against
runaway recursion if a caller implements manual recursion.

### In-memory archives (BinaryIO) receive full overlap detection

When `SafeZipFile` is instantiated with a `BinaryIO` (e.g., `BytesIO`) instead
of a filesystem path, the Guard phase now spills the buffer to a temporary
file to run `detect_zip_bomb()`. This ensures Fifield-style overlap detection
and extra-field quoting checks are applied to in-memory archives, closing a
previous bypass. The buffer position is restored after detection so the
caller's `zipfile.ZipFile` instance is not disturbed.

---

## 6. Agent Workflow: Adding Features or Fixing Bugs

When asked to add a feature or fix a bug, follow these steps in order:

1. **Check the mission** — Does the change preserve zero deps, secure defaults,
   and the three-phase model?
2. **Identify the correct phase** — Guard (static/header), Sandbox (path/policy),
   or Streamer (runtime/bytes).
3. **For bug fixes: write the regression fixture first** — Add a programmatic
   archive fixture to `src/safezip/tests/conftest.py` that reproduces the bug.
   The test must fail before your fix.
4. **Implement the change** in the correct phase file.
5. **Add/update exceptions** in `_exceptions.py` if a new error type is needed
   (inherit from `SafezipError`).
6. **Add event emission** in `_core.py` (`self._emit_event("event_type")`) if
   the check fires inside `_extract_one`.
7. **Export** new public symbols from `__init__.py` and `__all__`.
8. **Write tests:**
   - Unit test in `test_[phase].py` (e.g., `test_streamer.py`).
   - Integration test in `test_integration.py` verifying no partial files remain.
   - Legitimate-input test confirming the happy path still works.
9. **Update documentation** if you modify public API, CLI, or default limits,
   by running the `update-documentation` skill after committing. It will scan
   code vs docs and auto‑fix misalignments.
10. **MUST run:** Either single environment
    test `make test-env ENV=py312` or test all environments `make test`.
11. **MUST run:** `make pre-commit`.
12. If `pip-audit` fails on `docs/requirements.txt`, run
    the `make compile-requirements-upgrade` command.
    > **Note:** `docs/requirements.txt` targets Python ≥ 3.12 (built on
    > ReadTheDocs with Python 3.14, or locally on Python 3.13). Some pinned
    > packages (e.g. `ipython>=9`) require Python ≥ 3.12 and are intentional.
    > Do **not** downgrade them to satisfy older Python versions.

### Acceptable new features

- Windows reserved filename detection (Phase B / Sandbox).
- Additional event types for new violation categories.
- Optional recursive extraction (caller-controlled, guarded by `_nesting_depth`).
- Real OS symlink creation under `RESOLVE_INTERNAL` (see section 5).

### Forbidden

- Adding any external dependency.
- Lowering default limits.
- Bypassing or merging phases.
- Writing directly to the destination path (must use temp file).
- Exposing write-mode or `open()`/`read()` methods on `SafeZipFile`.

---

## 7. Testing Rules

### All tests must run inside Docker

```sh
make test                   # full matrix (Python 3.10–3.14)
make test-env ENV=py312     # single version
make shell                  # interactive shell
```

Do not run `pytest` directly on the host machine. Malicious test archives must
not touch the host filesystem.

### Test layout

```text
src/safezip/tests/
    conftest.py          — all archive fixtures (add new ones here)
    test_guard.py        — Phase A tests
    test_sandbox.py      — Phase B tests
    test_streamer.py     — Phase C tests
    test_integration.py  — end-to-end tests
```

The **root `conftest.py`** (project root) is for `pytest-codeblock` documentation
testing only. Do not add security fixtures there.

### Fixture rules

- Craft all test archives programmatically using `struct` or `zipfile`. Do not
  commit pre-built `.zip` files.
- Use `tmp_path` for all output. Never write to a fixed path.

### Required assertions for every security abort test

```python
# 1. pytest.raises wraps the full operation, not just extractall
with pytest.raises(SpecificError):
    with SafeZipFile(...) as zf:
        zf.extractall(dest)

# 2. Atomicity: no partial files remain
remaining = [f for f in dest.rglob("*") if not f.is_dir()]
assert not remaining
```

### Checklist for every new security check

- [ ] Fixture in `conftest.py` that triggers the violation
- [ ] Test asserting the correct exception is raised
- [ ] Test asserting no partial files remain after abort
- [ ] Test asserting a legitimate archive still extracts correctly
- [ ] Integration test in `test_integration.py`
- [ ] Event emission tested if applicable

---

## 8. Coding Conventions

Run all linting checks:

```sh
make pre-commit
```

### Formatting

- Line length: **88 characters** (ruff).
- Import sorting: `isort`; `safezip` is `known-first-party`.
- Target: `py310`. Run `make ruff` to check. `ruff fix = true` auto-fixes on
  commit — do not fight the formatter.

### Ruff rules in effect

`B`, `C4`, `E`, `F`, `G`, `I`, `ISC`, `INP`, `N`, `PERF`, `Q`, `SIM`.

Explicitly ignored:

| Rule | Reason |
| --- | --- |
| `G004` | f-strings in logging calls are allowed |
| `ISC003` | implicit string concatenation across lines is allowed |
| `PERF203` | `try/except` in loops allowed in `conftest.py` only |

### Style

- Every non-test module must have `__all__`, `__author__`, `__copyright__`,
  `__license__` at module level.
- Logger: always `logging.getLogger("safezip.security")`. Never use `__name__`.
- Log member names truncated to 256 characters in `extra` dicts (privacy).
- Always chain exceptions: `raise X(...) from exc`.
- Type annotations on all public functions. Use `Optional[X]` (not `X | None`)
  to match the existing codebase.
- `SecurityEvent` must never include member names, paths, or filesystem
  information — `event_type`, `archive_hash`, and `timestamp` only.

### Pull requests

Target the `dev` branch only. Never open a PR directly to `main`.

---

## 9. Prompt Templates

**Explaining usage to a user:**
> You are an expert in secure Python file handling. Explain how to use safezip
> for [task]. Start with secure defaults. Include exception handling. Note that
> symlink entries are extracted as regular files, not OS symlinks.

**Implementing a new feature:**
> Extend safezip with [feature]. Follow the AGENTS.md agent workflow (section 6):
> identify the correct phase, implement, add tests verifying atomicity and events,
> update README. Preserve zero external dependencies and secure defaults.

**Fixing a bug:**
> Reproduce [bug] with a new programmatic fixture in conftest.py. The test must
> fail before the fix. Then fix in the correct phase file. Add tests asserting
> the correct exception, no partial files on disk, and that legitimate archives
> still extract successfully.

**Reviewing a change:**
> Review this safezip change against AGENTS.md: Does it preserve zero deps?
> Does it maintain the three-phase model? Does it follow the atomic write
> contract? Are all new checks tested with both violation and legitimate inputs?

conftest.py

"""
Pytest fixtures for documentation testing.

DO NOT ADD OTHER FIXTURES HERE.
"""

import io
import zipfile
from pathlib import Path

import pytest


@pytest.fixture()
def file_zip(tmp_path):
    """A valid ZIP file named file.zip."""
    buf = io.BytesIO()
    with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_DEFLATED) as zf:
        zf.writestr("hello.txt", b"Hello, world!\n")
    p = Path("path/to") / "file.zip"
    p.write_bytes(buf.getvalue())
    return p


@pytest.fixture()
def nested_file_zip(tmp_path):
    """archive.zip containing readme.txt and data.zip (which contains report.csv).

    Matches the README 'Recursive extraction' example exactly::

        archive.zip
          readme.txt
          data.zip
            report.csv
    """
    inner_buf = io.BytesIO()
    with zipfile.ZipFile(inner_buf, "w", compression=zipfile.ZIP_DEFLATED) as zf:
        zf.writestr("report.csv", b"id,value\n1,100\n")

    buf = io.BytesIO()
    with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_DEFLATED) as zf:
        zf.writestr("readme.txt", b"Archive readme\n")
        zf.writestr("data.zip", inner_buf.getvalue())

    p = Path("path/to") / "archive.zip"
    p.write_bytes(buf.getvalue())
    return p

docker-compose.yml

services:
  tox:
    build: .
    volumes:
      - ./htmlcov:/app/htmlcov

pyproject.toml

[project]
name = "safezip"
description = "Hardened ZIP extraction for Python - secure by default."
readme = "README.rst"
version = "0.1.7"
requires-python = ">=3.10"
dependencies = []
authors = [
    { name = "Artur Barseghyan", email = "artur.barseghyan@gmail.com" },
]
maintainers = [
    { name = "Artur Barseghyan", email = "artur.barseghyan@gmail.com" },
]
license = "MIT"
classifiers = [
    "Development Status :: 4 - Beta",
    "Intended Audience :: Developers",
    "Operating System :: OS Independent",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Programming Language :: Python :: 3.12",
    "Programming Language :: Python :: 3.13",
    "Programming Language :: Python :: 3.14",
    "Programming Language :: Python :: 3.15",
    "Programming Language :: Python",
    "Topic :: Security",
    "Topic :: Software Development :: Libraries :: Python Modules",
    "Topic :: System :: Archiving :: Compression",
]
keywords = [
    "zip",
    "security",
    "zipslip",
    "zipbomb",
    "hardened",
    "safe",
]

[project.scripts]
safezip = "safezip.cli:main"

[project.urls]
Homepage = "https://github.com/barseghyanartur/safezip/"
Repository = "https://github.com/barseghyanartur/safezip/"
Issues = "https://github.com/barseghyanartur/safezip/issues"

[project.optional-dependencies]
all = ["safezip[dev,test,docs,build]"]
dev = [
    "detect-secrets",
    "doc8",
    "ipython",
    "mypy",
    "ruff",
    "uv",
]
test = [
    "pytest",
    "pytest-cov",
    "pytest-codeblock",
]
docs = [
    "sphinx",
    "sphinx-autobuild",
    "sphinx-rtd-theme>=1.3.0",
    "sphinx-no-pragma",
    "sphinx-markdown-builder",
    "sphinx-llms-txt-link",
    "sphinx-source-tree",
]
build = [
    "build",
    "twine",
    "wheel",
]

[tool.setuptools]
package-dir = {"" = "src"}

[tool.setuptools.packages.find]
where = ["src"]
include = ["safezip", "safezip.*"]

[build-system]
requires = ["setuptools>=41.0", "wheel"]
build-backend = "setuptools.build_meta"

[tool.ruff]
line-length = 88
lint.select = [
    "B",
    "C4",
    "E",
    "F",
    "G",
    "I",
    "ISC",
    "INP",
    "N",
    "PERF",
    "Q",
    "SIM",
]
lint.ignore = [
    "G004",
    "ISC003",
]
fix = true
src = ["src/safezip"]
exclude = [
    ".bzr",
    ".direnv",
    ".eggs",
    ".git",
    ".hg",
    ".mypy_cache",
    ".nox",
    ".pants.d",
    ".ruff_cache",
    ".svn",
    ".tox",
    ".venv",
    "__pypackages__",
    "_build",
    "buck-out",
    "build",
    "dist",
    "node_modules",
    "venv",
    "docs",
]
target-version = "py310"
# Allow unused variables when underscore-prefixed.
lint.dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"

[tool.ruff.lint.isort]
known-first-party = ["safezip"]

[tool.ruff.lint.per-file-ignores]
"conftest.py" = [
    "PERF203"  # Allow `try`-`except` within a loop incurs performance overhead
]

[tool.doc8]
ignore-path = [
    "docs/requirements.txt",
    "src/safezip.egg-info/SOURCES.txt",
]

[tool.pytest.ini_options]
addopts = [
    "-ra",
    "-vvv",
    "-q",
    "--cov=safezip",
    "--ignore=.tox",
    "--cov-report=html",
    "--cov-report=term",
    "--cov-append",
    "--capture=no",
]
testpaths = [
    "src/safezip/tests",
    ".",
    "**/*.rst",
    "**/*.md",
]
pythonpath = ["src"]
norecursedirs = [".git", ".tox"]

[tool.coverage.run]
relative_files = true
omit = [".tox/*"]
source = ["safezip"]

[tool.coverage.report]
show_missing = true
exclude_lines = [
    "pragma: no cover",
    "@overload",
]

[tool.mypy]
check_untyped_defs = true
warn_unused_ignores = true
warn_redundant_casts = true
warn_unused_configs = true
ignore_missing_imports = true

[tool.sphinx-source-tree]
ignore = [
    "*.egg-info",
    "*.py,cover",
    "*.pyc",
    "*.pyo",
    ".DS_Store",
    ".coverage",
    ".coverage.*",
    ".git",
    ".hg",
    ".hypothesis",
    ".idea",
    ".mypy_cache",
    ".nox",
    ".pre-commit-config.yaml",
    ".pre-commit-hooks.yaml",
    ".pytest_cache",
    ".readthedocs.yaml",
    ".ruff_cache",
    ".secrets.baseline",
    ".svn",
    ".tox",
    ".venv",
    ".vscode",
    "CHANGELOG.rst",
    "CODE_OF_CONDUCT.rst",
    "LICENSE",
    "SECURITY.rst",
    "Thumbs.db",
    "__pycache__",
    "build",
    "codebin",
    "dist",
    "docs/Makefile",
    "docs/_build",
    "docs/_static",
    "docs/changelog.rst",
    "docs/code_of_conduct.rst",
    "docs/customization",
    "docs/make.bat",
    "docs/requirements.txt",
    "docs/security.rst",
    "docs/source_tree.rst",
    "docs/source_tree_full.rst",
    "env",
    "htmlcov",
    "node_modules",
    "venv",
    "ARCHITECTURE.rst",
    ".coderabbit.yaml",
    ".coveralls",
    "docs/full-llms.rst",
    "docs/llms.rst",
    "docs/contributor_guidelines.rst",
    "docs/package.rst",
    "docs/documentation.rst",
    "docs/index.rst",
]
order = [
    "README.rst",
    "CONTRIBUTING.rst",
    "AGENTS.md",
]

[[tool.sphinx-source-tree.files]]
output = "docs/full_llms.rst"
title = "Full project source-tree"

[[tool.sphinx-source-tree.files]]
output = "docs/llms.rst"
title = "Project source-tree"
ignore = [
    "*.egg-info",
    "*.py,cover",
    "*.pyc",
    "*.pyo",
    ".DS_Store",
    ".coverage",
    ".coverage.*",
    ".git",
    ".hg",
    ".hypothesis",
    ".idea",
    ".mypy_cache",
    ".nox",
    ".pre-commit-config.yaml",
    ".pre-commit-hooks.yaml",
    ".pytest_cache",
    ".readthedocs.yaml",
    ".ruff_cache",
    ".secrets.baseline",
    ".svn",
    ".tox",
    ".venv",
    ".vscode",
    "CHANGELOG.rst",
    "CODE_OF_CONDUCT.rst",
    "LICENSE",
    "SECURITY.rst",
    "Thumbs.db",
    "__pycache__",
    "build",
    "codebin",
    "dist",
    "docs/Makefile",
    "docs/_build",
    "docs/_static",
    "docs/changelog.rst",
    "docs/code_of_conduct.rst",
    "docs/customization",
    "docs/make.bat",
    "docs/requirements.txt",
    "docs/security.rst",
    "docs/source_tree.rst",
    "docs/source_tree_full.rst",
    "env",
    "htmlcov",
    "node_modules",
    "venv",
    "examples",
    "docs",
    "ARCHITECTURE.rst",
    ".coderabbit.yaml",
    ".coveralls",
    "docs/full-llms.rst",
    "docs/llms.rst",
    "docs/contributor_guidelines.rst",
    "docs/package.rst",
    "docs/documentation.rst",
    "docs/index.rst",
]

src/safezip/init.py

src/safezip/__init__.py

"""safezip - Hardened ZIP extraction for Python."""

from ._core import SafeZipFile, safe_extract
from ._events import SecurityEvent, SymlinkPolicy
from ._exceptions import (
    CompressionRatioError,
    FileCountExceededError,
    FileSizeExceededError,
    MalformedArchiveError,
    NestingDepthError,
    SafezipError,
    TotalSizeExceededError,
    UnsafeZipError,
)

__title__ = "safezip"
__version__ = "0.1.7"
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
    # Core
    "SafeZipFile",
    "safe_extract",
    # Events / policy
    "SecurityEvent",
    "SymlinkPolicy",
    # Exceptions
    "SafezipError",
    "UnsafeZipError",
    "FileSizeExceededError",
    "TotalSizeExceededError",
    "CompressionRatioError",
    "FileCountExceededError",
    "NestingDepthError",
    "MalformedArchiveError",
)

src/safezip/_core.py

"""SafeZipFile: the public hardened wrapper around zipfile.ZipFile."""

import hashlib
import logging
import os
import stat
import zipfile
from contextlib import suppress
from pathlib import Path
from typing import BinaryIO, Optional, Union

from ._events import SecurityEvent, SecurityEventCallback, SymlinkPolicy
from ._exceptions import (
    CompressionRatioError,
    FileCountExceededError,
    FileSizeExceededError,
    MalformedArchiveError,
    NestingDepthError,
    TotalSizeExceededError,
    UnsafeZipError,
)
from ._guard import validate_archive
from ._sandbox import check_symlink, resolve_member_path
from ._streamer import CumulativeCounters, stream_extract_member

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
    "SafeZipFile",
    "safe_extract",
)

log = logging.getLogger("safezip.security")

_ARCHIVE_EXTENSIONS = frozenset(
    {".zip", ".jar", ".war", ".ear", ".apk", ".aar", ".whl", ".egg"}
)


def _archive_stem(name: str) -> str:
    """Strip the archive extension from *name*, returning the base stem.

    Handles single extensions only (ZIP archives do not use compound
    extensions like .tar.gz), but normalises consistently.

    Examples::

        archive.zip  → archive
        lib.whl      → lib
        app.jar      → app
        data.csv     → data.csv   (non-archive extension unchanged)
    """
    p = Path(name)
    if p.suffix.lower() in _ARCHIVE_EXTENSIONS:
        return p.stem
    return name


def _env_int(name: str, default: int) -> int:
    val = os.environ.get(name)
    if val is None:
        return default
    try:
        return int(val)
    except ValueError:
        return default


def _env_float(name: str, default: float) -> float:
    val = os.environ.get(name)
    if val is None:
        return default
    try:
        return float(val)
    except ValueError:
        return default


def _env_bool(name: str, default: bool) -> bool:
    val = os.environ.get(name)
    if val is None:
        return default
    if val.lower() in ("1", "true", "yes", "on"):
        return True
    if val.lower() in ("0", "false", "no", "off"):
        return False
    log.warning(
        "Ignoring unrecognised %s value %r; using default %r.",
        name,
        val,
        default,
    )
    return default


def _sanitise_mode(path: Path, *, strip_special_bits: bool = True) -> None:
    """Strip setuid/setgid/sticky bits from *path* if requested."""
    if not strip_special_bits:
        return
    try:
        current = path.stat().st_mode
        safe = current & ~(stat.S_ISUID | stat.S_ISGID | stat.S_ISVTX)
        if safe != current:
            os.chmod(path, safe)
    except OSError:
        pass  # best-effort; extraction already succeeded


def _env_symlink_policy(default: SymlinkPolicy) -> SymlinkPolicy:
    """Read SAFEZIP_SYMLINK_POLICY from the environment.

    Accepted values (case-insensitive): ``reject``, ``ignore``,
    ``resolve_internal``.  Any other value is logged and ignored.
    """
    val = os.environ.get("SAFEZIP_SYMLINK_POLICY")
    if val is None:
        return default
    mapping = {
        "reject": SymlinkPolicy.REJECT,
        "ignore": SymlinkPolicy.IGNORE,
        "resolve_internal": SymlinkPolicy.RESOLVE_INTERNAL,
    }
    resolved = mapping.get(val.lower())
    if resolved is None:
        log.warning(
            "Ignoring unrecognised SAFEZIP_SYMLINK_POLICY value %r; using default %r.",
            val,
            default.value,
        )
        return default
    return resolved


_DEFAULT_MAX_FILE_SIZE: int = _env_int("SAFEZIP_MAX_FILE_SIZE", 1 * 1024**3)
_DEFAULT_MAX_TOTAL_SIZE: int = _env_int("SAFEZIP_MAX_TOTAL_SIZE", 5 * 1024**3)
_DEFAULT_MAX_FILES: int = _env_int("SAFEZIP_MAX_FILES", 10_000)
_DEFAULT_MAX_PER_MEMBER_RATIO: float = _env_float("SAFEZIP_MAX_PER_MEMBER_RATIO", 200.0)
_DEFAULT_MAX_TOTAL_RATIO: float = _env_float("SAFEZIP_MAX_TOTAL_RATIO", 200.0)
_DEFAULT_MAX_NESTING_DEPTH: int = _env_int("SAFEZIP_MAX_NESTING_DEPTH", 3)
_DEFAULT_SYMLINK_POLICY: SymlinkPolicy = _env_symlink_policy(SymlinkPolicy.REJECT)
_DEFAULT_RECURSIVE: bool = _env_bool("SAFEZIP_RECURSIVE", False)


def _archive_hash(file: Union[str, os.PathLike, BinaryIO]) -> str:
    """
    Get first 16 hex chars of SHA-256 of the first 64 KiB of archive content.

    Content-based hashing ensures different files at the same path produce
    different hashes in SecurityEvent records.

    This is a **prefix fingerprint**, not a whole-archive hash. Two archives
    that differ only after the first 64 KiB will produce the same value.  Its
    purpose is incident correlation in SecurityEvent records (linking multiple
    events from the same extraction session), not integrity verification.

    For BinaryIO inputs the stream position is saved before reading and
    restored afterwards so the caller's zipfile.ZipFile instance is not
    disturbed.
    """
    h = hashlib.sha256()
    if isinstance(file, (str, os.PathLike)):
        try:
            with open(file, "rb") as fh:
                h.update(fh.read(65536))
        except OSError:
            h.update(str(file).encode())
        return h.hexdigest()[:16]

    pos = file.tell()
    try:
        h.update(file.read(65536))
    finally:
        with suppress(OSError):
            file.seek(pos)
    return h.hexdigest()[:16]


class SafeZipFile:
    """A hardened, composition-based wrapper around :class:`zipfile.ZipFile`.

    All defences are enabled by default.  Limits can be relaxed by passing
    explicit constructor arguments or by setting environment variables.

    .. note::

        This class intentionally does **not** expose ``open()``, ``read()``,
        or any write-mode methods from the underlying ``zipfile.ZipFile``.
        Callers needing lower-level access must use ``zipfile.ZipFile``
        directly, accepting the associated risks.
    """

    def __init__(
        self,
        file: Union[str, os.PathLike, BinaryIO],
        mode: str = "r",
        *,
        max_file_size: Optional[int] = None,
        max_total_size: Optional[int] = None,
        max_files: Optional[int] = None,
        max_per_member_ratio: Optional[float] = None,
        max_total_ratio: Optional[float] = None,
        max_nesting_depth: Optional[int] = None,
        symlink_policy: Optional[SymlinkPolicy] = None,
        password: Optional[bytes] = None,
        on_security_event: SecurityEventCallback = None,
        _nesting_depth: int = 0,
        recursive: Optional[bool] = None,
        strip_special_bits: bool = True,
    ) -> None:
        # Resolve limits: constructor arg > env var > module-level default
        # Env vars are read at runtime to support test monkeypatching
        self._max_file_size = (
            max_file_size
            if max_file_size is not None
            else _env_int("SAFEZIP_MAX_FILE_SIZE", _DEFAULT_MAX_FILE_SIZE)
        )
        self._max_total_size = (
            max_total_size
            if max_total_size is not None
            else _env_int("SAFEZIP_MAX_TOTAL_SIZE", _DEFAULT_MAX_TOTAL_SIZE)
        )
        self._max_files = (
            max_files
            if max_files is not None
            else _env_int("SAFEZIP_MAX_FILES", _DEFAULT_MAX_FILES)
        )
        self._max_per_member_ratio = (
            max_per_member_ratio
            if max_per_member_ratio is not None
            else _env_float(
                "SAFEZIP_MAX_PER_MEMBER_RATIO", _DEFAULT_MAX_PER_MEMBER_RATIO
            )
        )
        self._max_total_ratio = (
            max_total_ratio
            if max_total_ratio is not None
            else _env_float("SAFEZIP_MAX_TOTAL_RATIO", _DEFAULT_MAX_TOTAL_RATIO)
        )
        self._max_nesting_depth = (
            max_nesting_depth
            if max_nesting_depth is not None
            else _env_int("SAFEZIP_MAX_NESTING_DEPTH", _DEFAULT_MAX_NESTING_DEPTH)
        )
        self._symlink_policy = (
            symlink_policy
            if symlink_policy is not None
            else _env_symlink_policy(_DEFAULT_SYMLINK_POLICY)
        )
        self._recursive = (
            recursive
            if recursive is not None
            else _env_bool("SAFEZIP_RECURSIVE", _DEFAULT_RECURSIVE)
        )
        self._strip_special_bits = strip_special_bits
        self._password = password
        self._on_security_event = on_security_event
        self._archive_hash = _archive_hash(file)
        self._nesting_depth = _nesting_depth

        if _nesting_depth > self._max_nesting_depth:
            self._emit_event("nesting_depth_exceeded")
            log.warning(
                "Nesting depth limit exceeded",
                extra={
                    "event": "nesting_depth_exceeded",
                    "nesting_depth": _nesting_depth,
                    "max_nesting_depth": self._max_nesting_depth,
                    "archive_hash": self._archive_hash,
                },
            )
            raise NestingDepthError(
                f"Nested archive depth {_nesting_depth} exceeds "
                f"max_nesting_depth={self._max_nesting_depth}."
            )

        try:
            self._zf = zipfile.ZipFile(file, mode)
        except zipfile.BadZipFile as exc:
            raise MalformedArchiveError(f"Cannot open archive: {exc}") from exc

        # Run the Guard immediately on open
        try:
            validate_archive(
                self._zf, self._max_files, self._max_file_size, self._max_total_size
            )
        except FileCountExceededError:
            self._emit_event("file_count_exceeded")
            raise
        except FileSizeExceededError:
            self._emit_event("declared_size_exceeded")
            raise
        except MalformedArchiveError:
            self._emit_event("malformed_archive")
            raise

    # ------------------------------------------------------------------
    # Context manager
    # ------------------------------------------------------------------

    def __enter__(self) -> "SafeZipFile":
        return self

    def __exit__(self, *args: object) -> None:
        self.close()

    def close(self) -> None:
        """Close the underlying archive."""
        self._zf.close()

    # ------------------------------------------------------------------
    # Read-only inspection (safe subset of zipfile.ZipFile)
    # ------------------------------------------------------------------

    def namelist(self) -> list:
        """Return a list of archive member names."""
        return self._zf.namelist()

    def infolist(self) -> list:
        """Return a list of ZipInfo objects for all archive members."""
        return self._zf.infolist()

    def getinfo(self, name: str) -> zipfile.ZipInfo:
        """Return a ZipInfo object for *name*."""
        return self._zf.getinfo(name)

    # ------------------------------------------------------------------
    # Extraction
    # ------------------------------------------------------------------

    def extract(
        self,
        member: Union[str, zipfile.ZipInfo],
        path: Union[str, os.PathLike],
        *,
        pwd: Optional[bytes] = None,
    ) -> str:
        """Safely extract a single *member* to *path*.

        :param member: Member name string or ZipInfo object.
        :param path: Destination directory (required; no default).
        :param pwd: Optional decryption password.
        :returns: The path to the extracted file as a string.
        :raises UnsafeZipError: On path traversal, absolute paths, or symlinks.
        :raises FileSizeExceededError: If the member is too large.
        :raises CompressionRatioError: If the compression ratio is too high.
        :raises TypeError: If path is None.
        """
        if path is None:
            raise TypeError(
                "SafeZipFile.extract() requires an explicit 'path' argument."
            )
        base = Path(path).resolve()
        counters = CumulativeCounters()
        info = (
            member if isinstance(member, zipfile.ZipInfo) else self._zf.getinfo(member)
        )
        dest = self._extract_one(info, base, counters, pwd or self._password)
        return str(dest)

    def extractall(
        self,
        path: Union[str, os.PathLike],
        members: Optional[list] = None,
        *,
        pwd: Optional[bytes] = None,
    ) -> None:
        """Safely extract all (or selected) members to *path*.

        :param path: Destination directory (required; no default).
        :param members: Optional list of member names or ZipInfo objects.
        :param pwd: Optional decryption password.
        :raises UnsafeZipError: On path traversal, absolute paths, or symlinks.
        :raises FileSizeExceededError: If any member is too large.
        :raises TotalSizeExceededError: If total extracted size is too large.
        :raises CompressionRatioError: If any ratio limit is exceeded.
        :raises TypeError: If path is None.
        """
        if path is None:
            raise TypeError(
                "SafeZipFile.extractall() requires an explicit 'path' argument; "
                "extraction to the current working directory is not permitted."
            )
        base = Path(path).resolve()
        counters = CumulativeCounters()
        effective_pwd = pwd or self._password

        if members is None:
            infos = self._zf.infolist()
        else:
            infos = [
                m if isinstance(m, zipfile.ZipInfo) else self._zf.getinfo(m)
                for m in members
            ]

        for info in infos:
            self._extract_one(info, base, counters, effective_pwd)

    def _extract_one(
        self,
        info: zipfile.ZipInfo,
        base: Path,
        counters: CumulativeCounters,
        pwd: Optional[bytes],
    ) -> Path:
        """Core per-member extraction logic."""
        # Directories - create and skip streaming
        if info.filename.endswith("/"):
            dest = resolve_member_path(base, info.filename.rstrip("/"))
            dest.mkdir(parents=True, exist_ok=True)
            return dest

        # Validate and resolve the destination path (Sandbox / Phase B)
        try:
            dest = resolve_member_path(base, info.filename)
        except UnsafeZipError:
            self._emit_event("zip_slip_detected")
            log.warning(
                "Path traversal attempt blocked",
                extra={
                    "event": "zip_slip_detected",
                    "member": info.filename[:256],
                    "archive_hash": self._archive_hash,
                },
            )
            raise

        # Check for symlinks in the *source* entry
        # (detect if the ZIP entry itself is stored as a symlink)
        attr = (info.external_attr >> 16) & 0xFFFF
        is_symlink_entry = bool(attr and stat.S_ISLNK(attr))

        if is_symlink_entry:
            if self._symlink_policy == SymlinkPolicy.REJECT:
                self._emit_event("symlink_rejected")
                log.warning(
                    "Symlink entry rejected",
                    extra={
                        "event": "symlink_rejected",
                        "member": info.filename[:256],
                        "archive_hash": self._archive_hash,
                    },
                )
                raise UnsafeZipError(
                    f"Symlink entry {info.filename!r} rejected (symlink_policy=REJECT)."
                )
            if self._symlink_policy == SymlinkPolicy.IGNORE:
                self._emit_event("symlink_ignored")
                log.warning(
                    "Symlink entry skipped (IGNORE policy)",
                    extra={
                        "event": "symlink_ignored",
                        "member": info.filename[:256],
                        "archive_hash": self._archive_hash,
                    },
                )
                return dest

        # Nested archive guard
        suffix = Path(info.filename).suffix.lower()
        is_archive_extension = suffix in _ARCHIVE_EXTENSIONS

        # Non-recursive: keep the debug log but don't gate on content
        if not self._recursive:
            if is_archive_extension:
                log.debug(
                    "Nested archive detected: %r - extracting as raw file,"
                    " not recursing.",
                    info.filename,
                )
        else:
            # Recursive path: stream to temp first, then content-detect
            tmp = dest.parent / (
                f"{dest.name}.safezip_tmp_{os.getpid()}_{os.urandom(4).hex()}"
            )
            try:
                try:
                    stream_extract_member(
                        self._zf,
                        info,
                        tmp,
                        max_file_size=self._max_file_size,
                        max_per_member_ratio=self._max_per_member_ratio,
                        max_total_size=self._max_total_size,
                        max_total_ratio=self._max_total_ratio,
                        counters=counters,
                        pwd=pwd,
                    )
                except FileSizeExceededError:
                    self._emit_event("file_size_exceeded")
                    log.warning(
                        "Member size limit exceeded during streaming",
                        extra={
                            "event": "file_size_exceeded",
                            "member": info.filename[:256],
                            "archive_hash": self._archive_hash,
                        },
                    )
                    raise
                except TotalSizeExceededError:
                    self._emit_event("total_size_exceeded")
                    log.warning(
                        "Cumulative extraction size limit exceeded during streaming",
                        extra={
                            "event": "total_size_exceeded",
                            "member": info.filename[:256],
                            "archive_hash": self._archive_hash,
                        },
                    )
                    raise
                except CompressionRatioError:
                    self._emit_event("compression_ratio_exceeded")
                    log.warning(
                        "Compression ratio limit exceeded during streaming",
                        extra={
                            "event": "compression_ratio_exceeded",
                            "member": info.filename[:256],
                            "archive_hash": self._archive_hash,
                        },
                    )
                    raise
                # Content-based detection (avoids extension-spoofing)
                if zipfile.is_zipfile(tmp):
                    nested_dest = dest.parent / _archive_stem(dest.name)
                    nested_dest.mkdir(parents=True, exist_ok=True)
                    with SafeZipFile(
                        tmp,
                        max_file_size=self._max_file_size,
                        max_total_size=self._max_total_size,
                        max_files=self._max_files,
                        max_per_member_ratio=self._max_per_member_ratio,
                        max_total_ratio=self._max_total_ratio,
                        max_nesting_depth=self._max_nesting_depth,
                        symlink_policy=self._symlink_policy,
                        password=self._password,
                        on_security_event=self._on_security_event,
                        recursive=True,
                        _nesting_depth=self._nesting_depth + 1,
                    ) as nested_zf:
                        nested_zf.extractall(nested_dest, pwd=pwd)
                    return nested_dest
                else:
                    # Not a ZIP — rename temp to final destination as a regular file
                    tmp.replace(dest)
                    return dest
            finally:
                tmp.unlink(missing_ok=True)

        # Stream-extract with all runtime monitors (Phase C)
        try:
            stream_extract_member(
                self._zf,
                info,
                dest,
                max_file_size=self._max_file_size,
                max_per_member_ratio=self._max_per_member_ratio,
                max_total_size=self._max_total_size,
                max_total_ratio=self._max_total_ratio,
                counters=counters,
                pwd=pwd,
            )
        except FileSizeExceededError:
            self._emit_event("file_size_exceeded")
            log.warning(
                "Member size limit exceeded during streaming",
                extra={
                    "event": "file_size_exceeded",
                    "member": info.filename[:256],
                    "archive_hash": self._archive_hash,
                },
            )
            raise
        except TotalSizeExceededError:
            self._emit_event("total_size_exceeded")
            log.warning(
                "Cumulative extraction size limit exceeded during streaming",
                extra={
                    "event": "total_size_exceeded",
                    "member": info.filename[:256],
                    "archive_hash": self._archive_hash,
                },
            )
            raise
        except CompressionRatioError:
            self._emit_event("compression_ratio_exceeded")
            log.warning(
                "Compression ratio limit exceeded during streaming",
                extra={
                    "event": "compression_ratio_exceeded",
                    "member": info.filename[:256],
                    "archive_hash": self._archive_hash,
                },
            )
            raise

        # Post-extraction permission sanitisation
        if not dest.is_symlink():
            _sanitise_mode(dest, strip_special_bits=self._strip_special_bits)

        # Post-extraction symlink check (RESOLVE_INTERNAL policy)
        if dest.is_symlink() and self._symlink_policy == SymlinkPolicy.RESOLVE_INTERNAL:
            skip = check_symlink(dest, base, self._symlink_policy)
            if skip:
                dest.unlink(missing_ok=True)

        return dest

    def _emit_event(self, event_type: str) -> None:
        """Emit a SecurityEvent to the configured callback (if any)."""
        if self._on_security_event is None:
            return
        event = SecurityEvent(
            event_type=event_type,
            archive_hash=self._archive_hash,
        )
        try:
            self._on_security_event(event)
        except Exception:
            log.exception(
                "on_security_event callback raised an exception "
                "(event_type=%r); suppressing to preserve security "
                "enforcement.",
                event_type,
            )


def safe_extract(
    archive: Union[str, os.PathLike, BinaryIO],
    destination: Union[str, os.PathLike],
    **kwargs,
) -> None:
    """
    Convenience func: extract *archive* to *destination* using safe defaults.

    All keyword arguments are forwarded to :class:`SafeZipFile`.

    :param archive: Path to the ZIP file, or a file-like binary object.
    :param destination: Directory to extract into.
    """
    with SafeZipFile(archive, **kwargs) as zf:
        zf.extractall(destination)

src/safezip/_events.py

"""Security event types and symlink policy enum."""

import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, Optional

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
    "SecurityEvent",
    "SymlinkPolicy",
    "SecurityEventCallback",
)


class SymlinkPolicy(Enum):
    """Controls how symlink members in archives are handled."""

    REJECT = "reject"
    """Any symlink entry raises UnsafeZipError (default)."""

    IGNORE = "ignore"
    """Symlink entries are silently skipped."""

    RESOLVE_INTERNAL = "resolve_internal"
    """Symlink entries are extracted as regular files containing the raw link-target
    bytes. No OS symlink is created on disk."""


@dataclass
class SecurityEvent:
    """Minimal, privacy-safe payload emitted to the on_security_event callback.

    Deliberately excludes filenames, paths, and member names so that
    forwarding this to a third-party service cannot leak confidential
    filesystem information.
    """

    event_type: str
    """Short string identifying what happened, e.g. 'zip_slip_detected'."""

    archive_hash: str
    """First 16 hex chars of SHA-256 of the **first 64 KiB** of archive content.

    This is a prefix fingerprint for incident correlation, not a whole-archive
    integrity hash. Two archives that differ only after the first 64 KiB will
    share the same value.
    """

    timestamp: float = field(default_factory=time.time)
    """Wall-clock time at the moment of detection (time.time())."""


# Type alias for the optional callback
SecurityEventCallback = Optional[Callable[[SecurityEvent], None]]

src/safezip/_exceptions.py

"""Exception hierarchy for safezip."""

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
    "SafezipError",
    "UnsafeZipError",
    "FileSizeExceededError",
    "TotalSizeExceededError",
    "CompressionRatioError",
    "FileCountExceededError",
    "NestingDepthError",
    "MalformedArchiveError",
)


class SafezipError(Exception):
    """Base class for all safezip security exceptions."""


class UnsafeZipError(SafezipError):
    """Path traversal, absolute paths, or disallowed symlinks detected."""


class FileSizeExceededError(SafezipError):
    """A single member's decompressed size exceeds max_file_size."""


class TotalSizeExceededError(SafezipError):
    """Cumulative decompressed size across all members exceeds max_total_size."""


class CompressionRatioError(SafezipError):
    """Compression ratio exceeds the configured limit (per-member or total)."""


class FileCountExceededError(SafezipError):
    """Archive entry count exceeds max_files."""


class NestingDepthError(SafezipError):
    """Nested archive depth exceeds max_nesting_depth."""


class MalformedArchiveError(SafezipError):
    """Archive is structurally invalid (ZIP64 inconsistency, count mismatch, etc.)."""

src/safezip/_guard.py

"""Phase A: pre-extraction static validation (the Guard)."""

import logging
import mmap
import os
import struct
import tempfile
import zipfile
from contextlib import suppress
from dataclasses import dataclass, field
from typing import IO, BinaryIO, List, Optional, Tuple

from ._exceptions import (
    FileCountExceededError,
    FileSizeExceededError,
    MalformedArchiveError,
)

log = logging.getLogger("safezip.security")

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("validate_archive",)


@dataclass
class ScanResult:
    """Three-valued outcome of inspecting a zip file for overlapping records."""

    is_bomb: Optional[bool]
    invalid_reason: Optional[str] = None
    overlap_detail: Optional[str] = None

    @classmethod
    def clean(cls) -> "ScanResult":
        return cls(is_bomb=False)

    @classmethod
    def bomb(cls, detail: str) -> "ScanResult":
        return cls(is_bomb=True, overlap_detail=detail)

    @classmethod
    def invalid(cls, reason: str) -> "ScanResult":
        return cls(is_bomb=None, invalid_reason=reason)


# ---------------------------------------------------------------------------
# Comprehensive Zip Bomb Detection (Fifield 2019)
# ---------------------------------------------------------------------------

ZIP64_EXTRA_ID = 0x0001
COMPRESS_STORED = 0
COMPRESS_DEFLATE = 8
COMPRESS_BZIP2 = 12
SENTINEL_32 = 0xFFFFFFFF
SENTINEL_16 = 0xFFFF


@dataclass
class Config:
    max_aggregate_ratio: float = 10000.0  # Very high; let Streamer handle ratio checks
    max_total_uncompressed_bytes: int = (
        10 * 1024**3
    )  # 10 GiB; above SafeZipFile default
    max_file_count: int = 100_000  # Above SafeZipFile default of 10_000
    max_deflate_ratio: float = 1_032.0
    max_bzip2_ratio: float = 1_434_375.0


@dataclass
class FileEntry:
    filename: str
    header_offset: int
    compressed_size: int
    uncompressed_size: int
    compress_type: int
    cdh_extra_len: int = 0
    lfh_extra_len: int = -1
    data_start: int = 0
    data_end: int = 0


@dataclass
class Issue:
    kind: str
    detail: str


@dataclass
class DetectionResult:
    is_bomb: bool = False
    issues: List[Issue] = field(default_factory=list)
    compression_ratio: float = 0.0
    total_uncompressed: int = 0
    file_count: int = 0
    zip_size: int = 0
    zip64: bool = False


def _find_eocd(mm: mmap.mmap, file_size: int) -> int:
    sig = b"PK\x05\x06"
    search_start = max(0, file_size - 65535 - 22)
    mm.seek(search_start)
    tail = mm.read(file_size - search_start)
    pos = tail.rfind(sig)
    return search_start + pos if pos != -1 else -1


def _read_eocd(mm: mmap.mmap, file_size: int) -> Tuple[int, int, bool]:
    eocd_pos = _find_eocd(mm, file_size)
    if eocd_pos == -1:
        raise ValueError("No End of Central Directory record found")

    mm.seek(eocd_pos)
    eocd = mm.read(22)
    if len(eocd) < 22:
        raise ValueError("Truncated EOCD")

    cd_count_16 = struct.unpack_from("<H", eocd, 8)[0]
    cd_offset_32 = struct.unpack_from("<I", eocd, 16)[0]

    if eocd_pos >= 20:
        mm.seek(eocd_pos - 20)
        locator = mm.read(20)
        if locator[:4] == b"PK\x06\x07":
            zip64_eocd_offset = struct.unpack_from("<Q", locator, 8)[0]
            mm.seek(zip64_eocd_offset)
            eocd64 = mm.read(56)
            if len(eocd64) >= 56 and eocd64[:4] == b"PK\x06\x06":
                cd_count_64 = struct.unpack_from("<Q", eocd64, 32)[0]
                cd_offset_64 = struct.unpack_from("<Q", eocd64, 48)[0]
                return cd_offset_64, cd_count_64, True

    return cd_offset_32, cd_count_16, False


def _parse_zip64_extra(extra_bytes: bytes) -> dict:
    result: dict = {}
    i = 0
    while i + 4 <= len(extra_bytes):
        hdr_id = struct.unpack_from("<H", extra_bytes, i)[0]
        data_len = struct.unpack_from("<H", extra_bytes, i + 2)[0]
        i += 4
        if hdr_id == ZIP64_EXTRA_ID:
            j = i
            if j + 8 <= i + data_len:
                result["uncompressed_size"] = struct.unpack_from("<Q", extra_bytes, j)[
                    0
                ]
                j += 8
            if j + 8 <= i + data_len:
                result["compressed_size"] = struct.unpack_from("<Q", extra_bytes, j)[0]
                j += 8
            if j + 8 <= i + data_len:
                result["header_offset"] = struct.unpack_from("<Q", extra_bytes, j)[0]
            break
        i += data_len
    return result


def parse_central_directory(
    mm: mmap.mmap, file_size: int
) -> Tuple[List[FileEntry], bool]:
    cd_offset, cd_count, is_zip64 = _read_eocd(mm, file_size)
    entries: List[FileEntry] = []

    mm.seek(cd_offset)
    cdh_sig = b"PK\x01\x02"

    for _ in range(cd_count):
        header = mm.read(46)
        if len(header) < 46:
            raise ValueError(
                f"Truncated central directory header: expected 46 bytes, "
                f"got {len(header)}"
            )
        if header[:4] != cdh_sig:
            raise ValueError(
                f"Invalid central directory header signature: "
                f"expected {cdh_sig!r}, got {header[:4]!r}"
            )

        compress_type = struct.unpack_from("<H", header, 10)[0]
        compressed_size32 = struct.unpack_from("<I", header, 20)[0]
        uncomp_size32 = struct.unpack_from("<I", header, 24)[0]
        fname_len = struct.unpack_from("<H", header, 28)[0]
        extra_len = struct.unpack_from("<H", header, 30)[0]
        comment_len = struct.unpack_from("<H", header, 32)[0]
        header_offset32 = struct.unpack_from("<I", header, 42)[0]

        fname_bytes = mm.read(fname_len)
        extra_bytes = mm.read(extra_len)
        mm.seek(comment_len, 1)

        filename = fname_bytes.decode("utf-8", errors="replace")

        z64 = _parse_zip64_extra(extra_bytes)

        compressed_size = z64.get("compressed_size", compressed_size32)
        uncompressed_size = z64.get("uncompressed_size", uncomp_size32)
        header_offset = z64.get("header_offset", header_offset32)

        if compressed_size32 == SENTINEL_32 and "compressed_size" not in z64:
            compressed_size = 0
        if uncomp_size32 == SENTINEL_32 and "uncompressed_size" not in z64:
            uncompressed_size = 0
        if header_offset32 == SENTINEL_32 and "header_offset" not in z64:
            header_offset = 0

        entries.append(
            FileEntry(
                filename=filename,
                header_offset=header_offset,
                compressed_size=compressed_size,
                uncompressed_size=uncompressed_size,
                compress_type=compress_type,
                cdh_extra_len=extra_len,
            )
        )

    return entries, is_zip64


LFH_FIXED = 30


def resolve_data_intervals(mm: mmap.mmap, entries: List[FileEntry]) -> None:
    lfh_sig = b"PK\x03\x04"
    file_size = mm.size()

    for e in entries:
        if e.header_offset + LFH_FIXED > file_size:
            e.data_start = e.header_offset
            e.data_end = e.header_offset + e.compressed_size
            continue

        mm.seek(e.header_offset)
        lfh = mm.read(LFH_FIXED)
        if len(lfh) < LFH_FIXED or lfh[:4] != lfh_sig:
            e.data_start = e.header_offset
            e.data_end = e.header_offset + e.compressed_size
            continue

        lfh_fname_len = struct.unpack_from("<H", lfh, 26)[0]
        lfh_extra_len = struct.unpack_from("<H", lfh, 28)[0]
        e.lfh_extra_len = lfh_extra_len

        e.data_start = e.header_offset + LFH_FIXED + lfh_fname_len + lfh_extra_len
        e.data_end = e.data_start + e.compressed_size


def check_overlapping_files(
    entries: List[FileEntry],
) -> List[Tuple[FileEntry, FileEntry]]:
    if not entries:
        return []

    sorted_e = sorted(entries, key=lambda e: e.data_start)
    overlaps: List[Tuple[FileEntry, FileEntry]] = []
    max_end = sorted_e[0].data_end
    max_end_entry = sorted_e[0]

    for e in sorted_e[1:]:
        if e.data_start < max_end:
            overlaps.append((max_end_entry, e))
        if e.data_end > max_end:
            max_end = e.data_end
            max_end_entry = e

    return overlaps


def check_extra_field_quoting(entries: List[FileEntry]) -> List[FileEntry]:
    if not entries:
        return []

    sorted_e = sorted(entries, key=lambda e: e.header_offset)
    suspicious: List[FileEntry] = []

    for i, e in enumerate(sorted_e[:-1]):
        next_e = sorted_e[i + 1]
        eff_extra = e.lfh_extra_len if e.lfh_extra_len >= 0 else e.cdh_extra_len
        if eff_extra > 0 and e.data_start >= next_e.header_offset:
            suspicious.append(e)

    return suspicious


def check_compression_ratios(
    entries: List[FileEntry], cfg: Config
) -> List[Tuple[FileEntry, float]]:
    suspicious = []
    for e in entries:
        if e.compressed_size <= 0:
            continue
        ratio = e.uncompressed_size / e.compressed_size
        limit = (
            cfg.max_bzip2_ratio
            if e.compress_type == COMPRESS_BZIP2
            else cfg.max_deflate_ratio
        )
        if ratio > limit:
            suspicious.append((e, ratio))
    return suspicious


def detect_zip_bomb(path: str, cfg: Optional[Config] = None) -> DetectionResult:
    if cfg is None:
        cfg = Config()

    zip_size = os.path.getsize(path)
    result = DetectionResult(is_bomb=False, zip_size=zip_size)

    with open(path, "rb") as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        try:
            entries, is_zip64 = parse_central_directory(mm, zip_size)
        except (ValueError, struct.error) as exc:
            result.issues.append(
                Issue("parse_error", f"Could not parse central directory: {exc}")
            )
            result.is_bomb = True
            return result

        result.zip64 = is_zip64
        result.file_count = len(entries)

        try:
            resolve_data_intervals(mm, entries)
        except Exception:
            for e in entries:
                if e.data_start == 0:
                    e.data_start = e.header_offset
                    e.data_end = e.header_offset + e.compressed_size

    overlaps = check_overlapping_files(entries)
    if overlaps:
        has_full = any(a.header_offset == b.header_offset for a, b in overlaps)
        kind = "full_overlap" if has_full else "quoted_overlap"
        sample = [(a.filename, b.filename) for a, b in overlaps[:3]]
        result.issues.append(
            Issue(
                kind,
                f"Overlapping file data detected ({len(overlaps)} pair(s)). "
                f"Sample: {sample}. "
                f"Matches Fifield "
                f"{'full-overlap' if has_full else 'quoted_overlap (or giant-steps)'} "
                f"construction.",
            )
        )
        result.is_bomb = True

    extra_q = check_extra_field_quoting(entries)
    if extra_q:
        names = [e.filename for e in extra_q[:3]]
        result.issues.append(
            Issue(
                "extra_field_quoting",
                f"Extra-field quoting detected in {len(extra_q)} file(s): {names}. "
                "LFH extra fields enclose subsequent local file headers.",
            )
        )
        result.is_bomb = True

    total_uncompressed = sum(e.uncompressed_size for e in entries)
    result.total_uncompressed = total_uncompressed
    overall_ratio = total_uncompressed / zip_size if zip_size > 0 else 0.0
    result.compression_ratio = overall_ratio

    if overall_ratio > cfg.max_aggregate_ratio:
        result.issues.append(
            Issue(
                "aggregate_ratio",
                f"Extreme aggregate compression ratio: {overall_ratio:,.0f}:1 "
                f"({total_uncompressed / 1e9:.2f} GiB uncompressed from "
                f"{zip_size / 1e6:.2f} MiB zip)",
            )
        )
        result.is_bomb = True

    if total_uncompressed > cfg.max_total_uncompressed_bytes:
        result.issues.append(
            Issue(
                "total_size",
                f"Total uncompressed size {total_uncompressed / 1e9:.2f} GiB "
                f"exceeds limit of {cfg.max_total_uncompressed_bytes / 1e9:.2f} GiB",
            )
        )
        result.is_bomb = True

    bad_ratios = check_compression_ratios(entries, cfg)
    if bad_ratios:
        worst_entry, worst_ratio = max(bad_ratios, key=lambda x: x[1])
        cname = {
            COMPRESS_STORED: "stored",
            COMPRESS_DEFLATE: "DEFLATE",
            COMPRESS_BZIP2: "bzip2",
        }.get(worst_entry.compress_type, str(worst_entry.compress_type))
        limit = (
            cfg.max_bzip2_ratio
            if worst_entry.compress_type == COMPRESS_BZIP2
            else cfg.max_deflate_ratio
        )
        result.issues.append(
            Issue(
                "per_file_ratio",
                f"File '{worst_entry.filename}' ({cname}) ratio {worst_ratio:,.0f}:1 "
                f"exceeds the {cname} theoretical maximum of {limit:,.0f}:1",
            )
        )
        result.is_bomb = True

    if result.file_count > cfg.max_file_count:
        result.issues.append(
            Issue(
                "file_count",
                f"Suspiciously high file count: {result.file_count:,} "
                f"(threshold {cfg.max_file_count:,})",
            )
        )
        result.is_bomb = True

    return result


class ZipInspector:
    """Parses a zip file's structural records and checks for overlapping spans.

    Based on the approach described in David Fifield's zip bomb research:
    https://www.bamsoftware.com/hacks/zipbomb/
    """

    _SEARCH_BLOCK = 8192

    def __init__(self, fileobj: BinaryIO, verbose: bool = False) -> None:
        self._fobj = fileobj
        self._verbose = verbose
        self._file_size: int = 0
        self._record_spans: list[tuple[int, int]] = []

    def scan(self) -> ScanResult:
        """Inspect the zip file and return a ScanResult."""
        self._fobj.seek(0, os.SEEK_END)
        self._file_size = self._fobj.tell()
        self._record_spans = []

        directory = self._locate_central_directory()
        if directory is None:
            return ScanResult.invalid("could not locate a valid central directory")

        num_entries, cd_byte_length, cd_offset = directory
        local_spans = self._walk_central_directory(
            num_entries, cd_byte_length, cd_offset
        )
        if local_spans is None:
            return ScanResult.invalid(
                "central directory parse error or unsupported feature"
            )

        return self._check_spans(local_spans)

    def _locate_central_directory(self) -> Optional[tuple[int, int, int]]:
        """Scan backwards through the file for a valid EOCD record."""
        block = self._SEARCH_BLOCK
        cursor = self._file_size
        readback = 22
        carry = b""
        check_count = 1

        while True:
            cursor -= readback
            if cursor < 0:
                return None

            self._fobj.seek(cursor, os.SEEK_SET)
            window = self._fobj.read(readback) + carry[:21]

            while check_count > 0:
                check_count -= 1
                if (
                    window[check_count] == 0x50
                    and window[check_count + 1] == 0x4B
                    and window[check_count + 2] == 0x05
                    and window[check_count + 3] == 0x06
                ):
                    result = self._validate_eocd(
                        window[check_count + 4 : check_count + 22],
                        cursor + check_count,
                    )
                    if result is not None:
                        return result

            carry = window
            readback = ((cursor - 1) & (block - 1)) + 1
            check_count = readback

    def _validate_eocd(
        self, eocd_body: bytes, eocd_offset: int
    ) -> Optional[tuple[int, int, int]]:
        """Validate the EOCD record and handle Zip64 when needed."""
        if len(eocd_body) < 18:
            return None

        raw = struct.unpack("<HHHHLLH", eocd_body)
        disk_num, cd_start_disk, *_, comment_len = raw

        if disk_num != 0 or cd_start_disk != 0:
            return None
        if eocd_offset + 22 + comment_len > self._file_size:
            return None

        spans_scratch: list[tuple[int, int]] = [
            (eocd_offset, eocd_offset + 22 + comment_len)
        ]

        entries_on_disk, total_entries, cd_length, cd_offset = raw[2:6]

        if (
            entries_on_disk == 0xFFFF
            or total_entries == 0xFFFF
            or cd_length == 0xFFFFFFFF
            or cd_offset == 0xFFFFFFFF
        ):
            z64 = self._read_zip64_records(eocd_offset, spans_scratch)
            if z64 is None:
                return None
            total_entries, cd_length, cd_offset = z64
        else:
            if total_entries != entries_on_disk:
                return None
            if cd_offset + cd_length > self._file_size:
                return None

        spans_scratch.append((cd_offset, cd_offset + cd_length))
        spans_scratch.reverse()
        self._record_spans.extend(spans_scratch)
        return (total_entries, cd_length, cd_offset)

    def _read_zip64_records(
        self,
        eocd_offset: int,
        spans_scratch: list[tuple[int, int]],
    ) -> Optional[tuple[int, int, int]]:
        """Read the Zip64 locator and record."""
        if eocd_offset < 20:
            return None

        self._fobj.seek(eocd_offset - 20, os.SEEK_SET)
        loc_sig, loc_disk, z64_eocd_offset, loc_total_disks = struct.unpack(
            "<LLQL", self._fobj.read(20)
        )
        if (
            loc_sig != 0x07064B50
            or loc_disk != 0
            or loc_total_disks != 1
            or z64_eocd_offset + 56 > self._file_size
        ):
            return None

        spans_scratch.append((eocd_offset - 20, eocd_offset))

        self._fobj.seek(z64_eocd_offset, os.SEEK_SET)
        z64 = struct.unpack("<LQHHLLQQQQ", self._fobj.read(56))
        z64_sig, z64_record_size, _, _, z64_disk, z64_cd_disk, *rest = z64
        if (
            z64_sig != 0x06064B50
            or z64_record_size < 44
            or z64_disk != 0
            or z64_cd_disk != 0
        ):
            return None

        spans_scratch.append((z64_eocd_offset, z64_eocd_offset + 12 + z64_record_size))

        total_entries, _, cd_length, cd_offset = rest
        return (total_entries, cd_length, cd_offset)

    def _walk_central_directory(
        self, num_entries: int, cd_byte_length: int, cd_offset: int
    ) -> Optional[list[tuple[int, int]]]:
        """Read every central directory header and resolve to local entry spans."""
        self._fobj.seek(cd_offset, os.SEEK_SET)
        cd_bytes = self._fobj.read(cd_byte_length)

        local_spans: list[tuple[int, int]] = []
        cursor = 0
        remaining = num_entries

        while remaining > 0:
            if cursor + 46 > cd_byte_length:
                return None

            span = self._parse_cdh_entry(cd_bytes, cursor, cd_byte_length)
            if span is None:
                return None

            entry_span, next_cursor = span
            local_spans.append(entry_span)
            cursor = next_cursor
            remaining -= 1

        if cursor != cd_byte_length:
            return None
        return local_spans

    def _parse_cdh_entry(
        self, cd_bytes: bytes, offset: int, cd_length: int
    ) -> Optional[tuple[tuple[int, int], int]]:
        """Parse one central directory header and return local span."""
        hdr = struct.unpack("<LHHHHHHLLLHHHHHLL", cd_bytes[offset : offset + 46])
        offset += 46

        if hdr[0] != 0x02014B50:
            return None

        fname_len, extra_len, comment_len = hdr[10], hdr[11], hdr[12]
        total_variable = fname_len + extra_len + comment_len
        if offset + total_variable > cd_length:
            return None

        compressed_size = hdr[8]
        uncompressed_size = hdr[9]
        disk_number = hdr[13]
        local_hdr_offset = hdr[16]

        if (
            compressed_size == 0xFFFFFFFF
            or uncompressed_size == 0xFFFFFFFF
            or disk_number == 0xFFFF
            or local_hdr_offset == 0xFFFFFFFF
        ):
            z64_result = self._resolve_zip64_cdh_fields(
                cd_bytes,
                offset + fname_len,
                offset + fname_len + extra_len,
                compressed_size,
                uncompressed_size,
                disk_number,
                local_hdr_offset,
            )
            if z64_result is None:
                return None
            (
                compressed_size,
                uncompressed_size,
                disk_number,
                local_hdr_offset,
            ) = z64_result
            offset += fname_len + extra_len + comment_len
        else:
            offset += total_variable

        if disk_number != 0:
            return None
        if local_hdr_offset + 30 > self._file_size:
            return None

        local_end = self._measure_local_entry(
            local_hdr_offset,
            compressed_size,
            uncompressed_size,
            hdr[7],
        )
        if local_end is None:
            return None

        return ((local_hdr_offset, local_end), offset)

    @staticmethod
    def _resolve_zip64_cdh_fields(
        cd_bytes: bytes,
        extra_start: int,
        extra_end: int,
        compressed_size: int,
        uncompressed_size: int,
        disk_number: int,
        local_hdr_offset: int,
    ) -> Optional[tuple[int, int, int, int]]:
        """Walk the extra field looking for Zip64 extended information block."""
        pos = extra_start
        while pos + 4 <= extra_end:
            field_id, field_data_len = struct.unpack("<HH", cd_bytes[pos : pos + 4])
            pos += 4
            if pos + field_data_len > extra_end:
                return None
            field_end = pos + field_data_len

            if field_id != 0x0001:
                pos = field_end
                continue

            if uncompressed_size == 0xFFFFFFFF:
                if pos + 8 > field_end:
                    return None
                uncompressed_size = struct.unpack("<Q", cd_bytes[pos : pos + 8])[0]
                pos += 8
            if compressed_size == 0xFFFFFFFF:
                if pos + 8 > field_end:
                    return None
                compressed_size = struct.unpack("<Q", cd_bytes[pos : pos + 8])[0]
                pos += 8
            if local_hdr_offset == 0xFFFFFFFF:
                if pos + 8 > field_end:
                    return None
                local_hdr_offset = struct.unpack("<Q", cd_bytes[pos : pos + 8])[0]
                pos += 8
            if disk_number == 0xFFFF:
                if pos + 4 > field_end:
                    return None
                disk_number = struct.unpack("<L", cd_bytes[pos : pos + 4])[0]
                pos += 4

            if pos != field_end:
                return None

            return (compressed_size, uncompressed_size, disk_number, local_hdr_offset)

        return None

    def _measure_local_entry(
        self,
        local_offset: int,
        compressed_size: int,
        uncompressed_size: int,
        expected_crc: int,
    ) -> Optional[int]:
        """Read the local file header and return the byte offset after this entry."""
        self._fobj.seek(local_offset, os.SEEK_SET)
        raw = self._fobj.read(30)
        if len(raw) < 30:
            return None

        lfh = struct.unpack("<LHHHHHLLLHH", raw)
        if lfh[0] != 0x04034B50:
            return None

        fname_len, extra_len = lfh[9], lfh[10]
        flags = lfh[2]

        entry_end = local_offset + 30 + fname_len + extra_len + compressed_size
        if entry_end > self._file_size:
            return None

        if flags & 0x08:
            descriptor_end = self._measure_data_descriptor(
                entry_end, expected_crc, compressed_size, uncompressed_size
            )
            if descriptor_end is None:
                return None
            entry_end = descriptor_end

        return entry_end

    def _measure_data_descriptor(
        self,
        descriptor_offset: int,
        expected_crc: int,
        compressed_size: int,
        uncompressed_size: int,
    ) -> Optional[int]:
        """Determine the extent of the optional data descriptor."""
        self._fobj.seek(descriptor_offset, os.SEEK_SET)
        raw = self._fobj.read(24)

        if len(raw) == 24:
            d = struct.unpack("<LLQQ", raw)
            if (
                d[0] == 0x08074B50
                and d[1] == expected_crc
                and d[2] == compressed_size
                and d[3] == uncompressed_size
            ):
                return descriptor_offset + 24

        if len(raw) >= 20:
            d = struct.unpack("<LQQ", raw[:20])
            if (
                d[0] == expected_crc
                and d[1] == compressed_size
                and d[2] == uncompressed_size
            ):
                return descriptor_offset + 20

        if len(raw) >= 16:
            d = struct.unpack("<LLLL", raw[:16])
            if (
                d[0] == 0x08074B50
                and d[1] == expected_crc
                and d[2] == compressed_size
                and d[3] == uncompressed_size
            ):
                return descriptor_offset + 16

        if len(raw) >= 12:
            d = struct.unpack("<LLL", raw[:12])
            if (
                d[0] == expected_crc
                and d[1] == compressed_size
                and d[2] == uncompressed_size
            ):
                return descriptor_offset + 12

        return None

    def _check_spans(self, local_spans: list[tuple[int, int]]) -> ScanResult:
        """Merge local entry spans with structural spans and scan for overlaps."""
        all_spans = local_spans + self._record_spans
        all_spans.sort()

        _, prev_end = all_spans[0]

        for span_start, span_end in all_spans[1:]:
            if prev_end > span_start:
                return ScanResult.bomb(
                    f"records overlap: previous ends at {prev_end}, "
                    f"next starts at {span_start}"
                )
            prev_end = span_end

        return ScanResult.clean()


def _run_overlap_detection(path: str, cfg: Optional[Config]) -> None:
    """Run detect_zip_bomb against a filesystem path and raise on positive."""
    try:
        result = detect_zip_bomb(path, cfg)
    except Exception as exc:
        raise MalformedArchiveError(
            f"Failed to parse archive for overlap detection: {exc}"
        ) from exc
    if result.is_bomb:
        details = "; ".join(i.detail for i in result.issues[:2])
        raise MalformedArchiveError(f"overlapping entries detected: {details}")


def _check_overlapping_entries(
    fileobj: IO[bytes], cfg: Optional[Config] = None
) -> None:
    """Detect Fifield-style zip bombs using comprehensive detection.

    This function uses `detect_zip_bomb()` to analyse the archive for overlapping
    entries, extra-field quoting, and other Fifield 2019 attack vectors.

    For in-memory BinaryIO objects without a filesystem path, the archive is
    spilled to a temporary file to enable mmap-based detection.

    :param fileobj: A seekable binary file object.
    :param cfg: Optional Config with limits. If not provided, uses defaults.
    :raises MalformedArchiveError: If overlapping entries are detected.
    """
    path = getattr(fileobj, "name", None)

    if path is not None:
        _run_overlap_detection(path, cfg)
        return

    # BinaryIO input: spill to a temporary file so mmap-based detection
    # can run. Save and restore position so the caller's zipfile.ZipFile
    # instance is not disturbed.
    try:
        pos = fileobj.tell()
    except OSError:
        pos = None
    try:
        try:
            fileobj.seek(0)
        except OSError:
            log.warning(
                "Skipping Fifield-style zip bomb detection: "
                "in-memory archive is not seekable."
            )
            return

        tmp_path = None
        try:
            with tempfile.NamedTemporaryFile(suffix=".zip", delete=False) as tmp:
                tmp_path = tmp.name
                tmp.write(fileobj.read())
            _run_overlap_detection(tmp_path, cfg)
        finally:
            if tmp_path is not None:
                with suppress(OSError):
                    os.unlink(tmp_path)
    finally:
        if pos is not None:
            with suppress(OSError):
                fileobj.seek(pos)


def _check_zip64_consistency(info: zipfile.ZipInfo) -> None:
    """Detect ZIP64 inconsistencies and missing ZIP64 blocks.

    Two classes of problem are caught:

    1. **Missing ZIP64 block**: A 32-bit field holds the sentinel value
       ``0xFFFFFFFF`` (meaning "look in ZIP64 extra field"), but no ZIP64
       extra field is present.  This is always a malformed archive.

    2. **Disagreeing ZIP64 block**: A ZIP64 extra field is present, but the
       64-bit value it reports differs from the size that Python's
       ``zipfile`` resolved from the central directory.  A crafted archive
       can set the 32-bit field to a small non-sentinel value while hiding a
       huge size in the ZIP64 block; Python uses the small 32-bit value, but
       we see the discrepancy and reject the archive.
    """
    if info.file_size == SENTINEL_32 or info.compress_size == SENTINEL_32:
        zip64 = _parse_zip64_extra(info.extra) if info.extra else {}
        if not zip64:
            raise MalformedArchiveError(
                f"Entry {info.filename!r} has a ZIP64 sentinel (0xFFFFFFFF) "
                f"in the 32-bit size field but no ZIP64 extra field is present. "
                f"Archive is malformed."
            )
        return

    if not info.extra:
        return
    zip64 = _parse_zip64_extra(info.extra)
    if not zip64:
        return

    if "uncompressed_size" in zip64 and zip64["uncompressed_size"] != info.file_size:
        raise MalformedArchiveError(
            f"ZIP64 inconsistency in entry {info.filename!r}: "
            f"extra field reports uncompressed_size="
            f"{zip64['uncompressed_size']}, "
            f"but central directory reports {info.file_size}. "
            f"Archive may be crafted."
        )

    if "compressed_size" in zip64 and zip64["compressed_size"] != info.compress_size:
        raise MalformedArchiveError(
            f"ZIP64 inconsistency in entry {info.filename!r}: "
            f"extra field reports compressed_size="
            f"{zip64['compressed_size']}, "
            f"but central directory reports {info.compress_size}. "
            f"Archive may be crafted."
        )


def _validate_entry(info: zipfile.ZipInfo, max_file_size: int) -> None:
    """Validate a single ZipInfo entry during the Guard phase."""
    # Null bytes in filename
    if "\x00" in info.filename:
        raise MalformedArchiveError(
            f"Entry filename contains a null byte: {info.filename!r}"
        )

    # ZIP64 consistency
    _check_zip64_consistency(info)

    # Declared size early-rejection (Streamer enforces at stream time too)
    if info.file_size > max_file_size:
        raise FileSizeExceededError(
            f"Entry {info.filename!r} declares uncompressed size "
            f"{info.file_size:,} bytes, which exceeds the limit of "
            f"{max_file_size:,} bytes."
        )


def validate_archive(
    zf: zipfile.ZipFile,
    max_files: int,
    max_file_size: int,
    max_total_size: int,
) -> None:
    """Phase A: run all pre-extraction static checks.

    :param zf: An open zipfile.ZipFile instance (read-only access).
    :param max_files: Maximum number of entries permitted.
    :param max_file_size: Maximum permitted uncompressed size for any entry.
    :param max_total_size: Maximum permitted total uncompressed size.
    :raises FileCountExceededError: If the archive has too many entries.
    :raises FileSizeExceededError: If any entry's declared size is too large.
    :raises MalformedArchiveError: If structural anomalies are detected.
    """
    try:
        entries = zf.infolist()
    except Exception as exc:
        raise MalformedArchiveError(f"Cannot read central directory: {exc}") from exc

    if len(entries) > max_files:
        raise FileCountExceededError(
            f"Archive contains {len(entries):,} entries, "
            f"which exceeds the limit of {max_files:,}."
        )

    if zf.fp is not None:
        cfg = Config(
            max_total_uncompressed_bytes=max_total_size,
            max_file_count=max_files,
        )
        _check_overlapping_entries(zf.fp, cfg)

    for info in entries:
        _validate_entry(info, max_file_size)

src/safezip/_sandbox.py

"""Phase B: path resolution and symlink policy enforcement (the Sandbox)."""

import unicodedata
from pathlib import Path

from ._events import SymlinkPolicy
from ._exceptions import UnsafeZipError

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
    "resolve_member_path",
    "check_symlink",
)

# Practical upper bound; real OS limits vary but 4096 is a safe conservative cap
_MAX_PATH_LENGTH = 4096


def resolve_member_path(
    base: Path,
    member_filename: str,
) -> Path:
    """Resolve and validate a ZIP member filename against *base*.

    Applies the full path normalisation pipeline:

    1. Unicode NFC normalisation (catch lookalike characters).
    2. Null-byte rejection.
    3. Reject absolute Unix paths (starting with ``/``) and absolute Windows
       paths (drive letter + slash, e.g. ``C:/``).
    4. Reject any ``..`` path component.
    5. Verify the resolved path is inside *base*.
    6. Reject paths whose resolved length exceeds ``_MAX_PATH_LENGTH``.

    :param base: The extraction root directory (must be absolute).
    :param member_filename: Raw filename string from the ZIP central directory.
    :returns: Resolved absolute Path inside *base*.
    :raises UnsafeZipError: If the filename is unsafe for any reason.
    """
    # 1. Unicode NFC normalisation
    try:
        normalized = unicodedata.normalize("NFC", member_filename)
    except (TypeError, ValueError) as err:
        raise UnsafeZipError(f"Cannot normalise filename: {member_filename!r}") from err

    # 2. Null-byte rejection
    if "\x00" in normalized:
        raise UnsafeZipError(f"Filename contains a null byte: {normalized!r}")

    # 3. Normalise separators
    _norm = normalized.replace("\\", "/")

    # Reject absolute Unix paths and UNC paths (start with '/')
    if _norm.startswith("/"):
        raise UnsafeZipError(f"Absolute path detected in filename: {member_filename!r}")

    # Reject absolute Windows paths with drive letters (e.g. "C:/Windows")
    if len(_norm) >= 3 and _norm[1] == ":" and _norm[2] == "/" and _norm[0].isalpha():
        raise UnsafeZipError(
            f"Absolute Windows path detected in filename: {member_filename!r}"
        )

    parts = _norm.split("/")

    # Strip Windows-style relative drive
    # references (e.g. "C:relpath") - defence-in-depth
    clean_parts = []
    for part in parts:
        # Skip empty parts (double-slashes) and current-dir dots
        if not part or part == ".":
            continue
        # Reject parent-directory traversal
        if part == "..":
            raise UnsafeZipError(
                f"Path traversal detected in filename: {member_filename!r}"
            )
        # Strip Windows-style relative drive
        # references (e.g. "C:relpath" → "relpath")
        if len(part) >= 2 and part[1] == ":" and part[0].isalpha():
            part = part[2:]
            if not part:
                continue
            if part == "..":
                raise UnsafeZipError(
                    f"Path traversal detected in filename: {member_filename!r}"
                )
        clean_parts.append(part)

    if not clean_parts:
        raise UnsafeZipError(f"Filename resolves to empty path: {member_filename!r}")

    # 4. Build the resolved path
    resolved = base
    for part in clean_parts:
        resolved = resolved / part

    # 5. Confirm the resolved path is inside base
    try:
        resolved_real = resolved.resolve()
        base_real = base.resolve()
        resolved_real.relative_to(base_real)
    except (ValueError, RuntimeError, OSError) as err:
        raise UnsafeZipError(
            f"Resolved path escapes base directory: {resolved!r} is not under {base!r}"
        ) from err

    # 6. Path length check
    if len(str(resolved)) > _MAX_PATH_LENGTH:
        raise UnsafeZipError(
            f"Resolved path is too long ({len(str(resolved))} chars): "
            f"{str(resolved)[:120]!r}..."
        )

    return resolved


def check_symlink(
    extracted_path: Path,
    base: Path,
    policy: SymlinkPolicy,
) -> bool:
    """
    Check whether *extracted_path* is (or contains) a symlink, & apply policy.

    :param extracted_path: The path that was just extracted.
    :param base: The extraction root directory.
    :param policy: The configured symlink policy.
    :returns: ``True`` if the member should be skipped (IGNORE policy).
    :raises UnsafeZipError: If REJECT policy or chain exits base directory.
    """
    if not extracted_path.is_symlink():
        return False

    if policy == SymlinkPolicy.REJECT:
        raise UnsafeZipError(
            f"Symlink detected and symlink_policy is REJECT: {extracted_path}"
        )

    if policy == SymlinkPolicy.IGNORE:
        return True  # caller should skip this member

    # RESOLVE_INTERNAL: follow the full chain and verify every hop
    _verify_symlink_chain(extracted_path, base)
    return False


def _verify_symlink_chain(link_path: Path, base: Path) -> None:
    """Verify the full symlink chain from *link_path* stays inside *base*.

    Follows every link until a non-symlink is reached or an escape is detected.

    :raises UnsafeZipError: If any link in the chain exits *base*.
    """
    visited = set()
    current = link_path
    base_real = base.resolve()

    while current.is_symlink():
        current_resolved = current.resolve()
        real = str(current_resolved)
        if real in visited:
            # Cycle detected; treat as unsafe
            raise UnsafeZipError(
                f"Symlink cycle detected at {current}: refusing to follow further."
            )
        visited.add(real)

        try:
            current_resolved.relative_to(base_real)
        except (ValueError, RuntimeError, OSError) as err:
            raise UnsafeZipError(
                f"Symlink chain for {link_path} exits the base directory "
                f"at {current} → {current_resolved}"
            ) from err
        current = current_resolved

src/safezip/_streamer.py

"""Phase C: streaming extraction with runtime enforcement (the Streamer)."""

import contextlib
import logging
import os
import zipfile
from pathlib import Path
from typing import Optional

from ._exceptions import (
    CompressionRatioError,
    FileSizeExceededError,
    TotalSizeExceededError,
)

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
    "stream_extract_member",
    "CumulativeCounters",
)

log = logging.getLogger("safezip.security")

_CHUNK_SIZE = 65_536  # 64 KiB


class CumulativeCounters:
    """Tracks totals across all members in a single extractall/extract call."""

    __slots__ = ("bytes_written", "compressed_bytes")

    def __init__(self) -> None:
        self.bytes_written: int = 0
        self.compressed_bytes: int = 0


def stream_extract_member(
    zf: zipfile.ZipFile,
    member: zipfile.ZipInfo,
    dest: Path,
    *,
    max_file_size: int,
    max_per_member_ratio: float,
    max_total_size: int,
    max_total_ratio: float,
    counters: CumulativeCounters,
    pwd: Optional[bytes] = None,
) -> None:
    """
    Stream a single member from *zf* to *dest* with full runtime enforcement.

    Extraction is atomic: bytes are written to a temporary file and renamed to
    *dest* only after all checks pass.  If any check raises, the temporary file
    is deleted and *dest* is never created/modified.

    :param zf: Open zipfile.ZipFile instance (internal use only).
    :param member: The ZipInfo entry to extract.
    :param dest: Final destination path (must already be path-validated).
    :param max_file_size: Per-member decompressed size limit in bytes.
    :param max_per_member_ratio: Per-member decompressed/compressed ratio
           limit.
    :param max_total_size: Cumulative decompressed size limit across all
           members.
    :param max_total_ratio: Cumulative ratio limit across all members.
    :param counters: Shared counters for cumulative checks.
    :param pwd: Optional decryption password.
    """
    dest.parent.mkdir(parents=True, exist_ok=True)

    tmp_name = f"{dest.name}.safezip_tmp_{os.getpid()}_{os.urandom(4).hex()}"
    tmp_path = dest.parent / tmp_name

    # compress_size may be 0 for data-descriptor archives
    compress_size = member.compress_size
    member_bytes_written = 0

    try:
        with zf.open(member, pwd=pwd) as src, open(tmp_path, "wb") as dst:
            while True:
                chunk = src.read(_CHUNK_SIZE)
                if not chunk:
                    break

                chunk_len = len(chunk)
                member_bytes_written += chunk_len
                counters.bytes_written += chunk_len

                # --- Per-member size check ---
                if member_bytes_written > max_file_size:
                    raise FileSizeExceededError(
                        f"Member {member.filename!r} exceeded max_file_size="
                        f"{max_file_size:,} bytes "
                        f"(decompressed {member_bytes_written:,} bytes so "
                        "far)."
                    )

                # --- Per-member ratio check ---
                # Only when compress_size is known (not a data-descriptor
                # entry).
                if compress_size > 0:
                    ratio = member_bytes_written / compress_size
                    if ratio > max_per_member_ratio:
                        raise CompressionRatioError(
                            f"Member {member.filename!r} compression ratio "
                            f"{ratio:.1f}:1 exceeds "
                            f"max_per_member_ratio={max_per_member_ratio}:1."
                        )

                # --- Cumulative size check ---
                if counters.bytes_written > max_total_size:
                    raise TotalSizeExceededError(
                        f"Cumulative decompressed size "
                        f"{counters.bytes_written:,} bytes exceeds "
                        f"max_total_size={max_total_size:,} bytes."
                    )

                # --- Cumulative ratio check ---
                # Update compressed bytes estimate from the running member.
                if compress_size > 0:
                    counters.compressed_bytes += (
                        chunk_len * compress_size // max(member.file_size, 1)
                    )
                if counters.compressed_bytes > 0:
                    total_ratio = counters.bytes_written / counters.compressed_bytes  # noqa
                    if total_ratio > max_total_ratio:
                        raise CompressionRatioError(
                            f"Cumulative compression ratio {total_ratio:.1f}:1 "
                            f"exceeds max_total_ratio={max_total_ratio}:1."
                        )

                dst.write(chunk)

        # All checks passed - atomic rename to final destination
        tmp_path.replace(dest)

    except Exception:
        # Clean up partial / temporary file on any failure
        with contextlib.suppress(OSError):
            tmp_path.unlink(missing_ok=True)
        raise

src/safezip/cli/init.py

src/safezip/cli/__init__.py

"""safezip.cli — command-line interface for safezip."""

from ._main import main

__all__ = ("main",)

src/safezip/cli/_main.py

"""safezip CLI — hardened ZIP extraction from the command line."""

import argparse
import sys
import zipfile
from pathlib import Path

from safezip import SafeZipFile, SymlinkPolicy, safe_extract
from safezip._exceptions import SafezipError

__all__ = ("main",)

_SYMLINK_POLICIES = {
    "reject": SymlinkPolicy.REJECT,
    "ignore": SymlinkPolicy.IGNORE,
    "resolve_internal": SymlinkPolicy.RESOLVE_INTERNAL,
}


def _build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(
        prog="safezip",
        description="Hardened ZIP extraction — safe by default.",
    )
    parser.add_argument(
        "--version",
        action="version",
        version=f"%(prog)s {_version()}",
    )

    sub = parser.add_subparsers(dest="command", required=True)

    # ------------------------------------------------------------------ extract
    ext = sub.add_parser("extract", help="Extract a ZIP archive safely.")
    ext.add_argument("archive", help="Path to the ZIP file.")
    ext.add_argument("destination", help="Directory to extract into.")
    ext.add_argument(
        "--max-file-size",
        type=int,
        metavar="BYTES",
        help="Max uncompressed size per member (default: 1 GiB).",
    )
    ext.add_argument(
        "--max-total-size",
        type=int,
        metavar="BYTES",
        help="Max total uncompressed size (default: 5 GiB).",
    )
    ext.add_argument(
        "--max-files",
        type=int,
        metavar="N",
        help="Max number of members (default: 10 000).",
    )
    ext.add_argument(
        "--max-per-member-ratio",
        type=float,
        metavar="RATIO",
        help="Max compression ratio per member (default: 200).",
    )
    ext.add_argument(
        "--max-total-ratio",
        type=float,
        metavar="RATIO",
        help="Max overall compression ratio (default: 200).",
    )
    ext.add_argument(
        "--max-nesting-depth",
        type=int,
        metavar="N",
        help="Max nested-archive depth (default: 3).",
    )
    ext.add_argument(
        "--symlink-policy",
        choices=list(_SYMLINK_POLICIES),
        default=None,
        metavar="POLICY",
        help=(
            "How to handle symlink entries: reject (default), ignore, resolve_internal."
        ),
    )
    ext.add_argument(
        "--password",
        metavar="PWD",
        help="Decryption password for encrypted archives.",
    )
    ext.add_argument(
        "--recursive",
        action="store_true",
        default=False,
        help="Enable recursive extraction of nested archives.",
    )

    # --------------------------------------------------------------------- list
    lst = sub.add_parser("list", help="List members of a ZIP archive.")
    lst.add_argument("archive", help="Path to the ZIP file.")

    return parser


def _version() -> str:
    try:
        from safezip import __version__

        return __version__
    except ImportError:
        return "unknown"


def _cmd_extract(args: argparse.Namespace) -> int:
    kwargs: dict = {}

    for attr in (
        "max_file_size",
        "max_total_size",
        "max_files",
        "max_per_member_ratio",
        "max_total_ratio",
        "max_nesting_depth",
        "recursive",
    ):
        val = getattr(args, attr, None)
        if val is not None:
            kwargs[attr] = val

    if args.symlink_policy is not None:
        kwargs["symlink_policy"] = _SYMLINK_POLICIES[args.symlink_policy]

    if args.password is not None:
        kwargs["password"] = args.password.encode()

    dest = Path(args.destination)
    dest.mkdir(parents=True, exist_ok=True)

    try:
        safe_extract(args.archive, dest, **kwargs)
    except SafezipError as exc:
        print(f"error: {exc}", file=sys.stderr)
        return 1
    except FileNotFoundError as exc:
        print(f"error: {exc}", file=sys.stderr)
        return 1
    except zipfile.BadZipFile as exc:
        print(f"error: {exc}", file=sys.stderr)
        return 1

    print(f"Extracted to {dest.resolve()}")
    return 0


def _cmd_list(args: argparse.Namespace) -> int:
    try:
        with SafeZipFile(args.archive) as zf:
            for name in zf.namelist():
                print(name)
    except SafezipError as exc:
        print(f"error: {exc}", file=sys.stderr)
        return 1
    except FileNotFoundError as exc:
        print(f"error: {exc}", file=sys.stderr)
        return 1
    except zipfile.BadZipFile as exc:
        print(f"error: {exc}", file=sys.stderr)
        return 1

    return 0


def main() -> None:
    parser = _build_parser()
    args = parser.parse_args()

    if args.command == "extract":
        sys.exit(_cmd_extract(args))
    elif args.command == "list":
        sys.exit(_cmd_list(args))
    else:  # pragma: no cover
        parser.print_help()
        sys.exit(1)

src/safezip/tests/init.py

src/safezip/tests/__init__.py

"""Tests for safezip."""

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"

src/safezip/tests/conftest.py

"""Pytest fixtures: factory functions that craft malicious ZIP archives."""

import io
import stat
import struct
import zipfile
import zlib

import pytest

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
    "zipslip_archive",
    "absolute_path_archive",
    "unicode_traversal_archive",
    "high_ratio_archive",
    "many_files_archive",
    "null_byte_filename_archive",
    "zip64_inconsistency_archive",
    "legitimate_archive",
    "symlink_archive",
    "fifield_bomb_archive",
)


# ---------------------------------------------------------------------------
# Archive factory helpers
# ---------------------------------------------------------------------------


def _make_zip_bytes(entries: list[tuple[str, bytes]]) -> bytes:
    """Create a ZIP in memory from (filename, content) pairs."""
    buf = io.BytesIO()
    with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_DEFLATED) as zf:
        for name, content in entries:
            info = zipfile.ZipInfo(name)
            zf.writestr(info, content)
    return buf.getvalue()


def _make_zip_bytes_stored(entries: list[tuple[str, bytes]]) -> bytes:
    """Create a stored (uncompressed) ZIP in memory."""
    buf = io.BytesIO()
    with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_STORED) as zf:
        for name, content in entries:
            info = zipfile.ZipInfo(name)
            zf.writestr(info, content)
    return buf.getvalue()


# ---------------------------------------------------------------------------
# Fixtures
# ---------------------------------------------------------------------------


@pytest.fixture()
def zipslip_archive(tmp_path):
    """A ZIP whose sole entry has a path-traversal filename."""
    data = _make_zip_bytes([("../../evil.txt", b"evil content")])
    p = tmp_path / "zipslip.zip"
    p.write_bytes(data)
    return p


@pytest.fixture()
def absolute_path_archive(tmp_path):
    """A ZIP with an absolute Unix-style path entry."""
    buf = io.BytesIO()
    with zipfile.ZipFile(buf, "w") as zf:
        info = zipfile.ZipInfo("/etc/passwd")
        zf.writestr(info, "root:x:0:0:root:/root:/bin/bash\n")
    data = buf.getvalue()
    p = tmp_path / "absolute.zip"
    p.write_bytes(data)
    return p


@pytest.fixture()
def unicode_traversal_archive(tmp_path):
    """A ZIP with combining Unicode characters that NFC-normalises to a path
    still containing a ``..`` traversal component.

    The filename ``e\\u0301vil/../../escape.txt`` uses U+0301 COMBINING ACUTE
    ACCENT (NFD form of ``é``).  After Unicode NFC normalisation the combining
    accent is folded into the precomposed ``é``, yielding
    ``évil/../../escape.txt``.  The ``..`` components are unaffected by NFC
    and must still be detected and rejected.
    """
    # e + COMBINING ACUTE ACCENT → é after NFC; the traversal stays intact
    data = _make_zip_bytes([("e\u0301vil/../../escape.txt", b"escaped")])
    p = tmp_path / "unicode_traversal.zip"
    p.write_bytes(data)
    return p


@pytest.fixture()
def high_ratio_archive(tmp_path):
    """A ZIP whose content compresses at a very high ratio (zeros)."""
    # 2 MiB of zeros → compressed to ~2 KB → ratio ~1000:1
    data_bytes = b"\x00" * (2 * 1024 * 1024)
    buf = io.BytesIO()
    with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_DEFLATED) as zf:
        zf.writestr("zeros.bin", data_bytes)
    p = tmp_path / "bomb.zip"
    p.write_bytes(buf.getvalue())
    return p


@pytest.fixture()
def many_files_archive(tmp_path):
    """A ZIP with more entries than the default max_files limit allows."""
    buf = io.BytesIO()
    with zipfile.ZipFile(buf, "w") as zf:
        for i in range(15_000):
            zf.writestr(f"file_{i:05d}.txt", b"x")
    p = tmp_path / "many_files.zip"
    p.write_bytes(buf.getvalue())
    return p


@pytest.fixture()
def null_byte_filename_archive(tmp_path):
    """A ZIP with a null byte injected into a filename via raw struct manipulation.

    Python's zipfile won't let us write such names directly, so we craft the
    raw bytes: a minimal ZIP with one entry whose filename contains \\x00.
    """
    # Minimal ZIP structure:
    # Local file header + file data + central directory + end of central directory
    filename = b"safe\x00../../etc/passwd"
    fname_len = len(filename)
    content = b"evil"
    content_len = len(content)

    # Local file header (signature 0x04034b50)
    local_header = (
        struct.pack(
            "<4s2H3H4s2I2H",
            b"PK\x03\x04",  # signature
            20,  # version needed
            0,  # flags
            0,  # compression (stored)
            0,  # mod time
            0,  # mod date
            b"\x00\x00\x00\x00",  # CRC-32
            content_len,  # compressed size
            content_len,  # uncompressed size
            fname_len,  # filename length
            0,  # extra field length
        )
        + filename
        + content
    )

    local_offset = 0

    # Central directory header (signature 0x02014b50)
    # Format: 4s sig | 6H (ver_made,ver_needed,flags,compress,mod_time,mod_date) |
    #         4s CRC | 2I (comp_size,uncomp_size) |
    #         5H (fname_len,extra_len,comment_len,disk_start,int_attr) |
    #         2I (ext_attr, offset)  → 17 items, 46 bytes
    central_header = (
        struct.pack(
            "<4s6H4s2I5H2I",
            b"PK\x01\x02",  # signature
            0x031E,  # version made by (Unix, v30)
            20,  # version needed
            0,  # flags
            0,  # compression
            0,  # mod time
            0,  # mod date
            b"\x00\x00\x00\x00",  # CRC-32
            content_len,  # compressed size (I)
            content_len,  # uncompressed size (I)
            fname_len,  # filename length
            0,  # extra field length
            0,  # file comment length
            0,  # disk number start
            0,  # internal file attributes
            0,  # external file attributes (I)
            local_offset,  # relative offset of local header (I)
        )
        + filename
    )

    central_offset = len(local_header)
    central_size = len(central_header)

    # End of central directory record (signature 0x06054b50)
    eocd = struct.pack(
        "<4s4H2IH",
        b"PK\x05\x06",  # signature
        0,  # disk number
        0,  # disk with central dir
        1,  # entries on this disk
        1,  # total entries
        central_size,  # size of central directory
        central_offset,  # offset of central directory
        0,  # comment length
    )

    data = local_header + central_header + eocd
    p = tmp_path / "nullbyte.zip"
    p.write_bytes(data)
    return p


@pytest.fixture()
def zip64_inconsistency_archive(tmp_path):
    """A ZIP with a ZIP64 extra field that disagrees with the central directory.

    We craft a minimal archive where the ZIP64 extra field reports a size of
    999_999_999 bytes but the 32-bit central directory field reports 100 bytes.
    Python will use the 32-bit value (100), but our ZIP64 check sees 999_999_999
    and raises MalformedArchiveError.
    """
    filename = b"test.txt"
    fname_len = len(filename)
    content = b"A" * 100

    # ZIP64 extra field reporting a huge uncompressed size
    zip64_uncompressed = 999_999_999
    zip64_extra = struct.pack(
        "<HHQ",
        0x0001,  # ZIP64 tag
        8,  # size of following data (8 bytes = one uint64)
        zip64_uncompressed,  # uncompressed size (disagrees with 32-bit field below)
    )
    extra_len = len(zip64_extra)

    # Local file header - 32-bit uncompressed size = 100 (not sentinel)
    local_header = (
        struct.pack(
            "<4s2H3H4s2I2H",
            b"PK\x03\x04",
            20,
            0,
            0,
            0,
            0,
            b"\x00\x00\x00\x00",
            len(content),
            len(content),  # 32-bit uncompressed size = 100
            fname_len,
            extra_len,
        )
        + filename
        + zip64_extra
        + content
    )

    local_offset = 0

    # Central directory header - 32-bit uncompressed size = 100 (not sentinel)
    # Format: 4s | 6H | 4s CRC | 2I (comp,uncomp) | 5H | 2I → 17 items, 46 bytes
    central_header = (
        struct.pack(
            "<4s6H4s2I5H2I",
            b"PK\x01\x02",
            0x031E,
            20,
            0,
            0,
            0,
            0,
            b"\x00\x00\x00\x00",
            len(content),  # compressed size (I)
            len(content),  # 32-bit uncompressed size = 100 (I, not sentinel)
            fname_len,
            extra_len,
            0,  # comment length
            0,  # disk number start
            0,  # internal attributes
            0,  # external attributes (I)
            local_offset,  # offset of local header (I)
        )
        + filename
        + zip64_extra
    )

    central_offset = len(local_header)
    central_size = len(central_header)

    eocd = struct.pack(
        "<4s4H2IH",
        b"PK\x05\x06",
        0,
        0,
        1,
        1,
        central_size,
        central_offset,
        0,
    )

    data = local_header + central_header + eocd
    p = tmp_path / "zip64_inconsistency.zip"
    p.write_bytes(data)
    return p


@pytest.fixture()
def legitimate_archive(tmp_path):
    """A well-formed, safe archive with a few text files."""
    buf = io.BytesIO()
    with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_DEFLATED) as zf:
        zf.writestr("hello.txt", b"Hello, world!\n")
        zf.writestr("subdir/data.txt", b"Some data\n")
        zf.writestr("subdir/nested/deep.txt", b"Deep file\n")
    p = tmp_path / "legitimate.zip"
    p.write_bytes(buf.getvalue())
    return p


@pytest.fixture()
def symlink_archive(tmp_path):
    """A ZIP containing one regular file and one Unix symlink entry.

    The symlink entry's content (the link target) is ``../escape.txt``,
    which would point outside the extraction root if followed blindly.
    The entry is flagged as a symlink via the upper 16 bits of
    ``ZipInfo.external_attr`` (Unix mode ``S_IFLNK | 0o755``).
    """
    buf = io.BytesIO()
    with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_STORED) as zf:
        # A harmless regular file that must always be extractable
        zf.writestr("readme.txt", b"safe content\n")
        # Symlink entry: mode S_IFLNK | 0o755, content = link target
        sym = zipfile.ZipInfo("link.txt")
        sym.external_attr = (stat.S_IFLNK | 0o755) << 16
        zf.writestr(sym, "../escape.txt")
    p = tmp_path / "symlink.zip"
    p.write_bytes(buf.getvalue())
    return p


@pytest.fixture()
def setuid_archive(tmp_path):
    """A ZIP with a regular file that has setuid bit (04755) in external_attr."""
    buf = io.BytesIO()
    with zipfile.ZipFile(buf, "w") as zf:
        info = zipfile.ZipInfo("suid_binary")
        info.external_attr = (0o4755 & 0xFFFF) << 16
        zf.writestr(info, b"ELF\x00")
    p = tmp_path / "setuid.zip"
    p.write_bytes(buf.getvalue())
    return p


@pytest.fixture()
def data_descriptor_empty_archive(tmp_path):
    """Valid ZIP with empty member using data descriptor (compress_size=0)."""
    comp_data = b""
    comp_size = 0
    uncomp_size = 0
    crc = 0

    filename = b"empty.txt"
    fname_len = len(filename)

    # Local header: sizes=0, flags=0x08, method=0 (stored, since empty)
    local_header = (
        struct.pack(
            "<4sHHHHHIIIHH",
            b"PK\x03\x04",
            20,
            0x08,
            0,  # stored
            0,
            0,
            0,
            0,
            0,
            fname_len,
            0,
        )
        + filename
    )

    # Data descriptor
    descriptor = struct.pack("<4sIII", b"PK\x07\x08", crc, comp_size, uncomp_size)

    local_with_desc = local_header + comp_data + descriptor

    # Central header: sizes=0, flags=0x08
    central_header = (
        struct.pack(
            "<4sHHHHHHIIIHHHHHII",
            b"PK\x01\x02",
            0x0314,
            20,
            0x08,
            0,
            0,
            0,
            crc,
            comp_size,
            uncomp_size,
            fname_len,
            0,
            0,
            0,
            0,
            0,
            0,
        )
        + filename
    )

    cd_offset = len(local_with_desc)
    cd_size = len(central_header)
    eocd = struct.pack("<4sHHHHIIH", b"PK\x05\x06", 0, 0, 1, 1, cd_size, cd_offset, 0)

    archive_bytes = local_with_desc + central_header + eocd
    p = tmp_path / "dd_empty.zip"
    p.write_bytes(archive_bytes)
    return p


@pytest.fixture()
def data_descriptor_invalid_bomb_archive(tmp_path):
    """
    Invalid ZIP with non-empty member, data descriptor, but CD compress_size=0.
    """
    uncomp_data = b"\x00" * 2000
    compressor = zlib.compressobj(
        zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, -zlib.MAX_WBITS
    )
    comp_data = compressor.compress(uncomp_data) + compressor.flush()
    comp_size = len(comp_data)
    uncomp_size = len(uncomp_data)
    crc = zlib.crc32(uncomp_data)

    filename = b"bomb.txt"
    fname_len = len(filename)

    # Local header: sizes=0, flags=0x08, method=8 (deflate)
    local_header = (
        struct.pack(
            "<4sHHHHHIIIHH",
            b"PK\x03\x04",
            20,
            0x08,
            8,
            0,
            0,
            0,
            0,
            0,
            fname_len,
            0,
        )
        + filename
    )

    # Data descriptor with real sizes
    descriptor = struct.pack("<4sIII", b"PK\x07\x08", crc, comp_size, uncomp_size)

    local_with_desc = local_header + comp_data + descriptor

    # Central header: compress_size=0 (invalid mismatch), uncomp_size=real
    central_header = (
        struct.pack(
            "<4sHHHHHHIIIHHHHHII",
            b"PK\x01\x02",
            0x0314,
            20,
            0x08,
            8,
            0,
            0,
            crc,
            0,  # invalid comp_size=0
            uncomp_size,
            fname_len,
            0,
            0,
            0,
            0,
            0,
            0,
        )
        + filename
    )

    cd_offset = len(local_with_desc)
    cd_size = len(central_header)
    eocd = struct.pack("<4sHHHHIIH", b"PK\x05\x06", 0, 0, 1, 1, cd_size, cd_offset, 0)

    archive_bytes = local_with_desc + central_header + eocd
    p = tmp_path / "dd_invalid_bomb.zip"
    p.write_bytes(archive_bytes)
    return p


@pytest.fixture()
def fifield_bomb_archive(tmp_path):
    """A Fifield-style zip bomb: multiple central directory entries all pointing
    to the same compressed local entry (overlapping spans).

    Structure:
      - One real local entry at offset 0 containing 200 bytes of zeros.
      - Central directory with 3 entries, all with local_header_offset=0,
        so their spans all overlap the single local entry.

    This is structurally invalid per the ZIP specification and should be
    detected and rejected by _check_overlapping_entries before any
    decompression occurs.
    """
    content = b"\x00" * 200
    compressed = zlib.compress(content)[2:-4]
    crc = zlib.crc32(content) & 0xFFFFFFFF
    comp_size = len(compressed)
    uncomp_size = len(content)

    fname = b"data.bin"
    fname_len = len(fname)

    local_header = (
        struct.pack(
            "<4s2H3H4s2I2H",
            b"PK\x03\x04",
            20,
            0,
            8,
            0,
            0,
            struct.pack("<I", crc),
            comp_size,
            uncomp_size,
            fname_len,
            0,
        )
        + fname
        + compressed
    )

    local_offset = 0

    def make_cd_entry(offset):
        return (
            struct.pack(
                "<4s6H4s2I5H2I",
                b"PK\x01\x02",
                0x031E,
                20,
                0,
                8,
                0,
                0,
                struct.pack("<I", crc),
                comp_size,
                uncomp_size,
                fname_len,
                0,
                0,
                0,
                0,
                0,
                offset,
            )
            + fname
        )

    cd = make_cd_entry(local_offset) * 3
    cd_offset = len(local_header)
    cd_size = len(cd)

    eocd = struct.pack(
        "<4s4H2IH",
        b"PK\x05\x06",
        0,
        0,
        3,
        3,
        cd_size,
        cd_offset,
        0,
    )

    data = local_header + cd + eocd
    p = tmp_path / "fifield_bomb.zip"
    p.write_bytes(data)
    return p

src/safezip/tests/test_cli.py

"""Tests for the safezip CLI."""

import io
import zipfile
from unittest.mock import patch

import pytest

from safezip.cli._main import main

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"


@pytest.fixture()
def simple_archive(tmp_path):
    """A simple valid ZIP archive."""
    buf = io.BytesIO()
    with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_DEFLATED) as zf:
        zf.writestr("file1.txt", b"content1\n")
        zf.writestr("dir/file2.txt", b"content2\n")
    p = tmp_path / "simple.zip"
    p.write_bytes(buf.getvalue())
    return p


class TestExtractCommand:
    """Tests for the extract command."""

    def test_extract_basic(self, simple_archive, tmp_path, capsys):
        """Basic extraction works."""
        dest = tmp_path / "out"
        with patch("sys.argv", ["safezip", "extract", str(simple_archive), str(dest)]):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0

        assert (dest / "file1.txt").read_text() == "content1\n"
        assert (dest / "dir" / "file2.txt").read_text() == "content2\n"
        captured = capsys.readouterr()
        assert "Extracted to" in captured.out

    def test_extract_with_max_file_size(self, simple_archive, tmp_path):
        """Extract with --max-file-size flag."""
        dest = tmp_path / "out"
        with patch(
            "sys.argv",
            [
                "safezip",
                "extract",
                str(simple_archive),
                str(dest),
                "--max-file-size",
                "1000",
            ],
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0

        assert (dest / "file1.txt").exists()

    def test_extract_with_max_files(self, simple_archive, tmp_path):
        """Extract with --max-files flag."""
        dest = tmp_path / "out"
        with patch(
            "sys.argv",
            [
                "safezip",
                "extract",
                str(simple_archive),
                str(dest),
                "--max-files",
                "10",
            ],
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0

        assert (dest / "file1.txt").exists()

    def test_extract_with_symlink_policy(self, simple_archive, tmp_path):
        """Extract with --symlink-policy flag."""
        dest = tmp_path / "out"
        with patch(
            "sys.argv",
            [
                "safezip",
                "extract",
                str(simple_archive),
                str(dest),
                "--symlink-policy",
                "reject",
            ],
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0

        assert (dest / "file1.txt").exists()

    def test_extract_with_recursive_flag(self, simple_archive, tmp_path):
        """Extract with --recursive flag."""
        dest = tmp_path / "out"
        with patch(
            "sys.argv",
            [
                "safezip",
                "extract",
                str(simple_archive),
                str(dest),
                "--recursive",
            ],
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0

        assert (dest / "file1.txt").exists()

    def test_extract_nonexistent_archive(self, tmp_path, capsys):
        """Extract fails with nonexistent archive."""
        dest = tmp_path / "out"
        with patch("sys.argv", ["safezip", "extract", "/nonexistent.zip", str(dest)]):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 1

        captured = capsys.readouterr()
        assert "error:" in captured.err

    def test_extract_creates_destination(self, simple_archive, tmp_path):
        """Extract creates destination directory if it doesn't exist."""
        dest = tmp_path / "nested" / "out"
        with patch("sys.argv", ["safezip", "extract", str(simple_archive), str(dest)]):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0

        assert dest.exists()
        assert (dest / "file1.txt").exists()

    def test_extract_zipslip_rejected(self, zipslip_archive, tmp_path, capsys):
        """Extract rejects ZipSlip archive."""
        dest = tmp_path / "out"
        with patch("sys.argv", ["safezip", "extract", str(zipslip_archive), str(dest)]):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 1

        captured = capsys.readouterr()
        assert "error:" in captured.err

    def test_extract_zipbomb_rejected(self, high_ratio_archive, tmp_path, capsys):
        """Extract rejects ZIP bomb."""
        dest = tmp_path / "out"
        with patch(
            "sys.argv",
            [
                "safezip",
                "extract",
                str(high_ratio_archive),
                str(dest),
                "--max-per-member-ratio",
                "10",
            ],
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 1

        captured = capsys.readouterr()
        assert "error:" in captured.err

    def test_extract_too_many_files_rejected(
        self, many_files_archive, tmp_path, capsys
    ):
        """Extract rejects archive with too many files."""
        dest = tmp_path / "out"
        with patch(
            "sys.argv",
            [
                "safezip",
                "extract",
                str(many_files_archive),
                str(dest),
                "--max-files",
                "100",
            ],
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 1

        captured = capsys.readouterr()
        assert "error:" in captured.err

    def test_extract_null_byte_filename_rejected(
        self, null_byte_filename_archive, tmp_path, capsys
    ):
        """Extract rejects archive with null byte in filename."""
        dest = tmp_path / "out"
        with patch(
            "sys.argv",
            ["safezip", "extract", str(null_byte_filename_archive), str(dest)],
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            # Can exit with 1 due to either MalformedArchiveError (null byte)
            # or BadZipFile (CRC error from crafted archive)
            assert exc_info.value.code == 1

        captured = capsys.readouterr()
        assert "error:" in captured.err


class TestListCommand:
    """Tests for the list command."""

    def test_list_basic(self, simple_archive, capsys):
        """List command shows archive members."""
        with patch("sys.argv", ["safezip", "list", str(simple_archive)]):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0

        captured = capsys.readouterr()
        assert "file1.txt" in captured.out
        assert "dir/file2.txt" in captured.out

    def test_list_nonexistent_archive(self, capsys):
        """List fails with nonexistent archive."""
        with patch("sys.argv", ["safezip", "list", "/nonexistent.zip"]):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 1

        captured = capsys.readouterr()
        assert "error:" in captured.err


class TestVersionFlag:
    """Tests for --version flag."""

    def test_version_flag(self, capsys):
        """--version flag displays version."""
        with patch("sys.argv", ["safezip", "--version"]):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0

        captured = capsys.readouterr()
        assert "safezip" in captured.out

src/safezip/tests/test_guard.py

"""Tests for Phase A: the Guard (pre-extraction validation)."""

import io
import struct
import zipfile
import zlib

import pytest

from safezip import (
    FileCountExceededError,
    FileSizeExceededError,
    MalformedArchiveError,
    SafeZipFile,
)
from safezip._guard import ScanResult, ZipInspector

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"


class TestFileCountLimit:
    """Guard rejects archives with too many entries."""

    def test_many_files_raises(self, many_files_archive, tmp_path):
        with pytest.raises(FileCountExceededError):
            SafeZipFile(many_files_archive)

    def test_many_files_custom_limit_passes(self, many_files_archive, tmp_path):
        # Allow up to 20 000 files - should open without error
        with SafeZipFile(many_files_archive, max_files=20_000):
            pass

    def test_file_count_exactly_at_limit(self, tmp_path):
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w") as zf:
            for i in range(5):
                zf.writestr(f"file_{i}.txt", b"x")
        p = tmp_path / "five.zip"
        p.write_bytes(buf.getvalue())
        with SafeZipFile(p, max_files=5):
            pass

    def test_file_count_one_over_limit(self, tmp_path):
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w") as zf:
            for i in range(6):
                zf.writestr(f"file_{i}.txt", b"x")
        p = tmp_path / "six.zip"
        p.write_bytes(buf.getvalue())
        with pytest.raises(FileCountExceededError):
            SafeZipFile(p, max_files=5)


class TestDeclaredFileSizeLimit:
    """Guard rejects archives whose declared sizes exceed max_file_size."""

    def test_large_declared_size_raises(self, tmp_path):
        # Declare a very large file but store tiny content
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w") as zf:
            info = zipfile.ZipInfo("big.bin")
            zf.writestr(info, b"tiny")

        # Manually patch the ZipInfo to report a huge size - instead,
        # test via the limit: store a 200-byte file and set limit=100
        buf2 = io.BytesIO()
        with zipfile.ZipFile(buf2, "w") as zf2:
            zf2.writestr("data.bin", b"A" * 200)
        p = tmp_path / "large.zip"
        p.write_bytes(buf2.getvalue())

        with pytest.raises(FileSizeExceededError):
            SafeZipFile(p, max_file_size=100)

    def test_size_exactly_at_limit_passes(self, tmp_path):
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w") as zf:
            zf.writestr("data.bin", b"A" * 100)
        p = tmp_path / "exact.zip"
        p.write_bytes(buf.getvalue())
        with SafeZipFile(p, max_file_size=100):
            pass


class TestNullByteInFilename:
    """Null bytes in ZIP filenames are neutralised by Python's zipfile layer.

    Python 3.x's :mod:`zipfile` truncates filenames at the first null byte
    when reading the central directory (e.g. ``safe\x00../../etc/passwd``
    becomes ``safe``).  Our Guard therefore never sees a null byte in
    ``ZipInfo.filename``; the Sandbox's ``resolve_member_path`` carries the
    defence-in-depth check for callers that bypass ``zipfile``.

    This test verifies the safe outcome: no traversal path survives Python's
    null-byte truncation.
    """

    def test_null_byte_filename_truncated_safely(self, null_byte_filename_archive):
        """Python strips null bytes; the traversal portion is never evaluated."""
        # Python truncates 'safe\x00../../etc/passwd' → 'safe'
        with SafeZipFile(null_byte_filename_archive) as zf:
            names = zf.namelist()
        # No null bytes survive Python's filename decoding
        assert not any("\x00" in n for n in names), (
            f"Null byte survived Python's filename decoding: {names!r}"
        )
        # No directory-traversal components should be present
        assert not any(".." in n for n in names), (
            f"Traversal component present after null-byte truncation: {names!r}"
        )


class TestZip64Inconsistency:
    """Guard detects ZIP64 extra fields that disagree with central directory."""

    def test_zip64_inconsistency_raises(self, zip64_inconsistency_archive):
        with pytest.raises(MalformedArchiveError):
            SafeZipFile(zip64_inconsistency_archive)


class TestLegitimateArchive:
    """Guard passes well-formed archives."""

    def test_legitimate_archive_passes(self, legitimate_archive):
        with SafeZipFile(legitimate_archive) as zf:
            assert len(zf.namelist()) == 3

    def test_namelist_accessible(self, legitimate_archive):
        with SafeZipFile(legitimate_archive) as zf:
            names = zf.namelist()
        assert "hello.txt" in names

    def test_infolist_accessible(self, legitimate_archive):
        with SafeZipFile(legitimate_archive) as zf:
            infos = zf.infolist()
        assert any(i.filename == "hello.txt" for i in infos)

    def test_getinfo_accessible(self, legitimate_archive):
        with SafeZipFile(legitimate_archive) as zf:
            info = zf.getinfo("hello.txt")
        assert info.filename == "hello.txt"


class TestOverlappingEntryDetection:
    """Guard rejects archives with overlapping local entries (Fifield-style bombs)."""

    def test_fifield_bomb_raises_malformed(self, fifield_bomb_archive):
        with pytest.raises(MalformedArchiveError, match="overlapping"):
            SafeZipFile(fifield_bomb_archive)

    def test_fifield_bomb_no_extraction_attempted(self, fifield_bomb_archive, tmp_path):
        dest = tmp_path / "out"
        dest.mkdir()
        with pytest.raises(MalformedArchiveError):
            SafeZipFile(fifield_bomb_archive)
        assert list(dest.iterdir()) == []

    def test_legitimate_archive_passes_overlap_check(self, legitimate_archive):
        with SafeZipFile(legitimate_archive) as zf:
            assert len(zf.namelist()) > 0

    def test_overlap_check_does_not_decompress(self, fifield_bomb_archive, tmp_path):
        with pytest.raises(MalformedArchiveError, match="overlapping"):
            SafeZipFile(fifield_bomb_archive, max_per_member_ratio=100_000.0)


def _lfh(filename: bytes, data: bytes, compress_type: int = 0) -> bytes:
    """Build a Local File Header + data."""
    return (
        struct.pack(
            "<LHHHHHLLLHH",
            0x04034B50,
            20,
            0,
            compress_type,
            0,
            0,
            zlib.crc32(data) & 0xFFFFFFFF,
            len(data),
            len(data),
            len(filename),
            0,
        )
        + filename
        + data
    )


def _cdh(
    filename: bytes, data: bytes, local_offset: int, compress_type: int = 0
) -> bytes:
    """Build a Central Directory Header."""
    return (
        struct.pack(
            "<LHHHHHHLLLHHHHHLL",
            0x02014B50,
            20,
            20,
            0,
            compress_type,
            0,
            0,
            zlib.crc32(data) & 0xFFFFFFFF,
            len(data),
            len(data),
            len(filename),
            0,
            0,
            0,
            0,
            0,
            local_offset,
        )
        + filename
    )


def _eocd(
    num_entries: int, cd_size: int, cd_offset: int, comment: bytes = b""
) -> bytes:
    """Build an End of Central Directory record."""
    return (
        struct.pack(
            "<LHHHHLLH",
            0x06054B50,
            0,
            0,
            num_entries,
            num_entries,
            cd_size,
            cd_offset,
            len(comment),
        )
        + comment
    )


def _build_zip(*files: tuple[bytes, bytes]) -> bytes:
    """Build a well-formed zip from (filename, data) pairs."""
    lfhs, cdhs = [], []
    cursor = 0
    for fname, data in files:
        lfh = _lfh(fname, data)
        cdhs.append(_cdh(fname, data, cursor))
        lfhs.append(lfh)
        cursor += len(lfh)

    cd = b"".join(cdhs)
    return b"".join(lfhs) + cd + _eocd(len(files), len(cd), cursor)


def _build_overlap_zip(fname_a: bytes, fname_b: bytes, data: bytes) -> bytes:
    """Build a zip where two CDH entries point to the same LFH offset."""
    lfh = _lfh(fname_a, data)
    cdh1 = _cdh(fname_a, data, 0)
    cdh2 = _cdh(fname_b, data, 0)
    cd = cdh1 + cdh2
    return lfh + cd + _eocd(2, len(cd), len(lfh))


class TestZipInspector:
    """Tests for the ZipInspector overlap detection."""

    def _scan(self, data: bytes) -> ScanResult:
        return ZipInspector(io.BytesIO(data)).scan()

    def test_clean_single_file(self):
        data = _build_zip((b"readme.txt", b"hello"))
        result = self._scan(data)
        assert result.is_bomb is False

    def test_clean_two_files_sequential(self):
        data = _build_zip(
            (b"a.txt", b"first file contents"),
            (b"b.txt", b"second file contents"),
        )
        assert self._scan(data).is_bomb is False

    def test_clean_many_files(self):
        files = [(f"file{i}.txt".encode(), f"content {i}".encode()) for i in range(50)]
        data = _build_zip(*files)
        assert self._scan(data).is_bomb is False

    def test_clean_empty_file_entry(self):
        data = _build_zip((b"empty", b""))
        assert self._scan(data).is_bomb is False

    def test_overlap_two_cdh_same_offset(self):
        data = _build_overlap_zip(b"a", b"b", b"kernel data")
        assert self._scan(data).is_bomb is True

    def test_overlap_detail_is_populated(self):
        data = _build_overlap_zip(b"x", b"y", b"data")
        result = self._scan(data)
        assert result.is_bomb is True
        assert result.overlap_detail is not None

    def test_overlap_at_offset_zero(self):
        """Entries with data_start=0 should still be detected as overlapping."""
        lfh1 = _lfh(b"a", b"data")
        cdh1 = _cdh(b"a", b"data", 0)
        cdh2 = _cdh(b"b", b"data", 0)
        cd = cdh1 + cdh2
        data = lfh1 + cd + _eocd(2, len(cd), len(lfh1))
        result = self._scan(data)
        assert result.is_bomb is True

    def test_invalid_not_a_zip(self):
        result = self._scan(b"this is not a zip file at all")
        assert result.is_bomb is None

    def test_invalid_empty_bytes(self):
        result = self._scan(b"")
        assert result.is_bomb is None

    def test_invalid_truncated_eocd(self):
        result = self._scan(b"PK\x05\x06\x00\x00")
        assert result.is_bomb is None

    def test_invalid_garbage_with_pk_bytes(self):
        result = self._scan(b"\x00" * 100 + b"PK\x05\x06" + b"\xff" * 18)
        assert result.is_bomb is None

    def test_invalid_cdh_signature_mismatch(self):
        raw = bytearray(_build_zip((b"f", b"data")))
        cdh_pos = raw.find(b"PK\x01\x02")
        raw[cdh_pos] = 0xFF
        assert self._scan(bytes(raw)).is_bomb is None

    def test_invalid_lfh_signature_mismatch(self):
        raw = bytearray(_build_zip((b"f", b"data")))
        raw[0] = 0xFF
        assert self._scan(bytes(raw)).is_bomb is None

    def test_gap_does_not_trigger_bomb(self):
        gap = b"\x00" * 16
        lfh1 = _lfh(b"a", b"data1")
        lfh2 = _lfh(b"b", b"data2")
        off1 = 0
        off2 = len(lfh1) + len(gap)
        cdh1 = _cdh(b"a", b"data1", off1)
        cdh2 = _cdh(b"b", b"data2", off2)
        cd = cdh1 + cdh2
        raw = lfh1 + gap + lfh2 + cd + _eocd(2, len(cd), off2 + len(lfh2))
        assert self._scan(raw).is_bomb is False

    def test_leading_bytes_not_a_bomb(self):
        prefix = b"\x00" * 32
        lfh = _lfh(b"x", b"payload")
        cdh = _cdh(b"x", b"payload", len(prefix))
        cd = cdh
        raw = prefix + lfh + cd + _eocd(1, len(cd), len(prefix) + len(lfh))
        assert self._scan(raw).is_bomb is False

    def test_zip_with_comment(self):
        raw = _build_zip((b"x.txt", b"data"))
        eocd_pos = raw.rfind(b"PK\x05\x06")
        comment = b"this is a zip comment"
        head = raw[:eocd_pos]
        eocd = raw[eocd_pos:]
        new_eocd = eocd[:20] + struct.pack("<H", len(comment)) + comment
        assert self._scan(head + new_eocd).is_bomb is False

    def test_invalid_split_across_disks(self):
        lfh = _lfh(b"a", b"data")
        cdh = _cdh(b"a", b"data", 0)
        cd = cdh
        eocd = struct.pack(
            "<LHHHHLLH",
            0x06054B50,
            0,
            1,
            1,
            1,
            len(cd),
            0,
            0,
        )
        raw = lfh + cd + eocd
        assert self._scan(raw).is_bomb is None

    def test_invalid_eocd_entries_mismatch(self):
        lfh = _lfh(b"a", b"data")
        cdh = _cdh(b"a", b"data", 0)
        cd = cdh
        eocd = struct.pack(
            "<LHHHHLLH",
            0x06054B50,
            0,
            0,
            1,
            2,
            len(cd),
            len(lfh),
            0,
        )
        raw = lfh + cd + eocd
        assert self._scan(raw).is_bomb is None

    def test_invalid_eocd_cd_extends_past_eof(self):
        lfh = _lfh(b"a", b"data")
        cdh = _cdh(b"a", b"data", 0)
        cd = cdh
        eocd = struct.pack(
            "<LHHHHLLH",
            0x06054B50,
            0,
            0,
            1,
            1,
            len(cd) + 1000,
            len(lfh),
            0,
        )
        raw = lfh + cd + eocd
        assert self._scan(raw).is_bomb is None

    def test_invalid_cd_extends_past_eof(self):
        lfh = _lfh(b"a", b"data")
        cdh = _cdh(b"a", b"data", 0)
        cd = cdh
        raw = lfh + cd + cd + _eocd(2, len(cd) * 2, len(lfh))
        result = self._scan(raw)
        assert result.is_bomb is not False

    def test_invalid_local_offset_past_eof(self):
        lfh = _lfh(b"a", b"data")
        cdh_bad = _cdh(b"a", b"data", 999999)
        cd = cdh_bad
        raw = lfh + cd + _eocd(1, len(cd), len(lfh))
        assert self._scan(raw).is_bomb is None

    def test_invalid_cdh_variable_field_truncated(self):
        lfh = _lfh(b"a", b"data")
        cdh_base = struct.pack(
            "<LHHHHHHLLLHHHHHLL",
            0x02014B50,
            20,
            20,
            0,
            0,
            0,
            0,
            zlib.crc32(b"data") & 0xFFFFFFFF,
            4,
            4,
            1,
            0,
            0,
            0,
            0,
            0,
            0,
        )
        cdh = cdh_base + b"a"
        raw = lfh + cdh + _eocd(1, len(cdh), len(lfh))
        assert self._scan(raw).is_bomb is not True

    def test_invalid_lfh_signature_invalid(self):
        lfh = _lfh(b"a", b"data")
        raw_lfh = bytearray(lfh)
        raw_lfh[0] = 0xFF
        cdh = _cdh(b"a", b"data", 0)
        cd = cdh
        raw = bytes(raw_lfh) + cd + _eocd(1, len(cd), len(lfh))
        assert self._scan(raw).is_bomb is None

    def test_invalid_cdh_disk_nonzero(self):
        lfh = _lfh(b"a", b"data")
        cdh_base = (
            struct.pack(
                "<LHHHHHHLLLHHHHHLL",
                0x02014B50,
                20,
                20,
                0,
                0,
                0,
                0,
                zlib.crc32(b"data") & 0xFFFFFFFF,
                4,
                4,
                1,
                0,
                0,
                0,
                0,
                0,
                0,
            )
            + b"a"
        )
        cdh = cdh_base
        cd = cdh
        raw = lfh + cd + _eocd(1, len(cd), len(lfh))
        assert self._scan(raw).is_bomb is not True

    def test_cdh_extra_field_truncated(self):
        lfh = _lfh(b"a", b"data")
        cdh_base = struct.pack(
            "<LHHHHHHLLLHHHHHLL",
            0x02014B50,
            20,
            20,
            0,
            0,
            0,
            0,
            zlib.crc32(b"data") & 0xFFFFFFFF,
            4,
            4,
            5,
            0,
            0,
            0,
            0,
            0,
            0,
        )
        extra = struct.pack("<HH", 0x0001, 4)
        cdh = cdh_base + b"filename" + extra
        cd = cdh
        raw = lfh + cd + _eocd(1, len(cd), len(lfh))
        assert self._scan(raw).is_bomb is None

    def test_cdh_zip64_extra_invalid_size(self):
        lfh = _lfh(b"a", b"data")
        cdh_base = struct.pack(
            "<LHHHHHHLLLHHHHHLL",
            0x02014B50,
            20,
            20,
            0,
            0,
            0,
            0,
            zlib.crc32(b"data") & 0xFFFFFFFF,
            0xFFFFFFFF,
            0xFFFFFFFF,
            10,
            0,
            0,
            0,
            0,
            0,
            0xFFFFFFFF,
        )
        extra = struct.pack("<HHQ", 0x0001, 4, 100)
        cdh = cdh_base + b"filename" + extra
        cd = cdh
        raw = lfh + cd + _eocd(1, len(cd), len(lfh))
        assert self._scan(raw).is_bomb is None

src/safezip/tests/test_integration.py

"""End-to-end integration tests using real crafted malicious archives."""

import io
import stat
import zipfile

import pytest

from safezip import (
    CompressionRatioError,
    FileCountExceededError,
    FileSizeExceededError,
    MalformedArchiveError,
    NestingDepthError,
    SafeZipFile,
    SymlinkPolicy,
    UnsafeZipError,
    safe_extract,
)

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"


class TestZipSlip:
    """ZipSlip path traversal attacks are blocked before any bytes reach disk."""

    def test_relative_traversal_blocked(self, zipslip_archive, tmp_path):
        dest = tmp_path / "out"
        dest.mkdir()
        with pytest.raises(UnsafeZipError), SafeZipFile(zipslip_archive) as zf:
            zf.extractall(dest)
        # Confirm no file escaped to the parent
        evil = tmp_path / "evil.txt"
        assert not evil.exists()

    def test_absolute_path_blocked(self, absolute_path_archive, tmp_path):
        dest = tmp_path / "out"
        dest.mkdir()
        with pytest.raises(UnsafeZipError), SafeZipFile(absolute_path_archive) as zf:
            zf.extractall(dest)

    def test_traversal_leaves_no_files(self, zipslip_archive, tmp_path):
        dest = tmp_path / "out"
        dest.mkdir()
        with pytest.raises(UnsafeZipError), SafeZipFile(zipslip_archive) as zf:
            zf.extractall(dest)
        assert not list(dest.rglob("*"))

    def test_unicode_traversal_blocked(self, unicode_traversal_archive, tmp_path):
        dest = tmp_path / "out"
        dest.mkdir()
        with (
            pytest.raises(UnsafeZipError),
            SafeZipFile(unicode_traversal_archive) as zf,
        ):
            zf.extractall(dest)


class TestZipBomb:
    """ZIP bomb attacks are detected and aborted."""

    def test_high_ratio_bomb_blocked(self, high_ratio_archive, tmp_path):
        dest = tmp_path / "out"
        dest.mkdir()
        with (
            pytest.raises(CompressionRatioError),
            SafeZipFile(high_ratio_archive, max_per_member_ratio=10.0) as zf,
        ):
            zf.extractall(dest)

    def test_high_ratio_no_partial_files(self, high_ratio_archive, tmp_path):
        dest = tmp_path / "out"
        dest.mkdir()
        with (
            pytest.raises(CompressionRatioError),
            SafeZipFile(high_ratio_archive, max_per_member_ratio=10.0) as zf,
        ):
            zf.extractall(dest)
        remaining = [f for f in dest.rglob("*") if not f.is_dir()]
        assert not remaining

    def test_file_size_lie_blocked(self, tmp_path):
        """Archive that lies about size in header is caught by the streamer."""
        # Store 2000 bytes but set max_file_size=500 in Guard
        # The Guard will reject the archive if declare size > max_file_size
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_STORED) as zf:
            zf.writestr("data.bin", b"X" * 2000)
        p = tmp_path / "lie.zip"
        p.write_bytes(buf.getvalue())
        dest = tmp_path / "out"
        dest.mkdir()

        with (
            pytest.raises(FileSizeExceededError),
            SafeZipFile(p, max_file_size=500) as zf,
        ):
            zf.extractall(dest)

    def test_many_files_bomb_blocked(self, many_files_archive, tmp_path):
        """Archive with too many files is blocked at the Guard phase."""
        with pytest.raises(FileCountExceededError):
            SafeZipFile(many_files_archive)


class TestExplicitPathRequirement:
    """extractall must receive an explicit path; CWD is never used silently."""

    def test_extractall_requires_path(self, legitimate_archive, tmp_path):
        """extractall with a valid path works; calling without is a TypeError."""
        dest = tmp_path / "out"
        dest.mkdir()
        with SafeZipFile(legitimate_archive) as zf:
            zf.extractall(dest)  # must not raise
        assert (dest / "hello.txt").exists()

    def test_extractall_wrong_type_raises(self, legitimate_archive):
        """Passing None as path raises TypeError."""
        with (
            SafeZipFile(legitimate_archive) as zf,
            pytest.raises((TypeError, AttributeError)),
        ):
            zf.extractall(None)

    def test_extract_with_none_path_raises(self, legitimate_archive):
        """Passing None as path to extract() raises TypeError."""
        with SafeZipFile(legitimate_archive) as zf, pytest.raises(TypeError):
            zf.extract("hello.txt", None)

    def test_extractall_with_members_list(self, legitimate_archive, tmp_path):
        """extractall with a members list extracts only those members."""
        dest = tmp_path / "out"
        dest.mkdir()
        with SafeZipFile(legitimate_archive) as zf:
            zf.extractall(dest, members=["hello.txt"])
        # Only hello.txt should exist
        assert (dest / "hello.txt").exists()
        contents = list(dest.rglob("*"))
        assert len(contents) == 1


class TestMalformedArchive:
    """Structurally invalid archives raise MalformedArchiveError."""

    def test_not_a_zip_raises_malformed(self, tmp_path):
        """A file that is not a ZIP at all raises MalformedArchiveError."""
        bad = tmp_path / "bad.zip"
        bad.write_bytes(b"this is not a zip file")
        with pytest.raises(MalformedArchiveError):
            SafeZipFile(bad)

    def test_zip64_inconsistency_raises(self, zip64_inconsistency_archive):
        """ZIP64 extra field that disagrees with central directory is rejected."""
        with pytest.raises(MalformedArchiveError):
            SafeZipFile(zip64_inconsistency_archive)


class TestFifieldBomb:
    """End-to-end: Fifield-style zip bomb is blocked at Guard phase."""

    def test_fifield_bomb_blocked_end_to_end(self, fifield_bomb_archive, tmp_path):
        dest = tmp_path / "out"
        dest.mkdir()
        with (
            pytest.raises(MalformedArchiveError),
            SafeZipFile(fifield_bomb_archive) as zf,
        ):
            zf.extractall(dest)
        remaining = [f for f in dest.rglob("*") if not f.is_dir()]
        assert not remaining

    def test_security_event_fires_on_fifield_bomb(self, fifield_bomb_archive, tmp_path):
        """on_security_event callback receives 'malformed_archive' for Fifield bomb."""
        events = []
        dest = tmp_path / "out"
        dest.mkdir()
        with pytest.raises(MalformedArchiveError):
            SafeZipFile(fifield_bomb_archive, on_security_event=events.append)
        assert any(e.event_type == "malformed_archive" for e in events)

    def test_fifield_bomb_as_bytesio_rejected(self, fifield_bomb_archive):
        """Fifield bomb as BytesIO is rejected."""
        data = fifield_bomb_archive.read_bytes()
        bio = io.BytesIO(data)
        with pytest.raises(MalformedArchiveError):
            SafeZipFile(bio)

    def test_legitimate_archive_as_bytesio_passes(self, legitimate_archive):
        """Legitimate archive as BytesIO passes."""
        data = legitimate_archive.read_bytes()
        bio = io.BytesIO(data)
        with SafeZipFile(bio) as zf:
            assert len(zf.namelist()) > 0

    def test_fifield_bomb_bytesio_event_fires(self, fifield_bomb_archive):
        """on_security_event fires for in-memory Fifield bomb."""
        events = []
        data = fifield_bomb_archive.read_bytes()
        bio = io.BytesIO(data)
        with pytest.raises(MalformedArchiveError):
            SafeZipFile(bio, on_security_event=events.append)
        assert any(e.event_type == "malformed_archive" for e in events)


class TestSecurityEventCoverage:
    """on_security_event callback fires for all security violation types."""

    def test_callback_fires_on_path_traversal(self, zipslip_archive, tmp_path):
        events = []
        dest = tmp_path / "out"
        dest.mkdir()
        with (
            pytest.raises(UnsafeZipError),
            SafeZipFile(zipslip_archive, on_security_event=events.append) as zf,
        ):
            zf.extractall(dest)
        assert any(e.event_type == "zip_slip_detected" for e in events)

    def test_callback_fires_on_file_size_exceeded(self, tmp_path):
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_STORED) as zf:
            zf.writestr("data.bin", b"A" * 1000)
        p = tmp_path / "large.zip"
        p.write_bytes(buf.getvalue())
        dest = tmp_path / "out"
        dest.mkdir()
        events = []
        with (
            pytest.raises(FileSizeExceededError),
            SafeZipFile(p, max_file_size=500, on_security_event=events.append) as zf,
        ):
            zf.extractall(dest)
        # The Guard may fire "declared_size_exceeded" (declared header size >
        # limit) or the Streamer may fire "file_size_exceeded" (actual
        # decompressed bytes > limit).  Both indicate a file-size violation.
        size_events = {"file_size_exceeded", "declared_size_exceeded"}
        assert any(e.event_type in size_events for e in events)

    def test_callback_fires_on_ratio_exceeded(self, high_ratio_archive, tmp_path):
        events = []
        dest = tmp_path / "out"
        dest.mkdir()
        with (
            pytest.raises(CompressionRatioError),
            SafeZipFile(
                high_ratio_archive,
                max_per_member_ratio=10.0,
                on_security_event=events.append,
            ) as zf,
        ):
            zf.extractall(dest)
        assert any(e.event_type == "compression_ratio_exceeded" for e in events)

    def test_callback_fires_on_file_count_exceeded(self, many_files_archive, tmp_path):
        events = []
        with pytest.raises(FileCountExceededError):
            SafeZipFile(many_files_archive, on_security_event=events.append)
        assert any(e.event_type == "file_count_exceeded" for e in events)

    def test_callback_fires_on_symlink_rejected(self, symlink_archive, tmp_path):
        events = []
        dest = tmp_path / "out"
        dest.mkdir()
        with (
            pytest.raises(UnsafeZipError),
            SafeZipFile(symlink_archive, on_security_event=events.append) as zf,
        ):
            zf.extractall(dest)
        assert any(e.event_type == "symlink_rejected" for e in events)


class TestLegitimateExtraction:
    """Well-formed archives extract correctly and completely."""

    def test_all_files_extracted(self, legitimate_archive, tmp_path):
        dest = tmp_path / "out"
        dest.mkdir()
        with SafeZipFile(legitimate_archive) as zf:
            zf.extractall(dest)
        assert (dest / "hello.txt").read_bytes() == b"Hello, world!\n"
        assert (dest / "subdir" / "data.txt").read_bytes() == b"Some data\n"
        assert (dest / "subdir" / "nested" / "deep.txt").read_bytes() == b"Deep file\n"

    def test_safe_extract_convenience(self, legitimate_archive, tmp_path):
        dest = tmp_path / "out"
        dest.mkdir()
        safe_extract(legitimate_archive, dest)
        assert (dest / "hello.txt").exists()

    def test_context_manager_closes_properly(self, legitimate_archive, tmp_path):
        dest = tmp_path / "out"
        dest.mkdir()
        with SafeZipFile(legitimate_archive) as zf:
            zf.extractall(dest)
        # After context exit, the underlying ZipFile's fp should be None (closed).
        # zipfile.ZipFile.close() sets self.fp = None.
        assert zf._zf.fp is None


class TestSecurityEventCallback:
    """on_security_event callback is called on security events."""

    def test_callback_called_on_zip_slip(self, zipslip_archive, tmp_path):
        events = []

        def capture(event):
            events.append(event)

        dest = tmp_path / "out"
        dest.mkdir()
        with (
            pytest.raises(UnsafeZipError),
            SafeZipFile(zipslip_archive, on_security_event=capture) as zf,
        ):
            zf.extractall(dest)
        # Note: callback is called for monitored events during extraction;
        # path traversal may be detected in sandbox before callback fires.
        # The test verifies no crash occurs.

    def test_callback_exception_does_not_swallow_security_error(
        self, zipslip_archive, tmp_path
    ):
        def broken_callback(event):
            raise RuntimeError("callback broken")

        dest = tmp_path / "out"
        dest.mkdir()
        # The UnsafeZipError must still propagate even if callback raises
        with (
            pytest.raises(UnsafeZipError),
            SafeZipFile(zipslip_archive, on_security_event=broken_callback) as zf,
        ):
            zf.extractall(dest)


class TestNestingDepthLimit:
    """SafeZipFile refuses instantiation when _nesting_depth exceeds the limit."""

    def test_nesting_depth_exceeded_raises(self, legitimate_archive):
        """_nesting_depth > max_nesting_depth raises NestingDepthError."""
        with pytest.raises(NestingDepthError):
            SafeZipFile(legitimate_archive, max_nesting_depth=3, _nesting_depth=4)

    def test_nesting_depth_at_limit_passes(self, legitimate_archive):
        """_nesting_depth == max_nesting_depth is allowed."""
        with SafeZipFile(legitimate_archive, max_nesting_depth=3, _nesting_depth=3):
            pass

    def test_nesting_depth_zero_always_passes(self, legitimate_archive):
        """Default _nesting_depth=0 never raises."""
        with SafeZipFile(legitimate_archive):
            pass

    def test_nesting_depth_env_var_respected(self, legitimate_archive, monkeypatch):
        """SAFEZIP_MAX_NESTING_DEPTH env var is honoured when no constructor arg
        is given."""
        monkeypatch.setenv("SAFEZIP_MAX_NESTING_DEPTH", "1")
        # depth=2 > env-var limit of 1 → should raise
        with pytest.raises(NestingDepthError):
            SafeZipFile(legitimate_archive, _nesting_depth=2)

    def test_nesting_depth_exceeded_event(self, legitimate_archive):
        """nesting_depth_exceeded event is emitted when depth exceeds limit."""
        events = []
        with pytest.raises(NestingDepthError):
            SafeZipFile(
                legitimate_archive,
                max_nesting_depth=1,
                _nesting_depth=2,
                on_security_event=events.append,
            )
        assert any(e.event_type == "nesting_depth_exceeded" for e in events)


class TestNestedArchiveGuard:
    """Nested archive members are extracted as raw files, not recursed into."""

    def test_inner_zip_extracted_as_raw_file(self, tmp_path):
        inner_buf = io.BytesIO()
        with zipfile.ZipFile(inner_buf, "w") as inner_zf:
            inner_zf.writestr("secret.txt", b"inner content")
        inner_bytes = inner_buf.getvalue()

        outer_buf = io.BytesIO()
        with zipfile.ZipFile(outer_buf, "w") as outer_zf:
            outer_zf.writestr("readme.txt", b"outer content")
            outer_zf.writestr("nested.zip", inner_bytes)
        outer_p = tmp_path / "outer.zip"
        outer_p.write_bytes(outer_buf.getvalue())

        dest = tmp_path / "out"
        dest.mkdir()
        with SafeZipFile(outer_p) as zf:
            zf.extractall(dest)

        # The nested.zip should be present as a raw file, not recursed
        assert (dest / "nested.zip").exists()
        assert (dest / "nested.zip").read_bytes() == inner_bytes
        # The inner secret.txt should NOT be extracted
        assert not (dest / "secret.txt").exists()


class TestRecursiveNestingDepthIntegration:
    """Real zip-within-zip recursion is stopped at max_nesting_depth.

    These tests use an actual nested archive and a realistic recursive
    extraction helper to verify that the guard fires in practice, not just
    when the counter is poked directly.
    """

    @staticmethod
    def _build_nested_zip(levels: int) -> bytes:
        """Return bytes of a zip nested *levels* deep.

        The innermost zip contains ``secret.txt``.  Every outer layer wraps
        the previous one as ``inner.zip`` plus a ``readme.txt`` so there is
        always a regular file at each level too.
        """
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w") as zf:
            zf.writestr("secret.txt", b"innermost content")
        data = buf.getvalue()

        for _ in range(levels - 1):
            buf = io.BytesIO()
            with zipfile.ZipFile(buf, "w") as zf:
                zf.writestr("readme.txt", b"outer level content")
                zf.writestr("inner.zip", data)
            data = buf.getvalue()

        return data

    @staticmethod
    def _recursive_extract(zip_path, dest, *, depth=0, max_nesting_depth=2):
        """Minimal recursive extractor that passes *depth* to SafeZipFile.

        This is the pattern a caller must follow to get nesting protection.
        SafeZipFile raises NestingDepthError before opening the archive when
        *depth* exceeds *max_nesting_depth*.
        """
        with SafeZipFile(
            zip_path,
            max_nesting_depth=max_nesting_depth,
            _nesting_depth=depth,
        ) as zf:
            zf.extractall(dest)
            for name in zf.namelist():
                if name.endswith(".zip"):
                    nested_src = dest / name
                    nested_dest = dest / (name[:-4] + "_contents")
                    nested_dest.mkdir()
                    TestRecursiveNestingDepthIntegration._recursive_extract(
                        nested_src,
                        nested_dest,
                        depth=depth + 1,
                        max_nesting_depth=max_nesting_depth,
                    )

    def test_recursive_extraction_stopped_at_depth_limit(self, tmp_path):
        """Recursion into a 3-level archive raises NestingDepthError at level 3.

        Archive layout::

            outer.zip          (depth 0 — opened fine)
              readme.txt
              inner.zip        (depth 1 — opened fine)
                readme.txt
                inner.zip      (depth 2 — raises, exceeds max_nesting_depth=1)
                  secret.txt
        """
        outer_p = tmp_path / "outer.zip"
        outer_p.write_bytes(self._build_nested_zip(3))
        dest = tmp_path / "out"
        dest.mkdir()

        with pytest.raises(NestingDepthError):
            self._recursive_extract(outer_p, dest, max_nesting_depth=1)

    def test_recursive_extraction_succeeds_within_limit(self, tmp_path):
        """Recursion within the depth limit extracts every level successfully.

        With max_nesting_depth=2 and a 3-level archive (depths 0, 1, 2),
        all levels are within the limit and secret.txt reaches disk.
        """
        outer_p = tmp_path / "outer.zip"
        outer_p.write_bytes(self._build_nested_zip(3))
        dest = tmp_path / "out"
        dest.mkdir()

        self._recursive_extract(outer_p, dest, max_nesting_depth=2)

        innermost = dest / "inner_contents" / "inner_contents" / "secret.txt"
        assert innermost.read_bytes() == b"innermost content"


class TestBuiltinRecursiveExtraction:
    """SafeZipFile with recursive=True auto-descends into nested zip members."""

    @staticmethod
    def _build_zip(members: list[tuple[str, bytes]]) -> bytes:
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_DEFLATED) as zf:
            for name, content in members:
                zf.writestr(name, content)
        return buf.getvalue()

    def test_recursive_false_is_default_raw_blob(self, tmp_path):
        """recursive=False (default) leaves nested zips as raw files."""
        inner = self._build_zip([("secret.txt", b"inner")])
        outer_p = tmp_path / "outer.zip"
        outer_p.write_bytes(self._build_zip([("inner.zip", inner)]))
        dest = tmp_path / "out"
        dest.mkdir()

        with SafeZipFile(outer_p) as zf:
            zf.extractall(dest)

        assert (dest / "inner.zip").exists()
        assert not (dest / "inner" / "secret.txt").exists()

    def test_recursive_extracts_nested_content(self, tmp_path):
        """recursive=True descends into inner.zip and extracts its content."""
        inner = self._build_zip([("secret.txt", b"inner content")])
        outer_p = tmp_path / "outer.zip"
        outer_p.write_bytes(
            self._build_zip([("readme.txt", b"outer"), ("inner.zip", inner)])
        )
        dest = tmp_path / "out"
        dest.mkdir()

        with SafeZipFile(outer_p, recursive=True) as zf:
            zf.extractall(dest)

        assert (dest / "readme.txt").read_bytes() == b"outer"
        assert (dest / "inner" / "secret.txt").read_bytes() == b"inner content"
        assert not (dest / "inner.zip").exists()

    def test_recursive_depth_limit_raises(self, tmp_path):
        """recursive=True stops at max_nesting_depth and raises NestingDepthError."""
        # 3-level deep: outer -> middle.zip -> inner.zip -> secret.txt
        innermost = self._build_zip([("secret.txt", b"deep")])
        middle = self._build_zip([("inner.zip", innermost)])
        outer_p = tmp_path / "outer.zip"
        outer_p.write_bytes(self._build_zip([("middle.zip", middle)]))
        dest = tmp_path / "out"
        dest.mkdir()

        # max_nesting_depth=1 allows depth 0 and 1; opening depth-2 raises
        with (
            pytest.raises(NestingDepthError),
            SafeZipFile(outer_p, recursive=True, max_nesting_depth=1) as zf,
        ):
            zf.extractall(dest)

    def test_recursive_file_size_enforced_in_nested_zip(self, tmp_path):
        """File size limit applies inside nested zips when recursive=True."""
        inner = self._build_zip([("big.txt", b"A" * 2000)])
        outer_p = tmp_path / "outer.zip"
        outer_p.write_bytes(self._build_zip([("inner.zip", inner)]))
        dest = tmp_path / "out"
        dest.mkdir()

        with (
            pytest.raises(FileSizeExceededError),
            SafeZipFile(outer_p, recursive=True, max_file_size=500) as zf,
        ):
            zf.extractall(dest)

    def test_recursive_traversal_in_nested_zip_blocked(self, tmp_path):
        """Path traversal inside a nested zip is blocked when recursive=True."""
        inner = self._build_zip([("../../evil.txt", b"escaped")])
        outer_p = tmp_path / "outer.zip"
        outer_p.write_bytes(self._build_zip([("inner.zip", inner)]))
        dest = tmp_path / "out"
        dest.mkdir()

        with (
            pytest.raises(UnsafeZipError),
            SafeZipFile(outer_p, recursive=True) as zf,
        ):
            zf.extractall(dest)

        assert not (tmp_path / "evil.txt").exists()

    def test_recursive_mixed_members(self, tmp_path):
        """Regular files and nested zips are both handled correctly."""
        inner = self._build_zip([("data.txt", b"nested data")])
        outer_p = tmp_path / "outer.zip"
        outer_p.write_bytes(
            self._build_zip(
                [
                    ("top.txt", b"top level"),
                    ("pkg.zip", inner),
                ]
            )
        )
        dest = tmp_path / "out"
        dest.mkdir()

        with SafeZipFile(outer_p, recursive=True) as zf:
            zf.extractall(dest)

        assert (dest / "top.txt").read_bytes() == b"top level"
        assert (dest / "pkg" / "data.txt").read_bytes() == b"nested data"
        assert not (dest / "pkg.zip").exists()

    def test_recursive_content_detection_bypasses_extension(self, tmp_path):
        """A nested ZIP named with a non-ZIP extension is still recursed into
        when recursive=True (content-based detection)."""
        inner = self._build_zip([("secret.txt", b"inner content")])
        outer_buf = io.BytesIO()
        with zipfile.ZipFile(outer_buf, "w") as zf:
            zf.writestr("data.csv", inner)
        outer_p = tmp_path / "outer.zip"
        outer_p.write_bytes(outer_buf.getvalue())
        dest = tmp_path / "out"
        dest.mkdir()

        with SafeZipFile(outer_p, recursive=True) as zf:
            zf.extractall(dest)

        # .csv is not a known archive extension, so directory name stays as-is
        assert (dest / "data.csv" / "secret.txt").read_bytes() == b"inner content"

    def test_recursive_non_zip_with_zip_extension_not_recursed(self, tmp_path):
        """A file named .zip that is not actually a ZIP is extracted as a plain file."""
        outer_buf = io.BytesIO()
        with zipfile.ZipFile(outer_buf, "w") as zf:
            zf.writestr("fake.zip", b"this is not a zip file at all")
        outer_p = tmp_path / "outer.zip"
        outer_p.write_bytes(outer_buf.getvalue())
        dest = tmp_path / "out"
        dest.mkdir()

        with SafeZipFile(outer_p, recursive=True) as zf:
            zf.extractall(dest)

        assert (dest / "fake.zip").read_bytes() == b"this is not a zip file at all"


class TestPermissionSanitisation:
    """Dangerous Unix permission bits are stripped from extracted files."""

    def test_setuid_stripped_by_default(self, setuid_archive, tmp_path):
        """setuid bit is stripped by default."""
        dest = tmp_path / "out"
        dest.mkdir()
        with SafeZipFile(setuid_archive) as zf:
            zf.extractall(dest)
        mode = (dest / "suid_binary").stat().st_mode
        assert not (mode & stat.S_ISUID), "setuid bit must be stripped by default"

    def test_normal_permissions_unaffected(self, legitimate_archive, tmp_path):
        """Stripping special bits does not affect normal file access."""
        dest = tmp_path / "out"
        dest.mkdir()
        with SafeZipFile(legitimate_archive) as zf:
            zf.extractall(dest)
        for f in dest.rglob("*"):
            if f.is_file():
                assert f.stat().st_mode & stat.S_IRUSR


class TestSymlinkPolicy:
    """SafeZipFile enforces the configured SymlinkPolicy for ZIP symlink entries.

    A ZIP symlink entry is identified by the upper 16 bits of
    ``ZipInfo.external_attr`` carrying a Unix ``S_IFLNK`` file mode.
    The entry's data bytes contain the link target path.
    """

    def test_reject_is_default(self, symlink_archive, tmp_path):
        """Default policy (REJECT) raises UnsafeZipError on any symlink entry."""
        dest = tmp_path / "out"
        dest.mkdir()
        with pytest.raises(UnsafeZipError), SafeZipFile(symlink_archive) as zf:
            zf.extractall(dest)

    def test_reject_explicit_raises(self, symlink_archive, tmp_path):
        """Explicit REJECT policy raises UnsafeZipError on a symlink entry."""
        dest = tmp_path / "out"
        dest.mkdir()
        with (
            pytest.raises(UnsafeZipError),
            SafeZipFile(symlink_archive, symlink_policy=SymlinkPolicy.REJECT) as zf,
        ):
            zf.extractall(dest)

    def test_ignore_skips_symlink_entry(self, symlink_archive, tmp_path):
        """IGNORE policy silently skips symlink entries; no file is created."""
        dest = tmp_path / "out"
        dest.mkdir()
        with SafeZipFile(symlink_archive, symlink_policy=SymlinkPolicy.IGNORE) as zf:
            zf.extractall(dest)
        # The symlink entry must not appear on disk
        assert not (dest / "link.txt").exists()

    def test_ignore_preserves_regular_files(self, symlink_archive, tmp_path):
        """IGNORE policy skips symlinks but still extracts regular entries."""
        dest = tmp_path / "out"
        dest.mkdir()
        with SafeZipFile(symlink_archive, symlink_policy=SymlinkPolicy.IGNORE) as zf:
            zf.extractall(dest)
        assert (dest / "readme.txt").read_bytes() == b"safe content\n"

    def test_resolve_internal_extracts_target_as_file(self, symlink_archive, tmp_path):
        """RESOLVE_INTERNAL extracts the symlink target path as a regular file.

        Because the ZIP entry's content is the target string (not an OS
        symlink), the extracted file is a plain file containing that string.
        The post-extraction symlink check only fires when the OS creates an
        actual symlink (not applicable here), so extraction succeeds.
        """
        dest = tmp_path / "out"
        dest.mkdir()
        with SafeZipFile(
            symlink_archive, symlink_policy=SymlinkPolicy.RESOLVE_INTERNAL
        ) as zf:
            zf.extractall(dest)
        # The entry is written as a regular file containing the target path
        extracted = dest / "link.txt"
        assert extracted.exists()
        assert not extracted.is_symlink()
        assert extracted.read_text() == "../escape.txt"


class TestCompressSizeZero:
    """compress_size == 0 only occurs legitimately for empty members.

    Python's zipfile uses the central directory compress_size to control how
    many bytes it reads during decompression.  A non-empty member with
    compress_size=0 in the CD causes zipfile to read 0 bytes and then fail
    the CRC check (BadZipFile), so it never reaches the streamer's ratio logic.

    The only reachable case is a genuinely empty member, for which skipping
    the ratio check is correct — there is nothing to decompress.
    """

    def test_empty_member_skips_ratio_check_correctly(
        self, data_descriptor_empty_archive, tmp_path
    ):
        """Empty member (compress_size=0) extracts successfully even with a
        tight ratio limit.  Skipping the ratio check is correct behaviour."""
        dest = tmp_path / "out"
        dest.mkdir()

        with zipfile.ZipFile(data_descriptor_empty_archive) as zf:
            info = zf.infolist()[0]
            assert info.compress_size == 0
            assert info.file_size == 0

        with SafeZipFile(data_descriptor_empty_archive, max_per_member_ratio=1.0) as zf:
            zf.extractall(dest)

        assert (dest / "empty.txt").read_bytes() == b""

    def test_nonempty_with_zero_cd_compress_size_rejected_by_zipfile(
        self, data_descriptor_invalid_bomb_archive, tmp_path
    ):
        """A crafted archive with compress_size=0 in the CD but non-empty data
        is rejected by Python's zipfile with BadZipFile before the streamer's
        ratio logic is even reached.  The gap is not exploitable through
        Python's zipfile layer."""
        dest = tmp_path / "out"
        dest.mkdir()

        # Verify the CD does report compress_size=0 despite non-empty content.
        with zipfile.ZipFile(data_descriptor_invalid_bomb_archive) as zf:
            info = zf.infolist()[0]
            assert info.compress_size == 0
            assert info.file_size > 0

        # SafeZipFile opens fine (Guard sees compress_size=0, file_size=2000,
        # both within limits).  BadZipFile is raised by zipfile's CRC check
        # during streaming — before safezip's ratio logic is ever reached.
        with (
            pytest.raises(zipfile.BadZipFile),
            SafeZipFile(data_descriptor_invalid_bomb_archive) as zf,
        ):
            zf.extractall(dest)

        # No partial files left.
        remaining = [f for f in dest.rglob("*") if not f.is_dir()]
        assert not remaining


class TestEnvVarHandling:
    """Environment variable parsing edge cases."""

    def test_invalid_symlink_policy_env(self, legitimate_archive, monkeypatch, caplog):
        """Invalid symlink policy is logged and defaults to REJECT."""
        monkeypatch.setenv("SAFEZIP_SYMLINK_POLICY", "invalid_policy")
        with SafeZipFile(legitimate_archive, symlink_policy=None) as zf:
            assert zf._symlink_policy == SymlinkPolicy.REJECT
        assert "Ignoring unrecognised" in caplog.text

    def test_env_var_read_at_import_time(self, monkeypatch):
        """Changing env vars after import does not affect cached defaults.

        The module-level singletons (_DEFAULT_*) are evaluated once at import time.
        Late env changes do not alter limits on new SafeZipFile instances.
        """
        import safezip._core as _core

        original_default = _core._DEFAULT_MAX_FILES
        monkeypatch.setenv("SAFEZIP_MAX_FILES", "99")
        assert original_default == _core._DEFAULT_MAX_FILES

src/safezip/tests/test_sandbox.py

"""Tests for Phase B: path resolution and symlink policy (the Sandbox)."""

import os
import tempfile
from pathlib import Path

import pytest

from safezip import UnsafeZipError
from safezip._sandbox import resolve_member_path

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"


def _symlinks_available() -> bool:
    """Return True if we can create symlinks (requires privileges on Windows)."""
    probe_dir = Path(tempfile.gettempdir()) / "_safezip_symlink_probe"
    try:
        probe_dir.mkdir(exist_ok=True)
        target = probe_dir / "target"
        link = probe_dir / "link"
        target.mkdir(exist_ok=True)
        link.symlink_to(target)
        link.unlink()
        target.rmdir()
        probe_dir.rmdir()
        return True
    except (OSError, NotImplementedError):
        return False


class TestPathTraversal:
    """resolve_member_path rejects all forms of path traversal."""

    def test_dotdot_relative(self, tmp_path):
        with pytest.raises(UnsafeZipError, match="traversal"):
            resolve_member_path(tmp_path, "../../evil.txt")

    def test_dotdot_in_middle(self, tmp_path):
        with pytest.raises(UnsafeZipError, match="traversal"):
            resolve_member_path(tmp_path, "subdir/../../../evil.txt")

    def test_dotdot_windows_style(self, tmp_path):
        with pytest.raises(UnsafeZipError, match="traversal"):
            resolve_member_path(tmp_path, "subdir\\..\\..\\evil.txt")

    def test_absolute_unix_path(self, tmp_path):
        with pytest.raises(UnsafeZipError):
            resolve_member_path(tmp_path, "/etc/passwd")

    def test_absolute_windows_path(self, tmp_path):
        with pytest.raises(UnsafeZipError):
            resolve_member_path(tmp_path, "C:\\Windows\\System32\\cmd.exe")

    def test_unc_path(self, tmp_path):
        with pytest.raises(UnsafeZipError):
            resolve_member_path(tmp_path, "//server/share/evil.txt")


class TestNullByte:
    """resolve_member_path rejects filenames with null bytes."""

    def test_null_byte_rejected(self, tmp_path):
        with pytest.raises(UnsafeZipError):
            resolve_member_path(tmp_path, "safe\x00../../etc/passwd")

    def test_null_byte_at_start(self, tmp_path):
        with pytest.raises(UnsafeZipError):
            resolve_member_path(tmp_path, "\x00evil.txt")


class TestLegitimateFilenames:
    """resolve_member_path accepts well-formed filenames."""

    def test_simple_filename(self, tmp_path):
        result = resolve_member_path(tmp_path, "hello.txt")
        assert result == tmp_path / "hello.txt"

    def test_nested_filename(self, tmp_path):
        result = resolve_member_path(tmp_path, "subdir/data.txt")
        assert result == tmp_path / "subdir" / "data.txt"

    def test_deep_nested(self, tmp_path):
        result = resolve_member_path(tmp_path, "a/b/c/d/e.txt")
        assert result == tmp_path / "a" / "b" / "c" / "d" / "e.txt"

    def test_windows_separator_legitimate(self, tmp_path):
        """Windows-style separators are normalised to forward slashes."""
        result = resolve_member_path(tmp_path, "subdir\\file.txt")
        assert result == tmp_path / "subdir" / "file.txt"

    def test_result_is_inside_base(self, tmp_path):
        result = resolve_member_path(tmp_path, "subdir/file.txt")
        assert str(result).startswith(str(tmp_path))

    def test_unicode_filename(self, tmp_path):
        result = resolve_member_path(tmp_path, "données/résumé.txt")
        assert result.name == "résumé.txt"

    def test_leading_slash_rejected(self, tmp_path):
        """A leading slash is treated as an absolute path and rejected."""
        with pytest.raises(UnsafeZipError, match="Absolute path"):
            resolve_member_path(tmp_path, "/file.txt")

    def test_dot_components_stripped(self, tmp_path):
        result = resolve_member_path(tmp_path, "./subdir/./file.txt")
        assert result == tmp_path / "subdir" / "file.txt"

    def test_empty_parts_stripped(self, tmp_path):
        result = resolve_member_path(tmp_path, "subdir//file.txt")
        assert result == tmp_path / "subdir" / "file.txt"


class TestDrivePrefixBypass:
    """resolve_member_path rejects ZipSlip bypass via Windows drive-prefix stripping."""

    def test_dotdot_after_drive_prefix_double(self, tmp_path):
        with pytest.raises(UnsafeZipError, match="traversal"):
            resolve_member_path(tmp_path, "C:../C:../etc/target")

    def test_dotdot_after_drive_prefix_lowercase(self, tmp_path):
        with pytest.raises(UnsafeZipError, match="traversal"):
            resolve_member_path(tmp_path, "c:../foo")

    def test_dotdot_after_drive_prefix_various(self, tmp_path):
        payloads = [
            "C:../etc/passwd",
            "c:../foo",
            "Z:../../../etc/shadow",
        ]
        for payload in payloads:
            with pytest.raises(UnsafeZipError, match="traversal"):
                resolve_member_path(tmp_path, payload)

    def test_backslash_drive_prefix(self, tmp_path):
        """Backslash normalised to slash before drive-prefix stripping."""
        with pytest.raises(UnsafeZipError, match="traversal"):
            resolve_member_path(tmp_path, "C:..\\..\\foo")

    def test_bare_drive_prefix_double_dot(self, tmp_path):
        """Single part that is exactly 'X:..' strips to '..'."""
        with pytest.raises(UnsafeZipError, match="traversal"):
            resolve_member_path(tmp_path, "X:..")

    def test_drive_prefix_multiple_traversal(self, tmp_path):
        """Multiple '..' components after stripping are caught immediately."""
        with pytest.raises(UnsafeZipError, match="traversal"):
            resolve_member_path(tmp_path, "A:../../etc/passwd")

    def test_drive_prefix_legitimate_relative(self, tmp_path):
        """Stripping 'C:' from a valid relative path succeeds."""
        result = resolve_member_path(tmp_path, "C:subdir/file.txt")
        assert result == tmp_path / "subdir" / "file.txt"

    def test_bare_drive_no_suffix(self, tmp_path):
        """Bare drive 'C:' strips to empty string → empty path error."""
        with pytest.raises(UnsafeZipError, match="empty"):
            resolve_member_path(tmp_path, "C:")


class TestResolveContainment:
    """resolve_member_path uses .resolve() to catch symlink-based escapes."""

    @pytest.mark.skipif(
        not _symlinks_available(),
        reason="symlink creation not available or requires elevated privileges",
    )
    def test_symlink_in_base_dir_escapes(self, tmp_path):
        """Path resolving through a symlink outside base is rejected."""
        real_dir = tmp_path / "real_outside"
        real_dir.mkdir()
        (real_dir / "secret.txt").write_text("leaked")

        base = tmp_path / "base"
        base.mkdir()
        evil_link = base / "subdir"
        os.symlink(str(real_dir), str(evil_link))

        with pytest.raises(UnsafeZipError, match="escapes base"):
            resolve_member_path(base, "subdir/secret.txt")

    @pytest.mark.skipif(
        not _symlinks_available(),
        reason="symlink creation not available or requires elevated privileges",
    )
    def test_base_is_symlink(self, tmp_path):
        """Base directory is itself a symlink — both sides resolve identically."""
        real_base = tmp_path / "real_base"
        real_base.mkdir()

        link_base = tmp_path / "link_base"
        os.symlink(str(real_base), str(link_base))

        result = resolve_member_path(link_base, "file.txt")
        # Return value uses the symlink name; resolved paths must agree.
        assert result == link_base / "file.txt"
        assert result.resolve() == real_base.resolve() / "file.txt"

    @pytest.mark.skipif(
        not _symlinks_available(),
        reason="symlink creation not available or requires elevated privileges",
    )
    def test_nested_symlink_component(self, tmp_path):
        """Chained symlinks resolving outside base are rejected."""
        outside = tmp_path / "outside"
        outside.mkdir()
        (outside / "data.txt").write_text("secret")

        base = tmp_path / "base"
        base.mkdir()
        # a -> b -> outside
        b_dir = tmp_path / "b"
        b_dir.mkdir()
        os.symlink(str(outside), str(b_dir / "secret"))
        os.symlink(str(b_dir), str(base / "a"))

        with pytest.raises(UnsafeZipError, match="escapes base"):
            resolve_member_path(base, "a/secret/data.txt")


class TestPathLengthLimit:
    """resolve_member_path rejects excessively long paths."""

    def test_very_long_filename_rejected(self, tmp_path):
        long_name = "a" * 5000 + ".txt"
        with pytest.raises(UnsafeZipError, match="too long"):
            resolve_member_path(tmp_path, long_name)

src/safezip/tests/test_streamer.py

"""Tests for Phase C: streaming extraction (the Streamer)."""

import io
import zipfile

import pytest

from safezip import (
    CompressionRatioError,
    FileSizeExceededError,
    MalformedArchiveError,
    SafeZipFile,
)

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"


class TestFileSizeLimit:
    """Streamer enforces per-member file size limits at stream time."""

    def test_size_exceeded_raises(self, tmp_path):
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_STORED) as zf:
            zf.writestr("data.bin", b"A" * 1000)
        p = tmp_path / "large.zip"
        p.write_bytes(buf.getvalue())
        dest = tmp_path / "out"
        dest.mkdir()

        with (
            pytest.raises(FileSizeExceededError),
            SafeZipFile(p, max_file_size=500) as zf,
        ):
            zf.extractall(dest)

    def test_no_partial_file_after_size_failure(self, tmp_path):
        """Atomic write: no partial file must remain after FileSizeExceededError."""
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_STORED) as zf:
            zf.writestr("data.bin", b"A" * 1000)
        p = tmp_path / "large.zip"
        p.write_bytes(buf.getvalue())
        dest = tmp_path / "out"
        dest.mkdir()

        with (
            pytest.raises(FileSizeExceededError),
            SafeZipFile(p, max_file_size=500) as zf,
        ):
            zf.extractall(dest)

        # No partial files or temp files should remain
        remaining = list(dest.rglob("*"))
        assert not remaining, f"Partial files found: {remaining}"

    def test_size_at_limit_passes(self, tmp_path):
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_STORED) as zf:
            zf.writestr("data.bin", b"A" * 100)
        p = tmp_path / "ok.zip"
        p.write_bytes(buf.getvalue())
        dest = tmp_path / "out"
        dest.mkdir()

        with SafeZipFile(p, max_file_size=100) as zf:
            zf.extractall(dest)
        assert (dest / "data.bin").read_bytes() == b"A" * 100


class TestTotalSizeLimit:
    """Streamer enforces cumulative total size across all members."""

    def test_total_size_exceeded(self, tmp_path):
        """Total size limit enforced during Guard phase when limits are threaded."""
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w", compression=zipfile.ZIP_STORED) as zf:
            for i in range(5):
                zf.writestr(f"file_{i}.bin", b"A" * 300)
        p = tmp_path / "multi.zip"
        p.write_bytes(buf.getvalue())
        dest = tmp_path / "out"
        dest.mkdir()

        with pytest.raises(MalformedArchiveError):
            SafeZipFile(p, max_file_size=1000, max_total_size=1000)


class TestCompressionRatioLimit:
    """Streamer enforces per-member and total compression ratio limits."""

    def test_per_member_ratio_exceeded(self, high_ratio_archive, tmp_path):
        """High-ratio archive (zeros) triggers per-member ratio check."""
        dest = tmp_path / "out"
        dest.mkdir()
        with (
            pytest.raises(CompressionRatioError),
            SafeZipFile(high_ratio_archive, max_per_member_ratio=10.0) as zf,
        ):
            zf.extractall(dest)

    def test_no_partial_file_after_ratio_failure(self, high_ratio_archive, tmp_path):
        """Atomic write: no partial file must remain after CompressionRatioError."""
        dest = tmp_path / "out"
        dest.mkdir()

        with (
            pytest.raises(CompressionRatioError),
            SafeZipFile(high_ratio_archive, max_per_member_ratio=10.0) as zf,
        ):
            zf.extractall(dest)

        remaining = [f for f in dest.rglob("*") if not f.is_dir()]
        assert not remaining, f"Partial files found: {remaining}"

    def test_high_ratio_passes_with_generous_limit(self, high_ratio_archive, tmp_path):
        """Same archive passes if we allow a high ratio (both per-member and total)."""
        dest = tmp_path / "out"
        dest.mkdir()
        with SafeZipFile(
            high_ratio_archive,
            max_per_member_ratio=2000.0,
            max_total_ratio=2000.0,
            max_file_size=5 * 1024 * 1024,
        ) as zf:
            zf.extractall(dest)
        assert (dest / "zeros.bin").exists()


class TestAtomicWrite:
    """Extraction destinations are created atomically."""

    def test_successful_extraction_creates_file(self, tmp_path):
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w") as zf:
            zf.writestr("output.txt", b"hello safezip")
        p = tmp_path / "ok.zip"
        p.write_bytes(buf.getvalue())
        dest = tmp_path / "out"
        dest.mkdir()

        with SafeZipFile(p) as zf:
            zf.extractall(dest)
        assert (dest / "output.txt").read_bytes() == b"hello safezip"

    def test_extract_single_member(self, tmp_path):
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w") as zf:
            zf.writestr("a.txt", b"AAA")
            zf.writestr("b.txt", b"BBB")
        p = tmp_path / "two.zip"
        p.write_bytes(buf.getvalue())
        dest = tmp_path / "out"
        dest.mkdir()

        with SafeZipFile(p) as zf:
            zf.extract("a.txt", dest)
        assert (dest / "a.txt").read_bytes() == b"AAA"
        assert not (dest / "b.txt").exists()

    def test_no_temp_files_after_success(self, tmp_path):
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, "w") as zf:
            zf.writestr("hello.txt", b"world")
        p = tmp_path / "ok.zip"
        p.write_bytes(buf.getvalue())
        dest = tmp_path / "out"
        dest.mkdir()

        with SafeZipFile(p) as zf:
            zf.extractall(dest)

        all_files = list(dest.rglob("*"))
        temp_files = [f for f in all_files if ".safezip_tmp_" in f.name]
        assert not temp_files

View llms.txt version