This directory contains the test suite for the METS XML Synthesis Pipeline. The tests are designed to verify the functionality of the pipeline components and ensure that they work together correctly. The test suite provides comprehensive coverage of the pipeline's functionality, from data loading to XML validation.
The test suite consists of:
conftest.py
: Contains pytest fixtures that set up the test environmentsmoke_test.py
: A comprehensive smoke test that verifies the basic functionality of all pipeline components
The smoke test includes individual test functions for each component of the pipeline:
test_data_loader
: Tests the loading and validation of input JSON filestest_metadata_builder
: Tests the construction of SDV-compatible metadatatest_model
: Tests the training of a generative modeltest_sampler
: Tests the generation of synthetic datatest_reassembler
: Tests the conversion of synthetic data to XMLtest_validator
: Tests the validation of the generated XMLtest_pipeline
: Tests the end-to-end pipeline execution
The test fixtures in conftest.py
are designed to provide reusable components for tests. They follow a dependency chain that mirrors the pipeline's execution flow:
config
(session scope): Provides a test configuration with temporary output pathslogger
(session scope): Provides a logger for tests
dmdsec_df
,file_df
,structmap_df
(module scope): Provide the input DataFramestables
(module scope): Combines the DataFrames into a dictionary
metadata
(module scope): Builds metadata from the input DataFramesmodel
(module scope): Trains a model on the input datasynthetic_data
(function scope): Generates synthetic data from the modelxml_output_path
(function scope): Reassembles the synthetic data into XML
The fixtures use different scopes to optimize test execution:
- Session scope: Created once per test session (e.g.,
config
,logger
) - Module scope: Created once per test module (e.g.,
dmdsec_df
,metadata
,model
) - Function scope: Created for each test function (e.g.,
synthetic_data
,xml_output_path
)
To run the smoke test:
# Run with pytest (recommended)
pytest tests/smoke_test.py -v
# Or run the script directly
./tests/smoke_test.py
To run a specific test function:
pytest tests/smoke_test.py::test_data_loader -v
To run all tests with detailed output:
pytest tests/ -v
Understanding the difference between testing and production usage is important:
Aspect | Testing (pytest) | Production |
---|---|---|
Entry Point | pytest tests/smoke_test.py |
python -m src.data_archive_ml_synthesizer.pipeline |
Configuration | Test configuration created in conftest.py |
Configuration from config.yaml |
Output Files | Temporary files | Files specified in config.yaml |
Purpose | Verify code functionality | Generate synthetic METS XML |
Environment | Test fixtures automatically set up | Manual configuration required |
When to use each approach:
- Use pytest during development to verify that your code changes don't break existing functionality
- Use production mode when you want to generate actual synthetic METS XML documents for your use case
When writing new tests, you can use the fixtures defined in conftest.py
to set up the test environment. For example:
def test_example(config, synthetic_data, logger):
"""Example test using fixtures."""
# Your test code here
assert synthetic_data is not None
The fixtures will be automatically created and injected into your test function.
- Use fixtures: Leverage the existing fixtures to set up the test environment
- Test one thing at a time: Each test function should focus on testing a single aspect of the code
- Use descriptive names: Test function names should clearly describe what they're testing
- Include assertions: Every test should include at least one assertion to verify the expected behavior
- Handle exceptions: Use pytest's exception handling to test for expected exceptions
The test configuration is created in conftest.py
using the create_test_config()
function from smoke_test.py
. It uses the same input files as the main configuration but creates temporary output paths to avoid modifying the project's output files.
The test configuration includes:
- Input paths pointing to the same input files as the main configuration
- Temporary output paths for generated files
- A fixed random seed for reproducibility
- Validation settings matching the main configuration