gal_goku_sims Package

This package handles simulation data processing and computation of summary statistics.

Main Modules

hmf Module

Halo Mass Function computations from simulation catalogs.

Main Classes:

HMF

Computes the halo mass function from simulation halo catalogs.

Key functionality:

Read halo catalogs from simulations
Compute mass functions with proper binning
Handle multiple cosmologies and redshifts
Export results in standardized formats

Typical Usage:

from gal_goku_sims import hmf

# Initialize HMF computer
hmf_calc = hmf.HMF(
    catalog_path='path/to/halos.hdf5',
    box_size=1000.0,  # Mpc/h
    redshift=2.0
)

# Compute mass function
masses, phi = hmf_calc.compute()

xi Module

Correlation function computations from simulation data.

Main Classes:

HaloXi

Computes halo-halo correlation functions from simulation catalogs.

Key functionality:

Compute 2-point correlation functions
Support for mass-threshold samples
Jackknife error estimation
Parallel computation with MPI

Typical Usage:

from gal_goku_sims import xi

# Initialize correlation function computer
xi_calc = xi.HaloXi(
    catalog_path='path/to/halos.hdf5',
    box_size=1000.0,  # Mpc/h
    mass_threshold=1e12  # Msun/h
)

# Compute correlation function
r, xi_r = xi_calc.compute()

mpi_helper Module

gal_goku_sims.mpi_helper.Allgatherv_helper(MPI, comm, data, data_type)[source]: Each rank should call this with data on that rank MPI : pass the mpi4py.MPI comm : The mpi communicator data : The 1D array on each rank. The size of data on each rank could be different. data_type: Type of each elemnt of data array

gal_goku_sims.mpi_helper.distribute_array(comm, data)[source]: Distribute array “data” equally between ranks and return the laod for each rank individually.

gal_goku_sims.mpi_helper.distribute_array_split_comm(size, color, data)[source]: Similar to distribute_array(), but useful for split communicator

gal_goku_sims.mpi_helper.distribute_files(comm, fnames)[source]: Distribute a list of files among available ranks comm : MPI communicator fnames : a list of file names Returns : A list of files for each rank

gal_goku_sims.mpi_helper.into_chunks(comm, length)[source]: Similar to distribute_array but returns the start and end indexes of all ranks. Use this if each rank needs to know the start and end index of all other ranks. Parameters: comm : MPI communicator length : The total length of the array to be distributed Returns: start, end : The start and end index of the array for each rank, sorted by rank number. If padding is not zero, the start am

MPI utilities for parallel processing of simulation data.

Key Functions:

Process distribution across MPI ranks
Collective operations for data gathering
Efficient parallel I/O
Error handling in MPI context

Typical Usage:

from gal_goku_sims import mpi_helper
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Distribute work across ranks
my_tasks = mpi_helper.distribute_tasks(
    total_tasks=100,
    rank=rank,
    size=size
)

Data Formats

Halo Catalogs

Halo catalogs are expected in HDF5 format with the following structure:

halos.hdf5
├── mass          # Halo masses [Msun/h]
├── pos           # Positions [Mpc/h], shape (N, 3)
├── vel           # Velocities [km/s], shape (N, 3)
└── metadata
    ├── box_size  # Box size [Mpc/h]
    ├── redshift  # Redshift
    └── cosmology # Cosmological parameters

Correlation Functions

Correlation function outputs are saved in HDF5 format:

xi.hdf5
├── r             # Separation bins [Mpc/h]
├── xi            # Correlation function values
├── xi_err        # Errors (if computed)
└── metadata
    ├── mass_threshold  # Mass threshold [Msun/h]
    ├── redshift        # Redshift
    └── n_pairs         # Number of pairs per bin

Performance Considerations

MPI Parallelization

For large datasets, use MPI parallelization:

mpirun -np 16 python compute_correlations.py

Memory Management

When working with large catalogs:

Use chunked reading with HDF5
Process data in batches
Clear memory explicitly with del statements
Monitor memory usage with memory_profiler

Optimization Tips

Use pre-computed pair counts when possible
Cache frequently accessed data
Vectorize operations with NumPy
Profile code to identify bottlenecks