gal_goku_sims Package

This package handles simulation data processing and computation of summary statistics.

Main Modules

hmf Module

Halo Mass Function computations from simulation catalogs.

Main Classes:

HMF

Computes the halo mass function from simulation halo catalogs.

Key functionality:

  • Read halo catalogs from simulations

  • Compute mass functions with proper binning

  • Handle multiple cosmologies and redshifts

  • Export results in standardized formats

Typical Usage:

from gal_goku_sims import hmf

# Initialize HMF computer
hmf_calc = hmf.HMF(
    catalog_path='path/to/halos.hdf5',
    box_size=1000.0,  # Mpc/h
    redshift=2.0
)

# Compute mass function
masses, phi = hmf_calc.compute()

xi Module

Correlation function computations from simulation data.

Main Classes:

HaloXi

Computes halo-halo correlation functions from simulation catalogs.

Key functionality:

  • Compute 2-point correlation functions

  • Support for mass-threshold samples

  • Jackknife error estimation

  • Parallel computation with MPI

Typical Usage:

from gal_goku_sims import xi

# Initialize correlation function computer
xi_calc = xi.HaloXi(
    catalog_path='path/to/halos.hdf5',
    box_size=1000.0,  # Mpc/h
    mass_threshold=1e12  # Msun/h
)

# Compute correlation function
r, xi_r = xi_calc.compute()

mpi_helper Module

gal_goku_sims.mpi_helper.Allgatherv_helper(MPI, comm, data, data_type)[source]

Each rank should call this with data on that rank MPI : pass the mpi4py.MPI comm : The mpi communicator data : The 1D array on each rank. The size of data on each rank could be different. data_type: Type of each elemnt of data array

gal_goku_sims.mpi_helper.distribute_array(comm, data)[source]

Distribute array “data” equally between ranks and return the laod for each rank individually.

gal_goku_sims.mpi_helper.distribute_array_split_comm(size, color, data)[source]

Similar to distribute_array(), but useful for split communicator

gal_goku_sims.mpi_helper.distribute_files(comm, fnames)[source]

Distribute a list of files among available ranks comm : MPI communicator fnames : a list of file names Returns : A list of files for each rank

gal_goku_sims.mpi_helper.into_chunks(comm, length)[source]

Similar to distribute_array but returns the start and end indexes of all ranks. Use this if each rank needs to know the start and end index of all other ranks. Parameters: comm : MPI communicator length : The total length of the array to be distributed Returns: start, end : The start and end index of the array for each rank, sorted by rank number. If padding is not zero, the start am

MPI utilities for parallel processing of simulation data.

Key Functions:

  • Process distribution across MPI ranks

  • Collective operations for data gathering

  • Efficient parallel I/O

  • Error handling in MPI context

Typical Usage:

from gal_goku_sims import mpi_helper
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Distribute work across ranks
my_tasks = mpi_helper.distribute_tasks(
    total_tasks=100,
    rank=rank,
    size=size
)

Data Formats

Halo Catalogs

Halo catalogs are expected in HDF5 format with the following structure:

halos.hdf5
├── mass          # Halo masses [Msun/h]
├── pos           # Positions [Mpc/h], shape (N, 3)
├── vel           # Velocities [km/s], shape (N, 3)
└── metadata
    ├── box_size  # Box size [Mpc/h]
    ├── redshift  # Redshift
    └── cosmology # Cosmological parameters

Correlation Functions

Correlation function outputs are saved in HDF5 format:

xi.hdf5
├── r             # Separation bins [Mpc/h]
├── xi            # Correlation function values
├── xi_err        # Errors (if computed)
└── metadata
    ├── mass_threshold  # Mass threshold [Msun/h]
    ├── redshift        # Redshift
    └── n_pairs         # Number of pairs per bin

Performance Considerations

MPI Parallelization

For large datasets, use MPI parallelization:

mpirun -np 16 python compute_correlations.py

Memory Management

When working with large catalogs:

  1. Use chunked reading with HDF5

  2. Process data in batches

  3. Clear memory explicitly with del statements

  4. Monitor memory usage with memory_profiler

Optimization Tips

  • Use pre-computed pair counts when possible

  • Cache frequently accessed data

  • Vectorize operations with NumPy

  • Profile code to identify bottlenecks