gal_goku Package

This is the main package containing the emulator framework and galaxy clustering models.

Main Modules

gal Module

GalBase Class

LargeScaleGal Class

emus Module

To evaluate all the emulators for the Halo Mass Function. This is the interface of single_fid.py and summary_stats.py to work with the generated emulated Halo Mass Fucntion functions.

class gal_goku.emus.BaseStatEmu(X, Y, model_error, logging_level='info', multi_bin=False)[source]

Bases: object

__init__(X, Y, model_error, logging_level='info', multi_bin=False)[source]

The base emu to be inherited by single fidelity emulators built on any summary statistics This is the interface for all classses above.

configure_logging(logging_level)[source]

Sets up logging based on the provided logging level.

leave_bunch_out(n_out=5, narrow=0)[source]

Leaves out a random bunch of samples out n_out: Number of samples to leave out

loo_train_pred(savefile, narrow=0)[source]

Get the leave one out predictions

predict(X_test)[source]

Predict the mean and variance of the emulator

train_pred_all_sims()[source]

Train the model on all simulations and comapre with the truth

class gal_goku.emus.Hmf(data_dir, y_log=True, fid='L2', multi_bin=False, logging_level='INFO', narrow=False, no_merge=True)[source]

Bases: BaseStatEmu

__init__(data_dir, y_log=True, fid='L2', multi_bin=False, logging_level='INFO', narrow=False, no_merge=True)[source]

data_dir: Directory where the data is stored r_range: Range of r to consider fid: The fiducial cosmology to consider, default is ‘L2’ logging_level: Logging level cleaning_method: Method to clean the negative bins,

default is ‘linear_interp’; otherwuse replace with a small number, e.g. 1e-10

BaseStatEmu Class

class gal_goku.emus.BaseStatEmu(X, Y, model_error, logging_level='info', multi_bin=False)[source]

Bases: object

Base emulator class for summary statistics.

Key Methods:

predict(X_test)[source]

Predict the mean and variance of the emulator

loo_train_pred(savefile, narrow=0)[source]

Get the leave one out predictions

train_pred_all_sims()[source]

Train the model on all simulations and comapre with the truth

leave_bunch_out(n_out=5, narrow=0)[source]

Leaves out a random bunch of samples out n_out: Number of samples to leave out

__init__(X, Y, model_error, logging_level='info', multi_bin=False)[source]

The base emu to be inherited by single fidelity emulators built on any summary statistics This is the interface for all classses above.

configure_logging(logging_level)[source]

Sets up logging based on the provided logging level.

leave_bunch_out(n_out=5, narrow=0)[source]

Leaves out a random bunch of samples out n_out: Number of samples to leave out

loo_train_pred(savefile, narrow=0)[source]

Get the leave one out predictions

predict(X_test)[source]

Predict the mean and variance of the emulator

train_pred_all_sims()[source]

Train the model on all simulations and comapre with the truth

Hmf Class

class gal_goku.emus.Hmf(data_dir, y_log=True, fid='L2', multi_bin=False, logging_level='INFO', narrow=False, no_merge=True)[source]

Bases: BaseStatEmu

Halo Mass Function emulator.

Parameters:

  • data_dir (str): Directory containing the training data

  • y_log (bool): Whether to use log-space for predictions

  • fid (str): Fiducial cosmology identifier

  • multi_bin (bool): Use multi-bin mode

  • logging_level (str): Logging verbosity level

  • narrow (bool): Use narrow parameter range

  • no_merge (bool): Don’t merge data from multiple sources

__init__(data_dir, y_log=True, fid='L2', multi_bin=False, logging_level='INFO', narrow=False, no_merge=True)[source]

data_dir: Directory where the data is stored r_range: Range of r to consider fid: The fiducial cosmology to consider, default is ‘L2’ logging_level: Logging level cleaning_method: Method to clean the negative bins,

default is ‘linear_interp’; otherwuse replace with a small number, e.g. 1e-10

emu_cosmo Module

Cosmological emulator utilities.

halo_tools Module

Tools for halo manipulations, mass conversions, and halo model calculations.

init_power Module

Class to generate simulation ICS, separated out for clarity.

class gal_goku.init_power.SimulationICs(*, seed: int = 9281110, redshifts: float = 99, redend: float = 0, omega0: float = 0.288, omegab: float = 0.0472, hubble: float = 0.7, scalar_amp: float = 2.427e-09, ns: float = 0.97, rscatter: bool = False, m_nu: float = 0, nu_hierarchy: str = 'normal', uvb: str = 'pu', nu_acc: float = 1e-05, unitary: bool = True, w0_fld: float = -1.0, wa_fld: float = 0.0, N_ur: float = 3.044, alpha_s: float = 0, MWDM_therm: float = 0, python: str = 'python')[source]

Bases: object

Class for creating the initial conditions for a simulation. There are a few things this class needs to do:

  • Generate CAMB input files

  • Generate MP-GenIC input files (to use CAMB output)

  • Run CAMB and MP-GenIC to generate ICs

The class will store the parameters of the simulation.

We also save a copy of the input and enough information to reproduce the results exactly in SimulationICs.json.

Many things are left hard-coded.

We assume flatness.

Init parameters:

redshifts - redshifts at which to generate ICs omegab - baryon density. Note that if we do not have gas particles,

still set omegab, but set separate_gas = False

omega0 - Total matter density at z=0 (includes massive neutrinos and

baryons)

hubble - Hubble parameter, h, which is H0 / (100 km/s/Mpc) scalar_amp - A_s at k = 0.05, comparable to the Planck value. ns - Scalar spectral index m_nu - neutrino mass unitary - if true, do not scatter modes, but use a unitary gaussian

amplitude.

Remove:

separate_gas - if true the ICs will contain baryonic particles;

If false, just DM.

cambfile(maxk=0.5) Tuple[numpy.ndarray, numpy.ndarray][source]

Generate the IC power spectrum using classylss.

Basically is using pre_params feed into class and compute the powerspec files based on the range of redshift and redend. All files are stored in the directory camb_out.

Return:

camb_output (str) : power specs directory. default: “camb_linear/”

property json: dict

these are the variables will be saved into json

gal_goku.init_power.get_neutrino_masses(total_mass: float, hierarchy: str) numpy.ndarray[source]

Get the three neutrino masses, including the mass splittings. Hierarchy is ‘inverted’ (two heavy), ‘normal’ (two light) or degenerate.

Initial conditions and linear power spectrum calculations using ClassyLSS.

summary_stats Module

Loading the computed summary statistics from the data files.

class gal_goku.summary_stats.BaseSummaryStats(data_dir, fid, narrow=False, MPI=None, logging_level='INFO')[source]

Bases: object

Base class for summary statistics

configure_logging(logging_level='INFO', logger_name='summary_stats')[source]

Sets up logging based on the provided logging level in an MPI environment.

get_cosmo_params()[source]

get comological parameters from the simulations listed in the labels

get_ics(keys)[source]

Get the desired keys from the ICs file

get_labels()[source]

Get the labels we use for each simulation, they are in this format 10p_Box{BoxSize}_Par{Npart}_0001

get_params_array()[source]

Get the cosmological parameters as an array

get_sims_specs()[source]

Get the simulation specs from the ICs file

load_ics()[source]

Load the IC json file

make_big_ic_file(base_dirs=['/scratch/06536/qezlou/Goku/FOF/HF/', '/scratch/06536/qezlou/Goku/FOF/L1/', '/scratch/06536/qezlou/Goku/FOF/L2/', '/scratch/06536/qezlou/Goku/FOF/L2/narrow/', '/scratch/06536/qezlou/Goku/FOF/HF/narrow/'])[source]
class gal_goku.summary_stats.HMF(data_dir, fid, z=2.5, mass_range=(11.1, 12.5), narrow=False, no_merge=True, chi2=False, logging_level='INFO')[source]

Bases: BaseSummaryStats

Halo mass function

get_coeffs(ind=None, delta_r=None, *kwargs)[source]

Retrun the spline fits in an array Parameters: ————– ind: np.ndarray, optional, default=None

Index of the simulations to fit. If None, fit all simulations

kwargs: dict

Keyword arguments for utils.ConstrainedSplineFitter

Returns:

2D array of the spline coefficients and the knots (coeffs, knots)

get_counts()[source]

Return the halo mass function for all the simulations and the corresponding mask for the missing bins. TODO: We can improve this by not removing the very low count bins at high mass end and just rely on the Poisson likelihood to use that bins too.

Returns:

  • full_bins (np.ndarray) – The uniform mass bins used for all the simulations

  • log_hmfs (np.ndarray) – The halo mass function for all the simulations, natural log scale

  • mult_fac (np.ndarray) – The multiplicative factor to convert the hmf to counts in bins - It is zeros for the massive missing bins.

  • params (np.ndarray) – The cosmological parameters for each simulation

  • sim_tags (np.ndarray) – The simulation tags

get_data(get_counts=False, noise_floor=0.0)[source]

Get the data for the emulator :param get_counts: Whether to get log(hmf) with uncertainties (True) or

halo counts in bins (False)

Parameters:

noise_floor (float) – The fractional noise floor to add in quadrature to the Poisson errors when get_counts is False

Returns:

  • full_bins (np.ndarray) – The uniform mass bins used for all the simulations

  • log_hmfs or counts (np.ndarray) – if get_counts is False: The halo mass function for all the simulations, log10 scale else: The halo counts in the bins for all the simulations

  • log_hmf_errs (np.ndarray) –

    if get_counts is False:

    The uncertainties for the halo mass function, log10 scale

    else:

    Array of zeros

  • params (np.ndarray) – The cosmological parameters for each simulation

  • sim_tags (np.ndarray) – The simulation tags

get_labels()[source]

It is just the simulation tags

get_pca(x, n_components=4)[source]
get_smoothed(x, delta_r=None, ind=None, *kwargs)[source]

Get the smoothed halo mass function evaluated at x Parameters: ————– x: np.ndarray or list of np.ndarray

Array of x values to evaluate the spline at

ind: np.ndarray, optional, default=None

Index of the simulations to fit. If None, fit all simulations

kwargs: dict

Keyword arguments for utils.ConstrainedSplineFitter

get_wt_err(noise_floor=0.0)[source]

Return the halo mass function for all the simulations and the corresponding uncertainties. NOT: For now we assign very large uncertainties to the bins with insufficient counts. ToDo: Use Jacknife resampling to get the uncertainties

load()[source]

Load the Halo Mass Function computed for simulations and saved on data_dir

class gal_goku.summary_stats.HaloHaloPower(data_dir, z, max_k=0.5, log_k_interp=None, fid='HF', logging_level='INFO')[source]

Bases: BaseSummaryStats

Class to prepare the propagator for the emulator.

__init__(data_dir, z, max_k=0.5, log_k_interp=None, fid='HF', logging_level='INFO')[source]

Initialize the Propagator class.

Parameters:

basedir (str) – Base directory where the data files are located.

get_labels()[source]

It is just the simulation tags

get_powers()[source]

Loading the matter power spectrum for each simulation. The saved files are in plain ascii format, with the first line being the header and the rest being the k and P_m values.

class gal_goku.summary_stats.MatterPower(data_dir, z, max_k=0.5, log_k_interp=None, fid='HF', logging_level='INFO')[source]

Bases: BaseSummaryStats

Class to prepare the propagator for the emulator.

__init__(data_dir, z, max_k=0.5, log_k_interp=None, fid='HF', logging_level='INFO')[source]

Initialize the Propagator class.

Parameters:

basedir (str) – Base directory where the data files are located.

get_labels()[source]

It is just the simulation tags

get_powers()[source]

Loading the matter power spectrum for each simulation. The saved files are in plain ascii format, with the first line being the header and the rest being the k and P_m values.

class gal_goku.summary_stats.ProjCorr(data_dir, fid, logging_level='INFO')[source]

Bases: BaseSummaryStats

Projected correlation function, w_p

bin_in_param_space(r_range=(0, 100), num_bins=2)[source]

Computet the average wp of all the simulations with low and high values of the cosmological parameters :param r_range: The range of r values to consider :type r_range: tuple :param num_bins: Number of bins to divide the cosmological parameters :type num_bins: int

Returns:

  • rp (array) – The r values

  • av_wp (array) – The average wp for each bin

  • all_bins (array) – The bin edges for each parameter

find_nan_log10_bins()[source]
get_labels()[source]

Get the labels we use for each simulation, they are in this format 10p_Box{BoxSize}_Par{Npart}_0001

get_mean_std(r_range=(0, 100))[source]

get the mean and std of the projected correlation function, the std computed over the different realizations of the same HOD and cosmology

get_wp()[source]

Take the avergae along pi direction

class gal_goku.summary_stats.Propagator(data_dir, z, max_k=0.5, fid='HF', logging_level='INFO')[source]

Bases: BaseSummaryStats

Class to prepare the propagator for the emulator.

__init__(data_dir, z, max_k=0.5, fid='HF', logging_level='INFO')[source]

Initialize the Propagator class.

Parameters:

basedir (str) – Base directory where the data files are located.

get_labels()[source]

It is just the simulation tags

get_model_gk(k)[source]

Get the propagator G(k) = P_hm / P_lin at arbitrary k values. Returns: ——- gk: np.ndarray, shape=(n_sims, n_mbins, n_k)

get_powers()[source]

Get the computed propagator computed for each simulation. It is:

G(k, m) = P_hlin(k, m) / P_init(k, m) where P_hlin is the cross correlation of the halos and the initial power spectrum, P_init is the initial power spectrum, k and m are the wavenumber and mass bins respectively.

model(k, g0, g2, g4, sigma_d)[source]

The model for the propagator G(k) = P_hm / P_lin G(k) = (g_0 + g_2 k^2 + g_4 k^4) exp(- sigma_d^2_lin k^2 / 2 ) where sigma_d is the Zeldovich displacement. Returns: ——- G(k): np.ndarray, shape=(n_sims, n_mbins)

The propagator G(k) for each simulation and mass bin.

class gal_goku.summary_stats.Xi(data_dir, fid, mass_range=(11.0, 12.32), narrow=False, MPI=None, logging_level='INFO')[source]

Bases: BaseSummaryStats

Speherically Averaged correlation function

__init__(data_dir, fid, mass_range=(11.0, 12.32), narrow=False, MPI=None, logging_level='INFO')[source]

Calling this will load the xi(r, n1, n2) for all the simulations Parameters: ————– data_dir: str

The directory containing the data

fid: str

The folder id

narrow: bool, optional, default=False

Whether it is Goku-narrow or not

logging_level: str, optional, default=’INFO’

The logging level

get_labels()[source]

It is just the simulation tags

get_wt_err(rcut=(0.2, 61), remove_sims=None)[source]

Return the correlation function for all the simulations and the corresponding uncertainties. NOT: For now we assign very large uncertainties to the bins with NaN values and simulations missing more than alpha_bad fraction of the r-bins for some massive mass pairs. ToDo: Use Jacknife resampling to get the uncertainties :param rcut: The range of r values to consider for fitting :type rcut: tuple :param alpha_bad: The fraction of bad points to consider :type alpha_bad: float

Returns:

  • bins (np.ndarray, shape=(n,3)) – The bins of the correlation function in (r, m1, m2) format

  • corr (np.ndarray, shape=(n*n_bins,)) – The correlation function for the “remaining” bins at (r, m1, m2) bins

  • corr_err (np.ndarray, shape=(n*n_bins)) – The correlation function errors for the “remaining” bins at (r, m1, m2) bins

  • params (np.ndarray, shape=(n*n_bins, prams_size)) – The cosmological parameters for the “remaining” bins

get_wt_err_interped_deprecated(rcut=(0.2, 61), good_sims=False)[source]
Parameters:
  • rcut (tuple) – The range of r values to consider for fitting

  • good_sims (bool) – Whether to use only the good simulations. The good simulations are defined as those with a low fraction of bad points which is hard-coded in the method hard_coded_good_l2_sims() for now. This removes ~ 150 L2 simulations.

Returns:

  • bins (np.ndarray, shape=(n,3)) – The bins of the correlation function in (r, m1, m2) format

  • interped_log_corr (np.ndarray, shape=(n*n_bins,)) – The correlation function for the “remaining” bins at (r, m1, m2) bins. Replacing the missing values with a minimal spline interpolation

  • corr_err (np.ndarray, shape=(n*n_bins)) – The correlation function errors for the “remaining” bins at (r, m1, m2) bins

  • params (np.ndarray, shape=(n*n_bins, prams_size)) – The cosmological parameters for the “remaining” bins

get_xi_n1_n2(r_ind, symmetric=False)[source]

Get Xi(n1, n2) for all simulations at fixed r

get_xi_n1_n2_r()[source]

Get xi(n1, n2, r) for all simulations Returns: ————– rbins: np.ndarray, shape=(n_rbins,)

The r values

mass_bins: np.ndarray, shape=(n_mass_bins,)

The mass bins, storing the mass values in descending order. This is the order the corr is stored

all_corrs: np.ndarray, shape=(n_sims, n_mass_bins, n_mass_bins, n_rbins)

The xi(n1, n2, r) for all the simulations

get_xi_r(mass_pair, rcut=(0.2, 61))[source]

Get xi(r) for all simulations at fixed mass pair

hard_coded_good_l2_sims()[source]

Following on https://github.com/qezlou/private-gal-emu/issues/39#issuecomment-3226453844 there are ~ 150 L2-W sims that miss mroe than 25% of their xi(r) values for the largest mass bins, this is the list of those sims

make_3d_corr(corr, symmetric=False)[source]

Pass xi(n_mass_pairs, n_rbins) and get a 3D array (n_mass_pairs, n_mass_pairs, n_rbins)

minimal_spline_interp(good_sims, ind_r)[source]

Use Cubic spline only for recovering random NaN values. No extrapolation is done here, if the NaN values are at the edges, they will be replaced by a small value of xi = 1e-4 and a very large uncertainty of 1e6. Parameters: ————– good_sims: np.ndarray

The indices of the simulations to consider. Simulation with too many NaN values should be removed beforehand. A list is provided in hard_coded_good_l2_sims() which is curated for highest mass bin and is applicable for lower mass bins as well.

ind_r: np.ndarray

The indices of the r values to consider

Returns:

interpolated_values: np.ndarray, shape=(len(good_sims), n_mass_pairs, len(ind_r))

The xi(r, n1, n2) values for the good simulations with NaN values replaced by interpolated values or a small value of 1e-4.

corr_err: np.ndarray, shape=(len(good_sims), n_mass_pairs, len(ind_r))

The uncertainty estimates for the interpolated values

remove_sims(sim_id)[source]

get the indices to the simulations to remove Parameters: ———– sim_id: list

List of simulation IDs to remove

Returns:

keep_sims: np.ndarray

Indices of the simulations to keep

spline_nan_interp(mass_pair, spline_s=0.01, rcut=(0.2, 61), alpha_bad=0.35)[source]

Fit univariate spline to xi(r) for a specific mass pair, handling NaN values.

Parameters:
  • mass_pair (tuple) – Mass bin pair (m1, m2) to analyze

  • spline_s (float, default=1e-2) – Spline smoothing parameter

  • rcut (tuple, default=(0.2, 61)) – Min/max radius range in Mpc/h

  • alpha_bad (float, default=0.35) – Threshold fraction of valid points required

Returns:

  • rbins (ndarray) – Radial bins within rcut range (Mpc/h)

  • log_corr (ndarray) – Log10 of correlation function values

  • eval_splines (ndarray) – Spline evaluations at each rbin, it is interpolated log10(xi(r))

  • bad_sims_mask (ndarray) – Mask of simulations with a fraction of less than alpha_bad valid points

unconcatenate(corr, bins)[source]

Unconcatenate the large xi arrays used for the emulator :param corr: The correlation function :type corr: np.ndarray, shape=(n_sims, n_bins) :param bins: The bins of the correlation function in (m1, m2, r) format :type bins: np.ndarray, shape=(n_bins, 3)

Returns:

corr – The correlation function for each mass pair

Return type:

np.ndarray, shape=(n_sims, len(mass_pairs), len(rbins))

gal_goku.summary_stats.get_pairs(data_dir, eval_mbins, no_merge=False, narrow=False)[source]

Summary statistics computations from simulation data.

single_fid Module

Train a single fidelity emualtor

class gal_goku.single_fid.EvaluateSingleFid(X, Y, model_err, logging_level='INFO')[source]

Bases: object

configure_logging(logging_level)[source]

Sets up logging based on the provided logging level.

leave_bunch_out(n_out=1, sub_sample=None)[source]

Leave one out cross validation

load_saved_loo(savefile)[source]

Load the saved leave one out predictions

loo_errors()[source]

Compute the leave one out errors

loo_train_pred(mbins, savefile=None, labels=None, sub_sample=None)[source]

Iterate over the samples to leave one out and train the model Parameters: ——- mbins: array

The projected bins radii that used for training

savefile:

The file to save the results to

Returns:

mean_pred: the predictions for the left out samples while trained on the rest var_pred: the variance of the predictions

predict(X)[source]

Predict the output for the given input

train(X_train=None, Y_train=None)[source]

Train the model

class gal_goku.single_fid.EvaluateSingleFidMultiBins(X, Y, model_err, logging_level='INFO')[source]

Bases: EvaluateSingleFid

Build a single fidelity emulator for each bin of the summary statistics

__init__(X, Y, model_err, logging_level='INFO')[source]
configure_logging(logging_level)[source]

Sets up logging based on the provided logging level.

leave_bunch_out(n_out=1)[source]

Leave one out cross validation

loo_train_pred(bins, savefile=None, labels=None)[source]

Iterate over the samples to leave one out and train the model Parameters: ——- bins: array

The projected bins radii that used for training

savefile:

The file to save the results to

predict(X)[source]

Predict the output for the given input

train(X_train=None, Y_train=None)[source]

Train the model

class gal_goku.single_fid.SingleFid(X, Y, model_err=None, logging_level='INFO')[source]

Bases: object

build_model(X_train=None, Y_train=None, kernel=None)[source]

Build the GP model

configure_logging(logging_level)[source]

Sets up logging based on the provided logging level.

normalize(X, Y, model_err)[source]

Normalize the input data such it is between 0 and 1 We don’t normalize the output data as it the median of the stack could be 0. Returns: ——– X_normalized: normalized input data between 0 and 1 X_min: minimum value of the input data X_max: maximum value of the input data mean_func: the mean of the output to be used as the mean function in the GP model

predict(X)[source]

Predict the output for the given input

train(X_train=None, Y_train=None, max_iter=1000000000)[source]

Train the GP model

class gal_goku.single_fid.TestSingleFiled(n_samples=200)[source]

Bases: object

get_truth(X)[source]
predict(X=None)[source]

Get the predictionfor this simple test

Single-fidelity Gaussian Process emulator implementations.

multi_fid Module

Multi-fidelity Gaussian Process emulator implementations.

utils Module

class gal_goku.utils.BayesianPiecewisePolynomialFitter[source]

Bases: object

__init__()[source]

Initialize the fitter for piecewise quadratic polynomials with Bayesian inference.

fit_piecewise(x, y, m)[source]

Fit piecewise quadratic polynomials to m consecutive bins using Bayesian inference.

Parameters: x (array): 1D array of x-coordinates. y (2D array): 2D array of y-coordinates with shape (num_examples, num_dimensions). m (int): Number of bins to group for fitting.

Returns: list: A list of lists containing posterior samples for polynomial coefficients. Outer list corresponds to dimensions.

class gal_goku.utils.ConstrainedSplineFitter(degree=2, s=None, constraints=True, logging_level='INFO')[source]

Bases: object

__init__(degree=2, s=None, constraints=True, logging_level='INFO')[source]

A Spline fitter which requires the second derivative to be all negative. Of particular use in fitting the halo mass function where we expect the slope to be decreasing with mass. THis is similar to what implemented in MiraTitanII arxiv:2003.12116 Parameters: ———– degree: int, default=2

Degree of the spline

s: int, default=x.size, optional

Smoothing factor, refer to scipy.interpolate.generate_knots. This helps find the internal knots to achieve the desired smoothness.

constraints: bool, default=True

If True, enforce the constraints on the second derivative to be negative

configure_logging(logging_level)[source]

Sets up logging based on the provided logging level.

fit_spline(x, y, knots, sigma=1)[source]

Fit a spline with constraints on the coefficients of the quadratic term.

Parameters: x (array): 1D array of x-coordinates. y (array): 1D array of y-coordinates. knots (array): Internal knot positions for the spline.

Returns: BSpline: A spline object with the specified constraints.

Utility functions for data processing and manipulation.

plot Module

Plotting utilities for visualization of results.

plot_gal Module

Specialized plotting functions for galaxy clustering visualizations.