HMF Emulation Tutorial
======================

This tutorial focuses on Halo Mass Function (HMF) emulation, a key component of Goku-ELG.

Learning Objectives
-------------------

- Understand the halo mass function and its importance
- Load and prepare HMF training data
- Train single-fidelity and multi-fidelity emulators
- Perform cross-validation
- Interpret and validate results

What is the Halo Mass Function?
--------------------------------

The halo mass function :math:`\phi(M)` describes the number density of dark matter halos as a function of their mass:

.. math::

   \frac{dn}{d\log_{10}M} = \phi(M) = \ln(10) \cdot M \cdot \frac{dn}{dM}

It's a fundamental quantity in cosmology that depends on:

- Cosmological parameters (:math:`\Omega_m`, :math:`\sigma_8`, etc.)
- Redshift
- The power spectrum

Setting Up
----------

.. code-block:: python

   import numpy as np
   import matplotlib.pyplot as plt
   from gal_goku import emus
   import h5py
   
   # Set plotting style
   plt.style.use('seaborn-v0_8-darkgrid')

Loading Training Data
---------------------

.. code-block:: python

   # Initialize the HMF emulator with training data
   data_dir = 'path/to/hmf/data'  # Update this path
   
   hmf_emu = emus.Hmf(
       data_dir=data_dir,
       y_log=True,              # Work in log-space
       fid='L2',                # Fiducial simulation
       multi_bin=False,         # Single-bin mode
       logging_level='INFO',
       narrow=False,            # Use full parameter range
       no_merge=True            # Don't merge datasets
   )
   
   print(f"Loaded {len(hmf_emu.labels)} training simulations")
   print(f"Mass bins: {len(hmf_emu.mbins)}")
   print(f"Mass range: {hmf_emu.mbins.min():.2f} to {hmf_emu.mbins.max():.2f}")

Training the Emulator
---------------------

Simple Training
~~~~~~~~~~~~~~~

.. code-block:: python

   # Train on all simulations and compare with truth
   predictions, truth, mass_bins = hmf_emu.train_pred_all_sims()
   
   print(f"Training complete!")
   print(f"Predictions shape: {predictions.shape}")
   print(f"Truth shape: {truth.shape}")

Visualizing Training Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Plot a few examples
   n_examples = 4
   fig, axes = plt.subplots(2, 2, figsize=(14, 10))
   axes = axes.flatten()
   
   for i in range(n_examples):
       ax = axes[i]
       
       # Plot truth and prediction
       ax.plot(mass_bins, truth[i], 'ko-', label='Truth', alpha=0.6)
       ax.plot(mass_bins, predictions[i], 'r^-', label='Prediction', alpha=0.8)
       
       # Calculate residual
       residual = np.abs((predictions[i] - truth[i]) / truth[i]) * 100
       
       ax.set_xlabel(r'$\log_{10}(M / M_\odot h^{-1})$')
       ax.set_ylabel(r'$\log_{10}[\phi(M)]$')
       ax.set_title(f'Simulation {i}: Mean error = {residual.mean():.2f}%')
       ax.legend()
       ax.grid(True, alpha=0.3)
   
   plt.tight_layout()
   plt.savefig('hmf_training_examples.png', dpi=150)
   print("Saved training examples plot")

Cross-Validation
----------------

Leave-One-Out Cross-Validation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Perform leave-one-out cross-validation
   savefile = 'hmf_loo_results.h5'
   hmf_emu.loo_train_pred(savefile=savefile, narrow=0)
   
   print(f"LOO results saved to {savefile}")

Analyzing LOO Results
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Load and analyze LOO results
   with h5py.File(savefile, 'r') as f:
       loo_predictions = f['predictions'][:]
       loo_truth = f['truth'][:]
       loo_variances = f['variances'][:]
   
   # Compute errors
   absolute_errors = loo_predictions - loo_truth
   relative_errors = (absolute_errors / loo_truth) * 100
   
   print(f"Mean absolute error: {np.mean(np.abs(absolute_errors)):.4f}")
   print(f"Mean relative error: {np.mean(np.abs(relative_errors)):.2f}%")
   print(f"Max relative error: {np.max(np.abs(relative_errors)):.2f}%")

Visualizing LOO Performance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   fig, axes = plt.subplots(1, 2, figsize=(14, 5))
   
   # Left: Scatter plot of predictions vs truth
   ax = axes[0]
   ax.scatter(loo_truth.flatten(), loo_predictions.flatten(),
              alpha=0.3, s=5)
   
   # Perfect prediction line
   lims = [loo_truth.min(), loo_truth.max()]
   ax.plot(lims, lims, 'r--', linewidth=2, label='Perfect prediction')
   
   ax.set_xlabel('True log(HMF)', fontsize=12)
   ax.set_ylabel('Predicted log(HMF)', fontsize=12)
   ax.set_title('Leave-One-Out: Predictions vs Truth', fontsize=14)
   ax.legend()
   ax.grid(True, alpha=0.3)
   
   # Right: Residual distribution
   ax = axes[1]
   ax.hist(relative_errors.flatten(), bins=50, alpha=0.7, edgecolor='black')
   ax.axvline(0, color='red', linestyle='--', linewidth=2)
   ax.set_xlabel('Relative Error [%]', fontsize=12)
   ax.set_ylabel('Frequency', fontsize=12)
   ax.set_title('Distribution of Relative Errors', fontsize=14)
   ax.grid(True, alpha=0.3)
   
   plt.tight_layout()
   plt.savefig('loo_performance.png', dpi=150)
   print("Saved LOO performance plot")

Leave-Bunch-Out Cross-Validation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Leave out 5 simulations at a time
   n_out = 5
   X_test, Y_test, Y_pred, var_pred = hmf_emu.leave_bunch_out(n_out=n_out)
   
   print(f"Left out {n_out} simulations")
   print(f"Test set shape: {Y_test.shape}")
   
   # Compute errors
   lbo_errors = np.abs((Y_pred - Y_test) / Y_test) * 100
   print(f"Mean LBO error: {lbo_errors.mean():.2f}%")

Making Predictions
------------------

Single Cosmology
~~~~~~~~~~~~~~~~

.. code-block:: python

   # Define a test cosmology
   test_cosmo = np.array([[
       0.30,      # omega_m
       0.048,     # omega_b
       0.70,      # h
       2.0e-9,    # A_s
       0.96,      # n_s
       -1.0,      # w0
       0.0,       # wa
       3.046,     # N_ur
       0.0,       # alpha_s
       0.06       # m_nu
   ]])
   
   # Predict
   pred, var = hmf_emu.predict(test_cosmo)
   
   # Convert from log-space
   hmf_pred = 10**pred[0]
   
   # Plot
   plt.figure(figsize=(10, 6))
   plt.semilogy(mass_bins, hmf_pred, 'b-', linewidth=2)
   plt.xlabel(r'$\log_{10}(M / M_\odot h^{-1})$', fontsize=14)
   plt.ylabel(r'$\phi(M)$ [dex$^{-1}$ (Mpc/$h$)$^{-3}$]', fontsize=14)
   plt.title('HMF Prediction for Test Cosmology', fontsize=16)
   plt.grid(True, alpha=0.3)
   plt.savefig('hmf_test_prediction.png', dpi=150)

Parameter Dependence Study
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Study how HMF changes with omega_m
   omega_m_values = np.linspace(0.26, 0.36, 11)
   
   hmf_vs_omega_m = []
   
   for om in omega_m_values:
       test_cosmo = np.array([[om, 0.048, 0.70, 2.0e-9, 0.96,
                               -1.0, 0.0, 3.046, 0.0, 0.06]])
       pred, _ = hmf_emu.predict(test_cosmo)
       hmf_vs_omega_m.append(10**pred[0])
   
   hmf_vs_omega_m = np.array(hmf_vs_omega_m)
   
   # Plot results
   plt.figure(figsize=(12, 6))
   
   # Select a few mass bins to plot
   mass_indices = [5, 10, 15, 20]
   
   for idx in mass_indices:
       mass = mass_bins[idx]
       plt.plot(omega_m_values, hmf_vs_omega_m[:, idx],
                'o-', linewidth=2, label=f'log(M) = {mass:.1f}')
   
   plt.xlabel(r'$\Omega_m$', fontsize=14)
   plt.ylabel(r'$\phi(M)$ [dex$^{-1}$ (Mpc/$h$)$^{-3}$]', fontsize=14)
   plt.title('HMF Dependence on Matter Density', fontsize=16)
   plt.legend(fontsize=11)
   plt.grid(True, alpha=0.3)
   plt.tight_layout()
   plt.savefig('hmf_vs_omega_m.png', dpi=150)

Advanced: Multi-Fidelity Emulation
-----------------------------------

Using Multiple Fidelities
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Initialize multi-fidelity emulator
   hmf_emu_mf = emus.Hmf(
       data_dir=data_dir,
       y_log=True,
       fid='L2',
       multi_bin=True,          # Enable multi-bin mode
       logging_level='INFO'
   )
   
   print("Multi-fidelity emulator initialized")

The multi-fidelity approach combines:

1. **High-fidelity**: Large-box, high-resolution simulations (expensive)
2. **Low-fidelity**: Smaller-box or lower-resolution simulations (cheap)

This improves accuracy while reducing computational cost.

Best Practices
--------------

1. **Always use log-space** for HMF to handle the large dynamic range
2. **Check cross-validation** before using for science
3. **Monitor uncertainties** - large uncertainties indicate extrapolation
4. **Validate against known results** when possible
5. **Use appropriate mass range** - avoid extrapolating beyond training data

Common Issues
-------------

Issue: High Errors at Low Masses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Cause**: Low number of halos in small mass bins leads to noisy measurements.

**Solution**: 
- Increase simulation volume
- Use wider mass bins
- Apply smoothing

Issue: Poor Extrapolation
~~~~~~~~~~~~~~~~~~~~~~~~~~

**Cause**: Testing cosmologies far from training set.

**Solution**:
- Check if test parameters are within training range
- Add more training simulations
- Use informative priors

Exercises
---------

1. **Redshift Dependence**: Modify the tutorial to study HMF at different redshifts.

2. **Parameter Sensitivity**: Create a 2D plot showing HMF sensitivity to both omega_m and sigma_8.

3. **Custom Validation**: Implement k-fold cross-validation instead of leave-one-out.

4. **Compare with Theory**: Compare emulator predictions with theoretical HMF (e.g., Tinker et al. 2008).

Next Steps
----------

- Move on to :doc:`galaxy_clustering` to convert HMF to observable quantities
- Explore :doc:`advanced_topics` for sophisticated GP techniques
- Check the notebooks in ``emu/notebooks/hmf_emu/`` for more examples