Batch Processing Tutorial
===========================

This tutorial demonstrates how to use the ORCA Descriptors library for efficient batch processing of molecular descriptors with support for pandas integration, parallelization, and advanced descriptor definition.

Installation
-------------

To use batch processing with pandas integration, install pandas as an optional dependency::

   pip install 'orca-descriptors[pandas]'

Or install pandas separately::

   pip install pandas

Basic Usage
-----------

The ``ORCABatchProcessing`` class provides efficient batch processing of molecular descriptors.

Creating a Batch Processor
~~~~~~~~~~~~~~~~~~~~~~~~~~

Initialize a batch processor with your ORCA configuration::

   from orca_descriptors import Orca, ORCABatchProcessing

   # Create an Orca instance
   orca = Orca(
       functional="PBE0",
       basis_set="def2-SVP",
       method_type="Opt",
       n_processors=4,
   )

   # Create a batch processor
   batch_processing = ORCABatchProcessing(
       orca=orca,
       working_dir=".",
   )

You can also create a batch processor without an existing Orca instance::

   batch_processing = ORCABatchProcessing(
       functional="PBE0",
       basis_set="def2-SVP",
       method_type="Opt",
       n_processors=4,
   )

Calculating Descriptors
~~~~~~~~~~~~~~~~~~~~~~~

Calculate descriptors for a list of SMILES strings::

   smiles_list = ["C1=CC=CC=C1", "CCO", "CC(=O)C"]
   
   # Calculate all available descriptors
   result = batch_processing.calculate_descriptors(smiles_list)
   
   # Result is a DataFrame (if pandas available) or list of dictionaries
   print(result)

Working with Pandas
-------------------

The batch processor seamlessly integrates with pandas DataFrames and Series.

DataFrame Input
~~~~~~~~~~~~~~~

Pass a DataFrame with a 'smiles' column::

   import pandas as pd
   
   df = pd.DataFrame({
       'smiles': ['C1=CC=CC=C1', 'CCO', 'CC(=O)C'],
       'name': ['Benzene', 'Ethanol', 'Acetone']
   })
   
   # Calculate descriptors - original columns are preserved
   df_result = batch_processing.calculate_descriptors(df['smiles'])
   
   # DataFrame contains original columns + descriptor columns
   print(df_result.columns)
   # Output: Index(['smiles', 'name', 'homo_energy', 'lumo_energy', ...])

Series Input
~~~~~~~~~~~~

You can also pass a pandas Series directly::

   smiles_series = pd.Series(['C1=CC=CC=C1', 'CCO', 'CC(=O)C'])
   
   df_result = batch_processing.calculate_descriptors(smiles_series)
   
   # Result is a DataFrame with descriptor columns only
   print(df_result.head())

List Input
~~~~~~~~~~

Plain Python lists are also supported::

   smiles_list = ['C1=CC=CC=C1', 'CCO', 'CC(=O)C']
   
   result = batch_processing.calculate_descriptors(smiles_list)
   
   # Result is a DataFrame (if pandas available) or list of dictionaries

Defining Descriptors with XMolecule API
----------------------------------------

The XMolecule API allows you to define descriptors with their parameters using a special placeholder molecule. This is especially useful for descriptors that require parameters.

Creating an X Molecule
~~~~~~~~~~~~~~~~~~~~~~~

Create an X molecule using the ``x_molecule()`` method::

   x = batch_processing.x_molecule()

Using X Molecule to Define Descriptors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Call descriptor methods on the Orca instance with the X molecule to define descriptors with parameters::

   descriptors = [
       orca.ch_potential(x),
       orca.electronegativity(x),
       orca.abs_hardness(x),
       orca.topological_distance(x, 'O', 'O'),  # Distance between oxygen atoms
       orca.mo_energy(x, -3),  # HOMO-2 energy (index -3)
   ]
   
   result = batch_processing.calculate_descriptors(
       smiles_list,
       descriptors=descriptors
   )
   
   # Result columns include parameterized descriptor names:
   # 'ch_potential', 'electronegativity', 'abs_hardness', 
   # 'topological_distance_O_O', 'mo_energy_-3'

Complete Example
~~~~~~~~~~~~~~~~

Here's a complete example using the XMolecule API::

   from orca_descriptors import Orca, ORCABatchProcessing
   import pandas as pd
   
   # Initialize
   orca = Orca(functional="PBE0", basis_set="def2-SVP")
   batch_processing = ORCABatchProcessing(orca=orca)
   
   # Create X molecule
   x = batch_processing.x_molecule()
   
   # Define descriptors with parameters
   descriptors = [
       orca.homo_energy(x),
       orca.lumo_energy(x),
       orca.gap_energy(x),
       orca.mo_energy(x, -1),  # HOMO
       orca.mo_energy(x, -2),  # HOMO-1
       orca.topological_distance(x, 'C', 'C'),  # C-C distances
       orca.topological_distance(x, 'O', 'O'),  # O-O distances
   ]
   
   # Load your dataset
   df = pd.read_csv('molecules.csv')
   
   # Calculate descriptors
   df_result = batch_processing.calculate_descriptors(
       df['smiles'],
       descriptors=descriptors
   )
   
   # Save results
   df_result.to_csv('molecules_with_descriptors.csv', index=False)

Selecting Descriptors
---------------------

You can specify which descriptors to calculate in two ways:

Method 1: Using Descriptor Names
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pass a list of descriptor names as strings::

   selected_descriptors = [
       'homo_energy',
       'lumo_energy',
       'gap_energy',
       'dipole_moment',
       'molecular_volume'
   ]
   
   result = batch_processing.calculate_descriptors(
       smiles_list,
       descriptors=selected_descriptors
   )

Method 2: Using XMolecule API (Recommended)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use the XMolecule API for descriptors with parameters::

   x = batch_processing.x_molecule()
   
   descriptors = [
       orca.homo_energy(x),
       orca.mo_energy(x, -3),  # With parameter
       orca.topological_distance(x, 'O', 'O'),  # With parameters
   ]
   
   result = batch_processing.calculate_descriptors(
       smiles_list,
       descriptors=descriptors
   )

Parallelization
---------------

The batch processor supports multiple parallelization modes for efficient processing of large datasets.

Sequential Processing
~~~~~~~~~~~~~~~~~~~~~

Default mode - processes molecules one by one::

   batch_processing = ORCABatchProcessing(
       orca=orca,
       parallel_mode="sequential"
   )

Multiprocessing
~~~~~~~~~~~~~~~

Use Python multiprocessing to run multiple ORCA calculations in parallel::

   batch_processing = ORCABatchProcessing(
       orca=orca,
       parallel_mode="multiprocessing",
       n_workers=4  # Number of parallel workers
   )
   
   # Process molecules in parallel
   result = batch_processing.calculate_descriptors(smiles_list)

The multiprocessing mode automatically adjusts time estimates based on the number of workers and parallel efficiency.

MPI (mpirun)
~~~~~~~~~~~~

Use ORCA's built-in MPI parallelization::

   batch_processing = ORCABatchProcessing(
       orca=orca,
       parallel_mode="mpirun",
       use_mpirun=True,
       n_processors=4  # Processors per ORCA calculation
   )

Progress Tracking
-----------------

The batch processor provides detailed progress information.

Progress Output
~~~~~~~~~~~~~~~

By default, progress information is displayed::

   result = batch_processing.calculate_descriptors(
       smiles_list,
       progress=True  # Default
   )
   
   # Output:
   # INFO - Estimating calculation times...
   # INFO - Processing molecule 1/10 (remaining: 10, estimated time: ~5m)
   # INFO - Processing molecule 2/10 (remaining: 9, estimated time: ~4m 30s, avg: 30.5s/molecule)
   # INFO - Processing molecule 3/10 (remaining: 8, CACHED, avg: 0.1s/molecule)

Cached Molecules
~~~~~~~~~~~~~~~~

When a molecule is found in cache, the progress shows "CACHED" instead of time estimates::

   # First calculation
   result1 = batch_processing.calculate_descriptors(smiles_list)
   
   # Second calculation (from cache)
   result2 = batch_processing.calculate_descriptors(smiles_list)
   # Output: INFO - Processing molecule 1/10 (remaining: 10, CACHED)

Disable Progress
~~~~~~~~~~~~~~~~

To disable progress output::

   result = batch_processing.calculate_descriptors(
       smiles_list,
       progress=False
   )

Error Handling
--------------

The batch processor handles errors gracefully. If a descriptor calculation fails for a molecule, it sets that value to ``None`` and continues processing::

   df = pd.DataFrame({
       'smiles': ['C1=CC=CC=C1', 'INVALID_SMILES', 'CCO']
   })
   
   result = batch_processing.calculate_descriptors(df['smiles'])
   
   # Failed calculations are marked as None
   print(result[result['homo_energy'].isna()])

Error messages are logged at different levels:

- Brief error summary: ``logging.INFO``
- Detailed error information: ``logging.DEBUG``

Example: QSAR Dataset Preparation
----------------------------------

Here's a complete example of preparing a QSAR dataset::

   import pandas as pd
   from orca_descriptors import Orca, ORCABatchProcessing
   
   # Initialize calculator
   orca = Orca(
       functional="PBE0",
       basis_set="def2-SVP",
       method_type="Opt",
       n_processors=4,
   )
   
   # Create batch processor with multiprocessing
   batch_processing = ORCABatchProcessing(
       orca=orca,
       parallel_mode="multiprocessing",
       n_workers=4,
   )
   
   # Create X molecule for descriptor definition
   x = batch_processing.x_molecule()
   
   # Define descriptors
   descriptors = [
       orca.homo_energy(x),
       orca.lumo_energy(x),
       orca.gap_energy(x),
       orca.ch_potential(x),
       orca.electronegativity(x),
       orca.abs_hardness(x),
       orca.dipole_moment(x),
       orca.molecular_volume(x),
   ]
   
   # Load your molecular dataset
   df = pd.read_csv('molecules.csv')  # Contains 'smiles' column
   
   # Calculate descriptors
   df_descriptors = batch_processing.calculate_descriptors(
       smiles_column=df['smiles'],
       descriptors=descriptors
   )
   
   # Filter molecules based on descriptors
   active_molecules = df_descriptors[df_descriptors['gap_energy'] < 3.0]
   
   # Save results
   df_descriptors.to_csv('molecules_with_descriptors.csv', index=False)
   
   # Statistical analysis
   print(df_descriptors[['homo_energy', 'lumo_energy', 'gap_energy']].describe())

Example: Batch Processing with Custom Parameters
-------------------------------------------------

Process molecules in batches with different configurations::

   # Process in batches
   batch_size = 10
   all_results = []
   
   for i in range(0, len(df), batch_size):
       batch = df.iloc[i:i+batch_size]
       
       x = batch_processing.x_molecule()
       descriptors = [
           orca.homo_energy(x),
           orca.lumo_energy(x),
           orca.gap_energy(x)
       ]
       
       batch_result = batch_processing.calculate_descriptors(
           smiles_column=batch['smiles'],
           descriptors=descriptors
       )
       all_results.append(batch_result)
   
   # Combine results
   final_df = pd.concat(all_results, ignore_index=True)

Tips and Best Practices
-----------------------

1. **Caching**: The library automatically caches calculation results. Recalculating descriptors for the same molecules uses cached results, significantly speeding up subsequent runs.

2. **Multiprocessing**: For large datasets, use ``parallel_mode="multiprocessing"`` with an appropriate number of workers (typically equal to CPU cores).

3. **Descriptor Selection**: Use the XMolecule API to define descriptors with parameters, making your code more readable and maintainable.

4. **Progress Monitoring**: Keep ``progress=True`` (default) to monitor long-running calculations and identify cached molecules.

5. **Data Validation**: Check for ``None`` values in the result DataFrame to identify molecules that failed descriptor calculation.

6. **Memory Management**: For very large datasets, process molecules in batches to manage memory usage.

Available Descriptors
---------------------

See the :doc:`descriptors` documentation for a complete list of available descriptors and their parameters.