Batch Processing Tutorial¶

This tutorial demonstrates how to use the ORCA Descriptors library for efficient batch processing of molecular descriptors with support for pandas integration, parallelization, and advanced descriptor definition.

Installation¶

To use batch processing with pandas integration, install pandas as an optional dependency:

pip install 'orca-descriptors[pandas]'

Or install pandas separately:

pip install pandas

Basic Usage¶

The ORCABatchProcessing class provides efficient batch processing of molecular descriptors.

Creating a Batch Processor¶

Initialize a batch processor with your ORCA configuration:

from orca_descriptors import Orca, ORCABatchProcessing

# Create an Orca instance
orca = Orca(
    functional="PBE0",
    basis_set="def2-SVP",
    method_type="Opt",
    n_processors=4,
)

# Create a batch processor
batch_processing = ORCABatchProcessing(
    orca=orca,
    working_dir=".",
)

You can also create a batch processor without an existing Orca instance:

batch_processing = ORCABatchProcessing(
    functional="PBE0",
    basis_set="def2-SVP",
    method_type="Opt",
    n_processors=4,
)

Calculating Descriptors¶

Calculate descriptors for a list of SMILES strings:

smiles_list = ["C1=CC=CC=C1", "CCO", "CC(=O)C"]

# Calculate all available descriptors
result = batch_processing.calculate_descriptors(smiles_list)

# Result is a DataFrame (if pandas available) or list of dictionaries
print(result)

Working with Pandas¶

The batch processor seamlessly integrates with pandas DataFrames and Series.

DataFrame Input¶

Pass a DataFrame with a ‘smiles’ column:

import pandas as pd

df = pd.DataFrame({
    'smiles': ['C1=CC=CC=C1', 'CCO', 'CC(=O)C'],
    'name': ['Benzene', 'Ethanol', 'Acetone']
})

# Calculate descriptors - original columns are preserved
df_result = batch_processing.calculate_descriptors(df['smiles'])

# DataFrame contains original columns + descriptor columns
print(df_result.columns)
# Output: Index(['smiles', 'name', 'homo_energy', 'lumo_energy', ...])

Series Input¶

You can also pass a pandas Series directly:

smiles_series = pd.Series(['C1=CC=CC=C1', 'CCO', 'CC(=O)C'])

df_result = batch_processing.calculate_descriptors(smiles_series)

# Result is a DataFrame with descriptor columns only
print(df_result.head())

List Input¶

Plain Python lists are also supported:

smiles_list = ['C1=CC=CC=C1', 'CCO', 'CC(=O)C']

result = batch_processing.calculate_descriptors(smiles_list)

# Result is a DataFrame (if pandas available) or list of dictionaries

Defining Descriptors with XMolecule API¶

The XMolecule API allows you to define descriptors with their parameters using a special placeholder molecule. This is especially useful for descriptors that require parameters.

Creating an X Molecule¶

Create an X molecule using the x_molecule() method:

x = batch_processing.x_molecule()

Using X Molecule to Define Descriptors¶

Call descriptor methods on the Orca instance with the X molecule to define descriptors with parameters:

descriptors = [
    orca.ch_potential(x),
    orca.electronegativity(x),
    orca.abs_hardness(x),
    orca.topological_distance(x, 'O', 'O'),  # Distance between oxygen atoms
    orca.mo_energy(x, -3),  # HOMO-2 energy (index -3)
]

result = batch_processing.calculate_descriptors(
    smiles_list,
    descriptors=descriptors
)

# Result columns include parameterized descriptor names:
# 'ch_potential', 'electronegativity', 'abs_hardness',
# 'topological_distance_O_O', 'mo_energy_-3'

Complete Example¶

Here’s a complete example using the XMolecule API:

from orca_descriptors import Orca, ORCABatchProcessing
import pandas as pd

# Initialize
orca = Orca(functional="PBE0", basis_set="def2-SVP")
batch_processing = ORCABatchProcessing(orca=orca)

# Create X molecule
x = batch_processing.x_molecule()

# Define descriptors with parameters
descriptors = [
    orca.homo_energy(x),
    orca.lumo_energy(x),
    orca.gap_energy(x),
    orca.mo_energy(x, -1),  # HOMO
    orca.mo_energy(x, -2),  # HOMO-1
    orca.topological_distance(x, 'C', 'C'),  # C-C distances
    orca.topological_distance(x, 'O', 'O'),  # O-O distances
]

# Load your dataset
df = pd.read_csv('molecules.csv')

# Calculate descriptors
df_result = batch_processing.calculate_descriptors(
    df['smiles'],
    descriptors=descriptors
)

# Save results
df_result.to_csv('molecules_with_descriptors.csv', index=False)

Selecting Descriptors¶

You can specify which descriptors to calculate in two ways:

Method 1: Using Descriptor Names¶

Pass a list of descriptor names as strings:

selected_descriptors = [
    'homo_energy',
    'lumo_energy',
    'gap_energy',
    'dipole_moment',
    'molecular_volume'
]

result = batch_processing.calculate_descriptors(
    smiles_list,
    descriptors=selected_descriptors
)

Method 2: Using XMolecule API (Recommended)¶

Use the XMolecule API for descriptors with parameters:

x = batch_processing.x_molecule()

descriptors = [
    orca.homo_energy(x),
    orca.mo_energy(x, -3),  # With parameter
    orca.topological_distance(x, 'O', 'O'),  # With parameters
]

result = batch_processing.calculate_descriptors(
    smiles_list,
    descriptors=descriptors
)

Parallelization¶

The batch processor supports multiple parallelization modes for efficient processing of large datasets.

Sequential Processing¶

Default mode - processes molecules one by one:

batch_processing = ORCABatchProcessing(
    orca=orca,
    parallel_mode="sequential"
)

Multiprocessing¶

Use Python multiprocessing to run multiple ORCA calculations in parallel:

batch_processing = ORCABatchProcessing(
    orca=orca,
    parallel_mode="multiprocessing",
    n_workers=4  # Number of parallel workers
)

# Process molecules in parallel
result = batch_processing.calculate_descriptors(smiles_list)

The multiprocessing mode automatically adjusts time estimates based on the number of workers and parallel efficiency.

MPI (mpirun)¶

Use ORCA’s built-in MPI parallelization:

batch_processing = ORCABatchProcessing(
    orca=orca,
    parallel_mode="mpirun",
    use_mpirun=True,
    n_processors=4  # Processors per ORCA calculation
)

Progress Tracking¶

The batch processor provides detailed progress information.

Progress Output¶

By default, progress information is displayed:

result = batch_processing.calculate_descriptors(
    smiles_list,
    progress=True  # Default
)

# Output:
# INFO - Estimating calculation times...
# INFO - Processing molecule 1/10 (remaining: 10, estimated time: ~5m)
# INFO - Processing molecule 2/10 (remaining: 9, estimated time: ~4m 30s, avg: 30.5s/molecule)
# INFO - Processing molecule 3/10 (remaining: 8, CACHED, avg: 0.1s/molecule)

Cached Molecules¶

When a molecule is found in cache, the progress shows “CACHED” instead of time estimates:

# First calculation
result1 = batch_processing.calculate_descriptors(smiles_list)

# Second calculation (from cache)
result2 = batch_processing.calculate_descriptors(smiles_list)
# Output: INFO - Processing molecule 1/10 (remaining: 10, CACHED)

Disable Progress¶

To disable progress output:

result = batch_processing.calculate_descriptors(
    smiles_list,
    progress=False
)

Error Handling¶

The batch processor handles errors gracefully. If a descriptor calculation fails for a molecule, it sets that value to None and continues processing:

df = pd.DataFrame({
    'smiles': ['C1=CC=CC=C1', 'INVALID_SMILES', 'CCO']
})

result = batch_processing.calculate_descriptors(df['smiles'])

# Failed calculations are marked as None
print(result[result['homo_energy'].isna()])

Error messages are logged at different levels:

Brief error summary: logging.INFO
Detailed error information: logging.DEBUG

Example: QSAR Dataset Preparation¶

Here’s a complete example of preparing a QSAR dataset:

import pandas as pd
from orca_descriptors import Orca, ORCABatchProcessing

# Initialize calculator
orca = Orca(
    functional="PBE0",
    basis_set="def2-SVP",
    method_type="Opt",
    n_processors=4,
)

# Create batch processor with multiprocessing
batch_processing = ORCABatchProcessing(
    orca=orca,
    parallel_mode="multiprocessing",
    n_workers=4,
)

# Create X molecule for descriptor definition
x = batch_processing.x_molecule()

# Define descriptors
descriptors = [
    orca.homo_energy(x),
    orca.lumo_energy(x),
    orca.gap_energy(x),
    orca.ch_potential(x),
    orca.electronegativity(x),
    orca.abs_hardness(x),
    orca.dipole_moment(x),
    orca.molecular_volume(x),
]

# Load your molecular dataset
df = pd.read_csv('molecules.csv')  # Contains 'smiles' column

# Calculate descriptors
df_descriptors = batch_processing.calculate_descriptors(
    smiles_column=df['smiles'],
    descriptors=descriptors
)

# Filter molecules based on descriptors
active_molecules = df_descriptors[df_descriptors['gap_energy'] < 3.0]

# Save results
df_descriptors.to_csv('molecules_with_descriptors.csv', index=False)

# Statistical analysis
print(df_descriptors[['homo_energy', 'lumo_energy', 'gap_energy']].describe())

Example: Batch Processing with Custom Parameters¶

Process molecules in batches with different configurations:

# Process in batches
batch_size = 10
all_results = []

for i in range(0, len(df), batch_size):
    batch = df.iloc[i:i+batch_size]

    x = batch_processing.x_molecule()
    descriptors = [
        orca.homo_energy(x),
        orca.lumo_energy(x),
        orca.gap_energy(x)
    ]

    batch_result = batch_processing.calculate_descriptors(
        smiles_column=batch['smiles'],
        descriptors=descriptors
    )
    all_results.append(batch_result)

# Combine results
final_df = pd.concat(all_results, ignore_index=True)

Tips and Best Practices¶

Caching: The library automatically caches calculation results. Recalculating descriptors for the same molecules uses cached results, significantly speeding up subsequent runs.
Multiprocessing: For large datasets, use parallel_mode="multiprocessing" with an appropriate number of workers (typically equal to CPU cores).
Descriptor Selection: Use the XMolecule API to define descriptors with parameters, making your code more readable and maintainable.
Progress Monitoring: Keep progress=True (default) to monitor long-running calculations and identify cached molecules.
Data Validation: Check for None values in the result DataFrame to identify molecules that failed descriptor calculation.
Memory Management: For very large datasets, process molecules in batches to manage memory usage.

Available Descriptors¶

See the Descriptors Reference documentation for a complete list of available descriptors and their parameters.