Batch Processing Tutorial

This tutorial demonstrates how to use the ORCA Descriptors library for efficient batch processing of molecular descriptors with support for pandas integration, parallelization, and advanced descriptor definition.

Installation

To use batch processing with pandas integration, install pandas as an optional dependency:

pip install 'orca-descriptors[pandas]'

Or install pandas separately:

pip install pandas

Basic Usage

The ORCABatchProcessing class provides efficient batch processing of molecular descriptors.

Creating a Batch Processor

Initialize a batch processor with your ORCA configuration:

from orca_descriptors import Orca, ORCABatchProcessing

# Create an Orca instance
orca = Orca(
    functional="PBE0",
    basis_set="def2-SVP",
    method_type="Opt",
    n_processors=4,
)

# Create a batch processor
batch_processing = ORCABatchProcessing(
    orca=orca,
    working_dir=".",
)

You can also create a batch processor without an existing Orca instance:

batch_processing = ORCABatchProcessing(
    functional="PBE0",
    basis_set="def2-SVP",
    method_type="Opt",
    n_processors=4,
)

Calculating Descriptors

Calculate descriptors for a list of SMILES strings:

smiles_list = ["C1=CC=CC=C1", "CCO", "CC(=O)C"]

# Calculate all available descriptors
result = batch_processing.calculate_descriptors(smiles_list)

# Result is a DataFrame (if pandas available) or list of dictionaries
print(result)

Working with Pandas

The batch processor seamlessly integrates with pandas DataFrames and Series.

DataFrame Input

Pass a DataFrame with a ‘smiles’ column:

import pandas as pd

df = pd.DataFrame({
    'smiles': ['C1=CC=CC=C1', 'CCO', 'CC(=O)C'],
    'name': ['Benzene', 'Ethanol', 'Acetone']
})

# Calculate descriptors - original columns are preserved
df_result = batch_processing.calculate_descriptors(df['smiles'])

# DataFrame contains original columns + descriptor columns
print(df_result.columns)
# Output: Index(['smiles', 'name', 'homo_energy', 'lumo_energy', ...])

Series Input

You can also pass a pandas Series directly:

smiles_series = pd.Series(['C1=CC=CC=C1', 'CCO', 'CC(=O)C'])

df_result = batch_processing.calculate_descriptors(smiles_series)

# Result is a DataFrame with descriptor columns only
print(df_result.head())

List Input

Plain Python lists are also supported:

smiles_list = ['C1=CC=CC=C1', 'CCO', 'CC(=O)C']

result = batch_processing.calculate_descriptors(smiles_list)

# Result is a DataFrame (if pandas available) or list of dictionaries

Defining Descriptors with XMolecule API

The XMolecule API allows you to define descriptors with their parameters using a special placeholder molecule. This is especially useful for descriptors that require parameters.

Creating an X Molecule

Create an X molecule using the x_molecule() method:

x = batch_processing.x_molecule()

Using X Molecule to Define Descriptors

Call descriptor methods on the Orca instance with the X molecule to define descriptors with parameters:

descriptors = [
    orca.ch_potential(x),
    orca.electronegativity(x),
    orca.abs_hardness(x),
    orca.topological_distance(x, 'O', 'O'),  # Distance between oxygen atoms
    orca.mo_energy(x, -3),  # HOMO-2 energy (index -3)
]

result = batch_processing.calculate_descriptors(
    smiles_list,
    descriptors=descriptors
)

# Result columns include parameterized descriptor names:
# 'ch_potential', 'electronegativity', 'abs_hardness',
# 'topological_distance_O_O', 'mo_energy_-3'

Complete Example

Here’s a complete example using the XMolecule API:

from orca_descriptors import Orca, ORCABatchProcessing
import pandas as pd

# Initialize
orca = Orca(functional="PBE0", basis_set="def2-SVP")
batch_processing = ORCABatchProcessing(orca=orca)

# Create X molecule
x = batch_processing.x_molecule()

# Define descriptors with parameters
descriptors = [
    orca.homo_energy(x),
    orca.lumo_energy(x),
    orca.gap_energy(x),
    orca.mo_energy(x, -1),  # HOMO
    orca.mo_energy(x, -2),  # HOMO-1
    orca.topological_distance(x, 'C', 'C'),  # C-C distances
    orca.topological_distance(x, 'O', 'O'),  # O-O distances
]

# Load your dataset
df = pd.read_csv('molecules.csv')

# Calculate descriptors
df_result = batch_processing.calculate_descriptors(
    df['smiles'],
    descriptors=descriptors
)

# Save results
df_result.to_csv('molecules_with_descriptors.csv', index=False)

Selecting Descriptors

You can specify which descriptors to calculate in two ways:

Method 1: Using Descriptor Names

Pass a list of descriptor names as strings:

selected_descriptors = [
    'homo_energy',
    'lumo_energy',
    'gap_energy',
    'dipole_moment',
    'molecular_volume'
]

result = batch_processing.calculate_descriptors(
    smiles_list,
    descriptors=selected_descriptors
)

Parallelization

The batch processor supports multiple parallelization modes for efficient processing of large datasets.

Sequential Processing

Default mode - processes molecules one by one:

batch_processing = ORCABatchProcessing(
    orca=orca,
    parallel_mode="sequential"
)

Multiprocessing

Use Python multiprocessing to run multiple ORCA calculations in parallel:

batch_processing = ORCABatchProcessing(
    orca=orca,
    parallel_mode="multiprocessing",
    n_workers=4  # Number of parallel workers
)

# Process molecules in parallel
result = batch_processing.calculate_descriptors(smiles_list)

The multiprocessing mode automatically adjusts time estimates based on the number of workers and parallel efficiency.

MPI (mpirun)

Use ORCA’s built-in MPI parallelization:

batch_processing = ORCABatchProcessing(
    orca=orca,
    parallel_mode="mpirun",
    use_mpirun=True,
    n_processors=4  # Processors per ORCA calculation
)

Progress Tracking

The batch processor provides detailed progress information.

Progress Output

By default, progress information is displayed:

result = batch_processing.calculate_descriptors(
    smiles_list,
    progress=True  # Default
)

# Output:
# INFO - Estimating calculation times...
# INFO - Processing molecule 1/10 (remaining: 10, estimated time: ~5m)
# INFO - Processing molecule 2/10 (remaining: 9, estimated time: ~4m 30s, avg: 30.5s/molecule)
# INFO - Processing molecule 3/10 (remaining: 8, CACHED, avg: 0.1s/molecule)

Cached Molecules

When a molecule is found in cache, the progress shows “CACHED” instead of time estimates:

# First calculation
result1 = batch_processing.calculate_descriptors(smiles_list)

# Second calculation (from cache)
result2 = batch_processing.calculate_descriptors(smiles_list)
# Output: INFO - Processing molecule 1/10 (remaining: 10, CACHED)

Disable Progress

To disable progress output:

result = batch_processing.calculate_descriptors(
    smiles_list,
    progress=False
)

Error Handling

The batch processor handles errors gracefully. If a descriptor calculation fails for a molecule, it sets that value to None and continues processing:

df = pd.DataFrame({
    'smiles': ['C1=CC=CC=C1', 'INVALID_SMILES', 'CCO']
})

result = batch_processing.calculate_descriptors(df['smiles'])

# Failed calculations are marked as None
print(result[result['homo_energy'].isna()])

Error messages are logged at different levels:

  • Brief error summary: logging.INFO

  • Detailed error information: logging.DEBUG

Example: QSAR Dataset Preparation

Here’s a complete example of preparing a QSAR dataset:

import pandas as pd
from orca_descriptors import Orca, ORCABatchProcessing

# Initialize calculator
orca = Orca(
    functional="PBE0",
    basis_set="def2-SVP",
    method_type="Opt",
    n_processors=4,
)

# Create batch processor with multiprocessing
batch_processing = ORCABatchProcessing(
    orca=orca,
    parallel_mode="multiprocessing",
    n_workers=4,
)

# Create X molecule for descriptor definition
x = batch_processing.x_molecule()

# Define descriptors
descriptors = [
    orca.homo_energy(x),
    orca.lumo_energy(x),
    orca.gap_energy(x),
    orca.ch_potential(x),
    orca.electronegativity(x),
    orca.abs_hardness(x),
    orca.dipole_moment(x),
    orca.molecular_volume(x),
]

# Load your molecular dataset
df = pd.read_csv('molecules.csv')  # Contains 'smiles' column

# Calculate descriptors
df_descriptors = batch_processing.calculate_descriptors(
    smiles_column=df['smiles'],
    descriptors=descriptors
)

# Filter molecules based on descriptors
active_molecules = df_descriptors[df_descriptors['gap_energy'] < 3.0]

# Save results
df_descriptors.to_csv('molecules_with_descriptors.csv', index=False)

# Statistical analysis
print(df_descriptors[['homo_energy', 'lumo_energy', 'gap_energy']].describe())

Example: Batch Processing with Custom Parameters

Process molecules in batches with different configurations:

# Process in batches
batch_size = 10
all_results = []

for i in range(0, len(df), batch_size):
    batch = df.iloc[i:i+batch_size]

    x = batch_processing.x_molecule()
    descriptors = [
        orca.homo_energy(x),
        orca.lumo_energy(x),
        orca.gap_energy(x)
    ]

    batch_result = batch_processing.calculate_descriptors(
        smiles_column=batch['smiles'],
        descriptors=descriptors
    )
    all_results.append(batch_result)

# Combine results
final_df = pd.concat(all_results, ignore_index=True)

Tips and Best Practices

  1. Caching: The library automatically caches calculation results. Recalculating descriptors for the same molecules uses cached results, significantly speeding up subsequent runs.

  2. Multiprocessing: For large datasets, use parallel_mode="multiprocessing" with an appropriate number of workers (typically equal to CPU cores).

  3. Descriptor Selection: Use the XMolecule API to define descriptors with parameters, making your code more readable and maintainable.

  4. Progress Monitoring: Keep progress=True (default) to monitor long-running calculations and identify cached molecules.

  5. Data Validation: Check for None values in the result DataFrame to identify molecules that failed descriptor calculation.

  6. Memory Management: For very large datasets, process molecules in batches to manage memory usage.

Available Descriptors

See the Descriptors Reference documentation for a complete list of available descriptors and their parameters.