Skip to content

feat: Add Opt-in Formula Reading Support #137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

PSU3D0
Copy link

@PSU3D0 PSU3D0 commented Jul 7, 2025

Overview

This PR introduces optional formula reading capabilities to python-calamine, allowing users to access the underlying formulas in spreadsheet cells rather than just their calculated values.

Motivation

Previously, python-calamine only provided access to cell values (the calculated results of formulas). Users had no way to access the actual formula expressions, which is essential for:

  • Spreadsheet analysis and auditing
  • Formula debugging and validation
  • Converting between different spreadsheet formats while preserving formulas
  • Building tools that need to understand the calculation logic

Implementation

Core Changes

  1. Opt-in Design: Formula reading is disabled by default (read_formulas=False) to maintain backward compatibility and avoid performance overhead for users who don't need formulas.

  2. New API Parameter: Added read_formulas parameter to all workbook creation methods:

    • CalamineWorkbook.from_object(path_or_filelike, read_formulas=False)
    • CalamineWorkbook.from_path(path, read_formulas=False)
    • CalamineWorkbook.from_filelike(filelike, read_formulas=False)
    • load_workbook(path_or_filelike, read_formulas=False)
  3. Formula Iterator: Added CalamineSheet.iter_formulas() method that returns a CalamineFormulaIterator with consistent dimensions matching the data iterator.

  4. Performance Optimization: Implemented on-demand coordinate mapping instead of pre-allocating expanded ranges, ensuring minimal memory overhead.

Technical Details

  • Coordinate Consistency: Formula iterator returns the same dimensions as data iterator, with empty strings for cells without formulas
  • Graceful Error Handling: Clear error message when attempting to access formulas without enabling the feature
  • Memory Efficient: Uses get_value() lookups rather than expanding formula ranges upfront
  • Iterator Properties: Both CalamineCellIterator and CalamineFormulaIterator expose position, start, width, and height properties for coordinate recovery

File Format Support

Supports formula reading across all major spreadsheet formats:

  • Excel (.xlsx, .xls, .xlsb)
  • OpenDocument (.ods)

Formula syntax varies by format (e.g., ODS uses of:=SUM([.A1:.B1]) vs Excel's SUM(A1:B1)).

API Examples

# Default behavior - formulas disabled
wb = load_workbook("spreadsheet.xlsx")
sheet = wb.get_sheet_by_name("Sheet1")
# sheet.iter_formulas()  # Raises ValueError

# Enable formula reading
wb = load_workbook("spreadsheet.xlsx", read_formulas=True)
sheet = wb.get_sheet_by_name("Sheet1")

# Access both data and formulas
data = list(sheet.iter_rows())      # [[10.0, 15.0, 25.0]]
formulas = list(sheet.iter_formulas())  # [["", "", "SUM(A1:B1)"]]

# Iterator properties for coordinate mapping
formula_iter = sheet.iter_formulas()
print(f"Start: {formula_iter.start}, Size: {formula_iter.width}x{formula_iter.height}")

Backward Compatibility

  • Default behavior unchanged (read_formulas=False)
  • No breaking changes to existing APIs
  • Existing code continues to work without modification

Testing

Comprehensive test suite covering:

  • All supported file formats (.xlsx, .xls, .ods)
  • All workbook creation methods
  • Error handling for disabled formulas
  • Iterator property validation
  • Coordinate consistency between data and formula iterators

Performance Impact

  • Zero overhead when formulas are disabled (default)
  • Minimal memory usage when formulas are enabled due to on-demand lookup strategy
  • No upfront range expansion or pre-allocation

@PSU3D0
Copy link
Author

PSU3D0 commented Jul 7, 2025

A human addendum to the AI's diff summary ^:

This lib is much faster than openpyxl but remained unusable for my use-case due to:

  • Lack of formula support
  • Lack of coordinate recovery

This PR should be a big step forward for broader usability and more use-cases. The next thing I'd hit would be a unified CellData iterator that exposes value, coordinate, optional formula, and potentially style information (still need to check Calamine source on this).

Cheers!

@PSU3D0
Copy link
Author

PSU3D0 commented Jul 12, 2025

@dimastbk bumping this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant