Skip to content

Extractions, calculations and optimizations + some documentation #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from

Conversation

egeakman
Copy link
Member

@egeakman egeakman commented May 25, 2024

Added some documentation 📚

Added extractions for:

  • more socials for speakers
  • room: room name
  • start: session start time
  • end: session end time

Added calculations/extractions for:

  • talks_in_parallel: looks for intersecting sessions
  • talks_after: looks for the next sessions in all rooms
  • talks_before: looks for the previous sessions in all rooms
  • next_talk: looks for the session in the same room in talks_after list
  • prev_talk: looks for the session in the same room in talks_before list

Optimizations:

  • Changed the code to generate publishable submissions/speakers only once
  • Replaced the hard-coded website_url subdomain, now it reads from config
  • mkdir the paths if they don't exist
  • Recommend the usage of .env

Example result extracted from 2023 data:

"SURAC3": {
  "code": "SURAC3",
  "title": "Sprint Orientation",
  "speakers": [],
  "submission_type": "Talk",
  "slug": "sprint-orientation",
  "track": null,
  "state": "confirmed",
  "abstract": "",
  "tweet": "",
  "duration": "20",
  "level": "",
  "delivery": "in-person",
  "room": "PyCharm (Forum Hall)",
  "start": "2023-07-21T17:30:00+02:00",
  "end": "2023-07-21T17:50:00+02:00",
  "talks_in_parallel": [],
  "talks_after": [
    "ZUPSUD"
  ],
  "talks_before": [
    "P7CVWV",
    "TJW3YZ",
    "URF3Z9",
    "G9NKUY",
    "VFLKKR",
    "HN3EDM",
    "7VLWUV"
  ],
  "next_talk": "ZUPSUD",
  "prev_talk": "P7CVWV",
  "website_url": "https://ep2023.europython.eu/session/sprint-orientation"
}

@egeakman egeakman requested a review from artcz May 25, 2024 21:11
@egeakman egeakman self-assigned this May 25, 2024
src/transform.py Outdated
Comment on lines 223 to 224
and session.start >= submission.end
and session.code not in submission.talks_in_parallel
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change these two lines to include intersecting talks like these:

image

In the current version, PYO panel would not show How Python can help monitor governments as a talk after.

We can change it to do so.

@egeakman egeakman changed the title Extractions, calculations and optimizations Extractions, calculations and optimizations + some documentation May 29, 2024
@egeakman egeakman added documentation Improvements or additions to documentation enhancement New feature or request labels May 29, 2024
Copy link
Collaborator

@Kislovskiy Kislovskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a few ideas, but we could do it in the next Pull Request, if you like them.

For this PR, do you think we could add some tests?

## Installation

1. Clone the repository.
2. Install the dependency management tool: ``make deps/pre``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an idea for the future. What about we just install programapi as a Python package? And have a following functionality:

  • programapi download
  • programapi transform
    By the way do you think that we can avoid saving unnecessary / private fields when downloading?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it and didn't want to do it when there was bunch of other stuff to implement first. But I support it, I think we can do it after this PR.

About the private/unused fields:

  • IINM some fields like email, do_not_record cannot be excluded when downloading.
  • Answers can probably be excluded, but we do ?questions=all to get all the answers, so we can manage what to include/exclude in the model ad-hoc.

src/transform.py Outdated

website_url: str | None = None
website_url: str

@model_validator(mode="before")
@classmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing a custom extract, can we delegate this to Pydantic?
This theoretically should remove all if else branching below and will make parsing easier to understand. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is something like this. Since we know the data model from Pretix and the data model we want to serve after we could simplify code a lot by utilizing .parse_raw method.

Here is the simple example on what I mean:

from pydantic import BaseModel
from typing import List
import json

class Item(BaseModel):
    name: str
    price: float
    quantity: int

class Order(BaseModel):
    order_id: int
    items: List[Item]


json_data = '''
{
    "order_id": 123,
    "items": [
        {"name": "apple", "price": 1.2, "quantity": 10},
        {"name": "banana", "price": 0.8, "quantity": 5}
    ]
}
'''

order = Order.parse_raw(json_data)

print(order.order_id)
for item in order.items:
    print(f"Item: {item.name}, Price: {item.price}, Quantity: {item.quantity}")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm new to Pydantic, and these tips are really helpful for me. Thanks ❤️

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Pydantic can support us in two major ways:

  1. Represent and work with data in a structured way (as BaseModels) instead of hacking around with nested dicts and lists.
  2. Convert data between JSON strings and Python objects.

If we only want the JSON de-/serialization, having huge model_validators a valid the solution: We can do all the hacky dict magic we want, and in the end we get a BaseModel instance.

Pydantic also offers some tools to perform transformations in a more structured way:

[Click me] Example from the Pretix API
from enum import Enum

from pydantic import BaseModel, Field, computed_field


class PretixOrderStatus(Enum):
  PAID = "p"
  PENDING = "n"
  EXPIRED = "e"
  CANCELED = "c"


class PretixOrderPosition(BaseModel):
  order_id: str = Field(alias="order")
  item_id: int = Field(alias="item")
  attendee_name: str | None


class PretixOrder(BaseModel):
  id: str = Field(alias="code")
  status: PretixOrderStatus = Field(exclude=True)
  positions: list[PretixOrderPosition]

  @computed_field
  def is_paid(self) -> bool:
      return self.status is PretixOrderStatus.PAID


order_json = """
{
"code": "ABC01",
"status": "p",
"positions": [
  {
    "order": "ABC01",
    "item": 123,
    "attendee_name": "Jane Doe"
  },
  {
    "order": "ABC01",
    "item": 234,
    "attendee_name": "John Doe"
  }
]
}
"""

order = PretixOrder.model_validate_json(order_json)
assert order.model_dump_json(indent=2) == """\
{
"id": "ABC01",
"positions": [
  {
    "order_id": "ABC01",
    "item_id": 123,
    "attendee_name": "Jane Doe"
  },
  {
    "order_id": "ABC01",
    "item_id": 234,
    "attendee_name": "John Doe"
  }
],
"is_paid": true
}"""

I know this is a matter of opinion, feel free to keep the code as it is.

PS: BaseModel.parse_raw is deprecated in Pydantic v2, please use BaseModel.model_validate_json instead 🙂

@egeakman egeakman requested review from Kislovskiy and NMertsch May 31, 2024 23:06
Copy link
Collaborator

@NMertsch NMertsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton @egeakman for taking over this task! There is always room for improvement, but I think this PR is good enough to be merged.

I see these general areas to work on in future PRs (besides the core functionality):

  • Testing (see also Artem's comment)
  • Make use of Pydantic's toolbox to reduce long @model_validator(mode="before") sections (e.g. computed fields, aliases, excluded fields, enums) (see here and here)
  • Separate object parsing from object relationships (see here)

src/transform.py Outdated
tweet: str = ""
duration: str

level: str = ""
delivery: str | None = ""
delivery: str = ""

# This is embedding a slot inside a submission for easier lookup later
room: str | None = None
start: datetime | None = None
end: datetime | None = None

# TODO: once we have schedule data then we can prefill those in the code here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this is mixing too many concerns:

  1. Representation of individual submissions
  2. Temporal relationship between submissions

I would prefer to store the relationships (parallel, before, after, next, previous) in a separate data structure which references these PretalxSubmission objects.
Then we wouldn't have this situation where creating a "complete" PretalxSubmission object requires to steps:

submission = PretalxSubmission(...)  # call `__init__`, which only initialized a part of the object
submission.set_talks_in_parallel(...)  # initialize another part of the object
submission.set_talks_after(...)  # initialize yet another part of the object
[...]

Field like talks_in_parallel: list[str] | None = None can easily cause to accidents: my_submission.talks_in_parallel sometimes contains the "talks in parallel to my submission", depending on when I access the field.

That would be a non-trivial change, and the current implementation seems to work, so I'm fine with merging it as it is. But let's try to not walk this path much further, else it might get messy 🙂

src/transform.py Outdated

website_url: str | None = None
website_url: str

@model_validator(mode="before")
@classmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Pydantic can support us in two major ways:

  1. Represent and work with data in a structured way (as BaseModels) instead of hacking around with nested dicts and lists.
  2. Convert data between JSON strings and Python objects.

If we only want the JSON de-/serialization, having huge model_validators a valid the solution: We can do all the hacky dict magic we want, and in the end we get a BaseModel instance.

Pydantic also offers some tools to perform transformations in a more structured way:

[Click me] Example from the Pretix API
from enum import Enum

from pydantic import BaseModel, Field, computed_field


class PretixOrderStatus(Enum):
  PAID = "p"
  PENDING = "n"
  EXPIRED = "e"
  CANCELED = "c"


class PretixOrderPosition(BaseModel):
  order_id: str = Field(alias="order")
  item_id: int = Field(alias="item")
  attendee_name: str | None


class PretixOrder(BaseModel):
  id: str = Field(alias="code")
  status: PretixOrderStatus = Field(exclude=True)
  positions: list[PretixOrderPosition]

  @computed_field
  def is_paid(self) -> bool:
      return self.status is PretixOrderStatus.PAID


order_json = """
{
"code": "ABC01",
"status": "p",
"positions": [
  {
    "order": "ABC01",
    "item": 123,
    "attendee_name": "Jane Doe"
  },
  {
    "order": "ABC01",
    "item": 234,
    "attendee_name": "John Doe"
  }
]
}
"""

order = PretixOrder.model_validate_json(order_json)
assert order.model_dump_json(indent=2) == """\
{
"id": "ABC01",
"positions": [
  {
    "order_id": "ABC01",
    "item_id": 123,
    "attendee_name": "Jane Doe"
  },
  {
    "order_id": "ABC01",
    "item_id": 234,
    "attendee_name": "John Doe"
  }
],
"is_paid": true
}"""

I know this is a matter of opinion, feel free to keep the code as it is.

PS: BaseModel.parse_raw is deprecated in Pydantic v2, please use BaseModel.model_validate_json instead 🙂

Copy link
Collaborator

@Kislovskiy Kislovskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t looks solid for the first iteration. I'm unsure if we need .gitignore files in data/public, data/raw directories.

We could also declare all the project dependencies in pyproject.toml instead of having pyproject.toml + requirements.in + requirements.txt, but let's do it in a separate Pull Request.

And yeah, there are a few things to improve, but we need to start somewhere. Thanks a lot for your contribution 🤩

src/transform.py Outdated
return values

@staticmethod
def compute_talks_in_parallel(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you do here is you filtering the list of sessions to find the list of parallel talks. If that's right, maybe it would be easier to do with a helper function is_parallel, smth like this:

def get_parallel_talks(
    session: PretalxSession, all_sessions: List[PretalxSession]
) -> List[String]:
    def is_parallel(other_session: PretalxSession) -> bool:
        return (
            other_session.code != session.code
            and other_session.start is not None
            and session.start is not None
            and other_session.start < session.end
            and other_session.end > session.start
        )
    
    return [s.code for s in filter(is_parallel, all_sessions)]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done later, when tests are introduced.

src/transform.py Outdated
return talks_parallel

@staticmethod
def compute_talks_after(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks a bit complex, we could consider refactoring it a bit after introducing some tests.

@egeakman egeakman requested a review from NMertsch June 2, 2024 20:37
@egeakman egeakman requested a review from Kislovskiy June 4, 2024 14:37
@egeakman
Copy link
Member Author

egeakman commented Jun 4, 2024

Here is a more meaningful diff: ce1de63

@egeakman egeakman closed this Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants