-
Notifications
You must be signed in to change notification settings - Fork 0
Extractions, calculations and optimizations + some documentation #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/transform.py
Outdated
and session.start >= submission.end | ||
and session.code not in submission.talks_in_parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put a few ideas, but we could do it in the next Pull Request, if you like them.
For this PR, do you think we could add some tests?
## Installation | ||
|
||
1. Clone the repository. | ||
2. Install the dependency management tool: ``make deps/pre`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an idea for the future. What about we just install programapi
as a Python package? And have a following functionality:
programapi download
programapi transform
By the way do you think that we can avoid saving unnecessary / private fields when downloading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it and didn't want to do it when there was bunch of other stuff to implement first. But I support it, I think we can do it after this PR.
About the private/unused fields:
- IINM some fields like email,
do_not_record
cannot be excluded when downloading. - Answers can probably be excluded, but we do
?questions=all
to get all the answers, so we can manage what to include/exclude in the model ad-hoc.
src/transform.py
Outdated
|
||
website_url: str | None = None | ||
website_url: str | ||
|
||
@model_validator(mode="before") | ||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing a custom extract, can we delegate this to Pydantic?
This theoretically should remove all if
else
branching below and will make parsing easier to understand. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is something like this. Since we know the data model from Pretix and the data model we want to serve after we could simplify code a lot by utilizing .parse_raw
method.
Here is the simple example on what I mean:
from pydantic import BaseModel
from typing import List
import json
class Item(BaseModel):
name: str
price: float
quantity: int
class Order(BaseModel):
order_id: int
items: List[Item]
json_data = '''
{
"order_id": 123,
"items": [
{"name": "apple", "price": 1.2, "quantity": 10},
{"name": "banana", "price": 0.8, "quantity": 5}
]
}
'''
order = Order.parse_raw(json_data)
print(order.order_id)
for item in order.items:
print(f"Item: {item.name}, Price: {item.price}, Quantity: {item.quantity}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to Pydantic, and these tips are really helpful for me. Thanks ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Pydantic can support us in two major ways:
- Represent and work with data in a structured way (as BaseModels) instead of hacking around with nested dicts and lists.
- Convert data between JSON strings and Python objects.
If we only want the JSON de-/serialization, having huge model_validator
s a valid the solution: We can do all the hacky dict
magic we want, and in the end we get a BaseModel
instance.
Pydantic also offers some tools to perform transformations in a more structured way:
- computed_field to create new fields based on parsed data
- exclude to exclude fields during serialization
- aliases to automatically rename fields
- Enum parsing to avoid "magic strings"
[Click me] Example from the Pretix API
from enum import Enum
from pydantic import BaseModel, Field, computed_field
class PretixOrderStatus(Enum):
PAID = "p"
PENDING = "n"
EXPIRED = "e"
CANCELED = "c"
class PretixOrderPosition(BaseModel):
order_id: str = Field(alias="order")
item_id: int = Field(alias="item")
attendee_name: str | None
class PretixOrder(BaseModel):
id: str = Field(alias="code")
status: PretixOrderStatus = Field(exclude=True)
positions: list[PretixOrderPosition]
@computed_field
def is_paid(self) -> bool:
return self.status is PretixOrderStatus.PAID
order_json = """
{
"code": "ABC01",
"status": "p",
"positions": [
{
"order": "ABC01",
"item": 123,
"attendee_name": "Jane Doe"
},
{
"order": "ABC01",
"item": 234,
"attendee_name": "John Doe"
}
]
}
"""
order = PretixOrder.model_validate_json(order_json)
assert order.model_dump_json(indent=2) == """\
{
"id": "ABC01",
"positions": [
{
"order_id": "ABC01",
"item_id": 123,
"attendee_name": "Jane Doe"
},
{
"order_id": "ABC01",
"item_id": 234,
"attendee_name": "John Doe"
}
],
"is_paid": true
}"""
I know this is a matter of opinion, feel free to keep the code as it is.
PS: BaseModel.parse_raw
is deprecated in Pydantic v2, please use BaseModel.model_validate_json
instead 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton @egeakman for taking over this task! There is always room for improvement, but I think this PR is good enough to be merged.
I see these general areas to work on in future PRs (besides the core functionality):
- Testing (see also Artem's comment)
- Make use of Pydantic's toolbox to reduce long
@model_validator(mode="before")
sections (e.g. computed fields, aliases, excluded fields, enums) (see here and here) - Separate object parsing from object relationships (see here)
src/transform.py
Outdated
tweet: str = "" | ||
duration: str | ||
|
||
level: str = "" | ||
delivery: str | None = "" | ||
delivery: str = "" | ||
|
||
# This is embedding a slot inside a submission for easier lookup later | ||
room: str | None = None | ||
start: datetime | None = None | ||
end: datetime | None = None | ||
|
||
# TODO: once we have schedule data then we can prefill those in the code here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this is mixing too many concerns:
- Representation of individual submissions
- Temporal relationship between submissions
I would prefer to store the relationships (parallel, before, after, next, previous) in a separate data structure which references these PretalxSubmission
objects.
Then we wouldn't have this situation where creating a "complete" PretalxSubmission
object requires to steps:
submission = PretalxSubmission(...) # call `__init__`, which only initialized a part of the object
submission.set_talks_in_parallel(...) # initialize another part of the object
submission.set_talks_after(...) # initialize yet another part of the object
[...]
Field like talks_in_parallel: list[str] | None = None
can easily cause to accidents: my_submission.talks_in_parallel
sometimes contains the "talks in parallel to my submission", depending on when I access the field.
That would be a non-trivial change, and the current implementation seems to work, so I'm fine with merging it as it is. But let's try to not walk this path much further, else it might get messy 🙂
src/transform.py
Outdated
|
||
website_url: str | None = None | ||
website_url: str | ||
|
||
@model_validator(mode="before") | ||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Pydantic can support us in two major ways:
- Represent and work with data in a structured way (as BaseModels) instead of hacking around with nested dicts and lists.
- Convert data between JSON strings and Python objects.
If we only want the JSON de-/serialization, having huge model_validator
s a valid the solution: We can do all the hacky dict
magic we want, and in the end we get a BaseModel
instance.
Pydantic also offers some tools to perform transformations in a more structured way:
- computed_field to create new fields based on parsed data
- exclude to exclude fields during serialization
- aliases to automatically rename fields
- Enum parsing to avoid "magic strings"
[Click me] Example from the Pretix API
from enum import Enum
from pydantic import BaseModel, Field, computed_field
class PretixOrderStatus(Enum):
PAID = "p"
PENDING = "n"
EXPIRED = "e"
CANCELED = "c"
class PretixOrderPosition(BaseModel):
order_id: str = Field(alias="order")
item_id: int = Field(alias="item")
attendee_name: str | None
class PretixOrder(BaseModel):
id: str = Field(alias="code")
status: PretixOrderStatus = Field(exclude=True)
positions: list[PretixOrderPosition]
@computed_field
def is_paid(self) -> bool:
return self.status is PretixOrderStatus.PAID
order_json = """
{
"code": "ABC01",
"status": "p",
"positions": [
{
"order": "ABC01",
"item": 123,
"attendee_name": "Jane Doe"
},
{
"order": "ABC01",
"item": 234,
"attendee_name": "John Doe"
}
]
}
"""
order = PretixOrder.model_validate_json(order_json)
assert order.model_dump_json(indent=2) == """\
{
"id": "ABC01",
"positions": [
{
"order_id": "ABC01",
"item_id": 123,
"attendee_name": "Jane Doe"
},
{
"order_id": "ABC01",
"item_id": 234,
"attendee_name": "John Doe"
}
],
"is_paid": true
}"""
I know this is a matter of opinion, feel free to keep the code as it is.
PS: BaseModel.parse_raw
is deprecated in Pydantic v2, please use BaseModel.model_validate_json
instead 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t looks solid for the first iteration. I'm unsure if we need .gitignore
files in data/public
, data/raw
directories.
We could also declare all the project dependencies in pyproject.toml
instead of having pyproject.toml + requirements.in + requirements.txt
, but let's do it in a separate Pull Request.
And yeah, there are a few things to improve, but we need to start somewhere. Thanks a lot for your contribution 🤩
src/transform.py
Outdated
return values | ||
|
||
@staticmethod | ||
def compute_talks_in_parallel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you do here is you filtering the list of sessions to find the list of parallel talks. If that's right, maybe it would be easier to do with a helper function is_parallel
, smth like this:
def get_parallel_talks(
session: PretalxSession, all_sessions: List[PretalxSession]
) -> List[String]:
def is_parallel(other_session: PretalxSession) -> bool:
return (
other_session.code != session.code
and other_session.start is not None
and session.start is not None
and other_session.start < session.end
and other_session.end > session.start
)
return [s.code for s in filter(is_parallel, all_sessions)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be done later, when tests are introduced.
src/transform.py
Outdated
return talks_parallel | ||
|
||
@staticmethod | ||
def compute_talks_after( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one looks a bit complex, we could consider refactoring it a bit after introducing some tests.
Here is a more meaningful diff: ce1de63 |
Added some documentation 📚
Added extractions for:
Added calculations/extractions for:
Optimizations:
Example result extracted from 2023 data: