Extractions, calculations and optimizations + some documentation #6

egeakman · 2024-05-25T21:11:08Z

Added some documentation 📚

Added extractions for:

more socials for speakers
room: room name
start: session start time
end: session end time

Added calculations/extractions for:

talks_in_parallel: looks for intersecting sessions
talks_after: looks for the next sessions in all rooms
talks_before: looks for the previous sessions in all rooms
next_talk: looks for the session in the same room in talks_after list
prev_talk: looks for the session in the same room in talks_before list

Optimizations:

Changed the code to generate publishable submissions/speakers only once
Replaced the hard-coded website_url subdomain, now it reads from config
mkdir the paths if they don't exist
Recommend the usage of .env

Example result extracted from 2023 data:

"SURAC3": {
  "code": "SURAC3",
  "title": "Sprint Orientation",
  "speakers": [],
  "submission_type": "Talk",
  "slug": "sprint-orientation",
  "track": null,
  "state": "confirmed",
  "abstract": "",
  "tweet": "",
  "duration": "20",
  "level": "",
  "delivery": "in-person",
  "room": "PyCharm (Forum Hall)",
  "start": "2023-07-21T17:30:00+02:00",
  "end": "2023-07-21T17:50:00+02:00",
  "talks_in_parallel": [],
  "talks_after": [
    "ZUPSUD"
  ],
  "talks_before": [
    "P7CVWV",
    "TJW3YZ",
    "URF3Z9",
    "G9NKUY",
    "VFLKKR",
    "HN3EDM",
    "7VLWUV"
  ],
  "next_talk": "ZUPSUD",
  "prev_talk": "P7CVWV",
  "website_url": "https://ep2023.europython.eu/session/sprint-orientation"
}

src/transform.py

egeakman · 2024-05-25T21:29:53Z

src/transform.py

+            and session.start >= submission.end
+            and session.code not in submission.talks_in_parallel


We can change these two lines to include intersecting talks like these:

In the current version, PYO panel would not show How Python can help monitor governments as a talk after.

We can change it to do so.

src/transform.py

Kislovskiy

I put a few ideas, but we could do it in the next Pull Request, if you like them.

For this PR, do you think we could add some tests?

README.md

Kislovskiy · 2024-05-30T20:17:20Z

README.md

+## Installation
+
+1. Clone the repository.
+2. Install the dependency management tool: ``make deps/pre``


Maybe an idea for the future. What about we just install programapi as a Python package? And have a following functionality:

programapi download

programapi transform
By the way do you think that we can avoid saving unnecessary / private fields when downloading?

I thought about it and didn't want to do it when there was bunch of other stuff to implement first. But I support it, I think we can do it after this PR.

About the private/unused fields:

IINM some fields like email, do_not_record cannot be excluded when downloading.

Answers can probably be excluded, but we do ?questions=all to get all the answers, so we can manage what to include/exclude in the model ad-hoc.

README.md

Kislovskiy · 2024-05-30T20:34:00Z

src/transform.py


-    website_url: str | None = None
+    website_url: str

    @model_validator(mode="before")
    @classmethod


Instead of doing a custom extract, can we delegate this to Pydantic?
This theoretically should remove all if else branching below and will make parsing easier to understand. What do you think?

What I mean is something like this. Since we know the data model from Pretix and the data model we want to serve after we could simplify code a lot by utilizing .parse_raw method.

Here is the simple example on what I mean:

from pydantic import BaseModel from typing import List import json class Item(BaseModel): name: str price: float quantity: int class Order(BaseModel): order_id: int items: List[Item] json_data = ''' { "order_id": 123, "items": [ {"name": "apple", "price": 1.2, "quantity": 10}, {"name": "banana", "price": 0.8, "quantity": 5} ] } ''' order = Order.parse_raw(json_data) print(order.order_id) for item in order.items: print(f"Item: {item.name}, Price: {item.price}, Quantity: {item.quantity}")

I'm new to Pydantic, and these tips are really helpful for me. Thanks ❤️

I agree. Pydantic can support us in two major ways:

Represent and work with data in a structured way (as BaseModels) instead of hacking around with nested dicts and lists.

Convert data between JSON strings and Python objects.

If we only want the JSON de-/serialization, having huge model_validators a valid the solution: We can do all the hacky dict magic we want, and in the end we get a BaseModel instance.

Pydantic also offers some tools to perform transformations in a more structured way:

computed_field to create new fields based on parsed data

exclude to exclude fields during serialization

aliases to automatically rename fields

Enum parsing to avoid "magic strings"

[Click me] Example from the Pretix API

from enum import Enum from pydantic import BaseModel, Field, computed_field class PretixOrderStatus(Enum): PAID = "p" PENDING = "n" EXPIRED = "e" CANCELED = "c" class PretixOrderPosition(BaseModel): order_id: str = Field(alias="order") item_id: int = Field(alias="item") attendee_name: str | None class PretixOrder(BaseModel): id: str = Field(alias="code") status: PretixOrderStatus = Field(exclude=True) positions: list[PretixOrderPosition] @computed_field def is_paid(self) -> bool: return self.status is PretixOrderStatus.PAID order_json = """ { "code": "ABC01", "status": "p", "positions": [ { "order": "ABC01", "item": 123, "attendee_name": "Jane Doe" }, { "order": "ABC01", "item": 234, "attendee_name": "John Doe" } ] } """ order = PretixOrder.model_validate_json(order_json) assert order.model_dump_json(indent=2) == """\ { "id": "ABC01", "positions": [ { "order_id": "ABC01", "item_id": 123, "attendee_name": "Jane Doe" }, { "order_id": "ABC01", "item_id": 234, "attendee_name": "John Doe" } ], "is_paid": true }"""

I know this is a matter of opinion, feel free to keep the code as it is.

PS: BaseModel.parse_raw is deprecated in Pydantic v2, please use BaseModel.model_validate_json instead 🙂

src/download.py

NMertsch

Thanks a ton @egeakman for taking over this task! There is always room for improvement, but I think this PR is good enough to be merged.

I see these general areas to work on in future PRs (besides the core functionality):

Testing (see also Artem's comment)
Make use of Pydantic's toolbox to reduce long @model_validator(mode="before") sections (e.g. computed fields, aliases, excluded fields, enums) (see here and here)
Separate object parsing from object relationships (see here)

src/transform.py

NMertsch · 2024-06-01T17:20:06Z

src/transform.py

    tweet: str = ""
    duration: str

    level: str = ""
-    delivery: str | None = ""
+    delivery: str = ""

    # This is embedding a slot inside a submission for easier lookup later
    room: str | None = None
    start: datetime | None = None
    end: datetime | None = None

    # TODO: once we have schedule data then we can prefill those in the code here


To me this is mixing too many concerns:

Representation of individual submissions

Temporal relationship between submissions

I would prefer to store the relationships (parallel, before, after, next, previous) in a separate data structure which references these PretalxSubmission objects.
Then we wouldn't have this situation where creating a "complete" PretalxSubmission object requires to steps:

submission = PretalxSubmission(...) # call `__init__`, which only initialized a part of the object submission.set_talks_in_parallel(...) # initialize another part of the object submission.set_talks_after(...) # initialize yet another part of the object [...]

Field like talks_in_parallel: list[str] | None = None can easily cause to accidents: my_submission.talks_in_parallel sometimes contains the "talks in parallel to my submission", depending on when I access the field.

That would be a non-trivial change, and the current implementation seems to work, so I'm fine with merging it as it is. But let's try to not walk this path much further, else it might get messy 🙂

NMertsch · 2024-06-01T18:06:21Z

src/transform.py


-    website_url: str | None = None
+    website_url: str

    @model_validator(mode="before")
    @classmethod


I agree. Pydantic can support us in two major ways:

Represent and work with data in a structured way (as BaseModels) instead of hacking around with nested dicts and lists.

Convert data between JSON strings and Python objects.

If we only want the JSON de-/serialization, having huge model_validators a valid the solution: We can do all the hacky dict magic we want, and in the end we get a BaseModel instance.

Pydantic also offers some tools to perform transformations in a more structured way:

computed_field to create new fields based on parsed data

exclude to exclude fields during serialization

aliases to automatically rename fields

Enum parsing to avoid "magic strings"

[Click me] Example from the Pretix API

from enum import Enum from pydantic import BaseModel, Field, computed_field class PretixOrderStatus(Enum): PAID = "p" PENDING = "n" EXPIRED = "e" CANCELED = "c" class PretixOrderPosition(BaseModel): order_id: str = Field(alias="order") item_id: int = Field(alias="item") attendee_name: str | None class PretixOrder(BaseModel): id: str = Field(alias="code") status: PretixOrderStatus = Field(exclude=True) positions: list[PretixOrderPosition] @computed_field def is_paid(self) -> bool: return self.status is PretixOrderStatus.PAID order_json = """ { "code": "ABC01", "status": "p", "positions": [ { "order": "ABC01", "item": 123, "attendee_name": "Jane Doe" }, { "order": "ABC01", "item": 234, "attendee_name": "John Doe" } ] } """ order = PretixOrder.model_validate_json(order_json) assert order.model_dump_json(indent=2) == """\ { "id": "ABC01", "positions": [ { "order_id": "ABC01", "item_id": 123, "attendee_name": "Jane Doe" }, { "order_id": "ABC01", "item_id": 234, "attendee_name": "John Doe" } ], "is_paid": true }"""

I know this is a matter of opinion, feel free to keep the code as it is.

PS: BaseModel.parse_raw is deprecated in Pydantic v2, please use BaseModel.model_validate_json instead 🙂

Kislovskiy

t looks solid for the first iteration. I'm unsure if we need .gitignore files in data/public, data/raw directories.

We could also declare all the project dependencies in pyproject.toml instead of having pyproject.toml + requirements.in + requirements.txt, but let's do it in a separate Pull Request.

And yeah, there are a few things to improve, but we need to start somewhere. Thanks a lot for your contribution 🤩

src/config.py

Kislovskiy · 2024-06-02T09:38:31Z

src/transform.py

+        return values
+
+    @staticmethod
+    def compute_talks_in_parallel(


I think what you do here is you filtering the list of sessions to find the list of parallel talks. If that's right, maybe it would be easier to do with a helper function is_parallel, smth like this:

def get_parallel_talks( session: PretalxSession, all_sessions: List[PretalxSession] ) -> List[String]: def is_parallel(other_session: PretalxSession) -> bool: return ( other_session.code != session.code and other_session.start is not None and session.start is not None and other_session.start < session.end and other_session.end > session.start ) return [s.code for s in filter(is_parallel, all_sessions)]

This could be done later, when tests are introduced.

Kislovskiy · 2024-06-02T09:42:28Z

src/transform.py

+        return talks_parallel
+
+    @staticmethod
+    def compute_talks_after(


This one looks a bit complex, we could consider refactoring it a bit after introducing some tests.

for more information, see https://pre-commit.ci

egeakman · 2024-06-04T15:28:29Z

Here is a more meaningful diff: ce1de63

egeakman added 4 commits May 24, 2024 03:53

lots of stuff

3b9b559

port funcs to 2024

a3db579

update

0c50a62

oops + more readable + tell what event are we transforming

1d106e0

egeakman requested a review from artcz May 25, 2024 21:11

egeakman self-assigned this May 25, 2024

egeakman commented May 25, 2024

View reviewed changes

better slug dupe check + optimize

96111ab

egeakman commented May 25, 2024

View reviewed changes

src/transform.py Outdated Show resolved Hide resolved

egeakman changed the title ~~Extractions, calculations and optimizations~~ Extractions, calculations and optimizations + some documentation May 29, 2024

egeakman added 5 commits May 29, 2024 23:36

add documentation

08bcbde

Update README.md

39a96e3

Update README.md

ecb1cc3

add configuration to readme

4276fa5

Use model_dump_json to be able to serialize datetime

aba49d6

egeakman added documentation Improvements or additions to documentation enhancement New feature or request labels May 29, 2024

Kislovskiy reviewed May 30, 2024

View reviewed changes

egeakman added 2 commits May 31, 2024 20:59

Merge branch 'main' into port-to-2024

4a0d477

.env + documentation + extract more socials

4e433ec

egeakman requested review from Kislovskiy and NMertsch May 31, 2024 23:06

NMertsch reviewed Jun 1, 2024

View reviewed changes

src/download.py Outdated Show resolved Hide resolved

exist_ok

fcceb66

NMertsch approved these changes Jun 1, 2024

View reviewed changes

This was referenced Jun 1, 2024

Make this repo an installable Python package #14

Open

Use a separate model for timing relationships #17

Closed

egeakman added 2 commits June 2, 2024 00:32

url extraction functions

b666971

Tried to put timings under a different model

5798b4b

Kislovskiy approved these changes Jun 2, 2024

View reviewed changes

egeakman added 2 commits June 2, 2024 19:53

correct typing at some places

7818471

better overall structure

84d3387

egeakman requested a review from NMertsch June 2, 2024 20:37

egeakman added 5 commits June 3, 2024 00:41

typing

339ba50

Add resources to the schema

df0ad5f

Update README.md

f5e635f

oops missed this one

66fa79f

change gitx_url to gitx

ee3f018

egeakman mentioned this pull request Jun 3, 2024

Add more social medias to speaker profiles, and update content EuroPython/website#630

Merged

NMertsch and others added 4 commits June 3, 2024 19:41

Add tests for mastodon and linkedin url extraction

96eb614

[pre-commit.ci] auto fixes from pre-commit.com hooks

1dec5c8

for more information, see https://pre-commit.ci

better code structure

ce1de63

Separate files

de3f67d

egeakman requested a review from Kislovskiy June 4, 2024 14:37

egeakman added 2 commits June 4, 2024 17:44

naming

d875052

speaker website_url

42aba10

egeakman closed this Jun 7, 2024

		and session.start >= submission.end
		and session.code not in submission.talks_in_parallel

Extractions, calculations and optimizations + some documentation #6

Extractions, calculations and optimizations + some documentation #6

Uh oh!

Conversation

egeakman commented May 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kislovskiy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NMertsch left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kislovskiy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

egeakman commented Jun 4, 2024

Uh oh!

Uh oh!

egeakman commented May 25, 2024 •

edited

Loading

NMertsch left a comment •

edited

Loading