You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Modify the web scraping (growth_lab_scraper.py) and OpenAlex client (openAlex_client.py) scripts to log newly discovered publications into the ETL metadata database table.
Context:
This is the first step in tracking a publication's lifecycle through the ETL pipeline.
Tasks:
Add database connection logic to growth_lab_scraper.py and openAlex_client.py (or preferably, create a shared utility function in backend/etl/utils/storage_utils.py for database interactions).
When a new publication is identified, insert a new record into the metadata table.
Populate fields like publication_id, source_url, title (if available), discovery_timestamp, and set initial statuses (e.g., download_status='Pending', etc.). Use appropriate logging (loguru).
Handle potential database errors gracefully (e.g., duplicate entries if a publication is discovered again - perhaps use an INSERT OR IGNORE or UPSERT strategy based on publication_id).
Write unit tests (using pytest and mocking the database interaction) to verify that the scripts attempt to log data correctly.
Acceptance Criteria:
Running the scraper components results in new rows being added to the ETL metadata table.
Relevant fields (ID, source, timestamp, initial statuses) are populated.
Database interactions are handled via utility functions.
Basic error handling (e.g., duplicates) is implemented.
Implement a component that efficiently determines which publications are new or have been updated since the last ETL run:
The text was updated successfully, but these errors were encountered: