Check for new publications #8

shreyasgm · 2025-03-13T15:05:57Z

Implement a component that efficiently determines which publications are new or have been updated since the last ETL run:

Compare scraped publication metadata with the existing manifest
Detect both new publications and updates to existing ones
Identify which files need to be downloaded or reprocessed
Generate a processing plan for the pipeline

shreyasgm · 2025-04-14T14:56:26Z

Modify ETL Scraper to Log Discovered Publications

Modify the web scraping (growth_lab_scraper.py) and OpenAlex client (openAlex_client.py) scripts to log newly discovered publications into the ETL metadata database table.

Context:
This is the first step in tracking a publication's lifecycle through the ETL pipeline.

Tasks:

Add database connection logic to growth_lab_scraper.py and openAlex_client.py (or preferably, create a shared utility function in backend/etl/utils/storage_utils.py for database interactions).
When a new publication is identified, insert a new record into the metadata table.
Populate fields like publication_id, source_url, title (if available), discovery_timestamp, and set initial statuses (e.g., download_status='Pending', etc.). Use appropriate logging (loguru).
Handle potential database errors gracefully (e.g., duplicate entries if a publication is discovered again - perhaps use an INSERT OR IGNORE or UPSERT strategy based on publication_id).
Write unit tests (using pytest and mocking the database interaction) to verify that the scripts attempt to log data correctly.

Acceptance Criteria:

Running the scraper components results in new rows being added to the ETL metadata table.
Relevant fields (ID, source, timestamp, initial statuses) are populated.
Database interactions are handled via utility functions.
Basic error handling (e.g., duplicates) is implemented.
Unit tests pass.

shreyasgm added this to the ETL milestone Mar 13, 2025

shreyasgm added the help wanted Extra attention is needed label Apr 9, 2025

andresfortunato self-assigned this Apr 22, 2025

andresfortunato mentioned this issue May 6, 2025

Manifest branch - schema creation TEST #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for new publications #8

Check for new publications #8

shreyasgm commented Mar 13, 2025 •

edited

Loading

shreyasgm commented Apr 14, 2025

Check for new publications #8

Check for new publications #8

Comments

shreyasgm commented Mar 13, 2025 • edited Loading

shreyasgm commented Apr 14, 2025

Modify ETL Scraper to Log Discovered Publications

shreyasgm commented Mar 13, 2025 •

edited

Loading