Skip to content

Check for new publications #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shreyasgm opened this issue Mar 13, 2025 · 1 comment
Open

Check for new publications #8

shreyasgm opened this issue Mar 13, 2025 · 1 comment
Assignees
Labels
help wanted Extra attention is needed
Milestone

Comments

@shreyasgm
Copy link
Owner

shreyasgm commented Mar 13, 2025

Implement a component that efficiently determines which publications are new or have been updated since the last ETL run:

  1. Compare scraped publication metadata with the existing manifest
  2. Detect both new publications and updates to existing ones
  3. Identify which files need to be downloaded or reprocessed
  4. Generate a processing plan for the pipeline
@shreyasgm shreyasgm added this to the ETL milestone Mar 13, 2025
@shreyasgm shreyasgm added the help wanted Extra attention is needed label Apr 9, 2025
@shreyasgm
Copy link
Owner Author

Modify ETL Scraper to Log Discovered Publications

Modify the web scraping (growth_lab_scraper.py) and OpenAlex client (openAlex_client.py) scripts to log newly discovered publications into the ETL metadata database table.

Context:
This is the first step in tracking a publication's lifecycle through the ETL pipeline.

Tasks:

  1. Add database connection logic to growth_lab_scraper.py and openAlex_client.py (or preferably, create a shared utility function in backend/etl/utils/storage_utils.py for database interactions).
  2. When a new publication is identified, insert a new record into the metadata table.
  3. Populate fields like publication_id, source_url, title (if available), discovery_timestamp, and set initial statuses (e.g., download_status='Pending', etc.). Use appropriate logging (loguru).
  4. Handle potential database errors gracefully (e.g., duplicate entries if a publication is discovered again - perhaps use an INSERT OR IGNORE or UPSERT strategy based on publication_id).
  5. Write unit tests (using pytest and mocking the database interaction) to verify that the scripts attempt to log data correctly.

Acceptance Criteria:

  • Running the scraper components results in new rows being added to the ETL metadata table.
  • Relevant fields (ID, source, timestamp, initial statuses) are populated.
  • Database interactions are handled via utility functions.
  • Basic error handling (e.g., duplicates) is implemented.
  • Unit tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants