Skip to content

Growth Lab scraper assumptions #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shreyasgm opened this issue Apr 7, 2025 · 1 comment
Open

Growth Lab scraper assumptions #25

shreyasgm opened this issue Apr 7, 2025 · 1 comment
Assignees
Labels
help wanted Extra attention is needed
Milestone

Comments

@shreyasgm
Copy link
Owner

  1. CSS Class Dependencies

The scraper relies on specific CSS classes that could change in a website redesign:

title_element = pub_element.find("span", {"class": "biblio-title"})
authors_element = pub_element.find("span", {"class": "biblio-authors"})
abstract_element = pub_element.find("div", {"class": "biblio-abstract-display"})

Potential Solutions:

  • Create a selector configuration dictionary at the top of the class for easy updates
  • Implement a monitoring system to detect when selectors stop working
  • Add fallback selectors for critical data points
  • Consider adding XPath alternatives for critical elements
  1. Endnote Format Assumptions

The Endnote file parsing assumes a very consistent format:

for line in lines:
    if line.startswith("%"):
        key = line[1]
        value = line[3:].strip()
        
        if key == "X":  # Abstract
            soup = BeautifulSoup(value, "html.parser")

Problems:

  • Inadequate documentation about expected format
  • Unclear if abstracts always contain HTML
  • No validation for unexpected format variations
  • No robust error handling for malformed files

Potential Solutions:

  • Consider using a dedicated bibliographic parsing library
  • Add explicit format validation and error recovery
  • Document format assumptions clearly
  • Add unit tests with various Endnote file formats
@shreyasgm shreyasgm self-assigned this Apr 7, 2025
@shreyasgm shreyasgm added this to the ETL milestone Apr 7, 2025
@shreyasgm shreyasgm assigned karandaryanani and unassigned shreyasgm Apr 7, 2025
@shreyasgm shreyasgm added the help wanted Extra attention is needed label Apr 9, 2025
karandaryanani added a commit that referenced this issue Apr 22, 2025
- Added configurable selector system with fallbacks for resilient extraction
- Implemented SelectorMonitor to track and report selector performance
- Added XPath support after discovering frequent failures in file paths and authors
- A litle bit of updating endnote html parsing but need to come back to that
@karandaryanani
Copy link
Collaborator

Completed:

  • Added a configurable selector system with fallbacks to make data extraction more resilient to site changes
  • Implemented a SelectorMonitor class to track and report selector performance, making it easier to identify when selectors need updating
  • Added XPath support as a final fallback mechanism, after discovering frequent failures in file paths and author extraction

Remaining for PR completion:

  • Need to complete the enhanced EndNote HTML parsing improvements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants