Skip to content

Growth Lab scraper assumptions #25

Open
@shreyasgm

Description

@shreyasgm
  1. CSS Class Dependencies

The scraper relies on specific CSS classes that could change in a website redesign:

title_element = pub_element.find("span", {"class": "biblio-title"})
authors_element = pub_element.find("span", {"class": "biblio-authors"})
abstract_element = pub_element.find("div", {"class": "biblio-abstract-display"})

Potential Solutions:

  • Create a selector configuration dictionary at the top of the class for easy updates
  • Implement a monitoring system to detect when selectors stop working
  • Add fallback selectors for critical data points
  • Consider adding XPath alternatives for critical elements
  1. Endnote Format Assumptions

The Endnote file parsing assumes a very consistent format:

for line in lines:
    if line.startswith("%"):
        key = line[1]
        value = line[3:].strip()
        
        if key == "X":  # Abstract
            soup = BeautifulSoup(value, "html.parser")

Problems:

  • Inadequate documentation about expected format
  • Unclear if abstracts always contain HTML
  • No validation for unexpected format variations
  • No robust error handling for malformed files

Potential Solutions:

  • Consider using a dedicated bibliographic parsing library
  • Add explicit format validation and error recovery
  • Document format assumptions clearly
  • Add unit tests with various Endnote file formats

Metadata

Metadata

Labels

help wantedExtra attention is needed

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions