Growth Lab scraper assumptions

1. CSS Class Dependencies

The scraper relies on specific CSS classes that could change in a website redesign:

```python
title_element = pub_element.find("span", {"class": "biblio-title"})
authors_element = pub_element.find("span", {"class": "biblio-authors"})
abstract_element = pub_element.find("div", {"class": "biblio-abstract-display"})
```

**Potential Solutions:**
- Create a selector configuration dictionary at the top of the class for easy updates
- Implement a monitoring system to detect when selectors stop working
- Add fallback selectors for critical data points
- Consider adding XPath alternatives for critical elements

2. Endnote Format Assumptions

The Endnote file parsing assumes a very consistent format:

```python
for line in lines:
    if line.startswith("%"):
        key = line[1]
        value = line[3:].strip()
        
        if key == "X":  # Abstract
            soup = BeautifulSoup(value, "html.parser")
```

**Problems:**
- Inadequate documentation about expected format
- Unclear if abstracts always contain HTML
- No validation for unexpected format variations
- No robust error handling for malformed files

**Potential Solutions:**
- Consider using a dedicated bibliographic parsing library
- Add explicit format validation and error recovery
- Document format assumptions clearly
- Add unit tests with various Endnote file formats

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Growth Lab scraper assumptions #25

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Growth Lab scraper assumptions #25

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions