Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore functionality to resist temporary bad TED responses when parsing video pages #209

Closed
benoit74 opened this issue Jun 28, 2024 · 2 comments · Fixed by #214
Closed

Comments

@benoit74
Copy link
Collaborator

In order to retrieve video infos, TED scraper retrieves the video page with a URL like https://ted.com/talks/franco_sacchi_a_tour_of_nollywood_nigeria_s_booming_film_industry?language=nl and will look for __NEXT_DATA__ JSON inside the page, where it will find among other things the localized title and description.

This is done in extract_info_from_video_page function in scraper.py.

We currently have few recipes intermittently failing with an error An error occurred: 'NoneType' object has no attribute 'string'.

Looking at HTML content, there is no __NEXT_DATA__ JSON inside the page.

Loading again the page on my machine, there is __NEXT_DATA__ JSON.

So clearly the scraper should be more resilient to intermittent bad responses from TED server.

This was indeed the case in 2.10.0 where there was a retry logic in extract_info_from_video_page and got dropped in https://github.com/openzim/ted/pull/130/files when adapting to new DOM.

I think we should just restore this functionality by again pausing 5 secs and trying again up to 5 times, just like in 2.10.0.

@benoit74 benoit74 added this to the 3.1.0 milestone Jul 1, 2024
@benoit74
Copy link
Collaborator Author

benoit74 commented Jul 1, 2024

Moving this to 3.1.0, it is mostly straightforward to implement and seems to be impacting about 5-10% of the recipes randomly.

@benoit74
Copy link
Collaborator Author

Since we have currently no plan on when we will be able to work on 3.1.0 and since this bug makes the success of https://farm.openzim.org/recipes/ted_topic_all mostly impossible, I'm going to make a patch release 3.0.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant