Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some older feeds showing up after a cache clear #124

Open
tjguk opened this issue Mar 11, 2016 · 5 comments
Open

Some older feeds showing up after a cache clear #124

tjguk opened this issue Mar 11, 2016 · 5 comments

Comments

@tjguk
Copy link
Member

tjguk commented Mar 11, 2016

Problem

After we cleared the cache to address another issue, some very old posts showed up.

Details

A quick investigation suggests that the feeds / posts don't provide the date fields which the planet / feedparser software are looking for and the code falls back to some default value like today.

@tjguk
Copy link
Member Author

tjguk commented Mar 11, 2016

Examples include:

[http://www.sdjournal.com/archives/categories/languages/python/rss.xml]
name = SDJournal

[http://online.effbot.org/rss.xml]
name = Fredrik Lundh

[http://www.artima.com/weblogs/feeds/bloggers/micheles.rss]
name = Michele Simionato

@rochacbruno
Copy link
Member

We have to create a script to validate each feed url against w3.rssvalidator, So we can remove the invalid feeds.

Also we need to store the email address of feed responsible to notify about those issues.

What do you think?

I can write a script to validate each field and remove invalid from config.ini

@tjguk
Copy link
Member Author

tjguk commented Mar 11, 2016 via email

@rochacbruno
Copy link
Member

I checked the 2 examples you mentioned above and those are valid in RSS validator. So the validator script will not help with this issue.

However I will write a script anyway to validate and check for required fields such as update_date

@tjguk
Copy link
Member Author

tjguk commented Mar 11, 2016

After a trawl through the code, it really comes down to two things (I think):

The first is -- I think -- why we're only seeing one item for those older feeds which have shown up. The latter is why we're not excluding all the items as being too old.

@rochacbruno if you were building a validator, I'd add a check that the entries have one of those date elements and, ideally, that the feed itself has an "updated" element.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants