Python Automation Cookbook
上QQ阅读APP看书,第一时间看更新

Subscribing to feeds

RSS is probably the biggest secret of the internet. Its time in the spotlight seemed to be during the 2000s, and it enables easy subscription to websites. It is present in lots of websites and it's incredibly useful.

At its core, RSS is a way of presenting a succession of ordered references (typically articles, but also other elements such as podcast episodes or YouTube publications) and publishing times. This makes for a very natural way of learning what articles are new since the last check, as well as presenting some structured data about them, such as the title and a summary.

In this recipe, we will present the feedparser module and determine how to obtain data from an RSS feed.

RSS is not the only available feed format. There's also a format called Atom, but Atom and RSS are more or less the same. feedparser is also capable of parsing Atom, so both formats can be processed in the same way.

Getting ready

We need to add the feedparser dependency to our requirements.txt file and reinstall it:

$ echo "feedparser==5.2.1" >> requirements.txt
$ pip install -r requirements.txt

Feed URLs can be found on almost all pages that deal with publications, including blogs, news, podcasts, and so on. Sometimes they are very easy to find, but sometimes they are a little bit hidden. Search for feed or RSS.

Most newspapers and news agencies have their RSS feeds divided by themes. For our example, we'll parse the New York Times main page feed, https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml. There are more feeds available on the main feed page: https://archive.nytimes.com/www.nytimes.com/services/xml/rss/index.html.

Please note that the feeds may be subject to terms and conditions of use. In the case of the New York Times, the terms and conditions are described at the end of the main feed page.

Please note that this feed changes quite often, meaning that the linked entries will be different than the examples in this book.

How to do it...

  1. Import the feedparser module, as well as datetime, delorean, and requests:
    >>> import feedparser
    >>> import datetime
    >>> import delorean
    >>> import requests
    
  2. Parse the feed (it will be downloaded automatically) and check when it was last updated. Feed information, like the title of the feed, can be obtained in the feed attribute:
    >>> rss = feedparser.parse('http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml')
    >>> rss.channel.updated
    Friday, 24 Jan 2020 19:42:27 +0000'
    
  3. Get the entries that are less or equal to 6 hours old:
    >>> time_limit = delorean.parse(rss.channel.updated) - datetime.timedelta(hours=6)
    >>> entries = [entry for entry in rss.entries if delorean.parse(entry.published) > time_limit]
    
  4. Some of the returned entries will be older than 6 hours:
    >>> len(entries)
    28
    >>> len(rss.entries)
    54
    
  5. Retrieve information about the entries, such as the title. The full entry URL is available as link. Explore the available information in this particular feed:
    >>> entries[18]['title']
    'These People Really Care About Fonts'
    >>> entries[18]['link']
    'https://www.nytimes.com/2020/01/24/style/typography-font-design.html?emc=rss&partner=rss'
    >>> requests.get(entries[18].link)
    <Response [200]>
    

How it works...

The parsed feed object contains the information of the entries, as well as general information about the feed itself, such as when it was updated. The feed information can be found in the feed attribute:

>>> rss.feed.title
'NYT > Top Stories'

Each of the entries works as a dictionary, so the fields are easy to retrieve. They can also be accessed as attributes, but treating them as keys allows us to get all the available fields:

>>> entries[5].keys()
dict_keys(['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'media_content', 'summary', 'summary_detail', 'media_credit', 'credit', 'content', 'authors', 'author', 'author_detail', 'published', 'published_parsed', 'tags'])

The basic strategy when dealing with feeds is to parse them and go through the entries, performing a quick check on whether they are interesting or not, for example, by checking the description or summary. If the entry seems worth it, they can be fully downloaded through the link field. Then, to avoid rechecking entries, store the latest publication date and next time, only check newer entries.

There's more...

The full feedparser documentation can be found here: https://pythonhosted.org/feedparser/.

The information available can differ from feed to feed. In the New York Times example, there's a tag field with tag information, but this is not standard. As a minimum, entries will have a title, a description, and a link.

RSS feeds are also a great way of curating your own selection of news sources. There are great feed readers for that.

See also

  • The Installing third-party packages recipe in Chapter 1, Let's Begin Our Automation Journey, to learn the basics of installing external modules.
  • The Downloading web pages recipe, earlier in this chapter, to learn more about making requests and obtaining remote pages.