Tutorial: How to Build a News Monitoring App

News monitoring using Diffbot Knowledge Graph

📘

TLDR;

Skip the prose, go straight to code: news-monitoring.pynb

With over 1.5 billion articles and continuous crawling of major news sources, the Diffbot Knowledge Graph (KG) provides a vast database for up-to-date news. By leveraging our platform, you can monitor events, trends, and developments in various industries, countries, and topics of interest. You can customize your search parameters based on your preferences, including language, source, date range, and article type.

This guide will describe how to build a simple news monitoring application using Diffbot KG.

Prerequisite

  • You will need a token to access the Diffbot KG. You can sign up for a trial or paid plan on the Plans & Pricing page to get a token.
  • Basic programming or scripting experience with HTTP requests and file or database access. You will find Python examples in this guide.
  • Basic knowledge of DQL - the query language that you will use to query the KG. This tutorial will describe the relevant queries but you can get more details from the DQL documentation.

Basic Concepts

Before writing any code, let's talk about some basic concepts that you will need to be familiar with to implement news monitoring.

Crawl time vs. Publication Date

When dealing with articles, there are two different kinds of dates to keep in mind: the date the article was published by the publisher, and the date that Diffbot crawled the article.

  • The date field in the article JSON is the date the article was originally published. Diffbot's ML models extract the date visually from the article in the same way a human sees it, and using metadata found with the page source. Bear in mind that not all articles might have a date specified.
  • The lastCrawlTime field in the article JSON indicates the time when Diffbot's crawlers found the article. This can vary depending on how often the website is crawled. It may take only seconds for the article to be discovered, or it may take several days after the publication date.

Example: Monitoring Earthquake News

We will build an application to monitor news about earthquakes every hour. This is the general flow that we will follow:

  1. Write a DQL query that searches for articles mentioning "Earthquakes" and filters them by crawltime.
  2. Write a Python script to retrieve articles. To start monitoring, retrieve articles crawled in the past day. Record the timestamp for one day ago in a file called timestamp.chk. If the query returns new articles, update timestamp.chk with the most recent timestamp.
  3. Set up a cron job to run this script every hour. When the script runs, it checks the timestamp in timestamp.chk to determine the most recent articles to retrieve.

Step 1: DQL query

The easiest way to fetch all articles in the last 1 day with the word "earthquake" in their title is by using this DQL query: type:Article title:'Earthquake' lastCrawlTime<1d. However, for the news monitoring application, we will use a variant of the date-time filter and specify the actual date as the epoch timestamp (or the number of seconds since Jan 1, 1970). The query looks like type:Article title:'Earthquake' lastCrawlTime>${EPOCH_TIME_1_DAY_AGO}.

Because this requires us to compute EPOCH_TIME_1_DAY_AGO, it's best done in a script like below:

import time

EPOCH_TIME_1_DAY_AGO = int(time.time()) - 1 * 86400
query = f"type:Article title:'Earthquake' lastCrawlTime>{EPOCH_TIME_1_DAY_AGO}"

Step 2: Bootstrap monitoring

The next steps are to bootstrap monitoring.

  1. Record the timestamp for 1 day ago in a file called timestamp.chk.
  2. Get articles published since the time mentioned in timestamp.chk (ie. in the past 1 day)
  3. If the query returns new articles, update timestamp.chk with the most recent timestamp.

Record timestamp for 1 day ago

The get_latest_crawl_time function returns the latest_crawl_time from file timestamp.chk. If the file does not exist, bootstrap timestamp.chk it with time 1 day ago.

The set_latest_crawl_time function will write the latest_crawl_time to timestamp.chk

import time
import os.path

# Read latest_crawl_time from file timestamp.chk
def get_latest_crawl_time():
    # If timestamp.chk does not exist, create it and set the time to 1 day ago
    if not os.path.isfile('timestamp.chk'):
        # Get the time 1 day ago
        epoch_time_1_day_ago = int(time.time()) - 1 * 86400

        # record this in file timestamp.chk
        with open('timestamp.chk', 'w') as f:
            f.write(str(epoch_time_1_day_ago))

    with open('timestamp.chk', 'r') as f:
        latest_crawl_time = int(f.read())
        return latest_crawl_time


# Update latest_crawl_time in file timestamp.chk
def set_latest_crawl_time(latest_crawl_time):
    with open('timestamp.chk', 'w') as f:
        f.write(str(latest_crawl_time))

Function to get articles from DQL

import requests

def query_dql(last_recorded_crawl_time, token):
    querystring = {
        'token': token,
        'query': f'type:Article title:"Earthquake" lastCrawlTime>{last_recorded_crawl_time}',
        'format': "jsonl", # get results one JSON object per line
        'size': -1         # get all records
    }

    return requests.get('https://kg.diffbot.com/kg/v3/dql', params=querystring)

Fetch articles published

The download_articles function fetches articles published since the time mentioned in timestamp.chk

import json

def download_articles():
    # Query DQL
    latest_crawl_time = get_latest_crawl_time()
    print(f'Querying DQL with lastCrawlTime > {latest_crawl_time} ({time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(latest_crawl_time))} UTC)')
    response = query_dql(latest_crawl_time, DIFFBOT_TOKEN)

    # Write to file with name "news-monitoring-yyyymmdd-hh.jsonl"
    # where yyyymmdd-hh is the current time in UTC
    utc_time = time.strftime('%Y%m%d-%H', time.gmtime())
    filename = f'news-monitoring-{utc_time}.jsonl'

    counter = 0
    with open(filename, 'wb') as f:
        # write the response line by line
        for line in response.iter_lines():
            if line:
                f.write(line)
                f.write(b'\n')
                counter += 1

                # Get the lastCrawlTime of the article
                article = json.loads(line)
                last_crawltime = article['lastCrawlTime']
                # Update last_crawltime if the article is newer
                if last_crawltime > latest_crawl_time:
                    latest_crawl_time = last_crawltime

    # Update latest_crawl_time in file timestamp.chk
    set_latest_crawl_time(latest_crawl_time)
    print(f'Wrote {counter} articles to {filename} with lastCrawlTime <= {latest_crawl_time} ({time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(latest_crawl_time))})')
    

# call download_articles()
download_articles()


Step 3: Cron job for monitoring for new Articles every hour

Use the following crontab entry to schedule the python script to run every hours

0 * * * * python3 /path/to/your/news_monitoring.py

There are several resources on the web that have instructions for configuring a cronjob. Here's one: How to Set Up a Cron Job in Linux

Alternatively, you can call the download_articles()function every hour from the script in a loop:

# Call download_articles() every hour
while True:
    download_articles()
    time.sleep(3600)

Code

The entire code is available here: news-monitoring.pynb

Filters to Refine Your News Feed

There are many was to filter for articles for your news monitoring application. Here are some popular ways to filter the news:

Show a Company News Feed

To find news about a company, you can find articles with tags.uri field matching the company's diffbotUri. For example, to monitor news for Google (diffbotUri http://diffbot.com/entity/EUFq-3WlpNsq0pvfUYWXOEA), you can use:

type:Article tags.uri:'http://diffbot.com/entity/EUFq-3WlpNsq0pvfUYWXOEA' lastCrawlTime<30d

Alternatively, you can also specify the name of the company with the tags.label field:

type:Article tags.label:'Google' lastCrawlTime<30d

Filter by Article Category

To filter articles by article categories (e.g. business news, sports news), you can specify categories.name. For example, to monitor Business news articles mentioning Google, you can use:

type:Article categories.name:'Business' tags.label:'Google' lastCrawlTime<30d

Or, to monitor Sports news articles mentioning Manchester United, you can use:

type:Article categories.name:'Sports' tags.label:'Manchester United' lastCrawlTime<30d

Article Categories lists the complete taxonomy for article categories that can be used in these queries.

By Keywords

Articles can be filtered by keywords in title or article text fields. For example, to monitor UK articles mentioning Robotic Process Automation in their title, you can use:

type:Article title:"Robotic Process Automation" publisherCountry:"United Kingdom"

Or, to monitor articles US articles mentioning ESG in their text and Innovate or Innovation in their title, you can use:

type:Article text:"ESG" title:OR("Innovate", "Innovation")
   publisherCountry:"United States" lastCrawlTime<30d

By Source

Articles can be filtered by the publishing source. For example, to monitor US articles from reuters.com mentioning Inflation in their title, you can use:

type:Article site:"reuters.com" title:"Inflation"
  publisherCountry:"United States" lastCrawlTime<30d

📘

The samples above use the shorthand clause lastCrawlTime<30d to search for articles crawled in the last 30 days. For News Monitoring application, you will have to use the variant clause lastCrawlTime>{last_recorded_crawl_time} as discussed in the example.