Tutorial: How to Build a News Monitoring App
News monitoring using Diffbot Knowledge Graph
TLDR;
Skip the prose, go straight to code: news-monitoring.pynb
With over 1.5 billion articles and continuous crawling of major news sources, the Diffbot Knowledge Graph (KG) provides a vast database for up-to-date news. By leveraging our platform, you can monitor events, trends, and developments in various industries, countries, and topics of interest. You can customize your search parameters based on your preferences, including language, source, date range, and article type.
This guide will describe how to build a simple news monitoring application using Diffbot KG.
Prerequisite
- You will need a token to access the Diffbot KG. You can sign up for a trial or paid plan on the Plans & Pricing page to get a token.
- Basic programming or scripting experience with HTTP requests and file or database access. You will find Python examples in this guide.
- Basic knowledge of DQL - the query language that you will use to query the KG. This tutorial will describe the relevant queries but you can get more details from the DQL documentation.
Basic Concepts
Before writing any code, let's talk about some basic concepts that you will need to be familiar with to implement news monitoring.
Crawl time vs. Publication Date
When dealing with articles, there are two different kinds of dates to keep in mind: the date the article was published by the publisher, and the date that Diffbot crawled the article.
- The
date
field in the article JSON is the date the article was originally published. Diffbot's ML models extract the date visually from the article in the same way a human sees it, and using metadata found with the page source. Bear in mind that not all articles might have a date specified. - The
lastCrawlTime
field in the article JSON indicates the time when Diffbot's crawlers found the article. This can vary depending on how often the website is crawled. It may take only seconds for the article to be discovered, or it may take several days after the publication date.
Example: Monitoring Earthquake News
We will build an application to monitor news about earthquakes every hour. This is the general flow that we will follow:
- Write a DQL query that searches for articles mentioning "Earthquakes" and filters them by crawltime.
- Write a Python script to retrieve articles. To start monitoring, retrieve articles crawled in the past day. Record the timestamp for one day ago in a file called
timestamp.chk
. If the query returns new articles, updatetimestamp.chk
with the most recent timestamp. - Set up a cron job to run this script every hour. When the script runs, it checks the timestamp in
timestamp.chk
to determine the most recent articles to retrieve.
Step 1: DQL query
The easiest way to fetch all articles in the last 1 day with the word "earthquake" in their title
is by using this DQL query: type:Article title:'Earthquake' lastCrawlTime<1d
. However, for the news monitoring application, we will use a variant of the date-time filter and specify the actual date as the epoch timestamp (or the number of seconds since Jan 1, 1970). The query looks like type:Article title:'Earthquake' lastCrawlTime>${EPOCH_TIME_1_DAY_AGO}
.
Because this requires us to compute EPOCH_TIME_1_DAY_AGO
, it's best done in a script like below:
import time
EPOCH_TIME_1_DAY_AGO = int(time.time()) - 1 * 86400
query = f"type:Article title:'Earthquake' lastCrawlTime>{EPOCH_TIME_1_DAY_AGO}"
Step 2: Bootstrap monitoring
The next steps are to bootstrap monitoring.
- Record the timestamp for 1 day ago in a file called
timestamp.chk
. - Get articles published since the time mentioned in
timestamp.chk
(ie. in the past 1 day) - If the query returns new articles, update
timestamp.chk
with the most recent timestamp.
Record timestamp for 1 day ago
The get_latest_crawl_time
function returns the latest_crawl_time from file timestamp.chk
. If the file does not exist, bootstrap timestamp.chk
it with time 1 day ago.
The set_latest_crawl_time
function will write the latest_crawl_time
to timestamp.chk
import time
import os.path
# Read latest_crawl_time from file timestamp.chk
def get_latest_crawl_time():
# If timestamp.chk does not exist, create it and set the time to 1 day ago
if not os.path.isfile('timestamp.chk'):
# Get the time 1 day ago
epoch_time_1_day_ago = int(time.time()) - 1 * 86400
# record this in file timestamp.chk
with open('timestamp.chk', 'w') as f:
f.write(str(epoch_time_1_day_ago))
with open('timestamp.chk', 'r') as f:
latest_crawl_time = int(f.read())
return latest_crawl_time
# Update latest_crawl_time in file timestamp.chk
def set_latest_crawl_time(latest_crawl_time):
with open('timestamp.chk', 'w') as f:
f.write(str(latest_crawl_time))
Function to get articles from DQL
import requests
def query_dql(last_recorded_crawl_time, token):
querystring = {
'token': token,
'query': f'type:Article title:"Earthquake" lastCrawlTime>{last_recorded_crawl_time}',
'format': "jsonl", # get results one JSON object per line
'size': -1 # get all records
}
return requests.get('https://kg.diffbot.com/kg/v3/dql', params=querystring)
Fetch articles published
The download_articles
function fetches articles published since the time mentioned in timestamp.chk
import json
def download_articles():
# Query DQL
latest_crawl_time = get_latest_crawl_time()
print(f'Querying DQL with lastCrawlTime > {latest_crawl_time} ({time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(latest_crawl_time))} UTC)')
response = query_dql(latest_crawl_time, DIFFBOT_TOKEN)
# Write to file with name "news-monitoring-yyyymmdd-hh.jsonl"
# where yyyymmdd-hh is the current time in UTC
utc_time = time.strftime('%Y%m%d-%H', time.gmtime())
filename = f'news-monitoring-{utc_time}.jsonl'
counter = 0
with open(filename, 'wb') as f:
# write the response line by line
for line in response.iter_lines():
if line:
f.write(line)
f.write(b'\n')
counter += 1
# Get the lastCrawlTime of the article
article = json.loads(line)
last_crawltime = article['lastCrawlTime']
# Update last_crawltime if the article is newer
if last_crawltime > latest_crawl_time:
latest_crawl_time = last_crawltime
# Update latest_crawl_time in file timestamp.chk
set_latest_crawl_time(latest_crawl_time)
print(f'Wrote {counter} articles to {filename} with lastCrawlTime <= {latest_crawl_time} ({time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(latest_crawl_time))})')
# call download_articles()
download_articles()
Step 3: Cron job for monitoring for new Articles every hour
Use the following crontab entry to schedule the python script to run every hours
0 * * * * python3 /path/to/your/news_monitoring.py
There are several resources on the web that have instructions for configuring a cronjob. Here's one: How to Set Up a Cron Job in Linux
Alternatively, you can call the download_articles()
function every hour from the script in a loop:
# Call download_articles() every hour
while True:
download_articles()
time.sleep(3600)
Code
The entire code is available here: news-monitoring.pynb
Filters to Refine Your News Feed
There are many was to filter for articles for your news monitoring application. Here are some popular ways to filter the news:
Show a Company News Feed
To find news about a company, you can find articles with tags.uri
field matching the company's diffbotUri
. For example, to monitor news for Google (diffbotUri http://diffbot.com/entity/EUFq-3WlpNsq0pvfUYWXOEA
), you can use:
type:Article tags.uri:'http://diffbot.com/entity/EUFq-3WlpNsq0pvfUYWXOEA' lastCrawlTime<30d
Alternatively, you can also specify the name of the company with the tags.label
field:
type:Article tags.label:'Google' lastCrawlTime<30d
Filter by Article Category
To filter articles by article categories (e.g. business news, sports news), you can specify categories.name
. For example, to monitor Business news articles mentioning Google, you can use:
type:Article categories.name:'Business' tags.label:'Google' lastCrawlTime<30d
Or, to monitor Sports news articles mentioning Manchester United, you can use:
type:Article categories.name:'Sports' tags.label:'Manchester United' lastCrawlTime<30d
Article Categories lists the complete taxonomy for article categories that can be used in these queries.
By Keywords
Articles can be filtered by keywords in title
or article text
fields. For example, to monitor UK articles mentioning Robotic Process Automation in their title, you can use:
type:Article title:"Robotic Process Automation" publisherCountry:"United Kingdom"
Or, to monitor articles US articles mentioning ESG in their text and Innovate or Innovation in their title, you can use:
type:Article text:"ESG" title:OR("Innovate", "Innovation")
publisherCountry:"United States" lastCrawlTime<30d
By Source
Articles can be filtered by the publishing source. For example, to monitor US articles from reuters.com mentioning Inflation in their title, you can use:
type:Article site:"reuters.com" title:"Inflation"
publisherCountry:"United States" lastCrawlTime<30d
The samples above use the shorthand clause
lastCrawlTime<30d
to search for articles crawled in the last 30 days. For News Monitoring application, you will have to use the variant clauselastCrawlTime>{last_recorded_crawl_time}
as discussed in the example.
Updated almost 2 years ago