Tutorial: How to Enhance a CSV
Enrich a humble list of company/people names into an analysis-ready database. (⏲️ 20 Minutes)
You've found yourself with a list of company names and a data analysis task. But you're stuck. You'll need much more than just a list of names to analyze. You could manually look up revenue and employment figures, or you could simply use Diffbot Enhance to automatically enrich every record on your list with data from Diffbot's Knowledge Graph.
Here's what we will cover in this tutorial:
- Basics with the Diffbot Enhance API
- Enhance a CSV with the Enhance API
- Format & Filter Your Enhanced Results
- Advanced Tips & Tricks
Don't feel like coding? Enhance your CSV in the Diffbot Dashboard.
Bulk Enhance in the Diffbot Dashboard supports CSVs and works very well. For one off CSV enhances, this approach is highly recommended. For more bespoke or programmatic enhance tasks, read on.
Prerequisites
- Basic programming or scripting experience with HTTP requests.
- Any version of Python. We will use Python 3.8 in this guide.
- A Diffbot token. You can sign up for a trial or a paid plan on the Plans & Pricing page.
Basics with the Diffbot Enhance API
The Diffbot Enhance API is one of two APIs used to access data from Diffbot's Knowledge Graph. It works by accepting one or more input values for Organization or Person entities, and returns a match to the full entity record as know by the Knowledge Graph. For example, to find the company "Diffbot" with just a name, we can generate an Enhance GET request as follows:
curl --request GET \
--url 'https://kg.diffbot.com/kg/v3/enhance?type=Organization&name=Diffbot&jsonmode=%20&token=<YOURTOKEN>' \
--header 'accept: application/json'
The response of this request includes the full KG record of the entity named "Diffbot" with all known fields listed in the Organization ontology populated. In addition, the response also includes some meta details such as its match score
, a confidence score of how likely the match is correct.
At least one input value belonging to one of name
or url
or organizations and name
for people is required for Enhance to work.
Accepted input values for enhancing People include:
- name
- url
- phone
- employer
- title
- school
- location (most unstructured formats accepted e.g. New York City, U.S.A.)
- ip (as in IP address)
- id (ID of the entity in the Diffbot Knowledge Graph)
Accepted input values for enhancing Organizations include:
- name
- url
- phone
- location (most unstructured formats accepted e.g. New York City, U.S.A.)
- ip (as in IP address)
See the full Enhance API reference for more details on input values.
This endpoint enhances one organization or person at a time, returning a single matching entity record (or none, if one isn't found) with each request. We could technically loop through a CSV of names and enhance each of them individually, but that would be terribly inefficient. The Bulk Enhance API is much better suited for this purpose.
The Bulk Enhance API takes an array of entity inputs, enhances them asynchronously, and returns all matching KG records. It can often be faster to use the Bulk Enhance API for large enhance jobs (>50 records) over the synchronous Enhance API.
The Bulk Enhance API works in 2 steps — Creating a bulk enhance job, and downloading the results of the completed bulk enhance job.
Creating a Bulk Enhance Job
The Create a Bulkjob endpoint accepts a JSON array of inputs in the request body and returns a job_id
for reference on success.
curl --location 'https://kg.diffbot.com/kg/v3/enhance/bulk?token=<YOURTOKEN>' \
--header 'Content-Type: application/json' \
--data '[
{
"type": "Organization",
"name": "Diffbot"
},
{
"type": "Organization",
"url": "apple.com"
},
{
"type": "Organization",
"name": "fedex"
}
]'
import requests
import json
url = "https://kg.diffbot.com/kg/v3/enhance/bulk?token=<YOURTOKEN>"
payload = json.dumps([
{
"type": "Organization",
"name": "Diffbot"
},
{
"type": "Organization",
"url": "apple.com"
},
{
"type": "Organization",
"name": "fedex"
}
])
headers = {
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
Note that each JSON object in the array supplied to the Create a Bulkjob endpoint carries the same key-value pairs accepted in the individual Enhance API.
Downloading the Results of a Bulk Enhance Job
Once complete, supply the previously obtained job_id
to the Download Results of Bulkjob endpoint to download the matched entity results.
curl --request GET \
--url 'https://kg.diffbot.com/kg/v3/enhance/bulk/<bulkjobId>?token=<YOURTOKEN>' \
--header 'accept: application/json'
import requests
url = "https://kg.diffbot.com/kg/v3/enhance/bulk/<bulkJobId>?token=<YOURTOKEN>
headers = {
'Accept': 'application/json'
}
response = requests.request("GET", url, headers=headers)
print(response.text)
Check on the Status of a Bulk Enhance Job
To check on the status of an existing bulkjob to determine when it is ready for download, use the View Bulkjob endpoint with the same job_id
.
curl --request GET \
--url 'https://kg.diffbot.com/kg/v3/enhance/bulk/<bulkjobId>/status?token=<YOURTOKEN>' \
--header 'accept: application/json'
import requests
url = "https://kg.diffbot.com/kg/v3/enhance/bulk/<bulkJobId>/status?token=<YOURTOKEN>
headers = {
'Accept': 'application/json'
}
response = requests.request("GET", url, headers=headers)
print(response.text)
The response of this request includes a complete meta report of the status of the Enhance bulk job. For an example response, click on one of the response codes listed in the Response section of the API reference.
Enhance a CSV with the Bulk Enhance API
With the API basics out of the way, enhancing a CSV is a simple matter of format conversion into the JSON array format accepted by Bulk Enhance. By the way, we don't have to start with a CSV. This guide should be adaptable to any starting point you might be working with, but for simplicity sake we will stick to the more universally recognized format.
Converting a CSV for Bulk Enhance
We will tackle two examples of CSVs in unison with this guide — person and organization, with varying degrees of completeness. Feel free to follow along with your CSV, or grab the CSV examples below.
A quick reminder again that Bulk Enhance in the Diffbot Dashboard supports CSVs and may be easier to use if customization isn't required.
A person CSV (person.csv). Emails, although supported by Enhance and shown below, have been redacted for privacy and will not be used in this tutorial.
name | company | |
---|---|---|
Jerome Choo | Diffbot | ****@diffbot.com |
Nilay Patel | The Verge | |
Molham Aref | ****@relational.ai |
An organization CSV (org.csv).
name | url | location |
---|---|---|
Diffbot | diffbot.com | California |
Vox Media | voxmedia.com | |
Relational AI | ||
postman.com |
Note that there is a high degree of tolerance for missing inputs. For person enhances, name is a minimum requirement. For organization enhances, either one of name or url is the minimum.
To convert either of these CSVs into a JSON array that we can then send up to the Bulk Enhance API, we will use a simple Python script.
import csv
import json
input_csv_file = "person.csv" # or org.csv
output_json_file = "person.json" # or org.json
data = []
# Read the CSV file
with open(input_csv_file, "r") as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
# Filter out empty fields
filtered_row = {key: value for key, value in row.items() if value}
if filtered_row:
# Define type
filtered_row["type"] = "Person" # or "Organization", if enhancing an organization CSV
data.append(filtered_row)
# Write data to JSON file
with open(output_json_file, "w") as json_file:
json.dump(data, json_file, indent=2)
name,company
Jerome,Diffbot
Nilay Patel,The Verge
Molham Aref,
name,url,location
Diffbot,diffbot.com,California
Vox Media,voxmedia.com,
,postman.com,
DictReader does the heavy lifting of converting each CSV row into a Python dict, and we simply append a type declaration for each record. The result will look like the following:
[
{
"name": "Jerome Choo",
"company": "Diffbot",
"type": "Person"
},
{
"name": "Nilay Patel",
"company": "The Verge",
"type": "Person"
},
{
"name": "Molham Aref",
"type": "Person"
}
]
[
{
"name": "Diffbot",
"url": "diffbot.com",
"location": "California",
"type": "Organization"
},
{
"name": "Vox Media",
"url": "voxmedia.com",
"type": "Organization"
},
{
"url": "postman.com",
"type": "Organization"
}
]
Note that each JSON object in your JSON array should only include a select list of input values. Any additional input values included will not be passed along to the exported results.
Start Your Bulk Enhance Job
We can now create a Bulk Enhance job from our converted JSON array. On success, the endpoint sends back a job_id
we can use to track the status of the job.
import requests
import json
url = "https://kg.diffbot.com/kg/v3/enhance/bulk?token=<YOURTOKEN>"
input_json_file = "person.json" # or org.json
# Read data from JSON file
with open(input_json_file, "r") as json_file:
data = json.load(json_file)
payload = json.dumps(data)
headers = {
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
# {"job_id":"*****"}
Format & Filter Your Enhanced Results
Once your Bulk Enhance job is complete, we're ready to enjoy a massive dump of enhanced data. A JSON array of all the possible data the KG has on the entities in your CSV is available with a single call to the Download Results of Bulkjob endpoint.
For many workflows however, filtering to just the fields you need may be a leaner and cleaner approach. This section of the tutorial will cover how to do this.
Customizing Your JSON Export
By default the Download Results of Bulkjob endpoint will include all populated fields for the matched persons and/or organizations from your input data. An ontology of these fields is available here for reference.
While there is no cost difference to downloading all vs. a subset of fields available for your enhanced data, filtering the JSON export to just the fields you need will dramatically reduce the size and noise of the output file.
To filter the fields in your JSON export, use the filter
parameter in the Download Results of Bulkjob endpoint and list the field names to keep in your export, space delimited. For example:
import requests
url = "https://kg.diffbot.com/kg/v3/enhance/bulk/<bulkJobId>?token=<YOURTOKEN>
payload = {
'filter': 'name homepageUri' # Return just name and homepageUri
}
headers = {
'Accept': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
Review the ontology for a complete reference of fields available for each entity type. For even more advanced filtering options, check out Filtering Fields.
Customizing Your CSV Export
In addition to exporting enhanced results as a JSON array natively, the Bulk Enhance API also supports exporting in CSV format. We will just need to define some CSV formatting rules.
Here's a simple request to download the results of a bulk enhance job into a CSV with just the name
and homepageUri
fields.
import requests
url = "https://kg.diffbot.com/kg/v3/enhance/bulk/<bulkJobId>?token=<YOURTOKEN>
payload = {
'format': 'csv',
'exportspec': 'name;homepageUri', # Return name and homepageUri
'exportfile': 'org-enhanced.csv'
}
headers = {
'Accept': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
Just as with JSON export, reference the ontology to see all the fields available to filter by.
Don't like our field names? You can customize your own CSV header names with the exportspec
parameter.
payload = {
'format': 'csv',
'exportspec': 'name,Company Name;homepageUri,Website',
'exportfile': 'org-enhanced.csv'
}
You can also opt to include the input data used in the Create a Bulkjob step in your enhanced results export. Set the exportquery
parameter to 1 (default: 0) to do this.
import requests
url = "https://kg.diffbot.com/kg/v3/enhance/bulk/<bulkJobId>?token=<YOURTOKEN>
payload = {
'format': 'csv',
'exportspec': 'name;homepageUri', # Return name and homepageUri
'exportfile': 'org-enhanced.csv',
'exportquery': 1
}
headers = {
'Accept': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
For a complete reference on all the options available with CSV export, check out Exporting Columnar Format.
Advanced Tips & Tricks
How to Limit the Size of Exports
Helpful for testing complex filter rules without downloading an entire export file, or simply for downloading your enhanced results in batches.
Set the size
parameter to the number of entities you would like returned in your export.
import requests
url = "https://kg.diffbot.com/kg/v3/enhance/bulk/<bulkJobId>?token=<YOURTOKEN>
payload = {
'filter': 'name homepageUri', # Return just name and homepageUri
'size': 1
}
headers = {
'Accept': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
Note that for CSV exports, the size
value includes the header row. So size=1
will only return the header row.
import requests
url = "https://kg.diffbot.com/kg/v3/enhance/bulk/<bulkJobId>?token=<YOURTOKEN>
payload = {
'format': 'csv',
'exportspec': 'name;homepageUri', # Return name and homepageUri
'exportfile': 'org-enhanced.csv',
'exportquery': 1,
'size': 2
}
headers = {
'Accept': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
To download your enhanced results in paginated batches, use the size
and from
parameters together. Increment the from
parameter by the value of the size
parameter from the previous batch to ensure no duplicate entities are downloaded.
import requests
url = "https://kg.diffbot.com/kg/v3/enhance/bulk/<bulkJobId>?token=<YOURTOKEN>
# Download First Batch of 1
payload_1 = {
'filter': 'name homepageUri',
'from': 1
'size': 1
}
# Download Second Batch of 1
payload_2 = {
'filter': 'name homepageUri',
'from': 2
'size': 1
}
# Download Third Batch of 1
payload_3 = {
'filter': 'name homepageUri',
'from': 3
'size': 1
}
headers = {
'Accept': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=payload_1)
print(response.text)
How to Merge Enhanced Results With Your Original CSV
The Bulk Enhance API only returns fields that exist in the Knowledge Graph Ontology and in the case of CSV export, an option to return accepted input data. If your original dataset includes data you wish to merge into the enhanced export, you may add a customId
field to each JSON in the JSON array input sent up to Enhance.
[
{
"name": "Diffbot",
"url": "diffbot.com",
"location": "California",
"type": "Organization",
"customId": "1"
},
{
"name": "Vox Media",
"url": "voxmedia.com",
"type": "Organization",
"customId": "2"
},
{
"url": "postman.com",
"type": "Organization",
"customId": "3"
}
]
customId
values are passed along to the exported results, allowing you to perform a VLOOKUP or other reference method to link and merge the two datasets.
How to Get a Field Coverage Report
A field coverage report details the number of fields that were successfully populated across your enhanced results. Because not all fields can be populated for all entities (e.g. some person entities might not have a twitterUri
), this technique is a helpful way to know if a field has enough coverage for a well sampled analysis.
To obtain a field coverage report for a Bulk Enhance job, you will need to have the bulkjobId
and a reportId
. The former is obtained in the success response upon creating your bulk enhance job. reportId
can be acquired from the response of the bulk enhance job status endpoint, under the reports
key.
{
"code": 0,
"content": {
...
"reports": [
{
"format": "...",
"spec": "...",
"filter": "...",
"exportSeparator": "...",
"onlyMatches": true,
"reportId": "REPORTID"
}
]
}
}
Supply both bulkjobId
and reportId
to the Download Bulkjob Coverage Report endpoint to obtain a coverage report.
Wrap Up
By this point of the tutorial, you should have everything you need to know to enhance a CSV of organizations or people. If we missed anything, don't hesitate to reach out to us at [email protected]. And do mention this tutorial so we may improve it for others.
Updated about 1 year ago