Log-Level Data Service
The Log-Level Data Service (formerly known as the Data Siphon Service) is an API for pulling batch Log-Level Data.
These batch files must be retrieved within the default retention period specified in the feed (e.g., 7 days). The files are purged from our system after that time. See the feed-specific pages for retention periods. Also note that you cannot retrieve more than two files concurrently.
Name Change: Although the descriptive name of this service has changed from "Data Siphon" to "Log-Level Data", the names of the web services that you make calls to have not changed. They remain
siphon-download, as described in How to Retrieve a Log-Level Data Feed below. Existing technical integrations are therefore not affected by this change.
Recommendations for Building Against This Service
Log-level data feeds are often quite large; a single hour’s worth of data may exceed 1 GB in size, and a single day's worth of data may contain over 1 million records. Therefore, we recommend using a relational database (e.g., MySQL, PostgreSQL, Microsoft SQL server, etc.) or a data warehousing system (e.g., Netezza, Hadoop, etc.) for storage and processing rather than office productivity software such as M.S. Excel or M.S. Access.
If AppNexus detects discrepancies in generated data, we regenerate the file (generally within 3 days). In such cases, you will see two files listed for the affected hour - both the old, invalid file, and the new, regenerated file. The new file will have the same "hour" but will have a more recent "timestamp" than the original file. The old file might not have checksums generated (though some do); this is normal and is expected. The new file will be a fully corrected version of the original file instead of just the difference between the two. Therefore, when you pull new files, be sure to look back 3 days for any regenerations.
If you discover data issues in a file, you can request a regeneration by opening a support request at http://support.appnexus.com. However, AppNexus can only regenerate files that were created within the last three days.
Filtering on updated_since
Polling for all updates to the log-level data feeds can take a long time. For some larger members with many splits, this call can even time out. In order to avoid this issue, we suggest that you poll for new updates once an hour (or less frequently) and set the
"updated_since" filter to the last time (in UTC) that you polled for data. This filter will ensure that the log-level data service only returns data that has been updated since the last time you polled the service for updates. For details on how to use this filter, see the View list of hourly data files generated since your last update from the API example below.
How to Retrieve a Log-Level Data Feed
Retrieving a log-level data feed is a 3-step process. First you identify the feed you want, then you request the download location of the feed, and finally you download the file.
Log-level data files are purged from our system after 10 days. Be sure to retrieve them within this timeframe. Also, please note that you cannot download more than two files concurrently.
- Step 1. View your list of feeds
- Step 2. Request the download location for a feed
- Step 3. Download the file
Step 1. View your list of feeds
First, make a
GET call to the
siphon service to view information about each available hourly feed, including the exact time when it was generated, the number of file parts into which it is split, and the status and checksum for each part. See JSON Fields below for more details about the fields in the response.
You can filter the response by passing
hour in the query string of the call. It is strongly recommended that you use one or both of these filters, as they will greatly speed the response on the
/siphon call. See examples below.
To comply with GDPR, the following fields have been deprecated from the Log Level Data service feeds.
Step 2. Request the download location for a feed
Once you've identified the feed that you want to retrieve, make a
GET call to the
siphon-download service to request the location from which you can download the file. Include the
split_part and one of the following file formats (e.g.,
protobuf or ) in the query string of the call.
The download location will be in the header of the response rather than in the body. In the example below, we've used the
--verbose command to expose the header. You can also use the
-L option, which allows
curl to follow a 302 redirect to the feed's download location. This saves you from having to parse the download location from the HTTP header separately and allows you to skip Step 3. If you use the
-L option, be sure to specify a location to save to.
Step 3. Download the file
-Loption in step 2 above.
GET call to the download location URL in the header of the previous response and specify a location to save to.
Log-level data files are text files compressed in gzip format.
You can then calculate the MD5 checksum of the downloaded file and check it against the
checksum in the response from step 1. If they do not match, an error may have occurred during download and you should try again.
The name of the log-level data feed. For possible values, see list of available feeds on Log Level Data Feeds.
The hour when the feed was generated. The format is YYYY_MM_DD_HH (e.g.,
The date and time when the feed was generated. The format is YYYYMMDDHHMMSS (e.g.,
array of objects
The file parts into which the log-level data feed is split. Each part includes a status and checksum. If the
The status of the file. Please note that
The MD5 checksum of the file. After you download the file, you can calculate the MD5 checksum on your end and check it against our checksum. If they do not match, then an error may have occurred during download and you should try again.
View list of standard feeds
To only get information about standard feeds, include
siphon_name=standard_feed in the query string of the call.
View list of all feeds generated at a specific hour
To get information about all feeds that were generated at a specific hour, include
hour=YYYY_MM_DD_HH in the query string of the call.
View list of hourly data files generated since your last update from the API
To filter only by hours of data that have been updated since the last time you polled the API for updates, you can use the
"updated_since" filter. Place a timestamp in the format of updated_since=YYYY_MM_DD_HH (in UTC) in the query string of the call. Note that this filter will compare against the
timestamp field and not the