Skip to end of metadata
Go to start of metadata

Log-Level Data Service

The Log-Level Data Service (formerly known as the Data Siphon Service) is an API for pulling batch Log-Level Data.

These batch files must be retrieved within the default retention period specified in the feed (e.g., 7 days). The files are purged from our system after that time. See the feed-specific pages for retention periods. Also note that you cannot retrieve more than two files concurrently.

Name Change: Although the descriptive name of this service has changed from "Data Siphon" to "Log-Level Data", the names of the web services that you make calls to have not changed. They remain siphon and siphon-download, as described in How to Retrieve a Log-Level Data Feed below. Existing technical integrations are therefore not affected by this change.

On This Page

Recommendations for Building Against This Service

Storage

Log-level data feeds are often quite large; a single hour’s worth of data may exceed 1 GB in size, and a single day's worth of data may contain over 1 million records. Therefore, we recommend using a relational database (e.g., MySQL, PostgreSQL, Microsoft SQL server, etc.) or a data warehousing system (e.g., Netezza, Hadoop, etc.) for storage and processing rather than office productivity software such as M.S. Excel or M.S. Access.

File Regeneration

If AppNexus detects discrepancies in generated data, we regenerate the file (generally within 3 days). In such cases, you will see two files listed for the affected hour - both the old, invalid file, and the new, regenerated file. The new file will have the same "hour" but will have a more recent "timestamp" than the original file. The old file might not have  checksums generated (though some do); this is normal and is expected. The new file will be a fully corrected version of the original file instead of just the difference between the two. Therefore, when you pull new files, be sure to look back 3 days for any regenerations.

If you discover data issues in a file, you can request a regeneration by opening a support request at http://support.appnexus.com. However, AppNexus can only regenerate files that were created within the last three days.

Filtering on updated_since

Polling for all updates to the log-level data feeds can take a long time. For some larger members with many splits, this call can even time out. In order to avoid this issue, we suggest that you poll for new updates once an hour (or less frequently) and set the "updated_since" filter to the last time (in UTC) that you polled for data. This filter will ensure that the log-level data service only returns data that has been updated since the last time you polled the service for updates. For details on how to use this filter, see the View list of hourly data files generated since your last update from the API example below. 

How to Retrieve a Log-Level Data Feed

Retrieving a log-level data feed is a 3-step process. First you identify the feed you want, then you request the download location of the feed, and finally you download the file.

Log-level data files are purged from our system after 10 days. Be sure to retrieve them within this timeframe. Also, please note that you cannot download more than two files concurrently.

Step 1. View your list of feeds

First, make a GET call to the siphon  service to view information about each available hourly feed, including the exact time when it was generated, the number of file parts into which it is split, and the status and checksum for each part. See JSON Fields below for more details about the fields in the response.

You can filter the response by passing siphon_name and/or hour in the query string of the call. It is strongly recommended that you use one or both of these filters, as they will greatly speed the response on the /siphon call. See examples below.

To comply with GDPR, the following fields have been deprecated from the Log Level Data service feeds.

  • user_id_64
  • external_uid
  • device_unique_id
  • ip_address 
  • latitude
  • longitude

Subject to requirements under the GDPR, these field will continue to be available if you receive log-level data via Cloud Export. For details, see Changes to Log-Level Data and Console Reporting.

Step 2. Request the download location for a feed

Once you've identified the feed that you want to retrieve, make a GET call to the siphon-download  service to request the location from which you can download the file. Include the siphon_name, hour, timestamp, and split_part and one of the following file formats (e.g., protobuf or text ) in the query string of the call.

The download location will be in the header of the response rather than in the body. In the example below, we've used the --verbose command to expose the header. You can also use the -L option, which allows curl to follow a 302 redirect to the feed's download location. This saves you from having to parse the download location from the HTTP header separately and allows you to skip Step 3. If you use the -L option, be sure to specify a location to save to.

Step 3. Download the file

Skip this step if you used the   -L option in step 2 above.

Make a GET call to the download location URL in the header of the previous response and specify a location to save to. 

Log-level data files are text files compressed in gzip format.

You can then calculate the MD5 checksum of the downloaded file and check it against the checksum in the response from step 1. If they do not match, an error may have occurred during download and you should try again.

JSON Fields

Field

Type

Description

name

enum

The name of the log-level data feed. For possible values, see list of available feeds on Log Level Data Feeds.

hour

string

The hour when the feed was generated. The format is YYYY_MM_DD_HH (e.g., "2012_02_19_19").

timestamp

string

The date and time when the feed was generated. The format is YYYYMMDDHHMMSS (e.g., "20120209134931").

split

array of objects

The file parts into which the log-level data feed is split. Each part includes a status and checksum. If the split array is empty, it indicates that there was no data for that hour. See Split below for more details.

Split

Field

Type

Description

status

enum

The status of the file. Please note that status reflects the download activity only for the last 4 hours. Possible values:

  • "new" - File location has not been requested within the last 4 hours.
  • "pending" - File location has been requested within the last 4 hours, but the file has not been downloaded.
  • "in_progress" - File is being downloaded.
  • "completed" - File has been downloaded successfully within the last 4 hours.
  • "error" - Location request or file download failed within the last 4 hours.

    The "error" status usually indicates that connection or timeout issues caused the file download to fail. AppNexus allows up to 20 minutes to complete the download a single file. If your connection is slow and you exceed this limit, the download will fail. The timeout interval on your end can cause problems as well, especially if your timeout interval is less than 20 minutes.

    If you receive an "error" status for a file, we recommend requesting a new download location (step 2 above) and downloading the file again (step 3 above). If the download continues to fail, please submit a support request.

checksum

string

The MD5 checksum of the file. After you download the file, you can calculate the MD5 checksum on your end and check it against our checksum. If they do not match, then an error may have occurred during download and you should try again.

Examples

View list of standard feeds
View list of all feeds generated at a specific hour
 View list of hourly data files generated since your last update from the API