Create an Automated Pipeline

Build an Automated Research Data Pipeline (Warehouse Sync)

Introduction

This tutorial shows how to build an automated pipeline that pulls UserTesting Results API data and loads it into a data warehouse (such as Snowflake, BigQuery, or Redshift) on a scheduled basis.

The goal is to enable nightly or periodic syncs of session-level research data so analytics, BI, and AI teams can work with UserTesting insights alongside product and business metrics.

What you’ll build

A repeatable pipeline that:

  1. Uses a known testId to list all completed sessions
  2. Pages through session results safely
  3. Retrieves structured session details
  4. Retrieves session transcripts
  5. Loads the data into warehouse-friendly tables

Target audience

  • Data engineers
  • Analytics engineers
  • Research operations teams
  • ML / AI teams preparing data for downstream analysis

Prerequisites

  • A valid access token (ACCESS_TOKEN). Go to Authorization for details.
  • A known test ID, referred to as TEST_ID or testId.
  • A scheduler (cron, Airflow, dbt Cloud, GitHub Actions, etc.)
  • A destination warehouse (Snowflake, BigQuery, Redshift, etc.)

High-level architecture

  1. Scheduler triggers the job (nightly or on demand)
  2. Extractor calls the Results API
  3. Transformer normalizes JSON into tables
  4. Loader writes data to the warehouse

graph LR;
    Scheduler --> ResultsAPI;
    ResultsAPI --> JSON;
    JSON --> Normalize;
    Normalize --> WarehouseTables;

Steps

Step 1 — List sessions for a test (pagination aware)

Endpoint

GET /api/v2/sessionResults

Why this matters

  • Tests can have many sessions
  • Pagination must be handled correctly for full exports

Example (curl)

curl --location 'https://api.use2.usertesting.com/api/v2/sessionResults?testId=TEST_ID&limit=25&offset=50' \
  --header 'Authorization: Bearer ACCESS_TOKEN' \
  --header 'Content-Type: application/json'
👍

Best Practice: Always use pagination query parameters (limit and offset) along with the appropriate logic to ensure all sessions are retrieved.

  • Use the response data from limit, offset, and totalCount in the meta.pagination property to calculate the number of page iterations needed to collect all sessions in the test.
  • By default, limit is set to 25 and offset to 0.
  • In the response, sessions are sorted in descending order, i.e., from newest to oldest.

Key fields to store

  • testId
  • sessionId
  • status
  • startTime
  • finishTime
  • audienceId (if present)

Pagination logic

  • Retrieve meta.pagination.totalCount to determine the total number of sessions.
  • Increment the offset by the limit value in each request until the offset reaches or exceeds the totalCount, ensuring all sessions are retrieved.
  • Save a checkpoint if running in incremental mode to track progress and avoid reprocessing.

Step 2 — Retrieve session details

Endpoint

GET /api/v2/sessionResults/SESSION_ID

Purpose

This endpoint provides the core structured research data, including:

  • Participant demographics
  • Task types and responses
  • IDs needed for joins

Example (curl)

curl --location 'https://api.use2.usertesting.com/api/v2/sessionResults/SESSION_ID' \
  --header 'Authorization: Bearer ACCESS_TOKEN' \
  --header 'Content-Type: application/json'

Recommended warehouse tables

sessions
  • session_id
  • test_id
  • audience_id
  • test_plan_id
demographics
  • participant_id
  • demographic_code
  • label
  • value
  • type
task_results
  • task_id
  • task_type
  • task_response (JSON / VARIANT)
👍

Take into account that sessionId and participantId share the same value. This can simplify gathering the participant's demographic data.


Step 3 — Retrieve transcripts (text-based insights)

For sessions that have video recordings, such as those based on interaction tests.

Endpoint

GET /api/v2/sessionResults/SESSION_ID/transcript

Response format

  • text/vtt (WebVTT subtitle format)

Example (curl)

curl --location 'https://api.use2.usertesting.com/api/v2/sessionResults/SESSION_ID/transcript' \
--header 'Authorization: Bearer ACCESS_TOKEN'

Warehouse strategy

  • Store raw VTT text as-is
  • Optionally:
    • Strip timestamps
    • Store a derived plain_text column
    • Chunk text for LLM usage

Recommended table

transcripts
  • session_id
  • transcript_vtt (TEXT)
  • transcript_plain_text (TEXT, optional)

Step 4 — (Optional) Video handling strategy

For sessions with recorded videos, such as those based on interaction tests.

Endpoint

GET /api/v2/sessionResults/SESSION_ID/videoDownloadUrl

Request sample

curl --location 'https://api.use2.usertesting.com/api/v2/sessionResults/SESSION_ID/videoDownloadUrl' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ACCESS_TOKEN '

Important behavior

  • URLs are pre-signed
  • URLs expire after one hour

Best practice for pipelines

  • Do not store the video URL itself
  • Either:
    • Download the video immediately into object storage (S3, GCS)
    • Or skip video ingestion for nightly syncs

Recommended table (metadata only)

session_videos
  • session_id
    • video_storage_path
    • downloaded_at

Rate limits and batching

Take into consideration:

  • Endpoints allow 10 requests per minute

Use:

  • Request batching
  • Exponential backoff on rate limit error messages (429 status code)
  • Parallelism with limits (e.g., 3–5 concurrent requests)

Incremental vs full syncs

Full sync (simplest)

  • Re-pull all sessions for a test on a nightly basis
  • Overwrite or upsert warehouse tables

Incremental sync (recommended at scale)

  • Track:
    • session finishTime
    • last successful sync timestamp
  • Only process newly completed sessions

Example pseudo-code

checkpoint = read_checkpoint(test_id)  # e.g., last_finish_time or last_run_at

for page in paginate_sessions(test_id, status="completed", page_size=100):
    for s in page.sessions:

        if checkpoint and s.finishTime <= checkpoint:
            continue  # incremental mode

        details = with_backoff(lambda: get_session_details(s.sessionId))
        transcript = with_backoff(lambda: get_transcript(s.sessionId))  # may be optional / not always present

        upsert_sessions(details)
        upsert_participants(details)
        upsert_demographics(details)
        upsert_task_results(details)
        upsert_transcripts(s.sessionId, transcript)

        # optional: if ingesting video
        # video_url = with_backoff(lambda: get_video_download_url(s.sessionId))
        # download_immediately(video_url)  # URL expires quickly
        # upsert_session_videos_metadata(...)

update_checkpoint(test_id, max_finish_time_processed)
🖥️

Code overview: The pipeline reads a saved checkpoint (such as the last processed finishTime), then pages through all completed sessions for a test. For each session, it skips records that were already processed, retrieves structured session details and transcripts with retry and backoff handling, and upserts the normalized data into warehouse tables. An optional step shows how video downloads could be handled immediately due to expiring URLs. Finally, the pipeline updates the checkpoint so future runs only process newly completed sessions.


Common errors and troubleshooting

ErrorMeaningFix
401Invalid or missing access tokenVerify authorization header
404Test or session not foundValidate IDs
429Rate limit exceededAdd backoff + batching

What you can build next

Once this pipeline is running, you can:

  • Power BI dashboards with UX metrics
  • Feed transcripts into LLM workflows
  • Join UserTesting data with product analytics
  • Trigger alerts when certain task patterns appear

Summary

You now have a scalable pattern to:

  • Extract UserTesting research data programmatically
  • Normalize it for analytics and AI
  • Keep insights continuously in sync with your organization’s data ecosystem

This pipeline forms the backbone for advanced dashboards, automation, and AI-powered insight generation.