Back to Cube

Implementing event analytics

docs-mintlify/recipes/data-modeling/event-analytics.mdx

1.6.4322.1 KB
Original Source

This tutorial walks through how to transform raw event data into sessions. Many “out-of-box” web analytics solutions come already prepackaged with sessions, but they work as a “black box.” It doesn’t give the user either insight into or control how these sessions defined and work.

With Cube SQL-based sessions data model, you’ll have full control over how these metrics are defined. It will give you great flexibility when designing sessions and events to your unique business use case.

A few question we’ll answer with our sessions data model:

  • How do we measure session duration?
  • What is our bounce rate?
  • What areas of the app are most used?
  • Where are users spending most of their time?
  • How do we filter sessions where a user performs a specific action?

We’ll explore the subject using the data from Segment.com’s analytics.js library. The same concept could be applied for different data collection tools, such as Snowplow.

What is a session?

A session is defined as a group of interactions one user takes within a given time frame on your app. Usually that time frame defaults to 30 minutes, meaning that whatever a user does on your app (e.g. browses pages, downloads resources, purchases products) before they leave equals one session.

<div style={{ textAlign: "center" }}> </div>

Unify events and page views into single cube

Segment stores page view data as a pages table and events data as a tracks table. For sessions we want to rely not only on page views data, but on events as well. Imagine you have a highly interactive app, a user loads a page and can stay on this page interacting with the website for while. Hence, you want to count events as part of the session as well.

To do that we need to combine page view data and event data into a single cube. We’ll call the cube just events and assign a page views event type to pageview. Also, we’re going to assign a unique event_id to every event to use as primary key.

<CodeGroup>
yaml
cubes:
  - name: events
    sql: |
      SELECT
        t.id || '-e' as event_id
        , t.anonymous_id as anonymous_id
        , t.timestamp
        , t.event
        , t.context_page_path as page_path
        , NULL as referrer
      from javascript.tracks as t

      UNION ALL

      SELECT
        p.id as event_id
        , p.anonymous_id
        , p.timestamp
        , 'pageview' as event
        , p.context_page_path as page_path
        , p.referrer as referrer
      FROM javascript.pages as p
javascript
cube(`events`, {
  sql: `
     SELECT
      t.id || '-e' as event_id
      , t.anonymous_id as anonymous_id
      , t.timestamp
      , t.event
      , t.context_page_path as page_path
      , NULL as referrer
    from javascript.tracks as t

    UNION ALL

    SELECT
      p.id as event_id
      , p.anonymous_id
      , p.timestamp
      , 'pageview' as event
      , p.context_page_path as page_path
      , p.referrer as referrer
    FROM javascript.pages as p
    `
})
</CodeGroup>

The above SQL creates base table for our events cube. Now we can add some measures to calculate the number of events and number of page views only, using a filter on event column.

<CodeGroup>
yaml
cubes:
  - name: events
    # ...

    measures:
      - name: count
        sql: event_id
        type: count

      - name: page_views_count
        sql: event_id
        type: count
        filters: [{ sql: "{CUBE}.event = 'pageview'" }]
javascript
cube("events", {
  // ...,

  measures: {
    count: {
      sql: `event_id`,
      type: `count`
    },

    page_views_count: {
      sql: `event_id`,
      type: `count`,
      filters: [{ sql: `${CUBE}.event = 'pageview'` }]
    }
  }
})
</CodeGroup>

Having this in place, we will already be able to calculate the total number of events and pageviews. Next, we’re going to add dimensions to be able to filter events in a specific time range and for specific types.

<CodeGroup>
yaml
cubes:
  - name: events
    # ...

    dimensions:
      - name: anonymous_id
        sql: anonymous_id
        type: number
        primary_key: true

      - name: event_id
        sql: event_id
        type: number
        primary_key: true

      - name: timestamp
        sql: timestamp
        type: time

      - name: event
        sql: event
        type: string
javascript
cube("events", {
  // ...,

  dimensions: {
    anonymous_id: {
      sql: `anonymous_id`,
      type: `number`,
      primary_key: true
    },

    event_id: {
      sql: `event_id`,
      type: `number`,
      primary_key: true
    },

    timestamp: {
      sql: `timestamp`,
      type: `time`
    },

    event: {
      sql: `event`,
      type: `string`
    }
  }
})
</CodeGroup>

Now we have everything for Events cube and can move forward to grouping these events into sessions.

Creating Sessions

As a recap, a session is defined as a group of interactions one user takes within a given time frame on your app. Usually that time frame defaults to 30 minutes. First, we’re going to use LAG() function in Redshift to determine an inactivity_time between events.

sql
select
  e.event_id AS event_id
  , e.anonymous_id AS anonymous_id
  , e.timestamp AS timestamp
  , DATEDIFF(minutes, LAG(e.timestamp) OVER(PARTITION BY e.anonymous_id ORDER BY e.timestamp), e.timestamp) AS inactivity_time
FROM events AS e

inactivity_time is the time in minutes between the current event and the previous. We’re going to use inactivity_time to terminate a session based on 30 minutes of inactivity. This window could be changed to any value, based on how users interact with your app. Now we’re ready to introduce our Sessions cube.

<CodeGroup>
yaml
cubes:
  - name: sessions
    sql: |
      SELECT
        ROW_NUMBER() OVER(PARTITION BY event.anonymous_id ORDER BY event.timestamp) || ' - '|| event.anonymous_id AS session_id
        , event.anonymous_id
        , event.timestamp AS session_start_at
        , ROW_NUMBER() OVER(PARTITION BY event.anonymous_id ORDER BY event.timestamp) AS session_sequence
        , LEAD(timestamp) OVER(PARTITION BY event.anonymous_id ORDER BY event.timestamp) AS next_session_start_at
      FROM (
        SELECT e.anonymous_id
          , e.timestamp
          , DATEDIFF(minutes
          , LAG(e.timestamp) OVER(PARTITION BY e.anonymous_id ORDER BY e.timestamp)
          , e.timestamp) AS inactivity_time
        FROM {events.sql()} AS e
      ) AS event
      WHERE (event.inactivity_time > 30 OR event.inactivity_time IS NULL)
javascript
// Create new cube for sessions with the following content
cube(`sessions`, {
  sql: `
    SELECT
      ROW_NUMBER() OVER(PARTITION BY event.anonymous_id ORDER BY event.timestamp) || ' - '|| event.anonymous_id AS session_id
      , event.anonymous_id
      , event.timestamp AS session_start_at
      , ROW_NUMBER() OVER(PARTITION BY event.anonymous_id ORDER BY event.timestamp) AS session_sequence
      , LEAD(timestamp) OVER(PARTITION BY event.anonymous_id ORDER BY event.timestamp) AS next_session_start_at
    FROM (
      SELECT
        e.anonymous_id
        , e.timestamp
        , DATEDIFF(minutes, LAG(e.timestamp) OVER(PARTITION BY e.anonymous_id ORDER BY e.timestamp), e.timestamp) AS inactivity_time
      FROM ${events.sql()} AS e
    ) AS event
    WHERE (event.inactivity_time > 30 OR event.inactivity_time IS NULL)
    `
})
</CodeGroup>

As a primary key, we’re going to use session_id, which is the combination of the anonymous_id and the session sequence, since it’s guaranteed to be unique for each session. Having this in place, we can already count sessions and plot a time series chart of sessions.

<CodeGroup>
yaml
cubes:
  - name: sessions
    # ...

    measures:
      - name: count
        sql: session_id
        type: count

    dimensions:
      - name: anonymous_id
        sql: anonymous_id
        type: number
        primary_key: true

      - name: session_id
        sql: session_id
        type: number
        primary_key: true

      - name: start_at
        sql: session_start_at
        type: time

      - name: next_start_at
        sql: next_session_start_at
        type: time
javascript
cube("sessions", {
  // ...,

  measures: {
    count: {
      sql: `session_id`,
      type: `count`
    }
  },

  dimensions: {
    anonymous_id: {
      sql: `anonymous_id`,
      type: `number`,
      primary_key: true
    },

    session_id: {
      sql: `session_id`,
      type: `number`,
      primary_key: true
    },

    start_at: {
      sql: `session_start_at`,
      type: `time`
    },

    next_start_at: {
      sql: `next_session_start_at`,
      type: `time`
    }
  }
})
</CodeGroup>

Connecting Events to Sessions

The next step is to identify the events contained within the session and the events ending the session. It’s required to get metrics such as session duration and events per session, or to identify sessions where specific events occurred (we’re going to use that for funnel analysis later on). We’re going to declare a join such that the events cube has a many_to_one relation to the sessions cube, and specify a condition, such as all users' events from session start (inclusive) till the start of the next session (exclusive) belong to that session.

<CodeGroup>
yaml
cubes:
  - name: events
    # ...

    joins:
      - name: sessions
        relationship: many_to_one
        sql: |
          {events.anonymous_id} = {sessions.anonymous_id}
          AND {events.timestamp} >= {sessions.start_at}
          AND ({events.timestamp} < {sessions.next_start_at} or {sessions.next_start_at} is null)
javascript
cube("events", {
  // ...,

  joins: {
    sessions: {
      relationship: `many_to_one`,
      sql: `
        ${events.anonymous_id} = ${sessions.anonymous_id}
        AND ${events.timestamp} >= ${sessions.start_at}
        AND (${events.timestamp} < ${sessions.next_start_at} or ${sessions.next_start_at} is null)
      `
    }
  }
})
</CodeGroup>

To determine the end of the session, we’re going to use a subquery dimension.

<CodeGroup>
yaml
cubes:
  - name: events
    # ...

    measures:
      - name: last_event_timestamp
        sql: timestamp
        type: max
        public: false

  - name: sessions
    # ...

    dimensions:
      - name: end_raw
        sql: "{events.last_event_timestamp}"
        type: time
        sub_query: true
        public: false

      - name: end_at
        sql: |
          CASE WHEN {end_raw} + INTERVAL '1 minutes' > {CUBE}.next_session_start_at
            THEN {CUBE}.next_session_start_at
            ELSE {end_raw} + INTERVAL '30 minutes'
          END

      - name: duration_minutes
        sql: "datediff(minutes, {CUBE}.session_start_at, {end_at})"
        type: number
        
    measures:
      - name: average_duration_minutes
        sql: "{duration_minutes}"
        type: avg
javascript
cube("events", {
  // ...,

  measures: {
    last_event_timestamp: {
      sql: `timestamp`,
      type: `max`,
      public: false
    }
  }
})

cube("sessions", {
  // ...,

  dimensions: {
    end_raw: {
      sql: `${events.last_event_timestamp}`,
      type: `time`,
      sub_query: true,
      public: false
    },

    end_at: {
      sql: `CASE WHEN ${end_raw} + INTERVAL '1 minutes' > ${CUBE}.next_session_start_at
         THEN ${CUBE}.next_session_start_at
         ELSE ${end_raw} + INTERVAL '30 minutes'
         END`,
      type: `time`
    },

    duration_minutes: {
      sql: `datediff(minutes, ${CUBE}.session_start_at, ${end_at})`,
      type: `number`
    }
  },

  measures: {
    average_duration_minutes: {
      type: `avg`,
      sql: `${duration_minutes}`
    }
  }
})
</CodeGroup>

Mapping Sessions to Users

Right now all our sessions are anonymous, so the final step in our modeling would be to map sessions to users in case, they have signed up and have been assigned a user_id. Segment keeps track of such assignments in a table called identifies. Every time you identify a user with segment it will connect the current anonymous_id to the identified user id.

We’re going to create an identifies cube, which will not contain any visible measures and dimensions for users to use in Insights, but instead will provide us with a user_id to use in the Sessions cube. Also, identifies could be used later on to join sessions to your users cube, which could be a cube built based on your internal database data for users.

<CodeGroup>
yaml
# Create a new file for the `identifies` cube with following content
cubes:
  - name: identifies
    sql: "SELECT distinct user_id, anonymous_id FROM javascript.identifies"

    dimensions:
      - name: id
        sql: "user_id || '-' || anonymous_id"
        type: string
        primary_key: true

      - name: anonymous_id
        sql: anonymous_id
        type: number

      - name: user_id
        sql: user_id
        type: number
        format: id
javascript
// Create a new file for the `identifies` cube with following content
cube(`identifies`, {
  sql: `SELECT distinct user_id, anonymous_id FROM javascript.identifies`,

  dimensions: {
    id: {
      sql: `user_id || '-' || anonymous_id`,
      type: `string`,
      primary_key: true
    },

    anonymous_id: {
      sql: `anonymous_id`,
      type: `number`
    },

    user_id: {
      sql: `user_id`,
      type: `number`,
      format: `id`
    }
  }
})
</CodeGroup>

We need to declare a relationship between identifies and sessions, where session has a many_to_one relationship with identity.

<CodeGroup>
yaml
cubes:
  - name: sessions
    # ...

    joins:
      - name: identifies
        relationship: many_to_one
        sql: "{identifies.anonymous_id} = {sessions.anonymous_id}"
javascript
cube("sessions", {
  // ...,

  joins: {
    identifies: {
      relationship: `many_to_one`,
      sql: `${identifies.anonymous_id} = ${sessions.anonymous_id}`
    }
  }
})
</CodeGroup>

Once we have it, we can create a dimension user_id, which will be either a user_id from the identifies table or an anonymous_id in case we don’t have the identity of a visitor, which means that this visitor never signed in.

<CodeGroup>
javascript
cube("sessions", {
  // ...,

  dimensions: {
    user_id: {
      sql: `coalesce(${identifies.user_id}, ${CUBE}.anonymous_id)`,
      type: `string`
    }
  }
})
yaml
cubes:
  - name: sessions
    # ...

    dimensions:
      - name: user_id
        sql: "coalesce({identifies.user_id}, {CUBE}.anonymous_id)"
        type: string

</CodeGroup>

Based on the just-created dimension, we can add two new metrics: the count of
users and the average sessions per user.

<CodeGroup>

```yaml title="YAML"
cubes:
  - name: sessions
    # ... 

    measures:
      - name: users_count
        sql: "{user_id}"
        type: count_distinct

      - name: average_sessions_per_user
        sql: "{count}::NUMERIC / NULLIF({users_count}, 0)"
        type: number
javascript
cube("sessions", {
  // ...,

  measures: {
    users_count: {
      sql: `${user_id}`,
      type: `count_distinct`
    },

    average_sessions_per_user: {
      sql: `${count}::NUMERIC / NULLIF(${users_count}, 0)`,
      type: `number`
    }
  }
})
</CodeGroup>

That was our final step in building a foundation for a sessions data model. Congratulations on making it here! Now we’re ready to add some advanced metrics on top of it.

More metrics for Sessions

Number of Events per Session

This one is super easy to add with a subquery dimension. We just calculate the number of events, which we already have as a measure in the events cube, as a dimension in the sessions cube.

<CodeGroup>
yaml
cubes:
  - name: sessions
    # ...

    dimensions:
      - name: number_events
        sql: "{events.count}"
        type: number
        sub_query: true
javascript
cube("sessions", {
  // ...,

  dimensions: {
    number_events: {
      sql: `${events.count}`,
      type: `number`,
      sub_query: true
    }
  }
})
</CodeGroup>

Bounce Rate

we’ve just defined the number of events per session, we can easily add a dimension is_bounced to identify bounced sessions to the Sessions cube. Using this dimension, we can add two measures to the Sessions cube as well - a count of bounced sessions and a bounce rate.

<CodeGroup>
yaml
cubes:
  - name: sessions
    # ...

    dimensions:
      - name: is_bounced
        type: string
        case:
          when: [{ sql: "{number_events} = 1", label: "True" }]
          else: { label: "False" }

    measures:
      - name: bounced_count
        sql: session_id
        type: count
        filters:
          - - sql: "{is_bounced} = 'True'

      - name: bounce_rate
        sql: "1.0 * {bounced_count} / NULLIF({count}, 0)"
        type: number
        format: percent
javascript
cube("sessions", {
  // ...,

  dimensions: {
    is_bounced: {
      type: `string`,
      case: {
        when: [{ sql: `${number_events} = 1`, label: `True` }],
        else: { label: `False` }
      }
    }
  },

  measures: {
    bounced_count: {
      sql: `session_id`,
      type: `count`,
      filters: [
        {
          sql: `${is_bounced} = 'True'`
        }
      ]
    },

    bounce_rate: {
      sql: `1.0 * ${bounced_count} / NULLIF(${count}, 0)`,
      type: `number`,
      format: `percent`
    }
  }
})
</CodeGroup>

First Referrer

We already have this column in place in our base table. We’re just going to define a dimension on top of this.

<CodeGroup>
yaml
cubes:
  - name: sessions
    # ...

    measures:
      - name: first_referrer
        type: string
        sql: first_referrer
javascript
cube("sessions", {
  // ...,

  measures: {
    first_referrer: {
      type: `string`,
      sql: `first_referrer`
    }
  }
})
</CodeGroup>

Sessions New vs Returning

Same as for the first referrer. We already have a session_sequence field in the base table, which we can use for the is_first dimension. If session_sequence is 1 - then it belongs to the first session, otherwise - to a repeated session.

<CodeGroup>
javascript
cube("sessions", {
  // ...,

  dimensions: {
    is_first: {
      type: `string`,
      case: {
        when: [{ sql: `${CUBE}.session_sequence = 1`, label: `First` }],
        else: { label: `Repeat` }
      }
    }
  },

  measures: {
    repeat_count: {
      description: `Repeat Sessions Count`,
      sql: `session_id`,
      type: `count`,
      filters: [{ sql: `${is_first} = 'Repeat'` }]
    },

    repeat_percent: {
      description: `Percent of Repeat Sessions`,
      sql: `1.0 * ${repeat_count} / NULLIF(${count}, 0)`,
      type: `number`,
      format: `percent`
    }
  }
})
yaml
cubes:
  - name: sessions
    # ...

    dimensions: 
      - name: is_first
        type: string
        case:
          when: [{ sql: "{CUBE}.session_sequence = 1", label: "First" }]
          else: { label: "Repeat" }

    measures:
      - name: repeat_count
        description: Repeat Sessions Count
        sql: session_id
        type: count
        filters: [{ sql: "{is_first} = 'Repeat'" }]

      - name: repeat_percent
        description: Percent of Repeat Sessions
        sql: "1.0 * {repeat_count} / NULLIF({count}, 0)"
        type: number
        format: percent

</CodeGroup>

### Filter Sessions, where user performs specific event

Often, you want to select specific sessions where a user performed some
important action. In the example below, we’ll filter out sessions where the
`form_submitted` event happened. To do that, we need to follow 3 steps:

Define a measure on the Events cube to count only `form_submitted` events.

<CodeGroup>

```yaml title="YAML"
cubes:
  - name: events
    # ...

    # Add this measure to the `events` cube
    measures:
      - name: form_submitted_count
        sql: event_id
        type: count
        filters: [{ sql: "{CUBE}.event = 'form_submitted'" }]
javascript
cube("events", {
  // ...,

  // Add this measure to the `events` cube
  measures: {
    form_submitted_count: {
      sql: `event_id`,
      type: `count`,
      filters: [{ sql: `${CUBE}.event = 'form_submitted'` }]
    }
  }
})
</CodeGroup>

Define a dimension form_submitted_count on the Sessions using sub_query.

<CodeGroup>
yaml
cubes:
  - name: sessions
    # ...

    # Add this dimension to the `sessions` cube
    dimensions:
      - name: form_submitted_count
        sql: "{events.form_submitted_count}"
        type: number
        sub_query: true
javascript
cube("sessions", {
  // ...,

  // Add this dimension to the `sessions` cube
  dimensions: {
    form_submitted_count: {
      sql: `${events.form_submitted_count}`,
      type: `number`,
      sub_query: true
    }
  }
})
</CodeGroup>

Create a measure to count only sessions where form_submitted_count is greater than 0.

<CodeGroup>
yaml
cubes:
  - name: sessions
    # ...

    # Add this measure to the `sessions` cube
    measures:
      - name: with_form_submitted_count
        sql: session_id
        type: count
        filters: [{ sql: "{form_submitted_count} > 0" }]
javascript
cube("sessions", {
  // ...,

  // Add this measure to the `sessions` cube
  measures: {
    with_form_submitted_count: {
      type: `count`,
      sql: `session_id`,
      filters: [{ sql: `${form_submitted_count} > 0` }]
    }
  }
})
</CodeGroup>

Now we can use the with_form_submitted_count measure to get only sessions when the form_submitted event occurred.