I am trying to build a user-flow or sankey diagram so I could find out what are the visited pages based on visitor_id or pathname. I would like to be able to see which pages contribute visits to others, and in which proportion. It would also be nice to have things like elapsed time, or number of pages in a single session for a user, or which are the entry and exit pages and so on, similar to web analytics tool.
My issue is that all I capture is a flat table with events, no concept or session as a user can navigate at any time. For the purpose of my analytics, I would define a session like a visit that was more than 1 hour apart from the last one for the same user.
My (simplified) table schema looks like this:
CREATE TABLE events (
visitor_id SYMBOL,
pathname SYMBOL,
timestamp TIMESTAMP,
metric_name SYMBOL
) TIMESTAMP(timestamp) PARTITION BY MONTH WAL;
select CAST(pathname as long) from events;
I have tried different queries, but I struggle to get all I want in a single one. As an example, this is a query I am trying to use for entry pages, but it lacks the rest of features:
WITH FirstVisit AS (
SELECT visitor_id, min(timestamp) AS first_visit, pathname AS landing_page
FROM events WHERE metric_name = 'page_view' AND timestamp > dateadd('d', -7, now())
GROUP BY visitor_id, landing_page
)
SELECT vit.pathname, count(*) AS pageviews FROM vitals as vit
INNER JOIN FirstVisit fv ON vit.visitor_id = fv.visitor_id
WHERE vit.metric_name = 'page_view' AND vit.pathname != fv.landing_page AND vit.timestamp > fv.first_visit AND timestamp > dateadd('d', -7, now())
GROUP BY vit.pathname
ORDER BY pageviews DESC
LIMIT 5;
To make things easier, I have generated a demo CSV file you can use to populate a table with simulated data for 4 visitors during 7 days, with 2 sessions per day.