Data Sources & Pipeline Reference
How the dashboards get their data — last built 2026-04-13 09:56
Overview
This site is generated by a daily Python pipeline that pulls analytics from
Matomo Cloud (cefic.matomo.cloud, site ID 5), stores them in a local
DuckDB database, and renders a set of static HTML dashboards. The pipeline is
scheduled via cron at 06:00 CET every day (run_matomo_cron.sh), then
pushed to GitHub and deployed through Netlify. Runs are idempotent: campaign daily metrics are
appended incrementally, while period-snapshot tables (fact_*_period) are
truncated and re-inserted on every run.
Product views
Each main dashboard fits on one screen and answers one business question.
- Executive Overview — What's happening at a glance? KPIs over time, top campaigns with sub-campaign drill-down, top pages by type, MoM / YoY.
- Traffic & Acquisition — How do visitors arrive? Channel mix, keywords, referring websites, social networks, and a world map.
- Campaign Intelligence — What's working in our campaigns? Per-campaign quality score (engagement + conversions + channel diversity) with automated insights.
- Audience & Behaviour — Who are they and how do they navigate? Devices, browsers, new vs returning, entry and exit pages, internal search, events, media play rate.
- Content Performance — Which content actually delivers? Top pages, category breakdown, freshness vs performance scatter, top downloads.
Legacy drill-down pages (Dashboard, Campaigns, Audiences, Trends, Source/Medium, Chord, Monthly Ranking) remain available for deeper exploration on specific slices.
Data sources — Matomo endpoints
Endpoints called during each daily pipeline run, tables they populate, and the views that consume them.
| Endpoint | Data pulled | Tables populated | Used by |
|---|---|---|---|
Referrers.getCampaigns |
Per-campaign daily metrics (visits, page views, bounce, conversions, avg time) | fact_campaign_daily |
Overview, Campaigns, Campaign Intel |
MarketingCampaignsReporting.getName (expanded) |
mtm_content sub-campaigns, daily |
fact_campaign_content_daily |
Overview (campaign expand) |
MarketingCampaignsReporting.getSourceMedium |
Per-campaign source / medium breakdown | in-memory only | Campaign Intel (channel diversity score) |
Actions.getPageUrls (flat) |
Page-level daily stats + page-type classification | fact_page_daily |
Overview, Content |
VisitFrequency.get |
Daily new vs returning totals | fact_visit_frequency_daily |
Overview, Audience |
Referrers.getReferrerType |
Channel rollup (search, direct, referral, campaign, social, AI) | fact_channel_daily |
Traffic |
Referrers.getKeywords |
Top organic search keywords | fact_keyword_period |
Traffic |
Referrers.getWebsites |
Top referring websites | fact_website_period |
Traffic |
Referrers.getSocials |
Traffic from social networks | fact_social_period |
Traffic |
UserCountry.getCountry |
Visits by country | fact_country_daily |
Traffic (map + list) |
UserCountry.getCity |
Visits by city | fact_city_period |
Traffic |
DevicesDetection.getType |
Desktop / smartphone / tablet split | fact_device_daily |
Audience |
DevicesDetection.getBrowsers |
Browser breakdown with engagement | fact_browser_daily |
Audience |
Actions.getEntryPageUrls |
Top landing pages with entry bounce | fact_entry_page_period |
Audience |
Actions.getExitPageUrls |
Top exit pages with exit rate | fact_exit_page_period |
Audience |
Actions.getSiteSearchKeywords |
Internal search terms, hits, exit rate | fact_site_search_period |
Audience |
Events.getCategory |
Event categories (CTAs, downloads, etc.) | fact_event_category_daily |
Audience |
Events.getName |
Individual event names (used for downloads) | fact_event_name_period |
Content (downloads list) |
MediaAnalytics.get |
Video plays, impressions, play rate | dim_media_summary |
Audience (video play rate) |
Marts (SQL views)
Views derived from fact tables, computed on demand when the site is built.
mart_campaign_period— campaign rollup over the full period with audience, bounce, avg time, conversion rate. Feeds Campaigns and Campaign Intel.mart_daily_totals— campaign-wide daily totals used by the legacy Dashboard page.mart_campaign_movers— last-7 vs previous-7 rising / falling campaigns. Used by the Overview insights panel.mart_page_type_period— pageviews, bounce and avg time perpage_type. Feeds Content.mart_top_pages_by_type— top pages per type with rank. Feeds the Overview page type tabs.
Page taxonomy
How every URL gets classified into a single page_type.
The canonical taxonomy is defined in docs/cefic_site_structure.md in the repository —
it is the single source of truth for PAGE_TYPE_RULES, PAGE_TYPE_LABELS and
PAGE_TYPE_ORDER, consumed by both ingest_matomo.py (to tag each page
when the daily feed is written) and build_site.py (to label and order the charts).
The 13 canonical types are:
news, policy, guidance, case_studies,
events, science, industry_data, sectors,
highlights, resources, about, home, other.
Known limitations
What is not tracked — and why.
- Scroll depth — not tracked at the cefic.org tag level. Would require custom JS in the Matomo tracker.
- Navigation paths — would require
Live.getLastVisitsDetails, which is too expensive at our scale. Entry and exit pages are used as a proxy on the Audience page. - Per-page conversions — would require one saved Matomo segment per page, not practical at scale.
- Device mix per campaign — same constraint as above: one saved segment per campaign.
- Search Console plugin — not enabled. Matomo's built-in search keywords (
Referrers.getKeywords) are used instead. - Matomo Goals — the
nb_conversionsfield counts all configured goals (1–12) aggregated. Individual goal breakdown is not currently surfaced.
Glossary & methodology
New vs. Returning visitors
- New visitor — a visitor arriving on the site for the very first time.
- Returning visitor — a visitor who has already visited the site and comes back.
Key points often misunderstood:
- The distinction does not depend on the analysed period — it depends on the visitor's full history.
- It relies on cookies / User ID. If cookies are deleted or blocked, a returning visitor can be counted as new.
- Recognition is per device / browser (unless User ID is enabled).
Matomo vs. GA4: Matomo classifies new/returning based on the visitor's full global history (independent of the selected period). GA4 uses events detected within the selected period (first_visit / first_open), which can lead to different numbers between the two tools.
Refresh schedule
The full pipeline (ingest → marts → site → git push) runs every day at
06:00 CET via run_matomo_cron.sh. Logs are written to
logs/ and re-runs are idempotent: campaign daily rows are upserted incrementally,
and period-snapshot tables are fully truncated and re-inserted. If ingestion fails partway
through, a re-run simply picks up from MAX(date) + 1 for daily data and rebuilds
the period tables from scratch.