Data Sources – Cefic Matomo Analysis

Overview

This site is generated by a daily Python pipeline that pulls analytics from Matomo Cloud (cefic.matomo.cloud, site ID 5), stores them in a local DuckDB database, and renders a set of static HTML dashboards. The pipeline is scheduled via cron at 06:00 CET every day (run_matomo_cron.sh), then pushed to GitHub and deployed through Netlify. Runs are idempotent: campaign daily metrics are appended incrementally, while period-snapshot tables (fact_*_period) are truncated and re-inserted on every run.

Product views

Each main dashboard fits on one screen and answers one business question.

Executive Overview — What's happening at a glance? KPIs over time, top campaigns with sub-campaign drill-down, top pages by type, MoM / YoY.
Traffic & Acquisition — How do visitors arrive? Channel mix, keywords, referring websites, social networks, and a world map.
Campaign Intelligence — What's working in our campaigns? Per-campaign quality score (engagement + conversions + channel diversity) with automated insights.
Audience & Behaviour — Who are they and how do they navigate? Devices, browsers, new vs returning, entry and exit pages, internal search, events, media play rate.
Content Performance — Which content actually delivers? Top pages, category breakdown, freshness vs performance scatter, top downloads.

Legacy drill-down pages (Dashboard, Campaigns, Audiences, Trends, Source/Medium, Chord, Monthly Ranking) remain available for deeper exploration on specific slices.

Data sources — Matomo endpoints

Endpoints called during each daily pipeline run, tables they populate, and the views that consume them.

Endpoint	Data pulled	Tables populated	Used by
`Referrers.getCampaigns`	Per-campaign daily metrics (visits, page views, bounce, conversions, avg time)	`fact_campaign_daily`	Overview, Campaigns, Campaign Intel
`MarketingCampaignsReporting.getName` (expanded)	`mtm_content` sub-campaigns, daily	`fact_campaign_content_daily`	Overview (campaign expand)
`MarketingCampaignsReporting.getSourceMedium`	Per-campaign source / medium breakdown	in-memory only	Campaign Intel (channel diversity score)
`Actions.getPageUrls` (flat)	Page-level daily stats + page-type classification	`fact_page_daily`	Overview, Content
`VisitFrequency.get`	Daily new vs returning totals	`fact_visit_frequency_daily`	Overview, Audience
`Referrers.getReferrerType`	Channel rollup (search, direct, referral, campaign, social, AI)	`fact_channel_daily`	Traffic
`Referrers.getKeywords`	Top organic search keywords	`fact_keyword_period`	Traffic
`Referrers.getWebsites`	Top referring websites	`fact_website_period`	Traffic
`Referrers.getSocials`	Traffic from social networks	`fact_social_period`	Traffic
`UserCountry.getCountry`	Visits by country	`fact_country_daily`	Traffic (map + list)
`UserCountry.getCity`	Visits by city	`fact_city_period`	Traffic
`DevicesDetection.getType`	Desktop / smartphone / tablet split	`fact_device_daily`	Audience
`DevicesDetection.getBrowsers`	Browser breakdown with engagement	`fact_browser_daily`	Audience
`Actions.getEntryPageUrls`	Top landing pages with entry bounce	`fact_entry_page_period`	Audience
`Actions.getExitPageUrls`	Top exit pages with exit rate	`fact_exit_page_period`	Audience
`Actions.getSiteSearchKeywords`	Internal search terms, hits, exit rate	`fact_site_search_period`	Audience
`Events.getCategory`	Event categories (CTAs, downloads, etc.)	`fact_event_category_daily`	Audience
`Events.getName`	Individual event names (used for downloads)	`fact_event_name_period`	Content (downloads list)
`MediaAnalytics.get`	Video plays, impressions, play rate	`dim_media_summary`	Audience (video play rate)

Marts (SQL views)

Views derived from fact tables, computed on demand when the site is built.

mart_campaign_period — campaign rollup over the full period with audience, bounce, avg time, conversion rate. Feeds Campaigns and Campaign Intel.
mart_daily_totals — campaign-wide daily totals used by the legacy Dashboard page.
mart_campaign_movers — last-7 vs previous-7 rising / falling campaigns. Used by the Overview insights panel.
mart_page_type_period — pageviews, bounce and avg time per page_type. Feeds Content.
mart_top_pages_by_type — top pages per type with rank. Feeds the Overview page type tabs.

Page taxonomy

How every URL gets classified into a single page_type.

The canonical taxonomy is defined in docs/cefic_site_structure.md in the repository — it is the single source of truth for PAGE_TYPE_RULES, PAGE_TYPE_LABELS and PAGE_TYPE_ORDER, consumed by both ingest_matomo.py (to tag each page when the daily feed is written) and build_site.py (to label and order the charts). The 13 canonical types are: news, policy, guidance, case_studies, events, science, industry_data, sectors, highlights, resources, about, home, other.

Known limitations

What is not tracked — and why.

Scroll depth — not tracked at the cefic.org tag level. Would require custom JS in the Matomo tracker.
Navigation paths — would require Live.getLastVisitsDetails, which is too expensive at our scale. Entry and exit pages are used as a proxy on the Audience page.
Per-page conversions — would require one saved Matomo segment per page, not practical at scale.
Device mix per campaign — same constraint as above: one saved segment per campaign.
Search Console plugin — not enabled. Matomo's built-in search keywords (Referrers.getKeywords) are used instead.
Matomo Goals — the nb_conversions field counts all configured goals (1–12) aggregated. Individual goal breakdown is not currently surfaced.

Glossary & methodology

New vs. Returning visitors

New visitor — a visitor arriving on the site for the very first time.
Returning visitor — a visitor who has already visited the site and comes back.

Key points often misunderstood:

The distinction does not depend on the analysed period — it depends on the visitor's full history.
It relies on cookies / User ID. If cookies are deleted or blocked, a returning visitor can be counted as new.
Recognition is per device / browser (unless User ID is enabled).

Matomo vs. GA4: Matomo classifies new/returning based on the visitor's full global history (independent of the selected period). GA4 uses events detected within the selected period (first_visit / first_open), which can lead to different numbers between the two tools.

Refresh schedule

The full pipeline (ingest → marts → site → git push) runs every day at 06:00 CET via run_matomo_cron.sh. Logs are written to logs/ and re-runs are idempotent: campaign daily rows are upserted incrementally, and period-snapshot tables are fully truncated and re-inserted. If ingestion fails partway through, a re-run simply picks up from MAX(date) + 1 for daily data and rebuilds the period tables from scratch.

Data Sources & Pipeline Reference