Monitoring behavioural responses during pandemic via reconstructed contact matrices from online and representative surveys
Júlia Koltai, Orsolya Vásárhelyi, Gergely Röst, Márton Karsai
MMonitoring behavioural responses during pandemicvia reconstructed contact matrices from online andrepresentative surveys
J ´ulia Koltai , Orsolya V ´as ´arhelyi , Gergely R ¨ost , and M ´arton Karsai Computational Social Science – Research Center for Educational and Network Studies, Centre for Social Sciences, Budapest,H-1097, Hungary Faculty of Social Sciences, E¨otv¨os Lor´and University, Budapest, H-1117, Hungary Department of Network and Data Science, Central European University, Vienna, A-1100, Austria Center for Interdisciplinary Methodologies, University of Warwick, Coventry, United Kingdom Bolyai Institute, University of Szeged, Szeged, H-6720, Hungary Alfr´ed R´enyi Institute of Mathematics, Budapest, H-1053, Hungary + these authors contributed equally to this work * Corresponding author: [email protected]
ABSTRACT
The unprecedented behavioural responses of societies have been evidently shaping the COVID-19 pandemic, yet it is asignificant challenge to accurately monitor the continuously changing social mixing patterns in real-time. Contact matrices,usually stratified by age, summarise interaction motifs efficiently, but their collection relies on conventional representative surveytechniques, which are expensive and slow to obtain. Here we report a data collection effort involving over . of the Hungarianpopulation to simultaneously record contact matrices through a longitudinal online and sequence of representative phonesurveys. To correct non-representative biases characterising the online data, by using census data and the representativesamples we develop a reconstruction method to provide a scalable, cheap, and flexible way to dynamically obtain closer-to-representative contact matrices. Our results demonstrate the potential of combined online-offline data collections to understandthe changing behavioural responses determining the future evolution of the outbreak, and inform epidemic models with crucialdata. The spread of directly transmitted diseases such as COVID-19 is largely driven by social interactions and mixing patternsof people . While person-to-person transmission typically occurs in close contacts, , local transportation, commuting, orglobal travels allow the disease to reach distant territories. Mobility patterns of entire populations can be traced from datacoming from transportation or personal digital devices , yet the observation of social interactions is still not obvious. Theestimation of interactions and mixing patterns via social proximity, commonly coded as contact matrices , is difficult,especially when we can only observe a fraction of the population. Even the recently developed contact tracing apps may failthis challenge as they collect too sparse data due to low adoption rate , while they may not assure to keep people’s privacydata intact . By combining anonymous online data collection techniques with conventional, representative sample based surveymethods, we propose a privacy protecting, dynamic, economical, scalable and efficient solution to this problem. Our newlydeveloped large-scale online data collection method, similarly to any other method based on voluntary participation, suffersfrom unrepresentativity. To overcome this limitation, we developed a detailed weighting methodology using the large-scaleonline data, and a smaller-scale representative sample simultaneously. This methodology solves the puzzle how voluntaryonline questionnaires may produce more valid and dynamic contact matrices to inform epidemic models.The simplest approach to model an epidemic assumes that contacts between any two individuals occur randomly with equalprobability. This so called homogeneous mixing assumption dominated the early years of mathematical and computationalepidemiology and lead to the seminal results on the dynamics of infectious diseases . However, the heterogeneity ofpopulations called for more refined assumptions to bring the models closer to reality. One successful direction assumes networked populations where the social interaction structure of people is taken as the underlying skeleton for epidemictransmission . Social networks commonly appear with various structural heterogeneity , which crucially amplify the chancesof global spreading scenarios while making them easier to immunise in case their global structure is known. However,collecting data about the precise social network of a large population is difficult. Thus, a middle way approach betweenhomogeneously mixed and networked populations is necessary, which is proposed by contact matrices representing theaggregated probabilities that different groups of people are in contact with each other . Most commonly, contacts between a r X i v : . [ phy s i c s . s o c - ph ] F e b ge groups are considered, but family structure, gender, education, and other socio-demographic variables have also been usedfor such stratification . The advantages of contact matrices are manifold, as they can be easily integrated to conventionalmathematical frameworks to describe the dynamics of an epidemic. Further, they are privacy preserving as they only recordaggregated information, yet effectively breaking the homogeneous mixing assumption within a population. They can bedynamically collected and re-scaled to simulate the effects of social distancing or the isolation of different groups for scenariotesting of epidemic outcomes.International and national efforts were implemented worldwide to estimate locally relevant contact matrices for epidemicmodelling. One of the largest and earliest effort was carried out by Mossong et al. in the POLYMOD project , where in eightEuropean countries 7290 participants were asked to provide their daily contact data to estimate the aggregated age contactmatrices. Following these efforts similar studies have been conducted in various other countries around the world ,while several contact matrix estimation methods were also developed . One important study was published by Prem et al. ,who, based on the POLYMOD results and local census data, estimated the contact matrices of 152 countries by using MarkovChain Monte Carlo simulation. All these studies were established on a few paradigms of data collection methods . Severalquestionnaire based data collection campaigns were carried out using CATI, CAWI or CAPI survey methodologies .They commonly collected easily interpretable data, sometimes from representative samples using careful sampling design.Nevertheless, all of them suffered from limited sample size, high cost of data collection, and, except some recent examples ,as they were cross-sectional studies, they completely missed to capture any dynamical change of contact patterns duringnormal or pandemic periods. On the other hand, online questionnaires and behavioural data collection apps may open newways to solve these problems. They can reach large populations up to millions of people, while collecting data dynamically,even with changing content, for relatively small costs. However, they may press on privacy issues and due to the voluntaryparticipation, they fall short on providing a representative sample of the observed population. The later crucially limitstheir direct applicability; as any interpretation drawn from their results need to be handled with caution. Thus, the questionremains, how can one exploit all the advantages what online data collection methods provide, while ensuring the privacy of therespondents and the representativeness of the data collected? Actual circumstances
The recent COVID-19 pandemic called for an immediate answer to this question. In the early days of March 2020, as theCOVID-19 pandemic started to unfold in Hungary, a collective action of scientists has been called for the development ofcountry specific epidemic models. This effort was supported by a never seen data sharing initiative by mobile phone providersand health authorities to help realistic data-driven modelling approaches. However, one important data was missing from thebeginning: the spatially and demographically detailed age mixing patterns of the country’s population. Although estimated contact matrices were available for Hungary from earlier periods, the actual challenge was to monitor the changes in contactpatterns and to measure the societal responses like social distancing or self-protection to nationwide regulations. The HungarianData Provider Questionnaire ("Magyar Adatszolgáltató Kérd˝oív" - MASZK) was developed for these purposes as a voluntaryand anonymous online survey, designed by scientists and software engineers , as part of a larger project aiming the observationand modelling of the unfolding COVID-19 pandemic in Hungary . Beyond collecting static information about the respondents’demography, domicile, education level, or family structure, the primary goal of the questionnaire was to monitor the dailychanges in the contact pattern of people in order to calculate the age contact matrices in real time. Additionally, dynamicdata was collected about the respondents’ employment status, working conditions, physical and mental well-being, and theircompliance with recommended self-protection measures during the months of emergency state and beyond. This rollinganonymous online data collection campaign is ongoing up to date (Spring 2021) and reached over 2 .
3% of the population inHungary recording over 405 ,
000 questionnaires from more than 226 ,
000 individuals, mounting up to the largest data evercollected for this purpose, to our knowledge.
Problem and focus
However, as participation was voluntary, just as any data collected in similar ways, the obtained dataset was not representativefor the population of Hungary. To estimate the level and dimensions of unrepresentativity, we performed parallel data collectioncampaigns based on the same questionnaire, but conducted on a representative sample of 1 ,
500 people with CATI (computerassisted telephone interviewing) survey methodology. More precisely, additionally to the online survey, we conducted across-sectional representative survey in each month from the beginning of the pandemic, in which we measured the actual andpre-COVID-19 contact patterns of participants, and collected all other information recorded in the current online questionnaire.Through the combined analysis of the online and offline data, we evaluated the results of the large online survey and identifiedits most severe non-representative biases. To account for these biases, we developed a pipeline using iterative proportionalfitting to weight the non-representative data in order to provide more representative contact matrices. This method supportsthe more realistic measurement of age contact matrices of a whole population while keeping the advantages (like cost-efficiency,scalability and detailed dynamics) of the online data collection. To describe our results, first we briefly summarize the structure nd the content of the questionnaire and explain our data collection methods in details. Subsequently we introduce ourmethodology about the weighting of age contact matrices collected online, in which the dimension of weights are derived fromrepresentative data collections conducted in the same period. Finally, we demonstrate our methodology on contact matricesobserved during the first wave of the COVID-19 pandemic in Hungary. Results
Data collection
The MASZK questionnaire
The primary purpose of our questionnaire was to dynamically estimate the age contact matricesof people in different environments (like home, work, school, or elsewhere). For this very reason, we asked the respondentabout the number of people from different age groups, with whom they had contacts with. First, we recorded reference contactpatterns by asking respondents about their contacts during a typical weekday and weekend before the COVID-19 outbreak inHungary (13th March 2020). Second, we recorded actual contact patterns of participants by asking them about their contactactivities on the day before their actual response. We classified close contacts as physical contacts (direct physical contactswithout using personal protective equipment), and proxy contacts (two persons stayed closer than 2 meters to each other atleast for 15 minutes) . Individual contact patterns were recorded as the approximate number of contacts between the egoand their peers from different age groups of 0 −
4, 5 −
14, 15 −
29, 30 −
44, 45 −
59, 60 −
69, 70 −
79, and 80 + . For the sakeof potential adoption of our method and reproducibility of results we share the core part of our questionnaire including theessential questions for our analysis in the Supplementary Information (SI) . COVID-19 REGULATIONS FOR HUNGARY COVID-19 RELEASE 1 COVID-19 RELEASE 2
Observation periodPhone census R E F E R E NC E P E R I O D Representative sample Weighted online sample Representative sample (a) (b)(c)
REFERENCE PERIOD PANDEMIC PERIOD
Weighted online sampleReference period
Figure 1.
Contact dynamics, representative and reconstructed age contact matrices.
Age contact matrices measuredduring the (a) reference and (b) pandemic period via CATI survey methodology on a representative sample (blue) and viaweighted non-representative online data collection (orange) after reconstruction (for methodology see section on Constructionof age contact matrices). Data for children under 18 (indicated with asterisk and vertical dashed lines) could not be collecteddirectly due to privacy regulations, thus our data cannot provide a representative sample for the first two age groups. (c)Timeline of early pandemic regulations in Hungary and the average number of per capita daily proxy social contacts in ruralareas (solid green line), the central area (red solid line) of Hungary, and in the whole country (blue solid line). While onlinedata collection was continuously ongoing after the 23rd March 2020, representative data via telephone surveys were collectedduring the periods assigned by diagonal shading. Blue shades indicate telephone census collection, while grey shades cover theonline observation period of the actual study. Both methods retrospectively recorded the contact patterns from the referenceperiod (before 13th March 2020), except for age groups under 15 in the online questionnaire.
Online data collection
MASZK was originally developed as an online survey , and was later published as a mobile phoneapplication . Participation was - and still is - voluntary and the data collection was completely anonymous (for further details ee the Methods section). The data collection started on the 23rd March 2020 and is still ongoing (as of Spring of 2021). Whilekeeping the core questionnaire (shared in the SI) intact, the additional content has been adjusted to the actually pressing issuesof the pandemic, like work and home office conditions, job security, self-protection practices, or intention for vaccination incase of availability. Respondents were asked to fill out the questionnaire as many days, as they can, providing ongoing relevantinformation about their contacts. Up to date, the questionnaire has been completed in 405 ,
984 times by 226 ,
086 respondents,which accounts for ∼ .
3% of the population of Hungary. The collected data sensitively reflects public awareness and reactionsto national regulations as it can be followed in Fig. 1c. During the reference period, until the 13th of March 2020 when thefirst regulations were announced, the average daily number of proxy social contacts of individuals was measured ∼
25. Thisnumber dropped radically by 88% to a value ∼ ∼
8, which though never reached its reference value until theend of the observed period (20th June 2020). In this work we analyse a period of consecutive three weeks (29th April to 19thMay 2020) during the first relaxation of the restrictive measures, as both types of data collection campaigns were conductedin these days. Using online surveys we recorded 30 ,
770 responses from 12 ,
208 people during this three-week period (seeMethods, and SI, Table S1).
Nationally representative telephone survey
Additionally to the ongoing online data collection, CATI surveys were con-ducted by a market research company to ask the same questionnaire on a nationally representative sample of people in eachmonth. The sample size was 1 , Construction of age contact matrices
In order to construct the age contact matrix of social contacts for the whole population, we collected information about thenumber of proxy and physical contacts of each respondent x during the reference and actual periods in different settings. Fora given social connection type, period, and setting, using the age of the respondents we assigned them into one of eight agegroups A (as defined in section The MASZK questionnaire), while doing the same for their contacts too. Thus we receivedan individual contact matrix M x coding for each user x the number of contacts they had with others from age groups i ∈ A .Assuming an individual representative weight w x for each respondent, we computed a weighted average contact matrix ( M ) i j ,which was column-wise normalised, thus giving us the weighted average number of contacts between a person from age group j with someone from age group i . Note that this matrix is not symmetric, and in case of a fully representative sample, weightswould be w x =
1, simplifying the computation to a simple averaging process (see Methods).
Social-demographic biases
Despite the many advantages of open online surveys, due to voluntary participation they often record a highly non-representativesample of the observed population, which may cause misleading conclusions about the nature of the epidemic process. Toidentify the most relevant social-demographic dimensions along which the online survey data is biased, we compare thenon-representative online data to the corresponding national census.In most cases, the tests for representativeness of an online survey focus on standard social-demographic characteristics ofthe observed sample and the population. On the other hand, in our case those characteristics are relevant, which significantlyinfluence the contact patterns of the respondents. To explore these underlying factors, we performed regression analysis on theproxy contacts of respondents in the representative sample recorded in the actual period (for further details see Methods and SI).As the goal of the regression analysis was to detect influencing factors relevant for a later weighting process, the independentvariables of these models were not only limited to those asked in the questionnaire, but also by data available in the census.Although we could identify several significant dimensions, which significantly affect the contact patterns of people, we couldnot rule out the possibility, that other dimensions, that were not included in the survey or measured by the census, also influencethe contact patterns significantly.These regression analyses indicated that the age, employment status, education, settlement type, gender and geographicalregion of the domicile are the most significant social-demographic dimensions along which our online data is non-representative.Indeed, statistics shown in Fig. 2 evidently demonstrate that while the distributions of the nationally representative phonesurvey shows very similar values to the population census data provided by the Hungarian Statistical Office , the online surveypresents strong biases along these dimensions. Compared to the census data, those who filled out the online survey are more
11% 11% %
6% 7%5%7%
CensusRep . SurveyN= 2,290OnlineN= 12,723
Figure 2.
Descriptive statistics of key demographic variables.
Variable statistics are shown, which were used in theweighting of the raw online data in the representative and in the online survey, compared to the population data of theHungarian Statistical Office . Note, that statistics showed for age category, employment status and education are based on theadult population of Hungary (15 years old or older), while settlement type, gender and regions covers the entire population.likely to be middle aged, employed, higher educated, live in the capital and more likely to be women. On the other hand, peoplewho are lower educated, older than 70 years, or live in small settlements like towns are under-represented. These strikingdifferences demonstrate that the analysis of the raw online survey would lead to biased contact patterns, which are hardlygeneralizable for the whole Hungarian population. The weighting procedure
After the detection of those social-demographic variables, which significantly affect the contact patterns observed in therepresentative survey, we provide a weighting methodology for the online survey to make it more accurate in the measurementof contact patterns of the whole population. The goal of this procedure is to provide an individual weight w x for everyrespondent x , which indicates how much they are needed to be taken into account in the re-constructed online data to makeit representative. Those respondents, who belong to an underrepresented social group get higher weights, while those fromover-represented groups get lower ones. From results in Fig, 2 it is evident, that differences between the online and the censusdata are quite large. This suggests that individual weights will take values from a very broad range, which is undesirableas extreme weights can result unstable estimations . Therefore, our weighting methodology needs to meet two goals (1)bringing the online survey data closer to the Hungarian Census, by making it more representative in terms of the identifiedsocial-economic dimensions; while (2) keeping the size of the weights in a reasonable range. To meet the second goal, weapplied iterative proportional fitting (IPF). IPF is a weighting methodology, which adjusts the inner cells of an n -dimensionalcontingency table in a way that it returns the previously provided expected row and column margins . In our case, the expectedmargins (the population distributions of the weighting variables) are taken from census data, and the contingency tables, onwhich we apply the weighting procedure, are derived from the online survey data.To obtain well fitting weights, which satisfy both of our goals, we built on the age stratified structure of contact matrices.First, as they are built up by age-group-wise normalized vectors for each age group, the relative proportions of age groupscan be neglected (not included as expected margins) in the IPF, which considerably decreases the variation of the obtainedindividual weights. Second, as not necessarily the same dimensions are relevant in each age group (e.g., in some age groupsthe education level affects contact patterns, in other age groups the geographical location is important.), the identification ofrelevant weighting dimensions is conducted separately in each age group - which can lead to more realistic weights. The resultsstrengthen this argument as very different social-demographic dimensions affected the total number of actual proxy contacts ge group Variables Weight min. Weight max.0-14 Gender, Settlement Type, Central / Rural Hungary 0.29 3.2315-29 Gender, Central / Rural Hungary 0.47 1.8930-44 Region*Work, Region*Settlement type 0.44 3.3545-59 Employment status, Education*Settlement Type 0.18 6.6260-69 Gender, Employment status 0.50 1.7370+ Education*Gender, Central / Rural Hungary, Employment status 0.04 24.79 Table 1.
Table of age groups, the corresponding social-demographic variables and weight limits.
Social-demographicvariables are listed for each age group, which were used as margins in the IPF procedure, together with the minimum andmaximum values of calculated individual weights. The symbol ∗ indicates interactions between variables. To increase theprecision of the weighting procedure, regression analyses targeting the detection of those dimensions, which affect the contactpatterns of people, were conducted separately on each age group. The selected dimensions served as expected margins in theIPF procedure. Note that some age groups are merged to make age categories populated enough and to be compatible with theage categories of the census.significantly in different age groups, as summarised in Table 1 (for margins see SI).Compared to standard cell weighting, IPF is less likely to result extremely small or large weights. In our case, after theselection of the relevant dimensions, the IPF process obtained weights, which stayed within the range of 0 .
04 and 25 .
49 (aspresented in Table 1 with weight distribution summarised in the SI). The closer an individual weight is to one, the more thecorresponding individual is representative of their age group - by the listed dimensions. The weight values characterisingdifferent age groups can thus disclose, which groups are strongly biased in the online survey as compared to population data.From this perspective of evaluation, the results in Table 1 suggest that the age groups of 60-69 and 15-29 are the ones closest tothe population data of the same age group according to their composition by the listed dimensions. At the same time, the mostproblematic age group is the 70+, where observed minimum and maximum weights cover the largest range. The larger range ofweights can be explained by the self-selection process of respondents, in which older generation is less likely to adopt digitaltechnologies or have internet access, thus, those respondents, who filled out the online questionnaire from this age group arenot typical representatives of the whole age group.
Reconstructed matrix analysis
The reconstructed online proxy age contact matrix (panel Fig. 3e) appeared with an expected structure very similar to therepresentative result (panel Fig. 3c). It exposes a strong diagonal component induced by age homophily (for annotated matricessee SI), meanwhile it suggests larger contact numbers between people of age 15-59, including the employed population of thecountry. These matrices were recorded during the period in May 2020, when schools were closed in Hungary. This is reflectedin the higher contact numbers between the youngest age groups and their parents’ generation from the age group of 30-44.However, if we compare the representative or the reconstructed (weighted) matrices to their corresponding reference periodmeasures (see Fig. 1a and b), we evidently see the radical decrease in the number of contacts (darker shades for reference periodand lighter for the later one) and the closure of schools significantly reducing the number of homophilic contacts betweenchildren of age 5-14 as compared to the reference period.To quantify the precision of our reconstruction method we compare the raw (not weighted) and reconstructed online proxycontact matrices to the corresponding representative matrix. Although we have demonstrated that the IPF method providesweights within a reasonable range, it is still not evident, which age cells changed the most by the weighting, and which of thembecame closer to their representative value due to the the reconstruction. In the diagonal of Fig. 3 we depict the three actualproxy contact matrices built from the representative survey (Fig. 3c), from the reconstructed (weighted) online survey (Fig. 3e)and the raw (not weighted) online survey (Fig. 3g). First, in the upper diagonal, we compare these matrices by calculating theirpairwise differences (see Fig. 3a, b and d). The difference between the representative survey and the raw online data (Fig. 3a)shows that middle-aged respondents of the online data collection had higher number of average contacts with young andmiddle aged adults than the respondents of the representative survey. Meanwhile, the non-representative online data collectionunderestimates the number of contacts of elderly people with others of similar age old. However, while the absolute differencein the total number of contacts between the representative and the not weighted online survey was 16 .
4, after reconstruction thisdifference between the representative and weighted online matrices reduced to 14 . .
13% increasein Relative Accuracy Gain (for precise definition see Methods). Our weighting method performs the best in cases, when thedifference between two matrix cells is close to 0 (white in Fig. 3b), like in case of the 60-69 years old egos and their 30-44 yearsold alters. The difference matrix of the non-weighted and weighted matrices depicts the effect of the reconstruction process on
OT Weighted Online Weighted Online Representative Phone N O T W e i g h t e d O n li n e W e i g h t e d O n li n e R e p r e s e n t a t i v e P h o n e (a) (c)(b)(d) (e) (f)(g) (h) (i) Figure 3.
Results of iterative proportional fitting.
Normalized actual proxy contact matrices (green diagonal), theirpairwise difference matrices (above diagonal) and pairwise two-tail T-test results (below diagonal) are depicted for the onlinenon-weighted, online weighted, and representative matrices. In the difference matrices red or blue cells indicate that the sourcematrix (column label) appeared with higher or lower number of average contact than the target (row label) in a given cell. Forresults of pairwise two-tail T-tests blue to yellow cells (corresponding to p > .
05, assigned by an arrow beside the colorbar)indicate that the given cell is not significantly different in the source (column label) and target (row label) matrices. Data forchildren under 18 (indicated with asterisk and vertical dashed lines) could not be collected directly due to privacy regulations,thus our data cannot provide a representative sample for the first two age groups (see Limitations)the online matrix (see Fig. 3d). Although the magnitudes of differences are not large, certain heterogeneities are visible, like thedecrease of contact numbers between middle age people and the increase of contacts between 70-79 years old egos and similarothers after the reconstruction.To further quantify the goodness of the weighting in detail, we tested if a cell of a contact matrix is significantly differentfrom the same cell of another contact matrix. Each cell of a contact matrix M i j appears as the average of the distribution ofthe number of contacts between the age-group j of a respondent and the age group i of their peers. Thus we can perform apairwise two-tailed independent sample T-test for each cell to see whether the population means of two groups correspondingto respective cells measured in different contact matrices are significantly different from each other . These tests show if thedifferences presented in the upper diagonal of the figure are statistically significant, or just the results of estimation uncertainties.In the visualisations of the lower diagonal panels of Fig. 3, yellow cells correspond to p > .
05 values ( p = .
05 isindicated by arrows near colorbars) suggesting that average contact numbers between the corresponding age groups are not significantly different in the two data sources. For example, this is the case in the cell of egos from age group 45-59, and theirpeers from 15-29 in Fig. 3f, which shows the results of the significance tests comparing the values of the representative andthe weighted online matrices. This result suggest that the average contact number between the 45-59, and the 15-29 yearsold, are not significantly different in the representative and in the weighted online matrices. To check the robustness of ourmatrix reconstruction method, we performed the same significance test between the raw (not weighted) online matrix and therepresentative matrices (Fig. 3i). Comparing its results to the results of the weighted and representative matrices (Fig. 3f), the umber of cells, which are not significantly different increased by 6 .
38% in the latter (from from 44 to 47), while the range ofsimilarity has also elevated (indicated by more yellow cells). Meanwhile, from the T-test results between the raw (not weighted)and weighted online matrices (see Fig. 3h) it is evident that the weighting helped to capture the contact patterns better in thereconstructed matrix, especially in case of the active population (30-59) with same-age and older people, and the contacts ofthe elderly people (70-79) with younger others. Precise estimation of the contact patterns of these age groups are especiallyimportant for predicting the potential number of infected cases, which may end up with severe medical conditions in case ofthe COVID-19 pandemic . These results show that the reconstruction caused significant changes in the values of 8 cells outof the 64 and that these changes brought the value of the given cell closer to the representative one in most cases (for exactsignificance values see SI, Figure S2). Limitations and future directions
It is very important to emphasize that the comparison of the actual proxy contacts in the representative and weighted/notweighted online matrices does not follow the same logic for children in the first two age groups. Due to data protectionregulations, the CATI survey is only representative for the adult population of Hungary and not for children, while the onlinesurvey could not involve under age children either. Data of children are based on the responses of adult parents estimating thecontact patterns of their own children. This estimation is surely biased as, especially for older children, parents may not be fullyaware about all daily social contacts of their children. Consequently, we cannot use the representative sample as a ’gold standard’for these age groups, because the population of children recorded in that data is not representative for the children population ofthe whole country. Correction of this bias would require a separate data collection campaign involving a representative set ofchildren directly, which in turn would raise challenges to meet privacy regulations of under-aged participants and fall beyondthe scope of the actual study. Nevertheless, this explains the larger differences between the online and representative matricesin the first two columns in Fig. 3 off-diagonal panels. If we do not consider these age groups, the Relative Accuracy Gain ofthe weighting process increases to 11 .
92% as the absolute difference in the number of contacts between the representativeand weighted online survey decreases to 11 .
36 which corresponds to an increase of 8 .
33% in the number of significantly notdifferent cells. To make this bias evident, we separated the non-representative age groups with a vertical dashed line within thematrices, while indicated by asterisks at the labels in each relevant plot.Another potential limitation may be rooted in the sampling of the observed population. This issue is present at the onlinedata collection, where the number of responses may vary in time. If the size of the online sample is too small, individualweights would diverge and the reconstructed matrices would suffer from large errors. In the present study, this is not an issue, asin the examined period the number of daily responses were stable and relatively high. However in the case of a longitudinal datacollection, these parameters can change due to the varying level of public awareness, political influence, or media campaigns.Finally, not only the number, but also the composition of the respondents may change in time, thus the precision of the actualweights may decrease. To account for this effect in the dynamical reconstruction of contact matrices, one would need to make arepresentative data collection periodically, and recompute the relevant dimensions and weights for each period. Although wehave collected representative samples in each month since April 2020, the demonstration of dynamical re-weighting is thesubject of a future investigation (in preparation). There we also plan to apply more experimental weighting procedures, wherewe will not only include variables available in the census, but also others only available in the representative data. The goal ofthese weighting experiments is to increase the Relative Accuracy Gain of the procedure.
Discussion
Emergency situations, like the actual COVID-19 pandemic, may induce radical changes in the behavioural patterns of peopleleading to the reduction and re-organisation of their social interactions . Changes may be induced by external influences suchas governmental interventions, or change in employment status, but they may strongly depend also on individual decisionsinduced by self-, and environment-awareness or risk avoiding behaviour. All these influences have convoluted effects on thesize and structure of personal interactions leading to different paths of epidemic transmissions in a connected population .Age contact matrices provide a useful way to summarise and follow such changes in the social fabric at different settings andtime. Importantly, they can be further used for more realistic modelling of epidemic spreading. Nevertheless, their collectionwas rather spurious, expensive, and other than some recent studies , they were collected during ’normal’ times, thus theycommonly missed to capture changes in contact patterns during emergency periods.In this study we provide a feasible alternative approach, which combines the advantages of online data collections with theprecision provided by representative telephone surveys. We report here, one of the largest data collected to date to estimateage contact matrices in a single country, reaching over 2 .
3% of the population of Hungary. As the online data provided anon-representative sample of the population, we developed a methodology to reconstruct closer-to representative contactmatrices from the online data by using the simultaneously collected representative samples. This data collection method is notonly scalable, flexible in terms of content, and relatively cheap, but it also allows for dynamical estimation of contact matrices ith high temporal and spatial resolution.The reproducibility of our results and the possible adoption of our methods in different countries are primary concerns forus. For these reasons, along this study, we share the core questionnaire for further use , together with the raw, reconstructed,and representative matrices and all supporting data calculated for Hungary. Up to date, our data collection method has beenimplemented already in Mexico and Cuba. We hope that it will prove useful to collect relevant data for applied epidemiologicalmodelling in other countries too, and at large, will contribute to the global efforts to fight the actual COVID-19 and any futurepandemic. Materials and Methods
Data collection
MASZK online data collection
The online data collection started on the 23rd of March 2020 through the website covid.sed.hu and later using a mobile phoneapp . The anonymity of participants was ensured by using encrypted browser cookies to store hashed identifiers locally,while transferring only anonymous encrypted data to a central secure server. Encrypted browser cookies were used for thedetection of returning respondent filling out the questionnaire on multiple days. The participants did not have to give anyinformation, which could be used for their re-identification. The data collection was fully complying with the actual Europeanand Hungarian privacy data regulations and was approved by the Hungarian National Authority for Data Protection and Freedomof Information . The data collection was accompanied with an ongoing marketing campaign, including regular radio andnewspaper interviews, ads on social media platforms, and posters on public transportation, to reach the broadest audiencepossible. Targeted campaigns were also published with help of national organisations to reach parents, university students, orelderly people.In this study, we analyse data collected between the 29th of April and the 19th of May 2020 and recorded 30 ,
770 responsesfrom 12 ,
208 respondents of the online questionnaire. The questionnaire was constructed by two parts in order to minimise theburden and potential churning (sample attrition) of participants:
Static questionnaire:
It was asked only once upon first response (controlled by encrypted browser cookies) about information,which do not change frequently, like the year the respondent was born, gender, domicile, education level, etc. This static partalso included questions about the proxy contact patterns of the respondent during the reference period , before the officialdeclaration of the pandemic, 13th of March 2020. We recorded reference contact patterns separately for typical weekdays andweekends of the respondents together with their age and gender detailed household structure.
Dynamic questionnaire:
It was asked to be completed ideally every day about the activities of the respondent on the previousday. More specifically, we asked the reasons they were outside, the places they visited, the protections they wore, travel modethey used, the changes in their working conditions, etc. We asked questions about their proxy and physical social contactsoutside their home, at work, or elsewhere; and also about those people, with whom they had contacts at home, but who are notpart of their household. For those, who mentioned children under 18 years in their household, more questions were asked aboutthe contact patterns of their children at school or elsewhere. We share the full questionnaire including the essential questionsfor our analysis in the SI.
Nationally representative CATI survey
A smaller scale, but nationwide representative data collection was also conducted between the 6th and 12th of May 2020using exactly the same questionnaire taken from the online survey. The data collection was implemented by CATI surveymethodology using both landline and mobile phone numbers. A multi-step, proportionally stratified, probabilistic samplingprocedure was used for sampling. The sample is representative for the Hungarian population aged 18 or older by gender, age,education and domicile. Sampling errors were corrected using iterative proportional post-stratification weights. After datacollection, only the anonymised and hashed data was shared with people involved in the project after signing non-disclosureagreements.
Contact matrix construction
We categorised people into eight age groups, as defined in the main text, thus constructed 8 × X be the set of respondents (ego), andlet Y be the set of individuals who are contacts of some x ∈ X . For a specific x , let N x ⊂ Y be the set of individuals who arecontacts of x . We assign by a ( x ) ∈ A = { , . . . , } the age group of an individual x . Next we define the matrix M x , y for each x ∈ X and y ∈ N x as follows: ( M x , y ) i , j = a ( x ) = j and a ( y ) = i , and zero otherwise. For an ego x we can now compute itsindividual contact matrix as M x = ∑ y ∈ N x M x , y . Finally, we use an individual weight w x assigned to each ego, coming from the PF weighting method described in the main text. This weight effectively describes how much an ego and its contacts should beconsidered in order to receive a contact matrix for a closer-to-representative population. The population level contact matrix iscomputed by M = ∑ x ∈ X w x M x (cid:14) ∑ x ∈ X w x . Selection of the weighting dimensions
The goal of the weighting process was to correct the unrepresentativeness of the online data without getting very large weightswhich may lead to large errors in the estimations. However, unlike at a general survey, representativeness in our case was not ageneral term for the Hungarian population, but was related to their contact patterns. To unfold, which variables are the ones thataffect the actual proxy contacts the most in the different age groups, we applied linear regression analysis on the representativesurvey data for each age group separately. The dependent variable of these regressions was the total number of actual proxycontacts; and the independent variables were those ones, which we measured in the questionnaire and which were also availableon a population level from census. The following independent variables were matched these two criteria: region (the sevenmain geographical region of Hungary where the respondent lives), type of settlement of the domicile, gender, highest level ofeducation, and activity (detailed typology of the work type of the respondent - white or blue collar - or the reason they arenot employed). We built three models for each age group. In the first model, only the main effects of these variables wereincluded. In the second model we added the two-way interaction terms of all independent variables. Finally, in the third modelwe included those interaction terms, where neither the region and activity variables were present - as these are categorical datacausing too many parameters in the interactions. This step was done to see clearer signals, where the large number of categoriesof these two variables does not distort the effect of others. For each age group, we selected the significant variables and thesignificant interaction terms as weighting dimensions. If a main effect of a variable was significant, and an interaction term,which was built up by the same variable was also significant, we only included the interaction term, because the margins of theinteraction also include the margins of those variables, which build that up. Based on the results of the regression analyses andof the comparison of the online data with the population data, in some cases, we included the aggregated categories (values) ofthese dimensions in the weighting procedure. For example, in the case of activity, a binary variable was created, where the twocategories showed if the respondent worked or did not work. In the case of geographical region, instead of the original sevencategories we used two, which showed if the respondent lived in the central region of the country (which includes the capital),or in another region. The reason for these simplifications was that in these variables, the strongest effects on the contact patternsof the people were manifested along these cleavages.
Relative Accuracy Gain
We define Relative Accuracy Gain (RAG) in our setting to quantify how much we gain in terms of accuracy to approximate therepresentative contact matrix due to the weighting procedure of the online contact matrix, as compared to the unweighted case.It is defined as the function of the sum of absolute differences in the total number of contacts between the representative (rs)and the weighted online (ow) and the representative and not weighted (onw) online matrices. More formally
RAG = − (cid:18) ∑ | M rs − M ow | ∑ | M rs − M onw | (cid:19) , (1)where M rs denotes the actual proxy matrix obtained from the nationally representative survey, M ow is the weighted actualproxy matrix obtained after reconstruction from the online survey, and M onw is the not weighted actual proxy matrix measureddirectly from the online survey. References Mossong, J. et al.
Social contacts and mixing patterns relevant to the spread of infectious diseases.
PLoS Medicine (2008). Rea, E. et al.
Duration and distance of exposure are important predictors of transmission among community contacts ofontario sars cases.
Epidemiol. & Infect. , 914–921 (2007). Brankston, G., Gitterman, L., Hirji, Z., Lemieux, C. & Gardam, M. Transmission of influenza a in human beings.
TheLancet Infect. Dis. , 257–265 (2007). Musher, D. M. How contagious are common respiratory tract infections?
New Engl. J. Medicine , 1256–1266 (2003). Tellier, R. Review of aerosol transmission of influenza a virus.
Emerg. Infect. Dis. , 1657 (2006). Vespignani, A. Predicting the behavior of techno-social systems.
Science , 425–428 (2009). . Fumanelli, L., Ajelli, M., Manfredi, P., Vespignani, A. & Merler, S. Inferring the structure of social contacts fromdemographic data in the analysis of infectious diseases spread.
PLoS Comput. Biol. (2012). Prem, K., Cook, A. R. & Jit, M. Projecting social contact matrices in 152 countries using contact surveys and demographicdata.
PLoS Comput. Biol. , e1005697 (2017). Ferretti, L. et al.
Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing.
Science (2020).
Salathé, M. et al.
COVID-19 epidemic in Switzerland: on the importance of testing, contact tracing and isolation.
SwissMed. Wkly. , w20225 (2020).
Allen, W. E. et al.
Population-scale longitudinal mapping of COVID-19 symptoms, behaviour and testing.
Nat. Hum.Behav. , 972–982 (2020). Wiertz, C., Banerjee, A., Acar, O. A. & Ghosh, A. Predicted adoption rates of contact tracing app configurations-insightsfrom a choice-based conjoint study with a representative sample of the UK population.
Available at SSRN 3589199 (2020).
McLachlan, S. et al.
The fundamental limitations of COVID-19 contact tracing methods and how to resolve them with aBayesian network approach.
Risk Inf. Manag. London, U.K., Tech. Rep.
DOI: 10.13140/RG.2.2.27042.66243 (2020).
Bengio, Y. et al.
Inherent privacy limitations of decentralized contact tracing apps.
J. Am. Med. Informatics Assoc. (2020).
Hethcote, H. W. The mathematics of infectious diseases.
SIAM Rev. , 599–653 (2000). Pastor-Satorras, R., Castellano, C., Van Mieghem, P. & Vespignani, A. Epidemic processes in complex networks.
Rev.Mod. Phys. , 925 (2015). Vega-Redondo, F.
Complex social networks . 44 (Cambridge University Press, 2007).
Wang, Z. et al.
Statistical physics of vaccination.
Phys. Reports , 1–113 (2016).
Melegaro, A., Jit, M., Gay, N., Zagheni, E. & Edmunds, W. J. What types of contacts are important for the spread ofinfections? using contact survey data to explore European mixing patterns.
Epidemics , 143–151 (2011). Iannelli, M., Martcheva, M. & Milner, F. A.
Gender-structured population modeling: mathematical methods, numerics,and simulations (SIAM, 2005).
Béraud, G. et al.
The French connection: the first large population-based contact survey in France relevant for the spreadof infectious diseases.
PLoS ONE (2015). Hoang, T. et al.
A systematic review of social contact surveys to inform transmission models of close-contact infections.
Epidemiol. (Cambridge, Mass.) , 723 (2019). Klepac, P. et al.
Contacts in context: large-scale setting-specific social mixing matrices from the BBC Pandemic project. medRxiv 2020.02.16.20023754 (2020).
Jarvis, C. I. et al.
Quantifying the impact of physical distance measures on the transmission of COVID-19 in the UK.
BMCMedicine , 1–10 (2020). Read, J. M. et al.
Social mixing patterns in rural and urban areas of Southern China.
Proc. Royal Soc. B: Biol. Sci. ,20140268 (2014).
Zhang, J. et al.
Changes in contact patterns shape the dynamics of the COVID-19 outbreak in China.
Science (2020).
Fu, Y.-c., Wang, D.-W. & Chuang, J.-H. Representative contact diaries for modeling the spread of infectious diseases inTaiwan.
PLoS ONE (2012). Leung, K., Jit, M., Lau, E. H. & Wu, J. T. Social contact patterns relevant to the spread of respiratory infectious diseases inhong kong.
Sci. Reports , 1–12 (2017). Ibuka, Y. et al.
Social contacts, vaccination decisions and influenza in Japan.
J Epidemiol Community Heal. , 162–167(2016). Horby, P. et al.
Social contact patterns in vietnam and implications for the control of infectious diseases.
PLoS ONE (2011). de Waroux, O. l. P. et al. Characteristics of human encounters and social mixing patterns relevant to infectious diseasesspread by close contact: a survey in Southwest Uganda.
BMC Infect. Dis. , 172 (2018). Melegaro, A. et al.
Social contact structures and time use patterns in the Manicaland Province of Zimbabwe.
PLoS ONE (2017). Kiti, M. C. et al.
Quantifying age-related rates of social contact using diaries in a rural coastal population of Kenya.
PLoSONE (2014). Ajelli, M. & Litvinova, M. Estimating contact patterns relevant to the spread of infectious diseases in Russia.
J. Theor.Biol. , 1–7 (2017).
Grijalva, C. G. et al.
A household-based study of contact networks relevant for the spread of infectious diseases in thehighlands of Peru.
PLoS ONE (2015). Arregui, S., Aleta, A., Sanz, J. & Moreno, Y. Projecting social contact matrices to different demographic structures.
PLoSComput. Biol. , e1006638 (2018). Read, J., Edmunds, W., Riley, S., Lessler, J. & Cummings, D. Close encounters of the infectious kind: methods to measuresocial mixing behaviour.
Epidemiol. & Infect. , 2117–2130 (2012).
McCaw, J. M. et al.
Comparison of three methods for ascertainment of contact information relevant to respiratory pathogentransmission in encounter networks.
BMC Infect. Dis. , 166 (2010). Beutels, P., Shkedy, Z., Aerts, M. & Van Damme, P. Social mixing patterns for transmission models of close contactinfections: exploring self-evaluation and diary-based data collection through a web-based interface.
Epidemiol. & Infect. , 1158–1166 (2006).
Hungarian data supply questionnaire (maszk) (date of access 2020.09.28).
Hungarian data supply questionnaire (maszk) team, https://covid.sed.hu/tabs/staff, (date of access 2020.09.28).
Röst, G. et al.
Early phase of the COVID-19 outbreak in Hungary and post-lockdown scenarios.
Viruses , 708 (2020). Bishop, Y. M., Fienberg, S. E. & Holland, P. W.
Discrete multivariate analysis: theory and practice (Springer Science &Business Media, 2007).
MASZK - Hungarian Data Provider Questionnaire, https://figshare.com/articles/online_resource/Hungarian_Data_Provider_Questionnaire/13550057.
Dr. Vilmos Bilicki MASZK Development Team, D. o. S. D., University of Szeged. Maszk app for android,https://play.google.com/store/apps/ (date of access 2020.10.02).
Lavrakas, P. J.
Encyclopedia of survey research methods (Sage Publications, 2008).
David, H. A. & Gunnink, J. L. The paired t test under artificial pairing.
The Am. Stat. , 9–12 (1997). Van Bavel, J. J. et al.
Using social and behavioural science to support COVID-19 pandemic response.
Nat. Hum. Behav.
Block, P. et al.
Social network-based distancing strategies to flatten the COVID-19 curve in a post-lockdown world.
Nat.Hum. Behav.
COVID-19 UNAM, https://coronavirusapoyamexico.c3.unam.mx/ (date of access 2020.12.).
Acknowledgements
The authors are very thankful for the COVID-19 development team lead by Vilmos Bilicki from the Department of SoftwareDevelopment at the University of Szeged and for Eszter Bokányi for the data analysis and her constructive comments. Thiswork was done in the framework of the Hungarian National Development, Research, and Innovation (NKFIH) Fund 2020-2.1.1-ED-2020-00003. JK was supported by the Premium Postdoctoral Grant of the Hungarian Academy of Sciences. MK is thankfulfor the support from the DataRedux (ANR-19-CE46-0008) project funded by ANR and the SoBigData++ (H2020-871042)project. GR was supported by NKFIH FK 124016, EFOP-3.6.1-16-2016-00008, and TUDFO/47138-1/2019-ITM. Author contributions statement
J.K., M.K and O.V contributed equally to this work, collected data and analysed the results. All authors reviewed the manuscript.
Additional information
Competing interests
The authors declare no competing interests. upplementary Information
Monitoring behavioural responses during pandemic via reconstructed contact ma-trices from online and representative surveys
Júlia Koltai, Orsolya Vásárhelyi, Gergely Röst and Márton Karsai The goal of the Hungarian Data Provider Questionnaire (MASZK) questionnaire was to dynamically estimate the age contactmatrices of people in different settings (like home, work, school, or elsewhere). To collect such data we developed aquestionnaire to ask about people’s demographic characters, domicile, family structure, health conditions, travel patterns,education level, employment situations and many more. More importantly we asked them about the number of people fromdifferent age groups, with whom they had contacts. First, we recorded reference contact patterns by asking respondents abouttheir contacts during a typical weekday and weekend before the COVID-19 outbreak in Hungary (13th March 2020). Second,we recorded actual contact patterns of participants by asking them to indicate all their contact activities happened on the daybefore their actual response. We defined contacts in two different ways relevant for possible infection transmission. Interactionsbetween people without any protection were called physical contacts , while proxy contacts were identified as if two peoplestayed closer than 2 meters to each other at least for 15 minutes. Individual contact patterns were recorded as the number ofcontacts between the ego and their peers from different age groups of 0 −
4, 5 −
14, 15 −
29, 30 −
44, 45 −
59, 60 −
69, 70 − + .Due to privacy regulations, contact patterns of under-age people was not possible directly. Nevertheless, to collect dataabout children younger than 18 years old, we asked respondents living in the same household with an under-age children toestimate their number of contacts in different settings.For the sake of potential adoption of our method and reproducibility of results we share the questionnaire including theessential questions for our analysis in this repository . The overarching goal of the modelling process on the representative survey data was to identify those variables that can be usedto weight our non-representative online data, coming from the MASZK questionnaire. Since our goal was to weight our datasetto be more representative for the number of proxy contacts of respondents, we first ran regression models to identify thosefactors that significantly affect the daily number of proxy interactions.As the contact matrices contain the average number of proxy contacts for each age group separately, we ran general linearmodels with identity link function separately for each age-group. In this way, we could chose factors, which significantly affectthe number of proxy contacts specifically for the given age-group. This method helps to avoid potential weighting variables,which do not influence the contacts for all age segment – and thus it limits the increment of the standard deviation of theestimation.The dependent variable of the models was the proxy number of contacts; the independent variables were region and activitytype as factors, and gender, education and type of settlements as co-variates. The reason for selecting these independentvariables was because there are available census data about the distribution of these attributes, which we could use in theweighting procedure later. Moreover, as these dimensions are commonly recorded in any census, it makes possible to easilyapply our method in different countries without measuring expensive representative samples. Additionally to the baselinemodels, we built extended models, to which we included the interaction terms of the independent variables. For each agegroup we selected those variables or those interactions for the further weighting procedure, which significantly (on a 0.05 level)affected the proxy number of contacts. In the case of the region variable, the results suggested that the main differences arebetween Central Hungary and other regions, so in the weighting procedure we treated the region variable as a dummy. Similarly,in the case of the independent variable, which measured activity type, the breakpoints were mostly between the active andnot active groups, thus we included this variable into the weighting procedure as a dummy one. The resulting variables areavailable in Table 1 in the main text, while the full model tables are presented below in Tables S2, S3 and S4. corresponding authors: [email protected] All respondents 13,790Children* (0-14) 1,582Adults (15+) 12,208
Age groups - weighting N N after weighting
Missing ALL 12,723
Table 2.
Number of responses after filtering users with not avialable data points. The original dataset contains responses fromadults and for children based on parents responses. We applied a weighting methodology called iterative proportional fitting onthe online survey to make it more accurate of measuring the contact patterns of the whole population. Since this methodologyrequires each variable used in the weighting to have non zero entry, we had to drop respondents with no data about variablesused in the weighting procedure. See Table S2, and S3 for more details about variables used in the weighting procedure.
Gender
Male FemaleN 865,533 812,678% 51.6% 48.4%
Region
Central Not CentralN 519,266 1,198,076% 30.2% 69.8%
Settlement Type N% Gender
Male FemaleN 865,533 812,678% 51.6% 48.4%
Region
Central Not CentralN 519,266 1,198,076% 30.2% 69.8%
Center / Rural Hungary
Central Hungary*Working Central Hungary * Not WorkingN 579,409 143,751% 25.2% 6.3%Not Central Hungary*Working Not Central Hungary*Not Working1,150,796 421,47850.1% 18.4%
Region*Settlement Type
Not Central*County city Not Central*Other city Not Central*TownN 460,025 532,181 562,267% 20% 23% 24%Budapest Central Hungary*County city Central Hungary*TownN 435,057 203,519 102,385% 19% 9% 4%
Table 3.
Population census for variables used for applying the weighting methodology called iterative proportional fitting onthe online survey to make it more accurate of measuring the contact patterns of the whole population in age groups 0 − −
29, 30 − Employment Status
Working Not WorkingN 1,617,252 317,408% 96% 19%
Education*Settlement Type
Capital* max elementary Capital* high school diploma Capital*college degreeN 100,013 116,443 100,773% 6% 7% 6%County city max elementary Capital*high school diploma Capital *college degreeN 172,846 139,471 97,680% 10% 8% 6%Other city * max elementary Other city*high school diploma Other city*college degreeN 366,678 194,829 99,679% 21% 11% 6%Town*max elementary Town*high school diploma Town*college degreeN 453,990 154,907 62,801% 26% 9% 4%
Gender
Male FemaleN 579,975 732,233% 44% 56%
Employment Status
Working Not workingN 334,104 953,460% 26% 74%
Gender*Education
Max. elementary*male Max. elementary*female High school diploma*maleN 253,655 592,334 71,699% 15% 35% 4%N High school diploma*male College degree*male College degree*female% 64,399 70,284 49,4154% 4% 3%
Region
Central Hungary Not Central HungaryN 375,500 853,082% 30.6% 69.4%
Employment Status
Working Not workingN 20,708 1,133,441% 2% 98%
Table 4.
Population census of variables used for applying the weighting methodology called iterative proportional fitting onthe online survey to make it more accurate of measuring the contact patterns of the whole population in age groups 45 − −
69, 70 + Weight distribution of Iterative Proportional Fitting
Using the detected social-demographic variables effecting significantly the contact patterns, in the MS we described a weightingmethodology for the online survey. Our goal was to provide a method which assign a w x weight to each individual x , which arenot distributed very broadly, as extreme weights increases the standard errors of the estimates and decrease the accuracy ofthe estimation. Therefore, our weighting methodology needs to keep the weights in a reasonable range. This was possible byapplying iterative proportional fitting , which resulted in individual weights distributed over a relatively small range, between0 < w x <
25 as demonstrated in Fig. S4). F r e q u e n c y Figure 4.
Resulting weight distribution of age-stratified iterative proportional fitting (IPF). The obtained weights stayedwithin the range of 0.04 and 25. 49 (as presented in Table 1 in the main text.
To extend our results reported in the main text, here we summarise the measured and reconstructed contact matrices and theircomparison in a matrix plot panel, annotated with numerical values. More precisely, in Fig. S5 we in the diagonal we showthe representative, online-weighted, and online-unweighted matrices. Above the diagonal we depict the pairwise differencesbetween these matrices, while below the diagonal we show the pairwise two-tail T-test results.The raw, reconstructed, and representative matrices are shared as data tables in an online repositories . OT Weighted Online Weighted Online Representative Phone N O T W e i g h t e d O n li n e W e i g h t e d O n li n e R e p r e s e n t a t i v e P h o n e (a) (c)(b)(d) (e) (f)(g) (h) (i) Figure 5.
Annotated matrices.
Normalized actual proxy contact matrices (green diagonal), their pairwise differencematrices (above diagonal) and pairwise two-tail T-test results (below diagonal) are depicted for the online non-weighted, onlineweighted, and representative matrices. In the difference matrices red or blue cells indicate that the source matrix (column label)appeared with higher or lower number of average contact than the target (row label) at the given cell. For results of pairwisetwo-tail T-tests yellow to blue cells (corresponding to p > .
05, assigned by an arrow beside the colorbar) indicate that thegiven cell is not significantly different in the source (column label) and target (row label) matrices. Data for children under 18(indicated with asterisk and vertical dashed lines) could not be collected directly due to privacy regulations, thus our datacannot provide a representative sample for the first two age groups.05, assigned by an arrow beside the colorbar) indicate that thegiven cell is not significantly different in the source (column label) and target (row label) matrices. Data for children under 18(indicated with asterisk and vertical dashed lines) could not be collected directly due to privacy regulations, thus our datacannot provide a representative sample for the first two age groups.