[PDF] A Tale of Three Datasets: Towards Characterizing Mobile Broadband Access in the United States

Abstract

Understanding and improving mobile broadband deployment is critical to bridging the digital divide and targeting future investments. Yet accurately mapping mobile coverage is challenging. In 2019, the Federal Communications Commission (FCC) released a report on the progress of mobile broadband deployment in the United States. This report received a significant amount of criticism with claims that the cellular coverage, mainly available through Long-Term Evolution (LTE), was over-reported in some areas, especially those that are rural and/or tribal [12]. We evaluate the validity of this criticism using a quantitative analysis of both the dataset from which the FCC based its report and a crowdsourced LTE coverage dataset. Our analysis is focused on the state of New Mexico, a region characterized by diverse mix of demographics-geography and poor broadband access. We then performed a controlled measurement campaign in northern New Mexico during May 2019. Our findings reveal significant disagreement between the crowdsourced dataset and the FCC dataset regarding the presence of LTE coverage in rural and tribal census blocks, with the FCC dataset reporting higher coverage than the crowdsourced dataset. Interestingly, both the FCC and the crowdsourced data report higher coverage compared to our on-the-ground measurements. Based on these findings, we discuss our recommendations for improved LTE coverage measurements, whose importance has only increased in the COVID-19 era of performing work and school from home, especially in rural and tribal areas.

Full PDF

AA Tale of Three Datasets: Towards CharacterizingMobile Broadband Access in the United States

Tarun Mangla

University of Chicago

Esther Showalter

University of California, SantaBarbara

Vivek Adarsh

University of California, SantaBarbara

Kipp Jones

Skyhook

Morgan Vigil-Hayes

Northern Arizona University

Elizabeth Belding

University of California, SantaBarbara

Ellen Zegura

Georgia Institute of Technology

ABSTRACT

Understanding and improving mobile broadband deploy-ment is critical to bridging the digital divide and targetingfuture investments. Yet accurately mapping mobile coverageis challenging. In 2019, the Federal Communications Com-mission (FCC) released a report on the progress of mobilebroadband deployment in the United States [6]. This reportreceived a significant amount of criticism with claims thatthe cellular coverage, mainly available through Long-TermEvolution (LTE), was over-reported in some areas, especiallythose that are rural and/or tribal [12]. We evaluate the va-lidity of this criticism using a quantitative analysis of boththe dataset from which the FCC based its report [5] anda crowdsourced LTE coverage dataset [26]. Our analysisis focused on the state of New Mexico, a region character-ized by diverse mix of demographics-geography and poorbroadband access. We then performed a controlled measure-ment campaign in northern New Mexico during May 2019.Our findings reveal significant disagreement between thecrowdsourced dataset and the FCC dataset regarding thepresence of LTE coverage in rural and tribal census blocks,with the FCC dataset reporting higher coverage than thecrowdsourced dataset. Interestingly, both the FCC and thecrowdsourced data report higher coverage compared to ouron-the-ground measurements. Based on these findings, wediscuss our recommendations for improved LTE coveragemeasurements, whose importance has only increased in theCOVID-19 era of performing work and school from home,especially in rural and tribal areas.

Affordable, quality Internet access is critical for full partici-pation in the 21st century economy, education system, andgovernment [24]. Mobile broadband can be achieved throughcommercial Long-Term Evolution (LTE) cellular networks,which are a proven means of expanding this access [13], but are often concentrated in urban areas and leave economicallymarginalized and sparsely populated areas underserved [6].The U.S. Federal Communications Commission (FCC) incen-tivizes LTE operators serving rural areas [7, 23] and main-tains transparency by releasing maps from each operatorshowing geographic areas of coverage [9]. Recently thirdparties have challenged the veracity of these maps, claim-ing these maps over-represent true coverage, and thus maydiscourage much-needed investments.Most of these claims, however, are either focused on lim-ited areas where a few dedicated researchers can collect con-trolled coverage measurements (e.g., through wardriving), orare mainly qualitative in nature [1, 14, 25].

As dependence onmobile broadband connectivity increases, especially in the faceof the COVID-19 pandemic, mechanisms that quantitativelyvalidate FCC coverage datasets at scale are becoming acutelynecessary to evaluate and direct resources in Internet accessdeployment efforts [17, 22].

This is an issue of technology andtechnology policy, with equity and fairness implications forsociety.An increasingly widespread approach to measure cover-age at scale is through crowdsourcing wherein users of theLTE network contribute to coverage measurements. The FCChas recently advocated for the use of crowdsourcing to vali-date coverage data reported by operators [19]. In this context,we take a data-driven, empirical approach in this work, com-paring coverage from a representative crowdsourced datasetwith the FCC data. More specifically, our analysis is guidedby the following questions: (i) How consistent are existingLTE coverage datasets, ii) where and how do their coverageestimations differ, and what trends are present?We specifically consider a crowdsourced coverage esti-mate from Skyhook, a commercial location service providerthat uses a variety of positioning tools to offer precise ge-olocation. We select Skyhook because it crowdsources cel-lular coverage measurements from end-user applications a r X i v : . [ c s . N I] F e b angla et al. that subscribe to its location services. Such incidental crowd-sourcing can potentially provide richer coverage data com-pared to a voluntary form of crowdsourcing where a userhas to explicitly commit to contributing coverage data. Weexamine this by comparing the Skyhook measurements withthose of OpenCellID, an open but voluntary crowdsourceddataset [21]. As will be shown in Section 3.1, we find thatthe density of the crowdsourced datasets varies significantlyby the methodology of data collection, especially in ruralareas. In the regions we studied, incidental crowdsourcing(Skyhook) gathered up to 11 .

1x more cell IDs than voluntarycrowdsourcing (OpenCellID).Using Skyhook as an extensive crowdsourced dataset, wequantify how widely and where the crowdsourced coveragedata differs from the FCC data. We specifically focus on thestate of New Mexico , selected for its mix of demographics,diverse geographic landscape, and our partnership with com-munity stakeholders within the state. We compare coverageat the level of census blocks which are further grouped intourban, rural, and tribal categories. We find that the FCC andSkyhook LTE datasets have a disagreement as great as 15% inrural census blocks with the data from FCC claiming highercoverage than Skyhook. A major concern in interpretingthis comparison is accounting for coverage disagreement asa result of lack of data points in the crowdsourced dataset.To confirm the availability of users to provide data points,we check for the presence of alternate cellular technologies(e.g., 2G or 3G) within these census blocks and observe asignificant number (up to 9% in tribal rural areas) where suchalternates are present, providing evidence that users do visitthose blocks but cannot access LTE. These results, similar toa recent study on fixed broadband [18], suggest a need for in-corporating mechanisms to validate the operator-submitteddata into the FCC’s LTE access measurement methodology,especially in rural and tribal areas.Finally, we compare both FCC and Skyhook coverage mapsto our own controlled coverage measurements collected froma northern section of New Mexico. Interestingly, we findthat both FCC and Skyhook datasets report higher coveragerelative to our controlled measurements with the formershowing a higher degree (by up to 26.7%) of over-reportingthan the latter. Understanding the causes of these inconsis-tencies is important for effectively using crowdsourced datato measure LTE coverage, especially as crowdsourcing isincreasingly viewed as preferable to provider reports. We Our methodology is not specific to New Mexico and can be easily extendedto other regions in the U.S. We use the FCC methodology wherein a census block is considered coveredif the centroid is covered [8] Tribal areas have consistently experienced the lowest broadband coveragerates in the United States for the past decade [6]

Data Set Points of Format MethodologyCollection

FCC Polygon Shapefile Operator-reportedoverlay with Form 477Skyhook Cell signal CSV Incidentalpoint crowdsourcingAuthor Controlled Cell signal CSV WardrivingMeasurements point

Table 1: Summary of coverage data sets. T - M o b il e V e r i z o n A T & T Sp r i n t C h o i c e C e ll . O n e C l e a r T a l k N E _ C o l . % c e n s u s b l o c k s c o v e r e d Figure 1: LTE operators bycensus block coverage basedon FCC data. Figure 2: Map of authorwardriving areas in NewMexico. conclude with recommendations for improving LTE cover-age measurements, whose importance has only increasedin the COVID-19 era of performing work and school fromhome.

In this section, we first provide an overview of the LTEnetwork architecture. This is followed by a description ofthe LTE coverage datasets compared in our analysis. Thesedatasets are summarized in Table 1. We also note the limita-tions associated with each data collection methodology.

Internet access in an LTE network is available through basestations (known as eNodeBs) operated by the network provider.User equipment (UE), such as smartphones, tablets, or LTEmodems, connects to the eNodeB over the radio link. TheeNodeB is connected to a centralized cellular core known asthe Evolved Packet Core (EPC). This connection is typicallythrough a wired link forming a middle-mile connection. TheEPC consists of several network elements including a PacketData Network Gateway (PGW), which is the connecting nodebetween an end-user device and the public Internet. Thus,LTE broadband access depends on multiple factors includingradio coverage, middle-mile capacity, and interconnectionlinks with other networks (e.g., transit providers, contentproviders) in the public Internet. However, the focus of thisarticle is on understanding the last-mile LTE connectivitycharacterized by the radio coverage of the eNodeB. haracterizing Mobile Broadband Access An eNodeB controls a single cell site and consists of severalradio transceivers or cells mounted on a raised structuresuch as a mast or a tower. The radio cells use directionalantennas, where each antenna provides coverage in a smallergeographical area using one frequency band. The radio cellscan be identified through a globally unique number calledcell identifier (or cell ID), which is also visible to an end-userdevice in range of the cell. The cell ID enables aggregation ofconnectivity and signal strength information from multipleUEs connected to the same cell, which can then be used toestimate the geolocation of a cell along with its coverage(see Section 2.3).

The FCC LTE broadband dataset consists of coverage mapsin shapefile format that depict geospatial LTE network de-ployment for each cellular operators in the U.S. The FCCcompiles this dataset semi-annually from operators throughForm 477. Every operator that owns cellular network facili-ties must participate in this data collection. The operatorssubmit shapefiles containing detailed network informationin the form of geo-polygons along with the frequency bandused in the polygon and the minimum advertised upload anddownload speeds. The methodology used for obtaining thesepolygons is proprietary to each operator. Ultimately, the FCCpublishes only a coverage map that represents coverage asa binary indicator: in any location, cellular service is eitheravailable though an operator, or it is not.We use the binary coverage shapefiles, available on theFCC’s website, from June 2019 . Figure 1 shows the eightLTE network operators present in the state of New Mex-ico (NM) and the percentage of total census blocks in NMcovered by each operator. Note that we use one of the FCCmethodologies to report mobile broadband access, wherein acensus block is considered covered if the centroid of the cen-sus block is covered [8]. In this paper, we limit our analysisto the top four cellular operators due to their significantlygreater prevalence in NM; these operators are also the topfour cellular operators in the United States more broadly. Limitations:

These coverage maps are generated using pre-dictive models that are proprietary to the operator [12]and not generally reproducible. Furthermore, the publiclyavailable dataset consists of binary coverage and lacks anyperformance-related data. At the time of this analysis, data from December 2019 was also availableon the FCC website. However, we use data from June 2019 as the other twodatasets in our analysis are collected around this period. The FCC has only recently (beginning December 2019) started providingspeed data along with coverage information.

Skyhook is a location service provider that uses a varietyof positioning tools, including a database of cell locations,to offer precise geolocation to subscribed applications [26].Through apps that subscribe to Skyhook’s location services,user devices report back network information, which is gath-ered into anonymous logs and used to further improve thelocalization service. Through a data access agreement weare able to view the cell location database consisting of alist of unique cell IDs along with the cell technology (e.g.,3G vs LTE), estimated location, and the estimated coverage.The database was originally constructed through extensivewardriving but is now managed and updated using mea-surements gathered by devices using the Skyhook API forlocalization. The device measurements with the same cell IDare combined to estimate the cell location and coverage inthe following manner:

Cell location estimation : A grid-based methodology simi-lar to that proposed by Nurmi et al. [20] is used to estimatethe cell tower location. Specifically, Skyhook divides the ge-ographic area into 7 m squares and groups measurements inthe same square to obtain a central measure of the square’ssignal strength. This is done to reduce the bias due to largenumbers of measurements coming from the same area (e.g.a popular gathering place). A weighted average of the signalstrength is then used to estimate the cell location.

Estimation of cell coverage radius : Skyhook also pro-vides an estimate of the cell’s coverage radius using a pro-prietary method based on the path-loss gradient [27]. Thepath-loss gradient approximates how the wireless signal at-tenuates as a function of the distance from the transmitter(radio cell in this case). The value of the path-loss gradientdepends on several factors such as environment (foliage,buildings), geographic topography, and cell signal frequency.Skyhook estimates the path-loss gradient using field obser-vations of cell signal strength readings along with their dis-tributed geographic locations. Ideally, the signal attenuationvaries based on the direction and the distance from the cell.However, to reduce the complexity of coverage estimation,Skyhook’s cell coverage estimation heuristic calculates onlyone path-loss gradient for a single cell. The path-loss gra-dient is then used in a set of parameterized equations toestimate the cell coverage radius. The parameters in theseequations have been determined with careful research andtesting over more than 10 years.The cell location database is updated regularly with re-calculation of cell location and cell coverage radius usingthe new device measurements that have been collected sincethe last update. For our analysis, we use the cell locationdatabase last updated on June 10, 2019. angla et al. Limitations:

Since database entries are crowdsourced whenthe device passes within range of a cell, this dataset is morecomprehensive in population centers and highways wherepeople more often occupy. If there are too few measurementsoverall, or if measurements are primarily sourced from thesame grid section, then the cell location estimate can beinaccurate.

To complement these datasets, we performed a targeted mea-surement campaign collecting coverage information through120 miles of Rio Arriba county in New Mexico over a pe-riod of five days beginning May 28, 2019. Figure 2 showsthe locations of ground measurements and the four descrip-tive area labels we use for this analysis. The North areameasurements were taken on highways passing primarilythrough national forest. The Pueblo area measurements weretaken from highways within tribal jurisdiction boundaries. InSanta Clara Pueblo, tribal leadership permitted us to collectadditional measurements in residential zones. Finally, theSanta Fe area consists of highway measurements betweenthe pueblos and downtown Santa Fe. While limited in scale,these active measurements provide an important compari-son point for coverage and user experience. As described inSection 1, we selected these areas of New Mexico for theirmix of tribal and non-tribal demographics; tribal lands tendto have the highest coverage over-statements and the mostlimited cellular availability within the United States [6].Our measurements consist of service state and signal strengthreadings recorded on four Motorola G7 Power (XT1955-5)phones running Android Pie (9.0.0).

Service State is a dis-crete variable indicating whether the phone is connected toa cell. Measurements were collected using the

Network Mon-itor application [16]. An external GlobalSat BU-353-S4 GPSconnected to an Ubuntu Lenovo ThinkPad laptop gatheredgeolocation tags that were matched to network measure-ments by timestamp. Each phone was outfitted with a SIMcard from one of the four top cellular operators in the area:Verizon, T-Mobile, AT&T, and Sprint. The phones recordedservice state and signal strength every 10 seconds while wedrove at highway speeds (between 40 and 65 miles per hour)in most places and less than 10 miles per hour in residentialareas (Santa Clara Pueblo).

Limitations:

Our wardriving campaign was intensive in termsof human effort, economic cost, and time, making it difficultto scale. The dataset does not capture any temporal varia-tions in coverage as the measurements were collected over ashort span of time. It is possible that driving speed or deviceconfiguration affects the measurements, e.g., indicating no

Figure 3: CDF of cell updates in Skyhook (S) and OpenCellID(O). coverage when a stationary measurement might have de-tected coverage [10]. We have no evidence that this occurred,but it might warrant some additional investigation.

In this section, we first evaluate of Skyhook as a representa-tive crowdsourced dataset by comparing it with a popular voluntary crowdsourced data from OpenCellID [21]. This isfollowed by comparison of coverage across the FCC, Sky-hook, and our wardriving measurement data. Our compar-ison is guided by the following questions: (i) what is thedegree of coverage agreement across the datasets, ii) whereand how do their coverage estimations differ?

We compare the Skyhook dataset with a publicly availablecrowdsourced dataset – OpenCellID. Unwired Lab’s Open-CellID project provides a publicly available dataset of cellIDs along with their estimated location. The dataset is de-rived from crowdsourced UE signal strength measurementssimilar to Skyhook. However, the UE measurements in thiscase come from users voluntarily installing the OpenCellIDapplication on their smartphone [21] and manually choosingwhat data to upload. We differentiate this voluntary crowd-sourcing method of data collection from Skyhook’s incidental crowdsourcing method, where users of the Skyhook API con-tribute to the data by default. We specifically compare thenumber of unique LTE cells and the recentness of the mea-surements in both datasets. We consider each of these factorsto contribute to the overall density of the dataset. Methodology : While our coverage comparison will be fo-cused on New Mexico, we analyze our selected crowdsourceddata more broadly by considering these datasets within aset of counties of differing population densities across theUnited States. The counties are selected from three areas ofthe United States: Western (California), Central (New Mexicoand Colorado), and Eastern (Georgia). Within each region, OpenCellID Project is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.4 haracterizing Mobile Broadband Access

Countyclassification Region County Populationdensity (per sq. mile) Skyhook OpenCellID CommonCIDsName CIDs (

Western Los Angeles, CA 2,490.3 133,484 28% 39,875 92% 36,816Large Metro Central Denver, CO 4,683.0 11,061 24% 3,136 86% 2,689Eastern Fulton, GA 1,994.0 27,809 22% 7,225 86% 6,194Western Imperial, CA 43.5 1,818 17% 336 93% 311Small Metro Central Doña Ana, NM 57.1 1,870 32% 663 89% 592Eastern Bibb, GA 613.0 1,953 21% 464 89% 413Western Tehama, CA 21.7 733 17% 158 80% 126Micropolitan Central Rio Arriba, NM 6.7 333 8% 30 87% 26Eastern Pierce, GA 61.3 164 9% 21 67% 14

Table 2: Characteristics and cell ID (CID) counts in selected counties. we consider three different kinds of counties as defined bythe National Center for Health Statistics’ 2013 Urban-RuralClassification Guide [3]. These are: (i) large metropolitan (large), which contain a population of at least one millionand a principle city; (ii) small metropolitan (small), whichcontain a population of less than 250,000; and (iii) microp-olitan (micro), which must have at least one urban clusterof at least 10,000, but a total population of less than 50,000.This enables us to study differences based on population den-sity and geographic region for the crowdsourced datasets.We select three counties of each population category, for atotal of nine counties, to compare these two datasets. Wedescribe these counties in Table 2. For each county, we showthe 2018 population density estimated from the U.S. CensusBureau’s 2010 census records [2]. We first count the num-ber of unique cell IDs that appear in both datasets for eachcounty, as shown in Table 2. The “% Overlap" column inTable 2 shows the percentage of each dataset’s cell IDs thatalso appear in the other dataset, and the “Common CIDs"column shows the exact number of common cell IDs.

Results : Overall, Skyhook reports a greater number of cells(2.8x - 11.1x) for all counties. The difference is particularlypronounced in micro counties. This suggests that relyingon volunteers to download an application and offer networkmeasurements may not be the most accurate method forassessing LTE coverage in rural areas. Furthermore, Skyhookincludes a majority of the cells that appear in OpenCellID.We next consider how recently each cell ID record wasupdated with a new measurement. Figure 3 shows the CDFof the latest measurement date for cells in both the datasets,where cells are split into those located in urban and ruralcensus blocks. Almost 60% of the cells in Skyhook were lastupdated in the month of June 2019, but the most recentupdate in OpenCellID was in February 2019. Furthermore,cells in rural census blocks were updated less recently thanurban census blocks in OpenCellID, while the difference isnegligible in the Skyhook dataset. This suggests that theSkyhook dataset is updated more regularly than OpenCellID, thus making it more likely to represent any changes in thenetwork infrastructure.

We first compare a coverage shapefile generated from Sky-hook cell locations and estimated coverage ranges with theFCC map for each operator.

Methodology : We consider coverage at the census blocklevel for this comparison. In addition to reporting coverageshapefiles, the FCC reports coverage at a census block leveland considers a census block as covered if the centroid of thecensus block falls within a covered region [8]. We generate asimilar census block level coverage map per-operator usingSkyhook’s estimated coverage. To do so, we first obtain thecoverage shapefile for each operator using a cell’s estimatedlocation and coverage radius. Then we use the FCC centroidmethodology to generate the Skyhook LTE coverage mapat the census block level. We use the Python GeoPandas0.8.2 library for the associated spatial operations [11]. Wegroup census blocks into four categories: Non-Tribal Urban,Non-Tribal Rural, Tribal Urban, and Tribal Rural. This isdone to explore whether the degree of agreement of thetwo datasets varies across these dimensions. We use theU.S. Census Bureau’s classification of urban and rural blocksand its boundary definitions of tribal jurisdiction for thiscategorization [4]. In this analysis we consider census blocksas tribal if they overlap with any tribal boundaries. We variedthe tribal labeling schemes such as classifying a census blocktribal if the centroid of the block is within a tribal boundary.However, the results remain qualitatively similar and do notimpact the findings presented here.

Results : Table 3 shows the percentage of total census blockscovered by each cellular operator, according to the FCC andSkyhook data, broken down by census block type. Amongthe four operators, T-Mobile covers the greatest number ofcensus blocks based on both FCC and Skyhook data, whileSprint covers the fewest. All four cellular operators have angla et al. Censusblock type Total censusblocks Verizon T-Mobile AT&T SprintFCC Skyhook FCC Skyhook FCC Skyhook FCC Skyhook

Non-Tribal Rural

Non-Tribal Urban

Tribal Rural

Tribal Urban

Table 3: Percentage of total census blocks covered according to FCC and Skyhook.Blocktype Totalblocks Verizon T-Mobile AT&T Sprint

Non-Tribal Rural

Non-Tribal Urban

Tribal Rural

Tribal Urban

Table 4: Number of census blocks where there is coverageaccording to FCC but no coverage according to Skyhook.(a) Verizon (b) SprintFigure 4: Comparison of LTE coverage maps of New Mexico.Yellow blocks are covered in the FCC map but not in Sky-hook; purple blocks are covered in the Skyhook map but notthe FCC. Green blocks are covered in both, and pink blocksare covered in neither. relatively higher coverage for both tribal and non-tribal ur-ban census blocks. However, all operators except Verizonoffer their lowest coverage in tribal rural areas. For someoperators, the differences between non-tribal rural and tribalrural are as great as 23% (based on Skyhook data) and 11%(based on FCC data).The extent of LTE coverage differs between the two datasets.For three out of four providers, Skyhook shows lower cov-erage than the FCC, particularly in the rural census blocks.For instance, the FCC T-Mobile data shows coverage in 92%of tribal rural blocks, whereas Skyhook shows coverage inonly 63% of such blocks. On the other hand, Skyhook showsa higher number of census blocks covered than the FCCfor Sprint. The higher coverage in the case of Sprint couldhave been due to multiple reasons, including: (i) there aredifferences in the propagation models used by Skyhook andSprint to estimate coverage with the former’s models beingmore generous than the latter’s, and (ii) the Skyhook data is

Block type Verizon T-Mobile AT&T Sprint

Non-Tribal Rural

528 (1%) 2,575 (3%) 5,342 (6%) 19 (<1%)

Non-Tribal Urban

Tribal Rural

Tribal Urban

Table 5: Number of census blocks with LTE coverage accord-ing to the FCC, but only 3G coverage according to Skyhook.The numbers in parenthesis report the same data as a per-centage of total census blocks of the corresponding type. collected across time and Sprint may have discontinued ortemporarily disabled some of the cells, which is challengingto detect from the crowdsourced data.Figure 4 visually compares the LTE coverage maps fromthe FCC and the Skyhook datasets for Verizon and Sprint.We more deeply examine the discrepancy mapped in yellowin Figure 4a. Table 4 shows the number of census blockswhere there is coverage according to the FCC but none ac-cording to Skyhook for each operator. Coverage claims inboth tribal and non-tribal rural census blocks disagree themost. The number of such blocks are particularly high forVerizon (19 ,

126 overall) and T-Mobile (18 ,

189 overall). Thereare two possible reasons for this disagreement: network oper-ators lack adequate infrastructure in rural areas, but tend tooverestimate coverage while reporting it to FCC, or Skyhookis missing data points from rural census blocks where fewerpeople carry UEs. The latter case will lead to either some LTEcells not being detected or an inaccurate characterization ofcell coverage due to fewer measurements.To understand which of these potential reasons for dis-agreement is more likely, we check whether Skyhook shows3G coverage for these census blocks (where the FCC reportsLTE coverage but Skyhook does not). If Skyhook reports3G coverage in these blocks, this suggests that users mayhave contributed to the Skyhook dataset in these censusblocks, therefore LTE coverage would have been detected ifit existed. Note that a more accurate way would have beento directly consider the location of end-user measurementsconnected using 3G technology and analyze whether theyfall within LTE coverage areas in the FCC data. However, wedid not have access to these end-user measurements due to haracterizing Mobile Broadband Access Skyhook’s privacy policy. Instead, we consider the 3G cov-erage maps as a reasonable approximation for our analysisand generate a 3G coverage map at the census block level forthese areas in the same manner as described previously forLTE. The number of census blocks that show only 3G cover-age according to Skyhook is presented in Table 5. We observea significant number of census blocks where Skyhook detects3G coverage, indicating that the FCC LTE coverage claimsmay be overstated in these areas. The number of such blocksis greater for tribal rural areas (up to 9%), thus indicating ahigher mismatch of the two datasets in tribal rural areas.

In this section, we compare our own active mea-surements with the coverage maps from the FCC and Sky-hook described in Section 3.2.1. We focus now on the geo-graphic region around Santa Clara Pueblo, which lies northof Santa Fe (see Figure 2), a region with a mix of urban, rural,and tribal population blocks.

Methodology : We use the

Service State readings collected inour measurements for this analysis (see Section 2.4). We alsocollected information about the connected cell’s technology(e.g. LTE) and the geolocation of the measurements. Thisinformation is used to infer whether LTE coverage existsat a location. We consider LTE to be available if the

ServiceState shows IN_SERVICE to indicate an active connection,and if the associated cell is an LTE cell. We term this the active

LTE coverage. We then compare the FCC and Skyhookcoverage with the active LTE coverage to see whether thedatasets agree. Note that we use the coverage shapefiles forboth Skyhook and the FCC in this comparison instead ofthe census block centroid approach in Section 3.2.1. Thisallows us to compare coverage more precisely for a location,especially if a census block is only partially covered.

Results : Table 6 shows the confusion matrices that compareactive LTE coverage with reported coverage from the FCCand Skyhook maps. Both maps show coverage at locationswhere our measurements did not. In the case of Verizon, 81%of the measurements with no coverage are from locationsreported as covered by the FCC. This over-reporting is lowestfor Sprint and highest for T-Mobile.We also observe significant disagreement (up to 79%) be-tween Skyhook coverage and our measurements. Two pos-sibilities may cause this: i) paucity in Skyhook UE signalstrength readings available for cell location and coverageradius estimation, or ii) error in the cell propagation model it-self possibly due to variations in the environment conditionssuch as the terrain. In either case, Skyhook agrees better withour measurements than the FCC in reporting areas with noLTE coverage. E.g., in the case of AT&T, 75% of our mea-surements with no LTE coverage belong to areas reportedas covered by the FCC as compared to 48% by Skyhook.

Active Total FCC SkyhookNC C NC C

No Coverage (NC) 266 19% 81% 32% 68%Coverage (C) 1,440 0% 100% 5% 95% (a) VerizonActive Total FCC SkyhookNC C NC C

No Coverage (NC) 324 6% 94% 21% 79%Coverage (C) 1,361 0% 100% 5% 95% (b) T-MobileActive Total FCC SkyhookNC C NC C

No Coverage (NC) 568 25% 75% 53% 48%Coverage (C) 1,095 2% 98% 7% 93% (c) AT&TActive Total FCC SkyhookNC C NC C

No Coverage (NC) 231 96% 4% 99% 2%Coverage (C) 1,122 21% 79% 20% 80% (d) SprintTable 6: Confusion matrices comparing active measurementcoverage with FCC and Skyhook.

Total denotes the numberof active measurements in each category.

In this section, we discuss some of the implications of ourexperience collecting and analyzing coverage data, recom-mendations based on our findings, and directions for futurework.

Recommendations for the FCC : Our findings make acase for including mechanisms that validate ISP-reportedcoverage data, especially in rural and tribal regions. Giventhe scale of cellular networks, crowdsourcing coverage mea-surements is a viable approach to validate access as opposedto controlled measurements. Within crowdsourcing, we sug-gest leveraging incidental rather than voluntary approaches,possibly working with third-party services that collect net-work measurements as part of their service process (as inthe case of Skyhook).In addition, crowdsourcing alone may not be sufficient fordetermining coverage in some cases. Even with the morecomplete datasets provided through incidental crowdsourc-ing, rural areas tended to receive significantly fewer mea-surements per tower. In such cases, mechanisms need to bedeveloped to precisely determine areas of greatest disagree-ment using sparse crowdsourced datasets. Resources canthen be focused to target data collection in these areas in-stead of a blanket approach measuring coverage everywhere. angla et al. Recommendations for crowdsourced data collection :We find some shortcomings in the existing crowdsourceddatasets. First, existing datasets only report areas with pos-itive coverage, i.e., areas where coverage is observed. Thismakes it difficult to distinguish areas that lack coverage fromareas for which no measurements were gathered. Record-ing areas that lack a usable signal can enable more strongerconclusions from crowdsourced data.Second, we note that even crowdsourced datasets areprone to overestimation of coverage potentially due to er-rors in cell location and coverage estimation. Research effortsthat effectively utilize the knowledge of cellular network de-sign are needed for an accurate characterization of coveragefrom crowdsourced measurements. For instance, existingcell location estimation techniques localize cells indepen-dently (see Section 2.3) and are prone to errors when thereare few end-user measurements [15]. Instead, one can uti-lize the fact that a single physical tower in an LTE networkhosts multiple cells. Thus, algorithms that jointly localizecells for whom the end-user measurements are in physicalproximity may provide higher accuracy even with fewer end-user measurements. Similarly, alternate data sources can alsobe considered for localizing cell infrastructure such as us-ing geo-imagery data to identify physical towers or directlyobtaining infrastructure data from entities that build andmanage physical cell towers (usually different from cellularISPs).

Measuring access beyond binary coverage : While thefocus of this work is on understanding coverage, we recog-nize that a binary notion of coverage alone does not necessar-ily indicate the existence of usable LTE connectivity. Variousother factors can impact end-user experience in a “covered"area such as low signal strength or poor middle-mile con-nectivity. Thus, future coverage measurement efforts needto augment coverage reports with measurements of perfor-mance to provide models that are more aligned with userexperiences. Measuring such performance metrics poses agreater challenge because end-user experience depends on amyriad of factors beyond just last-mile link quality. We be-lieve that efforts that lead to increased community awareness(e.g., workshops in public libraries, community meetings) onthe importance of measuring mobile coverage is the way totackle this problem.Finally, we also note that access and adoption are dif-ferent and there are issues beyond access that might alsowarrant measurement and consideration as accountabilitymeasures for operators. Our collection of ground truth datasets involved five days driving through Rio Arriba Countyin northern New Mexico. In preparation for the trip, weworked to obtain SIM cards that would enable us to accessthe networks of the four major U.S. LTE operators. This wassurprisingly difficult; over the course of a month leading up to the measurement campaign, we spent a collective 24 hoursin various operator kiosks and stores in three states in orderto obtain four SIM cards (one for each major operator). Atone of the stores in Santa Fe, we encountered a woman whohad to drive an hour from Las Vegas, NM to address some ofthe issues she was having with her mobile service operatorthat were preventing her from using her data plan. Whilethese anecdotal experiences mirror the qualitative claimsof coverage overestimation, they do introduce a new set ofissues that need to be taken into account to effectively reducethe barriers of Internet access for rural communities.

In this paper, we quantitatively examine the LTE coveragedisagreement among existing datasets collected using dif-ferent methodologies. We find that existing datasets displaythe most divergence when compared with each other in ru-ral and tribal areas. We discuss our findings with respect totheir implications for telecommunications policy. We alsoidentify several future research directions for the comput-ing community, including: mechanisms to augment existingdatasets to precisely determine areas where more concertedmeasurement efforts are needed, improved coverage esti-mation models especially for areas with a lower density ofcrowdsourced measurements, and accurate and scalable mea-surement of access beyond a binary notion of coverage.

ACKNOWLEDGEMENTS

This work is funded in part by National Science FoundationSmart and Connected Communities grant NSF-1831698.

REFERENCES [1] Rural Wireless Association. 2018. RWA Calls for FCC Investigation ofT-Mobile Coverage Data. https://ruralwireless . . census . gov/programs-surveys/decennial-census/decade . . . cdc . gov/nchs/data_access/urban_rural . . census . gov/programs-surveys/geography/guidance/geo-areas/urban-rural . . fcc . . fcc . . fcc . gov/general/connect-america-fund-caf.[8] Federal Communications Commission. 2019. FCC centroid methodol-ogy. https://docs . fcc . gov/public/attachments/DA-16-1107A1_Rcd . . fcc . gov/form-477-mobile-voice-and-broadband-coverage-areas8 haracterizing Mobile Broadband Access [10] Mah-Rukh Fida and Mahesh K Marina. 2018. Impact of device diversityon crowdsourced mobile coverage maps. In IEEE CNSM .[11] GeoPandas. 2019. Python library for geospatial operations. http://geopandas . . itu . int/en/ITU-D/Statistics/Documents/facts/ICTFactsFigures2017 . pdf.[14] John Kahan. 2019. It’s time for a new approach for mapping broadbanddata to better serve Americans. https://blogs . microsoft . com/on-the-issues/2019/04/08/its-time-for-a-new-approach-for-mapping-broadband-data-to-better-serve-americans/[15] Zhijing Li, Ana Nika, Xinyi Zhang, Yanzi Zhu, Yuanshun Yao, Ben YZhao, and Haitao Zheng. 2017. Identifying value in crowdsourcedwireless signal measurements. In Proceedings of the 26th InternationalConference on World Wide Web . International World Wide Web Con-ferences Steering Committee, 607–616.[16] Benoit Lubek. [n.d.]. https://github . com/caarmen/network-monitor[17] Andra Lutu, Diego Perino, Marcelo Bagnulo, Enrique Frias-Martinez,and Javad Khangosstar. 2020. A Characterization of the COVID-19Pandemic Impact on a Mobile Network Operator Traffic. In ACM IMC .[18] David Major, Ross Teixeira, and Jonathan Mayer. 2020. No WAN’s Land:Mapping US Broadband Coverage with Millions of Address Queries toISPs. In

ACM IMC . ntia . doc . gov/files/ntia/publications/ntia_comments_on_modernizing_the_fcc_form_477_data_program . pdf.[20] Petteri Nurmi, Sourav Bhattacharya, and Joonas Kukkonen. 2010. Agrid-based algorithm for on-device GSM positioning. In Proceedings ofthe 12th ACM international conference on Ubiquitous computing . ACM,227–236.[21] OpenCellID. 2019. OpenCellID Open Data. https://opencellid . org/downloads . php[22] Pew Research Center. 2019. Mobile Fact Sheet. https://pewresearch-org-preprod . go-vip . co/pewinternet/fact-sheet/mobile/.[23] James E Prieger. 2017. Mobile data roaming and incentives for invest-ment in rural broadband infrastructure. Available at SSRN 3391478 (2017).[24] Elisabeth Roberts, David Beel, Lorna Philip, and Leanne Townsend.2017. Rural Resilience in a Digital Society.