Semantic Word Cloud Representations: Hardness and Approximation Algorithms
Lukas Barth, Sara Irina Fabrikant, Stephen Kobourov, Anna Lubiw, Martin Nöllenburg, Yoshio Okamoto, Sergey Pupyrev, Claudio Squarcella, Torsten Ueckerdt, Alexander Wolff
SSemantic Word Cloud Representations:Hardness and Approximation Algorithms
Lukas Barth ∗ Sara Irina Fabrikant † Stephen Kobourov ‡ Anna Lubiw § Martin N¨ollenburg ∗ Yoshio Okamoto ¶ Sergey Pupyrev ‡ Claudio Squarcella (cid:107)
Torsten Ueckerdt ∗∗ Alexander Wolff †† Abstract
We study a geometric representation problem, where we are given a set R of axis-aligned rectangles withfixed dimensions and a graph with vertex set R . The task is to place the rectangles without overlap such thattwo rectangles touch if and only if the graph contains an edge between them. We call this problem C ONTACT R EPRESENTATION OF W ORD N ETWORKS (CROWN). It formalizes the geometric problem behind drawingword clouds in which semantically related words are close to each other. Here, we represent words by rectanglesand semantic relationships by edges.We show that CROWN is strongly NP-hard even restricted trees and weakly NP-hard if restricted stars. Weconsider the optimization problem M AX -CROWN where each adjacency induces a certain profit and the task isto maximize the sum of the profits. For this problem, we present constant-factor approximations for several graphclasses, namely stars, trees, planar graphs, and graphs of bounded degree. Finally, we evaluate the algorithmsexperimentally and show that our best method improves upon the best existing heuristic by 45%. Word clouds and tag clouds are popular tools for visualizing text. The practical tool, Wordle [VWF09], took wordclouds to the next level with high quality design, graphics, style and functionality. Such word cloud visualizationsprovide an appealing way to summarize the content of a webpage, a research paper, or a political speech. Oftensuch visualizations are used to contrast two documents; for example, word cloud visualizations of the speechesgiven by the candidates in the 2008 US Presidential elections were used to draw sharp contrast between them inthe popular media.While some of the more recent word cloud visualization tools aim to incorporate semantics in the layout, noneprovides any guarantees about the quality of the layout in terms of semantics. We propose a mathematical model ofthe problem, via a simple edge-weighted graph. The vertices in the graph are the words in the document. The edgesin the graph correspond to semantic relatedness, with weights corresponding to the strength of the relation. Eachvertex must be drawn as an axis-aligned rectangle ( box , for short) with fixed dimensions. Usually, the dimensionswill be determined by the size of the word in a certain font, and the font size will be related to the importance ofthe word. The goal is to “realize” as many edges as possible, by contacts between their corresponding rectangles;see Fig. 1.
Hierarchically clustered document collections are visualized with self-organizing maps [LHKK96] and Voronoitreemaps [NB12]. The early word-cloud approaches did not explicitly use semantic information, such as word re- ∗ Institute of Theoretical Informatics, Karlsruhe Institute of Technology † Department of Geography, University of Zurich ‡ Department of Computer Science, University of Arizona § School of Computer Science, University of Waterloo ¶ Dept. Comm. Engineering and Informatics, University of Electro-Communications (cid:107)
Dipartimento di Ingegneria, Roma Tre University ∗∗ Department of Mathematics, Karlsruhe Institute of Technology †† Lehrstuhl f¨ur Informatik I, Universit¨at W¨urzburg a r X i v : . [ c s . C G ] N ov ECIDABLE
LOG-SPACE
P-TIME
NP-COMPLETE
P -SPACE
EXP-TIME
EXP-SPACE
R E C O G N I Z A B L E
NP-TIME coNP-TIME
Figure 1:
A hierarchical word cloud for complexity classes. A class is above another class when the former contains the latter.The font size is the square root of millions of Google hits for the corresponding word. This is an instance of the problemvariant H
IER -CROWN. latedness, in placing the words in the cloud. More recent approaches attempt to do so, as in ManiWordle [KLKS10]and in parallel tag clouds [CVW09]. The most relevant approaches rely on force-directed graph visualizationmethods [CWL +
10] and a seam-carving image processing method together with a force-directed heuristic etal. [WPW + rectangle representations of graphs, vertices are axis-aligned rectangles with non-intersecting interiors andedges correspond to rectangles with non-zero length common boundary. Every graph that can be represented thisway is planar and every triangle in such a graph is a facial triangle; these two conditions are also sufficient toguarantee a rectangle representation [Tho86, RT86, BGPV08, Fus09]. In a recent survey, Felsner [Fel13] reviewsmany rectangulation variants, including squarings. Algorithms for area-preserving rectangular cartograms are alsorelated [Rai34]. Area-universal rectangular representations where vertex weights are represented by area havebeen characterized [EMSV12] and edge-universal representations, where edge weights are represented by lengthof contacts have been studied [NPR13]. Unlike cartograms, in our setting there is no inherent geography, andhence, words can be positioned anywhere. Moreover, each word has fixed dimensions enforced by its frequencyin the input text, rather than just fixed area. The input to the problem variants that we consider is a sequence B , . . . , B n of axis-aligned boxes with fixedpositive dimensions. Box B i is encoded by ( w i , h i ) , where w i and h i are its width and height. For some of ourresults, some boxes may be rotated by ◦ , which means exchanging w i and h i . A representation of the boxes B , . . . , B n is a map that associates with each box a position in the plane so that no two boxes overlap. A contact between two boxes is a line segment (possibly a point) in the boundary of both. If two boxes are in contact, wesay that they touch . If two boxes touch and one lies above the other, we call this a vertical contact . We define horizontal contact symmetrically. For ≤ i (cid:54) = j ≤ n , a non-negative profit p ij represents the gain for makingboxes B i and B j touch. The supporting graph has a vertex for each box and an edge for each non-zero profit.Finally, we define the total profit of a representation to be the sum of profits over all pairs of touching boxes.Our problems and results are as follows. Contact Representation of Word Networks (CROWN):
In this decision problem, we assume 0–1 prof-its. The task is to decide whether there exists a representation of the boxes with total profit (cid:80) i (cid:54) = j p ij . This isequivalent to finding a representation whose contact graph contains the supporting graph as a subgraph. If such arepresentation exists, we say that it realizes the supporting graph and that the instance of the CROWN problem is realizable . We show that CROWN is strongly NP-hard even if restricted to trees and weakly NP-hard if restrictedstars; see Theorem 1.We also consider two variants of the problem that can be solved efficiently. First we present a linear-timealgorithm for CROWN on so-called irreducible triangulations; see Section 2.1. Then we turn to the problemvariant H IER -CROWN, where the supporting graph is a single-source directed acyclic graph with fixed planeembedding, and the task is to find a representation in which each edge corresponds to a vertical contact directedupwards; see Fig. 1. We solve this variant efficiently; see Section 2.2. M AX -CROWN: In this optimization problem, the task is to find a representation of the given boxes maxi-mizing the total profit. We present constant-factor approximation algorithms for stars, trees, and planar graphs,and a / (∆ + 1) -approximation for graphs of maximum degree ∆ ; see Section 3. We have implemented twoapproximation algorithms and evaluated them experimentally in comparison to three existing algorithms (two of2 i v i v i v i v i v i v i v i v i (cid:30) (cid:31) (cid:30)(cid:29) (cid:28) B (cid:31) (cid:30)(cid:29) (cid:28) B/ (cid:30) (cid:30) (cid:30) r r r r u cu u u a a a a a b b b b d d Figure 2:
Given an instance S of 3-P ARTITION , we construct a tree T S (thick red line segments) and define boxes such that T S has a realization if and only if S is feasible. which semantics-aware). Based on a dataset of 120 Wikipedia documents our best method outperforms the bestprevious methods by more than 45%; see Section 5. We also consider an extremal version of the M AX -CROWNproblem and show that if the supporting graph is K n ( n ≥ ) and each profit is , then there always exists arepresentation with total profit n − and that this is sometimes the best possible. Such a representation can befound in linear time. A REA -CROWN is as follows: Given a realizable instance of CROWN, find a representation that realizesthe supporting graph and minimizes the area of a box containing all input boxes. We show that this problem isNP-hard even if restricted to paths; see Section 4.
In this section, we investigate the complexity of CROWN for several graph classes.
Theorem 1.
CROWN is (strongly) NP-hard. The problem remains strongly NP-hard even if restricted to treesand weakly NP-hard if restricted to stars.Proof.
To show that CROWN on stars is weakly NP-hard, we reduce from the weakly NP-hard problem P
ARTI - TION , which asks whether a given multiset of n positive integers a , . . . , a n that sum to B can be partitioned intotwo subsets, each of sum B/ . We construct a star graph whose central vertex corresponds to an ( B/ , δ ) -box(for some < δ < min i a i ). We add four leaves corresponding to ( B, B ) -squares and, for i = 1 , . . . , n , a leafcorresponding to an ( a i , a i ) -square. It is easy to verify that there is a realization for this instance of CROWN ifand only if the set can be partitioned.To show that CROWN is (strongly) NP-hard, we reduce from 3-P ARTITION : Given a multiset S of n = 3 m integers with (cid:80) S = mB , is there a partition of S into m subsets S , . . . , S m such that (cid:80) S = · · · = (cid:80) S m = B ? It is known that 3-P ARTITION is NP-hard even if, for every s ∈ S , we have B/ < s < B/ , which impliesthat each of the subsets S , . . . , S m must contain exactly three elements [GJ79].Given an instance S = { s , s , . . . , s n } of 3-P ARTITION as described above, we define a tree T S on n +4( m −
1) + 7 vertices as in Fig. 2 (for n = 9 and m = 3 ). Let K = ( m + 1) B + m + 1 . We make a vertex c of size ( K, / . For each i = 1 , . . . , n , we make a vertex v i of size ( s i , B ) . For each j = 0 , . . . , m , we makevertices u j and b j of size (1 , B ) and vertices (cid:96) j and r j of size ( B/ , B ) . Finally, we make vertices a , . . . , a ofsize ( K, K ) , and vertices d and d of size ( B/ , B ) . The tree T S is as shown by the thick lines in Fig. 2: vertex c is adjacent to all the v i ’s, u j ’s, a ’s, and d ’s; and each vertex u j is adjacent to b j , (cid:96) j , and r j .We claim that an instance S of 3-P ARTITION is feasible if and only if T S can be realized with the given boxsizes. It is easy to see that T S can be realized if S is feasible: we simply partition vertices v , . . . , v n into groupsof three (by vertices u , . . . , u m ) in the same way as their widths s , . . . , s n are partitioned in groups of three; seeFig. 2.For the other direction, consider any realization of T S . By abusing notation, we refer to the box of somevertex v also as v . Since c touches the five large squares a , . . . , a , at least three sides of c are partially coveredby some a k and at least one horizontal side of c is completely covered by some a k . Since c has height 1/2 only,but touches all the v i ’s and u j ’s and d and d (each of height B > ), all these boxes must touch c on its free3 W v S p v W v S p w qr s v v W v S Figure 3:
Left: starting configuration with rays v S and v W . Center: representation at an intermediate step: vertex w fits intoconcavity p and results in a staircase, vertex v fits into concavity s but does not result in a staircase. Adding box w to therepresentation introduces a new concavity q and allows wider boxes to be placed at r . Right: no box can be placed, so thealgorithm terminates. horizontal side, say, the bottom side. Furthermore, the sum of the widths of the boxes exactly matches the widthof c ; so they must pack side by side in some order.This means that the only free boundary of u j is at the bottom, and u j must make contact there with b j , (cid:96) j , and r j . This is only possible if b j is placed directly beneath u j , and (cid:96) j and r j make contact with the bottom corners of u j . (They need not appear to the left and right as shown in Fig. 2.) Because the sum of the widths of the b j ’s, (cid:96) j ’s,and r j ’s exactly matches the width of c , they must pack side by side, and therefore the u j ’s are spaced distance B apart. There is a gap of width B/ before the first u j and after the last u j . These gaps are too wide for onebox in v , . . . , v n and too small for two of them since their widths are contained in the open interval ( B/ , B/ .Therefore, the boxes d and d must occupy these gaps, and the boxes v , . . . , v n are packed into m groups eachof width B , as required.Note that the proof of the weak NP-hardness for stars still works in case rectangles may be rotated because allboxes are squares—but one. The same holds for the strong NP-hardness for trees; for details see Appendix A.Although CROWN is NP-hard in general, there are graph classes for which the problem can be solved effi-ciently. In the remainder of this section, we investigate such a class—irreducible triangulations—, and we considera restricted variant of CROWN: H IER -CROWN.
A box representation is called a rectangular dual if the union of all rectangles is again a rectangle whose boundaryis formed by exactly four rectangles. A graph G admits a rectangular dual if and only if G is planar, internallytriangulated, has a quadrangular outer face and does not contain separating triangles [BGPV08]. Such graphs areknown as irreducible triangulations . The four outer vertices of an irreducible triangulation are denoted by v N , v E , v S , v W in clockwise order around the outer quadrangle. An irreducible triangulation G may have exponentiallymany rectangular duals. Any rectangular dual of G , however, can be built up by placing one rectangle at a time,always keeping the union of the placed rectangles in staircase shape. Theorem 2.
CROWN on irreducible triangulations can be solved in linear time.sketch.
The algorithm greedily builds up the supporting graph G , similarly to an algorithm for edge-proportionalrectangular duals [NPR13]. We define concavity as a point on the boundary of the so-far constructed represen-tation, which is a bottom-right or top-left corner of some rectangle. Start with a vertical and a horizontal rayemerging from the same point p , as placeholders for the right side of v W and the top side of v S , respectively. Thenat each step consider a concavity, with p as the initial one. Since each concavity p is contained in exactly tworectangles, there exists a unique rectangle R p that is yet to be placed and has to touch both these rectangles. Ifby adding R p we still have a staircase shape representation, then we do so. If no such rectangle can be added, weconclude that G is not realizable; see Fig. 3. The complete proof is in the appendix. IER -CROWN problem
The H
IER -CROWN problem is a restricted variant of the CROWN problem that can be used to create wordclouds with a hierarchical structure; see Fig. 1. The input is a directed acyclic graph G with only one sink and4ith a plane embedding. The task is to find a representation that hierarchically realizes G , meaning that for eachdirected edge ( v, u ) in G the top of the box for v is in contact with the bottom of the box for u .If the embedding of G is not fixed, the problem is NP-hard even for a tree, by an easy adaptation of theproof of Theorem 1. (Remove the vertices a , a , a , and orient the remaining edges of T S upward according tothe representation shown in Fig. 2.) However, if we fix the embedding of the supporting graph G , then H IER -CROWN can be solved efficiently.
Theorem 3. H IER - CROWN can be solved in polynomial time.Proof.
Let G be the given supporting graph, with vertices corresponding to boxes B , . . . , B n where B i has height h i and width w i , and B is the unique sink. We first check that the orientation and embedding of G are compatible,that is, that incoming edges and outgoing edges are consecutive in the cyclic order around each vertex.The main idea is to set up a system of linear equations for the x - and y -coordinates of the sides of the boxes.Let variables t i and b i represent the y -coordinates of the top and bottom of B i respectively, and variables (cid:96) i and r i represent the x -coordinates of the left and right of B i respectively. For each i = 1 , . . . , n , impose the linearconstraints t i = b i + h i and r i = (cid:96) i + w i . For each directed edge ( B i , B j ) , impose the constraints t i = b j , r i > (cid:96) j ,and r j > (cid:96) i . The last two constraints force B i and B j to share some x -range in which they can make verticalcontact. Initialize t = 0 .With these equations, variables t i and b i are completely determined since every box B i has a directed path to B . Furthermore, the values for t i and b i can be found using a depth-first-search of G starting from B .The x -coordinates are not yet determined and depend on the horizontal order of the boxes, which can beestablished as follows. We scan the boxes from top to bottom, keeping track of the left-to-right order of boxesintersected by a horizontal line that sweeps from y = 0 downwards. Initially the line is at y = 0 and intersectsonly B . When the line reaches the bottom of a box B , we replace B in the left-to-right order by all its predecessorsin G , using the order given by the plane embedding. In case multiple boxes end at the same y -coordinate, we makethe update for all of them. Whenever boxes B a and B b appear consecutively in the left-to-right order, we imposethe constraint r a ≤ (cid:96) b . The scan can be performed in O ( n log n ) time using a priority queue to determine which boxes in the currentleft-to-right order have maximum b i value. The resulting system of equations has size O ( n ) (because the con-straints correspond to edges of a planar graph). It is straightforward to verify that the system of equations has asolution if and only if there is a representation of the boxes that hierarchically realizes G . The constraints definea linear program (LP) and can be solved efficiently. (A feasible solution can be found faster than with an LP, butwe omit the details in this paper.)We can show that H IER -CROWN becomes weakly NP-complete if rectangles may be rotated, by a simplereduction from S
UBSET S UM (details in Appendix A.2). AX -CROWN problem In this section, we study approximation algorithms for M AX -CROWN and consider an extremal variant of theproblem. We present approximation algorithms for M AX -CROWN restricted to certain graph classes. Our basic buildingblocks are an approximation algorithm for stars and an exact algorithm for cycles. Our general technique is tofind a collection of disjoint stars or cycles in a graph. We begin with stars, using a reduction to the M AXIMUM G ENERALIZED A SSIGNMENT P ROBLEM (GAP) defined as follows: Given a set of bins with capacity constraintsand a set of items that may have different sizes and values in each bin, pack a maximum-value subset of itemsinto the bins. It is known that the problem is NP-hard (K
NAPSACK and B IN P ACKING are special cases of GAP),and there exists an (1 − /e ) -approximation algorithm [FGMS11]. In the remainder, we assume that there is an α -approximation algorithm for GAP, setting α = 1 − /e > . . Theorem 4.
There exists an α -approximation algorithm for M AX - CROWN on stars.Proof.
Let B denote the box corresponding to the center of the star. In any optimal solution for the M AX -CROWN problem there are four boxes B , B , B , B whose sides contain one corner of B each. Given5 B B B B Figure 4:
An optimal representation for the M AX -CROWN problem whose supporting graph is a star with center B . Thestriped boxes did not fit into the solution. B , B , B , B , the problem reduces to assigning each remaining box B i to one of the four sides of B , where itmakes contact for its whole length; see Fig. 4.This is a special case of GAP: The bins are the four sides of B , the size of an item is its width for thehorizontal bins and its height for the vertical bins, and the value of an item is the profit of its adjacency to thecentral box. We can now apply the algorithm for the GAP problem, which gives an α -approximation for the setof boxes. To get an approximation for the M AX -CROWN problem, we consider all possible ways of choosingboxes B , B , B , B , which increases the runtime only by a polynomial factor.In the case where rectangles may be rotated by ◦ , the M AX -CROWN problem on a star reduces to an easierproblem, the M ULTIPLE K NAPSACK P ROBLEM , where every item has the same size and value no matter whichbin it is placed in. This is because we will always attach a rectangle B to the central rectangle of the star usingthe smaller dimension of B . There is a PTAS for M ULTIPLE K NAPSACK [CK05]. Therefore, there is a PTAS forM AX -CROWN on stars if we may rotate rectangles.A star forest is a disjoint union of stars. Theorem 4 applies to a star forest since we can combine the solutionsfor the disjoint stars. Theorem 5. M AX - CROWN on the class of graphs that can be partitioned in polynomial time into k star forestsadmits an α/k -approximation algorithm.Proof. The algorithm is to partition the edges of the supporting graph into k star forests, apply the approximationalgorithm of Theorem 4 to each star forest, and take the best of the k solutions. This takes polynomial time. Weclaim this gives the desired approximation factor. Consider an optimum solution, and let W opt be the total profitof edges that are realized as contacts. By the pigeon hole principle, there is a star forest F in the partition withrealized profit at least W opt /k in the optimum solution. Therefore our approximation achieves at least αW opt /k profit for F . Corollary 1. M AX - CROWN admits • an α/ -approximation algorithm on trees, • an α/ -approximation algorithm on planar graphs.Proof. It is easy to partition any tree into two star forests in linear time. Moreover, it is known that every planargraph has star arboricity at most , that is, it can be partitioned into at most star forests, and such a partition canbe found in polynomial time [HMS96]. The results now follow directly from Theorem 5.Our star forest partition method is possibly not optimal. Nguyen et al. [NSH +
08] show how to find a starforest of an arbitrary weighted graph carrying at least half of the profits of an optimal star forest in polynomial-time. We can’t, however, guarantee that the approximation of the optimal star forest carries a positive fractionof the total profit in an optimal solution to M AX -CROWN. Hence, approximating M AX -CROWN for generalgraphs remains an open problem. As a first step into this direction, we present a constant-factor approximation forsupporting graphs with bounded maximum degree. First we need the following lemma. Lemma 1.
Given a sequence of n ≥ boxes, we can find a representation realizing the n -cycle in linear time. v v v v v v v v v B B B B B Figure 5:
Left: Realizing cycle ( v , . . . , v ) . Right: adjacencies with boxes in Theorem 7. Proof.
Let C = ( v , v , . . . , v n ) be a cycle. Let W be the sum of all the widths, W = (cid:80) i w i , and let t bemaximum index such that (cid:80) i ≤ t w i < W/ . We place v , v , . . . , v t side by side in order from left to right withtheir bottoms on a horizontal line h . We call this the “top channel”. Starting from the same point on h we place v n , v n − , . . . , v t +2 side by side in order from left to right with their tops on h . We call this the “bottom channel”.Note that v and v n are in contact. It remains to place v t +1 in contact with v t and v t +2 . It is easy to show that thefollowing works: add v t +1 to the channel of minimum width, or in case of a tie, place v t straddling the line h .Following the idea of Theorem 5, we can approximate M AX -CROWN by applying Lemma 1 to a partition ofthe supporting graph into sets of disjoint cycles. Theorem 6. M AX - CROWN on the class of graphs that can be partitioned into k sets of disjoint cycles (inpolynomial time) admits a (polynomial-time) algorithm that achieves total profit at least k (cid:80) i (cid:54) = j p ij . In particular,there is a /k -approximation algorithm for M AX - CROWN on this graph class.
Corollary 2. M AX - CROWN on graph of maximum degree ∆ admits a / (∆ + 1) -approximation.Proof. As Peterson [Pet91] shows, the edges of any graph of maximum degree ∆ can be covered by (cid:100) ∆ / (cid:101) setsof cycles, and such sets can be found in polynomial time. The result now follows from Theorem 6. AX -CROWN Problem. In the following, we bound the maximum number of contacts that can be made when placing n boxes. It is easyto see that for n = 2 , any set of boxes allows n − contacts. In case n = 4 the boxes can be arranged so thattheir corners meet at a point, thus realizing n − contacts. For larger n we have: Theorem 7.
For n ≥ and any set of n boxes, the boxes can be placed in the plane to realize n − contacts.For some sets of boxes this is the best possible.Proof. Let B , . . . , B n be any set of boxes. We place the first 5 boxes to make 8 contacts, and place the remainingboxes to make 2 contacts each for a total of n −
5) = 2 n − contacts. Among the first 5 boxes, let B and B be the boxes with largest height, and B and B be the boxes with largest width. Place the five boxes asin Fig. 5. Place the remaining boxes one by one as in the proof of Lemma 1 along the horizontal line between B and B . Then each remaining box makes two new contacts.Next we describe a set of n boxes for which the maximum number of contacts is n − . Let B i be a squarebox of side length i . Consider any placement of the boxes and partition the contacts into horizontal and verticalcontacts. Here we assume that a point contact of two boxes is horizontal if the point is the south-west corner ofthe first box and the north-east corner of the second; otherwise, a point contact is vertical. From the side lengthsof boxes, it follows that neither set of contacts contains a cycle. Thus each set of contacts has size at most n − for a total of n − . REA -CROWN problem
The same supporting graph can often be realized by different contact representations, not all of which are equallyuseful or visually appealing when viewed as word clouds. In this section we consider the A
REA -CROWN problemand show that finding a “compact” representation that fits into a small bounding box is another NP-hard problem.7he reduction is from the (strongly) NP-hard D S
TRIP P ACKING problem [LMM02]: The input is a set R of n rectangles with height and weight functions w : R → N and h : R → N , and a strip of width W and height H . All the input numbers are bounded by some polynomial in n . The task is to pack the given rectangles into thestrip.The S TRIP P ACKING problem is actually equivalent to A
REA -CROWN when the supporting graph is anindependent set. However, edges in the supporting graph impose additional constraints on the representation,which might make A
REA -CROWN easier. The following theorem (proved in the appendix) shows that this is notthe case.
Theorem 8. A REA - CROWN is NP-hard even on paths.
We implemented our new methods for constructing word clouds: the S
TAR F OREST algorithm based on extractingstar forests (Corollary 1), and the C
YCLE C OVER algorithm based on decomposing edges of a graph into cyclecovers (Theorem 6). We compared the algorithms with the existing method from [VWF09] (referred to as R AN - DOM ), the algorithm from [CWL +
10] (referred to as CPDWCV), and the algorithm from [WPW +
11] (referredto as S
EAM C ARVING ). Our dataset is 120 Wikipedia documents, with 400 words or more. For the word clouds,we removed stop-words (e.g., “and”, “the”, “of”), and constructed supporting graphs G and G for and the most frequent words respectively. Implementation details are provided in the appendix.We compare the percentage of realized profit in the representation of the supporting graphs. Since S TAR F OREST handles planar supporting graphs, we first extract a maximal planar subgraph of G , and then apply thealgorithm on the subgraph. The percentage of realized profit is presented in the table. Our results indicate that,in terms of the realized profit, C YCLE C OVER and S
TAR F OREST outperform existing approaches; see Fig. 8. Inpractice, C
YCLE C OVER realizes more than of the total profit of graphs with vertices. On the other hand,existing algorithms may perform better in terms of compactness, aspect ratio, and other aesthetic criteria; we leavea deeper comparison of word cloud algorithms as a future research direction.Algorithm Realized Profit of G Realized Profit of G R ANDOM [VWF09] .
4% 2 . CPDWCV [CWL + .
2% 8 . S EAM C ARVING [WPW + .
4% 5 . S TAR F OREST .
4% 8 . C YCLE C OVER .
8% 13 . We formulated the Word Rectangle Adjacency Contact (CROWN) problem, motivated by the desire to providetheoretical guarantees for semantics-preserving word cloud visualization. We described efficient algorithms forvariants of CROWN, showed that some variants are NP-hard, and presented several approximation algorithms. Anatual open problem is to find an approximation algorithm for general graphs with arbitrary profits.
Acknowledgments.
Work on this problem began at Dagstuhl Seminar 12261. We thank the organizers, partici-pants, Therese Biedl, Steve Chaplick, and G¨unter Rote.
References [BGPV08] Adam L. Buchsbaum, Emden R. Gansner, Cecilia Magdalena Procopiuc, and Suresh Venkatasubra-manian. Rectangular layouts and contact graphs.
ACM Transactions on Algorithms , 4(1), 2008.[CK05] C. Chekuri and S. Khanna. A polynomial time approximation scheme for the multiple knapsackproblem.
SIAM Journal on Computing , 35(3):713–728, 2005.[CVW09] Christopher Collins, Fernanda B. Vi´egas, and Martin Wattenberg. Parallel tag clouds to explore andanalyze faceted text corpora. In
IEEE VAST , pages 91–98, 2009.8CWL +
10] W. Cui, Y. Wu, S. Liu, F. Wei, M. X. Zhou, and H. Qu. Context-preserving, dynamic word cloudvisualization.
Computer Graphics and Applications , 30:42–53, 2010.[DMS05] Tim Dwyer, Kim Marriott, and Peter J. Stuckey. Fast node overlap removal. In , volume 3843 of
LNCS , pages 153–164, 2005.[EMSV12] David Eppstein, Elena Mumford, Bettina Speckmann, and Kevin Verbeek. Area-universal and con-strained rectangular layouts.
SIAM Journal on Computing , 41(3):537–564, 2012.[Fel13] Stefan Felsner. Rectangle and square representations of planar graphs. In
Thirty Essays on GeometricGraph Theory , pages 213–248. Springer, 2013.[FGMS11] Lisa Fleischer, Michel X. Goemans, Vahab S. Mirrokni, and Maxim Sviridenko. Tight approximationalgorithms for maximum separable assignment problems.
Math.Op.R. , 36(3):416–431, 2011.[Fus09] ´Eric Fusy. Transversal structures on triangulations: A combinatorial study and straight-line drawings.
Discrete Mathematics , 309(7):1870–1894, 2009.[GH10] Emden R. Gansner and Yifan Hu. Efficient, proximity-preserving node overlap removal.
J. GraphAlgorithms Appl. , 14(1):53–74, 2010.[GJ79] Michael R. Garey and David S. Johnson.
Computers and Intractability: A Guide to the Theory ofNP-Completeness . W. H. Freeman & Co., New York, NY, USA, 1979.[HMS96] S.L. Hakimi, J. Mitchem, and E. Schmeichel. Star arboricity of graphs.
Discrete Mathematics ,149(13):93–98, 1996.[KLKS10] Kyle Koh, Bongshin Lee, Bo Hyoung Kim, and Jinwook Seo. Maniwordle: Providing flexible controlover Wordle.
IEEE Trans. Vis. Comput. Graph. , 16(6):1190–1197, 2010.[LHKK96] Krista Lagus, Timo Honkela, Samuel Kaski, and Teuvo Kohonen. Self-organizing maps of documentcollections: A new approach to interactive exploration. In
KDD , pages 238–243, 1996.[LMM02] Andrea Lodi, Silvano Martello, and Michele Monaci. Two-dimensional packing problems: A survey.
European Journal of Operational Research , 141(2):241–252, 2002.[NB12] Arlind Nocaj and Ulrik Brandes. Organizing search results with a reference map.
IEEE Transactionson Visualization and Computer Graphics , 18(12):2546–2555, 2012.[NPR13] Martin N¨ollenburg, Roman Prutkin, and Ignaz Rutter. Edge-weighted contact representations ofplanar graphs. In
Graph Drawing , volume 7704 of
LNCS , pages 224–235. Springer, 2013.[NSH +
08] C Thach Nguyen, Jian Shen, Minmei Hou, Li Sheng, Webb Miller, and Louxin Zhang. Approxi-mating the spanning star forest problem and its application to genomic sequence alignment.
SIAMJournal on Computing , 38(3):946–962, 2008.[Pet91] Julius Petersen. Die Theorie der regul¨aren Graphen.
Acta Mathematica , 15(1):193–220, 1891.[Rai34] Erwin Raisz. The rectangular statistical cartogram.
Geographical Review , 24(3):292–296, 1934.[RT86] Pierre Rosenstiehl and Robert E Tarjan. Rectilinear planar layouts and bipolar orientations of planargraphs.
Discrete & Computational Geometry , 1(1):343–353, 1986.[Tho86] Carsten Thomassen. Interval representations of planar graphs.
Journal of Combinatorial Theory,Series B , 40(1):9–20, 1986.[VWF09] Fernanda B. Vi´egas, Martin Wattenberg, and Jonathan Feinberg. Participatory visualization withWordle.
IEEE Trans. Vis. Comput. Graph. , 15(6):1137–1144, 2009.[WPW +
11] Yingcai Wu, Thomas Provan, Furu Wei, Shixia Liu, and Kwan-Liu Ma. Semantic-preserving wordclouds by seam carving.
Computer Graphics Forum , 30(3):741–750, 2011.9 ppendix
A The CROWN problem
Theorem 1 still holds in the case where rectangles may be rotated. The construction uses squares for the a i ’s.Rectangle c has height 1/2 and all the other rectangles have both width and height greater than 1/2 so no rectanglecan make contact along the sides of c in-between the a k ’s. Finally, all rectangles have height greater than width,so there is no advantage to rotating any rectangle. A.1 The CROWN problem on irreducible triangulations
Theorem 2.
CROWN on irreducible triangulations can be solved in linear time.Proof.
Let G be the supporting graph, an irreducible triangulation. We consider G embedded in the plane withouter face { v N , v E , v S , v W } . Note that this embedding is unique. By abusing notation, we refer to a vertex andits corresponding box with the same letter.We begin by placing a horizontal and a vertical ray emerging from the same point in positive x -direction andpositive y -direction, respectively. For the first phase of the algorithm let us pretend that the horizontal ray isthe box v S (imagine a rectangle with tiny height and huge width) and the vertical ray is the box v W (imagine arectangle with tiny width and huge height), independent of how the actual boxes look like; see Fig. 3.We build up a representation by adding one rectangle at a time. At every intermediate step the representationis rectilinear convex , that is, its intersection with any horizontal or vertical line is connected. In other words, therepresentation has no holes and a “staircase shape”. We maintain the set of all concavities , that is, points on theboundary of the representation, which are bottom-right or top-left corners of some rectangle but not a top-rightcorner of any rectangle. Initially there is only one concavity, namely the point where the rays v W and v S meet.Each concavity p is a point on the boundary of two rectangles, say u and v . Since G has no separating trianglesthere are exactly two vertices that are adjacent to both, u and v , or only one if { u, v } = { v S , v W } . For exactlyone of the these vertices, call it w , the rectangle is not yet placed because its bottom-left corner is supposed to beplaced on the concavity p . We say that w fits into the concavity p . We call a vertex w applicable to an intermediaterepresentation if it fits into some concavity and adding the rectangle w gives a representation that is rectilinearconvex. In the very beginning the unique common neighbor of v S and v W is applicable.The algorithm proceeds in n − steps as follows. At each step we identify a inner vertex w of G thatis applicable to the current representation. We add the rectangle w to the representation and update the set ofconcavities and applicable vertices. At most two points have to be added to the set of concavities, while oneis removed from this set. The vertices that fit into the new concavities can easily be read off from the planeembedding of G . Checking whether these vertices are applicable is easy. If the top-left or bottom-right corner of w does not define a concavity then one has to check whether the vertices that fit into existing concavities to theleft or below, respectively, are now applicable. So each step can be done in constant time.If the algorithm has placed the last inner vertex, it suffices to check whether the representation without the tworays is a rectangle, that is, whether there are exactly two concavities left. If so, call this rectangle R , we checkwhether the width of R is at most the width of v N and v S and whether the height of R is at most the height of v E and v W . If this holds true, we can easily place the rectangles v N , v E , v S , v W to get a representation that realizes G . The total running time is linear.On the other hand, if the algorithm stops because there is no applicable vertex, or the height/width-conditionsin the end phase are not met, then there is no representation that realizes G . This is due to the lack of choice inbuilding the representation – if a vertex v is applicable to a concavity p then the bottom-left corner of v has to beplaced at p in order to establish the contacts of v with the two rectangles containing p . A.2 The H
IER -CROWN problem
We justify the claim that H
IER -CROWN becomes weakly NP-complete if rectangles may be rotated. We use areduction from S
UBSET S UM . Given an instance of S UBSET S UM consisting of n items s i , i = 1 , . . . , n and adesired sum S , construct a large top square T and a large bottom square B , and a square M of side-length n + S that must lie between B and T . Add a chain of rectangles B , . . . , B n from B to T where B i has dimensions × (1+ s i ) . Rectangles B i that are oriented to have height s i correspond to “chosen” elements for the S UBSET UM problem. Via this correspondence, the S UBSET S UM instance has a solution if and only if the constructedH IER -CROWN instance has a solution.
B The A
REA -CROWN problem
Theorem 8. A REA - CROWN is NP-hard even on paths.Proof.
We use a reduction from S
TRIP P ACKING , so fix any instance I of S TRIP P ACKING consisting of rectangles r , . . . , r n and two integers H and W . Let d = (cid:15)/ max( W, H ) for some (cid:15) ∈ (0 , .We define an instance of the A REA -CROWN problem by slightly increasing the heights and widths in I . Theidea is to lay a unit square grid over the strip and blow each grid line up to have a thickness of d ; see Fig. 6. Eachrectangle in I is stretched according to the number of grid lines is intersects.More precisely, we define for i = 1 , . . . , n a rectangle r (cid:48) i of width w ( r i ) + ( w ( r i ) − d and height h ( r i ) +( h ( r i ) − d . Further we define W (cid:48) = W + ( W − d and H (cid:48) = H + ( H − d . Finally, we arrange the rectangles r (cid:48) , . . . , r (cid:48) n into a path P by introducing between r i and r i +1 ( i = 1 , . . . , n − ), as well as before r (cid:48) k small x × x square, called connector squares . We choose k and x to satisfy kx = 4( n + 3)( H + 2 nW ) and (1) n ( kx + 2 x ) = d. (2)In particular, we choose x = d n (2 Hn + 6 H + 4 n W + 12 nW + 1) and k = 4( n + 3)( H + 2 nW ) x . We claim that there is a representation realizing P within the W (cid:48) × H (cid:48) bounding box if and only if the originalrectangles r , . . . , r n can be packed into the original W × H bounding box.First consider any representation realizing P within the W (cid:48) × H (cid:48) bounding box and remove all connectorsquares from it. Since W (cid:48) < W + (cid:15) < W + 1 and H (cid:48) < H + (cid:15) < H + 1 , the stretched bounding box has thesame number of grid lines than the original. Hence the rectangles r (cid:48) , . . . , r (cid:48) n can be replaced by the correspondingrectangles r , . . . , r n and perturbed slightly such that every corner lies on a grid point. This way we obtain asolution for the original instance of S TRIP P ACKING .Now consider any solution for the S
TRIP P ACKING instance, that is, any packing of the rectangles r , . . . , r n within the W × H bounding box. We will construct a representation realizing the path P within the W (cid:48) × H (cid:48) bounding box. We start blowing up the grid lines of the W × H bounding box to thickness d each, which alsoeffects all rectangles intersected by a grid line in its interior. This way we obtain a placement of bigger rectangles r (cid:48) , . . . , r (cid:48) n I (cid:48) in the bigger W (cid:48) × H (cid:48) bounding box, such that every rectangle r (cid:48) i intersects the interiors of exactlythose blown-up grid lines corresponding to the grid lines that intersect r i interiorly. Thus any two rectangles r (cid:48) i and r (cid:48) j are separated by a vertical or horizontal corridor of thickness at least d . We will refer to the grid lines ofthickness d as gaps . d 1 Figure 6:
Grid before and after stretching (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (a) NP-hardness of A REA -CROWN for paths (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (b) folding connector rectan-gles inside a gap (c) connectors before rerouting (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (d) connectors after rerouting
Figure 7:
Illustrations for Theorem 8.
It remains to place all the connector square so as to realize the path P . The idea is the following. We startin the lower left corner of the bounding box, and lay out connector squares horizontally to the right inside thebottommost horizontal gap until we reach the vertical gap that contains the lower-left corner of r (cid:48) . We thenstart laying out the connector squares inside this vertical gap upwards, until we reach the lower-left corner of r (cid:48) .Whenever a rectangle r (cid:48) i overlaps with this vertical gap, we go around r (cid:48) i ; see Fig. 7d. This way we lay out at most (3 W (cid:48) + H (cid:48) ) /x connector squares, which by (1) is less than k . The remaining connector squares are “folded up”inside the vertical gap; see Fig. 7b.Next we lay out the connectors squares between r (cid:48) and r (cid:48) . We start where we ended before, that is, at thelower-left corner of r (cid:48) , and go the along the path we took before till we reach the bottommost gap. Then welay connector squares along the outermost gaps in counterclockwise direction, that is, first horizontally to therightmost gap, then up to the topmost gap, left to the leftmost gap, and down to the bottommost gap. Now we dothe same for r (cid:48) than what we did for r (cid:48) . If while going right we “hit” the connector squares going up to r (cid:48) , wefollow them up, go around r (cid:48) , and go down again. This is possible since there are gaps all around r (cid:48) ; see Fig. 7a.Note that the red line of connectors will actually sit on the dashed, expanded grid lines but are drawn next to themfor better readability. 12e repeat the process for all the rectangles.We have to show two things: The number of connector squares between two r (cid:48) i and r (cid:48) i +1 is large enough sothat the length of the string of connectors is sufficient. And that the gaps have sufficient space so that we can foldup the connectors in them.The first condition is taken care of by equation (1). We divide the path of the connectors in up to n + 3 parts:The first part p down i is going down from r (cid:48) i to the bottom gap. The second part p circle that goes around the boundingbox in counterclockwise order to the vertical gap containing the lower-left corner of r (cid:48) i +1 . This part is interceptedby up to n parts p avoid k where we hit a string of connectors going up to another rectangle r (cid:48) k and we have to followit, go around r (cid:48) k and come down again. The last part p up i +1 is going up from the bottom gap to the position of r (cid:48) i +1 . We will now show that each of these parts has a maximum length of H (cid:48) + 2 nW (cid:48) ) .The parts p up i +1 and p down i have to span the height H (cid:48) at most once, and may encounter all other rectangles r (cid:48) k at most once. Going around any such r (cid:48) k means at most traversing its width twice, which is at most W (cid:48) . Henceeach of p up i +1 and p down i has a total length of at most H (cid:48) + 2 nW (cid:48) < H (cid:48) + 2 nW (cid:48) ) . Since every p avoid k exactlyfollows the p up k , then surrounds r (cid:48) k (which has maximum width W (cid:48) and maximum height H (cid:48) ) and then follows p down k , it has a maximum length of W (cid:48) + H (cid:48) )+2( H (cid:48) +2 nW (cid:48) ) ≤ H (cid:48) +2 nW (cid:48) ) . Finally, p circle has a maximumlength of H (cid:48) + 2 W (cid:48) ≤ H (cid:48) + 2 nW (cid:48) ) .Thus, the total length of the path of connectors comprised of n + 3 parts of at most length H (cid:48) + 2 nW (cid:48) ) eachis at most n + 3)( H (cid:48) + 2 nW (cid:48) ) . Equation (1) ensures that our string of connectors has sufficient length.The second condition is covered by equation (2). Consider Fig. 7b. If a string of connectors just passes througha gap, it takes up exactly × x space. If it folds m connector rectangles inside the gap, it takes m × x plus the’wasted’ space (the red shaded space in Fig. 7b). The wasted space can be at most × x , and since every stringof connectors has k connector rectangles, the space taken up by those can be at most kx , thus every string ofconnectors can take at most kx + 2 x space in any given gap. Since there are n such strings of connectors andevery gap has dimensions × d , equation (2) ensures that the space in every gap is sufficient.We showed that we can find a layout of the path that corresponds to the optimum packing of the rectangles, ifsuch a packing exists within the desired bounding box. Thus, finding the most space-efficient layout for a path ofrectangles is NP-hard. C Experimental Results
Here we provide some details regarding the implementation of the algorithms.Before the algorithms are applied, the text is preprocessed using this workflow: The text is split into sentences,and the sentences are split into words using Apache OpenNLP. We then remove stop words, perform stemming onthe words and group the words with the same stem. The similarity of words is computed using Latent SemanticAnalysis based on the co-occurrence of the words within the same sentence.In the implementation of S
TAR F OREST , we use the ( ββ +1 − (cid:15) ) -approximation of Fleischer et al. [FGMS11]combined with a FPTAS for K NAPSACK to approximate the stars. In the implementation of CPDWCV, weachieved the best results in our experiments with parameters K r = 1000 and K a = 25 . Some results are given inFig. 8.The experiments have been run on an Intel i5 3.2GHz with 8GB RAM. The R ANDOM , S
TAR F OREST , andC
YCLE C OVER algorithms finishes in under a second. The CPDWCV and S
EAM C ARVING algorithms are basedon a force-directed model, and compute the word cloud with words within several seconds.13 a) Random Layout (b) Seam Carving ([WPW + + Figure 8:
Word clouds generated for Obama’s 2013 State of the Union Speech by various algorithms. Percentage of the totalrealized profit: (a) . (b) . (c) . (d) . (e) .2%