Phitchaya Mangpo Phothilimthana

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Phitchaya Mangpo Phothilimthana is active.

Explore More

Publication

Featured researches published by Phitchaya Mangpo Phothilimthana.

architectural support for programming languages and operating systems | 2013

Portable performance on heterogeneous architectures

Phitchaya Mangpo Phothilimthana; Jason Ansel; Jonathan Ragan-Kelley; Saman P. Amarasinghe

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine. To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.

programming language design and implementation | 2014

Chlorophyll: synthesis-aided compiler for low-power spatial architectures

Phitchaya Mangpo Phothilimthana; Tikhon Jelvis; Rohin Shah; Nishant Totla; Sarah Chasins; Rastislav Bodik

We developed Chlorophyll, a synthesis-aided programming model and compiler for the GreenArrays GA144, an extremely minimalist low-power spatial architecture that requires partitioning the program into fragments of no more than 256 instructions and 64 words of data. This processor is 100-times more energy efficient than its competitors, but currently can only be programmed using a low-level stack-based language. The Chlorophyll programming model allows programmers to provide human insight by specifying partial partitioning of data and computation. The Chlorophyll compiler relies on synthesis, sidestepping the need to develop classical optimizations, which may be challenging given the unusual architecture. To scale synthesis to real problems, we decompose the compilation into smaller synthesis subproblems---partitioning, layout, and code generation. We show that the synthesized programs are no more than 65% slower than highly optimized expert-written programs and are faster than programs produced by a heuristic, non-synthesizing version of our compiler.

architectural support for programming languages and operating systems | 2016

Scaling up Superoptimization

Phitchaya Mangpo Phothilimthana; Aditya V. Thakur; Rastislav Bodik; Dinakar Dhurjati

Developing a code optimizer is challenging, especially for new, idiosyncratic ISAs. Superoptimization can, in principle, discover machine-specific optimizations automatically by searching the space of all instruction sequences. If we can increase the size of code fragments a superoptimizer can optimize, we will be able to discover more optimizations. We develop LENS, a search algorithm that increases the size of code a superoptimizer can synthesize by rapidly pruning away invalid candidate programs. Pruning is achieved by selectively refining the abstraction under which candidates are considered equivalent, only in the promising part of the candidate space. LENS also uses a bidirectional search strategy to prune the candidate space from both forward and backward directions. These pruning strategies allow LENS to solve twice as many benchmarks as existing enumerative search algorithms, while LENS is about 11-times faster. Additionally, we increase the effective size of the superoptimized fragments by relaxing the correctness condition using contexts (surrounding code). Finally, we combine LENS with complementary search techniques into a cooperative superoptimizer, which exploits the stochastic search to make random jumps in a large candidate space, and a symbolic (SAT-solver-based) search to synthesize arbitrary constants. While existing superoptimizers consistently solve 9--16 out of 32 benchmarks, the cooperative superoptimizer solves 29 benchmarks. It can synthesize code fragments that are up to 82% faster than code generated by gcc -O3 from WiBench and MiBench.

international conference on image processing | 2013

Communication-minimizing 2D convolution in GPU registers

Forrest N. Iandola; David Sheffield; Michael J. Anderson; Phitchaya Mangpo Phothilimthana; Kurt Keutzer

2D image convolution is ubiquitous in image processing and computer vision problems such as feature extraction. Exploiting parallelism is a common strategy for accelerating convolution. Parallel processors keep getting faster, but algorithms such as image convolution remain memory bounded on parallel processors such as GPUs. Therefore, reducing memory communication is fundamental to accelerating image convolution. To reduce memory communication, we reorganize the convolution algorithm to prefetch image regions to register, and we do more work per thread with fewer threads. To enable portability to future architectures, we implement a convolution autotuner that sweeps the design space of memory layouts and loop unrolling configurations. We focus on convolution with small filters (2×2-7×7), but our techniques can be extended to larger filter sizes. Depending on filter size, our speedups on two NVIDIA architectures range from 1.2× to 4.5× over state-of-the-art GPU libraries.

compiler construction | 2016

GreenThumb: superoptimizer construction framework

Phitchaya Mangpo Phothilimthana; Aditya V. Thakur; Rastislav Bodik; Dinakar Dhurjati

Developing an optimizing compiler backend remains a laborious process, especially for nontraditional ISAs that have been appearing recently. Superoptimization sidesteps the need for many code transformations by searching for the most optimal instruction sequence semantically equivalent to the original code fragment. Even though superoptimization discovers the best machine-specific code optimizations, it has yet to become widely-used. We propose GreenThumb, an extensible framework that reduces the cost of constructing superoptimizers and provides a fast search algorithm that can be reused for any ISA, exploiting the unique strengths of enumerative, stochastic, and symbolic (SAT-solver-based) search algorithms. To extend GreenThumb to a new ISA, it is only necessary to implement an emulator for the ISA and provide some ISA-specific search utility functions.

J3ea | 2016

Short and Simple Cycle Separators in Planar Graphs

Eli Fox-Epstein; Shay Mozes; Phitchaya Mangpo Phothilimthana; Christian Sommer

We provide an implementation of an algorithm that, given a triangulated planar graph with m edges, returns a simple cycle that is a 3/4-balanced separator consisting of at most √8m edges. An efficient construction of a short and balanced separator that forms a simple cycle is essential in numerous planar graph algorithms, for example, for computing shortest paths, minimum cuts, or maximum flows. To the best of our knowledge, this is the first implementation of such a cycle separator algorithm with a worst-case guarantee on the cycle length. We evaluate the performance of our algorithm and compare it to the planar separator algorithms recently studied by Holzer et al. [2009]. Out of these algorithms, only the Fundamental Cycle Separator (FCS) produces a simple cycle separator. However, FCS does not provide a worst-case size guarantee. We demonstrate that (1) our algorithm is competitive across all test cases in terms of running time, balance, and cycle length; (2) it provides worst-case guarantees on the cycle length, significantly outperforming FCS on some instances; and (3) it scales to large graphs.

2nd Summit on Advances in Programming Languages (SNAPL 2017) | 2017

Domain-Specific Symbolic Compilation.

Rastislav Bodik; Kartik Chandra; Phitchaya Mangpo Phothilimthana; Nathaniel Yazdani

A symbolic compiler translates a program to symbolic constraints, automatically reducing model checking and synthesis to constraint solving. We show that new applications of constraint solving require domain-specific encodings that yield the required orders of magnitude improvements in solver efficiency. Unfortunately, these encodings cannot be obtained with todays symbolic compilation. We introduce symbolic languages that encapsulate domain-specific encodings under abstractions that behave as their non-symbolic counterparts: client code using the abstractions can be tested and debugged on concrete inputs. When client code is symbolically compiled, the resulting constraints use domain-specific encodings. We demonstrate the idea on the first fully symbolic checker of type systems; a program partitioner; and a parallelizer of tree computations. In each of these case studies, symbolic languages improved on classical symbolic compilers by orders of magnitude.

computer aided verification | 2017

Data-Driven Synthesis of Full Probabilistic Programs

Sarah Chasins; Phitchaya Mangpo Phothilimthana

Probabilistic programming languages (PPLs) provide users a clean syntax for concisely representing probabilistic processes and easy access to sophisticated built-in inference algorithms. Unfortunately, writing a PPL program by hand can be difficult for non-experts, requiring extensive knowledge of statistics and deep insights into the data. To make the modeling process easier, we have created a tool that synthesizes PPL programs from relational datasets. Our synthesizer leverages the input data to generate a program sketch, then applies simulated annealing to complete the sketch. We introduce a data-guided approach to the program mutation stage of simulated annealing; this innovation allows our tool to scale to synthesizing complete probabilistic programs from scratch. We find that our synthesizer produces accurate programs from 10,000-row datasets in 21 s on average.

international world wide web conferences | 2015

Dicer: A Framework for Controlled, Large-Scale Web Experiments

Sarah Chasins; Phitchaya Mangpo Phothilimthana

As dynamic, complex, and non-deterministic webpages proliferate, running controlled web experiments on live webpages is becoming increasingly difficult. To compare algorithms that take webpages as inputs, an experimenter must worry about ever-changing webpages, and also about scalability. Because webpage contents are constantly changing, experimenters must intervene to hold webpages constant, in order to guarantee a fair comparison between algorithms. Because webpages are increasingly customized and diverse, experimenters must test web algorithms over thousands of webpages, and thus need to implement their experiments efficiently. Unfortunately, no existing testing frameworks have been designed for this type of experiment. We introduce Dicer, a framework for running large-scale controlled experiments on live webpages. Dicers programming model allows experimenters to easily 1) control when to enforce a same-page guarantee and 2) parallelize test execution. The same-page guarantee ensures that all loads of a given URL produce the same response. The framework utilizes a specialized caching proxy server to enforce this guarantee. We evaluate tool on a dataset of 1,000 real webpages, and find it upholds the same-page guarantee with little overhead.

EDM (Workshops) | 2014