Zubair Nabi | Researchain

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zubair Nabi is active.

Explore More

Publication

Featured researches published by Zubair Nabi.

Pro Spark Streaming | 2016

Real-Time Route 66: Linking External Data Sources

Zubair Nabi

The time budget for streaming data can be on a millisecond scale. Regardless of latency requirements, the first step is invariably transporting the data to a processing platform while perhaps traversing the entire Internet. Any pipelined architecture can only be as fast as its slowest link. For this reason, even before the data has landed in the data center, the choice of the transport solution—even though technically it is not part of your application—can substantially affect performance. With this in mind, this chapter is dedicated to ingesting data from solutions such as Kafka, Flume, and MQTT. In the process, you also write your own connector for HTTP to learn the ropes of connecting to external data sources.

Pro Spark Streaming | 2016

The Art of Side Effects

Zubair Nabi

Spark Streaming applications by design are stateless and side-effect free: running the same application an infinite number of times results in the same behavior and output. Similar to functional programming, this simplifies debugging and reasoning about the state of a program, because input and output paths are deterministic. Although side-effect-free applications have many advantages, in distributed systems side effects cannot be completely avoided, especially when interfacing with external systems. For this reason, Spark Streaming provides a primitive called foreachRDD, which is the Swiss Army Knife of side effects for micro-batch processing. This chapter introduces design patterns for enabling side effects in Spark Streaming applications.

Archive | 2016

High-Velocity Streams: Parallelism and Other Stories

Zubair Nabi

To engender applications that scale with the input and are production ready, you must have the right tools and the knowledge to use those tools, as well as insight into the inner wirings of the target system. Having gotten a taste for real-time data processing the Spark Streaming way, you are now ready to take a deep dive into its internals.

Archive | 2016

Introduction to Spark

Zubair Nabi

The first version of Spark was open sourced in 2010, and it went into Apache incubation in 2013. By early 2014, it was promoted to a top-level Apache project. It has already replaced Hadoop as the Big Data processing engine of choice in most organizations. This is a testament to its maturity and the richness of its design. Batch processing, iterative and interactive computation, stream processing, graph analytics, ETL, machine learning, and data warehousing; you name it and Spark can already handle it. This chapter is a hands-on primer to Spark to set the stage for the rest of the book.

Archive | 2016

Of Clouds, Lambdas, and Pythons

Zubair Nabi

In the real world, deployments of Big Data systems fit into a larger ecosystem that consists of managed cloud instances, cluster managers, data stores and warehouses, and so on. Cloud deployments enable organizations to pass the buck of DevOps to the cloud service provider and thus free them to focus on application development and business operations. Along with economies of scale, this provides on-demand horizontal elasticity and simplified scheduling. Similar to other systems, Spark can work in the cloud with fully managed instances provided by a number of companies including Google (Dataproc), Databricks, and IBM (Bluemix). No book about Spark would be complete without a discussion of running it in the cloud. Other topics covered in this chapter include the Spark Python API, the lambda architecture, and graph processing.

Archive | 2016

Machine Learning at Scale

Zubair Nabi

Data by itself is a static, lifeless entity. You need analytics to breathe life into it and make it talk or even sing. The most sophisticated and popular class of such analytics revolves around nowcasting, forecasting, and recommendations, more generally known as machine learning and data mining. Machine-learning algorithms learn patterns in data and can then be used to make predictions, whereas data mining helps extract structure from unstructured data. Machine learning at scale is the key to practical predictions and recommendations, which are essential to drive the needs of consumers: commercial, academic, or scientific. This chapters uses MLlib to enable such applications.

Archive | 2016

Real-Time ETL and Analytics Magic

Zubair Nabi

Data (big or otherwise) has been woven into the fabric of most businesses. The world is at a stage where Big Data directly drives corporate strategy. To maintain a competitive edge, most businesses try to run their analytics pipeline in near real-time. Although this captures the behavior of a large class of applications that rely on unstructured data, it is not exhaustive: a significant chunk of data sources are structured, and their analysis applications require data-warehousing capabilities. One way to handle these requirements is to blend the existing Spark API with an external warehousing solution such as Hive, but this is a marriage of convenience rather than a natural fit: data must be copied back and forth, not to mention the burden of maintaining two different APIs. A better solution is Spark SQL.

Archive | 2016

The Hitchhiker’s Guide to Big Data

Zubair Nabi

By the time you get to the end of this paragraph, you will have processed 1,700 bytes of data. This number will grow to 500,000 bytes by the end of this book. Taking that as the average size of a book and multiplying it by the total number of books in the world (according to a Google estimate, there were 130 million books in the world in 20101) gives 65 TB. That is a staggering amount of data that would require 130 standard, off-the-shelf 500 GB hard drives to store. Now imagine you are a book publisher and you want to translate all of these books into multiple languages (for simplicity, let’s assume all these books are in English). You would like to translate each line as soon as it is written by the author—that is, you want to perform the translation in real time using a stream of lines rather than waiting for the book to be finished. The average number of characters or bytes per line is 80 (this also includes spaces). Let’s assume the author of each book can churn out 4 lines per minute (320 bytes per minute), and all the authors are writing concurrently and nonstop. Across the entire 130 million-book corpus, the figure is 41,600,000,000 bytes, or 41.6 GB per minute. This is well beyond the processing capabilities of a single machine and requires a multi-node cluster. Atop this cluster, you also need a real-time dataprocessing framework to run your translation application. Enter Spark Streaming. Appropriately, this book will teach you to architect and implement applications that can process data at scale and at line-rate.

Archive | 2016

Getting Ready for Prime Time

Zubair Nabi

Application development is an incremental and continuous process: once an application has been designed, implemented, and deployed, it needs to be constantly monitored and improved. The same applies to real-time pipelines, with additional variables: scalability and capacity. There may be an increase in the volume and velocity of the incoming data or lower latency requirements. Over time, as requirements change, initial design choices need to be reevaluated. Developers and infrastructure engineers clamor to squeeze the last bit of performance out of both the software stack and the hardware. Regardless of the cause and effect, all such projects require rigorous and generous instrumentation—from logging and monitoring to alerting and metrics.

Archive | 2016

DStreams: Real-Time RDDs

Zubair Nabi

According to IBM, 60% of all sensory information loses value in a few milliseconds if it is not acted on. Bearing in mind that the Big Data and analytics market has reached

Explore More

Collaboration

Dive into the Zubair Nabi's collaboration.

Explore More

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot

Dive into the research topics where Zubair Nabi is active.

Publication

Featured researches published by Zubair Nabi.

Real-Time Route 66: Linking External Data Sources

The Art of Side Effects

High-Velocity Streams: Parallelism and Other Stories

Introduction to Spark

Of Clouds, Lambdas, and Pythons

Machine Learning at Scale

Real-Time ETL and Analytics Magic

The Hitchhiker’s Guide to Big Data

Getting Ready for Prime Time

DStreams: Real-Time RDDs

Collaboration

Dive into the Zubair Nabi's collaboration.