Monir Hossain, Alireza Goli, Omid Ardakanian* and Hamzeh Khazaei
Department of Electrical and Computer Engineering
*Department of Computing Science
University of Alberta
Gunjan Kaur and Dmitriy Volinskiy
In this paper we describe the design of a collab-orative real-time machine learning system anddiscuss preliminary results and experience gainedfrom two redundancy-themed design studies con-ducted in a context of a financial institution. Arationale is provided for the chosen approaches,alternatives are discussed, and directions for fu-ture system development are outlined.
Redundancy in the software and information domain used to be mostly about the systems fault tolerance. The in-creasing availability and affordability of cloud computinghas changed many design paradigms and led to the pro-liferation of software systems running on cloud providers“everything-as-a-service” platforms. The former can nolonger be vertically-architected and centrally-managed. Onone hand, this is nothing short of a dependability bliss as,arguably, design diversity becomes both deeply intrinsicand multi-faceted. On the other hand, the typically loosecoupling of system components creates new challenges: thestate the system is in, the total number of its states, as wellas many other metrics turn from a quantity uncertain into aquantity unknown. Most importantly, the very meaning ofcertain key redundancy notions may change as the systemnow is a loose agglomeration of modules built in-house, ofclouds services, API, distributed storage, etc.
At the center of this short paper is the design of a real-timemachine learning system which we dub “collaborative”.Similar to how a social network enables its participants tointeract in multiple planes, components of our system listenand/or publish to a variety of topics in a high-throughputpub/sub messaging bus. This leads to a de-facto completede-coupling, as any given component has no inherent knowledge regarding the existence, kind, state of any other compo-nent except for the messaging bus. There is no coordinationnor any direct information flow between any set of com-ponents. A key design feature this seemingly primordialarchitecture yields is the ease with which an author can con-nect their component of arbitrary design to the system, andthe way the system naturally will containerize its connectedsoftware. The system will thus grow, optimize itself anddevelop more functionality not according to a centrally pro-vided blueprint, but as a result of collaboration of multiplecontributing authors, hence the “collaborative” nomencla-ture piece. As it is highly likely that the multiple authors will supply multiple equivalent solutions, redundancy comesto the forefront.
The paper also covers two mini-case studies; one relatedto software redundancy and the second discussing certainaspects of information redundancy. On the software redun-dancy front, we consider a case of two databases, of verydifferent kinds, storing and retrieving identical informationfrom the messaging bus. Not only do we comment on thesetup needed to achieve this – given that the information re-questor knows nothing about the existence, nature and querysyntax of the databases — we also consider how one canpiggyback off this redundancy to handle data requests intelli-gently given the requestors preference for either low latencyor data consistency. Information redundancy that the secondstudy is dealing with arises due to the fact that our real-timesystem has no facility to synchronize or dispatch data flowsin a particular way. Uncurated data get released into the sys-tem the moment the information becomes available, whichmay lead to it appearing with delays or in bursts. This maywreak havoc on machine learning models deployed on theorchestration level of the system: the models use time seriesdata summarized over various time windows and have noway of telling an artifact due to a delay or burst from a mean-ingful change in the data-generating process. A techniquewe consider to remediate this is by employing streams ofartificial, predicted data which are regularized and blendedwith the real data in case the handler detects an irregularityin the respective data stream.
And, concluding the paper, we offer some musings on thenature of cheese on a cheeseburger, the efficiency of colaborative content creation, and other non-technical aspectswhich, curiously, are quite important given the nature of our endeavor.
The present section outlines the general design of the Col-laborative real-time machine learning system (to be referredto as CMLS).
A. System Layers
Conceptually, CMLS is based on the two-layer architectureshown in Fig. 1.
B. De-coupling of data transformation and storage
In the preceding paragraphs we gave a rather non-specific, cursory overview of the system architecture. To understand the two case studies that follow, one needs to take a closer look at parts of the Proof-of-concept (PoC) available at the moment of writing. The former implements a significant part of the system; all development and integration work was done in Google Cloud Platform (GCP) using solely GCP’s managed services.
Fig. 2 illustrates a part of the PoC build, showing how a derived feature gets calculated and stored. The role of the messaging bus is played by Google’s managed service, Google
Cloud Pub/Sub. The latter is a scalable event ingestion and delivery system. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for highly available communication among independently written applications. Cloud Dataflow is a managed service for executing a wide variety of data processing patterns. We use it to create streaming data processing pipelines. Pipelines are built using the Apache Beam Software development kit (SDK), and then run on the Cloud Dataflow service.
To further illustrate the process, suppose a certain stochastic process generates and publishes to Pub/Sub events xit pertaining to an observational unit i in discrete time indexed by t. The investigator desires to obtain hourly counts of the events, ∑1hr xit. There are two Apache Beam pipelines: one gets xit from a topic where it is published (“raw data”), applies the respective transforms, and publishes the result to another topic, “derived data”. Another pipeline is listening on the derived data topic, inputs the counts as they arrive, and channels them to an external storage.
This make strike the reader as completely sub-optimal and redundant. Indeed, a pipeline is supposed to be encapsulating the entire data processing task, from start to finish. The two stages — the transform and the output to storage — can easily be made parts of a single pipeline. However, the redundancy suffered by the system to enable this de-coupling has its raison d’etre, and the next section presents two case studies ˆ addressing different aspects of redundancy in CMLS.
As we have alluded to in the previous sections, this mini- case study addresses software redundancy in the system. Most commonly redundancy is a purposeful part of the system design; it may be beneficial to deliberately add redundancy to increase the systems fault tolerance. An example is N-version programming that diversifies the design process to produce redundant functionality , which in itself is a special case of the more general design diversity approach . An alternative is to let redundancy develop naturally in the system, and use it as needed. For example, a multitude a non-replicated components can be connected and reconnected in a multitude of ways, thus allowing for a Lego-like redundancy to provide automatic workarounds in self-healing systems; see , .
To facilitate exposition, let’s trivialize the problem and introduce only two competing data storage solutions, albeit of a very different kind. We can consider the solutions to be coming from two different authors as they have little in common, if any at all, in terms of design and implementation.
A practical rationale for introducing such a pair of storage solutions can be due to a case of availability versus consistency dilemma. By “CAP theorem”, it is conjectured to be impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: (a) consistency, (b) availability, and (c) partition tolerance .
Fig. 3 zooms on the part of the system PoC with the two connected databases. We add a Pub/Sub topic for data queries (Q) and data query results (A), and introduce a dummy module for creating data requests for items from a certain kind of transactional time series data. Requests for different data elements also come with an urgency indicator; we only use two levels: zero for non-urgent requests and unity for urgent ones. One can consider urgent requests as those prioritizing availability; non-urgent requests require consistency. No information about the kind of the databases is available to the data requesting module and, conversely, the handlers can only see data requests as published in the topic.
Data requests can now be fulfilled in one of the three ways, namely
Perhaps counter-intuitively, the previous section dealt with software redundancy in CMLS using the example of redundant data storage. While early studies considered information or data redundancy an extension or a development of the software redundancy concept (e.g. see ), the former is currently viewed as a separate paradigm. Information redundancy includes the use of information with data and the use of additional forms of data to assist in fault tolerance .
The canonical form of the alternate input generation process proceeds as follows. An input, x, to the Program — we will beusing Model, instead, as more relevant in the CMLS context— is re-expressed into a different form r = R(x), which is then supplied to the Model. The Model’s output, M(r), is adjusted by applying some A(M(r)), to compensate for distortions introduced at the re-expression stage. While the above considerations are still quite relevant in our study, the real-time nature of the system introduces some unique challenges, which lead to a significant departure from the canon.
Technology-wise, Cloud Dataflow (open-sourced as Apache Beam) is an unified model for defining both batch and streaming data-parallel processing pipelines, which is well suited for most of the data transformation in this exercise. The goal is to provide an easy-to-use, but powerful model for data-parallel processing, both streaming and batch, portable across a variety of run-time platforms. Dataflow is designed with infinite data sets in mind and, as such, it can deal with both bounded (batch) and unbounded (streaming) data in a reliable manner as opposed to Lambda Architecture . The challenge — and a quagmire at times — is the need to build, provision, and maintain two independent versions of the pipeline, and then also somehow merge the results from the two pipelines at the end.
Figs. 5 and 6 show two out of, admittedly, many options to implement data stream curation.
The linear design in Fig. 5 is straightforward and comes directly from our earlier discussion of what the alternative stream generating process would look like in CMLS. Note that the simulation, that is, the generation of the predicted data stream happens in Cloud Datalab. Cloud Datalab is a interactive tool created to explore, analyze, transform and visualize data and build machine learning models on Google Cloud Platform. But it is not a production tool, which turns it into the system’s weakest link. The design in Fig. 6 is more robust. The simulation moves to Google Compute engine, which is Google’s general computing infrastructure, Dataflow still handles basic transforms, and Cloud Stackdriver is added to perform certain dispatch functions. This adds more modules to the system, but this also relieves the simulation and transformation modules from the necessity to constantly monitor all data streams.
So, would the esteemed patron want cheese on their cheeseburger? It is always tempting to attempt to design a system optimally and efficiently. We all wield our Occam’s razor, even subconsciously, looking for simple, minimalist designs. No different from collaborate content creation, attempts to create online software content through the collective action of many autonomous individuals with little coordination, no common affiliation, and no explicit incentives to collaborate, should be futile, as aptly noted in . So — the reader probably figures — it only makes sense to skip the talk and serve the cheeseburger to the customer straight away, without asking them for any added cheese. Or does it?
A collaborative process is by its nature inefficient as it involves multiple parties to it. A lot of time and development effort is bound to be spent in discussions, negotiation, the much-dreaded “back and forth”. This essentially harks back to so-called Waldo’s argument , who proposed the existence of a negative, linear relation between efficiency and democracy, rooted in the impossibility to reconcile the efficiency from a managerial perspective with the concept of democracy through public engagement and dialog. Yet the success of such seemingly wasteful collaborative creation ecosystems as Wikipedia and GitHub may provide a wee testimonial to the contrary.
This collaborative real-time machine learning system is being built as we speak; techniques that we are discovering and observations we make will require a substantial level of maturity, to form a solid corpus of knowledge that we anticipate to evolve rapidly. Nonetheless, we believe that the present paper will be instrumental to practitioners in the field, hopefully shining some new light on familiar concepts as it relates to redundancy in software systems.
 A. Avizienis, “The n-version approach to fault-tolerant software,” IEEE Transactions on software engineering, no. 12, pp. 1491–1501, 1985.
 J. P. J. Kelly, T. I. McVittie, and W. I. Yamamoto, “Implementing design diversity to achieve fault tolerance,” IEEE Software, vol. 8, no. 4, pp. 61–71, 1991.
 A. Carzaniga, A. Gorla, and M. Pezze, “Self-healing by means of automatic workarounds,” in Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems. ACM, 2008, pp. 17–24.
 A. Carzaniga, A. Gorla, N. Perino, and M. Pezze, “Raw: runtime automatic workarounds,” in Software Engineering, 2010 ACM/IEEE 32nd International Conference on, vol. 2. IEEE, 2010, pp. 321–322.
 S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services,” Acm Sigact News, vol. 33, no. 2, pp. 51–59, 2002.
 P. E. Ammann and J. C. Knight, “Data diversity: An approach to software fault tolerance,” Ieee transactions on computers, no. 4, pp. 418–425, 1988.
 L. L. Pullum, Software fault tolerance techniques and implementation. Artech House, 2001, ch. Data Diversity (2.3).
 E. Friedman and K. Tzoumas, Introduction to Apache Flink: Stream Processing for Real Time and Beyond. ” O’Reilly Media, Inc.”, 2016.
 C. Wagner and P. Prasarnphanich, “Innovating collaborative content creation: the role of altruism and wiki technology,” in System Sciences, 2007. HICSS 2007. 40th Annual Hawaii International Conference on. IEEE, 2007, pp. 18–18.
 D. Waldo, The administrative state: A study of the political theory of American public administration. Routledge, 2017.