Runlet: A Cross-platform IoT Tool for Interactive Job Execution Over Heterogeneous Devices with Reliable Message Delivery

IoT uses different hardware and software components in a mixed environment, and for the management of the devices, interoperability and reliability are key issues. Interactive job execution is another important concept for the management in different scenarios. The literature lacks a tool with such characteristics. This work fills the gap in the state-of-the-art by introducing a tool that achieves interactive job execution over a network of heterogeneous devices with reliable message delivery. The tool leverages the power of the protocol Advanced Message Queuing Protocol (AMQP) and the message broker RabbitMQ. AMQP is an open standard Machine-to-Machine (M2M) publish/subscribe messaging protocol optimized for high-latency and unreliable networks that enables client applications to communicate with conforming messaging middleware brokers. RabbitMQ is an open-source message broker that supports various message protocols. The architecture of Runlet is discussed in detail, including the reasoning behind architectural decisions. The evaluation is conducted through an experimental approach that assesses interactivity and reliability on a testbed of devices composed of single-board ARM computers and laptop devices. The experimental results show that the application offers interactivity under different scenarios and provides reliable message delivery after node and


INTRODUCTION
The heterogeneous and dynamic nature of the Internet of Things (IoT) creates challenges that go beyond the traditional computer-based network model. These challenges are commonly related to the unpredictable mixture of devices with individual capabilities that may pose a barrier to achieving interoperability and managing devices in the context of IoT (Elkhodr et al., 2016).
The IEEE describes interoperability as "the ability of two or more systems or components to exchange information and to use the information that has been exchanged" (Geraci et al., 1991). Achieving interoperability is indispensable for devices across different networks and a difficult task given the IoT's competitive nature and rapidly evolving wireless technologies. This commonly results in integration issues where heterogeneous devices cannot communicate with one another (Elkhodr et al., 2016).
In addition to that, IoT solutions need to handle unique management scenarios that go beyond the traa https://orcid.org/0000-0001-7737-8878 b https://orcid.org/0000-0001-7051-7396 ditional capabilities of remote control, monitoring, and maintenance of devices. For example, a system that supports remote monitoring of tasks across a fleet of heterogeneous devices may highly benefit from a tool that offers the ability to monitor and troubleshooting errors in real-time.
This work fills the gap in the state of the art by presenting a tool that achieves interactive job execution over a network of heterogeneous devices with reliable message delivery. The tool, named Runlet, grants the user the capability of interacting with the shell from any connected device during job execution, which is particularly useful in a scenario that requires user input for successful script execution. A few potential use areas include Continuous Integration (CI), Continuous Delivery (CD), and remote monitoring and management of devices.
Runlet is a cross-platform application that runs across many architectures and operating systems. It uses both the protocol AMQP and the broker Rab-bitMQ for reliable message delivery. The protocol AMQP is an open standard M2M publish/subscribe messaging protocol optimized for high-latency and unreliable networks that enables client applications to communicate with middleware brokers. RabbitMQ is an open-source lightweight message broker that supports various messaging protocols and can be deployed on-premises and in the cloud.
An experimental evaluation assesses interactivity and reliability on a testbed of heterogeneous devices ranging from single-board ARM computers to laptop devices. The results show that Runlet is interactive under different use cases and that message delivery is reliable after node and server failures. This paper is organized as follows. Section 2 evaluates the related work. Section 3 describes the application and the reasoning behind architectural decisions. Section 4 describes the experimental evaluation used to evaluate interactivity and reliability. Finally, Section 5 presents our conclusions.

RELATED WORK
There are several studies that evaluate messaging protocols and distributed message brokers. However, to the best of our knowledge, there is no other academic study pursuing the goal of performing interactive job execution across a fleet of network devices with reliable message delivery.
A selection of academic studies is presented next, ranging from monitoring systems to surveys that evaluate messaging protocols and message brokers. However, no mention is made of interactive job execution in none of the studies. (Krishna and Sasikala, 2019) introduces a healthcare monitoring system that monitors patients' vitals in real-time and sends the information to a web server for persistence and decision-making. Sensors are connected to a Raspberry Pi that acts as a gateway and sends the data to a server using AMQP through the board's Wi-Fi. (Liang and Chen, 2018) introduces the design of a real-time data acquisition and monitoring system for medical data on smart campuses. The design includes both AMQP and RabbitMQ for data exchanging and the HL7 protocol to facilitate information management via standard-compliant messages. The HL7 standard applies to the exchange of medical records before different medical systems, which is also the main difference between this study and (Krishna and Sasikala, 2019) given that both introduce a similar concept for healthcare monitoring. (Kostromina et al., 2018) presents a concept for a resilient meteorological monitoring system using AMQP. The system design considers the need for stable and resilient delivery of data without losses, even in case of network outages and a predictive monitor-ing system that monitors memory usage on singleboard computers to prevent system crashes caused by lack of available memory. RabbitMQ is the selected message broker since it implements the protocol AMQP and has features like federation, clustering, and persistence. (Krishna and Sasikala, 2019) is the only study that presents the prototype's screen captures alongside this study. Interactive job execution is not applicable in none of the studies apart from Runlet, which is further motivation for conducting this study.

RUNLET
This study aims to fill the lack of a tool for interactive job execution by introducing an interactive job execution approach with reliable message delivery across heterogeneous devices. The queuing protocol and the message broker are two key elements discussed in this section, which also details the architecture and reasoning behind technology choices.
Runlet is a cross-platform application that features a daemon and a desktop manager. The desktop manager includes both the daemon and a full-featured GUI that provides an easy to use interface for managing jobs across a fleet of connected devices. The Advanced RISC Machine (ARM) distribution supports ARMv6, ARMv7, and ARM64.
Next, we present the reasoning and choices for the queuing protocol and the message broker. In the sequence, we present Runlet's architecture.

The Queuing Protocol
CoAP and XMPP protocols are discarded due to some of their characteristics described in the previous chapter. XMPP does not provide any Quality of Service (QoS) options (Dhas and Jeyanthi, 2019), which makes the protocol not suitable for Runlet, ensuring reliable message delivery is a crucial aspect of the application. CoAP seems like a great option from a resource usage and network bandwidth standpoint. However, it is discarded due to its lack of support for the publish/subscribe model (Karagiannis et al., 2015). The publish/subscribe model is a must-have for reaching optimal performance as job logs are updated continuously.
MQTT is an excellent option and very similar to AMQP in many aspects. They both have brokered architectures, support the publish/subscribe model, use TCP for transport and TLS/SSL for security, offer QoS levels for reliability, with MQTT having the advantage of smaller header size (Dhas and Jeyanthi, 2019). However, extra features provided by AMQP made the trade-off advantageous despite the increased overhead and latency. AMQP has the advantage of having flexible message routing, reliable message queues with fine-grained control over bindings, and multiple types of exchanges (AMQP, 2020). After a detailed analysis of the different protocols, we selected the AMQP protocol.

The Message Broker
RabbitMQ is the selected broker for the following reasons (RabbitMQ, 2020): • Web management UI for short-term metric collection and monitoring offers many administrative features convenient for development.
• Built-in support for using long-term metric visualization tools such as Prometheus and Grafana in production.
• Concise documentation and tutorials.
• Feature flag subsystem for upgradability.
ActiveMQ Artemis also comes with a web management console, supports metric tools, and has wellorganized documentation and active community (Ac-tiveMQ, 2020). Nonetheless, the decision ultimately came down to a personal preference for RabbitMQ's management tool and built-in support to Prometheus and Grafana in production. Qpid Broker-J and Qpid C++ Broker, on the other hand, fall short on development tools and lack support for metrics (Qpid, 2020). The Qpid community is also not as active and engaged as both RabbitMQ and ActiveMQ communities.

Architecture
This section details the conceptual architecture composed of the following components: GUI, daemon, server, database, and cloud storage. Fig. 1 shows the conceptual architecture and communication flow between components.

GUI
The Graphical User Interface (GUI) is a crossplatform application that runs on all major operating systems such as Mac, Linux, and Windows. It provides a friendly interface for the management and monitoring of devices and applies reactive programming concepts to observe, compute, and react to changes.
The GUI is built with JavaScript, HTML, and CSS on top of well-adopted open-source libraries such as Electron, React, TypeScript, MobX, Blueprint, Xterm.js, and Jest. Electron is a library developed by GitHub for building cross-platform desktop applications by combining Chromium and Node.js into a single runtime environment system (Electron, 2020). Chromium is used for rendering pages, while the Node.js API is used for lower-level system actions such as managing process events and interacting with the file system.

Daemon
The daemon, written in Golang, is responsible for job orchestration and has an internal queue that executes jobs in the received order, 'First In, First Out' (FIFO), as well as a parallelism controller that helps to prevent CPU throttling by limiting the number of jobs that run in parallel. It runs on machine startup as a background process capable of recovering from crashes through a process restart.
Each daemon communicates with the server directly. It acts both as a producer and a consumer simultaneously, meaning that jobs are triggered and executed from any active node in the network. It supports the execution of selected methods via Command-Line Interface (CLI) by appending the term runlet to method names, which is useful for script automation and remote access on devices without a graphical interface.
The daemon is also responsible for End-to-End Encryption (E2EE) of logs, local disk persistence, and data dispatch for cloud storage. Logs are first encrypted and persisted on the local disk for quick access and then submitted to the cloud storage server. Logs are only retrieved from the cloud when the local copy is outdated, according to the rules: 1. If the Log Does Not Exist Locally: a local copy is created from the remote copy and used for reading operations.
2. If Log Exists Locally: the hash of the local copy is sent to the server and compared with the hash present in the metadata of the remote copy. The remote copy replaces the hash's local copy if it is different, and the remote copy has a more recent modification date. Otherwise, the local copy is used for reading operations, avoiding unnecessary updates.

Runlet uses
PostgreSQL, an open-source objectrelational database that uses the Structured Query Language (SQL) language and is highly scalable both in the amount of data it can manage and in the number of concurrent users it accommodates (PostgreSQL, 2020). The database stores data from users and jobs.
It also provides relational operators to manipulate data, is highly scalable in terms of data it can store and supports many concurrent users. However, the database does not store job logs. They are shared between devices using RabbitMQ's messaging system. This decision reduces security concerns related to leakage of sensitive information that records may have and improves performance by ensuring that updates are queued and delivered reliably without the need for querying and storing large amounts of data from a database.

Cloud Storage
A cloud-based solution is used exclusively for log storing. This eliminates ownership costs associated with managing a data storage infrastructure and ensures data durability, availability, and security.
Log storing is accomplished with MinIO, an opensource distributed object storage server designed to be cloud-native, performant, scalable, and lightweight. The server runs as lightweight containers by external orchestration services and is highly efficient in its use of CPU and memory resources (MinIO, 2020).

Server
The server is the central point of the architecture and consists of a NestJS API, a RabbitMQ message broker, and metric tools Prometheus and Grafana for monitoring. It orchestrates the communication between daemons and the message broker through an API, persists data in the database, and files in the cloud.
RabbitMQ is the message broker that handles message exchanging between network devices. It has two subcomponents: (1) a server operator where all broker features are configurable, such as clustering, high availability, persistence, and access control; and (2) a web-based tool for external management, monitoring, and short-term visualization of broker events.
A topic exchange is created for each user. This type of exchange is used for multicast routing of messages as it routes messages to one or many queues bound with a matching binding key. All exchanges are durable and survive broker restart. On the other hand, queues are temporary and short-lived to reduce workload and avoid leaving unused queues behind failed connections. They are both exclusive, and auto-deleted, meaning that they are only used by their declaring connections and are automatically deleted when the last consumer is gone or cancels its subscription.
Queues are also created following the single responsibility principle that dictates that each queue has a single concern. That means that jobs end up having multiple queues to control execution rather than having a single queue for everything.

EXPERIMENTAL EVALUATION
This section describes the experiments that evaluate Runlet in regards to interactivity and reliability. The server, currently hosted on DigitalOcean, is a standard droplet with two (2) vCPUs, four (4) GB RAM, fifty (50) GB SSD storage, and four (4) TB of data transfer. A vCPU is a processing power unit that corresponds to a single hyperthread on a processor core (DigitalOcean, 2020). The testbed of devices used for the experiments is presented in table 1, including device model, CPU, RAM, operating system, and a nickname used for study reference.

Interactivity
Two (2) experiments are conducted to evaluate the user's capability to interact with job executions and make decisions when requested. The first experiment verifies the ability to trigger a macOS job from a Windows manager. The job is called brew-cask-upgrade and executes two commands: (1)neofetch and (2) brew cu -a. neofetch is a command-line system tool written in bash that displays information about the operating system and hardware (dylanaraps, 2020). brew cu -a is a command from a brew-cask-upgrade tool, a commandline tool for upgrading outdated apps installed by Homebrew Cask (buo, 2020). Homebrew Cask is a tool that extends Homebrew and is used for installation and management of macOS applications distributed as binaries (Homebrew, 2020).
The second experiment shows how a Raspberry PI can be remotely monitored using htop, which is an interactive text-mode process viewer for Unix systems (Muhammad, 2020). This viewer is highly configurable and gives the option to view information such as CPU load, memory consumption, hostname, tasks, load averages, and uptime.

Reliability
Reliability stands for the system's ability to deliver queued messages after failures. This study investigated two types of failures: (1) node failure and (2) server failure.
1. Node Failure: the number of nodes in the cluster is downscaled from four (4) nodes to a single node. This change does not cause any issues as RabbitMQ tolerates individual nodes' failure as long as there are other known nodes in the cluster at the time. It may also not affect queues as the current infrastructure uses queue mirroring to replicate queues across nodes.
2. Server Failure: the single node is removed for a brief period to observe how the broker reacts in the absence of nodes. It does not cause any significant issues to the server apart from becoming tem-porarily unavailable as the broker is configured to use lazy queues, a policy that moves messages from a large number of queue types to disk as early as possible and loads them in memory only when requested. These messages are expected to be retrieved from the disk after new nodes become operational. All queues are also likely to return.

Discussion
This section describes the results obtained from experiments. The results presented throughout this section are supported by screenshots of executed jobs and metrics from Prometheus and Grafana.

Interactivity
The first experiment triggers a job on a Windows desktop manager to update system applications on a macOS device using homebrew. Fig. 2 shows the output of command (1) neofetch.   plications, including current and latest versions, and a list of outdated apps. The option -a passed with brew cu indicates that applications with the autoupdate functionality are also listed. The question "Do you want to upgrade 7 apps or enter [i]nteractive mode [y/i/N]?" is shown after the list of outdated apps and requires user interaction. Fig. 4 shows an excerpt of the upgrading process that takes place after user confirmation [y]. The progress of each upgrade is displayed individually, which helps to visualize the overall progress.  The second experiment triggers a job on a Windows desktop manager to display running processes of a Raspberry PI using htop. Fig. 5 shows the initial output of htop, which is a list of system processes includ-ing PID, user, percentage of CPU and memory used, virtual memory, and time in execution. The bottom bar has all the options that are available through function keys. Fig. 6 shows all the options available under setup when the key F2 is pressed. These options can customize meters shown at the top and change display options, colors, and active columns. Navigation using arrow keys is done with no hassle as Runlet can capture all keyboard strokes. The key F10 can be pressed at any time to indicate that the setup is done and return to the previous screen.

Reliability
Reliability is investigated by observing the broker's behavior considering two failure events. The impact is observed through a Grafana dashboard that monitors the following metrics using Prometheus: • Total number of nodes, queues, connections, and channels.
• Per-node memory available.
• Messages published, delivered, and routed to queues in a per-node and per-second granularity.
• Total number of queues, channels, and connections per node per second.
The first experiment investigates how the broker behaves to node failure by changing the cluster composition from four (4) nodes to a single node at about 16:15. Figure 7 shows that all messages, queues, channels, and connections are kept in the remaining node after removing three (3) nodes, meaning no messages are lost. Grafana may use different colors to indicate the same node across graph panels. For example, green is designated for node one (1) on messages published, delivered, and routed to queues, while yellow is designated to the same node for total queues, channels, and connections.
A slight increase in memory consumption happens on the remaining node as it starts to handle all the load. The difference between the number of messages published and routed to queues for delivery may be attributed to messages routed to multiple queues. The number of queues, channels, and connections varies according to the workload of messages.
The second experiment investigates how the broker reacts to a server fail by removing the only active node at about 16:25. Figure 8 shows that messages stop being transmitted for about 5 minutes during the fail, but the process resumes as soon as the node joins the cluster again to take over. The node establishes all the old connections and recovers persisted messages  and queues, resulting in no lost messages. The number of channels drops significantly as previous connections and channels are closed. Grafana changes the color from yellow to green to differentiate instances of the same node.

CONCLUSIONS
This study introduces a tool that achieves interactive job execution across heterogeneous devices using the protocol AMQP and the message broker RabbitMQ.
The approach fills a gap in the state-of-the-art, as confirmed by the findings in section 2. A few potential use areas include Continuous Integration (CI), Continuous Delivery (CD), and remote monitoring and management of devices.
The protocol AMQP was selected due to its flexible message routing, reliable message queues, and multiple exchange types. The message broker Rab-bitMQ was selected based on a personal preference for the built-in web management tool and support for Prometheus and Grafana. This combination proved very useful in the experimental evaluation as the broker could recover messages and queues after node and server failures with no messages lost.
The architecture and motivation behind architectural decisions are discussed in detail in section 3, including a conceptual view that introduces all the components. The experimental evaluation is conducted in section 4. It shows that interactive job execution on heterogeneous devices is achieved under various scenarios and that message delivery is reliable even after node and server failures.
The application has been released on GitHub and is easily found over the internet. The website with more information is available at https://runlet.app.
The roadmap for future work includes an experimental evaluation to benchmark scalability by measuring message throughput and latency as the network grows. Also, the analysis of the security measures that are put in place to ensure data protection from cyberattacks.