Research studies

DATA INTEGRITY IN CLOUD : ISSUES AND CURRENT SOLUTIONS

 

Prepared by the researcher 

Mokhtar Mohammed Mohammed Ali  – Associate proof. computer science. –  Elimam Elmahadi University –Kosti Sudan

Dr. Mahala Elzain Beraima Ahmed  – Assistant proof computer science. – Mahalaelzain2020@gmail.com  /White Nile University- Kosti Sudan

Democratic Arab Center

Journal of Afro-Asian Studies : Fourteenth Issue – August 2022

A Periodical International Journal published by the “Democratic Arab Center” Germany – Berlin

Nationales ISSN-Zentrum für Deutschland
ISSN  2628-6475
Journal of Afro-Asian Studies

:To download the pdf version of the research papers, please visit the following link

https://democraticac.de/wp-content/uploads/2022/08/Journal-of-Afro-Asian-Studies-Fourteenth-Issue-%E2%80%93-August-2022.pdf

Abstract 

Cloud computing is empowering new innovations for big data. At the heart, cloud analytic applications become the most-hyped revolution. Cloud analytic applications have remarkable benefits for big data processing, making it easy, fast, scalable and cost-effective; albeit, they pose many security risks. Security breaches due to malicious, vulnerable, or misconfigured analytic applications are considered the top security risks to big data. The risk is further expanded from the coupling of data analytics with the cloud. Effective security measures, delivered by cloud analytic providers, to detect such malicious and anomalous activities are still missing. This paper presents real-time security monitoring as a service (SMaaS). SMaaS is a novel framework that aims to detect security anomalies in cloud analytical applications running on Hadoop clusters. It aims to detect vulnerable, malicious, and misconfigured applications which violate data integrity and confidentiality. Towards achieving this goal, we are motivated by leveraging big data pipeline that mixes advanced software technologies (Apache NiFi, Hive, and Zeppelin) to automate the collection, management, analysis, and visualization of log data from multiple sources, making it cohesive and comprehensive for security inspection. SMaaS monitors a candidate application by collecting log data on real-time. Then, it leverages log data analysis to model the application’s execution in terms of information flow. The information flow model is crucial for profiling processing activities conducted throughout the application’s execution. Such model, in turn, enriches the detection of various types of security anomalies. We evaluate the detection effectiveness and performance efficiency of our framework. The experiments are conducted over benchmark applications. The evaluation results demonstrate that our system is a viable solution, yet very efficient. Our system does not make modification in the monitored cluster, nor does it impose overhead to the monitored cluster’s performance.

  1. INTRODUCTION

The rapid adoption and investment in cloud analytic applications is magnified by the 4Vs: velocity, volume, variety, and veracity of big data. Such increasing momentum has given rise to the hype around analytic technologies to perform dataparallel computing on commodity hardware, empowered by leveraging on-demand cloud services. As the glory of cloud analytic applications grows in popularity, security concerns grow in importance even more. Analytic applications are prone to data breaches due to insecure computations, misconfiguration, and unauthorized access as a result of vulnerable, malicious, or misconfigured nodes/tasks [1].

The risk is further amplified due to the loss of control for running analytic applications in the cloud. Furthermore, the unique features of analytic clusters, which provide distributed large-scale heterogeneous computing environments, render traditional security technologies and regulations ineffective. Even though, there is a lack of effective security measures provided by cloud analytic providers to detect such malicious and anomalous activities.

Amongst the five pillars (integrity, authentication, availability, confidentiality and nonrepudiation) of Information Assurance (IA), in sharing data/information, Integrity tops the list. From the perspective of a data owner, data should provide the strengths of the five pillars but it might be corrupted by many reasons like human errors, software bugs, hardware malfunctions or malicious attacks. Many common solutions like Backing up data, using security mechanisms [Indumathi.J.,(2012,2013) Indumathi J., Uma G.V.(2008)], applying error detection and correction software can be used to provide data integrity.

Several research efforts, proposed to fortify analytic world against security threats, have span different research directions, ranging from differential privacy [2], integrity verification [3- 6], policy enforcement [7-10], data provenance [11-14], honeypot-based [15], to encryption-based [16] mechanisms, among others. The effectiveness of differential privacy as a widespread solution is still not testified.

The integrity verification approaches require intercepting the application execution for verifying result integrity, which comes at the cost of performance penalty. Access control policies cannot prevent misuse activities breaching data security, after an access has been granted. Provenance mechanism incurs overhead due to collecting, storing, and analyzing provenance data, which can lead to impracticability. Honeypot-based and encryption-based approaches entail modifications to the analytic applications to add the security attestations. In this paper, we propose Security Monitoring as a Service (SMaaS) framework. SMaaS is a novel information flow-based log analysis solution for detecting security anomalies in cloud analytic applications. It exemplifies one of the services that can be offered by Information Flow Control as a Service (IFCaaS) model, our previously proposed notion [17]. Inspired by Security as a Service (SecaaS), IFCaaS expands the horizon of SecaaS by featuring cloud-delivered IFC-based security analysis and monitoring services [17].

Hadoop, the most-shiny analytic technology, was not originally designed with security, compliance, and risk management support in mind. Recently, it is evolved to support authentication and encryption mechanisms for protecting data at rest and in transit. Despite the evolving efforts in securing Hadoop, it is still exposed to weak authentication and infrastructure attacks. Such attacks increase the security risk of analytic applications against data confidentiality and integrity The distinct features of computations and data in the distributed large-scale analytic systems arise several challenges to develop an effective log analysis for anomaly detection solution. These challenges are summarized as follows:

  1. a) handling log data that is characterized by the 4Vs and collected across the cluster nodes;
  2. b) involving the complex data and control flows enclosed among the cluster nodes to execute analytic applications;
  3. c) considering the different roles of core daemons responsible for running such analytic applications; and
  4. d) mining for tangible evidence of security anomalies from log data.

In this work, we propose a novel approach to boost our solution in solving the aforesaid challenges. We leverage streaming data pipeline for security inspection.

Fig1: Overview of the SMaaS framework operational

The data pipeline mixes advanced software technologies (Apache NiFi, Hive, Zeppelin) to automate the collection, management, processing, analysis, and visualization of log data from multiple sources, making it valuable, comprehensive, and cohesive for security inspection. Cluster log data is but one part of the whole picture. Thus, SMaaS relies also on system logs to complete the picture for security inspection. SMaaS works towards extracting information flow profile from log data to model the execution of a candidate application. Upon the information flow profile, SMaaS employs several techniques for the detection of security anomalies. These anomalies indicate data integrity and confidentiality violations. Our overall contributions are as follows:

1) We propose a novel framework called Security Monitoring as a Service (SMaaS) for analytic applications;

2) We introduce an advanced approach that leverages streaming data pipeline to automate log data ingestion, processing, analysis, and visualization for real-time security inspection;

3) We propose several techniques for detecting different types of security anomalies based on information flow analysis; and

4) We demonstrate the detection effectiveness and performance efficiency of our framework through a set of experiments over benchmark applications.

The balance of this paper is structured as follows: the operational overview of our proposed framework, the threat model it assumes, and Hadoop in a nutshell are introduced in Section 2. Section 3 presents the details of the framework. The framework implementation and experimental evaluation are presented in Section 4. Section 5 outlines the related work. Section 6 draws the concluding remarks of the paper and future work.

  1. OVERVIEW

This section outlines the operational overview of our proposed framework, the threat model it assumes, and a brief outline about Hadoop.

2.1 The SMaaS Operational overview

We consider three main entities that comprise the cloud service models for this framework: Cloud analytics provider, trusted party, and consumers running data analytic applications over the provided cluster/service. There are different architectural deployment offerings for analytic technologies (e.g., Hadoop) in the cloud. These offerings range from basic services (e.g., IaaS, PaaS) to specific-tailored services (e.g., Data Analytic as a Service). Such offerings facilitate running analytic application in the cloud. SMaaS design supports these different offerings. We offer SMaaS as an advanced security monitoring feature from the cloud analytics provider.

As depicted in Fig. 1, a provider for analytic technology (e.g., Hadoop) offers its consumer the option to subscribe for security service (1). For subscribed consumers, the provider enables collecting log data from the clusters respective these consumers. The provider delegates the trusted party for further analyzing the collected data (2). The trusted party, in turn, employs the proposed framework to detect anomalous activities indicating data breaches. Monitoring reports are published in the dashboard to the consumers and email alerts are sent with detailed analysis reports upon detecting security violations (3).

2.2 Threat Model

Analytic applications (e.g., MapReduce jobs) can be misconfigured, malicious, or vulnerable and may breach the security of processed data. This can be done throughout their execution via multiple activities (e.g., modifying, copying, or deleting data) at different levels (i.e., input, intermediate results, output) in a way that violates data integrity and confidentiality. Our solution aims to detect five anomaly types: 1) data leakage; 2) data tampering; 3) access violation; 4) misconfigurations; and 5) insecure computation. We assume the correctness and integrity of log files upon which we build our security analysis solution. We also assume the security of the cluster, the underlying platform, and infrastructure where SMaaS is deployed on.

2.3 Hadoop in a Nutshell

In this work, we are mostly interested in Hadoop’s latest versions (2.x and 3.x series). Hadoop stack consists of core modules. These modules contain:

1) YARN, responsible for job scheduling and cluster resource management;

2) MapReduce, based on YARN for parallel processing of large data sets; and

3) HDFS, a distributed file system for highthroughput access to application data.

Each module consists of several daemons. To be specific, YARN comprises of resource manager, node manager, and job history server daemons. HDFS has many daemons such as name node, and data node, among others. An analytic application performing MapReduce job breaks the input data into multiple splits, equivalent in size to HDFS block. Then, the application breaks the processing into two main phases: map and reduce. The map phase is responsible to map the input data into key/value pairs forming intermediate results. Multiple map tasks are initiated on cluster’s nodes to simultaneously process each input split. Then, the reduce phase takes the intermediate key/value pairs and produces the final output.

Multiple reduce tasks are commenced to process all relevant intermediate pairs together in parallel to perform the required processing task. YARN executes MapReduce application as follows:

1) the resource manager assigns a unique ID to the application and copies the resources required to run it;

2) then, the resource manager starts an application master to coordinate the execution of the application’s tasks; and

3) the node managers control containers on each individual node to concurrently run the tasks.

The infrastructure knowledge about Hadoop adds another obstacle to implementing security monitoring initiatives. Hadoop can produce various types of log data from different sources (e.g., applications, daemons, audit actions). Such log data is considered a rich source of information for troubleshooting and performance debugging issues. However, it has a complex confounding structure that precludes mining a useful knowledge for security inspections.

It is a very critical issue to derive a comprehensive profile of the behavior of an analytic application from the emitted log data, with the goal of fostering the detection of security anomalies. There is no direct mean to relate information about an executed application (in YARN) and processed data (in HDFS) from log data. In a typical Hadoop cluster, each individual node/daemon generates its own log data.

Logging and auditing are further configured through complicated settings to expose data which can have various granularity levels (e.g., DEBUG, WARN, INFO), reside in different storage locations, and have different retention and deletion policies. In this sense, logging and auditing in Hadoop can convey rapidly-growing quantities of data with low quality in terms of redundancy, heterogeneity, and diversity. In addition, it falls short to provide cohesive insights about user activities running analytic operations. Thus, Hadoop log data can be burdensome to mine for meaningful information or tangible evidence of security anomalies.

III. THE FRAMEWORK

Fig 2: components of the SMaaS framework

As illustrated in Fig. 2, we build the architecture of the SMaaS framework on a distributed model. Such architecture supports the scalability required for monitoring distributed analytic applications, which may execute on a cluster spanning thousands of nodes. The proposed framework comprises of four main components: Data Operator, Data Aggregator, Security Analyzer, and Visualization Manager. These components, according to their essential tasks, logically form two core engines: observation and inspection. To support our distributed architecture, the observation operates transparently on the monitored cluster without introducing any changes or overhead, while inspecting the security of the monitored cluster takes place separately by performing the log data analysis on a different cluster.

The observation engine involves the data operator component acting as a transparent agent to collect log data from each individual node in the monitored cluster. The inspection engine entails the data aggregator, security analyzer, and visualization manager components. The data aggregator component consumes the collected data from the data operator. It consolidates the data to model the execution events in the context of the whole application. This component leverages novel techniques to profile the execution behavior in terms of information flow including data and control dependencies. Upon the consolidated profile, the security analyzer component detects security anomalies as well as conduct alerting and reporting actions.

The profile is inspected against expected features that characterize benign applications. Such inspection approach boosts this component to detect deviations indicating anomalous and suspicious events. Monitoring reports are displayed in graphical web-based frontend dashboards, managed by the visualization manager component. The components are further detailed in the light of the aforesaid engines in the following subsections.

  1. Observation Engine
    1. Data Operator

The data operator component represents the observation engine in SMaaS. It is responsible for observing the monitored cluster through auditing and logging. We devise an approach for log data collection that promotes transparency and overheadefficiency. Instead of setting up a custom log collection process, the data operator component leverages log4j1 API and Syslog2 protocol as its workhorse to collect log data from the cluster nodes. We specifically leverage log4j API to enable the extensive native logging capabilities in Hadoop. Log4j API is the heart of the data operator component to employ logging process for Hadoop applications during the course of their execution.

We focus on collecting YARN application logs and HDFS data node daemon logs in order to capture both control and data flow activities relevant to an application execution. We automate storing the application logs in an aggregated fashion in HDFS. In this respect, logs from all containers, allocated to run the monitored application in the cluster distributed nodes, are unified in one location ready for ingestion by the inspection engine. The benefit of our approach is twofold:

1) it facilitates serving and managing the collected logs directly by YARN daemons (i.e., ResourceManager or JobHistoryServer) in HDFS; and

2) it enables, in turn, retrieving the collected data in an easily-managed way. Users are allowed to run or execute analytic applications and operations through command line from any node within the cluster.

The command line facilitates an option which provides an environment to run an application under another user (specified in the command). Hadoop log data falls short in recording this information. It records information about the specified user whose name appears in the command but not the actual user who submits the command to be run under the specified user’s environment. In this case, a user executing a malicious or vulnerable application can go undetected and incorrectly another user can be accounted for breaching the security. To cover this gap and achieve deeper monitoring, we augment our approach to log user activities from the host operating system (OS) in Hadoop nodes.

The data operator component leverages Syslog protocol to configure the host OS to collect system logs of user command activities. Our system supports a transparent distributed hierarchical architecture to collect system logs from the cluster nodes. The cluster is forked into groups of nodes forming sub-clusters. A localized relay is configured to poll the system logs from each individual group of nodes (sub-cluster) and instantly forward the received logs to be segregated into integral remote collector server, ready for ingestion by the inspection engine. In this sense, our approach exposes both Hadoop and system log data transparently without requiring any intrusive changes, installing custom agents, or introducing any overhead over the monitored cluster. It also supports scalability by efficiently managing log data collection.

  1. Inspection Engine

The data aggregator, security analyzer, and visualization manager components embody the inspection engine in SMaaS. These components work together towards automating the processing, analysis and visualization of log data, making it valuable, comprehensive, and cohesive for security inspection.

They employ data pipeline as an advanced log data processing and analysis approach to reason about the security of analytic applications. This section starts with presenting the main idea of the data pipeline. Then, it dives into the detailed steps conducted by each component in the following subsections.

Fig3:  The SMaaS data pipeline

Our approach leverages Apache NiFi3 , Hive4 , and Zeppelin5  to build the data pipeline as outlined in Fig. 3. We build the data aggregator and security analyzer components on top of Apache NiFi (Niagara Files) platform. These components are designed as groups of processors, running in an integral workflow, inside NiFi. This design boosts our system with the ability to stream log data from different sources (Hadoop and system); ingest high volumes of data in real-time; consume and transform many log data formats; and process log data in a distributed scalable manner. These features, in turn, empower our system with real-time security detection and decision-making capabilities.

The aggregated data is streamed into Apache Hive to enable historical analysis. The visualization manager component provides on Zeppelin dashboard relational views extracted from aggregated data residing in Hive. These visualized views are valuable to determine effective responses and decisions to the detected security issues.

Data Aggregator

The data aggregator component is responsible to ingest the log data, collected by the data operator, at runtime. It then consolidates the data to profile the execution of MapReduce applications in terms of information flow view.

This component utilizes a combination of NiFi standard processors. It starts with remotely retrieving information about the running/finished applications in the monitored cluster. We optimize the component to fetch log data only if the retrieved information is changed indicating new submitted applications. For each candidate application, its log data is fetched from YARN daemons. Recall that the fetched data represents logs from all containers allocated to run the candidate application over the cluster’s distributed nodes. The data aggregator component consumes and parses the data in order to obtain every piece of information relevant to the application’s execution. Then, it transforms and consolidates these pieces of information into a single profile in a condensed format as JSON (JavaScript Object Notation).

Fig 4: The JSON-Schema of Information Flow Profile

This profile represents an information flow view that models the application’s execution. The JSON schema of the profile is illustrated in Fig. 4. The profile models the application’s execution from two angles: global and partial. The former captures global attributes about the application such as id, name, start time, finish time, average map time, etc. Furthermore, attributes detailing the data flow processed by the application broken down into input, output, and intermediate data are also tracked. Such attributes are driven directly from HDFS and nodes where the data is stored. The latter portrays the application’s processing activities (Map and Reduce tasks) and dependencies (data and control) between them. For each Map task, the profile captures the data split that flows as input to the task. Similarly, for each Reduce task, the profile shows the list of dependent Map tasks that flow input data to the reduce task and the data split that flows as output from the reduce task.

The last step is streaming the constructed profile into Apache Hive. This step facilitates incrementally storing the profiles of applications, which ran on the monitored cluster, into a unified database. This database enables performing historical analysis by security administrators.

Security Analyzer

The security analyzer component utilizes the information flow profile received from the data aggregator component to detect anomalies.  This component is designed as a combination of NiFi standard processors. We implement the security analysis logic as scripts that automatically execute inside the processors. The analyzer reasons about five anomaly types:

1) data leakage;

2) data tampering;

3) access violation;

4) misconfigurations; and

5) insecure computation.

The analyzer takes from the information flow profile as a baseline for the security analysis. It checks the profile against expected features that govern benign applications. Any deviation from the expected features indicates a security anomaly. The analyzer also involves correlating information from daemon logs and Syslogs upon needed to identify a compromise. These logs are collected by the data operator and ingested by the data aggregator to be ready for this component. The analyzer sends email alerts to the security administrator with detailed analysis reports upon detecting security violations. The analysis techniques to detect the anomalies are further explained in Section IV.

Visualization Manager

The visualization manager component provides a dashboard as an important feature of our monitoring solution. We leverage Apache Zeppelin for the visualization dashboard. The dashboard augments security administrators with the ability to visualize the aggregated information flow profile as well as relational views about the aggregated data from Hive database. This is important to determine effective responses to security issues. For example, an administrator can query about other applications executed by a malicious or a victim user/node involved in a detected security violation. As a remedy, the administrator may block or isolate activities from this user/node until proper mitigation actions are conducted.

  1. EXPERIMENTAL EVALUATION

In this part, we explore our new setup and estimate results. Our evaluation targets answering two questions:

1) what is the effectiveness of SMaaS for detecting security anomalies in MapReduce applications;

2) what is the performance efficiency of our system.

The following sections describe the experiments conducted to assess each question, respectively. We set up our experiments over private cloud consisting of six VMs. Each VM has “Ubuntu 14.04”, 16 GB RAM, 4 CPUs and 100 GB storage. We choose Hortonworks6 data platform distribution to build the monitored Hadoop cluster. We build our system over a NiFi cluster using Hortonworks data flow distribution. Each cluster has one master and two slave nodes.

 We managed both clusters via Ambari7 server. The experiments are conducted over three popular MapReduce benchmark applications [18]: Tera Sort, Tera Gen, and Word Count. They are shipped with Hadoop distribution. TeraSort aims to sort data stored in files. We used TeraGen to generate data as input to Terasort with size of 1GB, 5GB, and 10GB. Word Count aims to count word occurrences in files.

  1. Anomaly Detection

We assess the detection effectiveness of our solution over the aforesaid five types of anomalies. We crafted the code of the Word Count application to implement data leakage, data tampering, access violation, and misconfigurations types. For the insecure computation type, we change the permission of the default locations with open access. At the end, we have six versions of Word Count applications, where five of them are malicious, vulnerable, and misconfigured. We applied our solution that detected all anomalies with 100% accuracy. In what follows, the detection techniques, employed by the security analyzer, are explained as regards each anomaly type.

1) Data Leakage

A malicious application may copy input or output data to unauthorized location. To detect data leakage activity, the analyzer reasons about the “Blk No” attribute of all splits of the input and output dataflow in the application’s profile. Then, it analyzes the data node daemon logs looking for unauthorized write operations of any block of data which is not related to the expected MapReduce control flow. The information is correlated based on the “Start Time” and “BlkNo” attributes in the profile. An occurrence in the data node logs indicates the data leakage activity. Only MapReduce related-activities are expected from a benign application. By knowing the compromised data split, the analyzer can further track the affected map and reduce tasks by traversing the profile. In case of input leakage, the “Data Flow Split” attribute can lead to the affected map task. From there, we can infer the reduce task having dependency with it from linking its “TaskId” with the reduce task’s “Dependent TaskIds”. In case of output leakage, the “Data Flow Split” suffices to identify the affected map task.

2) Data Tampering

A malicious application may replace input or output data with wrong data. To detect data tampering, the analyzer reasons about the “Dir Path” attributes of the input and output dataflow in the application’s profile. Then, it analyzes the Syslog’s to reason about the input and output paths that are configured when a user submitted the application. The information is precisely correlated based on the “Submit Time” and “map Class Name” attributes from the profile. Mismatch between configured and processed paths indicate the data tampering activity.This is because a benign application is expected to process the input and output data as configured by the user.

3) Access Violation

A vulnerable application may be submitted to process input at or produce output to unsafe locations. In addition, a malicious application may change the access permissions of the input or output making them prone to security compromise. Access violation is detected by reasoning about the “Permissions” attribute of  I/O dataflow in the profile. authorization, specifying that the input or output paths have open access, indicates a violation. This is because a benign application is expected to process and produce data that has restricted access to authorized users only.

Fig5: I/O performance evaluation over various data size of the SMaaS components: (a) data aggregator; (b) security analyzer.

4) Misconfigurations

Throughout the execution of an application, Yarn stores the application’s files inside local cache. The cache specifically contains the work directory of each individual container assigned to execute the application’s tasks. It also stores the intermediate results throughout the application’s execution. On the other side, input and output data processed by an application typically is stored in HDFS. The datanode daemons store data blocks in local directories. The location of the local cache and directories are normally configured at the time of the cluster set-up. A misconfigured application may change the configurations to different locations than the typical expected ones. Misconfiguration violation is founded by the inspection of “DirPath” quality of cache and in-between dataflow in the profile and linking them with the values of the corresponding properties in Hadoop’s configuration file.

5) Insecure Computation

A vulnerable application may execute while the intermediate data, application files, or data blocks are stored in the locations that have open access. In this sense, the application is prone to insecure computations due to the risk of tampering with its processing in terms of data and control flows. The analyzer detects insecure computations by examining the “Permission” attribute of the intermediate and cache dataflow in the profile. Such data is expected to have restrictive access.

  1. Performance

We conducted several experiments to evaluate the performance of SMaaS as streaming analytic solution. The experiments exercise different aspects including I/O performance, execution time, and resource usage. They are designed to appraise the SMaaS performance from two levels: component and system. Component-level experiments focus mainly on the data generator and security analyzer, the two main components that involve processing streaming data in SMaaS. System-level experiments reflect the overall performance of SMaaS, hosted on NiFi cluster.

1) Component-level performance This section presents the experiments employed to measure a) I/O performance and b) execution time. We employ customized reporting tasks inside NiFi to send the collected metrics about each component to Grafana8 . The metrics are measured during five minutes rolling window. The I/O performance is measured in terms of BytesRead and BytesWrite metrics. To estimate the execution time, we use the TotalTaskDurationSeconds metric. These metrics can be defined and measured as follows:

  • BytesRead: The total number of bytes that the component read from the disk during the rolling window.
  • BytesWrite: The total number of bytes written by the component to the disk during the rolling window.
  • TotalTaskDurationSeconds: The total time that the component used to complete its task during the rolling window

We perform two different experiments to evaluate each component. Firstly, we assess the aggregator component to analyze different volume of application logs. We execute TeraSort and TeraGen benchmark applications over various data volumes (1 GB, 5 GB, 10 GB). These various volumes consequently result in increasing the volume of each application’s logs that need to be analyzed. On the flip side, we exercise the analyzer component in processing various workloads (1.2 GB, 6.9 GB, 15 GB) of Syslog and datanode

log data. We created the workloads by combining the logs throughout a period of month.

  1. a) I/O Performance

The BytesRead and BytesWrite for varying data quantities for both the aggregator and the analyzer are shown in Fig. 5 part (a) and part (b), respectively. As inferred from the figures, the performance involves proportional growth with the increasing size of data. The data aggregator involves more I/O operations in contrast to the security analyzer. The latter performs the analysis on the fly, whereas; the aggregator requires interacting with the disk while preparing the information flow profile. Thus, the performance of the analyzer is not affected by the size of the processed data. The aggregator’s performance whereas is dependent on the data volume. Yet, it hits an efficient range between 127 and 224 MB.

  1. b) Execution Time

Fig. 6 part (a) and part (b) illustrate the execution time of the aggregator and the analyzer over different data loads, respectively. The execution time has a non-linear increase as the volume of data grows. We notice that the execution time of the analyzer is higher than the aggregator. The main reason is that the analyzer executes diverse algorithms for the security inspection that entail processing time. Both components are highly efficient as their execution does not exceed 1 sec. Thanks to the proposed data pipeline that boosts our system to efficiently support data-parallel computing.

2) System-level performance

This section highlights our experiments to gauge two aspects: a) CPU utilization and b) memory consumption of our system. Monitoring the resource consumption is not supported through NiFi reporting tasks. Thus, we implement our own monitoring component as a group of NiFi processors. The monitoring component leverages NiFi API to fetch system diagnostic report about the SMaaS cluster. The report captures the heapUtilization and processorLoadAverage metrics. Then, the monitoring component publishes the metrics after refinement into Grafana through AMS API9 Recall that, SMaaS runs on top of NiFi. As NiFi executes within a Java Virtual Machine (JVM) on the host VMs, the SMaaS resources are limited to the CPU capacity and memory space afforded by NiFi. In this sense, the SMaaS components share the same resources dedicated to the JVM. We perform the experiments in real-world settings. SMaaS is an online solution, thus it continuously runs while we execute 12 versions of the benchmark applications over various data size and anomaly types.

  1. a) CPU Utilization

Fig. 7 shows the average processor utilization of our system. As noticed, the utilization varies throughout the experiment time. It does not exceed 15% and reaches an average of 8%. Having our system on NiFi to support dataparallel processing diminishes the impact on the CPU usage.

Fig 6: Execution time evaluation over various data size of the SMaaS components: (a) data aggregator; (b) security analyzer.

Fig 7: The SMaaS CPU utilization evaluation

Fig 8: The SMaaS memory consumption evaluation

  1. b) Memory Consumption

Fig. 8 shows the percentage of heap memory consumed by SMaaS during the experiments. The average consumption is 37% and the maximum reached is 46%. The average used heap memory is 3.7 GB, the maximum heap memory used is 4.6 GB, and the 95th percentile is 4.2 GB. As observed from the results, our system achieves an efficient CPU utilization and memory consumption footprint.

  1. RELATED WORK

Several research efforts, proposed to fortify analytic world against security threats, have span different research directions, ranging from differential privacy [2], integrity verification [3- 6], policy enforcement [7-10], data provenance [11-14], honeypot-based [15], to encryption-based [16] mechanisms, among others. Interested reader can find the further details of the intrusion detection systems in the cloud in a recent survey [19]. Airavat [2] applied differential privacy to protect data from malicious MapReduce jobs. Despite the fact that differential privacy mechanism recently attracted researchers as an effective solution in specific problem contexts, its effectiveness as a widespread solution is still not testified. Integrity verification is a mechanism applied for decades, which recently appears in the light of MapReduce to examine the security of results produced by MapReduce jobs.

Secure MR [3] and Trust MR [4] rely mainly on replication-based computations, while VIAF [5] extends the mechanism to incorporate query-based approach to further detect colluding attacks. Integrity MR [6] performed the integrity checks at application layer along with MapReduce task layer. These approaches differ in the scope of integrity assurance, the logical layer of operation, and the mode of checking. In general, they require intercepting the computation of MapReduce tasks for verifying result integrity, which comes at the cost of performance penalty. Some approaches enforce security policies for access control such as Guard MR [7] and Vigiles [8] at different granularities by modifying the underlying platform [8]; or adding an extra access control layer [7].

Access control policies cannot prevent misuse activities breaching data security, after an access had been granted. An approach [9] pays attention to propose IFC-based access control model that supports multitenancy in SaaS systems. IFC endorses advancement over access control as it can provide end-to-end protection. As an alternative enhancement, accountability mechanisms are proposed to harden access control policies. Accountable MR [10] incorporates such enhancement at MapReduce. The accountability is achieved by verifying that data access happened after authorization is in compliance with the security policies, governing the data security.

Data provenance (or lineage) mechanism is typically used to keep history about data for the purpose of reproducibility. Recently, few approaches embrace such mechanism for big data security [11-14]. One approach [11] proposed formal perception about provenance mechanism to enable forward and backward tracing of data during the execution of MapReduce tasks. Other approaches [12-14] perform data provenance by analyzing metadata information and system log files to collect traces about data processing for the purpose of detecting anomalies. Such mechanism faces several challenges that may hinder its practicability such as the volume of captured provenance data, the storage and integration required to effectively analyze these data, and the most important factor is the overhead incurred from collecting these data during the execution of distributed analytic tasks.

Other approach [15] takes on honeypot-based mechanism to detect unauthorized access in MapReduce. A different approach [16] leverages encryption to protect data stored at spark. Encryption mechanism may disrupt the typical operations within the system when data is being processed. Furthermore, encryption and decryption are costly operations that may impose performance burden and reduction of system operations too.

  1. CONCLUSION

In this paper, we introduce a novel Security Monitoring as a Service for cloud analytic applications. Our system provides an advanced online defense in depth for analytic applications. The benefit is two-fold: 1) it hardens the security of analytic clusters (e.g. Hadoop) by inspecting applications running over them and 2) it protects big data processed by analytic applications by detecting anomalies that breach its security. The SMaaS framework is offered as a security monitoring service from the cloud analytic provider. The provider puts a trusted party in charge for analyzing log data collected from the monitored cluster. The trusted party, in turn, employs SMaaS to detect security anomalies.

By intelligently leveraging streaming big data pipeline and cloud technologies, SMaaS gains benefits to elude the challenges originate in analytic applications. The SMaaS pipeline automates the collection, management, processing, analysis, and visualization of log data. The framework extracts information flow profile to model the execution of the analyzed application. The profile captures both control and data dependencies across the distributed tasks/nodes running the analyzed application. Several techniques are employed for the detection of security anomalies based on analyzing the information flow profile. Our solution checks the profile against expected features that govern benign applications. Any deviation from the expected features indicates a security anomaly. The system helps security administrators to take mitigation actions when alerted about discovered anomalies. In this sense, our system conceals the log processing for security inspection behind the analytic cluster’s scene. It does not require any intrusive changes, installing custom agents, or introducing any overhead over the monitored cluster. It has the following implications:

  1. a) handling log data that is characterized by the 4Vs and collected across the cluster nodes;
  2. b) involving the complex data and control flows enclosed among the cluster nodes to execute analytic applications;
  3. c) considering the different roles of core daemons responsible for running such analytic applications; and
  4. d) mining for tangible evidence of security anomalies from cluster and system logs. We implement, deploy, and evaluate SMaaS in private cloud.

Our experiments validate the detection effectiveness and performance efficiency of our framework. We conduct our experiments over Hadoop-based benchmark applications. The results demonstrate that our system attains high detection accuracy of five different anomaly types. Yet, it achieves high performance with a pretty lightweight footprint on resource utilization.

REFERENCES

  • Cloud Security Alliance, “Big Data Security and Privacy Handbook: 100 Best Practices in Big Data Security and Privacy,” 2016.
  • Roy, S. Setty, A. Kilzer, V. Shmatikov, and E. Witchel, “Airavat: Security and Privacy for MapReduce,” Proc. of the 7th USENIX conference on Networked systems design and implementation, (NSDI’10), USENIX Association, Berkeley, CA, USA, pp. 297-312, 2010.
  • Wei, J. Du, T. Yu, and X. Gu, “SecureMR: A Service Integrity Assurance Framework for MapReduce”, Proc. of the 2009 Annual Computer Applications Conf., 2009.
  • Ulusoy, M. Kantarcioglu, and E. Pattuk. “TrustMR: Computation integrity assurance system for MapReduce.” Proc. of IEEE Int. Conf. on Big Data, (Big Data), pp. 441-450. IEEE, 2015.
  • Wang and J. Wei, “VIAF: Verification-based integrity assurance framework for mapreduce,” Proc. of the IEEE Int. Conf. on Cloud Computing, (CLOUD), pp. 300-307, 2011.
  • Wang, J. Wei, M. Srivatsa, Y. Duan, and W. Du, “IntegrityMR: Integrity assurance framework for big data analytics and management applications,” Proc. of IEEE Int. Conf. on Big Data, (BigData), pp. 33-40, IEEE, 2013.
  • Ulusoy, P. Colombo, E. Ferrari, M. Kantarcioglu, E. and Pattuk, “GuardMR: fine-grained security policy enforcement for MapReduce systems,” Proc. of 10th ACM Symposium on Information, Computer and Communications Security, pp. 285- 296, ACM, 2015.
  • Ulusoy, M. Kantarcioglu, K. Hamlen, and E. Pattuk, “Vigiles: Fine-grained access control for mapreduce systems,” Proc. of IEEE Int. Con. on Big Data, (BigData), 2014.
  • Solanki, W. Zhu, I. Yen, F. Bastani, and E. Rezvani, “Multitenant Access and Information Flow Control for SaaS,” Proc. of IEEE Int. Conf. on Web Services, (ICWS), pp. 99-106, 2016.
  • Ulusoy, M. Kantarcioglu, E. Pattuk, and L. Kagal, “AccountableMR: Toward accountable MapReduce systems,” Proc. of 2015 IEEE Int. Conf. on Big Data, (Big Data), pp. 451- 460, 2015.
  • Ikeda, H. Park, and J. Widom, “Provenance for generalized map and reduce workflows,” Proc. of 5th Biennial Conf. on Innovative Data Systems Research, (CIDR’11), California, USA, 2011.
  • Yoon and A. Squicciarini, “Toward detecting compromised mapreduce workers through log analysis,” Proc. of 2014 14th IEEE/ACM Int. Symposium on Cluster, Cloud and Grid Computing, (CCGrid), pp. 41-50, 2014.
  • Liao and A. Squicciarini, “Towards provenance-based anomaly detection in MapReduce,” Proc. of 2015 15th IEEE/ACM Int. Symposium on Cluster, Cloud and Grid Computing, (CCGrid), pp. 647-656, 2015.
  • Alabi, J. Beckman, M. Dark, and J. Springer, “Toward a data spillage prevention process in Hadoop using data provenance,” Proc. of the 2015 Workshop on Changing Landscapes in HPC Security, pp. 9-13, ACM, 2015.
  • Ulusoy, M. Kantarcioglu, B. Thuraisingham, and L. Khan, “Honeypot based unauthorized data access detection in MapReduce systems,” Proc. of 2015 IEEE Int. Conf. on Intelligence and Security Informatics (ISI), pp. 126-131, 2015.
  • Shah, B. Paulovicks, and P. Zerfos, “Data-at-rest security for Spark,” Proc. of 2016 IEEE Int. Conf. on Big Data, (Big Data), pp. 1464-1473, 2016.
  • Elsayed and M. Zulkernine, “IFCaaS: Information Flow Control as a Service for Cloud Security,” Proc. of 2016 11th Int. Conf. on Availability, Reliability and Security, (ARES), Salzburg, Austria, 2016, pp. 211-216. doi: 10.1109/ARES.2016.27
  • Eadline, “Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem,” Addison-Wesley Professional, 2015.
  • Elsayed and M. Zulkernine, “A Classification of Intrusion Detection Systems in the Cloud,” IPSJ Journal of Information Processing, vol. 23, no. 4, pp. 392-401, 2015.
  • [1] Y. Zhang, C. Xu, X. Liang, et al., “Efficient public verification of data integrity for cloud storage systems from indistinguishability obfuscation,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 3, pp. 676–688, 2017.
  • Li, L. Zhang, J. K. Liu, H. Qian, and Z. Dong, “Privacy-preserving public auditing protocol for low-performance end devices in cloud,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 11, pp. 2572–2583, 2016.
  • Yu and H. Wang, “Strong key-exposure resilient auditing for secure cloud storage,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 8, pp. 1931–1940, 2017.
  • Wang, Q. Wu, B. Qin, et al., “Identity-based data outsourcing with comprehensive auditing in clouds,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 4, pp. 940–952, 2017.
  • Wang, D. He, and S. Tang, “Identity-based proxy-oriented data uploading and remote data integrity checking in public cloud,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 6, pp. 1165–1176, 2016.
  • Yu, H. A. A. Man, G. Ateniese, et al., “Identity-based remote data integrity checking with perfect data privacy preserving for cloud storage,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 4, pp. 767–778, 2017.
  • Wang, S. S. M. Chow, Q. Wang, K. Ren, and W. Lou, “PrivacyPreserving Public Auditing for Secure Cloud Storage,” IEEE Transaction on Computers, vol. 62, no. 2, pp. 362–375, 2013.
  • Yuan and S. Yu, “Public Integrity Auditing for Dynamic Data Sharing with Multiuser Modification,” IEEE Transactions on Forensics and Security, vol. 10, no. 8, pp. 1717–1726, 2015.
  • Indumathi J.(2012) ,‘ A Generic Scaffold Housing The Innovative Modus Operandi For Selection Of The Superlative Anonymisation Technique For Optimized Privacy Preserving Data Mining’,Chapter 6 of book Data Mining Applications in Engineering and Medicine, Edited by Adem Karahoca
  • InTech ; ISBN: 9535107200 9789535107200 ; 335 pages  ;133-156
  • Indumathi J.(2013) ,‘ Amelioration of Anonymity Modus Operandi for Privacy Preserving Data Publishing ’, Chapter 7 of book Network Security Technologies: Design and Applications. Abdelmalek Amine (Tahar Moulay University, Algeria), Otmane Ait Mohamed (Concordia University,USA) and Boualem Benatallah (University of New South Wales, Australia).Release Date: November, 2013. Copyright © 2014. 330 pages; PP. 96-107
  • Indumathi J., Uma G.V.(2008), ‘A Bespoked Secure Framework for an Ontology-Based Data-Extraction System’, Journal of Software Engineering, Vol. 2, No. pp. 1-13.
  • Gitanjali J., Banu S.N., Indumathi J., Uma G.V.(2008), ‘A Panglossian Solitary-Skim Sanitization for Privacy Preserving Data Archaeology’, International Journal of Electrical and Power Engineering. 2, No. 3, pp.154 -165.
  • Gitanjali J., Md.Rukunuddin Ghalib., Murugesan K., Indumathi J., Manjula D. (2009) ,‘ An Object-Oriented Scaffold Premeditated For Privacy Preserving Data Mining of Outsourced Medical Data’,International Journal  of Software Engineering and Its Applications , Accepted for publication. In press. 2009.
  • Gitanjali J., Md.Rukunuddin Ghalib., Murugesan K., Indumathi J., Manjula D. (2009) ,‘ A Hybrid Scheme Of Data Camouflaging For Privacy Preserved Electronic Copyright  Publishing Using Cryptography And Watermarking Technologies’,  International Journal of Security and Its Applications.
  • Gitanjali J., Shaik Nusrath Banu*, Geetha Mary A*., Indumathi J., Uma G.V.(2007), ‘An Agent Based Burgeoning Framework for Privacy Preserving Information Harvesting Systems’, International Journal of Computer Science and Network Security, 7, No.11, pp.268-276.
  • Indumathi J.(2012) ,‘ A Generic Scaffold Housing The Innovative Modus Operandi For Selection Of The Superlative Anonymisation Technique For Optimized Privacy Preserving Data Mining’,Chapter 6 of book Data Mining Applications in Engineering and Medicine, Edited by Adem Karahoca
  • InTech ; ISBN: 9535107200 9789535107200 ; 335 pages  ;133-156
  • Indumathi J.(2013a) ,‘ Amelioration of Anonymity Modus Operandi for Privacy Preserving Data Publishing ’, Chapter 7 of book Network Security Technologies: Design and Applications. Abdelmalek Amine (Tahar Moulay University, Algeria), Otmane Ait Mohamed (Concordia University,USA) and Boualem Benatallah (University of New South Wales, Australia).Release Date: November, 2013. Copyright © 2014. 330 pages; PP. 96-107
  • Indumathi J., (2013b), “An Enhanced Secure Agent-Oriented Burgeoning Integrated Home Tele Health Care Framework for the Silver Generation”, J. Advanced Networking and Applications Volume: 04, Issue: 04, Pages: 16-21, Special Issue on “Computational Intelligence – A Research Perspective” held on “21st -22nd Feburary, 2013”
  • Indumathi J., (2013c), “State-of-the-Art in Reconstruction-Based Modus Operandi for Privacy Preserving Data Dredging”, J. Advanced Networking and Applications Volume: 04, Issue: 04, Pages: 9-15, Special Issue on “Computational Intelligence – A Research Perspective” held on “21st -22nd Feburary, 2013”
  • Indumathi J., Uma G.V.(2007a), ‘Customized Privacy Preservation Using Unknowns to Stymie Unearthing Of Association Rules’, Journal of Computer Science, 3, No. 12, pp. 874-881.
  • Indumathi J., Uma G.V.(2007b), ‘Using Privacy Preserving Techniques to Accomplish a Secure Accord’, International Journal of Computer Science and Network Security, 7, No.8, pp. 258-266.
  • Indumathi J., Uma G.V.(2008a), ‘A Bespoked Secure Framework for an Ontology-Based Data-Extraction System’, Journal of Software Engineering, Vol. 2, No. pp. 1-13.
  • Indumathi J., Uma G.V.(2008b), ‘A New flustering approach for Privacy Preserving Data Fishing in Tele-Health Care Systems’, International Journal of Healthcare Technology and Management. Special Issue on: “Tele-Healthcare System Implementation, Challenges and Issues.”9 No.5-6, pp.495 – 516(22).
  • Indumathi J., Uma G.V.(2008c), ‘A Novel Framework for Optimized Privacy Preserving Data Mining Using the innovative Desultory Technique’, International Journal of Computer Applications in Technology ; Special Issue on: “Computer Applications in Knowledge-Based Systems”. In press. 2008. Vol.35 Nos.2/3/4, pp.194 – 203.
  • Indumathi J., Uma G.V.(2008d), ‘An Aggrandized Framework For Genetic Privacy Preserving Pattern Analysis Using Cryptography And Contravening – Conscious Knowledge Management Systems’, International Journal of Molecular Medicine and Advance Sciences. 4, No. 1, pp.33-40.
  • Murugesan K., Gitanjali J., Indumathi J., Manjula D.(2009),‘Sprouting Modus Operandi for Selection of the Best PPDM Technique for Health Care Domain’, International Journal Conference in recent trends in computer science. Vol.1, No.1, pp. 627-629.
  • Murugesan K., Indumathi J., Manjula D. (2010a), “An Optimised Intellectual Agent Based Secure Decision System For Health Care”, International Journal of Engineering Science and Technology Vol. 2(8), 2010, 3662-3675
  • Murugesan K., Indumathi J., Manjula D. (2010b), “A Framework for an Ontology-Based Data-Gleaning and Agent Based Intelligent Decision Support PPDM System Employing Generalization Technique for Health Care”, International Journal on Computer Science and Engineering Vol. 02, No. 05, 2010, 1588-1596
  • Prakash D., Murugesan K., Indumathi J., Manjula D. (2009), ‘A Novel Cardiac Attack Prediction and Classification using Supervised Agent Techniques’, In the CiiT International Journal of Artificial Intelligent Systems and Machine Learning, May 2009. Vol.1, No.2, P.59.
  • Satheesh Kumar K., Indumathi J., Uma G.V. (2008), ‘Design of Smoke Screening Techniques for Data Surreptitiousness in Privacy Preserving Data Snooping Using Object Oriented Approach and UML’, IJCSNS International Journal of Computer Science and Network Security, Vol.8 No.4, pp.106 – 115.
  • Vasudevan V., Sivaraman N*., SenthilKumar S*., Muthuraj R*., Indumathi J., Uma. G.V. (2007), ‘A Comparative Study of SPKI/SDSI and K-SPKI/SDSI Systems’, Information Technology Journal 6(8); pp.1208-1216.
5/5 - (1 صوت واحد)

المركز الديمقراطى العربى

المركز الديمقراطي العربي مؤسسة مستقلة تعمل فى اطار البحث العلمى والتحليلى فى القضايا الاستراتيجية والسياسية والاقتصادية، ويهدف بشكل اساسى الى دراسة القضايا العربية وانماط التفاعل بين الدول العربية حكومات وشعوبا ومنظمات غير حكومية.

مقالات ذات صلة

اترك تعليقاً

لن يتم نشر عنوان بريدك الإلكتروني. الحقول الإلزامية مشار إليها بـ *

زر الذهاب إلى الأعلى