Browsing by Author "Pallickara, Shrideep, committee member"
Now showing 1 - 20 of 25
Results Per Page
Sort Options
Item Open Access A locality-aware scientific workflow engine for fast-evolving spatiotemporal sensor data(Colorado State University. Libraries, 2017) Kachikaran Arulswamy, Johnson Charles, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; von Fischer, Joseph, committee memberDiscerning knowledge from voluminous data involves a series of data manipulation steps. Scientists typically compose and execute workflows for these steps using scientific workflow management systems (SWfMSs). SWfMSs have been developed for several research communities including but not limited to bioinformatics, biology, astronomy, computational science, and physics. Parallel execution of workflows has been widely employed in SWfMSs by exploiting the storage and computing resources of grid and cloud services. However, none of these systems have been tailored for the needs of spatiotemporal analytics on real-time sensor data with high arrival rates. This thesis demonstrates the development and evaluation of a target-oriented workflow model that enables a user to specify dependencies among the workflow components, including data availability. The underlying spatiotemporal data dispersion and indexing scheme provides fast data search and retrieval to plan and execute computations comprising the workflow. This work includes a scheduling algorithm that targets minimizing data movement across machines while ensuring fair and efficient resource allocation among multiple users. The study includes empirical evaluations performed on the Google cloud.Item Open Access A questionnaire integration system based on question classification and short text semantic textual similarity(Colorado State University. Libraries, 2018) Qiu, Yu, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Li, Kaigang, committee memberSemantic integration from heterogeneous sources involves a series of NLP tasks. Existing re- search has focused mainly on measuring two paired sentences. However, to find possible identical texts between two datasets, the sentences are not paired. To avoid pair-wise comparison, this thesis proposed a semantic similarity measuring system equipped with a precategorization module. It applies a hybrid question classification module, which subdivides all texts to coarse categories. The sentences are then paired from these subcategories. The core task is to detect identical texts between two sentences, which relates to the semantic textual similarity task in the NLP field. We built a short text semantic textual similarity measuring module. It combined conventional NLP techniques, including both semantic and syntactic features, with a Recurrent Convolutional Neural Network to accomplish an ensemble model. We also conducted a set of empirical evaluations. The results show that our system possesses a degree of generalization ability, and it performs well on heterogeneous sources.Item Open Access Adaptive spatiotemporal data integration using distributed query relaxation over heterogeneous observational datasets(Colorado State University. Libraries, 2018) Mitra, Saptashwa, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Li, Kaigang, committee memberCombining data from disparate sources enhances the opportunity to explore different aspects of the phenomena under consideration. However, there are several challenges in doing so effectively that include inter alia, the heterogeneity in data representation and format, collection patterns, and integration of foreign data attributes in a ready-to-use condition. In this study, we propose a scalable query-oriented data integration framework that provides estimations for spatiotemporally aligned data points. We have designed Confluence, a distributed data integration framework that dynamically generates accurate interpolations for the targeted spatiotemporal scopes along with an estimate of the uncertainty involved with such estimation. Confluence orchestrates computations to evaluate spatial and temporal query joins and to interpolate values. Our methodology facilitates distributed query evaluations with a dynamic relaxation of query constraints. Query evaluations are locality-aware and we leverage model-based dynamic parameter selection to provide accurate estimation for data points. We have included empirical benchmarks that profile the suitability of our approach in terms of accuracy, latency, and throughput at scale.Item Open Access Counting isogeny classes of Drinfeld modules over finite fields via Frobenius distributions(Colorado State University. Libraries, 2024) Bray, Amie M., author; Achter, Jeffrey, advisor; Gillespie, Maria, committee member; Hulpke, Alexander, committee member; Pallickara, Shrideep, committee member; Pries, Rachel, committee memberClassically, the size of an isogeny class of an elliptic curve -- or more generally, a principally polarized abelian variety -- over a finite field is given by a suitable class number. Gekeler expressed the size of an isogeny class of an elliptic curve over a prime field in terms of a product over all primes of local density functions. These local density functions are what one might expect given a random matrix heuristic. In his proof, Gekeler shows that the product of these factors gives the size of an isogeny class by appealing to class numbers of imaginary quadratic orders. Achter, Altug, Garcia, and Gordon generalized Gekeler's product formula to higher dimensional abelian varieties over prime power fields without the calculation of class numbers. Their proof uses the formula of Langlands and Kottwitz that expresses the size of an isogeny class in terms of adelic orbital integrals. This dissertation focuses on the function field analog of the same problem. Due to Laumon, one can express the size of an isogeny class of Drinfeld modules over finite fields via adelic orbital integrals. Meanwhile, Gekeler proved a product formula for rank two Drinfeld modules using a similar argument to that for elliptic curves. We generalize Gekeler's formula to higher rank Drinfeld modules by the direct comparison of Gekeler-style density functions with orbital integralsItem Open Access Detecting advanced botnets in enterprise networks(Colorado State University. Libraries, 2017) Zhang, Han, author; Papadopoulos, Christos, advisor; Ray, Indrakshi, committee member; Pallickara, Shrideep, committee member; Hayne, Stephen C., committee memberA botnet is a network composed of compromised computers that are controlled by a botmaster through command and control (C&C) channel. Botnets are more destructive compared to common virus and malware, because they control the resources from many compromised computers. Botnets provide a very important platform for attacks, such as Distributed Denial-of-Service (DDoS), spamming, scanning, and many more. To foil detection systems, botnets began to use various evasion techniques, including encrypted communications, dynamically generated C&C domains, and more. We call such botnets that use evasion techniques as advanced botnets. In this dissertation, we introduce various algorithms and systems to detect advanced botnets in enterprise-like network environment. Encrypted botnets introduce several problems to detection. First, to enable research in detecting encrypted botnets, researchers need samples of encrypted botnet traces with ground truth, which are very hard to get. Traces that are available are not customizable, which prevents testing under various controlled scenarios. To address this problem we introduce BotTalker, a tool that can be used to generate customized encrypted botnet communication traffic. BotTalker emulates the actions a bot would take to encrypt communication. To the best of our knowledge, BotTalker is the first work that provides users customized encrypted botnet traffic. The second problem introduced by encrypted botnets is that Deep Packet Inspection (DPI)-based security systems are foiled. We measure the effects of encryption on three security systems, including Snort, Suricata and BotHunter (BH) using the encrypted botnet traffic generated by BotTalker. The results show that encryption foils these systems greatly. Then, we introduce a method to detect encrypted botnet traffic based on the fact that encryption increases data's entropy. In particular, we present two high-entropy (HE) classifiers and add one of them to enhance BH by utilizing the other detectors it provides. By doing this HE classifier restores BH's ability to detect bots, even when they use encryption. Entropy calculation at line speed is expensive, especially when the flows are very long. To deal with this issue, we introduce two algorithms to classify flows as HE by looking at only part of a flow. In particular, we classify a flow as HE or low entropy (LE) by only considering the first M packets of the flow. These early HE classifiers are used in two ways: (a) to improve the speed of bot detection tools, and (b) as a filter to reduce the load on an Intrusion Detection System (IDS). We implement the filter as a preprocessor in Snort. The results show that by using the first 15 packets of a flow the traffic delivered to IDS is reduced by more than 50% while maintaining more than 99.9% of the original alerts. Comparing our traffic reduction scheme with other work we find that they need to inspect at least 13 times more packets than ours or they miss about 70 times of the alerts. To improve the resiliency of communication between bots and C&C servers, bot masters began utilizing Domain Generation Algorithms (DGA). DGA technique avoids static blacklists as well as prevents security specialists from registering the C&C domain before the botmaster. We introduce BotDigger, a system that detects DGA-based bots using DNS traffic without a priori knowledge of the domain generation algorithm. BotDigger utilizes a chain of evidence, including quantity, temporal and linguistic evidence to detect an individual bot by only monitoring traffic at the DNS servers of a single network. We evaluate BotDigger's performance using traces from two DGA-based botnets: Kraken and Conflicker, as well as a one-week DNS trace captured from our university and three traces collected from our research lab. Our results show that BotDigger detects all the Kraken bots and 99.8% of Conficker bots with very low false positives.Item Open Access Determining disease outbreak influence from voluminous epidemiology data on enhanced distributed graph-parallel system(Colorado State University. Libraries, 2017) Shah, Naman, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Turk, Daniel E., committee memberHistorically, catastrophe has resulted from large-scale epidemiological outbreaks in livestock populations. Efforts to prepare for these inevitable disasters are critical, and these efforts primarily involve the efficient use of limited available resources. Therefore, determining the relative influence of the entities involved in large-scale outbreaks is mandatory. Planning for outbreaks often involves executing compute-intensive disease spread simulations. To capture the probabilities of various outcomes, these simulations are executed several times over a collection of representative input scenarios, producing voluminous data. The resulting datasets contain valuable insights, including sequences of events that lead to extreme outbreaks. However, discovering and leveraging such information is also computationally expensive. This thesis proposes a distributed approach for aggregating and analyzing voluminous epidemiology data to determine the influential measure of the entities in a disease outbreak using the PageRank algorithm. Using the Disease Transmission Network (DTN) established in this research, planners or analysts can accomplish effective allocation of limited resources, such as vaccinations and field personnel, by observing the relative influential measure of the entities. To improve the performance of the analysis execution pipeline, an extension to the Apache Spark GraphX distributed graph-parallel system has been proposed.Item Open Access Embedding based clustering of time series data using dynamic time warping(Colorado State University. Libraries, 2022) Mendis, R. A. C. Laksheen, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Hayne, Stephen, committee memberVoluminous time-series observational data impose challenges pertaining to storage and analytics. Identifying patterns in such climate time-series data is critical for many geospatial applications. Over the recent years, clustering has become a key computational technique for identifying patterns/clusters. However, data with complex structures and high dimensions could lead to uninformative clusters and hinder the quality of clustering. In this research, we use the state-of-the-art autoencoders with LSTMs, Bidirectional LSTMs and GRUs to learn highly non-linear mapping functions by training the networks with subsequences of timeseries to perform data reconstruction. Next, we extract the trained encoders to generate embeddings which are lightweight. These embeddings are more space efficient than the original time series data and require less computational power and resources for further processing. In the final step of clustering, instead of using common distance-based metrics like Euclidean distance, we use DTW, an algorithm for computing similarity between time series by ignoring variations in speed, to calculate similarity between the embeddings during the application of k- Means algorithm. Based on Silhouette score, this method generates clusters which are better than other reduction techniques.Item Open Access Enabling autoscaling for in-memory storage in cluster computing framework(Colorado State University. Libraries, 2019) Shrestha, Bibek Raj, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Hayne, Stephen C., committee memberIoT enabled devices and observational instruments continuously generate voluminous data. A large portion of these datasets are delivered with the associated geospatial locations. The increased volumes of geospatial data, alongside the emerging geospatial services, pose computational challenges for large-scale geospatial analytics. We have designed and implemented STRETCH , an in-memory distributed geospatial storage that preserves spatial proximity and enables proactive autoscaling for frequently accessed data. STRETCH stores data with a delayed data dispersion scheme that incrementally adds data nodes to the storage system. We have devised an autoscaling feature that proactively repartitions data to alleviate computational hotspots before they occur. We compared the performance of S TRETCH with Apache Ignite and the results show that STRETCH provides up to 3 times the throughput when the system encounters hotspots. STRETCH is built on Apache Spark and Ignite and interacts with them at runtime.Item Open Access GeoLens: enabling interactive visual analytics over large-scale, multidimensional geospatial datasets(Colorado State University. Libraries, 2015) Koontz, Jared, author; Pallickara, Sangmi, advisor; Pallickara, Shrideep, committee member; Schumacher, Russ, committee memberWith the rapid increase of scientific data volumes, interactive tools that enable effective visual representation for scientists are needed. This is critical when scientists are manipulating voluminous datasets and especially when they need to explore datasets interactively to develop their hypotheses. In this paper, we present an interactive visual analytics framework, GeoLens. GeoLens provides fast and expressive interactions with voluminous geospatial datasets. We provide an expressive visual query evaluation scheme to support advanced interactive visual analytics technique, such as brushing and linking. To achieve this, we designed and developed the geohash based image tile generation algorithm that automatically adjusts the range of data to access based on the minimum acceptable size of the image tile. In addition, we have also designed an autonomous histogram generation algorithm that generates histograms of user-defined data subsets that do not have pre-computed data properties. Using our approach, applications can generate histograms of datasets containing millions of data points with sub-second latency. The work builds on our visual query coordinating scheme that evaluates geospatial query and orchestrates data aggregation in a distributed storage environment while preserving data locality and minimizing data movements. This paper includes empirical benchmarks of our framework encompassing a billion-file dataset published by the National Climactic Data Center.Item Open Access Hermes - scalable real-time BGP broker with routing streams integration(Colorado State University. Libraries, 2011) Belyaev, Kirill Alexandrovich, author; Massey, Daniel F., advisor; Papadopoulos, Christos, committee member; Pallickara, Shrideep, committee member; Hayne, Stephen C., committee memberBGP is the de facto inter-domain routing protocol of Internet and understanding BGP is critically important for current Internet research and operations. Current Internet research is heavily dependent upon the availability of reliable up-to-date BGP data sources and often evaluated using data drawn from the operational Internet. The BGP real data supports a wide range of efforts ranging from understanding the Internet topology to building more accurate simulations for network protocols. To study and address the Internet research challenges, accessible BGP data is needed. Fortunately a number of BGP monitoring projects have been deployed for BGP data provision. However experience over a number of years has also indicated some major limitations in the current BGP data collection model with the most dramatic one being the inability to deliver real-time data and incapability to process and analyze this data fast enough in a flexible and efficient manner. This thesis presents the design and implementation of the new tool for analyzing BGP routing data in real-time - Hermes BGP Broker. Hermes is build upon the solid foundation of the related project - BGPmon [CSU] that is the BGP aggregation and monitoring platform that uses a publish/subscribe overlay network to provide real-time access to vast numbers of peers and clients. All routing events are consolidated into a single XML stream. XML allows to add additional features such as labeling updates to allow easy identification of useful data by clients and other related data structuring. Hermes as the Broker for BGPmon represents the next generation of route monitoring and analysis tools that bring routing data to the level of end-user applications. The main contribution of this thesis is the design and implementation of a new BGP route analysis platform that can be extensively used both in research and operational communities. Our work on Hermes has delivered the system that is able to analyze continuous XML data stream of BGP updates in real time and select non-duplicate messages that correspond to the specified regular expression pattern. Besides effective filtering mechanism Hermes is capable to scale really well with a large number of concurrent stream subscribers. Its performance under intensive benchmarking has been evaluated and estimated to be suitable for real-world deployment under heavy load with a large number of concurrent clients. The system is also able to distribute the filtering computations among a number of nodes and form Hermes data stream meshes of various topologies.Item Open Access Leveraging structural-context similarity of Wikipedia links to predict twitter user locations(Colorado State University. Libraries, 2017) Huang, Chuanqi, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Hayne, Stephen C., committee memberTwitter is a widely used social media service. Several efforts have targeted understanding the patterns of information dissemination underlying this social network. A user's location is one of the most important information items relative to analyzing content. However, location information tends to be unavailable because most users do not (want to) include geo-tags in their tweets. To predict a user's location, existing approaches require voluminous training data sets of geo-tagged tweets. However, some of the characteristics of tweets, such as compact, non-traditional linguistic expressions, have posed significant challenges when applying model-fitting approaches. In this thesis, we propose a novel framework for predicting the location of a social media user by leveraging structural-context similarity over Wikipedia links. We measure SimRanks between pages over the Wikipedia dump dataset and build a knowledge base, mapping location information (e.g., cities and states) to related vocabularies along with the likelihood for these mappings. Our results evolve as the users' tweet stream grows. We have implemented this framework using Apache Storm to observe real-time tweets. Finally, our framework provides a list of ranked "probable" cities based on the distances between candidate locations and their weights. This thesis includes empirical evaluations that demonstrate performance that is in line with current state-of-the-art location prediction approaches.Item Open Access Management of internet-based service quality(Colorado State University. Libraries, 2012) Yan, He, author; Massey, Daniel, advisor; Papadopoulos, Christos, committee member; Pallickara, Shrideep, committee member; Turk, Dan, committee member; Ge, Zihui, committee member; Yates, Jennifer, committee memberAn increasingly diverse set of services, content distribution network (CDN), Internet games, streaming videos, online-banking, IPTV, VPN, cloud computing and VoIP, are built on top of Internet. For most of these Internet-based services, best effort delivery is no longer an acceptable mode of operation as ultra-high reliability and performance is demanded to meet the stringent service-level requirements. In this dissertation, we focus on the research problem: how to manage the Internet- based service quality in a efficient and proactive manner from a service provider's point of view. Managing Internet-based service quality is extremely challenging due to its massive scale, complicated topology, high protocol complexity, ever-changing software or hardware environment and multiple administrative domains. We propose to look into this problem from two views (user view and network view) and design a novel infrastructure that consists of three systems (Argus, G-RCA and TowerScan) to enable managing Internet-based service quality from both views. We deployed our infrastructure in a tier-1 ISP that provides various Internet-based service and it has proven to be a highly effective way to manage the quality of Internet-based services.Item Open Access Measuring named data networks(Colorado State University. Libraries, 2020) Fan, Chengyu, author; Partridge, Craig, advisor; Papadopoulos, Christos, advisor; Pallickara, Shrideep, committee member; Pallickara, Sangmi, committee member; Luo, J. Rockey, committee memberNamed Data Networking (NDN) is a promising information-centric networking (ICN) Internet architecture that addresses the content directly rather than addressing servers. NDN provides new features, such as content-centric security, stateful forwarding, and in-network caches, to better satisfy the needs of today's applications. After many years of technological research and experimentation, the community has started to explore the deployment path for NDN. One NDN deployment challenge is measurement. Unlike IP, which has a suite of measurement approaches and tools, NDN only has a few achievements. NDN routing and forwarding are based on name prefixes that do not refer to individual endpoints. While rich NDN functionalities facilitate data distribution, they also break the traditional end-to-end probing based measurement methods. In this dissertation, we present our work to investigate NDN measurements and fill some research gaps in the field. Our thesis of this dissertation states that we can capture a substantial amount of useful and actionable measurements of NDN networks from end hosts. We start by comparing IP and NDN to propose a conceptual framework for NDN measurements. We claim that NDN can be seen as a superset of IP. NDN supports similar functionalities provided by IP, but it has unique features to facilitate data retrieval. The framework helps identify that NDN lacks measurements in various aspects. This dissertation focuses on investigating the active measurements from end hosts. We present our studies in two directions to support the thesis statement. We first present the study to leverage the similarities to replicate IP approaches in NDN networks. We show the first work to measure the NDN-DPDK forwarder, a high-speed NDN forwarder designed and implemented by the National Institute of Standards and Technology (NIST), in a real testbed. The results demonstrate that Data payload sizes dominate the forwarding performance, and efficiently using every fragment to improve the goodput. We then present the first work to replicate packet dispersion techniques in NDN networks. Based on the findings in the NDN-DPDK forwarder benchmark, we devise the techniques to measure interarrivals for Data packets. The results show that the techniques successfully estimate the capacity on end hosts when 1Gbps network cards are used. Our measurements also indicate the NDN-DPDK forwarder introduces variance in Data packet interarrivals. We identify the potential bottlenecks and the possible causes of the variance. We then address the NDN specific measurements, measuring the caching state in NDN networks from end hosts. We propose a novel method to extract fingerprints for various caching decision mechanisms. Our simulation results demonstrate that the method can detect caching decisions in a few rounds. We also show that the method is not sensitive to cross-traffic and can be deployed on real topologies for caching policy detection.Item Open Access Prediction based scaling in a distributed stream processing cluster(Colorado State University. Libraries, 2020) Khurana, Kartik, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Carter, Ellison, committee memberProliferation of IoT sensors and applications have enabled us to monitor and analyze scientific and social phenomena with continuously arriving voluminous data. To provide real-time processing capabilities over streaming data, distributed stream processing engines (DSPEs) such as Apache STORM and Apache FLINK have been widely deployed. These frameworks support computations over large-scale, high frequency streaming data. However, current on-demand auto-scaling features in these systems may result in an inefficient resource utilization which is closely related to cost effectiveness in popular cloud-based computing environments. We propose ARSTREAM, an auto-scaling computing environment that manages fluctuating throughputs for data from sensor networks, while ensuring efficient resource utilization. We have built an Artificial Neural Network model for predicting data processing queues and this model captures non-linear relationships between data arrival rates, resource utilization, and the size of data processing queue. If a bottleneck is predicted, ARSTREAM scales-out the current cluster automatically for current jobs without halting them at the user level. In addition, ARSTREAM incorporates threshold-based re-balancing to minimize data loss during extreme peak traffic that could not be predicted by our model. Our empirical benchmarks show that ARSTREAM forecasts data processing queue sizes with RMSE of 0.0429 when tested on real-time data.Item Open Access Secure, accurate, real-time, and heterogeneity-resilient indoor localization with smartphones(Colorado State University. Libraries, 2022) Tiku, Saideep, author; Pasricha, Sudeep, advisor; Pallickara, Shrideep, committee member; Maciejewski, Anthony, committee member; Siegel, H. J., committee memberThe advent of the Global Positioning System (GPS) reformed the global transportation industry and allowed vehicles to not only localize themselves but also to navigate reliably and in a secure manner across the world at high speeds. Today, indoor localization is an emerging IoT domain that is poised to reinvent the way we navigate within buildings and subterranean locales, with many benefits, e.g., directing emergency response services after a 911 call to a precise location (with sub-meter accuracy) inside a building, accurate tracking of equipment and inventory in hospitals, factories, and warehouses, etc. While GPS is the de-facto solution for outdoor positioning with a clear sky view, there is no prevailing technology for GPS-deprived areas, including dense city centers, urban canyons, and inside buildings and other covered structures, where GPS signals are severely attenuated or totally blocked, and affected by multipath interference. Thus, very different solutions are needed to support localization in indoor locales. Popular solutions for indoor positioning with high accuracy leverage wireless radio signals, such as WiFi, Bluetooth ultra-wideband (UWB), etc. Due to the existing widespread deployment of WiFi access points (WAPs) in most indoor locales, using WiFi for indoor localization can lead to low-cost solutions. Many localization algorithms that utilize these wireless signals have been proposed, e.g., based on the principles of proximity, trilateration, triangulation, and fingerprinting. Studies have shown that fingerprinting-based algorithms deliver higher accuracy, without stringent synchronization or line-of-sight requirements and enable greater error resilience in the presence of frequently encountered multipath signal interference effects, than other alternatives. A fingerprinting-based approach for indoor localization has two phases. In an offline phase, location-tagged wireless signal signatures, i.e., fingerprints, at known indoor locations are captured along a path and stored in a database. Each fingerprint in the database consists of a location and wireless signal characteristics, e.g., received signal strength (RSSI; which varies as a function of distance from the WAP), from visible WAPs at that location. This phase requires great manual effort of collecting several fingerprints at each location and comes at considerable cost. In the online phase, the observed RSS on the user's mobile device is used to query the fingerprint database and determine location (potentially after some interpolation). Such WiFi-based fingerprinting is a promising building block for low-cost indoor localization with mobile devices. Unfortunately, there are many unaddressed challenges before a viable WiFi fingerprinting based solution can be realized: (i) the algorithms used for the matching of fingerprints in the online phase have a major impact on accuracy, however the limited CPU/memory/battery resources in mobile devices requires careful algorithm design and deployment that can trade-off accuracy, energy-efficiency, and performance (localization decision latency); (ii) the diversity of mobile devices poses another challenge as smartphones from different vendors may have varying device characteristics leading to different fingerprints being captured at the same location; (iii) security vulnerabilities due to unintentional or intentional WiFi jamming and spoofing attacks can create significant errors which must be overcome; and (iv) short-term and long-term variations in WAP power levels and the indoor environments (e.g., adding/moving furniture, equipment, changes in density of people) can also introduce errors during location estimation, that often corrected by the expensive collecting new fingerprints. In this dissertation, we propose a new real-time machine learning based framework called SARTHI that addresses all of the abovementioned key challenges towards realizing a viable indoor localization solution with smart mobile devices. To enable energy-efficient enhancements in localization accuracy, SARTHI includes lightweight yet powerful machine learning algorithms with a focus on achieving a balance between battery life and response time. To enable device heterogeneity resilience, we analyzed and identified device diversity invariant pattern matching metrics that can be incorporated into a variety of machine learning based indoor localization frameworks. SARTHI also addresses the challenges associated with the security of fingerprinting-based indoor localization frameworks in the presence of spoofing and jamming attacks. This is achieved by devising a novel methodology for training and deploying deep-learning algorithms that are specifically designed to be resilient to the vulnerabilities associated with intentional power level variation-based attacks. Finally, SARTHI addresses the challenges associated with short-term and long-term variations in WiFi fingerprints using novel low-overhead relativistic learning-based deep-learning algorithms that can deliver high-accuracy while simultaneously minimizing the fingerprint collection effort in the offline phase.Item Open Access Separating implementation concerns in stencil computations for semiregular grids(Colorado State University. Libraries, 2013) Stone, Andrew, author; Strout, Michelle Mills, advisor; Massey, Daniel, committee member; Pallickara, Shrideep, committee member; Randall, David, committee memberIn atmospheric and ocean simulation programs, stencil computations occur on semiregular grids where subdomains of the grid are regular (i.e. stored in an array), but boundaries between subdomains connect in an irregular fashion. Implementations of stencils on semiregular grids often have grid connectivity details tangled with stencil computation code. When grid connectivity concerns tangle with stencil code it becomes difficult for programmers to modify the code. This is because any change made will have to account for grid connectivity. In this dissertation we introduce programming abstractions for the class of semiregular grids and describe a prototype Fortran 90+ library called GridWeaver that implements these abstractions. Implementing these abstractions requires determining the communication schedule given an orthogonal specification of the grid decomposition and solving nodes in the grid with a non-standard number of neighbors. We present solutions to these issues that work within the context of grids used in atmospheric and ocean simulations. We also show that to maintain the performance while still providing a separation of concerns, it is necessary for a source-to source translator to perform inlining between user code and the GridWeaver runtime library code. We present performance results for stencil computations extracted from the Parallel Ocean Program and Global Cloud-Resolving Model.Item Open Access Systems for characterizing Internet routing(Colorado State University. Libraries, 2018) Shah, Anant, author; Papadopoulos, Christos, advisor; Pallickara, Shrideep, committee member; Ray, Indrakshi, committee member; Gersch, Joseph, committee member; Luo, J. Rockey, committee member; Bush, Randy, committee memberToday the Internet plays a critical role in our lives; we rely on it for communication, business, and more recently, smart home operations. Users expect high performance and availability of the Internet. To meet such high demands, all Internet components including routing must operate at peak efficiency. However, events that hamper the routing system over the Internet are very common, causing millions of dollars of financial loss, traffic exposed to attacks, or even loss of national connectivity. Moreover, there is sparse real-time detection and reporting of such events for the public. A key challenge in addressing such issues is lack of methodology to study, evaluate and characterize Internet connectivity. While many networks operating autonomously have made the Internet robust, the complexity in understanding how users interconnect, interact and retrieve content has also increased. Characterizing how data is routed, measuring dependency on external networks, and fast outage detection has become very necessary using public measurement infrastructures and data sources. From a regulatory standpoint, there is an immediate need for systems to detect and report routing events where a content provider's routing policies may run afoul of state policies. In this dissertation, we design, build and evaluate systems that leverage existing infrastructure and report routing events in near-real time. In particular, we focus on geographic routing anomalies i.e., detours, routing failure i.e., outages, and measuring structural changes in routing policies.Item Open Access The future of networking is the future of big data(Colorado State University. Libraries, 2019) Shannigrahi, Susmit, author; Papadopoulos, Christos, advisor; Partridge, Craig, advisor; Pallickara, Shrideep, committee member; Ray, Indrakshi, committee member; Burns, Patrick J., committee member; Monga, Inder, committee memberScientific domains such as Climate Science, High Energy Particle Physics (HEP), Genomics, Biology, and many others are increasingly moving towards data-oriented workflows where each of these communities generates, stores and uses massive datasets that reach into terabytes and petabytes, and projected soon to reach exabytes. These communities are also increasingly moving towards a global collaborative model where scientists routinely exchange a significant amount of data. The sheer volume of data and associated complexities associated with maintaining, transferring, and using them, continue to push the limits of the current technologies in multiple dimensions - storage, analysis, networking, and security. This thesis tackles the networking aspect of big-data science. Networking is the glue that binds all the components of modern scientific workflows, and these communities are becoming increasingly dependent on high-speed, highly reliable networks. The network, as the common layer across big-science communities, provides an ideal place for implementing common services. Big-science applications also need to work closely with the network to ensure optimal usage of resources, intelligent routing of requests, and data. Finally, as more communities move towards data-intensive, connected workflows - adopting a service model where the network provides some of the common services reduces not only application complexity but also the necessity of duplicate implementations. Named Data Networking (NDN) is a new network architecture whose service model aligns better with the needs of these data-oriented applications. NDN's name based paradigm makes it easier to provide intelligent features at the network layer rather than at the application layer. This thesis shows that NDN can push several standard features to the network. This work is the first attempt to apply NDN in the context of large scientific data; in the process, this thesis touches upon scientific data naming, name discovery, real-world deployment of NDN for scientific data, feasibility studies, and the designs of in-network protocols for big-data science.Item Open Access Toward effective high-throughput georeferencing over voluminous observational data in the domain of precision agriculture(Colorado State University. Libraries, 2018) Roselius, Maxwell L., author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; McKay, John, committee memberRemote sensing of plant traits and their environment facilitates non-invasive, high-throughput monitoring of the plant's physiological characteristics. Effective ingestion of these sensing data into a storage subsystem while georeferencing phenotyping setups is key to providing timely access to scientists and modelers. In this thesis, we propose a high-throughput distributed data ingestion framework with support for fine-grained georeferencing. The methodology includes a novel spatial indexing scheme, the nested hash grid, for fine-grained georeferencing of data while conserving memory footprints and ensuring acceptable latency. We include empirical evaluations performed on a commodity machine cluster with up to 1TB of data. The benchmarks demonstrate the efficacy of our approach.Item Open Access Towards federated learning over large-scale streaming data(Colorado State University. Libraries, 2020) Pereira, Aaron, author; Pallickara, Sangmi, advisor; Pallickara, Shrideep, committee member; Zahran, Sammy, committee memberDistributed Stream Processing Engines (DSPEs) have seen significant deployment growth along with an increase in streaming data sources such as sensor networks. These DSPEs enable processing large amounts of streaming data in a cluster of commodity machines to extract knowledge and insights in real-time. Due to fluctuating data arrival rates in real-world applications, modern DSPEs often provide auto-scaling. However, the existing designs of advanced analytical frameworks are not effectively aligned with scalable streaming computing environments. We have designed and developed ORCA, a federated learning architecture that supports the training of traditional Artificial Neural Networks as well as Convolutional Neural Networks and Long Short-term Memory Network based models while ensuring resiliency during scaling. ORCA also introduces dynamic adjustment of the 'elasticity' hyper-parameter for rescaled computing environments. We estimate this elasticity hyper-parameter using reinforcement learning. Our empirical benchmarks show that ORCA is capable of achieving an MSE of 0.038 over real-world streaming datasets.