The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In this demonstration paper, we present SPATE, an innovative telco big data exploration framework whose objectives are two-fold: (i) minimizing the storage space needed to incrementally retain data over time, and (ii) minimizing the response time for spatiotemporal data exploration queries over stored data. Our framework deploys lossless data compression to ingest streams of telco big data in the...
According to the Centers for Disease Control, in the United States there are 6.8 million children living with asthma. Despite the importance of the disease, the available prognostic tools are not sufficient for biomedical researchers to thoroughly investigate the potential risks of the disease at scale. To overcome these challenges we present a big data integration and analysis infrastructure developed...
The paper demonstrates Hippo a lightweight database indexing scheme that significantly reduces the storage and maintenance overhead without compromising much on the query execution performance. Hippo stores disk page ranges instead of tuple pointers in the indexed table to reduce the storage space occupied by the index. It maintains simplified histograms that represent the data distribution and adopts...
We present the results of our work on integrating database and programming language runtimes through code generation and extensive just-in-time adaptation. Our techniques deliver significant performance improvements over non-integrated solutions. Our work makes important first steps towards a future where data processing applications will commonly run on machines that can store their datasets entirely...
With the rapid advancement of mobile devices and crowdsourcing platforms, spatial crowdsourcing has attracted much attention from various research communities. A spatial crowdsourcing system periodically matches a number of locationbased workers with nearby spatial tasks (e.g., taking photos or videos at some specific locations). Previous studies on spatial crowdsourcing focus on task assignment strategies...
Prescription drug abuse is one of the fastest growing public health problems in the USA. To address this epidemic, a near real-time monitoring strategy, instead of one resorting to a retrospective health records, may improve detecting the prevalence and patterns of abuse of both illegal drugs and prescription medications. In this paper, our primary goals are to demonstrate the possibility of utilizing...
Entity resolution (ER) is the process of identifying which entities in a dataset represent the same real-world object. This paper proposes a progressive approach to ER using MapReduce. In contrast to traditional ER, progressive ER aims to resolve the dataset such that the rate at which the data quality improves is maximized. Such a progressive approach is useful for many emerging analytical applications...
In this paper, we discuss an efficient and effective index mechanism to do the string matching with k mismatches, by which we will find all the substrings in a target string s having at most k positions different from a pattern string r. The main idea is to transform s to a BWT-array as index, denoted as BWT(s), and search r against it. During the process, the precomputed mismatch information of r...
Graph databases have become increasingly important with the rise of social networks, and with the growth of the Semantic Web and characterization of biological networks. Regular path queries (RPQs) are a way to explore path patterns in graphs which have become a standard method to explore graph databases. SPARQL 1.1 includes property paths, and so now encompasses RPQs as a fragment. In many environments,...
In the first wave of data science education programs, data engineering topics (systems, scalable algorithms, data management, integration) tended to be de-emphasized in favor of machine learning and statistical modeling. The anecdotal evidence suggests this was a mistake: data scientists report spending most of their time grappling with data far upstream of modeling activities. A second wave of data...
An inherent challenge arising in any dataset containing information of space and/or time is uncertainty due to various sources of imprecision. Integrating the impact of the uncertainty is a paramount when estimating the reliability (confidence) of any query result from the underlying input data. To deal with uncertainty, solutions have been proposed independently in the geo-science and the data-science...
We present SIRUM: a system for Scalable Informative RUle Mining from multi-dimensional data. Informative rules have recently been studied in several contexts, including data summarization, data cube exploration and data quality. The objective is to produce a small set of rules (patterns) over the values of the dimension attributes that provide the most information about the distribution of a numeric...
Data provenance is essential for debugging query results, auditing data in cloud environments, and explaining outputs of Big Data analytics. A well-established technique is to represent provenance as annotations on data and to instrument queries to propagate these annotations to produce results annotated with provenance. However, even sophisticated optimizers are often incapable of producing efficient...
User-managed-events is a popular feature on social networks. Take Facebook Events as an example: over 135 million events were created in 2015 and over 550 million people use events each month. In this work, we consider the heavy sparseness in both user and event feedback history caused by short lifespans (transiency) of events and user participation patterns in a production event system. We propose...
The proliferation of geo-textual data gives prominence to spatial keyword search. The basic top-k spatial keyword query, returns k geo-textual objects that rank the highest according to their textual relevance and spatial proximity to query keywords and a query location. We define, study, and provide means of computing the reverse top-k keyword-based location query. This new type of query takes a...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.