The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This work presents the design and development of a web-based system that supports cross-language similarity analysis and plagiarism detection. A suspicious document dq in a language Lq is to be submitted to the system via a PHP web-based interface. The system will accept the text through either uploading or pasting it directly to a text-area. In order to lighten large texts and provide an ideal set...
This paper presents a method of applying text mining techniques and data mining tools for pharmaceutical spam detection from Twitter data. A simple method based on a manually selected list of 65 pharmaceutical discriminating words is used for labeling spam training tweets. Preliminary experimental results show that J48 decision tree classifier has better performance over Naïve Bayesian algorithm.
In this paper, we present an approach to automatically extract and classify opinions in texts. We propose a similarity measurement calculating semantically distances between a word and predefined subgroups of seed words. We have evaluated our algorithm on the semantic evaluation company “SemEval 2007” corpus, and we obtained the best value of Precision and F1 62% and 61%. As an improvement of 20 %...
Recent library digitization projects attempt to provide large collections of printed material from varying sources in a searchable format. The scanned documents are typically processed using Optical Character Recognition (OCR), which typically introduces errors in the text. This paper proposes a technique for correction of OCR degraded text that is independent of character-level OCR errors, and hence...
Composition style is often an important factor in readers' selection of reading materials. For example, a reader may seek out articles written in similar style as his or her favorite writer. We present a new method for providing recommendations based on the composition style. Our algorithm analyzes and encodes the readability index and syntactical structure of a model document, and then searches for...
Word Alignment is an important supporting task for different NLP applications like training of machine translation systems, translation lexicon induction, word sense discovery, word sense disambiguation, information extraction and the cross-lingual projection of linguistic information. In this paper we study the main rules and guidelines required to build an aligner tool for Arabic language which...
We propose a classification model for the cognitive level of question items in examinations based on Bloom's taxonomy. The model implements the artificial neural network approach, which is trained using the scaled conjugate gradient learning algorithm. Several data preprocessing techniques such as word extraction, stop word removal, stemming, and vector representation are applied to a feature set...
Nowadays, satisfying user needs has become the main challenge in a variety of web applications. Recommender systems play a major role in that direction. However, as most of the information is present in a textual form, recommender systems face the challenge of efficiently analyzing huge amounts of text. The usage of semantic-based analysis has gained much interest in recent years. The emergence of...
Stemming is a fundamental step in processing textual data preceding the tasks of text mining, Information Retrieval (IR), and natural language processing (NLP). The common goal of stemming is to standardize words by reducing a word to its base (root or stem), thus can be also considered a feature reduction technique. This paper aims at presenting a new dictionary free, content-based Arabic stemmer...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.