The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Nowadays, text classification (TC) becomes the main applications of NLP (natural language processing). Actually, we have a lot of researches in classifying text documents, such as Random Forest, Support Vector Machines and Naive Bayes. However, most of them are applied for English documents. Therefore, the text classification researches on Vietnamese still are limited. By using a Vietnamese news corpus,...
The widespread prevalence of dietary supplements has drawn extensive attention due to the safety and efficacy issue. Clinical notes document a great amount of detailed information on dietary supplement usage, thus providing a rich source for clinical research on supplement safety surveillance. Identification the use status of dietary supplements is one of the initial steps for the ultimate goal of...
Twitter is one of the most popular microblog sites developed in recent years. Feelings are analysed on the messages shared on Twitter so that users ideas on the products and companies can be determined. Sentiment analysis helps companies to improve their products and services based on the feedback obtained from the users through Twitter. In this study, it was aimed to perform sentiment analysis on...
Event logging is a key source of information on a system state. Reading logs provides insights on its activity, assess its correct state and allows to diagnose problems. However, reading does not scale: with the number of machines increasingly rising, and the complexification of systems, the task of auditing systems' health based on logfiles is becoming overwhelming for system administrators. This...
In the era of Internet and electronic devices bullying shifted its place from schools and backyards into the cyberspace; it is now known as Cyberbullying. Children of the Arab countries are suffering from cyberbullying same as children worldwide. Thus concerns from cyberbullying are elevating. A lot of research is done for the purpose of handling this situation. The current research is focusing on...
Word-sense disambiguation is one of the key concepts in natural language processing. The main goal of a language is to present a specific concept to the audience. This concept is extracted from the meaning of words in that language. System should be able to identify role and meaning of words in order to identify the concepts in texts properly. This issue becomes more problematic if there are words...
Word2Vec (Word to Vector) processes natural language by calculating the cosine similarity. However, the serial algorithm of original Word2Vec fails to satisfy the demands of training of corpus text because of the explosive growth of information. It has become the bottleneck owing to its comparatively low processing efficiency. The High Performance Computing (HPC) specializes in improving the calculation...
In this paper, by learning the origin of the word distributed representation, knowing the distributed representation is one of the bridges of natural language processing mapping to mathematical calculations. Through the learning distributed representation model: neural network language model, CBOW model and Skip-gram model, the advantages and disadvantages of each model are clarified. Through the...
This work presents a method of classification of text documents using deep neural network with LSTM (long short-term memory) units. We have tested different approaches to build feature vectors, which represent documents to be classified: we used feature vectors constructed as sequences of words included in the documents, or, alternatively, we first converted words into vector representations using...
Duplicate Bug Detection is the problem of identifying whether a newly reported bug is a duplicate of an existing bug in the system and retrieving the original or similar bugs from the past. This is required to avoid costly rediscovery and redundant work. In typical software projects, the number of duplicate bugs reported may run into the order of thousands, making it expensive in terms of cost and...
Meaningless identifiers as well as inconsistent use of identifiers in the source code might hinder code readability and result in increased software maintenance efforts. Over the past years, effort has been devoted to promoting a consistent usage of identifiers across different parts of a system through approaches exploiting static code analysis and Natural Language Processing (NLP). These techniques...
Paraphrase Detection is the task of examining if two sentences convey the same meaning or not. Here, in this paper, we have chosen a sentence embedding by unsupervised RAE vectors for capturing syntactic as well as semantic information. The RAEs learn features from the nodes of the parse tree and chunk information along with unsupervised word embedding. These learnt features are used for measuring...
Coreference resolution plays a significant role in natural language processing systems. It is the method of figuring out all the noun phrases that refer back to the identical real world entity. Several researches have been done in noun phrase coreference resolution by using certain machine learning techniques. Our paper proposes a machine learning approach using support vector machines (SVM) towards...
Popular social media sites like Facebook, Twitter, YouTube etc. has become a platform for the user to express their stance towards the target entity. Target entity can be a person, a location, an organization, a new government policy etc. The stance expressed by the user may be favor towards the target entity, against or sometimes neutral. For the first time, we present stance detection system implemented...
Part of speech tagging has some different methods or techniques to the problem in assigning each word of a text with a part-of-speech tag. In this paper, we conducted some part-of-speech tagging techniques for Bahasa Indonesia experiments using statistical approach (Unigram, Hidden Markov Models) and Brill's tagger. In this study, we used Supervised POS Tagging approach requiring a large number of...
This paper presents the results of systematic and comparative experimentation with major types of methodologies for automatic duplicate question detection when these are applied to datasets of progressively larger sizes, thus allowing to study the learning profiles of this task under these different approaches and evaluate their merits. This study was made possible by resorting to the recent release...
This paper presents a combination of machine learning and lexicon-based approaches for sentiment analysis of students feedback. The textual feedback, typically collected towards the end of a semester, provides useful insights into the overall teaching quality and suggests valuable ways for improving teaching methodology. The paper describes a sentiment analysis model trained using TF-IDF and lexicon-based...
With the exponential growth of web meta-data, exploiting multimodal online sources via standard search engine has become a trend in visual recognition as it effectively alleviates the shortage of training data. However, the web meta-data such as text data is usually not as cooperative as expected due to its unstructured nature. To address this problem, this paper investigates the numerical representation...
Data analytics is being increasingly used in cyber-security problems, and found to be useful in cases where data volumes and heterogeneity make it cumbersome for manual assessment by security experts. In practical cyber-security scenarios involving data-driven analytics, obtaining data with annotations (i.e. ground-truth labels) is a challenging and known limiting factor for many supervised security...
More and more modern applications make use of natural language data, e. g. Information Extraction (IE) or Question Answering (QA) systems. Those application require preprocessing through Natural Language Processing (NLP) pipelines, and the output quality of these applications depends on the output quality of NLP pipelines. If NLP pipelines are applied in different domains, the output quality decreases...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.