Subscribe free to our newsletters via your




TECH SPACE
Building trustworthy big data algorithms
by Staff Writers
Chicago IL (SPX) Feb 23, 2015


File image.

Much of our reams of data sit in large databases of unstructured text. Finding insights among emails, text documents, and websites is extremely difficult unless we can search, characterize, and classify their text data in a meaningful way.

One of the leading big data algorithms for finding related topics within unstructured text (an area called topic modeling) is latent Dirichlet allocation (LDA). But when Northwestern University professor Luis Amaral set out to test LDA, he found that it was neither as accurate nor reproducible as a leading topic modeling algorithm should be.

Using his network analysis background, Amaral, professor of chemical and biological engineering in Northwestern's McCormick School of Engineering and Applied Science, developed a new topic modeling algorithm that has shown very high accuracy and reproducibility during tests.

His results, published with co-author Konrad Kording, associate professor of physical medicine and rehabilitation, physiology, and applied mathematics at Northwestern, were published Jan. 29 in Physical Review X.

Topic modeling algorithms take unstructured text and find a set of topics that can be used to describe each document in the set. They are the workhorses of big data science, used as the foundation for recommendation systems, spam filtering, and digital image processing. The LDA topic modeling algorithm was developed in 2003 and has been widely used for academic research and for commercial applications, like search engines.

When Amaral explored how LDA worked, he found that the algorithm produced different results each time for the same set of data, and it often did so inaccurately. Amaral and his group tested LDA by running it on documents they created that were written in English, French, Spanish, and other languages. By doing this, they were able to prevent text overlap among documents.

"In this simple case, the algorithm should be able to perform at 100 percent accuracy and reproducibility," he said. But when LDA was used, it separated these documents into similar groups with only 90 percent accuracy and 80 percent reproducibility. "While these numbers may appear to be good, they are actually very poor, since they are for an exceedingly easy case," Amaral said.

To create a better algorithm, Amaral took a network approach. The result, called TopicMapping, begins by preprocessing data to replace words with their stem (so "star" and "stars" would be considered the same word). It then builds a network of connecting words and identifies a "community" of related words (just as one could look for communities of people in Facebook). The words within a given community define a topic.

The algorithm was able to perfectly separate the documents according to language and was able to reproduce its results. It also had high accuracy and reproducibility when separating 23,000 scientific papers and 1.2 million Wikipedia articles by topic.

These results show the need for more testing of big data algorithms and more research into making them more accurate and reproducible, Amaral said.

"Companies that make products must show that their products work," he said. "They must be certified. There is no such case for algorithms. We have a lot of uninformed consumers of big data algorithms that are using tools that haven't been tested for reproducibility and accuracy."


Thanks for being here;
We need your help. The SpaceDaily news network continues to grow but revenues have never been harder to maintain.

With the rise of Ad Blockers, and Facebook - our traditional revenue sources via quality network advertising continues to decline. And unlike so many other news sites, we don't have a paywall - with those annoying usernames and passwords.

Our news coverage takes time and effort to publish 365 days a year.

If you find our news sites informative and useful then please consider becoming a regular supporter or for now make a one off contribution.
SpaceDaily Contributor
$5 Billed Once


credit card or paypal
SpaceDaily Monthly Supporter
$5 Billed Monthly


paypal only


.


Related Links
Northwestern University
Space Technology News - Applications and Research






Comment on this article via your Facebook, Yahoo, AOL, Hotmail login.

Share this article via these popular social media networks
del.icio.usdel.icio.us DiggDigg RedditReddit GoogleGoogle




Memory Foam Mattress Review
Newsletters :: SpaceDaily :: SpaceWar :: TerraDaily :: Energy Daily
XML Feeds :: Space News :: Earth News :: War News :: Solar Energy News





TECH SPACE
Chemists control structure to unlock magnetization and polarization simultaneously
Liverpool, UK (SPX) Jan 27, 2015
Scientists at the University of Liverpool have controlled the structure of a material to simultaneously generate both magnetisation and electrical polarisation, an advance which has potential applications in information storage and processing. Researchers from the University's School of Physical Sciences demonstrated that it is possible to unlock these properties in a material which initia ... read more


TECH SPACE
Fukushima decommissioning made 'significant progress': IAEA

Hong Kong captain jailed for 8 years over ferry tragedy

Sri Lanka's new leaders seek $4.0 bln IMF bail-out

Wildfires in Ukraine could revive Chernobyl's radiation

TECH SPACE
China, Russia strengthen satellite navigation cooperation

India Interested in Russia's Glonass Satellite Navigation System

Latest Galileo satellites reach launch site

PLA drill applies China's own GPS

TECH SPACE
Reality is distorted in brain's maps

Neanderthals disappeared from the Iberian Peninsula before than from the rest of Europe

Scientists call for antibody 'bar code' to follow Human Genome Project

New software analyzes human genomes faster than ever

TECH SPACE
Wild ponies ride to the rescue of unique Czech ecosystem

Earliest-known arboreal and subterranean ancestral mammals discovered

Making teeth tough: Beavers show way to improve our enamel

Make like a squid and transform

TECH SPACE
Death toll rises to 28 in Mozambique cholera epidemic

Ebola virus may have been present in West Africa long before 2014 outbreak

Schools reopen as Liberia turns page on Ebola epidemic

Bubonic bottleneck: UNC scientists overturn dogma on the plague

TECH SPACE
China man gets $189,000 for six years on death row

Big Yang Theory: Chinese year of the sheep or the goat?

China expels senior official from ruling party

China official's mandatory 'two children' proposal draws rebuke

TECH SPACE
Sagem-led consortium intoduces anti-piracy system

China arrests Turks, Uighurs in human smuggling plot: report

Two police to hang for murder in Malaysian corruption scandal

Nobel protester sought to draw attention to 'murdered Mexican students'

TECH SPACE
China's Dagong cuts France's credit ratings

Japan household spending drops fastest in 8 years

Dutch SNS Reaal sells insurer to China's Anbang

China January FDI jumps 29.4%: govt




The content herein, unless otherwise known to be public domain, are Copyright 1995-2014 - Space Media Network. All websites are published in Australia and are solely subject to Australian law and governed by Fair Use principals for news reporting and research purposes. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA news reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. Advertising does not imply endorsement, agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. Privacy Statement All images and articles appearing on Space Media Network have been edited or digitally altered in some way. Any requests to remove copyright material will be acted upon in a timely and appropriate manner. Any attempt to extort money from Space Media Network will be ignored and reported to Australian Law Enforcement Agencies as a potential case of financial fraud involving the use of a telephonic carriage device or postal service.