Medical and Hospital News  
TECH SPACE
Method enables machine learning from unwieldy data sets
by Staff Writers
Boston MA (SPX) Dec 20, 2016


Researchers from MIT's Computer Science and Artificial Intelligence Laboratory and its Laboratory for Information and Decision Systems have designed a new algorithm that makes it much more practical to select diverse subsets from a much larger dataset. Image courtesy Christine Daniloff and MIT.

When data sets get too big, sometimes the only way to do anything useful with them is to extract much smaller subsets and analyze those instead.

Those subsets have to preserve certain properties of the full sets, however, and one property that's useful in a wide range of applications is diversity. If, for instance, you're using your data to train a machine-learning system, you want to make sure that the subset you select represents the full range of cases that the system will have to confront.

Last week at the Conference on Neural Information Processing Systems, researchers from MIT's Computer Science and Artificial Intelligence Laboratory and its Laboratory for Information and Decision Systems presented a new algorithm that makes the selection of diverse subsets much more practical.

Whereas the running times of earlier subset-selection algorithms depended on the number of data points in the complete data set, the running time of the new algorithm depends on the number of data points in the subset. That means that if the goal is to winnow a data set with 1 million points down to one with 1,000, the new algorithm is 1 billion times faster than its predecessors.

"We want to pick sets that are diverse," says Stefanie Jegelka, the X-Window Consortium Career Development Assistant Professor in MIT's Department of Electrical Engineering and Computer Science and senior author on the new paper.

"Why is this useful? One example is recommendation. If you recommend books or movies to someone, you maybe want to have a diverse set of items, rather than 10 little variations on the same thing. Or if you search for, say, the word 'Washington.'

"There's many different meanings that this word can have, and you maybe want to show a few different ones. Or if you have a large data set and you want to explore - say, a large collection of images or health records - and you want a brief synopsis of your data, you want something that is diverse, that captures all the directions of variation of the data.

"The other application where we actually use this thing is in large-scale learning. You have a large data set again, and you want to pick a small part of it from which you can learn very well."

Joining Jegelka on the paper are first author Chengtao Li, a graduate student in electrical engineering and computer science; and Suvrit Sra, a principal research scientist at MIT's Laboratory for Information and Decision Systems.

Thinking small
Traditionally, if you want to extract a diverse subset from a large data set, the first step is to create a similarity matrix - a huge table that maps every point in the data set against every other point. The intersection of the row representing one data item and the column representing another contains the points' similarity score on some standard measure.

There are several standard methods to extract diverse subsets, but they all involve operations performed on the matrix as a whole. With a data set with a million data points - and a million-by-million similarity matrix - this is prohibitively time consuming.

The MIT researchers' algorithm begins, instead, with a small subset of the data, chosen at random. Then it picks one point inside the subset and one point outside it and randomly selects one of three simple operations: swapping the points, adding the point outside the subset to the subset, or deleting the point inside the subset.

The probability with which the algorithm selects one of those operations depends on both the size of the full data set and the size of the subset, so it changes slightly with every addition or deletion. But the algorithm doesn't necessarily perform the operation it selects.

Again, the decision to perform the operation or not is probabilistic, but here the probability depends on the improvement in diversity that the operation affords. For additions and deletions, the decision also depends on the size of the subset relative to that of the original data set. That is, as the subset grows, it becomes harder to add new points unless they improve diversity dramatically.

This process repeats until the diversity of the subset reflects that of the full set. Since the diversity of the full set is never calculated, however, the question is how many repetitions are enough. The researchers' chief results are a way to answer that question and a proof that the answer will be reasonable.

Research paper: Fast mixing Markov chains for strongly Rayleigh measures, DPPs, and constrained sampling


Comment on this article using your Disqus, Facebook, Google or Twitter login.


Thanks for being here;
We need your help. The SpaceDaily news network continues to grow but revenues have never been harder to maintain.

With the rise of Ad Blockers, and Facebook - our traditional revenue sources via quality network advertising continues to decline. And unlike so many other news sites, we don't have a paywall - with those annoying usernames and passwords.

Our news coverage takes time and effort to publish 365 days a year.

If you find our news sites informative and useful then please consider becoming a regular supporter or for now make a one off contribution.
SpaceDaily Contributor
$5 Billed Once


credit card or paypal
SpaceDaily Monthly Supporter
$5 Billed Monthly


paypal only


.


Related Links
Massachusetts Institute of Technology
Space Technology News - Applications and Research






Share this article via these popular social media networks
del.icio.usdel.icio.us DiggDigg RedditReddit GoogleGoogle

Previous Report
TECH SPACE
New technology of ultrahigh density optical storage researched at Kazan University
Kazan, Russia (SPX) Dec 02, 2016
According to current estimates, dozens of zettabytes of information will need to be placed somewhere by 2020. New physical principles must be found, the ones that facilitate the use of single atoms or molecules as basic memory cells. This can be done with the help of lasers. However, the existing methods of optical storage are limited to the diffraction limit (~500 nm), so the respective recordi ... read more


TECH SPACE
New Technology Could Help Track Firefighters for Safety

China charges 10 in power plant collapse killed 74

Sawdust reinvented into super sponge for oil spills

China arrests 18 over fatal October blast

TECH SPACE
Europe's own satnav Galileo goes live

Alpha Defence Company To Make Navigation Satellites For ISRO

Austrian cows swap bells from 'hell' for GPS

Galileo, Europe's own satnav, to go online

TECH SPACE
Dental hygiene, caveman style

Neurons paralyze us during REM sleep

Neanderthals visited seaside cave in England for 180,000 years

Sex of prehistoric hand-stencil artists can be determined forensic analysis

TECH SPACE
Plant's response to heat stress fluctuates between day and night

Norway slashes hunting quota for wolves

The fight to save Earth's smallest rhino in Sumatra's jungles

Rapid population decline among vertebrates began with industrialization

TECH SPACE
Paris seeks high ground in fight to keep rats underground

Smallpox, once thought an ancient disease, may have emerged in more recent times

Paris rat catchers deployed to tackle rodent scourge

Overwhelming evidence of malaria's existence 2,000 years ago

TECH SPACE
Woman sues China public security bureau over propaganda video

'Iron lady' Ip runs for Hong Kong leader

Chinese official's wife jailed in new vaccine scandal

Popular Chinese Muslim website shuttered after Xi Jinping petition

TECH SPACE
African leaders tackle piracy, illegal fishing at Lome summit

US to deport ex-navy chief drug trafficker to Guinea-Bissau

Gunmen ambush Mexican military convoy, kill 5 soldiers

Mexican army to probe killings of six in their home

TECH SPACE
Property and credit booms stablise China growth

China data and US banks propel equities higher

No debt-for-equity cure for zombie firms, says China

China's ranks of super-rich rise despite economic slowdown









The content herein, unless otherwise known to be public domain, are Copyright 1995-2024 - Space Media Network. All websites are published in Australia and are solely subject to Australian law and governed by Fair Use principals for news reporting and research purposes. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA news reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. All articles labeled "by Staff Writers" include reports supplied to Space Media Network by industry news wires, PR agencies, corporate press officers and the like. Such articles are individually curated and edited by Space Media Network staff on the basis of the report's information value to our industry and professional readership. Advertising does not imply endorsement, agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. General Data Protection Regulation (GDPR) Statement Our advertisers use various cookies and the like to deliver the best ad banner available at one time. All network advertising suppliers have GDPR policies (Legitimate Interest) that conform with EU regulations for data collection. By using our websites you consent to cookie based advertising. If you do not agree with this then you must stop using the websites from May 25, 2018. Privacy Statement. Additional information can be found here at About Us.