User:Ian Helmke/Sandbox: Difference between revisions

From Citizendium
Jump to navigation Jump to search
imported>Ian Helmke
mNo edit summary
No edit summary
 
(10 intermediate revisions by one other user not shown)
Line 1: Line 1:
= Clustering Additions =
{{AccountNotLive}}
= Machine Learning Additions =


This is text I'd like to add to the clustering page (pending heavy editing)
This is text I'd like to add to the [[machine learning]] page (pending
heavy editing). This text would go underneath the "Issues in Training
and Evaluation" section.


== Overfitting ==
== Overfitting ==


Overfitting occurs when a classifier trains so closely to model data
Overfitting occurs in classifiers when a classifier creates an extremely accurate [[model_(machine learning)|model]] for classifying the example data used to crate the model, yet is inaccurate on other examples.
that it is not useful for classifying things outside of the
model. The classifier is extremely accurate when classifying the
training data, but when the classifier is used to classify data
outside of the model, it will often classify data incorrectly.


This can sometimes occur because a classifier looks at irrelevant data
This can sometimes occur because a classifier looks at irrelevant data
points in the training data. Classifiers often give more weight to
points in the training data. Classifiers often give more weight to
features seen less commonly in the training data, and if data given to
[[feature_(vector)|features]] seen less commonly in the training data, and if data given to
the classifier shares similar uncommon values, it is grouped
the classifier shares similar uncommon values, it groups those values
accordingly. This produces inaccurate results.
accordingly. As a result, the classifier is extremely accurate when
classifying the training data and appears to be useful, but when used
to organize data outside of the model, it becomes inaccurate.


Machine learning techniques can prevent overfitting to some extent by
Machine learning techniques can prevent overfitting by
penalizing overly complicated models. A simpler model is oftentimes
imposing a [[penalty_(machine learning)|penalty]] upon itself for complicated models. A
more consistent with the data in question.
simpler model is oftentimes more consistent with the trends of the
data in question.


== Imbalance of Data ==
== Imbalance of Data ==


At times, data within training sets is imbalanced, where one sample
Data within training sets can be imbalanced, where the training set
category has many more points of data than another. Oftentimes, the
has more examples of one category of data than another. Many
category with less available data is the more interesting one, and we
classifying algorithms assume that the ratio of the different
want to find characteristics that are common to the minority group
categories of training data that they receive are roughly equivalent
that are also not shared by the majority group. Classifiers assume
to the ratio of real data that will belong in each category.
that the ratio of the different categories of training data that they
receive are roughly equivalent to the ratio of real data that will
belong in each category.


One way to prevent this imbalance of data is to use a technique called
There may also be characteristics that differentiate members of the
active example selection. Active example selection builds the
minority class that are missed when data is imbalanced. For example,
classifier model slowly, adding a few pieces of sample data of each
a classifier training on examples of cancer patients may fail to
class at a time. The model is tested at every stage, and documents
differentiate between different types of cancer in the minority
which improve the accuracy of the model are kept in the classifier,
class<ref name="ensemble">
while ones that do not change the model or make its performance worse
{{cite journal
are removed. This ensures that only meaningful pieces of data are used
      |author=Sangyoon, Oh et al.
in the classifier.  
      |title=Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification
      |journal=IEEE/ACM Trans. Comput. Biol. Bioinformatics
      |publisher=IEEE Computer Society Press
      |year=2011
      |id=http://dx.doi.org/10.1109/TCBB.2010.96}}</ref>.
 
One way to prevent this imbalance of data is to use
[[active example selection]]<ref name="ensemble" />. Active example
selection builds the classifier model slowly, adding a few pieces of
sample data of each class at a time. The model is tested at every
stage, and documents which improve the accuracy of the model are kept
in the classifier, while ones that do not change the model or make its
performance worse are removed. This ensures that only meaningful
pieces of data are used to train the classifier, and since the ratio
of documents is closer to 1:1, an imbalance of data is less of an
issue.


== Evaluating Results ==
== Evaluating Results ==


There are a number of techniques for evaluating the results of a
There are a number of techniques for evaluating the results of a
machine learning algorithm. Some of these techniques are also used in
machine learning [[algorithm]]. Some of these techniques are also used in
natural-language processing. Machine learning techniques are generally
[[computational linguistics|natural language processing]]. Machine learning techniques are generally
evaluated by their results, and some methods (such as neural networks)
evaluated by their results, and some methods (such as [[neural networks]])
are considered "black box" forms of classification, since it is not
are considered "black box" forms of classification, since it is not
easy to understand how or why the underlying implementation is sorting
easy to understand how or why the underlying implementation is sorting
the way it is.
a particular way.


In some cases, particularly with classifiers, the outcome of the
In some cases, particularly with classifiers, the outcome of the
machine learning algorithm is compared to a set of data classified by
machine learning algorithm is compared to a set of data classified by
experts. In this case, the machine learning algorithm is given a set
experts. The machine learning algorithm creates a model based on a set
of training data, and then classifies a second sample set of data. A group of
of training data, and classifies a second sample set of data. A
experts also annotate this second set of data, marking it according to how
group of experts also annotate the second set of data, marking it
they believe it should be classified (not according to how they
according to how they believe it should be classified (not according
believe a machine would classify it). This training and evaluation
to how they believe a machine would classify it). The training and
data is generally a tiny subset of the data available. If a classifier
evaluation data is generally a tiny subset of the data available.  
is able to produce good results for a subset of data, it should also
be successful at classifying a larger set of similar data.
The results of the algorithm are compared to the results of the
The results of the algorithm are compared to the results of the
experts and arranged into two scores: precision and recall. Precision
experts and arranged into two scores: [[precision]] and [[recall]]. Precision
accounts for situations where the classifier put something in a
accounts for situations where the classifier put something in a
category where it did not belong. Recall accounts for situations where
category where it did not belong. Recall accounts for situations where
the classifier did not put something in a category it should have.
the classifier did not put something in a category it should have.
If a classifier is able to produce good results for a subset of data, it
should also be successful at classifying a larger set of similar data.


Classification and clustering algorithms can also be measured against
Classification and clustering algorithms can also be measured against
Line 77: Line 91:
== Scalability ==
== Scalability ==


The ability of machine learning algorithms to take advantage of modern
Modern computers have the ability to [[multitask]] exceptionally
computers, which tend to have the ability to multitask exceptionally
well through the use of multiple cores or [[CPU]]s. Machine learning
well through the use of multiple cores or CPUs, is important. Since
algorithms are often used to process large quantities  
these algorithms are often used to process large quantities of data,
of data, and optimizing them for multiple CPUs greatly
the ability of an algorithm to be able to run in parallel and scale
improves their performance.
accordingly (so that an algorithm running on two cores runs twice as
fast, for example) allows it to
run faster on modern hardware, so that larger quantities of data can
be processed.


= Biclustering =
= Biclustering =


The following material would go on a new page of the above title (probably linked to the ML page)
''The following material would go on a new page of the above title
(probably linked to the ML page)''


Biclustering is an unsupervised [[machine learning]] method which searches
'''Biclustering''' is an unsupervised [[machine learning]] method
for a set of common features across the input data. It is unique among
which searches for similarities in specific subsections of the input
machine learning methods because it searches for similarities in
data. Biclustering is unique among machine learning methods because it searches
parts of the data, instead of putting a piece of data into a single
for similarities in small parts of the data, instead of putting a
group.
piece of data into a single group.


Biclustering was first discovered in 1970. Today, it is a commonly
Biclustering was first discovered in 1970. As of 2011, it is a commonly
used technique in bioinformatics, paricularly in the area of gene
used technique in [[bioinformatics]], particularly in the area of  
expression, or identifying groups of genes that are similar between
[[gene expression]], or identifying groups of [[genes]] that are similar between
different people.
different people.


== Processing ==
== Processing ==


Clustering algorithms take vectors as input. They sort the vectors
Clustering algorithms take [[vector|vectors]] as input. They sort the
according to how similar they are by comparing all of the features (data) in
vectors according to how similar they are by comparing all of the
the vector, and the result of the clustering is several groups that
[[feature_(vector)|features]] in the vector, creating
each contain a bunch of (hopefully) similar vectors. Biclustering
several groups that each contain a bunch of (hopefully) similar
looks at all of the vectors as a single input matrix, and attempts to
vectors. Biclustering looks at all of the vectors as a single input
find regions of the input which look similar.
matrix, and attempts to find regions of the input which look similar.


Biclustering is a useful technique for finding trends in data when
Biclustering is a useful technique for finding trends in data when
each vector of data is very large because it can spot trends in
each vector of data is large because it can spot trends in
specific parts of data that clustering cannot. A normal clustering
specific parts of data that clustering cannot. A normal clustering
algorithm sorts pieces of data according to features that the majority
algorithm sorts pieces of data according to features that the majority
of them share. Biclustering is able to organize data according to
of them share. Biclustering organizes data according to
parts of them that seem similar.
parts of them that seem similar.


Line 122: Line 133:
Biclustering is particularly useful in the medical field, where it can
Biclustering is particularly useful in the medical field, where it can
be used, for example, to find genes related to a specific disease in a
be used, for example, to find genes related to a specific disease in a
group of patients. If each vector represents how a person expresses
group of patients<ref>
{{cite journal
      |author=Still, Martin et al.
      |title=Robust biclustering by sparse singular value decomposition incorporating stability selection
      |publisher=Oxford University Press
      |journal=Bioinformatics
      |year=2011
      |id=http://dx.doi.org/10.1093/bioinformatics/btr322}}</ref>.
If each vector represents how a person expresses
traits, biclustering can be used to determine a set of genes which is
traits, biclustering can be used to determine a set of genes which is
associated with cancer. It can even be used to find similarities and
associated with cancer. It can even be used to find similarities and
differences between different varieties of cancers.
differences between different varieties of cancers.
== References ==
<references />

Latest revision as of 02:58, 22 November 2023


The account of this former contributor was not re-activated after the server upgrade of March 2022.


Machine Learning Additions

This is text I'd like to add to the machine learning page (pending heavy editing). This text would go underneath the "Issues in Training and Evaluation" section.

Overfitting

Overfitting occurs in classifiers when a classifier creates an extremely accurate model for classifying the example data used to crate the model, yet is inaccurate on other examples.

This can sometimes occur because a classifier looks at irrelevant data points in the training data. Classifiers often give more weight to features seen less commonly in the training data, and if data given to the classifier shares similar uncommon values, it groups those values accordingly. As a result, the classifier is extremely accurate when classifying the training data and appears to be useful, but when used to organize data outside of the model, it becomes inaccurate.

Machine learning techniques can prevent overfitting by imposing a penalty upon itself for complicated models. A simpler model is oftentimes more consistent with the trends of the data in question.

Imbalance of Data

Data within training sets can be imbalanced, where the training set has more examples of one category of data than another. Many classifying algorithms assume that the ratio of the different categories of training data that they receive are roughly equivalent to the ratio of real data that will belong in each category.

There may also be characteristics that differentiate members of the minority class that are missed when data is imbalanced. For example, a classifier training on examples of cancer patients may fail to differentiate between different types of cancer in the minority class[1].

One way to prevent this imbalance of data is to use active example selection[1]. Active example selection builds the classifier model slowly, adding a few pieces of sample data of each class at a time. The model is tested at every stage, and documents which improve the accuracy of the model are kept in the classifier, while ones that do not change the model or make its performance worse are removed. This ensures that only meaningful pieces of data are used to train the classifier, and since the ratio of documents is closer to 1:1, an imbalance of data is less of an issue.

Evaluating Results

There are a number of techniques for evaluating the results of a machine learning algorithm. Some of these techniques are also used in natural language processing. Machine learning techniques are generally evaluated by their results, and some methods (such as neural networks) are considered "black box" forms of classification, since it is not easy to understand how or why the underlying implementation is sorting a particular way.

In some cases, particularly with classifiers, the outcome of the machine learning algorithm is compared to a set of data classified by experts. The machine learning algorithm creates a model based on a set of training data, and classifies a second sample set of data. A group of experts also annotate the second set of data, marking it according to how they believe it should be classified (not according to how they believe a machine would classify it). The training and evaluation data is generally a tiny subset of the data available. The results of the algorithm are compared to the results of the experts and arranged into two scores: precision and recall. Precision accounts for situations where the classifier put something in a category where it did not belong. Recall accounts for situations where the classifier did not put something in a category it should have. If a classifier is able to produce good results for a subset of data, it should also be successful at classifying a larger set of similar data.

Classification and clustering algorithms can also be measured against other algorithms. This is useful when an algorithm is attempting to improve performance (speedwise, for example) while providing similar levels of precision and recall relative to another algorithm. It can also be used to show that an algorithm is an improvement over a previous generation, or to show which algorithm is most useful for organizing data for a particular problem.

Scalability

Modern computers have the ability to multitask exceptionally well through the use of multiple cores or CPUs. Machine learning algorithms are often used to process large quantities of data, and optimizing them for multiple CPUs greatly improves their performance.

Biclustering

The following material would go on a new page of the above title (probably linked to the ML page)

Biclustering is an unsupervised machine learning method which searches for similarities in specific subsections of the input data. Biclustering is unique among machine learning methods because it searches for similarities in small parts of the data, instead of putting a piece of data into a single group.

Biclustering was first discovered in 1970. As of 2011, it is a commonly used technique in bioinformatics, particularly in the area of gene expression, or identifying groups of genes that are similar between different people.

Processing

Clustering algorithms take vectors as input. They sort the vectors according to how similar they are by comparing all of the features in the vector, creating several groups that each contain a bunch of (hopefully) similar vectors. Biclustering looks at all of the vectors as a single input matrix, and attempts to find regions of the input which look similar.

Biclustering is a useful technique for finding trends in data when each vector of data is large because it can spot trends in specific parts of data that clustering cannot. A normal clustering algorithm sorts pieces of data according to features that the majority of them share. Biclustering organizes data according to parts of them that seem similar.

Applications

Biclustering is particularly useful in the medical field, where it can be used, for example, to find genes related to a specific disease in a group of patients[2]. If each vector represents how a person expresses traits, biclustering can be used to determine a set of genes which is associated with cancer. It can even be used to find similarities and differences between different varieties of cancers.

References

  1. 1.0 1.1 Sangyoon, Oh et al. (2011). "Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification". IEEE/ACM Trans. Comput. Biol. Bioinformatics. http://dx.doi.org/10.1109/TCBB.2010.96.
  2. Still, Martin et al. (2011). "Robust biclustering by sparse singular value decomposition incorporating stability selection". Bioinformatics. http://dx.doi.org/10.1093/bioinformatics/btr322.