EMBL
CAIPIRINI

Home

FAQs

About



Frequently asked Questions:

INTRODUCTION

1. What does Caipirini do? (*) Updated.
2. What can I do with Caipirini? For which cases is Caipirini appropriate?
3. What is important with Caipirini?
4. What is context based classification & regression? How does Caipirini achieve to be concept based? (*) Updated.
5. Can Caipirini solve my problem?

INPUT

6. How do I find a Set A and Set B? What is interesting and what not?
7. What do I need in order to use Caipirini? (*) Updated, again.
8. How can I collect abstracts for the Training?!
9. What is the input format of the lists?
10. Why avoid using as training input a query from PubMed?
11. How many abstracts do I need for training?
12. I do not have any Set B (uninteresting set); what should I do? (*) Updated.
13. How many abstracts can I give? What limitations are there? What should I keep in mind?
14. I do not want to query PubMed because I already have the set of abstracts that I want to classify; how can I enter it to be classified?
15. What are the term types used for?
16. Which term types should I use?
17. What do the term types mean?
18. Are there other term types?
19. How many e-mail addresses can I enter? (*) Updated.

METHODS

20. How does Caipirini work? What happens in the background? (*) Updated, again.
21. How does Caipirini process the abstracts? (*) Updated, again.
22. Does Caipirini take into account all the abstracts? (*) Updated, again.
23. How does the classification and the ranking work? (*) Updated, again.

RESULTS

24. What do I see at the raw classification file? (*) Updated.
25. What do I see at the result page? (*) Updated.
26. How should I interpret the results?
27. I want to work with the results of Caipirini; how can I access them?
28. I want to work with the results of Caipirini; how can I process them?
29. I cannot find all my PubMed Ids in the results! What has happened?
30. An abstract assigned to Set A contains more terms that are mentioned only in the input Set B; is this wrong?
31. The classification results do not satisfy me. Why do I get these results?!
32. The classification results do not satisfy me. What should I do? How can I influence the performance?
33. I gave as query the PubMed Ids of the Sets A and B so that I can check the performance, but I can not find them at the results; what happened? (*) Updated, again.
34. What is a Support Vector Machine (SVM)?
35. Why RBFs and not another kernel!? (*) Updated.
36. Why before I receive an e-mail with the results of Caipirini do I first receive an e-mail with the results of a Martini job!? (*) Updated

OTHER

37. Can one enter both Entrez and Ensembl ids in a single set?! What happens then?! (*) Updated.
38. How can I automatically submit Martini and/or Caipirini jobs?! Is there an API?! (*) Updated.
39. I submitted an older job of mine but this time I receive slightly different results!? What is wrong?! (*) Updated.

1. What does Caipirini do?

Caipirini works on the texts of abstracts that are extracted from the input of the user in order to learn the difference between the two inserted sets. The output results are sent by e-mail. One can update the output page indicated until the results have been prepared.

There are three ways of entering data in Caipirini for Sets A and B:

(a) One can enter directly lists of PubMed Ids or
(b) One can enter lists of Entrez and/or Ensembl Gene Ids or
(c) One can directly query PubMed.

In all three cases Caipirini retrieves the corresponding abstracts and processes them to provide Classification and Ranking of Set C:

Caipirini processes the third set of abstracts (Set C), defined through a PubMed query, to find out which of these (and how much they) are related to Set A or to Set B. If this is not clear, they are assigned into an 'ambiguous' category. To do this, Caipirini, by using the abstracts of Sets A and B, trains a Support Vector Machine (SVM) and creates an SVM Model that can distinguish them. The SVM Model is then used to classify the results of the PubMed Query that you have entered. In summary, it separates them into Set A and Set B abstracts and ranks them based on the input that you have given.

[return to top of page]

2. What can I do with Caipirini? For which cases is Caipirini appropriate?

Caipirini can be used in several different cases, for example in the further manners:

(a) Starting from specific sets A & B one can expand to or identify another set or sets related to (any of) the original one(s). By doing this, one can update, enrich or expand an already known or studied collection of abstracts.

(b) In addition, by using Caipirini, one can identify the subset of a collection that refers to specific context requirements.

(c) Furthermore, with Caipirini, one can connect collections of abstracts that refer to a different context but share a common concept. Thus, one can study the combined sets and transfer information and knowledge from one field or case study to another.

[return to top of page]

3. What is important with Caipirini?

Caipirini carries several innovations on different levels.
(a) It is the first system that directly uses genes to characterize abstracts, in a simple manner.
(b) It extracts information from the literature directly, without performing complex functions on the sets of abstracts.
(c) In addition, it is the first system that has automated the procedure of classifying abstracts. The user does not have to worry and take care of all the details that are related to the classic 'train and classify' process.
(d) Furthermore, the system is applicable to a wide range of cases related to the life sciences. This is because of the dictionaries used and the control that the user can have over them.
(e) The classification can, thus, be 'personalized' to each research field and user separately. By changing the context (selecting different combinations of term types to be taken into account) the user can define different concepts to be learned and used.

[return to top of page]

4. What is context based classification & regression? How does Caipirini achieve to be concept based?

Caipirini does not only take into account which terms and how many times they occur in each abstract. Caipirini takes into account the combinations of the terms as well (as facilitated by the SVM), i.e. the context in which the terms occur, or else the concept that is represented from these abstracts.
By changing the context, i.e. taking into account a specific set of terms, a different concept is created. Caipirini gives the opportunity to the user to do this. Caipirini takes into account only the term types of the training set that have been selected by the user at the interface, learning thus a different concept.
For example, we trained Caipirini using the Arabidopsis Thaliana (AT) set in order to learn what a resistance related abstract is like. By removing the term types organisms, genes/proteins and diseases we learn what a resistance abstract is like without teaching the SVM that such an abstract should refer to AT or to any of the specific diseases or genes/proteins mentioned in the example abstracts. This way one could try to apply the same training and examples to another organism, e.g. tomato, rice, etc. as well.

[return to top of page]

5. Can Caipirini solve my problem?

It depends; Caipirini most probably can.
The Classification and Ranking Analysis is a very useful methodology, especially for life sciences. Caipirini automates a 'standard' classification procedure. Such classification (& regression) methods have been used in different cases for a variety of projects within the life sciences.
Advantages of Caipirini are that it is generic and can be personalized for the needs of different users. In addition it achieves context and concept based classification. The method used in the background, Support Vector Machines, lately gain increasing interest within the life sciences.
However, Caipirini does not claim to be able to solve any possible abstract classification problem. Many problems are too difficult to be solved by any classifier type. In addition, Caipirini does not cope with the curse of imbalanced data as well. However, a user may adapt their datasets accordingly.

[return to top of page]

6. How do I find a Set A and Set B? What is interesting and what not?

This depends entirely on the type of problem you would like to solve.
Sets A and B are, in general, two sets of genes or abstracts the difference of which (as compared to each other) one wants to find. Then, someone could also classify another set of abstracts and discover more literature related to either set A or B.
In general, with the classification one can answer to the problem "which abstracts of this set C are 'closer' to Set A and which are 'closer' to Set B". In most cases, both sets A and B might be of interest, but they simply belong to different categories; the categories that Caipirini should learn to separate. For example, the sets might be coming from the literature of two different but interesting gene lists (or diseases or organisms, etc.).
In the specific case that someone wants to identify a subpart of a literature collection, then the one set should be examples that describe what the user is looking for (let's name it interesting), as opposed to the other set that should include examples of what should not be included (let's name them uninteresting). In general, the list of interesting PubMed Ids should include a number of example-abstracts that are of the type that one looks for. On the other hand, abstracts containing information which is not useful but are somehow related, or retrieved together with the interesting ones, should be at the uninteresting list of examples. In other words, interesting are the ones that describe what is to be looked for and uninteresting the ones that should be excluded. This way, Caipirini can be used for "disambiguation of abstracts".

[return to top of page]

7. What do I need in order to use Caipirini?

For a Caipirini one needs:

1. Two lists of PubMed Ids or Entrez and/or Ensembl Gene Ids or two queries for PubMed (or a combination of these) for Sets A and B - the retrieved abstracts will be then used as training examples.
2. A query for PubMed or a list of PubMed Ids for Set C, i.e. a set of abstracts to be classified.
3. An e-mail address (one only and valid) where the results will be sent.
4. Optional: The user can select different term types to be taken into account. The default is that all terms will be used.
5. Optional: A description of the job to be submitted. The description is used at the subject of the e-mail sent with the results. This is useful in the case that a user plans to submit several jobs.

Alternatively, a user may wan to submit only Set A. In this case, Set B is automatically populated with a number of randomly selected abstracts that are not contained in Set A but that are equal in size with those extracted from Set A. Set C is considered to be all abstracts of Caipirini indexed by the running AKS2 instance used.

[return to top of page]

8. How can I collect abstracts for the Training?!

There are many web sites, tools, systems and databases that someone can use in order to directly collect literature. One can also iteratively use Caipirini and in each round move uninteresting examples into Set B.

[return to top of page]

9. What is the input format of the lists?

There should be one PubMed Id per line, or there should be one Entrez/Ensembl Gene Id per line. If the input is a query for PubMed, then the query should be exactly at the same format as when used in the NCBI PubMed interface.

[return to top of page]

10. Why avoid using as training input a query from PubMed?

Because we believe that, for most of the cases, query results from PubMed can be too generic or noisy to be used directly as training examples. It is better that the training examples are carefully chosen so that they really describe what Caipirini should learn.

[return to top of page]

11. How many abstracts do I need for training?

The more abstracts you use the better results you will receive. A small number of abstracts is, most of the times, not enough to describe the difference between the two sets. (Optional: we propose the use of at least 50 abstracts per input list to start with).

[return to top of page]

12. I do not have any Set B (uninteresting set); what should I do?

In the current version, Caipirini will not proceed without all sets being declared by the user.
This case may happen when one knows what they are looking for (i.e. one has the interesting list of PubMed Ids or query) but do not have an uninteresting list of examples. To populate the uninteresting list of PubMed Ids one should think what should not be included in the results, e.g. what kind of literature complicates the procedure of retrieving more of the (or only) interesting abstracts. Entering in the start a single random id as set B can be a solution, but it will result in poor classification performance. A bigger set of randomly selected ids for Set B could work.

Currently, one can also use the simple web version where only Set A is required. However, it is better when a reference set is explictly defined as background.

[return to top of page]

13. How many abstracts can I give? What limitations are there? What should I keep in mind?

In general, the more ids you give the best results you will receive. Keep in mind however, that this increases the processing time. A process that includes a few hundreds up to a few thousands of abstracts will be managed in a reasonable time frame, i.e. some minutes or hours. Even bigger processes will last longer. You can paste as many PubMed IDs as you want at the input lists, as long as they are less than 25000. The query results will be all processed as long as they are also less than 25000 abstracts. (The methodology used can be applied on whichever amount of data; however we restrict the size simply for time performance reasons).
If you decide to enter as input a query that contains directly a list of PubMed Ids keep in mind that there is a limit; a number of up to 500 PubMed Ids can be retrieved this way. If you have a bigger set of abstracts to be classified split them in different queries with subsets of size 500 and use the same training set (Sets A and B) in all cases.

[return to top of page]

14. I do not want to query PubMed because I already have the set of abstracts that I want to classify; how can I enter it to be classified?

Just paste the PubMed Ids in the form of Set C. If you like to submit this as a query to PubMed through the interface, be aware of the fact that there is a limit on the query size that the current connection with the Entrez Programming Utilities can handle (we propose the number of 500 PubMed Ids; we also entered bigger sets by splitting them to sets of 500 PubMed Ids).

[return to top of page]

15. What are the term types used for?

The term types are important because their selection leads to a different user defined concept to be learned by the SVM. This makes it possible that for the same set of abstracts a different user defined, i.e. "personalized", classification can take place.

[return to top of page]

16. Which term types should I use?

This depends on the problem that you would like Caipirini to solve. The default option is that Caipirini will take into account all the term types that will be mentioned in the abstracts that correspond to the given input. For the classification, if your purpose is that Caipirini learns a general concept (i.e. without being restricted to the specific features of Sets A and B), e.g. in order to identify a set of abstracts related to the input but refers to a different case (such as different disease, organism, gene or in a different life sciences field), then the selection of terms might influence the performance of Caipirini.

[return to top of page]

17. What do the term types mean?

There are five different term types that the users can (un-) select:
Organisms: these are terms that refer to an organism.
Genes/Proteins: these are terms that to genes and/or their proteins.
Small Molecules: these are terms that refer to chemicals or drugs.
Diseases: these are terms that refer to diseases.
Symptoms: these are terms that refer to phenotypes of or reactions to diseases and other stimuli (e.g. pain).

When deselected, the terms that correspond to the respective semantic categories will not be taken into account.

[return to top of page]

18. Are there other term types?

Yes, there are. If one unselects all the available term types of the interface, Caipirini will still use terms in order to understand what the difference between the two given lists is. These are terms that refer to biological actions, such as "enhances" and "regulates", and other biomedical terms that do not belong to any other category. We have performed statistics and have found out that in order that a valid classification takes place these two types of terms are necessary. In addition, they comprise the vast majority of the terms mentioned in an abstract.

[return to top of page]

19. How many e-mail addresses can I enter?

Please, enter only one e-mail address. Make sure that it is correct and valid; otherwise the results will never arrive (to you at least!).

In the current version, Caipirini does not send results via e-mail. Instead, the user is forwraded to a page where the results should be announced when the task has finished. Reload frequently to be informed abouut the status of the job.

[return to top of page]

20. How does Caipirini work? What happens in the background?

Caipirini uses in the background text mining information retrieved from the AKS2 database, an industrial system, product of Bioalma. The AKS2 manages information extracted directly from the scientific literature. The system is getting daily updated and has indexed more than 8.000.000 of the latest PubMed abstracts. Caipirini uses instances of the AKS2 database, updated less frequently.
The SVM used is an implementation of the LIBSVM library (Chih-Chung Chang and Chih-Jen Lin, 2001). The SVM used is an implementation of the LIBLINEAR library (for general information see: R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification Journal of Machine Learning Research 9, 2008, 1871-1874).
When the user enters directly PubMed Ids the abstracts are directly registered. When the user enters lists of Entrez and/or Ensembl Gene Ids, the corresponding abstracts are being collected first and then registered to each Set. Caipirini retrieves all the abstracts linked to the given Gene records. When the user enters queries, for the retrieval, Caipirini uses the Entrez Programming Utilities of NCBI to query PubMed.
After the abstracts have been collected, Caipirini takes care of the cases that there are duplicate records entered by the user and of the cases that abstracts that have been collected belong to more than one input sets (Set A, B or C). Caipirini first removes the duplicates. In order to reduce the noise entered, Caipirini excludes for the Classification and Ranking Analysis the abstracts that belong to more than one set. When the class assignment procedure has finished, there are only unique abstracts that belong only to one set. These abstracts are then mapped to the AKS2. Only the indexed, from the currently used in the background update instance of AKS2, abstracts are used from the system. If after these steps, there are no abstracts remaining in any of the sets, Caipirini will send an e-mail to the user notifying the user will be notified at the results page that "no indexed abstracts can be found".
Then, the terms mentioned in each set of abstracts are extracted for further use. If there are no terms found, an e-mail is sent to the user the user is notified by the web interface, as well.
Last, the training vectors (i.e. the term vectors that describe the example sets A and B) are based on the terms of sets A and B only. For the creation of the vectors of Set C (i.e. the vectors corresponding to the abstracts to be classified), out of all the existing terms, only the ones that exist in sets A and B are taken into account, since based on these terms the separation will take place. The value for the dimension in each vector (in other words the weight of a term per abstract) is the number of times it is mentioned in the abstract.

(*) Updates applied: The content of the removed sentences does not take place anymore. Please see the answer of question 39 for more information.

(*) Updates applied: Currently Caipirini removes duplicate records only from Set C (if a valid entry is inserted n times within Set C it will still be classified only once). If an entry is not unique for Set C (i.e. is registered also to Set A and/or B) it will not be taken into account for Set C.

[return to top of page]

21. How does Caipirini process the abstracts?

Caipirini extracts the terms mentioned in each abstract. For the classification, vectors corresponding to each abstract are created. The features of the vectors are the terms found in the training set and the values are the number of times they are mentioned (i.e. are indexed by AKS2) in, per abstract.

(*) Updates applied: In the new version of Caipirini if an entry of training set is inserted multiple times in Set A and/or B, then its corresponding vector is taken into account for the training as many times as they entry was inserted in the training set (i.e. if an abstract X is twice mentioned within Set A and three times for Set B, then the corresponding vector vX will be also used two times as an example of abstracts of class for Set A and three times for Set B).

(*) Updates applied: For Set C only unique abstracts are taken into account (i.e. duplicates within the set and commons to the training set are removed). For example, if Set C input consists of abstract X twice and abstract Y n times, then from the results abstract X will be ommitted and abstract Y (i.e. its vector) will be classified once.

[return to top of page]

22. Does Caipirini take into account all the abstracts?

Caipirini filters the abstracts in 3 stages: (a) it checks whether the input indeed corresponds to a number. This happens in order to remove any "noise" input. This filter has been inactivated because Caipirini can accept Ensembl ids too now; instead, each input entry is checked against the registered identifiers for the type of input defined by the user. If there are no abstracts retrieved the procedure stops there. (b) It removes duplicates from Set C. If an abstract belongs to both sets A and B then Caipirini does not take it into account at all. If an abstract belongs to Set A and/or Set B and also to the set to be classified (Set C), it assigns it only to Set A and/or B, where it was found into. (c) From the remaining abstracts the ones that have been indexed are used for training and classification.

[return to top of page]

23. How does the classification and the ranking work?

First, a training of the SVM takes place using the training vectors and then the classification.

Training Sets. The training set is derived from sets A and B. For the classification and ranking, Caipirini handles lists of genes and abstracts in the same way. First, common abstracts are removed. Next Caipirini creates one feature vector for each abstract (a vector is taken into account for each Set A and/or B as many times as its corresponding abstract is associated to each input Set A and/or B). Caipirini then uses the LIBSVM library to scale these vectors, so that each dimension ranges from 0.0 to 1.0. The same scale factors that were used for the training vectors are then used to scale the vectors to be classified.

SVM Training. The SVM models are based on (dual) L2 regularized logistic regressionRadial Basis Functions (RBF) functions. The next step is to search for the best combination of the RBF parameter gamma and penalty factor c that gives an SVM model with the highest possible accuracy for the given input. The accuracy is assessed by LIBSVM using 10% LIBLINEAR using a five-fold cross-validation on the training data. Caipirini does this parameter search in two steps. First, it evaluates the accuracy at four points: c and gamma equal to 1 and 1/32, 1 and 1/2, 16 and 1/32, and 16 and 1/2, respectively. If the total number of training vectors is bigger than (or equal to) 10,000 and the best of these points gives a cross-validated accuracy bigger than (or equal to) 80%, then Caipirini goes to the next step. Else Caipirini continues to run a grid search. The grid search evaluates the cross-validated accuracy at the following points: c = 2^x where x ranges from -0.5 to 4.5 in steps of 0.5, and gamma=2^y where y ranges from -5.5 to -0.5 in steps of 0.5. First, it evaluates the accuracy at eight points c=2^x where integer x in [-3, 4], respectively. If the total number of training vectors is less than 5,000 or if the training set vectors are less than 10,000 and the best of the tested c points gives a cross-validated accuracy smaller than 80%, then Caipirini runs a further grid search in the neighborhood of the best c.

Prediction. The best c and gamma combination found in the SVM training is then used with the full training set data (i.e all 'folds' included) to construct two SVM models, one that makes discrete predictions, and another that makes probabilistic predictions. The abstracts (vectors) to be classified are then passed into both models: each vector is then classified as belonging to set A, if the discrete model assigns it to set A, and the probabilistic model assigns a probability of belonging to set A > 0.5. The optimal configuration found during the training is used to make the final linear SVM model. Then, each vector is classified as belonging to set A, if the prediction assigns it to set A with a probability of belonging to set A > 0.5. Vectors are assigned to set B using the opposite criteria. In the rare case that the two models disagree, the vector is assigned as ambiguous. Finally, the abstracts are listed together with these assignments, and ranked according to their scores from the probabilistic model.

[return to top of page]

24. What do I see at the raw classification file?

These are the raw classification results, summarized.
The first column refers to the classified abstract's PubMed id. The second column shows the set to which the discrete SVM model has assigned the abstract (1 for Set A and -1 for Set B). The third column shows the set to which the probabilistic SVM model has assigned the abstract (1 for Set A and -1 for Set B). Finally, the last column shows the certainty with which the probabilistic SVM model believes that the abstract belongs to the set it has been assigned to. The second column shows the SVM's prediction, i.e., the set the abstract was assigned to. The respective scores (probabilities) for sets A and B are listed in the third and fourth columns.

[return to top of page]

25. What do I see at the result page?

At the results, one can see the best Cross Validation Accuracy achieved during the training. This can be a measure about how successful the training has been. Next, one can see what percent of abstracts have been assigned to each set (A or B). Last, one can see what percent of abstracts have been correctly assigned to each set from the default test set (i.e., sets A and B). Pages 'Top Set A Results' and 'Top Set B Results' list the abstracts found to belong to the respective Set with a score >= 85%.
The tables in each page display the PubMed id of each abstract (link to NCBI), the title of each abstract (the link forwards to the tagged text of the abstract, where the color coding used by Caipirini displays information about: which of the abstract's terms were used, their type and the set(s) they were mentioned in; further information per term is also provided: synonyms, database identifiers, abstracts from sets A or B it mentioned into, etc.), the abstract's ranking (ranging from 1 to 0.85, i.e., abstracts assigned to a set with certainty between 100% and 85%). The raw results list the SVM scores for all abstracts classified (from sets A, B and C) in text format.
The color coding used is blue, black and gray that correspond to abstracts assigned to sets A, B and Ambiguous, respectively.


[return to top of page]

26. How should I interpret the Classification Results?

At the top of the result page, one can find the best Cross Validation accuracy (CV accuracy) that the SVM achieved during the training. This is an indication of how well the SVM has been trained on the given data. If the CV accuracy is not high enough this does not necessarily mean that the classification and ranking results are wrong. In general, the performance depends as well on the difficulty of the problem (some measures about the definition and the assessment of the difficulty of a classification problem and the classification performance can be found at the following article: http://www.siam.org/meetings/sdm01/pdf/sdm01_16.pdf).
Also, one can view and download the raw Caipirini results. There are mentioned only the PubMed Ids of the abstracts extracted from the input entered by the user.
When title links are provided, the user can see the text of the abstract and the terms indexed for it. For each term there are external links to publically available databases and collections, such as Entrez, PubChem, SRS3D, etc.). The indexed terms are colored based on the term type they belong to. If a term has been found in the training set of abstracts then there is also a background color. Blue represents terms that are found only in the input Set A, gray represents terms that are found in abstracts of both Sets A and B and dark gray represents terms that are found only in the input set B.
This coloring scheme can sometimes give a quick visual interpretation of why the abstract has been assigned to Set A, or B.

[return to top of page]

27. I want to work with the results of Caipirini; how can I access them?

The raw results file provides the essence of the Caipirini results. They are presented in this form, so that you can download them and then work on/with them.

[return to top of page]

28. I want to work with the results of Caipirini; how can I process them?

One could do many things with Caipirini and its results (text mining, statistical analysis, experiments, etc.). There is no specific direction; this is totally free and up to the user.

[return to top of page]

29. I cannot find all my PubMed Ids in the results! What has happened?

They have been probably removed during the filtering procedures.

[return to top of page]

30. An abstract assigned to Set A contains more terms that are mentioned only in the input Set B; is this wrong?

Well, not necessarily. The abstract most probably has been assigned to the right set. The fact that it contains more terms that belong to the examples of the other training set means that the rest of the features of this abstract (such as the combination of terms, the number of times they occur and the terms that belong to the training examples of the set it has been assigned to) have had a stronger effect.

[return to top of page]

31. The classification results do not satisfy me. Why do I get these results?!

There might be several reasons:
(a) The amount of example PubMed Ids that have been entered is not enough to describe the difference.
(b) The PubMed Ids that have been entered as examples do not really describe the desired difference.
(c) The selection of the term types might have not been the appropriate one.
(d) The abstracts do not contain enough of the selected term types so that the difference in their context can be really identified.
(e) If the example lists contain a very imbalanced set of PubMed Ids (i.e. the examples of one list are many more than the examples of the other one) the SVM tends to consider everything as belonging to the class with the most examples.
(f) The separation problem is too difficult to be solved by any classifier.

[return to top of page]

32. The classification results do not satisfy me. What should I do? How can I influence the performance?

You could (a) increase the number of examples (b) refine your example abstracts, i.e. find more descriptive ones (c) change the term type configuration.

[return to top of page]

33. I gave as query the PubMed Ids of the Sets A and B so that I can check the performance, but I can not find them at the results; what happened?

You can not classify the same abstracts that you have entered as examples, because Caipirini removes redundancy between the sets removes them from Set C (it does not take them into account simply for computational speed reasons).

This should not happen anymore; note that the training set is also used as default test set.

[return to top of page]

34. What is a Support Vector Machine (SVM)?

SVMs belong to the category of supervised learning methods and are mainly used for classification and regression. The SVMs map the input vectors to a higher dimensional space where a separating hyperplane between the classes to be distinguished is constructed. From all the possible separating hyperplanes the SVMs chose the maximal one (i.e. two parallel hyperplanes are made on each side of the data; the maximal separating hyperplane is the one that maximizes the distance between these two hyperplanes), based on the notion that the larger the margin between the classes the better the classifier.

[return to top of page]

35. Why RBFs and not another kernel!?

Initially, the RBFs have been chosen as considered to be more flexible/successful in solving difficault classification problems. Nevertheless, linear SVMs, that are a lot faster computationally, are considered to perfom almost equally well for the classification of text-data. For this reason, in the future the LIBSVM library will be was replaced by LIBLINEAR, to also improve Caipirini's response-times performance.

[return to top of page]

36. Why before I receive an e-mail with the results of Caipirini do I first receive an e-mail with the results of a Martini job!?

Caipirini when launced and while it is itself running, by default, initiates a Martini Keyword Enhancement Analysis with its training data (i.e Sets A and B) in order that the user can be informed on whether and how much these sets are good sets to be used for training a classifier (here an SVM). In spite of the fact that this cannot be considered as an absolute mesurement, one would expect that the more significantly enhanced terms there are the more separable or distinct the two training Sets A and B can be considered to be, and thus better are for training, i.e learning a difference. Nevertheless, although Martini can grasp the content of two sets by finding the significantly enhanced terms, the two methods are based on completely different mechanisms and have different purpose. As such, these results can only be considered as a useful indication about the fitness of two sets as training data for Caipirini but nothing else, as the SVM may succesfully grasp or may be negatively influenced from further sensitive term-trends identified during its own analysis.

(*) Updates applied: The user does not anymore receive a second e-mail with Martini's results when a Caipirini job is launced. If you wish to receive them either submit to Martini the same data set or contact us to give you instructions on how to access them.

(*) Updates applied: Caipirini does not use e-mails anymore.

[return to top of page]

37. Can one enter both Entrez and Ensembl ids in a single set?! What happens then?!

Yes, one can enter in the same Set (A or B) a list of gene ids from different organisms and that consists either from Entrez ids or from Ensembl ids or both. In this case, Caipirini will remove duplicate records both in the sense that
(a) if an id is pasted more than once, then it will be taken into account only a single time for the analysis and
(b) if an Ensembl and an Entrez id correspond to the same gene, then one of the two will be taken into account for the analysis, too.

Internally, Caipirini first maps in the background the Ensembl ids to their corresponding Entrez Gene entries and then it is their respective abstracts that are used for the analysis.

(*) Updates applied: The content of the removed sentences does not take place anymore. Please see the answer of question 39 for more information.

[return to top of page]

38. How can I automatically submit Martini and/or Caipirini jobs?! Is there an API?!

(*) Updates applied: No, currently you can't! Martini and Caipirni are not overlapping as much anymore, because the structure of Caipirini has been updated. We plan to provide you again with this possibilty, in the near future.

Yes, you can! There are two ways:

1. One can sequentially call from another program the script that can be downloaded here and pass the proper parameters in order to launch Martini Keyword Enhancement Analysis analyses and Caipirini Classification and Ranking of abstracts jobs. The script constructs respectively the Martini- and/or Caipirini-job urls and in turn submits the data for the analyses to be launched. When the jobs have started the script prints out the response message from the server and finishes. The results are, as from the main interface, sent to the user via e-mail response.
If the output you receive mentions 'Access denied due to security policy violation', please check your firewall and inter- or intra-net settings.

2. Alternatively, one can directly by themselves construct the Martini- and/or Caipirini-job urls, as follows bellow.

To repeat, the input for both Martini and Caipirini consists of:
- A list of PubMed ids or Entrez/Ensembl Gene ids or a PubMed query to define Set A.
- A list of PubMed ids or Entrez/Ensembl Gene ids or a PubMed query to define Set B.
- A set of term types to be used for analysis.
- If a Caipirini job is to be started a third Set C, that is a PubMed query, must be defined.

To launch Martini or Caipirini jobs, one must define the url prefix (i.e. http://martini.embl.de/martini/startGO? and http://caipirini.org/startGO? respectively) and the following parameters:
* positive_list   : The input for Set A (one id per line or a PubMed query; all types should be URL-encoded for proper submission to take place).
* inputA           : The input type for Set A. Please, enter:
      pubmedids : if the input consists of a list of PubMed ids, formatted as described above.
      geneids       : if the input consists of a list of Entrez and/or Ensembl Gene ids, formatted as described above.
      qpubmed    : if the input consists of a PubMed query, formatted as described above.
* negative_list   : The input for Set B (Please, apply as above, i.e. as for option 'positive_list').
* inputB            : The input type for Set B (Please, apply as above, i.e. as for option 'inputA').
* descriptionData : A description about the input data or the job to be submitted (URL-encoded; Optional).
* email               : The e-mail address to send the reply with the results of the launched Martini and Caipirini job. Please, one and valid e-mail address (URL-encoded).
* queryPubmed  : For a Caipirini job to be launched, Set C as a PubMed query must be defined (URL-encoded), else set value equal to do_enhancement_case to launch Martini.
* termTypes1      : Define whether terms of type 'Organisms' should be taken into account for the analysis setting the value '4'.
* termTypes2      : Define whether terms of type 'Genes/Proteins' should be taken into account for the analyses setting the value '2'.
* termTypes3      : Define whether terms of type 'Small Molecules' should be taken into account for the analyses setting the value '1__8'.
* termTypes4      : Define whether terms of type 'Diseases' should be taken into account for the analyses setting the value '3'.
* termTypes5      : Define whether terms of type 'Symptoms' should be taken into account for the analyses setting the value '7'.
If a term type definition is ommitted or is assigned the wrong value, it will not be taken into account. If any of the essential parameters (i.e. positive_list, inputA, negative_list, inputB, queryPubmed and email) is ommitted or not assigned a value there will be no response from the server and no submitted job will be launched.

Examples (To see the constructed url, please, copy the underlying link and paste it):
- Martini: Evaluation Data Set 1 (Set A and B as PubMed Ids; all term types used) .
- Martini: Evaluation Data Set 1 (Set A as PubMed Ids; Set B as PubMed query; only 'Gene/Proteins' and 'Small Molecules' term types used)
- Caipirini: Set A as Entrez Gene Ids; Set B as PubMed ids; 'Organisms' and 'Small Molecules' term types not used; Set C a query of two abstracts.


[return to top of page]

39. I submitted an older job of mine but this time I receive slightly different results!? The same applies for the examples that you provide! What is wrong?!

There is nothing wrong. Most probably the background data have been updated and this has influenced the results. The background data consist mainly of the Gene-to-Pubmed id mappings and the AKS2 repository information. The results of the analysis may have been affected primarily by the fact that:
(a) there may be a different and new set of abstracts associated to a gene entry (abstracts may have been added and/or removed) (b) there may be more abstracts indexed by AKS2 (c) there may be more (new) terms indexed by AKS2 for a given abstract.

(*) Updates applied: As a matter of fact, Caipirini background data have been updated already.

(*) Updates applied: In addition to the background information data-update, Caipirini now also processes the input in a different manner.

It has been asked from users and considered as proper strategy that the removal of repeated entries within a Set as well of common entries in both Sets A and B should not take place. The reason for this is that the existence of repeated entries within a Set or the existence of common entries in both Sets A and B for some users is considered as valuable information for the modelling of their data. Originally, it has been considered from the developers that the users would not like to take such computational issues into account and that the repetition of entries might thus have been a 'copy-paste' neglection of cleaning up the input data. Thus, Caipirini itself was filtering-cleaning the input Sets.

Nevertheless, currently, the asked alteration in the strategy of processing the input has taken place and can now give different results than before. For this reason, a 'Warning' message accompanies the results bringing forward this issue to the users that may later want to give as input new lists with unique entries this time.

(*) Updates applied: Furthermore, now martini can accept as input not only Entrez Gene ids (from any organism) but also Ensembl Gene ids.

A single Gene Set is allowed to comprise of a mix of gene ids from either Entrez or Ensembl or from both and from whichever (applicable from these sources) organism too. The Ensembl Genes are not directly linked to the literature but via their corresponding Entrez Gene records. In the case that there is a one-to-many Ensembl-to-Entrez Gene mapping, the literature of the Ensembl record is set to be the union of the abstracts linked to the corresponding Entrez Gene records.

(*) Updates applied: These updates may lead to longer processing time as the number of entries to be taken into account becomes larger.

[return to top of page]