Idea
As a mathematician, I often read articles from other, scientific fields to widen my view and to see applications to the kinds of tools math provides. A while ago I was looking for numerical simulations of the spread of cancer. I found two things: First a giant sea of publications that was impossible to navigate. Second two papers called “The Hallmarks of Cancer” and “The Hallmarks of Cancer: The Next Generation“.
These two publications describe essential components of cancer, necessary for it’s survival and spread. Specifically, they offer a categorization of processes associated with cancer and a description thereof which I found fascinating. They are also really well-written and also offer a lot of insight for someone completely uneducated on the topic of cancer like me.
Similar to oncology, my knowledge of machine learning and artificial intelligence applied to natural language is also very limited, but what I know is, that context is one of the biggest problems in this field. It is hard to understand what people are talking about for a machine if no context is given. Here, however, scientific texts stand out since they make extensive use of topic-specific words, such as abbreviations or scientific terms.
So let’s assume that research on cancer focusses on specific fields, such as the hallmarks suggest, then it would be reasonable to expect those clusters of researchers to have their own specific terms they use. These might be the names of proteins associated with the processes they investigate, names of tests, chemicals, symptoms, etc.
I came to the conclusion that if I was able to rate every scientific term in how it relates to the individual fields, I would then be able to rate any text as a set of words, adding the ratings of the included terms.
Proposition
There are millions of publications on cancer and oncology and hundreds of thousands are added every year, making it impossible for researchers to read all the publications – even the titles would most likely be too much. To aid in the selection of which articles are relevant, a tool to rate the most likely relevant articles would be useful. Since there are so many publications, however, it is not possible to do the classification myself or by a panel of experts because then, the work would be limited to the field of expertise of the experts. Also, while a pre-classified dataset gets better by the amount of data contained in it, the process of classification deteriorated, the larger the outcome dataset should be (you need more people which diversifies the perspectives, and what seems relevant to a field to one specialist might not do so to another. Also, the classification process takes longer, leading the experts to change as well and consequentially changing their way of classifying). So I want to create a tool, which can be applied to any scientific field to classify publications and rate them. There should be two main functions. On the one hand, it should be able to rank any publications and on the other hand, it should be able to show other publications of similar ratings, so a kind of search engine.
Search
Given an article title or DOI, it should be possible to retrieve its rating. This would be useful do determine how well the system works because then experts could look for the ratings of articles they know.
Rating
Given a new text – like an abstract or entire article – the system should provide its rating. This would help users find out which area in the realm of possible ratings is relevant to them.
Sorting
Given a reference point (either an article or its rating) it should be possible to either search the closest articles or order a list of articles given by the user.
Implementation
Since this tool is supposed to aide in research, it should be accessible to the general public to have a maximal, positive impact. As a consequence, it has to be open source and it may not use content that is subject to restrictive licenses. All code will be published on my Github page and it will be developed in a way that makes it useful to people in other fields as well. It would also be useful if it was developed using tools, that many other developers know since more people would then be able to reuse the code. As a primary language, I have therefore decided to use python for all the functional components and JavaScript (Angular, CSS, HTML, and Typescript) for the front-end. The frontend (i.e. the User-interface once the classifier is trained and ready to use) would be a little more difficult to personalize for people inexperienced with Webdesign, however, the advantages of using web technology outweigh the downsides here since it offers the simple possibility to make a classifier available to coworkers or the general public. This project is also intended to be completed after a very short timeframe such as not to interfere with other projects and also to show people what is possible in a short time.
More information on the progress and scope of this project will soon be released here!