“A Diagnosing Challenge” won first place for the 2017 S. Klein Prize for scientific writing.
Of all the stressful jobs in a hospital, perhaps none is more pressure-packed than that of a pathologist tasked with examining a tiny sample of cells on a glass slide and determining whether they are malignant or benign.
Like a map of a newly discovered territory, cellular regions with flexible borders stained different colors crowd along a twisting edge. As the pathologist peers through a microscope at a region of cells, she might notice an abnormality, a cancerous pattern or distribution of nuclei, and zoom in on the glass slide to see it at higher resolution Then pull back, and finding another troubling patch, take note. A pathologist might spend twenty minutes with a single slide, and there might be twelve slides made from a single lymph node, one of the first glands that breast cancer invades.
The work is precise and methodical, and ancient—even in an age of digitization and internets of things, the pathologist’s process remains decades old. Pathologists still search for the same cancerous signatures were used for diagnosing cancer almost a hundred years ago.
Frequently isolated from the rest of a hospital, a pathologist’s verdict becomes a doctor’s critical decision. Often, a pathologist reviews samples while the patient in question is still on the operating table, and the surgeon wondering whether the whole tumor has been removed.
If examining a 3-micron—about the width of a bacterium—slice of tissue is inconclusive, the slide is sent away for another round of processing, or packaged up and shared with another pathologist, time spent waiting when a tumor might be growing. The human eye is easily exhausted. And the mind is finite. There are only so many subtle malignant patterns that even an expert pathologist can remember and distinguish. Eventually the pathologist might miss a micro- metastasis, the first sign that a cancer has spread.
Human fallibility is inevitable. But the ability to make split-second, dispassionate diagnostic decisions based on an image has existed for more than a decade. People working in facial recognition and computer vision—leveraging computers to extract useful information from images and transforming this into decisions—have the tools for counting proliferating nuclei and identifying metastasis, but have never applied them to medical diagnosis.
Not for lack of interest. “For a lot of people it is more rewarding to work on cancer detection, or finding disease, then on a typical computer vision task, like labelling videos on the internet,” says Bram van Ginneken, a professor of functional image analysis at Radboud University in the Netherlands. Heavily restricted access to medical images keeps repetitive tasks that would be ideally suited to automation in the hands of pathologists. “There is a big hurdle for people working in computer vision to work with medical data,” he says. “If you are not working directly with doctors, the medical data is locked up inside the hospitals.”
While individuals could join a lab like van Ginneken’s, the field remains closed to the sorts of collaborations, forums, and competitions that drive progress in computer vision. In contrast, the computer vision community uses open sites like Kaggle or ImageNet to host challenges for visual recognition and image classification, where individuals or teams have access to large, datasets and can submit their algorithms and win prizes. Van Ginneken says that he notices a trend of including Kaggle scores on CV’s—and that these scores are often what gets people hired.
In 2006, van Ginneken and his colleague Tobias Heimann had a radical idea: introduce a practical component to their circuit of annual medical imaging conferences. In addition to seminars, the Medical Image Computing and Computer Assisted Intervention meeting would host a “Grand Challenge”—a chance for groups to pit diagnostic algorithms against one another. By 2007, van Ginneken had helped organize two challenges for algorithms to automate computer tomography (CT) and magnetic resonance image (MRI) scans. Challenges have since been adopted by all of the major medical image conferences. Topics have since ranged from aligning lung scans, to staging retinopathy disease, and have used materials ranging from digitized glass slides to electron microscopy images.
“The challenges really allow people who do not have direct access to medical data to download a big data set and to work on a relevant problem. We’ve actually seen in a number of challenges that people—sort of outsiders—participated and scored very well. I think it is definitely a way to attract new talent into the medical image analysis community,” van Ginneken says.
Last April, Babak Ehteshami-Bejnordi and Oscar Geessink, a pair of scientists at Radboud, hosted Camelyon16. Camelyon16 was a digital pathology challenge, where teams competed to write machine learning algorithms that would identify if lymph node tissue was cancerous, and then highlight where the troubling region lay. “You really have to look for a needle in a haystack, and that is what these computer algorithms are really suited for, they can process the whole image, and the pathologist will never be able to do that—to go at the highest magnification and really search every part of the image,” says van Ginneken of the topic. “[Pathologists] are basically taking shortcuts. Computers have an enormous potential in this area because of the nature of the task.”
By striking a balance between solving a real world problem and a competition, Camelyon16 saw the highest participation rate on record for any International Symposium for Biomedical Imaging (ISBI) meeting, and possibly any medical challenge to date.
Camelyon16 also gave birth to the first algorithm that outcompeted a human pathologist. Ehteshami-Bejnordi’s excitement for applying machine learning to digital pathology comes out in gesticulations, and his explanations are accompanied with generous line drawings and abstract sketches of neural networks and layers of breast tissue. He credits this leap with leveraging the skills of people not previously in the field—and by advertising outside the medical world, on LinkedIn, Google+, and Reddit. Before now, the problem had not been presented to the right community.
Using computers to analyze medical images is not new. The FDA approved the first software to analyze mammograms in 1998, and other software has been developed since to process radiology and X-ray images. No one has printed an MRI on physical film in years. But pathology lags more than fifteen years behind. The majority of pathologists still deal in terms of actual glass slides.
It wasn’t until the last six years or so that pathology began inching towards the digital. Initially, slides were fed into a scanner and a digital copy created in remote hospitals without a pathologist. The ease of sharing digital slides quickly became apparent during consultations between pathologists. Before scanners, a questioning pathologist would carefully package up the glass slides, and mail them away for a second opinion.
Almost more art than science, the practitioners of pathology are quickly aging—the average age of pathologists is fifty-seven. And to Zoya Volynskaya, the director of clinical and research informatics at the University Health Network in Toronto, pathologists tend to come across as a bit stuck in their ways. “They have a specific way of analyzing and reviewing slides for twenty or thirty years [using] a microscope, so they ask ‘why do I need to learn something new if it doesn’t give me anything in return?’” Volynskaya says. But because it is such an aging profession, Volynskaya says that even the United States is on the verge of a mass pathologist shortage. The general population is also aging, and people are screened, tested, and biopsied more than ever just as the lack of pathologists is starting to be felt.
Though pathologists rarely interact with patients, everyone interviewed for this article described them as the busiest, most burdened link of a hospital; every second of pathologists’ time is precious. Asking a pathologist to learn a new system, a system that is still under development and might still see significant changes in addition to their daily work is a challenge.
Healthcare organizations resist digitization because of storage. The College of American Pathologists recommends that slides be kept indefinitely, but concedes that physical limitations might impose a ten-year limit. Hospitals accommodate by creating temperature-controlled, offsite annexes for slides and human tissue samples preserved in paraffin wax. When an older specimen is needed for comparison, the physical slide is summoned, and little about it will have changed. While this system is cumbersome, it is also reliable—and somehow, astonishingly, until very recently still cheaper than the massive amounts of digital storage needed for pathology images. Nobody is sure how sample banking will work with digitization, another reason why systems are still resisting what seems like an obvious step. “If you go backward twenty years, 1996, do you remember what kind of storage we had? Who know what we will have in twenty years, and what’s important, we will still need to open images after that amount of time,” Volynskaya says.
Issues of standardization extend beyond image format. Not only does each institution see different patients, but every lab prepares slides with a different protocol—and as a result, the hematoxylin and eosin dyes stain the same cellular features with a spectrum of color, some more red, some more brown, all mostly pink and purple. Not all tissue is cut to an exact three-micron slice. And though they can recognize the complex patterns and features that describe a face (though arguably not all faces), computer vision software often fails when confronted with color variations—and makes the sort of mistakes that a pathologist never would.
When a group develops a machine learning algorithm, the algorithm is trained. By feeding the algorithm images of cancerous and healthy tissues, the system gradually begins to learn which elements are most important for judging an image. Groups train their algorithms on data available to their institution, and often this data is private. When conducting a comparison of theirs to leading algorithms from other groups—the publication standard—researchers use algorithms that have been trained on other institutions’ data sets to analyze their own data. But because the training data varies so much, this is almost like asking a concert pianist to perform a piece on a cello. Algorithms from other places invariably perform worse, and give inconsistent estimates of accuracy. This might lead a developer to say his algorithm “outperforms the state of the art, but the state of the art is poorly defined,” says Ehteshami-Bejnordi.
In the most common type of breast cancer, the tumor forms in the milk duct of the breast and spreads to surrounding tissue and lymph nodes, earning the name invasive ductal carcinoma. Metastasis in ductal cell carcinoma is relatively easy to spot. But in the 10 % of breast cancers known as lobular carcinomas, metastasis looks incredibly similar to normal lymph node cellular patterning, making diagnosis much more difficult. Often, these are the cancers that are missed.
The shift in cellular features is subtle, but Camelyon16 participants had a lot of data to train their algorithms on. Ehteshami-Bejnordi gave participants 400 images from 270 patients at two different hospitals, to demonstrate that the algorithms would perform well in a world with more than one institution. While 400 images might be a well-documented vacation’s worth of snaps, the size of each image makes this data set massive. At the resolution of an iPhone photo, one image would take up roughly the area of the side of an apartment building. Compressed, each image occupies only 30 gigabytes, which still exceeds the memory of an average laptop. In the months leading up to the challenge, Ehteshami-Bejnordi answered emails in the middle of the night from groups who were having difficulties even accessing the images.
Dayong Wang builds algorithms that get people’s attention. A soft-spoken postdoc in Andrew Beck’s Lab at the Harvard Medical School and the leader of his Camelyon16 team, Wang joined the group last year without any prior medical background. As a postdoc at Michigan State University, he developed a large-scale face retrieval system that was licensed to NEC Corporation of America, one of the largest suppliers of biometric technology to law enforcement agencies. But “facial recognition is a popular topic, a mature technique. Everything is on a large scale with low resolution,” says Wang. “I wanted something more exciting. If you combine domains, you get a more interesting problem.”
At their core, the algorithms that Wang had worked on before were similar: image classification. He identified people and faces, and now he needed to identify tumors. He began discussions with medical doctors in the group to learn how they performed “classifications,” and spoke with pathologists who could describe which features of a slide most factored into their predictions about a tumor and the surrounding tissue.
The algorithm developed by Wang used two steps of classification. The first locates cancerous regions on the slide. The second looks more closely at areas that the first has labeled as “clean”—potential false negatives—and looks again to distinguish between faint cancer patterning and the noise of the slide. Ehteshami-Bejnordi says that this two-step approach was quickly adopted by several teams, and will likely set the baseline of complexity for the Camelyon17 challenge.
Visualizing how an algorithm learned to make decisions, or classifications, provides insight into additional information that humans have missed. Beck says that the algorithm might be “pointing at something that you have been ignoring about cancer that the computer learned on its own— points to aspects of the image that are often overlooked, for example the stroma,” a connective tissue whose importance in indicating cancer has only recently been identified.
On a bright cold morning in Prague last April, the International Symposium for Biomedical Imaging opened with a day of challenge results. The conference room was packed as more teams than ever before vied to have their classification algorithms first on the leaderboard. Many teams had been working for months, if not the whole half a year, to get ready for this day, and sent representatives across the Atlantic to be present for the Camelyon16 reveal.
The algorithms competed against a pathologist who had an unlimited amount of time to spend with each slide—in some cases, spending up to twenty minutes with each, giving a distorted estimate of the human’s accuracy (in routine diagnostic situations, pathologists spend an average of two to five minutes with a slide). By verifying against the known diagnosis of each patient, the pathologist’s identification accuracy scored at 96%. Before the challenge, the ideal envisioned workflow was one in which the computer looked through the slides, ordered them by risk, and presented them to a pathologist for cross-examination. This way the pathologist could spot-check the work of the algorithm.
At the opening of Camelyon16, the winning algorithm for whole slide classification submitted by Wang scored at a 92% accuracy; coupled with the pathologist, this increased to 99%. This was much more than had been previously thought possible—a resounding success. Because of the open format of the challenge, another group (also from Harvard, but collaborating with the Gordon Center at Massachusetts General Hospital) submitted an algorithm that performed at 97%. Right before the final close of the challenge, Beck and Wang submitted another that scored at 99%. Two teams have now beaten the pathologist working alone. “Computers have outperformed pathologists for the first time,” says Ehteshami-Bejnordi, waving his hands and nearly spilling a cup of coffee in his excitement.
By focusing the attention of two disparate groups of specialists—pathologists and computer vision researchers—in the context of a competition, a method that might have taken four or five years to develop emerged in a matter of months.
In the spirit of developing the best solution for pathologists, Camelyon16 accepted resubmissions from groups who had further collaborated or developed their algorithms up until the announcement of the 2017 rules. Participants had the choice of making their algorithm open source by posting it to a site like GitHub, allowing other groups to use a successful code to build variations or to inspire another submission. (Companies that participated tended to opt not to.) An algorithm that Ehteshami-Bejnordi developed to standardize the pinks and purples between slides, for example, was incorporated by many groups as the first step in their own algorithms.
The preliminary results of Camelyon16—Wang and Beck’s first algorithm—were published in The National Artificial Intelligence Research and Development Strategic Plan this October. This is the first time that a machine learning competition made it into a report by the National Science and Technology Council, the White House group that coordinates science policy. Radboud University pathologists have been using a simpler algorithm for a couple of years, and will likely update with a newer algorithm. “And at some point we might remove the pathologist,” says Ehteshami-Bejnordi.
The success of Beck and Wang’s team at Camelyon16 also proved a foray into what might be another lucrative pursuit. This summer Beck and Aditya Khosla, another member of his and Wang’s team, launched PathAI, a startup that will analyze a customer’s pathology dataset. “The interest of the patients and the machines are very well aligned, particularly in resource poor areas. This will give patients access to a pathologist who has seen thousands or millions of cancers,” says Beck. But it is unclear how much of the algorithms that they are making proprietary came from open source algorithms written by others, or how adopting a business strategy might hamper further collaboration.
Despite generating rapid results, challenges are not without controversy. Van Ginneken remembers that when he first proposed to introduce challenges in 2007, the majority of the scientific community resisted. “We got a lot of opposition from people who really didn’t believe that this was very scientific. Many people think this is just about tweaking the settings of your algorithm, and from a scientific point of view you learn very little,” van Ginneken says. “[They think] applying five methods on the same data set is more an engineering, optimization process. And actually I can see that point of view and I can agree, but the success has been from the participants.” Challenges have become a component of every biomedical imaging analysis conference, and most conferences host more than one. There is even a competition process to have a challenge idea chosen.
And though challenges don’t require much funding, progress is not without cost. Teams are working to solve somebody else’s problem, for free. Of the twenty-one teams that entered Camelyon16, Ehteshami-Bejnordi thinks that maybe only the top five will be able to publish the results of their research. Though the other groups learned something through participation, diverting the work of a team for two or three months comes at a cost to research. Several teams lost weeks just trying to download massive images.
Van Ginneken also offers critique of the entire challenge structure, a criticism echoed by the computer vision community. Maybe challenges should be designed to encourage collaboration, and not competition. Teams are competing for prizes and fame, and because they are all trying to solve the same problem, their time will be spent doing the same work—and not collaborating.
“It would be nicer to say, ‘here is a problem, here is a big data set with it, and now let’s try to let everybody in the world collaborate to get the best solution for this problem,’” says van Ginneken. But this requires a paradigm change beyond that of once-a-year challenge at a conference. The academic environment that many teams originate in is also one of cut throat competition—groups are competing to be the first to solve a problem, and the first one to have a publication. “The question really is, how you can organize this in such a way so that everybody is willing to contribute to it?”