Liveblog of Hong Qu: “Keepr: Algorithm for Extracting Entities, Eyewitnesses and Amplifiers”

“When a big news story breaks, Twitter goes crazy. Keepr tries to make sense of these periodic bursts by implementing natural language processing and social network analysis algorithms to surface topics, eyewitnesses, and amplifiers. A live demo will be followed by a discussion of the capabilities and limitations of computational newsgathering, along with reports of how it is being used in newsrooms.”

Seth Mnookin introduces our speaker today: Hong Qu is a digital toolmaker. He has led teams at YouTube and Upworthy. He enjoys building social media tools that help us better understand ourselves and the world around us.


Hong’s sister went to MIT as an undergrad, they were competitive. He came to the U.S. at nine, she at fourteen. She went to MIT, he went to Berkeley. As an undergrad, he took a class focused on linguistics, where he took a course that introduced him to Noam Chomsky’s deep structure theory. This made him excited to do further research in linguistics. He then went to grad school at Berkeley’s I School. The faculty there told him about research that had already been done at MIT at the intersection he was interested in; for example, Vannevar Bush came up with this Memex idea. Google Glass is the 5th generation of Memex.

Next Hong shares slides about his history and research

(Slideshare link:

He started his work at YouTube when it was a brand new start-up, when much of the video content was low-quality. He began to wonder, “how can we use computation to create more high-quality, human-centered content?” Next, he transitioned into journalism. More recently, he was amongst the founding team of Upworthy. It’s gone from zero to 20 million unique visitors per month since its launch. In 2013 he left Upworthy to join the Nieman Foundation, where he was immersed in a rich learning environment for learning with practitioners as well as academics.

That’s where he developed Keepr.

Keepr: “Mining gold from the exhaust fumes of social media”

The journalists actually needed sources: people on the scene who could provide real information, or experts who were local to the event. Source credibility is much more valuable in following a story than just discovering topics.

During the 2009 Hudson River plane crash in New York, citizens were the first to break the news via Twitter ( Some talked about this as the source “going direct.” After moving back to NYC in 2010, he was trying to plan his next career move, and a Union Square venture capitalist told him “there is gold in the exhaust fumes of social media streams.” Social media is junk, it’s toxic fumes, but there’s gold if you can find a way to mine it, filter it, extract useful information from it. With that philosophy, he’s been developing a more computational approach to news and storytelling.

At Berkeley, Hong had learned about natural language processing. He learned from Marti Hearst and wrote a paper called “Automated Blog Classification: Challenges and Pitfalls.” which used an algorithm to classify blogs. He taught (with Marti Hearst) a course on “Analyzing Big Data with Twitter” at UC Berkeley.

Inherently human language is very difficult to understand, even by other humans. For computers, it’s even more complex to capture nuances. To make sense of tweets which are just 140 characters is even more difficult. 140 characters is a lot of meaning. All 140 characters describe the link. If you have a tweet that includes a link, you can consider all the other words around the link as a description of the link. This ends up being the strongest indicator of what category this blog’s content falls into. If you have thousands of tweets, patterns begin to emerge.

In his application to Nieman, he proposed to autosummarize highlights from live-tweeted events. He proposed an algorithm that would filter and sort thousands of tweets using NLP. Initially, he suggested creating an algorithm focused on finding tweets with the “strongest resonance.” However, he discovered that this approach was actually the wrong way to frame the issue. If an event is being live tweeted, one can analyse these tweets using NLP algorithms, and relatively easily find the strongest resonance tweets. The reason he crossed out “tweets with the strongest resonance” is that this only leaves you with the mainstream version of the story. This is the version of the story that will be most widely tweeted, quoted, linked to, and referenced. But, that doesn’t mean they’re the best. Sometimes, traditional outlets are slow, behind the curve, or even get it wrong.

From the perspective of his Nieman colleagues, journalists want to be the first to understand the context of any emerging story, and to data mine so that they can create a richer narrative and fill in the gaps that no one else sees.

Example: The Boston Bombings

When the Boston bombings happened, the mainstream media got it wrong. CNN and AP reported it in a similar manner. They spread misinformation, for example. CNN reported arrests made on the 17th and was retweeted 3,809 times. What sources were they using to verify these stories? In addition, the authorities asked for articles to be removed.

Suspects were misidentified on Twitter and other online communities, and members of the public misidentified one of the suspects as missing Brown student Sunil Tripathi. This happened even though Pete Williams from NBC categorically said in his tweet that “Sunil is not a suspect.”

When there was a shooting at MIT, he tried to find out what exactly happened. The news reports were repeating the same thing without much insight into the situation. He started to follow police radio and a user was trying to compile all the information. This Twitter account was a lot more frequently updated and accurate than TV and blogs. A user called Michael Skolnik was tweeting the play-by-play of events; he was trying to synthesize the information sources available, acting as a real-time individual “newsroom.” Much of his information may have been coming from Seth Mnookin, who was also livetweeting from the scene. Seth’s Twitter account went from 8k to 45k followers in the space of two hours.

How does Keepr work? By suggesting terms within stories that are unfolding in real-time as well. For example, in following tweets from Watertown, #policeradio came up as a popular tag. This turned out to be because thousands of people were tuning in to a live stream of the police radio feed.

You put in a search term, Keepr collects the 100 most recent tweets, and discovers the topic. It then surfaces the related topics in a link at the top of the page. If you follow that, it will search the subtopic and come up with another set of tweets. The journalism fellows at Nieman said “this isn’t enough.” Just suggesting topics doesn’t help that much. The journalists actually needed sources: people on the scene who could provide real information, or experts who were local to the event. Source credibility is much more valuable in following a story than just discovering topics.

If abnormal frequency of events can be detected that is actually a very good signal to follow – an insight he gained from a Cornell professor. Also, any algorithm that can take continuous snapshots of metadata can become a computational technique as opposed to a manual technique.

During the police chase, there were two suspects. In the Twitter feed, some people claimed one person was captured, others said both were captured. Hong was trying to make sense of this. He shows a slide with many tweets claiming “two suspects in custody.” This was erroneous at the time.

One of the biggest challenges of computational linguistics is to check who is spreading misconceptions, and how to stop that at the first moment. Also, journalists want to find words that are irregular, that are coming up frequently; these can provide clues to the unfolding story.

Side note re: Design Process: At UC Berkeley, Hong was taught to begin at very low fidelity, with simple UI sketches, then iterate. He shows us examples of early sketches for the Keepr UI, playing with gathering and placing different information elements. He also shows a picture of his work desk, where he worked everyday for four months at the Nieman Foundation.

People just want a story—a summary of the events of past. They want a list of verified sources.

He likes social media for the fact that social media democratizes, in contrast the monopolizing nature of mainstream media.

Hong explains how putting the power of natural language process through a tool like Keepr, journalists increase their capacity to digest information. What journalists are doing is finding a good source, and develop it into Twitter lists That is the newest innovation in the industry.

Humans vs. Machines

Machines and humans have different strengths. Hong shows a list of what humans are good at (meaning, feeling, etc.) and what machines are good at (memory, matching, etc.). Humans can derive meaning they have feelings and a physical existence. Humans are great at telling a story. Snowden didn’t send his data files to Wikileaks, he sent his data to Guardian journalists and the journalists, based on a judgement call after meeting Snowden, published the story.

He asked the audience how they would find information about Obamacare; which websites would they look at, what kinds of media would they use to search about a topic. Some people mention newspapers or websites, another mentions Twitter. Hong asks how we would use Twitter to find this information? By searching for a related term or hashtag. He pulls up a screen showing a tweetdeck search.

Hong says that investigative journalists are starting to use Twitter as a primary research tool. They create two columns of Twitter search results (“Obamacare” and “Obamacare defund”), then let it update in realtime and scan for possible leads. Hong believes that computers can do this more effectively. So, journalists can spend their time investigating sources and adding new insights, rather than manually filtering out the gold from crappy tweets.

The 140 characters of a tweet can be broken down into individual words: that’s called tokenizing in language processing; taking sentences and paragraphs and breaking them into individual words. They call them unigram, bigram etc. based on the different number of words in a token.

140 characters x 100 tweets is 14,000 characters. Keeper archives and organizing tweets by parsing, counting, visualizing, and zooming in. Archiving is done to extract meaning from tweets. And journalists can decide further actions needed based on that knowledge.

Keepr allows the journalist to organize and archive the tweets, because tweets could be deleted by users. If you don’t archive, the users can also delete the tweet.

Language has structure, there is dimensionality and meaning. The algorithm can allow you to do sentiment analysis, summarizing, topic classification, dialogue. Keepr tries to derive the users tweets and summarizes the topics.

Obamacare Example

Next, Hong does a live demo. He shows us what happens if you go to keepr and type in Obamacare.

The left column shows the mainstream report about that topic; in the middle column of the website are the amplifiers, who are followed by a lot of people and who are shaping the conversation. On the far right is the list of possible sources for the top stories.

Click on the names of the right column. Tweets and interactions about the name is shown in the middle column.

Keepr is open source and it summarizes the conversations that are going on.

Conclusion Points

Humans are better than computers at telling stories.

Journalists are needed to add context to a story and verify accuracy. The reporter’s job is to create context and create a narrative of the story. The key is to get the stories right rather than getting it first. Humans are better at doing this part (checking the sources).

In sum, what does Keepr do?

  • Extracts topics from a collection of words.
  • Extracts media which is a part of the tweet.
  • Conversation analysis to get the major and minor sources, discovers the source and tracks amplification process.
  • Source verification by geo location and importing other social media profiles of the source.

A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time like Twitter.

Journalists think more about the practical applications of Keepr, they need an automatic curated list of sources. They want passive monitoring of alerts, visual statistics, trends and graphs, and ease of use.

People just want a story—a summary of the events of past. They want a list of verified sources.

He likes social media for the fact that social media democratizes, in contrast the monopolizing nature of mainstream media.


Verification Junkie has been keeping up with Keepr

A beta program is being rolled out and Hong needs to continue pitching the tool and will be at ONA13 (Online News Association).

Keepr is open source and can be found on Github.

Question and Answer

Seth Mnookin: If I were a newsroom, why would I pay for this instead of use what’s already there? Or alternately, just grab it from github?

Hong: I am keeping it open source. I am thinking of the news junkies and the people who want to keep up with the news today. The business model is to charge a minimal fee; something that will sustain the project. The business model is closest to WordPress. You can install your own, or use a hosted version.

Seth: A hundred tweets is more than a human can process quickly, but it’s also nothing in the context of a big event. How did you come up with that number?

Hong: The Twitter API allowed for 100 tweets, and moreover the information has to be realtime.

You could probably go up to 500k. If it’s too much, it’s not realtime; you have to wait. Users have to drill down. I’ll iterate and change it as I go.

Erik Stayton: You talked about John Boehner in your section about sources and amplifiers. It didn’t look he was a source, but a sink: messages targeted to his Twitter account. Is that something you can separate out?

Hong: The algorithm is just counting the number of times a user occurs in the source. He could even be a topic. Since Obamacare is an ongoing public debate, it’s not breaking news. In breaking news situations, the users that get surfaced usually are people who are real sources uploading images and links. I hope the journalists will be able to differentiate!

Liam Andrew: The Boston Marathon and Obamacare are different in that one is longtail and the other is not. Do you have any way to optimize for longtail stories?

Hong: One of the approaches to summarize a long term story is to keep a database of searches and then summarize the sources longitudinally.

Ian Condry: I’m interested in what algorithms do well versus what humans do well. Have you thought of ways to integrate what humans do best into the process?

Hong:  Definitely. When I watch users, especially journalists, they don’t necessarily know where to begin. For journalists, as they write the story, they don’t know what angle to use. It works backwards from how they want to tell the story, then they do the research. It’s a dialectical process. Keepr does extract topics, but they have to make a value judgement about how to frame and source the story. In the earlier stages—let’s say the editor assigns the story—the reporter still has to get approval for their narrative.

Jim Paradis: How do journalists contact sources they finds on Keepr. Do they contact the source?

Hong: Based on my classes at journalism school, and also watching the Nieiman fellows follow the bombing, journalists are incredible at getting sources to talk to them. They’re aggressive. For example, one journalist went to the hospital to try to talk to the relatives of injured people. By any means possible.

Jim: So geoinformation in the tweet says “there’s a lot coming from this location,” so that’s one kind of resource. Another would be identification of an individual. What are some other ways people might use this?

Jesse: Does this only work for breaking events?

Hong: When I’m designing the tool, I don’t want journalists to just go and search in the search box. I want to be curating and showing them things. On load, I use CNN, BBC, AP breaking news accounts. On the homepage you see the most

Sasha Costanza-Chock: In your design process, have you thought about doing collaborative design workshops to brainstorm about features that journalists would need?

Hong: Fortunately for me, Online News Association recently conducted a session with the Boston Globe where they did exactly that. Happily, many of the features they wanted aligned with my own observations. I believe the best thing is to have an alpha version in the hands of the user, and get feedback that way. For example, I designed an email alert feature, but nobody signed-in to use that. Perhaps because they didn’t want to give up their email to an unknown site—so there are a lot of variables. There are many barriers to adoption, but I believe that you develop a basic version of an application, then let the user give insights about further changes.

Rodrigo Davies: What do you think about the much aligned field of sentiment analysis.

Hong: The technology is not foolproof yet. The context matters a lot. The reason I didn’t use sentiment analysis is that it is very high power for the value it provides.

Q: Which service are you using for search on the backend?

Hong: Currently it’s a call to Twitter API of search engines.

Wang Yu: I want to ask what kind of algorithm is involved in the processing of Twitter for journalism, and what’s your next step?

Hong: I really feel that the promising area is network analysis to know who issaying what to whom. The visualization of information flow is crucial. For example, I want to know who said the word “boat,” who else said it, who they were saying it to. Also visualization is very powerful. I think that’s the most promising angle.

Chelsea Barabas: You’ve designed this specifically for journalists, but what other end users could this useful for?

Hong: In the financial community, it can be used in tracking […], dataminers could be used to monitor change in policy. There are many applications for data mining, the reason I want to help journalists is because I feel that they need to have a level playing field, they need to get access to information first.

TL Taylor: I have a metaquestion. We’re in a moment of the hopes of Big Data and the power of algorithms. I’m an ethnographer, so I love the fact that you have a slide about interpretation as the work that humans are doing. Do you end up having to navigate cultural enthusiasm for these techniques against the realities?

Seth: You mean the enthusiasm of newsrooms that they won’t have to pay reporters? [laughs]

Hong: It is Intelligence Augmentation, or IA, not Artificial Intelligence, or AI. No matter how small the algorithm, it is doing pattern detection, but it is not deriving the values or meanings. Journalists still play a crucial role.

Seth: We are talking a lot about content and content producers. We’ve been focusing on the role of professional content producers. But Twitter and social media are also content producers, and we are just scratching the surface at the ways this content could be monetized. Do you think at some point people will stop using these services, because they don’t like being part of these billion-dollar revenue streams?

Hong: My perspective is that the companies will come and go but behaviors will persist, content will be produced on social media websites.

Whether the company emphasizes monetization or not, there will be new companies, new ways to incentivize these behaviors. There’s plenty of opportunity. Look at Instagram: it’s huge, even for journalism. Why do people post there instead of to Twitter? Computational metrics should just follow where the activities are taking place.

When I was at Berkeley, I read Goffman. Any scholar can tell you why people post: they want to be visible and present themselves, and be perceived. We should be respectful of the intent of people, not violate their expectations. Companies are being respectful of people’s privacy…the only actor who is not is the government.

Seth: Do you still [see] any tension between the professional content producers and the mass content producers? When I asked the journalists why they weren’t on the scene they replied that their job was to post.

Hong: Journalism doesn’t even pay the bills. On the other hand, they can innovate on the data movement. If they are able to [become] more accurate than the sources themselves then they are adding value.

There are A-list journalists who can sustain themselves, but I am talking about staff writers and they need to jump on the data bandwagon.

Denise Cheng: I used to work in grassroots local journalism, and meeting with people from community radio and public radio. I get the sense that we’re looking at Keepr as a replacement of work. From what I’ve seen of the industry, people are so tied to their computer that they want to get out there, but don’t have the time. You’re strapped to your computer. You’d rather be out there talking to people, but sitting looking at tweets gets you a lot of information rapidly. It’s not how to take an algorithm and create content, it’s that suddenly journalists capacity can be increased to go out there and talk to people. We don’t want to talk about data

Sasha: Seth, you let Hong get off too easily earlier. How do you get from the quote you showed from your thesis, which was about this massive cultural system which extracts cultural labor from people and then commodifies and monetizes it, to “the companies respect the users and the state is the only bad actor?” Aren’t social media firms the new algorithmic merchants of cool, who benefit from the free content production of their users, data mine them, and sell their information to third parties?

Hong: When social media become too bad, somebody else will take their place. The companies also put considerable effort in building the system. When companies try to create more equitable system of compensating the content producers then the situation would be better. Media literacy is another important skill.

Sasha: So, if Facebook becomes too invasive of our privacy, we’ll all just leave for another platform?

Ian Condry: The values of the platform and the algorithms could lead to another kind of manipulation.

TL Taylor: Do you have any views on ethics of large scale pattern recognition?

Hong: Any technology in this field can be used in malicious way, any data which is private or semi private should not be used by companies. I really feel that the corporate forces, government forces and the people, I hope that the fourth state is a voice of society and counter-balances mass behavior.

Jim: What about the digital divide? Some of these aggregation system push in a certain direction. You wonder about the voices of people that aren’t represented in the social media sphere. Sometimes it feels like news is generated in a smaller and smaller, more technically sophisticated sector.

Hong: That was probably the biggest concern during my research and development of the project. Everyone on Twitter is of a certain demographic or SES. If a journalist thinks that Keepr will give them access to the “full story,” they’re not a good journalist. They’ll have to get out of their offices, visit neighborhoods, make phone calls, and be on their beat in an old-school way. We have to train journalists to do traditional reporting plus the data journalism. Teaching at CUNY, the student body is very diverse and representative of the city itself. That’s a mission to keep in our minds as we proceed with analyzing the social media data, the on the ground data is also key.

Seth: Thanks!


Share this Post