Since news videos are valuable sources of multimedia information on real-world events, there is a demand for viewing them efficiently. However, there is a problem that summarization methods based on auditory contents do not take into account the visual contents. In the case of news videos, due to its presentation style where audio contents and visual contents do not necessarily come from the same source, this could severely decrease the amount of informative visual contents included in the generated summarized video. Thus, we propose a method for summarizing a sequence of news videos considering the consistency of both auditory and visual contents. The proposed method first selects key-sentences from the auditory contents (Closed Caption) of each news story in the sequence, and then selects a shot within the news story whose "Visual Concepts" detected from the visual contents are the most consistent with the key-phrase. Finally, the audio segment corresponding to each key-phrase is overlapped onto the selected shot, and then concatenated to generate a summarized video. The effectiveness of the proposed method was confirmed on several news topics through a subjective experiment.