Reading for learning frequently requires integrating text and picture information into coherent knowledge structures. This article presents an experimental study aimed at analyzing the strategies used by students for integrating text and picture information. Four combinations of texts and pictures (text–picture units) were selected from textbooks on biology and geography, each combined with 3 comprehension test items of different complexity. Item difficulties were assessed in terms of item-response theory and through a cognitive task analysis. The texts, pictures, and items were presented to 40 students from Grades 5 and 8 from the higher tier and the lower tier of the German school system. Participants were asked to process the material and answer the items. Students’ eye movements were recorded and analyzed in terms of number of fixations on different areas of interest as well as eye-movement transitions between these areas. Results suggest that text and pictures serve fundamentally different functions associated with different processing strategies in goal-directed knowledge acquisition. Texts are more likely to be used for coherence-oriented general processing. They guide the learner’s conceptual analysis of the subject matter, which results in a coherent semantic network and initial mental model. Pictures are used as scaffolds for the initial mental model construction. Afterward, however, they are more likely to be used for task-driven selective processing serving as easily accessible visual representations on demand for item-specific mental model updates.