Aspose pdf for java TextFragment still remain origin text after replace with logger text

Ethan111 · December 11, 2024, 8:02am

Hello Aspose Team,
No matter how I delete the textFragment, the original text always remains. Why is this happening? Especially when I use textFragment.setText("Longer text Longer text Longer text Longer text"), the previous text still exists after the replacement.

version: 24.11

TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation affects it, causing the original text to remain. However, I need it to adjust line length automatically.
How can I use each old textFragment.setText to replace text and automatically adjust the line length?

// code snap

public ByteArrayOutputStream extractAndReplace(InputStream file, String from, String to) throws Exception {
        List<String> list = new ArrayList<>();
        Document doc = new Document(file);
        ParagraphAbsorber absorber = new ParagraphAbsorber();
        absorber.visit(doc);
        for (PageMarkup markup : absorber.getPageMarkups()) {
            for (MarkupSection section : markup.getSections()) {
                int k = 0;
                List<TextFragment> newTextFragments = new ArrayList<>();
                for (MarkupParagraph paragraph : section.getParagraphs()) {
                    for (int i = 0; i < paragraph.getFragments().size(); i++) {
                        TextFragment fragment = paragraph.getFragments().get(i);
                        fragment.getReplaceOptions().setReplaceAdjustmentAction(TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation);
                        fragment.setText("");
                        fragment.getSegments().clear();
                        k++;
                    }
                }
            }
        }


        log.info("output list:{}", JSON.toJSONString(list));
        ByteArrayOutputStream output = new ByteArrayOutputStream();
//        SaveFormat.Pdf
        PdfSaveOptions saveOptions = new PdfSaveOptions();
        doc.save(output, saveOptions);
        return output;
    }

image.png (93.9 KB)

image.png (219.5 KB)

25024.pdf (76.2 KB)

asad.ali · December 18, 2024, 9:56pm

@Ethan111

We have tested the scenario using below code sample to replace the text in your PDF document and did not notice the issue you stated:

// Load the PDF document
Document pdfDoc = new Document(dataDir + "25024.pdf");

// Initialize TextFragmentAbsorber with search term (case-insensitive regular expression can be used)
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Section"); // NO I18N

// Configure search options for the absorber
TextSearchOptions textSearchOptions = textFragmentAbsorber.getTextSearchOptions();
textSearchOptions.setRegularExpressionUsed(true);

// Accept the absorber for all pages in the document
pdfDoc.getPages().accept(textFragmentAbsorber);

// Get the extracted text fragments into a collection
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();

// Loop through the text fragments
for (TextFragment textFragment : (Iterable<TextFragment>) textFragmentCollection) {
    // Print the text fragment to the console
    System.out.println(textFragment.getText());

    // Replace the text with an empty string
    textFragment.setText("");
}

// Save the modified document
pdfDoc.save(dataDir + "replaced.pdf");

replaced.pdf (101.3 KB)

Can you please share why you are using ParagraphAbsorber instead of TextFragmentAbsorber because it is specialized to modify and remove text whereas ParagraphAbsorber is used only for text extraction.