Docx Parsing with Tracked Changes

When parsing DOCX files that contain tracked changes or comments, Tensorlake preserves this collaboration metadata in the HTML output. This enables workflows that need to process document revisions, review comments, or extract specific change history. Tracked changes and comments are preserved using semantic HTML markup: Tracked Changes:

Insertions: <ins>inserted text</ins> - Text that was added to the document
Deletions: <del>deleted text</del> - Text that was removed or struck through

Comments:

Comment ranges: <span class="comment" data-note="comment text">highlighted text</span> - Comments anchored to selected text
Comment references:  - Comments at cursor positions without highlighted text

Example Output

Markdown

<p>Initial damage estimates suggest total losses between $2.8M and 
<span class="comment" data-note="Michael Torres: Need to verify this upper bound">$3.4M</span>,
<ins>based on preliminary contractor assessments,</ins> which falls within policy limits
<del>though a complete forensic analysis is pending</del>.</p>

Extracting Change Data Programmatically

Use these HTML patterns to extract specific content types:

Python

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')

# Extract all comments
comments = []
for span in soup.find_all('span', class_='comment'):
    comments.append({
        'text': span.get_text(strip=True),
        'comment': span.get('data-note', '')
    })

# Extract all deletions
deletions = [del_tag.get_text() for del_tag in soup.find_all('del')]
for deletion in deletions:
    print(f"Deleted: {deletion}")

# Extract all insertions
insertions = [ins_tag.get_text() for ins_tag in soup.find_all('ins')]
for insertion in insertions:
    print(f"Inserted: {insertion}")

# Print all comments
for comment in comments:
    print(f"Comment: {comment['text']} - {comment['comment']}")

Tracked changes are only preserved when parsing DOCX files that contain Microsoft Word’s revision history. Regular text formatting (bold, italic) is handled separately through standard HTML markup.

Get Started

Parsing

Datasets

Production

Docx Parsing with Tracked Changes

Example Output

Extracting Change Data Programmatically

Get Started

Parsing

Datasets

Production

​Example Output

​Extracting Change Data Programmatically

Example Output

Extracting Change Data Programmatically