Skip to main content
When parsing DOCX files that contain tracked changes or comments, Tensorlake preserves this collaboration metadata in the HTML output. This enables workflows that need to process document revisions, review comments, or extract specific change history. Tracked changes and comments are preserved using semantic HTML markup: Tracked Changes:
  • Insertions: <ins>inserted text</ins> - Text that was added to the document
  • Deletions: <del>deleted text</del> - Text that was removed or struck through
Comments:
  • Comment ranges: <span class="comment" data-note="comment text">highlighted text</span> - Comments anchored to selected text
  • Comment references: <!-- Comment: comment text --> - Comments at cursor positions without highlighted text

Example Output

Markdown
<p>Initial damage estimates suggest total losses between $2.8M and 
<span class="comment" data-note="Michael Torres: Need to verify this upper bound">$3.4M</span>,
<ins>based on preliminary contractor assessments,</ins> which falls within policy limits
<del>though a complete forensic analysis is pending</del>.</p>

Extracting Change Data Programmatically

Use these HTML patterns to extract specific content types:
Python
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')

# Extract all comments
comments = []
for span in soup.find_all('span', class_='comment'):
    comments.append({
        'text': span.get_text(strip=True),
        'comment': span.get('data-note', '')
    })

# Extract all deletions
deletions = [del_tag.get_text() for del_tag in soup.find_all('del')]
for deletion in deletions:
    print(f"Deleted: {deletion}")

# Extract all insertions
insertions = [ins_tag.get_text() for ins_tag in soup.find_all('ins')]
for insertion in insertions:
    print(f"Inserted: {insertion}")

# Print all comments
for comment in comments:
    print(f"Comment: {comment['text']} - {comment['comment']}")
Tracked changes are only preserved when parsing DOCX files that contain Microsoft Word’s revision history. Regular text formatting (bold, italic) is handled separately through standard HTML markup.