HTML to JSON Conversion Using lxml

  • Share this:

Code introduction


This function takes HTML content and XML namespaces as input, parses the HTML using the lxml library, and returns a JSON string containing the parsed HTML content.


Technology Stack : lxml

Code Type : Function

Code Difficulty : Intermediate


                
                    
def parse_html_to_json(html_content, namespaces):
    from lxml import etree

    # Parse the HTML content using lxml.etree
    tree = etree.HTML(html_content)

    # Extract all XML namespaces defined in the HTML
    namespaces_dict = dict([ns.split('}') for ns in tree.iterparse(None, events=['start-ns']) if ns])

    # Merge the provided namespaces with the ones extracted from the HTML
    merged_namespaces = {**namespaces_dict, **namespaces}

    # Convert the XML tree to a JSON string
    json_result = etree.tostring(tree, pretty_print=True, xml_declaration=False, encoding='utf-8')

    return json_result                
              
Tags: