HTML Text Extraction by Tag

  • Share this:

Code introduction


This function takes HTML content and a tag name as input, uses the lxml library to parse the HTML, and extracts the text from all specified tags.


Technology Stack : lxml, HTML parsing, XPath

Code Type : Function

Code Difficulty : Intermediate


                
                    
def extract_text_from_html(html_content, tag):
    from lxml import etree

    def extract_text_from_element(element):
        return ''.join(element.itertext())

    parser = etree.HTMLParser()
    tree = etree.fromstring(html_content, parser)
    elements = tree.xpath(f"//{tag}")
    return [extract_text_from_element(element) for element in elements]