Discuss - 廖雪峰的官方网站

LesLieM樂

#1 Created at ... [Delete] [Delete and Lock User]

# !/usr/bin/env/Python3
# - * - coding: utf-8 - * -


# 思路源于http://blog.csdn.net/nwpulei/article/details/7272832


from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.li = False
        self.h3 = False
        self.a = False
        self.p = False
        self.time = False
        self.span1 = False
        self.span2 = False
        self.event_dict = {}
        self.count = 0

    def handle_starttag(self, tag, attrs):
        if tag == 'li':
            self.li = True
        elif tag == 'h3':
            for k, v in attrs:
                if k == 'class' and v == 'event-title':
                    self.h3 = True
        elif tag == 'a':
            self.a = True
        elif tag == 'p':
            self.p = True
        elif tag == 'time':
            self.time = True
        elif tag == 'span':
            for k, v in attrs:
                if k == 'class' and v == 'say-no-more':
                    self.span1 = True
                elif k == 'class' and v == 'event-location':
                    self.span2 = True

    def handle_data(self, data):
        if self.li:
            if self.h3 == True and self.a == True:
                self.count += 1  # 用self.count作为self.IDdict的key，表示会议的次数
                self.event_dict[self.count] = {}
                self.event_dict[self.count]['name'] = data
            elif self.p:
                if self.time:
                    if not self.span1:
                        self.event_dict[self.count]['time'] = data
                    else:
                        self.event_dict[self.count]['time'] += (',' + data)
                else:
                    if self.span2:
                        self.event_dict[self.count]['site'] = data

    def handle_endtag(self, tag):
        if tag == 'a':
            self.a = False
        elif tag == 'h3':
            self.h3 = False
        elif tag == 'span':
            self.span1 = False
            self.span2 = False
        elif tag == 'time':
            self.time = False
        elif tag == 'p':
            self.p = False
        elif tag == 'li':
            self.li = False


parser = MyHTMLParser()

def parse_python_event(html_data):
    global parser
    parser = MyHTMLParser()
    parser.feed(html_data)
    return parser.event_dict


if __name__ == '__main__':
     html_data = r'''HTML Data'''
     event = parse_python_event(html_data)
    print('Conference: %s' % event)
    for i in range(1, parser.count+1):
        print(event[i]['name'], '\n', event[i]['time'], '\t', event[i]['site'])

输出为：

event 1: PyCon SK 2017 
 10 March – 13 March , 2017      Bratislava, Slovakia
event 2: Rencontres Django 2017 
 01 April – 03 April , 2017      Toulon, France
event 3: DjangoCon Europe 2017 
 03 April – 08 April , 2017      Florence, Italy
event 4: PyCon UA 2017 
 08 April – 10 April , 2017      Lviv Arena Stadium, Lviv City, Ukraine
event 5: PythonCamp 2017 
 08 April – 10 April , 2017      GFU Cyrus AG , Am Grauen Stein 27, 51105 Köln, Germany
event 6: PyDays Vienna 
 05 May – 07 May , 2017      FH Technikum Wien, Höchstädtplatz 6, 1200 Vienna, Austria
event 7: PyCon Philippines 2017 
 25 Feb. – 27 Feb. , 2017      Cagayan de Oro City, Philippines
event 8: PyWeek 23 - Python community game jam 
 19 Feb. – 27 Feb. , 2017      Online Event

LesLieM樂

#2 Created at ... [Delete] [Delete and Lock User]

@廖雪峰老师有一个小问题，为什么例子里这几行代码貌似没有起到作用？

from html.entities import name2codepoint

def handle_entityref(self, name):
        print('&%s;' % name)

def handle_charref(self, name):
        print('&#%s;' % name)

parser.feed('''<html>

<body>  <p>Some <a href=\"#\">html</a> HTML tutorial...<br>END</p> </body></html>''')

这里的 并没有打印出来呀，依旧是空格？好像根本就没有调用handle_entityref()

用户5370054522

#3 Created at ... [Delete] [Delete and Lock User]

那个是特殊字符，而我们的例子中没有特殊字符

zwq54argfun

#4 Created at ... [Delete] [Delete and Lock User]

def handle_entityref(self, name):
    name = chr(name2codepoint[name])
    print('&%s;' % name)

def handle_charref(self, name):
    if name.startswith('x'):
        name = chr(int(name[1:], 16))
    else:
        name = chr(int(name))
    print('&#%s;' % name)

```

把这段代码插入到对应的地方，应该就可以了。

Fire___Within

#5 Created at ... [Delete] [Delete and Lock User]

请问大神在win下运行得到UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 这种错误怎么办

淡淡水声

#6 Created at ... [Delete] [Delete and Lock User]

win下运行得到UnicodeEncodeError: 'gbk'...是因为cmd（或PowerShell）用的编码方式是GBK，这种方式无法对输出的一些字符进行编码。解决方法是在程序中先自我编码： print(data.encode('GB18030')) 或 print(data.encode('GBK','ignore'))

北京钱有用

#7 Created at ... [Delete] [Delete and Lock User]

哥们儿，你这个是完整的代码吗？

我敢说认识林迪格尔的不超过三个

#8 Created at ... [Delete] [Delete and Lock User]

文件头写了coding:utf-8吗

御蓝破

#9 Created at ... [Delete] [Delete and Lock User]

这个是print()函数自身有限制，不能完全打印所有的unicode字符。在IDLE下运行就不会有这个问题。或者改一下python的默认编码成'utf-8'就行了

import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')

交作业

LesLieM樂

LesLieM樂

用户5370054522

zwq54argfun

Fire___Within

淡淡水声

北京钱有用

我敢说认识林迪格尔的不超过三个

御蓝破

Reply