app engineでよく使いそうなコード片をメモ（rsserのソースから）

いったんwebapp.RequestHandlerを継承したクラスを作ってバリデーション処理を共通させる

別にバリデーションに限らないが

class AbstractHandler(webapp.RequestHandler):
    def v_url(self, url):
        if not url:
            logging.warning('url is blank.')
            return
        url = url.encode('utf-8')
        if url.startswith(self.request.host_url):
            logging.warning('url startswith host_url.')
            return
        return url

class EditHandler(AbstractHandler):
    def get(self):
        url = self.v_url(self.request.get('url'))
        if not url:
            self.message('パラメータが不正です')
            return

class SetHandler(AbstractHandler):
    def get(self):
        url = self.v_url(self.request.get('url'))
        if not url:
            self.message('パラメータが不正です')
            return

関数でやってもいいけど、まとめておくと理解しやすくていい

動いてるアプリのソースをそのまま返す

rootPath = os.path.dirname(__file__)
appPath  = os.path.join(rootPath, 'hoge.py')

class ViewSourceHandler(webapp.RequestHandler):
    def get(self):
        all = open(appPath, 'r').read()
        self.response.headers['Content-Type'] = 'text/plain; charset=utf-8'
        self.response.out.write(all)

そのまんま。
ファイルの書き込みがダメなだけなので、これで問題なくソースを返す。

テキストとテキストの差を取る（更新された部分だけ抽出）

def getDiff(strOld, strNew):
    diff = []
    for line in ndiff(strOld.splitlines(), strNew.splitlines()):
        if line.startswith('+'):
            diff.append(line[1:])
    return '\n'.join(diff)

urlfetchしたHTMLからテキストだけを抽出する

まず文字コードを判定する
charsetの候補を拾っては試していく

urlfetchのレスポンスの

- headers['Content-Type']にcharset=〜の指定があればそれを試す（レスポンスヘッダ）
- contentにcharset=〜が含まれていればそれを試す（HTMLファイル内のヘッダ）

def isCharset(charset):
    #正しいcharsetであるかを調べる
    try:
        unicode('hoge', charset)
        return True
    except:
        return False

if res.headers.has_key('Content-Type'):
    c = re.search(r'charset=([0-9a-zA-Z\-_]+)', res.headers['Content-Type'])
    if c:
        charset = c.group(1)
        if not isCharset(charset):
            charset = None

if not charset:
    c = re.search(r'charset=([0-9a-zA-Z\-_]+)', unicode(html[:1000], 'shift-jis', errors='replace'))
    if c:
        charset = c.group(1)
        if not isCharset(charset):
            charset = None

どちらにも指定が無い場合もあるし、x-sjisとか指定されてて、チェックで撥ねられる場合もある（pythonがencodeに失敗する＝不正なcharset、としている）

その場合は、

- chardetで推測する

それも失敗することがあるので

- 最後はshift-jis決め打ち

if not charset:
    try:
        charset = chardet.detect(html)['encoding']
    except:
        charset = 'shift-jis'
html = unicode(html, charset, errors='replace')

unicodeになったらそれをパースする
パーサはhtml5libを使うのがいい

parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("beautifulsoup"))
bs = parser.parse(html)

あとはBeautifulSoupを使ってタイトルとテキストを抜き出す

#http://python.g.hatena.ne.jp/y_yanbe/20081025/1224910392
def getNavigableStrings(soup):
  if isinstance(soup, NavigableString):
    if type(soup) not in (Comment, Declaration) and soup.strip():
      yield soup
  elif soup.name not in ('script', 'style', 'title', 'noscript'):
    for c in soup.contents:
      for g in getNavigableStrings(c):
        yield g

title = bs.find('title')
if title:
    title = '\n'.join(title.contents)
    title = re.sub(r'\n', ' ', title)
else:
    title = ''
text = '\n'.join(getNavigableStrings(bs))
text = re.sub(r'<([^>]+)?>', '', text)
return ( title, text )

HTMLコードが漏れることがあるのでその部分は消してる
BeautifulSoupについては調べてないので暫定的な対処

ソース全体は

http://rss-er.appspot.com/source