This presentation focuses on the importance of web crawling and page ranking
algorithms in dealing with the massive amount of data present on the World Wide
Web. As the web continues to grow exponentially, efficient search and retrieval
methods become crucial. Web crawling is a process that converts unstructured
data into structured data, enabling effective information retrieval.
Additionally, page ranking algorithms play a significant role in assessing the
quality and popularity of web pages. The presentation explores the background
of these algorithms and evaluates five different crawling algorithms: Shark
Search, Priority-Based Queue, Naive Bayes, Breadth-First, and Depth-First. The
goal is to identify the most effective algorithm for crawling web pages. By
understanding these algorithms, we can enhance our ability to navigate the web
and extract valuable information efficiently.

本文介绍了网页爬取和页面排名算法在处理海量互联网数据方面的重要性，讨论了五种不同的爬取算法，并旨在确定最有效的算法，以提高互联网导航和信息提取的能力。

多种网络爬虫算法的比较分析

Comparative analysis of various web crawler algorithms

The need for raw large raw corpora has dramatically increased in recent years
with the introduction of transfer learning and semi-supervised learning methods
to Natural Language Processing. And while there have been some recent attempts
to manually curate the amount of data necessary to train large language models,
the main way to obtain this data is still through automatic web crawling. In
this paper we take the existing multilingual web corpus OSCAR and its pipeline
Ungoliant that extracts and classifies data from Common Crawl at the line
level, and propose a set of improvements and automatic annotations in order to
produce a new document-oriented version of OSCAR that could prove more suitable
to pre-train large generative language models as well as hopefully other
applications in Natural Language Processing and Digital Humanities.

本文介绍了通过对现有的多语言网页语料库 OSCAR 进行自动注解和改进，以获得更适合于预训练大型生成语言模型的新版本的方法。

朝着更干净的面向文档的多语言爬行语料库

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

Web crawling is the problem of keeping a cache of webpages fresh, i.e.,
having the most recent copy available when a page is requested. This problem is
usually coupled with the natural restriction that the bandwidth available to
the web crawler is limited. The corresponding optimization problem was solved
optimally by Azar et al. [2018] under the assumption that, for each webpage,
both the elapsed time between two changes and the elapsed time between two
requests follow a Poisson distribution with known parameters. In this paper, we
study the same control problem but under the assumption that the change rates
are unknown a priori, and thus we need to estimate them in an online fashion
using only partial observations (i.e., single-bit signals indicating whether
the page has changed since the last refresh). As a point of departure, we
characterise the conditions under which one can solve the problem with such
partial observability. Next, we propose a practical estimator and compute
confidence intervals for it in terms of the elapsed time between the
observations. Finally, we show that the explore-and-commit algorithm achieves
an $\mathcal{O}(\sqrt{T})$ regret with a carefully chosen exploration horizon.
Our simulation study shows that our online policy scales well and achieves
close to optimal performance for a wide range of the parameters.

研究了在未知网页变化频率的情况下，使用部分可观察信号进行在线估计的 Web 抓取优化问题，并提出了实用的估计器，证明了探索 - 开发算法的性能。

学会爬行

Learning to Crawl

Parallel sentences are a relatively scarce but extremely useful resource for
many applications including cross-lingual retrieval and statistical machine
translation. This research explores our methodology for mining such data from
previously obtained comparable corpora. The task is highly practical since
non-parallel multilingual data exist in far greater quantities than parallel
corpora, but parallel sentences are a much more useful resource. Here we
propose a web crawling method for building subject-aligned comparable corpora
from Wikipedia articles. We also introduce a method for extracting truly
parallel sentences that are filtered out from noisy or just comparable sentence
pairs. We describe our implementation of a specialized tool for this task as
well as training and adaption of a machine translation system that supplies our
filter with additional information about the similarity of comparable sentence
pairs.

通过使用网页爬取方法和机器翻译系统，本文提出了一种从维基百科文章中获取主题对齐比较语料库的方法，并且能够提取噪音干扰较小的平行句子。