如何从字母矩阵中找到可能的单词列表[Boggle Solver]

2018-06-13 22:18:25

最近我一直在我的iPhone上玩一个叫做Scramble的游戏。你们中有些人可能会认为这个游戏是Boggle。基本上，当游戏开始时，你会得到一个字母矩阵，如下所示：

F X I E
A M L O
E W B X
A S T U

游戏的目标是尽可能多地找到可以通过将字母链接在一起而形成的单词。你可以从任何字母开始，并且围绕它的所有字母都是公平的游戏，然后一旦你移动到下一个字母，围绕该字母的所有字母都是公平的游戏， 除了以前使用的任何字母 。所以在上面的网格中，例如，我可以想出LOB ， TUX ， SEA ， FAME等词语。单词必须至少有3个字符，并且不能超过NxN个字符，这在这个游戏中应该是16个字符，但可以在一些实现中变化。虽然这个游戏很有趣并且令人上瘾，但我显然不是很擅长这个游戏，我想通过制作一个能够给我最好的单词的程序（这个单词越长得分越多）来欺骗一点。

示例Boggle http://www.boggled.org/sample.gif

不幸的是，我不擅长算法或其效率等等。我第一次尝试使用这样一个字典（约2.3MB），并进行线性搜索，尝试将组合与词典条目进行匹配。这需要很长时间才能找到可能的单词，并且由于每轮只能获得2分钟，所以这是不够的。

我很有兴趣看看是否有任何Stackoverflowers能够提供更高效的解决方案。我主要是在寻找使用Big 3 Ps的解决方案：Python，PHP和Perl，不过任何Java或C ++都很酷，因为速度是必不可少的。

当前解决方案 ：

Adam Rosenfield，Python，大约20多岁

John Fouhy，Python，〜3s

Kent Fredric，Perl，〜1s

大流士培根，Python，〜1s

rvarcher，VB.NET（live link），〜1s

Paolo Bergantino，PHP（现场链接），约5秒（本地约2秒）

BOUNTY ：

我为这个问题添加了一笔赏金，这是我向所有参与其计划的人表示感谢的方式。不幸的是，我只能给你们其中一个人接受答案，所以我将在7天后衡量谁拥有最快的解决方案，并奖励获奖者。

赏金颁发。感谢参与的每个人。

我的答案与其他人一样，但我会发布它，因为它看起来比其他Python解决方案快一些，从设置字典更快。（我对John Fouhy的解决方案进行了检查。）安装完成后，解决问题的时间缩短了。

grid = "fxie amlo ewbx astu".split()
nrows, ncols = len(grid), len(grid[0])

# A dictionary word that could be a solution must use only the grid's
# letters and have length >= 3. (With a case-insensitive match.)
import re
alphabet = ''.join(set(''.join(grid)))
bogglable = re.compile('[' + alphabet + ']{3,}$', re.I).match

words = set(word.rstrip('n') for word in open('words') if bogglable(word))
prefixes = set(word[:i] for word in words
               for i in range(2, len(word)+1))

def solve():
    for y, row in enumerate(grid):
        for x, letter in enumerate(row):
            for result in extending(letter, ((x, y),)):
                yield result

def extending(prefix, path):
    if prefix in words:
        yield (prefix, path)
    for (nx, ny) in neighbors(path[-1]):
        if (nx, ny) not in path:
            prefix1 = prefix + grid[ny][nx]
            if prefix1 in prefixes:
                for result in extending(prefix1, path + ((nx, ny),)):
                    yield result

def neighbors((x, y)):
    for nx in range(max(0, x-1), min(x+2, ncols)):
        for ny in range(max(0, y-1), min(y+2, nrows)):
            yield (nx, ny)

示例用法：

# Print a maximal-length word and its path:
print max(solve(), key=lambda (word, path): len(word))

编辑：过滤掉少于3个字母的单词。

编辑2：我很好奇为什么Kent Fredric的Perl解决方案更快; 原来使用正则表达式匹配而不是一组字符。在Python中执行相同的操作可以将速度加倍。

你将得到的最快解决方案可能涉及将你的字典存储在trie中。然后，创建三元组（x，y，s）的队列，其中队列中的每个元素都对应于可以在网格中拼写的单词的前缀s，结束于位置（x，y）。使用N x N个元素初始化队列（其中N是网格的大小），网格中每个正方形的一个元素。然后，算法进行如下：

While the queue is not empty:
  Dequeue a triple (x, y, s)
  For each square (x', y') with letter c adjacent to (x, y):
    If s+c is a word, output s+c
    If s+c is a prefix of a word, insert (x', y', s+c) into the queue

如果将字典存储在一个trie中，测试s + c是一个单词还是一个单词的前缀可以在恒定时间内完成（前提是您还在每个队列数据中保留一些额外的元数据，例如指向当前节点的指针在trie中），所以这个算法的运行时间是O（可以拼写的单词的数量）。

[编辑]这里有一个我刚刚编码的Python实现：

#!/usr/bin/python

class TrieNode:
    def __init__(self, parent, value):
        self.parent = parent
        self.children = [None] * 26
        self.isWord = False
        if parent is not None:
            parent.children[ord(value) - 97] = self

def MakeTrie(dictfile):
    dict = open(dictfile)
    root = TrieNode(None, '')
    for word in dict:
        curNode = root
        for letter in word.lower():
            if 97 <= ord(letter) < 123:
                nextNode = curNode.children[ord(letter) - 97]
                if nextNode is None:
                    nextNode = TrieNode(curNode, letter)
                curNode = nextNode
        curNode.isWord = True
    return root

def BoggleWords(grid, dict):
    rows = len(grid)
    cols = len(grid[0])
    queue = []
    words = []
    for y in range(cols):
        for x in range(rows):
            c = grid[y][x]
            node = dict.children[ord(c) - 97]
            if node is not None:
                queue.append((x, y, c, node))
    while queue:
        x, y, s, node = queue[0]
        del queue[0]
        for dx, dy in ((1, 0), (1, -1), (0, -1), (-1, -1), (-1, 0), (-1, 1), (0, 1), (1, 1)):
            x2, y2 = x + dx, y + dy
            if 0 <= x2 < cols and 0 <= y2 < rows:
                s2 = s + grid[y2][x2]
                node2 = node.children[ord(grid[y2][x2]) - 97]
                if node2 is not None:
                    if node2.isWord:
                        words.append(s2)
                    queue.append((x2, y2, s2, node2))

    return words

用法示例：

d = MakeTrie('/usr/share/dict/words')
print(BoggleWords(['fxie','amlo','ewbx','astu'], d))

输出：

['fa'，'xi'，'ie'，'io'，'el'，'am'，'ax'，'ae'，'aw'，'mi'，'ma'，'我'，' lo'，'li'，'oe'，'ox'，'em'，'ea'，'ea'，'es'，'wa'，'我们'，'wa'，'bo'，'bu' ，as，aw，ae，st，se，sa，tu，ut，fam，fae，imi，eli， elm'，'ami'，'ama'，'ame'，'aes'，'awl'，'awa'，'awe'，'awa'，'mix'，'mim'，'mil' ，'mae'，'maw'，'mew'，'mem'，'mes'，'lob'，'lox'，'lei'，'leo'，'lie'，' lim'，'oil'，'olm'，'ewe'，'eme'，'wax'，'waf'，'wae'，'waw'，'wem'，'wea'，'wea'，'was' ，'waw'，'wae'，'bob'，'blo'，'bub'，'but'，'ast'，'ase'，'asa'，'awl'，'awa'，'awe' 'swa'''sew'，'sea'，'sea'，'saw'，'tux'，'tub'，'tut'，'twa'，'twa' ，'tt'，'utu'，'fama'，'fame'，'ixil'，'imam'，'amli'，'amil'，'ambo'，'axil'，'轴'，'mimi'，' mime'mime'milo'mile'mewl'mese'mesa'loolo'lobo''lima''lime'''limb''lile' ，'oime'，'oleo'，'olio'，'oboe'，'obol'，'emim'，'emil'，'east'，'ease'，'wame'，'wawa'，'wawa'，' weam'，'west'，'wese'，'wast'，'wase' ，'wawa'，'wawa'，'boil'，'bolo'，'bole'，'bobo'，'blob'，'bleo'，'bubo'，'asem'，'stub'，'stut' Swam'，'semi'，'seme'，'seam'，'seax'，'sasa'，'sawt'，'tutu'，'tuts'，'twae'，'twas'，'twae'，'ilima' ，amble，axile，awest，mamie，mambo，maxim，mease，mesem，limax，limes，limbo，limbu， obama'，'emesa'，'embox'，'awest'，'swami'，'famble'，'mimble'，'maxima'，'embolo'，'embole'，'wamble'，'semese'，'semble' ，'sawbwa'，'sawbwa']

注意：该程序不输出单字母单词，或根据单词长度进行过滤。这很容易添加，但与问题无关。如果能够以多种方式拼写，它还会多次输出一些单词。如果给定的单词可以用许多不同的方式拼写（最坏的情况：网格中的每个字母都是相同的（例如'A'），并且在您的字典中有'aaaaaaaaaa'这样的单词），那么运行时间会变得非常指数。算法结束后，过滤掉重复项并排序很容易到期。

对于字典加速，可以进行一个通用转换/过程，以提前大大减少字典比较。

鉴于上述网格只包含16个字符，其中一些重复，您可以通过简单筛选出具有难以达到的字符的条目，大大减少字典中的总密钥数量。

我认为这是显而易见的优化，但看到没人提到它，我提到它。

它简单地在输入过程中将我从200,000个键的字典中减少到仅2,000个键。这至少可以减少内存开销，而且由于内存不是无限快的，这肯定会映射到某处的速度增加。

Perl实现

我的实现有点头重脚轻，因为我重视能够知道每个提取的字符串的确切路径，而不仅仅是其中的有效性。

我在这里也有一些适应，理论上允许一个有孔的网格起作用，而网格有不同大小的线（假设你得到的输入是正确的，并且它以某种方式排列）。

早期的过滤器在我的应用中是迄今为止最重要的瓶颈，正如前面所怀疑的那样，评论说这条线从1.5s扩展到7.5s。

在执行时，它似乎认为所有的单个数字都在他们自己的有效单词上，但我很确定这是由于字典文件的工作原理。

它有点臃肿，但至少我从cpan重用Tree :: Trie

其中一些部分受到现有实施的部分启发，其中一些我已经想到了。

建设性批评及其可以改进的方式欢迎（/我注意到他从来没有在CPAN中搜寻过一个难以理解的求解器，但这样做更有趣）

更新了新的标准

#!/usr/bin/perl 

use strict;
use warnings;

{

  # this package manages a given path through the grid.
  # Its an array of matrix-nodes in-order with
  # Convenience functions for pretty-printing the paths
  # and for extending paths as new paths.

  # Usage:
  # my $p = Prefix->new(path=>[ $startnode ]);
  # my $c = $p->child( $extensionNode );
  # print $c->current_word ;

  package Prefix;
  use Moose;

  has path => (
      isa     => 'ArrayRef[MatrixNode]',
      is      => 'rw',
      default => sub { [] },
  );
  has current_word => (
      isa        => 'Str',
      is         => 'rw',
      lazy_build => 1,
  );

  # Create a clone of this object
  # with a longer path

  # $o->child( $successive-node-on-graph );

  sub child {
      my $self    = shift;
      my $newNode = shift;
      my $f       = Prefix->new();

      # Have to do this manually or other recorded paths get modified
      push @{ $f->{path} }, @{ $self->{path} }, $newNode;
      return $f;
  }

  # Traverses $o->path left-to-right to get the string it represents.

  sub _build_current_word {
      my $self = shift;
      return join q{}, map { $_->{value} } @{ $self->{path} };
  }

  # Returns  the rightmost node on this path

  sub tail {
      my $self = shift;
      return $self->{path}->[-1];
  }

  # pretty-format $o->path

  sub pp_path {
      my $self = shift;
      my @path =
        map { '[' . $_->{x_position} . ',' . $_->{y_position} . ']' }
        @{ $self->{path} };
      return "[" . join( ",", @path ) . "]";
  }

  # pretty-format $o
  sub pp {
      my $self = shift;
      return $self->current_word . ' => ' . $self->pp_path;
  }

  __PACKAGE__->meta->make_immutable;
}

{

  # Basic package for tracking node data
  # without having to look on the grid.
  # I could have just used an array or a hash, but that got ugly.

# Once the matrix is up and running it doesn't really care so much about rows/columns,
# Its just a sea of points and each point has adjacent points.
# Relative positioning is only really useful to map it back to userspace

  package MatrixNode;
  use Moose;

  has x_position => ( isa => 'Int', is => 'rw', required => 1 );
  has y_position => ( isa => 'Int', is => 'rw', required => 1 );
  has value      => ( isa => 'Str', is => 'rw', required => 1 );
  has siblings   => (
      isa     => 'ArrayRef[MatrixNode]',
      is      => 'rw',
      default => sub { [] }
  );

# Its not implicitly uni-directional joins. It would be more effient in therory
# to make the link go both ways at the same time, but thats too hard to program around.
# and besides, this isn't slow enough to bother caring about.

  sub add_sibling {
      my $self    = shift;
      my $sibling = shift;
      push @{ $self->siblings }, $sibling;
  }

  # Convenience method to derive a path starting at this node

  sub to_path {
      my $self = shift;
      return Prefix->new( path => [$self] );
  }
  __PACKAGE__->meta->make_immutable;

}

{

  package Matrix;
  use Moose;

  has rows => (
      isa     => 'ArrayRef',
      is      => 'rw',
      default => sub { [] },
  );

  has regex => (
      isa        => 'Regexp',
      is         => 'rw',
      lazy_build => 1,
  );

  has cells => (
      isa        => 'ArrayRef',
      is         => 'rw',
      lazy_build => 1,
  );

  sub add_row {
      my $self = shift;
      push @{ $self->rows }, [@_];
  }

  # Most of these functions from here down are just builder functions,
  # or utilities to help build things.
  # Some just broken out to make it easier for me to process.
  # All thats really useful is add_row
  # The rest will generally be computed, stored, and ready to go
  # from ->cells by the time either ->cells or ->regex are called.

  # traverse all cells and make a regex that covers them.
  sub _build_regex {
      my $self  = shift;
      my $chars = q{};
      for my $cell ( @{ $self->cells } ) {
          $chars .= $cell->value();
      }
      $chars = "[^$chars]";
      return qr/$chars/i;
  }

  # convert a plain cell ( ie: [x][y] = 0 )
  # to an intelligent cell ie: [x][y] = object( x, y )
  # we only really keep them in this format temporarily
  # so we can go through and tie in neighbouring information.
  # after the neigbouring is done, the grid should be considered inoperative.

  sub _convert {
      my $self = shift;
      my $x    = shift;
      my $y    = shift;
      my $v    = $self->_read( $x, $y );
      my $n    = MatrixNode->new(
          x_position => $x,
          y_position => $y,
          value      => $v,
      );
      $self->_write( $x, $y, $n );
      return $n;
  }

# go through the rows/collums presently available and freeze them into objects.

  sub _build_cells {
      my $self = shift;
      my @out  = ();
      my @rows = @{ $self->{rows} };
      for my $x ( 0 .. $#rows ) {
          next unless defined $self->{rows}->[$x];
          my @col = @{ $self->{rows}->[$x] };
          for my $y ( 0 .. $#col ) {
              next unless defined $self->{rows}->[$x]->[$y];
              push @out, $self->_convert( $x, $y );
          }
      }
      for my $c (@out) {
          for my $n ( $self->_neighbours( $c->x_position, $c->y_position ) ) {
              $c->add_sibling( $self->{rows}->[ $n->[0] ]->[ $n->[1] ] );
          }
      }
      return @out;
  }

  # given x,y , return array of points that refer to valid neighbours.
  sub _neighbours {
      my $self = shift;
      my $x    = shift;
      my $y    = shift;
      my @out  = ();
      for my $sx ( -1, 0, 1 ) {
          next if $sx + $x < 0;
          next if not defined $self->{rows}->[ $sx + $x ];
          for my $sy ( -1, 0, 1 ) {
              next if $sx == 0 && $sy == 0;
              next if $sy + $y < 0;
              next if not defined $self->{rows}->[ $sx + $x ]->[ $sy + $y ];
              push @out, [ $sx + $x, $sy + $y ];
          }
      }
      return @out;
  }

  sub _has_row {
      my $self = shift;
      my $x    = shift;
      return defined $self->{rows}->[$x];
  }

  sub _has_cell {
      my $self = shift;
      my $x    = shift;
      my $y    = shift;
      return defined $self->{rows}->[$x]->[$y];
  }

  sub _read {
      my $self = shift;
      my $x    = shift;
      my $y    = shift;
      return $self->{rows}->[$x]->[$y];
  }

  sub _write {
      my $self = shift;
      my $x    = shift;
      my $y    = shift;
      my $v    = shift;
      $self->{rows}->[$x]->[$y] = $v;
      return $v;
  }

  __PACKAGE__->meta->make_immutable;
}

use Tree::Trie;

sub readDict {
  my $fn = shift;
  my $re = shift;
  my $d  = Tree::Trie->new();

  # Dictionary Loading
  open my $fh, '<', $fn;
  while ( my $line = <$fh> ) {
      chomp($line);

 # Commenting the next line makes it go from 1.5 seconds to 7.5 seconds. EPIC.
      next if $line =~ $re;    # Early Filter
      $d->add( uc($line) );
  }
  return $d;
}

sub traverseGraph {
  my $d     = shift;
  my $m     = shift;
  my $min   = shift;
  my $max   = shift;
  my @words = ();

  # Inject all grid nodes into the processing queue.

  my @queue =
    grep { $d->lookup( $_->current_word ) }
    map  { $_->to_path } @{ $m->cells };

  while (@queue) {
      my $item = shift @queue;

      # put the dictionary into "exact match" mode.

      $d->deepsearch('exact');

      my $cword = $item->current_word;
      my $l     = length($cword);

      if ( $l >= $min && $d->lookup($cword) ) {
          push @words,
            $item;    # push current path into "words" if it exactly matches.
      }
      next if $l > $max;

      # put the dictionary into "is-a-prefix" mode.
      $d->deepsearch('boolean');

    siblingloop: foreach my $sibling ( @{ $item->tail->siblings } ) {
          foreach my $visited ( @{ $item->{path} } ) {
              next siblingloop if $sibling == $visited;
          }

          # given path y , iterate for all its end points
          my $subpath = $item->child($sibling);

          # create a new path for each end-point
          if ( $d->lookup( $subpath->current_word ) ) {

             # if the new path is a prefix, add it to the bottom of the queue.
              push @queue, $subpath;
          }
      }
  }
  return @words;
}

sub setup_predetermined { 
  my $m = shift; 
  my $gameNo = shift;
  if( $gameNo == 0 ){
      $m->add_row(qw( F X I E ));
      $m->add_row(qw( A M L O ));
      $m->add_row(qw( E W B X ));
      $m->add_row(qw( A S T U ));
      return $m;
  }
  if( $gameNo == 1 ){
      $m->add_row(qw( D G H I ));
      $m->add_row(qw( K L P S ));
      $m->add_row(qw( Y E U T ));
      $m->add_row(qw( E O R N ));
      return $m;
  }
}
sub setup_random { 
  my $m = shift; 
  my $seed = shift;
  srand $seed;
  my @letters = 'A' .. 'Z' ; 
  for( 1 .. 4 ){ 
      my @r = ();
      for( 1 .. 4 ){
          push @r , $letters[int(rand(25))];
      }
      $m->add_row( @r );
  }
}

# Here is where the real work starts.

my $m = Matrix->new();
setup_predetermined( $m, 0 );
#setup_random( $m, 5 );

my $d = readDict( 'dict.txt', $m->regex );
my $c = scalar @{ $m->cells };    # get the max, as per spec

print join ",n", map { $_->pp } @{
  traverseGraph( $d, $m, 3, $c ) ;
};

Arch /执行信息进行比较：

model name      : Intel(R) Core(TM)2 Duo CPU     T9300  @ 2.50GHz
cache size      : 6144 KB
Memory usage summary: heap total: 77057577, heap peak: 11446200, stack peak: 26448
       total calls   total memory   failed calls
 malloc|     947212       68763684              0
realloc|      11191        1045641              0  (nomove:9063, dec:4731, free:0)
 calloc|     121001        7248252              0
   free|     973159       65854762

Histogram for block sizes:
  0-15         392633  36% ==================================================
 16-31          43530   4% =====
 32-47          50048   4% ======
 48-63          70701   6% =========
 64-79          18831   1% ==
 80-95          19271   1% ==
 96-111        238398  22% ==============================
112-127          3007  <1% 
128-143        236727  21% ==============================

如何从字母矩阵中找到可能的单词列表[Boggle Solver]

Perl实现

更多的正则表达式优化